You are on page 1of 889

Lecture Notes in Computer Science 588

Edited by G. Goos and J. Hartmanis

Advisory Board: W. Brauer D. Gries J. Stoer


G. Sandini (Ed.)

Computer Vision-
ECCV "92
Second European Conference on Computer Vision
Santa Margherita Ligure, Italy, May 19-22, 1992
Proceedings

Springer-Verlag
Berlin Heidelberg NewYork
London Paris Tokyo
Hong Kong Barcelona
Budapest
Series Editors
Gerhard Goos Juris Hartmanis
Universit~it Karlsruhe Cornell University
Postfach 69 80 Department of Computer Science
Vincenz-Priessnitz-Stra6e 1 5149 Upson Hall
W-7500 Karlsruhe, FRG Ithaca, NY 14853, USA

Volume Editor
Giulio Sandini
Dept. of Communication, Computer, and Systems Science, University of Genova
Via Opera Pia, 11A, 1-16145 Genova, Italy

CR Subject Classification (1991): 1.3, 1.5, 1.2.9-10

ISBN 3-540-55426-2 Springer-Verlag Berlin Heidelberg New York


ISBN 0-387-55426-2 Springer-Verlag New York Berlin Heidelberg

This work is subject to copyright. All rights are reserved, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, re-use of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other way,
and storage in data banks. Duplication of this publication or parts thereof is permitted
only under the provisions of the German Copyright Law of September 9, 1965, in its
current version, and permission for use must always be obtained from Springer-Verlag.
Violations are liable for prosecution under the German Copyright Law.
9 Springer-Verlag Berlin Heidelberg 1992
Printed in Germany
Typesetting: Camera ready by author
Printing and binding: Druckhaus Beltz, Hemsbach/Bergstr.
45/3140-543210 - Printed on acid-free paper
Foreword

This volume collects the papers accepted for presentation at the Second European Con-
ference on Computer Vision held in Santa Margherita Ligure, Italy, May 19 - 22, 1992.
The selection of the papers has been extremely difficult because of the high number
and excellent quality of the papers submitted. I wish to thank my friends on the Pro-
gramme Committee who, with the help of other qualified referees, have done a tremendous
job in reviewing the papers at short notice.
In order to maintain a single-track conference, and to keep this book within a rea-
sonable size, 16 long papers, 41 short papers and 48 posters have been selected from the
308 submissions, and structured into 14 sections reflecting the major research topics in
computer vision currently investigated worldwide.
I personally would like to thank all the authors for the quality of their work and for
their collaboration in keeping the length of the papers within the limits requested (we
all know how painful it is to delete lines after they have been printed). Particular thanks
to those who submitted papers from outside Europe for the credit given to the Euro-
pean computer vision community. Their contribution has been fundamental in outlining
the current state of the art in computer vision and in producing further evidence of the
maturity reached by this important research field. Thanks to ESPRIT and other collab-
orative projects the number of "transnational" (and even "transcontinental") papers is
increasing, with political implications which outlast, to some extent, the scientific ones.
It will be a major challenge for the computer vision community to take advantage of
the recent political changes worldwide in order to bring new ideas into this challenging
research field.
I wish to thank all the persons who contributed to make ECCV-92 a reality, in
particular Piera Ponta of Genova Ricerche, Therese Bricheteau and Cristine Juncker of
INRIA and Lorenza Luceti of DIST who helped in keeping things under control during
the hot phases of the preparation.
I give special thanks to Olivier Faugeras, who, as a the chairman of ECCV-90, estab-
lished the high standard of this conference, thus contributing significantly in attracting
so many good papers to ECCV-92.
Finally let me thank Anna, Pietro and Corrado for their extra patience during these
last months.

Genova, March 1992 Giulio Sandini


yl

Chairperson
Giulio Sandini DIST, University of Genova
Board
Bernard Buxton GEC Marconi, Hirst Research Center
Olivier Faugeras INR.IA - Sophia Antipolis
Goesta Granlund LinkSping University
John Mayhew Sheffield University
Hans H. Nagel Karlsruhe University Fraunhofer Inst.
Programme Committee
Nicholas Ayache INRIA Rocquencourt
Andrew Blake Oxford University
Mike Brady Oxford University
Hans Burkhardt University ttamburg-Harburg
Hilary Buxton Queen Mary and Westfield College
James Crowley LIFIA - INPG, Grenoble
Rachid Deriche INRIA Sophia Antipolis
Ernest Dickmanns University Miinchen
Jan Olof Eklundh P~yal Institute of Technology, Stockholm
David Hogg Leeds University
Jan Koenderink Utrecht State University
Hans Knutsson Linkoping University
P~oger Mohr LIFIA - INPG, Grenoble
Bernd Neumann Hamburg University
Carme Torras Institute of Cybernetics, Barcelona
Vincent Torte University of Genova
Video Proceedings:
Giovanni Garibotto Elsag Bailey S.p.a.
Experimental Sessions:
Massimo Tistarelli DIST, University of Genova
E S P R I T Day Organization:
Patrick Van Hove CEC, DG XIII
E S P R I T Workshops Coodination:
James L. Crowley LIFIA - INPG, Grenoble
Coordination:
Piera Ponta Consorzio Genova Ricerche
Cristine Juncker INRIA, Sophia-Antipolis
Therese Bricheteau INRIA
Lorenza Luceti DIST, University of Genova
Nicoletta Piccardo Eurojob, Genova
Referees
G a x i b o t t o G. Italy N o r d s t r S m N. Sweden
A m a t J. Spain G i r a u d o n G. France
A n d e r s s o n M.T. Sweden G o n g S. U.K. Olofsson G. Sweden
A u b e r t D. France G r a n l u n d G. Sweden
A y a c h e N. France G r o s P. France P a h l a v a n K. Sweden
Grosso E. Italy P a m p a g n i n L.H. France
BArman H. Sweden Gueziec A. France P a p a d o p o u l o T. France
Bascle B. France P a t e r n a k B. Germany
Bellissant C. France H a g l u n d L. Sweden P e t r o u M. France
B e n a y o u n S. France Heitz F. France P u g e t P. France
Berger M.O. France H~ranlt H. France
Bergholm F. Sweden Herlin I.L. France Q u a n L. France
Berroir J.P. France H o e h n e H.H. Germany
Berthod M. France H o g g D. U.K R a d i g B. Germany
BesaSez L. Spain Horaud R. France Reid I. U.K.
Betsis D. Sweden Howarth R. U.K Riehetin M. France
Beyer H. France Hugog D. U.K. Rives G. France
Blake A. U.K. H u m m e l R. France R o b e r t L. France
Boissier O. France
Bouthemy P. France Inglebert C. France S a g e r e r G. Germany
Boyle R. U.K. Izuel M.J. Spain Sandini G. Italy
Brady M. U.K. Sanfeliu A. Spain
Burkhardt H. Germany Juvin D. France S c h r o e d e r C. Germany
Buxton B. U:K. Seals B. France
Buxton H. U.K. Kittler J. U.K. S i m m e t h H. Germany
Knutsaon H. Sweden Sinclair D. U.K.
C a l e a n D. France Koenderink I. The Netherlands Skordas Th. France
Carlsson S. Sweden Koller D. Germany S o m m e r G. Germany
Casals A. Spain S p a r r G. Sweden
C a s t a n S. France L a n g e S. Germany Sprengel R. Germany
C e l a y a E. Spain Lapreste J.T. France Stein T h . y o n Germany
Chamley S. France Levy-Vehel J. France Stiehl H.S. Germany
Chassery J.M. France Li M. Sweden
C h e h i k i a n A. France L i n d e b e r g T. Sweden T h i r i o n J.P. France
C h r i s t e n s e n H. France Lindsey P. U.K. T h o m a s B. France
Cinquin Ph. France L u d w i g K.-O. Germany T h o m a s F. Spain
C o h e n I. France L u o n g T. France T h o n n a t M. France
Cohen L. France Lux A. France Tistarelli M. Italy
Crowley J.L. France T o a l A.F. U.K.
Curwen R. U.K. M a g r a s s i M. Italy T o r r a s C. Spain
M a l a n d a i n G. France T o r t e V. Italy
Dagless E. France M a r t i n e z A. Spain Tr~v~n H. Sweden
Daniilidis K. Germany M a y b a n k S.J. France
De Micheli E. Italy M a y h e w J. U.K. Uhlin T. Sweden
Demazeau Y. France M a z e r E. France Usoh M. U.K.
Deriche R. France Mc L a u c h l a n P. U.K.
Devillers O. France Mesrabi M. France Veillon F. France
D h o m e M. France Milford D. France Verri A. Italy
Dickmanns E. Germany Moeller R. Germany Vieville T. France
Dinten J.M. France M o h r R. France Villanueva J . J . Spain
Dreschler-FischerL. Germany M o n g a O. France
Drewniok C. Germany Montseny E. Spain W a h l F. Germany
M o r g a n A. France Westelius C.J. Sweden
Eklundh J.O. Sweden Morin L. France Westin C.F. Sweden
Wieske L. Germany
Faugeras O.D. France Nagel H.H. Germany W i k l u n d J. Sweden
Ferrari F. Italy N a s t a r C. France W i n r o t h H. Sweden
Fossa M. Italy N a v a b N. France W y s o c k i J. U.K.
F u a P. France N e u m a n n B. Germany
N e u m a n n H. Germany Z e r u b i a J. France
GArding J. Sweden N o r d b e r g K. Sweden Z h a n g Z. France
Organization and Support

Organized by:
DIST, University of Genova

In Cooperation with:
Consorzio Genova Ricerche
INRIA - Sophia Antipolis
Commission of the European Communities, DGXIII - ESPRIT

Supported by:
C.N.R. Special Project on Robotics
European Vision Society

Major Corporate Sponsor


Elsag Bailey S.p.a.

Corporate Sponsors
Digital Equipment Corporation - Italy
Sincon - Fase S.p.A. - Italy
Sun Microsystems - Italy
Contents

Features
Steerable-Scalable Kernels for Edge Detection and Junction Analysis . . . . . . . . . . . . . . . . 3
P.Perona
Families of Tuned Scale-Space Kernels . . . . ~. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
L.J.Florack, B.M.ter IIaar Romeny, J.J.Koenderink, M.A. Vicrgever
Contour Extraction by Mixture Density Description Obtained from Region
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
M.Etoh, Y.Shirai, M.Asada
The M5bius Strip Parameterization for Line Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
G.-F. Wcstin, IL Knutsson
Edge Tracing in a priori Known Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
A.Nowak, A.Fiorek, T.Piascik
Features Extraction and Analysis Methods for Sequences of Ultrasound Images . . . . 43
L L.tterlin, N.A yachc
Figure-Ground Discrimination by Mean Field Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
L.Hdrault, R.Itoraud
Deterministic Pseudo-Annealing: Optimization in Markov-Random-Fields
An Application to Pixel Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67~
M.Berthod, G. Giraudon, J.P.Stromboni
A Bayesian Multiple Hypothesis Approach to Contour Grouping . . . . . . . . . . . . . . . . . . . 72
LJ.Cox, J.M.Rehg, S.Hingorani
Detection of General Edges and Keypoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
L.Rosenthaler, F.Heitger, O.Kiibler, R.von der Heydt
Distributed Belief Revision for Adaptive Image Processing Regulation . . . . . . . . . . . . . . 87
V.Murino, M.F.Peri, C.S.Regazzoni
Finding Face Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
L Craw, D. Tock, A.Bennctt

Color
Detection of Specularity Using Color and Multiple Views . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
S. W.Lee, R.Bajcsy
Data and Model-Driven Selection Using Color Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
T.F.Syeda-Mahraood
Recovering Shading from Color Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
B. V.Funt, M.S.Drew, M.Brockingston
Texture and Shading
Shading Flows and Scenel Bundles: A New Approach to Shape from Shading . . . . . 135
P.Breton, L.A.Iverson, M.S.Langer, S.W.Zucker
Texture: Plus ~a Change,. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
M. M. Fleck
Texture Parametrization Method for Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 160
A.Casals, J.Araal, A.Grau
Texture Segmentation by Minimizing Vector-Valued Energy Functionals:
The Coupled-Membrane Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
T.S.Lee, D.Mumford, A.Yuille
Boundary Detection in Piecewise Homogeneous Textured Images . . . . . . . . . . . . . . . . . 174
S.Casadei, S.Milter, P.Perona

M o t i o n Estimation
Surface Orientation and Time to Contact from Image Divergence and Deformation .. 187
R. Cipolla, A.Blake
Robust and Fast Computation of Unbiased Intensity Derivates in Images . . . . . . . . 203
T. Vieville, O.D.Faugeras
Testing Computational Theories of Motion Discontinuities:
A Psychophysical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
L.M. Vaina, N.M.Grzywacz
Motion and Structure Factorization and Segmentation of Long Multiple
Motion Image Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
C.Debrunner, N.Ahuja
Motion and Surface Recovery Using Curvature and Motion Consistency . . . . . . . . . . 222
G.Soucy, F.P.Ferrie
Finding Clusters and Planes from 3D Line Segments with Application to 3D
Motion Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Z.Zhang, O.D.Faugeras
Hierarchical Model-Based Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
J.R.Bergen, P.Anandan, K.J.Hanna, R.ttingorani
A Fast Method to Estimate Sensor Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
V.Sundareswaran
Identifying Multiple Motions from Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
A.Rognone, M.Campani, A. Verri
A Fast Obstacle Detection Method Based on Optical Flow . . . . . . . . . . . . . . . . . . . . . . . 267
N.Ancona
A Parallel Implementation of a Structure-from-Motion Algorithm . . . . . . . . . . . . . . . . 272
H. Wang, C.Bowman, M.Brady, C.ttarris
Structure from Motion Using the Ground Plane Constraint . . . . . . . . . . . . . . . . . . . . . . . 277
T.N. Tan, G.D.Sullivaa, K.D.Baker

Detecting and Tracking Multiple Moving Objects Using Temporal Integration . . . . 282
M.Irani, B.Ronsso, S.Peleg

Calibration and Matching


A Study of Affine Matching with Bounded Sensor Error . . . . . . . . . . . . . . . . . . . . . . . . . . 291
W.E.L.Grimson, D.P.Huttenloeher, D.W.Jacobs
Epipolar Line Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
S.L Olsen
Camera Calibration Using Multiple Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
P.Beardsley, D.Mnrray, A.Zisserman
Camera Self-Calibration: Theory and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
O.D.Faugeras, Q.-T.Luong, S.J.Maybank
Model-Based Object Pose in 25 Lines of Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
D.F.DeMenthon, L.S.Davis

Depth
Image Blurring Effects due to Depth Discontinuities:
Blurring that Creates Emergent Image Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
T.C.Nguyen, T.S.Huang
Ellipse Based Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
J.Buurman
Applying Two-Dimensional Delaunay Triangulation to Stereo Data Interpretation.. 368
E.Bruzzone, M.Cazzanti, L.De Floriani, F.Mangili
Local Stereoscopic Depth Estimation Using Ocular Stripe Maps . . . . . . . . . . . . . . . . . . 373
K.-O.Ludwig, H.Neumann, B.Neumann
Depth Computations from Polyhedral Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
G.Sparr
Parallel Algorithms for the Distance Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
H.Embrechts, D.Roose

Stereo-motion
A Computational Framework for Determining Stereo Correspondence from
a Set of Linear Spatial Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
D. G. Jones, J.Malik
On Visual Ambiguities due to Transparency in Motion and Stereo . . . . . . . . . . . . . . . . 411
M.Shizawa
A Deterministic Approach for Stereo Disparity Calculation . . . . . . . . . . . . . . . . . . . . . . . 420
C. Chang, S. Chatlerjee
Occlusions and Binocular Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
D.Geiger, B.Ladendorf, A. Yuille
XII

Tracking
Model-Based Object Tracking in Traffic Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
D.Koller, K.Daniilidis, T. Th6rhallsson, tt.-tt.Nagel
Tracking Moving Contours Using Energy-Minimizing Elastic Contour Models . . . . . 453
iV. Ueda, K.Mase
Tracking Points on Deformable Objects Using Curvature Information . . . . . . . . . . . . . 458
LCohen, N.Ayache, P.Sulger
An Egomotion Algorithm Based on the Tracking of Arbitrary Curves . . . . . . . . . . . . . 467
E.Arbogast, R.Mohr
Region-Based Tracking in an Image Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
F.Meyer, P.Bouthemy
Combining Intensity and Motion for Incremental Segmentation and Tracking
over Long Image Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
M. J. Black

Active Vision
Active Egomotion: A Qualitative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
Y.Aloimonos, Z.Dnrif
Active Perception Using DAM and Estimation Techniques . . . . . . . . . . . . . . . . . . . . . . . 511
W.P~izleitner, 11. Wechsler
Active-Dynamic Stereo for Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
E. Grosso, M. Tistarelli, G.Sandini
Integrating Primary Ocular Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
K.Pahlavan, T. Uhlin, J.-O.Eklundh
Where to Look Next Using a Bayes Net: Incorporating Geometric Relations . . . . . . 542
R.D.Rimey, C.M.Brown
An Attentional Prototype for Early Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
S.M. Cnlhane, J.K. Tsotsos

Binocular Heads
W h a t Can Be Seen in Three Dimensions with an Uncalibrated Stereo Rig? . . . . . . . 563
O.D.Faugeras
Estimation of Relative Camera Positions for Uncalibrated Cameras . . . . . . . . . . . . . . . 579
R.L Hartley
Gaze Control for a Binocular Camera Head . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
J.L. Crowley, P.Bobet, M.Mesrabi

Curved Surfaces a n d Objects


Computing Exact Aspect Graphs of Curved Objects: Algebraic Surfaces.......... 599
J.Ponce, S.Petitjean, D.J. Kriegman
Xlll

Surface Interpolation Using Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615


A.P.Pentland
Smoothing and Matching of 3D Space Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
A. Gu~ziec, N.Ayache
Shape from Texture for Smooth Curved Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
J. Gdrding
Recognizing Rotationally Symmetric Surfaces from Their Outlines . . . . . . . . . . . . . . . . 639
D.A.Forsyth, J.L.Mundy, A.Zisserraan, C.A.Rothweli
Using Deformable Surfaces to Segment 3D Images and Infer Differential Structures .. 648
I.Cohen, L.D.Cohen, N.Ayache
Finding Parametric Curves in an Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
A.Leonardis, R.Bajcsy

Reconstruction and Shape


Determining Three-Dimensional Shape from Orientation and Spatial
Frequency Disparities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661
D. G. Jones, J.Malik
Using Force Fields Derived from 3D Distance Maps for Inferring the
Attitude of a 3D Rigid Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
L.Brunie, S.Lavalige, R.Szeliski
Segmenting Unstructured 3D Points into Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
P.Fna, P.Sander
Finding the Pose of an Object of Revolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681
R.Glachet, M.Dhome, J.T.Lapreste
Extraction of Line Drawings from Gray Value Images by Non-Local
Analysis of Edge Element Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
M. Otte, H.-H.Nagei
A Method for the 3D Reconstruction of Indoor Scenes from Monocular I m a g e s . . . 696
P. Olivieri, M. Gatti, M.Straforini, V. Torte
Active Detection and Classification of Junctions by Foveation with a
Head-Eye System Guided by the Scale-Space Primal Sketch . . . . . . . . . . . . . . . . . . . . . . 701
K.Brunnstr6m, T.Lindeberg, J..O.Eklundh
A New Topological Classification of Points in 3D Images . . . . . . . . . . . . . . . . . . . . . . . . . 710
G.Bertrand, G.Malandain
A Theory of 3D Reconstruction of Heterogeneous Edge Primitives from
Two Perspective Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715
M.Xie, M. Thonnat
Detecting 3D Parallel Lines for Perceptual Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 720
X.Lebdgne, J.K.Aggarwal
Integrated Skeleton and Boundary Shape Representation for Medical Image
Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725
G.P. Robinson, A. C.F. Colchester, L.D. Griffin, D.J. Hawkes
XIV

Critical Sets for 3D Reconstruction Using Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730


Th.Buchanan
Intrinsic Surface Properties from Surface Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 739
X. Chen, F.Schmitt
Edge Classification and Depth Reconstruction by Fusion of Range and
Intensity Edge Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744
G.Zhan9, A. Wallace
Image Compression and Reconstruction Using a 1D Feature Catalogue . . . . . . . . . . . 749
B. Y.K.Aw, R.A.Owens, J.Ross

Recognition
Canonical Frames for Planar Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
C.A.Rothwell, A.Zisserman, D.A.Forsyth, J.L.Mundy
Measuring the Quality of Hypotheses in Model-Based Recognition . . . . . . . . . . . . . . . . 773
D. P. Hutteniocher, T.A. Cass
Using Automatically Constructed View-Independent Relational Model in
3D Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778
S.Zhang, G.D.Sullivan, K.D.Baker
Learning to Recognize Faces from Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787
S.Edelman, D.Reisfeld, Y. Yeshurun
Face Recognition Through Geometrical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792
R.Brunelli, T.Poggio
Fusion Through Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
M.J.L.Orr, J.Hallam, R.B.Fisher
3D Object Recognition Using Passively Sensed Range Data . . . . . . . . . . . . . . . . . . . . . . 806
K.M.Dawson, D. Vernon
Interpretation of Remotely Sensed Images in a Context of Multisensor Fusion . . . . 815
V. Clement, G. Girandon, S.Honzelle
Limitations of Non Model-Based Recognition Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 820
Y.Moses, S. Ullman
Constraints for Recognizing and Locating Curved 3D Objects from
Monocular Image Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829
D.J.Kriegman, B. Vijayakumar, J.Ponce
Polynomial-Time Object Recognition in the Presence of Clutter, Occlusion,
and Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834
T.A. Cass
Hierarchical Shape Recognition Based on 3D Multiresolution Analysis . . . . . . . . . . . . 843
S.Morita, T.Kawashima, Y.Aoki
Object Recognition by Flexible Template Matching Using Genetic Algorithms . . . . 852
A.Hill, C.J. Taylor, T.Cootes
Matching and Recognition of Road Networks from Aerial Images . . . . . . . . . . . . . . . . . 857
S.Z.Li, J.Kiltler, M.Petrou

Applications
Intensity and Edge-Based Symmetry Detection Applied to Car-Following . . . . . . . . . 865
T.Zielke, M.Brauckmann, W.von Seelen
Indexieality and Dynamic Attention Control in Qualitative Recognition
of Assembly Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874
Y.Kuniyoshi, H.lnoue
Real-Time Visual Tracking for Surveillance and Path Planning . . . . . . . . . . . . . . . . . . . 879
R.Cnrwen, A.Biake, A.Zisserman
Spatic~Temporal Reasoning Within a Traffic Surveillance System . . . . . . . . . . . . . . . . . 884
A.F. Toal, H.Buxton
Template Guided Visual Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893
A.Noble, V.D.Nguyen, C.Marinos, A.T.Tran, J.Farley, K.Hedengren, J.L.Mundy
Hardware Support for Fast Edge-Based Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 902
P.Courtney, N.A. Thacker, C.R.Brown

Author Index ...................................................... 907


Steerable-Scalable Kernels for Edge Detection and
Junction Analysis*

Pietro Perona 1'2

1 California Institute of Technology 116-81, Pasadena CA 91125, USA


e-mail: perona~verona.caltech.edu
2 Universit~ di Padova - DEI, via Gradenigo 6A, 35131 Padova, Italy

Abstract. Families of kernels that are useful in a variety of early vi-


sion algorithms may be obtained by rotating and scaling in a continuum a
'template' kernel. These multi-scale multi-orientation family may be approx-
imated by linear interpolation of a discrete finite set of appropriate 'basis'
kernels. A scheme for generating such a basis together with the appropriate
interpolation weights is described. Unlike previous schemes by Perona, and
Simoncelli et al. it is guaranteed to generate the most parsimonious one.
Additionally, it is shown how to exploit two symmetries in edge-detection
kernels for reducing storage and computational costs and generating simul-
taneously endstop- and junction-tuned filters for free.

1 Introduction

Points, lines, edges, textures, motions are present in almost all images of everyday's
world. These elementary visual structures often encode a great proportion of the infor-
mation contained in the image, moreover they can be characterized using a small set
of parameters that are locally defined: position, orientation, characteristic size or scale,
phase, curvature, velocity. It is threrefore resonable to start visual computations with
measurements of these parameters. The earliest stage of visual processing, common for
all the classical early vision modules, could consist of a collection of operators that calcu-
late one or more dominant orientations, curvatures, scales, velocities at each point of the
image or, alternatively, assign an 'energy', or 'probability', value to points of a position-
orientation-phase-scale-etc, space. Ridges and local maxima of this energy would mark
special interest loci such as edges and junctions. The idea that biological visual systems
might analyze images along dimensions such as orientation and scale dates back to work
by Hubel and Wiesel [19, 18] in the 1960's. In the computational vision literature the idea
of analyzing images along multiple orientations appears at the beginning of the seventies
with the Binford-Horn linefiuder [17, 3] and later work by Granlund [14].
A computational framework that may be used to performs this proto-visual analy-
sis is the convolution of the image with kernels of various shapes, orientations, phases,
elongation, scale. This approach is attractive because it is simple to describe, imple-
ment and analyze. It has been proposed and demonstrated for a variety of early vision
tasks [23, 22, 5, 1, 6, 15, 40, 30, 28, 31, 10, 26, 4, 41, 20, 21, 11, 36, 2]. Various 'general'
computational justifications have been proposed for basing visual processing on the out-
put of a rich set of linar filters: (a) Koenderink has argued that a structure of this type is
an adequate substrate for local geometrical computations [24] on the image brightness,
(b) Adelson and Bergen [2] have derived it from the 'first principle' that the visual system
* This work was partially conducted while at MIT-LIDS with the Center for Intelligent Control
Systems sponsored by ARC}grant DAAL 03-86-K-0171, .
computes derivatives of the image along the dimensions of wavelength, parallax, position,
time, (c) a third point of view is the one of 'matched filtering': where the kernels are
synthesized to match the visual events that one looks for.
The kernels that have been proposed in the computational literature have typically
been chosen according to one or more of three classes of criteria: (a) 'generic optimality'
(e.g. optimal sampling of space-frequency space), (b) 'task optimality' (e.g. signal to
noise ratio, localization of edges) (c) emulation of biological mechanisms. While there is
no general consensus in the literature on precise kernel shapes, there is convergence on
kernels roughly shaped like either Gabor functions, or derivatives or differences of either
round or elongated Gaussian functions - all these functions have the advantage that they
can be specified and computed easily. A good rule of the thumb in the ~hoice of kernels
for early vision tasks is that they should have good localization in space and frequency,
and should be roughly tuned to the visual events that one wants to analyze.
Since points, edges, lines, textures, motions can exist at all possible positions, orien-
tations, scales of resolution, curvatures one would like to be able to use families of filters
that are tuned to all orientations, scales and positions. Therefore once a particular con-
volution kernel has been chosen one would like to convolve the image with deformations
(rotations, scalings, stretchings, bendings etc.) of this 'template'. In reality one can afford
only a finite (and small) number of filtering operations, hence the common practice of
'sampling' the set of orientations, scales, positions, curvatures, phases 3. This operation
has the strong drawback of introducing anisotropies and algorithmic difficulties in the
computational implementations. It would be preferable to keep thinking in terms of a
continuum, of angles for example, and be able to localize the orientation of an edge with
the m a x i m u m accuracy allowed by the filter one has chosen.
This aim m a y sometimes be achieved by means of interpolation: one convolves the
image with a small set of kernels, say at a number of discrete orientations, and obtains the
result of the convolution at a n y orientation by taking linear combinations of the results.
Since convolution is a linear operation the interpolation problem m a y be formulated
in terms of the kernels (for the sake of simplicity the case of rotations in the plane is
discuased here): Given a kernel F : R 2 -~ C z, define the family of 'rotated' copies of F as:
F0 = F o R0, 8 E $1, where $z is the circle and/~e is a rotation. Sometimes it is possible
to express Fe as
n

Fo(x) = v0 e sl,vx e R (1)


i=l

3 Motion flow computation using spatiotemporal filters has been proposed by Adelson and
Bergen [1] as a model of human vision and has been demonstrated by Heeger [15] (his
implementation had 12 discrete spati~temporal orientations and 3 scales of resolution).
Work on texture with multiple-resolution multiple-orientation kernels is due to Knuttson
and Granlund [23] (4 scales, 4 orientations, 2 phases), Turner [40] (4 scales, 4 orientations,
2 phases), Fogel and Sagi [10] (4 scales, 4 orientations, 2 phases), Malik and Perona [26] (11
scales, 6 orientations, 1 phase) and Bovik et al. [4] (n scales, m orientations, 1 phases). Work
on stereo by Kass [22] (12 filters, scales, orientations and phases unspecified) and Jones and
Malik [20, 21] (see also the two articles in this book) (6 scales, 2-6 orientations, 2 phases).
Work on curved line grouping by Parent and Zucker [31] (1 scale, 8 orientations, lphase) and
Malik and Gigus [25] (9 curvatures, 1 scale, 18 orientations, 2 phases). Work on brightness
edge detection by Binford and Horn [17, 3] (24 orientations), Canny [6] (1-2 scales, oo-6 orien-
tations, 1 phase), Morrone,Owens and Burr [30, 28] (1-3 scales, 2-4 orientations, c~ phases),
unpublished work on edge and illusory contour detection by Heitger, Rosenthaler, Kfibler and
yon der Heydt (6 orientations, 1 scale, 2 phases). Image compression by Zhong and Mallat [41]
(4 scales, 2 orientations, 1 phase).
Fig. 1.

a finite linear combination of functions Gi : R 2 ~ C 1. It must be noted that, at least for


positions and phases, the mechanism for realizing this in a systematic way is well under-
stood: in the case of positions the sampling theorem gives conditions and an interpolation
technique for calculating the value of the filtered image at any point in a continuum; in
the case of phases a pair of filters in quadrature can be used for calculating the response
at any phase [1, 29]. Rotation, scalings and other deformations are less well understood.
An example of 'rotating' families of kernels that have a finite representation is well
known: the first derivative along an arbitrary direction of a round (ax = ay) Ganssian
may be obtained by linear combination of the X- and Y-derivatives of the same. The
common implementations of the Canny edge detector [6] are based on this principle.
Unfortunately the kernel obtained this way has poor orientation selectivity and therefore
it is unsuited for edge detection if one wants to recover edge-junctions (see in Fig. 2 the
comparison with a detector that uses narrow orientation-selective filters). Freeman and
Adelson have recently proposed [11, 12] to construct orientation-selective kernels that
can be exactly rotated by interpolation (they call this property "steerability") and have
shown that higher order derivatives of round Gaussians, indeed all polynomials multiplied
by a radially symmetric function are steerable. They have also shown that functions that
may be written as finite sums of polar-separable kernels with sinusoidal 0 component are
also steerable. These functions may be designed to have higher orientation selectivity and
can be used for contour detection and signal processing [11]. However, one must be aware
of the fact that for most kernels F of interest a finite decomposition of F0 as in Eq. (1)
cannot be found. For example the elongated kernels used in edge detection by [35, 36]
(see Fig. 2 top right) do not have a finite decomposition as in Eq. (1).
Perona [32, 33] has proposed an approximation technique that, given an F0, allows
one to generate a function G~n] which is sufficiently similar to F0 and that is steerable, i.e.
can be expressed as a finite sum of n terms as in (1). This technique is guaranteed to find
the most parsimonious steerable approximation to a given kernel Fs, i.e. given a tolerable
amount 6 of error it computes an approximating G~n] that has minimum number n of
components and is within a distance 6 from F0. Perona [32, 33] and Simoncelli et al. [9]
31 9 # J / , 31 t //I///'/ /
32. , t$':: t, 32. I,/
3s 9 9 $ ~ ~ ~ ~ , 33, / fl~',/(,/,\,,

30 31 32 33 34 35 36 37 30 31 32 33 34 35 36 37
energies 2-sided 8x8 orientations 8x8

i /' t /

i::::i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Perona-Malik ax = 3, ev = 1 Canny u = 1
j
F i g . 2. Example of the use of orientation-selective filtering on a continuum of orientations (see
Perona and Malik [35, 36]). Fig. 1 (Left) Original image. Fig. 1 (Right) A T-junction (64x64
pixel detail from a region roughly at the centre of the original image). The kernel of the filter
for the edge-detector is elongated to have high orientation selectivity; it is depicted in Fig. 3.
(Top-left) Modulus R(x, y, 0) of the output of the complex-valued filter (polar plot shown for 8x8
pixels in the region of the T-junction). (Top-right) The local maxima of JR(x, y, 0)] with respect
to 0. Notice that in the region of the junction one finds two local maxima in 0 corresponding
to the orientation of the edges. Searching for local maxima in (x,y) in a direction ortogonal to
the maximizing 8's one can find the edges (Bottom left) with high accuracy (error around 1
degree in orientation and 0.1 pixels in position). (Bottom right) Comparison with the output of
a Canny detector using the same kernel width (a in pixel units).

have proposed non-optimal extensions to the case of joint rotation and scaling.
In this p a p e r the general case of compact deformations is reviewed in section 2. Some
results of functional analysis are recalled to formulate the decomposition technique in all
generality. T h e case of rotations is briefly recalled in section 3 to introduce some notation
which is used later in the paper. In section 4 it is shown how to generate a steerable and
scalable family. Experimental results and implementation issues are presented and dis-
cussed. Finally, in section 5 some basic symmetries of edge-detection kernels are studied
and their use described in (a) reducing calculations and storage, and (b) implementing
filters useful for junction analysis at no extra cost.
2 Deformable functions

In order to solve the approximation problem one needs of course to define the 'qual-
ity' of the approximation G~n] ~ F0. There are two reasonable choices: (a) a distance
D(Fo,G~hI) in the space 112 S 1 where F0 is defined; (b) if F0 is the kernel of some
filter one is interested in the worst-case error in the 'output' space: the maximum dis-
tance d((Fo, f}, (G~nl, f)) over all unit-norm f defined on It 2. The symbols An and din
will indicate the 'optimal' distances, i.e. the minimum possible approximation errors us-
ing n components. These quantities may be defined using the distances indflced by the
L2-norm:
Definition.

=inf IIF0 - @]ll 2

8n(Fo)=inf sup
a M II/11=I
II(Fo-@l,y),,lls,
The existence of the optimal finite-sum approximation of the kernel Fe(x) as decribed
in the introduction is not peculiar to the case of rotations. This is true in more general
circumstances: this section collects a few facts of functional analysis that show that one
can compute finite optimal approximations to continuous families of kernels whenever
certain 'compactness' conditions are met.
Consider a parametrized family of kernels F(x; 9) where x ~ X now indicates a generic
vector of variables in a set X and 0 E T a vector of parameters in a set T. (The notation
is changed slightly from the previous section.) Consider the sets A and B of continuous
functions from X and T to the complex numbers, call a(x) and b(0) the generic elements
of these two sets. Consider the operator L : A - - ~ B defined by F as:

(La(.))(0) = (F(.; 0), a(.))a (2)

A first theorem says that if the kernel F has bounded norm then the associated
operator L is compact (see [7] pag. 316):

T h e o r e m 1. Let X and T be locally compact Hausdorff spaces and F E L2(X T). Then
L is well defined and is a compact operator.

Such a kernel is commonly called a Hilbert-Schmidt kernel.


A second result tells us that if a linear operator is compact, then it has a discrete
spectrum (see [8] pag. 323):

T h e o r e m 2. Let L be a compact operator on (complex) normed spaces, then the spectrum


S of L is at most denumerable.

A third result says that if L is continuous and operates on Hilbert spaces then the
compactness property transfers to the adjoint of L (see [8] pag. 329):

T h e o r e m 3. Let L be a compact operator on Hilbert spaces, then the adjoint L* is com-


pact.
Trivially, the composition of two compact operators is compact, so the operators LL*
and L*L are compact and have a discrete spectrum as guaranteed by theorem 2. The
singular value decomposition (SVD) of the operator L can therefore be computed as the
collection of triples (~i, ai, bl), i = O, ... where the gl constitute the spectra of both LL*
and L*L and the ai and bi are the corresponding eigenvectors.
The last result can now be enunciated (see [37] Chap.IV,Theorem 2.2):

T h e o r e m 4 . Let L : A ~ B be a linear compact operator between two Hilbert spaces. Let


al, hi, al be the singular value decomposition of L, where the al are in decreasing order of
magnitude. Then

I. An optimal n-dimensional approximation to L is L,~ = ~i~=1 o'iaib i


2. The approximation errors are dni L) = an+l, and A2n(L) = EN=n+, a~

As a result we know that when our original template kernel F(x) and the chosen family
of deformations R(O) define a Hilbert-Schmidt kernel F(x; 0) = ( F o R(9))(x) then it is
possible to compute a finite discrete approximation as for the case of 2D rotations.
Are the families of kernels F(x; 0) of interest in vision Hilbert-Schmidt kernels? In the
cases of interest for vision applications the 'template' kernel F i x ) typically has a finite
norm, i.e. it belongs to L2(X) (all kernels used in vision are bounded compact-support
kernels such as Gaussian derivatives, Gabors etc.). However, this is not a sufficient con-
dition for the family F(x; 0) = F o R(0)(x) obtained composing F(x) with deformations
RiO ) (rotations, scalings) to be a Hilbert-Schmidt kernel: the norm of Fix; 0) could be
unbounded (e.g. if the deformation is a scaling in the unbounded interval (0, co)). A suf-
ficient condition for the associated family F(x; 0) to be a Hilbert-Schmidt kernel is that
the inverse of the Jacobian of the transformation R, IJR1-1 belongs to L2(T) (see [34]).
A typical condition in which this arises is when the transformation R is unitary, e.g.
a rotation, translation, or an appropriately normalized scaling, and the set T is bounded.
In that case the norm of I]JRI1-1 is equal to the measure of T. The following sections in
this paper will illustrate the power of these results by applying them to the decomposition
of rotating rotating and scaled kernels.
A useful subclass of kernels F for which the finite orthonormal approximation can
be in part explicitly computed is obtained by composing a template function with trans-
formations To belonging to a compact group. This situation arises in the case of n-
dimensional rotations and is useful for edge detection in tomographic data and spa-
tiotemporal filtering. It is discussed in [32, 33, 34].

3 Rotation

To make the paper self-contained the formula for generating a steerable approximation
is recalled here. The F[on] which is the best n-dimensional approximation of Fo is defined
as follows:
Definition. Call F~n] the n-terms sum:

iO) (3)
4=1

with ~i, al and bi defined in the following way: let h(v) be the (discrete) Fourier transform
of the function h(O) defined by:
N mmmlmm_mmR
(gaus-3) (sfnc.0) (sfnc.1) (sfnc.2) (sfnc.3) (sfnc.4) (sfnc.5) (sfnc.6) (sfnc.7) (sf.nc.8)

Fig. 3. The decomposition (ai, b~, ai) of a complex kernel used for brightness-edge detection [36].
(Left) The template function (gans-3) is shown rotated counterclockwise by 120~ Its real part
(above) is the second derivative along the vertical (Y) axis of a Gussian with a= : ay ratio of
1:3. The imaginary part (below) is the Itilbert transform of the real part along the Y axis. The
singular values a~ (not shown here - see [34]) decay exponentially: a~+~ ~ 0.75ai. (Right) The
functions a~ (sfnc.i) are shown for i = 0... 8. The real part is above; the imaginary part below.
The functions b~(O) are complex exponentials (see text) with associated frequencies t,i = i.

t
h(O) = ]m2 r,(x)Fs,=o(x)dx (4)

and let t~i be the frequencies on which ]t() is defined, ordered in such a way that it(vl) >
h(t,j) if i _< j. Call g _< oo the number of nonzero terms h(z~). Finally, define the
quantities:

,7~ = ~(vi)l/~ (5)


b,(o) = eJ~~'~ (6)
(7)
See Fig. 3 and [32, 33, 34] for details and a derivation of these formulae.

4 Rotation and scale

A number of filter-based early vision and signal processing algorithms analyze the image
at multiple scales of resolution. Although most of the algorithms are defined on, and
would take advantage of, the availability of a continuum of scales only a discrete and
small set of scales is usually employed due to the computational costs involved with
filtering and storing images. The problem of multi-scale filtering is somewhat analogue
to the multi-orientation filtering problem: given a template function F ( x ) and defined
Fr as Fa(x) --- r ~ E (0,oo) one would like to be able to write F~ as a
(small) linear combination:

F~(x) = ~ s,(~)a~(x) ~ e (0, ~ ) (S)


i
Unfortunately the domain of definition of s is not bounded (it is the real line) and
therefore the kernel F~(x) is not ttilbert-Schmidt (it has infinite norm). As a consequence
the spectrum of the LL* and L ' L operators is continuus and no discrete approximation
may be computed.
10

One has therefore to renounce to the idea of generating a continuum of scales spanning
the whole positive line. This is not a great loss: the range of scales of interest is never the
entire real line. An interval of scales (~1,a2), with 0 < ~rl < a2 < cr is a very realistic
scenario; if one takes the human visual system as an example, the range of frequencies
to which it is most sensitive goes from approximatly 2 to 16 cycles per degree of visual
angle i.e. a range of 3 octaves. In this case the interval of scales is compact and one can
apply the results of section 2 and calculate the SVD and therefore an L2-optimal finite
approximation.
In this section the optimal scheme for doing so is proposed. The problem of simul-
taneously steering and scaling a given kernel F(x) generating a family F($,0)(x) wich
has a finite approximation will be tackled. Previous non-optimal schemes are due to
Perona [32, 33] and Simoncelli et al. [9, 12].

4.1 P o l a r - s e p a r a b l e d e c o m p o s i t i o n
Observe first that the functions ai defined in eq.(7) are polar-separable. In fact x may
be written in polar coordinates as x = ]lxllR~(x)U where u is some fixed unit vector (e.g.
the 1st coordinate axis versor) and r is the angle between x and u and R~(x) is a
rotation by r Substituting the definition of F0 in (7) we get:

a,(x) = a~"1 .~, r([]xllRe+c~(x)(U))ei2~V'~ =

= a:~*e-J2"~,r [ F(llxllRc(u))eJ2"~'r162
J$~

so that (3) may be also written as :


N
rdx) = ~ ~,c,(llxll)ei~'~,r162162 (9)
i----1

Ci(HXH) = o'i JS/1 F(llx[lRc(u))eJ2"V'r162 (10)

4.2 Scaling is a 1D p r o b l e m
The scaling operation only affects the radial components c~ and does not affect the
angular components. The problem of scaling the kernels al, and therefore Fs through its
decomposition, is then the problem of finding a finite (approximate) decomposition of
continuously scaled versions of functions c(p):

c~(p) = ~ sk(a)rk(p) ~ E (~rl, ~2) (11)


k
If the scale interval (or1,a2) and the function c are such that the operator L associated
to F is compact then we can obtain the optimal finite decomposition via the singular
value decomposition. The conditions for compactness of L are easily met in the cases of
practical importance: it is sufficient that the interval (el, a2)is bounded and that the
norm of c(p) is bounded (p E R+).
Even if these conditions are met, the calculations usually cannot be performed analyt-
ically. One can employ a numerical routine (see e.g. [38]) and for each ci (below indicated
as ci) obtain an SVD expansion of the form:
11

ii!iiiiiiiiiiiiiii!iiiiiii Y 10 .3
Gausslan 3:1 -- singular functions

iiiiiiiii~i~i~iiiiiiiii x
i i
iiiiiiiiiiiiiii!iii!i!iii! o.oo-
polav-~.
~1~:~'.~
1
.....
~ - ~ : $ ....
sfnc 0 ~o'~C~-4- - -
-5.00 --
.:.:.:. ~r~ ~
-- "---s . - -
-10.00 -- * po~- ~ "I- --
pol~r-s~.s
-15.00 --

sfnc 4 -20,00 "

-25.00 --
:::::::::::::::::::::::
...........,............

-:30.00 "

X
0,00 10.00 20 00

sfnc 8 polar decomposition G3

Fig.4. (Right)The plots of ci(p), the radial part of the singular functions a~ (err. eq. 9). The
0 part is always a complex exponential. The original kernel is the same as in fig. 3. (Left) The
0th, 4th and 8th components co, c4 and cs represented in two dimensions.

i i i
c~(p) = E 7tsk(~ (12)
k

As discussed before (Theorem 4) one can calculate the approximation error from the
sequence of the singular values 7~. Finally, substituting (12) into (10) the scale-orientation
expansion takes the form (see Fig. 6):

N ni

F,,q(x) = E ~ E ~4(o)4([Ixll) (13)


i=l k=l

Filtering an image I with a deformable kernel built this way proceeds as follows:
first the image is filtered with kernels a~(x) = exp(-j2ru~r i = 0,..., N,
k = 0 , . . . , nl, the outputs I~ of this operation can be combined as
Ie:(x) " ~ E -l o, b,(O) Ek=l "' ' ' '
7kSk(O)I~(x) to yeld the result.

4.3 Polar-separable decomposition, experimental results


An orientation-scale decomposition was performed on the kernel of Fig. 3 (second deriva-
tive of a Gaussian and its Hilbert transform, ox : u~ = 3 : 1). T h e decomposition recalled
in sec. 3 and shown in Fig. 3 was taken as a starting point. The corresponding functions
ci(p) of eq. (9) are shown in Fig. 4.
The interval of scales chosen was (ax,o2) s.t. ol : o2 = 1 : 8, an interval which is
ample enough for a wide range of visual tasks.
The range of scales was discretized in 128 samples for computing numerically the
singular value decomposition (7k, i sk, i of c~(p). The computed weights 7~ are plotted
i rk)
on a logarithmic scale in Fig. 5 (Top). The ' X ' axis corresponds to the k index, each
curve is indexed by i, i = 0 , . . . , 8. One can see that for all the ci the error decreases
exponentially at approximately the same rate. The components r~(p) and s~(o), i = 4,
k = 0 , . . . , 3 are shown in the two plots at the b o t t o m of Fig. 5.
12

"
Gims~lm

,..o,-
, - ~,~..\
., _ - . ~ .

5-
3:1 - s.f. scale d e c o m p o s i t i o n

~lt~l~.
-- w e i g h t ~

-~-r-
_;,;~,i~--

- ; = ~ , h ~ '1" -
mm
(cos(2~.@)so'(p)) (cos(2~,,.0)~O))

i "N
le-04 -
I
0.00

decomposition weights 7~
I
~.00
:
I 0 O0
X

(cos(2,~,0)s~(p)) (cos(2~,0)s~'(p))
Gatmlun 3:1 -~ s i n g u l a r tuition nA - radius Gau~ian 3:1 - singular tuncUon nA ~ ~ale
Y x 10 . 3 Y x I~3
I I ~ale 4-0
300.~ - 'aldllul 4 !
~ 4 - 2 I00.~- / "f'... / -i~',~-"~....
2~,~ - ~ ~-,f_~" --'
200.00-
I~@.00 -

IOOJDO-
50.00 -

~'|
-lfl@.00 -
!i/. ,'~ :
-150.00 -

-900.00 -

-urn.00 - !!
3
-300~00 - ~ -2~.~ - ~
-3S0.00 - I I I- X i i x
0.00 5.00 10.00 O.SO 1.00

sfnc 4.0 - radius sfnc 4.0 - scale

Fig. 5. Scale-decomposition of the radial component of the functions a~. The interval of scales
a is a E (0.125, 1.00). See also Fig. 6. (Top-left) The weights 7~ of each polar functions' decom-
position (i = 0 , . . . , 8 , k along the x axis). The decay of the weights is exponential in k; 5 to 8
components are needed to achieve 1% error (e.g 5 for the 0th, 7 for the 4th and 8 for the 8th
shown in fig 4). (Bottom) The first four radial (left) and scale (right) components of the 5th
singular function: r~(p) and s~(a), k = 0 . . . . . 3 (see Eq. (12)). (Top-right) The real parts of the
first four scale-components of the 5th singular function as: cos(2~rv40)s~(p) with k = 0 . . . . ,3
(see Eq. (13)).

In figure Fig. 6 reconstructions of the kernel based on a 1% error decomposition


are shown for various scales and angles. A m a x i m u m of 1% error was imposed on the
original steerable decomposition, and again on the scale decomposition of each single ai.
The measured error was 2.5% independently from angle and scale. The total number of
filters required to implement a 3-octave 1% (nominal, 2.5% real) approximation error of
the 3:1 Gaussian pair is 16 (rotation) times 8 (scale) = 128. If 10% approximation error
is allowed the number of filters decreases by approximately a factor 4 to 32.
13

Fig. 6. The kernel at different scales and orientations: the scales are (left to right) 0.125, 0.33,
0.77, 1.00. The orientations are (left to right) 30~ 66~ 122~ 155~ The kernels shown here were
obtained from the scale-angle decomposition shown in the previous figures.

5 Kernel symmetries and junction analysis


The Hilbert-pair kernels used by [27, 11, 36] for edge detection have a number of inter-
esting symmetries that may be exploited to reduce the computational and storage costs
by a factor of two. Moreover, these symmetries may be used to reconstruct the response
of two assiociated kernels, endstopped and one-sided, that are useful for the analysis of
edge junctions. The kernels of figure 7, are used here as specific examples.
An illustration of the use of these kernels for the analysis of edges and junctions is
proposed in Fig. 8 where response maxima w.r. to orientation ~ as in Fig. 2 are shown
for a different image, a synthetic T-junction (Fig.7, right). The kernels employed for
this demonstration have shape as in Fig. 7 and are derived from an elongated Gaussian
function of variances ~r~ = 1.2 pixels and ax : a~ = 3 : 1.
From equation (10) one can see that the coefficients ci(p) (where p = Ilxl]) are, for
each value of p, the Fourier coefficients of F0(x) along a circular path of radius p and
14

Fig. 7. Three complex-valued kernels used in edge and junction analysis (the real parts are
shown above and imaginary parts below). The first one (2-sided) is 'tuned' to edges that are
combinations of steps and lines (see [36]) - it is the same as in Fig. 3 top left, shown at an
orientation of 0~ the second kernel one (endstopped) is tuned to edge endings and 'crisscross'
junctions [13, 16, 39]: it is equivalent to a 1st derivative of the 2-sided kernel along its axis
direction; the third one (1-sided) may be used to analyze arbitrary junctions. All three kernels
may be obtained at any orientation by combining suitably the 'basis' kernels ai shown in Fig. 3.

center in the origin. The circular path begins and ends at the positive side of the X axis.
Consider now such a path for the 2-sided kernel of Fig. 7: observe that for every p we
have at least two symmetries.
For the real part:

(E) the function is even-symmetric,


( / / + ) a translation of the function by lr returns the same function (i.e. it is r-periodic).

For the imaginary part:

(O) the function is odd-symmetric,


( H - ) a translation of the function by Ir returns the function multiplied by - 1 .

These symmetries imply corresponding properties in the discrete Fourier transform


(DFT) of the functions: symmetry (E) implies a DFT with zero coefficients for the sinu-
soidal components; symmetry (O) a DFT with zero cosinusoidal components; symmetry
( / / + ) implies that the odd-frequency components are zero; symmetry ( / / - ) that the
even-frequency components are zero.
As a consequence, the DFT of the real part of the 2-sided kernel is only made up
of even-frequency cosine components, while the imaginary part is only made up of odd-
frequency sinus components. If the complex-exponential, rather than the sinus-cosine,
notation is used, as in eq. (7) and (10), this implies that the odd-frequency coefficients
only depend on the imaginary part of the kernel, while the even-frequency components de-
pend on the real part. The negative-frequency components ~iai are equal to the positive-
frequency components for even frequencies and to the opposite for odd-frequencies. The
negative-frequency components therefore do not need to be computed and stored thus
saving a factor 2 of storage and computations. Equation (3) may therefore be re-written
as follows (for convenience of notation suppose that the number of components n is odd:
n=2b+l~ and that the n frequencies v~ involved are the ones from -b to b):
15

24 24
A - 2-slded B - 1-sided
25 25
26 26
27 27
28 28 . . . . .
29 . . . . . . 9 9 . . . . . . . .
30 . . . . ~ ~ -,, % "~ ~. ~ . . . . .

31 " . \ \ %

34 \
35 - \

36 ",

37

38

39

24 25 26 27 28 '79 30 31 32 33 34 35 36 37 38 39
24 25 26 27 28 29 31 32 33 34 35 36 37 38 39

24
25
C - endstop 2~ D - e n d s t o p along 1-sided m a x .
25
26 26
27 27
28 28
29 29
30 - , - 9 30 . . . .
31 . . . . . ~ ~ 4 . . . .
32 - 9 "~-~-"v--v- r ,

33 ~ --,-- ".- r

34 ~ ~ ~

33 - * ~ 9 - - 35 - -

36

37

38

39

2 4 2 5 2 6 2 7 2 8 2 9 3 0 3 1 3 2 3 3 3 4 3 5 ~ 3 7 3 8 3 9 24 25 26 27 28 29 30 31 32 33 34 35 36 3"/ 38 39

Fig. 8. Demonstration of the use of the kernels shown in Fig. 7 for the analysis of orientation
and position of edges and junctions. For each pixel in s 16z16 neighbourhood of the T-junction
in Fig. 7 (right) the local maxima in orientation of the modulus of corresponding filter responses
are shown. A - 2-s|ded: (equivalent to Fig. 2 top-right) Within a distance of approximately
2 - 2.5~v from an isolated edge this kernel gives an accurate estimate of edge orientation. Near
the junction there is a distortion in the estimate of orientation; notice that the needles indicating
the orientation of the horizontal edge bend clockwise by approximately 15~ within a distance of
approx. 1.5ux from the junction. The periodicity of the maxima is 180~ making it difficult to
take a local decision about the identity of the junction (L, T, X). B - 1-slded: Notice the good
estimate of orientation near the junction; from the disposition of the local maxima it is possible
to identify the junction as a T-junction. The estimate of edge orientation near an isolated edge is
worse than with the 2-sided kernel since the 1-sided kernel has a 360~ symmetry. C - e n d s t o p :
The response along an 'isolated' edge (far from the junction) is null along the orientation of
the edge, while the response in the region of the junction has maxima along the directions of
the intervening edges. D - e n d s t o p along 1-sided m a x i m a : Response of the endstop kernel
along the orientations of maximal response of the 1-sided kernel. Notice that there is significant
response only in the region of the junction. The junction may be localized at the position with
maximal total endstop response.
16

n b b
F(t-2s)a = = = + 04)
i=1 u=-b u=0

where the indexing is now by frequency: a~ and a~ denote the ai and ai associated to
the frequency y = vl, and av=0 = 89 i =arg(ui = 0).
Consider now the endstopped kernel (Fig. 7, middle): the same symmetries are found
in a different combination: the real part has symmetries (E) and ( / / - ) while the imaginary
part has symmeries (O) and (/7+). A kernel of this form may be clearly obtained from
the coefficients of the 2-sided kernel exchanging the basis finctions: sinusoids for the even
frequencies and cosinusoids for the odd frequencies (equivalent to taking the Hilbert
transform of the 2-sided kernel along the circular concentric paths):
b
= + (15)
b'----0

The endstopped kernel shown in Fig. 7 has been obtained following this procedure from
the decomposition (ai, al, bi) of the 2-sided kernel in the same figure.
A kernel of the form 1-sided can now be obtained by summing the 2-sided and end-
stopped kernels previously constructed. It is the one shown in Fig. 7, right side. The
corresponding reconstruction equation is:
b

(ls)ev'J = (16)
Y~0

6 Conclusions

A technique has been presented for implementing families of deformable kernels for early
vision applications. A given family of kernels obtained by deforming continuously a tem-
plate kernel is approximated by interpolating a finite discrete set of kernels. The technique
may be applied if and only if the family of kernels involved satisfy a compactness con-
dition. This improves upon previous work by Freeman and Adelson on steerable filters
and Perona and Simoncelli et al. on scalable filters in that (a) it is formulated with max-
imum generality to the case of any compact deformation, or, equivalently any compact
family of kernels, and (b) it provides a design technique which is guaranteed to find the
most parsimonious discrete approximation. It has also been shown how to build edge-
terminator- and junction-tuned kernels out of a same family of 'basis' function.
Unlike common techniques used in early vision where the set of orientations is dis-
cretized, here the kernel and the response of the corresponding filter may be computed
in a continuum for any value of the deformation parameters, with no anisotropies. The
approximation error is computable a priori and it is constant with respect to the defor-
mation parameter. This allows one, for example, to recover edges with great spatial and
angular accuracy.

7 Acknowledgements
I have had useful conversations concerning this work with Ted Adelson, Stefano Casadei, Charles
Desoer, David Donoho, Peter Falb, Bill Freeman, Fedefico Gizosi, Takis Konst~ntopoulos, Paul
17

Kube, Olaf Kfibler, Jitendra Malik, Stephane Mallat, Sanjoy Mitter, Richard Murray, Mas-
simo Porrati. Federico Girosi and Peter Falb helped with references to the functional analysis
textbooks. The simulations have been carried out using Paul Kube's "viz" image-manipulation
package. The images have been printed with software provided by Eero Simoncelli. Some of the
simulations have been run on a workstation generously made available by prof. Canali of the
Universit/t di Padova. Part of this work was conducted while at the M.I.T.. I am very grateful
to Sanjoy Mitter and the staff of LIDS for their warm year-long hospitality.

References
1. ADELSON, E., AND BERGEN, J. Spatiotemporal energy models for the perception of motion.
J. Opt. Soc. Am. ~, 2 (1985), 284-299.
2. ADELSON, E., AND BERGEN, J. Computational models of visual processing. M. Landy and
J. Movshon eds. MIT press, 1991, ch. "The plenoptic function and the elements of early
vision". Also appeared as MIT-MediaLab-TR148. September 1990.
3. BINFORD, T. Inferring surfaces from images. Artificial Intelligence 17 (1981), 205-244.
4. BOVIK, A., CLARK, M., AND GEISLER, W. Multichannel texture analysis using localized
spatial filters. IEEE trans. Pattern Anal. Mach. Intell. 1P, l (1990), 55-73.
5. BURT, P., AND ADELSON, E. The laplacian algorithm as a compact image code. IEEE
Transactions on Communications 31 (1983), 532-540.
6. CANNY, J. A computational approach to edge detection. IEEE trans. Pattern Anal. Mach.
Inteli. 8 (1986), 679-698.
7. CHOQUET, G. Lectures on analysis, vol. I. W. A. Benjamin Inc., New York, 1969.
8. DIEUDONNE, J. Foundations of modern analysis. Academic Press, New York, 1969.
9. E. SIMONCELLI, W. FREEMAN, E. A., AND HEEGER, D. Shiftable multi-scale transforms.
Tech. Rep. 161, MIT-Media Lab, 1991.
10. FOGEL, I., AND SAGI, D. Cabot filters as texture discriminators. Biol. Cybern. 61 (1989),
103-113.
11. FREEMAN, W., AND ADELSON, E. Steerable filters for early vision, image analysis and
wavelet decomposition. In Third International Conference on Computer Vision (1990),
IEEE Computer Society, pp. 406-415.
12. FREEMAN, W., AND ADELSON, E. The design and use of steerable filters for image analy-
sis, enhancement and multi-scale representation. IEEE trans. Pattern Anal. Mach. Intell.
(1991).
13. FREEMAN,W., AND ADELSON, E. Junction detection and classification. Invest. Ophtalmol.
Vis. Sci. (Supplement) 3~, 4 (1991), 1279.
14. GRANLVND, G. H. In seaxch of a general picture processing operator. Computer Graphics
and Image Processing 8 (1978), 155-173.
15. HEEGER, D. Optical flow from spatiotemporal filters. In Proceedings o] the First Interna-
tional Conference on Computer Vision (1987), pp. 181-190.
16. HEITGER, F., ROSENTHALER, L., VON DER HEYDT, P~., FETERHAN$, E., AND KUBLER, O.
Simulation of neural contour mechanismn: From single to end-stopped cells. Tech. Rep.
126, IKT/Image science lab ETH-Zuerich, 1991.
17. HORN, B. The binford-horn linefinder. Tech. rep., MIT AI Lab. Memo 285, 1971.
18. HUEEL, D., AND Vr T. Receptive fields of single neurones in the cat's striate cortex.
J. Physiol. (Land.) I48 (1959), 574-591.
19. HUBEL, D., AND ~rlESEL, T. Receptive fields, binocular interaction and functional archi-
tecture in the cat's visual cortex. J. Physiol. (Lond.) 160 (1962), 106-154.
20. JONES, D., AND MALIK, J. Computational stereopsis-beyond zero-crossings. Invest. Oph.
taimol. Via. Sci. (Supplement) 31, 4 (1990), 529.
21. JONES, D., AND MALIK, J. Using orientation and spatial frequency disparities to recover
3d surface shape - a computational model. Invest. Ophtalmol. Vis. Sci. (Supplement) 3~,
4 (1991), 710.
18

22. KASS, M. Computing visual correspondence. In Proceedings: Image Understanding Work-


shop (McLean, Virginia, June 1983), Science Applications, Inc, pp. 54-60.
23. KNUTTSON,H., AND GrtANLUND,G. H. Texture analysis using two-dimensional quadrature
filters. In Workshop on Computer Architecture ]or Pattern Analysis ans Image Database
Management (1983), IEEE Computer Society, pp. 206-213.
24. KOENDERINK, J., AND VAN DOORN, A. Representation of local geometry in the visual
system. Biol. Cybern. 55 (1987), 367-375.
25. MALIK, J., AND GIGUS, Z. A model for curvilinear segregation. Invest. Ophtalmol. Vis.
Sci. (Supplement) 32, 4 (1991), 715.
26. ~IALIK, 3., AND PERONA, P. Preattentive texture discrimination with early vision mecha-
nisms. Journal of the Optical Society o] America - A 7, 5 (1990), 923-932.
27. MORRONE, M., AND BURR, D. Feature detection in human vision: a phase dependent
energy model. Proc. R. Soc. Lond. B P35 (1988), 221-245.
28. MORRONE, M., AND BURR, D. Robots and biological systems. Academic Press, eds. P.
Dario and G. Sandini, 1990, ch. : "A model of human feature detection based on matched
filters".
29. MORRONE, M., BURR, D., ROSS, 3., AND OWENS, R. Mach bands depend on spatial
phase. Nature, 324 (1986), 250-253.
30. MORRONE, M., AND OWENS, R. Feature detection from local energy. Pattern Recognition
Letters 6 (1987), 303-313.
31. PARENT, P., AND ZUCKER, S. Trace inference, curvature consistency, and curve detection.
IEEE trans. Pattern Anal. Mach. Intell. 11, 8 (1989), 823-839.
32. PERONA, P. Finite representation of deformable functions. Tech. Rep. 90-034, International
Computer Science Institute, 1947 Center st., Berkeley CA 94704, 1990.
33. PERONA, P. Deformable kernels for early vision. IEEE Con]erence on Computer Vision
and Pattern Recognition (June 1991), 222-227.
34. PERONA, P. Deformable kernels for early vision. Tech. Rep. 2039, LIDS-MIT, October
1991. Submitted to IEEE PAMI.
35. PERONA, P., AND MALIK, J. Detecting and localizing edges composed of steps, peaks and
roofs. Tech. Rep. UCB/CSD 90/590, Computer Science Division (EECS), U.C.Berkeley,
1990.
36. PERONA, P., AND MALIK, 3. Detecting and localizing edges composed of steps, peaks and
roofs. In Proceedings o] the Third International Con]erence of Computer Vision (Osaka,
1990), IEEE Computer Society, pp. 52-57.
37. PINKUS, A. n-Widths in Approzimation Theory. Springer Verlag, 1985.
38. PRESS, W., FLANNERY,B., TEUKOLSKY, S., AND VETTERLING,W. Numerical Recipes in
C. Cambridge University Press, 1988.
39. ROSENTHALER, L., HEITGER, F., VON DER HEYDT, R., AND KUBLER, O. Detecting general
edges and keypoints. Tech. Rep. 130, IKT/Image science lab ETH-Zuerich, 1991.
40. TURNER, M. Texture discrimination by gabor functions. Biol. Cybern. 55 (1986), 71-82.
41. ZHONG, S., AND MALLAT, S. Compact image representation from multiscale edges. In
Proceedings o] the Third International Conference o] Computer Vision (Osaka, 1990), IEEE
Computer Society.

This article was processed using the IbTEX macro package with ECCV92 style
Families of Tuned Scale-Space Kernels *

L.M.J. Florack 1, B.M. ter Haar Romeny 1, J.J. Koenderink 2, M.A. Viergever 1
1 3D Computer Vision Research Group, University Hospital, Room E.02.222,
Heidelberglaan 100, 3584 CX Utrecht, The Netherlands
2 Dept. of Medical and Physiological Physics, University of Utrecht,
Princetonplein 5, 3584 CC Utrecht, The Netherlands

Abstract.
We propose a formalism for deriving parametrised ensembles of local
neighbourhood operators on the basis of a complete family of scale-space
kernels, which are apt for the measurement of a specific physical observable.
The parameters are introduced in order to associate a continuum of a priori
equivalent kernels with each scale-space kernel, each of which is tuned to a
particular parameter value.
Ensemble averages, or other functional operations in parameter space,
may provide robust information about the physical observable of interest.
The approach gives a possible handle on incorporating multi-valuedness
(transparancy) and visual coherence into a single model.
We consider the case of velocity tuning to illustrate the method. The
emphasis, however, is on the formalism, which is more generally applicable.

1 Introduction
The problem of finding a robust operational scheme for determining an image's differ-
entiM structure is intimately related to the concept of resolution or scale. The concept
of resolution has been given a well-defined meaning by the introduction of a scale-space.
This is a 1-parameter family of images, derived from a given image by convolution with
a gaussian kernel, which defines a spatial aperture for measurements carried out on the
image and thus sets the "inner scale" (i.e. inverse resolution).
The gaussian emerges as the unique smooth solution from the requirement of absence
of spurious detail, as well as some additional constraints [1, 2, 3]. Alternatively, it is
uniquely fixed by the requirement of linearity and a set of basic symmetry assumptions,
i.c. translation, rotation and scale invariance [4, 5]. These symmetries express the absence
of a priori knowledge concerning the spatial location, orientation and scale of image
features that might be of interest.
Although several fundamental problems are yet to be solved, the crucial role of res-
olution in any front end vision system cannot be ignored. Indeed, scale-space theory is
gaining more and more appreciation in computer vision and image analysis. Neurophys-
iological evidence obtained from mammalian striate cortex also bears witness of its vital
importance [6]. There is also psychophysical support for the gaussian model [7].
Once the role of scale in a physical observable has been appreciated and a smooth
scale-space kernel has been established, the problem of finding derivatives that depend
* This work was performed as part of the 3D Computer Vision Research Program, sup-
ported by the Dutch Ministry of Economic Affairs through a SPIN grant, and by
the companies Agfa-Gevaert, Philips Medical Systems and KEMA. We thank J. Biota,
M. van Eert, R. van Maarseveen and A. Salden for their stimulating discussions and software
implementation.
20

continuously on the image (i.e. are well-posed in the sense of Hadamard), has a trivial
solution [4, 5, 8, 9]: just note that ifD is any linear differential operator, f is a given image
and gr is the scale-space kernel on a sc~e ~ (fairly within the available scale-range), then
the convolution f . Dgr precisely yields the derivative of f on scale a, i.e. D ( f * ga). The
1-parameter family containing the scaled gaussian and its linear derivatives constitutes
a complete family of scaled differential operators or local neighbourhood operators [10].
Despite its completeness, however, the gaussian family is not always the most conve-
nient one. For example, local optic flow in a time varying image can be obtained directly
from the output of a gaussian family of space-time filters, at least in principle [11], but it
may be more convenient to first tune these filters to the physical parameter of interest,
i.c. a velocity vector field. This way, the filters have a more direct relation to the quantity
one wishes to extract.
To illustrate the formalism we will present an example of filter tuning, i.c. velocity
tuning [12, 13, 14, 15]. The emphasis, however, is on the formalism based on (Lie-group)
symmetries, expressing the a priori equivalence of parameter values, i.e. velocities. The
formalism is readily applicable to a more general class of physical tuning parameters, e.g.
frequency, stereo disparity, etc.

2 Filter Tuning

Fourier's powerful method of dimensional analysis has received a solid mathematical


formulation in the So-called Pi theorem. This theorem, together with the introduction
of the physical tuning parameter of interest and the symmetry assumptions mentioned
in the introduction, provides the main ingredients for the derivation of a family of lo-
cal neighbourhood operators tuned to that observable. Basically, this theorem states
that, given a physical relation f ( x l , . . . , a:/)) = 0, there exists an equivalent relation
] ( 7 r l , . . . , ~ro-R) = 0 in terms of dimensionless variables, which can be found by solving
a linear equation. For more details, the reader is referred to [16].
In order to illustrate the tuning procedure, we will now turn to the case of velocity
tuning. Our starting point will be a complete family of scaled spacetime differential
operators Fm...u, ( X ):

D e f i n i t i o n I ( T h e G a u s s i a n S p a c e t i m e F a m i l y ) . The gaussian spacetime family is


defined as:
(Ftt,,.,#~ (X__)d:~fVa" f--(D+I)~ot~l...~, exp {-X_.. X__}}nvr 0 (1)

in which X is a ( D + 1)-vector in spacetime, whose components are given by X0 = t/2~2~r2r 2


and Xi = xl/2v/2-~2 for i = 1 . . . D and in which am...v, denotes the n-th order differential
operator O" /OX m ... OXo,.

We have the following relationship between the kernels/'m...~," (X), given in dimen-
sionless, scale-invariant coordinates, and the scMe-parametrised kernels Gm...~,,(x, t; o', r):

~%//~T2ra ~V~ff2n-mel~,. .IJ. (x, t; o', "f) dxdt deafF#I. .Ij. (X) dX__ (2)

in which m is the number of zero-valued indices among Pl .../~n. Although the temporal
part of (1) obviously violates temporal causality, it can be considered as a limiting case
of a causal family in some precise sense [17]. Sofar for the basic, non-tuned gaussian
spaeetime family.
21

The tuning parameter of interest will be a spacetime vector ~_= (~0; ~). Apart from
this, the following variables are relevant: the scale parameters ~ and r, the frequency
variables ~ and o~0 for addressing the fourier domain, and the variables s and s rep-
resenting the scaled and original input image values in fourier space. According to the
Pi theorem, we may replace these by: A de_f ~//'~'0, ~ = (~'~0; ~e~) de_f (~0 ~ - ' ~ ; r ~ " ~ ) '
-~ = (~0; ~') de__f(~0/2X/~;~/2Vt~-~). Moreover we will use the conjugate, dimension-
less variables X = (X0; X) d~__f( t / ~ ; x/2X/~). Their dependency is expressed by
A = g (/2, ~), in which g is some unknown, scalar function.
In the ~, -~ 0 limit, the ~'--tuned kernels should converge to (1). Reversely, by applying
a spacetime tuning operation to this underlying family, we may obtain a complete family
of spacetime-tuned kernels and, more specifically, of velocity-tuned kernels:

Definition 2 (Spaeetime Tuning). A spacetime quantity Q (X__,~) is calledspacetime-


tuned with respect to a given point ~ if V ~ ~ the following symmetry holds: Q (X, ~) =
Q (x__- __-%__---__--') de=,T:_-,Q (x__,__--)
Applying the operator T=- on a given operand, we obtain a spacetime-tuned version of
it: TgQ (X__)= Q (X - ~). In this way the gaussian family can be tuned so as to yield:

Result 3 (Family of Spacetime Tuned Kernels). A complete family of spacetime-


tuned kernels is given by:

(3)

The construction of velocity-tuned kernels from the basic gaussian family is a special
case of spacetime tuning, viz. one in which the tuning point is the result of a galilean
boost applied to the origin of a cartesian coordinate frame. This is a transvection in
spacetime, i.e. a linear, hypervolume preserving, unit eigenvalue transformation of the
following type:

D e f i n i t i o n 4 ( G a l i l e a n B o o s t ) . A galilean boost T~ is a spacetime transvection of the


type T~: (60;~i) ~ (60 + "rk6k ; ~i), in which the ~ constitute an orthonormal basis in
spacetime and "r is an arbitrary D-vector.

T~ transforms static stimuli (straight curves of "events" parallel to the T-axis) into
dynamic stimuli, moving with a constant velocity "/. Note that TO is the identity and
T~-1 = T_~ is the inverse boost. Since a galilean boost is just a special type of spacetime
tuning, we immediately arrive at the following result:

R e s u l t 5 ( F a m i l y o f V e l o c i t y - T u n e d K e r n e l s ) . A complete family of velocity-tuned


kernels is obtained by applying a galilean boost T~ = Tx_(-O, with ~(-/) def = to
the gaussian spacetime family:

(F/~ (X;'y) def Fp (X ~.~(,.~)))oo


t --.~n ~ 1.../a~ , n.--O (4)

The relevance of using a (seemingly redundant) parametrised ensemble of local oper-


ators is best illustrated by means of an example.
22

Example 1. Consider a point stimulus L0(x, t) = AS(x - et), moving at fixed velocity e.
According to (4), the lowest order velocity-tuned filter is given in by:

G(x,t; a, r, v) =
1 1 { (x-vt).(x-vt) t2 ~
(5)
~ 2D 2V/~~ exp
2V/~"~ 2cr2 2r2

def
in which the velocity v is related to the parameter vector "7 in (4) by v = ~ / r .
Convolving the above input with this kernel yields the following result:

Lv,,,r(x,t)- A exp{ ( x - ct)2 } (6)


with A ~ f lie-vii
IJ and o'~ d=cr~x/T+ 32. This shows an ensemble of fuzzy blobs centered
at the expecte~ location of the stimulus, the most pronounced member of which is indeed
the one for which the tuning velocity coincides with the stimulus velocity (A = 0).
However, it is the velocity ensemble rather than any individual member that carries the
information about stimulus velocity and that allows us to extract this in a way that is
both robust and independent of contrast A. In order to appreciate the kind of ensemble
operations one can think of, consider for example the average < v >. A straightforward
symmetry argument shows that this equals:

< v > (x,t)def f d v v L v , a , r ( X , t )


= fdvLv,~,~(x,t) =e
V(x,t) (z)
(just note that < v - c > vanishes identically). In our example, this average turns out to
be a global constant, which, of course, is due to the uniform motion of the input stimulus.
A similar, though non-trivial ensemble integral can be evaluated to obtain the variance
Av = X/< ]Iv -- c IP >. It is important to realise, however, that by proceeding in this way
we enforce a singlue-valued "cross-section" of the observable, i.e. velocity field. It is clear
that this is by no means necessary: one could think of segmenting parameter space into
subdomains and applying similar operations on each subdomain independently, leading
to a multi-valued result (transparency). This is a conceivable thing to do especially if such
a segmentation is apparent, as in the case of a superposition of two point stimuli for two
clearly distinct values of c. In that case we can use (7) for each component of transparent
motion by restricting the integrations to the respective velocity segments. Although an
N-valued representation is certainly plausible in the limiting case when a segmentation
of parameter space into N segments is "obvious", it remains an intriguing problem of
how to deal with transient regions, occuring e.g. when there is an apparent jump in the
velocity field of the input stimulus. Clearly, our representation entails all these cases.
The problem as such is one of pattern extraction from the output of the tuned local
neighbourhood operators in the product space of locations and tuning parameters [18].

3 Conclusion and Discussion

In this paper we have shown how the basic family of local scale-space operators may
give rise to a gamut of other families, each of which is characterised, apart from scale,
by some physical tuning parameter. We have presented a formalism for generating such
families from the underlying gaussian scale-space family in a way that makes the a priori
equivalence of all tuning parameter values manifest. We have illustrated the formalism
23

by an example of velocity tuning, incorporating all possible galilean boosts so as to yield


ensembles of velocity-tuned local scale-space operators (of. Reichardt operators).
We have argued t h a t ensemble averages, or other functional operations in p a r a m e t e r
space, rather than the o u t p u t of individual kernels as such, m a y provide a robust, oper-
ational m e t h o d for extracting valuable information a b o u t the observable of interest. T h e
appealing aspect of this m e t h o d is t h a t it does not aim for a single-valued "expectation
value" for the observable right from the beginning and t h a t single-valuedness is t r e a t e d
on equal foot with multi-valuedness.
T h e formalism should be readily applicable to other p a r a m e t e r s of physical interest,
such as frequency, stereo disparity, etc., yielding ensembles of frequency- or disparity-
tuned local neighbourhood operators, etc.

References

1. A. Witkin, "Scale space filtering," in Proc. International Joint Conference on Artificial


Intelligence, (Karlsruhe, W. Germany), pp. 1019-1023, 1983.
2. J. J. Koenderink, "The structure of images," Biol. Cybern., vol. 50, pp. 363-370, 1984.
3. T. Lindeberg, "Scale-space for discrete signals," 1EEE Trans. Pattern Analysis and Ma-
chine Intelligence, vol. 12, no. 3, pp. 234-245, 1990.
4. B. M. ter Haar Romeny, L. M. J. Florack, J. J. Koenderink, and M. A. Viergever, "Scale-
space: Its natural operators and differential invariants," in International Conf. on Informa-
tion Processing in Medical Imaging, vol. 511 of Lecture Notes in Computer Science, (Berlin),
pp. 239-255, Springer-Verlag, July 1991.
5. L. Florack, B. ter Haar Romeny, J. Koenderink, and M. Viergever, "Scale-space." Submit-
ted to IEEE PAMI, November 1991.
6. R. A. Young, "The gaussian derivative model for machine vision: I. retinal mechanisms,"
Spatial Vision, vol. 2, no. 4, pp. 273-293, 1987.
7. P. Bijl, Aspects of Visual Contrast Detection. PhD thesis, University of Utrecht, University
of Utrecht, Dept. of Med. Phys., Princetonplein 5, Utrecht, the Netherlands, May 1991.
8. J. J. Koenderink and A. J. van Doom, "Representation of local geometry in the visual
system," Biol. Cybern., vol. 55, pp. 367-375, 1987.
9. J. J. Koenderink and A. J. Van Doorn, "Operational significance of receptive field assem-
blies," Biol. Cybern., vol. 58, pp. 163-171, 1988.
10. J. J. Koenderink and A. J. van Doom, "Receptive field families," Biol. Cybern., vol. 63,
pp. 291-298, 1990.
11. P. Werkhoven, Visual Perception of Successive Order. PhD thesis, University of Utrecht,
University of Utrecht, Dept. of Med. Phys., Princetonplein 5, Utrecht, the Netherlands,
May 1990.
12. D. J. Heeger, "Model for the extraction of image flow," Journal of the Optical Society o]
America.A, vol. 4, no. 8, pp. 1455-1471, 1987.
13. D. Heeger, "Optical flow using spatiotemporal filters," International Journal of Computer
Vision, vol. 1, pp. 279-302, 1988.
14. E. H. Adelson and J. R. Bergen, "Spatiotemporal energy models for the perception of
motion," Journal o] the Optical Society of America-A, vol. 2, no. 2, pp. 284-299, 1985.
15. W. E. Reichardt and R. W. SchSgl, "A two dimensional field theory for motion computa-
tion," Biol. Cybern., vol. 60, pp. 23-35, 1988.
16. P. J. Olver, Applications of Lie Groups to Differential Equations, vol. 107 of Graduate Texts
in Mathematics. Springer-Verlag, 1986.
17. J. J. Koenderink, "Scale-time," Biol. Cybern., vol. 58, pp. 159-162, 1988.
18. A. J. Noest and J. J. Koenderink, "Visual coherence despite transparency or partial occlu-
sion," Perception, vol. 19, p. 384, 1990. Abstract of poster presented at the ECVP 1990,
Paris.
Contour Extraction by Mixture Density Description
Obtained from Region Clustering

Minoru E T O H 1, Yoshiaki S H I R A I 2 and Minoru A S A D A 2

i Central Research Laboratories, Matsushita Electric Ind., Moriguchi, Osaka 570, Japan
2 Mech. Eng. for Computer-Controlled Machinery, Osaka University, Suita, Osaka 565, Japan

Abstract. This paper describes a contour extraction scheme which re-


fines a roughly estimated initial contour to outline a precise object bound-
ary. In our approach, mixture density descriptions, which are parametric
descriptions of decomposed sub-regions, are obtained from region cluster-
ing. Using these descriptions, likelihoods that a pixel belongs to the object
and its background are evaluated. Unlike other active contour extraction
schemes, region- and edge-based estimation schemes are integrated into an
energy minimization process using log-likelihood functions based on the
mixture density descriptions. Owing to the integration, the active contour
locates itself precisely to the object boundary for complex background im-
ages. Moreover, C 1 discontinuity of the contour is realized as changes of the
object sub-regions' boundaries. The experiments show these advantages.

1 Introduction
The objective of this work is to extract an object contour from a given initial rough
estimate. Contour extraction is a basic operation in computer vision, and has many
applications such as cut/paste image processing of authoring systems, medical image
processing, aerial image analyses and so on.
There have been several works which takes an energy minimization approach to the
contour extraction(e.g.[i][2]). An active contour model (ACM) proposed by Kass [3] has
demonstrated an interactivity with a higher visual process for shape corrections. It results
in smooth and closed contour through energy minimization. The ACM, however, has the
following problems:

C o n t r o l : It is common for typical ACMs to look for maxima in intensity gradient


magnitude. In complex images, however, neighboring and stronger edges may trap
the contour into a false, unexpected boundary. Moreover, if an initial contour is placed
too far from an object boundary, or if there is not sufficient gradient magnitude, the
ACM will shrink into a convex closed curve even if the object is concave. In order
to avoid these cases, a spatially smoothed edge representation[3], a distance-to-edge
map[4][5], successive lengthening of an active contour[6] and an internal pressure
force[5] have been introduced. Unfortunately, even if those techniques were applied,
the edge-based active contours might be trapped by unexpected edges.
Scaling : Many optimization methods including ACMs suffer from a scaling problem[7].
Even if we can set parameters of an optimization method through experiments for a
group of images, the validity of the parameters is not assured for images with different
contrasts.
D i s c o n t i n u i t y : The original ACM requires the object contour to be C 1 continuous. This
requirement implies the approximation errors for C 1 discontinuous boundaries which
often appear due to occlusion. A popular approach to the discontinuity control is that
25

the discontinuities are just set at the high curvature points[4]. Generally, however,
it is difficult to interpret the high curvature points as corners or occluding points
automatically without knowledge about the object.

Taking into account the interactivity for correction, we adopt an ACM for the object
contour extraction. In light of the above problems, we will focus not on the ACM itself but
on an underlying image structure which guides an active contour precisely to the object
boundary against the obstacles. In this paper we propose a miztnre density description
as the underlying image structure. Mizture density has been noted to give mathematical
basis for clustering in terms of distribution characteristics[8][9][10]. It refers to a prob-
ability density for patterns which are drawn from a population composed of different
classes. We introduce this notion to a low level image representation. The features of our
approach are:

- Log-likelihoods that a pixel belongs to an object(inside) and its background(outside)


regions enforce the ACM to converge to their equilibrium. On the other hand, the
log-likelihood gradients enforce the ACM to locate the contour on the precise ob-
ject boundary. They are integrated into an energy minimization process. These log-
likelihood functions are derived from the mixture density descriptions.
- Parameter setting for the above enforcing strengths is robust for a variety of images
with various intensity. Because, the strengths are normalized by statistical distribu-
tions described in the mixture density description.
- The object boundary is composed of sub-regions' boundaries. The C 1 discontinuity
of the ACM can be represented as changes of sub-regions' boundaries.

First, we present a basic idea of the mixture density description, its definition with
assumptions and a region clustering. Second, we describe the active contour model based
on the mixture density descriptions. The experimental results are presented thereafter.

2 Mixture Density Description by Region Clustering

2.1 B a s i c I d e a

Our ACM seeks for a boundary between an object and its background regions according
to their mixture density descriptions. The mixture density descriptions describe the po-
sitions and measurements of the sub-regions in the both sides of the object boundary. In
our approach, the mixture density description can be obtained from a region clustering
in which pixel positions are considered to be a part of features. Owing to the combination
of the positions and the measurements, the both side regions can be decomposed into
locally distributed sub-regions. Similarly, Izumi et a1.[11] and Crisman et a1.[12] decom-
posed a region into sub-regions and integrated the sub-regions' boundaries into an object
boundary. We do not take such boundary straightforwardly because they may be not
precise and jagged for our purpose.
For a pixel to be examined, by selecting a significant sub-region, which is nearest to the
pixel with respect to the position and the measurement, we can evaluate position-sensitive
likelihoods of inside and outside regions. Fig. 1 illustrates an example of mixture density
descriptions. In Fig.l, suppose that region descriptions were not for decomposed sub-
regions. The boundary between inside "black" and outside "blue" might not be correctly
obtained, because the both side regions include blue components and the likelihoods
of the both side regions would not indicate significant difference. On the other hand,
26

higher-level vlsusl process

active contour
rnide-level

low-level [ "~'~'~"~" .-~:'~.-:.description


:i-
mixturedenslty

dlscontinult
object region backgroundreg[on
(inside region) (out sloe region)

Fig. 1. Example of mixture density description.

using the mixture density description, the likelihoods of the both regions can indicate
significant difference knowing the sub-regions' positions. Moreover, the false edges can
be canceled by the equal likelihoods in the both sides of the false edges.

2.2 D e f i n i t i o n s a n d A s s u m p t i o n s

We introduce the probability density function for mixture densities [8]. In the mixture
densities, it is known that the patterns are drawn from a population of c classes. The
underlying probability density function for class ~i is given as a conditional density
p(xlwl, Oi) where Oi is a vector of unknown parameters. If a priori probability of class wi
is known as p(wl), ~'~i=1
c P (0~i ) = 1, a density function for the mixture densities can be
written as:
c

p(xl0) = (1)
i=1

where 0 = (01,02, ...,0c) 9 The conditional densities p(x[to/, 0i) are called component
densities, the priori probabilities p(w/) are called mixing parameters.
We assume that a region of interest R consists of c classes such that R = {wl, o;2, ..., a~ }.
In order to evaluate the likelihood that a pixel belongs to the region R, we take a model
that overlapping of the component densities are negligible and the most significant (high-
est) component density can dominate the pixel's mixture densities. Thus, the mixing pa-
rameters of the region R should be conditional with respect to position p(e.g, row,column
values) and measurements x(e.g. RGB values) of the pixel.
The mixing parameters are given by:

1 if p(x,p[wi,Oi) > p(x,p[wj,Oj),Vj,j # i, 1 < i,j < c


p @ d R , x, p) ~ - - (2)
0 otherwise.

Thus, we can rewrite the mixture density function of the region R for (x, p) as:
c

p(x, piR, 0) -- ~ p(x, ply,, 0~)p(~i IR, x, p ) . (3)


i=l
27

Owing to the approximation of (2), a log-likelihood function of a region R with a mixture


density description 8 can be given by:
In p(x, plR,/9) = m i a x In p(x, p ]wi, 04) . (4)

For convenience of notation, we introduce the notation y to represent the joint vector of
x and p: y = ( p ) . We assume that the component densities take general multivariate
normal densities. According to this assumption, the component densities can be written
as: 1
1
exp[-~-X2(y; u4,27)] (5)
p(ylwi, Oi) = (27r)d/21S411/2 z
and
X2(y;u/, 574) = (y - ui)t L'/"I(y - u4) , (6)
where 0i = (ui,,U4), d, ui and 27i are the dimension of y, the mean vector and the
covariance matrix of y belonging to the class w4, (.)t denotes the transpose of a matrix,
and X2(.) is called a Mahalanobis distance function. For the multivariate normal densities,
the mixture density description is a set of means and covariance matrices for c classes:
M i x t u r e D e n s i t y D e s c r i p t i o n : 0 = ((ul, 271), (u2,272), ...., (ue, 27e)) 9 (7)
The log-likelihood function for the multivariate normal densities is given by:

L o g - L i k e l i h o o d F u n c t i o n : l ( y [ R , 0) = m a x - l ( d ln(27r) + In [274[ + X2(Y; ui, 274)) 9


(8)
Henceforth, we use the log-likelihood function and the mixture density description of (7)
and (8).

2.3 R e g i o n C l u s t e r i n g
A mixture density description defined by (7) is obtained from decomposed sub-regions.
If the number of classes and parameters are all unknown, as noted in [8], there is no
analytical singular decomposition scheme. Our algorithm is similar to the Supervising
ISODATA 3. of Carman et al.[14] in which Akaike's information criterion(AIC)[15] is
adopted to reconcile the description error with the description compactness, and to eval-
uate goodness of the decomposition. The difference is that we assume the general multi-
variate normal distribution for the component densities and use a distance metric based
on (5), while Carman et. al. assume a multivariate normal distribution with diagonal
covariance matrices and used Euclidian distance metric.
By eliminating the constant terms from the negated logarithm of (5), our distance
metric between sample y and class wl is given by:
d(y; u~, El) = In [2~1 + X2(y; ui, 27i) 9 (9)
For general multivariate normal distributions (with no constraints on covariance ma-
trices), AIC is defined as:
c

A I C ( R , 0) = - 2 E / ( Y l R ' 0)+211011 = E{ni(d(l+ln(27r))+ln([Ei[))"l-(2d"l-d(d"l-1))},


y i=1
(10)
3 For review of clustering algorithms see [13].
28

where n~ and ]]0 H represent the number of samples y in the ith class and the degree
of free parameters to be estimated, respectively (note that E(X~(.)) = d). In (10), the
first term is the description error while the second term is a compactness measure which
increases with the number of classes.
The algorithm starts with a small number of clusters, and iterates split-and-merge
processes until the AIC has minima or other stopping rules, such as the number of
iterations, the number of class samples and intra-class distributions, are satisfied.

3 Active Contour Model


based on the Mixture Density Descriptions

We assume that 8in: a mixture density description of an object region R in and 8~


a mixture density description of the object background R ~ have been given by the
region clustering. According to a maximum likelihood estimate (MLE) which maximize
the sum of the two log-likelihoods of R in and R ~ an active contour can be localized
to a balanced position as illustrated in Fig.2. In addition, we will provide an edge-based
estimate for precise boundary estimation. Because there are vague features or outliers
caused by shading, low-pass filtering and other contaminatiors. Assuming that the log-
likelihoods indicate step changes at the boundary, the boundary position can be estimated
as the maxima of the log-likelihood gradients. In our ACM, the both estimates are realized
by a region force and an edge force, and they are integrated into an energy minimization
process 4.

log-likelihood ~ log-likelihood
of the inside Sd~foroe of the outside
Rerglo.form MLE
>
inside~ positionon the normal outside
to the contour

Fig. 2. Boundary location.

3.1 E n e r g y M i n i m i z a t i o n Process

We employ a dynamic programming(DP) algorithm for the energy minimizing of the


ACM. It provides hard constraints to obtain more desirable behavior. Equation (11) is a
standard DP form as is of Amini 5 except the external forces.

minv,_l {SI-l(Vl, Vi-1) "~-Estretch(Vi, Vi-i) "~ Ebend(Vi+l, Vi, Vi-1)


S~(Vi+l, v,) = [ +Ereg,o,*(vi) + Eedge(v,)} 9
(11)
4 For more details see[16]
For review of an ACM implemented by a DP see[17].
29

In (11), we assume that N control points vl on a active contour are spaced at equal
intervals. The energy minimization process proceeds by iterating the DP for the N con-
trol points until the number of position changes of the control points converges to a small
number or the number of iteration exceeds a predefined limit. At each control point vi,
we define two coordinates ti and ni, which are an approximated tangent and an approx-
imated normal to the contour, respectively. In the current implementation, each control
point vi is allowed to move to itself and its two neighbors as {vi, vl +~ni, v i - ~ n i } at each
iteration of the DP. For convenience of notation, we express them as {middle, outward,
inward }. Estretch(') and E~na(') in (11) are the first and the second order continuity
constraints, and they are called internal energy. In the ACM notation, Er~gion(') and
E~aa,(') in (11) are called external forces(image forces).
In the following subsections, we will describe the external forces.

3.2 E x t e r n a l F o r c e s

Two external forces of our ACM are briefly illustrated in Fig. 3. A region force has the
capability to globally guide an active contour to a true boundary. After guided nearly to
the true boundary, an edge force has the capability to locally enforce an active contour
to outline the precise boundary.

'!?:iiiii!ii,.troi
.............................
~::iii::i~::~......
:~ii!i!ili!i!i!i::i::i::i::::~::. :..
obl ~i::i)i::i::i::i::i::iii::::~?~:... n ~ t o r t tn~
reg

Point '~ii::iiiiiliiii~:
0

out In
I(ylR ,g) < I(ylR ,U) in equi um zone
at control point at control point
region force
region force region force + edge force
(squash) (protrude)

Fig. 3. Boundary seeking model: a region force and an edge force.

The region force is given by:

f Po~t(vi) if vi moves inward


-Eregion(Vi)=lPoin(Vi) if vi moves outward (12)
otherwise (if vi stays in the middle),

and
po,,t(vi) = - p l n ( v / ) = I(y[R~ 0~ - l(yl Rin, oin)v, , (13)
where l(.)v denotes the log-likelihood taking feature vectors at or nearly at the position
V.
In order to introduce the edge force, two auxiliary points v + and v~- are provided for
the control point v{, where v + = v{ + T/n/,v~- = vi - r/n{. The inside parameter Ovi is
selected from/9 in so that Ovi is the parameter of the highest component density at v~-.
30

The outside parameter O+i is selected at v + from 0 ~ in the same way. The edge force
is given by:

- - Eedge(Vi) = ( ~162 Ou + ~215 O(-u) if vi is in equilibrium


0 otherwise,
(14)
where W , t and u are a window function and variables on ti, ni coordinates of vi,
respectively. In the current implementation, W has been implemented as a Gaussian
window(c~ and C~n are scale factors). In (14), differentialof log-likelihoods is calculated
from Mahalanobis distances. Note that the edge force can be applied on condition that vi+
and v~- are placed across the true boundary. W e call this condition equilibrium condition
which can be easily determined by likelihoods of v + and v~-.

3.3 Discontinuity Control


In our ACM, the second order continuity constraint will be ignored at point where the
vi is in equilibrium and the inside parameter Ovi is different from its neighbors Ovi_l or
Ovi+l.

4 Experiments

Throughout the experiments, the conditions are 1.input image size:512 480 pixel RGB
, 2. feature vectors: five dimensional vectors (r,g,b,row,column).
Fig. 4 shows an input image with an initial contour drawn by a mouse. The initial
contour is used as a "band" which specifies the inside and outside regions. According to
the band description, the region clustering result is modified by splitting classes crossing
the band. Given an initial contour in Fig.4, we have obtained 24 classes for the inside and
61 classes for the outside through the region clustering. Mixture density descriptions are
obtained from the inside and outside classes. Using these mixture density descriptions,
the ACM performs the energy minimizing to outline the object boundary. Fig. 5 shows
that the discontinuous parts are precisely outlined with the discontinuity control.
Fig. 6 (a) shows a typical example of a trapped edge-based active contour. In contrast
with (a), (b) shows a fair result against the stronger background edges.
We can apply our ACM to an object tracking by iterating the following steps:l) extract
a contour using the ACM, 2) refine mixture density descriptions using the extracted
contour, 3) apply the region clustering to a successive picture taking the mixture density
description into the initial class data. In Fig.6 (c), the active contour tracked the boundary
according to the descriptions newly obtained from the previous result of (b).

5 Conclusion

We have proposed the precise contour extraction scheme using the mixture density de-
scriptions. In this scheme, region- and edge-based contour extraction processes are in-
tegrated. The ACM is guided against the complex backgrounds by the region force and
edge force based on the log-likelihood functions. Owing to the statistical measurement,
our model is robust for parameter setting. Throughout the experiments including other
pictures, the smoothing parameter has not been changed. In addition, the mixture den-
sity descriptions have enabled to represent the C 1 discontinuity. Its efficiency is also
demonstrated in the experiment.
31

Regarding the assumptions for the mixture density description, we have assumed
that the component densities take general multivariate normal densities. To be exact,
the position vector p is not in accordance with normal distribution. So far, however, the
assumption has not exhibited any crucial problems.
Further work is needed in getting initial contours in more general manner. Issues
for feature research include the initial contour estimation and extending our scheme to
describe a picture sequence.

A c k n o w l e d g m e n t s : The authors would like to thank Yoshihiro Fujiwara and Takahiro


Yamada for their giving the chance to do this research.

References
1. Blake, A., Zisserman, A.: Visual Reconstructiort The MIT Press (1987)
2. Mumford, D., Shah, J.: Boundary Detection by Minimizing Functionals. Proc. CVPR'85
(1985) 22-26
3. Kass, M., Witikin, A., Terzoponlos, D.: SNAKES: Active Contour Models. Proc. 1st ICCV
(1987)259-268
4. Menet, S., Saint-Marc, P., Mendioni, G.: B-snakes: Implementation and Application to
Stereo. Proc. DARPA Image Understanding Workshop '90 (1990) 720-726
5. Cohen, L., Cohen, I.: A Finite Element Method Applied to New Active Contour Models
and 3D Reconstruction from Cross Sections. Proc. 3rd ICCV (1990) 587-591
6. Berger, M., Mohr, R.: Towards Autonomy in Active Contour Models. Proc. 11th ICPR
(1990) 847-851
7. Dennis, J., Schnabel, R.: Numerical Methods .for Unconstrained Optimization and Linear
Equations. Prentice-Hall (1988)
8. Dnda, R., Hart, P.: Pattern Classification and Scene Analysis. John Wiley and Sons (1973)
9. Scoive, S.: Application of the Conditional Population-Mixture Model to Image Segmenta-
tion. IEEE Trans. on Putt. Anal. & Ma~h. Intell. 5 (1983) 429-433
10. Yarman-Vnral, F.: Noise,Histogram and Cluster Validity for Gaussian-Mixtured Data. Pat-
tern Recognition 20 (1987) 385-501
11. Izumi, N., Morikawa, H., Harashima, H.: Combining Color and Spatial Information for
Segmentation. IEICE 1991 Spring Nat. Convention Record, Part 7 (1991) 392 (in Japanese)
12. Crisman, J., Thorpe, C.: UNSCARF, A Color Vision System for the Detection of Unstruc-
tured Roads. Proc. Proc. Int. Conf. on Robotics & Auto. (1991) 2496-2501
13. Join, A., Dubes, R.: Algorithms .for Clustering Data. Prentice-Hall (1988)
14. Carman, C., Merickel,M: Supervising ISODATA with an Information Theoretic Stopping
Rule. Pattern Recognition 23 (1990) 185-197
15. Akaike, H.: A New Look at Statistical Model Identification. IEEE Trans. on Automat.
Contr. 19 (1974) 716-723
16. Etoh, M., Shiral, Y., Asada, M.: Active Contour Extraction by Mixture Density Description
Obtained from Region Clustering, SIGPRU Tech. Rep. 91-81. IEICE of Japan (1991)
17. Amini, A., Weymouth, T., Join, R.: Using Dynamic Programming for Solving Variational
Problems in Vision. IEEE Trans. on Putt. Anal. & Ma~h. InteU. 12 (1990) 855-867

This article was processed using the I~EX macro package with ECCV92 style
32

(a) an initial Contour (b) a clustering result

Fig. 4. An example image.

(a) without the discontinuity control (b) with the discontinuity control

Fig. 5. Extracted contours of Fig.4.

(a) an edge-based AC (b) the proposed AC (c) loci to another picture

Fig. 6. Comparison with a conventional edge-based ACM and an application to tracking.


T h e M~Jbius S t r i p P a r a m e t e r i z a t i o n
for Line E x t r a c t i o n *

Carl-Fredrik Westin and Hans Knutsson

Department of Electrical Engineering, Computer Vision Laboratory


LinkSping University~ 581 83 LinkSping, Sweden
Phone: +46 13 282460, Fax: +46 13 138526, email: westin@isy.liu.se
A b s t r a c t . A parameter mapping well suited for line segmentation is de-
scribed. We start with discussing some intuitively natural mappings for line
segmentation, including the popular Hough transform. Then, we proceed
with describing the novel parameter mapping and argue for its properties.
The topology of the mapping introduces its name, the "MSbius strip" pa-
rameterization. This mapping has topological advantages over previously
proposed mappings.

1 Introduction
The reason for using a parameter mapping is often to convert a difficult global detection
problem in image space into a local one. Spatially extended patterns are transformed so
that they produce spatially compact features in a space of parameter values. In the case
of line segmentation the idea is to transform the original image into a new domain so that
colinear subsets, i.e. global lines, fall into clusters. The topology of the mapping must
reflect closeness between wanted features, in this case features describing properties of a
line. The metric describing closeness should also be uniform throughout the space with
respect to the features. If the metric and topology do not meet these requirements, sig-
nificant bias and ambiguities will be introduced into any subsequent classification process.

2 Parameter Mappings
In this section some problems with standard map-
pings for line segmentation will be illuminated.
The Hough transform, HT, was introduced by P. V.
C. Hough in 1962 as a method for detecting complex
patterns [Hough, 1962]. It has found considerable ap-
plication due to its robustness when using noisy or in-
complete data. A comprehensive review of the Hough
transform covering the years 1962-1988 can be found
in [Illingworth and Kittler, 1988].
Severe problems with standard Hough parameteriza-
tion are that the space is unbounded and will contain
singularities for large slopes. The difficulties of un- \ "-- X

limited ranges of the values can be solved by using


two plots, the second corresponding to interchanging
the axis. This is of course not a satisfactory solution. Fig. 1. The normal parameteriza-
Duda and Hart [Duda and Hart, 1972] suggested that lion, (p,~), of a line. p is the mag-
straight lines might be most usefully parameterized nitude of displacement vector from
by the length, p, and orientation ~o, of the normal the origin and ~ is its argument.

* This work has been supported by the Swedish National Board for Techn. Development, STU.
34

vector to the line from the origin, the n o r m a l p a r a m e t e r i z a t i o n , see figure 1. This is
a mapping has the advantage of having no singularities.
Measuring local orientation provides additional information about the slope of the
line, or the angle V when using the normal parameterization. This reduces the standard
HT to a one-to-one mapping. With one-to-one we do not mean that the mapping is
invertible, but that there is only one point in the parameter space that defines the
parameter s that could have produced it.
Duda and Hart discussed this briefly in [Duda and Hart, 1973]. They suggested that
this mapping could be useful when fitting lines to a collection of short line segments.
Dudani and Luk [Dudani and Luk, 1978] use this technique for grouping measured edge
elements. Princen, Illingworth and Kittler do line extraction using a pyramid structure
[Princen et al., 1990]. At the lowest level they use the ordinary pv-HT on subimages
for estimating small line segments. In the preceding levels they use the additional local
orientation information for grouping the segments.
Unfortunately, however, the normal parameterization has problems when p is small.
The topology here is very strange. Clusters can be divided into parts very far away
from each other. Consider for example a line going through the origin in a xy-coordinate
system. When mapping the coordinates according to the normal parameterization, two
clusters will be produced separated in the v-dimension by ~r, see figure 2. Note that this
will happen even if the orientation estimates are perfectly correct. A line will always
have at least an infinitesimal thickness and will therefore be projected on both sides of
the origin. A final point to note is that a translation of the origin outside the image plane
will not remove this topological problem. It will only be transferred to other lines.
Granlund introduced a double angle notation [Granlund, 1978] in order to achieve a
suitable continuous representation for local orientation. However, using this double angle
notation for global lines, removes the ability of distinguishing between lines with the same
orientation and distance, p, at opposite side of the origin. The problem near p --- 0 is
removed, but unfortunately we have introduced another one. The two horizontal lines
(marked a and c), located at the same distance p from the origin, are in the double angle
normal parameterization mixed into one cluster, see figure 2.
It seems that we need a "double angle" representation around the origin and a "single
angle" representation elsewhere. This raises a fundamental dilemma: is it possible to
achieve a mapping that fulfills both the single angle and the double angle requirements
simultaneously?
We have been concerned with the problem of the normal parameterization spreading
the coordinates around the origin unsatisfactorily although they are located very close
in the cartesian representation. Why do we not express the displacement vector, i.e. the
normal vector to the line from the origin, in cartesian coordinates, ( X , Y ) , since the
topology is satisfactory? This parameterization is defined by

rX = 9 cos=(v) + v cos(v) sin(v)


= Ysin= (V) + x cos(v) sin(v)

where V is as before the argument of the normal vector (same as the displacement vector
of the line).
Davis uses the p~o-parameterization in this way by storing the information in a carte-
sian array [Davis, 1986]. This gives the (X, Y) parameterization. There are two reasons
for not using this parameterization. First, the spatial resolution is very poor near the
origin. Secondly, and worse, all lines having p equal to 0 will be mapped to the same
cluster.
35

2~
2n 2n
b
IC

J b
II ~ a,c

,p ,P

Fig. 2. A test image containing three lines and its transformation to the p~-domain, the normal
parameterization of a line. The cluster from the line at 45~ is divided into two par~s. This
mapping has topological problems near p = O. The p2~-domain, however, folds the space so the
topology is good near p = O, but unfortunately it is now bad elsewhere. The two horizontal lines,
marked a and c, have in this parameter space been mixed in the same cluster.

The first problem, the poor resolution near the origin, can at least be solved by
mapping the XY-plane onto a logarithmic cone. That would stretch the XY-plane so
the points close to the origin get more space. However, the second problem still remains.

3 The MSbius Strip Parameterization

In this section we shall present a new parameter space and discuss its advantages with
respect to the arguments of the previous section. The M5bius Strip mapping is based
on a transformation to a 4D space by taking the n o r m a l p a r a m e t e r i z a t i o n in figure
1, expressed in cartesian coordinates ( X , Y ) and adding a "double angle" dimension,
(consider the Z-axis in a X Y Z - c o o r d i n a t e system). The problem with the cartesian
normal parameterization is as mentioned that all clusters from lines going through the
origin mix into one cluster. The additional dimension, ~b = 2~, separates the clusters
on the origin and close to the origin if the clusters originate from lines with different
orientation. Moreover, the wrap-around requirement for r is ensured by introducing a
fourth dimension, R.

The 4D-mapping
X -- x cos2(~) + y cos(~) sin(~)
y = y sin2( ) + x cos( ) sin( )
r =2~
R =Ro eIR +
The two first parameters, X and Y, is the normal vector in fig 1, expressed in cartesian
coordinates. The two following parameters, r and R, define a circle with radius R0 in
the Rr Any R0 > 0 is suitable. This gives a XY~b-system with wrap-around
in the C-dimension.
In the mapping above, the parameters are dependent. As the argument of the vector
in the XY-plane is ~ and the fourth dimension is constant, it follows that for a specific
36

(X, Y) all the parameters are given. Hence, the degree of freedom is limited to two, the
dimension of the XY-p!ane. Thus, all the mapped image points lie in a 2D subspace of
the 4D parameter space, see figure 3.

\
X
Fig. 3. The X Y 2 ~ parameter mapping. The wrap around in the double phi dimension gives the
interpretation of a M~bius strip

T h e 2D-surface
The regular form of the 2D-surface makes it possible to find a two-parameter form for
the wanted mapping. Let us consider a yr corresponding to the flattened surface
in figure 3. Let
p2 = X 2 + y2 = (x cos2(~) + y cos(~) sin(~)) 2 + (y sin2(~) + x cos(~) sin(~)) 2

p = 9 cos( ) + y sin( )
Then the (7?,~) mapping can be expressed as

~/={ p O<~<r
-p r_<~<2~r

~=2~
z/is the variable "across" the strip with 0 value meaning the position in the middle of the
strip, i.e on the 2~ axis. The wrap-around in the r dimension make the interpretation
that the surface is a M6bius strip easy, see figure 3.
Finally, using the same test image as before, we can see that we can distinguish
between the two lines at opposite side of the origin at the same time as the cluster cor-
responding to the line going through the origin is not divided, see figure 4.
37

4 Conclusion
0---.2~
The main contribution of this paper is the novel 2n
M5bius strip parameter mapping. The name, as
mentioned above, reflects the topology of the yr b
parameter surface, its twisted wrap-around in the C-
dimension. The proposed mapping has the following n ma
C*
properties:

- The parameter space is bounded and has no sin-


gularities such as the standard HT parameteriza-
tion for large slopes.
. TI
- The metric reflects closeness between features.
The mapping does not share the topological prob-
lem of the p~-mapping near p = 0. It has been Fig. 4. The (~, r parameter map-
shown that any line passing through the origin ping (the "MSbius strip" map-
produces two clusters, separated by 7r in the ~- ping). We can see that we can dis.
dimension. tinguish between the two lines, a
and c, at opposite side of the ori-
Note that these properties have been achieved with- gin at the same time as the clus.
out increasing the dimension of the parameter space ter corresponding to the line going
compared to for example the normal parameteriza- through the origin is not divided.
tion, see figure 1.
Line segmentation results from using the MSbius Strip Parameterization can be found in
[Westin, 1991].

References

[Davis, 1986] Davis, E. R. (1986). Image space transforms for detecting straight edges in indus-
trial parts. Pattern Recognition Letters, Vol 4:447456.
[Duda and Hart, 1972] Duda, R. O. and Hart, P. E. (1972). Use of the Hough transform to
detect lines and cures in pictures. Communications of the Association Computing Machinery,
15.
[Duda and Hart, 1973] Duda, R. O. and Hart, P. E. (1973). Pattern classification and scene
analysis. Wiley-Interscienee, New York.
[Dudani and Luk, 1978] Dudani, S. A. and Luk, A. L. (1978). Locating straight/lines edge seg-
mentgs on outdoor scenes. Pattern Recognition, 10:145-157.
[Granlund, 1978] Granlund, G. H. (1978). In search of a general picture processing operator.
Computer Graphics and Image Processing, 8(2):155-178.
[Hough, 1962] Hough, P. V. C. (1962). A method and means for recognizing complex patterns.
U.S. Patent 3,069,654.
[Illingworth and Kittler, 1988] Illingworth, J. and Kittler, J. (1988). A survey of the Hough
transform. Computer Vision, Graphics and Image Processing, 44.
[Princen et al., 1990] Princen, J., Hliagworth, J., and Kittler, J. (1990). A hierarchicalapproach
to line extraction based on the Hough transform. Computer Vision, Graphics, and Image
Processing, 52.
[Westin, 1991] Westin, C.-F. (1991). Feature extraction based on a tensor image description.
LiU-Tek-Lie-1991:28, ISY, LinkSping University, S-581 83 LinkSping, Sweden. Thesis No.
288, ISBN 91-7870-815-X.

This article was processed using the I ~ X macro package with ECCV92 style
Edge tracing in a priori known direction

Andrzej Nowak, Andrzej Florek, Tomasz Piascik


Laboratory of Automation and Robotics Technical University of Poznan, Poland

A b s t r a c t . This paper presents a contour tracing algorithm based on a


priori knowledge about searched edge. The algorithm is destined to trace
edges having long fragments of approximately constant direction. This en-
ables the implementation of one edge detector mask only in a given area and
simplifies the thinning procedure. The way of searching for starting points
is discussed as well as choosing and joining fragments of edges assuring the
best correspondence between the found edge and the knowledge possessed.
The algorithm shows good insensitivity to noise and local edge distortions.

1 Introduction

Edge detection and tracing is a crucial problem in the area of digital image process-
ing. An edge can be defined as a boundary between two homogeneous areas of different
luminance. Local luminance changes and edges corresponding to them are one of char-
acteristic image features providing information necessary in the process of scene analysis
and objects classification [BB1], [Prl]. Most of contour extraction algorithms consist of
the two basic steps: the edge detection (sometimes with thresholding) and the thinning
and linking. They are efficient when applied to the image of nearly homogeneous objects
differing significantly from the background (e.g. tools, industrial parts, writing etc.) if
the image is not contaminated by noise. When the level of noise increases the obtained
contours are often broken and deformed. That makes the process of interpretation and
recognition more difficult. More sophisticated methods should be then implemented, e.g.
incorporating a feedback path or a local edge enhancement [CS1]. However, all the uni-
versal edge tracing algorithms may still fail when implemented to noisy images. In these
cases, the use of a priori knowledge about the edge to be traced significantly facilitates
construction of an appropriate tracing algorithm.

2 Edge Tracing

The first step of the algorithm is the edge detection. It is made by convolving an image
(or a fragment of it) with one mask which is the most sensitive to luminance changes
in a chosen direction. The choice of this mask is based on our expectations about the
searched edge direction (called here: the assumed edge direction). The obtained edges
are thinned then. The edge tracing begins with finding a starting chain (edge fragment).
Then the edge is traced, point after point. When a gap is encountered, a procedure
for seeking and estimating all the chains passing close to the current boundary point is
activated. The chains are estimated according to a criterion examining their usefulness
for further tracing. The best chain, fulfiling also some threshold conditions, is accepted as
a continuation of the broken boundary. This chain is then connected with the previously
found boundary fragment and the tracing procedure continues. In the case when no
39

chains are found (or when none of them fulfils the threshold conditions), searching for a
new starting chain begins (it is led in the assumed edge direction.)

2.1 E d g e D e t e c t i o n a n d T h i n n i n g
In the present algorithm, the edge direction is quantized into one of the eight directions.
This is a common approach assuring good detection of edges in any direction. For detect-
ing edges a set of eight masks, of the size of 3x3 pixels is used, as proposed in [Rol]. The
edge direction (from 1 to 8) to which a given mask is the most sensitive, is called here a
mask direction. For a chosen image part, one mask which direction is the closest to the
assumed edge direction, is implemented. The edge detection is carried out by convolving
the considered image fragment with the appropriate mask. The result of this convolu-
tion is the edge magnitude image, having "lines" where edges previously existed. Since
edges in real images are usually blurred over some area, and after convolution with a 3x3
mask this blurring still increases, then resulting lines are at least a few pixels wide. The
point of the maximum edge magnitude on an edge cross-section is assumed a boundary
point. Because of blurring and the presence of noise a few local maxima can appear on
this cross-section. It is difficult to estimate then if these maxima originate from noise
or from a few edges passing close to one another. A simple solution to this problem is
to assume a minimum distance between two edges. If a distance between neighbouring
maxima is shorter than this minimum, then the bigger maximum is considered as an
edge point. It should be aimed to take this minimum distance as short as possible, to
prevent attenuating the "weaker" edge by a "stronger," neighbouring one. It was fixed
in the implementation that a minimum distance between edges cannot be shorter than
three pixels. It also ensures that the obtained chains will not be branched.
Thinning the previously detected edges is achieved in two passes. In the first one
all the points of the image fragment are analyzed in rows, from left to right. For each
analyzed point the following points (in a direction perpendicular to the assumed edge
direction) are checked (see Fig. 1.). If the gradient magnitude of the analyzed point is
smaller than the gradient magnitude of one of the two following points, then it is set to
zero. This procedure is repeated in the second pass, while moving from right to left.

121 1'1'! '1

F i g . 1. Neighbourhoods of the analyzed point (x) checked during thinning, for the first (1) ~ d
the second (2) pass, for different mask directions

As a result of the thinning procedure fragments of edges (called chains) are obtained.
Their direction is close to the direction of the used edge detector mask and they are one
pixel wide (measuring in the thinning direction - perpendicularly to the mask direction).

2.2 E d g e T r a c i n g
The first step of the edge tracing is searching for a starting chain. This procedure is
analogous to the one of finding the best chain in a window, described below. The difference
40

is that the starting chain must fulfil much stronger conditions concerning its length and
average gradient magnitude. Also the size of a search window is usually bigger in this
case. Additional conditions, as a position of the edge in relation to some characteristic
points of the image, derive from the possessed a priori knowledge and must be defined
separately for each implementation.
After finding the starting chain, the edge tracing begins from its first point (starting
point). For each, already found edge point, the next point is searched in the strictly
defined neighbourhood. The choice of the analyzed neighbours depends on a direction
of the edge detector mask and on a tracing direction (in accordance or in opposition to
the mask direction - see Fig. 2.). If the gradient of any of these neighbours is different
from zero, it is accepted as the next edge point. The tracing procedure is continued until
a margin of the analyzed area or a break in the traced edge is reached. In the latter
case searching for a new chain (which could be assumed as a continuation of the broken
edge) is activated. The search is led in the area limited by a window which center is the
last found edge point. The window has the square shape and its sides are parallel to the

t s t:
0 I1 s R ~ s!
2 t ~ 2

2 2 2

S S t

Fig. 2. Neighbours analyzed in searching for the next edge point azcording to the mask direction

image margins. Its size depends on possible local changes in the edge direction and on a
distance from a possible "strong" edge.
In the window all the chains originating in it are searched (except for the already found
boundary). All the window points are checked along lines perpendicular to the mask
direction, starting from one of the corners. For each point, when its gradient magnitude

soaf~h w~ndow

~ fitmtpointof s chain
lUt found~ point

"I,
tdOedirtction

Fig. 3. The way of calculating parameters ! and a when the mask of direction 1 or 5 was used
41

is higher than zero, all its neighbours from the previously checked line are analyzed. If
their gradient equals zero then the following (in this window) chain number is assigned
to this point and its coordinates are stored. Thanks to the use of one edge detector mask
and described thinning procedure, it is not possible that chains beginning in two separate
points will join in one chain. Thus, after having the whole window analyzed, the number
of chains beginning in it and their starting coordinates are known. Afterwards their
properties are analyzed from the point of view of their usefulness for tracing continuation
of the broken edge. In order to attain this, each chain is traced from its (already known)
starting point to the end (but not further than on a given maximum distance). The total
gradient magnitude of all its points and its length (the number of pixels) are counted
as well as the average gradient magnitude and deviation from the assumed direction.
For counting this deviation only coordinates of the first and the last point of the chain
are used. The best chain is chosen from among the chains satisfying specified threshold
conditions (concerning their length and average gradient magnitude). The best chain
is assumed the one maximizing a given criterion. An exemplifying criterion can be as
follows:
Q = el _ + a2 m__ + a 3 1- + a 4 (1 - [ t g a [ ) (1)
nrnax ~71rna~
where:

ai - not negative weight coefficients,


n - length of a chain in pixels,
m - average gradient magnitude,
! - deviation of the first point of the chain from the assumed direction (in pixels),
L - window size (its side is 2 L + l pixels),
- maximum values in a window.

The way of calculating parameters I and a is shown in Fig. 3.


The chain maximizing the Q criterion is connected to the previously found edge
fragment. In a case when no chains are found in the searching window or no chains fulfil
specified threshold conditions, searching for a next starting point begins. The search
is carried in the expected direction of the edge. A negative result of the search in the
analyzed image part or reaching its margin ends the algorithm in this part of the image.

3 Experimental Results

The presented edge tracing algorithm was implemented to find the outlines of the outer
fat layer of halved pigs carcasses. It enables finding the maximum width of this layer
as well as other parameters necessary for meat classification. The resolution of analyzed
images was 512x512 pixels with 256 grey levels. The images were additionally low-pass
filtered (with a 3x3 mask). For edge detection the masks detecting vertical edges were
used. The summed up result of implementing both masks is shown in Fig. 4.b. After
having analyzed about 100 images the following parameter values were set:
- the minimum length of the best chain in a window was 10 pixels,
- the window size was 21 pixels (41 by 21 for a starting chain)
- the weight coefficients a l , a3, a4 were established so that the maximum of each product
in (1) was 1 (only a2 was set bigger).
Exemplary results are shown in Fig. 4. Edges obtained after detection and thinning
(Fig. 4.b) are broken in many points. This causes frequent searching for a chain which
could be accepted as a continuation of a broken boundary. In the places of joining chains,
42

Fig. 4. The result of implementing the edge tracing algorithm to tind tile outlines of the outer
fat layer of halved pig carcass: a) smoothed outlines shown on the original image, b) result of
implementing masks detecting vertical edges, c)obtaincd outlines of the fat layer

local changes in the outline direction can appear (Fig. 4.c). These outlines, after smooth-
ing, were put on a real image (see Fig. 4.a). On the basis of the obtained outlines char-
acteristic parameters of the fat layer were calculated. (Areas of interest are marked with
horizontal lines.)

4 Summary
A simple edge tracing algorithm was presented here. Applying universal edge tracing
algorithms is unjustified when analyzing images about which some a priori knowledge is
accessible. The computational complexity of these algorithms is usually high and their
efficiency is low, especially for noisy images. The presented algorithm (based on a pri-
ori knowledge about the analyzed scene) is efficient even when applied to noisy images
and distorted edges. However, for each implementation it requires setting values of all
parameters as well as defining additional conditions (characteristic for a particular im-
plementation) simplifying the tracing procedure.

References
[BB1] Ballard, D., Brown, C.: Computer Vision. Englewood Cliffs, N J: Prentice-Hall, 1982
[Prl] Pratt, W.: Digital Image Processing. New York: Wiley-Interscience, 1978
[CS1] Chen, B., Siy, P.: Forward/backward contour tracing with feedback. IEEE Trans. Pattern
Anal. Machine Intell., vol. PAMI-9, no. 3, pp. 438-446, May 1987
[Rol] Robinson, G.: Edge detection by compass gradient masks. Comput. Graphics Image Pro-
cessing, vol. 6, pp. 482-492, 1977
This article was processed using the ]rEX macro package with ECCV92 style
Features E x t r a c t i o n and Analysis M e t h o d s for
Sequences of Ultrasound Images

Isabelle L. HERLIN and Nicholas A YACHE


INRIA - B.P. 105 - 78153 Le Chesnay - FRANCE

A b s t r a c t . Our principal motivation is to study time sequences of echocar-


diographic raw data to track specific anatomical structures. First, we show
that the image processing can make direct use of the audio signal data,
avoiding loss of information and yielding optimal results.
Secondly, we develop a strategy which takes a time sequence of raw
data as input, computes edges, initiates a segmentation of a pre-selected
anatomical structure and uses a deformable model for its temporal tracking.
This approach is validated in a real time sequence of ultrasound images of
the heart to track the left auricle and the mitral valve.

1 Introduction
1.1 M o t i v a t i o n a n d O b j e c t i v e s
There is a continuously increasing demand in the automated analysis of 2D and 3D
medical images at the hospital[I]. Among these images, ultrasound images play a crucial
role, because they can be produced at video-rate and therefore allow a dynamic analysis
of moving structures. Moreover, the acquisition of these images is non-invasive and the
cost of acquisition is relatively low compared to other medical imaging techniques.
On the other hand, the automated analysis of ultrasound images is a real challenge for
active vision, because it combines most of the difficult problems encountered in computer
vision in addition to some specific ones related to the acquisition mode:
- images are usually provided in polar geometry instead of cartesian geometry,
- images are degraded by a very high level of corrupting noise,
- o b s e r v e d objects usually correspond to non-static, non-polyhedric and non-rigid
structures.
The geometric transformation (called scan correction) which transforms the data from
a polar representation to the correct cartesian representation is usually applied through
a bilinear interpolation.
We show in this paper the limitations of this scheme, which does not account for the
varying resolution of the data, and we propose a new method, called sonar-space filtering,
which consists of computing the scan conversion with a low-pass filtering of the cartesian
image applied directly to the available polar data, and which can be used to optimally
reconstruct the data with a chosen level of spatial linear filtering.
Furthermore, we develop a methodology to automatically track s physiological struc-
ture on an echocardiographic sequence. Interactivity is used to initiate the process on the
first image of the sequence. Then edges are computed and an approximative segmentation
of the structure is obtained by using deterministic algorithms. This information is finally
combined with a deformable model to obtain the temporal tracking of the pre-selected
structure.
44

Finally, to further demonstrate the efficiency of our approach, we apply it to a difficult


time sequence of ultrasound images of the heart to track the left auricle and the mitral
valve.

1.2 P r e v i o u s w o r k
Our approach is different from previous ones. This comes from the fact that we
study directly the ultrasonic data. More commonly, feature extraction is applied to the
cartesian video data. To our knowledge, there is only one study where all processing
is performed on sector scans in polar coordinate form. This was published by Taxt [13]
and reports noise reduction and segmentation in time-varying ultrasound images. But a
comparative study of scan correction methods to obtain cartesian images has apparently
not been pursued yet. For cartesian images, the most commonly used approach to obtain
the contour of left ventricle (in echocardiography) is radial search [5] [2] [7]: the procedure
starts from a point inside the heart chamber and searches along different radial lines for
edge points. The best-known dynamic approach is the one by Zhang and Geiser [15], who
compute temporal cooccurrences to obtain both stationary points and moving points. The
temporal information has also been used to filter images obtained at the same instant of
the cardiac cycle [14].

2 Acquisition of an echographic image

The purpose of this section is to present some primary characteristics of image


formation using echographic technologies.
A basic imaging system, called a pulsed system, is illustrated in Fig. 1. When the
switch is in the transmit position, the pulse waveform p(~) excites the transducer [9].1
This results in a wavefront that is propagated in the body. The transducer produces
a relatively narrow beam of propagation whose angular direction of propagation into
the body is known. Immediately following the transmission, the system switches to the
receive position, using the same transducer. The pulse is attenuated when it propagates
through the body. When the wavefront hits a discontinuity, a scattered wave is produced.
This scattered wave is received by the transducer and the resultant signal is processed
and displayed along a line representing the direction of the beam.

~ receive
cPosition

signal display
processor
t~ tracsjnit
P( posmon

Fig. 1. Elementary pulsed ultrasonic system

I Other types of echographs using pseudo-random code correlation are studied in the Litera-
ture [11]
45

3 Image representation in cartesian coordinates

The process of converting from the polar coordinates representation to the cartesian
coordinates representation is necessary for the convenience of the users. Physicians are
accustomed to viewing images in cartesian data and it would be difficult for them to
interpret polar data. Moreover vizualisation hardware and image processing algorithms
are designed for data in cartesian coordinates.
Let us suppose that M different orientations are used to obtain an echocardiographic
image, and that each return signal is digitized to L points. Fig. 2 shows an echographic
image, with M rows and L columns, obtained with a commercial echographic machine,
providing an image represented in polar coordinates. Fig. 3 shows the cartesian image
corresponding to the same data.

Fig. 2. Ultrasound image on raw data

Fig. 3. Cartesian image after conversion

Scan conversion requires the knowledge of the following set of parameters (see Fig. 4):
- the angular extent of data acquisition wedge a,
- minimal distance d for data acquisition,
- total distance D for data acquisition (these distances being calculated from the skin),
and
- the number of rows, N, desired in the ouput cartesian image (The number of columns
will be related to a, and will assume square pixels).
46

(transducer)

,\
I
I
I
t
1
t
t
I
I
1
t l
I
t
t
I
t
I
I
t
t
S
I
I
l
l
l E
l

F i g . 4. Parameters of the conversion process

Several methods may b e used for the conversion process. Usually, the video image on
the echographic machine is obtained by assigning to a cartesian point the grey level of
the nearest available point in polar coordinates, or the value of the bi]inear interpolation
of its four nearest points. In fact, we found that these methods d o n o r make an optimal
use of the available original data, and we introduced a new method, called sonar-space
filtering, which Can be used to optimally reconstruct the data with a chosen level of
spatial linear filtering.

3.1 Conversion by sonar-space filtering

We assume that it is desired t h a t the continuous input cartesian image I ( z , V) be


filtered by the impulse response f i l t e r / ( z , y). The resulting image R ( z , y), in continuous
space, is given by the convolution product:

R(z, y) : //; / ( ~ - u, y - v) . I ( u , v) d u d v . (1)


47

However, the input is only available in the polar coordinate space. We thus apply the
following change of variables:

( J ( ~ - ~o) ~ + y~) 9 e - ~ .

where:

- A~v = d * ~ represents the distance from the surface of the skin where the
acquisition process begins, measured in pixel units along a scan line of the raw data,
- A~ = ~ is the angular difference between two successive angular positions of
k S

the probe,

- e = ( D -Dd ) ( N(L
- 1-) 1) performs the change of pixel sampling rates along the axial
direction of the beam, according to the desired height N of the cartesianimage.
We obtain:

f ( ~ - ~(p, e), y - ~(p, 0)). I(p, 0). I J(P, 0) I d p d 0 .


R(~, y) = f02~rf0 ~176 (2)

Here [3(p, 0) I is the determinant of the Jacobian matrix corresponding to the inverse
transformation of variables:

( +eP--T-~- sin(0Aa + -"/r- '--T -o~


-) .

It is easily seen that

I J(P, 0) l = (" + ~ Ne )2 * ~ a

We have transformed the convolution in the cartesian coordinates to an infinite inte-


gral in polar coordinates, corresponding to the domain of the raw data.
Once the two-dimensional convolution filter f is chosen, we define its rectangle of
essential support: say a rectangular window of width 2X and 2Y. Outside this region
of support the absolute value of the impulse response must be lower than a pre-selected
threshold s, i.e.,:
If( ~,v)l < " if ((lul > X) OR (Ivl _> Y)) .
Therefore the integral is approximated by a finite integral over the domain

(z-X<u<x+X) AND ( y - Y <v<y+Y) .

The filter is also sampled in this domain in order to approximate the integral by
a discrete summation. Filtered numerical outputs are evaluated at original d a t a point
locations within the continuous domain.
We thus obtain the following equation:

R(~, y) = c ~ / ( ~ - ~(,~, 0~ ), y - v(p~, 0~)). I(p~, e~). I J ~ ( p ~ , 0~) I, (3)


k
48

where the summation is over the discrete collection of point (Pk, 8k) in polar coordinates,
where C is used to normalize the data and:

I J(p~, 0~) I = (p~ + raN) 9 za~


e2
We have transformed the computation of R(z, y) into a discrete summation on a
window of size 2X by 2Y, making use of image data only at points where it is defined
in the polar coordinate domain. Note that for the raw data, or more generally for any
sonar-like data, the sampling of the filter is not regular along the z and y axes but it
rather conforms to the sampling density of raw data. In this formula, the I~l(p~,Sk) I
value represents the surface area of the polar pixel patch in the cartesian domain. We
will present in the following subsections the different values that we have chosen for the
function f for different processing of the echocardiographic raw data.

3.2 V i s u a l i z a t i o n a p p l i c a t i o n

In practice, the convolution filter f ( z , y ) is typically separable and is denoted by


f(z)g(y). Then, classical 1-D smoothing and derivation filters can be used.
For visualization applications of sonar-space filtering, we used the Deriche's smoothing
function [4] with f(z) : g(z) = L(z):

L(z) = k2(e~sin(w I z I) + wcos(w [ z I)) -alxl . (4)


The conversion algorithm performs simultaneously a conversion to cartesian coor-
dinates and a smoothing of the data (whose amplitude can be adjusted with cz) thus
producing a cartesian image with a reduced speckle. Other smoothing functions could be
used instead.
It will be noted that the visualization quality is not significantly different, or better,
with sonar-space filtering than with classical bi-linear interpolation method. But our
objective is not to improve visualization, but rather to improve automatic analysis of
echocardiographic sequences.

3.3 E d g e d e t e c t i o n a p p l i c a t i o n

For further automatic boundary tracking, our goal is to use spatio-temporal ap-
proaches [10]. A time-varying edge may be represented as a surface in 3-D space, in
which z and l/are two spatial dimensions (in the cartesian coordinates space) and t is
the temporal dimension. We modify Deriche's edge detector for this goal. Another ap-
proach could be to generalize Deriche's detector with spatio-temporal functions as in [6].
We denote G, and Gy the two spatial components of the gradient vector and I(z, y, t)
the 3-dimensional grey level function. Let D be the Deriche differentiation filter and L
the associated smoothing filter:

z(~) = k ( ~ , i n ( ~ I 9 I) + ~ c o , ( ~ I 9 I)) -~l'l, (~)

D(z) = ksin(oJ I z I)-al~l (~)


The two components of the gradient vector have the following expression:

G. = (D, LuLt) | l(z,y,t)


49

Gy -- (L~DyLt) | I(z, y,t)

where the subscripts are used to explain along which axis the corresponding filter is
applied. Each component is obtained by differentiation in the associated direction and
filtering in the other spatial direction and in the temporal direction.
The norm of the gradient is defined by:

y, t) = + .

The edges are obtained as local maxima of the gradient norm in the direction of the 2D
gradient vector. The temporal dimension is only used to smooth the result. This produces
a significant image enhancement in regions that are not moving too fast.
We denote a~, ay and at the filtering parameters of the Deriche filters (cf. Equ. 5
and 6) for the respective axes z, y and t. Since the 2D space is homogeneous, we can
choose a~ = ay. The value of at is independent and must be chosen according to the
temporal resolution.

4 Temporal tracking

At this stage, we assume that we will work on ultrasound cartesian images and
on edges represented in a cartesian space, whatever the methods used to obtain this
information.
Our objective is to perform temporal tracking of a pre-selected anatomical structure
by combining different kinds of information. First, we want to obtain an approximate
segmentation of the structure by using simple deterministic processing. Secondly we
want to use the edges computed directly on the raw data. These will be combined by a
regularisation process that takes an initial segmentation and deforms it from its initial
position to make it better conform to the pre-detected edges. This approach is the idea
behind the use of deformable models.

4.1 Estimation of the boundaries of the anatomical structure

To obtain a crude estimation of the boundaries of anatomical structures, we use


techniques from mathematical morphology. The model of a cardiac cavity is very simple.
This is an ovoid region with low intensity. These regions cannot be obtained by simple
thresholding because of the speckle noise. But the fine structures of the speckle m a y be
easily suppressed by the following morphological operations [12]:

- A first order opening eliminates the small bright structures on dark background.
- The dual operation (first order closing) suppresses the small dark structures.

After these operations, a simple thresholding gives an image C where all the cardiac
cavities are represented in white. This detection can be refined b y the use of higher level
information. The specialist points out, using a computer mouse, the chosen cavity on the
first image of the sequence. The whole cavity is then obtained by a conditional dilatation
which begins at this point.
50

4.2 Use o f a d e f o r m a b l e m o d e l
The previous operations usually provide an approximately-correct but locally-
inaccurate positioning of the structure boundaries. In order to improve this crude seg-
mentation to an accurate determination of the boundaries, we use the deformable models
of [3], in the spirit of [8].
The deformable model is initialized in the first image by the crude approximation of
the structure boundary. It evolves under the action of image forces, which are counter-
balanced by its own internal forces to preserve its regularity. Image forces are computed
as the derivative of an attraction potential related to the previously computed spatio-
temporal edges. Typically, the potential is inversely proportional to the distance of the
nearest edge point.
Deformable models may be used independantly on each frame or iteratively on the
sequence: once the model has converged in the first frame, its final position is used as
the initial one in the next frame, and the process is repeated.

5 E x p e r i m e n t a l results for sonar-space filtering


This section gives the results obtained by bilinear interpolation and sonar-space
filtering evaluated in terms of the visualization and edge detection capabilities of the
methods.
A simple example concerns polar scanning of a thin dark structure (represented by
horizontal lines in polar data) in a white background. Fig. 5 presents the polar data: 512
rays of 128 pixels. The dark structure has a width of 3 rays. Fig. 6 presents edges obtained
on cartesian images reconstructed by bilinear interpolation and sonar-space filtering.
Because bilinear interpolation does not make use of all available polar data, informa-
tion is lost which can never be retrieved by further processing such as edge detection or
segmentation.
To conclude this section with ultrasound data, one can see in Fig. 7 the reconstructed
image using sonar-space filtering with a = l , and the same image with the detected edges
superimposed. The value of the parameter a is determined experimentally but is the
same for all ultrasound images provided by an echographic machine.

6 Experimental results for temporal tracking


We first note that temporal smoothing reduces some local distortions on the deeper
edges of the left auricle (compare the bottom right cavity of Fig. 8). Simultaneously,
temporal smoothing can cause a problem for the mitral valve (middle thin structure of
Fig. 8), which is moving fast with respect to the temporal resolution. The strategy is thus
to use temporal smoothing only to study cavities and to apply spatial gradient techniques
to study fast moving structures like the valves.
Secondly, we present the use of deformable models to analyse echocardiographic data
after the scan-correction process has been applied.
For structures moving slowly (heart cavities), deformable models may be applied it-
eratively using an initialization process and the results of edge detection. The software
of L. Cohen and I. Cohen [3] requires three parameters. The elasticity and rigidity coeffi-
cients model the properties of the cavity boundary curve. The third coefficient is a weight
representing the attraction of the edges. The results of this application of the deformable
model may be seen in the first frame of Fig. 12 for the segmentation of left auricle. The
51

Fig. 5. Polar data

F|g. 6. Left: edges on bilinear interpolation image. Right: sonar-space filtering

F|g. 7. Cartesian image and best edges obtained by sonar-space filtering


52

Fig. 8. Left: no temporal smoothing of edges, right: temporal smoothing of edges.

result is then used to initialize the second frame. The parameters are the same and the
process is repeated sequentially through all frames, as it can be seen on Fig. 12.
For structures moving fast (mitral valve), deformable models are applied indepen-
dantly on each frame and results of these applications may be seen on Fig. 11.
We summarize the advantages of using deformable models to analyze echocardio-
graphic data:

- Deformable models allow a compromise between an initial segmentation based on


grey levels and texture properties and an edge detection process performed directly
on raw data.
- The values of the parameters required by the deformable model are the same for both
the regularization application on a single frame and for the tracking application on
a sequence. They can be chosen interactively on the first image of the sequence.

The methods presented in this paper were applied to four different sequences obtained
from two different echographs. The data presented here were obtained in a polar coor-
dinate form on a V I G M E D echograph at Henri Mondor hospital in Creteil, France. A
sequence contains 38 images from a cardiac cycle. Fig. 9 shows a cartesian representation
of the original data. (Only one image in four is displayed.) The left heart cavities (auricle
and ventricle) and the mitral valve are visible in a typical image. Our aim is to track
them. Tracking of the latter structure is successfully achieved in this example due to the
fact that the edges were obtained from sonar-space filtering. Other methods (bilinear
interpolation followed by edge detection) generally do not give accurate edges for the
deep structures and cannot therefore be used for further temporal tracking. Edges are
shown in Fig. 10, and temporal tracking is presented in Figs. 11 and 12.

7 Conclusions

We showed in this paper the importance of using an appropriate conversion method


when dealing with images produced in polar coordinates. We introduced a new method
which computes both the conversion and a convolution of the polar data with a smoothing
filter in a single process. Using this approach, the quality of the edges and features that
53

Fig. 9. data after scan-correction


54

Fig. 10. Edges

are extracted can be enhanced. This approach is more flexible because it allows a variable
level of smoothing to be chosen according to the actual resolution of the original data.
This is not the case when an additional smoothing is required after a conversion by other
algorithms. We showed the enhancement produced on edge detection by our approach.
Finally, we demonstrated the effectiveness of this approach by solving a complete
application. We used morphological operators to initialize a deformable model in the first
image of a time sequence. Then we applied our edge detector and we let the deformable
55

model converge toward the detected edges. Using the solution as an initialization in the
following image, we tracked the left auricle b o u n d a r y in a sequence of 38 images.
Our future research will concentrate on the generalization of these methods to be
applied to 3-D ultrasound images produced in spherical coordinates.

8 Acknowledgements
W e gratefully acknowledge Gabriel Pelle (INSERM, CHU Henri M O N D O R , FRANCE)
for providing the data and for helpful discussions, and Robert H u m m e l for a significant
improvement of the final manuscript.
This work was partially supported by M A T R A Espace and Digital Equipment Cor-
poration.

References
I. N. Ayache, J.D. Boissonnat, L. Cohen, B. Geiger, J. Levy-Vehel, O. Monga, and P. Sander.
Steps toward the automatic interpretation of 3-D images. In H. Fuchs K. Hohne and
S. Pizer, editors, 3D Imaging in Medicine, pages 107-120. NATO ASI Series, Springer-
Verlag, 1990.
2. A.J. Buds, E.J. Delp, J.M. Meyer, J.M. Jenkins, D.N. Smith, F.L. Bookstein, and B. Pitt.
Automatic computer processing of digital 2-dimensional echocardiograms. In Amer. J.
GardioL, volume 51, pages 383-389, 1983.
3. L.D. Cohen and I. Cohen. A finite element method applied to new active contour models
and 3D reconstruction from cross sections. In Proceedings o/the International Conference
on Computer Vision, Osaka, Japan, December 1990.
4. R. Deriche. Using Canny's criteria to derive a recursively implemented optimal edge de-
tector. International Journal o/Computer Vision, 1 (2), May 1987.
5. F. Faure, J.P. Gambotto, G. Montserrat, and F. Patat. Space medical facility study. Tech-
nical report, ESA, 1988. final report, 6961/86/NL/PB.
6. T. Hwang and J.J. Clark. A spatio-temporal generalization of Canny's edge detector. In
iOth International Conference on Pattern Recognition, Atlantic City, New Jersey, USA,
June 1990.
7. J.M. Jenkins, O. Qian, M. Besozzi, E.J. Delp, and A.J. Buda. Computer processing of
echocardiographic images for automated edge detection of left ventricular boundaries. In
Computers in Cardiology, volume 8, 1981.
8. Michael Kass, Andrew Witkin, and Demetri Tersopoulos. Snakes: Active contour models.
In Proceedings of the First International Conference on Computer Vision, pages 259-268,
London, June 1987.
9. A. Macovski. Medical Imaging Systems. Prentice Hall, 1983.
10. O. Monga and R. Deriche. 3D edge detection using recursive filtering:application to scan-
ner images. Technical Report 930, INRIA, November 1988.
11. V.L. Newhouse. Progress in Medical Imaging. Springer Verlag, 1988.
12. J. Serfs. Image analysis and mathematical morphology. Academic Press, 1982. London.
13. T. Taxt, A. Lundervold, and B. Angelsen. Noise reduction and segmentation in time-
varying ultrasound images. In iOth International Con/crence on Pattern Recognition, At-
lantic City, New Jersey, USA, June 1990.
14. M. Unser, L. Dong, G. Pelle, P. Brun, and M. Eden. Restoration on echocardiagrams using
time warping and periodic averaging on a normalized time scale. In Medical Imaging,
number III, 1989. January 29 - February 3, Newport Beach.
15. L.F. Zhang and E.A. Geiser. An approach to optimal threshold selection on a sequence
of two-dimensional echocardiographic images. In IEEE Transactions on Biomedical Engi-
neering, volume BMB 29, August 1982.
56

Fig. 11. Temporal tracking of the mitral valve


57

Fig. 12. Temporal tracking of the left auricle


F i g u r e - G r o u n d Discrimination
by M e a n Field A n n e a l i n g *

Laurent Hdrault 1 and Radu Horaud 2

a CEA-LETI, avenue des Martyrs, 85X, 38041 Grenoble


2 LIFIA-IRIMAG, 46, avenue F~llx Viallet, 38031 Grenoble, FRANCE

Abstract. We formulate the figure-ground discrimination problem as a


combinatorial optimization problem. We suggest a cost function that makes
explicit a definition of shape based on interactions between image edges.
These interactions have some mathematical analogy with interacting spin
systems - a model that is well suited for solving combinatorial optimization
problems. We devise a mean field annealing method for finding the global
minimum of such a spin system and the method successfully solves for the
figure-ground problem.

1 Introduction and background

The problem of separating figure from ground is a central one in computer vision. O n e
aspect of this problem is the problem of separating shape from noise. Two-dimensional
shapes are the input data of high-level visual processes such as recognition. In order to
maintain the complexity of recognition as low as possible it is important to determine at
an early level what is shape and what is noise. Therefore one needs a definition of shape,
a definition of noise, and a process that takes as input image elements and separates
them into shape and noise.
In this paper we suggest an approach whose goal is twofold: (i) it groups image
elements that are likely to belong to the same (locally circular) shape while (ii) noisy
image elements are eliminated. More precisely, the method that we devised builds a cost
function over the entire image. This cost function sums up image element interactions and
it has two terms, i.e., the first enforces the grouping of image elements into shapes and
the second enforces noise elimination. Therefore the shape/noise discrimination problem
becomes a combinatorial optimization problem, namely the problem of finding the global
minimum for the cost function just described. In theory, the problem can be solved by
any combinatorial optimization algorithm that is guaranteed to converge towards the
global minimum of the cost function.
In practice, we implemented three combinatorial optimization methods: simulated
annealing (SA), mean field annealing (MFA), and microcanonical annealing (MCA) [3].
Here we concentrate on mean field annealing.
The figure-ground or shape/noise separation is best illustrated by an example. Fig. 1
shows a synthetic image. Fig. 2 shows the image elements that were labelled "shape" by
the mean field annealing algorithm.
The interest for shape/noise separation stems from Gestalt psychologists' figure-
ground demonstrations [6]: certain image elements are organized to produce an emergent
figure. Ever since the figure-ground discrimination problem has been seen as a side effect
of feature grouping. Edge detection is in general first performed. Edge grouping is done
* This research has been sponsored in part by "Commissariat ~ l'Energie Atomique," in part by
the ORASIS project, and in part by CEC through the ESPRIT-BRA 3274 (FIRST) project.
59

Fig. 1. A synthetic image with 1250 elements. Fig. 2. The result of applying mean field an-
Circles, a straight line, and a sinusoid are nealing to the synthetic image. The elements
plunged into randomly generated elements, shown on this figure were labelled "shape".

on the basis of their connectivity or by using a clustering technique. Noise is eliminated


by thresholding.
The connectivity analysis produces edge chains which are further fitted with piecewise
analytic curves (lines, conics, splines, etc.). The clustering technique maps image edges
into a parameter space and such a parameter space is associated with each curve type:
this is the well known Hough transform. There are two problems with the techniques
just mentioned: (i) one has to specify analytically the type of curve or curves that are to
be sought in the image and this is done at a low-level of the visual process and (ii) the
notion of noise is not clearly specified with respect to the notion of shape.
Our work is related to the influential paper of Parent & Zucker [8] who introduced
the notion of cocircularity between two edge elements and to the work of Sha'ashua
Ullman [11], and Sejnowski & Hinton [10].
The work described here has two main contributions. First, we suggest a mathematical
encoding of the figure-ground discrimination problem that consists of separating shape
from noise using a combinatorial optimization method. Second, we suggest mean field
annealing, a deterministic global optimization method.

2 A combinatorial optimization formulation

We consider a particular class of combinatorial optimization problems for which the cost
function has a mathematical structure that is analog to the global energy of a complex
physical system, that is a interacting spin system. First, we briefly describe the state of
such a physical system and give the mathematical expression of its energy. We also show
the analogy with the energy of a recursive neural network. Second we suggest that the
figure-ground discrimination problem can be cast into a global optimization problem of
the type mentioned above.
The state of an interacting spin system is defined by: (i) A spin state-vector of N
elements ~r = [ a l , . . . , ~N] whose components are described by discrete labels which cor-
respond to up or down Ising spins: ai E { - 1 , + l } . The components o'i may well be viewed
as the outputs of binary neurons. (ii) A symmetric matrix J describing the interactions
between the spins. These interactions may well be viewed as the synaptic weights be-
60

tween neurons in a network. (iii) A vector 6 = [(~1, 9 9 (~N] describing an external field in
which the spins are plunged.
Therefore, the interacting spin system has a '~natural" neural network encoding asso-
ciated with it which describes the microscopic behaviour of the system. A macroscopic
description is given by the energy function which evaluates each spin configuration. This
energy is given by:
INN N
E(~ ~ "'O'N) = ---2 E E Jij(ri~ - E ~iO'i (1)
i=1 j = l i=1

The main property of interacting spin systems is that at low temperatures the number
of local minima of the energy function grows exponentially with the number of spins.
Hence the adequation between the mathematical model of interacting spin systems and
combinatorial optimization problems with many local minima is natural.
We consider now N image elements. Each such element has a label associated with
it, Pi, which can take two values: 0 or 1. The set of N labels form the state vector
P = ~I,''',PN]. We seek a state vector such that the "shape" elements have a label
equal to 1 and the "noise" elements have a label equal to 0. If clj designates an interaction
between elements i and j, one may write by analogy with physics an interaction energy:
E N N
saliency(P) = -- E i = I Ej=I cijpipj
Obviously, the expression above is minimized when all the labels are equal to 1. In
order to avoid this trivial solution we introduce the constraint that some of the elements
in the image are not significant and therefore should be labelled "noise": Eeonstraint(P) =
(E/N_--I pl) 2
The function to be minimized could be something like the sum of these energies:

E(p) = Esalieney(P) + )~ Eeon,traint(P) (2)


In this expression ~ is a positive real parameter that has to be adjusted and is closely
related to the signal-to-noise ratio. With the substitution Pi = (al + 1)/2 in eq. (2),
the function to be minimized is given by eq. (1) where: Jij = (clj - ,~)/2 and 8i =
(Ej=x
N -
N)~)/2.

3 Computing image interactions

An image array contains two types of information: changes in intensity and local geome-
try. Therefore the choice of the image elements mentioned so far is crucial. Edge elements,
or edgels are the natural candidates for making explicit the two pieces of information
just mentioned.
An edgel can be obtained by one of the many edge detectors now available. An edgel
is characterized by its position in the image (xi, Yl) and by its gradient computed once
the image has been low-pass filtered. The x and y components of the gradient vector are:

gx(xi, Yi) -- OIy(xi, Yi) and gy(Xi Yi) = Ob(Xi' Yi)


Ox ' Oy
I! is the low-pass filtered image. From the gradient vector one can easily compute the
gradient direction and magnitude: Oi and gi, e.g., Fig. 3:

f gy(Xi,Yi) ~ + ~7f and gi = (gx(zi,


0i = arctan tm \ ~ /
2 2,
Yi) + gytxl, Yl))
61

Let i and j be two edgels. We want that the interaction between these two edgels en-
capsulates the concept of shape. That is, if i and j belong to the same shape then their
interaction is high. Otherwise their interaction is low. Notice that a weak interaction
between two edgels has several interpretations: (i) i belongs to one shape and j belongs
to another one, (ii) i belongs to a shape and j is noise, or (iii) both i and j are noise.
The interaction coefficient must therefore be a co-shapeness measure. In our approach,
co-shapeness is defined by a combination of cocircularity, proximity, and contrast.
The definition of cocircularity is derived from [8] and it constrains the shapes to be
as circular as possible, or as a special case, as linear as possible. Proximity restricts
the interaction to occur in between nearby edgels. As a consequence, cocircularity is
constrained to be a local shape property. The combination of coeireularity and proximity
will therefore allow a large variety of shapes that are circular (or linear) only locally.
Contrast enforces edgels with high gradient module to have a higher interaction coefficient
than edgels with a low gradient module.
Following [8] and from Fig. 3 it is clear that two edgels belong to the same circle if
and only if: Ai + )~j = r. In this formula, hi is the angle made by one edgel with the line
joining the two edgels. Notice that a circle is uniquely defined if the relative positions
and orientations of the two edgels verify the equation above. This equation is also a local
symmetry condition consistent with the definition of local symmetry of Brady & Asada.
Moreover, linearity appears as a special case of cocircularity, namely when hl = 0 and
Aj = I x .

Fig. 3. The definition of cocircularity between two edgels (i and j).

From this cocircularity constraint we may derive a weaker constraint which will mea-
sure the closeness of a two-edgel configuration to a circular shape: Aij =[ Ai + Aj -- lr [.
Aij will vary between 0 (a perfect shape) and ~ (no shape). Finally the cocircularity
coefficient is allowed to vary between 1 for a circle and 0 for noise and is defined by the
formula: ci~~ = ( 1 - A~j/~r ~) exp (-A~ffk). The parameter k is chosen such that the
cocircularity coefficient vanishes rapidly for non-circular shapes.
The surrounding world is not constituted only by circular shapes. Cocircularity must
therefore be a local property. That is, the class of shapes we are interested to detect at a
given scale of resolution are shapes that can be approximated by a sequence of smoothly
connected circular arcs and straight lines. The proximity constraint is best described by
multiplying the cocircularity coefficient with a coefficient that vanishes smoothly as the
two edgels are farther away from each other: ~fox = exp ( - d 5 / ( 2 ~ ) ) where dlj is the
distance between the two edgels and ad is the standard deviation of these distances over
the image. Hence, the edgel interaction will adjust itself to the image distribution of the
edgel population.
A classical approach to figure-ground discrimination is to compare the gradient value
62

at an edgel against a threshold and to eliminate those edgels that fall under this threshold.
An improvement of this simply-minded nonlinear filtering is to consider two thresholds
such that edgel connectivity is better preserved [1]. Following the same idea, selection of
shapes with high contrast can be enforced by multiplying the interaction coefficient with
a term whose value depends on contrast: e- Sc.~
3
= gigj/g2ma~ where gma~ is the highest
gradient value over the edgel population. Finally the interaction coefficient between two
edgels becomes:
cocir fox contrast

4 Mean field annealing (MFA)

The states reachable by the system described by eq. (1) correspond to the vertices of a N-
dimensional hypercube. We are looking for the state which corresponds to the absolute
minimum of the energy function. Typically, when N = 1000, the number of possible
configurations is 2 g ~ 10 3~ The problem of finding the absolute minimum is complex
because of the large number of local minima of the energy function and hence this
problem cannot be tackled with local minimization methods (unless a good initialization
is available).
We already mentioned that the functional to be minimized has the same structure as
the global energy of an interacting spin system. To find a near ground state of such a
physical system we will use statistical methods. Two analysis are possible depending on
the interaction of the system with its environment: either the system can exchange heat
with its environment (case of the canonical analysis) or the system is isolated (case of
the microcanonical analysis) [3]. We will consider here the canonical analysis.
This analysis makes the hypothesis that the physical system can exchange heat with
its environment. At the equilibrium, statistical thermodynamics shows that the free en-
ergy F is minimized. The free energy is given by: F = E - TS, where E is the internal
energy (the energy associated with the optimization problem) and S is the entropy (which
measures the internal disorder). Hence, there is a competition between E and S. At low
temperatures and at equilibrium, F is minimal and T S is close to zero. Therefore, the
internal energy E is minimized. However the minimum of E depends on how the tem-
perature parameter decreases towards the absolute zero. It was shown that annealing is
a very good way to decrease the temperature.
We are interested in physical systems for which the internal energy is given by eq. (1).
The remarks above are expressed in the most fundamental result of statistical physics,
the Boltzmann (or Gibbs) distribution:

P r ( E ( a ) = Ei) = exp ( - E i / ( k T ) )
Z(T) (4)

which gives the probability of finding a system in a state i with the energy Ei, assuming
that the system is at equilibrium with a large heat bath at temperature k T (k is the
Boltzmann's constant). Z(T) is called the partition function and is a normalization factor:
Z ( T ) = ~ , exp ( - E , / ( k T ) ) . This sum runs over all possible spin configurations. Using
eq. (4) one can compute at a given temperature T the mean value over all possible
configurations of some macroscopic physical parameter A:

(A) = ~ A , P r ( E ( ~ ) = E , ) = ~ " A , exp ( - E , / ( k T ) ) (5)


n Z(T)
63

Unfortunately, the partition function Z(T) is usually impossible to compute. Neverthe-


less, when the system is described by eq. (1) one can use eq. (5) which, with an addi-
tionnal hypothesis, is a basic equation for mean field approximation and for mean field
annealing.
In order to introduce the mean field annealing algorithm, we first introduce the some-
how more classical mean field approximation method which has been used to solve op-
timization problems [7, 9, 4]: it is a simple analytic approximation of the behaviour of
interacting spin systems in thermal equilibrium. We start by developing eq. (1) around
cr i :
N N N
1
k=l,k#i j=l ,j # i k=l,k#i
~i is the total field which affects the spin ai:

Oi ------ Jij o'j + ~i


\j=l
The mean field (~i) affecting ~i is computed from the sum of the fields created on the
spin ~ri by all the other spins "frozen" in their mean states, and of the external field
~i viewed by the spin ai. The mean state of a spin (o'i) is the mean value of the (ri's
computed over all possible states that may occur at the thermal equilibrium. We obtain:

We introduce now the following approximation [12]: The system composed of N in-
teracting spins is viewed as the union of N systems each composed of a single spin.
Such a single-spin system {ai} is subject to the mean field (~i) created by all the other
single-spin systems. Let us study such a single spin system. It has two possible states:
{ - 1 } or {+1}. The probability for the system to be in one of these states is given by the
Boltzmann distribution law, eq. (4):

a~
exp ( - ( 4 , ) a0 e { - 1 , 1} (7)
P(X, = a ~ ~ p ( ) ~ ~'+ P"(-~i ~- -1)'
Xi is the random variable associated with the value of the spin state. Notice that in the
case of a single-spin system the partition function (the denominator of the expression
above) has a very simple analytical expression. By combining eq. (5), (6), and (7), the
mean state of ai can now be easily derived:

(ai)~ (+i) e x p ( - ( * i ) / T ) + ( - 1 ) e x p ( ( ~ i ) / T ) = t a n h ( ~ = l J i j ( a j ) + ~ i )
exp ( (#i) /T) + exp (-(Oi) /T) T (8)

We consider now the whole set of single-spin systems. We therefore have N equations
of the form:
#i = tanh ()-~JN=I Jij pj + 6i)
T (9)

where Pi = (ai). The problem of finding the mean state of interacting spin system at
thermal equilibrium is now mapped into the problem of solving a system of N coupled
64

non-linear equations, i.e., eq. (9)9 In the general case, an analytic solution is rather dif-
ficult to obtain 9 Instead, the solution for the v e c t o r / t = [ / J l , " ' , P N ] m a y well be the
stationary solution of the following system of N differential equations:

r = tanh Ej=I J 3- 5i - (lO)

where r is a time constant introduced for homogeneity. In the discrete case the temporal
derivative term can be written as:

d#i~ - 1t'~+1 - I~n + o(At) (11)


dt ) t. -Au

where / ~ is the value of/Ji at time tn. By substituting in eq. (10) and by choosing
7- = At, we obtain an iterative solution for the system of differential equations described
by eq. (10):

Vie{l,...,N},#~ +1 = t a n h ( ~jN=xJ'jtt:+Si)T' , n > l _ (12)

w h e r e / ~ is an estimation of (o'i) at time tn or at the n'th iteration 9


Starting with an initial solution It o = [/~0,.. ", #~v] the convergence is reached at the
n ' t h iteration such that ttn = ~ , . . . , # ~ v ] becomes stationary 9 The physicists would
say that the thermal equilibrium has been nearly reached. When the vector 6 is the null
vector, one can start with a vector tt 0 close to the obvious unstable solution [ 0 , . . . , 0].
In practice, even when the vector 8 in not null one starts with an initial configuration
obtained by adding noise to [0,. 99, 0]. For instance the ~i0,s are chosen randomly in the
interval [ - 1 0 -5 , +10-5]. This initial state is plausible for an interacting spin system in
a heat bath at high temperature 9 In fact the spin values (q-1 or -1) are equally likely at
high temperatures and hence, (trl) --- 0 for all spin i. During the iterative process, the
/~i's converge to values in between - 1 or 3-1.
Two convergence modes are possible:

- Synchronous mode9 At each step of the iterative process all the /t~'s are updated
using t h e / ~ - 1 , s previously calculated.
- Asynchronous mode9 At each step of the iterative process a spin #n is randomly
selected and updated using the/z n - l ' s .

In practice, the asynchronous mode produces better results because the convergence
process is less subject to oscillations frequently encountered in synchronous mode. In
order to obtain a solution for the vector ~r from the vector it, one simply looks at the
signs of the/z~ 's. A positive sign implies that the probability that the corresponding spin
has a value of 3-1 is greater than 0.5: if 1/2(1 3- #~) > 0.5 then o'i = 3-1 else ai = -1.
A practical difficulty with mean field approximation is the choice of the temperature
T at which the iterative process must occur. To avoid such a choice one of us [5] and
other authors [13] have proposed to combine the mean field approximation process with
an annealing process giving rise to mean field annealing. Hence, rather than fixing the
temperature, the temperature is decreased during the convergence process according to
two possible annealing schedules:
65

- Initially the temperature has a high value and as soon as every spin has been updated
at least once, the temperature is decreased to a smaller value. Then the temperature
continues to slightly decrease at each step of the convergence process. This does not
guarantee that a near equilibrium state is reached at each temperature value but
when the temperature is small enough then the system is frozen in a good stable
state. Consequently, the convergence time is reduced since at low temperatures the
convergence to a stationary solution is accelerated. This strategy was successfully
used to solve hard np-complete graph combinatorial problems [4], [5];
- Van den Bout & Miller [13] tried to estimate the critical temperature Tr At this
temperature, some of the mean field variables pi begin to move significantly towards
either - 1 or +1. Hence their strategy consists performing two sets of iterations: one
iteration process at this critical temperature until a near state equilibrium is reached
and another iteration process at a temperature value that is close to 0. However, the
critical temperature is quite difficult to estimate.
We currently use the first of the annealing schedules described above.

5 Example and discussion


We tested these algorithms over a wide variety of images. The images are preprocessed
as follows. Edges are first extracted using the Canny/Deriche operator [2]. The tangent
direction associated with such an edge is computed by fitting a straight line in the least-
square sense to a small set of connected edges. Then this small set of edges is replaced
by an edgel, i.e., the fitted line. The position of the edgel is given by its midpoint, its
direction is given by the direction of the line, and its contrast is given by the average
contrast of the edges forming the edgel. Fig. 4 shows the input data of our example.
Fig. 2 and Fig. 5 show the result of applying mean field annealing.

\ ~x..) ~ ~."-./. 'r I

!:-.'-'777 I.'7 j
P.Jk,t(
W

Fig. 4. Set of edgels obtained with no noise Fig. 5. The result of applying MFA to the im-
elimination, age o n t h e left.

In this paper we attacked the problem of figure-ground discrimination with special


emphasis on the problem of separating image data into curve and noise. We proposed
a global approach using combinatorial optimization. We suggested a mathematical en-
coding of the problem which takes into account such image properties as cocircularity,
proximity, and contrast and this encoding fits the constraints of the statistical modelling
of interacting spin systems.
We conclude that the interacting spin model is well suited for encoding the figure-
ground problem. Moreover, the analogy of the energy of such a model with the energy of
66

~ t t lattl 9 ~Jto

Fig. 6. Evolution of the/ti variables in MFA for the image above.

a recursive neural network allows one to assert that the MFA algorithm proposed here is
implementable on a fine-grained parallel machine. In such an implementation, a processor
is associated with an edgel (or a spin, or a neuron) and each processor communicates
with all the other processors.

References
1. J. Canny. A Computational Approach to Edge Detection. IEEE Transactions on Pattern
Analysis and Machine Intelligence, PAM[-8(6):679-698, November 1986.
2. R. Deriche. Using Canny's criteria to derive a recursively implemented optimal edge de-
tector. International Journal of Computer Vision, 1(2):167-187, 1987.
3. L. Hdrault and R. Horaud. Figure-ground discrimination: a combinatorial optimization
approach. Technical Report RT 73, LIFIA-IMAG, October 1991.
4. L. H6rault and J.J. Niez. Neural Networks and Graph K-Partitioning. Complex Systems,
3(6):531-576, December 1989.
5. L. H6rault and J.J. Niez. Neural Networks and Combinatorial Optimisation: A Study of
NP-Complete Graph Problems. In E. Gelembe, editor, Neural Networks: Advances and
Applications, pages 165-213. North Holland, 1991.
6. W. Kohler. Gestalt Psychology. Meridian, New-York, 1980.
7. H. Orland. Mean field theory for optimization problems. Journal Physique Lettres, 46:L-
763-L-770, 1985.
8. P. Parent and S.W. Zucker. Trace Inference, Curvature Consistency, and Curve Detection.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(8):823-839, August
1989.
9. C. Peterson. A new method for mapping optimization problems onto neural networks.
International Journal of Neural Systems, 1(1):3-22, 1989.
10. T. J. Sejnowski and G. E. Hinton. Separating Figure from Ground with a Boltzmann
Machine. In Michael Arbib and Allen Hanson, editors, Vision, Brain, and Cooperative
Computation, pages 703-724. MIT Press, 1988.
11. A. Sha'ashna and S. Ullman. Structural Saliency: The Detection of Globally Salient Struc-
tures Using a Locally Connected Network. In Proc. 1EEE International Conference on
Computer Vision, pages 321-327, Tampa, Florida, USA, December 1988.
12. H.E. Stanley. Introduction to Phases Transitions and Critical Phenomena. Oxford Univer-
sity Press, 1971.
13. D.E. Van den Bout and T. K. Miller. Graph Partitioning Using Annealed Neural Networks.
In Int. Joint Conf. on Neural Networks, pages 521-528, Washington D.C., June 1989.
Deterministic Pseudo-Annealing :
Optimization in M a r k o v - R a n d o m - F i e l d s
A n Application to Pixel Classification

Mare Berthod 1, Ggrard Giraudon 1, Jean Paul Stromboni 2


1 INRIA, Sophia-Antipolis, BP109, F-06561 Valbonne - Tel.: (33) 936 57857
e-mail berthod~sophia.inria.fr e-mail giraudon~sophia.inria.fr
2 Universitd de Nice Sophi~-Antipolis
Laboratoire Signaux et Syst~mes, U R A 1376 CNRS,
41, Boulevard Napoldon III,06041 Nice France
e-mail strombon@puce.inria.fr
Abstract. W e present in this paper a new deterministic and massively
parallel algorithm for combinatorial optimization in a Markov R a n d o m
Field. This algorithm is an extension of previous relaxation labeling by opti-
mization algorithms. First,the a posterioriprobabilityof a tentative label-
ing, defined in.terms of a Markov R a n d o m Field is generalized to continuous
labelings.This merit function of probabilisticvectors is then convexified by
changing its domain. Global optimization is performed, and the m a x i m u m
is tracked down while the original domain is restaured. O n an application
to contextual pixel quantization, it compares favorably to recent stochastic
(simulated annealing) or deterministic (graduated non-convexity) methods
popularized for low-levelvision.

1 Introduction

Since the seminal paper of Geman [6], which popularized the Hammersley-Clifford the-
orem, Markov Random Fields (M.R.F.) have been increasingly for the last few years
for many low-level tasks in image processing and interpretation, and many heuristic al-
gorithms have been proposed to solve them: iterated conditional modes [2], simulated
annealing [6], dynamic programming [4] etc.
Starting from Relaxation Labeling [5], we propose here Deterministic Pseudo Anneal-
ing (D.P.A.), a variation on annealing, which shares some common flavors with mean-field
approximation [8], as well as Graduated-non-Convexity [3]. The basic idea is to extend
the probability of a labeling (a function defined on a discrete set) to a merit function
defined on continuous labelings (a subset of RJr a polynomial with non-negative coef-
ficients. The only extrema of this function, under suitable constraints, occur for discrete
labelings.
D.P.A. consists of changing the constraints so as to convexify this function, find its
unique global maximum, and then track down the solution, by a continuation method,
until the original constraints are restored, and a discrete labeling can be obtained.
We describe in Section 2 the optimization scheme. In Section 3, we relate an applica-
tion to image quantization, or segmentation, with comparisons with other methods.

2 Deterministic-Pseudo-Annealing : the Method

Let ,9 = Si, 1 < i < N be a set of sites (pixels in this paper), each of which may take
any label from 1 to M. A global discrete labeling L assigns one label Li to each site Si
68

in S. We assume that the a priori probability of L is modeled by a M.R.F., defined by


a graph G. An edge Eij of G connects the sites Si and Sj (4-connected pixels in this
paper), and V/ is the set of sites connected to Si. C is the set of all the cliques c of G.
Also C/ = {c : Si E c}. The number of sites in the clique is its degree : deg(c), and
deg(G) = maxcec deg(c).
The restriction of L to the sites of a given clique e is denoted by Lr The M.R.F. is
completely determined by the clique potentials VcL (shorthand for Vr163 for c E C and
L E s where s is the set of the M N discrete labelings. Thus, following Hammersley-
Clifford, and assuming the positivity condition P(L) > 0:

P(L) = YXeecexp(VeL) (1)


Z
where Z is the partition function.
Let Yi be the grey-level (for example) of pixel Si's, and Y = (Yt ... YN) t.
Following Bayes, the problem at hand is to find L which maximizes the a posteriori
probability P(L/Y), given by P ( L / Y ) = P(Y/L)P(L)/P(Y). Actually P(Y) may easily
be dropped (as Z in equation 1), as it does not depend on L. Besides, it is reasonnable
(at least commonplace) to assume that: P(Y/L) = I'IN=xP(yl/L) = H~=I P(yi/Li).
So, we may write:
N
P ( L / Y ) "~ H P(yi/Li) H exp(--WcL) (2)
i=1 cEC

(from now on, _ denotes equality up to a constant scale factor). Thus P ( L / Y ) also
derives from a M.R.F., obtained by incorporating cliques of order 1 corresponding to the
P(yi/Li)'s, and the problem at hand is strictly equivalent to maximizing :

:(L) = (3)
eEC

It is important to notice that the W's can always be made positive by shifting, without
changing the solution.
We propose to cast this combinatorial optimization problem into a more comfortable
maximization problem in a compact subset of T~N. Let 7~"r : f .~ 7~ be defined by:
deg(e)
f(X) = E E Wel H xcj,~oj (4)
eEC I9162 j=l

where cI denotes the j,h site of clique c, and le~ the label assigned to this site by It. f is
a polynomial in the zi,k's, linear with any zi,k; its degree is the maximum degree of the
cliques.
Let us now restrict X t o ~oNM, defined by:
M
Vi, k : xi,~ > 0 & Vi : E xi,k = 1
k=l

It turns out that, generically, if X* is a maximum of f on 7~NM then it is on the border,


i.e.:
Vi, 3k : zlk
* = 1, t#k z *. = o (5)
69

Thus, any maximum of f o n PNM directly yields a discrete labeling, and the absolute
maximum of f (which has many local maxima) yields the solution to our problem.
The basic idea in D.P.A. is to maximize f on a subset on which it is concave, and
to track the maximum while slowly restauring the original subset. Let Q NM,d be the
compact subset of T~NM defined by:
M
Vi, k :xl,k_>0 & Vi : Z x ~ k =1
k----1

It can be proven that f admits a unique maximum o n QNM,d. When d -- 2 and N -- 1,


this reduces to Perron-Frobenius theorem on non-negative matrices.
Besides, it turns out that the iterative power method (for finding the untique non-
negative eigenvector of a non-negative matrix) is here also very efficient.
Maximization is performed, starting from some X ~ by applying:
x "+~ ~ ( v / ( x n ) ) ~ -', (6)

This simply means that, at each iteration, we select on the pseudo-sphere of degree
d the point where the normal is parallel to the gradient of f . Obviously, the only stable
point is singular, and thus is the maximum we are looking for. We have only proved
experimentally that the algorithm does converge very fast to this maximum.
This procedure, already suggested in [1] yields a maximum which, as in the case d -- 2,
is inside QNM,d (degeneracies apart), and thus does not yield a discrete labeling. So we
actually track down the solution, maximizing f on successive QlVM,~,s, with fl decreasing
from d to 1, starting from the last maximum.
This iterative decrease of fi can be compared, up to a point to a cooling schedule, or
better to a Graduated Non-Convexity strategy.
It is important to notice that, though shifting the coefficients does not change the
discrete problem nor the maximization problem on pNM, it changes it on Q~M,d, and
thus there is no guarantee that the same solution is reached. Besides it is not guaranteed
that the process converges toward the global optimum; actually, it is not difficult to build
simple counterexamples on toy problems. Experiments show nevertheless that, on real
problems, a very good solution is reached.
Finally, experiments have shown that the speed whith wich fi is decreased is not
crucial : typically, 5 to 10 steps are enough to go from 2 to 1.

3 An Application to Image Segmentation

We want to quantize an image into nicely connected areas, so that isolated pixels, or
small isolated areas with a grey level different from their background are eliminated. The
sites are the pixels, and the labels are the quantized grey-levels (typically 2 to 5). P(yl/Li
is modeled by N(mz, or). The clique potentials (corresponding to the observations) are
the logs of these quantities, suitably shifted to become positive.
The World Model (cliques of order 2) favours similar classes for neighbouring pixels,
and penalizes different labels. For example, if two neighbouring sites have the same
label, the energy is 0, else it is -1. This actually means that the ratio between the a priori
probability of having the same labels, to the a priori probability of having different labels
is exp(1), i.e. approx. 2.7).
Figure 1) shows the results on an indoor scene, with five classes, of mean values:
O,a,2a, 3a,4a, where a = 63.75 and o" = a/2.
70

a - original indoor scene b - Five classes (22 iterations, 10 seconds)

Fig. 1. Example results with picture Desk

Another example is a synthetised (128"128'8) noisy chessboard, obtained by corrupt-


ing a binary picture by noise with a -5dB S/N ratio (Fig. 2). It is used to illustrate com-
parison with Graduated Non Convexity (or GNC) [3]. to segment image in two classes.
The classes mean values are 0 and 255 with a = 255 in our method.
For every method, parameters are optimized to obtain the best result (window range,
iteration,s thresholding, etc... ).
Visual results (as well as objective ones) are quite similar but D.P.A. is 10 times faster
(1 s, on a Connection Machine), Objective results are much better than with other faster
methods (mean median or anisotropic smoothing [7])

4 Conclusion

The method presented here is a new deterministic alternative to recent stochastic meth-
ods for combinatorial optimization problems. It certainly is heavier than standard image
processing techniques, but compares favorably with these optimization methods. More
precisely, the experiments so far show that this method leads to results as good as G.N.C.,
another deterministic method, and is faster. It thus offers a better cost/performance
trade-off thml simulated annealing. Actually, we ran experiments on small graph label-
ing problems, and found out that the results were much better than realistic runs of
simulated annealing (i.e. with fast cooling schedules), and were not far from the ideal
solution (found by exhaustive search based on dynamic programming).
Determinism may well prove to be an important advantage, when a massive par-
allelization is realized. This has still to be investigated from a theoretical as well as
71

1- Origin~ image: S / B = -5 dB 2- G.N.C. 2- D.P.A.: ~ = 512

Fig. 2. Comparison between G.N.C. and D.P.A.

practical point of view. Other application domains (stereo, graph matching) are also to
be considered.

References
1. M. Berthod. Definition of a consistent labeling as a global extremum. In Proceedings ICPR6,
Munich, pages 339-341, 1982.
2. J. Besag. On the statistical analysis of dirty pictures. Jl. Roy. Statis. Soc. B., 1986.
3. A. Blake. Comparison of the efficiency of deterministic and stochastic algorithms for visual
reconstruction. IEEE Trans. on Pattern Analysis and Machine Intelligence, 1:2-12, 1989.
4. H. Derin, H. E. R. Cristi, and D. Geman. Bayes smoothing algorithms for segmentation
of binary images modeled by markov random fields. IEEE trans, on Pattern analysis and
roach, intei., Vol 6, 1984.
5. O. Faugeras and M. Berthod. Improving consistency and reducing ambiguity in stochastic
labeling: an optimization approach. IEEE Trans. on Pattern Analysis and Machine Intelli-
gence, 3(4):412-423, 1981.
6. S. Geman and D. Geman. Stochastic relaxation, gibbs distributions and the bayesian
restoration of images. IEEE Trans. on Pattern Analysis and Machine Intelligence, 6:721-
741, 1984.
7. P. S. Marc, J. Chen, and G. Medioni. Adaptive smoothing : A general tool for early vision.
IEEE Trans. on Pattern Analysis and Machine Intelligence, 13(6):514-529, 1991.
8. J. Zeruhia and R. Chellappa. Mean field approximation using compound gauss-markov ran-
dom field for edge detection and image restoration. Proc. ICASSP, Albuquerque, USA, 1990.

This articlewas processed using the I A ~ X macro package with E C C V 9 2 style


A Bayesian Multiple Hypothesis Approach to
Contour Grouping
Ingemar J. Cox 1 and James M. Rehg 2 and Sunita Hingorani 1

1 NEC Research Institute, 4 Independence Way, Princeton, NJ 08540.


2 Dept. of Electrical and Computer Eng., Carnegie Mellon University, Pittsburgh, PA

Abstract.
We present an approach to contour grouping based on classical tracking
techniques. Edge points are segmented into smooth curves so as to min-
imize a recursively updated Bayesian probability measure. The resulting
algorithm employs local smoothness constraints and a statistical descrip-
tion of edge detection, and can accurately handle corners, bifurcations, and
curve intersections. Experimental results demonstrate good performance.

1 Introduction
The set of image contours produced by objects in a scene encode important information
about their shape, position, and orientation. Image contours arise from discontinuities in
the underlying intensity pattern, due to the interaction of surface geometry and illumi-
nation. A large body of work, from such areas as model-based object recognition [8] and
contour motion flow [7], depend critically on the reliable extraction of image contours.
The contour grouping problem[I, 10, 6, 12] involves assigning edge pixels produced by
an edge detector [4, 3] to a set of continuous curves. Associating edge points with contours
is difficult because the input data (from edge detectors) is noisy; there is uncertainty in the
position of the edge, there may be false and/or missing points, and contours may intersect
and interfere with one another. There are four basic requirements for a successful contour
segmentation algorithm. First, there must be a mechanism for integrating information in
the neighborhood of an edgel to avoid making irrevocable grouping decisions based on
insufficient data. Second, there must be a prior model for the smoothness of the curve
to base grouping decisions on. This model must have an intuitive parameterization and
sufficient generality to describe arbitrary curves of interest. Third, it must incorporate
noise models for the edge detector, to optimally incorporate noisy measurements, and
detect and remove spurious edges. And finally, since intersecting curves are common, the
algorithm must be able to handle these as well. We believe our algorithm is the first
unified framework to incorporate these four requirements.
We formulate contour grouping as a Bayesian multiple hypothesis "tracking" problem.
Our work is based on an algorithm originally developed by Reid [11] in the context
of military and surveillance tracking of multiple targets (aircraft) in the presence of
noise. The algorithm has three main components: A "dynamic" contour model that
encodes the smoothness prior, a measurement model that incorporates edge detector
noise characteristics, and a Bayesian hypothesis tree that encodes the likelihood of each
possible edge assignment and permits multiple hypotheses to develop in parallel until
sufficient information is available to make a decision.
Section (2) develops the Bayesian hypothesis tree, and associated probability calcu-
lations. Section (3) describes some important implementation details, and is followed by
a presentation of experimental results in Section (4). We conclude in Section (5) with a
discussion of directions for future work.
73

2 Multiple Hypothesis Framework for Contour Grouping


In this section we describe the Bayesian multiple hypothesis algorithm [11] and demon-
strate its application to the extraction of contours from an edge image. Figure (1) illus-
trates the principal conceptual components of the contour grouping algorithm. At each
iteration, there are a set of hypotheses (initially null), each one representing a different
interpretation of the edge points. Each hypothesis is a collection of contours and at each
iteration each contour predicts the location of the next edgel as the algorithm follows
the contour in unit increments of arc length. An adaptive search region is created about
each of these predicted locations as shown in Figure (2) and detailed in Section (3). Mea-
suresurements are extracted from these surveillance regions and matched to predictions
based on the statistical Mahalanobis distance. This matching process reveals ambiguities
in the assignment of measurements to contours. As a result, the hypothesis tree grow
another level in depth, a parent hypothesis generating a series of hypotheses each being
a possible interpretation of the measurements. The probability of each new hypothesis is
calculated based on assumptions described later. Finally, a pruning stage is invoked to
constrain the exponentially growing hypothesis tree. This completes one iteration of the
algorithm.
With this overview in mind, a detailed description of the algorithm is now described,
following [2, 9]. At iteration k we have a set of association hypotheses/2 k obtained from
the set of hypotheses 12k-1 at iteration k - 1 and the latest set of measurements, Z(k) =
mk
{zi(k)}i=l, where mk is the number of measurements zi(k) at measurement interval k.
The measurement set Z(k) is obtained by establishing a surveillance or search region
around the predicted end point of each hypothesized contour. The precise description of
this surveillance region is postponed until Section (3).
An association hypothesis, Z k,~, groups together edgels originating from a single con-
tour. We define Z k,t ~ {zil,~(1),zi2,~(2), ..., zik,,(k)} to be the set of all measurements
originating from contour i. An association hypothesis is one possible interpretation of
the current edgels. Due to ambiguity in interpreting edgels, there will be many asso-
ciation hypotheses. The resulting partitions are disjoint, however, since an edgel can
only belong to a single contour. By means of a Bayesian multiple hypothesis tree, each
hypothesis can be assigned a probability value through a recursive formula.
New hypotheses are formed by associating each measurement as either belonging to a
previously known contour, or being the start of a new contour or false alarm. In addition,
for contours that are not assigned measurements, there is the possibility of termination
k-1
of a contour. We define a particular global hypothesis at iteration k by O k . Let Ol(,~ )
denote the parent hypothesis from which O k is derived, and Om(k) denote the event that
indicates the specific status of all contours postulated by O~m~ at iteration k and the
specific origin of all measurements received at iteration k. In general, Ore(k) will consist
of r measurements from known contours, v measurements from new contours, r false
alarms and X terminated (or ended) contours from the parent hypothesis. Figure (2)
illustrates a situation in which we have two known contours (7"1 and T2) and three new
measurements {zl(k), z~(k) and z3(k)}.
We can construct an event O,n(k) by creating a hypothesis matrix in which known
contours are represented by the columns of the matrix and the current measurements by
the rows. A non-zero element at matrix position ci,j denotes that measurement zi(k) is
contained in the validation region of contour tj. In addition to the T known contours,
the hypothesis matrix has appended to it a column 0 denoting false alarms and a column
T + 1 denoting new contour. Hypothesis generation is then performed by picking one
unit per row and one unit per column except for columns TF and TN, where the number
74

of false alarms and new targets is not restricted. In this manner we impose the dual
constraints that (1) a measurement can originate from only one contour (disjointness)
and (2) that a contour produces only one measurement per iteration (this is guaranteed
by our adaptive search strategy).
Following [2, 9], we derive a recursive expression for the likelihood of a given seg-
mentation hypothesis conditioned on a set of edge measurements. A given hypothesis
at iteration k, ~gkm,is composed of a current event and a previous hypothesis resulting
from measurements up to and including iteration k - 1: ~ )k-1
= {O~(m),0m(k } We wish
to calculate the probability P { O ~ I Z k} = P ~O,n(k),O~,,~lZ(k),Zk-l~. Our deriva-
tion, presented in detail in [5], is based on two assumptions. First, the numbers of false
alarms and new edgels are Poisson distributed with densities AF and AN, respectively,
while contour initiations are distributed uniformly. Second, measurements assigned to
a given contour are Gaussian distributed, while false edge measurements are uniformly
distributed over the surveillance volume. Under these conditions, we obtain [2]:
rn k

-~-~--f~,F(r
p { o ~ l z k} = 1 +!~! =~- v H[N"[~'(k)IY'
i=l

3 Implementation
In this section, we briefly describe five important components of the segmentation al-
gorithm: the sCrnctnre and pruning of the hypothesis tree, contour and measnrement
models, a termination probability model, the surveillance volume, and a post-proce~ing
stage called contour fusion.
The hypothesis tree organizes segmentation alternatives and their associated like-
lihoods. Efficient implementation of this tree is critical to the practical success of the
algorithm. A single hypothesized contour may be present in many global hypotheses.
Rather than replicate this contour for each hypothesis with the associated memory and
computational overheads, Kurien [9] proposed the construction of a contour (or track)
tree, in which the root denotes the creation of a new contour and each branch denotes an
alternative measurement assignment. Each node in the global hypothesis tree contains a
set of pointers to track trees. Each set represents a different permutation of contour leaf
nodes from different contour trees, i.e. the global hypotheses enforce the assumptions of
disjoint partitions. The contour tree provides considerable savings, and is discussed in
detail in [9].
Pruning is an essential part of any practical contour segmentation algorithm. In our
implementation, pruning is based on a combination of the "N-scan-back" algorithm [9]
and a simple lower limit probability threshold. The "N-scan-back" algorithm assumes
that any ambiguity at iteration k is resolved by iteration k + N. Then, if hypothesis
~9~ at iteration k has m children, the sum of the probabilities of the the leaf nodes is
calculated for each of the m branches. The branch with the highest probabilty is retained
and all others are pruned. This gives the tree a particular structure: below the decision
node it has a depth of N, while above the node it has degenerated into a simple list of
assignments. In our experiments, N is set to 3. While this may seem quite small, previous
tracking results [9] suggest that even N = 2 can provide near optimum solutions. After
this procedure, the number of leaf nodes can still be very high. A second phase of pruning
removes all nodes whose probability is less than a lower limit, which is currently set to
0.01.
75

A key step in assigning probabilities to segmentation hypotheses is the computation of


the likelihood that a given measurement originated from a certain contour. This likelihood
computation depends on two things: a dynamic model that describes the evolution of the
curve in the image, and a measurement model that describes how curves produce edgels.
In our formulation, the curve state vector is [&x ~)y]l and its dynamics are described by
a linear noise-driven acceleration model common in the tracking literature ([2], Chapter
2). The autocorrelation of the white Gaussian acceleration noise can be varied to model
curves of arbitrary smoothness. Thus the tip of the contour as a function of arc length,
~, is (x(t), y(t)) and has tangent (~(t), y(t)). Since many edge detectors provide gradient
information, we assume that the entire state vector is available for measurement. A
Kalman filter is then employed to estimate curve state and predict the location of edgels.
These predictions are combined with actual measurements to produce likelihoods.
Once the location of a given curve has been predicted by the Kalman filter and dis-
cretized to image coordinates, a surveillance region is employed to extract measurements.
A surveillance region is a set of concentric circles, of radius 1, v~, 2, v~, that are searched
for edgels (the radii define discrete pixel neighborhoods.) It is these measurments that
form segmentation hypotheses whose probabilities are computed as described earlier.
The probability of termination controls the distance over which a contour will be
extended if no measurements are present. One may want this gap size to be 10's of pixels.
However, if N-scan is set to 3 the tree is always pruned such that the contour continues,
i.e. the contour is never terminated. Making N-scan substantially increases the size of the
tree. Instead, the probability of termination, P is set to Px = 1 -exp(-m/,~ where
"~x is a parameter set by the user and rn is specific to each contour and is the number of
consecutive iterations in which no measurement was assigned to the contour.
Since the algorithm scans the edge image by "walking" along contours, it may en-
counter a new contour at any point along its length. When tracking begins in the interior
of a curve, it is usually partitioned, erroneously, into two or more segments sharing com-
mon boundary points. These multiple contours can be merged to recover the correct
segmentation, compensating for the incorrect initial conditions. The Mahalanobis dis-
tance provides a simple contour merge test: Two contours with state estimates xi and ~j
at a common boundary are merged If 9
dxljTij
! --1
dxij </~, where dxij = ~ i _ ~j and Tij is
its covariance, and ~i is obtained from X2 tables. This test is applied after the algorithm
produces an initial segmentation.
4 Experimental Results
It is very difficult to quantify the performance of a contour grouping algorithm because
the definition of a "correct" segmentation is application-dependent 9 The experiments
presented here are therefore qualitative, illustrating the performance and limitations of
the algorithm9 For information on parameter settings, the reader is directed to [5].
Figure (3) shows the contour groupings for various images of a fork, a computer
mouse and a reasonably complex cutting tool. The ends of the contours are denoted by
circles. For the fork, the curvature at the end of the handle was smooth enough to allow
the contour to continue around, while the curvature at the prongs of the teeth is too
great and the contour is broken. Note that corners, bifurcations and intersections are
all reasonably partitioned, even though these concepts are not explicitly represented in
the algorithm. The algorithm also correctly segments circular contours even though the
contour model is piecewise linear.
5 Discussion
We presented a grouping algorithm that finds the optimal Bayesian maximum a poste-
riori partitioning of edges into contours. It is based on two observations: First, grouping
76

decisions must be based on a prior model for curve smoothness. Second, the difficulty
of the segmentation problem can be expressed by sensor/environmental statistics such
as mean rates of false edgels and new contours. The resulting algorithm is physically
grounded, i.e. all free parameters are physical quantities of the sensor and or scene that
can be physically measured. We believe the algorithm is the first unified framework to
incorporate a measurement noise model, scene statistics, optimal state estimation for
the contour model, statistical distance measures to quantify "closeness" and Bayesian
decision trees in a recursive formulation that is independent of the image sampling grid.
There are several avenues for future work. First, it would be useful to investigate other
contour models, such as a piecewise constant curvature model. It would also be interesting
to tightly couple the edge detection and contour grouping stages, with the latter providing
the former with expectations of where to look for edges. This unification may improve
upon standard techniques for curve enhancement, such as hysteresis [4]. Finally, we would
like to extend the Bayesian formulation to incorporate application specific segmentation
requirements. In a recognition scenario, for example, partial identification of the scene
could modify the prior probabilities associated with the contour grouping algorithm.
The authors would like to thank Y. Bar-Shalom, H. Durrant-Whyte, T. Kurien and
J. J. Leonard for valuable discussion on issues related to target tracking.
References

1. D. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern Recog-
nition, 13(2):111-122, 1981.
2. Y. Bar-Shalom and T. E. Fortmann. Tracking and Data Association. Academic Press,
1988.
3. R. A. Boie and I. J. Cox. Two dimensional optimum edge detection using matched and
wiener filters for machine vision. In 1EEE First International Conference on Computer
Vision, pages 450-456. IEEE, June 1987.
4. J. Canny. A computational approach to edge detection. IEEE Trans. Pattern Analysis and
Machine Intelligence, 8(6):34-43, 1986.
5. I. J. Cox, J. M. Rehg, and S. Hingorani. A bayesian multiple hypothesis approach to con-
tour grouping. Technical report, NEC Research Institute, Princeton, USA, 1991.
6. C. David and S. Zucker. Potentials, valleys, and dynamic global coverings. Int. J. Com-
puter Vision, 5(3):219-238, 1990.
7. E. Hildreth. Computation underlying the measurement of visual motion. Artificial Intelli-
gence, 27(3):309-355, 1984.
8. D. Kriegman and J. Ponce. On recognizing and positioning 3-d objects from image con-
tours. 1EEE Trans. Pattern Analysis and Machine Intelligence, 12(12):1127-1137, Decem-
ber 1990.
9. T. Kurien. Issues in the design of practical multitarget tracking algorithms. In Y. Bar-
Shalom, editor, Multitarget-Multisensor Tracking: Advanced Applications, pages 43-83.
Artech House, 1990.
10. A. Martelli. An application of heuristic search methods to edge and contour detection.
Communications of the ACM, 19(2):73-83, 1976.
11. D. B. Reid. An algorithm for tracking multiple targets. 1EEE Transactions on Automatic
Control, AC-24(6):843-854, December 1979.
12. A. Shashua and S. Unman. Grouping contours by iterated pairing network. In Richard P.
Lippmann, John Moody, and David S. Touretzky, editors, Advances in Neural Information
Processing Systems 3. Morgan Kaufmann, 1991. Proc. NIPS'90, Denver CO.

This article was processed using the I$,TEX macro package with ECCV92 style
77

1 t /
UinL'~lr
#
lin~,/~ ]~
ln~dil=
Prcdlcfed Edges tlypo|hessJ .~latrzx
eammmm
(t)
] OtJ~t'v~d EdjeJ

Fig.2. Predicted contour


locations, a surveillance
region and s t a t i s t i c a l
IntenJit~ Image Mahalanobis ( e l l i p t i c a l )
Fig.1. Outline of the regions.
contour grouping algorithm.

(a)

'

(c) (b)
Fig.3. Contour groupings for (a) fork, (b) mouse and (c)
cutter. Circles denote contour end points.
Detection of General Edges and Keypoints *

L. Rosenthaler 1, F. Heitger 1, O. KSbler I and R. yon der Heydt ~


1 Communication Technology Laboratory, ETH-Zfirich, CH-8092 Zfirich
Email: rosenth~vision.ethz.ch
2 Department of Neurology, University Hospital Zfirich, Ctt-8091 Zfirich

A b s t r a c t . A computational framework for extracting (1) edges with an


arbitrary profile function and (2) keypoints such as corners, vertices and
terminations is presented. Using oriented filterswith even and odd symme-
try we combine their convolution outputs to oriented energy resulting in a
unified representation of edges, lines and combinations thereof. W e derive
an "edge quality" measure which allows to test the validity of a general edge
model. A detection scheme for keypoints is proposed based on an analysis
of oriented energy channels using differential geometry.

1 Introduction
The interpretation of static, monocular grey-valued images is usually based on the hy-
pothesis that the loci of strong intensity variation are tightly coupled to physical events
such as 3D discontinuities (foreground/background) or changes of surface orientation.
However, the corresponding intensity variations normally differ from ideal edges. They
often have complex profiles and may not be perfectly straight. Therefore, edge detection
has to be a truly 2D process and linear operators based on idealized edge-models (e.g.
Canny [4]) seem to be inadequate for detecting complex intensity distributions.
Perona & Malik [16] have pointed out that, in general, linear operators cannot detect
and localize correctly edges and lines simultaneously. This problem becomes relevant in
real images as many "natural" edges are neither a pure edge or a line, but rather have
complex intensity profiles. Second order non-linearities in the form of local energy may
provide a solution to the problem [6] [1] [13] [12] [15] [16].
Yet, energy models and their "linear" precursors are intrinsically one-dimensional.
They cannot account for another important class of image features: corners, vertices,
terminations, junctions etc.. These two-dimensional intensity variations indicate, for ex-
ample, strong variations in contour orientation, terminations occurring in occlusion sit-
uations and many other relevant 2D features.
In this paper we propose a dual processing scheme which emphasizes the detection
of 1D signal variations on the one hand, and of points of strong 2D variations on the
other. We present a method which allows (1) to derive a valid indicator for the presence
of 1D edges with arbitrary profiles and (2) to detect and localize complex 2D intensity
variations. We use the term general edge (GE) for regions of 1D intensity variation and the
term keypoint for points of strong 2D intensity variation. The concept of a general edge
allows to develop a fully 2D filter model based on linear filters which are polar separable
in the Fourier domain. Even and odd filter outputs are then combined to oriented energy.
Two aspects of our approach are new: (a) The use of a contrast independent measure
for deviations from a general edge; this enables us to limit the application of the edge
model to points that qualify as general edge points. The local maxima of oriented energy
* The research described in this paper has been supported by the Swiss National Science Foun-
dation, Grant no. 32-8968.86.
79

can then be used to localize the edges [16],[12]. (b) The use of differential geometry ap-
plied to oriented energy maps, yielding a representation of strong 2D intensity variations
(keypoints).
The work presented here was partially motivated by our interest in biological mech-
anisms of contour processing [20] [19] [17] [7].

2 General Edge, Orientation Filters and Local Energy


We define general edges (GE) to be image features with an arbitrary intensity variation in
one direction and constant intensity orthogonal to it. The spectrum of a GE is restricted
to a central slice in Fourier space (cf. [2]). This property of GEs make orientation filters
that are polar separable in the Fourier domain a natural choice because they allow to
separate filter responses into a term which depends on the profile of the GE and a
term that depends only on the difference between filter and edge orientation. Using polar
coordinates in the frequency plane, we define the 2-D filters Fn(v, r --- H(u)/2n(r where
g(u) defines the radial bandpass characteristics and/2n (satisfying/2,(r - - / 2 n ( r + x))
controls the orientation selectivity of the filters (n: orientation index). In this paper, we
u s e / 2 , (r -- cos 2v (r - 0n - ~) with 0, defining filter orientation.
With fn being the inverse Fourier transform of Fn the convolution with a GE of
orientation 0 has the form s(x, y) 9 fn(x, y) = /2(0 - 0 , ) - g(~, y). Choosing (~, 0) as
the rotated coordinate system with ~ in the direction of the GE, g(x, y) becomes a 1D
function g(0) which is the convolution of the 1D GE profile with the radial term of the
filter H(v). In other words, polar separable filters allow to split up the convolution result
with a GE into a term depending on orientation and a term depending on the profile of
the GE. This reduces the design of the bandpass characteristics to a 1D problem.
Since we allow GEs to have arbitrary profiles, simple linear filtering (e.g. [11] [4])
cannot warrant correct localization. Local energy concepts, however, as proposed by
[6] [13] [12] [15], seem to overcome this deficit in that they unify the detection of edges,
lines and hybrid forms. Local energy requires for each orientation n a pair of even and
odd filters whose convolution output is combined by quadrature pair summation to form
local energy. A common method is to construct these filter pairs as Hilbert transforms
of each other [13] [16].
In general, however, Hilbert pairs do not guarantee a monomodal line response in local
energy. Gabor pairs, as proposed by Adelson et hi. [1], show a monomodal line response
but have the drawback that the even Gabor does not integrate to zero. We modified the
Gabor scheme by introducing a frequency sweep such that both filters integrate to zero
(cf. [7]). The Fourier transform of these 1D filters provides the radial term H ( v ) of the
polar separable 2D filters. In the present paper, local energy is then defined as the square
root of the sum of the squared response of odd and even filters.

3 Local Orientation and Contour Quality


Using a sufficient number of different orientations, it is possible to determine the exact
orientation 0 of a GE: Let Ej be the energy response in orientation j and Emax the
maximum over all orientations. Now we define Q(r as
N--1 ( E j ( r ) n(lr
Q(r = E \Ema~(r) - / 2 ( m i n k = 0 , ,N-1 [r -- 0kl)/ (1)
j=0 ""
Since the filters are polar separable in the Fourier domain, it is possible to write Ej =
S./2 (0 - 0r) and, since/2 is a monomodal function, E, na~ = S-/2 (mink=0,...,N-t [0 -- 0k [).
Therefore Q(r = 0 r r = 0. Finding the exact orientation 0 of a GE amounts to
searching for Qmln = mine=0,, Q(r with r being the angle where Q(r = Qmi,,.
80

The value Qmln is a measure of GE conformity. It is zero for a GE and increases


with increasing deviation from a perfect GE. If Qmin is greater than a threshold value
O,th we consider the image structure to differ significantly from a GE. We take the local
orientation as given by r if Qmin < Qth. Otherwise a unique orientation is not
defined. Fig. 1 shows in the top row four samples of a trihedral vertex with increasing
levels of additive gaussian noise. The center row represents the values of Qrnin and emi,.
The orientation of the lines corresponds to r whereas the match with the general edge
model is expressed by line length L = (1 + Qmin)-1. L can be considered an estimate
of "GE quality". Fig. 1 clearly shows how L decreases in the neighbourhood of a vertex.
Noise also leads to a degradation of L. The b o t t o m row shows the result of edge detection
given by the local maxima in energy orthogonal to the orientation emin. Edge processing
stops where the GE model is no longer valid.

-=0FF
Fig. 1. Edge quality maps.
The top row shows a s0anple
vertex, corrupted by increas-
ing levels of noise (from left
to right: no noise, 20dB, 10dB
and 5dB SNR). The center row

!iii!i!!!!i
i i1,iii!!lI!!iii!ii!IiI!+!i! iiiiiiiiiiill
l ilfII !+ii+ii shows a plot of local orienta-
tion and edge quality, and the
bottom row shows the result
of edge detection. The com-
putimg was done on 128x128
pixel images to avoid border
effects. Shown are 32x32 pixel
cuts from the central part of
the images.

4 Keypoints
In the previous section we have described a method of finding GEs using a measure of
quality. On the other hand, there is a class of important image features with pronounced
2D variation of intensity such as line endings, corners, junctions etc. (keypoin~s). In this
section we present a method for detecting these keypoints. It is based on the oriented
energy maps and does not rely on an explicit model of any particular 2D intensity dis-
tribution.
The basic idea is to exploit the fact that deviations from a GE result in changes
of local energy magnitude along the edge. Deviations may be induced by all sort of
image features such as (a) a loss of contrast (e.g. line ending), (b) two or more edges of
different orientation meeting at one point (e.g. corner, vertex) or (c) continuous changes
of orientation (curvature). Directional derivatives in the orientation of a contour seem to
be a straight-forward way for detecting keypoints. Local extrema of the first directional
derivatives would indicate features like line-endings, corners and junctions. For strong
curvature and blobs the second directional derivatives would be more appropriate.
On general edges, the derivatives along the edge orientation are zero. At keypoints
a unique orientation cannot be assigned and thus derivatives in a single orientation are
inadequate for representing such features. Using the property that oriented energy sepa-
rates orientational components, we propose to take for each energy channel the directional
derivatives parallelto its orientation. We expect keypoints to have local extrema in deriva-
tive magnitudes. However, the directional derivatives are non-zero, also on GEs, for all
81

orientations that differ from the edge orientation. We show that these "false responses"
can be selectively eliminated by a compensation scheme that makes use of the systematic
nature of these errors. This compensation scheme is based on derivatives orthogonal to
each oriented energy channel. We use the terms p-derivative and o-derivative for direc-
tional derivatives parallel and ovthogonal to the orientation of a channel, respectively.
Assuming N filters with orientations given by O,~ = ~ and a GE with orientation 8,
we define 8, as unit vector parallel to filter orientation and 8,,j_ as unit vector orthogonal
to filter orientation. At location r we define

p(1)(r)=
dEn (r)
~
[ dUE,
and P ( = ) ( r ) = _ -
(r) 1+ , w i t h
~--~5 -j [~]+=max(0,~) (2)

as the gradient magnitude parallel to filter orientation (lst p-derivative) and as an esti-
mate of negative curvature of the magnitude of local energy along filter orientation (2nd
p-derivative). The latter corresponds to ' b u m p ' s ' of local energy along filter orientation.
Since we are not interested in local minima along oriented energy, p(2)(r) is defined
to be zero for positive values of the second directional derivative. Fig. 2 shows oriented
energy En and its directional p-derivatives for a sample corner. Each column represents
one orientation channel. With above definitions and in analogy to local energy we may
define a scalar keypoint map K(r):
/f(r) = max ~ p ( 1 ) ( r ) 2 + p(2)(r)2
n=0,N--1
In Fig. 3 the raw keypoint m a p / ~ is depicted for a sample corner, a line ending and a
T-junction. As mentioned above, the 1st and 2nd p-derivative will be zero on a GE only if
(8 - On) = 0. For (8 - On) ~ O, 1st p-derivatives will be zero only at the exact location of
a GE and 2nd derivatives will always be > 0 with a local maximum on the GE. Because
of these "false responses" to GEs, /~ is insufficient for detecting keypoints selectively.
The systematic nature of these false responses allows to construct a compensation map
C(r): +

Using the properties of a separable filter (orientation selectivity given by cos 2P (8 - Or,))
and the properties of GEs (the local energy of a GE is a GE too), it is easy to show the
following relations:
pO) (r) = s(r) cosup (8 - On) sin (8 - On) , P(=) (r) = s(r) cos uP (8 - On) sin = (8 - 8.)
(a)
where s(r) depends only on the profile of the GE and on the distance from its center.
Eqn. (3) suggests to use the directional derivatives orthogonal to the filter orientation as
the compensation signal for the systematic error of K. In analogy to the p-derivatives
we define the 1st and 2nd o-derivatives as
0En0~aj.(r) 0(2) .[ 02E, (r)] +
= and (')= 0- 2/
On a GE, the following relation holds:

O(n1) (r) = $(r) COS2p+1 (8 -- On) and O(n2) (r) m s(r) cos 2p+2 (8 -- On) (4)
For all O of a GE, the maxima of the 1st and 2nd o-derivatives are greater than the
maxima of the 1st and 2nd p-derivatives. Therefore,

Z (O~l,(r)+ P O ' ( r ) ) = + (P(U'(r)) = , n = 0 , N - 1 .


k----0
82

Using the sum over all orientations as compensation has the advantage t h a t it is robust
in discrete implementations and we can define the compensation m a p as
N-1
~u): E (o~~ (~)+ o~~)(~)) ~,~ ~ou): [K(r)- ~U)] +
k----1
However, C is not zero at keypoints (e.g. at a line ending) as can be seen in Fig. 3.
Keypoints are characterized by the fact t h a t the orientation distribution of local energy
differs significantly from the distribution on a GE. This fact can be used to implement a
correction mechanism by combining orthogonal pairs of O i 2) (r) to form a new m a p R:
N/2-1
R(r) = E ~/oF)(r) oi~ (r) (5)
k=0
Fig. 3 shows/~ for the three sample keypoints. O b v i o u s l y / ~ > 0 on GEs, but it remains
almost constant with varying 0 (substitute (4) into (5)).

K C R

ORIENTATION

K C R
K C R
NNN
D DDFrI I K C R

k ~ k
F i g . 2. Responses to a 900 corner (top im- Fig. 3. Keypoint detection using a corner, a
age) of oriented energy E~, first p-derivative line-end and a T-junction (32x32 pixels). Top
p(1), second p-derivative p(2), first o- rows: original image, final keypoint map (K),
derivative 0 (1) and second o-derivative O~ ). corrected compensation map (C), corrected
Image dimensions axe 32 x 32 pixels; filter combination of o-derivatives (R). Bottom rows:
paxameters axe p = 2, cr = 3. uncompensated keypoint map ( / 0 , raw com-
pensation map (~ and unco~ected R map (/~).

The extrema/~ and ~ n i n differby less t h a n 10 percent of ~D~n~ (using filters with
N = 6 and p = 2). It is interesting to note t h a t , supposing t h a t N is even and N > ( p + l ) ,
the sum over all 2nd o-derivatives is constant and does not depend on 0. Therefore we
e s t i m a t e the error o f / ~ ( r ) on GEs with the sum over all 2nd o-derivatives and define

R(r)= ~/o12)(r)"Oi~ (')-~" E oi~)(r) (6)


L k--0 k~-0
83

An estimate of 7 can be easily derived setting 0 -- 0, substituting (4) into (6) and
resolving the resulting equation for R(r) = 0. With this definition of R(r) we finally
define the following compensation map C:

= (r) - R(r) + (r) -


k---=O

This compensation map fulfills all requirements: (1) it cancels successfully all systematic
errors of the raw keypoint map at general edges, and (2) it is zero at the location of
keypoints. Fig. 4 shows the different steps of keypoint detection on a simple gray-valued
image. The keypoint detection scheme has been tested on a wide variety of complex
natural scenes. Results will be shown in the next section.

(A) (B) (C)


: : : : :

Fig. 4. Keypoint detection


scheme: (A) input image . . . . . .
(~12 512 p~e~). (B) Raw
keypoint map K. (C) Com- " " " ~ ~ "
pensation map C. (D) Cor-
rected keypoint map K. (E)
Binary edge-map with key-
point localization indicated
by the center of the circles.

(D) (E)

5 E x p e r i m e n t a l results
I m p l e m e n t a t i o n : Convolutions with the twelve filter kernels were carried out in the
Fourier domain. Six maps of oriented energy were generated by quadrature pair sum-
mation of even and odd filter convolution outputs (we took the squareroot of oriented
energy to reduce signal dynamics to those of the original filter outputs). Binary edge
maps were generated by finding local maxima orthogonal to the orientation of the best
responding energy channel at each location.
To compute the derivatives in the various directions oriented energy maps were sam-
pled at discrete offset positions around center pixels. First derivatives corresponded to
the difference in value at two offset pixels and second derivatives to their average minus
the value of the center pixel.
84

C o m p l e x s c e n e s : We have tested our contour and keypoint detection scheme with a


variety of images of different complexity. One example is shown in Fig. 5A, an outdoor
scene. The scene contains primarily corners and vertices, but also some T-junctions (fence
in the lower part of the image). A large proportion are occlusion features generated by
the foreground object occluding structures of the building in the background. The curved
shape of the sculpture thus introduces some interesting variation in termination angles.
The result of the keypoint detection scheme is shown in Fig. 5B.

Fig. 5. Application of the keypoint scheme to an outdoor scene: (A) input image. (B)
keypoint map superimposed on a contrast reduced version of the original image. Image size
is 512 x 512 pixels and filter parameters are p = 2 and ,r = 2.

Fig. 6. Edge map (A) and keypoint localization (B) for a subsection of image in Fig. 5.
Keypoint positions (pixel accuracy) axe indicated by crosses.
85

The dark blobs correspond to the keypoint map superimposed on a contrast reduced
version of the original image. Note that the keypoint strength is a function of local image
contrast. Thus weaker markings do not necessarily indicate a weaker evidence for the
presence of a corner, vertex etc.. It can be seen that no markings, whatsoever, occur on
straight contour segments. This proves our compensation scheme to be effective also with
more complex 2D intensity configurations. Fig. 6A shows, for a part of the image, the
contour map extracted with the non-maximum suppression scheme described above. A
threshold of 8% of the maximal oriented energy response was applied. One can see that
while straight parts of contour are well represented we often find gaps or distortions in
the neighbourhood of keypoints. Fig. 6B shows the location of keypoints indicated by
crosses (threshold again 8% of the global keypoint maximum). As both general edge and
keypoint location have been derived from oriented energy, the maps are commensurable
and complementary.

6 Discussion
We have presented a computational framework for extracting (1) intensity discontinuities
which can be described as 1D variations (general edges) and (2) keypoints with a true
2D intensity distribution (corners, vertices, terminations etc.).
Using oriented filters with even and odd symmetry we combine their convolution
outputs to oriented energy. This quadrature pair summation has the advantage that edges
and lines and combinations thereof are treated in a unified way and can be unambiguously
localized [12], [15], [16]. The edge quality measure we derived is used to select for the
edge map only those pixels that exceed a predefined quality. This way we can be sure
edge detection is valid at these locations.
The detection scheme for keypoints represents a novel approach to the problem of
detecting and localizing image features like corners, junctions or terminations. This is
more difficult than detecting edges. The abundant richness of two-dimensional intensity
variations seems to prohibit approaches of the form of simplified model prototypes as,
for example, the Heaviside function used in edge models. What seems to be important
is to reduce the dimensionality of the problem by generating invariant representations.
Oriented energy is invariant with respect to the polarity and the type of edge [15].
We propose to detect keypoints by taking first and second derivatives on the energy
maps in the filter direction (p-derivatives). The idea is that true 2D features produce
strong variations in the energy signal parallel to its orientation. However, markings also
occur for general edges if its orientation and the direction of the derivative differ. We
have introduced a scalar compensation map that selectively suppresses these unwanted
derivative signals.
It seems that differential geometry is an adequate way to attack two-dimensional
intensity variations. However, compared to other approaches that also use methods of
differential geometry (e.g. [3], [10], [5]) , our model is not gray-level based (smoothed
version of the original image) but uses oriented energy maps which have the advantage
of representing different edge types in a unified way. Furthermore, our approach does
not contain any specific model of keypoints, as for example a corner [9], [14], [18], vertex
[5], T- or L-junctions. In this respect we cannot expect selectivities for these specific
2-D intensity variations. Our scheme detects and accurately localizes corners (di-, tri-,
tetra-hedral junctions of different angles and contrasts) as well as line-terminations, T-
junctions, strong curvature and blobs. However, the information given by the first and
second p-derivatives in different orientations may be used to classify the keypoints. We
are currently working on a processing scheme to classify keypoints paying attention to
occlusion situations and to the distinction between foreground/background structures.
86

P a r a l l e l s i n b i o l o g i c a l v i s i o n : Some of the ideas of the work presented here origi-


n a t e d from our interest in the simulation of neural contour mechanisms [7]. In fact the
different stages of our c o m p u t a t i o n a l approach can be compared to stages in cortical
processing of visual information: even and odd symmetrical, orientation selective filters
can be compared to the properties of simple cells. The oriented energy representation is
consistent with complex cells which are known to exhibit phase independence [12]. Hy-
percomplex or end-stopped cells respond well to short bars, line-ends and corners [8]. We
have related the response behaviour of single- and double-stopped cells to first and second
derivative operations based on orientation selective complex cells [7]. These endstopped-
operators can be used to generate "subjective contours". The results have d e m o n s t r a t e d
the importance of detecting keypoints when dealing with the problem of spatial occlusion.

References
1. Adelson, E. H. & Bergen, J. R.: Spatio-temporal energy models for the perception of motion.
Journal of the Optical Society of America A 2 (1985) 284-299
2. Barrett, H. B., & Swindell, W.: Analog Reconstruction for Transaxial Tomography. Pro-
ceeding of the IEEE 65 (1977) 89-107
3. Beaudet, P. R.: Rotationally invariant image operators. 4th International Joint Conference
on Pattern Recognition, Kyoto, Japan (1978) 578-583
4. Canny, J.: A computational approach to edge detection. IEEE Transactions on Pattern
Analysis and Machine Intelligence 8 (1986) 679-698
5. Girandon, G. & Deriche, R.: On corner and vertex detection. IEEE Proc. CVPR'91, Mani,
Hawai (1991) 650-655
6. Granhnd, G. H.: In search of a general picture processing operator. Computer Graphics
and Image Processing 8 (1978) 155-173
7. Heitger, F., Rosenthaler, L., yon der Heydt, R. Peterhans, E. and Kfibler, O.: Simulation of
neural contour mechanisms: From simple to end-stopped cells. Vision Research 32 (1992)
in press
8. Hubel, D. H. & Wiesel, T. N.: Receptive fields and functional architecture of monkey striate
cortex. Journal of Physiology, London 195 (1968) 215-243
9. Kitchen, L. & Rosenfeld, A.: Gray level corner detection. Pattern Recognition Letters 1
(1982) 95-102
10. Koenderink, J. J. & van Doom, A. J.: Representation of local geometry in the visual system.
Biological Cybernetics 55 (1987) 367-376
11. Mart, D. & ttildreth, E.: Theory of edge detection. Proceedings of the Royal Society, London
Series B 207 (1980) 181-217
12. Morrone, M. C. & Burr, D. C.: Feature detection in human vision: A phase-dependent
energy model. Proceedings of the Royal Society, London Series B 235 (1988) 221-245
13. Morrone, M. C. ~ Owens, R. A.: Feature detection from local energy. Pattern Recognition
Letters 6 (1987)303-313
14. Noble, J. A.: Finding corners. Image Vision and Computing 6 (1988) 121-128
15. Owens, R., Venkatesh, S. ~ Ross, J.: Edge detection is a projection. Pattern Recognition
Letters 9 (1989) 233-244
16. Perona, P. ~z Malik, J.: Detecting and localizing edges composed of steps, peaks and roofs.
UCB Technical Report, UCB/CSD90/590 (1990)
17. Peterhans, E. & yon der tteydt, R.: Mechanisms of contour perception in monkey visual
cortex. II. Contours bridging gaps. Journal of Neuroscience 9 (1989) 1749-1763
18. Rangarajan, M., Shah & Brackle, D. V.: Optimal corner detector. Computer Vision Graph-
ics and Image Processing 48 (1989) 230-245
19. yon der Heydt, R. & Peterhans, E.: Mechanisms of contour perception in monkey visual
cortex. I. Lines of pattern discontinnity. Journal of Neuroscience 9 (1989) 1731-1748
20. vonder t~eydt, R., Peterhans, E. & Baumgartner, G.: Illusory contours and cortical neuron
responses. Science 224 (1984) 1260-1262
Distributed Belief Revision
for Adaptive Image Processing Regulation*
V i t t o r i o M u r i n o , M a s s i m i l i a n o F. Peri, a n d C a r l o S. R e g a z z o n i

Department of Biophysical and Electronic Engineering (D.I.B.E.), University of Genova


Via all'Opera Pia 11A, 16145 Genova - ITALY

A b s t r a c L A theoretical approach to the problem of intelligent regulation of data-pro-


~ssing parameters is proposed in terms of joint probability maximization. It is shown
that, under suitable hypotheses, the problem can be solved by maximizing, ill a distributed
way, the product of computation,ally more tractable conditional probabilities. As a case
study, the implementation of an architecture made up of four milts is investigated.

1 Introduction
Image understanding ~ can be represented according to three principal phases: low-level lXO-
cessing, feature extraction and descriptive primitives grouping, and high-level reasolmig [HRI]. Each of
these phases may affect the goodness of final results. Little has been proposed in the literature for the
adaptive regulation of pr(r.e&~g lxuameters. The nearer and main effort has been that of I-lmtsou and
Riseman which designed VISIONS [HR2], a IZalowledge Based (.KB) system able for recognizing 3D ob-
jects, supervising low-level processing phases. However, this approach is neilher dynamic (i.e., it may
only be operated after the end of an interpretation cycle), nor distributed (i.e., it is based on the interpreta-
tion phase, and not on the knowledge available to each local low-level processing unit). In this paper, we
p ~ t a distributed algorithm for adaptively regulating image-processing parameters at multiple levels:
at each level, the loss of information occurring while mapping data input into local data representation is
minimized. The stone regulalion strategy is applied to any level, so allowing the design of a general proto-
type for the regulation module. This prototypical inference engine is specialized at each level, in accor-
dance with the different peculiarities oftbe processing units. Processing-parameter regulation is based on
evaluations of local data quality, and is performed by mapping quality features into the acttud values of
the Ixwamelers to be regulated. Results on a set of indoor hrtages ,'u~ewovidcd ill o~der to assess ihe v~did-
it), of the p r o ~ approach.

2 Problem Statement
In some applications, the architecture of an image understanding system can be represented as a singly
connected network (Fig.l) to which distributed problem-solving teclmiques, using probabilistic r ~ z f i n g
as an inference lxu'adigm, can be efficiently applied [Pel]. According to this model, each node of the
network is associated with a variable to be estimated, and is comlected with parent and son nodes by bidi-
rectional communication channels. An intermediate (not terminal) node can receive two types of mes-
sages: the evidence X, that is, the information coming flom lower level nodes, and the expectation rr.,
coming from parent ~rxles. If xi is the variable to be estimated by the i-th node, a Belief functioti [Pel]
can be defined as: BEL(xi)=ct.L(xi).n(xi), where ct is a normalizing constant. By maximizing BEL ovex
all received messages, the kxally optimum xi value can be obtained. Moreover, messages based on the
new variable settings can be computed and propagated to the neighboming nodes hi a distributed fashion.
When the estimation of the optimal variables is achieved, a stable status is reached (i.e., no message is

* This work was carried out and supported within the fnunework of the MOBIUS project (no. MAST-0028-C), which
is included in file CEC MArine Science and Teclmology (MAST) programme.
88

present in the network). Different kinds of messages can be used, depending on the type of problem con-
sidemd. In the used case of Belief Revision (BR) [Pel], the optimization criterion is the Most Probable
Explanation (MPE), which is a generalization of the Maximum A-Posteriori (MAP) probability criterion.
By using this criterion, the joint probability ofx i is made explicit by using the Bayes nile:
Pr(x i]e)=maXxi,xi,xi+Pr(xiJxi).Pr(xi).Pr(xi.Jxi)=13.maXxi~k(xi).~xi).~(xi) (1)
where ~xi)=maxxi.Pr(xiJxi), rc(xi)=maXxi+Pr(xiJxi),~(xi)=Pr(xi), and xi_, xi+ are lower and Ifigher levels
variables, respectively. In our application, each node is considered as a virtual sensor whose "acquisition"
phase depends on specific sets of parameters, and which produces an oulput representation to be sent to
higher-level modules. The status of each module can be identified by the set of parameters Pi regulathlg
the k~al transformation process. This means that, once the w.al scene (do) to be considered has Ix~n
fixed, the datum considered as input at each level i, di, with i;~0, can be univcr.ally determined if the
parameters Pj, j<i are known. As a consequence, by estimating tlie optimal set of processing p,'umneters,
an optimal set of image representations di can be obtained. If we pose x=P, we can write equation (1) as:
max pi{Pr(Pi) [max Pi_lPr(Pi_l/Pi)] [max pi+lPr(Pi+l/Pi)] } = max Pi {qt(Pi)~k(Pi)x(Pi)}
The BU message ;Lis computed by the lower level module, and indicates the op~nal condilional prob-
ability distribution for each Pi-I when the conditioning factor is Pi. The TD rc message represents an anal-
ogous disllibution: it is computed by the higher-level module, and indicates the optimal COladitional-prob-
ability distribution for each P~q when the conditio~lg factor is Pi ~The term ~,represents local regul;wiz-
ing knowledge for the optimization of the l~.al status variable Pi. This amounts to defumlg default pa-
rameter values (specific for each level) that are usually expected to give good iecal results. Depending on
higher- and lower-level status, further kr.al tunings may become necessary to improve the quality of the
global data flow in the processing chain. The same reasoning mechaJfism can be extended directly to the
SlmCe-varying case by performing regulation-parameter tunings to improve p,'ulicular subimages di~a~e-
garding possible effects on the rest of the image. By applying the MPE criterion to all the subwindows
separately, the ~ data quality at all levels of abstraction may be obtained for each subwindow.

3 DESCRIFrlON OF THE REGULATION MODULE PROTOTYPE


The functioning of a generic module [MRI] includes two phases: fusL the quality of a locally a-,ms-
formed datum is estimated; then, on the basis of this quality evaluation, processing par,uneters are modi-
fied. At the generic level i, a transformation is applied to the datum provided by the lower level, say di. 1,
and is regulated by the vector of parameters Pi, thus producing the local datum di: di=Ti(di.l,Pi). AS lower
level datum is held fixed at the BU solution proposed by lower level module, local datum may be ex-
pressed in terms of solely local parameters di=Si(Pi). Quality memaues qij are computed for a datum by us-
ing the quality ftmctions gij: qij=gij(di)=gij(S(Pi)), where, j=l,...,Ni, and N i is the number of quality mea-
sures at level i. A weighted cost function is used to obtain a single value, Qi, indicating the global degree
of quality of the image (or of the sub-image): the datum is considered to be of poor quality if the cost
value is high. By analogy to Gibbs models, the cost ftmction is associated with the probability distribution
of local regulation parameters, and is hflicated as a datum "energy". Hence, a suitable criterion to achieve
the ~ s t quality is to minimize the datum energy. The cost function at level i is expw.,ssodas (see Fig.2):
Qi = Zj~lWi/ij(qij)"
Ni _ Zj~lWi/ij(gij(Oi)
Ni ) _ ~j~lwijfij(gij(Si(Pi))
Ni ) _ Qi0ai)
where fij(qij) are the functions that penalize the terms far from the optimal ones, accor~ig to the type of
quality parameter qij, and wij are the weights associated with all the terms: they both represent a ~,.aling
factor for mapping quality measurements into a single measure space, and for exploiting their meaning-
fulness in the global quality assessment formulation. The 7 term can be computed by ushlg the Gibbs
model as:
89

3(P.~=Pr{Pi}=(Zi)'l.exp(-Qi(Pi)), where Z i is the partition function at level i. In this way, in absence of in-
put messages, the parameter which gives the highest quality, is regarded as being the most probable on
the basis of the local regularizing knowledge. Every module propagates a BU evaluated dalum (a datum
associated with its "energy"): on the basis of this message, the receiving module can compute a proper ;L
as: X(Pi) = (Zi)-1. exp(-Umi(Pi)), k/Pi, where mi=Di(Qi_l(Pi.l)) is a function mapping, by means of the dis-
cretization process Di, lhe le.~'llenergy values Qi-I in the number m i. Such munher is selected in a discrete
set (mi ~ {1,..,Mi}) and a d ~ the probability distribution at level i, according to the quality assess-
ment at level i-l, Umi(Pi) (in our case, m i ~ {1,..,3} corresponding to the set {Low, Medium, High} qual-
ity assesmaents). As propagated datum is the one correslx)nding to the best quality assessment, the maxi-
mization over all possible Pid is implicitly performed at level i- 1, and ;~Pi) may he considered as one of a
set of possible fixed distributions at level i. At the moment, as TD messages, we propagate requests for
focusing the atlention on particular subareas. In the bootstrap phase the r: messages may be viewed as:
x(Pi)=(dim{Pi})-l, being dim{Pi} the number of,all possible different parameters settings at level i. This
means that local regulation parameters are equally probable ('capable of obtai|fing the s~une quality
judgemen0 in absence of a-priori expectations. If focus of attention suggestions ,are propagated all plevi-
ously used regulation parameters, say {~}, wer~ not suited with respect to higher level's quality ~scss-
ments: thus, their associated probabilities may be reassigned among all not yet used regulation p~munetet~
at level i. This way, the maximization over all possible Pvl has no longer effect, that is:
n(Pi) = (dim {Pi}-dim{~})-I ifpi ~ {~1 and r4T'i) = 0 ifPi ~ {~}. In the pro[xxsed system, every module
receives this way fixed ~ and rc messages: the maximization of'i(Pi) over all possible Pi is the otdy one to
be performed.

4 Module Description and Results


The so~wam implemented system is composed by four independent processing modules (Camera, Pre-
processing, Edge-Extraction, and Line-Detection, see Fig.l), each one ,associated with a specilic pl~tse of
the low-level interpretation process [MRI]. The goal of each module is to find the local regulation-pa-
rameter vector producing the best possible result of the overall processing chain lot every window the da-
tum is divided halo. On the basis of the quality judgements, the module cm~ decide to tm~e local parrane-
ters, to propagate the datum, or to propagate a TD message specifying the windows that must be im-
proved.
The camera module simulates the optical sensor's behaviour. The optical seaxsorallows the regulation of
four acquisition parameters: focus, aperture, black level, and eleclrolfic gain. This module does not explic-
itly receive BU messages, but obtains new evidence from the environment by acquiring new data. The
criteria to evaluate the quality of image acquisition are provided by the classic low-level algorithms
[Krl,Hol] regarding the analyses of the histogram and of the high-flequency content of an image
(gradient analysis). In Fig.3 shows a couple of two images (aerial 2, left, and aerial_4) of an it~door scene
with different degrees of focusing. In FigA, it can be noticed that the cost functiotl on the image windows
(which am indicated by yl-xl, yl-x2,.., y4-x4, from left to fight, and from top to bottom) succeeds in dis-
criminathlg between better and worse defined image areas (aerial_4), so allowing the detection of win-
dows to be improved.
The Perona-Malik f'tlter [PMI] is applied to an image in order to eliminate noise, while preserving the
information about the image contours. The Perona-Malik is an adaptive (aJfisotropic) type of faltering
based on a diffusion process, that encourages smoothing operations in those image parts where there am
no edges. The outcome of this algorithm is a ftltered image, so the quality parameters and the procedures
for obtaining a global quality assessment are the same as the rams described for the camera module. The
regulation phase controls the number of iterations of the filtering process in order to prevent the algorithm
90

from reaching a saturation state. Preprocessing results are not shown, as the application of the Perona-Ma-
lik filter to nearly noiseless images (like the analyzed ones), yields negligible results, according to a pure
visual criterion.
The third module consists in an edge-detection filtering performed by means of the CaJmy algorithm
[Cal]. For this algorithm, we regulate the two thresholds for the hysteresis process that has to be applied
to the detecled edge points. Heuristic criteria have been chosen for quality ~ e n t , such as the number
of edges, the number of edge points, the number of long edges, and the number of connections among
edges. These criteria give a quantitative judgement on the ufform,ative content of the image. Fig.5 gives
the results of the edge-extraction process performed on one of the images (aerial__2); they were obtained
by using two different settings of the Canny algorithm thresholds. In the right image, the parameters have
been relaxed by the regulation process in order to take into accom~t also less "informative" edges.
The main goal of the Line-Detection module is the production of the "best" scene representation by
means of straight lhw.s extracted through the Hough tr',utsform [1KI], on the "basisof the dill~i~enfly pro-
cessed edge images obtained by the edge-extractor module. Actually, this module can neither propagate
evaluated data nor receive TD suggesfious from higher modules, and caJmot not yet adaptively regulate
its local processing parameters. The goal ks reached by performing a data fusion on all the binary edge
images based on their relaled quality ,'~sessments (i.e., accumulators in the Hough space are i n c ~ e d by
a value proportional to the quality judgement o,i the window contahmig the relalcd edge). Fig.6 shows the
results of the Line-Deteclion module after having fused in the Hough space and backpi~jccted in the
~ian space one (righ0 and three of the edge images (the edges of aerial_4 image and the ones of
aerial_2 images, respectively obtained by using default and relaxed settings). An improvement in the de-
tection of rectilinear segments can be noted; this is particularly evident after the third fusion (in the right
image), where meaningful segments appear, though with no substantial modificalions to file l~ll ~cenc
sll'UCture.

MODULE QUALITY llARAblI~|;i,;1~ REGULATION PARAMI~rERS


Camera Tenealgmd,Fl~less, Entmpy, Low/HighSaluration, F~cus,Apetlum,Black Level,"Ehzct~otfic
9 Gain
Lowesl/HighestLevel~Histog~n Oa~ulged-Modes
l~sing Iterafons Number
Edge-Extracfon Edges Number,Long Edges Ntnnber,ColU'll~ledEdges Low Hysteresis'llueshold
Numbe5 Edge PointsNumber ttigh HysteresisThreshold
I.Jne-Detection nOI1C NOlle

Table 1. Qualitymid regulationpar,ulw.tersfor each module in lhe pax:essh~gdaah~.

6 Conclusions
A framework for the regulation of image-processing phases h~ been presented. The actual system in-
cludes four regulation modules, each coupled to a different processing step. This subdivision allows one
to define a set of intermediate abstraction levels at which to perform the matc~ags necessary to gradually
obtain a robust image interpretation. Each module ks provided with a-priori knowledge represented by
regulation parameters for controlling local data transformalions, and by quality parameters for a quantita-
tive assessment of transformed data. Estimation of the best transformation parameters to be progressively
applied to input data is based on belief revision theory, according to the MPE criterion. When a solution
to the BR problem ks fotmd, a suboptimal estimation of regulatibn parameters at all system levels ks
reached. As a single prototype has been developed for each module, the system can he implemented on
parallel architecture.

Acknowledgment
The authors would like to thank the Thomson CSF - LER (Rennes, France) and the Thomsoq Sinwa
ASM (Brest, France) for providing all the images processed for this work.
91

References
[HRI ] A.R.l-k'mson,F_.M.Riseman:A Smlm~azyof hnagc Undcrsh'ulding Reseatdl at the University of M~L,~adms.~.'L~.COINS'redl0k~d
Report 83-85,DelX.of Comlxner toldInfom~,'u/onScience,University of Massadlussels (1985)
['lfl~..2]A.R. lianson, E.M. RJs~'n,"Ul:The VISIONS bnage Underst;mding System-86. Advmmes in Cmnpuler Vis~l, Edb,aum (1987)
[Pc I ] J.Pead: ~otxd)ilistic Re,x=.oe,ing in Mtelllgent Systml: Network of Raust'ble Inference. Morgml Kaufinann Ptlbl. inc., S,'ul Mateo, CA,
(1988)
[Kd ] " ] ~ P , ~ o v : Act/ve Cotr~xlter Vision by Coop2nltiVe I~x.'us~u',dS~en.,o,Sp~g,zr.Ved;tg, New-Yolk, NY (10g0)
[I Io | ] B.K.P. I Iota: I;tv2u.,~ing,"l~dadcal RL'poll, M;kmr lu.tas~.'sl Im.'filul~of'I L'dav)k)gy ( I tX~8)
[PMI ] P.Pemm, JA4al~: Scalo-SIx~r and Edge Dek~aion Using Anisotmpic Diffuse1 IFI ~ Tra==.on PAM112, No. 7 (1990) 629-639
[Ca I] J.Canny: A Computational At~:xldt to Edge Dett.'ct~L IEEE Tttms, on PAMI 8, No. 6 (1986) 679-698
OK 1] J.115ngwo~z, J. Kitllee A survey of die Hough trm~sfonn.Computer VLs~I, GraltdCs mid hnage Pmce~i,tg 44 (I 988) 87- I 16
[MRI ] V.Mu~lo, C.SJ~gazzoni: A Distributed Algoddml for Adaptive ReguL'dkmof hnage I~uct-ssing l);mut~:mrs. IEI'2E InL CoftJ: oil
Syslcln, M;m, mid Cybernetics, C]~afloll~villc, VA, USA ( 199 l) 25%2/..-4

k d4=T4(P4.d3)
( RECI"ILINEARSEGMENi"~
leveld ~, DETECIION MODULE..)

~ O3=I3(P3.d2)

level 3 I EDGE~oXID~LCTION1

~I d2=T2(P2.clI)

Fig.3 - Samph: of Ihe set of images acxluircd wilh diffcrcnI


11' cl1=TI(P 1.00) focus settings (aerial_2 and ae.rial_4(right))
level I

T
level 0 treolscene O0

Fig. I - Systcln arcllilccturc as Malkovian nclwork

qi-1,! ( = ) " ' ~ ~-1.I


qi-t 2 ( - " ~ - 1 . 2 ~ ",q ~'!ow(Pi}
' ~ ~'mcdium(Pi )

qi-I.Ni. [ ( ~ wi'l,Ni-I
Fig.4 - Profile of the ,cost function at the careen, level
Fig.2 - ~ message generation on all windows of d=e image of Fig. 3 (aerial_4)

Fig.5 - Outcome of the edge-extraction module of the Fig.6 - Outcome of line-dctectlon module after the
aerial_2 image with different so!tings of SI and $2 backprojcclion of one (loll) and three (righl)
thresholds (right, after regulation) images after cdge-r regulation
Finding Face Features *

Ian Craw, David Tock, and Alan Bennett


Department of Mathematical Sciences, University of Aberdeen, Aberdeen AB9 2UB, UK.

A b s t r a c t . We describe a computer program which understands a greyscale


image of a face well enough to locate individual face features such as eyes
and mouth. The program has two distinct components: modules designed to
locate particular face features, usually in a restricted area; and the overall
control strategy which activates modules on the basis of the current solution
state, and assesses and integrates the results of each module.
Our main tool is statististical knowledge obtained by detailed measure-
ments of many example faces. We describe results when working to high
accuracy, in which the aim is to locate 40 pre-specified feature points cho-
sen for their use in indexing a mugshot database. A variant is presented
designed simply to find eye locations, working at close to video rates.

1 Introduction
We describe a program to recognise and measure human facial features. Our original
motivation was to provide a way of indexing police mugshots for retrieval purposes.
Another use comes when identifying points on a face with a view to cartooning the face
[2] or to obtain a more useful Principal Component Analysis of face images [3], leading
to a compact representation suitable for matching. The work on blink rate is part of a
P R O M E T H E U S (CED2 - Proper Vehicle Operation) study by the Ford Motor C o m p a n y
on driver awareness, for which blink rate is a useful indicator. The aim is to correlate
the blink rate, obtained during normal driving conditions, with the output from other
sensors. Ideally this would enable blink rate to be predicted from measurements which
can be made more cheaply as part of the normal sensor input available from a modern
car.
The system aims to locate a total of 40 feature points within a grey-scale digitized full
face image of an adult male. We initially choose to ignore glasses and facial hair, although
such images are Occasionally used to test the robustness of the system. The points chosen
are those described in Shepherd [8]; thus allowing us to utilise data originally recorded by
hand. A total of 1000 faces were measured, and the locations of the 40 points on each face
were recorded. This data has been normalized and forms the basis of the model described
below. Identification is confirmed by overlaying the points on the image; the points are
usually linked with straight lines to form a wire frame face as shown in figure 1.
Our system is in two distinct parts:

- a number of independent recognition modules speciaiised to respond to individual


parts of the face; and
* This research was supported by a United Kingdom Science and Engineering Research Council
project grant number GR/E 84617 to Ian Craw and J Rowland Lishman (Computing Science,
Aberdeen). DT was funded by this project; AB was supported by an SERC studentship. We
thank Roly Lishman for many valuable discussions on fazes.
93

- a control structure, driven by a high level hypothesis about the location of the face,
which invokes the feature finding modules in order to support or refute its current
hypothesis.

The program endeavours to confirm that a face is located within the image in a two
part process. Possible coarse locations for the face are sought ab initio providing contexts
for further search, using modules capable of providing reliable, although not necessarily
accurate locations even when given a wide search area. Contexts are then refined and
assessed. In this phase, feature finding modules are usually called with very restricted
search areas, determined by statistical knowledge of the relative positions of features
within the face. When all the required feature points are located in a single context, this
identifies the face itself; and the features' locations provide mutual support.

2 Feature Experts

Feature experts obtain information directly from the image. Our current set range from
simple template matchers to much more elaborate deformable template models. Imple-
menting a template matcher is trivial, and execution is rapid, but problems arise from
their heavy dependence on scale and orientation, and multiple responses from many
parts of the image: however FindFace is normally confirming a good working hypothesis;
our templates are generated dynamically, tuned to the expected size of the feature, and
applied to a small search area.
We describe methods designed primarily for initial location as "global methods" as
opposed to the "local methods" used to assess or verify a location proposed by the existing
context. So, for example the global methods for locating a single eye look either for dark,
compact blobs completely surrounded by lighter regions or for areas of the image with
a substantial high frequency component. Locally the eyes use a probabilistic eye locater
rather like the outline finder we describe below, or even the blob detector confined to a
small area, if the uncertainty of the location is still high. A full list of feature experts can
be found in [4]
A number of algorithms have been proposed for finding the outline of a head, dating
at least from Kelly [6]. Our approach is inspired by the work of Grenander et al.[5] and [7],
in which a polygonal template outline is transformed at random to fit the data, governed
by our detailed statistics [1]. The advantage of this approach is that the background
can be cluttered (cf figure 3), and the initial placing of the outline is not required to be
outside the head.
The approximate location, scale and orientation of the head is found by repeatedly
deforming the whole template at random by scaling, rotation and translation, until it
matches best with the image whilst remaining a feasible head shape. This feasibility is
determined by imposing statistical constraints on the range of allowable transformations.
The optimisation, in both stages, is achieved using simulated annealing; although this
means the method is not rapid, it appears to be particularly reliable.
A further refinement is then achieved by transforming the individual vectors within
the polygon under certain statistical constraints. Consider the outline as a list of vectors
I v 1 , . . . , vn] with vi E IR2 for each i, where the representation is obtained by regarding
each vector as based at the head of the previous one in the list. Since the outline is closed.
vl + ... + vn = 0; a new outline may be generated by applying (n - 1) transformations
from the group 0(2) x US(2), where O is the orthogonal group and US is the group of
uniform scale change, to the first n - 1 vectors in the list. We represent elements of this
94

Fig. 1. Wire frame model. Fig. 2. The template for a Fig. 3. Identification in a
head outline. cluttered image.

group as matrices of the form

(_::)
When u = 1 and v = O, each transformation is the identity map, and the resulting
outline is the (initial) average one. Our first approximation to generating a variable
outline is to chose ul E A/'(1,~ru) and vi E Af(0, av) for 1 _< i < n - 1, where the
corresponding variances are calculated from our detailed measurements. This then gives
a shape (behaviour) score of the form

where the constants bu and b~ are associated with independent behaviour at each site.
This alone allows too much uncoordinated variation between neighbouring vectors.
To ensure more coordinated, head-like, polygons, we also place a measure on the change
in transformations generating adjacent vectors, making the assumption that the u's and
v's are realisations of Markov random chains of order 1. This gives a component of the
shape (acceptance) score in the form

A = exp (1(
- ~,, ( ~ - y 2 _ t ) ~ + a~ ~ (v~ - vA_O ~ ))
where the constants au and av scale the bonding relations.
By varying the parameters au, av, bu and by, we control the variability and coordi-
nation of the transformations. High values of au and av force coordination, whilst high
values ofbu and by allow greater variability. The a's represent the variability of particular
vector transformations and allow variability in the distributions from vector to vector.
Our description means that the first point of the outline will remain fixed; in fact we
apply the above procedure only to a sublist of the original outline list (a sweep site in
Grenander's terminology). Since we regard the list as cyclic, the base point can move,
and the outline slowly change location.
95

3 Control

A feature is either a feature point, or recursively, a suitable grouping of features or feature


points; natural groupings include the mouth, consisting of five feature points, the eyes
and of course the whole face itself. As a putative location for each new feature is returned
by the model feature expert, the best affine transformation between face and image co-
ordinates is calculated using a least squares fit of the located features and the mean model
feature locations. The new location is only accepted if the residual in this matching is
reduced with the additional point.

Fig. 4. Initial context des- Fig. 5. Initial context for Fig. 6. Final result obtained
tined for rejection in favour which refinement succeeds. for the context shown in fig-
of the one in figure 5. ure 5.

The model expert is responsible for creating an initial set of contexts Two such
contexts are shown diagramatically in figures 4 and 5. We deliberately generate a number
of contexts at this stage; ultimate success requires a context to be relatively close to the
correct position; in practice we rarely need refine more than three contexts.
The location of the remaining features and subsequent refinement of all features forms
the second phase of the operation of FindFace. The feature's location in the model is
transformed into image coordinates; a search area is also defined based on the model
feature's variance and the current context residual. Each feature expert in the list is
consulted in turn until one returns a positive result. Since the residual decreases mono-
tonically apart from when each of a finite number of feature points is added to the
context, convergence necessarily occurs; in fact it happens quite quickly. A completed
context is shown in figure 6.

4 Results

The FindFace system has successfully been demonstrated on many interested visitors
to the department, and their faces often include attributes FindFace is not designed to
work with - - glasses, beards or females. More rigorous testing has been performed on
random batches of 50 images from our library of faces, including subjects with beards
and glasses. Another test used a sequence of 64 images of a moving subject, at about 8
frames per second. In a typical batch,
- the head position is correctly located in all images, with the outline completely de-
tected in 43 cases - - the region normally missing from the remainder is the chin;
96

- the absence of feature experts for the eyebrows reduces the number of possible feature
point locations to 1462, of which the system claims to identify 1292;
- of the located points, 6% were inaccurately or incorrectly identified - - again the
mouth and chin region were usually in error.

The problem with the mouth and chin region is partially attributable to the inclusion
of subjects with beards and moustaches; somewhat surprisingly, glasses do not interfere
as much as originally anticipated. On the sequence of 64 images, the overall success rate
increased to greater than 95%. This result was achieved by processing each image ab
initio; better results would have been obtained by using a priori knowledge obtained
from the previous image(s).

5 Working Faster

A second implementation based on the same design philosophy aims to make detailed
measurements of the eye region from a real time video sequence, and hence detect eye-
lid separation and so blink rate. The other points in the modei need only be located
for corroboration purposes; valuable since the image sequence originates from a camera
mounted on the dashboard of a car, pointing towards the driver. The resulting images
suffer from poor contrast, and a variety of different noise elements caused by vehicle mo-
tion, changing lighting conditions etc. We now incorporate facilities for initialising the
system with a new subject, and tracking movement between frames. A system has been
successfully demonstrated that tracks the eye movement at approximately 5 frames per
second, although the detailed eye measurements were not being produced.

References
1. A. D. Bennett and I. Craw. Finding image features using deformable templates and detailed
prior statistical knowledge. In P. Mowforth, editor, British Machine Vision Conference 1991,
pages 233-239, London, 1991. Springer Verlag.
2. P. J. Benson and D. I. Perrett. Perception and recognition of photographic quality facial car-
icatures: Implications for the recognition of natural images. European Journal of Cognitive
Psychology, 3(1):105-135, 1991.
3. I. Craw and P. Cameron. Parameterising images for recognition and reconstruction. In
P. Mowforth, editor, British Machine Vision Conference 1991, pages 367-370, London and
Berlin, 1991. British Machine Vision Association, Springer Verlag.
4. I. Craw, D. Toek, and A. Bennett. Finding face features. Technical Report 92-15, Depart-
ments of Mathematical Sciences, University of Aberdeen, Scotland, 1991.
5. U. Grenander, Y. Chow, and D. Keenan. Hands: A Pattern Theoretic Study of Biological
Shapes. Research Notes in Neural Computing. Springer-Verlag, New York, 1991.
6. M. Kelly. Edge detection in pictures by computer using planning. In B. Meltzer and
D. Michie, editors, Handbook of research on face processing, pages 397-409. Edinburgh Uni-
versity Press, Edinburgh, 1971.
7. A. Knoerr. Global models of natural boundaries: Theory and applications. Pattern Analysis
Technical Report 148, Brown University, Providence, RI, 1988.
8. J. W. Shepherd. An interactive computer system for retrieving faces. In H. D. Ellis, M. A.
Jeeves, F. Newcombe, and A. Young, editors, Aspects of Face Processing, chapter 10, pages
398-409. Martinus Nijhoff, Dordrecht, 1986. NATO ASI Series D: Behavioural and Social
Sciences - No. 28.
This article was processed using the ISTEX macro package with ECCV92 style
D e t e c t i o n of Specularity Using Color and Multiple
Views *

Sang Wook Lee and Ruzena Bajcsy


GRASP Laboratory
Department of Computer and Information Science, University of Pennsylvania
3401 Walnut St., Philadelphia, PA 19082, USA

Abstract. This paper presents a model and an algorithm for the detection
of specularities from Lambertian reflections using multiple color images from
different viewing directions. The algorithm, called spectral differencing, is
based on the Larabertian consistency that color image irradiance from Lam-
bertian reflection at an object surface does not change depending on view-
ing directions, but color image irradiance from specular reflection or from
a mixture of Lambertian and specular reflections does change. The spectral
differencing is a pixelwise parallel algorithm, and it detects specularities
by color differences between a small number of images without using any
feature correspondence or image segmentation. Applicable objects include
uniformly or nonuniformly colored dielectrics and metals, under extended
and multiply colored scene illumination. Experimental results agree with the
model, and the algorithm performs well within the limitations discussed.

1 Introduction

Recently there has been a growing interest in the visual measurement of surface re-
flectance properties in both basic and applied computer vision research. Most vision
algorithms are based on the assumption that visually observable surfaces consist only of
Lambertian reflection. Specularity is one of the major hindrances to vision tasks such
as image segmentation, object recognition and shape or structure determination. With-
out any means of correctly identifying reflectance types, image segmentation algorithms
can be easily misled into interpreting specular highlights as separate regions or as dif-
ferent objects with high albedo. Algorithms such as shape from shading and structure
from stereo or motion can also produce false surface orientation or depth from the non-
Lambertian nature of specularity. Therefore it is desirable to have algorithms for esti-
mating reflectance properties as a very early stage or an integral part of many visual
processes. In many industrial applications, there is a great demand for visual inspection
of surface reflectance which is directly related to the quality of surface finish and paint.
Although the measurement of surface reflectance properties in applied physics has
been the topic of many research efforts, only a few attempts in computer vision have
been made until recently. There has been an approach to the detection of specularity
with a single gray-level image using the Lambertian constraints by Brelstaff and Blake
[BB88]. They attempted to extract maximal information from a single gray-scale image.
* This work was partly supported by E. I. du Pont de Nemours and Company, Inc. and partly by
the following grants: Navy Grant N0014-88-K-0630, A F O S R Grants 88-0244, A F O S R 88-0296;
A r m y / D A A L 03-89-C-0031PRI; NSF Grants C I S E / C D A 88-22719, IRI 89-06770, and A S C
91 0813. W e thank Ales Leonardis at University of Ljubljana, Slovenija, for his collaboration
on our color research during his stay at the G R A S P Lab. Special thanks to Steve Sharer at
Carnegie Mellon University for helpful discussions and comments.
100

Although moderate success was demonstrated in detecting some apparent specularities,


the problem is physically underconstrained.
In order to overcome the inherent limitations of a lack of information in a single
image, the natural development is of course to collect more images in physically sensible
ways, using optical and physical models which describe how surfaces appear according to
the reflectance properties and sensor characteristics. Recently the computer vision field
has increasingly incorporated methodologies derived from physical principles of image
formation and sensing. There have been three types of approaches so far in solving the
problem: collection of more images (1) with different light directions, (2) with different
sensor polarization and (3) with different spectral sensors.
The photometric-stereo-type approaches are the most comprehensive methods in in-
vestigating surface reflectance properties. The objective of the approaches is to obtain
object shape and both Lambertian and specular reflectances separately, and more than
two light sources are required for recently proposed algorithms [NIK90] [Td91]. The ba-
sic technique of the photometric-stereo-type approaches is the switching of illumination
sources. The direction and the degree of the collimation of the illumination need to be
strictly controlled. Therefore, application is restricted to dark-room environments where
the illumination can be strictly controlled.
Wolff proposed a method of detecting specularities using the analysis of the polar-
ization of reflected light [Wo189]. The polarization approach places restrictions on illu-
mination directions with only two polarizer angles. Although many angles of polarizer
filters are suggested for extended light, it is yet to be demonstrated extensively how the
algorithm performs for rough surfaces under extended light sources.
The dichromatic model [Sha85] proposed by Shafer has been the key model to the
recent specularity detection algorithms using color, such as the ones by Klinker, Shafer
and Kanade, by Gershon, by Healey and Binford, and more recently by Bajcsy, Lee and
Leonardis [KSK88] [Ger87] [ttB89] [BLL90]. The basic limitation of the color algorithms
is that objects have to be only colored dielectrics to use the dichromatic model. Another
limitation is the requirement for image segmentation as an essential part of the algorithm.
For image segmentation, it is usually assumed that object surface reflectance is spatially
piecewise uniform and scene illumination is singly colored. The algorithms using color
detect only probable specularities, since variation in object reflectances or in illumination
color can result in the spectral variation of the scene that may be interpreted as the
presence of specularities.
All the algorithms mentioned above have their limitations as well as advantages. The
assumptions involved with each algorithm pose limitations on the applicable domains
of the objects and illumination. The primary objective of the research presented in this
paper was to develop a model applicable to more general object and illumination domains
than the ones of previous algorithms, using color and multiple views. As mentioned above
the photometric-stereo4ype approaches require strict control of illumination light, and
the polarization method has a restriction in illumination directions. Illumination control
is not possible in generM environments. Examples include outdoor inspection, and indoor
or outdoor navigation or exploratory environments. Even for indoor inspection, a well
controlled dark room is not always available.
The color segmentation approaches impose restrictions on the object and illumination
color, because of the limited information in a single color image. Therefore it would be
desirable to have any extra information in order to overcome the limitations in the object
and illumination domains. The idea of moving the observer is motivated by the concept of
active observer [Baj88]. I t is well accepted that with extra views added, extra geometric
101

information can be obtained. For low-level vision problems of shape or structure, it has
been demonstrated that many ill-posed problems become well-posed if more information
is collected by active sensors [AB87]. Although the paradigms for shape or structure
based on feature correspondence cannot be directly applied to the study of reflectance
properties, the idea of an active observer motivates the investigation of new principles by
physical modeling in obtaining more information. A question to be answered is what kind
of extra spectral information can be obtained by a moving camera without considering
object geometry. If there is any, it may alleviate the limiting assumptions required for
color segmentation approaches and provide higher confidence in detecting specularities.
In this paper, a model is presented for explaining extra spectral information from
two or more views, and a specularity detection algorithm, called spectral differencing, is
proposed. The algorithm does not require any assistance from image segmentation since
it does not rely on the dichromatic model. The algorithm only exploits the variation of
different spectral composition of reflection depending on viewing directions, therefore it
does not require any geometric manipulation using feature correspondence. An important
principle used is the the L a m b e r t i a n c o n s i s t e n c y that the Lambertian reflection does not
change its brightness and spectral content depending on viewing directions, but the
specular reflection or the mixture of Lambertian and specular reflections can change.
Basic spectral models for reflection mechanisms are introduced in Sect. 2, and Sect.
3 explains how the measured color appears in a three-dimensional color space. A model
is also established in Section 3 for explaining the spectral difference between different
views for uniform dielectrics under singly colored illumination. The detection algorithm
of spectral differencing is also described in Sect. 4, and Sect. 5 discusses the spectral
differencing for various objects that include nonuniformly colored dielectrics and metals,
under multiply colored illumination. Experimental results are presented in Sect. 6.

2 Reflection Model

Physical models for light-surface interaction and for sensing are crucial in developing the
algorithms for detection of specularity. Several computer vision researchers have intro-
duced useful models based on the physical process of image-forming [TS67] [BS63] [Sha85]
[LBS90] [HB89]. Although there are certain approximations, the models introduced in
this section are generally well accepted in computer vision for their good approximation
of the physical phenomena.

2.1 R e f l e c t i o n Type

There are two physically different types of reflections for dielectric materials according
to the dichromatic model proposed by Shafer [Sha85], interface or surface reflection and
body or sub-surface reflection. Reflection types are summarized in Fig. 1. The surface or
interface reflection occurs at the interface of air and object surface. When light reaches
an interface between two different media, some portion of the light is reflected at the
boundary, resulting in the interface reflection, and some refracted into the material. The
ratio of the reflected to the refracted light is determined by the angle of incidence and
the refractive indices of the media. Since the refractive indices of dielectric material are
nearly independent of wavelength (A) over the visible range of light (400 n m to 700 n m of
wavelength), interface reflectance of dielectrics can be well approximated as flat spectrum
as shown in Fig. 1 [LBS90].
102

The refracted light going into a sub-surface is scattered from the internal pigments and
some of the scattered light is re-emitted randomly resulting in the body or sub-surface
reflection. Thus the reflected light has the Lambertian property due to the randomness
of the re-emitted light direction. The Lambertian reflection means that the amount of
reflected light does not depend on the viewing direction, but only on the incident light.
Depending on the pigment material and distribution, the reflected light undergoes a spec-
tral change, i.e., the spectral power distribution (SPD) of the reflected light is the product
of the SPD of the illumination and the body reflectance. The fact that the interface and
the body reflections are often spectrally different is the key concept of the dichromatic
model, and central to many detection Mgorithms by color image segmentation.
For metals, electromagnetic waves cannot penetrate into the material by more than
skin depth because of the large conductance that results in large refactive index. Therefore
all the reflections occur at the interface, and due to the lack of the body reflection, metals
are unichromatic [HB89]. Interface reflections from most metals are white or grey, e.g.,
from silver, iron, aluminum. However, there are reddish metals such as gold and copper.

Specular Diffuse Perfectly Extended


(glossy) (non-glossy) diffuse Dielectrics Metals illumination
.=:=: ............... ~ ................ ,...,.,o.,..,o,**o.o..., .~176176176176176
........ : ::= ~ = -I~176176176176176176176176176176176176176176176176176
........ ~176176176176
::= . . . . i
i ! " '' i
J t i i i
]Fnterface :. 1~ -- ~ .. ~
| reflectance
|
i ~
~~
!
!
I | a i
l(surf.ce)
Reflection ! 400 7o0 4oo 70o I ~ i
! i Surface i. i smooth surface i
I ! " rouahness " i L i
r........ ~ ............................................................................ ~ ................................................. 9. . . . . . . . . . . . . . . . . . . . . . . -]
i i ' I "
! i . =
I !, i
l
Body l ~r',-.-_ I reflectance

surface) i. ...-. . . . .
".~. iI -- J I
]

:i Reflectmn
" i' ~ "~,~ t="" " l " 400 7~(~ k [nm] "I ,
wavelength | i
i9 ii (LamberUan) Ii I9 ii
! ..................... : ........................................................ --.. .................................... I, ........................................ -L- ................................... i

Fig, 1. Reflection models

Reflections can also be categorized as glossy specular and nonglossy diffuse reflections
depending on the appearance. This categorization depends on the degree of diffusion in
the reflected light direction. Body reflection can be modeled as a perfectly diffuse re-
flection, i.e., Lambertian reflection. Specularity results from interface reflection, and the
reflected direction of the specularity depends both on the illumination direction and sur-
face orientation. The specularity is diffused depending on the surface roughness. There
have been some models that describe the scattering of light by rough surfaces. The
physical modeling of Beckman and Spizzichino [BS63] is based on the electromagnetic
scattering of light waves at rough surfaces. The simpler geometric modeling by Torrance
and Sparrow is widely accepted in computer vision and graphics as a good approximation
of the physical phenomenon [TS67]. The Torrance-Sparrow model assumes that a sur-
face is composed of small, randomly oriented, mirror-like microfacets, and the Gaussian
function is used for the distribution of the microfacets. A rougher surface has a wider
103

distribution of the microfacets, and the direction of the reflected light is more diffuse, as
is illustrated in Fig. 1.
Diffusion in the direction of the interface reflection may result from extended and
diffuse illumination as shown in Fig. 1. When illumination is extended and diffuse, the
incident angle of light to a surface patch is extended and interface reflections at even a
smooth surface appear diffusely. When the surface is rough, the reflection is more diffuse.
In this paper, "specular reflection" is used to denote interface reflection, while "Lam-
bertian reflection" is used to denote body reflection. Use of "surface reflection" for denot-
ing the interface reflection is avoided, since, in a wider sense, it means all the reflections
from surface and sub-surface. In this paper, "surface reflection" is used in the wider sense.
When "diffuse reflection" is used, it can be either diffuse interface or body reflection.

2.2 R e p r e s e n t a t i o n a n d Sensing
For singly colored illumination e(~), whether geometrically collimated or extended, scene
radiance is given as the product of illumination and reflection, i.e.,
Lr(~) = e(~)s(~). (1)
where s(A) is the reflection, and A is the wavelength of light. The surface reflection is the
linear combination of specular and Lambertian reflections with the different geometric
weighting factors, i.e.,
= ps( )Gs(0r, + pB( )G8 (2)
where ps(A) and pB(A) are the specular and the Lambertian refiectances, i.e., Fresnel re-
flectance and albedo, respectively, (0r, Cr) denotes the reflection direction, and Gs (Or, Cr)
and GB are the purely geometric factors which are independent of spectral information.
The geometric factors are determined by illumination and viewing directions with respect
to surface orientation. Observation of the specular reflection is highly dependent both
on the viewer and on the illumination directions, while observation of body reflection
depends only on the illumination direction.
Note that, for metals, pB(A)GB is 0, and for dielectrics, Gv is independent of the
viewing angle (0~, r It has been reported that the spectral composition of Lambertian
reflection slightly changes when the incident light direction approaches 90 ~ with respect
to surface normal (glancing incidence) [HB89]. However this effect is small even near the
glancing incidence of light, and thus is neglected in the model.
When there are more than one illumination sources with different colors from different
directions, the addition of reflections under different illumination sources
Lr(A) = el(A)sl(A) + e2(A)s2(A) -F . . (3)
is used for establishing models presented in this paper.
The color image sensing is usually performed with a CCD camera using filters of
different spectral responses. With 3 filters (usually R, G and B), the quantum catch or
the measured signal from the camera is given by

where Qk(,~) and qk for k = O, 1, 2 are the spectral response of the k - t h filter, and the
camera output through the k - t h filter, respectively. The wavelengths ~1 = 400 nm and
As = 700 nm cover the range of the visible spectrum.
104

In many works in color science, interpretation of measured color is often performed


with three-dimensional color space on three basis functions, as well as the RGB sensor
space, and some basis functions have been suggested and used [Coh64] [MW86]. In this
paper, the scene radiance model is explained in a general three-dimensional color space
although the proposed algorithm is implemented with the I%GB space.
The scene radiance Lr(A) can be approximated with three basis functions S0(A) SI(A)
and S2(A) as
2
Lr(A) = (5)
i--O

where 7~'s are the scalar weighting factors. The relationship between the sensor responses
qk's and 7/'s is a linear transformation given as

q = A 7, 7 = V q_, Aki = f/'


1
SI(A)Q~(A)dA, (6)

where q = [qo,ql,q2] T, 3' = [70,71, 72]T, V__= A -1, and Aki is the element of A in the
k-th row and i-th column.

3 Spectral Scene Radiance from Different Views

T h e vetor q or the linear transformation 7 represents the measured scene radiance that
results from the illumination and reflectance color and from geometric weighting. In
this section, it is explained how the measured q's or 7's from a color image appear in
a general three-dimensional color space, and a model is established for a specularity
detection algorithm using color information from different views. The three-dimensional
spectral space constructed from the RGB values or from the basis functions S0(A)SI(A)
and S~(A) is generally called S space in this paper.
In this section, the spectral scene radiance is considered only for dielectric objects
with uniform reflectance under singly colored illumination. Dielectric materials with re-
flectance variation and metals under multiply colored illumination will be discussed in
Sect. 5.

3.1 L a m b e r t i a n R e f l e c t i o n

For Lambertian surfaces, shading results from the variation in surface orientations rela-
tive to illumination directions. In the S space, the scene radiance generated by shaded
Lambertian reflections form linear clusters.
Scene radiance from Lambertian reflection is given from (1), (2) and (5) by
2
Lr(A) = e(A)pB(A)GB ,= ZTiSI(A). (7)
i=0

Shading on a surface of uniform reflectance is due to variations in geometry GB in (7),


and the spectral curve of Lr(A) is scaled depending on GB. When the spectral curves of
e(A) and pB(A) are assumed to be constant over the differently shaded object patches,
the ratio of 7i's such as 71/70 and 72/7o are independent of shading by GB, and thus the
ratios of 7{'s are constant over the differently shaded patches. Therefore 7/'s form a linear
cluster in the S space as shown Fig. 2 (a). This property has been previously suggested
105

S~s- (ev,rJ~z*
9v) t (ez,r
sha ' viewing~ ~ / Ulurninati~
2
(i:i~i:31!i:i~i:iiiii?i:i:
ii?!iiiiiiiiiiiiiii~iil
'~
Xw
(a) (b)

Fig. 2. Shading; (a) linear duster (b) coordinates for simulated images (c) Lambertian shading
(d) linear duster in S space

and used in segmentation [Sha85] [KSK88]. The orientation of the vector 7 is determined
by the Lambertian reflectance and illumination, and independent of geometry.
Examples are shown by simulation with a spherical object for the geometry shown in
Fig. 2 (b). Figure 2 (c) and (d) show a simulated image of a sphere, and its color cluster
in the S space with (Ov,r = (35~ ~ and (0i,r = (0~176 respectively. For the
simulation, a spectrum measured from a real blue color plate is used for the reflectance,
and a linear cluster is shown in the S space for spectrally flat neutral light. The Fourier
basis functions S0(A) = 1, Sz(A) = sin~ and S2(A) = cosA are used for the S space in
Fig. 2 (d).

vie 2 Xhading
Lambertian ~ ~ S=
shading ~ / N,/-~

for view 1 .s,


(a) Co)

Fig. 3. Lambertia_n surface from multiple views (a) geometric illustration (b) color clusters in S
space

Figure 3 (a) illustrates Lambertian surfaces at two different views. When illumination
is the same between different views, Lambertian reflections from a surface appear in the
same locations in the S space regardless of the viewing angle. However occlusion of
surfaces by other object surfaces can affect the distribution of color points in the linear
cluster. For the view 1, not all the Lambertian surfaces are visible due to occlusion, and
color clusters from only a part of the object are observable in the S space. The visible color
clusters are shown as a dark solid line in Fig. 3 (b). On the other hand, those invisible
surfaces are disoccluded in the view 0. Disocclusion is the emergence of object points or
patches into visibility from behind occlusion. Depending on the shading of the disoccluded
part, emerging color clusters of the object from occlusions can be included in the linear
cluster, or can appear outside the cluster yet in the extended line, since the disoccluded
part is a part of the same object. An example of spectrM difference between the two views
is shown as the gray lines in Fig. 3 (b). Note that orthographic projections are illustrated
in Fig. 3 (a), but the above explanation also applies to perspective projections.
106

3.2 Specular Reflection

Highlights are due to specular reflections from dielectrics or metals. Since scene radiance
from specular reflection is given by
2
Lr()t) = e(~)ps()t)Gs(t~r,r = ~--~%Si(~). (8)
i=0
Specular reflections alone, e.g., from metals or from black dielectrics, form linear clusters
in the S space like the Lambertian reflections. Because of the neutral reflectance, the
direction of the linear cluster from the dielectrics is the same as the illumination direction
in the S space. On the other hand, the direction of a linear cluster from a metal is
determined by the spectral reflectances and illumination.
For dielectrics, specular reflections are added to Lambertian reflections as shown in
Fig. 4 (a). With extended illumination or with roughened surfaces, the distribution of
specular reflections can spatially vary over a wide area of the shaded surface as shown
in Fig. 4 (a). Therefore specular reflections form planar clusters which include the linear
clusters formed by shading on the S space. The orientation of the plane is dependent on
the illumination color. When illumination is well collimated and the surface is smooth the
color clusters form generally skewed T or L shapes as suggested by Shafer [Sha85], since
the specular reflection is distributed in a small range of shading and forms a linear cluster
connected to a linear cluster of the Lambertian reflections. However, when illumination
is spatially extended or the surface is rough, the color clusters generally form skewed P
shapes, and the color cluster of specular reflections is planar and coplanar with a linear
cluster of Lambertian reflections.

So, .. So so

specularity ' vi vie view 1

_, : S1 \F.~r_ambertian -, "I
(a) (b) shading (C) (d)

Fig. 4. Specularity; (a) in S space (a) geometric illustration for multiple views (b) color clusters
in S space for smooth surface (c) for rough surface

Figure 4 (b) illustrates positions of specularities on shaded a Lambertian surface at


two different views. Depending on the viewing directions, the specularities h0 and hl in
Fig. 4 (b) are located on spatially different shaded surfaces.
In the S space, color clusters from the specularities are located on differently shaded
Lambertian clusters as shown in Fig. 4 (c) and (d). The shape and position of the spec-
ular clusters depend on viewing directions and surface roughness as well as on surface
orientations and illumination directions. Figures 5, 6 and 7 show the color clusters of
specular and Lambertian reflections by simulation for different surface roughness and for
different collimation of neutral illumination. As shown in the figures, the planar color
clusters in the $ space are differently shaped depending on the viewing directions.
107

so sl
:T:Yf::f::. (ii!+
...... 9,....., .,.

~ii!~!i!;!!i~!ii!i!ii!!i! :~ii:~iill
~i!iiii?'

Fig. 5. Reflection for smooth surface and collimated illumination for (0~, r = (o ~ 0~
(gv, Cv) = (0 ~ 0~ (35 ~ 0~ (70 ~ 0 ~ and relative surface roughness = 0.1

so sl

: :: ::::;:;:;.:.:: : : :

:::::::::::::::::::::::::::::: iiiiiiiiiiiiiii
iiiiiii!il
: . . : : :.:,:+:.:.:.
f~.1 s,

:.: :.:. ; 9 ....,....,


:.::.:: :::::.:,:: $,
/

Fig. 6. Reflections for smooth surface and extended illumination for 0 ~ < 81 < 30 ~
0~ < 61 < 360 ~ (8v,r = (0~176 (35~176 (700,0 ~ and relative surface roughness =
0.1

Except for Lambertian occlusions, the spectral difference between the views results
from specularities, although it does not account for all the specularities due to the over-
lap of specularities in the S space. In Fig. 4 (c), all the specularities are the spectral
difference, but in Fig. 4 (d), only part of the specularities is the spectral difference since
there is an overlap between the specular clusters from the two views. Since the amount
of spectral displacement of the specularities is determined by the difference in the view-
ing angles, object shape, variations in object shape and illumination distribution, it is
difficult to predict it in a simple manner for general objects and illumination. However
the general rule is that as the difference in the viewing directions increases, the spectral
overlaps between the specularities decrease. If the object shape varies more geometri-
cally, specularities are likely to change more. Specularities often completely disappear
depending on the views.
A point to note is occlusion by specularity. In some views, specularities can be dis-
tributed such that some Lambertian shading may not be visible at all. In other views,
the Lambertian shading may appear as new clusters in the S space, therefore can be
detected as spectral difference.

4 Spectral Differencing Algorithm

$0 Sl .%

+ , + : :+:.:.:.:.:+~ ~, H~::~+:+:,:+:.:: 9~!iiiiil

iiiii!!ii!iiiiiiiiiiiii i!iliiii!i
iiii!iii
9 ::::+:+;
r/~==

sl

Fig. 7. Reflections for rough surface and collimated illumination for (e~,~;) = (o~ ~
(0v, Cv) -- (0 ~ 0~ (35 ~ 0~ (70 ~ 0 ~ and relative surface roughness =0.3
108

-MSD
image a
s~ l
MSD(a~-[~) vie.
imageI~

(a) (b)

Fig. 8. Spectral differencing (a) images from different views (b) color clusters in S space

For two color images with different viewpoints, the spectral differencing is an algorithm
for finding the color points of one image which do not overlap with any color points
of the other image in a three-dimensional spectral space (e.g., the S space or a sensor
space with RGB values). In order to detect the view-inconsistent color points, the spec-
tral differencing algorithm computes the minimum spectral distance (MSD) images. The
computation of an MSD image is explained as follows with an example shown in Fig. 8.
Let c~ and/~ be two color images obtained from two different views. The notation
MSD(~ ,--/9)
represents the MSD image of c~ from/~. A pixel value of the MSD image MSD(a ~ #)
is the minimum value of all the spectral distances between the pixel in the image c~ and
all the pixels in the image #. The spectral distance is defined as the euclidean distance
between two color points in a three-dimensional spectral space. Any MSD's above a
threshold indicate the presense of specular reflections or Lambertian disocclusions. The
threshold for the MSD image is determined only by sensor noise, and no adjustment is
required for different environments.
Figure 8 (a) illustrates two images of an object with specularity from two different
viewpoints, and the corresponding color clusters in the S space are shown in Fig. 8 (b).
The pixel P in the image a is distantly located from the specular and Lambertian color
points of the image/~, which indicates specular reflection at P. On the other hand, the
Lambertian reflections from the views o~ and fl have the same linear cluster. Since the
pixel R in the region of Lambertian reflection in the image a is close to the Lambertian
points in the image j3 in the S space, it should not be detected by spectral differencing.
The spectral differencing does not detect all the specularities in a view. In Fig. 8 (b),
the color point from the pixel Q is located in the overlapped region between the planar
clusters in the views a and #. Since Q is located within the planar cluster formed by
specular reflection in the view #, it is hard to detect Q as a specular reflection when the
color points in planar cluster in the view/~ is densely populated. The specular reflection
at Q can be detected by this algorithm only when the color points in the planar cluster
in the view ~ are sparsely distributed around Q.
In this paper, no study for finding faster algorithms for spectral differencing is pre-
sented. However, an important point to note is that the algorithm is pixelwise parallel.
Therefore with a parallel machine, the computation time depends only on the degree of
achievable parallelism of the machine.
Spectral differencing is performed for the three simulated images shown in Fig. 7, and
the three images and the MSD images are shown in Fig. 9. The table of image arrange-
ment is also shown in Fig. 9. All the MSD images in Fig. 9 show detected specularities.
109

view 0 MSD( ,-O) MSD(2~O)

MSD(O~-I) view 1 MSD(2~I)


!ii i i i i i i i i i i i i i i i i i i i i i i i i
MSD(O,~2) MSD(I~-2) view 2
i i i i ii i !i!ii i!i i i i i i i i i i i ii i !i iii i i i i i i i i i i i i i i i i i i i i i i
Fig. 9. Spectral differencing for the simulated images in Fig. 9

The detection is always an underestimation of the specular region except for disocclu-
sions. The disoccluded Lambertian reflections are shown in MSD(0#--2), and there is a
region detected due to specular disocclusion in MSD(I~2). In the view 2, the brightest
shading is occluded by specularities.

5 Extended Object and Illumination Domain

In the previous sections, the spectral differencing is explained only for dielectrics with
uniform reflectance under singly colored illumination. In this section, it is discussed that
the spectral differencing is effective as well for various objects under multiply colored
illumination.

5.1 Dielectrics with Reflectance Variation

When the reflectance pB(A) is not uniform in color for a surface, but has gradual variation,
the measured colors from shaded Lambertian surface do not form a linear cluster. The
color cluster is dispersed depending on the degree of variation in the reflectance, as
illustrated in Fig. 10 (a). Some natural surfaces such as wood grains, leaves and human
faces have variation in reflectance.
Figure 10 illustrates the color clusters of a dielectric object with varying Lambertian
reflectance. The Lambertian cluster is not linear due to the variation in pB(A) in (1)
which is written again below

Lr(;~) = e()t)[pa()t)Gs(Sr, Cr) + pB(~)Gs]. (9)

Even with the volume or planar clusters from Lambertian reflection, the Lambertian
consistency applies (except for disocclusion), since the geometric factor GB is independent
of the viewing angle (0r, Cr). On the other hand, specularities are mixed with differently
shaded and colored Lambertian reflections depending on the viewing directions, since the
geometric factor Gs(~r,r for specular reflections varies depending on (gr,r in (9).
Therefore the spectral differencing can detect specularities that have different spectral
values over different views.
110

S= S= S=~

S, = S, ~ S,

(a) (b) (c)


Fig. 10. Dielectric with variation in reflectance

5.2 D i e l e c t r i c s u n d e r M u l t i p l y C o l o r e d I l l u m i n a t i o n

Under multiply colored illumination, Lambertian clusters even f r o m uniform dielectrics


are not linear, but dispersed in the S space due to the variation in the illumination.
However, the distribution of Lambertian color points is invariant with respect to viewing
directions except for occlusions and disocclusions. On the other hand, the distribution of
color points from specularities changes depending on the viewing directions due to the
different mixture of colors between the specular reflections or between the specular and
Lambertian reflections.
From (1) and (3), scene radiance with two illumination sources is given as

Lr(A) = ps(A)[el(A)Csl(8r, r e2(A)Gs2(#r, Cr)] (10)


+ pB(A)[el(A)a81 + ~2(A)G~2],
and Fig. 11 shows an example with two illumination sources. The Lambertian reflection
forms a planar cluster with two illumination colors, and independent of (6r, r When
there are more than two illumination sources with different colors, or when pB(A) varies,
the Lambertian reflection generally forms a volume cluster.

So So So reflection global illumination

_--S~ ~, :S~ ~, ~S~ object r

(a) (b) (c) (d)

Fig. 11. (a) (b) (c) Dielectric under multiply colored illumination from varying viewpoints (d)
inter-reflection

The specular reflection is a linear combination of the two components from the two il-
lumination colors el and e2. Each component as well as the combination varies depending
on the viewing angle (0r, Cr). When the specularities appear in different Lambertian sur-
faces without overlap, they represent illumination colors in two directions separately as
shown in Fig. 11 (a). When the viewing geometry changes, the specularities can be mixed
in a surface and produces new specular points in the S space as shown in Fig. 11 (b) and
111

(c). Therefore spectral differences result from specular reflections except for Lambertian
disocclusions.

Inter-reflection When there are many objects, the object surface of interest receives
not only the light from the illumination sources, but the reflected light from the other
objects. The latter causes a local change of illumination, as illustrated in Fig. 11 (d). The
reflection from more-than-one surfaces is called inter-reflection. Object surfaces for the
first reflection are secondary light sources which are generally extended depending on the
object size. Together with the direction global illumination, the first reflection provides
multiply colored illumination for other surfaces as shown in Fig. 11 (d), and influences
the distribution of color clusters of the other surfaces. In indoor environments, reflections
from walls and ceiling are the major sources of ambient light.

5.3 M e t a l s

Since metals have only specular reflectance, there are no Lambertian reflections. When
there is only a single illumination source for a uniform metallic object without any ambi-
ent light, only a linear cluster appears in the S space. In most cases, however, metals are
observed with reflections from many light sources that include scene illumination sources
and many surrounding objects. Especially shiny metals reflect all the incoming light
from surrounding objects. The mixture of reflected light from direct illumination sources
and inter-reflections changes depending on viewing directions, with different geometric
weighting of the light coming from different directions. Therefore the color changes due
to the different mixture of light can be detected by spectral differencing.
An example is shown in Fig. 12 for the scene radiance under two different illumination
sources. Without any Lambertian components in (10), the scene radiance is a combination
of two specular components as

(11)

When the two components separately appear in a measured image without being mixed,
the color points in the S space form two different linear clusters as shown in (a). Depend-
ing on the viewing directions, the two colors can be differently combined as shown in (b)
and (c), and the spectral differencing can detect different reflections in color.

So So So

~1 m $1
(a) (b) (c)

Fig. 12. Metal under multiply colored illumination from varying viewpoints
112

6 Experimental Results and Discussion

In order to test the algorithm, some experiments were carried out on various objects
all under multiply colored illumination. Illumination was provided by fluorescent light
in two directions on the ceiling of the room and by tungsten light in another direction
located closer to the objects. Four large fluorescent light tubes were used, two in each
direction, and half of a tungsten light bulb was screened with white paper for diffusing the
light and the remaining half was exposed. White walls and ceiling provide some ambient
illumination. The illumination environment is a normal indoor one, unlike a dark room
with collimated light.

Fig. 13. Specularity with variation in Lambertian reflectance

Figure 13 shows images of dielectric objects with smooth reflectance variation. The
arrangement of measured images and MSD images is the same as that in Fig. 9. The
porcelain horse has variation in its Lambertian reflectance, especially near its shoulder
and the saddle. The MSD images show nonzero values where most of the sharp and diffuse
specularities are. The threshold for the MSD images was experimentally determined as
2 in terms of the RGB input values (0-255).
113

Fig. 14. Specularity from metal

Figures 14 shows the results from a metallic object. The MSD images clearly show
most of the sharp specularities, indicating that the spectral movement of the sharp spec-
ularities are large. Some of the diffuse specularities are also detected. For a given shape
of object, the diffuse specularities are better detected with wider angles between the
views. In fact, all the reflections from metals are specular reflections. However the very
diffuse reflections are not detectable when they form densely populated color clusters
like Lambertian reflections in a three-dimensional color space, and the different viewing
geometry does not generate enough spectral differences.
The experimental results with real objects demonstrate that spectral differencing is a
remarkably simple and effective way of detecting specularities without any geometric rea-
soning. The algorithm does not require any geometric information or image segmentation.
Therefore it can provide independent information to other algorithms such as structure
from stereo, structure from motion, or image segmentation algorithms. Since the spectral
differencing does not depend on any image segmentation, there are no assumptions of
uniformly colored dielectric objects and singly colored illumination.
A limitation of the spectral differencing algorithm is that disocclusions are detected
together with specularities and they are indistinguishable. Separation between the spec-
ularity and disocclusion may be achieved with other algorithms such as color image seg-
mentation algorithms [BLL90]. As mentioned above, the spectral differencing algorithm
can be easily intergrated with a color segmentation algorithm, and we are currently
developing some integrated methods.
114

7 Conclusion
In this paper, an algorithm is proposed for the detection of specularities based on physical
models of reflection mechanisms. The algorithm, called spectral differencing, is pixelwise
parallel, and it detects specularities based on color differences between a small number of
multiple color images without any geometric correspondence or image segmentation. The
key contribution of the spectral differencing algorithm is to suggest the use of multiple
views in understanding reflection properties: Although multiple views have been one
of the major cues in computer vision in obtaining object shape or structure, it has
not been used for obtaining reflection properties. The spectral differencing algorithm is
based on the Lambertian consistency, and the object and illumination domains include
nonuniformly colored dielectrics and metals, under multiply colored scene illumination.
The experimental results conform well to our model based on the Lambertian consistency.

References
[AB87] J. Aloimonos and A. Badyopadhyay. Active vision. In Proc. 1st Int. Conj. on Com-
puter Vision, pages 35-54, 1987.
[Baj88] R. Bajcsy. Active perception. Proceedings o] the IEEE, 76:996-1005, 1988.
[BB88] G. Brelstaff and A. Blake. Detecting specular reflections using lambertain constraints.
In Proc. of lEEE Int. Con]. on Computer Vision, pages 297-302, Tarpon Springs, FL,
1988.
[BLLg0] R. Bajcsy, S.W. Lee, and A. Leonardis. Color image segmentation with detection of
highlights and local illumination induced by inter-reflections. In Proc. lOth Interna-
tional Conf. on Pattern Recognition, Atlantic City, NJ, June 1990.
[BS63] P. Beckman and A. Spizzichino. Scattering o] Electromagnetic Waves ]rom Rough
Sur]aces. Pergamon Press, London, UK, 1963.
[Coh64] J. Cohen. Dependency of the spectral reflectance curves of the munsell color chips.
Psychon. Sci., 1:369-370, 1964.
[Ger87] R. Gershon. The Use o] Color in Computational Vision. PhD thesis, Department of
Computer Science, University of Toronto, 1987.
[HB89] G.H. Healey and T.O. Binford. Using color for geometry-insensitive segmentation.
Journal of the Optical Society o] America, 6, 1989.
[KSK88] G.J. Klinker, S.A. Shafer, and T. Kanade. Image segmentation and reflection anal-
ysis through color. In Proceedings of the DARPA Image Understanding Workshop,
pages 838-853, Pittsburgh, PA, 1988.
[LBSg0] H.-C. Lee, E. J. Breneman, and C. P. Schulte. Modeling light reflection of computer
vision. IEEE Trans. PAMI, 12:402-409, 1990.
[MW86] L. T. Maloney and B. A. WandeU. A computational model of color constancy. Journal
o] the Optical Society o] America, 1:29-33, 1986.
[NIK90] S. K. Nayar, K. Ikeuchi, and T. Kanade. Determining shape and reflectance of hybrid
surfazes by photometric sampling. IEEE Trans. Robo. Autom., 6:418-431, 1990.
[Sha85] S.A. Sharer. Using color to separate reflection components. COLOR Research and
Application, 10:210-218, 1985.
[Td91] H.D. Tagare and R. J. deFigueiredo. Photometric stereo for diffuse non-lambertian
surface. IEEE Trans. PAMI, 13:, 1991.
[TS67] K.E. Torrance and E. M. Sparrow. Theory for off-specular relfection from roughened
surfaces. Journal o] the Optical Society of America, 57:1105-1114, 1967.
[Wo189] L.B. Wolff. Using polarization to separate reflection components. In Proc. of IEEE
Conference on Computer Vision and Pattern Recognition, pages 363-369, San Dingo,
CA, 1989.
This article was processed using the IbTEX macro package with ECCV92 style
Data and Model-Driven Selection using Color
Regions *

Tanveer Fathima Syeda-Mahmood


Artificial Intelligence Laboratory, M.I.T Cambridge, MA 02139.

A b s t r a c t . A key problem in model-based object recognition is selection,


namely, the problem of determining which regions in an image are likely
to come from a single object. In this paper we present an approach that
uses color as a cue to perform selection either based solely on image-data
(data-driven), or based on the knowledge of the color description of the
model (model-driven). It presents a method of color specification by color
categories which are used to design a fast segmentation algorithm to extract
perceptual color regions. Data driven selection is then achieved by selecting
salient color regions while model~driven selection is achieved by locating
instances of the model in the image using the color region description of
the model. The approach presented here tolerates some of the problems of
occlusion, pose and illumination changes that make a model instance in an
image appear different from its original description.

I Introduction
A key problem in object recognition is selection, namely, the problem of isolating
regions in an image that are likely to come from a single object. This isolation can
be either based solely on image data (data-driven) or can incorporate the knowledge
of the model (task-driven or model-driven). It has been shown that the search in the
matching stage of recognition can be considerably reduced if recognition systems were
equipped with a selection mechanism thus allowing the search to be focused on those
matches that are more likely to lead to a correct solution [3]. Even though selection can
be of help in recognition, it has largely remained unsolved. The lack of knowledge of
illumination conditions and surface geometries of objects in the scene, and the problems
of occlusion, shadowing, specularities, and interreflections in the image make it difficult to
interpret groups of data features as belonging to a single object. Previous approaches to
selection have focused on the problem of data-driven selection by grouping data features
such as edges and lines based on constraints such as parallelism, or collinearity, distance
and orientation, etc.[4][3]. But ensuring the reliability of such grouping has been found
to be difficult, thus restricting their effectiveness in reducing the search complexity in
recognition.
In this paper we present a way of performing data and model-driven selection by
extracting color regions from an image. A color region almost always comes entirely from
a single object, giving, therefore, more reliable groups than existing grouping methods
and this can be useful for data-driven selection. Because objects tend to show color
constancy under most illumination conditions, color when specified appropriately, can be
a stable cue for most appearances of objects in scenes, thus making it also suitable for
model-driven selection.
* This paper describes research done at the AI Lab., M.I.T. Support for the lab's research is
provided in part by Office of Naval Research and in part by the Advanced Research Projects
Agency of the Dept. of Defense. The author is supported by an IBM Fellowship.
116

2 Color Specification for Selection


Existing approaches to color have either tried to recover the surface color, i.e. the
surface reflectance function,[6] [7] or the image color, i.e., the color of the objects as they
appear under the present illumination conditions [5]. The recovery of surface color is
known to be an underconstrained problem and the solutions usually make some assump-
tions either about the colored surface or the illumination conditions [6][7]. The image
color, on the other hand, is a very unstable description, changing easily with illumina-
tion conditions. For the purposes of selection, therefore, we propose that the color of a
region be specified by its perceived color. Using the perceptual color, two adjacent color
regions would be distinguished if their perceived colors were different, and this is suffi-
cient for data-driven selection. Because objects tend to obey color constancy under most
changes in illumination, their perceived color remains more or less the same thus making
it sufficient also for model-driven selection.
We now present a way of specifying the perceptual color of image regions. The color of
pixels constituting color regions can be described by a triplet < R , G , B > (called specijqc
color henceforth), representing the components of image intensity at that point along
three wavelengths (usually red, green and blue as dominant wavelengths to correspond
to the filters used in the color cameras). When all possible triples are mapped into a
3-dimensional color space with axes standing for the pure red, green and blue respec-
tively, we get a color space that represents the entire spectrum of computer recordable
colors. Such a color space, must therefore, be partitionable into subspaces where the
color remains perceptually the same, and is distinctly different from that of neighboring
subspaces. Such subspaces can be called perceptual color categories. Now the color of each
pixel maps to a point in this color space, and hence will fall into one of the categories.
The perceptual color of the pizel can, thereyore, be specified by this color category. To get
the perceived colors of regions, we note that although the individual pixels of an image
color region may show considerable variation in their specific colors, the overall color of
the region is fairly well-determined by the color of the majority of pixels (called dominant
color henceforth). Therefore, the perceived color of a region can be specified by the color
category corresponding to the dominant color in the region.
The category-based specification of perceptual color (of pixels or regions) remains
fairly stable under changes in illumination conditions and as we show next, can be used
to give a reliable segmentation of the scene. In addition, since the perceptual categories
depend on the color space and are independent of the image, they can be found in advance
and stored. Finally, a category-based description is in keeping with the idea of perceptual
categorization that has been explored extensively through psychophysical studies [8].
To find the perceptual color categories, we performed some rather informal but exten-
sive psyehophysical experiments that systematically examined a color space and recorded
the places where qualitative color changes occur, thus determining the number of distinct
color categories that can be perceived. The details of these experiments will be skipped
here except to mention the following. The entire spectrum of computer recordable colors
(224 colors) was quantized into 7200 bins corresponding to a 5 degree resolution in hue,
and 10 levels of quantization of saturation and intensity values and the color in each such
bin was then observed to generate the categories. From our studies, we found about 220
different color categories were sufficient to describe the color space. The color category
information was then summarized in a color-look-up table. Similarly, the categories that
can be grouped to give an even rougher description of a particular hue were found and
stored in a category-look.up table to be indexed using the color categories given by the
color-look-up table.
117

3 Color Region Segmentation


The previous section described how to specify the color of regions, after they have
been isolated. But the more crucial problem is to identify these regions. If each surface
in the scene were a mondrian, then all its pixels would belong to single color category, so
that by grouping spatially close pixels belonging to a category, the desired segmentation
of the image can be obtained. But even for real surfaces, an analysis assuming a single
light source and the separability of surface reflectance has shown that the color variations
over a surface are mostly in intensity [2]. In practice, even when these assumptions are
not satisfied, the general observation is that the intensity and purity of colors get affected
but the hue still remains fairly constant. In terms of categories, this means that different
pixels in a surface belong to compatible categories, i.e. have the same overall hue but vary
in intensity and saturation. Conversely, if we group pixels belonging to a single category,
then each physical surface is spanned by multiple overlapping regions belonging to such
compatible color categories. These were the categories that were grouped in the category-
look-up-table mentioned earlier.
The algorithm for color image segmentation performs the following steps. (1) First,
it maps all pixels to their categories in color space. (2) It then groups pixels belonging
to the same category, (3) and finally merges overlapping regions in the image that are of
compatible color categories. The grouping is done by dividing the image into small-sized
bins and running a connected component algorithm to assemble the groups in linear time.
Similarly, the overlapping regions of compatible color categories are found and merged
by using the bin-wise representation of the image, also in linear time.
Figure 1 demonstrates the color region segmentation algorithm. Figure la shows a
256 x 256 pixel-size image of a color pattern on a plastic bag. The result of step-2 of the
algorithm is shown in Figure lb, and there it can be seen that the glossy portions on
the big blue Y and the red S cause overlapping color regions. These are merged in step 3
and the result is shown in Figure lc. Similarly, Figure 2 shows another example of color
region segmentation using the algorithm on an image of a realistic indoor scene.

4 Color-based Data-driven Selection


We now present an approach to data-driven selection using color regions. The seg-
mentation algorithm described above gives a large number of color regions, some of which
may span more than one object, while others may come from the scene clutter rather
then objects of interest in the scene. It would be useful for the purposes of recognition,
therefore, to order and consider only some salient color regions. This is based on the
observation that an object stands out in a scene because of some salient features (such
as, say, color) that are usually localized to some portion of the object. Therefore isolat-
ing salient regions is more likely to point to a single object and hence to a more reliable
grouping strategy. The next section describes how such salient color regions can be found.

4.1 F i n d i n g Salient Color Regions in I m a g e s


In finding salient color regions, we focus on the sensory components of their dis-
tinctiveness and propose that the saliency be a linear combination of two components,
namely, self-saliency and relative saliency. Self-saliency determines how conspicuous a
region is on its own and measures some intrinsic properties of the region, while relative
saliency measures how distinctive the region appears when there are regions of com-
peting distinctiveness in the neighborhood. To determine these components some region
features were selected and weighting functions were designed to appropriately reflect sen-
sory judgments of saliency. Specifically, the color of a region and its size were used as
features for determining self-saliency and were measured as follows. The color was given
118

by (s(R),v(R)), where s(R) = saturation or purity of the color of region R, and v(R) =
brightness, and 0 < s(R),v(R) _< 1.0. And the size is simply the normalized size given by
r(R) : Size(R)/Image-size. Similarly, the color and size contrast were chosen as features
for determining relative saliency. The color contrast measure chosen enhances a region
R's contrast if it is surrounded by a region T of different hue and is given by c(R,T)
below:
( k t d ( C R , CT) if R and T are of same hue
c ( R , T)
%
k2 + kld(CR, CT) otherwise (1)
where kx = ~ and k2 = 0.5, so that 0 _< c(R,T) _< 1.0, and d(CR,CT) is the cie-distance
between the two regions R and T with specific colors as CR = (ro,go, bo) T and CT =
(r, 9, b)T and is given by d(CR, CT) = L/( __._+go
s_a____,
+bo
. ,+~__~); 2 + ( ~ _ ,+g+b)
o 2 " The

size contrast is simply the relative size and is given by t(R, T) = rain \ s i z e ( T ) ' size(R)/"
In both cases the neighboring region T is the rival neighbor that ranks highest when all
neighbors are sorted first by size, then by extent of surround, and finally by contrast (size
or color contrast as the ease may be), and will be left implicit here.
The weighting functions for these features were chosen both from the point of data-
driven selection and the extent to which they reflect our sensory judgments. Thus for
example, the functions for weighting intrinsic color and color contrast, f~(s(R)) and
f2(v(R)) and fa(c(R)) were chosen to be linear (f~(s(R)) : 0.5s(R), and f2(v(R)) =
0.5v(R), and f4(c(R)) = c(R) respectively) to emphasize brighter and purer colors and
higher contrast respectively. The size of a region is given a non-linear weight to deem-
phasize both very small and very large regions. Very small regions are usually spurious
while very large regions tend to span more than one object, making both unsuitable for
selection. The corresponding weighting function f3(r(R)) was found by performing some
informal psychophysical experiments and is given by
__ I n ( I - n ) 0 < n < t1
cl
1 - e -~'~ tl < n <_~2
h(n) = s2 - c d n ( 1 - n + t2) ~2 < n _< ~ (2)
s3e -e~(n-ts) ~3 < n < 7~4
0 t4 < n < 1.0

where tt = 0.1, t2 = 0.4, t3 : 0.5, t4 : 0.75, sl : 0.8, s2 : 1.0, s3 = 0.7, s4 = 10 - s and


e l = --
Zn(1-tl)
Jl
c
, 2 : --
In(l-,1)
tx
- ('2-,s)
,c a : --ln(l+t2_ts),C4 : --
and n : size of region R
= r(R). A function f s ( t ( R ) ) = 1 - e -t2t(R) for relative size was similarly designed.
The color saliency of region R was obtained by combining all these features as

Color-saliency(R) = fx(s(R)) + f~.(v(R)) + f a ( r ( R ) ) + f4(c(R)) + fs(t(R)) (3)

Figure l d - l f and 2c-2f show the four most distinctive regions found by applying the
color-saliency measure to all the color regions extracted from the scene shown in Figure
l a and 2a respectively. In the experiments done so far, the color-saliency measure was
found to select fairly large bright-colored regions that showed good contrast with their
neighbors, and appeared perceptually significant.
4.2 U s e o f S a l i e n t C o l o r - b a s e d S e l e c t i o n in R e c o g n i t i o n
Data-driven selection based on salient color regions is primarily useful when the object
of interest has at least one of its regions appearing salient in the given scene, since the
119

search for data features that match model features can be restricted to these regions.
Selecting salient regions gives a small number of large-sized groups which were shown to
be very useful for indexing into the library of models [1]. But to recognize a single object,
it is desirable to have small-sized groups. For this, existing grouping techniques can be
applied to the data features found within the color regions to obtain reliable small-sized
groups.
To estimate the search reduction that can be achieved with such a selection mecha-
nism, let (M,N) : total number of features (such as edges, lines, etc.) in the model and
image respectively. Let (MR, NR) : total number of color regions in the model and image
respectively. Let Ns : number of salient regions that are retained in an image. Let g
= average size of a group of data features, within a model or image. Let (GM, G.v) =
number of groups formed (using any existing grouping scheme) in the model and image
respectively. Finally, let GNI be the number of groups in the salient image region i. Using
the alignment method of recognition [3], at least three corresponding data features are
needed to solve for the pose (appearance) of the model in the image. If no selection of
the data features is done, then the brute-force search required to try all possible triples
is O(MSN3). If selection is done by only grouping methods (i.e., without color region
selection), then the number of matches that need to be tried is O(GMGNgSg3) since only
triples within groups need to be tried. When grouping is done within color regions, the
groups obtained are even smaller in number and are more reliable, so that the overall
effect is to reduce search (by as much as a factor of 107). When grouping is restricted to
salient color regions, the number of matches further reduces to O ( ~ jNs
= 1 GNjGMgSgs).
To get an estimate of the number of matches and time taken for matching in real
scenes when color-based selection is used, we recorded the number of color regions,
and the number of data features within regions in some selected models and scenes
(Figure 2 and 3 show typical examples of models and scenes tried). The regions were
ordered using the color saliency measure and the four most salient regions were re-
tained. Then search estimates were obtained using the above formulas, and assuming
a grouping scheme that gives a number of groups within regions that is bounded by
the number of features in a region (which is a good bound using simple grouping
average size of the groups in a region
schemes such as grouping 'g' closely-spaced parallel lines in a region). The result of
such studies is shown in Table I. As can be seen from this table, the number of matches
is always smaller when salient color regions are used for selection.

5 Color-based Model-driven Selection


When the object of interest is not salient in color, saliency-based data-driven selection
will no longer be useful. In such cases, the color description of the model can be used
to perform selection. Previous approaches to using model color information to search
for instances of objects have used histogram matching techniques [9] that cause a lot of
false positive identifications since they do not explicitly address some of the problems
such as pose changes, occlusions, or illumination conditions that make a model instance
appear different from its original description. Our approach to color-based model-driven
selection handles some of these problems by using a rich description of model color regions
and a location strategy that exploits global relational information about the regions
provided in this description. In addition, it provides correspondence between model and
image regions, which can help reduce the search in recognition as matching can now be
restricted to the corresponding regions. Since the model description affects the design of
the location strategy, it is described first.
120

5.1 M o d e l Description
The color region information in the model (an image or view of the model, that is) is
represented as a region adjacency graph (RAG) MG ---< Vra, Era, Cra, R.m , Sra, B~ra , B,m >,
where Vra = color regions in the model, Era = adjaeencies between color regions, Cra(u)
-- color of region u E Vra, Rm(u,v) -- relative size of region 'v' w.r.t region u. Sra(u) =
size of region u, and B, ra = a bound on the relative size of regions given by R.m, and
B,ra = a bound on the absolute size of regions given by Sra.
This description exploits features of regions such as color and adjacency information
that tend to remain more or less invaxiant in most scenes where the model appears.
Also, the bounds B~ra and B,ra indicate the extent of pose changes and occlusions that
a selection mechanism is expected to tolerate. The description therefore, is fairly rich
and has some structural information about color regions that can be used to restrict
the number of false positives, and some constraints on the relative and absolute size
changes that can be used to restrict the number of false negatives made by the selection
mechanism.
Finally, the color region information in the image is similarly organized as an image
region adjacency graph as I a = < VI, El, CI, RI, SI > where each term has a meaning
analogous to < Vra, Era, Cra, P ~ , Sra > respectively.
5.2 Location Strategy
Given the image region adjacency graph IG, the model object if present in the scene
will form a subgraph in IG. The location strategy, therefore, regards the problem of selec-
tion as the problem of searching for suitable subgraphs that satisfy the model description.
Although the number of subgraphs is exponential, a set of unary and binary constraints
supplied in the model description restrict the subgraphs to a small number of feasible
subgraphs. The perceptual color of a region and its absolute size bound (Bara) were used
as the unary constraints, while region adjacency and relative size were used as the binary
constraints. Specifically, the lack of adjacency between two model regions was used to
prune false matches to two adjacent image regions. The bound Btra in the model was
used to discard matches when the relative size exceeded this bound.
The location strategy searched among the feasible subgraphs for a subgraph (or
subgraphs) that in some sense best matches the given model description. Such a sub-
graph Ig = < Vg,Eg, Cg, Rg, Sg > such that [IVg[[ _< [[Vrall,[[Egl[ _< [IEra[[, has as-
sociated with it a node correspondence vector T = {(ura,ug)[Vura E Vra,ug E Vg U
{_k}, {_k} is a null match} and is chosen to be the one that minimizes the following mea-
sure:

SCORE(Ig) = ( 1 - I]Vgll ) + . (4)


IIVrall IIErall
where R~rag(ura, vra, ug, vg) expresses the change in the relative size when adjacent model
regions (ura,v,~) are paired to corresponding image regions (ug, vg) and is given by
R~g(ura, vra, ug, vg) = m a[R.,(~,..,,.,)-R,(,,,,~,)I
x(R.( ..... ),R,(~,,,,))" SCORE(Ig) emphasizes rewards for
making as many correspondences as possible as indicated by the first term, and penalties
for a mismatch of the relative size, as indicated by the second term which accounts for
occlusions and pose changes in a more refined way than the binary constraints alone. A
branch and bound version of interpretation tree search [3] was then used to search for
the best subgraph.
The result of using color-based model-driven selection is illustrated in Figure 3. Figure
3a and 3b show a model object, and its color description obtained by using the color-
region segmentation algorithm of Section 3. Figure 3c shows a scene in which the model
121

object occurs. The scene shown has several other objects with one or more of the model
colors. Also, the model appears in a different pose, being r o t a t e d to the left a b o u t the
vertical axis. Figure 3d shows the result of applying the unary color constraints, and
Figure 3e, the subsequent use of the absolute size constraint. Finally, the subgraph with
the lowest value of S C O R E is shown in Figure 3f. As can be seen from this figure, a
region containing most of the model object has been identified even with an imperfect
color image segmentation.
5.3 Search R e d u c t i o n u s i n g C o l o r - b a s e d M o d e l - d r i v e n
Selection
The color-based model-driven selection mechanism provides a correspondence of model
regions to some image regions. The matching of model features to image features can
be restricted to within corresponding regions, and when this is combined with grouping
within regions as described in Section 4.2, the number of matches to be tried for recogni-
tion reduces further. To estimate the search reduction in this case, let Nz be the number
of solution subgraphs given by the selection mechanism, and let Ik represent one such
subgraph with the number of nodes = Nk. Let (G,,i,G~) = the n u m b e r of groups in
region uj of the solution subgraph Ih, and region vi of the model R A G t h a t corresponds
to uj as implied by the correspondence vector T associated with Ik. Then assuming, as
before, the average size of the group = g, the number of matches t h a t need to be tried are
0 ( ~ = 1 ~~1~1 G~,#G~,.ga.g3). By trying several models and images of scenes where they
occurred, we recorded the average number of subgraphs generated by the model-driven
selection mechanism. The search estimates were obtained using the above formula for
model-driven selection with grouping, and the formulas for other m e t h o d s mentioned in
Section 4.2. T h e results are shown in Table II. The bound on the number of groups in a
region was the same as used in Section 4.2. As can be seen from the table, the number of
matches using correspondence between model and image color regions is always lower.

6. C o n c l u s i o n s
In this p a p e r we have shown how color can be used as a cue to perform both d a t a and
model-driven selection. Unlike other approaches to color, we have used the intended task
to constrain the kind of color information to be extracted from images. This led to a fast
color image segmentation algorithm based on perceptual categorization of colors which
later formed the basis of d a t a and model-driven selection. Future work will be directed
towards integrating the selection mechanism with a 3D from 2D recognition system to
o b t a i n statistics of false positives and negatives and the actual search reduction due to
selection.

References
I. D.T. Clemens and D.W. Jacobs, "Space and time bounds on indexing 3D models from 2D
images," IEEE Trans. Pattern Anal. and Machine Intelligence, vol. 13, Oct. 1991.
2. T. F. Syeda-Mahmood, "Data and model-driven selection using color regions," AI-Memo
I~70, Artificial Intelligence Lab., M.I.T., 1992.
3. W.E.L.Grimson., Object Recognition by Computer: The Role of Geometric Constraints,
MIT Press: Cambridge, 1990.
4. D.G. Lowe, Perceptual Organization and Visual Recognition, Kluwer Academic: Boston,
1985.
5. G.J. Klinker, S.A. Sharer, and T. Kanade, "A physical approach to color image understand-
ing," Intl. J1. Computer Vision, vol.4, no.1, pp.7-38, Jan. 1990.
6. E. Land, " Recent advances in retinex theory," in Central and Peripheral Mechanisms of
color Vision, T. Ottoson and S. Zekl Ed., pp. 5-17, London:McMillan, 1985.
122

7. L,T. Maloney and B. Wandel, "Color constancy: A method for recovering surface spectral
reflectance," Jl. Optical Society of America, vol.3, 1986, pp.29-33.
8. E. Sternheim and R. Boynton, "Uniqueness of perceived hues investigated with a continuous
judgemental technique," Jl. of Experimental Psychology, voi.72, pp.770-776, 1966.
9. M.J. Swain and D. Ballard, "Indexing via color histograms," Third Int. ConL Computer
Vision, 1990.

Fig. I . Illustration of color region segmentation and color-sallency. (a) Input image consisting
of regions of 3 different colors: red, green and blue against an almost white background. (b)
Result of Step 2 of algorithm with regions colored differently from the original image. (c) Final
segmentation of the image of Fig.3a. (d) m (f) The three most distinctive regions found using
the color saliency measure.

T a b l e 1. Search reduction using color-bared data-driven selection. The last column shows the
m a t c h time when color-based data-driven selection is combined with grouping. The color-based
selection is done by choosing the four most salient regions. Here g = 7, Time per match = 1
microsecond, and the gzouplng m e t h o d is as described in text.

MR NR No selection Only grouping Salient color + groupingI


S.No M N Num. Time N u m . Time Num. Time
matches matches matches
I. 229 I1701 1 8 1 . 9 2 x i 0 ze 610yts 6.52xi0s 11rmn 3.37xi0s 5rmn
2. 5 0 7 2 6 5 5 2 2 0 2 0 2 . 4 x i 0 1 8 7 7 , 3 4 1 y z s3.22xI0 ~ 54rain 1.32xI0 Q 22rmn
3. 12426552 3.57xI0 te 1131yrs 8.05xi0s 13rain 3.3x10s ,rain
4. 50722472 [141.48xI01846,884yts2.72xi0~ 46rain 7.8x10e 13min

T a b l e 2. Search reduction using color-based model-driven selection, The last column shows the
m a t c h time when model-color-based selection is combined with grouping. Here g -- 7, Time per
m a t c h = 1 microsecond, and the grouping m e t h o d is as described in text.

No selection Only grouping Model-driven selection,


~.No M N MR NR:Objects Ni N~ i N u m . Time Num. Time Num. Time,
matches matches matches
il. 786 3268 5 30 20 1 I(3) 1.69x10j 530000y~s 5.15x10s 103ram 4.55x107 45secl
2. 83 30781 20 14 3 II,l,1) 1.67x10Zs528yrs 5.2x10e llmin 1.7x10s 3min
3. 507 2655 2 20 114 2 12,1) 2.4xi0 Is 77,34lyre 3.22xi0~ 54mJn 3.72xi0s 6rain
i. 507 2247 2 14 i6 1 12) 1.48xi0Is 46,884yrs 2.72xi09 46rain 3.16xi0j 5mir
123

Fig. 2. Mustration of color region segmentation and color-saliency. (a) Input image depicting a
scene of objects of different materials and having occlusions and inter-reflections. (b) Segmented
image using the color region segmentation algorithm. (c)-(f) The four most distinctive regions
detected using the color-sallency measure. The white portion in the red book appears so because
of the white background. (a) (b)

Fig. 3. mustration of color-based model-driven selection. (a) The object serving as the model.
(b) Its color description produced by the segmentation algorithm of Section 3. (c) A cluttered
scene in which the object appears. (d) Regions selected based on unary color constraint. (e)
Regions of (d) pruned after using the unary size constraint. (f) Regions corresponding to the
best subgraph that matched the model specifications.

This article was processed using the LtTEX macro package with E C C V 9 2 style
Recovering Shading from Color Images *

Brian V. Funt, Mark S. Drew, and Michael Brockington


School of Computing Science, Simon Fraser University, Vancouver, B.C., Canada V5A 1S6,
(604) 291-3126, funt~cs.sfu.ca

Abstract.
Existing shape-from-shading algorithms assume constant reflectance
across the shaded surface. Multi-colored surfaces are excluded because both
shading and reflectance affect the measured image intensity. Given a stan-
dard RGB color image, we describe a method of eliminating the reflectance
effects in order to calculate a shading field that depends only on the rel-
ative positions of the illuminant and surface. Of course, shading recovery
is closely tied to lightness recovery and our method follows from the work
of Land [10, 9], Horn [7] and Blake [1]. In the luminance image, R + G + B ,
shading and reflectance are confounded. Reflectance changes are located and
removed from the luminance image by thresholding the gradient of its loga-
rithm at locations of abrupt chromaticity change. Thresholding can lead to
gradient fields which are not conservative (do not have zero curl everywhere
and are not integrable) and therefore do not represent realizable shading
fields. By applying a new curl-correction technique at the thresholded lo-
cations, the thresholding is improved and the gradient fields are forced to
be conservative. The resulting Poisson equation is solved directly by the
Fourier transform method. Experiments with real images are presented.

1 Introduction

Color presents a problem for shape-from-shading methods because it affects the apparent
"shading" and hence the apparent shape as well. Color variation violates one of the
main assumptions of existing shape-from-shading work, namely, that of constant albedo.
Pentland [11] and Zheng [16] give examples of the errors that arise in violating this
assumption.
We address the problem of recovering (up to an overall multiplicative constant) the
intrinsic shading field underlying a color image of a multi-colored scene. In the ideal
case, the recovered shading field would be a graylevel image of the scene as it would have
appeared had all the objects in the scene been gray. We take as our definition of shading
that it is the sum of all the processes affecting the image intensity other than changes in
surface color (hue or brightness). Shading arises from changes in surface orientation and
illumination intensity.
It is quite surprising how well some shape-from-shading (SFS) algorithms work when
they are applied directly to graylevel images of multi-colored scenes [16]. This is encour-
aging since it means that shading recovery may not need to be perfect for successful
shape recovery. Nonetheless, the more accurate the shading the more accurate we can
expect the shape to be.
Consider the image of Fig. l(a) which is a black and white photograph of a color
image of a cereal box. The lettering is yellow on a deep blue background. Applying
Pentland's [11] remarkably simple linear SFS method to the graylevel luminance version
.(i.e. R + G + B ) of this color image generates the depth map in Fig. l(h). Although the
image violates Pentland's assumptions somewhat in that the light source was not very
* M.S. Drew is indebted to the Centre for Systems Science at Simon Fraser University for partial
support; B.V. Funt thanks both the CSS and the Natural Sciences and Engineering Research
Council of Canada for their support.
125

distant and the algorithm is known to do poorly on flat surfaces, it is clear that the
yellow lettering creates serious flaws in the recovered shape. Note also that the errors are
not confined to the immediate area of the lettering.
The goal of our algorithm is to create a shaded intensity image in which the effects of
varying color have been removed in order to improve the performance of SFS algorithms
such as Pentland's. Similar to previous work on lightness, the idea is to separate intensity
changes caused by change in color from those caused by change in shape on the basis
that color-based intensity changes tend to be very abrupt. Most lightness work, however,
has considered only planar "mondrian" scenes and has processed the color channels sepa-
rately. In lightness computations, the slowly varying intensity changes are removed from
each color channel by thresholding on the derivative of the logarithm of the intensity in
that channel. We instead remove intensity gradients from the logarithm of the luminance
image by thresholding whenever the chromaticity changes abruptly. Both the luminance
and the chromaticity combine information from all three color channels.
Many examples of lightness computation in the literature [13, 14, 1, 8, 4] use only
synthetic images. A notable exception is in Horn [7] in which he discusses the problem of
thresholding and the need for appropriate sensor spacing. He conducts experiments on
a few very simple real images. Choosing an appropriate threshold is notoriously difficult
and the current problem is no exception. By placing the emphasis on shading rather
than lightness, however, fewer locations are thresholded because it is the large gradients
that are set to zero, not the small ones. When a portion of a large gradient change
remains after thresholding due to the threshold being too high, the curl of the remaining
luminance gradient becomes non-zero. Locations of non-zero curl are easily identified and
the threshold modified by a technique called "curl-correction."
In what follows, we first analyze the case of one-dimensional images before proceeding
to the two-dimensional case. Then we elaborate on curl-correction and present results of
tests with real images.

2 One-dimensional Case
2.1 C o l o r I m a g e s w i t h S h a d i n g
Let us consider as a starting point the surface described by the one-dimensional depth
map shown in Fig. 2(a). If this surface has Lambertian reflectance and is illuminated by
a point source from an angle of 135~ (i.e., from the upper left), the resulting intensity
distribution will be as shown in Fig. 2(b). So far there is no color variation, so all the
intensity variation is due to shading.
If instead, the surface has regions of different color, each described by its own color
triple (R,G,B) in the absence of shading, (see Fig. 2(c)), then in a color image of the
surface the RGB values will be these original color triples modulated by the shading
field, as shown in Fig. 2(d). The combined effect of color edges and shading edges leads
to discontinuities in the observed RGB values at image locations corresponding to both
types. Fig. 2(d) has both kinds of edges--there are color discontinuities where there are
no shape discontinuities, and there are also shape discontinuities without accompanying
color ones.
To differentiate between the two kinds of edges, we note that if we form the analog
of the chromaticity [15] in RGB space, i.e.,
r = R/(R+G+B), g = G/(R+G+B)
then r and g are independent of the shading (cf. [12, 6, 2]) as can be seen in Fig. 2(e).
This, of course, must be the case because rg-chromaticity is simply a way of normalizing
the magnitude of the RGB triple. Both r and g are fixed throughout a region of constant
color. So long as we can assume that that color edges never coincide with shape edges,
rg-chromaticity will distinguish between them.

2.2 S h a d i n g R e c o v e r y
In a sense, the recovery of shading from color images is the obverse of the recovery
of lightness from graylevel images [7]. In the case of lightness, it is the sharp intensity
126

changes that are taken to represent reflectance changes and are retained, while the small
intensity changes are factored out on the basis that the illumination varies smoothly;
whereas, in the case of shading it is the large reflectance changes that are factored out
and the small ones retained. A significant difference, however, is that the reflectance
changes are identified, not by sharp intensity changes, but by sharp chromaticity changes.
Small chromaticity changes are assumed to be caused by changes in the spectrum of the
illuminant and are retained as part of the shading.
We begin by following the usual lightness recovery strategy [7, 1], but to do so we
first need to transform a color image into graylevel luminance image by forming
I' = R + G + B.
Under the assumption that the luminance is described well as the product of a shading
component S and a color component C, the two components are separated by taking
logarithms: I ( z ) = logI'(z) -- logS'(z) + logC'(z)
Differentiating, thresholding way all components of the derivative coincident with large
chromaticity changes, and integrating yields the logarithm of the shading.
Chromaticity changes (dr, dg) are determined from the derivative of the chromaticity
tur, g) where the threshold function locates pixels with high [dr I or [dg I. The threshold
nction is defined as T ( x ) = 1 at pixels where (dr, dg) is small and T ( z ) = 0 where it is
large. T will be mostly 1; whereas, for lightness T will be mostly 0.
Applying T to the derivative of the luminance image eliminates C = iogC', so
S = logS' can be recovered by integrating the thresholded intensity derivative. In other
words, dS = T ( d I ) , S = f dS , S' = e x p S
It is easy to see from Figs. 2(a)-(e) that this algorithm will recover the shading
properly for the case of perfect step edges and the correct result is in fact obtained by
the algorithm as shown in Fig. 2(f).

2.3 I n t e g r a t i o n b y F o u r i e r E x p a n s i o n
A fast, direct method of integrating the thresholded derivative of the shading field in
the two-dimensional case is to apply Fourier transforms. While not efficient in the one-
dimensional case it is easy to understand how the method works. Firstly, if the discrete
Fourier transform of d S is F ( d S ) , the effect of differentiation is given by
F(dS) = 21riu f(S)
where the frequency variable is u. This expression no longer holds exactly, however, when
the derivative is calculated by convolution with a finite-differences mask. For the case of
convolution by a derivative mask, after both the mask and the image are extended with
zeroes to avoid wraparound error [5], Frankot and Chellappa [3] show how to write the
Fourier transform of the derivative operation in terms of u. Call this transform H. H
is effectively the Fourier transform of the derivative mask, and integration of d S simply
involves dividing by H and taking the inverse transform:
F(S) = F(dS)/H
This division will not be carried out at u = 0, of course, so that integration by this
method does not recover the DC term representing the unknown constant of integration.

3 Two-dimensional Shading Recovery

To generalize the method to real, two-dimensional images two main problems need to be
addressed: how to properly deal with non-step edges and, since in two-dimensions the
gradient replaces the derivative, how to integrate easily the gradient image?
As in the one-dimensional case, the procedure is to determine a threshold image T
from the chromaticity image, apply the threshold to the derivative of the logarithm of the
luminance image I and integrate the result to obtain the shading field S. The threshold
function T comes from the chromaticity itself, not its log, and for two-dimensional images
is based on the gradient of the chromaticity vector field
127

IIV(r, g)ll = ~ r . Vr + Vg. Vg.


For this type of problem, the boundary conditions must be treated appropriately
as Blake [1] points out. He addresses lightness recovery and uses a threshold based on
the value of the log-intensity gradient and applies it to the log-intensity gradient. In the
present case we apply a threshold based on the chromaticity gradient to the log-luminance
gradient; nevertheless, Blake's results can still be applied.
Blake proves that given a threshold image T, the (log of) lightness L can be recovered
from the (log of) intensity I by inverting the Laplacian of I provided correct boundary
conditions are used and provided the thresholded intensity gradient forms an irrotationai
field.
Specifically, he shows that

VL = TVI ~ { V~L = V . T V I
n.VL = n.TWI
on boundary
However, the converse holds only if the field T V I has zero curl:
V2L = V.TVI ]
n . VL = n . T V I [
on boundary ( ==~ VL = T V I
V x (TVI) = 0 J
Blake argues that in theory T V I will have zero curl and thus forms a conservative
field. Furthermore, he points out that if this condition is violated in practice, the best
solution is robust in the sense that it minimizes a least-squares energy integral for L.
The demand that the curl be approximately zero is important because it amounts to the
condition that the recovered lightness (or shading field in our case) be integrable from
its gradient.
The fact that we use a threshold T that is not derived from I itself, but is instead
derived from (r, g), does not make any difference in the proof of Blake's theorem. In fact,
any T will do--the formal statement of the theorem follows through no matter how T
is chosen. The crucial point is that while the theorem holds for continuous images and
step edges, in practice the curl may not be zero because of edges that are not perfect
steps. With a non-step edge, thresholding may zero out only half the effective edge, say,
in which case T V I will not be conservative.
Another situation that can affect integrability is when some chromaticity edges are
slightly stronger than others so that some of the weaker edges are missed by the threshold.
For example, consider the case of a square of a different color in the middle of an otherwise
uniform image. If for some reason the horizontal edges are slightly stronger than the
vertical ones, so that only the horizontal ones are thresholded away, then the curl--
necessarily zero everywhere in the input gradient image---will become non-zero at the
corners of the square. The non-zero curl indicates that the resulting integrated image
will not make sense and we cannot hope to recover the correct, flat shading image from
this thresholded gradient.
What should be done to enforce integrability? Blake further differentiates the thresh-
olded gradient calculating its divergence, which results in the Laplacian of the lightness
field. In essence, Blake's method enforces integrability by mapping the two components
of the gradient image back to a single image L. Differentiating L will clearly result in an
irrotational gradient field.
In the context of shape-from-shading, Frankot and Chellappa [3] enforce integrability
of the gradient image (p, q) by projecting in the Fourier domain onto an integrable set.
This turns out to be equivalent to taking another derivative of (p, q) and assuming the
resulting sum equals the Laplacian of z [13]. For the lightness problem, then, integrating
by forming the Laplacian and inverting is a method of enforcing integrability of T V I . The
most efficient method for inverting the Laplacian is integration in the Fourier domain,
as set forward in [13].
While these methods of projecting T V I onto an integrable vector field generate the
optimal result in the sense that it is closest to the non-integrable original, in the case of
the thresholded shading gradient "closest" is not necessarily best. For example, consider
128

the luminance edge associate with the color change shown in Fig. 3(a) and its gradient
image Fig. 3(b) (using one dimension for illustrative purposes). Since the chromaticity
edge is not a perfect step, we can expect thresholding to eliminate only part of the edge
as shown in Fig. 3(c). The projection method of integration by forming a Laplacian and
inverting uses the integrable gradient that is best in the sense that it is closest to Fig. 3(c).
Fig. 3(d) shows the result after integration. The problem is that while the gradient of
Fig.3 (d) may be curl-free, a lot of the edge that should have been screened out remains.
We would prefer a method that enforces integrability while also removing more of the
unwanted edge.

3.1 C u r l - C o r r e c t i o n
If the thresholding step had succeeded in zeroing the entire gradient at the edge, then
the resulting image would have had zero curl. To accomplish the dual goals of creating
an integrable field and of eliminating the edge, we propose thresholding out the gradient
wherever the curl is non-zero. This must be done iteratively, since further thresholding
may itself generate more pixels with non-zero curl. Iteration continues until the maximum
curl has become acceptably small. Since the portion of the edge that was missed by the
initial thresholding created the curl problem, the thresholded region will expand until
the whole edge has been removed.
An alternative curl-correction scheme is to distribute the contributions of the x and
y partial derivatives of the gradient that make the curl non-zero among the pixel and its
neighboring pixels so that the result has zero curl. As an example of this type of scheme
one can determine which part of the curl, the x derivative of the y-gradient or the y
derivative of the x-gradient, is larger in absolute value. Then the larger part is made
equal to the other by adjusting the larger gradient value contributing to the curl. Tests
with this method did not show that it worked any better than the simpler scheme of
simply zeroing the gradient. Although it might work better in some other context, all
results reported in this paper simply zero the gradient.

3.2 B o u n d a r y C o n d i t i o n s
Blake imposes Neumann boundary conditions in which the derivative at the boundary
is specified. The process of integration by Fourier expansion is simplified slightly if,
instead, Dirichelet boundary conditions are used in which the values at the border are
fixed. Surrounding the image with a black border accomplishes this.
In the case of lightness, the Dirichelet conditions will not work because the intensity
variation removed via thresholding does not balance out from one edge of the image to
another. For shading recovery, however, as long as the color changes are contained within
the image, what is thresholded does balance across the image. Color changes are generally
completely contained within the image as, for example, with colored letters on a colored
background. For convenience, our current implementation uses Dirichelet conditions, but
could straightforwardly be changed to Neumann conditions if necessary.

3.3 A l g o r i t h m
To summarize the above discussion, the shading-recovery algorithm is as follows:

I~lF~176
1. Find color edges.

Smooth chromaticity images by convolution with a Gaussian.


Form gradient images Vr, Vg.
Form threshold image: T = 0 if IIV(r,g)H is larger than a percentage of the
maximum in the initial image. Else T = I .
2. Make log-of-luminance image, I = log(R + G + B).
3. Make thresholded x- and y-gradient images of I. Denote these images by P and Q
since they are analogous to the gradient (iv,q) in the SFS problem.
129

! / Surround image I with a border of zeroes.


Form P = cgI/c~z, Q = cgI/c~y.
Apply T to both P and Q yielding thresholded gradient images T(P) and T(Q).
4. Curl-correction
(a) Integrability requires c9/c9y(T(P)) - tg/Ox(T(Q)) = 0. For the input log-intensity
image I, this is precisely true up to numerical accuracy.
(b) If the maximum curl is not sufficiently small, then set to zero all locations of non-
zero curl. The locations to be zeroed also include the immediately surrounding
neighborhood. For example, when a 3 1 mask (-0.5, 0, 0.5) is used for the partial
derivatives, then the horizontal and vertical neighbors of the 3 3 surrounding
square should also b e made zero.
(c) Repeat curl-correction until the maximum curl value decreases sufficiently--for
example, until it reaches 50% of the original maximum curl.
5. Combine the thresholded ga.radient components by taking another derivative. This
gives the Laplacian of S: VzS = cg/tg~(T(P)) + O/cgy(T(Q)).
6. Solve this Poisson equation by the method of Fourier expansion [13]. Exponentiate
to find the actual shading image S'.

3.4 Test Image


Figure 4 illustrates several of the above steps on the synthetic image shown in part (a).
The actual image is in color but since the colors are not crucial to the argument, it is
reproduced only in black and white. It consists of a colored square (128 by 128 pixels) on
a colored background. The color of the square and the background are each uniform, but
there is a luminance gradient in all three bands increasing from left to right. To generate
non-step edges, the image has been smoothed by convolution with a Gaussian. Part (b)
shows a graph of the intensity along the middle row. The red and green chromaticity
images (scaled for display) are shown in parts (c) and (d). The chromaticities are flat
within the square and the surround. The thresholded log-luminance image, before curl-
correction, is shown in part (e). Curl-correction was repeated until the maximum value
of the curl anywhere in the image decreased to 50% of the maximum value of what it
was originally. The number of iterations was 13, which is high compared to non-synthetic
images (see below). The final, curl-corrected threshold image is shown in part (f). Part (g)
shows the (non-log) recovered shading image. The intensity gradient across this final
shading image is plotted in part (h).

4 Results on Real Images

The luminance image derived from the cereal box, color image of Fig. 1 is shown in
Fig. 5 (a). The corresponding chromaticity images r, g (scaled) are Figs. 5 (b,c). Apply-
ing the gradient operator to these chromaticity images and thresholding at 40% of the
gradient yields the initial threshold image in Fig. 5 (d) Reducing the maximum curl to
60% of its original maximum via curl correction generates the extended threshold image,
Fig. 5 (e). The number of curl-correction iterations required was 5. The recovered shading
image is Fig. 5 (f) with the difference between figures (a) and (f) shown in Fig. 5 (g).
In order to compare the algorithm's performance with "ground truth," we also con-
sidered the image of Fig. 6 (a), which was created by Lambertian shading of a laser
range-finder depth map of a plaster bust of Mozart. 2 Fig. 6 (b) overlays this shading
field with color by multiplication with the colors measured in the cereal box image. Thus,
both the shading and the color edges come from natural objects, but in a controlled fash-
ion, so the result is a synthetic image constructed from real shapes and colors including
noise. The image shown is actually the luminance image derived from the color image.
To take into account the color and not shape of the box, the colors were extracted from
the chromaticity images Fig. 5 (a,b), with the b image formed as 1 - r - g, rather than
2 The laser range data for the bust of Mozart is due to Fridtjof Stein of the USC Institute for
Robotics and Intelligent Systems.
130

using the I~GB directly. So that the intensity image would not simply equal the original
shading image, the r, g, b components were multiplied by unequal amounts.
The chromaticity images of the input color image are thus precisely those of Fig. 5
(a,b) (because Fig. 6 (a) contains no pixels that are exactly zero). The initial chromaticity-
derived threshold function, with a threshold level of 30%, is shown in Fig. 6 (c). Requiring
the maximum curl value to be reduced to 60% of its original value took a single iteration--
lowering the initial threshold further, from 40% in Fig. 5 to 30% substantially speeds up
the curl-correction step.
Curl-correction extends the threshold function as in Fig. 6 (d). Applying the algo-
rithm to the luminance image Fig. 6 (b), results in Fig. 6 (e) Comparing to Fig. 6 (a),
the shading is recovered well in that the difference between figures (a) and (e) is negligible.

5 A s s u m p t i o n s and L i m i t a t i o n s
Stated explicitly the assumptions and limitations of the algorithm are:
- Color edges must not coincide with shading edges.
- All color edges must involve a change of hue/chromaticity, not just brightness (e.g.
not orange to dark orange, or perfect gray to another shade of perfect gray).
- Surfaces are Lambertian reflectors. Strong specularities will be mistaken for re-
flectance changes, while weak specular components will be attributed to shading.
- The spectral power distribution of the illumination should be constant, but of course
its intensity can vary. Gradual changes will be attributed to the shading to the
extent that they affect the luminance image. Abrupt changes in intensity are allowed
and will be correctly attributed to shading because they will not cause an abrupt
chromaticity change. This is unlike retinex algorithms, which will be fooled by sharp
intensity changes because they treat each color channel separately.
- The shading is recovered up to an overall multiplicative scaling constant.

6 Conclusions
Color creates problems for shape-from-shading algorithms which assume that surfaces
are of constant albedo. We have implemented and tested on real images an algorithm
that recovers shading fields from color images which are equivalent to what they would
have been had the surfaces been all one color. It uses chromaticity to separate the surface
reflectance from surface shading and involves thresholding the gradient of the logarithm
of the image luminance. The resulting Poisson equation is inverted by the direct, Fourier
transform method.

References
1. A. Blake. Boundary conditions for lightness computation in Mondrian world. Computer
Vision, Graphics, and Image Processing, 32:314-327, 1985.
2. P. T. Eliason, L. A. Soderblom, and P. S. Chavez Jr. Extraction of topographic and spectral
albedo information from multispectral images. Photogrammetric Engineering and Remote
Sensing, 48:1571-1579, 1981.
3. R. T. Frankot and R. Chellappa. A method for enforcing integrability in shape from shading
algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10:439--451,
1988.
4. B.V. Funt and M.S. Drew. Color constancy computation in near-Mondrian scenes using a
finite dimensional linear model. In Computer Vision and Pattern Recognition Proceedings,
pages 544-549. IEEE Computer Society, June 1988.
5. R. C. Gonzalez and P. Wintz. Digital Image Processing. Addison/Wesley, 2nd edition,
1987.
6. G. Healey. Using color for geometry-insensitive segmentation. J. Opt. Soc. Am. A, 6:920-
937, 1989.
131

7. B. K. P. Horn. Determining lightness from an image. Computer Vision, Graphics, and


Image Processing, 3:277-299, 1974.
8. A. Hurlbert. Formal connections between lightness algorithms. J. Opt. Sac. Am. A, 3:1684-
1692, 1986.
9. E.H. Land. Recent advances in retinex theory. Vision Res., 26:7-21, 1986.
10. E.H. Land and J.J. McCann. Lightness and retinex theory. J. Opt. Sac. Amer., 61:1-11,
1971.
11. A. P. Pentland. Linear shape from shading. Int. J. Comput. Vision, 4:153-162, 1990.
12. T. Poggio. Mit progress in understanding images. In DAHPA Image Understanding Work-
shop, pages 56-74, 1989.
13. T. Simchony, R. Chellappa, and M. Shso. Direct analytical methods for solving Poisson
equations in computer vision problems. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 12:435-445, 1990.
14. D. Terzopoulos. Image analysis using multigrid relaxation methods. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 8:129-139, 1986.
15. G. Wyszecki and W.S. Stiles. Color Science: Concepts and Methods, Quantitative Data
and Formulas. Wiley, New York, 2nd edition, 1982.
16. Q. Zheng and R. Chellappa. Estimation of illuminant direction, albedo, and shape from
shading. [EEE Tron_sactions on Pattern Analysis and Machine Intelligence, 13:680-702,
1991.

Figure 1. (a) Black and white photograph of a color image of a cereal box--blue background
with yenow lettering. (b) Recovered depth image using shape from shading algorithm of [11].
(a) (b) (c) (d) (e) (f)

If
f . .t.. :..:

Figure 2. One-dimensional case: (a) Initial depth'map. (b) Shading field for (gray) Lambertian
surface illuminated by a point source from upper left. (c) Same surface with colored stripes
(red--solid, green--dotted, blue---dashed). (d) Colors in image (c) multiplied by shading field
of image (b). (el Chromaticities formed from observed camera values. (f) Shading field recovered
by algorithm.
(a) (b) (c) (d)

Figure 3. Thresholding: (a) Smooth step. (b) Derivative. (c) Thresholded derivative. (d) Integra-
tion of (c)--in 2 dimensions would be integration of gradient images under integrable projection.
132

Figure 4. (a) Black and white photograph of


synthetic color image: flat square and sur-
round with illumination gradient, smoothed
with Gaussian. (b) Intensity across the im-
age. (c) Red chromaticity image. (d) (2rid
row) Green chromaticity image. (e) Initial
threshold. (f) Threshold after curl-correction.
(g) (3rd row) Output shading image. (h) In-
tensity across output image.

Figure 6. Mozart bust overlaid with


color: (a) Intensity image. (b) Red chro-
maticity image. (c) Green chromaticity
image. (d) (2nd row) Initial threshold.
(e) Threshold after region-growing. (f)
Output shading image. (g) (3rd row)
Difference between (a) and (f).

Figure 5. Cereal box: (a) Intensity image. (b)


Red Chromaticity image. (c) Green Chro-
maticity image. (d) (2nd row) Initial thresh-
old. (e) Threshold after curl-correction. (f)
Output shading image. (g) (3rd row) Differ-
ence between (a) and (f).
This article was processed using the IgTEX macro package with ECCV92 style
Shading Flows and Scenel Bundles: A N e w Approach
to Shape from Shading

Pierre Breton 1, Lee A. Iverson 1, Michael S. Langer 1, Steven W. Zucker 1'2


1 McGill University, Reseazch Center for Intelligent Machine, 3480 rue UniversitY, Montreal,
Quebec, Canada, H3A 2A7, e--msih zucker~mcrcim.mcgill.edu
2 Fellow, Canadian Institute for Advanced Research
A b s t r a c t . The classical approach to shape from shading problems is to
find a numerical solution of the image irradianee partial differential equa-
tion. It is always assumed that the parameters of this equation (the light
source direction and surface albedo) can be estimated in advance. For images
which contain shadows and occluding contours, this decoupling of problems
is artificial. We develop a new approach to solving these equations. It is
based on modern differential geometry, and solves for light source, surface
shape, and material changes concurrently. Local scene elements (scenels)
are estimated from the shading flow field, and smoothness, material, and
light source compatibility conditions resolve them into consistent scene de-
scriptions. Shadows and related difficulties for the classical approach are
discussed.

1 Introduction

The shape from shading problem is classical in vision; E. Mach (1866) was perhaps the
first to formulate a formal relationship between image [1] and scene domains, and to
capture their inter-relationships in a partial differential equation. Horn set the modern
approach by focusing on the solution of such equations by classical and numerical tech-
niques [2, 3, 4], and others have built upon it [5, 6]. Nevertheless, problems remain which
are not naturally treated in the classical sense, especially those related to discontinuities
and shadows. We present a new approach to the shape from shading problem motivated
by modern notions of fibre bundles in differential geometry [9]. The global shape from
shading problem is posed as a coupled collection of "local" problems, each of which at-
tempts to find that local scene element (or scenel3) that captures the local photometry,
and which are then coupled together to form global piecewise smooth solutions.
The paper is written in a discursive style to convey a sense of the "picture" be-
hind our approach rather than the formal treatment. We begin with an overview of the
classical formulation, then proceed to define the structure of a "scener' and our new
conceptualization.

1.1 T h e "Classical" S h a p e f r o m S h a d i n g P r o b l e m
We take the classical setting in computer vision for shape from shading to be the following:
a point light source at infinity uniformly illuminates a smooth matte surface of constant
albedo whose image is formed by orthographic projection.
The matte surface is traditionally modeled with Lambert's reflectance function so the
image irradiance equation is
I(z, y) = pAL. N(x, y)
3 c]. Pixel, voxel . . . . scenel.
136

where I(z, y) is the intensity of an image point (z, y); p, the albedo of the surface, i.e.
the fraction of the shining light which is reflected; ~, the illumination, i.e. the amount
of shining light; L, the light source direction; N ( z , y), the normal at the surface point
corresponding to an image point (z, y).
The literature cited above describes the various attempts to solve this (or a closely
related) problem 4 from first principles. We emphasize, however, that, to make these
approaches tractable, certain parameters are assumed known (e.g. typically p, ~ and L).
Operationally this decouples problems; e.g., it decouples the shape from shading problem
from light source estimation problems[10].

1.2 P i e c e w i s e S m o o t h S h a p e f r o m S h a d i n g P r o b l e m s
We submit that such decoupling, while appropriate for certain highly engineered situa-
tions, is not always necessary; moreover, it can make shading analysis impotent precisely
when it should be useful. For example, a human observer confronted with a static, monoc-
ular view of a scene will succeed in obtaining some estimate of the shapes of the surfaces
within it even when some of the classical setting's constraints are relaxed. The presence
of a shadow, a diffuse light source, or even a patterned surface does not necessarily in-
terfere with our ability to recover shape from shading. Thus the classical constraints can
be relaxed in principle; but how far, and once relaxed by what mechanism can solutions
be found? These are precisely the questions with which we shall be concerned.
We retain the basic assumption that smooth variation in intensity is entirely due to
smooth variation in surface orientation; thus:

V I ( z , 9) = pAL. VNCz, y) . (1)


But we diverge from the classical model in two ways. First, we develop an approach
that handles light source and surface properties concurrently; neither problem must be
solved "before" the other. Second, we allow for discontinuities. Geometric discontinuities
(in curvature, orientation, and depth) are unavoidable and their projection into the image
has widely recognized importance [11, 12, 13]. We will therefore assume that the scene
is composed of piecewise smooth surfaces. 5
Since we are requiring that all smooth variation in the image intensity arises from
variation in surface normal, we also assume that the albedo is constant on each smooth
surface region.
When several light sources (point or non-point sources) are present, one can consider
instead a single equivalent light source under the condition that all light sources arc visible
from the lambertian surface patch. As long as this equivalent light source is constant over
a surface patch, (1) will be valid.
Shadow boundaries are problematic since their effect is to change the direction and
magnitude of the equivalent light source. These problems are discussed in Sect. 5. Another
problem is the mutual illumination between bright surfaces which can be considered as
nearby large light sources. However, (1) is only valid when light sources are far from the
surface.
In summary, we attribute all smooth variation in intcnsity to smooth variation in
orientation of surface elements, and thus to surface shape. Thus, in general, any shape
from shading process should reconstruct shape information identically for a photograph
4 Other reflectance functions have been studied, e.g. for glossy surfaces or the moon.
5 Since the reflectance function depends o n t h e existence of a differentiable surface normal,
allowing surfaces that are nowhere smooth is clearly inappropriate.
137

(where the albedo varies continuously), a projected slide (where the illumination on the
screen varies continuously) and the scene itself.

2 Shape from Shading as a Coupled Family of Local Problems:


Outline of the Scenel Bundle Approach

The key idea underlying our approach is to consider the shape from shading problem as
a coupled family of local problems. Each of these is a "micro"-version of the shape from
shading problem in which a lighting and surface model interact to produce the locally
observed shading structure. We call each of these different models a scene element, or
scenel, and, since many different scenels may be consistent with the local image structure,
utilize fibre bundles to provide a framework to couple them together.

.~\l t /

sS
m

(x,y) a#
oS

e#

st
oS

oS
ot
N ,

sS

s~

$6'

B.)

Fig. 1. Depiction of an abstract scene element, or scenel, corresponding to an image patch (A).
The scenel (B) consists of a surface patch, described by its image coordinates, surface normal,
and curvature. Its material properties (albedo) are also represented. Finally, a virtual fight source
completes the photometry.

The notion of FIBRE BUNDLES is fundamental to modern differential geometry [7].


A fibre bundle consists of a triple (E, ~, M), where E is called the total space, w is
a projection operator, and M is the base space. For each point p E M, the subset
Ir-l(p) C E is called the FIBRE over p. Denoting the fibre F, an example is the product
bundle (M x F, 7r, M), which illustrates how the total space can be viewed as the base
manifold crossed with the fibre space. While this construction is quite abstract, we use it
138

in the m a n n e r shown in Figs. 1 , 2 and 5. Essentially it allows us to consider t h e global


shape from shading problem as a coupled family of local problems.
We take the image manifold as the base space, and consider the p h o t o m e t r y for each
point on it. Locally, in a neighbourhood around the point (x, y), the shading information
can be described by a combination of local light source and surface p r o p e r t y values
(including curvature, albedo, etc.). Each of these defines a scenel (Fig. 1), and the space
of all possible scenels defines the fibre over t h a t point. Together, the collection of scenel
fibres defines the scenel bundle (Fig. 2).

_-r

F i g . 2. Depiction of a Scenel Bundle over an image. At each point in the image there are many
possible scene elements, or scenels. Each of these scenels is depicted along a fibre, or vertical space
above each image coordinate. The union of scenel fibres over the entire image is called a scenel
bundle. The shape from shading problem is formulated as determining sections through the
scenel bundle. Such a section is depicted by the shaded sceneis, and represents a horizontal slice
across the bundle. Scenel participation in a horizontal section is governed by surface smoothness
and material and light source constancy constraints.

We seek a solution of the shape from shading p r o b l e m as connected sets of scenels in


which neighbours are consistent; such a solution is called a c r t o s s SECTION through the
scenel bundle. Formally, a cross section of a bundle ( E , ~r, M ) is a m a p s : M ~ E such
t h a t 7rs = 1M. In other words, a cross section assigns a m e m b e r of each fibre to each
position in the manifold.
A sub-bundle of the scenel bundle is the TANGENT BUNDLE, in which the fibres consist
of the tangent spaces at each point and sections correspond to vector fields; Sander and
Zucker [8] previously used this bundle in their study of inferring principle direction fields
139

on surfaces.
We define what is meant by consistency shortly; first, we introduce the shading flow
field as our initial data.

2.1 T h e S h a d i n g Flow Field as I n i t i a l D a t a

Observe that a sensitivity issue arises in the scenel framework; spatial quantization of
the image induces a quantization of the scene domain. Analogously to the manner in
which integer solutions are not always possible for algebraic equations, we begin with
"quantized" initial data as well. In particular, we derive our initial estimates from the
shading/low field instead of directly from the intensity image. This field is the first order
differential structure of the intensity image expressed as the isoluminance direction and
gradient magnitude (Fig. 3(a)); we supplement it with the intensity "edge" image (Fig.
3(b)). We suggest that dealing with uncertainties at the level of the shading flow field
will expose more of the natural spatial consistency of the intensity variation, and will
thus lead to more robust processing than the raw intensities. The shading flow field ideas
are related to Koenderink's isophotes [14].
Traditionally, the gradient of an image is computed by estimating directional deriva-
tives by
OI
a=
- - Go(
# X )cocy),
at
-a~
- I , Co() X t
o(y) .

The gradient estimate follows immediately, and the isoluminance direction is simply
perpendicular to the gradient. The one limitation of this approach is that (depending on
the magnitude of ~r) it always infers a smooth gradient field,even when the underlying
image is non-continuous. W e have investigated methods of obtaining stable, discontinuous
shading flow fieldsusing logical/linearoperators [15, 16].
Our motivation for starting from the shading flow field is also biological. W e take
shading analysis to be an inherently geometric process, and hence handled within the
same corticalsystems that provide orientation selection and texture flow analysis. Shad-
ing flow is simply a natural extension.

2.2 C o n s t r a i n t s b e t w e e n Local Scenels

The coupling between the local scenel problems dictates a consistency relationship over
them, and derives from three principle considerations:

I. A SURFACE SMOOTHNESS CONSTRAINT, which states that the surface normal and
curvatures must vary according to a Lipschitz condition between pairs of scenels
which project to neighbouring points in the image domain. This notion is subtle
to implement, because it involves comparison of normal vectors following parallel
transport to the proper position (see [8]).
2. A SURFACE MATERIAL CONSTRAINT, which states that the surface material (albedo
and reflectance) is constant between pairs of scenels which project to neighbouring
points in the image domain.
3. A LIGHT SOURCE CONSTRAINT, which states that the virtual light source is constant
for pairs of scenels which project to neighbouring points in the image domain.
140

Fig. 3. Typical shading flow (a) and edge (b) fields The left shading flow field depicts the first
order differential structure of the image intensities, and the right is an edge map. Our shading
analysis is based on these data, and not on the raw image intensities. This is to focus on the
geometry of shape from shading analysis, and perhaps to capture something implicit in the
biological approach.

Viewed globally, the solution we seek consists of sections in which a single (equivalent)
light source illuminates a collection of surface patches with constant material properties
but whose shape properties vary smoothly. The above constraints are embedded into a
functional, and consistent sections through the sccnel bundle are stationary points of
this functional. More specifically, the constraints are expressed as compatibility relation-
ships between pairs of neighbouring estimates within a relaxation labelling process. As
background, we next sketch the framework of relaxation labelling.

2.3 I n t r o d u c t i o n to R e l a x a t i o n L a b e l l i n g

P~elaxation labelling is an inference procedure for selecting labels attached to a graph


according to optimal (symmetric) or variational (asymmetric) principles. Think of nodes
in the graph as scenels, edges in the graph as links connecting "nearby" scenels with
weights representing their compatibility. Formally, let i E Z denote the discrete scenel
i, and let i E /~i denote the set of labels for scenel i. The labels at each position are
ordered according to the measure pi(l) such that 0 < pi(l) < 1 and ~-~ze~,pi(i) = 1 Vi.
(In biological terms, think ofpi(l) as the firing rate for a neuron coding label I for scenel
i.) Compatibility functions rij(l, i') are defined between label l at position i and label iI
at position j such that increasingly positive values represent stronger compatibility. The
abstract network structure is obtained from the support S that label l obtains from the
labelling of it's neighbours N(i); in symbols,

jEN(i) I'Egj
141

The final labelling is selected such that it maximizes the average local support

A(p) = E a(l)pi(l)= E E E rl,j(i,l')pj(l')pi(i) .


is iEZ jEN(i) I~.~j
Such a labelling is said to be consistent [17]. We now simply remark that such "com-
putational energy" forms have become common in neural networks, and observe that
Hopfield [19] networks are a special case, as are polymatrix games, under certain condi-
tions [20].

2.4 O v e r v i e w o f t h e P a p e r
The paper is organized as follows from this point on. We first formally define a scene
element. We then show how to find those scene elements that are consistent with the
shading flow field, and solve the forward problem of calculating the shading flow field
expected for each scene element. The remainder of the paper is concerned with the
inverse problems of inferring those scenels that are consistent with the shading flow
field. The subtle interaction between expressing the scenel variables and the relaxation
compatibilities is described, and scene element ambiguity addressed. An advantage of
our technique is that, since both surface and lighting geometry are estimated, different
types of shadow and illumination discontinuities can be handled; these are discussed in
the Sect. 5.

3 Initial Estimation of Scene Attributes

We adopt a coarse coding of surface and light source attributes by quantizing the range
of values of each attribute. Each scene element is defined by an assignment of a value
to each of the scene attributes. The set of scene elements is viewed as a set of existence
hypotheses of a surface patch of a fixed shape and orientation at a fixed image position,
illuminated from a fixed direction with a fixed product of albedo and illumination.

3.1 T h e S c e n e E l e m e n t
The attributes we consider are
1. IMAGE POSITION: The image pixels are themselves a set of discrete values of z and
y position.
2. VIRTUAL ILLUMINANTDIRECTION:We take the light sources in the scene as a set of
M distant point sources 6, {A(0L(0 : 1 _< i < M} where A(0 is the intensity and L(0
is a unit vector. Let Vi(z, y) be a binary "View" function such that Vi(z, y) = 1 if
and only if light source L(0 directly illuminates the surface element corresponding
to pixel (x, y).
We define the virtual point source at (x, y) by the two attributes A and L such that

L(x,y) -- LC,) V , ( x , y ) . (2)


i
This virtual point source is constant for any neighbourhood in which all the Vi
functions are constant. In such a neighbourhood, surface luminance satisfies
I(x, y) = pAL. N(x, y) .
s M would be quite large in the case of a diffuse source
142

The possible virtuM light source directions m a p onto a unit sphere. We sample this
sphere as uniformly as possible to get a discrete set of virtual illuminant directions.
In viewer-centered coordinates, this unit vector is given as
L = (Lx,L~,L~) .

3. MATERIAL PROPERTIES or the product pA: We need only consider the product of
the albedo and the illuminance (see (1)). Usually the imaging process normalizes to
some m a x i m u m value, so we can assume a range between zero and one, and discretely
sample this range.
4. SURFACE SHAPE DESCRIPTORS: The two principal curvatures (~,, ~ ) describe the
shape up to rotation. Two angles slant a and tilt r are needed to describe the surface
tangent plane orientation with respect to the viewer's coordinate frame. An additional
angle ~b is needed to describe the principal direction of the Darboux frame in the
surface tangent plane.
(a) The two angles needed to orient the surface tangent plane in space describe the
surface normal
N = (Nx,Ny,Nz) = (cosvsinc,,sinvsin~,cosa) .
The set of all such normals form a unit sphere. The surface of this sphere is
sample as uniformly as possible to derive a discrete set of normals. Of these, only
the ones in the hemisphere facing the viewer are used; the others axe not visible;
see Fig. 4.

Fig. 4. The shape of surface patches is represented as a function of the principle curvatures
mapped onto an abstract sphere through "curveness" and "shape index" measures (see text).
Nearby positions on the sphere indicate smooth changes in either the shape or orientation of a
scene1. This representation on the sphere facilitates the definition of scene1 compatibilities later
in the text.
143

(b) The two principal curvatures (~l, ~:2) are mapped into a curveness measure

c = m a x ( I , t l , [tc2[)

and a shape index measure

a = COS- 1 2C J

These are analogous to Koenderink's curveness and shape index [21], with the
choice of norm for the curveness and the spreading function for the shape index
modified slightly.
The angles 2~ and s are, respectively, the longitude and latitude of a spherical
coordinate system covering shape variation in the tangent plane. As we stated
previously, the angle ~b represents the principal direction, while s values 0 and
represent umbilic surfaces where principal directions are not defined. Any smooth
curve on the (2~b,s) sphere represents a smooth deformation or rotation of the
surface. We define a unit vector K as follows

K = (Ir K2,1r = (cos 2~bsin s, sin 2~ sin s, cos s) .

Thus by sampling the surface of the sphere uniformly we derive a discrete set of
parameters which cover all variations of smooth, oriented shape in the tangent
plane. Augmenting this with the curveness index provides a complete, discretely
sampled shape descriptor.

Given the discrete sampling of the scene attributes as defined above, we derive a set
of scenel labels
I = {z, y, pA, L, N, e, K}
which represent all potential assignments of these scene attributes. Thus each i represents
the hypothesis that the scene can be locally described by the scenel (zi, yi, (p~)~, L~, N~,
ei, Ki). To relate this to the relaxation labelling paradigm [17], we distribute a measure
pi over each scenel i representing confirmation of the hypothesis. The first step is to
obtain an initial estimate for the confidence measure Pl from the shading flow field. The
second is to extract locally consistent sets of hypotheses by relaxation labelling. The
third step is to prune these sets by imposing appropriate boundary conditions. This is
a large scale parallel computation, and we are currently implementing it on a massively
parallel machine (MasPar MP-1).

3.2 T h e E x p e c t e d S h a d i n g Flow F i e l d

For each scenel i, we need an initial estimate for the associated weight pl- We take this
weight to reflect the match between the local properties of the shading flow field and the
EXPECTED VALUES for the scene element i.
The EXPECTED SHADING FLOW FIELD is obtained by computing the light intensity
gradient for a surface locally described by the scenel i. The paraboloid is an arbitrarily
curved surface with the local parametric form

K1 U2 -(- K2V 2
(u, v, w) where w =
2
144

The u and v axes correspond to the two principal directions. The normal to the surface
at a point (u, v) is given by:
N(u,v) = (Nu(u,v),Nv(u,v),Nw(u,v))

= (,~u~+,~lv2+l)l' ~ ~ + ,~,,
(~lU 2 2 + 1) 89 ( , ~ u ' + ,~,,~ + 1)89
Two rotations relate the viewer's coordinate frame to the paraboloid local frame.
The first rotation takes care of the surface orientation and the second, of the principal
directions. In matrix form
M = M.Mc

where

cos2 r cos~ + sin2 r sinrcosrcos~-sinrcosrcosrsin~


Mn sinrcosrcosa-sinrcosr sin2 r c o s ~ + cos~ r sin r s i n c r ]
- cos v sin ~ - sin r sin a cos r /
and
o\ / C o s ~ b - s i n r
Mc = |sin~b cos~b
0
Since the inverse of the matrix M is simply its transpose, the light source expressed in
the local surface coordinate is:
(Lu,Lv,Lw) = M t ( L x , L y , L , ) .
Therefore the surface luminance can be computed.
I(u, v) = p~L- N(u, v) = p~ (Lugu(u, v) + LvNv(u, v) + L w g , ( u , v))
Thus we can derive the local properties of the expected flow field. The orientation of the
isoluminance line:
0i = tan -~ \ ~ + ~ ]
The gradient magnitude:
9 I o(u, ~) I
IV Z(~, y)l = p,,, I ~ (~,L,,,,:~L,,)[ .
The curvature of the isoluminance line:
2~: 21 ~2L~
2
(L~2 + L~) det(M aa) - ~tn2(n2L~ det(M as ) + l q L . d e t ( U a l ) )
tr i =
( ( / H ~ 2 L ~ - U12~lLu) 2 + (U2t~2L~ - U22,tLu)2) ]
where
det(M s3) = (M12M21 - MI1M22) ,
det(M 32) = (MlaM21 - M2aMll) ,
det(M 31) = (M23MI~ - MlaM22) 9
The initial weight pl is a decreasing function of some distance measure between the
observed flow field in the neighbourhood of (zi, Yl) and the expected flow field for scenel
i.
p~ = G(0, - 0o~,.). GClVIll - IVXlo~,.). G ( ~ - ~o~,.)
where Oo~j., IVIlo~,. and no~. are extracted from the initial flow field.
145

3.3 S c e n e E l e m e n t A m b i g u i t y

It should be clear that a local shading flow field will not select a unique scenel. In general,
for an arbitrary shading flow field, several scene elements will be assigned a significant
weight for each image position. Identical flow fields can be generated by surfaces of
different shapes because of

- Intrinsic ambiguity. For example, the cases of convex (Icz, 62), concave ( - 6 1 , - 6 2 ) ,
hyperbolic ( - 6 1 , ~2) or ( 6 z , - ~ 2 ) surfaces facing the viewer and the light source all
have the same intensity profile.
- Accidental correspondence between light source and surface orientations. For exam-
ple, a surface with ~1 = ~2 will generate concentric circular isoluminance lines in
the plane facing the light source. If the curvature is small, the isoluminance lines
projected in the image plane form concentric ellipses. Such isoluminance lines can
also be seen when an elliptic surface faces the viewcr and the light source.

The example of elliptic isoluminance lines is particulary revealing. Such lines could be
due to the shading of a spherical patch directly facing the light source but slanted away
from the viewer; or it could be due to the shading of an elliptical (convex or concave)
patch directly facing the light source and the viewer. Here, two phenomena are closely
coupled: the formation of the isoluminance lines on the surface (related to the slant of the
surface with respect to the light source) and the projection on the image plane (related
to the slant of the surface with respect to the viewer).
However, these local ambiguities are not always accompanied by global ambiguities.
Although, as Mach observed in 1866, "many curved surfaces may correspond to one light
surface even if they are illuminated in the same manner" [1], these global ambiguities are
often finite [18]. In the Sect. 4, we propose a relaxation labelling process to disambiguate
the local surface geometry.

4 The relaxation labelling process

R~call that we are imposing three constraints on our surfaces: locally constant albedo,
locally constant lighting conditions, and locally smooth geometry. For the relaxation
labelling process, these translate into the following:

- A scenel j is compatible with the scenel i if they have the same constant pA and
the same virtual illuminance direction L, and if scenel j ' s surface descriptors fall on
scenel i's extrapolated surface at the corresponding relative position.
- A scenel j is incompatible with the scenel i if they have the same constant pA and the
same virtual illuminance direction, and if there exists another scenel j ' , neighbouring
scenel j along the fibre, that better fits the extrapolated surface from scenel i than
scenel j. Observe that this incompatibility serves to localize information along each
fibre.
- otherwise a scenel j is unrelated to the scencl i.

Using these guiding principles, we assign a value to the compatibility r!j between two
scenels i and j. This compatibility will be positive for compatible hypotheses, negative
for incompatible hypotheses, and zero otherwise. In general, variation in rli is assumed
to be smoothly varying between nearby points in the parameter space Z. The process is
illustrated in Fig. 5.
146

\~ I//

j,~ J
(x,y) ~ ' tl S

- j' f
I't
A.) I
I
I
t
/

j,

B . ) ~ the osculating

Fig. 5. Illustration of the compatibility relationship for scenel consistency. Two scenels axe
shown on the fibre at image location (x S,ys), and are evaluated against the scenel (i) at (x, y).
The surface represented in scenelx.u is modeled by the osculating paxaboloid, and extended to
(x ~, y~). It is now clear that one scenel (j~) at (z', y~) is consistent, because its surface patch
lles on this paraboloid and light source and albedo agree. The other scenel (j) is inconsistent,
because its surface does not match the extended paraboloid. Such osculating paraboloids are
used to simulate the parallel transport of scenel=,,~, onto scenel=,u.

Consider the unique paxaboloid Si(u,v) such that the neighbourhood of Si(0,0) is
described by the surface parameters of scenel i. The compatibility of a scenel j with i (r#)
is then defined in terms of the relationship between scenel j and the point Si(u, v) E Si
which is "closest to" scenel d. This operation is equivalent to the minimization of the
distance measure

. u (c~ ~
(=j - ~TCu, v)) ~ + (y~ - y~ ( , ~))~ + \ ~TM )

+ ( c~ (N"))'/'ark
"Kt (u' / + (~'('~ - ,,))'~'ao
~,'.(u, /
over (u, v). Here, Z~TJV,ATK, Z~c axe the distance between neighbouring scenels for the
given attribute and (x*, yT, N*, K*, c7) are the surface descriptors for the point Si(u, ~).
147

So by physical consideration, the compatibility between scenel i and and scenel j can
be expressed as

rq = 5p~,,p~r .SL~,L . G ( r z*)2 + (yj _ y , ) 2 ) . _ G ~ . y N ( c o s - I ( N j _ N [ ) )

9 (cos-1(I~ - K~))" G~o (:~ - :~)


where Gc,(x) is the Gaussian, and G~(z) is its second derivative.
Notice that these values depend only on the relationship between seenel i and scenel
j, which are fixed and constant throughout the computation. Therefore, these compati-
bilities can be calculated once and then stored in either a lookup table or as the weights
in some sort of network.

5 A Case Study of Discontinuities

We have described a framework for computing sections of the scenel fibre bundle that
are constrained to have a constant virtual light source, constant albedo, and smooth
surface geometry. Any discontinuity in these attributes will demarcate a section boundary.
Discontinuities can arise in many ways: the albedo can change suddenly along a smooth
surface; a shadow can be cast across a smooth surface; the surface normal can change
abruptly along a contour. In all three of these cases there will be a discontinuity in the
image intensity. In this Sect., we address the question of how seenel discontinuities are
manifested as discontinuities in the image. We restrict our attention to two types of image
discontinuities : intensity discontinuities (edges) and shading flow discontinuities.

5.1 I l l u m i n a t i o n D i s c o n t i n u i t i e s ( S h a d o w s )
Shadows are produced by variations in the virtual point source along a single surface
patch (recall Sect.3.1 for the definition of AL). These variations are the direct result of
a discontinuity in some Vi(z, y). We say that there is a shadow boundary along this
discontinuity. The image intensity cannot satisfy our model (1) in the neighbourhood of
a shadow boundary since the image intensity on each side of the shadow boundary is
due to a different virtual point source. We now briefly examine the two types of shadow
boundary.
In general, an attached shadow boundary lies between two nearby pixels (z0, Y0) and
( z l , y l ) when for some point source, L(j), it is the case that N(x0,Y0) 9 LU) > 0 and
N ( z l , y l ) . L ( j ) < 0.
The image intensity is continuous across the attached shadow boundary even though
Vj is discontinuous since N 9 L ( j ) -- 0 at the boundary. However, the shading flow is
typically not continuous at the attached shadow boundary. The shading flow due to the
source L(j) will be parallel to the shadow boundary on the side where Vj = 1, but it will
be zero on the side where Vj -- 0. The shading flow due to the rest of the light sources
will be smooth across the boundary but this shading flow will not typically be parallel
to the boundary. Hence the sum of the two shading flows will typically be discontinuous
at the boundary.
A cast shadow boundary is produced between two nearby points (zo, Yo) and (xl, Yl)
when for some point source, L(j), it is the case that both N(z0, Y0) 9 LU) > 0 and
N ( Z l , y l ) . L(j) 0 , and either Vj(z0,Y0) - 0 or V j ( z l , y l ) -- 0 (but not both).
Examining (2), we note that for cast shadows, the discontinuity in Vj results in a
discontinuity in image intensity, since the N 9 L(j) > 0. Furthermore, there is typically a
148

Fig. 6. An illustration of shadow boundaries and how they interact with flow fields. In (a)
shadow is cast across the hood of a car. In (b) we show 9 subimage of the cast shadow on the
left fender, and in (c) we show the shading flow field (represented as a direction field with no
arrowheads) and the intensity edges (represented as short arrows; the dark side of the edge is
to the left of the arrow). Observe how the shading field remains continuous (in fact, virtually
constant) across the intensity edge. This hold because the surface is cylindrical, and the shading
flow field is parallel to the axis of the cylinder.
149

discontinuity in the shading flow since the virtual light source defined on the side of the
boundary where Vj = 0 will usually produce a different shading flow across the boundary
than is produced by L(j) on the side of the boundary where Vj = 1.
In the special case of parabolic (e.g. cylindrical) surfaces, the shading flow remains
continuous across both cast and attached shadow boundaries because the flow is parallel
to the axis of the cylinder. Note however that the attached shadow is necessarily parallel
to the the shading flow field. This case is illustrated in Fig. 6.
To summarize, the image intensity in the neighbourhood of a pixel (z0, Y0) can be
modelled using a single virtual point source as long as there are neither attached nor cast
shadow boundaries in that neighbourhood. Attached shadow boundaries produce con-
tinuous image intensities, but discontinuities in the shading flow. Cast shadows produce
both intensity discontinuities and shading flow discontinuities.

5.2 G e o m e t r i c D i s c o n t i n u i t i e s
There are two different ways that the geometry of the scene can produce discontinuities
in image. There can be a discontinuity in N along a continuous surface, or there can
be a discontinuity in the surface itself when one surface occludes another. In the latter
case, even if there is no discontinuity in the virtual light source direction there will still
typically be a discontinuity in N which will usually result in both discontinuities in the
image intensity and in the shading flow.

5.3 M a t e r i a l D i s c o n t i n u i t i e s

If there is a discontinuity in the albedo along a smooth surface, then there will be a
discontinuity in luminance across this material boundary. However, the shading flow will
not vary across the boundary in the sense that the magnitude of the luminance gradient
will change but the direction will not.

5.4 S u m m a r y o f D i s c o n t i n u i t i e s
In summary, shading flow discontinuities which are not accompanied by intensity discon-
tinuities usually indicate attached shadows on a smooth surface. Intensity discontinuities
which are not accompanied by shading flow discontinuities usually indicate material
changes on a smooth surface. The presence of both types of image discontinuities indi-
cates that either there is a cast shadow on a smooth surface, or that there is a geometric
discontinuity.

6 Conclusions

We have proposed a new solution to the shape from shading problem based on notions
from modern differential geometry. It differs from the classical approach in that light
source and surface material consistency are solved for concurrently with shape proper-
ties, rather than independently. This has important implications for understanding light
source and surface interactions, e.g., shadows, both cast and attached, and an example
illustrating a cast shadow is included.
The approach is based on the notion of scenel, or unit scene element. This is defined
to abstract the local photometry of a scene configuration, in which a single (virtual) light
source illuminates a patch of surface. Since the image irradiance equation typically admits
150

many solutions, each patch of the image gives rise to a collection of scenels. These are
organized into a fibre space at that point, and the collection of scenel fibres is called the
scenel bundle. Algebraic and topological properties of the scenel bundle will be developed
in a subsequent paper.
The solution of the shape from shading problem thus reduces to finding sections
through the scenel bundle, and these sections are defined by material, light source, and
surface shape consistency relationships. The framework thus provides a unification of
these different aspects of photometry, and should be sufficently powerful to indicate the
limitations of unification as well.

References

1. Ratliff, F.: Mach Bands: Quantitative Studies on Neural Networks in the Retina,
Holden-Day, San Francisco (1965)
2. Horn, B.K.P.: "Obtaining Shape from Shading Information," P.H. Winston. Ed. in The
Psychology of C o m p u t e r Vision, McGraw-Hill, New York (1975)
3. Ikeuchi, K. and Horn, B.K.P.: "Numerical Shape from Shading and Occluding Boundaries,"
Artificial Intelligence, 17 (1981) 141-184
4. Horn, B.K.P. and Brooks, M.J.: "The Variational Approach to Shape from Shading," Corn-
put. Vision, Graph. Image Process., 33 (1986) 174-208
5. Nayar, S.K., Ikeuchi, K. and Kanade, T.: ~Shape From Interreflections," International
Journal of Computer Vision, 6 (1991) 173-195
6. Pentland, A.: "Linear Shape From Shading," International Journal of Computer Vision, 4
(1990) 153-162
7. Husemoller, D.: Fibre Bundles, Springer, New York (1966)
8. Sander, P., and Zucker, S.W.: "Inferring Surface Trace and Differential Structure from 3-D
Images," IEEE Trans. Pattern Analysis and Machine Intelligence, 9 (1990) 833-854
9. Spivak, M.: A Comprehensive I n t r o d u c t i o n to Differential Geometry, Publish or
Perish, Berkeley (1979)
10. Pentland, A.: "Finding the Illuminant Direction," J. Opt. Soc. Amer, 72 (1982) 448-455
11. Koenderink, J.J.: "What Does the Occluding Contour Tell us about Solid Shape?," Percep-
tion, 13 (1984) 321-330
12. Koenderink, J.J. and van Doorn, A.J.: "The Shape of Smooth Objects and the Way Contours
End," Perception, 11 (1982) 129-137
13. Biederman, I.: "Recognition-by-Components: A Theory of Human Image Understanding,"
Psychological Review, 94 (1987) 115-147
14. Koenderink, J.3. and van Doorn, A.J.: "Photometric Invariants Related to Solid Shape,"
Optica Acta, 27 (1980) 981-996
15. Iverson, L. and Zucker, S.W.: "Logical/Linear Operators for Measuring Orientation and
Curvature," TR-CIM-90-06, McGill University, Montreal, Canada (1990)
16. Zucker,S.W., Dobbins, A., and Iverson, L.: "Two Stages of Curve Detection Suggest Two
Styles of Visual Computation," Neural Computation, 1 (1989) 68-81
17. Hummel, A.R. and Zucker, S.W.: "On the Foundations of Relaxation Labeling Processes,"
IEEE Trans. Pattern Anal. Machine ]nteli., 5 (1983) 267-287
18. Oliensis, J.: "Shape from Shading as a Partially Well-Constrained Problem," Cornp. Vis.
Graph. Ira. Prac., 54 (1991) 163-183
19. Hopfield, J.J.: "Neurons with Graded Response Have Collective Computational Properties
like Those of Two-State Neurons," Proc. Natl. Acad. Sci. USA, 81 (1984) 3088-3092
20. Miller, D.A. and Zucker, S.W.: ~Eflicient Simplex-llke Methods for Equilibria of Nonsym-
metric Analog Networks," TR-CIM-91-3 McGill University, Montreal, Canada (1991);
Neural Computation, (in press)
21. Kovnderink, J.J.: Solid Shape, MIT Press, Cambridge, Mass. (1990) p. 320
T e x t u r e : P l u s ~a c h a n g e , ...*

Margaret M. Fleck
Department of Computer Science, University of Iowa, Iowa City, IA 52242, USA

A b s t r a c t . This paper presents an edge finder for textured images. Using


rough constraints on the size of image regions, it estimates the local amount
of variation in image values. These estimates are constructed so that they
do not rise at boundaries. This enables subsequent smoothing and edge de-
tection to find coarse-scale boundaries to the full available resolution, while
ignoring changes within uniformly textured regions. This method extends
easily to vector valued images, e.g. 3-color images or texture features. Signif-
icant groups of outlier values are also identified, enabling the edge finder to
detect cracks separating regions as well as certain changes in texture phase.

1 Introduction
The input to an edge finding algorithm consists of a 2D array of values for one or more
properties, e.g. raw intensities, color, texture features (e.g. striping orientation), or stereo
disparities. Its goal is to model these property values as a set of underlying property
values, plus a pattern of fast variation in these values (e.g. camera noise, fine texture)
(Fig. 1). The underlying property values are reconstructed as varying "smoothly," i.e.
obeying bounds on their higher derivatives, except at a sparse set of locations. These
locations are the boundaries in the input.

Fig. 1. A sequence of input property values (left) is modelled as a sequence of underlying values
plus a pattern of fine variation (right).

Currently-available edge finders work robustly only when the fast variation has a
known distribution that is constant across the image. This assumption is roughly correct
for "blocks world" type images, in which all fast variation is due to camera noise, but it
* The research described in this paper was done at the Department of Engineering Science,
Oxford University. The author was supported by a Junior Research Fellowship funded by
British Petroleum.
152

fails when there is non-trivial surface texture within each region, because the amount of
fast variation depends on the particular surface being viewed. The amount of variation in
texture feature values (i.e. the amount of mismatch between the texture and the feature
model) also varies from surface to surface, as does the amount of error in stereo disparity
estimates.
There have been many previous attempts to extend edge finders to these more general
conditions, but none can produce output of the quality needed by later processing on the
wide range of inputs encountered in typical vision applications. Some make implausible
assumptions about their inputs: [2] assumes that each image contains only two textures,
[8] and [9] require that the number of textures in the image be known and small, [13]
and [15] provide their algorithms with training samples for all images present. Others
produce poor quality or blurred boundaries [2, 20, 31, 33] or seem difficult to extend to
2D [16, 17].
This paper presents a new algorithm which estimates the scale of variation, i.e. the
amplitude of the fine variation, within image regions. It depends on two key ideas:

1. Minimize the scale estimate over all neighborhoods of the target location, to prevent
corruption of scale estimates near boundaries, and
2. Use a robust estimate for each neighborhood, to prevent scale estimates from being
corrupted by outliers or boundary blur.

Given reliable scale estimates, underlying values can be reconstructed using a standard
iterative edge-preserving smoother. Boundaries are trivial to extract from its output. The
method extends easily to multi-dimensional inputs, such as color images (Fig. 2) and sets
of texture features (Fig. 3), producing good-quality preliminary output. 2 The iterative
smoother is also used to detect outliers, values which differ from those in all nearby
regions. Previous texture boundary finders have looked only for differences in average
value between adjacent regions. The phase change and contrast reversal images in Fig. 4
have traditionally proved difficult to segment [2, 20, 19] because the two regions have the
same average values for most proposed texture features. However, as Fig. 4 illustrates,
these boundaries show up clearly as lines of outliers.

2 E s t i m a t i n g the scale of variation

The basic ideas behind the scale estimator are best presented in 1D. Consider estimating
the scale of variation for the slice shown in Fig. 1. Let Nw (x) be the neighborhood of
width pixels centered about the location x. The most obvious estimate for the scale
at x is the standard deviation of the (2w + 1) values in Nw(x). The spatial scale of the
edge finder output is then determined by the choice of w: output boundaries can be no
closer than about 2w + 1. Unfortunately, if x lies near a boundary, Nw (x) will contain
values from both sides of the boundary, so the standard deviation computed for Nw(x)
will be far higher than the true scale of the fine variation. This will cause later processing
(iterative smoothing and boundary detection) to conclude that there is no significant
boundary near x.
Therefore, the scale estimate at x should be computed from some neighborhood Nw (y)
containing x that does not cross a boundary. Such a neighborhood must exist because, by
definition, output boundaries are not spaced closer than about 2w+1. Neighborhoods that
do not cross boundaries generate much lower scale estimates than neighborhoods which
2 See appendix for details of texture features.
153

Fig. 2. A 300 by 300 color image and boundaries extracted from it (w = 8). Log intensity is at
the top left, red vs. green at the top right, and blue vs. yellow at the bottom left.

cross boundaries. Therefore, we can obtain a scale estimate from a neighborhood entirely
within one region by taking the m i n i m u m scale estimate from all neighborhoods Nw (y)
which contain ;v (where w is held fixed and y is varied). 3 Several authors [16, 17, 31]
use this minimization idea, but embedded in complex statistical tests. Choosing the
average value from the neighborhood with m i n i m u m scale [13, 24, 30] is not equivalent:
the m i n i m u m scale is well-defined but the neighborhood with m i n i m u m scale is not.
Even the best neighborhood containing x may, however, be corrupted: it m a y overlap
the blur region of the boundary or it m a y contain extreme outlier values (e.g. spots,
highlights, stereo mismatches, see Fig. 1). Since these outliers can significantly inflate
the scale estimates, the standard deviation should be replaced by a method from robust
statistics [11, 10, 12, 26] which can ignore small numbers of outliers. Simple robust filters
(e.g. the median) have been used extensively in computer vision and more sophisticated
methods have recently been introduced [14, 25, 27]. Because I expect only a small number
of outliers per neighborhood, the new scale estimator uses a simple a - t r i m m e d standard
deviation: remove the 3 lowest and 3 highest values and then compute the standard
deviation. The combination of this estimator with choosing the m i n i m u m estimate over
all neighborhoods seems to work well and is, I believe, entirely new.

3 This also biases the estimates downwards: calculating the amount of bias is a topic of on-going
research.
154

.....L, .-L..,-L,I /
J J .

/" "" "'"-':'I


I
I '

Fig. 3. Boundaries from texture features: a natural textured image (256 by 256, w = 8), a pair
of textures from Brodatz's volume [3] normalized to the' same mean log intensity (200 by 200,
w = 12), and a synthetic test image containing sine waves and step edges (250 by 150, w = 8).

[ ]I
Fig. 4. The thin bar generates outliers in intensities. The change in phase and the contrast
reversal generate outliers in various texture features. White and black occupy equal percentages
of the contrast-reversal image. The images are 200 by 100 and were analyzed with W -- 8.

There are robust estimators which can tolerate neighborhoods containing up to 50%
outlier values [10, 11]. However, despite some recent suggestions [22, 29], it is not pos-
sible to eliminate the m i n i m i z a t i o n step by using such an estimator. The neighborhood
centered about a location very close to a b o u n d a r y typically has more t h a n 50% "bad"
values: values from the wrong region, values from the blur area, and r a n d o m wild outliers.
This effect becomes worse in 2D: the neighborhood of a point inside a sharp corner can
contain over 75% "bad" values. Furthermore, real patterns of variation have bimodal or
even binary distributions (e.g. a sine wave of period 4 can digitize as binary). Robust es-
timators tolerating high percentages of outliers are all based on medians, which perform
very poorly on such distributions [1, 32].
155

3 Extending the scale estimator to 2D

I am currently exploring three possible ways of extending this scale estimator to 2D. In
2D, it is not practical to enumerate all neighborhoods containing the target location z, so
the estimator must consider only a selection. Which neighborhoods are considered deter-
mines which region shapes the edge detector can represent accurately. At each location
x, the current implementation computes 1D estimates along lines passing through x in 8
directions. The estimate at x is then the median of these 8 estimates. Although its results
(see Fig 2-4) are promising, it cannot match human ability to segment narrow regions
containing coarse-ish texture. This suggests it is not making full use of the information
contained in the locations near x. Furthermore, it rounds corners sharper than 90 degrees
and makes some mistakes inside other corners.
Another option would be to compute scale estimates for a large range of neighborhood
Shapes, e.g. the pie-wedge neighborhoods proposed in [17]. Such an algorithm would be
reliable but very slow, unless tricks can be found to speed up computation. Finally, one
might compute scale only for a small number of neighborhoods, e.g. the round neigh-
borhood centered about each location x, and then propagate good scale estimates to
nearby locations in the spirit of [18]. The difficulty here is to avoid growing very jagged
neighborhoods and, thus, hypothesizing jagged region boundaries.

4 The edge detection algorithm

Boundaries and outliers are detected using a modification of iterative edge-preserving


smoothing [7, 21, 28]. Edge-preserving smoothing differs from standard Gaussian smooth-
ing in that it is gradually inhibited as nearby values become sufficiently different from
one another. The current implementation prohibits interactions entirely, i.e. becomes
committed to placing a boundary, between two adjacent values if they differ by more
than 6S, where S is the scale estimate from Sect. 2 - 3 . I f the distributions of values were
Gaussian and S the standard deviation, 6S would produce an expected rate of about
five false positives per 512 by 512 image. This threshold may need to be adjusted, be-
cause the actual scale estimates are biased downwards and the shape of actual empirical
distributions has not yet been measured.
Specifically, to start each iteration, the algorithm first estimates the scale of variation
as in Sect. 2-3. A minimum scale (currently 1 intensity unit) is imposed to suppress very
low amplitude boundaries and effects of quantization. The value at each location is then
replaced by a weighted average of values at locations in a 4-3 by 4-3 cell neighborhood.4
Suppose that the value and scale at the center location are V and S. The weight for a
value vi is then

5 if I v i - V I <_ 3S;
wi= 10,(1- I ~ '6-SV l ~ J i f 3 S <- -[ v i - V ] < 6 S- - ;
0, otherwise.
This is a one-step W-estimate, a convenient way of approximating an M-estimate.(The
multi-step versions are asymptotically equivalent.) [10]. 5 A wide variety of weighting
4 This is the smallest smoothing neighborhood that still allows smoothing to "jump over" a
thin (one cell wide) streak of outliers.
5 Note that in this method of repeatedly applying the estimator and smoothing, the scale
estimates converge to zero, because information is diffused across the image. A traditional
multi-step estimator [10, 11, 14] is very different: scale estimates converge to a non-zero value.
156

functions are possible (e.g. see [10, 11]). This one has a shape similar to the better
behaved ones (e.g. a smooth cutoff) but is easy to compute.
In order to eventually identify outliers, the smoother also computes a second field
which measures how much each value resembles the neighboring ones. Initially, these
strengths are set to a constant value (currently 200). In each i t e r a t i o n , the strength at
location x is replaced by a weighted sum of the strengths in a =1:3 by -t-3 of x:

~-~ 8 i w i
49
where st is the input strength at location i in the neighborhood and the wi are the same
weights used in smoothing.
In the current implementation, smoothing is repeated 3 times. 6 Boundaries 7 are then
detected by locating sharp changes and outlier values. First, scale is re-estimated at all
locations. Two values are then considered significantly different if they differ by more
t h a n 6 times the smaller of their associated scales. If the values at two adjacent cells are
significantly different, a b o u n d a r y is marked between the cells. If there is a significant
difference between two opposite neighbors of some cell A, b u t not between A and either
one of them, the whole cell A is marked as p a r t of the boundaries. A n y cell with strength
less t h a n 20 is m a r k e d as an outlier. This outlier m a p is then pruned to remove all outliers
t h a t are not either (a) in the middle of a sharp change or (b) in a b a n d of outliers at
least 2 cells wide. Any cell still marked as an outlier after pruning is m a r k e d as part of
the boundaries.

5 Extension to vector-valued images

T h e new edge finder extends easily to vector-valued images, e.g. color images or sets of
t e x t u r e features. I assume t h a t the p a t t e r n of variation in the vectors can be accurately
represented by the scale of variation in each individual dimension. This assumption seems
plausible for most computer vision applications and removing it seems to be very difficult
(cf. [26]). T h e current i m p l e m e n t a t i o n uses an L ~176 metric ( m a x i m u m distance in any
component), because it simplifies coding.
Specifically, in the vector algorithm, scale is e s t i m a t e d separately for each feature, s In
each smoothing iteration, the weighted average is c o m p u t e d s e p a r a t e l y for each feature,
b u t using a set of weights c o m m o n to all features. Specifically, the weights are first
c o m p u t e d for each feature m a p individually and the m i n i m u m of these values is used
as the c o m m o n weight. The c o m m o n weights are also used to c o m p u t e a single strength
m a p , so only one c o m m o n set ofoutliers is detected. Sharp changes, however, are detected
in each m a p individually and AND-ed into the c o m m o n b o u n d a r y m a p .
It is essential to use a c o m m o n set of weights for outlier detection if boundaries in
different features m a y not be exactly aligned. Suppose t h a t a change from A to A ~ in
one m a p occurs, rapidly followed by a change from B to B ' in another. Cells in the two
regions have value A B or A ' B ~, but cells in a tiny strip between the regions have value
s This seems empirically to be sufficient, but more detailed theoretical and practical study of
convergence is needed.
See the theoretical model in [4, 5, 6]. The boundaries are a closed set of vertices, edges between
cells, and entire cells. An arbitrary set of boundary markings can be made closed by ensuring
that all edges of a boundary cell are in the boundaries and all vertices of a boundary edge are
in the boundaries.
s Note that the minimum scale for different features can be different.
157

A~B. These cells will not appear to be outliers in either map individually, but they stand
out clearly when the maps are considered jointly.
Interesting issues arise when one feature is available at higher resolution than another.
For example, people see intensities at much higher resolution than hue or saturation of
color. As it stands, the algorithm may not localize a boundary to the full resolution
available from the highest resolution feature, but may report that other locations near
the boundary also have outlier values (i.e. due to the blurring in the low-resolution
features). In a sense, this is correct, because values at these locations are genuinely
inaccurate and should not be averaged into estimates of region properties. However, it
might be useful to add further algorithms that would refine boundary locations using
the more reliable feature values, making appropriate corrections to the corrupted values
from other features as cells are removed from the boundaries.

6 Conclusions

This paper has presented a new method for estimating the scale of variation in values
within image regions. These estimates were used to extract boundaries at high resolution
from both color images and textured images. Compared to previous edge finders for tex-
tured data, these results are very promising. In particular, because it can detect outliers
as well as step changes in values, the new algorithm can segment a new class of examples
(as in Fig 4). This poses an interesting problem for studies of human preattentive texture
discrimination: if the human segmentation algorithm also detects outliers then being able
to preattentively segment a pair of textures does not automatically imply that some fea-
ture assigns different values to them. This implies that traditional texture discrimination
experiments may require additional controls, but it also opens up possibilities for new
types of experiments that would examine which sorts of texture mis-matches do, and do
not, generate visible outliers.
My ultimate goal is to bring the new edge finder's performance up to the standards
of conventional edge finders. By this yardstick its performance is far from perfect and
barely approaching the point where it would be suitable for later applications, such as
shape analysis and object recognition. It is quite slow. Many of the algorithm details,
particularly parameter settings and the 2D extension of the scale estimator, need further
tuning. Many theoretical issues (e.g. bias in the scale estimator, convergence) still need
to be examined. I believe that there is much scope for further work in this area.

Acknowledgements

Mike Brady and Max Mintz supplied useful comments and/or pointers.

Appendix: Details of T e x t u r e F e a t u r e s

The features used for the texture examples are a new set currently under development.
The new features were chosen because they have small support and reasonable noise
resistence, and they return a constant output field on their ideal input patterns. The
closely related features proposed in [2, 20, 23] have either much larger support or large
fluctuations in value even on ideal input patterns. Comparative testing of texture features
is well beyond the scope of this paper and I make no claims that these features are the
best available.
158

This m e t h o d models the texture as a sine wave and the five features measure its mean
intensity, orientation, frequency, and amplitude. The first feature is log intensity L:

L = 1791og10(I + 10) - 179


where I is the original input intensity. T h e log intensities are s m o o t h e d with the edge
finder's iterative smoother, blurred with a a = 0.5 Gaussian, and then subtracted from
the original log intensities to yield a difference image D containing texture b u t no inten-
sity boundaries. The image D is then smoothed with a Gaussian of s t a n d a r d deviation
1 cell and the first four finite differences D 1, D 2, D 3, and D 4 are taken in each of four
directions.
In each direction 0, compute:

Eo = x / D 2 D 2 + D1D 3, Fo = ~ / D 3 D 3 + D~D 4

W h e n D contains a perfect sine wave with a m p l i t u d e A and frequency w, E = Aw 2 and


F = Aw 3. The four texture features are then:

E = Eo + E45 + E9o + E135, F : F0 + F45 + F90 + F135

X=Eo-E90, Y=E4~-E135

References

1. Jaakko Astola, Pekl~ Heinonen, and Yrj6 Neuvo (1987) "On Root Structures of Median and
Median-Type Filters," IEEE Trans. Acoust., Speech, Signal Proc. ASSP-35/8, pp. 1199-
1201.
2. Alan C. Bovik, Marianna Clark, and Wilson S. Geisler (1990) "Multichannel Texture Anal-
ysis using Localized Spatial Filtering," IEEE Trans. Patt. Analg. and Mach. Intell.12/1,
pp. 55-73.
3. P. Brodatz (1966) Textures, Dover.
4. Margaret M. Fleck (1988) "Boundaries and Topological Algorithms," Ph.D. thesis, MIT,
Dept. of Elec. Eng. and Comp. Sci., available as MIT Artif. Intell. Lab TR-1065.
5. Margaret M. Fleck (1991) "A Topological Stereo Matcher," Inter. Jour. Comp. Vis. 6/3,
pp. 197-226.
6. Margaret M. Fleck (1990) "Some Defects in Finite-Difference Edge Finders," OUEL Report
No. 1826/90, Oxford Univ., Dept. of Eng. Science, to appear in IEEE Trans. Patt. Analy.
Mach. InteU.
7. Davi Geiger and Federico Girosi (1991) "Parallel and Deterministic Algorithms from
MRF's: Surface Reconstruction," IEEE Trans. Patt. Analy. and Mach. Intell. 13/5, pp.
401-412.
8. S. Geman and D. Geman (1984) "Stochastic Relaxation, Gibbs distributions, and the
Bayesian Restoration of Images," IEEE Trans. Patt. Analy. and Mach. Intell. 6/6, pp.
721-741.
9. Donald Geman, Stuart Geman, Christine Graffigne, and Ping Dong (1990) "Boundary
Detection by Constrained Optimization," IEEE Trans. Part. Analy. and Mach. Intell. 12/7,
pp. 609-628.
10. Frank R. Hampel, Elvezio M. Ronchetti, Peter J. Rousseeuw, and Werner A. Stahel (1986)
Robust Statistics: The Approach Based on Influence Functions, John Wiley, New York.
11. David C. Hoaglin, Frederick Mosteller, and John W. Tukey, eds.(1983) Understanding Ro-
bust and Exploratory Data Analysis, John Wiley, New York.
12. Peter J. Huber (1981) Robust Statistics, Wiley, NY.
159

13. John Y. Hsiao and Alexander A. Sawchuk (1989) "Supervised Textured Image Segmentation
using Feature Smoothing and Probabilistic Relaxation Techniques," IEEE Trans. Putt.
Analy. and Much. Intell.11/12, pp. 1279-1292.
14. Rangasami L. Kashyap and Kie-Bum Eom (1988) "Robust Image Modelling Techniques
with an Image Restoration Application," IEEE Trans. Acoust., Speech, Signal Proc. ASSP-
36/8, pp. 1313-1325.
15. Kenneth I. Laws. (1979) "Texture Energy Measures," Proc. DARPA Ira. Underst. Work.
1979, pp. 47-51.
16. Yvan Leclerc and Steven W. Zucker (1987) "The Local Structure of Image Discontinuities
in One Dimension," IEEE Trans. Part. Analy. and Much. Intell. 9/3, pp. 341-355.
17. Yvan Leelerc (1985) "Capturing the Local Structure of Image Discontinuities in Two Di-
mensions," Proc. IEEE Conf. on Comp. Vis. and Patt. Recogn. 1985, pp. 34-38.
18. Ale~ Leonardis, Alok Gupta, and Ruzena Bajcsy (1990) "Segmentation as the Search of the
Best Description of an Image in Terms of Primitives," Proc. Inter. Conf. on Comp. Vis.
1990, pp. 121-125.
19. Joseph T. Maleson, Christopher M. Brown, and Jerome A. Feldman, "Understanding Nat-
ural Texture," Proc. DARPA Ira. Underst. Work. 1977, Palo Alto, CA, 19-27.
20. Jitendra Malik and Pietro Perona (1990) "Preattentive Texture Discrimination with Early
Vision Mechanisms," Journ. Opt. Soc. Amer. A 7/5, pp. 923-932.
21. Jitendra Malik and Pietro Perona (1990) "Scale-Space and Edge Detection using
Anisotropie Diffusion," IEEE Trans. Patt. Analy. and Much. Intell. 12/7, pp. 629-639.
22. Peter Meer, Doron Mintz, Dong Yoon Kim, and Azriel Rosenfeld (1991) "Robust Regression
Methods for Computer Vision: A Review," Inter. Jour. Comp. Vis. 6/1, pp. 59-70.
23. M. Coneetta Morrone and Robyn Owens (1987) "Feature Detection from Local Energy,"
Pattern Recognition Letters 6/5, pp. 303-313.
24. Makota Nagao and Takashi Matsuyama, "Edge Preserving Smoothing," Comp. Graph. and
Ira. Proc. 9/4 (1979) 394-407.
25. Carlos A. Pomalaza-Raez and Clare D. McGillem (1984) "An Adaptive, Nonlinear Edge-
Preserving Filter," IEEE Trans. Acoust., Speech, Signal Proc. ASSP-32/3, pp. 571-576.
26. Peter J. Rousseeuw and Annick M. Leroy (1987) Robust Regression and Outlier Detection,
John Wiley, New York.
27. Schunck, Brian G. (1989) "Image Flow Segmentation and Estimation by Constraint Line
Clustering," IEEE Trans. Patt. Analy. and Much. Intell. 11/10, pp. 1010-1027.
28. Philippe Saint-Mare, Jer-Sen Chen, and G'erard Medioni (1991) "Adaptive Smoothing:
A General Tool for Early Vision," IEEE Trans. Patt. Analy. and Much. Intell. 13/6, pp.
514-529.
29. Sarvajit S. Sinha and Biran G. Schunck (1992) "A Two-Stage Algorithm for Discontinuity-
Preserving Surface Reconstruction," IEEE Trans. Patt. Analy. and Much. Intell. 14/1, pp.
36-55.
30. Fumiaki Tomita and Saburo Tsuji (1977) "Extraction of Multiple Regions by Smoothing
in Selected Neighborhoods," IEEE Trans. Syst., Man, Cybern. 7/2, 107-109.
31. Richard Vistnes (1989) "Texture Models and Image Measures for Texture Discrimination,"
Inter. Jour. Comp. Vis. 3/4, pp. 313-336.
32. S. G. Tyan (1981) "Median Filtering: Deterministic Properties," in T.S. Huang, ed., Two-
Dimensional Digital Sign al Processing 1I: Transforms and Median Filters, Springer-Verlag,
Berlin, pp. 197-217.
33. Harry Voorhees and Tomaso Poggio (1987) "Detecting Textons and Texture Boundaries in
Natural Images," Proe. Inter. Conf. on Comp. Vis. 1987, London, pp. 250-258.

This article was processed using the I.*TEX macro package with ECCV92 style
TEXTURE PARAMETRIZATION METHOD
FOR IMAGE SEGMENTATION

A. Casals, J. Amat and A. Grau


Dept. Enginyeria de Sistemes, Autom~tica i Inform~ttica Industrial
Universitat Pohthcnica de Catalunya
c/Pau Gargallo n -~5, 08028 Barcelona (Spain)

A b s t r a c t . In this p a p e r we define six p a r a m e t e r s a d d r e s s e d to p a r a m e t r i z e


the t e x t u r e characteristics of an image towards its segmentation. W i t h t h e
aim to operate a t high speed, these p a r a m e t e r s have been defined looking for
an acceptable compromise between d i s c r i m i n a t i o n c a p a c i t y a n d e a s y n e s s
to i m p l e m e n t a specific architecture for them.

1 Basic Textures Considered

Texture is a qualitative concept of images and so, its parametrization is a difficult


task. The ways followed to obtain a quantification for image texture have been, on one
hand, based on a structural analysis [Tsu],[Mat], which analyze the regularity and
uniformity of the image intensity, while other methods are based on statistical
analysis of the local properties that occur repeatedly throughout the image [Har], [Bou].
Zucker [Zuc] proposes a model based on these two approaches, while in [Shi] a study is
done about the h u m a n texture visual field.
The work presented on texture parametrization has as a goal obtaining the most
discriminating vector, in view to segmentation from the observable texture of the
different parts or objects that appear on the scene, and looking for a compromise
between the attainable efficiency and the easy of obtaining, by means of a high speed
specific hardware, the quantifiable values that define texture.
In order to attain a certain discrimination capability of the texture functions T,
complementary to other functions such as color, it is necessary to define a given
number of parameters ~i quantified with a sufficient resolution to reliable differentiate
the textures of an image. In a previous work, only a subset of the proposed parameters
has been used to segmentate images in some specific environments [Cas].
These parameters don't seek to be more discriminant than others described by
numerous authors, that in most cases require a higher level of data processing [Man],
[But], and which are used in applications that don't present computing time restrictions
such as terrain classification, satellite imagery, etc. W h a t we pretend is to obtain these
description parameters at high speed for applications that need real time operation.
The vector of characteristics we use is based on the following parameters:
Blurriness. A region with this texture presents soft intensity changes in any direction.
Granularity. This parameter quantifies the presence of a high density of non
concatenated gradients in a given region.
Discontinuity. This parameter measures the n u m b e r of lines discontinuities.
Abruptness. Abruptness measures sudden changes of direction of the lines of a region.
Straightness. This parameter indicates the density of straight lines in a region.
Curviness. This parameter measures the density of non straight lines in a region.

The research described in this paper has been supported partially by SKIDS (ESPRIT - 1560)
161

Fig.1 shows the affinities between different textures clearly defined and the six
texture parameters considered.
With these six parameters which are not mutually excluding, it is defined a texture
vector T[~1.... x6], which quantifies the degree of affinity of every region of an image to
each one of these parameters.
These selected p a r a m e t e r s are not essentially different from some of the classic
ones utilized such as: Coarseness, directionality, regularity ..... On the other hand, the
natural texture patters defined by Brodatz [Bro] are also defined from these parameters,
some of them can be defined in terms of exclusively one parameter, while others need
up to three different ones.
The use of these parameters, as they are defined, don't provide information such as:
regularity/irregularity, or symmetry/asymmetry, which correspond to a more global
scope t h a t surpass the region evaluated and t h a t has to be obtained in a higher
processing level.
Once the six texture parameters have been obtained for every pixel of the image,
tacking into account its 8x8 pixels environment, a new lower resolution image is
generated. This new texture image of 118 resolution with respect to the original one, is
oriented to contribute to image segmentation, mainly complementing the information
provided by the contours and solving the conflicts produced by the contours lack of
continuity.

Fig. i. The six texture parameters defined


162

2 Quantification Algorithms

The defined parameters are quantified both by their module and their argument, that
correspond to the main direction of the pattern considered, except the p a r a m e t e r s
blurriness and granularity that are considered isotropic.
The six parameters ~i are quantified by applying to each 8x8 pixels, organized in 4
subregions of 4x4 pixels size, the following algorithms:
(A - D) + (C - B)
Blurriness: ~B = K B
I+Z~j
where A,B,C and D are the average gray levels corresponding to the four subregions
what constitute a texel, and Pij are the pixels belonging to a contour within this
region.

Granularity: ~c = ~ Z Qij
being Qij the pixels contained in the 8x8 region, which are detected as contours but in
chains of length L~2.

Discontinuity: ~3D = K D ~ Tij


The terms Tij are the end pixels of contours of length L>2in a texel

curliness ~c = Z ( % -hi)
Qij are the pixels concatenated in chains of lenght >3 and Lij are the aligned pixels.

Straightnes~ ~L = KL L
The value of~L is restricted to be 0<~L<8, being L the number of aligned pixels within
a texel. (5<L<8).

Abruptness: ~A = KA a
where a is the angle of the smallest vertex contained in a texel.

3 Hardware Implementation

The m a x i m u m operation speed in a vision system is obtained when each processing


phase is executed at video rate. To attain this goal it has been necessary to design a
specific processor with a pipeline structure. In a first step, operating over 3x3 pixels, the
contours are obtained; In a second step, the contours are thinned using a 4x4 pixels
operator; In the third one, operating in a bigger size area, 8x8 pixels, the texture
function is obtained.
The calculus of each one of the six parameters is done in parallel all along the
sweeping of the image, but it would require to make use of six operators working
simultaneously over the same data, ie.,the contours obtained in the previous phase.
The implementation of the logical function to be carried out in this step requires
some considerations in order to be physically feasible. Since the operator size is 8x8, the
combinatory complexity is 264, that's the reason why it has been decomposed into four
163

parts, performing each one separately a function of complexity 216, which can be
already implemented with a single memory module. Fig. 2.
The decomposition of the texel in four quadrants presents some restrictions since
they can not interchange information, but it allows to perform some global operations
over an 8x8 environment, as it happens in the process of straight lines detection where
both, the straight line slope and its continuity between adjacent subregions can be
analyzed.
The processor operation time will be the time required to obtain the six texture
parameters and that required to process these data. The six parameters are obtained at
video rate, but with a delay of 3+4+8 pixels.The algorithms used to process the texture
parameters depend on the complexity of the scene with a range from 20 to 60 ms.

MORYw~((S xSwindow)/~=~ ~ RAM ~ MEMORY

I =O-Hfll ..... II II l/
~ eUWl PC B U S
II U
Fig. 2. Architecture of the processor designed to measure
in real time the parameters ~i

4 Results

The texture images obtained from images of 256x256 pixels,having each texel a size of
8x8 pixels,result in a much lower resolution,32x32texels. In spite of the low resolution
of the texture function generated, the results obtained are considered satisfactoryfor
image segmentation of scenes in structured environments. The segmentation of the
image is done based on the circumscribed contour of the elements of the scene, and the
texture information is used to differenciatethe regions.Fig.3 shows the results obtained
applying successively the straightness (3b). granularity (3c) and blurriness (3d)
operators. These results can still be improved using color as complementary
information for scene segmentation.

R e f ~

[Bou] Bouman, C., Liu, B.: Multiple Resolution Segmentation of Textured Images.
IEEE Trans P A M I Vol. 13 n92, (1991)
[Bro] Brodatz, P.: Textures. N e w York, Dover (1966)
164

Fig. 3. Visualization over the c i r c u m s c r i b e d contours of the


image shown in a) of the parameters: straightness b),
b l u r r i n e s s c), and g r a n u l a r i t y d)

[Buff du Buf, J.M.H. et al.: Texture Feature Performance for Image Segmentation.
Pattern Recognition, vol.23 No.4 291-309 (1990).
[Cas] Casals, A. and Pages, J.: A Vision System for Agricultural Machines
Guidance. IARPW on Robotics in Agriculture and the food Industry. (1990).
[Har] Haralick, R.M.: Statistical and Structural Approaches to Texture. Proc. Int.
Joint Conference on Pattern Recognition, vol.4 45-69, Kyoto (1978)
[Man] Manjunath, B.S. and Chellappa, R.: Unsupervised Texture Segmentation
Using Markow Randow Field Models. PAMI vol.13 No.5 (1991)
[Mat] Matsuyama et al.: A structural Analyzer for Regularity Arranged Textures.
Computer Graphics and Image Processing 18 259-279 (1982)
[Shi] Shipley, T. and Shore, T.: The Human Texture Visual Field: Fovea-to-
Periphery Pattern Recognition. Pattern Recognition, vol.23 N o . l l (1990).
[Tsu] Tsuji, S. and Tomita F.: A Structural Analyzer for a Class of Textures.
Computer Graphics and Image Processing 2 216-231 (1973)
[Zuc] Zucker, S.W.: Toward a Model of Texture Computer Graphics and Image
Processing 5 190-202 (1976)
Texture Segmentation by Minimizing Vector-Valued
Energy Functionals: The Coupled-Membrane Model

Tai Sing Lee, David Mumford, Alan Yuille


Harvard Robotics Laboratory,
Division of Applied Sciences, Harvard University, Cambridge, MA 02138.

A b s t r a c t . This paper presents a computational model that segments


images based on the textural properties of object surfaces. The proposed
Coupled-Membrane model applies the weak membrane approach to an im-
age W I ( a , 0, x, y), derived from the power responses of a family of self-
similar quadrature Gabor wavelets. While segmentation breaks are allowed
in x and y only, coupling is introduced to in all 4 dimensions. The result-
ing spatial and spectral diffusion prevents minor variations in local tex-
tures from producing segmentation boundaries. Experiments showed that
the model is adequate in segmenting a class of synthetic and natural texture
images.

1 Introduction

This paper presents a computational model that segments images based on the textural
properties of object surfaces. The proposed model distinguishes itself from the previous
models in texture segmentation [Turner 1986, Voorhees and Poggio 1988, Malik and
Perona 1989, Fogel and Sagi 1989, Bovik, Clark and Geisler 1990, Reed and Wechsler
1990, Geman et al 1990] in the following way.
Previous models have started with the extraction from the image I(x, y) of some set of
texture features which can be viewed as forming auxiliary texture images I~ (x, y). Then
applying either region growing, boundary detection, or (in the single paper [Geman et al
1990]) a membrane-like method combining these two, a segmentation is derived. In our
model, the texture features are the power responses of quadrature Gabor filters. These
filters form a continuous family depending on two variables ~, 6, and can be derived like
wavelets from dilation and rotation of a single filter. Thus we think of the texture features
as combining into a single image "l,VI(tr, O,x, y) depending on ~ continuous variables.
We apply the weak membrane approach to segmenting this signal, in which coupling is
introduced in all 4 dimensions, but breaks are allowed in x and y only. We call this the
Coupled- Membrane model.
Why is this model useful? Previous methods generally deal only with textures that are
statistically stationary (i.e. approximately translationally invariant) and not too granular
(e.g. with widely spaced textons, or large local variations). But natural textures do not
satisfy either: Firstly, they show considerable texture 'gradients', in which the power dis-
tribution of the texture among various channels changes slowly but systematically over
* This research is supported in part by Harvard-MIT Division of Health Sciences & Technology
and Harvard's Division of Applied Sciences' fellowship to T.S. Lee, Army Research Office
grant DAAL03-86-0171 to D. Mumford, and NSF grant IRI-9003306 to A. YuiUe. Interesting
discussion and technical help from John Daugman, Mark Nitzberg, Peter Hallinan, Peter
Belhumeur, Michael Weisman, and Petros Maragos are greatly appreciated.

Lecture Notes in Computer Science, Vol. 588


G. Sandini (Ed.)
Computer Vision - ECCV '92
9 Springer-Verlag Berlin Heidelberg 1992
166

a region, due for instance to the perspective affine distortion imposed on surface features
of solid objects in a 3-dimensional world, and to the deformation caused by the non-
planarity of the objects' body shapes. J. J. Gibson [Gibson 1979] has emphasized how
these texture gradients are ubiquitous clues to the 3D structure of the world. Secondly,
they show random local fluctuations, due to the stochasticity in their generation pro-
ceases, which are often quite large (compare the four subimages in 'Mosaic' below, taken
from [Brodatz 1966]). The inter-membrane couplings unique to our Coupled-Membrane
model allow interaction between neighboring components in the spectral vector and pre-
vent minor local variations from producing segmentation boundaries. At the same time,
they introduce explicitly the appropriate metric between texture channels so that a shift
in the peak of the power spectrum to a nearby frequency or orientation is treated differ-
ently from a shift to a distant frequency or orientation. As we shall see, this allows us to
begin to solve these problems for natural textures.
This paper is organized as follows: first, we will discuss how texture is represented in
o u r model and how texture disparity can be computed from this representation. Then,
we will discuss the Coupled-Membrane model for texture segmentation in its continuous
formulation and discrete approximation. Finally, we will present our experimental results.

2 Gabor-Wavelet Representation of Texture

Texture segmentation requires a description of local texture properties in an image. Pre-


vious methods include texton statistics [Voorhees and Poggio 1988], DOG filters [Malik
and Perona 1989], windowed Fourier transform or Gabor filtering [Turner 1986, Fogel
and Sagi 1989, Bovik, Clark and Geisler 1990, Reed and Wechsler 1990]. While the first
two of these methods emphasize feature detection, the Gabor-Fourier method is based
on power spectrum analysis or autocorrelation.
In our model, the texture features are the power responses of quadrature Gabor filters.
These filters form a continuous family depending on two variables or, 8, and can be derived
like wavelets from dilation and rotation of a single filter. We call this Gabor-Wavelet
Representation. Physiological evidence suggests that the visual cortex is employing a
similar representation for encoding visual information. We impose the constraints derived
from physiological data [Pollen et al. 1989, Dangman 1985] and obtain the following
family of self-similar Gabor filters centered at (x = 0, y = 0) in the spatial domain. (For
details, readers are referred to our technical report [Lee, et al 1991].)

0"2 - 1-i~o2o(4(x cos #-by sin 0)24-(-x sin 04-y cos 0) ~ e i ( a cos O:v4-a sin #y)
x , y ) = 5-6 e 9 (1)

where a is the radial frequency, and 0 is the angular orientation of the filter.
In a manner completely analogous to the generation of wavelet bases from a single
basic wavelet, this whole family of Gabor filters can be generated by rotation and dilation
from the following single Gabor filter (as shown in figure 1):

y) = __1 (2)
50~r -
Self-similar Gabor filters from this family serve both as band-pass filters and multi-
scale matched filters, producing a representational scheme that unifies power-spectrum
analysis and feature detection.
The convolution of this family of filters with the image produces a single image
W I ( a , 0, x, y) which is the normalized power modulus of the filter ensemble as follows,
167

Fig. I . Rotated and Dilated quadrature Gabor filters

W I ( a , 0, x, y) = ~l(O:,0
1 9 I)(~, v)l 2 (3)

where

(ao,o 9 I)(:o, yo) = / : ~ ao,0(: - :o,y- y~ y)~:dy (4)

and

r f
Z = JJ I(c:,0, I)(:, y)N:~0 (5)
Since each Gabor filter has a Gaussian spread in its frequency plane, local power
spectrum of an image can be sampled in a parsimonious and discrete manner. In our
implementation, we use Gabor wavelets with a sampling interval of 1-octave in frequency
and 22.5 ~ in orientation to pave the spatial frequency plane, as shown in figure 2.

Fig. 2. Tiling of Spatial Frequency Plane by Gabor Wavelets


168

3 Texture Disparity and Spectral Proximity


Using this sampling scheme, we construct a spectral signature vector of 24 components
(3 frequencies and 8 orientations) at each point (x, y) in the spatial domain. This vector
is then normalized, i.e. divided by Z, to discount the luminance effect. As a result, each
particular texture corresponds to a unit vector within this unit 24-dimensional ball. we
advocate the use of the L2 norm as the appropriate metric to compute the distance
between two spectral signature vectors for the following reason.
The L2 norm is superior to the Loo norm in computing texture disparity because it
does not discard the proximity information between the spectral signatures of two texture
patterns. When a Lcr norm is computed, components in a spectral vector are treated
as independent and their spectral proximity relationships are ignored. For instance, the
L ~ norm induced by rotating a texture by 30 ~ will be the same as that induced by
a 90 ~ rotational shift. This is true independent of the sampling scheme. Although the
L~ norm behaves the same way as the L ~ norm in a minimal sampling scheme with
orthogonal bases, its value decreases for spectrally proximal textures when the bases
become increasingly nonorthogonal in an oversampling scheme. In this case, the L2 norm
induced by a 30 ~ rotation is smaller than that induced by a 90~ rotation, as illustrated
in figure 3.
Because the parsimonious scheme we used is not straightly a minimal one with or-
thogonal bases, it is benefited from the use the L2 norm. The visual cortex, however,
oversamples the power spectrum by at least two or three times as much in both a and
0 dimensions [Webster & De Valois 1985, Silverman et al 1989, Hubel and Wiesel 1977].
The proximity effect due to the L~ norm would therefore be even more pronounced.
The parsimonious scheme saves computational effort, hut also decreases the proxim-
ity effect. To compensate, we introduce smoothing in the spectral domain by coupling
together the spectrally proximal components in the spectral vector, as will be discussed
in the next section.

a 4-D s p a ~ - q ~

S~ ,w~ma m B

//~sm~m n~r~ I c

i!
mr-mmmmm Co~r~ Lap~.~ I

'""~- - -'-'e'~"~2 i
i jr t t t Tit ~V~f~A
I
~'%'"^'% I 8pi~il V~ ~ B

~ - ~ , tttttttttttttttttttt
, ill, 8p~ vn~7 lB.A
ii~-n~ . l l a ~ l 2
'III, "~'~'"
alJll ~ V , ~ C

Fig. 3. Effect of L2 norm. Fig. 4. Couplings in the 4-D Lattice


169

4 Energy Functional for Texture Segmentation

The Coupled-Membrane model we developed for texture segmentation is a generalization


of the Weak-Membrane Model [Blake and Zisserman 1987, Marroquin 1984, and Mumford
and Shah 1985] or equivalently the Markov Random Field model [Geman and Geman
1984]. While the Weak-Membrane deals with intensity values in a 2-D image plane, our
model deals with spectral responses in a 4-D spatial-spectral domain. The continuous
formulation of the model is defined as follows,
Given the spectral signature image WI(a, 0, x, y), we are to find a piece-wise con-
tinuous functional f(a, 0, z, y) that is its smooth estimation, with its texture noise and
variations removed. Within a texture region, f(a, 6, x, y) is continuous. Discontinuity in
f(a, 0, z, y) is allowed at the boundary in spatial domain between two texture regions.
These objectives are captured by following energy functional that is to be minimized,

E(f, B ) = / / R / fs Hf(a' O, x, y ) - }/VI(a, O, x, y)ll2dlogadOdzdy


, of,2 ( ~ ) 2 +
+fL_s f f t(7,,~)~ +,7,~, + -y)Z]d(l~
()
+a fn ds

where R and S are the finite 2-dimensional spatial and spectral domains respectively;
boundaries B C R is a finite set of piece-wise C 1 contours which meet OR and meet
each other only at their endpoints. The contours of B cut R into a a finite set of disjoint
regions R1, ..., Rm the connected components of R-B. The integration over S is done with
(d log a)d0 = ~ because the power spectrum is represented in log-polar form.
The first term of the energy functional forces the smoothed spectral response f(a, 0, x, y)
to be as close as possible to the measured spectral response YYI(a, 0, z, y). The second
term asks the spectral response to be as smooth as possible in both spatial and spectral
domains. These two potentially antagonistic demands are to arrive at a compromise that
is determined by the A, 7a and 70.
Since f(a, O, z, y) is required to be smooth only within each Ri but not across B, the
third integral term is needed to prevent breaks from appearing everywhere. This term
imposes a penalty a against each break and provides the binding force within a region.

5 Computer Implementation
To solve a functional minimization problem computationally, the energy functional is
discretized as follows,

E(I,B) = E (f(i, j, k, i) - WI(i, j, k, l))':" + 7~ E [f(i' j' k, l) - f(i, j, k, 1 + 1)]"


id,k,l i,j,k,I
+7~ E [f(i,j,k,i) - f(i,j,k + 1,/)] 2
i,j,k,l

+~2 E [f(i,j,k,l) - f(i + 1,j,k,l)]2(1 - v(i + 2,j) )


i,j,k,I
170

1
+)~2 y ~ [f(g,j,k,l) - f(i,j + l,k,l)]2(1-h(i, "3+ -~))
i,j,~,l
1 1
+ y~[v(i + -~,j) + h(i,j + 5) 1
i,j

where i, j, k, 1 are indexes for x,y,0, log a respectively in the 4-dimensional spatial-spectral
sampling lattice, v and h are vertical and horizontal breaks between the lattice points in
the spatial domain.
Figure 4 illustrates the couplings among the nodes in the 4-D sampling lattice: Each
membrane corresponds to WI(k, l) for a frequency l, and an orientation k. Within each
membrane, each node is coupled to the nearest 4 neighboring nodes. At each spatial
location, a membrane is coupled with 4 other membranes which are its nearest spectral
neighbors.
As the segmentation-diffusion process unfolds, spectral response is allowed to diffuse
from one node to its 4 spatial and 4 spectral nearest neighbors. Breaks, however, can only
occur in the spatial domain. When the L2 norm of the evolving membranes exceeds the
texture disparity threshold V~X at a spatial location, a break will occur at that location
to cut across all the membranes.
Given a set of values for parameters A, 7a, 70, and a, an optimal compromise among
the three terms in the energy functional produces a set of segmentation boundaries and
smoothed spectral responses. Because the energy functional has many local minima due
to its nonconvexity, the global optimal compromise can be obtained using special math-
ematical programming methods. This paper presents results obtained using a stochastic
method called Simulated Annealing [Kirkpatrick 1983], and a deterministic method called
Graduated Non-Convexity [Blake and Zeisserman 1987]. We implemented both two meth-
ods on DEC5000 workstation and on a massively parallel computer called MASPAl~.

6 Experimental Results
A class of texture images, 256 x 256 pixels in size, are used to test the model. Percep-
tual boundaries in these images are defined primarily by difference in textures, and not
by luminance contrast. The segmentation-diffusion is performed on a 64 x 64 spatial
sampling grid: We use a simple annealing schedule schedule for Simulated Annealing:
Tn = 0.985Tn-1 at each temperature step, with a starting temperature of 25. It takes
about 24 hours on DEC 5000 or 6 hours on MASPAR to process each image. For GNC,
the error resolution ~ needs to be 2 - 1 2 to ensure the solution is close to the optimal one.
It takes 140 hours on DEC 5000 or 7 hours on MASPAR. Despite the fast annealing
schedule, the Simulated Annealing performs reasonably well. The answer provided by
GNC, however, is closer to the global minimum. These algorithms have also been imple-
mented in 1-D so that their solutions can be compared with the exact optimal solution
yielded by dynamic programming.
Three images are presented here as illustrations: 'Vase' (figure 5), 'Mondrian' (figure
7a), and 'Mosaic ~ (figure 7b). 'Vase' is used to demonstrate the model's tolerance to
texture 'gradient' due to inter-membrane coupling. When this coupling is disabled, the
segmentation is not perceptually valid (figure 5d). The initial response and the final
response of the filters to 'Vase' (figure 6) demonstrate the diffusion effect in both the
spatial and spectral domains.
'Mondrian' and 'Mosaic' both demonstrate that the model's ability in segmenting
synthetic and natural textures while withstanding significant texture noise, and local
171

............................................................................
i....................;.....

Fig. 5. (a) 'Vase' and its segmentations: (b) Simulated Annealing result
with a = 0.02,), -- 6, 7e -- 2,7~ = 4; (c) GNC result with a --0.02,)~ -- 6,70 = 2,7,~ -- 4;
(d) GNC with a -- 1.25, ~ = 6, 70 -- 7~ -- 0 i.e. without inter-membrane coupling.

Fig. 6. (a) Initial filter response map for 'Vase'. (b) Final filter response map at the end of the
segmentation-diffusion process (figure 5c). Each small square is the response map of a particular
filter to the image. The maps are arranged in frequency rows (three frequencies) and orientation
columns (eight orientations).

variation in scale and orientation. The initial and final response maps of 'Mondrian'
(figure 8) underscore the cooperative effect of the diffusion and segmentation processes
in producing Sharp texture boundary from fuzzy input.
The parameter values used for the segmentation-diffusion process are shown in the
figure caption. For the series of images we tested, the values needed to produce a seg-
mentation similar to our perception are fairly close together.

7 Discussion:

The Coupled-Membrane Model with the Gabor-wavelet representation has produced


promising results in the segmentation o f a class of texture images. It combines the several
sequential steps of filtering, smoothing and boundary detection in the previous texture
segmentation models into a coherent and unified framework with a simple and elegant
formalism. The model requires only three parameters (as 7~ and 70 are related) and is
more parsimonious in many aspects than the model Geman et al [1991] proposed. The
issue of spectral proximity, ignored by the previous models, is addressed in our model by
the introduction of spectral smoothing and the use of the L2 norm with oversampling.
The model needs to be further developed to address to a wider class of natural images.
In the form presented in this paper, the model has difficulty at the boundary between
non-texture regions. This problem can be solved by incorporating into the model the
172

Fig. 7. (a) 'Mondrian' and its segmentation. Parameters: a = 0.02, ), = 6, 70 = 2, 7~ = 4. (b)


'Mosaic' and its segmentation. Parameters: tr = 0.02, )~ = 12, 7e = 2, 7~ = 4. Top segmentation:
SA. Bottom segmentation: GNC.

Fig. 8. (a) Initial filter response map for 'Mondrian'. (b) Final filter response map at the end
of the segmentation-diffusion process.

luminance edge information derived from the same Gabor-Wavelet representation, and
by modifying the domain of integration in the energy functional This effort will be
reported in another paper.
A similar approach can be taken to the problem of speech segmentation: speech seg-
mentation is presently done with either Hidden Markov models or time-warping. We
propose that segmentation of time by a Coupled-String model applied to the power spec-
trum of speech, with couplings between adjacent values of time and frequency, provides
a third approach. The Coupled-String model is amenable to dynamic programming and
hence fast, and will be effective for all phonemes without the need to model each phoneme
in details.
The model uses neurophysiological components as its processing elements, and can
be implemented in a locally connected parallel network. There is a strong possibility that
it can he linked to the computational processes in the visual cortex. For instance, the
segmentation process is related to boundary perception, while the diffusion process can
be linked to texture grouping or diffusion phenomenon in psychology. Our work suggests
that when cortical complex cells are coupled in a particular fashion, a successive gradient
descent type of algorithms can solve a class of image segmentation problems that are
essential to visual perception.

References

1. Blake, A. & Zisserman, A. (1987) Visual Reconstruction. The MIT Press, Cambridge,
Massachusetts.
173

2. Bovik, A.C., Clark, M. & Geisler, W.S. (1990) Multichannel Texture Analysis Using Lo-
calized Spatial Filters. IEEE Transections on Pattern Analysis and Machine Intelligence,
Vol. 12, No. 1, January.
3. Brodatz, P. (1966) Texture - A Photographic Album for Artists and Designers. New York:
Dover.
4. Daugman, J.G. (1985) Uncertainty relation for resolution in space, spatial frequency, arid
orientation optimized by two-dimensional visual cortical filters. J. Opt. soc. Amer., Vol.
2, No. 7, pp 1160-1169.
5. Fogel, I. & Sagi, D. (1989) Gabor Filters as Texture Discriminator. Biological Cybernatics
61, 103-113.
6. Geiger, D. & Yuille, A. (1991) A Common Framework for Image Segmentation. Intl.
Journal o] Computer Vision, 6:3,227-243.
7. Geman, D., Geman, S., Graffigne, C., & Dong, P. (1990) Boundary Detection by Con-
strained Optimization. IEEE Transections on Pattern Analysis and Machine Intelligence,
Vol 12, No. 7, July.
8. Gibson, J.J. (1979) The Ecological Approach to Visual Perception, Houghton-Mifflin.
9. Heeger, D.J. (1989) Computational Model of Cat Striate Physiology. M I T Media Labora-
tory Technical Report 125. October, 1989.
10. Hubel, D.H., & Wiesel, T.N. (1977) Functional Architecture of macaque monkey visual
cortex. Proc. R. Soc. Lond. B. 198.
11. Kirkpatrick, S., Gelatt, C.D. & Vecchi, M.P. (1983) Optimization by simulated annealing.
Science, PPO. 671-680
12. Lee, T.S., Mumford, D. & Yuille, A. (1991) Texture Segmentation by Minimizing Vector-
Valued Energy Functionals: The Couple-Membrane Model. Harvard Robotics Laboratory
Technical Report no. 91-22.
13. Malik, J. & Perona, P. (1989) A computational model for texture segmentation. IEEE
CVPR Con]erence Proceedings.
14. Marroquin, J.L. (1984). Surface Reconstruction Preserving Discontinuities (Artificial In-
telligence Lab. Memo 792). MIT, Cambridge, MA. A more refined version " A Probabilistic
Approach to Computational Vision" appeared in Image Understanding 1989. Ed. by Ull-
man, S. & Whitman Richards. Ablex Publishing Corporation, New Jersey 1990.
15. Mumford, D. & Shah, J.(1985) Boundary Detection by Minimizing Functionals, I. IEEE
CVPR Con]erence Proceedings June 19-23. A more detailed and refined version appeared
in Image Understanding 1989. Ed. by Ullman,S. & Whitman Richards. Ablex Publishing
Corporation, New Jersey 1990.
16. Pollen, D.A. & Gaska, J.P. & Jacobson, L.D. (1989) Physiological Constraints on Models
of Visual Cortical Function. in Models o] Brain Function, Ed. Rodney, M.J. Cotterill.
Cambridge University Press, England.
17. Reed, T.R. & Wechsler, H. (1990) Segmentation of Textured Images and Gestalt Orga-
nization Using Spatial/Spatial-Frequency Representation. IEEE Transections on Pattern
Analysis and Machine Intelligence, Vol. 12, No. 1, January.
18. Silverman, M.S., Grosof, D.H., De Valois, R.L., & Elfar, S.D. (1989) Spatial-frequency
Organization in Primate Striate Cortex. Proc. Natl. Acad. Sci. U.S.A., Vol. 86, January.
19. Voorhees, H. & T. Poggio. (1988) Computing texture boundaries in images. Nature,
333".364-367. A detailed version exists as Voorhees' master thesis, "Finding Texture Bound-
aries in Images". MIT AI Lab Technical Report No. 968, June 1987.
20. Webster, M.A. & De Valois, R.L. (1985) Relationship between Spatial-frequency and Ori-
entation tuning of Striate-Cortex Cells. J. Opt. Soc. Am. A Vol 2, No. 7 July 1985.

This article was processed using the IATEXmacro package with ECCV92 style
Boundary Detection in Piecewise Homogeneous
Textured Images*
Stefano Casadei 1'2, Sanjoy Mitter 1,2 and Pietro Perona 3,4
1 Massachusetts Institute of Technology 35-308, Cambridge MA 02139, USA
e-mail: casadei@lids.mit.edu
2 Scuola Norma]e Superiore, Pisa, Italy
3 California Institute of Technology 116-81, Pasadena CA 91125, USA
4 Universit~ di Padova, Italy
A b s t r a c t . We address the problem of scale selection in texture analysis.
Two different scale parameters, feature scale and statistical scale, are de-
fined. Statistical scale is the size of the regions used to compute averages.
We define the class of homogeneous random functions as a model of tex-
ture. A dishomogeneity function is defined and we prove that it has useful
asymptotic properties in the limit of infinite statistical scale. We describe
an algorithm for image partitioning which has performed well on piecewise
homogeneous synthetic images. This algorithm is embedded in a redundant
pyramid and does not require any ad-hoc information. It selects the optimal
statistical scale at each location in the image.

1 Introduction

The problem of texture analysis (recognition and segmentation) has traditionally been
approached by computing locally defined vector-valued "descriptors" of each region in
the image. The texture recognition problem is thus reduced to a conventional classifi-
cation problem, and boundary detection may be performed by locating areas of rapid
change in the descriptor vectors, or, dually, by clustering regions with similar descriptors.
Constraints on texture analysis algorithms come from the physics and the geometry of
image formation: often one would like to ensure invariance with respect to changes in
illumination, scaling and rotation, sometimes also to tilt and slant.
The search for good local descriptors has been the focus of much work with two main
classes of descriptors being favored in the more recent literature: (a) linear filters followed
by elementary nonlinearities and smoothing (e.g. [1, 2, 3, 4]), and (b) different statistics
of brightness computed on image patches (e.g. [5, 6, 7]). The filtering framework has
natural characteristics for addressing the scale- and rotation-invariance issues: if each
filter category is present at multiple scales and orientations, and if the discretization is
fine enough, the representation of image properties given by the filter outputs is roughly
scale and rotation-invariant.
A number of important issues having to do with scale selection and response normal-
ization have remained virtually unanswered: What are the basic regularity hypotheses
that a texture has to satisfy for the local descriptor approach to work? How does one
identify automatically the proper scale for analyzing a texture, and how does one choose
the thresholds for declaring a boundary?
In this paper we make explicit and formalize a general assumption about texture
regularity. We show how this assumption can be used to find texture boundaries and we
* Research supported by Airforce grant AFOSR-89-0276-C and ARC) grant DAAL03-86-K-0171,
Center for Intelligent Control Systems
175

discuss its relation with the scale selection problem. We observe that two independent
scale parameters exist and must be dealt with. Finally, we show some results of an
efficient redundant-pyramid algorithm for computing homogeneous texture regions. This
algorithm is (approximately) translation-, scale- and illumination invariant.

1.1 O v e r v i e w o f t h e c o n t e n t s

A general assumption that forms the basis of all approaches to texture analysis is that
a texture at some level of description is homogeneous. Although this is not a new idea,
a precise definition of what this means is still missing. The difficulty in defining texture
homogeneity is rooted into the two conflicting natures of texture, namely, randomness
and regularity. A specific texture may be modeled as a realization of a random function.
We propose to define a random function as homogeneous if, with probability one, spatial
averages of any local operator are constant with respect to space in the limit of infinite
size of the averaging region. To quote Wilson [8, 9], when the scale of averaging is infinite
the class-localization of the texture pattern must become infinitely accurate. Of course,
in practical circumstances we cannot take averages over infinite regions: for practical
purposes the averaged texture features (i.e. the averaged output of the local operators)
must be close to space-independence for finite averaging regions. When this happens we
say that we are near the thermodynamic limit of a given texture. Due to the trade-off
which exists between reliably estimating texture properties (which requires large averag-
ing windows) and localizing the textured regions (which can be done more accurately if
the data are not too blurry) it is important to be able to estimate the smallest averaging
scale that approaches the thermodynamic limit.
The scale of averaging will also be called statistical scale or external scale. It must
be distinguished from a different scale parameter: the feature scale or internal scale. The
latter corresponds to the size of the support of a given local operator. These two scale
parameters are unrelated (apart from the obvious fact that the external scale must be
greater than internal scale). This is illustrated in figure 1. Note that even if the scale of
the texture elements is the same for all images, the behavior along the statistical scale is
quite different. A multiscale representation of a textured image should therefore contain
two scale parameters (this observation has been independently made by Eric Saund,
personal communication).
In order to select the most appropriate external scale(s) for analyzing a given texture
we need to quantify the level of variability of the descriptors at any given statistical scale.
To this purpose we define a real positive function, the dishomogeneity function, of two
arguments, spatial position and statistical scale. Low dishomogeneity values will indicate
that the thermodynamical limit has been reached. In order to be useful in texture analysis
such a dishomogeneity function must satisfy two elementary properties:

1. The dishomogeneity of a region which contains only one type of texture is zero.
2. The dishomogeneity of a region which contains a texture boundary is strictly positive
if the set of local descriptors is rich enough.

In section 2 we state two results (Theorems 1,2) which ensure that the dishomogeneity
function satisfies these two properties in the thermodynamic limit for piecewise homoge-
neous textures.
Figure 2 demonstrates these concepts on a finite image. Note how the dishomogeneity
of a given texture approaches 0 when the statistical scale is sufficiently large (left).
However when the analyzed region goes over a texture boundary the dishomogeneity
176

F i g . 1. Averages over larger regions are required to detect the regularity of the pattern in tex-
tures which are sparser or contain more "randomness" since the dishomogeneity of these textures
approaches zero at larger statistical scales than dense periodic textures. At right the dishomo-
geneity function, maximized over the image, is plotted vs. statistical scale k (i.e. max= g'(2 ~, x)
vs. k, see section 2.5) for the three different textures. Note that the dishomogeneity value of
e.g. 0.3 is attained approximately at k=4, k=5 a~d k=6 for the left, center and right textures
respectively. This suggests that a multiscale representation of the image should treat feature
scale and statistical scale as independent dimensions of analysis.

~~~::~:~.~#,~,~:~ . . . . . .

dbhomogQneity dlshon.JoSe n e | t y

14

- /
a~

F i g . 2. The dishomogeneity at different scales and positions. The dishomogeneity inside each
square region shown in the upper figure is plotted below. Left: the dishomogeneity of a given
type of texture as a function of statistical scale is shown for a constant localtion in the image.
The x-coordinate, k, is the logarithm of the statistical scale. The linear size (length of the side in
pixels) of each square is 2 k+l (k = 2, 3, 4, 5). The dishomogeneity at fine (k = 1,2) scales is zero
because at the considered location only the constant backgound colour is present in the smaller
squares. Right: the dishomogeneity at several positions in the image. The x-coordinate now
represents position in the image; the scale is held constant at k -- 5. Note that dishomogeneity
is high when the region contains a texture boundary.
177

becomes suddenly larger (right). Figure 3 shows how the dishomogeneity function is
constructed.

] Max and
~ " ~k~fand [Averages [ Min over
g(~)---~ g~slated ~i(A, Zh : -~i(.L, A, =
.input operators [ ~ulti~c~e Averaged [sLze z~
image I Descnptmn Multiscale
Description

g'(L,=)
~xti:em~ of Suecific ishgmogeneity
scillahon ITlshomogeneity nc~lon
inside B2L(z)
Fig. 3. Computation of the dishomogeneity function g'(L, ~).

Our segmentation algorithm selects the optimal statistical scale at each location in the
image by minimizing the dishomogeneity function (plus a penalty against small scales)
among all possible scales.

2 Multiscale Representations for textured images.


2.1 N o t a t i o n .

Let us introduce some notation. An image g will be for us a real function g : R 2 ~ R.


The set of all images is G. For x G R 2 let Tx : G ~ G be the translation operator defined
as (Txg)(z') = g(x' - x). Similarly, for A > 0, D~ : G ~ G is the dilation operator defined
by: (D~g)(z) = g ( A - l z ) . Let Bt(z) be the "box" centered at z = (u, v) having size l:
+'].
B,(x) = [ , , - ' , , , + 2'-] - '- ,,
D e f i n i t i o n 1. An image descriptor q : G --* R " is a real functional which depends only
on the image inside BI(0): q - g = q . gIBl(o ) (We use the dot product notation to denote
the operation of applying an image descriptor to an image, i . e . q , g = q(g) ). Iv1(0) is
the characteristic function of BI(0): IBl(0)(z) = 1 for x G BI(0) and IBl(0)(z) = 0 for
x ~ BI(0).
D e f i n i t i o n 2 . The image description ~(z) generated by q is defined by: ~ ( z ) = q . T_=g.
For each component we have then ~i(z) = ql 9T-rg.

The definitions and results below are valid for any image descriptor "regular" enough
(see [10] for a precise definition). Most of the texture features used in the literature can
be adapted to fit into this formalism. However, to give an example (and because we use
it in our implementation) we discuss a particular case in more detail. Namely, given a
set functions hi, i - 1 . . . . , n such that hi(z) - 0 for z ~ BI(0), consider the following
descriptor:
qi . g = f, [/ hi(z')g(-z')dz'] . (1)
178

The purpose of the function f~ : R --* R is to introduce a non-linearity. The image


description ~(z) = q . T_~g is then obtain from g by a convolution cascaded with the
non-linearity fi: ~i(z) = qi " T-xg = fi [f hi(x')g(z - z')dx'] = fi [(hi * g)(x)].

2.2 H o m o g e n e o u s R a n d o m Images.

Let (~2, $', 7~) be a probability space. A random image r is a function defined on ~2 into
the set of images: r : O --* G. That is, if w E f2, then r E G is an image.

D e f i n i t i o n 3 . A random image r is homogeneous if for any image descriptor q, for all


x0 E R 2 and for almost all w E ~2:

1 f
lim ] q.T_~r = ~q (2)
1---*ov
B~(=0)
where ~'q E R n is the asymptotic description of the random homogeneous image r given
by the descriptor q.

To clarify the meaning of (2) let us rewrite it for the convolution-plus-nonlinearity case de-
1
fined in (1): lin~...oo ~ fBK~o) qi" T_=r = lin~_oo rx fBK~o) fl [(hi * r dx =
lin~_oo ~ fB,(=0) ~i(x)dx = ~fq,iwhere ~q = (~q,i : i = 1 , . . . , n) and ~i(x) is given by:
~i(z) = fi [(hi * r That is, homogeneity requires that averages of image descrip-
tions exist in the thermodynamical limit and be spatial independent.
This definition is quite general. It includes all ergodic random functions, in which
case ~fq is the ensemble average of the image description. It is also valid for periodic
deterministic functions such as those generated by regular repetitive texture.
However, if illumination inhomogeneities or distorsions such as those created by
prospective are present, then our model of texture is no longer valid at very large scales.
A more complex model is needed in which our homogeneous random function is cou-
pled with (perturbed by) a smooth, slowly varying function. If this perturbation is small
enough it is still possible to approach the thermodynamic limit before the long range
variations become important. The greatest degree of homogeneity is then attained at a
finite statistical scale.

2.3 Multiscale Representations.

The image description ~(z) - q . T-xg is obtained by applying the descriptor q to


the translated image. However, translation is not the only natural symmetry of images.
Another one is dilation 5. We are then led to the following definition:

D e f i n i t i o n 4 . The multiscale description of g E G given by the descriptor q is the func-


tion: ~ : t t + x R 2 --* R " given by:

~i(A, z) = K,()~) (qi" D~-lT_xg). (3)

where Ki(~) is a normalization factor.

~ Rotation is also a natural symmetry. However, since this paper focuses on scale issues we omit
to deal with orientation explicitly.
179

For the convolution-plus-nonlinearity descriptor defined in (1) we have, after some simple
calculation: ~i(A, x) = Ki(A)fi [((Dxhi) * g) (x)]. In this case, the i-th component of the
multiscale description is obtained by convolving the image with a bank of filters generated
by dilating the template filter hi. Wavelets representations can be expressed in this way
by letting A = 2 k and x = (12k, m2~), l, m, k E Z and choosing hi, fi in the appropriate
way.
Note that ~(A, x) depends only on the image inside Bx(x).

2.4 A v e r a g e d M u l t i s c a l e R e p r e s e n t a t i o n s .

D e f i n i t i o n h . The averaged multiscale description ofg 6 G given by the descriptor q is


the function: ~ : t t + t t + It 2 --* It" given by, for A < L:

1
-~(L, )%x) - (L - A) ~ / ~(A, x')dx'. (4)
B L - A (Z)

For A _> L we let g(L, 2, z) = 0.

Note that the average is taken in such a way that g(L, A, x) depends only on the
image inside BL(X).
If g = r is a homogeneous random image then, by definition of homogeneity (see
(2)), we have with probability 1 and for all x 6 R 2 (assuming Ki(A) = K(A) for clarity):

lim r
L--*oo
A, x) = 2 i r n (L -1 A)2 K(A) f q. Dx-lT_~:,r = ft'(A) (5)
BL-A (x)

Definition6. The function ft : It + ~ It n will be called multiscale asymptotic descrip-


tion of g.

2.5 Definition of the dishomogeneity function.

We are going to define a dishomogeneity function g'(L, z) > 0 such that g'(L, x) depends
only on the image inside B2L(X) and on averages of size L. We start by defining the
maximum and minimum value of yi(L, A, x) inside B2L(Z):

yM(L,A,x)= max "~I(L,A,x') ym(L,A,x)-- min yi(L,A,z') (6)


Z EBL(z) z'EBL(z)

Now, let si(M, m) be a positive real function defined for M > m > 0 such that
si(M,m) "- 0 if M = m and, for each m, si(M,m) is strictly increasing in M (the
simplest example would be si(M, m) = M - m).

D e f i n i t i o n 7. The specific dishomogeneity is: g~(L, A, x) = si (yM( L, A, x ), "~i ( L , A, x ) ) .

Definition8. The dishomogeneity function is then defined by taking the most dishomo-
geneous "channel" :
g'(L, x) = maxg~(L, A, z) (7)
i,),
180

2.6 Asymptotic properties of the dishomogeneity function.


A first property of the dishomogeneity function is the following (for proofs see [10]):

Theorem 9. Lel r be a homogeneous random image and let [r z) be its dishomo-


geneity function. Then limL...~[r = 0 for all z e It 2 and with probability 1.
To deal with images which contain more than one type of texture we need a model
for them. A simple one is a piecewise homogeneous random function. Let then P =
{Rt . . . . ,Rn) be a partition of R2: R~ C R2; R I N R j = $ i f i 5s j; I.JR~ = R 2.
We assume that boundaries are regular enough, for instance, piecewise smooth. Let
{r : i = 1 , . . . , n) be a set of homogeneous random images.

D e f i n i t i o n 10. A piecewise homogeneous random image over the partition P is the ran-
dom function: r = ~i~=1 IR,r

D e f i n i t i o n l l . We say that r is asymptotically discriminable (by the image descriptor


q, used to define the dishomogeneity function) if all r i = 1 , . . . , n have different
asymptotic descriptions, i.e. ~i r ~j for i r j.

Then we have to define what the thermodynamic limit means for these class of images.
For, it is no longer possible to let L go to infinity without mixing together different types
of texture. A solution to this problem is to make the random functions r "shrink" while
leaving the boundaries unchanged . Then, for r defined as above and 7 > 0 we define r
as: [r = E,"=I IR,(~)[r

Definitionl2. For each w e 12 the asymptotic dishomogeneity function off(w), r :


t t + x R 2 --* R +, is given by: [r z) = lina~_~[r , x).

T h e o r e m 13. With the above definitions we have with probability 1:


1) The above limit exists for all x E It 2, L > 0
P) If B2L(X) C R~ for some i then [r = 0
3) l f r is asymptotically discriminable, and if B2L(X) contains at least two different
types of texture, i.e. B2L(X) f3 Ri ~ 0, B2L(X) N Rj ~ 0, i # j, then [r x) > 0

This theorem suggests that in finite images the dishomogeneity function can provide
useful information wherever the thermodynamic limit is a reasonable approximation.

3 Finding Texture Boundaries.

In real images the thermodynamic limit is at best an ideal approximation. Moreover, it


can not be attained near boundaries with high curvature, even if all homogeneous regions
are very large. Nonetheless, the dishomogeneity function can be very useful for detecting
the statistical scale at which relevant events - - namely, the presence of a homogeneous
texture - - occur. This can be done at each point ~ in the image by looking for minima of
the dishomogeneity function g'(L, z) with respect to L. g'(L, x) can also be useful to find
boundaries, since we expect abrupt increases of gl(L, z) when the corresponding window,
B2L(Z), "invades" a nearby texture.
A complete description of our algorithm can be found in [11]. Here we just give a
brief sketch of it. The algorithm is embedded in an overlapped pyramid of the type
shown in figure 4. Each node in the pyramid corresponds to a square window. A cost
181

ckij = g~ij - ark is associated with each node. tk is increasing in k. The negative term
--cdk introduces a bias for large statistical scales. This allows to select a unique scale
and to generate a consistent segmentation in those cases where the underlying texture
is near the thermodynamic limit at more than one statistical scale (for instance, think
of a checkerboard). The dishomogeneity is computed as described in section 2 by using
filter-plus-nonlinearity descriptor of the type shown in (1). Step filters at 4 different
orientations have been used.

k=5
k=4

~ k=3

k=O
Fig. 4. One dimensional oversampled pyramid. The vertical displacement between intervals of
the same level has been introduced for clarity and has no other meaning. Each node of the
pyramid has a dishomogeneity value and a cost associated with it.

The algorithm is region based, i.e. its output is a partition of the image. Therefore it
assumes that boundaries form closed curves and are sharp enough everywhere. It works
in two steps: first it selects nodes in the pyramid which minimize locally the cost function
cklj and then merges neighboring selected nodes into larger regions. In the selection phase
each pixel of the image selects a unique node (the one having the lowest cost) among all
those in which it is contained.

4 Experiments

We now describe some of the experiments we have done with synthetic images. All the
three images shown in this section are 256 256. The CPU time required to run one
image is approximately 9 minutes on a Sun SparcStation II. Most of the time goes into
the computation of ~M(L, ,~, x) and ym(L, A, x) from ~i(L, ~, z')(see section 2.5).
Figure 5 shows the segmentation of a collage of textures which reach the thermody-
namic limit at several statistical scales.
Figure 6-1eft illustrates the segmentation of an "order versus disorder" image. This
example shows that looking for the optimal statistical scale can significantly enhance
discriminative capabilities making possible the detection of very subtle differences.
Finally, figure 6-right shows that this scheme can also be valid for textures whose
properties change smoothly across the image (as occurs when tilt or slant are present).

5 Conclusions

In this paper we have addressed the problem of scale selection in texture analysis. We
have proposed to make a clear distinction between two different scale parameters: sta-
tistical scale and feature scale. Both scale parameters should be taken into account in
182

F i g . 5. Top-right: a 256 256 textured image. T h e black lines are the boundaries found by
the algorithm. Left and bottom: the dishomogeneity g'(L, x) for L = 2 k, k = 1 , . . . , 5. k grows
anti-clockwise. Homogeneous regions are black. Note t h a t the thermodynamic limit is attained
at different statistical scales by different textures.

F i g . 6. Two 256 x 256 textured images: "Order versus disorder" and "tilted texture".
183

constructing image representations but they should be dealt with in very different ways.
In particular, we claim that it is necessary to find the optimal statistical scale(s) at
each location in the image. In doing this there is a natural trade-off between the reliable
estimation of image properties and the localization of texture regions. It is possible to
extract texture boundaries reliably only if a good enough trade-off can be found.
We have formalized the notion of homogeneity by the definition of homogeneous
random functions. When local operators are applied to these functions and the result is
averaged over regions of increasing size, we obtain a description of the image which is
asymptotically deterministic and space independent. In practical circumstances, we say
that the thermodynamic limit has been reached when this holds to a sufficient degree.
We have defined a dishomogeneity function and proved that in the thermodynamic limit
it is zero if and only if the analyzed region does not contain a texture boundary.
Our algorithm has performed well on images which satisfy the piecewise-homogeneous
assumption. However, it did not perform well on images which violate the piecewise
homogeneous property, mainly because in such images boundaries are not sharp enough
everywhere and are not well defined closed curves. Our node-merging phase is not robust
with respect to this problem. We are currently designing an algorithm which is more
edge-based and should be able to deal with boundaries which are not closed curves. Also,
we need to use a better set of filters.

References
1. H. Knuttson and G. H. Granlund. Texture analysis using two-dimensional quadrature fil-
ters. In Workshop on Computer Architecture ]or Pattern Analysis ans Image Database
Management, pages 206-213. IEEE Computer Society, 1983.
2. M.R. Turner. Texture discrimination by gabor functions. Biol. Cybern., 55:71-82, 1986.
3. J. Malik and P. Perona. Preattentive texture discrimination with early vision mechanisms.
Journal of the Optical Society o] America - A, 7(5):923-932, 1990.
4. A.C. Bovik, M. Clark, and W.S. Geisler. Mnltichannel texture analysis using localized
spatial filters. IEEE Trans. Pattern Anal. Machine Intell., 12(1):55-73, 1990.
5. B. 3ulesz. Visual pattern discrimination. IRE Transactions on In]ormation Theory IT-8,
pages 84-92, 1962.
6. It. L. Kashyap and K. Eom. Texture boundary detection based on the long correlation
model. IEEE transactions on Pattern Analysis and Machine Intelligence, 11:58-67, 1989.
7. D. Geman, S. Geman, C. Graffigne, and P. Dong. Boundary detection by constraint opti-
mization. IEEE Trans. Pattern Anal. Machine Intell., 12(7):609, 1990.
8. R. Wilson and G.H. Granlund. The uncertainty principle in image processing. IEEE Trans.
Pattern Anal. Machine Intell., 6(6):758-767, Nov. 1984.
9. M. Spann and R. Wilson. A quad-tree approach to image segmentation which combines
statistical and spatial information. Pattern Recogn., 18:257-269, 1985.
10. S. Casadei. Multiscale image segmentation by dishomogeneity evaluation and local opti-
mization (master thesis). Master's thesis, MIT, Cambridge, MA, May 1991.
11. S. Casadei, S. Mitter, and P. Perona. Boundary detection in piecewise homogeneous tex-
tured images (to appear). Technical Report -, MIT, Cambridge, MA, - -.

This article was processed using the I~TEX macro package with ECCV92 style
Surface Orientation and T i m e to Contact from
Image Divergence and D e f o r m a t i o n

Roberto Cipolla* and Andrew Blake


Department of Engineering Science, University of Oxford, OX1 3P J, England

A b s t r a c t . This paper describes a novel method to measure the differential


invariants of the image velocity field robustly by computing average values
from the integral of normal image velocities around image contours. This is
equivalent to measuring the temporal changes in the area of a closed contour.
This avoids having to recover a dense image velocity field and taking partial
derivatives. It also does not require point or line correspondences. Moreover
integration provides some immunity to image measurement noise.
It is shown how an active observer making small, deliberate motions can
use the estimates of the divergence and deformation of the image velocity
field to determine the object surface orientation and time to contact. The re-
sults of real-time experiments are presented in which arbitrary image shapes
are tracked using B-spline snakes and the invariants are computed efficiently
as closed-form functions of the B-spline snake control points. This informa-
tion is used to guide a robot manipulator in obstacle collision avoidance,
object manipulation and navigation.

1 Introduction

Relative motion between an observer and a scene induces deformation in image detail
and shape. If these changes are smooth they can be economically described locally by
the first order differential invariants of the image velocity field [16] - the curl (vorticity),
divergence (dilatation), and shear (deformation) components. These invariants have sim-
ple geometrical meanings which do not depend on the particular choice of co-ordinate
system. Moreover they are related to the three dimensional structure of the scene and
the viewer's motion - in particular the surface orientation and the time to contact ~ - in
a simple geometrically intuitive way. Better still, the divergence and deformation com-
ponents of the image velocity field are unaffected by arbitrary viewer rotations about
the viewer centre. They therefore provide an efficient, reliable way of recovering these
parameters.
Although the analysis of the differential invariants of the image velocity field has
attracted considerable attention [16, 14] their application to real tasks requiring visual
inferences has been disappointingly limited [23, 9]. This is because existing methods have
failed to deliver reliable estimates of the differential invariants when applied to real im-
ages. They have attempted the recovery of dense image velocity fields [4] or the accurate
extraction of points or corner features [14]. Both methods have attendant problems con-
cerning accuracy and numerical stability. An additional problem concerns the domain of

* Toshiba Fellow, Toshiba Research and Development Center, Kawasaki 210, Japan.
2 The time duration before the observer and object collide if they continue with the same
relative translational motion [10, 20]
188

applications to which estimates of differential invariants can be usefully applied. First or-
der invariants of the image velocity field at a single point in the image cannot be used to
provide a complete description of shape and motion as attempted in numerous structure
from motion algorithms [27]. This in fact requires second order spatial derivatives of the
image velocity field [21, 29]. Their power lies in their ability to efficiently recover reliable
but incomplete solutions to the structure from motion problem which can be augmented
with other information to accomplish useful visual tasks.
The reliable, real-time extraction of these invariants from image data and their ap-
plication to visual tasks will be addressed in this paper. First we present a novel method
to measure the differential invariants of the image velocity field robustly by computing
average values from the integral of simple functions of the normal image velocities around
image contours. This is equivalent to measuring the temporal changes in the area of a
closed contour and avoids having to recover a dense image velocity field and taking partial
derivatives. It also does not require point or line correspondences. Moreover integration
provides some immunity to image measurement noise.
Second we show that the 3D interpretation of the differential invariants of the image
velocity field is especially suited to the domain of active vision in which the viewer makes
deliberate (although sometimes imprecise) motions, or in stereo vision, where the relative
positions of the two cameras (eyes) are constrained while the cameras (eyes) are free to
make arbitrary rotations (eye movements). Estimates of the divergence and deformation
of the image velocity field, augmented with constraints on the direction of translation, are
then sufficient to efficiently determine the object surface orientation and time to contact.
The results of preliminary real-time experiments in which arbitrary image shapes are
tracked using B-spline snakes [6] are presented. The invariants are computed as closed-
form functions of the B-spline snake control points. This information is used to guide a
robot manipulator in obstacle collision avoidance, object manipulation and navigation.

2 Differential Invariants of the Image Velocity Field

2.1 R e v i e w

For a sufficiently small field of view (defined precisely in [26, 5]) and smooth change
in viewpoint the image velocity field and the change in apparent image shape is well
approximated by a linear (affine) transformation [16]. The latter can be decomposed
into independent components which have simple geometric interpretations. These are an
image translation (specifying the change in image position of the centroid of the shape);
a 2D rigid rotation (vorticity), specifying the change in orientation, curlv; an isotropic
expansion (divergence) specifying a change in scale, divv; and a pure shear or deformation
which describes the distortion of the image shape (expansion in a specified direction with
contraction in a perpendicular direction in such a way that area is unchanged) described
by a magnitude, defy, and the orientation of the axis of expansion (maximum extension),
#. These quantities can be defined as combinations of the partial derivatives of the image
velocity field, v = (u, y), at an image point (z, y):

divv = (u~ + vu) (1)


curly = - ( u y - v~) (2)
(defy) cos2# = (u= - vy) (3)
(defy) sin 2# = (uy + v~) (4)
189

where subscripts denote differentiation with respect to the subscript parameter. The curl,
divergence and the magnitude of the deformation are scalar invariants and do not depend
on the particular choice of image co-ordinate system [16, 14]. The axes of maximum
extension and contraction change with rotations of the image plane axes.

2.2 R e l a t i o n t o 3D S h a p e a n d V i e w e r M o t i o n

The differential invariants depend on the viewer motion (translational velocity, U, and
rotational velocity, 12), depth, A and the relation between the viewing direction (ray di-
rection Q) and the surface orientation in a simple and geometrically intuitive way. Before
summarising these relationships let us define two 2D vector quantities: the component of
translational velocity parallel to the image plane scaled by depth, A, A where:

A = U- (U.Q)Q (5)
A

and the depth gradient scaled by depth 3, F, to represent the surface orientation and
which we define in terms of the 2D vector gradient:

F = gradA
(6)

The magnitude of the depth gradient, IF], determines the tangent of the slant of the
surface (angle between the surface normal and the visual direction). It vanishes for a
frontal view and is infinite when the viewer is in the tangent plane of the surface. Its
direction, LF, specifies the direction in the image of increasing distance. This is equal to
the tilt of the surface tangent plane. The exact relationship between the magnitude and
direction of F and the slant and tilt of the surface (a, r) is given by:

IFI = tan a (7)


LF = r . (8)

With this new notation the relations between the differential invariants, the motion
parameters and the surface position and orientation are given by [15]:

curly = -212. Q + IF ^ A[ (9)


2u. Q
divv = ~ + F. A (10)
defy = IFIIAI (11)

where p (which specifies the axis of maximum extension) bisects A and F:

LA + LF
- 2 (12)

The geometric significance of these equations is easily seen with a few examples. For
example, a translation towards the surface patch leads to a uniform expansion in the

3 Koenderink [15] defines F as a "nearness gradient", grad(log(I/A)). In this paper F is defined


as a scaled depth gradient. These two quantities differ by a sign.
190

image, i.e. positive divergence. This encodes the distance to the object which due to the
speed-scale ambiguity4 is more conveniently expressed as a time to contact, to:

t0 = u .---6 " (13)

Translational motion perpendicular to the visual direction results in image deformation


with a magnitude which is determined by the slant of the surface, cr and with an axis
depending on the tilt of the surface, r and the direction of the viewer translation. Diver-
gence (due to foreshortening) and curl components may also be present.
Note that divergence and deformation are unaffected by (and hence insensitive to
errors in) viewer rotations such as panning or tilting of the camera whereas these lead to
considerable changes in point image velocities or disparities s. As a consequence the defor-
mation component efficiently encodes the orientation of the surface while the divergence
component can be used to provide an estimate of the time to contact or collision.
This formulation clearly exposes both the speed-scale ambiguity and the bas-relief
ambiguity [11]. The latter manifests itself in the appearance of surface orientation, F,
with A. Increasing the slant of the surface F while scaling the movement by the same
amount will leave the local image velocity field unchanged. Thus, from two weak perspec-
tive views and with no knowledge of the viewer translation, it is impossible to determine
whether the deformation in the image is due to a large [A[ (large "turn" of the object
or "vergence angle") and a small slant or a large slant and a small rotation around
the object. Equivalently a nearby "shallow" object will produce the same effect as a far
away "deep" structure. We can only recover the depth gradient F up to an unknown
scale. These ambiguities are clearly exposed with this analysis whereas this insight is
sometimes lost in the purely algorithmic approaches to solving the equations of motion
from the observed point image velocities. A consequence of the latter is the numerically
ill-conditioned nature of structure from motion solutions when perspective effects are
small.

3 Extraction of Differential Invariants

The analysis above treated the differential invariants as observables of the image. There
are a number of ways of extracting the differential invariants from the image. These are
summarised below before presenting a novel method based on the temporal derivatives
of the moments of the area enclosed by a closed curve.

3.1 S u m m a r y o f E x i s t i n g M e t h o d s

1. P a r t i a l d e r i v a t i v e s of i m a g e v e l o c i t y field
4 Translational velocities appear scaled by depth making it impossible to determine whether
the effects are due to a nearby object moving slowly or a far-away object moving quickly.
5 This is somewhat related to the reliable estimation of relative depth from the relative image
velocities of two nearby points - motion parallax [21, 24, 6]. Both motion parallax and the
deformation of the image velocity field relate local measurements of relative image velocities
to scene structure in a simple way which is uncorrupted by the rotational image velocity
component. In the case of parallax, the depths are discontinuous and differences of discrete
velocities axe related to the difference of inverse depths. Equation (11) on the otherhand
assumes a smooth and continuous surface and derivatives of image velocities are related to
derivatives of inverse depth.
191

This is the most commonly stressed approach. It is based on recovering a dense


field of image velocities and computing the partial derivatives using discrete approx-
imation to derivatives [17] or a least squares estimation of the affine transformation
parameters from the image velocities estimated by spatio-temporal methods [23, 4].
The recovery of the image velocity field is usually computationally expensive and
ill-conditioned [12].
2. P o i n t velocities in a s m a l l n e i g h b o u r h o o d
The image velocities of a minimum of three points in a small neighbourhood are
sufficient, in principle, to estimate the components of the affine transformation and
hence the differential invariants [14, 18]. In fact it is only necessary to measure the
change in area of the triangle formed by the three points and the orientations of its
sides [7]. There is, however, no redundancy in the data and hence this method requires
very accurate image positions and velocities. In [7] this is attempted by tracking large
numbers of "corner" features [28] and using Delannay triangulation [3] in the image
to approximate the physical world by planar facets. Preliminary results showed that
the localisation of "corner" features was insufficient for reliable estimation of the
differential invariants.
3. R e l a t i v e O r i e n t a t i o n o f Line S e g m e n t s
Koenderink [15] showed how temporal texture density changes can yield estimates
of the divergence. He also presented a method for recovering the curl and shear
components that employs the orientations of texture elements. Orientations are not
affected by the divergence term. They are only affected by the curl and deformation
components. In particular the curl component changes all the orientations by the
same amount. It does not affect the angles between the image edges. These are only
affected by the deformation component. The relative changes in orientation can be
used to recover deformation in a simple way since the effects of the curl component are
cancelled out. Measurement at three oriented line segments is sufficient to completely
specify the deformation components. The main advantage is that point velocities or
partial derivatives are not required.
4. C u r v e s a n d Closed C o n t o u r s
The methods described above require point and line correspondences. Sometimes
these are not available or are poorly localised. Often we can only reliably extract
portions of curves (although we can not always rely on the erld points) or closed
contours.
Image shapes or contours only "sample" the image velocity field. At contour edges it
is only possible to measure the normal component of image velocity. This information
can in certain cases be used to recover the image velocity field. Waxman and Wohn
[30] showed how to recover the full velocity field from the normal components at
image contours. In principle, measurement of eight normal velocities around a contour
allow the characterisation of the full velocity field for a planar surface. Kanatani [13]
related line integrals of image velocities around closed contours to the motion and
orientation parameters of a planar contour. In the following we will not attempt
to solve for these structure and motion parameters directly but only to recover the
divergence and deformation.

3.2 R e c o v e r y o f I n v a r i a n t s f r o m A r e a M o m e n t s o f C l o s e d C o n t o u r s
It has been shown that the differential invariants of the image velocity field conveniently
characterise the changes in apparent shape due to relative motion between the viewer and
scene. Contours in the image sample this image velocity field. It is usually only possible,
192

however, to recover the normal image velocity component from local measurements at a
curve [27, 12]. It is now shown that this information is often sufficient to estimate the
differential invariants within closed curves,
Our approach is based on relating the temporal derivative of the area of a closed
contour and its moments to the invariants of the image velocity field. This is a general-
isation of the result derived by Maybank [22] in which the rate of change of area scaled
by area is used to estimate the divergence of the image velocity field. The advantage is
that point or line correspondences are not used. Only the correspondence between shapes
is required. The computationally difficult, ill-conditioned and poorly defined process of
making explicit the full image velocity field [12] is avoided. Moreover, since taking tem-
poral derivatives of area (and its moments) is equivalent to the integration of normal
image velocities (scaled by simple functions) around closed contours our approach is ef-
fectively computing average values of the differential invariants (not point properties)
and has better immunity to image noise leading to reliable estimates. Areas can also be
estimated accurately, even when the full set of first order derivatives can not be obtained.
The moments of area of a contour, 1I, are defined in terms of an area integral with
boundaries defined by the contour in the image plane:

1i : j((O fdzdy (14)

where a(t) is the area of a contour of interest at time t and f is a scalar function of image
position (x, y) that defines the moment of interest. For instance setting f = 1 gives us
area. Setting f -- z or f : y gives the first-order moments about the image x and y axes
respectively.
The moments of area can be measured directly from the image (see below for a novel
method involving the control points of a B-spline snake). Better still, their temporal
derivatives can also be measured. Differentiating (14) with respect to time and using a
result from calculus 6 we can relate the temporal derivative of the moment of area to an
integral of the normal component of image velocities at an image contour, v.n, weighted
by a scalar f ( x , y). By Green's theorem, this integral around the contour e(t), can be
re-expressed as an integral over the area enclosed by the contour, a(t).

~ (Ij) = ~ [fv.n]d8
(0
(lS)

= [ [div(fv)]dxdy (16)
Ja (0
f
= [ [fdivv + (v.gradf)]dxdy . (17)
da (0

If the image velocity field, v, can be represented by constant partial derivatives in the area
of interest, substituting the coefficients of the affine transformation for the velocity field
into (17) leads to a linear equation in which the left hand side is the temporal derivative of
the moment of area described by f (which can be measured, see below) while the integrals
on the right-hand side are moments of area (also directly measurable). The coefficients
of each term are the required parameters of the affine transformation. In summary, the
8 This equation can be derived by considering the flux linking the area of the contour. This
changes with time since the contour is carried by the velocity field. The flux field, f, in our
example does not change with time. Similar integrals appear in fluid mechanics, e.g. the flux
transport theorem [8].
193

image velocity field deforms the shape of contours in the image. Shape can be described
by moments of area. Hence measuring the change in the moments of area is an alternative
way characterising the transformation. In this way the change in the moments of area
have been expressed in terms of the parameters of the affine transformation.
If we initially set up the x - y co-ordinate system at the centroid of the image contour
of interest so that the first moments are zero, (17) with f = x and f = y shows that
the centroid of the deformed shape specifies the mean translation [u0, v0]. Setting f = 1
leads to the simple and useful result that the divergence of the image velocity field can
be estimated as the derivative of area scaled by area:

da(t)= a(t)divv . (18)

Increasing the order of the moments, i.e. different values of f ( x , y), generates new equa-
tions and additional constraints. In principle, if it is possible to find six linearly indepen-
dent equations, we can solve for the affine transformation parameters and combine the
co-efficients to recover the differential invariants. The validity of the affine approxima-
tion can be checked by looking at the error between the transformed and observed image
contours. The choice of which moments to use is a subject for further work. Listed below
are some of the simplest equations which have been useful in the experiments presented
here.

d [all
I~
Ia 0Oa2I. IuO 0OaI~
Iy 0 a Iv 0 I= 2Iy
vo
] [uo
u= (19)
dt = 0 2I y 0 | uy
/ / o / v.
Ll=suj k31=~yI ~ 41~y 3I~y~ I=, 21=~yj vy
(Note that in this equation subscripts are used to label the moments of area. The left-band
side represents the temporal derivative of the moments in the column vector.) In practice
certain contours may lead to equations which are not independent and their solution is
ill-conditioned. The interpretation of this is that the normal components of image velocity
are insufficient to recover the true image velocity field globally, e.g. a fronto-parallel circle
rotating about the optical axis. This was termed the "aperture problem in the large" by
Waxman and Wohn [30] and investigated by Berghom and Carlsson [2]. Note however,
that it is always possible to recover the divergence from a closed contour.

4 Recovery of Surface Orientation and Time to Contact

Applications of the estimates of the image divergence and deformation of the image
velocity field are summarised below. It has already been noted that measurement of the
differential invariants in a single neighbourhood is insufficient to to completely solve for
the structure and motion since (9,10,11,12) are four equations in the six unknowns of
scene structure and motion. In a single neighbourhood a complete solution would require
the computation of second order derivatives [21, 29] to generate sufficient equations to
solve for the unknowns. Even then the solution of the resulting set of non-linear equations
is non-trivial.
In the following, the information available from the first-order differential invariants
alone is investigated. It will be seen that the differential invariants are sufficient to con-
strain surface position and orientation and that this partial solution can be used to
194

perform useful visual tasks when augmented with additional information. Useful appli-
cations include providing information which is used by pilots when landing aircraft [10],
estimating time to contact in braking reactions [20] and in the recovery of 3D shape up
to a relief transformation [18, 19]. We now show how surface orientation and position
(expressed as a time to contact) can be recovered from the estimates of image divergence
and the magnitude and axis of the deformation.

1. W i t h k n o w l e d g e o f t r a n s l a t i o n b u t a r b i t r a r y r o t a t i o n
An estimate of the direction of translation is usually available when the viewer is
making deliberate movements (in the case of active vision) or in the case of binocular
vision (where the camera or eye positions are constrained). It can also be estimated
from image measurements by motion parallax [21, 24].
If the viewer translation is known, (10),(11) and (12) are sufficient to unambiguously
recover the surface orientation and the distance to the object in temporal units. Due
to the speed-scale ambiguity the latter is expressed as a time to contact. A solution
can be obtained in the following way.
(a) The axis of expansion (/~) of the deformation component and the projection in
the image of the direction of translation ( / A ) allow the recovery of the tilt of
the surface from (12).
(b) We can then subtract the contribution due to the surface orientation and viewer
translation parallel to the image axis from the image divergence (10). This is
equal to [defy[ c o s ( r - / A ) . The remaining component of divergence is due to
movement towards or away from the object. This can be used to recover the time
to contact, re. This can be recovered despite the fact that the viewer translation
may not be parallel to the visual direction.
(c) The time to contact fixes the viewer translation in temporal units. It allows the
specification of the magnitude of the translation parallel to the image plane (up
to the same speed-scale ambiguity), A. The magnitude of the deformation can
then be used to recover the slant, or, of the surface from (11).
The advantage of this formulation is that camera rotations do not affect the estima-
tion of shape and distance. The effects of errors in the direction of translation are
clearly evident as scalings in depth or by a relief transformation [15].
2. W i t h f i x a t i o n
If the cameras or eyes rotate to keep the object of interest in the middle of the image
(null the effect of image translation) the magnitude of the rotations needed to bring
the object back to the centre of the image determines A and hence allows us to solve
for surface orientation, as above. Again the major effect of any error in the estimate
of rotation is to scale depth and orientations.
3. W i t h no a d d i t i o n a l i n f o r m a t i o n - c o n s t r a i n t s on m o t i o n
Even without any additional assumptions it is still possible to obtain useful infor-
mation from the first-order differential invariants. The information obtained is best
expressed as bounds. For example inspection of (10) and (11) shows that the time to
contact must lie in an interval given by:
1 divv defy
- - - - (20)
re- 2 2
The upper bound on time to contact occurs when the component of viewer transla-
tion parallel to the image plane is in the opposite direction to the depth gradient.
The lower bound occurs when the translation is parallel to the depth gradient. The
upper and lower estimates of time to contact are equal when their is no deformation
195

component. This is the case in which the viewer translation is along the ray or when
viewing a fronto-parallel surface (zero depth gradient locally). The estimate of time
to contact is then exact. A similar equation was recently described by Subbarao [25].
4. W i t h n o a d d i t i o n a l i n f o r m a t i o n - t h e c o n s t r a i n t s o n 3D s h a p e
Koenderink and Van Doom [18] showed that surface shape information can be ob-
tained by considering the variation of the deformation component alone in small field
of view when weak perspective is a valid approximation. This allows the recovery of
3D shape up to a scale and relief transformation. That is they effectively recover the
axis of rotation of the object but not the magnitude of the turn. This yields a family
of solutions depending on the magnitude of the turn. Fixing the latter determines the
slants and tilts of the surface. This has recently been extended in the affine structure "
from motion theorem [19].
The solutions presented above use knowledge of a single viewer translation and mea-
surement of the divergence and deformation of the image velocity field. An alternative
solution exists if the observer is free to translate along the ray and also in two orthogonal
directions parallel to the image plane. In this case measurement of divergence alone is
sufficient to recover the surface orientation and the time to contact.

5 Implementation and Experimental Results

5.1 T r a c k i n g Closed L o o p C o n t o u r s
The implementation and results follow. Multi-span closed loop B-spline snakes [6] are
used to localise and track closed image contours. The B-spline is a curve in the image
plane
x(s) = Z fi(s)q, (21)
i

where fl are the spline basis functions with coefficients qi (control points of the curve)
and s is a curve parameter (not necessarily arc length)[1]. The snakes are initialised as
points in the centre of the image and are forced to expand radially outwards until they
were in the vicinity of an edge where image "forces" make the snake stabilise close to a
high contrast closed contour. Subsequent image motion is automatically tracked by the
snake [5].
B-spline snakes have useful properties such as local control and continuity. They also
compactly represent image curves. In our applications they have the additional advantage
that the area enclosed is a simple function of the control points. This also applies to the
other area moments. From Green's theorem in the plane it is easy to show that the area
enclosed by a curve with parameterisation x(s) and y(s) is given by:

. = x(s)y'(s)es (221
0

where x(s) and y(s) are the two components of the image curve and y~(s) is the derivative
with respect to the curve parameter s. For a B-spline, substituting (21) and its derivative:

a(t) = Z Z(qx,qyjlf, gds (23 /


o i j

= Z Z(q~:'qYJ ) fiflds" (24)


i j o
196

Note that for each span of the B-spline and at each time instant the basis functions
remain unchanged. The integrals can thus be computed off-line in closed form. (At most
16 coefficients need be stored. In fact due to symmetry there are only 10 possible values
for a cubic B-spline). At each time instant multiplication with the control point positions
gives the area enclosed by the contour. This is extremely efficient, giving the exact area
enclosed by the contour. The same method can be used for higher moments of area as
well. The temporal derivatives of the area and its moments is then used to estimate image
divergence and deformation.

5.2 A p p l i c a t i o n s

Here we present the results of a preliminary implementation of the theory. The examples
are based on a camera mounted on a robot arm whose translations are deliberate while
the rotations around the camera centre are performed to keep the target of interest in the
centre of its field of view. The camera intrinsic parameters (image centre, scaling factors
and focal length) and orientation are unknown. The direction of translation is assumed
known and expressed with bounds due to uncertainty.

B r a k i n g Figure 1 shows four samples from a sequence of images taken by a moving


observer approaching the rear windscreen of a stationary car in front. In the first frame
(time t = 0) the relative distance between the two cars is approximately 7m. The velocity
of approach is uniform and approximately 1m/time unit.
A B-spline snake is initialised in the centre of the windscreen, and expands out until
it localises the closed contour of the edge of the windscreen. The snake can then auto-
matically track the windscreen over the sequence. Figure 2 plots the apparent area, a(t)
(relative to the initial area, a(0)) as a function of time, t. For uniform translation along
the optical axis the relationship between area and time can be derived from (10) and
(18) by solving the first-order partial differential equation:

d (a(t)) : ( ~ - ~ ) a(t) 9 (25)

Its solution is given by:


.(t) = a(O) (26)

where to(O) is the time to contact at time t = O:

to(0) = uA(O)
. q (27)

This is in close agreement with the data (Fig. 2a). This is more easily seen if we look at
the variation of the time to contact with time. For uniform motion this should decrease
linearly. The experimental results are plotted in Fig. 2b. These are obtained by dividing
the area of the contour at a given time by its temporal derivative (estimated by finite
differences). The variation is linear, as predicted. These results are of useful accuracy,
predicting the collision time to the nearest half time unit (corresponding to 50cm in this
example).
For non-uniform motion the profile of the time to contact as a function of time is a
very important cue for braking and landing reactions [20].
197

Collision a v o i d a n c e It is well known that image divergence can be used in obstacle


collision avoidance. Nelson and Aloimonos [23] demonstrated a robotics system which
computed divergence by spatio-temporal techniques applied to the images of highly tex-
tured visible surfaces. We describe a real-time implementation based on image contours
and "act" on the visually derived information.
Figure 3 shows the results of a camera mounted on an Adept robot manipulator and
pointing in the direction of a target contour. (We hope to extend this so that the robot
initially searches by rotation for a contour of interest. In the present implementation,
however, the target object is placed in the centre of the field of view.) The closed contour
is then localised automatically by initialising a closed loop B-splinc snake in the centre
of the image. The snake "explodes" outwards and deforms under the influence of image
forces which cause it to be attracted to high contrast edges.
The robot manipulator then makes a deliberate motion towards the target. Tracking
the area of the contour and computing its rate of change allows us to estimate the diver-
gence. For motion along the visual ray this is sufficient information to estimate the time
to contact or impact. The estimate of time to contact - decreased by the uncertainty in
the measurement and any image deformation (20) - can be used to guide the manipulator
so that it stops just before collision (Fig. 3d). The manipulator in fact, travels "blindly"
after its sensing actions (above) and at a uniform speed for the time remaining until
contact. In repeated trials image divergences measured at distances of 0.5m to 1.0m were
estimated accurately to the nearest half of a time unit. This corresponds to a positional
accuracy of 20mm for a manipulator translational velocity of 40mm/s.
The affine transformation approximation breaks down at close proximity to the target.
This may lead to a degradation in the estimate of time to contact when very close to the
target.

Landing reactions a n d object manipulation If the translational motion has a com-


ponent parallel to the image plane, the image divergence is composed of two components.
The first is the component which determines immediacy or time to contact. The other
term is due to image foreshortening when the surface has a non-zero slant. The two ef-
fects can be computed separately by measuring the deformation. The deformation also
allows us to recover the surface orientation.
Note that unlike stereo vision, the magnitude of the translation is not needed. Nor are
the camera parameters (focal length and aspect ratio is not needed for divergence) known
or calibrated. Nor are the magnitudes and directions of the camera rotations needed to
keep the target in the field of view. Simple measurements of area and its moments -
obtained in closed form as a function of the B-spline snake control points - were used to
estimate divergence and deformation. The only assumption was of uniform motion and
known direction of translation.
Figure 3 shows an example in which a robot manipulator uses these estimates of time
to contact and surface orientation to approach the object surface perpendicularly so as to
position a suction gripper for manipulation. The image contours are shown in Fig. 3a and
3b highlighting the effect of deformation due to the sideways component of translation.
The successful execution is shown in Fig. 3c and 3d.

Qualitative visual navigation Existing techniques for visual navigation have typically
used stereo or the analysis of image sequences to determine the camera ego-motion and
then the 3D positions of feature points. The 3D data are then analysed to determine, for
example, navigable regions, obstacles or doors. An example of an alternative approach
198

is presented. This computes qualitative information about the orientation of surfaces


and times to contact from estimates of image divergence and deformation. The only
requirement is that the viewer can make deliberate movements or has stereoscopic vision.
Figure 4a shows the image of a door and an object of interest, a pallet. Movement towards
the door and pallet produce a deformation in the image. This is seen as an expansion
in the apparent area of the door and pallet in Fig. 4b. This can be used to determine
the distance to these objects, expressed as a time to contact - the time needed for
the viewer to reach the object if the viewer continued with the same speed. The image
deformation is not significant. Any component of deformation can, anyhow, be absorbed
by (20) as a bound on the time to contact, h movement to the left (Fig. 4c) produces
image deformation, divergence and rotation. This is immediately evident from both the
door (positive deformation and a shear with a horizontal axis of expansion) and the
pallet (clockwise rotation with shear with diagonal axis of expansion). These effects with
the knowledge of the direction of translation between the images taken at figure 4a and
4c are consistent with the door having zero tilt, i.e. horizontal direction of increasing
depth, while the pallet has a tilt of approximately 90 ~ i.e. vertical direction of increasing
depth. These are the effects predicted by (9, 10, 11 and 12) even though there are also
strong perspective effects in the images. They are sufficient to determine the orientation
of the surface qualitatively (Fig. 4d). This has been done without knowledge of the
intrinsic properties of the cameras (camera calibration), the orientations of the cameras,
their rotations or translational velocities. No knowledge of epipolar geometry is used
to determine exact image velocities or disparities. The solution is incomplete. It can,
however, be easily augmented into a complete solution by adding additional information.
Knowing the magnitude of the sideways translational velocity, for example, can determine
the exact quantitative orientations of the visible surfaces.

6 Conclusions

We have presented a simple and efficient method for estimating image divergence and
deformation by tracking closed image contours with B-spline snakes. This information
has been successfully used to estimate surface orientation and time to contact.

Aeknowledgement s

The authors acknowledge discussions with Mike Brady, Kenichi Kanatani, Christopher
Longuet-Higgins, and Andrew Zisserman. This work was partially funded by Esprit BRA
3274 (FIRST) and the SERC. Roberto Cipolla also gratefully acknowledges the support
of the IBM UK Scientific Centre, St. Hugh's College, Oxford and the Toshiba Research
and Development Centre, Japan.

References

1. R.I-I.Bartels, J.C. Beatty, and B.A. Barsky. An Introduction to Splines for use in Computer
Graphics and Geometric Modeling. Morgan Kaufmann, 1987.
2. F. Bergholm. Motion from flow along contours: a note on robustness and ambiguous case.
Int. Journal of Computer Vision, 3:395-415, 1989.
3. J.D. Boissonat. Representing solids with the delaunay triangulation. In Proc. ICPR, pages
745-748, 1984.
199

4. M. Campani and A. Verri. Computing optical flow from an overconstrained system of linear
algebraic equations. In Proc. 3rd Int. Conf. on Computer Vision, pages 22-26, 1990.
5. R. Cipolla. Active Visual Inference of Surface Shape. PhD thesis, University of Oxford,
1991.
6. R. Cipolla and A. Blake. The dynamic analysis of apparent contours. In Proc. 3rd Int.
Conf. on Computer Vision, pages 616-623, 1990.
7. R. Cipolla and P. Kovesi. Determining object surface orientation and time to impact from
image divergence and deformation. (University of Oxford (Memo)), 1991.
8. H.F. Davis and A.D. Snider. Introduction to vector analysis. Allyn and Bacon, 1979.
9. E. Francois and P. Bouthemy. Derivation of qualitative information in motion analysis.
Image and Vision Computing, 8(4):279-288, 1990.
10. J.J. Gibson. The Ecological Approach to Visual Perception. Houghton Mifflin, 1979.
11. C.G. Harris. Structure from motion under orthographic projection. In O. Faugeras, editor,
Proc. Ist European Conference on Computer Vision, pages 118-123. Springer-Verlag, 1990.
12. E.C. Hildreth. The measurement of visual motion. The MIT press, Cambridge Mas-
sachusetts, 1984.
13. K. Kanatani. Detecting the motion of a planar surface by line and surface integrals. Com-
puter Vision, Graphics and Image Processing, 29:13-22, 1985.
14. K. Kanatani. Structure and motion from optical flow under orthographic projection. Com-
puter Vision, Graphics and Image Processing, 35:181-199, 1986.
15. J.J. Koenderink. Optic flow. Vision Research, 26(1):161-179, 1986.
16. J.J. Koenderink and A.J. Van Doorn. Invariant properties of the motion parallax field due
to the movement of rigid bodies relative to an observer. Optica Acta, 22(9):773-791, 1975.
17. J.J. Koenderink and A.J. Van Doorn. How an ambulant observer can construct a model
of the environment from the geometrical structure of the visual inflow. In G. Hauske and
E. Butenandt, editors, Kybernetik. Oldenburg, Munchen, 1978.
18. J.J. Koenderink and A.J. Van Doorn. Depth and shape from differential perspective in the
presence of bending deformations. J. Opt. Soc. Am., 3(2):242-249, 1986.
19. J.J. Koenderink and A.J. van Doorn. Afflne structure from motion. Journal o/ Optical
Society of America, 1991.
20. D.N. Lee. The optic flow field: the foundation of vision. Phil. Trans. R. Soc. Lond., 290,
1980.
21. H.C. Longuet-Higgins and K. Pradzny. The interpretation of a moving retinal image. Proc.
R. Soc. Lond., B208:385-397, 1980.
22. S. J. Maybank. Apparent area of a rigid moving body. Image and Vision Computing,
5(2):111-113, 1987.
23. R.C. Nelson and J. Aloimonos. Using flow field divergence for obstacle avoidance: towards
qualitative vision. In Proc. Pnd Int. Conf. on Computer Vision, pages 188-196, 1988.
24. J.H. Rieger and D.L. Lawton. Processing differential image motion. J. Optical Soc. of
America, A2(2), 1985.
25. M. Subbarao. Bounds on time-to-collision and rotational component from first-order
derivatives of image flow. Computer Vision, Graphics and Image Processing, 50:329-341,
1990.
26. D.W. Thompson and J.L. Mundy. Three-dimensional model matching from an uncon-
strained viewpoint. In Proceedings of IEEE Conference on Robotics and Automation, 1987.
27. S. Ullman. The interpretation of visual motion. MIT Press, Cambridge,USA, 1979.
28. H. Wang, C. Bowman, M. Brady, and C. Harris. A parallel implementation of a structure
from motion algorithm. In Proc. ~nd European Conference on Computer Vision, 1992.
29. A.M. Waxman and S. Ullman. Surface structure and three-dimensional motion from image
flow kinematics. Int. Journal of Robotics Research, 4(3):72-94, 1985.
30. A.M. Waxman and K. Wohn. Contour evolution, neighbourhood deformation and global
image flow: planar surfaces in motion. Int. Journal of Robotics Research, 4(3):95-108, 1985.
200

Fig. 1. Using image divergence to estimate time to contact.

Four samples of a video sequence taken from a moving observer approaching a stationary car
at a uniform velocity (approximately l m per time unit}. A B-spline snake automatically tracks
the area of the rear windscreen (Fig. ~a). The image divergence is used to estimate the time to
contact (Fig. 2b). The next image in the sequence corresponds to collision!

Relative area a(t)/a(O) Time to contact (frames)


3C 7-

2E 6-
5-
2C
4.
1E

lC

0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Time (frame number) Time (frame number)

Fig. 2. Apparent area of windscreen for approaching observer and the estimated time to contact.
201

Fig. 3. Visually guided object manipulation using image divergence and deformation.

(a) The image of a planar contour (zero tilt and positive slant, i.e. the direction of increasing
depth, F, is horizontal and from left to right). The image contour is localised automatically by
a B.spline snake initialised in the centre of the field of view. (b) The effect on apparent shape
when the viewer translates to the right while fixating on the target (i.e. A is horizontal, left to
right). The apparent shape undergoes an isotropic expansion (positive divergence which increases
the area) and a deformation in which the axis of expansion is horizontal. Measurement of the
divergence and deformation can used to estimate the time to contact and surface orientation.
This is used to guide the manipulator so that it comes to rest perpendicular to the surface with
a pre-determined clearance. Estimates of divergence and deformation made approximately l m
away were sufficient to estimate the target object position and orientation to the nearest 2cm
in position and 1 ~ in orientation. This information is used to position a suction gripper in the
vicinity of the surface. A contact sensor and small probing motions can then be used to refine
the estimate of position and guide the suction gripper before manipulation (d}.
202

Fig. 4. Qualitative visual navigation using image divergence and deformation.

(a) The image of a door and an object of interest, a pallet. (b) Movement towards the door
and pallet produces a deformation in the image seen as an expansion in the apparent area of
the door and pallet. This can be used to determine the distance to these objects, expressed as a
time to contact - the time needed for the viewer to reach the object if it continued with the same
speed. (c} A movement to the left produces combinations of image deformation, divergence and
rotation. This is immediately evident from both the door (positive deformation and a shear with
a horizontal axis of expansion) and the pallet (clockwise rotation with shear with diagonal axis of
expansion). These effects, combined with the knowledge that the movement between the images,
are consistent with the door having zero tilt, i.e. horizontal direction of increasing depth, while
the pallet has a tilt of approximately 90~ i.e. vertical direction of increasing depth. They are
sufficient to determine the orientation of the surface qualitatively (d). This has been done with
no knowledge of the intrinsic properties of the camera (camera calibration), its orientations or
the translational velocities. Estimation of divergence and deformation can also be recovered by
comparison of apparent areas and the orientation of edge segments.
Robust and fast computation of unbiased intensity
derivatives in images

Thierry Vieville and Olivier D. Faugeras


INRIA-Sophia, 2004 Route des Lucioles, 06560 Valbonne, France

A b s t r a c t . In this paper we develop high order non-biased spatial deriva-


tive operators, with subpixel accuracy. Our approach is discrete and pro-
vides a way to obtain some of the spatio-temporal parameters from an image
sequence. In this paper we concentrate on spatial parameters.

I Introduction
Edges are important features in an image. Detecting them in static images is now a well
understood problem. In particular, an optimal edge-detector using Canny's criterion has
been designed [8,7]. In subsequent studies this method has been generalized to the com-
putation of 3D-edges [5]. This edge-detector however has not been designed to compute
edge geometric and dynamic characteristics, such as curvature and velocity.
It is also well known that robust estimates of the image geometric and dynamic
characteristics should he computed at points in the image with a high contrast, that is
edges. Several authors, have attempted to combine an edge-detector with other operators,
in order to obtain a relevant estimate of some components of the image features, or the
motion field [2], but they use the same derivatives operators for both problems.
However, it is not likely that the computation of edge characteristics has to be done
in the same way as edge detection, and we would like to analyse this fact in this paper.
Since edge geometric characteristics are related to the spatial derivatives of the picture
intensity [2]. we have to study how to compute "good" intensity derivatives, that is suitable
to estimate edge characteristics.
In this paper, we attempt to answer this question, and propose a way to compute
image optimal intensity derivatives, in the discrete case.

2 Computing optimal spatial derivatives


2.1 P o s i t i o n o f t h e p r o b l e m
We consider the following two properties for a derivative filter :

- A derivative filter is unbiased if it outputs 0nly the required derivative, but not lower
or higher order derivatives of the signal.
- Among these filters, a derivative filter is optimal if it minimizes the noise present in
the signal. In our ease we minimize the output noise.

Please note, that we are not dealing with filters for detecting edges, here, but rather
- edges having been already detected - with derivative filters to compute edge charac-
teristics. It is thus not relevant to consider other criteria used in optimal edge detection
such as localization or false edge detection [1].
In fact, spatial derivatives are often computed in order to detect edges with accuracy
and robustness. Performances of edge detectors are given in term of localization and signal
to noise ratio [1,8]. Although the related operators are optimal for this task, they might
204

not be suitable to compute unbiased intensity derivatives on the detected edge. Moreover
it has been pointed out [9] that an important requirement of derivative filters, in the
case where one wants to use differential equations of the intensity is the preservation
of the intensity derivatives, which is not the case of usual filters, ttowever, this author
limits his discussion to Gaussian filters, whereas we would like to derive a general set of
optimal filters for the computation of temporal or spatial derivatives. We are first going
to demonstrate some properties of such filters in the continuous or discrete case and then
use an equivalent formulation in the discrete case.

2.2 Unbiased filters with minimum output noise


A condition for unbiasness .
Let us note | the convolution product. According to our definition of unbiasness a
1D-filter fr is an unbiased rth-order derivator if and only if :
/~(z) 0 u(=) - d~u(~)
dz r
for all functions Cr.
In particular, for u(z) = z n, we have a set necessary conditions :
n!
fr(z) | z n = n(n - 1)...(n - r + 1)z n - r - (n - r)! = " - r
which is a generalization of the condition proposed by Weiss [9].
But, considering a Taylor expansion of u(z) = ~ ~r around zero, for a
dxr ~=0 7f.
Cr function, and using the fact that polynomials form a dense family over the set of Cr
functions, this enumerable set of conditions are also sul~cient.
The previous conditions can be rewritten as :
f f~(t)(z-t)"dt= ~,.="-" f $~(t)~=o~tqx.:--/dt= ~ x ~-~
ET=0 z " , ~ f f~(t)tqdt = ~ z
and these z-polynomial equations are verified if and only if all the coefficients are equal,
that is :
/ ,,
Equation s (1) are thus necessary and sufficient conditions of unbiasness. Moreover if
f~ is an unbiased r-order filter, f~+l = fr~ is an unbiased (r+l)-order filter, since :
fr-l(Z) | Z n - 1 "~- f f ~ - x ( z - t)t"-Xdt = . (l~-lr)' 1))'Z(n--1)--(r--1)
-

= [~-f~-x(t)] + f f'~_x(z-t)?dt = (",-_l~)'x~-~


f f~-l( -t)~dt = (~-~)" x

| z" = $ t ' _ , ( z - t)t"dt =


If equation (1) is true for all q, the filter will be an unbiased derivative filter. It is
important to note that this condition should be verified for q < r, but also for q > r.
If not, high-order derivatives will have a response though the filter and the output will
be biased. This is the case for Canny-Deriche filters, and this is an argument to derive
another set of filters.
In fact, the only one solution to this problem is the rth-derivative of the Dirac distri-
bution, 8r. This is not an interesting solution because this is just the "filter" which output
noise is maximal (no filtering!). However, in practice, the input signals high-order deriva-
tives are negligible, and we can only conisder unbiasness conditions for 0 < q < Q < cx~.
205

M i n i m i z i n g the output noise In the last paragraph we have obtained a set of condi-
tions for unbiasness. Among all filters which satisfy these conditions let us compute the
best one, considering a criteria related to the noise.
The mean-squared noise response, considering a white noise of variance 1, is (see
[1], for instance): f f r ( t ) 2 d t , and a reasonable optimal condition is to find the filter
which minimize this quantity and satisfy the constraints given by equation (1). Using
the opposite of the standard Lagrange multipliers ~p this might be written as :

9 . 1 ffr(t)~d t- O
p=O
Prom the calculus of variation, one can derive the Euler equation, which is a necessary
condition and which turns out to be, with the constraints, also sufficient in our case, since
we have a positive quadratic criteria with linear constraints.
The optimal filter equations (Euler equations and constraints) are then :
/ ~
fr(t) = ~ p = 0 Apt'
0 <_ q <_ Q f A ( t ) ~ . d t = 6q,
These equations are necessary conditions for the filter to be optimum. They yield
polynomial filters. Functions verify these equations only if they are defined, and presently,
polynomials are only define on finite supports. Thus these equations are convergent if
and only iffr(t) has a finite support. That is we obtain optimal filters minimizing output
noise, only on a finite window.
These equations have the following consequence : the optimal derivative filter is a
polynomial filter and is thus only defined on a finite window. If not, the Euler equations
are no more defined. In fact, we also studied infinite response filters, but we came with a
negative answer : even if considering special families of infinite response filters (such as
product of polynomial with Gaussian or exponentials) and applying the same constrainted
optimum criteria, it is not possible to obtain analytic filters as an infinite series of the
original basis of function, because the summation is divergent (see however section 2.5
for a discussion about sub-optimal solutions).
We thus have to work on finite windows and in this case, we can compute the values
of Ap, from a set of linear equations, since from the Euler equation and the constraints
we obtain :

= ] - dt = -7 = (2)
p=O q" =
for 0_< q_< Q.
Equations (2) define a unique optimal unbiased r-order filter. Itowever, if fr is this
optimal unbiased r-order filter, fr+t = fr~ is not the optimal unbiased (r+l)-order filter,
as it can be easily verified, whereas each filter has to be computed separately.
2.3 An equivalent parametric approach using p o l y n o m i a l a p p r o x i m a t i o n
There is another way to compute these derivatives, considering the Taylor expansion of
the input as a parametric model. Writing :
O tq
x(t) ~_ E ~"-~, + A/'oise (3)
q=0

one can minimize :J = 89f x(t) - E~=0 x, dt which is just a least-square criteria
with a similar interpretation, since we minimize the variance of the residual error.
206

This quadratic positive criteria is minimum for pl9 dd$


~p
= 0 which provides a set of
linear equations in Zq :
Q
f tP+qdt
f x(t)t, dt = qt (4)
q----O
But,the quantities xq are just equal to the output of the optimal filters computed
previously from frO | x 0 at t = 0, then both approachs are equivalent
Considering a signal with derivatives up to a given order Q, it is thus possible to
compute unbiased estimators of these derivatives with a minimum of output noise by
solving a least-square problem, as in equation (3). This result is not a surprise for someone
familiar with Optimization but is crucial when implementing such filters in the discrete
case, as done now.
Please note that the integration f . . . d t is to be made over a bounded domain, in
order this integration to be convergent for polynomials, but all the computations are
valid for any Lebesgue integrals. In particular, this is still valid for a finite summation, a
finite summation of definite integrals, etc... This will be used in the next sections.
2.4 Continuous implementation of unbiased filters
While, the continuous implementation of such filters is not directly usable in image
processing, very helpful to study the properties and characteristics of these filters. In
addition, we can compare these filters with others derivative filters, as used in edge
detection.
Let us consider a finite window. For reasons of isotropy this window has to be sym-
metric I - W , W], and it corresponds to a zero-phase non-causal filter. Moreover, changing
the scale factor it is always possible to consider W = 1.
We compute filters for 0 < r < 3 and r _< Q _< 6 and obtain the curves given in Fig. 2.
The related output noise f fi()2 is shown in Table 1.

Q 0 1 2 3 4 5 6 7
Smoother . o.W_._~5 1.W_~1 1W.__.~8 2.W_.24 .
First Order 1.5 9.4 25.0 65.0
Second Order 2a.0
w---r 2s0.0
w'-Wv- 1400.0
w~
Third Order 70o.o 16ooo.o 12o00o.o
w"Wr w7

Table 1. Computation of the output noise for different unbiased filters

We can make the following remarks :


- For a given window, there is a trade-off between unbiasness and output-noise limita-
tion, as in standard filtering9 The more the signal model contains high-order deriva-
tives, and the more noise is output.
- The amount of output noise is very high as soon as the order of the model increases,
especially for high-order derivatives. But it decreases very quickly with the increase of
the window size. It is thus possible to tune the window size to maintain this amount
of noise at a reasonable value 1
- Contrary to usual filters the number of zero-crossing is not equal to the order o f
derivative but "higher or equal. In particular the unbiased smoother has a number of
zero-crossing equal to the order of the model as shown on Fig. 1.
- There is no simple algebraic relations between this series of polynomials.

1 We have, in fact, f f~()2 = O(w____~t)


207

SmootherQ~0,2,4,6,8.10

Fig. 1. A few examples of unbiased optimal smoothers

Fig. 2. A few examples of unbiased optimal derivators


2.5 What about infinite response filters ?
We can also design infinite response unbiased filters.
Consider for instance the family of filters :
d
S (t) =

p----O

which correspond to the set of recursively implemented digital filters (see for instance
[6]), having an implementation of the form :
P q
Yt = E blxt-i -- E ajyt-i
i=O j--1
Applying the unbiasness condition of equation (1) to these functions leads to a finite
set of linear equations :
/ ~ ---d "t'+'
9 ~ J o!
p--O ~"

defining a affine functional subspace of finite co-dimension.


208

In particular for d = Q there is a unique unbiased filter while if d > Q we have an


(d - Q) - dimensional space of solutions. If d > Q, one can again choose the solution
minimizing the output noise, that is the one for which :
d d

10=0 q=0
is minimum. This yields to the minimization of a quadratic positive criteria in the pres-
ence of linear constraints, having a unique solution obtained from the derivation of the
related normal equations.
In order to illustrate this point, we derive these equations for Q < 2 and d >_ 2 for
r = 1. And in that case we obtained : fdz(t ) = flte-I~fl which corresponds precisely to
Canny-Deriche recursive optimal derivative filters. More generally if the signal contains
derivatives up to the order of the desired derivative, usual derivative filters such as Canny-
Deriche filters are unbiased filters and can be used to estimate edge characteristics.
However, such a filter is not optimal among all infinite response operators, but only
in the the small parametric family of exponential filters z. The problem of finding an
optimal filter among all infinite response operators is an undefined problem, because the
Euler equation obtained in the previous section (a necessary condition for the optimum)
is undefined, as pointed out.
Since this family is dense in the functional space of derivable functions it is indeed
possible to approximate any optimal filters using a combination of exponential filters,
but the order n might be very high, while the computation window has to be increased.
Moreover, in practice, on real-time vision machines, these operators are truncated (thus
biased t) and it is much more relevant to consider finite response filters.

2.6 A n optimal approach in the discrete 2D-case

Let us now apply these results in the discrete case.


Whereas most authors derive optimal continuous filters and then use a non-optimal
method to obtain a discrete version of these operators, we would like to stress the fact
that the discretization of an optimal continuous filter is not necessary the optimal discrete
filter.
In addition, the way the discretization is made depends upon a model for the sampling
process. For instance, in almost all implementations [8,3], the authors make the implicit
assumption that the intensity measured for one pixel is related to the true intensity by a
Dirac distribution, that is, corresponds to the point value of the intensity at this point.
This is not a very realistic assumption, and in our implementation we will use another
model.
The key point here is that since we have obtained a formulation of the optimal filter
using any Lebesgue integration over a bounded domain, then the class of obtained filters
is still valid for the discrete case. Let us apply this result now.
In the previous section we have shown that optimal estimators of the intensity deriva-
tives should be computed on a bounded domain, and we are going to consider here a
squared window of N N pixels in the picture, from (0, 0) to (N - 1, N - 1). We would
like to obtain an estimate of the derivatives around the middle point (N, N).
This is straightforward if we use the equivalent parametric approach obtained in
section 2.3.

2 The same parametric approach could have been developed using Gaussian kernels.
209

Generalizing the previous approach to 2D-data we can use the following model of the
intensity, a Taylor expansion, the origin being at ( ~ , ~ ) :
I(x,y) I o - I x - I - I x x x 2 - I Y Y ~ - [ x --l~t*:xs-I=~:Yx= --IxYYx 2-IYu~' S-ere
= ~ - t -I'YY't" 2 + 2 y + = v y + 6 + 2 y-t- 2 y-l- 6 Y ~- "'"
where the development is not made up to the order of derivative to be computed, but up
to the order of derivative the signal is supposed to contain.
Let us now modelize the fact that the intensity obtained for one pixel is related to
the image irradiance over its surface. We consider rectangular pixels, with homogeneous
surfaces, and no gap between two pixels. Since, one pixel of a C C D camera integrates the
light received on its surface, this means that a realistic model for the intensity measured
for a pixel (i, j) is, under the previous assumptions :
I,r = fl i+' f]+'
I@,y)dxdy
= IoPo(i) + I=Pl(i) + IvPx(j ) + I~rP2(i) + IxvPl(i)Pl(j) + I~yP2(j) + . . .
where Pk(i) = j,/,i+1 rz*, dx = E L 0 CLli'.
Now, the related least-square problem is
N-1N-1
1
J = 2 E [Iij-(I~176 .)]2
i=0j=0
and its resolution provides optimal estimates of the intensity derivatives
{Io, I~, Iv, I==, I~ v, I v v , . . . } in function of the intensity values Iq in the N N window.
In other words we obtain the intensity derivatives as a linear combination of the
intensity values Iq, as for usual finite response digital filters.
For a 5 x 5 or 7 7window, for instance, and for a intensity model taken up to the
fourth order one have convolutions given in Fig3.
This approach is very similar to what was proposed by Haraliek [3], and w e call t h e s e
f i l t e r s I t a r a l i c k - l i k e filters. In both methods the filters depends upon two integers :
(1) the size of the window, (2) the order of expansion of the model. In both methods,
we obtain polynomial linear filters. However it has been shown [4] that Haralick filters
reduce to Prewitt filters, while our filters do not correspond to already existing filters.
The key point, which is - we think - the main improvement, is to consider the intensity
at one pixel not as the simple vMue at that location, but as the integral of the intensity
over the pixel surface, which is closer to reality.
Contrary to Haralick original filters these filters are not all separable, however this not
a drawback because separable filters are only useful when the whole image is processed.
In our case we only compute the derivatives in a small area along edges, and for that
reason efficiency is not as much an issue 3
2.7 Conclusion
We have designed a new class of unbiased optimal filters dedicated to the c o m p u t a t i o n
of intensity derivatives, as required for the computation of edge characteristics. Because
these filters are computed though a simple least-square minimization problem, we have
been capable to implement these operators in the discrete case, taking the C C D subpixel
mechanisms into account.
These filters are dedicated to the computation of edge characteristics, they are well
implemented in finite windows, and correspond to unbiased derivators with m i n i m u m
output noise. They do not correspond to optimal filters for edge detection.
3 Anyway, separable filters are quicker than genera] filters if and only if they are used on a
whole image not a few set of points
210

1115 380 --61 --208 --61 380 1115


--610 --1100 --1394 --1492 --1394 --1100 --610
--711 --956 --1103 --1152 --1103 --956 --711
1 Iy = Ix T
Ix = 0 0 0 0 0 0 0
28224
711 956 1103 1152 1103 956 711

[i555
610 II00 1394 1492 1394 1100 610

i]
--1115 --380 61 208 61 --380 --1115

9 6 3 0-3-6
gooooo 63420-2-4-63
1 3-3-3-3-3-3 3 1 2 1 0-1-2-
Izz=~ -4-4-4-4-4 Izy=~ 0 0 0 0 0
-3-3-3-3-3 -2 -I 0 1 2
0 0 0 0 0 -4 -2 0 2 4
5 5 5 5 5 -6 -3 0 3 6

Iyy = Izx T

--1--1--1--1--1--1 10:0 101:1


Ixxz=~
1 1 1 1 1 1
0 0 0 0 0
1 1 1 1 1
1 1 1 1 1
1-1-1-1-1-11
Izxy=~
1
i
I 9
0
0
9
12
0
6

6 3 0--3--6
0 0 0 0 0
0 0
3 0--3-6-
8 4 0--4-8-12[

--15--i0--50 5 i0 15 J
0

:I
Izyy = Izzy T lyyy = Izxx T

Fig. 3. Some Haralick-like 5 x 5 and 7 x 7 improved filters


3 Experimental result : computing edge curvature
In order to illustrate the previous developments we have experimented our operators for
the computation of edge curvature. Under reasonable assumptions, the edge curvature
can be computed as :to = z~+z~] "
We have used noisy synthetic pictures, containing horizontal, vertical or oblique edges
with step and roof intensity profiles.
Noise has been added both to the intensity (typically 5 % of the intensity range) a n d
to the edge location (typically 1 pixel). Noise on the intensity will be denoted "I-Noise",
its unit being in percentage of the intensity range, while noise on the edge location will
be denoted "P-Noise", its unit being in pixels.
We have computed the curvature for non-rectilinear edges, either circular or elliptic.
The curvature range is between 0 for a rectilinear edge and 1, since a curve with a
curvature higher than 1 will be inside a pixel. We have computed the curvature along
an edge, and have compared the results with the expected values. Results are plotted in
Fig.4, the expected values being a dashed curve.
We have also computed the curvature for different circles, in the presence of noise,
and evaluated the error on this estimation. Results are shown in Table 2. The results are
the radius of curvature, the inverse of the curvature expressed in pixels. The circle radius
was of 100 pixels.
Although the error is almost 10 %, it appears that for important edge localization
errors, the edge curvature is simply not computable. This is due to the fact we use a
211

Fig. 4. Computation of the curvature along an elliptic edge

I-Noise 2% 5% 10% 0 0 0
P-Noise 0 0 0 0.5 1 2
Error (in pixel) 2.1 6.0 10.4 6.0 12.2 huge

T a b l e 2. Computation of the curvature at different level of noise

5 x 5 window, and that our model is only locally valid. In the last case, the second order
derivatives are used at the border of the neighbourhood and are no more valid.

References
1. J. F. Canny. Finding edges and lines in images. Technical Report AI Memo 720, MIT Press,
Cambridge, 1983.
2. R. Deriche and G. Giraudon. Accurate corner detection : an analytical study. In Proceedings
of the 3rd ICCV, Osaka, 1990.
3. R. M. Haralick. Digital step edges from zero crossing of second directional derivatives. ]EEE
Transactions on Pattern Analysis and Machine Intelligence, 6, 1984.
4. A. Huertas and G. Medioni. Detection of intensity changes with subpixel accuracy using
laplaca~n-gaussian masks. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 8:651-664, 1986.
5. O.Monga, J. Rocchisani, and R.Deriche. 3D edge detection using recursive filtering. CVGIP:
linage Understanding, 53, 1991.
6. R.Deriche. Separable recursive filtering for efficient multi-scale edge detection. In Int. Work-
shop Machine and Machine Intelligence, Tokyo, pages 18-23, 1987.
7. R.Deriche. Fast algorithms for low-level vision. IEEE Transaction on Pattern Analysis and
Machine Intelligence, 12, 1990.
8. R.Deriche. Using Canny's criteria to derive a recursively implemented optimal edge detector.
International Journal of Computer Vision, pages 167-187, 1987.
0. I. Weiss. Noise-resistant invariants of curves. In Application o] lnvariance in Computer
Vision Darpa-Esprit, lceland, 1991.

This article was processed using the IgrEX macro package with ECCV92 style
Testing Computational Theories of Motion
Discontinuities: A Psychophysical Study *

Lucia M. Vaina I and Norberto M. Grzywacz 2

x Intelligent Systems Laboratory, College of Engineering and Department of Neurology, Boston


University, ERB-331, Boston, MA 02215, and Harvard-MIT Division of Health Sciences and
Technology, Massachusetts Institute of Technology, USA
2 The Smith-Kettlewell Eye Research Institute, 2232 Webster Street, San Francisco, CA 94115,
USA

A b s t r a c t . This study reports results from three patients with bilateral


brain lesions (A.F., C.D., and O.S.) and normal observers on psychophys-
ical tasks, which examined the contribution of motion mechanisms to the
extraction of image discontinuities. The data do not support the suggestion
that the visual system extracts motion discontinuities by comparing fully
encoded velocity signals ([NL]; [Clo]). Moreover, the data do not support
the suggestion that the computations underlying discontinuity localization
must occur simultaneously with the spatial integration of motion signals
([Kea]). We propose a computational scheme that can account for the data.

1 Introduction

In this paper, we test theoretical proposals on the organization of motion processing


underlying discontinuity extraction. We investigate performance on three psychophysi-
cal motion tasks of three patients with focal brain lesions involving the neural circuits
mediating specific aspects of motion perception and address the possible implication of
these results for the validity of theories of motion discontinuity.

2 Subjects and Methods

Normal naive observers with good acuity and contrast sensitivity,and no known neu-
rological or psychiatric disorders, and three patients (A.F., C.D., and O.S.) with focal
bilateral brain damage resulting from a single stroke participated in an extensive psy-
chophysical study of motion perception. M R I studies revealed that the patients' lesions
directly involved or disconnected anatomical areas believed to mediate visual analysis.
The rationale for including these three patients in the study was their good performance
on static psychophysical tasks, their normal contrast sensitivity,and good performance
on some motion tasks but their selective poor performance on several other visual mo-
tion tasks. All the patients and healthy volunteers signed the Informed Consent form
according to the Boston University human subjects committee regulations. Detail of
the psychophysical experiments and of the experimental setting can be found in [Veal],
[Vea2], and [Vea3].

* This research was supported by the NIH grant EY07861-3 to L.V. and by the A F O S R grant
92-00564 to Dr. Suzanne McKee and N.M.G.
213

3 Results

3.1 Experiment 1: Localization of Discontinuities


We addressed the problem in which the subjects had to localize the position of discon-
tinuities defined by relative direction of motion. The stimuli (similar to those used by
Hildreth- [Hill) were dense dynamic random-dot patterns. The display was constructed
in such a way that there was a discontinuity in the velocity field along a vertical line
(Figure la). Along the side was a 1.4 deg2 notch whose distance from the point of fixa-
tion varied along the vertical axis from trial to trial, but which remained within 2 deg
of visual angle from the black fixation mark. The vertical boundary and the notch were
entirely defined by the difference in direction of motion between the left and right of the
boundary, and were not visible in any static frame. Each displacement was performed in
one screen refresh and was synchronized with the screen to reduce flicker (see Notes and
Comments).

B,

l O O ~ ~ ~ ?

mmsm~ Coutrob (n=15)


0 =
o 4; 9"o 1;s 1;o
Angle Difference (degrees)
Fig. i. Localizationof discontinuities

Figure lb shows that the normal subjects and O.S. performed the task essentially
without error for all conditions. In contrast, patient C.D. was severely impaired at all
conditions. Because the patients A.F. and C.D performed at chance in the pure temporal-
frequency condition (0~ we conclude that they could not use this cue well enough to
localize discontinuities.

3.2 Experiment 2: Local-Speed Discrimination


In this experiment, the stimuli consisted of two sparse dynamic random dot cinemato-
grams displayed in two rectangular apertures (Figure 2a). In any single trial, each dot
took a two-dimensional random-walk of constant step size, which was defined by the
speed. The direction in which any dot moved was independent of its previous direction
and also of the displacements of the other dots. The speeds of the dots was uniform and
was assigned independently to each aperture. A base speed of 4.95 deg/sec was always
214

compared to five other speeds, giving speed ratios of I.I, 1.47, 2.2, 3.6, and 5.5. The
assignment of the highest speed to the top or bottom aperture was pseudo randomly
selected.

A. a .

t
ii+
%
+ % % .a

! % ! Contro~ (n 2"])

t ,--+ ~. i / 0

0
' - .

1
- .

2
- 9

3
- .

4
- .

5
- 9

6
"-'/ t%--. /
Speed Ratio

Fig. 2. Local speed discrimination

Subjects were asked to determine which of the two apertures contained the faster
moving dots. Figure 2b shows that in comparison to the control group and O.S., who
were performing almost perfectly for the 1.47 speed ratio, A.F. had a very severe deficit
on this speed discrimination task. Similarly C.D was also impaired on this task, but to
a lesser degree than A.F.

3.3 E x p e r i m e n t 3: M o t i o n C o h e r e n c e

In the third experiment, the stimuli were dynamic random-dot cinematograins with a
correlated motion signal of variable strength embedded in motion noise. The strength of
the motion signal, that is the percentage of the dots moving in the same, predetermined
direction, varied from 0% to 100% (Figure 3a). The algorithm by which the dots were
generated was similar to that of Newsome and Park's (INP]), which is described in detail
in [Veal], [Vea2], and [Vea3]. The aim of this task was to determine the threshold of
motion correlation for which a subject could reliably discriminate the direction of motion.
Figure 3b shows that the mean of the motion coherence threshold of the normal
subjects (n=16) was 6.5% for left fixation and 6.9% for right fixation. The patient A.F.
was significantly impaired this task. His direction discrimination threshold was 28.4%
for left fixation and 35.2% for right fixation. Similarly, O.S. was very impaired on this
task. In contrast, C.D.'s performance was normal when the stimulus was presented in
the intact visual field, but she could not do the task when the stimulus was presented in
the blind visual field.
215

A, B.
I VemQLdt
; ; VOUO*lt~t r J ~ ,
? cr~ ; ; ; ; ; so
9o ?
; ; 40 ~0
0% 50% 100% ~ 30 9
20
10 ~ 9

0
co~,o~ AiF. elY. OiS.
(n.16)
Fig. 3. Motion coherence

4 Discussion

4.1 C o m p u t a t i o n of Visual M o t i o n

It has been theorized that comparisons between fully encoded velocity signals underlie
localization of discontinuities ([NL]; [CloD. However, our data do not support this sug-
gestion, since A.F., who could not discriminate speed well, had a good performance in
the localization-of-discontinuities task.
Our data also address the issue of whether the computations underlying discontinuity
localization and motion coherence occur simultaneously in the brain. These possibilities
are suggested by theories based on Markov random fields and line processors ([Kea]).
Such theories would predict that if the computation of coherence is impaired, then so
is the computation of discontinuity. Our data do not support this simultaneous-damage
prediction as C.D. performed well on the coherence task, but failed in the localization-
of-discontinuities task. Further evidence against a simultaneous computation of discon-
tinuities and coherence comes from A.F. and O.S., who were good in the localization-
of-discontinuities task but very impaired in the coherence task.
From a computational perspective, it seems desirable to account for the data by
postulating that the computation of motion coherence receives inputs from two parallel
pathways (Figure 4). One pathway would bring data from basic motion measurements
(directional, temporal, and speed signals). The other pathway would bring information
about discontinuity localization (see [Hill and [GY] for theoretical models) to provide
boundary conditions for the spatial integration in the computation of motion coherence
([YG1]; [YG2]). According to this hypothesis, it is possible that different lesions may
cause independent impairments of discontinuity localization and motion coherence.

Notes and Comments. In Experiment 1, flicker was not completely eliminated, since at
0 angular difference the notch was still visible, as if it was a twinkling border. A possible
explanation for this apparent flicker is that the dots inside the notch had shorter lifetimes
and thus were turned on and off at a higher temporal frequency.
216

Fig. 4. Two models of motion coherence and discontinuities

References
[Clo] Clocksin, W.F.: Perception of surface slant and edge labels from optical flow: A compu-
tational approach. Perception, 9 (1980) 253-269
[GY] Grzywacz, N.M., Yuille, A.L.: A model for the estimate of local image velocity by cells
in the visual cortex. Phil Trans. It. Soc. Lond. B, 239 (1990) 129-161
[H~ Hildreth, E.C.: The Measurement of Visual Motion. Cambridge, USA: MIT Press, (1984)
[Keel Koch, C., Wang, H.T., Mathur, B., Hsu, A., Suarez, H.: Computing optical flow in
resistive networks and in the primate visual system. Proc. of the IEEE Workshop on
Visual Motion, Irvine, CA, USA (1989) 62-72
[NL] Nakayama, K., Loomis, J.M.: Optical velocity patterns, velocity-sensitive neurons, and
space perception: A hypothesis. Perception, 3 (1974) 63-80
[NP] Newsome, W.T., Par~, E.B.: A selective impairment of motion perception following le-
sions of the middle temporal visual area (MT). 3. Neurosci. 8 (1988) 2201-2211
[Veal] Vaina, L.M., LeMay, M., Bienfang, D.C., Choi, A.Y., Nakayama, K.: Intact "biological
motion" and "structure from motion" perception in a patient with impaired motion
mechanisms. Vis. Neurosci. 5 (1990) 353-371
[Ve~2] Vaina, L.M., Grzywa~z, N.M., LeMay, M.: Structure from motion with impaired local-
speed and global motion-field computations. Neural Computation, 2 (1990) 420-435
[YeaS] Vaina, L.M., Grzywacz, N.M., LeMay, M.: Perception of motion discontinuities in pa-
tients with selective motion deficits. (Submitted for publication)
[YG1] Yuille, A.L., Grzywacz, N.M.: A computational theory for the perception of coherent
visual motion. Nature, 333 (1988) 71-74
[YG2] Yuille, A.L., Grzywacz, N.M.: A mathematical analysis of the motion coherence theory.
Intl. 3. Comp. Vision, 3 (1989) 155-175

This article was processed using the IATEXmacro package with ECCV92 style
Motion and Structure Factorization and Segmentation of Long
Multiple Motion Image Sequences

Chris Debrunner t and Narendra Ahuja


Coordinated Science Laboratory,Universityof Illinois at Urbana-Champaign, Urbana, Illinois 61801
1 Currently at Martin Marietta Corporation, Denver CO 80201

Abstract. This paper presents a computer algorithm which, given a dense tem-
poral sequence of intensity images of multiple moving objects, will separate the
images into regions showing distinct objects, and for those objects which are ro-
tating, will calculate the three-dimensional structure and motion. The method in-
tegrates the segmentation of trajectories into subsets corresponding to different
objects with the determination of the motion and structure of the objects. Trajecto-
ries are partitioned into groups corresponding to the different objects by fitting the
trajectories from each group to a hierarchy of increasingly complex motion mod-
els. This grouping algorithm uses an efficient motion estimation algorithm based
on the factorization of a measurement matrix into motion and structure compo-
nents. Experiments are reported using two real image sequences of 50 frames each
to test the algorithm.

1 Introduction
This paper is concerned with three-dimensional structure and motion estimation for scenes
containing multiple independently moving rigid objects. Our algorithm uses the image motion to
separate the multiple objects from the background and from each other, and to calculate the three-
dimensional sgucture and motion of each such object. The two-dimensional motion in the image
sequence is represented by the image plane trajectories of feature points. The motion of each ob-
ject, which describes the three-dimensional rotation and translation of the object between the im-
ages of the sequence, is computed from the object's feature trajectories. If the object on which a
particular group of feature points lie is rotating, the relative three-dimensional positions of the
feature points, called the structure of the object, can also be calculated.
Our algorithm is based on the following assumptions: (1) the objects in the scene are rigid,
i.e., the three-dimensional distance between any pair of feature points on a particular object is
constant over time, (2) the feature points are orthographically projected onto the image plane, and
(3) the objects move with constant rotation per frame. This algorithm integrates the task of seg-
menting the images into distinctly moving objects with the task of estimating the motion and
structure for each object. These tasks are performed using a hierarchy of increasingly complex
motion models, and using an efficient and accurate factorization-based motion and structure es-
timation algorithm.
This paper makes use of an algorithm for factorization of a measurement matrix into separate
motion and structure matrices as reported by the authors in [DA1]. Subsequently in [TK1], To-
masi and Kanade present a similar factorization-based method which allows arbitrary rotations,
but does not have the capability to process trajectories starting and ending at arbitrary frames.
Furthermore, it appears that some assumptions about the magnitude or smoothness of motion are

* Supported by DARPA and the NSF under grant IRI-89-02728, and the State of Illinois Departraent of
Commerce and CommunityAffairs under grant 90-103.
218

still necessary to obtain feature trajectories. Kanade points out [Kal] that with our assumption of
constant rotation we are absorbing the trajectory noise primarily in the structure parameters
whereas their algorithm absorbs them in both the motion and structure parameters.
Most previous motion-based image sequence segmentation algorithms use optical flow to
segment the images based on consistency of image plane motion. Adiv in [Adl] and Bergen et al
in [BB 1] instead segment on the basis of a fit to an affine model. Adiv further groups the resulting
regions to fit a model of a planar surface undergoing 3-D motions in perspective projection. In
[BB2] Boult and Brown show how Tomasi and Kanade's motion factorization method can be
used to split the measurement matrix into parts consisting of independently moving rigid objects.

2 Structure and Motion Estimation


Our method relies heavily on the motion and structure estimation algorithm presented in
[DA1], [Del], and [DA2]. The input to this algorithm is a set of trajectories of orthographically
projected feature points lying on a single rigid object rotating around a fixed-direction axis and
translating along an arbitrary path. If these constraints do not hold exactly the algorithm will pro-
duce structure and motion parameters which only approximately predict the input trajectories.
Given a collection of trajectories (possibly all beginning and ending at different frames) for which
the constraints do hold, our algorithm finds accurate estimates of the relative three-dimensional
positions of the feature points at the start of the sequence and the angular and translational veloc-
ities of the object. The algorithm also produces a confidence number, in the form of an error be-
tween the predicted and the actual feature point image positions. Aside from SVDs, the algorithm
is closed form and requires no iterative optimization.
In Section 4, the results of applying our motion and structure estimation algorithm to real im-
age sequences are presented in terms of the rotational parameters O~and ~ and the translational
motion parameter ;. The parameter t.0 represents the angular speed of rotation about the axis
along the unit vector ~, where ~ is chosen such that it points toward the camera ((0 is a signed
quantity). Since we are assuming orthographic projection, the depth component of the translation
cannot be recovered, so ~ is a two vector describing the image plane projection of the translation-
al motion. Although the motion and structure estimation algorithm can accommodate arbitrary
motion, most of the objects in the experimental image sequences are moving with constant veloc-
ity and their translational velocity is given in terms of r

3 Image Sequence Segmentation and Motion and Structure Estimation


The segmentation of the feature point trajectories into groups corresponding to the differently
moving 3D objects and the estimation of the structure and motion of these objects are highly in-
terrelated processes: if the correct segmentation is not known, the motion and structure of each
object cannot be accurately computed, and if the 3D motion of each object is not accurately
known, the trajectories cannot be segmented on the basis of their 3D motion. To circumvent this
circular dependency, we integrate the segmentation and the motion and structure estimation steps
into a single step, and we incrementally improve the segmentation and the motion and structure
estimates as each new frame is received.
The general segmentation paradigm is split and merge. Each group of trajectories (or region)
in the segmentation has associated with it one of three region motion models, two of which de-
scribe rigid motion (the translational and rotational motion models), and the third (unmodeled
motion) which accounts for all motions which do not fit the two rigid motion models and do not
contain any local motion discontinuities. When none of these motion models accurately account
219

for the motion in the region, the region is split using a region growing technique. When splitting
a region, a measure of motion consistency is computed in a small neighborhood around each tra-
jectory in the region. If the motion is consistent for a particular trajectory, we assume that the
trajectories in the neighborhood all arise from points on a single object. Thus the initial subre-
gions for the split consist of groups of trajectories with locally consistent motion, and these are
grown out to include the remaining trajectories.
Initially all the trajectories are ina single region. Processing then continues in a uniform fash-
ion: the new point positions in each new frame are added to the trajectories of the existing regions,
and then the regions are processed to make them compatible with the new data. The processing
of the regions is broken into four steps: (1) if the new data does not fit the old region motion mod-
el, find a model which does fit the data or split the region, (2) add any newly visible points or
ungrouped points to a compatible region, (3) merge adjacent regions with compatible motions,
(4) remove outliers from the regions.
Compatibility among feature points is checked using the structure and rotational motion esti-
marion algorithm or the translational motion estimation algorithm described in [Del]. A region's
feature points are considered incompatible if the fit error returned by the appropriate motion es-
timation algorithm is above a threshold. We assume that the trajectory detection algorithm can
produce trajectories accurate to the nearest pixel, and therefore we use a threshold (which we call
the error threshold) of one half of a pixel per visible trajectory point per frame.The details of the
four steps listed above may be found in [Del] or [DA3].

4 Experiments
Our algorithm was tested on two real image sequences of 50 frames: (1) the cylinder sequence,
consisting of images of a cylinder rotating around a nearly vertical axis and a box moving right
with respect to the cylinder and the background, and (2) the robot arm sequence, consisting of
images of an Unimate| PUMA| Mark III robot arm with its second and third joints rotating in
opposite directions. These sequences show the capabilities of the approach, and also demonstrate
some inherent limitations of motion based segmentation and of monocular image sequence based
motion estimation.
Trajectories were detected using the algorithm described in [De1] (using a method described
in [BH 1]), which found 2598 tzajectories in the cylinder sequence and 202 trajectories in the robot
arm sequence. These trajectories were input to the image sequence segmentation algorithm de-
scribed in Section 3, which partitioned the trajectories into groups corresponding to different rigid
objects and estimated the motion and structure parameters.
The segmentation for the cylinder sequence is shown in Fig. 1. The algorithm separated out
the three image regions: the cylinder, the box, and the background.The cylinder is rotating, and
thus its structure can be recovered from the image sequence. Fig. 2 shows a projection along the
cylinder axis of the 3D point positions calculated from the 1456 points on the cylinder. The points
lie very nearly on a cylindrical surface. Table 1 shows the estimated and the actual motion param-

Table 1. Comparisonof the parameters estimatedby the algorithm and the true parameter for the cylinder
image sequence experiment.
Parameters Estimated Actual
CO -0.022 -0.017
(0,.99,.12) (0,.98,.19)
(.29,-.19) (.14,0)
220

Fig. 1 The image sequence segmentation found Fig. 2 An end-on view of the three-dimensional
for the cylinder sequence (the segmentation is point positions calculated by our structure and
superimposed on the last frame of the sequence). motion estimation algorithrn from point trajectories
derived from cylinder image sequence.

Table 2. Comparison of the estimated and the true


parameter values for the second (larger) segment of
the robot arm.

Parameters Estimated Actual

co .0133 .0131
(-.67,-.01,.74) (-.62,.02,.79)

(.02,-.07) (0,0)

Table 3. Comparison of the estimated and the true


parameter values for the third (smaller) segment of
the robot arm.
Parameters Estimated Actual
Fig. 3 The image sequence segmentation found
for the robot arm sequence (the segmentation is co -.0127 -.0131
superimposed on the last frame of the sequence).
(-.58,.06,.81) (-.62,.02,.79)

eters for the cylinder. The error in the to estimate is large because the cylinder is rotating around
an axis nearly parallel to the image plane and, as pointed out in [WH1], a rotation about an axis
parallel to the image plane is inherently difficult to distinguish from translation parallel to the im-
age plane and perpendicular to the rotation axis (this also explains the error in ~). Note that the
predicted trajectory point positions still differ from the actual positions by an average of less than
the error threshold of 0.5 pixel. The accuracy of the motion and structure estimation algorithm for
less ambiguous motion is illustrated in the experiments on the robot arm sequence.
The image sequence segmentation for the robot arm sequence is shown in Fig. 3. Note that
several stationary feature points (only two visible in Fig. 3) on the background are grouped with
221

the second segment of the arm. This occurs because any stationary point lying on the projection
of a rotation axis with no translational motion will fit the motion parameters of the rotating object.
Thus these points are grouped incorrectly due to an inherent limitation of segmenting an image
sequence on the basis of motion alone. The remaining points are grouped correctly into three im-
age regions: the second and the third segments of the robot arm, and the background. The two
robot arm segments are rotating and their three-dimensional structure was recovered by the mo-
tion and structure estimation algorithm.Only a small number of feature points were associated
with the robot arm segments making it difficult to illustrate the structure on paper, but the esti-
mated motion parameters of the second and third robot arm segments are shown in Table 2 and
Table 3, respectively. Note that all the motion parameters were very accurately determined.

5 Conclusions
The main features of our method are: (1) motion and structure estimation and segmentation
processes are integrated, (2) frames are processed sequentially with continual update of motion
and structure estimates and segmentation, (3) the motion and structure estimation algorithm fac-
tors the trajectory data into separate motion and structure matrices, (4) aside from SVDs, the mo-
tion and structure estimation algorithm is closed form with no nonlinear iterative optimization
required, (5) the motion and structure estimation algorithm provides a confidence measure for
evaluating any particular segmentation.

References

[Adl]Adiv, G.: Determining Three-Dimensional Motion and Structure from Optical Flow
Generated by Several Moving Objects. IEEE Transactions on PAMI 7 (1985) 384-401
[BB 1]Bergen, J., Burr, P., Hingorani, R., Peleg, S.: Multiple Component Image Motion: Motion
Estimation. Proc. of the 3`/ICCV, Osaka, Japan (December 1990) 27-32
[BB2]Boult, T., Brown, L.: Factorization-based Segmentation of Motions. Proc. of the IEEE
Motion Workshop, Princeton NJ (October 1991) 21-28
[BH1]Blostein, S., Huang, T.: Detecting Small, Moving Objects in Image Sequences using
Sequential Hypothesis Testing. IEEE Trans. on Signal Proc. 39 (July 1991) 1611-1629
[DA1]Debrunner, C., Ahuja, N.: A Direct Data Approximation Based Motion Estimation
Algorithm. Proc. of the 10th ICPR, Atlantic City, NJ (June 1990) 384-389
[DA2]Debrunner, C., Ahuja, N.: Estimation of Structure and Motion from Extended Point
Trajectories. (submitted)
[DA3]Debrunner, C., Ahuja, N.: Motion and Structure Factorization and Segmentation of Long
Multiple Motion Image Sequences. (submitted)
[Del]Debrunner, C.: Structure and Motion from Long Image Sequences. Ph.D. dissertation,
University of Illinois at Urbana-Champaign, Urbana, IL (August 1990)
[Kal]Kanade, T.: personal communication (October 1991)
[KA1]Kung, S., Arun, K., Rao, D.: State-Space and Singular-Value Decomposition-Based
Approximation Methods for the Harmonic Retrieval Problem. J.of the Optical Society of
America (December 1983) 1799-1811
[TK1] Tomasi, C., Kanade, T.: Factoring Image Sequences into Shape and Motion. Proc. of the
IEEE Motion Workshop, Princeton NJ (October 1991) 21-28
[WH1]Weng, J., Huang, T., Ahuja, N.: Motion and Structure from Two Perspective Views:
Algorithms, Error Analysis, and Error Estimation. IEEE Transactions on PAM111 (1989)
451-476
Motion and Surface Recovery Using
Curvature and Motion Consistency

Gilbert Soucy and Frank P. Ferrie


McGiU University, Research Center for Intelligent Machine, 3480 rue UniversitY, Montreal,
Qufibec, Canada, H3A 2A7, e-mMl: ferrie~mcrcim.mcgiU.edu

1 Introduction

This paper describes an algorithm for reconstructing surfaces obtained from a sequence of
overlapping range images in a common frame of reference. It does so without explicitly
computing correspondence and without invoking a global rigidity assumption. Motion
parameters (rotations and translations) are recovered locally under the assumption that
the curvature structure at a point on a surface varies slowly under transformation. The
recovery problem can thus be posed as finding the set of motion parameters that preserves
curvature across adjacent views. This might be viewed as a temporal form of the curvature
consistency constraint used in image and surface reconstruction [2, 3, 4, 5, 6].
To reconstruct a 3-D surface from a sequence of overlapping range images, one can
attempt to apply local motion estimates to successive pairs of images in pointwise fashion.
However this approach does not work well in practice because estimates computed locally
are subject to the effects of noise and quantization error. This problem is addressed
by invoking a second constraint that concerns the properties of physical surfaces. The
motions of adjacent points are coupled through the surface, where the degree of coupling is
proportional to surface rigidity. We interpret this constraint to mean that motion varies
smoothly from point to point and attempt to regularize local estimates by enforcing
smooth variation of the motion parameters. This is accomplished by a second stage of
minimization which operates after local motion estimates have been applied.
The remainder of this paper will focus on the key points of the algorithm, namely:
how the problem of locally estimating motion parameters can be formulated as a convex
minimization problem, how local estimates can be refined using the motion consistency
constraint, and how the combination of these two stages can be used to piece together
a 3-D surface from a sequence of range images. An example of the performance of the
algorithm on laser rangefinder data is included in the paper.

2 Recovery of Local Motion Parameters

Our approach is based on the minimization of a functional form that measures the sim-
ilarity between a local neighbourhood in one image and a corresponding neighbour-
hood in an adjacent image. Following the convention of [6], we describe the local struc-
ture of a surface S in the vicinity of a point P with the augmented Darboux frame
2)(P) = (P, Mp,.]~p, Np,~Mp,~./~p), where gp is the unit normal vector to S at P,
Mp and ~4p are the directions of the principal maximum and minimum curvatures re-
spectively at P, and gMp and ~ p are the scalar magnitudes of these curvatures [1].
Now let x and x t be points on adjacent surfaces S and S ~ ~,'ith corresponding frames
:D(x) and :D(x'), and let ~2 and T be the rotation matrix an i translation vector that
map x to x', i.e., x' = $2x + T. The relationship between :D(x) and D(x') is then
223

:D(x') = T(/)(x),/2, T), (1)

where T 0 is defined as follows:

x' =$2 x + T Mx, = / 2 Mx .Atfx, = / 2 . A 4 x (2)


Nx, = / 2 N x ~r ~-- #r I~-AAX,= K2*4X

The task is to find/2 and T such that ]]:D(x)-:D(x')H is minimum. However, for reasons
of uniqueness and robustness which are beyond the scope of this paper, the minimization
must be defined over an extended neighbourhood that includes the frames in an i x i
neighbourhood of x,
min ~-'~ II~i(x') - 7-(79i(x),/2, T)II. (3)
aT "7"
If an appropriate functional DaT can be found that is convex in/2 and T, then these
parameters can b e easily be determined by an appropriate gradient descent procedure
without explicitly computing correspondence. Let A C S be a patch containing a point
x and :Di(x) a set of frames that describe the local neighbourhood of x. The patch A is
now displaced according t o / 2 and T. Specifically, we claim that if

1. /2 and T are such that the projection of x on S' lies somewhere on the image of ),,
A' on S',
2. A meets existence and uniqueness requirements with respect to :Di(x),
3. A is completely elliptic or hyperbolic,

then a DOT can be found that is convex. We define DaT as follows:

DaT
~ 3+ I~Mx,lu I~Mx,,I + I~x,I + I~x,,I
,%

- - ( Mxi " Mx'i) 2 - (.h4xl" .A4x'I)2 - ( Nxi " Nx'i)2 ~ (4)


J

where ( g M x l , IC.MXi, Mxl, .Mxl, Nxi), and OCMx,,, ~ x , , , Mx,i, .Mx,i, Nx,i) are the com-
ponents of :Di(x) 9 S and :Di(x') 9 S' respectively.
An algorithm for the local recovery of /2 and T follows directly from (4) and is
described in [8]. Given two range images R(i, j) and R'(i, j), the curvature consistency
algorithm described in [2, 3, 4] is applied to each image to obtain :D(x) and :D(x') for
each discrete sample. Then for each point x for which/2 and T are required, the following
steps are performed:

1. Obtain an initial estimate of [2 and T. This can either be provided by a manipulator


system or estimated from the velocity of an object in motion. It is assumed that D a T
is convex in the vicinity of ~ and "~.
2. Minimize DaT for each parameter o f / 2 and T, i.e. 0x, 0v, 0z, Tr, Ty, and Tz using
an appropriate gradient descent procedure.
3. Apply the resulting/2 and T to ~D(x) to obtain ~D(x'). Validate the result by checking
the compatibility of the local neighbourhood of ~D(x') with that of ~D(x).
224

3 3-D Surface Reconstruction

A brute force solution to the 3-D reconstruction problem would be to estimate s and
T for each element of R(i, j) and map it to R'(i, j), eliminating multiple instantiations
of points in overlapping regions in the process. However a more efficient and effective
approach is to determine s and T for a subset of points on these surfaces and then
use the solutions for these points to map the surrounding neighbourhood. Besides being
more efficient, this strategy acknowledges that solutions may not be found for each point,
either because a point has become occluded, or because a weak solution has been vetoed.
Still, as proposed, this strategy does not take the behaviour of real surfaces into account.
Because of coupling through the surface, the velocities of adjacent points are related. A
reasonable constraint would be to insist that variations in the velocities between adjacent
regions be smooth. We refer to this as motion consistency and apply this constraint to the
local estimates of s and T in a second stage of processing. However, rather than explicitly
correct for each locally determined s and T,, we correct instead for the positions and
orientations of the local neighbourhoods to which these transformations are applied.

3.1 M o t i o n Consistency
The updating of position and orientation can be dealt with seperately provided that cer-
tain constraints are observed as there are differentcompositions of rotation and transla-
tion that can take a set of points from one position to another. While the final position
depends on the amount of rotation and translationapplied, the finalorientationdepends
solely on the rotation applied. This provides the necessary insight into how to separate
the problem.
Within the local neighbourhood of a point P we assume that the motion is approxi-
mately rigid,i.e.that the motion of P and its near-neighbours, Qi, can be described by
the same s and T. The problem, then, is to update the local motion estimate at P, s
and Tp, given the estimates computed independently at each of its neighbours, s and
Ti. Since motion is assumed to be locallyrigid,then the relativedisplacements between
P and each Qi should be preserved under transformation. One can exploit this constraint
by noting that
Pl -" qli -{-rll, (5)
where Pl and qll are vectors corresponding to P and Qi in view 1, and rll is the relative
displacement of point P from point Qi. It is straightforward to show that the position of
P in view 2 as predicted by its neighbour Qi is given by

p21 = s (qll - Ti) Jr-s . (6)

A m a x i m u m liklihood estimate of P can be obtained by computing the predicted


position of P for each of its neighbours and taking a weighted mean, i.e.,

p = E =I w p2,
r; ,
E'=I Wi
where the w, take into account the rigidity of the object and the distance between
neighbouring points. The weights w, and the size of the local neighbourhood determine
the rigidity of the reconstructed surface. In our experiments a Gaussian weighting was
used over neighbourhood sizes ranging from 3 3 to 11 x 11.
The second part of the updating procedure seeks to enforce the relative orientation
between point P and each of its neighbours Q,. However, under the assumption of locally
225

rigid motion, this is equivalent to saying that each point in the local neighbourhood
should have the same rotation component in its individual motion parameters. Of course
one cannot simply average the parameters of D as in the case of position, l ~ t a t i o n
parameters are not invariant with respect to reference frame. To get around this problem
we convert estimates of Y21 into their equivalent unit quaternions Qi [9, 7]. The locus of
unit quaternions traces out a unit sphere in 4-dimensional space. One of their desirable
properties is that metric distance between two quaternions is given by the great circle
distance on this sphere. For small distances, which is precisely the case for the rotations
associated with each Qi, a reasonable approximation is given by their scalar product.
The computational task is now to estimate the quaternion at P, Q p , given the quater-
nions Qi at each Qi by minimizing the distance between Q p and each Qi. Using standard
methods it can be shown that the solution to this minimization problem amounts to an
average of the quaternions Qi normalized to unit length. An example of the effect of
applying these updating rules is shown later on in this section.

3.2 E x p e r i m e n t a l Results

The reconstruction procedure that we use can now be summarized as follows:


1. Selection of tokens. A set of points that spans the surface S and the frames ~)i(x)
corresponding to the local neighbourhoods of each are selected.
2. Estimation of local correspondence. The algorithm described in Section 2 is applied
to each candidate point to estimate $2 and T.
3. Mapping of ~q into S ~. Each point on 8 is mapped into ,.q~ according to t h e / 2 and T
of the closest token point that has a valid correspondence in S ~.
4. Application of motion consistency. The updating rules for position and orientation
are applied to the data mapped in the previous step.
Some results from our motion estimation and reconstruction procedure are now pre-
sented. Figure 1 shows three views of an owl statuette obtained with a laser rangefinder
in our laboratory and taken at 0 ~ 450 and 90 ~ relative to the principal axis of the owl.
The effects of the motion consistency constraint are shown next in Figure 2a and b. In
this experiment local estimates of/2 and T are used to map the surface of the 0 ~ view
into the coordinates of the 450 view. As can be see in Figure 2a, small perturbations due
to noise are evident when the constraint is not applied. Figure 2b shows that application
of this constraint is very effective in correctly piecing together adjacent surface patches.
The algorithm is now applied to the three range images in Figure 1 and the results shown
in Figure 2d. Next to it in Figure 2c is a laser rangefinder scan of the owl taken from the
viewpoint at which the reconstructed surfaces are rendered. As can be seen, the three
surfaces align quite well, especially considering that the surfaces of the object are smooth
and the displacement angles large.

4 Conclusions

In this paper we have sketched out an algorithm for reconstructing a sequence of over-
lapping range images based on two key constraints: minimizing the local variation of
curvature across adjacent views, and minimizing the variation of motion parameters
across adjacent surface points. It operates without explicitly computing correspondence
and without the invoking a global rigidity assumption. Preliminary results indicate that
the resulting surface reconstruction is both robust and accurate.
226

Fig. 1. Laser rangefinder images of an owl statuette at (a) 00, (b) 450 and (c) 90 ~ Resolution
is 256 x 256 by 10 bits/rangel.

Fig. 2. First pair of images: Surface of the owl at O* rotation mapped into the coordinates of
a second frame at 45 ~ using local motion estimates. (a) Motion consistency not applied. (b)
Motion consistency applied. Second pair of images: (c) A laser rangefinder image rendered as a
shaded surface showing the owl from the viewpoint of the reconstructed surfaces shown next.
(d) Reconstruction of three views of the owl taken at 00, 45 ~ and 90~ and rendered as a shaded
surface.

References

1. M. do Carmo. Differential Geometry of Curves and SurJaces. Prentice-Hall, Inc., Englewood


Cliffs,New Jersey, 1976.
2. F. Ferrie and J. Lagarde. Curvature consistency improves local shading analisys. In Pro-
ceedings lOth International ConJerence on Pattern Recognition, pages 70-76, Atlantic City,
New Jersey, jun 1990.
3. F. Ferrie, J. Lagarde, and P. Whaite. Darboux frames, snakes, and super-quadrics: Geom-
etry from the bottom-up. In Proceedings 1EEE Workshop on Interpretation of 3D Scenes,
pages 170-176, Austin, Texas, Nov. 27-29 1989.
4. F. Ferrie, J. Lagarde, and P. Whaite. Recovery of volumetric object descriptions from laser
rangefinder images. In Proceedings First European Conference on Computer Vision, An-
tibbes, France, April 1990.
5. P. Parent and S. Zucker. Curvature consistency and curve detection. J. Opt. Soc. Amer.,
Ser. A, 2(13), 1985.
6. P. Sander and S. Zucker. Inferring differential structure from 3-d images: Smooth cross
sections of fiber bundles. IEEE Trans. PAMI, 12(9):833-854, 1990.
7. K. Shoemake. Animating rotations with quaternion curves. ACM Computer Graphics,
21(5):365-373, 1985.
8. G. Soucy. View correspondence using curvature and motion consistency. Master's thesis,
Dept. of E.E., McGill Univ., 1991. To appear.
9. K. Spring. Euler parameters and the use of quaternion algebra in the manipulation of finite
rotations: a review. Mechanism and Machine Theory, 21(5):365-373, 1986.
Finding Clusters and Planes from 3D Line Segments
with Application to 3D Motion Determination*

Zhengyou Zhang and Olivier D. Faugeras


INRIA Sophia-Antipolis, 2004 route des Lucioles, F-06565 Valbonne Cedex, France

Abstract. We address in this paper how to find clusters based on proxim-


ity and planar facets based on coplanarity from 3D line segments obtained
from stereo. The proposed methods are efficient and have been tested with
many real stereo data. These procedures are indispensable in many applica-
tions including scene interpretation, object modeling and object recognition.
We show their application to 3D motion determination. We have developed
an algorithm based on the hypothesize-and-verify paradigm to register two
consecutive 3D frames and estimate their transformation/motion. By group-
ing 3D line segments in each frame into clusters and planes, we can reduce
effectively the complexity of the hypothesis generation phase.
1 Introduction
We address in this paper how to divide 3D line segments obtained from stereo into
geometrically compact entities (clusters) and coplanar facets. Such procedures are indis-
pensable in many applications including scene interpretation, object modeling and object
recognition. We show in this paper their application to 3D motion determination. The
problem arises in the context of a mobile robot navigating in an unknown indoor scene
where an arbitrary number of rigid mobile objects may be present. Although the structure
of the environment is unknown, it is assumed to be composed of straight line segments.
Many planar faced objects are expected in an indoor scene. A stereo rig is mounted on the
mobile robot. Our current stereo algorithm uses three cameras and has been described
in [1,2]. The 3D primitives used are line segments corresponding to significant intensity
discontinuities in the images obtained by the stereo rig.
Given two sets of primitives observed in two views, the task of matching is to estab-
lish a correspondence between them. By a correspondence, we mean that the two paired
primitives are the different observations (instances) of a single primitive undergoing mo-
tion. The matching problem has been recognized as a very difficult problem. It has an
exponential complexity in general. The rigidity assumption about the environment and
objects is used in most matching algorithms. We have developed an algorithm using the
rigidity constraints to guide a hypothesize-and-verify method. Our idea is simple. We use
the rigidity constraints to generate some hypothetical primitive correspondences between
two successive frames. The rigidity constraints include the invariants of the length of a
segment, the distance and angle between two segments, etc. Due to inherent noise in mea-
surements, the equalities in the rigidity constraints are not true anymore. Unlike other
methods using some empirical fixed thresholds, we determine dynamically the threshold
from the error measurements of 3D data given by a stereo system. Two pairings of 3D
line segments form a matching hypothesis if they satisfy the rigidity constraints. Since
we apply the rigidity constraints to any pair of segments, if we explore all possible pairs,
the number of operations required is ( m2 ) ( 2n ) 2 ,| where m is the number of segments in
the first frame and n that in the second frame. Therefore, the complexity is O(m2n2).
* This work was supported in part by Esprit project P940.
228

For each hypothesis, we compute an


initial estimate of the displacement using
the iterative extended Kalman filter. We
then evaluate the validity of these hypo-
thetical displacements. We apply this es-
timate to the first frame and compare the
transformed frame with the second frame.
If a transformed segment of the first frame
is similar enough to a segment in the sec-
ond frame, this pairing is considered as
matched and the extended Kalman filter
is again used to update the displacement
estimation. After all segments have been
processed, we obtain, for each hypothesis,
the optimal estimate of motion, the esti-
mated error given by the filter and the
number of matches. As one can observe,
the complexity of verifying each hypoth-
esis is O(mn). Once all hypotheses are
evaluated, the hypothesis which gives the
minimal estimated error and the largest
number of matches is considered as the
best one, and its corresponding optimal
estimate is retained as the displacement
between the two frames. Due to the rigid-
ity constraints, the number of hypothet- Optimal Estimation of Displacement
ical displacements is usually very small, Correspondences of Primitives
and computational efficiency is achieved.
We exploit the rigidity constraint lo- Fig. 1. Diagram of the displacement analysis
cally in the hypothesis generation phase algorithm based on the hypothesize-and-verify
and globally in the hypothesis verifica- paradigm
tion phase, as shown later. Figure 1 illus-
trates diagrammatically the principle of
our hypothesize-and-verify method.
A number of methods have been used in our implementation of the above method to
reduce the complexities of both the hypothesis generation and verification stages. The
interested reader is referred to [3,4]. In this paper, we show how grouping can reduce
the complexity of the hypothesis generation stage. We use proximity and coplanarity to
organize segments into geometrically compact clusters and planes. These procedures are
also useful to scene interpretation.

2 Grouping Speeds Up the Hypothesis Generation Process


If we can divide segments in each frame into several groups such that segments in each
group are likely to have originated from a single object, we can considerably speed up
the hypothesis generation process. This can be seen from the following. Assume we have
divided the two consecutive frames into gl and g2 groups. For simplicity, assume each
group in a frame contains the same number of segments. Thus a group in the first frame,
Gll, has m/g1 segments, and a group in the second frame, G2i, has n/g2 segments, where m
and n are the total numbers of segments in the first and second frames. Possible matches
of the segments in Gzi (i = 1. . . . ,gl) are restricted in one of the G2j's (j = 1 , . . . , g2).
That is, it is not possible that one segment in Gli is matched to one segment in G2j and
229

another segment in the same Gll is matched to one segment in G2k (k r j). We call this
condition the completeness of grouping. We show in Fig. 2 a noncomplete grouping in
the second frame. The completeness of a grouping implies that we need only apply the
hypothesis generation process to each pairing of groups. As we have gl x g2 such pairings
and that the complexity for each paring is O ( ( ~
ra ) 2 ( ~n ) )2, the total complexity is now
O('~ " = ) . Thus we have a speedup of
g~g=
O(glg2).

/I-
,."........ "-

,,. ..., ;:::::..

'....\ /

"........" 1a23
" . I / .....//
%,~176176
\ .............. /
Frame 1 Frame 2 Frame 1 Frame 2
Fig. 3. Illustration of how grouping speeds up the
Fig. 2. A grouping which is not complete
hypothesis generation process
Take a concrete example (see Fig. 3). We have 6 and 12 segments in Frame 1 and
Frame 2, respectively. If we directly apply the hypothesis generation algorithm to these
two frames, we need 6 x 5 x 12 11/2 = 1980 operations. If the first frame is divided into
2 and the second into 3, we have 6 pairings of groups. Applying the hypothesis generation
algorithm to each pairing requires 3 x 2 4 3/2 = 36 operations. The total number
of operations is 6 x 36 = 216, and we have a speedup of 9. We should remember that
the O(glg2) speedup is achieved at the cost of a prior grouping process. Whether the
speedup is significant depends upon whether the grouping process is efficient.
One of the most influential grouping techniques is called the perceptual grouping,
pioneered by Low [5]. In our algorithm, grouping is performed based on proximity and
coplanarity of 3D line segments. Use of proximity allows us to roughly divide the scene
into clusters, each constituting a geometrically compact entity. As we mainly deal with
indoor environment, many polyhedral objects can be expected. Use of coplanarity allows
us to further divide each cluster into semantically meaningful entities, namely planar
facets.

3 Finding Clusters Based on Proximity

Two segments are said to be prozimally connected if one segment is in the neighborhood
of the other one. There are many possible definitions of a neighborhood. Our definition
is: the neighborhood of a segment S is a cylindrical space C with radius r whose axis is
coinciding with the segment S. This is shown in Fig. 4. The top and bottom of the cylinder
230

are chosen such that the cylinder C contains completely the segment S. S intersects the
two planes at A and B. The segment AB is called the extended segment of S. The distance
from one endpoint of S to the top or bottom of the cylinder is b. Thus the volume V
of the neighborhood of S is equal to a'r2(! + 2b), where i is the length of S. We choose
b = r. The volume of the neighborhood is then determined by r.

x/
Fig. 4. Definition of a segment's neighborhood
A segment Si is said in the neighborhood of S if Si intersects the cylindrical space
C. From this definition, Si intersects C if either of the following condition is satisfied:
1. At least one of the endpoints of Si is in C.
2. The distance between the supporting lines of S and Si is less than r and the common
perpendicular intersects both Si and the extended segment of S.

We define a cluster as a group of segments, every two of which are proximally con-
nected in the sense defined above either directly or through one or more segments in the
same group. A simple implementation to find clusters by testing the above conditions
leads to a complexity of O(n 2) in the worst case, where n is the number of segments in
a frame. In the following, we present a method based on a bucketing technique to find
clusters. First, the minima and maxima of the z, y and z coordinates are computed,
denoted by Xrnin, Ymin, Zmin and Xmax, Ymax, Zmax. Then the parallelepiped formed by the
minima and maxima is partitioned into p3 buckets Wijk (p = 16 in our implementation).
To each bucket Wijk we attach the list of segments Lijk intersecting it. The key idea of
bucketing techniques is that on the average the number of segments intersecting a bucket
is much smaller than the total number of segments in the frame. The computation of
attaching segments to buckets can be performed very fast by an algorithm whose com-
plexity is linear in the number of segments. Finally, a recursive search is performed to
find clusters. We can write an algorithm to find a cluster containing segment S in pseudo
C codes as:

List find_cluster(S)
Segment S ;
231

{
L i s t cluster = N U L L ;
if i s _ v i s i t e d ( S ) r e t u r n N U L L ;
mark_segment_visited(S) ;
list_buckets -- f i n d _ b u c k e t s _ i n t e r s e c t i n g A t e i g h b o r h o o d _ o f ( S ) ;
list_segments = union_of_all_segments_in( list_buckets) ;
f o r (Si = each_segment_in(list_segments))
cluster = ctuste U {s,} U ind_cZuster(S,) ;
r e t u r n cluster ;
}
where L i s t is a structure storing a list of segments, defined as
struct cell {
S e g m e n t seg ;
s t r u c t c e l l *next ;
} *List ;

From the above discussion, we see that the operations required to find a cluster are
really very simple with the aide of a bucketing technique, except probably the function
find_buckets_intersecting-neighborhood_of(S). This function, as indicated by its
name, should find all buckets intersecting the neighborhood of the segment S. T h a t is,
we must examine whether a bucket intersects the cylindrical space C or not, which is by
no means a simple operation. The gain in efficiency through using a bucketing technique
may become nonsignificant. Fortunately, we have a very good approximation as described
below which allows for an efficient computation.

I. . . . . I
I I
I I

(a) N/ (b)
Fig. 5. Approximation of a neighborhood of a segment
Fill the cylindrical space C with m a n y spheres, each just fitting into the cylinder,
i.e., whose radius is equal to r. We allow intersection between spheres. Fig. 5a illustrates
the situation using only a section passing through the segment S. The union of the set
of spheres gives an approximation of C. When the distance d between successive sphere
centers approaches to zero (i.e., d ~ 0), the approximation is almost perfect, except in
the top and b o t t o m of C. The error of the approximation in this case is 89 3. This part
is not very important because it is the farthest to the segment. Although the operation
232

to examine the intersectionness between a bucket and a sphere is simpler than between
a bucket and a cylinder, it is not beneficial if we use too many spheres. W h a t we do is
further, as illustrated in Fig. 5b. Spheres are not allowed to intersect with each other
(i.e., d = 2r) with the exception of the last sphere. The center of the last sphere is
always at the endpoint of S, so it may intersects with the previous sphere. The number
of spheres required is equal to [ 6 + 1], where [a] denotes the smallest integer greater
than or equal to a. It is obvious that the union of these spheres is always smaller than the
cylindrical space C. Now we replace each sphere by a cube circumscribing it and aligned
with the coordinate axes (represented by a dashed rectangle in Fig. 5b). Each cube has
a side length of 2r. It is now almost trivial to find which buckets intersect a cube. Let
the center of the cube be [x, y, z] T. Let
imin: max[O, [(z -- Xmin -- r ) / d x J ] , /max = min[m - 1, [(x - Xmin + r)/dx - 1]] ,
jmin---- max[O, L(y - Ymin r)/dyJ] ,
- - jmax-- min[m - 1, [(y - Ymin + r)/dy - 1]] ,
kmin = max[0, [(z - Zmin - r)/dzJ] , kmax = min[m - 1, [(z - Zmin Jr" r)/dz - 1]] ,
where ]a] denotes the greatest integer less than or equal to a, m is the dimension of the
buckets, dx = (Xma~ - Xmin)/m, dy = (Ym~x - Ymin)/m, and dz = (Zmax - Zmin)/m. The
buckets intersecting the cube is simply {Wij~ ] i = irnin, 9 9 j = jmin, 99-, jmax, k =
kmin, 9 - -, kmax}.
If a segment is parallel to either x, y or z axis, it can be shown that the union of the
cubes is bigger than the cylindrical space C. However, if a segment is near 45~ 135 ~
to an axis (as shown in Fig. 5b), there are some gaps in approximating C. These gaps are
usually filled by buckets, because a whole bucket intersecting one cube is now considered
as part of the neighborhood of S.

4 Finding Planes
Several methods have been proposed in the literature to find planes. One common method
is to directly use data from range finders [6,7]. The expert system of Thonnat [8] for
scene interpretation is capable of finding planes from 3D line segments obtained from
stereo. The system being developed by Grossmann [9,10] aims at extracting, also from
3D line segments, visible surfaces including planar, cylindrical, conical and spherical
ones. In the latter system, each coplanar crossing pair of 3D line segments (i.e., they are
neither collinear nor parallel) forms a candidate plane. Each candidate plane is tested for
compatibility with the already existing planes. If compatibility is established, the existing
plane is updated by the candidate plane. In this section, we present a new method to
find planes from 3D line segments.
As we do not know how many and where are the planes, we first try to find two
coplanar line segments which can define a plane, and then try to find more line segments
which lie in this hypothetical plane until all segments are processed. The segments in
this plane are marked visited. For those unvisited segments, we repeat the above process
to find new planes.
Let a segment be represented by its midpoint m, its unit direction vector u and its
length 1. Because segments reconstructed from stereo are always corrupted by noise, we
attach to each segment an uncertainty measure (covariance matrix) of m and u, denoted
by A m and Au. The uncertainty measure of l is not required. For two noncollinear
segments: ( m l , u l ) with ( A m , , A u , ) and (ms, us) with (Am2, Au2), the coplanarity
condition is
A
c = ( m s - m 1 ) . (Ul A u s ) = 0 , (1)
where 9 and A denote the dot product and the cross product of two vectors, respectively.
233

In reality, the condition (1) is unlikely met. Instead, we impose that Icl is less than
some threshold. We determine the threshold in a dynamic manner by relating it to the
uncertainty measures. The variance of c, denoted by Ac, is computed from the covariance
matrices of the two segments by
Ac = (ul A u2)T(Am, + Ama)(Ul A u2) + [Ul A (m2 -- m l ) ] T A u a [ u l A (m2 -- m l ) ]
+ [ u 2 ^ (m2 - m l ) ] Z A u , [us ^ (m2 - m l ) ] 9
Here we assume there is no correlation between the two segments. Since cS/Ac follows a
X a distribution with one degree of freedom, two segments are said coplanar if
c21Ao < ,r (2)
where tr can be chosen by looking up the X 2 table such that P r ( x 2 < ~) = c~. In our
implementation, we set a = 50%, or ir = 0.5.
As discussed in the last paragraph, the two segments used must not be collinear. Two
segments are collinear if and only if the following two conditions are satisfied:
ul - us = 0 , and ul A (ms - ml) ---- 0 . (3)
The first says that two collinear segments should have the same orientation (Remark:
segments are oriented in our stereo system). The second says that the midpoint of the
second segment lies on the first segment. In reality, of course, these conditions are rarely
satisfied. A treatment similar to the coplanarity can be performed.
Once two segments are identified to lie in a single plane, we estimate the parameters
of the plane. A plane is described by
uz + vy+ wz + d = 0 , (4)
where n = [u, v, w]w is parallel to the normal of the plane, and [dl/I[n[[ is the distance
of the origin to the plane. It is clear that for an arbitrary scalar A r 0, A[u, v, w, d]T
describes the same plane as [u, v, w, d] T. Thus the minimal representation of a plane has
only three parameters. One possible minimal representation is to set w = 1, which gives
ux+vy+z+d=O .
However, it cannot represent planes parallel to the z-axis. To represent all planes in 3D
space, we should use three maps [11]:
Map 1: u x + v y + z + d = 0 for planes nonparallel to the z-axis, (5)
Map 2: x + v y + w z + d = 0 for planes nonparallel to the x-axis, (6)
Map 3: u x + y + w z + d = 0 for planes nonparallel to the y-axis, (7)
In order to choose which map to be used, we first compute an initial estimate of the plane
normal no = u l A u2. If the two segments are parallel, n0 = u l A (ms - m l ) . If the z
component of n has a maximal absolute value, Map 1 (5) will be used; if the x component
has a maximal absolute value, Map 2 (6) will be used; otherwise, Map 3 (7) will be used.
An initial estimate of d can then be computed using the midpoint of a segment. In the
sequel, we use Map 1 for explanation. The derivations are easily extended to the other
maps.
We use an extended Kalman filter [12] to estimate the plane parameters. Let the
state vector be x = [u, v, d] T. We have an initial estimate x0 available, as described
just above. Since this estimate is not good, we set the diagonal elements of the initial
covariance matrix Axo to a very big number and the off-diagonal elements to zero. Sup-
pose a 3D segment with parameters (u, m ) is identified as lying in the plane. Define the
measurement vector as z = [u T, m r ] T. We have two equations relating z to x:
f nTu
f(x, z) = "1.nTm "}" d = 0 , (8)
234

where n = [u, v, 1]T. The first equation says that the segment is perpendicular to the
plane normal, and the second says that the midpoint of the segment is on the plane. In
order to apply the Kalman filter, we must linearize the above equation [13], which gives
y = Mx + ~, (9)
where y is the new measurement vector, M is the observation matrix, and ~ is the noise
disturbance in y, and they are given by

M = Of(x'z) = [ mx m,U' 01] ' (10)

Y = - f ( x ' z ) + M x = [ -u~ ' (11)

-- 0Z
;
[oT0]
0T n T
(12)

where 0 is the 3D zero vector and ~z is the noise disturbance in z. Now that we have two
segments which have been identified to be coplanar and an initial estimate of the plane
parameters, we can apply the extended Kalman filter based on the above formulation to
obtain a better estimate of x and its error covariance matrix Ax.
Once we have estimated the parameters of the plane, we try to find more evidences
of the plane, i.e., more segments in the same plane. If a 3D segment z = [u T, mT] T with
(Au, Am) lies in the plane, it must satisfy Eq. (8). Since data are noisy, we do not expect
to find a segment having exactly p ~ f(x, z) = 0. Instead, we compute the covariance
matrix Ap of p as follows:
0f(x, Z) AX 0f(x, z)T ~f(x, Z) ~f(x, Z) T
Ap = ax ax + ~ A z ~z ' (13)

where of(x,Z~oxand ~ are computed by Eqs. (10) and (12), respectively. If


p T A ~ l p < ~p , (14)

then the segment is considered as lying in the plane. Since p T A ~ l p follows a X2 distri-
bution with 2 degrees of freedom, we can choose an appropriate ~p b y looking up the X ~
table such that P r ( x a _< ~p) = ap. We choose ap = 50%, or ~p = 1.4. Each time we find
a new segment in the plane, we update the plane parameters x and Ax and try to find
still more. Finally, we obtain a set of segments supporting the plane and also an estimate
of the plane parameters accounting for all these segments.

5 Experimental Results

In this section we show the results of grouping using an indoor scene. A stereo rig takes
three images, one of which is displayed in Fig. 6. After performing edge detection, edge
linking and linear segment approximation, the three images are supplied to a trinocular
stereo system, which reconstructs a 3D frame consisting of 137 3D line segments. Fig-
ure 7 shows the front view (projection on the plane in front of the stereo system and
perpendicular to the ground plane) and the top view (projection on the ground plane)
of the reconstructed 3D frame.
We then apply the bucketing technique to this 3D frame to sort segments into buckets,
which takes about 0.02 seconds of user time on a Sun 4/60 workstation. The algorithm
described in Sect. 3 is then applied, which takes again 0.02 seconds of user time to find
two clusters. They are respectively shown in Figs. 8 and 9. Comparing these with Fig. 7,
we observe that the two clusters do correspond to two geometrically distinct entities.
235

Fig. 6. Image taken by the


Fig. 7. Front and top views of the reconstructed 3D frame
first camera

/:1

Fig. 8. Front and top views of the first cluster Fig. 9. Front and top views of the second cluster
Finally we apply the algorithm described in Sect. 4 to each cluster, and it takes 0.35
seconds of user time to find in total 11 planes. The four largest planes contain 17, 10,
25 and 13 segments, respectively, and they are shown in Figs. 10 to 13. Other planes
contain less than 7 segments, corresponding to the box faces, the table and the terminal.
From these results, we observe that our algorithm can reliably detect planes from 3D line
segments obtained from stereo, but a plane detected does not necessarily correspond to
a physical plane. The planes shown in Figs. 11 to 13 correspond respectively to segments
on the table, the wall and the door. The plane shown in Fig. 10, however, is composed of
segments from different objects, although they do satisfy the coplanarlty. This is because
in our current implementation any segment in a cluster satisfying the coplanarity is
retained as a support of the plane. One possible solution to this problem is to grow
the plane by look for segments in the neighborhood of the segments already retained as
supports of the plane.

Fig. 10. Front and top views of the first plane Fig. 11. Front and top views of the second plane
Due to space limitation, the reader is referred to [4] and [13] for application to 3D
motion determination.
236

yJ
j s 9

-z I

Fig. 12. Front and top views of the third plane Fig. 13. Front and top views of the fourth plane
6 Conclusion
We have described how to speed up the m o t i o n d e t e r m i n a t i o n algorithm through group-
ing. A formal analysis has been done. A speedup of O(glg2) can be achieved if two con-
secutive frames have been segmented into gl and g2 groups. Grouping must be complete
in order not to miss a hypothesis in the hypothesis generation process. Two criteria sat-
isfying the completeness condition have been proposed, n a m e l y proximity to find clusters
which are geometrically compact and coplanarity to find planes. I m p l e m e n t a t i o n details
have been described. Many real stereo d a t a have been used to test the algorithm and
good results have been obtained. We should note t h a t the two procedures are also useful
to scene interpretation.

References
1. F. Lustman, Vision St~r~oscopique et Perception du Mouvement en Vision Artificielle. PhD
thesis, University of Paris XI, Orsay, Paris, France, December 1987.
2. N. Ayache, Artificial Vision for Mobile Robots: Stereo Vision and Mnltisensory Perception.
MIT Press, Cambridge, MA, 1991.
3. Z. Zhang, O. Faugeras, and N. Ayache, "Analysis of a sequence of stereo scenes containing
multiple moving objects using rigidity constraints," in Proc. Second Int'l Conf. Comput.
Vision, (Tampa, FL), pp. 177-186, IEEE, December 1988.
4. Z. Zhang and O. D. Faugeras, "Estimation of displacements from two 3D frames obtained
from stereo," Research Report 1440, INRIA Sophia-Antipolis, 2004 route des Lucioles, F-
06565 Valbonne cedex, France, June 1991.
5. D. Lowe, Perceptual Organization and Visual Recognition. Kluwer Academic, Boston, MA,
1985.
6. W. Grimson and T. Lozano-Perez, "Model-based recognition and localization from sparse
range or tactile data," Int'l J. Robotics Res., vol. 5, pp. 3-34, Fall 1984.
7. O. Fangeras and M. Hebert, "The representation, recognition, and locating of 3D shapes
from range data," Int'l J. Robotics Res., vol. 5, no. 3, pp. 27-52, 1986.
8. M. Thonnat, "Semantic interpretation of 3-D stereo data: Finding the main structures,"
Int'i J. Pattern Reeog. Artif. Intell., vol. 2, no. 3, pp. 509-525, 1988.
9. P. Grossmann, "Building planar surfaces from raw data," Technical Report R4.1.2, ESPRIT
Project P940, 1987.
10. P. Grossmann, "From 3D line segments to objects and spaces," in Proc. 1EEE Conf. Corn-
put. Vision Pattern Reeog., (San Diego, CA), pp. 216-221, 1989.
11. O. D. Faugeras, Three-Dimensional Computer Vision. MIT Press, Cambridge, MA, 1991.
to appear.
12. P. Maybeck, Stochastic Models, Estimation and Control. Vol. 2, Academic, New York, 1982.
13. Z. Zhang, Motion Analysis from a Sequence of Stereo Frames and its Applications. PhD
thesis, University of Paris XI, Orsay, Paris, France, 1990. in English.

This article was processed using the ICFEX macro package with ECCV92 style
Hierarchical Model-Based Motion Estimation

James R. Bergen, P. Anandan, Keith J. Hanna, and Rajesh Hingorani


David Sarnoff Research Center, Princeton NJ 08544, USA

A b s t r a c t . This paper describes a hierarchical estimation framework for


the computation of diverse representations of motion information: The key
features of the resulting framework (or family of algorithms) are a global
model that constrains the overall structure of the motion estimated, a local
model that is used in the estimation process, and a coarse-fine refinement
strategy. Four specific motion models: affine flow, planar surface flow, rigid
body motion, and general optical flow, are described along with their appli-
cation to specific examples.

1 Introduction

A large body of work in computer vision over the last 10 or 15 years has been con-
cerned with the extraction of motion information from image sequences. The motivation
of this work is actually quite diverse, with intended applications ranging from data com-
pression to pattern recognition (alignment strategies) to robotics and vehicle navigation.
In tandem with this diversity of motivation is a diversity of representation of motion
information: from optical flow, to affine or other parametric transformations, to 3-d ego-
motion plus range or other structure. The purpose of this paper is to describe a common
framework within which all of these computations can be represented.
This unification is possible because all of these problems can be viewed from the
perspective of image registration. That is, given an image sequence, compute a repre-
sentation of motion that best aligns pixels in one frame of the sequence with those in
the next. The differences among the various approaches mentioned above can then be
expressed as different parametric representations of the alignment process. In all cases
the function minimized is the same; the difference lies in the fact that it is minimized
with respect to different parameters.
The key features of the resulting framework (or family of algorithms) are a global
model that constrains the overall structure of the motion estimated, a local model that is
used in the estimation process 1, and a coarse-fine refinement strategy. An example of a
global model is the rigidity constraint; an example of a local model is that displacement
is constant over a patch. Coarse-fine refinement or hierarchical estimation is included in
this framework for reasons that go well beyond the conventional ones of computational
efficiency. Its utility derives from the nature of the objective function common to the
various motion models.

1.1 Hierarchical e s t i m a t i o n
Hierarchical approaches have been used by various researchers e.g., see [2, 10, 11, 22, 19]).
More recently, a theoretical analysis of hierarchical motion estimation was described in
1 Because this model will be used in a multiresolution data structure, it is "local" in a slightly
unconventional sense that will be discussed below.
238

[8] and the advantages of using parametric models within such a framework have also
been discussed in [5].
Arguments for use of hierarchical (i.e. pyramid based) estimation techniques for mo-
tion estimation have usually focused on issues of computational efficiency. A matching
process that must accommodate large displacements can be very expensive to compute.
Simple intuition suggests that if large displacements can be computed using low resolu-
tion image information great savings in computation will be achieved. Higher resolution
information can then be used to improve the accuracy of displacement estimation by
incrementally estimating small displacements (see, for example, [2]). However, it can also
be argued that it is not only e~icient to ignore high resolution image information when
computing large displacements, in a sense it is necessary to do so. This is because of
aliasing of high spatial frequency components undergoing large motion. Aliasing is the
source of false matches in correspondence solutions or (equivalently) local minima in the
objective function used for minimization. Minimization or matching in a multiresolution
framework helps to eliminate problems of this type. Another way of expressing this is
to say that many sources of non-convexity that complicate the matching process are not
stable with respect to scale.
With only a few exceptions ([5, 9]), much of this work has concentrated on using a
small family of "generic" motion models within the hierarchical estimation framework.
Such models involve the use of some type of a smoothness constraint (sometimes allow-
ing for discontinuities) to constrain the estimation process at image locations containing
little or no image structure. However, as noted above, the arguments for use of a mul-
tiresolution, hierarchical approach apply equally to more structured models of image
motion.
In this paper, we describe a variety of motion models used within the same hierar-
chical framework. These models provide powerful constraints on the estimation process
and their use within the hierarchical estimation framework leads to increased accuracy,
robustness and efficiency. We outline the implementation of four new models and present
results using real images.

1.2 M o t i o n M o d e l s
Because optical flow computation is an underconstrained problem, all motion estimation
algorithms involve additional assumptions about the structure of the motion computed.
In many cases, however, this assumption is not expressed explicitly as such, rather it is
presented as a regularization term in an objective function [14, 16] or described primarily
as a computational issue [18, 4, 2, 20].
Previous work involving explicitly model-based m o t i o n estimation includes direct
methods [17, 21], [13] as well as methods for estimation under restricted conditions [7, 9].
The first class of methods uses a global egomotion constraint while those in the second
class of methods rely on parametric motion models within local regions. The description
"direct methods" actually applies equally to both types.
With respect to motion models, these algorithms can be divided into three categories:
(i) fully parametric, (ii) quasi-parametric, and (iii) non-parametric. Fully parametric
models describe the motion of individual pixels within a region in terms of a parametric
form. These include affine and quadratic flow fields. Quasi-parametric models involve
representing the motion of a pixel as a combination of a parametric component that is
valid for the entire region and a local component which varies from pixel to pixel. For
instance, the rigid motion model belongs to this class: the egomotion parameters constrain
the local flow vector to lie along a specific line, while the local depth value determines the
239

exact value of the flow vector at each pixel. By non-parametric models, we mean those
such as are commonly used in optical flow computation, i.e. those involving the use of
some type of a smoothness or uniformity constraint.
A parallel taxonomy of motion models can be constructed by considering local models
that constrain the motion in the neighborhood of a pixel and global models that describe
the motion over the entire visual field. This distinction becomes especially useful in ana-
lyzing hierarchical approaches where the meaning of "local" changes as the computation
moves through the multiresolution hierarchy. In this scheme fully parametric models are
global models, non-parametric models such as smoothness or uniformity of displacement
are local models, and quasi-parametric models involve both a global and a local model.
The reason for describing motion models in this way is that it clarifies the relationship
between different approaches and allows consideration of the range of possibilities in
choosing a model appropriate to a given situation. Purely global (or fully parametric)
models in essence trivially imply a local model so no choice is possible. However, in the
case of quasi- or non-parametric models, the local model can be more or less complex.
Also, it makes clear that by varying the size of local neighborhoods, it is possible to move
continuously from a partially or purely local model to a purely global one.
The reasons for choosing one model or another are generally quite intuitive, though
the exact choice of model is not always easy to make in a rigorous way. In general,
parametric models constrain the local motion more strongly than the less parametric
ones. A small number of parameters (e.g., six in the case of affine flow) are sufficient
to completely specify the flow vector at every point within their region of applicability.
However, they tend to be applicable only within local regions, and in many cases are
approximations to the actual flow field within those regions (although they may be very
good approximations). From the point of view of motion estimation, such models allow
the precise estimation of motion at locations containing no image structure, provided the
region contains at least a few locations with significant image structure.
Quasi-parametric models constrain the flow field less, but nevertheless constrain it
to some degree. For instance, for rigidly moving objects under perspective projection,
the rigid motion parameters (same as the egomotion parameters in the case of observer
motion), constrain the flow vector at each point to lie along a line in the velocity space.
One dimensional image structure (e.g., an edge) is generally sufficient to precisely esti-
mate the motion of that point. These models tend to be applicable over a wide region
in the image, perhaps even the entire image. If the local structure of the scene can be
further parametrized (e.g., planar surfaces under rigid motion), the model becomes fully
parametric within the region.
Non-parametric models require local image structure that is two-dimensional (e.g.,
corner points, textured areas). However, with the use of a smoothness constraint it is
usually possible to "fill-in" where there is inadequate local information. The estimation
process is typically more computationally expensive than the other two cases. These
models are more generally applicable (not requiring parametrizable scene structure or
motion) than the other two classes.

1.3 Paper Organization

The remainder of the paper consists of an overview of the hierarchical motion estimation
framework, a description of each of the four models and their application to specific
examples, and a discussion of the overall approach and its applications.
240

2 Hierarchical Motion Estimation

Figure 1 describes the hierarchical motion estimation framework. The basic components
of this framework are: (i) pyramid construction, (ii) motion estimation, (iii) image warp-
ing, and (iv) coarse-to-fine refinement.
There are a number of ways to construct the image pyramids. Our implementation
uses the Laplacian pyramid described in [6], which involves simple local computations
and provides the necessary spatial-frequency decomposition.
The motion estimator varies according to the model. In all cases, however, the estima-
tion process involves SSD minimization, but instead of performing a discrete search (such
as in [3]), Gauss-Newton minimization is employed in a refinement process. The basic
assumption behind SSD minimization is intensity constancy, as applied to the Laplacian
pyramid images. Thus,
I ( x , t ) = I ( x - u ( x ) , t - 1)
where x = (x, Y) denotes the spatial image position of a point, I the (Laplacian pyramid)
image intensity and u(x) = (u(x, y), v(x, y)) denotes the image velocity at that point.
the SSD error measure for estimating the flow field within a region is:

E({u}) = E (I(x,t)- I(x- u(x),t- 1)) 2 (1)


X

where the sum is computed over all the points within the region and {u} is used to denote
the entire flow field within that region. In general this error (which is actually the sum
of individual errors) is not quadratic in terms of the unknown quantities {u}, because
of the complex pattern of intensity variations. Hence, we typically have a non-linear
minimization problem at hand.
Note that the basic structure of the problem is independent of the choice of a motion
model. The model is in essence a statement about the function u(x). To make this
explicit, we can write,
u(x) = u(x; (2)
where pm is a vector representing the model parameters.
A standard numerical approach for solving such a problem is to apply Newton's
method. However, for errors which are sum of squares a good approximation to Newton's
method is the Gauss-Newton method, which uses a first order expansion of the individual
error quantities before squaring. If {u}i current estimate of the flow field during the ith
iteration, the incremental estimate {Su} can be obtained by minimizing the quadratic
error measure
E({Su}) = E ( A I + V I . 5u(x)) 2 , (3)
X

where
A I ( x ) -- I ( x , t ) - I ( x - u/(x), t - 1),
that is the difference between the two images at corresponding pixels, after taking the
current estimate into account.
As such, the minimization problem described in Equation 3 is underconstrained. The
different motion models Constrain the flow field in different ways. When these are used
to describe the flow field, the estimation problem can be reformulated in terms of the un-
known (incremental) model parameters. The details of these reformulations are described
in the various sections corresponding to the individual motion models.
241

The third component, image warping, is achieved by using the current values of the
model parameters to compute a flow field, and then using this flow field to warp I(t - 1)
towards I(t), which is used as the reference image. Our current warping algorithm uses
bilinear interpolation. The warped image (as against the original second image) is then
used for the computation of the error A I for further estimation 2. The spatial gradient
V I computations are based on the reference image.
The final component, coarse-to-fine refinement, propagates the current motion esti-
mates from one level to the next level where they are then used as initial estimates. For
the parametric component of the model, this is easy; the values of the parameters are
simply transmitted to the next level. However, when a local model is also used, that
information is typically in the form of a dense image (or images)---e.g., a flow field or a
depth map. This image (or images) must be propagated via a pyramid expansion opera-
tion as described in [6]. The global parameters in combination with the local information
can then be used to generate the flow field necessary to perform the initial warping at
this next level.

3 Motion Models

3.1 Affine Flow


T h e M o d e l : When the distance between the background surfaces and the camera is
large, it is usually possible to approximate the motion of the surface as an affine trans-
formation:

U(X, y) : al "4- a2 x q- a3y


v(x, y) = a4 + asX + a6y (4)
Using vector notation this can be rewritten as follows:

u(x) = X ( x ) a (5)
where a denotes the vector (al, a2, a3, a4, as, a6) T, and

x x,
Thus, the motion of the entire region is completely specified by the parameter vector a,
which is the unknown quantity that needs to be estimated.

The Estimation Algorithm: Let ai denote the current estimate of the afllne param-
eters. After using the flow field represented by these parameters in the warping step, an
incremental estimate ~a can be determined. To achieve this, we insert the parametric
form of ~u into Equation 3, and obtain an error measure that is a function of 8a.

E( a) = (al + (vl) x6 ) (6)


X

Minimizing this error with respect to 6a leads to the equation:

(v O ( v x] = - (V (7)
2 We have avoided using the standard notation k in order to avoid any confusion about this
point.
242

E x p e r i m e n t s w i t h t h e afllne m o t i o n m o d e l : To demonstrate use of the affine flow


model, we show its performance on an aerial image sequence. A frame of the original
sequence is shown in Figure 2a and the unprocessed difference between two frames of
this sequence is shown in Figure 2b. Figure 2c shows the result of estimating an affine
transformation using the hierarchical warp motion approach, and then using this to com-
pensate for camera motion induced flow. Although the terrain is not perfectly flat, we
still obtain encouraging compensation results. In this example the simple difference be-
tween the compensated and original image is sufficient to detect and locate a helicopter
in the image. We use extensions of the approach, like integration of compensated differ-
ence images over time, to detect smaller objects moving more slowly with respect to the
background.

3.2 P l a n a r S u r f a c e Flow
T h e Model: It is generally known that the instantaneous motion of a planar surface
undergoing rigid motion can be described as a second order function of image coordinates
involving eight independent parameters (e.g., see [15]). In this section we provide a brief
derivation of this description and make some observations concerning its estimation.
We begin by observing that the image motion induced by a rigidly moving object (in
this case a plane), can be written as:
1
u(x) = Z---~A(x)t + B(x)w (8)

where Z(x) is the distance from the camera of the point (i.e., depth) whose image position
is (x), and
o;]
L(/+y2)/f -(xy)/f
The A and the B matrices depend only on the image positions and the focal length f
and not on the unknowns: t, the translation vector, w the angular velocity vector, and
Z.
A planar surface can be described by the equation

klX + k2Y + k3Z = 1 (9)

where (kl, k2,k3) relate to the surface slant, tilt, and the distance of the plane from
the origin of the chose coordinate system (in this case, the camera origin). Dividing
throughout by Z, we get
1 x y
--=kt +k2 +k3.
z 7 Y
Using k to denote the vector (kt, k2, k3) and r to denote the vector (z/f, y/f, 1) we obtain
1
Z(x) = r(x)Tk"

Substituting this into Equation 8 gives

u(x) -" (A(x)t) (r(x)Tk) + B(x)~ (10)


243

This flow field is quadratic in (x) and can be written also as

u(x) : al + a2x -t- aay + a7x 2 + a s x y


v(x) = a4 + a s x + a6y + a~xy + a s y 2 (11)

where the 8 coefficients (az,..., as) are functions of the motion paramters t,w and the
surface parmeters k. Since this 8-parameter form is rather well-known (e.g., see [15]) we
omit its details.
If the egomotion parameters are known, then the three parameter vector k can be used
to represent the motion of the planar surface. Otherwise the 8-parameter representation
can be used. In either case, the flow field is a linear in the unknown parameters.
The problem of estimating planar surface motion has been has been extensively stud-
ied before [21, 1, 23]. In particular, Negahdaripour and Horn [21] suggest iterative meth-
ods for estimating the motion and the surface parameters, as well as a method of estimat-
ing the 8 parameters and then decomposing them into the five rigid motion parameters
the three surface parameters in closed form. Besides the embedding of these computations
within the hierarchical estimation framework, we also take a slightly different, approach
to the problem.
We assume that the rigid motion parameters are already known or can be estimated
(e.g., see Section 3.3 below). Then, the problem reduces to that of estimating the three
surface parameters k. There are several practical reasons to prefer this approach: First, in
many situations the rigid motion model may be more globally applicable than the planar
surface model, and can be estimated using information from all the surfaces undergoing
the same rigid motion. Second, unless the region of interest subtends a significant field
of view, the second order components of the flow field will be small, and hence the
estimation of the eight parameters will be inaccurate and the process may be unstable.
On the other hand, the information concerning the three parameters k is contained in the
first order components of the flow field, and (if the rigid motion parameters are known)
their estimation will be more accurate and stable.

T h e E s t i m a t i o n A l g o r i t h m : Let ki denote the current estimate of the surface pa-


rameters, and let t and w denote the motion parameters. These parameters are used to
construct an initial flow field that is used in the warping step. The residual information
is then used to determine an incremental estimate ~k.
By substituting the parametric form of 6u

~u = u - u 0
= (A(x)t) (r(x)T(k0 + 6k)) + B(x)w - ( A ( x ) t ) (r(x)Tk0) + B(x)w
= (A(x)t) r(x)TSk (12)

in Equation 3, we can obtain the incremental estimate ~k as the vector that minimizes:

E(~k) = ~--~((AI + ( v I ) T ( A t ) r T ~ k ) 2 (13)


x

Minimizing this error leads to the equation:

[~-~r(tTAT)(vI)(VI)T(At)rT)]~k=--~-~r(tTAT)(VI)AI (14)

This equation can be solved to obtain the incremental estimate ~k.


244

E x p e r i m e n t s w i t h t h e p l a n a r surface m o t i o n m o d e l : We demonstrate the appli-


cation of the planar surface model using images from an outdoor sequence. One of the
input images is shown in Figure 3a, and the difference between both input images is
shown in Figure 3b. After estimating the camera motion between the images using the
algorithm described in Section 3.3, we applied the planar surface estimation algorithm
to a manually selected image window placed roughly over a region on the ground plane.
These parameters were then used to warp the second frame towards the first (this process
should align the ground plane alone). The difference between this warped image and the
original image is shown in Figure 3c. The figure shows compensation of the ground plane
motion, leaving residual parallax motion of the trees and other objects in the background.
Finally, in order to demonstrate the plane-fit, we graphically projected a rectangular grid
onto that plane. This is shown superimposed on the input image in Figure 3d.

3.3 Rigid B o d y M o d e l

T h e Model: The motion of arbitrary surfaces undergoing rigid motion cannot usually
be described by a single global model. We can however make use of the global rigid body
model if we combine it with a local model of the surface. In this section, we provide a
brief derivation of the global and the local models. Hanna [12] provides further details
and results, and also describes how the local and global models interact at corner-like
and edge-like image structures.
As described in Section 3.2, the image motion induced by a rigidity moving object
can be written as:
u(x) = Z---~x)A ( x ) t B(x)~ (15)

where Z(x) is the distance from the camera of the point (i.e., its depth), whose image
position is (x), and
0

B(x)=
L(fU+y2)/f -(~u)/f -xJ
The A and the B matrices depend only on the image positions and the focal length
f and not on the unknowns: t, the translation vector, ~ the angular velocity vector, and
Z. Equation 15 relates the parameters of the global model, ~ and t, with parameters of
the local scene structure, Z(x).
A local model we use is the frontal-planar model, which means that over a local image
patch, we assume that Z(x) is constant. An alternative model uses the assumption that
6Z(x)--the difference between a previous estimate and a refined estimate--is constant
over each local image patch.
We refine the local and global models in turn using initial estimates of the local struc-
ture parameters, Z(x), and the global rigid body parameters ~ and t. This local/global
refinement is iterated several times.

T h e E s t i m a t i o n A l g o r i t h m : Let the current estimates be denoted as Z/(x), ti and


~i. As in the other models, we can use the model parameters to construct an initial flow
field, ui(x), which is used to warp one of the image frames towards the next. The residual
error between the warped image and the original image to which it is warped is used to
245

refine the parameters of the local and global models. We now show how these models are
refined.
We begin by writing equation 15 in an incremental form so that

~u(x) = Z--~x)A ( x ) t + B(x)w _ Z__~A(x)t01 - B(x)w 0 (16)

Inserting the parametric form of ~u into Equation 3 we obtain the pixel-wise error as

E(t, w, 1/Z(x)) = ( A I + ( v I ) T A t / Z ( x ) + ( v I ) T B ~ -- (~TI)T A t J Z I ( x ) -- ( v I ) T B w i ) 2 .


(17)
To refine the local models, we assume that 1/Z(x) is constant over 5 x 5 image patches
centered on each image pixel. We then algebraically solve for this Z both in order to
estimate its current value, and to eliminate it from the global error measure. Consider
the local component of the error measure,

E, oca~ = ~ E(t, w, 1/Z(x)). (18)


5

Differentiating equation 17 with respect to 1/Z(x) and setting the result to zero, we get

- ~5 ( A I -- ( v I ) T A t J Z i ( x ) + (UI)TBw - (VI)TBwi)
1/Z(x) = (19)
~-~5 5 ( ( V I ) T A t ) 2

To refine the global model, we minimize the error in Equation 17 summed over the
entire image:
fg,obal = ~ E(t,w,1/Z(x)). (20)
Image
We insert the expression for 1/Z(x) given in Equation 19--not the current numerical
value of the local parameter--into Equation 20. The result is an expression for EgtobaZ
that is non-quadratic in t but quadratic in w . We recover refined estimates of t and w
by performing one Gauss-Newton minimization step using the previous estimates of the
global parameters, t i and wi, as starting values. Expressions are evaluated numerically
at t = tl and w --- wl.
We then repeat the estimation algorithm several times at each image resolution.

E x p e r i m e n t s w i t h t h e rigid b o d y m o t i o n m o d e l : We have chosen an outdoor scene


to demonstrate the rigid body motion model. Figure 4a shows one of the input images,
and Figure 4b shows the difference between the two input images. The algorithm was
performed beginning at level 3 (subsampled by a factor of 8) of a Laplacian pyramid. The
local surface parameters 1/Z(x) were all initialized to zero, and the rigid-body motion
parameters were initialized to t o -- (0, 0, 1) r and w -- (0, 0, 0) T. The model parameters
were refined 10 times at each image resolution. Figure 4c shows the difference image
between the second image and the first image after being warped using the final estimates
of the rigid-body motion parameters and the local surface parameters. Figure 4d shows
an image of the recovered local surface parameters 1/Z(x) such that bright points are
nearer the camera than dark points. The recovered inverse ranges are plausible almost
everywhere, except at the image border and near the recovered focus of expansion. The
bright dot at the bottom right hand side of the inverse range map corresponds to a leaf
in the original image that is blowing across the ground towards the camera. Figure 4e
246

shows a table of rigid-body motion parameters that were recovered at the end of each
resolution of analysis.
More experimental results and a detailed discussion of the algorithm's performance
on various types of scenes can be found in [12].

3.4 G e n e r a l Flow Fields


T h e M o d e l : Unconstrained general flow fields are typically not described by any global
parametric model. Different local models have been used to facilitate the estimation pro-
cess, including constant flow within a local window and locally smooth or continuous flow.
The former facilitates direct local estimation [18, 20], whereas the latter model requires
iterative relaxation techniques [16] It is also not uncommon to use the combination of
these two types of local models (e.g., [3, 10]).
The local model chosen here is constant flow within 5 5 pixel windows at each level
of the pyramid. This is the same model as used by Lucas and Kanade [18] but here it is
embedded as a local model within the hierarchical estimation framework.

The Estimation A l g o r i t h m : Assume that we have an approximate flow field from


previous levels (or previous iterations at the same level). Assuming that the incremental
flow vector ~u is constant within the 5 x 5 window, Equation 3 can be written as
E(6u) = ~_~(AI + V/T6u) 2 (21)
X

where the sum is taken within the 5 5 window. Minimizing this error with respect to
6u leads to the equation,
= -

We make some observations concerning the singularities of this relationship. If the sum-
ming window consists of a single element, the 2 2 matrix on the left-hand-side is an
outer product of a 2 1 vector and hence has a rank of atmost unity. In our case, when
the summing window consists of 25 points, the rank of the matrix on the left-hand-side
will be two unless the directions of the gradient vectors V I everywhere within the window
coincide. This situation is the general case of the aperture e]]ect.
In our implementation of this technique, the flow estimate at each point is obtained by
using a 5 5 windows centered around that point. This amounts to assuming implicitly
that the flow field varies smoothly over the image.

Experiments with the general flow model: We demonstrate the general flow algo-
rithm on an image sequence containing several independently moving objects, a case for
which the other motion models described here are not applicable. Figure 5a shows one
image of the original sequence. Figure 5b shows the difference between the two frames
that were used to compute image flow. Figure 5c shows little difference between the com-
pensated image and the other original image. Figure 5d shows the horizontal component
of the computed flow field, and figure 5e shows the vertical component. In local image
regions where image structure is well-defined, and where the local image motion is sim-
ple, the recovered motion estimates appear plausible. Errors predictably occur however
at motion boundaries. Errors also occur in image regions where the local image structure
is not well-defined (like some parts of the road), but for the same reason, such errors do
not appear as intensity errors in the compensated difference image.
247

4 Discussion

Thus far, we have described a hierarchical framework for the estimation of image motion
between two images using wrious models. Our motivation was to generalize the notion
of direct estimation to model-based estimation and unify a diverse set of model-based
estimation algorithms into a single framework. The framework also supports the combined
use of parametric global models and local models which typically represent some type of
a smoothness or local uniformity assumption.
One of the unifying aspects of the framework is that the same objective function
(SSD) is used for all models, but the minimization is performed with respect to different
parameters. As noted in the introduction, this is enabled by viewing all these problems
from the perspective of image registration.
It is interesting to contrast this perspective (of model-based image registration) with
some of the more traditional approaches to motion analysis. One such approach is to
compute image flow fields, which involves combining the local brightness constraint with
some sort of a global smoothness assumption, and then interpret them using appropriate
motion models. In contrast, the approach taken here is to use the motion models to
constrain the flow field computation. The obvious benefit of this is that the resulting
flow fields may generally be expected to be more consistent with models than general
smooth flow fields. Note, however, that the framework also includes general smooth flow
field techniques, which can be used if the motion model is unknown.
In the case of models that are not fully parametric, local image information is used to
determine local image/scene properties (e.g., the local range value). However, the accu-
racy of these can only be as good as the available local image information. For example,
in homogeneous areas of the scene, it may be possible to achieve perfect registration even
if the surface range estimates ( a n d the corresponding local flow vectors) are incorrect.
However, in the presence of significant image structures, these local estimates may be
expected to be accurate. On the other hand, the accuracy of the global parameters (e.g.,
the rigid motion parameters) depends only on having sufficient and sufficiently diverse
local information across the entire region. Hence, it may be possible to obtain reliable
estimates of these global parameters, even though estimated local information may not
be reliable everywhere within the region. For fully parametric models, this problem does
not exist.
The image registration problem addressed in this paper occurs in a wide range of
image processing applications, far beyond the usual ones considered in computer vision
(e.g., navigation and image understanding). These include image compression via motion
compensated encoding, spatiotemporal analysis of remote sensing type of images, image
database indexing and retrieval, and possibly object recognition. One way to state this
general problem is as that of recovering the coordinate system that relate two images of
a scene taken from two different viewpoints. In this sense, the framework proposed here
unifies motion analysis across these different applications as well.

A c k n o w l e d g e m e n t s : Many individuals have contributed to the ideas and results pre-


sented here. These include Peter Burt and Leonid Oliker from the David Sarnoff Research
Center, and Shmuel Peleg from Hebrew University.
248

References

1. G. Adiv. Determining three-dimensional motion and structure from optical flow generated
by several moving objects. IEEE Trans. on Pattern Analysis and Machine Intelligence,
7(4):384-401, July 1985.
2. P. Anandan. A unified perspective on computational techniques for the measurement of
visual motion. In International Conference on Computer Vision, pages 219-230, London,
May 1987.
3. P. Anandan. A computational framework and an algorithm for the measurement of visual
motion. International Journal of Computer Vision, 2:283-310, 1989.
4. J. R. Bergen and E. H. Adelson. Hierarchical, computationally efficient motion estimation
algorithm. J. Opt. Soc. Am. A., 4:35, 1987.
5. J. R. Bergen, P. J. Burt, R. Hingorani, and S. Peleg. Computing two motions from three
frames. In International Conference on Computer Vision, Osaka, Japan, December 1990.
6. P. J. Burt and E. H. Adelson. The laplacian pyramid as a compact image code. IEEE
Transactions on Communication, 31:532-540, 1983.
7. P.J. Butt, J.R. Bergen, R. Hingorani, R. Kolczinski, W.A. Lee, A. Leung, J. Lubin, and
H. Shvaytser. Object tracking with a moving camera, an application of dynamic motion
analysis. In IEEE Workshop on Visual Motion, pages 2-12, Irvine, CA, March 1989.
8. P.J. Burr, R. Hingorani, and R. J. Kolczynski. Mechanisms for isolating component pat-
terns in the sequential analysis of multiple motion. In IEEE Workshop on Visual Motion,
pages 187-193, Princeton, N J, October 1991.
9. Stefan Carlsson. Object detection using model based prediction and motion parallax. In
Stockholm workshop on computational vision, Stockholm, Sweden, August 1989.
10. J. Dengler. Local motion estimation with the dynamic pyramid. In Pyramidal systems for
computer vision, pages 289-298, Maratea, Italy, May 1986.
11. W. Enkelmann. Investigations of multigrid algorithms for estimation of optical flow fieldsin
image sequences. Computer Vision, Graphics, and Image Processing, 4339:150-177, 1988.
12. K. J. Hanna. Direct multi-resolution estimation of ego-motion and structure from motion.
In Workshop on Visual Motion, pages 156-162, Princeton, N J, October 1991.
13. J. Heel. Direct estimation of structure and motion from multiple frames. Technical Report
1190, MIT AI LAB, Cambridge, MA, 1990.
14. E. C. Hildreth. The Measurement of Visual Motion. The MIT Press, 1983.
15. B. K. P. Horn. Robot Vision. MIT Press, Cambridge, MA, 1986.
16. B. K. P. Horn and B. G. Schunck. Determining optical flow. Artificial Intelligence, 17:185-
203, 1981.
17. B. K. P. Horn and E. J. Weldon. Direct methods for recovering motion. International
Journal of Computer Vision, 2(1):51-76, June 1988.
18. B.D. Lucas and T. Kanade. An iterative image registration technique with an application
to stereo vision. In Image Understanding Workshop, pages 121-130, 1981.
19. L. Matthies, R. Szeliski, and T. Kanade. Kalman filter-based algorithms for estimating
depth from image-sequences. In International Conference on Computer Vision, pages 199-
213, Tampa, FL, 1988.
20. H. H. Nagel. Displacement vectors derived from second order intensity variations in in-
tensity sequences. Computer Vision, Pattern recognition and Image Processing, 21:85-117,
1983.
21. S. Negahdaripour and B.K.P. Horn. Direct passive navigation. IEEE Trans. on Pattern
Analysis and Machine Intelligence, 9(1):168-176, January 1987.
22. A. Singh. An estimation theoretic framework for image-flow computation. In International
Conference on Computer Vision, Osaka, Japan, November 1990.
23. A.M. Waxman and K. Wohn. Contour evolution, neighborhood deformation and global
image flow: Planar surfaces in motion. International Journal of Robotics Research, 4(3):95-
108, Fall 1985.
249

Fig. 1. Diagram of the hierarchical motion estimation framework.

Fig. 2. Affine motion estimation: a) Original. b) Raw difference, c) Compensated difference.


250

Fig. 3. Planar surface motion estimation.


a) Original image.
b) Raw difference.
c) Difference after planar compensation.
c) Planar grid superimposed on the original image.
251

IResolutiol~ I ~2 I T I
(.oooo,.oooo,.oooo) I(.oooo,.oooo,l.oooo)
32 30 (.0027,.0039,-.0001)!(-.3379,-.1352,.9314)!
64 60 (.0038,0041,.0019) (-.3319,-.0561,.9416)!
128 120 (.0037,.0012,.0008) (-.0660,-.0383,.9971)
256 240 (.0029,0006,0013) (-.0255,-.0899,9956)

Fig. 4. Egomotion based flow model.


a) OriginM image from an outdoor sequence.
b) Raw difference.
c) Difference after ego-motlon compensation.
d) Inverse range map.
e) Rigid body parameters recovered at each resolution.
252

Fig. 5. Optical flow estimation.


a) Original image.
b) Raw difference.
c) Difference after motion compensation.
d) Horizontal component of the recovered flow field.
e) Vertical component of the recovered flow field.
_& F a s t M e t h o d to Estimate Sensor Translation

V. Sundareswaran*

Courant Institute, New York University, New York, NY 10012

A b s t r a c t . An important problem in visual motion analysis is to deter-


mine the parameters of egomotion. We present a simple, fast method that
computes the translational motion of a sensor that is generating a sequence
of images. This procedure computes a scalar function from the optical flow
field induced on the image plane due to the motion of the sensor and uses
the norm of this function as an error measure. Appropriate values of the
parameters used in the computation of the scalar function yield zero error;
this observation is used to locate the Focus of Expansion which is directly
related to the translational motion.

1 Introduction
We consider the motion of a sensor in a rigid, static environment. The motion produces a
sequence of images containing the changing scene. We want to estimate the motion of the
sensor, given the optical flow fields computed from the sequence. We model the motion
using a translational velocity T and a rotational velocity w. These are the instantaneous
motion parameters.
Many procedures exist to compute the optical flow field [1,4]. Also, several methods
have been proposed to compute the motion parameters from the optical flow field. One
feature of most of these methods is that they operate locally. Recovering structure, which
is contained in local information, seems to be the motivation for preferring local methods.
However, the motion parameters are not local and they are better estimated by employing
global techniques. In addition, using more data usually results in better performance in
the presence of noise. Non-local algorithms are given in [3] and [8], and more recently, in
[6]. The algorithm presented in [3] requires search over grid points on a unit sphere. The
method of Prazdny [8] is based on a non-linear minimization. Faster methods have been
presented recently [7,10]. Though all these methods work well on noiseless flow fields,
there is insufficient data about their performance on real images. The work in this paper
has been motivated by the observation that making certain approximations to an exact
procedure gives a method that produces robust results from real data.
The algorithm presented here determines the location of the focus of expansion (FOE)
which is simply the projection of the translation vector T on the imaging plane. It is well
known that once the FOE is located, the rotational parameters can be computed from
the optical flow equations [2]. Alternative methods to directly compute the rotational
parameters from the flow field have also been proposed [11]. We begin by reviewing the
flow equations, then describe the algorithm and present experimental results.

2 T h e flow equations
We consider the case of motion of a sensor in a static environment. We choose the
coordinate system to be centered at the sensor which uses perspective projection for
* Supported under Air Force contract F33615-89-C-1087 reference 87-02-PMRE. The author
wishes to thank Bob Hummel for his guidance.
254

imaging onto a planar image surface (Fig. 1). The sensor moves with a translational
velocity of T -- (vl, v2, v3) and an angular velocity of w -- (wl, w2, w3).
The transformation from spatial coordinates to the image coordinates is given by the
equations
x = IX~Z, y = .fY/Z
where (X, Y, Z) = (X(x, y), Y(x, y), Z(x, y)) is the position
of the point in three-space
that is imaged at (x, y) and f is the focal length. The optical flow V = (u, v) at the
image point (x, y) is easily obtained [2,5,9]:

u(x,Y)= z - ' ~ [ - f v l T xva]Twl [ ~ ! - w ~ [f T ~-] Tw3Y,


(1)

Here, u(x, y) and v(x, y) are the x and y components of the optical flow field V(x, y).
In this context, we are interested in determining the location (1" = fvl/v3,77 = fv2/v3)
which is nothing but the projection of the translational velocity T onto the image plane.
This location is referred to as the Focus of Expansion (FOE).
Looking at Eqn. 1, we note that the vector flow field V(x, y) is simply the sum of the
vector field Vv(x,y) arising from the translation T and the vector field V~(x, y) due to
the rotation w:
v(~,u) = y~(~, y) + y~(x, y).

3 Algorithm Description

The observation behind the algorithm is that a certain circular component computed from
the flow field by choosing a center (x0, Y0) is a scalar function whose norm is quadratic in
the two variables x0 and Y0. The norm is zero (in the absence of noise) at the FOE. This
procedure will be referred to as the Norm of the Circular Component (NCC) algorithm.

3.1 T h e c i r c u l a r c o m p o n e n t

For each candidate (x0, Y0), we consider the circular component of the flow field about
(xo, Y0) defined by:

U(~o,~o)(~, ~) = v(~, y). ( - y + y0,~ - x0). (2)


Note that this is nothing but the projection onto concentric circles whose center is located
at (x0, Y0). Since V - W + V~, we further define

U?~o,~o)(~, y) = v~(x, ~). ( - y + ~0, x - ~0),


U~o,~o)(X, y) = v~(x, y ) . ( - y + yo, 9 - xo),
so that
U(=o,,o)(X, y) = U~o,,o)(X, y) + U~=o,,o)(~, ~)
where, denoting p(x, y) = 1/Z(x, y),
U~o,,o)(X, y) = v3p(x, ~). [(yo - , ) x + ( - ~ o + r)y + ,xo - ryo], (3)
255

and

+ - + + oyl. (4)
At the focus of expansion, when (zo, Yo) = (r, 77),

y) = y). [;-;] 9 (-y + ,1, 9 - 7-) = o (5)

so that U(xo,~o) = U~o,yo) for (x0, Y0) = (% 77). Eqn. 5 is merely a result of the radial
structure of the translational component of the flow field. In other words, pure translation
produces a field that is orthogonal to concentric circles drawn with the FOE as the center.
Observations about the quadratic nature of U~ i, o , Y o /
, (Eqn 4) lead to the convolution
and subspace projection methods described in [6]. Here, we obtain a method that is
approximate but is quick and robust.
To this end, we define an error function E(xo, Yo) as the norm of U(:~o,yo)(X, y):

E(xo, Y0) = IIV(~o,,o)(~, y)ll =. (6)


The important observation is that U(xo,~o)(x, y) is linear in the parameters x0 and Yc.
As a result, the norm defined in Eqn. 6 will be quadratic in x0 and Y0. That is, E(xo, yo)
will be a quadratic polynomial in x0 and Y0. The minimum of this quadratic surface is
purported to occur at the Focus of Expansion (FOE). We will justify this claim shortly.
But first, if the claim is correct, we have a simple algorithm that we describe now.

3.2 T h e N C C a l g o r i t h m
The first step is to choose six sets of values for (x0, y0) in a non-degenerate configuration
(in this case, non-collinear). Next, for each of these candidates, compute the circular
component and define E(xo, yo) to be the norm of the circular component (NCC). In a
discrete setting, the error value is simply the sum of the squares of the circular component
values. Note that this can be done even in the case of a sparse flow field. The error
function values at these six points completely define the error surface because of its
quadratic nature and so the location of the minimum can be found using a closed-form
expression. T h a t location is the computed FOE.
Let us now examine the claim about the minimum being at the location of the FOE.
Note that the function U(~o,uo)(x , y) is made up of two parts; one is the translational
part shown in Eqn. 3, and the other is the rotational part (Eqn. 4). The translational
part U(xo,uo)(x , y) vanishes at the FOE, as shown in Eqn. 5, and it is non-zero elsewhere.
Thus, the norm9 9
]]U~'~
~. 0 , u
. ,(x, y)[[2 is positive quadratic with minimum (equal to zero) at
the FOE. This Is no longer true once we add the rotational part. However, as long as the
contribution from the rotational part is small compared to that from the translational
part, we can approzirnate the behavior of [[U(~o yo)(x, y)[[2 by [[U(~o,~0)(x , y)[[2.
The method is exact for pure translation and '~s approximate when the rotation is small
compared to the translation or when the depth of objects is small (i.e., high p(x, y)) as
would be the case in indoor situations. Also, there is no apparent reason for this method
to fail in the case where a planar surface occupies the whole field of view. Previous
methods [7,10] are known to fail in such a case. Indeed, in two experiments reported
here, a large portion of the view contains a planar surface. In all experiments done with
synthetic as well as actual data, this algorithm performs well. We present results from
actual image sequences here.
258

4 Experiments

For all the sequences used in the experiments, the flow field was computed using an
implementation of Anandan's algorithm [1]. The dense flow field thus obtained (on a 128
by 128 grid) is used as input to the NCC algorithm. The execution time per frame is on
an average less than 0.45 seconds for a casual implementation on a SUN Sparcstation-2.
The helicopter sequences, provided by NASA, consist of frames shot from a moving
helicopter that is flying over a runway. For the straight line motion, the helicopter has
a predominantly forward motion, with little rotation. The turning flight motion has
considerable rotation. The results of applying the circular component algorithm to these
sequences are shown in Figure 2 for ten frames (nine flow fields). This is an angular
error plot, the angular error being the angle between the actual and computed directions
of translation. The errors are below 6 degrees for all the frames of the straight flight
sequence. Notice the deterioration in performance towards the end of the turning flight
sequence due to the high rotation (about 0.15 fads/see).
The results from a third sequence (titled ridge, courtesy David Heeger) are shown in
Figure 2. Only frames 10 through 23 are shown because the actual translation data was
readily available only for these frames. In this sequence, the FOEs are located relatively
high above the optical axis. Such sequences are known to be hard for motion parameter
estimation because of the confounding effect between the translational and rotational
parameters (see the discussion in [6]). The algorithm presented here performs extremely
well, in spite of this adverse situation.

5 Conclusions

In most practical situations, the motion is predominantly translational. However, even in


situations where only translation is intended, rotation manifests due to imperfections in
the terrain on which the camera vehicle is traveling or due to other vibrations in the vehi-
cle. Algorithms that assume pure translation will break down under such circumstances
if they are sensitive to such deviations. However, the algorithm described here seems to
tolerate small amounts of rotation. So, it can be expected to work well under the real
translational situations and for indoor motion where the small depth values make the
translational part dominant.
In addition to the above situations, the method described here could also be used to
provide a quick initial guess for more complicated procedures that are designed to work
in the presence of large rotational values.

References

1. P. Anandan. A computational framework and an algorithm for the measurement of visual


motion. International Journal of Computer Vision, 2:283-310, 1989.
2. D. tteeger and A Jepson. Subspace methods for recovering rigid motion I: Algorithm and
implementation. Research in Biological and Computational Vision Tech Rep RBCV-TR-
90-35, University of Toronto.
3. D. Heeger and A. Jepson. Simple method for computing 3d motion and depth. In Proceed-
ings of the 3rd International Conference on Computer Vision, pages 96-100, Osaka, Japan,
December 1990.
4. David J. Heeger. Optical flow using spatiotemporal filters. International Journal of Com-
puter Vision, 1:279-302, 1988.
257

5. B.K.P Horn. Robot Vision. The MIT Press, 1987.


6. Robert Hummel and V. Sundareswaran. Motion parameter estimation from global flow
field data. 1EEE Transactions on Pattern Analysis and Machine Intelligence, to appear.
7. A. Jepson and D. Heeger. A fast subspace algorithm for recovering rigid motion. In IEEE
Workshop on Visual Motion, Princeton, New Jersey, Oct 1991.
8. K.Prazdny. Determining the instantaneous direction of motion from optical flow generated
by a curvilinearly moving observer. Computer Vision, Graphics and Image Processing,
17:238-248, 1981.
9. H.C. Longuet-Higgins and K. Prazdny. The interpretation of a moving retinal image. Proc.
Royal Soc. Lond. B, 208:385-397, 1980.
10. V. Sundareswaran. Egomotion from global flow field data. In 1EEE Workshop on Visual
Motion, Princeton, New Jersey, Oct 1991.
11. V. Sundareswaran and R. Hummel. Motion parameter estimation using the curl of the flow
field. In Eighth lsraeli Conference on A1 and Computer Vision, Tel-Aviv, Dec 1991.

y Iv2

ba2

zJ
V3
Fig. 1. The coordinate systems and the motion parameters

\ /
/
/

/
k I

I0 20 30 40 JO 60 70 12 14 16 18 20

R'~2= ~ F.Lme N u m b ~

Fig. 2, Angular error plots for the helicopter sequences(left: straight line flight in solid line and
turning flight in dotted line) and the ridge sequence (right)
Identifying multiple motions from optical flow *

Alessandra Rognone, Marco Campani, and Alessandro Verri


Dipartimento di Fisica dell'Universitk di Genova
Via Dodecaneso 33, 16146 Gcnova, Italy

A b s t r a c t . This paper describes a method which uses optical flow, that is,
the apparent motion of the image brightness pattern in time-varying images,
in order to detect and identify multiple motions. Homogeneous regions are
found by analysing local linear approximations of optical flow over patches
of the image plane, which determine a list of the possibly viewed motions,
and, finally, by applying a technique of stochastic relaxation. The presented
experiments on real images show that the method is usually able to identify
regions which correspond to the different moving objects, is also rather
insensitive to noise, and can tolerate large errors in the estimation of optical
flOW.

1 Introduction

Vision is a primary source of information for the understanding of complex scenarios in


which different objects may be moving non-rigidly and independently. Computer vision
systems should be capable of detecting and identifying the image regions which corre-
spond to single moving objects and interpreting the viewed motions in order to interact
profitably with the environment. This capability could also be usefully employed to drive
the focus of attention and track moving objects in cluttered scenes.
The relative motion of the viewed surfaces with respect to the viewing camera pro-
duces spatial and temporal changes in the image brightness pattern which provide a vast
amount of information for segmenting the image into the different moving parts [1,2]. As
the image motion of nearby points in space which belong to the same surface are very
sinfilar, optical flow, i.e., the apparent motion of the image brightness pattern on the
image plane [3], is a convenient representation of this information. In addition, simple
interpretations of first order spatial properties of optical flow make possible meaningful
qualitative and quantitative descriptions of the relative viewed motion which are proba-
bly sufficient for a number of applications [2,4-7]. This paper proposes a method, which
is based on optical flow, for the detection and identification of multiple motions from
time-varying images.
The proposed method consists of three steps. In the first step, a number of linear
vector fields which approximate optical flow over non-overlapping squared patches of
the image plane are computed. In the second step, these linear vector fields are used to
produce a list of the "possible" viewed motions, or labels. Finally, in the third step, a
label, that is, a possible motion, is attached to each patch by means of a technique of
stochastic relaxation. The labeling of image patches depending on the apparent motion
* This work has been partially funded by the ESPRIT project VOILA, the Progetto
Finalizzato Robotica, the Progetto Finalizzato Trasporti (PROMETHEUS), and by
the Agenzia Spaziale Italiana. M.C. has been partially supported by the Consorzio
Genova Pdcerche. Clive Prestt kindly checked the English.
259

by means of relaxation techniques was first proposed in [2,8]. The presented method has
several very good features. Firstly, although accurate pointwise estimates of optical flow
are difficult to obtain, the spatial coherence of optical flow appears to be particularly well
suited for a qualitative characterisation of regions which correspond to the same moving
surface independently of the complexity of the scene. Secondly, even rather cluttered
scenes are segmented into a small number of parts. Thirdly, the computational load is
almost independent of the data. Lastly, the choice of the method for the computation
of optical flow is hardly critical since the proposed algorithm is insensitive to noise and
tolerates large differences in the flow estimates.
The paper is organised as follows. Section 2 discusses the approximation of optical
flow in terms of linear vector fields. In Section 3, the proposed method is described in
detail. Section 4 presents the experimental results which have been obtained on sequences
of real images. The main differences between the proposed method and previous schemes
are briefly discussed in Section 5. Finally, the conclusions are summarised in Section 6.

2 Spatial properties of optical flow

The interpretation of optical flow over small regions of the image plane is often ambiguous
[9]. Let us discuss this fact in some detail by looking at a simple example of a sequence
of real images.
Fig. 1A shows a frame of a sequence in which the camera is moving toward a picture
posted on the wall. The angle between the optical axis and the direction orthogonal to
the wall is 30 ~ The optical flow which is obtained by applying the method described in
[10] to the image sequence and relative to the frame of Fig. 1A is shown in Fig. lB. It
is evident that the qualitative structure of the estimated optical flow is correct. It can
be shown [7] that the accuracy with which the optical flow of Fig. 1B and its first order
properties can be estimated is sufficient to recover quantitative information, like depth
and slant of the viewed planar surface. The critical assumption that makes it possible to
extract reliable quantitative information from optical flow is that the relative motion is
known to be rigid and translational.
In the absence of similar "a priori" information (or in the presence of more complex
scenes) the interpretation of optical flow estimates is more difficult. In this case, a local
analysis of the spatial properties of optical flow could be deceiving. Fig. 1C, for example,
shows the vector field which has been obtained by dividing the image plane in 64 non-
overlapping squared patches of 32 x 32 pixels and computing the linear rotating vector
field which best approximates the optical flow of Fig. 1B over each patch. Due to the
presence of noise and to the simple spatial structure of optical flow, the correlation
coefficient of this "bizarre" local approximation is very high. On a simple local (and
deterministic) basis there is little evidence that the vector field of Fig. 1B is locally
expanding. However, a more coherent interpretation can be found by looking at the
distributions of Fig. 1D. The squares locate the foci of expansion of the linear expanding
vector fields which best approximate the estimated optical flow in each patch, while the
crosses locate the centers of rotation of the rotating vector field which have been used to
produce the vector field of Fig. 1C. It is evident that while the foci of expansion tend to
clusterise in the neighbourhood of the origin of the image plane (identified by the smaller
frame), the centers of rotation are spread around. This observation lies at the basis of the
method for the identification of multiple motion which is described in the next Section.
lsa!; oql sosA[uuu poqloua oql j o dols ~saU oql 'fig "~t.A u! suo!~om luoaoj~!p oql Aj!luopt. o~l
aopao u I "fli~ "~!~I u! umoqs s! '[0[] u! poq!a~sop oanpo~oad ~ q~noaq~, po~ndmo~ pu~ 'V6
"~[A Jo om~aj oq~ o~ oht.~[oa ~ o ~ I~o!~do oq~L "~uI.l~%oa s.I p u n o a ~ I ~ q oq~ pu~ ou~Id o ~ m ~
oH1 p a ~ n o ~u!~lSU~a~ st. oaaqds a~ll~tUS oq~ o[!qA~ 'Ou~ld o ~ u a ! oq~ p a ~ o ~ ~u!l~lSU~a~
s.t oaoqds ao~a~[ oq~ q~!qm u[ o~uonbos po~aouo~ ao~ndtuo~ ~ j o om~aj ~ smoqs Vg "~!d
~ o B Ie~.l]do j o s u o t . ~ t u ! x o a d d ~ aeam.i ~ u t ] n d m o D I ' g
9o~uonbos o~um! ~[~oq~uAs u jo oidmuxo uu
~u ~upIoo [ gq AlO~uaudos possn~s!p oa~ poqloui oq~ jo sdols ut~ua ao~q~ aq~L "posodoad s[
A~og [~o!ldo tuoaj suo!~oua old!~inm ~u!gj!~uop! puu ~u!~olop aoj poq~oul ~ uo!~oS s!q~ u I
suo!~otu oldt.~inua ~u!~:m~op aoj poq~atu V 8
9(otu~at p.qos oql
Xq po~J!luap!) too!^ jo plog oql u~ql ao~a~[ s~m!~ anoj ~oa~ u~ u!ql.t~ ~.q q,!qm Alos!l~odso~ sppg
ao~ooA ~ut.l~lox os.t~[~op pu~ ]3u!pu~dxa axom.[ oql jo (sossoa~) uo~.l~oa jo saa~uo~ pu~ (soa~nbs)
ltOtStt~dxo jo .tooj oql ~o suo.tlrtq!als[(I ((I '(~suas aa~nbs u~am ~sma[ aq~ u[) q ~ d q ~ a ut. ( a jo
~ott [ ~ t l d o oql sol~m.txoadd~ lsoq q~!qt~ pl~lt Su!l~loa a~om.[ oq~ ~u.tlndmo~ pu~ (tI,~a s[ox!d
~ x ~ ) soq~,~d pop, ribs 1'9 u[ ~ao.iAjo p[otj oq~ Su!p!^.tp s p o u ~ q o s.[ q~t.qta taolt [~[2,do oq~L (O
"(V lo om~aj oql ql!ta pol~!aoss~ [0[] u! poq!a~sop poq~,otu oql to suborn ,r po~,ndtuo~ mo U l ~ ! l d o
~ (ff "o0~ s.I s!x~ [~p, do oq~, pu~ ll~a oq$ ol aol~oA [~UlaOUoq~, uoa~a1,oq oI~U~ oq~L "llamaoq~ uo
polsod aan~.d ~ pa~mo~ ~UI.AOmsl xaam~o ~U.IA~aI.AOq~ q~t.qm U! a~uonbos ~ jo om~al V (V "l "~!~I
{~8E 9~ 8~1 0
8CI-
.'.
: ~ ~ * ~176
' 0
9 .;o r 9
8~I
i....; ...
9g~
~176
~8E
G 0
III
III
#2't
t77
C;;:; ..... :
N'NN
8
09~
261

order spatial properties of optical flow. The optical flow is divided into patches of fixed
size and the expanding (EVF), contracting (CVF), clockwise (CRVF) and anticlockwise
(ARVF) rotating, and constant (TVF) vector fields which best approximate the optical
flow in each patch si, i = 1, ..., N , are computed. Roughly speaking, this is equivalent to
reducing the possible 3D motions to translation in space with a fairly strong component
along the optical axis (EVF and CVF), rotation around an axis nearly orthogonal to
the image plane (CRVF and ARVF), and translation nearly parallel to the image plane
(TVF). This choice, which is somewhat arbitrary and incomplete, does not allow an
accurate recovery of 3D motion and structure (the shear terms, for example, are not
taken into account), but usually appears to be sufficient in order to obtain a qualitative
segmentation of the viewed image in the different moving objects (see Section 4).
As a result of the first step, five vectors x~, j = 1,...,5, are associated with each
patch si: the vector x81., position over the image plane of the focus of expansion of the
EFV; x82,, position of the focus of contraction of the CVF; xs~ , position of the center of
the CRVF; x, 4, position of the center of the ARVF, and the unit vector xa~ , parallel to
the direction of the TVF.

3.2 D e t e r m i n i n g t h e possible motions

In order to produce a list of the "possible" motions in the second step, global properties
of the obtained EVFs, CVFs, CRVFs, ARVFs, and CVFs are analysed. This step is
extremely crucial, since the pointwise agreement between each of the computed local
vector fields and the optical flow of each patch usually makes it difficult, if not impossible,
to select the most appropriate label (see Section 2). Figs. 2C and D respectively show
the distribution of the foci of expansion and contraction, and centers of clockwise and
anticlockwise rotation, associated with the EVFs, CVFs, CRVFs, and ARVFs of the
optical flow of Fig. 2B. A simple clustering algorithm has been able to find two clusters
in the distribution of Fig. 2C, and these clusters clearly correspond to the expansion
and contraction along the optical axis of Fig. 2B. The same algorithm, applied to the
distribution of the centers of rotation (shown in Fig. 2D), reveals the presence of a
single cluster in the vicinity of the image plane center corresponding to the anticlockwise
rotation in Fig. 2B. On the other hand, in the case of translation, the distribution of the
unit vectors parallel to the directions of the TVFs is considered (see Fig. 2E). For the
optical flow of Fig. 2B the distribution of Fig. 2E is nearly flat indicating the absence
of preferred translational directions. Therefore, as a result of this second step, a label l
is attached to each "possible" motion which can be characterised by a certain cluster of
points x,~ (0, where c(l) equals 1, ...,4, or 5 depending on I. In the specific example of
Fig. 2, one label of expansion, one of contraction, and one of anticlockwise rotation, are
found.

3.3 Labeling through deterministic relaxation

In the third and final step, each patch of the image plane is assigned one of the possible
labels by means of an iterative relaxation procedure [11]. The key idea is that of defining
a suitable energy function which not only depends on the optical flow patches but also on
the possible motions, and reaches its minimum when the correct labels are attached to
the flow patches. In the current implementation, the energy function is a sum extended
over each pair of neighbouring patches in which the generic term u(sl, sj), where si and
sj are a pair of neighbouring patches, is given by the formula
262

B
/ I/Z--" . ~

~, . . . . . . tt

C D
384 384 9 . . . . . .

256 256 I

9 %- ~ Oo9
;9 ;. J "
128 ...:~i.~.,
"'"-.... 128
9~ 9 .~ .~176',
0 0

-128 -128
-128 0 128 256 384 -128 0 128 256 384

od,~o~176

/ %

0 . . . . ~. . . . . . . . . . . . . . . . . . .
i

4'

-i 0 1

Fig. 2. A) A frame of a synthetic sequence in which the larger sphere is translating toward the
image plane, while the smaller sphere is moving away and the background is rotating anticlock-
wise. B) The corresponding optical flow computed by means of the method described in [10].
C) Distributions of the foci of expansion (squares) and contraction (crosses) of the EVFs and
CVFs respectively which lie within an area four times larger than the field of view (identified
by the solid frame). D) Distribution of the centers of anti~lockwise rotation of the ARVFs. E)
Distribution of the directions of the TVFs on the unit circle. F) Colour coded segmentation of
the optical flow of B) obtained through the algorithm described in the text.
263

/ \
u(~,, ~i) = (,11~, - Xo:(')ll + I1~, - x,;(')ll) ~,,=,~ (1)
where x~ is the center of mass of the cluster corresponding to the label l, and ~ = 1
if the labels of the two patches, ii and Ij respectively, equal l, otherwise $ = 0. The
relaxation procedure has been implemented through an iterative deterministic algorithm
in which, at cach iteration, each patch is visited and assigned the label which minimises
the current value of the energy function, keeping all the other labels fixed. The procedure
applied to the optical flow of Fig. 2B, starting from a random configuration, produces
the colour coded segmentation shown in Fig. 2F after twenty iterations. From Fig. 2F, it
is evident that the method is able to detect and correctly identify the multiple motions
of the optical flow of Fig. 2B. Extensive experimentation indicates that the deterministic
version usually converges on the desired solution. This is probably due to the fact that,
for the purpose of detecting multiple motions, the true solution can be approximated
equally well by nearly optimal solutions.
To conclude, it has to be said that the profile of the segmented regions can be suitably
modeled by adding ad hoc terms to the energy (or "penalty functions") which tend
to penalise regions of certain shapes. The choice of the appropriate penalty functions
reflects the available "a priori" knowledge, if any, on the expected shapes. In the current
implementation, in which no "a priori" knowledge is available, only narrow regions have
been inhibited (configurations in which in a square region of 3 x 3 patches there are no
five patches with the same label are given infinite energy).

4 Experimental results on real images

Let us now discuss two experiments on real images. Fig. 3A shows a frame of a sequence
in which the viewing camera is translating toward the scene while the box is moving
toward the camera. The optical flow associated with the frame of Fig. 3A is shown in
Fig. 3B. From Fig. 3B it is evident that the problem of finding different moving objects
from the reconstructed optical flow is difficult. Due to the large errors in the estimation of
optical flow, simple deterministic (and local) procedures which detect flow edges, or sharp
changes in optical flow, are doomed to failure. In addition, the viewed motion consists
of two independent expansions and even in the presence of precisely computed optical
flow, no clear flow edge can be found as the flow direction in the vicinity of the top, right
side, and bottom of the box agrees with the flow direction of the background. Fig. 3C
shows the distribution of the foci of expansion associated with the EVFs computed as
described above. Two clusters are found which correspond to the (independent) motion of
the camera and of the box of Fig. 3A. On the contrary, no clusters are found in the other
distributions. Therefore, it can be concluded that, at most, two different motions (mainly
along the optical axis) are present in the viewed scene. The colour coded segmentation
which is obtained by applying the third step of the proposed method is shown in Fig. 3D.
It is evident that the algorithm detects and correctly identifies the two different motions
of the viewed scene.
In the second experiment (Fig. 4A), a puppet is moving away from the camera, while
the plant in the lower part of Fig. 4A is moving toward the image plane. The optical
flow associated with the frame of Fig. 4A is reproduced in Fig. 4B. As can be easily seen
from Fig. 4C both the distributions of the foci of expansion (squares) and contraction
(crosses) clusterise in the neighbourhood of the origin. No cluster has been found in the
other distributions, which is consistent with the optical flow of Fig. 4B. The segmentation
which is obtained by applying the relaxation step is shown in Fig. 4D.
264

~r u tt It t r it g' ~ ~, ~ '~ h Na

C D
384

256

128

-128
0

-128 0
9m~
m

128
9 9
",~m

256 384
iiiiiiiliiii!iiii i
iil
ii ii iiii ii iiiiii
Fig. 3. A) A frame of a sequence in which the box is translating toward the camera, while the
camera is translating toward an otherwise static environment. B) The corresponding optical flow
computed by means of the method described in [10]. C) Distribution of the foci of expansion
of the EVFs. D) Colour coded segmentation of the optical flow of B) obtained through the
algorithm described in the text.

This example clarifies the need for two distinct labels for expansion and contraction
(and, similarly, for clockwise and anticlockwise rotation). The energy term of Eq. 1,
which simply measures distances between singular points, would not be sufficient to
distinguish between expanding and contracting patches. In order to minimise the number
of parameters which enter the energy function, it is better to consider a larger number
of different local motions than to add extra-terms to the right-hand-side of Eq. 1.
To summarise, the proposed method appears to be able to detect multiple motion and
correctly segment the viewed image in the different moving objects even if the estimates
of optical flow are rather noisy and imprecise.

5 Differences from previous methods

It is evident that the presented method is very different from the deterministic schemes
which attempt to identify multiple motions by extracting flow edges [12-13]. Important
similarities, instead, can be found with the techmque proposed in [2]. Firstly, the same
mathematical machinery (stochastic relaxation) is used. Secondly, in both cases first
265

C D
394 . . ~ - ~

256
9 =l I 9 ~ ~

128 '.:: .?4 9


0 /'." ":'" " ..I
:: : : : : : : : : : : : : : : : : : : : : : : : : : : : :
-128
-128 0 128 256 384 ~,~ii~'i~:i~iiii
:': i!i~i!i~i!ii~:!'~i i '~ili':~ii'~ii~i '~ii;!i'~i'~iili'~i~,!!iiiiiil!i!i~iiiii~iiiiiii~i!
ii~~',~~!i:~:i~i~'~!~iii

Fig. 4. A) A frame of a sequence in which the puppet is moving away from the camera, while
the plant is translating toward the image plane. B) The corresponding opticM flow computed
by means of the method described in [10]. C) Distribution of the loci of expansion (squares) and
contraction (crosses) of the EVFs and CVFS respectively. D) Colour coded segmentation of the
optical flow of B) obtained through the algorithm described in the text.

order spatial properties of optical flow, such as expansion and rotation, are employed to
determine the different types of motion. However, the two methods are basically different.
In [2] regions are segmented and only at a later stage local spatial properties of optical
flow are used to interpret the viewed motions. The possible motions are data-independent
and the resolution is necessarily fairly low. On the contrary, the method described in the
previous Section computes the possible motions first and then identifies the regions which
correspond to the different moving objects. Consequently, the number of labels remains
small and stochastic relaxation always runs efficiently. In addition, since the possible
motion are data-dependent, the resolution is sufficiently high to allow for the detection
of "expansion within expansion" (see Fig. 3D) or the determination of arbitrary direction
of translation.

6 Conclusion

In this paper a method for the detection and identification of multiple motions from
optical flow has been presented. The method, which makes use of linear approximations of
266

optical flow over relatively large patches, is essentially based on a technique of stochastic
relaxation. Experimentation on real images indicates that the method is usually capable
of segmenting the viewed image into the different moving parts robustly against noise,
and independently of large errors in the optical flow estimates. Therefore, the technique
employed in the reconstruction of optical flow does not appear to be critical. Due to the
coarse resolution at which the segmentation step is performed, the proposed algorithm
only takes a few seconds on a Sun SPARCStation for a 256x256 image, apart from the
computation of optical flow.
To conclude, future work will focus on the extraction of quantitative information on
the segmented regions and will be biased to the theoretical (and empirical) study of the
local motions which must be added in order to increase the capability of the method.

References

1. Adiv, G. Determining three-dimensional motion and structure from optical flow gen-
erated by several moving objects. IEEE Trans. Pattern Anal. Machine Intell. 7
(1985), 384-401.
2. Francois, E. and P. Bouthemy. Derivation of qualitative information in motion anal-
ysis. Image and Vision Computing 8 (1990), 279-288.
3. Gibson, J.J. The perception of the Visual World. (Boston, Houghton Mifflin, 1950).
4. Koenderink, J.J. and Van Doom, A.J. How an ambulant observer can construct
a model of the environment from the geometrical structure of the visual inflow. In
Kibernetic 1977, G. Hauske and E. Butendant (Eds.), (Oldenbourg, Munchen, 1977).
5. Verri, A., Girosi, F., and Torre, V. Mathematical properties of the two-dimensional
motion field: from Singular Points to Motion Parameters. J. Optical Soc. Amer. A 6
(1989), 698-712.
6. Subbarao, M. Bounds on time-to-collision and rotational component from first order
derivatives of image flow. CVGIP 50 (1990), 329-341.
7. Campani, M., and Verri, A. Motion analysis from first order properties of optical
flow. CVGIP: Image Understanding in press (1992).
8. Bouthemy, P. and Santillana Rivero, J. A hierarchical likelihood approach for region
segmentation according to motion-based criteria. In Proc. 1st Intern. Conf. Comput.
Vision London (UK) (1987), 463-467.
9. Adiv, G. Inherent ambiguities in recovering 3D motion and structure from a noisy
flow field. Pattern Anal. Machine Intell. 11 (1989), 477-489.
10. Campani, M. and A. Verri. Computing optical flow from an overconstrained system
of linear algebraic equations. Proe. 3rd Intern. Conf. Comput. Vision Osaka (Japan)
(1990), 22-26.
11. Geman, D., Geman, S., Grafflgne, C., and P. Dong. Boundary detection by con-
strained optimization. IEEE Trans. Pattern AnaL Machine IntelL 12 (1990), 609-
628.
12. Thompson, W.B. and Ting Chuen Pong. Detecting moving objects. IJCV 4 (1990),
39-58.
13. Verri, A., Girosi, F., and Torte, V. Differential techniques for optical flow. J. Optical
Soe. Amer. A 7 (1990), 912-922.
A Fast Obstacle Detection Method based on Optical
Flow *

Nicola Ancona

Tecnopolis CSATA Novus Ortus 70010 Valenzano - Bari - Italy.


e-mail: ancona~minsky.csat a.it

Abstract. This paper presents a methodology, based on the estimation


of the optical flow, to detect static obstacles during the motion of a mobile
robot. The algorithm is based on a correlation scheme. At any time, we
estimate the position of the focus of expansion and stabilize it by using the
Kalman filter. We use the knowledge of the focus position of the flow field
computed in the previous time to reduce the search space of corresponding
patches and to predict the flow field in the successive one. Because of its
intrinsic recursive aspect, the method can be seen as an on-off reflex which
detects obstacles lying on the ground during the path of a mobile platform.
No calibration procedure is required. The key aspect of the method is that
we compute the optical flow only on one row of the image, that is relative
to the ground plane.

1 Introduction

The exploitation of robust techniques for visual processing is certainly a key aspect in
robotic vision application. In this work, we investigate an approach for the detection
of static obstacles on the ground, by evaluation of optical flow fields. A simple way to
define an obstacle is by a plane lying on the ground, orthogonal to it and high enough
to be perceived. In this framework, we are interested in the changes happening on the
ground plane, rather than in the environmental aspect of the scene. Several constraints
help analysis and in tackling the problem. Among them:

1. the camera attention is on the ground plane, sensibly reducing the amount of required
computational time and data;
2. the motion of a robot on a plane exhibits only three degrees of freedom; further, the
height of the camera from this plane remains constant in time.

The last constraint is a powerful one on the system geometry, because, in pure transla-
tional motion, only differences of the vehicle's velocity and depth variations can cause
changes in the optical flow field. Then the optical flow can be analysed looking for the
anomalies with respect to a predicted velocity field [SA1].
A number of computational issues have to be taken into account: a) the on-line
aspect, that is the possibility to compute the optical flow using at most two frames;
b)the capability of detecting obstacles on the vehicle's path in a reliable and fast way;
c)the possibility of updating the system status when a new frame is available. The above
considerations led us to use a recursive token matching scheme as suitable for the problem
at hand. The developed algorithm is based on a correlation scheme [LI1] for the estimation
* Acknowledgements: this paper describes research done at the Robotic and Automation
Laboratory of the Tecnopolis CSATA. Partial support is provided by the Italian PRO-ART
section of PROMETHEUS.
268

of the optical flow fields. It uses two frames at a time to compute the optical flow and
so it is a suitable technique for on-line control strategies. We show how the estimation
of the optical flow on only one row of reference on the image plane is robust enough to
predict the presence of an obstacle. We estimate the flow field in the next frame using
a predictive Kalman filter, in order to have an adaptive search space of corresponding
patches, according to the environmental conditions. The possibility of changing the search
space is one of the key aspects of the algorithm's performance. We have to enlarge the
search space only when it is needed that is only when an obstacle enters the camera's
field of view. Moreover, it is important to point out that no calibration procedure is
required. The developed methodology is different from [SA1] and [EN1] because we are
interested in the temporal evolution of the predicted velocity field.

2 Obstacle model
Let us suppose that a camera moves with pure translational motion on a floor plane, fl,
(fig. la), and that the distance h between the optical center C and the reference plane
stays constant in time. Let V(V~, Vy, Vz) be the velocity vector of C and let us suppose
that it is constant and parallel to the plane /3. Let us consider a plane 7 (obstacle)
orthogonal to fl, having its normal parallel to the motion direction. Moreover, let us
consider a point P(P~, P~, Pz) lying on/3 and let P(Pu,Pv) be its perspective projection
on the image plane: 7"(P) = p. When the camera moves, 7 intersects the ray projected
by p in a point Q(Q~, Q~, Q~), with Qz < Pz. In other words, Q lies on the straight line
through C and P . It is a worth to point out that the points P and Q are acquired from
the element p at different temporal instants, because we have assumed the hypothesis of
opacity of the objects' surfaces in the scene.

C F t~
t I 9
I

j ",,'x,f / I
|
/
I \, \
w""~
~ , ~~+A~
*f ~ ii 99

/wQ "" i
-V ~i Q

w.... . . . . . . . . . . . . . . . . . b_w\_...... . . .';. . .


9 i
9 !
-V 9 I p

(a)

Fig. 1. (a) The geometrical aspects of the adopted model. The optical axis of the camera is
directed toward the floor plane /3, forming a fixed angle with it. (b) The field vectors relatives
to the points p and q on r in a general situation.

Let us consider the field vectors, W p and W q , projected on the image plane by the
camera motion, relative to P and Q. At this point, let us make some useful considerations:
269

1. W p and W o have the same direction. This statement holds because, in pure trans-
lational motion, the vectors of the 2D motion field converge to the focus of expansion
F, independently from the objects' surface into the scene. Then, W e and W Q are
parallel and so the following proposition holds: 3A > 0 9' W q = AWp.
2. As Qz < Pz, the following relation holds: [[Wp[] < [[Wq[[

So we can claim that, under the constraint of a constant velocity vector, a point Q
(obstacle), rising from the floor plane, does not change the direction of the vector flow,
with respect to the corresponding point on the floor plane P, but it only increases its
length. So, the variation of the modulus of the optical flow at one point, generated by the
presence of a point rising from the floor plane, is a useful indicator to detect obstacles
along the robot path.
We know that estimation of the optical flow is very sensitive to noise. To this aim
let us take a row on the image plane rather than single points, where to extract the flow
field. The considerations as above still hold. Let us suppose the X axis of the camera
coordinate system to be parallel to the plane ft. We consider a straight line r : v = k on
the image plane and let w be the corresponding line on fl obtained by back projecting r:
w = 7--Z(r)Aft. Under these constraints, all points on w have the same depth value. Let
us consider two elements of r: P(Pu, k) and q(qu, k), and let P(P~, P~, z) and Q(Q~, Qv, z)
be two points on w such that: 7"(P) -= p, T ( Q ) -- q We can state that the end points
of the 2D motion field, Wp and Wq, lie on the straight line s (fig. lb). In particular,
when the camera views the floor plane without obstacles, the straight lines r and s stay
parallel and maintain the same displacement during the time. When an obstacle enters
into the camera's field of view, the line parameters change and they can be used to detect
the presence of obstacles.

3 Search space reduction

A first analysis of the optical flow estimation process shows that the performances of
the algorithm are related to the magnitude of ~, the expected velocity of a point on the
image plane. This quantity is proportional to the search space (SS) of corresponding
brightness patches, in two successive frames. In other words, SS is the set of possible
displacements of a point during the time unit. We focus our attention on the search space
reduction, one of the key aspects of many correspondence based algorithms, to make the
performance of our approach close to real time and to obtain more reliable results. The
idea is to adapt the size of the search space according to the presence or absence of an
obstacle in the scene.
Let us consider (fig. lb) a row r on the image plane and let p and q be two points
on it. Let Wp and Wq be the relative field vectors. Moreover, let st be the straight line
where all of end points of the field vectors lie, at the time t. The search space SSt, at
the time t, is defined by the following rectangular region:

SS,= [min{Wp,W,},max{W,,W,}] x [min{W,,W,},max{W,,Wq}] (1)

We want to stress out that SSt is constrained by let and by the straight line st. For the
sake of these considerations, wishing to predict SSt+a~, at the time t + At, it is enough
to predict s~+a~, knowing F~ and st at the previous time. To realize this step, that we
call optical flow prediction, we assume the temporal continuity constraint to be true, in
other words the possibility of using an high sample rate of input data hold.
270

Suppose we know an estimate of the F O E fi't at time t JAN1]. The end points of Wp
and Wq determine a straight line st, whose equation is y = mix + nt. As we are only
considering pure translational motion, the straight lines determined by the vectors Wp
r

and Wq converge t o / 0 t. So, let us consider: lp = [fi~t,P] ,lq = [fit, q] determined by Pt


and p and q respectively. We denote by A and B the intersections of lp and lq with st.
As these points lie on st, to predict st+at, the position of st at the instant t + At, it
is sufficient to predict the position of A and B at the instant t + At. In the following,
we consider only the point A, because the same considerations hold for the point B. As
the position of fi' and p are constant, we can affirm that A moves on the line lp. To
describe the kinematic equation relative to this point/let us represent lp in a parametric
way. So the motion of A on lp can be described using the temporal variations of the real
parameter A:

A(t) = + - 0 + 89 - 0 2 wh re < t (2)

This equation, describing the temporal evolution of A, can be written in a recursive way.
Setting r -- (k - 1)T and t = kT, where T is the unit time, we get:

xl(k) = x l ( k - 1) + x 2 ( k - 1 ) T + l a ( k - 1)T 2
x2(k) x2(k 1) + a(k - 1)T
(3)

where ~l(k) denotes the value of the parameter A, x2(k) its velocity and a(k) its acceler-
ation. In this model, a(k) was regarded as a white noise. Using a vectorial representation,
the following equation holds: x(k) = A x ( k - 1 ) + w ( k - 1 ) describing the dynamical model
of the signal. At each instant of time it is possible to know only the value of A, so the
observation model of the signal is given by the following equation: y(k) = Cx(k) + v(k)
and C = (1,0), where E[v(k)] = 0 and E[v2(k)] = a~. The last two equations, describing
the system and observation model, can be solved by using the predictive Kalman filtering
equations. At each step, we get the best prediction of the parameter A for A and B and
so we are able to predict the estimate of the optical flow field, for all of points of r.

4 Experimental results

The sequence, fig. 2, was acquired from a camera mounted on a mobile platform moving
at a speed of 100 mrn/sec. The camera optical axis was pointing towards the ground.
In this experiment, we used human legs as obstacle. The size of each image is made of
64 x 256 pixels. The estimation of the opticM flow was performed only on the central
row (32 "a) of each image. The fig. (3) shows the parameters m and n of s during the
sequence. It is possible to note that at the beginning of the sequence, the variations of
the parameters m and n are not very strong. Only when the obstacle is close to the
camera, the perception module can detect the presence of the obstacle. This phenomena
is due to the experimental set-up: camera's focM length and angle between optical axis
and ground plane. The algorithm perceives the presence of an obstacle when one of the
above parameters increase or decrease in a monotonous way. Our implementation run on
a Risk 6000 IBM at the rate of 0.25 sec.
A c k n o w l e d g e m e n t s : we would like to thank Piero Cosoli for helpful comments on the
paper. Antonella Semerano checked the English.
271

///\\\~//\\\\y \\\\~!!xw/\\'u

F i g . 2. Ten images of the sequence and the relative flow fields computed on the reference row.

Y 9 ,.~ Y
9 i~re-

.ls.~
4,.~.

4 ~

Fig. 3. The values of the parameters m and n of s computed during the sequence.

References

[LI1] Little J., Bulthoff H. and Poggio T.: Parallel Optical Flow Using Local Voting. IEEE
2nd International Conference in Computer Vision, 1988
[SA1] Sandini G. and Tistarelli M.: Robust Obstacle Detection Using Optical Flow. IEEE
Workshop on Robust Computer Vision, 1-3 October 1990, Seattle - USA
JAN1] Ancona N.: A First Step Toward a Temporal Integration of Motion Parameters.
IECON'91, October 28 1991, Kobe - Japan
[EN1] Enkelmann W.: Obstacle Detection by Evaluation of Optical Flow Fields from Image
Sequences. First European Conference on Computer Vision, April 1990, Antibes - France
A parallel i m p l e m e n t a t i o n
of a structure-from-motion algorithm

Han Wang 1 Chris B o w m a n 2 Mike Brady 1 and Chris Harris 3

1 Oxford University, Robotics Research Group, Oxford, OXl 3P J, UK


2 DSIR Industrial Development, 24 Balfour Road, Auckland, NZ
3 Roke Manor Research Ltd, Roke Manor, Romsey, SO51 0ZN, UK

A b s t r a c t . This paper describes the implementation of a 3D vision algo-


rithm, Droid, on the Oxford parallel vision architecture, PARADOX, and
the results of experiments to gauge the algorithm's effectiveness in pro-
viding navigation data for an autonomous guided vehicle. The algorithm
reconstructs 3D structure by analysing image sequences obtained from a
moving camera. In this application, the architecture delivers a performance
of greater than 1 frame per second - 17 times the performance of a Sun-4
alone.

1 Introduction
PARADOX [5] is a hybrid parallel architecture which has been commissioned at Oxford
in order to improve the execution speed of vision algorithms and to facilitate their in-
vestigation in time-critical applications such as autonomous vehicle guidance. Droid[3] is
a struciure-frora-motion vision algorithm which estimates 3-Dimensional scene structure
from an analysis of passive image sequences taken from a moving camera. The motion of
the camera (ego-motion) is unconstrainted, and so is the structure of the viewed scene.
Until recently, because of the large amount of computation required, Droid has been
applied off-line using prerecorded image sequences, thus making real-time evaluation of
performance difficult.
Droid functions by detecting and tracking discrete image features through the image
sequence, and determining from their image-plane trajectories both their 3D locations
and the 3D motion of the camera. The extracted image features are assumed to be
the projection of objective 3D features. Successive observations of an image feature are
combined by use of a Kalman filter to provide optimum 3D positional accuracy.
The image features originally used by Droid are determined from the image, I, by
forming at each pixel location the 2 2 matrix, A = w 9 [ ( V I ) ( V I ) r ] , where w is a
Ganssian smoothing mask. Feature points are placed at maxima of the response function
R [3], R = det(A) - k(trace(A)) 2, where k is a weighting constant. Often, features are
located near image corners, and so the operator tends to be referred to as a corner finder.
In fact, it also responds to local textural variations in the grey-level surface where there
are no extracted edges. Such features arise naturally in unstructured environments such
as natural scenes. Manipulation and matching of corners are quite straightforward and
relatively accurate geometric representation of the viewed scene can be achieved. In the
current implementation, the depth map is constructed from tracked 3D points using a
local interpolation scheme based on Delanuay triangulation [2].
Droid runs in two stages: the first stage is the booting stage, called boot mode, in
which Droid uses the first two images to start the matching process; the second stage is
the run stage called run mode.
273

In the boot mode, points in the two 2D images are matched using epipolar constraints.
The matched points provide disparity information which is then used for estimation of
ego-motion and 3D instantiation. Ego-motion is described as a 6-vector (3 in translation
and 3 in rotation).
The run mode of Droid includes a 3D-2D match which associates the 3D points with
the newly detected 2D points, an updated ego-motion estimation and a 2D-2D match,
between residual points in the feature points list and unmatched points from the previous
frame, to identify new 3D features. Also, 3D points which have been unmatched over a
period are retired.

2 PARADOX Architecture

PARADOX is a hybrid architecture, designed and configured especially for vision/image


processing algorithms. It consists of three major functional parts: a Datacube pipelined
system, a transputer network and a Sun4 workstation. The architecture of PARADOX,
as applied to Droid, is shown in Figure 1.
The Datacube family contains more than 20 types of VME-based pipelined processing
and input/output modules which can perform a wide range of image processing operations
at video rates. Image data is passed between modules via a Datacube bus-standard
known as the MaxBus. System control is by means of the VME bus from the host Sun
workstation. The Datacube can be used for image digitisation, storage and display and
also for a wide range of video frame rate pixel-based processing operations.
The transputer network consists of a fully populated Transtech MCP 1000 board. This
contains 32 T800 transputers - each with one Mbyte of RAM - and both hardwired and
programmable switching devices to allow network topology to be altered. A wide range
of network topologies can be implemented including parallel one dimensional arrays [6],
a 2D array or a ring structure. This board delivers a peak performance of 320 MIPS. The
connection between the Datacube and the transputer network is by way of an interface
board designed by the British Aerospace Sowerby Research Centre [4].
In the parallel implementation of Droid [5], the Datacube is used to digitise, store
and display the image sequences and graphics overlays; the corner detection is carried
out by the transputer array and the 3D-isation is computed on the Sun workstation.

3 Performance Evaluation

Figure 2 shows an image from a sequence of frames with a superimposed Cartesian grid
plot of the interpreted 3D surface by Droid. The driveable region can be clearly identified.
An algorithm has been developed by D. Charnley [1] to extract the drivable region by
computing the surface normal of each grid.
The above demonstrates qualitatively the performance of Droid in live situations, but
not quantitatively. A series of experiments has been conducted at Oxford and at Roke
Manor to measure the performance of Droid in both live and static environments. The
intention has been to demonstrate the competence of dynamic vision in a real environ-
ment.
The performance obtained from PARADOX for parallel Droid was 0.87 seconds per
frame which is 17 times faster than a pure Sun-4 implementation. The overall performance
is limited primarily by the parallel execution of the 3D-isation and corner detection
algorithms which have comparable execution times. The Datacube control and visual
display functions contribute negligible fraction of execution time.
274

AGV

VMEbes

Tramteeh MCP 1000


31 Trmsput~ Network It

BAe :~:::::::;:~i~~~~i~~~
~ ~ ~' ~~~~~~~~~~~~~~~:"~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~
~~~~~~
~~~! :'~[ii~:~:~

/ " '

Dataeube Pipeline Processor

F i g . 1. Machine architecture of P A R A D O X (Droid incarnation)

F i g . 2. Droid reconstructed 3D surface, a driveable region can be clearly identified


275

The laser scanner on the vehicle can determine the AGV's location (2D position and
orientation) by detecting fixed bar-coded navigation beacons. This allows comparison
between the "true" AGV trajectory and that predicted by Droid. The following results
were obtained from an experiment where the AGV was programmed to run in a straight
line with varying speeds.

X (m) Ego motion, Sequence 12

z (m)
-1, 1 2 3 4 5 6 7 8 9 i0

Fig. 3. Plane view of A G V trajectory. Solid iine-Droid predicted motion; Dashed line-laser
scanner readings

Figure 3 depicts a plane view of the AGV's trajectories: the solid line represents the
AGV trajectory reported by Droid and the dashed line as reported by the laser scanner.
In this particular run, the AGV has been programmed to move in a straight line at
two different speeds. For the first part of the run it travels at about 8 cm/sec and for
the second it travels at about 4 cm/sec. Droid reports the camera position (6 degrees of
freedom - 3 translation and 3 rotation) from the starting point of the vehicle, which has
coordinates (z0, z0) = (0, 0) and the laser reported trajectory is re-aligned accordingly.
It can be seen from Figure 3 that the alignment between the laser scanner readouts and
the Droid prediction is very close.
During the run, the vehicle has been stopped twice manually to test this system's
tolerance under different situations. Figure 4 shows the speed of the AGV as determined
by Droid (solid line) and by the on-board laser scanner(dashed line). The speed plots
in figure 4 agree closely apart from the moment when the vehicle alters its speed where
Droid consistently overshoots. This can be improved using non-critical dumping.

4 C o n c l u s i o n and future work

Droid constructs an explicit three-dimensional representation from feature points ex-


tracted from a sequence of images taken by a moving camera. This paper has described
the algorithm, the PARADOX parallel vision architecture and the implementation of
Droid on PARADOX. Experiments have demonstrated the competence of Droid and the
performance of PARADOX in dealing with real world problems. The results show that
this system is capable of identifying basic surface structure and can be used to supply
purely passive guidance information for autonomous vehicles where other sensory mecha~
nism finding it hard or impossible. Recently, an improved corner detection algorithm has
276

1( speed
,= I
(cm/sec)
,
AGV speed, Sequence 12

8 , ,

q ,
ou wuu tSO L,O0

Fig. 4. Comparison of A G V speed. Solid line-speed reported using Droid; Dashed line-speed
reported using laser scanner

been developed and is under test at Oxford [7]. This uses second order directional deriva-
tives with the direction tangential to an edge. This algorithm has improved accuracy of
corner localisation and reduced computational complexity. Consequently, it allows faster
execution speed (14 frames per second) than the original Droid corner detection algo-
rithm. This, together with parallelisation of the 3D-isation algorithms, will offer further
improvements to overall execution speed. Future work will include (1) the incorporation
of the new fast corner detection algorithm into Droid, (2) the use of odometery informa-
tion taken from the AGV to provide Droid with more accurate motion estimations, and
(3) to eventually close the control loop of the AGV-that is to control AGV by utilising
the information provided by Droid.

References
1. D Charnley and R Blisset. Surface reconstruction from outdoor image sequences. Image
and Vision Computing, 7(1):10-16, 1989.
2. L. De Floriani. Surface representation based on triangular grids. The Visual Computer,
3:27-50, 1987.
3. C G Harris and J M Pike. 3D positional integration from image sequences. In Proc. 3rd
Alvey Vision Conference, Cambridge, Sept. 1987.
4. J.A. Sheen. A parallel architecture for machine vision. In Colloquium on Practical applica-
tions of signal processing. Institution of the Electrical Engineers, 1988. Digest no: 1988/111.
5. H Wang and C C Bowman. The Oxford distributed machine for 3D vision system. In
IEE colloquium on Parallel Architectures for Image Processing Applications, pages 1/2-5/2,
London, April 1991.
6. H Wang, P M Dew, and J A Webb. Implementation of Apply on a transputer array. CON-
CURRENCY: Practice and Experience, 3(1):43-54, February 1991.
7. Han Wang and Mike Brady. Corner detection for 3D vision using array processors. In
B A R N A I M A G E 91, Barcelona, Sept. 1991. Springer-Vedag.
Structure from Motion Using the Ground Plane Constraint
T. N. Tan, G. D. Sullivan & K. D. Baker
Department of Computer Science, University of Reading
Reading, Berkshire RG6 2AY, ENGLAND

Abstract. This paper concerns the interactive construction of geometric models


of objects from image sequences. We show that when the objects are constrained
to move on the ground plane, a simple direct SFM algorithm is possible, which is
vastly superior to conventional methods. The proposed algorithm is non-iterative,
and in general requires a minimum of three points in two frames. Experimental
comparisons with other methods are presented in the paper. It is shown to be
greatly superior to general linear SFM algorithms not only in computational cost
but also in accuracy and noise robustness. It provides a practical method for
modelling moving objects from monocular monochromatic image sequences.

1. Introduction
The work described here was carried out as part of the ESPRIT II project P2152 (VIEWS -
Visual Inspection and Evaluation of Wide-area Scenes). It is concerned with semi-automatic
methods to construct geometric object models using monocular monochromatic image
sequences. A common feature in the images used in the VIEWS project is that the movement of
objects satisfies the ground plane constraint [1, 2], i.e., the objects move on a ground surface
which, locally at least, is approximately flat. We approximate the flat ground surface by the X-Y
plane of a world coordinate system (WCS), whose Z-axis points upwards. In this WCS, an
object can only translate along the X- and Y-axis, and rotate about the Z-axis, leaving 3 degrees
of freedom of motion.
We show in this paper that, in order to make the most effective use of the ground plane
constraint in structure from motion (SFM), it is necessary to formulate structure (and motion)
constraint equations in the WCS. This allows us to derive simple yet robust SFM algorithms.
The paper is organised as follows. We first discuss the use of the ground plane constraint to
simplify the constraint equations on the relative depths (i.e., the structure) of the given rigid
points. We then describe simple robust methods for solving the constraint equations to recover
the structure and motion parameters. Experimental studies of both a conventional 6 degrees of
freedom SFM algorithm [3] and the proposed 3 degrees of freedom algorithm are then reported.

2. Constraint Equations
We assume a pinhole camera model with perspective projection as shown in Fig.1. Under this
...................!:i::i:i........ ~Yc

Figure I. Coordinate Systems and Imaging Geometry


278

2
imaging model, the squared distance dmn measured in the WCS between two points P,, with
image coordinates (Um,Vm)and P, with image coordinates (Un,Vn)is given by
2
dmn -- ( ~mUm - )~nUn) 2 + ( )~mVm - ~,nVn) 2 + ( ~mWm- )~nWn)2 (I)
where km and Xn are the depths (scales)of P,~ and P, respectively(XF = zr and U, V and W
are terms computable fromknown cameraparameters and image coordinates [I, 2]. A similar
equation can be written for the point pair in a subsequent frame. Using primed notation to
indicate the new frame, we have

dm = ()~'mU'm-~,'nU'n)2+ ()~'mV'm-)~'nV'n)2+ ()~mWm-)~n'Wn'')2' (2)


It is shown in [1, 2] that by using the distance invariance property of the rigidity constraint [4, 5]
and the height invariance property of the ground plane constraint, we can obtain from (1) and
(2) the following second-order polynomial equation:
am~,2m+ Bmn~,m~,n+an~,2 -- 0 (3)
where A m, Bmn and A n are terms computable from U, V and W [1, 2]. (3) is the basic
constraint on the relative depths of two points of a rigid object whose movement satisfies the
ground plane constraint. For N such points P1, P2 ..... P~v , there are N ( N - 1)/2 different
point pairs thus N (N - 1)/2 constraint equations of type (3):

Am)~2m +Bmn)~m)~n+An)~2n =0, n, mE {1,2 ..... N}; n > m (4)


which can be solved for the N unknown depths k n, n = 1, 2 ..... N.

3. Estimation of Structure and Motion Parameters


Since the equations in (4) are homogeneous in the N unknown depths, the point depths can only
be determined up to a global scale factor as is the case in all SFM algorithms [6], and we can set
the depth of an arbitrary point (the reference point) to be an arbitrary value. For instance, we can
set the depth ( k l ) of the first point PI to be 1, then the N - 1 constraint equations associated
with P1 in (4) become N - 1 quadratic equations each of which specifies the depth of a single
additional point:
AmX2m +BtmXm+A 1 = O, mE {2,3 ..... N} (5)
The correct root may usually be determined by imposing the physical constraint on the depths
)~m > 0, m = 2, 3 ..... N. In situations where both roots are positive, we need to use the
constraint between the point Pm and one additional point [1], i.e., in general, at most three
points in two frames are required to solve the constrained SFM uniquely.
We take each of the given points in tum as the reference point, and repeat the above
procedure, to obtain N sets of depths for a set of N points:
{ n n n )~N): n = 1 , 2 ..... N} (6)
(~'1' ~'2 ..... ~'i .....

where Xni = 1 for i -- n and the superscript n indicates the depths computed under
reference point Pn" The depths of each set in (6) are normalised with respect to the depth of the
same single point (say P1 ) in the corresponding set to get N normalized sets of depths as
279

-/1 ~n -?1 ~/1

{ (Z1, ~'2..... Li ..... Z~v) : n = 1, 2 ..... N} (7)


-n
where ~'1 -- 1, n = 1, 2..... N. A unique solution for the depths of the N points is obtained by
computing
-n
~'m = m e d i a n {~'m' n = 1,2,...,N}, m = 1, 2 ..... N (8)
Equation (8) is justified by the fact that all sets of normalized depth scales in (7) describe the
same relative structure of the given N points.
Other approaches have also been explored to solve (4) but are not included here because of
space limitation. Once the depths are known, the computation of point coordinates in the WCS,
and the estimation of the three motion parameters (i.e., the translations T x and Ty along the X-
and Y-axis, and the rotation angle about the Z-axis in the WCS) are straightforward. Details may
be found in [ 1, 2].

4. E x p e r i m e n t a l Results
We have compared the performance of the new algorithm with that of a recent linear SFM
algorithm proposed by Weng et al. [3]. For convenience, we call the proposed algorithm the
TSB algorithm, and the algorithm in [3] the WHA algorithm in the subsequent discussions.
Using synthetic image data, Monte Carlo simulations were conducted to investigate the
noise sensitivity of, and the influence of the number of point correspondences on the two
algorithms. Comprehensive testing has been carried out [1, 2]. Numerous results show that in
general the TSB algorithm performs much better than the WHA algorithm especially under high
noise conditions.
With real image sequences, the assessment of the accuracy of the recovered structure is not
straightforward as the ground truth is usually unknown. Proposals have been made [3] to use the
standard image error (SIE) Ae defined as follows [3]
,, 4+d'?
I i=~..~ ~

where N is the number of points, d i and d~ the distances in the two images between the
(9)

projection of the reconstructed 3D point i and its observed positions in the images. The SIEs of
the two algorithms applied to five different moving objects are listed in Table I. In these terms,
Table 1
SIE (in pixels) of Algorithms TSB and W H A Under Real Image Sequences
MovingObject StandardImage Error Error Ra~o
AlgorithmWHA AlgorithmTSB (WHA/TSB)
Lorry 0.160 0.00134 119
Estate1 0.455 0.00144 316
Estate2 0.532 0.00147 362
SaJoonl 0.514 0.00177 290
Saloon2 1.188 0.00171 695

the TSB algorithm performs several hundred times better than the WHA algorithm. It is argued
in [1, 2] that the SIE gives a poor measure of performance. In addition to the observed error in
the two frames, the performance should be analysed by a qualitative assessment of the projected
wire-frame model of all given points from other views. For example, Fig.2 shows two images of
280

Figure 2. For detailed captions, see text.


a moving lorry and an intermediate view of the partial wire-frame lorry model recovered by the
two algorithms. Fig.2(a) shows 12 lorry points in the first frame and Fig.2(b) the same points in
the second frame. The inter-frame point correspondences were known. The recovered point
coordinates are converted into a partial wire-frame lorry model simply by connecting
appropriate point pairs. The model recovered by the WHA algorithm appears to fit the two
original images reasonably well as illustrated in Fig.2(c) and (e), as indicated by the small
standard image error (0.160) given in Table 1. We can now perturb the recovered model slightly
(as if the lorry had undergone a small motion between (c) and (e)). The outcome (Fig.2(d)) is far
from expectation, and the recovered model is clearly not lorry-shaped. In contrast, the partial
wire-frame lorry model recovered by our algorithm proves to be accurate and consistent as can
be seen in Fig.2(f)-(h). Other image pairs from the lorry sequence were used, and the results
obtained were similar. The disastrous performance of the WHA algorithm shown in the above
example is attributed to several causes. Firstly, the linear constraint used by [3] is not the same
as the rigidity constraint [8], and as a consequence there are non-rigid motions that would
satisfy the linear constraint (see Fig.2(d)). Secondly, no explicit use of the ground plane
constraint is made in [3]. Finally, the number of points used in the given example is small.
To further assess the performance of the TSB algorithm, the WCS coordinates of the 12
lorry points recovered by the algorithm from the two images Fig.2(a) and (b) were converted by
means of an interactive models-from-motion tool [7] into a full polyhedral lorry model, which is
displayed in Fig.3 under three different viewpoints. The completed model has been matched

Figure 3. Three different views of the lorry model recovered by the new algorithm
against the lorry image sequence and tracked automatically using the methods reported in [9] as
illustrated in Fig.4. The match is very good.
We have also investigated the sensitivity of the proposed algorithm to systematic errors
such as errors in rotational camera parameters. It was found that such errors have small effects
281

Figure 4. Matching between the recovered lorry model and four lorry images
on the estimation of the rotation angle, and moderate impact on that of the translational
parameters. Detailed results cannot be reported here due to space limitation.

5. Discussion
In the real world, the movement of many objects (e.g., cars, objects on conveyor belts, etc.) is
constrained in that they only move on a fixed plane or surface (e.g., the ground). A new SFM
algorithm has been presented in this paper which, by formulating motion constraint equations in
the world coordinate system, makes effective use of this physical motion constraints (the ground
plane constraint). The algorithm is computationally simple and gives a unique and closed-form
solution to the motion and structure parameters of rigid 3-D points. It is non-iterative, and
usually requires two points in two frames. The algorithm has been shown to be greatly superior
to existing linear SFM algorithms in accuracy and robustness, especially under high noise
conditions and when there are only a small number of corresponding points. The recovered 3-D
coordinates of object points from outdoor images enable us to construct 3-D geometric object
models which match the 2-D image data with good accuracy.

References
[1] T.N. Tan, G. D. Sullivan, and K. D. Baker, Structure from Constrained Motion, ESPRIT II
P2152 project report, RU-03-WP.T411-01, University of Reading, March 1991.
[2] T.N. Tan, G. D. Sullivan, and K. D. Baker, Structure from Constrained Motion Using Point
Correspondences, Proc. of British Machine Vision Conf., 24-26 September 1991, Glasgow,
Scotland, Springer-Verlag, 1991, pp.301-309.
[3] J.Y. Weng, T. S. Huang, and N. Ahuja, Motion and Structure from Two Perspective Views:
Algorithms, Error Analysis, and Error Estimation, IEEE Trans. Pattern Anal. Mach. lntell.,
vol.ll, no.5, 1989, pp.451-477.
[4] A. Mitiche and J. K. Aggarwal, A Computational Analysis of Time-Varying Images, in
Handbook of Pattern Recognition and Image Processing, T.Y. Young and K. S. Fu, Eds.
New York: Academic Press, 1986.
[5] S. Ullman, The Interpretation of Visual Motion, MIT Press, 1979.
[6] J.K. Aggarwal and N. Nandhakumar, On the Computation of Motion from Sequences of
Images - A Review, Proc. oflEEE, vol.76, no.8, 1988, pp.917-935.
[7] T.N. Tan, G. D. Sullivan, and K. D. Baker, 3-D Models from Motion (MFM) - an
application support tool, ESPRIT II P2152 project report, RU-03-WP.T411-02, University
of Reading, June 1991.
[8] D.J. Heeger and A. Jepson, Simple Method for Computing 3D Motion and Depth, Proc. of
IEEE 3rd lnter. Conf. on Computer Vision, December 4-7, 1990, Osaka, Japan, pp.96-100.
[9] A.D. Worrall, G. D. Sullivan, and K. D. Baker, Model-based Tracking, Proc. of British
Machine Vision Conf., 24-26 September 1991, Glasgow, Scotland, Springer-Verlag, 1991,
pp.310-318.
Detecting and Tracking Multiple Moving Objects
Using Temporal Integration*
Michal Irani, Benny Rousso, Shmuel Peleg
Dept. of Computer Science
The Hebrew University of Jerusalem
91904 Jerusalem, ISRAEL

A b s t r a c t . Tracking multiple moving objects in image sequences involves


a combination of motion detection and segmentation. This task can become
complicated as image motion may change significantly between frames, like
with camera vibrations. Such vibrations make tracking in longer sequences
harder, as temporal motion constancy can not be assumed.
A method is presented for detecting and tracking objects, which uses
temporal integration without assuming motion constancy. Each new frame
in the sequence is compared to a dynamic internal representation image
of the tracked object. This image is constructed by temporally integrating
frames after registration based on the motion computation. The temporal
integration serves to enhance the region whose motion is being tracked, while
blurring regions having other motions. These effects help motion analysis
in subsequent frames to continue tracking the same motion, and to segment
the tracked region.

1 Introduction

Motion analysis, such as opticalflow [7], is often performed on the smallest possible re-
gions, both in the temporal domain and in the spatial domain. Small regions, however,
carry little motion information, and such motion computation is therefore very inac-
curate. Analysis of multiple moving objects based on optical flow.[1] suffers from this
inaccuracy.
The major difficulty in increasing the size of the spatial region of analysis is the
possibility that larger regions will include more than a single motion. This problem
has been treated for image-plane translations with the dominant translation approach
[3, 4]. Methods with larger temporal regions have also been introduced, mainly using a
combined spatio-temporal analysis [6, 10]. These methods assume motion constancy in
the temporal regioiis, i.e., motion should be constant in the analyzed sequence.
In this paper we propose a method for detecting and tracking multiple moving objects
using both a large spatial region and a large temporal region without assuming temporal
motion constancy. When the large spatial region of analysis has multiple moving objects,
the motion parameters and the locations of the objects are computed for one object after
another. The method has been applied successfully to parametric motions such as affine
and projective transformations. Objects are tracked using temporal integration of images
registered according to the computed motions.
Sec. 2 describes a method for segmenting the image plane into differently moving
objects and computing their motions using two frames. Sec. 4 describes a method for
tracking the detected objects using temporal integration.
* This research has been supported by the Israel Academy of Sciences.
283

2 Detection of Multiple Moving Objects in Image Pairs

To detect differently moving objects in an image pair, a single motion is first computed,
and a single object which corresponds to this motion is identified. We call this motion the
dominant motion, and the corresponding object the dominant object. Once a dominant
object has been detected, it is excluded from the region of analysis, and the process is
repeated on the remaining region to find other objects and their motions.

2.1 D e t e c t i o n of a Single O b j e c t a n d its Motion

The motion parameters of a single translating object in the image plane can be recovered
accurately, by applying the iterative translation detection method mentioned in Sec. 3 to
the entire region of analysis. This can be done even in the presence of other differently
moving objects in the region of analysis, and with no prior knowledge of their regions of
support [5]. It is, however, rarely possible to compute the parameters of a higher order
parametric motion of a single object (e.g. affine, projective, etc.) when differently moving
objects are present in the region of analysis.
Following is a summary of the procedure to compute the motion parameters of an
object among differently moving objects in an image pair:

1. Compute the dominant translation in the region by applying a translation computa-


tion technique (Sec. 3) to the entire region of analysis.
2. Segment the region which corresponds to the computed motion (Sec. 3). This confines
the region of analysis to a region containing only a single motion.
3. Compute a higher order parametric transformation (affine, projective, etc.) for the
segmented region to improve the motion estimation.
4. Iterate Steps 2-3-4 until convergence.

The above procedure segments an object (the dominant object), and computes its
motion parameters (the dominant motion) using two frames. An example for the de-
termination of the dominant object using an affine motion model between two frames
is shown in Fig. 2.c. In this example, noise has affected strongly the segmentation and
motion computation. The problem of noise is overcome once the algorithm is extended
to handle longer sequences using temporal integration (Sec. 4).

3 Motion A n a l y s i s a n d Segmentation

This section describes briefly the methods used for motion computation and segmenta-
tion: A more detailed description can be found in [9].

Motion Computation. It is assumed that the motion of the objects can be approximated
by 2D parametric transformations in the image plane. We have chosen to use an iter-
ative, multi-resolution, gradient-based approach for motion computation [2, 3, 4]. The
parametric motion models used in our current implementation are: pure translations (two
parameters), affine transformations (six parameters [3]), and projective transformations
(eight parameters [1]).
284

Segmentation. Once a motion has been determined, we would like to identify the region
having this motion. To simplify the problem, the two images are registered using the
detected motion. The motion of the corresponding region is therefore cancelled, and the
problem becomes that of identifying the stationary regions.
In order to classify correctly regions having uniform intensity, a multi-resolution
scheme is used, as in low resolution pyramid levels the uniform regions are small. The
lower resolution classification is projected on the higher resolution level, and is updated
according to higher resolution information (gradient or motion) when it conflicts the
classification from the lower resolution level.
Moving pixels are detected in each resolution level using only local analysis. A simple
grey level difference is not sufficient for determining the moving pixels. However, the grey
level difference normalized by the gradient gives better results, and was sufficient for our
experiments. Let I(z, y, t) be the gray level of pixel (x, y) at time t, and let VI(x, y, t) be
it spatial intensity gradient. The motion measure D(z, y, t) used is the weighted average
of the intensity differences normalized by the gradients over a small neighborhood N(z, y)
of (x, y).

D(x,y,t) dej ~(~,,y,)~N(~,~) IZ(~,y~,t + X)- Z(=~,y~,t)IIVZ(=,,y~,t)I (1)


~(=,,u,)~N(,,y) IVI(x~, y~, 012 + C
where the constant C is used to avoid numerical instabilities. The motion measure (1)
is propagated in the pyramid according to its certainty at each pixel. At the highest
resolution level a threshold is taken to segment the image into moving and stationary
regions. The stationary region M(t) represents the tracked object.

4 Tracking Objects Using Temporal Integration

The algorithm for the detection of multiple moving objects described in Sec. 2 is extended
to track objects in long image sequences. This is done by using temporal integration of
images registered with respect to the tracked motion, without assuming temporal motion
constancy. The temporally integrated image serves as a dynamic internal representation
image of the tracked object.
Let {I(t)} denote the image sequence, and let M(t) denote the segmentation mask of
the tracked object computed for frame I(t), using the segmentation method described in
Sec. 3. Initially, M(0) is the entire region of analysis. The temporally integrated image
is denoted by Av(t), and is constructed as follows:

Av(O) ~f I(0)
(~)
Av(t + 1) de=fw. I(t + 1) + (1 -- w). register(Av(t), I(t + 1))
where currently w = 0.3, and register(P, Q) denotes the registration of images P and
Q by warping P towards Q according to the motion of the tracked object computed
between them. A temporally integrated image is shown in Fig. 1.
Following is a summary of the algorithm for detecting and tracking the dominant
object in an image sequence, starting at t = 0:

1. Compute the dominant motion parameters between the integrated image Av(t) and
the new frame I(t + 1), in the region M(t) of the tracked object (Sec. 2).
2. Warp the temporally integrated image Av(t) and the segmentation mask M(t) to-
wards the new frame I(t + 1) according to the computed motion parameters.
285

Fig. 1. An example of a temporally integrated image.


a) A single frame from a sequence. The scene contains four moving objects.
b) The temporally integrated image after 5 frames. The tracked motion is that of
the ball which remains sharp, while all other regions blur out.

3. Identify the stationary regions in the registered images above (Sec. 3), using the
registered mask M(t) as an initial guess. This will be the tracked region in I(t + 1).
4. Compute the integrated image Av(t + 1) using (2), and process the next frame.

When the motion model approximates well enough the temporal changes of the
tracked object, shape changes relatively slowly over time in registered images. There-
fore, temporal integration of registered frames produces a sharp and clean image of the
tracked object, while blurring regions having other motions. An example of a temporally
integrated image of a tracked rolling ball is shown in Fig. 1. Comparing each new frame
to the temporally integrated image rather than to the previous frame gives the a strong
bias to keep tracking the same object. Since additive noise is reduced in the the average
image of the tracked object, and since image gradients outside the tracked object decrease
substantially, both segmentation and motion computation improve significantly.
In the example shown in Fig. 2, temporal integration is used to detect and track
a single object. Comparing the segmentation shown in Fig. 2.c to the segmentation in
Fig. 2.d emphasizes the improvement in segmentation using temporal integration.

Fig. 2. Detecting and tracking the dominant object using temporal integration.
a-b) Two frames in the sequence. Both the background and the helicopter are moving.
c) The segmented dominant object (the background) using the dominant affine motion computed
between the first two frames. Black regions axe those excluded from the dominant object.
d) The segmented tracked object after a few frames using temporal integration.

Another example for detecting and tracking the dominant object using temporal inte-
gration is shown in Fig. 3. In this sequence, taken by an infrared camera, the background
moves due to camera motion, while the car has another motion. It is evident that the
tracked object is the background, as other regions were blurred by their motion.
286

Fig. 3. Detecting and tracking the dominant object in an image sequence using temporal inte-
gration.
a-b) Two frames in an infrared sequence. Both the background and the car are moving.
c) The temporally integrated image of the tracked object (the background). The background
remains sharp with less noise, while the moving car blurs out.
d) The segmented tracked object (the background) using an afline motion model. White regions
are those excluded from the tracked region.

This temporal integration approach has characteristics similar to human motion de-
tection. For example, when a short sequence is available, processing the sequence back
and forth improves the results of the segmentation and motion computation, in a similar
way that repeated viewing helps human observers to understand a short sequence.

4.1 T r a c k i n g O t h e r O b j e c t s
After segmentation of the first object, and the computation of its affine motion between
every two successive frames, attention is given to other objects. This is done by applying
once more the tracking algorithm to the "rest" of the image, after excluding the first
detected object. To increase stability, the displacement between the centers of mass of
the regions of analysis in successive frames is given as the initial guess for the computation
of the dominant translation. This increases the chance to detect fast small objects.
After computing the segmentation of the second object, it is compared with the
segmentation of the first object. In case of overlap between the two segmentation masks,
pixels which appear in the masks of both the first and the second objects are examined.
They are reclassified by finding which of the two motions fits them better.
Following the analysis of the second object, the scheme is repeated recursively for
additional objects, until no more objects can be detected. In cases when the region of
analysis consists of many disconnected regions and motion analysis does not converge,
the largest connected component in the region is analyzed.
In the example shown in Fig. 4, the second object is detected and tracked. The de-
tection and tracking of several moving objects can be performed in parallel, by keeping
a delay of one or more frame between the computations for different objects.

5 Concluding Remarks

Temporal integration of registered images proves to be a powerful approach to motion


analysis, enabling human-like tracking of moving objects. The tracked object remains
sharp while other objects blur out, which improves the accuracy of the segmentation and
the motion computation. Tracking can then proceed on other objects.
Enhancement of the tracked objects now becomes possible, like reconstruction of
occluded regions, and improvement of image resolution [8].
287

Fig. 4. Detecting and tracking the second object using temporal integration.
a) The initial segmentation is the complement of the first dominant region (from Fig. 3.d).
b) The temporMly integrated image of the second tracked object (the car). The car remains
sharp while the background blurs out.
c) Segmentation of the tracked object after 5 frames.

References

1. G. Adiv. Determining three-dimensional motion and structure from optical flow generated
by several moving objects. IEEE Trans. on Pattern Analysis and Machine Intelligence,
7(4):384-401, July 1985.
2. J.R. Bergen and E.H. Adelson. Hierarchical, computationally efficient motion estimation
algorithm. J. Opt. Soc. Am. A., 4:35, 1987.
3. J.R. Bergen, P.J. Burr, K. Hanna, R. Hingorani, P. Jeanne, and S. Peleg. Dynamic
multiple-motion computation. In Y.A. Feldman and A. Bruckstein, editors, Artificial Intel-
ligence and Computer Vision: Proceedings of the Israeli Conference, pages 147-156. Elsevier
(North Holland), 1991.
4. J.R. Bergen, P.J. Butt, R. I-Iingorani, and S. Peleg. Computing two motions from three
frames. In International Conference on Computer Vision, pages 27-32, Osak&, Japa~n,
December 1990.
5. P.J. Butt, R. Hingoraaai, and R.J. Kolczynski. Mechanisms for isolating component patterns
in the sequential analysis of multiple motion. In IEEE Workshop on Visual Motion, pages
187-193, Princeton, New Jersey, October 1991.
6. D.J. Heeger. Optical flow using spatiotemporal filters. International Journal of Computer
Vision, 1:279-302, 1988.
7. B.K.P. Horn and B.G. Schunck. Determining optical flow. Artificial Intelligence, 17:185-
203, 1981.
8. M. Irani and S. Peleg. Improving resolution by image registration. CVGIP: Graphical
Models and Image Processing, 53:231-239, May 1991.
9. M. Irani, B. Rousso, and S. Peleg. Detecting multiple moving objects using temporal inte-
gration. Technical Report 91-14, The Hebrew University, December 1991.
10. M. Shizawa and K. Maze. Simultaneous multiple optical flow estimation. In International
Conference on Pattern Recognition, pages 274-278, Atlantic City, New Jersey, June 1990.

This article was processed using the IbTEX macro paclmge with ECCV92 style
A Study of Affine Matching With Bounded Sensor
Error *

W. Eric L. Grimson 1, Daniel P. Huttenlocher 2 and David W. Jacobs 1

1 AI Lab, Massachusetts Institute of Technology, Cambridge MA 02139, USA


Department of Computer Science, Cornell University, Ithaca NY 14853, USA

A b s t r a c t . Affine transformations of the plane have been used by model-


based recognition systems to approximate the effects of perspective projec-
tion. Because the underlying mathematics are based on exact data, in prac-
tice various heuristics are used to adapt the methods to real data where
there is positional uncertainty. This paper provides a precise analysis of
affine point matching under uncertainty. We obtain an expression for the
range of affine-invariant values consistent with a given set of four points,
where each data point lies in an e-disc. This range is shown to depend on
the actual x-y-positions of the data points. Thus given uncertainty in the
data, the representation is no longer invariant with respect to the Carte-
sian coordinate system. This is problematic for methods, such as geometric
hashing, that depend on the invariant properties of the representation. We
also analyze the effect that uncertainty has on the probability that recogni-
tion methods using affine transformations will find false positive matches.
We find that such methods will produce false positives with even moderate
levels of sensor error.

1 Introduction

In the model-based approach to object recognition, a set of geometric features from an


object model are compared against like features from an image of a scene (cf. [3, 9]). This
comparison generally involves determining a valid correspondence between a subset of the
model features and a subset of the image features, where valid means there exists some
transformation of a given type mapping each model feature onto its corresponding image
feature. The quality of an hypothesized transformation is then evaluated by determining
if the number of features brought into correspondence accounts for a sufficiently large
portion of the model and the data.
Several recent systems have used affine transformations of the plane to represent the
mapping from a 2D model to a 2D image (e.g. [4, 5, 15, 16, 20, 21, 22, 24, 25, 28]).
This type of transformation also approximates the 2D image of a planar object at an
arbitrary orientation in 3D space, and is equivalent to a 3D rigid motion of the object,
followed by orthographic projection and scaling (dilation). The scale factor accounts for
the perceptual shrinking of objects with distance. This affine viewing model does not

* This report describes research done in part at the Artificial Intelligence Laboratory of the
Massachusetts Institute of Technology. Support for the laboratory's research is provided in
part by an ONR URI grant under contract N00014-86-K-0685, and in part by DARPA un-
der Army contract number DACA76-85-C-0010 and under ONR contract N00014-85-K-0124.
WELG is supported in part by NSF contract number IRI-8900267. DPH is supported at Cor-
nell University in part by NSF grant IRI-9057928 and matching funds from General Electric
and Kodak, and in part by AFOSR under contract AFOSR-91-0328.
292

capture the perspective distortions of real cameras, but it is a reasonable approximation


to perspective except when an object is deep with respect to its distance from the viewer.
Recognition systems that use 2D affine transformations fall into two basic classes.
Methods in the first class explicitly compute an affine transformation based on the corre-
spondence of a set of 'basis features' in the image and the model. This transformation is
applied to the remaining model features, mapping them into the image coordinate frame
where they are compared with image features [2, 15, 16, 24]. Methods in the second class
compute and directly compare affine invariant representations of the model and the im-
age [4, 5, 20, 21, 22, 25, 28] (there is also recent work on deriving descriptions of shapes
that are invariant under perspective projection [8, 27]). In either case, recognition sys-
tems that employ affine transformations generally use some heuristic means to allow for
uncertainty in the location of sensory data. One notable exception is [4] who formulate
a probabilistic method. [26] also discusses bounds on the effects of error on invariants,
and [7] addresses this problem for simpler similarity transformations. In previous work
[17] we provided a precise account of how uncertainty in the image measurements affects
the range of transformations consistent with a given configuration of points acting under
an afflne transformation. Here, we show that many existing recognition methods are not
actually able to find instances of an object, without also admitting a large number of
false matches. The analysis further suggests techniques for developing new recognition
methods that will explicitly account for uncertainty.

1.1 Afllne T r a n s f o r m a t i o n s a n d I n v a r i a n t R e p r e s e n t a t i o n s

An affine transformation of the plane can be represented as a nonsingular 2 x 2 matrix


L, and a 2-vector, t, such that a given point x is transformed to x' = L x + t. Such a
transformation can be defined to map any triple of points to any other triple (expect in
degenerate cases). As well, three points define an affine coordinate frame (analogous to a
Cartesian coordinate frame in the case of Euclidean transformations) [6, 18], e.g., given
a set of points { m l , m 2 , m a } , any other point x can be expressed as:

x = + a(m - rex) - (1)


c~ and/3 remain unchanged when any affine transformation A is applied to the points:

A(x) = A ( m l ) + a(A(m2) - A(ml)) +/3(A(m3) - A(ml)).

Thus the pair (a,/3) constitute affine-invariant coordinates of x with respect to the basis
( m l , m2, m3). We can think of (a,/3) as a point in a 2D space, termed the a-/3-plane.
The main issue we wish to explore is: Given a model basis of three points and some
other model point, what sets of four image features are possible transformed instances of
these points? The exact location of each image feature is unknown, and thus we model
image features as discs of radius e. The key question is what effect this uncertainty has
on which image quadruples are possible transformed instances of a model quadruple.
We assume that a set of model points is given in a Cartesian coordinate frame, and
some distinguished basis triple is also specified. Similarly a set of image points is given
in their coordinate frame. Two methods can be used to map between the model and
the image. One method, used by geometric hashing [20], maps both model and image
points to (c~,/3) values using the basis triples. The other method, used by alignment [15],
computes the transformation mapping the model basis to the image basis, and uses it to
map all model points to image coordinates. In both cases, a distinguished set of three
293

model and image points is used to map a fourth point (or many such points) into some
other space. We consider the effects of uncertainty on these two methods.
First we characterize the range of image measurements in the x-y (Euclidean) plane
that are consistent with the (a,/9) pair computed for a given quadruple of model points,
as specified by equation (1). This corresponds to explicitly computing a transformation
from one Cartesian coordinate frame (the model) to another (the image). We find that if
sensor points' locational uncertainty is bounded by a disc of radius e, then the range of
possible image measures consistent with a given (c~,/9) pair is a disc with radius between
e(1 + [c~[+ [/9[) and 2e(1 + [a[ + [/9[). This defines the set of image points that could match
a specific model point, given both an image and model basis.
We then perform the same analysis for the range of affine coordinate, (c~,/9), values
that are consistent with a given quadruple of points. This corresponds to mapping both
the model and image points to (a,/9) values. To do this, we use the expressions derived
for the Euclidean case to show that the region of a-/9-space that is consistent with a
given point and basis, is in general an ellipse containing the point (c~,/9). The parameters
of the ellipse depend on the actual locations of the points defining the basis. Hence the
set of possible values in the a-/9-plane c a n n o t be computed independent of the actual
locations of the image basis points. In other words there is an interaction between the
uncertainty in the sensor values and the actual locations of the sensor points. This limits
the applicability of methods which assume that these are independent of one another. For
example, the geometric hashing method requires that the a-/9 coordinates he independent
of the actual location of the basis points in order to construct a hash table.

2 Image U n c e r t a i n t y a n d Affine C o o r d i n a t e s
Consider a set of three model points, m l , m~, m3, and the affine coordinates (a,/9) of a
fourth model point x defined by

X = II11 Jr" o~(m~ -- m l ) + / 9 ( m 3 -- m l ) (2)

plus a set of three sensor points Sl, s2, s3, such that si = T ( m i ) Jr e i , where T is some
affine transformation, and ei is an arbitrary vector of magnitude at most ei. That is, T
is some underlying affine transformation that cannot be directly observed in the data
because each data point is known only to within a disc of radius ei.
We are interested in the possible locations of a fourth sensor point, call it ~1, such that
could correspond to the ideally transformed point T(x). The possible positions of ~ are
affected both by the error in measuring each image basis point, sl, and by the error in
measuring the fourth point itself. Thus the possible locations are given by transforming
equation (2) and adding in the error e0 from measuring x,

= T ( m l + a ( m 2 - m l ) + / 9 ( m 3 - m l ) ) + eo
= Sl -~- or(s2 - Sl) q-/9(s3 - Sl) - el -~- or(el - e2) q-/9(el - e3) -I- eo.

The measured point ~ can lie in a range of locations about the ideal location Sl + a(s~ -
sl) +/9(s3 - sl) with deviation given by the linear combination of the four error vectors:

- - el + a ( e l -- e2) +/9(el -- e3) + e0 : --[(I -- ~ --/9)el + o~e2 +/9e3 -- e0]. (3)

The set of possible locations specified by a given ei is a disc of radius el about the origin:

C(~i) = {ei I Ileill _< el}.


294

Similarly, the product of any constant k with ei yields a disc C(kei) of radius Iklei
centered about the origin. Thus substituting the expressions for the disc in equation (3),
the set of all locations about the ideal point sl + 4(s2 - sl) +/~(s~ - sl) is:

C([1 - 4 - ~]el) @ C(ae2) @ C(j3e3) 6~ C(eo), (4)

where @ is the Minkowski sum, i.e. A @ B = {p + alp 6 A, q E B} (similarly for 6)).


In order to simplify the expression for the range of ~ we make use of the following
fact, which follows directly from the definition of the Minkowski sum for sets.

C l a i m 1 C(r~) 9 C(r~) = C(r~) e C(,'~) = C(,'1 + ,'~.), where C(,'i) is a disc of radi,s
ri centered about the origin, ri > O.

If we assume that the ei = e, Vi, then Claim 1 simplifies equation (4) to

C(e[J]- 4- ~l + I~l + I~l + 1]).

The absolute values arise because a and ~ can become negative, but the radius of a
disc is a positive quantity. Clearly the radius of the error disc grows with increasing
magnitude of 4 and ~, but the actual expression governing this growth is different for
different portions of the 4 - fl-plane, as shown in figure 1.

Fig. 1. Diagram of error effects. The region of feasible points is a disc, whose radius is given by
the indicated expression, depending on the values of tr and 19. The diagonal line is 1 - a - 19= 0.

We can bound the expressions defining the radius of the uncertainty disc by noting:

I + 14l + l~J < ( 1 1 - 4 - ~ J + J~J+ J~1+ 1) < 2 ( 1 + l a l + i~i).

We have thus established the following result, illustrated in figure 2:

P r o p o s i t i o n I . The range of image locations that is consistent with a given pair of affine
coordinates (4,~) is a disc of radius r, where

e(1 + 141 + I~1) < r < 2e(1 + 141 + I~1)

and where e > 0 is a constant bounding the positional uncertainty of the image data.
295

ID 4
1

O 2

03 i I s SS

Fig. 2. Diagram of error effects. On the left are four model points, on the right are four image
points, three of which are used to establish a basis. The actual position of each transformed
model point corresponding to the basis image points is offset by an error vector of bounded
magnitude. The coordinates of the fourth point, written in terms of the basis vectors, can thus
vary from the ideal case, shown in solid lines, to cases such as that shown in dashed lines. This
leads to a disc of variable size in which the corresponding fourth model point could lie.

The expression in Proposition 1 allows the calculation of error bounds for any method
based on 2D affine transformations, such as [2, 15, 24]. In particular, if [a[ and [fl[ are
both less than 1, then the error in the position of a point is at most 6e. This condition
can be met by using as the affine basis, three points m l , m ~ and m a that lie on the
convex hull of the set of model points, and are maximally separated from one another.
The expression is independent of the actual locations of the model or image points,
so that the possible positions of the fourth point vary only with the sensor error and the
values of a and ft. They do not vary with the configuration of the model basis (e.g., even
if close to collinear) nor do they vary with the configuration of the image basis. Thus,
the error range does not depend on the viewing direction. Even if the model is viewed
end on, so that all three model points appear nearly collinear, or if the model is viewed
at a small scale, so that all three model points are close together, the size of the region
of possible locations of the fourth model point in the image will remain unchanged.
The viewing direction does, however, greatly affect the affine coordinate system de-
fined by the three projected model points. Thus the set of possible ~ n e coordinatesof
the fourth point, when considered directly in a-j3-space, will vary greatly. Proposition 1
defines the set of image locations consistent with a fourth point. This implicitly defines
the set of affine transformations that produce possible fourth image point locations, which
can be used to characterize the range of (a,/~) values consistent with a set of four points.
We will do the analysis using the upper bound on the radius of the error disc from
Proposition 1. In actuality, the analysis is slightly more complicated, because the expres-
sion governing the disc radius varies as shown in figure 1. For our purposes, however,
considering the extreme case is sufficient. It should also be noted from the figure that the
extreme case is in fact quite close to the actual value over much of the range of a and/~.
Given a triple of image points that form a basis, and a fourth image point, s4, we
want the range of affine coordinates for the fourth point that are consistent with the
possibly erroneous image measurements. In effect, each sensor point si takes on a range
of possible values, and each quadruple of such values produces a possibly distinct value
using equation (1). As illustrated in figure 3 we could determine all the feasible values
by varying the basis vectors over the uncertainty discs associated with their endpoints,
finding the set of (a',/~ ~) values such that the resulting point in this affine basis lies
within e of the original point. By our previous results, however, it is equivalent to find
affine coordinates (a',/~') such the Euclidean distance from s 1 --]-O~t(S2 -- Sl) -~- fl'(S 3 -- 81)
296

to sl + a(s2 - Sl) + ~(s3 - s l ) is b o u n d e d above b y 2e(1 + I~'1 + IZ'I).

a'u'

Fig. 3. On the left is a canonical example of affine coordinates. The fourth point is offset from
the origin by a scaled sum of basis vectors, a u +/~v. On the right is a second consistent set of
affine coordinates. By taking other vectors that lie within the uncertainty regions of each image
point, we can find different sets of affine coordinates a ' , ~' such that the new fourth point based
on these coordinates also lies within the uncertainty bound of the image point.

T h e b o u n d a r y of the region of such p o i n t s ( a ' , ~ ' ) occurs w h e n the distance from the
n o m i n a l image p o i n t s4 = Sl + a(s2 - s l ) + / ~ ( s 3 - s l ) is 2e(1 + la'] + I~'D, i.e.

[ 2 d l + I~'l + I~'1)] 2 = [(~ - ~')u] 2 + 2(~ - ~ ' ) ( ~ - ~ ' ) w cos r + [(~ - ~')v] 2 (5)

w h e r e u = s~ st,v = sa-sl,u = H [],v Hv]l a n d where the angle m a d e by the


image basis vectors s2 - st a n d s3 - Sl is r Considered as a n implicit f u n c t i o n of a ' , ~ ,
e q u a t i o n (5) defines a conic. If we e x p a n d out e q u a t i o n (5), we get

all(OL')2 -{-2a12c~'~'+ a22(j3')2 + 2a13a' + 2a23j3'+ a33 = 0 (6)

where

all : u 2 -- 4e ~ a22 = v 2 - 4e 2
al~ = v u c o s r 4s~s~ ~ a13 = - u [ ~ u + ~ v cos r - 4 s ~ 2
a23 = --v [au cos r + j3v] - 4s~e 2 a3a = a 2 u 2 + 2 a ~ u v cos r +/~2v2 - 4e 2

a n d where sa denotes the sign of a , with so = 1.


We can use this form to c o m p u t e the i n v a r i a n t characteristics of a conic [19]:

I = u 2 + v 2 - 8r 2 (v)
D = u2v 2 sin 2 r - 4e 2 (u s - 2 u v s a s ~ cos r + v ~) (s)
A = - 4 e 2 u 2 v 2 sin 2 r + sao~ + s#/~) 2 (9)

If u 2 + v 2 > 8e 2, t h e n ~ < 0. F u r t h e r m o r e , if

u2v ~ sin 2 r > 4e ~ (u 2 - 2 u v s ~ s z cos r + v 2)


297

then D > 0 and the conic defined by equation (5) is an ellipse. These conditions are not
met only when the image basis points are very close together, or when the image basis
points are nearly collinear. For instance, if the image basis vectors u and v are each at
least 2e in length then u 2 + v 2 > 8e s. Similarly, if s i n e is not small, D > 0. In fact, cases
where these conditions do not hold will be very unstable and should be avoided.
We can now compute characteristics of the ellipse. The area of the ellipse is given by

4~eSuZv s sin s r + s ~ a + s0/~) s (10)


[u~vs sin' r - 4 d (u s - 2uvsas~ cos r + vS)] ~"
The center of the ellipse is at
40 = D -x [4=2v2 sin s ~ - 4 d ( - - s - s,(1 + s , ~ ) v s + uv c o s ~ ( ~ + s , ( l - s , , ) ) ) ]

/~0 = D -1 [/~uSvS sins r - 4es(/~v 2 - s~(1 + s , 4 ) u 2 + uvcosr + s , ( 1 - s~/~)))].


(11)
The angle 9 of the principal axes with respect to the 4 axis is

tan 2~ = 2[uv cos r - 4e2s~s~] (12)


u s _ V2

Thus we have established the following:

P r o p o s i t i o n 2. Given bounded errors of e in the measurement of the image points, the


region of uncertainty associated with a pair of alpine coordinates (a, ~) in a./~-space is
an ellipse. The area of this ellipse is given by equation (10), the center is at (a0,/~0) as
given by equation (11), and the orientation is given by equation (12).

Hence, given four points whose locations are only known to within e-discs, there is an
elliptical region of possible (a,/~) values specifying the location of one point with respect
to the other three. Thus if we compare (a,/~) values generated by some object model with
those specified by an e-uncertain image, each image d a t u m actually specifies an ellipse
of (a, f~) values, whose area depends on e, a, f/, and the configuration of the three image
points that form the basis. To compare the model values with image values one must see
if the affine-invariant coordinates for each model point lie within the elliptical region of
possible affine-invariant values associated with the corresponding image point.
The elliptical regions of consistent parameters in 4-/~-space cause some difficulties
for discrete hashing schemes. For example, geometric hashing uses affine coordinates
of model points, computed with respect to some choice of basis, as the hash keys to
store the basis in a table. In general, the implementations of this method use square
buckets to tessellate the hash space (the a-/~-space). Even if we chose buckets whose size
is commensurate with the ellipse, several such buckets are likely to intersect any given
ellipse due to the difference in shape. Thus, one must hash to multiple buckets, which
increases the probability that a random pairing of model and image bases will receive a
large number of votes.
A further problem for discrete hashing schemes is that the size of the ellipse increases
as a function of (1 + 141 + I~1)s. Thus points with larger affine coordinates give rise to
larger ellipses. Either one must hash a given value to m a n y buckets, or one must account
for this effect by sampling the space in a manner that varies with (1 + 141 + I~1)s.
The most critical issue for discrete hashing schemes, such as geometric hashing, is
that the shape, orientation and position of the ellipse depend on the specific image basis
chosen. Because the error ellipse associated with a given (4,/~) pair depends on the
298

characteristics of the image basis, which are not known until run time, there is no way
to pre-compute the error regions and thus no clear way to fill the hash table as a pre-
processing step, independent of a given image. It is thus either necessary to approximate
the ellipses by assuming bounds on the possible image basis, which will allow both false
positive and false negative hits in the hash table, or to compute the ellipse to access
at run time. Note that the geometric hashing method does not address these issues. It
simply assumes that some 'appropriate' tessellation of the image space exists.
In summary, in this section we have characterized the range of image coordinates and
the range of (a, j3) values that are consistent with a given point, with respect to some
basis, when there is uncertainty in the image data. In the following section we analyze
what fraction of all possible points (in some bounded image region) are consistent with a
given range of (a,/~) values. This can then be used to estimate the probability of a false
match for various recognition methods that employ affine transformations.

3 The Selectivity of Afflne-Invariant Representations

What is the probability than an object recognition system will erroneously report an
instance of an object in an image? Recall that such an instance in general is specified by
giving a transformation from model coordinates to image coordinates, and a measure of
'quality' based on the number of model features that are paired with image features under
this transformation. Thus we are interested in whether a random association of model
and image features can occur in sufficient number to masquerade as a correct solution.
We use the results developed above to determine the probability of such a false match.
There are two stages to this analysis; the first is a statistical analysis that is independent
of the given recognition method, and the second is a combinatorial analysis that depends
on the particular recognition method. In this section we examine the first stage. In the
following section we apply the analysis to the alignment method.
To determine the probability that a match will be falsely reported we need to know
the 'selectivity' of a quadruple of model points. Recall that each model point is mapped
to a point in a-/~-space, with respect to a particular model basis. Similarly each image
point, modeled as a disc, is mapped to an elliptical region of possible points in a-/~-space.
Each such region that contains one or more model points specifies an image point that is
consistent with the given model. Thus we need to estimate the probability that a given
image basis and fourth image point chosen at random will map to a region of a-/~-space
that is consistent with one of the model points written in terms of some model basis.
This is characterized by the proportion of ~-/~-space consistent with a given basis and
fourth point (where the size of the space is bounded in some way). As shown above, the
elliptical regions in a-j3-space are equivalent to circular regions in image space. Thus, for
ease of analysis we use the formulation in terms of circles in image space.
To determine the selectivity, assume we are given some image basis and a potential
corresponding model basis. Each of the remaining m - 3 model points are defined as affine
coordinates relative to the model basis. These can then be transformed into the image
domain, by using the same affine coordinates, with respect to the image basis. Because
of the uncertainty of t h e image points, there is an uncertainty in the associated affine
transformation. This manifests itself as a range of possible positions for the model points,
as they are transformed into the image. Previously we determined that a transformed
model point had to be within 2e(1 + [c~J+ J/~[) of an image point in order to match it.
That calculation took into account error in the matched image point as well as the basis
image points. Therefore, placing an appropriately sized disc about each model point is
299

equivalent to placing an ~ sized disc about each image point. We thus represent each
transformed model point as giving rise to a disc of some radius, positioned relative to the
nominal position of the model point with respect to the image basis. For convenience, we
use the upper bound on the size of the radius, 2e(1 + I~1 + I/~1). For each model point,
we need the probability that at least one image point lies in the associated error disc
about the model point transformed to the image, because if this happens then there is
a consistent model and image point for the given model and image basis. To estimate
this probability, we need the expected size of the disc. Since the disc size varies with
I~l + 1/31, this means we need an estimate of the distribution of points with respect to
affine coordinates. By figure 1 we should find the distribution of points as a function of
(~,/~). This is messy, and thus we use an approximation instead.
For this approximation, we measure the distribution with respect to p = I~1+1/~1, since
both the upper and lower bounds on the disc size are functions ofp. Intuitively we expect
the distribution to vary inversely with p. To verify this, we ran the following experiment.
A set of 25 points were generated at random, such that their pairwise separation was
between 25 and 250 pixels. All possible bases were selected, and for each basis for which
the angle between the axes was at least ~/16, all the other model points were rewritten
in terms of atilne invariant coordinates (~,/3). This gave roughly 300,000 samples, which
we histogrammed with respect to p. We found that the maximum value for p in this case
was roughly 51. In general, however, almost all of the values were much smaller, and
indeed, the distribution showed a strong inverse drop off (see figure (4)). Thus, we use
the following distribution of points in affine coordinates:

{kp p _<I (13)


6(~,/~)= kp-~ p__1.

l.Oi2
9.010

0,018
0.006'
O,Og4'
11.002'
e.NQ
:::::5 10 15

Fig. 4. Histogram of distribution of [~[ + I/~[ values. Vertical axis is ratio of number of samples
to total samples, horizontal axis is value for I~1 + I/~1. The maximum over 300,000 samples was
51. Only the first portion of the graph is displayed. Overlayed with this is the distribution given
in equation (13).

Note that this model underestimates the probability for large values of p, while over-
estimating it for small values of p. Since we want the expected size of the error disc, and
this grows with p, such an approximation will underestimate the size of the disc.
First, we integrate equation (13) and normalize to 1 to deduce the constant:

2
k= (14)
300

where P,n is the maximum value for p.


Next, we want to find the expected area of a disc in image space. Recall that we are
using the upper bound on disc size, so in principle this area is just 47re2(1 +p)2. We could
simply integrate this with respect to the distribution from equation (13). This, however,
ignores the fact that the image is of finite size (say each dimension is 2r), and some of
the disc may lie beyond the bounds of the image. We therefore separate out four cases.
The first case is for p _< 1. Here we get an expected area
I1
A1 = 4~re2(1 + p)2kpdp = 4~eSkl . (15)
----0

The second case considers discs that lie entirely within the image. For convenience,
assume that the coordinate frame of the basis is centered at the image center (since the
circle is entirely inside the image), and the image dimensions are 2r by 2r. In this case,
we have r - p _> 7 where 7 = 2e(1 + p). In general, we have p <_ pd where d is the
separation between two of the basis points in the image, and thus if 1 _< p _< cl where

cl = min {Pro,
2e+dJ
then the discs will all lie entirely within the image. Thus the second case is

As = + p)Skp-2 do = 4re2k [
Cl 2t- 2 log Cl --

= 41resk (16)
21~ + (d+2e)(r-2e) J"
The final expansion assumes that Pm> Cl, which is true for virtually all cases of interest.
Two other cases deal with discs that are partially truncated by the image bound-
aries. Details of these areas A3 and A4 are found in [14]. Because these areas contribute
minimally to the overall expected area, we focus on the cases described above.
Depending on the specific values for Pm and Cl we can add in the appropriate contri-
butions of A 1 , . . . , A4, together with the value for k (equation 14) to obtain an underesti-
mate for the expected area of an error disc - - the expected area of a circle in image space
that will be consistent with a point expressed in terms of some affine basis. Since such
discs can in general occur with equal probability anywhere in the image, the probability
that a model point lies within a disc associated with an image point is simply the ratio
of this area to the area of the image. Thus by normalizing these equations, we have an
underestimate for the selectivity of the scheme. This leads to the following result:

P r o p o s i t i o n 3 . Given a model basis and a fourth model point, the probability that a
corresponding image basis and fourth image point will map at random to a region of
(~-fl-space consistent with the model point and basis is given by
A1 + As + As + A4
p = 4r 2 (17)

where the Ai's are the areas for the four cases considered above.

This uses the upper bound on the radius of the error discs. As noted earlier, a simple
lower bound can be obtained by substituting e/2 in place of e, reflecting the use of the
bound e(1 +p) in place of 2e(1 + p). In this case, the bounds cl and cs will change slightly.
:301

We can use this to compute example values for the selectivity, which depends on Prn
(the maximum value of Icrl+ IflD. If we allow any possible triple of points to form a basis,
then Pm can he arbitrarily large. Consider a point p that makes an angle 0 with the u
axis, and where u, v make an angle ~. The value for p associated with the point p is

P (u I sin 01 + v I sin(~b - O)l).


uv I sin </'i
As ~b approaches 0, this value becomes unbounded. We can exclude unstable bases if
we set limits on the allowable range of values for ~b, in particular, we can restrict our
attention to bases with the property that

~bo _< ~b < ~r - ~bo or ~r + ~bo < q~ < 2r - ~o.

By applying standard minimization methods, the m a x i m u m value for p is given by


M 1
Pm< (18)
- rn sin 2.~t

where m and M are the minimum and maximum distance between any two model points.
To evaluate the selectivity, we also need to know d, the length of the basis vector,
1 < d < r. Given a specific value for d, we can compute the selectivity. To get a sense of
the variation of/~, it is plotted as a function of d in figure 5, for e = 3.

O.e05 .................................................................. " ..................

1.004

1.103

0.002

0.I01

O.O00
$0 leo ISO 200 2SO

Fig. 5. Graph of selectivity tt for 9 = 3 as the basis vector length d varies.

In general, d will take on a variety of values, as the choice of basis points in the
image is varied. To estimate the expected degree of selectivity, we perform the following
analysis. We assume, for simplicity, that the origin of the image basis is at the center of
the image. The second point used to establish the basis vector can lie anywhere in the
image, with equal probability. Hence the probability distribution for d is roughly ~ . We
could explicitly integrate equation 17 with respect to this distribution for d to obtain an
expected selectivity. This is messy, and instead we pursue two other options.
First, we can integrate this numerically for a set of examples, shown in Table 1 under
the column marked predicted, which lists values for p as a function of noise in the image
(with an image dimension of 2r = 500). The value of pm was set using ~0 = ~r/16,
and a ratio of minimum to maximum model point separation of M / r n = 10. It should
be noted that varying ~0 over the range lr/8 to lr/32 produced results very similar to
those reported in the table. As expected, the probability of a consistent match increases
(selectivity decreases) with increasing error in the measurements. Thus, for ranges of
parameters that one would find in many recognition situations, a considerable fraction
of the space of possible a and ~ values are consistent with a given feature and basis.
302

To test the validity of our formal development, we ran a series of simulations to


test the selectivity values p predicted by equation (17). We generated sets of model and
image features at random, chose bases for each at random, then checked empirically the
probability that a model point, rewritten in the image basis, lay within the associated
error disc of an image point. We chose to consider only cases in which the error disc fit
entirely within the image bounds, since we know that our predictions are underestimates
for the other cases. Table 1 summarizes the results, under the measured column.

Table l. Table comparing ~mulated and ~redicted sdectivities. See text for discus~on.
Case Measured Predated Approximation
= 1 .000116 .000117 .000118
r = 3 .001146 .001052 .001064
e = 5 .003142 .002911 .002955

Second, we can approximate the selectivity expression. By applying power series ex-
pansions and keeping only first and second order terms, we get:

k~re2 [17 r r 2 - d 2]
D~ ~ - 4 - 2log ~ -t- ~ j . (19)

We can find the expected value for equation 19 over the distribution for d, where ~ <
d < r, for some minimum value g. If we assume g << r, this expected value reduces to

'- r log ~ . (20)

This predicts values close to those in Table 1, as shown in the approximation column.
Note that the selectivity is clearly not linear in sensor error. For a fixed size image, in-
creasing the error e by some amount should decrease the selectiviby by at least a quadratic
effect (perhaps more since there are higher order terms). This is reflected in Table 1. This
expected value of the selectivity allows us to analyze the probability that a match will be
reported at random by some recognition method that uses affine transformations. The
selectivity, 7, in essence reflects the power of a given quadruple of features to distinguish
a particular model. Now we consider the manner in which information from multiple
quadruples is combined. This analysis differs slightly for different recognition methods.
As an illustration of how the analysis applies, we consider the alignment method.

4 The Sensitivity of Alignment in the Presence of Noise

The initial version of the affine-invariant alignment method was restricted to planar
objects [15], whereas later versions operate on 3D models (unlike affine hashing which
uses 2 D models) [16]. We consider the 2D case. The basic method is:

- Choose an ordered triple of image features and an ordered triple of model features,
and hypothesize that these are in correspondence.
- Use this correspondence to compute an affine transformation mapping model into
image.
303

- Apply this transformation to all of the remaining model features, thereby mapping
them into the image.
- Search over an appropriate neighborhood about each projected model feature for a
matching image feature, and count the total number of matched features.

This operation is in principle repeated for each ordered triple of model and image
features, although it may be terminated after one or more matches are found, or after a
certain number of triples are tried without finding a match.
We can use the expressions derived above to analyze the sensitivity of the alignment
method. The key question is whether a random collection of sensor points can masquerade
as a correct interpretation. In this case, we can investigate the probability of such false
positive identifications as follows:

1. The selectivity of a given quadruple of points is given by ~g (equation (17)).


2. Since each model point is projected into the image, the probability that a given model
point matches at least one image point is

p = 1 - (1-p)'-a

because the probability that a particular model point is not consistent with a partic-
ular image point is ( 1 - ~) and by independence, the probability that all s - 3 points
are not consistent with this model point is (1 - ~ ) , - 3 .
3. The process is repeated for each model point, so the probability of exactly k of them
having a match is

qk = k pk(1 - . (2D
Further, the probability of a false positive identification of size at least k is
k-1
wk = 1 - E qi.
i=0

Note that this is the probability of a false positive for a particular sensor basis and
a particular model basis.
4. This process is repeated for all choices of model bases, so the probability of a false
positive identification for a given sensor basis with respect to any model basis is

ek - 1 - (1 - wk)(~') . (22)

4.1 Testing the model

To check the correctness of our model, we ran a series of experiments based on equa-
tion 21. In particular, we used our analysis to generate a distribution for the probability
of a false positive of size k, given e -- 3 and r -- ~ , and using a model with 25 features
and images with 25, 50,100 and 200 features. For comparison, we also generated sets of
model and image points of the same sizes, selected bases for each at random, and de-
termined the size of vote associated with that pairing of bases. That is, for each model
point (other than the basis points) we computed the affine coordinates relative to the
chosen basis. Then we used the affine coordinates to determine the nominal position of
an associated image point. If at least one image point was contained within a given model
point's error disc, then we incremented the vote for this pairing of bases. This trial was
repeated 1000 times. The results are shown in figure (6).
304

9 I ...................... ' ...................... + ...................... :' . . . . . . . . . . . . . . . . . . . . . 1.8 . . . . . . . . . . . . . . . . . . . . . . +. . . . . . . . . . . . . . . . . . . . . . +. . . . . . . . . . . . . . . . . . . . . . : ......................

o+: + + i ., ....................+........................................... i......................


0.3

O,l~
. . . . . . . . . . . . . . . . . . . i ...................... i ...................... ~-.....................

. . . .
11
.
15 $~
9"
0.0
ii iliiil iliiii 10 11

O.| .......................

+.+.~ . . . . . . . . . . . . . . . . . . . . . . .
+ ......................

+ ......................
! ......................

: ......................
:. . . . . . . . . . . . . . . . . . . . . . .

:. . . . . . . . . . . . . . . . . . . . . . .
0,~8 ::~ . . . . . . . . . . . . . . . . . . . . . . . . ..j. . . . . . . . . . . . . . . . . . . ~. . . . . . . . . . . . . . . . . . . . . .

o.+ ...................... i ............................................ "~.....................

0,4 . . . . . . . . . . . . . . . . . . . . . . . ~. . . . . . . . . . . . . . . . . . . . . . ............................................ e.4 ......................


i ............................................
~ ....................

I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

O.G OC - " . . . . . . .
$ 10 15 0 $ LO 15 2O

Fig. 6. Comparison of predicted and measured probabilities of false positives. Each graph com-
pares the probability of a false peak of size k observed at random. The cases are for m = 25
and s = 25 and 50, in the top row, and s = 100 and 200 in the bottom row. In each case, 9 = 3.
The graphs drawn with triangles show the predicted probability, while the graphs drawn with
squares show the observed empirical probabilities.

' ~ ................. i ............... i .................

~., . . . . . . . . . . i.................
i................
;.............
i................. ,, ............... + .................
i................
i.................

, . , ........... +................. 4 ................. i ................. i................. , L ................. ~ ................ i ................. ;................

e.4 ............................. "i

a.2 ......... ~............... ~ ............... ~ ................. i ........... 9................................................

is 9 ' : .......... i ................ : .............

+,.. . . . . . . . . . . . . . . . . . . + ............... i ................. ~. . . . . . . . . . u .......... .................. i ................. ! .................

0" ................. + "" IN'Ill S ................ ++ ............... m .......................... + ................. + .............. o.,. . . . . . . . . . . . . . . . . . :~.... i
0J ................. G. . . . + ................. i ................ m4 ........................ G .............. i ................

..... " Lee Se ~e ee so ee'e Io ao u 4o a

Fig. 7. Graph of probability of false positives. In each graph, vertical axis is probability of false
positive of size k, horizontal axis is k. Each family of plots represents a different number of
sensor features, starting with s = 25 for the left most plot, and increasing by increments of 25.
For the three families in the top row, the model consisted of 25 features, and the sensor error
was 9 = 1,3 and 9 = 5 (from right to left). For the three families in the bottom row, 9 was fixed
at 3, and the model had 25, 38 and 50 features (from right to left).

One can see t h a t the cases are in good agreement. In fact, our model tends to overes-
t i m a t e the probability of small false positives, and u n d e r e s t i m a t e the probability of large
false positives, so our results will tend to be conservative.
Next, w h a t does ek look like? As an illustration, we graph in figure (7) the probability
of a false positive based on equation (22). In particular, we use a selectivity based on
e -- 3, obtained from Table 1, and plot the value of ek for an object with 25, 38 or 50
m o d e l features, for different values of k and a given n u m b e r of sensor features s. This is
graphed in figure (7). T h e process was r e p e a t e d for different numbers of sensor features
305

s, generating the family of graphs in the figure.


In figure (7), we show the false positive rate as the error rate changes. Each figure
plots the false positive rate, for model features rn = 25, and for sensor features varying
from s = 25 to s = 200 by increments of 25. The individual plots are for varying numbers
of sensory features, and the process is repeated for changes in the bound on the sensor
error, given a fixed threshold on angle of r = ~r/16. If the error is very small, the
method performs well, i.e. the probability of a false positive rapidly drops to zero even
for small numbers of model features. As the error increases, however, the probability of
a false positive rapidly increases, as can be seen by comparing different families of plots
in figure (7). Note that the best possible correct solution would be for k = 22.
While we have used our methods to analyse the alignment method, very similar results
hold for the case of geometric hashing. For details, see [14].

5 Summary
The computation of an affine-invariant representation in terms of a coordinate frame
( m l , m 2 , m 3 ) has been used in a number of model-based recognition methods. Nearly
all of these recognition methods were developed assuming no uncertainty in the sensory
data, and then various heuristics were used to allow for error in the locations of sensed
points. In this paper we have formally examined the effect of sensory uncertainty on such
recognition methods. This analysis involves considering both the Euclidean plane used
by the alignment method, and the space of affine-invariant (a,/~) coordinates used by
the geometric hashing method. Our analysis models each sensor point in terms a disc of
possible locations, where the size of this disc is bounded by an uncertainty factor, e.
Under the bounded uncertainty error model, in the Euclidean space the set of possible
values for a given point x and a basis (raz, m2, m3) forms a disc whose radius is bounded
by r = ke(1 + Ic~l+ 1/31), where 1 < k < 2. That is, assuming that each image point has
a sensing uncertainty of magnitude e, the range of image locations that are consistent
with x forms a circular region. In the a-/~-space, the set of possible values of the affine
coordinates of a point x in terms of a basis ( m l , m ~ , m 3 ) forms an ellipse (except in
degenerate cases). The area, center and orientation of this ellipse are given by somewhat
complicated expressions that depend on the actual configuration of the basis points.
The most important consequence of our analysis is that the set of possible values in the
a-/~-plane cannot be computed independent of the actual locations of the model or image
basis points. This means that the table constructed by the geometric hashing method
can only approximate the correct values, because the locations of the image points are
not known at construction time. We further find for even moderately large positional
uncertainty, methods that use affine transformations have a substantial probability of
false positive matches. These methods only check the consistency of each matched point
with a set of basis matches. They do not ensure the global consistency of all matched
points. Our results suggest that such methods will require that a substantial number of
hypothesized matches be ruled out by some subsequent verification stage.

References
1. Ballard, D.H., 1981, "Generalizing the Hough Transform to Detect Arbitrary Patterns,"
Pattern Recognition 13(2): 111-122.
2. Basri, R. & S. Ullman, 1988, "The Alignment of Objects with Smooth Surfaces," Second
Int. Con/. Comp. Vision, 482-488.
306

3. Besl, P.J. & R.C. Jain, 1985, "Three-dimensional Object Recognition," ACM Computing
Surveys, 17(1):75-154.
4. Costa, M., R.M. Haralick & L.G. Shapiro, 1990, "Optimal Affine-Invariant Point Match-
ing," Proc. 6th Israel Conf. on AI, pp. 35-61.
5. Cyganski, D. & J.A. Orr, 1985, "Applications of Tensor Theory to Object Recognition and
Orientation Determination", IEEE Trans. PAMI, 7(6):662-673.
6. Efimov, N.V., 1980, Higher Geometry, translated by P.C. Sinha. Mir Publishers, Moscow.
7. Ellis, R.E., 1989, "Uncertainty Estimates for Polyhedral Object Recognition," IEEE
Int. Conf. Rob. Aut., pp. 348-353.
8. Forsythe, D., J.L. Mundy, A. Zisserman, C. Coelho, A. Heller, & C. Rothwell, 1991, "In-
variant Descriptors for 3-D Object Recognition and Pose", IEEE Trans. PAMI, 13(10):971-
991.
9. Grimson, W.E.L., 1990, Object Recognition by Computer: The role of geometric constraints,
MIT Press, Cambridge.
10. Grimson, W.E.L., 1990, "The Combinatorics of Heuristic Search Termination for Object
Recognition in Cluttered Environments," First Europ. Conf. on Comp. Vis., pp. 552-556.
11. Grimson, W.E.L. & D.P. Huttenlocher, 1990, "On the Sensitivity of the Hough Transform
for Object Recognition," IEEE Trans. PAMI12(3):255-274.
12. Grimson, W.E.L. & D.P. Huttenlocher, 1991, "On the Verification of Hypothesized Matches
in Model-Based Recognition", IEEE Trans. PAMI 13(12):1201-1213.
13. Grimson, W.E.L. & D.P. Huttenlocher, 1990, "On the Sensitivity of Geometric Hashing",
Proc. Third Int. Conf. Comp. Vision, pp. 334-338.
14. Grimsou, W.E.L., D.P. Huttenlocher, & D.W. Jaeobs, 1991, "Affine Matching With
Bounded Sensor Error: A Study of Geometric Hashing & Alignment," MIT AI Lab Memo
1250.
15. Huttenlocher, D.P. and S. UNman, 1987, "Object Recognition Using Alignment",
Proc. First Int. Conf. Comp. Vision, pp. 102-111.
16. Hutteulocher, D.P. & S. Ullman, 1990, "Recognizing Solid Objects by Alignment with an
Image," Inter. Journ. Comp. Vision 5(2):195-212.
17. Ja~obs, D., 1991, "Optimal Matching of Planar Models in 3D Scenes," IEEE Conf. Comp.
Vis. and Patt. Recog. pp. 269-274.
18. Klein, F., 1939, Elementary Mathematics from an Advanced Standpoint: Geometry, MacMil-
lan, New York.
19. Korn, G.A. & T.M. Korn, 1968, Mathematical Handbook for Scientists and Engineers,
McGraw-Hill, New York.
20. Lamdan, Y., J.T. Schwartz & H.J. Wolfson, 1988, "Object Recognition by Affine Invariant
Matching," IEEE Conf. Comp. Vis. and Putt. Recog. pp. 335-344.
21. Lamdan, u J.T. Schwartz & H.J. Wolfson, 1990, "Affine Invariant Model-Based Object
Recognition," 1EEE Trans. Rob. Aut., vol. 6, pp. 578-589.
22. Larndan, Y. & H.J. Wolfson, 1988, "Geometric Hashing: A General and Efficient Model-
Based Recognition Scheme," Second Int. Conf. Comp. Vis. pp. 238-249.
23. Lamdan, Y. & H.J. Wolfson, 1991, "On the Error Analysis of 'Geometric Hashing'," IEEE
Conf. Comp. Vis. and Patt. Recog. pp. 22-27.
24. Thompson, D. & J.L. Mundy, 1987, "Three-Dimensional Model Matching From an Uncon-
strained Viewpoint", Proc. IEEE Conf. Rob. Aut. pp. 280.
25. Van Gool, L., P. Kempenasrs & A. Oosterlinck, 1991, "Recognition and Semi-Differential
Invariants," IEEE Conf. Comp. Vis. and Putt. Recog. pp. 454-460.
26. Wayner, P.C., 1991, "Efficiently Using Invariant Theory for Model-based Matching," IEEE
Conf. Comp. Vis. and Part. Recog. pp. 473-478.
27. Weiss, I., 1988, "Projective Invariants of Shape," DARPA IU Workshop pp. 1125-1134.
28. Wolfson, H.J., 1990, "Model Based Object Recognition by Geometric Hashing," First Eu.
rop. Conf. Comp. Vis. pp. 526-536.

This article was processed using the I~TEX macro package with ECCV92 style
EPIPOLAR LINE ESTIMATION *

S~ren L Olsen
University of Copenhagen, Universitetsparken 1, DK-2100, Copenhagen, Denmark

A b s t r a c t . This paper describes a method by which the epipolar line equa-


tion for binocular stereo, i.e. the invariant relating the image coordinates
of corresponding image points, can be estimated directly by analyzing the
images. No camera calibration or detailed knowledge of the stereo geometry
is required.

1 INTRODUCTION

In the past decades a large number of algorithms for solving the computational stereo
correspondence problem have been proposed. To reduce the size of the correspondence
problem most algorithms have assumed that the equation for the epipolar lines are known
in advance. The right image epipolar line to an observed left image point is the projection
of the 3D line through the observed point and the focal point of the left camera on the
right image plane. In most algorithms it is assumed that the epipolar lines coincide with
the scan lines. This situation occurs only when the two image planes are contained in a
common 3D plane, and when the vertical camera coordinate axes are parallel. In practice,
it is nearly impossible to orient the cameras to satisfy this assumption. Alternatively,
by using a specially designed calibration object, the transformation between the two
camera coordinate systems may be estimated prior to the application. Based on the
transformation the equation for the epipolar lines can be found. The method of camera
calibration is well suited for experiments in laboratory or industrial environments where
time is available, where the cameras are not moved after the calibration has been made,
or where the change in the stereo geometry is known very accurately. Alternatively,
the epipolar line equation may be estimated directly from the images. This approach
is relevant when measures of disparity suffice, and may be useful when no calibration
object is available. In the present work it is shown that for a broad class of stereo images
the epipolar line equation can be obtained directly from the images without restrictive
assumptions on the stereo geometry, knowledge of the values of the focal lengths, etc.
We assume that the two cameras, referred to as the left and right camera, can be
modeled by pin-hole cameras with focal lengths of fL and fR. The two camera coordinate
systems are both assumed to be left-handed, and related by an affine transformation. The
image point of, say the left camera focal center is denoted by (aL, bz). Each of the image
coordinate systems is assumed to align with the camera coordinate system by fixing the
z-component to the value of the focal length, and by translating the Origo by (at., bL).
A point P with three-dimensional left camera coordinates (X, Y, Z) projects on the left
image coordinates ( xL, YL ):

Xs = f L ~ q - a L and YL = f L Y W b L (1)

* This work was supported by The Danish Natural Science Research Council, program no.
11-6969, and program no. 11-8345.
308

The point P is then located on the line: ( X , , Y , , Z , ) = s ( x L -- a z , y L -- b L , f L ) . By


transforming this line, and by solving the two equations (eliminating the parameter s),
and by assuming that the epipolar line is not vertical, the epipolar line equation can be
written:

Cl -}- C2XL -b C3YL @ C4XR + CsyR "~ C6XLXR -4- CTXLYR -b C8YLXR "t- C9YLYR -- 0 (2)

where the coefficients ei are determined by the affine transformation, by the focal lengths,
and by the image coordinates of the two focal centers. In binocular stereo vision verti-
cal epipolar lines rarely occur. To make sure that the simple situation YL -- YR can be
modeled well we choose to fix the coefficient cs to - 1 . Assuming that a sufficient num-
ber of true correspondences have been established, we may then solve the set of linear
equations.

2 FEATURE POINT DETECTION

To establish the set of linear equations (2) a (relatively small) number of corresponding
feature points have to be localized in the images. In the present approach we will benefit
from the vast amount of knowledge gained from studying the behavior of extremes in the
so-called scale-space [9,3,4]. The local extremes in the scale-space mark centers of light
and dark image blobs. By the present approach, the feature points are detected as the
local extremes of the convolution of the images with the Laplacian of a Gaussian [5,1].
For computational practice the sampling in the scale-space must be sparse. By us-
ing a logarithmic sampling [3] a set of five sampling levels, corresponding to preferable
blob-diameters BJ of about 11, 16, 23, 32, and 45 pixels, were chosen. After all points
of local extremum have been marked for each of the sampling levels in the scale-space
the redundancy of the representation is pruned by checking for each local extremum if
an extremum of the same sign is localized within a small search area at the neighbor-
ing sampling levels. If so, the extreme with the smaller value disregarded. Next, weak
extremes caused by the side-lobes of the Laplacian-Gaussian, or by smooth non-blob
intensity surface changes) are removed.

3 THE CORRESPONDENCE ANALYSIS

To establish the set of correspondences a coarse-to-fine approach is adopted. At each level


in the scale-space a PMF-matching algorithm [7] is used to establish the set of correspon-
dences. Based on the new matches as well as on previously established correspondences
the coefficients ei in the epipolar line equation is estimated. Initially, the epipolar lines
are assumed to coincide with the scan-lines. The search diameter D~ along the epipolar
line is computed as a fraction B J / B 'na~: of the image side length. The search diameter D
perpendicular to the epipolar line is computed as half the value of D~. The center of the
search area is determined by locating the nearest neighboring feature point (previously
matched) and by projecting the coordinates of its corresponding point on the estimated
epipolar line. All right (left) image feature points within the search area and with an
extremum value of the same sign as the left (right) image feature point will be accepted
as candidate matches.
Then each left-to-right (right-to-left) candidate correspondence will be attributed
with a support score and the PMF algorithm [7] used to select the best candidate matches.
Because the features are sparsely located the support area is chosen to cover the image.
309

As in previous methods [8,7] the support function S is chosen to be increasing with the
similarity of disparity, and decreasing in the distance r between the supporting image
points. In contrast to [8,7] we find it useful to make S depend on the a priori belief/~ of
the candidate matches. This is defined as the ratio of the minimal and maximal extremal
values of the feature points. By this definition candidate matches between feature points
which mark blobs of significantly different intensities is penalized. Defining Ad as the
length of the disparity difference vector, i.e. Ad = Idp - dQ[, the support function S
was defined by:
1 1
S(PL--*PR, QL--*QR) -~ ~P~Q r 1 q- Ad (3)

4 SOLVING the SYSTEM of LINEAR EQUATIONS

In general, the number of established correspondences is much larger than 8, suggesting


that an approximate solution may be found by using a least squares approach. The error
in localization usually is small for fine levels of scale, but may be several pixels for the
coarser levels. The localization error is assumed to be Ganssian distributed. To make
the estimation robust against outliers an M-estimator [2,6] is applied. First, a guess on
the residual vector ri is obtained by applying a least squares method. From r~ the noise
standard deviation fi is then computed by the MAD-estimate 1.4826 reed Iri - med ril,
where reed denotes the median value. Then each equation is scaled with a weight w(r)
computed from ri and ~ by:

w(r) = < Irl _< (4)


0 c~e< It[
where the value of the constants Cl and c2 are chosen as 1.0 and 3.0. After the weighting
of the equations a least squares method will be applied and the procedure of reweighting
and residual estimation will be iterated. The stopping criterion for the iterations was
that a measure of the convergence (defined as in [6]) becomes small or that a maximal
number of 20 iterations was reached. To solve the set of weighted equations the method
of singular value decomposition was chosen. A threshold value of one percent of the sum
of the singular values was used to zero out small singular values.

5 EXPERIMENTS

In Fig. 1 the results of applying the method on two synthetic ray-traced stereo image
pairs are shown. Both images are overlaid with the (approximately conjugated) estimated
epipolar lines. In the upper row images the transformation from the left to the right
camera is described by a rotation around the vertical axis and a translation along the
horizontal axis and the the z-axis, i.e. vergence stereo. The vergence angle is 30 degrees.
Both of the focal lengths were equal. The accepted matches are shown by small black
squares. In the second row of Fig. 1 the two camera coordinate systems are related by a
translation along all three axes. Rotations are made around the vertical axis and around
the optical axis. The vergence angle is 16 degrees. The rotation around the z-axis is 10
degrees. The values of the two focal lengths differ by about 8%. For clarity, only a subset
of epipolar lines are shown in this example.
Table 1 shows the number of detected feature points, of candidate matches, of ac-
cepted matches (i.e. matches for which w(rl) ~ 0), and the number of false matches.
810

Table 1. Results on Synthetic images

Measure Synthetic image 1 Synthetic image 2


Left feature points 2, 5, 12, 13, 78 19, 26, 38, 75, 91
Right feature points 2, 5, 7, 24, 86 20, 29, 43, 75, 123
Matches 2, 5, 6, 11, 55 19, 23, 31, 54, 75
Accepted Matches 1, 5, 5, 8, 45 12, 17, 22, 39, 62
False Matches 1, 1, 0, 0, 3 2, 3, 2, 7, 11
Average error 0.28 pixels 0.02 pixels
Std.dev. of error 0.37 pixels 0.75 pixels

Most false matches are located close to the true epipolar line. For all pixels in the left
image having a visible corresponding point in the right image the distance was measured
between the truly corresponding right image point and the estimated epipolar line. Ta-
ble 1 shows that both the average and the standard deviation of this error are small
numbers. The method has also been tested on real laboratory and outdoor stereo im-
ages. For these images the estimation accuracy was measured manually. Typically, the
largest error found was on the order of 2-3 pixels. Due to space limits these results are
not reported further here.

6 CONCLUSION

Based on the limited number of experiments made, the method proposed for estimating
the epipolar lines in stereo images indeed seems promising. The experiments indicate that
if the slope of the epipolar lines are not too steep then the epipolar lines may be estimated
within a few pixels. The method is not directly applicable for trinocular stereo. More
experiments are needed to find the range of values (e.g. for the affine transformation)
for which a reliable solution can be found. A weak point is the selection of prominent
extremes. By the present approach there is a risk that some truly corresponding feature
points will be detected at different levels in the scale-space. This may happen if fL and
fR are not approximately equal, or if the amount of foreshortening in the two images
differs significantly. To increase the robustness, alternative approaches might be to link
the feature points across scale, or to accept matches between feature points located
at neighboring levels of scale. However, preliminary experiments indicate that by such
approaches the number of false matches often increase to an unacceptable level.

References

1. Blostein D. and Ahuja N. Shape from Texture: Integrating Texture Element Extraction and
Surface Estimation. IEEE PAMI vol. 11 no. 12, 1989, pp. 1233-1251
2. Huber P. J. Robust Statistics. Wiley, New York, 1981
3. Koenderink J. J. and van Doorn A. J. The Structure of Images. Biol. Cyb. vol. 50, 1984,
pp. 363-370
4. Lindeberg T. On the Behavior in Scale Space of Local Extrema and Blobs. Proc. 7'th SCIA,
1991, pp. 8-17
5. Marr D. and Hildreth E. C. Theory of Edge Detection. Proc. R. Soc. London B, vol. 207
1980, pp. 187-217
311

Fig. 1. Two synthetic stereo images overlaid with the estimated epipolar lines. The matched
feature points are marked by black squares.

6. Meer P., Mintz D. and Rosenfeld A. Robust Regression Methods for Computer Vision: A
Review. International Journal of Computer Vision 6:1, 1991, pp. 59-70
7. Pollard S. B., Mayhew J. E. W. and Frisby J. P. PMF: A Stereo Correspondence Algorithm
Using a Disparity Gradient Limit. Perception vol. 14; 1985. pp. 449-470
8. Prazdny K. Detection of Binocular Disparities. Biological Cybernetics vol. 52; 1985, pp.
93-99
9. Witkin A. P. Scale Space Filtering. Proc. 7'th IJCAI 1983, p. 1019-1022

This article was processed using the IbTEX macro package with ECCV92 style
Camera Calibration Using Multiple Images
Paul Beardsley, David Murray, Andrew Zisserman *
Robotics Group, Dept. of Engineering Science, University of Oxford, Oxford OX1 3P J, UK.
tel: +44-865 273154 fax: +44-865 273908 email: [pab,dwm,az]~uk.ac.ox.robots

A b s t r a c t . This paper describes a method for camera calibration. The


system consists of a static camera which takes a sequence of images of a
calibration plane rotating around a fixed axis. There is no requirement for
any exact positioning of the camera or calibration plane.
From each image of the sequence, the vanishing points and hence the
vanishing line of the calibration plane are determined. As the calibration
plane rotates, each vanishing point moves along a locus which is a conic
section, and the vanishing line generates an envelope which is also a conic
section. We describe how such conics can be used to determine the camera's
focal length, the principal point (the intersection of the optic axis with the
image plane), and the aspect ratio.

1 Introduction

The formation of images in a camera is described by the pinhole model. This model can
be specified by the camera's focal length, principal point, and the aspect ratio. These
parameters are called intrinsic because they refer to properties inherent to the camera.
In contrast, extrinsic parameters describe the translation and rotation of a camera with
respect to an external coordinate frame and are thus solely dependent on the camera's po-
sition, not on its inherent properties. The calibration method in this document measures
intrinsic parameters only.
Camera calibration is a central topic in the field of photogrammetry [PhSO]. In com-
puter vision, the best-known calibration method is due to Tsai [Ts85] [LT88]. Other im-
portant references include [Ga84], [FT8~, [PngO]. Recently, calibration methods based on
the use of vanishing points have been proposed [CTgO], [Kag$]. The calibration method
here uses vanishing points, the novel aspect of the work being that vanishing points are
shown to move in a predictable way for constrained scene motions. This offers a way of
integrating data over many images prior to calibrating the camera. The motivation for
utilising more data is to obtain resilience to noise.

2 Intrinsic Parameters

Figure l(a) shows the standard pinhole model of a camera. O is the point through which
all rays project and is referred to as the optic centre. The perpendicular from O to the
image plane intersects the image plane at the principal point, P. The line OP is the optic
axis and the distance OP is the focal length. Figure l(b) shows the customary way in
which the physical system is actually represented, with a right-handed coordinate frame

* This work was supported by SERC Grant No GR/G30003. PB is in receipt of a SERC stu-
dentship. AZ is supported by SERC.
313

~c$'cic 2 having origin 0 and z-axis aligned with O P . The ratio of the scaling along the
y-axis to the scaling along the z-axis is called the aspect ratio.

(a)
image plane

XC/"Aycl f ~ i m a g e plane

Fig. 1. The pinhole model. (a) shows the standard model for image formation - rays of light
from the scene pass through the pinhole and strike the image plane creating an inverted image;
(b) the usual diagrammatic representation of the system with a right-handed coordinate frame
set up at O.

A real camera lens does not act like a perfect pinhole, so there are deviations from
the model above. The most significant of these is known as radial distortion. The form
of radial distortion and methods for its correction are described in [PhSO], [Be91].

3 Vanishing Point and Vanishing Line Information

3.1 Vanishing P o i n t s a n d Vanishing Lines


Semple [SK52] is a comprehensive reference for the properties of vanishing points and
vanishing lines - this section outlines some of the elementary results. Figure 2 shows a
camera viewing a plane. Parallel lines on the plane appear in the image as converging
lines, which are concurrent at a vanishing point. The figure shows two sets of parallel
lines on the plane with different orientations - the line through their vanishing points is
called the vanishing line (the horizon) of the plane. A n y set of parallel lines on the plane
will have a vanishing point on this same vanishing line.

image plane
parallel lines

parallel lines

Fig. 2. The parallel lines in the scene appear as converging lines in the image. Each set of
parallel lines has its own vanishing point in the imag e . The line through the vanishing points is
the vanishing line (the horizon) of the plane in the scene.

Vanishing points and vanishing lines are important sources of information about the
geometry of the scene. A vector at the optic centre pointing towards a vanishing point
2 v is used to denote a vector of arbitrary magnitude, and ~t to denote a unit vector.
314

on the image plane is parallel to the physical lines in the scene which have given rise to
the vanishing point. Also, a plane which passes through the optic centre and a vanishing
line on the image plane has the same normal as the physical plane which has given rise
to the vanishing line [Ka92].

3.2 G e n e r a t i n g Conies f r o m Vanishing Points and Vanishing Lines

The main idea in the proposed calibration method is that, given a static camera observing
constrained motion of a calibration plane, vanishing points and vanishing lines move in
a predictable way. Figure 3 illustrates the physical setup of the system. From an image
of the calibration plane, it is possible to obtain a number of vanishing points and hence
the vanishing line of the plane. This section shows that rotation of the calibration plane
causes a vanishing point of the plane to move along a conic, and causes the vanishing
line of the plane to generate an envelope (Fig. 7) which is a conic.

^
^
~.~:7 ~OA n normal / I ! parallel lines
ptic axis ",,. / /
" J ~ ~'/Lr /,., ~r axis of rotation
............ ^

/
~ calibration plane

Fig. 3. The physical system. A camera is pointing at a calibration plane which is slanted away
from the ]ronto-parallel position. The calibration plane is mounted on an axis f~r about which it
can be rotated, and ~r is skewed away from the optic axis Lc. There is no requirement ]or any
exact positioning of the camera or calibration plane.

Referring to Fig. 3, consider a set of parallel lines on the calibration plane with
direction given by the unit vector ]. As the plane rotates, the direction o f l changes in the
way described by the following parametric representation (in spherical polar coordinates)

1(0) = sin r cos 0~r + sin r sin 0~'r + cos r (1)

where ir$'r~.r is an orthonormal set of vectors with ~r parallel to the axis of rotation, r
is a constant given by cos r = ].~.r, and 0 is the angle of rotation with range 0 < 0 < 27r.
It was pointed out in Sect. 3.1 that a vector passing through the optic centre and a
vanishing point is parallel to the physical lines in the scene which have given rise to the
vanishing point i.e. given a set of parallel lines 1(00) on the calibration plane, there is a
parallel vector ]e(00) located at the optic centre and passing through the vanishing point
for the lines. Now, the motion of the physical lines is described by (1), so the motion
of the associated vector passing through the optic centre and the vanishing point is also
described by (1). A vector moving in this way sweeps out a circular cone. It follows that
the locus of the vanishing point is given by the intersection of a circular cone and the
image plane, and therefore the locus of the vanishing point is a conic section. This is
illustrated in Fig. 4. An analogous argument can be used to show that, as the calibration
plane rotates, the vanishing lines generate an envelope which is a conic section.
A mathematical analysis is omitted, but the main results are summarised.
315

1. Although the physical setup is actually capable of generating any type of conic, the
only conic sections considered in the remainder of the paper are ellipses. It can be
shown that the equation of the ellipse on the image plane is

(c~ 2 r fsin~bc~ z Y2(C~ b)


7~t-Z~n~~ [x - ~o--;~-~-~-~-'~n~ $ J + i ~ t~.~
where the Origin of the coordinate frame is at the principal point, and the x-axis is
aligned with the major axis of the ellipse; r is the semi-angle of the cone and r is the
angle between the optic axis of the camera and the axis of rotation of the calibration
plane (see Figs. 3 and 4); f is the focal length.

image plane

i
", "-., ! (0)

Fig. 4. This figure illustrates the locus o] the vanishing point. The optic axis of the camera is
~e. The axis o] the circular cone is parallel to the axis of rotation of the calibration plane, ~r.
The circular cone is swept out by a vector parallel to the direction o/ the parallel lines on the
calibration plane, ](0). The vanishing point moves along the intersection of the cone and the
image plane. The semi-angle of the cone is 6, and the skew angle between the optic axis and
the axis of" the cone is r Note that the axis of the cone does not pass through the centre o] the
ellipse.

2. For a particular configuration of the system, the vanishing points and vanishing lines
generate ellipses with collinear major axes, although with different minor axes and
eccentricities (depending on their associated r value). The common major axis passes
through the principal point.
3. As shown in Fig. 4, the direction of the major axis is determined by the optic axis
and the axis of the cone (which is parallel to the axis of rotation of the calibration
plane). This direction is fixed for a particular configuration of the system, but can
be adjusted by moving the camera or the axis of rotation of the calibration plane.
4. The ellipses used in the actual experiments were generated from vanishing line en-
velopes rather than from vanishing points - a vanishing line is a best-fit line through
a number of vanishing points and should therefore be more resilient to noise.

4 Circular C o n e s and t h e O p t i c C e n t r e

This section considers what information an ellipse on the image plane provides about the
position of the optic centre.
Figure 5 shows an ellipse on the image plane, which is either the locus of a vanishing
point, or the envelope of a set of vanishing lines. The discussion in Sect. 3 assumed a
316

known camera geometry and showed how vanishing point and vanishing line information
can give rise to an ellipse. Camera calibration is the inverse process - given an ellipse
generated from vanishing points or vanishing lines, determine the camera geometry.
Now the ellipse can be regarded as the intersection of the image plane with a circular
cone whose vertex is at the optic centre. Thus, the problem is to use the ellipse to
determine the set of all circular cones which could have given rise to it; more specifically,
it is the vertices of this set of cones which are of interest since the optic centre must
coincide with one of these vertices. It is shown below that the vertices lie on a hyperbola
in a plane perpendicular to the image planel as illustrated in Fig. 5. Thus, a single ellipse
leaves the position of the optic centre underdetermined, since it can lie anywhere along
the hyperbola. Given two distinct ellipses, the optic centre lies at the intersection of the
two associated hyperbolae, and hence its position is uniquely determined (actually there
are four points of intersection, pairwise symmetric on either side of the image plane - the
incorrect solutions on the wrong side of the image plane are automatically eliminated,
and the other incorrect solution is easily eliminated because it lies far from the centre of
the image).
As shown in Fig. 5, a coordinate frame YCeStef~e is set up with the 5r plane coin-
cident with the image plane, the ~e axis aligned with the major axis of the ellipse, and
the origin at the centre of the ellipse.

A
ze

hyoqrbola on the ^ ,
xz-piane ...._...~..,. / Ye//

/
xyellipr'eplane~t h e ~ f- "~"x

Fig. 5. Note that this is a 319 diagram, with the ellipse on the the fce~e plane (the image plane),
and the hyperbola on the fCe~e plane. The set o] circular cones which could give rise to the
ellipse have their vertices on the hyperbola. As described in the text, possible positions of the
optic centre lie on this same hyperbola.

Points x on the surface of a cone with vertex v, axis @, and semi-angle r are described
by
ix - vl 2 cos 2 r - (,~. ( x - v ) ) 2 = 0 (2)
By expressing the intersection of the cone and the image plane in the canonical form
for an ellipse, which is
x2 y2
a-~ + ~ = 1 (3)

it is possible to derive the following constraint on the cone vertices


2
V~ v2z _
a2 _ b2 - b--r - 1 (4)

Thus, the cone vertices v lie on a hyperbola in the YCe~e plane. When vz = 0, vx =
- b2. This is the focus of the ellipse, so the hyperbola goes through the ellipse foci.
Equation (4) will be used in the computation of focal length.
317

5 Camera Calibration

5.1 D e t e r m i n a t i o n of Aspect Ratio

The determination of aspect ratio relies on the following observation - given a set of
ellipses, all the major axes are concurrent at one point (the principal point). If the aspect
ratio is not correct, however, the ellipses are distorted and the major axes will not be
concurrent. Thus, adjusting the aspect ratio until the major axes are concurrent serves
to identify the true aspect ratio. It is evident that this method requires a minimum of
three ellipses with different major axes in order to carry out the concurrency test.
The experimental procedure is as follows. The system is set up so that the vanishing
line envelope obtained from an image sequence will be an ellipse (the type of conic
produced depends on the setting of the r and r angles defined in Sect. 3.2). An image
sequence is taken as the calibration plane is rotated from 0 to 21r. The envelope of the
vanishing lines measured during the sequence is used to determine an ellipse.
The system configuration is then adjusted so that the major axis of the vanishing line
envelope lies in a different direction (the direction of the major axis is determined by
the skew between the optic axis of the camera and the axis of rotation of the calibration
plane as described in Sect. 3.2). A new image sequence is taken, and again the envelope
of the vanishing lines is used to determine an ellipse.
This is repeated at least three times to produce the minimum of three ellipses required
for the aspect ratio determination. Starting from an initial estimate, the aspect ratio is
iteratively adjusted until the major axes of the ellipses are at their closest approach to
concurrency - at this point, the value of the aspect ratio is recorded as the true aspect
ratio. Concurrency is measured using a weighted least-squares test given in [KagP].
We emphasise here that no accurate positioning is required when setting the camera
and calibration plane using the guidelines above, and rough human estimation of position
is always sufficient.

5.2 D e t e r m i n a t i o n of Principal Point

Once the aspect ratio has been corrected, all ellipses have major axes which are concurrent
at the principal point. Thus, given two or more ellipses, it is possible to compute the
principal point.

5.3 D e t e r m i n a t i o n o f Focal L e n g t h

Once the principal point is known, the focal length can be found using a single ellipse.
Focal length is computed using (4) from Sect. 4

v~ v~
=1
a 2 - b2 b2

vr is the offset of the centre of the ellipse from the principal point and is known; a and
b are obtained directly from the ellipse; vz is the focal length and is the only unknown.

6 Implementation and Results

Six image sequences were taken, with varying system geometry as described in Sect. 5.1.
The camera-calibration plane distance was about 80cm. There were 36 frames per image
318

Fig. 6. Typical image of the calibration plane.

sequence and the calibration plane, which was mounted on a rotating table, was rotated
about 10~ between each frame. A typical image is shown in Fig. 6.
Edges were found using a local implementation of the Canny edge detector which
works to sub-pixel accuracy. The aspect ratio of the edge map was updated using a value
which was close to the true aspect ratio, and the edgels were corrected for radial distortion
using a priori knowledge of the radial distortion parameters [PhSO], [Be91].3 Best-fit lines
were found for the 8 horizontal and 8 vertical lines available from the calibration grid, and
64 vertex positions were generated. Vanishing points were found in 16 different directions
available from the vertices (0 ~ 90 ~ 4-450 etc.), and the vanishing line was determined
from the vanishing points.
At the end of the sequence, an ellipse was fitted to the envelope of the vanishing lines
using the Bookstein algorithm [Bo79] - the vanishing lines were represented in homoge-
neous coordinates and a line conic was determined. A point conic was then determined
by inverting the matrix for the line conic [SK5$].
Figure 7 shows the vanishing line envelopes for two example sequences. Figure 8 shows
the major axes of ellipses fitted to the vanishing line envelopes of six sequences, after
the aspect ratio has been corrected to bring them as close as possible to concurrency.
The values determined for aspect ratio, principal point, and focal length are given in
table 1. The focal lengths were converted from pixels into millimetres using the pixel size
provided in the camera specification.
The computed aspect ratio is in good agreement with the value obtainable from the
camera and digitisation equipment specifications. The worst-case error for the perpen-
dicular distances between the major axes and the computed principal point is about
12 pixels as shown in Fig. 8. The presence of this error is the subject of current in-
vestigation - Kanatani [Ka92] has pointed out that the Bookstein algorithm for conic
fitting is inappropriate when there is anisotropic error in the data points. It is the case
that there is anisotropic error in vanishing points and vanishing lines, and Kanatani's
own algorithm for conic fitting is to be investigated. The standard deviation of the focal
length indicates a 2% error - whether this is satisfactory or not depends on the particular
application making use of the calibration information.
We are currently working on an error analysis. The calibration method is a linear
process, so it should be possible to work through from the errors present in the Canny
edge detection to an estimate of the variance in the output parameters. We hope that
the analysis will show that the use of data from many frames causes cancelling of errors.

a Accurate radial distortion correction can be applied based on only approximate estimates of
the aspect ratio and principal point.
319

-2d~~O0 - 2 0 ~ ~

Fig. 7. This figure shows the envelope of vanishing lines for two example sequences. Axes are
labelled with image plane coordinates in pixels. The aspect ratio has been corrected to an initial
estimate of 1.51.

QO

Fig. 8. This figure shows the closest approach to concurrency of the major axes of the ellipses,
obtained by adjusting the aspect ratio. Axes are labelled with image plane coordinates in pixels.
The least-squares estimation of the point of concurrency gives the principal point. The outliers
are discussed in the text.

7 Conclusion

The most significant characteristic of the calibration method is that intrinsic parameters
are computed in a sequential manner - first, the aspect ratio is found using a concurrency
test which does not require knowledge of the other parameters; the principal point is then
computed as the intersection point of a set of lines, a graphical computation which does
not require the focal length; finally, the focal length is determined using the computed
principal point together with the ellipse parameters. This breakdown of the processing
gives rise to individual stages which are computationally simple and easy to implement,
but it could cause problems through errors in one stage being passed on to subsequent
stages.
The method makes use of image sequences, but the large amounts of data which this
320

Aspect ratio 1.514


Principal point (x, y) (-4, 1)
Pixels MilHmetres
Focal length sequence 1 1180 20.3
Focal length sequence 2 1187 20.4
Focal length sequence 3 1176 20.3
Focal length sequence 4 1216 20.9
Focal length sequence 5 1229 21.2
Focal length sequence 6 1237 21.3
Focal length mean 1204 20.7
Focal length standard deviation 26 0.4

T a b l e 1. Results for the measurement of aspect ratio, principal point and focal length.

involves are condensed to a single vanishing line from each image. T h e potential benefits
of using m a n y images are currently the subject of an error analysis.

References

[Be91] P.A. Beardsley et al. The correction of radial distortion in images. Technical report
1896/91. Department of Engineering Science, University of Oxford.
[Bo79] F.L. Bookstein Fitting conic sections to scattered data. Computer Graphics and Image
Processing, pages 56-91, 1979.
[CT90] B. Caprile and V. Torre Using vanishing points for camera calibration. International
Journal of Computer Vision, pages 127-140, 1990.
[FT87] O.D. Faugeras and G. Toscani The calibration problem for stereo. In Proc. of IEEE
Con] Computer Vision and Pattern Recognition, Miami, 1987.
[Ga8$] S. Ganapathy. Decomposition of transformation matrices for robot vision. In Proc. of
IEEE Conference on Robotics, pages 130-139, 1984.
[Ka9~] K. Kanatani. Geometric Computation for Machine Vision. Oxford University Press,
Due for publication 1992/3.
[LT88] R.K. Lenz and R.Y. Ts~i. Techniques for calibration of the scale factor and image center
for high accuracy 3-D machine vision metrology. In IEEE Transactions Pattern Analysis
and Machine Intelligence, pages 713-720, 1988.
[Ph80] Manual of Photogrammetry. American Society of Photogrammetry, 1980.
[Pu90] P. Puget and T. Skordas An optimal solution for mobile camera calibration. In Proc.
First European Conf Computer Vision, pages 187-188, 1990.
[SKS$] J.G. Semple and G.T. Kneebone. Algebraic Projective Geometry. Oxford University
Press, 1952.
[Ts86] R.Y. TsM. An efficient and accurate camera calibration technique for 3D machine vision.
In Prac. of IEEE Con] Computer Vision, pages 364-374, 1986.

T h a n k s to Professor Mike B r a d y for advice, and to Professor Ken-ichi K a n a t a n i for


discussions on projective geometry. T h a n k s to Phil McLauchlan, Charlie Rothwell and
Bill Triggs for invaluable help.

This article was processed using the IbTEX macro package with ECCV92 style
Camera Self-Calibration: T h e o r y and E x p e r i m e n t s

O.D. Faugeras1, Q.-T. Luong 1 and S.J. Maybank 2


1 INRIA, 2004 Route des Lucioles, 06560 Valbonne, France
2 GEC Hirst Research Centre, East Lane, Wembley, Middlesex, HA9 7PP UK

A b s t r a c t . The problem of finding the internal orientation of a camera


(camera calibration) is extremely important for practical applications. In
this paper a complete method for calibrating a camera is presented. In
contrast with existing methods it does not require a calibration object with
a known 3D shape. The new method requires only point matches from image
sequences. It is shown, using experiments with noisy data, that it is possible
to calibrate a camera just by pointing it at the environment, selecting points
of interest and then tracking them in the image as the camera moves. It is
not necessary to know the camera motion.
The camera calibration is computed in two steps. In the first step the
epipolar transformation is found. Two methods for obtaining the epipoles
are discussed, one due to Sturm is based on projective invariants, the other
is based on a generalisation of the essential matrix. The second step of the
computation uses the so-called Kruppa equations which link the epipolar
transformation to the image of the absolute conic. After the camera has
made three or more movements the Kruppa equations can be solved for the
coefficients of the image of the absolute conic. The solution is found using
a continuation method which is briefly described. The intrinsic parameters
of the camera are obtained from the equation for the image of the absolute
conic.
The results of experiments with synthetic noisy data are reported and
possible enhancements to the method are suggested.

1 Introduction
Camera calibration is an important task in computer vision. The purpose of the camera
calibration is to establish the projection from the 3D world coordinates to the 2D image
coordinates. Once this projection is known, 3D information can be inferred from 2D
information, and vice versa. Thus camera calibration is a prerequisite for any application
where the relation between 2D images and the 3D world is needed. The camera model
considered is the one most widely used. It is the pinhole: the camera is assumed to
perform a perfect perspective transformation. Let [su, sv, s] be the image coordinates,
where s is a non-zero scale factor. The equation of the projection is

=A 010
001
G
[i] [i]
=M

where X, Y, Z are world coordinates, A is a 3 x 3 transformation matrix accounting


(1)

for camera sampling and optical characteristics and G is a 4 x 4 displacement matrix


322

accounting for camera position and orientation. Projective coordinates are used in (1)
for the image plane and for 3D space. The matrix M is the perspective transformation
matrix, which relates 3D world coordinates and 2D image coordinates. The matrix G
depends on six parameters, called extrinsic: three defining a rotation of the camera and
three defining a translation of the camera. The matrix A depends on a variable number of
parameters, according to the sophistication of the camera model. These are the intrinsic
parameters. There are five of them in the model used here. It is the intrinsic parameters
that are to be computed.
In the usual method of calibration [4] [12] a special object is put in the field of view
of the camera. The 3D shape of the calibration object is known, in other words the 3D
coordinates of some reference points on it are known in a coordinate system attached to
the object. Usually the calibration object is a fiat plate with a regular pattern marked
on it. The pattern is chosen such that the image coordinates of the projected reference
points (for example, corners) can be measured with great accuracy. Using a great number
of points, each one yielding an equation of the form (1), the perspective transformation
matrix M can be estimated. This method is widely used. It yields a very accurate de-
termination of the camera parameters, provided the calibration pattern is carefully set.
The drawback of the method is that in many applications a calibration pattern is not
available. Another drawback is that it is not possible to calibrate on-line when the cam-
era is already involved in a visual task. However, even when the camera performs a task,
the intrinsic parameters can change intentionally (for example adjustment of the focal
length), or not (for example mechanical or thermal variations).
The problem of calibrating the extrinsic parameters on-line has already been ad-
dressed [11]. The goal of this paper is to present a calibration method that can be carricd
out using the same images required for performing the visual task. The method applies
when the camera undergoes a series of displacements in a rigid scene. The only require-
ment is that the machine vision is capable of establishing correspondences between points
in different images, in other words it can identify pairs of points, one from each image,
that are projections of the same point in the scene.
Many methods for obtaining matching pairs of points in two images are described
in the literature. For example, points of interest can be obtained by corner and vertex
detection [1] [5]. Matching is then done by correlation techniques or by a tracking method
such as the one described in [2].

2 Kruppa's Equations and Self-Calibration

A brief introduction is given to the theory underlying the calibration method. A longer
and more detailed account is given in [9].

2.1 Derivation of Kruppa's Equations

Kruppa's equations link the epipolar transformation to the image w of the absolute conic
Y2. The conic w determines the camera calibration, thus the equations provide a way of
deducing the camera calibration from the epipolar transformations associated with a
sequence of camera motions. Three epipolar transformations, arising from three different
camera motions, are enough to determine w and hence the camera calibration uniquely.
The absolute conic is a particular conic in the plane at infinity. It is closely associated
with the Euclidean properties of space. The conic ~2 is invariant under rigid motions and
323

under uniform changes of scale. In a Cartesian coordinate system [zl, x2, x3, x4] for the
projective space p a the equations of 12 are

9. = o +,i+,i=o

The invariance of 12 under rigid motions ensures that w is independent of the position
and orientation of the camera. The conic w = M(/2) thus depends only on the m a t r i x A
of intrinsic parameters. T h e converse is also true [9] in that w determines the intrinsic
parameters.
Let the camera undergo a finite displacement and let k be the line joining the optical
centre of the camera prior t o the motion to the optical centre of the camera after the
9 i . ,
motion. Let p and p be the eplpoles associated with the displacement. The epipole p
is the projection of k into the first image and p ' is the projection of/r into the second
image. L e t / / b e a plane containing k. T h e n / / p r o j e c t s to lines I and l' in the first and
second images respectively. The epipolar transformation defines a homography from the
lines through p to the lines through p ' such that I and I' correspond. The symbol ~ is
used for the homographic correspondence, l-~l'.
I f / / i s tangent t o / 2 then I is tangent to w and l' is tangent to the projection w' o f / 2
into the second image 9 The conic w is independent of the camera position thus w = w'. It
follows that the two tangents to w from p correspond under the epipolar transformation
to the two tangents to w from p ' . The condition that the epipolar lines tangent to w
correspond gives two constraints linking the epipolar transformation with w. K r u p p a ' s
equations are an algebraic version of these constraints 9
Projective coordinates [Yl, Y2, Y3] are chosen in the first image. Two triples of coordi-
nates [Yl, Y2, Y3] and [Ul, u2, ua] specify the same image point if and only if there exists
a non-zero scale factor s such that Yi = sui for i = 1, 2, 3. The epipolar lines are param-
eterised by taking the intersection of each line with the fixed line y3 = 0. Let (p, y) be
the line through the two points p and y. A general point x is on (p, y) if and only if
(p x y ) . x = 0. Let D be the matrix of the dual conic to w. It follows from the definition
of D that (p, y) is tangent to w if and only if it lies on the dual conic,

(p x y ) T D ( p x y) = 0 (2)
The entries of D are defined to agree in part with the notation of Kruppa,

[-623 6a 67 ]
D = | 63 -613 61 (3)
L 62 61 --617 J
There are six parameters 6i, 6q in (3), but D is determined by w only up to a scale factor.
After taking the scale factor into account D has five degrees of freedom. On setting Y3 = 0
and on using (3) to substitute for D in (2) it follows that

2 -- 0
A11Y~ "}- 2A12yly2 "}"A 22Y2 (4)
where the coefficients Art, A12, A22 are defined by

Alt= -6taP2a - 612p~ - 261p2pa


A12 = 612PIP2 - 6apa2 + 62p2p3 -4r 61PlP3
A22 = -62apa~ - 61~pi~ - 262plpa (5)
324

An equation similar to (4) is obtained from the condition that the epipolar line (p', y ' )
in the second image is tangent to w,

A'llY'I 2 + 2Ai2Y'lY'2 + A'22Y2


'2 = 0 (6)
9 t n u .

The coefficnents All , Alg, A22 ' are obtained from (5) on replacing the coordinates Pi of
p with the coordinates p~ of p . The coordinate V'3 is set equal to zero9
The epipolar transformation induces a bilinear transformation N from the line V3 = 0
.to the line Y'3 = 0. If y = [Vt, ]/2, 0]f and y ' = [Y'I, V'2,0] T then .(p, y) ~ (p', .y') if and only
if y = N y . Let r = V2/Vl, r = V2/Vl. Then the transformation N is eqmvalent to
, ar+b
r = er + d (7)

T h e parameters a, b, c, d can be easily computed up to a scale factor from the two epipoles
p,p0 and a set of point matches qi ~ q~, 1 < i < n. A linear least square procedure
based on (7) is used. The ith image correspondence gives an equation (7) with
u s o t

ri = P3qi2 P2qi3
- 7.I __ PS,q$2 P2,qi3- (8)
Psql l -- Pl qi3 Psql l -- Pl qi3
Once a, b, c, d have been found (4), (6) and (7) yield

AII+ 2At2r + A 2 2 r 2 = 0
I n I

All(b~" + c) ~ + 2A12(bl" + c)('/" + a ) + A 2 2 ( ' r + a ) 2 : 0 (9)

Each equation (9) is quadratic in I". The two equations have the same roots, namely
the values of ~" for which (p, y) is tangent to w. It follows that one equation is a scalar
multiple of the other. Kruppa's equations are obtained by equating ratios of coefficients,

Ax2(A'~sa ~ + A'll c2 + 2A'x2ac) - (A'l~c + A~2a + A~libc + A'12ab)All = 0


A22(A'22a 2 + A'xlc 2 + 2A'xzac) - (2A'12b + A'22 + A'llbZ)All = 0

2.2 Kruppa's Equations for Two Camera Motions

Two camera motions yield two epipolar transformations and hence four constraints on
the image w of the absolute conic9 The conic w depends on five parameters, thus the
conics compatible with the four constraints form a one dimensional family c. The family
c is an algebraic curve which parameterises the camera calibrations compatible with the
two epipolar transformations 9
An algebraic curve can be m a p p e d from one projective space to another using trans-
formations defined by polynomials. A linear transformation is a special case in which the
defining polynomials have degree one. One approach to the theory of algebraic curves
is to regard each transformed curve as a different representation of the same underlying
algebraic object. For example a conic, a plane cubic with a node and a cubic space curve
can all be obtained by applying polynomial transformations to the projective line ]p1.
Each curve is a different representation of p1, even though the three curves appear to
be very different9
The properties of c are obtained in [9]9 It is shown that c can be represented as an
9 9 3 9
algebraic curve of degree seven in P or alternatively as an algebraic curve of degree six
9 2 9 9 2. 9
m P . The representation of c as a curve in P Is obtained as follows9
325

Fig. 1. Construction of the dual curve g

Let Pl, Pl be the two epipoles for the first motion of the camera. The epipolar
a
transformation is defined by the Steiner conic sl through Pl and Pl; two epipolar lines
(Pl, Y) and (P2, Y) correspond if and only i f y is a point of sl. The two tangents from pl
to w cut sl at points Xl, x2 as illustrated in Fig. 1. The chord (xl, x2) of sl corresponds
to the point Xl x2 in the dual of the image plane. The point xl x x2 lies on a curve
g which is an algebraic transformation of c. It is shown in [9] that ~? is of degree six and
genus four. The point Pl P'I of g corresponding to the line (Pl, Pl) in the image plane
is a singular point of multiplicity three. The curve g has three additional singular points,
each of order two. An algorithm for obtaining these three singular points is described
in [9]. The algorithm produces a cubic polynomial equation in one variable, the roots of
which yield the three singular points.
Three camera displacements yield six conditions on the camera calibration. This is
enough to determine the camera calibration uniquely.

3 Computing the Epipoles


Two different methods for computing the epipoles are described.

3.1 S t u r r a ' s M e t h o d
The epipoles and the epipolar transformations can be computed by a method due to
Hesse [6] and nicely summarized by Sturm in [10]. Sturm's method yields the epipoles
compatible with seven image correspondences.
Let qi *-* ql, 1 < i < n, be a set of image correspondences. Then p, p' are possible
epipoles if and only if a #

(P, ql) ~ ( P ,ql) 1 < i< n (10)


The pencil of lines through p is parameterised by the points of the line (ql, q2). Let
(p, q) be any line through p. Then (p, q) meets (ql, q2) at the point x defined by

X:(p X(ql xq2)


---- [(P q).q2]ql -- [(P X q).ql]q2
The line (p, q~) is assigned the inhomogeneous coordinate 0i defined by

01 -- (p qi).q2 -_ (qi x q2)-P


(p x ql).ql (ql ql).P
326

It follows that 01 = 00 and 02 = 0. Let 0q be the inhomogeneous coordinate of a fifth line


(p, q). The cross ratio 7" of the four lines through p with inhomogeneous coordinates 01,
02, 03, Oq is
(01--03~ /(01--Oq'~
r = k02-03)/\02-Oq)
Oq
Os

(11)
= (P q ) . q l ) x qa).q2
It follows from (10) that r is equal to the cross ratio of the lines (p', q'~), 1 < i < 3 and
(p', q') in the second image. On equating the two cross ratios the following equation is
obtained.

(-pxq).ql (pxq3).q2 -- ~ ( p x ~ ) \(p ~ )

If n = 6 then ~12) yields thr,ee independent equations on replacing q ~ q' by each of


t
q4 ~ q4, q5 ~ q5 and qs ~ qs in turn9 Equation (12) has the general form aq .p = 0
i
where aq is a vector linear in p that depends on q ~ q . The three equations a4.p = 0,
e i
as.p = 0 and as.p = 0 constrain p to lie on the cubic plane curve (a4 x as).as = 0.
i
The image correspondence qs ~ qr is replaced by a new image correspondence
i
q7 ~ q7 and a second cubic constraint on p is generated. The two cubic plane curves
intersect in nine points but only three of these intersections yield epipoles p such that (10)
holds. The remaining six intersections do not yield possible epipoles. The six intersections
include the points qi for 1 < i < 5.
The advantage of Sturm's method is its elegant mathematical form: it gives closed
form solutions for the epipoles. It is also possible to find by an exact algorithm a least
squares solution if many cubic plane curves are available. These two approaches have
been implemented in MAPLE.
However, because of the algebraic manipulations that are involved, both approaches
turned out to be very sensitive to pixel noise. Two methods for reducing the noise sensi-
tivity have been tried. Firstly, the number of manipulations has been reduced by working
numerically using only cross ratios and the equations (12). Secondly, the uncertainties in
the positions of the image points have been taken into account. In a first implementation
without taking uncertainties into account the criterion was very sensitive to pixel noise:
in some examples, 0.1 pixel of noise drastically changed the positions of the epipoles.
Using N correspondanees, partitioned in subsets of four correspondances, the idea is to
minimise the criterion
N/4X_. - ,.,)2
C(p, p')

where ~'i is the cross ratio of the lines (p, q~), 1 _< j _< 4, given by the formula (11), r' is
the same with primes, and where try, and ~r~ are the first order variances on 7"/and v/.
The notation qlj for 1 _<j < 4 indicates a subsequence of the qi. If the noise distribution
is the same for all image points ql, ~ then q~, = ~ixell[grad(r/)l[
2 2, where grad(rl) is
an eight dimensional gradient computed with respect to the qij for 1 < j < 4. The effect
of using the uncertainty in the criterion is that pairs of cross-ratios with large variances
will contribute little, whereas others will contribute more.
327

The problem is that non-linear minimisation techniques are needed. The results of
non-linear minimisation are often very dependent on the starting point. Another difficulty
is that the position of the minimum is quite sensitive to noise, as will be seen in the
experimental section below.

3.2 T h e F u n d a m e n t a l M a t r i x M e t h o d

The fundamental matrix F is a generalization of the essential matrix described in [8]. For
a given point m in the first image, the corresponding epipolar line era in the second image
is linearly related to the projective representation of m. The 3 3 matrix F describes
this correspondence. The projective representation era of the epipolar line em is given by

era = F m

Since the point m corresponding to m belongs to the line era by definition, it follows
that
m'TFm-- 0 (13)
If the image is formed by projection onto the unit sphere then F is the product of an
orthogonal matrix and an antisymmetric matrix. It is then an essential matrix and (13)
is the so-called Longuet-Higgins equation in motion analysis [8]. If the image is formed by
a general projection, as described in (1), then F is of rank two. The matrix A of intrinsic
parameters (1) transforms the image to the image that would have been obtained by
projection onto the unit sphere. It follows that F = A - I T E A -1, where E is an essential
matrix. Unlike the essential matrix, which is characterized by the two constraints found
by Huang and Faugeras [7] which are the nullity of the determinant and the equality of
the two non-zero singular values, the only property of the fundamental matrix is that it
is of rank two. As it is also defined only up to a scale factor, the number of independent
coefficients of F is 7. The essential matrix E is subject to two independent polynomial
constraints in addition to the constraint det(E) = 0. If F is known then it follows from
E = ATFA that the entries of A are subject to two independent polynomial constraints
inherited from E. These are precisely the Kruppa equations. It has also been shown, using
the fundamental matrix, that the Kruppa equations are equivalent to the constraint that
the two non-zero singular values of an essential matrix are equal.
The importance of the fundamental matrix has been neglected in the literature, as
almost all the work on motion has been done under the assumption that intrinsic pa-
rameters are known. But if one wants to proceed only from image measurements, the
fundamental matrix is the key concept, as it contains the all the geometrical information
relating two different images. To illustrate this, it is shown that the fundamental matrix
determines and is determined by the epipolar transformation. The positions of the two
epipoles and any three of the correspondences l ~ l ' between epipolar lines together de-
termine the epipolar transformation. It follows that the epipolar transformation depends
on seven independent parameters. On identifying the equation (13) with the constraint
on epipolar lines obtained by making the substitutions (8) in (7), expressions are ob-
tained for the coefficients of F in terms of the parameters describing the epipoles and
the homography:

Fll - - bp3pl3
F12 "- ap3Pl3
F13 = -ap2p'3 - bplp'~
328

F21 = --dPl3P3
F~2 = -cp~p3
F23 : cp~p2 + dp~pl
F~I = dp'~m - bpsp'~
F32 = cp'~p~ - ~p3p'~
F ~ = -cp'~p2 - dp'~pl + .p2p'~ + bp~p'~ (14)

From these relations, it is easy to see that F is defined only up to a scale factor. Let
cl, c2, c3 be the columns of F. It follows from (14) that plcl + p2c2 + p3c3 = 0. The
rank of F is thus at most two. The equations (14), yield the epipolar transformation as
a function of the fundamental matrix:

a--f12
b=Fll
c : --1;'22
d = -F21
F23F12 - F22F1s
pl = F22F11 - F21F12~
F11F23 - FieF21
P2 =
F22F11 - F21F12 p3

p ~ = F32F21 - ~F22F31
P 3
i
F22Fll F21F12
p~ = FllF32 - F31F12
~ P 3
,
(15)
F22Fll F21 12
The determinant of the homography is F22Fll -F21F12. In the case of finite epipoles, it
is not null.
A first method to estimate the fundamental matrix takes advantage of the fact that
equation (13) is linear and homogeneous in the nine unknown coefficients of F. Thus if
eight matches are given then in general F is determined up to a scale factor. In practice,
many more than eight matches are given. A linear least squares method is then used to
solve for F. As there is no guarantee, when noise is present, that the matrix F obtained is
exactly a fundamental one, the formulas (15) can not be used, and p has to be determined
by solving the following classical constrained minimization problem

minHFpH 2 subject to [IpH2 = 1


P
This yields p as the unit norm eigenvector of the matrix F r F with the smallest eigenvalue.
The same processing applies in reverse to the computation of the epipole p'. In contrast
with the Sturm method, this method requires only linear operations. It is therefore more
efficient and it has no initialization problem.
However the minimum turns out to be sensitive to noise, particularly when the
epipoles are far from the centre of the image. Experiments show that this problem is
reduced by using the following criterion for minimization:
min{d(m ' r , Fro) 2 + d(m T, F T m ' ) 2 } (16)

where d is a distance in the image plane. The criterion has a better physical significance
in terms of image quantities. It is necessary to minimize on F and on F T simultaneously
329

to avoid discrepancies in the epipolar geometry. In doing this non-linear minimization


successfully two constraints are important:

- The solution must be of rank two, as all fundamental matrices have this prop-
erty. Rather than performing a constrained minimization with the cubic constraint
det(F) = 0, it is possible to use, almost without loss of generality, the following
representation for F proposed by Luc Robert:

( Xl z~ z3 )
F = z4 x5 x6
xTxl ~- xsx4 xTx2 -k xsX5 xTx3 ~- xsX6

One unknown is eliminated directly, and then F is found by an unconstrained mini-


mization.
- The matrix F is defined only up to a scale factor. In order to avoid the trivial
solution F = 0 at which the minimization routines fail because the derivatives become
meaningless, one of the first six elements of F is normalised by giving it a fixed
finite value. However, as the minimization is non-linear convergence results can differ,
depending on the element chosen. This feature can he used to escape from bad local
minima during minimization.

This second method for computing the fundamental matrix is more complicated, as it
involves non-linear minimizations. However, it yields more precise results and allows the
direct use of the formulas (15) to obtain the epipolar transformation.

4 Solving Kruppa's Equations: the Continuation Method

Symbolic methods for solving Kruppa's equations are described in [9]. These methods
are very sensitive to noise: even ordinary machine precision is not sufficient. Also they
require rational numbers rather than real numbers. In this section Kruppa's equations
are solved by an alternative method which is suitable for real world use. The current
implementation is as follows,

- Do 3 displacements. For each displacement:


1. Find point matches between the two images
2. Compute the epipoles
3. Compute the homography of epipolar lines
4. Compute the two Kruppa equations
- Solve the six Kruppa equations using the continuation method
- Compute the intrinsic parameters

Three displacements yield six equations in the entries of the matrix D defined in Sect.
2.1. The equations are homogeneous so the solution for D is determined only up to a scale
factor. In effect there are five unknowns. Trying to solve the over-determined problem
with numerical methods usually fails, so five equations are picked from the six and solved
first. As the equations are each of degree two, the number of solutions in the general case
is 32. The remaining equation is used to discard the spurious solutions. In addition to the
six equations, the entries of D satisfy certain inequalities that are ,discussed later. These
are also useful for ruling out spurious solutions. The problem is that solving a polynomial
system by providing an initial guess and using an iterative numerical method will not
generally give all the solutions: many of the start points will yield trajectories that do
330

not converge and many other trajectories will converge to the same solution. However it
is not acceptable to miss solutions, as there is only one good one amongst the 32.
Recently developed methods in numerical continuation can reliably compute all so-
lutions to polynomial systems. These methods have been improved over a decade to
provide reliable solutions to kinematics problems. The details of these improvements are
omitted. The interested reader is referred to [13] for a tutorial presentation. The solu-
tion of a system of nonlinear equations by numerical continuation is suggested by the
idea that small changes in the parameters of the system usually produce small changes
in the solutions. Suppose the solutions to problem A (the start system) are known and
solutions to problem B (the target system) are required. Solutions to the problem are
tracked as the parameters of the system are slowly changed from those of A to those
of B. Although for a general nonlinear system numerous difficulties can arise, such as
divergence or bifurcation of a solution path, for a polynomial system all such difficulties
can be avoided.

Start System. There are three criteria that guide the choice of a start system: all of
its solutions must be known, each solution must be non-singular, and the system must
have the same homogeneous structure as the target system. The use of m-homogeneous
systems reduces the computational load by eliminating some solutions at infinity, so it is
useful to homogenize, but only inhomogeneous systems are discussed here for the sake of
simplicity. Thus an acceptable start system is: x~j - 1 = 0 for 1 < j < n where n is the
number of equations and dj is the degree of the equation j of the target system. Each
equation yields dj distinct solutions for xj, and the entire set of YI~'=Idj solutions are
found by taking all possible combinations of these.

Homotopy. The requirement for the choice of the homotopy (the schedule for transform-
ing the start system into the target system) is that as the transformation proceeds there
should be a constant number of solutions which trace out smooth paths and which are
always nonsingular until the target system is reached. It has been shown by years of
practice that the following homotopy suffices:

H(x, t) = (1 - t)ei~ + tF(x)

where G(x) is the start system, and F(x) is the target system.

Path tracking. Path tracking is the process of following the solutions of H ( x , t) = 0 as t


is increased from 0 to 1. These solutions form d continuation paths, where d is the Bezout
number of the system, characterising the number of solutions. To track a path from a
known solution (x ~ to), the solution is predicted for t = t o + A t , using a Taylor expansion
for H, to yield A x = - J ; l j t A t , where J~ and Jt are the Jacobians of H with respect
to x and t. The prediction is then corrected using Newton's method with t fixed at the
new value to give corrections steps of zJx = -J;XH(x, t).
Using an implementation provided by Jean Ponce and colleagues fairly precise so-
lutions can be obtained. The major drawback of this method is that it is expensive in
terms of CPU work. The method is a naturally parallel algorithm, because each contin-
uation path can b e tracked on a separate processor. Running it on a network of 7 Sun-4
workstations takes approximatively one minute for our problem.
331

5 Computing the Intrinsic Parameters

In this section the relation between the image of the absolute conic and the intrinsic
parameters is given in detail. The most general matrix A occurring in (1) can be written:

A = -fk~cosec(o) (1T)
0
- ku, kv, are the horizontal and vertical scale factors whose inverses characterize the
size of the pixel in world coordinates units.
- u0 and v0 are the image center coordinates, resulting from the intersection between
the optical axis and the image plane.
- f is the focal length
- 9 is the angle between the directions of retinal axes. This parameter is introduced to
account for the fact that the pixel grid may not be exactly orthogonal. In practice 0
is very close to ~r/2.
As f cannot be separated from k, and kv it is convenient to define products c~, =
- f k , and av = - f k v . This gives five intrinsic parameters. This is exactly the num-
ber of independent coefficients for the image w of the absolute conic thus the intrinsic
parameters can be obtained from w. The equation of w is [3]:
yTA-1TA-ly = 0

It follows that D = A A T. Up to a scale factor the entries 5ij and 6i of D are related to
the intrinsic parameters by
61 -- v0
62 ---- uo
h = UoVo - auav cot(0)cosec(0)
512 = - 1
6 = = - . o ~ - ~.~ c o ~ c ~(0)
~13 = - ~ 0 ~ - ~ c ~ c ~ ( 0 )
From these relations it is easy to see that the intrinsic parameters can be uniquely
determined from the Kruppa coefficients, provided the five following conditions hold:
~13~12 > 0
~23~12 > 0
613612 - 6~ > 0
~23~12 -- &~ > 0
(~3612 + 6162) 2
(6~1~ - 6~)(62~6~2 - 6~) -< I (is)

If one of the conditions (18) doesn't hold then there is no physically acceptable calibration
compatible with the Kruppa coefficients 6ij and ~. This is a strong condition which rules
out many spurious solutions obtained by solving five of the Kruppa equations. It is
interesting to note that if a four-parameter model is used with 0 = lr/2 then there is the
additional constraint 6s = -6162/612 which replaces the last one of (18). It can be also
verified that the calibration parameters depend only on the ratios of Kruppa coefficients,
so that the scaling of them doesn't modify their value, as expected.
332

6 Experimental Results

The results of experiments with computer generated data are described. The coordinates
of the projections of 3D points are computed using a realistic field of view and realistic
values for the extrinsic and intrinsic parameters. For each displacement 20 point matches
are selected and noise is added.

6.1 C o m p u t a t i o n o f t h e e p i p o l e 8
The results for the determination of the epipoles in the first image are presented. The
values obtained by the two algorithms (the Sturm method based on weighted cross-ratios,
and the fundamental matrix method) are given, as well as the relative error with respect
to the exact solution. The results in the second image are always similar to those in the
first image.

pixel motion 1 motion 2 motion 3


noise [ .497578.01443363.49306 ] [ o .os o l [0.1 0 O]
1-335.50 985.39 325.14] SturmI0 0 4001 tso 20 201
Sturm Fund. Matrix F u n d . Matrix Sturm Fund. Matrix
0 -414.42 3 1 1 5 . 6 4 - 4 1 4 . 4 2 3 1 1 5 . 6 4 246.09 255.64 246.09 255.64 1846.40 1199.34 1846.40 1199.34
0.01 -413.96 3113.82 - 4 1 4 . 5 3 3 1 1 5 . 6 7 246.06 255.66 246.10 255.57 1847.89 1200.50 1841.39 1195,77
0.1 0.06 0.02 0.02 0.01 0.007 0.04 0.02 0.08 0.1 0.27 0.29
0.1 -410.05 3 0 9 8 . 1 7 - 4 1 5 . 4 7 3 1 1 4 . 4 9 245.69 255.7(~ 246.11 255.00 1864.07 1212.39 1792.52 1161.6~
1 0.5 0,2 0.03 0.16 0.02 0.008 0.25 I 1.1 2.9 3.1
0.2 -406.08 3 0 8 2 . 2 7 - 4 1 6 . 2 7 3 1 1 2 . 2 9 245.58 255.9C 246.05 254.47 1889.58 1229.63 1731.60 1120.51
2 1.1 0.4 0,1 0.2 0.I 0.016 0.45 2.3 2.5 6.2 6.2
0.5 -396.32 3043.11 - 4 1 7 . 1 3 3099.29 244.45 256.3( 245.40 253.64 2045.52 1325.10 1527.21 988.02
4.3 2.3 0.6 0.6 0.6 0.3 0.3 0.8 10.7 10.5 17 17
1.0 -386.10 3001.28 - 4 1 3 . 2 4 3055.80 239.82 256.11 242.62 254.90 762.82 554.48 1201.05 785.79
6.8 3.6 0.2 1.9 2.5 0.2 1.4 0.3 58 53 35 34
2.0 -333.19 2772.54 - ~ 5 . 4 1 2 8 8 9 . 2 9 226.05 284.41 230,44 267.72 801.64 592.33 785.28 536.30
19 11 7 7.2 8 11 6.3 4.7 56 50 57 55

From these results, it can be seen that the fundamental matrix method is more
robust. It is also computationally very efficient since it involves only a linear least squares
minimisation and a 3 x 3 eigenvector computation. A second point worth noting is that
the stability of the position of the epipole depends strongly on the displacement that is
chosen.
Other experiments not reported here due to lack of space show that if more matches
are available then the precision of the determination of the epipoles can be improved.

6.2 I n t r i n s i c p a r a m e t e r s
The intrinsic parameters that have been computed using two displacement sequences
are presented. The first sequence consists of motion 1, motion 4, motion 2. The second
sequence consists of motion 1, motion 4, motion 3.

~u ~v UO '~0 ~
0 pixels 640.125 ~)43.695 246.096 255.648 0
0.01 pixels 597.355 940.403 248.922 259.196 0.02
6,68 0.34 1.14 1.38
0.1 pixel8 520.126 904.744 275,120 280.601 0.09
18.7 4.1 11.8 9.7
0.2 pixels 175.204 ~67.214 565,234 291.162 0.4
72.6 8.1 129.6 13.8
333

otu O~v UO 'VO ~ - ~"


0.01 pixels 699.815 948.1061174.723 245.112 0.018
9.3 0.46 29.0 4.12
0.1 pixels 687.814 989.538 149.055 257.030 0.004
7.4 4.85 39.4 0.54
0.2 pixels 552.601 993.837!269.278 283.458 0.013
13.6 5.31 9.41 10.87
0.5 pixels 433.894 957.018!358.904 308.191 0.13
32.2 1.4 45 20.5
1.0 pixel 477.043 724.909 137.052 258.763 0.3
25.4 23.2 44.3 1.2
More precise results are obtained if more than three camera displacements are available.
These results demonstrate the feasibility of the method in real environments, provided
image points can be located with a sufficient precision. This precision is already achievable
using special patterns.

7 Conclusion and Perspectives


A method for the on-line calibration of the intrinsic parameters of a camera has been
described. The method is based on the estimation of the epipolar transformations as-
sociated with camera displacement. Three epipolar transformations arising from three
different displacements are sufficient to determine the camera calibration uniquely. The
epipolar transformations can in principle be obtained by tracking a number of salient
image points while the camera is moving. It is therefore not necessary to interrupt the
action of the vision system in order to point the camera at a special test pattern.
The feasibility of the method is demonstrated by a complete implementation which is
capable of finding the intrinsic parameters provided a sufficient number of point matches
are available with a sufficient precision. However, the precision required to obtain accept-
able calibrations is at the limit of the state-of-art feature detectors.
The next step is thus to find efficient methods to combat noise. The key idea is
to compute the uncertainty explicitly. The results have shown that some displacements
yield epipolar transformations that are very sensitive to pixel noise, whereas, some yield
transformations that are more robust. Methods for charaeterising "bad" displacements
are currently being investigated. In particular, it has been shown that pure translations
lead to degenerate cases, thus yielding results that are very sensitive to noise. However, it
is not sufficient to know a priori which displacements are best because as the camera is not
yet calibrated they cannot be applied. If the uncertainty in the epipolar transformation
obtained from a given displacement can be computed it can be the basis of a decision
whether to use the transformation for the computations, to discard it and use another one,
or to take it into account only weakly. The final aim is to obtain acceptable calibrations
using real images.

References
1. R. Deriche and G. Giraudon. Accurate corner detection: An analytical study. In Proceed-
ings ICCV, 1990.
2. Rachid Deriche and Olivier D. Faugeras. Tracking Line Segments. Image and vision com-
puting, 8(4):261-270, November 1990. A shorter version appeared in the Proceedings of
the 1st ECCV.
334

3. O.D. Faugeras. Three-dimensional computer vision. MIT Press, 1992. To appear.


4. O.D. Fangeras and G. Toscani. The calibration problem for stereo. In Proceedings of
CVPR'86, pages 15-20, 1986.
5. C. Harris and M. Stephens. A combined corner and edge detector. In Proc. 4th Alvey
Vision Con]., pages 189-192, 1988.
6. O. Hesse. Die cubische Gleichung, yon welcher die LSsung des Problems der Homographic
yon M. Chasles abh~ngt. J. reine angew. Math., 62:188-192, 1863.
7. T.S. Hnang and O.D. Faugeras. Some properties of the e matrix in two view motion
estimation. IEEE Proc. Pattern Analysis and Machine Intelligence, 11:1310-1312, 1989.
8. H.C. Longuet-Higgins. A Computer Algorithm for Reconstructing a Scene from Two Pro-
jections. Nature, 293:133-135, 1981.
9. S.J. Maybank and O.D. Fangeras. A Theory of Self-Calibration of a Moving Camera. The
International Journal of Computer Vision, 1992. Submitted.
10. Rudolf Sturm. Das Problem der Projektivit~t und seine Anwendung auf die Fl~:hen zweiten
Grades. Math. Ann., 1:533-574, 1869.
11. G. Toscani, R. Vaillant, R. Deriche, and O.D. Faugeras. Stereo camera calibration using
the environment. In Proceedings o] the 6th Scandinavian conference on image analysis,
pages 953-960, 1989.
12. Roger Tsai. An Efficient and Accurate Camera Calibration Technique for 3D Machine
Vision. In Proceedings CVPR '86, Miami Beach, Florida, pages 364-374. IEEE, June 1986.
13. C.W. Wampler, A.P. Morgan, and A.J. Sommese. Numerical continuation methods for
solving polynomial systems arising in kinematics. Technical Report GMR-6372, General
Motors Research Labs, August 1988.

This article was processed using the I~TEX macro package with ECCV92 style
Model-Based O b j e c t P o s e i n 25 L i n e s o f C o d e *

Daniel F. DeMenthon and Larry S. Davis


Computer Vision Laboratory
Center for Automation Research
University of Maryland, College Park, MD 20742-3411, USA

A b s t r a c t . We find the pose of an object from a single image when the rel-
ative geometry of four or more noncoplanar visible feature points is known.
We first describe an algorithm, POS (Pose from Orthography and Scaling),
that solves for the rotation matrix and the translation vector of the object
by a linear algebra technique under the scaled orthographic projection ap-
proximation. We then describe an iterative algorithm, POSIT (POS with
ITerations), that uses the pose found by POS to remove the "perspective
distortions" from the image, then applies POS to the corrected image in-
stead of the original image. POSIT generally converges to accurate pose
measurements in a few iterations. Mathematica code is provided in an Ap-
pendix.

1 Introduction
Computation of the position and orientation of an object (object pose) using images of
feature points when the geometric configuration of the features on the object is known
(a model) has important applications, such as calibration, cartography, tracking and
object recognition. Researchers have formulated closed form solutions when a few feature
points are considered in coplanar and noncoplanar configurations (see [51 for a review).
However, numerical pose computations can make use of larger numbers of feature points
and tend to be more robust; the pose information content becomes highly redundant;
the measurement errors and image noise average out between the feature points. Notable
among these computations are the methods proposed by Tsai [7] and by Yuan [9].
The method we describe here can also use many noncoplanar points and applies a
novel iterative approach. Each iteration comprises two steps.
1. In the first step we approximate the "true" perspective projection ( T P P ) with a
scaled orthographic projection approximation (SOP). Finding the rotation matrix
and translation vector from image feature points with this approximation is very
simple. We call this algorithm "POS" (Pose from Orthography and Scaling) (see [6]
for similar solutions without scaling, and [8] for similar equations applied to object
recognition without pose computation).
2. We use the approximate pose from the first step to displace the T P P image feature
points toward the positions they would have if they were SOP projections.
We stop the iteration when the image points are displaced by less than one pixel. Since
the POS algorithm in the first step requires an SOP image instead of a T P P image to
produce an accurate pose, using the displaced points of the second step instead of the T P P
* The support of the Defense Advanced Research Projects Agency (ARPA Order No. 6989)
and the U.S. Army Topographic Engineering Center under Contract DACA76-89-C-0019 is
gratefully acknowledged, as is the help of Sandy German in preparing this report.
336

points yields an improved pose at the second iteration, which in turns leads to displaced
image points closer to SOP points, etc. We call this iterative algorithm "POSIT" (POS
with ITerations). Four or five iterations are typically required to converge to an accurate
pose.

2 Notations

In Fig. 1, we show the classic pinhole camera model, with its center of projection O, its
image plane G at a distance f (the focal length) from O, its axes Oz and Oy pointing
along the rows and columns of the camera sensor, and its third axis Oz pointing along
the optical axis. The unit vectors for these three axes are called i, j and k.
An object with feature points Mo, M1 . . . . , Mi . . . . , Mn is positioned in the field of
view of the camera. The coordinate frame of reference for the object is centered at M0
and is (Mou, Mov, Mow). We call M0 the reference point for the object. Only the object
points M0 and Mi are shown in Fig. 1. The shape of the object is assumed to be known;
therefore the coordinates (Ui, Vi, Wi) of the point Mi in the object coordinate frame of
reference are known. The coordinates of the same point in the camera coordinate system
are called (Xi, Yi, Zi).

:~:i:i:i:i:!:''" :::::::::::::::::::::::::::::::::::::::::::::::::::::::::

I
., ..

7*o . ." ..:

= ==================================================================================================

i!iil ilililiiiiiiii!iiiiiiiiiiiii
/
:-ii~:,i::;ii::;::iii::i::ii!ii::i!i:.i::)ii::i::iiii)ii::iii!?::i::iii?iiil)::i::iii~

0 I -- ~x

Fig. 1. Perspective projection and scaled orthographic projection for object point Mi and object
reference point M0.

3 Scaled Orthographic Projection and Perspective Projection

Consider a point Mi of the object (Fig. 1). In "true" perspective projection (TPP), its
image is a point rai of the image plane G which has coordinates

=~ = f x d z ~ , ~ = f~/z~ (1)
337

Scaled orthographic projection (SOP) is an approximation to TPP. One assumes that


the depths Zi of different points Mi of the object are not very different from one another,
and can all be set to the depth Z0 of the reference point M0 of the object. In SOP, the
image of a point Mi is a point pi of the image plane G which has coordinates

(2)
The ratio s = f / Z o is the scaling factor of the SOP. The reference point M0 has the
same image m0 with coordinates x0 and F0 in SOP and TPP. The image coordinates of
the SOP projection p~ can also be written as

9 ~ = ~0 + s(x~ - x 0 ) , y~ = y0 + s(Y, - Y0) (3)

The geometric construction for obtaining the T P P image point mi of Mi and the
SOP image point pi of Mi is shown in Fig. 1. Classically, the T P P image point rni is
the intersection of the line of sight of Mi with the image plane G. In SOP, we draw a
plane K through M0 parallel to the image plane G. This plane is at a distance Z0 from
the center of projection O. The point Mi is projected on K at Pi by an orthographic
projection. Then Pi is projected on the image plane G at pi by a perspective projection.
The vector mopl is parallel to MoPi and is scaled clown from MoPi by the scaling factor
s = f / Z o . Eq. (3) simply expresses the proportionality between these two vectors.

4 Approximate Pose from SOP (POS)

We find an approximate pose by assuming that the T P P image points mi can be ap-
proximated by the SOP image points Pl (Fig. 1). Our goal is to recover the coordinates
of the three unit vectors i,j, k of the camera coordinate system in the object coordinate
system using the SOP approximation. Indeed these three vectors expressed in the object
coordinate system are the row vectors of the rotation matrix R . The translation vector
T for the object is the vector OM0. Once we find the scaling factor of the SOP, this
vector OM0 is simply a scaled up version of the image vector Omo. We call this pose
calculation method POS (Pose from Orthography and Scaling).
We modify the two expressions of Eq. (3). After expressing the coordinates Xi - X0
and Yi - Y0 of the vector MoMi as dot products of M o M i with unit vectors i and j, we
obtain
zi-z0=si. MoMi, Y i - Z 0 = s j ' M o M i
We define I and J as scaled down versions of the unit vectors i and j

x = sl, a = sj (4)

which yields
zi-z0=I.MoMi, Yi-Y0-J-MoMi (5)
These can he viewed as linear equations where the unknowns are the coordinates of vector
I and vector J in the object coordinate system. The other parameters are known.
Writing Eq. (5) for the object points M0,M1,M2, M i , . . . , M n and their images, we
generate a linear system for the coordinates of the unknown vector I and a linear system
for the unknown vector J:
AI=x, AS=y (6)
338

where A is the matrix of the coordinates of the object points Mi in the object coordinate
system and x and y are the vectors of the x and y coordinates of the image points m~
offset by the coordinates of the image point m0.
Generally, if we have at least three visible points other than M0, and all these points
are noncoplanar, matrix A has rank 3, and the solutions of the linear systems in the least
square sense are given by
I=Bx, J=By

where B is the pseudoinverse of the matrix A. We call B the object matrix. Knowing
the geometric distribution of feature points Mi, we can precompute this pseudoinverse
matrix B.
Once we have obtained least square solutions for I and J, the unit vectors i and j are
simply obtained by normalizing I and J. As mentioned earlier, the three elements of the
first row of the rotation matrix of the object are then the three coordinates of vector i
obtained in this fashion. The three elements of the second row of the rotation matrix are
the three coordinates of vector j. The elements of the third row are the coordinates of
vector k of the z-axis of the camera coordinate system and are obtained by taking the
cross-product of vectors i and j.
Now the translation vector of the object can be obtained. It is vector OM0 between
the center of projection, O, and M0, the origin of the object coordinate system. This
vector, OM0, is aligned with vector Ore0 and is equal to ZoOmo/f, i.e. Om0/s. The
scaling factor s is obtained by taking the norm of vector I or vector J. The POS method
uses at least one more point than is strictly necessary to find the object pose. At least
four noncoplanar points including M0 are required for this method, whereas three points
are in principle enough if the constraints that i and j be of equal length and orthogonal
are applied (see [3] for a simple pose solution for three or more coplanar points). Since
we do not use these constraints in POS, we can verify a posteriori how close the vectors
i and j provided by POS are to being orthogonal and of equal length. Alternatively, we
can verify these properties with the vectors I and J which are proportional to i and j
with the same scaling factor s. We construct a goodness measure G, for example as

G = IX.JI + I I . X - J "JI

The goodness measure G becomes large when the results are poor and can be used for
quickly testing the quality of the computed pose and for detecting wrong correspondences
between image points and object points.
The POS algorithm provides a eomputationally inexpensive method for directly ob-
taining the translation and rotation of an object; the accuracy of POS may be sufficient
for tracking the motions of an object in space, finding initial estimates for iterative meth-
ods, or testing whether image and object points can be matched. Furthermore, when an
object is far from the camera, it is useless to try to improve on the pose found by POS.

5 From Approximate Pose to Exact Pose: The POSIT Algorithm

5.1 Basic Idea

In this section, we present an iterative algorithm, POSIT (POS with Iterations) , which
uses POS at each iteration. Less than five iterations are typically sufficient. The basic
idea for iterating toward a more accurate pose is the following:
339

If we could build an S O P image of the object feature points from a T P P image,


we could apply the P O S algorithm to this S O P image and we would obtain an
exact object pose.

Computing an exact SOP image requires knowing the exact pose of the object. However,
once we have applied POS to the actual image, we have an approximate depth for each
feature point, and we position the feature points at these depths on the lines of sight.
Then we can compute an SOP image. At the next iteration, we apply POS to the SOP
image to find an improved SOP image. The algorithm generally converges after a few
iterations and provides an accurate SOP image and an exact pose.

5.2 F i n d i n g a n S O P i m a g e f r o m a T P P i m a g e

Eq. (1) and Eq. (2) show that the SOP vector Cpi is aligned with the T P P vector Cm~
and the proportionality factor is Zi/Zo:
gl
Cpl = ~-~0Cml (7)

The coordinates Zi can be computed by

Zl = Z0 + k . MoMi (8)

where k is the unit vector along the optical axis Oz. Expressed in the object coordinate
system, k is the third row of the rotation matrix of the object, and MoMI is a known
vector. Eq. (7) and Eq. (8) yield for the SOP image points p~

Cpl = (1 + ~ ( k - M o M i ) ) C m l (9)

where we have replaced 1/Zo by s / f , the ratio of the scaling factor of the SOP by the
camera focal length.
Expression (9) provides at each iteration of the POSIT algorithm the approximated
positions of the SOP image points pi in relation to the image points ml if we use the
third row of the computed rotation matrix and the computed scaling factor.

6 Illustration of t h e Iteration P r o c e s s in P O S I T

To illustrate the iteration process of POSIT, we apply the method to synthetic data. The
object is a cube; the points of interest are the eight corners (one can easily experiment
with eight visible corners using light emitting diodes). The projection on the left of Fig. 2
is the given image for the cube (the shown projections of the cube edges are not used by
the algorithm). The distance-to-size ratio for the cube is small, thus some parallel cube
edges show strong convergence in the image. One can get an idea of the success of the
POS algorithm by computing T P P image of the cube at the found poses at successive
iterations (Fig. 2, top row). Notice that from left to right these projections become more
similar to the given image. POSIT does not compute these images. Instead, POSIT
computes SOP images using Eq. (9 (Fig. 2, bottom row). Notice that from left to right
the edges of the cube become more parallel in these SOP images, since orthographic
projection preserves parallelism.
340

ima~ =TPP(~(J~)) ~2 - ~( P ~ (--'~1)) ima~,~ - TPP( ~ ( s o p 2 ))

~ageo - r ~ ( ~ )

Fig. 2. TPP images (top) and SOP images (bottom) for cube poses computed at successive
steps by POSIT algorithm.

7 Performance Characterization

We try to follow the recommendations of Haralick for performance evaluation in com-


puter vision [4]. We compute the orientation and position errors of the POS and POSIT
algorithms for a cube (see [3] for more experiments). One corner of the cube is taken as
the reference point and is allowed to slide along the optical axis at 10 distances from the
center of projection, from four times to 40 times the size of the cube. These distance-to-
size ratios are used as the horizontal coordinates in the error plots. Around each of these
reference point positions, the cube can swivel at 40 random orientations.
We obtain synthetic images by perspective projection with a focal length of 760 pixels.
Note that only a wide-angle camera with a total angular field of more than 500 would
be able to see the whole cube when it is closest to the camera. We specify three levels
of random perturbation and noise in the image. At noise level 1, the computed image
coordinates are rounded to integer values. At noise level 2, perturbations of 4- 1 pixel are
added to the image coordinates. At noise level 3, the amplitudes of the perturbations are
4- 2 pixels.
For each of the images, the orientation and position of the object are computed by
the POS algorithm, then by the POSIT algorithm until it converges. We then compute
the axis of the rotation required to align the coordinate system of the object in its actual
orientation with the coordinate system of the object in its computed orientation. The
orientation error is defined as the rotation angle in degrees around this axis required
to achieve this alignment [3]. The relative position error is defined as the norm of the
translation vector required to align the computed reference point position with the actual
reference point, divided by the distance of the actual reference point position from the
camera. For both POS and POSIT, we show the average errors with their standard
deviation error bars as a function of the distance-to-size ratios (Fig. 3).

8 Results

At very low to medium range and low to medium noise, POSIT gives poses with less
than 20 rotation errors and less than 2% position errors. POSIT provides dramatic im-
341

i'ilt!
!. :-I~'~ pol~: POI; 9
LowerPmnl: POSIT9

i:t! , t
II 1| lil 10 M M R iS 40 0 12 1| m ~4 rib $| M 44
.... "" L~]

I
II 1| 11 m0 ~Pl 2 9 || N 44 Ill ,| lS ml Ir,4 U |1 N 4~ iI 1| 11 a0 M B N 16 40

Cube Immge d h O u w 9 Cube Immpwllh PIx~ Pmlull~lloql Cube bleOe wllh :l:2Plxd PmturbJiom

Fig. 3. Orientation and position errors for a cube at various distances at three image noise
levels.

provements over POS when the objects axe very close to the camera, and almost no
improvements when the objects are far from the camera. When the objects are close to
the camera, the so-called perspective distortions are large, and the approximation that
the image is an SOP is poor; therefore the performance of POS is poor. When the ob-
jects are very far, there is almost no difference between SOP and TPP; thus POS gives
the best possible results, and iterating with POSIT cannot improve upon them. Also,
when the object is far, pose errors increase with the distance ratios, since at long range
perturbations of a few pixels are a large percentage of the image size.

9 Convergence Analysis

We now explore with simulations the effect of the distance of an object to the camera
on the convergence of the POSIT algorithm (Fig. 4). The convergence test consists of
quantizing (in pixels) the coordinates of the image points in the SOP images obtained
at successive steps, and terminating when two successive SOP images are identical (see
Appendix A). A cube is displaced along the camera optical axis. One face is kept parallel
to the image plane. The abscissa in the plots is the distance from the center of projection
to that face, in cube size units. Noise of -4- 2 pixels is added to the perspective projection.
Four iterations are required for convergence until the cube is at three times its size from
the center of projection. The number gradually climbs to eight iterations for a distance
of 1, and 20 iterations for 0.5. Then the number increases sharply to 100 iterations for
a distance ratio of 0.28 from the center of projection. Up to this point the convergence
is monotonic. At still closer ranges the mode of convergence changes to a nonmonotonic
mode, in which SOP images are subjected to somewhat random variations from iteration
to iteration until they hit close to the final result and converge rapidly. The number of
iterations ranges from 20 to 60 in this mode, i.e. less than for the worse monotonic case,
with very different results for small variations of object distance. We label this mode
"chaotic convergence" in Fig. 4. Finally, when the distance ratio becomes less than 0.12,
the algorithm clearly diverges. Note, however, that in order to see the close corners of
the cube at this range, a camera would require a total field of more than 150~ i.e. a focal
342

length of less than 1.5 m m for a 10 m m CCD chip, an improbable configuration. In all our
experiments, the POSIT algorithm has been reliably converging in a few iterations in the
range of practical camera and object configurations. We are in the process of analyzing
the convergence process by analytical means, but so far have succeeded only for objects
and orientations chosen to yield simple expressions. Convergence seems to be guaranteed
if the image features are at a distance from the image center shorter than the focal length.

iiiiiiiiiii+++iiiiii!i!iiii!ii!!iiii!iiiiiiiiiiii!!!i!!i!i!iiiiiii!iii!i!: IO0

I!i!i.++!iiii+i."+i
ii+ii!!!i!!!iiii!~ii
!i!il 9
1 '~ 6O ~i~ tvkmotor~c
Convergenoe
o,[++++++++++++++++++++++i++++i++++++++i++++++++++++++
i+++~+i++ 9
i++++++++i+++i+++++i+++++++++++++++++++++++++++
++++++++++++i++~+++++++++++++++++++++++i9++++i++++++:+++ i:.i:.e
20[~ii i i!i i i!!i~i~i~iiiiii!ii!i~-~i::~ii:~i:/:ii :i~:i:~i::i~:ii i::i::i:/:::ii~i~:ii 20

o iiiiiiiiiii~:iiiiii!iiiiiiiiiIiiiiii~iiiiiii!i
i i ::i::i:/:::i~i i ?'~i i i?~iii:'~:i:i::ii:i::i~?ii::i~::il .
0.I 0.2 0.3 0.4 0.5 1 1.5 2 2.5 3 3.5 4
Distance to Camera / Object S~e Distanoe to Camera / Objea Size

F i g . 4. Number of iterations as a function of distance to camera at very close ranges (left) and
for a wider range of distances (right).

Appendix A: A Mathematica program implementing POS and


POSIT

Compute the pose of an object given a list of 2D image points, a list of corresponding
3D object points, and the object matrix (the pseudoinverse matrix for the list of object
points). The first point of the image point list is taken as a reference point. The outputs
are the pose computed by POS using the given image points and the pose computed by
POSIT.
GetPOSIT [imagePoint s_, obj ectPoints_, obj ectMatrix_, f ocalLength_] :- Nodule [
{objectVectors, imageVectors, IVect, 3Vect, ISquare, JSquare, I J,
illageDifference, roel, roe2, row3, scalel, scale2, scale, oldSOPIzlagePoints,
SOPImagePoints, translation, rotation, firetPose, count-O, converged - False},
objectVectors - (#-objectPoints [[l]])k /@ objectPoints;
oldSOPImagePo int sffiimagePoint s ;
(* loop until difference between 2 SOP images is less than one pixel *)
While [! converged,
If [count--O,
(* we get image vectors from image of reference point for POS: *)
imageVectors = (# - imagePoints[[l]])& /@ izlagePoints,
(* else count>O, we compute a SOP image first for POSIT: *)
SOPIllagePoints - imagePoints (I + (objectVectors.row3)/translation[[3]]);
imageDifference - Apply [Plus, Abe [Round [Flatten[SOPImagePoints] I-
Round [Plat ten [oldSOPImagePoint e]] ] ] ;
oldSOPImagePoints ffi SOPImagePoints;
imageVectors - (# - SOPImagePoints[[l]])~ /@ SOPImagePoints
]; (* end else count>O*)
343

{IVect, JVect} - Tranepoee[objectMatrix . ~ a g e V e c t o r s ] ;


ISquare - IVect.IVect; 3Square - 3Vect. JVect; I3 - IVect. JVect;
{scalel, scale2} - Sqrt[{ISqu~re, JSquare}];
{row1, rog2} = {IVect/ecalel, JVect/scale2};
row3 ffiRotateLeft[rogl] RotateRight[row2] -
RotateLeft[row2] RotateRight[rogl];(* cross-product *)
rotatlon={rowl, row2, roe3};
scale - (scalel + scale2)/2.0; (* scaling factor in SOP *)
translation - Append[imagePointe[[1]], focalLength]/scale;
If[count-=0, firstPose - {rotation, translation}];
converged = (count>0) ~& (imageDifference<l);
count++
]; (* End While *)
Return[{firetPose,{rotation, translation}}]]

(* Example of i n p u t : * )

fLength ffi 760;


cube = { { 0 , 0 , 0 } , { 1 0 , 0 , 0 } , { 1 0 , 1 0 , 0 } , { 0 , 1 0 , 0 } , { 0 , 0 , 1 0 } ,
{10,0,10},{10,10,10},{0,I0,10}};
cubeMatrix - PeeudoInverse[cube]//N;
cubeImage = {{0,0},{80,-93},{245,-77},{185,32},{32,135},
{99,35},{247, 62},{195, 179}};

{{POSRot,POSTrans},{POSITRot,POSITTrane}} =
GetP0SIT[cubeInage, cube, cubeMatrix, fLength];

References
1. H.S. Baird, "Model-Bases Image Matching Using Location", MIT Press, Cambridge, MA,
1985.
2. T.A. Cass, "Feature Matching for Object Localization in the Presence of Uncertainty",
MIT A.I. Memo 1113, May 1990.
3. D. DeMenthon and L.S. Davis, "Model-Based Object Pose in 25 Lines of C o d e ' , Center for
Automation Research Technical Report CAR-TR-599, December 1991.
4. R.M. Haralick, "Performance Characterization in Computer Vision", University of Wash-
ington C.S. Technical Report, July 1991.
5. R. Horand, B. Conio and O. Leboulleux, "An Analytical Solution for the Perspective-4-
Point Problem", Computer Vision, Graphics, and Image Processing, vol. 47, pp. 33-44,
1989.
6. C. Tomasi, "Shape and Motion from Image Streams: A Factorization Method", Technical
Report CMU-CS-91-172, Carnegie Mellon University, September 1991.
7. R.Y. Tsai, "A Versatile Camera Calibration Technique for High-Accuracy 3D Machine
Vision Metrology Using Off-the-Shelf TV Cameras and Lenses," IEEE J. Robotics and
Automation, vol. 3, pp. 323-344, 1987.
8. S. Ullman and R. Basri, "Recognition by Linear Combinations of Models", IEEE Trans.
on Pattern Analysis and Machine Intelligence, vol. 13, pp. 992-1006, 1991.
9. J.S.C. Yuan, "A General Photogrammetric Method for Determining Object Position and
Orientation", IEEE Trans. on Robotics and Automation, vol. 5, pp. 129-142, 1989.

This article was processed using the IbTF~ macro package with ECCV92 style
Image Blurring Effects Due to Depth Discontinuitites:
Blurring that Creates Emergent Image Details*

Thang C. Nguyen and Thomas S. Huang


Email: cthang@uirvld.csl.uiuc.edu, huang@uicsl.csl.uiuc.edu
Beckman Institute and Coordinated Science Laboratory
405 N. Mathews Street, Urbana, IL 61801, USA

Abstract: A new model (called multi-component blurring or MCB) to account for image
blurring effects due to depth discontinuities is presented. We show that blurring processes operating in
the vicinity of large depth discontinuities can give rise to emergent image details, quite distinguishable
but nevertheless un-explained by previously available blurring models. In other words, the maximum
principle for scale space [Per90] does not hold. It is argued that blurring in high-relief 3-D scenes
should be more accurately modeled as a multi-component process. We present results form extensive
and carefully designed experiments, with many images of real scenes taken by a CCD camera with
typical parameters. These results have consistently support our new blurring model. Due care was
taken to ensure that the image phenomena observed are mainly due to de-focussing and not due to
mutual illuminations [For89], specularity [Hea87], objects' "finer" structures, coherent diffTaction, or
incidental image noises. [Gla88] We also hypothesize on the role of blurring on human depth-from-
blur perception, based on correlation with recent results from human blur perception. [Hes89]
Keywords: Multi-component image blurring (MCB), depth-from-bhr, point-spread functions
(kernels), incoherent imaging of 3-D scenes, human blur perception, active vision.

i Introduction
The objectives of this paper are: to present a simplified image blurring model that
is sufficiently general to account for blurring effects due to depth discontinuities, and
to explore some implications on depth-from-blur techniques. See [Ens91], [Gar87],
[Gro87], [Pen8?&89], [Sub88a], and many others. Previously none of those known
depth-from-blur formulations discussed such important cases. We realized that an
accurate model must be composite (i.e. consisting of a possibly unknown number of
sub-processes.) The composite nature of blurring due to depth-discontinuities give
rise to the net blurring effects, with new local extrema generated, very much in
discord with commonly employed blurring models.
The organization of this paper is as follows:
Section 2 discusses the radiometry of image formation in the presence of
sharp discontinuities. Only incoherent (or very weakly coherent), polychromatic
lighting was assumed to be present (this was enforced in the experiments), as
is often true in normal, everyday lighting, thus radiometric models approximate
the imaging process adequately. These simple radiometric considerations are then
seen to be capable of predicting blurring instances in which interesting resultant
image structures (emergent details like peaks and valleys) are created. The object-
and-camera configuration used in this analysis is also adapted to the real-scene
experiments presented in section 3.
Section 3 presents the results from extensive experiments with images of realistic
scenes taken with a C0hu-4815 CCD camera with 8-bit accuracy. Temporal averaging
over twenty frames each image was employed to subdue various image noises to

This work is supported by the National Science Foundation under the Creativity in Engineering
Award EID-8811553, and grant IRI-89-02728
348

about one gray level in variance. Various settings of camera parameters (focal length
f, aperture D and back-focal distance v) are employed to test the blurring model.
Also, controlled experiments for checking ground truth were performed to ensure
valid interpretation.
In section 4 we illustrate the implications for depth-from-blur, an active
vision algorithm suitable for close range. We simulated Pentland's "localized
power estimation" algorithm [Pen89] to estimate the blur widths for simple
(single-component) blurring profiles as well as multi-component blurring (MCB)
cases. Finally we discuss implications to the modeling of human monocular depth
perception. This last discussion could suggest further psychophysical investigation.

2 Modeling of Single-component and Multi-component Blurring

2.1 Image b l u r as a f u n c t i o n of depth


Blurring (MCB or otherwise) herein always means de-focussing instead of
diffraction blurring (which does not contribute significantly in our problems.) With
a simple blend between geometrical and radiometric optics, the width wi of a
normalized, positive blurring kernel can be defined as: (similar to [Sub88b])

< w? > -~ }}
--co --r
( x - x-)"g~(x)dx; -Z = }}
--oo ~ o o
XK~(X)dX (1)

and this width is linearly related to the the blur circle diameter, Dlvi-v01/v0, and
inversely to the depth (distance) u F

wi -- nD Ivl - vol ~. flul - uol 1 1 + 1


vo = g U u o ( u i _ f) ; from the lens equation : f -- u-i~ vl (2)

where t~ is a small constant, f is focal length, u i is distance from the point Pi to first
principal plane of the lens system, and v i is that distance from the second principal
plane to the plane of best focus for Pi, image of point Pi. If u0 is set at infinity
(farther objects in better focus), then the relation above is simp|ified to:

wi- ~Dui-~; when u0 --* oo (.~)

giving only one solution. So, focusing camera at infinity is desirable to prevent
ambiguities in depth-from-blur. We will assume such a setting henceforth.

2.2 Image formation across depth discontinuities


Figure 1 represents a simplified model of a camera imaging a sharp edge that
"just cut" onto the optical axis and standing in front of a uniform background. The
Lambertian assumption, though convenient, is not required, only that no specular
reflection, no significant mutual illumination effects (interreflections) are present
in the scene, and neligible image noise. Care must be exercised to prevent the
aforementioned effects since they sometimes create spurious image features that
can be confused with multi-component blurring effects. [For89], [Hea87] Note that
a knife-sharp edge is n o t necessary here (as would be required in monochromatic
349

coherent diffraction experiments) and the real "edge" used was just a carefully hand-
cut edge (by a sharp blade) out of a high-quality foam-filled cardboard. In fact,
it is our objective to show that MCB effects are detectable in scenes containing
realistic objects.
This is perhaps the most important point of this paper: image blurring near
a depth discontinuity can be best analyzed separately for each surface patch at
different depth. We will concentrate on the cases where one of the blurring process is
dominant (i.e. having a much larger spread than others) in the image neighborhood.
(For example, in figure 1 the blurring due to E, the edge, is dominant.)
Toward modeling the imaging process of a 3-D scene, [Fri67] found that the
transfer function for a 3-D object cannot be cascaded. For example, in the case of a
3-D object imaged by a cascade of two lens systems. The tranfer function of such a
cascade is not the same as the product of each system's transfer. This is due to the
general 3-D nature of the resulting image (a 3-D object has its image also a 3-D
distribution of intensity.) Blurring on the image plane is then the result of projecting
the (3-D) image distribution onto the image plane. However, we have chosen to
model the blurring two-stage process as is fairly conventional:
a. Ideal image registration (geometric and radiometric) giving I0(x),'the idealized
unblurred image.
b. Blurring with blur width depending on u(x), or the depth value of the point P
= (X, Y, u ( x ) ) that has its image at x.
For the one-dimensional model in figure 1, we can see that, at each image
coordinate value xv, the resulting intensity also contains the sum of all blurring
(or diffusion) contribution from neighboring image regions (ie. pixels, etc.), each of
which may have a different blurring kernel. Concisely, then:

I ( ~ ) = ~-~ Ij(~); (4)


jEJ
where j indexes the different image components near Xv.
For our 1-dimensional, 2-component case, in particular, let Xv = 0, and let the
ideal image intensity levels from the edge and the background to be respectively Ie0,
Ib0. The blurred components are Ie(x), Ib(x) such that:
I(x) = k(x) +l~(x) : 2 components
? '/
-oo -oo (,51
oO OO

I,(x)--- / Kb(z, x'lIb0Tb(z'ldx'a ~ /K~,(z, x')It~,dz'


. ,
--oo --~

where Tb(x), describing the lens occlusion effect due to the edge, is similar to a
smeared step function. Typically, the occlusion effects is small, and an ideal step
function :(x) can be used for Tb(X). Or, the backround blurring kernel K*b(X, x') is
distorted from a simple Kb(x, x'). We will not go into details of the lens occlusion
effect, which is secondary. See figures 1 and 10.
With the analogy between Gaussian blurring and heat diffusion [Hum85],
[Per90], multi-component Gaussian blurring is analogous to multi-component particle
350

diffusion, whereas each type of particle has a different diffusion constant, and none
of them react chemically with each other. Note that analogy with heat diffusion
cannot be easily made, since temperature is a single entity, unless we distinguish
between different types of heat (due to different causes, and propagate at different
rates, for example.)
2.3 E m e r g e n c e of image details by m u l t i - c o m p o n e n t blurring effects
Continued form above, we now show that multi-component blurring can give
rise to new image features (or details), as opposed to the consistent suppression
of details by single-component blurring models. By new image details, we mean
specifically new local extrema, ie. local peaks and valleys. Just for ease of blurring
width estimation/verification later, we assume here that every component kernel is
some shift-invariant Gaussian, but other unimodal kernels can be used.
The 1-dimensional unblurred image is again taken to be approximately two
disjoint step functions with heights Ie0, Ib0. The blurred components are Ie(x), Ib(x)
respectively.

I(x) = I.(z) + Ib(x); I..,bl,..~(z) = I.,,~(--x) + Iu,~(x);


I,(z) = G,(z) 9 I~,f(-x); Ib(x) = Gb(x) 9 Iu,~(z)

~(u) =
{ 1;,,>__o}
O; u < 0 unit step function
(~)

Where * denotes convolution. Figures 2(a), 2(b) show the two components, and
figure 2(c) and 2(d) show a resulting MCB profile with emergent extrema. Note
that with ideal step functions as shown, a continuous, uni-modal single blurring
kernel will not introduce new local extrema. (This is the maximum principle, a main
assumption in the Gaussian scale-space concept [Per90], however, MCB does not
obey such restrictions.) At the emergent extrema location Xz, the gradient vanishes:

O (* Lof(-z)G+ Gb(z)r* Iu,f(-x)) = -L,~G~(z) + Iu,Gb(x) = 0;


_x 2
giving - ~1x (\"~b
l U ) ezP~,-~'ab2,/
(--x:~ - |"a--'~e x pk.( ~2a~
_ ~ ] ] =0; (7)

or I~la-----b 2 - - "

The resulting conditions and solution for zz, are:

if (ae > ab) and (a~lbo < abLo); or (ae < ab) and (a,:Ib,} > abl,~j)
(s)
then X'2z= (tre+ab)(ae--trb) k,I~lab]

Note that the MCB gradient profiles can be quite different from those of SCB
(single-component blurring). See also figures 3(ha), 3(bb) 3(cc) and 3(dd). For some
range of Ie0, Ib0, the MCB gradient actually is a weighted difference-of-gaussian, an
interesting fact.
Examples: Figure 3 shows a comparison of multi-component Gaussian blurring
effects (3(a), 3(b), 3(c), 3(d)) to the effects of comparable-width single-component
351

Gaussian blurring. Only Ib0 is varied, with Ie0 = 190, ere=5, erb=3. Values for
Ib0 are 210, 190, 150 and 105 respectively. The same set of Ie0, ere, and Ib0 was
used for single-kernel blurring (which has ere = erb = 5). It is quite evident that
multi-component blurring is capable of creating new interesting extrema. Note also
that, even though case 3(a) looks like Mach-band effect due to human retino-optic
ganglion cells [Lev85] (or, in image processing, edge-enhancement schemes using
filters similar to Laplacian-of-Gaussian kernels), MCB effects are not results of any
purposive image processing. We are talking of images as registered onto the camera
imaging sensor plane.
And there could be no confusion with Mach-band or edge-enhancement
processing in cases like figures 3(b) and 3(c), because the "peak" can occur well
below the "brighter" level (into the "darker" side, as long as the "dark" side is not
too dark.) Specifically, with given image parameters, for new extrema to be created,
Ib0 must satisfy:
I~ > a--~bL,= 114 (9)
O'e
which is larger than 105 (value of Ib0, the right image component in figure 3(d).)
Hence in figure 3(d), no extrema emerged.
Physical limitations such as blooming and smear of the imaging sensor elements
(pixels) [TI86] by the mechanism of charge spilling between adjacent pixels, also help
to blur the intensity difference between neighbor pixels, thus softening MCB features
somewhat. The net effect is that the local extrema by MCB are most detectable for
some range of Ie0/Ib0, with some upper limits dictated by CCD sensor characteristics,
and lower limits at least as high as given by equation (8). This suggests that, unlike
usual blurring effects, MCB effects are more detectable at lower local contrast, a
rather surprising prediction that was actually observed in real images, and have
possible implications to human perception. See figures 11, 12, 13, 14, and especially
figure 18.
Let us try to see what it takes for a single convolution kernel to describe well
the blurring effects shown). Then, the resulting kernel Kcomposite(X , x') is given by:
{ c,~(~ - ~'), ~' > 0
Kr x') = G~(x- z'), x' < 0 (I0)
which looks innocously simple, until we see some sample plots of it in figure 4.
As seen, with ae=5, ab-~ 3, Kcomposite(X,x') is neither Oaussian (it's a patching
of 2 truncated Gaussian segments), nor shift-invariant, and not even continuous at
x'--0 (blurring interface.) These characteristics are more pronounced for larger ratios
between the b|ur widths ~e, erb, and for smaller absolute values of xz. MCB blurring
can be very complex to estimate, because even for the simpler case of shift-invariant
single Gaussian blurring (directly analogous to heat diffusion) we cannot get exact
inverse solution (ie. for deblurring or estimation of the blur width.)[Hum85]
Note that even with anisotropic diffusion (blurring) model [Perg0], new details
(new local extrema) cannot be created (by the maximum priniciple), only that some
existing details can be preserved and possibly enhanced (ie. sharpened.)

3 Study on Real Camera Images


From the above model for MCB, we set out to experiment with real images
to test the hypothesis that MCB effects do exist and can be detected in images of
realistic scenes.
352

3.1 Experimental s e t u p
The set up is quite similar to the imaging model in figure 1. Distances from
camera are: 1.33 meters to edge of board E, and 5.69 meters to the 3 card boards
that served as background B on the wall. The three backround boards have slightly
different reflectivities, thus enabling convenient investigation of MCB effect due local
contrast (see figure 3 and figures 13, 14.) To make sure that other phenomena
different than MCB blurring (de-focussing) were excluded from registering onto the
images, we have insisted that: [Ngu90a]

a. No specular reflections were present on or nearby the visible surfaces in the scene.
b. No shadowing of the background patch B by the edge E.
c. No interreflections (or mutual illuminations) between the two. Interreflections
(mutual illuminations) between edge E and background B can give spurious
details (local extrema) rather easy to be confused with MCB effects. See [For89].
d. Illuminations had low partial coherence. See [Gla88].
e. Image noise was reduced to about less than 1 gray level in variance, by temporal
averaging of each image by 20 frames. This is also good for suppressing any
noise due to the neon flicker.
3.2 Image data
Since a work of this nature must be extensively tested with carefully controlled
experiments, we have performed extensive experiments (over 300 image frames taken
for tens of scene set-ups) with consistent results. Here we included three typical sets
of images and their video scan lines for further discussions. Note that all middle scan
lines go through the medium-sized background card board.
[] Set {M} (figures 5 through 8) contains M0, an image of overall scene, and M1,
M,?, two images of the background (three patches) B (one close-up and one
distant), and also M~, close-up image of edge E. This set serves to check for
uniformity of B and E both separately and together. Note especially the "edge
sharpness!' and surface smoothness of the edge E.
[] Set {N} contains NI (figure 9), N$ (figure 10). The parameter sets for them are
back-focal distance, aperture diameter, and focal length, respectively (v, D, f):
9 N1 with (v, D, f), = (6375 mf, 7420 ma, 8760 mz) or (87 mm, 4 mm,
84 mm).
9 N~ taken with (v, D, f), = (6375 mf, 9450 ma, 8760 mz) or (87 mm, 6
mm, 84 ram).
All parameters are expressed in machine units corresponding to the zoom
lens digital controller readout: focus (mf), aperture (ma) and zoom (mz).
Corresponding physical values of (v, D, f) are beleived to be only accurate
to within 5 percent, due to lack of precise radiometric calibration for aperture
(which is a complex entity for any zoom lens.)
[] Set {P} has Pl (figure 11) and P'2 (figure 12) showing the MCB effects when
camera parameters are fixed but scene lighting changed non-uniformly (so that
local contrast can be controlled.) Both were taken with (v, D, f) = (6375
mf, 9450 ma, 4200 mz) or (48 ram, 6 mm, 46 mm), but P$ with a reduction
in foreground lighting (which illuminates the edge E), which did not affect
background lighting significantly since whole room was lit with 44 neon tubes
353

and only 2 small lamps (~100 watts each) were used for independent illumination
of E.

To estimate independently the blurring widths of the background and the front
edge, (so that we can compare MCB model with real image blurring effects due to
depth discontinuity), we followed the simple method of Subbarao [Sub88b]. The blur
widths (de, ab) estimated in (horizontal) pixels were found as follows:

a. For N1, approximately (3.23, 2.62)


b. For N~, approximately (3.92, 3.45* (better fit with 3.0 due to lens occlusion))
c. For PI, P2, approximately (2.08, 1.66)

Accounting also for video digitizer resampling, the effective pixel size is
approximately 16.5pm (horizontal) by 13.5pm (vertical).

3.3 I n t e r p r e t a t i o n s

Refer to the figures 9 through 14. All images are originally 512x512 pixels but
only central 500x420 image portion shown, and image coordinates (x, y) denote the
original column and row indices, left to right and top to bottom. Analyses are done
on horizontal slices at y --- 270, called middle slices. The point x = 243 on all slices
is at approximately "the interface" (corresponding to x=0 in figure 1) between the
image regions of the background {x > 243} and the edge {x < = 243}.
The middle slices for the "ground-truth" images MO, M1, M2, M3 (controlled
set), included with the images (figures 5 to 8), show negligible MCB effects. They
reveal nothing very interesting on the background surface, nor across the depth
discontinuity (figures MO and M2.) Even right at the edge in image M2, one can
only see a small dip in intensity mainly due to the remaining small roughness of the
hand-cut (which absorbed and scatter lighting a little more.) However, the thin-lined
curve in figure 5, which is the middle slice of image MO* (taken with same focal length
as for MO, but with back-focus set so that edge E is blurred) demonstrates significant
MCB blurring. However, MO it self (dark dots) shows no such interesting feature.
Middle slices for images NI and Ne (figures 9 and 10) reveal MCB effects with
rather broad spatial extents, again near x = 243. For this image pair NI and N2, since
the intensity ratio Ie/Ib is approximately unity (very low local contrasts), the MCB
effects are controlled by w b and Wc. Note also the persistence of MCB effects even
with reduced aperture: overall intensities in NI is lower, but the "MCB details" still
very pronounced. Compare these image slices to figures 3(a) through 3(d). Image
N2 shows effects of aperture occlusion, that is, the best-fitting Wb, value of 3.0 (for
background, x > 243) is significantly smaller than the unoccluded background blur
width w b (about 3.45 pixels, see section 3.2 above)
Middle slices of P1 and Pe (figures 11 and 12, whose close-ups are figures 13
and 14) illustrate the detectability of MCB effects as a function of local intensity
contrast Ie/I b. See also section 2.2 . That is, when Ie/I b is closer to unity (lower local
contrast), MCB effects are more pronounced. This is also sugested in comparison
of slices (y = 86) as well as (y = 270) of P1, P2: reduced Ie reveals the "MCB
spike" unseen with brighter foreground (and hence higher local contrast)! This could
imply that human depth-perception may be enhanced naturally by MCB effects in
low-contrast, large depth-range scenes. Section 4.2 next discusses this point.
354

4 Some Implications from the MCB Blurring Effects


We like to discuss the MCB effects on depth-from-focus, and also touch briefly on
some recent results on human blur perception, which seems to support our speculation
that human depth perception could be enhanced in low-contrast, large depth-variation
settings, due to the MCB effects that can be detectable. [Hes89] [Ngug0a&b].
4.1 M C B b l u r r i n g a n d d e p t h - f r o m - b l u r (or d e p t h - f r o m - f o c u s )
Both Pentland [Pen87&89] and Subbarao [Sub88a], and others had worked on
local blur estimation as an approach to 3-D perception, and with considerable
sucesses, especially the real-time implementation by Pentland, which was up to a
few hundred times as fast as correspondence-based stereo, making the approach
rather attractive in some cases [Pen89]. We particularly pay attention to Pentland's
simple "local power estimator" method, which is fast and reasonably accurate for
single-component blurring cases. The more careful matrix formulation in lens91]
improved on depth-from-blur accuracy incrementally, possibly best so far, but did
not account for MCB, either. Also, even though depth-from-best-focus approach,
such as Krotkov's [Kro89] and others, is different from the depth-from-blur approach,
our following analyses have important implications to both, while we discuss only
the later.
We also presented here, however, two sets of MCB simulated data that does not
follow these researchers' models of local blurring. The simulated data are in fact very
similar to real image scan lines obtained and discussed in section 3. We show that
Pentland's "power measure" can in fact increase with increasing blur widths in many
cases of MCB blurring. In other words, Pentland's method (as well as other methods
mentioned above) fails to measure MCB blur.
In a nutshell, Pentland's approach can be summarized as: given a sharp image
Isharp and a blurred image Iblur (blurred by o'blur; o'blur > ~ one can take
localized power estimates Fsl~a~(A)and Fblur(A)for two corresponding image patches.
Then utilizing the relation: [Pen89]

klo'~lur -~ k21n(~blur) ~- k3 "~ In(F, harp(A)) - In(fblur(A)) (11)

one can estimate o'blurgiven ~


W e take our typical scene a step edge blurred by a small Gaussian of width
~ "- i, o'blurvarying from i to 6. For clarity,only images with o'blur-- {I, 3, 6}
are shown in figures 15(a), 16(a) and 17(a). Figure 15 illustratesthe case of single-
component blurring. The Pentland's power estimator applied to such a step edge of
widths ablur = {1, 2, 3, 4, 5, 6} gives power estimates (figure 15(b)). Then takes the
difference of the logarithm, or logarithm of the ratio, of the power estimates. Points
giving power ratios smaller than 1 are discarded. For single Gaussian blurring, the
power responses dies off monotonically with increasing blur (figure 15(b)), giving a
monotonicaly increasing power difference (figure 15(c)). For the Pentland's estimator
shown in figure 15(c), median fitting was used for general robustness, though more
refined approach can be used. Only mask size of 8 is used for the Laplacian-of-
Gaussian and the Gaussian windows here, but similar results are obtained for larger
windows.
Figures 16(b) and 17(b) show the results of trying to measure the "local power"
of MCB edges, that have ~ = 1 fixed on one side (left side in figure 16 and right
side in figure 17), and the other side blurred with o'blur = {1, ..., 6}. Note the two
355

different manifestations of MCB blurring. Power measures for image set in figure
16 mostly increase with larger blur widths, except perhaps for a small range around
O'blur/Crsharp < 3. Consequently, Pentland's model cannot be applied for reliable
determination of ~blur from these "power data".
Figures 17 gives not even a single case of valid power difference measure. This is
because for all O'blur = {1, ..., 6}, the "image power" consistently increases with blur
width, completely opposite to SCB case in figure 15(b). That is, the more blurring
occured, the higher the power measure. This last data set, as well as most of those
from figure 16, defies any "local power estimation" approach, due to emergent high
frequencies. We beleive a gradient-based approach to be more viable.

4.2 M C B b l u r r i n g effects a n d h u m a n b l u r p e r c e p t i o n
During the work in 1989 and published in [Ngug0b], we had speculated that
MCB effects could play some important role in human visual perception, especially
depth perception at low local contrast. This is a hypothesis arised naturally from
the observations in section 3.3 on the characteristics of the MCB effects (emergent
extrema). However, we had been unaware of any psychophysical data in favor of our
hypothesis until recently when we found a paper by Hess [Hes89], who argued:
a. that human blur discrimination (between blur edges slightly differing in blur
extent) may actually rely more on low-frequency information, rather than high-
frequency, near the vicinity of the blur edge transition.
b. that discrimination is consistently enhanced if one of the blur edges is pre-
processed so as to give an effect similar to MCB effects (he called phase-shifted
processing instead), that is, very similar to figures 3(d), 13(a), and 14(a). For
comparison, see figure 18, which contains our reproduction of his figures 10 and
11 in [Hes89].
The above conclusions came from Hess's study on blur discrimination without
any depth information. Human subjects looked at computer-generated 2-D intensity
profiles on a screen. [Wat83] However, conclusion (b) above was very favorable in
support of our hypothesis, which also involves depth. We strongly beleive that further
investigation into human perception of blurring effects due to depth discontinuities
could provide yet more clues into the working of human visual functions.

5 Discussions and Conclusions


In this paper, we have analyzed mainly the forward problem of multi-component
blurring (MCB), discussed possible implications, and suggested that a gradient-based
approach to the inverse problem could be promising. To summarize, we have:
O presented a simple but sumciently accurate multi-component blurring model to
describe blurring effects due to large depth discontinuties. Our model with aperture
occlusion (section 2.2) is more general than a computer graphics (ray-tracing)
model by Chen [Che88]. Due to space limitation we have restricted experimental
verification to 1-D profiles.
[] illustrated that current depth-from-blur algorithms could fail when significant
MCB effects are present. Effects due to MCB blurring seemed to be ignored,
or treated mistakenly like noise, by previous depth-from-focus algorithms [Pen89,
Sub88, Ens91], which would give inaccurate depth estimates (averaging of estimates
356

mainly serves redistribution of errors) and unknowingly discard valuable depth


information in MCB features.
El raised an interesting speculation that MCB effects could play an important
role in human depth perception, especially if the scene has low texture, low local
contrast and large depth discontinuities. While we are not aware of any depth-
from-blur experiment with human perception, we can point out some important
recent results in human (2-D) blur perception [Hes89] that correlates well with the
MCB effects presented here. Finally, although MCB effects are definitely not due
to Maeh-band illusion, the similarity between Math-band and MCB effects in some
cases could have led people to overlook the MCB effect in real images (thinking
Math-band effects were at work.) See [Lev85].

References
[Che88] Chen, Y. C., "Synthetic Image Generation for Highly Defocused Scenes",
Recent Advances in Computer Graphics, Springer-Verlag, 1988, pp. 117-125.
[Ens91] Ens, J., and Lawrence, P., "A Matrix Based Method for Determining
Depth from Focus", Proc. Computer Vision and Pattern Recognition 1991, pp.
600-606.
[For89] Forsyth, D., and Zisserman, A., "Mutual Illuminations", Proc. Computer
Vision and Pattern Recognition, 1989, California, USA, pp. 466-473.
[Fri67] Frieden, B., "Optical Transfer of Three Dimensional Object", Journal of
the Optical Society of America, Vol. 57, No. 1, 1967, pp. 56-66.
[GarB7] Garibotto, G. and Storace, P. "3-D Range Estimate from the Focus
Sharpness of Edges", Proc. of the 4th Intl. Conf. on Image Analysis and Processing
(1987), Palermo, Italy, Vol. 2, pp. 321-328.
[Gha78] Ghatak, A. and Thyagarajan, K., Contemporary Optics, Plenum Press,
New York, 1978.
[Gla88] Glasser, J., Vaillant, J., Chazallet, F., "An Accurate Method for
Measuring the Spatial Resolution of Integrated Image Sensor", Proc. SPIE Vol.
1027 Image Processing II, 1988, pp. 40-47.
[Gro87] Grossman, P., "Depth from Focus", Pattern Recognition Letters, 5,
1987, pp. 63-69.
[Hea87] Healey, G. and Bindford, T., "Local Shape from Specularity", Proc.
of the 1st Intl. Conf. on Computer Vision (ICCV'87), London, UK, (1987), pp.
151-160.
[Hes89] Hess, R. F., Pointer, J. S., and R. J. Watt, "How are spatial filters used
in fovea and parafovea?", Journal of the Optical Society of America, A/Vol. 6, No.
2, Feb. 1989, pp. 329-339.
[Hum85] Hummel, R., Kimia, B. and Zucker, S., "Gaussian Blur and the Heat
Equation: Forward and Inverse Solution", Proc. Computer Vision and Pattern
Recognition, 1985, pp. 668-671.
[Kro89] Krotkov, E. P., Active Computer Vision by Cooperative Focus and Stereo,
Springer-Verlag, 1989, pp. 19-41.
[Lev85] Levine, M., Vision in Man and Machine, McGraw-Hill, 1985, pp.
220-224.
357

[Ngu90a] Nguyen, T. C., and Huang, T. S., Image Blurring Effects Due to Depth
Discontinuities", Technical Note ISP-1080, University of Illinois, May 1990.
[Ngu90b] Nguyen, T. C., and Huang, T. S., "Image Blurring Effects Due to
Depth Discontinuities", Proc. Image Understanding Workshop, 1990, pp. 174-178.
[Per90] Perona, P. and Malik, J., "Scale-space and Edge Detection using
Anisotropic Diffusion", IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-12, No. 7, July 1990, pp. 629-639.
[Pen87] Pentland, A., "A New Sense for Depth of Field", IEEE Trans. on Pattern
Recognition and Machine Intelligence, Vol. PAMI-9, No. 4 (1987), pp. 523-531.
[Pen89] Pentland, A., Darrell, T., Turk, M., and Huang, W., "A Simple, Real-
time Range Camera", Proc. Computer Vision and Pattern Recognition, 1989, pp.
256-261.
[Sub88a] Subbarao, M., "Parallel Depth Recovery by Changing Camera
Parameters, Proc. of the end Intl. Conf. on Computer Vision, 1988, pp. 149-155.
[Sub88b] Subbarao, M., "Parallel Depth Recovery from Blurred Edges", Proc.
Computer Vision and Pattern Recognition, Ann Arbor, June 1988, pp. 498-503.
[TI86] Texas Instruments Inc., Advanced Information Document for T I Imaging
Sensor TC2~1, Texas, August 1986.
[Wat83] Watt, R. J., and Morgan M. J., "The Recognition and Representation
of Edge Blur: Evidence for Spatial Primitives in Human Vision", Vision Research,
Vol. 23, No. 12, 1983, pp. 1465-1477.

Ve
Ub ~,~

B__~~:.:.'.....':-::.......................................................................
:
...............
.,~:~,<~
" . . . i : ~<~-:::
.............
Vb
b2

/
.... . . . . . . . . . . . . . .....::..... .....,..... ..... :-;:::~ii
I~.~ [e(t)

e i: x~ ix)

sensor X
plane
intensity
profile

Figure 1. Imaging a sharp discontinuity: model for experimental set-up. Camera is


focused at infinity. The dark segments near points e, bl, b2 illustrate relative sizes
of blurring widths. T(x) describes the lens occlusion effects (see figure 10, imge
N2) for some backgound points like point B2. If background B is very far, T(x)
can be replaced by a simple on-off model (hence, the simple step image profile.)
358

. . . . "I I(x) 1 ~ 1IIx)


7 ~ 1 7 I(x) " ' ~ I(x)
12 .... "1.2 "'~.6
10 1 ,," 1.~
? 7 " 14
5 ,.5 %,% 13

-lS -10 zS S 10 ~S -15 - 10 - S ' S 10 1S - 1S - 10 : S ' S 10 1S -IS -10 -'5 S 10 1~


(a) Left signal (b) Right signal (c) MCB: (a) + (b) (d) close-up of (c)
Figure 2. Example of a 1-D step edge with a two-component MCB. a l ~ = 5 ,
while aright=3. The kernels are all Gauss9 Figure 2.(d) compares the
resultant MCB profile with a single-component Gauasian blurring with a = 5.

I*c z) z*, c(u) #*c:l9 2.0 ECZ)

zzc ~ 22c SZO

s~c ZOO

a~

ale III

Z~C 11o

-Io -lO lo zo -co -lo ~.u Io -~a -dO ZU Z~

(a) (h) (c) (d)


_ ~ 12[ gr~[I(l)] 9 z 9

-~o Io

(aa) (hb) (ee) (dd)


Figure 3. (a), (b), (c), (d) show Gauss9 MCB and SCB (dashed)
profiles in comparison. For MCB case, alert = 5, aright = 3, as in
figure 2. Figures 3. (ha), (bb), (cc), (dd) shows the corresponding MCB
and SCB gradient profiles in comparison. Discussion in section 2.2.

Figure 4. Two views of the Kcomposite(X, x') composite blurring


kernel for figures 2 and 3 above. See equation 10 in section 2.3.
359

a3
55 ~ % . m ~ 4 , m , .

51~ =)o =;o


m'~ ~ o~

5so z*Q
21
z)

25
21
Z:

230
9

Z|O
e*

ZSO

Fig. 6 Image M l: background patches


,,e%

z~o

Fig. 5 Image M0: setup (Section 3.3)

J "x]
~w

53 a~
53 5~
52 55
52
:: ................... - , . . . , . .
52 ~ " 21

zto I~O 5SO ZSO 5to z~* 25o z~o

Fig. 7 Image M2: Close-up of the edge Fig. 8 Image M3: Close-up of background
360

134 ~t30 I~
Z3Z

130 9 9 9 9 t2s

124

*111
115

la* tV
Fig. 9 Image NI: low local contrast Fig. 10 Image N2: lens occlusion effect

"~" . , a0 ~'o

Fig. 11 Image P1 with row 86 and 270 Fig. 12 Image P2 with rows 86 and 270.
361

10 * .
11 *
17
10 17 *
17 16 *

*-:-:::::*:-,:

z:~o =io =,so z6o =~o rio z~o zt=

Fig. 13(a) Closed-up row 86 of P1 Fig. 14(a) Closed-up row 86 of P2


22 .

21 * * 21 """

21 .... * * ** .*** 2O
20
19

11

=to =,~o =.Ao zlo z:Jo =,,o =.~o =,~o

Fig. 13(b) Closed-up row 270 of P1 Fig. 14(b) Closed-up row 270 of P2

16

14 //'' 9 " II

12 12

~o ....'./) i/" 1 ,

10 zo
X-,nuxla
30 40
.4 1o zo 30 40

Fig.15(a) Fig.16(a) Fig.17(a)

2.''
oat.
69
2r

49

29

l - 1~28
fs ~o 2s
X - ~im

Fig.15(b) Fig.16(b) Fig.17(b)

mvlr dill, .~ . 1"t~ rat 1~

:l' .V !
,-'t. 3. 4'.
9

Fig.15(c) Pentland's estimator


~. ~. t.
blur width
00;
00:

For the MCB cases in the middle column (figure 16) Pentland's is of very limited
See section 4.1 No estimator: 16(b), 17(b)

use (the bottom row in middle column is the only valid "power image difference"
between o'right = 1 and aright = 2.) For MCB cases in the last column (figure 17)
Pentland's method is inapplicable. See section 4.1. Note different power scales.
362

336 J. Opt. Soc. Am. A/Vol. 6, No. 2/February 1989 Hess e t al.

~ooo

-oI
,ooo

~ 4o0

~o
Olstance {deg}
0.0 0.5

Distance
t.O 1.5 2,0

(deg)
;t.5 0,0 0.5

Distance
I 0 1.5 2.0

{deg)
;~.fi

O.O 0.5 g.O ~.~ ~,O 25 O.O O.$ g.O i.5 2 O 25

Distance (deg) D i s t a n c e {deg) Distance (cteg)


Fig. 10. Top row, two just-di~'iminable edges and their spatial difference. Bottom row, consequences of inverting the p ~ of this
difference.

A B
,::] I
0~ j # ' / /
5 ,oJ

3 tO 30
mLUR (~)

,oo~ # /
, /
/ /
ao
:J ,,
!:i z~176 i ~"

,'o ,'o ,do ,'o i'. ,~o 3~0


neLu. t~) n|tu. ( ~
Fig. 11. Pwchometrir functionm for bl~r discrimination between stimuli that have the same difference in their amplitude spectrum but with
different phase spectra. Open symbols represent discriminations between the same edge stimuli discussed. Filled symbols represent
discriminatior~ between plmse-shiRed versions of the original stimuli (see the text). Result~ are given for a rav~e of pedestal blurl and
eccentricities. Note that the ab~i~a is a logarithmic scale, so that the factor by which discrimination is affected by this maneuver can be
gauged. SoUdcurvesarebestfitstothedata, obtalnedbyusingprobitanalysis. Discriminstionbetweenphase-thiftedversionsoftheori~inal
stimuli are better by a factor of 2. This argues against a frequency filter code and argu~ for a space filter cede (see the text).

Figure 18. Our reproduction of Hess and Pointer's results on blur discrimination.
The phase-shifted processing, second row of his figure 10, consistently enhanced
human blur discrimination. Compare his figure 10 with our figures 3(d), 13(a), 14(a).
E l l i p s e b a s e d stereo vision 1
Johannes Buurman

Pattern Recognition Group, Faculty of Applied Physics, Delft University of Technology,


Delft, the Netherlands.

Abstract. We propose a new stereo vision algorithm for finding circles in a scene. In
both 2-D images, ellipses are found. The ellipses are matched in order to find circles
in 3-D space. The method does not require a special camera alignment, instead both
camera matrices must be known. Some results are presented, showing that the
method is sufficiently fast and accurate for object recognition. After edge detection, a
few seconds of CPU time are sufficient to find full circles with standard deviations of
the order of 1-2% of the radius of the circles.
1. Introduction
We propose to use ellipses as well as straight lines as primities for stereo vision, resulting
in circles and straight lines in our 3D description. The set of 3-D circles is a significant
extension of the traditional, straight line based approach, while the number of parameters is
still kept small enough to allow meaningful estimation. In this paper, we will concentrate on
the ellipse-based stereo algorithm. Some results will be presented.
For an overview of the relevant literature, see [Bu2]. There is very little reference to stereo
vision based on parametfized curves of higher order than a straight line. We have only found
[PP1], where it is mentioned without detail, and [RG1] which is a predecessor to our work.
The reason for this may be that it is mainly attractive when the curve considered is really a
primitive of the scene. Such is the case in our work, where cylindrical objects must be recog-
nized.
2. Problem Description
When is circle is seen under perspective projection, the result is an ellipse. Projecting lines
from the focal point through each point on the ellipse back into 3-D space yields a "cone"
with an elliptical section. Stereo vision results in one ellipse in each image. However, com-
puting the section of the "cones" does not lead to a single circle. If the ellipses found were
exactly correct, we would find the two different interpretations that are possible: the circle
and a very eccentric ellipse, see Fig. 1.

,,,,.,..,, 9 ,- 9 9 , ' I . . . . . . . . . . ' ' ' , 9

Fig. 1. Two interpretations of two ellipses in Fig. 2. 100 points on a typical section of inexact
stereo vision. Top: the circle, bottom: an ec- "cones". Both the circle mad the ellipse are still
centric ellipse. visible.

1. This work has been conducted as part of the Delft Intelligent Assembly Cell project, partially
sponsored by SPIN/FLAIR.
364

In general however, the two 2-D ellipses will not be exact, and the axes of the two "cones"
will not meet in one point. The section of the two cones will then look like the example in
Fig. 2., a single curve that is not planar, but instead switches between the circle and the
eccentric ellipse. Parts of the circle and the ellipse can still be distinguished. Of course, if
two entirely unrelated ellipses are matched, the section of the "cones" may well be empty.
So, the problem of finding 3-D circles through stereo vision can be split into three steps:
- Finding ellipses in 2-D images.
- Computing the section of the corresponding "cones".
- Identifying the circle in each of these sections.
In this section, we have written "cones" to denote that the set of all lines through one point
and an ellipse does not, in general, have a circular section. In the remainder of this paper, we
will just use the word cone assuming that the reader is aware of this.
3. Overview of the method
An overview of the method is shown in Fig. 3. Both images are treated identically by the
preprocessing steps: edge detection is applied and the detected edges are stored as chain-
codes representing one pixel thick lines (for the full procedure see [Bull). From these code
strings, candidate elliptical edges are selected, to which an ellipse fit algorithm is applied.
The output of this algorithm is a set of an ellipse equations.

Left, edge detectionL~ellipse selectior cone ~ uations


segmentation [ I ellipse fitting
I I

dection sel on Jst matc n I


segmentation ] I ellipse fitting ellipses
Fig. 3. Overview of the ellipse stereo algorithm.

One of the sets of ellipse equations is first converted to a set of cone equations using the
appropriate camera matrix. Then, this set of cones and the other set of ellipse equations with
its camera matrix are then passed to the stereo algorithm, which outputs a set of 3-D circles.
4. Ellipse selection and fitting
In order to determine which ellipses are present in each image, we resort to fitting, but this
requires that the set of points belonging to one ellipse are identified first. We need to split up
slrings of chaincodes, until each string only contains of one ellipse. Accuracy of the fit
requires that these strings are as large as possible. Splitting based on the derivative of the
curvature at each part of the string does not work very well, because of the eccentricity of
some ellipses. Instead, we apply a simple grammar to the strings and split where the gram-
mar no longer applies. This algorithm is simple and fast, and works as long as the smoothing
effect of the edge detector is sufficient to deal with most of the noise on each ellipse.
Let c i be a Freeman code for direction i, and ci+ 1 (Ci_l) be the code for the next direction
(counter-) clockwise. If we use c* to denote zero or more, and c+ to denote one or more
occurrences of c, a string belonging to an ellipse can either be written as:
Ci+ Ci_l+)*Ci*(Ci+l+Ci+)*Ci+l*(Ci+2+ Ci+l+)*... (1)
or
365

(Ci+ Ci+l+)*Ci*(Ci.l + Ci+)*Ci_1*(Ci.2+ Ci. 1+)* . .. (2)


Our splitting algorithm tries both grammars (1) and (2) to find out how far from the start of
the string they apply, and splits the string after the longest of these substrings. This is done
for each string of chaincodes, repeatedly if necessary.
To each of the remaining strings of chaincodes, the fitting algorithm is applied. Let cq..5 be
parameters of the ellipse. Then, following [FS1] we minimize :

min al ' % % % %(au~'mtsx2 + 0tlxl x2 + a2x22+ (Z3Xl+ 0t4X2-t- (X5) (3)

which, although somewhat biased towards eccentric ellipses, is a good and efficient esti-
mator of ellipse parameters. If the minimal error per pixel is below a certain threshold, and
the set of parameters that minimizes it corresponds to an ellipse (and not another conic sec-
tion), we accept the ellipse.
As a final step, we go through the set of ellipses found to see whether two ellipses can be
merged, keeping the average error over the union of the pixels below the threshold. This is
done repeatedly until no more mergers are possible.
5. Ellipse-based stereo
Our method is general in that it does not require the two cameras to be parallel. Rather, it
assumes that each camera can be described by its mapping of world coordinates Yi to camera
coordinates xi (in homogeneous coordinates, multiplication with a matrix C)

EXlSX2ss~T = c' EylY2 Y31~T , (4)

and thatboth camera matrices CL and CR are known. Using (4),the set of all world points
y that correspond to a given camera point (x1~x2)is the linedescribed by the two equations
( C l - x l C 3 ) .y = 0 and ( C 2 - x 2 C 3 ) " y = 0 (5)

where y is the column vector (yl,ye,y3,1) T and C1, C 2 and C 3 are the rows of C, see e.g.
[Hol]. The line corresponding to (5) can also be written in parameter form, as is show for
instance in [BB1]. In [Bu2], we derive how an ellipse cone equation can be obtained of the
from
(C 3.y) 2 = ( t 1 . y ) 2 + ( t 2 . y ) 2, (6)

where y is the coordinate vector in homogeneous coordinates.


Let E L be the equation belonging to the left ellipse and camera matrix. We can obtain a
number of points on the right ellipse, and, by solving E L with (5) and (6) for these 2-D
points, obtain a set of 3-D points Pi on the section. An example of such a set is shown in Fig.
2. Of course, it is possible that such a set could not be found, in which case we decide that
the two ellipses didn't match.
Subsequently, we try to identify planes in the point set. We do that by moving a window of
width n over the set of points, trying to fit a plane to Pi, Pi+l"Pi+n-1 9 If one or two such
planes can be found, we try to fit a circle to all points in each plane. If two circles can be
found, we keep the best. If only one circle can be found, it is presumed to be the correct cir-
cle. Else, we decide that the two ellipses don't match. Both the plane fit and the circle fit are
least squares fits. The circle fit returns a measure of fit: the average relative distance of each
point to the circle (which is zero for a perfect fit).
This implies that there are a number of parameters to the procedure: the number of points
366

N used to describe one ellipse, the size n of the plane fitting window, the thickness d,nax
allowed to such a plane (distance beyond which points are no longer considered part of the
plane), and a maximum error e allowed for each circle found. Except for plane thickness
these are relative parameters and depend only on the noise.
6. Constraints
The process of stereo matching can be sped up considerably by the use of constraints. Our
current implementation uses the following constraints explicitly:
- The epipolar conslraint. This well-known constraint indicates that points in one image
can only match points on a single line in the other image. This line is the section of the plane
of the other image with the plane through the point and the centres of projection, see [Hol].
In our implementation it is applied to ellipse centres: if the centres of two ellipses are not
within a given distance of an epipolar line, they are not considered in the matching process.
- The similarity constraint. Most structures in one image look almost like corresponding
structures in the other image, if that image is obtained from a viewpoint which is too distant
(which is usually the case in stereo vision). If difference in length of the major axes of two
ellipses is not below a certain threshold, a match is not considered.
- Spatial constraints. Only ellipses within a specific volume in space are accepted. Because
of our application, we now very well where real ellipses can be.
7. R e s u l t s
The algorithm has been implemented and applied to a number of tests, using the following
parameter values: N = 100, n = 5, d = 4 mm, e = 0.15. The principle of the method is illus-
trated in Fig. 4. Shown are: the stereo pair, edges extracted from each image, ellipses found
in each image and the circles found, projected back into the scene. The object shown is one
of the test objects in our project, with four circles of which three are partially visible. Note
that only the top side of the discs at each end are found, which is sufficient for recognition.
The object is about 10 cm wide and 5 cm high, and is one of the largest parts in the cell. Both
cameras were placed at 65 cm from the scene, 12 cm apart, without any special alignment.
Instead, the camera matrices were obtained using a robot controlled calibration procedure.
Processing was done using a Sun 4/330 workstation using 512"512 images. For edge
detection about 25 s of image processing were required, chaincode handling and ellipse fit-
ting took 1 s and the ellipse stereo algorithm 0.2 s (CPU times). Note that the image process-
ing time is independent of scene complexity and can be improved upon using edge detection
hardware.
An analysis of the accuracy of the method can be found in [Bu2]. The figures show that the
method allows position estimates of circle centres with a standard deviation of the order of
1% of the circle's radius. The standard deviation of radii is about 1.7%. Orientation angles
show standard deviations of the order of 1 degree. Further research is needed for quantitative
analysis of the performance of the algorithm where ellipses are only partially visible. A sig-
nificant increase of errors is to be expected. Furthermore, the integration of ellipse based ste-
reo in the full stereo vision system needs further evaluation.
8. C o n c l u s i o n s
We have described a stereo vision system using ellipses in images, representing circles in
3D. It is based on an algorithm to find ellipses or partial ellipses in gray value images, and an
algorithm to find the circle in 3D corresponding to two ellipses. The algorithms proves to be
sufficiently fast and accurate for use in robot vision. Speed improvements may be obtained
using edge detection hardware. The performance of the stereo algorithm where ellipses are
only partially visible must still be investigated quantitatively.
367

Original Edges detected Ellipses found

Fig. 4. The stereo algorithm applied to a


cylindrical part. The first and second row
above show the steps as each image is pro-
cessed. Note that there is no special camera
alignment. On the right the circles found
are drawn in one of the images, including
normal vectors.

9. References
[BB 1] D.H. Ballard and C. M. Brown, Computer Vision, Prentice-Hall, Englewood Cliffs, New Jer-
sey, (1982).
[Bul] J. Buurman, The Diac Object Recognition System, to be presented at SPIE conference on
Applications of Artificial Intelligence X: Machine Vision and Robotics, Orlando (1992).
[Bu2] J. Buurman, Ellipse based stereo vision, Internal report, Pattern Recognition group, faculty
of Applied Physics, Delft University of Technology, (1992).
[FS 1] N.J. Foster and A.C. Sanderson, Determining Object Orientation Using Ellipse Fitting, SPIE
Intelligent Robots and Computer Vision, vol. 521, (1984) 34-43.
[Hol] B.K.P. Horn, Robot Vision, McGraw-Hill, New York, (1986).
[PP1] S.B. Pollard, J. Porrill, and J.E.W. Mayhew, Recovering partial 3D wire frames descriptions
from stereo data, Image and Vision Computing, vol. 9, no. 1, (1991) 58-65
[RG1] C.J. Rijnierse and F.C.A. Groen, Graph construction and matching for 3D object recogni-
tion, in: Pattern recognition and artificial intelligence - towards an integration, ed. L.N.
Kanal, North Holland, Amsterdam, (1988)
A p p l y i n g T w o - d i m e n s i o n a l D e l a u n a y Triangulation
to Stereo D a t a Interpolation
E . B r u z z o n e I M . C a z z a n t i 2 L. D e F l o ~ i a n i 2 F. M a a g i l i x

1 Elsag Bailey spa, Research & Development, 1-16154 Genova (Italy)


2 Dipartimento di Matematica - Universit~ di Genova, 1-16132 Genova (Italy)

A b s t r a c t . Interpolation of 3D segments obtained through a trinocular stereo


process is achieved by using a 2D Delannay triangulation on the image plane of
one of the vision system cameras. The resulting two-dimensional triangulation
is backprojected into the 3D space, generating a surface description in terms
of triangular faces. The use of a constrained Delaunay triangulation in the
image plane guarantees the presence of the 3D segments as edges of the surface
representation.

1 Introduction
Delaunay triangulation has turned out to be a very powerful tool in many application fields,
including finite element analysis, motion planning, digital terrain modeling and surface
reconstruction in computer tomography [LR1, Chl, Bol, DP1]. Such representation has
several important properties: it is invariant through rigid transformations, it adapts to the
data distribution, it is easy to update because of the local effect of inserting new points
or segments.
In classical computer vision problems, like scene reconstruction and autonomous nav-
igation, Delaunay triangulation has been often adopted for both 2D and 3D data. In
particular, its discontinuity-preserving nature makes it especially suitable to interpolate
passive stereo data, which usually correspond to scene discontinuities. The use of 3D
Delaunay triangulation for interpolation of data obtained by a stereo process was first
proposed in [Boll. A coherent and comprehensive presentation of this approach can be
found in [FL1], where the authors suggest a modification to standard Delannay triangula-
tion to include stereo segments as part of the triangulation, based on the addition of extra
points.
A new approach to 3D surface reconstruction which starts from stereo data and makes
use of a two-dimensional Delaunay triangulation including the projections of the segments
as part of the triangulation, has been proposed in [BG1]. The basic idea is to interpolate
the image segments which form the input for the stereo reconstruction process. The
computed 2D mesh is then backprojected into the 3D space using the corresponding
reconstructed stereo data. The result of the whole process is a triangular-faced piecewise
linear surface, in which the stereo segments are somehow preserved. Interesting features
of this approach are its fairly low computational cost, due to the fact that most of the
processing is done in 2D, and its robustness toward calibration a n d stereo reconstruction
errors. A drawback of this approach is in the splitting of the segments in the image
plane, which requires the computation of the 3D coordinates of the introduced points and
produces many small triangles in special segment configurations.
In this paper, we present a further development of that work by proposing an approach
to 3D surface reconstruction from stereo data based on the computation of constrained
Delaunay triangulation in the image plane which avoids the segment splitting and therefore
the computation of the 3D position of the added points [Chl, LL1, DP1].
369

2 Three-Dimenslonal Surface Reconstruction Strategy

The surface reconstruction process consists of three phases: stereo segment reconstruction,
constrained Delaunay triangulation in the image plane, and backprojection of the two-
dimensional tessellation.
The edge segment-based stereo process developed under the Esprit Project P940 [ALl,
Mull has been adopted. Three images are acquired from slightly different points of view.
On each image a low-level processing made of edge detection, edge linking and polygonal
approximation is performed, resulting in a set of 2D segments corresponding to relevant
scene features. One of the three images is selected as reference image. For each segment of
the reference image, possible matches, i.e., segments corresponding to the same feature in
the other two images, are selected, making use of the epipolar constraint. Then, for each
triple of matched segments the 3D segment is reconstructed, on the basis of perspective
projection.
The triangulation is computed on the 2D segments of the reference image plane selected
by the stereo process. Note that such segments are perspective projection of real observed
features, and therefore they reflect the visibility properties of the world features from
which they have been originated. As the low-level phases of edge linking and polygonal
approximation guarantee that the segments are disjoint, each triangle is bounded by only
one stereo segment. Moreover, as the image segments are directly the output of the stereo
matching, the triangulation can be computed independently of the stereo reconstruction,
avoiding the errors which may occur in the reconstruction phase.
The 2D mesh is then backprojected into the 3D space using the corresponding 3D
segments endpoints evaluated during the stereo phase. The result of the whole process
is a triangular-faced piecewise linear surface, in which the stereo segments are somehow
preserved. For each triangular face of the surface, the normal unit vector is computed,
achieving a space-variant needle map representation of the observed scene. The geometric
structure resulting from the backprojection can be defined by a function p -- p(~, ~) in a
system of spherical coordinates centered in the pin-hole of the camera. Therefore, possible
intersections among the triangular faces of the 3D surface can be caused only by errors
occurred in the stereo process.

3 Computing Constrained Delaunay Triangulation

The two-dimensional Delaunay triangulation of a set ~ = {Pi, P2,. 9 Pn~ of points in the
plane is the straight-line dual of the Voronoi diagram [PS1]. The Voronoi diagram of ~ is
a collection r = {V1, V2,..., Vn~ of convex regions, called Voronoi regions, such that V/is
the locus of the points of E 2 closer to Pi than to any other point in ~. Given a set ~ of
points in the plane and a set S of non-intersecting straight-line segments whose endpoints
are contained in ~, the pair G = ( ~ , S ) defines a planar straight-line graph, called the
constraint graph. A triangulation T o f ~ whose edge set contains S is called a constrained
triangulation of ~ with respect to S. A Constrained Delaunay Triangulation (CDT) T of
a set of points ~P with respect to a set 5 of line segments is a constrained triangulation
of ~ in which the circumcircle of each triangle t of T does not contain (in its interior)
any other vertex Pi of ~P which can be joined to each vertex of t by a line segment not
intersecting any constraint segment (see Figure 1).
Static algorithms for computing a C D T appeared recently in the computational geom-
etry literature [LL1, Chl]. The algorithm we use, proposed in [DP1], is instead based on
incremental refinements of a Delaunay triangulation. It starts from an initial Delaunay
triangulation of a specified subset of the input data, and then modifies the triangulation
370

Figure 1: An example of constrained Delaunay triangulation. Thick lines represent con-


straint segments.

by inserting the points of 7~ and the segments in ~q one at a time. Thus, the two major
computational steps of the algorithm are (i) C D T modification when inserting a point P,
(ii) C D T modification when inserting a segment I.
Step 1 is performed by extending a standard method for adding a point to a Delaunay
triangulation to the constrained case [Wal]. When a new point P is inserted, the trian-
gles whose circumcircle contains P are deleted and the resulting star-shaped polygon is
triangulated by connecting the vertices of such a polygon to P. The worst-case complex-
ity of this step is O(r~). Thus, inserting all n data points leads to an O(n 2) worst-case
complexity, which reduces to O(nlog n) if randomized algorithms are used [GK1].
Step 2 is performed by intersecting the new segment l with the existing triangulation
and retriangulating the region of the plane defined by the union of the triangles intersected
by I. The edges bounding the region of T intersected by l, called influence region, form
a simple polygon Qt, called i~fluence polygon, of which I is a diagonal. I splits Qt in two
simple polygons lrl and lr2, which are triangulated by recursively splitting them into three
subpolygons. The resulting triangulation of ~rl and ~r2 is then locally optimized by an
iterative application of the empty circle criterion for a C D T [DP1].
The time complexity of the influence region computation of a constraint segment I is
linear in the number of triangles intersected by I. Both rebuilding the constrained Delaunay
triangulation of a polygon and its optimization have a quadratic worst-case complexity in
the number of vertices of the influence polygon. The worst-case complexity of the segment
insertion algorithm is O(m~2), where m is the number of constraint segments (m -- 2r~ if
the points of 7~ are the endpoints of the segments of). By using an asymptotically optimal
Delaunay triangulation algorithm for simple polygons [LL1], the worst-case complexity of
the algorithm could be reduced to O(ranlog n), by losing the implementation simplicity.
An alternative approach to include sets of segments to a Delaunay triangulation, con-
sists of splitting the segments into subsegments (by adding additional vertices), so that
the constrained Delaunay triangulation of all subsegments is the same as the Delaunay
triangulation of the augmented vertex set. In [FL1] a preprocessing step is used to split
the segments according to their minimum distance.
A comparison between the CDT algorithm and the segment-splitting algorithm de-
scribed in [FL1] has been done. Experimental results show that the average number of
inserted points triples the number of original points (segments endpoints). The number
of points (and triangles) increases dramatically when close parallel segments occur in the
input data. Figure 2 shows the results of C D T and segment-splitting algorithms on a real
indoor scene.
371

Figure 2: Reference image of a trinocular stereo system (a) and matched segments (b).
C D T (c) and segment-splitting triangulations (d).

4 Experimental R e s u l t s on S c e n e R e c o n s t r u c t i o n
The complete process of scene reconstruction has been tested on a set of teal scenes.
Assuming as reference applications both scene surface characterization and free space
detection for autonomous navigation tasks, indoors images (i.e., office and laboratory
images) have been acquired.
The DMA machine, developed under the Esprit Project P940, has been used to get
both the 3D reconstructed segments and the corresponding 2D segments of the reference
image. First an unconstrained Delaunay triangulation is built on the segments endpoints.
Then, the resulting triangulation is updated, by adding the input segments as Delaunay
edges.
The surface obtained backprojecting the C D T into 3D is made of a minimum number
of triangles (for instance, two parallel segments define only two triangles). Besides, very
elongated triangles, which may occur in the image-plane C D T , often correspond to more
equiangular triangles in 3D, due to the perspective projection under which the scene has
been seen.
372

Experimental results have shown that the running time of the whole surface recon-
struction process reduces of about the 50% using the C D T algorithm, rather than the
segment-splitting one. Such a reduction is due to both the triangulation phase (without
the splittingof the constraint segments) and the backprojection phase.

5 Concluding Remarks
The proposed scene reconstruction process starting from stereo segments is based on a
two-dimensional Constrained Delaunay Triangulation done in the image plane and results
in a triangular-faced piecewise linear description of scene surfaces. With respect to what
presented in [BG1], the main novelty is in the use of a powerful algorithm which constrains
the triangulation to the input segments, avoiding the insertion of extra points. Some ex-
perimental tests on real data have confirmed the foreseen advantages of this new approach
in terms of both computational efficiency and improvement of the resulting surface de-
scription.
As the bottleneck of the whole strategy is in the computation of the CDT, a parallel
implementation of this phase on the Elsag Bailey multiprocessor machine EMMA2 has
been completed.

References
[ALl] Ayache N., Lustman F.: Fast an Reliable Passive Trinocular Stereovision,Proceed-
ings I't International Conference on Computer Vision, London (1987).
[Bol] Boissonnat J.D.: Geometric Structures for Three-Dimensional Shape Representa-
tion, A C M Transaction on Graphics, 3, 4 (1984).
[BGI] Bruzzone E., Garibotto G., Mangili F.: Three-Dimensional Surface Reconstruc-
tion using Dclaunay Triangulation in the Image Plane, Proceedings International
Workshop on Visual Form, Capri (1991).
[Chl] Chew, L.P.: Constrained Dclaunay Triangulation, Proceedings 3rd Symposium on
Computational Geometry, Waterloo (1987).
[DP1] De Floriani L., Puppo E.: Constrained Delaunay Triangulation for Multiresolution
Surface Description, Proceedings 9th International Conference on Pattern Recogni-
tion, R o m a (1988).
[FLI] Faugeras O.D., Le Bras-Mehlman E., Boissonnat J.D.: Representing Stereo Data
with the Delaunay Triangulation, ArtificialIntelligence,44 (1990).
[GK1] Guibas L.Y., Knuth D.K., Sharir M.: Randomized Incremental Construction of
Delaunay Triangulations and Voronoi Diagrams, Proceedings ICALP (1990).
[LLI] Lee D.T., Lin, A.K.: Generalized Dclaunay Triangulation for Planar Graphs, Dis-
crete Computational Geometry, I (1986).
[LR1] Lewis B.A., Robinson, J.S.: Triangulation of Planar Regions with Applications,
The Computer Journal, 21, 4 (1979).
[Mul] Musso G.: Depth and Motion Analysis: the Esprit Project P940, Proceedings 6 th
Annual ESPRIT Conference, Brussels (1989).
[PS1] Prcparata F., Shamos M.I.: Computation Geometry: an Introduction, Springer-
Verlag, New York (1985).
[Wal] Watson D.F.: Computing the a-dimensional Dclaunay Triangulation with Applica-
tion to Voronoi Polytopes, The Computer Journal, 24 (1981).
Local Stereoscopic Depth Estimation Using Ocular
Stripe Maps

Kai-Oliver Ludwig*, Heiko Neumann**, and Bernd Neumann


Universit~.t Hamburg, FB Informatik, AB KOGS
Bodenstedtstr. 16, W-2000 Hamburg 50, Germany

A b s t r a c t . Visual information is represented in the primate visual cortex


(area 17, layer 4B) in a peculiar structure of alternating bands of left and
right eye dominance. Recently, a number of computational algorithms based
on this ocular stripe map architecture have been proposed, from which we
selected the cepstral filtering method of Y. Yeshurun & E.L. Schwartz [11]
for fast disparity computation due to its simplicity and robustness. The
algorithm has been implemented and analyzed. Some special deficiencies
have been identified. The robustness against noise and image degradations
such as rotation and scaling has been evaluated. We made several improve-
ments to the algorithm. For real image data the cepstral filter behaves like
a square autocorrelation of a bandpass filtered version of the original im-
age. The discussed framework is now a reliable single-step method for local
depth estimation.
Keywords: stereopsis, primary visual cortex, ocular stripe maps, cep-
strum, local depth estimation.

1 Introduction
Multiframe analysis of images, such as stereopsis and time-varying image sequences, has
been a primary focus of activities within the last decade of computational vision research.
In both areas, the key problem has been identified as finding the correct correspondences
of homologous image points. The so-called correspondence problem has not been solved
to date to apply for general purpose vision tasks. For a review of relevant techniques for
finding stereo correspondences, we refer to e.g. [3].
Finding stereo correspondence can be identified as a mathematically ill-posed prob-
lem, which has to be regularized utilizing constraints imposed on the possible solution
(see e.g. [10]) . The majority of computational approaches is therefore formulated as
finding a solution in a high dimensional search or optimization space by minimizing a
functional which usually takes into account a data similarity term as well as a model
term (e.g. for achieving smoothness) to regularize the solution (see e.g. [1, 10]). In order
to avoid the complexity of most of the existing computational techniques, we investigated
biological findings about architecture and mechanisms for seeing stereoscopic depth.
Due to the limited space of the current conference proceedings, this contribution does
not cover all the topics we had to present. Neither does it provide you with the necessary
context to be able to fully understand the presented facts. If required use [5, 6] and the
references in there to get full background information.

* Em~il: ludwig~kogs26.informatik.uni-hamburg.de
** Em~il: neumann_h@rz.informatik.uni-hamburg.de
374

2 Biology

B i o l o g i c a l D a t a S t r u c t u r e s S u p p o r t Efficient C o m p u t a t i o n . An alternative to
the most commonly realized strategy in computational vision research ([8]) is to in-
fer information processing capabilities from the identification of structural principles in
the mammalian visual cortex (see e.g. [7]). These general principles include the discrete
mapping of different sensory features like orientation, color, ocularity, depth or motion
in 3D space to positions in a subspace of R2 ([4]). The "computational maps" discov-
ered so far, have been postulated to optimally support computational mechanisms of
different specificity. 3 Our computational model for local depth estimation is based on
the subdivision of the cortex into ocular dominance stripes. Starting with the idea that
a hypothetical disparity sensitive cell uses a local section of two neighboring stripes as
input to compute local disparity, we can subdivide the original left and right image into
local patches, whose size is chosen according to two major - in principle contradicting
- constraints: increasing stability of estimate with increasing stripe width and, increas-
ing accuracy with decreasing stripe width. Given the disparity at all single locations we
obtain a disparity map from which a (relative) depth map can be easily inferred.

U s e f u l n e s s o f S t r i p e M a p s f o r D e p t h E s t i m a t i o n . With reference to biological


vision systems, we assume in this work a geometry in which the optical axes of the left
and right image frames fixate a previously identified point in 3D space. 4 Then under these
conditions of imaging geometry a circle is uniquely defined by the two optical centers
of projection and the point of fixation. All points in space lying on this so-called Vieth-
Mffller Circle project onto the two retinae with zero disparity (Thales theorem). Due to
the discrete width of the ocular stripes, not only projections of space locations with zero
disparity can be fused. All 3D points with moderate negative (far field) or positive (near
field) disparity within a psychophysically defined region also contribute to a fused image
of varying depth due to the retinal shift of projection to retinal coordinates. 5 3D spatial
locations outside this area produce the well known phenomenon of double images.

3 Analysis, Evaluation, and Extensions

3.1 A n a l y s i s a n d E v a l u a t i o n

To determine the values of the parameters of the technical model we have evaluated the
relevant and sometimes diverging biological data from various sources to get a reasonable
and consistent parameter setting. For a detailed discussion see [5]. Two corresponding
local image patches extracted from the left and right image, respectively, can be arranged
a However, from the set of maps and principles given above, only the principles of retinotopy
and ocular dominance stripe maps are fully established ([4]). The organization of alternating
bands of ipsi- and contralateral eye dominance has been modeled recently as being the result
of a structural transformation principle described via a non-linear mapping function ([7]).
These so-called ocular dominance columns subdivide the whole area of the cortex (area 17,
layer 4B) in alternating bands of ca. 0.5ram width.
4 As a part of an active vision system, a fixating binocular head necessarily requires an atten-
tional control module for the selection of appropriate fixation points and a module for the
vergence movement to fixate the selected points. Proposals have been published how fixation
points could be selected and how such a point may be tracked in time (see [5] for references).
5 In case of idealized circular retinae the iso-disparity lines are circles of different radii with the
horopter circle as one element of the set. In case of flat projection planes these iso-lines are
conic sections (see [5] for a detailed discussion).
375

in a local neighborhood to form a single joint signal. This idea has been originally utilized
in an algorithm proposed by Yeshurun & Schwartz [11] using rectangular patches butted
against each other. If such a combined signal is filtered with the cepstrum s, the filtered
image contains a strong and sharp peak at a position which codes the disparity shift
between the two original subsignals. This can be derived mathematically for the ideal
case of a pure translational shift (see e.g. [5, 11]). Excluding some special cases - which
will be named later in this paper - the disparity between the two subsignals can be
obtained by simple m a x i m u m detection in the cepstral plane.
Using this method for computing disparities has several advantages. First, it is fast,
because the disparities are computed in a single step without any iterations z. Second,
due to the local and therefore independent computation of the disparities, parallelization
is easy. Third, it is well-known from previous work, that the cepstrum is extremely
insensitive to noise. We showed in a systematic evaluation ([5]) that the cepstral filter is
insensitive to moderate image degradations due to rotation or scaling (6 degrees, 6%).
G e o m e t r y . It is reasonable to assume, that physical surfaces in the natural envi-
ronment are piecewise smooth and can hence be approximated locally by their Taylor
series expansion. We mathematically investigated the distortions in the disparity field
when fixating planes and second order surfaces. For a given point in 3D space, let the
left image coordinates be 1 = (XL, YL). Then the right image coordinates r = (xR, YR)
can be computed in the first case to be:
alXL + a2yL CyL
xR -- and YR = (1)
blxL + b2YL + b3 blXL + b2yL + b3
where ai, b~ and c are constants with respect to a given stereo arrangement and local
surface orientation (see [5] for further details on formulae and (graphical) results).
E v a l u a t i o n . The cepstral filter as used in the literature with rectangular windowing
functions suffers from some specific problems: If, for example, the double signal contains
a single straight edge segment, then up to five additional m a x i m a may appear in the
cepstrum. In the case of varying illumination an additional peak at zero disparity m a y
appear. These and other deficiencies can be overcome with different support functions.

3.2 I m p r o v e m e n t s and extensions

O t h e r s u p p o r t f u n c t i o n s f o r w i n d o w i n g . If rectangular support functions are butted


against each other, information about the shape of the original suhsignals is lost in
the joint signal. Use of other window functions can prevent this information loss and
improve signal properties. As a first example, we showed (see [5]) that the use of gaussian
windows for the extraction of the local left and right image information produces a more
easily identifiable maximum. Furthermore, the use of different support functions - as an
approximation for a circular receptive field of a disparity cell - is also feasible and the
results are better than those with standard rectangular support.
6 The cepstrum '(anagram of spectrum) is a well-known non-linear filter first used by Bogert et
al. [2] for the detection of echo arrival times in 1D seismic signals. The cepstrum of a signal
g(x) is defined as Cepstrum{g(x)} := [[~{iog([l~{g(x)}[[2)}[[ 2. Due to its simplicity and noise
robustness it has been widely used since then in various application areas from 1D speech
processing to 2D image registration (see [6] for references).
r It has been shown in [9], that this filtering step could be done in 51ms when using special image
processing hardware. With such short computation times it is feasible to use the presented
cepstrum-based stereo segmentation approach in active vision systems for simple obstacle
avoidance or object recognition tasks.
376

Fig. 1. Cepstrum with gaussian support functions. Left: Double signal f(x) composed from
data of left and right image at same (retinal) locations multiplied by a gaussian window and
added with a fixed offset. Center: Amplitude spectrum. Right: Cepstrum{f(x, y)}. To enhance
the visual impression log(.) is displayed in the center and right image and a small region around
the origin has been removed (only right).

T h e c e p s t r a l a n d a u t o c o r r e l a t i o n . Based on a formula due to Olson & Coombs ([9])


the cepstrum can be written as autocorrelation preceded by an adaptive filtering step:
Cepstrum{f} = []h o hi[2 with h(z, y) = kr(x , y) * f(x, y). For natural images we found
that the prefilter, whose fourier transform is given by

KF(U, v) = x/l~ v)[12) with F(u, v) := ~ ' { f ( z , y ) } (2)


IIF(u,v)ll
computes mainly a bandpass filtered version of the original image with a narrow kernel,
which is in contrast to the example given in ([9, p. 28]). On the other hand the cepstral
filter is substantially better than autocorrelation applied to images appropriately filtered
by LoG. It seems that the peculiar image-dependence of the prefilter kernel contains the
main advantage of the cepstral filter.

4 Results, Conclusions and Prospects

We generated image pairs with a computer graphics visualization package to investigate


the precision of the cepstral disparity estimates under precisely known conditions. The
theoretical values given by (1) can be computed to 4-1 pixel accuracy due to numerical
instabilities and sampling effects. Fig. 2 gives an impression of the results on natural
images.
Motivated by recent findings about the architecture of biological visual systems, we
have investigated a method for how a binocular observer can recover local depth infor-
mation with a single step computation avoiding the correspondence problem. In contrast
to standard formulations of the stereopsis problem, this method needs neither regulariza-
tion nor iterative computations to obtain the solution. It is a fast and reliable one-step
method to determine depth locally around the point of fixation.
Current research topics include: For the technical aspects of the method, a mathe-
matical analysis of support functions with good signal properties is necessary, since we
currently investigated only box and gaussian shapes. In relation to this analysis, the bias
introduced from the window shape as an error component in the estimation has to be
evaluated. We also plan to investigate in greater detail the properties of the prefilter. For
the incorporation of this local depth estimation technique in an active vision system, the
problem of combining multiple depth maps has to be analyzed.
377

local patches la~k


actual surface
sufficient structure is outside of
\ fusional

Fig. 2. Local depth map computed by the improved algorithm using equal gaussian window
functions for left and right image with a previous LoG filtering step (The rectangles only outline
the subdivision of the image). The image pair has been taken at a distance of 2 meters with stereo
base length 7.00 cm using a precision adjusting device to produce the fixating arrangement. The
(foveal) angle of extent is 200 minutes of arc. As can be observed the algorithm fails if one of
the two indicated conditions hold (see arrows).

References
1. S.T. Barnard and M.A. Fischler: Computational and Biological Models of Stereo Vision.
In Proc. IU Workshop, Pittsburgh, PA, USA, September 11-13 (1990) 439-448
2. B.P. Bogert, M.J.R. Healy, and J.W. Tukey. The quefrency alanysis of time series for echoes:
cepstrum, cross-cepstrum, and saphe cracking. In Proceedings: Symposium on Time Series
Analysis (1963) 209-243
3. U.R. Dhond and J.K. Aggarwal. Structure from Stereo - A Review. IEEE Trans. on
Systems, Man, and Cybernetics, 19(6) (1989) 1489-1510
4. B.M. Dow. Nested maps in macaque monkey visual cortex. In K.N. Leibovic, editor, The
Science of Vision, Springer, New York (1990) 84-124
5. K.-O. Ludwig. Untersuchung der Cepstrumtechnik zur Querdisparit~tsbestimmung ffir
die Tiefensch~tzung bei fixierenden Stereokonfigurationen. Technical Report, Fa~hbereich
Informatik, Universit'at Hamburg (1991)
6. K.-O. Ludwig, B. Neumann, and H. Neumann. Robust Estimation of Local Stereoscopic
Depth. In International Workshop on Robust Computer Vision (IWRCV '9~), Bonn, Ger-
many, October 9-1~ (1992)
7. H.A. Mallot, W. yon Seelen, and F. Giannakopoulos. Neural Mapping and Space-Variant
Image Processing. Neural Networks, 3 (1990) 245-263
8. D. Marr. Vision. W.H. Freeman and Company, San Francisco (1982)
9. T.J. Olson and D.J. Coombs. Real-Time Vergence Control for Binocular Robots. Technical
Report 348, Department of Computer Science, University of Rochester (1990)
10. T. Poggio, V. Torre, and C. Koch. Computational vision and regularization theory. Nature,
317 (1985) 315-319
11. Y. Yeshurun and E.L. Schwartz. Neural Maps as Data Structures: Fast Segmentation of
Binocular Images. In E.L. Schwartz, editor, Computational Neuroscience, Chap. 20, The
MIT Press (1990) 256-266
This article was processed using the LATEXmacro package with ECCV92 style
Depth Computations from Polyhedral Images *

Gunnar S p a r r
Dept. of Mathematics, Lund Institute of Technology,
Box 118, S-22100 Lund, Sweden

Abstract. A method is developed for the computation of depth maps,


modulo scale,from one singleimage of a polyhedral scene. Only affineshape
properties of the scene and image are used, hence no metrical information.
Results from simple experiments show good performance, both what con-
cerns exactness and robustness. It is also shown how the underlying theory
m a y be used to single out and characterise certain singular situations that
m a y occur in machine interpretationof line drawings.

1 Introduction

The topic of this paper is depth computation and scene reconstruction in the case when
the scene is built up by planar surface patches, bounded by polygons. Having only this
information, from one single image no quantitative information can be drawn. A common
situation is that the scene contains patches which are parallelograms, often rectangles.
It will be seen that under this rather weak assumption, without any knowledge about
the sizes of these parallelograms, it is possible to compute a depth-map over the image,
modulo a common scaling factor. The method may be used also for patches of other
shapes. No camera calibration is needed.
The approach is inspired by the subjective experience that depth information seems to
be contained in the shape of an image of e.g. a rectangle. In a series of papers, e.g. [8], [9],
[10], [11], this hypothesis has been verified in quantitative terms. In the present paper
emphasis will be laid on examples and experiments, rather than on the mathematical
theory. For a thorough treatment of the latter, see [9].
The organization of the paper is as follows. In Sect. 2, the concept of 'shape' is
described, with examples. In Sect. 3 the same will be done for 'depth', with some new
theorems. In Sect. 4 is described a simple experiment, illustrating the applicability of
the method for realistic data. Also the robustness properties are investigated. In Sect. 5,
some degeneracies that may occur are treated. In Sect. 6, finally, the results and their
relations to previous work are discussed.
Throughout the paper, it is assumed that the correspondence problem is solved be-
forehand, i.e. that a set of point matches between points in the image and in the scene
is established.

2 Shape

Instead of working with individual points, we work with m-point configurations, by which
is meant ordered sets of points, planar or non-planar,
X = ( X ~ , . . . , X m) .
* The work has been supported by the Swedish National Board for Industrial and Technical
Development, (NUTEK).
379

It turns out to be fruitful to work with a kind of duality (for motivations and proofs,
see e.g. [9]), and consider the linear relations that exist between the points belonging
to a particular configuration. It can be proved that the set (1) below is independent of
coordinate representations. The following definition plays a crucial role in the sequel.

Definitionl. The (aI~ne) shape of 2, = ( X 1 . . . . , X m) is the linear space


tl~ m

s(2") = {s = (~1 . . . . ,s l ~,X' = O, ~-'~s = O} , (1)


1 1

where X i, i = 1 , . . . , m, stands for coordinates in an arbitrary affine coordinate system.

We use the notation


2,1 !_ 2," r 2" I and 2," have equal shape .
It can be shown that 2, ~ ~- 2," if and only if 2, ~ and 2," can be mapped onto each other
by an affine transformation.
To get a geometric feeling for the notion of shape, consider the case of a planar 4-point
configuration 2" = ( X 1 , . . . , X4). Suppose the configuration is non-degenerate, i.e. that
it contains three non-collinear points, say X 1, X ~, X 3. Let ~2, ~3 be the coordinates of
X 4 with respect to a coordinate system with origin in X z and basis vectors X 1 X 2 and
X 1 X 3, cf. Fig. 1. The equation X 1 X 4 = ~ 2 X 1 X 9 + ~ 3 X 1 X 3 m a y then be written
~-2r2-r + ~ 2 x ~ x 2 + ~3x~--r-~x+ ( - 1 ) x ~ x 4 = 0, ~1 = ] - ~2 - ~3 ,

where the coefficient ~1 of the null-vector X 1 X 1 is chosen so that the coefficient sum
vanishes. This construction determines the shape-vector (~1, ~ 2 , ~ s , - 1 ) in the case of
4-point configurations. Any multiple of this vector belongs to s(2.) too.

Xt X3

X 1 X 4 = ~2X1X 2 + ~ s X 1 X 3
X 4 = ~ I X 1 + ~2X 2 + ~3X 3,
X2

Fig. 1. Atfine coordinates.

Above the points X 1, X 2, X 3 form what is called an affine basis for the plane. The
coordinates {1,{2,{3, with ~ { i -- 1, are called the aIfine coordinates of X 4. An analo-
gous construction can be done in space. Thus, if X 1 , X 2, X 3, X 4 are vertices of a non-
degenerate tetrahedron, they form an affine basis, and X 5 can be described by its affine
coordinates { 1 , . . . ,{4, with ~-'~{i = 1. Again s(2.) is a one-dimensional linear space.
For two-dimensional configurations with more than 4 points, and three-dimensional
configurations with more than 5 points, s(2,) is a linear space of higher dimension.
Generally, for the shape of an m-point configuration, the following can be said:
dims(2,) = m - 2 if the points are collinear,
dims(2,) = m - 3 if the points are coplanar, but not collinear, (2)
dims(2,) = m - 4 if the points are not coplanar.
380

Example 1. In Fig. 2 are shown two 2D and one 3D configurations. The dotted lines have
no other meaning than to indicate relationship between the points.
In the left configuration, the point X 4 is the eentroid of the triangle with vertices
in X 1, X 2 , X a. In the middle one, two sides are parallel. The right configuration is a
"joint" in 3D, consisting of two rectangular 4-point configurations. Bases for the shapes
of these configurations are shown. They may be computed e.g. by means of the afline
basis construction above. A natural way to select a basis for the joint is by means of the
planar subconfigurations, as is done in the figure, hut other choices are possible too.

3 5--
":'" 4 .......... " '. ....... 6
" ", "'... " ' 3:, ~ 1 7 6 1 7 6 ".
*9o#". . . . * ~ "~ *=,~ . o .. o
1": ...... "'.',, ." . ...... :2 1-'.'.'.'~" ;;;3 4
~
. . . . . . , :" 2 1 ~ .... .--" .... .....~
...... 2"'"

(1, 1, 1,--3) (1, -1, 2, -2) (Z,--1,--1, 1, 0, 0)


(0,0, 1,--1,-1,1)
Fig. 2. Two 2D and one 3D configurations, and bases for their shapes.

To be of practical use, one needs algorithms for the computation of shape. One such
algorithm suggests itself by the definition, namely the solving of a homogeneous system
of linear equations. By means of e.g. a row echelon algorithm, a basis for s(X) can be
computed.
In the definition of shape, the coordinate invarianey is very important. It makes it
possible to compute the shape of an image configuration from intrinsic measurements
in the image plane, in terms of an arbitrary coordinate representation. The same holds
for the object configuration. These shapes can thus be computed independently, without
reference to the imaging process.
In the rest of this paper, point configurations defined by the vertices of polyhedral
objects will be considered. (Here the word 'polyhedral' is used in a wide sense for con-
figurations, not necessarily solid, built up by planar polygonal patches.) Besides being a
point configuration of its own,

X=(XZ,...,X ~) ,

such a configuration has a lot of additional structure. In fact, each of the f polygonal
faces of the object contributes with a sub-configuration, defined by the vertices of the
polygon
X~--(X~,...,X'~'), i= l,...,f .
The whole configuration may be considered as an ordered set of these sub-configurations

o x = xj) .

We will work in parallel with both these representations X and 0 x, and by abuse of
notation write 0 x in both instances.
381

3 Depth

In this section will be examined how the shape, as defined above, transforms under
projective transformations. The results are fundamental for the applications below.
By a perspectivity with center Z in 3-space is meant a mapping with the property
that every point on a line through Z is mapped onto the intersection of the line with
some plane lr, the image plane, where Z ~ ~r. For a perspectivity between two m-point
configurations, X -----*y , there exist ai, i = 1 , . . . , m, such that
Z X i = a I Z Y i, i=l,...,m .
Here o~i is called the depth of Xi with respect to Y/, i = 1 , . . . , m , and the vector
c~ = ( a l , . - - , am) is called the depth of X with respect to Y.
By a projectivity is meant a composition of perspectivities. The product of the depths
of these perspectivities defines the depth of the projectivity. (It can be shown, of. [9],
that this product is independent of decomposition.)
The following theorem shows how the knowledge of the shapes of two configurations
X and y makes it possible to characterise all projectivities P such that y ~ P ( X ) . In
particular, pose information is attained about the location of X relative to y .

T h e o r e m 2 . The following conditions are equivalent:


- There ezists a projectivity P such that P ( X ) ~ y , where X has depth c~ with respect
to P(X),
- diag(cOs(X) C s(y), with equality i f X is planar.
Note that here a is determined up to proportionality only. In accordance with this,
the depth will be considered as a homogeneous vector.
The meaning of Theorem 2 is illustrated in Fig. 3, drawn for 4-point configurations.
Let there be given two planes, containing the point configurations X and y respectively.
Suppose it is possible to move around these planes in space. Take an arbitrary point
Z, and form a pencil of rays connecting Z with the points of y . The theorem says
that whenever an X-configuration fits on this pencil, the depth-values a are given by the
formula of the theorem, independently of the location of Z and y . This is even true when
X and y are replaced by configurations having the same respective shapes. Conversely,
if instead a and y are given, and by means of them an X-configuration is constructed,
then the shape of the latter is given by the theorem.

Fig. 3. Perspective mapping of point configurations.


382

Example 2. The right hand configuration of Fig. 2 was said to illustrate a three-dimensio-
nal figure, a joint, with rectangular faces. Looking upon the figure as it is printed on the
paper, i.e. as a two-dimensional perspective image of the joint, measurements in the
image give that its two parts have the following shapes:
s((y1, y2, y3, y4)) = (01, 02, r/s, 74) = (1.1, - 1 , -1.2, 1.1) ,
s((yS, y4, yS, y6)) = (r/s, r/4, 75, 76) = (-1.2, 1.1, 1.1, - 1 ) .
The result of applying Theorem 2 to the two parts separately may be summarised in a
matrix equation

diag(0/)
i101i11]
-1
-1 1
1 -1
0 -1
0 1
0

=
-1.0
-1.2-1.2 diag(1,-1)
1.1 1.1
1.1
-1.0
Here the diagonal matrix on the right hand side is needed to adjust for the arbitrariness
(3)

in the choice of the columns of the 6 x 2-matrices. The system has the depth solution
0/T = [0/1 0/2 0/3 0/4 Ol5 0/6] = [1.1 1.0 1.2 1.1 1.1 1.0]

(together with all multiples of this vector).

In this example, a claim on depth-consistency is met, formulated in terms of the


matrix-equation (3). The general situation is covered by the following theorem.

T h e o r e m 3. Let 0 x = ( X I , . . . , XI) be a polyhedral point configuration, and let 0 y =


( Y l , . . ",YY) be its image. Let
Sx = [Sx',..., SX,]
be a matrix with sub-matrices S x~, having columns that form a basis for s(Xi), and let
S y~ be defined in the corresponding way, i = 1 , . . . , f . Then holds, for some 0/ and c,
diag (0/)Sx = S y diag (c) . (4)

For noisy data, the equation (4) can't be expected to be satisfied exactly. Moreover,
the system in 0/is in general overdetermined. In the next section it will be solved in the
least square sense.
For projective mappings from one plane to another, the following theorem gives an
analytic expression for the depth function.

T h e o r e m 4 . Let the planar configuration X = ( X 1 , X 2 , X 3 , X ) be mapped onto 3) =


(y1, y2, yS, y ) by a perspectivity, so that y1, y2, y3 have depths 0/1,0/2, 0/3 with respect
to X 1 , X 2 , X a. If s(y) = (rh,r/2, r/a,O), then the depth 0/ of Y with respect to X is given
by
Y_L+ 0_Z+ 0 3 + _ ~ = 0 "
0/1 0/2 0/3 0/

If r/ E s(y) is so normed that 0 = - 1 , i.e. 01 + 02 + 03 = 1, then 01,r/2,03 are the


affine coordinates of Y with respect to the affine basis y1, y2, y3. The formula says that
the depth 0/of Y is the weighted harmonic mean of 0/1, 0/2, 0/3, with the affine coordinates
as weights.
383

4 An Experiment

A simple experiment will illustrate how the theory m a y be used. In Fig. 4 is shown an
image of a corridor scene, containing a number of rectangular objects: two doors, p a r t of
a wall, a b o a r d and the faces of a box. The doors are at a p p r o x i m a t e distances 6 m and
12 m from the camera. The dimensions of the box are 35 x 35 x 20 cm. The size of the
image is about 300 x 400 pixels, where the box occupies a b o u t 50 50 pixels.

Wall+doors
1 2 3 4 5 6 7 8
comp 1.00 1.01 1.19 1.20 2.08 2.09 2.23 2.23
meas 1.00 1.02 1.20 1.22 2.07 2.08 2.21 2.22

Board
9 10 11 12
comp 1.35 1.35 1.92 1.92
meas 1.35 1.36 1.90 1.91

Box
A B C D E F G
comp 1.02 1.01 1.02 1.00 1.07 1.05 1.05
meas 1.03 1.01 1.03 1.00 1.07 1.05 1.05

Fig. 4. A corridor scene. Measured and computed depth values.

The purpose is to investigate the capacity of the m e t h o d , b o t h what concerns exact-


ness and robustness. T h e pixel coordinates of the points of interest were picked out by
h a n d from the image. The wall and the two doors form a configuration similar to the joint
of E x a m p l e 1, and m a y be treated as in E x a m p l e 2, by means of T h e o r e m 3. In the same
way, the depth values for the box are computed. T h e o r e m 4 then gives the depth values
for the board. For the wall, the doors and the board, the depth values were normalised
against Point 1, while for the box, the normalisation was done against Point D. The so
normalised computed depth values are shown in the first lines of the respective tables in
Fig. 4.
For comparison, the distances from the camera to the m a r k e d scene points were
measured by a measuring tape. The values were normalised as described above. Since
these distance ratios are not exactly the same as depth, they have to be corrected by
means of coefficients, which depend on the angle of sight relative to the focal line. T h e so
obtained measured depth values are printed in the second lines of the respective tables in
Fig. 4. Since the distance to Point 1 is about 6 m, a deviation of 0.01 in depth corresponds
to a b o u t 6 cm in distance. This is also the e s t i m a t e d uncertainty in the measurements.
As can be seen, the computed depth values show good agreement with the measured
ones.
384

To get a feeling for the robustness, rectangularly distributed noise of 4-1 pixels was
a d d e d to each coordinate of the points of interest above. New depth-values were then
computed. In order to compare homogeneous depth-vectors, the normalised differences

O/ref -- [OLnew-""~O/new

were computed, where aref stands for the c o m p u t e d depth values of Fig. 4. Table 1
shows the outcome of 10 r a n d o m simulations. For the wall+doors, the results indicate
good robustness properties, while for the box the results are not equally favourable. T h e
l a t t e r isn't surprising, since the object is small and so distant t h a t the perspective effects
are small.

T a b l e 1. Effects of noise.
Wall+doors Box
1 2 3 4 5 6 7 8 A B C D E F G
-0.01 0.00 0.00-0.01 0.02-0.01 0.00-0.01 0.00 0.00 0.00-0.06 0.01-0.03 0.08
0.00 0.00 0.02 0.01 0.03-0.02 0.00-0.02 -0.04 0.02-0.01 0.00-0.01 0.00 0.04
0.01 0.00-0.01 0.00-0.01 0.00 0.00 0.01 -0.02-0.06 0.02 0.02 0.02 0.05-0.03
-0.01 0.00 0.00-0.01 0.02-0.01 0.02-0.01 0.03 0.01 0.05 0.05-0.05-0.02-0.05
0.02-0.02 0.01-0.01-0.01 0.01 0.00 0.01 -0.02-0.04 0.03-0.01 0.06 0.00-0.03
0.01 0.00 0.00 0.01 0.00-0.02 0.02-0.02 0.00-0.01 0.00 0.01 0.04 0.01-0.04
0.00 0.00 0.00 0.01 0.00 0.02-0.01-0.01 0.01 0.01-0.04 0.01 0.09 0.02-0.09
-0.01 0.01-0.01 0.01 0.00 0.04-0.03-0.01 -0.02 0.01-0.01 0.04-0.08 0.02 0.05
-0.02 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.03-0.06-0.04 0.00 0.01 0.06
90.00-0.01 0.01 0.00 0.03-0.03 0.01-0.02 -0.03 0.01-0.01 0.02-0.02 0.02 0.02

To summarise, using only t h a t the scene contains parallelograms, nothing a b o u t their


sizes, it has been possible to compute depths from the image. No knowledge at all is used
a b o u t the camera, only t h a t the imaging process is projective. In fact, the camera used
in the experiment turned out to give a b o u t 5 % deficiency in the width and height scales.
However, since the m e t h o d only uses affine properties, it is robust also to such errors, as
long as they can be modeled by affine transformations in the image plane. They need not
even be compensated for. The results of the experiment are accurate and seem to have
good robustness properties.

5 Degeneration

W h e n analyzing an image of a polyhedral scene, difficulties with the interpretation m a y


occur, even for h u m a n observers. Some possible explanations are the occlusions and
accidental alignments t h a t m a y occur in the image, without having a counterpart in the
scene. A c o m p u t a t i o n a l algorithm, based on linear p r o g r a m m i n g , for the interpretation
of line drawings is given in [12].
In this section, it will be indicated how the concepts of shape and depth can be used
to give a simple criterion for correctness. First some definitions. Let 0 y = ( Y l , . - . , Y l )
be a p l a n a r configuration, built up by a number of polygonal configurations. Then 0 y is
called an impossible picture if

P :0 x ~ O y , P a projectivity ==~ 0 x is a p l a n a r configuration.


385

By construction, of. Theorem 3, all columns of S y belong to the shape s(OY). The
same holds for S x , s ( O x ) . From the depth consistency (4) it follows that S y and S x have
the same ranks. Hence d i m s ( O x ) > r a n k s x = r a n k s y. From (2) it is known that the
m - p o i n t configuration O x is non-planar if and only if d i m s ( O x ) = m - 4. Combining
these facts and definitions, we have proved the sufficiency part of the following theorem.
The necessity is omitted here.
Theorem 5. C)y is an impossible picture if and only if ranicS y >_ m - 3.
Having an image of a true three-dimensional polyhedral scene, this rank condition
must be fulfilled. If it is violated because of noise, it m a y be possible to "deform" O y to
fulfill the condition. In doing this, any deformation can't be allowed. Let us say that a
deformation is admissible if it doesn't change the topological and shape properties of the
configuration, where the latter claim m a y be formulated Two configurations C)y and C)~
are topologically shape-eq,ivalcnt iff for every choice of matrix S y in Theorem 3, there
exists a corresponding matrix S ~ for C)~, such that their non-vanishing elements have the
same distributions of signs. This gives a constructive criterion, possible to use in testing.
As a final definition, we say that C)y is a correctable impossible picture if there exists an
admissible deformation which makes rank S y < m - 4. An example of a configuration
with this property is given in Fig. 5. For a method to find admissible deformations, see
[7].

% I

I 0.65--0.60 0.00] J i%
-0.65 0.00 --0.61| ,, *" I %
** I ~t
0.00 0.60 0.54| P I %
--1.00 1.00 0.00|
1.00 0.00 1.06|
0.00 --1.00 -l.OOj

Fig. 5. A truncated tetrahedron and its shape. The Reutersv~rd-Penrose tribar.

A more severe situation is met when it is impossible to correct the picture by means
of admissible deformations. We then talk about an absolutely impossible picture. When
dealing with such an image, one knows that the topology of the object isn't what it
seems to be in the image. Accidental alignments or occlusions have occurred, and must
be discovered and loosened.
A celebrated example of an "impossible picture" in the human sense is the tribar of
Fig. 5. It is alternately called the "Reutersv~rd tribar" or the "Penrose tribar", after two
independent discoveries (1934 and 1958 respectively.) For historical facts, see the article
of Ernst in [1]. For this configuration it can be proved that there exists no admissible
deformation which makes the tribar fulfill the rank condition of Theorem 5. For more
details, see [10], [11]. In terms of the concepts introduced above, the discussion of this
section m a y be summarised:
- T h e truncated tetrahedron is a correctable impossible picture.
- The Reutersvrd-Penrose tribar is an absolutely impossible picture.
386

6 Discussion

Above a method for the computation of depth, modulo scale, from one single image of a
polyhedral scene has been presented, under the assumption of known point corresponden-
ces between scene and image. Only affine information about the scene is used, e.g. that the
objects contain parallelogram patches, nothing about their sizes. Other affine shapes may
be used as well. In the image, no absolute measurements are needed, only relative (affine)
ones. The image formation is supposed to be projective, but the method is insensitive to
affine deformations in the image plane. No camera parameters are needed. The problem
considered may be called an "affine calibration problem", with a solution in terms of
relative depth values. The weak assumptions give them good robustness properties. All
computations are linear.
The relative depth values may be combined with metrical information to solve the full
(metrical) calibration problem (cf. [8], [9]). That problem is usually solved by methods
that make extensive use of distances and angles, cf. [3] for an overview.
Relative depth information is also of interest in its own. For instance, in the case
of rectangular patches in the scene, the relative depth values may be interpreted as the
"motion" of the camera relative a location from which the patch looks like a rectangle.
Looked upon in this way, our approach belongs to the same family as [2], [4], [6].
Crucial for the approach is the use of affine invariants (the 'shape'). In this respect
the work is related to methods for recognition and correspondence, cf. [5].
In the last part of the paper is sketched an approach to the line drawing interpre-
tation problem. Its relations to other methods, notably the one of [12], need further
investigations.
References

1. Coxeter, H.M.S., Emmer, M., Penrose, R., Teuber, M.L.: M.C. Escher: Art and Science.
Elsevier, Amsterdam (1986)
2. Faugeras, O.D.: What can be seen in three dimensions with an uncalibrated stereo rig?
Proc. ECCV92 (1992) (to appear)
3. Horn, B.K.P.: Robot Vision. MIT Press, Cambridge, MA. (1986)
4. Koenderink, J.J., van Doorn, A.J.: Affine Structure from Motion. J. of the Opt. Soc. of
America (1992) (to appear)
5. Lamdan, Y., Schwartz, J.T., Wolfson, H.J.: Ailine Invariant Model-Based Object Recogni-
tion. IEEE Trans. Robotics and Automation 6 (1990) 578-589
6. Mohr, R., Morin, L., Grosso, E.: Relative positioning with poorly calibrated cameras. In
Proc. DARPA-ESPRIT Workshop on Applications of Invariance in Computer Vision (1991)
7. Persson, A.: A method for correction of images of origami/polyhedral objects. Proc. Swedish
Society for Automated Image Analysis. Uppsala, Sweden. (1992) (to appear)
8. Sparr, G., Nielsen, L.: Shape and mutual cross-ratios with applications to exterior, interior
and relative orientation. Proc. Computer Vision - ECCV90. Springer Verlag, Lect. Notes
in Computer Science (1990) 607-609
9. Spaxr, G.: Projective invariants for affine shapes of point configurations. In Proc. DARPA-
ESPRIT Workshop on Applications of Invariance in Computer Vision (1991)
10. Sparr, G.: Depth computations from polyhedral images, or: Why is my computer so dis-
tressed about the Penrose triangle. CODEN:LUFTD2(TFMA-91)/7004, Lund (1991)
11. Sparr, G.: On the "reconstruction" of impossible objects. Proc. Swedish Society for Auto-
mated Image Analysis. Uppsala, Sweden. (1992) (to appear)
12. Sugihaxa, K.: Mathematical Structures of Line Drawings of Polyhedrons - Toward Man-
Machine Communication by Means of Line Drawings. IEEE Trans. Pattern Anal. Machine
Intell. 4 (1982) 458-469
This article was processed using the IbTEX macro package with ECCV92 style
Parallel Algorithms for the Distance Transformation

Hugo Embrechts * and Dirk Roose

Katholieke Universiteit Leuven, Department of Computer Science, Celestijnenlaan 200A, B-3001


Leuven, Belgium

A b s t r a c t . The distance transformation (DT) is a basic operation in image


analysis where it is used for object recognition. A DT converts a binary
image consisting of foreground pixels and background pixels, into an image
where all background pixels have a value equal to the distance to the nearest
foreground pixel.
W e present several approaches for the parallelcalculation of the distance
transform based on the "divide-and-conquer" principle. The algorithms and
their performance on an i P S C O / 2 are discussed for the city block (CB)
distance that is an approximation for the Euclidean Distance.

1 Introduction

A D T converts a binary image consisting of foreground and background pixels, into an


image where all background pixels have a value equal to the distance to the nearest
foreground pixel.
Computing the Euclidean distance from a pixel to a set of foreground pixels is es-
sentially a global operation and therefore needs a complicated and time-consuming algo-
rithm. However, reasonable approximations to the Euclidean distance measure exist that
allow algorithms to consider only a small neighbourhood at a time. They are based on
the idea that the global distances are approximated by propagating local distances, i.e.
distances between neighbouring pixels. T w o of the distance measures proposed in [I, 2]
are the city block distance and the chamfer 3-4 distance. They are defined by the masks
of Fig. 1. The D T applied to an image with one foreground pixel centered at the middle
of the image is shown in Fig. 2. For the C B distance we present parallel algorithms.
The D T is a basic operation in image analysis where it is used for object recognition.
It can be used for computing skeletons in a non-iterative way. Further applications are
merging and segmentation, clustering and matching [1].

2 The Sequential Algorithm

The sequential algorithm is a known algorithm [1] consisting of two passes during which
the image is traversed, once from top to bottom and from left to right, and the second
time in reverse order. When a pixel is processed, its distance value (infinity if not yet
determined) is compared to the distance value of a number of neighbours augmented by
their relative distance and is replaced by the smallest resulting value. This causes the
distance values to propagate from the object boundaries in the direction of the scan and
yields, after the second pass, the correct DT-values.
* The following text presents research results of the Belgian Incentive Program "Information
Technology" - Computer Science of the future, initiatedby the Belgian State - Prime Minis-
ter's Service - Science Policy Office.The scientificresponsibilityis assumed by its authors.
388

+4 i+3 +4
I+1 +11 +3 0 +3
+4i+3 +4
City Block distance Chamfer 3-4 distance

Fig. 1. These masks show for the indicated distance measures the distance between the central
pixel and the neighbouring pixels. The distance between two image points a and b is defined
as the sum of the distances between neighbouring pixels in the path connecting a and b, that
minimizes this sum.

City Block
i i i11 ii iii i ::!!:::r:i!!

Chamfer 3-4
i
Fig. 2. The DT of an image with one foreground pixel centered in the middle of the image
for the City Block and Chamfer 3-4 distances. Growing distance is represented by a greytone
repeatedly varying from black to white (to accentuate the contours of the DT).

3 Introduction to the Parallel Approach

Parallelism is introduced by the 'divide-and-conquer' principle. This means that the im-
age is subdivided into as many subregions as there are processors available ; the operation
to be parallelized, in our case the DT, is computed on each subregion separately and
these local DTs have to be used to compute the global DT on the image. Let LDT (local
DT) denote the DT applied to a subregion or, where indicated, a union of neighbouring
subregions and let G D T (global D T ) d e n o t e the DT applied to the whole image.
The algorithm consists of the next three steps :
I. On each subregion the LDT is computed for the boundary pizels of that subregion.
II. The G D T values for the boundary pizels are computed out of the L D T values.
III. On each subregion the G D T values for the internal pixels are determined out of the
G D T values for the boundary pixels and the local image information. We call this
part I D T (internal DT).
The first step could be done by executing the sequential DT algorithm on each sub-
region and retaining the boundary values. However, in [3] we present a shorter one pass
algorithm which traverses each pixel at most once.
389

For step II we consider two possible solutions. In the first solution (hierarchical
algorithm) we consider a sequence of gradually becoming coarser partitions pl (l =
1 , 2 , . . . , L = log2p ) of the image, with the finest partition Pl being the chosen parti-
tion of the image containing as many subregions as there are processors available. Each
of the other partitions p~ (l > 1) consists of subregions that are the union of two subre-
gions of P~-l. The coarsest partition PL contains as only subregion the image itself. The
LDT on partition Pz is defined as the result of the DT on each of the subregions of Pz
separately. In this approach we calculate from the LDT on Pz for the boundary pizel8 of
its subregions the corresponding values on Pl+l for l -- 1, 2 , . . . , L - 1. The values of the
LDT on partition PL are by definition the GDT values. Then the GDT values for the
boundary pixels of the subregions of Pz are computed for decreasing I. This approach is
similar to the hierarchical approach we used for component labelling [4].
These computations can be implemented in two ways. In the first approach (agglom-
erated cornputatio~t), on a particular recursion level l each subregion of pz is processed
by one processor. This means that processors become idle on higher recursion levels. In
an alternative implementation (distributed computation), pixel values of a subregion are
not agglomerated into one processor, but are distributed in a way that each processor
contains a part of the boundary of one subregion.
The second solution (directional algorithm) for step II consists of an inter-subregion
propagation in successive directions. The feasibility of this approach, however, and the
complexity of the resulting algorithm depend on the distance measure used.
The step III of the parallel algorithm is done by executing the sequential algorithm
on each subregion starting from the original image and the GDT values obtained in step
2.
We refer to [3] for a full description and correctness proof of the algorithms.

4 Asymptotical Complexity

The calculation of the LDT-values of the boundary pixels of a subregion, as well as the
IDT, is local and can be performed in an amount of time asymptotically proportional to
the number of pixels of the image.
The calculation of GDT-values out of LDT-values for the border pixels of the subre-
gions is global and consists of computation and communication. The latter can be divided
into the initiation and the actual transfer of messages. A summary of the complexity fig-
ures for the global operations, derived in this section, is shown in table 1. We assume an
image of n x r~ pixels and p processors.

hierarchical alg. direct, alg,


agglom, distrib.
t~ O(logp) O(log ~ p) O(logp)
I~,. . . . f., o(,~) 0 ( ~n) o(~)
to~p o(n) o (~)
" o(~)
"

Table 1. A summary of the complexity analysis of the global computations of the presented
DT algorithms for the CB distance.
390

The Hierarchical Algorithm.

Agglomerated Computation. Since the number of messages sent on each recursion level
is constant and since initiating a message takes constant time, the total start up time is
proportional to the number of recursion levels L = log 2 p.
The transfer time is proportional to the amount of data sent. The amount of data
sent on recursion level l is proportional to the size of a subregion of pz being

s, : (1)

Therefore the total transfer time is t~,=~,/,~ = O(~"~= 1 Sz) : O(n). The computational
complexity is also O(n) as the data are processed in linear time.

Distributed Computation. On recursion level l processors cooperate in groups of 2 t pro-


cessors to compute the LDT on P~+I on the borders of the subregions of P~+I. If the
CB distance measure is used, the operations to be done on recursion level I can be done
in O(l) steps. In each of these steps an amount of data proportional to the boundary
length of the subregions of Pz divided by the number of processors 2 z is transferred and
processed :
Dz-- O ( ~n- - ~2 -z/2
). (2)

The total start up time is therefore t,t=,t_~,p = O ( ~ = ~ l) = O(log 2 p) and the total
L
amount of execution and transfer time Lt~=,~,/~r = tc,,mp -- O(~z= 1 Dzl) = 0 ( : ~n ) .

T h e D i r e c t i o n a l A l g o r i t h m . The directional algorithm consists of calculating a num-


ber of partial minima that can be done in O(logp) communication steps requiring in
total O(~r ) transfer and processing time. See [3].

5 Timing and Efficiency Results

We used as test images a number of realistic images and a few artificial images, among
which the one of Fig. 2. The execution time of the sequential DT algorithm on one node
of the iPSC/2 is proportional to the number of pixels of the image and is typically about
800 ms for a 256 x 256 image. For images of this size the LDT is typically 100 ms.
The parallel efficiency, as a function of the size of the image, is shown in Fig. 3 for
a sample image. From the asymptotical complexity figures of section 4 we learn that for
large image sizes the execution time of the global computations is negligible with respect
to the the execution time of the I D T and the LDT parts of the algorithm. The ratio of
the latter two mainly determines the parallel efficiency. For smaller images the LDT part
gets more important with respect to the IDT part. The image size for which the two
parts take an equal amount of time is typically 32 pixels for both distance measures. For
smaller images also the global computations get more important.
A factor that influences the efficiency too, is the load imbalance of the algorithm. It
occurs, when a part of the algorithm takes more time on one processor than on the others
and the processors have to wait for one another. A measure for the load imbalance of a
part of the algorithm is
1= t-- (3)
391

processor configuration : .............. 2 X 1 ....... 4 X 2 ..... 8 X 4


......... 2 ...... 4 .... 8

hierarchical algorithm directional algorithm


100 I00

,,o,..,..|,;~.,.o~*.| ~ ; : I | I LLU.
75- 75 ........ -.:... :: ...... - ...-..-- -;
.~176176176176 .'~176 .'" I" ~ ,~"
o~176176
~176176176. . . ~ 1 7 6 1 7 6. ~ . . ~ ,,.*" .~ . - ~, ~- .~,
~176176176~ 9 ~~ ~ .,,
9 . o ~ 1 7 6 . ~ t , ~" ,,s "/
50- 50 ~176
~
~176
.~176
."
o~
,,.
/
J
," .s" ,~ f '
o," . 9
r I
25- 25
2-
I I I
0 I I I
128 256 512 1024 64 128 256 512 1024
size of the image size of the image
Fig. 3. The parallel efficiency, as a function of the image size, for the image of Fig. 2, when the
hierarchical algorithm with agglomerated calculation or the directional algorithm is used.

with t '~ffi and t ~ the maximal and average execution times of the part of the algorithm
under investigation. We can distinguish two sources of load imbalance.
A first source of load imbalance is caused by the data dependence of the L D T part of
the algorithm. This is practically unavoidable, because for most images at least one sub-
region contains a considerable amount of background pixels and determines the execution
time of the L D T part of the algorithm.
A second source of load imbalance is the data dependence of the I D T algorithm. This
part of the load imbalance I grows with the number of subregions. However, we can find
a hard upper limit for the possible load imbalance similar to the analysis in [4].

Acknowledgements

We wish to thank Oak Ridge National Laboratory for letting us use their iPSC/2 machine.

References

1. Borgefors, G.: Distance transformations in arbitrary dimensions. Computer Vision, Graphics


and Image Processing 27(3) (1984) 321-345
2. Borgefors, G.: Distance transformations in digital images. Computer Vision, Graphics and
Image Processing 34(3) (1988) 344-371
3. Embrechts, H., Roose, D.: Parallel algorithms for the distance transformation. Technical
Report T W 1 5 1 (1991) Katholleke Universiteit Leuven
4. Embrechts, H., Roose, D., Wambacq, P.: Component labelling on an mired multiprocessor.
Computer Vision, Graphics and Image Processing : Image Understanding, to appear
This article was processed using the I~"I~X macro package with ECCV92 style
A C o m p u t a t i o n a l Framework for D e t e r m i n i n g Stereo
C o r r e s p o n d e n c e from a Set of Linear Spatial Filters *

David G. Jones 1 and Jitendra Malik 2

1 McGiU University, Dept. of ElectricM Engineering, Montrfial, PQ, Canada H3A 2A7
2 University of California, Berkeley, Computer Science Division, Berkeley, CA USA 94720

A b s t r a c t . We present a computational framework for stereopsis based


on the outputs of linear spatial filters tuned to a range of orientations and
scales. This approach goes beyond edge-based and area-based approaches
by using a richer image description and incorporating several stereo cues
that have previously been neglected in the computer vision literature.
A technique based on using the pseudo-inverse is presented for charac-
terizing the information present in a vector of filter responses. We show
how in our framework viewing geometry can be recovered to determine the
locations of epipolar lines. An assumption that visible surfaces in the scene
are piecewise smooth leads to differential treatment of image regions corre-
sponding to binocularly visible surfaces, surface boundaries, and occluded
regions that are only monocularly visible. The constraints imposed by view-
ing geometry and piecewise smoothness are incorporated into an iterative
algorithm that gives good results on random-dot stereograms, artificially
generated scenes, and natural grey-level images.

1 Introduction

Binocular stereopsis is based on the cue of d i s p a r i t y - - two eyes (or cameras) receive
slightly different views of the three*dimensional world. This disparity cue, which includes
differences in position, both horizontal and vertical, as well as differences in orientation
or spacing of corresponding features in the two images, can be used to extract the three-
dimensional structure in the scene. This depends, however, upon first obtaining a solution
to the correspondence problem. The principal constraints that make this feasible are:

1. Similarity of corresponding features in the two views.


2. Viewing geometry which constrains corresponding features to lie on epipolar lines.
3. Piecewise continuity of surfaces in the scene because of which nearby points in the
scene have nearby values of disparity. The disparity gradient constraint (Burr and
Julesz, 1980; Pollard et al., 1985) and the ordering constraint (Baker and Binford,
1982) are closely related.

Different approaches to the correspondence problem exploit these constraints in different


ways. The two best studied approaches are area correlation (Hannah, 1974; Gennery,
1977; Moravec, 1977; Barnard and Thompson, 1980) and edge matching (Mart and Pog-
gio, 1979; Grimson, 1981; Baker and Binford, 1982; Pollard et al., 1985; Medioni and
Nevatia, 1985; Ayache and Faverjon, 1987).

* This work has been supported by a grant to DJ from the Natural Sciences and Engineering
Research Council of Canada (OGP0105912) and by a National Science Foundation PYI award
(IRI-8957274) to JM.
396

The difficulties with approaches based on area correlation are well known. Because of
the difference in viewpoints, the effects of shading can give rise to differences in brightness
for non-lambertian surfaces. A more serious difficulty arises from the effects of differing
amounts of foreshortening in the two views whenever a surface is not strictly fronto-
parallel. Still another difficulty arises at surface boundaries, where a depth discontinuity
may run through the region of the image being used for correlation. It is not even guar-
anteed in this case that the computed disparity will lie within the range of disparities
present within the region.
In typical edge-based stereo algorithms, edges are deemed compatible if they are near
enough in orientation and have the same sign of contrast across the edge. To cope with
the enormous number of false matches, a coarse-to-fine strategy may be adopted (e.g.,
Mart and Poggio, 1979; Grimson, 1981). In some instances, additional limits can be im-
posed, such as a limit on the rate at which disparity is allowed to change across the
image (Mayhew, 1983; Pollard et al., 1985). Although not always true, assuming that
corresponding edges must obey a left-to-right ordering in both images can also be used
to restrict the number of possible matches and lends itself to efficient dynamic program-
ming methods (Baker and Binford, 1982). With any edge-based approach, however, the
resulting depth information is sparse, available only at edge locations. Thus a further
step is needed to interpolate depth across surfaces in the scene.
A third approach is based on the idea of first convolving the left and right images with
a bank of linear filters tuned to a number of different orientations and scales (e.g., Kass,
1983). The responses of these filters at a given point constitute a vector that characterizes
the local structure of the image patch. The correspondence problem can be solved by
seeking points in the other view where this vector is maximally similar.
Our contribution in this paper is to develop this filter-based framework. We present
techniques that exploit the constraints arising from viewing geometry and the assumption
that the scene is composed of piecewise smooth surfaces. A general viewing geometry is
assumed, with the optical axes converged at a fixation point, instead of the simpler
case of parallel optical axes frequently assumed in machine vision. Exploiting piecewise
smoothness raises a number of issues - - the correct treatment of depth discontinuities,
and associated occlusions, where unpaired points lie in regions seen only in one view.
We develop an iterative framework (Fig. 1) which exploits all these constraints to obtain
a dense disparity map. Our algorithm maintains a current best estimate of the viewing
parameters (to constrain vertical disparity to be consistent with epipolar geometry), a
visibility map (to record whether a point is binocularly visible or occluded), and a scale
map (to record the largest scale of filter not straddling a depth discontinuity).
stereo pair 0/imagel

~ geome~7
I "----.
o c d u ~ r~ionm depth boundarieJ
(vlewlngparam~erl) (viJ~lity map) (scalemap)

Fig. 1. Iteratively refining estimates of stereo disparity.


397

This paper is organized as follows. Section 2 gives an introduction to the use of


filtering as a first stage of visual processing. A technique based on using the pseudo-inverse
is presented for characterizing the information present in a vector of filter responses.
Section 3 demonstrates the performance of a simple-minded matching strategy based
on just comparing filter responses. This helps to motivate the need for exploiting the
additional constraints imposed by the viewing geometry and piecewise smoothness. These
constraints are developed further in Section 4. In section 5 the complete algorithm is
presented. Section 6 concludes with experimental results.

2 Local Analysis of Image Patches by Filtering

In order to solve the correspondence problem, stereo algorithms attempt to match features
in one image with corresponding features in the other. Central to the design of these
algorithms are two choices: What are the image features to be matched? How are these
features compared to determine corresponding pairs.
It is important to recall that stereo is just one of many aspects of early visual pro-
cessing: stereo, motion, color, form, texture, etc. It would be impractical for each of
these to have its own specialized representation different from the others. The choice of
a "feature" to be used as the basis for stereopsis must thus be be constrained as a choice
of the input representation for many early visual processing tasks, not just stereo. For
the human visual system, a simple feature such as a "pixel" is not even available in the
visual signals carried out of the eye. Already the pattern of light projected on the retina
has been sampled and spatially filtered. At the level of visual inputs to the cortex, vi-
sual receptive fields are well approximated as linear spatial filters, with impulse response
functions that are the Laplacian of a two-dimensional Gaussian, or simply a difference of
Gaussians. Very early in cortical visual processing, receptive fields become oriented and
are well approximated by linear spatial filters, with impulse response functions that are
similar to partial derivatives of a Gaussian (Young, 1985).
Since "edges" are derived from spatial filter outputs, the detection and localization of
edges may be regarded as an unnecessary step in solving the correspondence problem. A
representation based on edges actually discards information useful in finding unambigu-
ous matches between image features in a stereo pair. An alternative approach, explored
here, is to treat the the spatial filter responses at each image location, collectively called
the filter response vector, as the feature to be used for computing stereo correspondence.
Although this approach is loosely inspired by the current understanding of processing
in the early stages of the primate visual system (for a recent survey, DeValois and DeVal-
ois, 1988), the use of spatial filters may also be viewed analytically. The filter response
vector characterizes a local image region by a set of values at a point. This is similar to
characterizing an analytic function by its derivatives at a point. From such a representa-
tion, one can use a Taylor series approximation to determine the values of the function
at neighboring points. Because of the commutativity of differentiation and convolution,
the spatial filters used are in fact computing "blurred derivatives" at each point. The
advantages of such a representation have been described in some detail (Koenderink and
van Doom, 1987; Koenderink, 1988). Such a representation provides an efficient basis
for various aspects of early visual processing, making available at each location of the
computational lattice, information about a whole neighborhood around the point.
The primary goal in using a large number of spatial filters, at various orientations,
phases, and scales is to obtain rich and highly specific image features suitable for stereo
matching, with little chance of encountering false matches. At this point, one might be
398

tempted to formulate more precise, mathematical criteria and to a t t e m p t to determine


an optimal set of filters. The alternative viewpoint taken here is that a variety of filter
sets would each be adequate and any good stereo algorithm should not depend critically
upon the precise form of the spatial filters chosen.

2.1 The Filter Set


The implementation and testing of these ideas requires some particular set of filters to
be chosen, though at various times, alternative filters to those described below have been
used, always giving more or less similar results. The set of filters used consisted of rotated
copies of filters with impulse responses F(x, y) = Gn(a:) G0(y), where n = 1, 2, 3 and
G,, is the n ~h derivative of a Gaussian. The scale, or, was chosen to be the same in both
the x and y directions. Filters at seven scales were used, with the area of the filters
increasing by a factor of two at each scale. In terms of pixels, the filters are w w, with
w E {3, 5, 7, 10, 14, 20, 28}, and w = [8~r]. The filters at the largest scale are shown in
Fig. 2. Smaller versions of the same filters are used at finer scales. Nine filters at seven
scales would give 63 filters, except at the finest scale the higher derivatives are useless
because of quantization errors, and so were discarded.

2.2 Singular Value Decomposition


Regardless of why a particular set of filters may be chosen, it is useful to know that there
is an automatic procedure that can be used to evaluate the degree to which the chosen
filters are independent. Any filter that can be expressed as the weighted sum of others
in the set is redundant. Even filters for which this is not strictly true, but almost true
may be a poor choice, especially where this may lead to numerical instability in some
computations involving filter responses. The singular value decomposition provides just
this information.
Any m x n matrix A, may be expressed as the product of an m m matrix U, an
m n diagonal matrix 22, and an n x n matrix V T, where the columns of U and V are
orthonormal, and the entries in 22 are positive or zero. This decomposition is known as
the singular value decomposition. The diagonal entries of the matrix 22 are called singular
values and satisfy al > as > . . . > crk >_ 0. More details may be found in a standard
linear algebra or numerical analysis text (e.g., Golub and Van Loan, 1983).
A spatial filter with finite impulse response may be represented as an n x 1 column
vector, F/, by writing out its entries row by row. Here n is the number of pixels in the
support of the filter. If an image patch (of the same size and shape as the support of
the filter) is also represented as an n 1 column vector, then the result of convolving
the image patch by the filter is simply the inner product of these two vectors. Taken
together, a set of spatial filters forms a matrix F. This is a convenient representation
of the linear transformation that maps image patches to a vector of filter responses. For
an image patch represented as a vector I, the filter response vector is simply v = F T I .
Applying the singular value decomposition yields F 7" = U ~ V T
The number of non-zero entries in 22 is the rank, r, or the dimension of the vector
space spanned by the filters. The first r columns of V form an orthonormal basis set for
this vector space, ranked in order of the visual patterns to which this particular set of
filters is most sensitive. The corresponding singular values indicate how sensitive. The
remaining columns form an orthonormal basis for the null space of F - - those spatial
patterns to which F is entirely insensitive. The matrix U may be thought of as an
orthonormal basis set for the space of possible filter responses vectors, or merely as a
399

change of basis matrix. As an example of this decomposition, the orthonormal basis for
the set of filters in Fig. 2A is shown in Fig. 2B.

Fig. 2. A. Linear spatial filter set. B. Orthonormal basis set for vector sp~ce spanned by filters
in A.

One telltale sign of a poorly chosen set of filters is the presence of singular values that
are zero, or very close to zero. Consider, for example, a filter set consisting of the first
derivative of a Gaussian at four different orientations, 0.

G~,o(z,y)=Gl(u) ; u=xcos0-ysin0, v=zsint~+ycos

The vector space spanned by these four filters is only two dimensional. Only two filters
are needed, since the other two may be expressed as the weighted sum of these, and
thus carry no additional information. If one did not already know this analytically, this
procedure quickly makes it apparent. Such filters for which responses at a small number
of orientations allow the easy computation of filter responses for other orientations have
been termed steerable fillers (Koenderink, 1988; Freeman and Adelson, 1991; Perona,
1991). For Gaussian derivatives in particular, it turns out that n + 1 different orientations
are required for the n th Gaussian derivative.
As a further example, the reader who notes the absence of unoriented filters in Fig. 2A
and is tempted to enrich the filter set by adding a V2G, Laplacian of Gaussian filter,
should think twice. This filter is already contained in the filter set in the sense that it may
be expressed as the weighted sum of the oriented filters G#2,o(z,y). Similar filters, such
as a difference of Gaussians, may not be entirely redundant, but they result in singular
values close to zero, indicating that they add little to the filter set.
At the coarsest scales, filter responses vary quite smoothly as one moves across an
image. For this reason, the filter response at one position in the image can quite accurately
be computed from filter responses at neighboring locations. This means it is not strictly
necessary to have an equal number of filters at the coarser scales, and any practical
implementation of this approach would take advantage of this by using progressively
lower resolution sampling for the larger filter scales. Regardless of such an implementation
decision, it may be assumed that the output of every filter in the set is available at every
location in the image, whether it is in fact available directly or may be easily computed
from the outputs of a lower resolution set of filters.
400

2.3 I m a g e E n c o d i n g a n d R e c o n s t r u c t i o n

What information is actually carried by the filter response vector at any given position
in an image? This important question is surprisingly easy to answer. The singular value
decomposition described earlier provides all that is necessary for the best least-squares
reconstruction of an image patch from its filter response vector. Since v = F T I , and
F T = U ~ V T, the reconstructed image patch can be computed using the generalized
inverse (or the Moore-Penrose pseudo-inverse) of the matrix F T.

I' = V 1/2Y U T v

The matrix 1 / ~ is a diagonal matrix obtained from ~ by replacing each non-zero diagonal
entry at by its reciprocal, 1/ai.
An example of such a reconstruction is given in Fig. 3. The finest detail is preserved
in the center of the patch where the smallest filters are used. The reconstruction is pro-
gressively less accurate as one moves away from from the center. Because there are fewer
filters than pixels in the image patch to be reconstructed, the reconstruction is necessar-
ily incomplete. The high quality of the the reconstructed image, however, confirms the
fact that most of the visually salient features have been preserved. The reduction in the
number of values needed to represent an image patch means this is an efficient encoding
- - not just for stereo, but for other aspects of early visual processing in general. Since
this same encoding is used throughout the image, this notion of efficiency should be used
with caution. In terms of merely representing the input images, storing a number of filter
responses for each position in the image is clearly less efficient than simply storing the
individual pixels. In terms of carrying out computations on the image, however, there
is a considerable savings for even simple operations such as comparing image patches.
Encoded simply as pixels, comparing 30 30 image regions requires 900 comparisons.
Encoded as 60 filter responses, the same computation requires one-fifteenth as much
effort.

Fig. 3. Image reconstruction. Two example image patches (leJt), were reconstructed (right)
from spatial filter responses at their center. Original image patches masked by a Gaussian
(middle) are shown for comparison.
401

3 Using Filter Outputs for Matching

How should filter response vectors be compared? Although corresponding filter response
vectors in the two views should be very similar, differences in foreshortening and shading
mean that they will rarely be identical. A variety of measures can be used to compare
two vectors, including the angle between them, or some norm of their vector difference.
These and similar measures are zero when the filter response vectors are identical and
otherwise their magnitude is proportional to some aspect of the difference between po-
tentially corresponding image patches. It turns out that any number of such measures
do indistinguishably well at identifying corresponding points in a pair of stereo images,
except at depth discontinuities. Near depth discontinuities, the larger spatial filters lie
across an image patch containing the projection of more than one surface. Because these
surfaces lie at different depths and thus have different horizontal disparities, the filter
responses can differ considerably in the two views, even when they are centered on points
that correspond. While the correct treatment of this situation requires the notion of an
adaptive scale map (developed in the next section), it is helpful to use a measure such
as the L1 norm, the sum of absolute differences of corresponding filter responses, which
is less sensitive to the effect of such outliers than the L2 norm.

e,n = ~ [F~, * I r ( i , j ) - Fk * Il(i + h r , j + v~)[


k

This matching error er~ is computed for a set of candidate choices of (hr, vr) in a win-
dow determined by a priori estimates of the range of horizontal and vertical disparities.
The (hr, v~) value that minimizes this expression is taken as the best initial estimate
of positional disparity at pixel (i,j) in the right view. This procedure is repeated for
each pixel in both images, providing disparity maps for both the left and right views.
Though these initial disparity estimates can be quite accurate, they can be substantially
improved using several techniques described in the next section.
An implementation of this approach using the outputs of a number of spatial filters
at a variety of orientations and scales as the basis for establishing correspondence has
proven to give quite good results, for random-dot stereograms, as well as natural and
artificial grey-level images. Some typical examples are presented here.
The recovered disparity map for a ]ulesz random-dot stereogram is presented in
Fig.4A. The central square standing out in depth is clearly detected. Disparity values
at each image location are presented as grey for zero horizontal disparity, and brighter
or darker shades for positive or negative disparities. Because these are offsets in terms
of image coordinates, the disparity values for corresponding points in the left and right
images should have equal magnitudes, but opposite signs. Whenever the support of the
filter set lies almost entirely on a single surface, the disparity estimates are correct.
Even close to depth discontinuities, the recovered disparity is quite accurate, despite the
responses from some of the larger filters being contaminated by lying across surfaces at
different depths.
In each view, there is a narrow region of the background just to one side of the near
central square that is visible only in one eye. In this region, there is no corresponding
point in the other view and the recovered disparity estimates appear as noise. Methods
for coping with these initial difficulties are discussed in later sections. In the lower panels
of the same figure, the measure of dissimilarity, e,n, between corresponding filter response
vectors is shown, with darker shades indicating larger differences. Larger differences are
clearly associated with depth discontinuities.
402

Fig. 4. Initial disparity estimates: random-dot stereogram and fruit. For the stereo pairs shown
(top), the recovered disparity map (middle) and dissimilarity or error map (bottom) are shown.
(fruit images courtesy Prof. N. Ahuja, Univ. Illinois)

When approached as a problem of determining which black dot in one view cor-
responds with which black dot in the other, the correspondence problem seems quite
difficult. In fact, Julesz random-dot stereograms are among the richest stimuli - - con-
taining information at all orientations and scales. When the present approach based on
spatial filters is used, the filter response vector at each point proves to be quite distinctive,
making stereo-matching quite straightforward and unambiguous.
As an example of a natural grey-level image, a stereo pair of fruit lying on a table
cloth is shown in Fig. 4B. The recovered disparity values clearly match the shapes of
the familiar fruit quite well. Once again, some inaccuracies are present right at object
boundaries. The measure of dissimilarity, or error shown at the bottom of the figure
provides a blurry outline of the fruit in the scene. A mark on the film, present in one
view and not the other (on the canteloupe) is also clearly identified in this error image.
As a final example, a ray-traced image of various geometric shapes in a three-sided
room is depicted in Fig. 5. For this stereo pair, the optical axes are not parallel, but
converged to fall on a focal point in the scene. This introduces vertical disparities between
corresponding points. Estimated values for both the horizontal and vertical disparities
are shown. Within surfaces, recovered disparities values are quite accurate and there are
some inaccuracies right at object boundaries. Just to the right of the polyhedron in this
scene is a region of the background visible only in one view. The recovered disparity
values are nonsense, since even though there is no correct disparity, this method will
always choose one candidate as the "best". Another region in this scene where there
403

are some significant errors is along the room's steeply slanted left wall. In this case, the
large differences in foreshortening between the two views poses a problem, since the filter
responses at corresponding points on this wall will be considerably different. A method
for handling slanted surfaces such as this has been discussed in detail elsewhere (Jones,
1991; Jones and Malik, 1992).

Fig. 5. Initial disparity estimates: a simple raytraced room. For the stereo pair (top), the
recovered estimates of the horizontal (middle) and vertical (bottom) components of positional
disparity are shown.

4 Additional constraints for solving correspondence

4.1 E p i p o l a r G e o m e t r y
By virtue of the basic geometry involved in a pair of eyes (or cameras) viewing a three-
dimensional scene, corresponding points must always lie along epipolar lines in the im-
ages. These lines correspond to the intersections of an epipolar plane (the plane through
404

a point in the scene and the nodal points of the two cameras) with the left and right
image planes. Exploiting this epipolar constraint reduces an initially two-dimensional
search to a one-dimensional one. Obviously determination of the epipolar lines requires
a knowledge of the viewing geometry.
The core ideas behind the algorithms to determine viewing geometry date back to
work in the photogrammetry community in the beginning of this century (for some histor-
ical references, Faugeras and Maybank, 1990) and have been rediscovered and developed
in the work on structure from motion in the computational vision community. Given a
sufficient number of corresponding pairs of points in two frames (at least five), one can
recover the rigid body transformation that relates the two camera positions except for
some degenerate configurations. In the context of stereopsis, Mayhew (1982) and Gillam
and Lawergren (1983) were the first to point out that the viewing geometry could be
recovered purely from information present in the two images obtained from binocular
viewing.
Details of our algorithm for estimating viewing parameters may be found in (Jones
and Malik, 1991). We derive an expression for vertical disparity, vr, in terms of image
coordinates, (it,jr), horizontal disparity, hr, and viewing parameters. This condition
must hold at all positions in the image, allowing a heavily over-constrained determination
of certain viewing parameters. With the viewing geometry known, the image coordinates
and horizontal disparity determine the vertical disparity, thus reducing an initially two-
dimensional search for corresponding points to a one-dimensional search.

4.2 P i e c e w i s e s m o o t h n e s s

Since the scene is assumed to consist of piecewise smooth surfaces, the disparity map is
piecewise smooth. Exploiting this constraint requires some subtlety. Some previous work
in this area has been done by Hoff and Ahuja (1989). In addition to making sure that we
do not smooth away the disparity discontinuities associated with surface boundaries in
the scene, we must also deal correctly with regions which are only monocularly visible.
Whenever there is a surface depth discontinuity which is not purely horizontal, distant
surfaces are occluded to different extents in the two eyes, leading to the existence of
unpaired image points which are seen in one eye only. The realization of this goes back
to Leonardo Da Vinci (translation in, Kemp, 1989). This situation is depicted in Fig. 6.
Recent psychophysical work has convincingly established that the human visual sys-
tem can exploit this cue for depth in a manner consistent with the geometry of the
situation (Nakayama and Shimojo, 1990).
Any computational scheme which blindly assigns a disparity value to each pixel is
bound to come up with nonsense estimates in these regions. Examples of this can be
found by inspecting the occluded regions in Fig. 5. At the very minimum, the matching
algorithm should permit the labeling of some features as 'unmatched'. This is possible
in some dynamic programming algorithms for stereo matching along epipolar lines (e.g.,
Arnold and Binford, 1980) where vertical and horizontal segments in the path through
the transition matrix correspond to skipping features in either the left or right view.
In an iterative framework, a natural strategy is to try and identify at each stage the
regions which are only monocularly visible. The hope is that while initially this classifi-
cation will not be perfect (some pixels which are binocularly visible will be mislabeled
as monocularly visible and vice versa), the combined operation of the different stereopsis
constraints would lead to progressively better classification in subsequent iterations. Our
empirical results bear this out.
405

r I

L R

Fig. 6. Occlusion. In this view from above, it is cleat that at depth discontinuities there are
often regions visible to one eye, but not the other. To the right of each near surface is a region
r that is visible only to the right eye, R. Similarly, to the left of a near surface is monocular
region, I, visible only to the left eye, L:

The problem of detecting and localizing occluded regions in a pair of stereo images is
m a d e much easier when one recalls t h a t there are indeed a pair of images. T h e occluded
regions in one image include exactly those points for which there is no corresponding
point in the other image. This suggests t h a t the best cue for finding occluded regions in
one image lies in the disparity estimates for the other image!

Fig. 7. Visibility map. The white areas in the lower panels mark the regions determined to be
visible only from one of the two viewpoints.

Define a binocular visibility map, B(i,j), for one view as being 1 at each i m a g e
position t h a t is visible in the other view, and 0 otherwise (i.e., an occluded region). T h e
406

horizontal and vertical disparity values for each point in, say, the left image are signed
offsets that give the coordinates of the corresponding point in the right image. If the
visibility map for the right image is initially all zero, it can be filled in systematically as
follows. For each position in the left image, set the corresponding position in the right
visibility map to 1. Those positions that remain zero had no corresponding point in the
other view and are quite likely occluded. An example of a visibility map computed in
this manner is shown in Fig. 7.
Having established a means for finding regions visible only from a one viewpoint,
what has been achieved? If the disparity values are accurate, then the visibility map,
besides simply identifying binocularly visible points, also explicitly delimits occluding
contours. After the final iteration, occluded regions can be assigned the same disparity
as the more distant neighboring visible surface.

4.3 D e p t h D i s c o n t i n u i t i e s a n d A d a p t i v e Scale S e l e c t i o n
The output of a set of spatial filters at a range of orientations and scales provides a
rich description of an image patch. For corresponding image patches in a stereo pair of
images, it is expected that these filter outputs should be quite similar. This expectation
is reasonable when all of the spatial filters are applied to image patches which are the
projections of single surfaces. When larger spatial filters straddle depth discontinuities,
possibly including occluded regions, the response of filters centered on corresponding
image points may differ quite significantly. This situation is depicted in Fig. 8. Whenever
a substantial area of a filter is applied to a region of significant depth variation, this
difficulty occurs (e.g, in Fig. 5).

far
@

Fig. 8. Scale selection. Schematic diagram depicting a three-sided room similar to the one
in Fig. 5. When attempting to determining correspondence for a point on a near surface, larger
filters that cross depth boundaries can result in errors. If depth discontinuities could be detected,
such large scale filters could be selectively ignored in these situations.

From an initial disparity map, it is possible to estimate where such inappropriately


large scale filters are being used by applying the following procedure. At each position in
the image, the median disparity is determined over a neighborhood equal to the support
of the largest spatial filter used for stereo matching. Over this same neighborhood, the
difference between each disparity estimate and this median disparity is determined. These
differences are weighted by a Gaussian at the same scale as the filter, since the center of
407

the image patch has a greater effect on the filter response. The sum of these weighted
disparity differences provides a measure of the amount of depth variation across the image
patch affecting the response of this spatial filter. When this sum exceeds an appropriately
chosen threshold, it may be concluded that the filter is too large for its response to be
useful in computing correspondence. Otherwise, continuing to make use of the outputs
of large spatial filters provides stability in the presence of noise.
To record the results of applying the previous procedure, the notion of a scale map is
introduced (Fig. 9). At each position in an image, the scale map, S(i, j), records the scale
of the largest filter to be used in computing stereo correspondence. For the computation
of initial disparity estimates, all the scales of spatial filters are used. From initial disparity
estimates, the scale map is modified using the above criterion. At each position, if it is
determined that an inappropriately large scale filter was used, then the scale value at that
position is decremented. Otherwise, the test is redone at the next larger scale, if there is
one, to see if the scale can be incremented. It is important that this process of adjusting
the scale map is done in small steps, with the disparity values being recalculated between
each step. This prevents an initially noisy disparity map, which seems to have a great
deal of depth variation, from causing the largest scale filters to be incorrectly ignored.

Fig. 9. Scale map. The darker areas in the lower panels mark the regions where larger scale
filters are being discarded because they lie across depth discontinuities.

5 The Complete Algorithm

Once initial estimates of horizontal and vertical disparity have been made, additional
information becomes available which can be used to improve the quality of the disparity
estimates. This additional information includes estimates of the viewing parameters, the
408

location of occluded regions, and the appropriate scale of filters to be used for matching.
Our algorithm can be summarized as follows:
1. For each pixel P with coordinates (i, j) in the left image, and for each candidate dis-
parity value h, ~ in the allowable disparity range compute the error measure eij(h, ~).
2. Declare h(i, j) and v(i, j) to be the values of h, ~ that minimize eij.
3. Use the refined values of h(i, j) and v(i, j) to compute the new visibility map B(i, j)
and scale map S(i, j).
4. Perform steps 1-3 for disparity, visibility, and scale maps but this time with respect
to the right image.
5. Goto step 1 or else stop at convergence.
The error function e(h, f~) is the sum of the following terms
e(h, = raera(h, + + oeo(h, +
Each term enforces one of the constraints discussed: similarity, viewing geometry, consis-
tency, and smoothness. The )~ parameters control the weight of each of these constraints,
and their specific values are not particularly critical. The terms are:
9 era(h, ~) is the matching error due to dissimilarity of putative corresponding points.
It is 0 if B(i,j) = 0 (i.e., the point is occluded in the other view), otherwise it is
~ IFk * Ir(i,j) - Fk * Ii(i + hr,j + vr)l where k ranges from the smallest scale to
the scale specified by S(i, j).
9 ev (h, ~) is the vertical disparity error [~3- v* [ where v* is the vertical disparity consis-
tent with the recovered viewing parameters. This term enforces the epipolar geometry
constraint.
9 ec(h, r is the consistency error between the disparity maps for the left and right
images. Recall that in our algorithm the left and right disparity maps are computed
independently. This term provides the coupling - - positional disparity values for
corresponding points should have equal magnitudes, but opposite signs. If h I, vI is
the disparity assigned to the corresponding point P' = (i + h,j + ~) in the other
image, then h I = - h and v~ = -~3 at binocularly visible points. If only one of P and
P~ is labelled as monocularly visible, then this is consistent only if the horizontal
disparities place this point further than the binocularly visible point. In this case,
e~ = O, otherwise, e~ = Ih+ h' I + I~ + v'l.
9 e,(h, ~) = [h - hi + [fi - ~[ is the smoothness error used to penalize candidate dispar-
ity values that deviate significantly from h, %, the 'average' values of horizontal and
vertical disparity in the neighborhood of P. These are computed either by a local
median filter, within binocularly visible regions, or by a local smoothing operation
within monocularly visible regions. These operations preserve boundaries of binocu-
larly visible surfaces while providing stable depth estimates near occluded regions.
The computational complexity of this algorithm has two significant terms. The first
is the cost of the initial linear spatial filtering at multiple scales and orientations. Imple-
mentations can be made quite efficient by using separable kernels and pyramid strategies.
The second term corresponds to the cost of computing the disparity map. This cost is
proportional to the number of iterations (typically 10 or so in our examples). The cost in
each iteration is dominated by the search for the pixel in the other view with minimum e.
This is O(n2WhWv)for images of size n x n and horizontal and vertical disparity ranges,
Wh and wv. After the first iteration, when the viewing parameters have been estimated,
the approximate vertical disparity is known at each pixel. This enables wv to be restricted
to be 3 pixels which is adequate to handle quantization errors of 4-1 pixel.
409

6 Experimental Results

The algorithm describcd in the previous section has been implemented and tested on
a variety of natural and artificial images. In practice, this process converges (i.e., stops
producing significant changes) in under ten iterations. Disparity maps obtained using
this algorithm are shown in Fig. 10. The reader may wish to compare these with Figures
4 and 5 which show the disparity map after a single iteration when the correspondence is
based solely on the similarity of the filter responses. The additional constraints of epipolar
geometry and piecewise smoothness have clearly helped, particularly in the neighborhood
of depth discontinuities. Also note that the visibility map for the random dot stereogram
as well as the room image (bottom of Fig. 7) are as expected. From these representations,
the detection and localization of depth discontinuities is straightforward.

Fig. 10. Refined disparity estimates. For the stereo pairs (top), the recovered horizontal dis-
parities are shown in the middle panel. For the random dot stereogram, the lower panel shows
the visibility map. For the room image, the bottom panel shows the recovered vertical disparity.

We have demonstrated in this paper that convolution of the image with a bank of
linear spatial filters at multiple scales and orientations provides an excellent substrate on
which to base an algorithm for stereopsis, just as it has proved for texture and motion
analysis. Starting out with a much richer description than edges was extremely useful for
solving the correspondence problem. We have developed this framework further to enable
the utilization of the other constraints of epipolar geometry and piecewise smoothness as
well.
410

References
Arnold RD, Binford T O (1980) Geometric constraints on stereo vision. Proc SPIE
238:281-292
Ayache N, Faverjon B (1987) Efficientregistration of stereo images by matching graph
descriptions of edge segments. Int J Computer Vision 1(2):107-131
Baker HH, Binford T O (1981) Depth from edge- and intensity-based stereo.
Proc 7th IJCAI 631-636
Barnard ST, Thompson W B (1980) Disparity analysis of images. IEEE Trans P A M I
2(4):333-340
Burt P, Julesz B (1980) A disparity gradient limit for binocular function. Science
208:651-657
DeValois R, DeValois K (1988) Spatial vision. Oxford Univ Press
Faugeras O, Maybank S (1990) Motion from point matches: multiplicity of solutions. Int J
Computer Vision 4:225-246
Freeman WT, Adelson EH (1991) The design and use of steerable filters. IEEE Trans
PAMI 13(9):891-906
Gennery DB (1977) A stereo vision system for autonomous vehicles. Proc 5th IJCAI
576-582
Gillam B, Lawergren B (1983) The induced effect, vertical disparity, and stereoscopic
theory. Perception and Psychophysics36:559-64
Golub GH, Van Loan CF (1983) Matrix computations. The Johns Hopkins Univ Press,
Baltimore, MD
Grimson WEL (1981) Fromimages to surfa~:es. M.I.T Press, Cambridge, Mass
Hannah MJ (1974) Computermatching of areas in images. Stanford AI Memo #239
HoffW, Ahuja N (1989) Surfacesfrom stereo: integrating stereo matching, disparity
estimation and contour detection. IEEE Trans PAMI 11(2):121-136
Jones, DG (1991) Computational models of binocular vision. PhD Thesis, Stanford Univ
Jones DG, Malik J (1991) A computational frameworkfor determining stereo
correspondence from a set of linear spatial filters. U.C. Berkeley Technical Report
UCB-CSD 91-655
Jones DG, Malik J (1992) Determining three-dimensional shape from orientation and
spatial frequency disparities. Proc ECCV, Genova
Kass M (1983) Computing visual correspondence. DARPA IU Workshop 54-60
Kemp M (Ed) (1989) Leonardo on painting. Yale Univ. Press: New Haven 65-66
Koenderink J J, van Doom AJ (1987) Representation of local geometry in the visual
system. Biol Cybern 55:367-375
Koenderink JJ (1988) Operational significance of receptive field assemblies. Biol Cybern
58:163-171
Mart D, Poggio T (1979) A theory for human stereo vision. Proc Royal Society London B
204:301-328
Mayhew JEW (1982) The interpretation of stereo disparity information: the computation
of surface orientation and depth. Perception 11:387-403
Mayhew JEW (1983) Stereopsis. in Physiological and Biological Processing of Images.
Braddick O J, Sleigh AC (Eds) Springer-Verlag, Berlin.
Medioni G, Nevatia R (1985) Segment-based stereo matching. CVGIP 31:2-18
Moravec HP (1977) Towards automatic visual obst~le avoidance. Proc 5th IJCAI
Nakayama K, Shimojo S (1990) DaVinci Stereopsis: Depth and subjective occluding
contours from unpaired image points Vision Research 30(11):1811-1825
Perona P (1991) Deformable kernels for early vision. IEEE Proc CVPR 222-227
Pollard SB, Mayhew JEW, Frisby JP (1985) PMF: a stereo correspondence algorithm
using a disparity gradient limit. Perception 14:449-470
Young R (1985) The Gaussian derivative theory of spatial vision: analysis of cortical cell
receptive field line-weighting profiles. General Motors Research TR #GMR-4920
O n V i s u a l A m b i g u i t i e s D u e to T r a n s p a r e n c y in
Motion and Stereo *

Masahiko Shizawa
ATR Communication Systems Research Laboratories, Advanced Telecommunications Research
Institute International, Sanpeidani, Inuidani, Seika-cho, Soraku-gun, Kyoto 619-02, Japan

A b s t r a c t . Transparency produces visual ambiguities in interpreting mo-


tion and stereo. Recent discovery of a general framework, principle of super-
position, for building constraint equations of transparency makes it possible
to analyze the mathematical properties of transparency perception. This
paper theoretically examines multiple ambiguous interpretations in trans-
parent optical flow and transparent stereo.

1 Introduction

Transparency perception arises when we see scenes with complex occlusions such as picket
fences or bushes, with shadows such as those cast by trees, and with physically transpar-
ent objects such as water or glass. Conventional techniques for segmentation problems
using relaxation type techniques such as coupled MRF(Markov Random Field) with a
line process which explicitly models discontinuities[5][13], statistical decision on veloc-
ity distributions using statistical voting[1] [2][3] or outlier rejection paradigm of robust
statistics[14] and weak continuity[15], cannot properly handle these complex situations,
since transparency is beyond the assumptions of these techniques. More recently, an iter-
ative estimation technique for two-fold motion from three frames has been proposed[16].
The principle of superposition(PoS), a simple and elegant mathematical technique,
has been introduced to build motion transparency constraints from conventional single
motion constraints[25]. PoS resolves the difficulties in analyzing motion transparency and
multiple motions at the level of basic constraints, i.e., of computational theory in contrast
to conventional algorithm level segmentation techniques[21]. Using PoS, we can analyze
the nature of transparent motion such as the minimum number of sets of measurements,
signal components or correspondences needed to determine motion parameters in finite
multiplicity arid to determine them uniquely. Another advantage is its computational
simplicity in optimization algorithms such as convexity of the energy functionals.
In this paper, the constraints of the two-fold transparent aptical flow is examined
and ambiguities in determining multiple velocities are discussed. It is shown that con-
ventional statistical voting type techniques and a previously described constraint-based
approach[23][24] behave differently for some particular moving patterns. This behavioral
difference will provide a scientific test for the biological plausibility of motion perception
models regarding transparency.
Then, I show that transparency in binocular stereo vision can be interpreted similarly
to transparent motion using PoS. The constraint equations for transparent stereo match-
ing are derived by PoS. Finally, recent results in studies on human perception of multiple
transparent surfaces in stereo vision[19] are explained by this computational theory.

* Part of this work was done while the author was at N T T Human Interface Laboratories,
Yokosuka, Japan.

Lecture Notes in Computer Science, VoL 588


G. Sandini (Ed.)
Computer Vision - ECCV ' 92
412

2 Principle of Superposition
2.1 T h e O p e r a t o r F o r m a l i s m a n d C o n s t r a i n t s o f T r a n s p a r e n c y
Most of the constraint equations in vision can be written as,

a ( p ) f ( x ) = 0. (1)
where f ( x ) is a data distribution on data space G. f ( x ) may be the image intensity
data itself or outputs of a previous visual process, p is a point on a parameter space
7-[ which represents a set of parameters to be estimated and a(p) is a linear operator
parametrized by p. The linearity of the operator is defined by a(p){fl(x) + f2(x)} =
a ( p ) f l ( x ) + a(p)f2(x) and a(p)0 = 0. We call the operator a(p) the amplitude operator.
The amplitude operator and the data distribution may take vector values.
Assume n data distributions fi(x)(i = 1, 2 , . . . , n) on G, and suppose they are con-
strained by the operators a(pi)(pl e 7-[i,i = 1,2,... ,n) as a(pl)fi(x) = 0. The data
distribution f ( x ) having transparency is assumed to be an additve superposition of fi(x)
as f ( x ) = ~ fi(x). According to PoS, the transparency constraint for f ( x ) ' c a n be
i=1
represented simply by
a(pl)a(p2) ... a(pn)f(x) : 0. 2 (2)
It should be noted that if the constraint of n-fold transparency holds, then the con-
straint of m-fold transparency holds for any m > n. However, parameter estimation
problems based on the constraint of m-fold transparency are ill-posed because extra
parameters can take arbitrary values, i.e. are indefinite. Therefore, appropriate multi-
plicity n may be determined by a certain measure of well-posedness or stability of the
optimization as in [24].

2.2 S u p e r p o s i t i o n U n d e r O c c l u s i o n a n d T r a n s p a r e n c y
An important property of the transparency constraint equation is its insensitivity to
occlusion, ff some region of data fi(x) is occluded by another pattern, we can assume
that fi(x) is zero in the occluded region. The transparency constraint equation still holds
because of its linearity. Therefore, in principle, occlusion does not violate the assumption
of additive superposition.
In the case of transparency, there are typically two types of superposition: additive
and multiplicative. Multiplicative superposition is highly non-linear and therefore sub-
stantially violates the additivity assumption. However, taking the logarithm of the data
distribution transforms the problem into a case of additive superposition.

3 Visual Ambiguities in Motion Transparency


3.1 T h e C o n s t r a i n t E q u a t i o n s o f T r a n s p a r e n t O p t i c a l Flow
In the case of optical flow, the amplitude operator in spatial and frequency domains are
defined by[24]
0 0 0
a(u,v) ----u~-~x + v~u + ~ , 5(u,v) -- 2~'i(u~= + v~ v + w t ) , (3)
Y
2 For this constraint to be satisfied strictly, the operator a(pi) must commute, i.e., a(p~)a(pj) =
a(pDa(p0 for i # j.
413

where (u, v) is a flow vector. Then, the fundamental constraints of optical flow can be writ-
ten as a(u, v)f(x, y, t) = 0 and 5(u, v)F(w~:, wy, wt) = 0 where .f(x, y, t) and F(w~:, wy, wt)
denote a space-time image and its Fourier transform[9][10][ll][12]. Using PoS, the con-
straints for the two-fold transparent optical flow are simply a(ul, v~)a(u2, v2).f(x, y, t) = 0
and 5(u~, v~)5(u2,v2)F(w=,w~,w~) = 0 where (Ul,V~) and (u2, v2) are two flow vectors
which coexist at the same image location.
These two constraints of two-fold motion transparency can be expanded into
dzzUlU2 + dyyvlv2 + dzu(u~v2 + VlU2) -~- dzt(Ul + u2) + d~t(Vl + v2) + dtt = 0, (4)
8~
where components of d -- ( d~z, d~, d=u, d~,t, dyt , dtt ) are for example d~t = o--~ f( x, y, t)
for the spatial domain representation and dut = (2~ri)2w~wtF(w~, w~, wt) for the frequency
domain representation. Therefore, we can simultaneously discuss brightness measuments
and frequency components.

3.2 T h e C o n s t r a i n t C u r v e o f Two-fold T r a n s p a r e n t Optical Flow


Equation (4) is quadratic in four unknowns Ul, vl, u2 and v2. Therefore, if we have four
independent measurements or signal components d(k)(k = 1,2,3,4), a system of four
quadratic constraint equations denoted by Ek will produce solutions of a finite ambiguity.
The solution can be obtained as intersections of two cubic curves in velocity space as
shown below. This is the two-fold transparent motion version of the well-known fact that
the intersection of two lines which represent the single optical flow constraint equations
in the velocity space (u, v) uniquely determines a flow vector.
From E1 and E2, we can derive rational expressions u2 = Gu(dO), d(2); ul, vl) and
v2 = G~(d (1), d(2);ul, vl) which transform the flow vector (Ul, vl) into (u2, v2) and vice
versa. The concrete forms of these rational expressions can be written as

G~(d(O,d(J);u,v) = q(1)q(j)
~u ~t - q(i)q(j)'
q~ qu G~(d(O'd(J);u'v) = q~i)q(j)
.(i).(j) q(1)q~j)
q(i)q(D' (5)
~g ,/y

where q(2 = ( d Li)u + d . , v (0


+ d . , ) , (0 q~') = ( A ' ) u ~Ti. ) v ~ d ,(1)~
, , and q~') - ( d ( ~ ) u + ~ ' 2 v + ~ ) ) .
If we have three measurements/components d (1), d(D and d (k), then the equation
Gu(d(0, d(J); u, v) = Gu(d (~), d(k); u, v) gives the constraint for the velocity (u, v) in the
case of two-fold transparent optical flow. This equation can be factored into the form of
q(0Gu~(d(0, d(J), d(k); u, v) = 0 where

Gu~(d (i), d (j), d(k); u, v) = ~="(1)"(J)"(k)~y


~t + ~y"(i)"(/)"(k)~t
~x + ~t"(i)"(J)"(k)~=
~y
.(0.(J).(k) .(~).(j).(~) .(0.(j).Ck) (6)

If q(x0 ---- 0 then we can substitute the i by another index i ~ which is not equivalent to
i, j and k. Then q(xr -- 0 cannot hold if we have transparency, because two equations
q(O = 0 and q(~') = 0 imply single optical flow. Thus, we can substitute i by r without
loss of generality. Therefore, the cubic equation Gu~ (d(0, d(D, d(k); u, v) ----0 with respect
to u and v gives the constraint curve on the velocity space (u, v) under the assumption
of two-fold transparency. Intersecting points of two curves in uv-space
C1 : G,,~(d(1),d(2),d(8);u,v) = O, C2 : G~,~(d(1),d(2),d(4);u,v) = O, (7)

provide the candidate flow estimates for (ul, Vl) and (u~, v2). By using (5), we can make
pairs of solutions for {(ul, vl), (u2, v2)} from these intersections.
414

T h e T h r e e - f o l d A m b i g u i t y o f F o u r - c o m p o n e n t M o t i o n . In the space-time fre-


quency domain, there exists a three-fold ambiguity in interpreting the transparent mo-
tion of four frequency components, since there are three possible ways to fit two planes
so that they pass through all four points (frequency components) and the origin. Fig-
ure 1 provides the predicted visual ambiguity due to this fact. If we have two image
patterns A and B each of which has frequency components along just two space direc-
tions ({G1, G2} for A and {G3, G4} for B), and they move with different velocities VA
and vB, their superposed motion pattern has three-fold multiple interpretations.

G4 V G1 G1

m U
0 =u

\
G2
True sol tk~ Two tahm

Fig. 1. The three-fold ambiguity of transparent motion

3.3 U n i q u e Solution f r o m Five M e a s u r e m e n t s or C o m p o n e n t s


If a system of five constraint equations Ek of five independent measurements or five
frequency components d(k)(k -~ 1,2, 3, 4, 5) are available, we can determine two velocities
uniquely. The system of equations can be solved with respect to a vector of five 'unknown
parameters',

c = =
( ulu2,vlv2, , (ulv2 +
, +
, +
) , iS)

as a linear system. Component flow parameters ul, u2, vl and v2 can be obtained by
solving two quadratic equations, u 2 - 2 c x t u + c,~ = 0 and v 2 - 2%~v + c~ = 0. We
f__._
denote their solutions as u = c=t =t: - c=x and v -- c~t -4- ~ / c ~ t - %y. There are
constraints c2~ > c~, and c~t _ c ~ for the existence of real solutions. We now have two
possible solutions for (ul, vl) and (u2, v2) as {(ul, vl), (u2, v2)} = {(u+, v+), (u_, v_)}
and {(Ul, vl), (u2, v2)} = {(u+, v_), (u_, v+)}. However, we can determine a true solution
by checking their consistency with the remaining relation cx~ = 8 9 .-b v~u2) of (8).
Therefore, we have a unique interpretation for the general case.

B e h a v i o r a l Difference Against C o n v e n t i o n a l Schemes. The significance of trans-


parent motion analysis described above is its capability of estimating multiple motion
simultaneously from the m i n i m u m amount of image information, i.e. minimum measure-
ments or signal components as shown above. In this subsection, I show that conventional
techniques by statistical voting of constraint lines on velocity space (e.g. [1][2][3][4]) can-
not correctly estimate multiple flow vectors from this minimum information.
415

Figure 2(a) shows an example of moving patterns which produces this behavioral
difference between the proposed approach and conventional statistical voting. The two
moving patterns A and B are superposed. Pattern A has two frequency components which
may be produced by two plaids G1 and G2; its velocity is VA. The other pattern B, which
has velocity VB, contains three frequency components produced by three plaids G3, G4
and Gs. If the superposed pattern is given to our algorithm based on the transparent
optical flow constraint, the two flow vectors VA and vB can be determined uniquely as
shown in the previous subsection. Figure 2(b) shows plots of conventional optical flow
constraint lines on the velocity space (u, v). There are generally seven intersection points
only one of which is an intersection of three constraint lines but other six points are of two
constraint lines. 3 The intersection of three lines is the velocity vB and can be detected
by a certain peak detection or clustering techniques on the velocity space. However, the
other velocity VA cannot be descriminated from among the six two-line intersections!

V
G G2 G=4 --G5 ~Vs
. G1 ~ ~ C'~
A V! G4

A S
(a) (b)

Fig. 2. Moving pattern from which statistical voting schemes cannot estimate the correct two
flow vectors

4 Visual Ambiguities in Stereo Transparency

In this section, the transparency in stereo is examined by PoS. Weinshall[19][20] has


demonstrated that the uniqueness constraint of matching and order preservation con-
straint are not satisfied in the multiple transparent surface perception in human stereo
vision. Conventional stereo matching algorithms cannot correctly explain perception of
transparency, i.e., multiple surfaces[20]. My intention is not to provide a stereo algorithm
for the transparent surface reconstruction, but to provide stereo matching constraints
which admit and explain the transparency perception.

4.1 T h e C o n s t r a i n t o f S t e r e o M a t c h i n g

The constraints on stereo matching can also be written by the operator formalism. We
denote the left image patterns by L(z) and the right image patterns by R(z) where z
denotes a coordinate along an epipolar line. Then, the constraint for single surface stereo

3 Figure 2(b) actually contains only five two-line intersections. However, in general, it will
contain six.
416

with disparity D can be written as

a(O)f(x)-- 0 where, a(D)_= [-9 0) 0) ' f(z) - r,-,,x>l


[R(x)J . (9)

9 ( 0 ) is a shift operator which transforms L(x) into L(z - 0 ) and R(x) into R(z - 0 ) . 4
It is easy to see that the vector amplitude operator a(D) is linear, i.e. both a(D)0 = 0
and a(O){fl(x) + f2(z)} = a(O)fl(x) + a(O)f~(z) hold.
Figure 3 is a schematic diagram showing the function of the vector amplitude operator
a(D). The operator a(D) eliminates signal components of disparity D from the pair of
stereo images, L(z) and R(z), by substitutions.

I _, I
I I oc0 ,- I _

Fig. 3. Function of the amplitude operator of stereo matching

4.2 T h e C o n s t r a i n t o f T r a n s p a r e n t S t e r e o
According to PoS, the constraint of the n-fold transparency in stereo can be hypothesized
as

a ( D n ) . . , a(D2)a(D1)f(x) = 0, (10)

where f(z) = ~ f;(z), and each fi(x) is constrained by a(O,)fi(x) = 0. It is easily proved
i=1
using the commutability of the shift operator 9 ( D ) that amplitude operators a(D~) and
a(Dj) commute, i.e. a(O,)a(Dj) : a(Dj)a(Oi) for i # j under the condition of constant
Di and Dj. Further, the additivity assumption on superposition is reasonable for random
dot stereograms of small dot density.

4.3 P e r c e p t i o n o f M u l t i p l e T r a n s p a r e n t P l a n e s
In this section, the human perception of transparent multiple planes in stereo vision
reported in [19] is explained by the hypothesis provided in the previous section.
We utilize a random dot image P(x). If L(x) = P(x - d) and R(z) -- P(z), then the
constraint of single surface stereo holds for disparity D = d, since
a(d)f(x) = r d) _ V(d)P(z) ] rP(x - d) - P(x - d)]
[P(x)-9(-d)P(z-d)] = LP(z)-P(z-d+d) --0. (11)

4 We can write this shift operator explicitly in a differential form as /~(D) = exp(-Da-~ ) =
2 2 3 3
1 - D -~-"
Ox -
D" ~ _ p__:
2! ~ x ~
~
3! a x ~ "" ""
However, only the shifting property of the operators is essential
in the following discussions.
417

In [19], a repeatedly superimposed random dot stereogram shown in Fig.4 is used to


produce the transparent plane perception. This situation can be represented by defining
L(x) and R(x) as
L(x) = P(x) + P(x - dL), R(x) -- P(x) + P(x + dR), (12)

where dL and dR are shift displacements for the pattern repetitions in left and right
image planes. According to [19], when dz ~ dR, we perceive four transparent planes
which correspond to disparities D = 0, dz, dR and dL + dR. The interesting phenomenon
occurs in the case of dL = dR = de. The stereogram produces a single plane perception
despite the fact that the correlation of two image patterns L(x) and R(x) has three strong
peaks at the disparities D = 0, dc and 2dc.

s P(x) s P(x)
,,,x.C/_/ .................... ;../

/ 9 ~:~"
dR

Fig. 4. The stereogram used in the analysis

From the viewpoint of the constraints of transparent stereo, these phenomena can be
explained as shown below.
First, it should be pointed out that the data distribution f(x) can be represented as
a weighted linear sum of four possible unique matching components.

f(~) = ~ f l ( x ) + . f ~ ( x ) + (1 - . ) f ~ ( ~ ) + (1 - . ) f ~ ( ~ ) , (13)

where

LP(x)J , f2(x)= Lp(= + d . ) j ' f3(x)= 9 -,.-. 9 , f.,.(x)= iF(x + dR)J'


(14)
and

a(0)fl(x) = 0, a(dL + dR)f2(x) = 0, a(dL)f3(x) = 0, a(dR)f4(x) = 0. (15)

Note that the weights have only one freedom as parameterized by or.
When assuming dL ys dR, the following observation can be obtained regarding the
constraints of the transparent stereo.

1. The constraint of single surface stereo a(D1)f(x) = 0 cannot hold for any values of
disparities Dr.
2. The constraint of two-fold transparent stereo a(D2)a(Di)f(x) -- 0 can hold only
for two sets of disparities {Dz, D2} -- {0, dL + dR} and {Dz, D2} = {dR, dL} which
correspond to cr = 1 and c~ = 0, respectively.
418

3. The constraint of three-fold transparency a(Dz)a(D2)a(D1)f(z) = 0 can hold only


when c~ = 1 or a = 0 as same as the two-fold transparency. However, one of the three
disparities can take arbitrary value.
4. The constraint of four-fold transparency a(D4)a(Dz)a(D2)a(D1)f(z) = 0 can hold
for arbitrary c~. The possible set of disparities is unique as {D1,D2, D3,D4} =
{0, dL, dR, dL + dR} except the cases of {O1,02} = {0, dL + dR} and {D1,02} =
{dL, dR} which correspond to c~ = 1 and c~ = 0, respectively.
5. The constraints of more than four-fold transparency can hold, but some of the dis-
parity parameters can take arbitrary values.

We can conclude that the stereo constraint of n-fold transparency is valid only for n = 2
and n = 4 by using the criterion of Occam's razor, i.e., the disparities should not take
continuous arbitrary values. Then, in both cases for n = 2 and n = 4, the theory predicts
coexistence of four disparities 0, dL, dR and dL + dR.
When dL = dn = dc, the constraint of single surface stereo a(D1)f(x) = 0 can hold
only for D1 = de, since

r {P(z) + P(x _ de)} - V(dc){P(z) + P(z + de)} ]


a(dc)f(z) = [ { P ( z ) + P ( x + de)} - V(-dc)lP(z) + P(x - de)}]
_ r{P(z) + P(z - dc)} - {p(x - de) + P(z + dc - dc)}] = o.
(16)
- [{P(z)+P(z+dc)} {P(z+dc)+P(x-dc+dc)}J

Therefore, the case of dL = dn must produce the single surface perception, if we claim
the criterion of Occam's razor on disparities.

5 Conclusion

I have analyzed visual ambiguities in transparent optical flow and transparent stereo
using the principle of superposition formulated by parametrized linear operators. Ambi-
guities in velocity estimates for particular transparent motion patterns were examined by
mathematical analyses of the transparent optical flow constraint equations. I also pointed
out that conventional statistical voting schemes on velocity space cannot estimate mul-
tiple velocity vectors correctly for a particular transparent motion pattern. Further, the
principle of superposition was applied to transparent stereo and human perception of
multiple ambiguous transparent planes was explained by the operator formalism of the
transparent stereo matching constraint and the criterion of Occam's razor on the number
of disparities.
Future work may include development of a stereo algorithm based on the constraints of
transparent stereo. The research reported in this paper will not only lead to modification
and extension of the computational theories of motion and stereo vision, but will also
help with modeling human motion and stereo vision by incorporating transparency.

Acknowledgments. The author would like thank Dr. Kenji Mase of NTT Human Interface
Laboratories and Dr. Shin'ya Nishida of ATR Auditory and Visual Perception Labora-
tories for helpful discussions.
He also thanks Drs. Jun Ohya, Ken-ichiro Ishii, Yukio Kobayashi and Takaya Endo of
NTT Human Interface Laboratories as well as Drs. Fumio Kishino, Nobuyoshi Terashima
and Kohei Habara of ATR Communication Systems Research Laboratories for their kind
support.
419

References

1. C.L.Fennema and W.Thompson: "Velocity Determination in Scenes Containing Several


Moving Objects," CGIP, Vol.9, pp.301-315(1979).
2. J.J.Little, H.Bfilthoff and T.Poggio: "Parallel Optical Flow Using Local Voting," Proc.
~nd ICCV, Tampa, FL, pp.454-459(1988).
3. R.Jasinschi, A.Rosenfeld and K.Sumi: "The Perception of Visual Motion Coherence and
Transparency: a Statistical Model," Tech. Rep. Univ. of Maryland, CAR-TR-512(1990).
4. D.J.Fleet and A.D.Jepson: "Computation of Component Image Velocity from Local
Phase Information," IJCV, Vol.5, No.l, pp.77-104(1990).
5. D.W.Murray and B.F.Buxton: "Scene Segmentation from Visual Motion Using Global
Optimization," IEEE Trans. PAMI, Vol.9, No.2, pp.220-228(1987).
6. B.G.Schunck: "Image Flow Segmentation and Estimation by Constraint Line Cluster-
ing," IEEE Trans. PAMI, Vo1.11, No.10, pp.1010-1027(1989).
7. E.H.Adelson and J.A.Movshon: "Phenomenal Coherence of Moving Visual Patterns,"
Nature, 300, pp.523-525(1982).
8. G.R.Stoner, T.D.Albright and V.S.Ramachandran: "Transparency and coherence in hu-
man motion perception," Nature, 344, pp.153-155(1990).
9. B.K.P.Horn and B.G.Schunck: "Determining Optical Flow," Artificial Intelligence,
Vol.17, pp.185-203(1981).
10. E.H.Adelson and J.R.Bergen: "Spatiotemporal Energy Models for the Perception of Mo-
tion," J.Opt.Soc.Ara.A, Vol.2, pp.284-299(1985).
11. J.G.Dangman: "Pattern and Motion Vision without Laplacian Zero Crossings,"
J.Opt.Soc.Ara.A, Vol.5, pp.1142-1148(1987).
12. D.J.Heeger: "Optical Flow Using Spatiotemporal Filters," IJCV, 1, pp.279-302(1988).
13. S.Geman and D.Geman: "Stochastic Relaxation, Gibbs Distribution, and the Bayesian
Restoration of Images," 1EEE Trans. PAMI, 6, pp.721-741(1984).
14. P.J.Besl, J.B.Birch and L.T.Watson: "Robust Window Operators," Proc. Pnd 1CCV,
Tampa, FL, pp.591-600(1988).
15. A.Blake and A.Zisserman: Visual Reconstruction, MIT Press, Cambridge, MA(1987).
16. J.R.Bergen, P.Burt, R.Hingorani and S.Peleg: "Transparent-Motion Analysis," Proc. 1st
ECCV, Antibes, France, pp.566-569(1990).
17. A.L.Yuille and N.M.Grzywacz: "A Computational Theory for the Perception of Coherent
Visual Motion," Nature, 333, pp.71-74(1988).
18. N.M.Grzywacz and A.L.Yuille: "A Model for the Estimate of Local Image Velocity by
Cells in the Visual Cortex," Proc. Royal Society of London, Vol.B239, pp.129-161(1990).
19. D.Weinshall: "Perception of multiple transparent planes in stereo vision," Nature, 341,
pp.737-739(1989).
20. D.Weinshall: "The computation of multiple matching in stereo," 14th European Confer-
ence on Visual Pereeption(ECVP), page A31(1990).
21. D.Marr: Vision, Freeman(1982).
22. R.Penrose: The Er~peror's New Mind: Concerning C o m p u t e r s , Minds, and
The Laws of Physics, Oxford University Press, Oxford(1989).
23. M.Shizawa and K.Mase: "Simultaneous Multiple Optical Flow Estimation," Proc. lOth
ICPR, Atlantic City, N3, Vol.I, pp.274-278(3une,1990).
24. --: "A Unified Computational Theory for Motion Transparency and Motion Bound-
aries Based on Eigenenergy Analysis," Proe. IEEE CVPR'91, Maul, HI, pp.289-
295(June,1991).
25. --: "Principle of Superposition: A Common Computational Framework for Analysis
of Multiple Motion," Proc. IEEE Workshop on Visual Motion, Princeton, N:], pp.164-
172(October,1991).
A Deterministic Approach for Stereo Disparity
Calculation

Chienchung Chang 1 and Shanl~r Chatterjee 2


1 Qualcomm Incorporated, 10555 Sorrento Valley Rd. San Diego, CA 92121, USA
2 Department of ECE, University of California, San Diego, La Jolla, CA 92093, USA

Abstract. In this work, we look at mean field annealing (MFA) from two
different perspectives: information theory and statistical mechanics. An iterative,
deterministic algorithm is developed to obtain the mean field solution for disparity
calculation in stereo images.

1 Introduction

Recently, a deterministic version of the simulated annealing (SA) algorithm, called mean field
approximation (MFA) [1], was utilized to approximate the SA algorithm efficiently and success-
fully in a variety of applications in early vision modules, such as image restoration [8], image
segmentation [3], stereo [12], motion [11] surface reconstruction [4] etc.
In this paper, we apply the approximation in the stereo matching problem. We show that
the optimal Bayes estimate of disparity is, in fact, equivalent to the mean field solution which
minimizes the relative entropy between an approximated distribution and the given posterior
distribution, if (i) the approximated distribution h a a Gibbs form and (ii) the mass of dis-
tribution is concentrated near the mean as the temperature goes to zero. The approximated
distribution can be appropriately tuned to behave as close to the posterior distribution as pos-
sible. Alternatively, from the angle of statistical mechanics, the system defined by the states
of disparity variables can be viewed as isomorphic to that in magnetic materials, where the
system energy is specified by the binary states of magnetic spins. According to the MRF model,
the distribution of a specific disparity variable is determined by two factors: one due to the
observed image data (external fidd) and the other due to its dependence (internal field) upon
the neighboring disparity variables. We follow the mean field theory in the usual Ising model of
magnetic spins [7] to modify Gibbs sampler [5] into an iterative, deterministic version.

2 A n I n f o r m a t i o n T h e o r e t i c A n a l y s i s of M F A
The optimal Bayes estimate of the disparity values at uniformly spaced grid points, given a pair
of images, is the maximum a posteriori(MAP) estimate when a uniform cost function is assumed.
To impose the prior constraints (e.g., surface smoothness etc.), we can add energy terms in the
objective energy (performance) functional and/or introduce an approximated distribution. The
posterior energy functional of disparity map d given the stereo images, fL and fr, can usually be
formulated in the form [2]:
M M

UP(d[fl'fr)=a(T) EJgl(xl)-gr[xi +(dx"O)]J2 + E E (dx'-dxj)2 (1)


i~l i=1 XjENXI
where gt and gr represent the vectors of matching primitives extracted from intensity images;
Nxi is the neighborhood of x~ in the image plane /2 and is defined through a neighborhood
structure with radius r, Rr = {Nx,x E /2}, Nx =~ {Y, lY - x J 2 < r}; M = 1121is the number
of pixels in the discretized image plane; and the disparity map is defined as d zx {dx, x E
12}. The first term represents the photometric constraint and the second term describes the
421

surface smoothness. If the disparity is modelled as an MRF given the image data, the posterior
distribution of disparity is given as

P(d,f,,f.) = l 1 exp
~ [ UP(d~"fr)
p (2)

where Zp and T axe the normalization and temperature constants respectively. The MAP
estimate of disparity map is the minimizer of the corresponding posterior energy functional
Up(d]f~,fr). It is desirable to describe the above equation by a simpler parametric form. If the
approximated distribution is Pa, which is dependent on adjustable parameters represented by
vector a ----{dx,x E D} and has the Gibbs form:

P.(dld) = ~-1 exp . , Uo(dld) = ~"~(dx, - dx,) 2 (31

where Z . is the partition function and U.(d[d) is the associated energy functional. For the
specific Uo, the approximated distribution is Ganssian. In information theory, relative entropy
is an effective measure of how well one distribution is approximated by another [9]. Alternative
names in common use for this quantity are discrimination, Kunback-Liebler number, direct
divergence and Cross entropy. The relative entropy of a measurement d with distribution P~
relative to the distribution P is defined as
P,(dld)
S , ( a ) -~ P,(d[d) log p ( d [ f , f,) dd (4)

where P(dlfl, fr) is referred to as reference distribution. Kullback's principle of minimum relative
entropy [9] states that, of the approximated distributions P~ with the given Gibbs form, one
should choose the one with the least relative entropy. If d is chosen as the mean of the disparity
field d, the optimal mean field solution is apparently the minimizer of relative entropy measure.
After some algebraic manipulations, we can get

S , ( a ) ---- T ( F . - Fp + E(Up) - E ( U , ) ) (5)

where the expectations, E(.), axe defined with respect to the approximated distribution P,. F ,
- T l o g Za, Fp ~ - T I o g Zp axe called free energy. In statistical mechanics [10], the difference
between the average energy and the free energy scaled by temperature is equal to entropy,
o r F = E - T S . From the divergence inequality in information theory, the relative entropy is
always non-negative [6] S t ( a ) _> 0, with the equality holding if and only if P~ - P. And since
temperature is positive,
F~ < Fo + E(V~) - E(Vo) (6)

which is known as Perieris's inequality [1]. The MFA solution, realized as the minimizer of
relative entropy, can be alternatively represented as the parameter a yielding the tightest bound
in (6). In other words, we have
min S~(d) = m~n[Fo + E(Up) - E ( V ' . ) ] (7)
d

since Fp in (5) is not a functional of the parameter a at all. The choice of U~ relies on a prior
knowledge of the distribution of the solution. Gibbs measure provides us with the flexibility in
defining the approximated distribution P~ as it depends solely on the energy function Us, which
in turn can be expressed as the sum of clique potentials [5]. Next we discuss an example of U,
which is both useful and interesting. For the energy function given in (3), the corresponding
approximated distribution is Ganssian and the adjustable parameters are, in fact, the mean
values of disparity field. As the temperature (variance) approaches zero, it will be conformed
t o the mean value with probability one. Since the disparity values at lattice points axe assumed
422

to be independent Gaussian random variables, both the free energy and expected approximate
energy can be obtained as:
M
F. -Tlog Z. -~log(rT), E(U,,)= EE(dx, -dx,) 2 =
MT
= =
2 (s)
i----1

The mean posterior energy can be written as:


M M
E[(dX,--dX,)2I (9)
i----1 i=l X j E N x i

The second term in the right hand side (RHS) can be rewritten as:
M M
E E E[(dx'-dxi)2l = E E [Y-I'(dx'-dxJ)2] (10)
iffil X j 6 N X i i=1 XjENxl
On the other hand, if the first term at RHS of (9) can be approximated by (the validity of
approximation will be discussed later)
M M
~,(r) ~ E (Ig,(x,)- g..[x, + (dx,, o)]1~) -- o~(r) ~ Ig,(x,)- gdx, + (,ix,, o)]l ~ (11)
i=1 i-~.1

then, by combining (10) and (11), the upper bound in Peierls's inequality becomes
M M
F~+ E(Up)- E(Uo)vr a(T)E [g,(x,) - g~[x, + (dx,,0)][ 2 + E E (dx, - dxj) 2 (12)
i-~l iffil X iENxi

It is interesting to note that the format of the above functional of mean disparity function,
d, is identical to that of the posterior energy functional, Up(difl,f~ ) up to a constant. Hence,
it is inferred that the MAP estimate of disparity function is, in fact, equivalent to the mean
field solution minimizing the relative entropy between the posterior and approximated Gaussian
distributions. Regarding the approximation in (11), as the temperature T ---, 0, all the mass of
P . ( d l d ) will be concentrated at mean vector d = d and (11) holds exactly. At least, in the low
temperature conditions, the MFA solution coincides with the MAP solution.

3 MFA Based on Statistical Mechanics

When it system possesses a large interaction degree of freedom, the equilibrium can be attained
through the mean field [10]. It serves as a general model to preview a complicated physical
system. In our case, each pixel is updated by the expected (mean) value given the mean values
of its neighbors [7].
With Gibbs sampler [2], we visit each site xi and update the associated site variable dx~
with a sample from the local characteristics

P(dx~idy,Vy~xi,f,,fr)= 1 exp [ - TUi(dxi)] (13)

where the marginalenergy function Ui(dx,) is derived from (1) and (2). If the system is fury
specified by the interactions of site (disparity) variables and the given data, the uncertainty of
each variable is, in fact, defined by the local characteristics. In a magnetic material, each of
the spins is influenced by the magnetic field at its location. This magnetic field consists of any
external field imposed by the experimenter, plus an internal field due to other spins. During the
annealing process, the mean contribution of each spin to the internal field is considered. The
first term in (1) can be interpreted as the external field due to the given image data and the
423

second term as internal field contributed by other disparity variables. SA with Gibbs sampler
simulate the system with the samples obtained from the embedded stochastic rules, while MFA
tries to depict the system with the mean of each system variable.
In summary, the MFA version of Gibbs sampler can then be stated as:
1. Start with any initial mean disparity do and a relative high initial temperature.
2. Visit a site xl and calculate the marginal energy function contributed by given image data
and the mean disparity in the neighborhood Nxi as

0/(dx,) = a IgL(x/) - g , [ x ~ + (dx,, 0)][2 + (dx, - dy) 2 (14)


YENX~
3. Calculate the mean disparity dxl as

exp [ - O i ( d x l ) / T ] (15)
dxi = E d x i P ( d x i ] d y , V y ~ xl,f,,f,) = E dx,
Z~
dx i EItD dx i ERz)
4. Update in accordance with steps 2 and 3 until a steady state is reached at the current
temperature, T.
5. Lower the temperature according to a schedule and repeat the steps 2, 3 and 4 until there
axe few changes.
Consequently, MFA consists of a sequence of iterative, deterministic relaxations in approximating
the SR. It converts a hard optimization problems into a sequence of easier ones.

4 Experimental Results

We have used a wide range of image examples to demonstrate that SR can be closely approxi-
mated by MFA. Due to the space limitation, we only provide azt image example: Pentagon (256 x
256). The matching primitives used in the experiments are intensity, directional intensity gra-
dients (along horizontal and vertical directions), i.e., gs(z, y) --- (f,(z, y),-~=, ~ ) , V(z, y) E
~ , s = l, r. We try to minimize the functional Up(c]]fz,f~) by deterministic relaxation at each
temperature and use the result at current temperature as the initial state for the relaxation at
the next lower temperature. The initial temperature is set as 5.0 and the annealing schedule used
is where the temperature is reduced 50% relative to the previous one. The neighborhood system
2 is used in describing surface smoothness. The computer simulation results are shown in Fig
1. One could compare the result with those obtained by SA algorithm using Gibbs sampler.
In MFA version of SA with Gibbs sampler, we follow the algorithm presented in Section 3.
The initial temperature and the annealing schedule are identical to those in above. The results
axe also shown in Fig 1. When they are compared with the previous results, we can see that the
MFA from both approaches yield roughly the same mean field solution and they approximate
the MAP solution closely.

5 Conclusion

In this paper, we have discussed, for stereo matching problem, two general approaches of MFA
which provide good approximation to the optimal disparity estimate. The underlying models can
be easily modified and applied to the other computer vision problems, such as image restoration,
surface reconstruction and optical flow computation etc. As the Gaussian distribution is the most
natural distribution of an unknown variable given both mean and variance [9], it is nice to see
that the meaa values of these independent variables that minimize the relative entropy between
the assumed Ganssian and the posterior distribution is equivalent to the optimal Bayes estimate
in MAP sense.
424

Fig. 1. Upper row (left to right): the left and right images of Pentagon stereo pa~r, the mean
field result based on information theoretic approach, and the result using SA. Bottom row
(left to right): the result using deterministic Gibbs sampler, the three dimensionaJ (3-D) surface
corresponding to information theoretic MFA, and the 3-D surface corresponding to deterministic
Gibbs sampler.

References
1. G.L. Bilbro, W.E. Snyder, and R.C. Mann. Mean-field approximation minimizes relative
entropy. Your. of Optical Soci. America, Vol-8(No.2):290-294, Feb. 1991.
2. C. Chang. Area-Based Methods ]or Stereo Vision: the Computational Aspects and Their
Applications. PhD thesis, University of Cedifornia, San Diego, 1991. Dept. of ECE.
3. C. Chang and S. Chatterjee. A hybrid approach toward model-based texture segmentation.
Pattern Recognition, 1990. Accepted for publication.
4. D. Geiger and F. Girosi. Mean field theory for surface reconstruction. In Proc. DARPA
Image Understanding Workshop, pages 617-630, Palo Alto, CA, May 1989.
5. S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian
restoration of images. IEEE Trans. on Part. Anal. ~J Mach. Intel., Nov. 1984.
6. R.M. Gray. Entropy and In]ormation Theory. Springer-Verlag, New York, NY, 1990.
7. J. Hertz, A. Krogh, and R.G. Palmer. Introduction to The Theory of Neural Computation.
Addison Wesley, Reading, MA, 1991.
8. H.P. Hiriyann&iah, G.L. Bilbro, W.E. Snyder, and R.C. Mann. Restoration of piecewise-
constant images by mean field annealing. Your. of Optical Soci. America, Vol-
6(No.12):1901-1912, Dec. 1989.
9. S. Kullback. Information Theory and Statistics. John Wiley & Sons, New York, NY, 1959.
10. G. Paxisi. Statistical Field Theory. Addison Wesley, Reading, MA, 1988.
11. A. Yuille. Generalized deformable models, statistical physics and matching problems. Neu-
ral Computation, Vol.2(No.1):l-24, 1990.
12. A. Yuille, D. Geiger, and H. Bulthoff. Stereo integration, mean field theory and psy-
chophysics. In Proc. Ist European Con]. on Comp. Vision, pages 73-88, Antibes, France,
April 1990.
This article was processed using the IbTEX macro packatge with ECCV92 style
Occlusions and Binocular Stereo

Davi Geiger 1, Bruce Ladendorf 1 and Alan Yuille 2


I Siemens Corporate Research, 755 College Rd. East, Princeton NJ 08540.
2 Division of Applied Sciences, Harvard University, Cambridge, MA, 02138.
USA

Abstract.
Binocular stereo is the process of obtaining depth information from a
pair of left and right cameras. In the past occlusions have been regions
where stereo algorithms have failed. We show that, on the contrary, they
can help stereo computation by providing cues for depth discontinuities.
We describe a theory for stereo based on the Bayesian approach. We
suggest that a disparity discontinuity in one eye's coordinate system always
corresponds to an occluded region in the other eye thus leading to an oc-
clusion co~s~rain~ or monotonicity constraint. The constraint restricts the
space of possible disparity values, simplifying the computations, and gives
a possible explanation for a variety of optical illusions. Using dynamic pro-
gramming we have been able to find the optimal solution to our system and
the experimental results support the model.

1 Introduction

Binocular stereo is the process of obtaining depth information from a pair of left and
right camera images. The fundamental issues of stereo are: (i) how are the geometry and
calibration of the stereo system determined, (ii) what primitives are matched between
the two images, (iii) what a priori assumptions are made about the scene to determine
the disparity and (iv) the estimation of depth from the disparity.
Here we assume that (i) is solved, and so the corresponding epipolar lines (see figure 1)
between the two images are known. We also consider (iv) to be given and then we
concentrate on the problems (ii) and (iii).
A number of researchers including Sperling[Sperling70], Julesz [:Julesz71]; Mart and
Poggio[MarPog76] [MarPog79]; Pollard, Mayhew and Frisby[PolMayFri87]; Grimson[Grimson81];
Ohta and Kanade[OhtKan85]; Yuille, Geiger and Bfilthof[YuiGeiBulg0] have provided a
basic understanding of the matching problem on binocular stereo. However, we argue that
more information exists in a stereo pair than that exploited by current algorithms. In
particular, occluded regions have always caused difficulties for stereo algorithms. These
are regions where points in one eye have no corresponding match in the other eye. Despite
the fact that they occur often and represent important information, there has not been a
consistent attempt of modeling these regions. Therefore most stereo algorithms give poor
results at occlusions. We address the problem of modeling occlusions by introducing a
constraint that relates discontinuities in one eye with occlusions in the other eye.
Our modeling starts by considering adaptive windows matching techniques [KanOku90],
and taking also into account changes of illumination between left and right images, which
provide robust dense input data to the algorithm. We then define an a prior/probability
for the disparity field, based on (1) a smoothness assumption preserving discontinuities,
and (2) an occlusion constraint. This constraint immensely restrict the possible solutions
of the problem, and provides a possible explanation to a variety of optical illusions that
426

so far could not be explained by previous theories of stereo. In particular , illusory dis-
continuities, perceived by humans as described in Nakayama and Shimojo [NakShi96],
may be explained by the model. We then apply dynamic programming to exactly solve
the model.
Some of the ideas developed here have been initiated in collaboration with A. Chain-
boll and S. Mallat and are partially presented in [ChaGeiMalgl]. We also briefly mention
that an alternative theory dealing with stereo and occlusions has been developed by
Belhumeur and Mumford[BelMumgl].
It is interesting to notice that, despite the fact that good modelling of discontinuities
has been done for the problem of segmentation (for example, [BlaZis87][GeiGir91]), it is
still poor the modeling of discontinuities for problems with multiple views, like stereopsis.
We argue that the main difficulty with multiple views is to model discontinuities with
occlusions. In a single view, there are no occlusions !

2 Matching intensity windows


We use adaptive correlation between windows. At each pixel, say l on the left, we consider
a window of pixels that include I. This window is rectangular so as to allow pixels from
above and below the epipolar line to contribute to the correlation (thereby discouraging
mismatching due to misallignment of epipolar lines). The correlation between the left and
right windows, II W ~ - W ~ II, is a measure of similarity. A major limitation of using large
windows is the possibility of getting "wrong" correlations near depth discontinuities. To
overcome this limitation we have considered two possible windows, one (window-l) to
the left of the pixel l and the other (window-2) to the right (see figure 1). Both windows
are correlated with the respective ones in the right image. The one that has better
correlations is kept and the other one discarded. Previous work on adaptive windows is
presented in [KanOkug0].

wmdow-I ~dndow-2
LEFt Row[ Iiii I I I III llf }I i l~l
IIIII=I',',', fill
~ Dhnensiona] line I
Col~n

T~-I+D 9 Colmna
~- eptpolar line I /

Fig. 1. (a) A pair of ~ a m e s (eyes) and an epipolar line in the lelt jCrame. (b) The two windows
in the left image and the respective ones in the right image. In the left image each window shares
the "center pixel" l. The window.1 goes one pixel over the right of l and window-~ goes one over
left to I.

2.1 P r o b a b i l i t y o f m a t c h i n g
If a feature vector in the left image, say W~, matches a feature vector in the right image,
say W ~ , If W~ - W ~ II should he small. As in [MarPog76][YuiGeiBu190], we use a
427

matching process Mlr that is 1 if, a feature at pixel I in the left eye matches a feature at
pixel r in the right eye, and it is 0 otherwise. Within the Bayes approach we define the
probability of generating a pair of inputs, W L and W R, given the matching process M
by

- ~'~,~{M,.[IIW~-W~II]+e(1-M,~)},,."
P~.p~t(W L, WRIM) = e /~1 (1)
where the second term pays a penalty for unmatched points ( Mzr = 0), with e being
a positive parameter to be estimated. C1 is a normalization constant. This distribution
favors lower correlation between the input pair of images.

2.2 Uniqueness and an occlusion process

In order to prohibit multiple matches to occur we impose that


/9"-1 N--I
Z M"" ----0' I and ~M L, = 0 , 1 .
l=0 r=O

Notice that these restrictions guarantee that there is at most one match per feature,
and permits unmatched features to exist. There are some psychophysical experiments
where one would think that multiple matches occur, like in the two bars experiments (see
figure 5). However, we argue that this is not the case, that indeed a disparity is assigned
to all the features, even without a match, giving the sensation of multiple matches. This
point will be clearer in the next two sections and we will asume that uctiquertessholds.
Than, it is natural to consider an occlusion process, O, for the left (O L) and for the right
(O R ) coordinate systems, such that

N-1 N-1
O~(M)= I - ~ Mi,r and O f ( M ) = 1 - ~ M,,,. (2)
r----0 I=0

The occlusion processes are 1 when no matches occur and 0 otherwise. In analogy, we
can define a disparity field for the left eye, D L, and another for the right eye, Da, as

N-I N-1
D~(M)(1-O~)= ZM',r(r-l) and D~(M)(1-O~)= ~M,,,(r-l). (3)
r=O 1=0

where D L and DR are defined only if a match occurs. This definition leads to integer
values for the disparity. Notice that D~ = DI+D~
R and D~ = D_D~.
r These two variables,
O(M) and D(M) (depending upon the matching process M), wiU be useful to establish
a relation between discontinuities and occlusions.

3 P i e c e w i s e s m o o t h functions

Since surface changes are usually small compared to the viewer distance, except at depth
discontinuities, we first impose that the disparity field, at each eye, should be a smooth
function but with discontinuities (for example, [BlaZis87]). An effective cost to describe
these functions, (see [GeiGir91]), is given by
428

v.jf (M) = - z.(1 + _ +

where/~ and 7 are parameters to be estimated. We have imposed the smoothness criteria
on the left disparity field and on the right one. Assigning a Gibbs probability distribution
to this cost and combining it with (1), within the Bayesian rule, we obtain

(4)

where Z is a normalization constant and

lr
R D R 2 i E .oL .

where we have discarded the constant 29' + E(N - 1)N. This cost, dependent just upon
the matching process (the disparity fields and the occlusion processes are functions of
Ml,), is our starting point to address the issue of occlusions.

4 Occlusions

Giving a stereoscopic image pair, occlusions are regions in space that cannot be seen by
both eyes and therefore a region in one eye does not have a match in the other image,
To model occlusions we consider the matching space, a two-dimensional space where the
axis are given by the epipolar lines of the left and right eyes and each element of the
space, Mz~, decides whether a left intensity window at pixel / matches a right intensity
window at pixel r. A solution for the stereo matching problem is represented as a path
in the matching space(see figure 2).

4.1 O c c l u s i o n c o n s t r a i n t

We notice that in order for a stereo model to admit disparity discontinuities it also has
to admit occlusion regions and vice-versa. Indeed most of the discontinuities in one eye's
coordinate system corresponds to an occluded region in the other eye's coordinate system.
This is best understood in the matching space. Let us assume that the left epipolar line is
the abscissa of the matching space. A path can be broken vertically when a discontinuity
is detected in the left eye and, can be broken horizontally when a region of occlusion
is found. Since we do not allow multiple matches to occur by imposing u~iq~e~ess then
, almost always, a vertical break (jump) in one eye corresponds to a horizontal break
0ump) in the other eye (see figure 2).

Occl~sio~ co~strai~: A d~co~i~ity i~ o~e aye correspond ~o a~ occl~io~ i~ ~ha oXher


aye a~d vice-versa.
Notice that this is not always the case, even if we do apply ~nique~ess. It can be
violated and induces the formation of illusions which we discuss on the section 7.
429

x~

A ntinvity

~X ~D

IE

cx xB xA Left
9
BL~ N o Match
(~) (b)

Fig. 2. (a) A ramp occluding a plane. (b) The matching space, where the leftand right epipolar
lines are for the image of (a). Notice the S~lmmetr~l between occlusions and discontinuities. Dark
lines indicates where match occurs, Mr. = 1.

4.2 M o n o t o n i c i t y c o n s t r a i n t
An alternative way of considering the occlusion constraint is by imposing the monotonic-
ity of the function F~ = l + Dr, for the left eye, or the monotonicity of F ~ = r + DrR.
This is called the monotonicity constrain~ (see also [ChaGeiMal91]). Notice that F ~ and
F ~ are not defined at occluded regions, i.e.the functions F~ and F ~ do not have support
at occlusions. The monotonicity of F~, for an occlusion of size o, is then given by

L
Fi+o+1 - i~t > 0, or Df+o - D f > -o, V~
l+o
where L 1
01+o =Of=O and ~ (1-of,)=O
P=I+I
(8)
and analogously to F~R. The monotonicity constraint propose an ordering type of con-
straint. It differs from the known orders constraint in that it explicitly assumes (i)
occlusions with discontinuities,horizontal and verticaljumps, (ii)uniqueness. W c point
out that the monotonicity of F L is equivalent to the monotonicity of F R. The mono-
to~icit~Iconstraint can be applied to simplify the optimization of the effectivecost (5) as
we discuss next.

5 Dynamic Programming

Since the interactions of the disparity field D f and D~ are restricted to a small neigbor-
hood we can apply dynamic programming to exactly solve the problem.
We first constrain the disparity to take on integral values in the range of (-0, 8)
(Panum's limit, see [MarPog79]). We impose the boundary condition, for now, that the
disparity at the end sides of the image must be 0.
The dynamic program works by solving many subproblems of the form: what is the
lowest cost path from the beginning to a particular (/, r) pair and what is its cost? These
430

subproblems are solved column by column from left to right finally resulting in a solu-
tion of the whole problem (see figure 5). At each column the subproblem is considered
requiring a set of subproblems previously solved. Because of the mo~o~o~ici~y co~sgrai~
the set of previously solved subproblems is reduced. More precisely, to solve the subprob-
lem (l, r), requires the information from the solutions of the previous subproblems (z, y),
where y < r and m < I (see shaded pixels in figure 5). Notice that the mono~onici~y
eonscrai~ was used to reduce the required set of previously solved subproblems, thus
helping the efficiency of the algorithm.

6 Implementation and Results


A standard image pair of the Pentagon building and environs as seen from the air are
used (see figure 3 (a) and (b)) to demonstrate the algorithm. Each image is 512 by 512
8-bit pixels. The dynamic programming algorithm described above was implemented in
C for a SPARCstation 1+; it takes about 1000 seconds, mostly for matching windows
(~ 75% of the time). The parameters used were : 7 = 10; # = 0.15; e = 0.15; 0 = 40;
w = 3; and the correlation II W ~ - W ~ n L , II has'been normalized to values between
0 and 1. The first step of the program computes the correlation between the left and
right windows of intensity. Finally the resulting disparity map is shown in figure 3. The
disparity values changed from - 9 to +5.
The basic surface shapes are correct including the primary building and two over-
passes. Most of the details of the courtyard structure of the Pentagon are correct and
some trees and rows of cars are discernible. As an observation we note that the disparity
is tilted indicating that the top of the image is further away from the viewer than the
bottom. Some pixels are labeled as occluded and these are about where they are expected
(see figure 3).

7 Illusions and disparity at occlusions

In some unusual situations the mogoto~icicy constrain~ can be broken, still preserving the
uniqueness. We show in figure 4 an example where a discontinuity does not correspond
to an occlusion. More psychophysical investigation is necessary to asserts an agreement
of the human perception for this experiment with our theory. This experiment is a gen-
eralization of the double-nail illusion [KroGri82], since the head of the nail is of finite
size (not a point), thus we call it the double-hammer illusion.

7.1 Disparity limit at occluding areas and Illusory discontinuities


At occluded regions there is no match and thus we would first think not to assign a
disparity value. Indeed, according to (3) and (5) a disparity is just defined where a
match exist, and not at occlusions. However, some experiments suggest that a disparity
is assigned to the occluded features, like in the two-bars experiment illustrated in figure 5.
The possible disparity values for the occluded features are the ones that would break
the mor~o:onicigy constrgi~L This is known as Panum's limit case. Nakayama and Shi-
mojo [NakShi90] have shown that indeed a sensation of depth is given at the occluded
features according to a possible limit of disparity. If indeed, a disparity is assigned to the
occluded regions than a disparity discontinuity will be formed between the occluded and
not occluded regions. We have produced a variation of the Nakayama and Shimojo exper-
iments where indeed a sensation of disparity at occluded features and illusory contours
are produced (see figure 5). We then make the following conjecture
431

Fig. 3. A pair of (a} left and (b) right images of the pentagon, with horizontal epipolar lines.
Each image is 8-bit and 51~ by 51~ pizels. (c} The final disparity map where the values changed
from - 9 to +5. The parameters used where: '7 = 10; /.L = 0.15; e = 0.15;8 = 40; oJ = 3. In a
SPARCstation 1-/-, the algorithm takes about 1000 seconds, mostly for matching windows (.~ 75
of the time}. (d} The occlusion regions in the right image. They are approzimately correct.

C o n j e c t u r e 1 ( o c c l u d e d - d i s p a r i t y ) The perceived disparity of occluded features ks the


limit of their possible disparity values (Panum's limit case), if no other source of infor-
mation is given.

This conjecture provides a method, t h a t we have used, to fill in the d i s p a r i t y for


occluded features without having to assing a match.

Acknowledgements: We would like to t h a n k A. C h a m b o l l and S. Mallat for the stimu-


lating conversations, and for their p a r t i c i p a t i o n on the initial ideas of this p a p e r and D.
Mumford for m a n y useful comments.
432

F i g . 4. The double-hammer illusion. This figure has a square in front of another larger square.
There is no region of occlusion and yet there is a depth discontinuity.

Left
12
b
,
I
I
rl
Right
i 2 + D2= ii+Dl=rl

F i g . 5. (a) An illustration of the dynamic programming. The subproblem being considered is the
(l, I + D~) one. To solve it we need the solutions from all the shaded pizels. (b) When fused,
a 3-dimensional sensation of two bars, one in front of the of the other one, is obtained. This
suggests that a disparity value is assigned to both bars in the left image. (c) A stereo pair of the
type of Nakayama and Shimojo experiments. When fused, a vivid sensation of depth and depth
discontinuity is obtained at the occluded regions (not matched features). We have displaced the
occluded features with respect to each other to give a sensation of different depth values for the
occlude.d featui'es, supporting the disparity limit conjecture. A cross fuser should fuse the left and
the center images to preceive the blocks behind the planes. An uncross fuser should use the center
and right images.
433

References
[BelMum91] P. Be]_humeur and D. Mumford, A Bayesian treatment of the stereo corre-
spondence using halfioccluded region, Harvard Robotics Lab, Tech. Repport:
December, 1991.

[BIaZis87] A. Blake and A. Zisserman, Visual Reconstruction, Cambridge, Mass: MIT


Press, 1987.
[ChaGeiMal91] A. Champolle, D. Geiger, and S. Mallat, "Un algorlthme multi-dchelle de mise
encorrespondance stdrdo basd sur les champs markoviens," in 13th GRETSI
Conference on Signal and Image Processing, Juan-les-Pins,France, Sept. 1991.

[GeiGir91] D. Geiger and F. Girosi, "Parallel and deterministic algorithms for mrfs: surface
reconstruction," IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. PAMI-13, no. 5, pp. 401-412, May 1991.

[Grimson81] W. E. L. Grimson, From Images to Surfaces, Cambridge, Mass.: MIT Press,


1981.

[JuleszT1] B. Julesz, Foundations of Cyclopean Perception, Chicago: The University of


Chicago Press, 1971.

[KanOku90] T. Kanade and M. Okutomi, "A stereo matching algorithm with an adaptive
window: theory and experiments," in Prec. Image Understanding Workshop
DARPA, PA, September 1990.

[KroGri82] J.D. Krol and W.A. Van der Grind, "The double-nail illusion: experiments on
binocular vision with nails, needles and pins.," Perception, vol. 11, pp. 615-619,
1982.

[MarPog79] D. Mart and T. Poggio, "A computational theory of human stereo vision,"
Proceedings of the Royal Society of London B, vol. 204, pp. 301-328, 1979.

[MarPog76] D. Mart and T. Poggio, "Cooperative computation of stereo disparity," Science,


vol. 194, pp. 283-287, 1976.

[N~kS~90] K. Nal~yama and S. Shlmojo, "Da Vinci stereopsis: depth and subjective
occluding contours from unpaired image points," Vision Research, vol. 30,
pp. 1811-1825, 1990.

[OhtKan85] Y. Ohta and T. Kanade, "Stereo by intra- and inter-scanllne search Using
dynamic programming," IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. PAMI-7, no. 2, pp. 139-154, 1985.

[PolNIayFri87] S. B. Pollard, J. E. W. Mayhew, and J. P. Frisby, "Disparity gradients and


stereo correspondences," Perception, 1987.

[SperlingT0] G. Sperling, "Binocular vision: a physical and a neural theory.," American


Journal of Psychology, vol. 83, pp. 461-534, 19670.

[YniGeiBul90] A. Ynille, D. Geiger, and H. Bulthoff, "Stereo, mean field theory and psy-
chophysics," in 1st. ECCV, pp~ 73-82, Springer-Verlag, Antibes, France, April
1990.

This article was processed using the I ~ macro package with ECCV92 style
Model-Based Object Tracking in Traffic Scenes

D. Koller 1, K. Daniilidis 1, T. ThdrhaUson I and H.-H. Nagel 1,2

1 Institut ffir Algorithmen und Kognitive Systeme


Fakult~t ffir Informatik, Universit~t Karlsruhe (TH),
Postfach 6980, D-7500 Karlsruhe 1, Germany; E-marl: koller@ira.uka.de
2 Fraunhofer-Institut ffir Informations- und Datenverarbeitung (IITB), Karlsruhe

A b s t r a c t . This contribution addresses the problem of detection and track-


ing of moving vehicles in image sequences from traffic scenes recorded by
a stationary camera. In order to exploit the a priori knowledge about the
shape and the physical motion of vehicles in traffic scenes, a parameterized
vehicle model is used for an intraframe matching process and a recursive
estimator based on a motion model is used for motion estimation. The ini-
tial guess about the position and orientation for the models are computed
with the help of a clustering approach of moving image features. Shadow
edges of the models are taken into account in the matching process. This en-
ables tracking of vehicles under complex illumination conditions and within
a small effective field of view. Results on real world traffic scenes are pre-
sented and open problems are outlined.

1 Introduction

The higher the level of abstraction of descriptions in image sequence evaluation, the more
a priori knowledge is necessary to reduce the number of possible interpretations as, for
example, in the case of automatic association of trajectory segments of moving vehicles
to motion verbs as described in [Koller et al. 91]. In order to obtain more robust results,
we take more a priori knowledge into account about the physical inertia and dynamic
hehaviour of the vehicle motion.
For this purpose we establish a motion model which describes the dynamic vehicle
motion in the absence of knowledge about the intention of the driver. The result is a
simple circular motion with constant magnitude of velocity and constant angular velocity
around the normal of a plane on which the motion is assumed to take place. The unknown
intention of the driver in maneuvering the car is captured by the introduction of process
noise. The motion model is described in Section 2.2.
The motion parameters for this motion model are estimated using a recursive maxi-
m u m a posteriori estimator (MAP), which is described in Section 4.
Initial states for the first frames are provided by a step which consists of a motion
segmentation and clustering approach for moving image features as described in [Koller
et al. 91]. Such a group of coherently moving image features gives us a rough estimate
for moving regions in the image. The assumption of a planar motion yields then a rough
estimate for the position of the object hypothesis in the scene by backprojecting the
center of the group of the moving image features into the scene, based on a calibration
of the camera.
To update the state description, straight line segments extracted from the image
(we call them data segments) are matched to the 2D edge segments - - a view sketch - -
obtained by projecting a 3D model of the vehicle into the image plane using a hidden-line
algorithm to determine their visibility.
438

The 3D vehicle model for the objects is parameterized by 12 length parameters. This
enables the instantiation of different vehicles, e.g. limousine, hatchback, bus, or van from
the same generic vehicle model. The estimation of model shape parameters is possible by
including them into the state estimation process. Modeling of the objects is described in
Section 2.1.
The matching of data and model segments is based on the Mahalanobis distance of
attributes of the line Segments as described in [Deriche & Faugeras 90]. The midpoint
representation of line segments is suitable for using different uncertainties parallel and
perpendicular to the line segments.
In order to track moving objects in long image sequences which are recorded by a
stationary camera, we are forced to use a wide field of view. This is the reason for a small
image of an individual moving object. In bad cases, there are few and/or only poor line
segments associated with the image of a moving object. In order to track even objects
mapped onto very small areas in the image, we decided to include the shadow edges in
the matching process if possible. In a very first implementation of the matching process it
was necessary to take the shadow edges into account to track some small objects. In the
current implementation the shadow edges appear not to be necessary for tracking these
objects but yield more robust results. The improvement of the very first implementation
compared to the current implementation was only possible by testing the algorithms in
various real world traffic scenes. The results of the last experiments are illustrated in
Section 5.

2 Models for the Vehicles and their Motion

2.1 T h e p a r a m e t e r i z e d Vehicle M o d e l

We use a parameterized 3D generic model to represent the various types of vehicles


moving in traffic scenes. Different types of vehicles are generated from this representation
by varying 12 length parameters of our model. Figure 1 shows an example of five different
specific vehicle models derived from the same generic model.
In the current implementation we use a fixed set of shape parameters for each vehicle
in the scene. These fixed sets of shape parameters are provided interactively.
In initial experiments on video sequences from real world traffic scenes, the algorithm
had problems in robustly tracking small vehicular objects in the images. These objects
span only a region of about 20 x 40 pixels in the image (see for example Figure 2). In
bad cases, we had not enough and/or only poor edge segments for the matching process
associated with the image of the moving vehicle, which caused the matching process to
match lines of road markings to some model edge segments. Such wrong matches resulted
in wrong motion parameters and, therefore, in bad predictions for the vehicle position in
subsequent frames.
Since vehicle images in these sequences exhibit salient shadow edges, we decided to
include the shadow edges of the vehicle into the matching process. These shadow edges
are generated from the visible contour of the object on the road, as seen by the sun. The
inclusion of shadow edges is only possible in image sequences with a well defined illumi-
nation direction, i. e. on days with a clear sky (see Figures 7 and 8). The illumination
direction can be either set interactively off-line or it can be incorporated as an unknown
parameter in the matching process.
439

CAR-limousine CAR-hatchback CAR-station wagon

small bus pick-up

Fig. 1. Example of five different vehicle models derived from the same generic model.

2.2 T h e M o t i o n M o d e l

We use a motion model which describes the dynamic behaviour of a road vehicle without
knowledge about the intention of the driver. This assumption leads to a simple vehicle
motion on a circle with a constant magnitude of the velocity v = Ivl and a constant
angular velocity w. The deviation of this idealized motion from the real motion is captured
by process noise due to v and w. In order to recognize the pure translational motion in
the noisy data, we evaluate the angle difference wv (v = tk+l - tk is the time interval). In
case wr is less then a threshold we use a simple translation with the estimated (constant)
angle r and w = 0.
Since we assume the motion to take place on a plane, we have only one angle r and
one angular velocity w -- r The angle r describes the orientation of the model around
the normal (the z-axis) of the plane on which the motion takes place. This motion model
is described by the following differential equation:

i~ = ~ c o s r r = ,o,
i u = v sin r ~5 = O, r = O. (1)

3 The Matching Process

The matching between the predicted model data and the image data is performed on edge
segments. The model edge segments are the edges of the model, which are backprojected
from the 3D scene into the 2D image. The invisible model edge segments are removed by
a hidden-line algorithm. The position t and orientation r of the model are given by the
output of the recursive motion estimation described in Section 4. This recursive motion
estimation also yields values for the determination of a window in the image in which
edge segments are extracted. The straight line segments are extracted and approximated
using the method of [Korn 88].
440

:. "i:i.i:i :! :::~: :':':'!:i.:.:


""" :::}:::~.
"::...

:- ...::..:::::

..::...:: "!~;:::: .~i

\
\

Fig. 2. To illustrate the complexity of the task to detect and track small moving objects,
the following four images are given: the upper left image shows a small enlarged image
section, the upper right figure shows the greycoded maxima gradient magnitude in the
direction of the gradient of the image function, the lower left figure shows the straight
line segments extracted from these data, and the lower right figure shows the matched
model.

The Matching Algorithm

Like the method of [Lowe 85; Lowe 87] we use an iterative approach to find the set
with the best correspondence between 3D model edge segments and 2D image edge
segments. The iteration is necessary to take into account the visibility of edge segments
depending on the viewing direction and the estimated state of position and orientation,
respectively. At the end of each iteration a new correspondence is determined according
to the estimated state of position and orientation. The iteration is terminated if a certain
number of iterations has been achieved or the new correspondence found has already been
investigated previously. Out of the set of correspondences investigated in the iteration,
the correspondence which leads to the smallest residuM is then used as a state update.
The algorithm is sketched in Figure 3. We use the average residual per matched edge
segment, multiplied by a factor which accounts for long edge segments, as a criterion for
the selection of the smallest residual.
441

i~--0
Ci '-- get_correspondences( x - )
DO
z + ~- update_state( Ci )
ri 4-- residual( Ci )
Ci+l ~-- get_correspondences( :r + )
i~-i+l
W H I L E ( ( e i + I # Cj ; j = 0, 1 . . . , i ) A i < IMAX)
i,~i,, ~ {i]ri = min(rj) ; j = 0, 1 . . . , I i i X )
x+ ~- a:.I+m l n

Fig. 3. Algorithm for the iterative matching process, g~ is the set of correspondences between
p data segments 2~ = {Dj}j=~...p and n model segments A4 ----{Mj}jfa .... for the model inter-
pretation i: C, = {(Mj, D,j))j=I .....

Finding Correspondences
Correspondences between model and data segments are established using the Maha-
lanobis distance between attributes of the line segments as described in [Deriche &
Faugeras 90]. We use the representation X = (xm, y-n, O, l) of a line segment, defined
as:
Xm (2)
ym = 2 , t = (x2 - x l ) 2 + (y2 - y l ) 2 .
where (xl, yl) T a n d (x2, y2) T a r e the endpoints of a line segment.
Denoting by all the uncertainty in the position of the endpoints along an edge chain
and by or the positional uncertainty perpendicular to the linear edge chain approxi-
mation, a covariance matrix A is computed, depending on all,a and 1. Given the
attribute vector X m of a model segment and the attribute vector Xd of a data segment,
the Mahalanobis distance between X m and X d is defined as
d -- ( X m -- x d ) T ( A m "4- A d ) - l ( X m -- X d ) . (3)
The data segment with the smallest Mahalanobis distance to the model segment is
used for correspondence, provided the Mahalanobis distance is less than a given threshold.
Due to the structure of vehicles this is not always the best match. The known vehicles
and their models consist of two essential sets of parallel line segments. One set along
the orientation of the modeled vehicle and one set perpendicular to this direction. But
evidence from our experiments so far supports our hypothesis that in most cases the
initialisation for the model instantiation is good enough to obviate the necessity for a
combinatorial search, such as, e.g., in [Grimson 90b].
The search window for corresponding line segments in the image is a rectangle around
the projected model segments. The dimensions of this rectangle are intentionally set by
us to a higher value than the values obtained from the estimated uncertainties in order
to overcome the optimism of the IEKF as explained in Section 4.

4 Recursive Motion Estimation

In this section we elaborate the recursive estimation of the vehicle motion parameters.
As we have already described in Section 2.2, the assumed model is the uniform motion
of a known vehicle model along a circular arc.
442

The state vector xk at time point t~ is a five-dimensional vector consisting of the


position (t~,k, tv,k) and orientation Ck of the model as well as the magnitudes vk and wk
of the translational and angular velocities, respectively:
9 = )T. (4)

By integrating the differential equations (1) we obtain the following discrete plant
model describing the state transition from time point tk to time point tk+l:

t,,k+l = t,,k + vkr. sin(r Ck Ck+l = Ck + WkT,


WkT
r162 vk+l = vk, (5)
~y,k+l ~ ~y,k -- VkT " wkr ' O)k+ 1 -~- 0,)k.

We introduce the usual dynamical systems notation (see, e.g., [Gelb 74]). The sym-
bols (~k, P~-) and (~+, P+) are used, respectively, for the estimated states and their
covariances before and after updating based on the measurements at time tk.
By denoting the transition function of (5) by f(.) and assuming white Gaussian
process noise wk ,,~ Af(0, Qk), the prediction equations read as follows

~k+l = f ( ~ + ) , P~-+I = F k P + F T + Qk, (6)

where Fk is the Jacobian ~ z at z = ~+.


The four dimensional parameter vectors {X}i=l..m from m matched line segments
in the image plane build a (4m)-dimensional measurement vector zk assumed to be
equal to the measurement function hk (zk) plus white Gaussian measurement noise vk
Af(0, Rk). The measurement noise covariance matrix Rk is block-diagonal. Its blocks are
4 4 covariance matrices as they are defined in equation 12 in [Deriche & Faugeras 90].
As already formulated in Section 3, the line segment parameters are functions of the
endpoints of a line segment. We will briefly explain how these endpoints are related to
the state (4). A point (x~, y~) in the image plane at time instant tk is the projection of
a point xw~.~ described in the world coordinate system (see Figure 4). The parameters
of this transformation have been obtained off-line based on the calibration procedure
of [Tsai 87], using dimensional data extracted from a construction map of the depicted
roads. In this way we constrain the motion problem even more because we do not only
know that the vehicle is moving on the road plane, but the normal of this plane is known
as well. The point xw~,k is obtained by the following rigid transformation from the model
coordinate system

Ismr cosr 0 , (7)


0 1

where (t~,k,tv,k , Ck) are the state parameters and zm,i are the known positions of the
vehicle vertices in the model coordinate system.
As already mentioned, we have included the projection of the shadow contour in the
measurements in order to obtain more predicted edges segments for matching and to avoid
false matches to data edge segments arising from shadows that lie in the neighborhood
of predicted model edges. The measurement function of projected shadow edge segments
differs from the measurement function of.the projections of model vertices in one step.
Instead of only one point in the world coordinate system, we get two. One point zs as
vertex of the shadow on the street and a second point zw = (x~, y~, z~) as vertex on the
object which is projected onto the shadow point zs. We assume a parallel projection in
443

-0--
// \

camera c.s.

-- -- --, . . . . . . . . . . . . . . . . . . . . -I

I 9 P

P
s
model c.s. ,"
P
IP I
p

i p / /
9 /
/ /

9
, Street p /
J_ . . . . . . . . . . . . . . . . . . . . . . . . . . -I

Fig. 4. Description of coordinate systems (c.s.)

shadow generation. Let the light source direction be (cos a sin j3, sin a sin 8, cos j3)T where
and /3 - - s e t interactively o f f - l i n e - are the azimuth and polar angle, respectively,
described in the world coordinate system. The following expression for the shadow point
in the xy-plane (the road plane) of the world coordinate system can be easily derived:

- zw coso~tanfl~
X~ = x~-
Yw z~ on~tan~) (s)

The point mw can then be expressed as a function of the state using (7). A problem arises
with endpoints of line segments in the image which are not projections of model vertices
but intersections of occluding line segments. Due to the small length of the possibly
occluded edges (for example, the side edges of the hood and of the trunk of the vehicle)
we cover this case by the already included uncertainty aJl of the endpoints in the edge
direction. A formal solution uses a closed form for the endpoint position in the image as a
function of the coordinates of the model vertices belonging to the occluded and occluding
edge segments. Such a closed form solution has not yet been implemented in our system.
The measurement function hk is nonlinear in the state mk. Therefore, we have tested
three possibilities for the updating step of our recursive estimation. In all three approaches
we assume that the state after the measurement zk is normally distributed around the
estimate ~+-1 with covariance P+-I which is only an approximation to the actual a
posteriori probability density function (PDF) after an update step based on a nonlinear
measurement. An additional approximation is the assumption that the PDF after the
nonlinear prediction step remains Gaussian. Thus we state the problem as the search for
the maximum of the following a posteriori PDF after measurement zk:

p ( z k l z k ) = ~ exp {' - ~ (zk - hk(zk)) w R~-1 (zk - hk(zk)) } 9

exp { - - l ( z k -- ~ - ) T P k - - I ( X k -- ~ - ) } , (9)
444

where c is a normalizing constant.


This is a MAP estimation and can be stated as the minimization of the objective
function
(zk - h k ( x k ) ) T R ; l ( z ~ - h k ( ~ k ) ) -t- (~k - ~;)TPk--l(~k -- ~ ; ) , min (10)
~k

resulting in the updated estimate ~+. In this context the well known Iterated Extended
Kaln~an Filter (IEKF) [Jazwinski 70; Bar-Shalom & Fortmann 88] is actually the Gauss-
Newton iterative method [Scales 85] applied to the above objective function whereas the
Extended Kalman Filter (EKF) is only one iteration step of this method. We have found
such a clarification [Jazwinski 70] of the meaning of EKF and IEKF to be important
towards understanding the performance of each method.
A third possibility we have considered is the Levenberg-Marquardt iterative minimiza-
tion method applied on (10) which we will call Modified IEKF. The Levenberg-Marquardt
strategy is a usual method for least squares minimization guaranteeing a steepest descent
direction far from minimum and a Gauss-Newton direction near the minimum, thus in-
creasing the convergence rate. If the initial values are in the close vicinity of the minimum,
then IEKF and Modified IEKF yield almost the same result.
Due to the mentioned approximations, all three methods are suboptimal and the com-
puted covariances are optimistic [Jazwinski 70]. This fact practically affects the matching
process by narrowing the search region and making the matcher believe that the current
estimate is much more reliable than it actually is. Practical compensation methods in-
clude an addition of artificial process noise or a multiplication with an amplification
matrix. We did not apply such methods in our experiments in order to avoid a severe
violation of the smoothness of the trajectories. We have just added process noise to the
velocity magnitude v and w (about 10% of the actual value) in order to compensate the
inadequacy of the motion model with respect to the real motion of a vehicle.
We have tested all three methods [Th6rhallson 91] and it turned out that the IEKF
and Modified IEKF are superior to the EKF regarding convergence as well as retainment
of a high number of matches. As [Maybank 90] suggested, these suboptimal filters are the
closer to the optimal filter in a Minimum Mean Square Error sense the nearer the initial
value lies to the optimal estimate. This criterion is actually satisfied by the initial posi-
tion and orientation values in our approach obtained by backprojecting image features
clustered into objects onto a plane parallel to the street. In addition to the starting values
for position and orientation, we computed initial values for the velocity magnitudes v and
w during a bootstrap process. During the first nboot (-----2, usually) time frames, position
and orientation are statically computed. Then initial values for the velocities are taken
from the discrete time derivatives of these positions and orientations.
Concluding the estimation section, we should mention that the above process requires
only a slight modification for the inclusion of the shape parameters of the model as un-
knowns in the state vector. Since shape parameters remain constant, the prediction step
is the same and the measurement function must be modified by substituting the model
points ~m,~ with the respective functions of the shape parameters instead of considering
them to have constant coordinates in the model coordinate system.

5 Experiments and Results


Parking Area
As a first experiment we used an image sequence of about 80 frames in which one car is
moving from the left to the right leaving a parking area (see the three upper images of
446

Figure 5). The image of the moving car covers about 60 x 100 pixels of a frame. In this
example it was not necessary, and due to the illumination conditions not even possible,
to use shadow edges in the matching process. The matched models for the three upper
frames are illustrated in the middle row of Figure 5, with more details given in the lower
three figures. In the lower figures we see the extracted straight lines, the backprojected
model segments (dashed lines) and the matched data segments, emphasized by thick
lines.

....... ,

I +.....
+..___
....................
Fig. 5. The first row shows the 4 th, 41 st and 798t frame of an image sequence. The three
images in the middle row give an enlarged section of the model matched to the car moving
in the image sequence. The lower three figures exhibit the correspondences (thick lines)
between image line segments and model segments (dashed lines) in the same enlarged
section as in the middle row.

The resultant object trajectories will be used as inputs to a process of associating


motion verbs to trajectory segments. Since such subsequent analysis steps are very sen-
sitive to noise we attempt to obtain smoother object trajectories. In order to obtain
such smooth motion we use a small process noise for the magnitude of the velocity v
and the angular velocity w. In this and the subsequent experiments, we therefore use a
446

process noise of av = 10-3 m and a~ = 10 -4 - ~ . Given this o-v and a~, the majority of
the translational and angular accelerations are assumed to be ~) < ( r . / r = .625 ~ and
< a ~ / r = 2.5.10 -a ~_~4,respectively, with r = tk+l - tk = 40ms.
The bootstrap phase is performed using the first two frames in order to obtain initial
estimates for the magnitudes of the velocities v and w. Since the initially detected moving
region does not always correctly span the image of the moving object, we used values equal
to approximately half of the average model length, i.e. cr,=o = trt~o = 3 m. An initial value
for the covariance in the orientation r is roughly estimated by considering the differences
in the orientation between the clustered displacement vectors, i.e ere0 = .35 rad.
The car has been tracked during the entire sequence of 80 frames with an average
number of about 16 line segment correspondences per frame. The computed trajectory
for this moving car is given in Figure 6.

8 I I I I I I I
14
translational velocity v [m/s] - -
13 position - - 6 -

12 y [m]
11
10 2
9 I I I I I
0 ! I
8 120 130 140 150 160 170 180 190 200
frame #
7
~ / 40 I I I I I I I

5 angular velocity w [Grad/s]


30

3 20

2
10
1

0 I I I I I I I
0
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 120 130 140 150 160 170 180 190 200
x [m] frame

Fig. 6. The estimated position as well as the translational and angular velocity of the
moving car of Figure 5.

Multilane Street Intersection


The next group of experiments involved an image subsequence of about 5.0 frames of a
much frequented multilane street intersection. In this sequence there are several moving
vehicles with different shapes and dimensions, all vehicles turning to the left (Figure 7).
The size of the images of the moving vehicles varies in this sequence from 30 x 60 to
20 x 40 pixels in a frame. Figure 2 shows some intermediate steps in extracting the line
447

'. . . . L_ .~ i":11~'~I I

Fig. 7. The first row shows the 3ru, 25th and 49th frame of an image sequence recorded
at a much frequented multilane street intersection. The middle row shows an enlarged
section of the model matched to the taxi (object #6) moving in the center of the frame.
The lower three figures exhibit the correspondences between image line segments and
model segments in the same enlarged section as in the middle row.

segments. We explicitly present this Figure in order to give an idea of the complexity of
the task to detect and track a moving vehicle spanning such a small area in the image. We
used the same values for the process noise and the initial covariances as in the previous
experiment. As in the previous example we used the first two frames for initial estimation
of v and w. In this experiment we used the shadow edges as additional line segments in
the matching process as described in Section 4.
Five of the vehicles appearing in the first frame have been tracked throughout the
entire sequence. The reason for the failure in tracking the other vehicles has been the
inability of the initialization step to provide the system with appropriate initial values.
To handle this inability an interpretation search tree is under investigation.
In the upper part of Figure 7 we see three frames out of this image sequence. In the
middle part of Figure 7, the matched model of a taxi is given as an enlarged section
448

~. ~.,-.~~/,~.
y .~.~ J

( ..../ /

Fig. 8. The first row shows the 3rd, 25 th and 49~h frame of an image sequence recorded
at a much frequented multilane street intersection. The middle row shows an enlarged
section of the model matched to the small car (object #5) moving left of the center of the
frame. The lower three figures exhibit the correspondences between image line segments
and model segments in the same enlarged section as in the middle row.

for the three upper images. In the lower three figures the correspondences of image line
segments and the model line segments are given. Figure 9 shows the resultant object
trajectory. Figure 8 shows another car of the same image sequence with the resultant
trajectory also displayed in Figure 9.

6 Related Works

In this section we discuss related investigations about tracking and recognizing object
models from image sequences. The reader is referred to the excellent book by [Grimson
90a] for a complete description of research on object recognition from a single image.
[Gennery 82] has proposed the first approach for tracking 3D-objects of known struc-
ture. A constant velocity six degrees of freedom (DOF) model is used for prediction and
449

6 I I I I I I I
translational velocity v [m/s]
5 position object # 5 - -
object # 6 I I ! I I I I I I
4 object # 5
10 object # 6
3
2 8
1 6
0
4
-i
-2 2
-3 0 I I I I ! i I i I

-4 5 10 15 20 25 30 35 40 45
frame #
-5
-6 angular velocity w [Grad/s]
-7
-8
-9
0
-10
-11 -90
-12 -180
-13
-270
-14 - y [m]

-15 i | I I I I I I I -360
-13 -12 - I i -i0 -9 -8 -7 -6 -5 5 10 15 20 25 30 35 40 45
x [m] frame #

Fig. 9. The estimated positions as well as the translational and angular velocities of the
moving cars in Figure 8 (object # 5) and Figure 7 (object # 6).

an update step similar to the Kalman filter - without addressing the nonlinearity - is
applied. Edge elements closest to the predicted model line segments are associated as
corresponding measurements.
[Thompson & Mundy 87] emphasize the object recognition aspect of tracking by ap-
plying a pose clustering technique. Candidate matches between image and model vertex
pairs define points in the space of all transformations. Dense clusters of such points
indicate a correct match. Object motion can be represented by a trajectory in the trans-
formation space. Temporal coherence then means that this trajectory should be smooth.
Predicted clusters from the last time instant establish hypotheses for the new time in-
stants which are verified as matches if they lie close to the newly obtained clusters. The
images we have been working on did not contain the necessary vertex pairs in order to
test this novel algorithm. Furthermore, we have not been able to show that the approach
of [Thompson & Mundy 87] is extensible to handling of parameterized objects.
450

[Verghese et al. 90] have implemented in real-time two approaches for tracking 3D-
known objects. Their first method is similar to the approach of [Thompson & Mundy 87]
(see the preceding discussion). Their second method is based on the optical flow of line
segments. Using line segment correspondences, of which initial (correct) correspondences
are provided interactively at the beginning, a prediction of the model is validated and
spurious matches are rejected.
[Lowe 90, 91] has built the system that has been the main inspiration for our match-
ing strategy. He does not enforce temporal coherence however, since he does not imply
a motion model. Pose updating is carried out by minimization of a sum of weighted
least squares including a priori constraints for stabilization. Line segments are used for
matching but distances of selected edge points from infinitely extending model lines are
used in the minimization. [Lowe 90] uses a probabilistic criterion to guide the search for
correct correspondences and a match iteration cycle similar to ours.
A gradient-ascent algorithm is used by [Worrall el al. 91] in order to estimate the
pose of a known object in a car sequence. Initial values for this iteration are provided
interactively at the beginning. Since no motion model is used the previous estimate is
used at every time instant to initialize the iteration. [Marslin et al. 91] have enhanced
the approach by incorporating a motion model of constant translational acceleration and
angular velocity. Their filter optimality, however, is affected by use of the speed estimates
as measurements instead of the image locations of features.
[Schick & Dickmanns 91] use a generic parameterized model for the object types. They
solve the more general problem of estimating both the motion and the shape parameters.
The motion model of a car moving on a clothoid trajectory is applied including trans-
lational as well as angular acceleration. The estimation machinery of the simple EKF is
used and, so far, the system is tested on synthetic line images only.
The following approaches do not consider the correspondence search problem but
concentrate only on the motion estimation. A constant velocity model with six DOF
is assumed by [Wu et al. 88] and [Harris ~ Stennet 90; Evans 90], whereas [Young &
Chellappa 90] use a precessional motion model.
A quite different paradigm is followed by [Murray et al. 89]. They first try to solve
the structure from motion problem from two monocular views. In order to accomplish
this, they establish temporal correspondence of image edge elements and use these cor-
respondences to solve for the infinitesimal motion between the two time instants and the
depths of the image points. Based on this reconstruction [Murray et al. 89] carry out a
3D-3D correspondence search. Their approach has been tested with camera motion in a
laboratory set-up.

7 Conclusion and future work

Our task has been to build a system that will be able to compute smooth trajectories of
vehicles in traffic scenes and will be extensible to incorporate a solution to the problem
of classifying the vehicles according to computed shape parameters. We have considered
the task to be difficult because of the complex illumination conditions and the cluttered
environment of real world traffic scenes and the small effective field of view that is
spanned by the projection of each vehicle given a stationary camera. In all experiments
mentioned in the cited approaches in the last section, the projected area of the objects
covers a quite high portion of the field of view. Furthermore, only one of them [Evans
~01 is tested under outdoor illumination conditions (landing of an aircraft).
451

In order to accomplish the above mentioned tasks we have applied the following
constraints. We restricted the degrees of freedom of the transformation between model
and camera from six to three by assuming that a vehicle is moving on a plane known a
priori by calibration. We considered only a simple time coherent motion model because
of the high sampling rate (25 frames pro second) and the knowledge that vehicles do not
maneuver abruptly.
The second critical point we have been concerned about is the establishment of good
initial matches and pose estimates. Most tracking approaches do not emphasize the sever-
ity of this problem of establishing a number of correct correspondences in the starting
phase and feeding the recursive estimator with quite reasonable initial values. Again we
have used the a priori knowledge of the street plane position and the results of clustering
picture domain descriptors into object hypotheses of a previous step. Thus we have been
able to start the tracking process with a simple matching scheme and feed the recursive
estimator with values of low error covariance.
The third essential point we have addressed is the additional consideration of shadows.
Data line segments arising from shadows are not treated any more as disturbing data like
markings on the road, but they contribute to the stabilization of'the matching process.
Our work will be continued by the following steps. First, the matching process should
be enhanced by introducing a search tree. In spite of the good initial pose estimates,
we are still confronted occasionally with totally false matching combinations due to the
highly ambiguous structure of our current vehicle model. Second, the generic vehicle
model enables a simple adaptation to the image data by varying the shape parameters.
These shape parameters should be added as unknowns and estimated along time.

Acknowledgements
The financial support of the first author by the Deutsche Forschungsgemeinschaft (DFG,
German Research Foundation) and of the second as well as the third author by the
Deutscher Akademischer Austauschdienst (DAAD, German Academic Exchange Service)
are gratefully acknowledged.

References
[Bar-Shalom & Fortmann 88] Y. Bax-Shalom, T.E. Fortmann, Tracking and Data Association,
Academic Press, New York, NY, 1988.
[Deriche & Faugeras 90] R. Deriche, O. Faugeras, Tracking line segments, Image and Vision
Computing 8 (1990) 261-270.
[Evans 90] R. Evans, Kalman Filtering of pose estimates in applications of the RAPID video
rate tracker, in Proc. British Machine Vision Conference, Oxford, UK, Sept. 24-27, 1990,
pp. 79-84.
[Gelb 74] A. Gelb (ed.), Applied Optimal Estimation, The MIT Press, Cambridge, MA and
London, UK, 1974.
[Gennery 82] D.B. Gennery, Tracking known three-dimensional objects, in Proc. Conf. Ameri-
can Association of Artificial Intelligence, Pittsburgh, PA, Aug. 18-20, 1982, pp. 13-17.
[Grimson 90a] W.E.L. Grimson, Object recognition by computer: The role of geometric con-
straints, The MIT Press, Cambridge, MA, 1990.
[Grimson 90b] W. E. L. Grimson, The combinatorics of object recognition in cluttered environ-
ments using constrained search, Artificial Intelligence 44 (1990) 121-165.
[Harris & Stennet 90] C. Harris, C. Stennet, RAPID - A video rate object tracker, in Proc.
British Machine Vision Conference, Oxford, UK, Sept. 24-27, 1990, pp. 73-77.
452

[Jazwinski 70] A.H. Jazwinski, Stochastic Processes and Filtering Theory, Academic Press, New
York, NY and London, UK, 1970.
[KoUer et al. 91] D. Koller, N. Heinze, H.-H. Nagel, Algorithmic Characterization of Vehicle
Trajectories from Image Sequences by Motion Verbs, in 1EEE Conf. Computer Vision and
Pattern Recognition, Lahalna, Maui, HawMi, June 3-6, 1991, pp. 90-95.
[Korn 88] A. F. Korn, Towards a Symbolic Representation of Intensity Changes in Images, IEEE
Transactions on Pattern Analysis and Machine Intelligence P A M I - 1 0 (1988) 610-625.
[Lowe 85] D. G. Lowe, Perceptual Organization and Visual Recognition, Kluwer Academic Pub-
fishers, Boston MA, 1985.
[Lowe 87] D. G. Lowe, Three-Dimensional Object Recognition from Single Two-Dimensional
Images, Artificial Intelligence 31 (1987) 355-395.
[Lowe 90] D. G. Lowe, Integrated Treatment of Matching and Measurement Errors for Robust
Model-Based Motion Tracking, in Proe. Int. Conf. on Computer Vision, Osaka, Japan,
Dec. 4-7, 1990, pp. 436-440.
[Lowe 91] D.G. Lowe, Fitting parameterized three-dimensional models to images, IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 13 (1991) 441--450.
[Marslin et al. 91] R.F. Marslin, G.D. Sullivan, K.D. Baker, Kalman Filters in Constrained
Model-Based Tracking, in Proc. British Machine Vision Conference, Glasgow, UK, Sept.
24-26, 1991, pp. 371-374.
[Maybank 90] S. Maybank, Filter based estimates of depth, in Proc. British Machine Vision
Conference, Oxford, UK, Sept. 24-27, 1990, pp. 349-354.
[Murray et al. 89] D.W. Murray, D.A. Castelow, B.F. Buxton, From image sequences to rec-
ognized moving polyhedral objects, International Journal of Computer Vision 3 (1989)
181-208.
[Scales 85] L. E. Scales, Introduction to Non-Linear Optimization, Macmillan, London, UK,
1985.
[Schick & Dickmanns 91] J. Schick, E. D. Dickmanns, Simultaneous estimation of 3D shape
and motion of objects by computer vision, in Proc. IEEE Workshop on Visual Motion,
Princeton, N J, Oct. 7-9, 1991, pp. 256-261.
[Thompson & Mundy 87] D.W. Thompson, J.L. Mundy, Model-based motion analysis - motion
from motion, in The Fourth International Symposium on Robotics Research, R. Bolles and
B. Roth (ed.), M I T Press, Cambridge, MA, 1987, pp. 299-309.
[Thdrhallson 91] T. Thdrhallson, Untersuchung zur dynamischen Modellanpassung in monoku-
laren Bildfolgen, Diplomarbeit, Fakults ffir Elektrotechnik der Universits Karlsruhe (TH),
durchgeffihrt am Institut ffir Algorithmen und Kognitive Systeme, Fakults ffir Informatik
der Universits Karlsruhe (TH), Karlsruhe, August 1991.
[Tsal 87] R. Tsai, A versatile camera calibration technique for high accuracy 3D machine vision
metrology using off-the-shelf TV cameras and lenses, IEEE Trans. Robotics and Automation
3 (1987) 323-344.
[Verghese et al. 90] G. Verghese, K.L. Gale, C.R. Dyer, Real-time, parallel motion tracking of
three dimensional objects from spatiotemporal images, ill V. Kumar, P.S. Gopalakrishnan,
L.N. Kanal (ed.), Parallel Algorithms for Machine Intelligence and Vision, Springer-Verlag,
Berlin, Heidelberg, New York, 1990, pp. 340-359.
[Worrall et al. 91] A.D. Worrall, R.F. Marslin, G.D. Sullivan, K.D. Baker, Model-Based Track-
ing, in Proc. British Machine Vision Conference, Glasgow, UK, Sept. 24-26, 1991, pp. 310-
318.
[Wu et al. 88] J.J. Wu, R.E. Rink, T.M. Caelli, V.G. Gourishankar, Recovery of the 3-D location
and motion of a rigid object through camera image (an Extended Kalman Filter approach),
International Journal of Computer Vision 3 (1988) 373-394.
[Young & Chellappa 90] G. Young, R. Chellappa, 3-D Motion estimation using a sequence of
noisy stereo images: models, estimation and uniqueness results, IEEE Transactions on
Pattern Analysis and Machine Intelligence P A M I - 1 2 (1990) 735-759.

This article was processed using the LATEXmacro package with ECCV92 style
Tracking Moving Contours Using
Energy-Minimizing Elastic Contour Models
Naonori UEDA x and Kenji M A S E 2

t NTT Communication Science Laboratories, Hikaridai,


Seika-cho, Soraku-gun, Kyoto 619-02, Japan
2 NTT Human Interface Laboratories, Kanagawa, 238-03, Japan

A b s t r a c t . This paper proposes a method for tracking an arbitrary object


contour in a sequence of images. In the contour tracking, energy-minimizing
elastic contour models are utilized, which is newly presented in this paper.
The proposed method makes it possible to establish object tracking even
when complex texture and occluding edges exist in or near the target ob-
ject. We also newly present an algorithm which efficiently solves energy
minimization problems within dynamic programming framework. The algo-
rithm enables us to obtain optimal solution even when the variables to be
optimized are not ordered.

1 Introduction

Detecting and tracking moving objects is one of the most fundamental and important
problems in motion analysis. When the actual shapes of moving objects are important,
higher level features like object contours, instead of points, should be used for the track-
ing. Furthermore, since these higher level features make it possible to reduce ambiguity
in feature correspondences, the correspondence problem is simplified.
However, in general, the higher the level of the features, the more difficult the ex-
traction of the features becomes. This results in a tradeoff, which is essentially insolvable
as long as a two-stage processing is employed. Therefore, in order to establish high level
tracking, object models which embody a priori knowledge about the object shapes are
utilized[I][2].
On the other hand, Kass et ai.[3] have recently proposed active contour models(Snakes)
for the contour extraction. Once the snake is interactively initialized on an object contour
in the first frame, it will automatically track the contour from frame to frame. T h a t is,
contour tracking by snakes can be achieved. It is a very elegant and attractive approach
because it makes it possible to simultaneously solve both the extraction and tracking
problems. That is, the above tradeoff is completely eliminated.
However, this approach is restricted to the case that the movement and deformation
of an object are very small between frames. As also pointed out in Ref.[2], this is mainly
due to the excessive flexibility of the spline composing the snake model.
In this paper, we propose a robust contour tracking method which can solve the
above problem while preserving the advantages of snakes. In the proposed method, since
the contour model itself is defined by elastics with moderate "stiffness" which does not
permit local major deformations, the influence of texture and occluding edges in or near
the target contour is minimal. Hence, the proposed method becomes more robust than
the original snake models in that it is applicable to more general tracking problems.
In this paper, we also present a new algorithm for solving energy minimization prob-
lems using dynamic programming technique. Amini et al.[4] have already proposed a
Lecture Notes in Computer Science, Vol. 588
G. Sandini (Ed.)
Computer Vision - ECCV '92
9 Springer-Verlag Berlin Heidelberg 1992
454

dynamic programming(DP) algorithm which is superior to variational approach with re-


gard to optimality and numerical stability. In order to use DP, however, the original
decision process should be Markovian. From this point of view, with Amini's formula-
tion, optimality of the solution is ensured only in the case of open contours. That is,
for closed contours, reformulation is necessary. In this paper, we clarify the problem of
Amini's formulation, and furthermore, within the same DP framework, we present a new
formulation which guarantees global optimality even for closed contours.

2 Formulating the contour tracking problem

2.1 Elastic c o n t o u r m o d e l s

A model contour is defined as a polygon with n discrete vertices. That is, the polygonally
approximated contour model is represented by an ordered list of its vertices: C = {vi =
(zi, Yi)}, 1 < i < n. A contour model is constrained by two kinds of "springs" so that
it has a moderate "stiffness" which preserves the form of the tracked object contour in
the previous frame as much as possible. That is, each side of the polygon is composed of
a spring with a restoring force proportional to its expansion and contraction, while the
adjacent sides are constrained by another spring with a restoring force proportional to
the change of the interior angle. Assume that these springs are original length when the
contour model is at the initial contour position {vi0 }i=1
,, in the current frame. Therefore,
at that time, for the springs no force is at work. Clearly, the initial position in the current
frame corresponds to the tracking result in the previous frame.

2.2 E n e r g y m i n i m i z a t i o n f r a m e w o r k
Let {vi0}i=1
n denote a tracked contour in the preceding frame. Then our goal is to move
and deform the contour model from {vi0}i=1
n to the best position {v i9 }i=a
n in the current
frame such that the following total energy functional is minimized:
gt

i----1

Here, Eelastie is elastic energy functional derived from the deformation of the contour
model and can be defined as:
1 0
Ee,a:tir = ~ (pl(IVi+l - vii - Ivi+l - v~ 2

-Fp2(ang(vl, v,+,, v,+2) - ang(v ~ vO+I,vO_I_2))2) (2)


where ang(vi,vi+l,Vi+2) means the angle made by sides Vi+lVi and vi+zvi+2. (see
Fig.lb.) Pl and P2 are non-negative constants. In Eq.(2), the first energy term corre-
sponds to the deformation due to the expansion and contraction of each side of the
polygonally approximated contour model, while the second energy term corresponds to
the deformation due to the change of interior angle between the two adjacent sides.
Eli~td is the potential energy functional which gives rise to edge potential field forces
newly defined in this paper. The potential field is derived from the edges in the current
frame including target contour. The potential field used here, since it is obtained with
distance transformation[5], unlike that used in the original snakes, smoothly extends over
455

a long distance. Therefore, it can influence the contour model even ff the contour model
is remote from the target contour.
Assuming that z(vi) denotes the height or potential value at vi on the potential field,
then the potential energy, Elietd, can easily be defined by the classical gravitational
potential energy equation. That is,

EIield(Vi) = m~z(vi) = p3z(vi). (3)


where m is the constant mass of the vi, ~ is the magnitude of the gravitational acceler-
ation. Pa is a negative constant.
It can be intuitively interpreted that Eq.(1) becomes minimum when the contour
model is localized on the contour whose shape most nearly resemples the contour tracked
in the previous frame. Accordingly, even if the contour model is not remote from the
target contour, the model can move to the target contour while preserving its shape as
much as possible. As a result, a tracking desired contour can be achieved.

3 Optimization algorithm
From Eqs.(2) and (3), the total energy functional shown in Eq.(1) can be formally brought
to the general form:
n

Eto,at(Vl,V2,... ,vn) = E { f i ( v i ) + gi(vi,Vi+l) + hi(vi,Vi+l,Vi+2)}, (4)


i=1
Note that the general form of Eq.(4) is the same as that of snakes.
The minimization of Eq.(4), like snakes, returns us to the problem of finding the
optimum values {v*}in=l which give the local minimum, starting from the initial values
{vi}i~=t. One way to find the minimum is by employing exhaustive enumeration. However,
with this approach, combinatorial explosion is inevitable. Therefore, we must devise a
more efficient algorithm.
Recently, Amini et al.[4] proposed a dynamic programming approach to energy min-
imization of the snakes. In the dynamic programming approach, the minimization of
Eq.(4) is viewed as a discrete multistage decision process, with vi corresponding to the
state variable in the i-th decision stage. However, this DP formulation is for open con-
tours which preserve the ordering of the variables {vi}i~l. In other words, vl and vn
are not connected and constrained. Consequently, reformulation of DP equation for the
closed contours is necessary.
Let V be a set of v l , v ~ , . . . , v , . Being focused on vl in Eq.(4), vl is included by
I, (v,), h . _ , ( . . _ , , . . , v,), g. and Thus,
for convenience, we here use S for the sum of these functions. T h a t is,

S : / 1 (Vl)-[-gl (v 1, v2)-[-hi (Vl, v2, ~)3)+hn-,(vn-1, v , , t'l)-{-g, (vn, Vl)+hn(vn, Vl, v2).
(5)
Then, the minimization of E, otal can be written as:

rain Etota, = min min Etota,


v v-{v~} vt
= mi~ - s) + (6)
v-{vt}~
Hence, the first step of the optimization procedure is to perform the minimization with
respect to vl in Eq.(6). Clearly, from Eq.(5), one can see that the minimization is a
456

function of v2, v3, vn-1, and vn. Therefore, this minimization is made and stored for all
possible assignments of v~, v3, vn-1, and vn. Formally, the minimization can be written
as-
r v3, = (7)

Note that in the minimization in Eq.(7), exhaustive enumeration is employed.


Then, the problem remaining after the minimization with respect to vl,

minEtot,,, -- v_{v,}{(Etot,,-S)+r
min (8)

is of the same form as the original problem, and the function r va, vn-1, vn) can be
regarded as a component of the new objective function.
Applying the same minimization procedure for the rest of the variables, v2, v 3 , . . , in
this order, we can derive the following DP equations. That is, for 2 < i < n - 4,

~)i(Vi+ l , Vi+ 2, Vn- l , Vn ) --~ ~ y { r l (Vi, el+l, Vn--l,Vn)

"]-fi(vi) "[- gi(vi, VI+I) "{- hi(vi, Vi+l, VI+2)}, (9)


where, for i = n - 3,n - 2, n - 1, the corresponding DP equations can be obtained
respectively.
The time complexity of the proposed DP algorithm then becomes O(nm 5) because, in
Eq.(9), the optimum decision is done over m 4 combinations. However, since, in general,
each optimum decision stage in DP can be independently achieved, computation time
can be drastically reduced with parallel processing.

4 Experiments

The proposed contour tracking method has been tested experimentally on several syn-
thetic and real scenes. Figure 1 compares the snake model(Fig.la) with our model(Fig.lb)
when occluding edges exist. The scene in Fig.1 is an actual indoor scene and corresponds
to one frame from a sequence of a moving bookend on a turntable over a static grid.
Since the snake model is influenced by occluding edges, the model was not able to track
the target contour. On the other hand, the proposed model successfully tracked it with-
out being influenced by the occluding edges. We also obtained successful results for the
trackings of moving car, deforming ball, and so on.
In this approach, since the contour model itself moves toward the target contour,
point correspondences are established between frames. That is, correspondence based
optical flows are also obtained. Therefore, feature point trajectories over several frames
can easily be obtained by the proposed method.

5 Conclusions

We have presented here an energy-minimizing elastic contour model as a new approach


to moving contour tracking in a sequence of images. Compared to the original snake
model, the proposed method is more robust and general because it is applicable even
when movements and deformation of the object between frames are large and there exist
occluding edges. Moreover, we have newly devised an optimization algorithm with a
dynamic programming framework, which is efficient and mathematically complete.
457

I=0 I = 4 I = 12(Result)
(a)Tracking by the snake model

I=0 I = 2 I = 7(Result)
(b)Tracking by the proposed model

F i g . 1 . Comparison of the results of tracking contour with occluding edges. I denotes the
number of iterations.

References
1. Dreschler and Nagel H. H.: "Volumetric model and 3D trajectory of a moving car derived
from monocular tv-frame sequence of a street scene", in Proc. I J C A I 8 1 , 1981.
2. YuiUeA. L., Cohen D. S. and Hallinan P. W.: "Feature extraction from faces using deformable
templates", in Proc., C V P R 8 9 , pp. 104-109, 1989.
3. Kass A., Witkin A. and Terzopoulos D.: "Snakes: Active contour modes", Int. J. Comput.
Vision, 1, 3, pp. 321-331, 1988.
4. Amini A. A., Weymouth T. E. and JaJn R. C.: "Using dynamic programming for solving
variational problems in vision", IEEE Trans. Pattern Anal. Machine lnteii., P A M I - 1 2 , 9,
pp. 855-867, 1990.
5. Rosenfeld A. and Pfaltz J. L.:"Distance functions on digital pictures", Pattern Recognition,
1, pp.33-61, 1968.
This article was processed using the I~TEX macro package with ECCV92 style
T r a c k i n g P o i n t s on D e f o r m a b l e O b j e c t s U s i n g
Curvature Information*

Isaac COHEN, Nicholas A YACHE, Patrick SULGER


INRIA, Rocquencourt
B.P. 105, 78153 Le Chesnay CEDEX, France.
Email isaac@bora.inria.fr, naQhora.inria.fr.

Abstract
The objective of this paper is to present a significant improvement to the approach of
Duncan et al. [1, 8] to analyze the deformations of curves in sequences of 2D images.
This approach is based on the paradigm that high curvature points usually possess an
anatomical meaning, and are therefore good landmarks to guide the matching process,
especially in the absence of a reliable physical or deformable geometric model of the
observed structures.
As Duncan's team, we therefore propose a method based on the minimization of an
energy which tends to preserve the matching of high curvature points, while ensuring a
smooth field of displacement vectors everywhere.
The innovation of our work stems from the explicit description of the mapping between
the curves to be matched, which ensures that the resulting displacement vectors actually
map points belonging to the two curves, which was n o t the case in Duncan's approach.
We have actually implemented the method in 2-D and we present the results of the
tracking of a heart structure in a sequence of ultrasound images.

1 Introduction
Non-rigid motion of deformable shapes is becoming an increasingly important topic in
computer vision, especially for medical image analysis. Within this topic, we concentrate
on the problem of tracking deformable objects through a time sequence of images.
The objective of our work is to improve the approach of Duncan et al. [1, 8] to
analyze the deformations of curves in sequences of 2D images. This approach is based
on the paradigm that high curvature points usually possess an anatomical meaning, and
are therefore good landmarks to guide the matching process. This is the case for instance
when deforming patients skulls (see for instance [7, 9]), or when matching patient faces
taken at different ages, when matching multipatients faces, or when analyzing images of
a beating heart. In these cases, many lines of extremal curvatures (or ridges) are stable
features which can be reliably tracked between the images (on a face they will correspond
to the nose, chin and eyebrows ridges for instance, on a skull to the orbital, sphenoid,
falx, and temporal ridges, on a heart ventricle to the papillary muscle etc... ).
As Duncan's team, we therefore propose a method based on the minimization of an
energy which tends to preserve the matching of high curvature points, while ensuring a
smooth field of displacement vectors everywhere.
The innovation of our work stems from the explicit description of the mapping between
the curves to be matched, which ensures that the resulting displacement vectors actually
* This work was partially supported by Digital Equipment Corporation.
459

map points belonging to the two curves, which was n o t the case in Duncan's approach.
Moreover, the energy minimization is obtained through the mathematical framework of
Finite Element analysis, which provides a rigorous and efficient numerical solution. This
formulation can be easily generalized in 3-D to analyze the deformations of surfaces.
Our approach is particularly attractive in the absence of a reliable physical or de-
formable geometric model of the observed structures, which is often the case when
studying medical images. When such a model is available, other approaches would in-
volve a parametrization of the observed shapes [14], a modal analysis of the displacement
field [12], or a parametrization of a subset of deformations [3, 15]. In fact we believe that
our approach can always be used when some sparse geometric features provide reliable
landmarks, either as a preprocessing to provide an initial solution to the other approaches,
or as a post-processing to provide a final smoothing which preserves the matching of re-
liable landmarks.

2 Modelling the Problem

Let Cp and CQ be two boundaries of the image sequence, the contour CQ is obtained
by a non rigid (or elastic) transformation of the contour Cp. The curves Cp and CQ are
parameterized by P(s) and Q(s') respectively.
The problem is to determine for each point P on Up a corresponding point Q on
CQ. For doing this, we must define a similarity measure which will compare locally the
neighborhoods of P and Q.
As explained in the introduction, we assume that points of high curvature correspond
to stable salient regions, and are therefore good landmarks to guide the matching of the
curves. Moreover, we can assume as a first order approximation, that the curvature itself
remains invariant in these regions. Therefore, we can introduce an energy measure in
these regions of the form:

Ec~,~ = ~1 f~ s (KQ(s') - Kp(s))2ds (1)

where Kp and K q denote the curvatures and s, s t parameterise the curves Cp and CQ
respectively. In fact, as shown by [8, 13], this is proportional to the energy of deformation
of an isotropic elastic planar curve.
We also wish the displacement field to vary smoothly around the curve, in particular
to insure a correspondence for points lying between two salient regions. Consequently we
consider the following functional (similar to the one used by Hildreth to smooth a vector
flow field along a contour [11]) :
E = Ec,,,,~ + R E,,g~,a, (2)

where E,.,g,a.,.= /cp 10(o(s')Zs P(s)) 2ds

measures the variation of the displacement vector P Q along the curve Cp, and the ]].1]
denotes the norm associated to the euclidean scalar product (., .) in the space lR 2.
The regularization parameter R(s) depends on the shape of the curve Cp. Typically,
R is inversely proportional to the curvature at P, to give a larger weight to Ecur,e in
salient regions and conversely to Erea.za~ to points inbetween. This is done continuously
without annihiling totally the weight of any of these two energies (see [4]) .
460

3 Mathematical Formulation of the Problem

Given two curves Cp and C O parameterized by s G [0, 1] and s' E [0, c~] (where a is the
length of the curve CO) , we have to determine a function f : [0, 1] ~ [0, o~]; s ---* s p
satisfying f(0) = 0 and f(1) = a (3)
and f -- ArgMin(E(f)) (4)
where
E(.f) = /cp (Ko(.f(s)) _ Kp(s))2 ds W R /cp O(Q(.f(s)) - 2ds (5)

The condition (3) means that the displacement vector is known for one point of the curve.
In the model above defined we assumed that:
- the boundaries have already been extracted,
- the curvatures K are known on the pair of contours (see [9]).
These necessary data are obtained by preprocessing the image sequence (see for more
details [4]).
The characterization of a function f satisfying f = ArgMin(E(f)) and the condi-
tion (3) is performed by a variational method. This method characterizes a local mini-
mum f of the functional E(f) as the solution of the Euler-Lagrange equation V E ( f ) : 0,
leading to the solution of the partial differential equation:
/ " IIQ'(.t')H2 + K~ (Np, Q'(.f)) + -~ [Kj, - KQ(.f)] K~Q(f) = 0
+ Boundary conditions (i.e. condition 3). (6)
where Q is a parametrization of the curve CQ, Qr(/) the tangent vector of CO, K~O the
derivative of the curvature of the curve C 0 and Np is the normal vector to the curve
Cp.
The term fcp (Ko (f(s)) - Kp (s)) 2 ds measures the difference between the curvature
of the two curves. This induces a non convexity of the functional E. Consequently, solving
the partial differential equation (6) will give us a local minimum of E. To overcome this
problem we will assume that we have an initial estimation f0 which is a good approxi-
mation of the real solution (the definition of the initial estimation fo will be explained
later). This initial estimation defines a starting point for the search of a local minimum of
the functional E. To take into account this initial estimation we consider the associated
evolution equation:
{ Of-~s + f " ( s ) N Q ' ( f ( s ) ) l l 2 + g p ( s ) ( g p ( s ) , Q'(f(s)))+ 1 [gp(s)-gQ(f(s)) ] gtQ(f(s))=O
f(0, s) = fo(s) Initial estimation.
(7)
Equation (7) can also be seen as a gradient descent algorithm toward a minimum of
the energy E, it is solved by a finite element method and leads to the solution of a sparse
linear system (see for more details [4]).

3.1 D e t e r m i n i n g t h e I n i t i a l E s t i m a t i o n fo
The definition of the initial estimation f0 has an effect upon the convergence of the algo-
rithm. Consequently a good estimation of the solution f will lead to a fast convergence.
The definition of f0 is based on the work of Duncan e~ al [8]. The method is as follows:
Let s~ E [0, 1], i -- 1 . . . n be a subdivision of the interval [0, 1]. For every point
Pi = (X(si),Y(si)) of the curve Cp we search for a point Qi = (X(s~),Y(s~)) on the
curve CO, and the function f0 is then defined by f0(si) -- s~.
461

For doing so we have to define a pair of points P0, Q0 which correspond to each other.
But, first of all, let us describe the search method. In the following, we identify a point
and its arc length (i.e. the point s~ denotes the point P~ of the curve Cp such that
P(s~) = P~, where P is the parametrisation of the curve Up).
For each point s~ of Cp we associate a set of candidates S~ on the curve CQ. The
set Si defines the search area. This set is defined by the point s t which is the nearest
distance point to s~ belonging to the curve CQ, along with (N, . . . . ~ - 1)/2 points of the
curve CQ on each side of s t (where N, . . . . n is a given integer defining the length of the
search area).
Among these candidates, we choose the point which minimizes the deformation en-
ergy (1).
In some situations this method fails, and the obtained estimation /0 is meaningless,
leading to a bad solution. Figure 1 shows an example where the method described in [8]
fails. This is due to the bad computation of the search area S~.

Fig. 1. This example shows the problem that can occur in the computation of the initial estimate
based only on the search in a given area. The initial estimation of the displacement field and
the obtained solution.

To compute more accurately this set, we have added a criterion based on the arc
length. Consequently, the set defining the search area Si is defined by the point s t which
is the nearest distance point to si belonging to the curve CQ such that s~ ~ si/cl, along
with (N, earc~ - 1)/2 points of the curve CQ on each side of s~. Figure 2 illustrates the
use of this new definition of the set Si for the same curves given in Fig. 1. This example
shows the ability to handle more general situations with this new definition of the search
area ~i.

Fig. 2. In the same case of the previous example, the computation of the initial estimate based
on the local search and the curvilinear abscissa, gives a good estimation fo, which leads to an
accurate computation of the displacement function.
462

As noted above, the search area S~ can be defined only if we have already chosen a
point P0 and its corresponding point Q0. The most salient features in a temporal sequence
undergo small deformations at each time step, thus a good method for choosing the point
P0 is to take the most distinctive point so that the search for the corresponding point
becomes a trivial task. Consequently the point P0 is chosen among the points of Cp with
maximal curvature. In many cases this method provides a unique point Po. Once we have
chosen the point P0, the point Q0 is found by the local search described above.

4 Experimental Results
The method was tested on a set of synthetic and real image sequences. The results are
given by specifying at each discretization point P~ i -- 1 . . . N of the curve Cp the
displacement vector u i -- P i Q i " At each point P~ the arrow represents the displacement
vector u i.
The first experiments were made on synthetic data. In Fig. 3, the curve CQ (a square)
is obtained by a similarity transformation (translation, rotation and scaling) of the curve
Cp (a rectangle). The obtained displacement field and a plot of the function f are
given. We can note that the algorithm computes accurately the displacements of the

9 f(u)

Fig. 3. The rectangle (in grey) is deformed by a similarity (translation, rotation and scaling)
to obtain the black square. In this figure we represent the initial estimation of the displacement
vector of the curves, the obtained displacement field and the plotting of the solution f.

four corners. This result was expected since the curves Cp and CQ have salient features
which help the algorithm to compute accurately the displacement vector u i. Figure 4
give an example of the tracking of each point on an ellipse deformed by a similarity.
In this case, the points of high curvature are matched together although the curvature
varies smoothly.
As described in section (3.1) the computation of the initial estimation is crucial. In
the following experimentation we have tried to define the maximal error that can be done
on the estimation of f0 without disturbing the final result. In Fig. 5 we have added a
gaussian noise (~ = 0.05) to a solution f obtained by solving (7). This noisy function
was taken as an initial estimation for Eq. 7. After a few iterations the solution f is
recovered (5).
It appears that if If - f01 _~ 4h (where h is the space discretization step), starting with
f0 the iterative scheme ? will converge toward the solution f. The inequality If - fol ~_
4h means that for each point P on the curve Cp the corresponding point Q can be
determined with an error of 4 points over the grid of the curve Q.
463

rcl~

Fig. 4. Another synthetic example, in this case the curvature along the curves Cp and CQ varies
smoothly. This often produces as a consequence in the computation of the initial estimation f0
that several points of the curve Cp (in grey) match the same point of CQ (in black). We remark
that, for the optimal solution obtained by the algorithm, each point of the black curve matches
a single point of the grey curve, and that, maximum curvature points are matched together.

EI~)
r

Fig. 5. In this example we have corrupted an obtained solution with a gaussian noise (~, = 0.05)
and considered this corrupted solution as an initial estimate fo. The initial displacement field,
the initial estimate fo and the obtained solution are shown in this figure.

T h e tracking of the moving boundaries of the valve of the left ventricle on an ul-
t r a s o u n d image helps to diagnose some heart diseases. The segmentation of the moving
boundaries over the whole sequence was done by the snake model [6, 2, 10]. In Fig 6 a
global tracking of a part of the image sequence is showed 2. This set of curves are pro-
cessed (as described in [4]) to obtain the curvatures and the n o r m a l vector of the curves.
The Fig. 7 shows a t e m p o r a l tracking of some points of the valve in this image sequence.
The results are presented by pairs of successive contours. One can visualize t h a t the
results meet perfectly the objectives of preserving the matching of high curvature points
while insuring a s m o o t h displacement field.

2 Courtesy of I. Herlin [10]


464

Fig. 6. Temporal tracking of the mitral valve, obtained by the snake model [10], for images 1
to 6.

5 3-D Generalization

In this section we give a 3-D generalization of the algorithm described in the previous
sections. In 3-D imaging we must track points on located surfaces, since the objects
boundaries are surfaces (as in [1]). In [16] the authors have shown on a set of experimen-
tal data, that the extrema of the larger principal curvature often correspond to significant
intrinsic (i.e. invariant by the group of rigid transformation) features which might char-
acterize the surface structure, even in the presence of small anatomic deformations.
Let Sp and SQ be two surfaces parameterized by P(s, r) and Q(s', r'), and let ~p
denote the larger value of the principal curvature of the surface Sp at point P.
Thus the matching of two surfaces, leads to the following problem:
find a function
f : II:L2 ~IPO; (s,r)~(s',r')
which minimizes the functional:

E(f) = / _ (e;Q(f(s,r)) - top(s, r))2dsdr


P

+R,f s O(Q(f(s,r))-P(s,r)) zd s d r + i L f s 8(Q(f(s,r))-P(s,r)) 2dsdr


p Os p Or

where H'II denotes the euclidean norm in IRa. Its resolution by a finite element method
can be done as in [5], and the results should be compared to those obtained by [1]. This
generalization has not been implemented yet.
465

Fig. 7. Applying the point - tracking algorithm to the successive pairs of contours of Fig. 6
(from left to right and top to bottom).

6 Conclusion

We presented a significant improvement to Duncan's team approach to track the motion


of deformable 2D shapes, based on the tracking of high curvature points whUe preserving
the smoothness of the displacement field. This approach is an alternative to the other
approaches of the literature, when no physical or geometric model is available, and can
also be used as a complementary approach otherwise.
The results on a real sequence of time varying anatomical structure of the beating
heart perfectly met the defined objectives [2]. Future work will include the experimenta-
tion of the 3-D generalization.
466

References

1. A. Amini, R. Owen, L. Staib, P. Anandan, and J. Duncan. non-rigid motion models for
tracking the left ventr~cular wall. Lecture notes in computer science: Information processing
in medical images. 1991. Springer-Verlag.
2. Nicholas Ayache, Isaac Cohen, and Isabelle Herlln. Medical image tracking. In Active
Vision, Andrew Blake and Alan Yuille, chapter 20. MIT Press, 1992. In press.
3. Fred L. Bookstein. Principal warps: Thin-plate spllnes and the decomposition of deforma-
tions. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMLll(6):567-
585, June 1989.
4. Isaac Cohen, Nicholas Ayache, and Patrick Sulger. Tracking points on deformable objects
using curvature information. Technical Report 1595, INRIA, March 1992.
5. Isaac Cohen, Laurent D. Cohen, and Nicholas Ayache. Using deformable surfaces to seg-
ment 3-D images and infer differential structures. Computer Vision, Graphics, and Image
Processing: Image Understanding, 1992. In press.
6. Laurent D. Cohen and Isaac Cohen. A finite element method applied to new active contour
models and 3-D reconstruction from cross sections. In Proc. Third International Conference
on Computer Vision, pages 587-591. IEEE Computer Society Conference, December 1990.
Osaka, Japan.
7. Court B. Cutting. Applications of computer graphics to the evaluation and treatment of
major craniofacial malformation. In Jayaram K.Udupa and Gabor T. Herman, editors, 8-D
Imaging in Medicine. CRC Press, 1989.
8. J.S. Duncan, R.L. Owen, L.H. Staib, and P. Anandan. Measurement of non-rigid motion
using contour shape descriptors. In Proc. Computer Vision and Pattern Recognition, pages
318-324. IEEE Computer Society Conference, June 1991. Lahaina, Maul, Hawaii.
9. A. Gu~ziec and N. Ayache. Smoothing and matching of 3D-space curves. In Proceedings
of the Second European Conference on Computer Vision I99~, Santa Margherita Ligure,
Italy, May 1992.
10. I.L. Her]in and N. Ayache. Features extraction and analysis methods for sequences of
ultrasound images. In Proceedings of the Second European Conference on Computer Vision
199~, Santa Margherita Ligure, Italy, May 1992.
11. Ellen Catherine Hildreth. The Measurement of VisualMotion. The MIT Press, Cambridge,
Massachusetts, 1984.
12. Bradley Horowitz and Alex Pentland. Recovery of non-rigid motion and structures. In
Proc. Computer Vision and Pattern Recognition,pages 325-330. IEEE Computer Society
Conference, June 1991. Lahaina, Maul, Hawaii.
13. L.D. Landau and E.M. Lifshitz. Theory of elasticity.Pergamon Press, Oxford, 1986.
14. Dimitri Metaxas and Demetri Terzopoulos. Constrained deformable superquadrics and
nonrigid motion tracking. In Proc. Computer Vision and Pattern Recognition, pages 337-
343. IEEE Computer Society Conference, June 1991. Lahaina, Maul, Hawaii.
15. Sanjoy K. Mishra, Dmitry B. Goldgof, and Thomas S. Huang. Motion analysis and epi-
cardial deformation estimation from angiography data. In Proc. Computer Vision and
Pattern Recognition, pages 331-336. IEEE Computer Society Conference, June 1991. La-
haina, Maul, Hawaii.
16. O. Monga, N. Ayache, and P. Sander. From voxel to curvature. In Proc. Computer Vision
and Pattern Recognition, pages 644-649. IEEE Computer Society Conference, June 1991.
Lahalna, Maul, Hawaii.
A n E g o m o t i o n A l g o r i t h m Based on the Tracking of
Arbitrary Curves

Emmanuel Arbogast, Roger Mohr


LIFIA-IRIMAG
46 av. F~lix Via]let
38031 Grenoble Cedex
Fra, n c e

Abstract. We are interested in the analysis of non polyhedral scenes.


We will present an original egomotion algorithm, based on the tracking of
arbitrary curves in a sequence of gray scale images. This differential method
analyses the spatiotemporal surface, and extracts a simple equation relating
the motion parameters and measures on the spatiotemporal surface. When
a curve contour line is tracked in image sequence, this equation allows to
extract the 3D motion parameters of the object attached to the contour
when rigid motion is assumed. Experiments on synthetic as well as real
data show the validity of this method.

1 Introduction
We are interested in the analysis of non polyhedral scenes using a camera in motion. We
want in particular to determine the movement of the camera, relatively to the object
which is observed, by using visual cues only. This problem is known as the egomotion
problem.
The case of polyhedral objects is already well understood, and one knows how to
determine the motion parameters of the camera by tracking points [FLT87, WHA89,
TF87, LH81] or straight lines [LH86, FLT87] throughout a sequence of images. Dealing
with non polyhedral objects means that contours extracted from the images are no longer
necessarily straight. The problem is therefore expressed as the estimation of the motion
parameters of the camera using arbitrary curves.
O. Faugeras [Fau90] pioneered the field by working on a more general problem: de-
termine the movement and the deformation of a curve the arclength of which is constant
(the curve is perfectly flexible but not extensible). He is able to conclude that this es-
timation is impossible, but that a solution exists in the restricted case of rigid curves,
when the movement is reduced to a rigid movement. The approach described here derive
a constraint on the motion that can be set on each point of the tracked curve and provide
therefore a redundant set of equations which allows to extract motion and acceleration.
Equations are of the 5th degree in the unknown.
Faugeras has recently derived the same type of conclusion [Fau91] : for rigid curves
the full real motion field has not to be computed and this leads to 5th degree equations.
This paper is therefore just a simpler way to reach this point. However we show here a
results hidden by the mathematics: the parameterization of the spatiotemporal surface
has to be close to the epipolar parameterization in order to get accurate results. Such an

* This work was partially supported by the Esprit project First and the French project Orasis
within the Gdr-Prc "communication Homme-Ma~hine'.
468

observation was already made by Blake and Cipolla [BC90] for the probleme of surface
reconstruction from motion.
This paper first discusses what contours are and it introduces some notations and
concepts like the spatiotemporal surface. Next section then provides the basic equation
which is then mathematically transformed into an equation in motion parameters and the
algorithm for computing the motion is derived. Section 5 discusses the results obtained
on both synthetic and real data. It highligths the parameterization issue and the quality
of the motion estimation.

2 Notations

We introduce a few notions and our notations.

2.1 C o n t o u r classification

Non polyhedral scenes present multiple types of curves to the viewer. Contours on the
image plane are the projection of particular curves on the surface of the observed objects.
These curves can be classified into categories the intrinsic properties of which are different,
requiring a specific treatment.
D i s c o n t i n u i t y c u r v e s : A discontinuity curve is a curve on the surface of an object
where the gradient of the surface is discontinuous. It is therefore a frontier between two
C 1 surfaces. The edges of a polyhedron belong to this category.
E x t r e m a l curves: An extremal curve is a curve on the surface of an object such
that the line of sight is tangent to the surface.
S p a t i o t e m p o r a l surface When the camera moves relatively to the object, contours
that are perceived move on the image plane. These moving contours, stacked on top of
each other, describe a surface improperly called a spatiotemporal surface; spatiospatial
would be a more appropriate name since the displacement of the contours is due to the
displacement of the camera and not to time. The spatiotemporal surface represents the
integration of all observations of the object. We will prove that this surface is sufficient
to determine the motion parameters of the camera relatively to the object, under certain
conditions.

2.2 N o t a t i o n s o f t h e p r o b l e m

Figure 1 summarizes our notations:


A vector X expressed in a reference frame {W} is written ~~ A vector Q function
of ul, u2, ..., un has its partial differentials written as wQu I e t wQu2, .-. , wQu~. A rigid
object (Y) has its own fixed reference frame {O}. The camera has its own fixed reference
frame {C}. The position and orientation of the camera reference frame {C} is described
by a translation ~ and a rotation matrix e~ %) and ~162 are the linear and rotational
velocity of the reference frame {C} w.r.t, the reference frame {O}. Similarly, ~ and
~162 are the linear and rotational accelerations.
At a given time t, we observe on (Y) a critical curve/~(s), where s is a parameter-
ization of the arc F. A particular point P on F is considered: we will write ~ t) the
vector OF in {0}: ~ = F(s) P is referenced from {C} by ST, where T (T(s,t)) is
a unit vector and S is the distance between P and C. The normal at P to the surface of
(Y) is denoted ~
469

t 5t/

Fig. I. Reference frames - notations

3 Egomotion algorithm's principle


The problem is to calibrate the movement of the camera relatively to the object with the
data available in the image. Motion parameters have to be estimated from visual cues,
without any a priori knowledge about the scene. We will present our solution in the case
of discontinuity curves when the camera intrinsic parameters are known (focal length,
pixel size along the local camera axis, projection of the focal point onto the image plane).
Monocular observation is inherently ambiguous since there is no way to tell if the object
is far away and moving fast, or close and moving slowly. This ambiguity only affects
translation, and rotation can be completely recovered. Only the direction of translation
can be determined.

Case o f a e x t r e m a l c u r v e :

r (s,t ol
-~ s~

r (s,to)
spatlotemporalsurface (Y)
Fig. 2. Case of an extremal curve

When the camera moves relatively to the surface of an object (Y), a set of contours
is observed in the camera reference frame, creating the spatiotemporal surface. A curve
T ( s o , t ) at so = c o n s t a n t that passes by a point p = T ( s o , t o ) of that spatiotemporal
470

surface corresponds to a curve r(s0,t) at so = constant passing by P = r(s0,t0) on the


surface of the object, as illustrated by figure 2.

Case o f a d i s c o n t i n u i t y curve:

r (s,to~~so, t)

T (s,~
surface spatiotemporelle (Y)
Fig. 3. s=coustant curve for a discontinuity curve (Y)

In the case of a discontinuity curve, the general configuration is that of figure 3 where
the object (Y), a curve of discontinuity, is reduced to the arc corresponding to r(s, t0),
and where the curve r(s0,t) is a subset of this arc: for an arbitrary parameterization
of the spatiotemporal surface, the curve T(so,t) corresponds indeed to a curve r(so,t)
necessarily placed on that arc (and possibly locally degenerated into a single point). The
property enables us to state the fundamental property for egomotion:
The curves r(s, t0) and r(s0,t) both pass through point P, and correspond locally to
the same 3D arc. Their differentials in s and t at point r(s0,t0) are therefore parallel,
which is expressed by:
Or, ^o rt = 0 (1)
This constraint expressed in the frame {C} becomes an equation that only relates mea-
sures and motion parameters, which permits the computation of a solution for the motion
parameters. Notice that that there is a particular case when the parameterization r(s0, t)
leads locally to a constant function with respect with time. From one image to the other
the point P is in correspondence with itself and again equation (1) holds: ~ equals 0.
This case corresponds to the epipolar parameterization [BC90].

4 Mathematical analysis
The egomotion problem is related to kinematics and differential geometry. We introduce
a few results of kinematics before we actually solve the problem at hand. The constraint
(1) is indeed expressed in the frame {O} when it should be expressed in the frame {C},
where measures are known.

4.1 K i n e m a t i c s o f t h e solid

D e r i v a t i o n in a m o b i l e r e f e r e n c e f r a m e The only results of kinematics that we will


need concern the derivation of vectors in a mobile reference frame. The notations we use
derive from [Cra86].
471

Given a vector U, function of t, mobile in {C}. cUt is the derivative of U in {C}, and
~ the derivative of U in {0}. c (out) is therefore the derivative of U in {0} expressed
in {C}.
The two key equations of the kinematics concern the derivation of U in {0} at first
and second order:
~ = ~ 4-0 g2c/o A ~ U (where A denotes the cross product)
~ = ~RcUtt + 2~ A~ 4- ~ A (~ A~ + ~ A ~
These equations expressed in {C} are simplified and the rotation matrix ~ disappears:
c (out) = cut +c s2o/o A c U (2)
(~ = CUrt + 2c~cio A cUt 4- %f-2c/oA (c~c/o ACT) + r A CT (3)

4.2 E g o m o t i o n e q u a t i o n s
The key equation for the egomotion problem is the constraint (1), which expresses the
fact that the observed curve is fixed in its reference frame {0}:
Ors A ~ rt ----0
Two independent scalar equations can be extracted for each point considered. It is pos-
sible to obtain a unique equivalent equation by stating that the norm of the vector is
zero.
The equation (1) is expressed in {O}; we have to express each term in {C} in order to
explicit the motion parameters, by using the results of the kinematics presented earlier.
One degenerate case exists, where ~176 = O. It corresponds to the case where the
camera motion is in the plane tangent to the surface. No information on the movement
can then be extracted for this particular point.

4.3 A l g o r i t h m i c solution
Equation (1) is a differential equation the unknowns of which correspond to the four
vectors c (o~)), c (o~), c~c/o and c~c/o.
It is possible to solve this problem by using finite differences. At each point, constraint
(1) is evaluated. The correct motion is one that minimizes the sum of all constraints at all
points available. Degenerate cases may arise where multiple solutions exist. Nevertheless,
by using a least squares approach instead of one with the minimal number of points, we
most likely get rid of this problem.
Solving the equations requires the initial values (c~c/o)0 and c (oT))0"
Equation (1) is in fact a ratio of polynomial functions and it is possible to transform
it into a 5th degree polynomial constraint in its 6 unknowns. This is why it is necessary
to know an good approximation of the solution in order to converge toward the right
values.
This result is to be compared with that of O. Faugeras [Fau91] where the constraint
is of similar complexity, hut intuitively hard to comprehend.

5 Experimental results

The first part of the experiments is using synthetic data in order to validate the theoretical
approach as well as the implementation. The second part deals with real data. The
algorithm's sensitivity to noise will be analyzed.
472

5.1 S y n t h e t i c d a t a
The contour used for testing is an arc of ellipse moving relatively to the camera. A purely
translational movement will first be used, then a rotation will be added. The main axes
of the ellipse have lengths 2 and 2.6 meters, and the plane containing the ellipse lies at
a distance of 8 meters from the camera.
Finite differences require initial values of the motion parameters to be known at time
step t = 0. Exact values of the movement at t = 0 are used and the motion parameters
for the translation and the rotation are then computed at time step 1.

P u r e t r a n s l a t i o n The movement of the camera is a uniform translation the steps of


which are equal in Y and Z, where Z is the direction of the optical axis and Y the
horizontal direction in the image plane. There is no rotation in this case. The sensitivity of
the computed translation parameters w.r.t, the linear velocity will be studied. A number
of sequences is generated for increasing values of the velocity. For each velocity, a sequence
of contours is generated and the motion parameters estimated. A uniform linear velocity
of 0.4 meters per time unit is used.
Table 1 summarizes the results. The correspondence between the contours is the
epipolar correspondence, established with a priori estimates of the movement.

computed rotational velocity


along y ]along z ][along x along y along z I]~long x along y along z
l
0.01 0.01 -4.38~07 0.01 0.01 8.227e-ll 3.635e-08 -2.966e-07
0.05 0.05 -2.09e-06 0.0500 0.0500 -1.202e-07 7.490~07 8.719e-ll
0.2 0.2 -0.00058 0.2003 0.2007 .2.377e-05 5.265e-05 0.00015
0.4 0.4 -0.0019 0.4023 0.409 .0.00026 0.00034 -0.00017
o.s o.s -O.OH 0.812 0.902 -0.000369 0.00035 0.00018

T a b l e 1. Computed linear velocities

R e m a r k : For a linear velocity of 0.4 meter per time unit, the displacement of the
contour on the image plane is of the order of 100 pixels, which is very far from the original
hypothesis of infinitesimal displacement.
For the same movement but an arbitrary correspondence, a direction orthogonal to
the contour for instance, results degrade quicker than with the epipolar correspondence.
If the movement of the camera is approximatively known, it is suggested to use it. If the
planar movement of the contour in the image plane can be estimated, it is wise to use it
to establish the correspondence.

A r b i t r a r y m o v e m e n t We add a rotational movement to the purely translational move-


ment. The rotation is a uniform rotation around the axis Y.
Sensitivity of the computed motion parameters w.r.t, the rotational velocity is now
studied. A number of sequences is generated for increasing values of the rotational ve-
locities. The motion parameters are computed for each value. The linear velocity is kept
constant and equal to 0.2 m/s along Y and Z. A rotational velocity of 0.05 radian per
time unit is used.
Table 2 summarizes the results. The correspondence is the epipolar correspondence
with a priori estimates of the motion.
473

Rotational ve- computed linear velocity computed rotational velocity


locity
along y along x alongy along z alongx along y along z
0.01 -0.0002 0.2001 0.2007 -1.953e-06 0.0100 0.0001
0.05 -0.0004 0.2004 0.197 -8.180e-06 0.05010 1.591e-05
0.1 -0.004 0.2004 0.205 -3.39e-05 0.100 -2.33e-05
0.2 -0.023 0.2029 0.215 -0.0001 0.20004 -7.871e-05
0.4 -0.00156 0.1893 0.202 0.00045 0.368 -0.00193

Table 2. Computed motion parameters

For the same motion but using an arbitrary correspondence (orthogonal to the con-
tour), degraded results are obtained.

Sensitivity to the noise on the contours will n o w be analyzed; a uniform constant


translation in Y and Z of 0.2 m/s and a uniform constant rotation around Y of 12 =
0.05 rad/s is considered. Gaussian noise of null average is added to the position of the
pixels of the contour. Table 3 shows the influence of the noise on the computed values of
the motion parameters. The variance of the noise is expressed in pixel.

Variance computed linear velocity computed rotational velocity


Mong x and y y Iz
X X
!y I"
D -0.00147 0.20016 0.198 -5.699e-06 0.05006 1.19e-05
0.5 -0.00145 0.2002 0.1979 -2.416e-06 10.05007 -3.767e-07
-0.00133 0.20008 0.198 -2.216e-05 0.05006 -1.365e-05
-0.00136 0.20012 0.1979 -2.025e-05 0.05006 -2.829e-06
-0.00113 0.2003 0.198 8.127e-06 !0.05002 -1.462e-06
-0.00151 9.199 0.198 -3.985e-05 0.05005 2.249e-05
16 -0.00115 10.2003 0.198 -4.282e-07 0.05003 -1.091e-06

T a b l e 3. Influence of the noise on the motion parameters

Noise on the contour obviously has very little influence on the computed motion
parameters.

5.2 Real data - S e q u e n c e of t h e p e a r

Figure 4 presents 4 images of the sequence of the pear. The thicker contours are the ones
used to estimate the motion parameters.

Table 4 shows the egomotion results on the pear sequence. A priori values of the
motion parameters (obtained with the robot sensors) are plotted against the computed
values.
474

Iho~/g~larl~l~tllqs=lp~ ~][~ t l ~ l K ~ l ~ | m t l t = u ~ l p o i r o - t = l p o l r e - t t t ~ A . l q

)/l~gantleedelerbotast/LlmffestPOLra-t=ne/Pes ,LaB [] ~ lho~/g=larbet~tl~nlpolre-toNelPO~re-t~o=e-3.~g []


"

Fig. 4. Pear sequence

Linear velocity Rol ational velocity


imagelvelocity along x along y Mongz [[along x Mong y Mong z
a priori -0.0129 -0.0266 -0.00038 -0.04506 0.02007 0.02044
1 computed -0.0133 -0.0267 0.00199 -0.0444 0.0231 0.0193
a priori -0.01112 -0.02727 0.000487 -0.04649 0.01687 0.020327
2 computed -0.0115 -0.0271 0.0054 -0.0437 0.0234 0.0197

Table 4. Computed motion parameters against calibrated values

6 Conclusions

A new egomotion technique was presented which can be applied when no point or straight
line correspondences are available. It is generalizing egomotion to the case of arbitrary
shaped contours which is especially valuable in the case of non polyhedral objects. The
computation uses a very simple finite differences scheme and quickly provides a good esti-
mation of the motion parameters. This technique is robust against noise on the contours
since it is using a least squares approach. The experiments we conducted on synthetic
as well as real data show the validity of that approach. Finite differences do not allow
though to perform the computation on a long sequence since the errors are accumulating.
475

It has also been experimented that close to epipolar p a r a m e t e r i z a t i o n provides better ac-
curacy. So rough motion estimation should be used to compute such a p a r a m e t e r i z a t i o n .
It has to be noted t h a t all discontinuity contours in the image can be used, by building
as m a n y s p a t i o t e m p o r a l surfaces as contours. Robustness is thus increased. If multiple
rigid objects are moving independently in the scene, it is i m p o r t a n t to compute each
movement separately. It is then necessary to segment the image into regions where the
associated 3-D movement is homogeneous. It would be interesting at this p o i n t to s t u d y
more sophisticated techniques such as finite elements for instance to o b t a i n more precise
results, now t h a t we proved feasibility with this simple finite differences scheme.

References

[BC90] A. Blake and R. Cipolla. Robust estimation of surface curvature from deformation of
apparent contours. In O. Faugeras, editor, Proceedings of the 1st European Conference
on Computer Vision, Antibes, France, pages 465-474. Springer Verlag, April 1990.
[Cra86] John J. Craig. Introduction to robotics. Mechanics and control. Addison-Wesley,
1986.
[Fau90] O. Faugeras. On the motion of 3D curves and its relationship to optical flow. Rapport
de Recherche 1183, , Sophia-Antipolis, March 1990.
[Fau91] O. Faugeras. On the motion field of curves. In Proceeding of the workshop on
Applications of Invariants in Computer Vision, Reykjavik, Iceland, March 1991.
[FLT87] O.D. Faugeras, F. Lustman, and G. Toscani. Motion and structure from point and
line matches. In Proceedings of the 1st International Conference on Computer Vision,
London, England, June 1987.
[LH81] H.C. Longuet-Higgins. A computer program for reconstructing a scene from two
projections. In Nature, volume 293, pages 133-135. XX, September 1981.
[LH86] Y. Liu and T.S. Huang. Estimation of rigid body motion using straight line cor-
respondences, further results. Proceedings of the 8th International Conference on
Pattern Recognition, Paris, France, pages 306-307, October 1986.
[TF87] G. Toscani and O.D. Faugeras. Mouvement par reconstruction et reprojection. In
l l d m e Colloque sur ie Traitement du signal et des images (GRETSI), Nice, France,
pages 535-538, 1987.
[WHA89] J. Weng, T.S. Huang, and N. Ahuja. Motion and structure from two perspective
views: algorithms, error analysis and error estimation. IEEE Transactions on PAMI,
11(5):451-476, May 1989.

This article was processed using the IFEX macro package with ECCV92 style
Region-Based Tracking in an Image Sequence *

Franf.ois Meyer and Patrick Bouthemy


IRISA/INRIA, Campus Universitaire de Beaulieu, 35042 Rennes Cedex, France

A b s t r a c t . This paper addresses the problem of object tracking in a se-


quence of monocular images. The use of regions as primitives for tracking
enables to directly handle consistent object-level entities. A motion-based
segmentation process based on normal flows and first order motion models
provide instantaneous measurements. Shape, position and motion of each
region present in such segmented images are estimated with a recursive al-
gorithm along the sequence. Occlusion situations can be handled. We have
carried out experiments on sequences of real images depicting complex out-
door scenes.

1 Introduction

Digitized time-ordered image sequences provide an actually rich support to analyze and
interpret temporal events in a scene. Obviously the interpretation of dynamic scenes has
to rely somehow on the analysis of displacements perceived in the image plane. During
the 80's, most of the works have focused on the two-frame problem, that is recovering the
structure and motion of the objects present in the scene either from the opticM flow field
derived between time t and time t -t- 1, or from the matching of distinguished features
(points, contour segments, ...) previously extracted from two successive images.
Both approaches usually suffer from different shortcomings, like intrinsic ambiguities,
and above all numerical instability in case of noisy data. It is obvious that performance
can be improved by considering a more distant time interval between the two considered
frames (by analogy with an appropriate stereo baseline). But matching problems become
then overwhelming. Therefore, an attractive solution is to take into account more than
two frames and to perform tracking over time using recursive temporal filtering [1].
Tracking thus represents one of the central issues in dynamic scene analysis.
First investigations were concerned with tracking of points, [2], and contour segments,
[3, 4]. However the use of vertices or edges lead to a sparse set of trajectories and can
make the procedure sensitive to occlusion. The interpretation process requires to group
these features into consistent entities. This task can be more easily achieved when work-
ing with a limited class of a priori known objects [5]. It appears that the ability of
directly tracking complete and coherent entities should enable to more efficiently solve
for occlusion problems, and also should make the further scene interpretation step easier.
This paper addresses this issue. Solving it requires to deal with a dense spatio-temporal
information. We have developed a new tracking method which takes into account regions
as features and relies on 2D motion models.

* This work is supported by MRT (French Ministry of Research and Technology) in


the context of the EUREKA European project PROMETHEUS, under PSA-contract
VY/85241753/14/Z10.
477

2 Region Modeling, Extraction and Measurement

We want to establish and maintain the successive positions of an object in a sequence of


images. Regions are used as primitives for the tracking algorithm. Throughout this paper
we will use the word, "regions", to refer to connected components of points issued from
a motion-based segmentation step. The region can be interpreted as the silhouette of the
projection of an object in the scene, in relative motion with respect to the camera.
Previous approaches, [6], to the "region-tracking" issue generally reduce to the track-
ing of the center of gravity of regions. The problem of these methods is their inability
to capture complex motion of objects in the image plane. Since the center of gravity of
a region in the image does not correspond to the same physical point throughout the
sequence, its motion does not accurately characterize the motion of the concerned region.
We proceed as follows. First the segmentation of each image is performed using a
motion-based segmentation algorithm previously developed in our lab. Second the cor-
respondence between the predicted regions and the observations supplied by the seg-
mentation process is established. At last a recursive filter refines the prediction, and its
uncertainty, to obtain the estimates of the region location and shape in the image. A new
prediction is then generated for the next image.

2.1 T h e M o t i o n B a s e d S e g m e n t a t i o n A l g o r i t h m
The algorithm is fully described in [7]. The motion-based segmentation method ensures
stable motion-based partitions owing to a statistical regularization approach. This ap-
proach does not require neither explicit 3D measurements, nor the estimation of optic flow
fields. It mainly relies on the spatio-temporal variations of the intensity function while
making use of 2D first-order motion models. It also manages to link those partitions in
time, but of course to a short-term extent.
When a moving object is occluded for a while by another object of the scene and
reappears, the motion-based segmentation process may not maintain the same label for
the corresponding region over time. The same problem arises when trajectories of objects
cross each other. Labels before occlusion may disappear and leave place to new labels
corresponding to reappearing regions after occlusion. Consequently, tracking regions over
long periods of time requires a filtering procedure to be steady. A truly trajectory rep-
resentation and determination is required. The segmentation process will provide only
instantaneous measurements. In order to work with regions, the concept of region must
be defined in some mathematical sense. We describe hereafter the region descriptor used
throughout this paper.

2.2 T h e R e g i o n D e s c r i p t o r
The region representation
We need a model to represent regions. The representation of a region is not intended
to capture the exact boundary. It should give a description of the shape and location
that supports the task of tracking even in presence of partial occlusion.
We choose to represent regions with some of its boundary points. The contour is
sampled in such a way that it preserves shape information of the silhouette. We must
select points that best capture the global shape of the region. This is achieved through
a polygonal approximation of the region. A good approximation should be "close" to
the original shape and have the minimum number of vertices. We use the approach
478

developed by Wall and Danielson in [8]. A criterion controls the closeness of the shape
and the polygon.
The region can be approximated accurately by this set of vertices. This representation
offers the property of being flexible enough to follow the deformations of the tracked sil-
houette. Furthermore this representation results in a compact description which decreases
the amount of data required to represent the boundary, and it yields easily tractable mod-
els to describe the dynamic evolution of the region.
Our region tracking algorithm requires the matching of the prediction and an ob-
servation. The matching is achieved more easily when dealing with convex hull. Among
the boundary points approximating the silhouette of the region, we retain only those
which are also the vertices of the convex hull of the considered set of points. It must
be pointed out that these polygonal approximations only play a role as "internal items"
in the tracking algorithm to ease the correspondence step between prediction and ob-
servation. It does not restrict the type of objects to be handled as shown in the results
reported further.
The region descriptor
This descriptor is intended to represent the silhouette of the tracked region, all along
the sequence. We represent the tracked region with the same number of points during
successive time intervals of variable size. At the beginning of the interval we determine
in the segmented image the number of points, n, necessary to represent the concerned
region. We maintain this number fixed as long as the distance, defined in 2.3, between
the predicted region and the observation extracted from the segmentation is not too
important. The moment the distance becomes too large, the region descriptor is reset to
an initial value equal to the observation. This announces the beginning of a new interval.
We can represent the region descriptor with a vector of dimension 2n. This vector is
the juxtaposition of the coordinates (z~, Yl) of the vertices of the polygonal approximation
of the region : [xl, Yl, z2, Y2,-.., z , , yn]T.

2.3 T h e M e a s u r e m e n t V e c t o r
Measurement definition

We need a measurement of the tracked region, in each image, in order to update


the prediction generated by the filter. The measurement is derived from the segmented
image. For a given region we would like a measurement vector that depicts this region
with the same number of points as the region descriptor. This number remains constant
throughout an interval of frames. The shape of the tracked region may change. The
region may be occluded. Thus the convex hull of the segmented region does not provide
enough information. We will generate a more complete measurement vector related to the
segmented region. The idea is illustrated in Fig. 1. If the segmentation algorithm provides
us with only a partial view of the region, the "remaining part" can be inferred as follows.
Let us assume that the prediction is composed of n points, and that the boundary of the
region obtained by the segmentation is represented by m points, (if the silhouette of the
observation is occluded we have m ~ n). We will move the polygon corresponding to the
prediction in order to globally match it with the convex hull of the observation composed
of m points. We finally select the n points of the correctly superimposed polygon onto
the observation, as the measurement vector. The measurement coincides indeed with the
segmented region, and if the object is partially occluded, the measurement still gives an
equivalent complete view of the silhouette of the region.
479

Consequently this approach does not require the usual matching of specific features
which is often a difficult issue. Indeed the measurement algorithm works on the region
taken as a whole.

(i)

Fig. 1. The measurement algorithm : (1) Observation obtained by the segmentation (grey re-
gion), and prediction (solid line) ; (2) Convex hull of the observation ; (3) Matching of polygons ;
(4) Effective measurement : vertices of the grey region.

Measurement algorithm

If we represent the convex hull of the silhouette obtained by the segmentation and the
prediction vector as two polygons, the problem of superimposing the observation and the
prediction reduces here to the problem of matching two convex polygons with possibly
differentnumber of vertices.
Matching is achieved by moving a polygon and finding the best translation and rota-
tion to superimpose it on the other one. W e did not include scalingin the transformation,
otherwise in the case of occlusion the minimization process will scale the prediction to
achieve a best matching with the occluded observation.A distance is defined on the space
of shapes, [9],and we seek the geometrical transformation that minimizes the distance
between the two polygons. If PI and P2 are two polygons, T the transform applied on
the polygon P2, we minimize f with respect to T:

f ( T ) - m(P1,T(P2)) - Z d(MI'T(P2))~ & E d(T(M2),P1) 2 (1)


MI EPI M2EP2
The function f is continuous, differentiable.It is also convex with respect to the two
parameters of the translation.Thus conjugate-gradient methods can be used to solve the
optimization problem.

3 The Region-Based Tracking Algorithm

A previous version of the region-tracking algorithm, where each vertex of the region could
evolve independently from the others, with constant acceleration, is proposed in [10]. The
measurement is generated by the algorithm described in Sect. 2.3. A Kalman filter gives
estimates of the position of each vertex. Though the model used to describe the evolution
of the region is not very accurate, we nevertheless have good results with the method.
We propose hereafter a more realistic model to describe the evolution of the region. More
details can be found in [10].
480

Our approach has some similarities with the one proposed in [11]. The authors con-
straint the target motion in the image plane to be a 2D affine transform. An overde-
termined system allows to compute the motion parameters. However, the region repre-
sentation and the segmentation step are quite different and less efficient. Besides their
approach does not take into account the problems of possible occlusion, or junction of tra-
jectories. We propose an approach with a complete model for the prediction and update
of the object geometry and kinematics.
We make use of two models : a geometric model and a motion model, (Fig. 2). The
geometric filter and the motion filter estimate shape, position and motion of the region
from the observations produced by the segmentation. The two filters interact : the esti-
mation of the motion parameters enables the prediction of the geometry of the region in
the next frame. The shape of the region obtained by the segmentation is compared with
the prediction. The parameters of the region geometry are updated. A new prediction of
the shape and location of the region in the next frame is then calculated.
When there is no occlusion the segmentation process assigns a same label over time
to a region ; thus the correspondence between prediction labels and observation labels is
easy. If trajectories of regions cross each other, new labels corresponding to reappearing
regions after occlusion will be created while labels before occlusion will disappear. In this
case more complex methods must be derived to estimate the complete trajectories of the
objects.

I-------I r" . . . . -i
L ~me , i framel
i k , i k+l '
t- _ _ _ J L . . . . J

motion ~ ~ region
parameters[ Motion-based ~ _ ~ p e
I I Segmentati~ I
ll" . . . . . . . . |. . . . . . "l

..... - - ~ - - - - -i F____~ correspondence[

' pre"iicti~ [m!io n i I ?edfcti~ ~

moNel I: I m~ cl
- . . . . . . . . . . . -t. . . . . . . . . . . . . . . .J

motion filter geometric filter

Fig. 2. The complete region-based tracking filter

3.1 The G e o m e t r i c Filter

We assume that each region R, in the image at time t + 1 is the result of an affine
transformation of the region R, in the image at time t. Hence every point (z(t), y(t)) E R
481

at time t will be located at (~(t + 1), y(t + 1)) at time t + 1, with :

(y)(t+a,=~(t)(y)(t)+b(t) (2)
The affine transform has already been used to model small transformation between two
images, [11]. The matrix ~(t) and the vector h(t) can be derived from the parameters of
the affine model of the velocity field, calculated in the segmentation algorithm, for each
region moving in the image. Let M(t) and u(t) be the parameters of the affine model of
the velocity within the region R. We have :

Even if 2nd order terms generally result from the projection in the image of a rigid
motion, they are sufficiently small to be neglected in such a context of tracking, which
does not involve accurate reconstruction of 3D motion from 2D motion. Affine models
of the velocity field have already been proposed in [12] and [13]. The following relations
apply :
9 (t) = X2+ M(t) and ~(t) = _b(t) (3)
For the n vertices (zl, yl),..., (zn, Yn) of the region descriptor we obtain the following
system model :

i (t + 1) = ". i (t) q- b(t) -I- (t)


Xn ' Xn
un 0 u.

where/2 is the 2 x 2 identity matrix. ~(t) and h(t) have been defined above in (3). (i =
[~, ff~]T is a two dimensional, zero mean Gaussian noise vector. We choose a simplified
model of the noise eovariance matrix. We will assume that :
r =

where 12. is the 2n 2n identity matrix. This assumption enables us to break the filter
of dimension 2n into n filters of dimension 2.
The matrix ~(t) and the vector h(t) accounts for the displacements of all the points
within the region, between t and t + 1. Therefore the equation captures the global de-
formation of the region. Even though each vertex is tracked independently, the system
model provides a "region-level" representation of the evolution of the points.
For each tracked vertex the measurement is given by the position of the vertex in the
segmented image. The measurement process generates the measurement as explained in
Sect. 2.3.
The following system describes the dynamic evolution of each vertex (zl, yi) of the
region descriptor of the tracked region. Let _s(t) = [zl, yi] T be the state vector, and rn(t)
the measurement vector which contains the coordinates of the measured vertex,
_s(t + 1) = ~(t)s_(t) + h(t) + i(t)
re(t) _s(t) + .(t) (4)
if(t) and ~(t) are two sequences of zero-mean Gaussian white noise, b(t) is interpreted
as a deterministic input. ~(t) is the matrix of the affine transform. We assume that the
482

above linear dynamic system is sufficiently accurate to model the motion of the region
in the image. We want to estimate the vector _s(t) from the measurement ra(t). The
Kalman filter [14] provides the optimal linear estimate of the unknown state vector from
the measurements, in the sense that it minimizes the mean square estimation error and
by choosing the optimal weight matrix gives a minimum unbiased variance estimate. We
use a standard Kalman filter to generate recursive estimates _~(t).
The first measurement is taken as the initial value of the estimate, Hence we have
_~(0) = m(0). The covariance matrix of the initial estimate is set to a diagonal matrix
with very large coefficients. This expresses our lack of confidence in this first value.

3.2 T h e K i n e m a t i c F i l t e r
The attributes of the kinematic model are the six parameters of the 1st order approxima-
tion of the velocity field. These variables are determined with a least-squares regression
method. Therefore these instantaneous measurements are corrupted by noise and we
need a recursive estimator to convert observation data into accurate estimates. We use
a Kalman filter to perform this task. We work with the equivalent decomposition :
1 ( div + hypl hyp2 - rot

This formulation has the advantage that the variables div, rot, hypl and hyp2 correspond
to four particular vector fields that can be easily interpretated, [7].
The measurement is given by the least square estimates of the six variables. We have
observed on many sequences that the correlation coefficients between the six estimates are
negligible. For this reason, we have decided to decouple the six variables. The advantage
is that we work with six separate filters.
In the absencei in the general case, of any explicit simple analytical function describing
the evolution of the variables, we use a Taylor-series expansion of each function about
t. After having experimented with different approximations, it appears that using the
first three terms performs a good tradeoff between the complexity of the filter and the
accuracy of the estimates. Let _0(t) = [~(t), &(t), ~(t)] T be the state vector, where ~ is
any of the six variables : a, b, div, rot, hypl and hyp2. z(t) is the measurement variable.
We derive the following linear dynamic system :

z(t) C(t)O(t)_+rl(t) with A = 0001 C=[1 0 0] Q = a ~ L~

~(t) and 77(t) are two sequences of zero-mean Gaussian white noises of covariance
matrix Q, and variance cr~ respectively.

3.3 R e s u l t s
We present in Fig. 3 the results of an experiment done on a sequence of real images. The
polygons representing the tracked regions are superimposed onto the original pictures
at time tl, tg, and t12. The corresponding segmented pictures at the same instants are
presented on the right. The scene takes place at a crossroad. A white van is comming
from the left of the picture and going to the right (Fig. 3a). A black car is driving behind
the van so closely that the segmentation is enable to split the two objects (Fig. 3d). A
white car is comming from the opposite side and going left. The algorithm accurately
483

tracks the white car, even at the end of the sequence where the car almost disappears
behind the van (Fig. 3e and f). Since the segmentation process delivers a single global
region for the van and the black car (Fig. 3d), the filter follows this global region. Thus
the tracked region does not correspond exactly to the boundary of the van. This example
illustrates the good performanee of the region-based tracking in the presence of occlusion.
An improved version of the method, where the kinematics parameters are estimated using
a multiresolution approach is being tested. More experiments are presented in [10].

4 Conclusion

This paper has explored an original approach to the issue of tracking objects in a se-
quence of monocular images. We have presented a new region-based tracking method
which delivers dense trajectory maps. It allows to directly handle entities at an "object-
level". It exploits the output of a motion-based segmentation. This algorithm relies on
two interacting filters : a geometric filter which predicts and updates the region position
and shape, and a motion filter which gives a recursive estimation of the motion param-
eters of the region. Experiments have been carried out on real images to validate the
performance of the method. The promising results obtained indicate the strength of the
"region approach" to the problem of tracking objects in sequences of images.

References

1. T.J. Broida, R. Chellappa. Estimating the kinematics and structure of a rigid object from
a sequence of monoculax images. IEEE Trans. PAMI, Vol.13, No.6:pp 497-513, June. 1991.
2. I. K. Sethi and R. J~a. Finding Trajectories of Feature Points in a Monocular Image
Sequence. IEEE Trans. PAMI, Vol. PAMI-9, No l:pp 56-73, January 1987.
3. J.L. Crowley , P. Stelmaszyk, C. Discours. Measuring Image Flow by Tracking Edge-Lines.
Proc. ~nd Int. Conf. Computer Vision, Tarpon Springs, Florida, pp 658-664, Dec. 1988.
4. R. Deriche, O. Faugeras. Tracking Line Segments. Proc. Ist European Conf. on Computer
Vision, Antibes, pp 259-268, April 1990.
5. J. Schick, E.D. Dickmanns. Simultaneous estimation of 3d shape and motion of objects by
computer vision. Proceedings of the IEEE Workshop on Visual Motion, Princeton New-
Jersey, pp 256-261, October 1991.
6. G. L. Gordon. On the tracking of featureless objects with occlusion. Proc. Workshop on
Visual Motion, Irving California, pp 13-20, March 1989.
7. E. Francois, P. Bouthemy. Multiframe-based identification of mobile components of a scene
with a moving camera. Proe. CVPR, Hawaii, pp 166-172, June 1991.
8. Karin Wall and Per-Erik Danielsson. A fast sequential method for polygonal approximation
od digitized curves. Computer Vision, Graphics and linage Processing, 28:pp 220-227, 1984.
9. P. Cox, H. Maitre, M. Minoux, C. Ribeiro. Optimal Matching of Convex Polygons. Pattern
Recognition Letters, Vol 9 No 5:pp 327-334, June 1989.
10. F. Meyer, P. Bouthemy. Region-based tracking in an image sequence. Research Report in
preparation, IRISA/INRIA Rennes, 1992.
11. R.J. Schalkoff, E.S. McVey. A model and tracking algorithm for a class of video targets.
1EEE Trans. PAMI, VoI.PAMI-4, No.l:pp 2-10, Jan. 1982.
12. P.J. Butt, J. R. Bergen, R. Hingorani, R. Kolczynski, W.A. Lee, A. Leung, J. Lubin, H.
Shvaytser. Object tracking with a moving camera. IEEE Workshop on Visual Motion, pp
2-12, March 1989.
13. G. Adiv. Determining three-dimensional motion and structure from optical flow generated
by several moving objects. 1EEE Trans. PAM1, Vol 7:pp 384-401, July 1985.
14. Arthur Gelb. Applied Optimal Estimation. MIT Press, 1974.
454

Fig. 3. Left : original images at time tl, t~, tl~ with tracked regions. Right : segmented images
at the same instants
Combining Intensity and Motion for Incremental
Segmentation and Tracking Over Long Image
Sequences*
Michael J. Black

Department of Computer Science, Yale University


P.O. Box 2158 Yale Station, New Haven, CT 06520-2158, USA

Abstract. This paper presents a method for incrementally segmenting


images over time using both intensity and motion information. This is done
by formulating a model of physically significant image resgions using local
constraints on intensity and motion and then finding the optimal segmenta-
tion over time using an incremental stochastic minimization technique. The
result is a robust and dynamic segmentation of the scene over a sequence
of images. The approach has a number of benefits. First, discontinuities
are extracted and tracked simultaneously. Second, a segmentation is always
available and it improves over time. Finally, by combining motion and in-
tensity, the structural properties of discontinuities can be recovered; that is,
discontinuities can be classified as surface markings or actual surface bound-
aries.

1 Introduction

Our goal is to efficiently and dynamically build useful and perspicuous descriptions of
the visible world over a sequence of images. In the case of a moving observer or a dy-
namic environment this description must be computed from a constantly changing retinal
image. Recent work in Markov random field models [7], recovering discontinuities [2], seg-
mentation [6], motion estimation [1], motion segmentation [3, 5, S, 10], and incremental
algorithms [1, 9] makes it possible to begin building such a structural description of the
scene over time by compensating for and exploiting motion information.
As an initial step towards the goal, this paper proposes a method for incrementally
segmenting images over time using both intensity and motion information. The result is
a robust and dynamic segmentation of the scene over a sequence of images. The approach
has a number of benefits. First, discontinuities are extracted and tracked simultaneously.
Second, a segmentation is always available and it improves over time. Finally, by com-
bining motion and intensity, the structural properties of discontinuities can be recovered;
that is, discontinuities can be classified as surface markings or actual surface boundaries.
By jointly modeling intensity and motion we extract those regions which correspond
to perceptually and physically significant properties of a scene. The approach we take
is to formulate a simple model of image regions using local constraints on intensity
and motion. These regions correspond to the location of possible surface patches in the
image plane. The formulation of the constraints accounts for surface patch boundaries as
discontinuities in intensity and motion. The segmentation problem is then modeled as a
Markov random field with line processes.
* This work was supported in part by a grants from the National Aeronautics and Space Ad-
ministration (NGT-50749 and NASA RTOP 506-47), by ONR Grant N00014-91-J-1577, and
by a grant from the Whitaker Foundation.
486

Scene segmentation is performed dynamically over a sequence of images by exploiting


the technique of incremental stochastic minimization (ISM) [1] developed for motion
estimation. The result is a robust segmentation of the scene into physically meaningful
image regions, an estimate of the intensity and motion of each patch, and a classification
of the structural properties of the patch discontinuities.
Previous approaches to scene segmentation have typically focused on either static im-
age segmentation or motion segmentation. Static approaches which attempt to recover
surface segmentations from the 2D properties of a single image are usually not sufficient
for a structural description of the scene. These techniques include the recovery of per-
ceptually significant image properties; for example segmentation based on intensity [2, 4]
or texture [6], location of intensity discontinuities, and perceptual grouping of regions
or edges. Structural information about image features can be gained by analyzing their
behavior over time. Attempts to deal with image features in a dynamic environment have
focused on the tracking of features over time [11].
Motion segmentation, on the other hand, attempts to segment the scene into struc-
turally significant regions using image motion. Early approaches focused on the seg-
mentation and analysis of the computed flow field. Other approaches have attempted
to incorporate discontinuities into the flow field computation [1, 10], thus computing
flow and segmenting simultaneously. There has been recent emphasis on segmenting and
tracking image regions using motion, but without computing the flow field [3, 5].
In attempt to improve motion segmentation a number of researchers have attempted
to combine intensity and motion information. Thompson [12] describes a region merging
technique which uses similarity constraints on brightness and motion for segmentation.
Heitz and Bouthemy [8] combine gradient based and edge based motion estimation and
realize improved motion estimates and the localization of motion discontinuities.
The following section formalizes the notion of a surface patch in the image plane in
terms of constraints on image motion and intensity. Section 3 describes the incremental
minimization scheme used to estimate patch regions. Section 4 presents experimental
results with real image sequences. Finally, before concluding, section 5 discusses issues
regarding the approach.

2 Joint Modeling of Discontinuous Intensity and Motion

To model our assumptions about the intensity structure and motion in the scene we adopt
a Markov random field (MRF) approach [7]. We formalize the prior model in terms of
constraints, defined as energy functions over local neighborhoods in a grid. For an image
of size n x n pixels we define a grid of sites:

S = {sl,s2,...,sn2 IVw 0 <_ is,,,js,, <_ n - 1},


where (i,, js) denotes the pixel coordinates of site s.
For the first order constraints employed here we define a neighborhood system ~ =
{Gs, s E S} in terms of the nearest neighbor relations (North, South, East, West) in the
grid. We define a clique to be a set of sites, C C_ S, such that if s, t E C and s r t, then
t E G,. Let C be a set of cliques.
We also define a "dual" lattice, l(s, t), of connections between sites s and their neigh-
boring sites t E ~,. This line process [7] defines the boundaries of the image patches.
If l(s, t) = 1 then the sites s and t are said to belong to the same image patch. In the
case where l(s, t) = O, the neighboring sites are disconnected and hence a discontinuity
exists.
487

Associated with each site s is a random vector X(t) = [u, i, l] which represents the
horizontal and vertical image motion u = (u, v), the intensity i, and the discontinuity
estimates l at time t. A discrete state space As(t) defines the possible values that the
random vector can take on at time t.
To model surface patches we formulate three energy terms, Era, Ez, and EL: which
express our prior beliefs about the motion field, the intensity structure, and the organi-
zation of discontinuities respectively. The energy terms are combined into an objective
function which is to be minimized:

E ( u , u - , i , i - , l , l - ) = E ~ ( u , u - , l ) + Ez(i,i-,l)+ E~(1, l-). (1)

The terms u - , i - , and l - are predicted values given the history of the sequence, and are
used to express temporal continuity.
We convert the energy function, E, into a probability measure H by exploiting the
equivalence between Gibbs distributions [7, i0] and MRF's:

n(x(t)) = Z-le-F'(x('))/r('), z = ~, e-~(x(o)/r<'), (2)


x(,)~a(,)

where Z is the normalizing constant, and where T(t) is a temperature constant at time
t. Minimizing the objective function is equivalent to finding the maximum of/7.
The constraints are summarized in figure 1 and described briefly below:

The I n t e n s i t y M o d e l : We adopt a weak membrane model of intensity [2]. The data


consistency term Dz keeps the estimate close to the data while the term Sz enforces
spatial smoothness. The current formulation differs from previous approaches in that we
add a temporal continuity Tz term to express the expected change in the image over
time.

The B o u n d a r y M o d e l : We want to constrain the use of discontinuities based on our


expectations of how they occur in images. Hence, we will penalize discontinuities which do
not conform to expectations. The boundary model is expressed as the sum of a temporal
coherence term and a penalty term defined as the sum of clique potentials Vc over a set
of cliques C.
One component of the penalty term expresses our expectation about the local config-
uration of discontinuities about a site. Figure 2 shows the possible local configurations
up to rotation. We also express expectations about the local organization of boundaries;
for example we express notions like "good continuation" and "closure" which correspond
to assumptions about surface boundaries (figure 3). The values for these clique potentials
were determined experimentally and are similar to those of previous approaches [4, 10].

The M o t i o n M o d e l : As with the intensity model, we express our prior assumptions


about the motion in terms of three constraints. The data consistency constraint D~a
states that the image measurements corresponding to an environmental surface patch
change slowly over time. The spatial coherence constraint S ~ is derived from the ob-
servation that surfaces have spatial extent and hence neighboring points on a surface
will have similar motion. Finally, the temporal coherence constraint T ~ is based on the
observation that the velocity of an image patch changes gradually over time.
488

Intensity Model

Ez(I,i,i-,l,s)=WDzDz(I,i,s)+wTzTz(i,i-,s)+wszSz(i,l,s) (3)
Dz(I, i, s) = (I(s) - i(s)) a (4)
r~(i, i-, s) = (i(s) - i-(s)) ~ (5)
s~(i, z, s) = ~ t(s,.)(i(s) - i(.)) ~ (6)
nEG,

Boundary Model

E~(~, r , s) = ~ ~ (Z(s,.) - r ( s , . ) ) ~ + ~ ~ Vo(Z) (7)


nEGs CEC

Motion Model

E.~(I.,1,~+l,u,u-,l,s) =
wD~D.~(I,~,I.+a,u,s)+wT~T.~(U,U-,S)+ws~S~(u,I,s) (S)
D~(I~,In+l,u,s) = E ~D(In(it,jt)--In+l(it+u, j t + v ) ) (9)
rEG.

(u, 1, s) = ~ t(s, 011u(s) - u(011 (lo)


t6G,

T ~ ( u , u - , s) = Ilu(s) - ( u - ( s ) + Au-(s))ll (11)


A u 7 (s) = u 7 (s) -- u;'_~ (s) (12)
Miscellaneous

~ = { t l ( i t , j t ) = ' ( i ~ + A i , j ~ + A j ) , --c<_Ai, A j < c } (13)


9~ = {t [ (it, j,) = (i, + ~, j~ + ~), --1 _< ~ , ~ _< 1} (14)
--1
~v(X) = 1 + (z/AD)2 (15)

Fig. i. Robust constraints on image motion.

0 0 0 0 0 0

o o o o olo o o Io oioio olo Io ololo


0 0 0 0 0 0
V0 = 0.0 V~ = 1.0 V2 = 2.0 V~ = 5.0 V4 = 6.0 V5 = 7.0

Fig. 2. Examples of local surface patch discontinuities; (sites: (o), discontinuities: (I, - ) ) .

o olo o o Io o o o o o o

o o o o olo o o Io o o I o

o olo o olo o o Io o o o

V6 = 1.0 V7 = - 2 . 0 Vs = - 1 . 0 179 = 0.0

F i g . 3. Examples of local organization of discontinuities based on continuity with neighboring


patches.
489

3 The Computational Problem

The objective function defined in the previous section will typically have many local
minima. Simulated annealing (in this case a Gibbs Sampler [7]) can be used to find the
minimum X(t) by sampling from the state space A according to the d i s t r i b u t i o n / / w i t h
logarithmicly decreasing temperatures.
As mentioned earlier, each site contains a random vector X(t) = [u, i,/] which rep-
resents the motion, intensity, and discontinuity estimates at time t. The discontinuity
component of this state space is taken to be binary, so that l E {0, 1}.
The intensity component i can take on any intensity value in the range [0,255]. For
efficiency, we can restrict i to take on only integer values in that range. We make the
further approximation that the value of i at site s is taken from the union of intervals of
intensity values about i(s), the neighbors i(t) of s, and the current data value I,(s). Small
intervals result in a smaller state space without any apparent degradation in performance.
The motion component u = (u, v) is defined over a continuous range of displacements
u and v. Continuous annealing techniques [1] allow accurate sub-pixel motion estimates
by making the state space for the flow component adapt to the local properties of the
function being minimized.

3.1 I n c r e m e n t a l M i n i m i z a t i o n

Unfortunately, stochastic algorithms remain expensive, particularly without parallel hard-


ware, making them ill-suited to dynamic problems. Ideally a motion algorithm should
involve fast simple computations between a pair of frames, and exploit the fact that
tremendous amounts of data are available over time.
In the context of optical flow, Black and Anandan [1] describe an incremental stochas-
tic minimization (ISM) algorithm (figure 4) that has the benefits of simulated annealing

Imagen- 1 _ [ E(u,v)
Surface Increnaental
Image n ~1 IntemityModel - Stochastic
-[ Minimization
Motion Model
Predicted
Intensity
Boundary
Flow
........ 1 .... I=

i t . . . . . . . . . . . . . . . . . . . . . . . d

Fig. 4. Incremental Stochastic Minimization.

without many of the shortcomings. As opposed to minimizing the objective function for
a pair of frames, the ISM approach is designed to minimize an objective function which
is changing slowly over time. The assumption of a slowly changing objective function is
made possible by exploiting current motion estimates to compensate for the effects of
the motion on the objective function. With each new image, current estimates are propa-
gated by warping the grid of sites using the current optic flow estimate. The estimates are
then refined using traditional stochastic minimization techniques. Additionally, during
the warping process motion discontinuities are classified as occluding or disoccluding.
490

4 Experimental Results

The system is implemented in *Lisp on a 8K node Connection Machine (CM-2). A number


of experiments have been performed using real image sequences. For these experiments,
the parameters of the model were determined empirically. The intensity model parameters
were: w l ~ z = W T z = 1/402 and w s z = 1/202. For the boundary model, the weights were:
wT~ = 0.5 and wpL = 1.0. Finally, for the motion model, we have: W D ~ = 0.5, WW.~ = 0.1,
and ws~ = 1.5, with a 3 x 3 correlation window. An initial temperature of T(0) = 0.3
was chosen with a cooling rate of T ( t + 1) = T ( t ) - 0.0025 and AD was set to 5.0.

T h e P e p s i S e q u e n c e 1 The first sequence consists of ten 64 x 64 square images; the


first image in the sequence is shown in figure 5a. The Canny edge operator was applied
to the image and the edges are shown in figure 5b. For comparison, figure 5c shows an
intensity based segmentation using a piecewise constant intensity model with no motion
information. The figure shows the estimate for a single static image after 25 iterations
of the annealing algorithm. As with the Canny edges, the results correspond to intensity
markings.
Figure 5d shows the results for the same image when a joint intensity and motion
model is used. The results are from a two image sequence after 25 iterations. Compare
the boundaries corresponding to the right and left edges of the can. In figure 5c the
similarity of intensity between the can and the background results in smoothing across
the object boundary. When motion information is added in figure 5d the object boundary
is detected (figure 5e) and smoothing does not occur across it.
Figures 5f-51 show the results of incrementally processing the full ten image sequence.
Figure 5f shows the last image in the sequence. The horizontal and vertical motion is
shown in figures 5g and 5h respectively. Dark areas indicate leftward or upward motion
and similarly, bright areas indicate motion to the right and down. Figure 5i shows the
intensity estimates of the patches and figure 5j shows the discontinuities. Figure 5k shows
the detected motion boundaries, while figure 51 shows the classification of the boundaries
as occluding (bright areas) or disoccluding (dark areas). Figure 6 shows the evolution of
the features over the ten image sequence. The estimates start out noisy and are refined
over time. Only five iterations of the annealing algorithm were used between each pair
of frames. The processing time for each frame was approximately 30 seconds.

T h e C o k e S e q u e n c e 2 The second image sequence contains 38 images of size 128 x 128


pixels. Figures 7a and b show the first and last images in the sequence respectively.
Figure 7c shows the image features at the end of the image sequence. Unlike standard
segmentation, these features have been tracked over the length of the sequence. Figure
7d shows only features which are likely to correspond to surface boundaries. The pencils
and metal bracket are correctly interpreted as physically significant while the sweater is
interpreted as purely surface marking. Notice that the Coke can boundary is incorrectly
interpreted as surface marking. This is a result of small interframe displacements; the
motion of the can boundary is not significant enough to classify it as structural with
the current scheme. Figure 8 shows the evolution of the image features over time. The
segmentation improves as the features are tracked over the image sequence. Five iter-
ations of the annealing algorithm were used between frames with a processing time of
approximately one minute per frame.
1 This image sequence was provided by Joachim Heel.
2 This sequence was provided by Dr. Baaavar Sridhar at the NASA Ames Research Center.
491

F i g . 5. F e a t u r e E x t r a c t i o n : a) First image in the Pepsi can sequence, b) Edges in the image


extracted with the Canny edge operator, c) Intensity based segmentation without motion, d)
Segmentation using joint intensity and motion model, e) Structural features in the scene. ]) Last
image in the sequence, g) Horizontal component of image motion, h) Vertical component of image
motion, i) Reconstructed intensity image, j) Final patch boundaries, k) Motion boundaries, l)
Occlusion and disocclusion boundaries.
, -.'<+. "..,,~,~P'"; - .. ,. ",." +..'.." .":- .... '. :.'.. +" -.~ .... ... L. "'.
. + . . ~ .+-J~':'_. _+..-~- ._ _... ,~. _.." .-:-
":. ....~
... ~ -, . '-,.~++
.+ "- - , ~ ~ , . . . . . : 9
9
_. ,,'-.

9 P~+_+.~
9 f'~ ~ . ~"I I~,..~.
+'~"[.+ ~....:+ ..,'~I.,I""" ' ~ ~lJ
I
!"~
~ - - e "
. . +, , ~ , ~ ~ _
B
i-

~ : _ ~ :-C " -

Fig. 6. Incremental Feature Extraction. The images show the evolution (left to right, top
to bottom) of features over a ten image sequence.
492

f r ,~...-

~. ,tl: ~, .~~:.-:;~. ,,.


"~&.f ~' I~ "~:"-" "~:
c d
Fig. 7. The Coke Sequence. a, b) first and last images in the sequence, c) image features at
the end of the sequence, d) those features which axe likely to have a physical interpretation.

..... , r . . / . . . .,--:.., - . / ,
'' .'.:-.[.. 7t " "
';i"~"f :. . . . ' : : . .i. i . i.: .-" ,

"7./ .~ a-"'. 9 - " / ~~-.' " ~: o-"

'" / , , ~ ~ "'~

"1 i

9 $ '

Fig. 8. I n c r e m e n t a l Feature E x t r a c t i o n . The sequence shows the evolution (left to right,


top to bottom) of features at every third image in the 38 image sequence.

5 Issues and Future Work

There are a number of issues to be addressed regarding the approach described. First, the
current implementation employs only simple first order models of intensity and motion.
To cope with textured surfaces more complicated image segmentation models will be
required.
A second issue which must be addressed is one shared by many minimization ap-
proaches; that is the parameter estimation problem. The construction of an objective
function with weights controlling the importance of the various terms is often based on
intuition or empirical studies. The problem becomes more pronounced as the complexity
493

of the model increases. Experiments with the current model indicate that it is relatively
insensitive to changes in the parameters.

6 Conclusion

We have presented an incremental approach to extracting stable perceptual features over


time. The approach formulates a model of surface patches in terms of constraints on
intensity and motion while accounting for discontinuities. An incremental minimization
scheme is used to segment the scene over a sequence of images.
The approach has advantages over traditional segmentation and motion estimation
techniques. In particular, it is incremental and dynamic. This allows segmentation and
motion estimation to be performed over time, while reducing the amount of computation
between frames and increasing robustness.
Additionally, the approach provides information about the structural properties of the
scene. While intensity based segmentation alone provides information about the spatial
structure of the image, motion provides information about object boundaries. Combining
the two types of information provides a richer description of the scene.

References
1. Black, M.J., and Anandan, P., "Robust dynamic motion estimation over time,"
Proc. Comp. Vision and Pattern Recognition, CVPR-91, Maui, Hawaii, June 1991, pp. 296-
302.
2. Blake, A. and Zisserman, A., Visual Reconstruction, The MIT Press, Cambridge, Mas-
sachusetts, 1987.
3. Bouthemy, P. and Lalande, P., "Detection and tracking of moving objects based on a sta-
tistical regulaxization method in space and time," Proc. First European Conf. on Computer
Vision, ECCV-90, Antibes, France, April 1990, pp. 307-311.
4. Chou, P. B., and Brown, C. M., "The theory and practice of bayesian image labeling," lnt.
Journal of Computer Vision, Vol. 4, No. 3, 1990, pp. 185-210.
5. Francois, E. and Bouthemy, P., "Multiframe-based identification of mobile components of a
scene with a moving camera," Proc. Comp. Vision and Pattern Recognition, CVPR-91, Maui,
Hawaii, June 1991, pp. 166-172.
6. Geman, D., Geman, S., Graffigne, C., and Dong, P., "Boundary detection by constrained
optimization," 1EEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 12,
No. 7, July 1990, pp. 609-628.
7. Geman, S. and Geman, D., "Stochastic relaxation, Gibbs distributions, and Bayesian
restoration of images," IEEE Transactions on Pattern Analysis and Machine Intelligence,
Vol. PAMI-6, No. 6, November 1984.
8. Heitz, F. and Bouthemy, P., "Multimodal motion estimation and segmentation using Markov
random fields," Proc. IEEE Int. Conf. on Pattern Recognition, June, 1990, pp. 378-383.
9. Matthies, L., Szeliski, R., Kanade, T., "Kalman filter-based algorithms for estimating depth
from image sequences," Int. J. of Computer Vision, 3(3), Sept. 1989, pp. 209-236.
10. Murray, D. W. and Buxton, B. F., "Scene segmentation from visual motion using global op-
timization," IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. PAMI-9, No. 2,
March 1987, pp. 220-228.
11. Navab, N., Deriche, R., and Faugeras, O. D., "Recovering 3D motion and structure from
stereo and 2D token tracking cooperation," Proc. Int. Conf. on Comp. Vision, ICCV-90, Os-
aka, Japan, Dec. 1990, pp. 513-516.
12. Thompson, W. B., "Combining motion and contrast for segmentation," IEEE Transactions
on Pattern Analysis and Machine Intelligence, Vol. PAMI-2 1980, pp. 543-549.

This article was processed using the LATEXmacro package with ECCV92 style
Active Egomotion Estimation: A Qualitative
Approach

Yiannis Aloimonos and Zoran Duri9


Computer Vision Laboratory, Center for Automation Research, Department of Computer Sci-
ence and Institute for Advanced Computer Studies, University of Maryland, College Park, MD
20742-3411

Abstract.
Passive navigation refers to the ability of an organism or a robot that
moves in its environment to determine its own motion precisely on the ba-
sis of some perceptual input, for the purposes of kinetic stabilization. The
problem has been treated, for the most part, as a general recovery from
dynamic imagery problem, and it has been formulated as the general 3-D
motion estimation (or structure from motion) module. Consequently, if a
robust solution to the passive navigation problem--as it has been formu-
lated in the recovery paradigm--is achieved, we will immediately be able to
solve many other important problems, as simple applications of the general
principle. However, despite numerous theoretical results, no technique has
found applications in systems that can perform well in the real world. In
this paper, we outline some of the reasons behind this and we develop a
robust solution to the passive navigation problem which is
- purposive, in the sense that it does not claim any generality. It just
solves the kinetic stabilization problem and cannot be used as it is for
other problems related to 3-D motion.
- qualitative, in the sense that the solution comes as the answer to a
series of simple yes/no questions and not as the result of complicated
numerical processing.
active, in the sense that the activity of the observer (in this case "sac-
-

cades") is essential for the solution of the problem.


The input to the perceptual process of kinetic stabilization that we have
developed is the normal flow, i.e. the projection of the optic flow along the
direction of the image gradient.
Contributions of this work are the fact that translation can be estimated
reliably from a normal flow field that also contains rotation.

1 Introduction

The problem of passive navigation (kinetic stabilization) has attracted a lot of attention
in the past ten years (Bruss and Horn, 1983; Longuet-Higgins, 1981; Longuet-I-Iiggins
and Prazdny, 1980; Ullman, 1979; Spetsakis and Aloimonos, 1988; Tsai and Huang, 1984)
because of the generality of a potential solution. The problem has been formulated as fol-
lows: Given a sequence of images taken by a monocular observer undergoing unrestricted
rigid motion in a stationary environment, to recover the 3-D motion of the observer. In
498

particular, if (U, V, W) and (w=,Wy,Wz) are the translation and rotation, respectively,
comprising the general rigid motion of the observer, the problem is to recover the follow-
ing five numbers: the direction of translation ( ~ , v ) and the rotation(wx,wy,Wz). (See
Fig. 1 for a pictorial description of the geometric model of the observer; O is the nodal
point of the eye).

--~W
r

Fig. 1. Image plane perpendicular to the optical axis OZ.

The problem has thus been formulated as the general 3-D motion estimation problem
(kinetic depth or structure from motion) and its solution would solve a series of problems
(for example target pursuit, visual rendezvous, etc.) as simple applications. In this paper
we study the problem of passive navigation in the framework of purposive vision. Later
sections will clarify our point of view but our basic thesis is that we must seek a robust
solution for the problem under consideration only. If our proposed solution for the passive
navigation problem also solves the problem of determining the 3-D motion of an object
moving in the field of view of a static observer, then we have solved a more general
problem than the one we initially considered.

2 Previous Work

Previous research can be classified into two broad categories: methods based on optic
flow or correspondence and direct methods. 1

In the first category, under the assumption that optic flow or correspondence is known
with some uncertainty, finding the best solution results in a non-linear optimization
problem. One develops an error measure (usually a function of the input error) that is
minimized in some way. Treating the problem as one of statisticM estimation has given
rise lately to very sophisticated approaches. Although such research on general recovery

1 One can also differentiate a category of methods that use correspondence of macrofeatures
(contours, lines, sets of points, etc.) (Aloimonos, 1990b; Spetsakis and Aloimonos, 1990), but
we don't discuss them here, due to the lack of literature on the stability of such techniques.
499

is making tremendous progress, the existing general recovery results cannot yet survive
in the real world, because small amounts of error in the input can produce catastrophic
results in the output (Spetsakis and Aloimonos, 1988; Horn, 1988; Young and Chellappa,
1990; Adiv, 1985a, 1985b; Weng et al., 1987). Although it is true that ifa human operator
corresponded features in the successive image frames, 2 most of these algorithms would
give practical results, it is highly questionable that these algorithms could be used in a
real time navigational system, when an average of 1% input noise is enough to create an
error of 100% in the output, 3 and especially when the problem of computing optic flow
or displacements (correspondence) is ill-posed and any algorithm for computing them
must rely on assumptions about the world that might not always be valid. There is no
doubt that research on the topic will continue and will shed more light on the difficulties
associated with the general problem of 3-D motion computation.

In the second category, direct methods attempt to recover 3-D motion using as
input the spatiotemporal derivatives of the image intensity function, thus getting rid
of the correspondence problem. These techniques, pioneered in (Aloimonos and Brown,
1984) and developed much further by Horn and his associates (Horn and Weldon, 1984;
Negahdaripour, 1986), can be considered closer to our work, since the spatiotemporal
derivatives at a point define the normal flow at that point. The difference is, of course,
that thinking in terms of the normal flow provides much more geometric intuition about
the problem. In addition, existing direct methods attempt to solve the general problem,
while we are only interested in the kinetic stabilization problem, and we can also treat
the problem of extracting 3-D translation without knowing the rotation (see Section 6).

3 Purposive Visual Motion Analysis

Most of the research on visual motion has been concentrated along the lines of general re-
covery, i.e. recovering from a sequence of images relative 3-D motion (passive navigation)
and structure. If this general recovery problem is solved, then a series of questions such
as: is there a moving object in the field of view of the observer?, is it getting closer to the
observer?, is it going to hit the observer?, how can the observer intercept it or avoid it?,
etc., become simple applications of the general recovery module. This point of view is
consistent with the ideas on the modular design of vision systems put forth by D. Mart
(1982). Although it is clear that a modular, general recovery approach to the analysis of
visual motion will uncover the general principles behind this perceptual ability (or mod-
ule), it is not at all clear that a good, working system capable of understanding visual
motion (i.e., capable of accomplishing visual tasks involving motion) must be designed
in a modular fashion. That is, it is not clear that any system handling visual motion
and capable of displaying intelligent behaviors is best designed by first going through
the stage of 3-D motion and structure computation. True, that would be convenient, as
vision would be studied in isolation and its results would be given to other modules, such
as motion planning or reasoning, to accomplish tasks such as target pursuit, obstacle
avoidance, and the like. Purposive vision, on the other hand, with its goal being to close
the gap between theory (general recovery) and application (actual visual systems), is
2 As in photogrammetry, for example, for solving the problem of relative orientation (Horn,
1988).
3 Since measurements are in focal length units, 1% error in displacements amounts to about
4-8 pixels for commercially available cameras.
500

a more general theory of vision involving multiple functions and multiple relationships
between subsystems of the final intelligent system. W i t h regard to visual motion, a pur-
posive design should allow any output that the visual system can be trained to associate
reliably with features derived from a changing image. In other words, we can attempt
different robust solutions to various motion related problems (such as passive navigation,
detecting obstacles, avoiding moving objects, etc.), even though all these problems are
simple applications of the general structure from motion module (Aloimonos, 1990a). 4
In this spirit, we study the perceptual task of kinetic stabilization.

4 Kinetic Stabilization

Consider a monocular observer as in Fig. 2. We assume that the observer moves only
forward (see Fig. 3). 5 It is assumed that the observer is equipped with inertial sensors
which provide the rotation(w,, w~, wz) of the observer at any time. As the observer moves
in its environment, normal flow fields are computed in real time. Since optic flow due
to rotation does not depend on depth but on image position (x, y), we know (and can
compute in real time) its value (u n, v n) at every image point along with the normal flow. 6
T h a t means that we know the optic flow due to translation (see Fig. 3a). In other words,
since we can derotate, we assume that the normal flow is due to translation only. In later
sections we analyze the case where rotation is present. W h e n the observer moves forward 7
in a static scene, it is approaching anything in the scene and the flow is expanding. From
Fig. 35, it is clear that the focus of expansion (FOE) = ( w
v~, v ) (when the gradient space
of directions is superimposed on the image space) lies in the half plane defined by line
(e); thus every point in that half space receives one vote for being the FOE. Clearly, at
every point we obtain a constraint-line which constrains the F O E to lie in a half plane.
If the FOE lies on the image plane (i.e. the direction of translation is anywhere in the
solid sector O A B C D (Fig. 4) then the FOE is constrained to lie in an area on the image
plane and thus it can be localized (see Fig. 5). When the FOE does not lie inside the
image, a closed area cannot be found, but the votes collected by the half planes indicate
its general direction. By making a "saccade", i.e. a rotation of the camera, the observer
can then bring the FOE inside the image and localize it (Fig. 6 explains the process).

5 The A l g o r i t h m

We assume that the computation of the normal flow, the voting and the localization
of the area containing the highest number of votes can be done in real time. In this
4 It is becoming quite clear that biological organisms are designed in a purposive manner, i.e.
their visual systems do not create a central database containing the recovery of the scene and
its properties which is then used as input to other cognitive processes. This doesn't mean that
a purposive design of a robot system (a design around behaviors) is better than a modular
one; biological systems might be suboptimal. Still, a purposive analysis of vision makes sense
because we have found that various behaviors can be robustly obtained (Aloimonos, 1990).
s In the case of backward movement the situation is symmetric (maximum - minimum) and
handled similarly.
s If computation of normal flow at some points is unreliable, we just don't compute normal flow
there.
7 In the sense of Fig. 2; we assume that the observer usually looks towards where it is moving.
501

==================================
::::::::::::::::::::::::: ::: :~:~:~!:!!ii :::~
motors
,~ ..... ~;iliiiiiiii!i!!iliiiiii~!iiiii!i!!iii;
.,~:~ ~i~i::i::ii~i::::i::::!::::iliiiiii:~: .....~::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
~i::::::::::::::
~ cable from brain to motors
.~:::::: ..: ::::::::: .~::.: ........ :: ............... >
, ::::: :::::: :::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::
: ::....:::::.....:.:.:,
::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::
' cable from cyc to brain

l:i:~:~:i:~:!:!::: :: :;:~:~:~:!:!:!: :~:!:!:!:~:~:i:~:~:~:~:!!:~:i:i:i:~:~:!!~ !! !:~:~:!~:~


possible direction of rotation
:::::::::::::::::::::::::::::::: ::~::!iiiii::i::! iiiiii:: ~iii
: : : : : : ::::::::::: :: ::~:::~::::::: ::::::::::::::::~ ~-..~...~....~..~
..........................................................................
::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::: :::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::::
................ ..... l i m i t of possible motion (robot
:::::::::::::::::::::::::::::::::::::::::::::::: ......~::::::::::::::::::::::::::::: . ~.~!~!~i~ii~i~ cannot travcl backwards)
iiiiiiiiiiiiiiiiiiiiii~
..... ::~" ~ ~.k',,.\~
i iiiiiiiiiiiiiiiiiii ~ possiblc range of motion
ii::i::ili::i::i::i::i::i::ggg::i::i::i:jiii~ \
\

F i g . 2.

Ij
,~ : normal flow

o (, OA-BC

Co)
\
u, : rotationalflow

(a)

(c)
F i g . 3. Given the normal flow u ~ and the rotational flow u at a point O(z, y), and given that
the projection of the sum u ~ + u t on (el) should equal u n, we conclude that the translational flow
is OD, where D is anywhere on (e2). Clearly, in such a case, the focus of expansion lies on the
half plane defined by (e) that does not contain ut. This statement is equivalent to the following
algebraic inequality (Horn and Weldon, 1987). If f(x, y, t) is the image intensity function, then
we have fxu + f~v + ft = 0, where u, v is the flow. If we only have translation (or we know the
rotation), then we get fx ( - v ~ x w ) -t-HY (-vz--~-~-) -t-f, ---- 0 or fx-~- (x - ~ ) -t- H, -~- (y - ~r) -I-
ft ~- 0 and if -~- > 0, (Hx (z - ~ ) -t- fy (y - ~ ) ) / H t < O. However, thinking in terms of normal
flow gives more geometric intuition. Indeed, if the normal flow due to translation is as in Fig. 3b,
the F O E must lie in the half plane (dotted line) of (e). But this assumes that the flow u can be
arbitrarily large, which is absurd. If there is a bound on the flow, then the F O E is constrained
further (Fig. 3c).
502

Y B

A (U, V, W)

0 Z

X
D

Fig. 4. Consider the camera coordinate system. If the translation vector (U, V, W) is anywhere
inside the solid O A B C D defined by the nodal point of the eye and the boundaries of the image,
then the FOE is somewhere on the image.

paper we don't get involved with real time implementation issues as we wish to analyze
the theoretical aspects of the technique. However it is quite clear that computation of
normal flow can happen in real time (there already exist chips performing edge detection).
According to the literature on connectionist networks (Ballard, 1984), voting can also be
done in real time. Let S denote the area with the highest number of votes. Let L(S) be a
Boolean function that is true when the intersection of S with the image boundary is the
null set, and false otherwise. Then, the following algorithm finds the area S, i.e., solves
the passive navigation problem. We assume that the inertial sensors provide the rotation
and thus we know the normal flow due to translation.
1. begin {
2. find area S
3. repeat until L(S)
4. { rotate camera around x, y axes so that the
optical axis passes through the center of S (saccade)
. find area S
}
6. output S
}

If the camera has a wide angle lens, then image points can represent many orientations,
and only one saccade (or none) may be necessary. But if we have a small angle lens, then
we may have to make more than one saccade.

6 Analysis of the Method

We have assumed that the inertial sensors will provide the observer with accurate infor-
mation about rotation. Although expensive accelerometers can achieve very high accu-
503

(a) (b)
F i g . 5. (a) From a measurement u of the normal flow due to translation at a point (x, y) of the
image, every point of the image belonging to the half plane defined by (e) that does not contain
u is a candidate for the position of the focus of expansion, and collects one vote. The voting is
done in parallel for every image measurement. (b) If the F O E lies within the image boundaries,
then the area containing the highest number of votes is the area containing the FOE. Using only
a few measurements can result in a large area. Using many measurements (all possible) results
in a small area (in our experiments a small area means a few pixels, usually at most three or
four).

(a)

x,g }
(b) (c)

F i g . 6. (a) If the area containing the highest number of votes has a piece of the image boundary
as part of its boundary, then the F O E is outside the image plane (see fib). ('b) The position
of the area containing the highest number of votes indicates the general direction in which the
translation vector lies. (c) The camera ("eye") rotates so that the area containing the highest
number of votes becomes centered. With a rotation around the x and y axes only, the optical axis
can be positioned anywhere in space. The process stops when the highest vote area is entirely
inside the image.
504

racy, the same is not true for inexpensive inertial sensors and so we are b o u n d to have
some error. Thus we must assume t h a t some unknown r o t a t i o n a l p a r t still exists and
contributes to the value of the n o r m a l flow. As a result, the m e t h o d for finding the F O E
(previous section) which is based on t r a n s l a t i o n a l n o r m a l flow information (since we have
" d e r o t a t e d " ) might be affected by the presence of some r o t a t i o n a l flow. Thus, we need to
s t u d y the effect of rotation (the error of the inertial sensor) on the technique for finding
the F O E . A n extensive analysis of the distortion of the solution area in the presence of
r o t a t i o n is given in (Duri~ and Aloimonos, 1991). For the purposes of this analysis it will
suffice to show t h a t if we consider for voting normal flows whose value is greater t h a n
some threshold, then voting will always be correct. T h e analysis is done for a spherical
eye. Indeed, voting will clearly be correct only if the direction of the t r a n s l a t i o n a l normal
flow is the same as the direction of the actual normal flow, t h a t is when

(n. ut)(n.u) > 0 (1)

In addition, since we consider only n o r m a l flows greater t h a n threshold, we need

In. u I > T, (2)

InequMity (1) becomes

(n. u , ) ( n , u) = (n. u,)(~, u, + ~. ~R) = (3)


= (n. u,) ~ + (n. u , ) ( n , uR) > o

So, if we set In. unl = Tt, then there are two possibilities: either In. ul is below the
threshold, in which case it is of no interest to voting, or the sign of n 9u is the same as
the sign of n . ut. In other words, if we can set the threshold equal to the m a x i m u m vMue
of the normal r o t a t i o n a l flow, then our voting will always be correct. But at point r of
the sphere the r o t a t i o n a l flow is

In' uRI ~ Ilnll' IluRII = IluRII = I1~ x rll =


= II~ll" Ilrll" I sin(/w,r)l < I1~11

Thus if we choose Tt = I[wI[, then the sign of n . u (actual n o r m a l flow) is equal to


the sign of ut 9 n (translational n o r m a l flow) for any normal flow of m a g n i t u d e greater
t h a n Tt.

7 Experimental Results

We have performed several experiments with b o t h synthetic and real image sequences in
order to d e m o n s t r a t e the stability of our method. F r o m experiments on real images it was
found t h a t in the case of pure t r a n s l a t i o n the m e t h o d computes the Focus of Expansion
very robustly. In the case of general m o t i o n it was found from experiments on synthetic
d a t a t h a t the behavior of the m e t h o d is as predicted by our theoretical analysis.
505

7.1 S y n t h e t i c D a t a

We considered a set of features at random depths (uniformly distributed in a range Rmin


to Rm~x). The scene was imaged using a spherical retina as in Fig. 7. Optic flow and
normal optic flow were computed on the sphere and then projected onto the tangent
plane (see Fig.7). Normal flow was computed by considering features whose orientations
were produced using a uniform distribution. Figs. 8 to 12 show one set of experiments.
Fig. 8 shows the optic flow field for 8w = 0 ~ (the angle between t and w, viewing
angles (~, r = (0 ~ 0~ Rmin -- 10 and Rmax -- 20 in units of focal length, [It[[ = 1,
k = ~-~. ~ 2 = 0.1 and F O V = 56 ~ Fig. 9 shows the corresponding normal flow.
Similarly Figs. 10 and 11 show optical and normal flow fields for the same conditions as
before with the exception that k = 0.75 which is obtained by growing [[w[[. Under the
above viewing conditions, the FOE is in the center of the image. Fig. 12 shows results
of voting for determining the FOE. In the first row thresholding precedes voting, with
T~ = [[w[[f, and in the second row there is no thresholding. In the first row, only the area
with the m a x i m u m number of votes is shown, while in the second row the whole voting
function is displayed (black is maximum). Clearly, the solution is a closed area (except
for the biggest k) whose size grows with k - - t h e ratio of rotation to translation.

z F
CO

X
Fig. 7. Sphere O X Y Z represents a spherical retina (frame O X Y Z is the frame of the observer).
The translation vector t is along the z axis and the rotation axis lies on the plane O Z Y . Although
a spherical retina is used here, information is used only from a patch of the sphere defined by the
solid angle FOV containing the viewing direction v d. The spherical image patch is projected
stereographically with center S' on the plane P tangent to the sphere at N', and having a
natural coordinate system (~, ~/). All results (solution areas, voting functions, actual and normal
flow fields) are projected and shown on the tangential plane.

7.2 R e a l D a t a

Fig. 13a shows one of the images from a dense sequence collected in our laboratory using
an Merlin American Robot arm that translated while acquiring images with the camera
it carried (a Sony miniature T V camera). Fig. 13b shows the last frame in the sequence
506

\\\\ , i, li~ ]
G '~, :~ i'I' s ,vl 11 I. ,.,.,, ,.,/I
./
\\ ,~"'\ \ \ \ \ \ \\ i 1~ l ,1i// I Y (,,i

".. x > \ ""',\" t,. i


r

== = _~_-
- -,. -:_:- -- .." "-- ".'7"

I \
I-~"'" ' l t
,,,~\\,\ ",.'%'1
, d"!
:2 / Ii ,,

Fig. 8.

\_.. i ~ I -." "t _ I i/ "]


9 91 l -t; 1 l L I / i~., I
|

i ,

\ \

- /
# : 9 " ; * x ~ -

9 " ~ t

r p 9 ! "'~ . o

9 +/ , . r - +i 9 \ "s 9 t

- T

9 , i . , ,q.id. ~
: i - i r \ //

9 ~ ~/ / ~ ~,: -1, !

Fig. 9.

and Fig. 13c shows the first frame with the solution area (where the FOE lies), which
agrees with the ground truth9 Figs. 14a, 14b and 14c show results similar to those above,
but for the image sequence provided by N A S A Ames Research Center and made public
for the 1991 IEEE Workshop on Motion 9 One can observe that the solution area is not
very small, primarily due to the absence of features around the FOE area.
507

/ / /
\1 l~ ill A ,, --
~1 t I il~ I . .. -"~ .I --

~---- .. z 1 I \
- :i : rtl ~,11\ \ ~

Fig. 10.

I Q I ~ . 1 ,

"1, .~ "" .-T. " ' /> , o


., ",(
" ~J)~ i -( .:, ,~ - ,

.% ~ 9 9 o ~ 9 9
\ \ 9

1 - t
_~. ,_ _, :.. ".; 9 .,,- , :.
x. 9 :,,o
m @

X'~,\ \ I

\* /. -. 9
--, -
9
2
''
pp
9
.~
T " ('
\
"IT"
"
]'I
9 Tp 9

: : v I

\
x ,.
~
'/
9_ _F ~ ,
/ /
/*
'
// I' /
-:
~ "
. ~
'i
/

Fig. 11.

8 Conclusions

We have presented a technique for computing the direction of motion of a moving observer
using as input the normal flow field9 In particular, for the actual computation only the
direction of the normal flow is used. We showed theoretically that the method works
very robustly even when some amount of rotation is present, and we quantified the
relationship between time-to-collision and magnitude of rotation that allows the method
508

Thresholding No Thresholding

Fig. 12.

to work correctly. It has been shown that the position of the estimated FOE is displaced
in the presence of rotation and this displacement is explained in (Duric and Aloimonos,
1991). The practical significance of this research is that if we have at our disposal an
inertial sensor whose error bounds are known, we can use the method described in this
paper to obtain a machine vision system that can robustly compute the heading direction.
However, if rotation is not large, then the method can still reliably compute the direction
of motion, without using inertial sensor information 9

Note: The theoretical analysis in this paper was done by Y. Aloimonos. All experiments
reported here have been carried out by Zoran Duri~. The support of DARPA (ARPA
Order No. 6989, through Contract DACA 76-89-C-0019 with the U.S. Army Engineer
Topographic Laboratories), NSF (under a Presidential Young Investigator Award, Grant
IRI-90-57934), Alliant Techsystems, Inc. and Texas Instruments, Inc. is gratefully ac-
knowledged, as is the help of Barbara Burnett in preparing this paper.
509

Fig. 13.

Fig. 14.
510

References

Adiv, G.: Determining three-dimensional motion and structure from optical flow generated by
several moving objects. IEEE Trans. PAMI 7 (1985a) 384-401.
Adiv, G.: Inherent ambiguities in recovering 3D motion and structure from a noisy flow field.
Proc. IEEE Conference on Computer Vision and Pattern Recognition (1985b) 70-77.
Aloimonos, J.: Purposive and qualitative active vision. Proc. DARPA Image Understanding
Workshop (1990a) 816-828.
Aloimonos, J.: Perspective approximations. Image and Vision Computing 8 (1990b) 177-192.
Aloimonos, J., Brown, C.M.: The relationship between optical flow and surface orientation. Proc.
International Conference on Pattern Recognition, Montreal, Canada (1984).
Aloimonos, J., Weiss, I., Bandopadhay, A.: Active vision. Int'l. J. Comp. Vision 2 (1988) 333-
356.
Ballaxd, D.H.: Parameter networks. Artificial Intelligence 22 (1984) 235-267.
Bruss, A., Horn, B.K.P.: Passive navigation. Computer Vision, Graphics Image Processing 21
(1983) 3-20.
Duri~, Z., Aloimonos, 3.: Passive navigation: An active and purposive solution. Technical Report
CAR-TR-560, Computer Vision Laboratory, Center for Automation Research, University
of Maryland, College Park (1991).
Horn, B.K.P.: Relative Orientation. MIT AI Memo 994 (1988).
Horn, B.K.P., Weldon, E.J.: Computationally efficient methods of recovering translational mo-
tion. Proc. International Conference on Computer Vision (1987) 2-11.
Longuet-Higgins, H.C.: A computer algorithm for reconstructing a scene from two projections.
Nature 293 (1981) 133-135.
Longuet-Higgins, H.C., Prazdny, K.: The interpretation of a moving retinal image. Proc. Royal
Soc. London B 2 0 8 (1980) 385-397.
Mart, D.: Vision (W.H. Freeman, San Francisco (1982).
Negahdaripour, S.: Ph.D. Thesis, MIT Artificial Intelligence Laboratory (1986).
Nelson, R.C., Aloimonos, J.: Finding motion parameters from spherical flow fields (Or the ad-
vantages of having eyes in the back of your head). Biological Cybernetics 58 (1988) 261-273.
Nelson, R., Aloimonos, J.: Using flow field divergence for obstacle avoidance in visual navigation.
IEEE Trans. PAMI 11 (1989) 1102-~1106.
Spetsakis, M.E., Aloimonos, J.: Optimal computing of structure from motion using point corre-
spondences in two frames. Proc. International Conference on Computer Vision (1988).
Spetsakis, M.E., Aloimonos, J.: Unification theory of structure from motion. Technical Report
CAR-TR-482, Computer Vision Laboratory, Center for Automation Research, University
of Maryland, College Park (1989).
Spetsakis, M.E., Aloimonos, J.: Structure from motion using line correspondences. InCl. J. Com-
puter Vision 4 (1990) 171-183.
Tsal, R.Y., Huang, T.S.: Uniqueness and estimation of three dimensional motion parameters of
rigid objects with curved surfaces. IEEE Trans. PAMI 6 (1984) 13-27.
Ullman, S.: The Interpretation of Visual Motion (MIT Press, Cambridge, MA (1979).
Weng, 3., Huang, T.S., Ahuja, N.: A two step approach to optimal motion and structure esti-
mation. Proc. IEEE Computer Society Workshop on Computer Vision (1987).
Young, G.S., Chellappa, R.: 3-D motion estimation using a sequence of noisy stereo images.
Proc. IEEE Conference on Computer Vision and Pattern Recognition (1988).

This article was processed using the IbTEX macro package with ECCV92 style
Active Perception Using DAM and Estimation Techniques
Wolfgang Polzleitner i and Harry Wechsler2
Joanneum Research, Wastiangasse 6, A-8010 Graz, Austria,
z Dept. of Computer Science, George Mason University, Fairfax, VA 22030, USA

Abstract. The Distributed Associative Memory (DAM) has been described


previously as a powerful method for pattern recognition. We show that it also
can be used for preattentive and attentive vision. The basis for the preatten-
five system is that both the visual input features as well as the memory are
arranged in a pyramid. This enables the system to provide fast preselection of
regions of visual interest. The selected areas of interest are used in an attentive
recognition stage, where the memory and the features v~rk at full resolution.
The reason for application of DAM is based on a statistical theory of rejection.
The availability of a reject option in the DAM is the prerequisite for novelty
detection and preattentive selection. We demonstrate the performance of the
method on two diverse applications.

1 Parallel and Hierarchical Recognition

Machine vision research is currently recognizing the impact of connectionist and parallel
models of computation to approach the problem of huge amounts of data to be processed.
Complete parallelism, however, is not possible because it requires too large a number of
processors to be feasible (Sandon [12], %otsos [14]). A balance has to be made between
processor-intensive parallel implementation and time-intensive sequential implementation.
This motivates the need for multiresolution image representations, which is also justified by
the organization of the human retina, the availability of retina-like sensors [11], multichan-
nel pyramidal structures [15], and last but not least by the fact that sequential and hierar-
chical decision trees are basic tools employed in statistical pattern recognition to decrease
the computational load.
The approach we suggest in this paper is to distribute the stimulus vectors (i.e., the fea-
tural representation) to parallel processors. Preattentive selection is not in terms of special
features triggering selection, but in terms of the full (learned) knowledge the system has
obtained during training, and which is present in each processor. We use the Distributed
Associative Memory (DAM) as a generic recognition tool. It has several useful properties
to cope with incomplete input (e.g., in the presence of occlusion) and noisy patterns [7].
During recall weights flk are computed to indicate how well the unknown stimulus vector
matches with k-th memorized stimulus vector. We have enhanced the DAM scheme using
results from statistical regression theory, and have replaced the conventionally used weights
/~k by t-StatiStiCS, where the basic equations defining the DAM are

t,, = o-)

V~blfgang Pdlzleitner has been supported by Joanneum Research. Harry Weehsler has been partly
supported by DARPA under Contract #MDA972-91-C-004 and the C3I Center at George Mason
University.
A long version of this paper is available from the first author as Joanneum Research technical report
DIB-56.
Lecture Notes in Computer Science, Vol. 588
G. Sandini (Ed.)
Computer Vision - ECCV '92
9 Springer-Verlag Berlin Heidelberg 1992
512

R2 = var(x) - RSS
var(x) (2)
Here tk is the t-statistic indicating how well a specific (new) input is associated with the k-th
stimulus vector stored in the memory, and R 2 is the coefficient of determination, a number
between 0 and 1, measuring the total goodness of the association. RSS = IIx- ~ll' and the
variance var(x) = IIx - ~1t 2, where ~ is the mean over all n elements of x.
To use the DAM in a preattentive manner requires the following properties [9]. First, a
reject function based on R ~-is incorporated that enables the memory to decide whether an
input stimulus is known or is a novelty. Second, the memory is made selective by allowing
it to iterativety discard insignificant stimulus vectors from memory. This is called attentive
mode of operation (or coefficient focusing [9]).
F~xation Points, Saccade Generation, and Focused Recognition

Our DAM-based system achieves the balance between parallel and sequential processing
by segmenting the receptive field in preattentive and attentive processes: From the high
resolution visual input a pyramid is computed using Gaussian convolution, and one of its
low resolution layers is input to an array of preattentive DAMs working in parallel. Each
DAM has stored a low-resolution representation of the known stimuli and can output the
coefficient of determination R 2. Thus a two-dimensional array of R2-values is available at
the chosen pyramid level. These coefficients indicate to what extent the fraction of the input
covered by the particular DAM contains useful information.
Now a maximum selection mechanism takes this array of R2-values as input and selects
the top-ranked FOV. The top-ranked FOVs constitute possible fixation points and a saccade
sequence is generated to move among them starting at that FOV for that the Maximum R ~-
was obtained. Full resolution receptive fields are then centered at these fixation points and
attentive recognition follows.

2 Experiments

We describe next two experiments where active vision implemented as saccades between
fixation points is essential for recognition and safe navigation purposes. Specifically, one
has to Irade the quality of recognition and safety considerations for almost real-time per-
formance, and as we show below parallel hierarchical processing are the means to achieve
such goals.
2.1 Experiment 1: Quality Control

The methods described in the previous sections were tested on textured images of wooden
boards. The goal was to recognize defects in the form of resin galls, knots and holes, where
the knots appear in three different subclasses: normal (i.e., bright knots), knots with partially
dark surrounding, and dark knots so that the resulting six classes can be discriminated.
The preattentive selection methods and the related generation of visual saccades is il-
lustrated in Fig. 1. Here 2~a level (P = 2) of the gray-level pyramid is input to the array of
low-resolution DAMs. The output of each DAM is shown in Fig. 1 b), where the diameter
of the circles are proportional to the value R 2 computed in each DAM. The local maxima
of the R 2 numbers are shown in Fig. 1 c), where only circles that are local maxima are kept.
The maximum selection can be done efficiently by lateral inhibition. The sequence of fix-
ation points is also indicated in Fig. 1 d). At each such position the attentive mode was
initiated, but only on positions marked by a full circle was the value ofR 2 large enough after
coefficient focusing [9] to indicate recognition. These positions indeed correctly indicated a
wooden board defect, which was classified as such.
513

2.2 Experiment 2: Spacecraft Landing


A second experiment was performed using remote sensing data for spacecraft navigation.
Here the task of the preattentive system was to locate potential hazards for the landing
maneuver of a spacecraft. A subproblem in autonomous navigation of a spacecraft is to
identify natural objects like craters, rocky peaks, ridges and steep slopes. Fig. 2 a) shows a
test image taken from a mountainous scene in Austria. It is the left image of a stereo pair
that was used to compute a disparity map of the scene [10] to describe a digital elevation
model (DEM). This disparity map was coded as a gray-level image as shown in Fig. 2 b). In
our experiment the task was to detect landmarks that would later on be used for tracking or
hazard avoidance. Prototype landmarks were stored as stimulus vectors in the DAM.
The image shown in Fig. 2 a) is a 736 x 736 image and the elevation model in Pig. 2 b)
has the same resolution. A pyramid was built on the D E M and its 4th level was used to train
and test the memory (i.e., a resolution reduction by 16 x 16), On this low resolution D E M
the grid of preattentive memories was placed as shown in Fig. 2 c). Each memory had a
receptive field of 14 pixels radius. The output R 2 of these memories (computed in parallel)
is shown in Fig. 2 c). After selection of the local maxima of these R 2 values, the saccades in
Fig. 2 d) are generated. Their positions in the gray-level D E M are shown overlaid in Fig. 2
c), which represent areas of potential steep slopes. The coefficient focusing procedure was
initiated at these locations resulting in rejection of the fixation points 5, and 6, based on a
threshold of R 2 < 0.5 defining rejection.

References
1. R. Bajcsy. Active Perception. IEEEProceedings, 76(8):996--1005, August 1988.
2. E J. Burt. Smart Sensing within a Pyramid Vision Machine. IEl~.l~.Proceedings,76(8):1006-1015,
August 1988.
3. V Cherkassk'y. Linear Algebra Approach to Neural Associative Memories. To appear.
4. R. W. Conners and C. T Ng. Developing a Quantitative Model of Human Preattentive Vision.
IEEE Trans. SysL Man C'~bem, 19(6): 1384-1407, November/December 1989.
5. 1: Kohonen. S e ( f - ~ a t i o n and Associative Memory. Springer-Verlag, 2rid edition, 1988.
6. U. Neisser. Direct Perception and Recognition as Distinct Perceptual Systems. ~ ' v e S c / e n c e
Soc/ety Address, 1989.
7. P. OUvier. Optimal Noise Rejection in Linear Associative Memories. IEEE Trans. SysL Man
@bern., 18(5):814-815, 1988.
g W. P01zleitner and H. Wechsler. lnvatiantPatternReco~itionUsingAssociativeMemoty. Techni-
cal Report DIB-48, Joanneum Research, Institute for Image Processing and Computer Graphics,
March 1990.
9. W. POlzleitner and H. "~chsler. Selective and Focused Invariant Recognition Using Distributed
Associative Memories (DAM). JEEE Trans. Pattem AnaZ Machine IntelL, 11(8):809--814, August
1990.
10. W. POlzleitner, G. Paar, and G. Schwingshakl. Natural Feature ql'acking for Space Craft Guid-
ance by Vision in the Descent and Landing Phases. In ESA Workshop on Computer Vision and
Images Processingfor Spaceborn Applications, Noordwijk, June 10-12, 1991.
11. M. Sandini and M. Tistarelli. Fts/on and Space-VariantSat~ing. Volume 1, Academic Press, 1991.
12. E A. Sanclon. Simulating Visual Attention. Journal o f ~ v e N e u r o s c i e n c e , 2(3):213-231,
1990.
13. B.TelferandD. Casasent. Ho-Kashyap OptiealAssociativeProeessors. Appl/edOpt/cs,29:l191-
1202, March 10, 1990.
14. J. K. Tsotsos. Ana/ys/ng Vis/on at the Complex@LeveL Technical Report RCBV-TR-78-20, Univ.
Toronto, March 10, 1987.
15. L Uhr. Parallel Computer l/'tsion. Academic Press, 1987.
514

(5 d9
(~ (36
~8

05 @1_

6o

(3o

(~7 (~3
(34

e)
o0( [ ~ 0 0 o o o o o 0 o o 0
o08 ~ ' 0o~Q1o 7o 6 o oo oo ~o o1 7~6 1 7 6
ouo: oooo .o
o0{ (3o0oo 0o,, o o o
0 o o *oo00ooooo0
oo oO~oC)o o o o O size for R 2 = 1.0
OOo o~:c~,~p.@oo, o o o
o o o U._9.)JgS~ L.L)o * o
o o 9 O(-'X-~2L~Z~_)Oo o O
o o. o 6C~-X}C~OO_ o0o
0o0o o ODO0000o Oo
00o0oooo00ooo0o
00o000ooo000ooo
00o0000oo0.oo00
d)

Flg. L a) Level P = 0 of the gray-level image. The positions of the various DAMs marked on the input
image, b) A prototype 'active' receptive field consisting of 15 15 preattentive DAMs working in par-
allel. The array is first centered at the location of maximum R z. Coefficient focusing [9] is performed
only for this foveal DAM and the other top-ranked FOVs. Each circle is a DAM and the diameters of
the circles are scaled to the respective values o f R z. The stimuli stored in the DAMs are low-resolution
versions of those stored in the attentive DAM. They operate on the Z 'a level of a gray-level pyramid
that has a 4 4 reduced resolution with respect to the input image. The DAMs are spaced 4 pixels
apart, and their receptive fields l~ve a radius of 5 pixels on this level (which correspond to a radius of
5 x 4 = 20 pixels of the high-resolution level), c) Tlae saccades (changes of fixation points) generated
by sorting the local maxima of the preattentive R z values from b). The full circles represent locations
where the attentive recognition system has recognized an object. The numbers indicate the sequence
in which the fixation points were changed, d) Illustrates the scaling used in parts b) and c) and shows
the resulting diameter of the circle for R z = 1.
515

a)

o9 oC3O

c) d)

Fig. 2. a) The left image of an image pair used as a test image. Its size is 736 x 736 pixels, b) 4-th
pyramid level of the elevation model derived from this image pair is shown. Bright areas correspond
to high locations (mountain peaks, alpine glaciers), whereas dark areas are low altitude locations
(e.g., river valleys), c) The output of each preattentive memory is coded by drcles with radii scaled to
the respective values of R 2. The circular receptive fields of these memories are placed 4 pixels apart,
with radius of 14 pixels, d) The double circles show the values of R 2 in the preattentive (parallel)
mode (outer circles) and the values o f / ~ after the last step of coefficient focusing (inner circles). The
fixation points selected try the preattentive system are shown in b).
Active/Dynamic Stereo for Navigation
Enrico Grosso, Massimo Tistarelli and Giulio Sandini
University of Genoa
Department of Communication, Computer and Systems Science
Integrated Laboratory for Advanced Robotics (LIRA- Lab)
Via Opera Pia llA - 16145 Genoa, Italy
Abstract. Stereo vision and motion analysis have been frequently used
to infer scene structure and to control the movement of a mobile vehicle
or a robot arm. Unfortunately, when considered separately, these methods
present intrinsic difficulties and a simple fusion of the respective results has
been proved to be insufficient in practice.
The paper presents a cooperative schema in which the binocular dis-
parity is computed for corresponding points in several stereo frames and
it is used, together with optical flow, to compute the time-to-impact. The
formulation of the problem takes into account translation of the stereo set-
up and rotation of the cameras while tracking an environmental point and
performing one degree of freedom active vergence control. Experiments on
a stereo sequence from a real scene are presented and discussed.

1 Introduction

Visual coordination of actions is essentially a real-time problem. It is more and more


clear that a lot of complex operations can rely on reflexes to visual stimuli [Bro86].
For example closed loop visual control has been implemented at about video rate for
obstacle detection and avoidance [FGMS90], target tracking [CGS91] and gross shape
understanding [TK91].
In this paper we face the problem of "visual navigation". The main goal is to perform
task-driven measurements of the scene, detecting corridors of free space along which the
robot can safely navigate.
The proposed cooperative schema uses binocular disparity, computed on several image
pairs and over time. In the past the problem of fusing motion and stereo in mutually
useful way has been faced by different researchers. Nevertheless, there is a great difference
between the approaches where the results of the two modalities are considered separately
(for instance using depth from stereo to compute motion parameters [Mut86]) and the
rather different approach based upon more integrated relations (for instance the temporal
derivative of disparity [WD86, LD88]).
In the following we will explain how stereo disparity and image velocity are combined
to obtain a 2 89 representation of the scene, suitable for visual navigation, which is either
in terms of time-to-impact or relative-depth referred to the distance of the cameras from
the fixation point. Only image-derived quantities are used except for the vergence angles
of the cameras which could be actively controlled during the robot motion [OC90], and
can be measured directly on the motors (with optical encoders).
As a generalization of a previous work [TGS91] we consider also a rotational motion
of the cameras around the vertical axes and we derive, from temporal correspondence of
image points, the relative rotation of the stereo base-line. This rotation is then used to
correct optical flow or relative depth.
* This work has been partially funded by the Esprit projects P2502 VOILA and P3274 FIRST.
517

2 The experimental set-up

The experimental set-up is based on a computer-controlled mobile platform T R C Labmate


with two cameras connected to a VDS 7001 Eidobraia image processing workstation. The
cameras are arranged as to verge toward a point in space. Sequences of stereo images are
captured at a variable frequency, up to video rate, during the motion of the vehicle (left
and right images are captured simultaneously, thanks to the Eidobrain image processor).
At present the cameras are not motorized, therefore we moved the vehicle step by step,
adjusting manually the orientation of the two cameras as to always verge on the same
point in the scene. In this way we simulated a tracking motion of both cameras on a
moving vehicle.

Fig. 1. First stereo pair of tile acquired sequence.

In figure 1 the first stereo pair, from a sequence of 15, is shown. The vehicle was moving
forward at about 100 m m per frame. The sequence has been taken inside the LIRA lab.
Many objects were in the scene at different depths. The vehicle was undergoing an almost
straight trajectory with a very small steering toward left, while the cameras were fixating
a stick on the desk in the foreground.

3 Stereo and motion analysis

3.1 Stereo analysis


The stereo vision algorithm is based oll a regional multiresolution approach [GST89] and
produces, at each instant of time, a disparity map between the points in the left and in
the right image (see figure 3).
With reference to figure 2 we define the K function [TGS91] as:

t a n ( a - 7 ) ' tan(fl + 6) (1)


K(a,~,7,5) = tan(a-7)+tan(B+5)
518

P
/

/ "Y A ) B

/ L K
H 5

rt

I ~ I

Fig. 2. Schematic representation of the stereo coordinate system.

where a and 3 are the vergence angles, 7 = arctan ( ~ t ) and ~ = arctan (~[~t) define
the position of two corresponding points on the image planes and zrt = zu + D where
D is the known disparity. The depth is computed as:

Z, = d . K(a,fl,7,~ ) (2)
The knowledge of the focal length is required to compute the angular quantities.

3.2 M o t i o n analysis

The temporal evolution of image fetures (corresponding to objects in the scene) is de-
scribed as the instantaneous image velocity (optical flow). The optical flow V = (u, v)
is computed from a monocular image sequence by solving an over-determined system of
linear equations in the unknown terms (u, v) [HS81, UGVT88, TS90]:

d
dI = 0 --VI = 0
dt dt
where ! represents the image intensity of the point (x, y) at time t. The least squares
solution of these equations can be computed for each point on the image plane ITS90].
In figure 4 the optical flow of the sixth image of the sequence is shown. The image
velocity can be described as a function of the camera parameters and split into two terms
depending on the rotational and translational components of camera velocity respectively.
If the rotational part of the flow field Vr can be computed (for instance from pro-
prioceptive data), Vt is determined by subtracting Vr from V. From the translational
optical flow, the time-to-impact can be computed as:
A
T = (3)
Iv, i
where A is the distance of the considered point, on the image plane, from the FOE.
The estimation of the FOE position is still a critical step; we will show how it can be
avoided by using stereo disparity.
519

i ~ : ~ ' : : ~ ; : : : ; ~ ; . ~. ;;~:~z~;~::;'~.,

.-~.j~.~.~'..~i~'#]
~ . ~ ;~;m,-, ~~ . ~
"~"ww92w.*~kW,~?,i,~ i i / i | l %'mlltl$~;' t l ,.t~"~

Fig. 3. Disparity computed for the 6th Fig. 4. Optical flow relative to the 6th left
stereo pair of the sequence; negative values image of the sequence.
are depicted using darker gray levels.

4 Stereo - motion geometry

Even though the depth estimates from stereo and motion are expressed using the same
metric, they are not homogeneous because they are related to different reference frames.
In the case of stereo, depth is referred to an axis orthogonal to the baseline (it defines
the stereo camera geometry) while for motion it is measured along a direction parallel
to the optical axis of the (left or right) camera. We have to consider the two reference
frames and a relation between them:

X
Z,(x,y) = Zm(x,y) h(x) h(x) = s i n a + ~ c o s a (4)

where (~ is the vergence angle of the left camera, F is the focal length of the camera in
pixels and x is the horizontal coordinate of the considered point on the image plane (see
figure 2). We choose to adopt the stereo reference frame, because it is symmetric with
respect to the cameras, therefore all the measurements derived from motion are corrected
accordingly to the factor h(x).

5 Rotational motion and vergence control

The translational case analyzed in [TGS91] can be generalized by considering a planar


motion of the vehicle with a rotational degree of freedom and with the two cameras
tracking a point in space.
As the cameras and the vehicle are moving independently, we are interested in com-
puting the global camera rotation resulting from both vehicle global motion and ver-
gence/tracking motion of the cameras.
Figure 5 helps to clarify the problem. Previous work [KP86] shows that, from a the-
oretical point of view, it is possible to compute the vergence of the two cameras from
520

F Fixation / ~.l'~"" Generic


point
~ n t Fixation F ! t - ".,.
,,rr~,. point /~"'- / / '", " , .

/i\
/i" ....
~ ......
i ,~ / "x"...l.....v.
/ i" "\
,.................. / ~v~"/. . . / =.o .......
/ ;-7_"_._ ~'. '*'~', .-~""

i /
/
i .:,r- -
. , .......... ..~vL..
',..",,,..
.R.........

Fig. 5. Rotation of the stereo system dur- Fig. 6. Correction of the relative depth us-
ing the motion, ing rotation.

corresponding left and right image points. However, camera resolution and numerical
instability problems make difficult a practical application of the theory. Moreover, as it
appears in figure 5 the computation of the vergence angle is insufficient to completely
determine the position of the stereo pair in space. For this reason the angle ~1 or, alter-
natively, ~2 must be computed. We first assume the vergence angles of the cameras to
be known at a given time instant; for example they can be measured by optical encoders
mounted directly on the motors.
The basic idea for computing the rotational angle is to locate two invariant points
on the scene space and use their projection, along with the disparity and optical flow
measurements, to describe the temporal evolution of the stereo pair. The first point
considered is the fixation point, as it is "physically" tracked over time and kept in the
image center. Other points are obtained by computing the image velocity and tracking
them over successive frames.
In figure 7 the position of the two invariant points ( F and P), projected on the Z X
plane, with the projection rays is shown.
Considering now the stereo system at time tl we can compute, by applying basic
trigonometric relations, the oriented angle between the 2D vectors F P and LR.:

tan(01 ) = tan(a1-3'1 ).tan(E,-k61)-[tan(a, )-l-tan(E1 )] _


tan(E1 +61 ).[tan(al)+tan(E1 )]-tan(El ).[tan(hi -7'1 )+tan(~l +$1 )l

tan(a, ).tan(ill )-[tan(hi -7'1 )+tan(E1 q161)]


tan(El +~fl).ltan(a 1)q-tan(El)l- tan(El).[tan(a i -7"l)'Ftan(~l +61 )] (5)
It is worth noting that the angle 01 must be bounded within the range [0 - 2~r). In a
similar way the angle 02 at time $2 can be computed. ~1 is derived as:

~1 = el - O2 (6)
In this formulation ~1 represents the rotation of the base-line at time t2 with respect
to the position at time tl; the measurements of ~1 can be performed using a subset or
521

.. ..:::::.
..~,..."" ~L:.-/'.';i'~

......:>:~ / , f I\/ i
i .....

~ ~176176176
o~176176176176176176

Fig. 7. Geometry of the rotation (projection on the stereo plane).

Fig. 8. Rough and smoothed histograms of the angles computed from frames 5-6 and frames 7-8,
respectively. The abscissa scale goes from -0.16 radians to 0.16 radians. The maxima computed
in the smoothed histograms correspond to 0.00625 and 0.0075 radians respectively.

also all the image points: In the noiseless case, all the image points will produce identical
estimates. In the case of objects moving within the field of view it should be easy to
separate the different peaks in the histogram of the computed values corresponding to
moving and still objects. Figure 8 shows two histograms related, to frames 5-6 and 7-8
respectively.
In order to compute the angle ~1 the following parameters must be known or measured:
- c~ and ~, the vergence angles of the left and right camera respectively, referred to the
stereo baseline.
- 7 and 6, the angular coordinates of the considered environmental point computed
on the left and right camera respectively. The computed image disparity is used to
522

establish the correspondence between image points on the two image planes.
- The optical flow computed at time tl.

5.1 Using r o t a t i o n to correct image-derlved d a t a


The rotation of the stereo system affects the optical flow adding a rotational component to
the translational one. As a consequence, the computed rotation from image-data allows,
by difference, the computation of the component lit and, finally, the time-to-impact.
An alternative way is to correct the relative depth coming from stereo analysis taking
into account the rotation of the stereo pair. In this case, with reference to figure 6,
Zs(tl) = ~ is the depth at time tl and Zs(t~) = ~ is the depth of the same point at
time t2. The correction must eliminate only the rotational component, therefore we can
write:

[ sin~l ]
P---"O = Z,,r(t2) = Z,(t2). cos~l + tan(-~2: 72) (7)
In the remainder of the paper we denote with Z the translational component Z , tr.

6 Using neighborhoods to compute time-to-impact a n d r e l a t i v e


depth

From the results presented in the previous sections we can observe that both stereo-
and motion-derived relative-depth depend on some external parameters. More explicitly,
writing the equations related to a common reference frame (the one adopted by the stereo
algorithm):

Ki Zi ~s h(xi)T. Zi (8)
= T = h(O) ' = W~
where d is the interocular baseline, Wz is the velocity of the camera along the stereo ref-
erence frame, T/ represents the time-to-impact measured in the reference frame adopted
by the motion algorithm and Ti' is the time-to-impact referred to the symmetric, stereo
reference system.
We consider now two different expressions derived from equations (8).

d h(0) Zi T~ h(x~) W~ (9)


= ~. Ir h(=,) Z-S = K--S" h(0--'S "'~-
First equation represents the time-to-impact with respect to the motion system while
the second equation represents a generic relative measure of the point (zi, Yi) with respect
to the point (xt, yz). Our goal is now to eliminate the ratio - ~ . A first way to proceed is
to rewrite the first equation of (9) for a generic point (zj, yj) whose velocity with respect
to (zl, Yi) is zero.

d
7~ =W-:.K~. h(0)
h(=r (10)

Using the first equation of (9) and equation (10) we can compute a new expression
for - ~ :

Wz K, - K s h(0) (11)
d = Tih(xi) 7~h(zj)
523

where (zl, yi) and (xj, yj) are two points on the image plane of the left (or right) camera.
This formulation is possible if we are not measuring the distance of a flat surface, because
of the difference of K at the numerator and the difference of T at the denominator.
Substituting now in equations (9) we obtain:

Ki h(xj) Z..~ = (Ki - Kj) Ti (12)

The two equations are the first important result. In particular the second equation
directly relates the relative-depth to the time-to-impact and stereo disparity (i.e. the
K function). The relative-depth or time-to-impact can be computed more robustly by
integrating several measurements over a small neighborhood of the considered point
(xi, Yi), for example with a simple average. The only critical factor in the second equation
of (12) is the time-to-impact which usually requires the estimation of the FOE position.
However, with a minimum effort it is possible to exploit further the motion equations to
directly relate time-to-impact and also relative-depth to stereo disparity (the K function)
and optical flow only.

7 Using optical flow to compute time-to-impact

We will exploit now the temporal evolution of disparity. If the optical flow and the
disparity map are computed at time t, the disparity relative to the same point in space
at the successive time instant, can be obtained by searching for a matching around the
predicted disparity, which must be shifted by the velocity vector to take into account the
motion.
As - ~ is a constant factor for a given stereo image, it is possible to compute a robust
estimate by taking the average over a neighborhood [TGS91]:

W= _ AK _ 1
"~ - At N 2 E [K/(t) - gi(t + At)] (13)
i

Given the optical flow V = (u, v) and the map of the values of the K function at time
t, the value of K~(t -t- At) is obtained by considering the image point (xl + u~, yl q- vl)
on the map at time t + At.
The value of AK for the 6th stereo has been computed by applying equation (13) at
each image point. By taking the average of the values of AK over the all image, a value of
--~ equal to 0.23 has been obtained. This value must be compared to the velocity of the
vehicle, which was about 100 millimeters per frame along the Z axis and the interocular
baseline which was about 335 millimeters. Due to the motion drift of the vehicle and the
fact that the baseline has been measured by hand, it is most likely that the given values
of the velocity and baseline are slightly wrong.
By using equation (13) to substitute --~ in the first equation of (9), it is possible
to obtain a simple relation for the time-to-impact, which involves the optical flow to
estimate AK:

= h(x,) = K,
AK (14)
524

Fig. 9. Time-to-impact computed using eq. (14) for the 6th pair of the sequence; darker regions
correspond to closer objects.

This estimate of the time-to-impact is very robust and does not require the compu-
tation of the FOE (see figure 9). From equation (13) and the second equation of (9), it
is possible to obtain a new expression for the relative-depth:

Z.LI = AK h(xi)
Z~ Kt " h(O) Ti (15)

There is, in principle, a different way to exploit the optical flow. For completeness we
will briefly outline this aspect.
From the knowledge of the translational component Vt of V it is possible to write
for a generic point (xh, Yh):

Considering two points (xi, Yi) and (xj, yj) and eliminating F / Z and W~/W u we express
a relation between Ti and ~ :

xi - T~Vt~(xi, Yi) = xj - Tj Vt~(xj, yj) (16)

Now, combining equation (16) with the first of (12) we can obtain a new equation
of order two for ~ . ~ is expressed in this case as a function of the coordinates and
the translational flow of the two considered points. As in the case of equations (12) the
lime-to-impact can be computed more robustly by averaging the measurements over a
small neighborhood of the considered point (xi, Yi).
525

8 Conclusions

R o b o t navigation requires simple c o m p u t a t i o n a l schemes able to select and to exploit


relevant visual information. The paper addressed the problem of stereo m o t i o n coop-
eration proposing a new approach to merge binocular disparity and optical flow. The
formulation of the problem takes into account active vergence control b u t limits the ro-
t a t i o n of the cameras around the vertical axes. T h e rotation of the stereo base-line is first
extracted using stereo disparity and optical flow; after t h a t it is used to correct stereo
relative depth and to compute dynamic quantities like time-to-impact directly from stereo
information, using optical flow to determine the t e m p o r a l evolution of d i s p a r i t y in the
sequence. Future work will be addressed to include in the formulation the tilt angles of
the two cameras.

References
[Bro86] R.A. Brooks. A robust layered control system for a mobile robot. IEEE Trans. on
Robotics and Automat., RA-2:14-23, April 1986.
[CGS91] G. Casalino, G. Germzmo, and G. Sandini. Tracking with a robot head. In Proc. o]
ESA Workshop on Computer Vision and linage Proc essing for Spaceborn Applica-
tions, Noordwijk, June 10-12, 1991.
[FGMS90] F. Ferrari, E. Grosso, M. Magrassi, and G. Sandini. A stereo vision system for real
time obstacle avoidance in unknown environment. In Proc. of Intl. Workshop on
Intelligent Robots and Systems, Tokyo, Japan, July 1990. IEEE Computer Society.
[GST89] E. Grosso, G. Sandini, and M. Tistarelli. 3d object reconstruction using stereo
and motion. 1EEE Trans. on Syst. Man and Cybern., SMC-19, No. 6, Novem-
ber/December 1989.
[HS81] B. K. P. Horn and B. G. Schunck. Determining optical flow. Artificial Intelligence,
17 No.1-3:185-204, 1981.
[KP86] B. Kamgar-Parsi. Practical computation of pan and tilt angles in stereo. Technical
Report CS-TR-1640, University of Maryland, College Park, MD, March 1986.
[LD88] L. Li and J.H. Duncan. Recovering three-dimensional translational velocity and
establishing stereo correspondence from binocular image flows. Technical Report
CS-TR-2041, University of Maryland, College Park, MD, May 1988.
[MutS0] K.M. Mutch. Determining object translation information using stereoscopic motion.
IEEE Trans. on P A M I . 8, No. 6, 1986.
[oc90] T.J. Olson and D.J. Coombs. Real-time vergence control for binocular robots. Tech-
nical Report 348, University of Rochester - Dept. of Computer Science, 1990.
[TGS91] M. Tistarelli, E. Grosso, and G. Sandini. Dynamic stereo in visual navigation. In
Proc. of lnt. Conf. on Computer Vision and Pattern Recognition, Lahaina, Maui,
Hawaii, June 1991.
[TK91] C. Tomasi and T. Kanade. Shape and motion from image streams: a factorization
method. Technical Report CS-91-105, Carnegie Mellon University, Pittsburgh, PA,
January 1991.
[TSg0] M. Tistarelli and G. Sandini. Estimation of depth from motion using an anthropo-
morphic visual sensor, linage and Vision Computing, 8, No. 4:271-278, 1990.
[UGVT88] S. Uras, F. Girosi, A. Verri, and V. Torre. Computational approach to motion per-
ception. Biological Cybernetics, 1988.
[WD86] A.M. Waxman and J.H. Duncan. Binocular image flows: Steps toward stereo-motion
fusion. IEEE Trans. on P A M I - 8, No. 6, 1986.

This article was processed using the IATEX macro package with ECCV92 style
Integrating Primary Ocular Processes
Kourosh Pahlavan, Tomas Uhlin and Jan-Olof Eklundh
Computational Vision and Active Perception Laboratory (CVAP)
Royal Institute of Technology
S-100 44 Stockholm, Sweden
Emall: kourosh~bion.kth.se, tomas@bion.kth.se, joe~bion.kth.se

Abstract. The study of active vision using binocular head-eye systems


requires answers to some fundamental questions in control of attention. This
paper presents a cooperative solution to resolve the ambiguities generated
by the processes engaged in fixation. We suggest an approach based on
integration of these processes, resulting in cooperatively extracted unique
solutions.
The discussion is started by a look at biological vision. Based on this
discussion, a model of integration for machine vision is suggested. The im-
plementation of the model on the K T H - h e a d - - a head-eye system simulating
the essential degrees of freedom in mammalians--is explained and in this
context, the primary processes in the head-eye system are briefly described.
The major stress is put on the idea that the rivalry processes in vision in
general, and the head's behavioral processes in particular, result in a reliable
outcome.
As an experiment, the ambiguities raised by fixation at repetitive pat-
terns is tested; the cooperative approach proves to handle the problem cor-
rectly and find a unique solution for the fixation point dynamically and in
real-time.

1 Introduction

In recent years, there has been an increasing interest in studying active vision using head-
eye systems, see e.g. IBm88, ClF88, Kkv87]. Such an approach raises some fundamental
questions about control of attention.
Although one can point to work on cue integration, it is striking how computer vision
research generally treats vision problems in total isolation from each other. Solutions
are obtained as a result of imposed constraints or chosen parameters, rather than as an
outcome of several rivalling/cooperating processes, occuring in biological systems. The
main reason why such systems are so fault-tolerant and well performing, is that they
engage several processes doing almost the same task; while these processes are functional
and stable under different conditions.
The present paper presents an approach based on this principle, by integration of basic
behaviors of a head-eye system, here called primary ocular processes. These are low-level
processes that under integration, guarantee a highly reliable fixation on both static and
dynamic objects. The primary ocular processes build the interface between what we'll
call the reactive and active processes in the visual system, that is, the processes engaged,
either if the observer voluntarily wants to look at a specific point in the world, or if he
involuntarily is attracted by an event detected somewhere in the world.
In summary, the presented work actually addresses two separate issues. The first and
main one is that of building a complex behavior by integrating a number of independent,
527

primary processes in a cooperative-competitive manner. Secondly, by selecting vergence


as our test example, we in fact also develop a computational mechanism for fault-tolerant
continuous gaze control; a problem that is attracting considerable interest by itself, see
e.g. [Brn88, Jen91].
The proposed approach--largely inspired by mammalian vision--requires a design
of the head-eye system and its control architecture that allows both independent visual
processes and independent eye and head motions. The KTH head-eye system satisfies
these requirements. We shall briefly outline its design and characteristic performance.
We shall also, as a guideline, briefly discuss the associated processes in human vision.

2 Dynamic fixation: pursuit and saccades

In experimental psychology, two kinds of gaze shifting eye movements are clearly sepa-
rated [Ybs67]. These are vergence and saccades 1 . Since we also are interested in vergence
on dynamic objects or vergence under ego-motion, we change this division into pursuit
and saccades.
The general distinction between these two movements is based on the speed, am-
plitude and the inter-ocular cooperation of the movements. While saccades in the two
eyes are identical with respect to speed and the amplitude of the motion 2, the pursuit is
identified by its slow image-driven motion. Saccades are sudden and are pre-computed
movements (jumps), while pursuit is a continuous smooth movement.
With this categorization in mind, the elements of the human occulomotor system
can be classified into saccades, consisting of a spectrum of large to micro saccades, and
pursuit, consisting of accomodation, vergence and stabilization/tracking.
Here, we are arguing that in a computational vision system, the processes consti-
tuting pursuit, should be hardwired together. This is because they are not only similar
processes with regard to their qualitative features, they are also serving a similar purpose:
sequential and successive adjustment of fixation point.
Although they belong to the same class of movements, there is a very important
difference between convergence/divergence on one hand and stabilization on the other.
Stabilization is very successful and reliable in keeping track of lateral movements, while
the accomodation process (focusing and convergence/divergence) is very successful and
good at doing the same in depth. The combination of the two yields a very reliable
cooperative process.
In human vision, stabilization is a result of matching on the temporal side of one eye
and the nasal side of the other one, convergence/divergence is a result of matching on
either the temporal or the nasal side of both eyes [Hub88, Jul71].
Let's start by defining each process in our computational model of an occulomotor
system.

Saceades are sudden rotations in both eyes with identical speed and amplitude. They
are not dependent on the visual information under movement and the destination
point for them is pre-computed. The trace of a saccade is along the Vieth-Miiller

1 We have by no means any intention to enter a discussion on the nomenclature of psychology.


Therefore we use the established expressions familiar to the field of computer vision.
2 This is Hering's law, stating that saccadic movements in the two eyes are identical in speed and
amplitude. Recent results [Col91] suggest that one can simulate situations, where a difference
up to 10~ is tolerable.
528

circle 3 for the point of fixation at the time. The motion is ramped up and down as a
function of its amplitude. Since the motion strategy is an open-loop one, the speed
is limited by the physical constraints of the motoric system.
P u r s u i t is a cooperative but normally not identical movement in both eyes. The motion
is image-driven, i.e. depending on the information from both images at the time, the
eyes are continuously rotated so that the point of interest falls upon each fovea cen-
tralis, in a closed-loop manner; the speed is limited by the computational constraints
of the the visual system and its image flow.

In our categorization, elements of pursuit are vergence (accommodation and conver-


gence/divergence) and stabilization. By stabilization we simply mean the process of lock-
ing the feature on the fovea. The simple difference between stabilization and tracking is
that tracking is often referring to keeping track of a moving object. Stabilization, how-
ever, refers to keeping the image stable despite of ego-motion or the movement of the
object.
The classifications done here are not to be considered as lack of links between, or
overlap of processes; it is simply a scientific attempt to model by isolation. In actual fact,
in human vision, pursuit of objects, quite often is accompanied by small saccades to keep
track of the moving object at sudden accelerations or turns.

3 F i x a t i o n in v e r t e b r a t e s

The biological approach to the problem of fixation suggests a highly secure technique
[Ybs67]. Figure 1 (left) illustrates the model of human vergence combined with a saccade.
In this model, verging from one point to another starts by a convergence/divergence of
the optical axes of the eyes. This process continues for a certain short time. The initial
convergence/divergence is then followed by a saccade to the cyclopic axis through the new
fixation point. The remaining convergence/divergence is carried out along the cyclopic
axis until the fixation on the desired point is complete. Quite often, a correcting small
saccade is performed at the vicinity of the fixation point.
There are some points here, worth noting. To begin with, the model is based on a
cyclopic representation of the world. Although this does not have to be a representation
like in [Ju171], one needs at least some kind of direction information about the point of
interest. Changing the gaze direction to the cyclopic direction is preceeded by an initial
convergence or divergence, which could be caused by the preparation time to find the
direction. The search along the cyclopic axis is very interesting. This searching strategy
could transform the matching problem into a zero disparity detection. Besides, having a
non-uniform retina, this is perhaps the only way to do a reliable vergence, simply because
the matching/zero disparity detection is carried out at the same level of resolution. In
the end, the last matching is performed at the finest resolution part of fovea centralis.
Even more interesting evidence here is the observation that even with one eye closed,
the vergence process forces the closed eye to accompany the motion of the open eye.
Sudden opening of the closed eye shows that the closed eye was actually aiming fairly well
at the point of interest, suggesting that a monocular cue (focus in our implementation)
plays an active role in vergence.

3 The circle (or sphere) which goes through the fixation point and the optical centers of the two
eyes. All points on the circle have the same horizontal disparity. The horopter doesn't follow
this circle exactly. See e.g. [BaF91] and further references there.
529

D
~o'--~mQ.o 9

Fig. 1, Left: A model of human vergence suggested by A. Yarbus. In this model, con-
vergence/divergence movement is superimposed on the saccade movement. The conver-
gence/divergence shifts the fixation point Mong the cyclopic axis, while a saccaxie is a rotation
of the cyclopic axis towards the new fixation point. Right: The KTH-head.

Another interesting point to mention here is the non-uniform retina. It is evident


that in vertebrates lacking a non-uniform retina, fixation ability is also missing. Other
vertebrates (e.g. some birds) even engage two foveas for different classes of vergence. This
observation is especially interesting in the context of the general issue of why non-uniform
retinas are developed by evolution; suggesting that the technique is certainly helping
vision and not limiting it [A1K85]. The symmetric nature of convergence/divergence
matchings is also going on as a kind of coarse-scale to fine-scale matching.
So much about general definitions on eye movements and biological evidence necessary
to have in mind. Before we leave this section to return to a description of our own model,
inspired by the discussion above, there is one point to make here. Neck movements,
although participating in the process of gaze control as a very eminent member, are not
directly hard-wired to eye movements, in the sense that a neck movement--while eyes
are closed--do not affect the relative position of eyes in their orbits.

4 The KTH-head: design dependency of primary processes

The KTH-head is a head-eye system performing motions with 7 mechanical and 6 optical
DOFs. It utilizes a total of 15 motors and is capable of simulating most of the movements
in mammalians. Currently, it utilizes a network of 11 transputers, configured with a
symmetric layout for executing the primary behaviors of the system. Figure 1 (right)
illustrates the KTtI-head. The eye modules in the head-eye system are mechanically
independent of each other. There is also a neck module which can be controlled separately.
All DOFs can be controlled in parallel and the task of the coordination between them is
carried out by the control scheme of the system. See [PaE90, PaE92].
The design philosophy of the KTH-head was to allow a very flexible modular com-
bination of different units, so that the control system would have few restrictions in
530

integrating the module movements. The message here is that the design has been ad-
justed to a fairly flexible model of a mammalian head. In particular, our system allows
exploration and exploitation of the principle of competing and cooperating independent
primary processes proposed above. A motivation for our design also derives, from the
observation that in the mammalian visual s y s t e m s the mechanical structure supports
the visual processing.
Three major features distinguish our construction from earlier attempts 4. These are:

- Eye movements about the optical center of the lens.


- Separate neck unit
- The ability to change the size of the base-line dynamically

The first two items, are essential for adapting the mammalian control strategy to the
system. There is a very delicate but important difference between eye movements and
other body movements. When eyes rotate in their orbits, the image is not distorted 5.
We are simply suggesting that eye movements are not means of seeing things in the
world from another angle of view or achieving more information about objects by means
of motion parallax. Instead they seem to change the gaze direction and bring the point
of interest to the fovea. Smaller saccadic movements are also assumed to be means of
correspondence detection 6.
Naturally, the control strategy is dependent on the DOFs and the construction scheme.
The KTH-head is designed to cope with the two different movements of the model dis-
cussed earlier. By isolating the individual motor processes and allowing them to com-
municate via c o m m o n processes, they can be synchronized and communicate with one
another.

4.1 P e r f o r m a n c e d a t a

In order to give a better idea about what performance the KTH-head has, some data
about it, is briefly presented here:

- General data:
Total number of motors: 15
Total number of degrees of freedom: 13
Number of mechanical degrees of freedom in each eye: 2
Number of optical degrees of freedom in each eye: 3
Number of degrees of freedom in neck: 2
Number of degrees of freedom in the base-line: 1
Top speed on rotational axes (when all motors run in parallel): 180 deg/s
Resolution on eye and neck axes: 0.0072 deg
Resolution on the base-line: 20 p m
Repeatability on mechanical axes: virtually perfect
4 By earlier constructions, we mean those like [Brn88] which basically allow separate eye move-
ments and thereby asymmetric vergence, and not other constructions like [ClF88] and [Kkv87]
which despite their flexibilities follow a strategy based on symmetric vergence.
s For rotations up to almost 20 degrees, i.e. normal movements, human eyes rotate about a
specific center. Deviations from this is observed for larger angles. For rotations smaller than
20 degrees, the image is not distorted [Ybs67, Jul71].
e Or maybe these axe only corrective oscillations generated by eye muscles. There is evidence
showing that micro sa~:ca~ies and tremors in eyes disappear when eye balls axe not in a slip-
stick friction contact with the orbit [Gal91].
531

Min/max length of the base-line: 150/400 m m


Min/max focal length: 12.5/75 m m
Weight including the neck module: about 15 Kg
Weight excluding the neck module: about 7Kg

- Motors:
7 5-phase stepper motors on the mechanical axes
2 4-phase stepper motors for keeping the optical center in place
6 DC motors on optical axes, 3 on each lens

In these experiments only one single transputer for indexing and controlling the mo-
tors and one transputer based frame-grabber for primary control processes, were used.
For the purpose of extending the primary ocular processes, a network of 11 transputers
has recently been installed.
The details of the design and motivations can be found in [PaE90, PaE92].

5 Implementation issues

Presently, our implementation differs somewhat from the model. This is because we so
far lack a cyclopic representation and have to refer to left/right image depending on the
task in question. In addition, we have images with uniform resolution and no foveated
sensor.
Prior to integration, we implemented a set of primary ocular processes which run
continuously. These are:

- a vergence process based on correlation of the region of interest in the dominant e y e


along the associated epipolar line (band) in the other image. By default this process
scans the whole epipolar line and outputs numeric values as a criterion for how good
the match was in each instance. This process, like all other disparity detecting or
matching algorithms, suffer from the problem of repetitive patterns in the scene.
Although we have difficulty to make it fail, without help of focus it does fail on
repetitive patterns like checker board and parallel bars. The process runs in real-
time 25 Hz on synchronized cameras.
- a focusing process. This process keeps the foveal image of the eye in question in
focus. That is, it focuses on the vicinity of the center of the image. This process is
implemented in several different ways. The gradient magnitude maximation algorithm
discussed later (the tenengrad criterion [Kkv87]) turns out to be the best one.
- a stabilizing process that keeps track of the foveal image of the eye in question and fits
the motor position so that the image is stable in the center of the image. The search is
done down the steepest descent. This process engages an a-/~ tracker (see [DeF90]) for
the motion prediction. The result is a very smooth real-time stabilization 7, managing
a speed of up to 20 deg/s.
The particular method for performing the optimization in vergence and focusing is
not our concern here. We have in fact also implemented other methods, e.g. using
the cepstral filter. The end results seem to be quite similar. The spatial approach
7 We try not to use the word tracking here. Later on, we will also talk about a tracker for very
fast moving objects. As a trade-off, the stabilizer is optimized for the pattern rather than the
speed of the pattern.
532

has some advantages in our case, when a precise localization is required. Methods
based on e.g. bandpass filter techniques are presently beyond what we can compute
at video rate.
The processes described, do not mean too much if they are not put together as cooper-
ating processes compensating for each other's shortcomings. The three processes build a
circular loop, where each process interact with the two others in the ring

focus , , vergence < >stabilization ,---,

and in this way confirm or reject the incoming signals. At the "vergence" node, the left
ring and the right ring (see also Figure 2) are joined and this very node sends the action
command to the motor control process.
Figure 2 illustrates how these processes communicate with each other. The vergence
process, here, has a coordinative task and is the intersection of the two separate set of
processes on left and right sides respectively.

:~.'
.:-.,...:::!::~.'...
.,:.~!~:~ ~ .:..'.!:~
.~.,......
i_ _] Focus

Stabilization

Stabilization
i:!iiiiieH Motoric Process

"""";OCUS
.o I
~:~ ....~ ~

Fig. 2. Process configuration in the implementation. The meeting point for the processes dedi-
cated to the left and the right eye (the two rings), is the vergence process.

5.1 Vergence coordinating stabilization processes


The difficulties for the vergence process begin when the two stabilizing processes yield
two completely different directions for the cyclopic axis. This happens when one of the
stabilizers has lost track of the pattern it was stabilizing on, while the other one has not.
In actual fact, the vergence process consists of two concurrent subprocesses:
- A process searching for the right foveal image along the associated epipolar band in
the left image. If the best match corresponds to the position suggested by the right
stabilizer, then, the detected position from both stabilizers is correct. Otherwise, the
best match for the stabilizer pattern, between the one found on the epipolar band
and the one found by the stabilizer is taken to be correct.
533

- A similar process searching for the left foveal image along the associated epipolar
band in the right image. The confirmation procedure is also similar to the other
process.

If none of the processes succeed, then the track of the p a t t e r n is lost; the object has gone
outside the common stabilization area of the two eyes.
Although moving objects could seem troublesome, they have their own advantages.
Moving objects trigger the stabilization processes on b o t h eyes, so t h a t a new binocular
position for fixation will be suggested by these processes. T h e m a t c h i n g task of the ver-
gence process is then simplified. C o n t r a r y to the vergence on static objects, the vergence
process has a binocular cue to judge on. This cooperation is however not i m p l e m e n t e d
yet, so it is too early to speculate a b o u t its applicability.

5.2 Cooperative vergence-focusing

Vergence on retinally stabilized objects is p r o b a b l y the most complicated process among


all p r i m a r y processes. In primates, as mentioned earlier, it is often accompanied by a
version s .
Figure 3 illustrates the fixation model implemented. Having a point selected in the
d o m i n a n t image, both cameras do a saccade along the Vieth-Miiller circle, so t h a t the
p o i n t is transfered to the center of its image. T h e focusing process is then activated on
the d o m i n a n t image. Under the a c c o m m o d a t i o n process the other c a m e r a fits its focusing
distance to the one of the d o m i n a n t camera at the time. A t the focused point a search
for the best m a t c h is triggered in the area defined by depth of focus.
For two reasons, this scheme is not optimal. First of all it is not q u a n t i t a t i v e l y optimal,
because the dominance of the left or the right eye is not caused by the relative distance
of the selected point to the initial fixation point. T h a t is, the dominance is decided
beforehand and the selected point in the d o m i n a n t eye could be nearer to the center
of the d o m i n a n t image which in turn results in a smaller saccade. Since saccades s t a n d
for the fastest rotation of the optical axes, the o p t i m a l movement would he doing the
largest possible saccade. Secondly, it is not qualitatively o p t i m a l , because it is always
a c c o m m o d a t i o n on the d o m i n a n t image which steers the a c c o m m o d a t i o n on the other
image. T h i s means t h a t a potential cooperation between the two processes is lost.
The common root of all these shortcomings is the lack of a cyclopic representation.
Therefore our next stage of developement will s t a r t with defining and implementing an
efficient common representation for both eyes, where the d o m i n a n c e areas are defined by
the rivalry of the two eyes.
Let's go back to our present work and consider the i m p l e m e n t e d system. We have the
follwing situation:

- We have chosen a large piece of wall paper with a checker b o a r d p a t t e r n and placed
it so t h a t it is viewed frontally.
- The real distance to the wall paper is 2215 mm.

s We did not mention pure saccades or versions among our implemented primary processes,
though we use them. The reason is simply the fact that we do not yet have a cyclopic repre-
sentation in our system. The sa~cades, under these circumstances, cannot be represented by a
process. The process would not have much to do, other than sending a command to the motor
control process to turn both eyes with the same amount that the retinal displacement of the
dominant eye requires. In the existing implementation, the focusing process alone decides if
the destination point is inside the zero-disparity circle, or outside it.
534

A ,..:

F i g . 3. The implemented model of vergence in cooperation with accommodation. The amount


of the saccadic motion is always decided by the dominant eye (the eye where the new fixation
point is chosen). That is, the amplitude of the fast sa~cadic motion is not always optimal.

Fig. 4. The repetitive pattern without the band limits of accommodation. The band (top). The
pattern square superimposed on the best match (bottom). The match here, represented by the
least value, is false.

0.9

0.8 iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiill
iiiiii,iiiiiiiiiiiiiiiiiiiiiii,iiiiiiiiiiiiiiiiiiiiiiiii
0.7

0.6

i !
0.5
m
0.4

| 0.3

0.2

0.1

0
-100 -~0 0 50 100 150

displac~ent [pixels]

Fig. 5. The evaluation function for matching. A good match here is represented by minimum
points. As shown in the figure, without the dynamic constraints from the focusing process, there
are multiple solutions to the problem. The output match here is false.
535

- The size of each square on the wall paper is 400 m m 2.


- The focal length of each lens is 20 ram.
- The length of the base-line is 200 mm.

Figure 4 illustrates one sequence of vergence without focusing information. The image
is the repetitive pattern and the task is to find the pattern in the right eye, corresponding
to the pattern selected from the left eye (the square on the bottom stripe), by a search
along the epipolar band (the top stripe) in the left eye.
As it is expected and illustrated in Figure 5, the algorithm finds many candidates for
a good match (the minimum points). The problem here is that the best match is actually
not at all the real corresponding pattern on the left one, though from a comparative
point of view, they are identical.
Figure 6 illustrates the result of the cooperation with the focusing process. The fo-
cusing sharpness puts limit constraints on the epipolar band. The band is clearly shorter,
and as illustrated in Figure 7, there is, in this band, only one minimum point, yielding a
unique correspondence.
There is still a possiblity for repetitive structures in the area defined by depth of
focus. But this probability is very small and for a given window size, dependent on the
period of the structure and the field of view of the objective.
Figure 8 (top) illustrates a case where the distance to the object is increased. The
frequency of the pattern is higher than the earlier example, while the depth of focus has
become larger. In this case, the vergence process is still matching erroneously (at the
point with the pixel disparity o f - 1 4 ) . Figure 8 (bottom) illustrates the same case with
focus cooperation.
It can be observed that, although the cooperation provides limits for the matching
intervall and the matching process actually finds the right minimum, it still contains two
minimum points, i.e. a potential source of error. The problem here is caused by the large
depth of focus when the image is at a longer distance s. Note that it actually is not the
frequency which is important here; it is the distance to the pattern. An erroneous fixation
requires exactly similar patterns with small periods and a long distance.

Fig. 6. The repetitive pattern with the band limits provided by accommodation. The band
(left). The pattern square superimposed on the best match (right); here the correct match.

In order to demonstrate how the cooperative vergence algorithm performs, we illus-


trate a pair of stereo images in Figure 9 with their epipolar bands. Figure 10 demonstrates
9 There is a relevant point to note here. Human fovea has a very small field of view and a
very small depth of focus. Especially in near distances, where the whole field of vision can be
covered by a pattern (e.g. wall paper) and periferal cues can offer no help, the focusing cue
becomes very important.
536

0.9
o.s

0.7
1

f ..........................
. . . . . . . . . . . . .

..........................
,

i ' i
................... ! .......................
~ .......................
i

i
....................... 4i ...............
........ ~
.......................i
,

~ - ~
.........
i......................
i....................... i................ ~ i i i i i i i i i i i
i

i .............
i -'-
......................

0.5 ...................... ~....................... ~..................... i..................... ~................ !

.] 0,4 ...................... ~ ....................... ~. . . . . . . . . . . . . . . . . . . . . !....................... ~.. i i


| 0.3 ......................
.~.......................
~.......................
! .....................
!................... i i

0.2 ......................
~ .......................
i .......................
i......................
i..................

o.~ ......................
! .......................
i.......................
L.....................
i............... i

-20 -15 -I0 -5 0 5 I0 15


displac~nent [pL~ols]

Fig. 7. The evaluation function for matching. A good match here is represented by the minimum
point. To be compared with the curve not using the accommodation limits.

their evaluation functions.

5.3 S t a b i l i z a t i o n g i v e s f o c u s i n g a h a n d

In order for focusing to recognize the sharpness in the center of the image, it must be
capable of handling small movements caused by the object or the subject itself. The
stabilizer, here, gives focusing a hand to focus on the same pattern even if the object is
shaking. The focusing process always gets its image through the stabilizer, which already
has decided which part of the image is the part that the focusing process is interested
in. Stabilization has a great effect on focusing on dynamic objects. Figure 11 illustrates
the effect of stabilization on focusing on a real object.
At this stage the loop is finally closed. In practice focusing is a very important cue
for active exploration of the world. Here, we suggest that this cue also can be used in eye
coordinations and fixation.

6 Ongoing work

Besides the work on implementing a cyclopean representation of the binocular rivalry, we


are working on the figure-ground problem by combining foveal vision and cyclopic infor-
mation. Another interesting work is tracking moving objects on a moving background.
For this reason we have already designed a tracker which easily and in real-time can
follow very fast moving objects, like a falling lamp. Presently, this is implemented as
a control algorithm engaging simple real-time vision. The algorithm also involves neck
movements in coordinaion with eye trackings. Currently our system is running an a-/~
tracker at a video rate of 50 Hz and a top speed of 180 degrees on eye and neck 1~ axes
on a single transputer. This algorithm is best demonstrated live or on video tape and we
refrain from illustrating it in the paper.

10 That is, the combined speed of the eyes and neck can amount to 360 degrees/s.
537

1 !

0.9 .......... .~ iiiii ...... '"i'~ ......


:*~!
.........
!"!"
O.S ..... !-.,i . . . . . . li. .......H ........
H.
""1 ..... i-

0.7 ..... I . . . . . . . . i

I i
0.6 ........ !.. .... . . . . . . . . !

0.5 .... .... i . . . . . . . f,

0.4 ..... ..4,..= 9..i ..... I.ii........... '"4 ........ 4. .,J ........ t.,-~,..t,..# .......
l

0.3 .....
i :
9.i ...... 9t"~. . . . . . ~""' 9'4 ........ l'"
I.=

0.2 ..... +.1 . . . . . . . t,.~.

0. I ..... ~.....! ....... v .......


...... ~...~.

0
-40 -20
i 0 20 40 ~O
i SO 100 1:20 140 160

displ acea't ent [pixels]

1 i i

o.9 .....................' .............i .......................!......................!............. i


o.~ ...................... ! ....................... f ........................ i ....................... ~ ......................

0-7 t ........................................... ! . . . . . . . . . . . . . . . . . . . . . . ~. . . . . . . . . . . . . . . . . . . . . 4. . . . . . . . . . . . . . . . . . . . . . . . ~. . . . . . . . . . . . . . . . . . . . .

o.o .............. i ....................... i...................... i...................... i ........................ i .....................

0.5 ......................~ ...................... 9 ......................i...................... i......... } i ~~


o.4 ......................~ .......................i ...................... i......................i....................... ' .......................i ....................

| 0.3
I i
...................... .~ .......................~ ...................... i ..................... i....................... + ..................... . ......................

o.~ .......................
i.......................
i....................
~.....................
i.......................
i .....................
i.....................

oi .............

0 =
-15 -10 -5 0 10 15 20

displ acelment [pixel=]

Fig. 8. The graph at the top illustrates the result of matching along the epipolar band. The one
at the bottom illustrates the area confirmed by accommodation. The one at the bottom, in spite
of the two minima, detects a correct fixation. However, a potential source of false matching (the
second minimum) exists. The focal parameters ate by no means ideal here; the angle of view is
larger than 20 degrees. The displacement is defined to be 0 at the center of the image. Note,
however, that the cameras are at different positions in the two cases. Hence the displacements
do not directly relate.

T h e w o r k o n f o c u s i n g is a l s o f o l l o w e d in several directions. Focusing is basically a


w e l l - p o s e d problem and has a lot t o offer other processes to overcome their ambiguities.
T h e major p u r p o s e o f t h e w o r k o n f o c u s i n g is therefore to m a k e this cooperation and
well-posedness applicable.

7 Conclusion
I n b i o l o g i c a l vision, fault-tolerance and high performance is obtained through indepen-
dent processes often doing the s a m e task, which are functional and stable under different
538

Fig. 9. The left and right images of a stereo pair (top). The point marked with a cross (+) in the
left image, is the next fixation point. The epipolar band and the superimposed correlation square
on it are also shown. In this case, the matching process alone could manage the fixation problem
(the stripe in the bottom). The cooperation has however, shrunk the search area drastically (the
stripe in the middle).

conditions. Hence, complex behaviors arise from the cooperation and competition of such
primary processes.
This principle has generally been overlooked in computer vision research, where visual
tasks often are considered in isolation, without regard to information obtainable from
other types of processing. Before the recent interest in "active" or "animate" visual
perception one has also seldom appreciated the importance of having visual feedback from
the environment. We contend that such feedback is essential, implying that processing
should occur in (real) time and that a look at the world offers more information than a
look at prerecorded images.
539

0.9 ........................... ~-........................... !........................... i................. i


o.~ ........................... i ........................... i........................... i........................... i. . . . . . . . . . . . . . . . . i..........................
o.7 ........................... i............................ i........................... i........................... ~.......................... i.........................
6 0.6 ........................... !........................... i........................... i. . . . . . . . . . . . . . . . i........................... i..........................

I=
o.~ ........................... i............................ i.......................... ~ . . . . . . i........................ i..........................
o.41 - ......................... ..s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ! ........................... i..........................

o.31 ..........................
~............. i...........................
~............................
i..........................
o.2 ......................
~ ~ 'i ............. i i
0.1 ......................... ~ ........................... i ......................... i ........................... ~............................ ~..........................

0
-100 -50 0 50 100 150 200

displacement [pixels]

0.9

0.8

j 0.7
"6 0.6

0.5
M

|
0.4

0.3 ......................
T.......................
.........................................
! iiiiii! .................
0.2

0.1

0 i " i
-15 -10 -5 0 5 10 15 20

displ acelnont [pixels]

F i g . 10. T h e g r a p h at t h e t o p illustrates t h e result of m a t c h i n g along t h e epipolax b a n d . T h e


one at t h e b o t t o m illustrates t h e interval of search selected w h e n ~ c c o m m o d a t i o n is performed.
See also t h e n o t e in t h e c a p t i o n of Figure 8

I n t h i s p a p e r , we h a v e p r e s e n t e d w o r k o n c o n t r o l l i n g e y e m o v e m e n t s o n t h e b a s i s
of these principles. We have abstractly modeled the occulo-motor behavior of what we
call t h e a c t i v e - r e a c t i v e v i s u o m o t o r s y s t e m as d i v i d e d i n t o t w o o p e n - l o o p a n d c l o s e d - l o o p
motions. We also argued that the processes engaged in the closed-loop action should be
h a r d - w i r e d t o o n e a n o t h e r , so t h a t t h e y c a n c o m p e n s a t e f o r t h e s h o r t c o m i n g s o f e a c h
other.
This cooperative model has been applied to the tasks of vergence and pursuit, and
i n a n i m p l e m e n t a t i o n o n o u r h e a d - e y e s y s t e m , it h a s b e e n s h o w n t o b e v e r y p o w e r f u l .
A s a side-effect, a s i m p l e a n d r o b u s t m e t h o d f o r g a z e c o n t r o l i n a s y s t e m a l l o w i n g s u c h
i n d e p e n d e n t p r o c e s s e s is o b t a i n e d . T h e d e s i g n o f t h e K T H - h e a d m o t i v a t e d b y t h e g e n e r a l
p h i l o s o p h y p r e s e n t e d a b o v e is also b r i e f l y d e s c r i b e d . T h i s b o t h g i v e s a b a c k g r o u n d t o t h e
experiments and a demonstration of how the general principle of independent cooperative
and competitive processes can be realized.
540

........................ i ......................... ~ ........................ ~. . . . . . . "a ......... i ......................... ; ......................... .= .............

0.8
.a

o 0.6
.r,

:a
0.4
Z

0.2

0
0 5 10 15 20 25 30
Focus ring position

Fig. 11. The evaluation function for focusing without stabilization (dashed curve), and with
stabilization (solid curve). Stabilization results in a smooth curve free from local minima.

8 Appendix: algorithms

We have deliberately downplayed the role of the algorithms used. In fact there are many
possible approaches that could work.
The focusing algorithm used here is the tenengrad algorithm described by [Kkv87]:

m a x E E S(x, y)2, for S(x, y) > T


x y

where S is
v) = 9 y))= + (iv 9 y))=
ix and i v are the convolution kernels (e.g. for a Sobel operator) and T is a threshold.
The matching process is based on minimizing the sum of square of differences of the
two image functions.

m i n y ~ ~-~(fid - gi+ai, j + a j ) 2 Ai, Aj E A, whereAisthefovealwindow


i j
The search is done for the best match of the foveal window f along the small region
g on the epipolar band, defined by the depth of focus. This band has a specific heigth
which defines the degree of tolerance for the vertical desparity.
The same simple algorithm in combination with the steepest descent method is used
for stabilizing. The stabilizing is a search for the most recent detected pattern in a
541

square area around the center of image. For efficiency reasons, however, the search is
only performed down the steepest descent.
It is rather surprising how well these algorithms work, in spite of their simplicity!
This can be explained by two reasons:

- real-time vision is much less sensitive to relative temporal changes in the scene. The
correlations are for example less sensitive to plastic or elastic changes of the object,
smooth changes in lighting conditions, etc.
- cooperation of several processes gives vision a chance to judge the situation by several
objective criteria, rather than rigid constraints.

In all cases the input images have been noisy non-filtered ones. No edge extraction
or similar low-level image processing operations have preceeded the algorithms.

References

[AIK85] M. A. Ali, M. A. Klyne. Vision in Vertebrates, Plenum Press, New York, 1985
[BaF91] S. T. Barnard, M. A. Fischler Computational and Biological Models of Stereo Vision,
to appear in the Wiley Encyclopedia of Artificial Intelligence (2nd edition), 1991
[Brn88] C. Brown. The Rochester Robot, Tech. Rep., Univ. of Rochester, 1988
[CIF88] J. J. Clark, N. J. Ferrier. Modal Control of an Attentive Vision System, Proc. of the
2nd ICCV, Tarpon, Springs, FI, 1988
[Co191] H. Collewijn. Binocular Coordination of Saccadic Gaze Shifts: Plasticity in Time and
Space, Sixth European Conference on Eye Movements, Leuven, Belgium, 1991
[DeF90] R. Deriche, O. Faugeras. Tracking Line Segments, Proc. of the 1st ECCV, Antibes,
France, 1988
[Gal91] V. R. Galoyan Hydrobiomechanical Model of Eye Placing and Movements, Sixth Euro-
pean Conference on Eye Movements, Leuven, Belgium, 1991
[Hub88] D. H. Hubel. Eye, Brain, and Vision, Scientific American Library, 1988
[Jen91] M. R. M. Jenkin Using Stereo Motion to Track Binocular Targets, Proc. of CVPR,
Lah&inz, Hawaii, 1991
[Jul71] B. Julesz. Fundations o-f Cyclopean Perception, The University of Chicago Press, 1971
[Kkv87] E. P. Krotkov. Exploratory Visual Sensing for Determinig Spatial Layout with an Agile
Stereo System, PhD thesis, 1987
[PaE90] K. Pahlavan, J. O. Eklundh. A Head-Eye System .for Active, Purposive Computer Vi-
sion, TRITA-NA-P9031, KTH, Stockholm, Sweden, 1990
[PaE92] K. Pahlavan, J. O. Eklundh. Head, Eyes and Head-Eye Systems, SPIE Machine and
Robotics Conference, Florida, 1992 (To appear)
[Ybs67] A. Yarbus. Eye Movements and Vision, Plenum Press, New York, 1967

This article was processed using the I/flEX macro package with ECCV92 style
W h e r e to L o o k N e x t U s i n g a B a y e s N e t :
Incorporating Geometric Relations *

Raymond D. Rimey and Christopher M. Brown

The University of Rochester, Computer Science Department, Rochester, New York 14627, USA

Abstract.
A task-oriented system is one that performs the minimum effort neces-
sary to solve a specified task. Depending on the task, the system decides
which information to gather, which operators to use at which resolution,
and where to apply them. We have been developing the basic framework of
a task-oriented computer vision system, called TEA, that uses Bayes nets
and a maximum expected utility decision rule. In this paper we present a
method for incorporating geometric relations into a Bayes net, and then
show how relational knowledge and evidence enables a task-oriented system
to restrict visual processing to particular areas of a scene by making camera
movements and by only processing a portion of the data in an image.

1 Introduction

An important component in an active vision system is a spatially-varying sensor that


can be pointed in space (using a pan-tilt platform) to selectively view a scene. Thus the
system can not view the entire scene at once. We assume the sensor provides a peripheral
image that is a low-resolution image of the entire field of view from one camera angle,
and a fovea that is a small high-resolution image that can be selectively moved within the
field of view. Spatially-varying sensors can be constructed in many ways: using special
sensor array chips, two cameras with different focal lengths, or programmed in software.
The main reason for using a pointable spatially-varying sensor is the computational
advantage it affords. Only a portion of the scene is imaged (and analyzed) at a time
and even then only a portion of the potential image data is used. However, in exchange
for this advantage a new problem is introduced: deciding where to point the camera (or
fovea) and also what visual operations to run.
Our approach to this problem uses Bayes nets and a maximum expected utility de-
cision rule. Bayes nets encode prior knowledge and incorporate visual evidence as it is
gathered. The decision rule chooses where to point the camera (or fovea) and what visual
operators to run.
This paper presents expected area nets, a method for incorporating geometric rela-
tions into a Bayes net, and shows how they can be used to restrict visual processing to
particular areas of a scene. Section 2 summarizes our overall system, called TEA-l, and
Section 3 presents the expected area net in detail. Section 4 explains how TEA-1 uses the
expected area net: 1) to move cameras, 2) to create and use masks that process only a
portion of an image, and 3) to make decisions while considering relational and location
information. Experimental results are presented in Section 5. Section 6 contains some
concluding remarks.
* This material is based upon work supported by the National Science Foundation under Grants
numbered IRI-8920771 and IRI-8903582. The Government has certain rights in this material.
543

2 TEA-l: A Framework for Studying Task-Oriented Vision

This section summarizes the TEA-1 system, our second implementation of TEA, a general
framework of a task-oriented computer vision system. The reader is refered to [10] for
a detailed description of TEA-1. Earlier work involving TEA-0 and TEA-1 appears in
[8, 9].
M a i n C o n t r o l L o o p . In TEA, a task is to answer a question about the scene: Where
is the butter? Is this breakfast, lunch, dinner, or dessert? We are particularly interested
in more qualitative tasks: Is this an informal or fancy meal? How far has the eating
progressed? (Our example domain is table settings.) The TEA system gathers evidence
visually and incorporates it into a Bayes net until the question can be answered to a
desired degree of confidence. TEA runs by iteratively selecting the evidence gathering
action that maximizes an expected utility criterion involving the cost of the action and
its benefits of increased certainties in the net: 1) List all the executable actions. 2) Select
the action with highest expected utility. 3) Execute that action. 4) Attach the resulting
evidence to the Bayes net and propagate its influence. 5) Repeat, until the task is solved.
B a y e s N e t s . Nodes in a Bayes net represent random variables with (usually) a
discrete set of values (e.g. a utensil node could have values (knife, fork, spoon)). Links
in the net represent (via tables) conditional probabilities that a node has a particular
value given that an adjacent node has a particular value. Belief in the values for node
X is defined as B E L ( x ) -- P ( x I e), where e is the combination of all evidence present
in the net. Evidence, produced by running a visual action, directly supports the possible
values of a particular node (i.e. variable) in the net. There exist a number of evidence
propagation algorithms, which recompute belief values for all nodes given one new piece
of evidence. Several references provide good introductions to the Bayes net model and
associated algorithms, e.g. [2, 5, 7].
C o m p o s i t e B a y e s N e t . TEA-I uses a composite net, a method for structuring
knowledge into several separate Bayes nets [10]. A PART-0F net models subpart relation-
ships between objects and whether an object is present in the scene or not. An ezpected
area net models geometric relations between objects and the location of each object.
Section 3 presents the expected area net in detail. Associated with each object is an IS-A
tree, a taxonomic hierarchy modeling one random variable that has many mutually ex-
clusive values [7]. Task specific knowledge is contained in a task net. There is one task
net for each task, for example "Is this a fancy meal?", that TEA-1 can solve. Each of
the separate nets in the composite net, except the task net, maintains its B E L values
independently of the other nets. Evidence in the other nets affects the task net through a
mechanism called packages, which updates values in evidence nodes in the task net using
copies of belief values in the other nets.
A c t i o n s . TEA-1 uses the following description of an action:

- Precondition. The precondition must be satisfied before the action can be executed.
There are four types of precondition: that a particular node in the expected area net
be instantiated, that it not be instantiated, that it be instantiated and within the
field of view for the current camera position, and the empty precondition.
- Function. A function is called to execute the action. All actions are constructed from
one or more low-level vision modules, process either foveal image or peripheral image
data, and may first move the camera or fovea.
- Adding evidence. An action may add evidence to several nets and may do so in several
ways (see [7]): 1) A chance node can be changed to a dummy node, representing
virtual or judgemental evidence bearing on its parent node. 2) A chance node can
544

be instantiated to a specific value. Object locations get instantiated in the expected


area net. 3) Evidence weight can be added to an IS-A type of net.

Each kind of object usually has several actions associated with it. TEA-1 currently has
20 actions related to 7 objects. For example, the actions related to plates are: The
p e r - d e t e c t - t e m p l a t e - p l a t e action moves the camera to a specified position and uses a
model grayscale template to detect the presence and location of a plate in the peripheral
image. P e r - d e t e c t - h o u g h - p l a t e uses a Hough transform for plate-sized circles for the
same purpose. P e r - c l a s s i f y - p l a t e moves the camera to a specified position, centers a
window in the peripheral image there, and uses a color histogram to classify that area
as paper or ceramic. F o r - c l a s s i f y - p l a t e moves the fovea (but not the camera) to a
specified location and uses a color histogram to classify the area as paper or ceramic.
C a l c u l a t i n g a n A c t i o n ' s Utility. The utility U(c~) of an action a is fundamentally
modeled as U(a) = Y(a)/C(c~), a ratio of value Y(a) and cost C(a). The value of an
action, how useful it is for toward the task, is based on Shannon's measure of average
mutual information, Y(a) = I(T, ea), where T is the variable representing the goal of
the task and ea is the combination of all the evidence added to the composite net by
action a. An action's cost is its execution time. The exact forms of the cost and utility
functions depend on the expecLed area net and will be given in Section 4.
An important feature of the TEA-1 design is that a different task net is plugged into
the composite net for each task the system is able to solve. The calculation of an action's
value depends on the task net. Thus the action utilities directly reflect the information
needs of the specific task, and produce a pattern of camera and fovea movements and
visual operations that is unique to the task.

3 An Expected Area (Object Location) Bayes Net

Geometric relations between objects are modeled by an expected area net. The expected
area net and PART-0F net have the same structure: A node in the PART-0F net identifies a
particular object within the sub-part structure of the scene, and the corresponding node
in the expected area net identifies the area in the scene in which that object is expected
to be located. Fig. 1 shows the structure of one example of an expected area net.
In TEA-1 we assume a fixed camera origin. The location of an object in the scene is
specified by the two camera angles, O = (r ~tilt), that would cause the object to be
centered in the visual field. The height and width of an object's image is also specified
using camera angles.
Thus a node in the expected area net represents a 2-D discrete random variable, 0.
BEL(O) is a function on a discrete 2-D grid, with a high value corresponding to a scene
location at which the object is expected with high probability. Fig. 2(a)-(b) shows two
examples of expected areas. Note that these distributions are for the location of the
center of the object, and not areas of the scene that may contain any part of the object.
Each node also contains values for the height and width of the object. Initially these are
expected values, but once an object is located by a visual action the detected height and
width are stored instead. The height and width are not used in belief calculation directly,
but will be used to calculate conditional probabilities on the links (see below).
A root node R of an expected area net has an a priori probability, P(OR), which we
assume is given. A link from node A to node B has an associated conditional probability,
P(OB [ OA). Given a reasonable discretization, say as a 32x32 grid, each conditional
probability table has just over a million entries. Such tables are unreasonable to specify
545

Fig. 1. The structure of an expected area net. The corresponding PART-OF net is similar.

(a) (b) (c) (d)

Fig. 2. The expected area (a) for setting-area (a place setting area) before the location of any
other object is determined, and (b) for napkin after the location of the tabletop and plate have
been determined. The relation maps (c) for setting-area given tabletop, and (d) for napkin given
setting-area.

and cause the calculation of new belief values to be very slow. Next we present a way to
limit this problem.
R e l a t i o n M a p s Simplify Specification o f P r o b a b i l i t i e s . We make the following
observations about the table of P(OB I Oa) values: 1) The table is highly repetitious.
Specifically, ignoring edge effects, for every location of object A the distributions are
all the same if they are considered relative to the given location of object A. 2) Belief
calculations can be sped up by detecting terms that will have zero value. Therefore, rather
than calculate all values of the distribution, we should use a function to calculate selective
values. 3) The distribution depends on the size of object A. We assume the expected
height and width of object A's image are known, but whenever an action provides direct
observations of the object's dimensions those values should be used instead.
Our solution is to compute values of the conditional probabilities using a special
simplified distribution called a relation map. A relation map assumes that object A has
unity dimensions and is located at the origin. The relation map is scaled and shifted ap-
propriately to obtain values of the conditional probability. This calculation is performed
by a function that can be called to calculate select values of the conditional probability.
Note that the spatial resolution of the relation map grid can be less than that of the
546

expected area grid. Fig. 2(c)-(d) shows two examples of relation maps that were used in
the calculation of the two expected areas shown in Fig. 2(a)-(b).
Given an expected area grid that is 32x32 ( N x N ) and a relation map grid that is
16x16 (MxM), all the values for one link's conditional probability table can be obtained
by specifying only 256 (M s) values. The brute force approach would require that 1048576
(N 4) values be specified.
S p e e d i n g u p C a l c u l a t i o n of Belief Values. When the set of expected locations for
an object covers a relatively small area of the entire scene, the table of P(0B [ 0A) values
contains a large number of essentially zero values that can be used to speed up the belief
propagation computation. The equations for belief propagation (and our notation) c a n
be found in [7]. We do not give all the equations here for lack of space. The calculation
of new BEL(z) values for node X, with parent node U, contains two key equations:
~r(z) = ~ j P(z ] uj)Trx(uj) and Ax(u) = ~'~i P(zi [ u)A(z~). These summations involve
considerable time since x and u both denote a 2-D array (grid) of variables. Time can be
saved in the first equation by not summing a term (which is an array) when it is multiplied
by an essentially zero value. Specifically, for all j where rx (uj) is essentially zero, we do
not add the P(x [ uj)rx(uj) term (an array) into the summation. Similar savings can
be obtained in the second equation. For any given value of i, the P(zi [ u)),(zi) term
(an array) contains essentially zero values everywhere except for a few places (a small
window in the array). We locate that window and only perform the sum for values inside
the window.
C o m b i n i n g L o c a t i o n I n f o r m a t i o n . The expected area for node B is actually cal-
culated not from a single node like node A, but by combining "messages" about expected
areas sent to it from its parent and all its children. This combination is performed within
the calculation of BEL(B). Generally, it is useful to characterize relations as "must-be",
"must-not-be" and "could-be". Combination of two "must-be" maps would then be by
intersection, and in general map combination would proceed by the obvious set-theoretic
operations corresponding to the inclusive or exclusive semantics of the relation. In TEA-
l, however, all the relations are "could-be", and the maps are essentially unioned by the
belief calculation.

4 Using expected areas

M o v i n g c a m e r a s . Actions in TEA-1 that must move the camera to the expected loca-
tion of a specific (expected) object, say X, will move the camera to the center of mass of
the expected area for object X. (This happens even if the expected area, when thresh-
olded to a given confidence level, is larger than the camera's field of view. That case
could be handled by making several camera movements to cover the expected area.)
P r o c e s s i n g O n l y a P o r t i o n o f a n I m a g e . Every action related to a specific object
X processes only the portion of the image that is covered by the expected area of object
X, when thresholded to a given confidence level. Let I E (0, 1) be the confidence level,
which usually will be chosen close to 1 (typically 0.9). Let G~ be the smallest subset of
all the grid points Gx for node X (that corresponds with object X) in the expected area
net, such that their probabilities add up to I. G~ is the portion of the scene that should
be analyzed by the action. Each action in TEA-1 creates a mask that corresponds to the
portion of the current image data (i.e. after a camera movement) that overlaps G~, and
processes only the image pixels that are covered by that mask.
D e c i d i n g w i t h E x p e c t e d A r e a s . T E A - I ' s utility function for an action has the
following features: 1) Costs are proportional to the amount of image data processed.
547

2) It deals with peripheral actions that detect an object but don't otherwise generate
information for the task. 3) It considers that an action may have the impact of making
the expected areas of other objects smaller. Recall that the utility of an action a is
fundamentally modeled as a ratio of value V(a) (average mutual information) and cost
C(a) as explained near the end of Section 2.
An action a related to a specific object X has a cost proportional to the amount of
image data that it processes. Thus TEA-1 defines the cost as C(a) = rtaCo(a), where
C0(a) is the execution time of action a if it processed a hypothetical image covering
the entire scene, r~ is the ratio of the expected area for object X and the area of the
entire scene. So the value of r~ is the size of the subset G~ divided by the size of the
set Gx. rtx = 1 means that object X could be located anywhere in the entire scene.
Over time, as other objects in the scene are located and as more and tighter relations
are established, the value of r~ approaches zero. (Soon we will use a more accurate
cost function that has an additional term for the cost of moving the camera or fovea,
c ( a ) = c . . . . (a) +
TEA-1 uses the following "lookahead" utility function U(a) for action a.

U ( a ) - V(a)+V(fl) + H Z AU(X) (1)


6(a) + c(z) XENe$
where

v(7)
----argmaz'rEPre(a) 6(7)

AU(X)= max [ V(7)


,~a~.o.,~x) $'xCo(7) r c0(7)'

The first term in equation (1) accounts for the future value of establishing the location of
an object. P r e ( a ) is the set of actions 7 such that EITHER 7 has a precondition satisfied
by executing action a OR 7 is already executable and V(7)/C(7) < Y(a)/C(a). The
second term in equation (1) accounts for the impact of making expected areas smaller so
that future actions will have lower costs, s~ is like r~ except it assumes that the location
of action a's associated object is known. H 6 (0, 1) is a gain factor that specifies how
much to weigh the second term relative to the first term. See [7] and [10] for more details
about I and U respectively,'.

5 Experimental Results

A Basic R u n o f t h e S y s t e m . The task of deciding whether a dinner table is set for a


fancy meal or for an informal meal was encoded in a task net, and TEA-1 was presented
the scene shown in Fig. 3, which shows a "fancy" meal. The sequence of actions executed
by TEA-1 is summarized by the table in Fig. 3. The a priori belief of the table setting
being fancy is 0.590, compared with 0.410 that it is informal. As the system executed
actions to gather specific information about the scene, the belief that the setting is
a fancy one approaches 0.974. The graphics on the right of the figure illustrate the
sequence of camera movements executed by the system. Fig. 4 illustrates the execution
of a few actions in the sequence, showing each action's results after any camera (or fovea)
movement has been made and the expected area mask has been applied.
548

time U ( a ) a , an action BEL(i)


0 a priori 0.410
1 10.0 t a b l e 0.400
2 10.5 per-detect-hough-cup 0.263
3 42.8 p e r - c l a s s i f y - c u p 0.343
4 11.3 p e r - d e t e c t - h o u g h - p l a t e 0.340
5 11.9 p e r - c l a s s i f y - p l a t e 0.041
6 29.9 p e r - d e t e c t - u t e n s i l 0.041
7 58.8 p e r - c l a s s i f y - u t e n s i l 0.033

1
8 4.3 p e r - d e t e c t - n a p k i n 0.026
9 3.3 r o y - c l a s s i f y - c u p 0.026
10 2.4 f o r - c l a s s i f y - p l a t e 0.026
11 1.7 per-detect-hough-bowl 0.026
12 0.6 p e r - d e t e c t - b u t t e r 0.026
13 0.4 r o y - v e r i f y - b u t t e r 0.026

Fig. 3, The sequence of actions selected and executed by TEA-1 is shown in the table at left.
Each line corresponds to one cycle in the main control loop. The belief values listed are those
after incorporating the results from each action. The BEL(i) column shows B E L ( i n f o r m a l ) ,
and B E L ( f o r m a l ) = 1 - BEL(i). The path drawn on the wide-angle picture of the table scene
at the right illustrates the camera movements made in the action sequence.

Fig. 4. Processing performed by individual actions. Image pixels outside the expected area mask
are shown as gray values. (a) Results from the per-detect-hough-plate action executed at time
step 4. (b) Results from the per-detect-napkin action executed at time step 8. The mask prevents
the red napkin from being confused with the pink creamer container just above the plate. (c)
Results from the roy-classify-plate action executed at time step 10. A zoomed display of the fovea
centered on the plate is shown. Note: Fig. 5(b) shows results from the per-detect-hough-cup
action executed at time step 2.

E x p e c t e d A r e a s S h r i n k O v e r T i m e . As more objects are located via actions,


the expected areas for the remaining objects (not yet located by actions) get nar-
rower. Assume that TEA-1 has located (in order) the tabletop, then the plate, and
finally the napkin. Fig. 5 shows how the cup's expected area gets narrower and how the
p e r - d e t e c t - h o u g h - c u p action would hypothetically perform after each additional object
is located. Parts (a) and (e) show the situation before any other objects have been lo-
cated. The expected area is rather large, much larger than the field of view. The camera
movement, made to the center of the cup's expected area, is much higher than the true
549

location of the cup, and the action mistakenly detects the creamer container as the cup.
The situation improves once the tabletop is located, as shown in parts (b) and (f)..The
expected area is (almost) small enough to fit in the field of view and its center corre-
sponds better with the cup's actual location. A small portion of the image is masked out
by the expected area, and the cup is correctly detected, but this is just lucky since the
creamer and many other objects are still in the unmasked area. Parts (c) and (g) show
the situation after the plate has been located. The cup's expected area is much smaller.
Finally, in parts (d) and (h), once the napkin has been located, the cup's expected area
is small enough that the action is very likely to detect the cup correctly.

6 Concluding Remarks
Several people are investigating the use of Bayes nets and influence diagrams in sensing
problems. The most relevant work comes from two groups: Levitt's group was the first to
apply Bayes nets to computer vision [I, 6]. Dean's group is studying applications in sensor
based mobile robot control, using a special kind of influence diagram called a temporal
belief network (TBN) [3, 4]. More recently, they have used sensor data to maintain an
occupancy grid, which in turn affects link probabilities in the TBN.
The current TEA-1 system design, incorporating expected area nets, provides a frame-
work that enables the system to make decisions about moving a camera around and about
selectively gathering information. Thus we can begin using TEA-1 to study questions re-
garding task-oriented vision [8, 9, 10].
Deciding where to move a camera (or fovea) is an interesting problem. TEA-1 does
the simplest thing possible by moving to the center of the expected area of one object.
If several objects of interest should fall in the field of view, then it may for example be
better to move the camera to the center of that set of objects. In our experiments to
date, TEA-1 has relied mainly on camera movements to get the first piece of information
about an object, while fovea movements are mostly used for verification. This behavior
is determined by the costs and other parameters associated with actions. Another inter-
esting problem is to consider the tradeoffs between a camera and a fovea movement. A
camera movement is expensive and an action following one processes a completely new
area of the scene, which means there is risk of not finding anything, but if something
is found it will likely have large impact for the task. Alternatively, a fovea movement is
cheap but produces image data near an area already analyzed, so there is a good chance
of finding some new information, but it will tend to have a small impact on the task.

References
1. J. M. Agosta. The structure of Bayes networks for visual recognition. In Uncertainty in
A], pages 397-405. North-Holland, 1990.
2. E. Charniak. Bayesian networks without tears. AI Magazine, 12(4):50-63, Winter 1991.
3. T. Dean, T. Camus, and g. Kirman. Sequential decision making for active perception. In
Proceedings: DARPA linage Understanding Workshop, pages 889-894, 1990.
4. T. L. Dean and M. P. Wellman. Planning and Control. Morgan Kaufmann, 1991.
5. M. Henrion, J. S. Breese, and E. J. Horvitz. Decision analysis and expert systems. A I
Magazine, 12(4):64-91, Winter 1991.
6. T. Levitt, T. Binford, G. Ettinger, and P. Gelband. Probability-based control for computer
vision. In Proceedings: DARPA Image Understanding Workshop, pages 355-369, 1989.
7. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
Morgan Kaufman, 1988.
550

(c) . . . . . . . . . . (d)

@ Q Q
(e) (0 (g) (h)
Fig. 5. Performance of a cup detection action as the cup's expected area narrows over time:
(a) before any objects have been located, (b) after the .tabletop has been located, (c) after the
tabletop and plate have been located, (d) after the tabletop, plate and napkin have been located.
The cup's expected areas in (a)-(d) are plotted separately in (e)-(h). These plots must be rotated
90 degrees clockwise to match the images in (a)-(d).

8. R. D. Rimey. Where to look next using a Bayes net: An overview. In Proceedings: DARPA
Image Understanding Workshop, 1992.
9. R. D. Rimey and C. M. Brown. Task-oriented vision with multiple Bayes nets. Technical
Report 398, Department of Computer Science, University of Rochester, November 1991.
10. R. D. Rimey and C. M. Brown. Task-oriented vision with multiple Bayes nets. In A. Blake
and A. Yuille, editors, Active Vision. MIT Press, 1992. Forthcoming.

This article was processed using the I$TEX macro package with ECCV92 style
An Attentional Prototype for Early Vision

Sean M. Culhane and John If. Tsotsos


Department of Computer Science, University of Toronto,
Toronto, Ontario, Canada M5S 1A4

Abstract.
Researchers have long argued that an attentional mechanism is required
to perform many vision tasks. This paper introduces an attentiona] pro-
totype for early visual processing. Our model is composed of a process-
ing hierarchy and an attention beam that traverses the hierarchy, passing
through the regions of greatest interest and inhibiting the regions that are
not relevant. The type of input to the prototype is not limited to visual
stimuli. Simulations using high-resolution digitized images were conducted,
with image intensity and edge information as inputs to the model. The re-
suits confirm that this prototype is both robust and fast, and promises to
be essential to any real-time vision system.

1 Introduction

Systems for computer vision are confronted with prodigious amounts of visual informa-
tion. They must locate and analyze only the information essential to the current task
and ignore the vast flow of irrelevant detail if any hope of real-time performance is to
be realized. Attention mechanisms support efficient, responsive analysis; they focus the
system's sensing and computing resources on selected areas of a scene and may rapidly
redirect these resources as the scene task requirements evolve. Vision systems that have
no task guidance, and must provide a description of everything in the scene at a high
level of detail as opposed to searching and describing only a sub-image for a pre-specified
item, have been shown to be computationally intractable [16]. Thus, task guidance, or
attention, plays a critical role in a system that is hoped to function in real time. In short,
attention simplifies computation and reduces the amount of processing.
Computer vision models which incorporate parallel processing are prevalent in the
literature. This strategy appears appropriate for the vast amounts of input data that
must be processed at the low-level [4, 19]. However, complete parallelism is not possible
because it requires too many processors and connections [11, 17]. Instead, a balance
must be found between processor-intensive parallel techniques and time-intensive serial
techniques. One way to implement this compromise is to process all data in parallel at
the early stages of vision, and then to select only part of the available data for further
processing at later stages. Herein lies the role of attention: to tune the early visual input
by selecting a small portion of the visual stimuli to process.
This paper presents a prototype of an attentional mechanism for early visual process-
ing. The attention mechanism consists of a processing hierarchy and an attention beam
that guides selection. Most attention schemes previously proposed are fragile with respect
to the question of "scaling up" with the problem size. However, the model presented here
has been derived with a full regard of the amount of computation required. In addition,
this model provides all of the details necessary to construct a full implementation that
552

is fast and robust. Very few implemented models of attention exist. Of those, ours is
one of the first that performs well with general high-resolution images. Our implemented
attention beam may be used as an essential component in the building of a complete
real-time computer vision system.
Certain aspects of our model are not addressed in this investigation, such as the
implementation of task guidance in the attention scheme. Instead, emphasis is placed on
the bottom-up dimensions of the model that localize regions of interest in the input and
order these regions based on their importance.
The simulations presented in this paper reveal the potential of this attention scheme.
The speed and accuracy of our prototype are demonstrated by using actual 256 x ~56
digitized images. The mechanism's input is not constrained to any particular form, and
can be any response from the visual stimuli. For the results presented, image intensity
and edge information are the only input used. For completeness, relationships to existing
computational models of visual attention are described.

2 Theoretical Framework

The structure of the attention model presented in this paper is determined in part by
several constraints derived from a computational complexity analysis of visual search
[17]. This complexity analysis quantitatively confirms that selective attention is a major
contributer in reducing the amount of computation in any vision system. Furthermore,
the proposed scheme is loosely modelled after the increasing neurophysiology literature
on single-cell recordings from the visual cortex of awake and active primates. Moreover,
the general architecture of this prototype is consistent with their neuroanatomy [17, 18].
At the most basic level, our prototype is comprised of a hierarchical representation
of the input stimuli and an attention mechanism that guides selection of portions of
the hierarchy from the highest, most abstract level through to the lowest level. Spatial
attentional influence is applied in a "spotlight" fashion at the top. The notion of a
spotlight appears in many other models such as that of Treisman [15]. However, if the
spotlight shines on a unit at the top of the hierarchy, there seems to be no mechanism
for the rest of the selection to actually proceed through to the desired items.
One way to solve this problem in a computer vision system is to simply address the
unit of interest. Such a solution works in the computer domain because computer memory
is random access. Unfortunately, there is no evidence four random access in the visual
cortex. Another possible solution is to simply connect all the units of interest directly.
This solution also fails to explain how the human visual cortex may function because the
number of such connections is prohibitive. For instance, to connect all possible receptive
fields to the units in a single 1000 x 1000 representation, 10 :s connections are needed to
do so in a brute force manner 1. Given that the cortex contains 10 :~ neurons, with an
estimated total number of connections of 10:3 , this is clearly not how nature implements
access to high resolution representations.
The spotlight analogy is therefore insufficient, and instead we propose the idea of a
"beam" - something that illuminates and passes through the entire hierarchy. A beam is
required that "points" to a set of units at the top. That particular beam shines throughout
the processing hierarchy with an inhibit zone and a pass zone, such that the units in the
pass zone are the ones that are selected (see Fig. 1). The beam expands as it traverses
the hierarchy, covering all portions of the processing mechanism that directly contribute

: see Tsotsos 1990 [17] for this derivation


553

to the output at its point of entry at the top. At each level of the processing hierarchy,
a winner-take-all process (WTA) is used to reduce the competing set and to determine
the pass and inhibit zones [18].

laye
inpu
abst: inhibitory
1beam

"pass" zone

Fig. 1. The inhibitory attentional beam concept. Several levels of the processing hierarchy are
shown. The pass z o n e of the beam encompasses all winning inputs at each level of the hierarchy.
The darkest beams represent the actual i n h i b i t z o n e s rooted at each level of the hierarchy. The
light-grey beam represents the effective inhibit zone rooted at the most abstract level.

3 The Attention Prototype

The proposed attention prototype consists of a set of hierarchical computations. The


mechanism does not rely on particular types of visual stimulus; the input only considers
the magnitude of the responses. Connectivity may vary between levels. Each unit com-
putes a weighted sum of the responses from its input at the level below. The weighted
response used in this paper is a simple average; but in general the distribution of weights
need not be uniform and may even be different at each level. Processing proceeds as
dictated by Algorithm 1. An inhibit zone and a pass zone are delineated for a beam that
"shines" through all levels of the hierarchy. The pass zone permeates the winners at each
level and the inhibit zone encompasses those elements at each level that competed in the
WTA process. This algorithm is similar to the basic idea proposed by Koch and Ullman
[5]. One important difference is that our scheme does not rely on a saliency map. Another
distinction is that we use a modified W T A update rule that allows for multiple winners
and does not attenuate the winning inputs 2. Also, the final stage of the algorithm is not
simply the routing of information as Koch and Ullman claim, but rather a recomputation
using only the stimuli that were found as "winners" at the input level of the hierarchy.
For illustrative purposes, the attention scheme is shown with a one-dimensional rep-
resentation and illustrated in Fig. 2; the extension to two dimensions is straightforward.
If a simple stimulus pattern is applied to the input layer, the remaining nodes of the
2 The WTA updating function and a proof of convergence are described in Tsotsos 1991 [18]
554

1. Receive stimulus at the input layer.


2. Do 3 through 8 forever.
3. Compute the remaining elements of the hierarchy based on the
weighted sum of their inputs.
4. Do 5 through 6 for each level of the hierarchy, starting at the top.
5. Run W T A process at the current level.
6. Pass winner's beam to the next level.
7. Recompute based on winning input.
8. Inhibit winning input.
Algorithm 1

hierarchy will compute their responses based on a weighted summation of their inputs,
resulting in the configuration of Fig. 2(a). The first pass of the W T A scheme is shown in
Fig. 2(b). This is accomplished by applying steps 5 and 6 of Algorithm i for each level of
the hierarchy. Once an area of the input is attended to and all the desired information is
extracted, then the winning inputs are inhibited. The attention process continues "look-
ing" for the next area. The result is a very fast, automatic, independent, robust system.
Moreover, it is a continuous and reactive mechanism. In a time-varying image, it can
track an object that is moving if it is the item of highest response. In order to construct
such a tracking system, the input would be based on motion.

(4z) t551 t65| i ~ l Is4

i
I

(]) Q | | @ |

@ 9174174
(a) (b)
Fig. 2. A one-dimensional processing hierarchy. (a) The initial configuration. (b) The most
"important" item is selected - the beam's pass (solid lines) and inhibit zone (dashed lines) are
shown.

A number of the prototype's characteristics may be varied, including the number of


levels in the processing hierarchy and the resolution of each level. The elements that
compete in the W T A process are termed ~receptive fields" (RF) after the physiological
counterpart. In our implementation, a minimum RF (minRF) and a maximum RF
(mazRF) are specified in terms of basic image units such as pixels. All rectangular RFs
from rainRF x minRF to ma~RF x maxRF are computed and compete at each
position in the input. RF shapes other than rectangular are possible. In general, a set of
RFs are chosen that are appropriate for the current input computation.
There is an issue to consider when RFs of different sizes compete. If a small RF has
a response of k~ and a larger competing RF has a response (k~ -- e), then for a sufficiently
small e, the larger RF should "win" over the smaller one. For example, consider a RF
555

R1 of size 2 x 2 that has a weighted average of 212, and a competing RF R2 of size 20


x 20 that has a weighted average of 210. Since R2 is 100 times the size of R1 and over
99% the intensity, it seems reasonable to favour R2 over R1. Formally, this is exactly one
of the constraints proposed in Tsotsos 1989 [16] for visual search: given more than one
match with approximately the same error, choose the largest one. In the implementation
of the attention model, this favouring of larger RFs of comparable value is accomplished
by multiplying the weighted averages of all RFs by a normalizing factor that is a function
of the size of the RF.
Mart suggests the following selection criterion for RF sizes: choose a given size if
receptive fields of slightly smaller size give an appreciably smaller response and receptive
fields that axe larger do not give appreciably larger responses [7]. Marr notes, however,
that more than one receptive field size may satisfy this requirement. For instance, consider
a normalizing function that is linear. Also consider a 256 x 256 image-sized RF whose
weighted average is 128. In such an instance, the largest possible RF should be weighted
considerably less than two times the smallest possible RF. For this to hold, a linear
function would have a slope less than 0.000015. Therefore, for two small competing RFs
with similar sizes, the weighting is insignificantly small. Clearly a linear normalization
function is not acceptable.
In the experiments presented in this paper, a normalization function whose rate of
change is greatest for small RFs, without weighting very large RFs excessively is estab-
lished. Since e depends on RF size, it is smaller for small RF sizes. Thus, small RF sizes
must be weighted more than larger RFs. This means that an acceptable function has
a steep slope for small RF sizes and shallow slopes for the larger RF sizes. A good fit
to this point distribution is the function 1/(1 + e-z). In the experiments conducted, a
similar compensating function of a more general form is used: a

-
o~+1
~ + 3-v~ '

where, z represents the number of basic elements in the receptive field. Varying t~ affects
the absolute value of the function's asymptote; varying/~ affects the steepness of the
first part of the function. It was found empirically that values of t~ = 10 and/~ = 1.03
generally give good results in most instances (see Figure 3).

Weighting Factor
l.lO -- ~ -!
1.08 -
1.06--
1.04 -
1.02 --
l.O0- RF[Area x 103
0.00 5.00 10.00 15.00 20.00

F i g . 3. F ( z ) for ~ = 10, 3 = 1 . 0 3 .

a The number 1 in the numerator is a result of normalizing ~'(z) for z -- O. The V~ is used to
account for the a r e a of the RF.
556

4 Experimental Results

We have implemented this attention prototype in software on a Silicon Graphics 4D/340


VGX. Simulations have been conducted using a wide variety of digitized 256 x 256 8-bit
grey-scale images. In this paper, only brightness and edge information computed from
the images are used as input to the prototype. Further research is required to determine
on what other computations this attention beam should be applied.
This prototype lends itself to an implementation that is very fast, especially on hard-
ware that supports parallel processes, such as the SGI 4D/340 VGX. In particular, the
calculation of each element in a given level of the hierarchy is independent of all other
elements at the same level. Therefore, the calculation of the hierarchy may be completed
in parallel for each level. Furthermore, the WTA calculations at each time iteration are
independent and may be done in parallel. In addition, the WTA process converges very
quickly, typically taking less than ten iterations to determine the winner.
A simulation of the implementation for brightness is shown in Fig. 4. The lowest level
of the processing hierarchy is the digitized image, and each successive level is a simple
average of the previous level. This averaging computation has the effect of making each
level appear as a smaller "blurred" version of the previous level. The WTA process is
performed at the top of the hierarchy, and the pass zone is dictated by the RF that is
"brightest". At each successively lower level, the WTA only operates on the B.Fs that
fall within the beam from the previous level. Once the attention beam has located the
winning RF and the surrounding inhibit zone in the input level, and all the information
that is required is gathered from that focus of attention, the area is inhibited. In the
simulations presented here, the region inhibited at the input layer is defined by the
inhibit zone of the attention beam, contrary to the one-dimensional example in Sect. 3
where only elements in the pass zone are inhibited. In practice, once a region of the input
is processed, or '~oveated', it need not be considered again. The prototype then looks
for the next "bright" area, starting by recalculating the processing hierarchy with the
newly-inhibited image as its input. In this particular instance, the time taken to attend
to each area in the input is approximately 0.35 seconds.
Following the movement of the pass zone on the input layer for successive fixations
produces scan paths like the one shown in Fig. 5. The scan paths are interesting from a
computational perspective because they prioritize the order in which parts of the image
are assessed. The attention shifts discussed throughout this paper have been covert forms
of attention in which different regions of the visual input have been attended. It is
experimentally well established that these covert attention shifts occur in the humans [12].
In a similar way, the human visual system has special fast mechanisms called saccades
for moving the fovea to different spatial targets (overt attention). The first systematic
study of saccadic eye movements in the context of behaviour was done by Yarbus [20].
A future area of research is to discover a possible correlation between the scan paths of
our attention beam and the scan paths of Yarbus.
A simulation using edge information was also conducted. At the bottom of the hi-
erarchy is the output of a simple difference operator and again, each successive level is
a simple average of the previous level. The WTA process successively extracts and then
inhibits the most conspicuous items. Corresponding scan paths are displayed in Fig. 6.
The results of this simulation using edges are interesting in several respects. The focus of
attention falls on the longest lines first 4. In effect, the strongest, or most salient, features

4 In this instance, maxRF was set to 100 pixels so that only a portion of the longest line was
attended to at first
557

Fig. 4. Processing hierarchy and attention beam at two time intervals. The input layer is a 256
256 8-bit image. The beam is rooted at the highest level and "shines" through the hierarchy
to the input layer. The darker portion of the attention beam is the pass zone. Once a region of
the input is attended to, it is inhibited and the next "bright" area is found. The black areas in
the input layer indicate the regions that have been inhibited.

are attended to in order of the length of the line, much like Sha'ashua and Ullman's work
on saliency of curvature [14].

5 Discussion

The implementation of our attention prototype has a number of important properties that
make it preferable to other schemes. For example, Chapman [3] has recently implemented
a system based on the idea of a pyramid model of attention introduced by Koch and
Ullman [5]. Chapman's model places a log-depth tree above a saliency map. Similar to
our model, at each level nodes receive activation from nodes below. It differs, however,
in that Chapman's model only passes the maximum of these values to the next level.
There are several difficulties with this approach, the most serious being that the focus of
attention is not continuously variable. The restriction this places on Chapman's model
is that it cannot handle real pixel-based images but must assume a prior mechanism for
segmenting the objects and normalizing their sizes. Our scheme permits receptive fields
of all sizes at each level, with overlap. In addition, the time required for Chapman's
model is logarithmic in the maximum number of elements, making it impractical for
high-resolution images. Further, the time required to process any item in a sensory field
558

Fig. 5. Scan paths for a 256 x 256 8-bit image digitized image ( m i n R F = 5, m a x R F = 40).
The paths displays a priority order in which regions of the image are assessed.

Fig. 6. Scan paths for a 256 256 8-bit image digitized image consisting of horizontal and
vertical lines ( m i n R F = 10, m a x R F = 100). The path displays a priority order in which regions
of the image are assessed. The focus of attention falls on the longest lines first (only a portion
of the longest line is attended to first in this example because r n a x R F = 100).
559

is dependent on its location, which is contrary to recent psychological evidence [6]. In


our model, constant time is required irrespective of the locations of the sensory items.
Anderson and van Essen have proposed the idea of "shifter networks" to explain
attentional effects in vision [1]. There is some similarity between their model and the
inhibitory beam idea presented here. The Anderson and van Essen proposal requires a
two-phase process. First, a series of microshifts map the attention focus onto the nearest
cortical module, then a series of macroshifts switch dynamically between pairs of modules
at the next stage, continuing in this fashion until an attentional centre is reached. A major
drawback to this scheme is that there is no apparent method for control of the size and
shape of the attention focus. This is easily accomplished in our beam proposal because
the beam has internal structure that may be manipulated. Also, Anderson and van Essen
do not describe how the effects of nonattended regions of a receptive field are eliminated.
Finally, the shifting operation is quite complex and time consuming; whether this sort of
strategy can account for the extremely fast response times of human attention is unclear.
Califano, Kjeldsen and Bolle propose a multiresolution system in which the input
is processed simultaneously at a coarse resolution throughout the image and at a finer
resolution within a small "window" [2]. An attention control mechanism directs the high-
resolution spot. In many respects, our scheme may be considered a more general expan-
sion of the Califano model. Our model, however, allows for many resolutions whereas
Califano's is restricted to two. Moreover, our model allows for a variable size and shape
of the focus of attention, whereas both are fixed in Califano's model. The size and shape
of their coarse resolution representation are also fixed. These restrictions do not allow a
"shrink wrapping" around an object, as it is attended to, from coarser to finer resolu-
tions; nevertheless, our model performs this, as also observed in monkey visual cortex by
Moran and Desimone [8].
Several attentional schemes have been proposed by the connectionist community.
Mozer describes a model of attention based on iterative relaxation [9]. Attentional se-
lection is performed by a network of simple computing units that constructs a variable-
diameter "spotlight" on the retinotopic representation. This spotlight allows sensory in-
formation within it to be preferentially processed. Sandon describes a model which also
uses an iterative rule but performs the computation at several spatial scales simultane-
ously [13]. There are several shortcomings of iterative models such as these. One problem
is that the settling time is quite sensitive to the size and nature of the image. The time
required may be quite long if there are similar regions of activity that are widely sepa-
rated. For example, Mozer reports that his scheme took up to 100 iterations to settle on
a 36 x 6 image [10]. These schemes are clearly not suited to real-world high-resolution
images.

Summary

We have argued that an attention mechanism is a necessary component of a computer


vision system if it is to perform tasks in a complex, real world. A new model for visual
attention was introduced whose key component is an attentional beam that prunes the
processing hierarchy, drastically reducing the number of computations required. The
parallel nature of the hierarchy structure further increases the efficiency of this model.
This efficiency was shown empirically with simulations on high-resolution images. The
results confirm that our model is one that is highly suited for real-world vision problems.
560

Acknowledgements
Niels daVitoria Lobo provided helpful suggestions. This research was funded by the
Information Technology Research Centre, one of the Province of Ontario Centres of
Excellence, the Institute for 1~boties and Intelligent Systems, a Network of Centres of
Excellence of the Government of Canada, and the N a t u r a l Sciences and Engineering
Research Council of Canada.

References
1. C.H. Anderson and D.C. Van Essen. Shifter circuits: A computational strategy for dynamic
aspects of visual processing. In Proceedings of the National Academy of Science, USA,
volume 84, pages 6297-6301, 1987.
2. R. Califano, A. Kjeldsen and R.M. Bolle. Data and model driven foveation. Technical
Report RC 15096 (~67343), IBM Research Division - T.J. Watson Lab, 1989.
3. D. Chapman. Vision, Instruction and Action. PhD thesis, MIT AI Lab, Cambridge, MA,
1990. TR1204.
4. J.A. Feldman. Four frames suffice: A provisional model of vision and space. The Behavioral
and Brain Sciences, 8:265-313, 1985.
5. C. Koch and S. Ullman. Shifts in selective visual attention: Towards the underlying neural
circuitry. Human Neurobioiog~, 4:219-227, 1985.
6. B. KrSse and B. Julesz. The control and speed of shifts of attention. Vision Research,
29(11):1607-1619, 1989.
7. D. Marr. Early processing of visual information. Phil. Trans. R. Soc. Lond., B 275:483-524,
1976.
8. J. Moran and R. Desimone. Selective attention gates visual processing in the extrastriate
cortex. Science, 229:782-784, 1985.
9. M.C. Mozer. A connectionist model of selective visual attention in visual perception. In
Proceedings: 9th Conference of the Cognitive Science Society, pages 195-201, 1988.
10. M.C. Mozer. The Perception of Multiple Objects: A Connectionist Approach. MIT Press,
Cambridge, MA, 1991.
11. U. Neisser. Cognitive Psychology. Appleton-Century-Crofts, New York, NY, 1967.
12. Y. Posner, M.I. Cohen and R.D. Rafal. Neural system control of spatial ordering. Phil.
Trans. R. Soe. Loud., B 298:187-198, 1982.
13. P.A. Sandon. Simulating visual attention. Journal of Cognitive Neuroscience, 2(3):213-231,
1990.
14. A. Sha'ashua and S. Ullman. Structure saliency: The detection of giobally salient structures
using a locally connected network. In Proceedings of the Second ICCV, pages 321-325,
Tampa, FL, 1988.
15. A. Treisman. Preattentive processing in vision. Computer Vision, Graphics, and Image
Processing, 31:156-177, 1988.
16. J.K. Tsotsos. The complexity of perceptual search tasks. In Proceedings, IJCAL pages
1571-1577, Detroit, 1989.
17. J.K. Tsotsos. Analyzing vision at the complexity level. The Behavioral and Brain Sciences,
13:423-469, 1990.
18. J.K. Tsotsos. Localizing stimuli in a sensory field using an inhibitory attentional beam.
Technical Report RBCV-TR-91-37, University of Toronto, 1991.
19. L.M. Uhr. Psychological motivation and underlying concepts. In S.L. Tanimoto and
A. Klinger, editors, Structured Computer Vision. Academic Press, New York, NY, 1980.
20. A.L. Yarbus. Eye Movements and Vision. Plenum Press, 1967.
W h a t c a n b e s e e n in t h r e e d i m e n s i o n s with an
u n c a l i b r a t e d s t e r e o rig?

Olivier D. Faugeras

INRIA-Sophia, 2004 Route des Lucioles, 06560 Valbonne, France

A b s t r a c t . This paper addresses the problem of determining the kind of


three-dimensional reconstructions that can be obtained from a binocular
stereo rig for which no three-dimensional metric calibration data is avail-
able. The only information at our disposal is a set of pixel correspondences
between the two retinas which we assume are obtained by some correlation
technique or any other means. We show that even in this case some very rich
non-metric reconstructions of the environment can nonetheless be obtained.
Specifically we show that if we choose five arbitrary correspondences,
then a unique (up to an arbitrary projective transformation) projective rep-
resentation of the environment can be constructed which is relative to the
five points in three-dimensional space which gave rise to the correspon-
dences.
We then show that if we choose only four arbitrary correspondences,
then an affine representation of the environment can be constructed. This
reconstruction is defined up to an arbitrary affine transformation and is rel-
ative to the four points in three-dimensional space which gave rise to the
correspondences. The reconstructed scene also depends upon three arbitrary
parameters and two scenes reconstructed from the same set of correspon-
dences with two different sets of parameter values are related by a projective
transformation.
Our results indicate that computer vision may have been slightly overdo-
ing it in trying at all costs to obtain metric information from images. Indeed,
our past experience with the computation of such information has shown
us that it is difficult to obtain, requiring awkward calibration procedures
and special purpose patterns which are difficult if not impossible to use in
natural environments with active vision systems. In fact it is not often the
case that accurate metric information is necessary for robotics applications
for example where relative information is usually all what is needed.

1 Introduction

The problem we address in this paper is that of a machine vision system with two cameras,
sometimes called a stereo rig, to which no thro~-dimensional metric information has been
made available. The only information at hand is contained in the two images. We assume
that this machine vision system is capable, by comparing these two images, of establishing
correspondences between them. These correspondences can be based on some measures
of similitude, perhaps through some correlation-like process. Anyway, we assume that
our system has obtained by some means a number of point correspondences. Each such
correspondence, noted (m, m') indicates that the two image points m and m' in the two
retinas are very likely to be the images of the same point out there. It is very doubtful
564

at first sight that such a system can reconstruct anything useful at all. In the machine
vision jargon, it does not know either its intrinsic parameters (one set for each camera),
nor its extrinsic parameters (relative position and orientation of the cameras).
Surprisingly enough, it turns out that the machine vision system can nonetheless
reconstruct some very rich non-metric representations of its environment. These repre-
sentations are defined up to certain transformations of the environment which we assume
to be three-dimensional and euclidean (a realistic assumption which may be criticized by
some people). These transformations can be either affine or projective transformations
of the surrounding space. This depends essentially on the user (i.e the machine vision
system) choice.
This work has been inspired by the work of Jan Koenderink and Andrea van Doom
[4], the work of Gunnar Sparr [9,10], and the work of Roger Mohr and his associates [6,7].
We use the following notations. Vectors and matrixes will he represented in boldface,
geometric entities such as points and lines in normal face. For example, rn represents a
point and m the vector of the coordinates of the point. The line defined by two points
M and N will be denoted by (M, N). We will assume that the reader is familiar with
elementary projective geometry such as what can be found in [8].

2 T h e projective case: basic idea


In all the paper we will assume the simple pinhole model for the cameras. In this model,
the camera performs a perspective projection from the three-dimensional ambient space
considered as a subset of the projective space 7)3 to the two-dimensional retinal space
considered as a subset of the projective plane ~)2. This perspective projection can be
represented linearly in projective coordinates. If m is a retinal point represented by the
three-dimensional vector m , image of the point M represented by the four-dimensional
vector M, the perspective projection is represented by a 3 4 matrix, noted 13, such
that:
m=13M
Assume now that we are given 5 point matches in two images of a stereo pair. Let
Ai, i -- 1 , - - . , 5 be the corresponding 3D points. We denote their images in the two
cameras by ai, a~, i -- 1, 5. We make three choices of coordinate systems:
in 3D s p a c e choose the five (unknown) 3D points as the standard projective basis, i.e
A1 = el = [1, 0, 0, 0]T, ' ' . , A s = e5 ----[1, 1, 1, 1]T.
in t h e first i m a g e choose the four points as, i = 1 , . . - , 4 as the standard projective
basis, i.e, for example al = [1, 0, 0]T.
in t h e s e c o n d i m a g e do a similar change of coordinates with the points a~, i = 1 , . . . , 4.
With those three choices of coordinates, the expressions for the perspective matrixes
and P~ for the two cameras are quite simple. Lets us compute it for the first one.

2.1 A s i m p l e e x p r e s s i o n for
We write that
PAi=pial i=1,...,4
which implies, thanks to our choice of coordinate systems, that P has the form:

---- p~ 0 P4 (I)
0 P3 P4
565

Let a5 = [a, fl, 7] T, then the relation P A 5 = psa5 yields the three equations:

Pl + P 4 = p S a P2 + P 4 = PS~ P 3 + P 4 = P57
We now define/J = P5 and v = P4, matrix P can be written as a very simple function of
the two unknown parameters p and v:

= [00i]
# = ~:K + ~ Y

~ o
(2)

(3)

q =
[:o Ol] 07

-1 0 1 (4)
0 -11

A similar expression holds for P ' which is a function of two unknown parameters p' and
/fl:
#, = ~,~, + ~',~

2.2 O p t i c a l c e n t e r s a n d e p i p o l e s
Equation (2) shows that each perspective matrix depends upon two projectiv%parameters
i.e of one parameter. Through the choice of the five points Ai, i = 1, 9 9 5 as the standard
coordinate system, we have reduced our stereo system to be a function of only two
arbitrary parameters. What have we lost? well, suppose we have another match (m, mr),
it means that we can compute the coordinates of the corresponding three-dimensional
point M as a function of two arbitrary parameters in the projective coordinate system
defined by the five points Ai, i = 1 , . . . , 5. Our three-dimensional reconstruction is thus
defined up to the projective transformation (unknown) from the absolute coordinate
system to the five points Ai and up to the two unknown parameters which we can choose
as the ratios x = ~ and x ~ = ~ . We will show in a moment how to eliminate the
dependency upon x and z ~ by using a few more point matches.

C o o r d i n a t e s o f t h e o p t i c a l c e n t e r s a n d e p i p o l e s Let us now compute the coordi-


nates of the optical centers C and C ' of the two cameras. We know that the coordinates
of C are defined by the equation:
f'C=0
Combining this with the expression (2) for 1~, we obtain:

v c~/J v - / ~ / J v--Tp
a set of remarkably simple expressions. Note that the coordinates of C depend only upon
the ratio z:
C= 1
1-az [
1
1-3x'
1
1-Ta:'
1

Identical expressions are obtained for the coordinates of C ' by adding ':
1
C,=[ 1 1 i I]T [ i I I IT
9v~__-cd/j,, v~ ~/zl , v~ 71pl, V, = I_CdX~ , l_/3~Xt, l _ 7 , X ~ , 1
566

If we now use the relation I~'C = o I to define the epipole d in the second image, we
immediately obtain its coordinates:

o'= L v_p-
a'~'a' v' +-vlv ' pl~l~_._
V
P~"
-Vt
- -b Vl , P171v_pT--
vl +__]Trip = "[Xl--~--I
x--al--xa
-- ' xl/311--x/3--
x/3, xl7
'l_xT-XT]T
(~)
We note that they depend only on the ratios x and x ~. We have similar expressions for
the epipole o defined by P C ~ = o:

r~-v-- v v/~-v v vT-v v T .=~--z%' z/~-z'/~' zT--x'7']T

(~)

Constraints o n the coordinates of the epipoles The coordinates of the epipoles are
not arbitrary because of the epipolartransformation.This transformation is well-known
in stereo and motion [3].It says that the two pencils of epipolar lines are related by a
collineation,i.e a linear transformation between projective spaces (here two projective
lines).It implies that we have the equalitiesof two cross-ratios,for example:

{(o,al), (o, a2), (o,a3), (o,a~)} = {(o',al), (o',a~), (o',a~), (o',4)}


{(o, "d, (o, a~), (o,a3), (o, as)} = {(o',al), (o',a~), (o',a~), (o',a~)}
As shown in appendix A, we obtain the two relations(12) and (13) between the coordi-
nates of o and d.

3 Relative reconstruction of points

3.1 C o m p l e t e d e t e r m i n a t i o n o f I~ a n d 1~l
Assume for a moment that we know the epipoles o and d in the two images (we show in
section 3.3 how to estimate their coordinates). This allows us to determine the unknown
parameters as follows. Let, for example, U', V ~ and W ~ be the projective coordinates of
d . According to equation (5), and after some simple algebraic manipulations, we have:

U~ x~-z'a ~ x7-1 V~ x/~-z'/3 ~ x7-1


W~ x7-z~7 ~ za-1 W~ z7-z'7 ~ x/~-I
If we think of the pair (z, z ~) as defining the coordinates of a point in the plane, these
equations show that the points which are solutions are at the intersection of two conics.
In fact, it is easy to show, using Maple, that there are three points of intersection whose
coordinates are very simple:

x=0 xt=0
i Xt__ i
X=-- --.-7/
7 7
o'-(as^a~) x' ~" 7u'v'(a-~)+c~v'w'(7-a)+aw'u'(~-'O
z o~-(asAa~) = -7'v'v'(~-a)+a'v'w'(7-~)+yw'u'(~--r)
Where
o~ = [~v', ~v', ~w'] r
One of these points has to be a double point where the two conics are tangent. Since it
is only the last pair (z, z ~) which is a function of the epipolar geometry, it is in general
the only solution.
567

Note that since the equations of the two epipoles are related by the two equations
described in appendix A, they provide only two independent equations rather than four.
The perspective matrixes P and P ' are therefore uniquely defined. For each match
(m, m ~) between two image points, we can then reconstruct the corresponding three-
dimensional point M in the projective coordinate system defined by the five points Ai.
Remember that those five points are unknown. Thus our reconstruction can be considered
as relative to those five points and depending upon an arbitrary perspective transforma-
tion of the projective space ~p3. All this is completely independent of the intrinsic and
extrinsic parameters of the cameras.
We have obtained a remarkably simple result:

In the case where at least eight point correspondences have been obtained between two
images of an uncalibrated stereo rig, if we arbitrarily choose five of those correspondences
and consider that they are the images of five points in general positions (i.e not four of
them are coplanar), then it is possible to reconstruct the other three points and any other
point arising from a correspondence between the two images in the projective coordinate
system defined by the five points. This reconstruction is uniquely defined up to an unknown
projective transformation of the environment.

3.2 Reconstructing the points


Given a correspondence (m, m~), we show how to reconstruct the three-dimensional point
M in the projective coordinate system defined by the points Ai, i = 1 , - . . , 5.
The computation is extremely simple. Let Moo he the point of intersection of the
optical ray (C, m) with the plane of equation T -- 0. Moo satifies the equation PMoo --
m, where P is the 3 x 3 left submatrix of matrix P (note that Moo is a 3 x 1 vector, the
projective representation of Moo being [ M ~ , o]T). The reconstructed point M can then
be written as

where the scalars A and # are determined by the equation ~ " M = m ~ which says that
m ~ is the image of M. Applying P ' to both sides of the previous equation, we obtain

m' =/Jo' + AP'P-Im

where P' is the 3 x 3 leftsubmatrix of matrix P'.


It is shown in appendix B that

p,p-1 0 ~'x'-I 0
0 0 7~x'-I
~,z--1
Ixt
Let us note a -- ~ , _ b -- ~ x - 1 , and c -- 7=_-11 . A and/J are then found by solving
the system of three linear equations in two unknowns
568

3.4 C h o o s i n g t h e five p o i n t s Ai

1 As mentioned before, in order for this scheme to work, the three-dimensional points
that we choose to form the standard projective basis must be in general position. This
means that no four of them can be coplanar. The question therefore arises of whether we
can guarantee this only from their projections in the two retinas.
The answer is provided by the following observation. Assume that four of these points
are coplanar, for example A1, A2, A3, and A4 as in figure 1. Therefore, the diagonals of
the planar quadrilateral intersect at three points B1, Bs, B3 in the same plane. Because
the perspective projections on the two retinas map lines onto lines, the images of these
diagonals are the diagonals of the quadrilaterals al, as, a3, a4 and at, aS, a~, a~ which
intersect at bl, bs, b3 and bt, b~, b~, respectively. If the four points Ai are coplanar, then
the points b~, j = 1, 2, 3 lie on the epipolar line of the points bj, simply because they
are the images of the points Bj. Since we know the epipolar geometry of the stereo rig,
this can be tested in the two images.
But this is only a necessary condition, what about the reverse? suppose then that b~
lies on the epipolar line of bl. By construction, the line (C, bl) is a transversal to the
two lines (A1, As) and (A2, A4): it intersects them in two points C1 and C2. Similarly,
(C I, bt) intersects (A1, As) and (As, A4) in CI and C~. Because bt lies on the epipolar
line of b~, the two lines (C, b~) and (C', b~) are coplanar (they lie in the same epipolar
plane). The discussion is on the four coplanar points C1, C2, C~, C~. Three cases occur:

1. C1 r C~ and Cs r C~ implies that (A1, As) and (As, A4) are in the epipolar
plane and therefore that the points al, as, as, a4 and at, aS, as,~ a 4i are aligned on
corresponding epipolar lines.
2. C1 --- C~ and Cs r C~ implies that (A1, As) is in the epipolar plane and therefore
that the lines (al, as) and (at, a~) are corresponding epipolar lines.
3. The case C1 r CI and C2 - C~ is similar to the previous one.
4. C1 -- C~ and C2 - C~ implies that the two lines (A1, As) and (As, A4) are coplanar
and therefore also the four points A1, As, As, A4 (in that cas we have C1 --- C~ -
C s -- C~ - B 1 ) .

In conclusion, except for the first three "degenerate cases" which can be easily detected,
the condition that bt lies on the epipolar line of bl is necessary and sufficient for the four
points A1, A2, As, A4 to be coplanar.

4 Generalization to the afllne case

The basic idea also works if instead of choosing five arbitrary points in space, we choose
only four, for example Ai, i = 1, 9 9 4. The transformation of space can now be chosen in
such a way that it preserves the plane at infinity: it is an affine transformation. Therefore,
in the case in which we choose four points instead of five as reference points, the local
reconstruction will be up to an affine transformation of the three-dimensional space.
Let us consider again equation 1, change notations slightly to rewrite it as:

~= 0
r

1 This section was suggested to us by Roger Mohr.


569

where matrix
aX
A = V' bY
W' eZ
is in general of rank 2. We then have

The coordinates of the reconstructed point M are:

M -- ~[.z~ 1 1 I _I]T + A[xX Y Z O]T


-1' zfl-l' zT--l' -1' xfl-l' xT-l'
In which we have taken m = [X, Y, Z] T. We now explain how the epipoles can be
determined.

3.3 D e t e r m i n i n g t h e e p i p o l e s f r o m p o i n t m a t c h e s
The epipoles and the epipolar transformation between the two retinas can be easily
determined from the point matches as follows. For a given point m in the first retina, its
epipolar line om in the second retina is linearly related to its projective representation.
If we denote by F the 3 x 3 matrix describing the correspondence, we have:

orn = F m
where o,n is the projective representation of the epipolar line o,n. Since the corresponding
point m' belongs to the line em by definition, we can write:
m'TFm : 0 (8)

This equation is reminiscent of the so-called Longuet-Higgins equation in motion analysis


[5]. This is not a coincidence.
Equation (8) is linear and homogeneous in the 9 unknown coefficients of matrix F.
Thus we know that, in generM, if we are given 8 matches we will be able to determine a
unique solution for F, defined up to a scale factor. In practice, we are given much more
than 8 matches and use a least-squares method. We have shown in [2] that the result is
usually fairly insensitive to errors in the coordinates of the pixels m and m' (up to 0.5
pixel error).
Once we have obtained matrix F, the coordinates of the epipole o are obtained by
solving
Fo = 0 (9)
In the noiseless case, matrix F is of rank 2 (see Appendix B) and there is a unique vector
o (up to a scale factor) which satisfies equation 9. When noise is present, which is the
standard case, o is determined by solving the following classical constrained minimization
problem
m~nllFoll 2 subject to 11o112= 1
which yields o as the unit norm eigenvector of matrix E T F corresponding to the small-
est eigenvalue. We have verified that in practice the estimation of the epipole is very
insensitive to pixel noise.
The same processing applies in reverse to the computation of the epipole o'.
570

B2
A2

A1 As

Fig. 1. If the four points A1, A2, As, Ai are coplanar, they form a planar quadrilateral whose
diagonals intersect at three points B1, B2, B3 in the same plane

with a similar expression with ' for P ' . Each perspective matrix now depends upon 4
projective parameters, or 3 parameters, making a total of 6. If we assume, like previously,
that we have been able to compute the coordinates of the two epipoles, then we can write
four equations among these 6 unknowns, leaving two. Here is how it goes.
It is very easy to show that the coordinates of the two optical centers are:
1,11 !] T C' 1,1 1 1]T
C=[p q, r ' =[7 q" r "
from which we obtain the coordinates of the two epipoles:
pt 81 ql 81 r~
P T o' T
~ =[7 ;'q ;'r s
Let us note
p# ql rI 81
X1 ~ -- X 2 ~-~ - - X 3 ---~ - - X4 ~ --
p q r s
We thus have for the second epipole:
X 1 -- g 4 U e x 2 -- z 4 -- V t
(10)
x3 - - x4 W~ x3 - - x4 W~

and for the second:


Xl--X4 x3 U x2-x4 x3 V
9 -- = - - -- (11)
Z 3 -- X 4 Z 1 W x3 -- Z4 X2 W

The first two equations (10) determine xl and x2 as functions of z3 by replacing in


equations (11):
U'W V'W
xl = W-rffx~ ~2 = -~--~
571

replacing these values for xl and x2 in equations (10), we obtain a system of two linear
equations in two unknowns z3 and z4:

{ x s U ' ( W - U) + z 4 U ( U ' - W ' ) = 0


xsv'(w v) + x4v(v' w') o
Because of equation (12) of appendix A, the discriminant of these equations is equal to
0 and the two equations reduce to one which yields x4 as a function of z3:
V'(W - U) V'(W V) -

~' - v ( w , - u,) ~ = ~ : ~ ~

We can therefore express matrixes 13 and 131 as very simple functions of the four projective
parameters p, q, r, s:
9 U'W_ U'(W-U~ "1

V'W . n
131= 0 wr'r v vU s
u
0 0 r s

There is a detail that changes the form of the matrixes 13 and 13' which is the following.
We considered the four points Ai, i = 1 , . . . , 4 as forming an affine basis of the space.
Therefore, if we want to consider that the last coordinates of points determine the plane
at infinity we should take the coordinates of those points to have a 1 as the last coordinate
instead of a 0. It can be seen that this is the same as multiplying matrixes 13 and P ' on
the right by the matrix

010
Q= 001
111
Similarly, the vectors representing the points of p 3 must be multiplied by Q - 1 . For
example
~1,1,1+1 1 1
c=[ p 7
We have thus obtained another remarkably simple result:

In the case where at least eight point correspondences have been obtained between two
images of an uncalibrated stereo rig, if we arbitrarily choose four of these correspondences
and consider that they are the images of four points in general positions (i.e not coplanar),
then it is possible to reconstruct the other four points and any other point arising from a
correspondence between the two images in the alpine coordinate system defined by the four
points. This reconstruction is uniquely defined up to an unknown affine transformation
of the environment.

The main difference with the previous case is that instead of having a unique de-
termination of the two perspective projection matrixes 13 and 13', we have a family of
such matrixes parameterized by the point o f P 3 of projective coordinates p, q, r, s. Some
simple parameter counting will explain why. The stereo rig depends upon 22 = 2 x 11
parameters, 11 for each perspective projection matrix. The reconstruction is defined up
to an affine transformation, that is 12 = 9 + 3 parameters, the knowledge of the two
epipoles and the epipolar transformation represents 7 = 2 + 2 + 3 parameters. Therefore
we are left with 22 - 12 - 7 = 3 loose parameters which are the p, q, r, s.
572

Similarly, in the previous projective case, the reconstruction is defined up to a projec-


tive transformation, t h a t is 15 parameters. The knowledge of the epipolar geometry still
provides 7 p a r a m e t e r s which makes a total of 22. Thus our result t h a t the perspective
projection matrixes are uniquely defined in t h a t case.

4.1 R e c o n s t r u c t i n g the points


Given a pair (m, m ' ) of matched pixels, we want to compute now the coordinates of
the reconstructed three-dimensional point M (in the affine coordinate system defined by
the four points Ai, i = 1 , - - . , 4). Those coordinates will be functions of the p a r a m e t e r s
p, q, r, s.
T h e c o m p u t a t i o n is extremely simple and analogous to the one performed in the
previous projective case. We write again t h a t the reconstructed point M is expressed as

U'W and v12 = vw'v"


The scalars A and p are determined as in section 3.2. Let u12 : W-~ 'w
We have

A and ~ are given by equation 7 in which m a t r i x


~
U' u12X]
A =
w,V YI
V'

is in general of rank 2. The projective coordinates of the reconstructed point M are then:
[~ 1 1 1]T+.~[X, Y Z o ] T )
M=Q-I(/J ' q' r' q' r'

4.2 C h o o s i n g t h e p a r a m e t e r s p, q, r, 8
T h e p a r a m e t e r s p, q, r, s can be chosen arbitrarily. Suppose we reconstruct the s a m e
scene with two different sets of p a r a m e t e r s p l , ql, r l , Sl and p~, q2, r2, s2. Then the
relationship between the coordinates of a point M1 and a point Ms reconstructed with
those two sets from the same image correspondence (m, m ' ) is very simple in projective
coordinates:

M2 = Q - I
i"~
q~
0 ~r l
0 0
QM1 =

| 1~ _ ~_a ~. _ s__ar_.a_ s__a


~Pl $1 ql $1 rl

T h e two scenes are therefore related by a projective transformation. It m a y come as a


~
0
rl

$1

surprise t h a t they are not related by an afline t r a n s f o r m a t i o n but it is clearly the case
t h a t the above transformation preserves the plane at infinity if and only if
P2 q2 r2 s2
Pl ql rl Sl
If we have more information a b o u t the stereo rig, for example if we know t h a t the two
optical axis are coplanar, or parallel, then we can reduce the number of free parameters.
We have not yet explored experimentally the influence of this choice of parameters on
the reconstructed scene and plan to do it in the future.
573

5 Putting together different viewpoints

An interesting question is whether this approach precludes the building of composite


models of a scene by putting together different local models. We and others have been
doing this quite successfully over the years in the case of metric reconstructions of the
scene [1,11]. Does the loss of the metric information imply that this is not possible
anymore? fortunately, the answer to this question is no, we can still do it but in the weaker
frameworks we have been dealing with, namely projective and affine reconstructions.
To see this, let us take the case of a scene which has been reconstructed by the affine
or projective method from two different viewpoints with a stereo rig. We do not need to
assume that it is the same stereo rig in both cases, i.e we can have changed the intrinsic
and extrinsic parameters between the two views (for example changed the base line and
the focal lengths). Note that we do not require the knowledge of these changes.
Suppose then that we have reconstructed a scene $1 from the first viewpoint using
the five points Ai, i = 1 , . . . , 5 as the standard projective basis. We know that our recon-
struction can be obtained from the real scene by applying to it the (unknown) projective
transformation that turns the four points Ai which have perfectly well defined coordi-
nates in a coordinate system attached to the environment into the standard projective
basis. We could determine these coordinates by going out there with a ruler and measur-
ing distances, but precisely we want to avoid doing this. Let T1 denote this collineation
of 9 3 .
Similarly, from the second viewpoint, we have reconstructed a scene $2 using five other
points B~, i = 1 , . . . , 5 as the standard projective basis. Again, this reconstruction can be
obtained from the real scene by applying to it the (unknown) projective transformation
that turns the four points Bi into the standard projective basis. Let T2 denote the
corresponding collineation of 9 3. Since the collineations of 9 3 form a group, $2 is related
to $1 by the collineation T2T~ -1 . This means that the two reconstructions are related by
an unknown projective transformation.
Similarly, in the case we have studied before, the scenes were related by an unknown
rigid displacement [1,11]. The method we have developed for this case worked in three
steps:
1. Look for potential matches between the two reconstructed scenes. These matches are
sets of reconstructed tokens (mostly points and lines in the cases we have studied)
which can be hypothesized as being reconstructions of the same physical tokens
because they have the same metric invariants (distances and angles). An example is
a set of two lines with the same shortest distance and forming the same angle.
2. Using these groups of tokens with the same metric invariants, look for a global rigid
displacement from the first scene to the second that maximizes the number of matched
tokens.
3. For those tokens which have found a match, fuse their geometric representations
using the estimated rigid displacement and measures of uncertainty.
The present situation is quite similar if we change the words metric invariants into
projective invariants and rigid displacement into projective transformation. There is a
difference which is due to the fact that the projective group is larger than the euclidean
group, the first one depends on 15 independent parameters whereas the second depends
upon only 6 (three for rotation and three for translation). This means that we will have to
consider larger sets of tokens in order to obtain invariants. For example two lines depend
upon 8 parameters in euclidean or projective space, therefore we obtain 8 - 6 = 2 metric
invariants (the shortest distance and the angle previously mentioned) but no projective
574

invariants. In order to obtain some projective invariants, we need to consider sets of


four lines for which there is at least one invariant (16-15=1) 2. Even though this is not
theoretically significant, it has obvious consequences on the complexity of the algorithms
for finding matches between the two scenes (we go from an o(n 2) complexity to an o(n4),
where n is the number of lines).
We can also consider points, or mixtures of points and lines, or for t h a t m a t t e r any
combination of geometric entities but this is outside the scope of this paper and we will
report on these subjects later.
T h e affine case can be treated similarly. R e m e m b e r from section 4 t h a t we choose four
a r b i t r a r y noncoplanar points At, i = 1 , . . . , 4 as the s t a n d a r d affine basis and reconstruct
the scene locally to these points. The reconstructed scene is related to the real one by
a three-parameter family of affine transformations. W h e n we have two reconstructions
obtained from two different viewpoints, they are b o t h obtained from the real scene by
applying to it two unknown affine transformations. These two transformations depend
each upon three a r b i t r a r y parameters, b u t they remain affine. This means t h a t the rela-
tionship between the two reconstructed scenes is an unknown a]]ine transformation 3 and
t h a t everything we said about the projective case can be also said in this case, changing
projective into affine. In particular, this means t h a t we are working with a smaller group
which depends only upon 12 p a r a m e t e r s and t h a t the complexity of the matching should
be intermediate between the metric and projective cases.

6 Experimental results

This theory has been implemented in Maple and C code. We show the results on the
calibration p a t t e r n of figure 2. We have been using this p a t t e r n over the years to calibrate
our stereo rigs and it is fair enough to use it to d e m o n s t r a t e t h a t we will not need it
anymore in the forthcoming years.
The p a t t e r n is m a d e of two perpendicular planes on which we have painted with
great care black and white squares. T h e two planes define a n a t u r a l euclidean coordinate
frame in which we know quite accurately the coordinates of the vertexes of the squares.
The images of these squares are processed to extract the images of these vertexes whose
pixel coordinates are then also known accurately. T h e three sets of coordinates, one set
in three dimensions and two sets in two dimensions, one for each image of the stereo
rig, are then used to estimate the perspective matrixes P1 and P2 from which we can
compute the intrinsic p a r a m e t e r s of each camera as well as the relative displacement of
each of t h e m with respect to the euclidean coordinate system defined by the calibration
pattern.
We have used as input to our p r o g r a m the pixel coordinates of the vertexes of the
images of the squares as well as the pairs of corresponding points 4. F r o m these we can
e s t i m a t e the epipolar geometry and perform the kind of local reconstruction which has

2 In fact there axe two which axe obtained as follows: given the family of all lines, if we impose
that this line intersects a given line, this is one condition, therefore there is in general a finite
number of lines which intersect four given lines. This number is in general two and the two
invaxiants axe the cross-ratios of the two sets of four points of intersection.
z This is true only, according to section 4.2, if the two reconstructions have been performed
using the same parameters p, q, r and s.
In practice, these matches axe obtained automatically by a program developed by R~gis Vail-
last which uses some a priori knowledge about the calibration pattern.
575

been described in this paper. Since it is hard to visualize things in a projective space, we
have corrected our reconstruction before displaying it in the following manner.
We have chosen A1, A2, Aa in the first of the two planes, A4, As in the second, and
checked that no four of them were coplanar. We then have reconstructed all the vertexes
in the projective frame defined by the five points Ai, i = 1 , . - . , 5 . We know that this
reconstruction is related to the real calibration pattern by the projective transformation
that transforms the five points (as defined by their known projective coordinates in the
euclidean coordinate system defined by the pattern, just add a 1 as the last coordinate)
into the standard projective basis. Since in this case this transformation is known to
us by construction, we can use it to test the validity of our projective reconstruction
and in particular its sensitivity to noise. In order to do this we simply apply the inverse
transformation to all our reconstructed points obtaining their "corrected" coordinates in
euclidean space. We can then visualize them using standard display tools and in particular
look at them from various viewpoints to check their geometry. This is shown in figure 3
where it can be seen that the quality of the reconstruction is quite good.

Fig. 2. A grey scale image of the calibration pattern

7 Conclusion

This paper opens the door to quite exciting research. The results we have presented in-
dicate that computer vision may have been slightly overdoing it in trying at all costs to
obtain metric information from images. Indeed, our past experience with the computa-
tion of such information has shown us t h a t it is difficult to obtain, requiring awkward
calibration procedures and special purpose patterns which are difficult if not impossible
to use in natural environments with active vision systems. In fact it is not often the case
that accurate metric information is necessary for robotics applications for example where
relative information is usually all what is needed.
In order to make this local reconstruction theory practical, we need to investigate in
more detail how the epipolar geometry can be automatically recovered from the environ-
ment and how sensitive the results are to errors in this estimation. We have started doing
576

oo[J

Fig. 3. Several rotated views of the "corrected" reconstructed points (see text)

this and some results are reported in a companion paper [2]. We also need to investigate
the sensitivity to errors of the affine and projective invariants which are necessary in
order to establish correspondences between local reconstructions obtained from various
viewpoints.
Acknowledgements:
I want to thank Th~o Papadopoulo and Luc Robert for their thoughtful comments on
an early version of this paper as well as for trying to keep me up to date on their latest
software packages without which I would never have been able to finish this paper on
time.

A C o m p u t i n g s o m e cross-ratios

Let U, V, W de the projective coordinates of the epipole o. The projective representations


of the lines (o, ai), (o, a2), (o, a3), (0, a4) are the cross-products o ^ al --= I 1, 0 ^ a2 --
12, o A a 3 = 13, o A a 4 ----14.
A simple algebraic computation shows that
11 = [0, W , - V ] T l~ = [-W, 0, U] r
13 = [V, - U , 0] T 14 = [ Y - W, W - U, U - V] w

This shows that, projectively (if W ~ 0):

18=UIl+V12 14=(W-U)II+(W-V)I2
The cross-ratio of the four lines is equal to cross-ratio of the four "points" 11, 12, 13, 14:
Y W - Y V(W- U)
{<o, al>, <o, a~>, (o, ~ ) , <o, ~4>} = {0, oo, U' W - - 5 } = g ( w V)
577

Therefore, the projective coordinates of the two epipoles satisfy the first relation:
V(W - U) V'(W' - U') (12)
U(W - V) - V'(W' - V')
In order to compute the second pair of cross-ratios, we have to introduce the fifth line
(0, a5), compute its projective representation 15 = o h as, and express it as a linear
combination of 11 and 12. It comes that:
15 = (U7 - W a ) l l + (V'}, - Wfl)12
From which it follows that the second cross-ratio is equal to:
{(o, al), (o, a2), (o, as), (o, as)} = {0, cr -~, ~ } =
v(v~,-w~)
U(U.y-wa)
Therefore, the projective coordinates of the two epipoles satisfy the second relation:
Y ( V ' t - Wj3) Vt(Vt'r' - W t f l t) (13)
- woo - v,(u,.r, -

B The essential matrix

We relate here the essential matrix F to the two perspective projection matrixes P and
~". Denoting as in the main text by P and P ' the 3 x 3 left sub-matrixes of t ' and P',
and by p and p' the left 3 x 1 vectors of these matrixes, we write them as:

= [ P P] P'= [P'P'I
Knowing this, we can write:

C = [P-11
p]_ Moo=P-lm

and we obtain the coordinates of the epipole o' and of the image moo' of Moo in the
second retina:
01 = ~l C = p i p - l p _ p, moo
i = pip-1 m
The two points o' and moo define the epipolar line om of m, therefore the projective
representation of om is the cross-product of the projective representations of o' and m ~ :
o m = o t ^ m : = o-'moo'
where we use the notation 6' to denote the 3 x 3 antisymmetric matrix representing the
cross-product with the vector o ~,
From what we have seen before, we write:

p,p-1 = a~O-1 #'x'-I 0


~-1
0 0 7'#-1
,'yx-- 1

Thus:
<":'-'
<,:_, 1 r 1
p,p-ip_ p, = ~ _ /~/
~z-1
578

We can therefore write:


IXt X

5' =

~x- 1 a x ---']"-

and finally:

7 ' x ' - I . p ' x ' - p a r "1

l
0 3,'X'-TX
-- fix-1 - ~yx--1
"yx-1 ~x-1
i i
a'x'--I . 7'x~-Tx __Ttxl-1 . ~txt-~x
F = ax-1 3'x-1 0
OttO'S ~ 1 9 la:t
ff~x'-I a'x'--ax ")'x-1 crx-1
ax-1 ~ #x 'L1 " otx-1 0

References

1. Nicholas Aya~he and Olivier D. Faugeras. Maintaining Representations of the Environ-


ment of a Mobile Robot. IEEE transactions on Robotics and Automation, 5(6):804-819,
December 1989. also INRIA report 789.
2. Olivier D. Faugeras, Tuan Luong, and Steven Maybamk. Camera self-cMibration: theory
and experiments. In Proceedings of the 2nd European Conference on Computer Vision,
1992. Accepted.
3. Olivier D. Fangeras and Steven Maybank. Motion from point matches: multipficity of
solutions. The International Journal of Computer Vision, 4(3):225-246, June 1990. also
INRIA Tech. Report 1157.
4. Koenderink Jan J. and Andrea J. van Doorn. Affine Structure from Motion. Journal of
the Optical Society of America, 1992. To appear.
5. H.C. Longuet-Higgins. A Computer Algorithm for Reconstructing a Scene from Two Pro-
jections. Nature, 293:133-135, 1981.
6. R. Mohr and E. Arbogast. It can be done without camera calibration. Pattern Recognition
Letters, 12:39-43, 1990.
7. Roger Mohr, Luce Morin, and Enrico Groeao. Relative positioning with poorly calibrated
cameras. In J.L. Mundy and A. Ziseerman, editors, Proceedings of DARPA-ESPRIT Work-
shop on Applications of lnvariance in Computer Vision, pages 7-46, 1991.
8. J.G. Semple and G.T. Kneebone. Algebraic Projective Geometry. Oxford: Clarendon Press,
1952. Reprinted 1979.
9. Gunnar Sparr. An algebralc-analytic method for reconstruction from image correspon-
dences. In Proceedings 7th Scandinavian Conference on Image Analysis, pages 274-281,
1991.
10. Gunnar Sparr. Projective invariants for aItlne shapes of points configurations. In J.L.
Mundy and A. Zisserman, editors, Proceedings of DARPA-ESPRIT Workshop on Applica-
tions of Invariance in Computer Vision, pages 151-169, 1991.
11. Z. Zhang and O.D. Fangeras. A 3D world model builder with a mobile robot. International
Journal of Robotics Research, 1992. To appear.

This article was processed using the LTEX macro package with ECCV92 style
Estimation of Relative Camera Positions for
Uncalibrated Cameras

Richard L Hartley
G.E. CRD, Schenectady, NY, 12301.

A b s t r a c t . This paper considers, the determination of internal camera pa-


rameters from two views of a point set in three dimensions. A non-iterative
algorithm is given for determining the focal lengths of the two cameras, as
well as their relative placement, assuming all other internal camera param-
eters to be known. It is shown that this is all the information that may be
deduced from a set of image correspondences.

1 Introduction

A non-iterative algorithm to solve the problem of relative camera placement was given
by Longuet-Higgins ([4]). However, Longuet-Higgins's solution made assumptions about
the camera that may not be justified in practice. In particular, it is assumed implicitly
in his paper that the focal length of each camera is known, as is the principal point (the
point where the focal axis of the camera intersects the image plane). Whereas it is often
a safe assumption that the principal point of an image is at the center pixel, the focal
length of the camera is not easily deduced, and will generally be unknown for images
of unknown origin. In this paper a non-iterative algorithm is given for finding the focal
lengths of the two cameras along with their relative placement, as long as other internal
parameters of the cameras are known. It follows from the derivation of the algorithm,
as well as from counting degrees of freedom that this is all the information that may be
deduced about camera parameters from a set of image correspondences.
In this paper, the term magnification will be used instead of focal length, since it
includes the equivalent effect of image enlargement.

2 The 8-Point Algorithm


First, I will derive the 8-point algorithm of Longuet-Higgins in order to fix notation and
to gain some insight into its properties. Alternative derivations were given in [4] and
[5]. Since we are dealing with homogeneous coordinates, we are interested only in values
determined up to scale. Consequently we introduce the notation A m B (where A and B
are vectors or matrices) to indicate equality up to multiplication by a scale factor. Image
space coordinates will usually be given in homogeneous coordinates as (u, v, w) T.

2.1 Algorithm Derivation


We consider the case of two cameras, one which is situated at the origin of object space
coordinates, and one which is displaced from it. The two cameras may be represented
by the transformation that they perform translating points from object space into image
space coordinates. The two transformations are assumed to be
(u, v, w) T -- (x, y, z) T (1)
580

and
(~,, r ~,)T = a ((=, y, z) T -- (t~, t~,tz)T) (2)
where R is a rotation matrix, the vectors (u, v, w) T and (u', v', w') T are the homogeneous
coordinates of the image points, and (=,y,z) T and (tz,ty,tz) T are non-homogeneous
object space coordinates. Writing T = (tx,ty, t~)T, and using homogeneous coordinates
in both object and image space, the above relations may be written in matrix form as

( u , v , w ) T = ( I l O)(x,y,z,1) T = Pl(X,y,z,1) m (3)

and
(u,, r ~,)T = (R I - R T ) ( = , y, z, 1) T = P=(~, y, z, 1) T (4)
where (I [ 0) and (R [ - R T ) are 3 4 matrices divided into a 3 x 3 block and a 3 x 1
column and I is the identity matrix.
Now, I will define a transformation between the 2-dimensional projective plane of
image coordinates in image 1 and the pencil of epipolar lines in the second image. As
is well known, given a point (u, v, w) T in image 1, the corresponding point in image 2
must lie on a certain epipolar line, which is the image under P~ of the set s of all points
(z, y, z, 1) T which map under P1 to (u, v, w) T. To determine this line one may identify two
points in s namely the camera origin (0, 0, 0, 1) T and the point at infinity, (u, v, w, 0) T .
The images of these two points under P2 are - R T and R(u, v, w) T respectively and the
line that passes through these two points is given in homogeneous coordinates by the
cross product,

(p, q, r) T = R T x R(u, v, w) T = R (T x (u, v, w) T) (5)

Here (p, q, r) T represents the line pu' + qv' + rw' = O. Representing by S the matrix

0 -t~ ty
S = S T = ( t~
- t y t~
0 0 x ) (6)

equation (5) may be written as

(p, r r) T = R S ( u , v, w) T (7)
Since the point (u', v', w') T corresponding to (u, v, w) T must lie on the epipolar line, we
have the important relation

(u', v', w')Q(u, v, w)T = 0 (8)

where Q = RS. This relationship is due to Longuet-Itiggins ([4]).


As is well known, given 8 correspondences or more, the matrix Q may be computed
by solving a (possibly overdetermined) set of linear equations. In order to compute the
second camera transform, P2, it is necessary to factor Q into the product R S of a rotation
matrix and a skew-symmetric matrix. Longuet-ttiggins ([4]) gives a rather involved, and
apparently numerically somewhat unstable method of doing this. I will give an alternative
method of factoring the Q matrix based on the Singular Value Decomposition ([1]). The
following result may be verified.

T h e o r e m 1. A 3 x 3 real matrix Q can be factored as the product of a rolalion malriz


and a non-zero skew symmetric matrix if and only if Q has two equal non-zero singular
values and one singular value equal to O.
581

A proof is contained in [2]. This theorem allows us to give an easy method of factoring
any matrix into a product RS, when possible.

T h e o r e m 2. Suppose the matrix Q can be factored into a product R S where R is orthogo-


hal and S is skew-symmetric. Let the Singular Value Decomposition of Q be U D V "r where
D = diag(k, k, 0). Then up to a scale factor the factorization is one of the following:

S ~. VZV3. ; R ~ UEV3. or UETV 3. ; Q ~ RS .

where

E =
(Ol ) (Zol)
10
0
, Z =
0
(9)

Proof T h a t the given factorization is valid is true by inspection. T h a t these are the only
solutions is implicit in the paper of Longuet-Higgins ([4]). [3

It may be verified that T (the translation vector) in Theorem 2 is equal to V.(0, 0, 1) 3.


since this ensures that S T = 0 as required by (6). Furthermore IITll = 1, which is a con-
venient normalization suggested in [4]. As remarked by Longuet-Higgins, the correct
solution to the camera placement problem may be chosen based on the requirement
that the visible points be in front of both cameras ([4]). There are four possible rota-
tion/translation pairs that must be considered based on the two possible choices of R and
two possible signs of T. Therefore, since U E V T V(O, O, 1) T = U(0, 0, 1)3. the requisite
camera matrix P2 = (R I - R T ) is equal to (UEV3. I - U ( 0 , 0, 1) T) or one of the obvious
alternatives.

2.2 N u m e r i c a l C o n s i d e r a t i o n s

In any practical application, the matrix Q found will not factor exactly in the required
manner because of inaccuracies of measurement. In this case, the requirement will be to
find the matrix closest to Q that does factor into a product RS. Using the sum of squares
of matrix entries as a norm (Frobenius norm [1]), we wish to find the matrix Q' = R S
such that IIQ - Q']I is minimized. The following theorem shows that the factorization
given in the previous theorem is numerically optimal.

T h e o r e m 3. Let Q be any 3 x 3 matrix and Q = U D V T be its Singular Value Decomposi-


tion in which D = diag(r,s,t) and r > s > t. Define the matrix Q' by Q' = U D ' V T where
D ' = diag(k, k,0) and k = (r+s)/2. Then Q' is the matrix closest to Q in Frobenius norm
which satisfies the condition Q' = RS, where R is a rotation and S is skew-symmetric.
Furthermore, the factorization is given up to sign and scale by R ~ U E V r or U E r V T
and S ~ V Z V r .

This theorem is plausible given the norm-preserving property of orthogonal transfor-


mations. However, its proof is not entirely obvious and falls beyond the scope of this
paper.

2.3 Algorithm Outline

The algorithm for computing relative camera locations for calibrated cameras is as fol-
lows.
582

1. Find Q by solving a set of equations of the form (8).


2. Find the Singular Value Decomposition Q = UDV T , where D = diag(a, b, c) and
a>b>c.
3. The transformation matrices for the two cameras are P1 = ( I I 0) and P2 equal to
one of the four following matrices.

( UEVT I U(O,O,1) T)
( UEVT I-U(O,O, 1) T)
(uETvTI U(0,0,1) T)
(UETV m [ - U ( 0 , 0 , 1) T)
The choice between the four transformations for Pu is determined by the requirement
that the point locations (which may be computed once the cameras are known [4]) must
lie in front of both cameras. Geometrically, the camera rotations represented by UEV T
and U E T V T differ from each other by a rotation through 180 degrees about the line
joining the two cameras. Given this fact, it may be verified geometrically that a single
pixel-to-pixel correspondence is enough to eliminate all but one of the four alternative
camera placements.

3 Uncalibrated Cameras

If the internal camera calibration is not known, then the problem of finding the camera
parameters is more difficult. In general one would like to allow arbitrary non-singular
matrices K describing internal camera calibration and consider camera matrices of the
general form ( K R I - K R T ) , that is, general 3 x 4 matrices. Because K is multiplied by a
rotation, R, it may be assumed that K is upper triangular. Allowing for an arbitrary scale
factor, there are 5 remaining independent entries in K representing camera parameters.
Other authors ([6]) have allowed four internal camera parameters, namely principal point
offsets in two directions and different scale factors in two directions. If however different
scaling is allowed in two directions not necessarily aligned with the direction of the
image-space axes, then one more parameter is needed, making up the 5.
It is too much to hope that from a set of image point correspondences one could
retrieve the full set of internal camera parameters for a pair of cameras as well as the
relative external positioning of the cameras. Indeed if {xi} are a set of points visible
in a pair of cameras with transform matrices P1 and P2, and G is an arbitrary non-
singular 4 4 matrix, then replacing each xi by G-lxi and each camera Pj with Pi G
preserves the object-point to image-space correspondences. As may be seen, the internal
parameters of one of the cameras, P1 say, may be chosen arbitrarily. The situation is not
helped by adding more cameras. This is in contrast to the case of calibrated cameras
in which a finite number of solutions are possible ([2]). The question remains, therefore,
how much can be deduced about the internal camera parameters from a set of image
correspondences.
For uncalibrated cameras, a matrix Q can be defined, analogous to the matrix defined
for calibrated cameras, and this matrix may be computed given matched point pairs,
according to (8). It may be observed that however many pairs of matched points are
given, as far as determining camera models is concerned, the matrix Q encapsulates all
the information available, except as to which points lie behind or in front of the cameras.
As remarked above, the choice of the four possible relative camera placements may be
determined using just one matched point pair - the rest may be thrown away once Q
has been computed. To justify this observation it may be verified that a pair of matching
583

points (u, v, W)T and (u', v', w') T correspond to a possible placement of an object point
if and only if (u', v', w')Q(u, v, w) T = 0. This means that the addition of match points
beyond 8 does not add any further information except numerical stability. Now, Q has
only 7 degrees of freedom consisting of 9 matrix entries, less one for arbitrary scale
and one for the condition that det(Q) = 0. (Theorem 1 does not hold for uncalibrated
cameras.) Therefore, the total number of camera parameters that may be extracted from
a set of image-point correspondences does not exceed 7. As shown by Longuet-Higgins,
the relative camera placements account for 5 of these (not 6, since scale is indeterminate),
and this paper accounts for two more, the camera magnification factors. It is not possible
to extract any further information from Q, or hence from a set of matched points.

3.1 F o r m o f t h e Q - m a t r l x

Let K1 and K2 be two matrices representing the internal camera transformations of the
two cameras and let P1 -- (K1 I 0) and P2 -- (K2R I - K 2 R T ) be the two camera trans-
forms. The task is to obtain R, T, K1 and K2 given a set of image-point correspondences.
For the present, the matrices K1 and Ks will be assumed arbitrary.
As before, it is possible to determine the epipolar line corresponding to a point
(u,v,w) T in image 1. The two points that must lie on the epipolar line are the im-
ages under P2 of the camera centre (0, 0, 0, 1) T of the first camera and the point at
infinity ( K ~ l ( ~ v ' w ) T ) . Transform P~ takes these two points to the points - K 2 R T
and K 2 R K ~ 1 (u, v, w) T. The line through these points is given by the cross product
K2RT x K~RK~I(u, v, w) T (10)

If K is a square matrix, we use the notation K* to represent the cofactor matrix of K,


that is the matrix defined by K~ = ( - 1 ) ~+j det(K (~j)) whereK (~D is the matrix derived
from K by removing the i-th row and j-th column. If K is non-singular, then it is well
known that K* = d e t ( K ) . ( K T ) -1. In other words, K* ~, ( K T ) -1. The cofactor matrix
is related to cross products in the following way.

Lemma4. If a and b are 3-dimensional column vectors and K is a 3 x 3 matrix, then


Ka x Kb ~ K*(a x b).
Using this fact it is easy to evaluate the cross product (10).

K2RT x g 2 R g ~ l ( u , v, w) T ~ K~RK~ - l ( g t T x (u, v, w) T) (11)

Now, writing S = SIqT as defined in (6), we have a formula for the epipolar line corre-
sponding to the point (u,v,w) T in image 1 :

(p, q, r) T ~ K ~ R K I T S ( u , v, w) T (12)

Furthermore, setting Q = K ~ R K 1 T S we have the formula

(u',v',w')Q(u,v,w) T = 0 . (13)

An alternative factorization for Q that may be derived from (10) and Lemma 4 is

Q ~ (K~I)TRSK~ 1 (14)

where S = ST as given by (6).


584

3.2 Factorlzation of Q
O u r goal, given Q, is to find the factorization Q ~ K ~ R K I T S . As before, we use the
Singular Value Decomposition, Q = U D W T. By multiplying by - 1 if necessary, U and V
m a y be chosen such the det(U) -- d e t ( V ) = -F1 so t h a t U* -- U and V* = V. Since Q is
singular, the diagonal m a t r i x D equals diag(r, s, 0) where r and s are positive constants.
Since Q W ( 0 , 0,1) T = 0, it follows t h a t S W ( 0 , 0, 1) T -- 0 since K~RK~ T is non-singular,
and so S ~ W Z W T where Z is given in (9). The general solution to the problem of
factoring Q into a product R~S ~, where R * is non-singular and S ~ is skew-symmetric is
therefore given by
Q = ( U X , ~ , a , . r E T W T ) . ( W Z W T) (15)
where X~,Z,7 is given by

X,,~,~ = s (16)
0
and a , fl and 7 are a r b i t r a r y constants. T h e two bracketed expressions are R' and S '
respectively and the factorization is unique (except for the variables a , fl and 7) up to
scale. In contrast to the situation in Section 2.1 we do not need to consider the alternate
solution in which E T is replaced by E, since t h a t is taken care of by the undetermined
values ~, ~ and 7. Since b o t h E and W are orthogonal matrices, we write V = W E , and
V is also orthogonal.
Now, we turn our attention to the m a t r i x R ~ = UX~,,p,TV T. For some values of c~,
and 7, it must be true t h a t R' .~ K 2* R K 1* - I where R is a rotation matrix. From this it
follows t h a t R ~ . K 2, - 1 R ~K 19 . We now apply the p r o p e r t y t h a t a rotation m a t r i x is equal
t o its cofactor matrix, (inverse transpose). This means t h a t K 2* -1R~K'~ ~ ~ 2 1 D~LI * ~1 It: ~ TT"
or

K 2 K 2 T R ' ~-, RI*K1K1 r . (17)

Since R ' ~ UXc,,fL~V T , it follows t h a t R ~* ~ UX,~,~,TV


, T where X*,~.~, is the m a t r i x

x*:,~ = r7 (18)
\-s~ - r ~ rs

and so from (17)

(Ks K2 T)UXc,,~,~, V T ~ UX*,z,.y V T (K1 K l - r ) . (19)

A t this point, it is necessary to specialize to the case where K1 and Ks are of the
simple form K1 = diag(1, 1, kl) and Ks = diag(1, 1, ks). In this case, kl and k2 are the
inverses of the magnification factors. If the entries of UXa,z,~,V T are (fij) and those of
*
UXc,,~,~V T are (gij), then multiplying by (K~K2 T) and (K1K~ T) respectively gives an
equation
fll S12 f13 ~ :gll g12 k2g13~
/ =xig l k g 3/ (20)
where the fij and gij linear expressions in a, fl and 7, and x is an unknown scale factor.
T h e top left h a n d block of (20) comprises a set of equations of the form

( .fll :12"~
f21 Y22) =
X(gllg12)
\g21 g2~ "
(21)
585

If the scale factor were known, then this system could be solved for a , / 3 and 7 as a set
of linear equations. Unfortunately, x is not known, and it is necessary to find the value
of z before solving the set of linear equations. Since the entries of the matrices on both
sides of (21) are linear expressions in a,/3 and 7, it is possible to rewrite (21) in the form
M1(a,/3,7, 1) T - x M,(c~,/3,7, 1) T = 0 , (22)
where M1 and M , are 4 x 4 matrices, each row of M1 or M , corresponding to one of
the four entries in the matrices in (21). Such a set of equations has a solution only if
det(Ma - z Mz) = 0. This leads to a polynomial equation of degree 4 in z : p(z) =
det(M1 - x M , ) = 0. It will be seen later that this polynomial reduces to a quadratic.
The form of the matrix M~ may be written out explicitly. Let X~,;~,-r be written in
the form c~.A~a +/3.A2~ + 7.A~3 + ( r . A ~ + s A ~ ) , where Aij is the matrix having a one
in position i,j and zeros elsewhere. Then,
UXa,~,TV y = o~UA13V T -1-/3UA23 VT -1-7UAz3V T -k-rUAll V r -~-sU A22V T .
It may be verified that the the p,q-th entry of the matrix UAijV T is equal to UpiVqj.
Now, suppose that the rows of M1 are ordered corresponding to the entries f ~ , fx~, f~l
and f22 of UXc,,z.yV T. Then

( fll~
f12 /
/ UllV13 Ul~V13 U13V13 r.UllVll+s.U12Vl~ { a
[ Ull V23 U12V23 V13V23 r'UllV21-~s'U12V221 ~
f 2 1 ] -- [U21VI3
f22/
U22V13 U23V~3 r.U2Wn+s.U~2Vx2]
k U21V~3 U2~V23 U23V23 r.U21V~l+s.U22V22]
(23)

and MI is the matrix in this expression. The exact form of the matrix Mx may be
computed in a similar manner.

[ -s.U13Vn -r.UlsYl2 r.U12V12+S.UllVll rs.U13V13~


|-s.U13V21 -r.V13V22 r.U12V22+S.UllV21 rs.U13V23]
(24)
M~, = [-s.U2~Vn -r.U23V12 r.U~2V12+s.U2Wll rs.U~3Vl~}
\ -s.U23V21 -r.U23V22 r.U22V22+s.U~W~l rs.U23V~3]
With the help of a symbolic algebraic manipulation program such as Mathematica ([7])
three identities may easily be established by direct computation :
det(Mx)-0 , det(M1)=0 , det(Ml+Mx)+det(Mi-M~)=0 .
From this it follows easily that p(x) = det(M1 - x M , ) = a l z q- a3x 3. The root x : 0
of this polynomial may safely be ignored, since according to (21) it would imply that
f q = 0 for i,j < 2, and hence that R is singular, which by assumption it is not. Thus p(x)
reduces to a quadratic as promised, and this quadratic has two roots of equal magnitude
and opposite sign. It is possible that p(x) has no real root, which indicates that no real
solution is possible given the assumed camera model. This may mean that the position
of the principal points have been wrongly guessed. For a different value of each principal
point (that is, a translation of image space coordinates) a solution may be possible, but
the solution will be dependent on the particular translations chosen.
Supposing, however, that x is a real root of p(x), the values of a, /3 and 7 may be
determined by solving the set of equations given in (21). Finally, the values of kl and k2
may be read off from equation (20). In particular,

k~ = x.g31/f31 --- x.g32/f32 (i)


kl2 : fl3/X.gl3 m_ f23/x.g23 (ii) (25)
k~f33 = z.k12g33 9 (iii)
586

The apparent redundancy in the equations (25) is resolved by the following proposi-
tion.
P r o p o s i t i o n 5.
1. If x is either of the roots of p(x), then the two expressions zg31/f31 and zgs2/f32
for k~ in (g5.i) are equal. Similarly, the two expressions/or k~ in (e5.ii) are equal
and the relationship (25.iii) is always true.
2. Values k~ and k~ are either both positive or both negative.
3. The estimated values of k~ corresponding to the two opposite roots of p(x) are the
same. The same holds for the two values of k~.
Proof of this proposition is beyond the scope of this paper. The case where k~ and k~ are
negative implies as before that no solution is possible. Once again, selecting a different
value for the principal points (origin of irnage-space coordinates) may lead to a solution.
At this point, it is possible to continue and compute the values of the rotation matrix
directly. However, it turns out to be more convenient, now that the values of the mag-
nification are known, to revert to the case of a calibrated camera. More particularly, we
observe that according to (14), Q may be written as Q = K ~ I Q ' K ~ 1 where Qf = RS,
and _R is a rotation matrix. The original method of Section 2.3 may now be used to solve
for the camera matrices derived from Q~. In this way, we find camera models P1 = ( I I 0)
and P2 = (R I - R T ) for the two cameras corresponding to Q'. Taking account of the
magnification matrices K1 and K2, the final estimates of the camera matrices are (K1 I 0)
and ( K 2 R I - K 2 R T ).
In practice it has been observed that greater numerical accuracy is obtained by re-
peating the computation of kl and k2 after replacing Q by Qt. The values of kl and k~.
computed from Q~ are very close to 1 and may be used to revise the computed magni-
fications very slightly. However, such a revision is necessary only because of numerical
round-off error in the algorithm and is not strictly necessary.

3.3 A l g o r i t h m O u t l i n e
Although the mathematical derivation of this algorithm is at times complex, the imple-
mentation is not particularly difficult. The steps of the algorithm are reiterated here.
1. Compute a matrix Q such that (u~, v~, 1)YQ(ul, vi, 1) -- 0 for each of several matched
pairs (at least 8 in number) by a linear least-squares method.
2. Compute the Singular Value Decomposition Q ~ U D W r with det(U) = det(V) =
+1 and set r and s to equal the two largest singular values. Set V -- W E .
3. Form the matrices M1 and M~ given by (23) and (24) and compute the determinant
p(x) = det(Ul - x i~:) = alx .-b a3x 3.
4. If - a l / a a < 0 no solution is possible, so stop. Otherwise, let x = -X/'~l/aa, one of
the roots of p(x).
5. Solve the equation (M1 - x M=)(a,/~, 7, 1) T -- 0 to find a, fl and 7 and use these
values to form the matrices X~,,p,.r and X*~,x given by (16) and (18).
6. Form the products UX,~,/L.rV "r and UX~,,~,.rV9 7" and observe that the four top left
elements of these matrices are the same.
7. Compute kl and k2 from the equations (25) where (fij) and (gii) are the entries of
the matrices UX~,,/L.r V T and UX,~,Z,.rV
* T respectively. If kl and k2 are imaginary,
then no solution is possible, so stop.
8. Compute the matrix Q~ = K2QK1 where K1 and K2 are the matrices diag(1, 1, kl)
and diag(1, 1, k~) respectively.
587

9. Compute the Singular Value Decomposition of Q' = UID'V ''r.


10. Set P1 = (K1 I 0) and set P2 to be one of the matrices
(K2 v'Ev''r [ g2v'(o,o,1) T)
(K2U'E'rV"rl K2U'(0, 0,1) T)
(K2U'EV I O, 1)T)
(K2 U'ETV'T I -g2v'(o, O,1) T)
according to the requirement that the matched points must lie in front of both cam-
eras.

4 Practical Results

This algorithm has been encoded in C and tested on a variety of examples. In the first test,
a set of 25 matched points was computed synthetically, corresponding to an oblique place-
ment of two cameras with equal magnification values of 1003. The principal point offset
was assumed known. The solution to the relative camera placement problem was com-
puted. The two cameras were computed to have magnifications of 1003.52 and 1003.71,
very close to the original. Camera placements and point positions were computed and
were found to match the input pixel position data within limits of accuracy. Similarly,
the positions in 3-space of the object points matched the known positions to within one
part in 1 0 4 .
The algorithm was also tested out on a set of matched points derived from a stereo-
matching program, STEREOSYS ([3]). A set of 124 matched points were found by an
unconstrained hierarchical search. The two images used were 1024 x 1024 aerial overhead
images of the Malibu region with about 40% overlap. The algorithm described here was
applied to the set of 124 matched points and relative camera placements and object-point
positions were computed. The computed model was then evaluated against the original
data. Consequently, the computed camera models were applied to the computed 3-D
object points to give new pixel locations which were then compared with the original
reference pixel data. The RMS pixel error was found to be 0.11 pixels. In other words,
the derived model matches the actual data with a standard deviation of 0.11 pixels. This
shows the accuracy not only of the derived camera model, but also the accuracy of the
point-matching algorithms.

References
1. Atkinson, K. E., "An Introduction to Numerical Analysis," John Wiley and Sons, New
York, Second Edition 1989.
2. Faugeras, O. and Maybank, S., "Motion from Point Matches : Multiplicity of Solutions,"
International Journal of Computer Vision, 4, (1990), 225-246.
3. Hannah, M.J., "Bootstrap Stereo," Proc. Image Understanding Workshop, College Park,
MD, April 1980, 201-208.
4. Longuet-tiiggins, H. C., "A computer algorithm for reconstructing a scene from two pro-
jections," Nature, Vol. 293, 10, Sept. 1981.
5. Porrill, J and Pollard, S., "Curve Fitting and Stereo Calibration," Proceedings of the British
Machine Vision Conference, University of Oxford, Sept 1990, pp 37-42.
6. Strat, T. M., "Recovering the Camera Parameters from a Transformation Matrix," Readings
in Computer Vision, pp 93-100, Morgan Kauffamann Publishers, Inc, Los Altos, Ca., 1987.
7. Wolfram, S., Mathematica, "A System for Doing Mathematics by Computer," Addison-
Wesley, Redwood City, California, 1988.
This article was processed using the I~TEX macro package with ECCV92 style
Gaze Control for a Binocular Camera Head
James L. CROWLEY, Philippe BOBET and Mouafak MESRABI
LIFIA (IMAG), 46 Ave Felix Viallet, 38031 Grenoble, France

Abstract. This paper describes a layered control system for a binocular stereo head.
It begins with a discussion of the principles of layered control and then describes the
mechanical device for a binocular camera head. A device level controller is presented
which permits an active vision system to command the position of the gaze point. The
final section describes experiments with reflexive control of focus, iris and vergence.

1. I n t r o d u c t i o n
During the last few years, there has been a growing interest in the use of active control of
image formation to simplify and accelerate scene understanding. Basic ideas which were
suggested by [Bajcsy 88] and [Aloimonos et al. 87] has been extended by several groups.
Examples include [Ballard 91], and [Eklundh 92]. Brown [Brown 90] has demonstrated how
multiple simple behaviours may be used for control of saccadic, vergence, vestibulo-ocular
reflex and neck motion.

This trend has grown from several observations. For example, Aloimonos and others observed
that vision cannot be performed in isolation. Vision should serve a purpose [Aloimonos 87],
and in particular should permit an agent to perceive its environment. This leads to a view of a
vision system which operates continuously and which must furnish results within a fixed
delay. Rather than obtain a maximum of information from any one image, the camera is a
active sensor giving signals which provide only limited information about the scene. Bajcsy
[Bajcsy 88] observed that many traditionally vision problems, such as stereo matching, could
be solved with low complexity algorithms by using controlled sensor motion. Examples of
such processes were presented by Krotkov [Krotkov 90]. Ballard [Ballard 88] and Brown [Brown
90] demonstrated this principle for the case of stereo matching by restricting matching to a
short range of disparities close to zero, and then varying the camera vergence angles.

The development of binocular camera heads and an integrated vision system has opened a line
of cooperation between the scientific communities of biological vision, machine vision and
robotics. This paper is concerned with a robotics problem posed by such devices: How to
organize the control architecture. We will argue for a layered control architecture in which a
"gaze point" may be commanded by an external process or driven by simple measurements of
information from the scene.

2. A Layered Control Architecture for a Binocular Head


The control system for a robotic device may be organised as layers of control loops, where
layers are defined by the abstraction of the control data and the cycles time of the control loop
[Crowley 87]. Robot vehicles and robot arms often have layers of control loops at four level:
Motor, Device, Actions, and Tasks. We have found that such a layered control architecture
maps quite naturally onto the control of a binocular head, as shown in figure 2.1.
589

SAVA Mail Box System

I SAVA Camera Control Unit Interface J

I H-' I
I ~176176176 ]
[ Motor Controllers [

Figure 2.1 A Layered Architecture for Control of a Binocular Head in the SAVA System.

The motor level is concerned with control parameters defined by the motor
shaft. Sensor (typically optical encoders) provide information in terms of position and angular
speed of the motor. Commands are generated in terms of motor position, speed and perhaps
acceleration. Typical control cycle times for robotics are on the order of 1 to 10 milliseconds.
The motor level is typically controlled by a form of PID Controller.

The device level is concerned with the geometric and dynamic state of the
entire device. Control cycles for robotics applications are typically on the order of 10 to 100
milliseconds. For device independence, it is useful to design a controller for an idealized "virtual
device". Our virtual head is based on controlling an "gaze-point" defined as the intersection of
the optical axes. The mapping from the virtual head to a particular mechanical head is
performed by a translation layer between the device controller and the motor controllers. The
device level also permits any of the axes of the virtual head to be directly controlled.

The action level concerns procedural control of the device state based on
measurements taken from sensor signals. An action will drive the device lhrough a sequence of
states. The action level for a binocular head involves control of head motion and optical
parameters. The action level often involves control cycles of 0.1 to 1.0 seconds.

~ : A task level controller has a goal expressed in terms of a symbolic state. The
task level controller chooses actions to bring the device and the environment to the desired
state. The selection of actions is based on a symbolic description of the preconditions and
results of actions, as well as a description of the current state of the device and environment.
This leads us to propose a control cycle composed of three phases: Evaluation (of state),
Selection (of an action), and Execution (of an action).

3. A Binocular Camera Head


This section describes the LIFIA/SAVA binocular head and its motor control system. Our head
has been designed to minimize weight and complexity while maximizing precision. In
particular, actuation is provided by small lightweight DC motors. These motors give high
precision and light weight, at the cost of speed. This head is shown in figure 3.1.
590

Figure 3.1 The LIFIA/SAVA Binocular Camera Head

The vergence mechanism is mounted on the sixth axis of a robot manipulator. The two
cameras are mounted on small platforms that pivot about a point underneath the camera lens. A
precision adjustment screw permits the camera to be moved forward or backward so as to
position the optical center under the rotation point. The gearing on the motors provides
approximately 3 encoder counts per arc second (12.5 encoder counts per pixel), over a range of
20 ~ The sprocket gears have been mounted on the focus and aperture rings of 25 mm fl.8 c-
mount lenses. The gearing on the ring and motors provide approximately 15 000 counts over
the full range of movements for the focus and approximately 12 000 counts for the aperture.

A six axis manipulator serves as a "neck" for this head. The head and neck are mounted on a
mobile robot, permitting experiments in vision guided motion. The neck is mounted at a
point which is midway between the power wheels of the vehicle. This point serves as the
origin for both the vehicle and arm coordinate systems. The neck permits us to command the
position and orientation of the camera in coordinates which are relative to the position of the
mobile robot. The mobile robot provides us with an estimate of its position and orientation in
an arbitrary a world coordinate system.

A standard PID motor control software has been developed for the three head motor micro-
processors and burned onto ROM. The software protocol for each of the three motor
controllers is the same, except that the maximal values for each motor controller depends on
the axis. The protocol for the motors permits initialization, incremental and absolute
movement, immediate stop, and interrogation of the current motor position (in encoder counts).
The protocol is written in such a way that a command may be issued at any time, and that a
new movement command will replace a current command.

4. A Device Level Controller for a Binocular Head


One of our design goals is that the head controller provide a general control protocol which can
be easily transported to different mechanical configurations. For this reason, we have defined
our device controller to command a "virtual head".

The head controller has four components: Protocol Interpretation (1), State Estimation (2),
Command Generation (3) and Translation (4). These four components are illustrated in figure
4.1 and described in the following sections.
591

I rotocolInterpretation (~)

Commarlded"~ [r Estimated State Parameters


State J ~ State ~,/

/Q
Command
XState r
Generation Estimation
/
I Translation )07 Virtual Head
(Virtual Head to Real Head

Figure 4.1 Components of the Binocular Head Device Controller

4.1 A Virtual Head


The virtual head is an idealized mechanical structure which should be general enough to map
onto the kinematic structure of many physical heads. Any information which is specific to a
particular head should be included in the "translator". The motor commands for each of the axis
of the virtual head are made available at the protocol level, in an idealized form. That is,
commands are expressed in meters and degrees instead of encoder counts. This level also offers
the control of a head-centered "gaze-point", defined by the 3-D position at which the optical
axes of the two cameras intersect.

In order to accommodate any configuration of axes, a device table is defined. This device table
is built up from a dynamically allocated structure called an "Axis". Initialization of this
structure defines which axes are present, their units, the initial value for that axis, the
conversion factor from encoder counts, and their maximum and minimum values. Subsequent
access to that axis may either be based on the index of the entry in the table, or by association
with the axis name. The depth and angle to the gaze point are treated as axes.

Absolute and incremental moves change the reference for the axis control. At the end of each
cycle, the translator scans the list of axes and updates the current position. Whenever the
commanded position is different from the current position, the commanded value is transformed
to encoder counts and a move is issued to the motor controller. A command to an axis which is
not currently in the head table will trigger a negative acknowledgement.

4.2 Estimating the Gaze Point


The important state component for a binocular head is the gaze point, defined by the
intersection of the optical axes. Knowledge of the state of the neck permits us to transform the
gaze point to 3D coordinates whose origin is at the base of the head. A major role of the head
controller is estimating the current gaze point (process 2 in figure 4.1) and controlling the
current gaze point (process 3 in figure 4.1). We will start by defining the gaze point in polar
coordinates whose origin is midpoint on the baseline between the two cameras. We will then
develop the transformation to a f'bxedreference frame centered at the base of the head.
592

Let us derive formulas for determining the gaze point within an "eye centered" 2D polar
coordinate system. This eye centered coordinate system has its origin mid-way between the
optical centers of a pair of stereo cameras. Let us define the X axes as coincident with the
baseline, and the Y axis as perpendicular to the baseline and in the plane det-med by the optical
axes. Let the separation of the cameras be a distance 2B so that the location of the optical
centers are the defined as the points (B, 0) and (-B, 0). Furthermore, let the optical axes be
located in the (X, Y) plane with angles of a I and Or.
Y
~m~ e

.,.._,_.Origin, I I ~ = x
t.~ ~l

Figure 4.2 The gaze point P is the intersection of the optical axes.
The equation of the left optical axis in the plane defined by the base line and the optical axis is:

X Sin (al) - Y Cos (Ol) + B Sin(ol ) = 0.

The right optical axis is described by: X Sin (re - fir) - Y Cos (n - fir) - B Sin(re - fir ) = 0.
Since Sin (re - fir) = Sin (or) while Cos (n - fir) = -Cos (fir), the left equation reduces to

X Sin (fir) + Y Cos (fir) - B Sin(fir ) = 0.

The position of the fixation point, defined by the intersection of the optical axes, can be
calculated as the sum and difference of these two equations. This gives

CoS(Ol)Sin(fir)-Sin(ol)COs(fir) Sin(ol--Or)
X = B CoS(Ol)Sin(fir)+Sin(ol)Cos(or) - B 'Sin(ol+Or) (4.1)

2BSin(ol)Sin(or) Sin(ol)Sin(or)
Y = ~os(al)Sin(fir)+Sin(al)Cos(fir ) = 2B S i n ( a l + a r ) (4.2)

In the case of symmetric vergence angles, o = al = Or the two equations become:

2BSin(o~
X=0 Y = 2Cos(o) = B Tan(a)

In polar coordinates, (Dc, ok:), the vergence angle to the gaze point can be expressed as

Dc = ~/X 2 + y 2 ac = Tan "1 ( Y ) = Tan'l (2 Sin(al)Sin(ar)~~J/ (4.3)


593

Equation 4.3 describes a gaze point as if at the end of a telescopic stick which we can extend
and pointed within a plane. We can solve for the position of the end of this stick in the scene
using the position and orientation of the head and the state of the binocular head. The head
"state" parameters on which the gaze point depend are the Distance (Dc), the azimuth gaze angle
or pan (ag) and the elevation gaze angle, or tilt (13g). The gaze azimuth angle, ~g, is defined
as the sum of the head pan ah and a common vergence angle ~.c.

C~g = c~h + a c

The elevation angle depends on the state of the manipulator "neck". These values constitute a
polar expression of the gaze point with respect to a head centered coordinate frame.
Transformation to Cartesian form is quite simple. We consider that the pan and tilt axes of the
head are located at Cartesian coordinates (Xh, Yh, Zh). The position of the gaze point is
determined by.

Xg = Xh + Dc Cos(~g )
Yg = Yh + Dc Sin(cXg)
Zg = Zh + Dc Sin(13g)

Both the polar and Cartesian forms of the gaze point are stored in a data structure that defines
the head "state". The Cartesian values are computed from the polar values. Commanded values
may be set by messages from other processes. The difference between a commanded value and a
current position triggers the translator to call a device specific procedure to move the necessary
real axes.

Command of the gaze point involves controlling an under-constrained system of motors. Our
solution is to simultaneously drive each of the axes with common error term. In order to assure
stability, we must assure that the sum of the gain terms for the redundant axes is less than one.
Thus each motor moves with its characteristic speed, and the system converges to the specified
gaze point in an over-damped manned. The motor gains are tuned to assure stable convergence
over the range of motions.

5. Reflexive Control of Ocular Parameters


This section concerns action level control of the ocular parameters of aperture, focus, and
convergence. The measures which are described below are based on smoothed versions of the
image produced by a binomial pyramid. We have found that with a multiple resolution image
description provides a computational support which renders our image measures both more
stable and more efficient. Our multiple resolution representation is computed using a fast
"optimal S/N" binomial pyramid [Chehikian-Crowley 91]. That is, the image is convolved
with a cascade of binomial filters based on the kernel [ 1 2 1], to produce a set of 14
resampled images, numbered 1 to 14. The standard deviation for the level k of this pyramid is
given by an exact formula:
594

9 1OO

Figure 5.1 Example of the measure for aperture.

5.1 Control of Aperture.


We have found that the maximizing the variance of pixels in the region of interest provides a
robust estimator for aperture. Figure 5.2 shows an example of the measure over the range of
aperture values.

5.2 Focus
It is well known that focus can be controlled by the "sharpness" of contrast. The problem is
how to measure such "sharpness". In [Krotkov 87] we can find a description of several methods
for measuring image sharpness. Horn [Horn 65] proposes to maximize the high-frequency
energy in the power spectrum. Jarvis proposes to sum the magnitude of the f'trst derivative of
neighboring pixels along a scan line [Jarvis 83]. Schlag [Schlag et al. 82] and Krotkov
[Krotkov 87] propose to sum the squared gradient magnitude. Tenenbaum [Tenenbaum 82] and
Schlag compare gradient magnitude to a threshold and sum uniquely those pixels which are
above a threshold. The problem is then the choice of such a threshold. We have found that such
a measure performs poorly. After experiments with several measures, we have found our best
results with the sum of gradient magnitude, without the use of the threshold.

We measure image gradient at the level five or our low-pass pyramid, providing a binomial
smoothing window with a standard deviation of 4"~. Gradient is calculated using compositions
of the filter [1 0 -1] in the row and column directions. By default, the "region of interest" is
at the center of the image, but this region may be placed anywhere in the image by a message
from another software module. Local extrema in the gradient magnitude are summed within the
region of interest. An initialize command causes focus to look for a a global maximum in this
sum. Subsequently, the reflex action seeks to keep the focus at a local maximum. Note that
this measure exhibits a plateau around the proper focal value. This region corresponds to the
"depth of field". Reducing the aperture will enlarge the depth of field and thus enlarge this
plateau.

Figure 5.2 shows the values obtained by this measure. The camera was pointed at the
boundary between a dark and a gray face of a calibration cube, at a distance of approximately 1
meter. The sum of the gradient extrema was made within a 20 by 20 pixel region centered over
this boundary at level 1 of our pyramid, and the focus was scanned over the range of values. At
each focus setting, the sum of extrema was calculated.
595

l r r

" rHr IV _ l

Figure 5.2 Example of measure for focus.

5.3 Control of Convergence.


Our collaborators at University of Linkoping have found that a very robust measure for
convergence is provided by the difference in phase of the correlation of an even and odd filters
with the image [Westelius et al 91]. They have demonstrated vergence control of a simulated
head using even and odd Gabor-like filters. In collaboration with them, we have determined that
a reasonable approximation of the phase may be obtained using f'trst and second derivative
filters [1 0 -1] and [1 0 -2 0 1]. While the phase measured in this way is not linear with
position, it does seem to be monotonic.

We exploit the multiple resolution pyramid to converge on an object in a coarse to fine


manner. An image row and an initial column positions in the two cameras are selected for
convergence. We measure the phase at this row in the two cameras at level 9 of our pyramid.
The phase provides a shift in each image. This shift is then used to compute the column for the
next higher resolution level. The process is repeated at each level. The final shift in each image
is converted from pixels to encoder counts for the vergence motors and pan motors. The sum of
the shift is used to compute a pan motion for axe 6. The difference is used to compute a
vergence angle for the two vergence motors. The process repeats for each pair of images which
are taken by the image acquisition module.

6. Conclusions
In this paper we have presented a layered control architecture for a binocular head. We began by
discussing the principles of layered control. We then presented the mechanical and motor
control architecture for the LIFIA/SAVA binocular head.

Section 4 of this paper was concerned with a device level controller. In particular we developed
the control system for estimating and controlling the device state in terms of a "gaze point"
which can be used to explore the scene. By defining a virtual head, we are able to provide a
general head protocol. This head controller should allow algorithms to be easily ported between
different heads, even when the axes are not configured the same.

In section 5 we described some preliminary work on measures for controlling focus, aperture,
and vergence. The measures which we presented are all simple, stable, and of low
computational complexity. The development of such control techniques are a necessary
component in the construction of real time active vision systems.
596

Bibliography

[Aioimonos 87] J.Y. Aloimonos, I. Weiss and A. Bandopadhay, "Active Vision",


International Journal on Computer Vision, pp. 333-356, 1987.

[Bajcsy 88] R. Bajcsy, "Active Perception", IEEE Proceedings, Vol 76, No 8, pp. 996-
1006, August 1988.

[Ballard 88] Ballard, D.H. and Ozcandarli, A., "Eye Fixation and Early Vision: Kinematic
Depth", IEEE 2nd Intl. Conf. on Comp. Vision, Tarpon Springs, Fla., pp. 524-531, Dec.
1988.

[Ballard 91] D. Ballard, "Animate Vision", Artificial Intelligence, Vol 48, No. 1, pp. 1-27,
February 1991.

[Brown 90] C. Brown, "Prediction and Cooperation in Gaze Control", Biological


Cybernetics 63, 1990.

[Clark and Ferrier 88] Clark, J. and Ferrier, N., "Modal Control of an Attentive Vision
System", IEEE 2nd Intl. Conf. on Comp. Vision, Tarpon Springs, Fla., pp. 514-523, Dec.
1988.

[Crowley 87] Crowley, J. L., "Coordination of Action and Perception in a Surveillance


Robot", ~ , Vol 2(4), pp 32-43 Winter 1987, (Also appeared in IJCAI-87).

[Crowley 89] Crowley, J. L., "Asynchronous Control of Orientation and Displacement in a


Robot Vehicle", IEEE Conference on Robotics and Automation, Scottsdale AZ, May 1989.

[Crowley 91] Crowley, J. L. "Towards Continuously Operating Integrated Vision Systems


for Robotics Applications", SCIA-91, Seventh Scandinavian Conference on Image Analysis,
Aalborg, August 91.

[Eklundh 92] J.O. Eklundh and K. Pahlavan, Head, "Eye and Head-Eye System", SPIE
Applications of AI X: Machine Vision and Robotics, Orlando, Fla. April 92 (to appear).

[Horn 68] Horn, B. P. K., "Focussing", MIT Artificial Intelligence Lab Memo No. 160,
May 1968.

[Jarvis 83]. Jarvis, R. A., "A Perspective on Range Finding techniques for Computer
Vision", IEEE Trans. on PAMI 3(2), pp 122-139, March 1983.

[Krotkov 87] Krotkov, E., "Focusing", International Journal of Computer Vision, 1, p223-
237(1987).

[Krotkov 90] Krotkow, E., Henriksen, K. and Kories, R., "Stereo Ranging from Verging
Cameras", IEEE Trans on PAMI, Vol 12, No. 12, pp. 1200-1205, December 1990.

[Sehlag et. al. 83] Schlag, J., A. C. Sanderson, C. P. Neumann, and F. C. Wimberly,
"Implementation of Automatic Focussing Algorithms for a Computer Vision System with
Camera Control", CMU-RI-TR-83-14, August, 1983.

[Westelius et. al. 91] Westelius, C. J., H. Knutsson, and G. H. Granlund, "Focus of
Attention Control", SCIA-91, Seventh Scandinavian Conference on Image Analysis, Aalborg,
August 91.
Computing Exact Aspect Graphs of Curved Objects:
Algebraic Surfaces*
Jean Ponce x, Sylvain Petitjean 1, and David J. Kriegman ~
1 Dept. of Computer Science, University of Illinois, Urbana, IL 61801, USA
Dept. of Electrical Engineering, Yale University, New Haven, CT 06520, USA

A b s t r a c t . This paper presents an algorithm for computing the exact as-


pect graph of an opaque solid bounded by a smooth algebraic surface and
observed under orthographic projection. The algorithm uses curve tracing,
cell decomposition, and ray tracing to construct the regions of the view
sphere delineated by visual events. It has been fully implemented, and ex-
amples are presented.

1 Introduction

The aspect graph [25] is a qualitative, viewer-centered representation that enumerates all
possible appearances of an object: The range of all possible viewpoints is partitioned into
maximal regions such that the structure of the image contours, also called the aspect, is
the same from every viewpoint in a region. The change in the aspect at the boundary
between regions is named a visual event. The maximal regions and their boundaries are
organized into a graph, whose nodes represent the regions with their associated aspects
and whose ares correspond to the visual event boundaries between adjacent regions.
Since their introduction by Koenderink and Van Doom [25] more than ten years ago,
aspect graphs have been the object of very active research. The main focus has been on
polyhedra, whose contour generators are viewpoint-independent. Indeed, approximate
aspect graphs of polyhedra have been successfully used in recognition tasks [7, 18, 20],
and several algorithms have been proposed for computing the exact aspect graph of these
objects [6, 15, 16, 31, 36, 38, 39, 41, 42].
Recently, algorithms for constructing the exact aspect graph of simple curved objects
such as solids bounded by quadric surfaces [8] and solids of revolution [11, 12, 26] have
also been introduced. For more complex objects, it was recognized from the start that
the necessary theoretical tools could be found in catastrophe theory [1, 5, 25]. However,
algorithms based on these tools have, until very recently, remained elusive: Koenderink
[24] and Kergosien [23] show the view sphere curves corresponding to the visual events of
some surfaces, but, unfortunately, neither author details the algorithm used to compute
these curves. Rieger [35] uses cylindrical algebraic decomposition to compute the aspect
graph of a quartic surface of the form z = f(z, y).
This paper is the third in a series on the construction of exact aspect graphs of smooth
objects, based on the catalogue of possible visual events established by Kergosien [22]
(see [33, 40] for the case of piecewise-smooth objects). Previously, we presented a fully
implemented algorithm for solids of revolution whose generating curve is polynomial [26]
(see [11, 12] for a different approach to the same problem), and reported preliminary
results for polynomial parametric surfaces [32]. Here, we present a fully implemented
algorithm for computing the aspect graph of an opaque solid bounded by parametric or

* This work was supported by the National Science Foundation under Grant IRI-9015749.
600

implicit smooth algebraic surfaces, observed under orthographic projection (see [23, 24,
35, 37] for related approaches).
This algorithm is described in Sect. 3. It relies on a combination of symbolic and
numerical techniques, including curve tracing and cell decomposition [29], homotopy
continuation [30], and "symbolic" ray tracing [21, 28]. An implementation is described
in Sect. 4, and examples are presented (Figs. 4,5). Finally, future research directions are
briefly discussed in Sect. 5. While the main ideas of our approach are presented in the
body of the paper, detailed equations and algorithms are relegated to four appendices.

2 Visual Events

Let us start by reviewing some results from catastrophe theory [1]: From most view-
points, the image contours of smooth surfaces are piecewise-smooth curves whose only
singularities are cusps and t-junctions. The contour structure is in general stable with
respect to viewpoint, i.e., it does not change when the camera position is submitted to
a small perturbation. From some viewpoints, however, almost any perturbation of the
viewpoint will alter the contour topology. A catalogue of these "visual events" has been
established by Kergosien [22] for transparent generic smooth surfaces observed under
orthographic projection (Fig. 1).

a.

Fig. 1. Visual events, a. Local events. From top to bottom: swallowtail, beak-to-beak, lip. b.
Multilocal events. From top to bottom: triple point, tangent crossing, and cusp crossing.

Each visual event in this catalogue occurs when the viewing direction has high order
contact with the observed surface along certain characteristic curves [1, 24]. When contact
occurs at a single point on the surface, the event is said to be local; when it occurs at
multiple points, it is said to be multilocal. A catalogue of visual events is also available for
piecewise-smooth surfaces [33, 40], but we will restrict our discussion to smooth surfaces
in the rest of this paper.

2.1 Local E v e n t s

As shown in [22], smooth surfaces may exhibit three types of local events: swallowtail,
beak-to-beak, and lip transitions (Fig. 1.a). During a swallowtail transition, a smooth
image contour forms a singularity and then breaks off into two cusps and a t-junction. In
a beak-to-beak transition, two distinct portions of the occluding contour meet at a point
601

in the image. After meeting, the contour splits and forms two cusps; the connectivity
of the contour changes. Finally, a lip transition occurs when, out of nowhere, a closed
contour is formed with the introduction of two cusps.
Swallowtails occur on flecnodal curves, and both beak-to-beak and lip transitions
occur on parabolic curves [1, 24]. Flecnodal points are inflections of asymptotic curves,
while parabolic points are zeros of the Gaussian curvature. Equations for the parabolic
and flecnodal curves of parametric and implicit surfaces are given in Appendices A.1 and
B.1 respectively. The corresponding viewing directions are asymptotic directions along
these curves.

2.2 M u l t i l o c a l E v e n t s
These events occur when two or more surface points project onto the same contour
point. As shown in [22], there are three types of multilocal events: triple points, tangent
crossings, and cusp crossings (Fig. 1.b). A triple point is formed by the intersection of
three contour segments. For an opaque object, only two branches are visible on one side
of the transition while three branches are visible on the other side. A tangent crossing
occurs when two contours meet at a point and share a common tangent. Finally, a cusp
crossing occurs when the projection of a contour cusp meets another contour.
A multilocal event is characterized by a curve defined in a high dimension space, or
equivalently by a family of surface curves. For example, a triple point is formed when
three surface points are aligned and, in addition, the surface normals at the three points
are all orthogonal to the common line supporting these points. By sweeping this line
while maintaining three-point contact, a family of three curves is drawn on the surface.
Equations for the families of surface curves corresponding to multilocal events are given
in Appendices A.2 and B.2 for parametric and implicit surfaces respectively. The corre-
sponding viewing directions are parallel to the lines supporting the points forming the
events.

3 The Algorithm

We propose the following algorithm for constructing the aspect graph of an opaque solid
bounded by an algebraic surface:
1. Trace the visual event curves.
2. Eliminate the occluded events.
3. Construct the regions delineated on the view sphere by the remaining events.
4. Construct the corresponding aspects.
We now detail each step of the algorithm. Note that the aspect graph of a transparent
solid can be constructed by using the same procedure but omitting step 2.

3.1 S t e p 1: T r a c i n g Visual E v e n t s
As shown in Sect. 2, a visual event corresponds in fact to two curves: a curve (or family
of curves) F drawn on the object surface and a curve ,A drawn on the view sphere.
For algebraic surfaces, the curve F is defined implicitly in n+l by a system of n
polynomial equations in n + 1 unknowns, with 1 < n < 8 (see Appendices A and B):
P1 (X0, Xl, 9 9 X , ) = 0,
(1)
t P{,(Xo,X,,...,x,) o.
602

To trace a visual event, we first trace F in IRn+l. We then trace A by mapping


points o f / " onto points of A: given a point on F, the corresponding point on A is an
asymptotic direction for local events, or the direction of the line joining two surface points
for multilocal events.
The curve tracing algorithm is decomposed into the following steps (Fig. 2): 1.1.
Compute all extremal points o f / " in some direction, say X0 (this includes all singular
points). 1.2. Compute all intersections of F with the hyperplanes orthogonal to the
X0 axis at the extremal points. 1.3. For each interval of the X0 axis delimited by these
hyperplanes, intersect F and the hyperplane passing through the mid-point of the interval
to obtain one sample for each real branch. 1.4. March numerically from the sample points
found in step 1.3 to the intersection points found in step 1.2 by predicting new points
through Taylor expansion and correcting them through Newton iterations.

,X1

El

X0

Fig. 2. An example of curve tracing in ~2. This curve has two extremal points El, E~, and four
regular branches with sample points $1 to $4; note that E2 is singular.

This algorithm overcomes the main difficulties of curve tracing, namely finding a
sample point on every real branch and marching through singularities. Its output is a
graph whose nodes are extremal or singular points on F and whose arcs are discrete
approximations of the smooth curve branches between these points. This graph is similar
to the s-graph representation of plane curves constructed through cylindrical algebraic
decomposition [2]. Using the mapping f r o m / ' onto A, a discrete approximation of the
curve A is readily constructed.

T e c h n i c a l D e t a i l s . We now detail the computations involved in the curve tracing al-


gorithm. The casual reader may want to skip the rest of this section, at least on first
reading, and jump ahead to Sect. 3.2.
Step 1.1 requires the computation of the extrema of F in the X0 direction. As shown
in Appendix C, these points are the solutions of the system of n + 1 polynomial equations
in n + 1 unknowns obtained by adding the equation [J[ = 0 to system (1). Here, J is the
Jacobian matrix ( O P i / O X j ) , with i , j = 1, .., n. Steps 1.2 and 1.3 require computing the
intersections of a curve with a hyperplane, and these points are once again the solutions
of a system of polynomial equations. We use the homotopy continuation method, as
described in Appendix D, to solve these equations.
The curve is actually traced in step 1.4, using a classical prediction/correction ap-
proach based on a Taylor expansion of the Pi's [4, 13]. As shown in Appendix C, this
603

involves inverting the matrix J which is guaranteed to be nonsingular on extrema-free


intervals. Note that all real branches can be traced in parallel. As shown in Appendix D,
finding the extrema of the curve and its intersections with a family of hyperplanes is a
parallel process too.
There is no conceptual difficulty in applying this algorithm to aspect graph construc-
tion, but there is a very practical problem: the visual events are defined by very high
degree algebraic curves, and tracing them requires solving systems of equations that may
have millions of roots. We will come back to this problem in Sect. 4.

3.2 S t e p 2: E l i m i n a t i n g O c c l u d e d E v e n t s

All visual events of the transparent object are found in step 1 of the algorithm. For an
opaque object, some of these events will he occluded, and they should be eliminated. The
visibility of an event curve F can be determined through ray tracing at its sample point
found in step 1.3 [21, 44].

3.3 S t e p 3: C o n s t r u c t i n g t h e R e g i o n s

To construct the aspect graph regions delineated by the curves A on the view sphere, we
refine the curve tracing algorithm into a cell decomposition algorithm whose output is a
description of the regions, their boundary curves, and their adjacency relationships. Note
that this refinement is only possible for curves drawn in two-dimensional spaces such as
the sphere.
The algorithm is divided into the following steps (Fig. 3): 3.1. Compute all extremal
points of the curves in the X0 direction. 3.2. Compute all the intersection points be-
tween the curves. 3.3. Compute all intersections of the curves with the "vertical" lines
orthogonal to the X0 axis at the extremal and intersection points. 3.4. For each interval
of the X0 axis delimited by these lines, do the following: 3.4.1. Intersect the curves and
the line passing through the mid-point of the interval to obtain a sample point on each
real branch of each curve. 3.4.2. Sort the sample points in increasing X1 order. 3.4.3.
March from the sample points to the intersection points found in step 3.3.3.4.4. Two
consecutive branches within an interval of X0 and the vertical segments joining their
extremities bound a region.
A sample point can be found for each region as the mid-point of the sample points
of the bounding curve branches. This point is used to construct a representative aspect
in Sect. 3.4. Maximal regions are found by merging all regions adjacent along a vertical
line segment (two regions are adjacent if they share a common boundary, i.e., a vertical
line segment or a curve branch).

T e c h n i c a l D e t a i l s . We now detail the computations involved in the cell decomposition


algorithm. Again, the casual reader may want to skip the rest of this section, and jump
ahead to Sect. 3.4.
In our application, the coordinates Xo,X1 define a parameterization of the view
sphere, such as spherical angles. Also, the curve A corresponding to a visual event is
not explicitly defined by polynomial equations. As shown in Appendices A and B, it is
actually possible to augment the polynomial equations defining / ' to construct a new
algebraic curve f2 defined in IRre+l, with m > n, such that A is the projection of ~ onto
ll:t 2"
604

x~ i !ilililili i i!i !i i~:~ililililiiiiiiiiiiil


i!ili _..~,:
ii~li::iiiii::!!!!?:?ili~::i~i~'
~i~........
!:i i~iiiiiiiiiliiiiiiiii
i ::!i~i~::~i~::ii!i!::~iii]
i ::i 4

XO
I I I :

Fig. 3. An example of cell decomposition. Two curves are shown, with their extremal points Ei
and their intersection points Is; the shaded rectangle delimited by I1 and I2 is divided into five
regions with sample points $1 to $5; the region corresponding to $3 is shown in a darker shade.

The extrema of A in the X0 direction are the projections of the extrema of ~2 in


this direction, and they can be found by solving a system of m + 1 equations in m + 1
unknowns through continuation. Similarly, the intersections of two curves A1 and A2
can be found by writing that 121 and 12z must project onto the same point in }t 2 and by
solving the corresponding system of 2m equations in 2m unknowns. Marching along A
is achieved by marching along 12 and projection onto the ~2 plane.
An alternative is to first localize candidate regions of the view sphere where extrema
and intersections may occur (using, for example, an adaptive subdivision of the sphere)
and to converge to the actual points through local numerical optimization using the
equations defining the visual events. This method does not involve the costly resolution
of large sets of polynomial equations. A simpler version of this method is to directly work
with the discrete approximations of the curves A obtained in step 1. This is the method
we have actually used in the implementation described in Sect. 4.

3.4 S t e p 4: C o n s t r u c t i n g t h e A s p e c t s
This step involves determining the contour structure of a single view for each region,
first for the transparent object, then for the opaque object. This can be done through
"symbolic" ray tracing of the object contour [28] as seen from the sample point of the
region. Briefly, the contour structure is found using the curve tracing algorithm described
earlier. Since contour visibility only changes at the contour singularities found by the
algorithm, it is determined through ray tracing [21, 44] at one sample point per regular
branch.

4 Implementation and Results

The algorithm described in Sect. 3 has been fully implemented. Tracing the visual event
curves (step 1) is by far the most expensive part of the algorithm. Curve tracing and
continuation are parallel processes that can be mapped onto medium-grained MIMD
architectures. We have implemented continuation on networks of Sun SPARC Stations
communicating via Ethernet, networks of INMOS Transputers, and Intel Hypercubes. In
practice, this allows us to routinely solve systems with a few thousands of roots, a task
605

that requires a few hours using a dozen of Sparc Stations. The elimination of occluded
events in step 2 of the algorithm only requires ray tracing a small number of points on
the surface and takes a negligible amount of time. In our current implementation, the
cell decomposition algorithm of step 3 works directly with discrete approximations of
the visual event curves and only takes a few seconds. Finally it takes a few minutes to
generate the aspects in step 4.
In the following examples, an object is represented graphically by its silhouette,
parabolic and flecnodal curves, and the corresponding aspect graph is shown (more pre-
cisely, the visual event curves and their intersections are drawn on the view sphere).
Figure 4.a shows an object bounded by a complex parametric surface, and its (partial)
aspect graph. All visual events except certain cusp crossings and triple points have been
traced. Note that the aspect graph is extremely complicated, even though some events
are still missing. Also, note that this object is in fact only piecewise-smooth. As remarked
earlier, the catalogue of visual events used in this paper can be extended to piecewise-
smooth surfaces [33, 40], and corresponding equations can also be derived [32].
The objects considered in the next three examples are described by smooth compact
implicit surfaces of degree 4. The full aspect graph has been computed. Note that it has
the structure predicted in [5, 24] for similarly shaped surfaces.
Figure 4.b shows the silhouette of a bean-shaped object and the corresponding aspect
graph, with vertices drawn as small circles. This object has a hyperbolic patch within a
larger convex region.
Figure 4.c shows a squash-shaped object, its parabolic and flecnodal curves and its
aspect graph. This object has two convex parts separated by a hyperbolic region. Note
the two concentric parabolic curves surrounding the flecnodal curves.
Figure 4.d shows a "dimpled" object and its aspect graph. This object has a concave
island within a hyperbolic annulus, itself surrounded by a convex region. The flecnodal
curve almost coincides with the outer parabolic curve (compare to [24, p. 467]). There
is no tangent crossing in this case. Figure 5.a shows the corresponding decomposition
of the parameter space of the view sphere into 16 maximal regions, with sample points
indicated as black spots. The horizontal axis represents longitude, measured between - l r
and ~r, and the vertical axis represents latitude, measured between -7r/2 and 1r/2. Figure
5.b shows the corresponding 16 aspects.
What do these results indicate? First, it seems that computing exact aspect graphs
of surfaces of high degree is impractical. It can be shown that triple points occur only for
surfaces of degree 6 or more, and that computing the extremal points of the corresponding
curves requires solving a polynomial system of degree 4,315,680 - a very high degree
indeed! Even if this extraordinary computation were feasible (or another method than
ours proved simpler), it is not clear how useful a data structure as complicated as the
aspect graph of Fig. 4.a would be for vision applications.
On the other hand, aspect graphs of low-degree surfaces do not require tracing triple
points, and the necessary amount of computation remains reasonable (for example, a
mere few thousands of roots had to be computed for the tangent crossings of the bean-
shaped object). In addition, as demonstrated by Fig. 4.b-d, the aspect graphs of these
objects are quite simple and should prove useful in recognition tasks.

5 Discussion and Future Research

We have presented a new algorithm for computing the exact aspect graph of curved ob-
jects and described its implementation. This algorithm is quite general: as noted in [27],
606

a.
1
(
b.

C.
1
d.

Fig. 4. A few objects and their aspect graphs: a. A parametric surface, b. A bean-shaped implicit
surface, c. A squash-shaped implicit surface, d. A "dimpled" implicit surface.
607

iiii ii iiiiiiiiiiiiiii iiii ii


i~i~!i!~iii~i~i~i!~!~!i!~!i~i~i~i~i~i!iii~i~i~i~i~i~i~i~!~!~!~i~i~i~!~ii!~!i~!i~i~i~!~!~

:~:~:~::::':':"" o .... ================================= ' '.'.:,::::::~:~:~

:.:.:.:...... 9

iiiiii iiiii!iiii!iiiiiliii!ii!ii!i
iiiiiiii
. , ....:.:,:.:.:.:.:.:...., . ....;.:+:.

@C)C)C)
~i~i~ii~i~ii!i~i~i~i~i~i~i~iiiiii~i~i~i~i~i~i~ii~iii~ii~i~i~i~ii~ii~ii!i~i~i~i~i~i~iiii!ii~

a2:i~i~i:i:ii:i:ii:~!i!~i:i!~i:i:i~i:i~i:i:i!i:i:i!~i~i!i!:i!i:i:i:i~i:i:ii~ii~!ii!i!~!~i:i~i:i~i:i!i~:i:iii!~!!!!!!::i~:i:i~i:ii:i!i!i!i!!i!i:ii~b.
i:i:i
Fig. 5. a. Aspect graph regions of the "dimpled" object in parameter space, b. The corresponding
aspects.

algebraic surfaces subsume most representations used in computer aided design and com-
puter vision. Unlike alternative approaches based on cylindrical algebraic decomposition
[3, 9], our algorithm is also practical, as demonstrated by our implementation.
We are investigating the case of perspective projection: the (families of) surface curves
that delineate the visual events under orthographic projection also delineate perspective
projection visual events by defining ruled surfaces that partition the three-dimensional
view space into volumetric cells.
Future research will be dedicated to actually using the aspect graph representation
in recognition tasks. In [27], we have demonstrated the recovery of the position and
orientation of curved three-dimensional objects from monocular contours by using a
purely quantitative process that fits an object-centered representation to image contours.
What is missing is a control structure for guiding this process. We believe that the
qualitative, viewer-centered aspect graph representation can be used to guide the search
for matching image and model features and yield efficient control structures analogous
to the interpretation trees used in the polyhedral world [14, 17, 19].

A c k n o w l e d g m e n t s : We thank Seth Hutchinson, Alison Noble and Brigitte Ponce for


useful discussions and comments.

Appendix A: The Visual Events of Parametric Surfaces

A parametric algebraic surface is represented by:

X ( u , v ) = ( X ( u , v ) , Y ( u , v ) , Z ( u , v ) ) T, (u,v) E I x J C~2, (2)


where X, Y, Z are (rational) polynomials in u, v. These surfaces include B6zier patches
and non-uniform rational B-splines (NURBS) for example.
In this appendix and the following one, we assume that the viewing direction V is
parameterized in ~2, by spherical angles for example. Note that all equations involving
V can be made polynomial by using the rational parameterization of the trigonometric
functions.
608

A.1 Local E v e n t s

We recall the equations defining the surface curves (parabolic and flecnodal curves) as-
sociated to the visual events (beaks, lips, swallowtails) of parametric surfaces. There
is nothing new here, but equations for flecnodal curves are not so easy to find in the
literature.
Note: in this appendix, a u (resp. v) subscript is used to denote a partial derivative
with respect to u (resp. v).
Consider a parametric surface X(u, v) and define:

N=X~ xX~, e=(X~.N)/INI, f=(Xuv.N)/[NI, g = ( X v v . N ) / [ N I, (3)

i.e., N is the surface normal, and e, f, g, are the coefficients of the second fundamental
form in the coordinate system (X~, Xv) [10].

A . I . 1 A s y m p t o t i c D i r e c t i o n s . The asymptotic curves are the surface curves A(t) =


X(u(t), v(t)) defined by the differential equation:
eu '2 -F 2 f u ' v I + gv 12 = O. (4)

The asymptotic directions are given by u'X~ + v'X~, where u' and v' are solutions of
the above equation. A contour cusp occurs when the viewing direction is an asymptotic
direction.

A.1.2 P a r a b o l i c Curves. The parabolic curves of a parametric surface X(u,v) are


given by:
eg - / 2 = 0. (5)
For each point X(u, v) along a parabolic curve, there is only one asymptotic direction,
which is given by (4). In the language of Sect. 3, (4) defines the mapping from F (the
parabolic curve) onto A (the view sphere curve corresponding to beak-to-beak and lip
events). Equivalently, A is the projection of the curve /2 obtained by adding to (5) the
equations V x (u'Xu + v ' X v ) = 0 and (4).

A.1.3 F l e c n o d a l Curves. As shown in [43, p.85], the flecnodal points are inflections
of the asymptotic curves A, given by:

(A' A"). N = 0, (6)

which can be seen as an equation in u, v, t, or, equivalently, as an equation in u, v, u', vr,


U II ' V II"

An equation for the flecnodal curves is obtained by eliminating u', v', u", v" among
eqs. (4), (6), and the equation obtained by differentiating (4) with respect to t. Note
that since all three equations are homogeneous in u ~, ff and in u", v", this can be done by
arbitrarily setting u' = 1, u" = 1, say, and eliminating v', v" among these three equations.
The resulting equation in u, v characterizes the flecnodal curves.
Note that although it is possible to construct the general equation of flecnodal curves
for arbitrary parametric surfaces, this equation is very complicated, and it is better to
derive it for each particular surface using a computer algebra system. As before, explicit
equations for ~2 can be constructed.
609

A.2 M u l t i l o e a l E v e n t s

Multilocal events occur when the viewing direction V has high order contact with the
surface in at least two distinct points Xl and X2. In that case, V = Xl - X~.

A.2.1 T r i p l e P o i n t s . The triple point is conceptually the simplest of the multilocal


events. It occurs when three contour fragments intersect at a single point. Let Xi =
X ( u i , vi), for i = 1, 2, 3, be the three corresponding surface points, and let Ni be the
corresponding surface normals, we obtain the following equations:

(xx - x ~ ) ( x 2 - X3) = 0,
(x~ - x ~ ) 9 N 3 = 0,
(7)
(x2 - x z ) 9 N 1 = 0,
(x~ - x , ) 9N~ = 0 .

The first equation is a vector equation (or equivalently a set of two independent scalar
equations) that expresses the fact that the three points are aligned with the viewing
direction. The next three equations simply express the fact that the three points belong
to the occluding contour. It follows that triple points are characterized by five equations
in the six variables ui, vi, i = 1,2, 3.
An explicit equation for the curve D corresponding to a triple point can be obtained
by replacing (X2 - X3) by V in (7). Similar comments apply to the other multilocal
events.

A.2.2 T a n g e n t C r o s s i n g s . A tangent crossing occurs when two occluding contour


points Xl = X(ul, vl) and X2 = X(u2, v2) project to the same image point and have
collinear surface normals N1 and N2. This can be rewritten as:

N1 x N2 = O,
(8)
(Xl - X 2 ) " N I = 0.

Again, remark that the first equation is a vector equation (or equivalently a set of
two independent scalar equations). It follows that tangent crossings are characterized by
three equations in the four variables ui, vi, i = 1, 2.

A.2.3 C u s p C r o s s i n g s . A cusp crossing occurs when two occluding contour points X I


and X~ project to the same image point and one of the points, say X1, is a cusp. This
can be rewritten as:
(X1 - X2). NI = 0,
(X1 - X2)" N2 = 0, (9)
ela 2 + 2flab + glb 2 = 0,

where et, fl, gl are the values of the coefficients of the second fundamental form at X1,
and (a, b) are the coordinates of the viewing direction X 1 - X2 in the basis Xu(Ul, Vl),
Xv(ul, vt) of the tangent plane. Note that a, b can be computed from the dot products of
X 1 - X 2 with Xu(ul, vl) and Xv(ul, vl). It follows that cusp crossings are characterized
by three equations in the four variables ui, vl, i = 1, 2.
610

Appendix B: The Visual Events of Implicit Surfaces

An implicit algebraic surface is represented by:

F ( X , Y, Z) = F ( X ) = 0, (10)
where F is a polynomial in X, Y, Z.

B.1 Local E v e n t s

For implicit surfaces, even the equations of parabolic curves are buried in the literature.
Equations for both parabolic and flecnodal curves are derived in this appendix.
Note: in this appendix, X, Y, Z subscripts denote partial derivatives with respect to
these variables.

B . I . 1 A s y m p t o t i c D i r e c t i o n s . An asymptotic direction V at a point X lies in the


tangent plane and has second order contact with the surface. It is characterized by:

V F ( X ) . V = 0, (11)
VTH(X)V 0,

where H ( X ) is the Hessian of F at X. Asymptotic directions are determined by solving


this homogeneous system in V.

B.1.2 P a r a b o l i c C u r v e s . The parabolic curves of an implicit surface F ( X ) = 0 are


given by:

f } ( F r r Fzz - F~z) + ~ ( F x x Fzz - r}z)


+ E ~ ( V x x f y v - F}r) + 2 F x F y ( P x z f y z - E z z F x r ) (12)
+2Fv f z ( F x v F x z - F x x f Y z ) + 2Fx F z ( f x v F v z - F y v F x z ) = O,

plus the equation F ( X ) = 0 itself. For each point X along a parabolic curve, there is only
one asymptotic direction, which is given by (11). It should be noted that one can directly
characterize the beak-to-beak and lip events by adding to (12) the equations F ( X ) = 0
and (11), and tracing the resulting curve /2 in IRS; the projection of this curve onto ~2
defines the beak-to-beak and lip curves on the view sphere.

B.1.3 F l e c n o d a l C u r v e s . A surface point X = (X, Y, Z) T o n a flecnodal curve has


third order contact with a line along an asymptotic direction V = (V1, Vu, V3)T [1]. This
is characterized by:

V F ( X ) . V = 0,
V r H ( X ) V = 0, (13)
VT(Hx(X)V1 + Hy(X)V2 + Hz(X)V3)V -- O.

Since these three equations are homogeneous in the coordinates of V, these coordi-
nates can easily be eliminated to obtain a single equation in X. Along with F ( X ) = 0,
this system defnes the flecnodal curves. As before, explicit equations for f2 can be con-
structed.
611

B.2 M u l t i l o c a l E v e n t s
B.2.1 T r i p l e P o i n t s . Let Xi, i = 1, 2, 3, be three points forming a triple point event.
The corresponding equations are similar to the equations defining triple points of para-
metric surfaces:
F(XI) = 0, i = 1, 2, 3,
(Xt - X2) x (X2 - X3) = 0,
(Xl - X 2 ) . N 3 = 0, (14)
( X 2 - X3)" Nt = 0,
( X 3 - X l ) " N 2 = O,
where Ni = VF(XI). It follows that triple points are characterized by eight equations in
the nine variables Xi, Yi, Zi, i = 1, 2, 3.

B.2.2 T a n g e n t Crossings. Again, the equations defining tangent crossings of implicit


surfaces are similar to the corresponding equations for parametric surfaces:

{ F(Xi)=0,
N1 x N2 = 0,
i=1,2,

(Xl - X 2 ) " N I = 0.
(15)

This is a system of five equations in the six variables Xi, Yi, Zi, i = 1,2.

B.2.3 C u s p Crossings. Cusp crossings are characterized by:


F(Xi)=0, i=1,2,
(Xt - X2). Nt = 0, (16)
(Xl - X 2 ) ' N 2 = 0,
(Xl - x2)TH(X1)(X1 -- X2) = 0,
where the last equation simply expresses the fact that the viewing direction is an asymp-
totic direction of the surface at Xl. This is again a system of five equations in the six
variables Xi, Yi, Zi, i = 1,2.

Appendix C: Details of the Curve Tracing Algorithm

c.1 Step 1.1: Finding the E x t r e m a l Points


The extrema of F in the X0 direction are given by differentiating (1) and setting dXo = 0:

l ( OP1/oXt)dXl+'''+(OPt/OXn)dXn=O (dX11
r162 J = 0 (17)
[ i'OP,~/OX1)dXI +...+(OP,~/OX,)dXn =O \'dXn]
where J = (OPdOXj), with i,j = 1, .., n, is the Jacobian matrix. This system has non-
trivial solutions if and only if the determinant D(X0, X1, ..., Xn) = [J[ of the Jaeobian
matrix vanishes. The extrema of F are therefore the solutions of:
PI(Xo,X1, ...,Xn) = O,
(18)
P.(X0, x l , . , x.) = 0,
D(Xo, X1,..., X,) = O.
612

C.2 S t e p s 1.2 a n d 1.3: I n t e r s e c t i n g t h e C u r v e w i t h H y p e r p l a n e s

These steps correspond to finding all the intersections o f / ' with some hyperplane X0 --
)(0. These intersections are given by:

PI(X0,X1, ...,X,) = 0,
(19)
.... , x,,) = o.

C.3 S t e p 1.4: M a r c h i n g on E x t r e m a - F r e e I n t e r v a l s

To trace a curve on an extrema-free interval, we use a classical prediction/correction


approach based on a first order Taylor expansion of the Pi's (higher order expansions
could also be used [4, 13]). By differentiating (1), we obtain:

j(dX1] = _ d X o (OP1/OXo
(20)
ix. / oP2)OXo)
Given a step dXo in the X0 direction, one can predict the remaining dXi's by solving
this system of linear equations. This is only possible when the determinant of the Jacobian
matrix J is non-zero, which is exactly equivalent to saying that the point ( X 0 , . . . , Xn) T
is not an extremum in the X0 direction.
The correction step uses Newton iterations to converge back to the curve from the
predicted point. We write once more a first order Taylor approximation of the Pi's to
compute the necessary correction (dX1,..., dXn) T for a fixed value of X0:

P1 + (OP1/OXx)dX1 + . . . + (OP1/OXn)dXn = 0

P~ + (OP,/OX1)dXx + . . . + (oen/oXn)dX, = 0
,x1) (,)
dX2 = _ j- x

(21)

Appendix D: Homotopy Continuation


Consider a system of n polynomial equations Pi in n unknowns Xj, denoted by P(X) = 0,
with P = ( P 1 , . . . , P n ) T and X = ( X 1 , . . . , X n ) T. To solve this system, we use the
homotopy continuation method [30], itself a simple form of curve tracing. The principle
of the method is as follows. Let Q(X) = 0 be another system of polynomial equations with
the same total degree as P ( X ) = 0, but known solutions. A homotopy, parameterized by
t E [0, 1], can be defined between the two systems by:

(1 - t)Q(X) + t P ( X ) = 0. (22)
The solutions of the target system are found by tracing the curve defined in ~n+l by
these equations from t = 0 to t -- 1 according to step 1.4 of our curve tracing algorithm.
In this case, however, the sample points are the known solutions of Q(X) = 0 at t = 0,
which allows us to bypass step 1.3 of the algorithm. It can also be shown [30] that with
an appropriate choice of Q, the curve has no extrema or singularities, which allows us to
also bypass steps 1.1-1.2.
613

References

1. V.I. Arnol'd. Singularities of systems of rays. Russian Math. Surveys, 38(2):87-176, 1983.
2. D.S. Arnon. Topologically reliable display of algebraic curves. Computer Graphics,
17(3):219-227, July 1983.
3. D.S. Arnon, G. Collins, and S. McCallum. Cylindrical algebraic decomposition I and II.
SIAM J. Comput., 13(4):865-889, November 1984.
4. C.L. Bajaj, C.M. Hoffmann, R.E. Lynch, and J.E.H. Hopcroft. Tracing surface intersec-
tions. Computer Aided Geometric Design, 5:285-307, 1988.
5. J. Callalaan and R. Weiss. A model for describing surface shape. In Proc. IEEE Conf.
Comp. Vision Part. Recog., pages 240-245, San Francisco, CA, June 1985.
6. G. Castore. Solid modeling, aspect graphs, and robot vision. In Pickett and Boyse, editors,
Solid modeling by computer, pages 277-292. Plenum Press, NY, 1984.
7. I. Chakravarty. The use of characteristic views as a basis for recognition of three-
dimensional objects. Image Processing Laboratory IPL-TR-034, Rensselaer Polytechnic
Institute, October 1982.
8. S. Chen and H. Freeman. On the characteristic views of quadric-surfaced solids. In 1EEE
Workshop on Directions in Automated CAD-Based Vision, pages 34-43, June 1991.
9. G.E. Collins. Quantifier Elimination for Real Closed Fields by Cylindrical Algebraic De-
composition, volume 33 of Lecture Notes in Computer Science. Springer-Verlag, New York,
1975.
10. M.P. do Carmo. Differential Geometry of Curves and Surfaces. Prentice-Ha/l, Englewood
Cliffs, N J, 1976.
11. D. Eggert and K. Bowyer. Computing the orthographic projection aspect graph of solids
of revolution. In Proc. IEEE Workshop on Interpretation of 3D Scenes, pages 102-108,
Austin, TX, November 1989.
12. D. Eggert and K. Bowyer. Perspective projection aspect graphs of solids of revolution: An
implementation. In IEEE Workshop on Directions in Automated CAD-Based Vision, pages
44-53, June 1991.
13. R.T. Farouki. The characterization of parametric surface sections. Comp. Vis. Graph. Ira.
Proc., 33:209-236, 1986.
14. O.D. Faugeras and M. Hebert. The representation, recognition, and locating of 3-D objects.
International Journal of Robotics Research, 5(3):27-52, Fall 1986.
15. Z. Gigus, J. Canny, and R. Seidel. Efficiently computing and representing aspect graphs of
polyhedral objects. IEEE Trans. Part. Anal. Mach. lntell., 13(6), June 1991.
16. Z. Gigus and J. Malik. Computing the aspect graph for line drawings of polyhedral objects.
IEEE Trans. Part. Anal. Mach. Intell., 12(2):113-122, February 1990.
17. W.E.L. Grimson and T. Lozano-P~rez. Localizing overlapping parts by searching the in-
terpretation tree. IEEE Trans. Patt. Anal. Mach. Intell., 9(4):469-482, 1987.
18. M. Hebert and T. Kanade. The 3D profile method for object recognition. In Proc. IEEE
Conf. Comp. Vision Patt. Recog., pages 458-463, San Francisco, CA, June 1985.
19. D.P. Huttenlocher and S. Uilman. Object recognition using alignment. In Proc. Int. Conf.
Comp. Vision, pages 102-111, London, U.K., June 1987.
20. K. Ikeuchi and T. Kanaxie. Automatic generation of object recognition programs. Proceed-
ings of the IEEE, 76(8):1016-35, August 1988.
21. J.T. Kajiya. Ray tracing parametric patches. Computer Graphics, 16:245-254, July 1982.
22. Y.L. Kergosien. La famille des projections orthogonales d'une surface et ses singularit~s.
C.R. Acad. Sc. Paris, 292:929-932, 1981.
23. Y.L. Kergosien. Generic sign systems in medical imaging. IEEE Computer Graphics and
Applications, 11(5):46-65, 1991.
24. J.J. Koenderink. Solid Shape. MIT Press, Cambridge, MA, 1990.
25. J.J. Koenderink and A.J. Van Doom. The internal representation of solid shape with
respect to vision. Biological Cybernetics, 32:211-216, 1979.
614

26. D.J. Kriegman and J. Ponce. Computing exact aspect graphs of curved objects: solids of
revolution, lnt. J. of Comp. Vision., 5(2):119-135, 1990.
27. D.J. Kriegman and J. Ponce. On recognizing and positioning curved 3D objects from image
contours. 1EEE Trans. Patt. Anal. Mach. lntell., 12(12):1127-1137, December 1990.
28. D.J. Kriegman and J. Ponce. Geometric modelling for computer vision. In SPIE Confer-
ence on Curves and Surfaces in Computer Vision and Graphics 1I, Boston, MA, November
1991.
29. D.J. Kriegman and J. Ponce. A new curve tracing algorithm and some applications. In P.J.
Laurent, A. Le Mdhautd, and L.L. Schumaker, editors, Curves and Surfaces, pages 267-270.
Academic Press, New York, 1991.
30. A.P. Morgan. Solving Polynomial Systems using Continuation for Engineering and Scien-
tific Problems. Prentice Hall, Englewood Cliffs, N J, 1987.
31. H. Plantinga and C. Dyer. Visibility, occlusion, and the aspect graph. Int. J. of Comp.
Vision., 5(2):137-160, 1990.
32. J. Ponce and D.J. Kriegman. Computing exact aspect graphs of curved objects: parametric
patches. In Proc. A A A I Nat. Conf. Artif. lntell., pages 1074-1079, Boston, MA, July 1990.
33. J.H. Rieger. On the classification of views of piecewise-smooth objects, linage and Vision
Computing, 5:91-97, 1987.
34. J.H. Rieger. The geometry of view space of opaque objects bounded by smooth surfaces.
Artificial Intelligence, 44(1-2):1-40, July 1990.
35. J.H. Rieger. Global bifurcations sets and stable projections of non-singular algebraic sur-
faces. Int. J. of Comp. Vision., 1991. To appear.
36. W.B. Seales and C.R. Dyer. Constrained viewpoint from occluding contour. In IEEE
Workshop on Directions in Automated "CAD-Based" Vision, pages 54-63, Maui, Hawaii,
June 1991.
37. T. Sripradisvarakul and R. Jain. Generating aspect graphs of curved objects. In Proc.
IEEE Workshop on Interpretation of 3D Scenes, pages 109-115, Austin, TX, December
1989.
38. J. Stewman and K.W. Bowyer. Aspect graphs for planar-fax:e convex objects. In Proc.
1EEE Workshop on Computer Vision, pages 123-130, Miami, FL, 1987.
39. J. Stewman and K.W. Bowyer. Creating the perspective projection aspect graph of poly-
hedral objects. In Proc. Int. Conf. Comp. Vision, pages 495-500, Tampa, FL, 1988.
40. C.T.C. Wall. Geometric properties of generic differentiable manifolds. In A. Dold
and B. Eckmann, editors, Geometry and Topology, pages 707-774, Rio de Janeiro, 1976.
Springer-Verlag.
41. R. Wang and H. Freeman. Object recognition based on characteristic views. In Interna-
tional Conference on Pattern Recognition, pages 8-12, Atlantic City, N J, June 1990.
42. N. Watts. Calculating the principal views of a polyhedron. CS Tech. Report 234, Rochester
University, 1987.
43. C.E. Weatherburn. Differentialgeometry. Cambridge University Press, 1927.
44. T. Whitted. An improved illumination model for shaded display. Comm. of the ACM,
23(6):343-349, June 1980.
SURFACE INTERPOLATION USING WAVELETS

Alex P. Pentland

Massachusetts Institute of Technology, Cambridge, MA 02139, USA

Abstract. Extremely efficient surface interpolation can be obtained by


use of a wavelet transform. This can be accomplished using biologically-
plausible filters, requires only O(n) computer operations, and often only a
single iteration is required.

1 Introduction
Surface interpolation is a common problem in both human and computer vision. Perhaps
the most well-known interpolation theory is regularization [7, 9]. However this theory
has the drawback that the interpolation network requires hundreds or even thousands of
iterations to produce a smoothly interpolated surface. Thus in computer vision applica-
tions surface interpolation is often the single most expensive processing step. In biological
vision, timing data from neurophysiology makes it unlikely that many iterations of cell
firing are involved in the interpolation process, so that interpolation theories have been
forced to assume some sort of analog processing. Unfortunately, there is little experimen-
tal evidence supporting such processing outside of the retina. In this paper I will show
how efficient solutions to these problems can be obtained by using orthogonal wavelet
filters or receptive fields.

1.1 Background
In computer vision the surface interpolation problem typically involves constructing a
smooth surface, sometimes allowing a small number of discontinuities, given a sparse set
of noisy range or orientation measurements. Mathematically, the problem may be defined
as finding a function U within a linear space 7~ that minimizes an energy functional s
s : inf s inf (/C(12)+ R(12)) (1)
vE~ ~E~

where K:(12) is an energy functional that is typically proportional to the curvature of


the surface, and 7~(1~) is an energy functional that is proportional to the residual differ-
ence between 12 and the sensor measurements. When the solution exists, the variational
derivative (~ of the energy functional vanishes,
~uC(U) = ,SulC(U) + ~r~(U) = 0 (2)
The linear operators ~us SuK;, and $u7~ are infinite dimensional and normally dense.
To solve Equation 2, therefore, it must first be projected onto a discretization S of
containing n nodes. The resulting matrix equation is written S K U + R = 0 where
is a scalar constant, U, R are n x 1 vectors and K an n x n matrix; these are the
discretization of II, $uT~(ll), and ~K:(H), respectively. To make explicit the dependence
of R on U, I will write the regularization equation as follows:
)tKU + S U - D = 0 (3)
616

i.e., R = S U - D, where D is a n x 1 vector whose entries are the measured coordinates


di where sensor measurements exist and zero elsewhere, and S is a diagonal "selection
matrix" with ones for nodes with sensor measurements and zeros elsewhere.

.a

m
~ ..je 7.w 2.~ e.o~

. . ._. .~. . . ,

Fig. 1. Wavelet filter family "closest" to Wilson-Gelb filters (arbitrarily scaled for display).

1.2 Choice of Basis

W h e n / C is chosen to be the stress within a bending thin plate (as is standard), then K
is the stiffness matrix familiar from physical simulation. Unfortunately, several thousand
iterations are often required to the interpolated surface. Although sophisticated multires-
olution techniques can improve performance, the best reported algorithms still require
several hundred iterations.
The cost of surface interpolation is proportional to both the bandwidth and condition
number of K. Both of these quantities can be greatly reduced by choosing the correct
basis (a set of n orthogonal vectors) and associated coordinate system in which to solve
the problem. In neural systems, transformation to a new basis or coordinate system can
be accomplished by passing a data vector through a set of receptive fields; the shapes of
the receptive fields are the new basis vectors, and the resulting neural activities are the
coordinates of the data vector in the coordinate system defined by these basis vectors. If
the receptive fields are orthonormal, then we can convert back to the original coordinate
system by adding up the same receptive fields in amounts proportional to the associated
neurons activity.
For the class of physically-motivated smoothness functionals, the ideal basis would
be both spatially and spectrally localized, and (important for computer applications)
very fast to compute. The desire for spectral localization stems from the fact that, in the
absence of boundary conditions, discontinuities, etc., these sort of physical equilibrium
problems can usually be solved in closed form in the frequency domain. In similar fash-
ion, a spectrally-localized basis will tend to produce a banded stiffness matrix K. The
requirement for spatial localization stems from the need to account for local variations
in K ' s band structure due to, for instance, boundary conditions, discontinuities, or other
inhomogeneities.

1.30rthogonal Wavelet Bases

A class of bases that provide the desired properties are generated by functions known
as orthogonal wavelets [5, 2, 8]. Orthogonal wavelet functions and receptive fields are
different from the wavelets previously used in biological and computational modeling
because all of the functions or receptive fields within a family, rather than only the
functions or receptive fields of one size, are orthogonal to one another. A family of
617

orthogonal wavelets ha,b is constructed from a single function h by dilation of a and


translation of b
ha,b -- ,a,-i/2 h ( ~ a b) , a y~ O (4)

Typically a = 2i and b = 1, ..., n = 2J for j = 1, 2, 3 .... The critical properties of wavelet


families that make them well suited to this application are that:
- For appropriate choice of h they can provide an orthonormal basis of L2(~), i.e., MI
members of the family are orthogonal to one another.
- They can be simultaneously localized in both space and frequency.
- Digital transformations using wavelet bases can be recursively computed, and so
require only O(n~ operations.
Such families of wavelets may be used to define a set of multiscale orthonormal basis
vectors. I will call such a basis q~, where the columns of the n x n matrix ~ are the basis
vectors. Because ~w forms an orthonormal basis, ~ w T = 4~w4~w
T = I. That is, like the
Fourier transform, the wavelet transform is self-inverting. Figure 1 shows a subset of q~w;
from left to right are the basis vectors corresponding to a = 1, 2, 4, 8, 16 and b = n/2. All
of the examples presented in this paper will all be based on the wavelet basis illustrated
in this figure.
The basis vector shapes shown in Figure 1 may be regarded as the neural receptive
fields that transform an input signal into, or out of, the wavelet coordinate system. I
developed this particular set of wavelets to match as closely as possible the human psy-
chophysical receptive field model of Wilson and Gelb [10]; there is only a 7.5% MSE
difference between this set of wavelet receptive fields and the Wilson-Gelb model 1 [6].
This set of wavelets, therefore, provides a good model of human spatial frequency sensi-
tivity, and of human sensitivity to changes in spatial frequency.

2 Surface Interpolation using Wavelet Bases

It has been proven that by using wavelet bases linear operators such as ( ~ can be
represented extremely compactly [1]. This suggests that 4~w is an effective preconditioning
transform, and thus may be used to obtain very fast approximate solutions. The simplest
method is to transform a previously-defined K to the wavelet basis,
= T (5)
then to discard off-diagonal elements,
Y~ = diag T (6)
and then to solve. Note that for each choice of K the diagonal matrix/22~ is calculated
only once and then stored; further, its calculation requires only O(n) operations. In
numerical experiments I have found that for a typical K the summed magnitude of the
off-diagonals of I( is approximately 5% of the diagonal's magnitude, so that we expect
to incur only small errors by discarding off-diagonals.
This set of wavelets were developed by applying the gradient-descent QMF design procedure of
Simoneelli and Adelson [8] using the Wilson-Gelb filters as the initial "guess" at an orthogonal
basis. Wavelet receptive fields from only five octaves are shown, although the Wilson-Gelb
model has six channels. Wilson, in a personal communication, has advised us that the Wilson-
Gelb "b" and "e" channels are sufficiently similar that it is reasonable to group them into a
single channel.
618

Case I. The simplest case of surface interpolation is when sensor measurements exist for
every node so that the sampling matrix S = I. Substituting ~wlJ = U and premultiplying
by ~T converts Equation 3 to

A ~TK ~ U ~ T
+ ~O ~U~ = ~D
T
(7)

By employing Equation 6, we then obtain (A~2~ + I)I~l = ~wD,


T so that the approximate
interpolation solution U is
U = ~ w (A~'~2 + I) - I ~ V
T (8)

Note that this computation is accomplished by simply transforming D to the wavelet


basis, scaling the convolution filters (receptive fields) appropriately at each level of re-
cursion, and then transforming back to the original coordinate system. To obtain an
approximate regularized solution for an v ~ x v ~ image using a wavelet of width w
therefore requircs approximately 8wn + n add and multiply operations.

Case 2. In the more usual case where not all nodes have sensor measurements, the in-
terpolation solution may require iteration. In this case the sampling matrix S is diagonal
with ones for nodes that have sensor measurements, and zeros elsewhere. Again substi-
tuting 9 wI~ = U and premultiplying by ~ rw converts Equation 3 to
T ~
A#wK#wU + ~ T S ~ U~ = Ow
T
D (9)

The matrix ~wT S#w is diagonally dominant so that the interpolation solution U may be
obtained by iterating
v = + + V' (1o)
where S : diag(~Ts4~w) and D t = D - (K + S)U t is the residual at iteration t. I have
found that normally no more than three to five iterations of Equation 10 are required
to obtain an accurate estimate of the interpolated surface; often a single iteration will
sauce.
Note that for this procedure to be successful, the largest gaps in the data sampling
must be significantly smaller than the largest filters in the wavelet transform. Further,
when A is small and the data sampling is sparse and irregular, it can happen that the
off-diagonal terms of ~T S ~ introduce significant error. When using small A I have found
that it is best to perform one initial iteration with a large A, and then reduce A to the
desired value in further iterations.

Discontinuities. The matrix K describes the connectivity between adjacent points on a


continuous surface; thus whenever a discontinuity occurs K must be altered. Following
Terzopoulos [9], we can accomplish this by disabling receptive fields that cross discontinu-
ities. In a computer implementation, the simplest method is to locally halt the recursive
construction the wavelet transform whenever one of the resulting bases would cross a
discontinuity.

An Example. Figure 2(a) shows the height measurements input to a 64 x 64 node inter-
polation problem (zero-valued nodes have no data); the verticM axis is height. These data
were generated using a sparse (10%) random sampling of the function z = 100[sin(kx)+
sin(ky)]. Figure 2(b) shows the resulting interpolated surface. In this example Equation
10 converged to within 1% of its true equilibrium state with a single iteration. Execution
time was approximately 1 second on a Sun 4/330.
619

(a) (b)

Fig. 2. A surface interpolation problem; solution after one iteration (1 second on a Sun 4/330).

2.1 S u m m a r y

I have described a method for surface interpolation that uses orthogonal wavelets to
obtain good interpolations with only a very few iterations. The method has a simple bio-
logical implementation, and its performance was illustrated with wavelets that accurately
model human spatial frequency sensitivity.

References
1. Albert, B., Beylkin, G., Coifman, R., Rokhlin, V. (1990) Wavelets for the Fast Solution of
Second-Kind Integral Equations. Yale Research Report DCS.RR.837, December 1990.
2. Daubechies, I. (1988) Orthonormal Bases of Compactly Supported Wavelets. Communica-
tions on Pure and Applied Mathematics, XLI:909-996, 1988.
3. Kohonen, T., (1982) Self-organized formation of topologically correct feature maps, Biol.
Cyber., 43, pp. 59-69.
4. Linsker, R. (1986) From basic network principles to neural architecture, Proc. Nat. Acad.
Sci, U.S.A., 83, pp. 7508-7512, 8390-8394, 8779-8783.
5. Mallat, S. G., (1989) A theory for multiresolution signal decomposition: the wavelet repre-
sentation, IEEE Trans. PAMI, 11(7):674-693, 1989
6. Pentland, A., (1991) Cue integration and surface completion, Invest. Opthal. and Visual
Science 32(4):1197, March 1991.
7. Poggio, T., Torte, V., and Koch, C., (1985) Computational vision and regularization theory,
Nature, 317:314-319, Sept. 26, 1985.
8. Simoncelli, E., and Adelson, E., (1990) Non-Separable Extensions of Quadrature Mirror
Filters to Multiple Dimensions, Proceedings of the IEEE, 78(4):652-664, April 1990
9. Terzopoulos, D., (1988) The Computation of visible surface representations, IEEE Trans.
PAMI, 10(4):417-439, 1988.
10. Wilson, H., and Gelb, G., (1984) Modified line-element theory for spatial-frequency and
width discrimination, J. Opt. Soc. Am. A 1(1):124-131, Jan. 1984.

This article was processed using the LTEX macro package with ECCV92 style
S m o o t h i n g and M a t c h i n g of 3-D Space Curves *

Andrd Gudziec and Nicholas Ayache


INRIA, BP 105,
78153 Le Chesnay C6dex FItANCE,
e-maLl: gueziec and ayacheObora.inria.fr

Abstract. W e present a new approach to the problem of matching 3D


curves. The approach has an algorithmic complexity sublinear with the
number of models, and can operate in the presence of noise and partial
occlusions.
Our method builds upon the seminal work of [9],where curves are first
smoothed using B-splines, with matching based on hashing using curvature
and torsion measures. However, we introduce two enhancements:
- We make use of non-uniform B-spline approximations, which permits
us to better retain information at high curvature locations. The spline
approximations are controned (i.e., regularized) by making use of normal
vectors to the surface in 3-D on which the curves lie, and by an explicit
minimization of a bending energy. These measures allow a more accurate
estimation of position, curvature, torsion and Fr6net frames along the
curve;
- The computational complexity of the recognition process is considerably
decreased with explicit use of the Fr~net frame for hypotheses genera-
tion. As opposed to previous approaches, the method better copes with
partial occlusion. Moreover, following a statistical study of the curva-
ture and torsion covariances, we optimize the hash table discretisation
and discover improved invariants for recognition, different than the t o r -
s i o n measure. Finally, knowledge of invariant uncertainties is used t o
compute an optimal global transformation using an extended Kalman
filter.
We present experimental results using synthetic data and also using
characteristic curves extracted from 3D medical images.

1 Introduction

Physicians are frequently confronted with the very practical problem of registrating 3D
medical images. For example, when two images provided by complementary imaging
modalities must be compared, (such as X-ray Scanner, Magnetic resonance Imaging,
Nuclear Medicine, Ultrasound Images), or when two images of the same type but acquired
at different times and/or in different positions must be superimposed.
A methodology exploited by researchers in the Epidanre Project at Inria, Paris, con-
sists of extracting first highly structured descriptions from 3D images, and then using
those descriptions for matching [1].Characteristic curves describe either topological sin-
gularities such as surface borders, hole borders, and simple or multiple junctions, etc.,
(see [10]), or differential Structures, such as ridges, parabolic lines, and umbilic points [11].

* This work was financed in part by a grant from Digital Equipement Corporation. GeneraJ
Electric-CGR partially supported the research that provided ridge extraction software.
621

The characteristic curves are stable with respect to rigid tranformations, and can tolerate
partial occlusion due to their local nature. They are typically extracted as a connected
set of discrete voxels, which provides a much more compact description than the original
3D images (involving a few hundreds of points compared to several million). Fig. 1 shows
an example of ridges extracted from the surface of a skull [12]. These curves can be used
to serve as a reference identifying positions and features of the skull and to establish
landmarks to match skulls between different individuals, yielding a standard approach
for complex skull modeling [4].

Fig. 1. Extraction of characteristic curves (crest lines) from the surface of a skull (using two
different X-ray Scanner images)
The problem we address in this paper is the use of t h e s e c u r v e s to identify and
accurately locate 3D objects. Our approach consists in introducing a new algorithm
to approximate a discrete curve by a sufficiently smooth continuous one (a spline) in
order to compute intrinsic differential features of second and third order (curvature and
torsion). Given two curves, we then wish to find, through a matching algorithm, the
longest common portion, up to a rigid transformation. From three possible approaches,
specifically: prediction-verification, accumulation and geometric hashing, we retained the
third one whose complexity is sublinear in the number of models. We call it an indexation
method, and introduce logical extensions of the work of [9, 15, 3]. Our work is also closely
related to the work of [2, 6, 16] on the identification and positionning of 3D objects.
In Section 2, we discuss approaches to fitting curves to collections of voxels (points) in
3D imagery. In Section 3, we implement a matching system based on the indexation (geo-
metric hashing), whose complexity is sublinear in the number of models in the database.
Certain modifications are required for use with the differentiable spline curve represen-
tation, and other enhancements are suggested, in order to make the method robust to
partial occlusion of the curves (potentially in multiple sections). We finally introduce
alternative invariants for hashing. In sum, we considerably extend previous indexation-
based curve-matching methods. In Section 4, we provide experimental results obtained
using real data.

2 Approximation of Noisy Curves with B-Splines


We constrain the approximation to fit the data to within a m a x i m u m deviation dis-
tance, which is a parameter that depends on knowledge of expected errors due to image
acquisition, discretisation and boundary detection (see [11]).
B-spline curves, which include the class of polygonal curves, can readily provide dif-
ferential information at any point along the spline curve, and satisfy certain optimality
properties, viz., they minimise a certain measure of the bending energy [8]. There is an
622

extensive literature on B-splines We provide a very brief introduction, using the notation
of [3, 15]. Given a sequence of n + 1 points P~(z~, yi, z~), i = O..n in 3-space, a CK-2
approximating B-spline consists of the following components:
1. A control polygon of m + l points is given, such that ~ ( X j , Y j , Z j ) , j = O..m are
known points;
2. We are given m + l real-valued piecewise polynomial functions, Bj,K(fi), representing
the basis splines, which are functions of the real variable fi and consist of polynomials
of degree K - l , and are globally of class CK-2. The location in 3-space of the approx-
imating curve for a given parameter value fi is given by: Q(~) : ~m_ 0 VjBj,K(fi).
3. The knots must also be specified, and consist of m + K real values ~ . } , with ul : 0
and u,~+K : L, partitioning the interval [0, L] into r e + K - 1 intervals. Here, L is the
length of the polygon joining the P~'s. If the intervals are uniform, then we say that
the approximation is a uniform B-spline.
We use the global parameter fi along the interval [0, L], and denote by u the relative
distances between knots, defined by u = (fi-fi~)/(~+x-fi~)- The basis spline functions
are defined recursively. The basis splines of order 1 are simply the characteristic functions
of the intervals:
9 / 1 ~ _< ~ < fij+~
Bi,l(fi) -- k 0 otherwise
Successively higher-order splines are formed by blending lower-order splines:

B1,K+I(U) _ _ u -- ui_ B j , K ( u ) "-}- ui+K+l -- u Bj+I,K(fi).


U i + K -- U/ ~/+K+I -- ~ i + 1

It is not hard to show that:


0 B j , K - [ - I ( ~ ) ~--- K [ _ B j , ~ ( f i ! B~+I,K(~2)
05 U~-i.K -- U j -- ~j+K+I -- ~/-[-1 ]"

Thus quadratic splines, the (Bj,~), are C1, cubic splines (Bj,~), are C2, etc. Because of
this simple formula, we may incorporate contraints on the derivatives in our measure of
the quality of an approximation, for the process of finding the best control points and
knots, and we will also be able to easily make use of differential measures of the curve
for matching purposes.

2.1 A P r e v i o u s A p p r o x i m a t i o n Scheme
We next recall a classic approximation scheme due to Barsky [3]. This scheme has been
used by St-Marc and M~dioni [15] for curve matching. Our emphasis is on the shortcom-
ings of the approach for our objectives and on proposed modifications.
Given n + l data points P~(z~, y~, z~), i = 0..n, we seek m + l control vertices ~ , j :
0..m and m-t-K corresponding knots 12j , j -- 0 . . m + K minimizing the sum of square
distances between the B-spline Q(~) of degree K - 1 and the data P~. The notion of
distance between a spline Q(~2) and a data point P~ is based on the parameter value ~
where the curve Q(~2) comes closes to Pi. Thus, the criterion to minimize is:

A1 = ~ IIQ(~) - P~ll 2
i----O

The calculation of the ~ values is critical, since ] I Q ( ~ ) - P~ll is supposed to represent the
Euclidian distance of the point P~ to the curve. On the other hand, an exact calculation
623

of the values ~ is difficult, since they depend implicitly on the solution curve Q(~). As
an expedient, Barsky suggests using for ~ the current total length of the polygonal curve
from/Do to P~. Thus as an estimate, we can use ~ = ~ k =i-1
0 l I P s + * - Pk[I. I f B is the rn+l
by n matrix of the Bj,K(~), X the rr~-I by 3 control vertices matrix and z the n + l by
3 matrix of data points coordinates, A1 can we written as [[BtX - z[[ 2. Differentiating
with respect to X leads to:

BBtX - B z = 0 or A X : Bz.

Because X t B B t X : [ I B t X I [ 2, we know that A and B have the same rank. Thus i f m _< n
A is positive definite up to numerical error. If the approximating curve is not a closed
curve then A is a band matrix of band size K and X can be determined in linear time
with respect to ~ with a Choleski decomposition.
In working with this method, we have observed that ~ + 1 , the number of control
points, must he quite large in order to obtain a good visua/fit to the data points. Worse,
small amplitude oscillations often appear, corrupting the derivative information, and
making derivative-based matching methods unworkable. For example, using the synthetic
data of a noisy helix (Fig. 2a), we reconstruct Fig. 25 using the Barsky method for
spline approximation. It can be seen that curvature and torsion measurements along the
approximation curve will be unstable. In the next section, we explain how the results
shown in Figs. 2c and 2d are obtained.

2.2 I m p r o v e m e n t s

B e t t e r K n o t D i s t r i b u t i o n . The vertices of an approximating polygonal path wi]] con-


centrate around locations of high curvature [13]. We make use of this property to dis-
tribute B-Spllne knots non-uniformly with respect to segment lengths, so that the knots
are denser around high curvature points. In this way, the B-spline having a well defined
number of knots rn + K (and consequently of vertices rr~ + 1), will more closely approxi-
mate these portions of the curve. In order to cope with noise, the tolerance level of the
polygonal fit must exceed the standard deviation on the position of the points.
However, we utilize the following approach to locate the initial placements of the
points representing the locations of closest approach to the data points, ~ : Rather than
following Barsky's suggestion (which makes use of the interpolating polygonal path, as
opposed to the approximating polygonal path), we simply project each point P~ onto the
approximating polygonal path and consider the relative position of the projected points
in terms of total chordlength of the path.

I m p r o v e d D i s t a n c e E s t i m a t e s [14]. We next study the distance between a point


and a polynomial curve of arbitrary degree. The true ~ corresponds to the minimum of
IIQ(~) - Pill. Let us thus consider the following equation, where ~ is unknown:

a ~ - P~ll = 0.
F~(~) = alIQ(~)

We update ~ by a Newton Raphson iteration, using the quantity = F ~ ( ~ ) / F ~ ( ~ ) . For


a detailed calculation, the reader may refer to [71. Despite their apparent complexity,
these computations are not very expensive, since B~,X and B~',g were necessarily calcu-
lated before Bj,g (by the recursive definition). Moreover, once all ~ are updated by the
amounts 5~, we must once again solve the linear system for new control vertices { ~ } .
624

M i n i m i z a t i o n of Curvature. Cubic B-splines minimize, a m o n g all interpolants, the


norm of the second derivative[8]. Alternative criteria can be posed for smoothing; for
example, we might choose to minimize a weighted s u m of the squared second derivatives
of the approximating curve (evaluated at the projection points) together with the distance
error from the data points:

= Var(llQ(ud - Pill),

and Var designates the observed variance of the argument values over the index i. The
second term is related to the bending energy of the spline. Since the second derivative
values are linear in terms of control vertices, A2 is again quadratic, the construction
and complexity are as before, and the result is a spline. Fig. 2d illustrates results of
minimizing A2.

I n c o r p o r a t i o n o f S u r f a c e N o r m a l s . Finally, we assume that the curve is supposed to


lie in a surface whose normais are known. Thus, at every point along the approximating
curve, the tangent direction should lie normal to the surface normal n i. Accordingly, we
penalize our optimization criterion by a measure of the violations of this condition:
i=n ! " 2

,4 3 = As + E i = o ( Q (ui)'ni) with aa 2 = Var(Q'(s ni).


0-32
Note that the surface normals are a function of position, and must be provided in all
of three-space (or in any case, near the surface), even though the normal vector field is
only truly defined on the surface. This is the case when dealing with 3D medical images
including (possibly noisy) iso-intensity surfaces. The gradient of the intensity function is
identified with the surface normal direction, and is available at any 3D point. In [11], a
study on the stability of such measurements is provided. Finally, ,43 is still quadratic, but
due to the scalar product, variables cannot be separated and the system size is multiplied
by three, the regularization parameters ~" = (0-2/0-t) 2 and u = (0-3/0-t) 2 are arbitrarily
chosen so t h a t A 2 and A3 have no unit. We will describe and compare in a forthcoming
report automatic methods to optimize ~- and u.

3 Indexing for Curve Model Matching


Formally, our problem is stated as follows: we are given a set of model curves {Mi) and
an extracted (unknown) curve S. We wish to: (i) identify a curve Mi which has the
largest subset of points in common with S after a rigid transformation; and (ii) specify
that rigid transformation that best associates the two curves.
In a preprocessing phase, we construct an indeza~ion ~able, where entries are associ-
ated with pairs of values (c, r). For each pair, a list of entries of the form rni,j is formed,
denoting the fact that point number j on model Mi has a curvature and torsion value
that is close to (c, ~-). Note that the models curves have been sampled according to the
original sampling in the image.
During the recognition phase, we walk along the list of points of S, and for each
point sz we examine the ilst of entries associated with the index c(sz), r(sz). For each
entry rr~ d in the list, we compute a six-parameter rigid transformation Di,jj (see [7])
that would bring the point on S at sz into correspondence with the point rni,j of model
Mi. We register a vote for the pair (Mi, Dijj). This is N O T Hough transform. In Hough
625

9 !

4 i
x' y " .TI

/
~'~ ,I" " 9 .-':'
b" o / ~ ...'

,, .9149 ... ,

|. /, :. ....

F i g . 2. a. T o p l e f t : Noise is added to a helix, and points are sampled with a limitation on the
distance between successive points. The curvature and torsion are plotted in the top and the
right panels of the cube, as a function of arclength. In a perfect reconstruction, the curvature
and torsion would he constant.
b . T o p r i g h t : In the reconstruction method as suggested by Barsky, curvature and (especially)
torsion values are extremely noisy, despite the quality of the reconstruction (in terms of position)
of the original curve.
c. B o t t o m left: A more precise estimate of model-data distances improves the estimation of
curvature and torsion.
d . B o t t o m r i g h t : The constraint on the second derivative also improves the estimation.

transform, a hypothesis votes for a hyperplane in p a r a m e t e r space. Thus cluster detection


is inefficient (see [17]). We instead vote for a single point.
After processing all of the n points along S, we locate the pairs of the form (model,
displacement) t h a t have received a lot of votes (relative to some error measure in dis-
placements), and verify the indicated matches. The complexity of the recognition phase,
disregarding the preprocessing, is essentially independent of the number of models. T h e
a p p a r e n t complexity lles somewhere between O ( n ) a n d O(n2), depending on the level of
quantization of the index space according to curvature and torsion.
This description of the m e t h o d of indexation is essentially the "geometric hashing"
m e t h o d of Kishon and Wolfson [9], u p d a t e d in one i m p o r t a n t aspect. T h e y use a polyg-
626

onal representation of the curves, and thus vote for a model and a displacement length,
representing a difference between the arclength locations of the point s! and the candi-
date matching point rr~,j measured relative to some reference point along each curve's
representation. Since our representation of the curves includes a differentiable structure
and thus Fr~net frames, we may include the explicit calculation of the entire rigid trans-
formation as part of the recognition process. The advantage of our method is that the
arclength parametrization can suffer from inaccuracies and accumulative errors, whereas
the six-parameter rigid transformation suffers only from local representation error. An-
other advantage of voting for rigid transformations is that we may use a statistical method
to compute a distance between two such transformations, and incorporate this into the
voting process and the indexation table [71.

3.1 E n h a n c e m e n t s t o t h e I n d e x a t l o n M e t h o d

I n d e x a t i o n Table Q u a n t i z a t i o n . Guided by [5], we collect statistics based on exper-


iments with simulation and real data, described in [7]. These statistics provide expected
variances for the curvature and torsion values of typical noisy curves, and also covariance
values for pairs of values taken from intra- and inter- curve pairs of points. In order to es-
tablish an "optimal" discretisation cell size in the (c, ~') space, we study these covariance
values.

A M e t r i c for R i g i d T r a n s f o r m a t i o n s At the same time, we compute covariance val-


ues for the six-parameter rigid transformations that are obtained by matching points
along a scene curve with model curves. The resulting covariance matrix is used in the
definition of the Mahalanobis distance metric which we subsequently use to determine
the proximity of two distinct rigid transformations.

R e c u r s i v e T r a n s f o r m a t i o n E s t i m a t i o n . Throughout the recognition phase, as soon


as a pair of points are matched such that the transformation defined by the associated
Fr6net frames is sufficiently close to some previously recognized matching, the estimation
of the prototype transformation to be used as the matching criterion may be refined
through the use of a recursive filter, such as the Kalman filter. The experiments show
that this procedure can significantly improve the robustness of the method.

A l t e r n a t i v e G e o m e t r i c I n v a r i a n t s for M a t c h i n g . Suppose that we are given a ref-


erence point B on a model curve, and consider the points P on the same curve. For each
point P, we can define the rigid transformation D = (R, u) that maps the Fr~net frame
at B onto the Fr~net frame at P, and associate the six parameters with the point P. For
a fixed basis point, these parameters are invariant with respect to rigid transformations,
and consist of the three rotation coordinates (rt, rn, rb) with respect to the basis frame,
and the translation coordinates (ut, u,~, Ub), again measured in the basis frame. If the
curve lies in a plane, then rt will always be zero, in which case it is preferable to use the
representation (St, 0,~, Oh), angles between the vectors of the frame at B and of the frame
at P. We investigate in [7] the utility of these various invariants, and observe that 0t and
us are more stable than torsion, and have a greater discrimination power than Null.

N e w I n d e x a t i o n M e t h o d s . In the pteprocessing phase of the model curves, a basis


point B is selected for each such curve, and the (c, St, ut) parameters are calculated for
627

every point P on the (sampled) curve. This computation is repeated for every model
curve, and for eztremal curvature basis point8 B along the curve. In this way, the in-
formation about the model curves are stored into a three-dimensional table, indexed by
(c, at, ut). In each bin of this table, entries consist of a model curve, a point on that curve
(together with the corresponding Pr6net frames), and a basis point B also on the curve.
For the recognition algorithm, a basis point is selected on an unknown curve, and
transformations are computed from that basis point to other points along the curve. For
each such computation, the parameters (c, 8t, ut) m a p to a bin in the three-dimensional
table, which gives rise to votes for model/basis pairs, similar to before. This procedure
applies also to curves in m u l t i p l e s e c t i o n s (features are exclusively local), and last to
s c a t t e r e d p o i n t s associated with curvature information and a local reference frame.
Experimental results are reported in the next section.

4 Results

Using two views (A and B) of a skull from zeal X-ray Scanner data, we used existing
software (see [12]) to find ridge points, and then fed these points into the curve smoothing
algorithm of Section 2. For each view, the sub-mandibular rim, the sub-orbital ridges,
the nose contour and other curves were identified.
Using the indexation algorithm of Section 3.1, we preprocessed all curves from A,
also in the reverse orientation if necessary (to cope with orientation problems), and built
the indexation table, based on measurements of (c, 0t, ut) along the curves. Applying the
indexation-based recognition algorithm, curves from B were successfully identified. The
resulting transformations were applied to all curves from B and superimposed matches
(model, scene) appear in Figs. 3b to 3d. We next run our algorithm on the chin and
right orbit curves considered as one single curve (Fig. 3e) and finally on all curves simul-
taneously (Fig. 3f). CPU times on a DEC-workstation (in seconds) for recognition and
positionning are summarized in the following table. It confirms the linear time hypothesis.
scene curv e Inose contourlright orbit{left orbit{ chin {chin- orbitlall curves from B
[ C P U time H 1.085 [ 0.964 [ 1.183 12.5771 3.562 ] 9.515
Note that incorporating more curves increases the l i k e l i h o o d of the match. We thus
start from a local curve match and end up with one global rigid transformation.
We then experimented the matching using scattered points (several hundreds) on the
surface of the objet, selected for the high curvature value on the surface and associated
with a surface frame [12](Fig 3g). Last, we registered the entire skull by just applying
the transformation that superimposed the two submandibular curves. Incorporating the
match of the orbital ridge curves, we improved the overal rigid transformation estimate,
resulting in a more precise correspondence (Fig 4).

References
1. N. Ayache, J.D. Boissonnat, L. Cohen, B. Geiger, J. Levy-Vehel, O. Monga, and P. Sander.
Steps toward the automatic interpretation of 3-d images. In H. Fuchs K. Hohne and
S. Pizer, editors, 8D Imaging in Medicine, pages 107-120. NATO ASI Series, Springer-
Verlag, 1990.
2. N. Ayache and O.D. Faugeras. Hyper: A new approach for the recognition and positioning
of two-dimensional objects. IEEE Transactions on Pattern Analyai8 and Machine Intelli-
gence, 8(1):44-54, January 1986.
628

,P i J~"

~4.02 ,./ .I0*02 " 2.o*~eo

,0.0~ ~,,.02 t/" ~.,0.o~ ~,,.02 I/"-

Fig. 3. a. To p left: The successful matching of the two sub-mandibular curves, superimposed.
(Note that the occlusion and translation of the second view are handled automatically).
b. To p m i d d l e : Nose contours matched, c. T o p right: Right orbits matched.
d. M i d d l e left: Left orbits matched, e. M i d d l e : Chin-orbit matched simulaneously.
f . M i d d l e right: All curves matched simulaneously.
g. B o t t o m : The matching algorithm is successfully applied (bottom right) to scattered points
associated to A (bottom left) and B (bottom middle), represented here together with their
reference frame. There is a scale factor on the x and y axes, due to the evaluation in image
coordinates (as compared to real coordinates in the previous maps).

3. R. Bartels, J. Beatty, and B. Barsky. An introduction to splines for use in computer graph-
ics and geometric modeleling. Morgan Kaufmann publishers, 1987.
4. Court B. Cutting. Applications of computer graphics to the evaluation and treatment of
major craniofacial malformation. In Jayaram K.Udupa and Cabot T. Herman, editors, 3D
Imaging in Medicine. CRC Press, 1989.
5. W. Eric L. Crimson and Daniel P. Huttenlocher. On the verification of hypothesized
matches in model-based recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 13(12):1201-1213, December 1991.
6. W.E.L Crimson and T. Lozano-Per~z. Model-based recognition and localization from
sparse range or tactile data. International Journal of Robotics Research, 3(3):3-35, 1984.
629

Fig. 4. R e g l s t r a t l n g t h e ridges s top row shows the ridges extra~ted of a skull scanned in
position A (top left) and position B (top right). Figure in the bottom left shows the superposition
of the ridge points, obtained after transforming the points of the second view according to the
transformation discovered by matching the sub-mandibular curves. The best correspondences are
along the chin points. Figure in the bottom right shows the improved transformation obtained
with the addition of left sub-orbital curves.

7. A. Gudziec and N. Ayache. Smoothing and matching of 3d-space curves. Technical Report
1544, Inria, 1991.
8. J.C. Holladay. Smoothest curve approximation. Math. Tables Aids Computation, 11:233-
243, 1957.
9. E. Kishon, T. Hastie, and H. Wolfson. 3-d curve matching using splines. Technical report,
A T & T , November 1989.
10. G. Malandain, G. Bertrand, and Nicholas Ayache. Topological segmentation of discrete
surface structures. In Proc. International Conference on Computer Vision and Pattern
Recognition, Hawai,USA, June 1991.
11. O. Monga, N. Ayache, and P. Sander. From voxels to curvature. In Proc. International
Conference on Computer Vision and Pattern Recognition, Hawai,USA, June 1991.
12. Olivier Monga, Serge Benayoun, and Olivier D. Faugeras. Using third order derivatives to
extract ridge lines in 3d images. In submitted to IEEE Conference on Vision and Pattern
Recognition, Urbana Champaign, June 1992.
13. T. Pavlidis. Structural Pattern Recognition. Springer-Verlag, 1977.
14. M. Flass and M. Stone. Curve fitting with piecewise parametric cubics. In Siggraph, pages
229-239, July 1983.
15. P. Saint-Marc and G. Medioni. B-spline contour representation and symmetry detection.
In First European Conference on Computer Vision (ECCV), Antibes, April 1990.
16. F. Stein. Structural hashing: Efficient 3-d object recognition. In Proc. International Con-
ference on Computer Vision and Pattern Recognition, Hawai,USA, June 1991.
17. D. W. Thompson and J. L. Mundy. 3-d model matching from an unconstrained viewpoint.
In Proc. International Conference on Robotics and Automation, pages 208-220, 1987.
Shape from Texture for S m o o t h Curved Surfaces

Jonas Gdrding
Computational Vision and Active Perception Laboratory (CVAP)
Department of Numerical Analysis and Computing Science
Royal Institute of Technology, S-100 44 Stockholm, Sweden
Email: jonasg@bion.kth.se

A b s t r a c t . Projective distortion of surface texture observed in a perspec-


tive image can provide direct information about the shape of the underlying
surface. Previous theories have generally concerned planar surfaces; in this
paper we present a systematic analysis of first- and second-order texture
distortion cues for the case of a smooth curved surface. In particular, we
analyze several kinds of texture gradients and relate them to surface orien-
tation and surface curvature. The local estimates obtained from these cues
can be integrated to obtain a global surface shape, and we show that the two
surfaces resulting from the well-known tilt ambiguity in the locM foreshort-
ening cue typically have qualitatively different shapes. As an example of a
practical application of the analysis, a shape from texture algorithm based
on local orientation-selective filtering is described, and some experimental
results are shown.

1 Introduction

Although direct information about depth and three-dimensional structure is available


from binocular and dynamical visual cues, static monocular images can also provide im-
portant constraints on the structure of the scene. For example, the simple line drawing
shown in Fig. 1 gives a fairly convincing impression of a receding plane covered with cir-
cles. Nevertheless, it is far from trivial to determine precisely from which image qualities
this interpretation is derived.

Fig. 1. This image of a slanting plane covered with circles illustrates several forms of projective
distortion that can be used to estimate surface shape and orientation.

The fact that projective texture distortion can be a cue to three-dimensional surface
shape was first pointed out by Gibson [5]. His observations were mostly of a qualitative
nature, but during the four decades which have passed since the appearance of Gibson's
631

seminal work, m a n y interestingand useful methods for the quantitative recovery of sur-
face orientation from projective distortion have been proposed; see e.g. [3] for a review.
Furthermore, psychophysical studies (e.g. [I, 2]) have verified that texture distortion
does indeed play an important role in h u m a n perception of three-dimensional surfaces.
However, in our view there are two important issues which have not received enough
attention in previous work.
Firstly, most of the proposed mechanisms are based on the assumption that the
surface is planar. As pointed out by m a n y authors, real-world physical surfaces are rarely
perfectly planar, so the planarity assumption can at best be justified locally.However,
limiting the size of the analyzed region is generally not enough. W e show in this paper
that even for infinitesimallysmall surface patches, there is only a very restricted class
of texture distortion measures which are invariant with respect to surface curvature.
Gibson's gradient of texture density,for example, does not belong to this class.
Secondly, the possibilityof using projectivetexture distortion as a direct cue to sur-
face properties has not been fully exploited in the past. For example, a local estimate
of a suitably chosen texture gradient can be directly used to estimate the surface ori-
entation. Although this view of texture distortion as a direct cue predominates in the
psychophysical literature,most previous work in computational vision has proposed in-
direct approaches (e.g.backprojection) where a more or less complete representation of
the image pattern is used in a search procedure.
The main purpose of the present work is to analyze the use of projective distortionas a
direct and local cue to three-dimensional surface shape and orientation.W e concentrate
on those aspects that depend on the surface and imaging geometry, and not on the
properties of the surface texture. Whereas most previous work has assumed that the
scene is planar and sometimes also that the projection is orthographic, we study the
more general case of a smooth curved surface viewed in perspective projection. Early
work in the same spirit was done by Stevens [7],who discussed the general feasibility
of computing shape from texture, and derived several formulas for the case of a planar
surface.
A more detailed account of the work presented here can be found'in [4].

2 Local Geometry of the Perspective Mapping


Figure 2a illustrates the basic viewing and surface geometry. A smooth surface S is
mapped by central projection onto a unit viewsphere E centered at the focal point. This
spherical projection model has the advantage that it treats all parts of the field of view
equally, and it is equivalent to the ordinary perspective projection onto a fiat image plane
in the sense that if one of these projections is known, the other can be computed.
In the following we will make use of several concepts from standard differential geom-
etry; see e.g. O'Neill [6]for background. Consider a small patch around the point p in the
image. Assuming that this patch is the image of a corresponding patch on the smooth
surface S, we have a local differentiablemapping F from the image to the surface. The
linear part F, of F is a 2 x 2 matrix called the derivative map, which can be seen both
as a local linear approximation to F and as an exact mapping from the tangent plane of
the image (or retina) to the tangent plane of the surface. To firstorder, we can consider
the tangent plane of the viewsphere to be the local image of t'hesurface.
A convenient orthonormal basis (t,b) for the tangent plane of the viewsphere at the
point p is obtained by defining t to be a unit vector in the direction of the gradient of
distance from the focal point to the surface, and then setting b = p t. The tangent
632

I SurfaceS

: Focal point

F(p) ~ . . N
I I
M
: t = b
Viewsphere Z
T
(a) (b)
Fig. 2. a) Local surface geometry and imaging model. The tangent planes to the viewsphere ,~
at p and to the surface S at F(p) are seen edge-on but axe indicated by the tangent vectors
t and T. The tangent vectors b and B are not shown but are perpendicular to the plane of
the drawing, into the drawing, b) The derivative map F, can be visualized by an image ellipse
which corresponds to a unit circle in the surface.

direction t is usually called the tilt direction. The angle a between the viewing direction
p and the surface normal N is called the slant of the surface. Together, slant and tilt
specify the surface orientation uniquely.
We also define an orthogonal basis (T, B) for the tangent plane to the surface S at
F ( p ) as the normalized images under F, of t and b respectively.

2.1 F i r s t - O r d e r D i s t o r t i o n : F o r e s h o r t e n i n g

Starting with Gibson [5], much of the literature on shape from texture has been concerned
with texture gradients, i.e., the spatial variation of the distortion of the projected pattern.
However, an important fact which is sometimes overlooked is that texture gradients are
not necessary for slant perception; there is often sufficient information in the local first-
order projective distortion (F,) alone.
F, specifies to first order how the image pattern should be "deformed" to fit the
corresponding surface pattern. For a frontoparallel surface, F, is simply a scaling by the
distance, but for a slanted and tilted surface it will contain a shear as well. It can be
shown that in the bases (t, b) and (T, B), we have the very simple expression

(1)

where r = IIF(p)II is the distance along the visual ray from the center of projection to
the surface. The characteristic lengths (m, M) have been introduced to simplify later
expressions and because of their geometric significance: F, can be visualized by an image
ellipse corresponding to a unit circle in the surface (Fig. 2b). The minor axis of the ellipse
is aligned with t and has the length 2m, and the major axis has the length 2M.
The ratio m/M is called the foreshortening of the pattern. We see that magnitude
and direction of foreshortening determine slant a uniquely, and tilt t up to sign.
633

3 Second-Order Distortion: Texture Gradients

We are now prepared to take a closer look at the information content of texture gradients,
i.e., various measures of the rate of change of projective texture distortion. Gibson [5]
suggested the gradient of texture density as a main cue. Many other texture gradients
have subsequently been considered in the literature, see e.g. Stevens [7] or Cutting and
Millard [2]. These authors have restricted the analysis to the case of a planar surface.
In this section we reexamine the concept of texture gradients for the more general
case of a smooth curved surface. The analysis of texture gradients can be divided into
two relatively independent subproblems; firstly, gradient measurement, and secondly,
gradient interpretation. Here we concentrate on the interpretation task, but one specific
measurement technique is described in Sect. 5.

3.1 D i s t o r t i o n G r a d i e n t s

The most obvious way of defining the rate of change of the projective distortion is by
the derivatives of the characteristic lengths M and m defined by (1). This definition
encompasses most of the texture gradients that have been considered in the literature,
e.g. the compression gradient ~lVm, the perspective gradient ~2VM, the foreshortening
gradient Ve = (~I/~2)V(m/M), the area gradient VA = ~ I ~ V ( m M ) , and the density
gradient Vp = psV(1/(mM)), where ~1,~2 and ps are unknown scale factors.
We will refer collectively to such gradients as distortion gradients. They do not all
provide independent information, since by the chain rule the gradient of any function
f(M, m) is simply a linear combination of the basis gradients V M and Vm.
In practice it makes more sense to consider the normalized gradients ( V M ) / M and
( V m ) / m , since these expressions are free of scale factors depending on the distance to
the surface and the absolute size of the surface markings. Explicit expressions for these
gradients are given by the following proposition:

P r o p o s i t i o n 1 Basis t e x t u r e g r a d i e n t s . In the basis (t, b), where t is the tilt direc-


tion, the basis texture gradients are given by
Vm_ 1 { 2 s i n a + r x , tana~
m COS O" r ' r s i n ff
(2)

--~--- tans (3)

where r is the distance from the viewer, a is the slant of the surface, tot is the normal
curvature of the surface in the tilt direction, and r is the geodesic torsion, or "twist", of
the surface in the tilt direction.
From Proposition 1 it is straightforward to derive explicit expressions for gradients of
any function of m and M. For example, we obtain the normalized foreshortening gradient

Ve 1 ( s i n a + rx, t a n a )
e cos a \ r r sin a (4)

and the normalized density gradient

Vp_ 1 (3sina+rx, tana~


7 - cos a k r r sm a / (5)
634

Proposition 1 and equations (4-5) reveal several interesting facts about texture gradi-
ents. Firstly, the minor gradient Vm depends on the curvature parameters x~ and T,
whereas the major gradient V M is independent of surface curvature. This is important
because it means that ( V M ) / M can be used to estimate the local surface orientation,
and hence to corroborate the estimate obtained from foreshortening. Furthermore, unlike
foreshortening, ( V M ) / M yields an estimate which has no tilt ambiguity.
Secondly, the direction of any texture gradient which depends on m, such as the
foreshortening gradient or the density gradient, is aligned with the tilt direction if and
only if the twist r vanishes, i.e., if the tilt direction happens to be a principal direction in
the surface. This is of course always true for a planar surface, but for a general curved
surface the only distortion gradient guaranteed to be aligned with tilt is ( V M ) / M .
Thirdly, the complete local second-order shape (i.e. curvature) ors cannot be estimated
by distortion gradients. The reason is that it takes three parameters to specify the surface
curvature, e.g. the normal curvatures ~t, xb in the T and B directions and the twist r.
The Gaussian curvature, for example, is given by K = Xt~b -- 7"2. However, the basis
gradients (2) and (3) are independent of the normal curvature t%.

3.2 L e n g t h G r a d i e n t s
The concept of distortion gradients can be generalized. Distortion gradients are defined
as the rate of change of some function of the characteristic lengths rn and M, everywhere
measured relative to the tilt direction which may vary in the image. An alternative
procedure could be to measure the rate of change in a fixed direction in the image of
projected length in some direction w. It is a non-trivial fact that when w coincides with
the tilt or the perpendicular direction, this measure is equivalent to the corresponding
distortion gradient. However, a gradient can be computed for projected lengths in any
direction, not just t and b. This way we can obtain information about the surface shape
which cannot be provided by any distortion gradient.
A particularly useful example is the derivative in some direction of the projected
length measured in the same direction. This derivative could e.g. be estimated in a given
direction by measuring the rate of change of the distances between the intersections of
a reference line with projected surface contours. We have shown that the normalized
directional derivative computed this way in the direction w -- a t +/3b is given by
+ cos2, )'
- c~tana 2+ cos a (a~_ + j32 costa) ] (6)

which at a planar point simplifies to - 2 a tan a.


Note that a = 0 in the direction perpendicular to the tilt, so that this derivative
vanishes. This observation is in keeping with Stevens' [7] suggestion that the tilt of a
planar surface can be computed as the direction perpendicular to the direction of least
variability in the image, and we now see that this suggestion is in principle valid for
curved surfaces as well. However, this direction is not necessarily unique; for a non-
convex surface with sufficiently large negative curvatures the derivative may vanish in
other directions as well.
It is also worth noting that the normalized directional derivative (6) can be measured
even for textures that only exhibit variation in a single direction (such as wood grain),
whereas there is no obvious way to measure neither first-order distortion (foreshortening)
nor distortion gradients for such textures. For a planar surface, it suffices to measure
this derivative in two perpendicular directions in order to determine the surface normal
uniquely.
635

4 Global Analysis

Obtaining local estimates of surface shape is only the first step in the estimation of shape
from texture. The local estimates must then be combined to obtain a global surface
description, and in this process ambiguities existing at the local level can sometimes be
resolved. In the next subsection we will examine one such possibility in more detail.

4.1 T h e P h a n t o m S u r f a c e

Consider the common situation that local foreshortening is used to estimate surface ori-
entation. We pointed out in Sect. 2 that foreshortening only determines tilt up to sign,
leading to two mutually exclusive estimates of the local surface normal. By integrating
these two sets of surface normals we can in principle obtain two global surface descrip-
tions. We can then use any a priori knowledge we might have about the surface shape
(e.g. that it is approximately planar) to decide which of the two surfaces is more likely
to be correct.
The latter possibility has generally been overlooked in the literature, most likely
because the relation between the two surfaces is trivial if orthographic projection is
assumed. In this case the two surface normals are related by a reflection in the optical
axis, which corresponds to a reflection of the two surfaces in a plane perpendicular to
the optical axis. Hence, both surfaces will have the same qualitative shape.
In perspective projection, however, the relation is much more interesting and useful.
The sign ambiguity in the tilt direction now corresponds to a reflection of the surface
normal in the line of sight, which varies with position in the visual field. For example, if
the true surface has a constant surface normal (i.e., it is planar), then the other set of
surface normals will not be constant, i.e., it will indicate a curved surface. We will call
this surface the "phantom surface" corresponding to the actual surface. Strictly speaking,
we must first show that the surface normals obtained this way are integrahle, so that the
phantom surface actually exists. It turns out that this is indeed the case, and that there
is a very simple relation between the true surface and the phantom surface:

P r o p o s i t i o n 2 P h a n t o m surface. Let S be a surface parameterized by the distance r


along the visual rays p, i.e., S = {r : r = r(p) p} for points p in some region of the
viewsphere S . Then S has everywhere the same magnitude and direction of foreshortening
as the corresponding phantom surface S, obtained by inversion of S in the sphere 27
followed by arbitrary scaling, i.e.,
K
= {~: ~ = ~ p} (7)

where K is an arbitrary positive constant. The phantom surface S has everywhere the
same slant but reversed tilt direction with respect to the true surface S.

An interesting observation is that the phantom surface and the true surface are equivalent
if and only if r(p) is constant, i.e., if the eye is looking at the inside of a sphere from the
center of that sphere.

T h e P h a n t o m S u r f a c e o f a P l a n e For a planar surface, Proposition 2 takes a very


simple form. It is well-known, and easy to show, that inversion in the sphere maps planes
to spheres (see Fig. 3). More precisely, we have the following corollary to Proposition 2:
636

C o r o l l a r y 3= The phantom surface of a plane with surface normal N is a sphere passing


through the focal point and with its center anywhere along a line from the focal point in
the direction of N.

X
Phantom surface(sphere)
Yiewspher~ ........ /

' ""; Z

[ sN

I%r %,%=. i.lo*~176

Fig. 3. This drawing shows the intersection of the XZ-plane with the plane Z cos a - X sin a = 6
and the corresponding phantom surface (a sphere) (X + K sin a)2 -k y2 + (Z - K cos a)2 = K 2.
K is in a~bitrary scaling constant which in this example has been set to 8/2.

5 An Application of the Theory


So far we have not discussed how the length measurements needed for the computation
of first- and second-order projective distortion can be obtained in practice. The prob-
lem is easy if the surface texture consists of well-defined texture elements (texels), but
unfortunately the problem of texel identification is very hard for most natural textures.
In this section we will briefly describe a method for computing the projective distortion
without first identifying texture elements. This will also serve to illustrate how some of
the general theoretical principles outlined in the previous sections can be used in practice.
A detailed description of the method can be found in [3].
The approach is based on a proposition (proven in [3]) which states that the second
moment matrix pi of the spectrogram of the image intensities is related to the spectro-
gram of the reflectance of the surface pattern by the simple relation
I~i = F,r IJrF, (8)
under the simplifying assumption that the image intensity at a point is directly propor-
tional to the reflectance at the corresponding point in the surface.
If Pr is known and Pl can be measured, we can recover the eigenvalues (m, M) and
corresponding eigenvectors of F, by factoring (8), and then use (1) to compute slant and
tilt up to sign. Under the assumption that the surface reflectance pattern is isotropic, i.e.,
that Pr is a multiple of the identity matrix, the eigenvectors of Pi and F, will be the same,
and the eigenvalues of F, will be the square root of the eigenvalues of p~. Of course,/~/
(and hence F,) can only be recovered up to the unknown scale factor in Pr. Assuming that
this factor does not vary systematically in the surface, the normalized major gradient
(3) can be computed, providing an independent estimate of surface orientation. This
estimate has no tilt reversal ambiguity.
637

In our implementation the image spectrogram is sampled by convolving the image


with a set of complex 2-D Gabor filters, tuned to a range of spatial frequencies and
orientations.
The results obtained with a synthetic image are shown in Fig. 4. The synthetic image
shows a planar surface covered by a random isotropic reflectance pattern, generated
to have an approximately Gaussian power spectrum. The surface is slanted 60 ~ in the
vertical direction with respect to the optical axis of the camera. The surface is viewed in
perspective with a visual angle of approximately 59 ~ across the diagonal of the image.

0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
o o o ~ o
0 0 0 0 0 0 0 0
o o o o o o o o
0 0 0 0 0 0 0 0
0 ~ o 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 o o o o 0 o 00000000
O0000000 0 0 0 0 0 0 0 0
0000o00 9 000000 9169
Synthetic image Estimated local distortion Rescaled local distortion

'oooooooo
0 0 0 0 0 0 0 0
0 0 ~ 0 0 0 ~
0 0 ~ 0 ~ 0
0000000~ Q Q Q O ~
0 0 ~ 0 0 0 0
O~GO0000 000 9169
06000000 9169169
Using ~reshortening (1) Using Breshortening (2) Using the major gradient
Fig. 4. Estimation of surface orientation from first- and second-order projective distortion.

The top row shows the original image (left), the estimated local distortion/~i repre-
sented as ellipses on a superimposed 8 x 8 grid (middle), and a rescaled representation
of Pi where all the ellipses have been given the same size to allow a better assessment of
their shapes.
The bottom row shows the estimated local shape, obtained by first transforming each
local/Ji to a plane perpendicular to the line of sight and then applying (8). The left image
in this row shows the first estimate of surface orientation obtained from foreshortening,
the middle image shows the second estimate obtained by reflecting the first surface nor-
mal in the line of sight, and the third image shows an independent estimate of surface
orientation, obtained from the normalized major gradient.
Note that the second estimate from foreshortening indicates a curved surface, as a
result of the varying angle between the line of sight and the surface normal. As was
shown in Sect. 4, this "phantom surface" is obtained from the true surface by inversion
638

in the sphere. The estimate from the major gradient is significantly more noisy than the
other estimates, which was to be expected since it is based on the spatial derivatives
of an estimated quantity. Nevertheless, it is obviously stable enough to resolve the tilt
ambiguity in the estimate from foreshortening.

6 Conclusion

Quantitative information about three-dimensional surface shape is directly available in


the local projective texture distortion and its first spatial derivatives. Local surface orien-
tation can be estimated independently of surface curvature from foreshortening if isotropy
is assumed, and from the major gradient if spatial invariance of the surface pattern is
assumed. All other texture distortion gradients depend on surface curvature and surface
orientation in combination, but they are independent of the normal curvature in the
direction perpendicular to the tilt. Hence, Gaussian curvature cannot even in theory be
recovered from distortion gradients.
Length gradients can be considered as a generalization of the distortion gradients. By
measuring directional derivatives of projected lengths in a number of different directions it
is in theory possible to estimate both surface orientation and surface curvature. However,
this approach remains to be tested empirically.
The global consistency of local estimates can often be exploited to resolve local am-
biguities. We have shown that the well-known tilt ambiguity in the local foreshortening
cue can lead to qualitatively different interpretations at the global level.
We have briefly described one method for measuring first- and second-order projec-
tive distortion. However, much further research is required to test the usefulness and
robustness of this approach with various natural textures in natural viewing conditions,
and there are also many possible variations of the basic scheme to be explored.

A c k n o w l e d g m e n t s I would like to thank Jan-Olof Eklundh for guidance and support,


and John P. Frisby and his collaborators for many stimulating discussions. Part of this
work has been funded by the INSIGHT project within the ESPRIT BRA (Basic Research
Action). The support from the Swedish National Board for Industrial and Technical
Development (NUTEK) is gratefully acknowledged.

References
1. D. Buckley, J.P. Frisby, and E. Spivey, "Stereo and texture cue combination in ground planes:
an investigation using the table stereometer", Perception, vol. 20, p. 91, 1991.
2. J.E. Cutting and R.T. MiUard, "Three gradients and the perception of fiat and curved
surfaces", J. of Experimental Psychology: General, vol. 113(2), pp. 198-216, 1984.
3. J. G~rding, Shape from surface markings. PhD thesis, Dept. of Numerical Analysis and
Computing Science, Royal Institute of Technology, Stockholm, May 1991.
4. J. G~rding, "Shape from texture for smooth curved surfaces in perspective projection", Tech.
Rep. TRITA-NA-P9203, Dept. of Numerical Analysis and Computing Science, Royal Insti-
tute of Technology, Stockholm, Jan. 1992.
5. J. Gibson, The Perception of the Visual World. Houghton Mitt]in, Boston, 1950.
6. B. O'Neill, Elementary Differential Geometry. Academic Press, Orlando, Florida, 1966.
7. K.A. Stevens, "The information content of texture gradients", Biological Cybernetics, vol. 42,
pp. 95-105, 1981.
This article was processed using the IbTF_tXmacro package with 1~CCV92 style
Recognising rotationally symmetric surfaces f r o m
their outlines*

David A. Forsyth 1, Joseph L. Mundy 2, Andrew Zisserman 3 and Charles A. Roth-


well 3
1 Department of Computer Science, University of Iowa, Iowa City, Iowa, USA.
2 The General Electric Corporate Research and Development Laboratory, Schenectady, NY,
USA.
z Robotics Research Group, Department of Engineering Science, Oxford University, England.

A b s t r a c t . Recognising a curved surface from its outline in a single view is


a major open problem in computer vision. This paper shows techniques for
recognising a significant class of surfaces from a single perspective view. The
approach uses geometrical facts about bitangencies, creases, and inflections
to compute descriptions of the surface's shape from its image outline. These
descriptions are unaffected by the viewpoint or the camera parameters. We
show, using images of real scenes, that these representations identify surfaces
from their outline alone. This leads to fast and effective recognition of curved
surfaces.
The techniques we describe work for surfaces that have a rotational sym-
metry, or are projectively equivalent to a surface with a rotational symmetry,
and can be extended to an even larger class of surfaces. All the results in
this paper are for the case of full perspective. The results additionally yield
techniques for identifying the line in the image plane corresponding to the
axis of a rotationally symmetric surface, and for telling whether a surface
is rotationally symmetric or not from its outline alone.

1 Introduction

There has been a history of interest in recognising curved surfaces from their outlines.
Freeman and Shapira [7], and later Malik [9] investigated extending line labelling to
curved outlines. Brooks [2] studied using constraint-based modelling techniques to recog-
nise generalised cylinders. Koenderink [8] has pioneered work on the ways in which the
topology of a surface's outline changes as it is viewed from different directions, and has
studied the way in which the curvature of a surface affects the curvature of its out-
line. Ponce [12] studied tlte relationships between sections of contour in the image of
a straight homogenous generalised cylinder. Dhome [4] studied recognising rotationally
invariant objects by computing pose from the image of their ending contours.
Terzopolous et al. [17] compute three-dimensional surface approximations from im-
age data, based around a symmetry seeking model which implicitly assumes that "the
axis of the object is not severely inclined away from the image plane" (p. 119). These
approximations can not, as a result, be used for recognition when perspective effects are

* DAF acknowledges the support of Magdalen College, Oxford, of the University of Iowa, and
of GE. JLM acknowledges the support of the GE Coolidge Fellowship. AZ acknowledges
the support of SERC. CAR acknowledges the support of GE. The GE CRD laboratory is
supported in part by the following: DARPA contract DACA-76-86-C-007, AFOSR contract
F49620-89-C-003.
640

significant. Despite this work, it has been hard to produce robust, working model based
vision systems for curved surfaces.
Ponce and Kriegman [13] show that elimination theory can be used to predict, in sym-
bolic form, the outline of an algebraic surface viewed from an arbitrary viewing position.
For a given surface a viewing position is then chosen using an iterative technique, to give
a curve most like that observed. The object is then recognized by searching a database,
and selecting the member that gives the best fit to the observed outline. This work shows
that outline curves strongly constrain the viewed surface, but has the disadvantage that
it cannot recover surface parameters without solving a large optimization problem, so
that for a big model base, each model may have to be tested in turn against the image
outline.
A number of recent papers have shown how indexing functions can be used to avoid
searching a model base (e.g. [5, 18, 14]). Indexing functions are descriptions of an object
that are unaffected by the position and intrinsic parameters of the camera, and are usually
constructed using the techniques of invariant theory. As a result, these functions have
the same value for any view of a given object, and so can be used to index into a model
base without search. Indexing functions and systems that use indexing, are extensively
described in [10, 11], and [16] displays the general architecture used in such systems.
To date, indexing functions have been demonstrated only for plane and polyhedral
objects. Constructing indexing functions for curved surfaces is more challenging, because
the indexing function must compute a description of the surface's shape from a single
outline. It is clear that it is impossible to recover global measures of surface shape from
a single outline if the surfaces involved are unrestricted. For example, we can disturb any
such measure by adding a bump to the side of the surface that is hidden from the viewer.
An important and unresolved question is how little structure is required for points to
yield indexing functions.
In this paper, we emphasize the structure of image points by demonstrating useful
indexing functions for surfaces which have a rotational symmetry, or are within a 3D
projectivity of a surface with a rotational symmetry. This is a large and useful class of
surfaces.

2 Recognising rotationally symmetric surfaces

In this section, we show that lines bitangent to an image contour yield a set of indexing
functions for the surface, when the surface is either rotationally symmetric, or projectively
equivalent to a rotationally symmetric surface. This follows from a study of the properties
of the outline in a perspective image.

2.1 G e o m e t r i c a l p r o p e r t i e s o f t h e o u t l i n e

The outline of a surface in an image is given by a system of rays through the camera
focal point that are tangent to the surface. The points of tangency of these rays with the
surface form a space curve, called the contour generator. The geometry is illustrated in
figure 1.
Points on the contour generator are distinguished, because the plane tangent to the
surface at such points passes through the focal point (this is an alternative definition of
the contour generator). As a result, we have:
641

po~t

:... ima6e plane


"..::::....
"..."''::.'~i.:~.

)
Fig. 1. The cone of rays, through the focal point and tangent to the object surfa~ce, that forms
the image outline, shown for a simple object.

L e m m a : Except where the image outline cusps 4 , a plane tangent to the surface
at a point on the contour generator (by definition, such a plane passes through
the focal point), projects to a line tangent to the surface outline, and conversely,
a line tangent to the outline is the image of a plane tangent to the surface at the
corresponding point on the contour generator.

As a corollary, we have:

C o r o l l a r y 1: A line tangent to the outline at two distinct points is the image of


a plane through the focal point and tangent to the surface at two distinct points,
both on the contour generator.

This yields useful relationships between outline properties and surface properties. For
example:

C o r o l l a r y 2: The intersection of two lines, bitangent to the outline is a point,


which is the image of the intersection of the two bitangent planes represented by
the lines 5.

The lemma and both corollaries follows immediately from considering figure 2. Generic
surfaces admit one-parameter systems of bitangent planes, so we can expect to observe
and exploit intersections between these planes.
One case in which the intersections are directly informative occurs when the surface
is rotationally symmetric. The envelope of the bitangent planes must be either a right
circular cone, or a cylinder with circular cross-section (this is a right circular cone whose
vertex happens to be at infinity). We shall draw no distinction between vertices at infinity
and more accessible vertices, and refer to these envelopes as b i t a n g e n t c o n e s . These
comments lead to the following

4 We ignore cusps in the image outline in what follows.


5 P r o o f : Each of the bitangent planes passes through the focal point, so their intersection must
pass through the focal point, and in particular is the line from the focal point through the
intersection of the bitangent lines.
642

K e y r e s u l t : The vertices of these bitangent cones must lie on the axis (by
s y m m e t r y ) , and so are collinear. Assuming the focal point lies outside the surface,
as figure 2 shows, the vertices of the bitangent cones can be observed in an image.
T h e vertices a p p e a r as the intersection of a pair of lines bitangent to the outline.

~ bitangont
planes

axis o f s y m m e t r y

\
~ iraaBeof axis

F i g . 2. A r o t a t i o n ~ v symmetric object, and the planes bitangent to the object and passing
through the focal point, are shown. It is clear from the figure that the intersection of these
planes is a fine, also passing through the focal point. Each plane appears as a fine in the image:
the intersection of the planes appears as a point, which is the image of the vertex of the bitangent
cone. Note in particular that the image outline has no symmetry. This is the generic case.

As a result, if the surface has four or more bitangent cones, the vertices yield a system
of four or more collinear points, lying on the axis of the surface. These points project to
points t h a t are collinear, and lie on the image of the axis of symmetry. These points can
be measured in the image. This fact yields two i m p o r t a n t applications:

- Cross-ratios of the image points, defined below, yield indexing functions for the sur-
face, which can be determined from the outline alone.
- T h e image points can be used to construct the i m a g e of the axis of a rotationMly
s y m m e t r i c surface from its outline.
643

The second point can be used to extend work such as that of Brady and Asada [1] on
symmetries of frontally viewed plane curves to considering surface outlines.
We concentrate on the first point in this paper. The map taking these points to their
corresponding image points, is a projection of the line, and so the projective invariants of
a system of points yield indexing functions. A set of four collinear points A, B, C, D has
a projective invariant known as its cross ratio, given by:

(AC)(BD)
(AD)(BC)

where A B denotes the linear distance from A to B. The cross ratio is well known to be
invariant to projection, and is discussed in greater detail in [10]. The cross ratio depends
on the order in which the points are labeled. If the labels of the four points are permuted,
a different value of the cross ratio results. Of the 24 different labeling possibilities, only
6 yield distinct values. A symmetric function of cross ratios, known as a j-invariant is
invariant to the permut&tion of its arguments as well as to projections [11]. Since a
change in camera parameters simply changes the details of the projection of the points,
but does not change the fact that the map is a projection, these cross-ratios are invariant
to changes in the camera parameters.
A further result follows from the symmetries in figure 2. It is possible to show that,
although the outline does not, in general, have a symmetry, it splits into two components,
which are within a plane projectivity of one another. This means that the techniques for
computing projective invariants of general plane curves described in [15], can be used to
group corresponding outline segments within an image.

2.2 I n d e x i n g f u n c t i o n s f r o m i m a g e s o f r e a l objects
We demonstrate that the values of the indexing functions we have described are stable
under a change in viewing position, and are different for different objects. In figure 3, we
show two images each of two lampstands, taken from different viewpoints. These images,
demonstrate that the outline of a rotationally symmetric object can be substantially
affected by a change in viewpoint. Three images of each lampstand were taken in total,
including these images. For each series of images of each lampstand, bitangents were
constructed by hand, and the bitangents are shown overlayed on the images. The graph
in figure 4 shows the cross-ratios computed from the vertices in each of three images of
each of two lampstands. The values of the cross ratio are computed for only one ordering
of the points, to prevent confusion. As predicted in [5], the variance of the larger cross-
ratio is larger; this effect is discussed in [5], and is caused by the way the measurements
are combined in the cross-ratio. The results are easily good enough to distinguish between
the lampstands, from their outlines alone.

3 Generalizing the approach

This approach can be generalized in two ways. Firstly, there are other sources of vertices
than bitangent lines. Secondly, the geometrical construction described works for a wider
range of surfaces than the rotationally symmetric surfaces. We will demonstrate a range
of other sources of vertices assuming that the surface is rotationally symmetric, and then
generalize all the constructions to a wider range of surfaces in one step.
644

F i g . 3. This figure shows two views each of two different lampstands. Bitangents, computed by
hand from the outlines, are overlaid.

3.1 O t h e r s o u r c e s o f v e r t i c e s

Other sources of vertices, illustrated in figure 5, are:

- T h e t a n g e n t s a t a c r e a s e o r a n e n d i n g i n t h e o u t l i n e : We assume t h a t we can
distinguish between a crease in the outline, which arises from a crease in the surface,
and a double point of outline, which is a generic event t h a t m a y look like a crease.
In this case, these tangents are the projections of planes tangent to the surface, at a
crease in the surface.
- A t a n g e n t t h a t p a s s e s t h r o u g h a n e n d i n g i n t h e o u t l i n e : These are projections
of planes t h a t are tangent to the surface, and pass through an ending in the surface.
- I n f l e c t i o n s o f t h e o u t l i n e : These are projections of planes which have three-point
contact with the surface.

In each case, there is a clear relationship between the tangent to the outline and a plane
tangent to the surface, and the envelope of the system of planes tangent to the surface
and having the required property, is a cone with a vertex along the axis. These sources
of information are d e m o n s t r a t e d in figure 5. These results can be established by a simple
modification of the argument used in the bitangent case.

3.2 G e n e r a l i z i n g t o a w i d e r c l a s s o f s u r f a c e s

We have constructed families of planes tangent to a r o t a t i o n a l l y s y m m e t r i c surface and


distinguished by some property. The envelope of each family is a cone, whose vertex lies
on the axis of the surface. T h e projections of these vertices can be measured in the image,
by looking at lines tangent to the outline. To generalize the class of surfaces to which
these constructions apply, we need to consider the properties we are using, and how they
behave under transformation. T h e properties used to identify vertices are preserved under
projective m a p p i n g s of space. By this we mean t h a t , using bitangency as an example, if
we take a surface and a bitangent plane, and a p p l y a projectivity of space to each, the
new plane is still bitangent to the new surface, and the old points of tangency will m a p
645

10.

r
a ~p2
t
i
O
o
ImageNumber ~

-5.

4
-10,
~pl

F i g . 4. A graph showing one value of the cross-ratio of the vertex points for three different
images, taken from differing viewpoints, each of two different vases. This figure clearly shows
that the values of the indexing functions computed are stable under change of view, and change
for different objects, and so are useful descriptors of shape. As expected (from the discussion
in [5]), the variance in a measurement of the cross ratio increases as its absolute value increases.

to the new points of tangency. The other properties are preserved because projectivities
preserve incidence and multiplicity of contact.
These results m e a n t h a t if we take a r o t a t i o n a l l y s y m m e t r i c surface, and a p p l y a
projective m a p p i n g , the cones and vertices we o b t a i n from the new surface are j u s t the
images of the cones and vertices constructed from the old surface. Since a projective
m a p p i n g takes a set of collinear points to another set of collinear points, we can still
construct indexing functions from these points. This means t h a t , for our constructions
to work, the surface need only be projectively equivalent to a r o t a t i o n a l l y s y m m e t r i c
surface. One e x a m p l e of such a surface would be o b t a i n e d by squashing a r o t a t i o n a l l y
s y m m e t r i c surface so t h a t its cross section was an ellipse, rather t h a n a circle. This result
s u b s t a n t i a l l y increases the class of surfaces for which we have indexing functions t h a t
can be determined from image information.
A further generalisation is possible. The cross-ratio is a r e m a r k a b l e invariant t h a t
applies to sets of points lying on a wide range of algebraic curves. If a curve s u p p o r t s a
one-to-one p a r a m e t r i s a t i o n using rational functions, a cross-ratio can be c o m p u t e d for a
set of four points on t h a t curve. This follows because one can use the p a r a m e t r i s a t i o n
to c o m p u t e those points on the line t h a t m a p to the points distinguished on the curve,
and then take the cross-ratio of the points on the line 6. Curves t h a t can be p a r a m e t r i s e d
are also known as curves with genus zero. There is a wide range of such curves; some
examples in the plane include a plane conic, a cubic with one double point and a quartic
with either one triple point or three double points. In space, examples include the twisted

6 The parametrisation is birational. Any change of parametrisation is a birational mapping


between lines, hence a projectivity, and so the cross-ratio is Well defined.
646

".-. :|~ N~. ,o,


....... -:,, .......

......../...... i''":rli.i!.i
I ................
b' I "
~

......1 : ~,.--"~'-
i . 9 ~

;: T" *;
i, , ;-
-.m f] ,...

Fig. 5. The known cases that produce usable coaxial vertices. Note that although this figure
appears to have a refiectional symmetry, this is not a generic property of the outline of a
rotationally symmetric object.

cubic, and curves of the form (t, p(t), q(t), 1), where p and q are polynomials (the points
at infinity are easily supplied).
Remarkably, the resulting cross-ratio is invariant to projectivities and to projection
from space onto the plane (in fact, to any birational mapping). This means that, for ex-
ample, if we were to construct a surface for which the bitangent vertices or other similarly
identifiable points lie on such a curve, that surface could easily be recognised from its out-
line in a single image, because the cross-ratio is defined, and is preserved by projection.
Recognition would proceed by identifying the image of the bitangent vertices, identifying
the image of the projected curve that passes through these points, and computing the
cross-ratio of these points on that curve, which would be an invariant.
Since there is a rich range of curves with genus zero, this offers real promise as
a modelling technique which has the specific intent of producing surface models that
are both convincing models of an interesting range of objects, and intrinsically easy to
recognise.

4 Conclusions

We have constructed indexing functions, which rely only on image information, for a
useful class of curved surfaces. We have shown these functions to be useful in identifying
curved objects in perspective images of real scenes.
This work has further ramifications. It is possible to use these techniques to determine
whether an outline is the outline of a rotationally symmetric object, and to determine
the image of the axis of the object. As a result, it is possible to take existing investiga-
tions of the symmetry properties of image outlines, and extend them to consider surface
properties, measured in a single image, in a principled way.
As we have shown, image information has deep geometric structure that can be ex-
ploited for recognition. In fact, image outlines are so rich that recent work at Iowa[6] has
shown that a generic algebraic surface of any degree can be recovered up to a projective
mapping, from its outline in a single image.
647

References

1. Brady, J.M. and Asada, H., "Smoothed Local Symmetries and their implementation,"
IJRR-3, 3, 1984.
2. Brooks, R. A., "Model-Based Three-Dimensional Interpretations of Two Dimensional Im-
ages," IEEE PAMI, Vol. 5, No. 2, p. 140, 1983.
3. Canny J.F. "Finding Edges and Lines in Images," TR 720, MIT AI Lab, 1983.
4. Dhome, M., LaPreste, J.T, Rives, G., and Richetin, M. "Spatial localisation of modelled
objects in monocular perspective vision," Proc. First European Conference on Computer
Vision, 1990.
5. D.A. Forsyth, J.L. Mundy, A.P. Zisserman, A. Heller, C. Coehlo and C.A. Rothwell (1991),
"Invariant Descriptors for 3D Recognition and Pose," IEEE Trans. Patt. Anal. and Mach.
Intelligence, 13, 10.
6. Forsyth, D.A., "Recognising an algebraic surface by its outline," Technical report, Univer-
sity of Iowa Department of Computer Science, 1992.
7. H. Freeman and R. Shapira, "Computer Recognition of Bodies Bounded by Quadric Sur-
fa~ces from a set of Imperfect Projections," IEEE Trans. Computers, C27, 9, 819-854, 1978.
8. Koenderink, J.J. Solid Shape, MIT Press, 1990.
9. Malik, J., "Interpreting line drawings of curved objects," IJCV, 1, 1987.
10. J.L. Mundy and A.P. Zisserman, "Introduction," in J.L. Mundy and A.P. Zisserman (ed.s)
Geometric lnvariance in Computer Vision, MIT Press, 1992.
11. J.L. Mundy and A.P. Zisserman, "Appendix" in J.L. Mundy and A.P. Zisserman (ed.s)
Geometric Invariance in Computer Vision, MIT Press, 1992.
12. Ponce, J. "Invariant properties of straight homogenous generalized cylinders," IEEE Trans.
Patt. Anal. Much. Intelligence, 11, 9, 951-965, 1989.
13. J. Ponce and D.J. Kriegman (1989), "On recognising and positioning curved 3 dimensional
objects from image contours," Prac: DARPA 1U workshop, pp. 461-470.
14. R~thweU, C.A., Zisserman, A.P., Forsyth, D.A. and Mundy, J.L., ~Using Projective In-
variants for constant time library indexing in model based vision," Proc. British Machine
Vision Conference, 1991.
15. Rothweli, C.A., Zisserman, A.P., Forsyth, D.A. and Mundy, J.L., "Canonical frames for
planar object recognition," Prac. ~nd European Conference on Computer Vision, Springer
Lecture Notes in Computer Science, 1992.
16. Rothwell, C.A., Zisserman, A.P., Forsyth, D.A. and Mundy, J.L., "Fast Recognition using
Algebraic Invariants," in J.L. Mundy and A.P. Zisserman (ed.s) Geometric Invariance in
Computer Vision, MIT Press, 1992.
17. Terzopolous, D., Witkin, A. and Kass, M. "Constraints on Deformable Models: Recovering
3D Shape and Nonrigid Motion," Artificial Intelligence, 36, 91-123, 1988.
18. Wayner, P.C. "Efficiently Using Invariant Theory for Model-based Matching," Proceedings
CVPR, p.473-478, 1991.

This article was processed using the LTF_~ mascro package with ECCV92 style
Using D e f o r m a b l e Surfaces to S e g m e n t 3-D Images
and Infer Differential Structures*

Isaac C O H E N 1, Laurent D. C O H E N 2, Nicholas A Y A C H E 1


I INI~IA, Rocquencourt
B.P. 105, 78153 Le Chesnay CEDEX, France.
CEREMADE, U.R.A. CNRS 749, Universit~ Paris IX - Dauphine
75775 Paris CEDEX, France.
Emaih isaac, cohen, na Obora.inria.fr

Abstract

In this paper, we generalize the deformable model [4, 7] to a 3-D model, which evolves in
3-D images, under the action of internal forces (describing some elasticity properties of
the surface), and external forces attracting the surface toward some detected edgels. Our
formalism leads to the minimization of an energy which is expressed as a functional. We
use a variational approach and a finite element method to actually express the surface in a
discrete basis of continuous functions. This leads to a reduced computational complexity
and a better numerical stability.
The power of the present approach to segment 3-D images is demonstrated by a set
of experimental results on various complex medical 3-D images.
Another contribution of this approach is the possibility to infer easily the differential
structure of the segmented surface. As we end-up with an analytical description of the
surface, this allows to compute for instance its first and second fundamental forms. From
this, one can extract a curvature primal sketch of the surface, including some intrinsic
features which can be used as landmarks for 3-D image interpretation.

1 Energy Minimizing Surfaces


We consider a parameterized surface v(s, r) : (z(s, r), y(s, r), z(s, r)) (where (s, r) e ~ :
[0, 1] [0, 1]). The location and the shape of this surface are characterized by minimizing
the functional: E(v) -= Eloc~t~o~(v) + E,,~oo~h(v).
The functional Ezo~at~on(v) : In P(v(s, r)) dsdr
allows the surface to be located accurately at the image features. If these features are
image contours, the function P depends on the 3-D gradient images and can be set to
P -- - IV2:[ ~ (where 2: is the 3-D image convolved with a gaussian function), a complete
discussion on the choice of the potential P is given in [2].
The functional
f o. ~ av ~ o~v ~ a~ 2 a ~ 12
E ,,~ooth ( v ) s ~/?10 --0S + WO1 ~r + 2wll OsOr + w20 ~0s2 + w02 ~ ds dr

measures the regularity or the smoothness of the surface v. Minimizing E,mooth(v) con-
strains the surface v to be smooth. The parameters wij represent the mechanical proper-
ties of the surface. They determine its elasticity (wl0,wox), rigidity (w2o,wo2) and twist
(wll). These parameters act on the shape of the function v, and are also called the
regularization parameters.
* This work was partially supported by Digital Equipment Corporation.
649

A characterization of a function v minimizing the energy E is done by the Euler-


Lagrange equation, it represents the necessary condition for v to be a minimum of E.
The energy function Elocation is not convex, there may be many local minima of Etocatio~.
T h e Euler-Lagrange equation may characterize any such local minimum. But a s we are
interested in finding a 3-D contour in a given area, we assume in fact that we have a
rough prior estimation of the surface. This estimation is used to solve the associated
evolution equation:

~-~ l o ~ 7 } - ~ ol~j-~ ~ - ~ ll~-~j~--~ ~ 2o-~ F--~ ~ o~r-~ / - - ~}


(0, s, r) = vo(s, r) initial estimation
Boundary Conditions
(])
The boundary conditions allow to constrain the topology of the surface (see [5] for varying
topology models). We solve Eq. 1 with a Finite Element Method (FEM). This leads to
the solution of a positive definite linear system solved by a Conjugate Gradient method
(for more details see [1]).

2 Inferring the Differential Structure from 3-D Images


In the following we assume that the surface has localized accurately the 3-D image edges,
which means that we have reached a minimum of E. We now use this surface to compute
the differential characteristics of the 3-D image surface boundary. This computation can
be done analytically at each point of the surface since the use of FEM gives us an analytic
representation of the surface v(s, r). In Fig. 5, one can visualize the value of the larger
principal curvature. The results appear to be qualitatively correct, and can be compared
to those obtained in [6] by another method. Our results are a little more noisy, but the
advantage is t h a t the segmentation and curvature are computed simultaneously. Also, our
approach appears to be computationally much less expensive. The computation of the
principal curvatures can be enhanced by taking into account the normals of the surface
(see for details [1]).

3 Conclusion and Future Research


We have shown how a deformable surface can be used to segment 3-D images by mini-
mizing an appropriate energy. The minimization process is done by a variational method
with finite elements. Our formalism leads to a reduced algorithmic complexity and pro-
vides an analytical representation of the surface. This last feature is the most important
one for inferring differential structures of the surface. These characteristics provide a
helpful tool for recognizing 3-D objects [3].

References

1. Isaac Cohen, Lanrent D. Cohen, and Nicholas Ayache. Using deformable surfaces to seg-
ment 3-D images and infer differential structures. Computer Vision, Graphics, and Image
Processing: Image Undeestanding, 1992. In press.
2. Laurent D. Cohen and Isaac Cohen. Finite element methods for active contour models
and balloons from 2-D to 3-D. Technical Report 9124, CEREMADE, U.R.A. CNRS 749,
Universit6 Paris IX - Dauphine, November 1991. Cahiers de Mathematiques de la Decision.
3. A. Gu6aiec and N. Ayache. Smoothing and matching of 3D-space curves. In Proceedings of
the Second European Conference on Computer Vision 199~, Santa Margherlta Ligure, Italy,
May 1992.
650

4. Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes: Active contour models.
In Proceedings of the First International Conference on Computer Vision, pages 259-268,
London, June 1987.
5. F. Leitner and P. Cinquin. Dynamic segmentation: Detecting complex topology 3D-object.
In Proceedings of International Conference of the IEEE Engineering in Medicine and Biology
Society, pages 295-296, Orlando, Florida, November 1991.
6. O. Monga, N. Ayache, and P. Sander. From voxel to curvature. In Proc. Computer Vision
and Pattern Recognition, pages 644-649. IEEE Computer Society Conference, June 1991.
Lahaina, Maui, Hawaii.
7. Demetri Terzopoulos, Andrew Witkin, and Michael Kass. Constraints on deformable models:
recovering 3-D shape and nonrigid motion. AI Journal, 36:91-123, 1988.
4 Experimental Results

Fig. I. In this example we use a deformable surface constrained by boundaries conditions (cylin-
der type) to segment the inside cavity of the left ventricle, Overlays of some cross sections (in
grey) of the initial estimation (top) and the obtained surface (bottom) and a 3-D representation
of the inside cavity of the left ventricle.
651

Fig. 2. We have applied the 3-D deformable model to a magnetic Resonance image of the head,
to segment the face. This figure represents some cross sections of the initial estimate given by
the user.

F i g . 3. Here we represent the surface, once we have reached a m i n i m u m of the energy g . Some
vertical and horizontal cross sections of the surface are given. They show an accurate localization
of the surface at the edge points.
652

Fig. 4. A 3-D representation of the surface using AVS.

F i g . 5. A representation of the e x t r e m a of the principal curvatures. T h e high values of the ex-


t r e m a are in black a n d the low values are in light grey. These values characterize some structures
of the h u m a n face such the eyebrows and the nose.
Finding Parametric C u r v e s in a n I m a g e *

Ales Leonardis 1 and Ruzena Bajcsy 2

1 University of Ljubljana, Dept. of Electrical Engineering and Computer Science,


Tr~a~ka c. 25, 61000 Ljubljana, Slovenia
E-mall: Ales.Leonardis~ninurt a. fer.yu
2 University of Pennsylvania, GRASP Laboratory, Philadelphia, PA 19104, USA
A b s t r a c t . We present a reliable and efficient method for extracting simple
geometric structures, i.e., straight lines, parabolas, and ellipses, from edge
images. The reliability of the recovery procedure which builds the paramet-
ric models is ensured by an iterative procedure through simultaneous data
classification and parameter estimation. The overall relative insensitivity to
noise and minor changes in input data is achieved by considering many com-
petitive solutions and selecting those that produce the simplest description,
i.e., the one that accounts for the largest number of data points with the
smallest number of parameters while keeping the deviations between data
points and models low. The presented method is efficient for two reasons:
firstly, it is designed as a search which utilizes intermediate results as a
guidance toward the final result, and secondly, it combines model recovery
and model selection in a computationally efficient procedure.

1 Introduction

We advocate the view that the purpose of machine vision is not to reconstruct the scene
in its entirety, but rather to search for specific features that enter, via data aggregation,
a symbolic description of the scene necessary to achieve the specific task. Unfortunately,
the high degree of variability and unpredictability that is inherent in a visual signal makes
it impossible to design precise methods for detecting low-level features. Thus, almost any
output of early processing has to be treated only as a hypothesis for further processing.
In this paper we investigate a method for extracting simple geometric structures from
edge images in terms of parametric models, namely straight lines, parabolas, and ellipses.
These models satisfy the criteria for the selection of geometric representations (invari-
ance, stability, accessibility, 3-D interpretation, perceptual significance). We would like
to emphasize that our main objective is to develop a novel control strategy, by combin-
ing several existing techniques, that achieves a reliable and efficient recovery of geometric
parametric models and can serve as a powerful early vision tool for signal-to-symbol trans-
formation. The method consists of two intertwined procedures, namely model-recovery
and model-selection. The first procedure systematically recovers the models in an edge
image, creating a redundant set of possible descriptions, while the model-selection proce-
dure searches among them to produce the simplest description in terms of the criterion
function.
Due to space limitations we present only a summary of our algorithm. For a more
complete view on the procedure, experimental results, and a discussion of the related
work, the reader is referred to [3].
* The research described in this paper was supported in part by: The Ministry for Science
and Technology of The Republic of Slovenia, Project P2-1122; Navy Grant N0014-88-K-
0630, AFOSR Grants 88-0244, AFOSR 88-0296; Army/DAAL 03-89-C-0031PRI; NSF Grants
CISE/CDA 88-22719, IRI 89-06770, and ASC 91 0813; and Du Pont Corporation.
654

2 Model Recovery

Recovery of parametric models is a difficult problem because we have to find image


elements that belong to a single parametric model a n d we have to determine the values
of the parameters of the model. For image elements that have already been classified we
can determine the parameters of a model using standard statistical estimation techniques.
Conversely, knowing the parameters of the model, a search for compatible image points
can be accomplished by pattern classification methods. We propose to solve this problem
by an iterative method, conceptually similar to the one described by Besl [1] and Chen [2],
which combines data classification and model fitting.
One of the crucial problems is where to find the initial curve segments (seeds) in an
image since their selection has a major effect on the success or failure of the overall proce-
dure. We propose that a search for the edge points that could belong to a single curve is
performed in a grid-like pattern of windows overlaid on the image. Thus, the requirement
of classifying all edge points of a certain curve is relaxed to finding only a small subset.
However, there is no guarantee that every seed will lead to a good description since some
initial curve segments can be constructed over low-strength C o and C 1 discontinuities
without being statistically inconsistent. As a remedy we propose to independently build
all possible curves using all statistically consistent seeds and to use them as hypotheses
that could compose the final description.
Having an initial set of points (a seed) we estimate the parameters of the model. We
always start with the simplest model, i.e., a straight line 3, and determine the goodness-
of-fit between the model and the corresponding data. If sufficient similarity is established,
ultimately depending on the task at hand, we proceed with a search for more compatible
points. An efficient search which is performed in the vicinity of the present end-points
of the model is achieved by extrapolating the current model. New'image elements are
included in the data set and the parameters of the model are updated. The new goodness-
of-fit is computed and compared to the old value. This is followed by a decision whether
to perform another iteration, or replace the currently used model with a more complex
one, or terminate the procedure. A schematic diagram outlining the procedure is shown in
Fig. 1. The algorithm's main feature is its insensitivity to outliers since the performance of
the fitting is constantly monitored. The final outcome of the model-recovery procedure for

(. . . . . .
~Modelexb'apo(ati~onJ "I" ~ c~)o,)~-of.n~,,,t~

Fig. 1. A schematic diagram outlining the model-recovery proc~lure

a model m~ consists of three terms which are subsequently passed to the model-selection
a This is due to the limited amount of reliable information that can be gathered in an initially
local area. Only a small number of parameters can be estimated in order to avoid numerically
ill-conditioned cases.
655

procedure:
1. the set of edge elements that belong to the model,
2. the type of the parametric model and the corresponding set of parameters of the
model, and
3. the goodness-of-fit value which describes the conformity between the data and the
model.
While this description is general, specific procedures designed to operate on individual
types of models differ significantly. This is primarily due to the increased complexity of
the nonlinear fitting process for parabolas and ellipses, which is directly related to the
choice of the Euclidean distance as the error metric. The use of the Euclidean distance
results in a goodness-of-fit measure which has a straightforward interpretation, and the
recovered models extrapolate accurately to the vicinity of the end-points. The reader is
referred to [3] where we outline the model-recovery procedures for the three types of the
models and show how to switch between them.

3 Model Selection

The redundant representation obtained by the model-recovery procedure is a direct con-


sequence of the decision that a search for parametric models is initiated everywhere in
an image. Several of the models are completely or partially overlapped. The task of
combining different models is reduced to a selection procedure where the recovered in-
dividual descriptions compete as hypotheses to be accepted in the final interpretation
of the scene. Thus, the model-selection procedure is performed on the level of geomet-
ric structures rather than on the level of their constituent elements. The procedure is
designed to select the smallest possible number of models that would describe the data
with the smallest possible error. Here we follow the approach already exploited in our
previous work [4] on range images.
The objective function F, which is to be maximized in order to produce the "best"
description in terms of models, has the following form:

ell .., elM


F(~) = ~TQ~a - ~T

where 1~r T : [ m l , m 2 , . . . , m M ] .
[ :
CM1 ...
:
CMM J
~ , (1)

rn i is a presence variable having the value 1 for the


presence and 0 for the absence of the model rni in the final description. The diagonal
terms of the matrix Q express the cost-benefit value of a particular model mi. This value
is a function of the number of data points that belong to the model, the complexity of
the model (number of parameters), and the goodness-of-fit measure between the model
and the data. The off-diagonal terms handle the interactions between the overlapping
models, taking into account the mutual error and the number of data points covered by
both models.
Maximizing the objective function F ( ~ ) belongs to a class of problems known as
combinatorial optimization (Quadratic Boolean problem). Since the number of possible
solutions increases exponentially with the size of the problem, it is usually not tractable
to explore them exhaustively. We solve the problem with the WTA (winner-takes-all)
technique which, in our case, turns out to be a good compromise between the speed and
the accuracy of the solution. We briefly outline the algorithm:
656

1. Initialization: Vi rhi = 0, F(ril) = 0.


2. Find a model that has not been chosen so far (rhi = 0) and whose selection (rhi = 1)
would contribute the most to the value of the objective function. Once the model is
selected it cannot be rejected.
3. Repeat Step 2 until the value of the objective flmction cannot be increased any
further.

4 Model Recovery and Model Selection


In order to achieve a computationally efficient procedure we combine the model-recovery
and model-selection procedures in an iterative fashion. The recovery of currently active
models is interrupted by the model-selection procedure which selects a set of currently
optimal models which are then passed back to the model-recovery procedure. This process
is repeated until the remaining models are completely recovered. The trade-offs which are
involved in the dynamic combination of these two procedures are discussed elsewhere [4].
Intuitively, we should not invoke the selection procedure unless there is a significant
overlap between the models so that some of them can be eliminated. It follows that at
the beginning of the process, when the models are smaller, we invoke the model-selection
procedure more often than at the later stages when the models are larger.

5 Experimental Results

We tested our method on a variety of synthetic data as well as on real images. The two
images presented here (Figs. 2 (a) and 3 (a)), together with their respective edge images
(Figs. 2 (b) and 3 (b)), obtained with the Canny edge detector, were kindly supplied by
Dr. Etemadi from the University of Surrey. Fig. 2 (c) shows the initial curve segments

, ,I .... all.
(b) (c) (d)
Fig. 2. (a) Original image, (b) Edge-image, (c) Seed image, and (d) Reconstructed image

(seeds). Note that they are not placed on or near the intersections or junctions of the
edges. Besides, they do not appear in the areas with a high density of edge elements
(twisted cord). The size of the initial windows determines the scale (and the resolution)
on which the elements are anticipated to appear. If two lines fall into the same window, a
consistent initial estimate will not be found. One of the solutions would be to decrease the
size of the windows or to resort to orientation dependent windows. However, a missing
seed seldom poses a serious problem since usually only a few seeds are sufficient to
properly recover the complete curve. Of course, curves which are not initiated by any
657

seed at all will not appear in the final description. The final result is shown in Fig. 2 (d).
We observe that the procedure is robust with respect to noise (minor edge elements
scattered in the image). A standard approach utilizing a blind linking phase to classify
data points without support from models would encounter numerous problems. Besides,
the procedure determines its domain of applicability since it does not describe heavily
textured areas (densely distributed curves) in the image. Due to the redundancy present
in the scheme, the method degrades gracefully if the assumptions made by the choice
of the primitives are not met. A similar situation arises when the estimation about the
anticipated scale or resolution is not correct. Numerous small segments signal that a
different kind of models should be invoked or that the scale should be changed (the dial
in Fig. 2).

(b) (c) (d)

Fig. 3. (a) Original image, (b) Edge-image, (c) Seed image, and (d) Reconstructed image

In Fig. 3 (c) we show the seeds. Some of the seeds along the parallel lines are missing
due to the grid placement. Nevertheless, the lines are properly recovered, as shown in
Fig. 3 (d).

6 Conclusions
The method for extracting parametric geometric structures is a tool that has already
proven useful to other tasks in computer vision [4]. It offers several possible extensions by
using other types of models. Moreover, the same principle can be extended to operate on a
hierarchy of different models which would lead to the recovery of more and more abstract
structures. Besides, the scheme is inherently parallel and can easily be implemented on
a massively parallel machine.

References
1. Besl, P. J.: Surfaces in Range Image Understanding. Springer-Verlag, (1988)
2. Chen, D. S.: A data-driven intermediate level feature extraction algorithm. IEEE Transac-
tion on Pattern Analysis and Machine Intelligence. 11 (1989) 749-758
3. Leonardis, A.: A search for parametric curves in an image. Technical Report LRV-91-7.
Computer Vision Laboratory, University of Ljubljana, (1991)
4. Leonardis, A., Gupta, A., and Bajcsy, R.: Segmentation as the search for the best description
of the image in terms of primitives. In The Third International Conference on Computer
Vision. Osaka, Japan, (1990) 121-125
This article was processed using the ISTEX macro package with ECCV92 style
D e t e r m i n i n g T h r e e - D i m e n s i o n a l Shape from
Orientation and Spatial Frequency Disparities *

David G. J o n e s a and Jitendra Malik 2

1 McGill University, Dept. of Electrical Engineering, Montr6al, PQ, Canada H3A 2A7
2 University of California, Berkeley, Computer Science Division, Berkeley, CA USA 94720

A b s t r a c t . Binocular differences in orientation and foreshortening are sys-


tematically related to surface slant and tilt and could potentially be ex-
ploited by biological and machine vision systems. Indeed, human stereopsis
may possess a mechanism that specifically makes use of these orientation
and spatial frequency disparities, in addition to the usual cue of horizontal
disparity. In machine vision algorithms, orientation and spatial frequency
disparities are a source of error in finding stereo correspondence because
one seeks to find features or areas which are similar in the two views when,
in fact, they are systematically different. In other words, it is common to
treat as noise what is useful signal.
We have been developing a new stereo algorithm based on the outputs
of linear spatial filters at a range of orientations and scales. We present a
method in this framework, making use of orientation and spatial frequency
disparities, to directly recover local surface slant. An implementation of this
method has been tested on curved surfaces and quantitative experiments
show that accurate surface orientation can be recovered efficiently. This
method does not require the explicit identification of oriented line elements
and also provides an explanation of the intriguing perception of surface
slant in the presence of orientation or spatial frequency disparities, but in
the absence of systematic positional correspondence.

1 Introduction
Stereopsis has traditionally been viewed as a source of depth information. In two views
of a three-dimensional scene, small positional disparities between corresponding points in
the two images give information about the relative distances to those points in the scene.
Viewing geometry, when it is known, provides the calibration function relating disparity
to absolute depth. To describe three-dimensional shape, the surface normal, n(x, y), can
then be computed by differentiating the interpolated surface z(z,y). In practice, any
inaccuracies present in disparity estimates will be compounded by taking derivatives.
However, there are other cues available under binocular viewing that can provide di-
rect information about surface orientation. When a surface is not fronto-parallel, surface
markings or textures will be imaged with slightly different orientations and degrees of
foreshortening in the two views (Fig. 1). These orientation and spatial frequency dispar-
ities are systematically related to the local three-dimensional surface orientation. It has
been demonstrated that humans are able to exploit these cues, when present, to more

* This work has been supported by a grant to DJ from the Natural Sciences and Engineering
Research Council of Canada (OGP0105912) and by a National Science Foundation PYI award
(IRI-8957274) to JM.
662

accurately determine surface orientation (Rogers and Cagenello, 1989). In stimuli consist-
ing of uncorrelated dynamic visual noise, filtered to contain a certain spatial frequency
band, the introduction of a spatial frequency disparity or orientation disparity leads to
the perception of slant, despite the absence of any systematic positional disparity cue
(Tyler and Sutter, 1979; yon der Heydt et al., 1981). In much the same way that random
dot stereograms confirmed the existence of mechanisms that makes use of horizontal dis-
parities (:lulesz, 1960), these experiments provide strong evidence that the human visual
system possesses a mechanism that can and does make use of orientation and spatial
frequency disparities in the two retinal images to aid in the perception of surface shape.

Fig. 1. Stereo pair of a planar surface tilted in depth. Careful comparison of the two views
reveals slightly different orientation and spacing of corresponding grid lines.

There has been very little work investigating the use of these cues in computational
vision. In fact, it is quite common in computational stereo vision to simply ignore the
orientation and spatial frequency differences, or image distortions, that occur when view-
ing surfaces tilted in depth. These differences are then a source of error in computational
schemes which try to find matches on the assumption that corresponding patches (or
edges) must be identical or very nearly so. Some approaches acknowledge the existence
of these image distortions, but still treat them as noise to be tolerated, as opposed to an
additional signal that may exploited (Arnold and Binford, 1980; Kass, 1983; Kass, 1987).
A few approaches seek to cope using an iterative framework, starting from an initial
assumption that disparity is locally constant, and then guessing at the parameters of the
image distortion to locally transform and compensate so that image regions can again be
compared under the assumption that corresponding regions are merely translated copies
of one another (Mori et al., 1973; Quam, 1984; Witkin et al., 1987). The reliance of this
procedure on convergence from inappropriate initial assumptions and the costly repeated
"warping" of the input images make this an unsatisfactory computational approach and
an unlikely mechanism for human stereopsis.
This paper describes a novel computational method for directly recovering surface
orientation by exploiting these orientation and spatial disparity cues. Our work is in
the framework of a filter-based model for computational stereopsis (:/ones, 1991; Jones
and Malik, 1992) where the outputs of a set of linear filters at a point are used for
matching. The key idea is to model the transformation from one image to the other
locally as an affine transformation with two significant parameters, H~, Hy, the gradient
of horizontal disparity. Previous work has sought to recover the deformation component
instead (Koenderink and van Doom, 1976).
For the special case of orientation disparity, Wildes (1991) has an alternative approach
based on determining surface orientation from measurements on three nearby pairs of
corresponding line elements (Canny edges). Our approach has the advantage that it
treats both orientation and spatial frequency disparities. Another benefit, similar to least
squares fitting, it makes use of all the data. While measurements on three pairs may be
663

adequate in principle, using minimal information leads to much greater susceptibility to


noise.

2 Geometry of Orientation and Spatial Frequency Disparities

Consider the appearance of a small planar surface patch, ruled with a series of evenly
spaced parallel lines (Fig. 2A). The results obtained will apply when considering orienta-
tion and spatial frequencies of general texture patterns. To describe the parameters of an
arbitrarily oriented plane, start with a unit vector pointing along the z-axis. A rotation
Cz around the z-axis describes the orientation of the surface texture. Rotations of Cx
around the z-axis, followed by Cy around the y-axis, combine to allow any orientation of
the surface itself. The three-dimensional vector v resulting from these transformations
indicates the orientation of the lines ruled on the surface and can be written concisely:

[ ] v = /[sinr176176176

Lsin Cx cos Cy sin Cz - sin Cu cos Cz J

In order to consider orientation and spatial frequency disparities, this vector must be
projected onto the left and right image planes. In what follows, orthographic projection
will be used, since it provides a very close approximation to perspective projection,
especially for the small surface patches under consideration and when line spacing is
small relative to the viewing distance. The projection of v onto the left image plane is
achieved by replacing Cy with Cy +/~r (where/~r = tan-l(b/2d)), and then discarding
the z component to give the two-dimensional image vector yr. Similarly, replacing r with
Cy - ACy gives vr, the projection of v on the right image plane.

Y Y

left view right view

left view right view

Fig. 2. Differences in two views of a tilted surface. A. A planar surface is viewed at a distance
d, from two vantage points separated by a distance b. Three-dimensional vectors He parallel (v)
and perpendicular (w) to a generic surface texture (parallel lines). Arbitrary configurations are
achieved by rotations ~z, r and ~ , in that order. Different viewpoints are handled by adding
an additional rotation 5:A~y, where / ~ = tan-l(b/2d). B. Resulting two-dimensional image
textures are described by orientation, 0, and spacing, )~. Orientation disparity, 0r-01, and spatial
frequency disparity, ~l/~r, are systematically related to surface orientation, ~=, ~ .
664

Let 01 and/gr be the angles the image vectors vt and vr make with the z-axis (Fig. 2B).
These can be easily expressed in terms of the components of the image vectors.
tan 91 = cos r tan r
sin ~bxsin(r + A f t ) tan Cz + cos(r + Ar
This enables us to determine the orientation disparity, 0r - St, given the pattern orien-
tation Cz, 3-D surface orientation characterized by ~b~,~bv, and view angle Ar
Let Az, A~ be the spacing, and fl = 1/Ai, fr = 1/A~ be the spatial frequency, of
the lines in the left and right images (Fig. 2B). Since spatial frequency is measured
perpendicular to the lines in the image, a new unit vector w, perpendicular to v, is
introduced to indicate the spacing between the lines. An expression for w can be obtained
from the expression for v by replacing r with r + 90 ~ When these three-dimensional
vectors, v and w, are projected onto an image plane, they generally do not remain
perpendicular (e.g., va and wz in Fig.2B). If we let v f = (-vz~,vl~), then uz = vf/[]vzll
is a unit vector perpendicular to vl. The length of the component of wt parallel parallel
to ut is equal to A/, the line spacing in the left image.

At---- t o l . v f "
Ilvlll
Substituting expressions for vt and wt gives an expression for the numerator, and a simple
expression for the denominator can be found in terms of 01.
Ic~162 sin Cz I
wl'v x = cosr162 v+ACy) ; [[viii = sin01

Combining these with similar expressions for Ar gives a concise expression for spatial
frequency disparity.
)tl w, . v , ,,v,.,, ,cos(C,+Ar
fl =- Ar -- IIv,ll w ~ . v ~ = I Ic~162162
To determine spatial frequency disparity from a given pattern orientation Cz, surface
orientation r Cy, and view angle ACu , this equation and the previous ones to determine
0:, 0r are all that are needed.
For solving the inverse problem (i.e., determining surface orientation), it has been
shown that from the orientations 0t, 0r and 0~, 0" of two corresponding line elements, (or
0t, 0r and At, Ar for parallel lines), the three-dimensional surface normal can be recovered
(Jones, 1991). If more observations are available, they can be exploited using a least
squares algorithm. This is based on the following expression:
tan r = cos r cos(ACy) (tan 0~, -- tan 0t,) + sin r sin(ACy) (tan 0r, + tan 01,)
sin(2ACu) tan 0r, tan 0z,
= ai COS Cy q- bi sin Cy
This has the convenient interpretation that for a given surface orientation, all the ob-
servations (ai, bi) should lie along a straight line whose orientation gives r and whose
perpendicular distance from the origin is tan Cx. Details of the derivation and experi-
mental results may be found in (Jones and Malik, 1991).
In Section 4 we present an alternative solution which does not depend on the identifi-
cation of corresponding line elements, but simply on the output of a set of linear spatial
filters. To develop a solution in a filter-based framework, the next section first re-casts
the information present in orientation and spatial frequency disparities in terms of the
disparity gradient.
665

3 Formulation using Gradient of Horizontal Disparity

Consider a region of a surface visible from two viewpoints. Let P = (z, y) be the coor-
dinates of a point within the projection of this region in one image, and P~ = (x', y~) be
the corresponding point in the other image. If this surface is fronto-parallel, then P and
P~ differ only by horizontal and vertical offsets H, V throughout this region - - the image
patch is one view is merely a translated version of its corresponding patch in the other
view. If the surface is tilted or curved in depth then the corresponding image patches will
not only be translated, but will also be distorted. For this discussion, it will be assumed
that this distortion is well-approximated by an affine transformation.

Hx, Hy, Vx, Vv specify the linear approximation to the distortion and are zero when the
surface is fronto-parallel. For planar surfaces under orthogonal projection, the transfor-
mation between corresponding image patches is correctly described by this affine trans-
formation. For curved surfaces under perspective projection, this provides the best linear
approximation. The image patch over which this needs to be a good approximation is
the spatial extent of the filters used.
The vertical disparity V is relatively small under most circumstances and the vertical
components of the image distortion are even smaller in practice. For this reason, it will be
assumed that V~, W = 0, leaving H~ which corresponds to a horizontal compression or
expansion, and Hy which corresponds to a vertical skew. In both cases, texture elements
oriented near vertical are most affected. It should also be noted that the use of Hx, Hv
differs from the familiar Burt-fulesz (1980) definition of disparity gradient, which is with
respect to a cyclopean coordinate system.
Setting aside positional correspondence for the moment, since it has to do with relative
distance to the surface and not its orientation, this leaves the following:

If we are interested in how a surface, or how the tangent plane to the surface, is tilted
in depth, then the critical parameters are Hx and Hy. If they could be measured, then
the surface orientation could be estimated, up to some factor related to the angular
separation of the eyes. For a planar surface, with orientation Cx, Cv, the image distortion
is given by:

cos(r v - A f v ) _ l 9 Hv = tan Cx sin(2ACv)


H~ = cos(r + ACv) , cos(r + Ar

These are the parameters for moving from the left view to the right view. To go in
the other direction requires the inverse transformation. This can be computed either by
changing the sign of ACy in the above equations to interchange the roles of the two
viewpoints, or equivalently, the inverse of the transformation matrix can be computed
directly. Compression and skew depend on the angular separation 2ACv of the viewpoints
and are reduced as this angle decreases, since this is the angle subtended by the view-
points, relative to a point on the surface. More distant surfaces lead to a smaller angle,
making it more difficult to judge their inclination.
666

4 S u r f a c e S h a p e f r o m D i f f e r e n c e s in S p a t i a l F i l t e r O u t p u t s

We have been developing a new stereo algorithm based on the outputs of linear spatial
filters at a range of orientations and scales. The collection of filter responses at a position
in the image (the filter response vector, v~ = F T I~), provides a very rich description of
the local image patch and can be used as the basis for establishing stereo correspondence
(Jones and Malik, 1992). For slanted surfaces, however, even corresponding filter response
vectors will differ, but in a way related to surface orientation. Such differences would
normally be treated as noise in other stereo models.
From filter responses in the right image, we could, in principle, reconstruct the image
patch using a linear transformation, namely the pseudo-inverse (for details and exam-
ples, Jones and Malik, 1992). For a particular surface slant, specified by Hx, Hv, we could
predict what the image should look like in the other view, using another linear trans-
formation - - the affine transformation discussed earlier. A third linear transformation
would predict the filter responses in the other view (Fig. 3).
v~ = FT 9 T.=,.. 9 (FT) -1 v~
MH],H~
Here the notation (FT) -1 denotes a pseudo-inverse. This sequence of transformations
can, of course, be collapsed into a single one, M,=,N,, that maps filter responses from
one view directly to a prediction for filter responses in the other view. These M matrices
depend on Hr and Hv but not on the input images, and can be pre-computed once, ahead
of time. A biologically plausible implementation of this model would be based on units
coarsely tuned in positional disparity, as well as the two parameters of surface slant.
Surface

Response Response
Vector Vector
hi
vL ,M vR
vL' = M vR

Fig. 3. Comparing spatial filter outputs to recover 3-D surface orientation.

This provides a simple procedure for estimating the disparity gradient (surface orien-
tation) directly from v n and VL, the output of linear spatial filters. For a variety of choices
of H . , H~, compare V~L= MH=,R. "VR, the filter responses predicted for the left view, with
VL, the filter responses actually measured for the left view. The choice of H . , Hv which
minimizes the difference between v~' and v~ is the best estimate of the disparity gradient.
The sum of the absolute differences between corresponding filter responses serves as a
efficient and robust method for computing the difference between these two vectors, or
an error-measure for each candidate H . , H~.
667

5 The Accuracy of Recovered Surface Orientations

This approach was tested quantitatively using randomly generated stereo pairs with
known surface orientations (Fig.4). For each of 49 test surface orientations (H~, H~ E
{0.0,-4-0.1, =t=0.2, :i:0.4}), 50 stereo pairs were created and the values of H=, Hy were re-
covered using the method described in this paper. The recovered surface orientations
(Fig. 5) are quite accurate, especially for small slants. For larger slants, the spread in
the recovered surface orientation increases, similar to some psychophysical results. Small
systematic errors, such as those for large rotations around the vertical axis, are likely not
an inherent feature of the model, but an artifact of this particular implementation where
surface orientation was computed from coarse estimates using the parabolic interpolation.

Fig. 4. Stereo pair of surfaces tilted in depth. A white square marked on each makes the hor-
izontal compression/expansion, when H= # 0, and vertical skew, when H~ ~ 0, quite apparent.

Disl~rityC_,mditutI ~ ,

0.5

@ Oe| 0

J
-0.q

i i i i i , i , , i i , , ,

-o.$ 0.0 0.5


H ~ m u a Compom~ ( l ~

Fig. 5. Disparity gradient estimates. For various test surface orientations (open circles), the
mean (black dot) and standard deviation (ellipse) of the recovered disparity gradient are shown.

Because the test surfaces are marked with random textures, the orientation and spatial
frequency disparities at a single position encode surface orientation to varying degrees,
and on some trials would provide only very limited cues. Horizontal stripes, for example,
provide ao information about a rotation around the vertical axis. For large planar sur-
faces, or smooth surfaces in general, estimates could be substantially refined by pooling
over a local neighborhood, trading off spatial resolution for increased accuracy.
668

6 Orientation and Spatial Frequency Disparities Alone

The approach for recovering three-dimensional surface orientation developed here makes
use of the fact that it is the identical textured surface patch that is seen in the two
views. It is this assumption of correspondence that allows an accurate recovery of the
parameters of the deformation between the two retinal images. However, orientation and
spatial frequency disparities lead to the perception of a tilted surface, even in the absence
of any systematic correspondence (Tyler and Sutter, 1979; yon der Heydt et al., 1981).
One interpretation of those results might suppose the existence of stereo mechanisms
which make use of orientation or spatial frequency disparities independen$ of positional
disparities or correspondence. Such mechanisms would seem to be quite different from
the approach suggested here. On the other hand, it is not immediately apparent how the
present approach would perform in the absence of correspondence.
Given a pair of images, the implementation used in the previous experiment deter-
mines the best estimate of surface orientation - - even if it is nonsense. This allow us to
examine how it performs when the assumption of correspondence is false. Stereo pairs
were created by filtering random, uncorrelated one-dimensional noise to have a band-
width of 1.2 octaves and either an orientation disparity (Fig. 6A) or spatial frequency
disparity. Since a different random seed is used for each image, there is no consistent
correspondence or phase relationship. A sequence of 100 such pairs was created and for
each, using the same implementation of the model used in the previous experiment, the
parameters of surface orientation, or the disparity gradient,/Ix, H~ were estimated.

nbt~Ld~ Omdimt F.~mata

' O= ~-$ deLpus


O~ bw- 1.2oc~n~J

.oQ.*" |
~ o ~ ''~

J
--0.5

, , | , w , | i , , , , i ,

--0..5 0.0 0.-5


Horizootal Compcmemt (Hx)
(C..mnprmLion/Expm~ion)

Fig. 6. A.Orientation disparity without correspondence. B. Disparity gradient estimates.

There is a fair bit of scatter in these estimates (Fig. 6B), but if the image pairs were
presented rapidly, one after the other, one might expect the perceived surface slant to
be near the centroid. In this case, Hx = 0 and Hy is positive, which corresponds to a
surface rotated around the horizontal axis - - in agreement with psychophysical results
(yon der Heydt et al., 1981). In fact, the centroid lies close to where it should be based
on the 10~ orientation disparity (Hx = 0.0, Hy = 0.175), despite the absence of corre-
spondence. The same procedure was repeated for several different orientation disparities
669

and for a considerable range, the recovered slant (Hy) increases with orientation dispar-
ity. Similar results were found for stereo pairs with spatial frequency disparities, but no
systematic correspondence.

7 Conclusion
In this paper, a simple stereopsis mechanism, based on using the outputs of a set of
linear spatial filters at a range of orientations and scales, has been proposed for the direct
recovery of local surface orientation. Tests have shown it is applicable even for curved
surfaces, and that interpolation between coarsely sampled candidate surface orientations
can provide quite accurate results. Estimates of surface orientation are more accurate for
surfaces near fronto-parallel, and less accurate for increasing surface slants.
There is also good agreement with human performance on artificial stereo pairs in
which systematic positional correspondence has been eliminated. This suggests that the
psychophysical results involving the perception of slant in the absence of correspondence
may be viewed, not as an oddity, but as a simple consequence of a reasonable mechanism
for making use of positional, orientation, and spatial frequency disparities to perceive
three-dimensional shape.

References
Arnold RD, Binford TO (1980) Geometric constraints on stereo vision. Proc SPIE
238:281-292
Butt P, Julesz B (1980) A disparity gradient limit for binocular function. Science
208:651-657
Jones DG (1991) Computational models of binocular vision. PhD Thesis, Stanford Univ
Jones DG, Malik J (1991) Determining three-dimensional shape from orientation and
spatial frequency disparities I - - using corresponding line elements. Technical Report
UCB-CSD 91-656, University of California, Berkeley
Jones DG, Malik J (1992) A computational framework for determining stereo
correspondence from a set of linear spatial filters. Proc ECCV Genova
Julesz B (1960) Binocular depth perception of computer generated patterns. Bell
Syst Tech J 39:1125-1162
Julesz B (1971) Foundations of cyclopean perception. University of Chicago Press:Chicago
Kass M (1983) Computing visual correspondence. DARPA IU Workshop 54-60
Kass M (1988) Linear image features in stereopsis. Int J Computer Vision 357-368
Koenderink J J, van Doom AJ (1976) Geometry of binocular vision and a model for
stereopsis. Biol Cybern 21:29-35
Mori K, Kododi M, Asada H (1973) An iterative prediction and correction method for
automatic stereo comparison. Computer Graphics and Image Processing 2:393-401
Quam LH (1984) Hierarchical warp stereo. Proc Image Understanding Workshop.
Rogers BJ, Cagenello RB (1989) Orientation and curvature disparities in the perception of
3-D surfaces. Invest Ophth and Vis Science (suppl) 30:262
Tyler CW, Sutter EE (1979) Depth from spatial frequency difference: an old kind of
stereopsis? Vision Research 19:859-865
yon der Heydt R, H~nny P, Dursteller MR (1981) The role of orientation disparity in
stereoscopic perception and the development of binocular correspondence, in Advances
in Physiological Science: 16:461-470 Graystan E, Molnar P (eds) Oxford:Pergammon
Wildes RP (1991) Direct recovery of three-dimensional scene geometry from binocular
stereo disparity. IEEE Trans PAMI 3(8):761-774
Witkin AP, Terzopoulos D, Kass M (1987) Signal matching through scale space. Int J
Computer Vision 1(2):133-144
Using Force Fields Derived from 3D Distance Maps
for Inferring the Attitude of a 3D Rigid Object

Lionel Brunie 1 and St~phane Lavallde 1 and Richard Szeliski 2

1 TIMC - IMAG, Facult~ de M~decine de Grenoble


38 700 La Tronche, France, lionel~timb.imag.fr
2 Digital Equipment Corporation, Cambridge Research Lab
One Kendall Square, Bldg. 700, Cambridge, MA 02139, szeliski~crl.dec.com
A b s t r a c t . This paper presents a new method for evaluating the spatial
attitude (position-orientation) of a 3D object by matching a 3D static model
of this object with sensorial data describing the scene (2D projections or
3D sparse coordinates). This method is based on the pre-computation of a
force field derived from 3D distance maps designed to attract any 3D point
toward the surface of the model. The attitude of the object is infered by
minimizing the energy necessary to bring all of the 3D points (or projection
lines) in contact with the surface (geometric configuration of the scene).
To quickly and accurately compute the 3D distance maps, a precomputed
distance map is represented using an oetree spline whose resolution increases
near the surface.

1 Introduction

One of the most basic ability of any human or artificiM intelligence is the inference
of knowledge by matching various pieces of information [1]. When only a few data are
available, one can introduce a priori knowledge to compensate for the lack of information
and match it with the data. In this latter frame, one of the most classical problematics is
the inference of the attitude of a 3D object from sensorial data (2D projections or sparse
3D coordinates).
This problem can be formulated as follows: assume that we know a 3 D description
(model) or some features of an object in a first 3D attitude (location and orientation).
We acquire various sensorial data describing this object in another (unknown) attitude,
and we then attempt to estimate, from the model of the object and this new data,
this unknown attitude. This generally implies the determination of 6 parameters: three
components of translation (location) and three components of rotation(orientation).
In this paper, we will suppose the segmentation of the sensorial data achieved and
focus on the interpretation of the scene described by the segmented images.
In spite of a considerable amount of litterature (see [2] for a review of related works),
no general algorithm has been published yet. This paper presents a new complex object-
oriented geometric method based on the pre-computation of a force field derived from
3D distance maps. Experimental results, in the field of computer-assisted surgery, are
proposed.

2 Problem formulation : an energetic paradigm

To be independant from any 3D object representation and in order to have as wide an


application field as possible, the start point of our matching process will be therefore a
* The research described in this paper is supported by DEC and Safir-Groupe Sem companies
671

set of 3D points disLributed on the surface of the object and defining our model of the
object. Such a model can be extracted from any 3D initial representation.
The problem is to estimate the transformation T between Refsensor (the reference
system of the sensorial data) and Ref3D (reference system in which the 3D model of
the object is defined). After the sensor calibration ( N-planes spline method ([3]), in
3D/2D matching every pixel ~i of each projection is associated with a 3-D line, Li,
called matching line, whose representation is known in Refsensor.
In 3D/2D matching, when the 3D object is in its final attitude, T, every line Li is
tangent to the surface S. In the same way, when matching the 3D model with a set of
sparse 3D control points, these latter are in contact with S. For sufficiently complex ob-
jects (i.e. without strong symmetries), T is the only attitude leading to such a geometric
configuration. Our algorithm is based on this observation :
1. We first define the 3-D unsigned distance between a point r and the surface S,
drg(r, S), as the minimum Euclidean distance between r and all the points of S.
We use this distance function to define a force field in any point of the 3D space.
Every point r is associated with a force vector F ( r ) = w - r where w is the point of
S the closest to r. We therefore have:
[F(r)[ : dE(r, S) (1)
2. In 3D/2D matching, an attraction force FL(Li) is associated to any matching line
Li by:
(a) if Li does not cross S , F L ( L i ) : F(M/) where Mi is the point of Li the closest
to S;
(b) else FL(L~) = F(N~) where N~ is the point of L~ inside the surface S the farthest
from S (see fig. 1).
A simple way to compute F L is to consider a signed distance, 0~, of same module than
dF, but negative inside S and to choose the point of Li of minimum signed module.
same module than
Li k~
/

Fig. 1. Force vector associated to a matching line

3. l e m m a (not proved here) : The potential energy of the force field F at a point r with
respect to the surface S i.e. the energy necessary to bring r in contact with S is

P E ( r ) = 1F(r)~ + o ( F ( r ) ) (2)

For a set of N,I 3D control points, ri, to take into account the reliability of the data,
we introduce the variance of the noise of the measurement dE(ri, S), o'~, (see section
4) to weight the energy of a control point and consider the energy E :
Nq
E(p) = Z ~ [ d E ( r ~ , S)] ;. (3)
i-----I
672

4. In the same way, the potential energy of a matching line Li, i.e. the work necessary
to bring the line into contact with S is equal to the potential energy of the point
where the attraction force is applied (Mi or Ni). As previously, to take into account
the reliability of the d a t a on the matching lines, we weight the potential energy of
each matching line by the variance of the noise of the measurement d(li(p), S), o'~,
and consider the energy E:

Mr Mr 1 | .... :~
(4)

5. As shown above, when the object is in its final attitude, every line (every control
point) is in contact with S and the energy of the attitude is therefore zero, the lowest
possible energy. If the object is sufficiently complex the m i n i m u m of the energy
function is reached only once, in the final attitude, and the energy function is convex
in a large neighborhood of this attitude. A minimization procedure of convex function
can therefore be performed (see section 4).

3 Fast force field c o m p u t a t i o n and octree splines distance maps

The method described in the previous section relies on the fast computation of the
distances d~ and d. If the surface S is discretized in r~2 points, the computation of
the distance dE is a O(r~ 2) process. Similarly, if a line li(p) is discretized in m points,
the computation of the distance d is a O ( m n 2) process. To speed up this process, we
precompute a 3-D distance map, which is a function that gives the signed m i n i m u m
distance to S from any point q inside a bounding volume V that encloses S.
More precisely, let G a regular grid of N 3 points bounding V. W e first compute and
store the distance d for each point q of G. Then d(q, S~ can be computed for any point
q using a trilinear interpolation of the 8 corner values dijk of the cube that contains the
point q. If (u, v, w) E [0, 1] [0, 1] [0, i]) are the normalized coordinates of q in the
cube,

I I I
d(q,S)= ~__a~-'~bi(u)bj(v)bk(w)dijk with bt(t)=6zt+(1-6t)(1-t). (5)
i = 0 j--O k=O

W e can compute the gradient Vd(q, S) of the signed distance function by simply
differentiating (5) with respect to ~, v, and w. Because d is only C ~ Vd(q, S) is discon-
tinuous on cube faces. However, these gradient discontinuities are relatively small and do
not seem to affect the convergence of our iterative minimization algorithm.
In looking for an improved trade-off between m e m o r y space, accuracy, speed of com-
putation, and speed of construction, we have developed a new kind of distance m a p
which we call the octree spline. The intuitive idea behind this geometrical representation
is to have more detailed information (i.e.,more accuracy) near the surface than far away
from it. W e start with the classical octrce representation associated with the surface S
and then extend it to represent a continuous 3-D function that approximates the signed
Euclidean distance to the surface. This representation combines advantages of adaptive
splinc functions and hierarchical data structures. For more details on the concept of
octree-splines, see [2].
673

4 Least Squares Minimization

This section describes the nonlinear least squares minimization of the energy or error
function E(p) defined in eq. 4 and eq. 3.
Least squares techniques work well when we have many uncorrelated noisy measure-
ments with a normal (Gaussian) distribution 3. To begin with, we will make this assump-
tion, even though noise actually comes from calibration errors, 2-D and 3-D segmentation
errors, the approximation of the Euclidean distance by octree spline distance maps, and
non-rigid displacement of the surface between Ref3D and Refsen....
To perform the nonlinear least squares minimization, we use the Levenberg-Marquardt
algorithm because of its good convergence properties [4]. An important point of this
method is that in both equations 4 and 3 g ( p ) can be easily differentiated which allows
to exhibit simple analytical forms for ghe gradient and Hessian of E(p), used in the
minimization algorithm.
At the end of the iterative minimization process, we compute a robust estimate of
the parameter p by throwing out the measurements where e~(p) >> 0.2 and performing
some more iterations [5]. This process removes the influence of outliers which are likely
to occur in the automatic 2-D and 3-D segmentation processes (for instance, a partially
superimposed object on X-ray projections can lead to false contours).
Using a gradient descent technique such as Levenberg-Marquardt we might expect
that the minimization would fail because of local minima in the 6-dimensional parameter
space. However, for the experiments we have conducted, false local minima were few and
always far away from the solution. So, with a correct initial estimate of the parameters,
these other minima are unlikely to be reached.
Finally, at the end of the iterative minimization procedure, we estimate the uncer-
tainty in the parameters (covariance matrix) to compute the distribution of errors after
minimization in order to check that it is Gaussian.

5 Experimental results

We have performed tests on both real anatomical surfaces and on simulated surfaces. In
3D/2D matching, the projection curves of these surfaces were obtained by simulation in
order to know the parameters p* for which the correct pose is reached. Figures 2 and 3
show an example of convergence for an anatomical surface (VIM of the brain ; surface
$1) in 3D/2D matching. The state of the iterative minimization algorithm is displayed
after 0, 2, and 6 iterations. Figure 2 shows the relative positions of the projections lines
and the surface seen from a general viewpoint. Figure 3 shows the same state seen from
the viewpoints of the two cameras (computation times expressed below are given for a
DECstation 5000/200). Experiments have also been conducted to test this method for
3D/3D matching by simulating a complex transformation on a vertebra (surface $2) (see
fig. 4 for the convergence).

6 Discussion

In comparison with existing methods, the experiments we ran showed the method pre-
sented in this paper had five main advantages.
a Under these assumptions, the least squares criterion is equivalent to maximum likelihood
estimation.
674

:~/,/,;/i/
//]/L~.,'LI-~

ct 6 :
Fig. 2. Convergence of algorithm observed from a general viewpoint (surface SD is represented
by a set of points). Two sets of projection lines evolve in the 3D potential field associated with
the surface until each line is tangent to St: (a) initial configuration, (b) after 2 iterations, (c)
after 6 iterations. For this case, the matching is performed in 1.8 s using 77 projection lines, in
0.9 s using 40 projection lines.

First, the matching process works for any free-form smooth surface. Second, we
achieve the best accuracy possible for the estimation of the 6 parameters in p, because
the octree spline representation we use approximates the true 3-D Euclidean distance
with an error smaller than the segmentation errors in the input data. Third, we provide
an estimate of the uncertainties of the 6 parameters. Fourth, we perform the matching
process very rapidly. Fifth, in our method, only a f e w pizels on the contours are needed.
This allows to estimate the attitude of the object even if it is partially occluded. More-
over, reliability factors can be introduced to weight the contribution of uncertain data
(for instance, the variance of the segmentation can be taken into account).
This method could also be used for recognition problems, where the purpose is to
match some contour projections with a finite set of 3-D objects { O i } .
Researches are presently underway to adapt this algorithm to non-segmented gray-
levels images by selecting potential matching lines, then assign credibility factors to them
and maximize a matching energy.

References

1. A. Wackenheim. Perception, commentaire et interprdtation de l'image par les intelligences


naturelle et artificielle. Springer Verlag, 1987.
2. S. Lavallee and L. Szeliski, R. Brunie. Matching 3d smooth surfaces with their 2d projections
using 3d distance maps. In SPIE Geometric methods in CV, San Diego, CA, July 1991.
3. G. Champleboux. Utilisation des fonctions splines a la raise au point d'un capteur tridi-
mensionnel sans contact : application a la ponction assistee par ordinateur. PhD thesis,
Grenoble University, July 1991.
4. W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes: The
Art of Scientific Computing. Cambridge University Press, Cambridge, England, 1986.
5. P. J. Huber. Robust Statistics, John Wiley & Sons, New York, New York, 1981.
This article was processed using the LTEX macro package with ECCV92 style
675

- ; q ~:t+

Fig. 3. Convergence of algorithm for surface Sl observed from the 2 projection viewpoints. The
external contours of the projected surface end up fitting the real contours: (a) initial configura-
tion, (b) after 2 iterations (c) after 6 iterations.

't

..... I 'i

C4 t~

F i g . 4. Convergence of 3-D/3-D matching algorithm for surface S~ (vertebra) segmented from


a 3D C T image. For this case, the matching is performed in 2 s using 130 data points.
(a) initial configuration, E(p(~ = 113.47, ])At(~ = 125.23mm, IAc~(~ I = 48.25 ~
(b) after 2 iterations, E(p~2))/Mp= 38.58, II~t(2)tl = 2s.97mm, 1~(2) I = 20.53 o,
(c) after 6 iterations. E(p(6))/Mp = 4.20, II~t(+)ll = 0.75mm, I~a(6) I = 0.32 ~
Segmenting U n s t r u c t u r e d 3 D P o i n t s i n t o Surfaces*

P. Fua 1'2 and P. Sander 1

1 INRIA Sophia-Antipolis, 2004 Route des Lucioles, 06565 Valbonne Cedex, France
2 SRI International, 333 Ravenswood Avenue Menlo Park, CA 94025 USA

Abstract.
We propose an approach for building surfaces from an unsegmented set
of 3D points. Local surface patches are estimated and their differential prop-
erties are used iteratively to smooth the points while eliminating spurious
data, and to group them into more global surfaces. We present results on
complex natural scenes using stereo data as our source of 3D information.

1 Introduction

Deriving object surfaces from a set of 3D points produced, for example, by laser rangefind-
ers, stereo, or 3D scanners is a difficult task because the points form potentially noisy
"clouds" of data instead of the surfaces one expects. In particular, several surfaces can
overlap - - the 2 1/2 D hypothesis required by simple interpolation schemes is not neces-
sarily valid. Furthermore, the raw 3D points are unsegmented and it is well known that
segmentation is hard when the data is noisy and originates from multiple objects in a
scene. Most existing approaches to the problem of determining surfaces from a set of
points in space assume that all data points belong to a single object to which a model
can be fit [11,7,2].
To overcome these problems, we propose fitting a local quadric surface patch in the
neighborhood of each 3D point and using the estimated surfaces to'iteratively smooth the
raw data. We then use these local surfaces to define binary relationships between points:
points whose local surfaces are "consistent" are considered as sampled from the same
underlying surface. Given this relation, we can impose a graph structure upon our data
and define the surfaces we are looking for as sets of points forming connected components
of the graph. The surfaces can then be interpolated using simple techniques such as
Delaunay triangulation. In effect, we are both segmenting the data set and reconstructing
the 3D surfaces. Note that closely related methods have also been applied to magnetic
resonance imagery [1{3] and laser rangefinder images [3].
Regrettably, space limitations force us to omit many details; the interested reader is
referred to [6].

2 Local Surfaces

We iteratively fit local surfaces by frst fitting a quadric patch around each data point,
and then moving the point by projecting it back onto the surface patch. For our algorithm
to be effective with real data, it must be both orientation independent and insensitive to
outliers.
* Support for this research was partially provided by ESPRIT P2502 (VOILA) and ESPRIT
BRA 3001 (INSIGHT) and a Defense Advanced Research Projects Agency contract.
677

Orientation. To achieve orientation independence around a point P0 = (x0, Y0, z0), we


use an estimate of the tangent plane to define a reference frame whose origin is P0 itself
and whose z axis is perpendicular to the plane, and we fit a quadric of the form

z = q ( x , y ) = ax 2 + b x y + cy 2 + dx + e y + f

by minimizing a least squares criterion E 2,

E2= w i ( z , - q(x,, (t)


i

where the (xi, Yl, Zl)l<i<n are the n neighbors in a spherical neighborhood of P0 and the
wi are associated weights. We then transform P0 as follows:

Po ~ (0, O, f ----q(O, 0)),

expressed in the local reference frame, and we repeat the fitting process with the trans-
formed points.
In effect, we are approximating the 2nd order Taylor expansion of the surface around
each P. Since we iterate the estimates, this procedure is a form of relaxation [6]. For
typical stereo data sets (~ 100,000 points), the algorithm converges within five to ten
iterations.

Fig. 1. (a) Two spatially proximate and noisy hemispheres. (b) Smoothed spheres after sev-
eral iterations. (c) Resampled points. (d) The two largest connected components of
the corresponding graph for a given value of the parameter mq of Eq.(4), one shown
as a wireframe and the other as a shaded surface.

Outliers. To deal with outliers (to which least-squares techniques are notoriously sensi-
tive), we define a metric dq that measures whether or not two points appear to belong
to the same surface. We take dg to be

dq(P1, P2, qt, q2) -" max(d1, d2) (2)


where

dx = [Zl - q 2 ( x l , y l ) [ expressed in the reference frame of q2


d2 = [z2 - ql(x2, Y2)[ expressed in the reference frame of ql.
678

dq is zero when the two points belong to the same local surface qi and increases when
their respective local surfaces become inconsistent. It can therefore be used to discount
outliers by computing the weighting factor wl of Eq.(1) at iteration t to be:

1
Wi-- di = dq(Po, Pi, q0'-I , qlt-l")
1 + (di/~,) 2'

where the (q~-l)0<~<n are the quadrics that had been computed at the previous iterations
and cr is an estimate of the variance of the process generating the data points. Note
that for processes such as stereo computation or laser range finding, a can actually be
estimated.
In this manner, as the algorithm progresses, the points that are on the same surface as
P0 gain influence while the others are increasingly discounted. We illustrate this behaviour
in Fig.l(b), where the two noisy hemispheres are smoothed without being merged and
the points between them are left as outliers; such a result would be difficult to achieve
with a simple 2 1/2D interpolation scheme.

3 Resampling

When the smoothing is done, as shown in Figure l(b), the data points still form an
irregular sampling of the underlying surfaces that is ill-suited for the generation of a
map. In order to produce meaningful triangulations, we need a more regularly spaced set
of vertices, and we use the local surfaces to compute them [6]. In effect we are replacing a
large number of irregularly spaced 3D points by a smaller set of regularly spaced ones and
their local surfaces: we are achieving both data organization and compression. This turns
out to be an effective way to merge data-points coming from several views or sensors.

4 Clustering

To cluster the isolated 3D points into more global entities, we use again the metric of
Eq.(2) to define a "same surface" relationship R between points P1, P2 as follows:

PIRP2 ~ dq(P1, P2, ql, q2) < mq and de(P1, P2) < rod,

where de is the euclidean distance and md and mq are two thresholds. In other words,
two points are assumed to belong to the same surface if their local fits are consistent
with one another. The data set equipped with the relationship 7~ can now be viewed as
a graph whose connected components are the surfaces we are looking for.

5 Triangulation

The clusters we have generated so far are collections of points and their associated local
surfaces. For many applications, such as robotics or graphics, it is important to be able
to unambiguously interpolate the surfaces. Delaunay triangulation [1] is an excellent way
of doing this [6] and, furthermore, lets us compute shaded models of our data sets.
679

Fig. 2. (a) (b) A stereo pair showing some rocks. (c) Another image taken from a completely
different viewpoint. (d) The stereo map derived by matching (a) and (b). The black
areas correspond to textureless areas for which no depth was computed, and lighter
colors correspond to increasing depth. (e) Wireframe representation. (f) Shaded rep-
resentation. Note that only the parts of the rocks visible in both (a) and (b) are
correctly reconstructed.

Fig. 3. (a) Triangulated ground surface for the rocks of Figure 2 (b). (c) Shaded views. Note
that the backs of the foreground rocks of Figure 2(a) are now clearly visible.

6 Reconstructing 3D Surfaces from Stereo Data

In this section, we show how our technique can be used to reconstruct 3D surfaces by
fusing stereo depth-maps computed from several viewpoints.
Our stereo algorithm [4] produces semi-dense maps that are 2 1/2D representations
of the world. They are necessarily incomplete and, in particular, cannot account for
occluded parts of the scene, such as the back of the rocks in Figure 2(a).
To reconstruct the scene more completely, we have used 5 sets of stereo-pairs corre-
sponding to camera positions between that of figures 2(a) and 2(c). After having reg-
680

istered the stereo results with one another [12], we can use our technique to generate
clusters of triangulated 3D points. The largest one, depicted in Figure 3, accounts for all
the large rocks. Erroneous data points have been discarded as outliers.

7 Conclusion

The experiments shown in this paper used depth data acquired from stereopsis by a mo-
bile robot, although we have successfully tested much of the same code on 3D biomedical
scanner images as well. It must be noted that when the quality of the data degrades, the
critical step becomes that of grouping, and it will be necessary to develop more sophisti-
cated methods. In fact, generating the triangulations of w could be recast as the problem
of finding the best description of the data in terms of a set of triangles with normals and
curvatures known at every vertex. This problem can. be handled within the framework
provided by m i n i m u m description length encoding [9,8,5], and such a framework should
provide us with a sound theoretical basis for future work.

References

1. Jean-Daniel Boissonnat. Geometric structures for three-dimensional shape representation.


ACM Transactions on Graphics, 3:266-286, October 1984.
2. Isaac Cohen, Laurent Cohen, and Nicholas Ayache. Introducing deformable surfaces to
segment 3D images and infer differential structure. Technical report, INRIA, 1991.
3. Frank P. Ferrie, Jean Lagarde, and Peter WhMte. Recovery of volumetric object descrip-
tions from laser rangefinder images. In First European Conference on Computer Vision
(ECCV), pages 387-396, Antibes, April 1990.
4. P. Fua. A parallel stereo algorithm that produces dense depth maps and preserves image
features. Machine Vision and Applications, 1991. Accepted for publication, available as
INRIA research report 1369.
5. P. Fua and A.J. Hanson. Objective functions for feature discrimination. In IJCAI-89
Conference, Detroit, August 1989.
6. P. Fua and P. Sander. Reconstructing surfaces from unstructured 3d points. In Image
Understanding Workshop, San Diego, California, January 1992.
7. Bradley Horowitz and Alex Pentland. Recovery of non-rigid motion and structure. In
Proceedings of CVPR'91, pages 325-330, Hawaii, June 1991.
8. Y. G. Leclerc. Constructing simple stable descriptions for image partitioning. International
Journal of Computer Vision, 3(1):73-102, 1989.
9. J. Rissanen. Minimum-description-length principle. Encyclopedia of Statistical Sciences,
5:523-527, 1987.
10. Peter T. Sander and Steven W. Zucker. Inferring surface trace and differential structure
from 3-D images. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-
12(9):833-854, September 1990.
11. Demetri Terzopoulos and Dimitri Metaxas. Dynamic 3d models with local and global de-
formations: Deformable superquadrics. In Proceedings of ICCV'90, pages 606-615, Osaka,
December 1990.
12. Z. Zhang and O.D. Faugeras. Tracking and motion estimation in a sequence of stereo
frames. In L.C. Aiello, editor, Proc. 9th European Conf. Artif. Intell., pages 747-752,
Stockholm, Sweden, August 1990.

This article was processed using the IgFEX macro package with ECCV92 style
Finding the Pose of an Object of Revolution
R. GLACHET, M. DHOME, J.T. LAPRESTE.
Electronics Laboratory, URA 830 of the CNRS
Blaise Pascal University of Clermont-Ferrand 63177 AUBIERE CEDEX (FRANCE)
tel: 73.40.72.28 ; fax: 73.40.72.62, Email: dhome @ le-eva.univ-bpclermont.fr

Abstract: An algorithm able to locate an object of revolution from its CAD model and a single per-
spective image is proposed. Geomeaic properties of object of revolution are used in order to simplify
the localization problem. The axis projection is first computed by a prediction verification scheme. It
enables to compute a virtual image in which the contours are symmetric. A rough localization is done
in this virtual image and then improved by an iterative process. Experiments with real images prove
its robnsmess and its capability to deal with partially occluded contours.

1 INTRODUCTION
Among the visual features that can be extracted from an image, contours are a major source of information
about the shape and the attitude of an object. Retrieving the object location from the contours detected in
a single perspective image is not usually an easy task. The main difficulty consists in finding perspective
invariants that enable to match model elements to image primitives.
In the case ofpolyhedric scenes, matching is greatly simplified, as straight lines are projected onto straight
lines in the image. Thus linear ridges can be successfully used as invariants ([LOW-85], [DHO-89]).
The problem is more difficult when dealing with a world including curved objects, mainly because the
primitives used to describe the model and the image curves are not necessarily of the same nature
([MAR-82]). Alignment of points ([HUT-87]) allows to deduce the location of an object as soon as three
pairs of corresponding model- and image-points are found. But finding such pairs is not trivial.
In the case of Straight Homogeneous Generalized Cylinders (S.H.G.C.) ([MAR-82]), invariants like zero
of curvature can provide matching pairs ([PON-89], [ULU-90]), [RIC-91], but the resultant methods are
very sensitive to occlusion.
When no model is available, the location problem is under-constrained; Meridians can then be used to
avoid ambiguity ([ULU-90]), but the extraction of meridians is not straightforward.
In this paper, we have chosen to deal with objects of revolution whose model is described by a generating
curve. The matching difficulty is avoided by taking advantage of local symmetries (cf also [PON-89]).
Our localization method consists in three stages, that make full use of the symmetry properties:
Algorithm (At) is devoted to the localization of the axis projection and to the creation of a virtual image
in which the contours are symmetrical [GLA-91]. This step contributes to the simplification of the
following computations.
Algorithm (A2) deals with the determination by any available means of a rough starting localization
[DHO-90] (see section 3).
Algorithm (A3)uses the contours equations (section 2), in order to improve iteratively the location found
with (A2) (see section 4).

This work can be seen as an enhancement of a method presented at the last ECCV ([ DHO-90]). In ([DHO-90])
the concept of the virtual image was still used to simplify (A2), but its determination was not very accurate
because it was only based on the knowledge of two image points resulting from a same object section; these
chosen points were double points of the limb projection; their detection is often biased. In case of noisy or
partially occluded contours, such a method fails. On the contrary, the processus (A1)is quite robust to noise
(due to the fact that it avoids brute accumulation schemes) and then provides a virtual image known with
good accuracy.
Algorithm (A2) has recantly be improved; the reader can find in [LAV-90] the details of its new imple-
mentation. An iterative improving scheme has been added to the previous algorithm to reach accuracy (A3).
Its goal is similar to Kriegman & Ponce's work [KRI-90]. Respective advantages of the two methods are
detailed in paragraph (4-2).
Implementation details and experimental results are given in section 5 (see also [GLA-92]).
682

2 CONTOURS EQUATION
In this paragraph, we derive the equations of the limb projections of a solid of revolution from the attitude
parameters. To obtain simple equations, we only consider the attitudes for which the limb projections are
symmetrical with respect to the vertical axis through the image center. Using algorithm (A0 it is always
possible from any brightness image of a solid of revolution to compute a virtual image having this property
(see [GLA-91]).

~ ~ ~ Let~ bethoframe(O,Td,i') where


-----"'--- ~/~.. / . . ~ / - 0 is the optical center.
. ~.~" - ~ is parallel to the image lines.

- k" is orthogonal to the image plane.

Figure 1: Perspective projection of an


object of revolution.

PO . . ~

A limb point of a curved object is a surface point for which the tangent plane passes through the optical
center of the camera.
Let R be the model of an object of revolution whose axis initiallylies along (O,k'), described by its generating
curve r(z) in 91.
Let Qo be a point of the model surface and No the normal to the surface at Qo 9
We have :

oeo:t,o,--,,.sin0, - :co 0 fr=r(:o)


kZoJ k Zo ) k"o,) t -r' )
To bring, fxom its initial pose, the model in an attitude producing a virtual image in which the contours
are symmetrical, such as discussed previously, it must undergo a transformation W, composed of a rotation
around i of angle a , followed by a translation (bj + ck) .
Let us call p the attitude parameters vector: p = (p, P2,P3) = (ix,b, c).

Qo =~ Q
N
"O--Q=/;l f'/ -N= n,
kn,)
Iff is the focal length and u, v the coordinates of perspectiveprojectionq of Q, we deduce from the limb
equation(~~'-~'= o):
i = f.r.CosO
r.Sinct.SinO+ zo.Cosct +c (c.Cosct-b.Sinct + zo).r' - r
f.(r.Cosot.SinO-zo.Sinet+b) where Sin0= (b.Cosct+c.Sinr for limb points (1)
r.Sin(z.SinO+ zo.Cosct+c

Then, for a given section Zo of the model, the perspective projection of limb points can easily be computed.
683

3 R O U G H L O C A L I S A T I O N I N T H E V I R T U A L I M A G E (A2)
Many processes are available to find an approximate value of attitude parameter vector p , especially
when the contour detected in the image presents a reflectional symmetry. The weak perspective view
assumption can be sufficient to treat the problem.
Moments, smoothness and compacmess may help to reach an initial value of p, but such features are not
reliable when a part of the contour is occluded.
When an object of revolution has a flat base, a part of an ellipse is seen in the image and can be used to
find an approximate location of the object ([DHO-90]); extracuon of the ellipse among contours points is
often difficult, but reliable algorithms exist ([RIC-85]).
A prediction verification method using zero curvature points ([RIC-91]), recently generalized to any
points of a symmetrical contour ([DHO-90] [LAV-90]) can also lead to an estimation of the localization,
useful to initialize the iterative improvement described in the next section,

4 IMPROVEMENT O F T H E L O C A L I S A T I O N I N T H E V I R T U A L I M A G E (A~)
With a rough localization at hand, we can implement an iterative process that will improve it.
4.1 Description of the method
At this stage, we know an approximate location Po of the solid R, in afIame such that the perspective
projection of the object limbs are symmetrical with respect to the vertical axis through the image center.
Let Co be the calculated contour for this initial value Po (see section 2) and I be the contour observed
in the virtual image.
The problem is now to decide of an association rule between the points belonging respectively to I
and Co, under the constraint that ffPo is the good attitude, the rule must give the exact pairing.
Once this rule of association is chosen, distance between paired points can be computed (as well as
partial derivatives ff the association is not too complicated) and a process of minimization can be run.
The perfect rule is to associate each point qco(Zo) of Co to the point ql(zo) of I corresponding to the
same section Zo of the model; unfortunately matching points that way is not trivial (excepted for zero
curvature points).
Another possible association is to connect each point qc~ to the point q~ whose normal passes through
qc~ (see [PON-89b]), but here, the preliminary computation of a virtual symetrical image allows to
choose a very simple rule of matching;
For each section Zo of the model, we compute U(Zo) and V(Zo) using equation (1) of section (2) and
we compute the nearest contour point on the im~ ~eline v = v (Zo), i.e. the association is made horizontally.
v
C~ h.C k+~

(
.~A (Ze)

----g-----= ......

v(za i F (z.)
4-u(z,,) 4U(v(z.)) : L k u
m
a) b)
Figure 2: Notations=. a) oar distance, b) following an image point between two iterations.
The image contour I is supposed given by a function U(v). Ck is the calculated contour at iteration k.
We must find the vector p minimizing D = Y. F2(zo) = ~, I U(v(Zo))-u(eo) [2 (see figure 2b).
Zo Zo
684

The section of the model which is projected onto a given image line at iteration k, can change at
iteration k + 1 . This variation of section can be calculated and taken into account in the equations
involved in a Newton-Raphson minimization.
Let us consider two successive iterations k, k + 1, providing an unknown variation Ap of the parameters
vector, such that the perspective projection At(zo) of a limb point of section Zo becomes Ak+l(Zo) at the
next iteration.
It is possible to find the section Zo + Azo, and consequently the point Bk(zo + AZo) which will be
Ixansformed at iteration Ic + 1 in a point Bt +i(Zo + AZo) having the same ordinate v as point A~(zo). In
fact we can write:
3 ~u ~u
(2) ,,,,
AU=/'J~I~PJ
A p ' ~ + ~ - ' ~Jo A z ~ (~Ov~ ~,
and as we imposed Av = 0 for the measure of F(zo) between two iterations,
(~ ~,a ~'~
we can d duce p,].o
3(~u (av ~v~ 8u
Then (2)becomes: Au:i~lt-~Ps-t-~psl-~ZoJ.-ff~zoJ.Aps
3 3u o-iv o-iv au
<.o, ",,,
Now we solve for Fk+l = 0 (the distance between C~ and Ck+~ tends to zero)
3(~u (~v ~v'~ ~u
'bus =- j ,,,,,

and we obtain i U(v(zo) )-U(Zo) ~ - ~lt -~PJ-t'~PJ' ff~z~J"~zoJ" Aps (3)

4.2 C o m p a r i s o n w i t h K r i e g m a n & P o n c e ' s w o r k


Our method can be compared with Krie~nan & Ponce's ([KRI-90]), that produces similar results
with the same assumptions.
Kriegman & Ponce work with S.H.G.C. and do not compute any projection axis. They use elimination
theory to precompute the implicit equation of limb projection, parametrized by the attitude parameter
vector p and then run an iterative minimization scheme with another choice of the matching rule. This
choice also involves a one-dimensional minimization for each point.
The fact that our method only applies to solids of revolution and not to general S.H.G.C. seems to
be the main weakness of our work when compared to Kriegman & Ponce's (however, all the objects
presented in their experiments are also solids of revolution [KRI-90] ). The main advantage of our method
is itsabilityto deal with objects whose generating curve is very complex (this do not seem possible in
Kriegman & Ponce s work, due to the use of the elimination theory). Also it is clear in our approach
that,ateach step of the itcrativeprocess, the limb projectionmust be recompoted, which isnot necessary
for Kriegrnan & Ponce's, where the parametrized equation is at disposal (but very simplified).This fact
is counter-balanced by the trivialityof the rule of association that makes the iterationseasier.

5 IMPLEMENTATION
5.1 Process
The contours issued from an image of an object of revolution are first extracted. The axis projection
is computed by a procedure A l; it enables to compute a rotation Ro that must be applied to the camera
frame to obtain a virtual image in which the contours are symmetrical.
A rough estimation of the parameter vector p is done by a procedure A2 in this virtual image; this
vector is used to compote the initial contour Co of our iterative scheme.
We think that the least squares B-Spline curve fitting provides a compact and faithful representation
for smooth contours (see also [LAU-87]). Thus the right part of the image contour I (related to the axis
projection) is approximated by a B-Spline curve (the left part could have been used to reconstruct
occluded parts or to enhance accuracy).
685

The calculated contour Ck at iteration k is computed using the generating curve equation and equation
(1) of section 2; internal loops of this contour are pruned by keeping, for a given image line, the farthest
point from the axis projection. This operation avoids ambiguous associations in computing the distance
F(zo) .
C, is then sampled to use only 20% of the contour points; for each of these points the derivatives
involved in equation (3) and the corresponding F~(zo) are computed (F~(zo) is obtained by B-Spline
interpolation). The resulting system is then solved by the classic Gauss elimination which gives the
correction vector Ap ;p + Ap becomes the initial parameter vector for the next iteration.
E, = D/nbpointsfound is used as stopping error criterion.
Once convergence is reached, Ro t is applied to obtain the localization in the original frame.

5.2 Experimental results


Results for a vase are presented below. They show the robustness of the algorithm. The initial
localization has been chosen farther than the one given by (As).

C
a

,~ , , , , i ~,
d ............ f
Figure 3 : Visualization of the different steps of the process:
a: the brightness image, b: the found axis projection (Al),
c." the initial location in the virtual image (As), d: successive iterations Co..Clo,
e: the found location in the virtual image with (A3),f." localization in the original frame.
686

6 CONCLUSION
We have provided a complete method able to find the pose of objects of revolution, with a great accuracy.
The symmetry properties that result both from the geometric characteristics of such an object and from the
perspective view assumption have given usefull cues to the localisation problem:
Positioning a revolution object needs to find five localization parameters. Computing the axis projection
is an elegant way to get rid of two of them. A rough estimation of the three remaining ones can easily be
done, using one among the many algorithms dedicated to this task. An iterative process can then be applied
to improve these parameters.
The iterative process presented in section 4 is both simple and robust; it is able to deal with B-Spline
curves representation. Elimination theory and implicit contour equations (comparison with Kriegman &
Ponce's work) are avoided.
A contour point observed at a given iteration results from the projection of a given section. For a fixed
image line, this section changes between two iterations. We have taken this variation of section into account
in our way of pairing points between two successive iterations; such operation avoids the main problems
of stability met in similar processes.
This precise localisation can be used as a preliminary step in modelling objects of revolution, as soon as
a partial model is known (see [LAV-90]).
An extension of our algorithm to S.H.G.C. is forecast and will be soon available.

References:
[DHO-89] M. Dhome, M. Richctin, J.T. Laprest~ & G. Rives, "Determination of the attitudeof 3-D objects
from a single perspective view", IEEE Trans. Pattern Anal. Machine Intell.,vol. PAMI-I I, n'12,
pp. 1265-1278, December 1989.
[DHO-90] M. Dhome, J.T. Laprest~, G. Rives & M. Richetin, "Spatial localization of modelled objects of
revolution in monocular perspectivevision",in Proc. European Conf. Comput. Vision,pp. 475-485,
Antibes, France. April 1990.
[GLA-91] R. Glachet, M. Dhome, J.T. Laprest~, "Finding the perspective projection of an axis of revolution",
Pattern Recognition Letters,vol. 12, pp. 693-700, October 199 I.
[GLA-92] R. Glachet, M. Dhome, J.T. Laprost6, "Finding the Pose of an Object of Revolution", technical
report R.92, January 1992.
[KRI-90] D.J. Kriegman & J. Ponce, "On recognizing and positioning curved 3D objects from image contours",
IEEE Trans. Pattern Anal. Machine lnteU., vol. PAMI-12, N~ pp. 1127-1137, ~ b e r 1990.
[HUT.S7] D.P. Huttenlocher & S. Ullman, "Object recognition using alignment", in Prac. Int. Conf. Comput.
Vision, pp. 102-111, London, England, June 1987.
[LAU-87] P.J. Laurent, "Courbes ouvertes ou ferrules par B-Splines r~gularistes", rapport de recherche RR
652 -M-Anformatique et Math~matiques Appliqutos de Grenoble, France, Mars 1987. (In French)
[LAV-90] J.M. Lavest, R. Glaebet, M. Dbome, J.T. Laprest~, "Objects of revolution: Reconstruction using
monocular vision", sutmaittedto IEEE Trans. Pattern Anal. Machine lnteU., 1991.
[LOW-8$] D.G. Lowe, Perceptual Organization and Visual Recognition, MA: Kluwer, ch. 7,1985.
[MAR-82] D. Marl Vision, ed. Freeman, pp. 215-233, 1982.
[PON-89] J. Ponce, D. Chellberg & W. Mann, "Invariant properties of straight homogeneous generalized
cylinders and their contours", IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-I 1, n'9, pp.
951-966, September 1989.
[PON-89b] Ponce J.& DJ. Kriegman. On recognizing and positioningcurved 3D objects from image contours.
Proceedings of lEEE Workshop on Interpretation of 3D Scenes, Austin, Texas, November 1989,
61-67.
[RIC-85] M. Richetin, J.T. Laprest~ & M. Dhome, "Recognition of cenJcs in contours using theirgeometrical
properties", in Proc. of Conf. Compul. Vision Pattern Recognition, pp. 464-469, San Fransisco,
Califomia, June 1985.
[RIC-91] M. Richetin, M. Dhome, J.T. Laprest~ & G. Rives, "Inverse perspective transform using zero-
curvature contour points: Application to the localization of some generalized cylinders from a single
view", 1EEE Trans. Pattern Anal. Machine lntell., vol. PAMI-13, n~ pp. 185-192, February 1991.
[ULU-90] F. Ulupinar & R. Nevatia, "Shape from contour: Straight homogeneous generalized cones", in Proc.
Int. Conf. Comput. Vision, pp. 582-586, Osaka, Japan, December 1990.
Extraction of Line Drawings from Gray Value Images
by Non-Local Analysis of Edge Element Structures

M. Otte 1 and H.-H. Nagel 1'2


a Institut ffir Algorithmen und Kognitive Systeme,
Fakult~t ffir Informatik der Universits Karlsruhe (TH),
Postfach 6980, W-7500 Karlsruhe 1, Germany, E-maih otte@ira.uka.de
2 Fraunhofer - Institut ffir Informations- und Datenveraxbeitung (IITB), Karlsruhe

Abstract. Edge elements defined as maxima of the gradient magnitude in


gradient direction of a Gaussian-smoothed image are usually thresholded to
suppress edge elements due to noise. In low contrast image regions, thresholding
may suppress also edge elements which are part of a significant image structure
and may thus result in the fragmentation or total loss of such structures.
Based on an exhaustive categorization of edge element configurations in 5x5
pixel environments (about 34 million cases), straight and curved line structures
are enhanced in edge element pictures without applying a uniform gradient
magnitude thresholding. In contrast to the traditional pixel oriented gradient
magnitude thresholding approach, we check for chaining of edge elements based
mainly on the gradient direction. Using second moments of the change rates
of gradient properties, we either reject edge element chains as noisy or accept
them as a structure underlying the original image.
The approach reported here has been designed deliberately for real-time exe-
cution. Experience with this approach applied to real world images demonstrates
a significant improvement compared to a uniform thresholding approach.

1 Introduction
Edge element detection is a basic task in computer vision. The results of a subsequent
chaining of neighbouring edge elements are used for a symbolic description. Since the
performance of high level vision tasks based on edge pictures depends on the quality of
those edge pictures, it is important to detect good edge pictures. This implies to minimize
the number of edge elements due to noise and to detect edge elements along grey-level
transitions even in low contrast image regions.
To obtain good edge pictures one can enhance the original picture and choose an
appropriate edge detector or enhance the detected edge picture.
If the edge detection and enhancement process is part of a real-time application, it
must be guaranteed that the output of the edge detection and enhancement process has
a constant delay with respect to the input data. The approach developed in this article
satisfies this real-time constraint.
The input of our algorithm is a picture of edge elements represented by their location,
gradient magnitude, and orientation. A brief review of edge detection approaches is fol-
lowed by a discussion of the thresholding problem and other approaches concerned with
the enhancement of detected edge element pictures based on locally observed properties.
Many authors have been concerned with developing a robust edge detector. [John-
son 90] expects a better response of the edge detector in image regions of low contrast
by a contrast depending edge detection. [Canny 86] suggests smoothing of the original
image with Gaussians whose scale parameters depend on the signal-to-noise ratio. [Korn
88] uses normalized first directional derivatives of a Gaussian to compute the gradient
magnitude and orientation. Edge elements are selected if they correspond to a magnitude
maximum in gradient direction.
Good noise reduction and high positional accuracy can be achieved by coarse-to-fine
tracking of edge elements in scale-space approaches (e.g. example [Witkin 83], [Bergholm
688

87] and [Williams & Shah 90]). The scale-space technique presented in [Perona & Malik
90] should prevent edge shifting by using different coefficients to encourage intraregion
smoothing in preference to interregion smoothing.
To reduce the number of noisy edge elements many approaches employ gradient mag-
nitude thresholding. This often has the consequence that edge elements in regions of low
contrast can no longer be detected.
[Abdou & Pratt 79] suggest computation of a threshold by applying the Bayes-rule.
[Zuniga &; Haralick 88] and [Haddon 88] assume Gaussian noise with known parameters
and determine a suitable threshold by evaluating mean and standard deviation of the
noise distribution. T h e hysteresis thresholding of [Canny 86] uses two thresholds. Edge
elements with gradient magnitude greater than the higher threshold value are considered
to be true edge elements. All connected edge elements which contain at least one true
edge element and whose edge elements lie above the lower threshold are also taken to
represent true edge elements. [Kundu & Pal 86] select different thresholds in image regions
of different brightness.
Figure 5 shows two parallelepipeds and the edge picture containing all maxima of
the gradient magnitude in gradient direction. One can recognize the edge elements of
the objects, but the entire edge picture is swamped by edge elements due to noise.
Thresholding the gradient magnitude at a value of 4 (third image of Figure 5) results in
an almost noise-free edge picture but also in breaking up contours.
The goal of having simultaneously a minimum number of edge elements due to noise
and unbroken contours cannot be solved with one uniform threshold value.
The publications discussed in the following deal with the enhancement of detected
edge pictures. [Zucker et al. 77] and [Hancock & Kittler 90] are using relaxation algorithms
which are of iterative nature and therefore not suitable for real-time applications. An
overview of relaxation algorithms is given in [Kittler & Illingworth 85].
In the article of [Sakai et al. 69], the enhancement of edge pictures is effected by
consideration of orientational information. Their first step consists in thresholding the
gradient magnitude and evaluating the direction of deepest descend in order to preserve
edge elements in local environments around open edge segments: the gxadient magnitudes
of the neighbouring pixels are added up depending on their gradient direction and the
expected direction of the new line segment. If this sum exceeds another threshold, the
current pixel is considered to represent the location of an edge element. Since the local
environment examined here consist only of 3x3 pixels and all pixels passing the second
threshold test are marked, new edge segments may have widths of more than one pixel.
A heuristic approach is proposed by [McKee & Aggarwal 75]. An open edge segment
is extended up to eight edge elements in the direction of the end of the open segment. If
this does not close a gap, the extended edge segment is shortened by twelve edge elements
and a new attempt at an extension is started. The extension/shortening process will be
finished if a gap is closed or the entire segment vanishes.
[Chen & Siy 87] increase the contrast of local image regions around open edge seg-
ments in an iterative manner to make the edge detection process more sensitive. [Haralick
& Lee 90] compute the probability of an edge element for each pixel. An edge element
evaluation function with feedback is used to estimate the parameters for an adaptive
edge detection process.
[Hayden et al. 87] try to close gaps by observation of image sequences. Gaps can
only be closed by their approach if a corresponding closed segment can be found in the
previous and in the following image.
In the article of [Canning et al. 88] each pixel of the edge picture is observed within
a 3x3 environment with different gradient magnitude thresholds. This leads to sets of
edge element configurations for each pixel. Neighbouring 15ixels must have overlapping
masks of the resulting edge element configurations to close gaps. The problem with such
an approach is that the assignment between neighbouring edge elements is not unique,
the thinning does not guarantee a width of one pixel, and the computational cost of the
resulting combinatoric explosion.
689

[Deriche et al. 88] suggest to threshold with a high gradient magnitude to suppress all
noisy edge elements and to close the gaps afterwards. The edge picture is scanned until
an open edge segment is found. If an open segment is recognized the filling algorithm
will start. To build a closing path they consider the three topological neighbours of' the
open end. The new candidates for edge elements are successively extended by their three
topological neighbours. This approach results in a tree with the open end edge element
of the original segment as a root and whose branches represent all possible closing paths.
This recursive algorithm stops if the length of the path exceeds a maximum length or the
path encounters an edge element. If closing paths exist then the best path is computed
based on an evaluation of length and gradient magnitudes along it. This approach has
the disadvantage that the time to compute all closing paths grows exponentially in the
gap length. Moreover, closing paths may be circular and cross themselves.
The described enhancement approaches for detected edge pictures require almost
always high computational costs. Their results are not always convincing, moreover,
unwanted new edge segments may arise.
A real-time enhancement of edge pictures is described in [Otte 90]. A window of size
5x5 is moved line by line over the entire edge picture. The edge or non-edge element in the
center of the window will be changed or left unchanged dependent on the edge element
configuration of this local environment. Each point of the window represents an edge or
non-edge element, which implies 225 ~ 34 million different combinatoric possibilities. To
be able to describe all different configurations, [Otte 90] employs the predicate calculus
for low level vision applications. By means of this tool it is possible to give a formal model
which divides all combinations uniquely and completely into four equivalence classes. One
class leaves an edge element in the center unchanged, the second class eliminates an edge
element of the center, the third class adds an edge element in the center and the last
class laves a non-edge element unchanged. This approach makes it possible to fill gaps
up to two pixels long in fragmented lines and vertices and to thin edge segments whose
width exceed one.
Many authors have worked on thinning algorithms, e.g., [Wang & Zhang 89], [Zhang
Suen 84] or [Topa & Schalkoff 89]. Their thinning algorithms transform binary patterns of
several pixel width into skeletons. In edge pictures with edge elements defined as gradient
magnitude maxima in gradient direction, the width of edge segments lies in the range of
one to three. Therefore, the thinning of edge pictures can be done on the basis of edge
element configurations within a 5x5 environment.
The thinning process can be improved significantly if the gradient magnitude is taken
into account additionally as done by [Otte & Nagel 91a] who, moreover, describe a real-
time process for filling gaps of straight and curved lines which considers the gradient
magnitude and orientation within a 7x7-environment. An open edge segment is extended
by comparing the gradient orientation and the estimated normal direction of an expected
structure.
The algorithm described in Section 3 considers all edge elements without thresholding
the gradient magnitude. Edge elements thinned by the approach reported in [Otte &
Nagel 91a] are chained to obtain edge element chains. The properties of these edge
element chains allow to distinguish between edge element chains due to noise and due
to structures of the original image. The combination of an observation of edge element
chains instead of single edge elements and strictly local decision rules allow to consider
much more global aspects without loosing a real-time execution capability.

2 Edge enhancement based on local structures

This section gives a brief overview of the work in [Otte 90] and [Otte & Nagel 91a].
To describe an edge enhancement process based on local structures we have to explain
what we CM1 a local structure. A local structure is a straight or curved line, a corner
or a vertex. Structures in form of lines are, e.g., straight line, ellipse, or circle segments.
690

Straight lines of Figure l(a) and (b) are called major straight lines, lines like (c) are
named minor straight lines. Structures in the form of (d) and (e) are called general lines
or simply a line. Each point of a major or minor straight line and (general) lines with
the exception of their endpoint(s) has exactly two neighbouring points.
Figure 1 demonstrates also some examples of corners where we distinguish type (f)
as acute corners, (g) represents an orthogonal and (h) an obtuse corner, (i) shows a
rounded corner, which obey simultaneously the line criterion. Vertices could be simple
vertices (j), crossings (k) and (1) or multiple vertices (m). Crossings may be considered
as a combination of two simple vertices, whereas multiple vertices are characterised as
two or more vertices appearing in a close neighbourhood of each other. The structure
"endpoint" is an open end of a line.

I...'" 'i ,::, .:::: :::,. I.........'" :::. ::'"'. ::":. "i '= :'T':
(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (1) (m)
Fig. 1. Local structures divided into major straight lines (a) and (b), minor straight lines
like (c) and (general) lines (d) and (e), acute corner (f), orthogonal corner (g), obtuse
corner (h) and rounded corner (i), vertices in form of T-vertices (j), crossings built from
two T-vertices (k) or from Y- and W-vertices (1) and multiple vertices (m).
The structures presented above - including analogies obtained by rotation and reflec-
tion - are the basis of our edge picture enhancement approach. Goal of the enhancement
is the preservation and restoration of disturbed structures and the elimination of the
influence of noise.

2.1 E q u i v a l e n c e classes
The enhancement of edge pictures in K~e.r. 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
[Otte 90] is based on an edge element
configuration within a window of 5x5
pixels which is shifted over the edge I ~ I I ~=" I I P,v I
_L_a..~e_r_ _5..............................................
picture. It yields 225 ~ 34 million dif-
ferent edge element configurations. In
order to obtain a unique, complete and
consistent description it is necessary
~.~r_ ~. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
to divide all configurations into four
equivalence classes. One class leaves an
edge element in the center unchanged,
the second class eliminates edge ele- 1 t13r !: l IE3D El D El El I iD
_L_a2Le_r_
_3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ments, the third class adds an edge el-
ement in the center and the last class
leaves an non-edge element unchanged.
[Otte 90] introduced predicate calcu- K~cr_ 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
lus for low level vision applications in
order to provide a tool whereby each
of the 34 million masks of edge ele-
ment configurations is assigned to ex- ]~_a~e_r_~. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
actly one equivalence class.
The construction of the presented
structures is subdivided into seven lay-
ers - see Figure 2. First of all, each
r-v-i r- 73 i-V:l rv:3 rz:3 i-zr3
K~r. 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
point of a 5x5-mask needs to be la-
beled. Within a mask, a point rep- Fig. 2. Layers of the model
resents either an edge or a non-edge
691

element: we call this fact the state of a point. Single mask points together with their states
are called objects. On layer 0, the mask points do not have any relationships between
them, but they are the basis for the construction of composite objects by introducing
relationships, for example neighbourhoods, in the form of rules. Objects of the lowest
layer are also called basic objects.
Objects of layer n > 1 are composed of objects from lower layers, containing at least
one object of layer n - 1. Therefore, objects of layer n > 1 consist of at least two basic
objects. Objects with similar properties are combined into sets, for example the set of
orthogonal edges.
The model of layers in Figure 2 shows the sets which are used to build structures up to
the four equivalence classes, where objects at the different layers represent substructures
and disturbances. A detailed explanation is given in [Otte 90]. Here we can give only a
coarse overview.
Layer 0 contains the basic object set G of all edge elements and non-edge elements
within a 5x5 mask. This set is subdivided into regional sets Qw, Qn, Qo and Qs which
contains edge elements in the western, northern, eastern 3 and southern part of the mask.
The set Mp contains all edge elements and NMp all non-edge elements.
Layer 1 contains connections from the center to the boundary of the mask in form
of straight connection set Rs, curved Ra, orthogonal Ro and long connections in the set
R1. The sets Nh, Nv, Ns and Nf contain horizontal, vertical, diagonally increasing and
diagonally decreasing neighbours, respectively.
With layer 2, the first structures are described. The sets Hgh, Hgv, Hgf, and I-Igs
represent major straight lines and the sets Nghf, Nghs, Ngvf, and Ngvs minor straight
lines, respectively, based on their orientation. The set E combines orthogonal corners and
the set Np all neighbours of a given mask point p.
On layer 3, the first disturbed edge element structures are explicitly modeled. For
example, the set Iso contains objects of isolated edge elements which represent an edge
element with no neighbouring edge element within the 5x5 mask. The sets Ph and Pn
consists of potential major and minor straight lines which can be related to straight lines
by filling gaps. Due to lack of space, we cannot explain all sets.
To show the efficacy of the pred- E := {x E (Rs~Rs) [ e(z)}
icate calculus, the set E of orthog- e(z) : ~ (3y e (Rs n Nh) 3z 9 (Rs n Nv) : x = y~z) v
onal corners is explained in more (3y 9 (Rs n Nf) 3z 9 (Rs n Ns) : x = y ~ z )
detail:
An object z is called an orthogonal corner if it satisfies the predicate e(x). e(z) is true,
if x consists (expressed a s z = y ~ z ) of two straight connections y, z E Rs. The objects y
and z must either lie in the set of horizontal and vertical or in the set of increasing and
decreasing neighbours. The exact definitions of the symbols, sets, and predicates used
here are given in [Otte 90].
With the four equivalence classes of layer 6 it is possible to fill gaps of up to two
pixels as well as to thin lines of more than one pixel width and to reduce the number of
noisy edge elements.
One disadvantage of the thinning process applied to binary edge pictures is the fact
that it is undecidable which edge element of lines with more than one pixel width to
choose in order to obtain edge elements located as closely as possible to the expected
edge line. This problem is solved in [Otte & Nagel 91a] by taking the gradient magnitude
into account. If an edge element configuration of a 5x5 window corresponds to an object of
the equivalence class "erase edge element", a subtask is activated which decides whether
the edge element should be preserved or not. The edge element in the center of the
window is preserved if the following conditions are satisfied:
1. There exists a north-south connection and an edge element to the right of the center
or there exists a west-east connection and an edge element below the center.
2. There exists a north-south connection and an edge element to the right of the center

3 Symbols refer to abbreviations of german terms, e.g. the index "o" in Qo stands for "Ost'.
692

and the gradient magnitude in the


center is greater than or equal to the
gradient magnitude of the edge ele-
ment right of the center.
3. There exists a west-east connection
and an edge element below the cen-
ter and the gradient magnitude in
the center is greater than or equal to Fig. 3. Magnification of a clip of the porterhouse
image of Figure 4 including the bonnet of the car,
the gradient magnitude of the edge the resulting edge picture, and the edge picture
element below the center. after thinning.
Figure 3 shows a part of the edge pic-
ture 4 at the location of the bonnet of the car and the result after the thinning algorithm
from [Otte & Nagel 91a].

3 Extraction of edge element chains


The enhancement of edge pictures described in [Otte & Nagel 91a] is based on the
consideration of the gradient magnitude and orientation. The results encouraged us to
investigate how the gradient orientation may be used to eliminate noisy edge elements
without thresholding the gradient magnitude.
Edge elements thinned by the algorithm of [Otte & Nagel 91a] are linked to edge
element chains. Observation of edge element chains instead of single edge elements in
combination with strictly local decision rules allows to consider much more global aspects.
A real-time algorithm for the chaining step is described in [Otte & Nagel 91b]. The
chaining process distinguishes between edge elements as part of edge element chains and
edge elements as centers of vertices. To distinguish between edge element chains due to
noise and chains due to structures of the original image, the following information about
edge element chains need to be determined: 9
- Length of a chain: The length lk of an edge element chain k is equal to the number
of pairwise different edge elements belonging to the same edge element chain.
- Valid: Valid is a predicate which indicates whether the observed chain is valid or not
with respect to the following Definition 3.1 (see below).
- S u m of gradient magnitudes: The sum of gradient magnitudes is tk
the sum of all gradient magnitudes of all edge elements of the Sk = ~-~bk(i)
same edge element chain. Let bk(i) be the gradient magnitude of i=1
the i-th edge element of an edge element chain k of length Ik.
- O r i e n t a t i o n a i m e a n : The orientational mean Ek(dk) of an edge element chain k is
the mean of the absolute difference dk of neighbouring gradient orientations. The
absolute difference is taken because we do not want
to observe only elliptical or circular segments but also dk (i) = Itgk(i) - ~k (i + 1)l
edge element chains with points of inflection. Let the 1 z~-i
length of a chain k be lk and let Ok(i) with 1 < i < lk Ek(dk) - lk - 1 ~ dk(i)
be the gradient orientation of the i-th edge element, i=1
- Orientational standard deviation: The orien-
tational standard deviation sk of the abso- / 1 lk-V-~l
lute difference of neighbouring gradient orien- sk = ~ ~ (dk(i) -- Ek(d~)) 2
tations of an edge element chain k of length ~ .=
Ik >_ 2 is:
The data type "vertex" contains references to all adjacent edge element chains and it
possesses a predicate named valid, which indicates whether a vertex should be removed
or not.
If one examines the calculated information of noisy edge element chains, the following
observation can be made:
- Noisy edge element chains are either short or possess a large orientational standard
deviation in comparison with edge element chains due to real structures.
693

To distinguish between noise and structural edge element chains, the length, the sum of
gradient magnitude and the orientational standard deviation have to be compared with
the desired value. This leads to the following definition:
D e f i n i t i o n 3.1 ( V a l i d i t y o f e d g e e l e m e n t c h a i n s ) L e t the observed edge element
chain be k with length lk. Sk is the sum of gradient magnitudes, Ek(dk) is the orien-
rational mean and sk is the orientational standard deviation, rl is the desired minimum
value of average edge element chain gradient magnitude, r2 is the desired minimum length
of edge element chains and "c3 is the desired maximum orientational standard deviation.
The observed edge element chain is valid if the following conditions are fulfilled:
&
lk=l: lk Sk _> vi (1)
Sk
2<lk<4: ~-k > r l and sk < r 3 (2)

(Sk/Ik'~ 2 Sk S~ sk
1~>5:_ ..-- - 12_a~, > _ - (3)
\ 7-1 / v1~'2 kJl 2 "r3
Condition 1 is needed for edge element chains containing exactly one edge element. They
have an adjacent vertex, because isolated edge elements are removed by the thinning
process of [Otte & Nagel 91@ For those candidates, the traditional gradient magnitude
threshold must be used.
It is necessary to distinguish between edge element chains with up to four edge ele-
ments and chains with more than four edge elements, because otherwise the computation
of the orientational mean and standard deviation of short edge element may take too few
values into account. The Condition 2 preserves those edge element chains with average
gradient magnitude greater than or equal to the desired value and with an orientational
standard deviation below a given threshold.
In Condition 3 the relation between average gradient magnitude and desired minimum
value has more influence due to the quadratic term which is the quotient of the sum of
gradient magnitude divided by the product of the desired minimum values for length and
average gradient magnitude. The product rl 7"2gives a desired minimum value for the sum
of gradient magnitudes. The left part of the Inequality 3 is a measure for the difference
between a desired and an observed chain. If the left part is greater or equal to one, then
a greater standard deviation will be allowed and vice versa. The quadratic term prefers
shorter edge element chains of high contrast compared to longer edge element chains
of lower contrast. The Condition 3 considers more than one parameter with different
weights. Therefore it is better to speak of desired values in a control theoretical sense
instead of thresholds.
Definition 3.2 ( V a l i d i t y o f v e r t i c e s ) A vertex v is valid, /f the vertex is connected
to at least one valid adjacent edge element chain or it is linked to a neighbouring vertex
marked as valid.

4 Results

This section illustrates results of the validation process of edge element chains as defined
in the last section. The next four pictures are part of an image sequence taken from
[Koller et al. 91]. For the current version, the desired value of average gradient magnitude
is rl = 6, the desired length is 1"2 = 23 and the desired orientational standard deviation
is set to r3 = ~r. The results are compared with the corresponding edge pictures with
gradient magnitude threshold equal to rl.
Comparing the two edge images of Figures 4, one can see that parts of the curb below
the entrance building and the right border of the road in front of the barrier is better
preserved by applying the new approach. Furthermore we observe less edge elements due
694

to noise with the chain based edge enhancement process than by thresholding (e.g. the
roof of the entrance building or the road surface).

Fig. 4. Comparison of edge pictures of the porterhouse (left) thresholded at 6 (middle)


and our results (right).
The next Figure 5 shows two parallelepipeds, the edge image without thresholding
and with gradient magnitude threshold 4, and on the right side the result of the new
approach with average gradient magnitude vl = 4.

Fig. 5. Two parallelepipeds and the corresponding edge pictures without and with gra-
dient magnitude thresholded at 4 and the edge picture of the new approach.
The contours of the two parallelepipeds are much better preserved by the new al-
gorithm with a simultaneous weakening of double edge lines due to signal overshooting
of the video camera. But it has not yet been possible to preserve the entire top right
horizontal edge segment of the left parallelepiped.
We have deliberately shown this last example in order to demonstrate both the pos-
sibilities and the limits of the current version of our approach. Based on the experience
accumulated throughout the investigations which yielded the results presented here, we
are confident to be able to improve this approach further!

Acknowledgement
This work was supported in part by the Basic Research Action INSIGHT of the Europeen
Community. We thank D. Koller and V. Gengenbach for providing us with the grey-level
images appearing in Figures 4 and 5, respectively. We also thank K. Daniilidis for his
comments on a draft version of this contribution.

References
[Abdou & Pratt 79] I.E. Abdou, W.K. Pratt, Qualitative design and evaluation of enhance-
ment/thresholding edge detectors, Proceedings of the IEEE 67 (1979) 753-763.
[Bergholm 87] F. Bergholm, Edge focusing, IEEE Trans. Pattern Analysis and Machine Intel-
ligence PAMI-9 (1987) 726-741.
695

[Canning et al. 88] J. Canning, J.J. Kim, N. Netanyahu, A. Rosenfeld, Symbolic pixel labeling
for curvilinear feature detection, Pattern Recognition Letters 8 (1988) 299-310.
[Canny 86] J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal-
ysis and Machine Intelligence PAMI-8 (1986) 679-698.
[Chen & Siy 87] B.D. Chen, P. Siy, Forward/backward contour tracing with feedback, 1EEE
Trans. Pattern Analysis and Machine Intelligence PAMI-9 (1987) 438-446.
[Deriche et al. 88] R. Deriche, J.P. Cocquerez, G. Almouzny, An efficient method to build early
image description, Proc. Int. Conf. on Pattern Recognition, Rome, Italy, Nov. 14-17, 1988,
pp. 588-590.
[Haddon 88] J. Haddon, Generalized threshold selection for edge detection, Pattern Recognition
2z (1988) 195-203.
[Hancock & Kittler 90] E.R. Hancock, J. Kittler, Edge labeling using dictionary-based relax-
ation, IEEE Trans. Pattern Analysis and Machine Intelligence PAMI-12 (1990) 165-181.
[Haxalick & Lee 90] R.M Haralick, J. Lee, Context depending edge detection and evaluation,
Pattern Recognition 23 (1990) 1-19.
[Hayden et al. 87] C.H. Hayden, R.C. Gonzales, A. Ploysongsang, A temporal edge-based image
segmentor, Pattern Recognition 20 (1987) 281-290.
[Johnson 90] R.P. Johnson, Contrast based edge detection, Pattern Recognition23 (1990) 311-
318.
[Kittler & IUingworth 85] J. Kittler, J. Illingworth, Relaxation labelling algorithms - a review,
Image and Vision Computing 3 (1985) 206-216.
[Koller et al. 91] D. Koller, N. Heinze, H.-H. Nagel, Algorithmic characterization of vehicle tra-
jectories from image sequences by motion verbs, Proc. IEEE Conf. Computer Vision and
Pattern Recognition, Lahaina, Maul, Hawaii, June 3-6, 1991, pp. 90-95.
[Korn 88] A.F. Korn, Toward a Symbolic Representation of Intensity Changes in Images, IEEE
Trans. Pattern Analysis and Machine Intelligence PAMI-10 (1988) 610-625.
[Kundu & Pal 86] M.K. Kundu, S.K. Pal, Thresholding for Edge Detection Using Human Psy-
chovisual Phenomena, Pattern Recognition Letters 4 (1986) 433-441.
[McKee & Aggarwal 75] J.W. McKee, J.K. Aggarwal, Finding edges of the surface of 3-D curved
objects by computer, Pattern Recognition 7 (1975) 25-52.
[Otte 90] M. Otte, Entwicklung eines Verfahrens zur schnellen Verarbeitung yon Kanten-
element-Ketten, Diplomarbeit, Institut ffr Algorithmen und Kognitive Systeme, Fakult~t
ffir Informatik der Universit~t Karlsruhe (TH), Karlsruhe, Deutschland, Mai 1990.
[Otte & Nagel 91a] M. Otte, H.-H. Nagel, Pr~dikatenlogik als Grundlage far eine videoschnelle
Kantenverbesserung, Interner Bericht, Institut ffir Algorithmen und Kognitive Systeme,
FakultSt f6r Informatik der Universit~t Karlsruhe (TH), Karlsruhe, Deutschland, August
1991.
[Otte & Nagel 91b] M. Otte, H.-H. Nagel, Extraktion yon Strukturen aus Kantenelementbildern
dutch Austoertung yon Kantenelementketten, Interner Bericht, Institut ffir Algorithmen und
Kognitive Systeme, Fakult~t ffir Informatik der Universit~t Karlsruhe (TH), Karlsruhe,
Deutschland, September 1991.
[Perona & Malik 90] P. Perona, J. Mahk, Scale-space and edge detection using artisotropic dif-
fusion, IEEE Trans. Pattern Analysis and Machine Intelligence PAMI-12 (1990) 629-639.
[SakaJ et al. 69] T. Sakai, M. Nagao, S. Fujibayashi, Line extraction and pattern detection in a
photograph, Pattern Recognition 1 (1969) 233-248.
[Topa & Schalkoff 89] L.C. Topa, R.J. Schalkoff, Edge Detection and Thinning in Time-Varying
Image Sequences Using Spatio-Temporal Templates, Pattern Recognition 22 (1989) 143-
154.
[Wang & Zhang 89] P.S.P. Wang, Y.Y. Zhang, A Fast and Flexible Thinning Algorithm, IEEE
Transactions on Computers 38 (1989) 741-745.
[Williams & Shah 90] D.J. Williams, M. Shah, Edge contours using multiple scales, Computer
Vision, Graphics, and linage Processing 51 (1990) 256-274.
[Witkin 83] A.P. Witkin, Scale-space filtering, International Joint Conf. Artificial Intelligence,
Karlsruhe, Germany, Aug. 8-12, 1983, pp. 1019-1021.
[Zhang &Suen 84] T.Y. Zhang, C.Y. Suen, A Fast Parallel Algorithm for Thinning Digital
Patterns, Communications of the A C M 27 (1984) 236-239.
[Zucker et al. 77] S.W. Zucker, R.A. Hummel, A. Rosenfeld, An application of relaxation label-
ing to line and curve enhancement, IEEE Trans. on Computers C-26 (1977) 394-403.
[Zuniga & Haralick 88] O. Zuniga, R. Haralick, Gradient threshold selection using the facet
model, Pattern Recognition 21 (1988) 493-503.
A m e t h o d for t h e 3 D r e c o n s t r u c t i o n o f i n d o o r s c e n e s
from monocular images

Paolo Olivieri 1, Maurizio Gatti 1, Marco Straforini 2 and Vincent Torte 2


i Dipartimento di Informatica e Scienza dell'Informazione
2 Dipartimento di Fisica
Universitk di Genova, Italy

Abstract. The recovery of the 3D structure of indoor scenes from a


single image is an important goal of machine vision Therefore, a simple and
reliable solution to this problem will have a great influence on many tasks
in robotics, such as the autonomous navigation of a mobile vehicle in indoor
environments.
This communication describes the recovery, in a reliable and robust way,
of the 3D structure of a corridor and of obstacles from a sequence of images
obtained by a T.V. camera moving through the corridor. The obtained 3D
information can be used to extract the free space in the viewed scene in
order to plan the trajectory of a mobile vehicle. This application is being
worked on at the moment and the results will be illustrated in a future
communication.

1 The recovery of a line-drawing

Fig. 1A illustrates an image of polyhedral objects on a table. Using standard routines


for edge detection it is possible to obtain the edge map illustrated in Fig. lB. It is useful
to extract straight segments from this map and identify junctions. These features Mlow
the construction of a line-drawing from which it is simple to obtain 3D information (see
[H1] and [B2]).
The procedure for the recovery of the line-drawing is fully described in previous works
(see [C1]). Fig. 1C shows the results of the first step of the algorithm for the extraction
of the line-drawing. In this elaboration segments are fused together and the junctions
are identified by the symbols L, T, Y and X. In order to obtain a fully-connected line-
drawing, i.e. one whose segments have both ends belonging to identified junctions, it is
possible to delete unconnected segments recursively. When this procedure is applied to
the line-drawing of Fig. 1C, the fully connected line-drawing of Fig. 1D is obtained.
The algorithm used for the extraction of a line-drawing illustrated in Fig. 1 does
not fully exploit the geometrical constraint present in the scene. The algorithm is rather
general and can also be used for images of rounded objects or with complex surfaces. Many
images of indoor scenes, such as that of Fig. 2A, can be usefully analysed by exploiting
geometrical properties of the scene. By assuming that the viewed scene belongs to a
Legoland world, where objects have planar surfaces, with either parallel or perpendicular
edges, it is possible to make the algorithm for the recovery of line-drawing efficient and
robust. Fig. 2B reproduces the polygonal approximation of the edge map obtained from
Fig. 2A. Figs. 2C and 2D reproduce, respectively, the line-drawing obtained by using
the algorithm previously described and the algorithm making use of the assumption of
a Legoland world. It is evident that the line-drawing of Fig. 2D is more accurate and its
segments and junctions are more correct.
697

, [ ' ~ -

C T T T . ~ D T T ~ .

~...-.~ L-v

Fig. 1. The recovery of a line-drawing. A: an image of 512x512 pixels acquired with a Panasonic
camera and digitalised with a FG100 Imaging Technology board. B: the segments obtained
with a polygonal approximation of the edges extracted with a Canny filter. C: the line-drawing
with labelled junctions (L, T, Y and X). The thresholds used to merge ~he segments are: =t:5~
(collinearity), 8 pixels (adjacent), 50 pixels (distance). D: the final line-drawing after the recursive
deletion of unconnected segments. The L junctions are detected if two vertices are closer than
7 pixels.

2 Extraction of polygons
Using the line-drawing of Fig. 2D, it is possible to extract maximal simple polygons (see
IS1]), which are the perspective projection on the image of planar surfaces with similar
attitude in space. Each simple polygon may be labelled with a different orientation; this
depends on the attitude in space of the projected planar surfaces. Simple polygons in
images of scenes belonging to Legoland can have, at most, three different orientations.
The fig. 3B shows the polygons extracted from the line-drawing obtained from the image
3A; the three different textures correspond to horizontal, vertical and planar surfaces,
white regions correspond to complex polygons.

3 D e t e c t i o n of t h e d i m e n s i o n s of t h e c o r r i d o r
The 3D structure of viewed scenes is described by simply using 3D boxes, the largest
box corresponding to the empty corridor and other boxes representing different objects
698

B
Jl t
,I [

C D
U-- i --!~-T~T
~ -,,

'++-+'" ++III ++ tip

F i g . 2. The recovery of a line-drawing from an image of Legoland. A: an image of a corridor at


the Dipartimento di Fisica. The viewed scene can be described as belonging to a Legoland world,
where objects' boundaries are straight lines mutually parallel or orthogonal. B: the segments
map. C: the line-drawing obtained with the procedure explained in the text and illustrated in
Fig. 1. D: the line-drawing obtained making use of the assumption of a Legoland world. The
parameters used are the same as the ones in Fig. 1.

or obstacles. T h e algorithm able to extract this information can be divided into three
m a i n steps:

1. identification, on the image, of the b o t t o m end of the corridor. (see Fig. 3C).
2. identification on the image of lines separating the floor and wails, and those separating
the ceiling and walls (see Fig. 3D).
3. validation of the consistency of the first two steps.

By assuming t h a t the distance from the floor of the optical center of the viewing
c a m e r a is known, it is possible to make an absolute estimate of the side of the box in Fig.
3E and F. T h e image of Fig. 3A was acquired with an objective having a focM length
of 8 m m and the T.V. camera placed at, 115 cm from the floor. The estimate of 195 cm
for the width of the corridor (the true value is 200 cm) can be obtained by using simple
trigonometry.
699

C 0
|

iiiiiiiiiiiiiii..............
I
i

,: :.4 '

F i g . 3. The recovery of the 3D structure. A: an image of a corridor at the Dipartimento di


Fisica. B: maximal simple polygons. Polygons a, b, c and d are candidates to be the front panel
of obstacles. Polygons c and d are rejected because they are too high (polygon c) or outside the
frame of the corridor (polygon d). C: the detection of the boundaries of the bottom end. D: the
detection of the lines separating floor, walls and ceiling. E, F: the 3D structure of the corridor,
represented with two different perspective projections. The broken line in D is a colJision-free
trajectory for a mobile vehicle.
700

4 D e t e c t i o n of obstacles

When the largest 3D box corresponding to the empty corridor has been detected it is
useful to detect and localize other objects or obstacles, such as filing cabinets and drawers.
The algorithm for the detection of these boxes is divided into four steps:

1. detection of polygons, which are good candidates for the frontal panel of the obstacle
for example polygons a, b, c and d in Fig. 3B.
2. validation of the candidates.
3. a 3D box is associated with each validated polygon using a procedure which is very
similar to that used in constructing the 3D box associated with the empty corridor.
4. the consistency of the global 3D structure of the scene is checked, that is to say all
obstacles must be inside the corridor.

Figs. 3E and F reproduce two views of the 3D structure of the scene of image 3A. It
is evident that the global 3D structure of viewed corridors is well described by the boxes
illustrated

Conclusion
The algorithm described in this paper seems to be efficient for the recovery of the 3D
structure of indoor scenes from one or a sequence of images. The proposed algorithm
produced good results for different corridors under a variety of lighting and complexity.
Similar procedures can be used in order to determine the presence of, and locate, other
rectangular objects, such as cabinets, boxes, tables, ... Therefore, when a sequence of
many images is available it is possible to obtain an accurate and robust 3D description of
the scene by exploiting geometrical properties of Legoland and by using a simple Kalman
filter.

Acknowledgements
We wish to thank Dr. M. Campani, Dr. E. De Micheli and Dr. A. Verri for helpful
suggestions on the manuscript. Cristina Rosati typed the manuscript and Clive Prestt
checked the English. This work was partially supported by grants from the EEC (ESPRIT
II VOILA), E.B.R.A. Insight Project 3001, EEC BRAIN Project No. 88300446/JU1,
Progetto Finalizzato Trasporti PROMETHEUS, Progetto Finalizzato Robotica, Agenzia
Spaziale Italiana (ASI).

References
[B2] Barrow, H.G, Tenenbaum, J.M.: Interpreting line-drawings as three-dimensional surfaces.
Artif. Intell. 17 (1981) 75-116
[C1] Coelho C., Straforini M., Campani M.: A fast and precise method to extract vanishing
points, SPIE's International Symposia on Applications in Optical Science and Engineer-
ing, Boston 1990.
[H1] Haralick, R.M.: Using perspective transformation in scene analysis. Comput. Graphics
Image Process 13 (1980) 191-221
IS1] Straforini, M., Coelho, C., Campani, M., Torre V.: The recovery and understanding of
a line drawing from indoor scenes. PAMI in the press (1991)
A c t i v e D e t e c t i o n and Classification of J u n c t i o n s
by Foveation w i t h a H e a d - E y e S y s t e m
Guided by the Scale-Space Primal Sketch *

Kjell B r u n n s t r S m , Tony Lindeberg and J a n - O l o f Eklundh

ComputationM Vision and Active Perception Laboratory (CVAP)


Department of Numerical Analysis and Computing Science
Royal Institute of Technology, S-100 44 Stockholm, Sweden

Abstract. We consider how junction detection and classification can be


performed in an active visual system. This is to exemplify that feature de.
tection and classification in general can be done by both simple and robust
methods, if the vision system is allowed to look at the world rather than at
prerecorded images. We address issues on how to attract the attention to
salient local image structures, as well as on how to characterize those.

A prevalent view of low-level visual processing is that it should provide a rich but sparse
representation of the image data. Typical features in such representations are edges, lines,
bars, endpoints, blobs and junctions. There is a wealth of techniques for deriving such
features, some based on firm theoretical grounds, others heuristically motivated. Never-
theless, one may infer from the never-ending interest in e.g. edge detection and junction
and corner detection, that current methods still do not supply the representations needed
for further processing. The argument we present in this paper is that in an active system,
which can focus its attention, these problems become rather simplified and do therefore
allow for robust solutions. In particular, simulated foveation I can be used for avoiding
the difficulties that arise from multiple responses in processing standard pictures, which
are fairly wide-angled and usually of an overview nature.
We shall demonstrate this principle in the case of detection and classification of
junctions. Junctions and corners provide important cues to object and scene structure
(occlusions), but in general cannot be handled by edge detectors, since there will be
no unique gradient direction where two or more edges/lines meet. Of course, a number
of dedicated junction detectors have been proposed, see e.g. Moravec [15], Dreschler,
Nagel [4], Kitchen, Rosenfeld [9], FSrstner, Giilch [6], Koenderink, Richards [10], Deriche,
Giraudon [3] and ter I-Iaar et al [7]. The approach reported here should not be contrasted
to that work. What we suggest is that an active approach using focus-of-attention and
foveation allows for both simple and stable detection, localization and classification, and
in fact algorithms like those cited above can be used selectively in this process.
In earlier work [1] we have demonstrated that a reliable classification of junctions can
be performed by analysing the modalities of local intensity and directional histograms
during an active focusing process. Here we extend that work in the following ways:
- The candidate junction points are detected in regions and at scale levels determined
by the local image structure. This forms the bottom-up attentional mechanism.
* This work was partially performed under the ESPRIT-BRA project INSIGHT. The support
from the Swedish National Board for Industrial and Technical Development, NUTEK, is
gratefully acknowledged. We would also like to thank Kourosh Pahlavan, Akihiro Horii and
Thomas Uhlin for valuable help when using the robot head.
1 By foveation we mean active acquisition of image data with a locally highly increased resolu-
tion. Lacking a foveated sensor, we simulate this process on our camera head.
702

- The analysis is integrated with a head-eye system allowing the algorithm to actually
take a closer look by zooming in to interesting structures.
- T h e loop is further closed, including an automatic classification. In fact, by using the
active visual capabilities of our head we can acquire additional cues to decide about
the physical nature of the junction.
In this way we obtain a three-step procedure consisting of (i) selection of areas of interest,
(ii) foveation and (iii) determination of the local image structure.

1 Background: Classifying Junctions by Active Focusing


The basic principle of the junction classification method [1] is to accumulate local his-
tograms over the grey-level values and the directional information around candidate
junction points, which are assumed to be given, e.g. by an interest point operator. Then,
the numbers of peaks in the histograms can be related to the type of junction according
to the following table:

Intensity Edge d i r e c t i o n Classification hypothesis


unimodal any noise spike
bimodM unimodal edge
bimodal bimodal L-junction
trimodal bimodal T-junction
trimodal trimodal 3-junction

The motivation for this scheme is that for example, in the neighbourhood of a point
where three edges join, there will generically be three dominant intensity peaks corre-
sponding to the three surfaces. If that point is a 3-junction (an arrow-junction or a Y-
junction) then the edge direction histogram will (generically) contain three main peaks,
while for a T-junction the number of directional peaks will be two etc. Of course, the
result from this type of histogram analysis cannot be regarded as a final classification
(since the spatial information is lost in the histogram accumulation), but must be treated
as a hypothesis to be verified in some way, e.g. by backprojection into the original data.
Therefore, this algorithm is embedded in a classification cycle. More information about
the procedure is given in [1].

1.1 C o n t e x t I n f o r m a t i o n R e q u i r e d for t h e F o c u s i n g P r o c e d u r e
Taking such local histogram properties as the basis for a classification scheme leads to
two obvious questions: Where should the window be located and how large should it be2?
We believe that the output from a representation called the scale-space primal sketch
[11, 12] can provide valuable clues for both these tasks. Here we will use it for two main
purposes. The first is to coarsely determine regions of interest constituting hypotheses
about the existence of objects or parts of objects in the scene and to select scale levels
for further analysis. The second is for detecting candidate junction points in curvature
data and to provide information about window sizes for the focusing procedure.
In order to estimate the number of peaks in the histogram, some minimum number
of samples will be required. With a precise model for the imaging process as well as the
2 This is a special case of the more general problem concerning how a visual system should be
able to determine where to start the analysis and at what scales the analysis should be carried
out, see also [13].
703

noise characteristics, one could conceive deriving bounds on the resolution, at least in
some simple cases. Of course, direct setting of a single window size immediately valid
for correct classification seems to be a very difficult or even an impossible task, since if
the window is too large, then other structures than the actual corner region around the
point of interest might be included in the window, and the histogram modalities would
be affected. Conversely, if it is too small then the histograms, in particular the directional
histogram, could be severely biased and deviate far from the ideal appearance in case the
physical corner is slightly rounded - - a scale phenomenon that seems to be commonly
occurring in realistic scenes 3.
Therefore, what we make use of instead is the process of focusing. Focusing means
that the resolution is increased locally in a continuous manner (even though we still have
to sample at discrete resolutions). The method is based on the assumption that stable
responses will occur for the models that best fit the data. This relates closely to the
systematic parameter variation principle described in [11] comprising three steps

- vary the parameters systematically


- detect locally stable states (intervals) in which the type of situation is qualitatively
the same
- select a representative as an abstraction of each stable interval

2 Detecting Candidate Junctions

Several different types of corner detectors have been proposed in the literature. A prob-
lem, that, however, has not been very much treated, is that of at what scale(s) the
junctions should be detected. Corners are usually treated as pointwise properties and are
thereby regarded as very fine scale features.
In this treatment we will take a somewhat unusual approach and detect corners at
a coarse scale using blob detection on curvature data as described in [11, 13]. Realistic
corners from man-made environments are usually rounded. This means that small size
operators will have problems in detecting those from the original image.
Another motivation to this approach is that we would like to detect the interest points
at a coarser scale in order to simplify the detection and matching problems.

2.1 C u r v a t u r e o f L e v e l C u r v e s

Since we are to detect corners at a coarse scale, it is desirable to have an interest point
operator with a good behaviour in scale-space A quantity with reasonable such properties
is the rescaled level curve curvature given by

= IL**L2y + LyyL~ - 2L=~L,L~ I (1)

This expression is basically equal to the curvature of a level curve multiplied by the
gradient magnitude 4 as to give a stronger response where the gradient is high. The
motivation behind this approach is that corners basically can be characterized by two
properties: (i) high curvature in the grey-level landscape and (ii) high intensity gradient.
Different versions of this operator have been used by several authors, see e.g. Kitchen,
Rosenfeld [9], Koenderink, Richards [10], Noble [16], Deriche, Giraudon [3] and Florack,
ter Haar et al [5, 7].
3 This effect does not occur for an ideal (sharp) corner, for which the inner scale is zero.
4 Raised to the power of 3 (to avoid the division operation).
704

Figure l(c) shows an example of applying this operation to a toy block image at a
scale given by a significant blob from the scale-space primal sketch. We observe that the
operator gives strong response in the neighbourhood of corner points.

2.2 R e g i o n s o f I n t e r e s t - - C u r v a t u r e B l o b s
The curvature information is, however, still implicit in the data. Simple thresholding on
magnitude will in general not be sufficient for detecting candidate junctions. Therefore,
in order to extract interest points from this output we perform blob detection on the
curvature information using the scale-space primal sketch. Figure l(d) shows the result

Fig. 1. Illustration of the result of applying the (rescaled) level curve curvature operator at
a coarse scale, (a) Original grey-level image. (b) A significant dark scale-sp~ce blob extracted
from the scale-space primal sketch (marked with black). (c) The absolute value of the rescaled
level curve curvature computed at a scale given by the previous scale-space blob (this curvature
data is intended to be valid only in a region around the scale-space blob invoking the analysis).
(d) Boundaries of the 50 most significant curvature blobs (detected by applying the scale-spa~:e
primal sketch to the curvature data). (From Lindeberg [11, 13]).

of applying this operation to the data in Figure l(c). Note that a set of regions is extracted
corresponding to the major corners of the toy block. Do also note that the support regions
of the blobs serve as natural descriptors for a characteristic size of a region around the
candidate junction. This information is used for setting (coarse) upper and lower bounds
on the range of window sizes for the focusing procedure.
A trade-off with this approach is that the estimate of the location of the corner will
in general be affected by the smoothing operation. Let us therefore point out that we
are here mainly interested in detecting candidate junctions at the possible cost of poor
locMization. A coarse estimate of the position of the candidate corner can be obtained
from the (unique) local maximum associated with the blob. Then, if improved localization
is needed, it can be obtained from a separate process using, for example, information from
the focusing procedure combined with finer scale curvature and edge information.
The discrete implementation of the level curve curvature is based on the scale-space for
discrete signals and the discrete N-jet representation developed in [11, 14]. The smoothing
is implemented by convolution with the discrete analogue of the Gaussian kernel. From
this data low order difference operators are applied directly to the smoothed grey-level
data implying that only nearest neighbour processing is necessary when computing the
derivative approximations. Finally, the (rescaled) level curve curvature is computed as a
polynomial expression in these derivative approximations.

3 Focusing and Verification


The algorithm behind the focusing procedure has been described in [1] and will not
be considered further, except that we point out the major difference that classification
705

procedure has been integrated with a head-eye system (see Figure 2 and Pahlavan, Ek-
lundh [17]) allowing for algorithmic control of the image aquisition.

Fig. 2. The KTH Head used for acquiring the image data for the experiments. The head-eye
system consists of two cameras mounted on a neck and has a total of 13 degrees of freedom. It
allows for computer-controlled positioning, zoom and focus of both the cameras independently
of each other.

The method we currently use for verifying the classification hypothesis (generated
from the generic cases in the table in Section 1, given that a certain number of peaks,
stable to variations in window size, have been found in the grey-level and directional
histogram respectively) is by partitioning a window (chosen as representative for the
focusing procedure [1, 2]) around the interest point in two different ways: (i) by back-
projecting the peaks from the grey-level histogram into the original image (as displayed
in the middle left column of Figure 5) and (ii) by using the directional information
from the most prominent peaks in the edge directional histograms for forming a simple
idealized model of the junction, which is then fitted to the data (see the right column
of Figure 5). From these two partitionings first and second order statistics of the image
data are estimated. Then, a statistical hypothesis test is used for determining whether
the data from the two partitionings are consistent (see [2] for further details).

4 Experiments: Fixation and Foveation

We will now describe some experimental results of applying the suggested methodology
to a scene with a set of toy blocks. An overview of the setup is shown in Figure 3(a). The
toy blocks are made out of wood with textured surfaces and rounded corners.

Fig. 3. (a) Overview image of the scene under study. (b) Boundaries of the 20 most significant
dark blobs extracted by the scale-space primal sketch. (c) The 20 most significant bright blobs.

Figures 3(b)-(c) illustrate the result of extracting dark and bright blobs from the
overview image using the scale-space primal sketch. The boundaries of the 20 most signif-
icant blobs have been displayed. This generates a set of regions of interest corresponding
to objects in the scene, faces of objects and illumination phenomena.
706

Fig. 4. Zooming in to a region of interest obtained from a dark blob extracted by the scale-space
primal sketch. (a) A window around the region of interest, set from the location and the size of
the blob. (b) The rescaled level curve curvature computed at the scale given by the scale-space
blob (inverted). (c) The boundaries of the 20 most significant curvature blobs obtained by
extracting dark blobs from the previous curvature data.

Fig. 5, Classification results for different junction candidates corresponding to the upper left,
the central and the lower left corner of the toy block in Figure 4 as well as a point along the
left edge. The left column shows the maximum window size for the focusing procedure, the
middle left column displays back projected peaks from the grey-level histogram for the window
size selected as representative for the focusing process, the middle right column presents line
segments computed from the directional histograms and the right column gives a schematic
illustration of the classification result, the abstraction, in which a simple (ideal) corner model
has been adjusted to data. (The grey-level images have been stretched to increase the contrast).

In Figure 4 we have zoomed in to one of the dark blobs from the scale-space primal
sketch corresponding to the central dark toy block. Figure 4(a) displays a window around
t h a t blob indicating the current region of interest. The size of this window has been set
from the size of the blob. Figure 4(b) shows the rescaled level curve curvature computed at
the scale given by the blob and and Figure 4(c) the boundaries of the 20 most significant
curvature blobs extracted from the curvature data.
In Figure 5(a) we have zoomed in further to one of the curvature blobs (corresponding
to the upper left corner of the dark toy block in Figure 4(c)) and initiated a classification
procedure. Figures 5(b)-(d) illustrate a few o u t p u t results from t h a t procedure, which
707

classified the point as being a 3-junction. Figures 5(e)-(1) show similar examples for two
other j u n c t i o n candidates (the central and the lower left corners) from the s a m e toy
block. The interest point in Figure 5(e) was classified as a 3-junction, while the p o i n t in
Figure 5(i) was classified as an L-junction. Note the weak contrast between the two front
faces of the central corner in the original image. Finally, Figures 5(m)-(p) in the b o t t o m
row indicate the ability to suppress "false alarms" by showing the results of applying the
classification procedure to a point along the left edge.

5 Additional Cues: A c c o m o d a t i o n Distance and Vergence


The ability to control gaze and focus does also facilitate further feature classification, since
the c a m e r a parameters, such as the focal distance and the zoom rate, can be controlled
by the algorithm. This can for instance be applied to the task of investigating whether a
grey-level T-junction in the image is due to a depth discontinuity or a surface marking.
We will d e m o n s t r a t e how such a classification task can be solved monocularly, using
focus, and binocularly, using disparity or vergence angles.

Fig. 6. Illustration of the effect of varying the focal distance at two T-junctions corresponding t o
a depth discontinuity and a surface marking respectively. In the upper left image the camera was
focused on the left part of the approximately horizontal edge while in the upper middle image
the camera was focused on the lower part of the vertical edge. In both cases the accomodation
distance was determined from an auto-focusing procedure, developed by Horii [8], maximizing
a simple measure on image sharpness. The graphs on the upper right display how this mea-
sure varies as function of the focal distance. The lower row shows corresponding results for a
T-junction due to a surface marking. We observe that in the first case the two curves attain
their maxima at clearly distinct positions (indicating the presence of a depth discontinuity),
while in the second case the two curves attain their maxima at approximately the same position
(indicating that the T-junction is due to a surface marking).

In Figure 6(a)-(b) we have zoomed in to a curvature blob associated with a scale-


space blob corresponding to the bright toy block. We d e m o n s t r a t e the effect of varying
the focal distance by showing how a simple measure on image sharpness (the s u m of the
squares of the gradient magnitudes in a small window, see Horii [8]) varies with the focal
distance. Two curves are displayed in Figure 6(c); one with the window positioned at
the left p a r t of the approximately horizontal edge and one with the window positioned
at the lower p a r t of the vertical edge. Clearly, the two curves a t t a i n their m a x i m a for
different accomodation distances. The distance between the peaks gives a measure of the
708

relative depth between the two edges, which in turn can be related to absolute depth
values by a calibration of the camera system. For completeness, we give corresponding
results for a T-junction due to surface markings, see Figure 6(d)-(e). In this case the two
graphs attain their maxima at approximately the same position, indicating that there is
no depth discontinuity at this point. (Note that this depth discrimination effect is more
distinct at a small depth-of-focus, as obtained at high zoom rates).
In Figure 7 we demonstrate how the vergence capabilities of the head-eye system can
provide similar clues for depth discrimination. As could be expected, the discrimination
task can be simplified by letting the cameras verge towards the point of interest. The
vergence algorithm, described in Pahlavan et al [18], matches the central window of one
camera with an epipolar band of the other camera by minimizing the sum of the squares
of the differences between the grey-level data from two (central) windows.

Fig. 7. (a)-(b) Stereo pair for a T-junction corresponding to a depth discontinuity. (c) Graph
showing the matching error as function of the baseline coordinate for two different epipolar
planes; one along the approximately horizontal line of the T-junction and one perpendicular to
the vertical line. (d)-(e) Stereo pair for a T-junction corresponding to a surface marking. (f)
Similar graph showing the matching error for the stereo pair in (d)-(e). Note that in the first
case the curves attain their minima at different positions indicating the presence of a depth
discontinuity (the distance between these points is related to the disparity), while in the second
case the curves attain their minima at approximately the same positions indicating that there
is no depth discontinuity at this point.

Let us finally emphasize that a necessary prerequisite for these classification methods
is the ability of the visual system to foveate. The system must have a mechanism for
focusing the attention, including means of taking a closer look if needed, that is acquiring
new images.

6 S u m m a r y and D i s c u s s i o n

The main theme in this paper has been to demonstrate that feature detection and classi-
fication can be performed robustly and by simple algorithms in an active vision system.
Traditional methods based on prerecorded overview pictures may provide theoretical
foundations for the limits of what can be detected, but applied to real imagery they
will generally give far too many responses to be useful for further processing. We argue
that it is more natural to include attention mechanisms for finding regions of interest
709

and follow up by a step taking "a closer look" similar to foveation. Moreover, by looking
at the world rather than at prerecorded images we avoid a loss of information, which is
rather artificial if the aim is to develop "seeing systems".
The particular visual task we have considered to demonstrate these principles on is
junction detection and junction classification. Concerning this specific problem some of
the technical contributions are:

- Candidate junction points are detected at adaptively determined scales.


- Corners are detected based on blobs instead of points.
- The classification procedure is integrated with a head-eye system allowing the algo-
rithm to take a closer look at interesting structures.
- We have demonstrated how algorithmic control of camera parameters can provide
additional cues for deciding about the physical nature of junctions.

In addition, the classification procedure automatically verifies the hypotheses it generates.

References
1. Brunnstr6m K., Eklundh J.-O., Lindeberg T.P. (1990) "Scale and Resolution in Active
Analysis of Local Image Structure", Image ~ Vision Comp., 8:4, 289-296.
2. Brunnstr6m K., Eklundh J.-O., Lindeberg T.P. (1991) "Active Detection and Classification
of Junctions by Foveation with a Head-Eye System Guided by the Scale-Space Primal
Sketch", Teeh. Rep., ISRN KTH/NA/P-91/31-SE, Royal Inst. Tech., S-100 44 Stockholm.
3. Deriche R., Giraudon G. (1990) "Accurate Corner Detection: An Analytical Study", 3rd
ICCV, Osaka, 66-70.
4. Dreschler L., Nagel H.-H. (1982) "Volumetric Model and 3D-Trajectory of a Moving Car
Derived from Monocular TV-Frame Sequences of a Street Scene", CVGIP, 20:3, 199-228.
5. Florack L.M.J., ter Haar Romeny B.M., Koenderink J.J., Viergever M.A. (1991) "General
Intensity Transformations and Second Order Invariants', 7th SCIA, Aalborg, 338-345.
6. F/Srstner M.A., Gfilch (1987) "A Fast Operator for Detection and Precise Location of Dis-
tinct Points, Corners and Centers of Circular Features", ISPRS Intercommission Workshop.
7. ter Haar Romeny B.M., Florack L.M.J., Koenderink J.J., Viergever M.A. (1991) "Invariant
Third Order Detection of Isophotes: T-junction Detection", 7th SCIA, Aalborg, 346-353.
8. Horii A. (1992) "Focusing Mechanism in the KTH Head-Eye System", In preparation.
9. Kitchen, L., Rosenfeld, R., (1982), "Gray-Level Corner Detection", PRL, 1:2, 95-102.
10. Koenderink J.J., Richards W. (1988) "Two-Dimensional Curvature Operators", J. Opt.
Soc. Am., 5:7, 1136-1141.
11. Lindeberg T.P. (1991) Discrete Scale-Space Theory and the Scale-Space Primal Sketch,
Ph.D. thesis, ISRN KTH/NA/P-91/8-SE, Royal Inst. Tech., S-100 44 Stockholm.
12. Lindeberg T.P., Eklundh J.-O. (1991) "On the Computation of a Scale-Space Primal
Sketch", J. Visual Comm. Image Repr., 2:1, 55-78.
13. Lindeberg T.P. (1991) "Guiding Early Visual Processing with Qualitative Scale and Region
Information", Submitted.
14. Lindeberg T.P. (1992) "Discrete Derivative Approximations with Scale-Space Properties",
In preparation.
15. Moravec, H.P. (1977) "Obstacle Avoidance and Navigation in the Real World by a Seeing
Robot Rover", Stanford AIM-3$O.
16. Noble J.A. (1988) "Finding Corners", Image ~ Vision Computing, 6:2, 121-128.
17. Pahlavan K., Eklundh J.-O. (1992) "A Head-Eye System for Active, Purposive Computer
Vision", To appear in CVGIP-IU.
18. Pahlavan K., Eklundh J.-O., Uhlin T. (1992) "Integrating Primary Occular Processes", ~nd
ECCV, Santa Margherita Ligure.
19. Witkin A.P. (1983) "Scale-Space Filtering", 8th IJCAI, Karlsruhe, 1019-1022.
A N e w Topological Classification of Points in
3D Images

Gilles Bertrand 1 and Grdgoire Malandain ~

x ESIEE, Labo IAAI, Cite Descartes, 2 bd Blaise Pascal, 93162 Noisy-le-Grand C~dex, France,
z INRIA, project Epidaure, Domaine de Voluceau-l~ocquencourt, 78153 Le Chesnay C~dex,
France, e-marl: malandaln@bora.inria.fr

Abstract.
We propose, in this paper, a new topological classification of points in
3D images. This classification is based on two connected components num-
bers computed on the neighborhood of the points. These numbers allow to
classify a point as an interior or isolated, border, curve, surface point or as
different kinds of junctions.
The main result is that the new border point type corresponds exactly
to a simple point. This allows the detection of simple points in a 3D image
by counting only connected components in a neighborhood. Furthermore
other types of points are better characterized.
This classification allows to extract features in a 3D image. For exam-
ple, the different kinds of junction points may be used for characterizing
a 3D object. An example of such an approach for the analysis of medical
images is presented.

1 Introduction
Image analysis deals more and more with three-dimensional (3D) images. They may come
from several fields, the most popular one is the medical imagery. 3D images need specific
tools for their processing and their interpretation. This interpretation task involves often
a matching stage, between two 3D images or between a 3D image and a model.
Before this matching stage, it is necessary to extract useful information of the image
and to organize it into a high-level structure. It can be done by extracting the 3D edges
of the image (see [6]) and then by searching some particular qualitative features on these
edges. These features are geometrical (see [5]) or topological (see [4]). In both cases, they
are : intrinsic to the 3D object, stable to rigid transformations and locally defined.
In this paper, we propose a new topological classification which improves the one
proposed in [4]. After recalling some basic definitions of 3D digital topology (section 2),
we give the principle of the topological classification (section 3.1) and we present its
advantages (section 3.3). It is defined by computing two connected components numbers.
The main result is that we can characterize simple points with these numbers without
any Euler number (genus) computation. An example of application in medical imagery
is given (section 5).

2 Basic Definitions

We recall some basic definitions of digital topology (see [1] and [2]).
A 3D digital image is a subset of Z 3. A point z E 7/3 is defined by (zl, z2, za)
with zl E )Y. We can use the following distances defined in ]i~n with their associated
neighborhoods :
711

- D: (z, y) = Ein___l JYi - xil with V{ (z) = {y/D1 (z, y) < r}


- D~(z,y) = MAX i=l..,~ly~ - z i l with V~(z) = {y/D~(z,y) < r}

We commonly use the following neighborhoods :

{}-neighborhood : We note N6(z) = V#(z) and N~(~) = N6(~) \ {x}


2 6 - n e l g h b o r h o o d : We note N26(~) = V~(z) and N~s(z ) = N26(z) \ {z}
1 8 - n e i g h b o r h o o d : We note Nts(~) = Vx2(z) N V~(z) and N;s(x ) = Nts(z) \ {~}

A binary image consists of one object X and its complementary set X called the back-
ground. In order to avoid any connectivity paradox, we commonly use the 26-connectivity
for the object X and the 6-connectivity for the background X. These connectivities are
the one's used in this paper.

3 The Topological Classification

3.1 P r i n c i p l e

Let us consider an object X in the real space ]R3, let ~ E X, and let V(X) be an
arbitrarily small neighborhood of z. Let us consider the numbers C ~ , and C ~ , which
are respectively the numbers of connected components in X n (V(z) \ (z}) and in
X N (V(z) \ {z}) adjacent to z. These numbers may be used as topological descriptors
of z. For example a point of a surface is such that we can choose a small neighborhood
V(X) such as Crt, = 1 and Crt, = 2.
Such numbers are commonly used for thinning algorithms and for characterizing
simple points in 3D. The acute point of their adaptation to a digital topology is the
choice of the small neighborhood V(X).
The distance associated to the 26-connectivity is D ~ , it is then natural to choose
V~ (x) = N2e(z) which is the smallest neighborhood associated to Dor Usually, the same
neighborhood is chosen when using other connectivities. But the distance associated to
the 6-connectivity is Dx, then V~(z) = N~(z) is the smallest neighborhood associated
to D1. The trouble is that N~(x) is not 6-connected. Then V~(z) seems to be the good
choice. In this neighborhood, some points have only one neighbor and have no topological
interest, by removing them we obtain the 18-neighborhood Nls(z).

8.2 Application to Digital Topology


We propose the same methodology of classification than in [4] :

1. Each point is labeled with a topological type using the computation of two connected
components numbers in a small neighborhood.
2. Because some points (junctions points) are not detected with the two numbers, a less
local approach is used for extracting them.

We are using the two following numbers of connected components :

- c = NC=[XnN;6(~)] which is the number of 20-connected components of X n n ; 6 ( ~ )


26-adjacent to a. All points in the 26-neighborhood are 26-adjacent to z, therefore
C is the number of 26-connected components of X fl N~6(a ). It is not necessary to
check the adjacency to ~.
712

T y p e A - interior point : C = 0
T y p e B - isolated point : C -- 0
T y p e C - border point : C = 1,C = 1
T y p e D - curve point : C=1,C=2
T y p e E - curves junction : C = 1,C > 2
T y p e F - surface point : C = 2,C = 1
T y p e G - s u r f a c e - c u r v e junction : C=2,C>_2
T y p e H - surfaces junction : C > 2,C = 1
T y p e I - s u r f a c e s - c u r v e junction: C>2,C>2

T a b l e 1. Topological classification of 3D points according to the values of C and C

- -C = NC~[-XN N~s(z)] which is the number of 6-connected components of X N N~s(z )


6-adjacent to z.

We o b t a i n then a first local topological classification of each point of the object using
these two numbers (see Table 1).
However, this classification depends only on the 26-neighborhood of each point and
some j u n c t i o n points belonging to a set of junction points which is not of unit-width are
not detected. We propose the following procedures for extracting such points :

For curves : we only need to count the number of neighbors of each curve point (type
D ) , if this number is greater t h a n two, the point is a missed curves junction point
(type E).
For surfaces : we use the notion of simple surface introduced in [4]. If a point of type
F or G is adjacent to more t h a n one simple surface in a 5x5x5 neighborhood, it is
considered as a missed point of type H or I.

3.3 A d v a n t a g e s

The main difference of our new classification is t h a t we count the connected components
of the background X in a 18-neighborhood N~s(z ) instead of in a 26-neighborhood N~6(x )
as in [4]. By using a smaller neighborhood, we are able to see finer details of the object.
The main result due to this difference is t h a t the border point type corresponds
exactly to the characterization of simple points (see [7] and [1]).

P r o p o s i t i o n 1. A point z E X is simple if and only it verifies :

C -- N C a [ X N N~6(z)] = I (1)
-C = N C o [ X n N;~(~)] = 1 (2)

Proof. The complete proof of this proposition can not be written here by lack of space
(see [3] for details).

This new characterization of simple points needs only two conditions (instead of
three as usual, see [1]), and these two conditions only need the c o m p u t a t i o n of numbers
of connected components. The c o m p u t a t i o n of the genus, which requires quite a lot of
c o m p u t a t i o n a l effort, is no more necessary.
713

4 Counting the Connected Components

There exists some optimal algorithms for searching and labeling the k-connected com-
ponents in a binary image (see [8]). These algorithms need only one scan of the picture
by one half of the k-neighborhood, and use a table of labels for managing the conflicts
when a point owns to several connected components already labeled.
We can use the same algorithm in our little neighborhoods, but it has a high compu-
tational cost. In these neighborhoods, we have an a priori knowledge about the possible
adjacencies. We can store this knowledge in a table and use it in s propagation algorithm.
For that, we scan the neighborhood, if we find an object's point which is not labeled,
we assign a new label to it and we propagate this new label to the whole connected
component which contains the point. Using this knowledge, the propagation algorithm
is faster than the classical one.

5 Results

We consider two NMR 3D images of a skull scanned in two different positions (see Fig-
ure 1).
We apply a thinning algorithm (derived from our characterization of simple point)
to the 3D image containing a skull. The 3D image contains 256"256"151 quasi-isotropic
voxels of 0.8"0.8"1 m m 3.
We obtain then the skeleton of the skull. We apply our classification algorithm to
label each point. Projections of the labeled skeleton are shown in Figure 2. It is easy
to check the astonishing likeness between both results, in spite of the noise due to the
scan and the skeletonization. This will be used with profit in a forthcoming 3D matching
algorithm.

6 Conclusion

A new topological classification of points in a 3D image has been proposed. This classifi-
cation allows the characterization of a point as an interior, isolated, border, curve, surface
point or as different kinds of junctions. This classification allows also the detection of
simple points (and applications like thinning or shrinking). This is done by computing
two connected components numbers. The Euler number which leads to a lot of compu-
tational effort does not need to be evaluated. Furthermore the method for computing
connected components in s small neighborhood enables fast computations of the two
numbers.

References

I. T.Y. Kong and A. Rosenfeld. Digital topology: introduction and survey. Computer Vision,
Geaphica and Image Processing, 48:357-393, 1989.
2. V.A. Kovalevsky. Finite topology as applied to image analysis. Computer Viaion, Graphics,
And Image Processing, 46:141-161, 1989.
3. G. Malandaln and G. Bertrand. A new topological segmentation of discrete surfaces. Tech-
nical report, I.N.R.I.A., Rocquencourt, 78153 Le Chesnay C~dex, France, 1992.
4. G. Malandain, G. Bertrand, and N. Ayache. Topological segmentation of discrete surfaces.
In IEEE Computer Vision and Pattern Recognition, June 3-6 1991. Hawaii.
714

5. O. Monga, N. Ayache, and Sander P. From voxel to curvature. In IEEE Computer Vision
and Pattern Recognition, June 3-6 1991. Hawaii.
6. O. Monga, R. Deriche, G. Malandaln, and J.P Cocquerez. Recursive filtering and edge clos-
ing : two primary tools for 3d edge detection. In First European Con/erence on Computer
Vision (ECCV), April 1990, Nice, France, 1990. also Research Report INlZIA 1103.
7. D.G. Morgenthaler. Three-dimensional digital topology: the genus. Tr-980, Computer Sci-
ence Center, University of Maryland, College Park, MD 20742, U.S.A., November 1980.
8. C.M. Park and A. Rosenfeld. Connectivity and genus in three dimensions. Tr-156, Com-
puter Science Center, University of Maryland, College Park, MD 20742, U.S.A., May 1971.

Fig. 1. 3D representations of a skull scanned in two positions

Fig. 2. Projection of the topological characterization of the skeleton of the skull : border are in
black, surfaces in light grey and surfaces junctions in grey
A T h e o r y o f 3D R e c o n s t r u c t i o n o f H e t e r o g e n e o u s
E d g e P r i m i t i v e s from T w o P e r s p e c t i v e V i e w s *

Ming X I E and Monique T H O N N A T

INRIA Sophia Antipolis, 2004 Route des Lucioles, 06561 Valbonne, France.

A b s t r a c t . We address the problem of 3D reconstruction of a set of het-


erogeneous edge primitives from two perspective views. The edge primitives
that are taken into account are contour points, line segments, quadratic
curves and closed curves. We illustrate the existence of analytic solutions
for the 3D reconstruction of the above edge primitives, knowing the relative
geometry between the two perspective views.

1 Introduction

3D computer vision is concerned with recovering the 3D structure of the observed scene
from 2D projective image data. One major problem of 3D reconstruction is the precision
of the obtained 3D data (see [1] and [2]). A promising direction of research is to combine or
fuse 3D data obtained from different observations or by different sensors. However, simply
adopting the fusion approach is not enough: an additional effort needs to be contributed
at the stage of the 3D reconstruction by adopting a new strategy. For example, we think
that a strategy of 3D reconstruction of heterogeneous primitives would be an interesting
direction of research. The main reason behind this idea is that a real scene composed of
natural or man-made objects would be characterized efficiently by a set of heterogeneous
primitives, instead of uniquely using a set of 3D points or a set of 3D line segments.
Therefore, the design of a 3D vision system must incorporate the processing of a set of
heterogeneous primitives as a central element. In order to implement the strategy above,
we must know at the stage of 3D reconstruction what kind of primitives will be recovered
and how to perform such a 3D reconstruction of the primitives selected beforehand. In
fact, we are interested in the 3D reconstruction of primitives relative to the boundaries
of objects, i.e., the edge primitives. For the purpose of simplicity, we can roughly classify
such primitives into four types which are contour points, line segments, quadratic curves
and closed curves. Suppose now that a moving camera or moving stereo cameras observe
a natural scene to furnish some perspective views. Then, a relevant question will be:

Given two perspective views with the relative geometry being knowing, how to re-
cover the 3D information from the matched 2D primitives such as contour points,
line segments, quadratic curves and closed curves ?

2 Camera Modelling

We suppose the projection model of a camera to be a perspective one. Consider a coor-


dinate system O X Y Z to be at the center of the lens of the camera, with O X Y being
parallel to the image plane and OZ axis being the normal to the image plane (pointing

* This work has been supported by the European project PROMETHEUS.


716

outside the camera). Similarly, we associate a coordinate system oxy to the image plane,
with the origin being at the intersection point between OZ axis and the image plane; oz
and oy axis being respectively parallel to OX, OY. If we denote P = (X, Y, Z) a point
in OXYZ and p = (z, y) the corresponding image point in ozy, by using a perspective
projection model of the camera, we shall have the following relationship:

(z=~ (1)
Y
where f is the focal length of the camera. Without loss of generality, we can set f - 1.

3 Relative Geometry between Two Perspective Views

The two perspective views in question may be furnished either by a moving camera
at two consecutive instants or by a pair of stereo cameras. Thus, it seems natural to
represent the relative geometry between two perspective views by a rotation matrix R
and a translation vector T. In the following, we shall denote (Rvlv2, Tvzv2) the relative
geometry between the perspective view vl and the perspective view v2. Now, if we denote
Pvl = (X~I, Yvl, Z.1) a 3D point in the camera coordinate system of the perspective view
Vl and P,2 = (Xv2,Yv2, Zv2) the same 3D point in the camera coordinate system of the
perspective view v2, then the following relation holds:

Y ~ ] =R~,~2 Y,1 +T~1~2. (2)


Z~ ] Z~I
So far, we shall represent the relative geometry between two perspective views (view
Vl and view v2) as follows:

PII r12 r13~


{ R~1~2 = (R1, R2, Rs)s = ( r21 P22 P231
(3)
P31 r32 r33] 3
z)3Xl
where t means the transpose of a vector or a matrix.

4 Solutions of 3D Reconstruction

4.1 3D R e c o n s t r u c t i o n of C o n t o u r P o i n t s

In an edge map, the contour points are the basic primitives. A contour chain that can
not be described analytically could be considered as a set of linked contour points. So,
the 3D reconstruction of non-describable contour chains will be equivalent to that of
contour points. Given a pair of matched contour points: (p~l, P~2), we can first determine
the projecting line which passes through the point Pv2 and the origin of the camera
coordinate system of the perspective view v2. Then, we transform this projecting line
into the camera coordinate system of the perspective view vl. Finally, the coordinates of
the corresponding 3D point can be determined by inversely projecting the contour point
pvx onto the transformed line. Therefore, our solution for recovering 3D contour points
can be formulated by the following theorem:
717

T h e o r e m 1. A 3D point P is observed from two perspective views: the perspective view vt


and the perspective view v2. In the first perspective view, Pvi = (Xv~, Y~a, Zvl) represents
the 3D coordinates (in the camera coordinate system) of the point P and pv~ = (zv~, Yvi)
the $1) image point of P~a. In the second perspective view, P ~ = ( X ~ , Y~2, Z,~) rep-
resents the 319 coordinates (in the camera coordinate system) of the same point P and
Pv2 = (zv2,Yv2) the $D image point of Pv2. If the relative geometry between the two
perspective views is known and is represented by (3), then the 31) coordinates Pv~ are
determined by:
{ == ++ (4)
+

where:
Y~2 t=--t~

and (A~,Au) are two weighting coefficients.

4.2 3D R e c o n s t r u c t i o n o f Line s e g m e n t s
The problem of 3D reconstruction of line segments has been addressed by several re-
searchers (see [3] and [4]). In this paper, we shall develop a more simple solution with
respect to the camera-centered coordinate system, knowing two perspective views. The
basic idea is first to determine the projecting plane of a line segment in the second per-
spective view, then to transform this projecting plane into the first perspective view and
finally to determine the 3D endpoints of the line segment by inversely projecting the
corresponding 2D endpoints (in the image plane) to the transformed projecting plane in
the first perspective view. In this way, we can derive a solution for the 3D reconstruction
of line segments. This solution can be stated as follows:

T h e o r e m 2. A 3D line segment is observed from two perspective views: the perspective


view vl and the perspective view v2. In the second view v2, we know the supporting line of
the corresponding projected line segment (in the image plane), which is described by the
equation: av2 zv2 + by2 Yv2 + cv2 = O. If the relative geometry between the two perspective
views is known and is represented by (3}, then the coordinates (X~I, Y,I, Z,I) of a point
(e.g. an endpoinO of the 31) line segment in the first perspective view are determined by
the following equations:
Xvl = -(L,~,T~,.~)~'.x
Yv l - ( L,~,T,I.~) V.l
(L,~.Rx)x,I+(L,2.R2)y.a-b(L,2.R3) " (5)
Zvl -(L,~oT, I~)
(L~2eR1)x~l-F(L~oR2)y,l-b(L~eR3) "

where Lv~ = (av2, bv2,cv2) and (Xvl,Yvl) is the known projection of (Xvl,Yvl,Zvl) in
the image plane of the first perspective view.

4.3 3D R e c o n s t r u c t i o n o f Q u a d r a t i c C u r v e s

In this section, we shall show that an analytic solution exists for the 3D reconstruction
of quadratic curves from two perspective views. By quadratic curve, we mean the curves
whose projection onto an image plane can be described by an equation of quadratic form.
718

To determine the 3D points belonging to a 3D curve, the basic idea is first to determine
t h e projecting surface of a 3D curve observed in the second perspective view, then to
transform this projecting surface to the first perspective view and finally to determine
the 3D points belonging to the 3D curve by inversely projecting the corresponding 2D
points (in the image plane) to the transformed projecting surface in the first perspective
view. If we denote Pv = (Xv, y~, 1) the homogeneous coordinates of an image point in the
perspective view v, we can formulate our solution for the 3D reconstruction of quadratic
curves by the following theorem:
T h e o r e m 3. A 3D curve is observed from two perspective views: the perspective view vx
and the perspective view vs. In these two views, the corresponding projected eD curves
(in image planes) can be described by equations of quadratic form. The description of
the 2D curve in the second perspective view is given by: av2 x2v2+ by2 y22 + %2 Xv2 Yv2 +
ev2 xv2 + fv2 yv2 + gv2 = O. If the relative geometry between the two perspective views is
known and is represented by (3), then given a point P~I = ( z . l , Y . l , 1) on the 2D curve
in the first perspective view, the corresponding 3D point (X.I,Yvl, Z.1) on the 3D curve
is determined by the following equations:
I Xvl -: -B ~rvl.
Yvl = - B 4 - ~ B 4 " ~ Yvl. (6)
Zvl - 2A
where:
!' = P,I 9 R~1~2 9 Q~2 9 R~1~2 9 P~I.

and:

Qv2 - by2 f~2 } 9


0 g~2]

4.4 3D R e c o n s t r u c t i o n of Closed Planar Curves


A solution for the 3D reconstruction of closed curves can be derived by using a planarity
constraint, i.e. the closed curves to be recovered in a 3D space being planar (that means
that a closed curve can be supported by a plane). Therefore, given two perspective views
of a closed curve, our strategy will consist of first trying to estimate the supporting plane
of a closed curve in the first perspective view and then of determining the 3D points of
the closed curve by inversely projecting the points of the corresponding 2D curve (in the
image plane) to the estimated supporting plane. At the first step, we shall make use of
Theorem 1. Below is the development of our solution for the 3D reconstruction of closed
planar curves:
Let 0~1 = {(X~I,Y~I,Z~I ), i = 1, 2, 3 .... , n} be a set of n 3D points belonging to
a closed curve C in the first perspective view and I~1 = {(z~l, Y$I), i = 1, 2, 3, ..., n}
be a set of n corresponding image points of O~1. Due to the visibility of a closed curve
detected in an image plane, its supporting plane can not pass through the origin of the
camera coordinate system. Thus, we can describe a supporting plane by an equation of
the form: a X -t- b Y + c Z = i.
Based on the assumption that the observed closed curve C is planar, so, a 3D point
(X~I, Y~I, Z~I) on C must satisfy the equation of its supporting plane, that is:
axe1 +bY:l +cZ~l = 1. (7)
719

By applying (1) to the above equation, we obtain:


~ 1
a z v l + byvz + c = ~vSl. (8)

where Zvil will be calculated by (4) of Theorem 1.


(8) is a linear equation of the unknown variables (a, b, c). To solve it, we need at least
three non-collinear points in order to obtain an unique solution. In practice, there will

11 )
be more than three points on a closed curve. As for the closed curve C, if we define:
Z~l Yvl I ~ /1/Zll

Anx3 =
2 Yvl
Xvl 2 1 ; Bnxl = 1/Z~l ; wax:= .

*vnl ynI I / ~I 1
then a linear system will be established as follows:
A.W=B. (10)
To estimate the unknown vector W, we use a least-squares technique. So, the solution
for (a, b, c) can be obtained by the following calculation:
W = (A t . A ) -1 9 (A** B). (11)

Knowing the supporting plane determined by (a, b, c), the 3D points of the closed
curve C can be calculated as follows (by combining (1) and (8)):

{ Y/1 = Yh
'
Xv/1 = az~,+bY~l+c"
xvl
i = 1, 2,..., n. (12)
az',+bY~l+C"
zL =

5 Conclusions

We have addressed the problem of 3D reconstruction of heterogeneous edge primitives by


using two perspective views. With respect to the edge primitives such as contour points,
line segments, quadratic curves and closed curves, the existence of (analytic) solutions
has been illustrated. An advantage of our work is that the proposed solutions are derived
by reasoning in the discrete space of time. Consequently, they are directly applicable to
the situation where a set of discrete perspective views (or a sequence of discrete digital
images) are available.

References
[I] BLOSTEIN, S. D. and H U A N G , T. S.: Error Analysis in Stereo Determination of 3D
Point Positions.IEEE PAMI, Vol.9,No.6, (1987).
[2] R O D R I G U E Z , J. J. and A G G A R W A L , J. K.: StochasticAnalysis of Stereo Quantization
Error. IEEE PAMI, Vol.12,No.5, (1990).
[3] K R O T K O V , E. HENRIKSEN, K. and KORIES, P.:StereoRanging with Verging Cameras.
IEEE PAMI, Vol.12,No.12, (1990).
[4] AYACHE, N. and L U S T M A N , F.: Trinocnlar Stereo Vision for Robotics. IEEE PAMI,
Vol.13, No.l, (1991).
This article was processed using the lATEX macro package with ECCV92 style
Detecting 3-D Parallel Lines for Perceptual
Organization*
Xavier Lebhgue and J. K, Aggarwal

Computer and Vision Research Center, Dept. of Electrical and Computer Engr., ENS 520,
The University of Texas at Austin, Austin, Texas 78712-1084, U.S.A.

Abstract.
This paper describes a new algorithm to simultaneously detect and clas-
sify straight lines according to their orientation in 3-D. The fundamental
assumption is that the most "interesting" lines in a 3-D scene have orien-
tations which fall into a few precisely defined categories. The algorithm we
propose uses this assumption to extract the projection of straight edges from
the image and to determine the most likely corresponding orientation in the
3-D scene. The extracted 2-D line segments are therefore "perceptually"
grouped according to their orientation in 3-D. Instead of extracting all the
line segments from the image before grouping them by orientation, we use
the orientation data at the lowest image processing level, and detect seg-
ments separately for each predefined 3-D orientation. A strong emphasis is
placed on real-world applications and very fast processing with conventional
hardware.

1 Introduction

This paper presents a new algorithm for the detection and organization of line segments
in images of complex scenes. The algorithm extracts line segments of particular 3-D
orientations from intensity images. The knowledge of the orientation of edges in the
3-D scene allows the detection of important relations between the segments, such as
parallelism or perpendicularity.
The role of perceptual organization [5] is to highlight non-accidental relations between
features. In this paper, we extend the results of perceptual organization for 2-D scenes
to the interpretation of images of 3-D scenes with any perspective distortion. For this,
we assume a priori knowledge of prominent orientations in the 3-D scene. Unlike other
approaches to space inference using vanishing points [1], we use the information about
3-D orientations at the lowest image-processing level for maximum efficiency.
The problem of line detection without first computing a free-form edge map was
addressed by Burns et al. [2]. His algorithm first computes the intensity gradient ori-
entation for all pixels in the image. Next, the neighboring pixels with similar gradient
orientation are grouped into "line-support regions" by a process involving coarse ori-
entation "buckets." Finally, a line segment is fit to the large line-support regions by a
least-squares procedure. An optimized version of this algorithm was presented in [3].
The algorithm described in this paper is designed not only to extract 2-D line segments
from an intensity image, but also to indicate what are the most probable orientations
for the corresponding 3-D segments in the scene. Section 2 explains the geometry of
* This research was supported in part by the DoD Joint Services Electronics Program through
the Air Force Office of Scientific Research (AFSC) Contract F49620-89-C-0044, and in part
by the Army Research Office under contract DAAL03-91-G-0050.
721

projecting segments of known 3-D orientation. Section 3 describes a very fast algorithm
to extract the line segments from a single image and to simultaneously estimate their
3-D orientation. Finally, Sect. 4 provides experimental results obtained with images of
indoor scenes acquired by a mobile robot.

2 Motivation and Assumptions

We chose to concentrate on objects which have parallel lines with known 3-D orientations
in a world coordinate system. For example, in indoor scenes, rooms and hallways usually
have a rectangular structure, and there are three prominent orientations for 3-D line
segments: one vertical and two horizontal orientations perpendicular to each other. In
this paper, any 3-D orientation is permitted, as long as it is given to the algorithm.
Therefore, more complex environments, such as polygonal buildings with angles other
than 90 degrees, are handled as well if these angles are known. It is important to note that
human vision also relies on prominent 3-D orientations. Humans feel strongly disoriented
when placed in a tilted environment.
Vertical lines constitute an interesting special case for two reasons: they are especially
common in man-made scenes, and their 3-D orientation can easily be known in the 3-D
camera coordinate system by measuring the direction of gravity. If a 2-axis inclinometer
is mounted on the camera and properly calibrated, a 3-D vertical vector can be expressed
in the 3-D coordinate system aligned with the 2-D image coordinate system. Inexpensive
commercial inclinometers have a precision better than 0.01 degree. Humans also sense
the direction of gravity by organs in their inner ears. In our experiments, we estimate the
third angular degree of freedom of the camera relative to the scene from the odometer
readings of our mobile robot. Provided that the odometer is constantly corrected by
vision [4], the odometer does not drift without bounds.
We can infer the likely 3-D orientation of the line segments from their 2-D projections
in the image plane. With a pinhole perspective projection model, lines parallel to each
other in the 3-D scene will converge to a vanishing point in the 2-D projection. In partic-
ular, if the orientation of the camera relative to the scene is known, a vanishing point can
be computed for each given 3-D orientation before the image is processed. All the lines
that have a given orientation in 3-D must pass through the associated vanishing point
when projected. Conversely, if a line does not pass through a vanishing point, it cannot
have the 3-D orientation associated with that vanishing point. In practice, if a line does
pass through a vanishing point when projected, it is likely to have the associated 3-D
orientation.
To summarize, the line detection algorithm of Sect. 3 knows in each point of the
image plane the orientation that a projected line segment would have if it had one of
the predefined 3-D orientations. Therefore, the basic idea is to detect the 2-D segments
with one of the possible orientations, and mark them with the associated 3-D orientation
hypothesis.

3 Detecting Segments and Estimating their 3-D Orientation

3.1 Coordinate Systems and Transformations

The coordinate systems are W (the World coordinate system, with a vertical z-axis),
R (the Robot coordinate system, in which we obtain the inclinometer and odometer
readings), C (the Camera coordinate system), and P (the coordinate system used for
722

the perspective projection on the retina). The homogeneous coordinate transformation


matrix from W to R is Twa = TrollTpitchTheadingTtranslations. Troll and Tpitch are known
with a good precision through the inclinometer. Theading is estimated by the odometer
a n d Ttranslations is not used here. Trtc, the coordinate transformation matrix from R to C,
needs to be completely determined through eye/wheel cMibration. Finally, Tcp is known
through camera calibration.

3.2 Overview of the Algorithm


The processing can be outlined as follows:
1. Line support region extraction: compute the angle between the intensity gradient
at each pixel and the expected direction of the projection of each 3-D orientation
(see Sect. 3.3 for details). Use a loose threshold to allow for noise in the gradient
orientation. Reject improper pixels and 3-D orientations.
2. Non-maxima suppression: keep only the local gradient maxima along the estimated
perpendicular to the line.
3. Pixel linking: create chains of pixels using a partial neighborhood search in the di-
rection of the estimated vanishing points. This creates noisy linear chains.
4. Line fitting: perform a least-squares fit of line segments to the pixel chains. Re-
cursively break the pixel chains which cannot be closely approximated with a line
segment into smaller chains.
5. Global orientation check: compute the match between each line and each 3-D orien-
tation, like in the line support extraction step but with a much tighter threshold.
If the a priori heading is very uncertain, the lines will be extracted with loose thresholds,
the true heading will be estimated, and the algorithm can then be run again with tight
thresholds for the correct categorization.

3.3 Extracting Line Support Regions


For each pixel in the input intensity image and for each category of possible 3-D orienta-
tions, we compute the angle between the intensity gradient and the expected direction of
the line in 2-D. The expected line is given by the current pixel and the vanishing point
associated with the 3-D orientation. It is not necessary to compute the location of the
vanishing point (which may lie at infinity).
The homogeneous transformation matrix changing world coordinates into projective
coordinates is Twp = T c e T R c T w r t . Let [Px, Py, Pz, 0] T
w be a non-null vector in the 3-D
direction under consideration. If [su, sv, s, 1] w = Twp [x, y, z, 1] T defines the relation
between a 2-D point [u, v] w and its antecedent by the perspective projection, then

11 -- + 01 )
defines another point of the estimated 2-D line. A 2-D vector d in the image plane pointing
to the vanishing point from the current point is then collinear to [u ~ - u, v ' - v ] w.
Algebraic manipulations lead to [ du, dv IT = [ax -- azu, ay -- azv ]W where

lax, ay, 0] T = rwp [px, py, pz,


Note that ax, ay, and az need to be computed only once for each 3-D orientation.
723

The current pixel is retained for the 3-D direction under consideration if the angle
between d and the local gradient g is 90 degrees plus or minus an angular threshold 7.
This can be expressed by
lid x g[I
Ildll' IIg[~ > cos-/
or equivalently:
gy - dy gx) > + (gx + r
with F = (cosT) 2 computed once for all. Using this formulation, the entire line sup-
port extraction is reduced to 8 additions and 11 multiplications per pixel and per 3-D
orientation. If an even greater speedup is desired, (gx2 + g~) may be computed first and
thresholded. Pixels with a very low gradient magnitude may then be rejected before
having to compute d.

4 Results

The algorithm was implemented in C on an IBM RS 6000 Model 530 workstation, and
tested on hundreds of indoor images obtained by our mobile robot. The predefined 3-D
orientations are the vertical and the two horizontal orientations perpendicular to each
other and aligned with the axes of our building. Figures 1 and 2 show the results of
line extraction for one image in a sequence. The processing time is only 2.2 seconds for
each 512 by 480 image. Preliminary timing results on a HP 730 desktop workstation
approach only a second of processing, from the intensity image to the list of categorized
segments. The fast speed can be explained partly by the absence of multi-cycle floating-
point instructions from the line orientation equations, when properly expressed.
The lines are not broken up easily by a noisy gradient orientation, because the ori-
entation "buckets" are wide and centered on the noiseless gradient orientation for each
3-D orientation category. The output quality does not degrade abruptly with high im-
age noise, provided that the thresholds for local gradient orientations are loosened. The
sensitivity to different thresholds is similar to that of the Burns algorithm: a single set
of parameters can be used for most images. A few misclassifications occur in some parts
of the images, but are marked as ambiguities.
We have compared the real and computed 3-D orientation of 1439 detected segments
from eight images in three different environments. The presence of people in some scenes,
as well as noise in the radio transmission of images, did not seem to generate many mis-
classifications. The most frequent ambiguities occurred with horizontal segments parallel
to the optical axis: 1.1% of them were classified as possibly vertical in 3-D.

5 Conclusion

We have presented a new algorithm for detecting line segments in an image of a 3-D
scene with known prominent orientations. The output of the algorithm is particularly
well suited for further processing using perceptual organization techniques. In partic-
ular, angular relationships between segments in the 3-D scene, such as parallelism or
perpendicularity, are easily verified. Knowledge of the 3-D orientation of segments is a
considerable advantage over the traditional 2-D perceptual organization approach. The
orientation thresholds of the 2-D perceptual organization systems cannot handle a sig-
nificant perspective distortion (such as the third orientation category in Fig. 2). The
independence from the perspective distortion brings more formal angular thresholds to
724

Fig. 1. (a) The input intensity image, and (b) the 2-D segments

II1t
Fig. 2. The line segments associated with each 3-D orientation

the perceptual organization process. By using the 3-D orientation at the lowest image
processing level, both the quality and speed of the algorithm were improved. The ultimate
benefits of this approach were demonstrated on real images in real situations.

References

1. S.T. Barnard. Interpreting perspective images. Artificial Intelligence, 21(4):435-462,


November 1983.
2. J. B. Burns, A. R. Hanson, and E. M. Pdseman. Extracting straight lines. 1EEE Trans. on
Pattern Analysis and Machine Intelligence, 8(4):425-455, July 1986.
3. P. Kahn, L.Kitchen, and E. M. Riseman. A fast line finder for vision-guided robot naviga-
tion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 12(11):1098-1102, Novem-
ber 1990.
4. X. Leb~gue and J. K. Aggarwal. Extraction and interpretation of semantically significant
line segments for a mobile robot. To appear in Proc. 1EEE lnt. Con]. Robotics and Automa-
tion, Nice, France, May 1992.
5. D. G. Lowe. Perceptual Organization and Visual Recognition. Kluwer Academic Publishers,
1985.

This article was processed using the IbTEX macro package with ECCV92 style
Integrated Skeleton and Boundary Shape Representation for
Medical Image Interpretation*

Glynn P. Robinson I, Alan C.F. Colcheste/, Lewis D. Griffin 1 &


David J. Hawkes 2

1Department of Neurology, Guy's Hospital,London SE1 9RT, England.


ZDepartment of Radiological Sciences, Guy's Hospital,London SE1 9RT, England.

Abstract. We propose a method of extracting and describing the shape of


features from medical images which provides both a skeleton and boundary
representation. This method does not require complete closed boundaries nor
regularly sampled edge points. Lines between edge points are connected into
boundary sections using a measure of proximity. Alternatively, or in addition,
known connectivity between points (such as that available from traditional edge
detectors) can be incorporated if known. The resultant descriptions are object-
centred and hierarchical in nature with an unambiguous mapping between
skeleton and boundary sections.

1 Introduction

We are currently developing an improved shape representation for use in the Guy's
Computer vision system for medical image interpretation [RC1]. The requirement is for
an efficient method of shape representation which can be used to store information about
the expected anatomical structure in the model, and also represent information about the
shape of features present in the image. In this paper we present an integrated approach
to shape representation which also addresses the problem of grouping dot pattern and
disconnected edge sections to form perceptual objects. The method of shape
representation is based on the dual of the Voronoi diagram and the Delaunay
triangulation of a set of points. For each object, the boundary and the skeleton are
represented hierarchically. This reduces sensitivity to small changes along the object
boundary and also facilitates coarse to fine matching of image features to model entities.

2 Previous work

Many approaches to shape representation have been proposed, and more extensive
reviews can be found in reference [Mal]. Boundary representation of the shape of objects
such as those described in references [Frl] and [Ayl] tend to be sensitive to small changes
along the object boundaries, hierarchical representation is often difficult, as is the sub-
division of objects into their sub-parts. The hierarchical approach of the curvature primal

" The research described in this paper has been supported by the SERC grant ISIS
726

sketch [AB1] is an improvement on boundary representations, but the use of multiple


Gaussian scales may cause problems especially when primitives are close together.

Skeleton representations such as proposed in [BI1] allows the shape of objects to be


represented in terms of the relationships between their sub-parts. Spurious skeleton
branches can be generated by small protrusions on the object boundary. Naekrnan & Pizer
[NP1] propose an approach to overcoming the problems of these spurious branches by
generating the skeleton of an object at multiple Gaussian scales, and Arcelli [Arl]
proposes a hierarchy of skeletons in terms of the object's boundary curvature.

Grouping dot-patterns to form perceptual objects has been attempted by a number of


authors. We are concerned with grouping together dots which are considered to form the
boundary of objects (in a similar manner to a child's dot-to-dot game). Ahuja et al [AH1]
and [AT1] propose the use of the Voronoi diagram and the properties of the individual
Voronoi cells to classify points as boundary points, interior points, isolated points, or
points on a curve.

Fairfield [Fal], like ourselves, is concerned with both the detection of the boundary of
objects from dots, and also the segmenting of these objects into their sub-parts. He uses
the Voronoi diagram to detect areas of internal concavity and replaces Voronoi diagram
sides with the corresponding Delaunay triangulation sides to produce both the object
boundary and sub-parts. This work is dependent on a user defined threshold and does not
differentiate between object boundaries and the sub-part boundaries.

Ogniewicz et al [OI1] use the Voronoi diagram of a set of points to produce a medial axis
description of objects. This method requires that the points making up the boundary have
known connectivity, and a threshold is used to prune the skeleton description.

The method we propose does not require connected boundaries as input, merely a set of
points (dots) which are believed to be edge points of objects. Our method produces
distinct objects from these potential edge points, and concurrently generates both a
skeleton and a boundary representation of the shape of these objects.

3 Defining boundary/skeleton and objects


The Delaunay triangulation of the candidate edge points is calculated, and from this the
Voronoi diagram. We must then select from the Delaunay triangulation those sides that
make up the perceived object boundaries, and from the Voronoi diagram those sides that
make up the perceived object skeleton. This selection is refined to the exclusive decision
of whether to keep a Delaunay side as a boundary section or the corresponding Voronoi
side as a skeleton section. The decision is based purely on proximity, and the shorter of
the Delaunay triangle side and corresponding Voronoi side is kept as a boundary section
or skeleton section respectively. If connectivity between any two specific points is known
this can be easily incorporated by overriding the selection criteria for the Delaunay
triangle side connecting the two points. If no such triangle side exists then a new
connection is formed, while still preserving the Delaunay triangulation using the method
of Boissonnat [Bol].
727

Objects can now be defined by stretches of unbroken and possibly branching skeletons.
Each branch in a skeleton has associated with it two properties. Firstly, the mean
direction of the skeleton branch, and secondly the area of the object corresponding to
that branch. Fig. la shows an example set of points corresponding to the bodies of the
lateral cerebral ventricles, extracted via a DOG from a transverse MR image. These points
are shown as a series of crosses which are unfortunately drawn so close that they partially
overlap. Fig. lb-c show the Delaunay triangulation and Voronoi diagram of these points
respectively. Fig ld shows the result of the proximity based selection criterion.

4 Defining the intra-object hierarchy


Objects are decomposed into sub-parts by examining the area and direction associated
with each skeleton branch. These two measures are combined to locate small branches
meeting a more significant part of the object. Where these less significant branches occur,
"virtual boundaries" are constructed from the Delaunay triangle side corresponding to the
Voronoi side emanating from the branch. Each sub-object has an associated area which
is the total area of the object within the real or virtual boundaries surrounding that sub-
part.

An intra-object hierarchy is then generated by ordering the sub-objects in decreasing size


and starting a new level in the hierarchy where there is a significant change in area
between sub-parts which are adjacent in the list. Fig. 2a shows the object corresponding
to the lateral ventricles with the boundaries of these sub-parts shown with dotted lines.
Fig. 2b shows a low-level in the intra-object hierarchy, and fig. 2c shows the remaining
fine detail in the intra-object hierarchy. Figures 3a-c show the same information as figures
2a-c for the small portion of the object indicated in figure 2a.

5 Using the skeleton/boundary unification

The unified nature of the representation that comes from the duality of the Delaunay
triangulation and the Voronoi diagram allows simple changes in the data structure to
change the perceived number of objects. For example, considering the lateral ventricles
in figs 1-2, we may wish to further divide the object into left and right ventricles. This can
be easily achieved by simply forcing the connection between the two bodies. This requires
only a local change in the data structure, but generates two new objects. Fig. 4b shows the
effect of this simple change.

The converse of this can be just as easily achieved (merging two objects into one) by
forcing a boundary section to become a skeleton section.

6 Concluding Remarks
We have defined a hierarchical, object-centred shape description. The algorithm for
computing this description works on both connected and disconnected edge points. The
728

technique is based on a scale invariant proximity measure and so requires no user defined
thresholds.

We are extending our technique to make use of criteria other than proximity, for example
gradient magnitude at edge points, and directional continuity of edge sections.

a b c d

: : =, ";~ ,,~ 2"_ = _

/11 l l ~ l l l , lJt/I i I -. "~,'~.. I '.t,.~"~7-" --.-I

~~"~'1 '~,",-'" r ,,".'~.",",1


Fig. t. a) Discrete edge points, shown partially overlapping; b) Delaunay triangulation
(solid lines); c) Voronoi diagram (dashed lines); d) Result of selection criterion9

a b

, ,,./) /
I

) c / \
( , '.,

J\ \
i " / "'. "
' /) ~" \
) "% / \D
, M.,
k,

Fig. 2. a) Boundary and sub part of lateral bodies; b) low-level in object hierarchy; c) fine
detail of the object hierarchy. Dotted lines are Delaunays forming the virtual boundaries.

a b c

F
/ " ~ 1 7 6

i \
I
9
\ \
Fig. 3. a)-c) same features as fig. 2 for the small area indicated in fig. 2a.
729

Fig. 4. a) Lateral bodies of figs. 2-3; b) result of splitting the object in two.

References

[RCll Robinson, G.P., Colchester, A.C.F., Griffin,L.D.: A hierarchical shape


representation for use in anatomical object recognition. Proc. SPIE Biomedical
Image Processing & 3D microscopy (1992)
[Mal] Marshall, S.: Review of Shape Coding Techniques. Image and Vision Computing.
7 (1989) 281-294
[Frl] Freeman, H.: On the encoding of arbitrary geometric configurations. IRE Trans.
Electronic Computers. June (1961) 260-268
[Ayl] Ayache, N. J.: A model-based vision system to identify and locate partially visible
industrial parts. Proc. IEEE Computer Society Conference on Computer Vision
and Pattern Recognition. IEEE, New York: (1983) 492-494
[ABI] Asada, H., Brady, M. : The Curvature Primal Sketch. IEEE Trans. Pat. Anal.
Machine Intel. PAMI-8 NO. 1 (1986) 2-14
[Bill Blum, H.: Biological Shape and Visual Science. Int. Jour. Theory Biol. (1973)
205-287
[NPI] Nackman, L. R., Pizer, S. M. : Three dimensional shape description using the
symmetric axis transform. IEEE Trans. Pat. Anal. Machine Intel. PAMI-9 (1985)
505-511
[Arl] Arcelli, C.: Pattern thinning by contour tracing. Comp. Vis. Image Proces. 17
(1981) 130-144
[Ahl] Ahuja, N.: Dot Pattern Processing Using Voronoi Neighbourhoods. IEEE Trans.
Pat. Anal. Machine lntei. PAMI-4 (1982) 336-343
[AT1] Ahuja, N., Tuceryan, M. : Extraction of Early Perceptual Structure in Dot
Patterns: Integrating region, boundary & component Gestalt. Comp. Graph. Vis.
Image Proc. 48 (1989) 304-346
[Fall Fairfield, J. R. C.: Segmenting dot patterns by Voronoi diagram concavity. IEEE
Trans. Pat. Anal. Machine Intel. PAMI-5 (1983) 104-110
[oi1] Ogniewicz, R., Ilg, M. : Skeletons with Euclidean metric and correct topology and
their application in object recognition and document analysis. Proceedings 4th
International Symposium on Spatial Data Handling. Zurich, Switzerland: (1990)
15-24
[Boll Boissonnat, J. D.: Shape Reconstruction from planar cross section. Comp. Graph.
Vis. Image Proc. 44 (1988) 1-29
Critical Sets for 3D R e c o n s t r u c t i o n Using Lines

Thomas Buchanan
Eberstadt, Troyesstr. 64, D-6100 Darmstadt, Germany

A b s t r a c t . This paper describes the geometrical limitations of algorithms


for 3D reconstruction which use corresponding line tokens. In addition to
announcing a description of the general critical set, we analyse the configu-
rations defeating the Liu-Huang algorithm and study the relations between
these sets.

1 Introduction

The problem of 3D reconstruction is to determine the geometry of a three-dimensional


scene on the basis of two-dimensional images. In computer vision it is of utmost im-
portance to develop robust algorithms for solving this problem. It is also of importance
to understand the limitations of the algorithms, which are presently available, because
knowledge of such limitations guides their improvement or demonstrates their optimality.
From a theoretical point of view there are two types of limitations. The first type
involves sets of images, where there exist more than one essentially distinct 3D scene,
each giving rise to the images. The superfluous reconstructions in this case can be thought
of as "optical illusions". This type of limitation describes the absolute "bottom line" of
the problem, because it involves scenes where the most optimal algorithm breaks down.
The second type of limitation is specific to a given not necessarily optimal algorithm.
It describes those scenes which "defeat" that particular algorithm.
Currently, algorithms for 3D reconstruction are of two types. One type assumes a
correspondence between sets of points in the images. For algorithms of this type the
critical set has been studied extensively. (See [6] for a vivid graphical description of this
locus and the references in [8] for a detailed bibliography.) In recent years another type
of algorithm has been introduced, which assumes a correspondence between sets of lines
in the images.
The purpose of this paper is to describe limitations for the algorithms which use lines
as tokens in the images.
We use projective geometry throughout the paper. Configurations in 3-space are con-
sidered to be distinct if they cannot be transformed into one another by a projective linear
transformation. The use of the projective standpoint can be thought of as preliminary
to studying the situation in euclidean space. But the projective situation is of interest in
its own right, because some algorithms operate essentially within the projective setting.
Generally, algorithms using a projective setting are easier to analyse and implement than
algorithms which fully exploit the euclidean situation.
This paper is organized as follows. In section 2 we collect some standard definitions
from line geometry, which will allow us to describe the line sets in Sections 3 and 4. In
Section 3 we describe line sets gr in 3-space and images of ~ which give rise to ambiguous
reconstructions. In Section 4 we describe line sets F in 3-space which defeat the algorithm
introduced in [7]. Essential properties of F were first noted in [10, p. 106] in the context
of constructive geometry. In Section 5 we discuss the relationship between ~ and F.
731

A proof of Theorem 3.1 will appear in [1].

2 Definitions from line geometry

The set of all lines in 3-space will be denoted b y / 2 . It is well-known that /2 is a 4-


dimensional algebraic variety. (See [11, pp. 244-247 and Chap. XV] for an introduction
t o / 2 and [15] for an encyclopedic exposition.) To see that dim/2 = 4 is plausible, consider
the set of pairs of points in 3-space. The dimension of this set is 2 x 3 = 6. Each pair of
distinct points determines a line i joining the two points. However, ! is overdetermined
for we can move each of the two points along i. This reduces the degrees of freedom for
/2 by 2. Thus we have dim/2 = 6 - 2 = 4.
Elements of/2 can be coordinatized by 6-tuples (P01, p02, P03, P12, P13, P23) which are
subject to the following conditions.

(a) At least one Pij (0 _< i < j < 3) is nonzero.


(b) Scalar multiples (Ap01, Ap02, AP03, Ap12, )tPl3, AP23) denote the same line for all A r 0.
(c) The line coordinates pij satisfy the equation

POlP23 - Po2P13 + Po3P12 = 0 . (1)

Given a line i containing distinct points with homogeneous coordinates ( x 0 , . . . , x3)


and (y0,..., Y3), then the pij are defined by

P i j = d e t ( X l x J )Yj -- --

That the Plj do indeed have properties (a), (b) and (c) is shown in [11] for example.
An algebraic set is defined to be a set which is defined by a set of polynomial equa-
tions. In line geometry these equations involve the line coordinates Pij as unknowns. An
algebraic set is called reducible if it can be written as the union of two nonempty proper
algebraic subsets. For example, in the cartesian plane the set of points satisfying xy = 0,
which consists of the coordinate axes, is a reducible algebraic set, because the set is the
union of the y-axis (x = 0) and the x-axis (y = 0). On the other hand, the x-axis de-
scribed by y = 0 is irreducible, because the only proper algebraic subsets of the x-axis
are finite sets of points of the form (x, 0). An irreducible algebraic set is called a variety.
It can be shown that any algebraic set can be described as the finite union of varieties.
A variety V has a well-defined dimension, which is the number of parameters required
to parametrize smooth open subsets of V. We can think of the dimension of V as the
number of degrees of freedom in V. For example, the plane has dimension 2, 3-space has
dimension 3, etc.
Line varieties A are subvarieties of/2. Since dim/2 = 4, there four possibilities for
dim A, when A is not all of/2. If dim A = 0, then A is a single element of/2, i.e., a line.
Line varieties of dimension 1, 2 and 3 are called a ruled surface, a (line) congruence and a
(line) complex respectively. The unfortunate choice of terminology for line varieties goes
back to the 19th century. The terms have so thoroughly established themselves in the
literature, however, that it would be futile to try to introduce new names for the line
varieties.
732

Note that a ruled surface as defined above is a 1-parameter family of lines, not a set
of points. For example, the hyperboloid of one sheet contains a 1-parameter family of
lines--a ruled surface.

A different ruled surface lying on the hyperboloid is shown in figure below.

A particularly simple ruled surface is a pencil defined to be the set of lines passing
through a given point P and lying in a given plane ~r.
An important descriptor for a line complex F is its order. The orderof F is defined to
be the number of lines P has in common with a general pencil. It is important to count
not only lines real space but also properly count lines in the space of complex numbers.
For any point P in 3-space we may consider all lines of P which pass through P. This
subset of F is called the eomplez cone at P. The order of F could equivalently be defined
as the order of a general complex cone of F. If a general complex cone has as its base a
plane curve of degree d, then d is the order of the cone and the order of F.
A theorem of Felix Klein states that in the space over the complex numbers a line
complex can be described by a single homogeneous polynomial equation

f(P01, P02, P03, P12, P13, P23) = 0


(see [4, p.147, Exercise 6.5d]). Of course, it is always tacitly assumed that (1) holds. If
F is described by a homogeneous polynomial f, the order of F coincides with the degree
of f.
A very simple line complex consists of all lines which meet a given line 1. If the
coordinates of I are a = (a01, a02, a03, a12, a13, a23), then the equation for this complex
can be shown to be

aOlP23 -- ao2P13 + ao3p12 -- a13P02 + a12p03 + a23POl = 0 . (2)


733

This polymonial has degree 1 so the order of the complex is 1. The polynomial in a
and p = (Pol,Po2,Pos, Pz2,Pla, P2s) is denoted by 12ap. Equation (1), which we are always
tacitly assuming, can be expressed by the equation 12pp = 0.
For a given complex/" it may happen that F contains all lines through some special
point P. In this case/~ is called a total paint of F.
Given a line congruence ~ only a finite number of lines pass through a given point
in general. Again we count not only lines in real space but lines in the space over the
complex numbers. The number of such lines is constant for almost all points of 3-space;
this number is defined to be the order of ~. Analogously, a general plane ~r in 3-space
contains only a finite number of lines of ~. This number is defined to be the class of k~.
Points lying on an infinite number of lines of ~" and planes containing an infinite number
of lines of ~ are called singular.
Given a congruence fir and a line l in 3-space not in ~P, we may consider the subset
of k~ consisting of elements of ~ which meet I. This set can then be described b y the
equations which define ~P together with an additional linear equation of the form of (2).
If this set is irreducible, it is a ruled surface.
In general, there exist a finite number of points P on ! with the property that I
together with two elements of ~ through P lie in a plane. This number is the same for
almost all i and is defined to be the rank of ~P. A congruence of order n, class rn and rank
r is referred to as a (n, m, r)-congruence.
Given a point P all lines through P form a (1, 0, 0)-congruence called the star at P.
A ruled surface p can be considered to be an algebraic space curve in 5 dimensional
projective space, which is the space coordinatized by the six homogeneous line coordinates
Pi~" (0 < i < j < 3). The curve lies on the variety defined by (1). The order of p is
defined to be the number of lines which meet a general given line, where again lines are
counted properly in the space of complex numbers. For example, ruled surfaces lying on
a hyperboloid have order 2.
If a (space) curve in complex projective space is smooth, it is topologically equivalent
(homeomorphic) to a surface (a so-called Riemann surface), which is either a sphere, a
torus or a surface having a finite number of handles.

(The surface in the figure abt~ve has 5 handles.)


Surfaces with handles can be topologically built up from tori by cutting small disks
out of the tori and pasting them together on the disk boundries. The number of tori
required to build up a given surface is the number of handles of the surface; this number
is defined to be the genus of the curve. The definition of genus can be extended to curves
with singularities. We refer the reader to a textbook on algebraic curves or algebraic
geometry (for example, [12] or [14]) for equivalent definitions of "genus". The concept of
734

genus is applicable to ruled surfaces, since these can be regarded as space curves.
Given a congruence ~P, the sectional genus of ~P is defined to be the genus of a general
ruled surface p consisting of the elements of ~ which meet a given line 1 not lying in ~.

3 T h e critical line set

In this section we assume three cameras are set up in general position with centers
O1,02, 03. The image planes are denoted by I1,12, 13. The imaging process defines col-
lineations 71 : star(O/) ~ Ii (i -- 1,2,3), which we assume to be entirely general.
To consider the critical set, we consider another three centers 0 1 , 0 2 , 0 3 , which are in
general position with respect to each other and the first set of centers O1,02, 03. The
symbols with bars denote an alternative reconstruction of the scene and the camera po-
sitions. The stars at the O/'s project to the same image planes defining collineations
~i : star(O/) ----* Ii, also of general type. The compositions ai = 7i o ~-1 define collinea-
tions between the lines and the planes through Oi and 0i.
We shall describe what we mean by "general position" after stating our main result.

Theorem 3.1 With respect to images from three cameras the general critical set ~ for
the reconstruction problem using lines is a (3,6,5)-congruence. The sectional genus of ~
is 5. ~ contains 10 singular points, 3 of which are located at the camera centers. The
singular cones have order 3 and genus 1. ~ has no singular planes.

The proof of this theorem is given in [1]. Essentially, the proof determines ~ ' s order
and class and ~ ' s singular points and planes. These invariants suffice to identify ~ in
the classification of congruences of order 3 given in [3]. In this classification the other
properties of ~ can be found.
Just as a ruled surface can be considered to be a curve in 5-space, a congruence can
be considered to be a surface in 5-space.
According to [3, p. 72] !P is a surface of order 9 in 5-space. This surface has a plane
representation: the hyperplane sections of g', i.e., the intersection of ~ with complexes of
order 1, correspond to the system of curves of order 7, which have nodes at 10 given base
points. The plane cubic curves which pass through 9 of the 10 base points correspond to
the singular cones of ~.
Let us now describe what is meant by "general position".
First, we assume the centers of projection O1, 02, O3 and 01, 02, 0a are not coil/near.
Let Ir denote the plane spanned by O1,O2,O3 and ~ denote the plane spanned by
01,02, 03.
Next, we assume that the images of 7r under the various (~i intersect in a single point
15 = 7r~1 f3 lr~2 f3 Ir~3. Analogously, we assume the images of ~ under a i - l , c ~ - l , a ~ 1
intersect in a single point P = ~Y1 f3 ffa~* t3 ~ 7 * .
Each pair of centers Oi, 0 i and collineations cq, a j ( i # j = 1, 2, 3) determines a point
locus Qii, which is critical for 3D reconstruction using points. In the general projective
setting Qij is a quadric surface passing through O/ and 0 i. We assume each Qij is
a proper quadric and each pair of quadrics Qii,Q/k ({i, j, k} = 1, 2, 3) intersect in a
irreducible curve of order 4. Moreover, we assume that all three quadrics intersect in
8 distinct points. The analogous assumptions are assumed to hold for the quadries (~/i
determined by the centers 0i, 0 i.
Finally, we assume that for each fixed i = 1,2, 3 the two lines (OiOi)~J, j = 1, 2, 3, j
i) are skew. Here OiOj denotes the line joining Oi and Oj.
735

4 Line sets defeating the Liu-Huang algorithm

The algorithm proposed in [7] sets about to determine the rotational components of the
camera orientations with respect to one another in its first step. We shall only concern
ourselves with this step in what follows.
If three cameras are oriented in a manner that they differ only by a translation, we
can define a collineation between the lines and planes through each center of projection
Oi (i = 1, 2, 3) by simply translating the line or the plane from one center to the other.
This collineation coincides with the collineation at the Oi induced by the images, namely
where the points Pi and Pj in the i-th and j-th image correspond when they have the
same image coordinates (i, j = 1, 2, 3).
Regardless of camera orientation, introducing coordinates in the images preemptively
determines collineations between the images and as a result between the corresponding
lines and planes through the centers of projection. We call such lines and planes homol-
ogous, i.e., the images of homologous elements have the same coordinates in the various
images.
In the case where the cameras are simply translated, homologous elements in the stars
at Oi are parallel. Projectively speaking, this means that homologous rays intersect in
the plane at infinity and homologous planes are coaxial with the plane at infinity.
A generalization of the translational situation arises when the collineations between
the lines and planes through the centers are induced by perspectivities with a common
axial plane 7r, i.e., a ray ri through Oi corresponds to rj through Oj when rj = (rl n
lr)Oj (i, j = 1, 2, 3). Here (rl N ~r)Oj denotes the line joining points ri NIr and Oj.
Note that the projections of points X on ~r give rise to homologous rays OiX, which
per definition have the same coordinates in the images. Let ! be a line in 3-space and
li (i = 1,2, 3) denote the images of i. If ! meets ~r in X, the points Pi corresponding to
the projection of X in the images have the same coordinates. (In the translation case X
corresponds to the vanishing point of l.) Thus if the li are drawn in a single plane using
the common coordinate system of the images, they are concurrent, because the Pi E Ii
all have the same coordinates. In the translational case, this point is the vanishing point
of the parallel class of i.
The idea behind the first step in the algorithm of [7] is to find the rotational com-
ponents of the camera orientation by collinearly rearranging two of the images so that
all corresponding lines in all three images are simultanously concurrent with respect to
a given coordinate systems. If
2 2 2
~ i = 0 uizi = 0, El=0 vlzl = 0, ~ i = 0 wixi = 0

are the equations of the projections of a line l, we look for rotations, i.e., 3 3 orthogonai
matrices, or more generally simply 3 3 invertible matrices M1,Ms such that u =
(uo, ul, u~), May = Mx(vo, vx, vs) and Msw = Ms(wo, wl, wz) are linearly dependent,
the linear dependancy being equivalent to concurrency. This means we look for M1, Ms
such that
det(u, Mlv, M2w) = 0
for all triples of corresponding lines in the images. The algorithm would like to infer that
after applying M1 and Ms, the cameras are now oriented so that they are translates
of each other, or in the projective case that the images are perspectively related by a
common axial plane.
Consider the cameras with general orientations, where again homologous rays through
the centers corespond to points in the image having the same coordinates. If a line i in
736

space meets 3 homologous rays rl, r2, rs, then the projections of 1 are concurrent, the
point of concurrency being the point corresponding to rl, r2 and 7"3.
The set of all lines which meet the rays rl, r2 and r3 when the homologous rays are
skew is a ruled surface of order 2 denoted by [rl, r2, rs]. Let F = ~r~,r2,r3 [rl, r2, rs] be
the set of all lines of 3-space meeting triples of homologous rays. If all the lines in the
scene lie in F, then their projections have the property that they are concurrent. But
since the cameras were in general position, they are not translates of each other. Thus F
defeats the algorithm.
To find the equation for F let ql, q2, q3 denote the line coordinates of 3 rays through
O1, not all in a plane. Then ql, q2, q3 form a frame of reference for rays through O1; the
coordinates of any ray through 01 can be written as a nonzero linear combination

Alql-I'A2q2+A3q3 (3)

of the three coordinate 6-tuples ql, q2, q3.


If sl, sz, s3 denote the rays through 02, and t t, t2, t3 the rays through O3 which are
homologous to ql, q2, q3, the line coordinates of the rays through O2 and 03, which are
homologous to the ray defined by (3) are given by

A181 "~-A282 "b A383 and Altl q- A2t2 -}- ASt3

Thus a line i with coordinates p intersect this homologous triple if and only if

0 = ap,)~lql+A2qa+Anqa= ~i=1
3 ~it~pq,

0 Op,x,,,+x~,~+x,,~ ~i=1~~iap,,
In general I with line coordinates p lies in F if there exist (A1, As, As) not all zero such
that p satisfies the equations above. This will be the case when

( t2pq, apq~ apqs)


(4)

Thus (4) is the equation for F; the left-hand side of (4) is a homogeneous polynomial
in p = (P01, P02, P03, P12, pls,p23) of degree 3. We have the following theorem.

T h e o r e m 4.1 The set F which defeats the Liu-Huang algorithm is in general a line com-
plex of order 3 given by (~), where ql , q2, qs; sl , 82, 83 and t l , t2, t3 denote line coordinates
of rays through 0 1 , 0 2 and 03 respectively. The centers are total points of F.

To prove the assertion about the total points note that if say O1 E ! then ~2pq~ = 0
for i = 1, 2, 3; hence p satisfies (4). n
In the euclidean case (4) takes on the special form in which the triples ql, q2, qs; sl, s2, s3
and tl, t2, t3 are line coordinates for an orthogonal triple of lines through O1,02 and Os
respectively.
Definition: F is called the complex of common transversals of homologous rays. The
essential properties of F were first noted in [10, p. 106] in the context of constructive
geometry. The projective geometry of F has also been studied in [5] and [13, IV,pp. 134
ft.].
737

5 The relation between critical congruence qr a n d the complex


of common transversals/"

Before going into the relation between/" and g' we state some properties of still another
congruence.
The Roccella congruence .4 is a (3,3,2)-congruence of sectional genus 2 which consists
of all common transversals of 3 homographically related plane pencils in general position.
If we restrict the collineations of three stars to a plane pencil, we obtain .4 as a subset
of the complex of common transversals determined by collinear stars. (Cf. [9], [2, pp.
152-15T].)
Let us return to the situation used in defining ~. Here O1,O2, Oa and O1,O2,Oa
denote the location of the cameras for two essentially different 3D reconstructions and
ai denote collineations between the stars at Ok and Oi, which are induced by the images.
Any plane # in 3-space not meeting 01,02, Oa determines perspectivities between
stars at Oi via ri ~-} (rl n #)Oj (i, j = 1, 2, 3). Since star(Oi) and star(Oi) are collinear
via al, these perspectivities also induce collineations between the stars at Oi. Hence #
also gives rise to a complex F~ of common transversals of homologous rays, as explained
in the previous section.
P r o p o s i t i o n 5.1 If #1, #2 are two distinct planes in 3-space in general position, then
r,,nr,2=~u.4u U star(Oi)
i----1,2,3

where .4 denotes the Roccella congruence induced by the pencils at Ok in the planes
(Ok(#l N #2))"71 (i = 1,2,3).

PROOF. "_D". We need only show that


o \ U s t a r ( O l ) c_ F,r,
since F~, contains U star(Ok).
Given I E ~ not meeting any Oi, then i corresponds to an [in the second interpretation
of ~ from centers 01. Let t5 e [N #1. Then i meets (OiP)a7 ' (i = 1, 2, 3), hence l 9 F~,.
"C_". Let l 9 F~, N F~2 and let l~ = ( O l l ) a' n (021) a2.
First observe that homologous rays rl, r2, ra which meet ! must be of the form ri =
(OiP)a71 for some /5 9 16rl #1 N #2, because ri must be in Oil, hence r~ ~ must be in
(Oil) a', and r~ ~ meets l~ (i = 1, 2). In particular, if ~ N #1 n #2 is a point, rl, r2, r3 are
unique.
Case 1. ~ meets #1 N #2. Then either/5 = #1 N #2, whence ! 9 A by the observation
above, or/5 does not lie in one of the planes, say 16~ r l . Then the intersection l~N #1N #2
_ -- a - - I
is a point/5. Again by the observation above, ! must meet (OiP) ~ 9 Thus l 9 .4.
Case 2. ~ does not meet #1 rl#2. Let #j Np = / s j . By the observation above, (0i/5)a7 '
meets l for i = 1,2, 3 and j = 1,2. In particular (0a/sj) a;'l meets i. But these two rays
span Oil. This means Okp are coaxial (with axis l~), and Oil = (0i/5) ~7~ are coaxial with
axis i. Thus l 9 ~. Q
C o r o l l a r y 5.1 If#x, #2, #3, #4 are planes in general position, then

!PU U star(Oj)---- N r,,


i=1,2,3 i=1,...,4
738

References
1. Buchanan, T.: On the critical set for photogrammetric reconstruction using line tokens in
P3(C). To appear.
2. Fano, G.: Studio di alcuni sistemi di rette considerati comme superflcie dello spazio a cinque
dimensioni. Ann. Mat., Ser. 2, 21 141-192 (1893).
3. Fano, G.: Nuove richerche sulle congruenze di rette del 3 ~ ordine prive di linea singolare.
Mere. r. Acad. Sci. Torino, Ser. 2, 51, 1-79 (1902).
4. Hartshorne, R.: Algebraic Geometry. Berlin-Heidelberg-New York: Springer 1977.
5. Kliem, F.: Uber Otter yon Treffgeraden entsprechender Strahlen in eindeutig und linear ver-
wandter Strahlengebilden erster his vierter Stufe. Dissertation. Borna--Leipzig: Bnchdruck-
erei Robert Noske 1909.
6. Krames, J.: Uber die bei der Hauptaufgabe der Luftphotogrammetrie auftretende
,,gef~hrliche" Fl~chen. Bildmessung und Luftbildwesen (Beilage zur Allg. Vermessungs-
Nachr.) 17, Heft 1/2, 1-18 (1942).
7. Liu, Y., Huang, T.S.: Estimation of rigid body motion using straight line correspondences:
further results. In: Proe. 8th ]nternat. Conf. Pattern Recognition (Paris 1986). Vol. I. pp.
306-309. Los Angeles, CA: IEEE Computer Society 1986.
8. Rinner, K., Burkhardt, R.: Photogrammetrie. In: Handbuch der Vermessungskunde. (Hsgb.
Jordan, Eggert, Kneissel) Band III a/3. Stuttgart: J.B. Metzlersche Verlagsbnchhandlung
1972.
9. Roccella, D.: Sugli enti geometrici dello spazio di rette generate dalle intersezioni de' com-
plessi corrispondenti in due o pi~ fasci proiettivi di complessi lineari. Piazza Armerina:
Stabilimento Tipograflco Pansini 1882.
10. Schmid, T.: Uber trilinear verwandte Felder als Raumbilder. Monatsh. Math. Phys. 6, 99-
106 (1895).
11. Semple, J.G., Kneebone, G.T.: Algebraic Projective Geometry. Oxford: Clarendon Press
1952, Reprinted 1979.
12. Severi, F.: Vorlesungen iiber Algebraische Geometrie. Geometrie a~f einer Kurve, Rie-
mannsche Fliichen, A beische lntegrale. Deutsche Ubersetzung yon E. L6ffler. Leipzig--Berlin:
Teubner 1921.
13. Sturm, R.: Die Lehre yon den geometrischen Verwandtschaften. Leipzig - Berlin: B. G.
Teubner 1909.
14. Walker, R.J.: Algebraic Curves. Princeton: University Press 1950. Reprint: New York: Dover
1962.
15. Zindler, K.: Algebraische Liniengeometrie. In: EncyklopSdie der Mathematischen Wis-
senschaften. Leipzig: B.G. Teubner 1928. Band II, Teil 2, 2. H~lfte, Teilband A., pp. 973-1228.

This article was processed using the IbTF~ macro package with ECCV92 style
Intrinsic Surface Properties from Surface
Triangulation

Xin CHEN and Francis SCHMITT

Ecole NationMe Sup6rieure des T61@communications


46 Rue Barrault 75013 PARIS - FRANCE

1 Introduction

Intrinsic surface properties are those properties which are not affected by the choice of the
coordinate system, the position of the viewer relative to the surface, and the particular
parameterization of the surface. In [2], Besl and Jain have argued the importance of
the surface curvatures as such intrinsic properties for describing the surface. But such
intrinsic properties may be useful only when they can be stably computed. Most of the
techniques proposed so far for computing surface curvatures can only be applied to range
data represented in image form (see [5] and references therein). But in practice, it is not
always possible to represent the sampled data under this form, as in the case of closed
surfaces. So other representations must be used.
Surface triangulation refers to a computational structure imposed on the set of 3D
points sampled from a surface to make explicit the proximity relationships between these
points [1]. Such structure has been used to solve many problems [1]. One question con-
cerning such structure is what properties of the underlying surface can be computed from
it. It is obvious that some geometric properties, such as area, volume, axes of inertia,
surface normals at the vertices, can be easily estimated [1]. But it is less clear how to
compute some other intrinsic surface properties. In [8], a method for computing the min-
imal (geodesic) distance on a triangulated surface has been proposed. Lin and Perry [6]
have discussed the use of surface triangulation to compute the Gaussian curvature and
the genus of surface. In this paper, we propose a scheme for computing the principal
curvatures at the vertices of a triangulated surface.

2 Principal Curvatures from Surface Triangulation

The basic recipe of the computation is based on the Meusnier and the Euler theorem.
We will firstly describe how to use them to compute the principal curvatures. Then the
concrete application of the idea to surface triangulation is presented. Throughout we take
II~ as the L 2 norm, and < . , @> and A the inner and cross product, respectively.
2.1 C o m p u t i n g P r i n c i p a l C u r v a t u r e s b y M e u s n i e r a n d E u l e r T h e o r e m
Let N be the unit normal to a surface S at point P. Given a unit vector T in the tangent
plane to S at P, we can pass through P a curve C C S which has T as its tangent vector
at P. Now let s be the curvature of C at P, and cosO = < n, N >, where n is the normal
vector to C at P (see Fig. 1). The number

~ T = ~ ~ cosO (1)

is called the normal curvature of C at P. Note that the sign of the normal curvature of
C changes with the choice of the orientation of surface normal N: The Meusnier theorem
states that all curves lying on S and having the same tangent vector T at P have at
740

this point the same normal curvature [4]. Among all these curves, a particular one is
the normal section of S at P along T, which is obtained by intersecting S with a plane
containing T and N (see Fig. 1). For this curve, its normal n is aligned with N but with
the same or an opposite orientation. Thus from equation (1), its curvature satisfies the
expression ~ = I~TI.

Ct

ng T
9
J

Fig. 1. Local surface geometry around point P. Fig. 2. Choice of vertex triples.

If we let the unit vector T rotate around N, we can define an infinite number of
normM sections, each of which is associated with a normal curvature ~T" Among them,
there are two sections, which occur in orthogonal directions, whose normal curvature
attains maximum and minimum, respectively [4]. These two normal curvatures are the
principal curvatures gl and ~r their associated directions are the principal directions T1
and T~.. The Euler theorem gives the relation between the normal curvature ~T of an
arbitrary normal section T and ~r ~2 as follows [4]:

~T = ~1c~ + Ir (2)
where 9 is the angle between T and T1. Let
cosr sin~
= (l~wl)l/2, r l - (l~Tl)X/2. (3)
Then relation (2) becomes
s1~ 2 % ~r/2 = -I-1, (4)
where the sign of the right hand depends on the choice of orientation of the normal
N at P . Equation (4) defines what is known as the Dupin indieatrix of the surface S
at P . We see that the Dupin indicatrix is a conic defined by ~r and s2 in the tangent
plane to S at P. If P is an elliptic point, the Dupin indicatrix is an ellipse (~I and ~2
have the same sign). If P is a hyperbolic point, ~I and tr have opposite signs, thus the
Dupin indicatrix is made up of two hyperbola. If axes other than those in the directions
of principal curvatures are used, the Dupin indicatrix would take the following general
form:
A~ 2 + 2 B ~ + Cr/2 --- + l , (5)
Given these two theorems, a possible scheme to calculate the principal curvatures is
as follows: Let n plane curves (not necessarily normal sections) passing through the point
P be given. For each of them, we can compute its curvature and tangent vector (and thus
its normal vector) at P. From the n computed tangent vectors, the surface normal at P
741

can be determined by applying a vector product to two of these tangent vectors. Using
equation (1), the normal curvatures along n tangent directions are computed. Having
chosen two orthogonal axes on the tangent plane to S at P, we use equation (3) to
compute a pair of coordinates (~, y) for each direction (note that this time ~ is the angle
between the tangent vector and one of the chosen axes), thus we obtain an equation (5).
With n (n > 3) such equations, the three unknowns A, B, and C can be solved. Finally,
the principal curvatures ~1 and tr are

The principal directions are determined by performing a rotation of the two orthogonal
axes, in the tangent plane, by an angle r where tan2r = 2 B / ( A - C).

2.2 P r i n c i p a l C u r v a t u r e s f r o m S u r f a c e T r i a n g u l a t i o n
Suppose that a surface triangulation has been obtained. This triangulation connects
each vertex P to a set Np of vertices, which are the surface neighbors of P in different
directions. It is to be noted that such a neighborhood relationship is not only a topological
one, but also a geometric one. Our goal is to calculate the principal curvatures of the
underlying surface at P using this neighborhood relationship.
We see from the above section that ifa set of surface curves passing through P can be
defined, the calculation of the principal curvatures and directions will be accomplished
by simply invoking the Meusnier and the Euler theorem. So the problem is reduced to
define such a set of surface curves. As the neighborhood of P defined by the triangulation
reflects the local surface geometry around P, it is natural to define the surface curves
from the vertices in this neighborhood. A simple and direct way is to form n vertex
triples {Ti -- (P, Pi, Pj)IPi, Pj 9 Np, 1 < l < n}, and to consider curves interpolating
each triple of vertices as the surface curves. Two issues arise here: one is how to choose
two neighbor vertices P~ and Pj to form with P a vertex triple; another is which kind of
curve will be used to interpolate each triple of vertices.
To choose the neighbor vertices Pi and Pj, we have to take into account the fact that
the vertices in triangulation are sampling points of the real surface which are corrupted
by noise. Since in equation (1) the function cosine is nonlinear, it is better to use the
surface curves which are as close to the normal section as possible. In this way, the angle 0
between the normal vector n of the surface curve and the surface normal N at P is close to
0 or ~r depending on the orientation of the surface normal, which falls in the low variation
range of the function cosine, thus limiting the effects of the computation error for angle
0. On the other hand, the plane defined by two geometrically opposite vertices (with
respect to P) and P is usually closer to the normal section than that defined by other
combinations of vertices. These considerations lead to a simple strategy: we first define a
quantity M to measure the geometric oppositeness of two neighbor vertices Pi and Pj as
(see Fig. 2): M = < P - Pi, Pj - P >. We then calculate this quantity for all combination
of neighbor vertices, and sort them in a nonincreasing order. The first n combinations
of the vertices are used to form n vertex triples with P. This strategy guarantees that n
can always be greater than or equal to 3 which is the necessary condition for computing
the principal curvatures by the Meusnier and the Euler theorem.
Having chosen the vertex triples, the next question is which kind of curve is to be
used for their local interpolation. In our case, the only available information being P and
its two neighbor vertices, their circumcircle appears to be the most stable curve which
can be computed from these three vertices. Such a computation is also very efficient.
742

So we use the circles passing through each triple of vertices as an approximation of the
surface curves.
Now suppose we have chosen a set of vertex triples {Tzl 1 < l < n}. Each Vertex triple
Tz = (P, Pi, Pj) defines an intersecting plane to the underlying surface passing through
P. The center Cz of the circumcircle of these three vertices can be easily computed [4].
Thus, the curvature and the unit normal vector of this circle at P are ~a = 1/l[Cz - P][
and nz = (Cz - P ) / I I C I - PI[, respectively. The unit tangent vector t~ at P is then
~. ^ (u A v)
tt = lln~A (u A v)ll' (7)
where u = Pi - P and v = Pj - P. Hence for each triple Tz, we obtain for its circumcircle
the curvature value tq, the tangent vector tz, and the normM vector nl at P. We can
then compute the surface normal N at P as

N = [l~Yrnnll' where grnn = [itrnAtnll,mn. (8)

Note that the orientation of each Nmn must be chosen to coincide with the choice of
exterior (or interior) of the object (which can be decided from the surface triangulation).
Now, the normal curvature ~t~ of the surface along the direction tt can be obtained
by equation (1) as:
~t, = ~wosO, (9)
where 0 is the angle between N and n~. After choosing a coordinate system on the tangent
plane passing through P, we can use equation (3) to compute a pair of coordinates (~l, r/z)
for each direction tz and obtain an equation (5). Normally, n is greater than 3, so the three
unknowns A, B, and C are often overdetermined. We can therefore use the least-squares
technique to calculate them by minimizing the function:
tl

G = E ( A ~ I 9 + 2B~l~l + Cr}l2 - 6) 2, (10)


1
where 8 = 4-1 according to the orientation of the surface normal. The principal curvatures
and directions can then be obtained as mentioned in 2.1.

3 Experimental Results

In order to characterize the performance of the method proposed above (TRI), we com-
pare it with the method proposed by Besl and Jain (OP) [2] in which the second order
orthogonal polynomials is used to approximate range data. This comparison is realized
on a number of synthetic data sets. It is well known that even only with the trun-
cation error inherent in data quantization, the curvature computation will be gravely
deteriorated [2,5]. A common method to improve this is to first smooth the data by an
appropriate Gaussian filter and then retain the results in floating-point form [5]. But the
problem with such a smoothing is that it is often performed on the image coordinate
system which is not intrinsic to the surface in image. The result is that the surface will
be slightly modified which will lead to an incorrect curvature computation. So we will not
use the smoothed version of the image in our comparison. We generate three synthetic
range images of planar, spherical, and cylindrical surface with the following parameters:
Plane: 0.0125x - 0 . 0 1 2 5 y + 0.1f(x, y) = 1, - 3 0 < z, y < 30.
S p h e r e : z 2 + y2 + f ( z , y ) 2 = 100, - 1 0 _< z , y < 10, f ( x , y ) > O.
743

C y l i n d e r : x 2 + f ( x , y)2 _,_ 400, - 2 0 _~ x, y ~ 20, f(x.y) >_ O.


For each kind of synthetic surface, we produce two data sets of different precision:
one is obtained by sampling the surface at the grid points and retaining the results in
floating-point form. This is referred to as the 32 bit image. It is then truncated into 8-
bit precision, which results in another data set called 8-bit image. To obtain the surface
triangulation, we apply a Delaunay triangulation-based surface approximation [7] to each
image. Then both T R I and OP are applied to those vertices of the triangulation that are
inside the synthetic surface. The window size of OP method is 7 x 7.
We calculate two assessing measurements concerning these results: one is the mean p
of the computed kraal: and k,nin, another is their standard deviation a. They are listed
in the following tables, with the true curvature values being in parenthesis.
Plane km.~ (0.0) ~m,, (0.0)
Image p-TRI p-OP ~-TRI ~-OP p-TRI p-OP a-TRI q-OP
32bit 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
8bit 0.000379 0.002125 0.000594 0.002162 -0.000489 -0.002348 ).000726 0.002239
Sphere kmax (0.1) k,,i, (0.1)
Image #-TRI p-OP ~-TRI a-OP #-TRI p-OP r r
32bit 0.100000 0.100242 0.000000 0.000654 0.100000 0.100163 0.000000 0.000225i
8bit 0.101514 0.108257 0.001898 0.005355 0.098985 0.097595 0.001645 0.022245j
Cylinder ~m.~ (0.05) kmln (0.0)
Image p-TRI ~-OP a-TRI cr-OP ~-TRI p-OP a-TRI o'-OP
32bit 0.044019 0'.'050096 0.007106 0.000263 0.005353 0.000000 0.005911 0200000
8bit 0.044352 0.052607 0.007384 0.009240 0.005011 0.000272 0.005845 0.001517
From these preliminary results, we see that TRI performs in generally better than
OP. So it can be very useful for the applications where the surface triangulation is used.
Other results can be found in [3], where we have also given an explanation of how to
compute the Gaussian curvature without embedding the surface in a coordinate system.

References
[1] Boissonnat, J.D.: Geometric structures for three-dimensional shape representation. ACM
Trans. on Graphics, vol. 3, no. 4, pp. 266-286, 1984.
[2] Besl, P.J, Jain, R.C.: Invariant surface characteristics for 3D object recognition in range
image. Comput. Vision, Graphics, Image Processing, vol. 33, pp. 33-80, 1986.
[3] Chen, X., Schmitt, F.: Intrinsic surface properties from surface triangulation. Internal
report, T~l~com Paris, Dec. 1991.
[4] Faux, I.D., Pratt, M.J.: Computational geometry for design and manufacture. Ellis Hot-
wood Publishers, 1979.
[5] Flynn, P.J., Jain, A.K.: On reliable curvature estimation, in Proc. Conf. Computer Vision
and Pattern Recognition, June 1989, pp. 110-116.
[6] Lin, C., Perry, M.J.: Shape description using surface triangularization, in Proc. IEEE
Workshop on Computer Vision: Repres. and Control, 1982, pp. 38-43.
[7] Schmitt, F., Chen, X.: Fast segmentation of range images into planar regions, in Proc.
IEEE Conf. on Computer Vision and Pattern Recognition, June 1991, pp.710-711.
Is] Wolfson, E., Schwartz, E.L.: Computing minimal distances on polyhedral surfaces, IEEE
Trans. Pattern Anal. Machine Intell., vol. 11, no. 9, pp. 1001-1005, 1989.

This article was processed using the I~TEX macro package with ECCV92 style
Edge Classification and D e p t h Reconstruction by
Fusion of Range and Intensity Edge Data *

Guanghua Zhang and Andrew Wallace


Heriot-Watt University, 79 Grassmarket, Edinburgh EH1 2HJ, UK

A b s t r a c t . We present an approach to the semantic labelling of edges and


reconstruction of range data by the fusion of registered range and intensity
data. This is achieved by using Bayesian estimation within coupled Markov
Random Fields ( MRF ) employing the constraints of surface smoothness
and edge continuity.

1 Introduction

Fusion of intensity and range data can provide a fuller, more accurate scene description,
improving segmentation of range data and allowing semantic classification of edge labels.
In a previous paper [3], we presented an approach for the classification of edge labels at a
single site by combining the range and intensity edge data with a simplified Lambertian
shading model. This paper extends the approach to improve the results by incorporating
the constraints of edge continuity and surface smoothness in a relaxation process. The
whole fusion process is summarized and shown in Figure l(a).

2 Edge classification from range and intensity data

Representing a pixel image as a set of sites on a square lattice, the edges are located
halfway between vertical and horizontal pixel pairs, as shown in Figure l(b). Fusion
of range and intensity data can be simplified by the assumption of spatial registration
between the two sources, either by acquisition from the same viewpoint, or by geometric
transformation. A study of an edge shading model [3] showed the different appearances in
the two sources of intensity and range data with various types of edge labels. A complete
classification of edge labels is: {blade, eztremal, fold, mark, shadow, specular, no_edge }.
An informal basis for the classification of edge labels is derived in [3]. In order to
obtMn a quantitative estimate of the edge labels, a maximum likelihood estimation is
employed on a reduced set of edge labels, { blade,fold, mark, no_edge }. Eztremal edges
are not distinguished from blade edges, nor are specularand shadow edges distinguished
from mark edges. This may be accomplished by separate analysis of surface curvature
adjacent to the edge and by variation of the lighting parameters.

3 Bayesian estimation and MRF image model

The initial edge labelling is based solely on the fltered range and intensity data at
single sites without consideration of the neighbourhood context of either adjacent edge
* The work has been supported by a TC scholarship from the Chinese Education Commis-
sion and the British Council, and by the LAIRD project which is led by BAe and funded
by the SERC/IED (GR/F38327:1551). The LAIRD project is a collaboration between BAe,
BAeSema Ltd, NEL, Heriot Watt University and the Universities of Edinburgh and Surrey.
745

I-]I--I ol]olo Io=11


~" Ioi]_
Edgcsampling Couplcdncighbourhood

[--7 n=g
DDD =O =
D
D~thndghbourhoodEdgencighbourhood

Ill
- NI 1 != I1 |
"0-1=
C22 C12 C14
b~c ~ge configunuions
(a) co)

Fig. 1. (a) The diagram of the fusion process. (b) The dual lattice representation, edge and
depth neighbourhoods, and three basic edge cliques.

or depth sites. In order to improve this initial estimate we apply the well established
process of relaxation, incorporating general constraints based on edge continuity and
surface smoothness.
The conditional probability of obtaining the estimate (R, L) of depth and edge labels
from (r, Ai) is expressed by:
p(r, Ai [ R, L)P(R, L) (1)
P(R, L I r, Ai) = E(R,L) p(r, Ai [ R, L)P(R, L)

where (r, Ai) are the depth observation and the filtered intensity discontinuity respec-
tively. To obtain the estimate (/~, L) which maximizes the a-posteriori probability, we
make three assumptions similar to [1].
(a): the filtered intensity discontinuity Ai and range observation r are conditionally
independent of each other given the object geometry R and the edge label L, i.e. p(r, Ai ]
R, L) = p(Ai I R, L)p(r [ R, L).
(b): given the edge label Lu at edge site u, the filtered intensity discontinuity Aiu, is
independent of range data R, i.e. p(Aiu [ R, Lu) = p(Aiu I Lu).
(c): given the range value Rm at the depth site m, the measurement rrn is independent
of the edge labels L, i.e. p(rrn [ Rm, L) = p(r,n [ Rm), which is the observation model
of the range sensor and assumed to be a Gaussian function.
Simplifying (1) with these assumptions, we have
P(R, L I r, Ai) = tip(r, zSi I L, R)P(R, L) = tlp(Ai I L)p(r I R)P(R, L) (2)
where tl is a constant, and P(R, L) describes the prior knowledge about the interaction
between depth and edge sites. At edge sites, P(R, L) is expressed by P(R, L) = p(R I
746

L)P(L). As edge labels are only associated with the change of range data, p(R I L) is
expressed by p(AR [ L)p(AN I L) ( AR and AN are conditionally independent given the
underlying edge label L). At depth sites P(R, L) = P(L I R)p(R), the first term shows
the depth estimates consistent with the current edge configuration, and the second the
probability density function of range value R, which is assumed to be a uniform function.
Expressing (2) at depth and edge sites separately with the above expressions, we have
P(R, L [ r, Ai) = f p(r I R)P(L [ R)p(R) at depth sites
[p(Ai[L)p(ARIL)p(AN [L)P(L) at edge sites (3)
It is not feasible nor desirable to consider the whole image context with reference to
a single depth or edge site within the dual lattice. Consequently, we reduce the scope
of the analysis from the whole image to a local neighbourhood by application of the
MRF model. Using the Markovian property, the constraints of global edge labelling L
and surface smoothness are expressed locally by
P(L) = H [P(Afe(u) I L~)P(Lu)] and P(R,L) = I I [P(L,Afd(m) I Rm)p(Rm)] (4)

where P(Lu) is the prior probability of an edge label, and AZe(u), A{d(m) are the neigh-
bourhood of edge and depth sites respectively.
The interaction between depth and edge sites under surface smoothness and edge
continuity constraints can also be expressed by Gibbs energy function from the Clifford-
Hammersley theorem with a temperature parameter T [1]. The use of temperature in-
troduces an extra degree of freedom by analogy with temperature control of a material
lattice. In the context of refinement of the edge labels and reconstruction of the depth
data, T intuitively reflects whether the source data or prior knowledge of context is more
important. Putting (4) into (3), taking negative logarithms to convert into an energy
form, we have
E(R, L [ r, Ai) = f ~'~rn [TTm(Rm - rm)2/2ar2 + E(L,A/'a(m) [ Rm) + t2] at depth sites
~. E ~ [Tg(Au I Lu) + C(P(L,,)) + s I L,)] at edge sites
(5)
where ~2 is a constant, err is the standard deviation of Gaussian noise in range data, and
%n is 0 if there is no depth observation at site m and 1 otherwise. Au is a discontinuity
vector at edge site u representing the changes of range, surface orientation and reflectance
[3].

4 Modeling edge continuity

The first assumption we make in modeling an edge neighbourhood is continuity of edge


direction. This common assumption implies that corners and junctions need special treat-
ment. Furthermore, we consider a second order neighbourhood as shown in Figure l(b),
and there is a single edge within each neighbourhood.
In (5), each edge segment in a neighbourhood has one of four edge labels, { blade,
.fold, mark, no_edge }. The total number of possible edge configurations is 4 9 without
considering symmetry. We assume that the edge continuity and label consistency can be
treated independently. Thus a neighbourhood is decomposed into two parts, one binary
clique to deal with the edge connectivity, and the other with the compatibility between
mixed labels when the label of the central segment is blade, fold ormark.
f P(.N'be(u)[L~) if L~ is no_edge
P(AZe(u) I Lu) = ]. p(A/.be(u) I Lb)Pe(Af,(u) I Lu) otherwise (6)
747

where L~ -- 0 if D~ is no_edge, and 1 otherwise. A/'b(u) is the binary edge neighbourhood.


Pc(He(u) I Lu) is the compatibility between the central label Lu and the other edge
labels in the neighbourhood. The number of states is reduced from 4 9 t o 2 9 2 3.
We assign an 8 bit code ( in octal form ) for each binary edge configuration in which
each bit represents the status of each of the 8 neighbours as shown in Figure l(b). In
the application of the MRF model, e.g. [1], the neighbourhood or clique parameters have
been derived by hand, which we derived separately from simulations [4].

5 Modeling surface smoothness

A widely used smoothness model is the thin membrane model, which is a small deflection
approximation of the surface area,

1
t, dy) J xdy (7)
The derivative is represented simply by the difference of range values at adjacent sites.
If the edge label between depth sites m and n is mark or no_edge, the depth sites are on
the same surface and a penalty term is given as fl(Rm - Rn) 2 for any violation of surface
smoothness; otherwise a penalty is given from the edge process for the creation of an edge.
The energy s A/'d(m) ] Rm) is the summation of penalties over the neighbourhood.
Using the thin membrane model, we obtain a quadratic energy function at depth site
m and a analytic solution

.E .., , (s)

where Lmn is 0 if the edge label between depth site m and n is blade or fold, and 1
otherwise. This shows the best depth estimate ~ is a weighted sum of the neighbours
and the raw observation at this site.

6 Iterative energy minimization

Local HCF [2], developed from HCF [1], updates the site within an edge neighbourhood
that has the largest energy decrease if replaced by a new one. The interaction between
edge and depth processes illustrates the importance of the updating order as early up-
dated sites have influence over the other sites. The updating is switched between depth
a n d edge processes depending which one reduces energy the most.
Once a site is updated, the energy decrease is zero, i.e. no better estimate, and there-
fore less unstable sites can be updated. The changes of edge labels and range values are
propagated to other sites.

7 E x p e r i m e n t s and d i s c u s s i o n

So far we have experimented with the fusion algorithm on both synthetic and real data,
but only the results of real data are shown in Figure 2. We interpolate the gap between
stripes by a linear function along each row as the laser stripes are projected on the scene
vertically. The standard deviations of the noise are assumed to be ~r = ~i - 2.0. The
parameters used are T = 1.0, fl - 2.0 and the number of iterations is 10.
748

Dense depth data are obtained from reconstruction from sparse range data. Even
though fold edges are sensitive to noise due to application of a second order derivative
filer, the results are greatly improved by the fusion algorithm. The homogeneity criteria
defined locally have limitations for extracting reliable edge labels. Some edges are near to
each other, but outside the scope of the edge neighbourhood. Although more iterations
m a y improve the results, it m a y be more productive to further process the data using
some global grouping. For example, Hough transformation can be used to extract reliable
space lines and arcs. Surface patches m a y also be extracted from the labeled edges and
reconstructed depth to generate intermediate level object description.

References

1. P. B. Chou and C. M. Brown. The theory and practice of Bayesian image labeling. Int. J.
of Comput. Vision, 4:185-210, 1990.
2. M. J. Swain, L. E. Wixson, and P. B. Chou. Efficient parallel estimation for Markov random
fields. In L. N. Kanal et al ( editors ), Uncertainty in Artificial Intelligence 5, pages 407-419.
1990.
3. G. Zhang and A. M. Wallace. Edge labelling by fusion of intensity and range data. In Proc.
British Machine Vision Conf., pages 412-415, 1991.
4. G. Zhang and A. M. Wallace. Semantic boundary description from range and intensity data.
to appear, IEE Int. Conf. on Image Processing and its Applications, 1992.

Fig. 2. Results of real data widget. From left to right, top row: original intensity, range data
and depth reconstruction; bottom row: classified blade, fold and mark edges.

This article was processed using the IbTFfl macro package with ECCV92 style
Image Compression and Reconstruction Using a 1-D Feature
Catalogue
Brian Y.K. A w 1, Robyn A. Owens 1 and John Ross 2
l Department of Computer Science, University of Western Australia.
2 Department of Psychology, University of Western Australia.

Abstract. This paper presents a method of compressing and reconstructing


a real image using its feature map and a feature catalogue that conprises of
feature templates representing the local forms of features found in a number
of natural images. Unlike most context-texture based techniques that assume
all feature profiles at feature points to be some form of graded steps, this
method is able to restore the shading in the neighbourhond of a feature point
close to its original values, whilst maintaining high compression ratios of
around 20:1.

1. Introduction
The image compression ratio achieved by early image coding techniques based on information
theory operating on natural images saturated at a value of 10:1 in the early eighties [KIKI].
Later techniques that code an image in terms of its feature map managed to obtain higher
compression ratios but at a sacrifice of image quality, namely the loss of original local
luminance form (i.e. feature profiles ) at feature points. The technique described in this paper is
able to correct these defects yet maintains compression ratios around 20:1.
An image can he decomposed into two parts: a feature map and a featureless portion.
In other existing techniques (see review article [KIKI]), the feature map is thresholded and
only the location of feature points is coded. Consequently, all information about the original
luminance profiles that give rise to those feature points is lost in the reconstruction phase,
where artificial graded step profiles are used instead. To recover this lost information, our
technique makes use of a common feature catalogue that consists of a number of 1-
dimensional feature templates. Each template describes a feature profile in terms of
normalised mean luminance values and standard deviations at various pixel locations of the
profile. In [AORI], it has been shown that a catalogue whose templates approximate closely
most feature luminance profiles in many natural images can be derived from some appropriate
sample images. In the coding phase of our technique, both the locations of feature points and
pointers indicating feature types are also retained. Each pointer points to the feature templates
in the catalogue that best approximates the original luminance profile in the neighbourhood of
the indexing feature point. Subsequently, the information encoded in the pointers is used to
recover the luminance profile of features at the various locations in the inaage. The same
feature catalogue is used for all images. In our technique, we also encode the featureless
portion of an image in terms of a small number of fourier coefficients which is used in a later
stage to recover the background shading in the reconstructed image.

2. The feature catalogue


The feature catalogue mentioned above is shown in Fig. 1. This feature catalogue is formed by
a 2-layer recurrent neural network guided by the local energy operator. Please refer to [AORI]
and [MOI] for further details on the network and the local energy model. The horizontal axis
is the spatial dimension in pixel units (1 to 5), and the vertical axis represents luminance values
750

from 0 (lowest point) to 255 (highest point). The feature is located at pixel 3. The mean
luminance values of each template are marked by horizontal white bars. A ganssian function is
plotted as shades along the vertical axis for each pixel location of a feature profile. The wider
the spread of the shading, the larger the standard deviation value.

Figure 1. Feature Catalogue.

3. The coding process


An image to be compressed is coded in portions: the feature regions and the featureless
regions. The feature regions consist of feature points (i.e. pixel locations) and the luminance
profiles in the neighbourhood of the points (i.e. feature profiles). The feature points are
defined at the peak of the local energy according to the local energy model. The original
feature profile is rescaled (into xi) such that it has least square errors with the respective feature
template (~ti and o i) in the common catalogue. The similarity index (z) for the comparison
between the scaled feature profile (xl) and a feature template is defined as follows:

1 N (x i . ~i)2
z = x exp [- - - ] (1)
N i=1 Oi2

where N (=5) is the number of pixels in the 1-D templates.

The template that produces the highest z value is used to represent the feature profile
at the feature point. A pointer from this feature point is then set up and coded with the best-
matched template number and the necessary rescaling parameters (a d.c. shift and a multiplier).
For example, the feature point map of an original image "Baby" (Fig. 2(a)) is shown
in Fig. 2(b). This map combines the feature points found in the horizontal and vertical
directions. Centred at each (black dot) location, the scaled I-D profile of the best matched
feature type from the catalogue is shown in Fig. 2(c). Some features are in the horizontal
direction and some are in the vertical direction. Visually, it is evident that the information
retained in Fig. 2(c), i.e. location plus local form, is richer than just the locationai information
itself represented in Fig. 2(b).
Besides coding the feature portion of an image, we also code a low-pass version of its
featureless portion in terms of the coefficients of its low frequency harmonics. For a 256x256-
pixel image, we retain the lowest 10xl0 2-dimensional complex coefficients of its FFT (Fig.
2(d)).

We can attain a compression ratio of 20:1 if we (a) assume around 2% of the original
image pixels are feature points either in the horizontal and/or vertical direction, and (b) use the
following code words for the various messages of an image:
751

an average of 4.5 bits per feature point to code the positional information in
Huffman code;
a 5-bit code word to code the d.c. parameter.
a 4-bit code word for the multipling (scaling) parameter.
a 3-bit code word to code the feature template number,
a 1-bit code word to indicate the 1-D feature direction (horizontal or
vertical);
a 16-bit code word for the complex FFT coefficients.

Figure 2. (a) Original image "Baby". (b) The locations of feature points (shown as black
dots). (c) The template profile of the best matched feature template from the catalogue for
each feature point in the image is superimposed at the location of that point. (d) The low-
passed version of "Baby". Only the lowest 10xl0 2-dimensional complex coefficients of the
FFT of"Baby" axe retained.

4. Results of reconstruction process


There are two stages involved in the reconstruction process. First, luminance profiles at
feature points are retrieved from the catalogue of feature templates by means of the coded
pointers. Second, an iterative process toggles between local averaging and fourier coefficient
foldback. Local averaging smooths unwanted artifacts in featureless regions but alters the
original low frequency harmonics of an image. The coded fourier coefficients are used to
reinforce these harmonics at each iteration.

Figure 3. (a) Initial stage of reconstruction of the image "Baby". Feature profiles and low-
pass FFr coefficients are retrieved from the compressed data. (b) Result after 10 iterations.
(c) Result after 50 iterations.

The 2-dimensional local averaging process takes place in local neighbourhoods of


sizes that depend on the location of the current point in the featureless regions. The nearer the
current point is to a feature location, the smaller the size of the averaging neighbourhood.
Only smoothness up to the first order is enforced by the averging action. An illustration of the
752

reconstruction process is shown in Fig. 3. The first image (a) is the initial result when the
coded feature local forms and the low-pass data axe retrieved. This is followed by the local
averaging operation interlaced by the reinforcement of the coded low-pass fourier coefficients.
The results after 10 and 50 rounds of iterations are shown in (b) and (c) respectively.

5. More experimental results


The coding and reconstruction scheme proposed in this paper is used to test three other natural
images: "Machine", "Animal" and "X-Ray", all at three different scales (Fig. 4). If L n denotes
the scale of an image of size nxnxS-bit, then the three scales shown in Fig. 4 are L256,L128 and
L64. The upper rows are the original images and the lower rows the reconstructed ones.

Figure 4. Test images. The original images at different scales are in the upper rows and the
recomtructed ones are in the lower rows. The images are named "Baby" (top-left), "Animal"
(top-right), "Machine" (bottom-left) and "X-Ray" (bottom-right). The three different image
sizes are 256><256, 128x128 and 64x64 pixels.

An error percentage per pixel on the basis of a 256 level grey scale is computed
(Table 1) as an indication of the difference between the original image and its reconstructed
753

version. The error is calculated as the average of the absolute difference in luminance values
between corresponding pixel locations in the two images.

Image size "Baby . . . . Animal . . . . X-Ray . . . . Machine"


256><256 1.6 1.1 1.5 1.4
128x128 1.6 1.0 1.4 1.8
64x64 1.5 1.1 1.7 2.1

Table I. Average percentage difference per pixel between the original and reconstructed
images.

6. Conclusions
It was noted that recent image compression and reconstruction technique achieved high
compression ratios at the expense of losing information at feature points. This is due to the
fact that a more fundamental question was yet to find its solution, that is: What is a feature7 In
a previous paper [AORI], it is shown that features can assume a wide variety of local
luminance forms. In our analysis, a catalogue of 16 templates was sufficient to accommodate
most feature profiles encountered in a number of natural images. By setting up pointers to this
catalogue, we have shown that the local luminance forms of features can be preserved in a
compressed format. Subsequently, local feature forms are reproduced faithfully in the
reconstruction process.
It is also demonstrated that a 2-dimensional averaging algorithm, bolstered by the
incolporation of the FFT coefficients of lower harmonics of the original image, is able to
reconstruct a good quality image from its compressed feature map.
In terms of efficiency, this scheme achieves a compression ratio of around 20:1 with
little sacrifice in image quality. A higher compression ratio is within reach by incorporating
more efficient techniques in the encoding of feature maps. One possible way is to first obtain a
closed contour by using standard region-growing methods in the local energy maps.
It might also be possible to make further gains by taking advantage of the fact that
similar features recur at different scales (see also [AOR1]). Reconstruction might be possible
in stages, beginning at the coarest scale and then moving up the scale pyramid, introducing at
finer scales only features that first occur at those scales. The scheme of transmitting image
data in stages according to scales of interest makes good sense in terms of the use of
transmitting media.

References
[AOR1] Aw, B.Y.K., Owens, R.A., Ross, J.: A catalogue of l-d features in natural images.
Manuscript submitted to Neural Computation. (1991)
[KIK1] Kunt, M., Ikonemopoulos, A., Kocher, M.: Second-generation image-coding
techniques. Proc. of the IEEE. 73 (1985) 549-574
[MOI] Morrone, M.C., Owens, R.A.: Feature detection from local.energy. Pat. Recog.
Letters. 6 (1987) 303-313.
Canonical Frames for Planar Object Recognition*

Charles A. Rothwell 1, Andrew Zisserman 1, David A. Forsyth 2 and Joseph L.


Mundy 3
1 Robotics Research Group, Department of Engineering Science, Oxford University, England.
2 Department of Computer Science, University of Iowa, Iowa, USA.
The General Electric Corporate Research and Development Laboratory, Schenectady, NY,
USA.

A b s t r a c t . We present a canonical frame construction for determining pro-


jectively invariant indexing functions for non-algebraic smooth plane curves.
These invariants are semi-local rather than global, which promotes tolerance
to occlusion.
Two applications are demonstrated. Firstly, we report preliminary work
on building a model based recognition system for planar objects. We demon-
strate that the invariant measures, derived from the canonicM frame, pro-
vide sufficient discrimination between objects to be useful for recognition.
Recognition is of partially occluded objects in cluttered scenes. Secondly,
jigsaw puzzles are assembled and rendered from a single strongly perspective
view of the separate pieces. Both applications require no camera calibration
or pose information, and models are generated and verified directly from
images.

1 Introduction

There has been considerable recent success in using projective invariants of plane alge-
braic curves as index functions for recognition in model based vision [10, 15, 23, 28]. Less
attention has been given to invariants for smooth non-algebraic curves. In this paper we
present a novel and simple method of constructing a family of invariants for non-convex
smooth curves.
Lamdan et al. [18] proposed and implemented a canonical frame construction. We
improve on this in two important ways:
1. The transformation here is projective not aj~ine. Central projection between two
planes is a type of projective transformation, not subject to the limitations on viewing
distance required for affine approximation to hold. The affine transformation is only
valid if the object depth variation is small compared to the camera viewing distance.
Of course, projection includes the case that the transformation might actually be
affine, because the affine group is a sub-group of the projective group.
2. Recognition is entirely via index functions based on projective invariants. In [18]
recognition was a mixture of indexing and Hough style voting.
As has been argued elsewhere [10, 23]. there is considerable benefit in using invariants
to imaging transformations as indexing functions for generating recognition hypotheses.
In particular, such functions only involve image measurements and avoid comparison
* CAR acknowledges the support of GE. AZ acknowledges the support of the SERC. DAF
acknowledges the support of Magdalen College, Oxford and of GE. JLM acknowledges the
support of the GE Coolidge Fellowship. The GE CRD laboratory is supported in part by the
following: DARPA contract DACA-76-86-C-007, AFOSR contract F49620-89-C-003.
758

against each object in the model library. Complexity is then O(i k) rather than O(Aikm k)
for the pose based comparison, where i is the number of image features, A the number
of models, m the number of features per model, and k the number of features needed
to compute invariants, or determine transformations where invariants are not used 4.
Recognition hypotheses are verified in both cases by back-projection from models to
images, and determining overlap of projected model with image curves.
This paper largely follows the path suggested by Lamdan et al. [18], where a very
good discussion is given of reasonable requirements for curve representation in order to
facilitate recognition tolerant to occlusion and clutter. Briefly, indices should be local
and have some redundancy (i.e. several per outline), so if one index is occluded there
is a good chance recognition can proceed on other visible parts; they should be stable,
so small perturbations in the curve (due to image noise) do not cause large fluctuations
in index value; and they should have sufficient discriminatory power over models in the
library (so all models do not have similar index values). All of these requirements are
satisfied by the bitangent construction described here. Provided the object outline is
sufficiently rich in structure there will be several such constructions for each object, and
thus redundancy in the representation giving partial immunity to occlusion.
We briefly review previous methods for curve recognition under distorting imaging
transformations in section 2. The canonical frame construction and invariant measures
are described in section 3. We apply these techniques to model based recognition for a
library of planar objects of arbitrary (but non-convex) shape. They are recognised from
single perspective views (no affine approximation is assumed) in scenes in which there
may be partial occlusion by other known objects, or unknown clutter. The process does
not require camera calibration. This is described in section 4. Finally, in section 5 we
show how these measures can be used to reassemble a jigsaw.

2 Background

The recognition of silhouettes of planar objects under 2D similarity transformation (plane _


rotation, translation and isotropic scaling) and affine transformation has been extensively
studied. The curve differential invariant curvature ~ and the (integral) invariant s (arc
length) have played a significant role because they are clearly unaffected by the action of
the plane Euclidean group: Matching of these invariant curve "signatures" ~(s) or their
integral O(s) (where ~ = 0) are routine in the vision literature [1]. Unfortunately, such
differential invarian~s for projection (called Wilczynski's invariants [19, 29]) require 7th
order derivatives. This is clearly numerically infeasible. Even affine projection requires
5 ~h order derivatives. Such high order derivatives are required to give invariance to both
projection and reparameterisation 5.
Given the impracticality of using differential invariants directly, a number of methods
have been derived for matching smooth curves despite affine or projective distortion:

1. S e m i - d l f f e r e n t i a l i n v a r l a n t s :
An ingenious method, proposed and implemented independently by Van Gool r
4 This complexity analysis is not for the asymptotic case as we assume that the library, imple-
mented as a hash table, is sparse. Should this not be the case we increase the dimension of
the library by using further invariants.
5 Recall that the familiar Euclidean curvature has this invariance: ~ = (~(t)~(~)-
~(t)~(t))/(~(t) 2 + y(t)2) 1/2 irrespective of the parameterisation t. That is, t can be replaced
by f(~) without affecting the value of s.
759

ai. [27] and Barrett et al. [3], is to trade derivatives at a point for more points. They
demonstrate that at a combinatorial cost (some "reference" points must be matched)
projective differential invariants can be derived~requiring only first or second deriva-
tives.
2. Representation by algebraic curves:
Since invariants for algebraic curves are so well established it is natural to try and
exploit them by "attaching" algebraic curves to smooth curves. The algebraic invari-
ants of these attached curves are then used to characterise the non-algebraic curve.
This is the approach taken in [12] for affine invariance and [9, 17] for projective in-
variance. The problem here is that such methods tend to be global. Consequently,
the associated algebraic curves and their invariants change if part of the curve is
occluded.
3. Distinguished points:
A common method is to determine distinguished points on the curve, such as in-
flections and corners, which can be located before and after projection. Such points
then effectively represent the curve - either to determine the transformation (e.g.
alignment[14]) or to form algebraic invariants. The disadvantage is curve information
between these points is effectively wasted.
4. Distinguished frame:
The goal is to get to some distinguished frame from any starting point; usually the
frame corresponding to the plane of the object. A typical method is to maximise
a function over all possible transformations - the transformed frame producing the
function maximum determines the distinguished frame. Brady and Yuille considered a
function measuring compactness over orthography [7, 11]; Witkin and others texture
isotropy (over orthography) [5, 16, 30]; Marinos and Blake texture homogeneity (over
perspectivities) [20]; and more recently Blake and Sinclair [6] with compactness over
projectivities. Once in the distinguished frame any measurements act as invariants
(because the measurements are independent of the original frame and transforma-
tion). Again this is a global approach and degrades with occlusion. There are also
problems of uniqueness if the cost function is not convex, i.e. there are many local
maxima.
5. Canonical frame:
Distinguished points are used to transform a portion of the object curve to a canonical
frame [18]. As for the distinguished frame, any measurement made in this frame is
an invariant. However, the canonical frame does not daffy over the disadvantages:
i) it is semi-local (depends on more than a single point) but is not global; ii) the
transformation to the canonical frame is unique.

3 Canonical Frame Construction

3.1 P r o j e c t i v e T r a n s f o r m a t i o n s

A projective transformation between two planes is represented as a 3 x 3 matrix act-


ing on homogeneous coordinates of the plane. The homogeneous representation means
that only ratios of matrix elements are significant, and consequently the transformation
has 8 degrees of freedom. This transformation models the composed effects of 3D rigid
rotation and translation of the world plane (camera extrinsic parameters), perspective
projection to the image plane, and an affine transformation of the final image (which
covers the effects of changing camera intrinsic parameters). Clearly, all of these separate
760

transformations cannot be uniquely recovered from the single 3 x 3 matrix, since there
are 6 unknown pose parameters, and 4 unknown camera parameters (camera centre, focal
length and aspect ratio). We therefore have 10 unknowns with 8 constraints.
The mapping of four points between the planes is sufficient to determine the trans-
formation matrix T (each point provides two constraints, therefore 4 independent points
provide 4 x 2 = 8 constraints). Corresponding points (xl, Yl) and (Xi, ~ ) are represented
by homogeneous 3 vectors (zi, Yi, 1) w and (Xi, Yi, 1) T. The projective transformation
x = TX is:

where k is an arbitrary non-zero scalar. Eliminating k gives eight simultaneous equations


linear in the matrix elements:

vei(TaXi + ThYi + 1) : TaXi + TbYi + Te


y~(TgXi + ThY~ + 1) = TaXi + T , ~ + Ts

with i E {1, .., 4}. These are straightforward to solve, for example by Gaussian elimina-
tion.
Projectivities form a group, so every action has an inverse, and the composition of two
projectivities is also a projectivity. Consequently two images, from different viewpoints,
of the same object are related by a projectivity. This result is used in the verification
stage of matching.

3.2 O b t a i n i n g F o u r D i s t i n g u i s h e d P o i n t s

The aim here is to exploit a construction that is preserved under projection. Certain
properties, such as tangency and point of tangency are preserved by projection [26]. We
use tangency to select 4 distinguished points on the curve (see figure 1) and then de-
termine the projection that maps these to the corners of a unit square in the canonical
frame. This projectivity is then used to map the curve into this frame. Figure 2 demon-
strates this process for one concavity of a spanner. The object curve, and any projective
view of it, are mapped into the same curve. Consequently, any (metric) measurements
made in this frame are invariant descriptors and hence may be used as index functions to
recognise the object. For example the location of any point in the frame is an invariant;
it is not necessary to use Euclidean such as curvature.
Lamdan et al. [18] used bitangents to obtain two of three points to define a canonical
frame under affine transformations. The third point was obtained by introducing a line
parallel to the bitangent line in contact with the apex of the concavity. Since parallelism is
not preserved under projective transformations, we use tangency conditions to define our
third and fourth points. The selection of the corners of a unit squares as the corresponding
points in the canonical frame is arbitrary - any four points, no three of which are collinear
will do.
Alternative constructions are possible using other projectively preserved properties.
For example, inflections can be used in two ways: i) to define a distinguished point on
the curve; and ii) tO define a line which is tangent at the inflection (3 point contact with
the curve). If a concavity contains an inflection (and therefore it will necessarily have at
least two inflections), then the bitangent contact points and inflections can be used as
the four correspondence points. We believe a construction based on inflections will not be
761

A D
( 1
D,

(a) (b)

Fig. 1, (a) Construction of the four points necessary to define the canonicM frame for a concavity.
The first two points (A D) are points of bitangency that mark the entrance to the concavity.
Two further distinguished points, (B C), are obtained from rays cast from the bitangent contact
points and tangent to the curve segment within the concavity. These four points are used to
map the curve to the canonical frame. (b) Curve in canonical frame. A projection is constructed
that transforms the four points in (a) to the corner of the unit square. The same projection
transforms the curve into this frame.

as stable (i.e. i m m u n e to small curve perturbations) as one based on tangencies, though


we have not confirmed this e.
For this construction to be useful in model based vision it must satisfy two sensible
and useful criteria:

1. Curves in the canonical frame for differing views of the same object should be very
"similar".
2. Curves in the canonical frame from different objects should "differ" from each other.

T h e measures used to distinguish canonical curves are discussed in section 3.4.

3.3 S t a b i l i t y O v e r V i e w s

The stability of the canonical frame representation is illustrated by figures 3a-d, which
show three different views of the same spanner with extracted concavity and four reference
points. The marked edge d a t a is then mapped into the canonical frame. The curves in the
canonical frame are almost identical. Representative images of other objects, a second

e One interesting case of a further application of the bitangent construction is in forming in-
variants for curves with double points. This construction uses the dual space representation of
a curve where the curve tangents (which are lines) are represented as homogeneous points in
the plane. Then, a double point maps to a bitangent in the dual space of the curve (tangent
space), and so invariants can be formed in the dual space.
762

F i g . 2. Canonical frame transformation for a spanner concavity. (a) Original image. (b) Bitan-
gent and tangents (see figure 1). These are determined via a bitangent detectors acting on image
edge data. (c) Four distinguished points and concavity curve. (d) Projected curve in canonical
frame. The curve passes through the corners of the unit square which are the projections of the
four distinguished points. Note, the spanner has four external bitangents, four internal bitan-
gents, and also bitangents which cross the boundary. Each of these can generate a curve in the
canonical frame. Consequently, considerable redundancy is possible in the representation

spanner and a pair of scissors, are shown in figures 4 and 5 with their corresponding
canonical curves (again from three views). Note t h a t the jagged portion (A), of the curve
in figure 5, varies over viewing position, but t h a t the smooth portions (B), are consistent
for all views. This is because (B) is produced by the plastic handles of the scissors, which
are coplanar with the four reference points. The metal hinge is not coplanar with the
reference point and so (A) is not positioned in a projectively invariant manner. This p a r t
of the curve must be excised. The variation emphasises the fact t h a t the canonical frame
construction is defined only on p l a n a r structures.

3.4 I n d e x F u n c t i o n s a n d D i s c r i m i n a t i o n

Since any measurements made in the canonical frame are invariant signatures for the
curve, the question is what is the o p t i m u m set for discrimination over objects in the
library? Clear criteria are t h a t the number of measures should be reasonably small, but
t h a t there should be enough to discriminate objects from clutter, and that each one
should be useful.
It appears t h a t the most naive measurements, area moments, are stable and efficient
discriminators. We use the area bounded by the x-axis and the curve. The moments
763

F i g . 3. (a) - (c) T h r e e views of a spanner with extracted concavity curves and distinguished
points marked. Note the very different appearance due to perspective effects. (d) Canonical
frame curves for the three different views of the spanner. T h e curves are almost identical demon-
s t r a t i n g the stability of the method. Of course the same curve would result from a projective
t r a n s f o r m a t i o n between the object and canonical frame.

F i g . 4. (a) A second spanner with extracted concavity curves and distinguished points marked.
(b) Canonical frame curves for this image and the same spanner from two o t h e r viewpoints.
Again, the curves are very similar.
764

'1
t

9 0:S 1:o

F i g . 5. (a) A pair of scissors with extracted concavity curves and distinguished points marked.
(b) CanonicM frame concavity curves from three views of the pair of scissors. The smooth end
portions of the curve (B) correspond to regions of the concavity coplanar with the four reference
points. These match well between images. The jagged portions (A) do not match as well because
these are formed by edges non-coplanar with the reference points.

c o m p u t e d for the three views of the first spanner are given in table 1, and for all the
objects in table 2. Its clear t h a t in practice this construction gives very good results. For
example, area enclosed by the curve is constant to 5% over views (viewpoint invariance),
whilst differing by more t h a n 30% between the spanner and scissors (discrimination).
However, area alone could not reliably distinguish the two spanners (their area's differ
by only 7%).

view Area Mx M z "~ M y My 2


1 1.35 0.516 0.341 0.72010.686
2 1.40 0.518 0.343 0.743 0.732
i

3 1.42 0.516 0.34110.756 0.756

T a b l e 1. Moments computed for the spanner concavity when in the canonical frame (figure 3).
The moments are about the x and y axes. Both the first and second moments are computed.
The values are constant over change in the viewing position and so can be used as invariant
measures to index into a library.

Integral m e a s u r e s should be chosen as they p r o m o t e stability by smoothing noise


( i m m u n i t y to small curve perturbations). For example, measurement of curve arc length,
which is the integral of a constant function along the curve, degrades systematically with
image noise. However, the area enclosed by the curve (which m a y also be determined via
an integral along the curve) provides an effective smoothing function, since local curve
fluctuations have little effect on the t o t a l area enclosed by the curve.
To date, we have not investigated a set of ' o p t i m a l ' measures t h a t will provide maxi-
m u m discrimination between objects wih~good stability. Future work will investigate two
areas:

1. Given the model base we m a y perform a principal axis analysis to determine the
d o m i n a n t features (ie. the image shapes corresponding to the largest eigenvalues of
765

view Area Mx Mx ~ My My 2
spanner 1 1.35 0.516 0.341 0.720 0.686
scissors 1.99 0.506 0.318 1.107 1.694
spanner 2 1.26 0.507 0.334 0.665 0.584

T a b l e 2. Moments for the three different objects are sufficiently different that they can be
used for model discrimination. Note for example, that the measures for M x 2 appear to be very
similar, but because the values of Mx 2 are very stable the small differences provide sufficient
discrimination.

the library covariant matrix) t h a t provide the best discrimination between different
models. One obvious problem with this approach is t h a t the eigenvectors will be of
high dimension, and so m a y not be realistically computable.
2. We m a y transform the d a t a in the canonical frame to a set of orthogonal functions,
for example using a Walsh or cosine transform, and use the transform coefficients as
indexing values.

We are currently investigating different choices of index function, but in the demon-
stration of object recognition given in the next section we simply use area based mea-
surements.

4 Model Based Recognition

This closely follows the system described in [23] where more details are given. There are
two stages:

Pre-processlng:
1. M o d e l A c q u i s i t i o n :
Models are extracted directly from a single image of the unoccluded object. T h e
edgel list is stored for later use in the verification process. Segmentation is carried
out as described below to delimit concavities. No measurements are needed on
the actual object, nor are pose or camera intrinsic p a r a m e t e r s required.
2. A d d t o m o d e l l i b r a r y :
Invariant vectors of measures are calculated as described in section 3.4. Each
component of this vector is an invariant measure t h a t m a y be used as an index
to the object. These vectors are entered into a library which will be accessed as
a hash table.
Recognition:
1. E x t r a c t c o n c a v i t i e s :
Feature extraction and segmentation is carried out as below to delimit concavities.
2. C o m p u t e i n d i c e s ,for e a c h c o n c a v i t y :
As described in section 3.4.
3. I n d e x i n t o l i b r a r y :
If the index key corresponds to a table entry this is used to generate a recognition
hypothesis.
4. H y p o t h e s i s V e r i f i c a t i o n :
Verification proceeds in two phases (both based on the verification procedure
of [23]):
766

- Check that the measured and expected model curves in the canonical frame
are similar (that is lie close to each other).
- Project the edgel data from an acquisition image onto the current image.
If sufficient projected edges overlap the target image edgels, the match is
accepted. Note that the projective transformation between acquisition image
and target image is computed directly from the correspondence of the four
points used in the canonical frame construction.

4.1 S e g m e n t a t i o n

A local implementation of Canny's edge detector [8] is used to find edges to sub-pixel
accuracy. These edge chains are linked, extrapolating over any small gaps. Concavities
are detected by finding bitangent lines.
Bitangent lines are found by computing approximations to the tangents of the curve,
and representing these as points in the space of lines on the plane. Pairs of points that
are close together in this space are found by a coarse search. These pairs represent
approximate bitangents, which are refined through a convex hull construction.

4.2 Experimental Results

Some initial results of the system are demonstrated in figures 6 and 7. At present the
model base consists of 5 objects: 2 spanners, scissors, pliers and a hacksaw. The figures
demonstrate recognition under perspective of two models from this library despite the
presence of partial occlusion and other objects (clutter) not in the library. Note there is
a two fold ambiguity in the matching of curve tangent points to canonical frame. The
matching depends solely on the ordering around the curve. To overcome this problem
indexes and curves for both orderings are stored. Any problems with local symmetry
in the concavity giving rise to ambiguous matches will be detected by back projection
during the verification process.

5 Solving Jigsaw Puzzles

We have selected the problem of assembling a jigsaw puzzle to illustrate the shape dis-
criminating power of the canonical frame approach. Jigsaw assembly has Mso long been
considered a challenging vision task [13, 31]. The idea is that matching pieces will have
the same invariant signatures for the tab and slot curves. We image a jumble of (unoc-
cluded) puzzle pieces under significant perspective distortion and then "assemble" the
puzzle by matching the pieces using canonical frame matching. The pieces are assumed to
be planar but not necessarily coplanar with each other (in practice the pieces are not even
taken from a single image). The assembly process is carried out by mapping the pieces to
a common canonical frame and then aligning the matching curves. The texture patterns
on the pieces are not used in the matching process but we warp the image of each piece
to portray the assembly in a single plane. This experiment illustrates that the invariants
calculated from the canonical frame can be used to compare unknown objects in a single
image as well as classify objects from a library of model curves. Such comparison tasks
are not readily tackled with conventional model-based recognition systems.
767

Fig. 6. (a) Spanner almost entirely occluded by keys. The keys are not the library, and are clutter
in this scene. (b) Detected concavities, highlighted in white, which are used to compute indexes
(c) The spanner which is the only model in the scene contained in the library, is recognised from
the end slot concavity. The projected outline used for verification is highlighted in white.

5.1 M a t c h i n g d e t a i l s

Edge pieces are extracted from unoccluded views of the pieces using a Canny [8] edge
detector. Each piece is assumed to have four sides and to only connect to at most four
other pieces 7. Each of the four sides of a piece therefore either represent individual shape
descriptors that match other pieces within the jigsaw or are edge pieces. Each of the side
curves is then classified as either a straight side piece, a tab, or a slot, depending on the
general shape of the curve in relation to the rest of the piece of which it is part. Each
curve classed as a tab or a slot contains at least one significant concavity that corresponds
to the tab or slot. We map' each of the concavities into the canonical frame and search
for the unique matching side. Once this is found the pieces can be joined together.

5.2 R e c o n s t r u c t i o n

The first corner piece found is used as the b o t t o m left hand piece of the completed puzzle
(a corner piece has two straight sides). This piece is used as the base unit square in the
canonical frame, on which the puzzle is built. The piece immediately to its right can be
mapped into the corner piece's image frame using the image to canonical frame mappings
of the interlocking tabs and slots. Once this transformation is known the grey level values
7 This restriction is for implementation purposes only, and does not reduce the value of the
demonstration.
768

Fig. 7. (a) Image of various planar objects. (b) Concavities, highlighted in white, which are
used to compute indexes (c) The pliers which are the only model in the scene contained in
the library, is recognised and verified by projecting the edgels from an acquisition image, and
checking overlap with edgels in this image.

of pixels within the bounds of the new piece can be projected into the frame of the corner
piece, and so effect the joining of the two pieces. A similar process can be performed to
map the piece directly above the corner piece into the corner image frame.
The piece diagonally above and to the right of the corner interlocks both the pieces
above and to the right of the corner. We therefore do a least squares fit to determine the
projectivity from the eight correspondences, and again render by projecting image grey
values. A similar process is applied to the rest of the pieces.
Two examples of this are shown in figures 8 and 9. Both show the original pieces, and
the final assembled and rendered puzzle.
This is a O(n ~) algorithm (with n the number of pieces). Extracting indexes in the
canonical frame, building a hash table, and using these indices for matching as in the
recognition system, would reduce the complexity to O(n), but here n is small and the
time taken in computing matches negligible.
Some extensions are obvious:
1. The final reconstruction in the canonical frame should be mapped to a rectangle with
the correct aspect ratio for the assembled jigsaw to remove any projective effects, or
at least to a frame in which corners have right angles.
2. There are gaps in the assembled pattern arising because jigsaw pieces are mapped
by a transformation determined by only a small part of the outline. This can be
improved by determining the projectivity from all distinguished points around each
piece using least squares.
769

Fig. 8. (a) Two pieces of a jigsaw, with (b) the assembled and rendered solution. The puzzle
is solved and rendered using 6nly information from this image. No camera intrinsic parameters
or pose information is needed. Note the large perspective distortion of the pieces in the original
image (a) which are not in the same plane (the right hand piece lies in a plane at about 45 ~ to
the plane of the other piece).
770

Fig. 9, (a) A six piece jigsaw, with (b) the assembled and rendered solution. The puzzle is solved
and rendered using only information from this image. No camera intrinsic parameters or pose
information is needed.
771

3. Other constraints - such as collinearity of the outer boundaries - could be incorporated


by iteratively minimising a cost function.

6 Conclusions and Future Work

We have demonstrated recognition of non-algebraic planar objects from perspective im-


ages. The work is currently being extended in a number of ways:

1. At present there are five objects in the library. We are currently including more
objects. Efficient development will require attention to the measures used for index
functions.
2. We have assumed that the uncalibrated imaging process may be modeled by a pro-
jectivity. This is exact for a pin-hole camera, but corrections must be applied if
radial-distortion is present. We are currently evaluating this correction [4, 25].
3. We can observe the reliability of an invariant measure by perturbing the distinguished
points and recomputing the invariant values. This will be of benefit both during
library construction and during the recognition process: i) We only use invariant
indexes that are affected little by the bitangent locations, and ii) during verification
confidence in a match is weighted by the stability of the invariant measure.
4. At present we use only concavities (exterior bitangents). This does not exploit the
full structure of the curve. We wish to limit the bitangents used to those that do not
cross the curve, but this does not prevent the use of internal bitangents. Using these
will further improve immunity to occlusion.
5. There are obvious extensions for computing canonical frames for non-smooth curves.
If a tangency discontinuity is observed we can use the two tangents immediately
either side of the discontinuity as reference lines. We then find two more points or
lines and uniquely determine the map to the canonical frame.

Acknowledgements

We are very grateful to both Professor Christopher Longuet-Higgins and Margaret Fleck
for suggesting the jigsaw puzzle problem. We are also grateful to Mike Brady and Andrew
Blake for useful discussions.

References
1. Asada, H. and Brady, M. "The Curvature Primal Sketch," PAMI-8, No. 1, p.2-14, January
1986.
2. Ayache, N. and Faugeras, O.D. "HYPER: A New Approach for the Recognition and Posi-
tioning of Two-Dimensional Objects," PAMI-8, No. 1, p.44-54, January 1986.
3. Barrett, E.B., Payton, P.M. and Brill, M.H. "Contributions to the Theory of Projective
Invariants for Curves in Two and Three Dimensions," Proceedings First DARPA-ESPRIT
Workshop on Invariance, p.387-425, March 1991.
4. Beardsley, P.A. "Correction of radial distortion," OUEL internal report 1896/91, 1991.
5. Blake, A. and Marinos, C. "Shape from Texture: Estimation, Isotropy and Moments,"
Artificial Intelligence, Vol. 45, p.232-380, 1990.
6. Blake, A. and Sinclair,D. On the projective normalisation of planar shape, TR OUEL, in
preparation, 1992.
7. Brady, M. and Yuille, A. "An Extremum Principle for Shape from Contour," PAMI-6, No.
3, p.288-301, May 1984.
772

8. Canny J.F. "A Computational Approach to Edge Detection," PAMI-6, No. 6. p.679-698,
1986.
9. Carlsson, S. "Projectively Invariant Decomposition of Planar Shapes," Geometric Invari-
ance in Computer Vision, Mundy, J.L. and Zisserman, A., editors, MIT Press, 1992.
10. Clemens, D.T. and Jacobs, D.W. "Model Group Indexing for Recognition," Proceedings
CVPR, p.4-9, 1991, and PAMI-13, No. 10, p.1007-1017, October 1991.
11. Duda, R.O. and Hart P.E. Pattern Classification and Scene Analysis, Wiley, 1973.
12. Forsyth, D.A., Mundy, J.L., Zisserman, A.P. and Brown, C.M. "Projectively Invariant
Representations Using Implicit Algebraic Curves," Proceedings ECCV1, Springer-Verlag,
p.427-436, 1990.
13. Freeman, H. an~d Gardner, I. "Apictorial Jigsaw Puzzles: The Computer Solution of a
Problem in Pattern Recognition," IEEE Transaction in Electronic Computers, Vol. 13, No.
2, p.118-127, April 1964.
14. Huttenlocher, D.P. and Ullman, S. "Object Recognition Using Alignment," Proceedings
ICCV1, p.102-111, 1987.
15. Huttenlocher D.P. "Fast Affine Point Matching: An Output-Sensitive Method," Proceedings
CVPR, p.263-268, 1991.
16. Kanatani, K. and Chou, T-C. "Shape From Texture: General Principle," Artificial Intelli-
gence, Vol. 38, No. 1, p.1-48, February 1989.
17. Kapur, D. and Mundy, J.L. "Fitting Affine Invariant Conics to Curves," Geometric lnvari-
ance in Computer Vision, Mundy, J.L. and Zisserman, A., editors, MIT Press, to appear
1992.
18. Lamdan, Y., Schwartz, J.T. and Wolfson, H.J. "Object Recognition by Affine Invariant
Matching," Proceedings CVPR, p.335-344, 1988.
19. Lane, E.P. A Treatise on Projective Differential Geometry, University of Chicago press,
1941.
20. Marinos, C. and Blake, A. "Shape from Texture: the Homogeneity Hypothesis," Proceedings
ICCV3, p.350-354, 1990.
21. Mohr, R. and Morin, L. "Relative Positioning from Geometric Invariants," Proceedings
CVPR, p.139-144, 1991.
22. Mundy, J.L. and Heller, A.J. "The Evolution and Testing of a Model-Based Object Recog-
nition System, ~ Proceedings ICCV3, p.268-282, 1990.
23. RothweU, C.A., Zisserman, A. Forsyth, D.A. and Mundy, J.L. "Fast Recognition using Alge-
braic Invaxiants," Geometric Invariance in Computer Vision, Mundy, J.L. and Zisserman,
A., editors, MIT Press, 1992.
24. Semple, J.G. and Kneebone, G.T. Algebraic Projective Geometry, Oxford University Press,
1952.
25. Slama, C.C., editor, Manual of Photogrammetry, 4tn edition, American Society of Pho-
togrammetry, Falls Church VA, 1980.
26. Springer, C.E. Geometry and Analysis of Projective Spaces, Freeman, 1964.
27. Van Gool, L. Kempenaers, P. and Oosterlinck, A. "Recognition and Semi-Differential In-
variants," Proceedings CVPR, p.454-460, 1991.
28. Wayner, P.C. "Efficiently Using Invariant Theory for Model-based Matching," Proceedings
CVPR, p.473-478, 1991.
29. Weiss, I. "Projective Invariants of Shapes," Proceeding DARPA Image Understanding
Workshop, p.1125-1134, April 1988.
30. Witkin, A.P., "Recovering Surface Shape and Orientation from Texture" Artificial Intelli-
gence, 17, p.17-45, 1981.
31. Wolfson, H., Schonberg, E., Kalvin, A. and Lamdan, Y. "Solving Jigsaw Puzzles by Com-
puter," Annals o] Operational Research, Vol. 12, p.51-64, 1988.

This article was processed using the IgTEXmacro package with ECCV92 style
Measuring the Quality of H y p o t h e s e s in Model-Based
Recognition*

Daniel P. Huttenloeher a and Todd A. Cass 2

1 Department of Computer Science, Cornell University, Ithaca NY 14853, USA


2 AI Lab, Massachusetts Institute of Technology, Cambridge MA 02139, USA

Abstract. Model-based recognition methods generally search for geo-


metrically consistent pairs of model and image features. The quality of an
hypothesis is then measured using some function of the number of model
features that are paired with image features. The most common approach
is to simply count the number of pairs of consistent model and image fea-
tures. However, this may yield a large number of feature pairs, due to a
single model feature being consistent with several image features and vice
versa. A better quality measure is provided by the size of a maximal bi-
partite matching, which eliminates the multiple counting of a given feature.
Computing such a matching is computationally expensive, but under cer-
tain conditions it is well approximated by the number of distinct features
consistent with a given hypothesis.

1 Introduction
A number of different model-based techniques have been used to hypothesize instances
of a given model in an image, including searching for possible correspondences of model
and image features (e.g., [4]), the generalized Hough transform (e.g., [1]), alignment of a
model with an (e.g., [5]), and analysis of the space of model transformation parameters
[2]. While these methods differ substantially, they all measure the quality of a given
hypothesis as a function of the number of geometric features of the model that are
consistent with geometric features extracted from the image. The larger the consistent
set of model and image features, the better an hypothesis is judged to be.
In this paper we analyze some of the most common methods for assessing the quality
of an hypothesis in model-based recognition. We focus on the case in which objects
are modeled as a collection of 'atomic' features (such as points). In other words, each
individual model feature is either paired with an image feature or not (there is no partial
matching of individual features). The three quality measures that we investigate are: (i)
the number of pairs of model and image features that are consistent with an hypothesis,
(it) the maximum bipartite matching of such features, and (iii) the number of distinct such
features. We find that the first measure, although widely used, often greatly overestimates
the quality of a match. The second method is more expensive to compute, but is also
much more conservative. In practice we find that the third scoring method is the best.
This final method also turns out to be closely related to some recent theoretical work
* This report describes research done in part at the Artificial Intelligence Laboratory of the
Massachusetts Institute of Technology. Support for the laboratory's research is provided in
part by an ONR URI grant under contract N00014-86-K-0685, and in part by DARPA under
Army contract number DACA76-85-C-0010 and under ONR contract N00014-85-K-0124. DPH
is supported at Cornell University in part by NSF grant IRI-9057928 and matching funds from
General Electric and Kodak, and in part by AFOSR under contract AFOSR-91-0328.
774

on using Hansdorff distances for recognition [6]. Finally, we show that if restrictions are
placed on the spacing of the model features, then the third measure becomes equivalent
to the size of the maximal matching. This provides a more precise meaning to the riotion
that matching is more difficult when the features of a model are too close together.
Our results have two important implications for model-based recognition systems:

1. For atomic features (such as points), the quality of an hypothesis should be measured
based on the number of distinct model and image features that are geometrically
consistent with a given hypothesis, not on the number of consistent model and image
feature pairs.
2. Object models should be constructed such that no two features are close together,
where closeness is a function of the degree of sensory uncertainty.

Existing model-based recognition systems, where appropriate, can be modified with min-
imal effort to take advantage of these results. Using the number of distinct features as a
quality measure simply requires keeping track of which features are accounted for by a
given set of feature pairs. Constructing the object models requires that the sensor error
estimates used by the recognition method be available when the models are made.

2 Geometrically Consistent Sets of Features

We assume that a model consists of a set of features M = { m l } i = l ..... rn measured in some


coordinate system A~t, and the image data consist of a set of features S = {sj}j=l .....n
measured in some coordinate system Z. A pair of model and image features, (mi,sj)
defines a set of possible transformations from J~4 to 2:. This set of transformations can
be viewed as a volume in a d-dimensional transformation parameter space, where each
dimension of the space corresponds to one of the parameters of the transformation,
T : J~4 ~ 2:. We denote the set, or volume, of transformations consistent with a pair of
features (mi, sj) as V(mi, sj) C T. This is the set of all transformations on ml that map
it into 2:, such that mi is within some uncertainty region around sj. We limit the present
discussion to the case of point features, where the uncertainty region is modeled using a
disk of radius e. These results can be generalized to more complex models of uncertainty,
including non-circular positional uncertainty and angular uncertainty (cf. [2, 3]).
As an example of a transformation space volume, consider a model and an image
that both consist of a set of points in the plane, and a three-dimensional transformation
space, T with axes u, v and 0 (corresponding respectively to the two translations and
one rotation in the plane). If there is no sensory uncertainty in the image measurements
(e = 0), then the set V(mi, sj) is the helical arc ~ in T defined as a function of 0 by
sj - Romi where R0 is a rotation matrix, and sj and mi are the positions of image and
model feature points, respectively. Thus the translation is constrained exactly by a given
rotation, but any rotation is possible. If the location of the image point is uncertain up
to a circle of radius e, then V(mi, sj) is a helical tube of circular cross section of radius
e about the curve g.
We define a consistent set of feature pairs, C, to be a set of model and image feature
pairs that specify mutually overlapping sets of transformations. Thus formally C C_ M x S
is a consistent set when the intersection of all the volumes specified by the pairs in C is
nonempty,
N V(m,, s~) r O, (I)
(mi,s~)eC
775

The sets of 'geometrically consistent features' computed by most recognition sys-


tems are a slight overestimate of the volumes specified in equation (1). For example,
the interpretation tree approach (e.g., [4]) only ensures pairwise consistency between
model and image feature pairs. That is, a path through the interpretation tree specifies
a set of feature pairs J C_ M S, such that for each pair (mi,sj) E J and (ink,st) E J,
V(mi, sj)N Y(mk, st) ~ 0. The tree search does.not guarantee that the set J (correspond-
ing to a path in the tree) is a consistent set by the definition in equation (1), because
only pairs of volumes are required to intersect. Similarly, generalized Hough transform
methods (e.g., [1]) overestimate the consistent feature sets, because the transformation
space is tessellated into discrete buckets. These buckets are in general larger than the
true transformation space volumes (if they were smaller then a match could be missed
altogether).
We further note that the size of C, from equation (1) is itself generally an overestimate
of the quality of an hypothesis. The key problem is that a given set of consistent feature
pairs may contain many pairs that involve a given model feature mi or a given image
feature sj. Each of these pairs is counted separately, even though it involves the same
model or image feature. For atomic features, where there is no partial matching of a
single feature, it does not make much sense to count the same feature multiple times. A
set of sensor features T C S will all be paired with the same model feature, ml E M,
in the same consistent set if (7. ,-v V(mi, sj) # 0. This happens for any set of sensor
features T in which all the fea't~u~res are 'close' together, where close means that the
corresponding transformation space volumes intersect (i.e., closeness is a function of the
sensor uncertainty). Similarly, a set of model features can all be paired with a given
image feature.

A 2

B $

C- ~4

+ ?

Fig. 1. The number of consistent feature pairs overestimates the quality of a match: a) a super-
imposed set of model and image features, b) the corresponding bipartite graph.

Thus whenever a group of model features or a group of image features are close
together, the size of a consistent set will count the same model or image feature multiple
times. For example, Figure la shows an alignment of a five point model with an image
for which the corresponding consistent set contains eight feature pairs. The model points
are shown as crosses and the image points are shown as dots. Any model point mi that
lies within e of an image point sj defines a pair of features (mi, sj) that are consistent
with this position of the model in the image. On the other hand, only three of the five
model features are paired with distinct image features. In this case the difference in the
size of the consistent set (eight) and the number of model features accounted for (three)
occurs because a single image feature is paired with more than one model feature and
vice versa. Such situations are not merely of theoretical interest, as they occur frequently
for reasonable ranges of sensory uncertainty and images containing just tens of features.
776

3 Bipartite Matching

Maximal bipartite matching can be used to rule out the 'multiple counting' that occurred
in the above example, by requiring that each model feature only be paired with a single
image feature and vice versa. A consistent set of feature pairs, C, defines a bipartite graph
G = (U, V, E) where for each feature pair (m~, sj) E C there is a vertex u~ E U corre-
sponding to the model feature m/, a vertex vi E V corresponding to the image feature sj,
and an edge eij E E connecting ui to vj. Each edge is incident on one vertex of U and one
vertex of V, so the graph is bipartite. For example, the set of consistent feature pairs corre-
sponding to Figure la is {(A, 1), (B, 1), (C, 2), (C, 3), (C, 4), (D, 5), (D, 6), (D, 7)} (where
model features are denoted by letters and image features by numbers). This set defines
the bipartite graph shown in Figure lb.
A matching in a bipartite graph, G, is a subset of the edges, F C_ E, such that each
vertex of G has at most one edge incident on it. The size of F is the number of edges
that it contains, IFI. For instance, a trivial matching is a single edge of a bipartite graph.
A maximal matching is one such that there are no larger matchings in the graph. For
our problem, a maximal matching corresponds to the largest set of consistent model
and image feature pairs that can be formed without using any model or image feature
more than once. For the graph in Figure lb the set {(A, 1), (C, 2), (D, 5)} is a maximal
matching. Note that in general there can be more than one maximal matching.
Methods for finding a maximal matching in a bipartite graph require O(IV[ x/2. IEI)
time, or O(n ~~) where n = IVI. Methods that are straightforward to implement require
time O(min([VI, [U[)-IEI) time [7]. In contrast, simply counting the number of pairs in
a consistent set (i.e., the number of edges in E) only requires time O(IED. As model
based recognition methods already require substantial amounts of running time, we are
concerned with how to estimate the size of the maximal matching, IFI, in O(IE D time.
From the bipartite graph representation of a geometrically consistent set, we can
identify several possible measures of the quality of the corresponding hypothesis:
1. The number of feature pairs in the consistent set (i.e., ICI = lED.
2. The number of distinct features accounted for by the consistent set (e.g., [U[, IVI, or
their minimum, maximum, or sum).
3. The size of a maximal matching in the bipartite graph defined by I (i.e., IFI where
F C_ E is a maximal matching).
Given the way we have constructed a bipartite graph from C, the first of these three
quantities is the largest, and the last is the smallest, that is, IEI _ min(lUI, IVI) > IFI-
Clearly the first inequality holds because the number of edges in the graph must be at
least as large as the number of vertices of each type. The second inequality holds because
in a matching each vertex has at most one edge incident on it.
Whereas most recognition systems use the first of these measures, the last is the most
conservative measure. Some recognition systems do use bipartite matchings, but these
are quite expensive to compute compared with counting the number of consistent feature
pairs. Thus we propose using the number of distinct pairs of model and image features,
as measured by the quantity min(JU], IV[). This is cheap to compute, and measures the
minimum number of distinct model or image features accounted for.
The measure min(lUI, IVI) only overestimates the size of a maximal matching, IFI,
when there is branching on both sides of the bipartite graph. This corresponds to a
situation in which there are several neighboring model features that match a single image
feature, and vice versa. If a bipartite graph is guaranteed to only have branching at the
vertices on one side of the graph, then situations such as this cannot occur. In that case,
777

it is trivial to compute the size of a maximal matching - simply count the number of
vertices on the side of the graph where the branching occurs. For example if the branching
occurs only for vertices in U, then each edge e E E is incident on a unique vertex v E V
(otherwise there would be branching for some vertex in V). Thus the number of vertices
in U determines the size of a maximal matching.
While we cannot in general control the spacing of features in an image, we can do so
for the features in a model. More formally, in order for no two model features to match
a given image feature, it must be that for each pair of model features, mi, mk E M,
the volumes produced by intersection with the same sensor feature sj are disjoint, that
is, Vrn,,rnkEMV(mi, Sj)N W(mk, sj) = 0. Another way to view this is that no two model
features can be close enough together that when mapped into the image coordinate
frame they overlap the uncertainty region around the same image feature. For a rigid-
body transformation this can be accomplished by surrounding each model feature with
a 'buffer area' based on the positional uncertainty value e. As long as no two buffer areas
overlap, no pair of model features can match the same image feature.
As an example, consider the case of points in the plane, where the sensory uncertainty
region is a circle of radius e. If each model point is surrounded by a circle of radius e, and
the model points are placed such that no two circle intersect, then no two model points
can match the same image point. Having done this, the number of distinct features is
equal to the size of the maximal bipartite matching, but is much cheaper to compute. In
practice, we have found this method to be much better than simply counting the number
of feature pairs.
In summary, a good estimate of the size of the size of the maximal bipartite matching
is provided by the number of distinct model and image features that are consistent with
a given hypothesis. Moreover, if models are constructed such that no two model features
are close together (as a function of the degree of sensory uncertainty) then the number of
distinct features is the same as the size of the maximal bipartite matching. This provides
a formal meaning to the intuition that matching is harder when the model features are
'too close together to be resolved by the sensor'.

References
1. D. H. Ballard, 1981. Generalizing the Hough transform to detect arbitrary shapes. Pattern
Recognition 13:111.
2. Cass, T.A., 1990, "Feature Matching for Object Localization in the Presence of Uncertainty",
MIT Artificial Intelligence Laboratory Memo no. 1133.
3. Grimson, W.E.L. and D.P. Huttenlocher, 1990, "On the Verification of Hypothesized Matches
in Model-Based Recognition", Proceedings of the First European Con/erence on Computer
Vision, Lecture Notes in Computer Science No. 427, pp. 489-498, Springer-Verlag.
4. Grimson, W.E.L. & T. Lozano-P~rez, 1987, "Localizing overlapping parts by searching the
interpretation tree," 1EEE Trans. PAM1 9(4), pp. 469-482.
5. Huttenlocher, D.P. and S. Ullman, 1990, "Recognizing Solid Objects by Alignment with an
Imagen,lntl. Journal o] Computer Vision, vol. 5, no. 2, pp. 195-212.
6. Huttenlocher, D.P. and Kedem, K., 1990, "Efficiently computing the Hansdorlf distance for
point sets under Translation", proceedings of Sixth ACM Symposium on Computational Ge-
ometry, pp. 340-349.
7. Papadilnitriou, C.H. and K. Steiglitz, 1982, Combinatorial Optimization: Algorithms and
Complezity, Prentice-Hall.

This article was processed using the IbTEX macro package with ECCV92 style
Using Automatically Constructed View-Independent
Relational Model in 3D Object Recognition
S. Zhang, G. D. Sullivan and K. D. Baker
Intelligent Systems Group
University of Reading, RG6 2AY, UK

Abstract This paper describes and demonstrates a view-independent relational model


(VIRM) in a vision system designed for recognising a known 3D object from single
monochromatic images. The aim is to derive a model of an object able to effect
recognition without invoking pose information. The system inspects a CAD model of the
object from a number of different viewpoints to identify relatively view-independent
relationships among component parts of the object. These relations are represented in the
form of a hypergraph. The VIRM can be searched using a best-first technique to obtain
hypotheses of vehicle poses which match image features.
1. Introduction
The recognition of 3D objects from single 2D monochromatic images of unknown scenes is a
major problem for computer vision. In this paper we discuss the construction of a view-
independent relational model (VIRM), derived automatically from a CAD wireframe model, and
its use in recognising 3D objects. The VIRM of the object is a weighted hypergraph, associated
with procedural constraints. Weights attached to the hyperedges, representing the probability of
co-visibility of component parts of the object, are used to control the search for object hypotheses
in recognition, and procedural constraints associated with the hyperedges prune the interpretation
tree during the search. An early report of this model scheme has been given in [12]. We report
here several major improvements brought about by adding an automatic model feature selection
process, and by using the hypergraph to represent relations among clusters of more than two
model features. We also report the use of the VIRM for recognition of multiple objects.
In this paper we are only concerned to establish plausible initial hypotheses of objects and
their poses. The system is used in conjunction with a hypothesis refinement process to recover
accurate pose and object classes (see Sullivan [9], Worrall et al [10]). The V I R is used to
identify extended groups of 2D image features compatible with a hypothesis of the class and pose
of the object. The VIRM encodes the object by means of five relations: co-visibility, parallelism,
colinearity, side relation, and relative size. The model scheme is viewer-centred, so that the
hypothesis generation process does not depend on an initial estimate of pose. However, unlike
other viewer-centred models, such as those based on aspect graphs ([3,6,7,8]), the present model
is comprised entirely of view-independent attributes and relations. No search over alternative
viewpoints is needed during the object recognition process, and the storage required for the
model is small. The model uses a hypergraph representation associated with procedural
constraints, expressing relations among two or more features, which can be used to control the
search in the labelling (or hypothesis generation) process.
The model is generated automatically from a CAD wireframe model of the object. We
illustrate the system by using a model of a car, but the approach can be applied to any
geometrically defined object. The output of the model building phrase is a triple M={V, G, C},
which consists of:
1. A set of extended model features, V, including their shapes and the types of extended
213 image features, that they might match.
2. Extended adjacency matrices, G, representing the co-visibility of the model features.
779

3. A set of procedural constraints, C, each associated with a pair of co-visible model


features, representing selected view-independent relations based on: parallelism,
colinearity, side relation and relative size.
The first two elements V and G form the hypergraph representation of the VIRM. A
hypergraph [1] is defined as an ordered pair H={X, El where X = { x 1, x 2 . . . . , x n } is a set
of vertices and E = { e v e 2. . . . . e m } is a set ofhyperedges suchthat e i ~X, e i '~ ~ , i --" 1, 2 . . . . . m
n

and L.) e i = X. If ]eil = k, then e i is called an order-k hyperedge of the hypergraph. In particular,
i
e i is called an edge if lei] = 2. In this work we consider only edges and order-3 hyperedges.

2. Construction o f the V l R M from a C A D Model


2.1. The Wireframe Model and Its Projection
Wirefrarne models of three different types of vehicles are shown in Fig. 1, with the features of a
hatchback car labelled symbolically (the hatchback car is used in this paper to illustrate the
approach). The primitives in the wireframe model are line segments. Each is labelled uniquely,
e.g. nfw_4 denotes the bottom line of the nearside front window, rws_l denotes the bottom line
of the rear windscreen, etc. (Note that the vehicles are British, the right side is the offside.)
Throughout this research we assume that both the object and the image are approximately upright
and that the angle between the axis of the camera and the ground is between 0 and 0raax (=60~
These assumptions cover all likely views of a car in normal conditions.
We start the construction of the VIRM of an object by projecting the CAD wireframe model
from a number of different viewpoints. Let
O = {oi,o 2.....ore}
be the set of linesegments in the wireframe model of the object.The model of the objectisplaced
at the centre of the Gaussian viewsphere. The area (0<q><2n, 0<0<0max) on the surface of the
Gaussian viewsphere is sampled randomly giving n (=500) viewpoints equally distributedover
the area. Each sampled image gives riseto a set of projected model linesegments:
s,. ..... - .....

in which s ti represents the projection of model feature ok in sample i. Each sti is either a line
segment in the 2D image plane represented by its coordinates [[x1 Y]] Ix2 y211 or an empty set,
if % is occluded from the given viewpoint.
2.2. Building Nodes and Node Attributes of the VIRM
The model primitives are 1-D line segments which provide only very poor constraints for object
recognition. These 1-D line segments are therefore grouped into 2-D feature complexes which
form invariant patterns, for example, a planar quadrilateral in the 3D world reliably projects to a
2-D quadrilateral in the image. The 2-D complexes can therefore be used as "focus features"
(Bolles & Horaud [2]) providing a starting point for searching for consistent cliques.
A subset of model features is grouped into a 2D complex if they satisfy the following:
1. For a given viewpoint, if any one feature is visible, then all are visible,
2. The features form connected sets on the 3D model and hence the image,
3. They conform to a known class of shapes (quadrilateral or U-shape curve).
For example, the windows of the car satisfy the above conditions, but the bonnet of the car does
not because the line at the bottom of the windscreen may be occluded in views from the rear. For
the hatchback car, this process groups all 6 windows and the 4 wheel arches into 2D complexes,
represented as quadrilaterals and U-shape curves respectively. Complexes of model features are
used in the same way as single model features, and in the following discussion, we use O2a to
express the set of 2-D complexes and Old to represent the remaining (l-D) model features.
780

e Lt Ol

Fig.1. Wireframe models of vehicles


2.3. Building the Co-visibility Hypergraph of VIRM
The VIRM consists of nodes corresponding to model features, and hyperedges indicating co-
visibility of features. Co-visibility of model features is generally view-dependent. However, we
accept features to be co-visible if the probability of their co-occurring in images is high. This
relaxation of the co-visibility constraints introduces errors, but these errors can be eliminated by
finding combinations of mutually consistent matches. Currently we only use the co-visibility of
pairs and triples of features.
2.3.1. Pairwise Co-visibility - Edges of the hypergraph
The pairwise co-visibility of two features is quantified by the conditional probability of observing
one feature given the presence of another, represented by
A 2 (i,j) = p {oil oi}
Fig. 2 shows a part of co-visibility hypergraph of the VIRM of the hatchback car (a sub-
hypergraph of G), where only the windows of the car are shown. (NB: p {oil oj} is shown to the
right side of the arc from i to j).

0 . 4 ~ 40

Fig.2. Part of the pairwise co-visibility hypergraph of the VIRM of a car

2.3.2. Co-visibility of Feature Clusters - Order-3 tIyperedges of the tIypergraph


Pairwise co-visibility provides only weak constraints on component parts of the object. It can be
extended to include triples of features containing at least two 2-D complexes. For a cluster of
model features {o i, oj, ok}, we consider all the n projections of the model features and estimate
the probability of their co-occurrence in the sampled images. The quantified co-visibilities of
feature clusters are represented by means of the adjacency matrix of the hypergraph
A 2 ( i , j , k ) p{okl (Oi,Oj) }
in which % oj c o2a. This gives an order-3 hyperedge connecting nodes {oi,oj,ok} with weight
p {okl (% %.) } to represent theirco-occurrence.
781

2.4. Geometrical Constraints- Procedures Associated with Edges of the VIRM


Although simple geometrical relations are inherently view-dependent, some are at least partially
insensitive to view and provide weak constraints on pose. We consider, in particular, four
pairwise relationships: parallelism, colinearity, side relation and relative size. Other view-
independent relations, such as connectivity, symmetry, etc., are possible but have proved to be
more difficult to compute and contribute little to car recognition.
The constraints on feature pairs are examined by quantifying the above relations as scalars.
There are no generally accepted ways to quantify parallelism, colinearity, side relation, and we
have adopted measures which we call parallel ratio, co-line ratio, and side ratio (see below). Each
is based on overlaps, lengths, and angles among the line segments concerned.We use the Monte
Carlo methods to obtain statistical evidence of the four relations between each pair of model
features. If a specified relation between a pair of model features is reasonably stable over all
views, then it defines a constraint associated with the edge connecting the corresponding nodes
in the co-visibility hypergraph of the VIRM.
2.4.1. Parallel Ratio
Parallel model features appear approximately parallel in the image except when viewed from
extreme angles. The parallel ratio between a pair of line segments is defined as follows. If the
two line segments do not overlap (as shown in Fig.3(a)), or if the crossing point of the two lines

lb

) C"
lc
td
(b) c" a (c)
Fig.3. Definition of parallel ratio

is within both segments (Fig.3(b)), the parallel ratio is defined as zero. Otherwise, (e.g. Fig.3(c)),
the parallel ratio is defined as
rain {1 a , I b, I c, ld}
p (ab, cd) = cosO
m a x { l a , l b, ! c, la]

in which la is the distance between a and its orthogonal projection onto cd, etc., and 0 is the angle
between the two line segments. The parallel ratio between two line segments lies in [0,1], with 1
indicates absolutely parallel. If the mean value of p is high (>0.75) and the standard deviation is
small (<0.25) then the parallel relation is accepted.
u Vl~lntl , Vl~Olnt=

191 t~
/ ,, .. , , "
! ,,_.,.

/,,/
/
,?
: /'
: ..,.'"
,.~ p_r.tlo
, ~
.,.,";
~
P.r.t)*

(a) nr_nsdl P_ratlo (b)or_osill P_~'atio ,mill nfw 1 P Ratio (d)ou nsill P ratio
Fig.4. Parallel ratio between feature pairs
If one of the features is two dimensional (e.g. a quadrilateral or U-shape curve) and the other
is a line segment, we record the parallel ratio between the line segment and each of the lines of
the 2-D feature. If both are two dimensional, we record the parallel ratio of all pairs.
The pdj~ of the parallel ratios (5(p)) for 4 pairs of features are shown in Fig. 4: (a) nearside
782

roof and nearside sill, (b) offside roof and offside sill, (c) nearside sill and the bottom line of the
nearside front window, (d) offside upright (windscreen pillar) and nearside sill. The solid curves
show the distribution function 5(p) of the variable. The dotted curves show the cumulative
distributions. The first three show good parallel ratio of the pair of model features, with most
values in the region of [0.8, 1].Therefore the constraints that these features pairs are parallel are
accepted and are coded into the model. In Fig. 4(d), the parallel ratio is evenly distributed, and
no parallel constraint exists between the pair.
2.4.2. Colinear Ratio
Colinearity between object features is preserved by the perspective transformation. Given two
line segments ab and cd (assuming that ab is longer), as shown in Fig.5, we construct a minimal
rectangle whose long axis is parallel to ab and encloses ab and cd. Let w be the length of the side
of the rectangle parallel to ab, h be the length of the perpendicular side of the rectangle, and 0 be
the angle between the two line segments. Coline ratio is defined as in Fig.5.

o h>w
c (ab, cd) =
(I - h ) cos0 h<:w

Fig.5. Definition of coline ratio


The expected value for true colinearity, and the acceptance criterion is similar to that of
parallelism.
2.4.3. Side Relation
Side relation, meaning one feature is to the left (or right) of another in the image, is view-
independent if the object is solid and ff the roll of the camera is limited (i.e. if the object and the
image are both "upright"), though this breaks down if the object has significant concavities. The
side relation of a point P to a directed line segment a--/iis used in the following way: Let V be the
vector from a to b. P is said to the left (right) of line segment a--/iif P is to the left (right) of V. To
define the side relation between two features, all the line segments involved are labelled with
directions so that they can be used as vectors. Given line segment ab and a featurefand assuming
that there are nI points infwhich are on the left side of ab and n2 points which are on the right
side, the side ratio of featurefwith respect to line segments ab is defined as:
n 1- n2
s (ab,.O = - -
nl+n 2
The definition of side relation involving compound features is similar to that of parallelism.
If one feature is always at the right (left) side of another the side relation constraint is accepted.
2.4.4. Relative Size
Relative size depends on the position and orientation of the camera, the focal length and the view
angle. But provided that the focal length to distance ratio is not very small, and the object is not
very large, the size changes of the different component parts of the object remain similar. The
main effect on relative size is then due to the view angle. In the case of a two dimensional feature
(e.g. a quadrilateral), at least one of its two dimensions is relatively stable and thus can provide
bounds on the relative size of features. As an example, Fig. 6 shows a quadrilateral (Q) together
with several projected shapes (Q1, Q2, 03) when viewed from different angles. At least one of its
two dimensions remains at least 70% (1/4~) of its original scale, and can be used to provide
783

bounds for the relative sizes of component parts of the object.

Q .~.View-direction~ QI:

~t ~,,View-direction
. View-directioha~
I ~ \
Fig.6. Variation of a quadrilateral viewed from different directions
Relative size is only used in feature pairs including at least one 2-D complex. Under
perspective transformations its longest observed edge (the pseudo-height of the feature) in the
image is expected to remain stable to some degree. The relative size of two features (r) is defined
as the ratio of the distance between the features in the image to the pseudo-height of the 2D
feature (if both features are 213, the larger pseudo-height is used). The distance between two
features is defined as the shortest distance from the vertices in one feature to the vertices in the
ether. If both features are 2D, their pseudo-heights are also compared and stored as a separate
constraint. These ratios vary considerably, and therefore only provide weak bounds for the
distributions. The acceptance criterion is similar to that of parallelism.
2.4.5. Procedural Constraints
As a result of the statistical analysis, each edge is associated with a set of constraints
Cij={Pij, Cij, lij, r ij}
representing relations between the corresponding pair of features {oi, oy}. If a relation is not
applicable to this pair, the entry is open. We combine the hypergraph and the constraints to obtain
the VIRM, with the selected constraints being compiled as procedures associated with edges of
the hypergraph. The compiled constraints are Boolean-valued procedures with two inputs (the
polyline representations of the two image features concerned). All of the constraints are defined
in terms of the individual line segments constituting the corresponding features. Vertices in both
2D model features and 2D image features are labelled in an counter-clockwise order beginning
with the bottom left vertex.
3. Application of the VIRM in the Generation of Pose Hypotheses
The VIRM is used to generate hypotheses of the class and pose of the object of interest. Fig.7
illustrates the different stages of the hypothesis generation process. The Canny [5] edge detector
is first applied to the original image to get edgelets (Fig. 7(b)). These edgelets are grouped into
213 features (Fig. 7(c)). Each image feature is associated with all permissible model features, for
example, quadrilateral image features are associated with any of the windows of the car. The
image features are then matched with the VIRM by a depth-first search, restricted to only those
cases with high co-visibility as recorded in the hypergraph. Typically around 10 consistent
hypotheses are generated which reflect the inherent symmetries of the car (examples are shown
in Fig. 7(d), (g) & (j)).
The hypotheses are used to estimate the pose using a quantitative method described by Du
[12] in which two labelled non-parallel, non-colinear, co-planar lines are used to estimate the
position and orientation of the camera, by means of pre-compiled look-up tables. Each of the 10
extended features groups identified by the VIRM gives rise to a number of pose hypotheses,
based on the labelled 2-D features in the extended group. Fig. 7 shows three labelled feature
groups ((d), (g) & (j)), each containing two 2-D features, so that each gives two pose hypotheses.
Where these pose hypotheses are not consistent with each other, the labelling is rejected (Fig. 7
784

-: _ 4

(a~al Image (c) Image Features


or

,~ ...inu

osill
(d) Correct labellings (e) Posebased on ofwin (d) (f) Pose based on orw in (d)
nr

nwr
nsill
(g) Incorrect but consistent (h) Pose based on nfw in (g) (i) Pose based on nrw in (g)
labellings
wscu

~,......I nu jrx-,,
J ws

(j) Inconsistent labellings (k) Pose based on ws in (j) (1) Pose based on ofw in (j)

Fig.7. Hypothesis generation process

(k), (1)). In the case here only two of the 10 hypotheses are retained by this requirement (Fig. 7
(d) & (g)), giving two pairs of very similar possible poses. These accepted hypotheses must be
subjected to further evaluation using view specific methods which are not discussed here. Details
of the pose verification process can be found in Brisdon [4], and Worrall [10].
Fig.8 shows further examples of hypotheses superimposed on the images, which in theses
examples have been selected manually from the few candidates. It should be noted that, as
expected, the pose recovered is only approximate. The model can be used in the recognition of
occluded objects (Fig.8(f)), as well as scenes containing multiple objects (Fig.8(c)). Table 1
shows the number of hypotheses generated against the size of the combinatorial search space. To
make the comparison realistic, the search space quoted represents the number of possible triples
of feature labellings, containing at least one 2-D feature - these could form a comparable basis
for viewpoint inversion and subsequent view-dependent reasoning. It can be seen that the VIRM
is very effective in identifying the very few labellings in the interpretation tree which are
mutually consistent with a single view of the vehicle model.
785

(a) Selected from 12 poses (b) Selected from 10 poses (c) Selected from 19 poses

(d) Selected from 14 poses (e) Selected from 7 poses (f) Selected from 7 poses
Fig.8. Correct instances superimposed on a representative set of original images
Table 1: Number of hypotheses against the search space
Image Number of Number of Number of U- Number of Line Size of Search
Hypotheses Quadrilaterals shape Curves Segments Space

Fig. s(a) 12 9 7 112 8.3 x 10 7


Fig. 8('0) 10 8 4 128 9.4 X 107
Fig. 8(c) 19 17 6 98 1.3 X 108
Fig, 8(d) 14 7 6 57 1.7 X 107
Fig.8(e) 7 112 1.7 x 107
FiB. 8~f) 7 8 121 6.2 x 10 7

4. Results and D i s c u s s i o n
We have built VIRMs for the three vehicles shown in Fig. 1.Table 2 summarises the result of the
model building process. The results are obtained from 500 samples of the view with the camera
upright and its position limited to within 0 and 60~ of the model's ground plane. In representing
the fastback car, an extra shape feature, a triangle, was introduced. The data change slightly each
time the model is built because viewpoints are selected randomly, but we have found that this
change is small and has no appreciable influence in the later object recognition process. The
number of constraints generated in the model building process depends on the thresholds selected
for the acceptance criteria. Such thresholds are inherent m any recognition problem, and must be
determined by experience. However, in our experiments the effects of the thresholds on final
recognition performance appear not to be dramatic, mainly affect the time used in recognition.
The time used to construct the relational model is high, since all the relations among the
component parts of the object need to be assessed. At the present state of development the code
786

runs in pop11 and we have made no attempt to make the code efficient. Model generation takes
about one and a half hours on a Sun Sparc 2 with 24 MB memory. However, storage of the
eventual VIRM is very efficient. We need an m by m matrix to represent the pairwise co-visibility
of the object and a similar m2d by m2a by m matrix (m2ais the number of 2-D complexes) to store
co-visibility of feature clusters, and a set of procedures (typically 100 because a procedure may
include more than one geometrical constraint) to represent geometrical relations among the
component parts of the object.
Table 2: VIRMs for different types of vehicles
Numbar of Hatchback Fastback Estate
1-D Features 22 26 28
2-D F~ttmss 10 12 12
Pairwiso Co-visibility 107 134 152
Triple Co-visibility 300 424 468
Parallel Constraints 43 58 67
Colinear Constraints 21 25 31
Side Relation Constraints 72 85 85
Relative Size Constraints 35 42 49

5. S u m m a r y
A method has been described for creating a view-independent relational model of an object used
in object recognition to aggregate features related to a pose hypothesis. A match is accepted as a
hypothesis, and therefore will be Rurther evaluated, only when its relational support passes a
certain threshold. The model is created off-line and its use in object recognition requires no non-
linear calculations.

Reference
1. l~rge, C., Graph and Hypergrapgh, New York: North-Holland, 1973.
2. Belles, R. C., Horaud, P., "3DPO: A Three Dimensional Part Orientation Systern', Tim International Journal of
Robotics Research, Vol.5, No, 3, Fall 1986.
3. Bray, A. J., "Recognising and Tracking Polyhedral Objects", Ph. D Dissertation, University of Sussex, Oct., 1990.
4. Bfisdon, K., Sullivan, G. D., Baker, K. D., "FeaatumAggregation in Iconic Model Evaluation", Prec. of AVC-88,
Manchester, Sept., 1988.
5. Canny, J. F., "A Computation Approach to Edge Detection", IEEE Transactions on Pattern Analysis and Machine
Intelligence, PAMI-8, No. 6, pp. 179-698, Nov., 1986.
6. Gigus, Z., Malik, J., "Computing the Aspect Graph for Line Drawings of Polyhedral Objects', I F . , Transactions
on Patmrn Analysis and Machine Intelligence, PAMI- 12, No. 2, pp. 113-122, Fob., 1990.
7. Goad, C., "Special Purpose Automatic Programming for 3D Model-Based Vision", Proceedings Image
Understanding Workshop, Virginia, USA, pp.94-104,1983.
8. Kcenderink, J., J., Von Door, A., J., "The Internal Representation of Solid Shupe with Respect to Vision", Biological
Cybernetics, Vol. 32, pp. 211-216, 1979.
9. Sullivan, G., D, "Alvey MMI-007 Vehicle Exemplar: Performance & Limitations", Prec. AVC-87, Cambridge,
England, Sep., 1987.
10. Worrall, A. D., Baker, K. D. and Sullivan, G. D., "Model Based Perspective Inversion", Prec. AVC-88, Manchester,
Aug., 1988.
11. Zhang, S., Du, L., Sullivan, G. D. and Baker, K. D., "Model-Based 3D Grouping by Using 21) Cues", Prec.
BMVC90, Oxford, Sept., 1990.
12. Zhang, S., Sullivan, G.D., and Baker, K. D., "Relational Model Construction and 3D Object Recognition", Prec.
BMVC91, Glasgow, Sept., 1991.
Learning to Recognize Faces from Examples
S h i m o n E d e l m a n , 1 D a n i e l Reisfeld, 2 Yechezkel Y e s h u r u n 2

1 Dept. of Applied Mathematics and Computer Science, The Weizmann Institute of Science,
R~hovot 76100, Israel (edelman@wisdom.weizmann.ac.il)
2 Dept. of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel (reisfeld@math.ta u.ac.il)

Abstract. We describe an implemented system that learns to recognize


human faces under varying pose and illumination conditions. The system
relies on symmetry operations to detect the eyes and the mouth in a face
image, uses the locations of these features to normalize the appearance of the
face, performs simple but effective dimensionality reduction by a convolution
with a set of Gaussian receptive fields, and subjects the vector of activities
of the receptive fields to a Radial Basis Function interpolating classifier.
The performance of the system compares favorably with the state of the art
in machine recognition of faces.

1 Learning from Examples as Function Interpolation

Classifying the image of a face as a picture of a given individual is probably the most
difficult recognition task that humans carry out on a routine basis with nearly perfect
success rate. It is not too surprising, therefore, that advances in face recognition by
computer fail to match recent progress in the recognition of general 3D objects. The
major problem in face recognition appears to be the design of a representation that,
on one hand, would be sufficiently informative to allow discrimination among inputs
that are all basically similar to each other, and, on the other hand, would be efficiently
computable. One way around this problem is to learn the required representations, e.g.,
by examining and remembering several instances of the input.
How can such a simple scheme generalize recognition to novel instances? In a standard
formulation of pattern recognition, a characteristic function is defined over a multidimen-
sional space, so that its value is close to 1 over the region corresponding to instances of
the pattern to be recognized, and is close to 0 elsewhere [2]. If the characteristic function
is smooth, recognition may be generalized to novel patterns of the same class by interpo-
lating the characteristic function, e.g., using splines. An efficient scheme for interpolating
(or approximating) smooth functions was proposed recently under the name of HyperBF
networks [9,6]. Within the HyperBF scheme, a multivariate function is expanded in terms
of basis functions, with parameter values that are learned from the data. For a scalar-
valued function, the expansion has the form f ( x ) = ~[~:=1 caG(llx - tall2), where the
parameters ta that correspond to the centers of the basis functions and the coefficients
ca are unknown, and are in general much fewer than the data points (n < N). The pa-
rameters e, t are searched for during learning by minimizing the error functional defined
as g [ f ] = He, t = )"]~=I(A,) ~, where A i = Yl -- f ( x ) = Yl -- ~ : = 1 cr - tall2). If
the centers ta are fixed (e.g., are a subset of the training examples), the coefficients ca
can be found by pseudo-inverting a matrix composed of center responses to the training
vectors [9] (other, iterative, methods such as gradient descent or stochastic search can
be used for the minimization of H). HyperBF interpolation has been previously applied
with success to 3D object recognition [7,3,1].
788

2 L e a r n i n g Face R e c o g n i t i o n

2.1 P r e p r o c e s s l n g
Three-dimensional objects change their appearance when viewed from different direc-
tions and when the illumination conditions vary. We used alignment [13] to remove the
variability in the input images due to changing viewpoint. Our program starts with the
identification of anchor points: image features that are both relatively viewpoint-invariant
and well-localized. Good candidates for such features in face images are the eyes and the
mouth. The input image is then subjected to a 2D affme transformation that normalizes
its shape and size, so that the two eyes and the mouth are situated at fixed locations. The
parameters of the transformation are computed from the desired and the actual locations
of the anchor points in the image. We remark that the central assumption behind the
choice of 2D affine transform as the normalizing operation is that faces are, to a first
approximation, two-dimensional.
Our method of detecting the eyes and the mouth in face images is based on the
observation that the prominent facial features are highly symmetrical, compared to the
rest of the face [10]. We proposed in [11] a low-level operator that captures the intuitive
notion of such symmetries and produces a "symmetry map" of the image. This map is
then subjected to clustering. Geometrical relationships among the clusters, together with
the location of the midline (as defined by a cross-correlation between two halves of that
portion of the image that presumably contains a face), allow us to infer the position of
the face, and of the eyes and the mouth in it. These positions are then used as anchor
points for affine normalization.
After normalization, the input is a standard-size array of (8-hit) pixels, in which the
value of each pixel is determined both by the geometry of the face and by the direction
of the illumination. We next reduce the influence of illumination, by the usual method
of taking a directional derivative of the intensity distribution at each pixel. The input
is then subjected to dimensionaiity reduction, to increase both the efficiency and the
effectiveness of the HyperBF classifier.
A well-known statistical method for dimensionality reduction, principal component
analysis, has been applied recently to face recognition with some success [5,12]. In the
present work we chose to explore a considerably simpler method, based on the neuro-
biological notion of receptive field (RF), defined as that portion of the retinal visual
field whose stimulation affects the response of the neuron. Assuming that the neuron
performs spatial integration over its RF, its output is a (possibly nonlinear) function
of ffRF K(z, y)I(z, y)drdy, where I(x, y) is the input, and K(z, y) is a weighting kernel
that we took to be Gaussian (ef. [8]). As noted in [4], pattern classification requires that
dimensionality reduction facilitate discrimination between classes, rather than faithful
representation of the data. Indeed, the vector of RF activities proved to be adequate for
representing face images for recognition, although it would be impossible to recover from
it the original structure of the image.

2.2 First Stage: Recognizing Individual Faces


We tested our recognition program on the subset o f the MIT Media Lab database of
face images made available by Turk and Pentland [12], which contained 27 face images
of each of 16 different persons. The images were taken under varying illumination and
camera location. Of the 27 images available for each person, 17 randomly chosen ones
served for training the HyperBF recognizer, and the remaining 10 were used for testing.
789

A different recognizer was created for each person, and was trained to output 1 for the
images in the training set.
The performance of the individual recognizers was assessed by computing a 16 x 16
confusion table, in which the entries along the diagonal signified mean miss rates and
the off-diagonal entries - - mean false alarm rates. The table (see Figure 1, bottom) was
computed row by row, as follows. First, recognizer for the person whose name appears at
the head of the row was trained. Second, the recognition threshold was set to the mean
output of the recognizer over the training set less two standard deviations. Third, the
performance of the recognizer on the test images of the same person was computed and
the miss rate entered on the diagonal of the table. The above choice of threshold resulted
in a mean miss rate of about 10%. Finally, the false alarm rates for the recognizer on
the images of the other 15 persons were computed and entered under the appropriate
columns of the table.
Our second experiment used no thresholds. Instead, recognition was declared for that
person whose recognizer was the most active among the sixteen. The performance of this
winner-take-all scheme is shown in Figure 2 (left).

2.3 Second Stage: Incorporating Ensemble Knowledge

An examination of the confusion table reveals that some of the individuals tended to
be confused with almost any other person in the database. To take aclvantage of this
"ensemble phenomenon", we trained another HyperBF module to accept vectors of in-
dividual recognizer activities and to produce vectors of the same length in which the
value corresponding to the activity of the correct recognizer was 1, and all other values
were 0 (see Figure 1, right top). The training set for the second-stage HyperBF module
was obtained by pooling the training sets of all 16 first-stage recognizers. The outcome
of the recognition of a test image was determined by finding the coordinate in the output
vector whose value was the closest to 1. The performance of the two-stage scheme was
considerably better than that of the individual recognizer stage alone (9% error rate,
compared to 22%), demonstrating the importance of ensemble knowledge for recognition
(Figure 2, right).

3 Summary

The approach to face recognition described in this paper was made possible by recent
advances in model-based object recognition [13], in automatic detection of spatial features
[10,11], and in applications of learning and of function approximation to recognition and
other visual functions [7,3,8]. The architecture of our system (in particular, its reliance on
receptive fields for dimensionality reduction and for classification) has been inspired by
the realization that receptive fields are the basic computational mechanism in biological
vision. The system's performance, which at present stands at about 5 - 9 % generalization
error rate under changes of orientation, size and lighting, compares favorably with the
state of the art in face recognition [12]. These results have the potential of contributing to
the evaluation of a recently proposed theory of brain function [6], and of making practical
impact in machine vision.
790

input image
RF z /~7,

d. ]/,J%"
/ r i / ,~--~ \\

9 ~,r \\
II % L dim~sio~lity
*: .
sI- ,2".
.~. ~ - . . .
-'-~
o . %, * s
. 9 9 " - . ~ . ~ - ~ ,#
-'.,' . - . -- ~-~. ;.

individual
dassi6cation

inql)ut x z x s x3 x 4

RBFs
~ m b / e -based
ct~ classi~tlos

r.

I~gl'A
Jndez ~r~mpo~lW~':i~ft ~lduec/~,ed I o l I

[ l~..~el
output
9t r i t n ~ t . e ~ bi I brt dau fc~ i rf Joe Mlk ~tn I~S rob st.e true tre uMb
btl II.t t,2 8.t S.! 0.2
bre e,4 0,4

fcm 1.1
lrf 8.3 e,l o.! 8.4 e,l e.5 0.3 J,2 0,5 S,2 e.1
o.t 0,3
Rlk o.l Q.I e,t
0,3 e.5 e.g 8.2 II1.11 S,3 8.g 8.8 e.G
t.8 1.1 8.2 0.1 Io| $J5 8.8 e.4
10.2 IDol e.3 0.6
0.4 8.~ II.L 0.| 1.4 8,|
S.| 0,4
8.1
0,2 u.l e.l g,2 i.1
e.3 8.2 e,l 8.4 LI
0.1
T

Fig. 1. Left top: a fa~e image from the database we used, courtesy of Turk and Pentland [12],
before preprocessing. Left middle: a HyperBF network. Basis function centezs t~ (points in the
multidimensional input spa~e) are prototypes for which the desired response is known. The
output of the network is a linear superposition of the activities of all the basis function units.
In the limit case, when the bases are delta functions, the network becomes equivalent to a
look-up table holding the examples. Right top: The entire two-stage recognition scheme (see
text for explanation). Bottom: A confusion table representation of the performance of the first
stage. Entries along the diagonal correspond to ~miss" error rates; oi~-di~gonal entries signify
ufalse-alaxm" error rates (zeros omitted for clarity).
791

bll -> .00 bll -) .OB


bra -) .20 bra -) .2B
day -> .3B day -) .OB
foo -> .40 foo -> .20
irf -> .30 irf -> .20
Joe -> .20 Joe -> .10
Mik -> .10 nik -> .8~
nln -> .@0 ~in -> .80
paa -> .10 paa -> .20
rob -> .BO rob -) .E~
sta -> .80 sta -> .lg
ste -> .60 ate -> .29
tha -> .20 tha -> .88
tra -> .68 tre -> .1E
vMb -> .30 vnb -> .1E
w~v -> .28 uav -> .18

Mean e r r o r rata: .22 Mean e r r o r rate: .89

Fig. 2. Left: performance of the one-stage recognition scheme. Right: performance of the two-
stage scheme that uses ensemble knowledge.

References
1. R. Bruneni and T. Poggio. HyperBF networksfor real object recognition. In Proceedings
HCAI, pages 1278-1284, Sydney, Austraiia, 1991.
2. R. O. Duds and P. E. Hart. Pattern classification and scene analysis. Wiley, New York,
1973.
3. S. Edelman and T. Poggio. Bringing the Grandmother back into the picture: a memory-
based view of object recognition. A.I. Memo No. 1181, AI Lab, MIT, 1990. to appear in
Int. J. Pattern Recog. Artif. Intell.
4. N. Intrator, J. I. Gold, H. It. Bfilthoff, and S. Edelman. Three-dimensional object recog-
nition using an unsupervised neural network: understanding the distinguishing features.
In D. Touretzky, editor, Neural Information Processing Systems, volume 4. Morgan Kanf-
mann, San Msteo, CA, 1992. to appear.
5. M. Kirby and L. Sirovich. Application of the Karhunen-Lo~ve procedure for characteriza.
tion of human faces. IEEE Transactions on Pattern Analysis and Machine Intelligence,
12(1):103-108, 1990.
6. T. Poggio. A theory of how the brain might work. Cold Spring Harbor Symposia on
Quantitative Biology, LV:899-910, 1990.
7. T. Poggio and S. Edelman. A network that learns to recognize three-dimensional objects.
Nature, 343:263-266, 1990.
8. T. Poggio, M. Fahle, and S. Edelman. Synthesis of visual modules from examples: learning
hyperacuity. A.I. Memo No. 1271, AI Lab, MIT, 1991. to appear in CVGIP B, 1992.
9. T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to
multilayer networks. Science, 247:978-982, 1990.
10. D. Reisfeld, H. Wolfson, and Y. Yeshurun. Detection of interest points using symmetry. In
Proceedings of the 3rd International Conference on Computer Vision, pages 62-65, Tokyo,
1990. IEEE, Washington, DC.
11. D. Reisfeld and Y. Yeshurun. Robust Detection of Facial Features by Generalized Symme-
try 1991. in preparation.
12. M. Turk and A. Pentland. Eigenfaces for recognition. J. of Cognitive Neuroscience, 3:71-
66, 1991.
13. S. Ullman. Aligning pictorial descriptions: an approach to object recognition. Cognition,
32:193-254, 1989.
Face R e c o g n i t i o n through G e o m e t r i c a l Features

R. Brunelli a and T. Poggio 2

1 Istituto per la Ricerca Scientifica e Tecnologica


1-38050 Povo, Trento, ITALY
2 Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Cambridge, Massachusetts 02139, USA

A b s t r a c t . Several different techniques have been proposed for computer


recognition of human faces. This paper presents the first results of an ongo-
ing project to compare several recognition strategies on a common database.
A set of algorithms has been developed to assess the feasibility of recog-
nition using a vector of geometrical features, such as nose width and length,
mouth position and chin shape. The performance of a Nearest Neighbor
classifier, with a suitably defined metric, is reported as a function of the
number of classes to be discriminated (people to be recognized) and of the
number of examples per class. Finally, performance of classification with
rejection is investigated.

1 Introduction
The problem of face recognition, one of the most remarkable abilities of human vision,
was considered in the early stages of computer vision and is now undergoing a revival.
Different specific techniques were proposed or reproposed recently. Among those, one may
cite neural nets [9], elastic template matching [5, 23], Karhunen-Loewe expansion [20], al-
gebraic moments [11] and isodensity lines [16]. Typically, the relation of these techniques
with standard approaches and their relative performance has not been characterized well
or at all. Even absolute performance has been rarely measured with statistical signifi-
cance on meaningful databases. Psycological studies of human face recognition suggest
that virtually every type of available information is used [22]. Broadly speaking we can
distinguish two ways [19] to get a one-to-one correspondence between the stimulus (face
to be recognized) and the stored representation (face in the database):

G e o m e t r i c , f e a t u r e - b a s e d m a t c h i n g . A face can be recognized even when the details


of the individual features (such as eyes, nose and mouth) are no longer resolved. The
remaining information is, in a sense, purely geometrical and represents what is left at
a very coarse resolution. The idea is to extract relative position and other parameters
of distinctive features such as eyes, mouth, nose and chin [10, 14, 8, 2, 13]. This was
the first approach towards an automated recognition of faces [13].
T e m p l a t e m a t c h i n g . In the simplest version of template matching, visual patterns,
represented as bidimensional arrays of intensity vMues, are compared using a suitable
metric (typically the euclidean distance) and a single template, representing the whole
face, is used 3.
3 There are of course several, more sophisticated ways of performing template matching. For
instance, the array of grey levels may be suitably preprocessed before matching. Several full
templates per each face may be used to account for the recognition from different viewpoints.
Still another important variation is to use, even for a single viewpoint, multiple templates. A
7 9 3

In order to investigate the first of the above mentioned approaches we have developed a
set of algorithms and tested it on a data base of 47 different people.

2 Experimental setup

The database we used for the comparison of the different strategies is composed of 188
images, four for each of 47 people. Of the four pictures available, the first two were
taken in the same session (a time interval of a few minutes) while the other pictures
were taken at intervals of some weeks (2 to 4). The pictures were acquired with a CCD
camera at a resolution of 512 x 512 pixels as frontal views. The subjects were asked
to look into the camera but no particular efforts were made to ensure perfectly frontal
images. The illumination was partially controlled: the same powerful light was used but
the environment where the pictures were acquired was exposed to sun light through
windows. The pictures were taken randomly during the day time. The distance of the
subject from the camera was fixed only approximately, so that scale variations of as much
as 30 percent were possible.

3 Geometric, feature-based matching

As we have mentioned already, the very fact that face recognition is possible even at coarse
resolution, when the single facial features are hardly resolved in detail, implies that the
overall geometrical configuration of the face features is sufficient for discrimination. The
overall configuration can be described by a vector of numerical data representing the
position and size of the main facial features: eyes and eyebrows, nose and mouth. This
information can be supplemented by the shape of the face outline. As put forward by
Kaya and Kobayashi [14] the set of features should satisfy the following requisites:

- estimation must be as easy as possible;


- dependency on light conditions must be as small as possible;
- dependency on small changes of face expression must be small;
- information contents must be as high as possible.

The first three requirements are satisfied by the set of features we have adopted, while
their information contents is characterized by the experiments described later.
The first attempts at automatic recognition of faces by using a vector of geometri-
cal features are probably due to Kanade [13] in 1973. Using a robust feature detector
(built from simple modules used within a backtracking strategy) a set of 16 features was
computed. Analysis of the inter and intra class variances revealed some of the parame-
ters to be ineffective, yielding a vector of reduced dimensionality (13). Kanade's system
achieved a peak performance of 75% correct identification on a database of 20 different
people using two images per person, one for reference and one for testing.
The computer procedures we implemented are loosely based on Kanade's work and
will be detailed in the next sections. The database used is however more meaningful (in
the sense of being greater) both in the number of classes to be recognized, and in the
number of instances of the same person to be recognized.

face is stored then as a set of distinct(ive) smaller templates [1]. A rather different approach
is based on the technique of elastic templates [6, 5, 23]
794

3.1 N o r m a l i z a t i o n

One of the most critical point when using a vector of geometrical features is that of proper
scale normalization. The extracted features must be somehow normalized in order to be
independent of position, scale and rotation of the face in the image plane. Translation
dependency can be eliminated once the origin of coordinates is set to a point which can
be detected with good accuracy in each image. The approach we have followed achieves
scale and rotation invariance by setting the interocular distance and the direction of the
eye-to-eye axis. We will describe the steps of the normalization procedure in some detail
since they are themselves of some interest.
The first step in our technique resembles that of Baron [1] and is based on template
matching by means of a normalized cross-correlation coefficient, defined by :

< I T T > -- < IT >< T >


CN(y) - - - - a(IT)o'(T) (1)
where IT be the patch of image I which must be matched to T, < > the average operator,
I T T represent the pixel-by-pixel product, and a the standard deviation over the area
being matched. This normalization rescales the template and image energy distribution
so that their average and variances match.
The eyes of one of the authors (without eyebrows) were used as a template to locate
eyes on the image to be normalized. To cope with scale variations, a set of 5 eyes templates
was used, obtained by scaling the original one (the set of scales used is 0.7, 0.85, 1, 1.15,
1.3 to account for the expected scale variation). Eyes position was then determined
looking for the maximum absolute value of the normalized correlation values (one for
each of the templates). To make correlation more robust against illumination gradients,
each image was preprocessed by dividing each pixel by the average intensity on a suitably
large neighborhood.
It is well known that correlation is computationally expensive. Additionally, eyes of
different people can be markedly different. These difficulties can be significantly reduced
by using hierarchical correlation (as proposed by Butt in [7]). Gaussian pyramids of the
preprocessed image and templates are built. Correlation is done starting from the lowest
resolution level, progressively reducing the area of computation from level to level by
keeping only a progressively smaller area.
Once the eyes have been detected, scale is pre-adjusted using the ratio of the scale
of the best responding template to the reference template. The position of the left and
right eye is then refined using the same technique (with a left and a right eye template).
The resulting normalization proved to be good. The procedure is also able to absorb a
limited rotation in the image plane (up to 15 degrees). Once the eyes have been indipen-
dently located, rotation can be fixed by imposing the direction of the eye-to-eye axis,
which we assumed to be horizontal in the natural reference frame. The resolution of the
normalized pictures used for the computation of the geometrical features was of 55 pixels
of interocular distance.

3.2 Feature E x t r a c t i o n

Face recognition, while difficult, presents interesting constraints which can be exploited
in the recovery of facial features. An important set of constraints derives from the fact
that almost every face has two eyes, one nose, one mouth with a very similar layout.
While this may make the task of face classification more difficult, it can ease the task of
feature extraction: average anthropometric measures can be used to focus the search of a
795

particular facial feature and to validate results obtained through simple image processing
techniques [3, 4].
A very useful technique for the extraction of facial features is that of integral pro-
jections. Let Z(z, y) be our image. The vertical integral projection of Z(x,y) in the
[zl, z2] x [Yz,Y2] domain is defined as:
Y2
= z ( x ,u ) (2)
Y=Yl

The horizontal integral projection is similarly defined as:


~2
Y(y)-- ~ Z(x,y) (3)

This technique was succesfully used by Takeo Kanade in his pioneering work [13] on
recognition of human faces. Projections can be extremely effective in determining the
position of features provided the window on which they act is suitably located to avoid
misleading interferences. In the original work of Kanade the projection analysis was
performed on a binary picture obtained by applying a laplacian operator (a discretization
of cgx~I+ 0yyI) on the grey-level picture and by thresholding the result at a proper level.
The use of a laplacian operator, however, does not provide information on edge (that
is gradient) directions. We have chosen therefore to perform edge projection analysis by
partitioning the edge map in terms of edge directions. There are two main directions in
our constrained face pictures: horizontal and vertical4.

i~
I
~ " q"i -~*
.r" . ~ ,

, '~.f ,.t).~ ;'~'J: . ~ .'.

Fig. 1. Horizontal and vertical edge dominance maps

Horizontal gradients are useful to detect the left and right boundaries of face and nose,
while vertical gradients are useful to detect the head top, eyes, nose base and mouth.
Once eyes have been located using template matching, the search for the other features
can take advantage of the knowledge of their average layout.
Mouth and nose are located using similar strategies. The vertical position is guessed
using anthropometric standards. A first, refined estimate of their real position is obtained
4 A pixel is considered to be in the vertical edge map if the magnitude of the vertical component
of the gradient at that pixel is greater than the horizontal one. The gradient is computed using
a gaussian regulaxization of the image. Only points where the gradient intensity is above an
automatically selected threshold are considered [21, 3].
796

Fig. 2. LEFT: Horizontal and vertical nose restriction. RIGHT: Horizontal mouth restriction

looking for peaks of the horizontal projection of the vertical gradient for the nose, and
for valleys of the horizontal projection of the intensity for the mouth (the line between
the lips is the darkest structure in the area, due to its configuration). The peaks (and
valleys) are then rated using their prominence and distance from the expected location
(height and depth are weighted by a gaussian factor). The ones with the highest rating
are taken to be the vertical position of nose and mouth. Having established the vertical
position, search is limited to smaller windows.
The nose is delimited horizontally searching for peaks (in the vertical projection of
horizontal edge map) whose height is above the average value in the searched window.
The nose boundaries are estimated from the leftmost and rightmost peaks. Mouth height
is computed using the same technique but applied to the vertical gradient component.
The use of directional information is quite effective at this stage, cleaning much of the
noise which would otherwise impair the feature extraction process. Mouth width is finally
computed thresholding the vertical projection of the horizontal edge map at the average
value (see Fig. 2).
Eyebrows position and thickness can be found through a similar analysis. The search
is once again limited to a focussed window, just above the eyes, and the eyebrows are
found using the vertical gradient map. Our eyebrows detector looks for pairs of peaks
of gradient intensity with opposite direction. Pairs from one eye are compared to those
of the other one: the most similar pair (in term of the distance from the eye center and
thickness) is selected as the correct one.
We used a different approach for the detection of the face outline. Again we have
attempted to exploit the natural constraints of faces. As the face outline is essentially el-
liptical, dynamic programming has been used to follow the outline on a gradient intensity
map of an elliptical projection of the face image. The reason for using an elliptical coor-
dinate system is that a typical face outline is approximately represented by a line. The
computation of the cost function to be minimized (deviation from the assumed shape,
an ellipse represented as a line) is simplified, resulting in a serial dynamic problem which
can be efficiently solved [4].
In summary, the resulting set of 22 geometrical features that are extracted automatically
in our system and that are used for recognition (see Fig. 3), is the following:

- eyebrows thickness and vertical position at the eye center position;


797

Fig. 3. Geometrical features (black) used in the face recognition experiments

- nose vertical position and width;


- mouth vertical position, width and height;
- eleven radii describing the chin shape;
- bigonial breadth;
- zygomatic breadth.

3.3 R e c o g n i t i o n P e r f o r m a n c e
Detection of the features listed above associates to each face a twentytwo-dimensional
numerical vector. Recognition is then performed with a Nearest Neighbor classifier, with a
suitably defined metric. Our main experiment aim to characterize the performance of the
feature-based technique as a function of the number of classes to be discriminated. Other
experiments try to assess performance when the possibility of rejection is introduced. In
all of the recognition experiments the learning set had an empty intersection with the
testing set.
The first observation is that the vectors of geometrical features extracted by our
system have low stability, i.e. the intra-elass variance of the different features is of the
same order of magnitude of the inter-class variance (from three to two times smaller).
This is reflected by the superior performance we have been able to achieve using the
centroid of the available examples (either 1 or 2 or 3) to model the frontal view of each
individual (see Fig. 4).
An important step in the use of metric classification using a Nearest Neighbor classifier
is the choice of the metric which must take into account both the interclass variance and
the reliability of the extracted data. Knowledge of the feature detectors and of the face
configuration allows us to establish, heuristically, different weights (reliabilities) for the
single features. Let {xi} be the feature vector, {al} be the inter class dispersion vector
and {wi} the weight (reliability) vector. The distance of two feature vectors {xi} {x~) is
then expressed as:
A'~(x,x ') = Z w, (4)
i=1 O*i
798

A useful data on the robustness of the classification is given by an estimate of the class
separation. This can be done using the so called MIN/MAX ratio [17, 18], hereafter RrnM,
which is defined as the minimum distance on a wrong correspondence over the distance
from the correct correspondence. The performance of the classifier at different values of
r has also been investigated. The value of r giving the best performance is a = 1.2, while
the robustness of the classification decreases with increasing a. This result, if generally
true, may be extremely interesting for hardware implementations, since absolute values
are much easier to compute in silicon than squares. The underlying reason for the good
performance of r values close to 1 is probably related to properties of robust statistics
[12] 5. Once the metric has been set, the dependency of the performance on the number
of classes can be investigated. To Obtain these data, a number of recognition experiments
have been conducted on randomly chosen subsets of classes at the different required
cardinalities, The average values on round robin rotation experiments on the available
sets are reported. The plots in Fig. 4 report both recognition performance and the RmM
ratio. As expected, both data exhibits a monotonically decreasing trend for increasing
cardinality.
A possible way to enhance the robustness of classification is the introduction of a
rejection threshold. The classifier can then suspend classification if the input is not suf-
ficiently similar to any of the available models. Rejection could trigger the action of
a different classifier or the use of a different recognition strategy (such as voice iden-
tification). Rejection can be introduced, in a metric classifier, by means of a rejection
threshold: if the distance of a given input vector from all of the stored models exceeds
the rejection threshold the vector is rejected. A possible figure of merit of a classifier
with rejection is given by the recognition performance with no errors (vectors are either
correctly recognized or rejected). The average performance of our classifier as a function
of the rejection threshold is given in Fig. 5. 6

4 Conclusion

A set of algorithms has been developed to assess the feasibility of recognition using a
vector of geometrical features, such as nose width and length, mouth position and chin
shape. The advantages of this strategy over techniques based on template matching are
essentially:

- compact representation (as low as 22 bytes in the reported experiments);


- high matching speed.

The dependency of recognition performance using a Nearest Neighbor classifier has been
reported for several parameters such as:

- number of classes to be discriminated (i.e. people to be recognized);


- number of examples per class;
- rejection threshold.

5 We could have chosen other classifiers i n s t e a d o f Nearest Neighbor. The HyperBF classifier,
u s e d i n p r e v i o u s e x p e r i m e n t s o f 3D object recognition, allows the automatic choice of the
appropriate metric, which is still, however, a weighted euclidean metric.
6 Experiments by Lee on a OCR problem [15] suggest that a HyperBF classifier would be
significantly better than a NN classifier in the presence of rejection thresholds.
799

Recosnlflon vL Nr. o( Examples Clu#d'k:sUon ",It. N u m b e r d ' C l u ~ u


R~oguiuon Y

1.00- 1.60-

050-
1.40- ~'~-"
0.80- 1.30-
I
I 1.20-
0.70- J
1.10-
1.O0-
0.60-
0.gO- ~
0.30- I- 0.S0-
I 0.70-
0,40-
0.60-

0.30- 0.$0-
0.40-
0 .2 0 - 0.30-
0.20-
0.10 -
0.10 -
0.00- 0.00-,
~W
1.00 1.50 ZOO ZSO 3.00 10.00 20.00 30.00 40.00

Fig. 4. LEFT: Performance as a function of the number of examples. RIGHT: Recognition


performance and MIN/MAX ratio as a function of the number of classes to be discriminated

CluslflcaUon with Re~cliem


Recogmlion
"Co,root
1.00- r$
~ ' d ....
0.90-

0.80 -
m,oj. , e ' ~ ' ~ ' ~
0.70 - fo"

0.60- /
0.50~ /
0.40-

0.30-

0.20-

0.10 -

0.00-
Threshold z 10-3
0.00 100.00 200.00 300.00

Fig. 5. Analysis of the classifier as a function of the rejection threshold

The attained performance suggests that recognition by means of a vector of geomet-


rical features can be useful for small databases or as a screening step for more complex
recognition strategies.
These data are the first results of a project which will compare several techniques for
automated recognition on a common database, thereby providing quantitative informa-
tion on the performance of different recognition strategies.

Acknowledgements
The authors thanks Dr. L. Stringa for helpful suggestions and stimulating discussions.
One of the authors (R.B) thanks Dr. M. Dallaserra for providing the image data base.
Thanks are also due to Dr. C. Furlanello for comments on an earlier draft of this paper.
800

References

1. R. J. Baron. Mechanisms of human facial recognition. International Journal of Man Ma-


chine Studies, 15:137-178, 1981.
2. W. W. Bledsoe. Man-machine facial recognition. Technical Report Rep. PRI:22, Panoramic
Reseaxch Inc, Patio Alto, Cal., 1966.
3. R. Brunelli. Edge projections for facial feature extraction. Technical Report 9009-12,
I.R.S.T, 1990.
4. R. Brunelli. Face recognition: Dynamic programming for the detection of face outline.
Technical Report 9104-06, I.R.S.T, 1991.
5. J. Buhmann, J. Lunge, and C. yon der Malsburg. Distortion iavariant object recognition
by matching hierarchically labeled graphs. In Proceedings of IJCNN'89, pages 151-159,
1989.
6. D. J. Burr. Elastic matching of line drawings. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 3(6):708-713, 1981.
7. P. J. Butt. Smart sensing within a pyramid vision machine. Proceedings of the IEEE,
76(8):1006-1015, 1988.
8. H. Chan and W. W. Bledsoe. A man-machine facial recognition system: some preliminary
results. Technical report, Panoramic Research Inc., Cal, 1965.
9. G. Cottrell and M. Fleming. Face recognition using unsupervised feature extraction. In
Proceedings of the International Neural Network Conference, 1990.
10. A. J. Goldstein, L. D. Harmon, and A. B. Lesk. Identification of human faces. In Prec.
IEEE, Vol. 59, page 748, 1971.
11. Zi-Quan Hong. Algebraic feature extraction of image for recognition. Pattern Recognition,
24(3):211-219, 1991.
12. P. J. Huber. Robust Statistics. Wiley, 1981.
13. T. Kanade. Picture processing by computer complex and recognition of human faces. Tech-
nical report, Kyoto University, Dept. of Information Science, 1973.
14. Y. Kaya and K. Kobayashi. A basic study on human face recognition. In S. Watanabe,
editor, Frontiers of Pattern Recognition, page 265. 1972.
15. Y. Lee. Handwritten digit recognition using k nearest-neighbor, radial basis functions and
backpropagation neural networks. Neural Computation, 3(3), 1991.
16. O. Nakamura, S. Mathur, and T. Minami. Identification of human faces based on isodensity
maps. Pattern Recognition, 24(3):263-272, 1991.
17. T. Poggio and S. Edelman. A network that learns to recognize three-dimensional objects.
Nature, 343(6225):1-3, 1990.
18. T. Poggio and F. Girosi. A theory of networks for approximation and learning. Technical
Report A.I. Memo No. 1140, Massachusetts Institute of Technology, 1989.
19. J. Sergent. Structural processing of faces. In A.W. Young and H.D. Ellis, editors, Handbook
of Research on Face Processing. North-Holland, Amsterdam, 1989.
20. M. Turk and A. Pentland. Eigenfaces for recognition. Technical Report 154, MIT Media
Lab Vision and Modeling Group, 1990.
21. H. Voorhees. Finding texture boundaries in images. Technical Report AI-TR 968, M.I.T.
Artificial Intelligence Laboratory, 1987.
22. A. W. Young and H. D. Ellis, editors. Handbook of Research on Face Processing. NORTH-
HOLLAND, 1989.
23. Alan L. Yuille. Deformable templates for face recognition. Journal of Cognitive Neuro-
science, 3(1):59-70, 1991.

This article was processed using the IrEX macro package with ECCV92 style
Fusion through Interpretation

Mark J.L. Orr 1'2, John Hallam 3, Robert B. Fisher 3


1 Advanced Robotics Research Ltd., University Road, Salford M5 4PP, England
2 SD-Scicon Ltd., Abney Park, Cheadle, Cheshire SK8 2PD, England
3 Department of Artificial Intelligence, Edinburgh University, Forrest Hill, Edinburgh EH1 2QL,
Scotland

A b s t r a c t . We discuss two problems in the context of building environ-


ment models from multiple range images. The first problem is how to find
the correspondences between surfaces viewed in images and surfaces stored
in the environment model. The second problem is how to fuse descriptions
of different parts of the same surface patch. One conclusion quickly reached
is that in order to solve the image-model correspondence problem in a rea-
sonable time the environment model must be divided into parts.

1 Introduction

In many applications of mobile robots there is a need to construct environment models


from data gathered as the environment is explored. Environment models are useful for
tasks such as recognising objects, navigating routes and planning the acquisition of new
data. It is common to use laser scanning devices which produce range images as a pri-
mary source of information. This paper is concerned with some of the problems arising
when multiple range images from different viewpoints are fused into a single environment
model.
The first stage in processing a range image is to segment it into distinct surface
patches. Although real environments possess a proportion of curved surfaces, we assume
that the segmented patches are planar with poly-line boundaries. When the environment
is really curved, the segmented image will contain a group of smaller planar patches
approximating the curved surface. We note that it is difficult to construct fast and reliable
segmentation systems for non-planar surfaces. However, while the details are different,
the principles of the methods we use also apply to curved surfaces.
The surface descriptions extracted from each image and the descriptions contained
in the environment model relate to different coordinate frames. As the robot and the
sensor attached to it move about, the relation between the model frame and the image
frame changes. Before any data from the image can be added to the model, the image
surfaces must be transformed into the model coordinate frame. Surfaces common to
both image and model, if there are any, can be used to estimate this transform as long as
corresponding pairs can be identified correctly. However, finding these correspondences
is complicated by the effects of occlusion which can result in different parts of a surface
being visible in different images. Consequently, it is necessary to rely for comparisons on
properties, such as relative distance and relative orientation, which are independent of
occlusion and frame of reference. The method we use, based on constrained search and
hypothesis testing, is discussed in Sect. 3.
Section 4 discusses the problem of updating the description of an environment model
802

surface with information from a new description of the same patch. If the existing de-
scription is incomplete because of occlusion, the new description may supply information
about the 'missing' parts. There is thus a requirement to be able to combine the infor-
mation from two incomplete descriptions.
Underlying everything is the problem of uncertainty: how to make estimates and take
decisions in the presence of sensor noise. Much attention has been given to this subject
in recent years and stochastic methods have become the most popular way of handling
uncertainty. With these methods, when large numbers of estimates and/or decisions are
required, the computational burden can be quite substantial and there may be a need to
find ways of improving efficiency. Section 2 below touches on this issue.
A more detailed version of this paper can be found in [8].

2 Uncertainty

The type of uncertainty we are talking about is primarily due to noise in the numeri-
cM data delivered by sensors. Recently, it has become standard practice in robotics and
computer vision [1, 2, 9, 7] to represent uncertainty explicitly by treating parameters
as random variables and specifying the first two moments (mean and variance) of their
probability distributions (generally assumed to be Ganssian). This permits the use of
techniques such as the Extended Kalman Filter for estimation problems, and the Maha-
lanobis Distance test for making decisions.
The Mahalanobis Test is used to decide whether two estimates are likely to refer to
the same underlying quantity. For example, suppose two surface descriptions give area
estimates of (a, A) and (/L B) (the first member of each pair is the mean, the second is
the variance). These estimates can be compared by computing the quantity

D~ = (a - ~,)2
A+B '
which has a X2 distribution. Thus one can choose an appropriate threshold on Da to test
the hypothesis that the surface being described is the same in each case.
The same test is applicable in more complicated situations involving binary relations
(between pairs of surfaces) and vector valued parameters. In general, some relation like
g(xl,yl,x2,y=) = 0 (1)
will hold between the true values, xl and x2, of parameters describing some aspect of
two image surfaces and the true values, Yl and y=, of parameters describing the same
aspect of two model surfaces - though only if the two pairs correspond. If the parameter
estimates are (~i,Xi) and (~,i,Yi), i = 1,2, then, to first order, the mean and variance
o f g are
= g ( ~ l , ~1, ~2, ~2) ,

G = Z,__, k ox,'" + ) '


(where the 3acobians are evaluated at ~i and $'i). To test the hypothesis that the two
pairs correspond the Mahalanobis Distance D = ~ 2 / G is computed and compared with
the appropriate X2 threshold.
803

If such measures have to be computed frequently but are usually expected to result in
hypothesis rejections (as in interpretation trees - see Sect. 3), there is an efficient method
for their calculation. We illustrate this for the case of binary relations for the relative
distance of two points (Pi and qi) and the relative orientations of two vectors (ui and
vl). The appropriate functions are, respectively,

gd = (Pl -- P2)T(pl -- P2) ~ ( q l -- q2)T(ql -- q2) ,

go = - vTv
Additive terms of the form xTAx, where A is a variance matrix and x is a vector,
occur in the expressions for the scalar variances Gd and Go. We can use the Rayleigh-
Ritz Theorem [6] and the fact that variance matrices are positive definite to bound such
expressions from above by

x T A x _~ )lmax(A)xTx < t r a c e ( A ) x T x .

This leads to cheaply calculated upper bounds on Gd and Go and corresponding lower
bounds on Dd and Do. Since these will usually exceed the thresholds, only in a minority
of cases will it be necessary to resort to the full, and more expensive, calculations of the
variances.
When the relation holding between the parameters (the function in (1)) is vector
valued (as for direct comparisons of infinite plane parameters - Sect. 3) a similar proce-
dure can be used. This avoids the necessity of performing a matrix inverse for every test
through the inequality
O ----- ~ T G - I g ~__
trace(G) "

3 Finding the Correspondences

Popular methods for solving correspondence problems include constrained search [5] and
generate-and-test [4]. We have adopted a hybrid approach similar to [3] where an inter-
pretation tree searches for consistent correspondences between groups of three surfaces,
the correspondences are used to estimate the image-to-model transform, the transform
is used to predict the location of all image surfaces in the model, and the prediction is
used to test the plausibility of the original three correspondences.
To constrain the search we use a unary relation on surface area, a binary relation
on relative orientation of surface normals and a binary relation on relative distance of
mid-points (see Sect. 2). For surface patches with occluded boundaries it is necessary to
make the variances on area and mid-point position appropriately large. In the case of
mid-point position, efficiency is maximised by increasing the uncertainty only in the plane
of the surface (so that one of the eigenvalues of the variance matrix, with an eigenvector
parallel to the surface normal, is small compared to the other two).
Transforms are estimated from the infinite plane parameters n and d (for any point
x in the plane n T x = d where n is the surface normal). The measurement equation used
in the Extended Kalman Filter is
0]
804

where [n T dm]T and [ItT di]T are the parameter vectors for corresponding model and
image planes, t is the translation and P~ is the rotation matrix, parameterised by a three
component vector equal to the product of the rotation angle and the rotation axis [10].
The transform estimated for each group of three correspondences is used to transform
all the image surfaces into model coordinates allowing a direct comparison of positions
and orientations. Assuming there is at least one group which leads to a sufficiently large
number of further correspondences (hits) to add to the original three, the group with
the most is chosen as the correct interpretation. If there is no overlap between image
and model, none of the groups will develop more hits than the number expected on the
basis of random coincidence (which can be calculated and depends on the noise levels).
Moving objects in the scene result in multiple consistent groups with distinct transform
estimates.
The time required to find all the consistent triples and calculate the number of hits
for each is proportional to a fourth order polynomial in M and N - the number of, re-
spectively, model and image surfaces [8]. The number of consistent triples is proportional
to a third order polynomial in M and N, all but one of them (in a static scene) coming
about by random coincidence. Both also depend on noise levels: as uncertainty increases
the search constraints become less efficient, more time is spent searching the interpreta-
tion tree and more consistent groups are generated by coincidence. In practice, for noise
levels of a few percent and sizes of M > 103 and N > 10, the process is intractable.
A dramatic change can be made by partitioning the environment model into parts and
searching for correspondences between the image and each part separately (instead of
between the image and the whole model). If there are P parts, the search time is reduced
by a factor of p3 and the number of spurious solutions by p2. Such a partition is sensible
because it is unlikely that the robot will be able to simultaneously view surfaces of two
different rooms in the same building. The perceptual organization can be carried out as
a background recognition process with a generic model of what constitutes a part (e.g. a
room model).

4 Updating the Environment Model

Updating the infinite plane parameters of a model surface after a correspondence has
been found between it and an image surface is relatively straight forward using an Ex-
tended Kalman Filter. However, updating the boundary or shape information cannot be
achieved in the same manner because it is impossible to describe the boundary with a
single random variable. Moreover, because of the possibility of occlusion, the shapes of
corresponding surfaces may not be similar at all. The problem has some similarity with
the problem of matching strings which have common substrings.
The method we have adopted is again based on finding correspondences but only
between small data sets with efficient search constraints, so there is not a combinatorial
explosion problem. The features to be matched are the vertices and edges making up
the poly-line boundaries of the two surface patches, there being typically about 10 edge
features in each. If the two boundary descriptions relate to the same coordinate frame
the matching criteria may include position and orientation information as well as vertex
angles, edge lengths and edge labels (occluded or unoccluded). In practice, because of
the possibility of residual errors in the model-image transform estimate (see Sect. 3),
805

we exclude position and orientation information from the matching, calculate a new
transform estimate from the matched boundary features and use the old estimate to
check for consistency.
The search procedure is seeded by choosing a pair of compatible vertices (similar ver-
tex angles) with unoccluded joining edges, so it relies on there being at least one common
visible vertex. A new boundary is then traced out by following both boundaries around.
When neither edge is occluded both edges are followed; when one edge is occluded the
other is followed; when both edges are occluded the outermost is followed. Uncertainty
and over- or under-segmentation of the boundaries may give rise to different possible
feature matches (handled by an interpretation tree) but the ordering of features around
each boundary greatly constraints the combinatorics. If two unoccluded edges don't over-
lap, if an occluded edge lies outside an unoccluded one or if the transform estimate is
incompatible with the previously estimated image-model transform then the seed vertex
match is abandoned and a new one tried. If a vertex match is found which allows the
boundary to be followed round right back to the initial vertices, the followed boundary
becomes the new boundary of the updated surface. Otherwise, the two boundaries must
represent disjoint parts of the same surface and the updated surface acquires both.

5 Conclusions

Constrained search (interpretation trees) with stochastic techniques for handling un-
certainty can be used to solve both the image-model correspondence problem and the
boundary-boundary correspondence problem in order to fuse together multiple range
images into a surface-based environment model. The eombinatorics of the image-model
problem are such that environment models must be divided into small parts if the solution
method is to be tractable while the combinatorics of the boundary-boundary problem
are inherently well behaved.

References
1. N. Ayache and O.D. Faugeras. Maintaining representations of the environment of a mobile
robot. In Robotics Research 4, pages 337-350. MIT Press, USA, 1988.
2. Y. Bar-Shalom and T.E. Fortmann. Tracking and Data Association. Academic Press, UK,
1988.
3. T.J. Fan, G. Medioni, and R. Nevatia. Recognzing 3-d object using surface descriptions.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(11):1140-1157, 1989.
4. O.D. Faugeras and M. Hebert. The representation, recognition, and locating of 3d shapes
from range data. International Journal o/Robotics Research, 5(3):27-52, 1986.
5. W.E.L. Grimson. Object Recognition by Computer: the Role of Geometric Constraints.
MIT Press, USA, 1990.
6. R.A. Horn and C.R. Johnson. Matri~ Analysis. Cambridge University Press, USA, 1985.
7. M.J.L. Orr, R.B. Fisher, and J. Hallam. Uncertain reasoning: Intervals versus probabilities.
In British Machine Vision Conference, pages 351-354. Springer-Verlag, 1991.
8. M.J.L. Orr, J. Hallam, and R.B. Fisher. Fusion through interpretation. Research Paper
572, Dept. of Artificial Intelligence, Edinburgh University, 1992.
9. J. Porril. Fitting ellipses and predicting confidence using a bias corrected Kalman Filter.
Image and Vision Computing, 8(1):37-41, 1990.
10. Z. Zhang and O.D. Faugeras. Determining motion from 3d line segment matches: a com-
parative study. Image and Vision Computing, 9(1):10-19, 1991.
3-D Object Recognition using Passively Sensed
Range Data. *

Kenneth M. Dawson and David Vernon


University of Dublin, Trinity College, Dept. of Computer Science, Dublin 2, Ireland

Abstract. Model-based object recognition is typically addressed by first


deriving structure from images, and then matching that structure with
stored objects. While recognition should be facilitated through the deriva-
tion of as much structure as possible, most researchers have found that a
compromise is necessary, as the processes for deriving that structure are
not sufficiently robust. We present a technique for the extraction, and sub-
sequent recognition, of 3-D object models from passively sensed images.
Model extraction is performed using a depth from camera motion tech-
nique, followed by simple interpolation between the determined depth val-
ues. The resultant models are recognised using a new technique, implicit
model matching, which was originally developed for use with models derived
from actively sensed range data [1]. The technique performs object recogni-
tion using secondary representations of the 3-D models, hence overcoming
the problems frequently associated with deriving stable model primitives.
This paper, then, describes a technique for deriving 3-D structure from
passively sensed images, introduces a new approach to object recognition,
tests the approach robustness of the approach, and hence demonstrates the
potential for object recognition using 3-D structure derived from passively
sensed data.

1 3-D Model Extraction from passively sensed images

The extraction of a 3-D model can be performed in a series of steps: (1) computing
depth using camera motion, for the significant intensity discontinuities, (2) interpolating
range data between the significant intensity discontinuities, (3) smoothing of the resultant
depth map and (4) deriving a 3-D model from the depth map.

1.1 Depth from Camera M o t i o n


In order to compute the 3-D location of a point we must obtain two vectors to that point.
For passive approaches to vision those two vectors must be obtained from two separate
observations of the point. However the accuracy of the resulting 3-D data is limited by
the sensor resolution and the disparity between the viewpoints. Also, it is important to
note that the identification of points which correspond in the two images is difficult, and
that the complexity of the correspondence problem increases with the disparity between
the viewpoints. Hence, although an increase in disparity increases the potential accuracy
of the 3-D data, it also increases the complexity of (and hence the likelihood of error
within) the correspondence problem.
* The research described in this paper has been supported by ESPRIT P419, EOLAS APT-
VISION and ESPRIT P5363.
807

One solution to this dilemma is provided by the computation of depth using camera
motion. The technique employed in this paper (see [2,3,4] for details) uses nine images
taken from different positions on an arc around a fixation point. The instantaneous
optic flow, representing the apparent motion of zero-crossings in each successive image,
is computed from the time derivative of the Laplacian of Gaussian of each image. The
global optic flow is computed by exploiting the instantaneous optic flow to track the
motion of each zero-crossing point thoughout the complete sequence. This provides a
vector field representing the correspondence between zero- crossing points in the initial
and final image in the sequence, i.e. over an extended base-line of camera displacement.
The depth of each zero-crossing is then computed by triangulation, using the start point
and the end point of each global optic flow vector.
An example is shown in Figure 1 in which the third image in a sequence of nine
images of a book is shown, along with its significant intensity discontinuities, the optical
flow determined between the third and eighth images in the sequence, and finally the
depth map which results after interpolation and smoothing. The nine images were taken
with an angular disparity of approximately 2~ between successive camera positions and
a fixation distance of 600mm.

Fig. 1. Depth from camera motion. See text for details.

1.2 I n t e r p o l a t i o n o f sparse r a n g e data

The result of the previous algorithm is a sparse depth map, where depth values are known
only at locations corresponding to the significant zero-crossings which were successfully
tracked throughout the sequence of images. However, the purpose of this research is to
investigate the potential for recognising 3-D structure dervied from passively sensed data,
and hence we must interpolate between the available depth information.
The majority of interpolation techniques attempt to fit a continuous surface to the
available depth information (e.g. [6]). This requires that range data be segmented into
likely surfaces prior to the application of the interpolation technique, or alternatively
808

that only a single surface be presented. We employ a simpler technique invloving planar
interpolation to ensure that the surfaces are correct for polyhedral objects.
The interpolation method defines a depth value for each undefined point in the depth
map by probing in five directions (to both the East and West, where East and West are
parallel to the direction of motion) from that point in order to find defined depth values.
A measure of those defined depth values (based also on the orientations of the features
with which the depth values are associated, and the distances from the undefined point)
is then employed to define the unknown depth value; e.g. for point (z,y):

Depth(x, y) = Rangeeast * Distancew~st + Range~oest * Distanceea,t


Dislancewe,~ + Distanceea,t (1)
where Rangeeast and Rangewe,~ are the weighted average of the range values located to
the east and to the west of point (x, y) respectively, and Distanceeae~ and Distaneewe,t
are the average distances to those range values.

1.3 F i l t e r i n g r a n g e d a t a
The depth map which results from the interpolation can be quite noisy. That is to say,
that there can be local regions of depth values which vary significantly to those in a
larger area around them. In order both to overcome this and to define values for isolated
undefined points, a smoothing filter was applied to the data. A smoothing filter which
simply averages all depth values within a mask was not appropriate, as resultant values
would be affected by local regions'of noise. This restricted the potential choice of filter
considerably, and only two types of filter, median and modal, were considered. It was
found, experimentally, that a reasonably large (e.g. l l x l l ) modal filter produced the best
results, in terms of the resultant depth map, and hence 3-D structure.

1.4 B u i l d i n g 3-D m o d e l s
Finally, having obtained a reasonably smooth depth map, it is still necessary to convert
from a viewer-centered description to an object-centered surface model. This can be
done by first employing the relevant camera model, to convert from image coordinates
(i, j, depth) to Cartesian coordinates (x, y, z), and then deriving '3-point seed' surfaces
[7] (i.e. surfaces are instantiated between any three points which are within the sampling
distance of each other).

2 Object Recognition - Implicit Model Matching

The basic problem of rigid object recognition is to establish the correspondence between
an object model, which is viewed, and a particular known object model, and the com-
putation of the associated pose. The majority of object recognition techniques match
known object models with viewed instances of objects through the comparison of model
primitives (e.g. edges). However, it is extremely unlikely that the model primitives ex-
tracted will be identical to those of a model which is known a priori. In order to overcome
that problem it is possible to employ secondary representations, such as the Extended
Gaussian Image (or EGI) [8], although using the EGI has proved difficult [9].
The technique of implicit model matching introduces a new, more powerful, but sim-
ilar idea. It employs several secondary representations which allow the problem to be
considered in terms of sub-problems. Initially, orientations of known objects which may
809

potentially match the viewed object are identified through the comparison of visible sur-
face normals, for each possible orientation of each known object (where 'each possible
orientation' is defined by the surface normals of a tesselated sphere). Potential orienta-
tions are then fine tuned by correlatiag 1-D histograms of specific components of surface
orientations (known as directional histograms). The object position is estimated using ap-
proximate knowledge of the physical configuration of the camera system, and fine tuned
using a template matching technique between needle diagrams derived from the known
and viewed models. Finally, using normalised correlation, each hypothesis is evaluated
through the comparision of the needle diagrams.
At each stage in the generation, tuning and verification of hypotheses only compar-
isons of the various secondary representations are employed. The central concept behind
implicit model matching is, then, that 3-D object models may be reliably compared
through the use of secondary representations, rather than (or, more properly, as well as)
by comparison of their component primitives. Additionally it is important to note that
object pose may be determined to an arbitrarily high degree of accuracy (through the
fine-tuning stages), although initially only a limited number of views are considered.

2.1 A p p r o x i m a t i n g Object O r i e n t a t i o n

The first stage in this technique is the computation of approximate orientations of a


known object model which may, potentially, correspond to the viewed object model. This
is achieved by considering the known object from every possible viewpoint, as defined by
a tesselated sphere, and comparing directional histograms of tilt (which will be explained
presently) for every viewpoint with a directional histogram of tilt derived from the viewed
model. The orientations which generate locally maximum correlations between these
histograms (i.e. as compared to the correlations associated with neighbouring tesselations
on the sphere) may be regarded as the potentially matching orientations.

Directional Histograms~ The concept of the Directional Histogram was developed as


part of the technique of implicit model matching and embodies the notion of mapping
a single component of the 3-D orientations of a model visible from a given viewpoint to
a 1-D histogram, where the component of orientation is defined about the axes of the
viewing device. Four different components are employed: roll, pitch, yaw and tilt; where
roll, pitch and yaw are defined as rotations about the Z, X and Y axes of the viewing
device respectively, and tilt is defined as ~r raxiians less the angle between the orientation
vector and the focal axis (i.e. the Z axis of the viewing device). See Figures 2 and 3.

R e s u l t i n g O r i e n t a t i o n s The result of these comparisons of tilt directional histograms


is the identification of potentially matching orientations. However, only two degrees of
freedom have been constrained, as only the tilt of the object is approximated. In order to
complete the approximation of orientation, we must also compute potentially matching
values of roll around the focal axis of the viewing device (tilt and roll are independent).
This identification may again be performed using directional histograms, but using
roll rather than tilt. Hence, for every determined value of tilt, a directional histogram
of roll is derived and compared (using normalised cross correlation) with that from the
viewed model in every possible value of roll. Each 'possible value' of roll is defined by the
resolution of the directional histogram, and the directional histogram is simply shifted
in a circular fashion in order to consider the various possible values of roll.
810

Camer
mf frames
e a of~: IViewp~ ~
vml
YiewpoiInLVi~w~oi2 ra I ~ IYAWHISTOGRAMS

- yOAW
~5
Fig. 2. Example yaw directional histograms. These two yaw histograms of two views of a
garage-like object are a simple example of how directional histograms work. The visible sur-
face areas of the views of the object are mapped to the histograms at their respective yaw
angles (defined with respect to the focal axes of the camera). Notice the shift in the histograms,
which is due to the slightly different values of yaw of the two viewpoints.

~I~ANVmW'
~I TILTHISTOGRAM ~ ~I ~ I

6 ' ~s ' ~ 'o ~,s


TILT T~..T

Fig. 3. Example tilt directionM histograms. These two tilt histograms are derived from the
two views of the garage-like object shown in Figure 2. Notice how, for the first view, the two
orientations result in the same value of tilt, and in the second view how the values change.

2.2 F i n e - t u n i n g Object O r i e n t a t i o n
The potentially matching orientations computed can only be guaranteed to be as accurate
as the quantisation of the sampled sphere. Increasing the resolution of the sphere to
an arbitrarily high level, however, would obviously cause a signifcant increase in the
computational overhead required in determining potential orientations. Alternatively, it
is possible to fine-tune the orientations using directional histograms (of roll, pitch and
yaw) in a similar fashion to the method used for the approximate determination of object
roll. Pitch, yaw and roll directional histograms are derived from the view of the known
object and compared with histograms derived from the viewed model. The differences
between the directional histograms indicate the amount by which the orientation may
best be tuned (e.g. see Figure 2). T h e various directional histograms are derived and
compared sequentially and iteratively until the tuning required falls below the required
accuracy of orientation or until the total tuning on any component of orientation exceeds
the range allowed (which is defined by the quantisation of the tesselated sphere).
This stage allows the accuracy of potentially matching Orientations to be determined
to an arbitrarily high level (limited only by the resolution of the directional histograms).
Hence, although only a limited number of possible viewpoints of any known object are
considered, the orientation of the object may be determined to a high level of accuracy.
811

2.3 A p p r o x i m a t i n g O b j e c t P o s i t i o n
Turning now to the approximate determination of object position, it is relatively straight-
forward to employ the position of the viewed model with respect to its viewing camera.
The imaged centroid of the viewed model, and an approximate measure of the distance
of the viewed model from the viewing camera are both easily computed. The position
of the camera which views the known model may then be approximated by placing the
camera in a position relative to the known model's 3-D centroid, such that the centroid
is at the correct approximate distance from the camera and is viewed by the camera in
the same position as the viewed model's imaged centroid.

2.4 F i n e - t u n i n g O b j e c t P o s i t i o n
Fine tuning object position may be considered in terms of two operations; tuning position
in a directional orthogonal to the focal axis of the viewing device, and tuning of the
distance of the object from the same viewing device (i.e. the depth). This separates the
3 degrees of freedom inherent in the determination of object position.

T u n i n g V i e w e d O b j e c t p o s i t i o n (i.e. o r t h o g o n a l t o viewing device). This oper-


ation is performed using a template matching technique in which a needle diagram (i.e.
an iconic representation of the visible local surface orientations) of the known model is
compared using a normalised correlation mechanism, with a needle diagram of the viewed
model. The position of the template which returns the highest correlation is taken to be
the optimal position for the known model.
The standard method of comparing iconic representations is normaiised cross correla-
tion, but this form of correlation is defined only for scalars. For the comparison of needle
diagrams 3-D vectors must be compared and that correlation (NV) for each possible
position of the template (m, n) is defined as follows:

NV(rn, n) = ~'~' ~ f(viewed(i, j)) * (~ - angle(viewed(i, j), known(i - m)(j - n))) (2)
~ , ~'~j f(viewed(i,j)) *
where viewed(i, j) and known(i, j) are the 3-D orientation vectors from the viewed
and known needle diagrams respectively, f(vector) is 1 if the vector is defined and 0
otherwise, and angle(vectorl, vector2) is the angle between the two vectors (or 0 if they
are undefined).
In order to make this template matching operation more efficient, the needle diagrams
are first compared at lower resolutions, using a somewhat simpler measure-of-fit.

T u n i n g O b j e c t D e p t h . Fine tuning the distance between the known model and its
viewing camera is done through direct comparison of the depth maps generated from
both the viewed model and the known model (in its determined pose). The Depth Change
Required (or DCR) is defined as follows:
DCR = ~'~i ~'~ f(viewed(i,j)) . f(known(i,j)) 9 (viewed(i,j) - known(i,j))
~ , ~'~ f(viewed(i, j)) 9 f(known(i, j)) (3)
where viewed( i, j) and known(i, j) are the depths from the viewed and known depth
maps respectively, where f(depth) = 1 if the depth is defined and 0 otherwise. The D C R
is directly applied as a translation to the pose of the camera which views the known
812

model, in a direction defined by the focal axis of the camera. Due to perspective effects
this operation will have effects on the depth map rendered, and so is applied iteratively
until the D C R falls below an acceptable level.

Final t u n i n g o f Object Position. Tuning of object position is limited in accuracy


primarily by the resolution used in the needle diagrams (for tuning position orthogonal
to the focal axis of the viewing camera). As a final stage in tuning we attempt to over-
come this by determining the position to sub-pixel accuracy. This is accomplished using
a combination of template matching, normalised correlation and quadratic modelling
techniques. The best position may be determined to pixel accuracy using the technique
described in Section 2.4. In order to determine the position to sub-pixel accuracy the
normalised correlations, from the comparison of needle diagrams, around the best posi-
tion (as determined to pixel accuracy) are used and are modelled as quadratics in two
orthogonal directions (i.e. parallel to the two image axes).

2.5 Verifying H y p o t h e s e s
Having hypothesised and fine tuned poses of known objects it is necessary to determine
some measure of fit for the hypothesis so that it may be accepted (subject to no better
hypothesis being determined), or rejected. The normalised correlation of local surface
orientations (i.e. needle diagrams) between the viewed model and the known model in a
determined pose, as used when fine tuning object position (see section 2.4) gives a degree-
of-fit which represents all aspects of object position and orientation. This degree-of-fit,
then, provides a powerful hypothesis verifcation measure.
The known model, in the computed position and orientation, which gives the best
degree of fit with the viewed model (i.e. maximal correlation between derived needle
diagrams), and which exceeds some predefined threshold, is taken to be the best match;
the viewed model is assumed to be the corresponding object, in the pose of the known
model.

2.6 An E x c e p t i o n

There is, however, one situation in which this technique, implicit model matching, will
fail and that is when only a single surface/orientation is visible. Computation of object
roll in this instance is impossible using directional histograms (as there is an inherent
ambiguity with respect to roll around the orientation vector of the surface).
This situation can be detected by considering the standard deviation of the visible
orientations as mapped to an EGI (as if the standard deviation is less than a small angle
then it may be taken that only one surface is visible). The problem must then be regarded
as one of shape recognition, although it should be noted that it is' possible to adapt the
technique of implicit model matching to cope with this situation (see [10]).

3 Experimental Results and Conclusions

The intention of the testing detailed herein is to investigate the robustness of the recog-
nition technique, and to demonstrate the potential for recognising 3-D models derived
from passively sensed data. The objects employed were all of simple rigid geometric
structure, and scenes contained only one object. The rationale for these choices is that
813

the segmentation and identification of occlusion problems, etc., still require much further
research.
As an example of recognition, consider the book shown Figure 1. The model deter-
mined is quite accurate, with the main exception being the title. Regardless of these
errors, however, sufficient of the model is computed correctly to allow reliable identifica-
tion of the book (See Figure 5) from the database of objects (See Figure 4).

Fig. 4. The database of known object models.

Fig. 5. Recognition of the model derived from the depth map shown in Figure 1.

The discrimination between the various recognition hypotheses is not that significant
however, and as more complex objects are considered (See Table 1) the discriminatory
ability gets progressively worse resulting, eventually, in mistaken recognition. The testing
allows a number of conclusions to be drawn, which follow, and satisfies both of the stated
intentions of this research.

1. It is demonstrated that there is potential for the recognition of objects in passively


sensed images on the basis of derived 3-D structural information.
814

2. Implicit model matching was found to degrade reasonably gracefully in the presence
of noisy and incorrect data. However, its performance with models derived from
actively sensed range data [1] was significantly more reliable.
3. Finally, the limitations on the techniques presented for the development of three
dimensional models are emphasised. The visual processing performed in this research
quite obviously deals in a trivial way with many important issues, and these issues
remain for future research.

Further details of all aspects of the system described in this paper are given in [4].

Scene Figure Cube Cone Book Mug Can Globe Tape Result
Cube 0.8229 0,6504 0,7220 0.6810 0.5270 0.6344 0.3656 Cube
Cone 0,2508 0.7311 0.5130 0.2324 0.1962 0.3777 0.1332 Cone
Book 5 0.2763 0.5432 0.7534 0.2958 0.2189 0.3412 0.1883 Book
Mug 0.6872 0.6061 0.5770 0.6975 0.5892 0.6580 0.3276 Mug
Pepsi Can 0.6643 0.5829 0.7099 0.6705 0.6852 0.5150 0.3658 Book 9
Globe 0.4492 0.5620 0.5789 0.4036 0.3305 0.5725 0.2897 Book 9
Sellotape 0.5706 0.6270 0.6689 0.4979 0.5039 0.4295 0.3735 Book *

Table 1. The complete table of the degrees-of-fit determined between viewed instances of objects
and the known models.

References

1. Dawson, K. Vernon, D.: Model-Based 3-D Object Recognition Using Scalar Transform
Descriptors. Proceedings of the conference on Model-Based Vision Development and Tools,
Vol. 1609, SPIE - The International Society for Optical Engineering (November 1991)
2. Sandini, G., Tistarelli, M.: Active Tracking Strategy for Monocular Depth Inference Over
Multiple Frames. IEEE PAMI, Vo1.12, No.1 (January 1980) 13-27
3. Vernon, D., Tistarelli, M.: Using Camera Motion to Estimate Range for Robotic Part
Manipulation. IEEE Robotics and Automation, Vol.6, No.5 (October 1990) 509-521
4. Vernon, D., Sandini, G. (editors): Parallel computer vision - The V I S a VIS System. Ellis
Horwood (to appear)
5. Horn, B., Schunck, B.: Determining Optical Flow. Artificial Intelligence, Vo1.17, No.1 (1981)
185-204
6. Grimson, W.: From Image to Surfaces: A Computational Study of the Human Early Visual
System. MIT Press, Cambridge, Massachusetts (1981)
7. Faugeras, O., Herbert, M.: The representation, recognition and locating of 3-D objects.
International Journal of Robotics Research, Vol. 5, No. 3 (Fall 1986) 27-52
8. Horn, B.: Extended Gaussian Images. Proceedings of the IEEE, Vol.72, No.12 (December
1984) 1671-1686
9. Brou, P.: Using the Gaussian Image to Find Orientation of Objects. The International
Journal of Robotics Research, Vol.3, No.4 (Winter 1984) 89-125
10. Dawson, K.: Three-Dimensional Object Recognition through Implicit Model Matching.
Ph.D. thesis, Dept. of Computer Science, Trinity College, Dublin 2, Ireland (1991)

This articlewas processed using the I~TEX macro pacl~ge with E C C V 9 2 style
Interpretation of R e m o t e l y Sensed Images
in a C o n t e x t of Multisensor Fusion*

Vdronique CLI~MENT, Gdrard GIRA UDON and Stdphane HOUZELLE


INRIA, Sophia Antipolis, BP109, F-06561 Valbonne - Tel.: (33) 93657857
Email: vclement~sophia.inria.fr giraudon~sophia.inria.fr houzelle@sophia.inria.fr

Abstract. This paper presents a scene interpretation system in a con-


text of multi-sensor fusion. We present how the real world and the inter-
preted scene are modeled; knowledge about sensors and multiple views no-
tion (shot) are taken into account. Some results are shown from an applica-
tion to S A R / S P O T images interpretation.

1 Introduction
An extensive literature has grown since the beginning of the decade on the problem
of scene interpretation, especially for aerial and satellite images [NMS0,Mat90] [RH89]
[RIHR84] [MWAW89] [HN88] [Fua88] [GG90]. One of the main difficulties of these appli-
cations is the knowledge representation of objects, of scene, and of interpretation strategy.
Previously mentioned systems use various knowledge such as: object geometry, mapping,
sensor specifications, spatial relations, etc...
In the other hand, there is a growing interest in the use of multiple sensors to increase
both the availability and capabilities of intelligent systems [MWAW89,Mat90] [LK89]
[RH89]. However, if the multi-sensor fusion is a way to increase the number of measures
on the world by complementary or redundancy sensors, problems of control of the data
flow, strategies of object detection, and modeling of objects and sensors are also increased.
This paper presents a scene interpretation system in a context of multi-sensor fusion.
We propose to perform fusion at the intermediate level because it is the most adaptive
and the most general for different applications of scene analysis. First, we present how the
real world and the interpreted scene are modeled; knowledge about sensors and multiple
views notion (shot) are taken into account. Then we give an overview of the architecture
of the system. Finally, some results are shown from an application to S A R / S P O T images
interpretation.

2 Modeling
Consistency of information is one of the relevant problems of multi-sensor fusion systems;
in fact, various models must be used to express the a priori knowledge. This knowledge
can be divided into knowledge about the real world and knowledge about the interpre-
tation.

2.1 R e a l W o r l d M o d e l i n g
For an interpretation system, a prior/knowledge on the scene to be observed is necessary:
for example, the description of objects which might be present in the scene. Moreover,
* This work is in part supported by AEROSPATIALE, Department E/ETRI, F-78114 Magny-
les-hameaux, and by ORASIS contract, PRC/Communication Homme-Machine.
816

in order to perform multi-sensor fusion at different levels of representation, and to use


the various data in an optimal way, characteristics of available sensors have to be taken
into account: this allows the selection of the best ones for a given task. In the following,
we first develop object modeling, then sensor modeling.

O b j e c t s t o D e t e c t : Usually, in single-sensor systems, two main descriptions are used:


the geometric description, and the radiometric one. These two criteria can be used to
detect an object (by allowing the choice of the best-adapted algorithm, for instance), or
to validate the presence of an object (by matching the computed sizes with the model
sizes, for example).
In a multi-sensor system, the distinction must be made between knowledge which
is intrinsic to an object, and knowledge which depends on the observation. Geometric
properties can be modeled on the real world, however geometric aspects have to be com-
puted depending on the sensor. Concerning radiometric properties, there is no intrinsic
description; radiometric descriptions are sensor-dependent. In fact, only the observation
of an object can be pale or dark, textured or not. Thus, the notion of material has been
introduced in our system to describe an object intrinsically. Materials describe the com-
position of an object: for example, a bridge is built of metal, cement and/or asphalt.
So, radiometric properties of an object can be deduced from its composition: an object
mainly made of cement, and another one mainly made of water would not have the
same radiometry in an image taken by an infra-red sensor. These criteria (geometry,
composition) which are only descriptions of objects can be used in a deterministic way.
Another sensor-independent knowledge very important in human interpretation of
images is spatial knowledge, which corresponds to the spatial relationships between ob-
jects. Spatial knowledge can link objects of the same kind, as well as objects of different
kinds. This heuristic knowledge can be used to facilitate detection, validation, and solv-
ing of the conflicts among various hypotheses. For example, as we know that a bridge
will be over a road, a river, or a railway, it is not necessary to look for a bridge in the
whole image; the search area can be limited to the roads, rivers, and railways previously
detected in the scene. In multi-sensor interpretation, we can even detect the river on one
image, and look for the bridges on another one.

Sensors: Some sensors are sensitive to object reflectance, other to their position, or to
their shape.., l~diometric features mainly come from the materials the objects are com-
posed of, and more precisely from features of these materials such as cold, homogeneous,
rough, textured, smooth .... The response to each aspect is quite different depending on
the sensor.
Therefore sensors are modeled in our system using the sensitivity to aspects of various
materials, the sensitivity to geometry of objects, the sensitivity to orientation of objects,
the band width described by minimum and maximum wave length, and the type (active
or passive). Note that the quality of the detection (good, medium, or bad) has been
dissociated from the aspect in the image (light, grey, dark).
Due to their properties, some objects will be well detected by one sensor, and not by
another one; other objects will be well detected by various sensors. To be able to detect
easily and correctly an object, we have to choose the image(s), i.e. the sensor(s), in which
it is best represented. For that, our system uses the sensitivities of the sensors, and the
material composition of the objects.
Knowing the position of the sensor, and its resolution is also important to be able to
determine whether an object could be well detected. We call shot the whole information:
817

the description of the sensor, the conditions of acquisition including the point of view,
and the corresponding image.

2.2 I n t e r p r e t a t i o n

The main problem is how to represent the scene being interpreted. First of all, we are
going to precise what we call an interpreted scene, and which information must be present
in an interpretation. Our goal is not to classify each pixel of the image; it is to build a
semantic model of the real observed scene. This model must include: the precise location
of each detected object, its characteristics (such as shape, color, function...), and its
relations with other objects present in the scene. To capture such information, it is
necessary to have a spatial representation of the scene; in the 2D-case, this can be done
using a location matrix. This representation allows to focus attention on precise areas
using location operators such as surrounded by, near..., and to detect location conflicts.
Location conflicts occur when areas of different objects overlap. Three different kinds
of conflicts can be cited: conflicts among superposed objects (in fact, they are not real
conflicts: a bridge over a road); conflicts among adjacent objects (some common pixels;
such a conflict is due to low level algorithms, digitalization...); conflicts arising because
of ambiguous interpretation between different sorts of objects (this kind of conflict can
be elucidated only by using relational knowledge).

3 Implementation

Our goal was to develop a general framework to interpret various kinds of images such as
aerial images, or satellite images. It has been designed as a shell to d~velop interpretation
systems. Two main knowledge representations are used: frames and production rules. The
system has been implemented using the SMECI expert system generator [II91], and the
NMS multi-specialist shell [CAN90]; it is based on the blackboard and specialist concepts
[HR83]. This approach has been widely used in computer vision [HR87,Mat90], and in
multi-sensor fusion [SST86]. We have simplified the blackboard structure presented by
Hayes-Roth, and we have build a centralized architecture with three types of specialists:
t h e generic specialists (application-independent), t h e s e m a n t i c o b j e c t specialists
(application-dependent), and t h e low level specialists (dependent on image processing,
and feature description). They work at different levels of representation, are independent,
and work only on a strategy level request; so the system is generic and incremental. The
detection strategy is based on the fundamental notion of spatial context linking the
objects in the scene, and the notion of salient object.
To demonstrate the reliability of our approach, we have implemented an application
for the interpretation of SAR images registered with SPOT images, a set of sensors
which are complementary. Five sensors (the SIR-B Synthetic Aperture Radar, and the
panchromatic, XS1 [blue], XS2 [green], XS3 [near infra-red] SPOT sensors), ten materials
(water, metal, asphalt, cement, vegetation, soil, sand, rock, snow, and marsh), and five
kinds of semantic objects (rivers, lakes, roads, urban areas', and bridges) are modeled in
this application. We present Figure 1 an example of result (fig 1.c) we obtained using
three images: SAR (fig 1.a), SPOT XS1, and SPOT XS3 (fig 1.b). Closed contours point
out urban areas. Filled regions indicate lakes. Thin lines represent bridges, roads, and
the river. More details about this application can be found in [CGH92], while low-level
algorithms are described in [HG91].
818

9 O
q, b

,n
(c)
F i g . 1. Top : Sensor images used for scene interpretation : (a) SIR-B Radar image; (b) near
infra-red S P O T XS3 image. Bottom : (c) Objets detected in the scene after interpretation.
Closed contours point out urban areaz. Filled regions indicate lakes. Thin lines represent bridges,
roads, and the river.
819

4 Conclusion

We have proposed a way to model real world and interpreted scene in the context ot'multi-
sensor fusion. A priori knowledge description includes the characteristics of the sensors,
and a semantic object description independent of the sensor characteristics. This archi-
tecture meets our requirements of highly modular structure allowing easy incorporation
of new knowledge, and new specialists. A remote sensing application with S A R / S P O T
sensors aiming at detecting bridges, roads, lakes, rivers, and urban areas demonstrates
the efficiency of our approach.
A c k n o w l e d g m e n t s : The authors would like to thank O.Corby for providing useful
suggestions during this study, and J.M.Pelissou and F.Sandakly for their contribution in
the implementation of the system.

References
[CAN90] O. Corby, F. Allez, and B. Neveu. A multi-expert system for pavement diagnosis
and rehabilitation. Transportation Research Journal, 24A(1), 1990.
[CGH92] V. Cldment, G. Giraudon, and S. Houzelle. A knowledge-based interpretation sys-
tem for fusion of sar and spot images. In Proc. o/IGARSS, Houston, Texas, May
1992.
[FuaS8] P. Fun. Extracting features from aerial imagery using model-based objective func-
tions. PAMI, 1988.
[GG90] P. Garnesson and G. Giraudon. An image analysis system, application for aerial
imagery interpretation. In Proc. of ICPR, Atlantic City, June 1990.
[HG91] S. Houzelle and G. Giraudon. Automatic feature extraction using data fusion in re-
mote sensing. In SPIE Proceedings, Vo11611, Sensor Fusion IV: Control Paradigms
and data structures, Boston, November 1991.
[HN88] A. Huertas and R. Nevatia. Detecting building in aerial images. ICGCV, 41.2:131-
152, February 1988.
[HR83] B. Hayes-Roth. The blackboard architecture : A general framework for problem
solving? Stanford University, Report HPP.83.30, 1983.
[HR87] A. Hanson and E. Riseman. The visions image-understanding system. In C. M.
Brown, editor, Advances in Computer Vision, pages 1-114. Erlbaum Assoc, 1987.
[II91] Ilog and INRIA. Smeci 1.54 : Le manuel de rdfdrvnce. Gentilly, 1991.
[LK891 R. Luo and M. Kay. Multisensor integration and fusion in intelligent systems.
IEEE Trans on Sys. Man and Cyber., 19(5):901-931, October 1989.
[Mat90] T. Matsuyama. SIGMA, a Knowledge.Based Aerial Image Understanding System.
Advances in Computer Vision and Machine Intelligence. Plenum, New York, 1990.
[MWAWS9] D.M. McKeown, Jr. Wilson, W. A.Harvey, and L. E. Wixson. Automating knowl-
edge acquisition for aerial image interpretation. Comp. Vision Graphics and linage
Proc, 46:37-81, 1989.
[NMS0] M. Nagao and T. Matsuyama. A Structural Analysis of Complex Aerial Pho-
tographs. Plenum, New York, 1980.
[RH89] E. M. Riseman and A. R. Hanson. Computer vision research at the university of
massachusetts, themes and progress. Special Issue of Int. Journal of Computer
Vision, 2:199-207, 1989.
[RII{R84] G. Reynolds, N. Irwin, A. Hanson, and E. Riseman. Hierachical knowledge-
directed object extraction using a combined region and line representation. In
Proc. of Work. on Comp. Vision Repres. and Cont., pages 238-247. Silver Spring,
1984.
[SST86] S. Sharer, A. Stentz, and C. Thorpe. An architecture for sensor fusion in a mobile
robot. In Int. Conf. on Robotics and Automation, pages 2202-2011, San Francisco,
June 1986.
Limitations of Non Model-Based Recognition
Schemes

Yael Moses and S h i m o n Ullman

Dept. of Applied Mathematics and Computer Science,


The Weizmann Institute of Science, Rehovot 76100,
Israel

A b s t r a c t . Approaches to visual object recognition can be divided into


model-based and non modeLbased schemes. In this paper we establish some
limitations on non model-based recognition schemes. We show that a con-
sistent non model-based recognition scheme for general objects cannot dis-
criminate between objects. The same result holds even if the recognition
function is imperfect, and is allowed to mis-identify each object from a
substantial fraction of the viewing directions. We then consider recognition
schemes restricted to classes of objects. We define the notion of the discrim-
ination power of a consistent recognition function for a class of objects. The
function's discrimination power determines the set of objects that can be
discriminated by the recognition function. We show how the properties of a
class of objects determine an upper bound on the discrimination power of
any consistent recognition function for that class.

1 Introduction

An object recognition system must recognize an object despite dissimilarities of images


of the same object due to viewing position, illumination conditions, other objects in
the scene, and noise. Several approaches have been proposed to deal with this problem.
In general, it is possible to classify these approaches into modeLbased vs. non model
based schemes. In this paper we examine the limitations of non model-based recognition
schemes.
A number of definitions are necessary for the following discussion. A recognition func-
tion is a function from 2-D images to a space with an equivalence relation. Without loss
of generality we can assume that the range of the function is the real numbers, R. We
define a consistent recognition function for a set of objects to be a recognition function
that has identical value on all images of the same object from the set. That is, let s be
the set of objects that f has to recognize. If vl and v2 are two images of the same object
from the set s then f ( v l ) = f(v2).
A recognition scheme is a general scheme for constructing recognition functions for
particular sets of objects. It can be regarded as a function from sets of 3-D objects, to the
space of recognition functions. That is, given a set of objects, s, the recognition scheme,
g, produces a recognition function, g(s) = f . The scope of the recognition scheme is the
set of all the objects that the scheme may be required to recognize. In general, it may be
the set of all possible 3-D objects. In other cases, the scope may be limited, e.g., to 2-D
objects, or to faces, or to the set of symmetric objects. A set s of objects is then selected
from the scope and presented to the recognition scheme. The scheme g then returns
a recognition function f for the set s. A recognition scheme is considered consistent if
g(s) = f is consistent on s as defined above, for every set s from the scheme's scope.
821

A model-based scheme produces a recognition function g(s) = f that depends on


the set of models. That is, there exist two sets sl and s2 such that g(sl) ~ g(s2) where
the inequality is a function inequality. Note that the definition of model-based scheme
in our discussion is quite broad, it does not specify the type of models or how they
are used. The schemes developed by Brooks (1981), Bolles & Cain (1982), Grimson &
Lozano-P~rez (1984,1987), Lowe (1985), Huttenlocher & Ullman (1987), Ullman (1989)
and Poggio & Edelman (1990) are examples of model-based recognition schemes.
A non model-based recognition scheme produces a recognition function g(s) = f that
does not depend on the set of models. That is, if g is a non model-based recognition
scheme, then for every two sets sl and s2, g(sl) -: g(s2), where the equality is a function
equality.
Non model-based approaches have been used, for example, for face recognition. In this
case the scope of the recognition scheme is limited to faces. These schemes use certain
relations between facial features to uniquely determine the identity of a face (Kanade
1977, Cannon et al. 1986, Wong et al. 1989). In these schemes, the relations between the
facial features used for the recognition do not change when a new face is learned by the
system. Other examples are schemes for recognizing planar curves (see review Forsyth et
al. 1991).
In this paper we consider the limitations of non model-based recognition schemes. A
consistent non model-based recognition scheme produces the same function for every set
of models. Therefore, the recognition function must be consistent on every possible set
of objects within the scheme's scope. Such a function is universally consistent, that is,
consistent for objects in its scope.
A consistent recognition function of the set s should be invariant to at least two
types of manipulations: changes in viewing position, and changes in the illumination
conditions. We first examine the limitation of non model-based schemes with respect to
viewing position, and then to illumination conditions.
In examining the effects of viewing position, we will consider objects consisting of a
discrete set of 3-D points. The domain of the recognition function consists of all binary
images resulting from scaling of orthographic projection of such discrete objects on the
plane. We show (Section 2) that every consistent universal recognition function with
respect to viewing position must be trivial, i.e. a constant function 1. Such a function
does not make any distinctions between objects, and therefore cannot be used for object
recognition. On the other hand it can be shown that in a model-based scheme it is possible
to define a nontrivial consistent recognition function that is as discriminating as possible
for every given set of objects (see Moses & Ullman 1991)
The human visual system, in some cases, misidentifies an object from certain viewing
positions. We therefore consider recognition functions that are not perfectly consistent.
Such a recognition function can be inconsistent for some images of objects taken from
specific viewing positions. In Section 3.1 we show that such a function must still be
constant, even if it is inconsistent for a large number of images (we define later what we
consider "large"). We also consider (Section 3.2) imperfect recognition functions where
the values of the function on images of a given object may vary, but must lie within a
certain interval.
Many recognition schemes deal with a limited scope of objects such as cars, faces
or industrial parts. In this case, the scheme must recognize only objects from a specific
class (possibly infinite) of objects. For such schemes, the question arises of whether there

1 A similar result has been independently proved by Burns et al. 1990 and Clemens & Jacobs
1990.
822

exists a non-trivial consistent function for objects from the scheme's scope. The function
can have in this case arbitrary values for images of objects that do not belong to the
class. The existence of a nontrivial consistent function for a specific class of objects
depends on the particular class in question. In Section (4) we discuss the existence of
consistent recognition function with respect to viewing position for specific classes of
objects. In Section (4.1) we give an example of a class of objects for which every consistent
function is still a constant function. In Section (4.2) we define the notion of the function
discrimination power. The function discrimination power determines the set of objects
that can be discriminated by a recognition scheme. We show that, given a class of objects,
it is possible to determine an upper bound for the discrimination power of any consistent
function for that class. We use as an example the class of symmetric objects (Section 4.3).
Finally, we consider grey level images of objects that consist of n small surface patches
in space (this can be thought of as sampling an object at n different points). We show that
every consistent function with respect to illumination conditions and viewing position
defined on points of the grey level image is also a constant function.
We conclude that every consistent recognition scheme for 3-D objects must depend
strongly on the set of objects learned by the system. That is, a general consistent recog-
nition scheme (a scheme that is not limited to a specific class of objects) must be model-
based. In particular, the invariant approach cannot be applied to arbitrary 3-D objects
viewed from arbitrary viewing positions. However, a consistent recognition function can
be defined for non model-based schemes restricted to specific class of objects. An up-
per bound for the discrimination power of any consistent recognition function can be
determined for every class of objects.
It is worth noting here that the existence of invariant features to viewing position
(such as parallel lines) and invariant recognition function for 2-D objects (see review
Forsyth et al. 1991) is not at odds with our results. Since, the invariant features can
be regarded as model-based recognition function and the recognition of 2-D objects is a
recognition scheme for a specific class of objects (see section 4).

2 Consistent function with respect to viewing position

We begin with the general case of a universally consistent recognition function with
respect to viewing position, i.e. a function invariant to viewing position of all possible
objects. The function is assumed to be defined on the orthographic projection of objects
that consist of points in space.

C l a i m 1: Every function that is invariant to viewing position of all possible objects


is a constant function.

Proof. A function that is invariant to viewing position by definition yields the same
value for all images of a given object. Clearly, if two objects have a common orthographic
projection, then the function must have the same value for all images of these two objects.
We define a reachable sequence to be a sequence of objects such that each two succes-
sive objects in the sequence have a common orthographic projection. The function must
have the same value for all images of objects in a reachable sequence. A reachable object
from a given object is defined to be an object such that there exists a reachable sequence
starting at the given object and ending at the reachable object. Clearly, the value of the
function is identical for all images of objects that are reachable from a single object.
823

Every image is an orthographic projection of some 3-D object. In order to prove that
the function is constant on all possible images, all that is left to show is that every two
objects are reachable from one another. This is shown in Appendix 1. O

We have shown that any universal and consistent recognition function is a constant
function. Any non model-based recognition scheme with a universal scope is subject to
the same limitation, since such a scheme is required to be consistent on all the objects in
its scope. Hence, any non model-based recognition scheme with a universal scope cannot
discriminate between any two objects.

3 Imperfect recognition functions

Up to now, we have assumed that the recognition function must be entirely consistent.
That is, it must have exactly the same value for all possible images of the same objects.
However, a recognition scheme may be allowed t o make errors. We turn next to examine
recognition functions that are less than perfect. In Section 3.1 we consider consistent
functions with respect to viewing position that can have errors on a significant subset
of images. In Section 3.2 we discuss functions that are almost consistent with respect to
viewing position, in the sense that the function values for images of the same object are
not necessarily identical, but only lie within a certain range of values.

3.1 E r r o r s o f the recognition function

The human visual system may fail in some cases to identify correctly a given object
when viewed from certain viewing positions. For example, it might identify a cube from
a certain viewing angle as a 2-D hexagon. The recognition function used by the human
visual system is inconsistent for some images of the cube. The question is whether there
exists a nontrivial universally consistent function, when the requirements are relaxed:
for each object the recognition function is allowed to make errors (some arbitrary values
that are different from the unique value common to all the other views) on a subset of
views. The set should not be large, otherwise the recognition process will fail too often.
Given a function f, for every object x let E l ( x ) denote the set of viewing directions
for which f is incorrect ( E / ( x ) is defined on the unit sphere). The object x is taken to
be a point in R n. We also assume that objects that are very similar to each other have
similar sets of "bad" viewing directions. More specifically, let us define for each object
x, the value 4~(x, e) to be the measure (on the unit sphere) of all the viewing directions
for which f is incorrect on at least one object in the neighborhood of radius e around x.
T h a t is, ~(x0, e) is the measure of the set U~B(~o,~) El(x)" We can now show that even
if ~(x, e) is rather substantial (i.e. f makes errors on a significant number of views), f
is still the trivial (constant) function. Specifically, assuming that for every x there exist
an e such that ~ ( z , e) < D (where D is about 14% of the possible viewing directions),
then f is a constant function. The proof of this claim can be found in Moses & Ullman
(1991).

3.2 "Almost consistent" recognition functions


In'practice, a recognition function may also not be entirely consistent in the sense that
the function values for different images of the same object may not be identical, but only
close to one another in some metric space (e.g., within an interval in R). In this case,
824

a threshold function is usually used to determine whether the value indicates a given
object.
Let an object neighborhood be the range to which a given object is mapped by such an
"almost consistent" function. Clearly, if the neighborhood of an object does not intersect
the neighborhoods of other objects, then the function can be extended to be a consistent
function by a simple composition of the threshold function with the almost consistent
function. In this case, the result of the general case (Claim 1) still holds, and the function
must be the trivial function.
If the neighborhoods of two objects, a and b, intersect, then the scheme cannot dis-
criminate between these two objects on the basis of images that are mapped to the
intersection. In this case the images mapped to the intersection constitute a set of im-
ages for which f is inconsistent. If the assumption from the previous section holds, then
f must be again the trivial function.
We have shown that an imperfect universal recognition function is still a constant
function. It follows that any non model-based recognition scheme with a universal scope
cannot discriminate between objects, even if it is allowed to make errors on a significant
number of images.

4 C o n s i s t e n t r e c o g n i t i o n f u n c t i o n s for a class o f o b j e c t s

So far we have assumed that the scope of the recognition scheme was universal. That
is, the recognition scheme could get as its input any set of (pointwise) 3-D objects.
The recognition functions under consideration were therefore universally consistent with
respect to viewing position. Clearly, this is a strong requirement. In the following sections
we consider recognition schemes that are specific to classes of objects. The recognition
function, in this case must still be consistent with respect to viewing position, but only
for objects that belong to the class in question. That is, the function must be invariant
to viewing position for images of objects that belong to a given class of objects, but can
have arbitrary values for images of objects that do not belong to this class.
The possible existence of a nontrivial consistent recognition function for an object
class depends on the particular class in question. In Section (4.1) we consider a simple
class for which a nontrivial consistent function (with respect to viewing position) still
does not exist. In Section (4.2) we discuss the existence of consistent functions for certain
infinite classes of objects. We show that when a nontrivial consistent function exist, the
upper bound of any function discrimination power can be determined. Finally, we use
the class of symmetric objects (Section 4.3) in order to demonstrate the existence of
consistent function for an infinite class of objects and its discrimination power.

4.1 T h e class o f a p r o t o t y p i c a l o b j e c t

In this section, we consider the class of objects that are defined by a generic object.
The class is defined to consist of all the objects that are sufficiently close to a given
prototypicM object. For example, it is reasonable to assume that all faces are within a
certain distance from some prototypicM face. The class of prototypical objects composed
of n points in space, can be thought of as a sphere in R 3n around the prototypicM object.
The results established for the unrestricted case hold for such classes of objects. That
is, every consistent recognition function with respect to viewing position of all the objects
that belong to a class of a given prototypical object is a constant function. The proof for
this case is similar to the proof of the general case in Claim 1.
825

4.2 D i s c r i m i n a t i o n p o w e r

Clearly, some class invariants exist. A simple example is the class of eight-point objects
with the points lying on the corners of some rectangular prism, together with the class
of all three-point objects (since at least 4 points will always be visible of the eight-
point object). In this example the function is consistent for the class, all the views of
a given object will be mapped to the same value. However, the function has a limited
discrimination power, it can only distinguish between two subclasses of objects. In this
section we examine further the discrimination power of a recognition function.
Given a class of objects, we first define a teachability partition of equivalence sub-
classes. Two objects are within the same equivalence subclass if and only if they are
reachable from each other. P~eachability is clearly an equivalence relation and therefore
it divides the class into equivalence subclasses. Every function f induces a partition into
equivalent subclasses of its domain. That is, two objects, a and b, belong to the same
equivalent subclass if and only i f / ( a ) = f(b). Every consistent recognition function must
have identical value for all objects in the same equivalence subclass defined by the reach-
ability partition (the proof is the same as in Claim 1). However, the function can have
different values for images of objects from different subclasses. Therefore, reaehability
partition is a refinement of any partition induced by a consistent recognition function.
That is, every consistent recognition function cannot discriminate between objects within
the same reachability partition subclass.
The reachability subclasses in a given class of objects determines the upper bound
on the discrimination power of any consistent recognition function for that class. If the
number of reachability subclasses in a given class is finite, then it is the upper bound for
the number of values in the range of any consistent recognition function for this class.
In particular, it is the upper bound for the number of objects that can be discriminated
by any consistent recognition function for this class. Note that the notion of reachability
and, consequently, the number of equivalence classes, is independent of the particular
recognition function. If the function discrimination power is low, the function is not very
helpful for recognition but can be used for classification, the classification being into the
equivalence subclasses.
In a non model-based recognition scheme, a consistent function must assign the same
value to every two objects that are reachable within the scope of the scheme. In contrast,
a recognition function in a model-based scheme is required to assign the same value to
every two objects that are reachable within the set of objects that the function must
in fact recognize. Two objects can be unreachable within a given set of objects but be
reachable within the scope of objects. A recognition function can therefore discriminate
between two such objects in a model-based scheme, but not in a non model-based scheme.

4.3 T h e class o f s y m m e t r i c o b j e c t s

The class of symmetric objects is a natural class to examine. For example, schemes for
identifying faces, cars, tables, etc, all deals with symmetric (or approximately symmetric)
objects. Every recognition scheme for identifying objects belonging to one of these classes,
should be consistent only for symmetric objects.
In the section below we examine the class of bilaterally symmetric objects. We will
determine the reachability subclasses of this class, and derive explicitly a recognition
function with the optimal discrimination power. We consider images such that for every
point in the image, its symmetric point appears in the image.
826

Without loss of generality, let a symmetric object be (0,pl,p2, ...,p2n), where pi =


(xi, yl,zi) and Pn+i = ( - x i , y l , z i ) for 1 < i < n. That is, Pi and Pn+i are a pair of
symmetric points about the y x z plane for 1 < i < n. Let p~ = /xr r r~J be the
~ i,Yl,Zl
new coordinates of a point/9/ following a rotation by a rotation matrix R and scaling
by a scaling factor s. The new x-coordinates are: x ir = s ( x i r l l + yirl2 + zlrl3) and
x~n+i = s ( - x l r l l + Ylr12 + zira3). In particular, for every pair of symmetric points Pi and
v . + ~ , ( x; - x ~ + , ) / ( x ~ - x ~ + l ) = x ~ / x l hold.
In the same manner it can be shown that the ratios between the distances of two
pairs of symmetric points do not change when the object is rotated in space and scaled.
We claim that these ratios define a nontrivial partition of the class of symmetric objects
to equivalence subclasses of unreachable objects. Let di be the distance between a pair
of symmetric points Pi and Pn+i. Define the function h by
h(0, .., p2,) = , 9

C l a i m 2: Every two symmetric objects a and b are reachable if and only if h(a) = h(b).
(The proof of this claim can be found in Moses & Ullman (1991).)
It follows from this Claim that a consistent recognition function with respect to view-
ing position defined for all symmetric objects, can only discriminate between objects that
differ in the relative distance of symmetric points.

5 Consistent recognition function for grey level images


So far, we have considered only binary images. In this section we consider grey level
images of Lambertian objects that consist of n small surface patches in space (this can
be thought of as sampling an object at n different points). Each point p has a surface
normal Np and a reflectance value pp associate with it. The image of a given object
now depends on the points' location, the points' normals and reflectance, and also on the
illumination condition, that is, the level of illumination, and the position and distribution
of the light sources.
An image now contains more information than before: in addition to the location
of the n points, we now have the grey level of the points. The question we consider is
whether under these conditions objects may become more discriminable then before by a
consistent recognition function. We now have to consider consistent recognition functions
with respect to both illumination condition and viewing position. We show that a non-
trivial universally consistent recognition function with respect to illumination condition
and viewing position still does not exists.

C l a i m 3: Any universally consistent function with respect to illumination condition and


viewing position, that is defined on grey level images of objects consisting of n surface
patches, is the trivial function.
In order to prove this claim, we will show that every two objects are reachable. That
is, there exists a sequence of objects starting with the first and ending with the second
object, and every successive pair in the sequence has a common image. A pair of objects
has a common image if there is an illumination condition and viewing position such that
the two images (the points' location as well as their grey level) are identical. The proof
of this claim can be found in Moses & Ullman (1991).
We conclude that the limitation on consistent recognition functions with respect to
viewing position do not change when the grey level values are also given at the image
points. In particular, it follows that a consistent recognition scheme that must recognize
827

objects regardless of the illumination condition and viewing position must be model-
based.

6 Conclusion

In this paper we have established some limitations on non model-based recognition


schemes. In particular, we have established the following claims:
(a) Every function that is invariant to viewing position of all possible point objects is
a constant function. It follows that every consistent recognition scheme must be model-
baaed.
(b) If the recognition function is allowed to make mistakes and mis-identify each ob-
ject from a substantial fraction of viewing directions (about 14%) it is still a constant
function.
We have considered recognition schemes restricted to classes of objects and showed
the following: For some classes (such as classes defined by prototypical object) the only
consistent recognition function is the trivial function. For other classes (such as the class
of symmetric objects), a nontrivial recognition scheme exists. We have defined the notion
of the discrimination power of a consistent recognition function for a class of objects. We
have shown that it is possible to determine the upper bound of the function discrimination
power for every consistent recognition function for a given class of object. The bound
is determined by the number of equivalence subclasses (determined by the teachability
relation). For the class of symmetric objects, these subclasses were derived explicitly.
For grey level images, we have established that the only consistent recognition function
with respect to viewing position and illumination conditions is the trivial function.
In this study we considered only objects that consist of points on surface patches in
space. Real objects are more complex. However, many recognition schemes proceed by
first finding special contours or points in the image, and then applying the recognition
process to them. The points found by the first stage are usually projections of stable
object points. When this is the case, our results apply to these schemes directly. For
consistent recognition functions that are defined on contours or surfaces, our result do
not apply directly, unless the function is applied to contours or surfaces as sets of points.
In the future we plan to extend the result to contours and surfaces.

Appendix 1

In this Appendix we prove that in the general case every two objects are reachable from
one another.
First note that the projection of two points, when viewed from the direction of the
vector that connects the two points, is a single point. It follows that for every object with
n - 1 points there is an object with n points such that the two objects have a common
orthographic projection. Hence, it is sufficient to prove the following claim:

C l a i m 4: Any two objects that consists of the same number of points in space are
reachable from one another.

Proof. Consider two arbitrary rigid objects, a and b, with n points. We have to show
that b is reachable from a. That is, there exists a sequence of objects such that every two
successive objects have a common orthographic projection.
828

Let the first object in the sequence be al = a = (p~,p~, ...,pan) and the last object
be bl = b = (pb,pb, ...,pb). We take the rest of the sequence, a2 .... , an to be the objects:
b b
ai = (Pl,P2, . . . , PbI - D P la , ...,P,)"
a
All t h a t is left to show is t h a t for every two successive
objects in the sequence there exists a direction such t h a t the two objects project to the
same image. By the sequence construction, every two successive objects differ by only
one point. The two non-identical points project to the same image point on the plane
perpendicular to the vector t h a t connects them. Clearly, all the identical points project to
the same image independent of the projection direction. Therefore, the direction in which
the two objects project to the same image is the vector defined by the two non-identical
points of the successive objects. O

References
1. Bolles, R.C. and Cain, R.A. 1982. Recognizing and locating partially visible objects: The
local-features-focus method. Int. J. Robotics Research, 1(3), 57-82 .
2. Brooks, R.A. 1981. Symbolic reasoning around 3-D models and 2-D images, Artificial
Intelligence J., 17, 285-348.
3. Burns, J. B., Weiss, R. and Pdseman, E.M. 1990. View variation of point set and line
segment features. Proc. Image Understanding Workshop, Sep., 650-659.
4. Cannon, S.R., Jones, G.W., Campbell, R. and Morgan, N.W. 1986. A computer vision
system for identification of individuals. Proc. IECON 86 O, WI., 1,347-351.
5. Clemens, D.J. and Jacobs, D.W. 1990. Model-group indexing for recognition. Proc. Image
Understanding Workshop, Sep., 604-613.
6. Forsyth, D., Mundy, L., Zisserman, A., Coelho, C., Heller A. and Rothwell, C. 1991.
Invariant Descriptors for 3-D object Recognition and pose. IEEE Trans. on PAMI. 13(10),
971-991.
7. Grimson, W.E.L. and Lozano-PSrez, T. 1984. Model-based recognition and localization
from sparse data. Int. J. Robotics Research, 8(3), 3-35.
8. Grimson, W.E.L. and Lozano-P6rez, T. 1987. Localizing overlapping parts by searching
the interpretation tree. IEEE Trans. on PAMI. 9(4), 469-482.
9. Horn B. K.P. 1977. Understanding image intensities, Artificial Intelligence J.. 8(2), 201-
231
10. Huttenlocher, D.P. and UNman, S. 1987. Object recognition using alignment. Proceeding
of ICCV Conf., London, 102-111.
11. Kanade, T. 1977. Computer recognition of human faces. Birkhauser Verlag. Basel and
Stuttgart.
12. Lowe, D.G. 1985. Three dimensional object recognition from single two-dimensional im-
ages. Robotics research Technical Report 202, Couraant Inst. of Math. Sciences, N.Y.
University.
13. Moses, Y. and Ullman S. 1991. Limitations of non model-based recognition schemes. A I
MEMO No 1301, The Artificial Intelligence Lab., M.I.T.
14. Phong, B.T. 1975. Illumination for computer generated pictures. Communication of the
A C M , 18(6), 311-317.
15. Poggio T., and Edelman S. 1990. A network that learns to recognize three dimensional
objects. Nature, 343, 263-266.
16. Ullman S. 1977. Transformability and object identity. Perception and Psychophysics,
22(4), 414-415.
17. Ullman S. 1989. Alignment pictorial description: an approach to object recognition. Cog-
nition, 32(3), 193-254.
18. Wong, K.H., Law, H.H.M. and Tsang P.W.M, 1989. A system for recognizing human faces,
Proc. ICASSP, 1638-1642.
This article was processed using the IbTEX macro package with ECCV92 style
Constraints for R e c o g n i z i n g and Locating C u r v e d
3D Objects from Monocular Image Features *

David J. Kriegman, 1 B. Vijayakumar, 1 Jean Ponce 2

1 Center for Systems Science, Dept. of Electrical Engineering, Yale University, New Haven, CT
06520-1968, USA
2 Beckman Institute, Dept. of Computer Science, University of Illinois, Urbana, IL 61801, USA

A b s t r a c t . This paper presents viewpoint-dependent constraints that re-


late image features such as t-junctions and inflections to the pose of curved
3D objects. These constraints can be used to recognize and locate object
instances in the imperfect line-drawing obtained by edge detection from a
single image. For objects modelled by implicit algebraic equations, the con-
straints equations are polynomial, and methods for solving these systems of
constraints are briefly discussed. An example of pose recovery is presented.

1 Introduction

While in the "classical approach" to object recognition from images, an intermediate


289 or 3D representation is constructed and matched to object models, the approach
presented in this paper bypasses this intermediate representation and instead directly
matches point image features to three dimensional vertices, edges or surfaces. Similar
approaches to recognition and positioning of polyhedra from monocular images have been
demonstrated by several implemented algorithms [3, 4, 8] and are based on the use of
the so-called "rigidity constraints" [1, 2] or "viewpoint consistency constraints" [8]. This
feature-matching approach is only possible, however, because mose observable image
features are the projections of object features (edges and vertices). In contrast, most
visible features in the image of a curved object depend on viewpoint and cannot be
traced back to particular object features. More specifically, the image contours of a
smooth object are the projections of limb points (occluding contours, silhouette) which
are regular surface points where the viewing direction is tangent to the surface; they join
at t-junctions and may also terminate at cusp points which have the additional property
that the viewing direction is an asymptotic direction of the surface.
In this paper, we show how matching a small number of point image features to a
model leads to a system of polynomial equations which can be solved to determine an
object's pose. Hypothesized matches between image features and modelled surfaces, edges
and vertices can be organized into an interpretation tree [2], and the mutual existence
of these features can be verified from a previously computed aspect graph [5, 11]. The
image features emphasized in this paper and shown in figure 1.a are generic viewpoint
dependent point features and include vertices, t-junctions, cusps, three-tangent junctions,
curvature-L junctions, limb inflections, and edge inflections [9]. This is an exhaustive
list of the possible contour singularities and inflections which are stable with respect
to viewpoint; for almost any viewpoint, perturbing the camera position in a small ball
around the original viewpoint will neither create nor destroy these features. More details
of the presented approach can be found in [7].

* This work was supported by the National Science Foundation under Grant IRI-9015749.
830

3-tangent ~ T-jmaetion

Ctnvmar*-L
(, ~ Inflection

Fig. 1. a. Some viewpoint dependent image features for piecewise smooth objects, b. A t-junction
and the associated geometry.

2 Object Representation and Image Formation


In this paper, objects are modelled by algebraic surfaces and their intersection curves.
We consider implicit surfaces given by the zero set of a polynomial
/ ( x ) = f(x, y, z) = 0. (1)
The surface will be considered nonsingular, so the unnormalized surface normal is
given by n(x) = Vf(x). Note that rational parametric surface representations, such as
Bezier patches, non-uniform rational B-splines, and some generalized cylinders, can be
represented implicitly by applying elimination theory, and so the presented constraints
readily extend to these representations [6, 12]. The intersection curve between two sur-
faces f and g is simply given by the common zeros of the two defining equations:
f(x) = 0
g(x) = 0 (2)
In this paper, we assume scaled orthographic projection though the approach can be
extended to perspective; the projection of a point x = [z,V, z]* onto the image plane
= [~, ~]i can be written as:
i : x0 + I u]'x (3)
w, u form an orthonormal basis for the image plane, and v = w x u is the viewing
direction; x0 = [z0, V0]* and/~ respectively parameterize image translation and scaling.

3 Viewpoint-Dependent Feature Constraints


We now consider the constraints that relate a pair of points on an object model to
measured image features in terms of a system of n equations in n unknowns where the
unknowns are the coordinates of the model points. While these constraints hold in gen-
eral, they can be manipulated into systems of polynomial equations for algebraic surfaces.
To solve these systems, we have used the global method of homotopy continuation to find
all roots [113] as well as a combination of table lookup and Newton's method to only find
the real roots. For each pair of points, the parameters of the viewing transformation
can be easily calculated. Below, constraints are presented for all of the features found
in Malik's junction catalogue [9]. Additionally, constraints are presented for inflections
of image contours which are easily detected in images. Pose estimation from three ver-
tices (viewpoint independent features) has been discussed elsewhere [4]. Note that other
pairings of the same features are possible and lead to similar constraints.
831

3.1 T-junctions
First, consider the hypothesis that an observed t-junction is the projection of two limb
points xx, x2 as shown in figure 1.b which provides the following geometric constraints:
f,(x,) : 0
(xl - : 2 ) . N , = o (4)
N1 9N2 = cos 812,
where i = 1, 2, Ni denotes the unit surface normals, and cos 012 is the observed angle
between the image normals. In other words, we have five equations, one observable cos 012,
and six unknowns (Xl, x2). In addition, the viewing direction is given by v = Xl - x2.
When another t-junction is found, we obtain another set of five equations in six unknowns
xj, x4, plus an additional vector equation: (xl - x2) x (x3 - x4) = 0 where only two
of the scalar equations are independent. This simply expresses the fact that the viewing
direction should be the same for both t-junctions. Two observed t-junctions and the
corresponding hypotheses (i.e., "t-junction one corresponds to patch one and patch two",
and "t-junction two corresponds to patch three and patch four") provide us with 12
equations in 12 unknowns. Such a system admits a finite number of solutions in general.
For each solution, the viewing direction can be computed, and the other parameters of
the viewing transformation are easily found by applying eq. (3). Similar constraints are
obtained for t-junctions that arise from the projection of edge points by noting that the
3D curve tangent, given by t = V f x Vg, projects to the tangent of the image contour.
3.2 C u r v a t u r e - L a n d T h r e e - t a n g e n t J u n c t i o n s
For a piecewise smooth object, curvature-L or three-tangent junctions are observed when
a limb terminates at an edge, and they meet with a common tangent; observe the top
and b o t t o m of a coffee cup, or consider figure 1. Both feature types have the same local
geometry, however one of the edge branches is occluded at a curvature-L junction.
fil
- 82 91R2

Fig. 2. The image plane geometry for pose estimation from three-tangent and curvature-L junc-
tions: The curved branch represents the edge while the straight branch represents a limb.
Consider the two edge points xi, i = 1, 2 formed by the surfaces fi, gi that project to
these junctions, xl is also an occluding contour point for one of the surfaces, say fi. Note
the image measurements (angles a,/31 and f12) shown in figure 2. Since xi is a limb point
of fl, the surface normal is aligned with the measured image normal fii. Thus, the angle
ot between fil and fi2 equals the angle between nl and n2, or c o s a = n l 9n2/Inll]n21.
Now, define the two vectors A = Xl -- x2 and /~ = Xl - x2. Clearly the angle between
fii and z~ must equal the angle between ni and the projection of A onto the image plane
/$ which is given by ,~ = A -- ( A . ~)9 where ~7 = n l x n2/]nx x n21 is the normalized
viewing direction. Noting that n~. 9 = 0, we have Ink I[/$[ cos fli = ni. A. However, z~ is of
relatively high degree and a lower degree equation is obtained by taking the ratio of cos fll
and using the equation for cos a. After squaring and rearrangement, these equations
(nl.nl)(n2.n2)cosa- (nl.n2) 2 = 0,
(s)
~(n2. A ) ( n l . n 2 ) -- coso~(n2, n 2 ) ( n l " A) = 0.
832

along with the edge equations (2) form a system of six polynomial equations in six
unknowns whose roots can be found; the pose is then be determined from (3).

3.3 I n f l e c t i o n s
Inflections (zeros of curvature) of an image contour can arise from either limbs or edges.
In both cases, observing two such points is sufficient for determining object pose.
As Koenderink has shown, a limb inflection is the projection of a point on a parabolic
line (zero Gaussian curvature) [5], and for a surface defined implicitly, this is

f~(fyyfzz 2 -.l-f~2 (f=~fzz -- f2xz) -4- f2z(f~z.fyy -- f2zy) -4- 2f=f~(f~zfyz


-- f~z) -- fzzfxu) (6)
94-2fyfz(f~.yfxz -- f ~ f y z ) + 2f~fz(f=yfyz -- fyyfxz) = O,

where the subscripts indicate partial derivatives. Since both points xl, x2 are limbs,
equation (6) and the surface equation for each point can be added to (5) for measured
values of a,/~1 and/32 as depicted in figure 2. This system of six equations in xl, x2 can
be solved to yield a set of points, and consequently the viewing parameters.
In the case of edges, an image contour inflection corresponds to the projection of an
inflection of the space curve itself or a point where the viewing direction is orthogonal to
the binormal. Space curve inflections typically occur when the curve is actually planar,
and can be treated like viewpoint independent features (vertices). When inflections arise
from the binormal bi being orthogonal to the viewing direction, as in figure 1, two mea-
sured inflections are sufficient for determining pose. It can be shown that the projection
of bl is the image contour normal, and for surfaces defined implicitly, the binormal is
given by b = [ t t g ( g ) t ] V f - [ t t y ( f ) t ] V g where g ( f ) is the Hessian of f. By including
the curve equations (2) with (5) after replacing ni by bi, a system of six equations in
Xl, x~ is obtained. After solving this system, the pose can be readily determined.
3.4 Cusps
Like the other features, observing two cusps in an image is sufficient for determining
object pose. It is well known that cusps occur when the viewing direction is an asymptotic
direction at a limb point which can be expressed as v t H ( x i ) v = 0 where the viewing
direction is v = V f l ( x l ) x Vf2(x2). While the image contour tangent is not strictly
defined at a cusp (which is after all a singular point), the left and right limits of the
tangent as the cusp is approached will be in opposite directions and are orthogonal to
the surface normal. Thus, the cusp and surface equations can be added to the system (5)
which is readily solved for Xl and x2 followed by pose calculation.

4 Implementation and Results

Fig. 3.a shows an image of a cylinder with a cYlindrical notch and two inflection points
found by applying the Canny edge detector and fitting cubic splines. The edge constraints
of section 3.3 lead to a system of six polynomial equations with 1920 roots. However, only
two roots are unique, and figs. 3.b and 3.c show the corresponding poses. Clearly the pose
in fig. 3.c could be easily discounted with additional image information. As in [6], elim-
ination theory can used to construct an implicit equation of the image contours of the
intersection curve parameterized by the pose. By fitting this equation to all detected
edgels on the intersection curve using the previously estimated pose as initial conditions
for nonlinear minimization, the pose is further refined as shown in fig. 3.d. Using contin-
uation to solve the system of equations required nearly 20 hours on a SPARC Station
1, though a recently developed parallel implementation running on network of SPARC
stations or transputers should be significantly faster. However, since there are only a few
833

real roots, another effective m e t h o d is to construct a table offiine of (~, fli as a function
of the two edge points. Using table entries as initial conditions to Newton's m e t h o d , the
same poses are found in only two minutes. Additional examples are presented in [7].

Fig. 3. Pose estimation from two inflection points. Note the scale difference in c.
Aeknowledgments:
Many thanks to Darrell S t a m for his distributed implementation of continuation.
Referecnes
1. O. Faugeras and M. Hebert. The representation, recognition, and locating of 3-D objects.
Int. J. Robot. Res., 5(3):27-52, Fall 1986.
2. W. E. L. Grimson. Object Recognition by Computer: The Role o.f Geometric Constraints.
MIT Press, 1990.
3. R. Horaud. New methods for matching 3-D objects with single perspective views. IEEE
Trans. Pattern Anal. Mach. Intelligence, 9(3):401-412, 1987.
4. D. Huttenlocher and S. Ullman. Object recognition using alignment. In International
Conference on Computer Vision, pages 102-111, London, U.K., June 1987.
5. J. Koenderink. Solid Shape. MIT Press, Cambridge, MA, 1990.
6. D. Kriegman and J. Ponce. On recognizing and positioning curved 3D objects from image
contours. IEEE Trans. Pattern Anal. Mach. Intelligence, 12(12):1127-1137, 1990.
7. D. Kriegman, B. Vijayakumar, and J. Ponce. Strategies and constraints for recognizing
and locating curved 3D objects from monocular image features. Technical Report 9201,
Yale Center for Systems Science, 1992.
8. D. G. Lowe. The viewpoint consistency constraint. Int. J. Computer Vision, 1(1), 1987.
9. J. Malik. Interpreting line drawings of curved objects. Int. J. Computer Vision, 1(1), 1987.
10. A. Morgan. Solving Polynomial Systems using Continuation for Engineering and Scientific
Problems. Prentice Hall, Englewood Cliffs, 1987.
11. J. Ponce, S. Petit jean, and D. Kriegman. Computing exact aspect graphs of curved objects:
Algebraic surfaces. In European Conference on Computer Vision, 1991.
12. T. W. Sederberg, D. Anderson, and R. N. Goldman. Implicit representation of parametric
curves and surfaces. Comp. Vision, Graphics, and Image Proces., 28:72-84, 1984.
Polynomial-Time Object Recognition in the
Presence of Clutter, Occlusion, and Uncertainty*

Todd A. Cass

Artificial Intelligence Laboratory


Massachusetts Institute of Technology
Cambridge, Massachusetts, USA

Abstract. We consider the problem of object recognition via local geo-


metric feature matching in the presence of sensor uncertainty, occlusion, and
clutter. We present a general formulation of the problem and a polynomial-
time algorithm which guarantees finding all geometrically feasible interpre-
tations of the data, modulo uncertainty, in terms of the model. This formu-
lation applies naturally to problems involving both 2D and 3D objects.
The primary contributions of this work are the presentation of a robust,
provably correct, polynomial-time approach to this class of recognition prob-
lems and a demonstration of its practical application; and the development
of a general framework for understanding the fundamental nature of the
geometric feature matching problem. This framework provides insights for
analyzing and improving previously proposed recognition approaches, and
enables the development of new algorithms.

1 Introduction

The task considered here is model-based recognition using local geometric features, e.g.
points and lines, to represent object models and sensory data. The problem is formulated
as matching model features and data features to determine the position and orientation
of an instance of the model. This problem is hard because there axe spurious and missing
features, as well as sensor uncertainty. This paper presents improvements and exten-
sions to earlier work[7] describing robust, complete, and provably correct methods for
polynomial-time object recognition in the presence of clutter, occlusion, and sensor un-
certainty.
We assume the uncertainty in the sensor measurements of the data features is bounded.
A model pose 1 is considered feasible for a given model and data feature match if at that
pose the two matched features are aligned modulo uncertainty, that is, if the image of
the transformed model feature falls within the uncertainty bounds of the data feature.
We show that, given a set of model and data features and assuming bounded sensor
uncertainty, there are only a polynomial number of qualitatively distinct poses matching
the model to the data. Two different poses axe qualitatively distinct if the sets of feature
matches aligned (modulo uncertainty) by them are different.
The idea is that uncertainty constraints impose constraints on feasible model trans-
formations. Using Baird's formulation for uncertainty constraints[2] we show the feature
This report describes research done at the Artificial Intelligence Laboratory of the Mas-
sachusetts Institute of Technology, and was funded in part by an ONR URI grant under
contract N00014-86-K-0685, and in part by DARPA under Army contract DACA76-85-C-
0010, and under ONR contract N00014-85-K-0124.
t The pose of the model is its position and orientation, which is equivalent to the transformation
producing it. In this paper pose and transformation will be used interchangeably.
835

matching problem can be formulated as the geometric problem of analyzing the arrange-
ment of linear constraints in transformation space. We call this approach pose equivalence
analysis. A previous paper [7] introduced the idea of pose equivalence analysis; this pa-
per contributes a simpler explanation of the approach based on linear constraints and
transformations, outlining the general approach for the case of 3D and 2D models with
2D data, and discusses the particular case of 2D models and planar transformations to
illustrate how the structure of the matching problem can be exploited to develop efficient
matching algorithms. This work provides a simple and clean mathematical framework
within which to analyze the feature matching problem in the presence of bounded geo-
metric uncertainty, providing insight into the fundamental nature of this type of feature
matching problem.

1.1 R o b u s t G e o m e t r i c F e a t u r e M a t c h i n g

Model-based object recognition is popularly defined as the problem of determining the


geometric correspondence between an object model and some a priori unknown subset of
the data. We're given a geometrical model of the spatial structure of an object and the
problem is to select data subsets corresponding to instances of the model, and determine
the pose of the object in the environment by matching the geometric model with instances
of the object represented in the sensory data. A common and effective paradigm for this
is based on geometric feature matching.
In this paper the model and the sensory data will be represented in terms of local
geometric features consisting of points or lines and possibly curve normals. Data feature
selection and object pose determination is achieved via geometrically matching model and
data feature subsets. There are four features of this task domain which are important
to consider in the work: The feature matching problem is difficult because the correct
model and data feature correspondences are unknown; there are spurious data features
due to unknown objects and scene clutter; even in the presence of a model instance model
features are missing from the data; finally, and most importantly, the sensory data are
subject to geometrical uncertainty that greatly affects geometric feature matching. It
is important to consider geometric uncertainty in the data features in order to be able
to guarantee that an object will be detected if present. Uncertainty is the main factor
which makes the localization problem difficult. If there is no uncertainty then simple
polynomiai-time algorithms are possible for localization, guaranteeing success[15, 17].
However, if the measured position of features are used without, accounting for possible
deviation from the correct positions then these approaches cannot guarantee the correct
matching of the model to the data.
Localization is reasonably divided into a pose hypothesis stage and a pose verification
stage. This paper considers pose hypothesis construction via feature matching. 2

1.2 Robustness, Completeness, and Tractability


There are three important criteria by which to analyze methods for object localization
via geometric feature matching: robustness, completeness, and tractability. The robust-
ness requirement means that careful attention is paid to the geometric uncertainty in the
features, so no correct feature correspondences are missed due to error. The complete-
ness requirement means that all sets of geometrically consistent feature correspondences

2 Pose verification may use a richer representation of the model and data to evaluate and verify
an hypothesis [3, 10, 15].
836

are found, including the correct ones. The tractability requirement simply means that
a polynomial-time, and hopefully efficient algorithm exist for the matching procedure.
Except for our previous work [6, 7], and recent work by Breuel[4], among those existing
methods accurately accounting for uncertainty, none can both guarantee that all feasible
object poses will be found and do so in polynomial time. Those that do account for error
and guarantee completeness have expected-case exponential complexity[12].

1.3 C o r r e s p o n d e n c e S p a c e vs. P o s e Space


For given sets of model and data features, geometric feature matching is defined as
both determining which subset of data features correspond to the model features, and
how they correspond geometrically. Feature matching can be accomplished by either
searching for geometrically consistent feature correspondences or searching for the model
transformation or pose geometrically aligning model and data features. Techniques based
on these approaches can be called correspondence space methods and pose space methods,
respectively.
Correspondence space is the power set of the set of all model and data feature pairs:
2{m'}x{d~} where {mi} and {dj} represent the sets of model and data features, respec-
tively. We define a match set .A4 E 2 (ml}x{d~} as an arbitrary set of model and data
feature matches. Pose hypotheses can be constructed by finding geometrically consistent
match sets, that is, match sets for which there exists some pose of the model align-
ing modulo uncertainty the matched features. One way to structure this is as a search
through the correspondence space, often structured as a tree search[13, 2]. Correspon-
dence space is an exponential-sized set, and although enforcing geometric consistency in
the search prunes away large portions of the search space, it has been shown that the
expected search time is still exponential[12].
Pose or transformation space is the space of possible transformations on the model.
Pose hypotheses can be constructed by searching pose space for model poses aligning
modulo uncertainty model features to data features. Examples of techniques searching
pose space are pose clustering[18, 19], transformation sampling[5], and the method de-
scribed in this paper, pose equivalence analysis[7]. The pose space is a high dimensional,
continuous space, and the effects of data uncertainty and missing and spurious features
make effectively searching it to find consistent match sets difficult.
The approach described in this paper provides a framework for unifying the corre-
spondence space approach and the pose space approach.

2 Pose Equivalence Analysis


We will represent the model and the image data in terms of local geometric features
such as points and line segments derived from the object's boundary, to which we may
also associate an orientation. Denote a model feature m = (Pro, ~m) by an ordered pair
of vectors representing the feature's position and orientation, respectively. Similarly the
measured geometry of a data feature is given by d -- (Pd, 8d)" Define U~ and U~ to
be the uncertainty region for the position and orientation, respectively, for data feature
dl. We assume that the uncertainty in position or orientation are independent. The true
position of dl falls in the set U~, and its true orientation falls in the set U~.
For the moment assume features consist simply of points without and associated
orientation. Correctly hypothesizing the pose of the object is equivalent to finding a
transformation on the model into the scene aligning the model with its instance in the
837

data, by aligning individual model and d a t a features. Aligning a model feature and a
d a t a feature consists of transforming the model feature such t h a t the transformed model
feature falls within the geometric uncertainty region for the d a t a feature. We can think
of the d a t a as a set of points and uncertainty regions {(Pdj, U~)} in the plane, where
each measured d a t a position is surrounded by some positional uncertainty region UJ'. A
model feature with position Pm~ and a d a t a feature with position Pdj are aligned via a
transformation T if T[Pml] E U~. Intuitively, the whole problem is then to find single
transformations simultaneously aligning in this sense a large number of pairs of model
and image features.
One of the main contributions of this work, and the key insight of this a p p r o a c h is
the idea t h a t under the bounded uncertainty model there are only a polynomial num-
ber of qualitatively different transformations or poses aligning subsets of a given model
feature set with subsets of a given d a t a feature set. Finding these equivalence classes of
transformations is equivalent to finding all qualitatively different sets of feature corre-
spondences. Thus we need not search through an exponential number of sets of possible
feature correspondences as previous systems have, nor consider an infinite set of possible
transformations.
In the 2D case the transformations will consist of a planar rotation, scaling, and trans-
lation. We've said that a model feature rnl and d a t a feature dj are aligned by a transfor-
mation T ill' T[ml] E Uj. Two transformations are qualitatively similar if and only if they
align in this sense exactly the same set of feature matches. All transformations which
align the same set of feature matches are equivalent thus there are equivalence classes
of transformations. More formally, let f / b e the transformation p a r a m e t e r space, and let
T E f/ be a transform. Define ~ ( T ) = { ( m i , d j ) l T [ m i ] E Uj} to be the set of matches
aligned by the transformation T. The function ~,(T) partitions /2 forming equivalence
classes of transformations Eh, where f/ = l,Jk Ek and T = T' ~ r = ~(T'). The
entire recognition approach developed in this p a p e r is based upon computing these equiv-
aience classes of transformations, and the set of feature matches associated with each of
them.

2.1 R e l a t i n g P o s e S p a c e a n d C o r r e s p o n d e n c e Space

If a model feature ml and a d a t a feature dj are to correspond to one another, the set
of transformations on the model feature which are feasible can be defined as the set of
transformations Ym,,dj = {T E f/lT[m,] E Uj}. Let A,t = {(mi, dj)} be some match set. a.
A m a t c h set is called geometrically consistent iff N ( m , , d j ) ~ Ym,,d~ r 0, t h a t is iff there
exists some transformation which is feasible for all (ml, dj) E ,h~.
The match set given by ~o(T) for some transformation T is called a mazimal geo-
metrically consistent match set. A match set .A4 is a maximal geometrically-consistent
match set (or, a mazimal match set) if it is the largest geometrically consistent m a t c h
set at some transformation T. Thus by definition the match set given by ~o(T) is a
maximal match set. The function ~ ( T ) is a m a p p i n g f_tom transformation space to cor-
respondence space. ~o(T) : f/ , 2 {m~}x{di}, and there is a one-to-one correspondence
between the pose equivalence classes and the maximal m a t c h sets given by ~ ( T ) , , E k

3 This is also sometimes called a correspondence, or a matching. To clarify terms, we will define
a match as a pair of a model feature and a data feature, and a match set as a set of matches.
The term matching implies a match set in which the model and data features are in one-to-one
correspondence.
838

iff T E E k. The function ~(T) partitions the infinite set of possible object poses into a
polynomial-sized set of pose equivalence classes; and identifies a polynomial sized subset
of the exponential-sized set of possible match sets.
The important point is that the pose equivalence classes and their associated maxi-
mal match sets are the only objects of interest: all poses within a pose equivalence class
are qualitatively the same; and the maximal geometrically consistent match sets are es-
sentially the only sets of feature correspondences that need be considered because they
correspond to the pose equivalence classes. Note that this implies we do not need to
consider all consistent match sets, or search for one-to-one feature matchings, because
they are simply subsets of some maximal match set, and provide no new pose equiva-
lence classes. However, given a match set we can easily construct a maximal, one-to-one
matching between data and model features[14].
One distinction between this approach, which works in transformation space, and
robust and complete correspondence space tree searches[13, 2] is that for each maximal
geometrically consistent match set (or equivalently for each equivalence class of transfor-
mations) there is an exponential sized set (in terms of the cardinality of the match set)
of different subsets of feature correspondences which all specify the same set of feasible
transformations. Thus the straightforward pruned tree search does too much (exponen-
tially more) work. This is part of the reason why these correspondence space search
techniques had exponential expected case performance, yet our approach is polynomial.

2.2 F e a t u r e M a t c h i n g R e q u i r e s O n l y P o l y n o m i a l T i m e

Formalizing the localization problem in terms of bounded uncertainty regions and trans-
formation equivalence classes allows us to show that it can be solved in time polynomial
in the size of the feature sets. Cass[6] originally demonstrated this using quadratic uncer-
tainty constraints. This idea can be easily illustrated using the linear vector space of 2D
scaled rotations and translations, and the linear constraint formulation used by Baird[2]
and recently by Breuel4[4]. In the 2D case the transformations will consist of a planar
rotation, scaling, and translation. Any vector s = [sx, s2]r = Icecos 8, ~r sin 8]r is equiva-
lent to a linear operator S performing a rigid rotation by an orthogonal matrix R E S02

and a scaling by a positive factor o', where S = ~ R = o" [sin0 cos0 J s2 sl "
denote the group ofail transformations by ~2, with translations given by t = [tl,t~]r and
scaled rotations given by s = [sl,s2]T, so a point x is transformed by T[x] = Sx-{-t, and
a transformation, T, can be represented by a vector T ~-~ [sl, s~.,tl, t2]T E ~t4.
By assuming k-sided polygonal uncertainty regions U~ and following the formulation
of Baird, the uncertainty regions UiP can be described by the set of points x satisfying
inequalities (1): ( x - pdj)Tfi~ _< e~ for l = 1 ...../e and thus by substitution the set of
feasible transformations for a feature match (m~, d#) are constrained by inequalities (2):
(Spm~ + t -pdj)rfiz ~ el for l = 1, ...,k which can be rewritten as constraints on the
transformation vector [sl,s2,tl,t2] r as SlO~1 -~-82Ot2 -~-~IR~-~-t2n ~ ~ pd#rfi/ + ez for
[ : 1 . . . . . ~, Wlth O~I : (~ra~RZ9 - ~ - ~ i ~ ) a n d o~2 : ( ~ zi n Y
I -~iR~) a n d w h e r e Ill is
the unit normal vector and ez the scalar distance describing each linear constraint for
I = 1 ..... k. The first set of linear inequalities, (1), delineate the polygonal uncertainty
regions U~ by the intersection of k halfplanes, and the second set of inequalities, (2),

4 Thomas Breuel pointed out the value of Baiid's linear formulation of the 2D transformation
and his use of linear uncertainty constraints.
839

provides k hyperplane constraints in the linear transformation space D. The intersection


of the k halfspaces forms the convex polytope ~'rn~,dj of feasible transformations for match
(rni, dj). Say there are m model features and n data features. The arrangement s of these
rczn convex polytopes forms the partition of transformation space into equivalence classes.
To see this another way, consider the set of k hyperplanes for each of the n m feature
matches (rnl, dj). The arrangement constructed by these k m n hyperplanes partitions the
set of transformations into cells. It is well known from computational geometry that the
complexity of the arrangement of k m n hyperplanes in ~ 4 is O(k4ra4n 4) in terms of the
number of elements of the arrangement. These elements are called k-faces where a 0-face
is a vertex, a 1-face is an edge, a 3-face is a facet, and a 4-face is a cell, and these elements
can be constructed and enumerated in O(k4m4n 4) time[9].
The transformation equivalence classes are the cells and faces formed by the arrange-
ment of the m n convex polytopes ~'rn,d~ in the transformation space[6]. It is easy to see
that the arrangement formed by the feasible regions ~'rn~,dj is a subset of the arrange-
ment formed by the k m n hyperplanes. Thus the number of qualitatively different poses
is bounded by O(k4m4n4). O
To construct a provably correct and complete polynomial-time algorithm for pose
hypothesis via feature matching we enumerate the set of pose equivalence classes by
deriving them from this arrangement induced by the two feature sets and the uncertainty
constraints. Each equivalence class is associated with a geometrically consistent set of
feature matches, and so we simply select those consistent match sets of significant size. 7
This is a simple illustration that the problem of determining geometrically consistent
feature correspondences in the 2D ease can be done in polynomial time, and the solution
is correct and complete in spite of uncertainty, occlusion and clutter.
If we associate orientations with features the true orientation for data feature dj falls
within an uncertainty region Uj~ = [(0d, - ~), (0d, + 6)] where 6 is the bound on orien-
tation uncertainty. This yields linear orientation constraints on feasible transformations:
(S0m,)Tfif < 0 and (S0m,)Tfi; _< 0 where fif and ftf are determined by/~dj and 6.
Baird[2] utilized linear uncertainty constraints which together with a linear transfor-
mation space resulted in linear transformation constraints. Although, as we have shown in
this paper, the constraints in transformation space have polynomial complexity, Baird's
algorithm did not exploit this. Breuel[4] has developed a very elegant correspondence-
space search technique based on the problem formulation and tree search method used
by Baird[2], and the notion of transformation equivalence classes and maximal match
sets described in [6] and here. See also [16, 1, 11].

2.3 T h e C a s e o f 3 D m o d e l s a n d 2D d a t a

Of particular interest is the localization of 3D objects from 2D image data. We'll con-
sider the case where the transformation consists of rigid 3D motion, scaling and ortho-
graphic projection; where a model point Pm~ is transformed by T~pm~] = Spin, + t with
The computational geometric term is arrangement for the topological configuration of geo-
metric objects like linear surfaces and polytopes.
e As was shown in Cass[6] these same ideas apply to cases using non-llnear uncertainty con-
stralnts, such as circles. The basic idea is the same however in these cases we must analyze
an arrangement of quadratic surfaces which is computationally more difficult.
7 To measure the quality of a match set we approximate the size of the largest one-to-one
matching contained in a match set by the minimum of the number of distinct image features
and distinct model features[14].
840

S =
[
s21 s 2 2 s 2 3 1
$11 S12 S13 : t r P R and P :
I0~176 S03,
1 0 '
R C and cr > O E ~ . In the case ofpla-

nav3DobjectsthiscorrespondstothetransformationS= [8,18,2]s~1
s22 , sij E N, describing
the projection of all rotations and scalings of a planar 3D object. The linear constraint
formulation applies to any affine transformation[2, 4]. To exploit linear uncertainty con-
straints on the feasible transformations as before, we must have the property that the
r~176176176176 S= [ sl-s2]s2
s, , S l , S 2 E ~ , i n

the case of 2D models and 2D transformations, and the case of S = sn


s2, s22 , s~j 6 ~ ,
for planar models and 3D transformations; but it is not the case for S = crPR because
the components of S must satisfy SS T = cr2I, where I is the 2 x 2 identity matrix.
For the case of 3D planar objects under the transformation T[pm~] = Spm~ + t with
S ----
[ 3 , 1 s12 ] the transformation space is a f-dimensional linear space and there are
$21 822 '

O(kemSne) elements in the arrangement, and so analogous to the 2D case there are are
O(kemSne) transformation equivalence classes. Note that the special case of planar ro-
tation, translation, and scaling discussed in the previous section is a restriction of this
transformation to those 2 x 2 matrices S satisfying SS y = ~2I.
To handle the case of 3D non-planar objects we follow the following strategy. We
compute equivalence classes in an extended transformation space ~ which is a vector
space containing the space D. After computing transformation equivalence classes we
then restrict them back to the non-linear transformation space we are interested in. So
consider the vector space ,~ of 2 x 3 matrices S = [A1
~21 ~t2 51s] E S where s~ E ~ ,
~22~2~J
L
and define ~ to be the set of transformations ( S , t ) E ~ where T[pm,] = SPIn, + t
as before. The set ~ is isomorphic to ~ s . Again expressing the uncertainty regions in
t h e form of linear constraints we have (Spm~ + t -pd~)Tfi~ < el for l = 1 ..... k. These
describe k constraint hyperplanes in the linear, 8-dimensional transformation space ~ .
The k m n hyperplanes due to all feature matches again form an arrangement in f?. Anal-
ogous to the 2D case there are O(kSmSns) elements in this arrangement, and O(kSmSns)
transformation equivalence classes Ek for this extended transformation.
To consider the case where the general 2 3 linear transformation, S is restricted
to the case, S of true 3D motion, scaling, and projection to the image plane, we make
the following observations. Each element of this arrangement in ~ s is associated with
a maximal match set, so there are O(kSmSn s) maximal match sets. To restrict to rigid
3D motion and orthographic projection we intersect the hyperplanar arrangement with
the quadratic surface described by the constraints ~ T = cr2i. We still have O(kSmSn s)
maximal match sets with the restricted transformation, although the equivalence classes
are more complicated.

3 Efficiently Exploring Pose Equivalence Classes

We see from the previous analysis that there are only a polynomial number of qual-
itatively different model poses aligning the model features with the data features. A
simple algorithm for constructing pose hypothesis consists of constructing the transfor-
mation equivalence classes, or representatives points of them (such as vertices[6]) from
841

the arrangement of constraint hyperplanes. Unfortunately this straightforward approach


is impractical because of the complexity of constructing the entire arrangement.
The focus of the approach then becomes developing algorithms that explore the ar-
rangement in an efficient way in order to find those regions of transformation space as-
sociated with large maximal match sets without explicitly constructing or exploring the
entire arrangement if possible. This is an interesting problem in computational geometry.
We know the complexity of exploring the arrangement is high in places where there
are many constraints satisfied, e.g. large sets of constraints due to correctly matched
features. We conjecture that for practical problem instances there are not too many
other places where the complexity is high, e.g. large consistent sets of constraints due
to incorrect feature matches. Empirically for practical problem instances we found this
is true[8]. This means that we expect to spend less computational effort searching for
places of interest in the transformation space than is expressed by the upper bound.
We implemented and tested an algorithm for planar objects under rigid 2D rotation
and translation with known scale. The model and data consisted of point features with an
associated orientation. The position uncertainty regions were isothetic squares centered
on each data feature with side 2e. Experiments on real images were performed to demon-
strate the idea (figure 1) and on synthetic models and data to test the computational
complexity for practical problem instances.

~.. ,~L~ :

t ~ t- ," Lv

Fig. 1. A real image, the edges, and the correct hypothesis. The dots on the contours are the
feature points used in feature matching. Images of this level of clutter and occlusion are typical.

Due to space limitations we can only outline the algorithm. The interested reader is
referred to Cass[8] for a more complete description. The approach taken is to decompose
the transformation into the rotational component represented by the sl-sz plane and the
translational component represented by the tl-t2 plane. Equivalence classes of rotation
are constructed in the sl-s~ plane in each of which the same set of match sets are feasible,
and these equivalence classes are explored looking for large maximal match sets. For any
given rotational equivalence class the match sets can be derived by analyzing translational
equivalence classes in the tl-t~ plane.
We explore the transformation space locally by sequentially choosing a base feature
match and analyzing the set of other matches consistent with the base match by partially
exploring their mutual constraint arrangement. We used angle constraints to eliminate
impossible match combinations, but only used position constraints to construct equiva-
842

lence classes. Empirically for m model features and n data features we found that ana-
lyzing all maximal match sets took ~ m 2 n 2 time. We can get an expected time speedup
by randomly choosing the base matches[10]. This leads to a very good approximate al-
gorithm in which we expect to do ~ m2n work until a correct base match is found along
with an associated large consistent match set. Experiments s show empirically that for
practical problem instances the computational complexity in practice is quite reasonable,
and much lower than the theoretical upper bound.

References
1. Alt, H. & K. Mehlhorn & H. Wagener& E. Welzl, 1988, "Congruence, Similarity, and Sym-
metries of Geometric Objects", In Discrete and Computational Geometry, Springer-Verlag,
New York, 3:237-256.
2. Baird, H.S., 1985, Model-Based Image Matching Using Location, MIT Press, Cambridge,
MA.
3. Bolles, R.C & R.A. Cain, 1982, "Recognizing and Locating Partially Visible Objects: The
Local-feature-focus Method", International Journal of Robotics Research, 1(3):57-82.
4. Breuel, T. M., 1990, "An Efficient Correspondence Based Algorithm for 2D and 3D Model
Based Recognitlo~t', MIT AI Lab Memo 1259.
5. Cass, Todd A., 1988, "A Robust Implementation of 2D Model-Based Recognition", Pro-
ceedings IEEE Conf. on Computer Vision and Pattern Recognition, Ann Arbor, Michigan.
6. Cass, Todd A., 1990, "Feature Matching for Object Localization in the Presence of Uncer-
tainty", MIT AI Lab Memo 1133.
7. Cass, Todd A., 1990, "Feature Matching for Object Localization in the Presence of Uncer-
tainty", Proceedings of the International Conference on Computer Vision, Osaka, Japan.
8. Cass, Todd A., 1991, "Polynomial-Time Object Recognition in the Presence of Clutter,
Occlusion, and Uncertainty", MIT AI Lab Memo No. 1302.
9. Edelsbrunner, H., 1987, Algorithms in Combinatorial Geometry, Sp$inger-Verlag.
10. Fischler,M.A. & R.C. Bolles, 1981, "Random Sample Consensus: A Paradigm for Model
Fitting with Applications to Image Analysis and Automated Cartography", Communica-
tions of the ACM 24(6):381-395.
11. Ellis, R.E., 1989, "Uncertainty Estimates for Polyhedral Object Recognition", IEEE
Int. Conf. Rob. Aut., pp. 348-353.
12. Grimson, W.E.L., 1990, "The combinatorics of object recognition in cluttered environments
using constrained search", Artificial Intelligence,44:121-166.
13. Grimson, W.E.L. & T. Lozano-Perez, 1987, "Localizing Overlapping Parts by Searching
the Interpretation Tree", IEEE Trans. on Pat. Anal. &Mach. Intel., 9(4):469-482.
14. Huttenlocher, D.P. & T. Cass, 1992, "Measuring the Quality of Hypotheses in Model-Based
Recognition", Proceedings of the European Conference on Computer Vision, Genova, Italy.
15. Huttenlocher, D.P. & S. Ullman, 1990, "Recognizing Solid Objects by Alignment with an
Image," Inter. ]ourn. Comp. Vision 5(2):195-212.
16. Jacobs, D., 1991, "Optimal Matching of Planar Models in 3D Scenes," IEEE Conf. Comp.
Vis. and Patt. Recog. pp. 269-274.
17. Lowe, D.G., 1986, Perceptual Organization and Visual Recognition, Kluwer Academic Pub-
lishers, Boston, MA.
18. Stockman, G. & S. Kopstein & S. Bennet, 1982, "Matching Images to ModeLs for Registra-
tion and Object Detection via Clustering", IEEE Trans. on Pat. Anal. & Mach. Intel.4(3).
19. Thompson, D. & J.L. Mundy, 1987, "Three-Dimensional Model Matching From an Uncon-
strained Viewpoint", Proc. IEEE Conf. Rob. Aut. pp. 280.
This article was processed using the IATEX macro package with ECCV92 style

s Experiments were run with m E [10,100] and n E [0,500]. The uncertainty assumed was e = 8
pixels and 6 = ~ .
H i e r a r c h i c a l S h a p e R e c o g n i t i o n B a s e d o n 3-D
Multiresolution Analysis.

Satoru MORITA, Toshio KAWASHIMA and Yoshinao A O K I


Faculty of engineering,Hokkaido University
West 8,North 13,Kita-Ku Sapporo 060 Japan

A b s t r a c t . This paper introduces a method to create a hierarchical de-


scription of smooth curved surfaces based on scale-space analysis. We ex-
tend the scale-space method used in 1-D signal analysis to 3-D object. A
3-D scale-space images are segmented by zero-crossings of surface curva-
tures at each scale and then linked between consecutive scales based on
topological changes (KH-description). The KH-description is, then, parsed
and translated into the PS-tree which contains the number and distribution
of subregions required for shape matching. The KH-description contains
coarse-to-fine shape information of the object and the PS-tree is suitable for
shape matching. A hierarchical matching algorithm using the descriptions
is proposed and examples show that the symbolic description is suitable for
efficient coarse-to-fine 3-D shape matching.

1 Introduction
Recent progress in range finding technology has made it possible to perform direct mea-
surement of 3-D coordinates from object surfaces. Such a measurement provides 3-D
surface data as a set of enormous discrete points. Since the raw data are unstable to
direct interpretation, they must be translated into appropriate representation. In the
recognition of curved objects from depth data, the surface data are usually divided into
patches grouped by their normal vector direction. The segmentation, however, often re-
sults in failure, if the density of the measurement is insufficient to represent a complex
object shape. Other segmentation algorithms based differential geometry iRA1] [PR1]
are noise-sensitive, because calculation of curvature is intrinsically local computation.
One approach to these problems is to analyze an object as a hierarchy of shape primi-
tives from coarse level to fine level. D. Marr stated the importance of the multiresolution
analysis [Dell. A. P. Witkin introduced scale-space technique which generates multires-
olution signals by convolvlng the original signal with Gaussian kernels [Wil]. Lifchitz
applied the multiresolution analysis to image processing [LP1].
Another approach using the structure of geometrical regions and characteristic lines
of a curved surface [TL1] seems attractive. However, the relationships between this at-
tractive approach and multiresolution representation have not been well understood.
Our approach to the issue is essentially based on the multiresolntion analysis. We
have introduced a hierarchical symbolic discription of contour using scale-space filtering
[MK1]. We extend the scale-space approach to 3-D discrete surface data. After a hierarchy
of surface regions is computed by scale-space filtering, the representation is matched with
3-D object models.
In section 2, scale-space signal analysis method is extented to 3-D surface analysis
using difference equation computation. The extended scale-space, unfortunately, does not
show monotonicity required for generating hierarchy, since zero-crossing contours of 3-D

Lecture Notes in Computer Science, Vol. 588


G. Sandini (Ed.)
Computer Vision - ECCV '92
9 Springer-Verlag Berlin Heidelberg 1992
844

surface often vanish or merge, when the scale increased. The resulting scale-space cannot
be described by a simple tree as reported in 2-D case [YP1]. In this paper, we regard this
3-D scale-space filtering as continuous deformation process which leads surface curvature
to a constant. From this view point, we can generate a hierarchical description by marking
a non-monotonic deformation a.s an exceptional case.
The algorithm of the hierarchical recognition is twofold: KH description generation
and hierarchical pattern matching. In section 3, topological analysis of region deformation
is described and an algorithm to create the KH description is illustrated. We use the
Ganssian curvature and the mean curvature as viewer invariant features to segment
the surface into primitives. First, we extract the local topological region deformation of
zero-crossing lines of the features, and create the KH description from the deformation.
Since the information in the KH description is limited to local changes, further global
interpretation is required. In section 4, we add auxiliary information by analyzing the
global change of regions to translate the KH description into a tree. The tree generated
contains symbolic description of the shape from the coarser level to the finer level and
the pattern matching is performed efficiently with the tree.
In section 5, examples are shown. We apply the algorithm to several sample data sets
from a range finder.

2 Surface Geometry and Filtering

Scale-space filtering is an useful method to analyze a signal qualitatively with managing


the ambiguity of scale in an organized and natural way. In this section, we extend the
scale-space filtering for 2-D contour to 3-D surface analysis.

2.1 D e f i n i t i o n s o f C u r v a t u r e
The curvatures used in tllis article are defined as follows.
Suppose a parametric form of a surface X(u, v) = (x(u, v), y(u, v), z(u, v)). A tangen-
tim llne at X(u, v) is denoted by t(u, v) = duXt(u, v) + dvX,(u, v).
The curvature at X along (du, dr) is defined as A(du, dr) - $~.( d ~ , d t ) '

where $,(du,dv) = (du dr) ~ X , X . ] (


kdv and

\x..x.,/
With the directional vectors which maximizes and minimizes the curvature at the
point p as (~1, ~?1) and (I2, ~7i), the maximum curvature nl, the minimum curvature ~2,
the mean curvature H, and the Ganssian curvature H are defined as:
~l = ,X((1, ~l ), n2 = ~( (2, ~2 ), H = ~ +-x-.~x~ , and K = ~;ln2, respectively.
Characteristic contours which satisfy H = ~1+,,~ 2
= 0, and K = ~1x2 = 0 are called
H0 contour and K0 contour respectively.

2.2 Filtering a 3-D C u r v e d Surface


Since a dosed 3-D surface cannnot be filtered with ordinary convolution, we extend the
idea reported in [Lil] to 3-D. In our method, the Ganssian convolution to the surface ~b
is formulated by a diffusion equation:

o.2 + ~ = ?~,"(1)
845

where r v) = (x(u, v), y(u, v), z(u, v)) is a parametric representation of a surface. This
equation is approximated by a difference equation:

r + nat) r + atr - zau,~,t) - 2r + r + au,~,t)


Au~

+nat r - a , , t ) - 2r + r + za,, t)
zav2 ...(2)
Extend form of the equation to arbitrary number of points is:

r + ,at) = r + ~-Z~=o~(r - ~,o(t))...(3)


'~ *ik

where ~i0, {r }, and lik are a sample point, its nelghbour samples, and the distance of
between r and r respectively.
Iterating (3), the curvature at each sample point converges to a constant.

2.3 C h o o s i n g G e o m e t r i c F e a t u r e s

The fact that the curvature of the surface converge to a constant value by the filtering
indicates that the concave or convex regions ultimately vanish or merge into a single
region as the scale parameter t increases. This behaviour gives us a hint which contour
must be chosen to segment the curved surface hierarchically.
A convex and a concave are enclosed by K0 contours, and a valley and a ridge by H0
contours. Since each contour is insufficient to characterize a complex surface, we choose
both K0 and H0 to segment the surface. With this segmentation, continuous deformation
of the contours by the Gaussian filtering forms scale-space-llke image (KH-image) of a
3-D shape as shown in figure. 1. Actually, the image does not form a tree because of
non-monotoniclty of contour deformation.

3 Topological Changes of Contours in Scale-Space Image


In this section, we make a general discussion on the topological change of a contour, then,
propose an method to treat the topology as a hierarchy, and apply it to filtered contours.

3.1 C o n t o u r T o p o l o g y a n d H i e r a r c h y

Consider the topology of curved contours which are deforming continuously, its state
can be characterized by the number of contours and connections among them. The basic
topological changes are #eneration, connection, inner contact, outer contact, intersection,
and disappearance as indicated in figure 2. The inner contact and outer contact are further
dlvided into contact with other contours and self-contact.
T h e requirement for creating a hierarchical structure is that the number of region
must increase monotonously as the scale decreases 1. Since the requirements are not
satisfied, we must impose the following two assumptions to the interpretation of the
disappearance of intersection or eontact(Fig.2[a2]) which breaks the hierarchy.
Assumption 1 If a contour disappears at the smaller scale, the traces of the contour
i The parameter scale specifies the direction of deformation, in this section. Scale decreases
toward leaf nodes.
846

must be preserved in the description.


Assumption 2 If two contours separate after contact, the history of connection must
be preserved in the description.
With these assumptions to the analysis, any contour deformation can be traced hier-
archically.

3.2 K 0 and H 0 as C o n t o u r s

By recording traces and histories of disappearances, the hierarchy of a shape can be


extracted from a Gauss filtered scale-space image. In this paper, as described in 2.3,
both the contours, H0 and K0, are used to segment regions.
Since the exceptional changes such as disappearance or separation of contours seldom
occur in actual scale space image, the exceptional interpretation does not make the
processing speed slow down.

3.3 T o p o l o g i c a l A n a l y s i s of K 0 a n d H 0

In this section, we summerize the topological changes of K0 and H0 with scale decre-
ment. In the case, the numbers of contours and connection points must increase in our
interpretation. With the fact that K0 does not contact H0 except at a singular point,
actual types of contour deformation topology is limited.
Qualitative changes of K0 and H0 contour are generation, inner contact, outer contact,
inner self-contact, and outer self-contact as shown in 3.1. Figures 3 and 4 indicate the
topological changes of H0's and K0's, respectively. Actually, some of these changes does
not occur because of the smoothness of a surface. For example, two K0's do not contact
except when there exist two H0's between them (figure 3(c)).

3.4 G e n e r a t i n g a K H - d e s c r i p t i o n

K0's and H0's divide the surface of an objcet into four types of regions according to their
signs of the Gaussian curvature and the mean curvature. Each region is called a KH-
element. Figure 6 shows their typical shapes. A KH-element is specified by a triplet of the
signs of the Gaussian curvature, the maximum curvature, and the minimum curvature:
(+, + , - ) is such an example.
The topological changes of K0's and H0's with scale increment such as the generation
or contact of regions correspond to the inclusion or connection of regions. Our coarse-to-
fine representation of an object is a sequence of images described by the local topology
of KH-elements which are linked by between scales. The description derived is called
KH-description. The basic topological changes of KH-descriptlon shown in figure 3(a)
and figure 4(a) are symbolized as shown figure (b).
A basic flow to generate a KH-descrlption is the following:
1. Filter the discrete data at each scale.
2. Segment the surface by H0's and K0's and label peak and pit.
3. Analyze topological changes between the current scale and the previous.
847

[al] A GENERATION [ ~ ] H DISAPPEARANCE


m m'
B DIv,~N~
o -
H0 cotour C~---C~
C INNER CONTACT
K0 contour cx:>c---~ r--
~
D OUTER CONTACT h'--)- t%-)~ ]
00--(30
G OVERLAP m m' CO - - O O ~ ' - ]
fig.1 Scale-space image for 3D surface. O O-- GD ~----}
[b] E DIVIDING BY REGION
~------~f==~
peak 9 peak m F
pit = pit
H0 cotour K0 contour
fig.2 Topological changes of contour.

E
(a) (a)

-Fi--~--I li~-l-~
F~_F~_ F~-3F~BF--~_ ~
l i X i l - l - ~ - ri-Ili-I [ i - - l - F - ~
C

-I-ii-I
~ - ~ C NOT EXISTING
(b) E ~ F
l~ol-l(ll~l)| DT~---I_ ~
(b) A B C D Ib) li-----]--I--N---I
A B C D

fig.3: Topological changes of HO contour (right).


rigA: Topological changes of KO contour (left). fig.5 Qualitative changes of region.

Maximum normal curvature : " + " K<0 Saddle surface Adent C dent d
Minimum normal curvature : k2
Gaussian curvature : K=KI*K2 + + + K>0 H>0
Mean curvature : H=(KI+K2)/2 \~ ]/
B bump D bump.d
Surface sign :
(sign of K,sign of Kl,sign of + - - K>O H<O Pe
aC~,ace {m 9 H-OJ 9 H-~
K2)
fig.6 Surface classification. //~ f~.7 pdmitive surface elements.
848

4 Shape Matching

In this section, an efficient shape matching algorithm using the KH-description is pro-
posed. The KH-description describes the local topological changes of a contour caused
by scale change. When the description is parsed, regions surrounded by connected re-
gions( figure 2b(E)(F)) cannot be recognized. To recognize these regions correctly, the
contour-based KH-description is translated into region-bazed network description. The
new description is parsed along the tree structure extracted from it to improve the match-
ing efficiency.

4.1 D e t e c t i o n of R e g i o n Loops
The topological changes of a contour are described with the KH-description, but the
qualitative changes of a region is not described in it, because it does not contain global
topological changes. If a group of regions form a loop as shown in figure. 5, a single region
will be divided into two separate regions. Since it is difficult to detect these closed loops
from a single-scale KH-image, the loops should be searched by coarse-to-fine tracing in
the KH-description. Region loops found by the search are added to the description. These
qualitative changes are shown in Fig.f: (E) shows the region created by inner contacts
and (F) shows that by outer contacts. Thus the global topological information can be
added to the KH-description.

4.2 Surface S e g m e n t a t i o n for S h a p e R e c o g n i t i o n


Since the KH-description is a rooted directed graph and the graph contains a tree as a
subset, the parsing cost can be reduced by tracing the tree. By removing the connection
relations of elements from the KH-description, the graph can be treated as a tree.
We choose the explicit region segmentation as a node of the tree. The explicit region
segmentation (figure 5(E)) divides a region into two independent regions. In the tree, this
segmentation corresponds a branches. A single region generation inside a region, which
is called an implicit region generation, (figure 3(A) and 4(A)) is not treated as a branch.
However, a group of implicit regions may form a region loop. The region enclosed by the
loop forms a new branch. A single implicit segmentation must, therefore, marked as a
label change which has possibility to branch the tree. Once a loop is detected when the
scale decreases, the label is rewritten as a new branch of the tree. The connection of
regions and the order of implicit region generation are also recorded in each node.
Above segmentation method based on H0's and K0's create a hierarchy of the shape.
However, the segmentation based on both contours create more regions than we expected
from its shape complexity. To reduce meaningless segmentations, our final version of
segmentation is based on H0's and each region is classified by the signs of the elements
enclosed by K0 contours.
A region surrounded by a H0 contour is classified into four types of elements by the
sign of the mean curvature and the existence of H0 contours in it (figure 7): the mean
curvature of enclosed region is negative in (A) and (C), and positive in (B) and (D), and
H0 contours appears in (A) and (B), while do not in (C) and (D). These four types of
elements are called PS-elements in distinction from KH-notation.
Since the PS-elements are basically enclosed by H0 contours, they can be extracted
from KH-descriptlon regarding new-branches created by region loops. PS-elements form
a PS-tree suitable for hierarchical shape matching.
849

(a)A face image (left) derived from s range finder, (a) HAND-I (left) and HANO-2(right) measured with
s range f i n d e r ~ . ~ , ~
and its filtered image(right).
9 PEAK

[ : ~ O E G H

~
(b)Hierachical descr ,tion derived from HAND-1.

9 PEAK

U
(b) Surface classification for diffemt scales.
fig,8 Filtering and surface classification for a face
A B

data.

(c)Hlerechicel description dedved from HAND-2.


fig.10 Hlerachioal structure for similar objects.(KH-
description)

(a) Labeled image.

(alLabeled image of HAND-I or 2.

b mp mp bump b m

(b) The common tree of HAND-1 and HAND-2.


(b) Hierachical description derived a face.
fig.11 Common hierschicel structure for similar
fig.9 Hierachical structure derived from a face dat~.
objects,
(KH-description)
850

4.3 M a t c h i n g

The matching algorithm is essentially based on the PS-tree. Since the PS-tree loses the
details of shape information at leaf level, the KH-description can be compared if needed.
The outline of shape matching is the following.
step 1. Add the number and distribution of region to the KH-description
step 2. Extract PS-elements from the KH-description to generate a PS-tree.
step 3. Compare the PS-trees from the root node.
step 4. Match the KH-elements corresponding to the PS-elements, if detail examina-
tion is required.

5 Experimental Results
5.1 E x a m p l e 1: A Face

A set of range data from a laser range finder was used to validate our algorithm. Fig-
ure 8 (a) shows the filtering results: the original face data (left) and its filtered image.
Figure 8 (b) shows labeled KH-images at different scales. From these KH-images, the
KH-description is created by adding the number and relation of regions as shown in fig-
ure 8 (b). The symbols [A]-[L] in the KH-description correspond to the labels of regions
in (a). Circles and filled circles in KH-description indicate peaks and pits, respectively.
Frames enclosing peaks and pits represent H0 contours.
In the example, moustaches are found at the surface of an egg-llke shape at the
coarser level, and then eyes, nose, and cheeks are found by tracing the KH-description
downward to the finer level. The KH-description reflects the shape of overall scale. By
extracting PS-elements from the KH-description and by linking them, the description
can be translated into a PS-tree.

5.2 E x a m p l e 2: H a n d s

In the section, similar objects are analyzed to evaluate the hierarchical similarity of the
descriptions. Subjects were requested to imitate a shape of prototype hand: the hands
were measured by the range finder. Figure 10 (a) shows a pair of data set HAND-I(left)
and HAND-2(right). Both the data sets are filtered and analyzed to create labeled KH-
descriptions. Figure 10(b) shows KH-images at different scales. Figure 11 (a) and (b) are
the KH-descriptions derived from HAND-1 and HAND-2, respectively. The symbols A-H
in figure 10 and 11 are corresponding regions.
Comparing with the KH-description and the PS-tree, the KH-elements distribution
of subparts rl and r2 in figure 10(a) evidently differ from those in (b) at the coarser
level, although the PS-elements A-I are common at the finer level. The fact indicates
that the PS-description is more useful than the KH-description to match objects. Fig.ll
is common PS-subtrees extracted from figure 10. The nodes, dent, bump, dent_b, and
bumpM indicates PS-elements introduced in figure 6.

6 Conclusions

In this paper, we have extended the scale-space approach used in 1-D signal analysis
to 3-D shape analysis. Scale-space filtered surface images of an object are divided into
segments based on curvature analysis and linked between consecutive scales.
851

In this research, we regarded scale-space filtering as a process of continuous defor-


mation resulting in constant curvature, and therefore, we can extract the hierarchy by
excluding non-monotonous deformation as exceptions.
Since the matching algorithm is basically a top-down hierarchical analysis, the shape
recognition can be done efficiently. Furthermore, the matching resolution can be varied
according to the depth of the hierarchy. Experimental results showed the method is also
useful to categorize objects by their similarities.

References

[I~A1] R.Hoffman and A.Jain.: Segmentation and Classification of Range Images


.IEEE Trans.Pattern.Anal.& Mach.Intell.-9.5.pp.608-620.( 1987 )
[PR1] P.J.Besl and R.C.Jain.: Segmentation Through Variable-Order Surface Fitting
.IEEE Trans .Pattern.Anal.& Mach.Intell.-10.2.pp.167-192.( 1988 )
[])el] D.M arr,'Vision",W. H. Freeman,S an fransisco
(1982)
[Wit] A.P.Witkin.: Scale-space filtering .Proceeding of International Joint Conference
Artificial Intelligence,Karlsruhe.West G e r m a n y . pp.1019-1022 .( 1983 )
[YP1] A.L.Yuille and T.Poggio.: Scaling theorem for zero crossing .IEEE
Trans.Pattern.Anal.& Mach.Intell.-8.1.pp.15-25.( 1986 )
[LP1] L.M.Lifshitz and S.M.Pizer.: A Multiresolution Hierarchical Approach to Im-
age Segmentation Based on Intensity Extrema .IEEE Trans .Pattern.Anal.&
Mach.Intell.-12.3.pp.529-538.( 1990 )
[TL1] H.T.Tanaka and D.T.L.Lee.: View-Invariant Surface Structure Descriptors -
Toward a Smooth Surface Sketch .The Transactions of the IEICE-E 73.3.pp.418-
427.( 1990 )
[Lit] T.Lindeberg.: Scale-Space for Discrete Signals .IEEE Trans .Pattern.AnaL&
Mach.Intell.-12.3.pp.234-254.( 1990 )
[MM1] F.Mokhtarian and A.Mackworth.: Scale-ba~ed description and recognition
of planar curves and two-dimensional shapes .IEEE Trans.Pattern.Anal.&
Mach.Intell..8.pp.34-43.( 1986 )
[MK1] S.Mofit&,T.Kawashima and Y.Aoki.: Pattern Matching of 2-D Shape Using Hier-
archical Description .Trans.IECE.Japan(D) J73-D2.5.pp.717-727.( 1990 )

This article was processed using the ISTEX macro package with ECCV92 style
Object Recognition by Flexible Template Matching
using Genetic Algorithms *
A. Hill, C. J. Taylor and T. Cootes.
Department of Medical Biophysics, University of Manchester,
Oxford Road, Manchester M13 9PT, England.

Abstract. We demonstrate the use of a Genetic Algorithm (GA) to match


a flexible template model to image evidence. The advantage of the GA is that
plausible interpretations can be found in a relatively small number of trials;
it is also possible to generate multiple distinct interpretation hypotheses. The
method has been applied to the interpretation of ultrasound images of the
heart and its performance has been assessed in quantitative terms.
1 Introduction
Flexible template matching has become an established technique for image interpretation
[4,5,6]. It is particularly relevant to problems such as medical image interpretation,
where there can be considerable variation between examples of the same object. The
basis of the approach is to project instances of the template into the image until one
which is well supported by the observed data is found. There are several requirements:
9 A template with a (small) number of controlling parameters; for any particular set
of parameter values it must be possible to reconstruct a feasible instance of the
object the model represents.
9 For any model instance which is projected back into the image an objective function
which can assess support provided by the image evidence; in general this objective
function is likely to be non-linear with respect to the model parameters,
multi-modal, possibly noisy and/or discontinuous.
9 A method of optimisation which can search the parameter space of the model in
order to identify the particular set of parameter values for which the objective
function is maximised i.e. for which the image evidence is greatest.
In this paper we have employed the approach described by Cootes et al [1] for
constructing flexible template models which satisfy the first requirement listed above.
For the echocardiogram exemplar we consider in this paper, the variation in shape can
be described approximately by 6 parameters. In general, it is also necessary to translate
scale and rotate the template onto the image, giving 4 additional parameters for 2D
models. Altogether, then, we have 10 parameters which instantiate and project the
model onto the image. If we employ n bits to e n c o d e each parameter a search space
of 21~n ~ 103n results. Clearly, even for modest values of n, the search space is large.
In the literature this problem has been overcome by assuming that a reasonably good
approximation to the solution is available so that the search space can be drastically
reduced. We argue that this is often an unrealistic assumption and have attempted to
develop methods of finding good solutions without restricting the search space.
The problem, as stated, is one of optimisation of a multi-modal, non-linear function
of many variables, the function being possibly discontinuous and/or noisy. A class of
methods which have been proposed as a solution to problems of this nature are Genetic
Algorithms (GAs) [2,3]. It is claimed that GAs can robustly find good solutions
(although not guaranteed to be optimal) in large search spaces using very few trials. Given
an objective function, f, which measures the evidential support for any particular
projection into the image of the model, a GA search can find a set of parameters which
9This research was funded by the UK Science & Engineering Research Council and Department of
Trade & Industry
853

provide a good explanation (or interpretation) of the image. Our results demonstrate
the feasibility of this approach.

2 Genetic Algorithms
GAs employ mechanisms analogous to those involved in natural selection to conduct
a search through a given parameter space for the global optimum of some objective
function. The main features of the approach are as follows :
9 A point in the search space is encoded as a chromosome.
9 A population of N chromosomes/search points is maintained.
9 New points are generated by probabilistically combining existing solutions.
9 Optimal solutions are evolved by iteratively producing new generations of
chromosomes using a selective breeding strategy based on the relative values of
the objective function for the different members of the population.
A solution x_ = (xl,x2, ..,xn), where the xi are in our case the model parameters, is
encoded as a string of genes to form a chromosome representing an individual. In many
applications the gene values are [0,1] and the chromosomes are simply bit strings. An
objective function, f, is supplied which can decode the chromosome and assign a fitness
value to the individual the chromosome represents.
Given a population of chromosomes the genetic operators crossover and mutation
can be applied in order to propagate variation within the population. Crossover takes
two parent chromosomes, cuts them at some random gene/bit position and recombines
the opposing sections to create two children e.g. crossing the chromosomes 010-11010
and 100-00101 at position 3-4 gives 010-00101 and 100-11010. Mutation is a
background operator which selects a gene at random on a given individual and mutates
the value for that gene (for bit strings the bit is complemented).
The search for an optimal solution starts with a randomly generated population of
chromosomes; an iterative procedure is used to conduct the search. For each iteration
a process of selection from the current generation of chromosomes is followed by
application of the genetic operators and re-evaiuation of the resulting chromosomes.
Selection allocates a number of trials to each individual according to its relative fitness
value A f t , 7 = 1 I N Ift + 1'2 + .. + fN} 9 T h e fitter an individual the more trials it will
be allocated and vice versa. Average individuals are allocated only one trial.
Trials are conducted by applying the genetic operators (in particular crossover) to
selected individuals, thus producing a new generation of chromosomes. The algorithm
progresses by allocating, at each iteration, ever more trials to the high performance areas
of the search space under the assumption that these areas are associated with short
sub-sections of chromosomes which can be recombined using the random cut-and-mix
of crossover to generate even better solutions.

3 Flexible Template Construction


We have employed a modified version of the method described by described by Cootes
et al [1] for constructing flexible templates. The templates are generated from a set
of examples of the object to be modelled. The technique is to represent each example
by a set of labelled points (xi,yi) so that each (xi, yi) is at an equivalent position for each
example. The mean position of the points gives the average shape of the template and
a principle component analysis of the deviations from the mean gives a set of modes
of variation. These modes represent the main ways in which the shape deforms from
the mean. By adding weighted sums of the first m modes to the average shape, new
examples of the object can be generated. The m weights /7 -- (bl, b2, ..,bin) give a set
of parameters for the model.
854

4 The Echoeardiogram Exemplar


We have evaluated the method using ultrasound images of the heart (apical 4 - c h a m b e r
echocardiograms). A typical example is shown in figure 1.a. The problem we address

Figure l.a : Example echocardiograrn Figure 1.b : Associated LV boundary


is that of locating the boundary of the left ventricle (LV) as shown in figure 1.b. The
aim is to provide quantitative information concerning LV function.
In order to construct a flexible template model of the left ventricle we employed a
training set of 66 apical 4-chamber echocardiograms with LV boundaries labelled by
an expert, On each of these boundaries 4 fiducial points were placed by the expert at
the apex and mitral valve - see figure 1.b. A number of intermediate points were
generated automatically by positioning equidistant points along the boundary to give 18
points in all, The principle component analysis identified 6 major modes of variation
in LV shape for this training set,
5 The Objective Function
To perform the experiments described below we used a simple and rather a d - h o c
objective function. A number of points, P, on the boundary of the candidate object
are selected for processing. For each of these points a grey level profile perpendicular
to the boundary at that point is extracted. For each profile the position
(Pi, P ~ <- Pi < P,~) and strength (gi) of the largest intensity step along the profile are
recorded. The objective function is then given by :

f-eP i-1 ~ + -1 where ~ = P 2 g ' , . l

When we use the objective function to evaluate instances of the model we seek to
minimise f, thus favouring solutions with strong edges (g large) of equal magnitude
( ~i/g- 11 --* 0 ) located close to the boundary position predicted by the model (Pi ~ 0 ).
6 Results
In order to assess the method we employed 2 frames from each of ten echocardiogram
time sequences. The frames showed the LV in its most extended and contracted state
for each sequence and the LV boundary was delineated on each frame by an expert.
The images were interpreted using the model-based approach described above and the
855

resulting LV boundaries were compared with the expert-generated boundaries using the
P
1
mean pixel distance between boundaries A = p Z / ( a i , x - b j J ) 2 + (a~y-bJ,y)2 where
i=1

= (bj~,bja) is the point on the expert boundary closest to the point .~ on the GA
generated boundary. The control parameters for the GA were : population size (N) =
50, crossover rate (C) = 0.6 and mutation rate (M) of 0.005. A limit of 5000 objective
function evaluations was employed. For 19 of the 20 images A < 6 and for 14 of the
20 images A ~ 4. A typical result is show in figure 2.

Figure 2.a : Expert boundary. Figure 2.b : GA boundary.

7 N i c h e s and S p e c i e s
During the course of the above experiments it became obvious that one of the major
problems for the GA search was premature convergence to sub-optimal solutions in the
presence of multiple plausible interpretations. Rather than extract a single solution, we
would like the GA search to extract a handful of strong, ranked candidates for the object
we wish to locate. The problem of locating multiple optima when using GAs is discussed
by Goldberg [2]. The approach adopted is to reduce the number of individuals in
over-crowded areas of the search space by modifying their function values.
In this case the fitness of an individual is weighted by the number of neighbours the
individual has. The more neighbours an individual has the worse its fitness value. The
number of individuals allowed to crowd into the various areas of the search space is
thus proportional to the relative fitness value associated with the different areas of the
search space. In order to maintain stable sub-populations two further modifications are
necessary. The first is simply to increase the size of the population in order to prevent
extinction due to sampling errors. The second is to implement a restricted mating strategy
in order to promote speciation i.e. prefer neighbours to distant individuals for crossover.
The key to the entire procedure is the ability to decide Who is a neighbour? i.e. how
close are two points in the search space. We define two individuals x_l,x~ to be neighbours
if they lie within a sub-space defined by the global transformation parameters, translation
(tx, ty), scaling (s) and rotation (dO) :
Itx.l-txal <- 6tx, Jty:-t,.2l <- 6t,, [sl-s2] < 6s, [r162 -< 6 r
856

We have employed this mechanism to extract multiple solutions from


echocardiogram images 9 An example of applying the m e t h o d is shown in figure 3. T h e
groups are listed in order of size, group 1 containing the largest n u m b e r of individuals.
T h e species these groups represent were generated using N = 100, C = 0.5, M = 0.005
and 100 generations.

~149
, I
/

it. "

group 1
group 2
....... group 3
. . . . group 4

Figure 3 : Species adapted to different chambers.


8 Conclusions
One of the major attributes of the m o d e l - b a s e d approach is the ability to correctly
interpret incomplete and/or noisy image data by constraining all possible interpretations
using knowledge represented by a model. In order for this process to be successful it
m a y be necessary to search a high-dimensional, non-linear, multi-modal and noisy
search space. T h e combination of a flexible template model and a powerful search
technique has b e e n shown to give very promising results. The major feature of GAs which
is useful for object location is a population of solutions. This allows alternative
interpretations to compete with one another, the strongest solution having the greatest
probability of success. This property of a G A search also allows multiple plausible
interpretations to be extracted.
References
[1] Comes T. E, Cooper D. H., Taylor C. J . , Graham J. : A Trainable Method of Parametric Shape
Description. Proa British Machine Vision Conference, Glasgow (1991) Springer Verlag 54-61.
[2] Goldberg D. E. : Genetic Algorithms in Search, Optimisation and Machine Learning.
Addison-Wesley (1989).
[3] Holland J. H. : Adaptation in Natural and Artificial Systems. University of Michegan Press, Ann
Arbor (1975).
[4] Lipson P., Yuille A.L, O'Keeffe D., Cavanaugh J., Taaffe J., Rosenthal D. : Deformable Templates
for Feature Extraction from Medical Images. Proc. 1st European Conference on Computer Vision,
Lecture Notes in Computer Science, Springer-Verlag, (1990) 413-417.
[5] Pentland A., Sclaroff S. : Closed-Form Solutions for Physically Based Modeling and Recognition.
IEEE Trans. on Pattern Analysis and Machine Intelligence 13(7) (1991) 715-729.
[6] Yuille A.L, Cohen D.S., Hallinan P. : Feature Extraction from Faces using Deformable Templates.
Proa Computer Vision, San Diego (1989) 104--109.
M a t c h i n g and R e c o g n i t i o n of R o a d N e t w o r k s from
Aerial Images *

Start Z. Li, Josef Kittler and Maria Petrou


Department of Electronic and Electrica~ Engineering,
University of Surrey, Guilford, Surrey GU2 5XH, U.K.

A b s t r a c t . In this paper, we develop a method of matching and recognizing


aerial road network images based on road network models. We use attributed
relational graphs to describe images and models. The correspondences are
found using a relaxation labelling algorithm, which optimises a criterion of
similarity.

1 Introduction

Automatic map registration of remotely sensed images is an important computer vision


problem with applications to autonomous airborne vehicle guidance and analysis of satel-
lite and aerial images. For the specific application we are discussing in this paper, we
chose to match the road network extracted from an image to the road network of a much
larger map which presumably contains the area shown in the image. Our choice is based
on the idea that linear features can be extracted in a more robust way than other features
like road junctions or round abouts.
We abstract the problem as one of sub-graph matching. Both model road networks
and images (scenes) are represented by line segments. These lines, their properties, and
relations between them can be organized into a graph. Such graph is an attributed rela-
tional graph (ARG), in which each node represents a line, node attributes represent the
unary properties of the feature, and weights between nodes represent binary relations
between features. The task is then to match the scene ARG, which contains a usually
distorted subset of the model ARG, to the model ARG. We develop a parallel distributed
approach for solving the matching problem. It is to maximise a gain functional. The gain
functional is formulated to measure the goodness of matching between the scene and the
models [3]. Constraints on node properties and relations are encoded into the interac-
tion function in the gain. Best matches are in the optimum of the functional. We use
relaxation labelling methods [1, 2, 6] to compute the optimum.

2 Attributed Relational Graphs ARGs

An ARG is an attributed and weighted graph which we denote by a triple g = (d, rl, r2).
In this notation, d -- (1 .... , rn} represents a set o f m nodes; rl -- {rl(i) I i E d} is a set of
unary relations defined over d in which rl(i) = [r~l)(i), ..., r~K1)(i)] is a vector consisting
of K1 different types of unary relations; r2 = (r2(i,j) I (i,j) e d 2} is a set of binary
(bilateral) relations defined over d 2 = d d in which r2(i, j) = [r~D(i, j), ..., r~g2)(i,j)]
is a vector consisting of K2 different types of binary relations.

* This work was supported by IED and SERC, project number IED-1936. The images were
kindly provided by the Defence Research Agency, RSRE, UK.
858

As binary features we use the relative orientation between lines, which is invariant to
rotation, translation and scale changes and the minimum distance between the end points
of the lines and the distance between their midpoints, which are invariant to rotation
and translation but not to scale changes. Therefore, our graph matching is independent
of translation and rotation, but not of scale changes.
In the following discussion of inexact matching from one ARG to another, lower-case
notation refers to instances of ARGs built from scenes while upper-case refers to instances
of ARGs derived from models.

3 Matching as a Constrained Optimisation Problem

Given the scene ARG and the model ARG, the sought matching is a mapping from
scene nodes to model nodes which optimises a certain criterion. We include a 0 node to
the model to allow for unmatched lines from the scene. Further, we allow many-to-one
matches. We formulate the problem as one of continuous relaxation labelling [1, 2, 6].
The set L of possible matches consists of those matchings for which the difference in
orientation between the matched lines is within a certain range. (The orientation of each
line is approximately known, either from the plane route or the map.) Associated with
match (i, I) is a real number f(i, I) e [0, 1] called state of the match (i, I). It represents
the certainty with which i E d is mapped to I E D. The set of these numbers is called
the state of matching or mapping. A constraint on f is

Z f(i, I) = 1 Vi 9 d (1)
{ II( i,I)E L }
The optimal mapping f* maximises the global gain functional defined by:

E(f) = Z Z 7r(i, I; j, S)f(i, I)f(j, J) (2)


(i,1)c L (j ,~r)~L d #i
In the above, ~r(i, I; j, J) is the interaction between matches (i, I) and (j, J) and it mea-
sures the similarity between the binary relations r~(i, j) of the scene ARG and R~(I, J)
of the model ARG. We choose it to be of the form
K2
7r(i, I; j, J) = e x p ( - Z [r(~) ( i' J) - R~k )( I' J)l/a~k )) - g (3)
k=l

where H 9 [0, 1) is a bias controlling the excitatory (> 0) / inhibitory (< 0) behaviour
of lr and a~k) > 0 are some parameters determined experimentally.

4 The Algorithm

The solution f* is required to be unambiguous f(i, I) E {0, 1}. Finding the solution is a
combinatorial optimisation problem. To perform this, we decided to use the continuous
relaxation labelling method [1, 2, 6]. In this method, a global combinatorial solution is
achieved through propagation of local information. The algorithm is ilrherently parallel
and distributed, and thus could be efficiently implemented on SIMD architectures.
For continuous relaxation labelling, a final solution is reached by allowing f E [0, 1].
We introduce a time parameter into f such that f -- f(O. We are interested in
constructing a dynamic system that can locate as the final solution a maximum of E(f).
859

We require that the energy E should not decrease along its trajectory f ( 0 as the system
evolves, i.e. ~ > 0. The process starts with an initial state f(0) which we set at random.
At time t, the gradient function q(0 is computed using:

q(i, I) = 2 E 7r(i, I; j, g)f(j, J) (4)


(j,J)EL,j#I

The component q(i, I) is the support for the match (i, I) from all other matches.
Based on the gradient, the state f(0 is updated to reduce the energy using an updating
rule. It is meant to update f(t) by an appropriate amount in the appropriate direction.
In the adopted algorithm, the computation of the next state f(t+l) from f(O and q(O is
based on the gradient projection (GP) operation described in [1, 4].
The iteration terminates when the algorithm converges, i.e. when f(t+D = f(O. The
final labelling is in general unambiguous, i.e. f* (i, I) = I or 0. This means that each node
has a single interpretation. For details of the algorithm and its convergence properties,
readers are referred to [1].

5 Experimental results
Figure la shows one of the images we used and Figure lb part of the corresponding map.
A set of line segments, shown in Figure lc, are extracted from the map. These are used to
built a model ARG. This model ARG is used for all experiments presented in this paper.
Shown in Figure ld are the line segments of the image extracted by a preprocessing
procedure [5]. The scale of the scene to the model is about 50:26. We have broken on
purpose long lines in the model into short ones. They are broken in such a way that their
length is about the same as the average length of those in the scene. This is necessary
when distance relations are used as constraints for matching. A scene ARG is extracted
from these scene line segments.

t=O Random Correspondences


t=I--25 All scene n o d e s are m a x i m a l l y a s s o c i a t e d to node 0 (NULL) of t h e m o d e l
Jt 26 scene n o d e 1 2 5 7 23 24 26 28 29 31 32 46 48 63 66 67 69 70
m o d e l n o d e 18 20 21 23 13 14 19 22 27 28 29 37 32 40 41 42 45 48
t=27 scene n o d e 1 2 7 23 24 26 27 28 29 30 31 32 35 37 42 46 48 58 63 66 67 69 70
m o d e l n o d e 18 20 23 13 14 19 22 22 27 25 28 29 31 29 35 37 32 36 40 41 42 45 48
t=28 scene n o d e 1 2 7 23 24 26 27 28 29 30 31 32 34 35 46 48 58 63 66 67 69 70
m o d e l n o d e i 1 8 20 23 13 14 19 22 22 27 25 28 29 29 31 37 32 36 40 41 42 45 48
t=29 scene n o d e 1 2 23 24 26 27 28 29 30 31 32 34 35 46 48 58 63 66 67 69 70
m o d e l n o d e 18 20 13 14 19 22 22 27 25 28 29 29 31 37 32 36 40 41 42 45 48
$=30 scene n o d e 1 2 23 24 26 27 28 29 30 31 32 34 35 46 48 58 63 66 67 69 70
m o d e l n o d e i18 20 13 14 19 22 22 27 25 28 29 29 31 37 32 36 40 41 42 45 48

T a b l e 1. Labelling process for road network matching with H = 0.30.

Table 1 shows the iterative process of matching the image lines in Figure ld to the
model lines in Figure lc, with the standard value H = 0.3. At time t = 0, the initial
860

correspondences are all given nearly equal strength. From t = 1 to 25, all scene nodes
(lines) are maximally associated to node 0 ( N U L L ) of the model. This is because during
the first few iterations, there is not enough supporting evidence for n o n - N U L L matches.
Meaningful matches emerge from t = 26 onwards as evidence is gathered and propagated.
The algorithm converges at t = 30. There are basically no wrong matches in the presented
result. The computation on a SUN-4 work station takes about 210 seconds user time and
6 seconds system time for the matching results shown in the above table.
To test the algorithm, we performed several experiments by adding extra lines to the
scene and subtracting others. There were basically no wrong matches in the results. Since
precise knowledge of the scale of scenes may not be guaranteed, we also run experiments
with incorrect scale ratios. Results are meaningful and reasonable for the ratio ranging
from 30:26 to 60:26. However, the results deteriorate rapidly and become disastrous
beyond this range.

6 Conclusion

We have developed an approach for road network matching and recognition from line seg-
ments detected from aerial images. The ARG representation is used to describe scenes and
models. The optimal matching optimises a gain functional which measures the goodness
of fit between the model ARG and the scene ARG. Issues we have dealt with include in-
exact matching in the presence of distortions, N U L L matches and many-to-one matches.
The computational algorithm for the matching works in a parallel and distributed way,
and is suited for implementation on SIMD architectures like the Connectionist Machines.
We have presented experiments to demonstrate the performance of the approach.
Our matching is invariant to translation and rotation but not to scale changes. The
results are not apparently affected if both models and scenes are changed by the same
scale factor. However, they will deteriorate drastically when scene scales are changed
relative to models by a relative factor larger than, say, 20%. Our work is aimed at
performing inexact matching under distortions caused by noise, but not at dealing with
rubber-like shape changes. This is because the constraints used are geometric rather than
topological.

References

1. R. A. Hummel and S. W. Zucker. "On the foundations of relaxation labeling process".


IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(3):267-286,
May 1983.
2. J. Kittler and J. Illingworth. "Relaxation labeling algorithms - a review". Image and
Vision Computing, 3(4):206-216, November 1985.
3. S. Z. Li. "Towards 3D vision from range images: An optimization framework and parallel
distributed networks". CVGIP: Image Understaning, to appear.
4. J. Mohammed, R. Hummel, and S. Zucker. "A feasible direction operator for relaxation
method". IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-
5(3):330-332, May 1983.
5. M. Petrou. "Optimal convolution filters and an algorithm for the detection of linear
features". VSSP-TR- 21/90, Electronic and Electrical Engineering Department, Surrey
University, 1990.
6. A. Rosenfeld, R. Hummel, and S. Zucker. "Scene labeling by relaxation operations".
IEEE Transactions on Systems, Man and Cybernetics, 6:420-433, June 1976.
861

", ,,~ ,'G ree ,~

t
~t
J

it

Fig. 1. The image on top left is the originai image. The one on top right is the map that contains
the image area. At the bottom left is shown part of the line model extracted from the map. At
the bottom right the lines extracted from the image are shown.

This article was processed using the IbTEX macro package with ECCV92 style
Intensity and Edge-Based Symmetry Detection
Applied to Car-Following *

Thomas Zielke, Michael Brauckmann, and Werner yon Seelen


Instltut ffir Neurolnformatik, ltuhr-UniversitKt, 4630 Boehum, Germany.

Abstract. We present two methods for detecting s y m m e t r y in images,


one based directly on the intensity values and another one based on a dis-
crete representation of local orientation. A s~lmmetr~lfinder has been devel-
oped which uses the intensity-based m e t h o d to search an image for compact
regions which display some degree of mirror s y m m e t r y due to intensity sim-
ilarities across a straight axis. In a different approach, we look at s y m m e t r y
as a bilateral relationship between local orientations. A s~mmet~y-enl~ancing
edge detector is presented which indicates edges dependent on the orienta-
tions at two different image positions. SEED, as we call it, is a detector
element implemented by a feedforward network t h a t holds the s y m m e t r y
conditions. We use SEED to find the contours of symmetric objects of which
we know the axis of s y m m e t r y from the intensity-based s y m m e t r y finder.
T h e methods presented have been applied to the problem of visually guided
car-foilowing. Real-time experiments with a system for automatic headway
control on motorways have been successful.

1 Introduction

Our interest in s y m m e t r y detection originates from the problem of car-following by Com-


p u t e r Vision, i.e. the problem of how an automobile equipped with a camera and control
computers can be p r o g r a m m e d to automatically keep a safe driving distance to a car in
front. There are three m a j o r visual tasks the system has to cope with:

i . Detecting leading cars. This means repeated visual scanning of the road in front of
the car until an object appears which can be identified as another vehicle.
2. Visual tracking of a car's rear while its image position and size may vary greatly.
3. Accurate measuring of the car's dynamic image size needed for the speed control.

The m e t h o d s presented here exploit the s y m m e t r y property of the rear view of most
vehicles on normal roads. Mirror s y m m e t r y with respect to a vertical axis is one of the
most striking generic shape features available for object recognition in a car-following
situation. Initially, we use an intensity-based s y m m e t r y finder to detect image regions t h a t
are candidates for a leading car. The vertical axis of s y m m e t r y obtained from this step
is also an excellent feature for measuring the leading car's relative lateral displacement
in consecutive images because it is invariant under (vertical) nodding movements of the
camera and under changes of object size. To exactly measure the image size of the car
in front, a novel edge detector has been developed which enhances edges t h a t have a
symmetric counterpart with respect to a given axis.

* This work has been supported by the German Federal Ministry of Research and Technology
(BMFT) and by the Volkswagen AG (VW). PROMETHEUS PRO-ART Project: ITM8900/2
Lecture Notes in ComputerScience,Vol.588
G. Sandini(Ed.)
ComputerVision- ECCV '92
9 Springer-VerlagBerlinHeidelberg1992
866

Fig. 1. A typical image in a


car-followlng situation. The
overlay plot in the lower half
shows the intensity distri-
bution along the scan line
marked in the upper half of
the image.

In the following, we briefly review previous work on symmetry detection and finding
symmetry azes. There exist optimal solutions for this problem if the d a t a can be m a p p e d
onto a set of points in a plane and only accurately symmetric point configurations are
searched for. However, for real image d a t a such methods cannot be used because they
fall to detect imperfect symmetry. Other algorithms assume t h a t the figure, i.e. the
image region, for which s y m m e t r y axes are sought, can be readily separated from the
background. Friedberg [1], for example, shows how to derive axes of skewed s y m m e t r y of
a figure from its m a t r i x of moments. Marola [2] proposes a method for object recognition
using the s y m m e t r y axis and a s y m m e t r y coefficient of a figure. The m e t h o d is based
on central moments too, but, in contrast to [1], it directly uses intensity values and
hence takes into account the internal structure of the object as well . However, these
m e t h o d s either assume t h a t the segmentation problem has been solved or t h a t there is
no segmentation problem (e.g. uniform background intensity). Saint-Marc and Medioni
[3] propose a B-spline contour representation which facilitates s y m m e t r y detection.
For our application we need methods t h a t do not require computationally expensive
image preproeessing or segmentation. More importantly, we need algorithms which can be
applied locally, since this is the only way real-tlme performance can be achieved without
dedicated hardware. To our knowledge no "low-level" methods for s y m m e t r y detection
in images have been reported upon in the literature to date. T h a t is, all the methods we
found require some kind of object-related preprocessing or a transformation of the whole
object region into a hlgher-level representation.

2 Finding Axes of Intensity Symmetry

Two-dimensional s y m m e t r y is formed by a systematic coincidence of one-dimensional


symmetries. Elementary mirror symmetries are searched for along straight scan lines in
an image. In this section we look at the intensity distribution along a scan line as a
continuous function.

2.1 A M e a s u r e f o r L o c a l S y m m e t r y within a 1D Intensity Function

A n y function G ( z ) can be written as the sum of its even component G e ( z ) and its odd
component Go(z), provided the origin is at the center of the definition interval. We
use d to denote the width of the interval on which G ( z ) is defined. In an image, d is
given by the distance between the intersection points of the scan line with the image
867

borders. Intuitively, the degree of s y m m e t r y of a function with respect to the origin


should be reflected by the relative significance of its even component compared with
its o d d component. This is essentially the idea underlying our definition of a s y m m e t r y
measure.
Usually the p r i m a r y objective is not to measure the degree of global s y m m e t r y of a
scan line with respect to its center point. Rather, in the context of object recognition, we
are interested in local s y m m e t r y intervals. Both the size and the center position of such
intervals have to be found. For this, two additional parameters are introduced. One (za)
t h a t permits shifting the origin of the 1D function, and another ( ~ ) is for varying the
size of the interval being evaluated. The values z , are restricted by d and by the actual
choice of ~ :
d-~ d-~
= ' = =,~ - -T- "" = ~ "" 2 +=,0,~_<d (1)

=,, m a y also be thought of as denoting the location of a potential s y m m e t r y axis with


9v being the width of the symmetric interval about =, . We define the even function of
G ( z - =,) a n d its odd function for a given interval of width w such t h a t

{ 1 (G(= =.) + G(=,, - =)) if I= - =,1 < ~ / 2


(2)
E(z,z,,~) := 0 -

otherwise -

0(=,=,.,~) I co<=- =.)- GO=,.- =,))if i=- =,.i _<


otherwise (3)

For any fixed pair of values z , and w, the significance of either E(=,, z , , w) or O(z, =,,, ~o)
m a y a p p r o p r i a t e l y be expressed by their respective energy contents, the integral over their
squared values ( Energy[f(=,)] : = f f(z)2d=, ). However, the problem arises t h a t the mean
value of the odd function always is zero whereas the even function in general has some
positive mean value. This bias has to be s u b t r a c t e d from the even function in order to
render the two energy quantities comparable in the sense t h a t their dissimilarity indicates
s y m m e t r y or a n t i s y m m e t r y respectively. We introduce a normalized even function which
has a zero mean value:

En (z, zo,w) 9 _- E(.,..,.)- m (4)


9D
J
W i t h E,~ a n d O we construct a normalized measure for the degree of s y m m e t r y by means
of the contrast f~ctio~ : C (a, b) = (a - b)/(a + b) . The s y m m e t r y measure S ( z , , to)
is a function of two variables, i.e. it is a number t h a t can be computed for any potential
s y m m e t r y axis =,~ with observation interval ~ .

/ E, (=,,=,,, w)'d=, - / O (z,z,, ~)'dz


S(=,.,~,) = ; -I _~ S(=,.,~,) _< 1 (5)
/ E.(=,,=,.,~)"d=+ / O(=,=.,~,)'d=
We get S = 1 for ideal symmetry, S = 0 for asymmetry, and S = - 1 for the ideally
a n t l s y m m e t r i c case.

2.2 T h e S y m m e t r y F i n d e r

T h e function S ( z , , ~ ) gives a normalized measure for symmetry, independent of the


width of the interval being evaluated. For detecting axes of symmetry, however, we need
a measure for the significance of an axis. This is a question of scale and depends b o t h
868

Fig. 2. A plot of S (z,, ~ ) [a] and SA (zo, ~ ) [b] for the image row marked in Fig. 1. SA (z o, ~ )
summed over 54 image rows [c]. Two-dimensional symmetry clearly emerges as a stable feature.
869

on the degree of s y m m e t r y within a given interval and on the interval's relative size with
respect to the overall extent of the signal. We define a confidence measure [0...1] for a
s y m m e t r y axis originating from an interval of width to about the position z , as
~0
sA _ ' <- (6)

to,~,~ is the maximal size of the symmetric interval. Considering a global symmetry
comprising the entire scan line to be the most significant case would imply w,~az -- d.
However, to,~= is better thought of as a limitation of the search interval, meaning that
any perfectly symmetric interval of width to,,~ and wider corresponds to an indisputable
symmetry axis.
Two-dlmensionai symmetry detection requires the combination of m a n y symmetry
histograms (accumulated confidence for symmetry axis vs. position of axis). This is easily
done by summation of the confidence values for each axis position, provided the symmetry
axis (axes) is (are) straight. Figure 2 illustrates the process of intensity-based symmetry
detection. The image for which the simulation results are presented here is shown in Fig. I
9 The symmetry position zm is varied in a horizontal range of about half the image width.
Figure 2a is a plot of the function S (zm, to) for one image row across the three cars. M a n y
local symmetry peaks exist for small to's.Figure 2b is a plot of the confidence function for
symmetry axes. Large symmetry intervals give rise to steep peaks. After summation of
SA (zm, to) over a number of image rows, clearly distinguishable peaks emerge, indicating
the symmetry axes of the three cars, Fig. 2c.

3 Detecting Symmetry of Local Orientations

3.1 A n a l y s i s o f O r i e n t a t i o n Symmetry

W h e n curves (e.g. contour segments) are used as basic features for s y m m e t r y detection
[3] the relationship between mutually symmetric points along two curves can be defined
as a function of the directions of the two tangents. However, we m a y want to do without
a segmentation process which has to produce differentiable curve segments, or the image
m a y be such t h a t recognizing symmetric objects becomes much easier when s y m m e t r y
is detected first. Starting from a lower feature level we use local orientation instead of
tangent direction and define a s y m m e t r y relationship for it.
Figure 3 depicts two banks of directional filters and all possible pairs of directions
between them. Having a vector of eight direction-specific orientation values at each image
position is actually common in practice (e.g. when Sobel filter masks are being used for
preprocessing). The problem now is to define a s y m m e t r y relationship between the two
local orientations each represented by a vector of n directional values. There are n2/2
different pairs of directions for which it has to be decided whether this combination agrees
with our notion of symmetry. It is not always obvious which pairs of directions are to be
regarded symmetric. In Fig. 3 the direction pairs are grouped into six categories.
Ideal symmetry means that the two directions can be m a p p e d onto each other by
a reflection about the s y m m e t r y axis and a subsequent reversal of direction (the edge
polarities at the two points have opposite sign). In case of inverse symmetry, the two
directions are the reflection of each other. W i t h antisl/mmetry we denote cases which
are the opposite of what we consider ideal symmetry. Non-symmetry covers all cases
where the pair of directions is considered neither symmetric nor antisymmetric, i.e. it
is the neutral relation. The category n e a r (inverse) symmetry expresses approximate
symmetry.
870

N
/e
//
\

//
\

\
//
\

\ t

. . _ y

F i g . 3. S y m m e t r y as a r e l a t i o n s h i p b e t w e e n local o r i e n t a t i o n s . T h e pairs of directions are cate-


gorized according to their degree and kind of symmetry respectively.

3.2 The Symmetry-Enhanclng Edge Detector (SEED)

If orientation is measured by a small number of directional filters which may have over-
lapping angular sensitivity, all the above types of symmetry relations can be present con-
currently between two image loci, varying in degree of significance. We have developed
a method which combines evidence for the different categories of orientation symmetry.
It serves as a mechanism for detecting edges that are related to other edges by a certain
degree of symmetry. In the following we describe a detector element which receives in-
put from two image loci through a discrete representation of local orientation. The two
outputs of the detector element indicate edge points dependent on the strength of the
corresponding local orientation features and on the degree of compatibility of the two
orientation inputs according to a criterion for discrete orientation symmetry.
Figure 4 shows the architecture of the detector element. For clearness of the illus-
tration, only four directional inputs are drawn on either side. The basic idea is to view
the table of symmetry relations in Fig. 3 as a matrix of weight factors (wii) and com-
pute weighted sums of the directional inputs (dLi,d.ej), which then serve as inhibitory
signals for the corresponding directional inputs (dRj,d~) on the respective other side of
the detector. SEED is a detector element implemented by a feedforward network whose
connection weights represent the symmetry conditions.
The weighted sums represent the degrees of accumulated evidence for the compatibil-
ity of a given edge direction on either side of the detector element with the local orienta-
tion present at the respective other side's input. The degree of evidence is thresholded by
a sigmoid function (O). The threshold function prevents a strong directional signal from
causing an edge signal when there is only weak symmetry. By taking the products of
the directional inputs and the normalized degrees of symmetry a second layer of discrete
orientation representation is constructed, now with an enhanced sensitivity to orientation
patterns that have a symmetric counterpart across a given axis.
Let dL~ and dRj (i, j = 1...n) denote the output of the orientation-tuned filters for
n discrete directions and let z01j denote the factors specifying a direction's degree of
compatibility with a direction on the opposite side of the detector element. Then the left
and right output resp. of the detector element is defined as
871

I ~ W..
U
dLi , i = 1 . . . 4 [ dRj j=l...4 '

Fig. 4. Architecture of the Symmetry-Enhancing Edge Detector element (SEED).

j J

~(u, T) is a sigmoid threshold function with upper limit 1 and lower limit 0 :

1
9 (u, T) = 1 + e.xpCT - u/k) (8)
Equation (8) contains two parameters that need to be chosen appropriately, k is a scaling
factor for normalizing u, such that T can be chosen independently of the weighting factors
9oij. The factor h has to be determined once for a given weight matrix W. We compute h
from the product of W and an angular influence matrix A, with aij = [[cos(21r(i-j)/n)H
, (i, j = 1...r~). The maximum of the elements of W 9 A is a good approximation for the
"gain" caused by the weighted summation of the directional inputs. The parameter T has
to be determined experimentally. For best results, T should be adapted to the average
edge contrast in the image. We usually set T to half the value of the average response of
the directional filter for the strongest direction at edges.
Figures 5 and 6 show results. With the intensity-based symmetry finder we determine
the position of the symmetry axis. At the axis position found, SEED is applied. The
weight vector [Z,X,M,O,Y,N] (see Fig. 3) used to process the images in this section is
[-2,2,2,0,1,0]. The edges which do not confirm the assumption of mirror symmetry about
the given axis are suppressed. Edges arising from the symmetric structures of an object
clearly stand out.
872

Fig. 5. A picture of a
household cable drum.
Viewed from the front,
it is a mlrror-symmetrlc
object with some asym-
metric internal struc-
tures. Applying a Sobel
edge filter to this image
results in the upper right
image. LI.: The result of
applying SEED at the
position of the symmetry
axis. 1.r.: The symmetric
edges (blnarlzed) super-
imposed onto the origi-
nal image.

Fig. 6. Multiple object detection. In addition to the processing steps demonstrated in Fig. 5,
SEED is applied twice, picking out the edges of the road sign as well.

4 The CARTRACK System

Our methods for symmetry detection and symmetry filtering have been used to build
a vision system (dubbed C A R T R A C K ) for detecting and tracking ears and other vehi-
cles having an approximately mirror-symmetric rear view. Symmetry is used for three
different purposes:

1. Initial candidates for vehicle objects on the road are detected by means of the as-
sumption that compact image regions with a high degree of (horizontal) intensity
symmetry are likely to be nonincidental.
2. Visual tracking of a car's rear is greatly facilitated by the invariance of the symmetry
axis under changes of size and vertical position of the object in an image sequence.
3. W h e n using the symmetry constraint, separating the lateral vehicle edges from the
image background becomes feasible, even on a low processing level. From the lateral
contours of a vehicle, its image width can be determined accurately.

C A R T R A C K is a real-time implementation, in the sense that we can use it for con-


ducting experiments in test cars on normal roads. This performance is achieved by means
873

of an adaptable processing window whose position and size are controlled by a predictive
filter and a number of rules. While the size of the processing window varies depending
on the object size, the amount of image data processed by the symmetry finder is kept
approximately constant by means of a scan line oriented image compression technique.
The result of the symmetry finder is a symmetry histogram. Its highest peak is tenta-
tively taken as the horizontal object position. Then S E E D is used to verify and correct
the object position by trying to find symmetric edges in addition to the intensity sym-
metry. Given that the object's symmetry axis has been found with high accuracy, it is
relatively easy to correct the vertical object position. Finally, a boundary point is located
on either side of the object, providing the start points for a boundary tracking algorithm
which uses S E E D to "see" only symmetric edge pairs. O n both sides of a leading car,
a sufficiently long contour segment in the image is extracted and the maximal lateral
distance between the two contours is taken as its image width.
The current implementation had to be trimmed for fast computation. The weight
factors representing the symmetry criterion within SEED, for example, had to be chosen
as powers of two or zero respectively. The directional filter kernels are simple Sobel
masks. The threshold function 9 is implemented as a ~hard" threshold. Nonetheless,
the C A R T R A C K system has performed well during tests for an intelligent cruise control
system in an experimental van of Volkswagen (VW). With the current computer system,
based on a single M C 6 8 0 4 0 microprocessor (25MHz), we achieve a cycle time of about
300 milliseconds for the output data to the van's control system.

5 Conclusion

Symmetry in an image m a y exist on various levels of abstraction. W e deal with the


two lowest levels of image features, intensity values and local orientation. A symmetry
finder is presented based on a normalized measure for the degree of intensity symmetry
within scan line intervals. It can detect axes of intensity symmetry without any prior
image segmentation step. For detecting symmetry which is formed by a relationship
between local orientations, we present a symmetry-enhancing edge detector. The edge
detector extracts pairs of edge points if the local orientations at these points are mutually
symmetric with respect to a certain axis. Other edge points are suppressed.
The application for which the symmetry methods have been developed is vision-based
car-following. We found that compact image regions with a high degree of (horizontal)
intensity symmetry are likely candidates for the initial detection of vehicles on the road,
even when the viewing conditions are unfavorable. Visual tracking of automobiles from
behind can be done fast and reliably using the symmetry axis as the guiding object
feature. Finally, when the edge detector only "sees" symmetry edges, extracting the car's
lateral boundaries becomes possible even in situations where background and object are
hard to distinguish.

References

1. Frledberg, S. A. : Finding Axes of Skewed Symmetry. Computer Vision, Graphics, and Image
Processing 34 (1986) 138-155
2. Marola, G. : Using Symmetry for Detecting and Locating Objects in a Picture. Computer
Vision, Graphics, and Image Processing 46 (1989) 179-195
3. Saint-Marc, P., Medioni, G. : B-Spline Contour Representation and Symmetry Detection,
Proceedings, First European Conf. on Computer Vision, Antibes, France (1990) 604-606
Indexicality and Dynamic Attention Control
in Qualitative Recognition of Assembly Actions
Yasuo Kuniyoshi 1 and Hirochika Inoue 2

1 Autonomous Systems Section, Intelligent Systems Division, Electrotechnical Laboratory,


1-1-4 Umezono, Tsukuba-shi, Ibaraki 305, JAPAN
2 Department of Mechaao-Informatics, The University of Tokyo,
7-3-1 Hongo, Bunkyo-ku, Tokyo 113, JAPAN
A b s t r a c t . Visual recognition of physical actions requires temporal seg-
mentation and identification of action types. Action concepts are analyzed
into attention, context, and change. Temporal segmentation is defined as
a context switch detected by a switching of attention. Actions are identi-
fied by detecting "indexical" features which can be quickly calculated from
visual features and directly point to action concepts. Validity of the index-
icality depends on the attention and the context. These are maintained by
three types of attention control: spatial, temporal and hierarchical. They
are combined by a mechanism called "attention stack", which extends at
important points and winds up elsewhere. An action recognizer built upon
the framework successfully recognized human assembly action sequences in
real time and output qualitative descriptions of the tasks.

1 Introduction
Intelligent systems in multi-agent environment must react to other agents' actions. In
such cases, visual recognition of actions is indispensable. The recognition process must
operate in real-time, generating semantic information suitable for reasoning processes or
matching against stored knowledge.
Research on motion understanding has offered various methods to estimate 3D motion
parameters of various objects including human bodies [1, 2]. But it is not clear how to
extract semantics from these parameters. Moreover, bodily motion parameters are not
sumcient nor of primary importance for action recognition. Early attempts to extract
sematic information from motion pictures dealt with simple animations [3, 4]. Real-time
recognition of general actions in the real world is still an open problem.
Recently, many ill-posed vision problems have been reformulated and solved by as-
suming an active observer [5] and considering behavior context [6, 7]. This approach
should fit well to qualitative action recognition.
In this paper, we propose and test a new method for real-time visual recognition
of simple assembly tasks performed by human workers. The core of the method is two
fold: (1) Define a set of simple "indexical" features which can be quickly calculated
from visual features. They directly point to action concepts under certain conditions. (2)
Employ spatial/temporal/hierarchical attention control and context memory to maintain
the indexieality of extracted features.

2 Qualitative Recognition of Human Action Sequences

The objective of qualitative action recognition is to generate symbolic descriptions of


every action and every related change in the environmental state by watching continuous
performance of purposeful human actions.
875

The example problem treated in this paper is stated as follows:


Input: A human worker constructs various structures with blocks. One structure is
assembled in one task. Each task is carried out only once from start to finish without
a pause. The system observes it by stereo vision.
Output: Symbolic assembly plans. Each plan is a sequence of assembly operators with
intermediate state descriptions. It should be useful for reproducing the task by a
robot [8].
A s s u m p t i o n s : Explicit cues for temporal segmentations and a priori knowledge about a
specific task are not given. Classification of possible assembly operations is given. The
worker uses only one hand and grasps only one block at a time. Spatial segmentation
is tractable. Complete occlusion of blocks should not occur. "Messed up" situations
such as collapses are not allowed.
The system must automatically determine the start and the end time point of every
action (temporal segmentation), identify each temporal interval as either of known action
class (action identification), extract and remember qualitative state descriptions at the
segmentation points (state recognition). All these must be processed in real time. This
constraint gives rise to the use of contextual information for active attention control while
freeing the system from storing whole image sequences.

3 Indexicality in Classifying Actions


In order to define indexical features for assembly actions, we first analyze the structure of
action concepts and then construct mappings from observable features to action concepts.

Conceptual Organization of Assembly Actions. Let W be a set of all possible episodes in


the time-space of the physical world. Qualitative recognition of an action requires cutting
a spatio-temporal region from W and identify as one of known action concept.
According to Hobbs [9], a set of concepts C is modelled as a quotient set of W by a
certain equivalence relation "c: C = W / ' e ) . A set of observable ~lualitative features F
is also defined as a quotient set of W by an equivalence relation "o: F = W / " o .
We state that F has indexicality to C if and only if the observation process "~o
articulates the world W in the very same way as the concept formation process ~c does.
Let us divide the concept formation process ~'c into three steps (note that all three
are equivalence relations):
Attention: Specify "focused entities" (objects, geometrical features or locations) which
serve as "supports" of motion/relation descriptions. Their identity must be main-
tained within each action interval.
Context: lnvariant portion of motion: Relative position/velocity and contact state of
focused entities.
Change: Qualitative change of relations (held, amxed) or relative position/orientation
(on, coplanar) among focused entities. Equivalence relation among possible temporal
changes in W. Meaningful only under specified context.
Physical actions in assembly tasks are roughly grouped by the "context" into "assembly
motions": (1)'q~ransfer", with large movement of the hand in free space, and (2)"Local-
Motion" with the hand movement near specific target objects. "LocalMotion" is further
divided into three types of assembly motions: (1)"Approach", in which the fingers and
the held object moves toward the target objects, (2) "Depart", the other way round, and
(3) "FineMotion', in which the held object moves in contact with the target objects.
876

Indezical Features of Assembly Actions. Provided that the "attention" and the "context"
are maintained, aa~embly actions belonging to each assembly motion are discriminated by
the "change". Observable features which directly point to the "changes" can be defined
in terms of temporal change of simple visual features.
Action concept of LocaiMotion is stated as follows:

A t t e n t i o n : The hand, the target object and the held object, if any.
C o n t e x t : The hand is moving slowly near (toward/away) the target object.
Change: Aholding(Itand, X) classifies PICK/PLACE/NO.

Let us define a 3D convex region called "target region" which covers the block to be picked
up or the space in which the held one will be placed. Then, the indexical feature is defined
as a change of the intersecting volume of the target region and any object. This feature
can be detected by the following procedure: (1) Predict the location and size of the target
region. (2) Project the 3D model of the target region onto each field of view of binocular
stereo to define 2D attention regions. (3) For each view, take two gray level (possibly,
color) snapshots of the attention region at temporal segmentation points before and after
the current action. (4) Differentiate the snapshots and threshold the change of area into
three cases, decreased/increased/no-change, which point to PICK/PLACE/NO 3. Then
verify the 3D location of the change by triangulating the center of mass of the changed
areas.
Conceptual organization of an ALIGN operation (see Fig. 1) is as follows:

A t t e n t i o n : F1 of B1 (moved block), F2 of B2 (target block).


Context: (1) B1 and B2 are identified, (2) Bottom face of B1 and top face of B2 are
"against". (3) B1 is moving in direction m so that F1 approaches the common plane
with F2.
Change: -~oplanar(F1, F2) --* coplanar(F1, F2)

i e= P-q
Fig. 1. ALIGN action (leftmost) and indexical features (in round boxes) of coplanar/no-coplanar
relations.

In the context specified above, simple 2D visual features, namely, contour junction types
shown in Fig. 1, directly point to coplanar/no-coplanar states. Tracking only the edges
El, E2, E3 and check for the junctions, coplanar states are immediately detected. The
"change" feature can be quickly calculated by differentiating the coplanar/no-coplanar
state at start/end points of the current action.

s To simplify the implementation, blocks are painted white and the background is black, in our
experiment.
877

4 Dynamic Attention Control for Action Recognition

The indexicality of observable features is based on correctness of the "attention" and the
"context".
Spatial attention control maintains the "attention" part. It detects the moving hand
by temporal differentiation of images (pre-attentive), tracks the hand and the moving
objects/edges to solve correspondence problem, and predicts a target region/edge by
visual search procedure. The visual search starts by setting a stereo-pair of 2D regions
on the "base" of attention maintained by tracking. The regions are progressively moved
in the direction in which the base is moving until they hit an object or the worktable. By
looking up the estimated 3D position into the environment model, a correct target region
is determined. Contextual information is important here, eg. the environment model is
updated by recognizing previous actions, and whether to set the target region on (in
PICK) or above (in PLACE) the found object depends on which action is expected now.
Temporal attention control maintains the "context" part. When a shift of attention is
detected by a visual search or a pre-attentive routine, current context is no longer valid,
which means that the indexcality of currently selected features breaks. This context
switch directly signals a temporal segmentation of an action. Different visual features are
selected for monitoring, and tracking or visual search is invoked/stopped depending on
the context. This is done by selecting nodes of the action model.
Hierarchical attention control stabilizes the overall recognition process. It extracts dif-
ferent types of visual features at different timings in parallel from superimposed regions.
The regions and the features are organized in a hierarchy which corresponds to the levels
of assembly motions. Very coarse features such as temporal differentiation of the whole
view are always monitored, while fine features such as edge junctions are checked only
in FineMotions. The hierarchical parallelism contributes to the robustness: Even when
the hand suddenly moves off while a FineMotion, the system readily catches up with the
gross motion. This control scheme is implemented as "attention stack" operations.

5 Experiments and Results

An experimental system [10] was built upon the presented theory. It succeeded in rec-
ognizing human assembly actions in real-time. Figure 2 shows a monitor display during
the recognition of an arch construction task. The worker's hand is tracked by multiple
visual windows and an indexical feature is extracted from the target region. Result of
recognition is displayed at the bottom lines. The elapsed time for the whole task was 2
min. Figure 3 shows the coplanar detector at work. In this experiment, the detector was
run continuously to repeatedly check for coplanar relations. The held block was moved
continuously and the detector reported coplanar/no-coplanar discrimination at a speed
of 1 Hz. Other tasks recognized successfully by the system include; (1) pick and place
with ALIGN operation, (2) tower building, (3) inverted arch balanced on a center pillar,
(4) table with four legs.

Acknowledgement

The authors wish to express their thanks to Dr. Masayuki Inaba for his suggestions
and contribution to the base system, and Mr. Tomohiro Shibata for implementing the
coplanar detector.
878

Fig.2. Recognizing the task "Build Arch': (1) Target region defined. (2) Differentiation. (3)
Identified action type =place-on-block" displayed.

Fig. 3. Recognizing an ALIGN action. Indexical features displayed as white lines.

References

1. J. O'rourke and N. J. Badler. Model-based image analysis of human motion using con-
straint propagation. IEEE Trans., PAMI-2(6):522-536, 1980.
2. M. Yamamoto and K. Koshikawa. Human motion analysis baaed on a robot arm model.
In Proc. of IEEE Conf. CVPR, 664-665, 1991.
3. S. Tsuji, A. Morizono, and S. Kuroda. Understanding a simple cartoon film by a computer
vision system. In Proc. IJCAI5, 609-610, 1977.
4. R. Thibadeau. Artificial perception of actions. Cognitive Science, 10(2):117-149, 1986.
5. J. Aloimonos and I. Weiss. Active vision. Int. J. of Computer Vision, 333-356, 1988.
6. D. H. Ballard. Reference frames for animate vision. In Proc. IJCAI, 1635-1641, 1989.
7. S. D. Whitehead and D. H. Ballard. Learning to perceive and act. Technical Report 331,
Computer Science Dept., Univ. of Rochester, Rochester, NY 14627, USA, June 1990.
8. Y. Kuniyoshi, M. Inaba, and H. Inoue. Teaching by showing: Generating robot programs
by visual observation of human performance. In Proc. ISIR$O, 119-126, 1989.
9. J. R. Hobbs. Granularity. In Proc. IJCAL 432-435, 1985.
10. Y. Kuniyoshi, M. Inaba, and H. Inoue. Seeing, understanding and doing human ta~k. In
Proc. IEEE Int. Conf. Robotics and Automation, 1992.

This article was processed using the I~TEX macro package with ECCV92 style
R e a l - t i m e Visual Tracking for Surveillance and P a t h
Planning

R. Curwen, A. Blake, and A. Zisserman


Department of Engineering Science
University of Oxford
Oxford OX1 3P J, U.K.
Tel. -t-44-865-273154, Fax. +44-865-273908, emaJl rupert~uk.ac.ox.robots.

Abstract. In this paper we report progress towards a flexible, visually


driven, object manipulation system. The aim is that a robot arm with a
camera and gripper mounted on its tip should be able to transport objects
across an obstacle-strewn environment. Our system is based on the anal-
ysis of moving image contours, which can provide direct estimates of the
shape of curved surfaces. Recently we have elaborated on this basis in two
respects. First we have developed real-time visual tracking methods using
"dynamic contours" with Lagrangian Dynamics allowing direct generation
of approximations to geodesic paths around obstacles. Secondly we have
built a 289 system for incremental, active exploration of free-space.

1 Introduction

Over the last few years, significant advances have been made in estimation of surface
shape from visual motion, that is from image sequences obtained from a moving camera.
By combining differential geometry [6] with spatio-temporal analysis of visual motion [9,
10, 7] it has been shown that local surface curvature can be computed from moving images
[8, 3, 4, 11, 1]. The computation is robust with respect to surface shape, configuration
of the surface relative to the camera and the nature of the camera motion. For example
the ability to discriminate qualitatively between rigid features and silhouettes on smooth
surfaces has been demonstrated [3]. Particularly important for collision-free motion, the
"sidedness" of silhouettes is computed, that is, which side is solid surface and which is
free space.
This paper reports on progress in building a robot with active vision that can ma-
nipulate objects in the presence of obstacles. Our Adept robot has a camera and gripper
on board and is able to make exploratory "dithering" movements around its workspace.
As it moves, it monitors the image motion and deformation of contours in real-time us-
ing parallel dynamic contours. Real-time performance depends on appropriate internal
dynamical modelling with adaptive control of scale. As in earlier versions of our system
[3, 4], contour motion is used to interpret occlusion and surface curvature. More recently
we have built on further features: incremental building of a free-space model, incremental
planning of robot motion and search strategies for navigation.

2 Visual Tracking
The curvature analysis described above relies on the ability to track moving curved
image-contours. We have implemented a series of deformable cubic B-spline "dynamic
contours" that run at video rates using a Transputer network. Points along the dynamic
880

contour are programmed to have an affinity for image features such as brightness or high
intensity-gradient. The contour can now be used to make curvature estimates in a few
seconds. Substantial improvements in tracking performance are r'ealised when Lagrangian
dynamics formalisms are used to model mass distributed along a dynamic contour which
moves in a viscous medium. We have found that use of large Gaussian blur is unnecessary,
a crucial factor in achieving real-time performance.

2.1 C o u p l e d B - s p l i n e m o d e l

Defining a dynamic contour with inertia [5], implies a model in which the image velocity
of features is assumed uniform. However, stronger assumptions about the feature can
be incorporated when appropriate, to considerable effect. This is done by making the
further assumption that the feature shape will change only slowly. As an illustration,
with no shape assumption, the contour flows around corners (Fig. la). When the shape
assumption is incorporated it moves almost rigidly with the corner feature (Fig. lb).

Fig. 1. Dynamic contours tracking a corner. In (a) the dynamic contour is flexible, and slides
round the corner as the object moves to the left, whereas the "coupled" dynamic contour in (b)
follows the true motion of the corner.

The shape assumption is imposed by using a pair of coupled B-splines. The first can
be trained by taking the original (uncoupled) dynamic contour and allowing it to relax
onto a feature. Its shape is then frozen and becomes the "template" shape (see also Yuille
et. al. [12]) in the new model. A second B-spline curve, initially an exact copy of the first,
is then spawned and coupled to the template B-spline. The coupling is defined, naturally,
881

by a new elastic energy term. In the case of first derivative coupling energy, the dynamic
contour behaves like a one-dimensional membrane or string. Second derivative coupling
produces a contour which acts like a one-dimensional thin plate or rod. Suppose the
template contour has control point vector Q,, then the equations of motion, derived by
an analysis similar to the uncoupled model [5]), are:

= w 0 ( U - Q) - 2/~oQ + H o l H 1 [wl(Qs - Q) - 2/~1Q] (1)

where H0 and H1 are constant matrices, simply compositions of B-spline coefficients, and
U is the least squares B-spline approximation to the feature vector. The constants w0,
fl0 govern elastic attraction towards the feature and velocity damping respectively. The
constants wl and t31 govern elastic restoring forces and internal damping respectively.

2.2 Parallel i m p l e m e n t a t i o n

The equations of motion are integrated using an implicit Euler scheme for speed on a
network of transputers. The dynamic contours are allocated to worker transputers a span
at a time. The individual spans of a contour communicate their contribution to V in (1)
to the rest of the contour at each Euler step. With six worker transputers, Euler steps
can be performed at frame rate (25Hz). To overcome this problem, three separate frame
grabbers are used, sampling the sample video input. Figure 2 shows a sample of frames
from a multiple contour tracking sequence using coupled dynamic contours.
The dynamic contour can successfully track features whose velocity is such that the
lag caused by viscous drag does not exceed the radius of the tracking window. With a
tracking window radius of approximately 35 mrad (in a field of view of 0.3 tad) maximum
tracking velocity is about 1.4 rad/sec, for our system. Note that varying the tracking
window radius is our mechanism for control of scale. For example, a large window is used
during feature capture, getting smaller as the contour locks on.

3 Active Exploration of Free Space

Following the first version of our manipulation system [2], we have investigated a more
complex version of the manipulation problem in which the workspace is cluttered with
several obstacles. In addition to visual and spatial geometry, we now need to add Arti-
ficial Intelligence search techniques (A* search). Path-planning works in an incremental
fashion, repeating a cycle of exploratory motion, clearing a triangular chunk of freespace
(figure 3), and viewpoint prediction. After several of these cycles, over which the robot
"jostles" for the best view of the goal (box lid), it may find itself jammed between ob-
stacles and the edge of its workspace. In that case it backtracks to an earlier point in
its search of freespace and investigates a new path. So far we have demonstrated these
techniques with up to three different, unmodelled obstacles in the workspace.

4 Conclusions

We have discussed the principles and practice of visual manipulation of objects in clutter
from several points of view: tracking, dynamics, spatial geometry and geometry of grasp.
We are currently working to extend this in a variety of ways, including:
882

Fig. 2. Dynamic contours tracking objects in a scene (raster order). All dynamic contours are
"coupled". Note the top left hand contour falling off its feature as the object moves over the
image boundary.

- Developing more powerful internal models for tracking to improve ability to ignore
clutter. This would enable the robot to perform efficiently with obstacles in the
background as well as in the foreground.
- Employ more sophisticated geometric modelling to allow fine-motion planning that
takes account of the shape of the gripped object, and the fact that the workspace is
3D not 289

Acknowledgments
We acknowledge the financial support of the SERC and of the EEC Esprit programme.
We have benefitted greatly from the help and advice of members of the Robotics Research
Group, University of Oxford, especially Professor Michael Brady, Roberto Cipolla and
Zhiyan Xie.

References
1. E. Arbogast and R. Mohr. 3D structure inference from images sequences. In H.S. Baird,
editor, Proceedin#s o.f the Syntactical and Structural Pattern Recognition Workshop,, pages
21-37, Murray-Hill, N J, 1990.
2. A. Blake, J.M. Brady, R. Cipolla, Z. Xie, and A. Zisserman. Visual navigation around
curved obstacles. In Proc. 1EEE Int. Conf. Robotics and Automation, volume 3, pages
2490-2499, 1991.
883

Fig. 3. Our visual object manipulation system clears triangular chunks of freespace visually, in
an incremental fashion (freespace shown in grey). Accumulated freespace is represented in a 2D
plane as shown, and is actually a projection of 3D obstacles onto a horizontal plane. The robot
plans paths (black lines), restricted to space currently known to be free. Horizontal paths in the
figure are exploratory "dithering" motions performed deliberately to facilitate structure from
motion computations. The state of freespace is shown after navigation around three obstacles,
towards the goal object at the left of the figure. The robot tried the path on the bottom, reached
a fixed search depth-bound, backtracked, and successfully navigated on the top.

3. A. Blake and R.C. Cipolla. Robust estimation of surface curvature from deformation of
apparent contours. In O. Faugeras, editor, Proe. Ist European Conference on Computer
Vision, pages 465-474. Springer-Verlag, 1990.
4. R. Cipolla and A. Blake. The dynamic analysis of apparent contours. In Proc. 3rd Int.
Conf. on Computer Vision, pages 616-625, 1990.
5. R.M. Curwen, A. Blake, and R. Cipolla. Parallel implementation of lagrangian dynamics
for real-time snakes. Submitted to 2nd British Machine Vision Conference, 1991.
6. M.P. DoCarmo. Differential Geometry of Curves and Surfaces. Prentice-Hall, 1976.
7. O.D. Faugeras. On the motion of 3-d curves and its relationship to optical flow. In Pro-
ceedings of 1st European Conference on Computer Vision, 1990.
8. P. Giblin and R. Weiss. Reconstruction of surfaces from profiles. In Proc. 1st lnt. Conf.
on Computer Vision, pages 136-144, London, 1987.
9. H.C. Longuet-Higgins and K. Pradzny. The interpretation of a moving retinal image. Proc.
R. Soc. Lond., B208:385-397, 1980.
10. S.J. Maybank. The angular velocity associated with the optical flow field arising from
motion through a rigid environment. Proc. Royal Society, London, A401:317-326, 1985.
11. R. Vaillant. Using occluding contours for 3d object modelling. In O. Faugeras, editor, Proc.
1st European Conference on Computer Vision, pages 454-464. Springer-Verlag, 1990.
12. A.L. Yuille, D.S. Cohen, and P.W. Hallinan. Feature extraction from faces using deformable
templates. Proc. CVPR, pages 104-109, 1989.

This article was processed using the ]~TEX macro package with ECCV92 style
Spatio-temporal Reasoning within a Traffic
Surveillance System

A.F. Toal and H.Buxton


Queen Mary and Westfield College,
London, E1 4NS, England.
paddy@uk.ac.qmw.dcs
hilary@uk.ac.qmw.dcs

A b s t r a c t . The majority of potential vision applications such as robotic


guidance and visual surveillance involve the real-time analysis and descrip-
tion of object behaviour from image sequences. In the VIEWS project we are
developing advanced visual surveillance capabilities for situations where the
scene structure, objects and much of the expected behaviour is known. This
combines competences from image understanding, knowledge-based process-
ing and real-time technology. In this paper we discuss the spatio-temporal
reasoning which is of central importance to the system allowing behavioral
feedback. In particular, we will elaborate the analysis of occlusion behaviour
where we need knowledge of the camera geometry to invoke the occlusion
region monitoring of vehicles plus knowledge of the scene geometry to main-
tain high-level models of possible trajectories for the occluded vehicles and
to recognise the re-emerging vehicle(s).

1 Introduction and Background

A new emphasis in vision research is to produce conceptual descriptions of the behaviour


of objects for dynamic scenes, for example Thibadeau 86, Nagel 87, Buxton and Walker
88, Mohnhaupt and Neumann 90, Howarth and ToM 90 in [Thibadeau '86, Nagel '88,
Buxton and Walker '88], [Mohnhaupt and Neumann '90, Howarth and ToM '90]. To al-
low such understanding typically involves a knowledge based approach to computer vi-
sion where the knowledge specifies the models of the events and behaviours as well as
the tasks that the system is to perform. The ESPRIT II VIEWS (Visual Inspection and
Evaluation of Wide-area Scenes) project has the aim of demonstrating the feasibility of
knowledge-based computer vision for real-time surveillance of well-structured, outdoor
dynamic scenes.
The VIEWS project is supporting the development of both generic techniques and se-
lected application systems which meet the specific requirements for airport stand surveil-
lance, ground traffic control and road traffic incident detection. We have developed two
major run-time components, the Perception Component (PC) and the Situation Assess-
ment Component (SAC), to fulfill the two main competences required for traffic under-
standing: first, the ability to detect and recognise the vehicle types and trajectories in
the PC; and, second, the ability to understand the situation as it develops over time in
terms of vehicle behaviour and interactions in the SAC. In addition, we need a control
component for reM-time control of asynchronous data streams and an application sup-
port component to assist in the acquisition of scene, object and behavioral knowledge
and to precompile appropriate visual strategies. The project addresses the integration of
* This research was funded as part of ESPRIT-2 P2152 'VIEWS' Project.
885

these components. Figure 1, illustrates a simplified functional architecture for these two
main modules of our computer vision system.

?
1
Fig. 1. Functional Architecture Fig. 2. Bremer.Stern: Tracking updates

The main exemplar running through this paper will be based around tracking of a
sequence which includes the 3 frames appended to the paper. These 3 frames are from
an image sequence taken at the Bremer Stern roundabout in Germany that we have
used as an exemplar for the VIEWS project. Consider the vehicles where a) the lorry
starts to occlude the saloon car, b) the car is fully occluded by the lorry, and c) the ear
starts to re-emerge from behind the lorry. The ear then moves off. Processing at both the
perceptual and conceptual levels needs spatio-temporal calculations. At the perceptual
level, the detection, tracking and classification of the visible moving objects is performed.
Some of the updates for the model based tracking of the occlusion frames are displayed
in figure 2. This figure displays vehicles as points projected onto the groundplane of the
roundabout which is a complex environment segmented into road, bicycle and tram lanes
(cutting across the middle of the roundabout). Figure 2 also shows the relationship of
the camera field of view to the groundplane, this is important for occlusion reasoning, as
will be discussed later.
The information extracted by the PC is given as updates at different levels of analysis
and speed of processing (but always as estimates with respect to the ground plane) to the
SAC. The basic form is as < label, position, time > updates. The SAC then processes the
information first, at the level of events such as stopping/starting and entering/exiting re-
gions with an additional local check on the spatio-temporal continuity. Second, the global
consistency checking constructs and maintains the space-time histories using behavioural
constraints in space and time from the context of other vehicles in the locality. Finally,
the SAC can also perform selected behavioral prediction and feedback which, for this full
occlusion, must maintain the occlusion relationships for recognition of the re-emerging
vehicle and give feedback both within the SAC and to the PC so that the relabelling of
the vehicle and constructing a consistent history can be performed.
Any vision based tracking system must incorporate some means of overcoming occlu-
sion problems. Even a simple task like counting the number of vehicles passing through
the scene requires occlusion reasoning so that re-emerging vehicles are not counted twice.
886

In this paper, we will present results from a case study where the role of the SAC
is primarily to act as a high level, long term memory keeping explicit representations of
total occlusions occurring in dynamic scenes and utilizing behavioural knowledge to aid
relabelling.

2 Analogical Representation
The analogical spatial representation underlying the behavioural models described later
is a flexible, multi-purpose representation which maintains the structure of the world in
a usable manner. The key requirements for the static knowledge in this representation
are: (1) the conversion of the ground plane geometry into meaningful regions; and (2)
the description of the connectivity of these regions. The objective of the surveillance
is to reason about the vehicle in the scene, so we also require a means of representing
dynamically: (3) spatial extent and edges of vehicle ground plane bounding boxes to
determine occupied regions; and (4) velocity and orientation to derive behaviour. When
we also include vehicle interactions, we require (5) inter-vehicle orientation and inter-
vehicle distance.
The analogical representation provides a uniform framework in the SAC, incorporat-
ing the static scene knowledge and dynamic vehicle histories. 'The static knowledge is
attached to regions. Regions, on the groundplane, are constructed from cell sets. For ex-
ample, these can express: (1) the types of vehicle that use a region (eg roads, cycle-lane,
tram-line); (2) regions of behavioural significance (eg giveway, turning); (3) direction in-
formation (eg lane leading onto roundabout); and (4) the basic connectivity of the leaf
regions.
Cells are also used to represent dynamic vehicle histories. The analogical spatio-
temporal representation is shown in figure 3. Calculation based on relating vehicle histo-
ries can fully capture the semantics of behaviours like 'following', 'overtaking', 'crossing',
'queueing' etc. An example of dynamic relation 'following' is given in figure 5 (overlap in
spatial history within some time delay). Given the dynamic and static knowledge bases,
we can usually determine what a vehicle is doing.

Mixing Time & Space :

Explicit Temporal Knowledge: Hr

" P tO ~ tl ~ t2[~t3 i ~ t4 ~ t5

i
: i 9 ,. !

Explicit Spatial Representation

Fig. 3. Spatio-Temporal Representation Fig. 4. Completed S-T Histories


887

Fig. 5. Spatial ~ Temporal Histories for 'follow'

3 Occlusion

The Perception Component incorporates a range of vision tracking competencies. Its pri-
mary task is motion focused tracking and classification of vehicles. The specific tracking
competencies are detailed elsewhere ([Sullivan et al. 90]). The VIEWS configuration we
are considering is as follows:

- 'Gated' initialization of tracking: Static areas of the image where vehicles are known
to appear are highlighted for track initialization.
- PC < 1second vehicle memory: The PC is dealing with video rate input, and once a
vehicle track is lost attempts to relocate the vehicle are swiftly curtailed.
- Motion cued tracking: pixel grouped motion in the image is the primary cue for
processing. Vehicles which halt and remain stationary for extended periods may be
'forgotten'. This problem may be handled as a virtual occlusion by the techniques
presented here.

In the case of tracking being lost by total occlusion, some recovery mechanism is
needed so any new tracks initialized by the PC are correctly labelled as continuations.
In the short term, tracking algorithms attempt recovery by focusing processing, directed
by calculations extrapolating the "lost" vehicles velocity. Occluded vehicles manoeuvre
unpredictably, this makes simple velocity based calculation a poor approximation in the
long term.
The SAC maintains occluded-by relationships. The boundary of a lead vehicle in the
image defines the area of potential emergence. The SAC occlusion reasoning aids cor-
rect relabelling after emergence and indicates to the PC which vehicles to monitor for
boundary deformations caused by emerging vehicles. Occlusions are so common these
techniques are developed to be of low cost.

3.1 T h e e x a m p l e

Examine the example occlusion sequence, frames are shown at the end of the paper and
tracks plotted in figure 2. A saloon travels behind a lorry, is occluded for some time and
888

finally emerges. When the saloon emerges, it is labelled as a new track by the PC. The
minimum competences for the SAC are to spot the occlusion occurring, maintain the
occlusion relationships, relabel the saloon correctly upon emergence and complete its
Spatio-temporal history.
While the reconstruction of the specific < label, position, time > track is impossible,
the form of the spatio-temporal history can be bounded in a useful manner (see figure 4).

.3.2 Flagging Occlusion


A cheap robust method of candidate generation is needed, several are possible (eg. from
the merging of coherent motion 'envelopes' generated by the primary tracker). Early
experiments operated well and cheaply by casting potential shadows as shown in fig-
ure 6. Currently more accurate partial occlusion relations developed in the 3D model
matcher are used. The model matcher is described in Marslin, Sullivan & Baker '91
[Marslin et al. '91] (see also: ECCV-92).

O: I ~ ' r ( Otm~ DIfNV))

Qamgt'a

Fig. 6. < theta, depth> Fig. 7. SAC Occlusion Output

3.3 D u r i n g Occlusion
In the example, partial occlusion of the saloon by the lorry (see first frame), develops
into full occlusion (see frame-2). After the update for frame-96 the SAC generates the
message shown in figure 7 and asserts the relationship occluded-by(096, 5, < 8 >). The
lorry is 'objectS' the saloon 'object8'. The first field is either the time from which the
occlusion is active, or a range indicating the period of time the occlusion is considered to
have lasted. Once an occluded-by relationship is created the SAC must be able to reason
about its maintenance and development.
During a typical occlusion development:

- Vehicles may emerge from occlusion.


- Vehicles may join the occlusion.
- The lead vehicle may itself become occluded.
889

- A vehicle may leave shot never having emerged from occlusion. Once occluding vehicle
leaves shot, all occluded vehicles must do so.
- When not themselves becoming totally occluded, other vehicles may enter into partial
occlusion relationships with the lead vehicle and then separate. In this case, it is
possible that occluded vehicles move from one vehicle 'shadow' to another.
In the next stage, 'relabelling', it will be argued that behavioral knowledge is required
to help disambiguate labelling problems as vehicles emerge. This is because a total oc-
clusion may be of arbitrary length and over a long time almost any reordering of vehicles
could occur. No processing is expended on reasoning about the development of the rel-
ative positions of vehicles until they emerge. Behavioral relationships established prior
to the occlusion eg. following(tara,oath), are stored for consideration during relabelling.
But the consideration of the development of relationships is not useful during the actual
time of occlusion.
A g r a m m a r is used to manage the development of occlusions. Where ta : tb is the time range
from t~ to tb and vj ::< list > represents the union of vj and < list >. < list > @vj represents
< list > with vj removed. Where | is an exclusive-or such that occluded-by(tj,vm,cara) A
occluded.by(tj,vt,cara) cannot be true. Unless otherwise stated ti is before tj and where used
ti < tk < tj.
Exlstenee .................. Does occlusion exist r
occluded-by(ti,vt,l~) - no occlusion
Emergence... What may emerge next?
ocduded-by(ti,v:,< list >) --, next.emerge(tj,vl) E< list >
I f occluded-by(ti,vz,< list >) A
emerge-from(tj,m,Vo~d)
---* occluded-by(tj,m,< list > @Void) A
occluded-by(ti : tj,vl,v~d)
Joining... Has a vehicle joined existing occlusion ?
occluded-by(tl,vl,< list >) A
occl uded-by( t j , vh v~ew )
occluded-by(ti,vl,v,ew ::< list >)
Lead O c c l u d e d . . . Has an occluding vehicle been occluded r
occluded-by(ti,vt,< list >l) A
occluded-by(tk,v. . . . < list > , ~ , ) A
occluded-by(tj,v. . . . vt)
--. occluded-by(tj,v,~,,vl ::< list > , ~ , : : < list >t)
ti Sgtk are not ordered.tj follows both
Leave Scene... Anything occluded leaves show with lead!
occluded-by(ti,vl,< list >) A
left.shot( ti,vl )
left.shot(ti,< list >)
Visible I n t e r a c t i o n . . . I f shadows meet occluder may change!
occluded-by(ti,vh < list >) ^
(partially-occludes(tk,vh vm) V
partially-occludes(t k,v,n, vl))
occluded-by(tj,vh< list >z)
| occluded-by(tj,vm,< list >,n)
where < list >l::< list > m - < list >

The SAC maintains the consistency of the options with constraint reasoning. Updates
from the P C are compared to these possibilities. In our example, the initial relationship
generates expectations:
890

occluded-by(096, 5, < 8 >) ~ next_emerge(tj,5) E< 8 >

Later the creation of a new track for 'vehicle-10', matches the suggested emergence of
the occluded saloon (see figure 7). This matches the following rule:
occluded-by(096,5,< 8 >) ^ emerge-from(108,5,8)
--* occluded-by(108,5,~) ^ occluded-by(096 : 108,5,< 8 >)

In the example the relabelling is unique (10 = 8) and the no additional occlusion rela-
tionships exist (--* 0). The relabelling and the fact the lorry no longer occludes anything
are passed to the PC. This relabelling case is the simplest possible. More complex cases
and behavioural evaluation, will be considered in the following section.

3,4 E m e r g e n c e
The remaining occlusion reasoning in the SAC can be summarised as two capabilities:

(1) R e l a b e l While the actual moment of emergence is not possible to predict, this
does not mean no useful inferences can be made. When a new track is initialized,
indicated by a new vehicle label in the VIEWS system, it is compared with currently
maintained occlusion relationships. If it matches, the new label is replaced by the
matching previous label for both the PC and SAC.
(2) H i s t o r y C o m p l e t i o n As complete histories are preferable for behavioral evalua-
tion, the SAC needs access to the last and the newest position so the complete
history is 'extruded' as shown in figure 4 to fill in the trajectory of the occluded
vehicle.

Two more advanced considerations:


1. If a vehicle is known to be occluding another, currently only the identities are passed
to the PC. With both enough geometric knowledge (eg. the ground to the right of the
lorry is grass not road) and behavioral knowledge (eg. that a vehicle is 'turning off'),
some consideration of where on the lead vehicles boundary a vehicles may emerge is
possible. Currently only preliminary results of such a study are available and they
will not be presented here.
2. Complex cases of history completion occur. Eg. where a vehicle turns a corner and a
simple direct link of updates crosses pavement or grass. Analogical representations are
well suited to this form of path completion problem. Cells store localised knowledge
about what type of space they represent (eg. roadway or grass). Deforming a suggested
completion to conform to expectation is possible with techniques reported elsewhere
(eg. [Steels '88, Shu and Buxton '90]).

4 More complex relabelling

Consider the case where two vehicles (cara and carb) are occluded by the same lorry.
When the PC indicates a new vehicle track the following cases, generated by the grammar
given earlier, are all initially plausible, since the 'emergence' may be caused by the vehicles
motion relative to the camera or each other:
occluded-by(lorry, <cara, carb >) ---, visible(lorry) A
1. next_emerge(lorry, ear=) A occluded-by(lorry, < carb >)
~. next-emerge(Iorry,eara) A occluded.by(car=, < carb >)
891

3. next_emerge(lorry,earb) A occluded.by(lorry, < eara >)


4. next_emerge(lorry, carb) A occluded-by(carb, < car~ >)

The temporal fields are omitted here for simplicity.


Althoughall cases must be considered, the SAC should provide some ordering on the op-
tions. This is where behavioral knowledge may be utilized. Eg. where a following(cara, carb)
relationship was established prior to the occlusions, the default for the SAC is to main-
tain behavioral relationships unless given specific contradictory evidence, and interpret
the evolving scene accordingly.
The job of the SAC is twofold:

1. To produce consistent histories and alert the PC to any inconsistency in data pro-
vided.
2. To produce ongoing behavioural evaluations; both for the end user and to supply a
source of knowledge for defaults in the production of consistent histories.

In the introduction an example of 'following' behaviour was demonstrated, which


could be deduced by comparing the analogical spatial and temporal histories (ie. the car
'following' must overlap the trail of the lead car in the same temporal sequence). Such
definitions have proved a very natural expression of behavioural concepts. Others which
have been investigated include: turning (off one road-instance onto another), crossing (of
two vehicles), giveway (at a junction), potential intersection (for risk evaluation) and
overtaking.
We have demonstrated techniques which allow the production of consistent spatio-
temporal histories under conditions of total long term occlusion. The final completed
histories in figure 4 are only a limited example, but even they can allow for the final
deduction that the saloon car, having been occluded by the lorry, finally overtakes it.
The techniques used to produce these final behavioural comparisons will be reported in
[ToM and Buxton '92].

5 Conclusion

In summary we have discussed some of the competenees developed for specific applica-
tion of road traffic surveillance. The main competence elaborated here is to is to deal
with total occlusions, in order to develop consistent spatio-temporal histories for use in
behavioral evaluation eg. following, queue formation, crossing and overtaking. Specifi-
cally we have proposed a grammar for the handling of occlusion relationships that allows
us to infer correct, consistent labels and histories for the vehicles. This was illustrated
in with a simple example, although the information made available is sufficient to de-
duce that the saloon ear, having been occluded by the lorry overtakes it. In addition we
also described, the analogical representation that supports the contextual indexing and
behavioural evaluation in the scene.
Work is continuing on more complex behavioural descriptions and their interaction
with more complex occlusions as well as generating more constrained expectations of
vehicles on emergence. In particular, it is important to look at the trade-offs between
behavioural and occlusion reasoning, decision speed and the accuracy of predicted re-
emergence.
892

6 Acknowledgements
We would especially like to thank Simon King & Jerome Thomere at Framentec for
much discussion of ideas a b o u t the SAC and behavioural reasoning and Geoff Sullivan
and Anthony Worrall at Reading University for discussion of the interaction with the
PC and we would like to thank Rabin Ezra for access to his boundless knowledge about
postscript. In addition, we would thank all the VIEWS team for their work in putting
together this project and the ESPRIT II programme for funding.

References
[Buxton and Walker '88] Hilary Buxton and Nick Walker, Query based visual analysis: spatio.
temporal reasoning in computer vision, pages ~47-254 linage and Vision Computing
6(,t), November 1988.
[Fleck '88a] Margaret M. Fleck, Representing space for practical reasoning, pages 75.86, Image
and Vision Computing, volume 6, number P, May 1988.
[Howarth and Toal '90] Andrew F. ToM, Richard Howarth, Qualitative Space and Time for Vi-
sion, Qualitative Vision Workshop, AAAI-90, Boston
[Mohnhaupt and Neumann r Michael Mohnhaupt and Bernd Neumann, Understanding oh.
ject motion: recognition, learning and spatiotemporal reasoning, FB1-HH-B-145//90,
University of Hamburg, March 1990
[Nagel '88] H-H. Nagel, From image sequences towards conceptual descriptions, pages 59-7,~,
Image and vision computing, volume 6, number P, May 1988.
[Steels '88] Luc Steels, Step towards common sense, VUB AI lab. memo 88-3, Brussels, 1988.
[Thibadeau '86] Robert Thibaxteau, Artificial perception of actions, pages 117-149, Cognative
science, volume I0, 1986.
[Sullivan et al. 90] Technical Report DIO$ "Knowledge Based Image Processing", G. Sullivan,
Z. Hnssaln, R. Godden, R. Marslin and A. Worrall. Esprit-II P2152 'VIEWS', 1990.
[Marslin et al. '91] R. Marslin, G.D. Sullivan, K. Baker, Kalman Filters in Constrained Model
Rased Tracking, pp371-37,~, BMVC-91
[Shu and Buxton '90] C. Shu, H. Buxton ".4 parallelpath planning algorithm for mobile robots",
proceedings: International Conference and Automation, Robotics and Computer Vi.
sion, Singapore 1990
[Toal and Buxton '92] A.F. Toal, H. Buxton "Behavioural Evaluation for Traffic Surveillance
using Analogical Reasoning and Prediction" in preparation.

This article was processed using the I$TEX macro package with ECCV92 style
Template Guided Visual Inspection
A. Noble, V.D. Nguyen, C. Marinos,
A.T. Tran, J. Farley, K. Hedengren, J.L. Mundy
GE Corporate Research and Development Center
P.O. Box 8, 1 River Road Schenectady, NY. 12301. USA.

A b s t r a c t . In this paper we describe progress toward the development of


an X-ray image analysis system for industrial inspection. Here the goal is
to check part dimensions and identify geometric flaws against known toler-
ance specifications. From an image analysis standpoint this poses challenges
to devise robust methods to extract low level features; develop deformable
parameterized templates; and perform statistical tolerancing tests for ge-
ometry verification. We illustrate aspects of our current system and how
knowledge of expected object geometry is used to guide the interpretation
of geometry from images.

1 Introduction
Automatic Visual inspection is a major application of machine vision technology. How-
ever, it is very difficult to generalize vision system designs across different inspection
applications because of the special approaches to illumination, part presentation, and
image analysis required to achieve robust performance. As a consequence it is necessary
to develop such systems Mmost from the beginning for each applicatign. The resulting
development cost prohibits the application of machine vision to inspection tasks which
provide a high econonfic payback in labor savings, material efficiency or to the detection
of critical flaws involving human safety.
The use of Computer Aided Design (CAD) models has been proposed to derive the
necessary information to automatically program visual inspection [16, 2]. The advantage
of this approach is that the geometry of the object to be inspected and the tolerances
of the geometry can be specified by the CAD model. The model can be used to derive
optimum lighting and viewing configurations as well as provide context for the application
of image analysis processes.
On the other hand, the CAD approach has not yet been broadly successful because
images result from complex physical phenomena, such as specular reflection a~ld mu-
tual illumination. A more significant problem limiting the use of CAD models is that
the actual manufactured parts may differ significantly from the idealized model. During
product development a part design can change rapidly to acconmmdate the realities of
manufacturing processes and the original CAD representation can quickly become obso-
lete. Finally, for curved objects, the derivation of tolerance offset surfaces is quite complex
and requires the solution of high degree polynomial equations [3].
An alternative to CAD models is to use an actual copy of the part itself as a reference.
The immediate objection is that the specific part may not represent the ideal dimensions
or other properties and without any structure it is impossible to know what attributes of
the part are significant. Although the part reference approach has proven highly successful
in the case of VLSI photolithographic mask inspection [5, 14] it is difficult to see how
to extend this simple approach to the inspection of nmre complex, three dimensional,
manufactured parts without introducing some structure defining various regions and
boundaries of the part geometry. The major problem is the interpretation of differences
894

between the reference part and the part to be inspected. These differences can arise from
irrelevant variations in intensity caused by illumination, uniformity or shadows. Even if
the image acquisition process can be controlled, there will be unavoidable part-to:part
variations which naturally arise from the manufacturing process itself, but are irrelevant
to the quality of the part.
In the system to be described here we combine the best features of the CAD model
and part reference approaches by introducing a deformable template which is used to
automatically acquire the significaut attributes of the nominal part by adapting to a large
number of parts (e.g. 100). The template provides a number of important functions:

- Part feature reference coordinates


- Feature tolerances for defining flaw conditions
- Domains for the application of specialized image feature extraction algorithms.

In Section 2 we consider the theoretical concepts which determine the general struc-
ture of a constraint template. Section 3 describes the general design and principal al-
gorithms used in the current prototype inspection system. Experimental results demon-
strating constraint templates applied to X-ray images of industrial parts are given in
Section 4. We conclude in Section 5.
2 Constraint Templates
We have based the design of our inspection system on the definition of a template which
consists of a set of geometric relationships which are expected to be maintained by any
correctly manufactured instance of a specific part. It is important to emphasize that
the template is a generic specification of the entire class of correct instances which can
span a wide range of specific geometric configurations. We accommodate these variations
by solving each time for an instance of the template which satisfies all of the specified
constraints, while at the same time accommodating for the observed image features which
define the actual part geometry. Currently, the system is focused on single 2D views of
a part, such as X-ray projections. However, there is no limitation of the general concept
to a single image, so that multiple 2D views or 3D volume data could be interpreted by
a similar approach.
More specifically, the template is defined in terms of the following primitive geometric
entities; point, conic (ellipse, circle, hyperbola,line), bezier curve. These primitive curve
types can be topologically bounded by either one or two endpoints to define a ray or
curve segment. The geometric primitives are placed in the template in the context of a
set of geometric relationships. The set of geometric constraints available are as follows:

Incident Two geometric entities have at least one point in common.


Coincident Two entities have exactly the same descriptive parameters. For example, two points
are at the same location.
Location Two points are constrained to be a fixed distance apart. Or more generally, the
position of two entities is constrained by a distance relation.
Angle The relative orientation of two entities is fixed.
Parallel A specific case of angle, i.e. 0~
Perpendicular A specific case of angle, i.e. 90~
S y m m e t r y Symmetry can be defined with respect to a point or a line. That is a reflection
across the point or line leaves the geometric figure unchanged.
Tangent Continuity Two curve primitives are constrained to have equal tangents at a point.
Equal Size The size of a primitive is defined more formally below but is essentially the scale
of the entity. This constraint maintains two primitives to have equal scale factors.
895

Size in Ratio Two primitives have some fixed ratio in scale factor.
Linear Size A set of entities are related by a linearly varying scale factor. This relationship is
often observed in machine parts.
Linear Spaeiug The distance between a~et of entities varys linearly over the set. Again, this
constraint is motivated by typical part geometries.
2.1 T h e C o n f i g u r a t i o n C o n c e p t
We have developed the concept of the configuration which provides a systematic approach
to the symbolic definition of geometric entities and many of the geonmtric relationships
"just defined [12]. The geometric constraints are ultimately represented by a system of
polynomials in the primitive shape variables and the constraint parameters.
Except for scalar measures such as length and cosine, all geometric entities are repre-
sented by configurations, which have parameters for the location, orientation, and size of
the primitive shapes. Symbolically, these slots are represented by 2D vectors of variables.
The location of a shape is described by 1T = (I,, ly). This location is usually the center
or the origin of the local fi'ame of the primitive shape. The orientation of a shape in the
plane is described by an angle O, or by a unit vector o T = (o,,ov) = (cos S, sin O). The
later is used to avoid trigonometric functions and to use only polynomial functions of
integer powers. The size of a shape, like for all ellipse, is represented by a vector having 2
scale factors, (k,, kv) along the major and minor axes. To avoid division of polynomials,
the inverse of the size is represented: for example, k T = (k,,kv) = ( a - l , b -1) for an
ellipse.
The configuration is an affine transformation matrix representing the translation,
rotation, and scaling from the local coordinate frame (X,Y) of the shape to the image
frame (x, y):
C X ) = ( k ~ k 0v ) ( c o s 8 sinS~ ~-l~

3 System Design
3.1 P h i l o s o p h y
The inspection system operates in one of two functional modes; inspection template
acquisition mode or part inspection mode, Figure 1. Inspection template acquisition,
involves the derivation of a constraint template which encapsulates the expected geometry
of a "good" part. Initially a template is created manually by a user through a graphical
interface with the aid of blueprint specifications or inspection plans. Once a template
is created, the system is run on a suite of images of "good" parts to refine the nominal
template parameters and provide statistical bounds on parameter values. The end result is
a template description which includes correction for inaccurate placement of prinfitives in
the initial template creation process and which accurately reflects the true part geometry.
Part inspection involves making decisions about whether parts contmn defects. For
example, parts must not contain flaws produced by poor drilling and part dimensions
must satisfy geometric tolerance specifications. In terms of image analysis tasks this pro-
cess involves first extracting empirical geometric features from the image data via image
segmentation and local feature parameterization. Global context for decision-nmking is
provided via the inspection template which is deformed to the empirical prinlitives by
first registering the template to image features and then applying nonlinear optimization
techniques to produce the "best-fit" of the template to the empirical features. Finally, the
deformed template description is used for verification of part feature dimensions and to
provide the context for the application of specialized a.lgorithms for characterizing local
flaWS.
896

Fig. 1. Flowchart of critical components of the inspection system.

I I I m m ~ FIeauk~nm

Ddtled holm (1) mu~t not


(merge)
DdUed hokm (2) must be

Driller hokm mu~ comect

Fig. 2. (a) A simplied example of requirements for an inspection template, and; (b) A snapshot
in the process of template construction illustrating the introduction of a constraint after line
primitive creation.

3.2 S y s t e m C o m p o n e n t s

Essentially the inspection system can be divided into four functional modules: template
creation; image feature extraction; template refinement and flaw decision-making.
T e m p l a t e C r e a t i o n . A simplified example to illustrate the requirements for all inspec-
tion template is shown in Figure 2a. The general template creation process involves first
specifying a set of geometric primitives and then establishing the relationships between
them. In our system this is achieved using a graphical template editing tool which allows
the user to build a template composed of a selectiou of the 4 types of geometric primitive
specified in section 2 which are related by any of 12 possible constraint types. Figure2b
illustrates a "snap-shot" view in creating a template.
Image S e g m e n t a t i o n . The extraction of geometric primitives is achieved using a mor-
phology bmsed region boundary segmentation technique. Details of this algorithm can
be found elsewhere [13]. This algorithm locates, to pixel accuracy, boundary points on
either side of an edge ms half boundaries which are 4-connected pLxel chains. A typical
897

output from the algorithm is shown in Figure 3b where both edges of the regions are
highlighted.
To detect subtle changes in image geometry and to achieve accurate feature param-
eterization we have implemented a subpixel residual crossing localization algorithm. A
morphological residual crossing is defined as the zero-crossing of the signed maz dilation-
erosion residue, fmaxder(f) [13]:

[fma~der(f)[ = rnax[]fer(f)[, [fdr(f)[]


where, fer(f) = f - f O B , fdr(f) = f ~ B - f, f is all image, B a structuring set, f ~ B
and f O B are dilation and erosion respectively, and the sign of the residual satisfying the
magnitude condition is attached to fma~der. Subpixel residual zero-crossings are found
using the following algorithm:
1. First, the residual values are interpolated by a factor 2 using a 7x7 pixel separable
cubic spline interpolation filter [10]. This is done in the neighborhood of each point
which belongs to a pixel accurate region boundary contour.
2. Then, the residual crossing locations are located from the interpolated max dilation-
erosion residue responses. This is achieved using a modified version of the predicate-
based algorithm proposed by Medioni and Huertas for locating zero-crossings of the
Laplacian of the Gaussian operator[8]. The result is an 8-connected single-pixel wide
edge map with boundaries located to 0.5 pixel accuracy, Figure 3c.
In the application, drilled hole features appear as dark elongated intensity regions
in an image, of width approximately 4 pixels. They are extracted using the following
algorithm. First, the image is enhanced by applying a morphological closing residue
operator [9], using a disk of radius 5 pixels. Then, region boundary segmentation is applied
to the filtered image. Figure 3b shows the closed contours detected to pixel accuracy
where both sides of the edge between the two regions are highlighted. Subpixel precision
edge locations are shown in Figure 3c.
Empirical F e a t u r e C o n s t r u c t i o n a n d F e a t u r e C o r r e s p o n d e n c e . The objective
here is to derive a geometric representation from the image data which can be associated
with the constraint template primitives. We view the features which are extracted by
image segmentation to be empirical versions of the ideal template primitives.
In the current implementation, correspondence is carried out as a search for the closest
image feature to each template primitive. The distance measure in use is the Euclidean
distance from the center of gravity of the image feature to the origin of the primitive's
local reference frame (i.e. the location of the primitive's configuration). Although this
correspondence method admittedly lacks the robustness required in general inspection
applications, and depends upon fairly good image registration, it provides sufficiently
accurate results in the case of the inspection task, and has the additional benefit of low
computational overhead. In future work the correspondence problem will be considered
in more detail. We expect to employ correspondence techniques which are specialized to
each geometric prinfitive type.
Once correspondence has been established, a set of empirical primitives is produced
by fitting to the image feature pixel locations. The fitting procedure used is in general
determined by the template primitive type. In the current version of the system we
use eigenvectors of the feature point scatter matrix to derive the empirical geometric
parameters [4] and first threshold out small features. The philosophy in use here is that
it is preferable to let a missing feature signal a flaw of omission rather than attempting
to interpret an inappropriate geometry.
898

Fig. 3. Drilled hole segmentation: (a) original; (b) region boundary segmentation; (c) subpixel
localization of boundaries. In (b) both region edges have been marked which explains the ap-
pearance of the boundaries as thick edges.

C o n s t r a i n t Solver. The goal of the constraint solver is to solve the problem of finding
an instance of the inspection template which satisfies all of the geometric constraints
defined by the template and at the same time, nfinimizes the mean-square error between
the template primitives and the image features. The mean-square error can be expressed
as a convex function of the template parameters and a geometric description of the image
features.
Theoretical details of the approach can be found in [12]. Briefly, the two goals of
finding the global minimum of a convex function, V ] ( x ) = 0, and satisfying the con-
straints, h(x) = 0, are combined to give a constrained minimization problem. A linear
approximation to this optimization problem is:
V2f(x) d x = - V f ( x )
Vh(x) d x = - h ( x ) (2)

Since the two goals cannot in general be sinmltaueously satisfied, a least-square-error


satisfaction of V f ( x ) = 0 is sought. The constraint equations are multiplied by a. factor
8;99

y/~, which determines the weight given to satisfying the constraints versus minimizing the
cost function. Each iteration of (2) has a line search that minimizes the least-square-error:

re(x) = I V f ( x ) [ 2 + c [h(x)J 2 (3)

which is a merit function similar to the objective of the standard penalty method.
Verification. The output from the constraint solver is a set of deformed prinfitives which
can by used for one of two purposes; either to further refine the parameter values and tol-
erances of the inspection template, or for flaw decision-making. For example, the derived
parameters from the deformed primitives can be compared to the template parmneters to
detect geometric flaws such as inaccurate drilled hole diameters. The deformed inspection
template primitives can also provide the context for applying specialized algorithms for
characterizing shape and intensity-based properties of subtle flaws. Although the detec-
tion of flaws is not the focus of this paper, preliminary results of flaw analysis will be
illustrated in the experiments described in the next section.
4 Experiments
In this section, we present results from our current working system in action. This sys-
tem has been designed using object-oriented methodology and implemented on SPARC
workstations using the C + + language and the X-based graphics toolkit InterViews.
/
T e m p l a t e C r e a t i o n . First, in Figure 2b, we illustrate the process of template creation.
The set of configurations in the template contains a number of lines and a number of
points. These geometric entities (or rather their counterparts in the image} are subjected
to a number of constraints. The original telnplate and the template after deformation
are shown in figure 4.

Fig. 4. Template creation and solving for the best fit. (a) Template specification ; (b) Template
after best fit.

F e a t u r e T o l e r m l c e M e a s u r e m e n t . Next, we consider using inspection templates to


acquire statistical tolerance information. Average values, and variations of length (ie size)
and location parameters were collected for the 16 horizontal drill holes of a sample set
of 10 "good" parts using the same inspection template. Table 1 shows the results for
a selection of holes; numbers 4,6, 8 and 14 from the top. tlere the lengths have been
normalized by the lengths output from the constraint solver. A histogram plot of the
900

Table 1. Normalized nominal values and tolerances for geometric measurements collected over
a sample set of 10 images using an inspection template.
measurement average s . d . max min
length of hole 4 0.997 0.011 1.020 0.974
length of hole 6 0.995 0.062 1.010 0.971
length of hole 8 0.995 0.02:1.018 0.971
length of hole 14 0.998 0.011 1.018 0.984
hole length for sample set 0.998 0.010 1.023 0.971
outer boundary orientation 1.017 0.1~ 1.220 ).807
sum of hole spacing 0.998 0.002 1.002 0.996
hole separation 1.000 0.018 1.132 0.956

normalized lengths for the 10 parts is shown in Figure 5a. Table 1 also shows statistics
for the sum of hole spacings, hole separation and the orientation of the outer boundary.
These global measurements were specified in the template by l h m a r s p a c i n g and linear
size constraints. As the table shows, the agreement between the template model and
image data is very good. This indicates the template accurately represents both critical
local and global geometric parameters.

G e o m e t r i c Flaw D e t e c t i o n . Finally, we illustrate how a template can be used to


detect geometric flaws. An image of an industrial part containing a flaw was analyzed by
the system using the same inspection template as above. The normalized lengths of the
drill holes were recorded as before. Figure 5b shows the updated histogram plot. As seen
from the histogram one sample (unshaded) is more them 2a from the sample average.
This sample corresponds to the defect drill hole (known as underdrill).

HIMogram of normalized line lengths Updated hlatogmm


44 ,i 45
40

lllll
LI. 1511 IIIII
-IIIII
10[ i: iulmilum
IIIIII ....
5~ .lllnll
llllllll_
0 QI ~ nlllllllli

Nom~lized lenglh
Fig. 5. Histogram plots of (a) the lengths of holes for a sample set of 10 good parts. Lengths
have been normalized by tile template values; and, (b) the updated histogram wl,ere the samples
from a part containing a defect has been added to the sample set. The unshaded sample is more
than 26 from the sample average.
901

5 Discussion
To smmnarize, this paper has described progress toward the development of a geometry-
based image analysis system based on the concept of a deformable inspection template.
We have described aspects of our approach, some of the key components of our inte-
grated system and presented results from processing experimental data using our current
implementation.
Our approach differs from elastic 'snake' based techniques [1, 6] and intensity-based
deformable parameterized contours [15] and templates [7] in a number of respects. First,
we use geometric primitives rather than intensity based features as subcomponents to
build the template, although the constraint solving machinery could be modified to handle
this case. Second, a key idea our work addresses is how to use a deformable template for
quantitative interpretation as opposed to feature extraction. Finally, our scheme allows
for the derivation of generic deformable parameterized templates, which is clearly a major
benefit for fast prototyping of new inspection algorithms.
References
1. Burr, D.J.: A Dynamic Model for Image Registration, Computer Vision, Graphics, and
Image Processing, 1981, 15,102-112.
2. Chen, C., Mulgaonkar, P.: CAD-Based Feature-Utility Measures For Automatic Vision
Programming, Proc. IEEE Workshop Auto. CAD-Based Vision, Lahaina HI, June 1991,106.
3. Farouki, R.: The approximation of non-degenerate offset surfaces, Computer Aided Geom-
etry Design, 1986, 3:1, 15-44.
4. Horn, B.K.P.: Robot Vision, McGraw-Hill, New York, 1986.
5. Huang. G: A robotic alignment and inspection system for semiconductor processing, Int.
Conf. on Robot Vision and Sensory Control, Cambridge MA, 1983, 644-652.
6. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active Contour Models, Int. J. of Computer
Vision, 1988, 1:4, 321-331.
7. Lipson, P., et at.: Deformable Templates for Feature Extraction from Medical Images, Proc.
Europ. Conf. on Computer Vision, Antibes France, April 1990, 413-417.
8. Medioni, G., Huertas, A.: Detection of Intensity Changes with Subpixel Accuracy using
Laplacian-Gaussian Masks, IEEE PAMI, September 1986, 8:5,651-664.
9. Maragos, P., Schafer, R.W.: Morphological Systems for Multidimensional Signal Processing,
Proc. of the IEEE, April 1990, 78:4, 690-710.
10. NalWa,V.S.: Edge Detector Resolution Improvement by Image Interpolation, IEEE PAMI,
May 1987, 9:3, 446-451.
11. Nelson, G.: Juno, a Constraint-Based Graphics System, ACM Computer Graphics, SIG-
GRAPH '85, San Francisco CA, 1985, 19:3, 235-243.
12. Nguyen, V.,Mundy, J.L., Kapur, D.: Modeling Generic Polyhedral Objects with Con-
straints, Proc. IEEE Conf. Comput. Vis. & Part. Recog., Lahaina HI, June 1991, 479-485.
13. Noble, J.A.: Finding Half Boundaries and Junctions in Images, Accepted Image and Vision
Computing, (in press 1992).
14. Okamoto, K., et al.: An automatic visual inspection system for LSI photomasks, Proc. Int.
Conf. on Pattern Recognition, Montreal Canada, 1984, 1361-1364.
15. Staib, L.H., Duncan, J.S.: Parametrically Deformable Contour Models, Proc. IEEE Conf.
Comput. Vis. & Part. Recog., San Diego CA, June 1989, 98-103.
16. West, A., Fernando, T., Dew, P.: CAD-Based Inspection: Using a Vision Cell Demonstrator.
Proc. IEEE Workshop on Auto. CAD-Based Vision, Lahaina HI, June 1991, pp 155.
Hardware Support for Fast Edge-based Stereo

Patrick Courtney, NeiI A. Thacker and Chris R. Brown


Artificial Intelligence Vision Research Unit, University of Sheffield, Sheffield S10 2TN, England

A b s t r a c t . This paper concerns hardware support for a fast vision engine


running an edge-based stereo vision system. Low level processing tasks are
discussed and candidates for hardware acceleration are identified.

1 Introduction and abstract

AIVRU (Artificial Intelligence Vision Research Unit) has an evolving vision system,
TINA, which uses edge-based stereo to obtain 3D descriptions of the world. This has been
parallelised to operate on a transputer-based fast vision engine called MARVIN (Mul-
tiprocessor ARchitecture for VisioN) [9] which can currently deliver full frame stereo
geometry from simple scenes in about 10 seconds. Such vision systems cannot yet offer
the thoughput required for industrial applications and the challenge is to provide real-
time performance and be able to handle more complex scenes with increased robustness.
To meet this challenge, AIVRU is constructing a new generation vision engine which will
achieve higher performance through the use of T9000 transputers and by committing
certain computationally intensive algorithms to hardware.
This hardware is required to operate at framerate in the front-end digitised video
pathways of the machine with output routed (under software control) to image memories
within the transputer array. The current generation of general purpose DSP (Digital
SignM Processing) devices offers processing throughputs up to 100Mips but many low
level tasks require 100 or more operations per pixel, and at a framerate of 10Mttz this
is equivalent to over 1000Mops. Four low level processing tasks h~ve been identified
as candidates for framerate implementation: image rectification, ground plane obstacle
detection (GPOD), convolution and edge detection (see [3] for more detail).

2 Image Rectification and Ground Plane Obstacle Detection

Image rectification is a common procedure in stereo vision systems [1] [8]. Stereo match-
ing of image features is performed along epipolar lines. These are not usually raster lines
but the search may be performed rapidly if the images are transformed so that a raster
in one image is aligned with a raster in the other. This also permits increased spatial par-
allelism since horizontal image slices distributed to different processors need less overlap.
The current method is to rectify image feature positions rather than each pixel, since
there are generally far fewer image features than pixels. However, as image features are
added and more complex scenes analysed, this becomes less practical.
Obstacle detection is required as a major component of the zero order safety compe-
tence for AGVs (Autonomously Guided Vehicles), when it is necessary to detect, but not
necessarily recognise, putative obstacles. This would have to be performed very rapidly
to permit appropriate action to be taken. MaUot [7] pointed out that if an image were
projected down into the ground plane (at an angle to the image plane), it should be
903

the same as another image of the same ground plane taken from a different viewpoint,
but that if there were some object outside the ground plane, such as an obstacle, the
projected images would be different and subtracting one from the other would reveal
that obstacle. The image plane warping offered by an image rectification module would
support such an operation by using different transform parameters.
At each point in the transformed image, it is necessary to compute the corresponding
position in the input image. It has been shown that this process is equivalent to effecting
a 3 by 3 homogeneous coordinate transform [5]:

2 e = ( x l Yl 1)
h

from pixel (x2,y2) in the output image to pixel (xl,yl) in the original image. This
computation, solved to yield xl and Yl, is nonlinear and the few commercially available
image manipulation systems are unable to cope with this nonlinearity. A mapping lookup
table could be used but in a system with variable camera geometry, this would be time
consuming to alter, so the transform must be computed on the fly.
The accuracy required of the image rectification process is determined by the accuracy
with which features may be detected in images. We work with just two features at present:
Canny edgels [2]; and Harris corners [6]; both of which offer subpixel location: edgels down
to 0.02 pixels repeatability, falling to about 0.1 pixels on natural images; and corners to
about 1/3 pixel. Propagating this though the transform equation suggests that as many
as 22-24 bits of arithmetic are required for the normal range of operations.
High resolution image processing using subpixel acuity places severe demands on the
sensor. Whereas CCD imaging arrays may be accurately manufactured, lenses of similar
precision are very costly. Reasonable quality lenses give a positional error of 1.25 to
2.5 pixels, though this can be corrected to first order by adjusting the focal length. To
improve upon this, a simple radial distortion model has been used to good effect. It is
necessary to apply distortion correction to the transformed coordinates before they are
used to select pixels in the source image. The most general way to implement this is to
fetch offsets from large lookup tables addressed by the current coordinates.
As exact integer target pixel coordinates are transformed into source pixel coordinates,
the resulting address is likely to include a fractional part and the output pixel will
have to incorporate contributions from a neighbourhood. The simplest approach using
interpolation over a 2x2 window is inaccurate in the case of nonlinear transforms. Our
approach is to try to preserve the 'feature spaces'. Edgels are extracted from gradient
space and corners from second difference space, and if these spaces are preserved, the
features should be invariant under warping. An attempt is made to preserve the gradients
by computing the first and second differences around the point and interpolating using
these. This requires at least a 3x3 neighbourhood to compute second differences. Six
masks are needed in all [5]. Computing each for a given subpixel offset at each point
would be very costly in terms of hardware, but the masks may be combined to form a
single mask (by addition) for any given x and y subpixel offsets. The coefficients of the
desired mask is selected using the fractional part of the transformed pixel address and
the set of 9 pixel data values selected using the integer part. This method was evaluated
by comparing explicitly rectified line intersections with those extracted from the rectified
and anti-aliased image. Tests indicate that subpixel shifts are tracked with a bias of less
than 0.05 pixels and a similar standard deviation, at 0.103 pixels, when using a limited
number (64) of mask sets. Since grey level images are being subtracted during GPOD,
904

signal-to-noise ratio is important and this is affected by the number of masks used. Tests
indicate that of a maximum of 58dB for an 8 bit image, 57dB can be obtained using 6
mask selection bits (64 masks) on test images.
The overall block diagram of the image rectification module, including lens distortion
correction table and mask table is shown in figure 1.

PARAMETERS ~ I TRANSFORM
!xl,yl ),fromtimin PROCESSOR
~ I,,
LENS DISTORTION
CORRECTIONTABLE

OFFSETS~

I
INTEGER FRACTION
(x2'y2)~F (x2,y2)

(
Data C~ff ~ Data
I ~ I Out MASK
UTIMAGEBUFFER Reiult I I TABLE

4
Fig. 1. Block diagram of Image Rectification Module.

3 Edge Detection

On the MARVIN system, approximately 5 of the 10 second full frame stereo processing
time is taken performing edge extraction. Object tracking at 5Hz has been demonstrated
but for this the software has to resort to cunning but fragile algorithmic short-cuts [9]. If
the edge extraction process could be moved into hardware, full frame stereo throughput
and tracking robustness would be significantly increased.
The current TINA implementation of Canny's edge detection algorithm [2] extracts
edge element strength, orientation and subpixel position by: convolution with a gaussian
kernel; computation of gradients in x and y; extraction of the edgel orientation from
x and y gradients using arctangent; extraction of the edgel gradient from the x and y
gradients by pythagoras; application of nonmaximai suppression (NMS) based on local
edgel gradient and orientation in a 3x3 neighbourhood, suppressing the edgel if it is not a
maximum across the edgel slope; computation of edgel position to subpixel acuity using
a quadratic fit; and linking of edgels thresholded with hysteresis against a low and high
threshold. These linked edgels are then combined together into straight line segments
and passed onto a matching stage prior to 3D reconstruction. Despite several attempts
by other researchers to build Canny edge detectors this hardware is still unavailable [3].
Gaussian smoothing may be performed using a convolution module. Convolution is
well covered in standard texts and may be simply performed using commercially available
convolution chips and board level solutions from a number of sources. A prototype dual
channel 8x8 convolution board has been constructed using a pair of Plessey PDSP16488
905

chips. Since edge detection most commonly uses a (r of 1.0, an 8x8 neighbourhood would
be adaquate. Tests with artificial images indicate that least 12 bits are necessary from the
output of the convolution to ensure good edgel location. Real images gave worse results
and 15-16 bits would be preferred. Note that not all commercial convolution boards are
able to deliver this output range. The x and y gradients may easily be obtained by
bufferering a 3x3 neighbourhood in a delay line and subtracting north from south to get
the y gradient dy, and east from west to get the x gradient dz.
The edgel orientation r is normally computed from the x and y gradients using the
arctangent. This is quite poorly behaved with respect to discrete parameters dx and dy
and these have to be carefully manipulated to minimise numerical inaccuracies. This is
achieved by computing the angle over the range 0 to 45 degrees by swapping dx and
dy to ensure that dy is the largest and expanding over the whole range later on. Using
a lookup table for the angle using full range dz and dy would require a 30 bit address
which is beyond the capability of current technology. The m a x i m u m reasonable table
size is 128k or 17 address bits. However the address range may be reduced by a form of
cheap division using single cycle barrel shifting. Since shifting guarantees that the most
significant bit of dy is 1, this is redundant, permitting an extra dz bit to be used in
address generation, thus extending the effective lookup table size by a factor of 2. This
has been shown to result in an error of less than 0.45 degrees, which is smaller than the
quantisation error due to forcing 360 degrees into 8 bits of output [4].
For the gradient strength, the pythagoras root of sum of squares is quite difficult to
perform in hardware. The traditional hardware approach is to compute the Manhattan
magnitude (i.e. sum of moduli) instead but this results in a large error of about 41%
when dz -- dy. A somewhat simpler technique is to apply a correction factor directly by
dividing the maximum of dz and dy by cosr to give:
1
G = max(ldx I, Idyl). cost

This factor ~ may be looked up from the barrel shifted dx and dy as before and it has
been shown that this would result in an error in the gradient G of less than 0.8% [4].
Nonmaximal suppression is a fairly simple stage at which the central gradient mag-
nitude is compared with those of a number of its neighbours, selected according to the
edgel orientation. Quadratic fitting is performed to obtain an offset del which may be
computed from the central gradient c and the two neighbours a and b by:
a-b
del =
2((a + b) - 2c)
Due to the division, the offset is gradient-scale invariant over a wide range. The three
gradients may be barrel shifted as before. The terms a - b, a + b and thence a + b - 2c
may be computed using simple adders, and the subpixel offset looked up from a single
128k lookup table. The x and y subpixel offsets may then be computed from this value
to give an x offset if the edgel is roughly vertical and a y offset if the edgel is roughly
horizontal. This again may be accomplished with a small lookup table.
Other implementations of hysteresis edge linking used an iterative algorithm requiring
5-8 iterations [4]. A simpler solution is to perform thresholding against the low threshold
in hardware but leave the linking and hysteresis to the software. The computational
overhead for hysteresis should be small as it is simply a case of comparing the fetched
gradient with zero (to check that it is valid - and this has to be performed anyway) or
with the high threshold, depending on the branch taken in the linking algorithm. The
hysteresis problem therefore disappears.
906

Tests of the Canny hardware design, carried out with synthetic circle images providing
all possible edgel orientations, gave a pixel accuracy of better than 0.01 pixels worst case
and a standard deviation of 0.0025 pixels, even with a contrast as low as 16 grey levels.
This is far superior to the current repeatability of 0.02 pixels.

4 Conclusions

Image rectification is a common, well understood but costly task in computer vision. It is
well suited to hardware implementation. A scheme for correcting arbitrary lens distortions
was proposed. Anti-aliasing may be performed by convolving the 3x3 neighbourhood
with a set of precomputed masks. Using 64 masks in each axis should be large enough to
ensure good line fitting and signal to noise ratio for GPOD. The surface fitting scheme
proved to preserve line intersections with a variance of better than 0.1 pixels and a
bias of less than 0.05 pixels on artificial images. This is rather poorer than the Canny
edge detector is capable of but adequate for later stages and may be improved using
alternative anti-aliasing methods. Experiments with the simulated hardware rectification
on sample images of a simple object produce good 3D geometry. Such a module would
be useful in other domains, notably realtime obstacle detection where speed is of the
utmost importance and the use of realtime hardware is mandatory. Convolution is usually
solvable using available products. Canny edge detection is a stable algorithm and requires
relatively little additional hardware to obtain framerate performance.
AIVRU's intention is to build the next generation of fast vision engine using T9000
transputers as the network processor and to integrate this with a series of framerate
boards such as the ones just described to obtain full frame stereo in under a second.

5 Acknowledgements
Thanks to Phil McLauchlan and Pete Furness for insightful comments and kind support.

References

1. Ayache, N.: Artificial Vision for Mobile Robots. MIT Press (1991)
2. Canny, J.F.: A Computational Approach to Edge Detection. IEEE PAMI-8 (1986) 679-
698
3. Courtney P.: Evaluation of Opportunities for Framerate Hardware within AIVRU.
AIVRU Research Memo 51 (November 1990)
4. Courtney P.: Canny Post-Processing. AIVRU Internal Report (June 1991)
5. Courtney P., Thacker, N.A. and Brown, C.R.: Hardware Design for Realtime Image
Rectification. AIVRU Research Memo 60 (September 1991)
6. Harris, C. and Stephens, M.: A Combined Corner and Edge Detector. Proc. 4th Alvey
Vision Conference, Manchester, England (1988) 147-151
7. Mallot, H.A., Schulze E. and Storjohann, K.: Neural Network Strategies for Robot Nav-
igation. Proc. nEuro '88, Paris. G. Dreyfus and L. Personnadz (Eds.) (1988)
8. Mayhew, J.E.W. and Frisby, J.P.: 3D Model Recognition from Stereoscopic Cues. MIT
Press (1990)
9. Rygol, M., Pollard, S.B. and Brown, C.R.: MARVIN and TINA: a Multiprocessor 3-D
Vision System. Concurrency 3(4) (1991) 333-356

This article was processed using the I~TEX macro package with ECCV92 style
Author Index
Aggarwal, J.K., 720 C16ment, V., 815
Ahuja, N., 217 Cohen, I., 458,648
Aloimonos, Y., 497 Cohen, L.D., 648
Amat, J., 160 Colchester, A.C.F., 725
Anandan, P., 237 Cootes, T., 852
Ancona, N., 267 Courtney, P., 902
Aoki, Y., 843 Cox, I.J., 72
Arbogast, E., 467 Craw, I., 92
Asada, M., 24 Crowley, J.L., 588
Aw, B.Y.K., 749 Culhane, S.M., 551
Ayache, N., 43, 458,620,648 Curwen, R., 879
Bajcsy, R., 99, 653 Daniilidis, K., 437
Baker, K.D., 277, 778 Davis, L.S., 335
Beardsley, P., 312 Dawson, K.M., 806
Bennett, A., 92 De Floriani, L., 368
Bergen, J.R., 237 DeMenthon, D.F., 335
Berthod, M., 67 Debrunner, C., 217
Bertrand, G., 710 Dhome, M., 681
Black, M. J., 485 Drew, M.S., 124
Blake, A., 187, 879 Duri~, Z., 497
Bobet, P., 588 Edelman, S., 787
Bouthemy, P., 476 Eklund, J.-O., 526, 701
Bowman, C., 272 Embrechts, H., 387
Brady, M., 272 Etoh, M., 24
Brauckmann, M., 865
Breton, P., 135 Farley, J., 893
Brockingston, M., 124 Faugeras, O.D., 203,227, 321,563
Brown, C.M., 542 Ferrie, F.P., 222
Brown, C.R., 902 Fisher, R.B., 801
Brunelli, R., 792 Fleck, M.M., 151
Brunie, L., 670 Florack, L.J., 19
BrunnstrSm, K., 701 Florek, A., 38
Bruzzone, E., 368 Forsyth, D.A., 639, 757
Buchanan, Th., 730 Fua, P., 676
Buurman, J., 363 Funt, B.V., 124
Buxton, H., 884 G~rding, J., 630
Campani, M., 258 Gatti, M., 696
Casadei, S., 174 Geiger, D., 425
Casals, A., 160 Giraudon, G., 67,815
Cass, T.A., 773,834 Glachet, R., 681
Cazzanti, M., 368 Grau, A., 160
Chang, C., 420 Griffin, L.D., 725
Chatterjee, S., 420 Grimson, W.E.L., 291
Chen, X., 739 Grosso, E., 516
Cipolla, R., 187 Grzywacz, N.M., 212
908

Gu~ziec, A., 620 Marinos, C., 893


Mase, K., 453
ttaar Romeny, ter , B.M., 19
Maybank, S.J., 321
ttallam, J., 801
Hanna, K.J., 237 Mesrabi, M., 588
Harris, C., 272 Meyer, F., 476
IIartley, R.I., 579 Mitter, S., 174
tIawkes, D.J., 725 Mohr, R., 467
Hedengren, K., 893 Morita, S., 843
tteitger, F., 78 Moses, Y., 820
H~rault, L., 58 Mumford, D., 165
IIerlin, I.L., 43 Mundy, J.L., 639,757, 893
IIeydt, R., von der, 78 Murino, V., 87
Hill, A., 852 Murray, D., 312
IIingorani, R., 237 Nagel, H.-II., 437, 687
IIingorani, S., 72 Neumann, B., 373
IIoraud, R., 58 Neumann, It., 373
IIouzelle, S., 815 Nguyen, T.C., 347
IIuang, T.S., 347 Nguyen, V.D., 893
ttuttenlocher, D.P., 291,773 Noble, A., 893
Inoue, If., 874 Nowak, A., 38
Irani, M., 282 Olivieri, P., 696
Iverson, L.A., 135 Olsen, S.I., 307
Jacobs, D.W., 291 Orr, M.J.L., 801
Jones, D.G., 395, 661 Otte, M., 687
Owens, R.A., 749
Kawashima, T., 843
Kittler, J., 857 Pahlavan, K., 526
Knutsson, If., 33 Peleg, S., 282
Koenderink, J.J., 19 Pentland, A.P., 615
Koller, D., 437 Peri, M.F., 87
Kriegman, D.J., 599,829 Perona, P., 3, 174
Kiibler, O., 78 Petitjean, S., 599
Kuniyoshi, Y., 874 Petrou, M., 857
Ladendorf, B., 425 Piascik, T., 38
Langer, M.S., 135 Poggio, T., 792
Lapreste, J.T., 681 PSlzleitner, W., 511
Lavall~e, S., 670 Ponce, J., 599,829
Leb~gue, X., 720 Regazzoni , C.S., 87
Lee, S.W., 99 Rehg, J.M., 72
Lee, T.S., 165 Reisfeld, D., 787
Leonardis, A., 653 Rimey, R.D., 542
Li, $.Z., 857 Robinson, G.P., 725
Lindeberg, T., 701 Rognone, A., 258
Ludwig, K.-O., 373 Roose, D., 387
Luong, Q.T., 321 Rosenthaler, L., 78
Malandain, G., 710 Ross, J., 749
Malik, J., 395, 661 Rothwell, C.A., 639, 757
Mangili, F., 368 Rousso, B., 282
909

Sander, P., 676 Ueda, N., 453


Sandini, G., 516 Uhlin, T., 526
Schmitt, F., 739 Ullman, S., 820
Seelen, W~ von, 865 Vaina, L.M., 212
Shirai, Y., 24 Vernon, D., 806
Shizawa, M., 411 Verri, A., 258
Soucy, G., 222 Viergever, M.A., 19
Sparr, G., 378 Vieville, T., 203
Straforini,M., 696 Vijayakumar, B., 829
Stromboni, J.P., 67
Sulger, P., 458 Wallace, A., 744
Sullivan, G.D., 277, 778 Wang, H., 272
Sundareswaran, V., 253 Wechsler, H., 511
Syeda-Mahmood, T.F., 115 Westin, C.-F., 33
Szeliski,R., 670 Xie, M., 715
Tan, T.N, 277 Yeshurun, Y., 787
Taylor, C.J., 852 Yuille, A., 165
Thacker, N.A., 902 Yuille, A., 425
Thonnat, M., 715
Th6rhallsson, T., 437 Zhang, G., 744
Tistarelli,M., 516 Zhang, S., 778
Toal, A.F,, 884 Zhang, Z., 227
Tock, D., 92 Zielke, T., 865
Torre, V., 696 Zisserman, A., 312,639, 757, 879
Tran, A.T., 893 Zucker, S.W., 135
Tsotsos, J.K., 551
Lecture Notes in Computer Science
For information about Vols. 1-504
please contact your bookseller or Springer-Verlag

Vol. 505: E. H. L. Aarts, J. van Leeuwen, M. Rein (Eds.), PARLE Vol. 525: O. Giinther, H.-J. Schek (Eds.), Advances in Spatial
'91. Parallel Architectures and Languages Europe, Volume I. Databases. Proceedings, 1991. XI, 471 pages. 1991.
Proceedings, 1991. XV, 423 pages. 1991. Vol. 526: T. lto, A. R. Meyer (Eds.), Theoretical Aspects of
Vol. 506: E. H. L. Aarts, J. van Leeuwen, M. Rem (Eds.), PARLE Computer Software. Proceedings, 1991. X, 772 pages. 1991.
'91. Parallel Architectures and Languages Europe, Volume II. Vol. 527: J.C.M. Baeten, J. F. Groote (Eds.), CONCUR '91.
Proceedings, 1991. XV, 489 pages. 1991. Proceedings, 1991. VIII, 541 pages. 1991.
Vol. 507: N. A. Sherwani, E. de Doncker, J. A. Kapenga (Eds.), Vol. 528: J. Maluszynski, M. Wirsing (Eds,), Programming Lan-
Computing in the 90's. Proceedings, 1989. XIII, 441 pages. guage Implementation and Logic Programming. Proceedings,
1991. 1991. XI, 433 pages. 1991.
Vol. 508: S. Sakata (Ed.), Applied Algebra, Algebraic Algo- Vol. 529: L. Budach (Ed.), Fundamentals of Computation
rithms and Error-Correcting Codes. Proceedings, 1990. IX, 390 Theory. Proceedings, 1991. XII, 426 pages. 1991.
pages. 1991.
Vol. 530: D. H. Pitt, P.-L. Curien, S. Abramsky, A. M. Pitts, A.
Vol. 509: A. Endres, H, Weber (Eds.), Software Development Poignr, D. E. Rydeheard (Eds.), Category Theory and Compu-
Environments and CASE Technology. Proceedings, 1991. VIII, ter Science. Proceedings, 1991. VII, 301 pages. 1991.
286 pages. 1991.
Vol. 531: E. M. Clarke, R. P. Kurshan (Eds.), Computer-Aided
Vol. 510: J. Leach Albert, B. Monien, M. Rodriguez (Eds.), Verification. Proceedings, 1990. XIII, 372 pages. 1991.
Automata, Languages and Programming. Proceedings, 1991.
XII, 763 pages. 1991. Vol. 532: H, Ehrig, H.-J. Kreowski, G. Rozenberg (Eds.), Graph
Grammars and Their Application to Computer Science. Pro-
Vol. 511: A. C. F. Colchester, D.J. Hawkes (Eds.), Information ceedings, 1990. X, 703 pages. 1991.
Processing in Medical Imaging. Proceedings, 1991. XI, 512
Vol. 533: E. Btirger, H. Kleine Brining, M. M. Richter, W.
Vol. 512: P. America (Ed.), ECOOP '91. European Conference Schtinfe/d (Eds.), Computer Science Logic. Proceedings, 1990.
on Object-Oriented Programming. Proceedings, 1991. X, 396 VIII, 399 pages. 1991.
pages. 1991.
Vol. 534: H. Ehrig, K. P. Jantke, F. Orejas, H. Reichel (Eds.),
Vol. 513: N. M. Mattos, An Approach to Knowledge Base Man- Recent Trends in Data Type Specification. Proceedings, 1990.
agement. IX, 247 pages. 1991. (Subseries LNAI). VIII, 379 pages. 1991.
Vol. 514: G. Cohen, P, Charpin (Eds.), EUROCODE '90. Pro- Vol. 535: P. Jorrand, J, Kelemen (Eds.), Fundamentals of Arti-
ceedings, 1990. XI, 392 pages. 1991. ficial Intelligence Research. Proceedings, 1991. VIII, 255 pages.
Vol. 515: J. P. Martins, M. Reinfrank (Eds.), Truth Maintenance 1991. (Subseries LNAI).
Systems. Proceedings, 1990. VII, 177 pages. 1991. (Subseries Vol. 536: J. E, Tomayko, Software Engineering Education. Pro-
LNA1). ceedings, 1991. VIII, 296 pages. 1991.
Vol. 516: S. Kaplan, M. Okada (Eds.), Conditional and Typed Vol. 537: A. J. Menezes, S. A. Vanstone (Eds.), Advances in
Rewriting Systems. Proceedings, 1990. IX, 46l pages, 1991. Cryptology -CRYPTO '90. Proceedings. XIII, 644 pages. 1991.
Vol. 517: K. NiSkel, Temporally Distributed Symptoms in Tech- Vol. 538: M. Kojima, N. Megiddo, T. Noma, A. Yoshise, A
nical Diagnosis, IX, 164 pages. 1991. (Subseries LNAI). Unified Approach to Interior Point Algorithms for Linear
Vol, 518: J. G. Williams, Instantiation Theory. VIII, 133 pages. Complementarity Problems. VIII, 108 pages. 1991.
1991. (Subseries LNAI). Vol. 539: H. F. Mattson, T. Mora, T. R. N. Rao (Eds.), Applied
Vol. 519: F. Dehne, J.-R. Sack, N. Santoro (Eds.), Algorithms Algebra, Algebraic Algorithms and Error-Correcting Codes.
and Data Structures. Proceedings, 1991. X, 496 pages. 199 I. Proceedings, 1991. XI, 489 pages. 1991.
Vol. 520: A. Tarlecki (Ed.), Mathematical Foundations of Vol. 540: A. Prieto (Ed.), Artificial Neural Networks. Proceed-
Computer Science 1991. Proceedings, 1991. XI, 435 pages, ings, 1991. XIII, 476 pages. 1991.
1991. Vol. 541 : P. Barahona, L. Moniz Pereira, A. Porto (Eds.), EPIA
Vol. 521: B. Bouchon-Meunier, R. R. Yager, L. A. Zadek (Eds.), '91. Proceedings, 1991. VIII, 292 pages. 1991. (Subseries
Uncertainty in Knowledge-Bases. Proceedings, 1990. X, 609 LNAI).
pages. 1991. Vol. 542: Z. W. Ras, M. Zemankova (Eds.), Methodologies for
Vol. 522: J. Hertzberg (Ed.), European Workshop on Planning. Intelligent Systems. Proceedings, 1991. X, 644 pages. 1991.
Proceedings, 1991. VII, 121 pages. 1991. (Subseries LNAI). (Subseries LNAI).
Vol. 523: J. Hughes (Ed.), Functional Programming Languages Vol. 543: J. Dix, K. P. Jantke, P. H. Schmitt (Eds.), Non-
and Computer Architecture. Proceedings, 1991. VIII, 666 pages. monotonic and Inductive Logic. Proceedings, 1990. X, 243
1991. pages. 1991. (Subseries LNAI).
Vol. 524: G. Rozenberg (Ed.), Advances in Petri Nets 1991. Vol. 544: M. Broy, M. Wirsing (Eds.), Methods of Program-
VIII, 572 pages. 1991. ming. XI1, 268 pages. 1991.
pages. 1991.
Vol. 545: H. Alblas, B. Melichar (Eds.), Attribute Grammars, Vol. 570: R. Berghammer, G. Schmidt (Eds.), Graph-Theoretic
Applications and Systems, Proceedings, 1991. IX, 513 pages. Concepts in Computer Science. Proceedings, 1991. VIII, 253
1991. pages. 1992.
Vol. 546: O. Herzog, C.-R. Rollinger (Eds.), Text Understand- Vol. 571: J. Vytopil (Ed.), Formal Techniques in Real-Time
ing in LILOG. XI, 738 pages. 1991. (Subseries LNAI). and Fault-Tolerant Systems. Proceedings, 1992. IX, 620 pages.
Vol. 547: D. W. Davies (Ed.), Advances in Cryptology - 1991.
EUROCRYPT '91. Proceedings, 1991. XII, 556 pages. 1991. Vol. 572: K. U. Schulz (Ed.), Word Equations and Related Top-
Vol. 548: R. Kruse, P. Siegel (Eds.), Symbolic and Quantitative ics. Proceedings, 1990. VII, 256 pages. 1992.
Approaches to Uncertainty, Proceedings, 199l. XI, 362 pages. Vol. 573: G. Cohen, S. N. Litsyn, A. Lobstein, G. Z6mor (Eds.),
1991. Algebraic Coding. Proceedings, 1991. X, 158 pages. 1992.
Vol. 549: E. Ardizzone, S. Gaglio, F. Sorbello (Eds.), Trends in Vol. 574: J. P. Ban~tre, D. Le M6tayer (Eds.), Research Direc-
Artificial Intelligence. Proceedings, 1991. XIV, 479 pages. 1991. tions in High-Level Parallel Programming Languages. Proceed-
(Subseries LNAI). ings, 1991. VIII, 387 pages. 1992.
Vol. 550: A. van Lamsweerde, A. Fugetta (Eds.), ESEC '91. Vol. 575: K. G. Larsen, A. Skou (Eds.), Computer Aided Veri-
Proceedings, 1991. XII, 515 pages. 1991, fication. Proceedings, 1991. X, 487 pages. 1992.
Vol. 551:S. Prehn, W. J. Toetenel (Eds.), VDM '91. Formal Vol. 576: J. Feigenbaum (Ed.), Advances in Cryptology -
Software Development Methods. Volume 1. Proceedings, 1991. CRYPTO '91. Proceedings. X, 485 pages. 1992.
XIII, 699 pages. 1991. Vol. 577: A. Finkel, M. Jantzen (Eds.), STACS 92. Proceed-
Vol. 552: S. Prehn, W. J. Toetenel (Eds.), VDM '91. Formal ings, 1992. XIV, 621 pages. 1992.
Software Development Methods. Volume 2. Proceedings, 1991. Vol. 578: Th. Beth, M. Frisch, G. J. Simmons (Eds.), Public-
XIV, 430 pages. 1991. Key Cryptography: State of the Art and Future Directions. XI,
Vol, 553: H. Bieri, H. Noltemeier (Eds.), Computational Ge- 97 pages. 1992.
ometry - Methods, Algorithms and Applications '91. Proceed- Vol. 579: S. Toueg, P. G. Spirakis, L. Kirousis (Eds.), Distrib-
ings, 1991. VIII, 320 pages. 1991. uted Algorithms. Proceedings, 1991. X, 319 pages. 1992.
Vol. 554: G. Grahne, The Problem of Incomplete Information Vol. 580: A, Pirotte, C. Delobel, G. Gottlob (Eds.), Advances
in Relational Databases. VIII, 156 pages. 1991. in Database Technology - EDBT '92. Proceedings. XII, 551
Vol. 555: H. Maurer (Ed.), New Results and New Trends in pages. 1992.
Computer Science. Proceedings, 1991. VIII, 403 pages. 1991. Vol. 581: J.-C. Raoult (Ed.), CAAP '92. Proceedings. VIII, 361
Vol. 556: J.-M. Jacquet, Conclog: A Methodological Approach pages. 1992.
to Concurrent Logic Programming. XII, 781 pages. 1991. Vol. 582: B. Krieg-Brtickner (Ed.), ESOP '92. Proceedings. VIII,
Vol. 557: W. L. Hsu, R. C. T. Lee (Eds.), ISA '91 Algorithms. 491 pages. 1992.
Proceedings, 1991. X, 396 pages. 1991. Vol. 583: I. Simon (Ed.), LATIN '92. Proceedings, IX, 545 pages.
Vol. 558: J. Hooman, Specification and Compositional Verifi- 1992.
cation of Real-Time Systems. VIII, 235 pages. 1991. Vol. 584: R. E. Zippel (Ed.), Computer Algebra and Parallel-
Vol. 559: G. Butler, Fundamental Algorithms for Permutation ism. Proceedings, 1990. IX, 114 pages. 1992.
Groups. XII, 238 pages. 1991. Vol. 585: F. Pichler, R. Moreno Dfaz (Eds.), Computer Aided
Vol. 560: S. Biswas, K. V. Nori (Eds.), Foundations of Soft- System Theory - EUROCAST '91. Proceedings. X, 761 pages.
ware Technology and Theoretical Computer Science. Proceed- 1992.
ings, 1991. X, 420 pages. 1991. Vol. 586: A. Cheese, Parallel Execution of Parlog. IX, 184 pages.
Vol. 561: C. Diog, G. Xiao, W. Shan, The Stability Theory of 1992.
Stream Ciphers. IX, 187 pages. 1991. Vol. 587: R. Dale, E. Hovy, D. ROsner, O. Stock (Eds.), As-
Vol. 562: R. Breu, Algebraic Specification Techniques in Ob- pects of Automated Natural Language Generation. Proceedings,
ject Oriented Programming Environments. XI, 228 pages. 1991. 1992. VIII, 311 pages. 1992. (Subseries LNA1).
Vol. 563: A. Karshmer, J. Nehmer (Eds.), Operating Systems Vol. 588: G. Sandini (Ed.), Computer V i s i o n - ECCV '92. Pro-
of the 90s and Beyond. Proceedings, 199 l. X, 285 pages. 1991. ceedings. XV, 909 pages. 1992.
Vol. 564: I. Herman, The Use of Projective Geometry in Com-
puter Graphics. VIII, 146 pages. 1992.
Vol. 565: J. D. Becker, I. Eisele, F. W. Mtindemann (Eds.), Par-
allelism, Learning, Evolution. Proceedings, 1989. VIII, 525
pages. 1991. (Subseries LNAI).
Vol. 566: C. Delobel, M. Kifer, Y. Masunaga (Eds.), Deductive
and Object-Oriented Databases. Proceedings, 1991. XV, 581
pages. 1991.
Vol. 567: H. Boley, M. M. Richter (Eds.), Processing Declara-
tive Kowledge. Proceedings, 1991. XI1, 427 pages. 1991.
(Subseries LNAI).
Vol. 568: H.-J. Biirckert, A Resolution Principle for a Logic
with Restricted Quantifiers. X, 116 pages. 1991. (Subseries
LNAI).
Vol. 569: A. Beaumont, G. Gupta (Eds.), Parallel Execution of
Logic Programs. Proceedings, 1991. VII, 195 pages. 1991.
Referees
G a x i b o t t o G. Italy N o r d s t r S m N. Sweden
A m a t J. Spain G i r a u d o n G. France
A n d e r s s o n M.T. Sweden G o n g S. U.K. Olofsson G. Sweden
A u b e r t D. France G r a n l u n d G. Sweden
A y a c h e N. France G r o s P. France P a h l a v a n K. Sweden
Grosso E. Italy P a m p a g n i n L.H. France
BArman H. Sweden Gueziec A. France P a p a d o p o u l o T. France
Bascle B. France P a t e r n a k B. Germany
Bellissant C. France H a g l u n d L. Sweden P e t r o u M. France
B e n a y o u n S. France Heitz F. France P u g e t P. France
Berger M.O. France H~ranlt H. France
Bergholm F. Sweden Herlin I.L. France Q u a n L. France
Berroir J.P. France H o e h n e H.H. Germany
Berthod M. France H o g g D. U.K R a d i g B. Germany
BesaSez L. Spain Horaud R. France Reid I. U.K.
Betsis D. Sweden Howarth R. U.K Riehetin M. France
Beyer H. France Hugog D. U.K. Rives G. France
Blake A. U.K. H u m m e l R. France R o b e r t L. France
Boissier O. France
Bouthemy P. France Inglebert C. France S a g e r e r G. Germany
Boyle R. U.K. Izuel M.J. Spain Sandini G. Italy
Brady M. U.K. Sanfeliu A. Spain
Burkhardt H. Germany Juvin D. France S c h r o e d e r C. Germany
Buxton B. U:K. Seals B. France
Buxton H. U.K. Kittler J. U.K. S i m m e t h H. Germany
Knutsaon H. Sweden Sinclair D. U.K.
C a l e a n D. France Koenderink I. The Netherlands Skordas Th. France
Carlsson S. Sweden Koller D. Germany S o m m e r G. Germany
Casals A. Spain S p a r r G. Sweden
C a s t a n S. France L a n g e S. Germany Sprengel R. Germany
C e l a y a E. Spain Lapreste J.T. France Stein T h . y o n Germany
Chamley S. France Levy-Vehel J. France Stiehl H.S. Germany
Chassery J.M. France Li M. Sweden
C h e h i k i a n A. France L i n d e b e r g T. Sweden T h i r i o n J.P. France
C h r i s t e n s e n H. France Lindsey P. U.K. T h o m a s B. France
Cinquin Ph. France L u d w i g K.-O. Germany T h o m a s F. Spain
C o h e n I. France L u o n g T. France T h o n n a t M. France
Cohen L. France Lux A. France Tistarelli M. Italy
Crowley J.L. France T o a l A.F. U.K.
Curwen R. U.K. M a g r a s s i M. Italy T o r r a s C. Spain
M a l a n d a i n G. France T o r t e V. Italy
Dagless E. France M a r t i n e z A. Spain Tr~v~n H. Sweden
Daniilidis K. Germany M a y b a n k S.J. France
De Micheli E. Italy M a y h e w J. U.K. Uhlin T. Sweden
Demazeau Y. France M a z e r E. France Usoh M. U.K.
Deriche R. France Mc L a u c h l a n P. U.K.
Devillers O. France Mesrabi M. France Veillon F. France
D h o m e M. France Milford D. France Verri A. Italy
Dickmanns E. Germany Moeller R. Germany Vieville T. France
Dinten J.M. France M o h r R. France Villanueva J . J . Spain
Dreschler-FischerL. Germany M o n g a O. France
Drewniok C. Germany Montseny E. Spain W a h l F. Germany
M o r g a n A. France Westelius C.J. Sweden
Eklundh J.O. Sweden Morin L. France Westin C.F. Sweden
Wieske L. Germany
Faugeras O.D. France Nagel H.H. Germany W i k l u n d J. Sweden
Ferrari F. Italy N a s t a r C. France W i n r o t h H. Sweden
Fossa M. Italy N a v a b N. France W y s o c k i J. U.K.
F u a P. France N e u m a n n B. Germany
N e u m a n n H. Germany Z e r u b i a J. France
GArding J. Sweden N o r d b e r g K. Sweden Z h a n g Z. France

You might also like