You are on page 1of 34

Table of Contents

About this Page


Machine Vision in Brief
What is Machine Vision?
General-Purpose Vision
Sensors Used
Spot Sources
Cameras
Laser Scanners
Echoic Triangulation
Primitive Visual Features
Edge Detection
Regions and Flood-Fill
Texture Analysis
Two Dimensional Perception
Pixel Pattern Matching
Big Blobs
Point Orientation
2D Feature Networks
Three Dimensional Perception
Lasers and Direct 3D Perception
Binocular Vision
Final Thoughts

Feel free to listen to an audio version of this introduction. There's an


MP3 file for each major section:

About this Page


Machine Vision in Brief
Sensors Used
Primitive Visual Features
Two Dimensional Perception
Three Dimensional Perception
Final Thoughts

About this Page


I'm a lay artificial intelligence researcher.
I occasionally get interested in machine
vision. At various times, I've searched

Figure: Head of a robot endowed with


vision.
online for information about the subject, but it can be difficult. There
are quite a few sites that have links to other sites. And there are
conferences, commercial products, and other such things that can
be found. But it's hard to find much to bring it all together.

This page sets out to bring some of these resources together. I don't
have the time to create a truly exhaustive resource. My hope, then,
is to make this a decent starting point for others looking for
background information. To that end, I'm organizing information by
category and trying to provide summaries, speculations, and other
opinions.

One other goal I have for this page is to demystify machine vision.
The popular press has a habit of making the products of machine
vision research look far more impressive than they often actually
are. Even someone like me with a little knowledge of how a lot of
the techniques work can easily be fooled by a new trick into thinking
the field is much farther along than it really is. And for various
reasons, the web sites of research projects or commercial products
don't often reveal much about the techniques used. Admittedly,
some of my explanations are speculations based on what I find
online and sometimes just reckoned by asking myself, "how would I
do that?"

I invite you to let me know of your own work. I especially welcome


information about current and historically significant research
projects, but I also welcome information from the private sector
about new or significant products under development or in use
today. And feel free to let me know if you find any of my
explanations is inaccurate or incomplete. Send me email about your
projects, products, and thoughts.

Machine Vision in Brief


What is Machine Vision?
General-Purpose Vision

What is Machine Vision?


"Machine vision" is a field of study and
technology whose goal is to endow
machines with the ability to perceive
selective aspects of the world using
visual means.

I apologize if this sounds like a circular


definition. One can easily get lost in a
particular concept or technology when Figure: Facial measures used
trying to define machine vision. Perhaps in a biometrics vision system.

it's best to start with a more ostensible definition, then. Those of us


fortunate enough to have functional eyes have an incredible ability
to perceive and understand the world through them. Engineers have
long sought to endow machines with this same capability. It's easy
to assume that this just means duplicating the mechanisms people
use in machines, but that's not all there is to it. Some techniques
involve projecting and reflecting laser beams off distant targets, for
example, which is very different from how you and I work. Some
systems can read and understand information in bar codes or other
special constructs that are difficult for humans to deal with.

Most importantly, few techniques being researched or in use today


really resemble the awesome complexity and flexibility available to
humans. We MV researchers have our own bag of tricks. It may be
that some day, we bring all those tricks together and find we can
make machines "see" as well as or even better than humans do, but
we're no where near there, yet.

All practical machine vision systems in use today exist for their own
specific purposes. Some are used to ensure that parts coming off
assembly lines are manufactured correctly. Some are used to detect
the lines in a road for the benefit of cars that drive themselves.
Though some interested parties claim otherwise, there are no
general purpose vision systems, either in laboratories or on the
market.

If it sounds like it's difficult to define machine vision, don't fret. The
point is that the field of machine vision is not simply interested in
duplicating human vision. What is essential is the basic goal of
visual perception; of the ability to "understand" the world visually,
sufficiently for moving about in and interacting with a complex,
ever-changing world and discerning information in the environment
essential to the core goals of consumers of these faculties.

General-Purpose Vision
As mentioned above, all practical machine vision end products
available now are for specific purposes. I contrasted that with
general purpose machine vision. Let me define what that means,
then.

I'll start, again, with an ostensible model. Human vision is general


purpose. In our everyday experiences, we see a rich panoply of
things in all sorts of lighting conditions. We are able to operate well
in almost any circumstance in which there's even a modest amount
of light entering our eyes and which isn't damaging them.

Merely being able to see light is nearly useless, though. The best
video cameras today are still just recording or transmission devices;
they don't do anything else practical with it. By contrast, we are
poor recording and transmission devices. It's our faculties for visual
perception that distinguish us. So let's talk about what we do with
the visual information we can see.

We can recognize the boundaries between objects. We can


recognize objects. We can recognize the repetitions that compose
both simple and rich textures. We can intuit the nature and location
of light sources without seeing them directly. We can recognize the
three dimensional nature of the things we see. We can see how
things are connected together and how larger objects are
subdivided into smaller ones. We can recognize that the two halves
of a car on either side of a telephone pole are actually parts of a
single car that is behind the pole. We can tell how far away things
are. We can detect the motion of objects we see. We can recognize
complex mechanisms with lots of moving parts as components of
single larger objects and distinguish them from the backdrop of the
rest of the world. We can even recognize a silver pitcher amidst a
noisy background as a thing unto itself, even though we only see
the reflections of that background.

Perhaps the most interesting feature of human vision that


distinguishes it from most machine vision techniques crafted to date
is that we can deal very well with novel situations. A new car you've
never seen before is still obviously a car because it looks like a car.
You instantly catalog novel objects and register essential
differences.

How does one distill all this down into a clear definition, then? What
is general purpose machine vision? I think it's best to define it in
terms of a set of core goals. A machine can be said to have general
purpose machine vision if it can:

1. Construct a 3D model of the open space within its visual field


sufficient for movement within that space and interaction with
the objects within it
2. Distinguish most any whole object, especially a complex
moving one, from the rest of a visual field
3. Recognize arbitrarily complex textures as continuous surfaces
and objects
4. Have a hierarchic way of characterizing all the objects within a
scene and their relative positional and connectivity
relationships to one another.
5. Characterize a novel object using a three dimensional
animated model composed of simpler primitives and be able to
recognize that object in most any orientation
6. Be able to recognize and separate objects in a wide variety of
lighting conditions, including complex arrangements of
shadows
7. Be able to separate and recognize objects that are transparent,
translucent, or reflective, given sufficient visual cues

There are probably other milestones one could add to this, but it
seems a pretty lofty set for now.

Sensors Used
It's a given that if a machine is to see, it must have sensors that
enable it to see. Here are some examples of the kinds of devices
being used today.

Spot Sources
Cameras
Laser Scanners
Echoic Triangulation
Spot Sources
The most basic kind of sensor that can be used for vision is one that
sees only a single "pixel". A photoelectric cell -- in this case, a
photoresistor -- like the one in the figure at right is an example. Note
that while, when you think of pixels, you probably
think of very small parts of a larger picture, I don't
necessarily mean it in this sense. An entire picture
can be composed of a single pixel. What matters in
this sense is the field of view of the imaging sensor. In Figure: A
the case of many photoelectric cells, for instance, the photoresistor
field of view might include up to half the full sphere of for detecting light.
the view around it. To narrow the field of view of a photocell like this
is a simple matter. One could, for example, put a small box over the
cell and drill a small hole in it so only light coming from a source in
the direction of that hole can get to the sensor.

In keeping with the idea that an electronic eye need not be limited
to working like our eyes, let's consider some other kinds of spot-
source sensors. One technique involves a speaker outputting a very
high frequency tone and using a microphone to pick it up. The closer
or larger a nearby object is, the more it will reflect that sound and
hence the stronger will be the signal to the microphone. A laser
beam and a photoelectric cell can serve a similar purpose. In
addition to sensing differences in intensity, they can also be used to
determine how long it takes for the signal to get from emitter to
detector and thus determine the distance to one or more objects.

More to Explore
Using photoresistor arrays in robotic applications
Ultrasonic Acoustic Sensing

Cameras
Most digital cameras use the same basic
approach to imaging. At their heart is a
device that serves the same purpose as a
piece of film that is called a charge-coupled
device, or "CCD". A set of one or more lenses
focuses light onto the CCD, which is made up
of a rectangular grid of individual light
Figure: A charge-coupled device
sensors similar to the photoelectric cell (CCD) used in digital cameras.
figured in the previous section. The figure at right shows an example
of a CCD that has a grid of 1,024 sensors across by 1,280 sensors
down and is used in medical X-ray imaging devices. A digital camera
outputs information in a form that can easily be interpreted by a
computer as a grid of levels of light in one or more discrete
electromagnetic wave bands (e.g., red, green, and blue or X-rays
and infrared frequencies).

Technically, a computer can use an analog camera as its input, but


in any case, a digital computer ultimately must use digital
information. The continuous stream of signals from an analog
camera, then, must be converted into a stream of digital information
that can be interpreted in the
same way as a digital camera's
output.

Laser Scanners
Some machine vision systems use
lasers to directly sense the three
dimensional shapes of their
immediate surroundings.
Figure: A laser imager scanning a statue.
The basic idea behind this is to
exploit the fact that light travels at a known and thus predictable
velocity. A laser pulse is sent out in some direction and may be
detected by a sensor like the photoelectric cell described earlier if it
hits some object relatively nearby or which is highly prone to reflect
light back in the direction it came from. Using a very fast clock, the
electronics that coordinate the laser and light sensor measure how
long it took for the light pulse to be detected and hence calculate
how far away the reflective surface is. Because a laser beam can be
made to be very fine-pointed, it is generally reasonable to assume
that it will only hit a single surface and so only a single response will
come back to the sensor.

By gradually pointing the laser at different places -- usually within a


rectangular grid pattern -- in the system's field of view, sending
pulses of light at each, and taking measurements of the time each
pulse takes to be reflected, one can gradually build an image. The
image formed is not like the one you are used to. Instead of
representing levels of light, each pixel in such an image represents
a distance to the surface that the laser hit when it was aimed in that
direction.

Echoic Triangulation
One particularly interesting idea
that has found many expressions
is the idea of using echoes to
detect objects that are not
otherwise visible.
Figure: Output from a ground-penetrating radar.
The laser scanners described above use echoes, but rely on the
object being detected to be fairly solid and the space between the
emitter and the subject being imaged to be fairly empty, relative to
the much higher density of the subject.

Imaging objects underground is a great example of a case where the


goal is to "see" objects amid surroundings that are not nearly as
varied in their densities as one would find with air and rock, for
example. One key technique is to project some wave of energy --
perhaps sound waves or microwave energy -- down into the ground
and detect the energy that is reflected off of layers and objects in
the ground. Because they do have different densities or other
properties that affect the energy projected, they will reflect to
varying degrees. As each pulse of energy is sent out, the detector is
continually measuring the degree of energy coming back over time.
The intensity of the returning signal and the amount of time that has
passed since the original pulse was sent are typically used to create
a linear gradient. By moving the device along a linear path on the
ground and sending pulses at each point, a two dimensional image
can be composed by taking each linear gradient as a vertical column
of pixels and each position along the ground as a horizontal starting
point for that vertical column of pixels.

The result of most ground-penetrating radar systems like the one in


the figure at right, while often appearing in 2D images like the
figure, could not look more alien to our own sense of how vision
works. But it's important to recognize that there is information in a
visual system like this. With training, anyone can learn to recognize
the significance of what's in such images. And so can a machine.
And while we don't have the capacity to deal easily with it, one
could also take many slices in a grid drawn on the ground and put
them together like pages in a book to form a three dimensional
picture. To be useful to a human, it would probably be necessary to
delete some of the resulting three-dimensional pixels -- also known
as "voxels" -- so one can see the other parts as "solid" objects.

One fascinating extension of this same concept is to generate an


image of what's below the ground using sound. In one arrangement,
two or more microphones are placed on the ground around an area
to be imaged. A person with a sledgehammer moves from point to
point on a grid and strikes the ground. A computer records the
echoes and times when sounds arrive. Again, it may be that more
than one pulse is heard by a given microphone, because different
objects underground may reflect sound in different ways. Sound
waves may even separate and take different pathways to a given
microphone. The end result, again, is either an image representing a
sort of 2D slice through the ground or a 3D image representing a
volume of ground.

This same sort of concept is also used in familiar medical imaging


technologies like MRI and PET scanners, not to mention the
ubiquitous ultrasound equipment.

Primitive Visual Features


It's natural to want to dive right into the high level techniques and
goals of machine vision, but it's important to understand some of
the lower level features that we use to characterize images. Most
higher level vision approaches involve particular solutions to the
problems of how to recognize them or build larger structures based
on them.

Edge Detection
Regions and Flood-Fill
Texture Analysis

Edge Detection
One of the oldest concepts in machine vision, edge detection is also
one of the most enduring. The essence of the technique is to scan
an image, pixel by pixel, in search of strong contrasts. With each
pixel considered, the pixels around it are also considered, and the
more variation there is, the stronger that pixel will be considered to
be a part of an "edge", presumably of some surface or object.
Typically, the contrast sought is of brightness. This is a function of
what can be thought of as the black and white representation of an
image, but sometimes hue, multiple color channels, or other pixel-
level features are considered.

This pixel-level edge detection operation is so simple and common


that it can be found in many ordinary paint programs. The figure
below illustrates one use of edge enhancement in the popular
PhotoShop program:

Figure: Using PhotoShop to "detect" and enhance edges.

The idea of doing contrast-based edge detection had a lot of


momentum in the early days when scientists studying human vision
determined that our own visual systems use this technique. Once
replicated in machines, it seemed like we were just a short way off
from having general purpose vision. But early successes in the
ability to find edges at the pixel level did not quickly translate into
successes in higher level vision goals. We'll explore this more in
coming sections.

One of the challenges in translating edges based on contrast into


edges of objects is that contrasts can be caused by factors other
than the obvious. For instance, a "specular" reflection of light as off
a shiny surface can cause the appearance of a sharp edge around
the reflection. Similarly, a shadow cast upon a surface can create a
strong contrast at the boundary between the shadowed and lighted
portions of that surface. These artifacts tend to lead edge detection
algorithms to get "false positive" results. Following is an illustration
of how shadows can create false positives:
Figure: The effects of partial shadows on edge detection with an image of leaves.

On the other hand, "false negative" results can be caused by


something as simple as a blurry edge. Consider the figure below:

Figure: Unexpected results can come from edge detection with blurry edges.

Note how the woman's nose is completely invisible to this edge


detection approach because the edges we perceive are actually
very soft and subtle in terms of contrasts. Other edges we infer, like
the one at the top of her hair or on her left shoulder, are also
missing because of weak contrasts. The shine off her forehead and
chin also clearly create strong enough contrasts to result in false
positive matches of edges.

The above figures also illustrate how easily edges we perceive as


continuous get broken up in pixel-level edge detection algorithms.
The messiness of having lots of neighboring and intersecting edges
packed into small spaces also really complicates things.
Consider one of the central issues with simple edge-finding
algorithms. We'll call it the "threshold problem". It can be expressed
simply as "how strong of a contrast is strong enough to consider a
place in an image to represent an edge? If one chooses a threshold
that's too low, there will be too many edges to be able to be useful.
If the threshold is too high, too few edges will be found. The
following illustrates the problem:

Figure: Edge detection using different thresholds: a.) source image; b.) high threshold; c.) medium threshold; d.) low
threshold.

The sad truth is that there is no "right" answer when it comes to


choosing a threshold value. What most researchers don't want to
admit is that they do not rely on automation to decide what
threshold value to use. They choose a value based on the particular
application, lighting conditions, and other finer details. This raises
the classic AI problem of the "brain inside the brain". That is, it takes
an intelligent agent -- the researcher -- to frequently determine a
key factor in proper edge detection so it can be "automated". In
some situations, like in a factory, the conditions can be controlled.
General-purpose vision cannot assume such controlled conditions,
though. Your eyes certainly don't.

Despite the shortcomings, edge detection has found much


expression in very practical industrial and research systems. The
figure below illustrates a sample use of a simple sort of edge
detection algorithm in inspection of a manufactured part:
Figure: An inspection system detects that one of four expected cables is missing.

In this case, a linear slice one pixel wide is taken where cables are
expected to lie in the image. The number of sharp edges (six here)
is counted up and divided by two edges per cable, revealing that
there are only three of four expected cables. Linear slices like this
can be used to spot-check object widths, rotational alignments, and
other useful metrics that are helpful in inspection systems. And full-
image scans are also used to detect edges in roads and other
systems for use in more sophisticated applications.

Regions and Flood-Fill


Finding the edges of objects may seem the basis of finding objects
in an image, but it's only the beginning. The edges found on a
picture of a human ear, for example, will be far more complicated
than the overall shape of the ear. Edges provide a means to finding
objects, but finding regions within an image can be thought of as
one step higher in abstraction.

One of the most basic means of finding regions in an image is to use


a "flood-fill" algorithm. This term comes from the similarity of the
algorithm to the basic flood-fill operation most paint programs have.
To the program, it's as though a region is a flat plain into which color
can be poured, but which has "sharp" edges beyond which the color
won't spread. Those edges are usually defined in exactly the same
way as we considered above with regards to edge detection.

It's helpful to use the paint program's flood-fill analogy because of


its intuitive nature. The following figure shows a picture with some
areas sectioned off using a flood-fill algorithm. Each distinct region
found gets its own unique color.

Figure: Using flood-fill to isolate major regions of an image.

The limits of flood fill start becoming pretty obvious with the above
image. For example, notice how the ocean is divided into two parts
by the large rock? The left side of the ocean (blue) and the right
(green) are obviously part of the same object, to you and me, but
not to an algorithm simply seeking out unique regions using a basic
flood-fill algorithm.

Another issue is that a flood-fill operation can "spill out" of one


region to another. See how the white region includes the nearby
rock, part of the cliff on the left side, most of the farther-off rock, the
white foam where the ocean meets the beach, and so on? Few of us
would assume that all of these separate objects are really part of
the same object, yet the flood-fill algorithm doesn't see these
distinctions.

One way in which a typical paint program's flood-fill algorithm


differs from one used for machine vision is in how they deal with
gradients. The following figure illustrates the distinction.
Figure: Two ways to interpret a smooth gradient using a flood-fill algorithm.

A typical paint program will take note of the color of the pixel you
first clicked on and seek all contiguous pixels that are similar to that
one. Hence the separate bands above in the middle image. A typical
edge detection algorithm as described above would not find any
edges within the gradient; only around the circle and square. A more
appropriate flood-fill algorithm for machine vision, then, will fill
smooth regions, even where the color subtly changes from pixel to
pixel. The filling stops wherever there are harder edges.

Texture Analysis
One of the more interesting primitive features that can be dealt with
in machine vision is repeating and quasi-repeating patterns. Strong
textures can confound simple object detection because they can
invoke edge detection and stop flood-fill operations. Following are
some images that include strong textures that can easily foil such
simple operations.

Figure: Samples of textures, such as grass, bricks, leopard spots, marble, and water.

Texture recognition is such a challenge to deal with in large part


because it's difficult even to define the concept of textures formally.
Even dictionaries don't seem to do it much justice. Here are some
examples:

• The characteristic appearance of a surface having a tactile


quality
• The tactile quality of a surface or the representation or
invention of the appearance of such a surface
• In a photographic image the frequency of change and
arrangement of tones

What the above sample images illustrate, though, is how obvious


the notion of texture seems to our visual systems, even if it's
difficult to formally define.

One characteristic that seems somewhat consistent about textures


is what can be called a "color scheme". In the first image, the grass
is heavy in the greens and blacks. The bricks are heavy in reds and
blacks. The water is heavy in blues.

How can we use this in automation? Here's a simple illustration.


Imagine taking a sampling of many or all of the colors in a patch of
some texture of interest. We'll call that collection of colors the color
scheme. Now for each color, we find all pixels in the source image
that have that same color and add them to a total selection.
Following is an illustration of this using the above images as
sources:

Figure: The same images above with some textures selected based purely on color schemes.

It should be fairly apparent that in most of the cases above, the


color scheme-based selections seem very strongly biased towards
highlighting just the textures of interest. The leopard one seems a
poor example, to be sure. That seems to be because the black spots
themselves are very similar in color to the black in the tree
branches, leaves, and so forth. Whatever its power, this neat trick is
surely not sufficient for recognizing textures.

What we could do with the selections made, then, is to start by


removing the "noise" pixels. That is, we can find places in an image
-- like with the grass -- where there are small, stray islands of pixels
not in the selection and just add them. Likewise, we can find stray
islands of selected pixels among non-selected ones and remove
them from the selection. Next, we could segment an entire image
up into large blocks -- perhaps squares -- and, for each, see if a
large percent of the pixels in it are among the selection. The
resulting "block map" can be used to pick out the rough shape or
shapes of items with the given texture. And so on.

The above thought experiment assumes that we have "intelligently"


picked out some patch of an image as a candidate for a texture.
What would be to stop us, alternatively, from picking a patch that
contains both some of the water and some of the hills on the shore
in the right-hand image, for example?

One other issue this dodges is changes in illumination, as from


shadows or the like.

Besides the notion of color schemes applying to textures, there does


tend to be genuine structure. The grass texture, for example, has
edges that favor up and down orientations. The bricks are
definitively ordered from top to bottom in a zig-zag pattern. The
leopard's spots are definitively spots with semi-regular spacing, if no
obvious ordering. This facet seems to require some more
sophisticated processing to deal with.

One interesting approach to texture analysis involves taking a large


number of samples of pairs of nearby pixels. For each pixel in the
source image, we look around at those pixels within a fixed radius of
it. For each pair of pixels, we note the brightness of each pixel. Let's
say instead of recognizing 256 shades of gray (brightness), we
recognize only 8. We then create a matrix (grid) that's 8 columns
wide and 8 rows tall, where columns represent the first pixel's
brightness and the rows represent the second's. For each pair we
find, we look in the matrix for the place that represents that pair's
combination of brightness levels. Each place in the matrix starts out
as zero, so each time we find a match for a combination, we add
one to it.

With a little extra math, we can boil the resulting matrix down to a
set of simplified characteristics called "energy", "inertia",
"correlation", and "entropy". These can further simplify the task of
recognizing a texture using a neural network or classifier system.

When we're done, we have a matrix that could be used by, say, a
neural network to recognize textures. With a little more math, we
can improve the ability to deal with some different orthogonal (90°)
rotations of a given texture. One downside to this concept, however,
is that it doesn't directly address finding edges of textures. Much of
the literature seems to focus on cases where the entire image is of
an homogeneous texture and nothing else. And one limitation
seems to be that if one zooms in or out of a given texture, the
resulting matrices will probably be different for a set of texture
images.

There are other variants of this sort of concept that involve different
mathematical complexities. They generally seem to suffer some of
the same limitations, though. If anything, they seem more exercises
in fascinating mathematics than in practical vision systems. It seems
so much easier to pick out textures in color images than in black
and white, yet these techniques focus naively on black and white for
mathematical elegance. Despite these sorts of shortcomings,
though, their conceptual basis seems to have merit.

As a side note, there is a related but separate field of study into


what is called "texture synthesis", which is about using a sample
image texture to generate extensions of that texture or new
textures altogether based on multiple source images. Following is an
illustration of some examples of this concept. Each real image is
paired with a new texture programmatically generated based on it.
Figure: Above are source images and below are new textures synthesized based on them.

Although synthesis is not the same thing as analysis, there does


seem to be a useful symmetry here. The ability to recall a texture
from memory is essentially an ability to synthesize it using some set
of rules. These rules should be simpler than the original image, in a
sense, and be more generic than what one would expect from just
tiling an image to create a repeating texture.

Two Dimensional Perception


Taking a step above the primitive features of images discussed
above, we can start to talk more about the substantive content in
images. We'll focus in this section on two-dimensional features. That
is, we'll limit ourselves to images that don't have intrinsic depth; as
though we were considering a bulletin board with flat things pinned
to it.

Following are some examples of images that we can process in a


two-dimensional context.
Figure: Some images that are good candidates for 2D perceptual processing.

Not surprisingly, there are many ways to approach analyzing such


images. Since there's still no such thing as general purpose machine
vision systems, yet, deciding which
one to use is often a matter of what
one is trying to accomplish.

Pixel Pattern Matching


As stated above, the goal of a
machine vision system often
determines the method chosen. Let's
say our goal was to identify the
contents of whole images against
known images. Figure: Images of bathroom tiles in our illustration.

Let's say we have a set of images of bathroom tiles that we


manufacture. In our application, we will be fed images of whole,
single tiles. The images are always of the same width and height.
We also have a finite set of images of the tiles we manufacture. As
we're fed new images to identify, then, we want to identify which
known tile the new image is most like.

Since we have whole images, we decide to do best-fit matching of


the whole images. Looking at the figure at right, it seems the main
distinguishing feature among the four sample tiles is their overall
color. That suggests one simple approach might be to find the
average color of each tile. Our database of known tile models would
simply have the same average color calculated on one or perhaps
several samples of each tile model. So as a new tile image comes
past our analyzer, it takes the average color and finds the one in the
database that has the shortest "distance" from the sampled color to
each archetype's average color. To avoid problems that might arise
from the white-space surrounding each tile, we might ignore the
outer 10% margins of each image when finding the average color.

Let's say that we found that there are tiles that had the same
average color but which are different in shape. Some are square,
some hexagonal, and some triangular. The above color-based
algorithm would probably not be sufficient. We might modify our
algorithm to include shapes. To deal with shape, we'll opt to create
simple masks for each known shape. Each mask is just a two-color
image, as illustrated by the following figure.

Figure: Sample masks for recognizing squares, triangles, and hexagons.

The shapes don't have to be perfectly clean or straight. Each tile


model, then, is associated with one of the known shapes. So when
we see a new image, we compare the shape of the tile within it
against the known shapes. To do this, we might first use a flood-fill
starting from one corner of the image to select the white margin
around the tile. From this selection we create a new image that has
the same two colors as our shape masks. Next, we compare the two
images, pixel by pixel. For each pixel that doesn't match, we add
one to a count of mismatched pixels. In the end, whichever mask
has the lowest mismatch count is the one we choose as best
representing the shape of the tile in our test image.

To make our algorithm a little better, we also use the mask we


created using flood fill to find the average color by only looking
within the area that is not in the outer-margin selection. Armed with
the known shape and average color of the tile, we again find the
best match for these two properties in our database and thus
identify our image.

This thought experiment illustrates how straightforward some vision


applications can be when they are defined carefully to reduce their
potential complexity. What if we increased the complexity of our
present problem? Let's say one series of tiles is white and square,
but each has a different large letter (e.g., "A", "B", "C") on it. The
above algorithm is no longer sufficient.

To solve this problem, we decide to first identify the model or model


series of tile and, if a tile is identified to be in the "letter" series,
we'll use a new algorithm to identify which letter it is. We could use
the masking approach described above, but let's be creative and
say we want to use a neural network. We buy an off-the-shelf neural
network software package and train it to recognize each of the
letters that we might find on the tiles in the letter series. Training
done, we switch the neural net into its regular behavior mode and
go from there. With each tile put before it, the neural net will output
which model (letter) it thinks the tile represents.

In each case in the above example, we've considered what might


loosely be called pixel patterns. We considered the average color of
a textured object, the overall shape in terms of a mask, and the
shape of some bitmapped feature (letters) within such a shape. We
never resorted to trying to find lines or corners or other more
abstract features. We didn't even need to deal with images being at
different scales or rotations, let alone in varying lighting conditions.

Big Blobs
One practical technique available for
use in 2D perception applications is
the isolation of objects of interest
into "blobs" that can be counted,
characterized, or have their
positional relationships considered.

The figure at right illustrates a


typical example. The technique used
to isolate the insects in the image
from the background is trivial. The

Figure: Insects "thresholded" to isolate


them from a fairly plain background.
brightness of each pixel is measured and, if it is above a certain
threshold value, it is painted white and otherwise black. Each insect,
in the thresholded image, can be thought of as a "blob" in the
image. We'll call them "blob objects", or "blobjects".

It's easy for us to perceive the individual blobjects, but can be quite
a challenge to get a piece of software to do as well. The simplest
approach would be to consider every black pixel in the image and,
for each, perform a flood-fill. The flood-filling would continue until
one of the following conditions is met:

1. The width of a bounding box gets larger than some constant W.


2. The height of a bounding box gets larger than some constant
H.
3. The area (number of pixels) of the region gets larger than
some constant Amax.
4. The region gets fully filled and the area of the region is larger
than some constant Amin.

Only in this last condition would we conclude that we've found an


insect blob. To help speed up execution a bit, we would keep track
of all the pixels we've already tested so as we're continuing the
scan, we don't consider the same blobject twice, for example.

It should be apparent from this example, however, that blob


detection is not going to be a clean process using the above
algorithm. Wherever insects touch or overlap one another, it's likely
we will meet one of the above failure conditions. The bounding box
might get too big or the total area filled by a region might be
exceeded. It becomes necessary to introduce more sophisticated
techniques to get more accurate information.

Admittedly, this technique, which can be practical in controlled


circumstances and with certain classes of tasks, can be quite useful,
is actually very limited. The need to manually set a usable threshold
value for separating blob from background means it's usually
necessary to ensure that the background against which blobjects
are to be placed must be in high contrast to the blobjects. And what
makes this fundamentally a two dimensional perception problem is
the fact that the blobjects really need to be guaranteed to be
generally non-touching and non-overlapping. This is usually much
harder to come by in a three dimensional environment.

Point Orientation
One somewhat simplified version of blob detection that has found
practical application is navigation based on known, fixed points that
can be perceived. The term "astral navigation" is a term that's
common fare in popular science fiction to identify how a space
vessel can get its bearings by observing the positions of stars
around the vessel. And now we actually do have such vessels which
are able to do this, including Nasa's Deep Space One.

The concept of orienting based on the positions of points is fairly


straightforward. First, one takes an image of the stars in the current
field of view. In deep space, most of the possible field of view is
black or very nearly so. Most visible objects appear as small dots
perhaps one or a few pixels in size. It's easy to isolate these blobs
from the black of space. Their positions in the image are recorded as
a list of points. The goal is to be able to identify which known star
each of the given points represents. Once one knows with certainty
which stars any two of the points in the image are, it's then easy to
figure out which way the spacecraft is facing.

Figure: Using the relative distances between stars as a


way of identifying stars for use in self orientation.

Although there are plenty of ways to use point position information


in the source image to figure out which stars one is seeing, let me
describe one very simplistic way to illustrate how easy it can be.
First, assume that our camera cannot change its zoom level. We
know that our spacecraft will stay within our own solar system,
which means that no matter where we are within the solar system,
the positions of luminous objects (stars, galaxies, etc.) outside our
galaxy will not appear significantly different than if we were
somewhere else in the solar system. So in any picture our
spacecraft takes of any two known luminous bodies outside this
system, the distance measured between them will be the same. Our
solution, then, begins with a database with two basic kinds of
information. The first is a list of known luminous bodies. The second
is a list of distances between any two known luminous bodies. So we
take a picture of the sky and separate all the bright blobs into their
own point positions. For each pair of points, we measure the
distance and go to our database of body-to-body distances to find
candidates. As we do this, we'll have different possible alternatives,
but the more distances we measure and correlations we make, the
more we will be able to narrow the interpretations down to exactly
one and to increase our certainty of the interpretation.

It's important to note that this technique works great in the context
of astral navigation because we can count on the field of view to
vary minimally within the distances that we care to work with. This
is what makes it a two dimensional perceptual problem. If we were
talking about a spacecraft that traversed many light years' distance,
the field of view would change enough that we would have to
change our approach because it would now be a three dimensional
perceptual problem.

2D Feature Networks
Given a complex two dimensional scene and a goal of being able to
identify all the objects within it, one general approach is to identify a
variety of easily isolated primitive features and to attempt to match
the combinations of such features against a database of known
objects. There are many ways to go about this, and no way seems to
fit all needs. Still, we'll consider a few here.

The astral navigation technique described earlier can be a good


starting point for identifying objects in a 2D scene. The first thing to
do is identify important points. This can be done by identifying
exceptionally bright or dark points, blobs of a significant color, and
so forth and calculating distances among the points to see if there
are known configurations in the scene. Another primitive feature
that some researchers have had success in isolating is sharp corners
and junctions where three or more lines meet. In an image of a "pac
man", for example, there are three sharp corners that form the pie
wedge of a mouth in the circular body. A picture of a stick man
would have lots of corners and junctions. Once the raw image is
processed to find such corners and junctions, they too become
points whose relative positions can be measured and compared to
known proportions. When a significant percentage of the
components that define some object are matched, we can isolate
that portion of the image as a single instance of that kind of object.

Another interesting technique that has been tried is to study the


outline of a shape. A shape can be isolated using edge detection or
flood filling, for example. A secondary image that only includes the
outline of a single shape can be isolated. It's not hard, then, to find
the smallest possible circle that can fit around that shape and
identify its center point. Then the code traverses from one point to
the next in the outline. For each point, the distance from the center
and the angle are measured. Because the object can be rotated in
any angle, one goal is to "rotate" the image until it fits a "standard"
orientation. One way to do so would be to find the three or more
points that are farthest from the center; i.e., those that touch the
outer circumference. Using one of a variety of techniques, one of
these points can be identified as the "first" one. The whole image
would effectively be rotated around the center point so that that
first point is straight above the center point. The image wouldn't
literally be rotated, of course. What would actually happen is that an
angular offset would be added to each angle-plus-distance
measured point so that the "first" point would actually be the first
one in the list of such points. Then the distances-from-center would
be normalized so that the farthest-out ones would be exactly one
distance unit from the center.

The result of all this processing, then, can be graphed as a linear


profile, with the X axis going from zero to the full 360 degrees and
the Y axis measuring from zero to one the normalized height. This
graph can be further analyzed to find known patterns. One simple
way to do this is to reduce the resolution of the graph so that the X
and Y values range from, say, zero to sixteen and to create a 16 x
16 matrix with true and false values. There would be a true value at
any point in the matrix where at least one point is found at that
combination of X and Y values. That matrix can then be compared
against a database full of such matrices for known shapes.

Three Dimensional Perception


In contrast to two dimensional perception, three dimensional
perception is all about processing of information in all three spatial
dimensions, not just in flat or virtually flat worlds; usually, of
detecting where some or all objects within a visual field are in
space. Although there are many interesting experiments and
products that deal in 3D perception, this area is much less well
developed. Let me introduce some broad areas of interest.

Lasers and Direct 3D Perception


Binocular Vision
Geometic Perception
Generic Views
Intuiting Shape from Light and Shadow

Lasers and Direct 3D Perception


No discussion of 3D perception could be complete without
consideration of the most obvious technique of perceiving objects in
space: directly. Human eyes are presented with flat images that we
have to work with to guess at how far away things are in space.
Certain kinds of devices, though, actually "see" how far away things
are.

It all started with radar, a British invention dating back to some time
between World Wars One and Two. Scientists
found that radio waves would reflect off of
some kinds of objects and sometimes back
toward their original sources. Since we know
how fast a radio wave travels -- the same
speed as light waves -- it became possible to
measure how far away the reflector was based
on how long it takes for a radio wave to be Figure: Emitted and reflected
received after it was transmitted. It is true of microwave pulse.
an ordinary transmitter, like a radio broadcasting aerial tower, that
its signals get reflected back toward it. And you could probably
measure the time differences, but you'd have two problems. First,
you'd have to send out very short pulses instead of a continuous
broadcast. You need a "beginning" for your signal so you have a
beginning of when it returns so you can calculate the time
difference. Second, you wouldn't know where in space the reflector
-- a car, for example -- is. To figure out that, you need to focus the
radio beam so it mainly travels in a single direction. Then you would
sweep your transmitter / receiver combination back and forth or
around in a full circle so you cover a wide field of view. While radio
waves were where radar technology began, technically speaking,
most radar systems doesn't use radio waves, but the narrower
microwaves. They penetrate weather and certain materials better,
but they also can be used to create finer images.

What is true of radio and microwave waves in this regard is also true
of light waves. You could, technically, have a friend stand miles
away with a mirror and shine a flashlight in his direction and
measure how long it takes before you see the reflection in order to
calculate how far away he is, but the time delay would be so small
that you probably wouldn't notice it. It's been estimated that an
object traveling as fast as light could circle Earth about seven times
in a single second. Still, we have long had electronics that operate
fast enough to detect such small time delays.

The gold standard today in direct perception of 3D spaces is to use


lasers. Using the same concepts described above for radar, a
scanner makes a laser beam scan left to right, top to bottom in the
same sort of way you typically read a page of text in a book. In each
direction the scanner is aimed, a laser pulse is fired and a light
detector determines how long it takes for a reflection to be
measured. Since we have a direction and a distance, we can plot a
point in 3D space where the reflection occurred and hence where
some part of a physical object is. And since a laser beam can be
made to stay very sharp over large distances, it's possible to get
very precise 3D coordinates using a laser scanner. Following is an
example of a machine for surveying using a laser scanner.
Figure: Laser scanner used in high definition surveying.

3D points do not a 3D picture make, though. Usually, the next step


is to connect the points together into 3D surfaces. One could simply
do this by assuming every reflection point measured is connected to
the ones to the left and right and above and below. The problem
with this is that one ends up seeing the entire world as one single,
solid object. We know, of course, that the 3D world is composed of
many separate objects and we know that some things are in front of
others that we can't see.

If I were trying to endow a robot with the ability to see using laser
scanning, I would probably want it to understand this idea of one
object occluding another and the idea that there may be space
between them that can't be seen yet. One very simple way to do
this is to modify the above mesh-building algorithm slightly. For
each pair of neighboring points, I would calculate how far away they
are in depth. Above a certain depth, I would declare the two points
part of separate surfaces and below it, I would assume the two
points part of the same surface. The following figure illustrates this:
Figure: One way to determine when surfaces are connected or disconnected.

How would I set the threshold? The broken record comes around
again here to sing the refrain that there is no universal answer.
Perhaps our goal would be to make it so our robot can move around
in space and so we might arbitrarily choose a threshold of, say, 3
feet if our robot can move within a 3 foot wide space. We could also
use the "tears" in the 3D mesh to segment distinct objects out using
familiar techniques like our basic flood-fill algorithm.

Before you decide that in laser scanners we finally have found the
ultimate solution to the 3D perception problem, let me throw water
on that fire. If the goal is to get a 3D image of the world, laser
scanning is an excellent solution. If the goal is to get a machine to
understand the world, laser scanners do nothing than measure
distances to points. They don't "understand" the world any better
than digital cameras do. And they tend to not see light levels or
colors like a camera does; only points in space. As you'll see later,
though, laser scanning can be used in conjunction with other
techniques as "cheats" to work around solving certain problems that
are easily solved by our own visual systems.
Binocular Vision
We have two eyes. And while it's
true they give us a sharper image
than a single eye would, the most
interesting benefit of having two
eyes is that we can use them to
help us judge distances. We do so
using "binocular" vision Figure: Two cameras in a stereo (binocular)
techniques. arrangement.

To understand what this means, try a simple experiment. Look at a


corner or other vertical edge on a distant wall. Close your left eye.
Stick your finger up at arm's length so the tip is just to the left of
that vertical edge. Now open your right eye and close your left. You
should see that the edge is now to the right of the edge. Try
alternating between having just your left and right eyes open and
you'll see that your finger appears to move between being to the
left and to the right of that vertical edge. The reason for this is fairly
obvious: your eyes are in two different places in the world and so
see different views of it. Your brain makes use of these differences
to tell you useful things, like the fact that your finger is closer to you
than that vertical edge is.

In theory, binocular vision makes perfect sense and is pretty easy to


imagine. In practice, though, making software to line up objects
seen by two cameras in a binocular arrangement is not so easy to
do. One way that's been explored by some researchers is to find
"interesting" points in a stereo pair of images and to measure how
different those points are from one another in the horizontal
direction. Provided one can tell when two points of interest
represent the same point in 3D space, one can build up what Hans
Moravec calls an "evidence grid". The horizontal offset of each point
pair provides "evidence" of a real point in space where that point
exists. The following illustrates this idea:
Figure: Representation of 3D points discovered as "evidence of occupancy" in three dimensions of certain points in
a stereo pair of images.

While it just looks like a cartoonish version of the original image on


the left, the one on the right is actually just a 2D projection of a 3D
evidence grid built up from a stereo pair of images like the original
one seen here. One could take that 3D image and rotate it to project
a view from any place within or "outside" the room imaged. And with
a bit more processing, one can make some guesses about how
points in the evidence grid are related to form meshes of the same
sort described earlier for use in 3D laser scanning.

Another technique involves trying to match


pieces of the left image with pieces of the
right image using literal bitmap matching.
Because we shouldn't have vertical
variation, only horizontal variation, we
start by breaking up the original bitmaps
into separate horizontal slices from top to
bottom in each image and comparing
corresponding slices. Each slice is a 1D
bitmap, which is much easier to process.
The next part is to break this 1D image up
Figure: Depth discontinuity segmentation. into segments using edge detection. Each
span of pixels between two bounding edges is considered as a
single, solid "object". The average color of each object can easily be
calculated and that color can be searched for in the opposite
image's corresponding slice. Using some brute computation, we
should be able to come up pretty readily with a pretty good
interpretation of which objects in the left image slice lines up with
the ones in the right image slice's. There will often be miscellaneous
bits that don't match, of course. They make computation a little
more complicated. But we can even improve on the quality of our
guesses by comparing the results of one horizontal slice pair against
the ones directly above and below it to see if there are correlations.
The end result, though, is yet another set of points in space that are
defined by the edges found earlier and including bitmaps that nicely
fill the spaces between those points.

One somewhat low-complexity technique for determining distance


to objects in a scene involves taking a relatively small portion of
what one eye sees -- roughly around the center of its field of view --
and finding out where
the best match for it is
in the right eye. This
requires moving a
frame of the same size
as the one for the left
eye from left to right in
the right eye's field of
view. At each point,
the differences of each Figure: Calculating distance using stereo disparity.
left/right pixel pair are
summed up. Once this survey is done, the place that had the lowest
sum of differences is considered the best match. The horizontal
pixel offset of that frame's position in the right versus the left
camera's corresponding frame is then used to calculate how far
away the subject matter is. This works fairly well when what the
frames contain is pretty homogeneous, in terms of distance, or
when parts of the background -- perhaps a wall behind a person --
that do creep into the frame are relatively flat in texture. This
technique is analogous to how your own eyes work, but it only give
distance for whatever is in the frame, rather than building a
complete 3D scene.

Final Thoughts
I hope this brief introduction to machine vision has been helpful to
you. As I stated in the beginning, it is by no means complete, but it's
not a bad intro if you are just getting started or are just curious. I
also hope it has successfully given you the sense that a lot of the
stuff being done today is not as complicated -- or competent -- as it
is often portrayed in the popular media and technical literature.
There's a lot one can do with just some simple tricks.

Moreover, we're clearly nowhere near achieving the ultimate goal of


general purpose vision in machines. There's plenty of room for
aspiring AI researchers to get in the game, even today. The road
ahead is long and the prospects are great.

Incidentally, I have made a point of not making reference to my own


machine vision research projects because I didn't want this to be
primarily about my work. I invite you, however, to check out
my machine vision site for more about what I'm working on.

You might also like