Professional Documents
Culture Documents
This page sets out to bring some of these resources together. I don't
have the time to create a truly exhaustive resource. My hope, then,
is to make this a decent starting point for others looking for
background information. To that end, I'm organizing information by
category and trying to provide summaries, speculations, and other
opinions.
One other goal I have for this page is to demystify machine vision.
The popular press has a habit of making the products of machine
vision research look far more impressive than they often actually
are. Even someone like me with a little knowledge of how a lot of
the techniques work can easily be fooled by a new trick into thinking
the field is much farther along than it really is. And for various
reasons, the web sites of research projects or commercial products
don't often reveal much about the techniques used. Admittedly,
some of my explanations are speculations based on what I find
online and sometimes just reckoned by asking myself, "how would I
do that?"
All practical machine vision systems in use today exist for their own
specific purposes. Some are used to ensure that parts coming off
assembly lines are manufactured correctly. Some are used to detect
the lines in a road for the benefit of cars that drive themselves.
Though some interested parties claim otherwise, there are no
general purpose vision systems, either in laboratories or on the
market.
If it sounds like it's difficult to define machine vision, don't fret. The
point is that the field of machine vision is not simply interested in
duplicating human vision. What is essential is the basic goal of
visual perception; of the ability to "understand" the world visually,
sufficiently for moving about in and interacting with a complex,
ever-changing world and discerning information in the environment
essential to the core goals of consumers of these faculties.
General-Purpose Vision
As mentioned above, all practical machine vision end products
available now are for specific purposes. I contrasted that with
general purpose machine vision. Let me define what that means,
then.
Merely being able to see light is nearly useless, though. The best
video cameras today are still just recording or transmission devices;
they don't do anything else practical with it. By contrast, we are
poor recording and transmission devices. It's our faculties for visual
perception that distinguish us. So let's talk about what we do with
the visual information we can see.
How does one distill all this down into a clear definition, then? What
is general purpose machine vision? I think it's best to define it in
terms of a set of core goals. A machine can be said to have general
purpose machine vision if it can:
There are probably other milestones one could add to this, but it
seems a pretty lofty set for now.
Sensors Used
It's a given that if a machine is to see, it must have sensors that
enable it to see. Here are some examples of the kinds of devices
being used today.
Spot Sources
Cameras
Laser Scanners
Echoic Triangulation
Spot Sources
The most basic kind of sensor that can be used for vision is one that
sees only a single "pixel". A photoelectric cell -- in this case, a
photoresistor -- like the one in the figure at right is an example. Note
that while, when you think of pixels, you probably
think of very small parts of a larger picture, I don't
necessarily mean it in this sense. An entire picture
can be composed of a single pixel. What matters in
this sense is the field of view of the imaging sensor. In Figure: A
the case of many photoelectric cells, for instance, the photoresistor
field of view might include up to half the full sphere of for detecting light.
the view around it. To narrow the field of view of a photocell like this
is a simple matter. One could, for example, put a small box over the
cell and drill a small hole in it so only light coming from a source in
the direction of that hole can get to the sensor.
In keeping with the idea that an electronic eye need not be limited
to working like our eyes, let's consider some other kinds of spot-
source sensors. One technique involves a speaker outputting a very
high frequency tone and using a microphone to pick it up. The closer
or larger a nearby object is, the more it will reflect that sound and
hence the stronger will be the signal to the microphone. A laser
beam and a photoelectric cell can serve a similar purpose. In
addition to sensing differences in intensity, they can also be used to
determine how long it takes for the signal to get from emitter to
detector and thus determine the distance to one or more objects.
More to Explore
Using photoresistor arrays in robotic applications
Ultrasonic Acoustic Sensing
Cameras
Most digital cameras use the same basic
approach to imaging. At their heart is a
device that serves the same purpose as a
piece of film that is called a charge-coupled
device, or "CCD". A set of one or more lenses
focuses light onto the CCD, which is made up
of a rectangular grid of individual light
Figure: A charge-coupled device
sensors similar to the photoelectric cell (CCD) used in digital cameras.
figured in the previous section. The figure at right shows an example
of a CCD that has a grid of 1,024 sensors across by 1,280 sensors
down and is used in medical X-ray imaging devices. A digital camera
outputs information in a form that can easily be interpreted by a
computer as a grid of levels of light in one or more discrete
electromagnetic wave bands (e.g., red, green, and blue or X-rays
and infrared frequencies).
Laser Scanners
Some machine vision systems use
lasers to directly sense the three
dimensional shapes of their
immediate surroundings.
Figure: A laser imager scanning a statue.
The basic idea behind this is to
exploit the fact that light travels at a known and thus predictable
velocity. A laser pulse is sent out in some direction and may be
detected by a sensor like the photoelectric cell described earlier if it
hits some object relatively nearby or which is highly prone to reflect
light back in the direction it came from. Using a very fast clock, the
electronics that coordinate the laser and light sensor measure how
long it took for the light pulse to be detected and hence calculate
how far away the reflective surface is. Because a laser beam can be
made to be very fine-pointed, it is generally reasonable to assume
that it will only hit a single surface and so only a single response will
come back to the sensor.
Echoic Triangulation
One particularly interesting idea
that has found many expressions
is the idea of using echoes to
detect objects that are not
otherwise visible.
Figure: Output from a ground-penetrating radar.
The laser scanners described above use echoes, but rely on the
object being detected to be fairly solid and the space between the
emitter and the subject being imaged to be fairly empty, relative to
the much higher density of the subject.
Edge Detection
Regions and Flood-Fill
Texture Analysis
Edge Detection
One of the oldest concepts in machine vision, edge detection is also
one of the most enduring. The essence of the technique is to scan
an image, pixel by pixel, in search of strong contrasts. With each
pixel considered, the pixels around it are also considered, and the
more variation there is, the stronger that pixel will be considered to
be a part of an "edge", presumably of some surface or object.
Typically, the contrast sought is of brightness. This is a function of
what can be thought of as the black and white representation of an
image, but sometimes hue, multiple color channels, or other pixel-
level features are considered.
Figure: Unexpected results can come from edge detection with blurry edges.
Figure: Edge detection using different thresholds: a.) source image; b.) high threshold; c.) medium threshold; d.) low
threshold.
In this case, a linear slice one pixel wide is taken where cables are
expected to lie in the image. The number of sharp edges (six here)
is counted up and divided by two edges per cable, revealing that
there are only three of four expected cables. Linear slices like this
can be used to spot-check object widths, rotational alignments, and
other useful metrics that are helpful in inspection systems. And full-
image scans are also used to detect edges in roads and other
systems for use in more sophisticated applications.
The limits of flood fill start becoming pretty obvious with the above
image. For example, notice how the ocean is divided into two parts
by the large rock? The left side of the ocean (blue) and the right
(green) are obviously part of the same object, to you and me, but
not to an algorithm simply seeking out unique regions using a basic
flood-fill algorithm.
A typical paint program will take note of the color of the pixel you
first clicked on and seek all contiguous pixels that are similar to that
one. Hence the separate bands above in the middle image. A typical
edge detection algorithm as described above would not find any
edges within the gradient; only around the circle and square. A more
appropriate flood-fill algorithm for machine vision, then, will fill
smooth regions, even where the color subtly changes from pixel to
pixel. The filling stops wherever there are harder edges.
Texture Analysis
One of the more interesting primitive features that can be dealt with
in machine vision is repeating and quasi-repeating patterns. Strong
textures can confound simple object detection because they can
invoke edge detection and stop flood-fill operations. Following are
some images that include strong textures that can easily foil such
simple operations.
Figure: Samples of textures, such as grass, bricks, leopard spots, marble, and water.
Figure: The same images above with some textures selected based purely on color schemes.
With a little extra math, we can boil the resulting matrix down to a
set of simplified characteristics called "energy", "inertia",
"correlation", and "entropy". These can further simplify the task of
recognizing a texture using a neural network or classifier system.
When we're done, we have a matrix that could be used by, say, a
neural network to recognize textures. With a little more math, we
can improve the ability to deal with some different orthogonal (90°)
rotations of a given texture. One downside to this concept, however,
is that it doesn't directly address finding edges of textures. Much of
the literature seems to focus on cases where the entire image is of
an homogeneous texture and nothing else. And one limitation
seems to be that if one zooms in or out of a given texture, the
resulting matrices will probably be different for a set of texture
images.
There are other variants of this sort of concept that involve different
mathematical complexities. They generally seem to suffer some of
the same limitations, though. If anything, they seem more exercises
in fascinating mathematics than in practical vision systems. It seems
so much easier to pick out textures in color images than in black
and white, yet these techniques focus naively on black and white for
mathematical elegance. Despite these sorts of shortcomings,
though, their conceptual basis seems to have merit.
Let's say that we found that there are tiles that had the same
average color but which are different in shape. Some are square,
some hexagonal, and some triangular. The above color-based
algorithm would probably not be sufficient. We might modify our
algorithm to include shapes. To deal with shape, we'll opt to create
simple masks for each known shape. Each mask is just a two-color
image, as illustrated by the following figure.
Big Blobs
One practical technique available for
use in 2D perception applications is
the isolation of objects of interest
into "blobs" that can be counted,
characterized, or have their
positional relationships considered.
It's easy for us to perceive the individual blobjects, but can be quite
a challenge to get a piece of software to do as well. The simplest
approach would be to consider every black pixel in the image and,
for each, perform a flood-fill. The flood-filling would continue until
one of the following conditions is met:
Point Orientation
One somewhat simplified version of blob detection that has found
practical application is navigation based on known, fixed points that
can be perceived. The term "astral navigation" is a term that's
common fare in popular science fiction to identify how a space
vessel can get its bearings by observing the positions of stars
around the vessel. And now we actually do have such vessels which
are able to do this, including Nasa's Deep Space One.
It's important to note that this technique works great in the context
of astral navigation because we can count on the field of view to
vary minimally within the distances that we care to work with. This
is what makes it a two dimensional perceptual problem. If we were
talking about a spacecraft that traversed many light years' distance,
the field of view would change enough that we would have to
change our approach because it would now be a three dimensional
perceptual problem.
2D Feature Networks
Given a complex two dimensional scene and a goal of being able to
identify all the objects within it, one general approach is to identify a
variety of easily isolated primitive features and to attempt to match
the combinations of such features against a database of known
objects. There are many ways to go about this, and no way seems to
fit all needs. Still, we'll consider a few here.
It all started with radar, a British invention dating back to some time
between World Wars One and Two. Scientists
found that radio waves would reflect off of
some kinds of objects and sometimes back
toward their original sources. Since we know
how fast a radio wave travels -- the same
speed as light waves -- it became possible to
measure how far away the reflector was based
on how long it takes for a radio wave to be Figure: Emitted and reflected
received after it was transmitted. It is true of microwave pulse.
an ordinary transmitter, like a radio broadcasting aerial tower, that
its signals get reflected back toward it. And you could probably
measure the time differences, but you'd have two problems. First,
you'd have to send out very short pulses instead of a continuous
broadcast. You need a "beginning" for your signal so you have a
beginning of when it returns so you can calculate the time
difference. Second, you wouldn't know where in space the reflector
-- a car, for example -- is. To figure out that, you need to focus the
radio beam so it mainly travels in a single direction. Then you would
sweep your transmitter / receiver combination back and forth or
around in a full circle so you cover a wide field of view. While radio
waves were where radar technology began, technically speaking,
most radar systems doesn't use radio waves, but the narrower
microwaves. They penetrate weather and certain materials better,
but they also can be used to create finer images.
What is true of radio and microwave waves in this regard is also true
of light waves. You could, technically, have a friend stand miles
away with a mirror and shine a flashlight in his direction and
measure how long it takes before you see the reflection in order to
calculate how far away he is, but the time delay would be so small
that you probably wouldn't notice it. It's been estimated that an
object traveling as fast as light could circle Earth about seven times
in a single second. Still, we have long had electronics that operate
fast enough to detect such small time delays.
If I were trying to endow a robot with the ability to see using laser
scanning, I would probably want it to understand this idea of one
object occluding another and the idea that there may be space
between them that can't be seen yet. One very simple way to do
this is to modify the above mesh-building algorithm slightly. For
each pair of neighboring points, I would calculate how far away they
are in depth. Above a certain depth, I would declare the two points
part of separate surfaces and below it, I would assume the two
points part of the same surface. The following figure illustrates this:
Figure: One way to determine when surfaces are connected or disconnected.
How would I set the threshold? The broken record comes around
again here to sing the refrain that there is no universal answer.
Perhaps our goal would be to make it so our robot can move around
in space and so we might arbitrarily choose a threshold of, say, 3
feet if our robot can move within a 3 foot wide space. We could also
use the "tears" in the 3D mesh to segment distinct objects out using
familiar techniques like our basic flood-fill algorithm.
Before you decide that in laser scanners we finally have found the
ultimate solution to the 3D perception problem, let me throw water
on that fire. If the goal is to get a 3D image of the world, laser
scanning is an excellent solution. If the goal is to get a machine to
understand the world, laser scanners do nothing than measure
distances to points. They don't "understand" the world any better
than digital cameras do. And they tend to not see light levels or
colors like a camera does; only points in space. As you'll see later,
though, laser scanning can be used in conjunction with other
techniques as "cheats" to work around solving certain problems that
are easily solved by our own visual systems.
Binocular Vision
We have two eyes. And while it's
true they give us a sharper image
than a single eye would, the most
interesting benefit of having two
eyes is that we can use them to
help us judge distances. We do so
using "binocular" vision Figure: Two cameras in a stereo (binocular)
techniques. arrangement.
Final Thoughts
I hope this brief introduction to machine vision has been helpful to
you. As I stated in the beginning, it is by no means complete, but it's
not a bad intro if you are just getting started or are just curious. I
also hope it has successfully given you the sense that a lot of the
stuff being done today is not as complicated -- or competent -- as it
is often portrayed in the popular media and technical literature.
There's a lot one can do with just some simple tricks.