Professional Documents
Culture Documents
353
published in IEEE Transactions on Pattern Analysis and Machine Intelligence July 1997, vol 19, no 7, pp. 780-785
who use dynamic models. However, in contrast to Pnder The aggregate support map s(x; y) over all the blob mod-
these other systems all require accurate initialization and els represents the segmentation of the image into spatio-
use local image features. Consequently, they have di- color classes.
culty with occlusion and require massive computational Like other representations used in computer vision and
resources. signal analysis, including superquadrics, modal analysis,
Functionally, our systems are perhaps most closely re- and eigenvector representations, blobs represent the global
lated to the work of Bichsel[3] and Baumberg and Hogg aspects of the shape and can be augmented with higher-
[2]. These systems segment the person from the back- order statistics to attain more detail if the data supports
ground in real time using only a standard workstation. it. The reduction of degrees of freedom from individual
Their limitation is that they do not analyze the person's pixels to blob parameters is a form of regularization which
shape or internal features, but only the silhouette of the allows the ill-conditioned problem to be solved in a prin-
person. Consequently, they cannot track head and hands, cipled and stable way.
determine body pose, or recognize any but the simplest Each blob has a spatial (x; y) and color (Y; U; V ) com-
gestures. ponent. Color is expressed in the YUV color space. We
could additionally use motion and texture measurements
3 Steady State Tracking as part of the blob descriptions, but current hardware has
We will rst describe Pnder's representations and op- restricted us to use position and color only. Because of
eration in the \steady-state" case, where it has already their dierent semantics, the spatial and color distribu-
found and built representations of the person and scene. tions are assumed to be independent. That is, Kk is
In the following sections we will then describe the model- block-diagonal, with uncoupled spatial and spectral com-
building, and error recovery processes. ponents. Each blob can also have a detailed representation
of its shape and appearance, modeled as dierences from
3.1 Modeling The Person the underlying blob statistics. The ability to eciently
We can represent these 2-D regions by their low-order compute compact representations of people's appearance
statistics. Clusters of 2-D points have 2-D spatial means is useful for low-bandwidth applications[5].
and covariance matrices, which we shall denote and K. The statistics of each blob are recursively updated to
The blob spatial statistics are described in terms of their combine information contained in the most recent image
second-order properties; for computational convenience with knowledge contained in the current class statistics
we will interpret this as a Gaussian model: and the priors.
exp[? 21 (O ? )T K?1 (O ? )] 3.2 Modeling The Scene
Pr(O) = (1) We assume that the majority of the time Pnder will be
(2) m2 jKj 2
1
processing a scene that consists of a relatively static situ-
The Gaussian interpretation is not terribly signicant, be- ation such as an oce, and a single moving person. Con-
cause we also keep a pixel-by-pixel support map showing sequently, it is appropriate to use dierent types of model
the actual occupancy. We dene sk (x; y), the support map for the scene and for the person.
for blob k, to be We model the scene surrounding the human as a tex-
ture surface; each point on the texture surface is associ-
x; y) 2 k
sk (x; y) = 10 (otherwise (2) ated with a mean color value and a distribution about that
2 mean. The color distribution of each pixel is modeled with
the Gaussian described by a full covariance matrix. Thus, For each image pixel we must measure the likelihood
for instance, a
uttering white curtain in front of a black that it is a member of each of the blob models and the
wall will have a color covariance that is very elongated in scene model.
the luminance direction, but narrow in the chrominance For each pixel in the new image, we dene y to be
directions. the vector (x; y; Y; U; V ). For each class k (e.g., for each
We dene 0 to be the mean (Y; U; V ) of a point on blob and for the corresponding point on the scene texture
the texture surface, and K0 to be the covariance of that model) we then measure the log likelihood
point's distribution. The spatial position of the point is dk = ? 12 (y ? k )T K?k 1 (y ? k )?
treated implicitly because, given a particular image pixel 1 ln jK j ? m ln (2) (5)
at location (x; y), we need only consider the color mean 2 k 2
and covariance of the corresponding texture location. The Self-shadowing and cast shadows are a particular dif-
scene texture map is considered to be class zero. culty in measuring the membership likelihoods, however
One of the key outputs of Pnder is an indication of we have found the following approach sucient to com-
which scene pixels are occluded by the human, and which pensate for shadowing. First, we observe that if a pixel
are visible. This information is critical in low-bandwidth is signicantly brighter (has a larger Y component) than
coding, and in the video/graphics compositing required predicted by the class statistics, then we do not need to
for \augmented reality" applications. consider the possibility of shadowing. It is only in the case
In each frame, visible pixels have their statistics recur- that the pixel is darker that there is a potential shadow.
sively updated using a simple adaptive lter. When the pixel is darker than the class statistics indi-
cate, we therefore normalize the chrominance information
t = y + (1 ? )t?1 (3) by the brightness, U = U=Y , and V = V=Y . This
normalization removes the eect of changes in the over-
This allows us to compensate for changes in lighting all amount of illumination. For the common illuminants
and even for object movement. For instance, if a person found in an oce environment this step has been found to
moves a book it causes the texture map to change in both produce a stable chrominance measure despite shadowing.
the locations where the book was, and where it now is. The log likelihood computation then becomes
By tracking the person we can know that these areas, al-
though changed, are still part of the texture model and dk = ? 21 (y ? k )T Kk ?1(y ? k )? (6)
1 ln jK j ? m ln(2)
thus update their statistics to the new value. The updat- 2 k 2
ing process is done recursively, and even large changes in where y is (x; y; U ; V ) for the image pixel at location
illumination can be substantially compensated within two (x; y), k is the mean (x; y; U ; V ) of class k and Kk is
or three seconds. the corresponding covariance.
3.3 The Analysis Loop The next step is to resolve the class membership like-
Given a person model and a scene model, we can now lihoods at each pixel into support maps, indicating for
acquire a new image, interpret it, and update the scene each pixel whether it is part of one of the blobs or of the
and person models. To accomplish this there are several scene. Spatial priors and connectivity constraints are used
steps: predict the appearance of the user in the new image to accomplish this resolution.
using the current state of our model; for each image pixel Individual pixels are then assigned to particular
and for each blob model, calculate the likelihood that the classes: either to the scene texture class or a foreground
pixel is a member of the blob; resolve these pixel-by-pixel blob. A classication decision is made for each pixel
likelihoods into a support map; and update the statistical by comparing the computed class membership likelihoods
models all blob models. Each of these steps will now be and choosing the best one (in the MAP sense), e.g.,
described in more detail. s(x; y) = argmaxk (dk (x; y)) (7)
The rst step is to update the spatial model associated
with each blob using the blob's dynamic model, to yield Connectivity constraints are enforced by iterative mor-
the blob's predicted spatial distribution for the current phological \growing" from a single central point, to pro-
image: duce a single region that is guaranteed to be connected
X^ [njn?1] = [n?1]X^ [n?1jn?1] (4) (see Figure 2). The rst step is to morphologically grow
out a \foreground" region using a mixture density com-
where the estimated state vector X^ includes the blob's prised of all of the blob classes. This denes a single
position and velocity, the observations Y^ are the mean connected region corresponding to all the parts of the
spatial coordinates of the blob in the current image, and user. Each of the individual blobs are then morphologi-
the lter G^ is the Kalman gain matrix assuming simple cally grown, with the constraint that they remain conned
Newtonian dynamics. 3 to the foreground region.
4 Initialization
Pnder's initialization process consists primarily of build-
ing representations of the person and the surrounding
scene. It rst builds the scene model by observing the
scene without people in it, and then when a human enters
the scene it begins to build up a model of that person.
Figure 2: The morphological grow operation The person model is built by rst detecting a large
change in the scene, and then building up a multi-blob
model of the user over time. The model building process
This results in a set of connected blobs that ll out is driven by the distribution of color on the person's body,
the foreground region. However the boundaries between with blobs added to account for each dierently-colored
blobs can still be quite ragged due to misclassication of region. Typically separate blobs are required for the per-
individual pixels in the interior of the gure. We therefore son's hands, head, feet, shirt and pants.
use simple 2-D Markov priors to \smooth" the scene class The process of building a blob-model is guided by a
likelihoods. 2-D contour shape analysis that recognizes silhouettes in
Given the resolved support map s(x; y), we can now which the body parts can be reliably labeled. For instance,
update the statistical models for each blob and for the when the user faces he camera and extends both arms
scene texture model. By comparing the new model pa- (what we refer to as the \star sh" conguration) then
rameters to the previous model parameters, we can also we can reliably determine the image location of the head,
update the dynamic models of the blobs. hands, and feet. When the user points at something, then
For each class k, the pixels marked as members of the we can reliably determine the location of the head, one
class are used to estimate the new model mean k : hand, and the feet.
^ k = E [y] (8) These locations are then integrated into blob-model
building process by using them as prior probabilities for
and the second-order statistics become the estimate of the blob creation and tracking. For instance, when the face
model's covariance matrix Kk , and hand image positions are identied we can set up a
strong prior probability for skin-colored blobs.
K^ k = E[(y ? k)(y ? k )T ] (9) The following subsections describe the blob-model
This process can be simplied by re-writing it in an- building process in greater detail.
other form more conducive to iterative calculation. The
rst term can be built up as examples are found, and the 4.1 Learning The Scene
mean can be subtracted when it is nally known: Before the system attempts to locate people in a scene, it
E [(y ? k )(y ? k )T ] = E [yyT ] ? k Tk (10) must learn the scene. To accomplish this Pnder begins
by acquiring a sequence of video frames that do not con-
For computational eciency, color models are built in two tain a person. Typically this sequence is relatively long,
dierent color spaces: the standard (Y; U; V ) space, and a second or more, in order to obtain a good estimate of
the brightness-normalized (U ; V ) color space. the color covariance associated with each image pixel. For
Errors in classication and feature tracking can lead to computational eciency, color models are built in both
instability in the model. One way to ensure that the model the standard (Y; U; V ) and brightness-normalized (U ; V )
remains valid is to reconcile the individual blob models color spaces.
with domain-specic prior knowledge. For instance, some
parameters (e.g., color of a person's hand) are expected to 4.2 Detect Person
be stable and to stay fairly close to the prior distribution,
some are expected to be stable but have weak priors (e.g., After the scene has been modeled, Pnder watches for
shirt color) and others are both expected to change quickly large deviations from this model. New pixel values are
and have weak priors (e.g., hand position). compared to the known scene by measuring their Maha-
Intelligently chosen prior knowledge can turn a class lanobis distance in color space from the class at the ap-
into a very solid feature tracker. For instance, classes in- propriate location in the scene model, as per Equation 5.
tended to follow
esh are good candidates for assertive If a changed region of the image is found that is of suf-
prior knowledge, because people's normalized skin color cient size to rule out unusual camera noise, then Pnder
is surprisingly constant across dierent skin pigmentation proceeds to analyze the region in more detail, and begins
levels and and radiation damage (tanning). 4 to build up a blob model of the person.
4.3 Building the Person Model test hand arm
To initialize blob models, Pnder uses a 2D contour shape translation
analysis that attempts to identify the head, hands, and (X ,Y ) 0.7 pixels 2.1 pixels
feet locations. When this contour analysis does identify (0.2% rel) (0.8% rel)
one of these locations, then a new blob is created and rotation
placed at that location. For hand and face locations, the () 4.8 degrees 3.0 degrees
blobs have strong
esh-colored color priors. Other blobs (5.2% rel) (3.1% rel)
are initialized to cover clothing regions. The blobs intro-
duced by the contour analysis compete with all the other Table 1: Pnder Estimation Performance
blobs to describe the data.
When a blob can nd no data to describe (as when
a hand or foot is occluded), it is deleted from the person and an attempt will be made to explain them in the user
model. When the hand or foot later reappears, a new blob model.
will be created by either the contour process (the normal Another limitation, related to the dynamic scene prob-
case) or the color splitting process. This deletion/addition lem, is that system expects only one user to be in the
process makes Pnder very robust to occlusions and dark space. Multiple users don't cause problems in the low level
shadows. When a hand reappears after being occluded or segmentation or blob tracking algorithms, but do cause
shadowed, normally only a few frames of video will go by signicant diculties with the gesture recognition system
before the person model is again accurate and complete. that attempts to explain the blob model as a single human
The blob models and the contour analyzer produce gure.
many of the same features (head, hands, feet), but with
very dierent failure modes. The contour analysis can nd 6 Performance
the features in a single frame if they exist, but the results We nd RMS errors in pnder's tracking on the or-
tend to be noisy. The class analysis produces accurate der of a few pixels, as shown in Table 1. Here, the term
results, and can track the features where the contour can \hand" refers to the region from approximately the wrist
not, but it depends on the stability of the underlying mod- to the ngers. An \arm" extends from the elbow to the
els and the continuity of the underlying features (i.e., no ngers. For the translation tests, the user moves through
occlusion). the environment while holding onto a straight guide. Rel-
The last stage of model building involves the reconcili- ative error is the ratio of the RMS error to the total path
ation of these two modes. For each feature, Pnder heuris- length.
tically rates the validity of the signal from each mode. The For the rotation error test, the user moves an ap-
signals are then blended with prior probabilities derived pendage through several cycles of approximately 90 de-
from these ratings. This allows the color trackers to track gree rotation. There is no guide in this test, so neither
the hands in front of the body|when the hands produce the path of the rotation, nor even its absolute extent, can
no evidence in the contour. If the class models become be used to directly measure error. We settle for measur-
lost due to occlusion or rapid motion, the contour tracker ing the noise in the data. The RMS distance to a low-pass
will dominate and will set the feature positions once they ltered version of the data provides this measure.
are re-acquired in the contour.
7 Applications
5 Limitations Although interesting by itself, the full implications of real-
Pnder explicitly employs several domain-specific assump- time human tracking only become concrete when the in-
tions to make the vision task tractable. When these as- formation is used to create an interactive application.
sumptions break, the system degrades in specic ways. Pnder has been used to explore several dierent human
Due to the nature of Pnder's structure and since the interface applications.
model of the user is fairly weak, the system degrades
gracefully and recovers in two or three frames once the 7.1 A Modular Interface
assumption again holds. Pnder provides a modular interface that allows client
Pnder expects the scene to be signicantly less dy- applications to request subsets of the information that
namic than the user. Although Pnder has the ability Pnder provides. A wide range of data is exported
to compensate for small, or gradual changes in the scene through the Pnder interface: blob model statistics; poly-
or the lighting, it cannot compensate for large, sudden gon representation of the support map; video texture
changes in the scene. If such changes occur, they are likely bounding box with alpha map semantically labeled fea-
to be mistakenly considered part of the foreground region, 5 tures (e.g. head, right hand, etc.); and static gestures
(e.g. standing, sitting, pointing, etc.). sixteenth resolution (160x120 pixel) frames is 10Hz using
a 200MHz R4400 processor Indy with Vino Video. For
7.2 Gesture Control for ALIVE, SURVIVE input we use a JVC-1280C, single CCD, color camera. It
In many applications it is desirable to have an interface provides an S-video signal to the SGI digitizers.
that is controlled by gesture rather than by a keyboard or
mouse. One such application is the Articial Life IVE 9 Conclusion
(ALIVE) system[9]. ALIVE utilizes Pnder's support Pnder demonstrates the utility of stochastic, region-
map polygon to dene alpha values for video compositing based features for real-time image understanding. This
(placing the user in a scene with some articial life forms approach allows meaningful, interactive-rate interpreta-
in real-time). Pnder's gesture tags and feature positions tion of the the human form without custom hardware.
are used by the articial life forms to make decisions about The technique is stable enough to support real applica-
how to interact with the user, as illustrated in Fig. 3(a). tions as well as higher-order vision techniques.
Pnder's output can also be used in a much simpler
and direct manner. The position of the user and the con-
guration of the user's appendages can be mapped into a
10 References
control space, and sounds made by the user are used to [1] A. Azarbayejani and A.P. Pentland. Recursive esti-
change the operating mode. This allows the user to con- mation of motion, structure, and focal length. IEEE
trol an application with their body directly. This interface Trans. Pattern Analysis and Machine Intelligence,
has been used to navigate a 3-D virtual game environment 17(6):562{575, June 1995.
as in SURVIVE (Simulated Urban Recreational Violence [2] A. Baumberg and D. Hogg. An ecient method for
IVE) [18] (illustrated in Fig. 3(b)). contour tracking using active shape models. In Pro-
7.3 Recognition of American Sign Language ceeding of the Workshop on Motion of Nonrigid and
Articulated Objects. IEEE Computer Society, 1994.
One interesting application attends only to the spatial
statistics of the blobs associated with the users hands. [3] Martin Bichsel. Segmenting simply connected mov-
Starner and Pentland [17] used this blob representation ing objects in a static scene. Pattern Analysis and
together with hidden Markov modeling to interpret a forty Machine Intelligence, 16(11):1138{1142, Nov 1994.
word subset of American Sign Language (ASL). Using this [4] T. Darrell and A.P. Pentland. Cooperative robust
approach they were able to produce a real-time ASL inter- estimation using layers of support. IEEE Trans. Pat-
preter with a 99% sign recognition accuracy. Thad Starner tern Analysis and Machine Intelligence, 17(5):474{
is shown using this system in Fig. 3(c). 487, May 1995.
7.4 Avatars and Telepresence [5] Trevor Darrell, Bruce Blumberg, Sharon Daniel, Brad
Using Pnder's estimates of the user's head, hands, and Rhodes, Pattie Maes, and Alex Pentland. Alive:
feet position it is possible to create convincing shared vir- Dreams and illusions. In ACM SIGGraph, Computer
tual spaces. The ALIVE system, for instance, places the Graphics Visual Proceedings, July 1995.
user at a particular place in the virtual room populated
by virtual occupants by compositing real-time 3-D com- [6] Willis Davis Ellis. A source book of gestalt psychology.
puter graphics with live video. To make a convincing 3-D Harcourt, Brace, New York, 1938.
world, the video must be placed correctly in the 3-D en- [7] D. M. Gavrila and L. S. Davis. Towards 3-d model-
vironment, that is, video of the person must be able to based tracking and recognition of human movement:
occlude, or be occluded by, the graphics[5]. a multi-view approach. In International Workshop
The high level description of the user is also suitable on Automatic Face- and Gesture-Recognition. IEEE
for very-low bandwidth telepresence applications. On the Computer Society, 1995. Zurich.
remote end information about the user's head, hand, and
feet position is used to drive an video avatar that rep- [8] R. J. Kauth, A. P. Pentland, and G. S. Thomas. Blob:
resents the user in the scene[5]. One such avatar is illus- An unsupervised clustering approach to spatial pre-
trated in Fig. 3(d). It is important to note that the avatars processing of mss imagery. In 11th Int'l Symposium
need not be an accurate representation of the user, or be on Remote Sensing of the Environment, Ann Arbor,
human at all. MI, April 1977.
8 Hardware [9] Pattie Maes, Bruce Blumberg, Trevor Darrell, and
Alex Pentland. The alive system: Full-body interac-
Pnder is implemented on the SGI architecture using the tion with animated autonomous agents. ACM Multi-
VL (Video Library) interface. A typical frame rate on 6 media Systems, 5:105{112, 1997.
(a) (b) (c) (d)
Figure 3: (a) Chris Wren playing with Bruce Blumberg's virtual dog in the ALIVE space, and (b) playing SURVIVE. (c)
Real-time reading of American Sign Language (with Thad Starner doing the signing) (d) Trevor Darrell demonstrating
vision-driven avatars.
[10] S. Mann and R. W. Picard. Video orbits: charac- ing computer vision and audition. Applied Articial
terizing the coordinate transformation between two Intelligence, 11(4):267{284, June 1997.
images using the projective group. IEEE T. Image
Proc., 1997. To appear.
[11] D. Metaxas and D. Terzopoulos. Shape and non-rigid
motion estimation through physics-based synthesis.
IEEE Trans. Pattern Analysis and Machine Intelli-
gence, 15:580{591, 1993.
[12] A. Pentland and B. Horowitz. Recovery of nonrigid
motion and structure. IEEE Trans. Pattern Analysis
and Machine Intelligence, 13(7):730{742, July 1991.
[13] Alex Pentland. Classication by clustering. In Pro-
ceedings of the Symposium on Machine Processing of
Remotely Sensed Data. IEEE, IEEE Computer Soci-
ety Press, June 1976.
[14] J.M. Rehg and T. Kanade. Visual tracking of high
dof articulated structures: An application to human
hand tracking. In European Conference on Computer
Vision, pages B:35{46, 1994.
[15] K. Rohr. Towards model-based recognition of human
movements in image sequences. CVGIP: Image Un-
derstanding, 59(1):94{115, Jan 1994.
[16] H. S. Sawhney and S. Ayer. Compact representations
of videos through dominant and multiple motion es-
timation. IEEE transactions on pattern analysis and
machine intelligence, 18(8):814{31, 1996.
[17] Thad Starner and Alex Pentland. Real-time ameri-
can sign language recognition from video using hid-
den markov models. In Proceedings of International
Symposium on Computer Vision, Coral Gables, FL,
USA, 1995. IEEE Computer Society Press.
[18] C. Wren, F. Sparacino, A. Azarbayejani, T. Darrell,
T. Starner, Kotani A, C. Chao, M. Hlavac, K. Rus-
sell, and Pentland A. Perceptive spaces for pefor-
mance and entertainment: Untethered interaction us- 7