Professional Documents
Culture Documents
Shiloh L. Dockstader
and A. Murat Tekalp
*
" ,
where
{ } ( ) , ( ) 0
m m m
H = < z z 0 z P P (2)
and H indicates the Hessian of a real-valued function. The
above set of 3-D positions records the location of all
modal values for the m
th
features prior distribution. It is
hypothesized that this information correlates with typical
configurations of the human model, as indicated by
previously observed examples. Since the actual domain of
z is finite, computing the critical points is easily per-
formed using a hierarchical grid search.
2.3. System Dynamics
The utility of the proposed structural model and
kinematic constraints is best demonstrated by using them
to define a more accurate dynamic model for tracking
human motion. In particular, we describe a standard
second-order autoregressive process in which the sampled
position of an arbitrary point or feature in 3-D can be
described by
1 2
2
[ ] [ 1] [ 1] [ 1] ( ) k k k t k t = + + z z z z , (3)
where k is a discrete instant in time, t is the change in
time or distance between temporal samples, and [ 1] k z
and [ 1] k z are first and second derivatives of position.
A considerable amount of research has attempted to
better describe the dynamics of human motion by more
carefully modeling the velocity, [ ] k z , of moving features.
Less attention, however, has been paid to the potential
acceleration, [ ] k z , that drives the motion of the human
body. It is our submission that the body assumes num-
erous, but predictable and repetitive, configurations while
performing certain tasks. Because of the connected nature
of the body and the physics of motion, there will be a
natural tendency for body parts to return to their comfort-
able, or typical, positions after performing some activity.
The pendular motion of the arms, for instance, demon-
strates the effects of acceleration as the arm deviates only
so far from its steady state position before naturally
returning. This force can be thought of as an acceleration
of particles towards their characteristic arrangements.
The most significant novelty of this research is the
application of a priori statistical models of kinematic
configuration to traditional system and object dynamics.
The actual tracking of the proposed kinematic model
could be performed using any number of techniques. We
choose Kalman filtering since it provides a simple means
of incorporating a priori object motion via the state
transition matrix, [ ] k . With a sufficiently accurate
motion model, the Gaussian noise assumption taken by the
Kalman filter places little restriction on the overall system
dynamics. This is particularly true if the noise modeling is
allowed to be non-stationary. An alternative, but more
computationally complex, approach might employ Con-
densation tracking [17] to more accurately model a non-
Gaussian noise process. However, since the method of
generating image observations is based on motion
estimation, and not just on the visual tracking of specific
features, the observation and state densities are consider-
ably less likely to exhibit non-Gaussian behavior. Thus,
we introduce a time-varying state vector, within the
context of a Kalman filter, as
[ ] 1 2 2 1
[ ] [ ] [ ] [ ] [ ]
T
m N
k k k k k
+
" " . (4)
Here, [ ],
m
k m N denotes the 3-D position of the m
th
parameter in our body-centered coordinate system, while
1
[ ]
[ ]
m
m N
m N
k
k
k
+ +
(5)
indicates an approximation of the true velocity of the m
th
parameter. Following the form of (3), we define an
estimate of the m
th
parameter as
( )
1
2
1
[ 1] [ 1]
2
( )
[ ] [ 1] [ 1]
( ) (1 )
m v m N mi
m m v m N
k c k
a m
t t
k k c k t
c t l
+ +
+ +
= +
+
, (6)
where, dropping the dependence of
[ ]
m
i k on m and k for
simplicity,
{ }
[ ] argmin [ 1]
m m mi
i
i i k k = , (7)
( )
( )
[ 1]
[0,1]
m m
m
m mi
k
l
P
P
, (8)
and c
v
and c
a
are our familiar, adaptive constants for
velocity and acceleration. Relating (6) to (3), we can
clearly see that [ 1] [ 1]
m
k k = z ,
1
[ 1] [ 1]
z
z
MAX
m N
v
k k
+ +
<
= , (9)
and with the constraint that z
MAX
a < ,
[ ] 2
2(1 )
( )
[ 1] [ 1] [ 1]
m l
m mi
t
k k k t
= z z . (10)
An inspection of (10) indicates that the actual acceleration
of a parameter is a function of both a variance-like
contribution in
m
l as well as the location of the modal
values,
m
, in its prior distribution. Also notice that both
the velocity (9) and acceleration terms (10) for a particular
parameters motion model are forced to never exceed the
hard limits of
MAX
v and
MAX
a , respectively.
Since Kalman filtering is a well-known and familiar
processing tool, we omit the processing details here, but
refer the reader to [18] for more information. It is
sufficient to remark that the above equations define only
the motion model that guides the dynamics of the Kalman
filter. The corresponding prediction using these equations
would be given, for the m
th
feature, by
1 1
[ ] [ 1]
[ ]
[ ] [ 1]
m m
m
m N m N
k k
k
k k
+ + + +
( (
=
( (
, (11)
where
( )
( )
(1 ) [ 1]
1
[ 1]
1 1
[ ]
0 1
a m m mi
m a
c l k
v a m
k c
m
c c t l
k
(
+
( =
(
. (12)
The required observations for the Kalman filter are based
on image measurements extracted from the video
sequences. For this task, we refer to our previous work on
the tracking and extraction of features from multiple views
[19]. In essence, this method performs the simultaneous
[constrained] tracking of image features from multiple
views and then integrates the tracked observations using
an occlusion-adaptive algorithm. Observations are defined
using estimated motion vectors in the vicinity of tracked
features. These observations are then incorporated into a
Kalman filter that leverages the kinematic-based dynamic
motion model described in this contribution.
The algorithm enforces hard constraints during the
Kalman filtering process when searching for image
observations. We start with the most reliable parameter
(based on its Kalman minimum mean square error matrix)
and construct an observation that is subject to the hard
constraints,
K
. Then, in a highest confidence first
(HCF) fashion, we estimate an observation for each
parameter subject to the hard constraints and to the
locations of previously positioned, presumably more
reliable, parameters. This is implemented while tracking
2-D image measurements by simply restricting the search,
in accordance with the hard constraints, for the best
motion correspondence between neighboring frames.
Additional details regarding the standard application of
hard and structural constraints to tracking are readily
available in the literature [7][20].
3. Experimental Results
To test the proposed contribution, we measure hard
kinematic constraints for two test subjects and soft
constraints on the same subjects while walking in a home
environment. The training is performed in a semi-
automatic fashion and uses a total of 2500 video frames,
with a sampling rate of
1
30
sec t = , for building the a
priori stochastic model distributions. For experimental
testing, the same subjects carry out a number of
characteristic activities involving gait-based locomotion in
a home environment. The results are based on a total of
927 frames, captured from three distinct views, extracted
at various times during the course of a day.
Select results of the tracking are shown in Figures 3
and 4; in each case, we show the output, as seen from all
views in the system, for both the bounding volume and
stick component parameters. In the upper half of the
figures, the first row indicates the projection of the
bounding volume parameters on the imaging plane, as
seen from different views, while the second row shows the
positions of the stick model parameters. The bottom half
of the figure shows additional results taken later in the
same sequence. Dashed lines denote bounding volume
parameters for the lower body. We are able to auto-
matically track the bounding volume in near real-time for
an average length of 367 frames. The stick model
component is able to track, with a manual initialization,
for an average of 112 frames at a slower rate of approx-
imately one to two frames per second. For a full
appreciation of the tracking results, we invite the reader to
inspect the complete video clips at our website.
Figure 3. Tracking results for one person.
To quantify the accuracy of the tracking, we report on
the average error between an estimated parameter location
in 3-D and its corresponding, manually determined,
ground truth position. The results of this analysis are
shown in Figure 5. For each parameter we show the
average calculated error (shown as the number above each
error bar) as well as the statistical confidence in that
estimate. No statistics are provided for p
0
, as its position is
a linear combination of those for p
2
and p
3
. A careful
inspection of Figure 5 reveals that the estimation error for
the bounding volume is typically much greater than that
for the stick model component. This is particularly true for
those parameters that attempt to differentiate between the
movement of the upper and lower body. The accuracy of
the stick model is noticeably better, and more than
sufficient for the extraction of gait variables, but still
burdened by the lack of a fully automatic initialization.
Figure 4. Tracking results for multiple persons.
In general, for both aspects of the proposed model,
the estimation error is highest for the parameters
associated with the hands and arms. This is due primarily
to the significant self-occlusion and periodic irregularity
of the arm swing complications that, collectively, will be
difficult to overcome. For a potential application in gait
analysis, it is encouraging to see the accuracy obtained in
tracking the lower body, as such parameters can be used to
directly measure gait velocity, stance times, stance width,
stride length, and similar variables.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
5
10
15
20
Parameter Number
A
v
e
r
a
g
e
E
r
r
o
r
(
c
m
)
Stick Model
5.1
5.6 6.1
9.7
9.1
6.6 6.9
14.2
13.0
17.8
15.7
9.7
10.9
12.4
10.4
16 17 18 19 20 21 22 23 24 25
0
10
20
30
40
50
Parameter Number
Bounding Box Model
5.3
3.8
36.3
30.0
19.6
16.0
40.9
38.6
26.2
20.8
Figure 5. Error statistics.
The confidence of the error estimates was found to be
a consequence of both the multi-camera configuration as
well as the total amount of scene occlusion. Observations
for parameters near the top of the body were often
unavailable due to the positioning of the multiple cameras
relative to the location of the walking area in the home.
Other parameters, related to the position of the arms,
simply lacked reliable observations due to self-occlusion.
As expected, the error trends also tend to favor the more
rigid parts of the body over the articulated components.
4. Conclusions and Future Work
This contribution introduces an improved model for
tracking complex human motion using 3-D observations
derived from a multiple camera configuration. We suggest
a structural model of the human body that leverages the
simplicity and robustness of a 3-D bounding volume and
the elegance and accuracy of a more highly parameterized
stick model. This hierarchical structural model is
accompanied by hard and soft kinematic constraints. The
hard constraints, derived from actual body measurements,
include limits on velocity and acceleration, an individ-
ualized scaling factor for the body model, and a set of
minimum and maximum spanning distances for all pairs of
model parameters. The soft constraints take the form of a
priori, probabilistic distributions for each model
parameter and are based on learned examples of human
motion. We use these soft kinematic constraints to derive
an acceleration term for each tracked feature. This factor
is used to augment the potentially time-varying velocity of
a classical dynamic motion model. The result is an
increase in tracking accuracy in the presence of significant
occlusion and articulated movement.
Acknowledgments
This research was supported in part by grants and
financial support from Eastman Kodak Company and the
Center for Future Health. We would also like to recognize
Stephanie C. Dockstader and Kelly A. Bergkessel for their
assistance in building training samples for the a priori
kinematic model and defining 3-D ground truth for quant-
ifying the accuracy of the proposed algorithm.
References
[1] Y. Song, L. Gonclaves, E. Di Bernardo, and P. Perona,
Monocluar Perception of Biological Motion in Johansson
Displays, Computer Vision and Image Understanding,
vol. 81, no. 3, pp. 303-327, 2001.
[2] A. F. Bobick and J. W. Davis, The Recognition of Human
Movement Using Temporal Templates, IEEE Trans. on
Pattern Analysis and Machine Intelligence, vol. 23, no. 3,
pp. 257-267, 2001.
[3] Y. Yacoob and M. J. Black, Parameterized Modeling and
Recognition of Activities, Computer Vision and Image
Understanding, vol. 73, no. 2, pp. 232-247, 1999.
[4] D. Cunado, J. M. Nash, M. S. Nixon, and J. N. Carter,
Gait extraction and description by evidence-gathering,
Proc. of the Int. Conf. on Audio and Video-Based Bio-
metric Person Authentication, Washington, DC, 22-23
March 1999, pp. 43-48.
[5] A. Baumberg and D. Hogg, Learning flexible models from
image sequences, Proc. of the Eur. Conf. on Computer
Vision, Stockholm, Sweden, 2-6 May 1994, vol. 1, pp. 299-
308.
[6] G. Monheit and N. I. Badler, A Kinematic Model of the
Human Spine, IEEE Computer Graphics and Appli-
cations, vol. 11, no. 2, pp. 29-38, 1991.
[7] C. Rasmussen, Joint likelihood methods for mitigating
visual tracking differences, Proc. of the Int. Workshop on
Multi-Object Tracking, Vancouver, Canada, 8 July 2001,
pp. 69-76.
[8] H. Sidenbladh, M. J. Black, and D. J. Fleet, Stochastic
tracking of 3D human figures using 2D image motion,
Proc. of the European Conf. on Computer Vision, Dublin,
Ireland, 26 June - 1 July 2000, pp. 702-718.
[9] Y. Yacoob and L. S. Davis, Learned Models for
Estimation of Rigid and Articulated Human Motion from
Stationary or Moving Camera, Int. J. of Computer Vision,
vol. 12, no. 1, pp. 5-30, 2000.
[10] H. Sidenbladh and M. Black, Learning image statistics for
Bayesian tracking, Proc. of the Int. Conf. on Computer
Vision, Vancouver, Canada, 9-12 July 2001, vol. 2, pp.
709-716.
[11] C. R. Wren, B. P. Clarkson, and A. P. Pentland, Under-
standing purposeful human motion, Proc. of the Int. Conf.
on Automatic Face and Gesture Recognition, Grenoble,
France, 28-30 March 2000, pp. 378-383.
[12] D. M. Gavrila, The Visual Analysis of Human Movement:
A Survey, Computer Vision and Image Understanding,
vol. 73, no. 1, pp. 82-98, 1999.
[13] J. K. Aggarwal and Q. Cai, Human Motion Analysis: A
Review, Computer Vision and Image Understanding, vol.
73, no. 3, pp. 428-440, 1999.
[14] Z. Chen and H. I. Lee, Knowledge-Guided Visual
Perception of 3D Human Gait from a Single Image
Sequence, IEEE Trans. on Systems, Man, and Cyber-
netics, vol. 22, no. 2, pp. 336-342, 1992.
[15] S. L. Dockstader and A. M. Tekalp, On the Tracking of
Articulated and Occluded Video Object Motion, Real-
Time Imaging, vol. 7, no. 5, pp. 415-432, 2001.
[16] G. Shakhnarovich, L. Lee, and T. Darrell, Integrated face
and gait recognition from multiple views, Proc. of the
Conf. on Computer Vision and Pattern Recognition, Lihue,
Kauai, HI, 11-13 December 2001, vol. 1, pp. 439-446.
[17] M. Isard and A. Blake, Condensation - Conditional
Density Propagation for Visual Tracking, Int. J. of
Computer Vision, vol. 29, no. 1, pp. 5-28, 1998.
[18] A. Azarbayejani and A. P. Pentland, Recursive Estimation
of Motion, Structure, and Focal Length, IEEE Trans. on
Pattern Analysis and Machine Intelligence, vol. 17, no. 6,
pp. 562-575, 1995.
[19] S. L. Dockstader and A. M. Tekalp, Multiple Camera
Tracking of Interacting and Occluded Human Motion,
Proc. of the IEEE, vol. 89, no. 10, pp. 1441-1455, 2001.
[20] J. Deutscher, A. Davison, and I. Reid, Automatic
partitioning of high dimensional search spaces with
articulated body motion capture, Proc. of the Conf. on
Computer Vision and Pattern Recognition, Lihue, Kauai,
HI, 8-14 December 2001, vol. 2, pp. 669-676.