You are on page 1of 7

UMR 8022 CNRS - USTL

State of art in Body Tracking

Chabane Djeraba

Publication
interne
n° 06 - 2005

Université des Sciences et Technologies de Lille


LIFL - UMR CNRS 8022 - UFR IEEA - Bât. M3 - 59655 Villeneuve d'Ascq cedex
Tél. : 33 (0)3 28 77 85 41 - Fax : 33 (0)3 28 77 85 37
Technical report of LIFL – 6-2005

State of art in Body Tracking


Chabane Djeraba
LIFL, UMR CNRS-USTL 8022

Introduction
Definition: body tracking is the process of capturing the large body movements, of a subject (person)
at some resolution. What is not covered by the above definition is small scale body movements such as
facial expressions and hand gestures. Body tracking is used both when the subject is viewed as a single
object and when viewed as articulated motion of a high degree of freedom skeleton structure with a
number of joints. The term generally used is in the litterature is human motion capture.
The state of the art in the field of body tracking is particularly rich. The interest on this problem has
been provoked by numerous promizing applications, such as video-surveillance (e.g. detection of
incidents in the subway [Cup 2004] or persons' recognition by gait [Lee 2002]). Furthermore, the
recent technological advances involving the real-time capture, and the transfer and processing of
images on widely available low-cost hardware platforms (e.g. PCs), increased the operationability of
the current systems.
There are two categories of body tracking techniques: instrusive sensing and not instrusive sensing.
Intrusive sensing techniques operate by placing devices on the subject and in his environment which
transmit or receive generated information. Intrusive sensing techniques allow for simpler processing
and are widely used when the applications are situated in well-controlled environments (e.g. help
athletes understand and improve their performance, diagnostics of orthopedic patients). Non intrusive
sensing techniques are based on “natural” information sources (e.g. fixed camera) and need no
wearable devices (e.g. tracking of customers in shops where system tracks people to anticipate their
need, or surveillance of parking lot, where system tracks people to evaluate whether they may be about
to commit a crime: steal a car ..). Intrusive sensing are much more complex than non intrusive sensing.
We will limit our investigation to some representative works on non intrusive sensing techniques for
body tracking.
The general framework of body tracking (also true for gesture tracking), is based on four main tasks
[Oro 80]: prediction, synthesis, image analysis and state estimation. Many of the tracking approaches
follows this general model. The prediction task considers previous states up to time t to make a
prediction for time t+1. It allows the integration of knowledge into the tracking process. The synthesis
component translates the prediction from the state level to the measurement (image) level. The
synthesis allows the image analysis task to selectively focus on a subset of regions and look for a
subset of features. Finally, the state estimation task computes the new state using the segmented
image. This framework can be applied to any model-based tracking problem, whether involving a 2D
or 3D tracking space.

Categories of body tracking techniques


Many parameters may be used to characterize the state of the art approaches, for example, the type of
models (e.g. stick figure-based, volumetric, statistical), the dimensionality of the tracking space (2D,
3D), sensor multiplicty (e.g. monocular, stereo), sensor placement (centralized, distributed) and sensor
mobility (stationary, moving), sensor light, applications, motion type (rigid, non-rigid, elastic),
recognition, subjects (mon-person, multi-person tracking), etc. This state of art is based on the first
two parameters. It highlights:
-2D approaches without explicit shape models.
-2D approaches with explicit shape models.
-3D approaches.
Technical report of LIFL – 6-2005

2D approaches without explicit shape models describe human movement in terms of simple low-level
2D features from a region of interest. Models for human action are then described in statistical terms
derived from these low-level features, or by simple heuristics. This category of approaches has been
especially popular for applications of hand pose estimation in sign language recognition and gesture-
based dialogue management. In these cases, the region of interest is typically obtained by background
image subtraction or skin color detection. This is followed by morphological operations to remove
noise. The extracted features are based on hand shape, movement or location of the interest region.
2D approaches with explicit shape models use a priori knowledge of human body 2D representation.
They take essentially a model-based (e.g. stick figures, wrapped around with ribbons or “blobs”) and
view-based approach to segment, track and label body parts. Since self-occlusion makes the problem
quite hard for arbitrary movements, many systems assume a priori knowledge of the type of movement
or the viewpoint under which it is observed. The human figure is typically segmented by background
subtraction, assuming a slowly changing or stationary background and a fixed camera.
2D approaches represent [Sie 2004] and identify homogeneous regions by colour or texture, and puts
together these regions to constitute the person; 2D modelling tries to represent the image constraints of
the picture of a person, either as internal zone of an outline, or as a group articulated by regions;
Numerous techniques are used to implement this representation: the detection of movement is often
used in initial stage to separate the background and objects in the movement. The background can then
be subtracted from following frames; the use of colors, particularly often characteristic colors of faces
and hands, which allow to define points of anchoring for the adaptation of a model in a frame; Kalman
filters [Rig 2000] are often used to make the estimate of the temporal evolution of the parameters of
the model along the sequence; various techniques of probabilistic modelling such as Pseudo 2D Hired
Markov Models [Rig 2000], particle filtering [Chen 2004], b-spline to represent outlines, etc. are also
employed.
The monitoring techniques can be used in more or less complex situations like: tracking one or several
persons [Zha 2002]; treatment of occlusions by objects, or meetings between persons; use of one or
several cameras that pose problems particularly to identify the visible zones on several cameras all at
once [Khan 2003]; use of mobile cameras often not calibrated; detection and analysis of movements of
crowd [Bey 2004]; real time situations and simultaneous treatment of several video streams [Rui
2004].
3D approaches use the volume of the person to track its movements in the video. They capture 3D
articulated pose over time. And, it is very complex to capture 3D articulated pose over time from 2D
images. That is why, the approaches take advantage of the large available a priori knowledge about the
kinematic and shape properties of the human body to make the problem tractable. Tracking also is
well supported by the use of a 3D shape model which can predict events such as (self) occlusion and
(self) collision. Being able to use the 3D joint angles as features to represent body pose have the
advantage to be viewpoint independent. Joint angles are less sensitive to variations in the size of
humans, compared to 3D joint coordinates.

Constraints
The techniques developed up till recently are all based on a number of constraints to make the problem
tractable. And we are far from a general solution to the body tracking problem. Which constraints a
particular system uses depends on its goals. Generally, the complexity of a system is reflected in the
number of the constraints introduced, the fewer assumptions, the higher is the complexity. These
constraints have been listed in ranked order according to frequency on the basis of the analysis of 180
papers [Moeslund 2000]. In the same reference, we have a good survey on human motion capture till
2000.
We can distinguish three categories of constraints:
-Ten (10) constraints related to movements.
The first three constraints related to movements (the subject remains inside the workspace, none or
constant camera motion, only one person in the workspace at the time) are generally used in all current
Technical report of LIFL – 6-2005

systems. The next constraint (the subject faces the camera at all time) is mainly used in human
computer interface, and simplifies the calculation of the averall body pose. The next constraint
(movements parallel to the camera-plane) reduces the dimensionally of the problem from 3D to 2D
and is often used in applications such as gait analysis. The sixth constraints (no occlusion) simplifies
the task of tracking the subject and limbs since the entire posture of the subject is visible in every
frame. The next constraint (slow and continuous movements) allows a simple and continuous
trajectory calculation. It simplifies the calculation of the velocity if the subject and of the camera. The
eighth constraint (only move one or a few limbs) allows to focus on only one or a few body parts. The
next constraint (the motion pattern of the subject is known) is used to simplify the tracking and pose
estimation problems by reducing the solution-space. The final constraints (subject moves on a flat
ground plane) allows calculation of the distance between the camera and the subject using the camera
geometry and the size of the subject.
-Five (5) constraints related to environment appearance.
The first environmental assumption (constant lighting) in practice contraints the scene to be indoor.
The next assumption (static background) requires a static background which makes it possible to
segment the subject based on motion information. The third assumption (uniform background)
constrains the background further to have a uniform color and a simple thresholding may be used to
segment the subject. The first two assumptions (constant lighting, static background) are used in
many systems while the the third assumption (uniform background) is used in approximately half of
the systems. The fourth assumption (known camera parameters) concerns are necessary to know in
order to obtain absolute measures in the registration. The lasted environmental assumption concerns
the use of special hardware, such as multiple camera or an IR-camera.
-Five (5) constraints related to person appearance.
The first subject assumption (knowing start pose) about the known stat pose is introduced in many
systems to simplify the initialisation problem. The next assumption concerns prior knowledge of the
subject (known subject) e.g. in terms of specific model parameters, such as the subject’s height,
length, and width of limbs, etc. The lasted three subject assumptions (markers placed on the subject,
special coloured clothes, tight-fitting clothes) reduce the segmentation problems by making the
subject’s “structure” easier to detect.

Discussion
1) The occlusion issue remain a difficult problem. Most approaches don’t deal with signicant (self)
occlusion of body and do not suggest parameters when to stop and restart tracking of body.
2) The performance of the approaches proposed, both 2D and 3D, depends of the application.
For example, 2D approaches are performant for applications where accurate pose capture are not
needed: e.g. tracking pedestrians in a surveillance setting does’t need high image resolution, and then
precise pose recovery are not needed. 2D approaches are also suitable for applications with mono-
person involving constrained movement and single viewpoint (e.g. hand pose estimation in sign
language recognition facing the camera, recognizing gait lateral to the camera). More generally, 2D
approaches provide some success, in the tracking of body, where involved on few, well separable,
motion classes.
3D approaches seem to be more suitable for applications in environments where we would like to
track various, un-constrained and multiple human movements (e.g. making different gestures while
walking, turning and running, human-human interactions such as shaking hands, dancing or fighting).
3D approaches seem to be more accurate, compact representation of physical space which allows a
better prediction and handling of occlusion and collision. Although 3D approaches seem to be suitable
for robust body tracking, there are a number of challenges that need to be resolved before approach-
based 3D can be effectively operational:
Unconstrained movement. One of the challenges of 3D systems on the topic of pose capture is to
prove that they scale up to unconstrained movement. The majority of 3D tracking introduce
simplications (e.g. constrained movement, segmentation) or limitations (e.g. processing speed). They
Technical report of LIFL – 6-2005

use also multiple cameras to achieve accurate 3D capture [Reh 94] for body poses and movements.
Body poses and movements, that are too complex from one view (by occlusion or depth), can be
disambiguated from another view. Furtheremore, calibration effort has to be deployed which doesn’t
simplify the tracking process.
The model acquisition issue. Few approaches [Kak 95] capture both shape and pose parameters from
uncontrolled movement, e.g. the case of a person running into a stadium and moving freely around. 3D
model is characterized by various shape features that require to be approximated from the images, and
this task is not simple. Certain works separate the acquisition model and pose capture, i.e. requiring a
separate initialization stage where either known poses or known movements simplify the acquisition
of the shape parameters.
The modeling issue. Few human models have incorporated parameters such as joint angle limits and
collision, and even less have taken into account dynamical properties such as balance. And few
approaches modeled the objects the human interacts with.
3) The initialization problem, for both 2D and 3D approaches, remains open. The majority of
approaches deals with incremental pose estimation and does not provide ways for bootstrapping, either
initially or when tracking gets lost. But it is the availability of an easy initialization procedure, which
can be started up from a wide range of situations, what makes asystem robust enough to be deployed
in real world settings (e.g. [Wre 97]).
4) The ability to detect and track multiple humans in the scene is an open problem. Stronger models
might be necessary to handle occlusion and the correspondence problem between features and body-
parts. Naive techniques which rely on background subtraction to obtain a segmented human gure, will
no longer be feasible here.
5) Another open problem is the recognition of human generic actions and whether these generic
human actions could be identified and applied to a variety of applications. If indeed such useful
generic actions could be identified, would it be possible to identify corresponding features and
matching methods whichare, to a large degree, application-independent? In these generic actions, a
distinction is made between stand-alone actions (e.g. walking, running, jumping, turning around,
bending over, looking around, squatting, pointing, falling, climbing, waving, clapping) and
interactions with objects (e.g. grasping, examining, transferring, throwing, dropping, pushing, hitting,
shaking, drinking, eating, writing, typing) or other people (e.g. shaking hands, embracing, kissing,
pushing, hitting).
6) Some of the general key issues needing to be addressed is initialisation, recover from failure, and
robustness.
Too many systems are based on knowing the initial state of their system and/or a well-defined model
fitted (off-line) to the current subject. In a real life scenario, we may expect a system to run on its own,
i.e. adapt to the current situation.
Related is the problem of how to recover from failure. A number of systems are based on incremental
updates or searching around a predicted value. Many of these fail due to occlusion, bad predictions,
and a change in the framerate/camera focus/image resolution, and are not able to recover. This is an
important problem since real life applications are likely to challenge a system by new situations not
included in the design/training and hereby making it fail from time to time.
The robustness relates to the number of assumptions applied in systems, but also to the fact that most
systems are tested on less than 1000 frames. How can one justify to evaluate the robustness of a
system within such a short effort lifespan? Long test sets available for everybody need to be generated
(as in the face recognition community) to evaluate the robustness of individual systems and compare
various systems.
Technical report of LIFL – 6-2005

7) It is evident from the number of constraints considered in the current techniques that the research
domain is still in a phase of development. The direct use of a human model seems to be the prefered
trend. From state of art systems, it can be seen that the choice of model type differ while silhouettes
seem to be the prefered abstraction levels. The use of silhouettes is motivated by the presence of
simple algorithms for their estimation.
8) The methods are based on incremental updates which rely on a number of constraints such as “no
occlusion” and “the subject being the only moving object in the image”. Due to the incremental
update, the initial pose is required and the systems have no way to recover after a total lose of track,
lacking a mechanism for globally searching the entire image. Another problem is the risk of
accumulaing errors due to the incremental procedure.
Beside the problems related to incremental updates another issue also has to be considered. Many
movements become ambiguous when projected into the image plan, e.g. rotation about an axis parallel
to the image plane will produce the same optical flow field as a translation in a certain direction will.
To solve these problems, multiple cameras are required or multiple data types.

Some current projects


Many projects confirm the interest to this topic:
VSAM Video Surveillance and Monitoring [Col 2000]: This project developed by the Carnegie
Mellon University and the Sarnoff Institute funded by DARPA, studied a group of techniques for
video analysis in applications of surveillance of urban environment (and of battle field), allowing to
signal to an operator the incidents discerned by a multitude of sensors;
ADVISOR Annotated Digital Video for Surveillance and Optimised Retrieval [Sie 2004]: This
European project (IST-1999-11287) studied the detection of incidents and the annotation of video
recording for problems of surveillance, with an application in the surveillance of the subway;
CSAIL [Rah 2004]. As part of VSAM project, the Vision team of MIT particularly developed
techniques to manage in an automated manner the fixed whole camera allowing to show a broad zone
of environment;
CAVIARE - Context Aware Vision Picture-based Active Recognition [Jor 2004]. This European
project (IST 2001 37540) studies techniques of fine analysis of pictures to ameliorate the
performances of the systems of surveillance, with applications in urban environment and commercial
activities;
GMF4iTV - Generic Media Framework for Interactive TV [Because 2005]. This European project
developed techniques for monitoring of objects in video sequences to create hyper-video links in
programs of interactive television, these links allowing linking meta-data to objects identified in
pictures.

Conclusion
For future systems to be more successful and less dependent of various constraints, new methods and a
combination of current methods should be developped, i.e. the combination of various images cues,
such as motion and silhouettes, and more extensive and adaptive use of human models. Furthermore,
new sensors or combinations of sensors might also be an interesting path into the future. Work on
different sensor modalities (range, infrared, sound) would drive to systems with combined strengths.
By addressing the above issues, approaches would improve capabilities to successfully deal with
complex human tracking. Furthermore, dealing with intrusive markers may be a good compromize
between intrusive and non intrudive sensing techniques. Finally, it seems to be a good and necessary
to combine various data types, to broaden invariance and robustness to all possible situations.

References
[Bey 2004] Beymer, D., Person counting using stereo; Workshop on Human Motion, Dec. 2000.
Technical report of LIFL – 6-2005

[Cup 2004] Cupillard, F.; Avanzi, A.; Bremond, F.; Thonnat, M., Video understanding for metro
surveillance,; IEEE International Conference on Networking, Sensing and Control, March 2004.
[Col 2000] Collins, Lipton, Kanade, Fujiyoshi, Duggins, Tsin, Tolliver, Enomoto, and Hasegawa, "A
System for Video Surveillance and Monitoring: VSAM Final Report," Technical report CMU-RI-TR-
00-12, Robotics Institute, Carnegie Mellon University, May, 2000.
[Jor 2004] Pedro M. Jorge, Arnaldo J. Abrantes, Jorge S. Marques, Estimation of the Bayesian
Network Architecture for Object Tracking in Video Sequences, ICPR, Cambridge, August 2004
[Kak 95] I. Kakadiaris and D. Metaxas. 3-D human body model acquisition from multiple views. In
Proc. of International Conferenceon ComputerVision, pages 618-623, Cambridge, 1995.
[Khan 2003] Khan, S.; Shah, M., Consistent labeling of tracked objects in multiple cameras with
overlapping fields of view; IEEE Pattern Analysis and Machine Intelligence,Volume 25, Issue 10,
Oct. 2003 Page(s):1355 – 1360.
[Lee 2002] Lee, L.; Grimson, W.E.L., Gait analysis for recognition and classification,; Fifth IEEE
International Conference on Automatic Face and Gesture Recognition, May 2002.
[Moeslund 2000] T. B. Moeslund and E. Granum, “A Survey of Computer Vision-Based Human
Motion Capture”, Technical report if Laboratory if Computer Vision and Media Tecjnology, Aalborg
University, Denmark.
[Niu 2003] Niu, W.; Jiao, L.; Han, D.; Wang, Y.-F., Real-time multiperson tracking in video
surveillance, Fourth International Conference on Information, Communications and Signal Processing,
Dec. 2003.
[Oro 1980] J. O'Rourke and N. Badler. Model-based image analysis of humanmotion using constraint
propagation. IEEE Transactions on PatternAnalysis and Machine Intelligence , 2(6):522-536, 1980.
[Rah 2004] Ali Rahimi, Brian Dunagan, Trevor Darrell, Tracking People with a Sparse Network of
Bearing Sensors, European Conference on Computer Vision (ECCV), 2004.
[Reh 1994] J. Rehg and T. Kanade. Visual tracking of high DOF articulated struc-tures: an application
to human hand tracking. In Proc. of EuropeanConference on Computer Vision, pages 35-46,
Stockholm, 1994.
[Rig 2000] Rigoll, G.; Eickeler, S., Real-time tracking of moving persons by exploiting spatio-
temporal image slices; IEEE Pattern Analysis and Machine Intelligence, Volume 22, Issue 8, Aug.
2000 Page(s):797 - 808
[Rui 2003] Ruiz-del-Solar, J.; Shats, A.; Verschae, R., Real-time tracking of multiple persons; 12th
International Conference on Image Analysis and Processing, Sept. 2003 Page(s):109 - 114
[Sie 2004] Siebel, N., Maybank, S., In Markus Clabian, Vladimir Smutny and Gerd Stanke, Ruiz-del-
Solar, J.; Shats, A.; Verschae, R., The ADVISOR Visual Surveillance System, editors, Proceedings of
the ECCV 2004 workshop "Applications of Computer Vision" (ACV'04), Prague, Czech Republic,
pp. 103-111, May 2004, ISBN 80-01-02977-8.
[Wre 97] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. Pnder: Real-time tracking of the
human body. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence , 19(7):780-785, 1997.
[Zha 2002] Hung-Xin Zhao; Yea-Shuan Huang, Real-time multiple-person tracking system,
International Conference on Pattern Recognition, Aug. 2002.

You might also like