You are on page 1of 8

Proceedings 2nd Joint IEEE International Workshop on VS-PETS, Beijing, October 15-16, 2005

Appearance-based 3D Face Tracker: An Evaluation Study *


Fadi Dornaika and Angel D. Sappa
Computer Vision Center, Edi ci 0, Campus UAB
08193 Bellaterra, Barcelona, SPAIN
{dornaika, sappa} @cvc.uab.es

Abstract cepts of Online Appearance Models (OAMs) and image-


based registration. This approach does not suffer from drift-
The ability to detect and track human heads and faces in ing and seems to be robust in the presence of large head
video sequences is useful in a great number ofapplications. motions and facial animations. In this paper, we summarize
In this paper; we present our recent 3D face tracker that the developed approach and propose an accuracy evaluation
combines Online Appearance Models with an image regis- based on dense depth data obtained from a stereo head. The
tration technique. This monocular tracker runs in real-time main innovation of this paper is the introduction of an eval-
and is drift insensitive. We introduce a scheme that takes uation of the appearance-based tracker using dense range
into account the orientation of localfacial regions into the data.
registration technique. Moreover; we introduce a general The rest of the paper is organized as follows. Section
frameworkfor evaluating the developed appearance-based 2 describes the deformable 3D face model that we use to
tracker Precision and usability of the tracker are assessed create shape-free facial patches from input images. Sec-
using stereo-based range facial data from which ground tion 3 describes the problem we are focusing on, and the
truth 3D motions are inferred. This evaluation quantifies online reconstruction of the facial appearance model. Sec-
the monocular tracker accuracy, and identifies its working tion 4 describes the proposed approach, that is, the recovery
range in 3D space. of the 3D head pose and facial actions using a registration
technique. Section 5 gives some experimental results using
1. Introduction monocular video sequences. Section 6 introduces an accu-
racy evaluation using range face data and gives quantitative
The ability to detect and track human heads and faces in accuracy evaluation obtained with a real video sequence.
video sequences is useful in a great number of applications, Section 7 concludes the paper.
such as human-computer interaction and gesture recogni-
tion. There are several commercial products capable of
accurate and reliable 3D head position and orientation es-
2. Modeling faces
timation. These are either based on magnetic sensors or 2.1. A deformable 3D model
on special markers placed on the face; both practices are In our study, we use the 3D face model Candide. This 3D
encumbering, causing discomfort and limiting natural mo- deformable wireframe model was first developed for the
tion. Vision-based 3D head tracking provides an attractive purpose of model-based image coding and computer ani-
alternative since vision sensors are not invasive and hence mation. The 3D shape of this wireframe model is directly
natural motions can be achieved. However, detecting and recorded in coordinate form. It is given by the coordinates
tracking faces in video sequences is a challenging task be- of the 3D vertices Pi, i = 1, . . ., n where n is the number
cause faces are non-rigid and their images have a high de- of vertices. Thus, the shape up to a global scale can be fully
gree of variability. A huge research effort has already been described by the 3n-vector g; the concatenation of the 3D
devoted to vision-based head and facial feature tracking in coordinates of all vertices Pi. The vector g is written as:
2D and 3D (e.g., [3, 4, 7, 10, 11, 13]).
Tracking the 3D head pose from a monocular image se- g = g+STs +ATa (1)
quence is a difficult problem. Proposed techniques may be where g is the standard shape of the model, Ts and Ta are
roughly classified to those based on optical flow and those shape and animation control vectors, respectively, and the
based on tracking some salient features. Recently, we devel- columns of S and A are the Shape and Animation Units.
oped an appearance-based 3D face tracker adopting the con- A Shape Unit provides a way to deform the 3D wireframe
* This work was supported in part by the Government of Spain under the such as to adapt the eye width, the head width, the eye sep-
CICYT project TRA2004-06702/AUT and The Ramon y Cajal Program. aration distance etc. Thus, the term S Ts accounts for shape

0-7803-9424-01051$20.00 ©2005 IEEE. 121


Proceedings 2nd Joint IEEE International Workshop on VS-PETS, Beijing, October 15-16, 2005

variability (inter-person variability) while the term A Ta ac- chosen for the shape-free textures. The reported results are
counts for the facial animation (intra-person variability). obtained with a shape-free patch of 5392 pixels. Regard-
The shape and animation variabilities can be approximated ing photometric transformations, a zero-mean unit-variance
well enough for practical purposes by this linear relation. normalization is used to partially compensate for contrast
Also, we assume that the two kinds of variability are inde- variations. The complete image transformation is imple-
pendent. With this model, the ideal neutral face configura- mented as follows: (i) transfer the texture y using the piece-
tion is represented by Ta = 0. wise affine transform associated with the vector b, and (ii)
In this study, we use twelve modes for the facial Shape perform the grey-level normalization of the obtained patch.
Units matrix S and six modes for the facial animation units
Animation Units (AUs) matrix A. Without loss of gener-
ality, we have chosen the six following AUs: lower lip de-
pressor, lip stretcher, lip corner depressor, upper lip raiser,
eyebrow lowerer and outer eyebrow raiser. These AUs are
enough to cover most common facial animations (mouth
and eyebrow movements). Moreover, they are essential for
conveying emotions.
In equation (1), the 3D shape is expressed in a local co- (a) (b)
ordinate system. However, one should relate the 3D coor-
Figure 1: (a) an input image with correct adaptation. (b) the
dinates to the image coordinate system. To this end, we
corresponding shape-free facial image.
adopt the weak perspective projection model. We neglect
the perspective effects since the depth variation of the face
can be considered as small compared to its absolute depth.
Therefore, the mapping between the 3D face model and the
3. Problem formulation and adaptive
image is given by a 2 x 4 matrix, M, encapsulating both the observation model
3D head pose and the camera parameters.
Thus, a 3D vertex Pi (Xi, Yi, Z,)T C g will be pro- Given a video sequence depicting a moving head/face, we
jected onto the image point pi = (ut, V,)T given by: would like to recover, for each frame, the 3D head pose and
the facial actions encoded by the control vector Ta. In other
(ui, vi)T Yi, Zi, 1)T
M (Xi, (2) words, we would like to estimate the vector bt (equation 3)
at time t given all the observed data until time t, denoted
For a given person, Ts is constant. Estimating Ts can Yi:t yl,.. ,Yt}. In a tracking context, the model pa-
be carried out using either feature-based or featureless ap- rameters associated with the current frame will be handed
proaches. Thus, the state of the 3D wireframe model is over to the next frame.
given by the 3D head pose parameters (three rotations and For each input frame Yt, the observation is simply the
three translations) and the internal face animation control warped texture patch (the shape-free patch) associated with
vector Ta. This is given by the 12-dimensional vector b: the geometric parameters bt. We use the HAT symbol for
the tracked parameters and textures. For a given frame t, bt
b = [ox, v,
o OZ tx ty, tz, TaT ]TT3 represents the computed geometric parameters and Xt the
corresponding shape-free patch, that is,
2.2. Shape-free facial patches
it
I I

= x (bt) W (yt bt)


= (5)
A face texture is represented as a shape-free texture (geo-
metrically normalized image). The geometry of this image The estimation of bt from the sequence of images will
is obtained by projecting the standard shape g using a cen- be presented in the next Section.
tered frontal 3D pose onto an image with a given resolu- The appearance model associated to the shape-free facial
tion. The texture of this geometrically normalized image is patch at time t, At, is time-varying on that it models the ap-
obtained by texture mapping from the triangular 2D mesh pearances present in all observations x up to time (t -1).
in the input image (see figure 1) using a piece-wise affine The appearance model At obeys a Gaussian with a center
transform, W. The warping process applied to an input im- ,u and a variance (X. Notice that ,u and u- are vectors com-
age y is denoted by: posed of d components/pixels (d is the size of x) that are
assumed to be independent of each other. In summary, the
x(b) = W(y, b) (4) observation likelihood at time t is written as
d
where x denotes the shape-free texture patch and b denotes
the geometrical parameters. Several resolution levels can be P(Yt |bt) = p(xt |bt) =
fl N(xi; Hi, vij)
i=l
(6)

122
Proceedings 2nd Joint IEEE International Workshop on VS-PETS, Beijing, October 15-16, 2005

where N(x; ,ui, o-i) is the normal density: 4 Tracking with a registration tech-
nique
N(x;pi, oi) (27wo2)-1/2
= exp [-1 (x ¾)1 (7)
Consider the state vector b = [ox 0y tx0 Hz: ty, tz, TaT]T
encapsulating the 3D head pose and the facial animations.
We assume that At summarizes the past observations under In this section, we will show how this state can be recovered
an exponential envelop, that is, the past observations are for time t from the previous known state bft 1.
exponentially forgotten with respect to the current texture. The sought geometrical parameters bt at time t are re-
When the appearance is tracked for the current input image, latedI to the previous parameters by the following equation
i.e. the texture Xt is available, we can compute the updated (bt- , is known):
appearance and use it to track in the next frame.
It can be shown that the appearance model parameters, bt = bt-, +Abt (10)
i.e., ,u and u- can be updated using the following equations
(see [8] for more details on Online Appearance Models): where Abt is the unknown shift in the geometric parame-
ters. This shift is estimated using a region-based registration
ut+±i (1 -a) fit + aXt (8) technique that does not need any image feature extraction.
In other words, Abt is estimated such that the warped tex-
ture will be as close as possible to the facial appearance
(Jt2 1o) (Jt2 + at (it pt)2
a
(9) _
At. For this purpose, we minimize the Mahalanobis dis-
tance between the warped texture and the current appear-
In the above equations, all ,u's and (X2's are vectorized ance mean,
and the operation is element-wise. This technique, also
called recursive filtering, is simple, time-efficient and there- d 2
fore, suitable for real-time applications. The appearance
parameters reflect the most recent observations within a
mine(bt) = DinD(x(bt), Ut) v: XSi - iA
(1 1)
bt bt
roughly L = 1/a window with exponential decay. Fig-
ure 2 shows an envelop having a equal to 0.01 where the The above criterion can be minimized using iterative
current frame is 500. first-order linear approximation which is equivalent to a gra-
Note that ,u is initialized with the first patch x. However, dient descent method. It is worthwhile noting that mini-
equation (9) is not used until the number of frames reaches mizing the above criterion is equivalent to maximizing the
a given value (e.g., the first 40 frames). For these frames, likelihood measure given by (6).
the classical variance is used, that is, equation (9) is used
with a being set to . Gradient-descent registration: We assume that there ex-
Here we used a single Gaussian to model the appearance ists bt = 1t_j + Abt such that the warped shape-free tex-
of each pixel in the shape-free template. However, mod- ture will be very close to the appearance mean, i.e,
eling the appearance with Gaussian mixtures can also be
used on the expense of additional computational load (e.g., W (yt, bt) -_ t
see [14, 9]).
Approximating W(yt, bt) via a first-order Taylor series ex-
pansion around bt-, yields
009
008

007
-
W (yt, bt ) -_W(yt, bt- 1 ) + Gt (bt-b~t- 1 )
006 - where Gt is the gradient matrix. By combining the previous
two equations we have:

O0 -
Ttwt(yt
= i bt-he ) + Gt (bts-e ti g)
004 10-
50 0 20 20 0 5 0 5 0

003 - ~ ~ ~ Frme Therefore, the shift in the parameter space is given by:
Abt = bt- bit -Gt (W(yt, fit-1) Pt) (12)
Figure 2: A sliding exponential envelop having an a equal In practice, the solution bt (or equivalently the shift Abt)
to 0.01. The current frame/time is 500. is estimated by running several iterations until the error can-
not be improved. We proceed as follows.

123
Proceedings 2nd Joint IEEE International Workshop on VS-PETS, Beijing, October 15-16, 2005

Starting from b = bt-1, we compute the error vector It is worthwhile noting that the gradient matrix is com-
(W(yt, bt_ ) -Ut) and the corresponding Mahalanobis dis- puted for each time step. The advantage is twofold. First,
tance e(b) (given by (11)). We find a shift Ab by multi- a varying gradient matrix is able to accommodate appear-
plying the error vector with the negative pseudo-inverse of ance changes. Second, it will be closer to the exact gradient
the gradient matrix using (12). The vector /Ab gives a dis- matrix since it is computed for the current geometric con-
placement in the search space for which the error, e, can be figuration (3D head pose and facial animations) whereas a
minimized. We compute a new parameter vector and a new fixed gradient matrix can be a source of errors.
error:
b'= b + pAb (13) Improving the minimized criterion. When significant
e' = e(b') out-of-plane rotations of the face occur, local self occlu-
where p is a positive real. sions and distortions may appear in the shape-free texture
If e' < e, we update b according to (13) and the process x. In order to downweight their influence on the registration
is iterated until convergence. If e' > e, we try smaller up- technique we incorporate the orientation of individual tri-
date steps in the same direction (i.e., a smaller p is used). angles of the 3D mesh in the minimized criterion such that
Convergence is declared when the error cannot be improved the contribution of any triangle becomes less significant as
anymore. In practice, we found that convergence is reached it shies away from the frontal view. Recall that the orien-
with less than ten iterations. tation of any 3D triangle with respect to the camera can be
recovered since the 3D rotation between the 3D head model
and the camera frame is tracked. For a given triangle, m,
Computation of the gradient matrix. The gradient ma- the angle 7'y, is given by the angle between the optical axis
trix is given by: k = (0, 0, 1)T and the normal to the triangle expressed in
axt the camera frame
G= &W(yt,
Ob
bt)
Ob For any given frame, the minimized criterion (11) be-
comes:
It is approximated by numerical differences. Once the
solution bt becomes available for a given frame, it is possi- d 2
ble to compute the gradient matrix from the associated input mine(bt) = ZwQy() (Xi lPi) (14)
image. We use the following: bt i=l 0-i~~
The jth column of G(j 1, . . ., dim(b)):
where 'Yi is the angle associated to the triangle containing
the pixel i and w(7i) is a monotonic decreasing function.
Gj _ W(yt, bt) For example, we use w (Yj) +a
ca es uin
d nb
Figure 3 displays three real 3D face poses: a frontal view
can be estimated using differences and two non-frontal views. The left column shows the ori-
entation of all individual triangles of the 3D mesh. The right
Gj-W(yt
Gj
, bt) -W(yt, bt + 6 qj )
X~~~~ column shows the corresponding shape-free texture.
where d is a suitable step size and qj is a vector with all
elements zero except the jth element that equals one. To
5. Tracking results
gain more accuracy, the jth column of G is estimated using Figure 4 displays the tracking results associated with a
several steps around the current value bj, and then averaging 300-frame-long sequence. The sequence is of resolution
over all these, we get the final Gj as 640 x 480 pixels. Only frames 38, 167, 247, and 283 are
shown. The upper left corner of each image shows the cur-
1 K/2 W(yt, bt) -W(yt, bt + k 6j qj) rent appearance (it) and the current shape-free texture (Xt).
K The plots of this figure display the estimated values of the
k=-KI2,k#O 3D head pose parameters (the three rotations and the three
where 5j is the smallest perturbation associated with the pa- translations) as a function of the sequence frames. Since the
rameter bj and K is the number of steps (in our experiments, used camera is calibrated the absolute translation is recov-
K is set to 8). Note that other averaging windows can also ered.
be used, e.g. triangular or Gaussian windows. On a 3.2 GHz PC, a non-optimized C code of the ap-
Notice that the computation of the gradient matrix Gt proach computes the 12 degrees of freedom (the six 3D head
at time t is carried out using the estimated geometric para- pose parameters and the six facial actions) in less than 50 ms
meters bit- and the associated input image Yt-1 since the if the patch resolution is 1310 pixels. About half that time
adaptation at time t has not been computed. is required to compute the 3D head pose parameters.

124
Proceedings 2nd Joint IEEE International Workshop on VS-PETS, Beijing, October 15-16, 2005

Frontal view

Large yaw angle

Large yaw angle


Pitch Yaw
Figure 3: Three different 3D face orientations (from top to
bottom): frontal view, a vertical rotation to the left, and a
vertical rotation to the right. The first column depicts the
angle of all 3D triangles with respect to the camera. Dark cm cm
grey level corresponds to small angles (the 3D triangle is in a)
0
a)
0

fronto-parallel plane) while bright grey level corresponds to


large angles. The right column depicts the corresponding
shape-free texture.
Frames Frames
Roll X Translation
Figure 5 displays the head and facial action tracking re-
sults associated with two video sequences. These sequences
are of resolution 640 x 480 pixels.
6)
a) E
0

6. Accuracy evaluation
The evaluation of the above appearance-based tracker has
not been more formal than observing that it works quite Frames Frames
well and that the features of the 3D model projects onto Y Translation Z Translation
their corresponding 2D features in the image sequence. The
problem with an objective evaluation is that the absolute
truth is not known. This is particularly true for the 3D
head pose/motion which is given by six degrees of freedom. E E

However, it is less problematic for the facial feature mo-


tion since their estimated motion can be assessed by check-
ing the alignment between the projected 3D model (feature
points and line segments) and the actual location of the fa-
Frames Frames
cial features. In our case, the facial features are given by the
lips and the eyebrows so evaluating their motion is straight- Figure 4: Tracking the 3D head pose with appearance-based
forward. We point out that their corresponding motions es- tracker. The sequence length is 300 frames. Only frames
sentially belong to the frontal plane of the face. There are 38, 167, 247, and 283 are shown. The upper left corner of
other techniques for measuring face motion, such as motion each image shows the current appearance Ut and the current
capture systems based on acoustic trackers [12] or magnetic texture. The six plots display the six degrees of freedom of
sensors. However, such systems are expensive and encum-
the 3D head pose as a function of time.
bering, and may not succeed to capture small motion ac-
curately. Since we are using a deformable 3D mesh we
can adopt an inexpensive solution that employs synthetic

125
Proceedings 2nd Joint IEEE International Workshop on VS-PETS, Beijing, October 15-16, 2005

(a)

(b)
Figure 5: Tracking the 3D head pose and the facial actions
applied to two video sequences.

test sequences with known ground truth similarly to [5, 1]. (c)
In [1], a 3D face tracker based on a statistical facial tex-
ture was evaluated. A synthetic video sequence is created Figure 6: Dense 3D facial data provided by a stereo head.
using a 3D mesh mapping a texture onto it, and then ani- (a) a stereo pair. (b) the corresponding computed 3D fa-
mating it according to some captured or semi-random mo- cial data with mapped texture displayed from three differ-
tion. The tracker then tracks the face in the synthetic video ent points of view. (c) 3D facial data associated to another
sequence and the discrepancy between the used synthetic stereo pair illustrating a non-frontal face.
motion (ground-truth) and the estimated motion yields the
accuracy of the tracker.
Although this scheme can give an idea on the tracker ac-
curacy, it has several shortcomings. First, one can note the
self-referential nature of the test, since the same 3D mesh
is used in the synthesis phase and in the test phase. Sec-
ond, synthetic videos may not look very life-like. Third,
since the synthetic motion should be realistic to some ex-
tent, one has to use the output of another tracker, and if (a) (b) (c)
the same tracker is used the evaluation test becomes self- Figure 7: 3D registration of two facial clouds provided at
referential regarding the used 3D motions in the sense that frames 1 and 39, which are separated by a large yaw an-
the tracker is tested with motion parameters that are easy to gle (about 40 degrees). (a) the range facial data associated
estimate. Therefore, our idea is to use stereo-based 3D fa- to frames 1 and 39, expressed in the same coordinate sys-
cial surfaces (from which an accurate rigid 3D head motion tem. (b) alignment results using the relative 3D face motion
can be retrieved), and at the same time run our appearance- provided by our monocular tracker. (c) refinement of the
based on the associated monocular sequence. Then, the ac- registration using the Iterative Closet Point algorithm.
curacy is evaluated by comparing the 3D head motions pro-
vided by the stereo data and the developed monocular 3D
face tracker.

6.1. 3D facial data and 3D face motion


A commercial stereo vision camera system (Bumblebee
from Point Grey [6]) was used. It consists of two Sony (a) (b) (c)
ICX084 color CCDs with 6mm focal length lenses. Bum-
blebee is a precalibrated system that does not require in- Figure 8: 3D registration of two facial clouds provided at
field calibration. The baseline of the stereo head is 12cm frames 1 and 85.

126
Proceedings 2nd Joint IEEE International Workshop on VS-PETS, Beijing, October 15-16, 2005

and it is connected to the computer by a IEEE-1394 con- the corresponding absolute 3D head poses. Figure 7 illus-
nector. Right and left color images were captured at a res- trates the estimated 3D head pose/motion associated with
olution of 640 x 480 pixels and a frame rate near to 30 fps. frame 39. Figure 7.(a) illustrates the range facial data asso-
After capturing these right and left images, 3D data were ciated to frames 1 and 39, expressed in the same coordinate
computed by using the provided 3D reconstruction soft- system. One can easily see that the face has performed a
ware. During the experiments the stereo head was placed vertical rotation of about 40 degrees. Figure 7.(b) displays
at a distance of 80 centimeters from the person. Figure 6.a the 3D alignment obtained using the relative 3D face mo-
shows a stereo pair used in our evaluation. Figure 6.b shows tion provided by the monocular tracker. Figure 7.(c) shows
the corresponding 3D facial data visualized from three dif- the refinement of the registration using the ICP algorithm
ferent points of view. Figure 6.c depicts the 3D data asso- whose returned displacement gives the 3D error associated
ciated with another stereo pair depicting a non-frontal face. with the monocular tracker. One can notice that the monoc-
In our case, the 3D face model (a cloud of 3D points) is ular registration technique brings the two 3D clouds into
manually selected in the first stereo frame which is cap- a good alignment even though there is an offset in the in-
tured in a frontal view ( Figure 6.b). This 3D face model depth translation. This is due to the monocular vision effect
contains about 20500 3D points. More elaborated statisti- and to the use of a frontal texture model.
cal techniques could be used for segmenting the 3D head in Figure 8 illustrates the estimated 3D head pose/motion
the range images (e.g., [10]). For subsequent frames, we associated with frame 85 using the monocular tracker and
adopt a moving head's bounding box to bound the head re- the ICP algorithm.
gion in the range images (Figure 6.b). Note that for these Figure 9 depicts the monocular tracker errors associated
frames accurate head segmentation is not required since we with the sequence depicted in Figure 4. These errors are
are using a global 3D registration technique (see below). computed by the ICP algorithm. As can be seen, the errors
The proposed evaluation, as mentioned before, consists increase as the face shies away from the frontal view. How-
of computing the 3D face motions using two different meth- ever, the tracker never loose the track. As can be seen, due
ods: (i) the proposed appearance-based approach (Section to the effect of monocular vision and very large out-of-plane
4) using the monocular sequence provided by the right cam- rotations (more than 60 degrees) the estimated depth may
era, and (ii) the 3D face motions computed by 3D registra- suffer from 6cm error. However, within a useful working
tion of 3D facial data in different frames. Recall that the 3D range about the frontal view, this error is about 3 cm which
rigid displacement that align two facial clouds obtained at corresponds to one pixel error given our camera parameters
two different frames is equivalent to the performed 3D head and the actual depth of the face.
motion between these two frames.
The 3D registration is computed by means of the well
known Iterative Closest Point, ICP, algorithm. ICP, also ref- 7. Conclusion
erenced in the literature as a fine registration technique, as-
In this paper, we have described our appearance-based
sumes that the clouds to be registered are very close. ICP
3D face tracker. We have introduced a general evalua-
has been originally presented by [2]. In our evaluation,
tion framework that is based on stereo-based range facial
since we use the 3D facial data/cloud in the first reference
data. We have found that the out-of-plane motions can be
frame as a face model, the 3D registration may fail in sub-
off the track whenever the absolute orientation of the face
sequent frames containing large rotations, thus our idea is
is so far from the frontal view, e.g, a vertical rotation of
to use the monocular tracker solution as a starting solution
60 degrees. However, even in these extreme cases, the
for the ICP algorithm (see Figures 7 and 8). Therefore, the
ICP returns a 3D rigid displacement that directly quantifies
appearance-based tracker is still usable and do not suffer
from drifting due to these out-of-plane motion inaccuracies
the monocular tracker accuracy.
that can be explained not only by the monocular effect but
also by the fact that the texture/appearance of the 3D wire-
6.2. Tracker accuracy frame is modelled in a frontal view. Adopting multi-view
The 3D head pose was tracked in the previous 300-frame shape-free texture models that partition the 3D rotations of
long sequence using two approaches: (i) the monocular the face is expected to considerably decrease such inaccu-
tracker, and (ii) the joint use of the stereo-based facial data racies.
and the ICP algorithm. The 3D head pose parameters pro- Although the joint use of 3D facial data and the ICP al-
vided by the monocular tracker gives the position and ori- gorithm can be used as a 3D head tracker, the significant
entation of the 3D wireframe model with respect to the right computational overhead of ICP algorithm prohibits real-
camera frame (the one used by the monocular tracker). The time performance. In the light of the evaluation, one is able
3D head motion is set to the 3D motion between the first to adjust the experimental set-up such that the monocular
frame and the current frame, which is easily recovered from tracker provides optimal results.

127
Proceedings 2nd Joint IEEE International Workshop on VS-PETS, Beijing, October 15-16, 2005

References
[1] J. Ahlberg. Model-based coding: Extraction, coding, and
evaluation offace model parameters. PhD thesis, No. 761,
Linkoping University, Sweden, September 2002.
[2] P. Besl and N. McKay. A method for registration of 3-D
shapes. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 14(2):239-256, 1992.
[3] M.L. Cascia, S. Sclaroff, and V. Athitsos. Fast, reliable
Pitch Yaw
head tracking under varying illumination: An approach
based on registration of texture-mapped 3D models. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
22(4):322-336, 2000.
a

O 4( O 4
[4] S.B. Gokturk, J.Y Bouguet, and R. Grzeszczuk. A data-
LU LU
1- driven model for monocular face tracking. In IEEE Interna-
tional Conference on Computer Vision, 2001.
[5] M. Harville, A. Rahimi, T. Darell, G. Gordon, and J. Wood-
Frames Frames fill. 3D pose tracking with linear depth and brightness con-
Roll X Translation straints. In IEEE International Conference on Computer Vi-
sion, 1999.
[6] http://www.ptgrey.com.
[7] T.S. Jebara and A. Pentland. Parameterized structure from
motion for 3D adaptative feedback tracking of faces. In Proc.
E

O 4(
LU LL IEEE Conference on Computer Vision and Pattern Recogni-
tion, 1997.
[8] A.D. Jepson, D.J. Fleet, and T.F. El-Maraghi. Robust online
Frames Frames
appearance models for visual tracking. IEEE Transactions
Y Translation Z Translation
on Pattern Analysis and Machine Intelligence, 25(10): 1296-
11-
1311, 2003.
[9] D. Lee. Effective Gaussian mixture learning for video back-
ground subtraction. IEEE Transactions on Pattern Analysis
E~ and Machine Intelligence, 27(5):827-832, 2005.
E
Li [10] S. Malassiotis and M. G. Strintzis. Robust real-time 3D
head pose estimation from range data. Pattern Recognition,
38(2005):1153-1165, 8.
[11] F. Moreno, A. Tarrida, J. Andrade-Cetto, and A. Sanfeliu. 3D
Frames Frames
real-time tracking fusing color histograms and stereovision.
In IEEE International Conference on Pattern Recognition,
Figure 9: 3D head pose errors computed by the ICP algo- 2002.
rithm associated with 300-frame long sequence (see Fig-
[12] L.D. Mouse. Acoustic tracking system.
ure 4). For each degree of freedom, the absolute value of http://www.vrdepot.com/vrteclg.htm.
the error is plotted. For each frame, the ICP algorithm was
initialized by the output of the monocular tracker, thus the [13] M.H. Yang, D.J. Kriegman, and N. Ahuja. Detecting faces
in images: A survey. IEEE Transactions on Pattern Analysis
refined 3D registration can be considered as the monocular
and Machine Intelligence, 24(1):34-58, 2002.
tracker error.
[14] S. Zhou, R. Chellappa, and B. Mogghaddam. Visual track-
ing and recognition using appearance-adaptive models in
particle filters. IEEE Transactions on Image Processing,
13(11):1473-1490, 2004.

128

You might also like