Professional Documents
Culture Documents
variability (inter-person variability) while the term A Ta ac- chosen for the shape-free textures. The reported results are
counts for the facial animation (intra-person variability). obtained with a shape-free patch of 5392 pixels. Regard-
The shape and animation variabilities can be approximated ing photometric transformations, a zero-mean unit-variance
well enough for practical purposes by this linear relation. normalization is used to partially compensate for contrast
Also, we assume that the two kinds of variability are inde- variations. The complete image transformation is imple-
pendent. With this model, the ideal neutral face configura- mented as follows: (i) transfer the texture y using the piece-
tion is represented by Ta = 0. wise affine transform associated with the vector b, and (ii)
In this study, we use twelve modes for the facial Shape perform the grey-level normalization of the obtained patch.
Units matrix S and six modes for the facial animation units
Animation Units (AUs) matrix A. Without loss of gener-
ality, we have chosen the six following AUs: lower lip de-
pressor, lip stretcher, lip corner depressor, upper lip raiser,
eyebrow lowerer and outer eyebrow raiser. These AUs are
enough to cover most common facial animations (mouth
and eyebrow movements). Moreover, they are essential for
conveying emotions.
In equation (1), the 3D shape is expressed in a local co- (a) (b)
ordinate system. However, one should relate the 3D coor-
Figure 1: (a) an input image with correct adaptation. (b) the
dinates to the image coordinate system. To this end, we
corresponding shape-free facial image.
adopt the weak perspective projection model. We neglect
the perspective effects since the depth variation of the face
can be considered as small compared to its absolute depth.
Therefore, the mapping between the 3D face model and the
3. Problem formulation and adaptive
image is given by a 2 x 4 matrix, M, encapsulating both the observation model
3D head pose and the camera parameters.
Thus, a 3D vertex Pi (Xi, Yi, Z,)T C g will be pro- Given a video sequence depicting a moving head/face, we
jected onto the image point pi = (ut, V,)T given by: would like to recover, for each frame, the 3D head pose and
the facial actions encoded by the control vector Ta. In other
(ui, vi)T Yi, Zi, 1)T
M (Xi, (2) words, we would like to estimate the vector bt (equation 3)
at time t given all the observed data until time t, denoted
For a given person, Ts is constant. Estimating Ts can Yi:t yl,.. ,Yt}. In a tracking context, the model pa-
be carried out using either feature-based or featureless ap- rameters associated with the current frame will be handed
proaches. Thus, the state of the 3D wireframe model is over to the next frame.
given by the 3D head pose parameters (three rotations and For each input frame Yt, the observation is simply the
three translations) and the internal face animation control warped texture patch (the shape-free patch) associated with
vector Ta. This is given by the 12-dimensional vector b: the geometric parameters bt. We use the HAT symbol for
the tracked parameters and textures. For a given frame t, bt
b = [ox, v,
o OZ tx ty, tz, TaT ]TT3 represents the computed geometric parameters and Xt the
corresponding shape-free patch, that is,
2.2. Shape-free facial patches
it
I I
122
Proceedings 2nd Joint IEEE International Workshop on VS-PETS, Beijing, October 15-16, 2005
where N(x; ,ui, o-i) is the normal density: 4 Tracking with a registration tech-
nique
N(x;pi, oi) (27wo2)-1/2
= exp [-1 (x ¾)1 (7)
Consider the state vector b = [ox 0y tx0 Hz: ty, tz, TaT]T
encapsulating the 3D head pose and the facial animations.
We assume that At summarizes the past observations under In this section, we will show how this state can be recovered
an exponential envelop, that is, the past observations are for time t from the previous known state bft 1.
exponentially forgotten with respect to the current texture. The sought geometrical parameters bt at time t are re-
When the appearance is tracked for the current input image, latedI to the previous parameters by the following equation
i.e. the texture Xt is available, we can compute the updated (bt- , is known):
appearance and use it to track in the next frame.
It can be shown that the appearance model parameters, bt = bt-, +Abt (10)
i.e., ,u and u- can be updated using the following equations
(see [8] for more details on Online Appearance Models): where Abt is the unknown shift in the geometric parame-
ters. This shift is estimated using a region-based registration
ut+±i (1 -a) fit + aXt (8) technique that does not need any image feature extraction.
In other words, Abt is estimated such that the warped tex-
ture will be as close as possible to the facial appearance
(Jt2 1o) (Jt2 + at (it pt)2
a
(9) _
At. For this purpose, we minimize the Mahalanobis dis-
tance between the warped texture and the current appear-
In the above equations, all ,u's and (X2's are vectorized ance mean,
and the operation is element-wise. This technique, also
called recursive filtering, is simple, time-efficient and there- d 2
fore, suitable for real-time applications. The appearance
parameters reflect the most recent observations within a
mine(bt) = DinD(x(bt), Ut) v: XSi - iA
(1 1)
bt bt
roughly L = 1/a window with exponential decay. Fig-
ure 2 shows an envelop having a equal to 0.01 where the The above criterion can be minimized using iterative
current frame is 500. first-order linear approximation which is equivalent to a gra-
Note that ,u is initialized with the first patch x. However, dient descent method. It is worthwhile noting that mini-
equation (9) is not used until the number of frames reaches mizing the above criterion is equivalent to maximizing the
a given value (e.g., the first 40 frames). For these frames, likelihood measure given by (6).
the classical variance is used, that is, equation (9) is used
with a being set to . Gradient-descent registration: We assume that there ex-
Here we used a single Gaussian to model the appearance ists bt = 1t_j + Abt such that the warped shape-free tex-
of each pixel in the shape-free template. However, mod- ture will be very close to the appearance mean, i.e,
eling the appearance with Gaussian mixtures can also be
used on the expense of additional computational load (e.g., W (yt, bt) -_ t
see [14, 9]).
Approximating W(yt, bt) via a first-order Taylor series ex-
pansion around bt-, yields
009
008
007
-
W (yt, bt ) -_W(yt, bt- 1 ) + Gt (bt-b~t- 1 )
006 - where Gt is the gradient matrix. By combining the previous
two equations we have:
O0 -
Ttwt(yt
= i bt-he ) + Gt (bts-e ti g)
004 10-
50 0 20 20 0 5 0 5 0
003 - ~ ~ ~ Frme Therefore, the shift in the parameter space is given by:
Abt = bt- bit -Gt (W(yt, fit-1) Pt) (12)
Figure 2: A sliding exponential envelop having an a equal In practice, the solution bt (or equivalently the shift Abt)
to 0.01. The current frame/time is 500. is estimated by running several iterations until the error can-
not be improved. We proceed as follows.
123
Proceedings 2nd Joint IEEE International Workshop on VS-PETS, Beijing, October 15-16, 2005
Starting from b = bt-1, we compute the error vector It is worthwhile noting that the gradient matrix is com-
(W(yt, bt_ ) -Ut) and the corresponding Mahalanobis dis- puted for each time step. The advantage is twofold. First,
tance e(b) (given by (11)). We find a shift Ab by multi- a varying gradient matrix is able to accommodate appear-
plying the error vector with the negative pseudo-inverse of ance changes. Second, it will be closer to the exact gradient
the gradient matrix using (12). The vector /Ab gives a dis- matrix since it is computed for the current geometric con-
placement in the search space for which the error, e, can be figuration (3D head pose and facial animations) whereas a
minimized. We compute a new parameter vector and a new fixed gradient matrix can be a source of errors.
error:
b'= b + pAb (13) Improving the minimized criterion. When significant
e' = e(b') out-of-plane rotations of the face occur, local self occlu-
where p is a positive real. sions and distortions may appear in the shape-free texture
If e' < e, we update b according to (13) and the process x. In order to downweight their influence on the registration
is iterated until convergence. If e' > e, we try smaller up- technique we incorporate the orientation of individual tri-
date steps in the same direction (i.e., a smaller p is used). angles of the 3D mesh in the minimized criterion such that
Convergence is declared when the error cannot be improved the contribution of any triangle becomes less significant as
anymore. In practice, we found that convergence is reached it shies away from the frontal view. Recall that the orien-
with less than ten iterations. tation of any 3D triangle with respect to the camera can be
recovered since the 3D rotation between the 3D head model
and the camera frame is tracked. For a given triangle, m,
Computation of the gradient matrix. The gradient ma- the angle 7'y, is given by the angle between the optical axis
trix is given by: k = (0, 0, 1)T and the normal to the triangle expressed in
axt the camera frame
G= &W(yt,
Ob
bt)
Ob For any given frame, the minimized criterion (11) be-
comes:
It is approximated by numerical differences. Once the
solution bt becomes available for a given frame, it is possi- d 2
ble to compute the gradient matrix from the associated input mine(bt) = ZwQy() (Xi lPi) (14)
image. We use the following: bt i=l 0-i~~
The jth column of G(j 1, . . ., dim(b)):
where 'Yi is the angle associated to the triangle containing
the pixel i and w(7i) is a monotonic decreasing function.
Gj _ W(yt, bt) For example, we use w (Yj) +a
ca es uin
d nb
Figure 3 displays three real 3D face poses: a frontal view
can be estimated using differences and two non-frontal views. The left column shows the ori-
entation of all individual triangles of the 3D mesh. The right
Gj-W(yt
Gj
, bt) -W(yt, bt + 6 qj )
X~~~~ column shows the corresponding shape-free texture.
where d is a suitable step size and qj is a vector with all
elements zero except the jth element that equals one. To
5. Tracking results
gain more accuracy, the jth column of G is estimated using Figure 4 displays the tracking results associated with a
several steps around the current value bj, and then averaging 300-frame-long sequence. The sequence is of resolution
over all these, we get the final Gj as 640 x 480 pixels. Only frames 38, 167, 247, and 283 are
shown. The upper left corner of each image shows the cur-
1 K/2 W(yt, bt) -W(yt, bt + k 6j qj) rent appearance (it) and the current shape-free texture (Xt).
K The plots of this figure display the estimated values of the
k=-KI2,k#O 3D head pose parameters (the three rotations and the three
where 5j is the smallest perturbation associated with the pa- translations) as a function of the sequence frames. Since the
rameter bj and K is the number of steps (in our experiments, used camera is calibrated the absolute translation is recov-
K is set to 8). Note that other averaging windows can also ered.
be used, e.g. triangular or Gaussian windows. On a 3.2 GHz PC, a non-optimized C code of the ap-
Notice that the computation of the gradient matrix Gt proach computes the 12 degrees of freedom (the six 3D head
at time t is carried out using the estimated geometric para- pose parameters and the six facial actions) in less than 50 ms
meters bit- and the associated input image Yt-1 since the if the patch resolution is 1310 pixels. About half that time
adaptation at time t has not been computed. is required to compute the 3D head pose parameters.
124
Proceedings 2nd Joint IEEE International Workshop on VS-PETS, Beijing, October 15-16, 2005
Frontal view
6. Accuracy evaluation
The evaluation of the above appearance-based tracker has
not been more formal than observing that it works quite Frames Frames
well and that the features of the 3D model projects onto Y Translation Z Translation
their corresponding 2D features in the image sequence. The
problem with an objective evaluation is that the absolute
truth is not known. This is particularly true for the 3D
head pose/motion which is given by six degrees of freedom. E E
125
Proceedings 2nd Joint IEEE International Workshop on VS-PETS, Beijing, October 15-16, 2005
(a)
(b)
Figure 5: Tracking the 3D head pose and the facial actions
applied to two video sequences.
test sequences with known ground truth similarly to [5, 1]. (c)
In [1], a 3D face tracker based on a statistical facial tex-
ture was evaluated. A synthetic video sequence is created Figure 6: Dense 3D facial data provided by a stereo head.
using a 3D mesh mapping a texture onto it, and then ani- (a) a stereo pair. (b) the corresponding computed 3D fa-
mating it according to some captured or semi-random mo- cial data with mapped texture displayed from three differ-
tion. The tracker then tracks the face in the synthetic video ent points of view. (c) 3D facial data associated to another
sequence and the discrepancy between the used synthetic stereo pair illustrating a non-frontal face.
motion (ground-truth) and the estimated motion yields the
accuracy of the tracker.
Although this scheme can give an idea on the tracker ac-
curacy, it has several shortcomings. First, one can note the
self-referential nature of the test, since the same 3D mesh
is used in the synthesis phase and in the test phase. Sec-
ond, synthetic videos may not look very life-like. Third,
since the synthetic motion should be realistic to some ex-
tent, one has to use the output of another tracker, and if (a) (b) (c)
the same tracker is used the evaluation test becomes self- Figure 7: 3D registration of two facial clouds provided at
referential regarding the used 3D motions in the sense that frames 1 and 39, which are separated by a large yaw an-
the tracker is tested with motion parameters that are easy to gle (about 40 degrees). (a) the range facial data associated
estimate. Therefore, our idea is to use stereo-based 3D fa- to frames 1 and 39, expressed in the same coordinate sys-
cial surfaces (from which an accurate rigid 3D head motion tem. (b) alignment results using the relative 3D face motion
can be retrieved), and at the same time run our appearance- provided by our monocular tracker. (c) refinement of the
based on the associated monocular sequence. Then, the ac- registration using the Iterative Closet Point algorithm.
curacy is evaluated by comparing the 3D head motions pro-
vided by the stereo data and the developed monocular 3D
face tracker.
126
Proceedings 2nd Joint IEEE International Workshop on VS-PETS, Beijing, October 15-16, 2005
and it is connected to the computer by a IEEE-1394 con- the corresponding absolute 3D head poses. Figure 7 illus-
nector. Right and left color images were captured at a res- trates the estimated 3D head pose/motion associated with
olution of 640 x 480 pixels and a frame rate near to 30 fps. frame 39. Figure 7.(a) illustrates the range facial data asso-
After capturing these right and left images, 3D data were ciated to frames 1 and 39, expressed in the same coordinate
computed by using the provided 3D reconstruction soft- system. One can easily see that the face has performed a
ware. During the experiments the stereo head was placed vertical rotation of about 40 degrees. Figure 7.(b) displays
at a distance of 80 centimeters from the person. Figure 6.a the 3D alignment obtained using the relative 3D face mo-
shows a stereo pair used in our evaluation. Figure 6.b shows tion provided by the monocular tracker. Figure 7.(c) shows
the corresponding 3D facial data visualized from three dif- the refinement of the registration using the ICP algorithm
ferent points of view. Figure 6.c depicts the 3D data asso- whose returned displacement gives the 3D error associated
ciated with another stereo pair depicting a non-frontal face. with the monocular tracker. One can notice that the monoc-
In our case, the 3D face model (a cloud of 3D points) is ular registration technique brings the two 3D clouds into
manually selected in the first stereo frame which is cap- a good alignment even though there is an offset in the in-
tured in a frontal view ( Figure 6.b). This 3D face model depth translation. This is due to the monocular vision effect
contains about 20500 3D points. More elaborated statisti- and to the use of a frontal texture model.
cal techniques could be used for segmenting the 3D head in Figure 8 illustrates the estimated 3D head pose/motion
the range images (e.g., [10]). For subsequent frames, we associated with frame 85 using the monocular tracker and
adopt a moving head's bounding box to bound the head re- the ICP algorithm.
gion in the range images (Figure 6.b). Note that for these Figure 9 depicts the monocular tracker errors associated
frames accurate head segmentation is not required since we with the sequence depicted in Figure 4. These errors are
are using a global 3D registration technique (see below). computed by the ICP algorithm. As can be seen, the errors
The proposed evaluation, as mentioned before, consists increase as the face shies away from the frontal view. How-
of computing the 3D face motions using two different meth- ever, the tracker never loose the track. As can be seen, due
ods: (i) the proposed appearance-based approach (Section to the effect of monocular vision and very large out-of-plane
4) using the monocular sequence provided by the right cam- rotations (more than 60 degrees) the estimated depth may
era, and (ii) the 3D face motions computed by 3D registra- suffer from 6cm error. However, within a useful working
tion of 3D facial data in different frames. Recall that the 3D range about the frontal view, this error is about 3 cm which
rigid displacement that align two facial clouds obtained at corresponds to one pixel error given our camera parameters
two different frames is equivalent to the performed 3D head and the actual depth of the face.
motion between these two frames.
The 3D registration is computed by means of the well
known Iterative Closest Point, ICP, algorithm. ICP, also ref- 7. Conclusion
erenced in the literature as a fine registration technique, as-
In this paper, we have described our appearance-based
sumes that the clouds to be registered are very close. ICP
3D face tracker. We have introduced a general evalua-
has been originally presented by [2]. In our evaluation,
tion framework that is based on stereo-based range facial
since we use the 3D facial data/cloud in the first reference
data. We have found that the out-of-plane motions can be
frame as a face model, the 3D registration may fail in sub-
off the track whenever the absolute orientation of the face
sequent frames containing large rotations, thus our idea is
is so far from the frontal view, e.g, a vertical rotation of
to use the monocular tracker solution as a starting solution
60 degrees. However, even in these extreme cases, the
for the ICP algorithm (see Figures 7 and 8). Therefore, the
ICP returns a 3D rigid displacement that directly quantifies
appearance-based tracker is still usable and do not suffer
from drifting due to these out-of-plane motion inaccuracies
the monocular tracker accuracy.
that can be explained not only by the monocular effect but
also by the fact that the texture/appearance of the 3D wire-
6.2. Tracker accuracy frame is modelled in a frontal view. Adopting multi-view
The 3D head pose was tracked in the previous 300-frame shape-free texture models that partition the 3D rotations of
long sequence using two approaches: (i) the monocular the face is expected to considerably decrease such inaccu-
tracker, and (ii) the joint use of the stereo-based facial data racies.
and the ICP algorithm. The 3D head pose parameters pro- Although the joint use of 3D facial data and the ICP al-
vided by the monocular tracker gives the position and ori- gorithm can be used as a 3D head tracker, the significant
entation of the 3D wireframe model with respect to the right computational overhead of ICP algorithm prohibits real-
camera frame (the one used by the monocular tracker). The time performance. In the light of the evaluation, one is able
3D head motion is set to the 3D motion between the first to adjust the experimental set-up such that the monocular
frame and the current frame, which is easily recovered from tracker provides optimal results.
127
Proceedings 2nd Joint IEEE International Workshop on VS-PETS, Beijing, October 15-16, 2005
References
[1] J. Ahlberg. Model-based coding: Extraction, coding, and
evaluation offace model parameters. PhD thesis, No. 761,
Linkoping University, Sweden, September 2002.
[2] P. Besl and N. McKay. A method for registration of 3-D
shapes. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 14(2):239-256, 1992.
[3] M.L. Cascia, S. Sclaroff, and V. Athitsos. Fast, reliable
Pitch Yaw
head tracking under varying illumination: An approach
based on registration of texture-mapped 3D models. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
22(4):322-336, 2000.
a
O 4( O 4
[4] S.B. Gokturk, J.Y Bouguet, and R. Grzeszczuk. A data-
LU LU
1- driven model for monocular face tracking. In IEEE Interna-
tional Conference on Computer Vision, 2001.
[5] M. Harville, A. Rahimi, T. Darell, G. Gordon, and J. Wood-
Frames Frames fill. 3D pose tracking with linear depth and brightness con-
Roll X Translation straints. In IEEE International Conference on Computer Vi-
sion, 1999.
[6] http://www.ptgrey.com.
[7] T.S. Jebara and A. Pentland. Parameterized structure from
motion for 3D adaptative feedback tracking of faces. In Proc.
E
O 4(
LU LL IEEE Conference on Computer Vision and Pattern Recogni-
tion, 1997.
[8] A.D. Jepson, D.J. Fleet, and T.F. El-Maraghi. Robust online
Frames Frames
appearance models for visual tracking. IEEE Transactions
Y Translation Z Translation
on Pattern Analysis and Machine Intelligence, 25(10): 1296-
11-
1311, 2003.
[9] D. Lee. Effective Gaussian mixture learning for video back-
ground subtraction. IEEE Transactions on Pattern Analysis
E~ and Machine Intelligence, 27(5):827-832, 2005.
E
Li [10] S. Malassiotis and M. G. Strintzis. Robust real-time 3D
head pose estimation from range data. Pattern Recognition,
38(2005):1153-1165, 8.
[11] F. Moreno, A. Tarrida, J. Andrade-Cetto, and A. Sanfeliu. 3D
Frames Frames
real-time tracking fusing color histograms and stereovision.
In IEEE International Conference on Pattern Recognition,
Figure 9: 3D head pose errors computed by the ICP algo- 2002.
rithm associated with 300-frame long sequence (see Fig-
[12] L.D. Mouse. Acoustic tracking system.
ure 4). For each degree of freedom, the absolute value of http://www.vrdepot.com/vrteclg.htm.
the error is plotted. For each frame, the ICP algorithm was
initialized by the output of the monocular tracker, thus the [13] M.H. Yang, D.J. Kriegman, and N. Ahuja. Detecting faces
in images: A survey. IEEE Transactions on Pattern Analysis
refined 3D registration can be considered as the monocular
and Machine Intelligence, 24(1):34-58, 2002.
tracker error.
[14] S. Zhou, R. Chellappa, and B. Mogghaddam. Visual track-
ing and recognition using appearance-adaptive models in
particle filters. IEEE Transactions on Image Processing,
13(11):1473-1490, 2004.
128