Automatic Alignment and Reconstruction of Facial Depth Images

Pattern Recognition Letters 50 (2014) 8290
Contents lists available at ScienceDirect
Pattern Recognition Letters

jo urnal h om e p a g e: ww w.e lse v ie r.c o m /l oc a te / p a t re c
Automatic alignment and reconstruction of facial depth images
Giancarlo Taveira , Leandro A.F. Fernandes

Instituto de Computao, Universidade Federal Fluminense (UFF), CEP 24210-240 Niteri, RJ, Brazil
a r t i c l e
o
i n f
Article history:
Available online 12 December
2013
Keywords:
Depth
image
Alignment
Interpolati
on
Resamplin
g Face
image
a b s t r a c t
Face, gender, ethnic and age group classication systems often work through an
alignment, feature extraction, and identication pipeline. The quality of the alignment
process is thus central to the performance of the identication process. Furthermore,
missing portions of depth information can greatly affect results. Appropriate image
reconstruction is therefore crucial for the correct operation of those systems. This paper
presents a simple and effective approach for the automatic alignment and
reconstruction of damaged facial depth images. By using only four facial landmarks and
the raw depth data, our approach converts a given damaged depth image into a
smooth depth function, performs the 3D alignment of the underlying face with the face
of an average person, and produces an aligned depth image having arbitrary resolution.
Our experiments show that the proposed approach outperforms commonly used methods.
For instance, we show that it improves the quality of a state-of-art gender classication
technique.
2013 Elsevier B.V. All rights
reserved.
1. Introduction
The ability to retrieve information from facial depth
images has many practical applications including face
recognition, age group estimation, gender and ethnic
group classication. Unfortunately, depth data is often
damaged due to limitations intrinsic to off- the-shelf
depth-image
capturing systems. Examples include,
but are not limited to, depth shadowing and the
inuence of reective, refractive and infrared absorbing
materials in the scene (Zhu et al., 2008). Also, the
amount of pixels covering the imaged face and faces
orientation often vary from image to image, making
difcult or even impossible the use of captured images
without the proper alignment and reconstruction of
depth data (Szeliski, 2010).
Virtually every computer vision researcher that
needs to perform alignment and reconstruction of facial
depth data usually presents its own solution to the
problem. A well-known technique is to identify some
facial features by curvature, and compute the
alignment based on them (Moreno et al., 2005).
Solutions based on principal component analysis (PCA)
have also been proposed (Stormer and Rigoll, 2008).
However, most of the attempts do not make proper
use of depth information while performing the
alignment, restricting the solution to the 2D image
plane. Also, linear interpolation is commonly used to
ll the holes (Wu et al., 2010), leading to unnatural

at artifacts on the facial surface.
It is remarkable that the solutions commonly
applied in the literature contradict the common
wisdom that appropriate
This paper has been recommended for acceptance by

Dmitry Goldgof.
Corresponding author. Tel.: +55 21 2629 5665; fax: +55 21
2629 5669.
E-mail
addresses:
gtaveira@ic.uff.br
laffernandes@ic.uff.br (L.A.F. Fernandes).
(G.
Taveira),
alignment and resampling techniques must be employed

in order to produce corrected depth images from the
original ones. For instance, it is important to make use of
the depth information intrinsic to this kind of data in
order to alignment the structures of interest (i.e., the
faces) in the 3D space, not just on the image plane.
Furthermore, by considering the nature of the surface
of interest, it is clear that it is necessary to apply smooth
non-linear
interpolation
techniques
capable
of
reconstructing the damaged portions of the original
image and also of producing depth values with sub pixel
precision. With such care, the expectation is that the
performance of depth-based classication techniques
may be improved.
This paper presents a simple and effective method for
aligning and reconstructing facial depth images from
damaged depth data in a completely automatic way
(Section 3). The approach uses information extracted
0167-8655/$ - see front matter 2013 Elsevier B.V. All rights
reserved. http://dx.doi.org/10.1016/j.patrec.2013.12.007
from valid pixels to adjust a smooth thin- plate spline

(TPS) interpolating function that naturally reconstructs
the depth information of missing pixels (see Fig. 1) and
computes smooth transitions among existing ones.
The approach also ex- plores facial landmarks in order
to determine the actual position and orientation of the
imaged face in the 3D space. The relation between the
set of landmarks in the actual face and a set of
canonical landmarks is used to map the shape of the
imaged face to a standard space where the resulting
aligned image is generated by ray casting the
reconstructed surface. The developed ray casting
scheme is derived from the relief mapping (RM)
technique proposed by Policarpo et al. (2005) for realtime rendering of surface details mapped on coarse
triangular meshes. Our approach easily ts into popular
processing pipelines, and can be extended to produce
correct color and normal map images to be used with
the resulting depth images.
G. Taveira, L.A.F. Fernandes / Pattern Recognition Letters 50

(2014) 8290
Fig. 1. For the same subject: (a) the original color image, (b) the original damaged depth image, and (c) the image with reconstructed depth
information produced by our technique. Six aligned and reconstructed depth images of different subjects are presented in (d). Images (b), (c) and
(d) are presented in false-color, where dark red pixels denote the surface closest to the camera. Notice in (c) the smooth transition of depth values
in the originally corrupted portions (navy blue pixels in (b)). (For interpretation of the references to colour in this gure legend, the reader is
referred to the web version of this article.)
Our experiments (Section 4) show that the

approximation errors produced by our method are
smaller than those using linear interpolation for
reconstruction with iterative closest point (ICP) for 3D
alignment of depth data and with 2D alignment of the
depth images (see Fig. 2). We also present a
comparative study among four distinct interpolation
methods (nearest-neighbor, linear, natural-neighbor
and thin-plate spline) using the proposed alignment
method in the 3D domain. Each interpolation method
was applied as part of state-of-art gender classication
processes proposed by Wu et al. (2010, 2011). Since
the classication techniques receive surface normals
computed from reconstructed facial depth images as
input, the performance of these classication models as
a function of the input images indicate the quality
and the inuence of each interpolation method on the
result.
2. Related work
This section discusses the use of TPS on the
interpolation of facial color images, the use of
interpolation schemes to resample facial depth data,
and alignment schemes for aligning human body
surfaces.
Rosen (1996) developed the Java applet entitled
AlexWarp. Since its creation, the applet has gained
popularity among internet users world-wide for its
simple and fast method of facial image warping.
When the user provides one pair of landmark points,
the applet determines the region to warp, warps it, and
then out- puts the warped picture. One major drawback
of the AlexWarp applet is that transformations can
only be applied one at a time. The AlexWarp applet
works on colored images in the 2D domain with a
limited number of control points.
180
Subjects
120
100
80
120
Proposed
Approach ICP +
2D
+ Linear
Linear
160
140
120
100
80
60
60
40
40
20
20
Proposed
Approach ICP +
2D
+ Linear
Linear
100
Subjects
140
180
Proposed Approach
ICP + Linear
2D + Linear
Subjects
160
Whitbeck and Guo (2006) implemented an applet as

an improvement over the AlexWarp program. In their
implementation, TPS was used to allow more control
points to be added instead of just one pair, as in the
AlexWarp. The TPS was applied in a 2D domain in
order to interpolate warped colored images. In our
work, we use TPSs on depth images. We use all points
from the original depth image and do not intend to
apply warping to the data.
Guo et al. (2004) created an average morphable
shape represented by a TPS to be used in face
recognition applications. The average face was created
using a database of 60 individuals (33 males and 27
females) containing only records of asian people
which resulted in a model restricted to a particular
ethnic group. The facial landmarks, a total of 7, were
manually set. In order to reconstruct the face of a
subject, they projected the colored 2D image over the
average 3D model. A reduced number of control points
and the usage of a single TPS are some of the
limitations of their work. The authors reported that the
results, although not great, showed an interesting
potential. In our work, we use several TPS functions to
build a different 3D model for each subject. Also, we
developed an adaptive block scheme in order to allow
the use of all depth values of the image pixels while
performing the reconstruction of damaged depth
information.
Moreno et al. (2005) developed a 3D face modeling
system and used two face recognition methods to test
their model, one based on PCA and another one based
on support vector machine (SVM). Their system aims
to work on face images with varying poses, in
situations where there is no control over the depth
data acqui- sition. They reported that median and
Gaussian lters were applied in the pre-processing stage
in order to remove noise and to smooth the curvature of
the resulting surfaces. Instead, we propose the use
80
60
40
20
0
0
0.7
0.1
0.2
> 0.8
0.3
0.4
0.5
0.6
0.75
1.5
2.25
3.75
4.5
5.25
>6

(2014) 8290 0
Squared error
Squared error
x 10
0.75
1.5
2.25
Squared
error
3.75
4.5
5.25
x 10
>6
5
Fig. 2. Histograms showing the distribution of (a) minimum, (b) mean and (c) maximum squared error values computed from the depth values
of a reference image and images produced using the proposed alignment and reconstruction approach (blue), a common 3D ICP-based
alignment method with linear reconstruction (green), and a common 2D alignment technique with linear reconstruction (red). Notice that the
error values of the proposed approach are smaller than those of the common approaches. (For interpretation of the references to colour in this
gure legend, the reader is referred to the web version of this article.)
of a TPS-based interpolation scheme combined with a

the rectangular region that contains the face of the
RM-based ray casting technique to reconstruct missing
imaged person (Section 3.1). Then, we adaptively
portions of data as smooth surface patches.
subdivide the cropped region into smaller regions
Stormer and Rigoll (2008) proposed a procedure
where TPSs are adjusted to the depth data in each of
consisting of facial feature hypotheses extraction by
them (Section 3.2). The TPSs not only guarantee
invariant curvature features, PCA-based classication,
smooth interpolation of depth data while producing
and iterative closest point alignment to create
the nal image but also provide the reconstruction of
aligned and normalized patches of faces in range
damaged and missing portions of depth information. In
images. They nd the facial features by pre-processing
the third step, we use afne transformations to
the spatial discrete data and apply linear interpolation
perform the 3D alignment of the landmarks of the
and low-pass lter to get a closed and smooth surface.
given face with the landmarks of the face of an
Their nal results are patches that contain a minor
average person (Sec- tion 3.3). Lastly, by applying the
portion of the eyes and nose to serve as input to a
same transformations to the TPSs, we map the input
classication
system.
Their
alignment
and
face to a standard space in which we held ray casting
normalization procedure is not very accurate since
in order to produce an aligned facial image having
they use an iterative approach that approximates the
arbitrary resolution (Section 3.4).
facial features by simple square distance nearest
Section 3.5 describes how the parameters of the
neighbor. The use of linear interpolation is also
average person can be obtained from the images in
adopted by other face classication systems, including
the dataset. Section 3.6 discusses how to transform
state-of-the-art gender classication techniques (Wu et
the raw color images and how to produce normal maps
al., 2010, 2011). It is important to notice
that are consistent with the resulting depth images.
that the interpolating surface of the linear method is
0
C continuous. We propose the use of a TPS-based interpolation
Image cropping
1
scheme that produces C continuous surfaces. Also, in
contrast with Stormers et al. approach, we perform
We dene the axis-aligned cropping rectangle that
non-iterative 3D alignment using only four facial
contains
the face by using the image coordinates uno ;
vno T of the nose (no) as
landmarks.
pivot point, and the horizontal distance dle;re between
Segundo et al. (2012) presented a method for preleft (le) and right (re) eyes outer corners and the
aligning surfaces and to better nd a correspondence
vertical distance dno;ch between imaged nose and
of keypoints between two objects. They had also
chin (ch) as parameters for computing, respectively,
developed a system that automatically detects
the width and height of the resulting sub-image (Fig.
incorrect results, removing the need of manual human
3). The lower and upper corners of the cropping
inspection. The authors used speeded up robust
rectangle are expressed as:
features (SURF) to nd the correspondence between a
single objects data obtained from different views.
They also used SURF to nd the transformation matrix
to project the geometric data into the color information.
.
.
Azouz et al. (2004) proposed a 3D human model that
max uno
min uno uD; wac
.
.
is based on
uD; 1
signed-distance for shape analysis. They used PCA to

C
and C
;
max
v
v
1 ;
obtain generic
human characteristics. Their representation model lacks

a good pre- processing stage as they only apply Taubin
lter to remove noise in the data and no alignment
procedure is performed. PCA and ICP were
used in the work of Yan and Bowyer (2005). The authors
compared different approaches to ear recognition, both
2D and 3D. Ear recognition is particularly relevant in
the eld of biometrics. They report that an ICP-based
approach outperformed every other results. Kak- adiaris
et al. (2007) presented an unied software and
hardware
solution for 3D face recognition. They used a variation
of ICP for
aligning the images but proposed the use of deformable
model tting
(DMF) for subsampling depth data. The original ICP
method is described by Besl and McKay (1992). In Section 4 we
compare our face alignment approach against an ICPbased scheme.
Xu et al. (2009) presented a promising way to build a
robust recognition system integrating depth and
intensity information. Although their face recognition
classier is robust, their preprocess- ing stage is very
poor. Therefore, their results could be improved by
applying a more sophisticated alignment and
reconstruction method. Our experiments show that our
face alignment and reconstruction scheme may be
no
min vno
hac
v D;
where wac and hac are, respectively, the width and height
of the input image having pixel coordinates u 2 1; wac
] and v 2 1; hac ]. uD and v D
are computed,
respectively, as:
.
.
uD min d1:5 dle;re=2e; wac uno
and
vD min
.
v no ;
1:5 dno;ch e; hac
used to improve gender classication processes.

Tekumalla and Cohen (2004) used a method based on
the mov- ing least squares (MLS) projection to ll holes in
triangular meshes. Wang and Oliveira (2003) also used
MLS as a hole-lling approach. This approach, although
efcient to implement, can only handle holes of simple
geometry that resemble a plane since it relies on
parameterizing the vicinity of the hole by orthographic
projection onto a plane. Our hole lling procedure, on the
other hand, is inde- pendent of the structure of a mesh. It
uses the point cloud as input
and it ts a smooth surface to the input data.

3. The proposed alignment and reconstruction approach
The computation of the aligned facial depth image
consists of
four main steps. First, we take the input raw depth
image and crop
where d:e denotes the ceiling function. The proportion

value 1:5 was empirically chosen for our experiments.
It is important to emphasize that dening a subimage using a cropping rectangle is an optional step of
our approach. By limiting data to the sub-image that
contains the region of interest (i.e., the subjects face)
we do alleviate the computational cost of subsequent steps. Furthermore, the cropped image does not
have to be perfectly symmetrical to the imaged face
because the actual alignment procedure will be
performed by the nal steps of our algorithm (see
Sections 3.3 and 3.4). Also, the produced cropping
rectangles may have different resolutions, since their
dimensions are proportional to distances in a given
input image. The only requirement for a sub-image is
to contain the face of the subject. Therefore, the
empirical scaling factor of 1:5 used in (2) and (3) may
be changed in order to t the images of different
databases. However, we emphasize that the scale
factor of 1:5 met our requirements well, even for
faces covering different portions of the images.
The location of the eyes, nose and chin is usually
provided by the image database (e.g., the UND
Biometric Database, Collection D (Chang et al., 2003)).
They can also be retrieved by automatic techniques
(Romero-Huertas and Pears, 2008; Perakis et al.,
2009) or manually identied.
Fig. 3. Image cropping. (left) A grayscale visualization of an original depth image (640 480) before any cropping is done. The distances dle;re and
dno;ch are proportional to the cropping rectangle. (right) A scaled example of a cropped image (195 293). See (1) for details. Lighter shades of
gray correspond to points closer to the camera.
Depth data interpolation and reconstruction

Depth data interpolation and reconstruction is
performed using TPSs adjusted to raw depth data. TPS
is the 2D analog of the cubic spline in one dimension
(Bookstein, 1989). It encodes a scalar
T
coordinate
height function that can be evaluated at a
given u; v
in order to retrieve the respective scalar value z that
best describes the height surface passing through N
non-overlapping control
T
points having coordinates u; v ; z . In this paper, u and v are coordinates of valid pixels (i.e., pixels storing valid depth
information), and z is the associated depth value. It is
important to notice that a TPS is adjusted to an
unstructured set of control points, and also
T
position, returnthat it can be evaluated at any real
valued u; v
ing a smoothly interpolated z value. Thus, it is clear
that sub-pixel sampling and damaged facial depth
reconstruction are naturally handled by the TPS-based
interpolation scheme adopted in our work.
Fig. 4. An example of how the blocks in the last row and in the last
A TPS is described by 2 N 3 parameters, which
column may be smaller in size than the rest of the blocks in the
include six global afne motion parameters and 2 N
image. In the proposed algorithm, each block may independently
coefcients for correspon- dences of the control points.
grow in size until enough control points are within its boundaries (see
These parameters are computed by solving a linear
the blocks in the lower right corner of the image).
system having a closed-form solution. Due to the large
number of parameters, the computation of a single
TPS to all valid pixels in a cropped image may be
dened in the standard space (i.e., the 3D space where
all faces will be aligned). The transformation for a
unfeasible. We avoid such an issue by dividing the
given imaged face is computed from the location of
cropped image into adaptive blocks having a small
four landmarks in the actual coordinate frame (namely,
number of control points, and t a different TPS to
left eye (le), right eye (re), nose (no) and chin (ch)) and
each one of the blocks. Such an approach has two
the equivalent locations in the standard coordinate
advantages:
frame. In
(i) it allows our technique to handle images having
the following equations, each location is represented
arbitrary size; and (ii) the procedure is less prone to
by a point
numerical instability.
The adaptive blocks are initially distributed uniformly
over the
cropped image as a regular grid comprised by square
entries having xed size (most blocks in Fig. 4). However, since
depth data
may be damaged, some of the blocks may not contain
enough control points to dene a TPS. In such a case,
we incrementally change
the size of an ill-dened block by including a ring of
surrounding pixels in it. An ill-dened block grows until it
has enough valid pixels to solve the linear system of equations that
computes the coefcients of
of the TPS (see the blocks in the lower right
corner
PS;F , where S is ac or st for, respectively, actual or

standard frames,
and F is one of the labels in fle; re; no; chg.
The coordinates of are computed from the input
Pac;F
depth image as:
Pac;F
.
x
a
c; F
where Q
; yac;F .ac;F
zac;
;z
ac;F
u ;
vF
.
;
ac;F
; 1 is the location (in pixels) of the

T given lazac;F
Fig.
4). The
TPS assigned to a block ts the points
covered
by the
original
blockregion.
size and the points inside the
overlapping
However, after the TPS has been tted, the evaluation
of the smooth surface related to a block is performed
only inside the original coverage of that block.
Three-dimensional face alignment

The
alignment
stage
computes
the
afne
transformation that maps a face dened in the actual
space (i.e., the 3D space where the imaged face
resides) to a standard position and orientation
bel point in image

is the depth retrieved from
space,
the
T
uF ; 1 pixel, and K is the inverse of the matrix that

vF
models
the intrinsic camera parameters:
f mu c
B
K@ 0
0
ou
f mv oA
v
0
1
C
:
In (5), f is the focal length, mu and mv are the scale

factors relating
pixels to distance, c; ou and ov represent the skew and
the coordinates of the principal point, respectively. The
intrinsic parameters
are usually provided by the depth camera, but they can
also be retrieved from calibration procedures (Hartley
and Zisserman, 2000).
The formulas for computing the 3 3 matrix M and

the 3 1 offset vector O modeling the intended afne
transformation is given by:
MQP
MP
and
O
; Pst le
ac;
le
The average person is dened in the standard space. It

is used as target-face during the face-alignment stage of
our procedure. In order to setup an average person one needs to specify
the location
Pst;F of the four face landmarks. In our experiments we
computed
where P and Q are 3 3 matrices computed as:
P Pac;re Pac;le
Q Pst;re Pst;le
3.5. Computing the average person
Pac;no Pac;le
Pst;no Pst;le
Pac;ch Pac;le ;
Pst;ch Pst;le :
7
8
The procedure for computing Pst;F is presented in Section

3.5.
Once M and O are known, the afne mapping of a
general point
P ac in the actual space to the standard space is
given by:
Pst MPac O:
3.4. Producing the nal

depth image
The nal depth image of a given face is computed
by casting
rays (one ray per resulting image pixel) from a pinhole

camera dened in the standard space to the surface
of the subjects face mapped from the actual
coordinate frame to the standard coordinate frame
(Section 3.3). Due to space restriction, this paper does
not present the proposed ray casting procedure in
detail. However, it can be derived from a well-known
rendering technique: the relief mapping (RM)
(Policarpo et al., 2005).
The central idea of our ray casting procedure is to
use RM to quickly nd the rst intersection of each
casted ray and the surface encoded by the set of TPSs
(Section 3.2). Once the rst intersection is found for a
given ray, the z coordinate of the intersection point in
cameras coordinate system is stored in its respective
image pixel. As in RM, we start the process with a
linear search. Beginning at the center of projection O,
we step along the ray passing through the current
pixel mapped to the image plane in the 3D space at
increments of d looking for the rst point inside the
surface. Once
the rst point under the TPS surface has been identied,
the binary search starts using the last point outside the
surface and the current one. The role of the linear
search is to quickly approximate the rst intersection
between the casted ray and the TPS surface. The role
of the binary search, on the other hand, is to nd the
exact location of such an intersection.
Recall from Section 3.2 that the original surface of
the face was encoded into a set of TPSs that dene a
height eld in the actual coordinate frame. Rendering
such an analytical 3D representation from an arbitrary
point of view may be tricky since the depth information
cannot be transformed from the actual coordinate frame
to the standard space with the guarantee that it will be
an unambig- uous height map after such mapping. The
the coordinates of Pst;F from average values retrieved

from the input dataset. However, one can place the
landmark in the way that is most convenient for a
particular application.
We build a mean tetrahedron from the tetrahedra
dened by the landmarks of each input face. The base
of such tetrahedra was dened by the location of
both eyes and chin. The apex was set to be the nose.
The vertices were computed as:
0
1
0
1
wst leye
Pst;le 1 B wst leye
C
B
C
hst lchin ;
Pst;re1 hst lchin ;
A
A
2@
2@
0
0
0
st
1
0
st 1
C1 B h
1 B
w
wst
C
P
lchin
st;ch
h
2
l
st
l
2
2
@
A;
0
Pst;no
chin
nose
A;
ltip
use of RM as the ray casting approach for solving the

problem is an elegant solution because it turns the
harder problem of nding the intersection of the ray
with the analytical surface in 3D into the simpler problem
of walk- ing in the 2D domain of the height eld function
while looking for the intersection in its codomain. To do
that, one has to: (i) map the whole situation of the ray
casting procedure (i.e., the cameras center of
projection Ost and pixels points Q st in the image plane)
from
the standard space to actual space by inverting (9):
where wst and hst are, respectively, the width and

height of the (standard) resulting image, leye is the
mean distance between the eyes of faces in the
dataset, lchin denotes the mean distance from the chin
to the middle of the eyes, lnose is the mean distance
from
the orthogonal projection of the nose onto the base
plane and the middle of the eyes, and ltip denotes
the mean distance from the nose tip to the base
plane. These distances were measured in the
3D coordinate frame where each input face resides.
3.6. Producing correct color and normal map images
The computation of correct color and normal map
images to be used with the aligned depth images is
straightforward. The color information related to the
rst intersection of the casted ray and the surface
encoded by the TPS can be retrieved from another
TPS encoding the color of input image pixels. By doing
so, one guar- antees a smooth interpolation of color
values as well as the reconstruction of missing
portions of color information. The normal vectors for
the normal map can be retrieved directly from the
TPS encoding depth data by computing the normal of
the surface at the intersection point.
Pac M
O;
Pst
(ii)nd the rst intersection point using our RM-based ray

casting procedure, (iii) map the intersection point back
to the standard space using (9), and (iv) compute the
nal depth value as the signed distance between the
Ost and the intersection point.
4. Experiments and discussion

We have implemented our technique using C++ and
MATLAB . The C++ code was compiled using
Microsoft
Visual Studio
as dynamic link libraries
(DLLs) so that they could be called from MATLAB .

OpenMP was used to explore parallel computing in
TPS computation. The system was tested on several
real depth images. We have applied our method to the
UND Biometric Data- base (Collection D) (Chang et al.,
2003). This dataset is comprised by 953 images of 277
individuals, recorded using the Minolta Vivid
series 3D scanner. The UND database has the
advantage that it contains the 2D color images, the
corresponding range images, and the location of the
facial landmarks in image space.
In our experiments, we set the initial size of the
adaptive blocks where TPSs are adjusted (Section 3.2)
to 32 32. The number of blocks depends on the size
of the cropped image (Section 3.1).
The resolution of the nal images was

set to wst
hst
118.
How-
ever, it is important to notice that our approach can

produce smooth images having any resolution. The
location of the facial landmarks in the input images
was retrieved from the database. We have found the
following mean values (expressed in millimeters) while
dening the average person (Section 3.5):
leye 101:3238; lnose 40:0234; lchin 104:3238, and

ltip 38:2279. Notice that those values are consistent
since they are proportional to the average subjects
face. Hence, they could be used without change by
any application where our technique could be
applied, even by those that process a different
dataset. The syn- thetic pinhole camera (Section 3.4)
was placed 1750 mm apart from the base plane of the
mean tetrahedron, with optic axis coin- ciding with the
displacement vector of the nose and having its x and
y axes aligned, respectively, to the x and y axis of the
standard coordinate frame.
We compared our results to a widely used approach
based on 2D alignment with linear interpolation. In this
approach, we rst lled the missing portions of depth
information using a linear interpolation method
provided by MATLAB . In turn, we applied

the transformation matrix to align the triangle dened
by the points Q ac;le ; Q ac;re and Q ac;no to the same
standard coordinates presented in Section 3.5. Both
operations can be performed by
the imtransform function using the bilinear
interpolation scheme (see MATLAB documentation

for details). The sum of the square differences in the
z values was calculated for both methods. We
compared different recordings of faces belonging to
the same subject, since their alignment should match
better than with
other subjects faces. Every recording of a subject was
compared against the other recordings of the same
subject, and we took the minimum, mean and
maximum squared error values produced. Our analysis
covered 180 subjects. A histogram showing the minimum, mean and maximum values for each subject
can be seen in Fig. 2. The gure also presents the
errors produced by aligning the depth data in 3D by
using an ICP-based solution, followed by linear
interpolation of the surface for reconstruction. We
used the ICP implementation provided by the Point
Cloud
Library
in
our experiments. The guess
transformation matrix was computed from the centroid
of the source and target point clouds and from their
eigenvectors. The same alignments were achieved by
using the facial landmarks to compute the guess
transformation matrix. It is important to comment that,
in our approach, the larger errors are gathered mostly
on the neck region (outside the face). With the common
2D and sometimes with the common 3D approaches,
the errors are scattered all over the image. By
analyzing just the nose region (Fig. 5), the maximum
error produced by our method in this database becomes
two orders of magnitude smaller than the error
produced by the common approaches.
Our alignment procedure is dependent on the
identication of facial landmarks in color images. We
have veried the robustness of the proposed approach
against errors in the detection of the eyes outer
corners, nose tip and chin by adding noise to the image
location of ducial marks provided by the database.
In turn, we
compared every noisy recording of a subject
against the other
noisy recordings of the same subject. Fig. 6 presents the
histograms of minimum, mean and maximum squared
error values produced for each of the 180 subjects
regarding the original location of the landmarks, and
after adding Gaussian noise with mean 0 and standard deviation (r) ranging from 1 to 5. The histogram
in Fig. 6a
We also performed an experiment that compared

four distinct interpolation methods (i.e., nearestneighbor, linear, naturalneighbor and TPS) applied as
part of gender classication models presented by Wu et
al. (2010, 2011). We chose to use two of the three
gender classication models provided by Wu et al.:
Principal Geodesic Analysis (PGA) and Supervised
Weighted PGA (SWPGA). Further details on PGA can be
found in Wu et al. (2010). The SWPGA is described in
Wu et al. (2011).
The SWPGA method relies on the proper setting of
the parame- ter d, which indicates the number of
features (i.e., relevant dimensions of the facial
feature space) to be used during the iterative
construction of the weight map, and in the number of
interactions. In our experiments we followed Wu et al.
(2011) and chose d 5 and set the number of
iterations to 6000.
The performance of the interpolation methods used
to provide input data for the gender classication
techniques was measured by comparing the confusion
matrix (Kohavi and Provost, 1998) and the Matthews
Correlation Coefcient
(MCC)
(Matthews, 1975) of
each technique under the k-fold cross-validation framework (Kohavi, 1995). The image set used for both
training and testing was comprised of 180 images (90
females and 90 males) from different individuals.
Although each subject may have several different
images in the dataset, only one image per subject was
used during the tests. In this way we avoid bias
caused by duplicated individuals.
The cross-validation was run with k 5 folds.
Therefore, each fold contained 36 subjects (18 females
and 18 male). At each round, one fold was reserved for
testing and the others were used for training. After all
rounds have nished, the measured statistics are
averaged by the number of rounds.
The gender classication can be interpreted as a
binary classication. For that matter, it is necessary to
t the two genders within two classes: Female and Male.
The following nomenclature is used to refer to the cells
of the resulting confusion matrix: true females (TF) are
the females identied as such, true males (TM) are
the males identied as such, false females (FF) are the
males incorrectly classied as being females and false
males (FM) are the females incorrectly classied as
being males. The confusion matrix provides
information
about
the
number
of
correct
classications
in
comparison
to
the
predict
classications for each class. Using the aforementioned
naming convention, we calculated four distinct
measurements: accuracy, true females rate (TFR), true
males rate (TMR) and the MCC.
The accuracy is the proportion of true results (both TF
and TM) in the population. The accuracy can be
calculated as follows:
TF TM
accuracy
FF
TF
FM
TM
10
TFR measures the proportion of actual females which are

correctly identied as such. Similarly, TMR measures the
proportion of males which are correctly identied. A
perfect predictor would be described as 100% TFR and
100% TMR. The TFR and TMR are described as:
shows that the Gaussian perturbations did not affected

the minimum error produced for each subject. Notice that
squared errors values are virtually zero even for r 5.
The comparison between
Figs. 6b and 2b show that the distribution of mean
squared errors is more favorable for the proposed
approach with imprecise landmarks location than for the ICP-based alignment with
linear interpolation. The distribution of maximum
errors produced for r 5
(Fig. 6c) is equivalent to the one produced for the ICPbased approach (Fig. 2c). Given that the mean size
of cropped regions in
the database is 280 200 pixels and r 5 leads to
errors of up
to T15 pixels, we conclude that using detected facial
landmarks for aligning faces is a feasible solution
even in the presence of noise.
TFR
TF
TF
FM
and
TMR
TM
TM
FF
11
By looking at (11), TFR can be interpreted as a bias

towards female classication or the capacity of correctly
identifying the female gender. Similarly, TMR can be
interpreted as the bias towards male classication or
the capacity of correctly identifying the male gender.
The MCC (Matthews, 1975) is in essence a
correlation coefcient between the observed and
predicted binary classications. A coefcient of 1
represents a perfect prediction, 0 means no better
than random prediction and 1 indicates total
disagreement between prediction and observation.
+ Linear
3D ICP
Interpolation
Alignment
> 600 Min.: 0.00 | Max.: 449999..4444
> 600 Min.: 0.00 | Max.: 495.58
> 600
500
500
500
500
400
400
400
400
300
300
300
300
200
200
200
200
100
100
100
100
0
1200
> 600 Min.: 0.01 | Max.: 11004455..2222
> 600 Min.: 0.00 | Max.: 525.57
> 600
1000
500
500
500
800
400
400
400
600
300
300
300
400
200
200
200
200
100
100
100
0
4
x 10
0
4
x 10
2.6
0
3
x 10
6.0
> 600
Min.: 0.00 | Max.: 412.33
Min.: 0.00 | Max.: 579.37
Min.: 29.41 | Max.: 2470.18
+ Linear
2DInterpolation
Alignment
Min.: 0.00 | Max.: 424.98
0
3
x 10
Min.: 2724.35 | MMax.:
ax.:
111556699..7766
11.0
3.0
Min.: 117104.28
7104.28 | Max.: 30507.77
10.0
Min.: 11723.75 | Max.: 26391.08
9.0
8.0
Min.: 1873.29 | Max.: 6000.98
5.5
2.4
2.2
2.8
2.6
5.0
4,5
2.0
2.4
7.0
4.0
1.8
5.0
2.0
4.0
1.8
1.6
3.0
1.4
2.5
1.2
3.0
2.0
Fig. 5. Color visualization of the squared error on the nose region. (top) Using our method. (center) Using ICP for 3D alignment and linear
interpolation. (bottom) Using 2D alignment and linear interpolation. Notice the difference of maximum error values. (For interpretation of the
references to colour in this gure legend, the reader is referred to the web version of this article.)
180
180
120
Subjects
Subjects
Subjects
Original Landmarks
160
Original +
Landmarks
+ Noise
( ( = 5)
Original Landmarks
Landmarks + Noise ( = 1) Landmarks + Noise
( = 2) Landmarks
Noise ( = 3) Landmarks + Noise ( =Landmarks
4) Landmarks
+ Noise
160
= 1) Landmarks +
100
Landmarks
+
Noise
(
=
140
Noise
(
=
2)
140
1) Landmarks + Noise (
Landmarks + Noise (
= 2) Landmarks + Noise
120
= 3) Landmarks +
120
80
( = 3) Landmarks +
Noise ( = 4)
Noise ( = 4) Landmarks
100
Landmarks + Noise (
100
+ Noise ( = 5)
= 5)
60
80
80
60
60
40
40
20
20
0
40
20
0
0
1.25
> 10
2.5
3.75
Squared error
6.25
7.5
8.75
x 10
-7
0
0.75
>6
1.5
2.25
Squared
error
3.75
4.5
5.25
4
x 10
0.75
1.5
2.25
Squared
error
3.75
4.5
5.25
x 10
>6
Proposed Approach
Fig. 6. Histograms showing the distribution of (a) minimum, (b) mean and (c) maximum squared error values computed from the comparison of
images aligned using the original location of facial landmarks provided by the database, and locations corrupted by Gaussian noise with mean 0
and standard deviation (r) ranging from 1 to 5.
We calculated the average of each metric for the k

folds and the results are presented in Fig. 7, respectively
for PGA and SWPGA. As expected, the nearest-neighbor
interpolation had the poorest accuracy (0:7 for PGA
and 0:677 for SWPGA), while the linear and naturalneighbor interpolation are considered tied (0:711 for
PGA and 0:683 for SWPGA, both). The TPS had the
highest accuracy (0:717 for PGA and 0:706 for
SWPGA), suggesting that smoother interpolation can
increase the gender classication performance. It is
important to emphasize that we have used the full
set of PGA features during the classication step of
the PGA technique, but only the d 5 leading PGA
features during the training step of the SWPGA
technique. In that case, one must be careful while
reading the graphs in Fig. 7 in order to compare the
performance of PGA and SWPGA. Notice that it would
not be a fair comparison. As pointed out in Wu et al.
(2011), SWPGA outperforms PGA when the same
number of PGA features are used during the training
and
the classication steps. The results in Fig. 7 should

only be ana- lyzed to evaluate the performance of
interpolation procedures.
By
comparing
the
results
of
changing
the
interpolation technique used in both PGA and SWPGA
methods (Fig. 7), it is important to notice that the
accuracy differences in the PGA model are smaller
than those found in the SWPGA model. The TPS in
the SWPGA showed an increased accuracy of up to
3%. We believe that, since the SWPGA iteratively
creates
a
weight
map
to
describe
relevant
discriminating regions, iterative methods have the
ten- dency to amplify the errors introduced by the
simpler interpolating functions and, as a consequence,
the TPS resulted in higher accuracy when applied to
the gender discriminating model.
The purpose of the weight map computed by the
SWPGA is to improve the gender discriminating capacity
of the leading features extracted from a training set (Wu
et al., 2011). The leading features are estimated from
pixel-by-pixel coherence between subjects of
Fig. 7. The results of (a) PGA and (b) SWPGA. Notice how the TPS provides a higher accuracy value than the nearest-neighbor, linear and naturalneighbor interpolations.
the same gender and pixel-by-pixel incoherence

between genders. We believe that the lack of continuity
(artifacts) introduced by the nearest-neighbor and the
linear interpolation schemes affects the computation
of proper weights because the artifacts may mask
small non-soft features expected in male faces. By
comparing the TMF and TFR coefcients computed for
simpler interpolation schemes and for the proposed
approach, it is possible to conclude that male
classication benets from the use of the TPS-based
scheme. Notice in Fig. 7b that the TMR increased from
0:685 (nearest) to 0:720 (TPS), while the TFM
increased from 0:681 (nearest) to 0:709 (TPS). Such an
improvement may be explained by TPS ability to
estimate the depth value of a point on the surface from
all data points in the same block, and not just from the
closest data points as in nearest or linear interpolation,
leading to a continuous and coherent surface.
Similarly to the aforementioned metrics, the
computed MCCs show that the nearest-neighbor, linear,
and natural-neighbor interpolations are equivalent to
each other when applied with both non-iterative and
iterative classication approaches. The TPS, on the
other hand, may improve the result of iterative
techniques. The MCC coefcients computed for the
interpolation schemes used in combination with the
PGA were 0:402 (nearest), 0:424 (linear and natural),
and 0:435 (TPS). In contrast, the coefcients computed for simpler interpolation schemes with SWPGA
were 0:361 (nearest), and 0:375 (linear and natural),
while the MCC coefcient for TPS with SWPGA was
0:419. In practice, this means that for non-iterative

techniques the simpler interpolation methods, specially the linear and natural-neighbor interpolations, can
be applied with no major impact on the nal
classication result. However, it is recommended to use
our TPS-based approach in techniques that may
amplify interpolation errors (e.g., iterative methods).
5. Conclusions
We have presented a completely automatic approach
for aligning and reconstructing damaged facial depth
images. The approach uses TPS to smoothly interpolate
existing data, facial landmarks to ensure data alignment,
and RM-based ray casting to render the nal aligned
depth image having arbitrary resolution. In order to
reduce the high computational costs of the TPS, a block
division approach was introduced where separate TPSs
are adjusted to each block, considerably reducing the
time needed to t an interpolation. We demonstrated
the effectiveness of the proposed techniques by
implementing it and using it to align faces of several
real depth images available in a well-known biometric
database.
The proposed alignment and interpolation methods

were compared against two common approaches:
one based on 2D alignment and linear interpolation
that operates on intensity images, and another one
based on 3D alignment using ICP and linear interpolation for depth data sampling. The errors in the
proposed method were up to two orders of
magnitude smaller than the common approach. This
result suggests that it is recommended the use of
the proposed techniques in order to achieve better
results on procedures that are currently based on
naive alignment and reconstruction of depth data.
When using the proposed alignment method (in the
3D domain) while varying only the interpolation
scheme it has been shown that the differences in the
nal results were better in favor of the proposed
technique. More specically, the experiments suggest that the TPS has better results when used
within
iterative
methods
since
the
feedback
mechanisms of this kind of procedure tend to amplify
errors (e.g., interpolation errors). Also, the TPS has
proven able to increase the gender classication
accuracy of the SWPGA model by 3%. With that in
mind, the presented conclusion is that when the
research is in a prototyping phase the researcher may
use a simple interpolation method (i.e., linear) in order
to val- idate their implementation and later use a more
sophisticated
interpolation
method
(i.e.,
our
approach) to improve the nal results.
We believe that these ideas may lend to better

results on procedures that are currently based on
naive alignment of depth data. We are currently
exploring ways of analyzing the error propaga- tion
through the stages of our algorithm. A reference
implementation of the approach described here will be
made available to other research groups in the home
page of the authors.
Acknowledgments
This
work
was
sponsored
by
FAPERJ
(E26/111.468/2011). Giancarlo was sponsored by a CAPES
fellowship. We thank the Computer Vision Research
Laboratory of the University of Notre Dame for the
database used in this research. We thank Wu, Smith and
Hancock for kindly providing the implementation of their
gender classication technique, and the anonymous
reviewers
for their
comments
and
insightful
suggestions.
References
Azouz, Z.B., Rioux, M., Shu, C., Lepage, R., 2004. Analysis of human
shape variation using volumetric techniques, In: Proc. of CASA, pp.
197206.
9
0

(2014) 8290
Besl, P.J., McKay, N.D., 1992. Method for registration of 3-D shapes, In:
Proc. of IEEE Trans. Pattern Anal. Machine Intell., International
Society for Optics and Photonics. pp. 239256.
Bookstein, F.L., 1989. Principal warps: thin-plate splines and the
decomposition of deformations. IEEE Trans. Pattern Anal. Mach.
Intell. 11, 567585.
Chang, K., Bowyer, K., Flynn, P., 2003. Face recognition using 2D and
3D facial data, In:
ACM
Workshop
on
Multimodal
User
Authentication, pp. 2532.
Guo, H., Jiang, J., Zhang, L., 2004. Building a 3D morphable face model
by using thin plate splines for face reconstruction, In: Proc. of
SINOBIOMETRICS, pp. 258267. Hartley, R.I., Zisserman, A., 2000.
Multiple View Geometry in Computer Vision.
Cambridge University Press.
Kakadiaris, I.A., Passalis, G., Toderici, G., Murtuza, M.N., Lu, Y.,
Karampatziakis, N., Theoharis, T., 2007. Three-dimensional face
recognition in the presence of facial expressions: an annotated
deformable model approach. Pattern Analysis and Machine
Intelligence 29, 640649.
Kohavi, R., 1995. A study of cross-validation and bootstrap for
accuracy estimation and model selection, In: Proc. of IJCAI, pp.
11371143.
Kohavi, R., Provost, F., 1998. Glossary of terms. Mach. Learn. 30, 271
274.
Matthews, B., 1975. Comparison of the predicted and observed
secondary structure of t4 phage lysozyme. Biochim. Biophys. Acta
405, 442.
Moreno, A.B., Sanchez, A., Velez, J.F., Diaz, F.J., 2005. Face recognition
using 3D local geometrical features: PCA vs. SVM, In: Proc. of ISPA,
pp. 185190.
Perakis, P., Theoharis, T., Passalis, G., Kakadiaris, I.A., 2009. Automatic
3D facial region retrieval from multi-pose facial datasets, In: Proc.
of Eurographics Workshop on 3D Object Retrieval, pp. 3744.
Policarpo, F., Oliveira, M.M., Comba, J., 2005. Real-time relief mapping
on arbitrary polygonal surfaces, In: Proc. of ACM SIGGRAPH I3D,
pp. 155162.
Romero-Huertas, M., Pears, N., 2008. 3D facial landmark localisation

by matching simple descriptors, In: Proc. of IEEE Intern. Conf. on
BTAS, pp. 16.
Rosen,
A.,
1996.
AlexWarp
applet.
<http://java.seite.net/bibliothek/alexwarp/ alexwarp.html>
Segundo, M.P., Gomes, L., Bellon, O.R.P., Silva, L., 2012. Automating
3D reconstruction pipeline by SURF-based alignment, In: Proc. of
IEEE ICIP, pp. 17611764.
Stormer, A., Rigoll, G., 2008. A multi-step alignment scheme for face
recognition in range images, In: Proc. of IEEE ICIP, pp. 27482751.
Szeliski, R., 2010. Computer Vision: Algorithms and Applications.
Springer. Tekumalla, L.S., Cohen, E., 2004. A hole-lling algorithm for
triangular meshes.
Wang,
J.,
Oliveira,
M.M.,
2003.
A
hole-lling
strategy
for
reconstruction of smooth surfaces in range images. In: Proc. of
SIBGRAPI. IEEE, pp. 1118.
Whitbeck, M., Guo, H., 2006. Multiple landmark warping using thinplate splines, In: Proc. of IPCV, pp. 256263.
Wu, J., Smith, W., Hancock, E., 2011. Gender discriminating models
from facial surface normals. Pattern Recognit. 44, 28712886.
Wu, J., Smith, W.A.P., Hancock, E.R., 2010. Facial gender classication
using shape- from-shading. Image Vis. Comput. 28, 10391048.
Xu, C., Li, S., Tan, T., Quan, L., 2009. Automatic 3D face recognition
from depth and intensity gabor features. Pattern Recognit. 42,
18951905.
Yan, P., Bowyer, K.W., 2005. Ear biometrics using 2d and 3d images.
In: Proc. of CVPR Workshops. IEEE, p. 121.
Zhu, J., Wang, L., Yang, R., Davis, J., 2008. Fusion of time-of-ight
depth and stereo for high accuracy depth maps, In: Proc. of CVPR,
pp. 18.

Automatic Alignment and Reconstruction of Facial Depth Images

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automatic Alignment and Reconstruction of Facial Depth Images

Uploaded by

Copyright:

Available Formats

Pattern Recognition Letters 50 (2014) 8290

Contents lists available at ScienceDirect

Pattern Recognition Letters

Automatic alignment and reconstruction of facial depth images

Giancarlo Taveira , Leandro A.F. Fernandes

ll the holes (Wu et al., 2010), leading to unnatural

This paper has been recommended for acceptance by

Corresponding author. Tel.: +55 21 2629 5665; fax: +55 21

alignment and resampling techniques must be employed

from valid pixels to adjust a smooth thin- plate spline

G. Taveira, L.A.F. Fernandes / Pattern Recognition Letters 50

Our experiments (Section 4) show that the

Whitbeck and Guo (2006) implemented an applet as

G. Taveira, L.A.F. Fernandes / Pattern Recognition Letters 50

of a TPS-based interpolation scheme combined with a

signed-distance for shape analysis. They used PCA to

human characteristics. Their representation model lacks

1:5 dno;ch e; hac

used to improve gender classication processes.

and it ts a smooth surface to the input data.

where d:e denotes the ceiling function. The proportion

Depth data interpolation and reconstruction

PS;F , where S is ac or st for, respectively, actual or

; 1 is the location (in pixels) of the

Three-dimensional face alignment

bel point in image

uF ; 1 pixel, and K is the inverse of the matrix that

In (5), f is the focal length, mu and mv are the scale

The formulas for computing the 3 3 matrix M and

The average person is dened in the standard space. It

where P and Q are 3 3 matrices computed as:

3.5. Computing the average person

The procedure for computing Pst;F is presented in Section

3.4. Producing the nal

rays (one ray per resulting image pixel) from a pinhole

the coordinates of Pst;F from average values retrieved

use of RM as the ray cast- ing approach for solving the

where wst and hst are, respectively, the width and

(ii)nd the rst intersection point using our RM-based ray

4. Experiments and discussion

MATLAB . The C++ code was compiled using

(DLLs) so that they could be called from MATLAB .

The resolution of the nal images was

ever, it is important to notice that our approach can

leye 101:3238; lnose 40:0234; lchin 104:3238, and

provided by MATLAB . In turn, we applied

interpolation scheme (see MATLAB documentation

We also performed an experiment that compared

TFR measures the proportion of actual females which are

shows that the Gaussian perturbations did not affected

By looking at (11), TFR can be interpreted as a bias

> 600 Min.: 0.00 | Max.: 449999..4444

> 600 Min.: 0.00 | Max.: 495.58

> 600 Min.: 0.01 | Max.: 11004455..2222

> 600 Min.: 0.00 | Max.: 525.57

Min.: 0.00 | Max.: 412.33

Min.: 0.00 | Max.: 579.37

Min.: 29.41 | Max.: 2470.18

Min.: 0.00 | Max.: 424.98

Min.: 11723.75 | Max.: 26391.08

Min.: 1873.29 | Max.: 6000.98

We calculated the average of each metric for the k

the classication steps. The results in Fig. 7 should

the same gender and pixel-by-pixel incoherence