You are on page 1of 5


3, JULY 2009


Eye Gaze Correction to Guarantee Eye Contact

in Videoconferencing
J. Civit y T. Montserrat

Abstract In a typical desktop video-conference setup,

the camera and the display screen cannot be physically aligned.
This problem produces lack of eye contact and substantially
degrades the users experience. Expensive hardware Systems
using semi- reflective materials are available on the market to
solve the eye gazing problem. However, these specialized systems
are far away from the mass market. This paper presents
an alternative approach using stereo rigs to capture a threedimensional model of the scene. This information is then used
to generate the view from a virtual camera aligned with the
conference image the user looks at.
Keywords Computer vision, eye contact,
trifocal tensor, videoconferencing.


A video conferencing allows face to face communication
geographically distant persons by bidirectional transmission of
audio and video. However, the expansion of this technology is
still far below the expectations generated initially. Overcome
problems as cost or bandwidth, it seems that the main barrier to
widespread adoption of videoconferencing is the lack of eye
contact [1]. In a typical home environment, the camera and the
screen is not physically be aligned as shown in Fig.1. The user
looks at the image of the remote peer displayed on the monitor,
but not directly to the camera from which it is observed, so that
printing is lost of looking into the eyes of the interlocutor. It has
been demonstrated [3] that if the divergence angle between the
camera and the display exceeds five degrees, the loss of contact
visual is appreciable. In a common scenario, with a member
sitting at the computer, the value of this angle lies between
fifteen and twenty degrees. This brings effects negative
psychological, since the lack of eye contact, or avoid the gaze of
the interlocutor, usually associated with the deception [2], so
above the threshold of divergence,video information loses its
communicative value, and may even bother

This work has been partially developed within projects

VISION, sponsored by the Center for Technological
Development Industrial (CDTI) (CENIT-VISION 2007-1007)
and 3DPresence, sponsored by the FP7 program of the European
Union (ICT, 215,269).
J. Civit and T. Work in department Montserrat
technology Video of Telefnica Investigacin y Desarrollo,

Different hardware systems have been proposed to correct squinting

in the videoconference. In Most of them are semi-reflective
materials used for align the camera and remote image. The products
offered by Digital Video Enterprises in line Telepresence is a good
example [5]. The system suggested by Okada et al. [6] the image
projected on a semitransparent screen and capture the user's face
with a camera located behind the screen. Despite its effectiveness,
the high cost of these systems, together with a montage
cumbersome, has kept them away from the mass market. This
system is based on obtaining a three-dimensional description of the
scene in order to generate the image corresponding to a virtual
camera located in the desired view. In this way avoids the necessity
of physically locate the camera to eye level for ensure eye contact.
3D information can obtained by capturing the scene with two or
more chambers.

Figure 1. Conventional videoconferencing system

Given the parameters of camera calibration and point
correspondences between views, is triangular possible dimensional
point position has each set of correspondences originated and,
therefore, generate any arbitrary view. There are two aspects
fundamental to the results of this process are satisfactory. Firstly,
the search algorithm Correspondence should provide accurate
results. In Secondly, the virtual camera should be located between
the two real cameras, so that any display point in the new image is
displayed at least for one of the chambers. If only two cameras are
used, the latter



orre pon enci s
de ccorrespondencias
squeda de


Loc l



o sist nci

Sntesis de la
nueva vista




Clculo del

Nuevos Parmetros
de cmara

Figure 2. Partial diagram of the system architecture

restraining force to put them on opposite sides of the monitor.
However, whether the orientation is vertical and horizontal force
monitor size separation high chambers, which greatly hinders the
process correlation search. To solve this problem, it is proposed
to add two additional cameras by Fig.3 diagram


Synthetic similar to an avatar, probably due to generic face model

employee. The substitution of eyes can also cause changes in facial
expression. Yang and Zhang [8], using a custom model of face
Video adjusted by monitoring points 3D characteristic, it uses two
cameras for mounted in the upper and lower sides of the monitor. In
one second stage, objects that are outside of the model face are
treated by finding correspondences
On contours and feature points. The result is used to generate a
3D mesh which synthesizes new view. Unlike the system
proposed here, the high separation chambers enables only a
search based on reliable correspondences detection of
characteristic points. Low density map resulting depth
knowledge required to rely on prior to the scene and threedimensional models predesigned.

Figure3. Conventional video conferencing system

The smallest separation, or baseline, between the optical centers

both cameras allows smooth operation of the search of
correspondences between pairs of chambers located on the same
side. While the cameras located on the opposite side to help
correctly synthesize sight virtual. Fig.2 shows the four basic steps
in underlying the system proposed here: calibration, rectification,
correlation search and synthesis new view. These steps are
described in the following paragraphs.
In the literature one can find also other proposals to solve the
problem of eye contact using artificial vision algorithms. Ott et
al. [4] proposed a similar system here presented, but with a
configuration of two chambers on opposite sides. The limitations
caused finding correspondences the appearance of artifacts in the
generated view. Despite use dynamic programming algorithm
highly parallelizable, the processing time was greater than 50 s
by box with the hardware of the time. The project GazeMaster
Microsoft Research [7] used a single camera to track head
orientation and eyes. The synthesis of sight is performed by
replacing the by a user's eye looking in the direction synthetic
desired. The texture of the face with the eyes are corrected then
applied to a rigid model of 3D face can be rotated. The images
generated have a point


The calibration process to retrieve the position relative
cameras, obtaining information overall system geometry. This
procedure recover the intrinsic parameters of the camera (distance
focal length, optical center, etc.) of each lens, and extrinsic
parameters that relate the reference system of each camera, with
origin at the optical center, about one global reference system (in
3D space) common all cameras in addition to obtaining the
parameters of lens distortion. The algorithm used for this stage is
proposed in [9]. This points are extracted corresponding between
the chambers by capturing one calibration standard (hard to
describe). These entries serve to extract the geometric relationships
between all views.
Once known calibration parameters can known exploit
geometric constraints between points of view, to facilitate later
stages. Parameters shaped handle projection matrix, or chamber, P.
For each pair of cameras entering a stage called epipolar
rectification. Once retrieved parameters calibration may apply
restrictions set by epipolar geometry (geometry of 2 views)
[11].rectification is to generate two new cameras matrices P1 'and
P2'', such that given a pixel in one of the images, the corresponding
pixel is on the same line in the other image. For more information


see [14]. This process simplifies the search step correlation to

one dimension. Rectifying images require also perform a
correction of the radial and tangential distortion characteristic of
the lens of each camera.

Figure 4. Capture the calibration standard. They are defined as points

characteristic of the square corners


The search phase correlation is target those match points
in the images are projections of the same three-dimensional
point. It is called disparity to the relative displacement of the
position of a correspondence regarding your point on the image
reference, this value is inversely proportional to the depth of the
3D point. The set of all disparities associated with each pixel in
an image receives disparity map name.
Thanks to the rectification process is simplified problem
the search for correspondences. However, this is be a
computationally expensive task since it requires deal with the
ambiguities caused by regions Homogeneous, occlusions, etc..
The disparities may be found by different methods, being
common to perform the distinction between local and global
methods, depending on restrictions used. Local methods only
make use of information provided by a small number of pixels
neighboring the pixel of interest. Although can be very efficient,
highly sensitive zones locally ambiguous images. At the other
extreme, Global methods impose restrictions affecting whole
image, which makes it more robust than local exchange for
greater computational cost. In [13] You can find a detailed
classification of the different types of algorithms.

Figure 5. Reference image with its depth map


The nature of this system requires calculating two maps

independent disparity and a VGA resolution frequency of 30 images
per second. These requirements real-time algorithms are prohibitive
for type overall, which is also more complex parallelization. For
this reason, we have chosen a local method with window
nonadaptive aggregation [13]. For reasons of efficiency, cost
function used is the sum of absolute differences (SAD). In each
group of stereo-cameras is performed consistency checking superior
/ inferior map disparity, allowing rule misassigned fruit occlusions
or areas of low texture. This process gives resulting in a disparity
map gapped caused by disparity areas unreliable.
To carry out the synthesis of the new view is necessary
having a dense disparity map to each side of the monitor. For this
reason, the maps "consistent" are undergoing a process of postprocessing based on the diffusion anisotropic proposed by Perona
and Malik [15]. The main Unlike the filter used from that described
in [15] is using the reference image disparity map for calculate the
diffusion coefficient, instead of the map itself to filter. This strategy
is based on the restriction that the homogeneous intensity regions
presented no changes sudden deep. In the first iterations of the filter
values are modified only by the points marked as discarded, which
is initialized with the value of gap found in the previous frame,
reducing the number total iterations per image. The latest iterations
yes modifying the total disparity map helping smooth the final
The post-processing method preserves employee the
contours of the original image, which contributes to photoconsistent results in the reprojection of new view.
The correlation search system described here operates in
real time (VGA @ 30fps) implemented on FPGA Virtex-4 SX-35 of
Xilinx, with a frequency of the operating environment 200 MHz
parallelism algorithms used makes it possible to raise their
working on new graphics cards.
This section is presented as improving future online
incorporating disparity maps techniques color segmentation and
global methods.


The technique used to generate the image perceived by the
new virtual camera is within the group Image Based Rendering
[12]. As its name suggests, this group Algorithms used only in the
ideal case information contained within the actual images, that is,
those that are captured from physical cameras. In the case
particular, presented in this paper, a method is used transfer points
based on: the pixels of the images originals are transferred to their
corresponding position in the synthetic image, this achieves
imaging completely realistic virtual (photo-realism). For this



modeling each pair of chambers, located on each side of the

screen capture a trinocular system, comprising two real cameras
and the virtual camera, in which advantage of the geometry
constraints of three views.
The algebraic entity that relates the position of the
pixels in the three images is called Trifocal Tensor. It is a 3x3x3
array containing all relationships geometric between the three
views, equivalent to the matrix fundamental to a binocular. Given
three matrices camera P=[I|0] P1=[A|ai] and P2=[B|bi], Obtained
during the calibration process, one can obtain the tensor follows,
expressed in tensor notation and using Einstein's convention [11].
j k

Figure6. Incidence point-line-point.

From trifocal tensor relations are obtained incidence between
points and lines of the three images. The most useful in the
design presented in this article is:


x '' = x l' j i

In Fig.7 shows an original image, captured from point of view of
the camera, and image synthesis for: the latter, a virtual camera
located in the center of the screen, and the second, virtual camera
located in an intermediate position between the original and the
center of the monitor.


Known as the incidence point-line-point [11], the

diagram shown in Fig.4. According to the equation above,
knowing the point of the first image and line passing through the
corresponding point in the second image, it is possible to
calculate what the position of the pixel corresponding to the third
image, the image in this case synthetic.
To establish the correspondence between the first two
view and calculate the input parameters of the equation (2), are
used stereoscopic techniques introduced in the section III, who
obtain results in a map of disparity containing information of the
pixel correspondences between stereo pair. Thereby line gives the
epipolar line perpendicular to the x, in the second image, which
passes through the point x '. For details see on this algorithm [10]
This procedure applies to all pixels of the 1 image to
generate the synthetic image at the position of the virtual camera,
image 3. Obviously there will be parts of the scene that are not
visible, while from the virtual camera and real cameras,
therefore, the generated voids present, coincident with occlusions
between both views. To compensate for this fact is applied above
procedure to both pairs of cameras according to configuration
proposed in section I, generating two synthetic images for the
same point of view camera virtual, in this way, the gaps in each
of the images will be different and not coincident, because the
architecture capture system. Using both synthetic images may
generating a virtual view complete photo-realistic. This algorithm
is implemented in table format 3D search, a table with three
entries: pixel location in the reference image (u, v) coordinates
and image pixel disparity. Thus operations are performed once,
achieving real time ratios. Really produce two lookup tables, one
for each stereo pair

Figure 7. Synthesis of different viewpoints..

As seen in the previous picture sequence, to As the virtual
camera moves away from the original position appear more
artifacts. This is because the new point view displays information
not available in the image original. In future work will require
improved synthesis algorithm.
This process is repeated for the whole image sequence
forming video, through a 3D look-up table, but because it is not
considered in the temporal consistency step for obtaining the depth
map, errors in this can be observed in the reprojection thus occurs
considerable fluctuation effect, and unpleasant. This problem will
be fixed in a future update algorithm.


In this paper we have presented a proposal to improved
videoconferencing. By application of computer vision techniques
is achieved correcting the point of view of the user, creating a
more realistic communication

L. Mhlbach, B. Kellner, A. Prussog, and G. Romahn "TheImportance

of Eye Contact in a Videotelephone Service, "Proc.11th Int'l Symp.
Human Factors in Telecomm., 1985.


Ernst Bekkering, JP Shim. "Trust in Videoconferencing.

"Communications of the ACM, Volume 49 Issue 7, July 2006.


RR Stokes, "Human Factors and Appearance Design

Considerations of the Mod II Picturephone Station Set, "
IEEE Trans. Comm. Technology, vol. 17, no. 2, Apr. 1969.
M. Ott, J. Lewis and I. Cox, "Teleconferencing Eye Contact Using a
Virtual Camera, "Proc. Conf Human Factors in
Computing Systems, pp. 119-110, 1993.
Digital Video Enterprises.
Telepresence systems. 2008
Okada, KI, Maeda, F., Ichikawa, Y., and Matsushita, Y.
"Multiparty videoconferencing at virtual social distance: MAJIC
design. "Proc. CSCW '94, pp.385-395, 1994.
J. Gemmell, CL Zitnick, T. Kang, K. Toyama, and S. Seitz,
"Gaze-Awareness for Videoconferencing: A Software
Approach, "IEEE Multimedia, vol. 7, no. 4, pp. 26-35, Oct.
Yang, R., and Zhang, Z.; Eye gaze correction with StereoVision
for video-teleconferencing. Microsoft Research, Technical
Report, MSR-TR-2001-119, 2001.
Zhang, Z. "A flexible new technique for camera calibration"
IEEE Transactios on Pattern Analysis and Machine
Intelligence, 22 (11) :1330-1334, 2000.
Avidan, S. and Shashua. M. "Novel Synthesis View by
Cascading Trilinear Tensors. "IEEE Transactions on
Visualization and Computer Graphics, 4 (4), pp. 1077-2626,
October to December 1998.
RI Hartley and A. Zisserman. Multiple View Geometry in
Computer Vision. CUP, Cambridge, 2000.
Shum, Heung-Yeung, Chan, Shing-Chow Kang, Sing Bing
Image-Based Rendering XX, 408 p. 95 illus., Hardcover4, 2007.
Daniel Scharstein, Richard Szeliski, "A Taxonomy and
Evaluation of Dense Two-Frame Stereo Correspondence
Algorithms ", International Journal of Computer Vision, V.47
n.1-3, p.7-42, April-June 2002
A. Fusiello, E. Trucco, and A.Verri. "A compact algorithm for
rectification of stereo pairs. "Machine Vision and Applications,
12 (1) :16-22, 2000.
Perona, P., Malik, J. "Scale-space and edge detection using
anisotropic diffusion ", Pattern Analysis and Machine
Intelligence, IEEE Transactions on Volume 12, Issue 7,
Page (s): 629 - 639, July 1990






Jaume Rovira Civit he graduated

In Telecommunications Engineering
2005, by the University of
Catalonia. In 2008 he graduated as an Engineer
Senior Masters in Telecommunications
telecommunications management.
Exerts professionally in Telefnica
Research and Development at its center
Barcelona (Via Augusta 177-08021) in the
video technologies division. His areas of
interest focus on vision,
3D technologies, image and video processing

Toms Montserrat Mora es

Superior en Telecomunicaciones (2007) e
Ingeniero Superior en Electrnica
Politcnica de Catalua, Barcelona.
Ejerce profesionalmente en
Investigacin y Desarrollo, en su centro de
Barcelona (Va Augusta 177 - 08021), dentro
la divisin de tecnologas de video.
Sus reas de inters se centran en visin
artificial, tecnologas 3D, diseo digital y
aceleracin hardware.

Engineer Toms Mora Montserrat

Senior Telecommunications (2007) and
Electronics Engineer (2008),
both degrees from the Polytechnic University
of Catalonia, Barcelona.
Exerts professionally in Telefnica
Research and Development at its center
Barcelona (Via Augusta 177-08021), within
Division of video technologies. Their
areas of interest focus on vision,
3D technologies, digital design and acceleration