You are on page 1of 25

Seminar SS 2003 Digital Signal Processing in Multimedia-Devices

Immersive Multimedia (Video)


Gkalelis Nikolaos

Coach Markus Barkowsky


0

Abstract
Immersive Multimedia devices constitute the ultimate Virtual Reality technology. In order to operate in real time, they combine the best digital signal processing, computer graphics, machine vision and multimedia communication techniques. Their goal is to provide perception, described as the feeling of being there. As this technology evolves, the role of video is becoming increasingly important. Then techniques such as 3D reconstruction and disparity estimation are becoming crucial for the immersive use of video in telepresence applications. With the existence of standards like MPEG-4, video objects can be extracted and efficiently transmitted to the receiving end of the communication. An immersive multimedia device which deploys these concepts is VIRTUE. The aim of this report is to familiarize the reader with the fundamentals of Immersive Multimedia devices in the video domain.

Table of Contents
CHAPTER 1 CHAPTER 2 Introduction _________________________________________________ 3 Theoretical Concepts __________________________________________ 5
Structure of the report ___________________________________________________________3

Fundamental Definitions _________________________________________________________5 Virtual Reality Experience _______________________________________________________6 Visual Perception _______________________________________________________________7 Summary ______________________________________________________________________8

CHAPTER 3

3D User Interface devises ______________________________________ 9

Input Devices___________________________________________________________________9 Output Devices _________________________________________________________________9 Summary _____________________________________________________________________11

CHAPTER 4

3D Reconstruction ___________________________________________ 12

The Pinhole Model _____________________________________________________________12 Camera Calibration ____________________________________________________________13 Epipolar Geometry _____________________________________________________________14 Rectfication ___________________________________________________________________14 3D reconstruction from disparity estimations _______________________________________15

CHAPTER 5

Application: VIRTUE ________________________________________ 18

Requirements _________________________________________________________________18 Three-way-immersive telepresence videoconferencing mock up ________________________18 Software Architecture __________________________________________________________20

Summary ________________________________________________________________ 22 References _______________________________________________________________ 23

Figure 1 Immersive Multimedia Applications: (A) Flight simulator (1929). (B) CAVE Automatic Virtual Environment (now). (C) Cure of fear of spiders. (D) Visualization of Neural Damage

CHAPTER 1:

Introduction

Immersive Multimedia devices are the ultimate Virtual Reality (VR) systems which completely immerse the viewpoint of the user. Technologically, they are the integration of several disciplines such as Digital Signal Processing, Multimedia Communications, Machine Vision, Computer Graphics, Simulation, and other. They should operate and process a large amount of data in real time. Therefore they are very challenging systems which demand the best from their underlying technologies. DeFanti (1999) [7] said that Immersive Multimedia will not be possible for the next millennium. Fortunately, the technology evolves very fast and today very important applications have been emerged [5, 7, 8, 15]. In figure 1 we can see four such applications. Figure 1a illustrates a flight simulator for the training of military officers, developed in 19291. Flight simulators were the first Immersive devices. CAVE [33] shows the state of todays Immersive Multimedia devices. These devices allow remote participants to communicate with each other like they sharing the same virtual world (seamless communication). This is a key new technology called tele-immersion2 which we will be discussed in the core of the report. Figure 1c and b shows to other important applications of Immersive systems, in the domain of medicine and scientific visualization respectively. In the former an Immersive device is used for the cure of spider phobia, and on the later a VR medium is utilized for the visualization of nerve damages in the brain. While Immersive Multimedia systems advance, the role of video is becoming increasingly important [17]. Indeed the feeling of presence3 can be farther intensified by integrating natural video objects into the artificial world. For instance MPEG-4 standard offers powerful means for combining real and virtual worlds and dedicated tools to stream live video into virtual scene [27]. These aspects are especially interesting for communication services, like tele-immersion. A fundamental issue for integrating video in Immersive Multimedia devices is 3D reconstruction. This technique allows for realistic scenes to appear in the receiving end of the participants and thus, to infer presence. For instance, VIRTUE is an Immersive teleconferencing system which deploys 3D reconstruction techniques to convince the referees that they are participating in a real meeting.

Structure of the report


Before we proceed, it is valuable to give a brief outline of the report. Apart from the introduction there are other four chapters. In Chapter 2 the theoretical issues of Immersive Multimedia and Virtual Reality are analyzed. Furthermore, we look at the human factors responsible for visual presence. Thus we form a conceptual basis for the subsequent chapters. Chapter 3 gives a short overview of the visual 3D user interface devices, their requirements and limitations. In Chapter 4 the essential components of 3D reconstruction are discussed. At the end a disparity based 3D reconstruction algorithm is given. This chapter gives insight on the video
1 2

For a short historical review the reader is referred to [17] and for a complete to [11] For a brief overview of the terminology used in the field the reader is referred to [17]. In [13] there are citations for the formal taxonomies of the terms used in VR. 3 This term will be exemplified in chapter two. In plane words means the feeling of being there.

part of Immersive Multimedia and prepares the reader for the comprehension of the application in the next chapter. Finally, in Chapter 5 we look at a representative tele-immersive application, the so called VIRTUE, where a 3D reconstruction algorithm is used to integrate real video scenes in the system.

CHAPTER 2:
Fundamental Definitions

Theoretical Concepts

In this chapter we discuss the basic concepts of Immersive Multimedia.

There are several contradictory and confusing terminologies of Immersive Multimedia and VR. This fact has already been addressed by many researchers: Machover and Tice (1994) [4]: The field has produced an enormous amount of hype. This can damage credibility and obscure the real industry achievements and the extraordinarily important work being done. Maclntyre, and Feiner (1996) [17]: The term virtual reality promises far more than our technology can currently deliver and has been so abused by the media that it means something different to almost everyone. A very nice definition of Immersive VR is given by Andreies van Dam et al. (2000) [8] By immersive4 virtual reality we mean technology that gives the user the psychological experience of being surrounded by a virtual, that is, computer-generated environment. This experience is elicited with a combination of hardware, software, and interaction devices. Based on the above definition and on the one given in [13] we can finally define immersive Multimedia: Immersive Multimedia is the technology which provides VR that completely immerses the users viewpoint 5inside a virtual world. We should note that: VR is a theoretical concept defined as a real or simulated environment in which a perceiver experiences telepresence6 [1]. An Immersive multimedia system is a collection of devices and software which provide VR.

Immersion is the state of being immersed (plunged) into something that surrounds or covers [Mariam Webster Dictionary]. 5 Here viewpoint refers to all the human senses. 6 The concepts of Presence and telepresence refer to the sense of being in an environment, generated by natural or mediated means, respectively [1].

Virtual Reality Experience


It is apparent that VR is a fundamental concept of Immersive Multimedia (Immersive Multimedia provide VR). In this section we discuss the advantages of defining VR as particular type of experience rather than as a collection of hardware. We look its basic components and its variables. This will allow distinguishing between Immersive and non Immersive Multimedia and comparing different Immersive systems. The material of this section follows [1, 3]. Defining VR as a particular type of experience, rather than a collection of hardware devices, provides a powerful tool for the analysis of VR itself and the underlying technology. The advantages are the following: A concrete unit of analysis of VR can be given as illustrated in figure 1: A VR consists of an individual (A or B) who experiences telepresence, and a mediated environment. The mediated environment is created and then experienced from the individual.

VR
A B
Breadth

TELEPRESENCE
Human experience Technology

Vividness Depth

Interactivity Speed Range Mapping

Figure 2 a) Unit of VR, b) Dimensions of VR

A set of dimensions can be defined, over which VR experience can vary across deferent technologies. The basic dimensions7 are Vividness and Interactivity as shown in figure 1b: Vividness means the representational richness of a mediated environment as defined by its formal features8. Two important variables of vividness are Breadth and Depth. Breadth refers to the number of sensory dimensions simultaneously presented (e.g. visual, auditory, haptic or more). Sensory Depth refers to the resolution within each of these perceptual channels. Interactivity defined as the extent to which the user can participate in modifying the form and content of the mediated environment in real time. Speed, Range and Mapping are the three main variables of Interactivity. Speed refers to the rate at which input can be assimilated into the mediated environment. Range refers to the number of possibilities for action at any given time. Mapping refers to the ability of the system to map its controls to changes in the mediated environment in a natural and predictable manner. Therefore, the experience based VR definition gives the means for examining and comparing different VR technologies. For instance a telephone device or watching a movie in the television provides poor feeling of telepresence and therefore cannot be characterized as

Zeltzers Conceptual Cube provides a similar set of dimensions for characterizing Immersive Multimedia Devices [9, 3, 2]. These dimensions are Immersion, Interaction, Autonomy. 8 The way an environment presents information to the senses.

Immersive Multimedia. On the other hand a VIRTUE9 or a CAVE device provides more vivid and interactive environments and thus, a stronger feeling of telepresence. Consequently, these devices can well be characterized as Immersive. Of course the sense of telepresence varies along different individuals, the other part of a VR (figure 1). Therefore it is also important to examine the human factors and cues that deduce presence to an individual. This is done in the next section.

Visual Perception
As discussed above Immersive Multimedia create the necessary stimuli to the sensory channels of the human and cause telepresence. The visual sense is the most important channel to infer presence .In this section we look at the physiology of human visual perception [3, 17]. Then we discuss some important visual cues responsible for inferring presence [3, 12]. Finally some usual visual problems are discussed due to use of immersive visual devices [3, 17]. The knowledge of these issues is important for the design of Immersive Multimedia systems.

Physiology of the eye


The human eye is a very complex organ. Here we discuss its characteristics relevant to visual perception. Field of view We distinguish between monocular and binocular field of view (figure 2). When both eyes used the total field of view is approximately 120o vertically and 200o horizontally. The monocular field of view decreases to 120o and 150o respectively.

150, 120

200

120

Figure 3 Physiology of eye: Field of View and Visual Acuity.

Visual acuity Human visual resolution is not constant over the eye. The finest details are being resolved within a field of view of 2o in the fovea region (figure 2).

VIRTUE system will be discussed in detail in Chapter 4.

Perception of 3D objects [21, 12, ] The human visual system is not designed to image true 3D objects. Instead we observe only 2D surfaces and we resolve the depth in order to perceive the 3D world. Several cues give the information of depth, with the most important the lateral retinal disparity.

Visual Cues
It is apparent that perception of 3D scenes is a fundamental issue to infer the feeling of presence. In the following some important visual cues for depth perception are disposed. Lateral retinal disparity is the difference in relative position of the image of an object on the observers retinas. It is the strongest cue inferring depth. Motion parallax is a depth of cue that results from motion. It refers to the relationship between objects in an observers field of view, as the observer moves in relation to the objects. Accommodation is the process of the eye responsible to alter the lenses of our eyes and bring an object in focus. It depends on the lighting conditions and objects depth. Additional effective visual cues employed in Immersive Multimedia to convince the user that he is not in a virtual environment are: eye-to-eye contact, gaze awareness, seeing owns body and other. Although the human eye is very robust to visual artifacts, if the above cues used inconsistently and with poor resolutions, uncomfortable situations may arise. Some of them summarized here: Motion sickness is a symptom caused when different sensory cue conflict with each other. For instance, when somebody feels to move due to visual scene motion, but in absence of corresponding physical motion. Visual fatigue: Immersive Multimedia displays need to operate in high frame rates in order to give the necessary visual cues for perception. Then the eyes need to continually refocusing on virtual objects in virtual depth. This cause eye coposis. Diplopia: As we discussed the brain fuses the two different images in the retinas and infer the 3D scene. Sometimes, this system brakes down and we see two different images instead of one. Diplopia may caused when the synthetic disparity presented by the multimedia device is not consistent. Muscle fatigue caused by heavy immersive devices that the user has to carry. Head Mounted Displays are such devices.

Summary
In this chapter we discussed several theoretical aspects of Immersive Multimedia. All these issues are essential for the design of Immersive Multimedia devices. In the following chapter we give a brief overview of 3D user interface devices, an important component of VR systems.

CHAPTER 3: 3D User Interface devises


3D user interface devices are an essential component of Immersive Multimedia [17, 18, 19]. Here we concentrate on Visual 3D user interfaces.

Input Devices
The basic visual immersive input devices are cameras and trackers [17]. Cameras are usually used in stereo configurations to infer the 3D scene. The captured video streams are processed in parallel and finally used for 3D reconstruction in the output displays. We will discuss the basic camera issues for 3D reconstruction in the third chapter. Spatial Trackers often used in combination with cameras to report information about position, orientation or acceleration of the user. Thus, they allow for users mobility. The information captured by the trackers is important for the consistency of rendered virtual scenes in the output devices. For instance the 3D scene can be reconstructed in correct perspective and depth in respect to the immersed participant. Trackers may be electromagnetic, ultrasonic, optical, or camera based. Camera based trackers have the advantage that the user does not need to carry any special device in order to be tracked. Some new techniques employ the global positioning system (GPS) for worldwide tracking. Eye-tracking devices are also novel tools, which will allow for even more precise and consistent 3D reconstruction of the virtual scene. The aforementioned technologies trade between precision, latency, robustness, and tracking distance.

Output Devices
The most common immersive displays 10 are visual, auditory, haptic, tactile, and olfactory. Visual displays can be roughly categorized into fully immersive and semi-immersive [18]. Fully Immersive displays occlude the real world11. Such displays are the Head Mounted Displays (HMDs), Arm Mounted Displays, Boom Mounted Displays, virtual retinal displays (figure 3).

Figure 4 Fully Immersive visual displays: Head Mounted Displays HMDs, Boom Mounted Displays.

10 11

The term diplay is used to describe output. Augmented reality is an exception to this rule. In this case see-through (transparent) displays are used in order to allow virtual objects to overlap with the real world.

Semi Immersive displays such as stereoscopic monitors, workbenches, and surround-screen immersive systems allow the user to see both the physical and mediated world.

Stereoscopic Display techniques


In stereoscopic displays the scene is rendered twice, at double frame rates, once from the point of view of each eye. The user then will perceive the two images coming from different prospective like one, and will resolve their disparity 12in order to infer the 3D scene. Hence, by producing artificial disparity we are able to trick the brain and generate a 3D effect. There are several ways to separate the images. The oldest is called anaglyph and refers to red/green or red/blue glasses. In this technique two stereo images are shot from slightly different positions. One image is then made all red and the other all green. The glasses will separate them for each eye, and will allow the user to infer a 3D scene. A similar method uses polarized glasses. Two projectors with polarized lenses over them are used for the generation of images. The lenses correspond to the polarization of the glasses and allow for separation of images by the classes. Hence, the user can see stereo. Shutter glasses is another 3D imagery technique. Two images of each video frame are displayed. The shutters are synchronized with the video such that each eye will see one of the images. LCD Shutters work similarly, with the difference that instead of mechanically opening and closing an aperture, they simple flash from transparent to opaque. The main requirement for stereoscopic techniques is high frame rates in order to avoid flicker. Therefore these techniques are highly depended on the display technologies, which we are going to discuss in the next section.

Figure 5 3D polarization technique for the separation of images and perception of a 3D scene.

Immersive Display Technologies

As we have discussed disparity is the difference between corresponding points at the two images perceived by the eye. This concept is fundamental for inferring a 3D view from stereo images as we will see later.

12

10

In this section we will briefly discuss the most important display technologies for immersive applications. For further information on the subject the reader is refered to [19]. 3-tube projectors are the systems currently in use for immersive applications. They have several artifacts and produce low resolution images for immersive multimedia applications. Several immerging technologies attempt to overcome these artifacts. To name some of them: Liquid Crystal Display (LCD) projectors and panels. These displays achieve a good resolution, but have too high lag to be used for stereo. A solution may be to use two projectors with shutters. Digital Micro-mirror Displays (DMDs).These provide a good resolution and are theoretically fast enough to produce stereo. Plasma panel displays. They provide low-medium resolution, but probably they are fast enough to produce stereo if combined with the necessary driver electronics. Their main advantage is that they offer borderless configutations. Light Emitting Diode (LED) displays. These are low resolution now, but the offer the advandage of borderless configurations as well. It is apparent that display technologies are not mature yet. For VR purposes, the future displays should be faster, lighter, cost effective, with less power consumption and higher resolution. Fortunately the technology evolves faster and these requirements may satisfied soon.

Summary
In this chapter we discussed the interfaces technologies for 3D scene acquisition and reproduction. We saw that an important cue used to infer depth and thus 3D reproduction is disparity. Disparity is inferred by alternating the left eye images from different prospectives. Another way to infer the 3D scene using disparity is to encode disparity in the gray image value. This is fundamental for techniques employing video for 3D scene reconstruction and will be discussed in the next chapter.

11

CHAPTER 4:

3D Reconstruction

When video is employed in Immersive Multimedia applications the following process takes place: The object of interest, usually human head and shoulders, is segmented from the background scene. The disparity and other important issues of 3D reconstruction are estimated. Afterwards, the human object and the disparity measures are efficiently encoded (e.g. MPEG-4) and transmitted to the receiving end of the communication medium. There, based on the disparity measures the 3D video object is reconstructed Then, it is composed with synthetic background and rendered to give a realistic virtual scene. It is apparent that the heart of the process is the 3D reconstruction. In this chapter we will present the basic issues of 3D reconstruction. At the end a disparity based signal processing technique of 3D reconstruction will be explored.

The Pinhole Model


Pinhole is a simple and basic camera model. It is important for the understanding of the concepts of rectification and 3D reconstruction. In the following figures the pinhole camera model is illustrated.

Optical axis (u0,v0) C m z = -f Zw

Figure 6 Illustration of the pinhole camera model in (a) 2D and (b) 3D spaces.

It consists of its optical centre (or center of projection) c and its retinal plane (or image plane which is a CCD chip) R. The z-axis of the world coordinates (called world reference frame) corresponds to the optical axis of the camera reference frame, as in figure. Then A 3D point w in arbitrary world coordinates may projected into the 2D image plane point m, in pixel coordinates:

12

w= (xw, yw, zw)T pc = (xc , yc, zc )T m= (u, v)T


The 3D word coordinates of a world point and a pinhole camera point are related by the following fundamental equation:

p c = R 3 x 3 w + t3 x1
where R called the rotation matrix and t the translation vector. . The main drawbacks of the pinhole camera are: The depth information is lost. Occluded objects can not been seen. To resolve these problems a stereo camera rig is usually deployed.

Camera Calibration
Camera calibration is the estimation of the intrinsic and extrinsic parameters of the camera model. The Intrinsic parameters are: Image center coordinates (u0,v0). Focal lengths (fu, fv). Horizontal and vertical pixel size units. Radial distortion coefficient. The Extrinsic parameters are: Rotation matrix R3x3 Translation vector t3 Then the calibration problem can be stated as: Given image points (ui, vi) i=1,..,N , from projections of N known world points (xi, yi, zi) in the world reference frame, estimate the intrinsic and extrinsic parameters of the camera. In homogenous coordinates13 the projection of a point is given by the equation:

0 0

fv 0

P3x4 is called the perspective projection matrix (PPM).It inherits the intrinsic and extrinsic parameters of the camera and therefore it characterizes the camera entirely.

Homogenous coordinates offer several advantages: the system becomes linear, points in infinite can be handled and unambiguously represented.

13

13

KK

m = P3 x 4 w

P3 x 4 =

v0 ( I 3 x 3 , 0 3 x1 ) 1

fu

u0

R3 x 3 0

t1 x 3 1

Epipolar Geometry
Epipolar geometry simplifies the solution of the correspondence problem i.e. to find the corresponding points between two images. We need this information in order to calculate disparity. In the following we introduce the epipolar geometry and its terminology.

Figure 7 Epipolar geometry

As shown in the figure stereo epipolar geometry refers to stereo camera configurations. Even if we dont know the exact petition of a 3D world point w, it is bound to be in the line of sight of the corresponding image point m1. The two conjugate lines defined by the projections of a point w are called the epipolar lines. It is proven that given an image point m1 of an unknown 3D world point w, its corresponding m2 point in the other image is constrained to be in the conjugate epipolar line of m1. The line connecting the two projection centers c1, c2 is called baseline and is important on the epipolar geometry. The baseline coincides with the image plain of each camera at the image points e1, e2, called epipoles. The plane defined by the baseline and the point w (or its image m) is called epipolar plane. In fact all the points on the epipolar plane have their image in conjugate epipolar lines. All epipolar lines pass through two specific points, the epipoles e1, e2.

Rectfication
Rectification is a process that transforms the image plane such that the epipolar lines are aligned horizontaly. Even more they may coincide with image scan lines. Hence, stereo matching algorithms can easily take advantage of the epipolar constraint and reduce the search space for corresponding points from two to one dimension. The process of rectification can be summarised as follows: Given a stereo pair of images I1, I2 and PPMs Po1, Po2 (obtained by calibration); 14

Compute [T1, T2, Pn1, Pn2] = rectify (Po1, Po2 ); rectify images by applying T1 and T2. Hence, the process of rectification is reduced to the estimation of two new projection matrices. This usually done in the camera calibration stage. Reconstruction can simple be performed from the rectified images directly using Pn1-1, Pn2-1. Schematichaly rectifiction is shown in the following figure.

Rn m
n

Rectified retinal planes

mo

C1

baseline

C2

Figure 8 Rectification.

In the next figure we see two row images and their rectified version.

Figure 9 Example of images after rectification (bottom pair).,

3D reconstruction from disparity estimation


Now we can concider a digital signal processing algorithm for 3D reconstruction based on disparity estimation. The algorithm is summarized as followes: 1. Before run: Decide how many disparity levels to use. 15

Decide which similarity measure to use. For instance cross correlation:


w w

C O RR ( u , v , d ) =

I l (u + , v + ) I r (u d + , v + )

= w v= w

Calibrate the cameras. Perform rectification and construct the corresponding look-up tables. 2. During operation, for each corresponding rows of the left and right image construct one layer of the Similarity Accumulator: Compute the Block Matching Similarity for each level of disparity for each pixel in the correspnding rows. Construct the right and left image Disparity Maps. Find the corresponding pixels between the left and right image rows. Perform consistency check by looking the stereo constraints (e.g. uniquness). Mark the inconsistent pixels as undefined in the disparity maps. 3. By iterating the algorithm of step two (2) for every corresponding rows, construct all the layers of the Similarity Acumulator and the related disparity maps. 4. Fill the undefined regions: For small gups use Median Filter with small mask size. For larger gaps use morphological closing operator (i.e. dilation followed by erosion). For gaps along the scanlines still unfilled, determine the smallest disparity value between the left and right end of the unfilled area. Then use this value to fill this area.

For the visualization of the algorithm the following figures have been constructed:

Disparity level 1

u-dmax ud udmin

Disparity level 2

Disparity level 3

Left Image plane CL

Right Image plane CR

Right Image

Left Image

Figure 10 In the beginning of the algorithm we decide the number of disparity levels to be used. The right image illustrates how the block matching algorithm works.

16

Right Image
P0 P1 P 2 P 3

u=0

3 4

7 8

9 d=0 d=1 d=2 d=3

d=0 Left Image d=1


d= 0 0 0 2 1 1 0

0 2 0 1 1 1 0

d=2 u= 0 1 2 d=3
Row of disparity map: left right

Figure 11 Construct one layer of the Similarity Accumulator (left figure). Construct the right and left image Disparity Maps (right figure).

17

CHAPTER 5:

Application: VIRTUE

VIRTUE (Virtual Team User Environment) is an IST (EU Information Societies Technology) project started at the beginning of 2000. The consortium consists of several academic and industrial partners. The goal of VIRTUE project is the development of the innovative technology for the production of a semi-immersive teleconferencing system. This system opts to replace the current video conferencing systems. Here we will give an overview of VIRTUE based on the knowledge established before. The chapter is divided into three (3) sections. In the first section the basic requirements that VIRTUE aims to fulfil are introduced. In the next section the mock up of a three-wayimmersive telepresence videoconferencing scenario with VIRTUE is presented and discussed. In the last section a software architecture of the system is given.

Requirements
VIRTUE should have the following features in order to provide immersive experience: 1. Semi-immersive display capable to provide life size head and torso images. This will enable: Accurate reproduction of facial expressions and effective body language. Balance of relative size of peoples images. 2. Camera views for multiple participants (ex. cocktail party). This will enable: Eye-to-eye contact and gaze awareness. Spatial awareness (consistent spatial positioning of objects in the environment). Directed body language (i.e. to whom am I pointing to?). Moveable viewpoint (motion parallax or look around effect). 3. Integrated visual environment for multiple participants. This will enable: Participants to feel as they are sitting around the same virtual table. Correct perspective (participants around the virtual table will appear in correct proportions relative to their positions around the real table). Harmonisation of visual parameters (contrast, illumination and other) of video objects from different locations, combining them in a common virtual scene.

Three-way-immersive telepresence videoconferencing mock up


The VIRTUE mock up for three conferees communication, one local and two remote is shown in the figure:

18

Figure 12 VIRTUE mock up.

The system composed of: A half real table. A large plasma display (diagonal 50 inc. or more). Four cameras mounted around the display (one left and one right pair). A head tracker for the registration of the head at the receiving end. Software to handle the video coming from the four cameras and create video streams for the virtual cameras behind the display. A fast network connection to connect the different VIRTUE devices and allow for real time performance.

Figure 13 Immersive videoconferencing with VIRTUE.

19

A main issue to meet the requirements imposed is the principle of shared table. It is based on the idea to position the participants in predefined symmetrically distributed positions around a shared table environment. Then their 3D images will be placed virtually around a shared table. Ideally, this is done isotropically in order to obtain a social. Hence, for the case of threeparty conference the participants form a equilateral triangle as in the above figures (for the case of four referees they should form a square). At the transmitting end, the video stream captured by the left and right pair of video cameras is processed separately. The acquired stereo information is sufficient for the 3D reconstruction (or rendering) of the local participant in the left or right counterpart of the remote display respectively. At the receiving end the head position is permanently registered by the head tracker such that a virtual camera to be moved with conferees head. Then the entirely composed 3D scene is rendered onto the 2D display of the terminal as it is produced by a virtual camera. The conferees will see the scene under the right prospective view if the geometrical parameters of the multiview capture devices, the virtual scene and the virtual camera are well fitted to each other.

Software Architecture
The software Architecture of the system is shown in the following figure.

Figure 14 The software architecture of VIRTUE.

It consists mainly from two processing parts, the analysis and synthesis block. 20

The analysis block is responsible for the extraction of the 3D information of the foreground figure, from the pair of the stereo video streams. The stereo images captured, are firstly segmented by a foreground/background segmentation module and the human object is extracted. Then radial distortion correction is applied. The distortion free images are then rectified, such that only horizontal disparity to exist. The disparity estimation module is then applied to output coherent disparity maps. The synthesis block reconstructs the dynamic scene information from a virtual viewpoint in line with the head position. The input to the block is the 3D information of the synthesis block and the head position from the head tracker. Then an image based rendering approach is used to synthesise the new virtual view image.

21

Summary
In this report we aimed to reveal the fundamental theoretical concepts and technologies of Immersive Multimedia devices, outweighing their video part. Thus, familiarize the reader with the field. In order to achieve our goal, in the second chapter we analyzed the theoretical issues of Immersive Multimedia and the visual perceptual factors of the human. Then, in the third chapter we saw step by step the basic components and a disparity based algorithm of 3D reconstruction, and essential technique to allow the use of video in Immersive Multimedia. Finally, VIRTUE, \a tele-immersive videoconferencing device was presented, which utilizes the most of the aforementioned concepts and techniques.

22

References
1. 2. 3. 4. 5. 6. 7. 8. Overview: J. Steuer, Autumn 1992, Defining Virtual Reality: Dimensions Determining Telepresence, Journal of Communication, 42(4), pages 73-93. S. R. Kalawsky, 1993, The Science of Virtual Reality and Virtual Environments, Addison-Wesley. F. Biocca, M. R. Levy, 1995, Communication in the Age of Virtual Reality, Laurence Erlbaum Associates, New Jersey. C. Machover, S. E. Tice, January 1994, Virtual Reality, IEEE Computer Graphics & Applications, 14(1), pages: 15-16. S. R. Ellis, January 1994, What Are Virtual Environments?, IEEE Computer Graphics & Applications, 14(1), pages: 17-22. J. N. Latta, D. J. Oberg, January 1994, A Conceptual Virtual Reality Model, IEEE Computer Graphics & Applications, 14(1), pages: 23-29. F. P. Brooks, November/December 1999, Whats Real about Virtual Reality? IEEE Computer Graphics & Applications, 19(6), pages: 16-27. Andries van Dam et al., November/December 2000, Immersive VR for Scientific Visualization: A Progress Report, IEEE Computer Graphics & Applications, 20(6), pages: 26-52. D. Zeltzer, April 2002, Virtual Reality: Immersion, Decisions, Empathy, Third International Conference on Virtual Reality and its Application in Industry, China. A. Steed, October 1993, A Survey of Virtual Reality Literature Department of Computer Science, Queen Mary and Westfield College, Tech. report 623. U.S. Congress, Office of Technology Assessment, September 1994 , Virtual Reality and Technologies for Combat SimulationBackground Paper, OTA-BP-ISS-136, Washington, DC: US Government Printing Office. M. T. Bolas et al., January/February 1996, Perceptual Issues in Augmented Reality SPIE: Stereoscopic Displays and Virtual Reality Systems III, 2653. J. Isdale, What is Virtual Reality? A Web-Based Introduction, Sep. 1998. http://www.isx.com/~jisdale/WhatIsVr.html (6 May 2003). L. F. Hodges et al., November/December 2001, Treating Psychological and Physical Disorders with VR, IEEE Computer Graphics & Applications, 21(6), pages: 25-32. R. Pausch et al., 1997, Quantifying Immersion in Virtual Reality, ACM SIGGRAPH 1997. H. Regenbrecht, T. Schubert, 2002, Measuring Presence in Augmented Reality Environments: Design and a First Test of a Questionnaire, Proceedings of the Fifth Annual International Workshop Presence 2002, Porto, Portugal. 17. 18. 19. 3D User Interface Devices: B. Maclntyre, S. Feiner, 1996, Future Multimedia User Interfaces, Multimedia Systems, Springer-Verlag, 4: pages 250-268. D. A. Bowman et. al. , 2001, An Introduction to 3D User Interface Design, Presence, MIT, 10, pages 96-108. T. DeFanti et. al. , June 1999, Technologies for Virtual Reality/Tele-Immersion Applications: Issues of Research in Image Display and Global Networking, EC/NSF

9. 10. 11. 12. 13.

14. 15. 16.

23

Workshop on Research Frontiers in Virtual Environments and Human-Centered Computing, Chateau de Bonas, France. 20. 21. 22. 23. 24. 3D reconstruction: Prof. Nieman, J. Denzler, February 1998, Lectures Script of Image Processing, University of Erlangen-Nuernberg, Germany. B. Jahne, 1997, Digital Image Processing, Springer, Berlin Heidelberg, Germany. D. H. Ballard, C. M. Brown, 1982, Computer Vision, Prentice Hall, New Jersey, United States. A. Fusiello, Epipolar Rectification, http://profs.sci.univr.it/~fusiello/rectigfxyncxvf_cvol/ A.Fusiello, Tutorial on Rectification on Stereo Images , http://www.dai.ed.ac.uk/CVonline/LOCAL_COPIES/FUSIELLO/tutorial.html (7 July 2003) M. Pollefeyes, Tutorial on 3D modeling from Images, June 2000, http://www.esat.kuleuven.ac.be/~pollefey/tutorial/tutorialECCV.html (7 July 2003) 26. 27. 28. 29. 30. VIRTUE: L. Q. Xu et. al., January 2002, Computer Vision for a 3D Visualization and Telepresence Collaborative Working Environment, BT Technology Journal, vol. 20(1). P. Kauff, R. Schfer, O. Schreer, September 2000, Tele-Immersion in Shared Presence Conference Systems, Int. Broadcasting Convention2000, Amsterdam, Netherlands. O. Schreer, N. Brandenburg, P. Kauff, September. 2000, A Comparative Study on Disparity Analysis Based on Convergent and Rectified Views, British Machine Vision Conference, Bristol, United Kingdom. B. J. Lei, E. A. Hendriks; November 2001 Multi-step View Synthesis with occlusion handling, Proc. of VMV 2001, Stuttgart, Germany. M. Hirose, T.Ogi, T. Yamada; 1999, Integrating Live Video for Immersive Environments, IEEE Multimedia, pages: 14-22.

25.

24

You might also like