You are on page 1of 6

1

Exploration and Implementation of


a Next Generation Telepresence System
Ramachandra Budihal, Navaneeth Mohanan, Sahil A. Anand and Saish Satish Kamat

AbstractHuman communication includes not only spoken


language but also non-verbal cues such as hand and body gestures, facial expressions, etc., to communicate our thoughts and
feelings and gather feedback. Telepresence systems of today use a
2-way audio and video transmission to transmit this non-verbal
information. In this paper, we introduce a novel Experiential
Telepresence System, which possesses cognitive intelligence and
is also context-ware i.e., it is aware of the multiple components of
communication and ambience in which it communicates - both
verbal and non-verbal, making the telepresence experience far
more immersive when compared to its peers. This is achieved
using a 3-tier architecture comprising of a Humanoid Robot,
a Cognitive Collective Intelligence Platform on Cloud and an
Experience Centre. Towards the end, a performance analysis
coupled with a qualitative analysis of user perception which in
otherwords is to measure the Quality of Experience of the system
- shows the acceptability and user experience of our system is
far higher when compared to with traditional telepresence and
video conferencing.
Figure 1.

Index TermsExperiential Telepresence, cognition, augmented


reality, context-awareness, humanoid robot, Affective Interfaces,
Tele-operation, Collective Intelligence on Cloud, SLAM, Cloud
Robotics, Quality of Experience (QoE, QoX), Quality of Service
(QoS), User Experience(UX)

I. I NTRODUCTION
NTRINSICALLY, human communication can be broken
down into verbal and non-verbal communication components. Face to face communication (Fig. 1) is considered to
be one of the most effective forms of communication as it
propagates both components without restrictions[1]. When it
comes to long distance communication, traditional channels
like letters and telephones lack the latter component .
This gave birth to telepresence systems. These systems ensure non-verbal communication between individuals do not get
hindered due to the limitations of the channel between them
(Fig. 1). Different implementations of telepresence systems
have approached this problem in multiple fashions.
Companies like Cisco Systems have tackled this problem
by launching products like the Cisco TelePresence in the
year 2006[2]. Many more companies like Anybots[3], VGo
Communications[4], Gostai[5] have ventured the path of teleoperated robots in order to add an element of user interactivity
to telepresence.

Ramachandra Budihal is with Wipro Techologies, Bangalore, India, e-mail:


rama.budihal@wipro.com.
Navaneeth Mohanan is with India Innovation Labs, Bangalore, India, email: navaneeth.mohan@indiainnovationlabs.in.
Sahil A. Anand is with India Innovation Labs, Bangalore, India, e-mail:
sahil.yousif@indiainnovationlabs.in.
Saish Satish Kamat is with India Innovation Labs, Bangalore, India, e-mail:
saish.kamat@indiainnovationlabs.in.

Three tools for communication

However, in most current implementations, the channel


of communication is a means of transmitting mostly audio
and visual data over to a recipient and is interpreted by the
recipient himself. Our research focuses of making the channel
intelligent so as to be aware of the multiple components of
communication it is transmitting and receiving.
This brings us to the concept of Experiential Telepresence.
In an Experiential Telepresence System, extra knowledge
gathered from diverse sensing systems (sensors + smart apps)
is available to the intelligent channel. This extra knowledge
is augmented on top of a standard video and audio feed
to convey more information than the previously mentioned
telepresence systems (Fig. 1). Currently, our channel is able to
interpret emotions, detect faces, recognize speakers and gather
environmental information.
The idea behind this Experiential Telepresence System originated from a talk presented by Budihal and a team consisting
of other authors of this paper at a TED conference in Mysore
in 2009. The talk introduced on a new model for heritage
tourism called E3iT - Engage, Entertain, Educate, immerse and
Transform[6]. The model stresses the need for an immersive
experience in order to convey the story and history behind a
heritage site.
In this paper, we shall discuss the overall architecture and
implementation of our Experiential Telepresence System along
with a comparison against few of its commercial counterparts.
Towards the end of the paper, we have briefly mentioned the
application areas of our telepresence system.

II. T HE E XPERIENTIAL T ELEPRESENCE S YSTEM


A. Overview
Our Experiential Telepresence System is a 3-tier architecture
consisting of PRATHAM (a humanoid robot), a Collective
Intelligence Platform on Cloud and an Experience Centre. All
three components are connected via the Internet (Fig. 2). The
information and knowledge gathered by multiple intelligent
agents/systems, which also include a humanoid robot, are the
primary knowledge generating sources. This shared knowledge
is made available as crowd intelligence by the Collective
Intelligence Platform.
The Collective Intelligence Platform is a knowledge portal
responsible for assimilating and dissipating knowledge from
multiple robots on a real-time basis. The knowledge generated
by the robot is transmitted across to the Experience Centre and
is responsible for creating context-awareness in the information delivered. This forms the basis of Cloud Robotics.
Cognition and context-awareness is one of the key differentiating feature of our Experiential Telepresence System. It
is built at various levels in the system. At the lowest level,
the system is aware of the network bandwidth available and
is, thus, able to scale up or down the level of immersiveness,
in order, to maintain optimal performance. The system is also
able to recognize people in its environment using facial recognition, then gather specific information like age, profession
through social networking sites and deliver content in a view
most suitable to that person. This gives the user, who has
created his/her avatar in the humanoid robot and is connected
to the Experience Centre, a more immersive experience.
The following sections talk about PRATHAM and the
Experience Centre.

Figure 2.

The Experiential Telepresence System

B. PRATHAM - a Humanoid Robot:


PRATHAM stands for Personal Robot And Telepresence
Humanoid with Autonomous Mobility. Taking cue from the
popular hypothesis of the Uncanny Valley[7], [8], [9] (Fig.
4) we decided to make PRATHAM a humanoid robot, thus,
maintaining a social and emotional connect with the people it
interacts.

Figure 3.

Anatomy of PRATHAM

Figure 4. Hypothesized emotional response of human subjects following


Moris statements

The humanoid robot, itself, consists of a three layered


architecture (Fig. 5) that generates all the necessary knowledge
primitives before transmitting it to the Experience Centre.
1) Hardware Layer: At the lowest level, the robot consists
of a system of sensors and actuators. Sensors are broadly
divided into four types - position, navigation, visual and auditory. Position sensors include GPS and compass. Navigation
sensors comprise of a laser SLAM (Simultaneous Localization
and Mapping) module and ultrasound sensors. Visual sensors
include a combination of a high resolution camera and a depth
sensing camera. Finally, auditory sensors include a 6 channel
microphone system useful for sound analysis.
The robot also consists of two actuators - a mobility platform and a 6DOF head motor system. The mobility platform
is a 3 wheeled system comprising of two feedback enabled
DC motors that provide differential drive and a caster. The
6DOF head motor system is a combination of 3 servo motors
connected orthogonally to each other. Together, the 2 actuators
allow a remote user to move the base and the head of the robot.
2) Middleware Layer: The middleware forms the basic
software platform consisting of ROS - Robot Operating System, a hardware abstraction layer, and Ubuntu Linux as our
OS. ROS - Diamondback is the subsystem used by all our
higher level software modules and by our hardware abstrac-

human gestures.
c) Navigation Stack: The navigation stack handles control aspects of the humanoid robot. There are two modes of
Experiential Telepresence - manual and autonomous. The two
modes are explained in detail in the following section.
C. Experience Centre:
As mentioned earlier, a user of this system logs on to the
Experience Centre in order to experience a remote location.
Fig. 6 shows a user at our Experience Centre. The Experience
Centre fuses the data from various perception primitives of the
robot and, currently, displays it using augmented reality[11].
It consists of specific external aids and a neat visual user
interface.

Figure 5.

PRATHAMs Architecture

tion layer. ROS allows development of modules in a graph


architecture where each module forms a node of the graph
and the communication between each of these node takes
place through a publish subscribe or a service (request-reply)
methodology. The hardware abstraction layer is a set of drivers
written for each hardware module. The driver, written in ROS,
is the entry point for the hardware to the ROS subsystem. The
driver also does the necessary semantic conversion of the data
to and from the hardware depending on the type and make
of the hardware. Ubuntu Linux 10.10 was chosen as the OS
keeping in mind the compatibility with ROS - Diamondback.
3) Application Layer: The application layer implements the
high level logic of Experiential Telepresence on the robot. It
consists of three subsystems - a video encoder, Experiential
Telepresence Stack and navigation stack.
a) Video Encoder: Video streaming on the robot is a
point-point transmission. We have used an open source H.264
encoder called x264 for streaming video at a resolution of
640x480. This ensures high quality video streaming over the
Internet. The high resolution camera is used to capture the
scene the robot is able to see. This camera is placed exactly
in the center between its two emotion eyes, which perhaps give
the first eye-to-eye contact between the user who has created
an avatar in this robot and the subject/person interacting with
the robot, which perhaps is very critical and important part for
the QoE measure of the communication/interaction
b) Experiential Telepresence Stack: The Experiential
Telepresence Stack is the source of several knowledge primitives which get fused together at the Experience Centre. As
part of this Experiential Telepresence Stack we have implemented facial recognition, emotion recognition and synthesis,
sound localization and gesture recognition in this version.
Facial recognition primitive recognizes multiple faces through
the robots camera. Emotion recognition can recognize the six
basic emotions[10] while emotion synthesis uses expression
leds on the face to show emotions. Sound localization uses the
six microphone array to localize the source of a speaker and
gesture recognition uses the depth camera to interpret basic

Figure 6.

A user at our Experience Centre.

1) External Aids: To build this fully immersive experience,


we found the usage of just a desktop monitor and a mouse to
be insufficient. In order to make the user oblivious to his current environment and to immerse him into the remote location,
we used a head gear(Fig. 6) that displays the perception of the
robot. The head tracking sensors on the head gear detect the
users head orientation, and then mimick the same using the
robots 6DOF head motor system.
In manual navigation mode, a joystick (Fig. 7) is used
to navigate the humanoid robot from the Experience Centre.
In addition, the robot is fitted obstacle detection sensors to
provide navigation assistance by overriding a users control in
case of an emergency.
The PRATHAM system also has a feature that provides a
guided tour to its user[12]. By clicking at a location on the map
provided, the robot autonomously navigates to that location by
using either the laser range-finder or the depth camera.
2) Visual User Interface: The visual user interface (Fig.
8) performs the task of fusing the addtional knowledge originating the robot. It augments this new knowledge on top
of the video feed from the robot camera to help perceive
the environment better. At the lower right corner of our
UI we show the GPS information which tells the current
position and bearing of the robot on a map. At the lower
left corner we have navigation assistance controls. As the

Figure 7.

Joystick (left), Vuzix iWear VR920 (right)

Figure 9.

PRATHAM in an outdoor environment

A. Methodology of Measurement:

perspective like the famous saying by D.R.Scoggin - "The


only way to know how customers see your business is to look
at it through their eyes" perhaps prepares a base to involve
psychologists and human behavior experts to value add in term
of measurement called - Quality of Experience (QoE, QoX).
Our evaluation has two parts one from the normal engineering
perspective of measuring system performance and benchmarking, which perhaps serve mostly the objective measure, but this
alone doesnt suffice for the customer satisfaction. Customer
satisfaction involves a lot from the user/customer perception,
which perhaps is more derived by the overall experience they
are able to perceive after they are exposed to the system. These
is a pure subjective measure which expressed based on their
feelings. This is the measure of the overall value people are
able to perceive about a product/concept (classical example is
the success of apples iPod against various similar products
which existed even before iPod entered the market - UX and
Design elements of the product played major subjective gains
over and above other system innovations) forms the second
part of the evaluation.
Quality of Experience - There has been many definitions
to this like wikipedia states this as "a subjective measure
of a customers experiences with a vendor" [13], K. Kilkki
defines it as "basic character or nature of direct personal
participation or observation" [14], which he further breaks
it into multiple measures from different user perspective and
brings in the relationship with the Quality of Service and the
Fig. 10 has been taken from [14] that clearly defines it in form
of components of a communication ecosystem.

Conventionally most of the systems benchmarks and measurements were done by subject matter experts from engineering and in variably they were concerned with network
performance, Quality of Service (system uptime, MTBF, jitter,
packet loss, BERR etc., were some of the key measurements).
Business executives starting talking about average revenue
per user and customer addition and attrition parameters by
implementing some service systems such as SLA etc., in
communication information management systems they used
to manage. Today, more analysis is sought from the user

B. Performance Analysis:
1) PRATHAMs Benchmark Specifications: Table 1 shows
the benchmark specifications that have emerged after approximately 200 hours of testing.
2) Comparison Against Peers: In order to help position
our system with respect to similar systems, Table 2 shows
a comparison of PRATHAM against three popular commercial telepresence systems- QB by Anybot[3], VGo by VGo
Communications[4] and Jazz by Gostai[5].

Figure 8.

User Interface Design

user nears an obstacle, the navigation assistance warns the


user of the direction of the obstacle so that necessary actions
may be taken to avoid the obstacle. The user interface also
shows the information regarding the temperature, wind speed
and direction at PRATHAMs location. In addition to this,
PRATHAMs facial recognition system augments information
regarding the people it sees through the camera. PRATHAM
also identifies buildings and structures based on the GPS
location and augments information regarding the same.
III. R ESULTS
After 8 months of development, PRATHAM was successfully demonstrated at several locations (Fig. 9).

Table II
COMPARISON OF PRATHAM AGAINST QB,
Pratham

Battery Life

4 hours

4-6 hours

Top Speed
Height

3.6 km/h
5' 9"
No (planned
feature)
1 speaker, 6
microphone
Yes
3-axis
Keyboard or
Joystick
Yes

5.6km/h
6' 3"

VGo
Windows
7NistaIXP
6 or 12 hour
battery option
3.3km/h
4'

Yes

Yes

Yes

1 speaker, 3
microphones
No
No

No

2 speakers, 4
microphones
No
No
Keyboard or
Mouse
No

2 speakers, 1
microphone
No
2-axis
Keyboard or
Mouse
Yes

Yes

No

No

No

Yes
Yes
Yes (Work in
progress)
Yes

No
No

No
No

Yes
No

No

No

No

No

No

Yes

Depth Cameras
Video Camera Tilt
Navigation Control

Key components of a measurement in a communication eco

Autonomous Path
Planning and
Navigation
Augmented Reality
Fusion
Emotion Synthesis
Affective Interfaces
Cognitive Intelligence

Table I
PRATHAM's BENCHMARK SPECIFICATIONS

Context Awareness

Keyboard

10

5 hours
4km/h
3' 4"

Table III

Battery Life

SURVEY SAMPLE DISTRIBUTION

"'3 hrs

Wifi + High Motor Usage


Wifi + Moderate Motor Usage

"'4.5hrs

Wifi + Minimal Motor Usage

+6hrs

At speed 3.6 kmlh


At speed 2.0 kmlh

Group

13

38cm

Lecturers

24cm

Professors
Developers

Face Recognition

4
5

!:}.

17

Field Professionals

170

Business Executives

Persons in dataset
Images in dataset

Size

College Students

Braking distance

99.2%

Recognition Performance

<Is

Recognition Time

Head Tilt

Yaw

-30deg to +30deg

Pitch

-2Odeg to +4Odeg

Roll

Jazz
Any with Flash

Ubuntu

Sound

Figure 10.
system

JAZZ ROBOTS

OS

Web Interface

QB
MacOSwith
Firefox

VGO,

-35deg to +35deg

Emotion Recognition: Recognition Accuracy


(for 25 persons, each face being not more than 2m from camera, no occlusions, and within the FoV
of camera)

which tells us the likelihood of this particular outcome of


differences in the group means. If p is close to 1 then there is
high likelihood that this difference would show up at random,
if p < 0.05 then there is less than caused by chance.

Smile

100%

Table IV

Anger

62%

SURVEY RESULTS

Fear

53%

Disgust

55%

Question

Overall Mean

Std. Deviation

Surprise

100%

Video Resolution and


Audio Quality

6.14

1.84

0.042

Conversation as
compared to
conversing with a
person

4.43

1.67

0.083

Attention paid to
driving task

4.87

1.12

0.062
0.073

87%

Sadness
Gesture Recognition (for 25 persons)

Recognition Accuracy

c.

82%

Qualitative Analysis of User Perception:

A qualitative survey was performed during a workshop and


presentation of the Experiential Telepresence System. A total
of 40 people participated in the study to rate their experience
of using the Experiential Telepresence System in an indoor
as well as outdoor environment. The following (Table 3) are
details of the study participants.
The questions were answered on a seven point Likert
Scale ( from 1 to 7, where a low score depicts lower level
of engagement or quality or whatever the measure is) and
were analyzed using a method called Analysis of Variance
(ANOVA)[15], which highlights the statistically significant
difference in the means between samples from groups (Table
4). One of the outcomes of this analysis is the p measure

Surrounding awareness 4.20

0.82

Distraction

3.92

1.28

0.038

Control Sensitivity

5.62

1.44

0.044

Response Latency

2.48

1.18

0.052

TIme taken to adjust to 2.12


controls

0.43

0.036

Clarity of information
on Ul

5.73

1.76

0.024

Overall Engagement
Level

5.88

1.66

0.028

Overall Experience

6.28

1.58

0.008

Users were asked to rate the Experiential Telepresence


System along with exisiting systems for telepresence on a 10
point scale ( higher score = better quality of experience ). This
data was taken only from users who currently use or have used
these systems in the past (Table 5).

Table V
Experiential
Telepresence
8.5

IVideoConferencing I'IelepresenceRobots ICisco'sTelepresence

1&3

1~2

We, thus, believe that our architecture will serve as a


platform for the next generation telepresence systems and
continue to improve a user's experience.
ACKNOWLEGMENT

The above data indicates that the user perception on experience and acceptance of an Experiential Telepresence System is
far higher over a conventional telepresence systems. Few users
felt greater focus was required in the driving task initially,
but were able to adjust to the controls in a short amount of
time. The UI provided information on the obstacles present
in the scene, and the navigation assistance meant for obstacle
avoidance proved to be of great help when steering in indoor
environments. Users were happy with the overall experience
and found the activity quite engaging.
Please note, since the number of people surveyed is small,
the results are only indicative and do not necessarily prove
that our system is better than the other systems mentioned.
As we receive feedback from more users the statistics will be
more accurate.
IV. CONCLUSION
In this paper, we have described the overall architecture and
implementation of our Experiential Telepresence System. Our
research was aimed at improving the user's experience of a
remote location through addition of context - aware visual data
over a standard telepresence system.
Our immediate focus, now, is on the implementation of
variable bit rate video transmission. The current system uses
H.264 compression of constant bitrate (250kbps). Seamless
streaming of video requires high available bandwidth and low
traffic on the network. Under scenarios where network quality
has been poor, frame loss and temporary freezing of the video
feed has been observed. Such issues are not desirable in a good
telepresence system, and can be improved by changing the
current compression technique from constant to variable which
alters the compression ratio of the video stream keeping in
mind a maximum bit rate (limited by the available bandwidth)
so that it smoothly plays over the network without frame loss
or freezes.
In our current implementation of the Experiential Telepresence System, we have only touched upon the auditory
and visual senses of a user. In order to immerse the user
further, we need to tap into other senses like olfactory, touch
and gustatory as well. Thus, concepts like haptics[16], mixed
reality are some of the future additions to our Experiential
Telepresence Stack. This shall allow the user to feel the real
physical environment at the remote location through touch and
at the same time interact with virtual elements perceived by
the robot.
Experiential Telepresence Systems have a wide variety of
application areas. At India Innovation Labs, our primary
study is in the area of digital tourism. The concept allows
tourists to remotely experience a tourism site through our
experiential system. In addition to digital tourism, Experiential
Telepresence may be used for distance education, hospitality
at large office campuses[17] and at retail outlets.

We would like to thank the Board of Trustees of India


Innovation Labs for their support especially Mr. NAPS Rao
and Prof. Prahladacharya. We would like to acknowledge
our core team including Mr. Viswanath Buravalla, I. Vijay
Kumar, V. R. Venkatesh and B. D. Vijaya for their unfailing
encouragement. A special acknowledgement also goes to our
well wishers from Wipro Technologies especially Mr. Anant
C. D. and Dr. Anurag Srivastava, CTO. Finally, we would
like to acknowledge our colleagues Ms. Aarushi Khanna, Mr.
Maruthi R. and the students of R.V. College of Engineering
and National Institute of Technology, Karnataka who have
been associated with the development of PRATHAM over the
course of the last one year.
REFERENCES
[1] A. Chapanis, "Interactive human communication," in Scientific American, vol. 232(2), March 1975.
[2] H. S. Lichtman, "A brief history of telepresence," February 2007.
[Online]. Available: http://www.telepresenceoptions.com!
[3] Anybots, "Introducing anybots, qb, telepresence robot!!" [Online].
Available: https://www.anybots.com!
[4] V. Communications, "Introducing vgo secure, simple, affordable," 2010.
[Online]. Available: http://www.vgocom.com!
[5] Gostai, "Robotic telepresence." [Online]. Available: http://www.gostai.
com!
[6] "The buzz: Ramachandra budihal augments reality," November 2009.
[Online]. Available: http://blog.ted.com!2009/11106/the_buzz_ramach
[7] "The truth about robotic's uncanny valley - human-like robots and
the uncanny valley," Popular Mechanics, January 2010. [Online].
Available: http://www.popularmechanics.com!technology/engineering/
robots/4343054
[8] Saygin, I., Chaminade, T., Ishiguro, and H., "The perception of humans
and robots: Uncanny hills in parietal cortex," CogSci, 2010.
[9] Mori and Masahiro, "Bukimi no tani / the uncanny valley," in Energy,
1970, pp. 33-35.
[10] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee,
A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan,
"Analysis of emotion recognition using facial expressions, speech
and multimodal information," in Proceedings of the 6th international
New
conference on Multimodal interfaces, ser. ICMI '04.
York, NY, USA: ACM, 2004, pp. 205-211. [Online]. Available:
http://doi.acm.org/10.1145/1 027933.1027968
[11] R. T. Azuma, "The challenge of making augmented reality work
outdoors," in In Mixed Reality: Merging Real and Virtual. SpringerVerlag, 1999, pp. 379-390.
[12] K. M. Tsui, M. Desai, H. A. Yanco, and C. Uhlik, "Telepresence robots
roam the halls of my office building," HRI Workshop, 2011.
[13] "Quality of experience." [Online]. Available: http://en.wikipedia.org/
wiki/Quality_of_experience
[14] K. Kilkki, "Quality of experience in communication systems," Journal
of Universal Computer Science, vol. 14, pp. 615-624, 2008.
[15] D. J. Weiss, Analysis of Variance and Functional Measurement. Oxford
University Press, October 2005.
[16] A. Ansar, D. Rodrigues, J. P. Desai, K. Daniilidis, V. Kumar,
and M. F. M. Campos, "Visual and haptic collaborative telepresence," Computers & Graphics, vol. 25, no. 5, pp. 789 - 798,
2001. [Online]. Available: http://www.sciencedirect.com/science/article/
pli/S0097849301001212
[17] K. M. Tsui, M. Desai, H. A. Yanco, and C. Uhlik, "Exploring
use cases for telepresence robots," in Proceedings of the 6th
international conference on Human-robot interaction, ser. HRI '11.
New York, NY, USA: ACM, 2011, pp. 11-18. [Online]. Available:
http://doi.acm.org/10.1145/1957656.1957664

You might also like