You are on page 1of 3

PROCEEDING OF NCCN-10, 12-13 MARCH,10

Multimodal Human Computer Interaction


Seema Rani*1, Rajeev Kumar Ranjan*2
*

Department of ECE, Sant Longowal Institute of Engineering and Technology, Longowal, India
1

seemagoyal.2009@gmail.com
2

rkranjan2k@yahoo.co.in

II.
Abstract: Human-to-human communication takes advantage of an abundance of information and cues, human computer interaction is limited to only a few input modalities (usually only keyboard and mouse) In this paper, we present an overview to overcome some of these human computer communication barriers. Multimodal interfaces are to include not only typing, but speech, lip-reading, eye-tracking, face recognition and tracking, and gesture and handwriting recognition. Keywords: Multiple modalities, speech recognition, gesture recognition, lip reading.

DIFFERENT INPUT MODALITIES

A. Gesture Recognition Humans use a very wide variety of gestures ranging from simple actions of using the hand to point at objects to the more complex actions that express feelings and allow communication with others. Gestures should therefore play an essential role in MMHCI. A major motivation for these research efforts is the potential of using hand gestures in various applications aiming at natural interaction between the human and the computer-controlled interface. There are several important issues that should be considered when designing a gesture recognition system . The first phase of a recognition task is choosing a mathematical model that may consider both the spatial and the temporal characteristics of the hand and hand gestures. The approach used for modeling plays a crucial role in the nature and performance of gesture interpretation. Once the model is detected, an analysis stage is required for computing the model parameters from the features that are extracted from single or multiple input streams. These parameters represent some description of the hand pose or trajectory and depend on the modeling approach used. After the parameters are computed, the gestures represented by them need to be classified and interpreted based on the accepted model and based on

I.

INTRODUCTION

The process of using multiple mode of interaction for the communication between user and computer is called multimodal interaction. Speech and mouse, speech and peninput, mouse and pen input, mouse and keyboard are the examples of multimodal interaction. Todays focus is on the combination of speech and gesture recognition as a form of multimodal interaction. There are many reasons to study multimodal interaction. It enhances mobility, speed and usability. It provides greater expressive powers and increases flexibility. In human-human communication, interpreting the mix of audio-visual signals is essential in understanding communication. Researchers in many fields recognize this, and advances in the development of unimodal techniques and in hardware technologies there has been a significant growth in multimodal human computer interaction (MMHCI) research. we place input modalities in two major groups: based on human senses (vision, audio and touch), and others (mouse, keyboard, etc.). The visual modality includes any form of interaction that can be interpreted visually, and the audio modality any form that is audible. Multimodal techniques can be used to construct a variety of interfaces. Of particular interest for our goals are perceptual and attentive interfaces. Perceptual interfaces are highly interactive, multimodal interfaces that enable rich, natural, and efficient interaction with computers. Attentive interfaces, on the other hand, are context-aware interfaces that rely on a persons attention as the primary input. The goal of these interfaces is to use gathered information to estimate the best time and approach for communicating with the user.

Fig1 Architecture of gesture recognition

some grammar rules that reflect the internal syntax of gestural commands. The grammar may also encode the interaction of gestures with other communication modes such as speech,

152

PROCEEDING OF NCCN-10, 12-13 MARCH,10


gaze, or facial expressions. A number of systems have been designed to use gestural input devices to control computer memory and display. These systems perceive gestures through a variety of methods and devices. While all the systems presented identify gestures, only some systems transform gestures into appropriate system specific commands. The representative architecture for these systems is show in Figure1 A basic gesture input device is the word processing tablet statistical techniques.

.1) Word Spotting Word spotting systems for continuous, speaker independent speech recognition are becoming more and more popular because of the many advantages they afford over more conventional large scale speech recognition systems. Because of their small vocabulary and size, they offer a practical and efficient solution for many speech recognition problems that depend on the accurate recognition of a few important keywords.

Fig2: Gesture recognition using HMM Fig 4 word spotting (HMM based)

B. Speech Recognition Among human communication modalities, speech and language undoubtedly carries a significant part of the information in human communication. At Carnegie Mellon several approaches toward robust high performance speech
Fig3: Hidden Markov Models (HMM)

C. Lip Reading Most approaches to automated speech perception are very sensitive to background noise or fail totally when more than one speaker talk simultaneously, as it often happens in offices, conference rooms, and other real-world environments. Humans deal with these distortions by considering additional sources such as context information and visual information, such as lip movements. This latter source is involved in the recognition process and is even more important for hearingimpaired people, but also contributes significantly to normal hearing recognition. In order to exploit lip-reading as a source of information complementary to speech, lip-reading system is developed based on the MS-TDNN and testing it on a letter spelling task for the German alphabet. The recognition performance is understandably poor (31% using lip-reading only) because some phonemes cannot be distinguished using pure visual information; however, the thrust of this work is to show how a state-of-the-art speech recognition system can be significantly improved by considering additional visual

recognition is under way. They include Hidden Markov Models (HMM) and several hybrid connectionist and

153

PROCEEDING OF NCCN-10, 12-13 MARCH,10


information in the recognition process. This section presents only the lip-reading component; its combination with speech recognition. D. Combination of Speech and Lip Movement Early fusion applies to combinations like speech+lip movement. It is difficult because: Of the need for MM training data. Because data need to be closely synchronized. Computational and training costs E. Combination of Gesture and Speech

scope specified by the circle in the gesture frame. The word spotter produces "delete word", which causes the parser to fill the action slot with delete and the unit subslot of source-scope with word. The frame merger then produces a unified frame in which action=delete, source-scope has unit=word and type=box with coordinates as specified by the drawn circle. From this the command interpreter constructs an editing command to delete the word circled by the user. III. CONCLUSION

We have highlighted major approaches for multimodal human-computer interaction. We discussed techniques for gesture recognition, speech recognition, and lip reading. The information that is presented via several modalities is merged and refers to various aspects of the same process. Combining modalities could be seen to: improve recognition performance significantly by exploiting redundancy provide greater expressiveness and flexibility by exploiting complementary information in different modalities improve understanding in allowing for complementary modalities to take effect.
Figure2: Multi-Modal Interpretation

IV.

REFERENCES

We based the interpretation of multi-modal inputs on frames. As explained above, a frame consists of slots representing parts of an interpretation. In our case, there are three slots named action, source-scope, and destination-scope (the destination is used only for the move command). Within each scope slot are subslots named type and unit. The possible scope types are: point (specified by coordinates), box (specified by coordinates of opposite corners), and selection (i.e. currently highlighted text). The unit subslot specifies the unit of text to be operated on, e.g. character or word. Consider an example in which a user draws a circle and says "Please delete this word". The gesture-processing subsystem recognizes the circle and fills in the coordinates of the box

[1] J.K. Aggarwal and Q. Cai, Human motion analysis: A review, CVIU, 73(3):428-440, 1999. [2] Application of Affective Computing in Human-computer Interaction, Int. J. of Human-Computer Studies, 59(1-2), 2003. [3] J. Ben-Arie, Z. Wang, P. Pandit, and S.Rajaram, Human activity recognition using multidimensional indexing, IEEE Trans. On PAMI, 24(8):1091-1104, 2002. [4] A.F. Bobick and J. Davis, The recognition of human movement using temporal templates, IEEE Trans. on PAMI, 23(3):257267, 2001. [5] I. Guyon, P. Albrecht, Y. LeCun, J. Denker, and W. Hub-bard. Design of a Neural Network Character Recognizer for a Touch Terminal. Pattern Recognition, 1990. [6] P. Haffner, M. Franzini, and A. Waibel. Integrating Time Alignment and Neural Networks for High Performance Continuous Speech Recognition. In Proc. ICASSP91. [7]. Baecker , R., et al., "A Historical and Intellectual Perspective," in Readings in Human-Computer Interaction: Toward the Year 2000, Second Edition, R. Baecker, et al., Editors. 1995, Morgan Kaufmann Publishers, Inc.: San Francisco. pp. 35-47.

154

You might also like