You are on page 1of 15

Chapter 1: Introduction

CHAPTER 1 INTRODUCTION

A general problem in computer vision is building robust systems, where higher-level modules can tolerate inaccuracies and ambiguous results from lower-level modules. This problem arises naturally in the context of gesture recognition: existing recognition methods typically require as input (presumably obtained from faultless lower-level modules) either the location of the gesturing hands or the start and end frame of each gesture. Requiring such input to be available is often unrealistic, thus making it difficult to deploy gesture recognition systems in many real-world scenarios. In natural settings, hand locations can be ambiguous in the presence of clutter, background motion and presence of other people or skin-colored objects. Furthermore, gestures typically appear within a continuous stream of motion, and automatically detecting where a gesture starts and ends is a challenging problem. To recognize manual gestures in video, an end-to-end computer vision system must perform both spatial and temporal gesture segmentation. Spatial gesture segmentation is the problem of determining where the gesture occurs, i.e., where the gesturing hand(s) are located in each video frame. Temporal gesture segmentation is the problem of determining when the gesture starts and ends. This paper proposes a unified framework that accomplishes all three tasks of spatial segmentation, temporal segmentation, and gesture recognition, by integrating information from lower-level vision modules and higher-level models of individual gesture classes and the relations between such classes. Instead of assuming unambiguous and correct hand detection at each frame, the proposed algorithm makes the much milder assumption that a set of several candidate hand locations has been identified at each frame. Temporal segmentation is completely handled by the algorithm, requiring no additional input from lower level modules. In the proposed framework, information flows both bottom-up and topdown. In the bottom-up direction, multiple candidate hand locations are detected and their features are fed into the higher-level video-to model matching algorithm. In the top-down
Dept. of Information Technology MEA Engineering College

Chapter 1: Introduction

direction, information from the model is used in the matching algorithm to select, among the exponentially many possible sequences of hand locations, a single optimal sequence. This sequence specifies the hand location at each frame, thus completing the low-level task of hand detection. Within the context of the proposed framework, this Paper makes two additional contributions, by addressing two important issues that arise by not requiring spatiotemporal segmentation to be given as input:

Matching feature vectors to gesture models using dynamic programming requires more computational time when we allow multiple candidate hand locations at each frame. To improve efficiency, we introduce a discriminative method that speeds up dynamic programming by eliminating a large number of hypotheses from consideration.

Not knowing when a gesture starts and ends makes it necessary to deal with the sub gesture problem, i.e., the fact that some gestures may be very similar to sub gestures of other gestures. In our approach, sub gesture relations are automatically learned from training data, and then those relations are used for reasoning about how to choose among competing gesture models that match well with the current video data.

The proposed framework is demonstrated in two gesture recognition systems. The first is a real-time system for continuous digit recognition, where gestures are performed by users wearing short sleeves, and (at least in one of the test sets) one to three people are continuously moving in the background. The second demonstration system enables users to find gestures of interest within a large video sequence or a database of such sequences. This tool is used to identify occurrences of American Sign Language (ASL) signs in video of sign language narratives that have been signed naturally by native ASL signers.

Dept. of Information Technology

MEA Engineering College

Chapter 2: Literature survey

CHAPTER 2 LITERATURE SURVEY

The key differentiating feature of the proposed method from existing work in gesture recognition is that our method requires neither spatial nor temporal segmentation to be performed as preprocessing. In many dynamic gesture recognition systems lower-level modules perform spatial and temporal segmentation and extract shape and motion features. Those features are passed into the recognition module, which classifies the gesture. In such bottom-up methods, recognition will fail when the results of spatial or temporal segmentation are incorrect. Despite advances in hand detection and hand tracking, hand detection remains a challenging task in many real-life settings. Problems can be caused by a variety of factors, such as changing illumination, low quality video and motion blur, low resolution, temporary occlusion, and background clutter. Commonly-used visual cues for hand detection such as skin color, edges, motion, and background subtraction may also fail to unambiguously locate the hands when the face, or other hand-like objects are moving in the background. Instead of requiring perfect hand detection, we make the milder assumption that existing hand detection methods can produce, for every frame of the input sequence, a relatively short list of candidate hand locations, and that one of those locations corresponds to the gesturing hand. The key advantages of the method proposed here with respect to the proposed method: 1.) does not require temporal segmentation as preprocessing and 2.) achieves significant speed-ups by eliminating from consideration large numbers of implausible hypotheses. An alternative to detecting hands before recognizing gestures is to use global image features. For example, Bobick et al. propose using motion energy images. Dreuw et al. use a diverse set of global features, such as thresholded intensity images, difference images, motion history, and skin color images. Gorelick et al. use 3D shapes extracted by identifying areas of
Dept. of Information Technology MEA Engineering College

Chapter 2: Literature survey

motion in each video frame. Nayak et al. propose a histogram of pairwise distances of edge pixels. A key limitation of those approaches is that they are not formulated to tolerate the presence of distractors in the background, such as humans moving around. In scenes where such distractors have a large presence, the content of the global features can be dominated by the distractors. In contrast, our method is formulated to tolerate such distractors, by efficiently identifying, among the exponentially many possible sequences of hand locations, the sequence that best matches the gesture model. Ke et al. propose modeling actions as rigid 3D patterns, and extracting action features using 3D extensions of the well-known 2D rectangle filters. Two key differences between our work and that of Ke et al. are that, 1) actions are treated as 3D rigid patterns) , distractors present in the 3D hyper rectangle of the action can severely affect the features extracted from that action (the method was not demonstrated in the presence of such distractors). Our method treats gestures as non rigid along the time axis (thus naturally tolerating intra-gesture variations in gesturing speed), and, as already mentioned, is demonstrated to work well in the presence of distractors. Several methods have been proposed in the literature for gesture spotting i.e., for recognizing gestures in continuous streams in the absence of known temporal segmentation. One key limitation of all those methods is that they assume that the gesturing hand(s) have been unambiguously localized in each video frame. Taking a closer look at existing gesture spotting methods, such methods can be grouped into two general approaches: the direct approach, where temporal segmentation precedes recognition of the gesture class, and the indirect approach, where temporal segmentation is intertwined with recognition. Methods that belong to the direct approach first compute low-level motion parameters such as velocity, acceleration, and trajectory curvature (Kang et al.) or mid-level motion parameters such as human body activity, and then look for abrupt changes (e.g., zero-crossings) in those parameters to identify candidate gesture boundaries. A major limitation of such methods is that they require each gesture to be preceded and followed by non-gesturing intervals a requirement not satisfied in continuous gesturing.
Dept. of Information Technology MEA Engineering College

Chapter 3: Existing System

CHAPTER 3 EXISTING SYSTEM

Existing recognition methods typically require as input (presumably obtained from faultless lower-level modules) either the location of the gesturing hands or the start and end frame of each gesture. Requiring such input to be available is often unrealistic, thus making it difficult to deploy gesture recognition systems in many real-world scenarios. In natural settings, hand locations can be ambiguous in the presence of clutter, background motion and presence of other people or skin-colored objects. Furthermore, gestures typically appear within a continuous stream of motion, and automatically detecting where a gesture starts and ends is a challenging problem. In many dynamic gesture recognition systems lower-level modules perform spatial and temporal segmentation and extract shape and motion features. Those features are passed into the recognition module, which classifies the gesture. In such bottomup methods, recognition will fail when the results of spatial or temporal segmentation are incorrect. Despite advances in hand detection and hand tracking, hand detection remains a challenging task in many real-life settings. Problems can be caused by a variety of factors, such as changing illumination, low quality video and motion blur, low resolution, temporary occlusion, and background clutter. Commonly-used visual cues for hand detection such as skin color, edges, motion, and background subtraction may also fail to unambiguously locate the hands when the face, or other hand-like objects are moving in the background.Existing hand detection methods can produce, for every frame of the input sequence, a relatively short list of candidate hand locations, and that one of those locations corresponds to the gesturing hand.

Dept. of Information Technology

MEA Engineering College

Chapter 4: System Overview

CHAPTER 4 SYSTEM OVERVIEW

Throughout the paper, we use double-struck symbols for sequences and models (e.g.,M,Q,V,W), and calligraphic symbols for sets (e.g., A, S, T ). Let M1, . . . ,MG be models for G gesture classes. Each model Mg = (Mg 0 , . . . ,Mg m) is a sequence of states. Intuitively, state Mg i describes the i-th temporal segment of the gesture modeled by Mg. Let V = (V1, V2, . . .) be a sequence of video frames showing a human subject. We denote with Vsn the frame sequence (Vs, . . . , Vn). The human subject occasionally (but not necessarily all the time) performs gestures that the system should recognize. Our goal is to design a method that recognizes gestures as they happen. More formally, when the system observes frame Vn, the system needs to decide whether a gesture has just been performed, and, if so, then recognize the gesture. We use the term gesture spotting for the task of recognizing gestures in the absence of known temporal segmentation. Our method assumes neither that we know the location of the gesturing hands in each frame, nor that we know the start and end frame for every gesture. With respect to hand location, we make a much milder assumption: that a hand detection module has extracted from each frame Vj a set Qj of K candidate hand locations. We use notation Qjk for the feature vector extracted from the k-th candidate hand location of frame Vj . The online gesture recognition algorithm consists of three major components: hand detection and feature extraction, spatiotemporal matching, and inter-model competition. The offline learning phase, marked by the dashed arrows in Fig. 1, consists of learning the gesture models, the pruning classifiers, and the sub gesture relations between gestures.

Dept. of Information Technology

MEA Engineering College

Chapter 4: System Overview

Fig. 1. System flowchart. Dashed lines indicate flow during offline learning. Solid lines indicate flow during online recognition

Dept. of Information Technology

MEA Engineering College

Chapter 4: System Overview HAND DETECTION AND FEATURE EXTRACTION

The spatiotemporal matching algorithm is designed to accommodate multiple hypotheses for the hand location in each frame. Therefore, we can afford to use a relatively simple and efficient preprocessing step for hand detection, that combines skin color and motion cues. The skin detector first computes for every image pixel a skin likelihood term. For the first frames of the sequence, where a face has still not been detected, we use a generic skin color histogram to compute the skin likelihood image. Once a face has been detected (using the detector of Rowley et al), use the mean and covariance of the face skin pixels in normalized rg space to compute the skin likelihood image.

The motion detector computes a motion mask by thresholding the result of frame differencing. The motion mask is applied to the skin likelihood image to obtain the hand likelihood image. Using the integral image of the hand likelihood image, we efficiently compute for every subwindow of some predetermined size the sum of pixel likelihoods in that subwindow. Then we extract the K subwindows with the highest sum, such that none of the K subwindows may include the center of another of the K subwindows. Each of the K subwindows is constrained to be of size 40 rows x 30 columns. Alternatively, for scale invariance, the subwindow size can be defined according to the size of the detected face. SPATIOTEMPORAL MATCHING In spatiotemporal matching we match feature vectors, such as the 4D position-flow vectors obtained as described in Sec hand detection and feature extraction module, to model states, according to observation density functions that have been learned for those states.First step is to learn those density functions, for each gesture class, using training examples for which spatiotemporal segmentation is known. Colored gloves are used only in the training examples, not in the test sequences to facilitate automatic labeling of hand locations.

Dept. of Information Technology

MEA Engineering College

Chapter 4: System Overview PRUNING

A straightforward implementation of the matching algorithm described above may be too slow for real-time applications, especially with larger vocabularies. The time complexity is linear to the number of models, the number of states per model, the number of video frames, and the number of candidate hand locations per frame. Therefore propose a pruning method that eliminates unlikely hypotheses from the DP search, and thereby makes spatiotemporal matching more efficient. In the theoretically best case, the proposed pruning method leads to time complexity that is independent of the number of model states (vs. complexity linear to the number of states for the algorithm without pruning). The key novelty in our pruning method is that we view pruning as a binary classification problem, where the output of a classifier determines whether to prune a particular search hypothesis or not.

SUBGESTURES RELATIONS AND INTERMODEL COMPETITION The inter-model competition stage of our method needs to decide, when multiple models match well with the input video, which of those models (if any) indeed corresponds to a gesture performed by the user. This stage requires knowledge about the subgesture relationship between gestures. A gesture g1 is a subgesture of another gesture g2 if g1 is similar to a part of g2 under the similarity model of the matching algorithm. For example, as seen in Fig. 2, in the digit recognition task the digit 5 can be considered a subgesture of the digit 8. In this section we describe how to learn subgesture relations from training data, and how to design an intermodal competition algorithm that utilizes knowledge about subgestures.

Dept. of Information Technology

MEA Engineering College

Chapter 4: System Overview

10

The hand detection and feature extraction modules used in our experiments. Multiple candidate hand regions are detected using simple skin color and motion cues. Each candidate hand region is a rectangular image sub window, and the feature vector extracted from that subwindow is a 4D vector containing the centroid location and optical flow of the subwindow. We should note that the focus of this paper is not on hand detection and feature extraction; other detection methods and features can easily be substituted for the ones used here.

Gesture models are learned by applying the Baum-Welch algorithm to training examples. Given such a model Mg and a video sequence V1n, spatiotemporal matching consists of finding, for each n {1, . . ., n}, the optimal alignment between Mg and V1n , under the

constraint that the last state of Mg is matched with frame n (i.e., under the hypothesis that n is the last frame of an occurrence of the gesture). The optimal alignment is a sequence of triplets (it, jt, kt) specifying that model state Mgit is matched with feature vector Qjtkt . As we will see, this optimal alignment is computed using dynamic programming. Observing a new frame Vn does not affect any of the previously computed optimal alignments between Mg and V1n , for n < n. At the same time, computing the optimal alignment between Mg and V1n reuses a lot of the computations that were performed for previous optimal alignments. The spatiotemporal matching module describe how to learn gesture models and then describe to match such models with feature extracted from video sequence. Allowing for multiple candidate hand locations increases the computational time of the dynamic programming algorithm. To address that, we describe a pruning method that can significantly speed up model matching, by rejecting a large number of hypotheses using quick-to-evaluate local criteria. Deciding whether to reject a hypothesis or not is a binary classification problem, and pruning classifiers are learned from training data for making such decisions. These pruning classifiers are applicable at every cell of the dynamic programming table, and have the effect of significantly decreasing the number of cells that need to be considered during dynamic programming. The inter-model competition consists of deciding, at the current frame Vn, whether a gesture has just been performed, and if so, determining the class
Dept. of Information Technology MEA Engineering College

Chapter 4: System Overview

11

of the gesture. By just been performed we mean that the gesture has either ended at frame Vn or it has ended at a recent frame Vn . This is the stage where we resolve issues such as two or more models giving low matching costs at the same time, and whether what we have observed is a complete gesture or simply a sub gesture of another gesture for example, whether the user has just performed the gesture for 5 or the user is in the process of performing the gesture for 8. Sub gesture relations are learned using training data.

In the input sequence used for the online recognition phase the user gestures naturally in front of the camera without any aiding device; therefore , the hand detection algorithm can no longer localize the gesturing hand with certainity, but instead generates a short list of hand candidate window in every frame. The feature extraction module extracts multiple feature vectors (one feature vector per candidate window). The resulting time series o multiple feature vectors is then matched with the learned models in both space and time. During the matching process the learned purning classifiers reject unlikely matching hypotheses. Finally, the matching costs of the different models are fed into the gesture spotting and recognition module, which applies a set of spotting rules and sub gesture reasoning to spot and recognizes the gestures performed by the user. We use the proposed method to implement a digit recognition system that can recognize the users gestures in real time. Compared to many existing interfaces, that were successfully applied to control computer robots and computer games, our method can function reliably even in the presence of clutter or moving background, without requiring the user to wear special clothes or cumbersome devices such as colored markers or gloves. In addition, we implement a new type of computer vision application : a gesture-finding tool that helps users locate gestures of interest in a large video sequence or in a database of such sequences . In the experiments we use this tool to find occurrences of American sign language(ASL) signs in video streams of sign language narrative signed by native ASL signers

Dept. of Information Technology

MEA Engineering College

Chapter 4: System Overview ADVANTAGES y

12

This paper proposes a unified framework that accomplishes all three tasks of spatial segmentation, temporal segmentation, and gesture recognition, by integrating information from lower-level vision modules and higher-level models of individual gesture classes and the relations between such classes.

Proposed algorithm makes the much milder assumption that a set of several candidate hand locations has been identified at each frame. Temporal segmentation is completely handled by the algorithm, requiring no additional input from lower level modules.

y y y

Allow multiple candidate hand locations at each frame. Deals with sub gesture problem. Our method requires neither spatial nor temporal segmentation to be performed as preprocessing.

Recognize gestures using models that are trained with Baum-Welch, as opposed to using exemplars.

Describe an algorithm for learning sub gesture relations, as opposed to hard coding such relations.

Improve the pruning method. The new method is theoretically shown to be significantly less prone to eliminate correct hypotheses.

APPLICATIONS y y y y Handwriting recognition Human computer interaction Sign language interaction Digit recognition

Dept. of Information Technology

MEA Engineering College

Chapter 4: System Overview

13

Fig. 2. Illustration of two key problems addressed in this paper. Left: low-level hand detection can often fail to provide unambiguous results. In this example, skin color and motion were used to identify candidate hand locations. Right: an example of the need for subgesture reasoning. Digit 5 is similar to a subgesture of the digit 8. Without modeling this relation, and by forgoing the assumption that temporal segmentation is known, the system is prone to make mistakes in such cases, falsely matching the model for 5 with a subgesture of 8.

Dept. of Information Technology

MEA Engineering College

Chapter 5: Conclusion

14

CHAPTER 5 CONCLUSION

This paper presented a novel gesture spotting algorithm that is accurate and efficient, is purely vision-based, and can robustly recognize gestures, even when the user gestures without any aiding devices in front of a complex background. The proposed algorithm can recognize gestures using a fairly simple hand detection module that yields multiple candidates. The system does not break down in the presence of a cluttered background, multiple moving objects, multiple skin-colored image regions, and users wearing short sleeve shirts. It is indicative that, in our experiments on the hard digit dataset, the proposed algorithm increases the correct detection rate tenfold, from 8.5% to 85%. The proposed algorithm simultaneously localizes the gesture in every frame of the video(spatial segmentation) , as well as detect the start and end frames of the input gestures(temporal segmentation). In hand tracking and recognition applications, hand detection may be unreliable for a number of different reasons, including cluttered background, multiple moving objects in the scene, hand face occlusions, and certain clothing. Proposed framework is designed to function reliably under such circumstances by accommodating multiple feature vectors a every time instance during the matching process. An interesting topic for future work is formulating a more general inter-model competition algorithm

Dept. of Information Technology

MEA Engineering College

Bibliography

15

BIBLIOGRAPHY

[1] J. Alon, Spatiotemporal gesture segmentation, Ph.D. dissertation, Dept. of Computer Science, Boston University, Technical Report BU-CS-2006-024, 2006.

[2] A. Corradini, Dynamic time warping for off-line recognition of a small gesture vocabulary, in Proc. IEEE ICCV Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, 2001, pp. 8289.

[3] F. Chen, C. Fu, and C. Huang, Hand gesture recognition using a real-time tracking method and Hidden Markov Models, Image and Video Computing, vol. 21, no. 8, pp. 745 758, August 2003.

[4] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, Actions as space-time shapes, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 22472253, 2007. [5] C. Wang, W. Gao, and S. Shan, An approach based on phonemes to large vocabulary Chinese sign language recognition, in Proc. Fifth IEEE International Conference on Automatic Face and Gesture Recognition, 2002, pp. 393398.

Dept. of Information Technology

MEA Engineering College

You might also like