You are on page 1of 57

Sign Language Recognition System

Submitted in partial fulfillment of the requirements for the degree of

Bachelor of Technology
In

Electronics and Communication Engineering


By

ROSHAN P. SHAJAN 11BEC0034


RAJESH THOMAS 11BEC0097
ABHIJITH MANOHAR J. 11BEC0218
Under the guidance of
Prof. Vidhyapathi C. M.
School of Electronics Engineering,
VIT University, Vellore.

May, 2015

DECLARATION

We hereby declare that the thesis entitled Sign Language Recognition System
submitted by us, for the award of the degree of Bachelor of Technology in Electronics and
Communication Engineering to VIT University, is a record of bonafide work carried out by
us under the supervision of Prof. Vidhyapathi C.M.
We further declare that the work reported in this thesis has not been submitted and
will not be submitted, either in part or in full, for the award of any other degree or diploma
in this institute or any other institute or university.

Place : Vellore
Date

:
Signature of the Candidates

CERTIFICATE

This is to certify that the thesis entitled Sign Language Recognition System
submitted by Roshan P. Shajan, Rajesh Thomas, Abhijith Manohar J., School of
Electronics Engineering, VIT University, for the award of the degree of Bachelor of
Technology in Electronics and Communication Engineering, is a record of bonafide work
carried out by them under my supervision, as per the VIT code of academic and research
ethics.
The contents of this report have not been submitted and will not be submitted either
in part or in full, for the award of any other degree or diploma in this institute or any other
institute or university. The thesis fulfills the requirements and regulations of the University
and in my opinion meets the necessary standards for submission.

Place : Vellore
Date

Signature of the Guide

The thesis is satisfactory / unsatisfactory

Internal Examiner

External Examiner

Approved by

Program Chair [B.Tech ECE]


School of Electronics Engineering

ACKNOWLEDGEMENT
It is our great pleasure to express our sincere thanks to our project guide and mentor Prof.
Vidhyapathi C.M for his guidance and continued support in the completion of this project.
His timely advice, meticulous scrutiny, scholarly counsel and technical supervision was very
much crucial in accomplishing this task on time.
We also owe a deep sense of gratitude to the other members of the review panel, Dr. Alex
Noel Joseph Raj and Prof. Subha Bharathi S., for their invaluable suggestions that helped to
upscale the project to a different level. It was their prompt inspirations, timely suggestions
and dynamism that enabled us to complete our thesis.
We also thank Dr. Ramachandra Reddy and Dr. Arulmozhivarman P., the Dean and Asst.
Dean of SENSE, VIT University, respectively for giving this opportunity to apply the
theoretical knowledge we have gained during our past four years and for the comments to
greatly improve this manuscript. We also place on record, our sincere thanks to the
Management of VIT for their support and provision of the necessary resources and facilities
for the research.
We also take this occasion to place our sense of thankfulness to one and all who was a part
of this endeavour, directly or indirectly. Our parents for their unceasing support and
attention, the lab assistant for his cooperation and wisdom, the members of the Kinect
Translation Tool discussion group at Codeplex.com for their advices, we are indebted to
you all.
But of all, we are most obliged to the Almighty for all the good healthiness and well-being
that was showered upon us to complete this venture.

Students Name

EXECUTIVE SUMMARY
It is often difficult for hearing impaired people to communicate with normal people because
sign language is properly understood only by a few. This project is aimed to help such
people to communicate effectively with normal people in a public service information
counter. Any computer that operates with the FCC Class B standard, equipped with a USB
2.0 port can be interfaced with Microsoft Kinect to provide a low-cost and effective process
which helps to translate sign language to written and spoken text.
The ability of Kinect to take the motion of the human body as a command input to the
controller is the key to this system. This type of human-machine interaction, also called as
Natural User Interface (NUI), is an approach that facilitates the capturing of the gesture
from the user by means of the Kinect peripheral. The depth image input acquired by
machine learning mechanism (here Dynamic Time Warping) in the form of skeletal data is
being processed to train and record a number of movements as a Sign Language Library.
ECMA-334, ISO/IEC 23270:2006 standard is used to implement the logic while XAML is
used to design the UI. Here, we incorporate the American Sign Language to the database
and map it into spoken language, consisting of some vocabulary words, with the help of
Microsoft Text-to-Speech Engine. The database is put into use as a text file that incorporates
a 1833 feature profile for each gesture(33 frames, each with the (x,y,z) pair of the 6 joints
of the upper torso).

TABLE OF CONTENTS

Title

Page No.

Acknowledgement

Executive Summary

ii

Table of Contents

iii

List of Figures

List of Tables

vi

List of Abbreviations

vii

Symbols and Notations

viii

INTRODUCTION

1.1 Objective

1.2 Motivation

1.3 Background

1.3.1 American Sign Language

1.3.2 Phonetics

1.3.3 Grammar (Finger Spelling)

1.3.4 Syntax Structure

PROJECT DESCRIPTION AND GOALS

2.1 The Transformation of Skeletal Coordinates

TECHNICAL SPECIFICATION

3.1 Software Architecture

3.2 Hardware and Software Requirements

3.2.1 Hardware System Component

3.2.2 Software System Component

3.3 The Depth Sensor

3.4 Framework Evaluation

10

3.5 The Microsoft SDK

12

3.6 The DTW Algorithm

13

3.6.1 Theoretical Background

15

DESIGN APPROACH AND DETAILS

17

4.1 Design Approach

17

4.1.1 MainWindow.xaml.cs

17

4.1.2 Skeleton3DdataCoordEventArgs.cs

19

4.1.3 Skeleton3DdataExtract.cs

19

4.1.4 DtwGestureRecognizer.cs

19

4.2 Codes and Standards

20

4.3 Constraints, Alternatives and Trade-offs

20

SCHEDULE, TASKS AND MILESTONES

22

PROJECT DEMONSTRATION

23

6.1 Project Walkthrough

23

COST ANALYSIS

27

7.1 Marketing Analysis

27

7.2 Cost Analysis

27

SUMMARY

28

REFERENCES

29

APPENDIX A (Key parts of the Code)

31

APPENDIX B (Gesture File)

36

APPENDIX C (Gantt Chart)

42

List of Figures
Figure No.

Title

Page No.

1.1

Hand shape distinctive features. Reprinted from

American Sign Language The Easy Way (p.24)


by D.A.Stewart, 1998, New York: Barrons.
1.2

Sample ASL Gestures

2.1
3.1

Block Diagram of the Project


Overview of Kinect Sign Language Software

6
8

3.2

Layer
Architecture of Microsoft Kinect

10

3.3

DTW Algorithm Least Cost Path

15

6.1
6.2
6.3
6.4

User Interface
Loading Gestures
Recognizing Gestures
Show Gesture Text

23
24
25
26

List of Tables
Table No.
3.1
3.2

Title
Comparison of Software Frameworks
Kinect Algorithm Evaluation

Page No.
11
13

List of Abbreviations
ASL
RGB
HCI
Auslan
FANN
NN
LSF
DTW
SDK
NUI
API

American Sign Language


Red Green Blue
Human Computer Interaction
Australian Sign Language
Fast Artificial Neural Network
Neural Network
French Sign Language
Dynamic Time Warping
Software Development Kit
Natural User Interface
Application Programming Interface

Symbols and Notations

[1:N]

Is an Element of
An Interval including all numbers between

{}

and inclusive of 1 and N


Set of

10

1. INTRODUCTION
Sign language is the language used among the deaf community to communicate with
one another. It uses a combination of hand gestures, movement of the arm and also facial
expression of the speaker to effectively convey the message. It is not international and varies
from country to country.

Gesture recognition is the technique of interpreting human

gestures using mathematical algorithms. This enables the customer to interact directly with
the computer without any intermediate exemplars. For recognizing the gestures we use
computer vision technique in which the computer produces numerical or symbolic
information by acquiring, processing and analyzing images. Microsoft Kinect can be used to
implement computer vision.
1.1.

OBJECTIVE

The aim of the thesis is to obtain a detailed understanding of computer vision and gesture
recognition algorithms, and apply the same to narrow the communication gap between the
typical people and the hearing impaired community in a public service information counter.
It involves developing an effective algorithm to recognize gestures and learning the working
of Kinect camera in obtaining the skeletal image. In the end, a database for the ASL
vocabulary is trained and maintained.

1.2.

MOTIVATION

11

In customer service counters like airports, banks, post offices and other public areas, the
hearing impaired people often find themselves at a disadvantage in obtaining information.
Thus, it is necessary to build a system that can translate the sign language gestures into
spoken and written form.

1.3.

BACKGROUND

In the past, research with gestures was limited only with the regular webcam that only
provided the RGB images. Thus, the required tools were not available to produce the depth
images, but instead had to rely on processing the hand colour to get the hand shape [1], Due
to the differences in the physique and tone of a person, this method was found
unsatisfactory.

A solution to the problem of skin color in gesture recognition as suggested by researchers


was to use coloured gloves or marker [2] [3] [4] [5]. Unfortunately, this method had some
shortcomings from the HCI perspective. Also, this method was limited to tracking only the
palm of the person whereas in real-life scenarios a gesture often involves the use of both
hand and head gestures.
With the evolution in camera technology and the use of recognition and pattern matching
algorithms for Xbox and PlayStation, developers and researchers could overcome the
previous constraints in translating sign language. The potential of Microsoft Kinect to
generate depth image more naturally is the foundation of Kinect Auslan [6] that uses a set
of software modules for the development of applications that carry out Auslan.
FANN library was used by Professor Hubert Wassner to create a gesture recognition system
with NN for the LSF. He captured a series of gestures and taught the program to recognize
the signs [7].
Researchers at the College of Computing, Georgia Institute of Technology developed an
ASL recognition system using Hidden Markov Model algorithm for educational games for
deaf children that contained 1000 American Sign Language (ASL) phrases [8].

12

DTW Algorithm was also used by Zico Pratama Putra of Hochschule Rhein-Waal
University of Applied Sciences for his sign language translation tool but he couldnt use it
for 3D gestures [9].
1.3.1. American Sign Language
American Sign Language (ASL) is selected as the sign language to be used in this thesis. It
is the most popular sign language in the world, especially in Canada and the United States.
In addition, the various dialects of ASL are mostly adopted in the West African region and
Southeast Asia. ASL has a very close relationship with the French Sign Language (FSL).
Although sign language is manifested in different ways from spoken language i.e., using
visual space without sound, they share similarity in organizing the fundamental character of
the spoken language. As in other languages, sentences are structured in a complex but
orderly manner. They consist of a basic unit of meaning, which in turn consists of units that
stand alone without meaning. Although the different units are not expressed through sound,
they has a relation to the types of units which are traditionally studied in the "phonological,"
so that the same term is generally applied to a typical unit in the ASL. [10]
1.3.2. Phonetics
Bahan (1996) explained that each sign gesture in ASL is constructed of various distinctive
components. A sign may use two hands or one hand depending on needs. The hand may be
in a particular orientation (e.g., closed fist with one index finger extended) in a particular
location on the body or in the "signing space", and may involve movement. When one of
these elements are changed, it may result in a completely different meaning of the sign.

13

Figure 1.1: Hand shape distinctive features [11]

1.3.3. Grammar (Fingerspelling)


ASL possesses a set of 26 signs known as the American manual alphabet, which can be used
to spell out words from the English language. These signs make of the 19 hand shapes of
ASL. For example, the signs for 'p' and 'k' use the same hand shape but different
orientations. A common misconception is that ASL consists only of fingerspelling, such a
method (Rochester Method) has been used, however it is not an ASL.
Fingerspelling is a form of borrowing, a linguistic process wherein words from one
language are incorporated into another. In ASL, fingerspelling is used for proper nouns and
for technical terms with no native ASL equivalent. There are also some other loan words

14

which are finger spelled, either very short English words or abbreviations of longer English
words, e.g. O-N from English 'on', and A-P-T from English 'apartment'.
Fingerspelling may also be used to emphasize a word that would normally be signed
otherwise.

Figure 1.2: Sample ASL Gestures

15

1.3.4. Syntax Structure


Word order structure in ASL is generally subject-verb-object (SVO) with various
possibilities affecting this basic word order. Pauses are excluded in Elementary SVO
sentences.
MOTHER HUG SON
"The mother hugs the son."
However topicalization, which is a phenomenon in ASL allows the topic of a sentence to be
moved to the initial position of the sentence. Below is a sample dialog snapshot. [11]

1. Tina: EXCUSE-me, ME NAME T-I-N-A. NAME YOU?


Excuse me, I'm Tina, what is your name?
2. Judy: ME J-U-D-Y. NICE me-MEET-you.
I'm Judy. It is nice to meet you.
3. Tina: NICE me-MEET-you. YOU TAKE-UP ASL?
It is nice to meet you. Are you taking ASL?
4. Judy: YES, ME TAKE-UP ASL. YOU?
Yes, I am taking ASL. How about you?
5. Tina: SAME-HERE. ME TAKE-UP ASL.
Same as you, I am taking ASL too.

16

2. PROJECT DESCRIPTION AND GOALS

Figure 2.1 - Block Diagram of the Project

The goals of this project consist of:

Conducting an experiment with Microsoft Kinect to produce skeletal tracking and

vectors.
Comparing the recorded gestures with the database by employing Dynamic Time

Warping (DTW) gesture recognition.


Developing The Sign Language Vocabulary Library from the sign language gesture

recognition.
Introducing the gestures and translating them into written texts or sentences and
spoken language.

A sign language gesture is made by the joints of the upper torso of the user. The Kinect
camera generates the skeletal image of this user performing the gesture from the depth
image which is inputted to the system. The system recognizes the input using the DTW
algorithm and displays the output in the form of speech and text. The project can be divided
into two broad areas or modes to achieve this targeted functionality. They are:

17

Capture Mode:
In this mode, the Kinect camera detects the skeletal joints of the person performing the
gesture. The various joints detected by Kinect are:
Hip Center, Spine, Shoulder Center, Head, Shoulder Left, Elbow Left, Wrist Left,
Hand Left, Shoulder Right, Elbow Right, Wrist Right, Hand Right, Hip Left, Knee
Left, Ankle Left, Foot Left, Hip Right, Knee Right, Ankle Right, Foot Right
Of these joints, six of them are taken for consideration. They are:
Hand Left, Wrist Left, Elbow Left, Elbow Right, Wrist Right, Hand Right
Once a gesture is performed by the user, the x, y, z coordinates of these six joints are
recorded across 33 frames forming an 1833 feature matrix. This is then compared with the
gestures already present in the database using DTW Algorithm and if there is no match, it is
stored in the database.

Read Mode
In this mode when a gesture is performed, the x, y, z coordinates are compared with the ones
already in the database using DTW algorithm. When a suitable match is found, the system
displays the output gesture name, as both text and speech. If a match is not found, a message
UNKNOWN is displayed.

2.1.

The Transformation of Skeletal Coordinates

The Skeletal coordinates returned by the Kinect camera have to be normalized for each user.
This is necessary as it would bring about a universality in the gesture by eliminating the
problem of having variations in the physique.
The coordinates of the centre of the body is calculated by finding the mid-point of the
Shoulder Left and Shoulder Right joints. The origin is shifted from the Kinect axis to this
point by subtracting each coordinates with this coordinate. Each coordinate value is then
normalized by dividing it with the distance between the Shoulder Left and Shoulder Right
joints. This would enable the user to stand anywhere within the field of view of the Kinect
and obtain the coordinates with respect to his body.

18

3. TECHNICAL SPECIFICATION
3.1.

Software Architecture

The software used in this study is designed to run on the Kinect SDK 1.8 framework. It is
important to understand the interaction of each layer of the Kinect SDK. At the lowest level,
the Kinect SDK provides the required drivers to generate the visual data and audio from the
hardware devices. The abstract layer of Kinect Sign Language is shown in the Figure 3.1.
The hardware components, including the Kinect sensor, are connected to a computer.

Figure 3.1: Overview of Kinect Sign Language Software Layer

The Kinect drivers that are installed as part of SDK are controlling the streaming audio and
video (colour, depth, and skeleton) from Kinect sensors. Kinect NUI processes the audio
and video component for skeleton tracking, audio, and colour and depth imaging [12].
The software developed for this project runs on the top of the Kinect SDK framework to
extract the information of users gesture from NUI (Microsoft.Kinect.dll) and to provide a
method of comparison to recognize the gestures. The predefined gesture data saved in the
database can be directly accessed. Gesture translation accesses the audio, speech and media
application programming interfaces (APIs) library from Windows 7 (Microsoft.speech.dll)

19

to process the information taken from the gesture data. These data are processed as
information that need to be translated into text and spoken language.
The speech recognition component simultaneously manipulate the user's voice by gaining
access to the Audio API from Windows and transform the output text to speech.
3.2.

Hardware and Software Requirements

3.2.1.

Hardware System Component

Thefollowinghardwareisnecessaryforthisproject:

Kinect for Windows, including the Kinect sensor and the USB hub, through which
the sensor is connected to the computer.

32-bit (x86) or 64-bit (x64) processors

Dual-core, 2-GHz or faster processor

USB 2.0 bus dedicated to the Kinect

2 GB of RAM (4 GB recommended)

Graphics card that supports DirectX 9.0c

3.2.2.

Software System Component

Kinect SDK v1.8 for the Kinect sensor.

Windows 7 standard APIs- The audio, speech, and media APIs in Windows 7, as
described in the Windows 7 SDK and the Microsoft Speech SDK.
Visual Studio 2010 or later

.NET Framework 4(extension for Visual Studio) and XNA Framework 4.0

KinectDTW Library

3.3.

The Depth Sensor

The Microsoft Kinect was initially used as a peripheral device for the Xbox 360 gaming
console that used image-processing methods to provide hands-free control of the console. It
consists of an RGB camera, an infrared (IR) projector and a camera to provide depth

20

perception, an array of microphones for voice commands and a motor to adjust the position
of the sensor.
The affordability of the Microsoft Kinect has provided a tool to the wider public to access
the gesture recognition technology that was previously complex and expensive.
Today there are two different versions of the Microsoft Kinect available: the standard Xbox
360 Kinect and the Kinect for Windows that has been designed and supported for
application development purposes. The main difference between these two products is that
the Kinect for Windows has a near mode that increases the range of vision of the Kinect
sensor. This project has been carried out using a Kinect for Windows v1 sensor.

Figure 3.2: Architecture of Microsoft Kinect

3.4.

Framework Evaluation

The Kinect sensor requires a driver to translate the incoming raw signals, and then convert
these signals into data useable in the application. These drivers are at a higher layer of
application architecture which is essential for the software development.
The earlier use of Kinect sensor for Xbox 360 with its ability to read and to track body
movements seemed too good to be used just for recreational activities. The open source
community then did reverse engineering to create the OpenKinect Framework with
libfreenect driver3, just a few days after the Kinect was marketed. The driver uses raw data
from various sensors of the Kinect; however, it does not provide a powerful framework for
creating applications based on natural interface environment. OpenNI and Microsoft SDK

21

are more preferable when the developer needs a feature like skeleton tracking. Thus, the
OpenKinect framework is not used to build this application.
The technology behind the Kinect sensor was originally designed by PrimeSense which
released its own version of an SDK to be used with the Kinect, named OpenNI. A natural
interface application could be developed using OpenNI framework as it works on a broader
range of the depth sensor. Since OpenNI has some tools that could speed up the
development of this project, it is able to overcome the shortcomings of OpenKinect /
libfreenect. This software comes with NITE, a middleware that has the ability to track the
skeleton by translating raw data and measure the coordinates of the body parts, similar to the
technology used to create Xbox Kinect games [13].
Seven months after the Kinect device was released, Microsoft launched the Windows SDK
following the high interest of the developers to utilize this device. One of the advantages of
this official SDK was that it was made directly by Microsoft which designs these devices.
The official SDK released in the beta state was only used for testing purposes. This SDK has
been developed specifically to encourage broad exploration and experimentation by the
academic community and research enthusiasts. It provides a better API than other
frameworks for accessing Kinects hardware capabilities, including its four-element
microphone array. The SDK is also equipped with Microsoft Kinect runtime which provides
more efficient algorithms for implementing user segmentation, skeletal tracking, and voice
control. A test had been carried out earlier to compare OpenNI and Windows SDK by
implementing the same functions.
Table 3.1: Comparison of Software Frameworks

Framework

Supported

Operating

Performanc

Depth

Language

System

Image (m)
0.8 to 4

Limited

0.5 to 9

Open

Official

C++, C# or Windows

Faster

SDK

Visual Basic 7/8/8.1

Response

using Visual

than OpenNI

Studio

and

License

OpenKinect
OpenNI

Python,

C, Linux, Mac

C++, C#, VS OS X and

22

Not required.
OpenKinect

Python,

Windows

C, Linux, Mac

0.5 to 9

Open

C++, C#, Not OS X and


requiring

Windows

Visual Studio

Based on these comparisons, Microsoft SDK has a better performance than the other open
source alternatives in some applications, especially for the skeleton tracking and its fast
response. After evaluating all these available frameworks, the Microsoft SDK was found to
be the best choice for this project.
3.5.

The Microsoft SDK

The app developers use the Kinect for Windows SDK from Microsoft Research as a starter
kit to develop a wide range of applications which use the Kinect sensor. It is expected that
with this SDK, Kinect could be used in the fields of education, robotics, and many others
other than the Xbox.
The Kinect for Windows SDK comes with drivers for flow sensor and tracking of human
motion. It was released by Microsoft for technology development with C++, C# or Visual
Basic by using Microsoft Visual Studio 2010.
Kinect SDK features used in this project are:

Raw Sensor Streams

This feature has a function to gain access to the raw data streams from the camera sensor,
depth sensor, and four-element microphone array.

Skeletal Tracking

This feature has a function to track the skeleton image of at most two moving people within
the field of view of Kinect sensor to create a movement-based program easily.
At present we have the more accurate Kinect v2 Sensor launched very recently, which
works on SDK 2.0. The SDK support for Kinect v1 has been terminated with SDK 1.8.

23

3.6.

The DTW Algorithm

There are a lot of machine learning algorithms like FAAST (Flexible Action and Articulated
Skeleton

Toolkit),

Gesture

Studio,

SignmaNIL,

Candescent,

3DHandTracking,

TipTepSkeletonizer, Kinect Auslan and Kinect Dynamic Time Warping. A recent algorithm
based on Hidden Markov Model is also gaining some popularity. The best algorithm that
meets the criteria laid out earlier is to be used for this project.
Based on many tests, the KinectDTW algorithm was found to be the best choice to
recognize gestures with accuracy. It is fast, light, reliable, highly customizable, and has an
excellent performance, moreover it does not require excessive memory. Dynamic Time
Warping (DTW) algorithm was first introduced in the 1960s by Bellman & Kalaba, and
extensively explored in the 1970's for voice recognition applications by C. Myers. One
disadvantage of this algorithm is that it is developed using Microsoft SDK version 1, and the
application developer has never updated the libraries that correspond to the latest SDK. Also
since SDK 1.5, Microsoft has made major changes to the SDK algorithm regarding its
library, classes, methods and the way to access them which is very different from version 1.
However, the significance of KinectDTW cannot be sidelined and hence this study makes
efforts to reengineer the code to run on SDK version 1.8.
Table 3.2: Kinect Algorithm Evaluation

Algorithm

Framework

Supported

Point Detection

Programming
Languages
FAAST

OpenNI, Kinect SDK

None

Body

Gesture Studio

OpenNI, Kinect SDK

None

Body

SignmaNIL

OpenNI, Kinect SDK

C++

Hand Palm

Candescent

OpenNI, Kinect SDK

C#

Hand Palm

3D Hand Tracking

OpenNI, for Windows 64

C++

Hand Palm

Tiptep

OpenNI

C#

Hand Palm

24

Kinect Auslan

OpenNI

C++

Body minus finger


tip

KinectDTW

Kinect SDK v.1

C#

Body

Until now, this algorithm supports skeletal tracking and 2D vectors gesture recognition
using all the joints of the upper torso (Skeleton Frame). The user may select the desired
gesture name to record and click the capture button prior to recording the gestures.
KinectDTW then starts the countdown before recording the gesture, allowing the user to
prepare. By default, the system records the gesture up to 33 frames before the user finishes
the training on the 33rd frame. The DTW algorithm does not care about how quickly the
gesture is performed. This project aims to extend the capability of DTW into recording and
recognizing 3D gestures.
Currently, DTW has been applied in many fields, including: Handwriting and online
signature matching, computer vision and computer animation, protein sequence alignment
and chemical engineering. The basic DTW logic is given as under: [14]
int DTWDistance(s: array [1..n], t: array [1..m], w: int) {
DTW := array [0..n, 0..m]
w := max(w, abs(n-m)) // adapt window size (*)
for i := 0 to n
for j:= 0 to m
DTW[i, j] := infinity
DTW[0, 0] := 0
for i := 1 to n
for j := max(1, i-w) to min(m, i+w)
cost := d(s[i], t[j])
DTW[i, j] := cost + minimum(DTW[i-1, j ],
DTW[i, j-1],

// insertion

// deletion

DTW[i-1, j-1]) // match


return DTW[n, m]

25

3.6.1.

Theoretical Background

Figure 3.3: DTW Algorithm Least Cost Path

There are several different types of learning algorithms. The main two types are supervised
learning and unsupervised learning. In supervised learning, the idea is that users are going to
teach the computer to do something, whereas in unsupervised learning, the user is going to
let it learn by itself. The term supervised learning refers to the fact that the algorithm
contains a data set in which the correct output is given.
This DTW algorithm is an example of a classification problem in supervised learning. The
term classification refers to a prediction to a discrete value output. It turns out that in
classification problems, study can have more than two values for the two possible value
outputs. As a concrete example, there are three types of sign data and study should predict
the discrete value of zero, one, two, or three with zero being no sign. One is Happy sign
and two is hello sign, three is good sign. However, this would also be a classification
problem, because we have a discrete value set of output corresponding to no sign, or
happy, or hello, or good.
According to Sakoe and Chiba, dynamic time warping is a method for calculating the
similarity between two time series that may vary in time and speed, as in the case of

26

detecting similarities of the running motion, where the first data is showing a person
walking slowly and other data indicating a person running faster. DTW is widely used in
speech recognition to recognize whether two waveforms represent the same spoken phrase
or not.
The DTW here compares two feature sequences sampled at equidistant points of time X=(x1,
x2, xn) of length N N and Y= (y1, y2, yn) of length M M in the feature space F, such that
xn, ym F for n [1:N] and m [1:M].
This comparison is done by means of generating a Cost matrix c(x, y), by finding out the
pairwise Euclidian Distances of each x and y. The local cost matrix for the alignment of two
sequences is given by
(c: F F R >= 0): Ci, j = |xi yj|, i [1:N], j [1:N]
When this matrix is created, the algorithm finds the alignment path that runs through the
low-cost areas of the matrix. This path is the warping path.
The warping path built by DTW is a sequence of points p =(p1,pL) with pL=(nl,ml) [1:N]
[1:N] for l [1:N] satisfying the following criteria for successful matching.

Boundary Condition: The initial and final points of the aligned sequences must be
the starting and ending points of the warping path. (p1 =(1,1) and pL =(N,M))

Monotonicity Condition: It should preserve the time ordering of points. (n1<=


n2.nL) and (m1<= m2..mL)

Step Size Condition: While aligning sequences the warping path is limited from
shifts in time (long jumps). pl+1 pl {(1,0),(0,1),(1,1)} for l [1:L-1]

27

4. DESIGN APPROACH AND DETAILS


4.1.

DESIGN APPROACH

To achieve the gesture recognition, the KinectDTW is modified. It encompasses three main
classes as given below:
1. MainWindow.xaml.cs Interface Logic for the MainWindow.xaml
2. DtwGestureRecognizer.cs Class that uses Dynamic Time Warping to compare with
the nearest neighbour
3. Skeleton3DdataExtract.cs Class to transform and normalise the skeletal data
4. Skeleton3DdataCoordEventArgs.cs Class that retrieves the Skeletal frame
coordinates
A DtwGestures object takes the Kinect runtime feed and saves it to the .txt file
A checkGesture () event on the object from Skeleton3DdataCoordEventArgs is called on
clicking Show Gesture Text that returns the string to represent the detected gesture.
For introducing the three dimensional coordinate, the XNA Framework reference was added
which took care of the required vector class. This allowed to increase the range of various
gestures that could be given by the user, distinguished within a threshold of 10 cm along
each axis.
To make the system more interactive, the inbuilt Microphone array in the Kinect v1 Sensor
was used for speech recognition. This could be sufficiently expanded and improved to make
the entire system completely hands free.
4.1.1.

MainWindow.xaml.cs

This C# file implements the interaction logic for the MainWindow.xaml file that takes care
of the design of the interface. It controls how depth data gets converted into false-color data
for more intuitive visualization; thus the 32-bit color frame buffer versions of these are kept,
which gets updated whenever a 16-bit frame is received and processed.

28

Separate dictionaries of all the joints which the Kinect SDK is capable of tracking and the
ones we focus on, are maintained. The Ignore variable stores the information regarding
which frame is to be captured and neglected (1 = capture every frame, 2 = capture every
second frame, etc.). BufferSize gets the value of the number of frames to be stored in the
video buffer (here 32) and MinimumFrames stores the minimum number of frames to be
maintained in the video buffer before we attempt to start matching gestures (here 10). The
countdown time before attempting to start matching gestures is set and the Gesture Saving
location and filename is specified. A flag is maintained to show whether or not the gesture
recognizer is capturing a new pose. The _vectordimension variable stores the dimension of
the position vector of joints that are required (2 = 2D, 3= 3D, 4=3D plus W coordinate). The
no. of joints to be tracked are specified and so is the minimum length of the gesture. A
measure of the Final Position Threshold and Sequence Similarity Threshold is also
specified.
A LoadGesturesFromFile function opens the text file containing the gesture information
and creates a DTW recorded gesture sequence. It defines how to differentially store the
coordinate

values,

the

gesture

names

and

the

joint

names.

The

SkeletonExtractFrameReady function interacts with the Skeleton3DdataExtract.cs file in


order to retrieve the skeletal coordinates. The ConvertDepthFrame function converts the
16-bit grayscale depth frame into a 32-bit frame that displays multiple users in different
colours. It also transforms the 13-bit depth information into an 8-bit intensity appropriate for
display (disregarding information in most significant bit). The NuiDepthFrameReady and
GetDisplayPosition functions are called when each depth frame is ready and the position
of a joint is to be displayed. The DtwCaptureClick, CaptureCountdown and
StartCapture functions are triggered when the user prompts to record a gesture. Upon
prompting to save the gesture information, the DtwSaveToFile function is called that saves
the text file to the specified location. The DtwLoadFile function implements the logic
behind the loading of the gesture text file. DtwShowGestureText displays the gesture
information of the recently recorded gestures. In order to make the entire system voice
based, a set of Grammar is defined and loaded. Using the inbuilt SpeechSynthesizer,
PromptBuilder keywords and the SpeechRecognized function, the user may command the
system functions by a set of voice commands and voice based inputs.

29

4.1.2. Skeleton3DdataCoordEventArgs.cs
This function takes the skeletal frame coordinates and converts it into a form useful to the
DTW algorithm. The positions of the elbows, the wrists and the hands of both the hands are
recorded and stored. SkeletonSnapshot returns the map of the joints versus their position in
a particular frame.
4.1.3. Skeleton3DdataExtract.cs
This class is used to transform the data of the skeleton thus obtained. The coordinates of the
eight points extracted are stored in the array p. To make the system recognize gestures from
users with any physique the center of the camera is shifted to the center of the user by
finding the midpoint of the ShoulderLeft and ShoulderRight points. This shoulder distance,
denoted by ShoulderDist is given by the Euclidian distance between the two points. This
point which is roughly the center of the body is stored in the variable center. Each of the
coordinates are made relative to this center by subtracting each coordinate point with the
center value. Further, the coordinates are normalized by dividing each of the six points
taken with the shoulder distance.
4.1.4. DtwGestureRecognizer.cs
A DtwGestureRecognizer class is used as a Dynamic Time Warping nearest neighbour
sequence comparison class. Information on the size of observation vectors, number of data
points required, position and recognition threshold, the minimum length of a gesture before
being recognized, the maximum distance between an example and sequence being classified
and so on is stored over here. A DTW constructor overloading is done to initialize these
values. A provision to add a seqence with a label to the known sequences library, provided
the gesture MUST start on the first observation of the sequence and end on the last one is
met by the AddOrUpdate function. The Recognize function recognizes the gesture in the
given sequence. It works on the assumption that the last observation of the sequence marks
the end of the gesture. No gesture will be recognized if the overall DTW distance between

30

the two sequences is too high. A gesture is recognized as long as it hits the threshold.
Though this means all gestures that hit the threshold are recognised, only the gesture that is
most similar should be returned. The RetrieveText function retrieves a text representation of
the _label and its associated _sequence. It helps as it displays the debug information and the
user may save it to the file. It returns a string containing all recorded gestures and their
names. The RetrieveGest function is called only when the Gesture file is loaded to display
the gesture names in the database. The Dtw function computes the min DTW distance
between the inputSequence and all possible endings of recorded gestures. To compute the
length between a frame of the inputSequence and a frame of a recorded gesture,
CalculateSnapshotPositionDistance function is used.
4.2.

CODES AND STANDARDS


Kinect SDK v1.8 only supports C++, C# and VB. Due to the ease of programming with C#,
the logic is implemented using the same in Visual Studio 2012. C# forms the ECMA-334,
ISO/IEC 23270:2006 standard. The User Interface design is met by using XAML, which
can be easily interfaced with programming logic. The Kinect Camera requires to be
interfaced to be PC, governed by the FCC Class B regulations with a support for USB 2.0
hub.

4.3.

CONSTRAINTS, ALTERNATIVES AND TRADEOFFS


Time Consumption: Conversion of a 16-bit grayscale depth frame into a 32-bit frame,
accompanied by the real time working of the DTW processor can be slightly time
consuming. Hence a system with at least 4GB RAM, a Graphic card that supports DirectX
9.0c, an Intel Dual-Core or faster processor would provide a seamless interface to the user.
Occlusion of Kinect: Though the depth image takes care of the insufficient lighting of the
background, Kinect v1 sensor doesnt efficiently handle the occlusions. It has to be ensured
that the user faces the camera within the Kinects range of operation, where none of his
skeletal joints are occluded for an efficient gesture tracking, in a reasonably hassle free
environment. The program is developed for a single user interface.

31

Limitation of Kinect v1 Sensor: In American Sign Language, alphabets are represented


using fingers. The Kinect v1 Sensor, however, cannot distinguish the finger joints separately,
but considers the entire hand as a single joint.
Accuracy: Gestures were successfully distinguished with a minimum deviation of 10 cm
along each axis.

The trade-offs are:

Use of software running on PC versus an Embedded System: Though for practical use an
Embedded System at a Public Service Information Counter would be the ideal choice, it
reduces scalability, given that the system is still on the developmental phase. The Kinect
Cameras easier interaction with the Software Development Kit and Visual Studio ran on a
PC provides for an easier way to test the various parameters and work on improvements,
such as introducing the z-coordinate to the skeletal data. Also, since it is just a piece of
software, downloadable upgrades can easily be made available and its easier for the end
user to use being more interactive.
One person Input versus Better Performance: While gesture input involving more than one
person at a time could enhance the range of the different sign languages, it would require
more complex supervised learning by means of Dynamic Time Warping, often resulting in
mismatches and/or a delay in the performance of the system.
Today, there are advanced and efficient algorithms than DTW, like the Hidden Markov
Model on which research is still going on. Also, instead of a voice input, a specific gesture
could take care of each commands.

32

5. SCHEDULE, TASKS AND MILESTONES


There are three major milestones as well as sever smaller tasks that must be achieved in
order to reach the milestones. The three milestones are:

Implementing DTW for recognizing gestures

Extending the interface to 3D

Interfacing the system through Voice.

The work was divided amongst the team after laying out a rough schematic of the entire
project. Most of the tasks required the entire team effort. Though each task had a leader who
was responsible for the completion of that task, sometimes a group member sat on the other
members task at a different point of time so as to provide a varied perspective or to provide
a fresh set of eyes. This allowed the blending of ideas so that the most effective outcome
was obtained in the least possible time frame. Even in certain impasse situations, this
strategy allowed the work to go forward.
Different online forums and discussions were referred and the information collected was
used to solve the hundreds of bugs in the code. A usability testing was performed at each
stage of the project to see for possible enhancements. For instance, the need for the
development of the system for three dimensional gestures was felt after doing numerous
works on 2D gestures. The subsequent expansion of the entire UI to Speech based input was
also arrived at, in such a modus of operandi.
The expert opinions from the Project Panel often provided ways to improve the project. A
wide variety of information could be obtained based on these guidelines, and thus gave a
new dimension to the work being done.
A Gantt chart outlining important tasks and goals can be seen in Appendix C.

33

6. PROJECT DEMONSTRATION
One of the group members record two gestures which appear to be slightly differentiable
because of the three dimensional plane involved, and two gestures which is exactly the same
but performed at a slightly different coordinate levels. Another group member use the
system in the Read mode and try to perform these gestures. The latter two gestures are
performed in a way so that the threshold levels on which the gesture recognition work with
(roughly 10 cm along each axes) can be demonstrated. The demonstration in this way shows
the systems ability to differentiate between 2D and 3D gestures, and its capacity to
recognize the gesture irrespective of the physique of the person.

Figure 6.1: User Inteface

6.1.

Project Walkthrough

34

1. Open up the Solution file of the project and check if all the prerequisite software/dlls
(references) are added. Put Visual Studio into Debug mode and build and run the project.
2. The Main Window XAML window appears as above. A Menu with the options Start
Capture , Load Gesture, Save To File, Show Gesture on to the left side, an RGB
Camera Feed, a Depth Feed and a Skeletal Feed from the depth feed, a Start and Stop
button and two text Boxes are displayed on the output window.
3. The user is free to use the mouse or use his voice to navigate through the entire UI. The user
should press the Start button at once to start the session.
4. Step into the field of view of the Kinect sensor and the skeleton would be tracked
spontaneously. The program is developed for a single player gesture recognition, so it
should be made sure that the Kinect feed is not subjected to more than one person.
5. Load the sample gestures by clicking Load gesture and navigating to the supplied
DefaultGestures.txt file (Figure 6.2). If using voice, Say Load gesture.

Figure 6.2: Loading Gestures

35

6. Start performing some gestures. The names of the gestures already in the database are
shown at the left. Upon a successful match with the displayed and previously recorded
gesture, the match appear in the text panel at the top of the screen.(Figure 6.3)
7. Stop debugging the app and reboot it.
8. Record your own gestures. Once the skeletal field comes up on the screen, enter the gesture
name and then click the Start Capture button to record. If using voice, say the appropriate
gesture name to input, and say Start Capture. The user has three seconds to get into place
and start recording the gesture. The system is currently programmed to look at 32 frames
(i.e. every alternate frame over 64 consecutive frames).
9. The user has to make sure that the whole gesture has to be completed within the span of 32
frames. This might mean that the user has to vary the speed of the gestures accordingly to fit
this time frame. The rate at which the gesture is performed is immaterial to the DTW
algorithm.

Figure 6.3: Recognizing Gestures

36

10. When recording of each gesture is finished, it automatically switches back into Read mode.
So the new gesture can be tested a few times before confirming on it. If not happy, it can be
re-recorded and tried again.
11. The details of the gestures just recorded or the ones loaded can be found by clicking Show
Gesture or by speaking the same.
12. When ones happy with the results, the gestures can be saved into a file.

Figure 6.4: Show Gesture Text

37

7. COST ANALYSIS
7.1.

Marketing Analysis

Though there are numerous sign language systems which work with DTW, none of them
have the ability to work with 3D gestures. Robust and adverse situations often make the
working of these systems difficult and unpredictable.
This system has a scalable interface that can be further modified appropriately with ease
according to the preference of the user. The system can be used for a vice-versa
communication as well, involving the use of Avatars and animated videos, but require a lot
more work, since animated videos have to be made for each new gesture.
7.2.

Cost Analysis

Beyond the requirement of a Windows based PC of sufficient performance and a Microsoft


Kinect Camera, the project doesnt require any additional costs.

Manpower Cost Analysis


Sr.No

Hours Per

Total Number

Week(hours)

of Weeks

38

Total Hours

1
2
3
Total

40
40
40

17
17
17

680
680
680
2040

8. SUMMARY
The group is currently working on improving the efficiency of processor utilization of the
system by limiting the unnecessary execution of DTW. At present, the DTW works
perpetually for the entire session when the code is running, trying to get a match. Since in
most of the real world gestures, the hand movement is above the hip, this may be laid out as
limiting condition for the DTW to work. Getting the Kinect v2 Sensor to work for this code
is believed to improve the accuracy by detecting the finger joints.

39

9. REFERENCES
[1] Cerezo, F.T. (2012). 3D Hand and Finger Recognition using Kinect. Available:
http://frantrancerkinectft.codeplex.com/
[2] Kyatanavar, M. R., & Futane, P. (2012). Comparative Study of Sign Language
Recognition Systems, International Journal of Scientific and Research Publications.
[3] V. Buchmann, S. V. (2004). FingARtips: Gesture Based Direct Manipulation in
Augmented Reality. GRAPHITE '04 Proceedings of the 2nd international conference on
Computer graphics and interactive techniques in Australasia and South East Asia (pp. 212221). New York: ACM Press.

40

[4] Akmeliawati, R., Melanie, P.-L. O., & Ye, C. K. (2007). Real-time Malaysian Sign
Language Translation Using Colour Segmentation and Neural Network. Proceedings of the
Instrumentation and Measurement Technology Conference, ISBN 1-4244-0588-2. Warsaw.
[5] Uebersax, D., Gall, J., Van den Bergh, M., & Van Gool, L. (2011). Real-time Sign
Language Letter and Word Recognition from Depth Data. Computer Vision Workshops
(ICCV Workshops), 2011 IEEE International Conference on (pp. 383-390). Barcelona:
IEEE.
[6] Kinect-Auslan, Auslan Recognition Software for Kinect. (2012, October 24). Available:
https://code.google.com/p/kinect-auslan/
[7]

Kinect

Neural

Network

Gesture

Recognition.

Available:

http://professeurs.esiea.fr/wassner/?2011/05/06/325-kinect-reseau-de-neuronereconnaissance-de-gestes
[8] Zafrulla, Z., Brashear, H., Starner, T., Hamilton, H., & Presti, P. (2011). American Sign
Language Recognition with the Kinect. ICMI '11 Proceedings of the 13th international
conference on multimodal interfaces (pp. 279-286). New York: ACM.
[9] Zico Pratama Putra. (2014). A Natural User Interface Translation Tool: From Sign
Language

to

Spoken

Text

and

Vice

Versa.

Available:

http://www.researchgate.net/profile/Zico_Putra/publication/262675277_A_Natural_Us
er_Interface_Translation_Tool_From_Sign_Language_to_Spoken_Text_and_Vice_Ver
sa/links/0a85e5386286d952b0000000.pdf
[10] Bahan, B. J. (1996). Non-Manual Realization of Agreement in American Sign
Language. Boston: Boston University.
[11] Stewart, D. A. (1998). American Sign Language The Easy Way. New York:
Barron's

Educational Series, Inc.

[12] Kinect for Windows Architecture. (2013, December 23). Available at Microsoft
Developer Network: http://msdn.microsoft.com/en-us/library/jj131023.aspx
[13] Keane, S., Hall, J., & Perry, P. (2011). Meet the Kinect: An Introduction to
Programming Natural User Interfaces. Apress.

41

[14] Dynamic time warping, Wikipedia. Available at: en.wikipedia.org/wiki/


Dynamic_time_warping

42

APPENDIX A
(Key parts of the code)

1. (Extraction of Skeletal Data)


public static void ProcessData(Skeleton data, int jointsTracked)
{
// Extract the coordinates of the points.
var p = new Vector3[jointsTracked];
Vector3 shoulderRight = new Vector3(), shoulderLeft = new Vector3();
foreach (Joint j in data.Joints)
{
switch (j.JointType)
{
case JointType.HandLeft:
p[0] = new Vector3(j.Position.X, j.Position.Y, j.Position.Z);
break;
case JointType.WristLeft:
p[1] = new Vector3(j.Position.X, j.Position.Y, j.Position.Z);
break;
case JointType.ElbowLeft:
p[2] = new Vector3(j.Position.X, j.Position.Y, j.Position.Z);

43

break;
case JointType.ElbowRight:
p[3] = new Vector3(j.Position.X, j.Position.Y, j.Position.Z);
break;
case JointType.WristRight:
p[4] = new Vector3(j.Position.X, j.Position.Y, j.Position.Z);
break;
case JointType.HandRight:
p[5] = new Vector3(j.Position.X, j.Position.Y, j.Position.Z);
break;
case JointType.ShoulderLeft:
shoulderLeft = new Vector3(j.Position.X, j.Position.Y, j.Position.Z);
p[6] = shoulderLeft;
break;
case JointType.ShoulderRight:
shoulderRight = new Vector3(j.Position.X, j.Position.Y, j.Position.Z);
p[7] = shoulderRight;
break;
}

// Centre the data


var center = new Vector3((shoulderLeft.X + shoulderRight.X) / 2, (shoulderLeft.Y +
shoulderRight.Y) / 2, (shoulderLeft.Z + shoulderRight.Z) / 2);
for (int i = 0; i < jointsTracked - 2; i++)
{
p[i].X -= center.X;
p[i].Y -= center.Y;
p[i].Z -= center.Z;
}
// Normalization of the coordinates
double shoulderDist =
Math.Sqrt(Math.Pow((shoulderLeft.X - shoulderRight.X), 2) +
Math.Pow((shoulderLeft.Y - shoulderRight.Y), 2) +
Math.Pow((shoulderLeft.Z - shoulderRight.Z), 2));
for (int i = 0; i < jointsTracked -2; i++)
{
p[i] = Vector3.Divide(p[i], (float)shoulderDist);
}
// Now put everything into the dictionary, and send it to the event.
Dictionary<JointType, Vector3> _skeletonSnapshot = new Dictionary<JointType, Vector3>
{
{JointType.HandLeft, p[0]},
{JointType.WristLeft, p[1]},
{JointType.ElbowLeft, p[2]},
{JointType.HandRight, p[3]},
{JointType.WristRight, p[4]},
{JointType.ElbowRight, p[5]},
{JointType.ShoulderLeft, p[6]},
{JointType.ShoulderRight, p[7]},
};
// Launch the event
Skeleton3DdataCoordReady(null, new Skeleton3DdataCoordEventArgs(_skeletonSnapshot));
}
}
}

44

2. DtwGestureRecognizer.cs (Algorithm for DTW)


/// <summary>
/// Compute the min DTW distance between the inputSequence and all possible endings
of recorded gestures.
/// </summary>
public double Dtw(Dictionary<JointType, List<Vector3>> inputSequence,
Dictionary<JointType, List<Vector3>> recordedGesture)
{
//Make assumption that all lists are same length!
var inputSeqIterator = inputSequence.GetEnumerator();
inputSeqIterator.MoveNext();
int inputLength = inputSeqIterator.Current.Value.Count;
//Make assumption that all lists are same length!
var recordedGestureSeqIterator = recordedGesture.GetEnumerator();
recordedGestureSeqIterator.MoveNext();
int recordLength = recordedGestureSeqIterator.Current.Value.Count;
//Book keeping, setting up and initialization.
var tab = new double[inputLength + 1, recordLength + 1];
var horizStepsMoved = new int[inputLength + 1, recordLength + 1];
var vertStepsMoved = new int[inputLength + 1, recordLength + 1];
for (int i = 0; i < inputLength + 1; ++i)
{
for (int j = 0; j < recordLength + 1; ++j)
{
tab[i, j] = double.PositiveInfinity;
horizStepsMoved[i, j] = 0;
vertStepsMoved[i, j] = 0;
}
}
tab[inputLength, recordLength] = 0;
for (int i = inputLength - 1; i > -1; --i)
{
for (int j = recordLength - 1; j > -1; --j)
{
if (tab[i, j + 1] < tab[i + 1, j + 1] && tab[i, j + 1] < tab[i + 1, j] &&
horizStepsMoved[i, j + 1] < _maxSlope)
{
//Move right, move left on reverse
tab[i, j] = CalculateSnapshotPositionDistance(inputSequence, i,
recordedGesture, j) + tab[i, j + 1];
horizStepsMoved[i, j] = horizStepsMoved[i, j + 1] + 1;
vertStepsMoved[i, j] = vertStepsMoved[i, j + 1];
}
else if (tab[i + 1, j] < tab[i + 1, j + 1] && tab[i + 1, j] < tab[i, j + 1] &&
vertStepsMoved[i + 1, j] < _maxSlope)
{
//Move down, move up on reverse
tab[i, j] = CalculateSnapshotPositionDistance(inputSequence, i,
recordedGesture, j) + tab[i + 1, j];
horizStepsMoved[i, j] = horizStepsMoved[i + 1, j];
vertStepsMoved[i, j] = vertStepsMoved[i + 1, j] + 1;
}

45

else
{
//Move diagonally down-right
if (tab[i + 1, j + 1] == double.PositiveInfinity)
{
tab[i, j] = double.PositiveInfinity;
}
else
{
tab[i, j] = CalculateSnapshotPositionDistance(inputSequence, i,
recordedGesture, j) + tab[i + 1, j + 1];
}
horizStepsMoved[i, j] = 0;
vertStepsMoved[i, j] = 0;
}

}
double bestMatch = double.PositiveInfinity;
for (int i = 0; i < inputLength; ++i)
{
if (tab[i, 0] < bestMatch)
{
bestMatch = tab[i, 0];
}
}
return bestMatch;

3. Voice Recognition Code Snippet


/// <summary>
/// Speech recognition
/// </summary>
SpeechSynthesizer sSynth = new SpeechSynthesizer();
PromptBuilder pBuilder = new PromptBuilder();
SpeechRecognitionEngine sRecognize = new SpeechRecognitionEngine();
private void button3_Click(object sender, RoutedEventArgs e)
{
button3.IsEnabled = false;
button2.IsEnabled = true;
Choices sList = new Choices(new String[] { "hello", "how are you", "go there", "good
bye", "day night", "good morning", "where are you going", "where do you live", "start
capture", "load gesture", "save to file, close window", "show gesture", "stop recognition"});
Grammar gr = new Grammar(new GrammarBuilder(sList));
try
{
sRecognize.RequestRecognizerUpdate();
sRecognize.LoadGrammar(gr);
sRecognize.SpeechRecognized += sRecognize_SpeechRecognized;
sRecognize.SetInputToDefaultAudioDevice();
sRecognize.RecognizeAsync(RecognizeMode.Multiple);
}

46

catch
{
return;
}

private void sRecognize_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)


{
switch (e.Result.Text.ToString())
{
case "start capture":
DtwCaptureClick(null, null);
break;
case "load gesture":
DtwLoadFile(null, null);
break;
case "save to file":
DtwSaveToFile(null, null);
break;
case "close window":
sRecognize.RecognizeAsyncStop();
WindowClosed(null, null);
break;
case "show gesture":
DtwShowGestureText(null, null);
break;
case "stop recognition":
button2_Click(null, null);
break;
default:
gestureList.Text = e.Result.Text.ToString();
break;
}
}

APPENDIX B:

47

Gesture File
(For a sample gesture Stop)

-0.4261816

-1.788881

-0.5578205

Stop

-0.5498732

-0.4352334

-1.696717

-1.770446

-0.5596022

-0.3935509

HandLeft

-0.4227935

-1.675511

-0.5467879

-0.5465779

-0.3958366

-1.806109

-0.5584522

-1.768444

-0.5515297

-0.437495

-1.69523

-0.4206775

-1.766214

-0.5579826

-0.3896284

-0.422309

-1.677761

-0.5434869

-0.5473102

-0.3952018

-1.80842

-0.5589038

-1.767943

-0.5520903

-0.4367658

-1.692818

-0.4215374

-1.771812

-0.5600392

-0.3845813

-0.4252875

-1.679049

-0.5555161

-0.5506173

-0.3972197

-1.695055

-0.559976

-1.748346

-0.5518262

-0.396802

-1.688262

-0.41476

-1.770722

-0.5485767

-0.3787701

48

-1.514764

-1.365383

-1.379123

-0.5600966

-0.5905223

-0.3241709

-0.2930388

-0.2806951

-1.683997

-1.65569

-0.3726001

-0.3527114

-0.5640036

-0.5876778

-0.590988

-1.51814

-1.366911

-1.375231

-0.563631

-0.5897038

-0.327135

-0.2935429

-0.2746843

-1.680666

-1.650761

-0.3676121

-0.3534772

-0.5769781

-0.581562

-0.5904654

-1.435833

-1.497629

-1.371635

-0.5697027

-0.5886499

-0.3079937

-0.3385767

-0.2689545

-1.675911

-1.650684

-0.3635985

-0.3527497

-0.5762411

-0.5774227

-0.593803

-1.496735

-1.50987

-1.368991

-0.573864

-0.5860181

-0.3270946

-0.3384296

-0.2643787

-1.669153

-1.649434

-0.3605476

-0.3531981

-0.5823691

-0.5723946

-0.5965445

-1.453499

-1.510131

-1.364299

-0.5784317

-0.5849504

-0.3186346

-0.3371365

-0.2607551

-1.666453

-1.64173

-0.3575091

-0.3508452

-0.580523

-0.5893998

-0.6026189

-1.490743

-1.38153

-1.358537

-0.583452

-0.5806479

-0.3322577

-0.2920636

-0.2584283

-1.664267

-1.639059

-0.3559659

-0.3524311

-0.5785821

-0.5917851

-0.6089005

-1.488628

-1.382606

-1.356288

-0.5850421

-0.5781318

-0.3328525

-0.2903225

-0.2578495

-1.664054

-1.636915

-0.3582083

-0.3532341

-0.5880581

-0.5912442

-0.612928

-1.363058

-1.381245

-1.35373

-0.5895359

-0.2938675

-0.2861682

-0.2579393

-1.665702

WristLeft

-0.3560796

-0.5656841

-0.5861214

-0.5910709

-0.6129557

49

-1.353062

-1.328392

-0.2596968

-0.2537197

-0.6375536

-0.6299503

-0.6404853

-0.7354441

-0.7439541

-0.7245578

-0.6174191

-0.1066876

-0.1067583

-0.09145489

-1.354195

ElbowLeft

-0.2584752

-0.6297219

-0.6370271

-0.6300611

-0.6413344

-0.7423494

-0.7369277

-0.7423642

-0.7240863

-0.6162021

-0.08831059

-0.1082313

-0.102484

-0.09118281

-1.344626

-0.2554723

-0.6296226

-0.6347612

-0.6303329

-0.6446501

-0.7421902

-0.7373043

-0.7393978

-0.7238654

-0.6134917

-0.09051158

-0.1081079

-0.09803963

-0.08995963

-1.339635

-0.2561639

-0.6309989

-0.6318849

-0.6312894

-0.6462428

-0.7420812

-0.7374734

-0.7371837

-0.7145904

-0.6123129

-0.09224374

-0.1081631

-0.09357738

-0.08954219

-1.339546

-0.2555852

-0.633018

-0.6334696

-0.6321175

-0.6475508

-0.7411837

-0.7417028

-0.7354296

-0.7093838

-0.6089341

-0.09726381

-0.1094519

-0.09115996

-0.08891118

-1.338724

-0.2556441

-0.6339899

-0.6304296

-0.6337499

-0.6473817

-0.7415846

-0.7441894

-0.7319439

-0.7088526

-0.6097731

-0.09910565

-0.1111783

-0.08871153

-0.08848532

-1.33121

-0.2531933

-0.6366127

-0.6292173

-0.6372962

-0.6469035

-0.7402673

-0.7444624

-0.7273782

-0.7081882

-0.6072124

-0.1023309

-0.1109337

-0.08904926

-0.08847874

-1.329975

-0.2541083

-0.637464

-0.6291283

-0.6391879

-0.650836

-0.7379246

-0.7452379

-0.7262004

-0.699598

-0.6061478

-0.1047137

-0.1100142

-0.0902863

-0.08927869

50

-0.7380863

-0.6588704

-0.5607622

-0.846585

-0.6466925

0.04441198

0.02853809

-0.01323227

-0.0126985

-0.6999918

-0.09029114

0.6321993

0.7761078

0.8628892

0.7651359

-0.7378464

-0.5870198

-0.6579664

-0.8235629

-0.6457067

0.04579677

-0.009463423

-0.0003788

-0.01604284

-0.6997902

-0.09032783

0.6303699

0.8424494

0.8396102

0.7526379

-0.7354181

-0.5403482

-0.7589383

-0.665691

0.04585827

-0.03927057

-0.004028854

0.006169528

HandRight

0.63555

0.6301543

0.8793992

0.7823606

0.7902258

-0.752029

-0.7439613

-0.5130765

-0.8200279

-0.6262385

0.02573418

0.04431383

-0.06350178

-0.006391682

-0.01417817

0.6346772

0.6352494

0.9260274

0.7361414

-0.7462917

-0.7822866

-0.4724625

-0.8388283

WristRight

0.0303518

0.04212125

-0.08567546

-0.00926331

0.6573445

-1.383153

0.6348553

0.6488897

0.9399447

0.6987474

-0.1496864

-0.7412551

-0.8029748

-0.4600487

-0.8496009

0.0367773

0.04301647

-0.08338467

-0.01311253

0.6571148

-1.377571

0.6341189

0.6963145

0.9375965

0.6841881

-0.1431388

-0.7395588

-0.815634

-0.4630218

-0.8417236

0.03985514

0.03366444

-0.07195166

-0.01330691

0.6555118

-1.373983

0.63324

0.6911042

0.9002698

0.6915578

-0.1353894

-0.7377969

-0.7303107

-0.4999833

-0.854135

0.0425829

0.05756655

-0.03569496

-0.0135907

0.6533387

-1.372789

0.6325793

0.7327586

0.8633494

0.7226061

-0.1307444

51

-1.691447

0.6515293

1.325926

1.514683

0.9628447

-0.2300268

-1.371487

-0.8312135

0.05042579

-1.67419

-0.1264397

-0.2920314

-0.1659329

-0.1701451

0.6500679

-1.690029

0.6501026

1.432224

1.664808

1.219836

-0.2254462

-1.372187

-0.3973462

-0.3774369

-1.49791

-0.1248557

-0.2778822

-0.2469057

-0.2464926

0.6487893

-1.688446

0.6493781

1.332712

1.648196

1.445555

-0.22141

-1.371425

-0.02541793

-0.8603157

-1.249456

-0.1239796

-0.1877133

-0.3005878

-0.3054719

0.6491179

-1.689101

0.6480017

1.201797

1.463672

1.4548

-0.2198882

-1.368521

0.2376853

-1.309965

-0.8036617

-0.123915

-0.09236081

-0.2927042

-0.2843024

0.6483927

-1.687924

0.7082255

1.102619

1.177869

1.581862

-0.2188166

-1.593075

0.2094297

-1.541668

-0.3706301

-0.1873185

-0.07011735

-0.2412405

-0.3032525

0.6540638

-1.68424

0.8949236

1.085546

0.9833593

-0.2194075

-1.581511

0.2121879

-1.663336

ElbowRight

-0.2471603

-0.063785

-0.182293

0.6539017

0.7531308

-1.700517

-1.874656

1.02959

1.149409

0.865234

-0.2451763

-0.2785848

-1.409887

0.3152185

-1.683783

-0.2594071

-0.05240987

-0.1489949

0.6526704

0.9670539

-1.694833

-1.806716

1.334804

1.309614

0.8556766

-0.2375464

-0.3071377

-1.254363

0.2548853

-1.682338

-0.3418876

-0.09617565

-0.1447957

0.6512198

1.2432

52

-1.557173

1.274695

-0.2790578

-1.656719

-0.3387707

0.4697857

1.636304

-0.287441

-0.0590479

0.2219491

1.063301

1.509444

-0.1695613

-1.920503

1.676899

-1.3175

1.144705

-0.1971105

-1.297873

-0.3763915

0.4477526

1.886373

-0.3540402

-0.02947171

-0.2811854

0.9425531

1.643614

-0.2693345

-1.93961

1.770078

-0.8247764

1.076778

-0.1621993

-0.7763379

-0.368371

0.5356966

1.886444

-0.3601937

0.007272084

-0.8340896

0.9324288

1.66565

-0.3328337

-1.94067

1.794415

-0.3057205

1.128912

-0.1571549

-0.262679

-0.2958825

0.5942565

1.684555

-0.3350106

0.000929478

-1.39348

1.099532

1.49161

-0.344117

-1.899104

---

0.1612848

1.351163

-0.2035037

-0.1719775

0.4938258

1.327617

-0.07229854

-1.727974

1.396841

APPENDIX C:
Gantt Chart

53

CURRICULUM VITAE
Name

Roshan P. Shajan

Fathers name

P.U. Shajan

Date of Birth

21.12.1992

Nationality

India

Sex

Male

Company placed

Tricon Infotech Pvt. Ltd.

Permanent Address

Pulikkottil House, Pengamuck (P.O),


Thrissur - 680544, Kerala

Phone Number

+91 4885 274234

Mobile

+91 7200157374

Email Id

roshanpshajan@gmail.com

CGPA: 8.00

Examinations taken:
GRE: 314

54

Placement Details:
Position: Associate Software Engineer
Location: Bangalore

CURRICULUM VITAE
Name
Fathers name
Date of Birth
Nationality
Sex
Company placed
Permanent Address

:
:
:
:
:
:
:

Phone Number
Mobile
Email Id

:
:
:

Rajesh Thomas
J. A. Thomas
19.05.1993
Indian
Male
Accenture
Privy Garden, TC 11/691, Nanthencode, Kawdiar PO,
Trivandrum 695 003, Kerala.
+91 471 2310587
+91 9486225587
rajeshtheeinstein@rocketmail.com

CGPA: 8.44

Examinations taken:
GRE: 309
TOEFL: 103
CAT: 77.8 percentile

Placement Details:

55

Position: Software Engineer


Location: Bangalore

CURRICULUM VITAE
Name
Fathers name
Date of Birth
Nationality
Sex
Company placed
Permanent Address

:
:
:
:
:
:
:

Phone Number
Mobile
Email Id

:
:
:

Abhijith Manohar J.
Jeevaraj M.N.
28.11.1993
Indian
Male
Wellsfargo India Solutions
Raja Nivas, TC 5/923-1, Peroorkada,
Trivandrum 695 005, Kerala.
+91 471 2439918
+91 9566817603
abhijith2393@gmail.com

CGPA: 8.19

Placement Details:
Position: Analyst
Location: Bangalore

56

57

You might also like