Professional Documents
Culture Documents
Bachelor of Technology
In
May, 2015
DECLARATION
We hereby declare that the thesis entitled Sign Language Recognition System
submitted by us, for the award of the degree of Bachelor of Technology in Electronics and
Communication Engineering to VIT University, is a record of bonafide work carried out by
us under the supervision of Prof. Vidhyapathi C.M.
We further declare that the work reported in this thesis has not been submitted and
will not be submitted, either in part or in full, for the award of any other degree or diploma
in this institute or any other institute or university.
Place : Vellore
Date
:
Signature of the Candidates
CERTIFICATE
This is to certify that the thesis entitled Sign Language Recognition System
submitted by Roshan P. Shajan, Rajesh Thomas, Abhijith Manohar J., School of
Electronics Engineering, VIT University, for the award of the degree of Bachelor of
Technology in Electronics and Communication Engineering, is a record of bonafide work
carried out by them under my supervision, as per the VIT code of academic and research
ethics.
The contents of this report have not been submitted and will not be submitted either
in part or in full, for the award of any other degree or diploma in this institute or any other
institute or university. The thesis fulfills the requirements and regulations of the University
and in my opinion meets the necessary standards for submission.
Place : Vellore
Date
Internal Examiner
External Examiner
Approved by
ACKNOWLEDGEMENT
It is our great pleasure to express our sincere thanks to our project guide and mentor Prof.
Vidhyapathi C.M for his guidance and continued support in the completion of this project.
His timely advice, meticulous scrutiny, scholarly counsel and technical supervision was very
much crucial in accomplishing this task on time.
We also owe a deep sense of gratitude to the other members of the review panel, Dr. Alex
Noel Joseph Raj and Prof. Subha Bharathi S., for their invaluable suggestions that helped to
upscale the project to a different level. It was their prompt inspirations, timely suggestions
and dynamism that enabled us to complete our thesis.
We also thank Dr. Ramachandra Reddy and Dr. Arulmozhivarman P., the Dean and Asst.
Dean of SENSE, VIT University, respectively for giving this opportunity to apply the
theoretical knowledge we have gained during our past four years and for the comments to
greatly improve this manuscript. We also place on record, our sincere thanks to the
Management of VIT for their support and provision of the necessary resources and facilities
for the research.
We also take this occasion to place our sense of thankfulness to one and all who was a part
of this endeavour, directly or indirectly. Our parents for their unceasing support and
attention, the lab assistant for his cooperation and wisdom, the members of the Kinect
Translation Tool discussion group at Codeplex.com for their advices, we are indebted to
you all.
But of all, we are most obliged to the Almighty for all the good healthiness and well-being
that was showered upon us to complete this venture.
Students Name
EXECUTIVE SUMMARY
It is often difficult for hearing impaired people to communicate with normal people because
sign language is properly understood only by a few. This project is aimed to help such
people to communicate effectively with normal people in a public service information
counter. Any computer that operates with the FCC Class B standard, equipped with a USB
2.0 port can be interfaced with Microsoft Kinect to provide a low-cost and effective process
which helps to translate sign language to written and spoken text.
The ability of Kinect to take the motion of the human body as a command input to the
controller is the key to this system. This type of human-machine interaction, also called as
Natural User Interface (NUI), is an approach that facilitates the capturing of the gesture
from the user by means of the Kinect peripheral. The depth image input acquired by
machine learning mechanism (here Dynamic Time Warping) in the form of skeletal data is
being processed to train and record a number of movements as a Sign Language Library.
ECMA-334, ISO/IEC 23270:2006 standard is used to implement the logic while XAML is
used to design the UI. Here, we incorporate the American Sign Language to the database
and map it into spoken language, consisting of some vocabulary words, with the help of
Microsoft Text-to-Speech Engine. The database is put into use as a text file that incorporates
a 1833 feature profile for each gesture(33 frames, each with the (x,y,z) pair of the 6 joints
of the upper torso).
TABLE OF CONTENTS
Title
Page No.
Acknowledgement
Executive Summary
ii
Table of Contents
iii
List of Figures
List of Tables
vi
List of Abbreviations
vii
viii
INTRODUCTION
1.1 Objective
1.2 Motivation
1.3 Background
1.3.2 Phonetics
TECHNICAL SPECIFICATION
10
12
13
15
17
17
4.1.1 MainWindow.xaml.cs
17
4.1.2 Skeleton3DdataCoordEventArgs.cs
19
4.1.3 Skeleton3DdataExtract.cs
19
4.1.4 DtwGestureRecognizer.cs
19
20
20
22
PROJECT DEMONSTRATION
23
23
COST ANALYSIS
27
27
27
SUMMARY
28
REFERENCES
29
31
36
42
List of Figures
Figure No.
Title
Page No.
1.1
2.1
3.1
6
8
3.2
Layer
Architecture of Microsoft Kinect
10
3.3
15
6.1
6.2
6.3
6.4
User Interface
Loading Gestures
Recognizing Gestures
Show Gesture Text
23
24
25
26
List of Tables
Table No.
3.1
3.2
Title
Comparison of Software Frameworks
Kinect Algorithm Evaluation
Page No.
11
13
List of Abbreviations
ASL
RGB
HCI
Auslan
FANN
NN
LSF
DTW
SDK
NUI
API
[1:N]
Is an Element of
An Interval including all numbers between
{}
10
1. INTRODUCTION
Sign language is the language used among the deaf community to communicate with
one another. It uses a combination of hand gestures, movement of the arm and also facial
expression of the speaker to effectively convey the message. It is not international and varies
from country to country.
gestures using mathematical algorithms. This enables the customer to interact directly with
the computer without any intermediate exemplars. For recognizing the gestures we use
computer vision technique in which the computer produces numerical or symbolic
information by acquiring, processing and analyzing images. Microsoft Kinect can be used to
implement computer vision.
1.1.
OBJECTIVE
The aim of the thesis is to obtain a detailed understanding of computer vision and gesture
recognition algorithms, and apply the same to narrow the communication gap between the
typical people and the hearing impaired community in a public service information counter.
It involves developing an effective algorithm to recognize gestures and learning the working
of Kinect camera in obtaining the skeletal image. In the end, a database for the ASL
vocabulary is trained and maintained.
1.2.
MOTIVATION
11
In customer service counters like airports, banks, post offices and other public areas, the
hearing impaired people often find themselves at a disadvantage in obtaining information.
Thus, it is necessary to build a system that can translate the sign language gestures into
spoken and written form.
1.3.
BACKGROUND
In the past, research with gestures was limited only with the regular webcam that only
provided the RGB images. Thus, the required tools were not available to produce the depth
images, but instead had to rely on processing the hand colour to get the hand shape [1], Due
to the differences in the physique and tone of a person, this method was found
unsatisfactory.
12
DTW Algorithm was also used by Zico Pratama Putra of Hochschule Rhein-Waal
University of Applied Sciences for his sign language translation tool but he couldnt use it
for 3D gestures [9].
1.3.1. American Sign Language
American Sign Language (ASL) is selected as the sign language to be used in this thesis. It
is the most popular sign language in the world, especially in Canada and the United States.
In addition, the various dialects of ASL are mostly adopted in the West African region and
Southeast Asia. ASL has a very close relationship with the French Sign Language (FSL).
Although sign language is manifested in different ways from spoken language i.e., using
visual space without sound, they share similarity in organizing the fundamental character of
the spoken language. As in other languages, sentences are structured in a complex but
orderly manner. They consist of a basic unit of meaning, which in turn consists of units that
stand alone without meaning. Although the different units are not expressed through sound,
they has a relation to the types of units which are traditionally studied in the "phonological,"
so that the same term is generally applied to a typical unit in the ASL. [10]
1.3.2. Phonetics
Bahan (1996) explained that each sign gesture in ASL is constructed of various distinctive
components. A sign may use two hands or one hand depending on needs. The hand may be
in a particular orientation (e.g., closed fist with one index finger extended) in a particular
location on the body or in the "signing space", and may involve movement. When one of
these elements are changed, it may result in a completely different meaning of the sign.
13
14
which are finger spelled, either very short English words or abbreviations of longer English
words, e.g. O-N from English 'on', and A-P-T from English 'apartment'.
Fingerspelling may also be used to emphasize a word that would normally be signed
otherwise.
15
16
vectors.
Comparing the recorded gestures with the database by employing Dynamic Time
recognition.
Introducing the gestures and translating them into written texts or sentences and
spoken language.
A sign language gesture is made by the joints of the upper torso of the user. The Kinect
camera generates the skeletal image of this user performing the gesture from the depth
image which is inputted to the system. The system recognizes the input using the DTW
algorithm and displays the output in the form of speech and text. The project can be divided
into two broad areas or modes to achieve this targeted functionality. They are:
17
Capture Mode:
In this mode, the Kinect camera detects the skeletal joints of the person performing the
gesture. The various joints detected by Kinect are:
Hip Center, Spine, Shoulder Center, Head, Shoulder Left, Elbow Left, Wrist Left,
Hand Left, Shoulder Right, Elbow Right, Wrist Right, Hand Right, Hip Left, Knee
Left, Ankle Left, Foot Left, Hip Right, Knee Right, Ankle Right, Foot Right
Of these joints, six of them are taken for consideration. They are:
Hand Left, Wrist Left, Elbow Left, Elbow Right, Wrist Right, Hand Right
Once a gesture is performed by the user, the x, y, z coordinates of these six joints are
recorded across 33 frames forming an 1833 feature matrix. This is then compared with the
gestures already present in the database using DTW Algorithm and if there is no match, it is
stored in the database.
Read Mode
In this mode when a gesture is performed, the x, y, z coordinates are compared with the ones
already in the database using DTW algorithm. When a suitable match is found, the system
displays the output gesture name, as both text and speech. If a match is not found, a message
UNKNOWN is displayed.
2.1.
The Skeletal coordinates returned by the Kinect camera have to be normalized for each user.
This is necessary as it would bring about a universality in the gesture by eliminating the
problem of having variations in the physique.
The coordinates of the centre of the body is calculated by finding the mid-point of the
Shoulder Left and Shoulder Right joints. The origin is shifted from the Kinect axis to this
point by subtracting each coordinates with this coordinate. Each coordinate value is then
normalized by dividing it with the distance between the Shoulder Left and Shoulder Right
joints. This would enable the user to stand anywhere within the field of view of the Kinect
and obtain the coordinates with respect to his body.
18
3. TECHNICAL SPECIFICATION
3.1.
Software Architecture
The software used in this study is designed to run on the Kinect SDK 1.8 framework. It is
important to understand the interaction of each layer of the Kinect SDK. At the lowest level,
the Kinect SDK provides the required drivers to generate the visual data and audio from the
hardware devices. The abstract layer of Kinect Sign Language is shown in the Figure 3.1.
The hardware components, including the Kinect sensor, are connected to a computer.
The Kinect drivers that are installed as part of SDK are controlling the streaming audio and
video (colour, depth, and skeleton) from Kinect sensors. Kinect NUI processes the audio
and video component for skeleton tracking, audio, and colour and depth imaging [12].
The software developed for this project runs on the top of the Kinect SDK framework to
extract the information of users gesture from NUI (Microsoft.Kinect.dll) and to provide a
method of comparison to recognize the gestures. The predefined gesture data saved in the
database can be directly accessed. Gesture translation accesses the audio, speech and media
application programming interfaces (APIs) library from Windows 7 (Microsoft.speech.dll)
19
to process the information taken from the gesture data. These data are processed as
information that need to be translated into text and spoken language.
The speech recognition component simultaneously manipulate the user's voice by gaining
access to the Audio API from Windows and transform the output text to speech.
3.2.
3.2.1.
Thefollowinghardwareisnecessaryforthisproject:
Kinect for Windows, including the Kinect sensor and the USB hub, through which
the sensor is connected to the computer.
2 GB of RAM (4 GB recommended)
3.2.2.
Windows 7 standard APIs- The audio, speech, and media APIs in Windows 7, as
described in the Windows 7 SDK and the Microsoft Speech SDK.
Visual Studio 2010 or later
.NET Framework 4(extension for Visual Studio) and XNA Framework 4.0
KinectDTW Library
3.3.
The Microsoft Kinect was initially used as a peripheral device for the Xbox 360 gaming
console that used image-processing methods to provide hands-free control of the console. It
consists of an RGB camera, an infrared (IR) projector and a camera to provide depth
20
perception, an array of microphones for voice commands and a motor to adjust the position
of the sensor.
The affordability of the Microsoft Kinect has provided a tool to the wider public to access
the gesture recognition technology that was previously complex and expensive.
Today there are two different versions of the Microsoft Kinect available: the standard Xbox
360 Kinect and the Kinect for Windows that has been designed and supported for
application development purposes. The main difference between these two products is that
the Kinect for Windows has a near mode that increases the range of vision of the Kinect
sensor. This project has been carried out using a Kinect for Windows v1 sensor.
3.4.
Framework Evaluation
The Kinect sensor requires a driver to translate the incoming raw signals, and then convert
these signals into data useable in the application. These drivers are at a higher layer of
application architecture which is essential for the software development.
The earlier use of Kinect sensor for Xbox 360 with its ability to read and to track body
movements seemed too good to be used just for recreational activities. The open source
community then did reverse engineering to create the OpenKinect Framework with
libfreenect driver3, just a few days after the Kinect was marketed. The driver uses raw data
from various sensors of the Kinect; however, it does not provide a powerful framework for
creating applications based on natural interface environment. OpenNI and Microsoft SDK
21
are more preferable when the developer needs a feature like skeleton tracking. Thus, the
OpenKinect framework is not used to build this application.
The technology behind the Kinect sensor was originally designed by PrimeSense which
released its own version of an SDK to be used with the Kinect, named OpenNI. A natural
interface application could be developed using OpenNI framework as it works on a broader
range of the depth sensor. Since OpenNI has some tools that could speed up the
development of this project, it is able to overcome the shortcomings of OpenKinect /
libfreenect. This software comes with NITE, a middleware that has the ability to track the
skeleton by translating raw data and measure the coordinates of the body parts, similar to the
technology used to create Xbox Kinect games [13].
Seven months after the Kinect device was released, Microsoft launched the Windows SDK
following the high interest of the developers to utilize this device. One of the advantages of
this official SDK was that it was made directly by Microsoft which designs these devices.
The official SDK released in the beta state was only used for testing purposes. This SDK has
been developed specifically to encourage broad exploration and experimentation by the
academic community and research enthusiasts. It provides a better API than other
frameworks for accessing Kinects hardware capabilities, including its four-element
microphone array. The SDK is also equipped with Microsoft Kinect runtime which provides
more efficient algorithms for implementing user segmentation, skeletal tracking, and voice
control. A test had been carried out earlier to compare OpenNI and Windows SDK by
implementing the same functions.
Table 3.1: Comparison of Software Frameworks
Framework
Supported
Operating
Performanc
Depth
Language
System
Image (m)
0.8 to 4
Limited
0.5 to 9
Open
Official
C++, C# or Windows
Faster
SDK
Response
using Visual
than OpenNI
Studio
and
License
OpenKinect
OpenNI
Python,
C, Linux, Mac
22
Not required.
OpenKinect
Python,
Windows
C, Linux, Mac
0.5 to 9
Open
Windows
Visual Studio
Based on these comparisons, Microsoft SDK has a better performance than the other open
source alternatives in some applications, especially for the skeleton tracking and its fast
response. After evaluating all these available frameworks, the Microsoft SDK was found to
be the best choice for this project.
3.5.
The app developers use the Kinect for Windows SDK from Microsoft Research as a starter
kit to develop a wide range of applications which use the Kinect sensor. It is expected that
with this SDK, Kinect could be used in the fields of education, robotics, and many others
other than the Xbox.
The Kinect for Windows SDK comes with drivers for flow sensor and tracking of human
motion. It was released by Microsoft for technology development with C++, C# or Visual
Basic by using Microsoft Visual Studio 2010.
Kinect SDK features used in this project are:
This feature has a function to gain access to the raw data streams from the camera sensor,
depth sensor, and four-element microphone array.
Skeletal Tracking
This feature has a function to track the skeleton image of at most two moving people within
the field of view of Kinect sensor to create a movement-based program easily.
At present we have the more accurate Kinect v2 Sensor launched very recently, which
works on SDK 2.0. The SDK support for Kinect v1 has been terminated with SDK 1.8.
23
3.6.
There are a lot of machine learning algorithms like FAAST (Flexible Action and Articulated
Skeleton
Toolkit),
Gesture
Studio,
SignmaNIL,
Candescent,
3DHandTracking,
TipTepSkeletonizer, Kinect Auslan and Kinect Dynamic Time Warping. A recent algorithm
based on Hidden Markov Model is also gaining some popularity. The best algorithm that
meets the criteria laid out earlier is to be used for this project.
Based on many tests, the KinectDTW algorithm was found to be the best choice to
recognize gestures with accuracy. It is fast, light, reliable, highly customizable, and has an
excellent performance, moreover it does not require excessive memory. Dynamic Time
Warping (DTW) algorithm was first introduced in the 1960s by Bellman & Kalaba, and
extensively explored in the 1970's for voice recognition applications by C. Myers. One
disadvantage of this algorithm is that it is developed using Microsoft SDK version 1, and the
application developer has never updated the libraries that correspond to the latest SDK. Also
since SDK 1.5, Microsoft has made major changes to the SDK algorithm regarding its
library, classes, methods and the way to access them which is very different from version 1.
However, the significance of KinectDTW cannot be sidelined and hence this study makes
efforts to reengineer the code to run on SDK version 1.8.
Table 3.2: Kinect Algorithm Evaluation
Algorithm
Framework
Supported
Point Detection
Programming
Languages
FAAST
None
Body
Gesture Studio
None
Body
SignmaNIL
C++
Hand Palm
Candescent
C#
Hand Palm
3D Hand Tracking
C++
Hand Palm
Tiptep
OpenNI
C#
Hand Palm
24
Kinect Auslan
OpenNI
C++
KinectDTW
C#
Body
Until now, this algorithm supports skeletal tracking and 2D vectors gesture recognition
using all the joints of the upper torso (Skeleton Frame). The user may select the desired
gesture name to record and click the capture button prior to recording the gestures.
KinectDTW then starts the countdown before recording the gesture, allowing the user to
prepare. By default, the system records the gesture up to 33 frames before the user finishes
the training on the 33rd frame. The DTW algorithm does not care about how quickly the
gesture is performed. This project aims to extend the capability of DTW into recording and
recognizing 3D gestures.
Currently, DTW has been applied in many fields, including: Handwriting and online
signature matching, computer vision and computer animation, protein sequence alignment
and chemical engineering. The basic DTW logic is given as under: [14]
int DTWDistance(s: array [1..n], t: array [1..m], w: int) {
DTW := array [0..n, 0..m]
w := max(w, abs(n-m)) // adapt window size (*)
for i := 0 to n
for j:= 0 to m
DTW[i, j] := infinity
DTW[0, 0] := 0
for i := 1 to n
for j := max(1, i-w) to min(m, i+w)
cost := d(s[i], t[j])
DTW[i, j] := cost + minimum(DTW[i-1, j ],
DTW[i, j-1],
// insertion
// deletion
25
3.6.1.
Theoretical Background
There are several different types of learning algorithms. The main two types are supervised
learning and unsupervised learning. In supervised learning, the idea is that users are going to
teach the computer to do something, whereas in unsupervised learning, the user is going to
let it learn by itself. The term supervised learning refers to the fact that the algorithm
contains a data set in which the correct output is given.
This DTW algorithm is an example of a classification problem in supervised learning. The
term classification refers to a prediction to a discrete value output. It turns out that in
classification problems, study can have more than two values for the two possible value
outputs. As a concrete example, there are three types of sign data and study should predict
the discrete value of zero, one, two, or three with zero being no sign. One is Happy sign
and two is hello sign, three is good sign. However, this would also be a classification
problem, because we have a discrete value set of output corresponding to no sign, or
happy, or hello, or good.
According to Sakoe and Chiba, dynamic time warping is a method for calculating the
similarity between two time series that may vary in time and speed, as in the case of
26
detecting similarities of the running motion, where the first data is showing a person
walking slowly and other data indicating a person running faster. DTW is widely used in
speech recognition to recognize whether two waveforms represent the same spoken phrase
or not.
The DTW here compares two feature sequences sampled at equidistant points of time X=(x1,
x2, xn) of length N N and Y= (y1, y2, yn) of length M M in the feature space F, such that
xn, ym F for n [1:N] and m [1:M].
This comparison is done by means of generating a Cost matrix c(x, y), by finding out the
pairwise Euclidian Distances of each x and y. The local cost matrix for the alignment of two
sequences is given by
(c: F F R >= 0): Ci, j = |xi yj|, i [1:N], j [1:N]
When this matrix is created, the algorithm finds the alignment path that runs through the
low-cost areas of the matrix. This path is the warping path.
The warping path built by DTW is a sequence of points p =(p1,pL) with pL=(nl,ml) [1:N]
[1:N] for l [1:N] satisfying the following criteria for successful matching.
Boundary Condition: The initial and final points of the aligned sequences must be
the starting and ending points of the warping path. (p1 =(1,1) and pL =(N,M))
Step Size Condition: While aligning sequences the warping path is limited from
shifts in time (long jumps). pl+1 pl {(1,0),(0,1),(1,1)} for l [1:L-1]
27
DESIGN APPROACH
To achieve the gesture recognition, the KinectDTW is modified. It encompasses three main
classes as given below:
1. MainWindow.xaml.cs Interface Logic for the MainWindow.xaml
2. DtwGestureRecognizer.cs Class that uses Dynamic Time Warping to compare with
the nearest neighbour
3. Skeleton3DdataExtract.cs Class to transform and normalise the skeletal data
4. Skeleton3DdataCoordEventArgs.cs Class that retrieves the Skeletal frame
coordinates
A DtwGestures object takes the Kinect runtime feed and saves it to the .txt file
A checkGesture () event on the object from Skeleton3DdataCoordEventArgs is called on
clicking Show Gesture Text that returns the string to represent the detected gesture.
For introducing the three dimensional coordinate, the XNA Framework reference was added
which took care of the required vector class. This allowed to increase the range of various
gestures that could be given by the user, distinguished within a threshold of 10 cm along
each axis.
To make the system more interactive, the inbuilt Microphone array in the Kinect v1 Sensor
was used for speech recognition. This could be sufficiently expanded and improved to make
the entire system completely hands free.
4.1.1.
MainWindow.xaml.cs
This C# file implements the interaction logic for the MainWindow.xaml file that takes care
of the design of the interface. It controls how depth data gets converted into false-color data
for more intuitive visualization; thus the 32-bit color frame buffer versions of these are kept,
which gets updated whenever a 16-bit frame is received and processed.
28
Separate dictionaries of all the joints which the Kinect SDK is capable of tracking and the
ones we focus on, are maintained. The Ignore variable stores the information regarding
which frame is to be captured and neglected (1 = capture every frame, 2 = capture every
second frame, etc.). BufferSize gets the value of the number of frames to be stored in the
video buffer (here 32) and MinimumFrames stores the minimum number of frames to be
maintained in the video buffer before we attempt to start matching gestures (here 10). The
countdown time before attempting to start matching gestures is set and the Gesture Saving
location and filename is specified. A flag is maintained to show whether or not the gesture
recognizer is capturing a new pose. The _vectordimension variable stores the dimension of
the position vector of joints that are required (2 = 2D, 3= 3D, 4=3D plus W coordinate). The
no. of joints to be tracked are specified and so is the minimum length of the gesture. A
measure of the Final Position Threshold and Sequence Similarity Threshold is also
specified.
A LoadGesturesFromFile function opens the text file containing the gesture information
and creates a DTW recorded gesture sequence. It defines how to differentially store the
coordinate
values,
the
gesture
names
and
the
joint
names.
The
29
4.1.2. Skeleton3DdataCoordEventArgs.cs
This function takes the skeletal frame coordinates and converts it into a form useful to the
DTW algorithm. The positions of the elbows, the wrists and the hands of both the hands are
recorded and stored. SkeletonSnapshot returns the map of the joints versus their position in
a particular frame.
4.1.3. Skeleton3DdataExtract.cs
This class is used to transform the data of the skeleton thus obtained. The coordinates of the
eight points extracted are stored in the array p. To make the system recognize gestures from
users with any physique the center of the camera is shifted to the center of the user by
finding the midpoint of the ShoulderLeft and ShoulderRight points. This shoulder distance,
denoted by ShoulderDist is given by the Euclidian distance between the two points. This
point which is roughly the center of the body is stored in the variable center. Each of the
coordinates are made relative to this center by subtracting each coordinate point with the
center value. Further, the coordinates are normalized by dividing each of the six points
taken with the shoulder distance.
4.1.4. DtwGestureRecognizer.cs
A DtwGestureRecognizer class is used as a Dynamic Time Warping nearest neighbour
sequence comparison class. Information on the size of observation vectors, number of data
points required, position and recognition threshold, the minimum length of a gesture before
being recognized, the maximum distance between an example and sequence being classified
and so on is stored over here. A DTW constructor overloading is done to initialize these
values. A provision to add a seqence with a label to the known sequences library, provided
the gesture MUST start on the first observation of the sequence and end on the last one is
met by the AddOrUpdate function. The Recognize function recognizes the gesture in the
given sequence. It works on the assumption that the last observation of the sequence marks
the end of the gesture. No gesture will be recognized if the overall DTW distance between
30
the two sequences is too high. A gesture is recognized as long as it hits the threshold.
Though this means all gestures that hit the threshold are recognised, only the gesture that is
most similar should be returned. The RetrieveText function retrieves a text representation of
the _label and its associated _sequence. It helps as it displays the debug information and the
user may save it to the file. It returns a string containing all recorded gestures and their
names. The RetrieveGest function is called only when the Gesture file is loaded to display
the gesture names in the database. The Dtw function computes the min DTW distance
between the inputSequence and all possible endings of recorded gestures. To compute the
length between a frame of the inputSequence and a frame of a recorded gesture,
CalculateSnapshotPositionDistance function is used.
4.2.
4.3.
31
Use of software running on PC versus an Embedded System: Though for practical use an
Embedded System at a Public Service Information Counter would be the ideal choice, it
reduces scalability, given that the system is still on the developmental phase. The Kinect
Cameras easier interaction with the Software Development Kit and Visual Studio ran on a
PC provides for an easier way to test the various parameters and work on improvements,
such as introducing the z-coordinate to the skeletal data. Also, since it is just a piece of
software, downloadable upgrades can easily be made available and its easier for the end
user to use being more interactive.
One person Input versus Better Performance: While gesture input involving more than one
person at a time could enhance the range of the different sign languages, it would require
more complex supervised learning by means of Dynamic Time Warping, often resulting in
mismatches and/or a delay in the performance of the system.
Today, there are advanced and efficient algorithms than DTW, like the Hidden Markov
Model on which research is still going on. Also, instead of a voice input, a specific gesture
could take care of each commands.
32
The work was divided amongst the team after laying out a rough schematic of the entire
project. Most of the tasks required the entire team effort. Though each task had a leader who
was responsible for the completion of that task, sometimes a group member sat on the other
members task at a different point of time so as to provide a varied perspective or to provide
a fresh set of eyes. This allowed the blending of ideas so that the most effective outcome
was obtained in the least possible time frame. Even in certain impasse situations, this
strategy allowed the work to go forward.
Different online forums and discussions were referred and the information collected was
used to solve the hundreds of bugs in the code. A usability testing was performed at each
stage of the project to see for possible enhancements. For instance, the need for the
development of the system for three dimensional gestures was felt after doing numerous
works on 2D gestures. The subsequent expansion of the entire UI to Speech based input was
also arrived at, in such a modus of operandi.
The expert opinions from the Project Panel often provided ways to improve the project. A
wide variety of information could be obtained based on these guidelines, and thus gave a
new dimension to the work being done.
A Gantt chart outlining important tasks and goals can be seen in Appendix C.
33
6. PROJECT DEMONSTRATION
One of the group members record two gestures which appear to be slightly differentiable
because of the three dimensional plane involved, and two gestures which is exactly the same
but performed at a slightly different coordinate levels. Another group member use the
system in the Read mode and try to perform these gestures. The latter two gestures are
performed in a way so that the threshold levels on which the gesture recognition work with
(roughly 10 cm along each axes) can be demonstrated. The demonstration in this way shows
the systems ability to differentiate between 2D and 3D gestures, and its capacity to
recognize the gesture irrespective of the physique of the person.
6.1.
Project Walkthrough
34
1. Open up the Solution file of the project and check if all the prerequisite software/dlls
(references) are added. Put Visual Studio into Debug mode and build and run the project.
2. The Main Window XAML window appears as above. A Menu with the options Start
Capture , Load Gesture, Save To File, Show Gesture on to the left side, an RGB
Camera Feed, a Depth Feed and a Skeletal Feed from the depth feed, a Start and Stop
button and two text Boxes are displayed on the output window.
3. The user is free to use the mouse or use his voice to navigate through the entire UI. The user
should press the Start button at once to start the session.
4. Step into the field of view of the Kinect sensor and the skeleton would be tracked
spontaneously. The program is developed for a single player gesture recognition, so it
should be made sure that the Kinect feed is not subjected to more than one person.
5. Load the sample gestures by clicking Load gesture and navigating to the supplied
DefaultGestures.txt file (Figure 6.2). If using voice, Say Load gesture.
35
6. Start performing some gestures. The names of the gestures already in the database are
shown at the left. Upon a successful match with the displayed and previously recorded
gesture, the match appear in the text panel at the top of the screen.(Figure 6.3)
7. Stop debugging the app and reboot it.
8. Record your own gestures. Once the skeletal field comes up on the screen, enter the gesture
name and then click the Start Capture button to record. If using voice, say the appropriate
gesture name to input, and say Start Capture. The user has three seconds to get into place
and start recording the gesture. The system is currently programmed to look at 32 frames
(i.e. every alternate frame over 64 consecutive frames).
9. The user has to make sure that the whole gesture has to be completed within the span of 32
frames. This might mean that the user has to vary the speed of the gestures accordingly to fit
this time frame. The rate at which the gesture is performed is immaterial to the DTW
algorithm.
36
10. When recording of each gesture is finished, it automatically switches back into Read mode.
So the new gesture can be tested a few times before confirming on it. If not happy, it can be
re-recorded and tried again.
11. The details of the gestures just recorded or the ones loaded can be found by clicking Show
Gesture or by speaking the same.
12. When ones happy with the results, the gestures can be saved into a file.
37
7. COST ANALYSIS
7.1.
Marketing Analysis
Though there are numerous sign language systems which work with DTW, none of them
have the ability to work with 3D gestures. Robust and adverse situations often make the
working of these systems difficult and unpredictable.
This system has a scalable interface that can be further modified appropriately with ease
according to the preference of the user. The system can be used for a vice-versa
communication as well, involving the use of Avatars and animated videos, but require a lot
more work, since animated videos have to be made for each new gesture.
7.2.
Cost Analysis
Hours Per
Total Number
Week(hours)
of Weeks
38
Total Hours
1
2
3
Total
40
40
40
17
17
17
680
680
680
2040
8. SUMMARY
The group is currently working on improving the efficiency of processor utilization of the
system by limiting the unnecessary execution of DTW. At present, the DTW works
perpetually for the entire session when the code is running, trying to get a match. Since in
most of the real world gestures, the hand movement is above the hip, this may be laid out as
limiting condition for the DTW to work. Getting the Kinect v2 Sensor to work for this code
is believed to improve the accuracy by detecting the finger joints.
39
9. REFERENCES
[1] Cerezo, F.T. (2012). 3D Hand and Finger Recognition using Kinect. Available:
http://frantrancerkinectft.codeplex.com/
[2] Kyatanavar, M. R., & Futane, P. (2012). Comparative Study of Sign Language
Recognition Systems, International Journal of Scientific and Research Publications.
[3] V. Buchmann, S. V. (2004). FingARtips: Gesture Based Direct Manipulation in
Augmented Reality. GRAPHITE '04 Proceedings of the 2nd international conference on
Computer graphics and interactive techniques in Australasia and South East Asia (pp. 212221). New York: ACM Press.
40
[4] Akmeliawati, R., Melanie, P.-L. O., & Ye, C. K. (2007). Real-time Malaysian Sign
Language Translation Using Colour Segmentation and Neural Network. Proceedings of the
Instrumentation and Measurement Technology Conference, ISBN 1-4244-0588-2. Warsaw.
[5] Uebersax, D., Gall, J., Van den Bergh, M., & Van Gool, L. (2011). Real-time Sign
Language Letter and Word Recognition from Depth Data. Computer Vision Workshops
(ICCV Workshops), 2011 IEEE International Conference on (pp. 383-390). Barcelona:
IEEE.
[6] Kinect-Auslan, Auslan Recognition Software for Kinect. (2012, October 24). Available:
https://code.google.com/p/kinect-auslan/
[7]
Kinect
Neural
Network
Gesture
Recognition.
Available:
http://professeurs.esiea.fr/wassner/?2011/05/06/325-kinect-reseau-de-neuronereconnaissance-de-gestes
[8] Zafrulla, Z., Brashear, H., Starner, T., Hamilton, H., & Presti, P. (2011). American Sign
Language Recognition with the Kinect. ICMI '11 Proceedings of the 13th international
conference on multimodal interfaces (pp. 279-286). New York: ACM.
[9] Zico Pratama Putra. (2014). A Natural User Interface Translation Tool: From Sign
Language
to
Spoken
Text
and
Vice
Versa.
Available:
http://www.researchgate.net/profile/Zico_Putra/publication/262675277_A_Natural_Us
er_Interface_Translation_Tool_From_Sign_Language_to_Spoken_Text_and_Vice_Ver
sa/links/0a85e5386286d952b0000000.pdf
[10] Bahan, B. J. (1996). Non-Manual Realization of Agreement in American Sign
Language. Boston: Boston University.
[11] Stewart, D. A. (1998). American Sign Language The Easy Way. New York:
Barron's
[12] Kinect for Windows Architecture. (2013, December 23). Available at Microsoft
Developer Network: http://msdn.microsoft.com/en-us/library/jj131023.aspx
[13] Keane, S., Hall, J., & Perry, P. (2011). Meet the Kinect: An Introduction to
Programming Natural User Interfaces. Apress.
41
42
APPENDIX A
(Key parts of the code)
43
break;
case JointType.ElbowRight:
p[3] = new Vector3(j.Position.X, j.Position.Y, j.Position.Z);
break;
case JointType.WristRight:
p[4] = new Vector3(j.Position.X, j.Position.Y, j.Position.Z);
break;
case JointType.HandRight:
p[5] = new Vector3(j.Position.X, j.Position.Y, j.Position.Z);
break;
case JointType.ShoulderLeft:
shoulderLeft = new Vector3(j.Position.X, j.Position.Y, j.Position.Z);
p[6] = shoulderLeft;
break;
case JointType.ShoulderRight:
shoulderRight = new Vector3(j.Position.X, j.Position.Y, j.Position.Z);
p[7] = shoulderRight;
break;
}
44
45
else
{
//Move diagonally down-right
if (tab[i + 1, j + 1] == double.PositiveInfinity)
{
tab[i, j] = double.PositiveInfinity;
}
else
{
tab[i, j] = CalculateSnapshotPositionDistance(inputSequence, i,
recordedGesture, j) + tab[i + 1, j + 1];
}
horizStepsMoved[i, j] = 0;
vertStepsMoved[i, j] = 0;
}
}
double bestMatch = double.PositiveInfinity;
for (int i = 0; i < inputLength; ++i)
{
if (tab[i, 0] < bestMatch)
{
bestMatch = tab[i, 0];
}
}
return bestMatch;
46
catch
{
return;
}
APPENDIX B:
47
Gesture File
(For a sample gesture Stop)
-0.4261816
-1.788881
-0.5578205
Stop
-0.5498732
-0.4352334
-1.696717
-1.770446
-0.5596022
-0.3935509
HandLeft
-0.4227935
-1.675511
-0.5467879
-0.5465779
-0.3958366
-1.806109
-0.5584522
-1.768444
-0.5515297
-0.437495
-1.69523
-0.4206775
-1.766214
-0.5579826
-0.3896284
-0.422309
-1.677761
-0.5434869
-0.5473102
-0.3952018
-1.80842
-0.5589038
-1.767943
-0.5520903
-0.4367658
-1.692818
-0.4215374
-1.771812
-0.5600392
-0.3845813
-0.4252875
-1.679049
-0.5555161
-0.5506173
-0.3972197
-1.695055
-0.559976
-1.748346
-0.5518262
-0.396802
-1.688262
-0.41476
-1.770722
-0.5485767
-0.3787701
48
-1.514764
-1.365383
-1.379123
-0.5600966
-0.5905223
-0.3241709
-0.2930388
-0.2806951
-1.683997
-1.65569
-0.3726001
-0.3527114
-0.5640036
-0.5876778
-0.590988
-1.51814
-1.366911
-1.375231
-0.563631
-0.5897038
-0.327135
-0.2935429
-0.2746843
-1.680666
-1.650761
-0.3676121
-0.3534772
-0.5769781
-0.581562
-0.5904654
-1.435833
-1.497629
-1.371635
-0.5697027
-0.5886499
-0.3079937
-0.3385767
-0.2689545
-1.675911
-1.650684
-0.3635985
-0.3527497
-0.5762411
-0.5774227
-0.593803
-1.496735
-1.50987
-1.368991
-0.573864
-0.5860181
-0.3270946
-0.3384296
-0.2643787
-1.669153
-1.649434
-0.3605476
-0.3531981
-0.5823691
-0.5723946
-0.5965445
-1.453499
-1.510131
-1.364299
-0.5784317
-0.5849504
-0.3186346
-0.3371365
-0.2607551
-1.666453
-1.64173
-0.3575091
-0.3508452
-0.580523
-0.5893998
-0.6026189
-1.490743
-1.38153
-1.358537
-0.583452
-0.5806479
-0.3322577
-0.2920636
-0.2584283
-1.664267
-1.639059
-0.3559659
-0.3524311
-0.5785821
-0.5917851
-0.6089005
-1.488628
-1.382606
-1.356288
-0.5850421
-0.5781318
-0.3328525
-0.2903225
-0.2578495
-1.664054
-1.636915
-0.3582083
-0.3532341
-0.5880581
-0.5912442
-0.612928
-1.363058
-1.381245
-1.35373
-0.5895359
-0.2938675
-0.2861682
-0.2579393
-1.665702
WristLeft
-0.3560796
-0.5656841
-0.5861214
-0.5910709
-0.6129557
49
-1.353062
-1.328392
-0.2596968
-0.2537197
-0.6375536
-0.6299503
-0.6404853
-0.7354441
-0.7439541
-0.7245578
-0.6174191
-0.1066876
-0.1067583
-0.09145489
-1.354195
ElbowLeft
-0.2584752
-0.6297219
-0.6370271
-0.6300611
-0.6413344
-0.7423494
-0.7369277
-0.7423642
-0.7240863
-0.6162021
-0.08831059
-0.1082313
-0.102484
-0.09118281
-1.344626
-0.2554723
-0.6296226
-0.6347612
-0.6303329
-0.6446501
-0.7421902
-0.7373043
-0.7393978
-0.7238654
-0.6134917
-0.09051158
-0.1081079
-0.09803963
-0.08995963
-1.339635
-0.2561639
-0.6309989
-0.6318849
-0.6312894
-0.6462428
-0.7420812
-0.7374734
-0.7371837
-0.7145904
-0.6123129
-0.09224374
-0.1081631
-0.09357738
-0.08954219
-1.339546
-0.2555852
-0.633018
-0.6334696
-0.6321175
-0.6475508
-0.7411837
-0.7417028
-0.7354296
-0.7093838
-0.6089341
-0.09726381
-0.1094519
-0.09115996
-0.08891118
-1.338724
-0.2556441
-0.6339899
-0.6304296
-0.6337499
-0.6473817
-0.7415846
-0.7441894
-0.7319439
-0.7088526
-0.6097731
-0.09910565
-0.1111783
-0.08871153
-0.08848532
-1.33121
-0.2531933
-0.6366127
-0.6292173
-0.6372962
-0.6469035
-0.7402673
-0.7444624
-0.7273782
-0.7081882
-0.6072124
-0.1023309
-0.1109337
-0.08904926
-0.08847874
-1.329975
-0.2541083
-0.637464
-0.6291283
-0.6391879
-0.650836
-0.7379246
-0.7452379
-0.7262004
-0.699598
-0.6061478
-0.1047137
-0.1100142
-0.0902863
-0.08927869
50
-0.7380863
-0.6588704
-0.5607622
-0.846585
-0.6466925
0.04441198
0.02853809
-0.01323227
-0.0126985
-0.6999918
-0.09029114
0.6321993
0.7761078
0.8628892
0.7651359
-0.7378464
-0.5870198
-0.6579664
-0.8235629
-0.6457067
0.04579677
-0.009463423
-0.0003788
-0.01604284
-0.6997902
-0.09032783
0.6303699
0.8424494
0.8396102
0.7526379
-0.7354181
-0.5403482
-0.7589383
-0.665691
0.04585827
-0.03927057
-0.004028854
0.006169528
HandRight
0.63555
0.6301543
0.8793992
0.7823606
0.7902258
-0.752029
-0.7439613
-0.5130765
-0.8200279
-0.6262385
0.02573418
0.04431383
-0.06350178
-0.006391682
-0.01417817
0.6346772
0.6352494
0.9260274
0.7361414
-0.7462917
-0.7822866
-0.4724625
-0.8388283
WristRight
0.0303518
0.04212125
-0.08567546
-0.00926331
0.6573445
-1.383153
0.6348553
0.6488897
0.9399447
0.6987474
-0.1496864
-0.7412551
-0.8029748
-0.4600487
-0.8496009
0.0367773
0.04301647
-0.08338467
-0.01311253
0.6571148
-1.377571
0.6341189
0.6963145
0.9375965
0.6841881
-0.1431388
-0.7395588
-0.815634
-0.4630218
-0.8417236
0.03985514
0.03366444
-0.07195166
-0.01330691
0.6555118
-1.373983
0.63324
0.6911042
0.9002698
0.6915578
-0.1353894
-0.7377969
-0.7303107
-0.4999833
-0.854135
0.0425829
0.05756655
-0.03569496
-0.0135907
0.6533387
-1.372789
0.6325793
0.7327586
0.8633494
0.7226061
-0.1307444
51
-1.691447
0.6515293
1.325926
1.514683
0.9628447
-0.2300268
-1.371487
-0.8312135
0.05042579
-1.67419
-0.1264397
-0.2920314
-0.1659329
-0.1701451
0.6500679
-1.690029
0.6501026
1.432224
1.664808
1.219836
-0.2254462
-1.372187
-0.3973462
-0.3774369
-1.49791
-0.1248557
-0.2778822
-0.2469057
-0.2464926
0.6487893
-1.688446
0.6493781
1.332712
1.648196
1.445555
-0.22141
-1.371425
-0.02541793
-0.8603157
-1.249456
-0.1239796
-0.1877133
-0.3005878
-0.3054719
0.6491179
-1.689101
0.6480017
1.201797
1.463672
1.4548
-0.2198882
-1.368521
0.2376853
-1.309965
-0.8036617
-0.123915
-0.09236081
-0.2927042
-0.2843024
0.6483927
-1.687924
0.7082255
1.102619
1.177869
1.581862
-0.2188166
-1.593075
0.2094297
-1.541668
-0.3706301
-0.1873185
-0.07011735
-0.2412405
-0.3032525
0.6540638
-1.68424
0.8949236
1.085546
0.9833593
-0.2194075
-1.581511
0.2121879
-1.663336
ElbowRight
-0.2471603
-0.063785
-0.182293
0.6539017
0.7531308
-1.700517
-1.874656
1.02959
1.149409
0.865234
-0.2451763
-0.2785848
-1.409887
0.3152185
-1.683783
-0.2594071
-0.05240987
-0.1489949
0.6526704
0.9670539
-1.694833
-1.806716
1.334804
1.309614
0.8556766
-0.2375464
-0.3071377
-1.254363
0.2548853
-1.682338
-0.3418876
-0.09617565
-0.1447957
0.6512198
1.2432
52
-1.557173
1.274695
-0.2790578
-1.656719
-0.3387707
0.4697857
1.636304
-0.287441
-0.0590479
0.2219491
1.063301
1.509444
-0.1695613
-1.920503
1.676899
-1.3175
1.144705
-0.1971105
-1.297873
-0.3763915
0.4477526
1.886373
-0.3540402
-0.02947171
-0.2811854
0.9425531
1.643614
-0.2693345
-1.93961
1.770078
-0.8247764
1.076778
-0.1621993
-0.7763379
-0.368371
0.5356966
1.886444
-0.3601937
0.007272084
-0.8340896
0.9324288
1.66565
-0.3328337
-1.94067
1.794415
-0.3057205
1.128912
-0.1571549
-0.262679
-0.2958825
0.5942565
1.684555
-0.3350106
0.000929478
-1.39348
1.099532
1.49161
-0.344117
-1.899104
---
0.1612848
1.351163
-0.2035037
-0.1719775
0.4938258
1.327617
-0.07229854
-1.727974
1.396841
APPENDIX C:
Gantt Chart
53
CURRICULUM VITAE
Name
Roshan P. Shajan
Fathers name
P.U. Shajan
Date of Birth
21.12.1992
Nationality
India
Sex
Male
Company placed
Permanent Address
Phone Number
Mobile
+91 7200157374
Email Id
roshanpshajan@gmail.com
CGPA: 8.00
Examinations taken:
GRE: 314
54
Placement Details:
Position: Associate Software Engineer
Location: Bangalore
CURRICULUM VITAE
Name
Fathers name
Date of Birth
Nationality
Sex
Company placed
Permanent Address
:
:
:
:
:
:
:
Phone Number
Mobile
Email Id
:
:
:
Rajesh Thomas
J. A. Thomas
19.05.1993
Indian
Male
Accenture
Privy Garden, TC 11/691, Nanthencode, Kawdiar PO,
Trivandrum 695 003, Kerala.
+91 471 2310587
+91 9486225587
rajeshtheeinstein@rocketmail.com
CGPA: 8.44
Examinations taken:
GRE: 309
TOEFL: 103
CAT: 77.8 percentile
Placement Details:
55
CURRICULUM VITAE
Name
Fathers name
Date of Birth
Nationality
Sex
Company placed
Permanent Address
:
:
:
:
:
:
:
Phone Number
Mobile
Email Id
:
:
:
Abhijith Manohar J.
Jeevaraj M.N.
28.11.1993
Indian
Male
Wellsfargo India Solutions
Raja Nivas, TC 5/923-1, Peroorkada,
Trivandrum 695 005, Kerala.
+91 471 2439918
+91 9566817603
abhijith2393@gmail.com
CGPA: 8.19
Placement Details:
Position: Analyst
Location: Bangalore
56
57