Professional Documents
Culture Documents
Abstract—This paper outlines the ongoing development of a be explained in the course of the paper. The images are then
pattern recognition technique for recognizing the finger-spelling cropped to reduce the effect of background and other
of alphabet of American Sign Language (ASL) vocabulary and insignificant details in the image that are may impair the
converting them to text. This work is the phase two of the
classification. Once the images are cropped they are fed to the
broader ongoing project. The methodology adopted in the
recognition employs applying a group of classification techniques classifiers. The results obtained from each of the classifiers are
such as Principle Component Analysis (PCA), Linear discussed. After the alphabets have been classified they are
Discriminant Analysis (LDA) and both linear and kernel Support immediately displayed on the screen and the webcam
Vector Machines (SVM). We describe the approach used for continues to obtain new images and the process continues to
obtaining the individual frames from a real-time webcam video, make it real-time. In this phase we have introduced a new
crop the required part of the image that will be used for
feature which verifies the spelling of the text displayed on
classification and then run each of the classifiers for recognition.
Finally the intermediate results of each classifier along with the screen.
future work are discussed. We also use a spell checker to correct
and ensure the accuracy of the output text on screen. II. PREVIOUS WORK
ASL being an inherently gestural language, many previous
Index Terms—Image segmentation, Clustering, American Sign
Language, Finger Spelling, PCA, LDA
works have also taken into account the background scene,
facial expressions, eyebrow movements to recognize the signs
[3], [4]. In this project work we only consider the finger spelt
I. INTRODUCTION alphabet of ASL. There are a lot many of the previous work
where the signer wears input devices such as gloves, while
The most common mode of communication amongst the
signing. The position and the transition between each letter are
deaf community is sign language. It comprises of gestures,
recorded and these inputs are used to extract and recognize the
visual cues which may or may not be accompanied by
motional gestures. In this work we use finger spelling alphabet signed letters. The gloves worn by the user helps in obtaining
of the American Sign Language [1] vocabulary. American the hand position and orientation data [5], which is sent to a
Sign Language (ASL) is the fourth most commonly used computer for further analysis. Though this approach has
language in the United States and Canada [2]. Finger Spelling proved to yield acceptable results, it was an invasive and
uses signs for each letter of the alphabet to spell out a expensive solution. Some approaches that involved image
complete word of English language. In ASL the finger processing used the Mean Square Error (MSE) and recognized
spelling gestures are created by a single hand. Most of it does the letters using the lowest MSE while some other approaches
not require motion (except for letters ‘i’ and ‘z’). Each of the used images collected from more than one camera and the
finger spelt alphabet is distinguished by the positioning of the algorithms used varied from Hidden Markov Models to
fingers by the signer. Each configuration that makes a letter of modeling the position of the hand and its velocity [6] to using
the alphabet is called a handshape. Neural Networks. A research that most closely correlates with
In this project our approach is to use Principle Component what we propose to accomplish uses the SecDia Fischer
Analysis (PCA), Linear Discriminant Analysis (LDA) and Linear Discriminant (FLD) [7], in which the training images
Support Vector Machines (SVM) as image classification are arranged based on the secondary diagonal. After the
techniques, to recognize the finger-spelled alphabet in front of training images are rearranged according to the secondary
a webcam, using MATLAB. This process initially requires us diagonal Fischer Linear Discriminant (FLD) analysis is
to train our classifier using a training set of images of all the applied on them to obtain fingerspelling recognition.
27 letters (26 alphabets and space) of the ASL alphabet.
During testing only the appropriate and required image frames
from the webcam are to be extracted and manipulated so that III. APPROACH
they can then be fed to the classifier. Finally the spelt letter is
In this project we consider the alphabet of ASL and try to
recognized and can displayed as text in real-time. Frames from
convert it to digital text. The approach to obtain and recognize
a webcam are taken at specific intervals and from these the
the finger spelt alphabet letters can be formulated into 3 steps,
images of individual finger spelt alphabets are segmented out.
as shown:
The method of the segmentation of individual alphabets will
1. Obtaining required image data – Data Acquisition
The data acquisition comprises of obtaining the required
2
images from the webcam during run-time. easy to crop the images using thresholding.
2. Manipulating the data for classification – Image Second, for processing the input video a webcam was
Preprocessing. interfaced with MATLAB and 12 frames were captured with
3. Extracting salient features of the data and classifying pause of 0.01s between each frame. Out of these 12 frames
the data based on above mentioned algorithms and only the ones with the finger letters were saved while the rest
recognizing the letters – Feature Extraction and of them which were transition between two letters were
Classification. neglected. To accomplish this, each frame was compared to
the previous one and the difference between the two images
was calculated. If there was small change in motion between
the images then values of the resulting difference will be dark
(black) while if the difference is large (i.e. large motion
between images) the image looks white where the change took
place as shown in Fig. 2. This information was used to decide
which image had less relative motion or change from the
previous image and eventually the still image determines that
a letter has been spelled. Hence, for the 12 frames we have 11
difference images. The closer these difference images are to 0
the less motion or change they had, the further away they are
from 0 the most change was observed between frames. So the
Euclidean distance of each of these difference images from 0
were calculated and finally based on a heuristic threshold
value (e = 7 or 8 depending on the background and
illumination) the images with smaller values of e (or small
Fig. 1. Block diagram of the different stages of the finger spelling differences) were preserved.
recognizing system.
IV. IMPLEMENTATION
Training Data
1. The images of 26 letters and a sign for space between
words giving a total of 27 letters, with 9 samples of
each were obtained. Three subjects (each repeating 3
instances of an alphabet) were used to perform signing Fig. 2. Two images out of the 12 images that were captured using the
of alphabets. These static images are used as our webcam are shown. Their difference is shown below which determines
whether it should be cast away or kept for classification.
training set.
2. These training images were cropped to obtain the Finally out of these images, the consecutive ones were also
region of interest and neglect the extra insignificant cast off since they pronounce the same letter and so that could
detail like the background that might reduce the be considered more than once if it is not ignored. We look at
accuracy of classification. the 8 past images and if either of them are the same letter then
3. The images are resized since PCA and LDA need to it is not considered. Hence, the user or signer has to hold a
have all training and testing images of the same sign for a maximum time of 8 frames or 800ms else that letter
dimensions. would be considered again.
4. Principal Component Analysis (PCA), Linear
Discriminant Analysis (LDA) and two variations of the Image Preprocessing
Support Vector Machines (SVM), linear SVM and The images obtained are color images (RGB). Since we
kernel SVM are run on these images. The performance don’t extract any features from the color information, all the
and clustering ability of all these algorithms are images are converted to gray scale. The gray scale images are
obtained. then cropped to obtain the region of interest and remove the
background. Binary thresholding is performed to obtain a
Data Acquisition black and white image which helps in finding the coordinates
The data acquisition can be looked at as two parts. to crop the image. This eases the cropping of the images. The
First for the training data set, images were captured using a black and white images are cropped by detecting three edges,
webcam. Each letter was imaged 9 times with 3 images per the left, the right and the top edge as shown in Fig. 3. The
signer. The captured images were of size 352x288. For ease of edges are detected by counting the number of white pixels that
computation and eliminating artifacts arising from the are encountered in each line, since the white pixels represent
background, the images were captured on a plane black the hand. The number of white pixels in each line is counted
background. Having a plain black background also makes it
3
starting from top to bottom to find the top edge, from left to principal component analysis to reduce the
right to find the left edge and the right to left to find the right dimensionality of the training images and then
edge. When more than 10 white pixels are encountered in a perform linear discriminant analysis to find the
line that line is detected as the beginning of the hand, and its universal linear feature subspace, and then each
coordinates are used to crop the gray scale image. A training image of reduced dimensionality is
cushioning parameter is also provided which leaves that many projected onto this subspace to obtain the feature
pixels gap between the coordinate found by detection and the vector for each class. During testing, we project the
actual cropping line. testing image onto the universal linear subspace to
obtain the feature vector for the test image and we
find the distance of this vector to each of the trained
class vectors using the Euclidean distance metric and
classify based on the nearest neighbor rule.
4. Support Vector Machine (SVM)- Linear SVM has
been used to classify each of the classes. To reduce
dimensionality the PCA coefficients were fed into
the SVM classifier. For a large value(N_Tr =8) of
training samples the classifier returned a 100%
Fig. 3. Thresholding the image to find the coordinates of the boundary and accuracy. The SVM is implementation is based on
then cropping the grayscale image. the method illustrated in [9].
The cropped images will be of varying sizes due to the Spell checker
distance between the hand and the camera, subject variability A new feature has been incorporated in this phase of the
and change in handshape of each of the letter. To overcome project. We had added spell checker module that is
this problem the images are resized to 120x200. This resizing constructed on the Levenshtein distance metric. This is
has to be done since PCA and LDA algorithms run only when introduced as a post processing step to improve the accuracy
the training and testing data have the same dimensions. This of the text returned as the output of the recognition phase.
process is done on the training images too while modeling the The current library of vocabulary used consists of 58112
classifier. words of English. The number of words in the vocabulary can
be varied to increase or decrease the size of the library. We
Feature Extraction and Classification have limited to the present collection of words keeping in
For feature extraction and classification we have used PCA focus the objective of keeping the computational time low.
and LDA in this phase of our work. We have used two types The purpose of adding this new feature is to rectify the
of PCA: errors that can occur due to insertion of an extra character,
1. Global PCA – Principal component analysis is done deletion of a required character or a wrong character
on all the training images of all the classes (i.e. substitution which can be viewed as propagated errors from
letters) together to find a universal linear feature the image capture stage. It can also correct errors due to
subspace, and then each training image is projected misclassification at the recognition phase. The spell checker
onto this subspace to obtain the feature vector for compares the recognition output to each of the word contained
each class. During testing, we project the testing in the library, if a word is found to be a perfect match the
image onto the universal linear subspace to obtain score is added a zero and the search stops. If there was no
the feature vector for the test image and we find the successful match it tries to assign the word to an element of
distance of this vector to each of the trained class the dictionary with the minimum Levenshtein distance
vectors using the eucledian distance metric and measure. We have concluded that since it is a post processing
classify based on the nearest neighbor rule. technique it can act to better the results without adversely
2. Individual PCA – Principal component analysis is affecting the outcome in most cases.
done on the training images of each class separately
to obtain linear feature subspace for each of the
classes. During testing we project the testing image V. EXPERIMENTAL RESULTS
onto each of the subspace and reconstruct the image
For our experiments, we generated a database of nine
using the projections from each of the subspace, then
images for each of the twenty six letters and a sign for space
we find the reconstruction error of each
between words. The nine images were captured using three
reconstructed image to the original test image and
signers (three per signer). The images were captured with
classify based on which class gave the smallest
consistent illumination and without any lateral or rotational
reconstruction error.
changes. We tested all four (Global PCA, Individual PCA,
3. Linear Discriminant Analysis - The methodology we
LDA and SVM) of our classifiers by partitioning the database
have adopted for LDA is based on the work of
into training and testing set of different sizes. The recognition
Belhumeur et al. [8], during training, to perform
accuracy for the various cases is shown in Table I.
4