You are on page 1of 6

Shape Recognition for Irish Sign Language Understanding

LIVIU VLADUTU
Dublin City University
School of Computing
Glasnevin, Dublin 9
IRELAND
lvladutu@computing.dcu.ie

Abstract: The recognition of humans and their activities from video sequences is currently a very active area of
research because of its many applications in video surveillance, multimedia communications, medical diagnosis,
forensic research and sign language recognition. Our system is designed with the aim to precisely identify human
gestures for Irish Sign Language (ISL) recognition. The system is to be developed and implemented on a standard
personal computer (PC) connected to a colour video camera. The present paper tackles the problem of shape
recognition for deformable objects like human hands using modern classification techniques derived from artificial
intelligence.

Key–Words: Statistical Learning, Shape recognition, Sign-Language

1 Introduction using Hidden Markov Models;


- shapes coding and classification using Machine
The purpose of the project is to develop a system for Learning techniques ;
Irish Sign Language (ISL) understanding. Sign lan- - elimination of small area occlusion problems
guages are the native languages by which communi- (hand-hand or hand-face occlusion) using motion
ties of Deaf communicate throughout the world. De- estimation and compensation (Figure 1 below);
spite the great deal of effort in Sign Language so far, - construction of faster implementation aiming the
most existing systems can achieve good performance real-time ISL understanding system, using faster
only with small vocabularies or gesture datasets. In- programming environments (mex programs based on
creasing vocabulary inevitably incurs many difficul- C++ implementation);
ties for training and recognition, such as the large size
of required training set, variations due to signers and
to recording conditions and so on. Up to now the Deaf
people had to communicate usually through an inter- 1.2 Short description
preter or through written forms of spoken languages,
which are not the native languages of the Deaf com- The work presented in the current paper investigates
munity. The aim of the project is to develop this the detection of subunits (that compose a sign) from
system using vision based techniques, independent the viewpoint of human motion characteristics. In the
of sensor- based technologies (using gloves) that can model the subunit is seen as a continuous hand ac-
prove to be expensive, uncomfortable to wear, intru- tion in time and space; therefore, the clear shape un-
sive and limit the natural motion of the hand. The derstanding at certain moments in time Representa-
images are extracted from simple ’one-shot’ video- tive Frames (RFrames) for human action understand-
streams, where only an individual gesture is executed, ing is essential. One of the problems we faced is that
not linked to the consecutive signs (as in a normal the hand is a highly deformable articulate object with
sign-language conversation). up to 28 degrees of freedom. Also the skin detec-
tion for segmentation, see [2, 7] is based on the as-
sumption that skin color is quite different from col-
1.1 Main steps ors of other objects and its distribution might form
The classical steps of this type if human-computer a cluster in some specific color-spaces. Even in the
interaction (HCI) system that were implemented by case of specific difficult conditions, i.e. fast segmenta-
members of the team (see also [2, 3]) are: tion imposed by the online requirement of the design,
- hands and face detection; workarounds were found. The previous coding ap-
- tracking of the above mentioned human body parts proaches were dictated by classical implementation in
the field, using principal component analysis (PCA),
like [11], influenced by new insights of the machine
50
learning and statistical learning theory theory [13].
100
We detect the skin by combining 3 useful fea-
150
tures: colour, motion and position. These features to-
200
gether, represent the skin colour pixels that are more
likely to be foreground pixels and are within a pre- 100 200 300

dicted position range. Machine Learning is a rich field


of knowledge-discovery in data, but the severe restric-
100
tions imposed by having in the end to have a real-time
200
ISL recognition system running on a PC (not a server)
300
and with a regular digital camera for providing the in-
put data has imposed us to limit our tests to acceptable 400

classification method. The material, was a database of 500


100 200 300 400
video streams created by our group using a PC and a
handycam, but also a proprietary database of synthetic
images (using virtual signers) created using Poser and Figure 1: Example of an image from real signer
the Python programming language. (above) and the equivalent one from a Poser-generated
video (below)
2 The proposed approach
information. Two MPEG-7 visual descriptors are
2.1 The skin model used in the experiments. The Color Layout descriptor
The current work was based on a large experience (CLD) provides information about the spatial color
in skin-model detection ( [2, 7] and related) using distribution within images. After an image is divided
the presumption of the human bodies being ”under in 64 blocks, this descriptor is extracted from each of
uniform lighting”. The proposed algorithm is fully the blocks based on Discrete Cosine Transform.We
automatic and adaptive to different signers (real can evaluate the distance between two CLD vectors
persons or virtual speakers, implemented in Poser). using the formula with luminance and 2 chrominance
The skin detector is responsible for segmenting skin channels information: pP
objects like the face and hands from video frames and S (Q, I) = 2
CLD i wyi (YQi − YIi ) +
it works ’in tandem’ with the tracker which keeps
pP
2
track of the hand location and detects any occlusions pPi wCbi (CbQi − CbIi ) +
2
i wCri (CrQi − CrIi )
that might happen between any skin objects. Due to where wi represents the weight associated with
this feature, (skin detector closely interacting with the coefficient i. There are 12 coefficients extracted for
HMM-based tracker) has also solved the problem of the color layout descriptor (6 for Y, and 3 each for
small occlusions (hand-hand or hand-face), see Figure Cb and Cr). There is a more detailed description in
1 below. The skin detector + tracker scheme can be the MPEG-7 ISO schema files, [8]. The region based
seen at work, see [12], but the tracking doesn’t make shape descriptor belongs to the broad class of shape
the object of the current paper. The present work, analysis techniques based on moments . It uses a
tackled mainly one-handed gestures corresponding to complex 2D Angular Radial Transformation (ART)
ISL, see also ”The Standard Dictionary of Irish Sign , defined on a unit disk in polar coordinates. The
Language”, [14]. ART coefficients were recorded from each segmented
image after selecting (cropping) the body part of
interest (face or hands). From each shape, as set
2.2 The features space of ART coefficients, Fnm , is extracted, using the
To evaluate the retrieval performance of the proposed following formula:
method, a large number of experiments were carried
on a set of images recorded by a member of our group D E
and a a database of images created using Python Fnm = Vnm (ρ, θ) − f (ρ, θ)
and Poser, see [1]. I have used descriptors from Z 2π Z 1

MPEG-7, formally known as Multimedia Content = Vnm (ρ, θ)f (ρ, θ)ρdρdθ
Description Interface and includes standardized tools 0 0
(descriptors, description schemes, and language) en- where f (ρ, θ) is an image intensity function in
abling structural, detailed descriptions of audiovisual polar coordinates and Vnm (ρ, θ) is the ART basis
function of order n and m. The basis functions are for shape understanding. A simple algorithm, was
separable along the angular and radial directions, and chosen, ( also [9]): for example, if a logical video seg-
are defined as follows: ment is v10−100 , and the Rframe set from the whole
video is {v1, v40, v75, v120...}, then {v40, v75} can
1 be used to visually represent the segment.
Vnm (ρ, θ) = exp(jmθ)Rn (θ), (1)
2π In order to have a clear understanding of how ges-
( ture’s RFrames are selected a simple figure (3) shows
1 if n = 0, a relative smooth transition from neutral phase (where
Rn (θ) = (2) signer keep hands down) to the active phase (the re-
2cos(πnρ) if n 6= 0
gion in green from the middle) and again in the neu-
The default region-based shape descriptor has 140 tral position.In this selection process, the images are
bits. It uses 35 coefficients (n=10, m=10) quantized to represented in the Principal Components (PC) space,
4 bits per coefficient. I have used in the object descrip- [4]. The representation in PC-space has previously
tion all the 35 resulted coefficients. The region based revealed us very interesting aspects of motion’s dy-
shape descriptor expresses pixel distribution within namics, [11],[18]. Since we are not dealing with crisp
a 2D object region; it can describe complex objects transitions (ie an image may belong to 2 subsets), I
consisting of multiple disconnected regions as well as considered mandatory to use Fuzzy Clustering.
simple objects with or without holes (Figure 2). Some
important features of this descriptor are:
◦ It gives a compact and efficient way of describ-
ing properties of multiple disjoint regions simulta-
neously,
◦ It can cope with errors in segmentation where an ob-
ject is split into disconnected subregions, provided
that the information which subregions contribute to
the object is available and used during the descriptor
extraction,
◦ The descriptor is robust to segmentation noise.
The information from CLD and RS (the region
shape) descriptors is stored in XML Metadata Inter-
change format, which is further processed by the clas-
sifier.The feature space for shape retrieval and classifi-
cation consisted of up to 47 coefficients (12 from CLD
and 35 from ART descriptors) and they were passed
further on to the classifier after the selection scheme
based on Fuzzy C-Means. 100 100 100

200 200 200

300 300 300


100 200 300 100 200 300 100 200 300

Figure 2: Example of shapes where region based


Figure 3: The figure shows how the images in a video-
shape is applicable.
stream can be clustered in 3 simple regions, therefore
Fuzzy C-Means is applicable
2.3 Video stream segmentation and
RFrames selection The basics of the algorithm, called Fuzzy C-
Means (FCM) were introduced by Dunn [5] and im-
This initial phase is supposed to select the images proved by Bezdek, [6] a classic fuzzy clustering algo-
from our simple video recordings that are to be used rithms. The objective function for FCM is given by:
Theorem 1 Let the l training set vectors
Table 1:
Fix number of the clusters C; x1 , x2 , . . . , xl ∈ X (X is the dot product space)
belong to a sphere SR (a), of diameter D, and center
Fix the fuzzifier m;
at a, i.e. SR (a) = {x ∈ X : kx − ak< D 2 }, a ∈ X.
Do {
Also, let fw,b = sgn (w · x) + b be canonical
Update membership using equation (4)
hyperplane decision functions, defined on these
Update center using equation (5)
points. Then the set of ∆-margin optimal separating
} Until (center stabilize) hyperplanes has the VC-dimension h bounded by the
inequality

N
C X
h ≤ min([D2 /∆2 ], n) + 1 (6)
X
J= (uij )m d2 (xj , ci ) (3) where [x] denotes the integer part of x.
i=1 j=1

where uij ∈ [0, 1] for all i and obeys the con- In this case, we can find an optimal weight vector w0
straints defined in the equation below: such that kw0 k2 is minimum (in order to maximize the
margin ∆ = kw20 k of Theorem 1) and yi ·(w0 ·xi +b) ≥
1, i = 1, . . . , l.
C
X The support vectors are those training examples
uik = 1, 1 ≤ k ≤ n that satisfy the equality, i.e. yi · (w0 · xi + b) = 1.
i=1 They define two hyperplanes. The one hyperplane
n
X goes through the support vectors of one class and the
0< uik < n, 1 ≤ i ≤ c other through the support vectors of the other class.
k=1 The distance between the two hyperplanes is maxi-
mized when the norm of the weight vector kw0 k is
where C is the number of clusters (3 as explained
minimum. This minimization can proceed by maxi-
above) and d is the distance norm (like Euclidean,
mizing the following function with respect to the vari-
Manhattan aso). The minimization of the objective
ables αi (Lagrange multipliers) [20]:
function with respect to membership values leads to
the following:
l l l
1
X 1 XX 
uij = P (4) W (α) = αi − αi · αj · xi · xj · yi · yj (7)
C d2 (x ,c )
j i
1 2
k=1 ( d2 (xj ,c ) )
m−1 i=1 i=1 j=1
k

And the minimization of the objective function subject to the constraint: 0 ≤ αi . If αi > 0 then
with respect to the center of each cluster will gives xi corresponds to a support vector. The classification
us: of an unknown vector x is obtained by computing
PN m l
j=1 (uij ) xj
X
ci = PN (5) F (x) = sgn{w0 · x + b}, where w0 = αi · yi · xi (8)
m
j=1 (uij ) i=1

In the equation above m ∈ [1, ∞] is the fuzzifier. and the sum accounts only Ns ≤ l nonzero sup-
Therefore, in the end the algorithm looks like in the port vectors (i.e. training set vectors xi whose αi are
Table 1. nonzero). Clearly, after the training, the classifica-
tion can be accomplished efficiently by taking the dot
2.4 Short introduction to SVM product of the optimum weight vector w0 with the in-
put vector x.
In the current framework of Machine Learning and
l
data understanding we considered the approach de- 1 X
rived from statistical learning theory of SVM (support minimize w · w + C · ξi (9)
2
vector machines). i=1

Suppose we are given a set of examples The case that the data is not linearly separable is
(x1 , y1 ), . . . , (xl , yl ), xi ∈ X, yi ∈ {±1} and we as- handled by introducing slack variables (ξ1 , ξ2 , . . . , ξl )
sume that the two classes of the classification problem with ξi ≥ 0 [21] such that, yi (w · xi + b) ≥ 1 − ξi , i =
are linearly separable. 1, . . . , l. The introduction of the variables ξi , allows
misclassified points, which have their corresponding
Table 2: Generalization Performance of mySVM clas-
ξi > 1. Thus, li=1 ξi is an upper bound on the num-
P
sifier with polynomial kernels.
ber of training errors. The corresponding generaliza-
Current Performance Description of
tion of the concept of optimal separating hyperplane is Experiment Vector the experiment
obtained by the solution of the optimization problem RS1 6.03% Region-shape only
given by equation (9) above, subject to: RS1 & CLD1 6.39% Region-shape and CLD
coeffs real images
yi · (w · ξi + b) ≥ 1 − ξi and ξi ≥ 0, i = 1, . . . , l (10)
RS2 7.64 % Region-shape coeffs for
The control of the learning capacity is achieved real and virtual signers
by the minimization of the first term of (9) while the
purpose of the second term is to punish for misclassi-
have used a Java version ([15]) of an implementation
fication errors. The parameter C is a kind of regular-
of the Support Vector Machine called mySVM devel-
ization parameter, that controls the tradeoff between
oped by Stefan Rüping. It is based on the optimization
learning capacity and training set errors. Clearly, a
algorithm of SV M light as described in [16]. mySVM
large C corresponds to assigning a higher penalty to
can be used for pattern recognition, regression and
errors.
distribution estimation. In order to cope with the rela-
Finally, the case of nonlinear Support Vector Ma-
tively small number of examples, a cross-validation
chines should be considered. The input data in this
(see, [19] with a factor of 25 was chosen. Sev-
case are mapped into a high dimensional feature space
eral types of kernels were tested (neural, polynomial,
through some nonlinear mapping Φ chosen a priori
Anova, epanechnikov, gaussian-combination, multi-
[20]. The optimal separating hyperplane is then con-
quadric, based on radial-basis functions) but, our ex-
structed in this space. As shown at 5.5 in [13] the
perience [10] is once more confirmed, that (most prob-
corresponding optimization problem is obtained from
ably) the SVM-classifiers based on polynomial ker-
(7) by substituting x by its mapping z = Φ(x) in the
nels are the best for classification problems. There-
feature space, i.e. is the maximization of W (α):
fore, all the results expressed in the table 2 correspond
l l l to the polynomial-kernel supervised learning.
X 1 XX
W (α) = αi − αi · αj ·
2
i=1 i=1 j=1
(Φ(xi ) · Φ(xj )) · yi · yj
3 Experimental results
The experience acquired in the group has shown that
subject to: there are many factors that can influence the qual-
l ity of image understanding, like: the differences be-
X
yi · αi = 0 tween signers clothing, between the lighting sources,
between the skin of the humans or due to the mo-
i=1
tion blur. Therefore the first step was to compare the
∀i : 0 ≤ αi ≤ C
classification performance of our algorithm for two
classes of input data, only 35 coefficients (of the RS-
2.5 The proposed classifier descriptor), or 47 coefficients (of the RS and CLD de-
scriptors). The performance vector (overall) resulted
The number of training examples is denoted by l.In from the confusion matrix of the results is presented
our case l was always 280 (10 frames for each of in the table 2, and it shows that by adding the 12-extra
the 28 one-handed gestures of ISL). α is a vector coefficients corresponding to CLD, the classification
of l variables, where each component αi corresponds error is only slightly increased (by 0.35%). That will
to a training example (xi , yi ). xi represents the fea- allow to gather more information in our training and
tures vector which is formed by either 35 (only the testing database (more real and virtual signers) and
Region-shape coefficients) or 47 coefficients corre- to quantify the differences enumerated above in only
sponding to both the Region-shape (RS) and the CLD few coefficients at virtually no expenses. The second
descriptors. A fast Windows implementation (.dll’s) step of our experiment used the same number of im-
for the extraction of the video descriptors was chosen, ages but, for the same letter/ sign expressed were used
([17]) that can be included in our final real-time Sign- ten static images and ten images extracted from the
Language understanding system. Although there are Poser-generated video-stream- corresponding to the
many available implementations in several program- same sign, like in the Figure 1. The performance is
ming languages (like Matlab, C++, Java, Lisp aso), I given in the third line of the table.
4 Conclusions [8] http://standards.iso.org/ittf/PubliclyAvailableStan
dards/MPEG-7 schema files/
The results explained in the previous sections show
[9] Anupam Joshi, Sansanee Auephanwiriyakul,
that a limited of gestures (executed by a human or a
and Raghu Krishnapuram, On Fuzzy Clustering
robot...) can be learned and understood by combining
and Content Based Access to Networked Video
the shape recognition (hand shapes playing the role of
Databases, Proceedings of the Workshop on Re-
letters in an alphabet) detailed in the current work with
search Issues in Database Engineering, Page:
an understanding of the gesture dynamics represented
42–47 ,1998.
in the feature space (like PCA). In this latter approach,
[10] S. Papadimitriou, L. Vladutu, S. Mavroudi and
the images are represented in the PCA-space and the
A. Bezerianos, Ischemia Detection with a Self-
gestures are represented in a nonlinear manifold, see
Organizing Map Supplemented by Supervised
[18]. The fast procedure exposed- it takes approxi-
learning, IEEE Transactions on Neural Net-
mately 10 msec (average classification time), and ap-
works 12 (2001), 503–515.
prox. 1 second for the VDE-feature extraction on a
PC (2.4 GHz) it’s considered to be a good choice for [11] Wu Hai and Alistair Sutherland, Irish Sign
other researchers in the field. Language Recognition Using Hierarchical PCA,
Irish Machine Vision and Image Processing
Acknowledgements: Conference (IMVIP 2001), National University
The research was supported by the Science Founda- of Ireland, Maynooth, 5–7 September 2001.
tion of Ireland (SFI) and the author is also thank- [12] G. Awad, J. Han and A. Suther-
ful to Dr. Sara Morrisey for the database of syn- land, Hand and face tracking demo:
thetic video streams and to Lecturer Alistair Suther- http://www.youtube.com/watch?v=exbGdHpFiW0
land (both from Dublin City University)- for being a [13] Vladimir Vapnik, The Nature of Statistical
good group leader. Learning Theory, 2nd Edition, Springer –Verlag,
Berlin–Heidelberg–New York–Tokyo 2000.
[14] ”The Standard Dictionary of Irish
References: Sign Language”, by microBooks
[1] Poser official site by e-Frontier America Inc : Ltd. on DVD,2006 by NAD,
http://www.e-frontier.com/. www.deafhear.ie/documents/pdf/1951.pdf.
[2] Junwei Han, George Awad, Alistair Suther- [15] Open-Source Data Mining with the
land, and Hai Wu, Automatic Skin Segmen- Java Software RapidMiner, http://rapid-
tation for Gesture Recognition Combining Re- i.com/content/blogcategory/38/69/
gion and Support Vector Machine Active Learn- [16] Joachims, Thorsten, Making large-Scale SVM
ing, Proceedings of the 7th International Con- Learning Practical, In Advances in Kernel Meth-
ference on Automatic Face and Gesture Recog- ods - Support Vector Learning, chapter 11, MIT
nition, 2006, pp. 237–242. Press, 1999.
[17] Giorgos Tolias, Visual Descriptor Applications,
[3] George M. Awad, A Framework for Sign Lan-
Semantic Multimedia Analysis Group- NTUA,
guage Recognition using Support Vector Ma-
Greece: http://image.ntua.gr/smag/tools/vde.
chines and Active Learning for Skin Segmenta-
tion and Boosted Temporal Sub-units, PhD The- [18] L. Vladutu, A. Sutherland, Gesture Analysis
sis, Dublin City University, 2007. of Deaf People Language using nonlinear anal-
ysis of manifolds, , July 2007, SFI Conference,
[4] I. T. Jolliffe, Principal Component Analysis, Dublin, Ireland.
Springer Verlag, 2002. [19] Kohavi, Ron, ”A study of cross-validation and
[5] J. C. Dunn, A Fuzzy Relative of the ISODATA bootstrap for accuracy estimation and model se-
Process and Its Use in Detecting Compact Well- lection”, Proceedings of the 14th International
Separated Clusters, Cybernetics and Systems: Joint Conference on Artificial Intelligence 2
An International Journal, 1087-6553, Volume 3, (12): 1137-1143,(Morgan Kaufmann, San Ma-
Issue 3, 1973, pp. 32 - 57. teo),1995.
[6] J.C. Bezdek, Pattern Recognition with Fuzzy [20] V. N. Vapnik, Statistical learning theory, Wiley-
Objective Function Algorithms, Plenum Press, Interscience, 1998.
New York, 1981 [21] C. Cortes and V. Vapnik, Support vector net-
[7] J. Kovac, P. Peer and F. Solina, Human Skin works, Machine Learning 20:3,1995, pp. 273–
Colour Clustering for Face Detection, Proc. of 297.
EUROCON 2003, Finland, pp. 144–148.

You might also like