Professional Documents
Culture Documents
Introduction
Human gesture recognition has been a widely researched area over the last few
years due to potential applications in the eld of security and surveillance. Early
research on gesture recognition used the concept of space time shapes, which are
concatenated silhouettes over a set of frames, to extract certain features corresponding to the variation within the spatio-temporal space. Gorelick et al. [7]
modelled the variation within the space time shape using Poissons equation and
extracted space time structures which provides discriminatory features. Wang et
al. recognized human activities using the derived form of the Radon transform
known as the R-Transform [17,16]. A combination of a 3D distance transform
along with the R-Transform is used to represent a space time shape at multiple
levels and used as corresponding action features [11].
H. Jiang et al. (Eds.): IEA/AIE 2012, LNAI 7345, pp. 124133, 2012.
c Springer-Verlag Berlin Heidelberg 2012
125
126
networks to model the mapping. The proposed technique in this paper uses the
histogram of spatial gradients in a region of interest, nds an underlying function
which captures the temporal variance of these 2D shape descriptors with respect
to each action and classies a set of contiguous frames irrespective of the speed
of the action or the time instant of the body posture.
Proposed Methodology
In this paper, we focus on three main aspects of the action recognition framework.
The rst is that of feature extraction where a shape descriptor is computed for
the region of interest in each frame. The second is that of a computation of an
appropriate reduced space which spans the shape change variations across time.
The third aspect is that of suitable modelling of the mapping from the shape
descriptor space to the reduced space. A suitable block diagram illustrating the
framework is shown Figure 1.
127
Fig. 2. HOG descriptor extracted from a binary human silhouette from the Weizmann
Database [7]
(a) Hand
(b) Gradient
Fig. 3. HOG descriptor extracted from a gray scale hand image from Cambridge Hand
Gesture Database [8]
corresponding location of that particular shape on the action manifold than the
approach which uses nearest neighbours to determine the corresponding reduced
posture point. In this paper, we use a separate model for each action class and
the modelling is done using generalized regression neural networks which is a
multiple-input multiple-output network.
2.1
The histogram of gradients is computed by rst taking the gradient of the image
in the x and y directions and calculating the gradient magnitude and orientation
at each pixel. The image is then divided into overlapping K blocks and the orientation range is divided into n bins. From each block, the gradient magnitudes
of those pixels corresponding to the same range of orientation (belonging to the
same bin) are added up to form a histogram. The histograms from the various blocks are normalized and concatenated to form the HOG shape descriptor.
An illustration of the HOG descriptor extracted from masked human silhouette
image are shown in Figure 2. It can been seen that since the binary silhouette
produces a gradient where all of its points correspond to the silhouette, the HOG
descriptor produces a discriminative shape representation. Moreover, due to the
block operation during the computation of the HOG, this descriptor provides
a more local representation of the particular posture or shape. An illustration
of the HOG descriptor(rst 50 elements) applied on a gray scale hand image is
128
shown in Figure 3. Unlike the binary image, there is some noise in the gradient
image which gets reected onto the HOG descriptor. Since the HOG descriptors are illumination invariant, we can assume that under varying illumination
conditions, the feature descriptors do not vary much.
2.2
The next step in the framework is determine an appropriate space which represents the inter-frame variation of the HOG descriptors. An illustration of the
reduced posture or shape space using PCA is shown for the Weizmann dataset
and the Cambridge Hand dataset using three Eigenvectors in Figure 4. Each
action class of the reduced posture points shown in Figure 4 are color-coded to
illustrate how close the action manifolds are and the separability existing between them. We can see that there are lot of overlaps between dierent action
manifolds in the reduced space and our aim is to use a functional mapping for
each manifold to distinguish between them. We rst collect the HOG descriptors
from all the possible postures of the body irrespective of the action class and
form a space denoting what is known as an action space denoted by SD . We can
express the action space mathematically as
SD = {hk,m : 1 k K(m) and 1 m M }
(1)
where K(m) is the number of frames taken over all the training video sequences
from the action m out of M action classes and hk,m being the corresponding
HOG descriptor of dimension D 1. The reduced action or posture space is
obtained by extracting the principal components of the matrix HHT where H =
[h1,1 h1,1 h1,1 ... hK(M),M ] using PCA. This is done by nding the Eigenvectors
or Eigenpostures v 1 v 2 ...v d corresponding to the largest variances between the
HOG descriptors. In this reduced space, the inter-frame variation between the
extracted HOG descriptors due to the changes of shape of the body (due to the
motion or the action) are maximized by selecting the appropriate number of
Eigenpostures and at the same, reducing the eect of noise due to illumination
Fig. 4. Reduced Posture Space for the HOG descriptors extracted from video sequences
129
The mapping from the HOG descriptor space (D 1) to the reduced posture or
shape space (d 1) can be represented as SD Sd where Sd = {pk,m : m =
1 to M } and p is a vector representing a point in the reduced posture space.
In this framework, we aim to model the mapping from the HOG to the posture space for each action m separately using the Generalized Regression Neural
Network [14,3]. This network is a one-pass learning algorithm which provides
fast convergence to the optimal regression surface. It is memory intensive as it
requires the storage of the training input and output vectors where each node
in the rst layer is associatedwith one training point. The network models the
N
yi radbasis(xxi )
= i=1
where (y i , xi ) are the trainequation of the form y
N
i=1 radbasis(xxi )
is the estimated point for the test input x. In our
ing input/output pairs, y
algorithm, since a lot of training points are present, a lot of nodes have to be
implemented for each class which is not memory ecient. To get suitable training points that marks the transitions in the posture space for a particular action
class m, k-means clustering is done to get L(m) clusters. So, the mapping of the
HOG descriptor space to its reduced space for a particular action class m can
be modelled by a general regression equation given as
L(m)
=
p
i,m exp(
p
i=1
L(m)
i=1
2
Di,m
)
2 2
2
Di,m
exp(
)
2 2
i,m )T (h h
i,m )
; Di,m = (h h
(2)
i,m ) are the ith cluster centres in the HOG descriptor space and
where (
pi,m , h
the posture space. Selection of the standard deviation for each action class
is taken as the median Euclidean distance between the corresponding actions
cluster centres. The action class is determined by rst projecting the consecutive
set of R frames onto to the Eigenpostures. These projections of the frames given
(m)r of the
by pr : 1 r R is compared with the estimated projections p
corresponding frames estimated by each of the GRNN action model using the
Mahalanobis distance. The action model which gives the closest estimates of
the projections is selected as the action class.
130
The algorithm presented in this paper has been evaluated on two datasets,the
Weizmann Human Action [7] and the Cambridge Hand Gestures [8]. The histogram of gradients feature descriptor has been extracted by dividing the detection region into 7 7 overlapping cells. From each cell, a histogram of gradient is
computed with 9 orientation bins which are normalized by taking the L2norm,
and the normalized histograms are concatenated to form the feature vector of
size 441 1.
3.1
131
Fig. 6. Average Accuracy computed for the action classes for window size
10, 12, 15, 18, 20, 23, 25, 28, 30
3.2
The dataset contains 3 main actions classes showing dierent postures of the
hand, at, spread out and V-shape. Each of the main classes has three other subclasses which diers in the direction of movement. In total, we have 9 dierent
action classes which diers in the posture of the hand as well as its direction
of motion. The main challenge is to dierentiate between dierent motion and
shape at dierent illumination conditions. The dataset is shown in Figure 5(b).
There are 5 sets, each containing dierent illuminations of all the action classes
with class having 20 sequences. From each of the video sequences, we applied skin
segmentation to get a rough region of interest, and extracted the HOG based
shape descriptor from the gray scale detection region. Unlike the descriptors
extracted from silhouettes in the Weizmann dataset, these descriptors contain
noise variations due to dierent illumination conditions. The testing strategy
we used is the same as that of the Weizmann with leave-9 out video sequences
132
Table 2. Confusion Matrix and Overall Accuracy for Cambridge Hand Gesture Dataset
(a) Confusion Matrix
a1 a2 a3 a4 a5 a6 a7 a8 a9
a1 94.0
1
5
a2
91.0 6
3
a3 2
1 95.0
2
a4
91.0 1
8
a5
5 85.0 10
a6
1 99.0
a7
83.0 3 14
a8
86.0 14
a9
1 13 9 77.0
where each test sequence corresponds to an action class. The confusion matrix
for the action classes obtained from the framework with 4 clusters is given in
Table 2(a). We can see that if all the illumination conditions are trained into
the system, the overall accuracy obtained with the framework is high. Using
the same testing strategy, we tested the system for overall accuracy for each
set and this is given in Table 2(b). For set1, the overall accuracy is high as the
non-uniform lighting does not aect the feature vectors and noise is diminished
by the partial illumination variant property of the HOG descriptor. For sets 4
and 5, it shows moderate accuracies while sets 2 and 3 give an average overall
accuracy.
In this paper, we presented a frame work for recognizing actions from partial
video sequences which is invariant to the speed of the action being performed. We
illustrated this approach using the Histogram of Gradients shape descriptor and
computed the mapping from the HOG space to the reduced dimensional posture
space using Principal Component Analysis. The mapping from the HOG space
to the reduced posture space for each action class is learned separately using
Generalized Regression neural network. Classication is done by projecting the
HOG descriptors of the partial sequence onto the posture space and comparing
the reduced dimensional representation with that of the estimated posture from
the GRNN action models using Mahalanobis distance. The results shows the
accuracy of the framework as illustrated on the Weizmann database. However,
when using the gray scale images to compute the HOG, severe illumination conditions can aect the framework as illustrated by the Hand Gesture database
results. In future, our plan is to extract a shape descriptor which represents
a shape from a set of corner points where relationships between them are determined in the spatial and temporal scale. Other regression and classication
schemes will also be investigated in this framework.
133
References
1. Ali, S., Basharat, A., Shah, M.: Chaotic invariants for human action recognition.
In: IEEE 11th International Conference on Computer Vision, ICCV 2007, pp. 18
(October 2007)
2. Batra, D., Chen, T., Sukthankar, R.: Space-time shapelets for action recognition.
In: IEEE Workshop on Motion and video Computing, WMVC 2008, pp. 16 (January 2008)
3. Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science
and Statistics), 1st edn. Springer (2006), corr. 2nd printing edn. (October 2007)
4. Chin, T.J., Wang, L., Schindler, K., Suter, D.: Extrapolating learned manifolds for
human activity recognition. In: IEEE International Conference on Image Processing, ICIP 2007, vol. 1, pp. 381384 (October 2007)
5. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
CVPR 2005, vol. 1, pp. 886893 (June 2005)
6. Dalal, N., Triggs, B., Schmid, C.: Human Detection Using Oriented Histograms of
Flow and Appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006,
Part II. LNCS, vol. 3952, pp. 428441. Springer, Heidelberg (2006)
7. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as spacetime shapes. Transactions on Pattern Analysis and Machine Intelligence 29(12),
22472253 (2007)
8. Kim, T.K., Wong, S.F., Cipolla, R.: Tensor canonical correlation analysis for action
classification. In: IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2007, pp. 18 (June 2007)
9. Klaser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3dgradients. In: Proceedings of the British Machine Vision Conference (BMVC 2008),
pp. 9951004 (September 2008)
10. Lui, Y.M., Beveridge, J., Kirby, M.: Action classification on product manifolds.
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010,
pp. 833839 (June 2010)
11. Nair, B., Asari, V.: Action recognition based on multi-level representation of 3d
shape. In: Proceedings of the International Conference on Computer Vision Theory
and Applications, pp. 378386 (March 2010)
12. Niebles, J., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories
using spatial-temporal words. In: British Machine Vision Conference, BMVC 2006
(2006)
13. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the International Conference on
Multimedia (MultiMedia 2007), pp. 357360 (September 2007)
14. Specht, D.: A general regression neural network. IEEE Transactions on Neural
Networks 2(6), 568576 (1991)
15. Sun, X., Chen, M., Hauptmann, A.: Action recognition via local descriptors and
holistic features. In: IEEE Computer Society Conference on Computer Vision and
Pattern Recognition Workshops, CVPR Workshops 2009, pp. 5865 (June 2009)
16. Tabbone, S., Wendling, L., Salmon, J.: A new shape descriptor defined on the radon
transform. In: Computer Vision and Understanding, vol. 102, pp. 4251 (2006)
17. Wang, Y., Huang, K., Tan, T.: Human activity recognition based on r transform.
In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007,
pp. 18 (June 2007)