You are on page 1of 16

Computer Vision and Image Understanding 117 (2013) 587602

Contents lists available at SciVerse ScienceDirect

Computer Vision and Image Understanding


journal homepage: www.elsevier.com/locate/cviu

View invariant action recognition using weighted fundamental ratios q


Nazim Ashraf a,, Yuping Shen a,b, Xiaochun Cao a,c, Hassan Foroosh a,d,1
a

College of Engineering and Computer Science, Computational Imaging Lab, University of Central Florida, 4000 Central Florida Blvd., Orlando, FL 32816, USA
Advanced Micro Devices, Quadrangle Blvd., Orlando, FL 32817, USA
c
State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
d
Department of EECS, Computational Imaging Laboratory, University of Central Florida, Orlando, FL 32816, USA
b

a r t i c l e

i n f o

Article history:
Received 10 February 2012
Accepted 19 January 2013
Available online 5 February 2013
Keywords:
View invariance
Pose transition
Action recognition
Action alignment
Fundamental ratios

a b s t r a c t
In this paper, we fully investigate the concept of fundamental ratios, demonstrate their application and
signicance in view-invariant action recognition, and explore the importance of different body parts in
action recognition. A moving plane observed by a xed camera induces a fundamental matrix F between
two frames, where the ratios among the elements in the upper left 2  2 submatrix are herein referred to
as the fundamental ratios. We show that fundamental ratios are invariant to camera internal parameters
and orientation, and hence can be used to identify similar motions of line segments from varying viewpoints. By representing the human body as a set of points, we decompose a body posture into a set of line
segments. The similarity between two actions is therefore measured by the motion of line segments and
hence by their associated fundamental ratios. We further investigate to what extent a body part plays a
role in recognition of different actions and propose a generic method of assigning weights to different
body points. Experiments are performed on three categories of data: the controlled CMU MoCap dataset,
the partially controlled IXMAS data, and the more challenging uncontrolled UCF-CIL dataset collected on
the internet. Extensive experiments are reported on testing (i) view-invariance, (ii) robustness to noisy
localization of body points, (iii) effect of assigning different weights to different body points, (iv) effect
of partial occlusion on recognition accuracy, and (v) determining how soon our method recognizes an
action correctly from the starting point of the query video.
2013 Elsevier Inc. All rights reserved.

1. Introduction
The perception and understanding of human motion and action
is an important area of research in computer vision that plays a
crucial role in various applications such as surveillance, human
computer interaction (HCI), ergonomics, etc. In this paper, we focus
on the recognition of actions in the case of varying viewpoints and
different and unknown camera intrinsic parameters. The challenges to be addressed in action recognition include perspective
distortions, differences in viewpoints, anthropometric variations,
and the large degrees of freedom of articulated bodies [1]. The literature in human action recognition has been extremely active in
the past two decades and signicant progress has been made in
this area [25]. Action can be regarded as a collection of 4D
spacetime data observed by a perspective video camera. Due to
image projection, the 3D Euclidean information is lost and projectively distorted, which makes action recognition rather challenging, especially for varying viewpoints and different camera
q

This paper has been recommended for acceptance by J.K. Aggarwal.

Corresponding author.

E-mail address: nazim@cs.ucf.edu (N. Ashraf).


Assistant Professor, Department of Computer Science, FC College (A Chartered
University).
1

1077-3142/$ - see front matter 2013 Elsevier Inc. All rights reserved.
http://dx.doi.org/10.1016/j.cviu.2013.01.006

parameters. Another source of challenge is the irregularities of human actions due to a variety of factors such as age, gender, circumstances, etc. The timeline of action is another important issue in
action recognition. The execution rates of the same action in different videos may vary for different actors or due to different camera
frame rates. Therefore, the mapping between same actions in different videos is usually highly non-linear.
To tackle these issues, often simplifying assumptions are made
by researchers on one or more of the following aspects: (1) camera
model, such as scaled orthographic [6] or calibrated perspective
camera [7]; (2) camera pose, i.e. little or no viewpoint variations;
(3) anatomy, such as isometry [8], coplanarity of a subset of body
points [8], etc. Human action recognition methods start by assuming a model of the human body, e.g. silhouette, body points, stick
model, etc., and build algorithms that use the adopted model to
recognize body pose and its motion over time. Spacetime features
are essentially the primitives that are used for recognizing actions,
e.g. photometric features such as the optical ow [911] and the
local spacetime features [12,13]. These photometric features can
be affected by luminance variations due to, for instance, camera
zoom or pose changes, and often work better when the motion is
small or incremental. On the other hand, salient geometric features
such as silhouettes [1418] and point sets [8,19] are less sensitive
to photometric variations, but require reliable tracking. Silhouettes

588

N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602

are usually stacked in time as 2D [16] or 3D object [14,18], while


point sets are tracked in time to form spacetime curves. Some
existing approaches are also more holistic and rely on machine
learning techniques, e.g. HMM [20], SVM [12], etc. As in most
exemplar-based methods, they rely on the completeness of the
training data, and to achieve view-invariance, are usually expensive, as it would be required to learn a model from a large dataset.

mood et al. [28] and later by Yilmaz et al. [18,19]. They stack silhouettes of input videos into spacetime objects, and extract features in
different ways, which are then used to compute a matching score
based on the fundamental matrices. A similar work is also presented in [29], which is based on body points instead of silhouettes.
A recent method [30] uses probabilistic 3D exemplar model that
can generate 2D view observations for recognition.

1.1. Previous work on view-invariance

1.2. Our approach

Most action recognition methods adopt simplied camera models


and assume xed viewpoint or simply ignore the effect of viewpoint
changes. However, in practical applications such as HCI and surveillance, actions may be viewed from different angles by different perspective cameras. Therefore, a reliable action recognition system
has to be invariant to the camera parameters or viewpoint changes.
View-invariance is, thus, of great importance in action recognition,
and has started receiving more attention in recent literature.
One approach to tackle view-invariant action recognition has
been based on using multiple cameras [2022,7]. Campbell et al.
[23] use stereo images to recover a 3D Euclidean model of the human
subject, and extract view invariance for 3D gesture recognition;
Weinland et al. [7] use multiple calibrated and background-subtracted cameras, and they obtain a visual hull for each pose from multi-view silhouettes, and stack them as a motion history volume, based
on which Fourier descriptors are computed to represent actions. Ahmad et al. [20] build HMMs on optical ow and human body shape
features from multiple views, and feed a test video sequence to all
learned HMMs. These methods require the setup of multiple cameras, which is quite expensive and restricted in many situations such
as online video broadcast, HCI, or monocular surveillance.
A second line of research is based on a single camera and is motivated by the idea of exploiting the invariants associated with a given camera model, e.g. afne, or projective. For instance, Rao
et al. [24] assume an afne camera model, and use dynamic instants, i.e. the maxima in the spacetime curvature of the hand trajectory, to characterize hand actions. The limit with this
representation is that dynamic instants may not always exist or
may not be always preserved from 3D to 2D due to perspective effects. Moreover the afne camera model is restrictive in most practical scenarios. A more recent work reported by Parameswaran et al.
[8] relaxes the restrictions on the camera model. They propose a
quasi-view-invariant 2D approach for human action representation
and recognition, which relies on the number of invariants in a given
conguration of body points. Thus a set of projective invariants are
extracted from the frames and used as action representation. However, in order to make the problem tractable under variable dynamics of actions they introduced heuristics, and made simplifying
assumptions such as isometry of human body parts. Moreover, they
require that at least ve body points form a 3D plane or the limbs
trace planar area during the course of an action. [25] described a
method to improve discrimination by inferring and then using latent discriminative aspect parameters. Another interesting approach to tackle unknown views has been suggested by [26], who
use virtual views, connecting the action descriptors extracted from
source view to those extracted from target view. Another interesting approach is [27], who used a bag of visual-words to model an
action and present promising results.
Another promising approach is based on exploiting the multiview geometry. Two subjects in the same exact body posture
viewed by two different cameras at different viewing angles can
be regarded as related by the epipolar geometry. Therefore, corresponding poses in two videos of actions are constrained by the associated fundamental matrices, providing thus a way to match poses
and actions in different views. The use of fundamental matrix in
view invariant action recognition is rst reported by Syeda-Mah-

This work is an extension of [31], which introduced the concept


of fundamental ratios that are invariant to rigid transformations of
camera, and were applied to action recognition. We make the following main extensions: (i) Instead of looking at fundamental ratios induced by triplets of points, we look at fundamental ratios
induced by line segments. This, as we will later see, introduces
more redundancy and results in better accuracy. (ii) It has been
long argued in the applied perception community [32] that humans focus only on the most signicant aspects of an event or action for recognition, and do not give equal importance to every
observed data point. We propose a new generic method of learning
how to assign different weights to different body points in order to
improve the recognition accuracy by using a similar focusing strategy as humans; (iii) We study how this focusing strategy can be
used in practice when there is partial but signicant occlusion;
(iv) We investigate how soon after the query video starts our method is capable of recognizing the action - an important issue never
investigated by others in the literature; and (v) our experiments
in this paper are more extensive than [31] and include larger set
of data with various levels of difculty.
The rest of the paper is organized as follows: In Section 2, we
introduce the concept of fundamental ratios, which are invariant
to rigid transformations of camera, and describe how they may
be used for action recognition in Section 3. Then in Section 4, we
focus on how we can weigh different body parts for better recognition. We present our extensive experimental evaluation in Section
5, followed by discussions and conclusion in Section 6.
2. Fundamental ratios
In this section, we establish specic relations between the epipolar geometry induced by line segments. We derive a set of
feature ratios that are invariant to camera intrinsic parameters
for a natural perspective camera model of zero skew and unit aspect ratio. We then show that these feature ratios are projectively
invariant to similarity transformations of the line segment in the
3D space, or equivalently invariant to rigid transformations of
camera.
Proposition 1. Given two cameras Pi  Ki[Rijti], Pj  Kj[Rjjtj] with
zero skew and unit aspect ratio, denote the relative translation and
rotation from Pi to Pj as t and R respectively, then the upper 2  2
submatrix of the fundamental matrix between two views is of the form

F22 

1st ts rt1 1st ts rt2


2st ts rt1 2st ts rt2


;

where rk is the kth column of R, the superscripts s, t = 1, . . ., 3 indicate


the element in the vector, and rst, r = 1, 2 is a permutation tensor. 1
Remark 1. The ratios among elements of F22 are invariant to
camera calibration matrices Ki and Kj.
1

The use of tensor notation is explained in details in [33, p. 563].

N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602

The upper 2  2 sub-matrices F22 for two moving cameras can


be used to measure the similarity of camera motions. That is, if two
cameras perform the same motion (same relative translation and
rotation during the motion), and F1 and F2 are the fundamental
matrices between any pair of corresponding frames, then
F22
 F22
1
2 . This also holds for the dual problem when the two
cameras are xed, but the scene objects in both cameras perform
the same motion. A special case of this problem is when the scene
objects are planar surfaces, which is discussed below.
Proposition 2. Suppose two xed cameras are looking at two moving
planar surfaces, respectively. Let F1 and F2 be the two fundamental
matrices induced by the two moving planar surfaces. If the motion of
the two planar surfaces is similar (differ at most by a similarity
transformation), then

589

n o
in frame t. Our goal is to identify the sequence J kt from DB such
that the subject in {It} performs the closest action to that observed
n o
J kt .

in

Existing methods for action recognition such as [16,18] consider


an action as a whole, which usually requires known start and end
frames and is limited when action execution rate varies. Some
other approaches such as [29] regard an action as a sequence of
individual poses, and rely on pose-to-pose similarity measures.
Since an action consists of spatio-temporal data, the temporal
information plays a crucial role in recognizing action, which is ignored in a pose-to-pose approach. We thus propose using pose
transition. One can thus compare actions by comparing their pose
transitions.
3.3. Matching pose transition

F22
 F22
1
2 ;

where the projective equality, denoted by , is invariant to camera


orientation.
Here similar motion implies that plane normals undergo same
motion up to a similarity transformation. The projective nature
of the view-invariant equation in (2) implies that the elements in
the sub-matrices on the both sides of (2) are equal up to an arbitrary non-zero scale factor, and hence only the ratios among them
matter. We call these ratios the fundamental ratios, and as Propositions 1 and 2 state, these fundamental ratios are invariant to camera
intrinsic parameters and viewpoints. To eliminate the scale factor,

 


 

we can normalize both sides using b
F i F22 =F22  ; i 1; 2,
i

where j  j refers to absolute value operator and k  kF stands for


the Frobenius norm. We then have

b
F1 b
F2:

In practice, b
F 1 and b
F 2 may not be exactly equal due to noise, computational errors or subjects different ways of performing same actions. We, therefore, dene the following function to measure the
residual error:

E b
F1; b
F2 kb
F1  b
F 2 kF :

3. Action recognition using fundamental ratios


3.1. Representation of pose
Using a set of body points for representing human pose has
been used frequently in action recognition primarily because a human body can be modeled as an articulate object, and secondly,
body points capture sufcient information to achieve the task of
action recognition [29,34,8,19]. Other representations of pose include subject silhouette [14,16,28], optical ow [9,11,10], and local
space time features [13,12].
We have used the body point representation. Therefore, an action consists of a sequence of point sets. These points can be obtained by using articulated object tracking techniques such as
[3539]. Further discussions on articulated object tracking can be
found in [40,2,3], and is beyond the scope of this paper. We shall,
henceforth, assume that tracking has already been performed on
the data, and that we are given a set of labeled points for each image.

The structure of a human can be divided into lines of body


points using 2 body points. The problem of comparing articulated
motions of human body thus transforms to comparing rigid motions of body line segments. According to Proposition 2, the motion
of a plane induces a fundamental matrix, which can be identied
by its associated fundamental ratios. If two pose transitions are
identical, their corresponding body point segments would induce
the same fundamental ratios, which provide a measure for matching two pose transitions.
3.3.1. Fundamental matrix induced by a moving line segment
Assume that we are given an observed pose transition Ii ? Ij
n o
from sequence {It}, and J km ! J kn from sequence J kt from an action
dataset containing k actions, with N examples of each action.
When Ii ? Ij corresponds to J 1m ! J 1n , and J 2m ! J 2n , one can regard
them as observations of the same 3D pose transition by three different cameras P1, P2, and P3, respectively. There are two instances
of epipolar geometry associated with this scenario:
1. The mapping between the image pair hIi, Iji and the image pairs
D
E D
E
J 1m ; J 1n ; J 2m ; J 2n is determined by the fundamental matrices F12
and F13 [33] related to P1, P2, and P3. Also, the mapping between
D
E
D
E
image pair J 1m ; J 1n and J 2m ; J 2n is determined by the fundamental matrices F23. The projection of the camera center of P2 in Ii or
Ij is given by the epipole e21, which is found as the right null
vector of F12. The image of the camera center of P1 in J 1m or J 1n
is the epipole e12 given by the right null vector of F12T. The projection of the camera center of P3 in Ii or Ij is given by the epipole e31, which is found as the right null vector of F13.
Similarly the image of the camera center of P1 in J 1m or J 1n is
the epipole e13 given by the right null vector of F13T. The image
of the camera center of P3 in J 1m or J 1n is the epipole e32 given by
the right null vector of F23T. Note that e31 and e32 are corre-

3.2. Pose transitions

sponding points in Ii or Ij and J 1m or J 1n , respectively. This fact


would be used later on.
2. The other instance of epipolar geometry is between transitioned
poses of a line segments of body points in two frames of the
same camera, i.e. the fundamental matrix induced by a moving
body line segment, which we denote as F. We call this fundamental matrix the inter-pose fundamental matrix, as it is induced
by the transition of body point poses viewed by a stationary
camera.

We are given a video sequence {It} and a database of reference


sequences corresponding to K different known actions,
n o n o
n o
DB J 1t ; J 2t ; . . . ; J Kt , where It and J kt are labeled body points

Let be a line of 3D points, whose motion lead to different image


projections on Ii ; Ij ; J 1m ; J 1n ; J 2m and J 2n as i ; j ; 1m ; 1n ; 2m and 2n ,
respectively:

590

N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602

i hx1 ; x2 i; j hx01 ; x02 i;

we have:

1m hy1 ; y2 i; 1n hy01 ; y02 i;

E b
F 1; b
F 2 0:

2m hz1 ; z2 i; 2n hz01 ; z02 i:

Under more general motion, the equalities in (17) become only


approximate. However, we shall see in Section 5.1.1 that this
approximation is inconsequential in action recognition for a wide
range of practical rotation angles. As described shortly, using Eq.
(4) and the fundamental matrices F 1 and F 2 computed for every
non-degenerate line segment, we can dene a similarity measure
for matching pose transitions Ii ? Ij and J km ! J kn .

i and j can be regarded as projections of a stationary 3D line


hX1, X2i on two virtual cameras P0i and P0j . Assume that the epipoles
in P0i and P0j are known and let us denote these as

T
e0i a1 ; b1 ; 1T ; e0j a01 ; b01 ; 1 ; e0m a2 ; b2 ; 1T ,
and
T
0
0
0
en a2 ; b2 ; 1 .
We can use the epipoles as parameters for the fundamental
matrices induced by i and j and 1m ; 1n [41]:

6
F1 4

a1 a1 b1 b1
7
a1 c1 b1 d1
5;
0
0
0
0
0
0
0
a1 a1 b1 c1 a1 b1 b1 d1 a1 a1 a1 a1 b1 b1 b1 a1 c1 b1 b1 d1
a1

b1

c1

d1

5
2

a2 a2 b2 b2
6
7
F2 4
c2
d2
a2 c2 b2 d2
5:
0
0
0
0
0
0
0
a2 a2 b2 c2 a2 b2 b2 d2 a2 a2 a2 a2 b2 b2 b2 a2 c2 b2 b2 d2
a2

b2

6
To solve for the four parameters, we have the following
equations:

18

Degenerate congurations: If the other examples camera projection is collinear with the 2 points in the line-segment, the problem becomes ill-conditioned. We can either ignore this camera
center in favor of other camera centers or we can simply ignore
the line-segment altogether. This does not produce any difculty
in practice, since with 11 body point representation used in this
paper, we obtain 55 possible line segments, the vast majority of
which are in practice non-degenerate.
A special case is when the epipole is close to or at innity, for which
all line-segments would degenerate. We solve this problem by transforming the image points in projective space in a manner similar to
Zhang et al. [42]. The idea is to nd a pair of projective transformations
Q and Q0 , such that after transformation the epipoles and transformed
image points are not at innity. Note that these transformations do
not affect the projective equality in Proposition 2.

xT1 F 1 x1 0;

xT2 F 1 x2 0:

3.3.2. Algorithm for matching pose transitions


The algorithm for matching two pose transitions Ii ? Ij and
J km ! J kn is as follows:

D
E
1. Compute F, e1, e2 between image pair hIi, Iji and J km ; J kn using
the method proposed in [43].
2. For each non-degenerate 3D line segment that projects onto
i ; j ; km and kn in Ii ; Ij ; J km and J kn , respectively, compute
b
F 2 as described above, and compute e E b
F 1; b
F 2 from
F 1; b
Eq. (4).
3. Compute the average error over all non-degenerate line segments using

Similarly, F 2 induced by 1m and 1n can be computed from:

yT1 F 2 y1 0;
yT2 F 2 y2

0:

10

Given N  1 other examples of the same action in the dataset,


we have:

eT11 F 1 e11 0;

11

eT12 F 2 e12 0;

12

eT21 F 1 e21
eT22 F 2 e22

0;

13

0;

14

..
.
eTN11 F 1 eN11 0;

15

eTN12 F 2 eN12 0:

16

where ei1 and ei2 are the projection of the ith sequences camera
center in Ii or Ij and Jim or J in .
With N > 2, we have an overdetermined system, which can be
easily solved by re-arranging the above equations in the form of
Ax = 0 and solving for the right null space of A to solve for the ratios.
In fact, given
we
 N
 examples in the dataset,11
 can have as many as
N  N  1  11
ratios
per
frame,
where
are the total number
2
2
of different combinations given
11
body
points.
Compared to using
 
triplets, we would have N  11
ratios per frame. Given N > 4, this
3
is a huge advantage over using triplets as we have more redundancy leading to more accuracy.
The difculty with Eqs. (5) and (6) is that the epipoles e0i ; e0j ; e0m
and e0n are unknown. Fortunately, however, the epipoles can be closely approximated as described below.
Proposition 3. If the exterior orientation of P1 is related to that of P2
by a translation, or by a rotation around an axis that lies on the axis
planes of P1, then under the assumption:

e0i e0j e1 ;

e0m e0n e2 ;


1 X
E Ii ! Ij ; J km ! J kn
e ;
L 1...L

19

where L is the total number of non-degenerate line segments.


4. If EIi ! Ij ; J km ! J kn < E0 , where E0 is some threshold, then the
two pose transitions are matched. Otherwise, the two pose
transitions are classied as mismatched.
3.4. Sequence alignment
Before we give our solution of action recognition, we rst describe the algorithm of matching two action sequences. We represent an action A = {I1,. . .,n} as a sequence of pose transitions,
PA; r fI1!r ; . . . ; Ir1!r ; Ir!r1 ; . . . ; Ir!n g,2 where Ir is an arbitrarily selected reference pose. If two sequences A = {I1. . .n} and
B = {J1. . .m} contain the same action, then there exists an alignment
between PA; r 1 and PB; r2 , where Ir1 and J r2 are two corresponding poses. To align the two sequences of pose transitions, we used
dynamic programming. Therefore, our method to match two action
sequences A and B can be described as follows:
1. Initialization: select a pose transition Ii0 ! Ii1 from A so that two
poses are distinguishable. Then nd its best matched pose transition J j0 ! J j1 in B, by checking all pose transitions in the
sequence as described in Section 3.3.

17
2

For brevity of notation, we denote pose transition Ii ? Ij as Ii?j.

N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602

2. For all i = 1 . . . n, j = 1 . . . m, compute

8
i i0 ; j j0 ;
>
< s  EIi0 ! Ii ; J j0 ! J j
Si;j s  EIi0 ! Ii1 ; J j0 ! J j1 i i0 ; j j0 ;
>
:
0
otherwise;
where s is a threshold, e.g., s = 0.3. S is the matching score matrix of
{I1,. . .,n} and {J1,. . ., m}.
3. Initialize the n  m accumulated score matrix M as

Mi;j

Si;j
0

i 1 or j 1
otherwise

4. Update matrix M from top to bottom, left to right (i, j P 2),


using

Mi;j Si;j maxfMi;j1 ; Mi1;j ; Mi1;j1 g:


5. Find (i, j) such that


i ; j arg maxMi;j :
i;j

Then back trace M from (i, j), and record the path P until it reaches
a non-positive element.
The matching score of sequences A and B is then dened as
SA; B Mi ;j . The back-traced path P provides an alignment between two video sequences. Note that this may not be a one-toone mapping, since there may exist horizontal or vertical lines in
the path, which means that a frame may have multiple candidate
matches in the other video. In addition, due to noise and computational error, different selections of Ii0 ! Ii1 may lead to different valid alignment results.
3.5. Action recognition
To solve the action recognition problem, we need a reference
sequence (a sequence of 2D poses) for each known action, and
n o n o
maintain an action database of K actions, DB J 1t ; J 2t ;
n o
. . . ; J Kt . To classify a given test sequence {It}, we match {It}
against each reference sequence in DB, and classify {It} as the action
n o

n o
of best-match, say J kt , if S fIt g; J kt
is above a threshold T. Due
to the use of view-invariant fundamental ratios vector, our solution
is invariant to camera intrinsic parameters and viewpoint changes,
when the approximation of epipoles is valid. As discussed in Section 5.1.1, this can be achieved by using reference sequences from
more viewpoints for each action. One major feature of the proposed method is that there is no training involved if line segments
are used without the weighting (discussed in Section 4), and we
can recognize an action from a single example. This is experimentally veried in Section 5.
4. Weighting-based human action recognition
In the previous section, we saw how fundamental ratios can be
used for action recognition. But we implicitly made the assumption
that all body joints have equal contribution to matching pose transitions and action recognition, which goes against the evidences in
the applied perception literature [32]. For instance, in a sport such
as boxing, the motion of the upper body parts of the boxer is more
important than the motion of the lower body.
In his classic experiments, Johansson [34], demonstrated that
humans can identify motion when presented with only a small
set of moving dots attached to various body parts. This seems to
suggest that people are quite naturally adept at ignoring trivial vari-

591

ations of some body part motions, and paying attention to those


that capture the features that are essential for action recognition.
With the line segment representation of human body pose, a
similar assertion can be made on body line segments: some body
point line segments have greater contribution to pose and action
recognition than others. Therefore, it would be reasonable to assume that by assigning appropriate weights to the similarity errors
of body point line segments, the performance of pose and action recognition could be improved. To study the signicance of different
body-point line segments in action recognition, we selected two different sequences of walking action WA = {I1. . .l} and WB = {J1. . .m},
and a sequence of running action R = {K1. . .n}. We then aligned sequence WB and R to WA, using the alignment method described in
Section 3.4, and obtained the corresponding alignment/mapping
w:WA ? WB and w0 :WA ? R. As discussed in Section 3.3, the similarity of two poses is computed based on error scores of all bodypoint line segments motion. For each pair of matched poses hIi, Jw(i)i,
we stacked the error scores of all line segments as a vector Ve(i):

3
E1
6 E 7
2 7
6
Ve i 6
7;
4 : 5
2

20

ET
We then built an error score matrix Me for alignment wWA?WB:

Me Ve 1 Ve 2 . . . Ve l :

21

Each row i of Me indicates the dissimilarity scores of line segment i


across the sequence, and the expected value of each column j of Me
is the dissimilarity score of pose Ij and JwWA!WB j . Similarly we built
an error score matrix M0e for alignment wWA?R.
To study the role of a line segment i in distinguishing walking
and running, we can compare the ith row of Me and M0e , as plotted
in Fig. 1af. We found that, some line segments such as line segments 1, 2 and 11 have similar error scores in both cases, which
means the motion of these line segments are similar in walking
and running. On the other hand, line segments 19, 46 and 49 have
high error scores in M0e and low error scores in Me, that is, the motion of these line segments in a running sequence is different from
their motion in a walking sequence. Line segments 55, 94 and 116
reect the variation in actions of walking and running, thus are
more informative than line segments 1, 21 and 90 for the task of
distinguishing walking and running actions.
We compared sequences of different individuals performing the
same action in order to analyze the importance of line segments in
recognizing them as the same action. For instance, we selected four
sequences G0, G1, G2, and G3 of golf-swing action, and aligned G1,
G2, and G3 to G0 using the alignment method described in Section
3.4, and then built error score matrices M1e ; M2e ; M3e as described
above. From the illustrations of M1e ; M2e ; M3e in Fig. 2ac, the dissimilarity scores of some line segments, such as line segments 53 (see
Fig. 2f), is very consistent across individuals. Some other line segments such as line segments 6 (Fig. 2d) and 50 (Fig. 2e) have various
error score patterns across individuals, that is, these line segments
represent the variations in individuals performing the same action.
Denition 1. If a line segments reects the essential differences
between an action A and other actions, we call it a signicant line
segments of action A. All other line segments are referred to as
trivial line segments of action A.
A typical signicant line segments should (i) convey the variations between actions and/or (ii) tolerate the variations of the same
action performed by different individuals. For example, line segments 19, 46 and 49 are signicant line segments for walking action, and line segment 53 is a signicant line segmentsfor the
golf-swing action.

592

N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602
1.6

0.5
Walkwalk
Walkrun

1
0.8
0.6
0.4

1.4

0.3
0.2
0.1

1
0.8
0.6
0.4
0.2

0.2
0

Walkwalk
Walkrun

1.2

dissimilarity score

1.2

Walkwalk
Walkrun

0.4

dissimilarity score

dissimilarity score

1.4

10

20

30

40

50

60

10

20

30

40

50

60

frame #

frame #

10

20

30

40

50

60

frame #

(a) Line segment 1


(b) Line segment 2
(c) Line segment 11
Examples of insignificant line segments which are similar in both walking and running.
1.4

0.4
0.2

10

20

30

40

50

1
0.8
0.6
0.4
0.2
0
0

60

10

20

30

40

50

Walkwalk
Walkrun

1.2

dissimilarity score

0.6

1.4
Walkwalk
Walkrun

1.2

dissimilarity score

dissimilarity score

Walkwalk
Walkrun

0.8

1
0.8
0.6
0.4
0.2
0

60

10

20

frame #

frame #

30

40

50

60

frame #

(d) Line segment 19


(e) Line segment 46
(f) Line segment 49
Examples of significant line segments for distinguishing between walking and running.
Fig. 1. Roles of line segments in action recognition: (a)(f) are the plots of dissimilarity scores of some line segments across frames in the walk-walk and walk-run
alignments. As can be observed, line segments 1, 21 and 90 have similar error scores in both cases, which essentially means the motion of these line segments is similar in
walking and running. But line segments 55, 94 and 116 have high error scores in M0e and low error scores in Me, which means that the motion of these line segments in a
running sequence is different from their motion in a walking sequence. Therefore, these line segments reect the variation in actions of walking and running and are much
more useful for distinguishing between walking and running actions.
2

0.5

0
10

20

30

40

50

60

0.5
Golfswing 01
Golfswing 02
Golfswing 03

1.5

dissimilarity score

Golfswing 01
Golfswing 02
Golfswing 03

dissimilarity score

dissimilarity score

1.5

0.5

Golfswing 01
Golfswing 02
Golfswing 03

0.4
0.3
0.2
0.1
0

10

20

30

40

50

60

10

20

30

40

50

frame #

frame #

frame #

(a) Line segment 6

(b) Linesegment 50

(c) Line segment 53

60

Fig. 2. Roles of different line segments in action recognition. We selected four sequences G0, G1, G2, and G3 of golf-swing action, and align G1, G2, and G3 to G0 using the
alignment method described in Section 2, and then build error score matrix M1e ; M2e ; M3e correspondingly as in above experiments. As can be observed, the dissimilarity
scores of some line segments, such as line segments 53 is very consistent across individuals. Some other line segments such as line segments 6 and 50 have various error score
patterns across individuals, that is, these line segments represent the variations of individuals performing the same action.

Intuitively, in action recognition, we should place more emphasis on the signicant line segments while reducing the negative
impact of trivial line segments, that is, assigning appropriate inuence factor to the body-point line segments. In our approach to action recognition, this can be achieved by assigning appropriate
weights to the similarity errors of body point line segments in
Eq. (19). That is, Eq. (19) can be rewritten as:



X
E Ii ! Ij ; J km ! J kn
x e ;

22

1...L

where L is the total number of non-degenerate line segments and


x1 + x2 +    + xL = 1.

The next question is, how to determine the optimal set of


weights xi for different actions. Manual assignment of weights
could be biased and difcult for a large database of actions, and
is inefcient when new actions are added in. Therefore, automatic
assignment of weight values is desired for a robust and efcient action recognition system. To achieve this goal, we propose to use a
xed size dataset of training sequences to learn weight values.
Suppose we are given a training dataset T which consists of K  J
action sequences for J different actions, performed by K different
individuals. Let x be the weight value of body joint with label
( = 1 . . . L) for a given action. Our goal is to nd the optimal
weights x that maximize the similarity error between sequences

593

N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602

of different actions and minimize those of same actions. Since the


size of the dataset and the alignments of sequences are xed, this
turns out to be an optimization problem over x. Our task is to dene a good objective function f(x1, . . ., xL) for this purpose, and to
apply optimization to solve the problem.

actions, each of which with K pre-aligned sequences performed


by various individuals. T jk is the kth sequence in the group of action
j, and Rj is the reference sequence of action j. To nd the optimal
weight assignment for action j, we dene the objective function as:

4.1. Weights on line segments versus weights on body points

where a and b are non-negative constants and

Given a human body model of n points, we could obtain at most




n
n
line segments, and need to solve a
dimensional optimi2
2
zation problem for weight assignment. Even with a simplied human body model of 11 points, this yields an extremely high


11
dimensional (
55 dimensions) problem. On the other
2
hand, the body point line segments are not independent of each
other. In fact, adjacent line segments are correlated by their common body point, and the importance of a line segments is also
determined by the importance of its two body points. Therefore,


n
n
instead of using
variables for weights of
line segments,
2
2
we assign n weights x1. . .n to the body points P1. . .n, where:

x1 x2    xn 1:

23

The weight of a line segments = hPi, Pji is then computed as:

xi xj
n

24

Note that the denition of k in (24) ensures that k1 + k2 +    + kT = 1.


Using (24) and (22) is rewritten as:

EI1 ! I2 ; J i ! J j

1
Median16i<j6n xi xj  Ei;j ;
n

By introducing weights {x1. . .n} to body points, we reduce the high


dimensional optimization problem to a lower dimensional, and
more tractable problem.
4.2. Automatic adjustment of weights
Before moving onto the automatic adjustment of weights, we
rst discuss the similarity score of two pre-aligned sequences. Given two sequences A = {I1. . .N}, B = {J1. . .M}, and the known alignment
w: A ? B, the similarity of A and B is:
N
N
X
X
SA; B
Sl; wl Ns  N EIl!r1 ; J wl!r2 ;
l1

25

l1

where r1 and r2 are computed reference poses, and s is a threshold,


which we set as suggested in [44,45]. Therefore, the approximate
similarity score of A and B is:

SA; B Ns 

N
X
1X
xi xj  El;wl i;j :
N l1 16i<j6n

Considering that N, s, n and El,w(l)(i,j) are constants given the alignment w, Eq. (26) can be further rewritten into a simpler form:

SA; B a0 

n1
X
ai  xi ;

26

i1

where {ai} are constants computed from (26).


Now let us return to the problem of automatic weight assignment for action recognition. As discussed earlier, a good objective
function would reect the intuition that, signicant line segments
should be assigned higher weights, while trivial line segments
should be assigned lower weights. Suppose we have a training
dataset T which consists of K  J action sequences for J different

f j x1 ; x2 ; . . . ; xn1 Q1 aQ2  bQ3 ;

27

Q1

K


1X
S Rj ; T jk ;
K k1

28

Q2

K

2
1X
S Rj ; T jk  Q21 ;
K k1

29

Q3

K
X X
1
SRj ; T ik :
KJ  1 16i6J;ij k1

30

The optimal weights for action j are then computed using:

hx1 ; . . . ; xn1 i argmaxx1 ;x2 ;...;xn1 f j x1 ; . . . ; xn1 ; a; b:

31

j
1

In this objective function, we use T as the reference sequence


for action j, and the term Q1 and Q2 are the mean and variance of
similarity scores between T j1 and other sequences in the same action. Q3 is the mean of similarity scores between T j1 and all sequences in other different actions. Hence fj(x1, x2, . . ., xn1)
achieves high similarity scores for all sequences of same action j,
and low similarity scores for sequences of different actions. The
second term Q2 may be interpreted as a regularization term to ensure the consistency of sequences in the same group.
As Q1 and Q3 are linear functions, and Q2 is quadratic polynomial, our objective function fj(x1, x2, . . ., xn1) is quadratic polynomial function, and the optimization problem becomes a
quadratic programming (QP) problem. There are a variety of methods for solving the QP problem, including interior point, active set,
conjugate gradient, etc. In our problem, we adopted the
conjugate

gradient method, with the initial weight values set to 1n ; 1n ; . . . ; 1n .
Degenerate line segments: As before, degenerate line segments are ignored. As explained earlier, with 11 body points, we
obtain a total of 55 possible triplets, the vast majority of which
are in practice non-degenerate.
5. Experimental results and discussion
We rst examine our method on semi-synthetic data. In particular, we rst demonstrate that our method is resilient to viewpoint
changes and noise. We then present our results for action recognition and demonstrate that weighting considerably improves our
results. We then present our results on two sets of real video data:
the IXMAS multiple view data set [7], and the UCF-CIL dataset consisting of video sequences of 8 actions (available at http://cil.cs.ucf.edu/actionrecognition.html).
For weighting, we used different ways for each dataset depending on the size of the dataset. For MoCap dataset, we used a training dataset of 2  17  5 = 170 sequences for ve actions (walk,
jump, golf swing, run, and climb), each action is performed by
two subjects, and each instance of action is observed by 17 cameras at different random locations. The remaining sequences
belonging to a different actor are used to test the action. Since
there are three actors, we train the weights three times and test
it on the remaining person. So we train using persons 1 and 2
and test on person 3; train on persons 1 and 3, and test on person
2; and nally train on persons 2 and 3, and test on person 1. UCFCIL dataset has only 48 sequences, so we use leave-one-out crossvalidation: we took out the reference sequence and used the rest of
the sequences to train the dataset. Hence, we had 48  1 = 47 sequences for training weights. So we had to train the weights 48

594

N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602

Fig. 3. Left: Our body model. Right: Experiment on view-invariance. Two different pose transitions P1 ? P2 and P3 ? P4 from a golf swing action are used.

5.1. Analysis based on motion capture data


We generated our data based on the CMU Motion Capture Database, which consists of 3D motion data for a large number of human actions. We generated the semi-synthetic data by projecting
3D points onto images through synthesized cameras. In other
words, our test data consist of video sequences of true persons,
but the cameras are synthetic, resulting in semi-synthetic data to
which various levels of noise were added. Instead of using all body
points provided in CMUs database, we employed a body model
that consists of only eleven points, including head, shoulders, elbows, hands, knees and feet (see Fig. 3).

50
50

100

0
X
50

100

1.2

Same pose transitions


Different pose transitions

1
0.8
0.6
0.4
0.2
0
0

50

100

150

250

300

350

Same pose transitions


Different pose transitions

1
0.8
0.6
0.4
0.2
0
0

1.5
1

0
100

(d)

0
Rotation angle 100 0
around y axis

(e)

100

200

250

(c)

0.5

100

150

(b)

100 0

100

Rotation angle

0
100

50

Rotation angle

(a)
50
0
50
100
150
200
100

200

1.2

300
200
Rotation angle
around x axis

400

0
100
0
Rotation angle
100 0
around y axis

(f)

300
200
100
Rotation angle
around x axis

400

Rotation angle around


y axis

50

Pose transition Similarity


Error

5.1.1. Testing view invariance


We selected four different poses P1, P2, P3, P4 from a golf swinging sequence (see Fig. 3). We then generated two cameras as
shown in Fig. 4a: camera 1 was placed at an arbitrary viewpoint

Pose transition Similarity


Error

(marked by red color), with focal length f1 = 1000; camera 2 was


obtained by rotating camera 1 around an axis on xz plane of camera 1 (colored as green), and a second axis on yz plane of camera 1
(colored as blue), and changing focal length as f2 = 1200. Let I1 and
I2 be the images of poses P1 and P2 on camera 1 and I3, I4, I5 and I6
the images of poses P1, P2, P3 and P4 on camera 2, respectively. Two
sets of pose similarity errors were computed at all camera positions shown in Fig. 4a: E(I1 ? I2, I3 ? I4) and E(I1 ? I2, I5 ? I6).
The results are plotted in Fig. 4b and c, which show that, when
two cameras are observing the same pose transitions, the error is
zero regardless of their different viewpoints, conrming Proposition 3.
Similarly, we xed camera 1 and moved camera 2 on a sphere as
shown in Fig. 4d. The errors E(I1 ? I2, I3 ? I4) and E(I1 ? I2, I5 ? I6)
are shown in Fig. 4e and f. Under this more general camera motion,
the pose similarity score of corresponding poses is not always zero,
since the epipoles in Eqs. (5) and (6) are approximated. However,
this approximation is inconsequential in most situations, because
the error surface of different pose transitions is in general above
that of corresponding pose transitions. Fig. 4h shows the regions
(black colored) where approximation is invalid. These regions correspond to the situation that the angles between camera orientations around 90 degrees, which usually implies severe self-

times to test on the 48 sequences. IXMAS is a large dataset, and we


tested each sequence by randomly generating a reference dataset
of 2  5  10 = 100 sequences for 10 actions performed by two
people observed from ve different viewpoints, and tested on the
remaining sequences.

300

350

90

90
0

90

180

270

350

Rotation angle around x axis

(g)

Fig. 4. Analysis of view invariance: (a) Camera 1 is marked in red, and all positions of camera 2 are marked in blue and green. (b) Errors for same and different pose transitions
when camera 2 is located at viewpoints colored as green in (a). (c) Errors of same and different pose transitions when camera 2 is located at viewpoints colored as blue in (a).
(d) General camera motion: camera 1 is marked as red, and camera 2 is distributed on a sphere. (e) Error surface of same pose transitions for all distributions of camera 2 in
(d). (f) Error surface of different pose transitions for all distribution of camera 2 in (d). (g) The regions of confusion for (d) marked in black (see text). (For interpretation of the
references to colour in this gure legend, the reader is referred to the web version of this article.)

mean error and standard deviation in


100 runs

N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602

595

Same pose transitions


0.5

Different pose transition

0.4
0.3
0.2
0.1
0
0

gaussian noise level (sigma)


Fig. 5. Robustness to noise: I1 and I2 are the images in camera 1, and I3, I4, I5 and I6 are the images in camera 2. Same and different actions are distinguished unambiguously
for r < 4.

to 8 pixels. The errors E(I1 ? I2, I3 ? I4) and E(I1 ? I2, I5 ? I6) were
computed. For each noise level, the experiment was repeated for
100 independent trials, and the mean and standard deviation of
both errors were calculated (see Fig. 5). As shown in the results,
the two cases are distinguished unambiguously until r increases
to 4.0, i.e., up to possibly 12 pixels. Note that the image sizes of
the subject were about 200  300, which implies that our method
performs remarkably well under high noise.

100
80
60
40
20
100
0
100

100

50

0
100

50

Fig. 6. The distribution of cameras used to evaluate view-invariance and camera


parameter changes.

5.1.3. Performance in action recognition


We selected ve classes of actions from CMUs MoCap dataset:
walk, jump, golf swing, run, and climb. Each action class is performed by 3 actors, and each instance of 3D action is observed
by 17 cameras, as shown in Fig. 6. The focal lengths were changed
randomly in the range of 1000 300. Fig. 7 shows an example of a
3D pose observed from 17 viewpoints.
Our dataset consists of totally 255 video sequences, from which
we generated a reference action Database (DB) of 5 video sequences, i.e. one video sequence for each action class. The rest of
the dataset was used as test data, and each sequence was matched
against all actions in the DB and classied as the one with the highest score. For each sequence matching, 10 random initializations
were tested and the best score was used. Classication results

occlusion and lack of corresponding points in practice. The experiments on real data in Section 5.2 also show the validity of this
approximation under practical camera viewing angles.
5.1.2. Testing robustness to noise
Without loss of generality, we used the four poses in Fig. 3 to
analyze the robustness of our method to noise. Two cameras with
different focal lengths and viewpoints were examined. As shown in
Fig. 5, I1 and I2 are the images of poses P1 and P2 on camera 1 and I3,
I4, I5 and I6 are the images of P1, P2, P3 and P4 on camera 2. We then
added Gaussian noise to the image points, with r increasing from 0

(1)

(2)

(3) (4)

(5)

(6)

(7)

(8)

(9)

(10) (11) (12) (13) (14)

(15) (16) (17)

Fig. 7. A pose observed from 17 viewpoints. Note that only 11 body points in red color are used. The stick shapes are shown here for better illustration of pose conguration
and extreme variability being handled by our method. (For interpretation of the references to colour in this gure legend, the reader is referred to the web version of this
article.)

596

N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602

Table 1
Confusion matrix before applying weighting: large values on the diagonal entries
indicate accuracy. The overall recognition rate is 89.20%.
Ground-truth

Walk
Jump
Golf swing
Run
Climb

Walk

Jump

43
2
1
2
1

2
46
1
2
1

Golf swing
1
46
1

Run

Climb

2
1
1
44
2

2
1
1
2
44

i 1 . . . 10;
32

i1

The optimal weights hx1, x2, . . ., x 10i are then searched


to max
1
1
1
imize fj(), with the initialization at 11
; 11
; . . . ; 11
. The conjugate
gradient method is then applied to solve this optimization problem. After performing the above steps for all the actions, we obtained a set of weights W j for each action j in our database.
Classication results are summarized in Table 2. The overall recognition rate is 92.4%, which is an improvement of 3.2% compared to
the unweighted case (see Tables 3 and 4).
5.2. Results on real data
5.2.1. UCF-CIL dataset
The UCF-CIL dataset consists of video sequences of eight classes
of actions collected on the internet (see Fig. 9): ballet fouette, ballet
spin, push-up exercise, golf swing, one-handed tennis backhand
stroke, two-handed tennis backhand stroke, tennis forehand
stroke, and tennis serve. Each action is performed by different subjects, and the videos are taken by different unknown cameras from
various viewpoints. In addition, videos in the same class of action

Table 2
Confusion matrix after applying weighting: large values on the diagonal entries
indicate accuracy. The overall recognition rate is 92.40%, which is an improvement of
3.2% compared to the non-weighted case.
Ground-truth

Walk
Jump
Golf swing
Run
Climb

Ground-truth

Recognized as

without weighting are summarized in Table 1. The overall recognition rate is 89.2%.
For weighting, we build a MoCap training dataset which consists of total of 2  17  5 = 170 sequences for 5 actions (walk,
jump, golf swing, run, and climb): each action is performed by 2
subjects, and each instance of action is observed by 17 cameras
at different random locations. We use the same set of reference sequences for the 5 actions as the unweighted case, and align the sequences in the training set against the reference sequences. To
obtain optimal weighting for each action j, we rst aligned all sequences against the reference sequence Rj , and stored the similarity scores of line segments for each pair of matched poses. The
objective function fj(x1, x2, . . ., x10) is then built based on Eq.
(27), and the computed similarity scores of line segments in the
alignments. fj() is a 10-dimensional function, and the weights xi
are constrained by

8
>
< 0 6 xi 6 1;
10
X
>
xi 6 1:
:

Table 3
Confusion matrix for [45]. The overall recognition rate is 91.6%.

Recognized as
Walk

Jump

45
2

1
47
1
1
1

2
1

Golf swing
1
47

Run
2
1
1
46
2

Climb
1
1
1
46

Walk
Jump
Golf swing
Run
Climb

Recognized as
Walk

Jump

45
2
1

1
47

Golf swing

48
3
2

Run
2
1
1
47

Climb
2

42

Table 4
Confusion matrix for [31]. The overall recognition rate is 81.6%.
Ground-truth

Walk
Jump
Golf swing
Run
Climb

Recognized as
Walk

Jump

39
4
1
4
8

3
44
1
3
3

Golf swing
1
45
1

Run

Climb

5
1
2
41
3

2
1
1
2
35

may have different starting and ending points, thus may be only
partially overlapped. The execution speeds also vary in the sequences of each action. Self-occlusion also exists in many of the sequences, e.g., golf, tennis, etc.
Fig. 8a shows an example of matching action sequences. The
frame rates and viewpoints of two sequences are different, and
two players perform golf-swing action at different speeds. The
accumulated score matrix and back-tracked path in dynamic programming are shown in Fig. 8c. Another result on tennis-serve sequences is shown in Fig. 8b and d (see Fig. 10).
We built an action database DB by selecting one sequence for
each action; the rest were used as test data, and were matched
against all actions in the DB. The action was recognized as the
one with the highest matching score for each sequence. The confusion matrix is shown in Table 5, which indicates an overall 95.83%
classication accuracy for real data. As shown by these results, our
method provides a successful recognition of various actions by different subjects, regardless of camera intrinsic parameters and
viewpoints.
We test each sequence using the take-one-out strategy. With
weighting, the classication results are summarized in Table 6.
The overall recognition rate is 100%, which is an improvement
of 4.17% compared to the non-weighted case (see Tables 7 and
8)
5.2.2. IXMAS data set
We also evaluated our method on IXMAS data set [7], which has
5 different views of 13 different actions, each performed three
times by 11 different actors. We tested on actions, {1, 2, 3, 4, 5,
8, 9, 10, 11, 12}. Similar to [7], we applied our method on all actors
except for Pao and Srikumar, and used andreas 1 under
cam1 as the reference for all actions similar to [45]. The rest of
the sequences were used to test our method. The recognition results are shown in Table 10 for non-weighted case. The average
recognition rate is 90.5%. For weighting, we tested each sequence
by randomly generating a reference dataset of 2  5  10 = 100 sequences for 10 actions performed by two people observed from
ve different viewpoints. The results are shown in Table 11. The
average recognition rate is 92.6%, which boosts 2.1% over the
non-weighted case. In addition, we compare our method to others
in Table 9. As can be seen, our method improves on each camera
view (see Tables 12 and 13).

597

N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602

(a) Example 1:matching two golf-swing sequences

Frame no. of upper sequence

Frame no. of lower sequence

(b) Example 2: matching two tennis-serve sequences


35
30
25
20
15
10
5
10

20

30

40

50

60

70

50
40
30
20
10
20

40

60

80

Frame no. of upper sequence

Frame no. of lower sequence

(c)

(d)

100

Fig. 8. Examples of matching action sequences: (a) and (b) are two examples in golf-swing and tennis-serve actions. (c) and (d) Show the accumulated score matrices and
backtracked paths, resulting in the alignments shown in (a) and (b), respectively.

5.2.3. Testing occlusion


As discussed earlier, we handle occlusions by ignoring the
line segments involving the occuluded points. Since there are a
total of 11 points in our body model, there are a total of 55 line
segments. If, lets assume, three points are occluded, there are
still 28 line segments. While the non-weighted method would
be expected to degenerate when lesser line segments are used,
weighting the line segments would still be able to differentiate
between actions, which are dependent on the non-occluded
points. While our previous experiments implicitly involve self-

occlusion, in this section, we want to rigorously test our method


when occlusion is present. In particular, we test for these different scenarios: (i) Upper body is occluded including the head and
shoulder points. (ii) The right side of the body is occluded
including the shoulder, arm, hand, and knee points. (iii) The left
side of the body is occluded including the shoulder, arm, hand,
and knee points. (iv) Lower body is occluded including the knee
and feet points. Therefore (i) has 3 occluded points and the rest
of the test cases have 4 occluded points. The results are shown
in Tables 1417.

598

N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602

Fig. 9. Examples from the UCF-CIL dataset consisting of 8 categories (actions) used to test the proposed method. Ballet fouettes: (1)(4); ballet spin: (5)(16); push-up: (17)
(22); golf swing: (23)(30); one-handed tennis backhand stroke: (31)(34); two-handed tennis backhand stroke: (35)(42); tennis forehand stroke: (43)(46); tennis serve:
(47)(56).

As can be seen from these results, our method is able to recognize actions even when such drastic occlusions are present. The
few low percentages in the tables correspond to actions that are
more or less dependent on the occluded part. For instance, kick
action has a percentage of only 5.5% when lower body is occluded.
But this action is solely based on the lower part of the body. Therefore, it is not surprising that the recognition rate is low. In general,
the recognition rates are low since we are using lesser number of
line segments, and more importantly, we are using lesser number
of points to compute the fundamental matrix (when 4 points are
occluded, we are forced to use the 7 point algorithm [41]).

5.3. How soon can we recognize the action?


We also experimented with how soon our method is able to distinguish between different actions. This is helpful to gauge
whether our method would be able to perform real-time or not
and has received attention from researchers such as [52,53]. To

do this, we looked at all the correctly classied sequences and


the results are summarized in Table 18. So, for instance, for action
1, on average we can detect the action after 60% of the sequence.
The best case and the worst case are also provided.

6. Discussions and conclusion


Table 19 gives a summary of the existing methods for viewinvariant action recognition. In terms of the number of required
cameras, the existing methods fall into two categories: multiple
view methods ([8,20], etc.) and monocular methods ([29,8,24,6,
28,18] and ours). Multiple view methods are more expensive, and
less practical in real life problems when only one camera is available, e.g. monocular surveillance. Many of these methods also make
additional assumptions such as afne camera model ([24,6,18]),
which can be readily violated in many practical situations, or
impose anthropometric constraints, such as isometry. Others, e.g.
Parameswaran et al. [8], make additional assumptions that canon-

599

N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602

Fig. 10. Examples from IXMAS dataset.

Table 5
Confusion matrix before applying weighting: large values on the diagonal entries
indicate accuracy. The overall recognition rate is 97.92%. The actions are denoted by
numbers: 1 ballet fouette, 2 ballet spin, 3 pushup, 4 golf swing, 5 one handed
tennis backhand, 6 two handed tennis backhand, 7 tennis forehand, 8 tennis
serve.
Ground-true actions

#1
#2
#3
#4
#5
#6
#7
#8

Recognized as action
#1

#2

3
1

10

#3

#4

#5

#6

#7

#8

5
7

Table 6
Confusion matrix after applying weighting: the overall recognition rate is 100%,
which is an improvement of 2% compared to the non-weighted case.
Ground-true actions

Recognized as action
#1

#1
#2
#3
#4
#5
#6
#7
#8

#2

#3

#4

#5

#6

#7

#8

3
11
5
7
3
7
3
9

3
7
3
9

ical poses are predened, or certain limbs trace planar areas during
actions; Sheikh et al. [6] assume that each action is spanned by
some action bases, estimated directly using training sequences.
This implicitly requires that the start and the end of a test sequence
be restricted to those used during training. Moreover, the training
set needs to be large enough to accommodate for inter-subject
irregularities of human actions.
In summary, the major contributions in this paper are: (i) We
generalize the concept of fundamental ratios and demonstrate its

Table 7
Confusion matrix for [45].
Ground-true actions

Recognized as action
#1

#1
#2
#3
#4
#5
#6
#7
#8

#2

#3

#4

#5

#6

#7

#8

3
11
5
7
3
7
3
9

600

N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602

Table 8
Confusion matrix for [31].
Ground-true actions

#1
#2
#3
#4
#5
#6
#7
#8

Table 13
Confusion matrix for [31]. Average recognition rate is 85.6%.
Recognized as action
#1

#2

3
1

10

#3

#4

#5

#6

3
1

#7

#8

Action

Recognition rate %
Action
Recognition rate %

85.2
8
90.4

89.6
9
89.6

82.1
10
82.1

78.4
11
91.1

89.6
12
82.1

5
7

3
9

Table 9
Recognition rates in % on IXMAS dataset. Shen [45] and Shen [31] use the same set of
body points as our method.
Method

All

Cam1

Cam2

Cam3

Cam4

Cam5

Fundamental ratios without


weighting
Fundamental ratios with
weighting
Shen [31]
Shen [45]
Weinland [46]
Weinland [30]
Reddy [47]
Tran [48]
Junejo [49]
Liu [50]
Farhadi [51]
Liu [27]

90.5

92.0

89.6

86.6

82.0

78.0

92.6

94.2

93.5

94.4

92.6

82.2

85.6
90.2
83.5
57.9
72.6
80.2
72.7

58.1
82.8

87.0
65.4
69.6

74.8
76.7

86.6

88.3
70.0
69.2

74.5
73.3

81.1

85.6
54.3
62.0

74.8
72.0

80.1

87.0
66.0
65.1

70.6
73.0

83.6

69.7
33.6

61.2

82.8

Table 10
Confusion matrix for IXMAS dataset before applying weighting. Average recognition
rate is 90.5%. The actions are denoted by numbers: 1 = check watch, 2 = cross arms,
3 = scratch head, 4 = sit down, 5 = get up, 8 = wave, 9 = punch, 10 = kick, 11 = point,
and 12 = pick up.
Action

Recognition rate %
Action
Recognition rate %

92.6
8
92.6

91.1
9
92.6

85.2
10
88.1

91.1
11
91.1

89.6
12
87.3

Table 11
Confusion matrix for IXMAS dataset after applying weighting: the overall recognition
rate is 92.6%, which is an improvement of 2.1% compared to the non-weighted case.
Action

Recognition rate %
Action
Recognition rate %

94.8
8
92.6

91.1
9
92.6

87.2
10
91.1

92.6
11
92.6

92.6
12
89.6

Table 12
Confusion matrix for [45]. Average recognition rate is 90.23%.
Action

Recognition rate %
Action
Recognition rate %

89.6
8
85.2

94.8
9
92.6

85.2
10
91.1

91.1
11
90.4

91.1
12
89.6

important role in action recognition. The advantage of using line


segments as opposed to triplets is that it introduces more redundancy and leads to better results. (ii) We compare transitions of
two poses, which encodes temporal information of human motion
while keeping the problem at its atomic level. (iii) We decompose a

Table 14
Confusion matrix when head and two shoulder points are occluded. The actions are
the same as in Table 10.
Action

Recognition rate %
Action
Recognition rate %

85.5
8
92.3

91.1
9
90.3

83.3
10
83.3

81.1
11
90.4

91.1
12
83.3

Table 15
Confusion matrix when the right side of the body is occluded including the right
shoulder, arm, hand, and knee point.
Action

Recognition rate %
Action
Recognition rate %

83.3
8
3.3

54.5
9
10.3

5.5
10
79.1

58.8
11
5.6

61.3
12
16.1

Table 16
Confusion matrix when the left side of the body is occluded including the left
shoulder, arm, hand, and knee point.
Action

Recognition rate %
Action
Recognition rate %

3.3
8
83.3

47.5
9
73.3

75.5
10
76.7

57.7
11
77.1

66.7
12
66.7

Table 17
Confusion matrix when the lower body is occluded including the two knee and feet
points.
Action

Recognition rate %
Action
Recognition rate %

86.6
8
81.1

83.3
9
79.3

78.1
10
5.5

45.2
11
78.1

54.8
12
36.6

human pose into a set of line segments and represent a human action by the motion of 3D lines dened by line segments. This converts the study of non-rigid human motion into that of multiple
rigid planar motions, making it thus possible to apply well-studied
rigid motion concepts, and providing a novel direction to study
articulated motion. Our results denitely conrm that using line
segments improves considerably the accuracy in [31]. Of course,
this does not preclude that our ideas of line segments and weighting could be applied to other methods such as [45], and they may
also result in improved accuracy in the same manner as [31] (as
studied this paper). (iv) We propose a generic method for weighting body point line segments, in an attempt to emulate humans
foveated approach to pattern recognition. Results after applying
this scheme indicate signicant improvement. This idea can be applied to [45] as well, and probably to a host of other methods,
whose performance may improve in the same manner as [31], as
shown in this paper. (v) We study how this weighting strategy
can be useful when there is partial but signicant occlusion. (vi)
We also investigate how soon our method is able to recognize

601

N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602
Table 18
This table shows how soon we can recognize an action for IXMAS dataset.
Action

% of Sequence used: best caseworst caseaverage case


Action
% of Sequence used: best caseworst caseaverage case

308860
8
568869

337750
9
488163

569177
10
458977

356756
11
609278

407766
12
377955

Table 19
Comparison of different methods.
Method

# of views

Camera model

Input

Ours
[45]
[46]
[30]
[47]
[51]
[48]
[50]
[8]
[7]
[20]
[24]
[18]
[6]
[29]
[49]

1
1
P1
P1
All
>1
All
1
>1
All
>1
1
1
1
1
P1

Persp.
Persp.
Persp.
Persp.
Persp.
Persp.
Persp.
Persp.
Persp.
Persp.
Persp.
Afne
Afne
Afne
Persp.
Persp.

Body points
Body points
3D HoG
Silhouettes
Interest points
Histogram of the silhouette and of the optic ow
Histogram of the silhouette and of the optic ow
3D Interest points
Body Points
Visual hulls
Optical ow silhouettes
Body points
Silhouettes
Body points
Body points
Body points/optical ow/HoG

projection
projection
projection
projection
projection
projection
projection
projection
projection
projection
projection

projection
projection

the action. (vii) We provide extensive experiments to rigorously


test our method on three different datasets.

References
[1] V. Zatsiorsky, Kinematics of Human Motion, Human Kinetics, 2002.
[2] D. Gavrila, Visual analysis of human movement: a survey, CVIU 73 (1) (1999)
8298.
[3] T. Moeslund, E. Granum, A survey of computer vision-based human motion
capture, CVIU 81 (3) (2001) 231268.
[4] T. Moeslund, A. Hilton, V. Krger, A survey of advances in vision-based human
motion capture and analysis, CVIU 104 (23) (2006) 90126.
[5] L. Wang, W. Hu, T. Tan, Recent developments in human motion analysis,
Pattern Recognition 36 (3) (2003) 585601.
[6] Y. Sheikh, M. Shah, Exploring the space of a human action, ICCV 1 (2005) 144
149.
[7] D. Weinland, R. Ronfard, E. Boyer, Free viewpoint action recognition using
motion history volumes, CVIU 104 (23) (2006) 249257.
[8] V. Parameswaran, R. Chellappa, View invariants for human action recognition,
CVPR 2 (2003) 613619.
[9] A. Efros, A. Berg, G. Mori, J. Malik, Recognizing action at a distance, ICCV (2003)
726733.
[10] G. Zhu, C. Xu, W. Gao, Q. Huang, Action recognition in broadcast tennis video
using optical ow and support vector machine, LNCS 3979 (2006) 8998.
[11] L. Wang, Abnormal walking gait analysis using Silhouette-masked ow
histograms, ICPR 3 (2006) 473476.
[12] C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local SVM
approach, ICPR 3 (2004) 3236.
[13] I. Laptev, S. Belongie, P. Perez, J. Wills, C. universitaire de Beaulieu, U. San
Diego, Periodic motion detection and segmentation via approximate sequence
alignment, ICCV 1 (2005) 816823.
[14] M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, Actions as space-time
shapes, in: Proc. ICCV, vol. 2, 2005, pp. 13951402.
[15] L. Wang, D. Suter, Recognizing human activities from silhouettes: motion
subspace and factorial discriminative graphical model, in: CVPR, 2007, pp. 18.
[16] A. Bobick, J. Davis, The recognition of human movement using temporal
templates, IEEE Transactions on PAMI 23 (3) (2001) 257267.
[17] L. Wang, T. Tan, H. Ning, W. Hu, Silhouette analysis-based gait recognition for
human identication, IEEE Transactions on PAMI 25 (12) (2003) 15051518.
[18] A. Yilmaz, M. Shah, Actions sketch: a novel action representation, CVPR 1
(2005) 984989.
[19] A. Yilmaz, M. Shah, Matching actions in presence of camera motion, CVIU 104
(23) (2006) 221231.
[20] M. Ahmad, S. Lee, HMM-based human action recognition using multiview
image sequences, ICPR 1 (2006) 263266.
[21] F. Cuzzolin, Using bilinear models for view-invariant action and identity
recognition, Proceedings of CVPR (2006) 17011708.

Other assumptions

Five pre-selected coplanar points or limbs trace planar area

Same start and end of sequences

[22] F. Lv, R. Nevatia, Single view human action recognition using key pose
matching and viterbi path searching, Proceedings of CVPR (2007) 18.
[23] L. Campbell, D. Becker, A. Azarbayejani, A. Bobick, A. Pentland, Invariant
features for 3-d gesture recognition, FG 0 (1996) 157162.
[24] C. Rao, A. Yilmaz, M. Shah, View-invariant representation and recognition of
actions, IJCV 50 (2) (2002) 203226.
[25] A. Farhadi, M.K. Tabrizi, I. Endres, D.A. Forsyth, A latent model of
discriminative aspect, in: ICCV, 2009, pp. 948955.
[26] R. Li, T. Zickler, Discriminative virtual views for cross-view action recognition,
in: CVPR, 2012.
[27] J. Liu, M. Shah, B. Kuipers, S. Savarese, Cross-view action recognition via view
knowledge transfer, in: CVPR, 2011, pp. 32093216.
[28] T. Syeda-Mahmood, A. Vasilescu, S. Sethi, I. Center, C. San Jose, Recognizing
action events from multiple viewpoints, in: Proceedings of IEEE Workshop on
Detection and Recognition of Events in Video, 2001, pp. 6472.
[29] A. Gritai, Y. Sheikh, M. Shah, On the use of anthropometry in the invariant
analysis of human actions, ICPR 2 (2004) 923926.
[30] D. Weinland, E. Boyer, R. Ronfard, Action recognition from arbitrary views
using 3d exemplars, in: ICCV, 2007, pp. 17.
[31] Y. Shen, H. Foroosh, View-invariant action recognition using fundamental
ratios, in: Proc. of CVPR, 2008, pp. 16.
[32] A. Schtz, D. Brauna, K. Gegenfurtnera, Object recognition during foveating eye
movements, Vision Research 49 (18) (2009) 22412253.
[33] R.I. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, second
ed., Cambridge University Press, 2004. ISBN: 0521540518.
[34] G. Johansson, Visual perception of biological motion and a model for its
analysis, Perception and Psychophysics 14 (1973) 201211.
[35] V. Pavlovic, J. Rehg, T. Cham, K. Murphy, A dynamic bayesian network
approach to gure tracking using learned dynamic models, ICCV (1) (1999)
94101.
[36] D. Ramanan, D.A. Forsyth, A. Zisserman, Strike a pose: tracking people by
nding stylized poses, Proceedings of CVPR 1 (2005) 271278.
[37] D. Ramanan, D.A. Forsyth, A. Zisserman, Tracking people and recognizing their
activities, Proceedings of CVPR 2 (2005) 1194.
[38] J. Rehg, T. Kanade, Model-based tracking of self-occluding articulated objects,
ICCV (1995) 612617.
[39] J. Sullivan, S. Carlsson, Recognizing and tracking human action, in: ECCV,
Springer-Verlag, London, UK, 2002, pp. 629644.
[40] J. Aggarwal, Q. Cai, Human motion analysis: a review, CVIU 73 (3) (1999) 428
440.
[41] R.I. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision,
Cambridge University Press, 2000.
[42] Z. Zhang, C. Loop, Estimating the fundamental matrix by transforming image
points in projective space, CVIU 82 (2) (2001) 174180.
[43] R.I. Hartley, In defense of the eight-point algorithm, IEEE Transactions on PAMI
19 (6) (1997) 580593.
[44] Y. Shen, H. Foroosh, View-invariant recognition of body pose from space-time
templates, in: Proc. of CVPR, 2008, pp. 16.
[45] Y. Shen, H. Foroosh, View-invariant action recognition from point triplets, IEEE
Transactions on PAMI 31 (10) (2009) 18981905.

602

N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602

[46] D. Weinland, M. zuysal, P. Fua, Making action recognition robust to


occlusions and viewpoint changes, in: ECCV, 2010, pp. 635648.
[47] K.K. Reddy, J. Liu, M. Shah, Incremental action recognition using feature-tree,
in: ICCV09, 2009, pp. 10101017.
[48] D. Tran, A. Sorokin, Human activity recognition with metric learning, in: D.
Forsyth, P. Torr, A. Zisserman (Eds.), ECCV, Lecture Notes in Computer Science,
vol. 5302, Springer Berlin/Heidelberg, 2008, pp. 548561.
[49] I. Junejo, E. Dexter, I. Laptev, P. Perez, View-independent action recognition
from temporal self-similarities, IEEE Trans. PAMI 99, preprints.

[50] J. Liu, M. Shah, Learning human actions via information maximization., in:
CVPR08, 2008, pp. 18.
[51] A. Farhadi, M.K. Tabrizi, Learning to recognize activities from the wrong view
point, in: ECCV (1)08, 2008, pp. 154166.
[52] S. Masood, C. Ellis, A. Nagaraja, M. Tappen, J. LaViola Jr., Sukthankar, R.,
Measuring and reducing observational latency when recognizing actions, in:
The 6th IEEE Workshop on Human Computer Interaction: Real-Time Vision
Aspects of Natural User Interfaces (HCI2011), ICCV Workshops, 2011.
[53] M. Hoai, F. De la Torre, Max-margin early event detectors, in: Proceedings of
IEEE Conference on Computer Vision and Pattern Recognition, 2012.

You might also like