Professional Documents
Culture Documents
College of Engineering and Computer Science, Computational Imaging Lab, University of Central Florida, 4000 Central Florida Blvd., Orlando, FL 32816, USA
Advanced Micro Devices, Quadrangle Blvd., Orlando, FL 32817, USA
c
State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
d
Department of EECS, Computational Imaging Laboratory, University of Central Florida, Orlando, FL 32816, USA
b
a r t i c l e
i n f o
Article history:
Received 10 February 2012
Accepted 19 January 2013
Available online 5 February 2013
Keywords:
View invariance
Pose transition
Action recognition
Action alignment
Fundamental ratios
a b s t r a c t
In this paper, we fully investigate the concept of fundamental ratios, demonstrate their application and
signicance in view-invariant action recognition, and explore the importance of different body parts in
action recognition. A moving plane observed by a xed camera induces a fundamental matrix F between
two frames, where the ratios among the elements in the upper left 2 2 submatrix are herein referred to
as the fundamental ratios. We show that fundamental ratios are invariant to camera internal parameters
and orientation, and hence can be used to identify similar motions of line segments from varying viewpoints. By representing the human body as a set of points, we decompose a body posture into a set of line
segments. The similarity between two actions is therefore measured by the motion of line segments and
hence by their associated fundamental ratios. We further investigate to what extent a body part plays a
role in recognition of different actions and propose a generic method of assigning weights to different
body points. Experiments are performed on three categories of data: the controlled CMU MoCap dataset,
the partially controlled IXMAS data, and the more challenging uncontrolled UCF-CIL dataset collected on
the internet. Extensive experiments are reported on testing (i) view-invariance, (ii) robustness to noisy
localization of body points, (iii) effect of assigning different weights to different body points, (iv) effect
of partial occlusion on recognition accuracy, and (v) determining how soon our method recognizes an
action correctly from the starting point of the query video.
2013 Elsevier Inc. All rights reserved.
1. Introduction
The perception and understanding of human motion and action
is an important area of research in computer vision that plays a
crucial role in various applications such as surveillance, human
computer interaction (HCI), ergonomics, etc. In this paper, we focus
on the recognition of actions in the case of varying viewpoints and
different and unknown camera intrinsic parameters. The challenges to be addressed in action recognition include perspective
distortions, differences in viewpoints, anthropometric variations,
and the large degrees of freedom of articulated bodies [1]. The literature in human action recognition has been extremely active in
the past two decades and signicant progress has been made in
this area [25]. Action can be regarded as a collection of 4D
spacetime data observed by a perspective video camera. Due to
image projection, the 3D Euclidean information is lost and projectively distorted, which makes action recognition rather challenging, especially for varying viewpoints and different camera
q
Corresponding author.
1077-3142/$ - see front matter 2013 Elsevier Inc. All rights reserved.
http://dx.doi.org/10.1016/j.cviu.2013.01.006
parameters. Another source of challenge is the irregularities of human actions due to a variety of factors such as age, gender, circumstances, etc. The timeline of action is another important issue in
action recognition. The execution rates of the same action in different videos may vary for different actors or due to different camera
frame rates. Therefore, the mapping between same actions in different videos is usually highly non-linear.
To tackle these issues, often simplifying assumptions are made
by researchers on one or more of the following aspects: (1) camera
model, such as scaled orthographic [6] or calibrated perspective
camera [7]; (2) camera pose, i.e. little or no viewpoint variations;
(3) anatomy, such as isometry [8], coplanarity of a subset of body
points [8], etc. Human action recognition methods start by assuming a model of the human body, e.g. silhouette, body points, stick
model, etc., and build algorithms that use the adopted model to
recognize body pose and its motion over time. Spacetime features
are essentially the primitives that are used for recognizing actions,
e.g. photometric features such as the optical ow [911] and the
local spacetime features [12,13]. These photometric features can
be affected by luminance variations due to, for instance, camera
zoom or pose changes, and often work better when the motion is
small or incremental. On the other hand, salient geometric features
such as silhouettes [1418] and point sets [8,19] are less sensitive
to photometric variations, but require reliable tracking. Silhouettes
588
N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602
mood et al. [28] and later by Yilmaz et al. [18,19]. They stack silhouettes of input videos into spacetime objects, and extract features in
different ways, which are then used to compute a matching score
based on the fundamental matrices. A similar work is also presented in [29], which is based on body points instead of silhouettes.
A recent method [30] uses probabilistic 3D exemplar model that
can generate 2D view observations for recognition.
F22
;
N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602
589
n o
in frame t. Our goal is to identify the sequence J kt from DB such
that the subject in {It} performs the closest action to that observed
n o
J kt .
in
F22
F22
1
2 ;
b
F1 b
F2:
In practice, b
F 1 and b
F 2 may not be exactly equal due to noise, computational errors or subjects different ways of performing same actions. We, therefore, dene the following function to measure the
residual error:
E b
F1; b
F2 kb
F1 b
F 2 kF :
590
N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602
we have:
E b
F 1; b
F 2 0:
6
F1 4
a1 a1 b1 b1
7
a1 c1 b1 d1
5;
0
0
0
0
0
0
0
a1 a1 b1 c1 a1 b1 b1 d1 a1 a1 a1 a1 b1 b1 b1 a1 c1 b1 b1 d1
a1
b1
c1
d1
5
2
a2 a2 b2 b2
6
7
F2 4
c2
d2
a2 c2 b2 d2
5:
0
0
0
0
0
0
0
a2 a2 b2 c2 a2 b2 b2 d2 a2 a2 a2 a2 b2 b2 b2 a2 c2 b2 b2 d2
a2
b2
6
To solve for the four parameters, we have the following
equations:
18
Degenerate congurations: If the other examples camera projection is collinear with the 2 points in the line-segment, the problem becomes ill-conditioned. We can either ignore this camera
center in favor of other camera centers or we can simply ignore
the line-segment altogether. This does not produce any difculty
in practice, since with 11 body point representation used in this
paper, we obtain 55 possible line segments, the vast majority of
which are in practice non-degenerate.
A special case is when the epipole is close to or at innity, for which
all line-segments would degenerate. We solve this problem by transforming the image points in projective space in a manner similar to
Zhang et al. [42]. The idea is to nd a pair of projective transformations
Q and Q0 , such that after transformation the epipoles and transformed
image points are not at innity. Note that these transformations do
not affect the projective equality in Proposition 2.
xT1 F 1 x1 0;
xT2 F 1 x2 0:
D
E
1. Compute F, e1, e2 between image pair hIi, Iji and J km ; J kn using
the method proposed in [43].
2. For each non-degenerate 3D line segment that projects onto
i ; j ; km and kn in Ii ; Ij ; J km and J kn , respectively, compute
b
F 2 as described above, and compute e E b
F 1; b
F 2 from
F 1; b
Eq. (4).
3. Compute the average error over all non-degenerate line segments using
yT1 F 2 y1 0;
yT2 F 2 y2
0:
10
eT11 F 1 e11 0;
11
eT12 F 2 e12 0;
12
eT21 F 1 e21
eT22 F 2 e22
0;
13
0;
14
..
.
eTN11 F 1 eN11 0;
15
eTN12 F 2 eN12 0:
16
where ei1 and ei2 are the projection of the ith sequences camera
center in Ii or Ij and Jim or J in .
With N > 2, we have an overdetermined system, which can be
easily solved by re-arranging the above equations in the form of
Ax = 0 and solving for the right null space of A to solve for the ratios.
In fact, given
we
N
examples in the dataset,11
can have as many as
N N 1 11
ratios
per
frame,
where
are the total number
2
2
of different combinations given
11
body
points.
Compared to using
triplets, we would have N 11
ratios per frame. Given N > 4, this
3
is a huge advantage over using triplets as we have more redundancy leading to more accuracy.
The difculty with Eqs. (5) and (6) is that the epipoles e0i ; e0j ; e0m
and e0n are unknown. Fortunately, however, the epipoles can be closely approximated as described below.
Proposition 3. If the exterior orientation of P1 is related to that of P2
by a translation, or by a rotation around an axis that lies on the axis
planes of P1, then under the assumption:
e0i e0j e1 ;
e0m e0n e2 ;
1 X
E Ii ! Ij ; J km ! J kn
e ;
L 1...L
19
17
2
N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602
8
i i0 ; j j0 ;
>
< s EIi0 ! Ii ; J j0 ! J j
Si;j s EIi0 ! Ii1 ; J j0 ! J j1 i i0 ; j j0 ;
>
:
0
otherwise;
where s is a threshold, e.g., s = 0.3. S is the matching score matrix of
{I1,. . .,n} and {J1,. . ., m}.
3. Initialize the n m accumulated score matrix M as
Mi;j
Si;j
0
i 1 or j 1
otherwise
i ; j arg maxMi;j :
i;j
Then back trace M from (i, j), and record the path P until it reaches
a non-positive element.
The matching score of sequences A and B is then dened as
SA; B Mi ;j . The back-traced path P provides an alignment between two video sequences. Note that this may not be a one-toone mapping, since there may exist horizontal or vertical lines in
the path, which means that a frame may have multiple candidate
matches in the other video. In addition, due to noise and computational error, different selections of Ii0 ! Ii1 may lead to different valid alignment results.
3.5. Action recognition
To solve the action recognition problem, we need a reference
sequence (a sequence of 2D poses) for each known action, and
n o n o
maintain an action database of K actions, DB J 1t ; J 2t ;
n o
. . . ; J Kt . To classify a given test sequence {It}, we match {It}
against each reference sequence in DB, and classify {It} as the action
n o
n o
of best-match, say J kt , if S fIt g; J kt
is above a threshold T. Due
to the use of view-invariant fundamental ratios vector, our solution
is invariant to camera intrinsic parameters and viewpoint changes,
when the approximation of epipoles is valid. As discussed in Section 5.1.1, this can be achieved by using reference sequences from
more viewpoints for each action. One major feature of the proposed method is that there is no training involved if line segments
are used without the weighting (discussed in Section 4), and we
can recognize an action from a single example. This is experimentally veried in Section 5.
4. Weighting-based human action recognition
In the previous section, we saw how fundamental ratios can be
used for action recognition. But we implicitly made the assumption
that all body joints have equal contribution to matching pose transitions and action recognition, which goes against the evidences in
the applied perception literature [32]. For instance, in a sport such
as boxing, the motion of the upper body parts of the boxer is more
important than the motion of the lower body.
In his classic experiments, Johansson [34], demonstrated that
humans can identify motion when presented with only a small
set of moving dots attached to various body parts. This seems to
suggest that people are quite naturally adept at ignoring trivial vari-
591
3
E1
6 E 7
2 7
6
Ve i 6
7;
4 : 5
2
20
ET
We then built an error score matrix Me for alignment wWA?WB:
Me Ve 1 Ve 2 . . . Ve l :
21
592
N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602
1.6
0.5
Walkwalk
Walkrun
1
0.8
0.6
0.4
1.4
0.3
0.2
0.1
1
0.8
0.6
0.4
0.2
0.2
0
Walkwalk
Walkrun
1.2
dissimilarity score
1.2
Walkwalk
Walkrun
0.4
dissimilarity score
dissimilarity score
1.4
10
20
30
40
50
60
10
20
30
40
50
60
frame #
frame #
10
20
30
40
50
60
frame #
0.4
0.2
10
20
30
40
50
1
0.8
0.6
0.4
0.2
0
0
60
10
20
30
40
50
Walkwalk
Walkrun
1.2
dissimilarity score
0.6
1.4
Walkwalk
Walkrun
1.2
dissimilarity score
dissimilarity score
Walkwalk
Walkrun
0.8
1
0.8
0.6
0.4
0.2
0
60
10
20
frame #
frame #
30
40
50
60
frame #
0.5
0
10
20
30
40
50
60
0.5
Golfswing 01
Golfswing 02
Golfswing 03
1.5
dissimilarity score
Golfswing 01
Golfswing 02
Golfswing 03
dissimilarity score
dissimilarity score
1.5
0.5
Golfswing 01
Golfswing 02
Golfswing 03
0.4
0.3
0.2
0.1
0
10
20
30
40
50
60
10
20
30
40
50
frame #
frame #
frame #
(b) Linesegment 50
60
Fig. 2. Roles of different line segments in action recognition. We selected four sequences G0, G1, G2, and G3 of golf-swing action, and align G1, G2, and G3 to G0 using the
alignment method described in Section 2, and then build error score matrix M1e ; M2e ; M3e correspondingly as in above experiments. As can be observed, the dissimilarity
scores of some line segments, such as line segments 53 is very consistent across individuals. Some other line segments such as line segments 6 and 50 have various error score
patterns across individuals, that is, these line segments represent the variations of individuals performing the same action.
Intuitively, in action recognition, we should place more emphasis on the signicant line segments while reducing the negative
impact of trivial line segments, that is, assigning appropriate inuence factor to the body-point line segments. In our approach to action recognition, this can be achieved by assigning appropriate
weights to the similarity errors of body point line segments in
Eq. (19). That is, Eq. (19) can be rewritten as:
X
E Ii ! Ij ; J km ! J kn
x e ;
22
1...L
593
N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602
x1 x2 xn 1:
23
xi xj
n
24
EI1 ! I2 ; J i ! J j
1
Median16i<j6n xi xj Ei;j ;
n
25
l1
SA; B Ns
N
X
1X
xi xj El;wl i;j :
N l1 16i<j6n
Considering that N, s, n and El,w(l)(i,j) are constants given the alignment w, Eq. (26) can be further rewritten into a simpler form:
SA; B a0
n1
X
ai xi ;
26
i1
27
Q1
K
1X
S Rj ; T jk ;
K k1
28
Q2
K
2
1X
S Rj ; T jk Q21 ;
K k1
29
Q3
K
X X
1
SRj ; T ik :
KJ 1 16i6J;ij k1
30
31
j
1
594
N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602
Fig. 3. Left: Our body model. Right: Experiment on view-invariance. Two different pose transitions P1 ? P2 and P3 ? P4 from a golf swing action are used.
50
50
100
0
X
50
100
1.2
1
0.8
0.6
0.4
0.2
0
0
50
100
150
250
300
350
1
0.8
0.6
0.4
0.2
0
0
1.5
1
0
100
(d)
0
Rotation angle 100 0
around y axis
(e)
100
200
250
(c)
0.5
100
150
(b)
100 0
100
Rotation angle
0
100
50
Rotation angle
(a)
50
0
50
100
150
200
100
200
1.2
300
200
Rotation angle
around x axis
400
0
100
0
Rotation angle
100 0
around y axis
(f)
300
200
100
Rotation angle
around x axis
400
50
300
350
90
90
0
90
180
270
350
(g)
Fig. 4. Analysis of view invariance: (a) Camera 1 is marked in red, and all positions of camera 2 are marked in blue and green. (b) Errors for same and different pose transitions
when camera 2 is located at viewpoints colored as green in (a). (c) Errors of same and different pose transitions when camera 2 is located at viewpoints colored as blue in (a).
(d) General camera motion: camera 1 is marked as red, and camera 2 is distributed on a sphere. (e) Error surface of same pose transitions for all distributions of camera 2 in
(d). (f) Error surface of different pose transitions for all distribution of camera 2 in (d). (g) The regions of confusion for (d) marked in black (see text). (For interpretation of the
references to colour in this gure legend, the reader is referred to the web version of this article.)
N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602
595
0.4
0.3
0.2
0.1
0
0
to 8 pixels. The errors E(I1 ? I2, I3 ? I4) and E(I1 ? I2, I5 ? I6) were
computed. For each noise level, the experiment was repeated for
100 independent trials, and the mean and standard deviation of
both errors were calculated (see Fig. 5). As shown in the results,
the two cases are distinguished unambiguously until r increases
to 4.0, i.e., up to possibly 12 pixels. Note that the image sizes of
the subject were about 200 300, which implies that our method
performs remarkably well under high noise.
100
80
60
40
20
100
0
100
100
50
0
100
50
occlusion and lack of corresponding points in practice. The experiments on real data in Section 5.2 also show the validity of this
approximation under practical camera viewing angles.
5.1.2. Testing robustness to noise
Without loss of generality, we used the four poses in Fig. 3 to
analyze the robustness of our method to noise. Two cameras with
different focal lengths and viewpoints were examined. As shown in
Fig. 5, I1 and I2 are the images of poses P1 and P2 on camera 1 and I3,
I4, I5 and I6 are the images of P1, P2, P3 and P4 on camera 2. We then
added Gaussian noise to the image points, with r increasing from 0
(1)
(2)
(3) (4)
(5)
(6)
(7)
(8)
(9)
Fig. 7. A pose observed from 17 viewpoints. Note that only 11 body points in red color are used. The stick shapes are shown here for better illustration of pose conguration
and extreme variability being handled by our method. (For interpretation of the references to colour in this gure legend, the reader is referred to the web version of this
article.)
596
N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602
Table 1
Confusion matrix before applying weighting: large values on the diagonal entries
indicate accuracy. The overall recognition rate is 89.20%.
Ground-truth
Walk
Jump
Golf swing
Run
Climb
Walk
Jump
43
2
1
2
1
2
46
1
2
1
Golf swing
1
46
1
Run
Climb
2
1
1
44
2
2
1
1
2
44
i 1 . . . 10;
32
i1
Table 2
Confusion matrix after applying weighting: large values on the diagonal entries
indicate accuracy. The overall recognition rate is 92.40%, which is an improvement of
3.2% compared to the non-weighted case.
Ground-truth
Walk
Jump
Golf swing
Run
Climb
Ground-truth
Recognized as
without weighting are summarized in Table 1. The overall recognition rate is 89.2%.
For weighting, we build a MoCap training dataset which consists of total of 2 17 5 = 170 sequences for 5 actions (walk,
jump, golf swing, run, and climb): each action is performed by 2
subjects, and each instance of action is observed by 17 cameras
at different random locations. We use the same set of reference sequences for the 5 actions as the unweighted case, and align the sequences in the training set against the reference sequences. To
obtain optimal weighting for each action j, we rst aligned all sequences against the reference sequence Rj , and stored the similarity scores of line segments for each pair of matched poses. The
objective function fj(x1, x2, . . ., x10) is then built based on Eq.
(27), and the computed similarity scores of line segments in the
alignments. fj() is a 10-dimensional function, and the weights xi
are constrained by
8
>
< 0 6 xi 6 1;
10
X
>
xi 6 1:
:
Table 3
Confusion matrix for [45]. The overall recognition rate is 91.6%.
Recognized as
Walk
Jump
45
2
1
47
1
1
1
2
1
Golf swing
1
47
Run
2
1
1
46
2
Climb
1
1
1
46
Walk
Jump
Golf swing
Run
Climb
Recognized as
Walk
Jump
45
2
1
1
47
Golf swing
48
3
2
Run
2
1
1
47
Climb
2
42
Table 4
Confusion matrix for [31]. The overall recognition rate is 81.6%.
Ground-truth
Walk
Jump
Golf swing
Run
Climb
Recognized as
Walk
Jump
39
4
1
4
8
3
44
1
3
3
Golf swing
1
45
1
Run
Climb
5
1
2
41
3
2
1
1
2
35
may have different starting and ending points, thus may be only
partially overlapped. The execution speeds also vary in the sequences of each action. Self-occlusion also exists in many of the sequences, e.g., golf, tennis, etc.
Fig. 8a shows an example of matching action sequences. The
frame rates and viewpoints of two sequences are different, and
two players perform golf-swing action at different speeds. The
accumulated score matrix and back-tracked path in dynamic programming are shown in Fig. 8c. Another result on tennis-serve sequences is shown in Fig. 8b and d (see Fig. 10).
We built an action database DB by selecting one sequence for
each action; the rest were used as test data, and were matched
against all actions in the DB. The action was recognized as the
one with the highest matching score for each sequence. The confusion matrix is shown in Table 5, which indicates an overall 95.83%
classication accuracy for real data. As shown by these results, our
method provides a successful recognition of various actions by different subjects, regardless of camera intrinsic parameters and
viewpoints.
We test each sequence using the take-one-out strategy. With
weighting, the classication results are summarized in Table 6.
The overall recognition rate is 100%, which is an improvement
of 4.17% compared to the non-weighted case (see Tables 7 and
8)
5.2.2. IXMAS data set
We also evaluated our method on IXMAS data set [7], which has
5 different views of 13 different actions, each performed three
times by 11 different actors. We tested on actions, {1, 2, 3, 4, 5,
8, 9, 10, 11, 12}. Similar to [7], we applied our method on all actors
except for Pao and Srikumar, and used andreas 1 under
cam1 as the reference for all actions similar to [45]. The rest of
the sequences were used to test our method. The recognition results are shown in Table 10 for non-weighted case. The average
recognition rate is 90.5%. For weighting, we tested each sequence
by randomly generating a reference dataset of 2 5 10 = 100 sequences for 10 actions performed by two people observed from
ve different viewpoints. The results are shown in Table 11. The
average recognition rate is 92.6%, which boosts 2.1% over the
non-weighted case. In addition, we compare our method to others
in Table 9. As can be seen, our method improves on each camera
view (see Tables 12 and 13).
597
N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602
20
30
40
50
60
70
50
40
30
20
10
20
40
60
80
(c)
(d)
100
Fig. 8. Examples of matching action sequences: (a) and (b) are two examples in golf-swing and tennis-serve actions. (c) and (d) Show the accumulated score matrices and
backtracked paths, resulting in the alignments shown in (a) and (b), respectively.
598
N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602
Fig. 9. Examples from the UCF-CIL dataset consisting of 8 categories (actions) used to test the proposed method. Ballet fouettes: (1)(4); ballet spin: (5)(16); push-up: (17)
(22); golf swing: (23)(30); one-handed tennis backhand stroke: (31)(34); two-handed tennis backhand stroke: (35)(42); tennis forehand stroke: (43)(46); tennis serve:
(47)(56).
As can be seen from these results, our method is able to recognize actions even when such drastic occlusions are present. The
few low percentages in the tables correspond to actions that are
more or less dependent on the occluded part. For instance, kick
action has a percentage of only 5.5% when lower body is occluded.
But this action is solely based on the lower part of the body. Therefore, it is not surprising that the recognition rate is low. In general,
the recognition rates are low since we are using lesser number of
line segments, and more importantly, we are using lesser number
of points to compute the fundamental matrix (when 4 points are
occluded, we are forced to use the 7 point algorithm [41]).
599
N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602
Table 5
Confusion matrix before applying weighting: large values on the diagonal entries
indicate accuracy. The overall recognition rate is 97.92%. The actions are denoted by
numbers: 1 ballet fouette, 2 ballet spin, 3 pushup, 4 golf swing, 5 one handed
tennis backhand, 6 two handed tennis backhand, 7 tennis forehand, 8 tennis
serve.
Ground-true actions
#1
#2
#3
#4
#5
#6
#7
#8
Recognized as action
#1
#2
3
1
10
#3
#4
#5
#6
#7
#8
5
7
Table 6
Confusion matrix after applying weighting: the overall recognition rate is 100%,
which is an improvement of 2% compared to the non-weighted case.
Ground-true actions
Recognized as action
#1
#1
#2
#3
#4
#5
#6
#7
#8
#2
#3
#4
#5
#6
#7
#8
3
11
5
7
3
7
3
9
3
7
3
9
ical poses are predened, or certain limbs trace planar areas during
actions; Sheikh et al. [6] assume that each action is spanned by
some action bases, estimated directly using training sequences.
This implicitly requires that the start and the end of a test sequence
be restricted to those used during training. Moreover, the training
set needs to be large enough to accommodate for inter-subject
irregularities of human actions.
In summary, the major contributions in this paper are: (i) We
generalize the concept of fundamental ratios and demonstrate its
Table 7
Confusion matrix for [45].
Ground-true actions
Recognized as action
#1
#1
#2
#3
#4
#5
#6
#7
#8
#2
#3
#4
#5
#6
#7
#8
3
11
5
7
3
7
3
9
600
N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602
Table 8
Confusion matrix for [31].
Ground-true actions
#1
#2
#3
#4
#5
#6
#7
#8
Table 13
Confusion matrix for [31]. Average recognition rate is 85.6%.
Recognized as action
#1
#2
3
1
10
#3
#4
#5
#6
3
1
#7
#8
Action
Recognition rate %
Action
Recognition rate %
85.2
8
90.4
89.6
9
89.6
82.1
10
82.1
78.4
11
91.1
89.6
12
82.1
5
7
3
9
Table 9
Recognition rates in % on IXMAS dataset. Shen [45] and Shen [31] use the same set of
body points as our method.
Method
All
Cam1
Cam2
Cam3
Cam4
Cam5
90.5
92.0
89.6
86.6
82.0
78.0
92.6
94.2
93.5
94.4
92.6
82.2
85.6
90.2
83.5
57.9
72.6
80.2
72.7
58.1
82.8
87.0
65.4
69.6
74.8
76.7
86.6
88.3
70.0
69.2
74.5
73.3
81.1
85.6
54.3
62.0
74.8
72.0
80.1
87.0
66.0
65.1
70.6
73.0
83.6
69.7
33.6
61.2
82.8
Table 10
Confusion matrix for IXMAS dataset before applying weighting. Average recognition
rate is 90.5%. The actions are denoted by numbers: 1 = check watch, 2 = cross arms,
3 = scratch head, 4 = sit down, 5 = get up, 8 = wave, 9 = punch, 10 = kick, 11 = point,
and 12 = pick up.
Action
Recognition rate %
Action
Recognition rate %
92.6
8
92.6
91.1
9
92.6
85.2
10
88.1
91.1
11
91.1
89.6
12
87.3
Table 11
Confusion matrix for IXMAS dataset after applying weighting: the overall recognition
rate is 92.6%, which is an improvement of 2.1% compared to the non-weighted case.
Action
Recognition rate %
Action
Recognition rate %
94.8
8
92.6
91.1
9
92.6
87.2
10
91.1
92.6
11
92.6
92.6
12
89.6
Table 12
Confusion matrix for [45]. Average recognition rate is 90.23%.
Action
Recognition rate %
Action
Recognition rate %
89.6
8
85.2
94.8
9
92.6
85.2
10
91.1
91.1
11
90.4
91.1
12
89.6
Table 14
Confusion matrix when head and two shoulder points are occluded. The actions are
the same as in Table 10.
Action
Recognition rate %
Action
Recognition rate %
85.5
8
92.3
91.1
9
90.3
83.3
10
83.3
81.1
11
90.4
91.1
12
83.3
Table 15
Confusion matrix when the right side of the body is occluded including the right
shoulder, arm, hand, and knee point.
Action
Recognition rate %
Action
Recognition rate %
83.3
8
3.3
54.5
9
10.3
5.5
10
79.1
58.8
11
5.6
61.3
12
16.1
Table 16
Confusion matrix when the left side of the body is occluded including the left
shoulder, arm, hand, and knee point.
Action
Recognition rate %
Action
Recognition rate %
3.3
8
83.3
47.5
9
73.3
75.5
10
76.7
57.7
11
77.1
66.7
12
66.7
Table 17
Confusion matrix when the lower body is occluded including the two knee and feet
points.
Action
Recognition rate %
Action
Recognition rate %
86.6
8
81.1
83.3
9
79.3
78.1
10
5.5
45.2
11
78.1
54.8
12
36.6
human pose into a set of line segments and represent a human action by the motion of 3D lines dened by line segments. This converts the study of non-rigid human motion into that of multiple
rigid planar motions, making it thus possible to apply well-studied
rigid motion concepts, and providing a novel direction to study
articulated motion. Our results denitely conrm that using line
segments improves considerably the accuracy in [31]. Of course,
this does not preclude that our ideas of line segments and weighting could be applied to other methods such as [45], and they may
also result in improved accuracy in the same manner as [31] (as
studied this paper). (iv) We propose a generic method for weighting body point line segments, in an attempt to emulate humans
foveated approach to pattern recognition. Results after applying
this scheme indicate signicant improvement. This idea can be applied to [45] as well, and probably to a host of other methods,
whose performance may improve in the same manner as [31], as
shown in this paper. (v) We study how this weighting strategy
can be useful when there is partial but signicant occlusion. (vi)
We also investigate how soon our method is able to recognize
601
N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602
Table 18
This table shows how soon we can recognize an action for IXMAS dataset.
Action
308860
8
568869
337750
9
488163
569177
10
458977
356756
11
609278
407766
12
377955
Table 19
Comparison of different methods.
Method
# of views
Camera model
Input
Ours
[45]
[46]
[30]
[47]
[51]
[48]
[50]
[8]
[7]
[20]
[24]
[18]
[6]
[29]
[49]
1
1
P1
P1
All
>1
All
1
>1
All
>1
1
1
1
1
P1
Persp.
Persp.
Persp.
Persp.
Persp.
Persp.
Persp.
Persp.
Persp.
Persp.
Persp.
Afne
Afne
Afne
Persp.
Persp.
Body points
Body points
3D HoG
Silhouettes
Interest points
Histogram of the silhouette and of the optic ow
Histogram of the silhouette and of the optic ow
3D Interest points
Body Points
Visual hulls
Optical ow silhouettes
Body points
Silhouettes
Body points
Body points
Body points/optical ow/HoG
projection
projection
projection
projection
projection
projection
projection
projection
projection
projection
projection
projection
projection
References
[1] V. Zatsiorsky, Kinematics of Human Motion, Human Kinetics, 2002.
[2] D. Gavrila, Visual analysis of human movement: a survey, CVIU 73 (1) (1999)
8298.
[3] T. Moeslund, E. Granum, A survey of computer vision-based human motion
capture, CVIU 81 (3) (2001) 231268.
[4] T. Moeslund, A. Hilton, V. Krger, A survey of advances in vision-based human
motion capture and analysis, CVIU 104 (23) (2006) 90126.
[5] L. Wang, W. Hu, T. Tan, Recent developments in human motion analysis,
Pattern Recognition 36 (3) (2003) 585601.
[6] Y. Sheikh, M. Shah, Exploring the space of a human action, ICCV 1 (2005) 144
149.
[7] D. Weinland, R. Ronfard, E. Boyer, Free viewpoint action recognition using
motion history volumes, CVIU 104 (23) (2006) 249257.
[8] V. Parameswaran, R. Chellappa, View invariants for human action recognition,
CVPR 2 (2003) 613619.
[9] A. Efros, A. Berg, G. Mori, J. Malik, Recognizing action at a distance, ICCV (2003)
726733.
[10] G. Zhu, C. Xu, W. Gao, Q. Huang, Action recognition in broadcast tennis video
using optical ow and support vector machine, LNCS 3979 (2006) 8998.
[11] L. Wang, Abnormal walking gait analysis using Silhouette-masked ow
histograms, ICPR 3 (2006) 473476.
[12] C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local SVM
approach, ICPR 3 (2004) 3236.
[13] I. Laptev, S. Belongie, P. Perez, J. Wills, C. universitaire de Beaulieu, U. San
Diego, Periodic motion detection and segmentation via approximate sequence
alignment, ICCV 1 (2005) 816823.
[14] M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, Actions as space-time
shapes, in: Proc. ICCV, vol. 2, 2005, pp. 13951402.
[15] L. Wang, D. Suter, Recognizing human activities from silhouettes: motion
subspace and factorial discriminative graphical model, in: CVPR, 2007, pp. 18.
[16] A. Bobick, J. Davis, The recognition of human movement using temporal
templates, IEEE Transactions on PAMI 23 (3) (2001) 257267.
[17] L. Wang, T. Tan, H. Ning, W. Hu, Silhouette analysis-based gait recognition for
human identication, IEEE Transactions on PAMI 25 (12) (2003) 15051518.
[18] A. Yilmaz, M. Shah, Actions sketch: a novel action representation, CVPR 1
(2005) 984989.
[19] A. Yilmaz, M. Shah, Matching actions in presence of camera motion, CVIU 104
(23) (2006) 221231.
[20] M. Ahmad, S. Lee, HMM-based human action recognition using multiview
image sequences, ICPR 1 (2006) 263266.
[21] F. Cuzzolin, Using bilinear models for view-invariant action and identity
recognition, Proceedings of CVPR (2006) 17011708.
Other assumptions
[22] F. Lv, R. Nevatia, Single view human action recognition using key pose
matching and viterbi path searching, Proceedings of CVPR (2007) 18.
[23] L. Campbell, D. Becker, A. Azarbayejani, A. Bobick, A. Pentland, Invariant
features for 3-d gesture recognition, FG 0 (1996) 157162.
[24] C. Rao, A. Yilmaz, M. Shah, View-invariant representation and recognition of
actions, IJCV 50 (2) (2002) 203226.
[25] A. Farhadi, M.K. Tabrizi, I. Endres, D.A. Forsyth, A latent model of
discriminative aspect, in: ICCV, 2009, pp. 948955.
[26] R. Li, T. Zickler, Discriminative virtual views for cross-view action recognition,
in: CVPR, 2012.
[27] J. Liu, M. Shah, B. Kuipers, S. Savarese, Cross-view action recognition via view
knowledge transfer, in: CVPR, 2011, pp. 32093216.
[28] T. Syeda-Mahmood, A. Vasilescu, S. Sethi, I. Center, C. San Jose, Recognizing
action events from multiple viewpoints, in: Proceedings of IEEE Workshop on
Detection and Recognition of Events in Video, 2001, pp. 6472.
[29] A. Gritai, Y. Sheikh, M. Shah, On the use of anthropometry in the invariant
analysis of human actions, ICPR 2 (2004) 923926.
[30] D. Weinland, E. Boyer, R. Ronfard, Action recognition from arbitrary views
using 3d exemplars, in: ICCV, 2007, pp. 17.
[31] Y. Shen, H. Foroosh, View-invariant action recognition using fundamental
ratios, in: Proc. of CVPR, 2008, pp. 16.
[32] A. Schtz, D. Brauna, K. Gegenfurtnera, Object recognition during foveating eye
movements, Vision Research 49 (18) (2009) 22412253.
[33] R.I. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, second
ed., Cambridge University Press, 2004. ISBN: 0521540518.
[34] G. Johansson, Visual perception of biological motion and a model for its
analysis, Perception and Psychophysics 14 (1973) 201211.
[35] V. Pavlovic, J. Rehg, T. Cham, K. Murphy, A dynamic bayesian network
approach to gure tracking using learned dynamic models, ICCV (1) (1999)
94101.
[36] D. Ramanan, D.A. Forsyth, A. Zisserman, Strike a pose: tracking people by
nding stylized poses, Proceedings of CVPR 1 (2005) 271278.
[37] D. Ramanan, D.A. Forsyth, A. Zisserman, Tracking people and recognizing their
activities, Proceedings of CVPR 2 (2005) 1194.
[38] J. Rehg, T. Kanade, Model-based tracking of self-occluding articulated objects,
ICCV (1995) 612617.
[39] J. Sullivan, S. Carlsson, Recognizing and tracking human action, in: ECCV,
Springer-Verlag, London, UK, 2002, pp. 629644.
[40] J. Aggarwal, Q. Cai, Human motion analysis: a review, CVIU 73 (3) (1999) 428
440.
[41] R.I. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision,
Cambridge University Press, 2000.
[42] Z. Zhang, C. Loop, Estimating the fundamental matrix by transforming image
points in projective space, CVIU 82 (2) (2001) 174180.
[43] R.I. Hartley, In defense of the eight-point algorithm, IEEE Transactions on PAMI
19 (6) (1997) 580593.
[44] Y. Shen, H. Foroosh, View-invariant recognition of body pose from space-time
templates, in: Proc. of CVPR, 2008, pp. 16.
[45] Y. Shen, H. Foroosh, View-invariant action recognition from point triplets, IEEE
Transactions on PAMI 31 (10) (2009) 18981905.
602
N. Ashraf et al. / Computer Vision and Image Understanding 117 (2013) 587602
[50] J. Liu, M. Shah, Learning human actions via information maximization., in:
CVPR08, 2008, pp. 18.
[51] A. Farhadi, M.K. Tabrizi, Learning to recognize activities from the wrong view
point, in: ECCV (1)08, 2008, pp. 154166.
[52] S. Masood, C. Ellis, A. Nagaraja, M. Tappen, J. LaViola Jr., Sukthankar, R.,
Measuring and reducing observational latency when recognizing actions, in:
The 6th IEEE Workshop on Human Computer Interaction: Real-Time Vision
Aspects of Natural User Interfaces (HCI2011), ICCV Workshops, 2011.
[53] M. Hoai, F. De la Torre, Max-margin early event detectors, in: Proceedings of
IEEE Conference on Computer Vision and Pattern Recognition, 2012.