You are on page 1of 4

HMM-based Human Action Recognition Using Multiview Image Sequences

Mohiuddin Ahmad and Seong-Whan Lee


Department of Computer Science and Engineering, Korea University,
Anam-dong, Seongbuk-gu, Seoul 136-713, Korea
{mohi, swlee}@image.korea.ac.kr

Abstract such as, in [9],[10], and [13], authors use motion feature for
action recognition. Most of the above techniques depend
In this paper, we present a novel method for human ac- on the viewing direction. The work of testing an action us-
tion recognition from any arbitrary view image sequence ing multi-view motion learning is still unsolved. In [11] and
that uses the Cartesian component of optical flow veloc- [12], authors presented view invariant recognition of action.
ity and human body silhouette feature vector information. In this paper, we recognize human action from an ar-
We use principal component analysis (PCA) to reduce the bitrary view of an image sequence using the optical flow
higher dimensional silhouette feature space into lower di- and human body shape features. In this regard, human ac-
mensional feature space. The action region in an image tion models are created in each viewing direction for some
frame represents Q-dimensional optical flow feature vector specific actions using hidden Markov model (HMM). We
and R-dimensional silhouette feature vector. We represent extract the optical flow feature and new body silhouette fea-
each action using a set of hidden Markov models and we ture and then the feature vectors are converted into sym-
model each action for any viewing direction by using the bols. Then, we learn the features and build HMMs in dif-
combined (Q + R)-dimensional features at any instant of ferent viewing directions. Classification is finally achieved
time. We perform experiments of the proposed method by by feeding a given (test) sequence in any viewing direction
using KU gesture database and manually captured data. to all the learned HMMs and employing a likelihood mea-
Experimental results of different actions from any viewing sure to declare the action performed in the image sequence
direction are correctly classified by our method, which in- which makes our system view invariant. For training and
dicate the robustness of our view-independent method. testing actions, we use the KU gesture database [6] and
manually captured data.
This paper is organized as follows: Section 2 describes
1 Introduction the features extraction from optical flow and human body
silhouette. Section 3 briefly describes the HMMs for action
This contribution addresses the human action recogni- modeling and recognition. Experimental results and anal-
tion from any arbitrary view of an image sequence with ysis of the selected approaches are presented in Section 4.
the assumption that each image sequence includes only one Finally, conclusions are drawn in Section 5.
person and performs a single activity. Usually, recognition
of human actions from image sequences is very popular 2 Feature Extraction
in computer vision community, which has applications in
video surveillance and monitoring, human-computer inter-
In this section, first, we extract the foreground and detect
actions, etc. Since, there is no rigid syntax and well defined
the action region from the image sequence, second, we ex-
structure for human action recognition is available, there-
tract the human body shape feature and optical flow feature
fore, human action recognition is a more challenging and
in different viewing directions.
sophisticated task. Several human action recognition meth-
ods have been proposed in the past decades. A detailed
excellent survey can be found in [1]. For recognizing ac- 2.1 Foreground extraction
tion, researchers use either explicit human body shape or
motion. For example, in [7] [8], authors use 2D and 3D For extracting the foreground, we use a simplification of
shape features for recognizing actions. Motion-based ac- the background subtraction algorithm used in [2], and the
tion recognition was also performed by several researchers, color value of each background pixel is modeled by using a

Proceedings of the 18th International Conference on Pattern Recognition (ICPR'06)


0-7695-2521-0/06 $20.00 © 2006
(a) Normalized silhouette images of run
(a) (b) (c) (d)

Figure 1. Image preprocessing steps of an


(b) Feature images of the specified actions using new silhouette
image in action video. (a) sample frame (320× features
240 pixels) (b) filtered foreground (320 × 240
pixels) (c) action region (80 × 150 pixels) and
(d) normalized image (60 × 80 pixels). Figure 2. Shape images and features.

Φi = Γi − Ψ. An example training set of the sequence


single Gaussian distribution. The probability density func-
is shown in Figure 2(a). This very large set of vector is then
tion that a background pixel will have a color value xt at
subject to PCA, which seeks a set of M ortho-normal vec-
time t is estimated by using mean value, (μt ) and covari-
tor, uM , which best describe the distribution of the data.
ance matrix. We assume the independence of different color
The k-th vector, uk , is chosen [4], such that,
channels with variance σi2 , therefore, the estimated proba-
bility of xt is given by the following expression(i means
1  T
M
different color): λk = (u Φn )2 (2)
M n=1 k
3 − 1 (xt i −μt,i )2
 1 2
σ2
P (xt ) = 2 2 1 e i (1) The vector uk and scalars λk are the eigenvectors and eigen-
i=1 (2πσi ) values, respectively, of the covariance matrix [4]
Using this probability estimation, we compute the prob-
1 
M
ability P (xt ) that a pixel value belongs to the background
C= Φn ΦTn (3)
or not. Then, we can use a threshold value that estimate M n=1
any pixel value will be the foreground pixel or not. We up-
date the background parameters, μt and σt2 continuously to The eigenvectors, uk , for i = 1, · · · , M are ranked accord-
adopt the scene as [2]. Figure 1 (a) shows the sample frame ing to their associated eigenvalues, λk . Given a shape vector
from the image sequence of a walking person, and Fig.1 λ, the projected new feature space is given by:
(b) shows the extracted foreground image after background
subtraction and median filtering. sk = uTk (Γ − Ψ) (4)

2.2 Action region extraction for k = 1, · · · · · · , N . Therefore, we use new feature


T
vectors, s = [s1 , s2 , · · · , sR ] for every frame of the se-
We define the action region as the rectangular area where quence of any action in any viewing direction. Figure 2
action is occurred in the image frame. The action region (b)shows the new shape feature images of the specified ac-
depends on body shape, types of action, and the distance tions, where x-axis represents the time and y-axis represents
between the camera and the person who performs action. the multiple features.
Since distance is almost constant, therefore, we select the
action region inside the image frame which includes the av- 2.4 Motion features extraction
erage human body shape of the full image sequences. Fig-
ure 1 (c) shows the action region. In this work, we use optical flow to estimate motion, be-
cause we can precisely determine the non-rigid motion of
2.3 Shape features extraction the human action at any pixel in the image sequence. We
calculate the optical flow, v(vx , vy ) at any pixel x(x, y) in
The action region silhouette image is normalized to 60 × time t from the action regions by using Lukas-Kanade’s op-
80 pixels by using bi-cubic interpolation method, shown in tical flow technique [3]. Figure 3(a) shows the optical flow
Fig. 1(d). For PCA analysis, let the training set of the nor- of five specified frame for walking, raising of the hand,
malized images are Γ1 , Γ2 , Γ3 ,· · ·, ΓM . The average of bowing, sitting on the floor and running, respectively, in
1 M
the set is defined by Ψ = M n=1 Γn . Each body sil- which optical flows are overlapped on the body parts. For
houette image differs from the average Ψ by the vector consistency of the analysis, we normalize the optical flow,

Proceedings of the 18th International Conference on Pattern Recognition (ICPR'06)


0-7695-2521-0/06 $20.00 © 2006
(a) Optical flow in action regions (b) Flow features

Figure 3. The optical flow vectors and corre-


sponding flow feature vectors.

vn (vnx , vny ). Finally, for feature extraction, we partition


the action region into K blocks, B(k) of equal sizes. We
extract the flow features, vkx and vky by using the average Figure 4. Multiple views for training se-
value of optical flow motion at each block with n number quences (with sample images).
of pixels by using the following expression:
1 
[vkx,t , vky,t ] = [vnx (x, t), vny (x, t)] (5) rameters to achieve the local maximum. We use maximum
n
x∈B(k) likelihood approach for classifying each action:
T
Then, we arrange the vector, [v1x,t , ..., v Q x,t ] and λ = arg max P (O|λa ) (7)
2 λa ∈allActions
T
[v1y,t , ..., v Q y,t ] for any frame t of the sequence, where
2 where, P (O|λa ), the conditional probability for any ac-
Q = 72. The line plot of Fig. 3(b) shows the flow fea-
tion a and it is computed by P (O|λa ) = max P (O|λad ),
tures of the specified actions in Cartesian coordinate sys-
where O is the multi-observation feature vector sequence of
tem (x-flow and y-flow) by using equation above. For each
an unknown action.
image frame of any action, silhouette features and optical
flow features are combined and known as combined fea-
tures, It = [s1 , · · · , sR , v1x,t , · · · , v Q x,t , v1y,t , · · · , v Q y,t ]. 4 Experimental Results and Analysis
2 2

Experiments are performed on image sequences that


have 320 × 240 pixel resolution and 30 frames per second.
3 Action Modeling using HMMs
We use 2D video data of KU gesture database [6] and man-
ually captured data of all directions for modeling and clas-
Hidden Markov Models have been successfully used for
sifying actions. The training data set includes eight views,
speech recognition. We employ HMM for action recogni-
which are shown in figure 4. We use five actions for training
tion, since it can be applied to analyze the time series with
and testing purposes, such as walking, raising of the hand,
spatio-temporal variations. In this paper, we use HMM
bowing, running, and sitting on the floor. The duration of
for modeling and testing actions. We build HMM model
each video is approximately 2 ∼ 8 seconds depends on per-
λad = {A, B, π} for any action a in any viewing direction
forming the action.
d to characterize the variation from the change of viewing
Table 1 shows the confusion matrix of action recogni-
direction by using the symbols of the individual and com-
tion using HMM where we use human body shape feature,
bined features. We can consider a set of hidden Markov
flow feature, and combined features. Each column in the ta-
model for the multi-view directions which is expressed as:
ble represents the best match for each test sequence in any
arbitrary view direction. The 1st, 2nd, and 3rd value repre-
λa = {λa1 , λa2 , · · · λad } (6)
sents the recognition accuracy for silhouette, optical flow,
We use same topology of HMM , i.e. 6-state ergodic and combined features and the recognition rate becomes
model. The number of states is heuristically selected. Each 81.25%, 77.5%, and 87.5%, respectively. The recognition
observation is modeled by using mixtures of Gaussian den- rate of combined features is better than the recognition rate
sities. Two mixtures per feature is used in our experiment. of individual feature. Some sequences are misclassified,
The model parameters are adjusted in such a way that they such as walking and running. They may occur due to the
can maximize the likelihood P (O|λa ) for classifying ac- high degree of similarity in the front view and rear view.
tion by using the given set of training data. We use Baum- In table 2, we compare our test results with some previ-
Welech algorithm [5] for iteratively re-estimate model pa- ous results. The important thing is to note that we recognize

Proceedings of the 18th International Conference on Pattern Recognition (ICPR'06)


0-7695-2521-0/06 $20.00 © 2006
References
Table 1. Confusion matrix. Action recognition
using (1) body silhouette feature, (2) optical [1] D. M. Gavrilla, “The visual analysis of human move-
flow feature, and (3) combined features. ment: A survey,” CVIU, Vol. 73, No. 1, January, 1999,
pp. 82-98.
Walk Raise Bow Run Sit
Walk 13,12,14 1,1,0 0,0,0 2,3,2 0,0,0 [2] C. Stauffer and W. Grimson, “Adaptive background
Raise 2,2,2 12,11,13 0,0,0 2,3,1 0,0,0 mixture model for real time tracking,” Proc. of IEEE
Bow 0,0,0 0,0,0 15,15,16 0,0,0 1,1,0 Conf. on CVPR, Vol. 2, 1999, pp. 246-252.
Run 3,3,3 1,2,0 0,0,0 12,11,13 0,0,0
Sit 1,2,1 0,0,0 1,0,1 1,1,0 13,13,14 [3] B. Lucas and T. Kanade, “An iterative image regis-
tration technique with an application to stereo vision,”
Proc. of IJCAI, 1981, pp. 674-679.
human action from any arbitrary view rather than any spe- [4] M. Turk and A. Pentland, “Eigenfaces for recogni-
cific view. We mainly face the problem of recognition in tion,” Journal of Cognitive Neuroscience, Vol. 3, No.
front view among all views. 1, 1991, pp. 71-80.
[5] R. Lawrence, and A. Rabiner, “Tutorial on hidden
Table 2. Comparison results of action recog- Markov models and selected applications in speech
nition with some previous researches. recognition,” Proc. of the IEEE, Vol. 77 No. 2, 1989,
pp. 257-286.
Researches Features View Recog(%)
Ali [13] Angle profile 78.8 [6] B. W. Hwang, S. Kim, and S. W. Lee, “A full-body
Masaud [9] motion parallel 92.8 gesture gatabase for automatic gesture recognition,”
Yacoob [10] parametric motion diagonal 82.0 Proc. of IEEE Conf. on FGR, April 2006, pp. 243-248.
Our approach shape & flow Independent 87.5 KU Gesture Database, http://gesturedb.korea.ac.kr/.
[7] I. Cohen and H. Li, “Inference of human postures
by classification of 3D human body shape,” Proc. of
5 Conclusions and Future Research IEEE WS on AMFG, 2003, pp. 74-81.
[8] S. Carlsson and J. Sullivan, “Action recognition by
In this paper, we proposed a novel method for HMM- shape matching to key frames,” Proc. of IEEE WS on
based view-independent human action recognition using Models versus Exemplars in CV, Florida, USA, 2002,
body silhouette feature, optical flow feature, and combined pp. 263-270.
feature. Based on these features, a set of HMMs were built
for each action to represent each action from different views [9] O. Masoud and N. Papanikolopoulos, “A method for
to enable recognizing from arbitrary views. The recogni- human action recognition,” IVC, Vol. 21, 2003, pp.
tion rate was found 87.5% for combined feature which is 729-743.
higher than the rate obtained by individual features. This
[10] Y. Yacoob and M. J. Black, “Parameterized modeling
result showed that our algorithm was robust to variations in
and recognition of activities,” CVIU, Vol. 73, No. 2,
view and duration. Although this rate was found lower than February 1999, pp. 232-247.
some previous researches. But it would be mentioned that
we recognize action from arbitrary views rather than any [11] C. Rao and M. Shah, “View-Invariance in action
specific view. Our future work includes the interaction of recognition,” Proc. of IEEE Conf. on CVPR, Hawaii,
multi-view learning using a single frame work with using December 2001, pp. 316-321.
the complex actions and more precise recognition of human
actions. [12] V. Parameswaran and R. Chellappa, “View invariants
for human action recognition,” Proc. of IEEE Conf. on
Acknowledgment CVPR, Vol. 2, 2003, pp. 613-619.
[13] A. Ali, J. K. Aggarwal, “Segmentation and recog-
This research was supported by the Intelligent Robotics nition of continuous human activity,” Proc. of IEEE
Development Program, one of the 21st Century Frontier Workshop on Detection and Recognition of Events in
R&D Programs funded by the Ministry of Commerce, In- Video, Canada, July 2001, pp. 28-35.
dustry and Energy of Korea.

Proceedings of the 18th International Conference on Pattern Recognition (ICPR'06)


0-7695-2521-0/06 $20.00 © 2006

You might also like