Professional Documents
Culture Documents
Contents of Presentation
Introduction
Proposed Methodology
Shape Representation Using HOG
Computation of Reduced Posture Space using PCA
INTRODUCTION
Introduction
Action Recognition
Widely Researched Area
Potential applications in Security and Surveillance
Literature Survey
Some of the research work was on space time shapes with different kinds of
representation
Gorelick et. al Representing space time shape using Poissons equation and extracting
stick/plate like structures.
Nair et. al - Representing space time shape using 3D Distance transform along with the
R-Transform at multiple levels.
Representing space time shape as a collection of spatio-temporal words in a bag of words
model. (Neibles et.al using probabilistic Latent Semantic Analysis model).
Batra et.al characterizes a space time shape as a histogram in a dictionary of space time
shapelets which are local motion patterns.
Scovannar et al. represents a spatio-temporal word by a 3D SIFT region descriptor in a
bag of words model.
Introduction
Literature Survey
Some of the research work was done as a tracking problem i.e. to track
suitable points in the human body
Ali et al. used concepts from Chaos Theory to reconstruct the phase space from each
trajectories and compute the dynamic and metric invariants.
Some of the latest work characterizes human action sequences as multidimensional arrays called tensors and use these as the basis for feature
extraction.
Kim et al. presents a framework called Tensor Canonical Correlation Analysis where
descriptive similarity features between two video volumes(tensors) are used.
Lui et al. studied the underlying geometry of the tensor space and performed
factorization on this space to obtain product manifolds and the comparing using the
geodesic measure
Proposed Work
Models the feature variations extracted from a frame with respect to time.
In other words, find an underlying manifold in the feature space which
captures the temporal variance needed for discriminating between action
sequences.
Classifies a set of contiguous frames irrespective of the speed of the action or
the time instant of the body posture.
PROPOSED METHODOLOGY
METHODOLOGY
Proposed Methodology
Feature Extraction Shape Descriptor computed for the region of interest in a frame.
Computation of an appropriate reduced dimensional space which spans the change of
shape occurring across time.
Suitable Modeling of the mapping from the feature space to the reduced space.
3.
Feature
Extraction
Reduced
Space
Computation
Modeling
The inter-frame variation of the shape descriptors across the frame is maximized in the Eigen space.
These variations indirectly corresponds to the variation occurring in the silhouette (body posture
changes) which differs with different action sequences.
To model the mapping from the shape descriptor or feature space to the Eigen space for each
action class, we use a regression based network such as generalized regression neural
network(GRNN).
Back propagation neural networks take a lot of time for training and may not often converge while the
GRNN is based on the radial basis functions which is a one-pass training algorithm and converges to a
stable state.
Proposed Methodology
A set of N frames (complete sequence where N=60 or a partial sequence where N = 15) consisting of
segmented body regions (silhouettes).
From each frame, the Histogram of gradient is computed and accumulated for each frame of the sequences
from all the action classes.
The Eigen space is obtained by performing PCA on the accumulated features and suitable reduced
representations of the features are obtained.
Each Model from 1 to M where M is the number of action classes is represented by one GRNN network.
It is trained by using the HOG descriptors of a frame as input and the corresponding representation in the Eigen space as
the output.
In short, each GRNN models the mapping from the HOG space to the Eigen space.
Testing Phase
From an input set of frames from a test sequence, the corresponding HOG descriptors are computed.
The representation of the HOG descriptors in the Eigen space is computed by projecting it into the Eigen
space (Obtained in the Training) and this is taken as the reference.
The estimation of the reduced representation by each of the GRNN is then compared with the reference
representation and the model which gives the closest match is considered as the class of the test action
sequence.
Binary Silhouette
Noise present in the gradient image due to illumination variations in the image.
Some noise is reflected onto the HOG descriptor but since the HOG are partially illumination
invariant due to the normalization, the feature descriptors do not vary much.
The Eigen space is obtained by performing PCA on the matrix = [1,1 1,2 1,3 (), ] to get the Eigen
vectors 1 , 2 . corresponding to the largest variances between the HOG descriptors.
These Eigen vectors with the highest Eigenvalues corresponds to the direction along which the temporal variance
between the HOG descriptors is maximum.
GRNN is a one pass learning algorithm which provides fast convergence to the optimal
regression surface.
It is memory intensive and so we train the GRNN with the cluster centers obtained from Kmeans clustering.
Each GRNN is represented by the equation as
Where (, , , ) are the cluster centers in the HOG descriptor space and the Eigen space.
Selection of the standard deviation for the radial basis function for each action class is taken
as the median Euclidean distance between the corresponding actions cluster centers.
Classification
The set of HOG descriptors from consecutive frames of a test sequence are projected onto
the Eigenspace to get the corresponding projections(reference) given by : 1 .
This is compared with the estimated projections of the corresponding frames by
each of the GRNN action model using Mahalanobis distance measure.
The action model which gives the closest estimate to the reference projections is the class.
7 7 overlapping cells
9 orientation bins
Normalized by taking the L2-norm
The histograms from each block are combined to form a feature vector of size 441 1
Action Dataset consists of 10 action classes and each action class has 9-10 video sequences.
Each video sequence of an action is performed by a different individual.
There is variation in size of person and speed of motion
It has 3 main action classes corresponding to different shapes of the hand. Each of this class is furthur
divided by the motion of the hand. In short, there are 9 different action classes.
a1 - bend; a2 - jplace ; a3 - jack ; a4 - jforward ; a5 - run ; a6 - side ; a7 - wave1 ; a8 - skip ; a9 wave2 ; a10 walk
The test sequence is divided into overlapping windows of size with an overlap
of 1
Testing is done using the Leave-10-sequence out strategy. In short, all the partial
sequences corresponding to the test sequence are left out of the training.
On the left shows the Confusion Matrix and on the right shows the average
accuracy obtained with the framework for different window sizes of
=10,12,15,18,20,23,25,28 and 30.
This database has 5 different sets with each set corresponding to a different kind of illumination.
Each action class has 20 sequences.
For each sequence, skin segmentation was done in order to get the region of interest and for
centering the region in the image.
The HOG descriptor extracted contains noise variation due to different illumination conditions.
The testing strategy used was the leave-9- out test sequences where each test sequence
corresponds to an action class.
The confusion matrix shown on the left is obtained by considering 4 clusters during training.
If all the illumination conditions are trained into the system, the overall accuracy is higher.
Testing was done on each set individually and the overall accuracy computed for each set as
shown on the right.
For set 1, the overall accuracy is high as the lighting in set 1 is pretty uniform but for sets 2,3,4,5
gives moderate overall accuracy due to extreme non-linear lighting conditions.
Thank You
Questions?
Please contact
Binu M Nair : nairb1@udayton.edu