Professional Documents
Culture Documents
Introduction
Most of the research work done in the field of tracking from surveillance videos
has been restricted to detecting and tracking large objects in the scene such as
people in shopping malls, players on a soccer/basketball court, detection and
tracking of cars etc. But, when it comes to tracking body joints in a scene,
it borders on the line of human pose estimation in images and videos. In the
present research community, the human body pose estimation problem is being
tackled in two different scenarios; one which uses the depth information and
the other which uses only the images. The former uses the depth information
from the Kinect (Shotten et al [14]) and mainly suited for indoor applications
such as gaming consoles, human interactive systems etc.. The latter is used in
surveillance applications which uses video feed from multiple CCTV cameras
Nair et al.
(a) Manual annotation provided by point lights (b) Human pose estimation using
software [10]
Articulated Models[15]
monitoring a parking lot or a shopping mall. Some early research for tracking
motion and pose in surveillance videos has been developed where interest points
detected on the human body can be tracked. The trajectories are then modeled
to differentiate between human actions [8]. Recently in an approach proposed by
Huang et al. [7], human body pose is estimated and tracked across the scene using information acquired by a multi-camera system. The human pose estimates
obtained from such algorithms give continuous smooth sinusoidal like trajectories and therefore are deemed useful for gait analysis. However, one limitation is
the requirement of high resolution imagery for accurate estimation of joint trajectories. Therefore, the use of such algorithms on low-resolution videos does not
guarantee joint location estimate suitable for gait analysis and a pre-processing
mechanism should be applied on these noisy discrete estimates. An illustration of
the pose estimates obtained by a proprietary point light software and articulated
part-based models are shown in Figures 1a and 1b.
Related Work
One of the earlier and popular works which does not use the depth information
and uses only a single video camera to track human motion is done by Markus
Kohler [9]. Here, a Kalman Filter is designed to track non-linear human motion
in such a way that non-linearity in motion is considered as motion with constant
velocity and changing acceleration modeled as white noise. In our proposed algorithm, we use a modification of this Kalman filter and the design of the process
noise covariance to track the body joints across the video sequence. Kaniche et
al [8] used the extended Kalman filter to track specific points or corners detected
at every frame of the video sequence for the purpose of gesture recognition.
In recent years, the problem of human body pose estimation has not just
being limited to tracking points or corners or using depth information. One of
the state of art methods for human pose estimation on static images is the
flexible mixture of parts model, proposed by Yang and Ramanan [15]. Instead of
explicitly using variety of oriented body part templates(parameterized by pixel
location and orientation) in a search-based template matching scheme, a family
of affine-warped templates is modeled, each template containing a mixture of
Theory
In this section, we explain the various modules such as the region-based feature
matching and the tracking scheme using the Kalman filter used in the proposed
framework.
3.1
Nair et al.
where (P, R) is the number of points around the local neighborhood and its
radius. The textural representation of the joint region will then be the histogram
of these LBP-coded values. For our purpose, we use P = 8 with R = 1 which
reduces to a local region of size 8 8. The matching between two joint regions
represented either by HOG or LBP is done using the Chi-squared metric [11]
in Equation 2 where f1 ,f2 are feature vectors corresponding to a certain joint in
successive frames.
X (f1 (b) f2 (b))2
(2)
2 (f1 , f2 ) =
f1 (b) + f2 (b)
b
3.2
Kalman Filter
The recursive version of the Kalman filter can also be used for tracking purposes and in literature, it has been widely applied for tracking points in video
sequences. In this proposed algorithm, we use the Kalman filter to track a specific body joint across the scene. This is done by setting the state of the process
(which in this case is the human body movement) as the (x, y) coordinates of
the joint along with its velocity (vx , vy ) to get a state vector xk R4 . The
measurement vector zk = [xo , yo ] R2 will be provided either by the coarse
joint location estimates or by the region-based estimate. By approximating the
motion of a joint in a small time interval by a linear function, we can design the
transition matrix A so that the next state is a linear function of the previous
states. As done by Kohler[9], to account for non-constant velocity often associated with accelerating image structures, we use the process noise covariance
matrix Q defined in Equation 3 where a is the acceleration and 4t is the time
step determined by the frame rate of the camera.
2(4t)2
0
34t 0
a2 4t
2(4t)2
34t
0
(3)
Q=
0
6
6 34t
0
34t
6
Proposed Framework
4:
5:
6:
7:
8:
9:
10:
error co-variance Pt , estimate the elliptical region Sreg1 (t) where the joint
location is likely to fall on.
Extract the next frame. Find the region based matching estimate of each
joint between instances t and t 1 formulated as argminpSop (t) 2 (fj , fp )
where fj is the joint descriptor updated in the previous time instant, fp is
the region descriptor computed at the pixel p within the elliptical search
region Sreg1 (t). Also compute the dense optical flow and the global velocity
of the foreground region.
Using this estimate and the coarse joint location estimate, predict the new
elliptical search region Sreg2 (t). A constraint Sreg12 (t) Sreg1 (t) is enforced to prevent the growth of Sreg2 (t). If constraint is satisfied, go to Step
6. Else goto Step 8.
Compute region-based estimate given by argminpSreg2 (t) 2 (fj , fp ). Use this
finer estimate of the joint location as the measurement vector z = [zx , zy ] to
correct the Kalman tracker associated with that particular joint.
Update t t + 1. Set the joint velocity as the global velocity and predict
The proposed tracking scheme has been tested on a private dataset provided
by the Air Force Institute of Technology, Dayton OH. It consists of 12 subjects
walking along a outdoor track across the face of a building is performed twice,
Nair et al.
one wearing a loaded vest and other no vest by each subject to get a total
of 24 video sequences. The area of focus is when the subject walks clockwise
around the track and climbs a ramp. We set equal neighborhood sizes of 17 17
for each joint region and set a constant acceleration a = 0.1 pixels/f rame2
for the corresponding Kalman filter. Figure 4 shows sample illustration of the
proposed scheme in certain frames of the sequence. Sample illustrations of the
joint trajectories are also shown in Figure 5 where a comparison is made with four
different schemes. All of the joint trajectories estimated by different schemes for
each joint is smoothened by using a regression based neural network. We see that
the smooth trajectories obtained by the proposed scheme using LBP or HOG
has the closest approximation to the sinusoidal trajectory with subtle variations.
5.1
Its a statistical measure which gives how close the tracked joint locations are
to the coarse estimates of the joint location for each sequence associated with a
n
X
particular subject. This metric [6] is given by d2 (K, Km ) =
(log(i (K, Km ))2
i=1
where K R3 is the co-variance of the tracked points, Km R3 is the covariance matrix of the coarse joint locations, i is the ith Eigen value associated
with |K Km | = 0 and n being the number of Eigen values. The lower the
value, the closer are the tracked points to the coarse joint locations. This measure
does not provide us with the precision of the tracking scheme but it gives an
indication whether the tracked joint trajectory are located within the spatialtemporal neighborhood of the coarse joint trajectory. We see that most of the
joint trajectories obtained from the proposed scheme have very low values. This
shows that the proposed scheme obtains tracked estimates which are close to the
pose estimates obtained from a pose detector.
5.2
The MOTP/MOTA [2] metric is a widely used efficiency measure for multipleobject tracking mechanisms where the MOTP/MOTA gives the precision and
accuracy of the tracker by considering all the detected and tracked objects.
We use an implementation of the CLEAR-MOT [1] to give us the statistical
data such as false positive rate, false negative rate, MOTA and MOTP scores.
Multiple Object Tracking Precision (MOTP) refers to the closeness of a tracked
point location to its true location (given as ground truth). Here, we measure the
closeness by measuring the overlap between the neighborhood region occupied by
the tracked point location and the ground truth. Higher the value of this overlap,
more precise is the estimated location of the point. Multiple Object Tracking
Accuracy (MOTA) gives the accumulated accuracy in terms of the fraction of
the tracked joints matched correctly without any misses or mismatches. We
computed the MOTP, MOTA, false positive rate and false negative rate for each
sequence by setting the threshold T = 0.5 with same acceleration parameter
a = 0.1 and a neighborhood size of 17 17 for each body joint. We also use the
coarse joint location estimates as the ground truth data since no appropriate
ground truth has been provided with this dataset. In Figure 3b, we see that
all of the sequences have moderately high precision of around 75% and a high
accuracy of around 90%. This shows that the proposed tracking scheme is less
noise free and the reduction in precision is due to the slight variation of the
estimated joint locations with respect to the coarse joint location estimates.
Conclusion
Nair et al.
(a) Elliptical search region in frame 1 for (b) Fine estimates of joint location in
frame 2.
frame 2 obtained from tracking scheme
(LBP).
(c) Elliptical search region computed in (d) Finer estimates of joint locations in
frame 3 for frame 4. Here, the shoulder and frame 4 obtained from tracking scheme
the ankle joint trackers are corrected with (LBP).
the coarse location while the other joint
trackers are corrected with the regionbased estimate.
(e) Elliptical search region computed at (f) Tracked joint locations at frame 9
frame 7 for frame 8.
based on the elliptical search regions.
Fig. 4: Illustration of elliptical search regions before tracking and joint location
estimates after tracking. The coarse pose estimates are represented by purple
color in each frame. The search regions and the finer joint estimates are given
as shoulder (blue), elbow (green), wrist (red), waist (cyan), knee (yellow) and
ankle (pink).
Fig. 5: Estimated fine joint trajectories by different schemes for subject 11 wearing a coat in phase A. Color Key : Blue - Coarse joint locations from human
pose estimation, Purple - Coarse joint locations filtered by Kalman filter, Green
- HOG region based tracking, Red - LBP region based tracking
10
Nair et al.
References
1. Bagdanov, A., Del Bimbo, A., Dini, F., Lisanti, G., Masi, I.: Posterity logging of
face imagery for video surveillance. MultiMedia, IEEE 19(4), 4859 (Oct 2012)
2. Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance:
The clear mot metrics. J. Image Video Process. 2008, 1:11:10 (Jan 2008)
3. Burgos-Artizzu, X., Hall, D., Perona, P., Dollar, P.: Merging pose estimates across
space and time. In: Proceedings of the British Machine Vision Conference. BMVA
Press (2013)
4. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer
Society Conference on. vol. 1, pp. 886893 vol. 1 (2005)
5. Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction
for human pose estimation. In: Computer Vision and Pattern Recognition, 2008.
CVPR 2008. IEEE Conference on. pp. 18 (June 2008)
6. Forstner, W., Moonen, B.: A metric for covariance matrices (1999)
7. Huang, C.H., Boyer, E., Ilic, S.: Robust human body shape and pose tracking. In:
3DV-Conference, 2013 International Conference on. pp. 287294 (2013)
8. Kaaniche, M., Bremond, F.: Tracking hog descriptors for gesture recognition. In:
Advanced Video and Signal Based Surveillance, 2009. AVSS 09. Sixth IEEE International Conference on. pp. 140145 (2009)
9. Kohler, M.: Using the Kalman Filter to Track Human Interactive Motion: Modelling and Initialization of the Kalman Filter for Translational Motion. Forschungsberichte des Fachbereichs Informatik der Universit
at Dortmund, Dekanat Informatik, Univ. (1997)
10. Nair, B.M., Kendricks, K.D., Asari, V.K., Tuttle, R.F.: Optical flow based kalman
filter for body joint prediction and tracking using hog-lbp matching. vol. 9026, pp.
90260H90260H14 (2014)
11. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns. Pattern Analysis and
Machine Intelligence, IEEE Transactions on 24(7), 971987 (2002)
12. Ramakrishna, V., Kanade, T., Sheikh, Y.: Tracking human pose by tracking symmetric parts. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE
Conference on. pp. 37283735 (2013)
13. Ramanan, D.: Learning to parse images of articulated bodies. In: Sch
olkopf, B.,
Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems
19, pp. 11291136. MIT Press (2007)
14. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth
images. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. pp. 12971304 (2011)
15. Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-ofparts. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. pp. 13851392 (June 2011)