You are on page 1of 29

Object Tracking by Exploiting Adaptive Region-wise Linear Subspace Representations and Adaptive Templates in an Iterative Particle Filter

Ming-Che Ho, Cheng-Chin Chiang, Ying-Yu Su


Department of Computer Science and Information Engineering, National Dong Hwa University, Shoufeng, Hualien, Taiwan, 974

Abstract Aiming at tracking visual objects under harsh conditions, such as partial occlusions, illumination changes, and appearance variations, this paper proposes an iterative particle filter incorporated with an adaptive region-wise linear subspace (RWLS) representation of objects. The iterative particle filter employs a coarse-to-fine scheme to decisively generate particles that convey better hypothetic estimates of tracking parameters. As a result, a higher tracking accuracy can be achieved by aggregating the good hypothetic estimates from particles. Accompanying with the iterative particle filter, the RWLS representation is a special design to tackle the partial occlusion problem which often causes tracking failure. Moreover, the RWLS representation is made adaptive by exploiting an efficient incremental updating mechanism. This incremental updating mechanism can adapt the RWLS to gradual changes in object appearances and illumination conditions. Additionally, we also propose the adaptive mechanism to continuously adjust the object templates so that the varying appearances of tracked objects can be well handled. Experimental results demonstrate that the proposed approach achieves better performance than other related prior arts. Keywords: object tracking, region-wise linear subspace (RWLS), iterative particle filter, incremental PCA

Corresponding author: Cheng-Chin Chiang. Tel.: +886-3-8634027; fax: +886-3-8634010. Email address: ccchiang@mail.ndhu.edu.tw (Cheng-Chin Chiang) July 8, 2011

Preprint submitted to Pattern Recognition Letters

1. Introduction Visual object tracking is a core task in most computer vision applications and has been intensively researched over the past decade. The typical objects needing to be tracked in applications include faces, hands, cars and human bodies, etc. A wide spectrum of potential applications, such as tele-conferencing, video surveillance, human-machine interactions, intelligent transportation systems, etc., have been developed for our daily life. Visual object tracking are challenging due to the intrinsic and extrinsic variations in tracking conditions. Intrinsic variations, the variations that appear on tracked objects, may include the dynamic changes of object poses, geometries, colors and textures. In contrast, extrinsic variations are variations induced by environmental conditions, including illumination changes, cluttered backgrounds, and partial occlusions on tracked objects. No matter what kind of variations appear, the difficulty of object tracking increases if the tracking is based on the matching of object appearances. Tracking by matching object appearances, sometimes termed as the template matching, is a common approach for object tracking ?. Under different poses and illumination, the object appearance would vary continuously during the tracking. Hence, tracking objects with one or several fixed templates of object appearances is not feasible in practical applications. An adaptive template representation becomes essential in handling the varying object appearances during the tracking. Besides the varying poses and illumination, partial occlusions on a tracked object is another major cause of tracking failure. A partial occlusion that causes the discrepancy between the appearances of the occluded object and the template often fails the templatebased tracking. Therefore, handling the problem of partial occlusions demands a flexible representation of object appearances that reveals high tolerance to the missing of some local parts of object appearances. Due to the dynamic pose changes of an object during its motion, an object tracking method is usually required to estimate the pose of the tracked object. The pose of an object is usually characterized with the parameters relating to the position, scale/dimension, 2

and rotation of the object. Hence, the tracking problem is sometimes referred to as the pose estimation problem. Inventing a good way which can accurately estimate the pose parameters of the tracked object under the challenges from various appearance variations is actually the kernel task and also the final goal of the research of visual object tracking. Motivated by the demanded technical designs and research goals mentioned above, our work presented in this paper aims at developing a robust and effective solution to the object tracking problem. This solution encompasses a method to estimate pose parameters of visual objects and an adaptive and flexible representation to tackle the problems of varying illumination and partial occlusions. The proposed method to estimate object poses is an iterative particle filter method, which offers improved performance over other particle filter methods. The proposed object representation is a RWLS representation which enables the tracking of partially occluded objects. To adapt the RWLS representation to intrinsic and extrinsic variations during the tracking, an incremental updating scheme is also designed to update the bases of the subspace with each up-to-date input in an efficient way. Besides the adaptive linear subspace representation, we also devise an incremental updating mechanism to adapt the object template to the varying object appearances. With the iterative particle filter, the adaptive RWLS representation, and the adaptive object template, the problem of pose estimation under the challenges of partial occlusions, illumination changes, and varying object appearances can be simultaneously handled very well. The rest of this paper is organized as follows. Section 2 presents a brief survey of related work. Section 3 then introduces the adaptive RWLS representation of objects and the adaptive incremental updating scheme of the linear subspace for dynamic appearance variations. Section 4 presents a preliminary review of the conventional particle filter and then proposes the iterative particle filter method for pose estimation. The mechanism to handle partial occlusions and the adaptive templates for handling varying object appearances are also described in this section. Section 5 shows the experimental results and compares the performance with those of other related prior arts. Section 6 gives the conclusions to end this paper. 3

2. Related Work Several studies (????) have shown the high efficacy of the linear subspace in many applications of object tracking and recognition. ? proposed a pre-trained view-based eigenspace representation for object tracking, called Eigentracking. The appearance variations of objects with different poses are limitedly captured by the samples of many different views. ? developed an efficient affine tracking scheme to deal with changing illumination by exemplar training under a variety of lighting conditions. However, these approaches still may encounter exceptional cases for untrained poses or illumination conditions. Moreover, these methods require the storage and the efforts for collecting a large set of samples for building the linear subspace. To reduce the cost of storage and the efforts for off-line linear subspace training, the online subspace learning (???) offers an alternative solution. The key merit of the online subspace learning methods is to incrementally update the subspace whenever any new sample becomes available. The updating requires no accesses to the past samples and thus no storage for accumulating the past data is necessary. Following this principle, ? presented an incremental principal component analysis (PCA) to update the linear subspace. ? employed the incremental PCA to update the appearance model for face recognition and tracking. However, the incremental PCA method can build a biased subspace if the new samples contain outliers or out-of-date noise. ? developed an efficient incremental updating algorithm which incorporates a forgetting factor to wear down the influence of older samples. Their empirical results show that the incremental method can tolerate larger pose variations and illumination changes. However, the problem of partial occlusions still cannot be well solved. As to the pose estimation, previous work can be divided into two categories: the deterministic approach and the stochastic approach. The deterministic approach, including the template-based algorithms (??) and the mean-shift algorithm (??), estimates the object poses without introducing any random process into the estimation. One typical example of the template-based algorithms is the algorithm proposed by ? which revises the Lucas4

Kanade optical flow algorithm (?) into an efficient inverse compositional (IC) algorithm. This algorithm estimates the poses of objects that undergo different motion by the process of gradient-descent error minimization. The major weakness of the gradient-descent error minimization is the problem of trapping into local minima during the minimization, which may lead to undesirable tracking results. This method is also error-prone to track objects with larger appearance changes. ? presented the mean-shift tracking algorithm using the Bhattacharyya distance to calculate the similarity between the color density distributions of the template and the tracked object. Since the color density distribution is a global visual feature which is not sensitive to local distortion of object appearances, the mean-shift algorithm can tolerate partial occlusions to some extent. Nonetheless, it is difficult to attain accurate estimation of object poses under the use of a coarse-level visual feature. Some extensions of the mean-shift algorithm (??) capture the spatial information which calculates the means and the covariance corresponding to their color bins making the pose estimation more accurate and robust. In contrast to the deterministic approach, the stochastic-based approach for object tracking typically estimates the object pose parameters, which are usually modeled with random variables, through a random process. The instance values of all modeled pose parameters at a certain moment are collectively called the state of the tracked object. The Kalman filter is a well-known method for the state space estimation based on a linear stochastic model of system dynamics. The Kalman filter produces the estimates of the true values of measurements and their associated calculated values by predicting a value, estimating the uncertainty of the predicted value, and computing a weighted average of the predicted value and the measured value. The particle filter is a generalized extension to the Kalman filter because it assumes no linearity on the system dynamics. In addition, the random noise in the stochastic process can be non-Gaussian. The pioneering work of the particle filter is the CONDENSATION algorithm proposed by ?. Some extensions of the particle filter have also been developed to enhance the efficiency and effectiveness for visual object tracking. ? further proposed the ICONDENSATION algorithm, which incorporated an auxiliary blob tracker into the CONDENSATION algorithm. The tracking of the auxiliary blob tracker is 5

based on well segmented regions of interest with homogeneous colors. Unfortunately, another challenging problem encountered in this method is how to robustly segment the object into desirable regions. The estimation results of the particle filter significantly depend on a set of randomly generated particles, with each carrying an hypothetic instance of the estimated state. The final state estimation is aggregated from the state instances on all particles. To achieve better aggregation, the number of particles is usually kept large so that the bad influence of outlier particles can be reduced. However, the larger number of particles do not necessarily guarantee the better hypothetic instances on particles. Moreover, the computation burdens for the state estimation also increase with the number of generated particles. Therefore, a good design is to devise a mechanism to decisively improve the goodness of state instances conveyed on particles so that the number of particles required to achieve accurate state estimation can be reduced. 3. The Adaptive Region-wise Linear Subspace (RWLS) Representation 3.1. RWLS Representation The principal component analysis (PCA) is a well-known technique of linear subspace representation. By PCA, a transformation U can be derived to transform a data vector x in a higher dimensional space into another data vector c in a lower dimensional space, i.e., x = x + Uc, where x is the mean vector of the collected data set. Inspired by the regionbased face recognition approach proposed by ?, we adopt a region-wise linear subspace representation for object appearance. By partitioning an object appearance into several regions, with each represented with a linear subspace, a partially occluded object still can be tracked as long as one or more regions are not occluded. Thus, the robustness of tracking a partially occluded object can be effectively enhanced by the region-wise representation. Concerning the way to partition the object appearance into regions, a simple and regular partitioning scheme is preferred for easier subimage cropping and region tracking. Hence, the proposed region-wise representation exploits the simplest way to uniformly partition 6

each object appearance into k k rectangular regions, where k is a designs choice. For ease of reference, we use the notation Rk to denote the scheme of k k region partitioning. For example, the partitioning scheme R1 treats the whole object appearance as a single region, while R2 partitions the object appearance into 22 equal-size regions. 3.2. Incremental Subspace Updating In PCA, a linear subspace can be built by solving the eigenproblem on the covariance matrix computed from a collection of samples. However, such a batched processing on a fixed set of samples cannot well model the appearance variations of a moving object over time. Suppose that the frame rate of the video camera is faster enough, e.g., more than 30 frames per second, we can assume that all appearance changes on objects occur gently. To adapt the RWLS to the up-to-date conditions, we need to recomputed the eigenvectors of the new regional covariance matrices updated with new coming samples. Nonetheless, this recomputation of covariance matrices and eigenvectors would require a high storage of all past samples and also demands a high cost in time. ? proposed an efficient method to incrementally update the eigenvectors without storing the past samples. This method updates the current eigenvectors using only the newest sample. One problem of this method is that the subspace may be improperly updated by some incoming outliers or noisy samples. To avoid this problem, we proposes a revised incremental subspace updating scheme, called the weighted incremental PCA (WI-PCA). The merit of the WI-PCA is to update the subspace using a new coming sample if this sample is reliable enough. To this end, each incoming sample is associated with a weight value, which is inversely proportional to the residual computed on approximating the sample using its lower-dimensional subspace representation. If the residual is large, meaning that this sample is very likely to be an outlier or a noise with respect to the current subspace, then the associated weight value is small. In what follows, we present the formal derivations of the WI-PCA. Let CN be the covariance matrix computed from {xi }1iN , and CN +1 be the new covariance matrix obtained after adding a new sample xN +1 . Similarly, the mean vectors 7

before and after adding the new sample xN +1 are denoted by xN and xN +1 , respectively. Both CN +1 and xN +1 can be derived recursively as follows: 1
N

xN +1 = CN +1 =

i xN + N +1 xN +1 N +1 i i=1 i=1 N N +1 N i i=1 i i=1 CN + x x T, N +1 N +1 2 ( i=1 i ) i=1 i

(1) (2)

where x = (xN +1 xN ), and i is the weight associated with xi . Note that the above computations involve no past samples in {xi }1iN . The value of the term
N i=1

i can be

also incrementally updated and stored with a variable, say N , on the arrival of each new sample. In effect, this term sums up the weights of the past samples. Here, we introduce a forgetting factor f , for 0 < f 1, into the adaptive update of N , i.e., N +1 = f N +N +1 . The forgetting factor aims to lower down the importance of the past particles. Accordingly, the adaptive computation of xN +1 and CN +1 in Eq. (1) and Eq. (2) can be rewritten with xN +1 = CN +1 (f N xN + N +1 xN +1 ) , N +1 f N N +1 f N = CN + x x T. 2 N +1 (N +1 ) 1 (3) (4)

In physical meaning, the associated weight i in Eq. (1) and Eq. (2) differentiates the influence of the sample xi on the PCA process. Noisy or outlier samples are assigned with smaller weights to reduce their influence. As mentioned previously, we can relate the weight value of a sample xi to the approximation residual using the current linear subspace. According to PCA, the approximation residual can be computed by ri = xi x Uci . Accordingly, the weight i of xi can be set as N +1 = exp(k||ri ||), where k is a constant for controlling the rate of change of the weight with respect to the magnitude of the residual and we set it to 1 in our implementation. Let U be the matrix whose columns comprise the current set of eigenvectors obtained from the PCA on N samples, {xi }1iN . When the new incoming sample xN +1 becomes available, the new set of eigenvectors must be recomputed and stored into the matrix U . 8

The underlying eigenproblem of the PCA can then be formulated by CN +1 U = U (5)

where is a diagonal matrix whose diagonal elements are the eigenvalues corresponding to the eigenvectors in U . Due to the orthogonality of both U and U , the new eigenvectors after adding the new sample can be considered as a rotated version of the set of old eigenvectors, i.e. U = R U, where R is an orthornormal rotation matrix. Equivalently, it can be written with U = UR where R = UT R U for another rotation matrix R. From Eqs. (5) and (6), we have UT CN +1 UR = R (7) (6)

Consequently, Eq. (7) leads to the equation of a new eigenproblem. The solution of the rotation matrix R is exactly the eigenvectors of the composite matrix UT CN +1 U. Since this composite matrix has a much lower dimension than the matrix CN +1 , the computation complexity for deriving its eigenvectors is also much lower. After finding the rotation matrix R from the composite matrix, the new eigenvectors can be obtained from Eq. (6). 4. Visual Object Tracking by Iterative Particle Filter 4.1. Particle Filter A particle filter formulates the tracking problem by a state prediction equation, xk = f (xk1 , uk ) and a measurement (or observation) function, zk = h(xk , nk ) 9 (9) (8)

where xk Rn and zk Rd are respectively the vector of state parameters and the measurement (or observation) at the time k, uk and nk are respectively an independent and identically distributed (i.i.d.) random vector of process noise and measurement noise, and the functions f () and h() respectively define a prediction model of the state parameter vector and a measurement function with respect to the given state parameter vector . This measurement is generally modeled by a likelihood function p(zk |xk ). The Sequential Importance Resampling (SIR) method (??), a well-known state parameter estimating method of the particle filter, approximates the expectation of the prediction
i E[f (xk )p(xk |z1:k )] by aggregating a set of weighted particles Sk = {xi , wk }1iNs where the k Ns i=1

i weight wk approximate the relative posterior probabilities of the stochastically generated

particle and satisfies

i wk = 1. The aggregated state estimation is Ns

xk = E[f (xk )p(xk |z1:k )] = E[xk |z1:k ]

i wk xi . k i=1

(10)

With respect to the measurement zi induced by the hypothetic estimate xi on a particle i, k k


i the particle weight wk for the current frame turns out to be

i i wk wk1 p(zi |xi ). k k

(11)

One common problem with the particle filter is the degeneracy problem, which occurs after several iterations of re-weighting (?). This problem occurs when all but one particle has negligible weight, implying that most computational efforts are wasted on updating particles that contribute nothing to the approximation to p(xk |z1:k ). A remedial operation is to perform a resampling process of particles if the number of effective particles is too small. To determine the appropriate timing for the resampling process, a criterion for the degeneracy is defined as Nef f = 1/
NS i 2 i=1 (wk ) ,

which logically indicates the number of

particles with effective weights. The resampling process is thus initiated if Nef f NT , where NT is a predefined threshold. In the resampling process, each ineffective particle is replaced with a new particle which carries a state instance stochastically perturbed from an 10

existing particle with a higher particle weight. All new particles after the resampling process
i i have equal weight (i.e., wk1 = 1/Ns ), implying that wk relates only to p(zi |xi ) according k k

to Eq. (11). 4.2. The Iterative Particle Filter In the SIR particle filter, the particle weights are related to the likelihood value p(zi |xi ). k k In our method, we design the weight of each particle to be proportional to a criterion function G(zi |xi ), which defines quantitatively the goodness of the observation zi with respect to the k k k

hypothetic state parameter xi . Instead of employing the conditional resampling strategy of k the SIR particle filter, we perform the resampling process unconditionally at every frame k. Furthermore, when doing tracking on each frame, a filtering process and the resampling process are iteratively performed for a fixed number of iterations to enhance the goodness of the survived particles for tracking. This is why we call the proposed particle filter an iterative particle filter. 4.2.1. Models of State Transition and Measurement For tracking an object with a particle filter, we represent the state parameters with the location and the dimension of the tracked object on each video frame. Hence, the state

parameters are encoded with a vector x = (x, y, w, h), defining the objects bounding box which has its upper-left corner situated at (x, y) and a dimension of w h pixels on the frame. As formulated in Eqs. (8) and (9), a particle filter requires a state transition model and a measurement model. To characterize the object motion, a discrete equal-velocity equation is adopted for modeling the position parameter of the object, i.e. vk = pk1 pk2 , where pk = (xk , yk , 0, 0)T denotes the position of the objects bounding box at the frame k. For simplicity, the perturbation uk in Equation (8) is defined as a random vector uk = (xk , yk , wk , hk )T . This vector u adds small random deviations, xk , yk , wk and hk , to the estimated position and dimension of the object. Combining the velocity and

11

the random perturbations, the state transition model can be defined as xk = xk1 + vk + uk . (12)

Given the state parameter vector xk , the measurement on the input frame Ik is defined as the appearance of the tracked object, i.e., zk = I(xk ), where Ik (xk ) denotes the subimage enclosed by the bounding box specified by xk . Here, no random noise, as the nk in Eq. (9), is assumed for the measurement model. 4.2.2. Iterative Filtering and Resampling The particle weight in our design is computed from a quantitative function G(zi |xi ) which k k evaluates the goodness of the hypothetic observation on each particle. Suppose that tk is the adaptive template used for tracking the target object. The quantitative function G(zi |xi ) k k can be designed in terms of the matching error between the observation zi = I(xi )) and the k k template tk . Based on the linear subspace representation, the particle weight is designed as
i wk = G(zi |xi ) = exp( Ut zi Ut tk k k k k k 2 k ),

(13)

where the columns in the matrix U contain the eigenvectors of the current linear subspace. The matching error in Eq. (13) involves a weighted norm a= (a1 , a2 , . . . , am )t 2 k =(1 ,2 ,...,m )t =
i=1

which is defined as
m

i a 2 i

(14)

The parameter in Eq. (13) is a small positive value and controls the sensitivity of the goodness value to the change of the weighted norm. The elements in the weight vector k reflect the importance of the elements of the vector in calculating the norm. Inside the compressed adaptive template Ut tk , if the values of an element present only a small variance k over a period of time, meaning that this element is stable and reliable under the current appearance variations, then this element should gain a higher weight (i ) in computing the norm. The detailed procedure for obtaining the weight vector k and the adaptive template 12

tk is presented later in Section 4.4. When tracking an object on a certain video frame, we use the final state estimation of the previous frame as the seed to stochastically generate hypothetic state instances on the particles. The generation of state instances follows the stochastic state transition model given in Eq. (12). The goodness of each particle is then evaluated according to Eq. (13). With
i the particles and the goodness of particles (say Sk = {(xi , wk = G(zi |xi ))}Ns ), a filtering i=1 k k k

operation removes the particles with lower goodness values and the remained particles are:
i i i Sk = f ilter(Sk , ) = {(xi , wk )|((xi , wk )i Sk ) (wk )}, k k k

(15)

i where = 1.2 min{wk }Ns is set as a 20% increment of the minimum value of the weights. i=1

With the particles in the new particle set Sk , the resampling process particle filter is then performed. In the resampling process, the particles are resampled according a probability proportional to their weights. For each sampled particle, a small random perturbation uk , as defined in Eq.(12), is made on the carried state instance to increase the opportunity of escaping from a locally optimal estimation. The random perturbation is designed to decrease with iterations, i.e., uk = iter uk , for (0, 1), to ensure the final convergence of the state estimation after several iterations. On each frame, the filtering operation and the resampling process are iteratively performed in turn for several runs to enhance the goodness of the remained particles for tracking. Let Sk be the particle set obtained from the final run of resampling process. The weight for each particle in Sk is re-evaluated according to Eq. (13). Each weight is then normalized by dividing its value by the the sum of all weights. Finally, the aggregation scheme is performed to infer the final estimation from the normalized weights, i.e., xk =
i (xi ,wk )f ilter(Sk , k

i w k xi . k )

(16)

To illustrate the gradual improvement of the tracking as the iterations go, Figure 1 demonstrates an example to track a fast-moving hand. The white box in Figure 1(a) shows 13

(a)

(b)

(c)

Figure 1: The gradually improved particle estimates of the proposed iterative particle filter for 3 iterations of particle resampling. (a) Tracking result on a frame at time t 1, (b) the particle estimates of the three iterations of particle resampling, illustrated respectively with white boxes, gray boxes, and black boxes, on the frame at time t, and (c) the final aggregated tracking result on the frame at time t.

the tracking result on a certain frame. On tracking the hand in the next frame shown in 1(b), the white boxes illustrate the estimates from 100 particles generated in the first iteration. After the second iteration, the estimates from these particles are illustrated with gray boxes. Apparently, the estimates get closer to the true position of the hand comparing to the estimates in the first iteration. When the third iteration is completed, the generated estimates illustrated with black boxes are even better than the estimates in the second iteration. The final aggregated estimation is shown with the white box in Figure 1(c). For the conventional particle filter, this kind of fast-moving objects would demand a large number of particles (e.g., >600) and require large perturbations to attain good tracking results. Additionally, the tracking results on different video frames may drift unstably because of the introduced large perturbations. 4.3. Handling of Partial Occlusions Owing to the RWLS representation, the matching between each observation and the object template is also performed in a region-wise manner. Intuitively, the matching errors of occluded regions would be larger than those of un-occluded regions. Let zk (r) be the image observation corresponding to region r, and tk (r) be the corresponding regional sub-image on the object template. The regional matching error on this region is computed by Err(zk (r)) = Ut zk (r) Ut tk (r) k k 14
k ,

(17)

where the matrix Uk contains the eigenvectors of the current linear subspace, and the weight norm
k

is defined in Eq. (14). A region r is claimed to be occluded if its regional matching

error satisfies the following condition Err(zk (r)) > mean({Err(zk (r))}R ) + stdv({Err(zk (r))}R ), r=1 r=1 (18)

where R is the total number of partitioned regions and the two functions mean(S) and stdv(S) compute respectively the mean and standard deviation of the data in a given data set S. The constant controls the allowed error deviation from the averaged regional matching errors. For each particle, if more than a half of the partitioned regions on the hypothetic observation of this particle are identified as occluded regions, then this particle is discarded. For each video frame, if no particle remains after the identification of occluded regions, then the particle filter skips the tracking on the current frame and proceeds to the next frame. For handling partial occlusions, the matching errors of occluded regions should not be included into the final matching error between the observation and the template. Otherwise, the tracking may be failed by these occluded regions. Hence, we refine the matching error between the object observation and the object template as Err(zk ) = 1 R |Socc | Err(zk (r)),
rSocc /

(19)

where Socc is the set containing all occluded regions and |Socc | denotes the number of occluded regions. 4.4. Adaptive Template Updating The incremental updating of the linear subspace adapts the subspace representation to the up-to-date inputs. However, even with the up-to-date subspace representation, the stored object template is still likely to be out-of-date. Therefore, adaptively updating the template is another important mechanism for handling the appearance variations of objects. We call the adaptively updated template an adaptive template. 15

The adaptive template is updated with the appearance of the object tracked on the most recent frame. Nonetheless, the template updating should be made conditional to avoid the improper influence from disturbances. According to the tracked object on the current frame, if no region is identified as an occluded region, meaning that the tracked observation has no fatal disturbances in appearance, then the template can be updated with the tracked observation. Let tk be the adaptive template at the frame k, and zk be the corresponding tracked observation. The update is performed according to tk+1 = (1 )tk + zk , (20)

where (0, 1) controls the rate of the updating. A larger value of means a faster adaptation of the template toward the new object appearance. Empirically, the value of this parameter highly depends on the rate of appearance changes. The parameter is normally set below 0.05 for objects with normal moving speeds and slowly changed illumination and around 0.05 0.5 for faster variations in object appearances and illumination. Since the template is adaptable, some elements of the template vector may have frequent changes due to the appearance changes of the tracked object. Such varying elements may unstably affect the calculation of weighted norm between the observed appearance and the object template in Eq.(13) and Eq. (17) and consequently lead to wrong tracking results. Hence, we introduce the weight vector in Eq.(13) and Eq. (17) to reduce such a negative effect. Recall that the elements of the weight vector assign different importance factors to the elements of the compared vectors on calculating the weighted norm. Highly varying elements in the template should be given low importance values. Therefore, the importance of each element in the template vector can be quantitatively evaluated in terms of the variance computed from the element values collected from the tracked objects on past frames. Similar to the adaptive updating in Eq. (1) and Eq. (2), the variances of the elements in the

16

template vector represented in the linear subspace can be incrementally updated by k+1 = k+1 k+1 1 k k + Ut tk+1 , k+1 k+1 k k k + = (Ut tk+1 k+1 )(Ut tk+1 k+1 )t , k+1 k+1 2 k+1 (k + 1) 1 = expv (diag( (Ut tk+1 k+1 )1 (Ut tk+1 k+1 )t )), k+1 k+1 2 k+1 (21) (22) (23)

where diag(M) denotes a vector composed of the diagonal elements of a matrix M and expv (a = (a1 , a2 , . . . , ad )t ) = (exp(a1 ), exp(a2 ), . . . , exp(ad ))t . In summary, the proposed method adapts both the template and the linear space. Eq. (20) defines the way to adapt the template in the original image space. Meanwhile, the eigenspace is incrementally updated by the proposed method presented in Section 3.2. Eqs. (21)-(23) defines the way to adaptively compute the weight vector k required for calculating the matching residual defined in Eq. (17). Algorithm 1 lists of the detailed steps of the proposed iterative particle filter. 5. Experimental Results and Performance Comparison To evaluate the performance of the proposed algorithm, some experiments are conducted to track objects on six testing video sequences acquired in real-world environments. Two sequences are available at http://www.cs.toronto.edu/dross/ivt/, presenting appearance variations, including illumination changes, pose changes and facial expressions, on the tracked objects. The rest four sequences are captured with our camcorder and present the cases of partial occlusions, size variations, and fast motions. The tracking algorithm is implemented in C++ on Microsoft Visual Studio using the Intel Pentium 4 2.8 GHz CPU. The assessed processing speed is about 14.7 frames per second for 100 particles. When initializing the tracking of a sequence, we manually specify the rectangle bounding box for the target object on the first frame. The boxed target object appearance is then resized to a 2424 object template. Then, we randomly translate and scale the bounding box by a small random perturbation for 100 times to acquire 100 samples to build the initial linear subspaces of the partitioned regions. Two region partitioning schemes, R1 and R2, are 17

Algorithm 1 The Proposed Tracking Algorithm with Occlusion Handling 1: Given a particle set Sk1 = {xi , 1/Ns }N s , target template tk1 and subspace model i=0 k1 k1 = { k1 , Uk1 , Ck1 } at the frame k 1. x 2: Set occ f lag = 0 to indicate no occlusion. 3: Set iter = 1. 4: for i = 1 : Ns do i Propagate the particle set for the initial iteration by xi 5: k,iter = xk1 + vk + uk . i Evaluate the measurement zi 6: k,iter corresponding to state xk,iter . i Update the weight wk,iter by Eq. (13). 7: 8: end for Ns i i i 9: Normalize the weight wk,iter = wk,iter / i=1 wk,iter , for 1 i N s. 10: for iter = 2 : Iter do Generate the seed sample set according to filtering function Sk,iter = 11: j {x j k,iter , w k,iter } = f ilter(Sk,iter1 , ) by Eq. (15) . Set c0 = 0. 12: k for j = 1 : J do 13: cj = cj1 + w j 14: k k k,iter . end for 15: Normalize the cumulative probability cj = cj /cJ for 1 j J. 16: k k k for i = 1 : Ns do 17: Generate a uniformly distributed random number r [0, 1]. 18: Find the smallest j which cj r. 19: k j iter Propagate the particle set xi uk . 20: k,iter = x k,iter + i Evaluate the measurement zk,iter corresponding to state xi 21: k,iter . i Update the weight wk,iter by Eq. (13). 22: 23: end for i i i Normalize the weight wk,iter = wk,iter / N s wk,iter . 24: i=1 25: end for 26: Perform the filtering function S k = f ilter(Sk,iter , ). Ns i i i 27: Normalize the weight w k,iter = wk,iter / i=1 wk,iter . 28: Estimate the state xk by Eq. (16). 29: Set occ f lag according to match error by Eq. (18). 30: if occ f lag = 0 then Update the template tk by Eq. (20) and subspace model k presented in Section 3.2. 31: 32: end if

18

Table 1: Parameter settings for the proposed iterative particle filter

forgetting factor, f [0.05,1]

0.2

0.7

1.5

[0.03,0.1]

used for comparison. The numbers of eigenvectors used in representing the linear subspaces of R1 and R2 are 50 and 15, respectively. Only 100 particles are generated in our iterative particle filter on tracking each testing sequence. The setting of other parameters, including forgetting factor, , , and , are listed in Table 1. The crucial parameters are forgetting factor f and learning rate of template , which should be set according to the rate of appearance changes of the tracked object. The range of the uniformly distributed random vector uk = (xk , yk , wk , hk )T are xk U [12, 12], yk [12, 12], wk [0.035, 0.035], hk [0.035, 0.035]. 5.1. Experimental Results The experiment conducted on the first testing video sequence is to track a human face (dudek) with different head poses and facial expressions. Figure 2 demonstrates some snapshots of the tracking results. At the bottom of each illustrated snapshot, the thumbnail images from left to right show the tracked target, the template, the subspace mean, the approximation error (residual) image and the approximated image, respectively. For comparison, Figure 2 simultaneously illustrates the tracking results on using the following four different combinations of adaptive mechanisms: 1. a fixed template and a fixed subspace representation, 2. an adaptive template and a fixed subspace representation, 3. a fixed template and an adaptive subspace representation, and 4. an adaptive template and an adaptive subspace representation. The results show evidentially that an adaptive template combined with an adaptive subspace representation attains the best performance. The combinations with no adaptive templates 19

fail to track the faces from Frame #796 onward, while the one with both the adaptive template and the adaptive subspace correctly tracks the faces on all frames. Another video sequence for tracking the face of another person (ming-hsuan) is also tested. Figure 3 shows the snapshots of the tracking results for comparing the use of the fixed subspace and the adaptive subspace. Both compared methods use the adaptive template. Note that this testing sequence contains large illumination variations on some frames (#400 and #1200). The top row in Figure 3, which illustrates the tracking results of the fixed subspace, shows that the face cannot be well tracked on Frame #1420. However, this frame still can be correctly tracked by the adaptive subspace, as shown in the bottom row of Figure 3.

Frame362

Frame684

Frame934
Figure 2: Face tracking on a sequence (dudek) with variations in poses and facial expressions. Column 1: the tracking results with a fixed template and a fixed subspace model; Column 2: the tracking results with a fixed template and an adaptive subspace model; Column 3: the tracking results with an adaptive template and a fixed subspace model; Column 4: the tracking results with an adaptive template and an adaptive subspace model.

Two other sequences are tested to examine the efficacy of the RWLS in handling partial occlusions. The first sequence is the video of a moving toy tank which is gradually occluded 20

Frame 1

Frame 400

Frame 1200

Frame 1420

Frame 840

Frame 1120

Frame 1200

Frame 1420

Figure 3: Face tracking on another sequence (ming-hsuan) with large variations in illumination and poses. The top row shows the tracking results with an adaptive template and a fixed subspace representation. The bottom row shows the tracking results with an adaptive template and an adaptive subspace representation.

by another scene object during its motion. The maximal occluded area during its motion is about 50%. Figure 4 shows the tracking results on some frames for the region partitioning schemes of R1 and R2. Note that the character N shown at the left side of each snapshot of the scheme R2 indicates that the corresponding partitioned region is automatically identified as an un-occluded region by the proposed tracking algorithm. On the contrary, a region which is identified as an occluded regions is labeled with its region identification number. As shown in Figure 4, the scheme R2 successfully tracks the occluded tank, while the scheme R1 fails. This result verifies the good capability in handling partial occlusions for the region-wise tracking of objects. The tracking on the other testing sequence is to track a Chinese character printed on an aluminium foil package. As shown in Figure 5, the maximal occluded area during its motion is about 40% of size of this Chinese character. The results demonstrate again the superiority of the proposed region-wise tracking of objects in handling partial occlusions.

21

Frame 1

Frame 295

Frame 308

Frame 500

Frame 1

Frame 295

Frame 308

Frame 500

Figure 4: Tracking a moving toy tank with severe occlusions during its motion. Top row: the results for the R1 representation. Bottom row: the results for the R2 representation.

Frame 1

Frame 57

Frame 77

Frame 97

Frame 1

Frame 57

Frame 77

Frame 97

Figure 5: Tracking a Chinese character on a moving aluminium foil package with partial occlusions. Top row: the results for the R1 representation. Bottom row: the results for the R4 representation. The objects tracked by the region partitioning R4 are more accurate in object size on Frame 77 and Frame 97.

22

5.2. Performance Comparisons with Other Particle Filters The performance of the proposed iterative particle filter are compared with those the SIR particle filter and a general version of particle filter, which is called the GPF in this paper. The GPF performs only one iteration of particle generation. For a more balanced comparison, both the adaptive subspace and adaptive template are exploited in the SIR and GPF. Two video sequences are tested for the performance comparison. One sequence is a car moving away from the camera at a normal speed and the other is a doll moved by a fast moving hand. On testing the video sequence, both SIR and GPF use 300 particles, while our iterative particle filter uses 75 particles in each iteration. The number of iterations are set to four. As shown in Figure 6, both SIR and GPF fail to track the car accurately when the car runs further for the car sequence. This result indicates that SIR and GPF cannot well handle the size variations of the car. In contrast, the proposed iterative particle filter gains better capability in tracking objects with size variations according to the results shown in Figure 6. The performance comparison for the doll sequence is shown in Figure 7. Note that the high moving speed of the hand causes motion blurs on some frames. Consequently, both SIR and GPF fail to track the doll from Frame #130 onward. Again, the proposed iterative particle filter demonstrates its better performance in tracking fast-moving objects in this test. The proposed iterative particle filter successfully tracks the doll on every frame of the sequence. 5.3. Performance Comparison with Halls Incremental PCA The proposed WI-PCA is an improved variant over the incremental PCA proposed by ?. Hence, we use three sequences, including the car sequence, the doll sequence and the jal sequence, as the benchmarks for comparing the performance of Halls method and our method. Note that the adaptive template is also exploited in both methods for the test. To quantitatively evaluate the tracking accuracy for comparison, we compute the position error for the tracking result on each video frame. The position error is defined to be Errpos =
4 i=1 (X(i)

X(i))2 , where X(i) and X(i), for 1 i 4, are respectively the 23

Frame 1

Frame 62

Frame 77

Frame 115

Figure 6: Tracking a car moving away from the camera using the SIR, the GPF (using 300 particles), and the proposed particle filter (using 4 iterations with 75 particles per iteration) on rows 1, 2, and 3, respectively.

24

Frame 109

Frame 127

Frame 130

Frame 137

Frame 109

Frame 127

Frame 130

Frame 137

Frame 109

Frame 127

Frame 130

Frame 137

Figure 7: Tracking a doll moved rapidly by a hand. The top two rows are the tracking results for SIR and GPF, respectively. The bottom row is the tracking results of the proposed particle filter.

25

corresponding corners of the bounding boxes of the tracked object and the ground truth. After five rounds of tracking on each sequence, Table 2 lists the position errors averaged over all frames for these three sequences. The results show that the accuracy of the proposed WI-PCA is slightly better than that of Halls approach. In the test, we find that Halls approach may improperly update the linear subspace with the samples of bad object appearances drawn from incorrect tracking results. On the contrary, our proposed WI-PCA can ignore the bad samples for updating through the designed weighting mechanism. Furthermore, the introduced forgetting factor in our WI-PCA gives higher influence to the up-to-date good samples, while Halls method equally treats the out-of-date samples.
Table 2: the statistics of the tracking position errors for three video sequences.

jal WI-PCA method 5.350.02 Halls method 6.450.23

car 6.501.16 7.252.14

doll 7.420.31 7.600.40

To further demonstrate the effectiveness of the proposed WI-PCA in modeling object appearance, we also compare the reconstruction errors between the WI-PCA method and the Halls method. Fifty eigenvectors are selected to reconstruct the appearance of object for both compared methods. In addition to the previous three benchmark sequences, we also include the two face sequences (dudek and ming-hsuan ) used in Section 5.1 for evaluation. Table 3 presents the computed reconstruction errors averaged over all frames for five runs of tracking on each sequence. RM SE =
1 N N i=1 (I(i)

The reconstruction error is defined to be

I(i))2 , where I(i) and I(i) are respectively the ith pixel on

the template image and the reconstructed image and N is the total number of the pixels on the template. The results in Table 3 show that the proposed WI-PCA better represents the varying appearances of objects. 5.4. Performance Comparison with Mean-Shift Algorithm We further compare the proposed tracker with the mean-shift algorithm (??) and the CAMShift algorithm (?) on the doll sequence and the jal sequence. As described in Section 26

Table 3: statistics of the reconstruction error (RMSE per pixel).

dudek WI-PCA method 0.7680.005 Halls method 0.7390.117

ming-hsuan 0.9960.002 1.3060.018

jal 0.9550.003 0.9970.011

car 1.3520.074 1.5170.017

doll 1.4220.001 2.1320.116

2, since the mean-shift algorithm uses the global color histogram as the visual feature for object tracking, accurate estimation of object poses becomes more difficult. Furthermore, the mean-shift tracker is error-prone under varying illumination because the color histogram is sensitive to illumination. Figure 9 demonstrates some snapshots of the tracking results on the two testing sequences. The tracked results of the proposed tracker and the mean-shift tracker are drawn with solid white boxes and dashed yellow boxes, respectively. Figure 9 shows that the mean-shift tracker loses the doll when the doll move rapidly (#130, #200). The proposed tracker tracks very well till the end of the sequence. Figure 9 shows that the mean-shift tracker fails to track the face under illumination changes (#40, #170). Figure 10 (a) and (b) plot respectively the position error of the tracked object on each frame of the tracked doll and the jal sequences for these compared two trackers.

Frame 110

Frame 130

Frame 182

Frame 209

Frame 40

Frame 80

Frame 160

Frame 197

Figure 8: The snapshots of the tracked object on the doll sequence and the jal sequence for comparing the proposed tracker, the mean-shift tracker and the CAMShift tracker. The dashed yellow boxes are the results of the mean-shift tracker; the solid red boxes are the results of the CAMShift tracker; the solid white boxes are the results of the proposed tracker.

27

6. Concluding Remarks This paper designs an improved particle filter and a RWLS representation for visual object tracking. With this RWLS representation, the proposed tracking method partitions the object into k k independent regions. The independent tracking of these regions enables the proposed method to ignore the occluded regions and continue the tracking of un-occluded regions. Thus, the partial occlusions can be effectively handled during the tracking. To enhance the adaptability of the linear subspace, an adaptive subspace learning model, the WI-PCA, which can efficiently and incrementally update the built subspace is proposed. This adaptive learning model can well adapt to the variations in object appearances and illumination. In addition, the WI-PCA seeks to modulate the bad influence from outliers, noisy inputs, and out-of-date data by introducing a weighting mechanism and a forgetting factor into the adaptation. Besides the adaptive subspace, the object template for tracking is also made adaptive so that the dynamic appearance variations of the tracked object can be handled even better. On the particle filter framework, we propose an iterative particle filter to improve over the traditional particle filters. The improved particle filter features a better strategy of iterative particle generation which is designed to guarantee the gradually improved quality of generated particles. The improved particle quality leads to an improved tracking accuracy after the particle aggregation. The experimental results demonstrate the effectiveness and the superiority of the proposed algorithm on tracking objects undergoing various pose changes, partial occlusions, and illumination variations.

28

(a) The doll sequence.

(b) The jal sequence.


Figure 9: The comparison of position errors of the tracked objects on the doll sequence and the jal sequence for the mean-shift tracker, the CAMShift tracker and the proposed tracker.

29

You might also like