De Printat Articole

An Introduction to the Kalman Filter
Greg Welch1 and Gary Bishop2 TR 95-041 Department of Computer Science University of North Carolina at Chapel Hill Chapel Hill, NC 27599-3175 Updated: Monday, July 24, 2006
Abstract
In 1960, R.E. Kalman published his famous paper describing a recursive solution to the discrete-data linear filtering problem. Since that time, due in large part to advances in digital computing, the Kalman filter has been the subject of extensive research and application, particularly in the area of autonomous or assisted navigation. The Kalman filter is a set of mathematical equations that provides an efficient computational (recursive) means to estimate the state of a process, in a way that minimizes the mean of the squared error. The filter is very powerful in several aspects: it supports estimations of past, present, and even future states, and it can do so even when the precise nature of the modeled system is unknown. The purpose of this paper is to provide a practical introduction to the discrete Kalman filter. This introduction includes a description and some discussion of the basic discrete Kalman filter, a derivation, description and some discussion of the extended Kalman filter, and a relatively simple (tangible) example with real numbers & results.
1. 2.
welch@cs.unc.edu, http://www.cs.unc.edu/~welch gb@cs.unc.edu, http://www.cs.unc.edu/~gb
Welch & Bishop, An Introduction to the Kalman Filter
The Discrete Kalman Filter
In 1960, R.E. Kalman published his famous paper describing a recursive solution to the discretedata linear filtering problem [Kalman60]. Since that time, due in large part to advances in digital computing, the Kalman filter has been the subject of extensive research and application, particularly in the area of autonomous or assisted navigation. A very friendly introduction to the general idea of the Kalman filter can be found in Chapter 1 of [Maybeck79], while a more complete introductory discussion can be found in [Sorenson70], which also contains some interesting historical narrative. More extensive references include [Gelb74; Grewal93; Maybeck79; Lewis86; Brown92; Jacobs93]. The Process to be Estimated The Kalman filter addresses the general problem of trying to estimate the state x of a discrete-time controlled process that is governed by the linear stochastic difference equation x k = Ax k 1 + Bu k 1 + w k 1 , with a measurement z
m n
(1.1)
that is zk = H xk + vk . (1.2)
The random variables w k and v k represent the process and measurement noise (respectively). They are assumed to be independent (of each other), white, and with normal probability distributions p(w) p(v) N ( 0, Q ) , N ( 0, R ) . (1.3) (1.4)
In practice, the process noise covariance Q and measurement noise covariance R matrices might change with each time step or measurement, however here we assume they are constant. The n n matrix A in the difference equation (1.1) relates the state at the previous time step k 1 to the state at the current step k , in the absence of either a driving function or process noise. Note that in practice A might change with each time step, but here we assume it is constant. The n l l matrix B relates the optional control input u to the state x. The m n matrix H in the measurement equation (1.2) relates the state to the measurement zk. In practice H might change with each time step or measurement, but here we assume it is constant. The Computational Origins of the Filter We define x k (note the super minus) to be our a priori state estimate at step k given n to be our a posteriori state estimate at step knowledge of the process prior to step k, and x k k given measurement z k . We can then define a priori and a posteriori estimate errors as ek ek
n
x k x k , and xk xk .
UNC-Chapel Hill, TR 95-041, July 24, 2006
The a priori estimate error covariance is then Pk = E [ ek ek ] , and the a posteriori estimate error covariance is
T Pk = E [ ek ek ] . - -T
(1.5)
(1.6)
In deriving the equations for the Kalman filter, we begin with the goal of finding an equation that computes an a posteriori state estimate x k as a linear combination of an a priori estimate x k and a weighted difference between an actual measurement z k and a measurement prediction H x k as shown below in (1.7). Some justification for (1.7) is given in The Probabilistic Origins of the Filter found below. xk = xk + K ( zk H xk )
-
(1.7)
The difference ( z k H x k ) in (1.7) is called the measurement innovation, or the residual. The residual reflects the discrepancy between the predicted measurement H x k and the actual measurement z k . A residual of zero means that the two are in complete agreement. The n m matrix K in (1.7) is chosen to be the gain or blending factor that minimizes the a posteriori error covariance (1.6). This minimization can be accomplished by first substituting (1.7) into the above definition for e k , substituting that into (1.6), performing the indicated expectations, taking the derivative of the trace of the result with respect to K, setting that result equal to zero, and then solving for K. For more details see [Maybeck79; Brown92; Jacobs93]. One form of the resulting K that minimizes (1.6) is given by1 K k = Pk H T ( H Pk H T + R ) Pk H T = ---------------------------H Pk H T + R
1
(1.8)
Looking at (1.8) we see that as the measurement error covariance R approaches zero, the gain K weights the residual more heavily. Specifically,
Rk
lim K k = H 1 .
0 -
On the other hand, as the a priori estimate error covariance P k approaches zero, the gain K weights the residual less heavily. Specifically, lim K k = 0 .
Pk
-
All of the Kalman filter equations can be algebraically manipulated into to several forms. Equation (1.8) represents the Kalman gain in one popular form. UNC-Chapel Hill, TR 95-041, July 24, 2006
1.
Another way of thinking about the weighting by K is that as the measurement error covariance R approaches zero, the actual measurement z k is trusted more and more, while the predicted measurement H x k is trusted less and less. On the other hand, as the a priori estimate error covariance P k approaches zero the actual measurement z k is trusted less and less, while the predicted measurement H x k is trusted more and more. The Probabilistic Origins of the Filter The justification for (1.7) is rooted in the probability of the a priori estimate x k conditioned on all prior measurements z k (Bayes rule). For now let it suffice to point out that the Kalman filter maintains the first two moments of the state distribution, E [ xk ] = xk E [ ( xk xk ) ( xk xk ) T ] = Pk . The a posteriori state estimate (1.7) reflects the mean (the first moment) of the state distribution it is normally distributed if the conditions of (1.3) and (1.4) are met. The a posteriori estimate error covariance (1.6) reflects the variance of the state distribution (the second non-central moment). In other words, p ( xk zk ) N ( E [ x k ], E [ ( x k x k ) ( x k x k ) T ] ) .
-
= N ( x k, P k ).
For more details on the probabilistic origins of the Kalman filter, see [Maybeck79; Brown92; Jacobs93]. The Discrete Kalman Filter Algorithm We will begin this section with a broad overview, covering the high-level operation of one form of the discrete Kalman filter (see the previous footnote). After presenting this high-level view, we will narrow the focus to the specific equations and their use in this version of the filter. The Kalman filter estimates a process by using a form of feedback control: the filter estimates the process state at some time and then obtains feedback in the form of (noisy) measurements. As such, the equations for the Kalman filter fall into two groups: time update equations and measurement update equations. The time update equations are responsible for projecting forward (in time) the current state and error covariance estimates to obtain the a priori estimates for the next time step. The measurement update equations are responsible for the feedbacki.e. for incorporating a new measurement into the a priori estimate to obtain an improved a posteriori estimate. The time update equations can also be thought of as predictor equations, while the measurement update equations can be thought of as corrector equations. Indeed the final estimation algorithm resembles that of a predictor-corrector algorithm for solving numerical problems as shown below in Figure 1-1.
Time Update (Predict)
Measurement Update (Correct)
Figure 1-1. The ongoing discrete Kalman filter cycle. The time update projects the current state estimate ahead in time. The measurement update adjusts the projected estimate by an actual measurement at that time. The specific equations for the time and measurement updates are presented below in Table 1-1 and Table 1-2. Table 1-1: Discrete Kalman lter time update equations. x k = Ax k 1 + Bu k 1 P k = AP k 1 A T + Q
-
(1.9) (1.10)
Again notice how the time update equations in Table 1-1 project the state and covariance estimates forward from time step k 1 to step k . A and B are from (1.1), while Q is from (1.3). Initial conditions for the filter are discussed in the earlier references. Table 1-2: Discrete Kalman lter measurement update equations. K k = Pk H T ( H Pk H T + R ) xk = xk + K k ( zk H xk ) P k = ( I K k H )P k
1
(1.11) (1.12) (1.13)
The first task during the measurement update is to compute the Kalman gain, K k . Notice that the equation given here as (1.11) is the same as (1.8). The next step is to actually measure the process to obtain z k , and then to generate an a posteriori state estimate by incorporating the measurement as in (1.12). Again (1.12) is simply (1.7) repeated here for completeness. The final step is to obtain an a posteriori error covariance estimate via (1.13). After each time and measurement update pair, the process is repeated with the previous a posteriori estimates used to project or predict the new a priori estimates. This recursive nature is one of the very appealing features of the Kalman filterit makes practical implementations much more feasible than (for example) an implementation of a Wiener filter [Brown92] which is designed to operate on all of the data directly for each estimate. The Kalman filter instead recursively conditions the current estimate on all of the past measurements. Figure 1-2 below offers a complete picture of the operation of the filter, combining the high-level diagram of Figure 1-1 with the equations from Table 1-1 and Table 1-2.
Filter Parameters and Tuning In the actual implementation of the filter, the measurement noise covariance R is usually measured prior to operation of the filter. Measuring the measurement error covariance R is generally practical (possible) because we need to be able to measure the process anyway (while operating the filter) so we should generally be able to take some off-line sample measurements in order to determine the variance of the measurement noise. The determination of the process noise covariance Q is generally more difficult as we typically do not have the ability to directly observe the process we are estimating. Sometimes a relatively simple (poor) process model can produce acceptable results if one injects enough uncertainty into the process via the selection of Q . Certainly in this case one would hope that the process measurements are reliable. In either case, whether or not we have a rational basis for choosing the parameters, often times superior filter performance (statistically speaking) can be obtained by tuning the filter parameters Q and R . The tuning is usually performed off-line, frequently with the help of another (distinct) Kalman filter in a process generally referred to as system identification.
Measurement Update (Correct) Time Update (Predict) (1) Project the state ahead x k = Ax k 1 + (1) Compute the Kalman gain K k = Pk H T ( H Pk H T
+ R)
Bu k 1
(2) Project the error covariance ahead P k = AP k 1 A T + Q
(2) Update estimate with measurement zk xk = xk + K k ( zk H xk ) (3) Update the error covariance
P k = ( I K k H )P k
Initial estimates for x k 1 and P k 1
Figure 1-2. A complete picture of the operation of the Kalman filter, combining the high-level diagram of Figure 1-1 with the equations from Table 1-1 and Table 1-2. In closing we note that under conditions where Q and R .are in fact constant, both the estimation error covariance P k and the Kalman gain K k will stabilize quickly and then remain constant (see the filter update equations in Figure 1-2). If this is the case, these parameters can be pre-computed by either running the filter off-line, or for example by determining the steady-state value of P k as described in [Grewal93].
It is frequently the case however that the measurement error (in particular) does not remain constant. For example, when sighting beacons in our optoelectronic tracker ceiling panels, the noise in measurements of nearby beacons will be smaller than that in far-away beacons. Also, the process noise Q is sometimes changed dynamically during filter operationbecoming Q k in order to adjust to different dynamics. For example, in the case of tracking the head of a user of a 3D virtual environment we might reduce the magnitude of Q k if the user seems to be moving slowly, and increase the magnitude if the dynamics start changing rapidly. In such cases Q k might be chosen to account for both uncertainty about the users intentions and uncertainty in the model.
The Extended Kalman Filter (EKF)
The Process to be Estimated As described above in section 1, the Kalman filter addresses the general problem of trying to n estimate the state x of a discrete-time controlled process that is governed by a linear stochastic difference equation. But what happens if the process to be estimated and (or) the measurement relationship to the process is non-linear? Some of the most interesting and successful applications of Kalman filtering have been such situations. A Kalman filter that linearizes about the current mean and covariance is referred to as an extended Kalman filter or EKF. In something akin to a Taylor series, we can linearize the estimation around the current estimate using the partial derivatives of the process and measurement functions to compute estimates even in the face of non-linear relationships. To do so, we must begin by modifying some of the material n presented in section 1. Let us assume that our process again has a state vector x , but that the process is now governed by the non-linear stochastic difference equation x k = f ( x k 1, u k 1 , w k 1 ) , with a measurement z
m
(2.1)
that is z k = h ( x k, v k ) , (2.2)
where the random variables w k and v k again represent the process and measurement noise as in (1.3) and (1.4). In this case the non-linear function f in the difference equation (2.1) relates the state at the previous time step k 1 to the state at the current time step k . It includes as parameters any driving function u k 1 and the zero-mean process noise wk. The non-linear function h in the measurement equation (2.2) relates the state x k to the measurement z k . In practice of course one does not know the individual values of the noise w k and v k at each time step. However, one can approximate the state and measurement vector without them as x k = f ( x k 1, u k 1 , 0 ) and z k = h ( x k, 0 ) , where x k is some a posteriori estimate of the state (from a previous time step k). (2.4) (2.3)
It is important to note that a fundamental flaw of the EKF is that the distributions (or densities in the continuous case) of the various random variables are no longer normal after undergoing their respective nonlinear transformations. The EKF is simply an ad hoc state estimator that only approximates the optimality of Bayes rule by linearization. Some interesting work has been done by Julier et al. in developing a variation to the EKF, using methods that preserve the normal distributions throughout the non-linear transformations [Julier96]. The Computational Origins of the Filter To estimate a process with non-linear difference and measurement relationships, we begin by writing new governing equations that linearize an estimate about (2.3) and (2.4), xk xk + A ( xk 1 xk 1 ) + W wk 1 , zk where x k and z k are the actual state and measurement vectors, x k and z k are the approximate state and measurement vectors from (2.3) and (2.4), x k is an a posteriori estimate of the state at step k, the random variables w k and v k represent the process and measurement noise as in (1.3) and (1.4). A is the Jacobian matrix of partial derivatives of f with respect to x, that is A [ i, j ] = f [i] (x ,u , 0) , x[ j] k 1 k 1 zk + H ( xk xk ) + V vk . (2.5) (2.6)
W is the Jacobian matrix of partial derivatives of f with respect to w, W [ i, j ] = f [i] (x ,u , 0) , w[ j ] k 1 k 1
H is the Jacobian matrix of partial derivatives of h with respect to x, H [ i, j ] = h[ i ] ( x , 0) , x[ j] k
V is the Jacobian matrix of partial derivatives of h with respect to v, V [ i, j ] = h[ i ] ( x , 0) . v[ j ] k
Note that for simplicity in the notation we do not use the time step subscript k with the Jacobians A , W , H , and V , even though they are in fact different at each time step.
Now we define a new notation for the prediction error, e xk and the measurement residual, e zk zk zk . (2.8) xk xk , (2.7)
Remember that in practice one does not have access to x k in (2.7), it is the actual state vector, i.e. the quantity one is trying to estimate. On the other hand, one does have access to z k in (2.8), it is the actual measurement that one is using to estimate x k . Using (2.7) and (2.8) we can write governing equations for an error process as e xk A ( xk 1 xk 1 ) + e zk H e xk +
k, k,
(2.9) (2.10)
where k and k represent new independent random variables having zero mean and covariance matrices WQW T and VRV T , with Q and R as in (1.3) and (1.4) respectively. Notice that the equations (2.9) and (2.10) are linear, and that they closely resemble the difference and measurement equations (1.1) and (1.2) from the discrete Kalman filter. This motivates us to use the actual measurement residual e zk in (2.8) and a second (hypothetical) Kalman filter to estimate the prediction error e x k given by (2.9). This estimate, call it e k , could then be used along with (2.7) to obtain the a posteriori state estimates for the original non-linear process as xk = xk + ek . The random variables of (2.9) and (2.10) have approximately the following probability distributions (see the previous footnote): p ( e xk ) p( k ) p(
k)
(2.11)
T N ( 0, E [ e x k e x k ] ) N ( 0, W Q k W T ) N ( 0, V R k V T )
Given these approximations and letting the predicted value of e k be zero, the Kalman filter equation used to estimate e k is e k = K k e zk . (2.12)
By substituting (2.12) back into (2.11) and making use of (2.8) we see that we do not actually need the second (hypothetical) Kalman filter: x k = x k + K k e zk = xk + K k ( zk zk ) (2.13)
Equation (2.13) can now be used for the measurement update in the extended Kalman filter, with x k and z k coming from (2.3) and (2.4), and the Kalman gain K k coming from (1.11) with the appropriate substitution for the measurement error covariance.
10
The complete set of EKF equations is shown below in Table 2-1 and Table 2-2. Note that we have substituted x k for x k to remain consistent with the earlier super minus a priori notation, and that we now attach the subscript k to the Jacobians A , W , H , and V , to reinforce the notion that they are different at (and therefore must be recomputed at) each time step. Table 2-1: EKF time update equations. x k = f ( x k 1, u k 1, 0 )
T T Pk = Ak Pk 1 Ak + W k Qk 1 W k -
(2.14) (2.15)
As with the basic discrete Kalman filter, the time update equations in Table 2-1 project the state and covariance estimates from the previous time step k 1 to the current time step k . Again f in (2.14) comes from (2.3), A k and W k are the process Jacobians at step k, and Q k is the process noise covariance (1.3) at step k. Table 2-2: EKF measurement update equations.
T T T K k = Pk H k ( H k Pk H k + V k Rk V k ) 1
(2.16) (2.17) (2.18)
x k = x k + K k ( z k h ( x k, 0 ) ) P k = ( I K k H k )P k
-
As with the basic discrete Kalman filter, the measurement update equations in Table 2-2 correct the state and covariance estimates with the measurement z k . Again h in (2.17) comes from (2.4), H k and V are the measurement Jacobians at step k, and R k is the measurement noise covariance (1.4) at step k. (Note we now subscript R allowing it to change with each measurement.) The basic operation of the EKF is the same as the linear discrete Kalman filter as shown in Figure 1-1. Figure 2-1 below offers a complete picture of the operation of the EKF, combining the high-level diagram of Figure 1-1 with the equations from Table 2-1 and Table 2-2.
11
Measurement Update (Correct) Time Update (Predict) (1) Project the state ahead x k = f ( x k 1, u k 1 , (1) Compute the Kalman gain - T - T K k = Pk H k ( H k Pk H k +
0)
T V k Rk V k )
(2) Project the error covariance ahead T T Pk = Ak Pk 1 Ak + W k Qk 1 W k
(2) Update estimate with measurement zk x k = x k + K k ( z k h ( x k, 0 ) ) (3) Update the error covariance
P k = ( I K k H k )P k
Initial estimates for x k 1 and P k 1
Figure 2-1. A complete picture of the operation of the extended Kalman filter, combining the high-level diagram of Figure 1-1 with the equations from Table 2-1 and Table 2-2. An important feature of the EKF is that the Jacobian H k in the equation for the Kalman gain K k serves to correctly propagate or magnify only the relevant component of the measurement information. For example, if there is not a one-to-one mapping between the measurement z k and the state via h , the Jacobian H k affects the Kalman gain so that it only magnifies the portion of the residual z k h ( x k, 0 ) that does affect the state. Of course if over all measurements there is not a one-to-one mapping between the measurement z k and the state via h , then as you might expect the filter will quickly diverge. In this case the process is unobservable.
A Kalman Filter in Action: Estimating a Random Constant
In the previous two sections we presented the basic form for the discrete Kalman filter, and the extended Kalman filter. To help in developing a better feel for the operation and capability of the filter, we present a very simple example here. Andrew Straw has made available a Python/SciPy implementation of this example at http://www.scipy.org/Cookbook/KalmanFiltering (valid link as of July 24, 2006). The Process Model In this simple example let us attempt to estimate a scalar random constant, a voltage for example. Lets assume that we have the ability to take measurements of the constant, but that the measurements are corrupted by a 0.1 volt RMS white measurement noise (e.g. our analog to digital converter is not very accurate). In this example, our process is governed by the linear difference equation x k = Ax k 1 + Bu k 1 + w k = xk 1 + wk ,

1
12
with a measurement z
that is zk = H xk + vk = xk + vk .
The state does not change from step to step so A = 1 . There is no control input so u = 0 . Our noisy measurement is of the state directly so H = 1 . (Notice that we dropped the subscript k in several places because the respective parameters remain constant in our simple model.) The Filter Equations and Parameters Our time update equations are xk = xk 1 , Pk = Pk 1 + Q , and our measurement update equations are K k = Pk ( Pk + R ) Pk = --------------Pk + R
1 -
(3.1)
xk = xk + K k ( zk xk ) , P k = ( 1 K k )P k . Presuming a very small process variance, we let Q = 1e 5 . (We could certainly let Q = 0 but assuming a small but non-zero value gives us more flexibility in tuning the filter as we will demonstrate below.) Lets assume that from experience we know that the true value of the random constant has a standard normal probability distribution, so we will seed our filter with the guess that the constant is 0. In other words, before starting we let x k 1 = 0 . Similarly we need to choose an initial value for P k 1 , call it P 0 . If we were absolutely certain that our initial state estimate x 0 = 0 was correct, we would let P 0 = 0 . However given the uncertainty in our initial estimate x 0 , choosing P 0 = 0 would cause the filter to initially and always believe x k = 0 . As it turns out, the alternative choice is not critical. We could choose almost any P 0 0 and the filter would eventually converge. Well start our filter with P 0 = 1 .
-
13
The Simulations To begin with, we randomly chose a scalar constant x = 0.37727 (there is no hat on the x because it represents the truth). We then simulated 50 distinct measurements z k that had error normally distributed around zero with a standard deviation of 0.1 (remember we presumed that the measurements are corrupted by a 0.1 volt RMS white measurement noise). We could have generated the individual measurements within the filter loop, but pre-generating the set of 50 measurements allowed me to run several simulations with the same exact measurements (i.e. same measurement noise) so that comparisons between simulations with different parameters would be more meaningful. In the first simulation we fixed the measurement variance at R = ( 0.1 ) 2 = 0.01 . Because this is the true measurement error variance, we would expect the best performance in terms of balancing responsiveness and estimate variance. This will become more evident in the second and third simulation. Figure 3-1 depicts the results of this first simulation. The true value of the random constant x = 0.37727 is given by the solid line, the noisy measurements by the cross marks, and the filter estimate by the remaining curve.
-0.2
Voltage
-0.3
-0.4
-0.5 50 30 40 Iteration Figure 3-1. The first simulation: R = ( 0.1 ) 2 = 0.01 . The true value of the random constant x = 0.37727 is given by the solid line, the noisy measurements by the cross marks, and the filter estimate by the remaining curve. 10 20 When considering the choice for P 0 above, we mentioned that the choice was not critical as long as P 0 0 because the filter would eventually converge. Below in Figure 3-2 we have plotted the value of P k versus the iteration. By the 50th iteration, it has settled from the initial (rough) choice of 1 to approximately 0.0002 (Volts2).
14
0.01 0.008 (Voltage)2 0.006 0.004 0.002 10 20 30 Iteration 40 50

-
Figure 3-2. After 50 iterations, our initial (rough) error covariance P k choice of 1 has settled to about 0.0002 (Volts2). In section 1 under the topic Filter Parameters and Tuning we briefly discussed changing or tuning the parameters Q and R to obtain different filter performance. In Figure 3-3 and Figure 34 below we can see what happens when R is increased or decreased by a factor of 100 respectively. In Figure 3-3 the filter was told that the measurement variance was 100 times greater (i.e. R = 1 ) so it was slower to believe the measurements.
-0.2
Voltage
-0.3
-0.4
-0.5 10 20 30 40 50
Figure 3-3. Second simulation: R = 1 . The filter is slower to respond to the measurements, resulting in reduced estimate variance. In Figure 3-4 the filter was told that the measurement variance was 100 times smaller (i.e. R = 0.0001 ) so it was very quick to believe the noisy measurements.
15
-0.2
Voltage
-0.3
-0.4
-0.5 10 20 30 40 50
Figure 3-4. Third simulation: R = 0.0001 . The filter responds to measurements quickly, increasing the estimate variance. While the estimation of a constant is relatively straight-forward, it clearly demonstrates the workings of the Kalman filter. In Figure 3-3 in particular the Kalman filtering is evident as the estimate appears considerably smoother than the noisy measurements.
16
References
Brown92 Gelb74 Grewal93 Jacobs93 Julier96 Brown, R. G. and P. Y. C. Hwang. 1992. Introduction to Random Signals and Applied Kalman Filtering, Second Edition, John Wiley & Sons, Inc. Gelb, A. 1974. Applied Optimal Estimation, MIT Press, Cambridge, MA. Grewal, Mohinder S., and Angus P. Andrews (1993). Kalman Filtering Theory and Practice. Upper Saddle River, NJ USA, Prentice Hall. Jacobs, O. L. R. 1993. Introduction to Control Theory, 2nd Edition. Oxford University Press. Julier, Simon and Jeffrey Uhlman. A General Method of Approximating Nonlinear Transformations of Probability Distributions, Robotics Research Group, Department of Engineering Science, University of Oxford [cited 14 November 1995]. Available from http://www.robots.ox.ac.uk/~siju/work/publications/Unscented.zip. Also see: A New Approach for Filtering Nonlinear Systems by S. J. Julier, J. K. Uhlmann, and H. F. Durrant-Whyte, Proceedings of the 1995 American Control Conference, Seattle, Washington, Pages:1628-1632. Available from http://www.robots.ox.ac.uk/~siju/work/publications/ACC95_pr.zip. Also see Simon Julier's home page at http://www.robots.ox.ac.uk/~siju/. Kalman60 Kalman, R. E. 1960. A New Approach to Linear Filtering and Prediction Problems, Transaction of the ASMEJournal of Basic Engineering, pp. 35-45 (March 1960). Lewis, Richard. 1986. Optimal Estimation with an Introduction to Stochastic Control Theory, John Wiley & Sons, Inc. Maybeck, Peter S. 1979. Stochastic Models, Estimation, and Control, Volume 1, Academic Press, Inc. Sorenson, H. W. 1970. Least-Squares estimation: from Gauss to Kalman, IEEE Spectrum, vol. 7, pp. 63-68, July 1970.
Lewis86 Maybeck79 Sorenson70
Robust Visual Tracking from Dynamic Control of Processing

Alban Caporossi, Daniela Hall, Patrick Reignier and James L. Crowley PRIMA-GRAVIR, INRIA Rh ne-Alpes, Montbonnot, France o
Abstract
This paper presents a robust tracking system that employs a supervisory controller to dynamically control the selection of processing modules and the parameters used for processing. This system employs multiple pixel level detection operations to detect and track blobs at video rate. Groups of blobs can be interpreted as related components of objects during an interpretation phase. A central supervisor is used to adapt processing parameters so as to maintain reliable real time tracking. System performance is demonstrated on the PETS 04 data set.
TargetObservation Target Prediction list

Estimation Background detector
Eyes Tracking Hand Tracking Face Tracking RFIdentification Analysis & Interpretation Supervisor Modules
Time
DetectionRegion Detection region list Detection Background detector
1. Introduction
This paper presents an architecture for robust on-line tracking and interpretation of video streams. The system is based on a real time process managed by a supervisory controller. During each cycle, target blobs are observed and updated using simple pixel level detection processes. Detection procedures are then specied in a number of detection regions to detect new blobs. An evaluation phase is used to assess system performance and to adapt processing so as to maintain both reliability and real time (video rate) processing. An interpretation phase is then run to interpret groups of blobs as more abstract objects. Performance for this system is illustrated using the PETS 04 data set. The paper starts with an overview of the system architecture. Section 3 describes the underlying principle of the core modules followed by technical details of the implementation. Section 4 describes a method for automatic adaption of the parameters necessary for the tracking system. The exibility of the architecture is demonstrated in section 5. Section 6 evaluates the performance of this system on the PETS 04 data sets.
Figure 1. Visual tracking using a central supervisor architecture with core modules enables the exible plug-in of higher level modules.
isation module (Detection Region) and a tracking module (Target Observation). These modules are detailed in section 3. The supervisor acts as a process scheduler, sequentially executing modules in a cyclic process. Each cycle begins by acquiring the current image from an image buffering system (video demon). For each image, targets are tracked and new targets are detected. The supervisor enables a exible integration of a several modules. During each cycle, for each target, the supervisor can call additional modules for analysis and interpretation as needed. During each cycle, the currently listed image processing operation for each target is applied to the targets region of interest. In this way, the appropriate image processing procedure can be changed and new image processing procedures can be added without changing the existing system architecture. Section 5 shows examples on this exible architecture by adding modules for head and hand tracking, for eye detection and tracking and for general target identication.
2. Architecture
Figure 1 shows the system architecture. The core of the tracking system is composed of a supervisor, a target initial This
research is supported by IST-CAVIAR 2001 37540
Figure 2. Target tracking by background differencing. The central person is tracked using all pixels whereas the two other persons are tracked using every second pixel.
Three targets can be clearly identied. Notice that the center target appears as solid white, while the adjacent targets appear to be hashed. This is the result of optimization that allows the processing to be applied to every N th pixel. In this example, the two adjacent regions were processed with N = 2, while the center target was processed with N = 1. N is determined dynamically during each cycle by the process supervisor. The position and extent of a target are determined by the moments of the detected pixels in the difference image Id within the ROI. The center of gravity (or rst moment) gives the position of a target. The covariance (or second moment) determines the spatial extent, and can be used to determine width, height, and slant of a target. These parameters also provide the targets search region in the next image. Chrominance information can be used to provide probabilistic detection of targets. The intensity for each RGB color pixel within a ROI is normalized to separate chrominance from luminance. r= R , R+G+B g= G R+G+B (2)
3. The tracking system

In this section, we describe the theoretical aspects and the details on the actual implementation of the core tracking system.
3.1 Energy detection

Currently, targets can be detected by energy measurements based on background subtraction or intensity normalized color histograms. The background subtraction module computes a difference image Id from the current frame I = (Ired , Igreen , Iblue ) and the background image B = (Bred , Bgreen , Bblue ): Id =
1 3
| Ired Bred | + | Igreen Bgreen | + | Iblue Bblue |
These color components have the property to be robust to intensity variations [6]. The probability that a pixel takes on a particular color can be represented as a histogram of (r, g) values. The histogram hT of chrominance values for a target, T , provides an estimate of the probability of a chrominance vector (r, g) given the target p(r, g|T ). The histogram of chrominance for all pixels htotal gives the global probability p(r, g) of encountering a chrominance among the pixels. The probability of a target is the number of pixels of the target divided by the total number of pixels. Putting these values into Bayes rule shows that an estimate of the probability of the target for each pixel can be obtained by evaluating the ratio of the target histogram divided by the global histogram. p(T |r, g) = p(r, g|T )p(T ) hT (r, g) p(r, g) htotal (r, g) (3)
The background image B is updated with each frame using a weighted averaging technique, with a strong weight applied to the previous background, and a small weight applied to the current image. This procedure constitutes a simple rst order recursive lter along the time axis for each pixel. The background image is only updated for those pixels that do not belong to one of the target ROIs.
For each image, a probability map, Ip , can be created by evaluating the ratio of histograms for each pixel in the image. Figure 3 shows an example of face detection using a ratio of chrominance histograms. The bottom image displays the probability map Ip . The probability map is only evaluated within the search region provided by the Kalman It (i, j) + (1 )Bt1 (i, j), (i, j) bg (1) lter in order to increase processing speed. Bt (i, j) = Bt1 (i, j), else A common problem in both background subtraction and histogram detection are spatial outliers. In order to increase the stability of target localization, we suppress the contribuFigure 2 shows an example of target tracking by backtion of outliers using a method proposed by Schwerdt in [5]. ground subtraction. The right image represents the background difference image Id after processing of three ROIs. With this method, the probability image Ip is multiplied by
a new prediction according to : x = t xt1 , with t t = 1 0 t 1 (6)
and t the time difference between two iterations. From the new position measurement zt , estimation update is carried out. xt = x + Kt (zt Ht x ) t t (7)
This relation is important for balancing the estimation between measurement and prediction with the Kalman gain Kt . The estimated precision is a diagonal covariance matrix 2 0 0 xx 0 2 0 yy 0 0 (8) Pt = 2 0 0 vxx 0 2 0 0 0 vyy and is predicted by: Pt = t1 Pt1 T + Qt1 t1 Figure 3. Target detection by normalized color histogram. (9)
where Qt1 is the covariance matrix of the prediction error which represents the growth of the uncertainty in the current target parameters.
3.3 The core modules

a Gaussian weighting function centered at the predicted target position. This corresponds to a ltering by a strong positional prior. The effect is that spatial outliers lose their inuence on position and extent as a function of distance from the predicted Gaussian. In order to save computation time, this operation is performed only within the region of interest R of each target. Even for small regions of interest this operation stabilizes the estimated position and extent of targets.
Ip =
The tracking process has been implemented in the ImaLab environment [4]. This environment allows realtime processing of frames extracted from the video stream. The basic tracking system is composed of two modules: TargetObservation predicts for each target the position in the current frame by a Kalman lter and then computes its real position by background subtraction or color histogram detection. DetectionRegion detects new targets by analysing the energy (background differencing or color histogram) within several manually dened detection regions. Figure 1 shows the system architecture. Both core modules can be instantiated to use either background differencing or color histogram. For the PETS 04 experiments, we use tracking based on background subtraction.
Ip G(, ), (i, j) R 0, else

1 T
(4)
where G(x; , ) = e 2 (x)

1 (x)
(5)
is the Kalman preThe center of gravity = diction of the target location. The spatial covariance reects the size of the target as well as the growing uncertainty about the current target size and location. The same principle can be applied to the background difference Id .
[x , yt ]T t
3.4 Target initialization module

Detection regions are image regions where new targets can appear. Restricting detection of new targets to such regions allows the system to reduce the overall computing time. As a side effect, the use of detection regions also provides a reduction in the number of spurious false detections
3.2 Tracking process

The tracking system is a form of Kalman lter [7]. The state vector for each target is composed of position and velocity. The current target state vector xt1 is used to make
Background difference of detection region 1 dim energy histogram
3.5 Tracking module

The module TargetObservation implements the target tracking. The supervisor maintains a list of current targets. Targets of this list are sequentially updated by the supervisor depending on the feedback of the modules. For each target, a new position is predicted by a rst order Kalman lter. This prediction determines a search region within which the target is expected to be found. A target is found by applying the specied detection operation to the search region. If the average target detection energy is above a threshold, the target observation vector is updated. This module depends on following parameters: Detection energy threshold: this represents the average energy threshold validating the target existence.
Noise threshold
Rmin
4sy 4sx
Rmax Analysis and moment computation
Analysis interval R
Initialised target
Sensitivity threshold : this parameter thresholds the energy image (Id in case of background differencing or Ip in case of chrominance detection). If the value is 0, the raw data of the energy image is used. Target area threshold: A step size parameter N enables faster processing for large targets by processing only 1 out of N pixels. When the target surface is larger than a threshold, N is increased. This temporary measure will be replaced by a more sophisticated control logic based on computing time. Figure 2 illustrates the use of this parameter.
Figure 4. Initialisation of new target.
by avoiding detection in unlikely regions, but targets might be missed when the detection regions are not chosen appropriately. For each scenario a different set of detection regions is determined. Currently, these regions are selected by hand. An automatic algorithm appears to be relatively easy to imagine. New targets are initialized automatically by analysing the detection regions in each tracking cycle. This analysis is done in two steps. In the rst step, the subregion which is occupied by the new target is determined by creating a 1 dimensional histogram along the long axis of the detection region. The limits of the target subregion are characterized by an interval, Rmin , Rmax , whose values of the one dimensional histogram are above a noise threshold (see Figure 4). In the second phase, the energy density within the so specied subregion R is computed as eR = 1 |R| Id (i, j)
(i,j)R
3.6 Split and merge of targets

In real world video sequences, especially in the domain of video surveillance, it often happens that targets come together, move in the same direction for a while and then separate. It can also occur that close targets occlude each other. In that case only one target is visible at the time, but both targets are still present in the scene. To solve such problems, we use a method that allows merging and splitting of targets. This method enables to keep track of occluded targets and also to model common behavior of a target group. The PETS 04 sequences contain many examples of such group behavior. A straight forward approach is applied for the detection of target split and merge. Merging of two targets that are within a certain distance from each other is detected by evaluating following equation: c/(a + b) < threshold (11)
(10)
with |R| number of pixels of R. A new target with mean R and covariance R is initialised when the measured energy density eR exceeds a threshold. This approach has the advantage, that targets can be detected independently of the size of the detection region.
where c is the distance between the gravity centers of both targets, a and b are the distances between the center of gravity and the boundary of the ellipse dened by the covariance of the respective target(see Figure 5 (left)). In our implementation we use a threshold = 0.8.
Parameters P r(t)
a c b
Noise d(t)
K
Control f(y(t))
System
y(t) Output
Input
Figure 5. (left) Merging of targets as a function of the target relative position and size. (right) Splitting detectors are dened proportionally to the target size.
Figure 6. A controlled system
Splitting of targets is implemented by placing detection regions around the target as shown in Figure 5 (right). The size and location of the split detection regions are proportional to the target size. Within each split detection region, the average enery is evaluated in the same way as in the target initialisation module. A new target is created if this average energy is greater than the threshold u = energy density split coefcient. The parameter split coefcient controls the constraints for target splitting.
parameters with little effect. For a sequence for which the ground truth r(t) is available we vary the parameters systematically and measure the output of the system, yPk (t) for a particular parameter setting Pk in the parameter space P . yPk (t) and r(t) are split in m sections according to m intervals si = [ti1 , ti ], i = 1, . . . , m. For each parameter setting Pk and each interval r(si ) and yPk (si ) are known. From these input/output correspondences we can compute the transfer function f (yPk (si )) = r(si ) by a least squares approximation. The overall error of the transfer function on the sequence is computed as follows: = ||r(t) f (yPk (t))|| =
si
||r(si ) f (yPk (si ))|| (12)
4. Automatic parameter adaption

Target initialization and tracking by background differencing or histogram detection requires a certain number of parameters, as mentioned in the previous sections (detection energy threshold, sensitivity, density energy threshold, , split coefcient, area threshold). In order to preserve the re-usability of the tracking module and guarantee good performance in a wide range of different tracking scenarios, it is crucial to have a good parameter setting at hand. Up to now, parameter adaption is done manually. This is a very tedious job which might need frequent repetition when the scene setup has changed. In this section we propose a rst attempt of a module that automatically nds a good parameter setting. As a rst step, we consider the tracker as a classical system with control parameters and noise perturbations (see Figure 6). The system produces an output y(t) that depends on the input r(t), some noise d(t), and a set of parameters that affect the control module K [1].
For each Pk , we determine the transfer function that minimizes this error. The average error ( = /n, n number of frames) is used to characterize the performance of the system with the current parameter setting. This is a very coarse approximation, but as we will see, the average error evolves smoothly over the parameter space. We consider polynomial transfer functions of rst and second order (linear and quadratic) of the following form r(tk ) r(tk ) = A0 y(tk ) + b = A2 (y(tk )) + A1 y(tk ) + b
2
(13) (14)
4.1 Algorithm
First we need to explore the effect of particular parameters on the system. The goal of this step is to identify the important parameters, their relation and eventually discard
with transfer matrices Ai and offset b. The measurements have either two or four dimensions. In the two dimensional case, the measurements contain the coordinates of the center of gravity of the target. The four dimensional case also contains the height and width of the target bounding box. We could have considered an additional dimension for the target slant, but we discarded this possibility due to the discontinuity of the slant measurement at 180. The linear transfer function estimated from the data of the sequences Walk1.mpeg and Walk3.mpeg produce good results. We observe a transfer matrix A0 that is close to identity. The quadratic transfer function has a smaller , but the transfer matrix A2 has very low values and is therefore
not signicant. This means that the linear transfer function is a good model for our system.
4.2 Exploration of the parameter space

The average error of the best transfer function evaluated on the entire test sequence is used to characterize the performance of the controlled system. The parameter space can be very high dimensional. Therefore, exploring the entire space can be time consuming. To cope with this problem we assume that some parameters evolve independently from each other. This allows to restrict the search of an optimal parameter value to a low dimensional hyperspace. In the experiment we use following default values for the constant parameters of the hyperspace: detection energy = 10, density = 15, sensitivity = 20, split coefcient = 2.0, = 0.001, area threshold = 1500. We experiment on sequence Walk1.mpeg except for gure 7. Figure 7 shows the surface produced by varying the detection energy threshold and the sensitivity threshold simultaneously. Figure 8 shows the error evolution by varying the split coefcient and the sensitivity. The optimal parameter value is different for each sequence. This means that the parameters are sequence dependent. In all cases the error evolves smoothly. This means that we are dealing with a controlled system and not with a system following chaotic or arbitrary rules. Figure 9 (left) provides evidence to set = 0.1. Figure 9 (right) shows that the density threshold has no effect on the average error. This parameter is therefore a candidate that needs not be considered for further exploration of the parameter space. Figure 10 shows the effect of the parameter area threshold. This parameter treats one pixel out of two for targets that are larger than area threshold pixels. This explains the increase of the error for small thresholds and the speed up in processing time. It is interesting to see, that the error increase is very small, less than 4% error increase for a 25% gain in processing time. Our method allows to identify this kind of relations between parameters.
Figure 11. Modules for face and hand observation are plugged into tracking system.
processing time. This is an interesting result which a dynamic control system should take into account. The experiments show that the optimal parameter setting estimated from one sequence scenario must not be optimal for another sequence. This needs to be explored by evaluating more data sequences. Another important point is that the approach requires ground truth labelling. This means that our method can not nd the optimal parameters when the ground truth is unknown. Likelihood may be appropriate in some cases to replace the ground truth, but the results will be inferior since the likelihood increases the noise perturbations.
5. Tracking : optional higher level modules

In this section we demonstrate the exibility of our tracking system. The proposed architecture enables easy plug in of higher level modules which enables the system to solve quite different tasks.
5.1. Face and hand tracking for human computer interaction

Modules for face and hand tracking use color histogram detection. Face and hands are initialised automatically with respect to a body detected by background differencing. This means that the same tracking principle is applied to faces and hands at a higher level. An example is shown in Figure 11.
4.3 Summary
We have shown a method to evaluate the performance of a system controlled by a set of parameters. The average error is used to understand the effect of single parameters and parameter pairs. This method allows to verify that our tracking system has a controlled behavior. We identied that the density parameter has no effect on the error performance and it can be removed from the parameter space. The area threshold parameter inuences the overall processing time and the average error. With our method, we found that the increase in error is small with respect to the gain in
5.2. Eye detection for head pose estimation

This module detects facial features by evaluating the response to receptive eld clusters [2]. The method detects facial features robust to scale, lighting variation, person and
"walk1_energy_sensitivity"
"walk3_energy_sensitivity"
average error
average error
80 70 60 50 40 30 20 40 35 30 10 15 20 energy threshold 10 25 5 30 0 15 25 20 sensitivity
250 200 150 100 50 0 40 35 30 10 15 20 energy threshold 10 25 5 30 0 15 25 20 sensitivity
Figure 7. Evolution of the average error over detection energy threshold and sensitivity threshold (sequence Walk1.mpeg (left) and Walk3.mpg (right) and default values for free parameters).
"walk1_split_sensitivity"
average error
90 80 70 60 50 40 30 40 35 30 1 1.5 2 split coefficient 10 2.5 5 3 0 15 25 20 sensitivity
Figure 8. Evolution of the average error over split coefcient and sensitivity threshold.
"walk1_energy_alpha"
"walk1_energy_density"
average error
average error = 31.9 const
46 44 42 40 38 36 34 32 30 10.1 10.08 10.06 10.04 10.02
32.3 32.2 32.1 32 31.9 31.8 31.7 31.6 31.5 10.1 10.08 10.06 10.04 10.02
0.2
0.4 alpha
0.6
0.8
10 9.98 energy = 10 9.96 9.94 9.92 9.9 9.88 1
10
20
30
40
50
60
density
70
80
10 9.98 energy = 10 9.96 9.94 9.92 9.9 90 100 9.88
Figure 9. Evolution with varying alpha (left) and varying density (right). We can identify an optimal value for alpha ( = 0.1), but the error is constant for all density values.
"walk1_energy_area"
"walk1_area_error_time"
average error
processing time [Hz]
35 34 33 32 31 30 10.1 10.08 10.06 10.04 10.02 energy
100 98 96 94 92 90 88 86 84 82 80 78 40 38 0 36 500 1500 area threshold 1000 34 2000 32 2500 3000 30 average error
500
1500 area threshold
1000
2000
2500
10 9.98 9.96 9.94 9.92 9.9 3000 9.88
Figure 10. Evolution with varying area threshold (left). The error increases slightly with decreasing area threshold. The area threshold has a signicant impact on the processing time (right).
Figure 12. Real-time head pose estimation.
head pose. The tracking system provides the precise face location which allows the combined system to run in real time. Figure 12 shows an example of the eye tracking module.
cost pers1 165, pers2 186
Merge: cost pers1 337, pers2 492
5.3. Agent identication

The agent identication module provides an association between individual features and tracked targets by background subtraction. Identication of each tracked blob is carried out by elastic matching of labelled graphs where the labels are receptive eld responses [2]. The degree of correspondence between the model and the observations extracted from the ROI provided by the tracking system is computed by evaluating a cost function. The cost function is a weighted sum of the spatial similarity and the appearance similarity [3, 8]. Figure 13 shows a successful identity recovery after a target occlusion. The system currently processes 10 frames/s.
Occlusion: cost pers1 488, pers2 1470
Split: cost pers1 2073, pers2 735
Figure 13. Example of a split and merge event with successful identity recovery.
Average error in Position Size Orientation Entry time lag Exit time lag
average value 6 - 7 pixels -160% to -240% 0.5% 50 to 80 frames 1 frame
maximum value 13 - 15 pixels -240% 30% 100 to 160 frames 1 frame
Table 1. Evaluation of the tracking results with respect to measurement precision.
Figure 14. True versus False detections for individuals
6. Tracking performance of the core modules

In order to evaluate the performance of our tracking system, we have tested the core modules on 16 of the PETS 04 sequences (17182 frames containing 50404 targets marked by bounding boxes)1. In this section we give a brief summary of the tracking results. Figure 14 shows the receiver operator curve for all 16 sequences. Our system has a low false detection probability of 9.8% and a true detection probability of 53.6%. This translates to a recall of 53.6% (27030 correct positives out of 50404 total positives) and a precision of 90.2% (27030 correct positives out of 29974 detections). The reason for the relatively low recall is the fact that the ground truth labeling takes into account targets that are already present in the scene and targets that pass on the gallery at the rst oor. Our tracking system relies on the method of detection region for target initialization. Both type of targets are not detected by our tracking system, because they are not initialized. The tracking results are evaluated with respect to other parameters such as errors in detected position, size, and orientation, the time lag of entry and exit. The performance of our system with respect to these parameters is summarized in Table 1. Our system performs very well in position detection, orientation estimation and exit time lag. The bounding box produced by the tracking system is signicantly smaller than the bounding box of the ground truth. This is due to the fact that the tracking system estimates the bounding box from the covariance of the pixels with high energy whereas
1 The sequences as well as the statistics are available at the CAVIAR home page http://homepages.inf.ed.ac.uk/rbf/CAVIAR/caviar.htm
a human draw a bounding box that includes all pixels that belong to the target. The tracking system can produce a similar output by computing the connected components of the energy image. This is a costly operation. In the case where the connected components bounding box is used for position computation, the position become more unstable. For this reason we decided to use the rst and second moments of energy pixels for target specication. The entry time lag is a problem related to the detection region. A human observer marks a new target as soon as it occurs. The detection region requires that the observed energy is above the energy density threshold.
7. Conclusion
We have presented an architecture for a tracking system that consists of a central supervisor, a tracking module based on background subtraction or color histogram detection combined with Kalman ltering and an automatic target initialization module restricted to detection regions. These three modules form the core system. The central supervisor architecture has the advantage that additional modules can be plugged in very easily. New tracking systems can be created in this way that can solve different tasks. The tracking system depends on a number of parameters that inuence the performance of the system. Therefore, nding a good parameter setting for a particular scenario is essential. We have proposed to consider the tracking system as a classical controlled system and identied a method to evaluate the quality of a particular parameter setting. The preliminary experiments show that small variations of the parameters produce smooth changes of the average error function. Using this behavior, we can improve the performance of our tracking system by nding a good parameter setting using gradient descend in the parameter space. Unfortunately, the experiments on the automatic parameter adaption are preliminary and could not yet be integrated in the performance evaluation of the system.
References
[1] P. de Larminat. Automatique commande des syst` mes e lineaires. Hermes Science Publications, 2nd edition, 1996. [2] D. Hall and J.L. Crowley. D tection du visage par e caract ristiques g n riques calcul es a partir des ime e e e ` ages de luminance. In Congr` s Francophone de Recone naissance des Formes et Intelligence Articielle, pages 13651373, Toulouse, France, 2004. [3] M. Lades, J.C. Vorbr ggen, J. Buhmann, J. Lange, u C. von der Mahlsburg, R.P. W rz, and W. Konen. Disu tortion invariant object recognition in the dynamic link architecture. Transactions on Computers, 42(3):300 311, March 1993. [4] A. Lux. The imalab method for vision systems. In International Conference on Vision Systems, pages 319327, Graz, Austria, April 2003. [5] K. Schwerdt and J.L. Crowley. Robust face tracking using color. In International Conference on Automatic Face and Gesture Recognition, pages 9095, Grenoble, France, March 2000. [6] M.J. Swain and D.H. Ballard. Color indexing. International Journal of Computer Vision, 7(1):1132, 1991. [7] G. Welch and G. Bishop. An introduction to the kalman lter. Technical Report TR 95-041, University of North Carolina at Chapel Hill, 2004. [8] L. Wiskott, J.M. Fellous, N. Kr ger, and C. von der u Mahlsburg. Face Recognition by Elastic Bunch Graph Matching, chapter 11, pages 355396. Intelligent Biometric Techniques in Fingerprint and Face Recognition. CRC Press, 1999.
Automatic parameter regulation for a tracking system with an auto-critical function

Daniela Hall
INRIA Rh ne-Alpes, St. Ismier, France o Email: Daniela.Hall@inrialpes.fr
Abstract In this article we propose an architecture of a tracking system that can judge its own performance by an auto-critical function. Performance drops can be detected which trigger an automatic parameter regulation module. This regulation module is an expert system that searches a parameter setting with better performance and returns it to the tracking system. With such an architecture, a robust tracking system can be implemented which automatically adapts its parameters in case of changes in the environmental conditions. This article opens a way to selfadaptive systems in detection and recognition.
Robust tracking Target Prediction list

Estimation Background detector
Target initialisation Detection region list Detection Background detector
Time Supervisor
I. I NTRODUCTION Parameter tuning of complex systems is often performed manually. A tracking system requires different parameter settings as a function of the environmental conditions and the type of the tracked targets. Each change in condition requires a parameter update. There is a great need to design an expert system that performs the parameter regulation automatically. This article proposes an approach and applies it to a realtime tracking systems. The here proposed architecture for autoregulation is valid for any complex system whose performance depends on a set of parameters. Automatic regulation of parameters can signicantly enhance performance of systems for detection and recognition. Surprising little previous work has been published in this domain [5]. A rst step towards performance optimization is the ability of the system to be auto-critical. This means that the system must be able to judge its own performance. A performance drop, detected with this kind of auto-critical function, can trigger an independent module for auto-regulation. The task of the regulation module is to propose a set of parameters to improve system performance. The auto-critical function detects a performance drop when the measurements with respect to a scene reference model diverge. In this case the automatic regulation module is triggered to provide a parameter setting with better performance. Section II explains the architecture of the tracking system and the architecture of the regulation cycle. Section III explains the details of the auto-critical function, the generation of the scene reference model and the measure used for performance evaluation. In section IV we explain the use of the regulation module. We then show experiments that demonstrate the utility of our approach. We nish with conclusions and a critical evaluation.
Fig. 1.
Architecture of the tracking and detection system controlled by a supervisor.
II. S YSTEM ARCHITECTURE In order to demonstrate the utility of our approach for autoregulation of parameters we choose a detection and tracking system as previously described in [2]. Figure 1 shows the architecture of the system. The tracking system is composed of a central supervisor, a target initialisation module and a tracking module. This modular architecture is exible such that competing algorithms for detection can be integrated. For our experiments we use a detection module based on adaptive background differencing using manually dened detection regions. Robust tracking is achieved by a rst order Kalman lter that propagates the target positions in time and updates them by measurements from the detection module. The tracking system depends on a number of parameters such as detection energy threshold, sensitivity for detection, energy density threshold to avoid false detections due to noise, a temporal parameter for background adaptation, and a split coefcient to enable merging and splitting of targets (i.e. when two people meet they merge to a single group target, a split event is observed when a person separates from the group). Figure 2 shows the integration of the parameter regulation module and the auto-critical function. The auto-critical function evaluates the current system performance and decides if parameter regulation is necessary. If this is the case, the tracker supervisor sends a request to the regulation module. It provides the its current parameter setting and current performance as well as other data that is needed by the regulation module. When the regulation module has found a better parameter setting (or after a maximum number of iteration) it stops processing and sends the result to the system supervisor that
y(t)
no
K
Input Control
System
y(t)
yes
Regulation
Regulation?
Autocritical function
optimized parameters
Fig. 2.
Integration of the regulation module in a complex system.
updates the parameters and reinitialises the modules. It is difcult to predict the performance gain of the autoregulation. Since the module can test only a discrete number of parameter settings, there is no guarantee that the global optimal parameter setting is found. For this reason, the goal of the regulation system is to nd a parameter setting that increases system performance. Subsequent calls of the regulation module allow then to obtain a constantly increasing system performance. The modular architecture enables the use of different methods and apply the regulation to different system kinds. III. T HE AUTO - CRITICAL FUNCTION The task of the auto-critical function is to provide a fast estimation of the current tracking performance. A performance evaluation function requires a reliable measure to estimate the current system performance. The used measure (described in Section III-B) is based on a probabilistic model of the scene which allows to estimate the likelihood of measurements. The probabilistic scene model is generated by a learning approach. Section III-C explains how the quality of a model can be measured. Section III-D discusses different clustering schemes. A. Learning a probabilistic model of a scene A model of a scene describes what usually happens in the scene. It describes a set of target positions and sizes, but also a set of paths of the targets within the scene. The model is computed from previously observed data. A valid model allows to describe everything that is going to be observed. For this reason we require that the training data is representative for what usually happens in the scene. The ideal model of a scene allows to decide in a probabilistic manner which measurements are typical and which measurements are unusual. With such a model we can compute the probability of single measurements and of temporal trajectories. Furthermore, we can detect outliers that occur due to measurement errors. The model represents the typical behaviour of the scene. Furthermore it enables the system to alert a user when unusual behavior takes place. This is a feature which is useful for the task of a video surveillance operator. In this section we describe the generation of a scene reference model which gives rise to a goodness measure that
can compute the likelihood of measurements y(t i ) with respect to the scene reference model. We know that a single mode is insufcient to provide a valid scene description. We need a model with several modes that associate spatially close measurements and provide a locally valid model. The model is composed from data using a static camera. An important question is which training data should be used to create an initial model. The CAVIAR test case scenarios [4] contain 26 image sequences and hand labelled ground truth. We can use the ground truth to generate an initial model. If the initial model is not sufcient, the model can be rened by adding tracking observations where the measurements with low probability which are likely to contain errors are removed. For the computation of the scene reference model, we use the hand labelled data of the CAVIAR data set (42000 bounding boxes). We divide the model into a training and a test set of equal size. The observations consist of spatial 2 2 measurements yspatial (ti ) = (x , y , x , y ) (rst and second moments of the target observation in frame I(t i )). We can extend these observations to spatio-temporal measurements 2 2 2 2 yspatiotemp (ti ) = (x , y , x , y , x , y , x , y ) by considering observations at subsequent time instances t i and ti1 . Such measurements have the advantage that we take into account the local motion direction and speed. A trajectory y(t) is a sequence of spatial or spatio-temporal measurements y(ti ). Single measurements are noted as vectors y(t i ) whereas trajectories y(t) are coded as vector lists. The following approach is valid for both types of observed trajectories y(t). To obtain a multi-modal model we have experimented with two types of clustering methods: k-means and k-means with pruning. K-means requires a xed number of clusters that must be specied by the user a priori. K-means converges to a local minimum that depends on the initial clusters. These are determined randomly, which means that the algorithm produces different sub-optimal solutions in different runs. To overcome this problem, k-means is run several times with the same parameters. In section III-C we propose a measure to judge the quality of the clustering result. With this measure we select an optimal clustering solution as our scene reference model. The method k-means with pruning is a variation of the traditional k-means that produces stabler results due to subsequent fusion of close clusters. In this variation, k-means is called with a large number of clusters, k [500, 2000]. Clusters that are close within this solution are merged subsequently and clusters with few elements are considered as noise and removed. This method is less sensitive to outliers and has the characteristics of a hierarchical clustering scheme and at the same time can be computed quickly due to the initial fast kmeans clustering. Figure 3 illustrates this algorithm. B. Evaluating the goodness of a trajectory A set of Gaussian clusters modeled by mean and covariance is an appropriate representation for statistical evaluation of measurements. The probability P (y(ti )|C) can be computed according to equation 2.
developed accordingly. The probability of the sub-trajectories is dened as: P (y(si )) = P (y(t2 ), y(t1 ), y(t0 )) = P (y(si )) + r = P (Ck2 |y(t2 ))P (Ck1 |y(t1 ))P (Ck0 |y(t0 )) (4) P (Ck2 Ck1 Ck0 ) + r P (y(si )) is composed of the probability of the most likely path through the modes of the model P (y(si )) plus a term r which contains the probability of all other path permutations. Naturally, the P (y(si )) will be dominated by P (y(si )), and r tends to be very small. This is the reason, why we use in the nal goodness score only the dominant term P (y(si )). P (Cki |y(ti )) is computed using Bayes rule. The prior P (Ck ) is set to the ratio |Ck |/( u |Cu |). The normalisation factor P (y(ti )) is constant. Since we are interested in the maximum likelihood, we compute: P (Cki |y(ti )) = P (y(ti )|Cki )P (Cki ) P (y(ti )) |Cki | P (y(ti )|Cki ) u |Cu |
merge clusters whose centers are closer than 1.0 delete clusters with < 4 elements
Result 3 clusters and noise
Fig. 3. K-means with pruning. After initial k-means clustering close clusters are merged and clusters with few elements are assigned to noise.
The auto-regulation and auto-critical module need a measure to judge the goodness of a particular trajectory. A simple goodness score consists of the average probability of the most likely cluster for the single measurements. The goodness G(y(t)) of the trajectory y(t) = (y(tn ), . . . , y(t0 )) with length n + 1 is computed as follows: G(y(t)) = with (2) P (y(ti )|C) = P (y(ti )|; U ) 1 (0.5(y(ti ))T U 1 (y(ti ))) e = (2)dim/2 |U |1/2 with mean and U covariance of cluster C. Trajectories have variable length and may consist of several hundred measurements. The proposed goodness score is high for trajectories composed of likely measurements and small for trajectories that contain many unlikely measurements (errors). This measure allows to reliably classify good and bad trajectories independent of their particular length. On the other hand, the goodness score does not take into account the sequential structure of the measurements. The sequential structure is an important indicator for the detection of local measurement errors and errors due to badly adapted parameters. To study the potential of a goodness score that is sensitive to the sequential structure, we propose following measure (see equation 3). Gseq(v) (y(t)) = 1 m
m1
(5)
1 n+1
max(P (y(ti )|Ck ))

i=0 k
(1)
where |Cki | denotes the number of elements in Cki . P (y(ti )|Cki ) is computed according to equation 2. The joint probability P (Ck2 Ck1 Ck0 ) is developed according to P (Ck2 Ck1 Ck0 ) = P (Ck2 |Ck1 Ck0 )P (Ck1 |Ck0 )P (Ck0 ) (6)
We simplify this equation by assuming a Markov constraint of rst order: P (Ck2 Ck1 Ck0 ) = P (Ck2 |Ck1 )P (Ck1 |Ck0 )P (Ck0 ) (7)
To compute the conditional probabilities P (Ci |Cj ), we need to construct a transfer matrix from the training set. This can be obtained by counting for each cluster Ci the number of state changes and then normalise such that each line in the state matrix sums to 1. The probabilistically inspired sequential goodness score of equation 3 is computed using equations 4 to 7. C. Measuring the quality of the model K-means clustering is a popular tool for learning and model generation because the user needs to provide only the number of desired clusters [3], [7], [8]. K-means converges quickly to a (locally) optimal solution. K-means clustering starts from a number of randomly initialised cluster centers. Therefore, each run produces a different sub-optimal solution. In cases where the number of clusters is unknown, k-means can be run several times with varying number of clusters. A difcult problem is to rank the different k-means solutions and select the one that is the most appropriate for the task. This section provides a solution to this problem which is often neglected. For a particular model (clustering solution) we can compute the probability of a measurement belonging to the model. To ensure that the computed probability is meaningful, the
log(P (y(si )))

i=0
(3)
which is the average log likelihood of the dominant term P (y(s)) of the probability of a sub-trajectory y(s) of length v. We use the log likelihood because P (y(s)) is typically very small. A trajectory y(t) = (y(s0 ), y(s1 ), . . . , y(sm1 )) is composed of m sub-trajectories y(si ) of length v. We develop the measure for v = 3, the measure for any other value v is
The task of the module for automatic regulation is to determine a parameter setting that improves the performance of the system. In the case of a detection and recognition system, this corresponds to increasing the number of true positives and reducing the number of false positives. For (8) Pd (error) = P (x Rbad , Cpos ) + P (x Rgood , Cbad ) this task, the module requires an evaluation function of the d 1 p(x|Cgood )P (Cgood )dx + p(x|Cbad )P (Cbad )dx current output, a strategy to choose a new parameter setting = and a subsequence which can be replayed to optimize the 0 d performance. with Rbad = [0, d] and Rgood = [d, 1]. We search the optimal threshold d such that P d (error) is A. Integration minimised. We operate on a histogram using logarithmic scale. When the parameter regulation module is switched on, the This has the advantage that the distribution of lower values is system tries to nd a parameter setting that performs better sampled more densely. The optimal threshold d with minimum classication error can be estimated precisely with the method. than the current parameter setting on a subsequence that is This classication error P (error) is a measurement for the provided by the tracking system. The system uses one of the quality of the cluster model. Furthermore, less complex models goodness scores of section III-B. In the experiments we use a subsequence of 200 frames for should be preferred. For this reason we formulate the quality auto-regulation. The tracker is run several times with changing constraint of clustering solutions as follows: the best clustering has the lowest number of clusters and an error probability parameter settings on this subsequence and the goodness score P (error) < q with q = 1%. The values of q are chosen of the trajectory is measured for each parameter setting. The depending on the task requirements. This measure is a fair parameter setting that produces the highest goodness score evaluation criteria which enables to choose the best model is kept. Parameter settings are obtained from a parameter space exploration tool whose strategies are explained in the among a set of k-means solutions. section IV-B and IV-C. D. Clustering results The automatic regulation can only operate on sequences that We test two clustering methods: K-means and k-means with produce a trajectory (something observable must happen in pruning. The positive trajectory is a person walking across the the scene). To allow a fair comparison, the regulation module hall, the negative trajectory consists of 1200 measurements must process the same subsequence several times. For this
model must be representative. A good model assigns a high probability to a typical trajectory and a low probability to an unusual trajectory. Based on these notions we dene an evaluation criteria for measuring the quality of the model. We need to have a model that is neither too simple nor too complex. The complexity is related to the number of clusters [1]. A high number of clusters tends to over-tting and a low number of clusters provides an imprecise description. Model quality evaluation requires a positive and negative example set. Typical target trajectories (positive examples) are provided within the training data. It is more difcult to create a negative example. A negative example trajectory is constructed as follows. First we measure the mean and variance of all training data. This represents the distribution of the data. We can now generate random measurements by drawing from this distribution with a random number generator. The result is a set of random measurements. From the training set, we generate a k-means clustering with a large number of clusters (K=100). For each random measurement we compute p(y(ti )|model100 ). From the original random 5000 measurements we keep the 1200 measurements with the lowest probability. This gives the set of negative examples. Figure 4 shows an example of the positive and negative trajectory as well as the hand labelled ground truth and a multi-modal model obtained by k-means with pruning. For any positive and negative measurements we compute the probability P (y(ti )). Classication of the measurements in positive and negative can be obtained by thresholding this value. For a threshold d the classication error can be computed according to equation 8. The optimal threshold, d, separates positive from negative measurements with a minimum classication error [1].
Scene reference model with metric
Parameter space exploration tool
Subsequence of images Initial parameter setting
Regulation process Optimized parameter setting
Fig. 5.
A process for automatic parameter regulation.
constructed as described above. The training set consists of 21000 hand labelled bounding boxes from 15 CAVIAR sequences (see Figure 4). Table I shows characteristics of the winning models with highest quality dened by minimum classication error and minimum number of clusters. The superiority of the k-means with pruning is demonstrated by the results. For the constraint P (error) < 1%, k-means with pruning requires only 20 or 19 clusters respectively whereas the classical k-means needs a model of clusters to obtain the same error rate. The best overall model is obtained for spatio-temporal measurements using k-means with pruning. IV. T HE MODULE FOR AUTOMATIC PARAMETER
REGULATION
Hand labelled bounding boxes (21000)
Example of a typical trajectory
Example of a unusual trajectory (random)
Clustering result
Fig. 4. Ground truth labelling for the entrance hall scenario, examples of typical and unusual trajectories and clustering result using k-means with pruning. Measurement type Spatial Spatio-temporal Clustering method K-means K-means with pruning K-means K-means with pruning # clusters 35 20 35 19 optimal threshold d 0.0067380 0.0067380 0.00012341 0.00012341 P (error) 0.0007 0.0061 0.0013 0.0034
TABLE I B EST
MODEL REPRESENTATIONS AND THEIR CHARACTERISTICS ( FINAL NUMBER OF CLUSTERS , OPTIMAL THRESHOLD , AND CLASSIFICATION ERROR ).
reason the regulation process requires a signicant amount of computing power. As a consequence, the regulation module should be run on a different host such that the regulation does not slow down the real time tracking. B. Parameter space exploration tool To solve the problem of the parameter space exploration we propose a parameter exploration tool that provides the next parameter setting to the regulation module. The dimensions of the parameter space as well as a reasonable range of the parameter values are given by the user. In our tracking example the parameter space is spanned by detection energy, density, sensitivity, split coefcient, , and area threshold. In the experiments we tested two strategies for parameter setting selection. An enumerative method, that denes a small number of discrete values for each parameter. At each call the parameter space exploration tool provides the next parameter setting in the list. The disadvantage of this method is that only a small number of settings can be tested and the best setting may not be in the predened list. The second strategy for parameter space exploration is based on a genetic algorithm. We found genetic algorithms perfectly adapted to our problem. It enables feedback from the performance of previous settings. We have a high dimensional feature space which makes hill climbing methods costly, whereas genetic algorithms explore the space without need of a high dimensional surface analysis. C. Genetic algorithm for parameter space exploration Among the different optimization schemes that exist we are looking for a particular method, that fullls several constraints. We are not requiring to reach a global maximum of our function, but we would like to reach a good level of performance quickly. Furthermore we are not particularly interested in the shape of the surface in parameter space. We are only interested
in obtaining a good payoff with a small number of tests. According to Goldberg [6], these are exactly the constraints of an application for which genetic algorithms are appropriate. Hill climbing methods are not feasible because the estimation of the gradient of a single point in a 6 dimensional space requires 26 tests. Testing several points would therefore require a higher number of tests than we would like. Genetic algorithms are inspired by the mechanics of natural selection. Genetic algorithms require an objective function to evaluate the performance of an individual and a coding of the input variables. Typically the coding is a binary string. In our example, each parameter is represented by 5 bit, which gives an input string of length 30. Genetic algorithms have three major operators: reproduction, crossover and mutation. Reproduction is a process in which individuals are copied according to their objective function values. Those individuals with high performance are copied more often than those with low performance. After reproduction, crossover is performed as follows. First, pairs of individuals are selected at random. Then, a position k within the string of length l is selected at random. Two new individuals are created by swapping all characters of position k + 1 to l. The mutation operator selects at random a position within the string and swaps its value. The power of genetic algorithms comes from the fact, that individuals with good performance are selected for reproduction and crossing of high performance individuals speculates on generating new ideas from high performance elements of past trials. For the initialisation of the genetic algorithm, the user needs to specify the boundaries of the input variable space, coding of the input variables, the size of the initial population and the probability of crossover and mutation. Goldberg [6] proposes to use a moderate population size, a high cross over
probability and a low mutation probability. The coding of the input variables should use the smallest alphabet that allows to express the problem. In the experiment we use a population of size 16, we estimate 7 generations, the crossover probability is set to 0.6 and the mutation probability to 0.03. V. E XPERIMENTAL EVALUATION In this section we evaluate the proposed approach on the CAVIAR entry hall sequences1 . The system is evaluated by recall and precision of the targets compared to the handlabelled ground truth. recall = precision = true positives total # targets true positives (true positives + false positives) (9) (10)
parameter selection and evaluation measure produces superior results (higher recall and higher precision). This seems to be related to the spatio-temporal model. The precision can be further improved using the genetic approach and the more complex evaluation function (recall 39.7% and precision 78.8%). VI. C ONCLUSIONS AND OUTLOOK We presented an architecture for a tracking system that uses an auto-critical function to judge its own performance and an automatic parameter regulation module for parameter adaptation. This system opens the way to self-adaptive systems which can operate under difcult lighting conditions. We applied our approach to tracking systems, but the same approach can be used to increase the performance of other systems who depend of a set of parameters. An auto-critical function and a parameter regulation module require a reliable performance evaluation measure. In our case, this measure is computed as a divergence of the observed measurements with respect to a scene reference model. We proposed an approach for the generation of such a scene reference model and developed a measure that is based on the measurement likelihood. With this measure, we can compute a best parameter setting for pre-stored sequences. The experiments show that the autoregulation greatly enhances the performance of the tracking output compared to a tracking without auto-regulation. The system can not quite reach the performance of a human expert, who uses knowledge based on the type of tracking errors for parameter tuning. This kind of knowledge is not available to our system. The implementation of the auto-critical function can trigger the automatic parameter regulation. First successful tests have been made to host the system on a distributed system. The advantages of the distributed system architecture is that the tracking system can continue the real time tracking. There rests the problem of re-initialisation of the tracker. Currently, existing targets are destroyed when the tracker is reinitialised. The current model relies entirely on ground truth labelling. The success of the method strongly depends on the quality of the model. In many cases, a small number of hand labelled trajectories can be gathered, but often their number is not sufcient for the creation of a valid model. For such cases we envision an incremental modeling approach by generating an initial model from few hand-labelled sequences. The initial model is then used to lter the tracking results, such that they are error free. These error free trajectories are then used to rene the model. This corresponds to a feed back loop in model generation. After a small number of iterations a valid model should be obtained. The option of such an incremental model is essential for non-static scenes. ACKNOWLEDGMENT This research is funded by the European commissions IST project CAVIAR (IST 2001 37540). Thanks to Thor List for providing the recognition evaluation tool.
We use the results of the manual adaptation as an upper benchmark. These results were obtained by a human expert who processed several times the sequences and hand tuned the parameters. Quickly the expert gained experience which kind of tracking errors depend on which parameter. The automatic regulation module does not use this kind of knowledge. For this reason, the recall and precision of the manual adaptation is the best we can hope to reach with an automatic method. We do not have manual adapted parameters for all sequences, due to the repetitive and time consuming manual task. A lower benchmark is provided by the tracking results that do no adaptation. This means all 5 sequences are evaluated using the same parameter setting. Choosing parameters with high values2 produce little recall and bad precision. Choosing parameters with low values3 increase the recall but the very large number of false positives is not acceptable. Table II shows the tracking results using a spatial and a spatio-temporal model and two parameter space exploration schemes. The rst uses a brute force search (enumerative method) of the discrete parameter space composed of the discrete values for detection energy [20, 30, 40], density [5, 15, 25], sensitivity [0, 20, 30, 40], split coefcient = 2.0, = 0.001, area threshold = 1500. The method tests 36 parameter settings. The second exploration scheme uses a genetic algorithm as described in section IV-C. The enumerative method has several disadvantages, that are reected by the rather low performance measurements of the experiments. The sampling of the parameter space is coarse and therefore it happens frequently that none of the parameter settings provide an acceptable improvement. The same arguments are true for random sampling of the parameter space. The spatial model using brute force method and the simple score has a small recall, but a better precision than the lower benchmark. The spatio-temporal measurements using the same
at http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/ energy=30, density=15, sensitivity=30, split coefcient=1.0, = 0.01, and area threshold=1500 3 detection energy=10, density=15, sensitivity=20, split coefcient=2.0, = 0.01, and area threshold=1500
2 detection 1 available
Auto-regulation method Manual adaptation (benchmark) Spatio-temporal model, (genetic approach,Gseq(10) ) Spatio temporal model, (genetic approach, simple score G) Spatio temporal model, (brute force, simple score G) Spatial model, (brute force, simple score G) No adaptation (low thresholds) No adaptation (high thresholds)
Recall 49.7 39.7 39.4 38.1 29.2 68.0 28.3
Precision 91.0 78.8 73.2 72.2 68.7 24.5 47.5
Total # targets 23180 21564 21564 21564 21564 21564 21564
true positives 11520 8556 8492 8224 6302 14672 6109
false positives 1136 2304 3108 3160 2872 45131 6746
TABLE II P RECISION
AND RECALL OF THE DIFFERENT METHODS EVALUATED FOR
5 CAVIAR SEQUENCES ( OVERLAP
REQUIREMENT
50%).
R EFERENCES
[1] C.M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. [2] A. Caporossi, D. Hall, P. Reignier, and J.L. Crowley. Robust visual tracking from dynamic control of processing. In International Workshop on Performance Evaluation of Tracking and Surveillance, pages 2331, Prague, Czech Republic, May 2004. [3] G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization with bags of keypoints. In European Conference on Computer Vision, Prague, Czech Republic, May 2004. [4] R.B. Fisher. The PETS04 surveillance ground-truth data sets. In International Workshop on Performance Evaluation of Tracking and Surveillance, Prague, Czech Republic, May 2004.
[5] B. G oris, F. Br mond, M. Thonnat, and B. Macq. Use of an evaluation e e and diagnosis method to improve tracking performances. In International Conference on Visualization, Imaging and Image Proceeding, September 2003. [6] D.E. Goldberg. Genetic Algorithms in Search and Optimization. AddisonWesley, 1989. [7] T. Leung and J. Malik. Recognizing surfaces using three-dimensional textons. In International Conference on Computer Vision, Corfu, Greece, September 1999. [8] C. Schmid. Constructing models for content-based image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 3945, Kauai, USA, December 2001.
Performance evaluation of object detection algorithms for video surveillance

Jacinto Nascimento , Member, IEEE Jorge Marques jan@isr.ist.utl.pt jsm@isr.ist.utl.pt IST/ISR, Torre Norte, Av. Rovisco Pais, 1049-001, Lisboa Portugal
EDICS: 4-SEGM
Abstract In this paper we propose novel methods to evaluate the performance of object detection algorithms in video sequences. This procedure allows us to highlight characteristics (e.g., region splitting or merging) which are specic of the method being used. The proposed framework compares the output of the algorithm with the ground truth and measures the differences according to objective metrics. In this way it is possible to perform a fair comparison among different methods, evaluating their strengths and weaknesses and allowing the user to perform a reliable choice of the best method for a specic application. We apply this methodology to segmentation algorithms recently proposed and describe their performance. These methods were evaluated in order to assess how well they can detect moving regions in an outdoor scene in xed-camera situations. Index Terms Surveillance Systems, Performance Evaluation, Metrics, Ground Truth, Segmentation, Multiple Interpretations.
I. I NTRODUCTION
IDEO surveillance systems rely on the ability to detect moving objects in the video stream which is a relevant information extraction step in a wide range of computer vision applications. Each image is segmented by
automatic image analysis techniques. This should be done in a reliable and effective way in order to cope with unconstrained environments, non stationary background and different object motion patterns. Furthermore, different types of objects are manually considered e.g., persons, vehicles or groups of people. Many algorithms have been proposed for object detection in video surveillance applications. They rely on different assumptions e.g., statistical models of the background [1][3], minimization of Gaussian differences [4], minimum and maximum values [5], adaptivity [6,7] or a combination of frame differences and statistical background models [8]. However, few information is available on the performance of these algorithms for different operating conditions. Two approaches have been recently considered to characterize the performance of video segmentation algorithms: pixel-based methods, template based methods and object-based methods. Pixel based methods assume that we wish to detect all the active pixels in a given image. Object detection is therefore formulated as a set of independent pixel detection problems. This is a classic binary detection problem provided that we know the ground truth (ideal segmented image). The algorithms can therefore be evaluated by standard measures used in Communication theory e.g., misdetection rate, false alarm rate and receiver operating characteristic (ROC) [9].
This work was supported by FCT under the project LTT and by EU project CAVIAR (IST-2001-37540). Corresponding Author: Jacinto Nascimento, (email:jan@isr.ist.utl.pt), Complete Address: Instituto Superior T cnico-Instituto e de Sistemas e Rob tica (IST/ISR), Av. Rovisco Pais, Torre Norte, 6o piso, 1049-001, Lisboa, PORTUGAL Phone: +351-21-8418270, Fax: o +351-21-8418291
Several proposals have been made to improve the computation of the ROC in video segmentation problems e.g., using a perturbation detection rate analysis [10] or an equilibrium analysis [11]. The usefulness of pixel-based methods for surveillance applications is questionable since we are not interested in the detection of point targets but object regions instead. The computation of the ROC can also be performed using rectangular regions selected by the user, with and without moving objects [12]. This improves the evaluation strategy since the statistics are based on templates instead of isolated pixels. A third class of methods is based on an object evaluation. Most of the works aim to characterize color, shape and path delity by proposing gures of merit for each of these issues [13][15] or area based performance evaluation as in [16]. This approach is instrumental to measure the performance of image segmentation methods for video coding and synthesis but it is not usually used in surveillance applications. These approaches have three major drawbacks. First object detection is not a classic binary detection problem. Several types of errors should be considered (not just misdetection and false alarms). For example what should we do if a moving object is split into several active regions ? or if two objects are merged into a single region ? Second some methods are based on the selection of isolated pixels or rectangular regions with and without persons. This is an unrealistic assumption since practical algorithms have to segment the image into background and foreground and do not have to classify rectangular regions selected by the user. Third, it is not possible to dene a unique ground truth. Many images admit several valid segmentations. If the image analysis algorithm produces a valid segmentation its output should be considered as correct. In this paper we propose objective metrics to evaluate the performance of object detection methods by comparing the output of the video detector with the ground truth obtained by manual edition. Several types of errors are considered: splits of foreground regions; merges of foreground regions; simultaneous split and merge of foreground regions; false alarms, and detection failures. False alarms occur when false objects are detected. The detection failures are caused by missing regions which have not been detected. In this paper ve segmentation algorithms are considered as examples and evaluated. We also consider multiple interpretations in the case of ambiguous situations e.g., when it is not clear if two objects overlap and should be considered as a group or if they are separated apart. The rst algorithm is denoted as basic background subtraction (BBS ) algorithm. It computes the absolute difference between the current image and a static background image and compares each pixel to a threshold. All the connected components are computed and they are considered as active regions if their area exceeds a given threshold. This is perhaps the simplest object detection algorithm one can imagine. The second method is the detection algorithm used in the W 4 system [17]. Three features are used to characterize each pixel of the background image: minimum intensity, maximum intensity and maximum absolute difference in consecutive frames. The third method assumes that each pixel of the background is a realization of a random variable with Gaussian distribution (SGM - Single Gaussian Model) [1]. The mean and covariance of the Gaussian distribution are independently estimated for each pixel. The fourth algorithm represents the distribution of the background pixels with a mixture of Gaussians [2]. Some modes correspond to the background and some are associated with active regions (M GM - Multiple Gaussian Model). The last method is the one proposed in [18] and denoted as Lehigh Omnidirectional
Tracking System (LOT S ). It is tailored to detect small non cooperative targets such as snipers. Some of these algorithms are described in a special issue of IEEE transactions on PAMI (August 2001), which describes a state of art methods for automatic surveillance systems. In this work we provide segmentation results of these algorithms on the PETS2001 sequences, using the proposed framework. The main features of the proposed method are the following. Given the correct segmentation of the video sequence we detect several types of errors i) splits of foreground regions, ii) merges of foreground regions,
iii) simultaneously split and merge of foreground regions, iv) false alarms (detection of false objects) and v) the
detection failures (missing active regions). We then compute statistics for each type of error. The structure of the paper is as follows. Section 2 briey reviews previous work. Section 3 describes the segmentation algorithms used in this paper. Section 4 describes the proposed framework. Experimental tests are discussed in Section 5 and Section 6 presents the conclusions. II. R ELATED W ORK Surveillance and monitoring systems often require on line segmentation of all moving objects in a video sequence. Segmentation is a key step since it inuences the performance of the other modules, e.g., object tracking, classication or recognition. For instance, if object classication is required, an accurate detection is needed to obtain a correct classication of the object. Background subtraction is a simple approach to detect moving objects in video sequences. The basic idea is to subtract the current frame from a background image and to classify each pixel as foreground or background by comparing the difference with a threshold [19]. Morphological operations followed by a connected component analysis are used to compute all active regions in the image. In practice, several difculties arise: the background image is corrupted by noise due to camera movements and uttering objects (e.g., trees waving), illumination changes, clouds, shadows. To deal with these difculties several methods have been proposed (see [20]). Some works use a deterministic background model e.g., by characterizing the admissible interval for each pixel of the background image as well as the maximum rate of change in consecutive images or the median of largest inter-frames absolute difference [5,17]. Most works however rely on statistical models of the background, assuming that each pixel is a random variable with a probability distribution estimated from the video stream. For example, the Pnder system (Person Finder) uses a Gaussian model to describe each pixel of the background image [1]. A more general approach consists of using a mixture of Gaussians to represent each pixel. This allows the representation of multi modal distributions which occur in natural scene (e.g., in the case of uttering trees) [2]. Another set of algorithms is based on spatio-temporal segmentation of the video signal. These methods try to detect moving regions taking into account not only the temporal evolution of the pixel intensities and color but also their spatial properties. Segmentation is performed in a 3D region of image-time space, considering the temporal evolution of neighbor pixels. This can be done in several ways e.g., by using spatio-temporal entropy, combined with morphological operations [21]. This approach leads to an improvement of the systems performance, compared with traditional frame difference methods. Other approaches are based on the 3D structure tensor dened from the pixels spatial and temporal derivatives, in a given time interval [22]. In this case, detection is based on the Mahalanobis distance, assuming a Gaussian distribution for the derivatives. This approach has been implemented
in real time and tested with PETS 2005 data set. Other alternatives have also been considered e.g., the use of a region growing method in 3D space-time [23]. A signicant research effort has been done to cope with shadows and with nonstationary backgrounds. Two types of changes have to be considered: show changes (e.g., due to the sun motion) and rapid changes (e.g., due to clouds, rain or abrupt changes in static objects). Adaptive models and thresholds have been used to deal with slow background changes [18]. These techniques recursively update the background parameters and thresholds in order to track the evolution of the parameters in nonstationary operating conditions. To cope with abrupt changes, multiple model techniques have been proposed [18] as well as predictive stochastic models (e.g., AR, ARMA [24,25]). Another difculty is the presence of ghosts [26], i.e., false active regions due to statics objects belonging to the background image (e.g., cars) which suddenly start to move. This problem has been addressed by combining background subtraction with frame differencing or by high level operations [27],[28]. III. S EGMENTATION A LGORITHMS This section describes object detection algorithms used in this work: BBS , W 4, SGM , M GM and LOT S . The
BBS , SGM , M GM algorithms use color while W 4 and LOT S use gray scale images. In the BBS algorithm,
the moving objects are detected by computing the difference between the current frame and the background image. A thresholding operation is performed to classify each pixel as foreground region if
|I t (x, y) t (x, y)| > T,
(1)
where I t (x, y) is a 3 1 vector being the intensity of the pixel in the current frame and t (x, y) is the mean intensity (background) of the pixel, T is a constant. Ideally, pixels associated with the same object should have the same label. This can be accomplished by performing a connected component analysis (e.g., using 8 - connectivity criterion). This step is usually performed after a morphological ltering (dilation and erosion) to eliminate isolated pixels and small regions. The second algorithm is denoted here as W 4 since it is used in the W 4 system to compute moving objects [17]. This algorithm is designed for grayscale images. The background model is built using a training sequence without persons or vehicles. Three values are estimated for each pixel using the training sequence: minimum intensity (Min), maximum intensity (Max), and the maximum intensity difference between consecutive frames (D). Foreground objects are computed in four steps: i) thresholding, ii) noise cleaning by erosion, iii) fast binary component analysis and iv) elimination of small regions. We have modied the thresholding step of this algorithm since often leads to a signicant level of miss classications. We classify a pixel I(x, y) as a foreground pixel iff
|I t (x, y) < Min(x, y)| |I t (x, y) > Max(x, y)|) |I t (x, y) I t1 (x, y)| > D(x, y)
(2)
Figs. 1, 2 show an example comparing both approaches. Fig. 1 shows the original image with two active regions. Figs. 2(a),(b) display the output of the thresholding step performed as in [17] and using (2).
Fig. 1.
Two regions (in bounding boxes) of an image.
(a)
Fig. 2. Thresholding results: (a) using the approach as in [17] and (b) using (2).
(b)
The third algorithm considered in this study is the SGM (Single Gaussian Model) algorithm. In this method, the information is collected in a vector [Y, U, V ]T , which denes the intensity and color of each pixel. We assume that the scene changes slowly. The mean (x, y) and covariance (x, y) of each pixel can be recursively updated as follows
t (x, y) = (1 )t1 (x, y) + I t (x, y), t (x, y) = (1 )t1 (x, y) + (I t (x, y) t (x, y))(I t (x, y) t (x, y))T
(3) (4)
where I(x, y) is the pixel of the current frame in Y U V color space, is a constant. After updating the background, the SGM performs a binary classication of the pixels into foreground or background and tries to cluster foreground pixels into blobs. Pixels in the current frame are compared with the background by measuring the log likelihood in color space. Thus, individual pixels are assigned either to the background region or a foreground region
1 m 1 l(x, y) = (I t (x, y) t (x, y))T (1 )t (I t (x, y) t (x, y)) ln |t | ln(2) 2 2 2
(5)
where I t (x, y) is a vector (Y, U, V )T dened for each pixel in the current image, t (x, y) is the pixel vector in the background image B .
If a small likelihood is computed using (5), the pixel is classied as active. Otherwise, it is classied as background. The fourth algorithm (M GM ) models each pixel I (x) = I(x, y) as a mixture of N (N = 3) Gaussians distributions, i.e.
p(I(x)) =
k=1 N
k N (I(x), k (x), k (x)),
(6)
where N (I(x), k (x), k (x)) is a multivariate normal distribution and k is the weight of k th normal,
N (I(x), k (x), k (x)) = c exp 1 I(x) k (x) 2
T
1 (x) I(x) k (x) k
(7)
with c = i.e., I(x)
1 1 . Note that each pixel I(x) is a 3 1 (2)n/2 |k | 2 = [I(x)R I(x)G I(x)B ]T . To avoid an excessive
vector with three component colors (red, green and blue), computational cost, the covariance matrix is assumed to
be diagonal [2]. The mixture model is dynamically updated. Each pixel is updated as follows: i) The algorithm checks if each incoming pixel value x can be ascribed to a given mode of the mixture, this is the match operation. ii) If the pixel value occurs inside the condence interval with
+
2.5 standard deviation, a match event is veried. The parameters
of the corresponding distributions (matched distributions) for that pixel are updated according to
t1 t (x) = (1 t )k (x) + t I t (x) k k k
(8) (9)
t (x) = (1 t )t1 (x) + t (I t (x) t (x))(I t (x) t (x))T k k k k k k
where
t1 t = N (I t (x), k (x), t1 (x)) k k
(10)
The weights are updated by

1 matched models
t1 t t k = (1 )k + (Mk ),
t with Mk =
(11)
0 remaining models
is the learning rate. The non match components of the mixture are not modied. If none of the existing components
match the pixel value, the least probable distribution is replaced by a normal distribution with mean equal to the current value, a large covariance and small weight. iii) The next step is to order the distributions in the descending order of / . This criterion favours distributions which have more weight (most supporting evidence) and less variance (less uncertainty). iv) Finally the algorithm models each pixel as the sum of the corresponding updated distributions. The rst B Gaussian modes are used to represent the background, while the remaining modes are considered as foreground distributions. B is chosen as follows: B is the smallest integer such that
B
k > T
k=1
(12)
where T is a threshold that accounts for a certain quantity of data that should belong to the background.
The fth algorithm [18] is tailored for the detection of non cooperative targets (e.g., snipers) under non stationary environments. The algorithm uses two gray level background images B1 , B2 . This allows the algorithm to cope with intensity variations due to noise or uttering objects, moving in the scene. The background images are initialized using a set of T consecutive frames, without active objects
B1 (x, y) = min{I t (x, y), t = 1, . . . , T } B2 (x, y) = max{I t (x, y), t = 1, . . . , T }
(13) (14)
where t {1, 2, . . . , T } denotes the time instant. In this method, targets are detected by using two thresholds (TL , TH ) followed by a quasi-connected components (QCC) analysis. These thresholds are initialized using the difference between the background images
TL (x, y) = |B1 (x, y) B2 (x, y) | + cU TH (x, y) = TL (x, y) + cS
(15) (16)
where, cU and cS [0, 255] are constants specied by the user. We compute the difference between each pixel and the closest background image. If the difference exceeds a low threshold TL , i.e.,
t min |I t (x, y) Bi (x, y)| > TL (x, y) i
(17)
the pixel is considered as active. A target is a set of connected active pixels such that a subset of them veries
t min |I t (x, y) Bi (x, y)| > TH (x, y) i
(18)
t t where TH (x, y) ia a high threshold. The low and high thresholds TL (x, y), TH (x, y) as well as the background t images, Bi (x, y), i = 1, 2 are recursively updated in a fully automatic way (see [18] for details).
IV. P ROPOSED F RAMEWORK In order to evaluate the performance of object detection algorithms we propose a framework which is based on the following principles:
A set sequences is selected for testing and all the moving objects are detected using an automatic procedure and manually corrected if necessary to obtain the ground truth. This is performed one frame per second.
The output of the automatic detector is compared with the ground truth. The errors are detected and classied in one of the following classes: correct detections, detections failures, splits, merges, split/merges and false alarms.
A set of statistics (mean, standard deviation) are computed for each type of error.
To perform the rst step we made a user friendly interface which allows the user to dene the foreground regions in the test sequence in a semi-automatic way. Fig. 3 shows the interface used to generate the ground truth. A set of frames is extracted from the test sequence (one per second). An automatic object detection algorithm is then used to provide a tentative segmentation of the test images. Finally, the automatic segmentation is corrected by the
user, by merging, splitting, removing or creating active regions. Typically the boundary of the object is detected with a two pixel accuracy. Multiple segmentations of the video data are generated every time there is an ambiguous situation i.e., two close regions which are almost overlapping. This problem is discussed in section IV-D. In the case depicted in the Fig. 3, there are four active regions: a car, a lorry and two groups of persons. The segmentation algorithm also detects regions due to lighting changes, leading to a number of false alarms (four). The user can easily edit the image by adding, removing, checking the operations, thus providing a correct segmentation. In Fig. 3 we can see an example where the user progressively removes the regions which do not belong to the object of interest. The nal segmentation is shown at the bottom images.
Fig. 3.
User interface used to create the ground truth from the automatic segmentation of the video images.
The test images are used to evaluate the performance of object detection algorithms. In order to compare the output of the algorithm with the ground truth segmentation, a region matching procedure is adopted which allows to establish a correspondence between the detected objects and the ground truth. Several cases are considered: 1) Correct Detection (CD) or 1-1 match: the detected region matches one and only one region. 2) False Alarm (FA): the detected region has no correspondence. 3) Detection Failure (DF): the ground truth region has no correspondence. 4) Merge Region (M): the detected region is associated to several ground truth regions. 5) Split Region (S): the ground truth region is associated to several detected regions. 6) Split-Merge Region (SM): when the conditions pointed in 4, 5 are simultaneously satised. A. Region Matching Object matching is performed by computing a binary correspondence matrix C t which denes the correspondence between the active regions in a pair of images. Let us assume that we have N ground truth regions Ri and M detected regions Rj . Under these conditions C t is a N M matrix, dened as follows
C t (i, j) =
1 if 0 if
(Ri Rj ) >T (Ri Rj ) (Ri Rj ) <T (Ri Rj ) i{1,...,N },j{1,...,M }
(19)
where T is the threshold which accounts for the overlap requirement. It is also useful to add the number of ones in each line or column, dening two auxiliary vectors
M
L(i) =
j=1 N
C(i, j)
i {1, . . . , N }
(20)
C(j) =
i=1
C(i, j)
j {1, . . . , M }
(21)
When we associate ground truth regions with detected regions six cases can occur: zero-to-one, one-to-zero, one-to-one, many-to-one, one-to-many, many-to-many associations. These correspond to false alarm, misdetection, correct detection, merge, split and split-merge. Detected regions Rj are classied according to the following rules CD M S SM FA
i i i i i : L(i) = C(j) = 1 C(i, j) = 1 : C(j) > 1 C(i, j) = 1 : L(i) > 1 C(i, j) = 1 : L(i) > 1 C(j) > 1 C(i, j) = 1 : C(j) = 0
(22)
Detection failures (DF ) associated to the ground truth region Ri occurs if L(i) = 0.
The two last situations (FA, DF) in (22) occur whenever empty columns or lines in matrix C are observed. Fig. 4 illustrates the six situations considered in this analysis, by showing synthetic examples. Two images are shown in each case, corresponding to the ground truth (left) and detected regions (right). It also depicts the correspondence matrix C . For each case, the left image (I ) contains the regions dened by the user (ground truth), the right image (I ) contains the regions detected by the segmentation algorithm. Each region is represented by an white area containing a visual label. Fig. 4 (a) shows an ideal situation, in which each ground truth region matches only one detected region (correct detection). In Fig. 4 (b) the square-region has no correspondence with the detected regions, thus it corresponds to a detection failure. In Fig. 4 (c) the algorithm detects regions which have no correspondence to the I image, thus indicating a false alarm occurrence. In Fig. 4 (d) shows a merge of two regions since two different regions (square and dot regions in I ) have the same correspondence to the square region in I . The remaining examples in this gure are self explaining, illustrating the split (e) and split-merge (f) situations. B. Region Overlap The region based measures described herein depends on an overlap requirement T (see (19)) between the region of the ground truth and the detected region. Without this requirement, this means that a single pixel overlap is enough for establishing a match between a detected region and a region in the ground truth segmentation, which does not make sense.
10
A match is determined to occur if the overlap is at least as big as the Overlap Requirement. The bigger the overlap requirement, the more the boxes are required to overlap hence performance usually declines as the requirement reaches 100%. In this work we use a overlap requirement of T = 10%. Fig. 5 illustrates the association matrices in two different cases considering an overlap requirement of T = 20%. It can be seen that in Fig. 5(a) the region in the ground truth (circle region) is not represented by any detected region since the overlap is below the overlap requirement, leading to a detection failure. If we increase the overlap between these two regions (see Fig. 5(b)) we see that now we have a correct detection (second line, second column of C ). Finally it is illustrated a situation where two detection failures (in Fig. 5 (c)) become a split (in Fig. 5 (d)) if we increase the overlap among these regions. C. Area Matching The match between pairs of the two regions (ground truth/ automatically detected) is also considered to measure the performance of the algorithms. The higher is the percentage of the match size, the better are the active regions produced by the algorithm. This is done for all the correctly detected regions. The match metric is dened by
M(i) =
(Ri Rj ) , (Ri Rj )
where j is the index of the corresponding detected region. The metric M is the area of the
overlap normalized by the total area of the object. The average of M(i) in a video sequence will be used to characterize the performance of the detector.
11
Ground Truth
Detector output
Ground Truth
Detector output
1 0 0 C = 0 1 0 0 0 1 (a)
Ground Truth Detector output Ground Truth
0 0 C = 1 0 0 1 (b)
Detector output
1 0 0 0 C = 0 1 0 0 0 0 1 0 (c)
Ground Truth Detector output Ground Truth
1 0 C = 1 0 0 1 (d)
Detector output
C=
1 1 0 0 0 1
(e)
Fig. 4.
1 1 0 C = 0 0 1 0 1 0 (f)
Different matching cases: (a) Correct detection; (b) Detection Failure; (c) False alarm; (d) Merge; (e) Split; (f) Split Merge.
12
Ground Truth
Detector output
Ground Truth
Detector output
C=
1 0 0 0 0 0 (a)
Detector output Ground Truth
C=
1 0 0 0 1 0 (b)
Detector output
Ground Truth
C=
1 0 0 0 0 0 (c)
C=
1 0 0 0 1 1 (d)
Fig. 5. Matching cases with an overlap requirement of T = 20%. Detection failure (overlap < T) (a) Correct detection (overlap > T) (b); two detection failures (overlap < T) (c) and split (overlap > T) (d).
13
D. Multiple Interpretations Sometimes the segmentation procedure is subjective, since each active region may contain several objects and it is not always easy to determine if it is a single connected region or several disjoint regions. For instance, Fig. 6 (a) shows an input image and a manual segmentation. Three active regions were considered: person, lorry and group of people. Fig. 6 (b) shows the segmentation results provided by the SGM algorithm. This algorithm splits the group into three individuals which can also be considered as a valid solution since there is very little overlap. This segmentation should be considered as an alternative ground truth. All these situations should not penalize the performance of the algorithm. On the contrary, situations such as the ones depicted in Fig. 7 should be considered as errors. Fig. 7 (a) shows the ground truth and in Fig. 7 (b) the segmentation provided by the W 4 algorithm. In this situation the algorithm makes a wrong split of the vehicle.
(a)
Fig. 6. Correct split example: (a) supervised segmentation, (b) SGM segmentation.
(b)
(a)
Fig. 7. Wrong split example: (a) supervised segmentation, (b) W 4 segmentation.
(b)
Since we do not know how the algorithm behaves in terms of merging or splitting, every possible combinations within elements, belonging to a group, must be taken into account. For instance, another ambiguous situation is depicted in Fig. 8, where it is shown the segmentation results of the SGM method. Here, we see that the same algorithm provides different segmentations (both can be considered as correct) on the same group in different
14
instants. This suggests the use of multiple interpretations for the segmentation. To accomplish this the evaluation setup takes into account all possible merges of single regions belonging to the same group whenever multiple interpretations should be considered in a group, i.e., when there is a small overlap among the group members. The number of merges depends on the relative position of single regions. Fig. 9 shows two examples of different merged regions groups with three objects ABC (each one representing a person in the group). In the rst example (Fig. 9 (a)) four interpretations are considered: all the objects are separated, they are all merged in a single active region or AB (BC) are linked and the other is isolated. In the second example an addition interpretation is added since A can be linked with C. Instead of asking the user to identify all the possible merges in an ambiguous situation, an algorithm is used to generate all the valid interpretations in two steps. First we assign all the possible labels sequences to the group regions. If the same label is assigned to two different regions, these regions are considered as merged. Equation (23)(a) shows the labelling matrix M for the example of Fig. 9 (a). Each row corresponds to a different labelling assignment. The element Mij denotes the label of the j th region in the ith labelling conguration. The second step checks if the merged regions are close to each other and if there is another region in the middle. The invalid labelling conguration are removed from the matrix M . The output of this step for the example of Fig. 9 (a) is in equation (23)(b). The labelling sequence 121 is discarded since region 2 is between region 1 and 3. Therefore, regions 1, 3 cannot be merged. In the case of the Fig. 9 (b) all the congurations are possible (M = MF IN AL ). A detailed description of the labelling method is included in appendix VII-A. Figs. 10,11 illustrate the generation of the valid interpretations. Fig. 10 (a) shows the input frame, Fig. 10 (b) shows the hand segmented image, where the user species all the objects (three objects must be provided separately in the group of persons) and Fig. 10 (c) illustrates the output of the SGM . Fig. 11 shows all possible merges of individual regions. All of them are considered as correct. Remain to know which segmentation should be selected to appraise the performance. In this paper we choose the best segmentation, which is the one that provides the highest number of correct detections. In the present example the segmentation illustrated in Fig. 11 (g) is selected. In this way we overcome the segmentation ambiguities that may appear without penalizing the algorithm. This is the most complex situation which occurs in the video sequences used in this paper.
Fig. 8.
Two different segmentations, provided by SGM method on the same group taken at different time instants.
15
(a)
(b)
Fig. 9. Regions linking procedure with three objects A B C (from left to right). The same number of foreground regions may have different interpretations: three possible congurations (a), or four congurations (b). Each color represent a different region.
1 1 M = 1 1 1 (a)
1 1 2 2 2
1 2 1 2 3
MF IN AL
(b)
1 1 = 1 1
1 1 1 2 2 2 2 3
(23)
(a)
Fig. 10.
(b)
(c)
Input frame (a), segmented image by the user (b), output of SGM (c).
V. T ESTS ON PETS2001
DATASET
This section presents the evaluation of several object detection algorithms using PETS2001 dataset. The training and test sequences of PETS2001 were used for this study. The training sequence has 3064 and the test sequence has 2688 frames. In both sequences, the rst 100 images were used to build the background model for each algorithm. The resolution is half-resolution PAL standard (288 384 pixels, 25 frames per second). The algorithms were evaluated using one frame per second. The ground truth was generated by an automatic segmentation of the video signal followed by a manual correction using a graphical editor described in section IV. The outputs of the algorithms were then compared with the ground truth. Most algorithms require the specication of the smallest area of an object. An area of 25 pixels was chosen since it allows to detect all objects of interest in the sequences.
16
(a)
(b)
(c)
(d)
(e)
Fig. 11.
(f)
(g)
(h)
Multiple interpretations given by the application. The segmentation illustrated in (g) is selected for the current frame.
A. Choice of the Model Parameters The segmentation algorithms described herein depend on a set of parameters, which are mainly the thresholds and the learning rate . In this scenario, we must gure out which are the best values for the most signicant parameters for each algorithm. This was done using ROC curves which display the performance of each algorithm as a function of the parameters. The Receiver Operation Characteristic (ROC) have been extensively used in communications [9]. It is assumed that all the parameters are constant but one. In this case we have kept the learning rate constant and varied the thresholds in the attempt to obtain the best threshold value T . We repeated this procedure for several values of . This requires a considerable number of tests, but in this way it is possible to achieve a proper conguration for the algorithm parameters. These tests were made for a training sequence of the PETS2001 data set. Once the parameters are set, we use these values in a different sequence. To ROC curves describe the evolution of the false alarms (FA) and detection failures (DF) as T varies. An ideal curve would be close to the origin, and the area under the curve would be close to zero. To obtain these two values, we compute these measures (for each value of T ) by applying the region matching trough the sequence. The nal values are computed as the mean values of FA and DF. Fig. 12 shows the receiver operating curves (ROC) for all the algorithms. It is observed that the performance of
BBS algorithm is independent of . We can also see that this algorithm is sensitive with respect to the threshold,
since there is a large variation of FA and DF for small changes of T, this can be viewed as a lack of smoothness of the ROC curve (T = 0.2 is the best value). There is a large number of false alarms in the training sequence due to the presence of a static object (car) which suddenly starts to move. The background image should be modied when the car starts to move. However, the image analysis algorithms are not able to cope with this situation since they only consider slow adaptations of the background. A ghost region is therefore detected in the place where the car was (a false alarm).
17
The second row of the Fig. 12 shows the ROC curves of the SGM method, for three values of (0.01, 0.05, 0.15). This method is more robust than the BBS algorithm with respect to the threshold. We see that for 400 < T <
150, and = 0.01, = 0.05 we get similar FA rates and a small variation of DF. We chose = 0.05, T = 400.
The third row show the results of the M GM method. The best performances are obtained for < 0.05 (rst and second column). The best value of the parameter is = 0.008. In fact, we observe the best performances for 0.01. We notice that the algorithm strongly depends on the value of T , since for small variations of T there are signicant changes of FA and DF. The ROC curve suggest that it is acceptable to choose T > 0.9. The fourth row shows the results of the LOT S algorithm for a variation of the sensitivity from 10% to 110%. As discussed in [29] we use a small parameter. For the sake of computational burden, LOT S does not update the background image in every single frame. This algorithm decreases the background update rate which takes place in periods of N frames. For instance an effective integration factor = 0.0003 is achieved by adding approximately Remark that
1 13 of the current frame to the background B t = B t1 + Dt , with Dt = I t B t . In our
in every 256th frame, or
1 6.5
in every 512th frame.
case we have used intervals of 1024 (Fig. 12 (j)) 256
(Fig. 12 (k)) 128 (Fig. 12 (l)), being the best results achieved in the rst case. The latter two cases Fig.(12) (k), (l) present a right shift in relation to (j), meaning that in these cases one obtains a large number of false alarms. From this study we conclude that the best ROC curves are the curves associated with LOT S and SGM since they have the smallest area under the curve.
18
100
T = 0.9 T = 0.8
100
T = 0.9 T = 0.8
100
T = 0.9 T = 0.8
90
90
90
80
80
80
Detection Failures
Detection Failures
60
60
Detection Failures
70
70
70
60
50
T = 0.7 T = 0.6 T = 0.5
50
T = 0.7 T = 0.6 T = 0.5
50
T = 0.7 T = 0.5 T = 0.6
40
40
40
30
30
30
20
T = 0.4 T = 0.3
20
T = 0.4 T = 0.3
20
T = 0.4 T = 0.3
10
10
10
T = 0.2
0 0 10 20 30 40 50
T = 0.1
0 60 0 10 20
T = 0.2
30 40 50
T = 0.1
0 60 0 10 20
T = 0.2
30 40 50
T = 0.1
60
False Alarms
False Alarms
False Alarms
(a)
15
(b)
15
(c)
15
T = 25 T = 50 T = 100
T = 25 T = 50 T = 100
T = 150
T = 25 T = 50
Detection Failures
Detection Failures
T = 150
10
10
T = 200 T = 300
Detection Failures
10
T = 100 T = 150
T = 200
T = 300
5
T = 200
5
T = 400
T = 400
T = 300
T = 500 T = 500
0 30 40 50 60 70 80 90
T = 400
T = 600
0 30 40 50 60
T = 600
0
T <= 500
80 90
30 40 50 60 70 80 90
70
False Alarms
False Alarms
False Alarms
(d)
70 70
(e)
70
(f)
T > 0.4
60
60
60
T = 0.4 T = 0.3
Detection Failures
Detection Failures
50
Detection Failures
50
50
40
40
T = 0.99
40
T = 0.99
30
T = 0.2
30
30
T = 0.95
20
T = 0.95
T = 0.1
20
20
T = 0.9
10 10
T < 0.9
10
T = 0.9
0 0 10 20 30 40 50
T < 0.9
0 60 0 10 20 30 40 50 60
T=0
0 0 10 20 30 40 50 60
False Alarms
False Alarms
False Alarms
(g)
10 9
9
(h)
10
(i)
10 9 8
S = 110
8
8
Detection Failures
Detection Failures
Detection Failures
S = 100
6 5 4 3 2 1 0 0 10 20 30 40 50 60 70 80 90
7 6 5 4
7 6 5
S = 110
4 3 2
S = 90 S = 80 S = 70
S = 110
3 2
S <= 60
S = 100
1 0 0 10 20
S = 100
S = 80
1
S = 90 S = 70
30 40 50
S = 60
S = 40 S = 20 S = 10 S = 30 S = 50
70 80 90
S = 90
0 0 10 20 30
S = 80 S = 60 S = 70
40 50 60
S = 40 S = 20 S = 30 S = 10 S = 50
70 80 90
60
False Alarms
False Alarms
False Alarms
(j)
(k)
(l)
Fig. 12. Receiver Operation Characteristic for different values of : BBS (rst row: (a) = 0.05, (b) = 0.1, (c) = 0.15), SGM (second row: (d) = 0.01, (e) = 0.05, (f) = 0.15), M GM (third row: (g) = 0.008, (h) = 0.01, (i) = 0.05, LOT S (fourth row with background update at every: (j) 1024th frame, (k) 256th frame, (l) 128th frame.
19
B. Performance Evaluation Table I (a),(b) shows the results obtained in the test sequence using the parameters selected in the previous study. The percentage of correct detections, detection failures, splits, merges and split-merges were obtained by normalizing the number of each type of event by the total number of moving objects in the image. Their sum is
100%. The percentage of false alarms is dened by normalizing the number of false alarms by the total number of
detected objects. It is therefore a number in the range 0 100%. Each algorithm is characterized in terms of correct detections, detection failures, number of splits, merges and split/merges false alarms as well as matching area. Two types of ground truth were used. They correspond to different interpretations of static objects. If a moving object stops and remains still it is considered an active region in the rst case (Table I (a)) and it is integrated in the background after one minute in the second case (Table I (b)). For example, if a car stops in front of the camera it will always be an active region in the rst case. In the second case it will be ignored after one minute. Let us consider the rst case. The results are shown in Table I (a). In terms of correct detections, the best results are achieved by the LOT S (91.2%) algorithm followed by SGM (86.8%). Concerning the detection failures, the LOT S (8.5%) followed by W 4 (9.6%) outperforms all the others. The worst results are obtained by M GM (13.1%). This is somewhat surprising since M GM method, based on the use of multiple Gaussians per pixel, performs worse than the SGM method based on a single Gaussian. We will discuss this issue bellow. The W 4 has the highest percentage of splits and the BBS , M GM methods tend to split the regions as well. The performance of the methods in terms of region merging is excellent: very few merges are observed in the segmented data. However, some methods tend to produce split/merges errors (e.g., W 4, SGM and
BBS ). The LOT S and M GM algorithm have the best score in terms of split/merge errors.
Let us now consider the false alarms (false positives). The LOT S (0.6%) is the best and the M GM and BBS are the worst. The LOT S , W 4 and SGM methods are much better than the others in terms of false alarms. The LOT S has the best tradeoff between CD and FA. Although the W 4 produces many splits (splits can often be overcome in tracking applications since the region matching algorithms are able to track the active regions though they are split). The LOT S algorithm has the best performance if all the errors are equally important. In terms of matching area the LOT S exhibit the best value in both situations. In this study, the performance of the M GM method, based on mixtures of Gaussians is unexpectedly low. During the experiments we have observed the following: i) when the object undergoes a slow motion and stops, the algorithm ceases to detect the object after a small period of time; ii) when an object enters the scene it is not well detected during a few frames since the Gaussian modes have to adapt to this case. This situation justify the percentage of the splits in both Tables. In fact, when a moving object stops, the M GM starts to split the region until it disappears, becoming part of the background. Objects entering into the scene will cause some detection failures (during the rst frames) and splits, when the M GM method starts to separate the foreground region from the background. Comparing the results in Table I (a) and (b) we can see that the performance of the M GM is improved. The detection failures are reduced, meaning that the stopped car is correctly integrated in the background. This produces
20
an increase of correct detections by the same amount. However, we stress that the percentage of false alarms also increases. This means that the removal of the false positives is not stable. In fact some frames contain, as small active regions, the object which stops in the scene. In regard to the other methods, it is already expected that the false alarms percentage suffers an increase, since these algorithms remain with false positives throughout the sequence. The computational complexity of all methods was studied to judge the performance of the ve algorithms. Details about the number of operations in each method is provided in the Appendix VII-B.
% Correct Detections Detection Failures Splits Merges Split/Merges False Alarms Matching Area BBS 84.3 12.2 2.9 0 0.6 22.5 64.7 W4 81.6 9.6 5.4 1.0 1.8 8.5 50.4 (a) TABLE I P ERFORMANCE OF FIVE OBJECT DETECTION ALGORITHMS . SGM 86.8 11.5 0.2 0 1.5 11.3 61.9 MGM 85.0 13.1 1.9 0 0 24.3 61.3 LOTS 91.2 8.5 0.3 0 0 0.6 78.8 % Correct Detections Detection Failures Splits Merges Split/Merges False Alarms Matching Area BBS 83.5 12.4 3.3 0 0.8 27.0 61.3 W4 84.0 8.5 4.3 0.8 1.8 15.2 53.6 (b) SGM 86.4 11.7 0.2 0 1.7 17.0 61.8 MGM 85.4 12.0 2.6 0 0 28.2 65.6 LOTS 91.0 8.8 0.3 0 0 7.2 78.1
VI. C ONCLUSIONS This paper proposes a framework for the evaluation of object detection algorithms in surveillance applications. The proposed method is based on the comparison of the detector output with a ground truth segmented sequence sampled at 1 frame per second. The difference between both segmentations is evaluated and the segmentation errors are classied into detection failures, false alarms, splits, merges and split/merges. To cope with ambiguous situations in which we do not know if two or more objects belong to a single active region or to several regions, we consider multiple interpretations of the ambiguous frames. These interpretations are controlled by the user through a graphical interface. The proposed method provides a statistical characterization of the object detection algorithm by measuring the percentage of each type of error. The user can thus select the best algorithm for a specic application taking into account the inuence of each type of error in the performance of the overall system. For example, in object tracking detection failures are worse than splits. We should therefore select a method with less detection failures, even if it has more splits than another method. Five algorithms were considered in this paper to illustrate the proposed evaluation method. These algorithms are: Basic Background Subtraction (BBS ), W 4, Single Gaussian Model (SGM ), Multiple Gaussian Model (M GM ), Lehigh Omnidirectional Tracking System (LOT S ). The best results were achieved by the LOT S and SGM algorithm.
21
Acknowledgement: We are very grateful to the three anonymous reviewers for their useful comments and suggestions. We also thank R. Oliveira and P. Ribeiro for kindly provide the code of LOT S detector.
22
VII. A PPENDIX A. Merge Regions Algorithm The pseudo code of the region labelling algorithm is given in Algorithms 1, 2. Algorithm 1 describes the synopsis of the rst step, i.e., generation of the labels congurations. When the same label is assigned to two different regions, this means that these regions are considered as merged. Algorithm 2 describes the synopsis of the second step, which checks and eliminates label sequences which contain invalid merges. Every time the same label is assigned to a pair of regions we dene a strip connecting the mass center of the two regions and check if the strip is intersected by any other region. If so, the labelling sequence is considered as invalid. In these algorithms N denotes the number of objects, label is a labelling sequence, M is the matrix of all label congurations, MF IN AL is a matrix which contains the information (nal label congurations) needed to create the merges. Algorithm 1 Main 1: N Num; 2: M(1) 1; 3: for t = 2 to N do 4: AUX [ ]; 5: for i = 1 to size(M, 1) do 6: label max(M(i, :)) + 1; 7: AUX [AUX; [repmat(M(i, :), label, 1) (1 : label)T ] ]; 8: end for 9: M AUX; 10: end for 11: MFINAL FinalConguration(M);
@
@

@

@

Fig. 13.
Generation of the label sequences for the example in the Fig. 14.
To illustrate the purposes of algorithms 1 and 2 we will consider the example illustrated in the gure 14, where each rectangle in the image represents an active region. Algorithm 1 computes the leaves of the graph shown in the Fig. 13 with all label sequences.
23
Algorithm 2 MFINAL = FinalConguration (M) 1: MFINAL [ ]; 2: for i = 1 to lenght(M) do 3: Compute the centroids of the objects to be linked in M(i, :); 4: Link the centroids with strip lines; 5: if the strip lines do not intersect another object region then T T 6: MFINAL [MT FINAL M(i, :) ] ; 7: end if 8: end for
Fig. 14.
Four rectangles A,B,C,D representing active regions in the image.
Algorithm 2 checks each sequence taking into account the relative position of the objects in the image. For example, congurations 1212,1213 are considered as invalid since object A cannot be merged with C (see Fig. 14). Equations (24)(a) and (b) show the output of the rst and the second step respectively. All the labelling sequences considered as valid (the content of the matrix MF IN AL ) provides the resulting images shown in Fig. 15.
1 1 1 1 1 1 1 M = 1 1 1 1 1 1 1 1 (a) 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 1 1 2 2 2 3 3 3 3 1 2 1 2 3 1 2 3 1 2 3 1 2 3 4 1 1 1 1 = 1 1 1 1 1 1 1 1 1 2 1 2 2 1 2 3 2 2 2 2 2 3 2 3 3 2 3 4
MF IN AL
(24)
(b)
B. Computational Complexity Computational complexity was also studied to judge the performance of the ve algorithms. Next, we provide comparative data on computational complexity using the Big-O analysis. Let us dene the following variables:
24
Fig. 15.
Valid merges generated from the example in the Fig. 14.
N, number of images in the sequence, L, C, number of lines and columns of the image, R, number of regions detected in the image,
Ng , number of Gaussians.
The BBS , W 4, SGM , M GM and LOT S methods share several common operations namely: i) morphological operations, for noise cleaning, ii) computation of the areas of the regions and iii) labelling assignment. The complexity of these three operations is
K = (2 ( c) 1) (L C) + (L C) + R + R (L C)
morphological op. region areas op. Labels op.
(25)
where , c are the kernel dimensions ( c = 9, 8 - connectivity is used), L, C are the image dimensions and R is the number of detected regions. The rst term, 2 ( c) 1, is the number of products and summations required for the convolution of each pixel in the image. The second term, (L C) + R, is the number of differences taken to compute the areas of the regions in the image. Finally, the R (L C) term is the number of operations to label all the regions in the image. BBS Algorithm The complexity of the BBS is
O 11 (L C) +K N
threshold op.
(26)
where 11 (L C) is the number of operations required to perform the thresholding step (see (1)) which involves
3 (L C) differences and 8 (L C) logical operations.
W4 Algorithm The complexity of this method is

O 2 [2p3 + (L C) (p + (p 1))] + 9 (L C) +K + KW 4 N
rgb2gray op. threshold op.
(27)
25
where the rst term is related to the conversion of the images to grayscale level, p = 3 (RGB space). The second one is concerned with the threshold operation (see (2)) which requires 9 (L C) operations (8 logical and 1 difference operations). The term KW 4 corresponds to the background subtraction and morphological operations inside the bounding boxes of the foreground regions
KW 4 = R 9 (Lr Cr ) + (2 ( c) 1) (Lr Cr ) + (L C) + R + R (L C)
Threshold op. morphological op. region areas op. Labels op.
(28)
where Lr , Cr are the dimensions of the bounding boxes, assuming that the bounding boxes of the active regions have the same length and width. SGM Algorithm The complexity of the SGM method is
O p [2p (L C)] + 28 (L C) + (L C) +K N
rgb2yuv op. likelihood op. threshold op.
(29)
The rst term is related to the conversion of the images to YUV color space (in (29) p = 3). The second term is the number of operations required to compute the likelihood measure (see (5)). The third term is related to the threshold operation to classify the pixel as foreground if the likelihood is greater than a threshold, or classied as background otherwise. MGM Algorithm The number of operations of the MGM method is
O Ng (136 (L C)) + 2 (2Ng 1) (L C) +K N
mixture modelling norm. and mixture op.
(30)
The rst term depends on the number of Gaussians Ng . This term is related to the following operations: i) matching operation - 70 (L C), ii) weight update - 3 (L C) (see (11)), iii) background update - 3 8 (L C) (see (8)), iv) covariance update for all color components - 3 13 (L C) (see (9)). The second term accounts for: i) weight normalization - (2Ng 1)(L C) and ii) (2Ng 1) (L C) computation of the Gaussian mixture for all pixels. LOTS Algorithm The complexity of the LOT S method is
O [2p3 + (L C) (p + (p 1))]
rgb2gray op.
+ 11 (L C) + (2 (Lb Cb ) 1) nb + (2 ( c) 1) (Lrsize Crsize ) + (Lrsize Crsize ) (31)

QCC op.
+K N
The rst term is related to the conversion of the images and it is similar with the rst term in (27). The second term is related to the QCC algorithm. A number of 11 (L C) operations are needed to compute (17,18).
26
BBS LOTS W4 SGM MGM
1 + 30 (L C) 55 + (35 +
145 ) 64
3.3 106 4.1 106 4.4 106 7.2 106 48.3 106
(L C)
760 + 40 (L C) 1 + 66 (L C) 1 + 437 (L C)
TABLE II T HE SECOND COLUMN GIVES THE SIMPLFIED EXPRESSION FOR EQUATIONS (26, 27, 29, 30, 31). T HE SECOND COLUMN GIVES THE NUMBER OF TOTAL OPERATIONS .
The QCC analysis is computed in a low resolution image PH , PL . This is accomplished by converting each block of Lb Cb pixels (in high resolution images) into a new element of the new matrices (PH , PL ). Each element of
PH , PL contains the active pixels of each block in the respective images. This task requires (2 (Lb Cb ) 1) nb
operations (second term of QCC in (31)) where (Lb Cb ) is the size of each block and nb is the number of blocks in the image. A morphological operation (4-connectivity is used) over PH is performed, taking (2 ( c) 1)
(Lrsize Crsize ) operations where (Lrsize Crsize ) is the dimension of the resized images. The targets candidates
are obtained by comparing PH and PL . This task takes (Lrsize Crsize ) operations (fourth term in QCC). For example, the complexity of the ve algorithms is shown in table II assuming the following conditions for each frame

the kernel dimensions, c = 9, the block dimensions, Lb Cb = 8 8, i.e., (Lrsize Crsize ) = the number of Gaussians, Ng = 3 (for MGM method), a single region is detected with an area of 25 pixels, (R = 1, r Cr = 25), the image dimension is (L C) = 288 384.
LC 64
(for LOTS method),
From the table, it is concluded that the four algorithms (BBS ,LOT S ,W 4,SGM ) have a similar computational complexity whilst M GM is more complex requiring a higher computational cost.
27
R EFERENCES
[1] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, Pnder: Real-time tracking of the human body, IEEE Trans. Pattern Anal. Machine Intell., vol. 19, no. 7, pp. 780785, July 1997. [2] C. Stauffer, W. Eric, and L. Grimson, Learning patterns of activity using real-time tracking, IEEE Trans. Pattern Anal. Machine Intell., vol. 22, no. 8, pp. 747757, August 2000. [3] S. J. McKenna and S. Gong, Tracking colour objects using adaptive mixture models, Image Vision Computing, vol. 17, pp. 225231, 1999. [4] N. Ohta, A statistical approach to background suppression for surveillance systems, in Proceedings of IEEE Int. Conference on Computer Vision, 2001, pp. 481486. [5] I. Haritaoglu, D. Harwood, and L. S. Davis, W 4 : Who? when? where? what? a real time system for detecting and tracking people, in IEEE International Conference on Automatic Face and Gesture Recognition, April 1998, pp. 222227. [6] M. Seki, H. Fujiwara, and K. Sumi, A robust background subtraction method for changing background, in Proceedings of IEEE Workshop on Applications of Computer Vision, 2000, pp. 207213. [7] D. Koller, J. Weber, T. Huang, J. Malik, G. Ogasawara, B. Rao, and S. Russel, Towards robust automatic trafc scene analysis in real-time, in Proceedings of Int. Conference on Pattern Recognition, 1994, pp. 126131. [8] R. Collins, A. Lipton, and T. Kanade, A system for video surveillance and monitoring, in Proc. American Nuclear Society (ANS) Eighth Int. Topical Meeting on Robotic and Remote Systems, Pittsburgh, PA, April 1999, pp. 2529. [9] H. V. Trees, Detection, Estimation, and Modulation Theory. John Wiley and Sons, 2001. [10] T. H. Chalidabhongse, K. Kim, D. Harwood, and L. Davis, A perturbation method for evaluating background subtraction algorithms, in Proc. Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS 2003), Nice, France, October 2003. [11] X. Gao, T.E.Boult, F. Coetzee, and V. Ramesh, Error analysis of background adaption, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2000, pp. 503510. [12] F. Oberti, A. Teschioni, and C. S. Regazzoni, Roc curves for performance evaluation of video sequences processing systems for surveillance applications, in IEEE Int. Conf. on Image Processing, vol. 2, 1999, pp. 949953. [13] J. Black, T. Ellis, and P. Rosin, A novel method for video tracking performance evaluation, in Joint IEEE Int. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), Nice, France, 2003, pp. 125132. [14] P. Correia and F. Pereira, Objective evaluation of relative segmentation quality, in Int. Conference on Image Processing, 2000, pp. 308311. [15] C. E. Erdem, B. Sankur, and A. M.Tekalp, Performance measures for video object segmentation and tracking, IEEE Trans. Image Processing, vol. 13, no. 7, pp. 937951, 2004. [16] V. Y. Mariano, J. Min, J.-H. Park, R. Kasturi, D. Mihalcik, H. Li, D. Doermann, and T. Drayer, Performance evaluation of object detection algorithms, in Proceedings of 16th Int. Conf. on Pattern Recognition (ICPR02), vol. 3, 2002, pp. 965969. [17] I. Haritaoglu, D. Harwood, and L. S. Davis, W 4 : real-time surveillance of people and their activities, IEEE Trans. Pattern Anal. Machine Intell., vol. 22, no. 8, pp. 809830, August 2000. [18] T. Boult, R. Micheals, X. Gao, and M. Eckmann, Into the woods: Visual surveillance of non-cooperative camouaged targets in complex outdoor settings, in Proceedings of the IEEE, October 2001, pp. 13821402. [19] R. C. Gonzalez and R. E. Woods, Digital Image Processing. Prentice Hall, 2002. [20] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, Detecting moving objects, ghosts and shadows in video streams, IEEE Trans. Pattern Anal. Machine Intell., vol. 25, no. 10, pp. 13371342, 2003. [21] Y.-F. Ma and H.-J. Zhang, Detecting motion object by spatio-temporal entropy, in IEEE Int. Conf. on Multimedia and Expo, Tokyo, Japan, August 2001. [22] R. Souvenir, J. Wright, and R. Pless, Spatio-temporal detection and isolation: Results on the PETS2005 datasets, in Proceedings of the IEEE Workshop on Performance Evaluation in Tracking and Surveillance, 2005. [23] H. Sun, T. Feng, and T. Tan, Spatio-temporal segmentation for video surveillance, in IEEE Int. Conf. on Pattern Recognition, vol. 1, Barcelona, Spain, September, pp. 843846. [24] A. Monnet, A. Mittal, N. Paragios, and V. Ramesh, Background modeling and subtraction of dynamic scenes, in Proceedings of the ninth IEEE Int. Conf. on Computer Vision, 2003, pp. 13051312. [25] J. Zhong and S. Sclaroff., Segmenting foreground objects from a dynamic, textured background via a robust kalman lter, in Proceedings of the ninth IEEE Int. Conf. on Computer Vision, 2003, pp. 4450. [26] N. T. Siebel and S. J. Maybank, Real-time tracking of pedestrians and vehicles, in Proc. of IEEE workshop on Performance Evaluation of tracking and surveillance, 2001. [27] R. Cucchiara, C. Grana, and A. Prati, Detecting moving objects and their shadows: an evaluation with the PETS2002 dataset, in Proceedings of Third IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS 2002) in conj. with ECCV 2002, Pittsburgh, PA, May 2002, pp. 1825. [28] Collins, Lipton, Kanade, Fujiyoshi, Duggins, Tsin, Tolliver, Enomoto, and Hasegawa, A system for video surveillance and monitoring: Vsam nal report, Robotics Institute, Carnegie Mellon University, Tech. Rep. Technical report CMU-RI-TR-00-12, May 2000. [29] T. Boult, R. Micheals, X. Gao, W. Y. P. Lewis, C. Power, and A. Erkan, Frame-rate omnidirectional surveillance and tracking of camouaged and occluded targets, in Second IEEE International Workshop on Visual Surveillance, 1999, pp. 4855.
HAREM 2005 - International Workshop on Human Activity Recognition and Modelling, Oxford, UK, September 2005
Segmentation and Classication of Human Activities

J.C. Nascimento1 M. A. T. Figueiredo2 J. S. Marques3 jan@isr.ist.utl.pt mtf@lx.it.pt jsm@isr.ist.utl.pt 1,3 Instituto de Sistemas e Rob tica 2 Instituto de Telecomunicacoes o Instituto Superior T cnico e 1049-001 Lisboa PORTUGAL
Abstract
This paper describes an algorithm for segmenting and classifying human activities from video sequences of a shopping center. These activities comprise entering or exiting a shop, passing, or browsing in front of shop windows. The proposed approach recognizes these activities by using a priori knowledge of the layout of the shopping view. Human actions are represented by a bank of switch dynamical models, each tailored to describe a specic motion regime. Experimental tests illustrate the effectiveness of the proposed approach with synthetic and real data. Keywords: Surveillance, Segmentation, Classication, Human Activities, Minimum Description Length.
1 Introduction
The analysis of human activities is an important computer vision research topic with applications in surveillance, e.g. in developing automated security applications. In this paper, we focus on recognizing human activities in a shopping center. In commercial spaces, it is common to have many surveillance cameras. The monitor room is usually equipped with a large set of monitors which are used by a human operator to watch over the areas observed by the cameras. This requires a considerable effort of the human operator, who has to somehow multiplex his/her attention. In recent years a considerable effort was devoted to develop automatic surveillance systems providing information about which activities take place in a given space. With such a system, it would be possible to monitor the actions of individuals, determining its nature and discerning common activities from inappropriate behavior (for example, standing for a large period of time at the entrance of a shop, ghting). In this paper, we aim at labelling common activities taking place in the shopping space. 1 Activities are recognized from motion patterns associated to each person tracked by the system. Motion is described by a sequence of displacements of the 2D centroid (mean position) of each persons blob. The trajectory is modelled by using multiple dynamical models with a switching mechanism. Since the trajectory is described by its appearance, we compute the statistics for the identication of the dynamical models involved in a trajectory. The rest of the paper is organized as follows. Section 2 deals with related work. Section 3, describes the statistical activity model. Section 4 derives the segmentation algorithm. Section 5 reports experimental results with synthetic data and real video sequences. Section 6 concludes the paper.
2 Related Work
The analysis of human activities has been extensively addressed in several ways using different types of features and inference methods. Typically, a set of motion features is extracted from the video signal and an inference model is used to classify it into one of c possible classes. For example in [16] the human body is approximated by a set of segments and atomic activities are then dened as vectors of temporal measurements which capture the evolution of the ve body parts. In other works the human body is simply represented by the mass center of its active region (blob) in the image plane [12] or the body blob as in [4]. The activity is then represented by the trajectory obtained from the blob center, or from the correspondence of body blob regions respectively. Other works try to characterize the human activity directly from the video signal without segmenting the active regions. In [2] human activities are characterized by temporal templates. These templates try to convey information about where and how motion is performed. Two templates are created: a binary motion-energy-image which represents where the motion has occurred in the whole sequence, and a scalar motion-history-image which represents
work was partially supported by FCT under project CAVIAR(IST-2001-37540) work is integrated in project CAVIAR, which has the general goal of representing and recognizing contexts and situations.An introduction and the main goals of the project can be found in http://homepages.inf.ed.ac.uk/rbf/CAVIAR/caviar.htm
1 This This
how motion occurs for each activity. Motion patterns have also been used in [9] based on the concept of recency. This work integrates several frames into a single image, assigning higher weights to the most recent frames. In [10], the human motion is characterized by the optical ow. Several inference techniques have been used for the recognition of human activities using static and dynamic techniques. In [12] a single-person or person-to-person interactions are modelled by Hidden Markov Models (HMMs) and Coupled Hidden Markov Models (CHMMs). Both techniques are used to characterize the evolution of the person mass center along the video sequence. In [4] a Bayesian networks are used to for making inference about the events. In [11] activities are modelled using banks of switched dynamic models each of which tailored to a specic motion regime. Geometric constraints have also been used e.g., using the layout of the surveillance region [13, 3]. In [1, 3] Finite State Machines (FSM) are used for gesture and activity recognition. The later uses prior knowledge about the scene, where regions of interest are dened (e.g., entrances and exits). When the human motion is characterized by global features static pattern recognition methods can be used to classify the human activities. In [15] neural networks are used for this purpose. The previous methods have been used to deal with single pedestrians or a very limited number of pedestrians [12]. To deal with the interaction among multiple pedestrians Bayesian networks have been proposed [8] since they are able to represent the dependencies among several random variables.
3 Statistical Model
We represent the human activity by the trajectory of its centroid. The time evolution of this feature is modelled by a dynamical model. Since a single model may not sufce to describe an entire trajectory, we use multiple dynamical models and a switching mechanism. In this paper, a trajectory will be represented by a sequence of 2D locations, x = (x1 , ..., xn ), with xt IR2 . We assume that the trajectory is the output of a bank of switched dynamical systems of the form xt xt1 = xt = kt + wt , (1)
where kt {1, . . . , c} is the label of the active model at time instant t, kt is a (model-dependent) displacement vector, and the wt N (0, Qkt ) are independent Gaussian random variable, with covariances Qkt . Since the observations are {xt ;t N}, xt Rd (d is the dimension of the observation vector), instead of xt , equation (1) describes an independent increment process, given kt , as shown in Fig. 1
k1 k2 k3
... ...
kn
x1
x2
x3
xn
Figure 1: Architecture of the proposed approach. Finally we assume that the sequence of model labels is composed of T constant segments: {k1 , . . . , k1 , k2 , . . . , k2 , . . . , kT , . . . , kT }.
4 Segmentation and classication Algorithm

In order to segment and classify the different activities, we rst observed that all trajectories concerning a common activity follow a typical route. Fig. 2 shows trajectories corresponding to a person entering a shop (left), leaving a shop (middle) or just passing in front of a shop (right). This work demonstrates that elementary actions such as: moving upwards, stopped, moving downwards, moving left and moving right (i.e., M = 5), are representatives of the trajectories. The underlying idea is: given a test trajectory xt = (x1 , . . . , xn ), segment it into its elementary actions and classify the activity. The number of segments will depend on the activity being considered, as described later.
4.1 Model Parameter Estimation

To segment and classify a given trajectory we have to previously obtain the parameters of each dynamic model. To accomplish this, we collect tens of trajectory samples from each model.
Figure 2: Examples of three different activities (entering, exiting, passing). From xt we can obtain xti , where xti contains the displacements of xt known to have been generated by the ith model. Dening Xi = {xi , xi , . . . , xi } as the vector containing all the displacements in ith model of the training N 1 2 set, we have, for the ith model: i = 1 Xi , Xi t Qi = 1 (Xi i )(Xi i )T , Xi (2)
where i and Qi are standard estimates of the mean and the covariance matrix respectively.
4.2 Segmentation and Classication

Having dened the set of models and the corresponding parameters, one can now classify a test trajectory xt . One way to attain this goal is to compute the likelihood of xt into the model space. In this paper, the activity depends on the number of the model switchings. In Fig. 2, we see that passing can be described by using just one model. The activities entering and exiting can be described by using two dynamical models. The fourth activity considered browsing, requires three models to be described; we dene browsing when the person is walking, stop to see the shop-window and restarts walking. This behavior was observed in all the other samples of the activities which come about in this context. This means that we have to estimate the time instants in which the model switching happens. Assuming that the sequence xt has n samples and is described by T segments (and T is known) the log-likelihood is (3) L(m1 , . . . , mT ,t1 , . . . ,tT 1 ) = log p(x1 , . . . , xn | m1 , m2 , . . . , mT ,t1 ,t2 , . . . ,tT 1 ) where m1 , . . . , mT is the sequence of model labels describing the trajectory and ti for i = 1, . . . , T 1 is the time instant when switching from model mi to mi+1 occurs. If T = 1, there is no switching. Due to the conditional independence assumption underlying (1), the log-likelihood can be written as L(x1 , . . . , xn | m1 , . . . , mT ,t1 , . . . ,tT 1 )
T tj T tj
j=1 i=t j1
log p(xi | m j ) =
j=1 i=t j1
log N (xi | m j , Qm j )
(4)
where we dene t0 = 1, T is the number of segments and t j the switch time. Assuming that T is known, we can segment the sequence (i.e., estimate m1 , . . . , mT and t1 , . . . ,tT 1 ) using the maximum-likelihood approach: m1 , . . . , mT , t1 , . . . , tT 1 = arg max L(x1 , . . . , xn | m1 , . . . , mT ,t1 , . . . ,tT 1 ) This maximization can be performed in a nested way, t1 , . . . , tT 1 = arg max
t1 ,...,tT 1 m1 ,...,mT
(5)
max L(x1 , . . . , xn | m1 , . . . , mT ,t1 , . . . ,tT 1 )
(6)
In fact, the inner maximization can be decoupled as

T m1 ,...,mT tj
max L(x1 , . . . , xn | m1 , . . . , mT ,t1 , . . . ,tT 1 ) =
j=1
max mj
log p(xi | m j )
(7)
i=t j1
where the maximization with respect to each of m j is a simple maximum likelihood classier of sub-set of samples (xt j1 , . . . , xt j ) into one of a set of Gaussian classes. Finally, the maximization with respect to t1 , . . . ,tT 1 is done by exhaustive search (this is never too expensive, since we consider a maximum of three segments).
4.3 Estimating the number of models of the activity

4.3.1 MDL Criterion In the previous section, we derived the segmentation criterion assuming that the number of segments T is known. As is well known, the same criterion can not be used to select T , as this would always return the largest possible number of segments. We are thus in the presence of a model selection problem, which we address by using the minimum description length (MDL) criterion [14]. The MDL criterion for selecting T is T = arg min log p(x1 , . . . , xn | m1 , . . . , mT , t1 , . . . , tT 1 )
T
(8) + M(m1 , . . . , mT , t1 , . . . , tT 1 )
where M(m1 , . . . , mT , t1 , . . . , tT 1 ) is the number of bits required to encode the selected model indeces and the estimated switching times. Notice that we do not have the usual 1 log n term because the real-valued model parameters (means 2 and covariances) are assumed xed (previously estimated). Finally, it is easy to conclude that M(m1 , . . . , mT , t1 , . . . , tT 1 ) T log c + (T 1) log n (9)
where T log c is the code length for the model indeces m1 , . . . , mT , since each belongs to {1, . . . , c}, and (T 1) log n is the code length for t1 , . . . , tT 1 , because each belongs to {1, . . . , n}; we have ignored the fact that two switchings can not occur at the same time, because T << n.
5 Experimental results
This section presents results with synthetic and real data. In the synthetic case, we have performed Monte Carlo tests. We have considered ve models (c = 5) shown in Fig. 3. The synthetic models shown in Fig. 3(a) were obtained by simulating four activities of a person, using the generation model in (1). Fig. 4 shows examples of activities (the trajectory shape of Leaving is the same as Entering, however with opposite direction). Here, the thin (green) rectangles correspond to areas where the trajectory begins. The rst sample of xt in these areas is random, because the agent may appears at random places in the scene. The wide (yellow) rectangle is the area in which occurs a model switching. In this gure the trajectories are generated with two segments (Entering, Leaving, Passing) and with three segments (Browsing). For each activity we generate 100 test samples using (1) and classify each of them in one of the four classes. Fig. 5 shows the displacements xt (black dots) of the test sequences (Entering and Passing) overlapped with the ve models. We can see that the displacements lie on right-up clusters (Entering) and right cluster (Passing). In this experiment, all the test sequences were correctly classied (%100 accuracy).
10 50
40
30
20
10
10
20
30
40
10 10
10
50 50
40
30
20
10
10
20
30
40
50
(a)
(b)
Figure 3: Five models are considered to describe trajectory. Each color corresponds to a different model. Synthetic case (a), real case (b). We also generated different test trajectories, this is because the exiting and entering may occur in different direction from the ones in Fig. 4. These examples are illustrated in Fig. 6. In this new experiment, the same 100% accuracy was also obtained.
500
500
500
400
400
400
300
300
300
200
200
200
100
100
100
100 0 50 100 150 200 250 300 350 400 450 500
100 0 50 100 150 200 250 300 350 400 450 500
100 0 50 100 150 200 250 300 350 400 450 500
(a)
(b)
(c)
Figure 4: Examples of synthetic activities (performed in left-right direction): (a) entering, (b) passing, (c) browsing.
10 10
10 10
10
10 10
10
(a)
(b)
Figure 5: Five models with the displacements (black dots) of the test activities: (a) entering, (b) passing. The proposed algorithm was also tested with real data. The video sequences were acquired in the context of the EC funded project CAVIAR. All the video sequences comprise human activities in indoor plaza and shopping center observations of individuals and small groups of people. Ground truth was hand-labelled for all sequences2 . Fig. 7 shows the bounding boxes as well as the centroid, which is the information used for the segmentation. As in the synthetic case, we also generate the statistics of the considered models. The procedure is the same as in the previous case using training sequences. Fig. 3(b) shows the clusters of the models. Fig. 8 shows several activities performed at the shopping center with the time instants of the model switching marked with small red circle. From this experiment, it can be seen that the proposed approach correctly determines the switching times between models. We have tested the proposed approach in more than 40 trajectories from 25 movies of about 5 minutes each. We just present the results of some of those activities in Tables 1 and 2. These Tables show the penalized log-likelihood values (8) of each test sequence. The rst table refers to all activities performed in the left-right direction, whilst the second table reports all activities performed in the opposite direction. In the rst table the classes referring to entering, exiting, passing and browsing are right-upwards, downwards-right, right, right-stop-right respectively, whereas in the second table the classes are left-upwards, downwards-left, left and left-stop-left. It can be observed that the output classier correctly assigns the activities into the corresponding classes, exhibiting good results as in the previous synthetic examples.
6 Conclusions
In this paper we have proposed and tested an algorithm for modelling, segmentation, and classication of human activities in a constrained environment. The proposed approach uses a switched dynamical models to represent the human trajectories. It was illustrated that the time instants are effectively well determined, despite of the signicant random perturbations that the trajectory may contain. It is demonstrated that the proposed approach provides good
2 The
ground truth labelled video sequences is provided at http://homepages.inf.ed.ac.uk/rbf/CAVIAR/.
500
500
500
400
400
400
300
300
300
200
200
200
100
100
100
100 0 50 100 150 200 250 300 350 400 450 500
100 0 50 100 150 200 250 300 350 400 450 500
100 0 50 100 150 200 250 300 350 400 450 500
(a)
(b)
(c)
Figure 6: Synthetic activities with different dynamic models (entering,exiting,passing).
Figure 7: Bounding boxes and centroids of the pedestrians performing activities. results with synthetic and real data obtained in a shopping center. The proposed method is able to effectively recognize instances of the learned activities. The activities studied herein can be interpreted as atomic, in the sense that they are simple events. Compound actions or complex events can be represented as concatenations of the activities studied in this paper. This is one of the issues to be addressed in the future. Acknowledgement: We would like to thank Prof. Jos Santos Victor of ISR and the members of CAVIAR project, e for providing video data of human activities with the ground truth information.
Figure 8: Samples of different activities. The large circles are the computed times instants where the model switches: Entering (rst column); exiting (second column); browsing (third column).
Test trajectories Classes Entering Exiting Passing Browsing E1 187.2 401.0 359.7 299.1 E2 157.3 340.0 311.0 265.6 Ex1 212.7 116.1 232.5 196.5 Ex2 217.0 102.4 183.3 180.0 P1 100.3 104.6 88.8 160.7 P2 107.4 93.8 90.2 156.0 B 169.1 178.7 147.7 98.1
Table 1: Penalized Log-likelihood of several real activities performed in left-right direction: E- entering, Ex-exiting, P- passing, B- browsing.
Test trajectories Classes Entering Exiting Passing Browsing E1 116.2 277.6 210.0 207.4 E2 115.0 284.6 224.4 197.3 Ex1 337.7 151.0 350.1 343.2 Ex2 358.2 127.4 362.0 286.7 P1 89.3 98.6 63.4 188.9 P2 90.9 96.6 64.7 179.0 B 211.7 297.4 358.4 170.1
Table 2: Penalized Log-likelihood of several real activities performed in right-left direction: E- entering, Ex- exiting, P- passing, B- browsing.
References
[1] D. Ayers and M. Shah,Monitoring Human Behavior from Video Taken in an Ofce Environment, Image and Vision Computing, vol. 19, Issue 12, 1, pp. 833-846, Oct, 2001. [2] A. Bobick and J. Davis, The Recognition of Human Movement using Temporal Templates, in IEEE Transaction on Pattern Analysis and Machine Intelligence, pp. 257-267, vol. 23, no. 3, March 2001. [3] J. Davis and M. Shah, Visual Gesture Recognition, IEE Proc. Vision, Image and Signal Processing, Vol. 141, No. 2, pp. 101-106, April 1994. [4] S. Hongeng and R. Nevatia, Multi-Agent Event Recognition, in Proc. of the 8 th IEEE Int. Conf. on Computer Vision (ICCV01), pp. 84-91, vol. 2, 2001. [5] M. Isard and A. Blake,A Mixed-state Condensation Tracker with Automatic Model-switching, Proc. of the Int. Conf. on Computer Vision, pp. 107-112, 1998. [6] J. S. Marques and J. M. Lemos, Optimal and Suboptimal Shape Tracking Based on Switched Dynamic Models, Image and Vision Computing, pp. 539-550, june, 2001. [7] N. Johnson and D. Hogg, Representation and Synthesis of Behaviour using Gaussian Mixtures, in Image and Vision Computing, pp. 889-894, vol. 20, no 12, 2002. [8] A. J. Abrantes, J. S. Marques, J. M. Lemos, Long Term Tracking Using Bayesian Networks, in Proc. of IEEE Int. Conf. on Image Processing, Rochester, 609-612, vol. III, Sept. 2002. [9] O. Masoud and N.P. Papanikolopoulos, A Method for Human Action Recognition, in Image and Vision Computing, pp.729-743, vol. 21, no. 8, August 2003. [10] A. Nagai, Y. Kuno and Y. Suirai, Surveillance Systems based on Spatio-temporal Information, Proc. IEEE Int. Conf. Image Processing, pp. 593-596, 1996. [11] J. C. Nascimento and M. A. T. Figueiredo and J. S. Marques, Recognition of Human Activities with Space Dependent Switched Dynamical Models, Proc. IEEE Int. Conf. Image Processing, September, 2005. [12] N. M. Oliver and B. Rosario and A. P. Pentland, A Bayesian Computer Vision System for Modeling Human Interactions, in IEEE Trans. on Pattern Anal. and Machine Intell., pp. 831-843, vol. 22, no. 8, August 2000. [13] T. J. Olson and F. Z. Brill, Moving Object Detection and Event Recognition for smart Cameras, Proc. Image Understanding Workshop, pp. 159-175, 1997. [14] J. Rissanen, Stochastic Complexity in Statistical Inquiry.Singapore: World Scientic, 1989. [15] M. Rosenblum and Y. Yacoob and L. S. Davis, Human expression recognition from motion using a radial basis function network architecture, IEEE Trans. Neural Networks, no. 7, pp. 1121-1138, 1996. [16] Y. Yacoob and M. J. Black, Parameterized Modeling and Recognition of Activities, in Computer Vision and Image Understanding, pp. 232-247, vol. 73, no. 2, February 1999.
Chapter 4
The Kalman Filter Approach

Imagine you are sitting in a car waiting at a crossroad to pass it. The visibility is poor due to parked cars at the roadside. But there are some gaps between them so that you have the possibility to observe these openings to decide whether you can cross the street without causing an accident or not. You have to guess the number, position and velocity of potential vehicles moving on the road from just a few information derived by watching these gaps over time. Let us integrate the mentioned attributes of the street into the concept of a state of the street. The observations can also be seen as measurements and are noisy because of the poor visibility. An estimation of the state of the street is just possible if you know how vehicles move on a road and how the measurements are related to this motion. Due to the noise in the measurements and to not directly observable aspects like acceleration there will not be absolute certainty in your estimation. This task is one instance of the problem known as the observer design problem. In general, you have to estimate the unknown internal state of a dynamical system given its output in the presence of uncertainty. The output depends somehow on the systems state. To be able to infer this state from the output you need to know the according relation and the systems behaviour. In such situations, we have to construct a model. In practise it is not possible to represent the system considered with absolute precision. Instead, the according model will stop at some level of detail. The gap between it and reality is lled with some probabilistic assumption referred to as noise. The noise model introduced in this chapter will be applied throughout this work. An optimal solution for this sort of problems in the case of linear models can be derived by using the Kalman Filter which is explained in the rst section of this chapter based on [12]. Most of the interesting instances of the observer design problem, e.g. the SLAM problem, do not full the condition of linearity. To be able to apply the Kalman Filter approach to this non-linear sort of tasks, we have to linearise the models. The according algorithm is referred to as Extended Kalman Filter. We will introduce it in the second section. 23
24
CHAPTER 4. THE KALMAN FILTER APPROACH
4.1
The Discrete Kalman Filter
In this section we introduce the Kalman Filter chiey based on its original formulation in [17] where the state is estimated at discrete points in time. The algorithm is slightly simplied by ignoring the so called control input which is not used in this specic application of purely vision based SLAM. Nevertheless, in a robotic application it might be useful to involve e.g. odometry data as control input. A complete description of the Kalman Filter can be found in [17] and [12]. In the following, we will rstly introduce the models for the systems state and the process model which describes the already mentioned systems behaviour. Here, also the noise model is presented. After that, we introduce the model for the relation between the state and its output. The section closes with a description of the whole Kalman Filter algorithm.
4.1.1
Model for the Dynamical System to Be Estimated
The Kalman lter is based on the assumption that the dynamical system, which should be estimated, can be modelled as a normally distributed random process X(k) with mean xk and covariance matrix Pk where index k represents time. The mean xk is referred to as the estimate of the unknown real state xk of the system at the point k in time. This state is modelled by an n dimensional vector: x1 . . . x = xi . . . xn For the simplicity of the notation we did not use the subscript k, here. Throughout this work, we will continue omitting k when the components of a vector or matrix are presented even if they are dierent at each point in time. Our main objective is to derive a preferably accurate estimate xk for the state of the observed system at time k. The covariance matrix Pk describes the possible error between the state estimate xk and the unknown real state xk , in other words - the uncertainty in the state estimation after time step k. It can be modelled as an n n matrix. x1 x1 . . . P = xi x1 . . . xn x1 ... .. . ... .. . ... x1 xi . . . xi xi . . . xn x1 ... .. . ... .. . ... x1 xn . . . xi xn . . . xn xn
where the main diagonal contains the variances of each variable in the state vector and the other entries contain the covariances of pairs of these variables. Covariance matrices are always symmetric due to the symmetric property of
4.1. THE DISCRETE KALMAN FILTER
25
covariances.1 If we want to derive an accurate estimate of the systems state, the corresponding uncertainty should obviously be small. The Kalman lter is optimal in the sense, that it minimises the error covariance matrix Pk .
4.1.2
Process Model
Examined over time the dynamical system is subject to a transformation. Some aspects of this transformation are known and can be modelled. Others, e.g., acceleration as in the example above (also inuencing the state of the system) are unknown, not measurable or too complex to be modelled. Then, this transformation has to be approximated by a process model A involving the known factors. The classic Kalman lter expects that the model is linear. Under this condition the normal distribution of the state model is maintained after it has undergone the linear transformation A. The new mean xk and covariance matrix Pk for the next point in time are derived by xk Pk = Ak1 x = APk1 A .
(4.1) (4.2)
Due to the approximative character of A, the state estimate xk is also just an approximation of the real state xk . The dierence is represented by a random variable w: (4.3) xk = Axk1 + wk1 . The individual values for w are not known for each point k in time but need to be involved to improve the estimation. We assume these values to be realisations of a normally distributed white noise vector with zero mean. In the following, this vector w is referred to as process noise. It is denoted by p(w) N (0, Q) (4.4)
where zero is the mean and Q the process noise covariance. The individual values of w at each point in time can now be assumed to be equal to the mean, to zero. Thus, we stick to Equation (4.1) to estimate xk as xk . The process noise does not inuence the current state estimate, but the uncertainty about it. Intuitively we can say, the higher the discrepancy is between the real process and the according model, the higher is the uncertainty about the quality of the state estimate. This can be expressed by extending the computation of the error covariance Pk in Equation (4.2) with the process noise covariance matrix Q. Pk = APk1 A + Q (4.5)
The choice of the values for the process noise covariance matrix reects the quality we expect from the process model. If we set them to small values, we are quite sure that our assumptions about the considered system are mostly right. The uncertainty regarding to our estimates will be low. But then, we will be unable or hardly able to cope with large variations between the model and
1 The covariance value x x 1 n is the same as xn x1 . correlated to xn like xn to x1
In practise this means, that x1 is
26
the system. Setting the variances to large values instead means to accept that there might be large dierences between the state estimate and the real state of the system. We will be able to cope with large variations but the uncertainty about the state estimate will increase stronger than with a small process noise. A lot of good measurements are needed to constrain the estimate.
4.1.3
Output of the System
As already mentioned earlier, the output of the system is related to the state of the system. If we know this relation and the estimated state after the current time step, we are able to predict the according measurement of the systems output. In this section, we will introduce the model for the measurement of the output. In the next section the relation between state and output is examined. As well as the state of the considered dynamical system, also its output is modelled as a normally distributed random process Z(k) with mean k and z z covariance matrix Sk where index k indicates time. The mean k represents the estimated and predicted measurement of the output depending on the state estimate xk at the point k in time. The real measurement zk of the output is obtained by explicitly measuring the systems output. zk is modelled as an m dimensional vector z1 . . . z = zi . . . zm The so called innovation covariance matrix Sk describes the possible error between the estimate k and the real measurement zk , in other words - the uncerz tainty in the measurement estimation after time step k. It can be modelled as an m m matrix z1 z1 . . . z1 zi . . . z1 zm . . . .. .. . . . . . . . . zi z1 . . . zi zi . . . zi zm S= . . . .. .. . . . . . . . . zm z1 . . . zm z1 . . . zm zm
where the main diagonal contains the variances of each variable in the measurement vector and the other entries contain the covariances of pairs of these variables. Note, that in contrast to the systems real state, the real measurement can be obtained and we are therefore able to compare predicted and real measurement. The precisely known dierence between estimation and reality constitutes the basis to correct the state estimate used to predict the measurement. This will be explained in detail in Section 4.1.5.
4.1.4
Measurement Model
In the previous sections we mentioned, that the systems output is somehow related to the systems state. In this sections the relation is modelled.
27
We have the same situation as for the process model. The connection between the output and the state can just be modelled up to a certain degree. Known factors are summarised in the measurement model H. After we have obtained a new state estimate for the current point in time, we can apply H to predict the according measurement k and covariance matrix Sk . If this z measurement model is linear, the normal distribution of the state model is maintained after applying this linear transformation. k z Sk = Hk x = HPk H . (4.6) (4.7)
Because measurements of the systems output are mostly noisy due to inaccurate sensors, the dierence between the estimate k and the real measurement zk is z not just caused by the dependency on the state estimate but also by a random variable v: zk = Hxk + vk . (4.8)
As already mentioned for the process noise, the individual values of v are not known for each point k in time. We apply the same noise model and approximate these unknown values as realisations of a normally distributed white noise vector with zero mean. In the following, v is referred to as measurement noise. It is denoted by p(v) N (0, R) (4.9) As v is now assumed to be equal to the mean of its distribution at each point in time, it does not inuence the measurement estimate but the uncertainty about it. This is modelled by extending the computation of the measurement innovation covariance matrix Sk in Equation (4.7) with the measurement noise covariance matrix R. Sk = HPk H + R (4.10)
Again, the values chosen for the measurement noise covariance matrix indicate how sure we are about the assumptions we made in our measurement model. More information about the inuence of the measurement noise are given below in connection with the Kalman Gain.
4.1.5
Predict and Correct Steps
In the last sections we introduced the model for the process the system is subject to and the model for the relation between the systems state and its output. These models are used in the Kalman Filter algorithm to determine an optimal estimate of the unknown state of the system. As already mentioned in Section 4.1.3, we use the known dierence between the predicted measurement k and real measurement zk as basis to correct the z state estimate derived by the application of the process model A. The lter can be divided into two parts. In the predict step the process model and the current state and error covariance matrix estimates are used to derive an a priori state estimate for the next time step. Next, in the correct step, a (noisy) measurement is obtained to enhance the a priori state estimate and derive an improved a posteriori estimate.
28
Initialisation
Predict
Correct
Figure 4.1: The Predict-Correct Cycle of the Kalman Filter Algorithm. Before this predict-correct cycle as depicted in Figure 4.1 can be started, the state and its error covariance matrix have to be initialised. In the following we will assume that this is already the case. Predict Step We are situated at the point k in time and the state and error covariance matrix estimates at time k 1 are given. By using Equations (4.1) and (4.5) we predict the state and error covariance matrix for k: x k P k = Ak1 x = APk1 A + Q.
The minus superscript labels the predicted state and error covariance matrix as a priori in contrast to a posteriori estimates. Correct Step Assume that we have already obtained an actual measurement zk of the systems output. With the help of this, we rst want to calculate the a posteriori state k estimate xk . This is a linear combination of the a priori estimate x and a weighted dierence between zk and the predicted measurement k . According z to Equation (4.6), k is calculated by H . Summarised, we have: z xk xk = x + Kk (zk k ) k z = xk + Kk (zk H ). xk
xk The dierence zk H is called measurement innovation or residual . If the value is zero, the prediction and the actual measurement are in complete agreement and the a priori state estimate wont be corrected. If it is unequal to zero, k xk will be unequal to x . The weight Kk , the so called Kalman Gain, is represented by a n m matrix and minimises the a posteriori error covariance estimate P . It can be k calculated by P H k (4.11) Kk = (HP H + R) k Note, that the denominator equals Equation (4.10), representing the uncertainty in the predicted measurement. If we look closely at Equation (4.11), we can
29
see that, if the measurement error covariance error R approaches zero, the measurement innovation is weighted more heavily. lim Kk = 1 H
R0
In other words, the smaller the measurement error, the more reliable is the actual measurement zk . On the other hand, if the predicted error covariance matrix P approaches to k be zero the residual is weighted less. lim Kk = 0
P 0 k
This means, the smaller the uncertainty in the a priori state estimate x , the k more reliable is the predicted measurement k . z Secondly, we have to correct the a priori error covariance matrix estimate to derive the a posteriori estimate. Pk = (I Kk H)P k For details of the derivation of the lter algorithm see [26]. In Figure 4.2 the whole algorithm is given again step by step.
4.1.6
A Simple Example
To clarify the eectiveness of the Kalman Filter we will examine a simple example. To stick to the central theme of this work right from the beginning, this example will be an instance of the SLAM problem. The section will be structured as follows: Firstly, we will give a short description of the problem. After that, the process and measurement model are formulated. The section closes with some experiments on simulated data. Problem Description In Chapter 5, we will analyse how to apply the Kalman Filter approach to the problem of SLAM with using a vision sensor mounted on a robot. This rstly means to track the position and orientation of the camera within the 3D environment (localisation) and secondly to estimate the position of some landmarks situated in the world (mapping). In the following we will simplify this task to SLAM in one dimension. The camera is represented by a point moving randomly in 1D. There also is a static landmark with a position known up to a certain degree. The process model of this example should describe the motion of the camera. We will assume that it moves smoothly so that fast changes in its velocity are unlikely. We are able to measure the distance between the landmark and the moving point at discrete points in time. The measurement model should relate this distance to the state of the considered system. The situation is depicted in Figure 4.3.
30
1. Predict Step (a) Predict the state x x = Ak1 k (b) Predict the error covariance matrix P = APk1 A + Q k 2. Correct Step (a) Calculate the Kalman Gain Kk = P H k (HP H + R) k
(b) Correct the a priori state estimate xk = x + Kk (zk H ) k xk (c) Correct the a priori error covariance matrix estimate Pk = (I Kk H)P k Figure 4.2: Equations of one Kalman Filter Cycle. We assume that the state, its covariance and the noise values are already initialised.
12 Point Positions Position Landmark
10
8 Position [Units]
0 0 2 4 6 Time [Filter Cycles] 8 10
Figure 4.3: An Example for a Point Moving Randomly in 1D. A static landmark is situated at x = 3. The distance between the current point position and this landmark is measurable at each time step.
4.1. THE DISCRETE KALMAN FILTER Process and Measurement Model
31
At rst we have to model the state x which has to be estimated. Three important entities have to be taken into account. Firstly, there is the position of the point in a point in time. It is fully described by an one-dimensional coordinate in xdirection. Secondly, we choose a constant velocity to describe the motion of the point. 2 This does not mean that we assume the point is moving constantly over all time but that this value is the average velocity between two points in time and changes occur with a Gaussian prole. These changes are modelled beneath as process noise. At last, the position of the landmark has to be augmented into the state. xp Position of the point x = vp = Velocity of the point xf Position of the landmark The error covariance matrix is then a xp xp P = vp xp xf xp 3 3 matrix of the following form xp vp xp xf vp vp vp xf . xf vp xf xf
The task of the process model A is to approximate the transformation of the considered system over time. Here, this is the motion of the point between time k and k 1. This constant time period is denoted as k. A is used to predict the state of the system for the current point k in time from the old state estimate at time k 1 by calculating x(k) = A(k 1). x = old point position + old velocity per k = xp (k 1) + vp (k 1)k vp (k) = constant velocity due to assumed smooth motion = vp (k 1) xf (k) = static landmark = xf (k 1) xp (k)
(4.12)
As already mentioned, the constant velocity value just describes the average velocity in the time period k. Therefore, it is just an approximation. Variations are caused by random unmeasurable accelerations a.3 We involve it in the process noise vector w. If we would know the individual values of w at each k, we could derive the real state: x(k) = Ax(k 1) + w(k 1) Because the process noise is an additive constant, w is modelled as a threedimensional vector w = (w0 , w1 , w2 ) . Noise is just added to the velocity component of the state. Thus, the rst and third component, w0 and w2 , referring to the position of the moving point and to the position of the landmark, are set to zero. Just the second value carries a dierent random value after each time step: w = (0, ak, 0) . Adding the noise term to the process model, we
velocity vp describes the distance x covered in a certain time intervall k. acceleration a is a change in velocity vp in a certain time intervall k. Thus, w1 = ak = vp , the change in velocity.
3 An 2A
32 have: xp (k) vp (k) xf (k)
= xp (k 1) + (vp (k 1) + a(k 1)k)k = vp (k 1) + a(k 1)k = xf (k 1)
We do not know the individual values for a at each point in time. Therefore, we model the process noise as a realization of a normally distributed white noise random vector with zero mean and a covariance matrix Q. p(w) N (0, Q) Now, we can assume w to be equal to the mean of its distribution, which is zero. We derive the process model already formulated in Equation (4.12). Expressed in a linear transformation with k assumed to be 1, this is 1 1 0 A = 0 1 0 . 0 0 1 0 0 2 Q = 0 p 0 0 0 0 . 0
Q is of the following form:
The constant value of p as the standard deviation of the noise in the velocity value indicates the amount of smoothness in the motion we expect. If we choose it to be small, we expect the point to move with a nearly constant velocity. Then, we will not be able to cope with sudden accelerations. If we choose large values instead, we will be able to track the point well, if it acts in another way than expected by the process model. On the other hand, the uncertainty about a state estimate is higher than with small values for p . The measurement model approximates the relation between the actual measurement zk and the current state xk . In our example the measurement consists of just one value representing the distance dk between the moving point and the static landmark at the current point k in time. Expressed in a linear equation, we have z (k) = dk = xp (k) xf (k) (4.13)
The sensor used to measure the distance is assumed to provide just noisy measurements. If we would know the value for this measurement noise exactly, we could determine the real measurement and not just an estimate. If we denote the measurement noise by the random variable v, the real measurement can be computed by: z(k) = dk = xp (k) xf (k) + v(k).
But we do not know the individual values of the random variable v. Therefore, we apply our noise model such that the values of v are a realization of a normally 2 distributed white noise with zero mean and the variance m
2 p(v) N (0, m ).
33
The measurement noise has the same dimension as the measurement and its distribution is therefore modelled by specifying a variance instead of a covariance matrix. We can now assume the value v to be equal to the mean of its distribution, to zero. Then, we derive the measurement model already formulated in Equation (4.13). Note, that the dierence between the estimate of the measurement zk and the real measurement is not just caused by the unknown noise, but also by the fact that in reality we just have an estimate of the state to predict the measurement. The nal measurement model for this problem is: z (k) = dk = xp (k) xf (k). Expressed in a linear transformation we have H= 1 0 1 .
The constant value m as the standard deviation of the measurement noise distribution indicates how sure we are about the correctness of the real measurements. Large value show that we do not trust them that much and we will weight the measurement innovation less. Small values indicate that the measured values are accurate. The residual will be weighted more heavily. Experiments on Simulated Data In the previous section, we derived the basis for the application of the Kalman Filter on our problem: the appropriate process and measurement model. In this section, we will test these models on simulated data. The simulation was initialized with the state: 0 x0 = 1 3
The real noise in the measurements can usually be determined prior to the application of the lter. To determine the process noise covariance is more complicated, because we generally do not have the ability to measure the process, we want to estimate, directly. Anyway, we set the standard deviation of the
The subsequent real positions of the point moving in 1D were generated by applying the process exactly described in the according model and adding some random values. The standard deviation of the random values is set to 0.2. The real measurements were also generated as described in the measurement model. Measurment noise is simulated by adding random values with a standard deviation of 0.2. To start the predict-correct cycle of the Kalman Filter, we have to initialize the state and its error covariance matrix as well as the process and measurement noise values. Let us set the state to the real initial values. We assume an uncertainty about the initial position of the moving point as well as about the position of the landmark and velocity at time 0. Let the error covariance be 1 0 0 P0 = 0 1 0 0 0 1
34
12

Point Positions Position Landmark Estimated Point Position Estimated Landmark Position
10
8 Position [Units]
0 0 2 4 6 Time [Filter Cycles] 8 10
Figure 4.4: The Simulation of the Problem of Estimating a Moving Points Position by Orientating at a Single Landmark. The deviation between the estimation and real position of the point is very small as well as between the estimated and real position of the landmark. noise in the velocity v and in the measurement m to the real value used in the simulation: 0.2 We will run the lter on ten simulated measurements. The results are depicted in Figure 4.4. In Figure 4.5, the behaviour of the error covariance P during the ten lter cycles is visualized.
4.2
The Extended Kalman Filter
As we saw in Section 4.1.6, the Kalman Filter algorithm works quite well for the estimation of a linear system with linear related measurements depending on the quality of the appropriate models for the process and measurement of the output. Moreover, the Kalman Filter is optimal in the sense that it minimizes the error covariance representing the uncertainty in the estimate of the state. To come back to the main theme of this work, estimating the position of a moving robot and of static landmarks using a camera sensor, we need to be able to cope with nonlinear motion and a nonlinear relationship between measurements and the systems state. The nonlinear motion is caused by possible rotational movements, the robot is able to do. Measurements of landmarks in the surrounding of the robot are projections of them onto the image plane of the camera sensor. The process of projection is nonlinear. In Section 4.1.2, it is stated that a Gaussian distribution is maintained by a linear transformation. This is not the case if we use a nonlinear transformation instead. Thus, we cannot apply the Kalman Filter equations in its original formulation to estimate a nonlinear system. A solution for this problem is to linearize the transformation via Taylor Expansion. A Kalman Filter that uses Taylor Expansion to linearize the process and measurement models is called Extended Kalman Filter , in the following abbreviated as EKF .
4.2. THE EXTENDED KALMAN FILTER

1 0.9 0.8 0.7 Error Covariance P [Units] 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 Time [Filter Cycles] 8 10 Variance in Point Position Variance in Velocity Variance in Landmark Position
35
Figure 4.5: The Error Covariance matrix P. After two iterations, the initial value of 1 for the variances has settled at approximatly 0.5 for the estimation of the points position and of the landmarks position and at approximatly 0.04 for the estimation of the velocity. Like in Section 4.1.1 we assume that the considered system can be modelled as a normally distributed random process X(k) with mean xk as the estimation of the real systems state xk and covariance matrix Pk . Its output can be modelled as well as a normally distributed random process Z(k) with mean k z as the prediction of the real measurement zk and covariance matrix Sk . In the following sections the EKF is derived for nonlinear process and measurement models. Right from the beginning, we will stick to the super minus notation labeling a priori estimates.
4.2.1
Process Model
Let us assume that our system to be estimated, represented by a state vector xk at time k, is now governed by the nonlinear funtion xk = f (xk1 , wk1 ) (4.14)
We assume w to be equal to the mean of its distribution, which is zero. The result of the function f will then be an approximation x of the real state xk . k x x = f (k1 , 0) k (4.15) Let the dierence between the real state and its estimate, namely the error in the prediction, be a random variable e: exk = xk x . k
relating the previous state xk1 at the point k 1 in time to the next state xk at the current point k in time. The random value wk1 represents the process noise as in Equation (4.4). p(w) N (0, Q)
36
To be able to estimate the result of the process represented by the nonlinear Equation (4.14) via the Kalman Filter algorithm, we linearize it about the current state estimate given in Equation (4.15) by setting up a rst order Taylor polynomial ([16], p.411): xk x + A(xk1 xk1 ) + Wwk1 = xk k (4.16)
The matrix A is the Jacobian matrix containing the partial derivatives of (4.15) with respect to x, whereas the Jacobian matrix W is lled with the partial derivatives of f with respect to w. Note, that we ommitted time subscript k for the Jacobians to simplify the notation. Nevertheless, they may be dierent at each point in time. In the following, we will stick to omitting k for the Jacobian matrices. The a priori estimate x in Equation (4.16) can be calculated by f (k1 , 0). k x e The remainder term approximates exk as xk . e exk A(xk1 xk1 ) + Wwk1 = xk With this dention of xk , we can rewrite Equation (4.16) to e k e xk = x + xk (4.18) (4.17)
According to Equation (4.18), we need to estimate the random value exk as xk e at each point in time to achieve our actual goal: estimating xk as xk . Note, that (4.17) is a linear equation. Thus, we can apply a second hypothetical classic Kalman Filter to estimate exk . We will model this dynamical linear error system as a normally distributed random process with mean xk e and covariance matrix Pk representing the uncertainty about the estimated exk . Since exk denotes the error in the state estimate, it is clear that it should always be approximatly zero. Therefore, the mean xk of the distribution is chosen to e be zero. Lets consider Equation (4.17) again. The second term Wwk1 denotes the noise in the estimation of exk . It is the product of the process noise w and the Jacobian matrix W containing the partial derivatives of Equation (4.15) with respect to w. Remember, that the process noise is assumed to be always equal to zero. Thus, the term Wwk1 is also assumed to be equal to zero. If w is transformed by applying W, the corresponding covariance matrix Q of the process noise is transformed by WQW . The noise in the estimation of exk is then modelled as p(Wwk1 ) N (0, WQW ). To involve this noise in the prediction of the error exk between real and estimated state, the according error covariance WQW is added to the prediction APk1 A of its error covariance P. To summarize the last statements, we have: k ex P k = A(xk1 xk1 ) = 0
(4.19)
= APk1 A + WQW .
(4.20)
Equations (4.19) and (4.20) represent the process model for the linear error system.
37
If we substitute Equation (4.19) for xk in Equation (4.18), the process model e for the nonlinear system to predict a state estimate x is then k x k P k = f (k1 , 0) x = APk1 A + WQW .

(4.21) (4.22)
The process noise covariance matrix WQW acts in the nonlinear process model as the covariance matrix Q in the linear process model: It represents the amount of trust in the process model. High values indicate that high variations between the state estimate and the real state are expected. Low values show a lot of condence in the process model.
4.2.2
Measurement Model
Let us assume that the relation between the system and its output is described by the nonlinear function zk = h(xk , vk ) (4.23) where vk represents the measurement noise as in (4.9). p(v) N (0, R) As usual, we assume vk to be zero which is the mean of its distribution. xk k = h( , 0). z (4.24)
The result k is just an approximation of the real measurement. Let the dierz ence between the actual and the predicted measurement be the random value ezk = zk k . z In contrast to the error exk between the real state and its estimate, ezk is accessible. To estimate the measurement of the systems output we linearize Equation (4.23) about the current state estimate given in Equation (4.24) by setting up a rst order Taylor polynomial: zk k + H(xk x ) + Vvk z k (4.25)
The matrix H is the Jacobian matrix containing the partial derivatives of Equation (4.24) with respect to x in contrast to the Jacobian matrix V which contains the derivatives of the same equation with respect to the measurement noise v. The predicted measurement k in Equation (4.25) can be calculated by Equaz tion (4.24). The error ezk is approximated as zk by the remainder term e ezk H(xk x ) + Vvk = zk . k e With this denition of zk we can rewrite Equation (4.25). e z e zk k + zk (4.27) (4.26)
Note, that Equation (4.26) is a linear equation. Therefore, we also model the error in the estimation of the output as a normally distributed random process
38
with mean zk and innovation covariance matrix Sk , which approximates the e error between the predicted and the actual measurement. From the notion that ezk species the estimated error in the estimation of the state xk of the system, it is clear that it should preferably be approximatly equal to zero. Thus, the mean zk of its distribution is assumed to be always equal to zero. e If we re-consider Equation (4.26), we can state that Vvk is the noise term in the prediction of ezk . Remember that the measurement noise v is assumed to be zero at every point in time. Thus, the product of v and the Jacobian matrix V containing the partial derivatives of Equation (4.24) with respect to the noise is zero. If v is transformed by applying V, the corresponding covariance matrix R is transformed by VRV . Then, the noise involved in the estimation of the error ezk is modelled as follows: p(Vvk ) N (0, VRV ) The covariance matrix of the noise Vvk is added to the prediction of the innovation covariance matrix by HP H . Summarized, we have: k ezk Sk = H(xk xk ) = 0 = HP H k (4.28)
+ VRV .
(4.29)
Equations (4.28) and (4.29) represent the measurement model for the linear error system and are used to correct the a priori error estimate k between ex the state and its approximation. If we substitute Equation (4.28) for zk in Equation (4.27), the measurement e model for the nonlinear system is: k z Sk = h( , 0) xk = HP H k + VRV .
(4.30) (4.31)
4.2.3
Predict and Correct Steps
Using the Kalman Filter for the estimation of the state of a linear system, means that we exactly know how uncertain we are about this estimate. Whereas, using the EKF for the estimation of the state of a nonlinear system means to additionally estimate the uncertainty in this state estimate. This can be done by a second hypthetical Kalman Filter, presented in the previous chapters, which estimates the error between the real state and its estimate. Lets assume, that we already used the process model for the nonlinear system given in Equations (4.21) and (4.22) to derive an a priori estimate x for k the state and P for its error covariance. Then, we can predict the measurement k by using Equation (4.30). After we have obtained the real measurement zk , we can calculate the error ezk between zk and the predicted measurement k . z According to Equation (4.19), the predicted error estimate k between the ex real state and its estimate is assumed to be zero in every time step. The Kalman Filter equation to correct the a priori error estimate k and ex derive an a posteriori xk is then e xk e = k + Kk ezk ex = Kk ezk .
39
1. Predict Step (a) Predict the state. x x = f (k1 , 0) k (b) Predict the error covariance matrix. P = APk1 A + WQW k 2. Correct Step (a) Calculate the Kalman Gain. Kk = P H k (HP H + VRV ) k
(b) Correct the a priori state estimate xk x + Kk (zk h( , 0)) k (c) Correct the a posteriori error covariance matrix estimate Pk = (I Kk H)P k Figure 4.6: Equations of one Extended Kalman Filter Cycle. We assume that the state, its covariance and the noise values are already initialized. Note, that for simplicity the superscript k is not used here for the Jacobians, although, they have to be re-calculated after each predict-correct cycle.
If we substitute this into Equation (4.18) we get xk = x + Kk ezk . k Because ezk is the measurement residual, we also can write xk = x + Kk (zk k ) k z = xk + Kk (zk h( , 0)). xk (4.32) (4.33)
Equation (4.33) can be used in the correct step of the Extended Kalman Filter algorithm to derive the a posteriori estimate for the state of the nonlinear system. The Kalman Gain Kk itself is calculated as in Equation (4.11) with the appropriate substitution for the measurement error covariance matrix given in (4.31): Kk = P H k (HP H + VRV ) k
In Figure 4.6, the Extended Kalman Filter algorithm is given step by step.
40 Lighthouse
xj Position of the Ship x0 xi Figure 4.7: A ship is sailing on the straight line perpendicular to the axis between x0 , the initial position of the ship, and the position of the lighthouse. xi and xj are sample positions of the ship which need to be estimated from the corresponding observable angles i and j .
4.2.4
A Simple Example
The derivation of the Extended Kalman Filter presented in the previous section is a bit more complicated than the explanations of the classic Filter. In this section a simple example is examined to provide a better understanding of the EKF algorithm. Again, we will consider an instance of the general SLAM problem. The section is structures as follows. Firstly, we will describe the specic problem in general. After that, the models for the systems state and process and the relation between the state and the measurement are presented. The section closes with some experiments on simulated data. Problem Description Imagine you are the skipper of a ship and your task is to sail a straight route of a certain length on an ocean. As you might infer from this sentence, the example deals more or less with the routing aspect of navigation. But we will focus on the localization and mapping problem. To be more concrete, as a skipper you need to localize your ship on that straight route. We assume that there is a lighthouse with an uncertainly known position to orientate at. Your initial position is located in some distance from that lighthouse. You will sail in a perpendicular direction to the axis between the lighthouse and the initial ship position. It is obvious that the motion of a ship is smooth, so that changes in the velocity are unlikely. You will be able to measure the angle between the current position of your ship and the lighthouse. Of course, these values will be more or less guesses than precise measurements. We assume that you are not able to measure your velocity which is normally the case. This situation is depicted in Figure 4.7. Process and Measurement Model In this example we have two tasks. Firstly, we need to localize the position x of the moving ship on the straight route at every time step. Secondly, we have to rene our knowledge about the position y of the lighthouse.
41
Thus, the state x of the considered system contains three entities. The position x and velocity vs of the ship are the rst ones. Again, we choose a constant value for the velocity which represents an average value during the constant time period k. The third component of the state denotes the distance between the lighthouse and the initial position x0 of the ship. x Position of the Ship Velocity of the Ship x = vx = y Distance of the Lighthouse from x0 With this denition of the state, we have the following error covariance matrix P representing the uncertainty in the estimation of the state. xx xvx xy P = vx x vx vx vx y yx yvx yy
The process, the system is subject to, is just the motion of the ship on that route. The process model f we will set up here, relates the state at time k 1 to k by calculating: = old position + old velocity per time intervall = x(k 1) + vx (k 1)k vx (k) = constant velocity due to assumed smooth motion = vx (k 1) y (k) = static landmark = y (k 1) x(k)
(4.34)
These equations are linear. Nevertheless, we will treat them as to be nonlinear and apply the EKF approach. We will see, that the EKF equations will reduce to the equations of the classic Kalman Filter. As already mentioned, vx just describes the average velocity between two time steps. Thus, it is just an approximation of the real velocity. The random dierence between estimated and real velocity is modelled as process noise w = (w0 , w1 , w2 ) = (0, ak, 0) . As the state, w is a three-dimensional vector. Just the velocity is corrupted by noise. Therefore, just w1 carries a value unequal to zero involving unmeasurable acceleration a: p(w) N (0, Q) Q is of the following form 0 0 0 2 Q = 0 v 0 0 0 0 The variable v denotes the standard deviation of the noise in the velocity. If we would know the indivdual values for w, we could derive the real state of the considered system by calculating f (xk1 , wk1 ): x(k) vx (k) y(k) = x(k 1) + (vx (k 1) + w1 )k + w0 = vx (k 1) + w1 = y(k 1) + w2 = x(k 1) + (vx (k 1) + a(k 1)k)k = vx (k 1) + a(k 1)k = y(k 1)
(4.35)
42
Again, we assume w to be always equal to the mean of its distribution which is zero. Then we derive the process model f (k1 , 0), as it is already formulated x in Equation (4.34). To be able to predict the error covariance matrix P at each point in time, we need to derive the Jacobian matrix A containing the partial derivatives of Equation (4.34) with respect to the state x and the Jacobian matrix W containing the partial derivatives of Equation (4.34) with respect to the noise w. Assuming that k is equal to 1, as A, we have: x x x 1 1 0 x vx y A = vx vx vx = 0 1 0 . x vx y y y y 0 0 1 x v y
x
Hence, WQW = Q. Then, the equation to predict the error covariance equals the one for the standard Kalman Filter: P (k) = AP(k 1)A + Q. Now, let us consider the measurement model for our system. It provides the relation between the state x of the system and the measurement z of its output. Remember, as measurement we obtain the value for the angle at each time step. If we have a look again at Figure 4.7, we can state, that the situation can be represented by a right triangle. Then, two denitions hold: a2 + b2 = c2 a = c sin We dene the axis between the lighthouse and x0 as a, the distance, the ship has covered till a certain point in time, as b and the connection between the lighthouse and the current position of the ship as the hypotenuse c. b is then equal to x in the state and a is the same as y. Thus, the measurement model to obtain the measurement z is z (k) = = arcsin y(k) (x(k))2 + (y(k))2 . (4.36)
Note, that this is the same as Equation (4.34) expressed as a linear transformation. As W, we have: x x x 1 0 0 w0 w1 w2 v v v W = wx wx wx = 0 1 0 . 0 1 2 y y y 0 0 1 w w w
0 1 2
Thus, we have a nonlinear measurement model h. The value provided for might be more or less a guess than a precise measurement. Therefore we have to introduce measurement noise v to model the dierence between the real measurement and the predicted one. If we know the noise value for each time step, we would obtain z instead of z by calculating h(xk , vk ): z(k) = = arcsin y(k) (x(k))2 + (y(k))2 + v(k).
43
But this is not the case. Therefore, we model v as normally distributed measurement noise with zero mean and standard deviation r .
2 p(v) N (0, r )
Now, we can assume v to be zero at each point in time which is the mean of its distribution. Then we obtain h(x, 0) as it is already formulated in Equation (4.36). The standard deviation is added to the calculation of the innovation covariance matrix S(k) = HP(k) H which is also one dimensional. Because we have a nonlinear model, the value for the variance is rstly transformed by Vr V and then added. As usual, the value we choose for r indicates how we rate the quality of the measurement model. Because we have a nonlinear measurement model, we need to derive the Jacobian matrices H and V for each point in time. H contains the partial derivatives of the measurement model h(xk , 0) with respect to the state. It is of the following form: H= For
h x h x h vx h y
we have h = x 1
y2 x2 +y 2
xy (x2 + y 2 )3
is equal to zero, because the velocity of the point is irrelevant in the measurement model. For h we have y h = y
1 x2 +y 2
h vx
y2 (x2 +y 2 )3
y2 x2 +y 2
The covariance matrix V contains the partial derivatives of h(xk , 0) with respect to the noise v. Thus, it is of the following form: V= h v
h v
Because the measurement noise v is an additive constant, Experiments on Simulated Data
is equal to 1.
In the previous section we derived the basis to apply the EKF approach to solve our problem: the process and measurement model. In this section, we will test these models on simulated data. We repeat the procedure from the simple example for the standard Kalman Filter. The initial values for the Filter reect reality but are just known approximately. This is represented by an error covariance matrix P where the values in the main diagonal are unequal to zero. The values for the process and measurement noise are also choosen to represent the real values.
44
12
Ship Position Estimated Ship Position
10
8 Position [Units]
0 0 2 4 Time [Filter Cycles] 6 8 10
Figure 4.8: The Simulation of the Problem of Estimating a the Position of a Ship by Orientating at a Lighthouse. To start the predict-correct cycle we initialize the state x and the error covariance matrix P. For x we choose: 0 x=1 20 known. For P we choose: 0 0 1 0 0 1
In reality, the standard deviation of the process and measurement noise need to be determined prior to the application of the lter. Here, the values v and m reect the real noise values. v = 0.02 m = 0.02 We will run the lter on 10 simulated measurements. The results for the estimation of the ships position are depicted in Figure 4.8. In Figure 4.9, the estimated lighthouse position is opposed to the real one. In Figure 4.10, the behaviour of the error covariance P is depicted during the ten lter cycles. We can note that the uncertainty about the position of the ship decreases rst and then starts to increase slowly. This is due to the more and more inuencing measurement noise. The farer the ship is getting away from its starting point the lesser the measured angle will change its value. The measurement noise stays at a constant level and will therefore increase its inuence concerning uncertainty
These initial values are just uncertainly 1 P = 0 0
45
Lighthouse Position Estimated Lighthouse Position 20.4
20.2 Position [Units]
20
19.8
19.6
10
Time [Filter Cycles]
Figure 4.9: The Results for the Mapping of the Lighthouse. about the correctness of the infered position of the ship. Small changes in the value of the angle will cause larger deviations in the estimation of the ships position and therefore a large uncertainty about the state estimate.
1 0.9 0.8 0.7 Error Covariance P [Units] 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2
Variance in Ship Position Variance in Velocity Variance in Landmark Position
4 6 Time [Filter Cycles]
10
Figure 4.10: The Error Covariance matrix P. After just one iteration, the uncertainty about the ships position has decreased massivly. Then it increases slightly. In contrast to that, the uncertainty about the velocity has nearly fallen to zero.
Chapter 6
An Observation Strategy
In the previous chapter we applied the Extended Kalman Filter approach to the SLAM problem. A problem of using the EKF is that it does not scale very well. The complexity is cubic in the number of features in the map. In this chapter, we will examine strategies to reduce the complexity to at least O(n2 ) where n is the number of features. One of these strategies includes that just a single feature instead of all visible ones is measured. In [35] it is shown, that this is sucient for tracking. If we do so, we need to select the best feature based on a heuristic. In the following, we will refer to this heuristic as an observation strategy. It is adapted from Davison in [9] and [8]. In this chapter, we will rstly concentrate on the ways to reduce the time complexity of one EKF cycle. This examination is chiey based on [23]. Secondly, an appropriate heuristic is introduced to realise the selection of the best landmark. The two SLAM scenarios, the rst with a single camera, the second with a stereo camera, are handled separately.
6.1
Complexity of the Kalman Filter
We will rst examine the general time complexity of the Extended Kalman Filter algorithm in detail. Considering each step during one EKF cycle, we will introduce methods to reduce the cubic time complexity to at least O(n2 ). Just to remember, the appropriate equations are listed in Figure 6.1. If we have a look at these equations, we can state that there are two major time consuming operations: matrix multiplication and matrix inversion. If the matrix multiplication is carried out in a straightforward manner, its time complexity is O(n3 ) if we multiply n n matrices. Matrix inversion also grows cubic with the number of visible and measured features. In the case of the EKF, the maximal size of a matrix, here P, is (13 + 3n) (13 + 3n) where n is the number of features. The matrix which will be inverted is the innovation covariance. It is of dimension (2l 2l) or (3l 3l).1 l denotes the number of visible and measurable features. Because the number of
1 The dimension of the measurements using a monocular camera is 2. If a stereo camera is used as a vision sensor, the measurement is three-dimensional
69
70
CHAPTER 6. AN OBSERVATION STRATEGY
1. Predict Step (a) Predict the state ahead. x = f (k1 , 0) k x (b) Predict the error covariance matrix ahead. P = APk1 A + WQW k 2. Correct Step (a) Calculate the Kalman Gain. Kk = P H k (HP H + VRV ) k
(b) Correct the a priori state estimate x + Kk (zk h( , 0)) k xk (c) Correct the a posteriori error covariance matrix estimate Pk = (I Kk H)P k Figure 6.1: Equations of one Extended Kalman Filter Cycle.
6.1. COMPLEXITY OF THE KALMAN FILTER
71
measurable features cannot be larger than the number of known features, the overall complexity of the EKF is O(13 + 3n) = O(n3 ). We can reduce this complexity to O(n2 ) by considering aspects related to the SLAM problem. First of all, the process model just aects the state of the camera and the velocities, summarised in xv . The known features are not involved and thus not the whole state of the system. Secondly, usually just a small subset of feature points can be measured at each point in time, due to the constraints of the viewing direction. In the following, we will explain this in detail, rst for the predict step and after that for the correct step.
6.1.1
Complexity of the Predict Step
In the predict step of the Kalman Filter, we predict the state x of the system as x and the related error covariance P as P . The process model f relates the state at one point in time to the next. But, as already mentioned above, just the state of the camera and its velocities are aected. Thus,the Jacobian matrix A, containing the partial derivatives of the process model with respect to the state, is of the following form, A=
fv xv
0 I
The detailed Jacobian matrix A can be found in Appendix A. The overall dimension of the state is m = 13 + 3n where n is the number of the 3D landmarks and 13 is the dimension of xv . Thus, A is a m m Jacobian f matrix as well as the error covariance matrix P. The block xv is of dimension v 13 13. Lets consider the rst summand APk1 A of the prediction of the error covariance matrix as P and let the old Pk1 be denoted by k Pk1 = P11 P21 P12 . P22
where fv is the rst part of the measurement model. W tnew rW + vW k qCW q( CW k) qCW xv,new = f (xv , w = 0) = new = W vnew vW CW new CW
P11 is a covariance matrix also of dimension 13 13 related to xv . P12 and P21 are of dimension 13 3n and 3n 13, respectively. 2 P22 is then a 3n 3n covariance matrix. If we perform the matrix operation for APk1 A explicitly, we obtain: APk1 A =
fv xv
0 I
P11 P21
P12 P22
f ( xv ) v 0
0 I
=
2 Note
fv fv xv P11 ( xv ) f ( xv P12 ) v
fv xv P12
P22
that P12 is the transpose of P21 because of the symmetry of covariances.
72
f f Regarding to the dimensions of the matrices, the term xv P11 ( xv ) can be v v f evaluated by 2(13 13 13) multiplications. To solve xv P12 we need 13 13 3n v f multiplications. ( xv P12 ) is just the transpose of the previous term and do v not need to be evaluated again. Altogether, the whole amount of multiplications to evaluate APk1 A lies at 2(13 13 13) + (13 13 3n). The second summand WQW of the prediction function can be considered equivalently. The Jacobian matrix W contains the partial derivatives of the process model with respect to the process noise. It is of the following form:
W=
fv VW
fv CW
For the detailed matrix, have a look at Appendix A. Since the process noise vector w is of dimension 6, W is a m 6 matrix. f fv The blocks Vv as well as CW carry 13 3 elements. The process noise does W not aect the coordinates of the known features. Thus, the according elements of W are equal to zero. The process noise covariance Q can be denoted by: Q= Q11 0 0 Q22
It is a 6 6 matrix and the blocks Q11 and Q22 are each of dimension 3 3. If we perform the matrix multiplication WQW explicitly, we derive: WQW = =
fv VW fv CW
Q11 0 + 0
0 Q22
f ( Vv ) W fv ( CW )
0 0 0 0
fv Q ( fv ) VW 11 VW
fv Q ( fv ) W 22 W
Because no block of a size related to the n known features is involved in the one block unequal to the zero matrix, the number of multiplications is independent of n. We exactly need 4(13 3 3) multiplications. Thus, the cost of the predict step in all is linear in m.
6.1.2
Complexity of the Correct Step
Since just a few features of all known are visible for the camera sensor at each point in time, the Jacobian matrix H containing all partial derivatives of the measurement model h with respect to the state, carries a large number of zeros. W Lets assume that we just measure one feature yi after each time step. Then H is of the following form: H=
h xv
h W yi
The detailed Jacobian matrix can be found in Appendix A. We know, that the dimension of the state vector x is m = 13 + 3n. The dimension p of the measurement vector is either 2 or 3, depending on whether we use a single or stereo camera. Thus, the whole matrix H is of dimension h h p m. The block xv carries p 13 elements whereas yW is of dimension p 3.
i
6.1. COMPLEXITY OF THE KALMAN FILTER
73
To evaluate the Kalman Gain K, we need to perform the multiplication P H . For this case, P is represented by: k k P = P1 k P01 P2 P02 (6.1)
The block P1 contains m 13 and the block P2 m 3 elements. If we perform this multiplication explicitly, we obtain
h
P H = P1 k
P01
P2
P02
The amount of multiplications lies at 16p2 , where p is either 2 or 3. The second summand VRV in the equation to derive the innovation covariance can be simplied equivalently. R is the measurement error covariance of dimension p p. The Jacobian matrix V contains the partial derivatives of the measurement model with respect to the measurement noise. Because the measurement noise vector is an additive constant in both SLAM scenarios whether with a single or stereo camera, V is an identity matrix regardless of the value of p. We have VRV = R. The overall amount of multiplications to calculate the innovation covariance is 16p2 . To evaluate the Kalman Gain K we need to invert S. As already mentioned above, the complexity of matrix inversion grows cubic with the number of rows or columns, respectively, of the considered quadratic matrix. Here, we have a p p matrix to invert. Thus, we need p3 multiplications. The whole amount of multiplications to calculate the Kalman Gain is therefore 16pm + 16p2 + p3 which is linear in m. Until now, the complexity of all equations whether in the predict or correct step were linear in m. The second equation of the correct step updating the
where the block P is a 13 p and P a 3 p matrix. 1 2 As result for the product HP H we obtain k P1 P h h h h HP H = xv 0 yW 0 01 = k P xv P1 + yW P2 i 2 i P 02
The number of multiplications adds up to 16pm. After evaluating P H we need to derive the innovation covariance S. It k can be obtained by equation HP H + VRV . We will rstly consider the k rst summand. The result for P H is a m p matrix and is represented by k P1 P P H = 01 k P 2 P 02
xv h h 0 + P2 W . h = P1 xv yW yi
i
74 Predict Step
P = APk1 A + WQW k Correct Step Kk =

P H k (HP H +VRV ) k
O(m) = O(13 + 3n) = O(n)
O(m) = O(13 + 3n) = O(n) O(m2 ) = O((13 + 3n)2 ) = O(n2 )
Pk = (I Kk H)P k
Table 6.1: Complexities for the Equations of one Extended Kalman Filter Cycle. error covariance P is responsible for the quadratic complexity. We have to evaluate the summand Kk HP . We will rst consider the product HP . H is k k as already stated above represented by H=
h xv
h W yi
where the result is a p m matrix. 16pm multiplications are needed. The last step is to multiply the Kalman Gain K with this result. Either K which is a m p matrix, nor HP carries a zero or identity matrix. Therefore, we derive k an m m matrix by performing pm2 multiplications. Thus, the time complexity of the correct step is O(m2 ) or if we just consider the number of known features O((13 + 3n)2 ) = O(n2 ). At the same time, this is the time complexity of one EKF cycle. The results presented in Section 6.1 are summarised in Table 6.1.2.
Note that these blocks are not the same as in Equation (6.1) although they split up the same matrix P . Here, P1 is of dimension 13 m. P2 carries 3 m k elements. If we evaluate the product, we obtain P1 P h h h h HP = xv 0 yW 0 01 = k P2 xv P1 + yW P2 i i P02
The predicted error covariance matrix P is denoted by k P1 P P = 01 . k P2 P02
6.2
A Heuristic to Decide which Feature to Track
In the last section we presented methods to reduce the complexity of one EKF cycle by taking the particular structure of the SLAM problem into account.
6.2. A HEURISTIC TO DECIDE WHICH FEATURE TO TRACK
75
For one of these methods it is assumed, that we just measure one of the visible feature points per point in time. But if we do so, two questions may arise:
Is it sucient for the estimation of the state to measure just one feature? Which feature of the several visible ones is best to be measured?
Considering the rst question, Welch and Bishop [35] presented the SCAAT method where it is shown that measuring a single landmark after each time step is sucient to observe 3D structure and motion of a scene over time.3 In the case of 3D-SLAM, a single measurement of a 2D projection of a 3D landmark just provides partial or incomplete information about the whole state of the system, e.g., nothing about the (linear or angular) velocity of the camera and nothing about the depth of the 3D feature position. Systems operating just by obtaining incomplete measurements are referred to as unobservable because the whole systems state cannot not be inferred from them. Such systems must incorporate a sucient set of these measurements to obtain observability. This can be achieved over space or over time. The latter is adopted by the SCAAT technique. It is based on the Extended Kalman Filter where individual measurements providing incomplete information about the systems state are blended into a complete state estimate. The mean for this blending provided by the lter describes the state estimate. Based on several experiments, SCAAT was shown to be accurate, stable, fast and exible. To nd an answer on the second question, we rst need to nd a criteria to rate the feature. An intuitive idea is stated by Davison in [9]: The more uncertain we are about the 3D position of a feature, the more protable it is to measure this one. Or in other words, measurements of features, that are dicult to predict, provide more information about the position of this feature and of the camera than measurements of features which can be reliably predicted. The innovation covariance S describes the uncertainty about each predicted measurement. Thus, it contains the basic information to decide which visible feature should be measured at each point in time. It is calculated as follows S = HPH + VRV (6.2)
where H and V are the Jacobian matrices of the measurement model h(x, 0) with respect to the state x and the measurement noise v, respectively. P is the error covariance matrix linked to the state and R is the measurement noise covariance. S is a multivariate Gaussian. Therefore, covariance matrices Si for each W predicted measurement i corresponding to a visible feature point yi can be z extracted from it. These smaller covariances refer to a Gaussian with the measurement i as its mean. According to Whaite and Ferrie [36], depending on z the measurement space, each Si can be represented either by an ellipse or ellipsoid centred around the mean of the distribution. They are also referred to as ellipses or ellipsoids of condence and represent the amount of uncertainty about the predicted measurement. Or in other words, we can be condent, that
3 Single
Constraint At A Time
76
the real measurement is situated within the ellipse or ellipsoid. By calculating the surface area or volume of these objects, we can decide which predicted measurement is most uncertain. Besides its role as a measure of the information content expected of a measurement, Si also denes a search region where the according measurement i z should be located in with high probability. Thus, if we have decided to measure a specic feature, we can send the parameters of the search region to the feature tracker. The advantages of this method are obvious. The feature tracker just needs to search a small region of interest instead of the whole picture. Furthermore, the chances of a mismatch are reduced. In the previous chapter, we considered two SLAM cases: SLAM with a single camera and SLAM with a stereo camera. In the following sections, the heuristic is discussed in detail with respect to the dierent vision sensors.
6.2.1
Deriving the Innovation Covariance Matrix for SLAM with a Single Camera
In the case we use a single camera, we predict two dimensional measurements yi I W for each visible three-dimensional feature yi referring to its 2D projection onto the image plane. Thus, if l features are visible, S is a 2l 2l matrix and l 2 2 covariance matrices Si regarding to the visible features can be extracted from it. These covariance matrices represent a two-dimensional standard distribution over image coordinates. Its mean is the predicted measurement yi . The disI tribution can be visualised by an ellipse of condence on the picture. Its focal point refers to the mean, the direction of its axes are given by the eigenvectors of the covariance matrix and the square root of the according eigenvalues species the deviation of the distribution along the axis. According to [36], the surface area of the ellipse can be used as a measure of uncertainty. If a and b denote the length of the principal axes of the ellipse, the surface area A is calculated by A = ab. The standard deviation of a distribution describes the average deviation of the related Gaussian. The values of the whole distribution diversify much more. Possible realisations of the predicted measurement situated beyond the average deviation are just less probable but should also be involved in the calculation of the amount of uncertainty and in the size of the search region. Thus, we introduce the factor n and multiply the length of the principle axes of the ellipse with it. Consider the estimated measurement yi with eigenvalues I e1,i and e2,i of the according covariance matrix Si . To derive the surface area of the demanded ellipse, we have to compute Ai = n e1,i e2,i . (6.3)
The value for n should extend the standard deviation such that the probability for a measurement to be found within the considered region is approximately 100%. In [9], Davison chose n = 3. The probability that the possible realisations of a standard deviated random variable lie within the 3-region around the mean of the distribution is approximately 99% ([16], p. 1119).
6.2. A HEURISTIC TO DECIDE WHICH FEATURE TO TRACK
77
After calculating the amount of uncertainty about the predicted measurement of each visible 3D feature, we can rank them and send the parameters (predicted measurement and corresponding covariance matrix) of the landmark whose measurement is most dicult to predict to the feature tracker. The corresponding covariance matrix species the search region for the demanded feature measurement within the image and centred around the estimated measurement.
6.2.2
Deriving the Innovation Covariance Matrix for SLAM with a Stereo Camera
In the second case of SLAM scenarios, we use a stereo camera to measure the visible features. For each, we derive three-dimensional measurement vectors. Thus, if l features are visible, l 3 3 smaller covariances Si , each referring to one of the predicted measurements for the visible features, can be extracted from the innovation covariance matrix S. As already mentioned for the twodimensional case, these covariances are related to a standard distribution. Their means are the predicted measurements. W Considering one visible feature point yi , the measurement vector for the SLAM scenario with a stereo camera consists of the image coordinates of the I I projection of this feature on the left image plane yi = (xI , yl ) and the disparity l I d . The according innovation covariance matrix Si is therefore not dened over one of the image coordinate frames as it was the case wen using a monocular vision sensor. It can be represented as an ellipsoid in the space spanned by I xI , yl and dI . Analogous to the surface area of the ellipses, the volume of the l ellipsoids can be seen as a measure for uncertainty. The equation to calculate the volume of an ellipsoid is 4 V = abc. 3 where a, b and c are the lengths of its principal axes. If we substitute the square root of the eigenvalues for a, b and c and introduce the factor n again, we derive the equation to calculate the volume of each Si : Vi = 4 n e1,i e2,i e3,i 3
After calculating this volume for each ellipsoid, we are able to rank the visible 3D feature points. The corresponding predicted measurement and innovation covariance of the landmark whose measurement is most dicult to predict is sent to the feature tracker. Centred around this prediction, the covariance matrix denes the search region where the real measurement is likely to be found. Note that in this case, the search region is not dened on image coordinates as for the usage of a monocular camera. Thus, the feature tracker needs to project the ellipsoid onto both image planes to dene the search region on the pictures.

De Printat Articole

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

De Printat Articole

Uploaded by

Copyright:

Available Formats

An Introduction to the Kalman Filter

welch@cs.unc.edu, http://www.cs.unc.edu/~welch gb@cs.unc.edu, http://www.cs.unc.edu/~gb

Welch & Bishop, An Introduction to the Kalman Filter

The Discrete Kalman Filter

UNC-Chapel Hill, TR 95-041, July 24, 2006

Welch & Bishop, An Introduction to the Kalman Filter

Welch & Bishop, An Introduction to the Kalman Filter

UNC-Chapel Hill, TR 95-041, July 24, 2006

Welch & Bishop, An Introduction to the Kalman Filter

Time Update (Predict)

Measurement Update (Correct)

(1.11) (1.12) (1.13)

Welch & Bishop, An Introduction to the Kalman Filter

(2) Project the error covariance ahead P k = AP k 1 A T + Q

Initial estimates for x k 1 and P k 1

UNC-Chapel Hill, TR 95-041, July 24, 2006

Welch & Bishop, An Introduction to the Kalman Filter

The Extended Kalman Filter (EKF)

UNC-Chapel Hill, TR 95-041, July 24, 2006

Welch & Bishop, An Introduction to the Kalman Filter

W is the Jacobian matrix of partial derivatives of f with respect to w, W [ i, j ] = f [i] (x ,u , 0) , w[ j ] k 1 k 1

H is the Jacobian matrix of partial derivatives of h with respect to x, H [ i, j ] = h[ i ] ( x , 0) , x[ j] k

V is the Jacobian matrix of partial derivatives of h with respect to v, V [ i, j ] = h[ i ] ( x , 0) . v[ j ] k

UNC-Chapel Hill, TR 95-041, July 24, 2006

Welch & Bishop, An Introduction to the Kalman Filter

Welch & Bishop, An Introduction to the Kalman Filter

(2.16) (2.17) (2.18)

UNC-Chapel Hill, TR 95-041, July 24, 2006

Welch & Bishop, An Introduction to the Kalman Filter

(2) Project the error covariance ahead T T Pk = Ak Pk 1 Ak + W k Qk 1 W k

Initial estimates for x k 1 and P k 1

A Kalman Filter in Action: Estimating a Random Constant

UNC-Chapel Hill, TR 95-041, July 24, 2006

Welch & Bishop, An Introduction to the Kalman Filter

UNC-Chapel Hill, TR 95-041, July 24, 2006

Welch & Bishop, An Introduction to the Kalman Filter

UNC-Chapel Hill, TR 95-041, July 24, 2006

Welch & Bishop, An Introduction to the Kalman Filter

0.01 0.008 (Voltage)2 0.006 0.004 0.002 10 20 30 Iteration 40 50

UNC-Chapel Hill, TR 95-041, July 24, 2006

Welch & Bishop, An Introduction to the Kalman Filter

UNC-Chapel Hill, TR 95-041, July 24, 2006

Welch & Bishop, An Introduction to the Kalman Filter

Lewis86 Maybeck79 Sorenson70

UNC-Chapel Hill, TR 95-041, July 24, 2006

Robust Visual Tracking from Dynamic Control of Processing

TargetObservation Target Prediction list

DetectionRegion Detection region list Detection Background detector

research is supported by IST-CAVIAR 2001 37540

3. The tracking system

3.1 Energy detection

| Ired Bred | + | Igreen Bgreen | + | Iblue Bblue |

a new prediction according to : x = t xt1 , with t t = 1 0 t 1 (6)

3.3 The core modules

Ip G(, ), (i, j) R 0, else

where G(x; , ) = e 2 (x)

3.4 Target initialization module

3.2 Tracking process

Background difference of detection region 1 dim energy histogram

3.5 Tracking module

Rmax Analysis and moment computation

Figure 4. Initialisation of new target.

3.6 Split and merge of targets