You are on page 1of 6

8th IEEE International Conference on Advanced Video and Signal-Based Surveillance, 2011

A Real-Time Image-to-Panorama Registration Approach for Background Subtraction Using Pan-Tilt-Cameras


Eduardo Monari and Thomas Pollok Fraunhofer Institute of Optronics, System Technologies and Image Exploitation (IOSB) Fraunhoferstr. 1, 76131 Karlsruhe, Germany
eduardo.monari@iosb.fraunhofer.de

Abstract
While for static cameras several background subtraction approaches have been developed in the past, for non-static pan/tilt cameras efcient and robust motion detection is still a challenging task. Known approaches use image-to-image registration methods to generate a panorama background model of the scene, which spans a joint pixel coordinate system for later background estimation and subtraction. However, for a real-time panorama-based background subtraction a highly efcient image-to-panorama registration is needed. For this purpose, in this paper a key-frame representation of the panorama image is proposed and a strategy for fast global homography estimation in large panorama images is presented.

1. Introduction
A classical approach for video-based motion or change detection is background subtraction. Here, the pixels of an image are analyzed over time, whereby parameters of a socalled background model are estimated [1, 2]. The resulting background model is in best case a scene which contains no foreground objects (e.g. people, vehicles). The subsequent comparison of the current video frame with the background model allows to classify each pixel of the image as background or foreground. The crucial part of these approaches is based on appropriate spatio-temporal pixel-wise background modeling and parameter adaptation. Hereby, it is essential to establish pixel correspondences of one and the same point in the scene over time, that is from frame to frame. In doing so, a temporal analysis of background pixels is possible. Using static cameras, pixel correspondence is implicit given, due to the xed camera orientation. For moving cameras however, the pixel correspondence has to be recovered by computationally intensive methods, before performing background estimation and subtraction. Several methods

on this issue have been proposed [3, 4]. Given established frame-to-frame pixel correspondences, in the last few years rst approaches for background subtraction using pan/tilt-cameras have been proposed [5, 6, 7, 8]. In these approaches (similar to this paper) the basic idea behind panorama-based background subtraction is to generate a panorama image of the observed environment by frame stitching and homography estimation. The panorama represents a joint static reference pixel coordinate system, which allows a direct temporal pixel analysis. For background estimation and subtraction subsequential video frames are registered into the panorama coordinate system rst. In a second step, for the subset of the panorama background model, in which the current frame is located, background estimation and subtraction is performed. Consequently, the image-to-panorama registration can be regarded as a quite modular pre-processing step for a large number of already known background subtraction algorithms. Furthermore, several methods (as the approach proposed in this paper) are based on image information only, which means that there is no need for meta data like azimuth and elevation angles of the pan/tilt unit for robust image registration. However, a still challenging part of such panoramabased background subtraction approaches is a computational efcient image-to-panorama registration. If this step can be performed with high accuracy and in a computational efcient way, the application of background estimation and subtraction is straight forward. The main contribution of this paper is a new approach for representation of the background panorama by a so-called key-frame map. The approach allows for a highly efcient and high-precicly frame-to-panorama registration, which is a crucial part for panorama-based background estimation. We will show, that our approach allows robust and computational efcient frame-to-panorama registration, independent of the overall panorama size.
~ 1

978-1-4577-0845-9/11/$26.00 c 2011 IEEE

237

20% 50%

28% 11%

However, for generation of large panorama images, the goal is to transform all video frames into a common pixel coordinate system. Unfortunately, in most cases it is not possible to dene a single reference image with a sufcient overlap to all sub-sequential video frames for imagebased homography estimation. So, for real-time processing in video surveillance, the homography to a reference pixel coordinate system has to be estimated using several frameto-frame registration steps. Given a set of k estimated homography matrices between sub-sequential video frames {Hk,k1 , Hk1,k2 , . . . , H1,0 }
Figure 1. A sequential frame-to-frame registration leads to a error accumulation. This error is crucial for long-term robustness (repeat accuracy) of the image-to-panorama registration.

(2)

the Homography between the last image bk and the very rst image (reference pixel coordinate system) b0 can simply be establish by the product of the frame-to-frame matrices:
k

This paper is structured as follows: In section 2 the overall approach for efcient image-to-panorama registration is described. This section includes a short introduction to image-to-image and image-to-panorama homography estimation in section 2.1, as well as a detailed description of the key-frame-based approach in section 2.2. Furthermore, in section 2.3 a strategy for fast image-to-panorama registration is described. In section 3 a performance evaluation is presented and qualitative results for background subtraction are shown. Finally, in section 4 a short summary and description of future work is given.

Hk,0 =
i=1

Hi,i1

(3)

2. Description of the Approach


2.1. Real-Time Image-to-Image Registration
The core component of our framework is a real-time image-to-image registration with sub-pixel accuracy called m3 motion [9, 10]. This algorithm allows for efcient and robust estimation of the homography (perspective projection) of two sub-sequential video frames. Let H be the homography (projection matrix) given by h11 h12 h13 H = h21 h22 h23 . h31 h32 1 The parameters of the projection matrix can be determined using several existing approaches as described in [4, 3]. Given the homography of two sub-sequential images Hk,k1 now for each normalized pixel coordinate p = (x, y, 1)T in the current video frame bk , the calculation of the corresponding normalized coordinate p = (wx , wy , w)T in bk1 can be determined by p = Hp (1)

In doing so, starting with a reference image, a panorama image is generated by adding new frames into the pixel coordinate system of the very rst image of the sequence by multiplication of the last estimated frame-to-frame homography, with the homography product at time k 1. This leads to a real-time panorama image - that is, with minimal time delay. However, using this approach in a native way, shows a signicant drawback for background subtraction: Each frame-to-frame homography estimation includes errors. Even the error is in sub-pixel-scale, due to the multiplication of several homography matrices (in real-time video, hundreds in few seconds) the error becomes signicant over time. But while an absolute error of the homography estimation is of minor interest for motion detection, the challenge for a long-term stable panorama background estimation and subtraction is to achieve a reproductivity of homography estimations to the reference pixel coordinate system. The reproductivity is needed to establish an accurate pixelwise correspondence of the current video frame to the background image, which has been estimated by previous video frames. To overcome this drawback of continuous error accumulation a two step process is proposed in the following sections.

2.2. Generation of the Reference Panorama and Key-Frame Extraction


In a rst step (initialization mode) a so-called keyframe map is generated (Fig. 2). The key-frame map is a representation of the scene to observe by a limited number of G representative video frames B key = {bkey , bkey , bkey , . . . , bkey } which have been transformed 0 1 2 G into the joint coordinate system. Each key-frame is assigned to the corresponding global homography matrices

In this case, the video frame bk is transformed into the coordinate system of frame bk1 .

238

initialization mode (full scene scan)

frame-to-frame g p y homography estimation

full set of homography matrices

redundant representation

Initia alization Mode

database surveillance mode (random move) frame-tokey-frame registration

key-frame map

key-frame extraction

homography estimation

frame in reference image coordinates

Surveillance Mode S

background estimation

Figure 2. Overview of the key-frame-based panorama representation. During an initialization phase the scene is scanned and a key-frame map is generated. The key-frames are used during survellance phase for efcient frame-to-panorama registration.

Hkey = {I, Hkey , Hkey , . . . , Hkey } which have been cal1 2 G culated by the frame-to-frame registration and matrix multiplication described by equation 3. The key-frame map of the scene is updated during initialization phase, only. After initialization phase, in the follwing surveillance mode the key-frames can be considered as static sets of local feature point of the environment. They are used for efcient image-to-panorama registration only, but are not included in background estimation and subtraction. In practice we observed, that generation of the key-frame map can be done in short time, that is, by a limited number of video frames (e.g. few hundreds). In this case on one hand the resulting absolute error in homography estimation is assumed to be small (in practice 5 pixels). On the other hand, due to the limited duration of the initialization phase, the homography error (panorama drifting) is stopped for later processing. The key-frames are extracted as follows: During initialization step the camera performs an overview scan of the area to observe. The very rst frame b0 of the video stream is instantly dened as reference pixel coordinate system and therefore as rst key frame bkey with the identity matrix I as 0 global homography. For sub-sequential frames bk the overlapping area (after transformation in reference coordinates) to each previously extracted key frame is determined. If a sufcient high overlap to one of the already existing key frame is detected, an additional key frame is considered as redundant. In this case the current video frame is discarded. The procedure is repeated during the whole scene scan. The resulting set B key of detected key frames and its associated homography matrices Hkey are stored in a database as a so-called key-frame map. As an example Fig. 3 shows

panorama images with highlighted borders of the extracted key-frames. For later fast key-frame search, additional information to geometric relationships (neighborhood) of the key-frames is determined. Hereby, the neighborhood of two key-frames is established, if the overlapping area of two key frame is larger than a given threshold. In our tests we used a minimum overlap of 50%. The key-frame neighborhood is stored in the database as an adjacency matrix A (see Fig. 4a).

2.3. Image-to-Panorama Registration


After generation of the key-frame map, the system switches to a real-time processing mode for image-topanorama registration and for later background estimation and subtraction (Fig. 2). The background panorama is generated by registration of each following video frame

Figure 3. The extracted key-frames in the reference pixel coordinate system are highlighted. The red dots show the centroids of the frames for calculation of geometric neighborhood.

239

key frames a)

neighbourhood graph

used key frame b) at time k-1

video frame at time k-1

c)

candidates for new best key frame

Figure 4. In a) the neighbourhood of the key-frames is visualized by a graph, which corresponds to the determined adjacency matrix. Starting from the used key-frame at time t 1 (b), in the subsequential processing step (c) only the key frame and its neighbours are needed for evaluation (key-frame hand-over).

into the reference coordinate system. This is achieved, by a frame-to-frame homography estimation between current video frame bk and a key-frame bkey B key with sufcient i overlap to the current frame. Then, the global homography a) (to the reference key frame bkey ) can be calculated by mul0 tiplication of the homography Hk,i between current video frame bk and key-frame bkey , with the global homography i of the key frame, given by Hkey Hkey . 28% i
20% b) 50%k,0

through the panorama, a reliable key-frame (with sufcient overlap to the current camera view) is always determined by single node transitions. In practice the number of neighboring key-frames is key frames used key frame neighbourhood small, even in very large panorama images (typically 5-10). graph b) at time k-1 a) Subsequently the computational load for optimal homography estimation is independent of the overall panorama size and can be achieved with limited resources. 2.3.2 Heuristic Neighbourhood Ranking

video f at tim

= Hk,i Hkey i

11%

(4)

As a reliability metric for the estimated homography Hk,i the resulting frame overlap is used, wherein a minimum number of corresponding feature points is assumed. The key-frame with the largest overlap to the current frame is assumed to produce the most accurate homography estimation. 28%
20%

key = bkey with jk = argmax Overlap(bk , bkey ) b 50% jk i


iBkey

11%

(5)

Unfortunately this procedure means that for each video frame an exhaustive search for best key-frame in B key is needed, which leads to high computational load especially in case of large background panoramas with several keyframes. To overcome this drawback an efcient best keyframe search has been developed. 2.3.1 Fast Best Key-Frame Search

However, the homography estimation can be accelerated signicantly by a simple heuristic neighborhood ranking. The idea hereby is, to introduce a systematic priority ranking for the neighborhood key-frames. As a metric for priority, we use the overlapping area of the last video frame bk1 a) to each neighboring key-frame (Fig. 5). Starting with the neighbor key-frame with the largest overlap and proceeding in descending order, the frame-tokey-frame registration is performed. After each homogra28% phy estimation the resulting overlap to the corresponding 20% 11% key-frame is used as a condence metric. If the homogra50% phy is classied as reliable, the key-frame search will be b)

To speed up best key-frame search the adjacency matrix as a representation of the geometric neighborhood of the key-frames has been calculated during initialization phase. This allows to perform frame-to-frame registration between the current video frame bk and those key-frames bkey i {B key | Aijk1 = 1} which are identied as neighbors to the best key-frame at time k 1, only. In other words, if we consider the adjacency matrix representing a graph, our approach ensures that if a pan/tilt camera is smoothly moving

20% 50%

28% 11%

Figure 5. The frame-to-keyframe registration can be fastend up by a heuristic ranking of the key-frame candidates. Depending on the frame-to-keyframe overlaps at time k 1, a rank order for keyframe selection at time k is dened.

240

current video frame

current video frame

Figure 6. The gure shows the result of a image-to-panorama registration for indoor surveillance. The red box shows the current video frame transformed in panorama pixel coordinate system.

stopped. The resulting behavior of this approach is similar to a kind of short-time hysteresis for key-frame hand-over. Due to the slightly out of date neighborhood ranking, which has been calculated at time k 1, for at least one processing step the previously used key-frame is preferred, even if its not the best one. However, due to the assumed smooth camera movement and near real-time processing achieved by this method this aspect is negligible and still highly robust. Furthermore, using this heuristic neighborhood ranking, in most cases only a single frame-to-key-frame registration per processing step is needed, which leads to a (near) real-time performance even in very large panorama images.

mention that a reliable key-frame extraction during initialization phase is of high importance for robustness and performance of our approach. Even this phase can be assumed to be done in prior (off-line), the additional computational load for key-frame extraction should not be neglected.
current video frame

3.2. Application to Background Subtraction


As a proof of concept for panorama-based background subtraction qualitative results for moving object segmentation are presented. As mentioned before, for background estimation and subtraction an arbitrary approach can be used. For background estimation we use a slightly modied -Approach as described in [11] with additional postprocessing for segmentation improvement (shadow detection, disturbance data removal). The variation of the gray value of each pixel is described as a single Gaussian, using a recursive estimation algorithm. The foreground objects are extracted using a background subtraction approach with a segmentation enhancement using a pyramidal spatial Markov model for fast dynamic noise reduction. The object segments are represented by a blob image determined by a connected component analysis. The background subtraction shows promising results, but there is still room for improvements. In particular, the signicantly changing illumination conditions which depend on camera orientation and camera AGC compensation are still challenging tasks. The histogram specication integrated in our background subtraction approach reduces the resulting clutter but not completely.

3. Experimental Results
3.1. Computational Time
The proposed key-frame-based image-to-panorama registration approach has been evaluated regarding computation time for global homography estimation of a single video frame. For measurement of computational time a standard Pentium Dual-Core, 3.2 GHz PC has been used. The following table shows the obtained processing time for image-topanorama registration: Panorama Resolution 1408 x 576 2816 x 1152 Video Resolution CIF 4CIF No. of Key-Frame s 3-30* 3-30* proc. time in ms 17-22 26-32

Table 1. Processing time (in ms) for different image resolution and number of key frames. (*) 30 is the maximum number of keyframes tested in our evaluation. However, the approach allows usage of an unlimited number of key-frames without performance loss.

4. Conclusion and Future Work


In this paper we proposed an approach for efcient image-to-panorama registration by a so-called key-frame representation. First, during an initialization phase a scan of the scene to observe is performed using the pan/tilt camera. Hereby, the key-frames and associated homographies to a reference pixel coordinate system are determined from

The results show a near real-time performance even for for large panoramas with 2816x1152 resolution, represented by 3-30 key-frames. However, it is important to

241

ed key frame at time k-1

video frame at time k-1

c)

candidates for new best key frame

frame registration. A GPU1 implementation of these steps is currently under examination.

5. Acknowledgements
This work was supported in part by the German Ministry of Education and Research (BMBF), in conjunction with the CamInSens-Project2 .

References
[1] M. Piccardi, Background subtraction techniques: a review, in Proc. IEEE International Conference on Systems, Man and Cybernetics, vol. 4, 1013 Oct. 2004, pp. 30993104. [2] S. Herrero and J. Bescs, Background subtraction techniques: Systematic evaluation and comparative analysis, in LNCS: Advanced Concepts for Intelligent Vision Systems. Springer Berlin / Heidelberg, 2009, vol. 5807, pp. 3342. [3] M. Brown and D. Lowe, Automatic panoramic image stitching using invariant features, International Journal of Computer Vision, vol. 74, pp. 5973, 2007. Figure 7. The proposed approach has been tested for application to background subtraction. The gure shows exemplary results. Regarding this application the future work will focus on object segmentation and camera AGC compensation. [4] M. M ller, W. Kr ger, and G. Saur, Robust image registrau u tion for fusion, Inf. Fusion, vol. 8, pp. 347353, October 2007. [5] P. Azzari, L. Di Stefano, and A. Bevilacqua, An effective real-time mosaicing algorithm apt to detect motion through background subtraction using a ptz camera, in IEEE Conference on Advanced Video and Signal Based Surveillance (AVSS), Sept. 2005, pp. 511516. [6] A. Bevilacqua and P. Azzari, High-quality real time motion detection using ptz cameras, in IEEE International Conference on Video and Signal Based Surveillance (AVSS), Nov. 2006, pp. 2323. [7] J. Zhang, Y. Wang, J. Chen, and K. Xue, A framework of surveillance system using a ptz camera, in 3rd IEEE International Conference on Computer Science and Information Technology (ICCSIT), vol. 1, 2010, pp. 658 662. [8] K. Xue, Y. Liu, J. Chen, and Q. Li, Panoramic background model for ptz camera, in Image and Signal Processing (CISP), 2010 3rd International Congress on, vol. 1, 2010, pp. 409 413. [9] T. M ller, T. Honke, and M. M ller, Cart iii: improved camu u ouage assessment using moving target indication, G. C. Holst, Ed., vol. 7300, no. 1. SPIE, 2009, p. 73000N. [10] W. Kr ger, Robust and efcient map-to-image registration u with line segments, Mach. Vision Appl., vol. 13, no. 1, pp. 3850, 2001. [11] E. Monari and C. Pasqual, Fusion of background estimation approaches for motion detection in non-static backgrounds, in Proc. IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 57 Sept. 2007, pp. 347352.
1 GPU:

the video stream, using a highly accurate real-time capable image-to-image registration approach. The generation of an adjacency matrix, which describes the neighborhood of the key-frames in the joint panorama coordinates, completes the initialization phase. In surveillance mode, our framework is performing a highly efcient frame-to-panorama registration by rst, nding the best key-frame for homography estimation, and second frame-to-key-frame registration and global homography calculation. We have shown, that our approach allows for real-time processing using CIF resolution images, even in large panorama images. For higher resolution video frames (4CIF), near real-time processing is achieved. In future work we will improve our approach regarding two main aspects: First, if the scene changes signicantly over time (e.g. new objects like in a parking area, day/night illumination, etc.), the key-frames have to be updated or supplemented over time. In our current framework such an update is done by a system re-initialization. In future we will work on new methods for continuous key-frame update, taking into consideration the background shifting (reproducibility) problem. Second, the structure of our approach shows promising aspects for further speed up. In particular, that key-framebased subdivision of the panorama allows for efcient parallelization of the best key-frame search and image-to-key-

Graphics Processing Unit.

2 www.caminsens.org

242

You might also like