You are on page 1of 168

Page 1 of 168

Multi View Image Surveillance and Tracking

James Black

A thesis submitted for the degree of Doctor of Philosophy

City University School of Engineering London EC1V OHB


April 2004

Page 2 of 168

Abstract
The work presented in this thesis provides a framework for object tracking using multiple camera views. The application uses several widely separated overlapping and nonoverlapping camera views for object tracking in an outdoor environment. The system employs a centralised control strategy, where each intelligent camera unit transmits tracking data to a multi view-tracking server. The tracking data generated by each intelligent camera unit is stored in a central surveillance database during live operation. Each camera in the surveillance network is calibrated using known 3D landmark points. The system applies 3D Kalman filtering for object tracking and trajectory prediction. The 3D Kalman filter is effective for robustly tracking objects through occlusion. For overlapping camera views the homography constraint is used to match moving objects in each camera view. The homography is automatically learned by applying a robust search to a set of object trajectories in each overlapping camera view. The system uses symbolic scene information to reason about object handover between non-overlapping viewpoints that are separated by a small temporal distance, of the order of seconds. The major entry and exit regions between each non-overlapping view are used to improve the robustness of predicting where objects should re-appear having left the field of view of an adjacent camera. This thesis applies a novel framework for performance evaluation of video tracking algorithms. The data stored in a surveillance database is used to generate pseudo synthetic video sequences, which can be used for performance evaluation. A comprehensive set of metrics are defined to measure the quality of ground truth tracks and characterise the performance of video tracking. This framework is a novel contribution for performance evaluation, since it is possible to automatically generate large volumes of testing datasets without the need to perform exhaustive manual ground truth generation. The framework allows any video tracking algorithm to be evaluated over a variety of datasets, which vary in perceptual complexity and represent a number of different tracking scenarios.

Page 3 of 168

Declaration
The candidate confirms that the work submitted is his own and that appropriate credit has been given where reference has been made to the work others.

Parts of the research presented in this thesis has appeared in the following publications:

J Black, T Ellis, D Makris. A Hierarchical Database for Visual Surveillance Applications. The 2004 IEEE International Conference on Multimedia and Expo (ICME2004), Taipei, Taiwan, June 2004.

J Black, T Ellis, D Makris. Wide Area Surveillance With a Multi Camera Network. IEE Intelligent Distributed Surveillance Systems, London, February 2004.

J Black, T Ellis, P Rosin. A Novel Method for Video Tracking Performance Evaluation. The Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), Nice, France, October 2003.

T Ellis, J Black. A Multi-view surveillance system. IEE Intelligent Distributed Surveillance Systems, London, February 2003.

J Black, T Ellis, P Rosin, Multi view image surveillance and monitoring. IEEE Workshop on Motion and Video Computing, Orlando, December 2002, pp 169-174.

J Black, T Ellis. Intelligent image surveillance and monitoring. The Institute of Measurement and Control. Volume 35, No. 8, September 2002, pp 204-208.

J Black, T Ellis, Multi camera image measurement and correspondence. The Journal of the International Measurement Confederation (IMEKO) Volume 32, No. 1, July 2002, pp 61-71

J Black, T Ellis, Multi camera image tracking. The Proceedings of the Second International Workshop on Performance Evaluation of Tracking and Surveillance (PETS2001), Kauai, Hawaii, December 2001.

Acknowledgements
It would not have been possible to complete the research in this thesis without the help and support of my friends and colleagues at City University. I would like to thank Professor Tim Ellis for initially encouraging me to pursue a PhD in machine vision, and for being my supervisor during my time at City University. He has given me direction, motivation, guidance and ensured that I kept a clear set of research goals and objectives, while at the same time encouraging me to work independently. The experience I have gained from my research has been of great benefit in both my academic and commercial careers. I would also like to thank my colleagues at City University Dimitrios Makris, Ming Xu, and Paul Walcott for the numerous discussions we have had. I would also like to thank Dr Paul Rosin at Cardiff University you has provided valuable input in the research presented in this thesis. I would also like to thank Jim Hooker in the department of Civil Engineering at City University who performed numerous surveys of the University campus in order to allow us to calibrate the cameras in our surveillance network. I would lastly like to thank the members of my family who have consistently given me support and encouragement while completing my research.

Page 5 of 168

Contents
1 INTRODUCTION .............................................................................................................13 1.1 1.2 1.3 1.4 2 SURVEILLANCE AND MONITORING ..............................................................................13 RESEARCH AIMS AND OBJECTIVES ..............................................................................17 ORGANISATION AND PRESENTATION ...........................................................................20 CONTRIBUTIONS ...........................................................................................................20

PREVIOUS WORK ..........................................................................................................22 2.1 2.2 2.3 2.4 2.5 BACKGROUND ..............................................................................................................22 SINGLE VIEW TRACKING ..............................................................................................22 MULTI VIEW TRACKING SYSTEMS ...............................................................................23 PERFORMANCE EVALUATION OF VIDEO TRACKING ALGORITHMS ..............................27 SUMMARY .....................................................................................................................29

FEATURE MATCHING AND 3D MEASUREMENTS................................................32 3.1 3.2 3.3 BACKGROUND ..............................................................................................................32 CAMERA CALIBRATION ................................................................................................32 TWO VIEW RELATIONS .................................................................................................36 Homography Transformation................................................................................36 Epipole Geometry..................................................................................................37

3.3.1 3.3.2 3.4

ROBUST HOMOGRAPHY ESTIMATION ...........................................................................38 Feature Detection..................................................................................................38 Least Quantile of Squares .....................................................................................39

3.4.1 3.4.2 3.5

3D MEASUREMENT AND UNCERTAINTY ......................................................................43 3D Measurements..................................................................................................44 3D Measurement Uncertainty ...............................................................................47

3.5.1 3.5.2 3.6

EXPERIMENTS AND ANALYSIS ......................................................................................49 Homography Estimation Experiment By Simulation..........................................50 Homography Estimation PETS2001 Datasets ...................................................53 Homography Estimation Northampton Square Dataset ....................................54 Temporal Calibration............................................................................................55 3D Measurement and Uncertainty Experiment By Application .........................57

3.6.1 3.6.2 3.6.3 3.6.4 3.6.5

Page 6 of 168

3.7 4

SUMMARY .....................................................................................................................60

OBJECT TRACKING AND TRAJECTORY PREDICTION......................................62 4.1 4.2 4.3 BACKGROUND ..............................................................................................................62 FEATURE DETECTION AND 2D TRACKING ...................................................................63 FEATURE MATCHING BETWEEN OVERLAPPING VIEWS ...............................................63 Viewpoint Correspondence (Two Views) ..............................................................63 Viewpoint Correspondence (Three Views)............................................................64

4.3.1 4.3.2 4.4

TRACKING IN 3D...........................................................................................................66 3D Data Association .............................................................................................69 Outline of 3D Tracking Algorithm ........................................................................70

4.4.1 4.4.2 4.5

NON-OVERLAPPING VIEWS ..........................................................................................70 Entry and Exit Regions..........................................................................................71 Object Handover Regions .....................................................................................71 Object Handover Agents .......................................................................................73

4.5.1 4.5.2 4.5.3 4.6

EXPERIMENTS AND EVALUATION .................................................................................75 Object Tracking Using Overlapping Cameras......................................................75 Object Tracking Between Widely Separated Views ..............................................81

4.6.1 4.6.2 4.7 5

SUMMARY .....................................................................................................................83

SYSTEM ARCHITECTURE ...........................................................................................85 5.1 5.2 5.3 BACKGROUND ..............................................................................................................85 INTELLIGENT CAMERA NETWORK................................................................................87 MULTI VIEW TRACKING SERVER (MTS)......................................................................87 Temporal Alignment..............................................................................................88 Viewpoint Integration............................................................................................89 3D Tracking ..........................................................................................................90

5.3.1 5.3.2 5.3.3 5.4 5.5

OFFLINE CALIBRATION/LEARNING ..............................................................................90 SURVEILLANCE DATABASE DESIGN .............................................................................91 Image Framelet Layer...........................................................................................92 Object Motion Layer .............................................................................................93 Semantic Description Layer ..................................................................................96

5.5.1 5.5.2 5.5.3 5.6 5.7

METADATA GENERATION.............................................................................................99 APPLICATIONS ............................................................................................................100

Page 7 of 168

5.7.1 5.7.2 5.8 6

Performance Evaluation .....................................................................................100 Visual Queries.....................................................................................................100

SUMMARY ...................................................................................................................104

VIDEO TRACKING EVALUATION FRAMEWORK ..............................................105 6.1 6.2 6.3 BACKGROUND ............................................................................................................105 PERFORMANCE EVALUATION .....................................................................................106 PSEUDO SYNTHETIC VIDEO ........................................................................................108 Ground Truth Track Selection ............................................................................108 Pseudo Synthetic Video Generation....................................................................113

6.3.1 6.3.2 6.4 6.5 6.6

PERCEPTUAL COMPLEXITY.........................................................................................116 SURVEILLANCE METRICS ...........................................................................................121 EXPERIMENTS AND EVALUATION ...............................................................................123 Ground Truth Track Selection Surveillance Database A (Cloudy Day) ..........123 Ground Truth Track Selection Surveillance Database B (Sunny Day)............127 Single View Tracking Evaluation (Qualitative) ..................................................131 Single View Tracking Evaluation (Quantitative) ................................................134

6.6.1 6.6.2 6.6.3 6.6.4 6.7 7

SUMMARY ...................................................................................................................138

CONCLUSION ................................................................................................................139 7.1 7.2 7.3 7.4 RESEARCH SUMMARY ................................................................................................139 LIMITATIONS ..............................................................................................................141 FUTURE WORK ...........................................................................................................142 EPILOGUE ...................................................................................................................143

BIBLIOGRAPHY ...................................................................................................................145 APPENDIX A APPENDIX B B.1 B.2 CAMERA MODELS .................................................................................156 JACOBIAN MATRIX OF 2D TO 3D TRANSLATION........................157

IMAGE COORDINATES TO IDEAL UNDISTORTED COORDINATES ................................157 IDEAL UNDISTORTED COORDINATES TO WORLD COORDINATES ..............................158 SURVEILLANCE DATABASE TABLES ..............................................162

APPENDIX C C.1 C.2

REGION .......................................................................................................................163 CAMERA .....................................................................................................................163

Page 8 of 168

C.3 VIDEOSEQ .......................................................................................................................164 C.4 MULTIVIDEOSEQ .............................................................................................................165 C.5 TRACKS3D ......................................................................................................................165 C.6 TRACKS2D ......................................................................................................................166 C.7 MULTITRACKS2D ............................................................................................................167 C.8 FRAMELETS .....................................................................................................................168 C.9 TIMESTAMPS ...................................................................................................................168

Page 9 of 168

List of Figures
Figure 2.1 Visibility criteria of camera network.........................................................................23 Figure 3.1 Example of landmark points gathered by a survey of the surveillance region ..........34 Figure 3.2The epipole geometry between a pair of camera views..............................................38 Figure 3.3 Features used to estimate the homography transformation between two camera views using LQS method. .............................................................................................................40 Figure 3.4 Feature matching using: epipole line analysis and homography alignment. The red circles represent the tracked centroid of each object, the white circles represent the centroids projected by homography transformation. The white lines represent the epipole lines (derived from the calibration information) projected through each centroid .............43 Figure 3.5 Geometric view of the minimum discrepancy...........................................................45 Figure 3.6 Covariance fusion by accumulation (a), Covariance fusion by intersection (b)........49 Figure 3.7 Synthetic trajectories created to evaluate LQS method by simulation. .....................50 Figure 3.8 Histograms of re-projection errors of correspondence points. ..................................51 Figure 3.9 LQS Evaluation by Simulation..................................................................................52 Figure 3.10 Correspondence points found for dataset 1 (top row), and dataset 2(bottom row) by using LQS search. ...............................................................................................................53 Figure 3.11 Correspondence points homography estimation method Northampton dataset 1 54 Figure 3.12 Correspondence points homography estimation method Northampton dataset 2 54 Figure 3.13 Correspondence points homography estimation method Northampton dataset 3 55 Figure 3.14 LQS Plots between two cameras for different time offsets. ....................................56 Figure 3.15 Least squares estimate measurements for toy car video sequence ..........................58 Figure 3.16 Uncertainty of the 3D measurements for toy car sequence .....................................59 Figure 4.1 Example of feature matching in PETS2001 dataset one. ..........................................65 Figure 4.2 Example of viewpoint correspondence between three overlapping camera views....66 Figure 4.3 Block diagram of the Kalman filter...........................................................................66 Figure 4.4 Handover regions for six cameras in the surveillance system. ..................................73 Figure 4.5 Examples of handling dynamic occlusion, frames 601 and 891 in data sequence one. ............................................................................................................................................76 Figure 4.6 Tracker output for frame 1366 in data sequence two. The tree splits the tracked object but the 3D tracker still correctly assigns a single object label. ................................77

Page 10 of 168

Figure 4.7 Objects during static occlusion (top image), and dynamic occlusion (bottom image). ............................................................................................................................................77 Figure 4.8 An example of resolving both dynamic and static occlusions...................................78 Figure 4.9 Tracking error during dynamic and static object occlusions. ....................................78 Figure 4.10 Example of object tracking using three overlapping camera views. The 3D trajectories are visualised of a ground plane map. ..............................................................79 Figure 4.11 Plot of 3D tracking error for tracks 1 and 3 in the three-camera video sequence....80 Figure 4.12 Example of object handover failure due to the size of an object.............................82 Figure 4.13 Example of object tracking between adjacent non-overlapping views....................82 Figure 5.1 System Architecture of the Image Surveillance Network of Cameras. .....................86 Figure 5.2 Graphical illustration of the temporal alignment process..........................................89 Figure 5.3 Example of objects stored in the image framelet layer..............................................93 Figure 5.4. Camera network on University campus showing 6 cameras distributed around the building, numbered 1-6 from top left to bottom right, raster-scan fashion.........................95 Figure 5.5. Re-projection of the camera views from Figure 5.4 onto a common ground plane, showing tracked objects trajectories plotted into the views (white, red, blue and green trails). ..................................................................................................................................96 Figure 5.6. Re-projection of routes onto ground plane ...............................................................97 Figure 5.7. Example of routes, entry and exit zones stored in semantic description layer .........98 Figure 5.8 Conceptual Layout of High Level Surveillance Database.......................................102 Figure 5.9 Visualisation of results returned by spatial temporal activity queries. ....................103 Figure 5.10 Example of online route classification ..................................................................103 Figure 6.1 Generic framework for quantitative evaluation of a set of video tracking algorithm ..........................................................................................................................................107 Figure 6.2 Distribution of the ground truth metrics ..................................................................111 Figure 6.3 Examples of how phantom objects can be used to form dynamic occlusions in synthetic video sequences. ................................................................................................115 Figure 6.4 Using ground truth tracks to simulate dynamic occlusions .....................................116 Figure 6.5 System diagram for main input and outputs of PViGEN. .......................................117 Figure 6.6 Perceptual complexity: left framelets plotted for p(new)=0.01, middle framelets plotted for p(new)=0.10, right framelets plotted for p(new)=0.20.................................117 Figure 6.7 Perceptual Complexity average objects per frame, and average number of dynamic occlusions..........................................................................................................................119

10

Page 11 of 168

Figure 6.8 Plots of number of objects in each frame of a sample of synthetic video sequences for different values of p(new). ..........................................................................................120 Figure 6.9 Illustration of surveillance metrics: (a) Image to illustrate true positives, false negative and false positive, (b) Image to illustrate a fragmented tracked object trajectory. ..........................................................................................................................................123 Figure 6.10 Example of outlier tracks identified during ground truth track selection. .............125 Figure 6.11 Top four ranked ground truth tracks......................................................................126 Figure 6.12 Bottom four ranked ground truth tracks ................................................................127 Figure 6.13 Distribution of the average path coherence (a), average colour coherence (b), and average shape coherence of each track selected from the surveillance database (sunny day) ..........................................................................................................................................129 Figure 6.14 Top four ranked ground truth tracks......................................................................130 Figure 6.15 Bottom four ranked ground truth tracks ................................................................131 Figure 6.16 An example of how poor track initialisation results in low object track detection rate of the pedestrian leaving the vehicle..........................................................................133 Figure 6.17 Example of dynamic occlusion reasoning for PETS2001 dataset 2 camera 2.......133 Figure 6.18 Plot of Object Tracking Error (OTE) ....................................................................135 Figure 6.19 Plot of Track Detection Rate (TDR)......................................................................136 Figure 6.20 Plot of Tracking Success Rate (TSR) ...................................................................136 Figure 6.21 Plot of Occlusion success rate (OSR)....................................................................137

11

Page 12 of 168

List of Tables
Table 3.1 Summary of extrinsic parameters of six cameras shown in figure 3.1 .......................35 Table 3.2 Summary of intrinsic parameters of six cameras shown in figure 3.1 ........................35 Table 3.3 Summary of the calibration errors for each camera shown in figure 3.1....................35 Table 3.4 Summary of error statistics for homography estimation.............................................55 Table 5.1 Attributes stored in image framelet layer....................................................................93 Table 5.2 Attributes stored in object motion layer (2D Tracker)...............................................94 Table 5.3 Attributes stored in object motion layer (Multi View Tracker). .................................95 Table 5.4 Attributes stored in semantic description layer (entry/exit zones)..............................97 Table 5.5 Attributes stored in semantic description layer (routes). ............................................97 Table 5.6 Attributes stored in semantic description layer (route nodes).....................................98 Table 5.7 Attributes metadata generated (object_summary). .....................................................99 Table 5.8 Attributes metadata generated (object_history). .........................................................99 Table 6.1 Summary of surveillance metrics for PETS2001 dataset2 camera 2 ........................132 Table 6.2 Summary of object tracking metrics .........................................................................132 Table 6.3 Summary of perceptual complexity of PETS datasets..............................................133 Table 6.4 Summary of the perceptual complexity of the synthetic video sequences................137 Table 6.5 Summary of metrics generated using each synthetic video sequence.......................137

12

Page 13 of 168

1 Introduction
1.1 Surveillance and Monitoring
Image surveillance and monitoring is an area being actively investigated by the machine vision research community. With several government agencies investing significant funds into closed circuit television (CCTV) technology methods are required to simplify the management of the enormous volume of information generated by these systems. CCTV technology has become commonplace in society to combat anti-social behaviour and reduce other crime. With the increase in processor speeds and reduced hardware costs it has become feasible to deploy large networks of CCTV cameras to monitor surveillance regions. However, even with these technological advances there is still the problem of how information in such a surveillance network can be effectively managed. CCTV networks are normally monitored by a number of human operators located in a control room containing a bank of screens streaming live video from each camera. Human operators are presented with a number of issues when monitoring a bank of video terminals. One problem is how to reliably navigate through the environment using each camera in the CCTV network. Each camera has a limited field of view of the region, and hence it is necessary to switch between camera views appropriately to track suspicious individuals as they walk through the scene. In manually operated environments it has been observed that human operators have difficultly in performing this task if they are not completely familiar with the scene and placement of each camera. Another issue is that human operators are normally only interested in identifying certain events that can occur in the scene, for example crowd congestion on a railway platform, loitering in a restricted area, or other atypical behaviour. Ideally, a system that could automate this task would be of great benefit, particularly in the CCTV network comprised of a large number of cameras. By reducing the information load on the operators it is likely this would reduce the likelihood of missing important events. In addition, it may sometimes be necessary to recall an event that occurred during a specific date and time interval. Hence, the operator will be required to review archived storage of video data, which could be a laborious task if the video data is not stored in a format that is suitably indexed for fast retrieval. Machine vision based surveillance systems can be applied to a number of application domains, for example retail outlets, traffic monitoring, banks, city centres, airports and building

13

Page 14 of 168

security, with each domain having its own specific requirements. This broad spectrum of applications domains has resulted in variety of approaches being developed by the research community. This fact has been recognised by a number of leading international technical committees. A special issue on visual surveillance appeared in the journal of Pattern Analysis and Machine Intelligence in August 2000. The IEEE has also sponsored several workshops on visual surveillance including International Workshop on Visual Surveillance, and Performance Evaluation of Tracking and Surveillance Systems (PETS). A machine vision based solution for a visual surveillance application would comprise of many components to address the operations ranging from low-level video acquisition and pre-processing to high-level object tracking and visual interpretation. Live video data can be acquired by using frame-grabbing hardware, which is available at a relatively low cost. The reduced hardware costs have made it economically feasible to deploy networks of cameras to perform visual surveillance tasks. The frame-grabbers also have an application-programming interface (API) in order to allow software to be developed for integration with the hardware. Once live video data can be captured and stored the next step of the surveillance application would be to identify any object activity within the camera field of view. This task is normally referred to as motion segmentation and requires that the surveillance application utilise vision algorithms to automatically detect possible moving objects of interest. This presents many challenges, since the motion detection must be robust with respect to illumination changes, and irrelevant motion. Illumination changes typically occur in outdoor environments due to varying weather conditions. For example, the appearance and disappearance of the sun on a partially cloudy day causes significant changes in lighting conditions and cast shadows. Irrelevant motion generally occurs due to properties of the scene, which include: cast shadows, reflections from windows or puddles on the ground, or vegetation blowing in the wind. Each of these conditions has the potential to cause an error during the motion segmentation process. In order to reduce the affect of each of these sources of error it has become common to employ adaptive background modelling techniques to provide a robust solution. The adaptive background modelling process maintains a reference image, which represents the cameras field of view without any moving objects. A background subtraction process is then applied to identify possible moving objects of interest. Once moving objects have been identified in the camera view the next task of the visual surveillance system is perform feature extraction and tracking. Features extracted from detected objects can comprise of location, shape and colour

14

Page 15 of 168

cues. Feature extraction is important, since it provides a mechanism to represent each moving object using a compact model. An important requirement for a machine vision based surveillance system is to be able to preserve the identity of an object as it moves through the field of view of the camera. This presents an additional challenge to the motion segmentation problem, since it is necessary to establish inter frame object correspondence between each captured image frame. This task can be resolved by employing an object tracking algorithm, which takes as input a set of detected object features and attempts the maintain the correct tracked state of each object. The inter frame matching between tracked objects and detected object features is usually referred to as the data association problem in the machine vision community. Data association relies on the inter frame consistency of the tracked object state and the features of each detected foreground object. Data association can be performed by using a combination of shape, geometric, position, and colour cues. This task can be particularly difficult when two objects interact and form a dynamic occlusion that can make the data association process ambiguous. Once a tracked object has been matched to a detected foreground object the measurement is used to update the state of the tracked object. New tracked objects can be created for any foreground objects that have not been assigned to an existing tracked object during the data association process. We have outlined the generic functional requirements for single view object tracking using a machine vision based solution. For real world applications a visual surveillance system would comprise of many cameras, so a method is required to integrate all the tracking information from the multiple camera sources. This presents a number of additional issues than faced for single view tracking. Firstly, it is a necessary requirement to assign a unique identity to moving objects even if they are visible in more than one camera view simultaneously. Secondly, the identity of an object should be preserved when it moves between nonoverlapping camera views. Using multiple camera views for object tracking offers some advantages over single view tracking. With sensible camera placement the system would have an increased field of coverage, since the field of views of all the cameras could be combined by the system. One common cause of single view tracking failure is due to dynamic and static object occlusions. A dynamic occlusion occurs when two objects interact or cross each others path within the camera view. A static occlusion typically occurs when an object temporarily disappears from the camera field of view due to an occlusion plane, for example a tree located near a pedestrian path. Since multiple camera views provide a larger field of coverage it is

15

Page 16 of 168

expected that a multi view camera surveillance system should be capable of resolving dynamic and static occlusions better than single view tracking, since the start and end of an occlusion should occur at different times in each camera view increasing the possibility that the system should be able to correctly track the occluded objects. Matching object features between overlapping camera views is simplified if the scene conforms to the ground plane constraint. This constraint assumes that there is a dominant ground plane present within the region under surveillance, which moving objects are constrained to move along. This assumption is valid for the majority of scenes, for example road junctions, and pedestrian pathways tend to be located on the same ground plane. The ground plane constraint allows the feature correspondence of moving objects between overlapping camera views to be simplified to a planar transformation. Once an objects features in each camera view have been corresponded it would be possible to infer its location in 3D, assuming that calibration information is available. The camera calibration information defines a geometric model that relates 2D image features to a world coordinate system. Once a set of 3D features has been extracted from the scene a 3D tracker can be employed to track each object. The 3D tracker would follow the same principal as the 2D tracker with the main difference being that the features being tracked are the objects 3D location as opposed to its 2D location in image coordinates. In a typical image surveillance application it is likely that there will be several non-overlapping and spatially adjacent cameras. As a consequence the system would be required to preserve the identity of a tracked object once it disappears from the field of view of one camera and then reappears in the adjacent camera view after a short temporal delay of a few seconds. If each camera is calibrated in the same world coordinate system then it should be possible to preserve the identity of tracked objects as long as the system has some understanding of the handover regions that exist between each of the non-overlapping camera views. The visual surveillance system would be required to run continuously over a period of several hours or days. The system should provide some functionality that would allow object activity playback for specific time intervals. Storing uncompressed video would not be feasible to perform this operation. To illustrate this point if a network of six cameras were running at 25 frames per second they would generate nearly four terabytes of video data over a twenty-four hour period. This would require a very large storage capacity to accumulate vast quantities of surveillance data over a period of weeks or months. Hence, in order to reduce the cost of storage of surveillance data the system will have to employ some video data encoding strategy, which would result in considerable compression of the raw video data. In addition to storage

16

Page 17 of 168

requirements the surveillance application would also have to provide an appropriate set of functions that would support the continuous operation of the system. It should be possible to seamlessly add or remove cameras to the surveillance network, and capture data from different combinations of cameras. This task is important, since if the surveillance network contains a large number of cameras it could be difficult to maintain the system operation without an appropriate set of tools.

1.2

Research Aims and Objectives


The main focus of this research was to create a framework for tracking objects using

multiple camera views. Object tracking using multiple views has recently received much attention [1,3,11,12,13,14,16,17,19,20,25,45,46,47,52,54,56,57,58,59,60,62,63,64,74,75,76,

80,90,91,99,101]. The obvious benefits of tracking using multiple cameras are increased coverage of the scene, since each of the combined field of views of all the cameras should be greater than that of any individual camera. Using multiple camera views for object tracking increases the possibility of preserving object identity across the region. In addition, if the camera views are widely separated then the multi camera tracker should be able to resolve static and dynamic occlusions. The task of multi view object tracking is comprised of many tasks. Initially, moving objects of interest must be identified in each camera. This represents a challenging problem, particularly in outdoor environments where lighting conditions cannot be controlled and image intensities are subject to large changes in illumination variation. Each camera in the surveillance network of this research has an intelligent sensor, which employs a robust motion segmentation and object tracking strategy [96,97,98]. It is assumed that each camera view is fixed and calibrated in a world coordinate system. The multi view object-tracking framework should be able to integrate tracking information from each camera and reliably track objects between views. In addition the multi view tracker should be able to resolve both dynamic occlusions that occur due to object interaction, and static occlusions that can occur due to the scene constraints, for example trees that form occlusion regions. The framework uses a training phase to learn information about the scene, which can facilitate the integration and object tracking process. This can include learning the relations between each camera view to allow feature correspondence between camera views in order to assign a unique label for an object even it is visible in several camera views simultaneously. In a typical surveillance environment the cameras are placed to maximise the field of coverage of

17

Page 18 of 168

the scene. As a consequence some cameras will have limited overlap, which increases the difficulty of tracking objects without loss of identity. Hence, the framework should be able to track objects between non-overlapping and spatially adjacent camera views. The system should be able to exploit spatial cues to maintain the identity of tracked objects. Since the object disappears from the field of view temporarily, to increase the likelihood of matching the object on reappearance it will be necessary to record attributes of the object at the time of its exit. Object tracking using multiple views has received much attention for the overlapping view case but there has been only limited investigation of the non-overlapping view case. Performance evaluation of video tracking systems for surveillance has recently become a popular topic [26,27,29,30,31,77,82,88], since it provides an effective approach for comparing several algorithms for solving a specific surveillance task. Much work has been reported of evaluation of object tracking algorithms for surveillance but is normally restricted to a few minutes of video. The Police Scientific Development Branch (PSDB) is in the process of creating a Video Test Image Library (VITAL) [2], which represents a broad range of object tracking and surveillance scenarios encompassing: parked vehicle detection, intruder detection, abandoned baggage detection, doorway surveillance, abandoned vehicles. Performance evaluation of object tracking algorithms presents many issues that include: generating a variety of test datasets for evaluation, acquiring ground truth for each dataset, measuring the complexity of each dataset, and determining how the tracking performance relates to the complexity of each video sequence. Another key objective of this research is to define a methodology that can be used to evaluate tracking algorithms over a comprehensive set of testing datasets that represent a diverse range of object tracking scenarios. One requirement of this methodology is that the data acquisition and ground truthing should be fully automated, or at least semi-automatic with limited supervision. The approach that will be adopted will involve using pseudo synthetic video sequences that are automatically generated by extracting tracking data from an online surveillance database. It should be feasible to select ground truth tracks from the surveillance database using an appropriate set of metrics to assess the quality of the tracks. Normally the most common failure of video tracking algorithms is due to dynamic occlusions between several interacting objects. Hence, it is not unreasonable to use the ability of occlusion reasoning as a performance measure for a tracking algorithm. Given the query properties of the surveillance database it is envisaged that it should be possible to generate test datasets that span several hundred thousands of image frames. The automatic generation of a large number of datasets enables the tracking algorithm performance to be assessed over a wide range of perceptual complexity,

18

Page 19 of 168

which would not be feasible even using semi-automatic tools. Some of the major benefits of the proposed framework is that a number of metrics will be defined in order to assess the quality of ground truth tracks and characterise the video tracking performance, and ground truth data can be automatically acquired for each generated video sequence. The perceptual complexity, with respect to the number of dynamic object occlusions and number of objects can be controlled allowing a large number and variety of different tracking scenarios to be created. It is envisaged that the performance evaluation framework presented in this thesis will result in an alternative strategy for generating test data and ground truth, which can be used for evaluating the performance of any video-tracking algorithm. In order to achieve the requirements of online tracking, and video tracking performance evaluation framework, a system architecture will be designed and implemented, which will support the real time capture and storage of object tracking data from multiple cameras. The captured data will comprise of object track information such as location, appearance features, bounding box dimensions, and pixel image data of each detected object. Central to the operation of the system will be a multi view-tracking server (MTS), which will integrate all the tracking data observed by each camera in the surveillance network. Another design consideration is how the video data will be stored and managed for retrieval, particularly if the system runs continuously for many days, which would result in large quantities of tracking data. One requirement of any surveillance system is that it should be possible to access video data for specific times and dates. This functionality has been implemented in a surveillance database, which is appropriately indexed to support fast retrieval of data. One approach would be to use an existing movie compression technology (eg, MPEG2 or MPEG4[15]) but these support access to specific events, or take full advantage of some system constraints (i.e. static camera views). We employ a variant of the MPEG4 approach for video compression, which results in considerable savings in terms of the space required to compress the video data. The surveillance database will store information of each sensor connected to the network of cameras, and also acts as a repository for all the video tracking information generated by each intelligent camera unit (ICU) and the MTS. The MTS and ICUs upload tracking information into the surveillance database. One advantage of this approach is that the surveillance database is able to support offline-learning processes to provide information about each region under surveillance to improve the performance of the online tracking. Examples of learning include: path learning [70], homography alignment [8,9], object handover reasoning and behaviour analysis of object trajectories [68,69]. In addition to offline-learning the surveillance database forms a critical component for the performance evaluation framework presented in this thesis.

19

Page 20 of 168

1.3

Organisation and Presentation


The remainder of this thesis is organised as follows: chapter 2 is a discussion of related

work that includes a review of previously published research. The next three chapters primarily focus on the operation of the multi view-tracking server. Chapter 3 describes the techniques used to calibrate each camera in the surveillance network. The robust methods used to automatically recover the spatial relationships between overlapping views assuming the ground plane constraint are discussed. Chapter 3 also describes the methods used by the system to extract 3D measurements of each detected object in 3D world coordinates. Chapter 4 describes the framework for tracking objects between widely separated views, which may be overlapping or non-overlapping. Chapter 5 describes the overall system architecture of the surveillance system that was designed and implemented to support the research presented in this thesis. This architecture was required to allow the real time capture and storage of video data over continuous extended periods of at least twenty-four hours. The key components of the system are discussed as well as how information is exchanged by each sub-process. Chapter 6 describes the approach employed to evaluate the performance of the video tracking algorithms. Pseudo synthetic video sequences are automatically generated from the tracking data stored in a surveillance database. One key novelty of this application is that large volumes of test data can be automatically generated with associated ground truth, allowing video-tracking algorithms to be evaluated with comprehensive test datasets. Chapter 6 also includes a description of all the metrics that describe the quality of the ground truth tracks, the object tracking performance, and perceptual complexity of each dataset. Chapter 7 gives a summary of the main contributions of this thesis and what problems still need to be addressed. Possible extensions to the current system are discussed that could provide a solution to these open issues.

1.4

Contributions
The work presented in this thesis contributes to the field of visual surveillance through the

development of a system that can coordinate object tracking between multiple camera views. In addition, this thesis also presents a quantitative performance evaluation framework for video tracking systems. In particular the main contributions are:

20

Page 21 of 168

Application of 3D Kalman filtering for robustly tracking objects between static and dynamic occlusions Learning homgography relations between overlapping camera views, which are robustly estimated using a Least Quantile of Squares (LQS) technique. Using a spatially variant 3D measurement uncertainty process to set the observation noise of the Kalman filter tracker. This results in an improvement of the Kalman filter tracking compared to using constant observation noise.

Development of a multi view tracker that uses a camera topology model to facilitate object tracking between non-overlapping views. Development of a framework for unsupervised performance evaluation of video tracking systems. Ground truth tracks are automatically selected from a surveillance database and used to construct pseudo-synthetic video sequences for performance evaluation.

21

Page 22 of 168

2 Previous Work
2.1 Background
The purpose of this chapter is to provide a survey of the research that has already been published in relation to multi view object tracking, and video tracking performance evaluation. Some of the work discussed is outside the scope of this thesis including: adaptive background modelling, motion segmentation, and single view tracking but are included for completeness. In the previous chapter we discussed some the general issues that would be encountered when developing a surveillance application. We discuss some solutions to some of the surveillance tasks identified. This survey of existing multiple view tracking systems and methods of performance evaluation enabled us to identify the key requirements that needed to be considered by this research.

2.2

Single View Tracking


There is a number of techniques available for single view tracking. The tracking problem

is primarily decomposed into a number of stages, which include: motion detection, object segmentation, and object tracking. The KidsRoom system developed at MIT Media laboratory [10,42,43] used a real-time tracking algorithm that uses contextual information. The system could track and analyse the actions and interactions of people and objects. The contextual information included knowledge about the objects being tracked and their current relationships between one another. The contextual information was used to weight the image features used for inter frame data association. Each object was detected using background subtraction [43] allowing the blobs dimensions, location, and colour appearance attributes to be computed. The W4 system [35,36,37,38,39,40] employed a set of techniques for implementing a realtime surveillance system using low cost hardware. The key components of W4 were: adaptive background modelling to statistically detect foreground regions, object classification to distinguishing between different object classes using shape and motion cues, and tracking multiple objects simultaneously in groups. The blob representation of objects particularly has problems when objects interact and form a dynamic occlusion. Using a blob representation it is not possible to distinguish between the foreground regions of each object. The W4 system uses an alternative appearance model for each tracked object that takes the form of silhouette

22

Page 23 of 168

description that includes the location of the head, hands, feet and torso. This representation allows more robust dynamic occlusion reasoning than was possible in the KidsRoom system. Pfinder (Person-finder) was a real-time system for tracking and interpretation of human motion developed at the Massachusetts Institute of Technology (MIT) [95]. Motion detection is performed using background subtraction, where the statistics of background pixels are recursively updated using a simple adaptive filter. The human body is modelled as a connected set of blob regions using a combination of spatial and colour cues. Features of the human body are found by analysis of the foreground objects contour. The system can only track one human object, in future work they plan to extend Pfinder to use multiple cameras. Pfinder has been applied for a variety of applications including: video games, distributed virtual reality, providing interfaces to information spaces, and recognising sign language.

2.3

Multi View Tracking Systems


In order to integrate the track data from multiple cameras, it is useful to consider the

visibility of targets within the entire environment, and not just each camera view separately. Four region visibility criteria can be identified to define the different fields-of-view (FOV) available from the network of cameras:

camera location building viewfield overlapped viewfield

Figure 2.1 Visibility criteria of camera network

visible FOV - this defines the regions that an individual camera will image. In cases where the camera view extends to the horizon, a practical limit on the view range is

23

Page 24 of 168

imposed by the finite spatial resolution of the camera or a practical limit on the minimum size of reliably detectable objects camera FOV - encompasses all the regions within the camera view, including occluded regions network FOV - encompasses the visible FOV's of all the cameras in the network. Where a region is occluded in one cameras visible FOV, it may be observable within another FOV (i.e. overlap). virtual FOV - covers the network FOV and all spaces in between the camera FOVs within which the target must exist. The boundaries of the system represent locations from which previously unseen targets can enter the network.

Figure 2.1 illustrates the camera network visibility regions for a simple environment projected onto the ground plane. Occluded regions are shown in white (if within the expected viewfield of a camera). The main requirement of the multi view tracking system is that a unique identity should be assigned for objects tracked within regions of overlap, and the identity should be preserved when objects move between adjacent non-overlapping views. Cai and Aggarwal [12,13,14] demonstrated a comprehensive framework for tracking coarse human models in an indoor environment using multiple synchronised monocular cameras. Each camera performed a set of pre-processing tracks to detect motion, segment moving human subjects, and extract features from each subject detected. The coarse 2D human model defined the upper body of the tracked subject. Moment invariants where used to describe the shape of the feature, enabling a detected object to be classified as human or non-human. Feature matching between successive image frames was achieved by evaluating the Mahalanobis distances for each feature. They employed an automatic camera-switching scheme to coordinate tracking between overlapping camera views, which was driven by optimal camera selection for each tracked object. This system was demonstrated to operate in real-time using three monocular cameras. One weakness of this approach is that tracking failure occurs during dynamic occlusions that cannot be resolved by the single view tracker. The Video Surveillance and Monitoring (VSAM) project [19,20,65,66] at Carnegie Mellon University (CMU) developed a system for multi view surveillance using a distributed network of active sensors. Their system operated in an outdoor environment that presents more challenges than indoor environments, where lighting conditions can more easily be controlled, which makes motion segmentation an easier task. In outdoor environments lighting conditions can vary due to changing weather conditions or cast shadows. In addition, wind can result in

24

Page 25 of 168

irrelevant motion (such as branches or vegetation swaying) causing false alarms. In order to overcome these problems they chose to use an adaptive background model to reflect slow changes in illumination. Each pixel in the background is modelled as a mixture of Gaussians, allowing slow varying changes in illumination, and bi-modal backgrounds (for instance leaves blowing in the wind) to be correctly represented. Foreground objects can be detected by applying a background subtraction technique. One problem with adaptive background modelling is that transient objects, for example a car stopping for a few seconds, can be absorbed into the background model after a period of time. They addressed this problem by employing a layered approach to adaptive background subtraction. By considering pixel layer analysis and region layer analysis it is possible to hypothesise if a pixel is stationery or transient. Their framework for single view tracking employed a combination of positional and template matching. They determine the best object match between successive image frames by employing correlation matching between the objects intensity template and candidate regions in the new image frame. Neural networks were used to classify each detected object as one of: person, group, vehicle or clutter. Linear Discriminant Analysis (LDA) was used to provide a finer distinction between different types of vehicle. VSAM used GPS to provide a global world coordinate system, allowing tracking to be coordinated between overlapping views. GPS was used in this application, since some of the regions under surveillance were located in rough terrain where it may not be practical to perform a landmark-based survey to calibrate each camera. Each sensor in the network employs an active camera, which allows the pan, tilt and zoom to be automatically adjusted to bring an object within the visible field of view. They used a 3D model of the surveillance region to visualise the object activity. Another VSAM project system was developed at the Massachusetts Institute of Technology (MIT) for tracking objects in an urban outdoor environment [34,62,89,91]. This project was one of the earlier adopters of the concept of adaptive background modelling using a mixture of Gaussians. This approach has proved to be extremely effective for motion segmentation in a variety of illumination and weather conditions. The basic idea behind adaptive background modelling is to model each pixel in the reference image as a mixture of Gaussians. This model is effective for robustly handling slow varying changes in illumination, which typically occur in outdoor environments. An additional benefit of using a mixture of Gaussians is that it is possible to handle bimodal backgrounds, for example leaves of a tree swaying in the wind. This would cause failure if only a single Gaussian was used to model the reference image. Given the reference image of the image background it is possible to perform a background subtraction process to identify foreground regions in the current image frame. The

25

Page 26 of 168

use of this technique has become common among the research community and has been implemented in various forms [34,43,66,89,97]. The VSAM project at MIT also introduced a novel method for recovering ground plane models between overlapping camera views. Initially a sparse set of 2D object trajectories in a pair of overlapping camera views are used to recover a homography mapping between the two views. A homography is a planar mapping which projects points lying on a plane from one camera view to another. The homography mapping is derived from centroid locations of tracked objects and provides a rough alignment between two views. A robust image alignment algorithm is then applied to register the ground plane of the two images [44]. For pre-recorded video sequences they were also able to perform time calibration in order to determine the temporal offset between the two internal clocks of each camera. This approach of time calibration could not be applied in this research, since we expect the system will run continuously. In addition, since the cameras can be located in different buildings it was not feasible to enforce synchronisation signals between each camera. Hence it is possible the internal clocks of each camera will become skewed over periods of time during continuous operation. This problem was an important design consideration when creating the system architecture of the online multi view tracking system. Chang and Gong [16,17] developed a system for tracking people using two camera views in an indoor office area. The system combines geometric and appearance modalities within a Bayesian framework for cooperative tracking. The system performs colour calibration by acquiring a number of training examples between the two images and then using Support Vector Machine regression to learn the non-linear mapping between the two colour spaces of the cameras. The ability to predict appearance allows the dynamic occlusion reasoning to use both colour and motion cues, resulting in methods that rely on linear prediction. The prediction of appearance is robust indoors, since the lighting conditions are fairly stable but this approach could only be applied outdoors during times where the lighting conditions do not vary considerably, otherwise the appearance prediction model would need to be re-calibrated. Kogut and Trivedi [60] have developed a system for traffic surveillance. Their camera network comprised of two pan, tilt, and zoom cameras and one omni-directional camera. Each camera was connected to a gigabit Ethernet network to facilitate the transfer of full size video streams for remote access. The system could track platoons of vehicles between each camera site. The system implemented demonstrated the importance of a high bandwidth network infrastructure to support a real-time surveillance system.

26

Page 27 of 168

Khan, Javed and Shah [45,46,47,58,59] have recently presented a system for multi view surveillance that can be applied to both indoor and outdoor environments using a set of uncalibrated cameras. Their method can automatically identify the field of view boundaries between overlapping cameras. Once this information is available it is possible for the multi view tracking algorithm to consistently assign the correct identity to objects, even when they appear in more than one camera view. In order to handle the scenario of tracking objects between non-overlapping cameras they use a combination of spatio-temporal information and colour cues. They assume that training data is available for objects moving between the fields of view of each non-overlapping camera.

2.4

Performance Evaluation of Video Tracking Algorithms


Recent interest has been shown in the performance evaluation of video tracking with the

introduction of the Performance Evaluation for Tracking and Surveillance (PETS) workshops, which are sponsored by the IEEE Computer Society technical committee. One problem that the PETS workshops have resolved is making available a common set of data to the research community, so that is possible to compare tracking systems on an equal footing. The common approach to performance evaluation is to generate ground truth from pre-recorded video. An operator is required to step through a video sequence and annotate each object that moves through the field of view. The ground truth normally takes the form of the tracked object trajectory and the objects bounding box. The video tracking algorithm can then be applied to the pre-recorded video sequence. The ground truth and tracking results can then be compared in order to get an indication of the tracking performance. Generating ground truth for pre-recorded video can be a time-consuming process, particularly for video sequences that contain a large number of objects. A number of semiautomatic tools are available to speed up the process of ground truth generation. Doermann and Mihalcik [26] created the Video Performance Evaluation Resource (ViPER) to provide a software interface that could be used to visualise video analysis results and metrics for evaluation. The interface was developed in Java and is publicly available for download. Jaynes, Webb, Steel and Xiong developed the Open Development Environment for Evaluation of Video Surveillance Systems (ODViS) [49] at the University of Kentucky. The system differs from ViPER in that it offers an application programmer interface (API), which supports the integration of new surveillance modules into the system. Once integrated ODViS provides a number of software functions and tools to visualise the behaviour of the tracking systems. The

27

Page 28 of 168

integration of several surveillance modules into the ODViS framework allows several different tracking algorithms to be compared to each other, or pre-defined ground truth. The development of the ODViS framework is an ongoing research effort. Plans are underway to support a variety of video formats and different types of tracking algorithms. The ViPER and ODViS frameworks provide a set of software tools to capture ground truth and visualise tracking results from pre-recorded video. Once the ground truth is available there is a number of metrics that can be applied in order to measure tracking performance [27,29,77,82,88]. Ellis [27] discusses approaches to performance evaluation and how tracking performance is related to weather conditions, illumination changes, and irrelevant motion. Consideration is also given to dataset complexity that is related to the number of objects present in the video sequence, along with the number of dynamic occlusions and the distinctiveness of each object. It is suggested that evaluation should cover a diverse range of testing datasets in order to provide an adequate test of a tracking algorithm. Needham and Boyle [77] have defined a set of methods for positional tracker evaluation. The metrics defined can be applied for object trajectory comparison. The object trajectories represent a sequence of positions of a tracked object, or a sequence of points used to define the ground truth of an object. Manually hand marked ground truth can have some variability if recorded by different operators. The set of metrics allow two handcrafted trajectories to be compared according to temporal and spatial shifts. The set of metrics for trajectory comparison allow quantitative evaluation of positional tracking algorithms. Pingal and Segen [82] defined a set of performance evaluation metrics for tracking systems. In the absence of fully automated quantitative evaluation techniques results from different versions of a tracking algorithm are normally compared visually, which is subjective and can be an unreliable approach to evaluation. They propose three categories for evaluation metrics: track cardinality measures, durational measures, and positional tracking measures. The track cardinality measures are based upon the number of tracks in the ground truth and reported by the system. The cardinality measures allow the false alarm rate, average track fragmentation, and miss rate (that is a measure of tracks in the ground truth not detected by the tracking system). The durational accuracy measures the duration for which tracked objects are correctly reported by the system. This cannot be measured accurately by the cardinality measures, which would report ideal accuracy even if an object were only tracked for a fraction of its true duration. The positional tracking measures indicate how closely the tracks reported by the system correspond to the ground truth. There is difficulty in using the three different categories of metrics if ground truth is not readily available. As an alternative they proposed additional

28

Page 29 of 168

metrics that can be obtained from ground truth and are easier to acquire than complete object trajectories and bounding boxes. An event sequence based error measure is used to define trajectories as ordered sequences of events. An event is defined as a line crossing between the object trajectory and a line segment that defines a point of interest within the field of view. A benefit of adopting this approach is that ground truth based on a set of event sequences can be more easily acquired than complete object trajectories; however the definition of points of interest in the camera field of view is still subjective, and are required to be defined manually. Senior, Hampapur, et al [88] have defined metrics for trajectory comparison between ground truth and tracked object trajectories. Additional metrics are defined for object detection lag, track completeness factor, and object area error. Each of these metrics can measure the effectiveness of the tracking of an object with respect to available ground truth. The metrics defined have been applied to a tracking system that employs a similar approach to [40]. The performance evaluation metrics previously discussed assume that some form of ground truth is available. This assumption is not valid for an online tracking system where the original video is not available for ground truth generation. In this situation, a number of metrics are required to measure tracking performance online. Erdem, Tekalp, and Sankur have employed a set of metrics for evaluation where ground truth is not available [29,30,31]. A combination of colour and motion metrics is used to assess the consistency of a tracked object between successive image frames. The colour metrics are based on the intra frame colour difference along the estimated object boundary, and the intra frame colour histogram differences. The motion metrics check for consistency between the estimated object boundaries and the actual motion boundaries. By combining all three metrics it is possible to identify poorly tracked objects. It is also suggested that for future work these online metrics can be integrated into an online tracking algorithm in order to improve performance.

2.5

Summary
The systems and methods discussed in this chapter illustrate the considerable progress that

has been made for video surveillance and performance evaluation. The research presented in this thesis is specifically concerned with multi view object tracking using widely separated views, and developing an automated framework that can be applied for quantitative video tracking performance evaluation. Implementing a system that can be applied for continuous twenty-four hour tracking requires detailed planning and must account for a number of design considerations. There are many practical aspects to the network infrastructure and system

29

Page 30 of 168

implementation that need to be addressed in order to satisfy all the requirements of the research aims and objectives outlined in chapter 1. In this research we choose to cooperatively track objects between overlapping views to increase the likelihood of resolving both static and dynamic occlusions. Assuming the cameras are widely separated it is likely that dynamic occlusions will start and end at different times in each view increasing the likelihood of success. This issue was not considered in the [12,13,14], where failure of single view tracking would cause failure of the multi view tracking. One weakness of the approach adopted by Shah et al [45,46,47,58,59] is that if the training data contains no object moving between the field of view of two overlapping cameras then the model will not have a handover policy in this region. One solution to this problem is to initialise the FOV boundaries by automatically recovering the homography relations between each pair of overlapping views. The homography relations can be recovered from a set of sparse 2D object trajectories as discussed in [8,9,90,91]. From the review of previously published work a set of general requirements was derived for the system that will be designed and implemented to support the research presented in the remainder of this thesis: Robust motion segmentation, which must be adaptable to abrupt lighting changes, for example when the sun emerges from behind a cloud. The system must support robust object tracking for single camera views. The system must support robust object tracking between multiple camera views. The system must be able to resolve both static and dynamic object occlusions, which are a common reason for tracking failure. Tracking between overlapping camera views must be coordinated appropriately to preserve object identity when objects move between the different fields of view of each camera. The system must support a temporal synchronisation strategy between multiple views that are located in different locations. The system must store the surveillance data in a compact format (preferably a database) that can be easily accessed to support the playback of video captured during a specific time interval. The system architecture must support the insertion or removal of an intelligent camera from the surveillance network. In a typical camera network it is likely that devices can fail,

30

Page 31 of 168

or the camera network can be increased in size. This maintenance of the camera network should be performed in a seamless manner. It should be possible to calibrate cameras in the surveillance network with limited supervision. This is particularly important in order to match objects between overlapping views, and tracking objects between non-overlapping views.

The requirements for single view motion detection and object tracking were implemented using the methods of Xu and Ellis [96,97,98] and do not form part of the contribution of this thesis.

What follows are the key requirements that were identified for the video tracking performance evaluation framework: The framework should provide a set of tools that allows pre-recorded video to be reviewed and supports the capture of ground truth data. The framework should include a comprehensive set of online metrics that can be used to measure the quality of the tracking data stored in the surveillance database. A comprehensive set of metrics should also be defined for characterising the tracking performance by comparing the ground truth and tracking results. The framework should support the generation of a variety of testing datasets that can be employed for quantitative performance evaluation. Ideally, the ground truth generation for each dataset should be fully automatic, or at least semi-automatic. There should be some degree of control over the perceptual complexity of each video sequence generated within the evaluation framework.

31

Page 32 of 168

3 Feature Matching and 3D Measurements


3.1 Background
The objective of this chapter is to examine the methods and techniques available to perform camera calibration, match features between overlapping views, and extract 3D measurements from the scene. Camera calibration is important, since it provides a mechanism to translate 2D image features to a 3D world coordinate system, which can facilitate the integration of tracking information from multiple camera views. The camera calibration also provides a means of making accurate 3D measurements in terms of the world coordinate system, particularly if an object has been matched between several camera views. This chapter is organised as follows: we first describe the method used to calibrate each camera. We then discuss the approach employed to extract 3D landmark points from the scene being monitored. A homography transformation is employed to correspond 2D object features between overlapping camera views. We describe how it is possible to robustly estimate the homography transformation using a sparse set of tracked object trajectories. The homography transformation is utilised by the multi-view tracking framework in order to augment the tracking process. We then describe the techniques used to extract 3D measurements from the scene. A least squares estimation is used to perform 3D line intersection to estimate an objects 3D location using overlapping camera views. A 3D measurement is not of much practical use unless we have some idea of its accuracy, since this has an impact on the reliability of an objects location and consequently on how well the object will be tracked. The measurement uncertainty can be determined by propagating the 2D measurement uncertainty to the world coordinate system by using the calibrated camera parameters. We then discuss the results of homography calibration, along with 3D measurement and uncertainty, for various video sequences to test the validity of each approach.

3.2

Camera Calibration
In order to extract 3D measurements from the scene it is necessary to calibrate each

camera within the surveillance system. The calibration model provides a mechanism to translate 2D image coordinates to 3D world coordinates. In general it is most common to derive the calibration information by using a set of known 3D landmark points that are visible within

32

Page 33 of 168

the camera field of view [32,41,94]. The calibration model is defined in terms of intrinsic and extrinsic parameters. The intrinsic parameters characterize the internal parameters of the cameras such as: principal point, pixel dimensions, focal length, and radial lens distortion. While the extrinsic parameters describe the cameras position and orientation with respect to the world coordinate system. A description of the camera model used in this thesis is given in Appendix A. With the aid of theodolite surveying equipment it is possible to extract a number of 3D landmark points for a surveillance region. In general, most surveillance regions have a dominant ground plane. If at least five coplanar (seven points are required for non-coplanar calibration) survey points are visible in each camera view then Tsais algorithm [94] can be used to perform the calibration. The accuracy of the calibration is sufficient for extracting 3D measurements and tracking objects as long as the survey points are sensibly distributed on the ground plane. A survey of a typical surveillance region can be performed in a few hours. An example of some of the survey points used in the calibration of cameras connected to the surveillance network is shown in figure 3.1.

33

Page 34 of 168

4 7
8 2 9 5 3 1

4 7

1 3 6 7 8 4 5 6 2 7

8 3 2 10 1

5 4

10 5 8 7 6 5 4 2 1 1 9 3 3 2 4 6

Figure 3.1 Example of landmark points gathered by a survey of the surveillance region

34

Page 35 of 168

Tx(m) Camera 1 Camera 2 Camera 3 Camera 4 Camera 5 Camera 6 -501.5 -484.9 -130.8 266.8 -450.3 524.1

Ty(m) -13.7 -15.54 -139.8 -123.3 108.3 -32.0

Tz(m) 162.6 221.5 521.6 449.9 222.7 111.1

Rx(deg) -152.1 -140.6 -106.8 -110.3 -139.0 -144.6

Ry(deg) 57.6 55.3 17.7 -15.9 66.8 -51.6

Rz(deg) 114.6 124.7 176.6 168.0 139.5 -129.5

Table 3.1 Summary of extrinsic parameters of six cameras shown in figure 3.1

K( mm 2 ) F(mm) Camera 1 Camera 2 Camera 3 Camera 4 Camera 5 Camera 6 0.01 0.02 0.01 0.01 0.01 0.01 5.6 5.2 25.2 10.6 10.2 5.0

Cx(pixels) Cy(pixels) 323.7 331.9 337.1 308.4 318.7 320.0 247.6 233.8 246.0 304.5 272.5 240

Table 3.2 Summary of intrinsic parameters of six cameras shown in figure 3.1

Number of landmark points Camera 1 Camera 2 Camera 3 Camera 4 Camera 5 Camera 6 7 9 8 10 10 6

Image plane error Mean (pixels) 1.9 2.1 3.6 2.6 5.4 2.2 Standard deviation (pixels) 1.30 1.82 2.60 1.74 2.78 1.48

Object space error Mean (mm) Standard Deviation (mm) 55.6 68.6 75.0 80.7 125.8 42.6 40.07 59.88 54.18 73.13 65.24 30.28

Table 3.3 Summary of the calibration errors for each camera shown in figure 3.1

A summary of the calibration of the intrinsic and extrinsic parameters of each camera is given in tables 3.1 and 3.2. The parameters (Tx, Ty,Tz) define the translation vector between the world and camera coordinate space. The parameters (Rx,Ry,Rz) define the rotation angles for

35

Page 36 of 168

the transformation between the world and camera coordinate space. K defines the 1st order radial lens distortion coefficient, F is the focal length of the camera, and (Cx,Cy) defines the centre of the radial lens distortion on the image plane. A summary of the calibration errors for each of the cameras is shown in table 3.3. The calibration errors are dependent on a number of factors including: the number of landmark points, the distribution of the features along the ground plane, the distance of the features from the camera, and the accuracy of the features selected on the image and on the scene ground plane. The mean image space error was between 1.9 and 5.4 pixels, and the mean object space error varied between 42.6 and 125.8 mm. The largest error values correspond to the bottom left camera view in figure 3.1. One of the landmark features is located at the end of the road junction, which is a distance of several meters from the other landmark points. The calibration errors and standard deviations for the image and object space are (4.5, 94.2) and (2.80, 62.35) respectively if this point is excluded. The results illustrate that the accuracy of the camera calibration is to within a few centimetres in the world coordinate system. The accuracy should be sufficient to reliably extract measurements from the scene and track objects in 3D.

3.3
3.3.1

Two View Relations


Homography Transformation
A homography mapping defines a planar mapping between two camera views that have

a degree of overlap [24,41, 90,91]:

x' = y' =

h11 x + h12 y + h13 h31 x + h32 y + h33 h21 x + h22 y + h23 h31 x + h32 y + h33

(3.1)

(3.2)

Where ( x, y ) and ( x' , y ' ) are image coordinates for the first and second camera views respectively. Hence, each correspondence point between two camera views results in two equations in terms of the coefficients of the homography. Given at least four correspondence points allows the homography to be evaluated. It is most common to use Singular Value Decomposition (SVD) for computing the homography [24,41]. The homography matrix can be written in vector form:

36

Page 37 of 168

H = [h11 h12 h13 h21 h22 h23 h31 h32 h33 ]T

(3.3)

Each pair of correspondence points (( x, y ), ( x , y ) ) results in two equations in terms of the coefficients of the homography matrix. The following equations can be determined by rearranging equations 3.1 and 3.2.

[ xi y i 1 0 0 0 xi xi yxi xi ] H = 0 [0 0 0 xi y i 1 xy i yy i y i ] H = 0

(3.4) (3.5)

Given N correspondence points a (2N by 9) matrix M that can be constructed and then used to minimise MH subject to the constraint H = 1 . The value of the homography matrix can then be estimated by using Singular Value Decomposition (SVD). Methods such as Gaussian elimination or pseudo inverse could not be applied for the homography estimation, since these techniques cannot handle sets of equations that are singular or numerically very close to singular [85]. The result of the homography estimation is dependent on the coordinate system of the correspondence points, and their distribution across the image plane. In order to compensate for these differences we also apply an isotropic scaling function to the set of correspondence points, in order to normalise the data [41]. The normalization is performed prior to estimating the homography transform, and reduces the effect of the coordinate system and scale on the estimation. The normalisation function defines a translation and scaling that maps each correspondence point such that centroid of the points is the coordinate origin (0,0) and the average distance of each point from the origin is
2

. The additional steps are added to the

homography estimation to incorporate the isotropic scaling.

3.3.2 Epipole Geometry


The epipole geometry is another method that can be employed for feature correspondence between two overlapping camera views [16,17,32,41,74,75,76]. The key distinction between from the homography method is that feature matching involves a 2D line search, whilst the homography is a point based transform. This has benefits where the ground

37

Page 38 of 168

plane constraint is not valid but this method has increased ambiguity if several objects lie on the epipole line. The epipole geometry is graphically depicted in figure 3.2.

M epl1

m C1

Epipole plane C2

Camera View 1

Camera View 2

Figure 3.2The epipole geometry between a pair of camera views.

The 2D feature m in camera view 1 is constrained to lie on the epipole line (epl1) in camera view 2. C1 and C2 are the cameras centre of views 1 and 2 respectively.

3.4

Robust Homography Estimation

3.4.1 Feature Detection


Each single view tracker employs a background subtraction algorithm for motion detection [97], and a partial observation method for 2D tracking [98]. As discussed in chapter 2 the background subtraction algorithm can handle small changes in illumination, which is particularly important in outdoor environments where lighting conditions can vary considerably. The object features required to perform the homography estimation takes the form of centroid measurements of objects detected by each single view tracker. The sequence of centroids for the same object represents the tracked path as it moves through the camera field of view.

38

Page 39 of 168

3.4.2 Least Quantile of Squares


Given a set of correspondence points the homography can be estimated using the method described in section 3.3.1. The next step is to define a process that would allow a set of correspondence points to be determined automatically from a set of input data. Given a set of sparse object trajectories they can be used to provide training data for estimating the homography transformation between the two overlapping camera views. The object trajectories are taken during periods of low activity, in order to reduce the likelihood of finding false correspondence points. The trajectories take the form of a set of tracked object centroids that are found using the feature detection method described in section 3.4.1. A Least Quantile of Squares (LQS) approach is used to automatically recover a set of correspondence points between each pair of overlapping camera views, which can be used to compute the homography mapping. The LQS method performs an iterative search of a solution space by randomly selecting a minimal set of correspondence points to compute a homography mapping [92]. The solution found to be the most consistent with the set of object trajectories is taken as the final solution. Stein used this method for registering ground planes between overlapping camera views [91]. The LQS method was used since it is a robust alignment algorithm and can cope with data that contains a large number of outliers. The following steps are used to estimate the homography transformation using this approach:

1. Synchronise the tracking data using the timestamp information associated with each object. The internal clocks of each camera are synchronised via a LAN using the method that will be discussed in chapter five.

2. Create a list of the M possible object correspondence pairs between the two cameras for each synchronised image frame.

3. From the M possible pairs select four unique correspondence pairs randomly and use these to compute a homography from camera one to camera two.

4. Compute the transfer error for each correspondence pair in the list created in step 2. The LQS score for this test is chosen as the worst of the top 20%. A value of 20% is chosen since it is expected that the list of correspondence pairs will

39

Page 40 of 168

contain more than 50% of outliers, particularly if there are several objects moving simultaneously between the cameras.

5. Repeat steps 3-4, N times saving the random choice that gives the smallest LQS score.

6. After N tests we assume that the choice giving the smallest LQS score corresponds to object pairings that contain the smallest number of outliers. The top 20% of these object pairings are used to compute the final homography.

The value of N was chosen to be 3000 for all the experiments performed in this chapter.

The steps used to perform the LQS search are illustrated with a simple example used to estimate a homography between two cameras, where a set of six features appear in each of the camera views as shown in Figure 3.3.
cam e ra 1 B F C 2 D 1 E 6 5 cam e ra 2 3 4

Figure 3.3 Features used to estimate the homography transformation between two camera views using LQS method.

Step 2 The list of all possible combinations of correspondence points is created: (A,1), (A,2), (A,3), (A,4), (A,5),(A,6) (B,1), (B,2), (B,3), (B, 4), (B,5),(B,6) .. (E,1), (E,2),(E,3), (E,4), (E,5),(E,6)

40

Page 41 of 168

Step 3 Select four correspondence points at random, for example: ((A,1), (B,2), (C,3),(D,4)), and use the points to estimate a homography transformation. In this instance the four points selected are a set of inlier correspondence points.

Step 4 Let ri2 , i = 1,...., M be the list of transfer errors associated with the correspondence points defined in step 2 and the homography transformation estimated in step 4. The list of transfer errors is then sorted in ascending order, resulting in the list rk2,M . The subscript k refers to the kth largest transfer error in the ascending sorted list. The LQS score for this test is taken as the worst transfer error of the top 20% of the list rk2,M . We choose to take the top 20%, since we expect the list of correspondence points defined in step 2 to contain a large percentage of outliers. The correspondence points consistent with the estimated homography will appear at the top of the list rk2,M , while outlier correspondence points will appear at the bottom of the list of transfer errors.

Step 5 The steps 3-4 are repeated N times. The value of N is chosen such that there is a 99% chance that one of the random samples of correspondence points selected in Step 3 is free from outliers. (1 w s ) N = 1 p , where w is the probability that a correspondence point is an inlier, N is the number of selections (each of s correspondence points), and p is the probability that at least one of the N selections will be free from outliers.
log (1 p ) log 1 w s

N=

, where p=0.99, s=4, w=0.2 resulting in N =2875

The LQS test that has the lowest score is taken as the result with the best set of inlier correspondence points. We use the top 20% of correspondence points defined by the list rk2, M to estimate the final homography transformation

3.4.2.1

Transfer Error The homography relations between each overlapping camera are used to match

detected moving objects in each overlapping camera view. The transfer error is the summation of the projection error in each camera view for a pair of correspondence points [41]. It indicates the size of the error between corresponded features and their expected projections

41

Page 42 of 168

according to some translating function, the object centroid homography in our case. In Mikic et al. the 3D epipolar constraints were used as a basis for matching [74]. Each method has its own advantages. The epipole line based approach can still function even if the two views do not share a common dominant ground plane but requires that the camera geometry between the two views is known in advance and fairly accurate. The homography-based method assumes that each camera view shares a dominant ground plane. The biggest advantage of the homographic method over the epipole based method is that the homography maps points to points, while the epipole approach maps points to lines, so a one dimensional search still needs to be performed to establish an object correspondence. The homographic method could be applied to all regions of the image assuming that we had 3D camera geometry along with terrain information of the scene, for example an elevation map. A graphical depiction is given for the feature matching in Figure 3.4. The bounding box of each object is displayed. The white lines represent the epipole lines for each object centroid terminating at the ground plane. The red points represent the tracked centroid of each detected object. The white points in each bounding box represent the projection of the object centroid using the homography as a transformation. The top row of images in Figure 3.4 shows an example of matching two vehicles, the bottom row of images demonstrates feature matching between three tracked objects. The transfer error is used by the homography alignment and viewpoint correspondence methods for assessing the quality of a corresponded pair of centroids in two different camera views. The transfer error associated with a correspondence pair is defined as:

( x' H 1 x' ' ) 2 + ( x' ' Hx' ) 2

(3.6)

Where x' and x' ' are projective coordinates in view 1 and view 2, respectively and H is the homography transformation from view 1 to view 2.

42

Page 43 of 168

Figure 3.4 Feature matching using: epipole line analysis and homography alignment. The red circles represent the tracked centroid of each object, the white circles represent the centroids projected by homography transformation. The white lines represent the epipole lines (derived from the calibration information) projected through each centroid

3.5

3D Measurement and Uncertainty


Given a set of corresponded object features and camera calibration information it is

possible to extract 3D measurements from the scene [32,41]. Using multiple viewpoints improves the estimation of the 3D measurement. The surveillance system is able to make 3D measurements once detected objects have been corresponded in each camera view, as will be discussed in chapter 4. A 3D line intersection algorithm is employed to estimate each objects location in world coordinates. Using the calibrated camera parameters it is possible to estimate the uncertainty of the image measurement by propagating the covariance from the 2D image plane to 3D world coordinates.

43

Page 44 of 168

3.5.1 3D Measurements
Given a set of corresponded objects in each camera view a 3D ray is projected through the centroid of the object in order to estimate its location. Using the camera calibration model it is possible to map the 2D object centroid to a 3D line in world coordinates.

3.5.1.1

Least Squares Estimation

Given a set of N 3D lines

ri = a i + i b i
a point p = (x, y, z) T must be evaluated which minimises the error measure:

2 = d i2
i =1

(3.7)

Where d i is the perpendicular distance from the point p to the line ri , assuming that the direction vector b i is a unit vector then we have:

d i2 = p a i

((p a i ) b i )

(3.8)

Figure 3.5 provides an explanation of the error measure from a geometric viewpoint. The point

a i is a general point located on the line, and b i . is the unit direction vector of the line. The
distance d i2 is the perpendicular distance between an arbitrary point p and the line ri . The origin of the world coordinate system is defined by O . Evaluating the partial derivatives of the summation of all d i2 with respect to x, y and z results in the equation for computing the least squares estimate of p :

44

Page 45 of 168

bi = 1

bi

(p - a i ) b i
di

p - ai

ai
O

Figure 3.5 Geometric view of the minimum discrepancy

2 = d i2 = p a i ((p a i ) b i )2
2 i =1 i =1

(3.9)

Rearrangement of (3.9) leads to:


N 2 = {2( x aix ) 2(p a i ) b i bix } x i =1 N 2 = {2( y aiy ) 2(p a i ) b i biy } y i =1 N 2 = {2( z aiz ) 2(p a i ) b i biz } z i =1

(3.10)

(3.11)

(3.12)

2 2 2 + + =0 x y z

(3.13)

Using matrix notation an equation can be derived to minimise the error function (3.13) for all N lines.

45

Page 46 of 168

N 2 1 bix i N=1 b b ix iy i =1 N bixbiz i =1

bixbiy
i =1 N

1 b
i =1 N

2 iy

b b
i =1

iy iz

N bixbiz aix bixa i b i i =1 = x iN1 N y = a b a b biybiz iy iy i i i =1 i =1 N z N 2 aiz biz a i b i 1 biz i =1 i =1


N

(3.14)

KP = C P = K -1C
The point P can now be calculated by solving the summation of each partial derivative for the N 3D lines. A 3D line intersection algorithm was used to find the optimal centroid point in the least squares sense.

3.5.1.2

Singular Value Decomposition An alternative strategy for making 3D measurements is to use a Singular Value

Decomposition (SVD) based approach. This approach is a full least squares approach and is numerically stable for N camera views. Although the matrix dimensions increase with the number of views the matrices are sparse, reducing the computational complexity. Using the Cartesian representation of a line:

x aix y aiy z a iz = = = i bix biy biz


after rearranging (3.15):

(3.15)

x aix = i bix y aiy = i biy z aiz = i biz

(3.16) (3.17) (3.18)

The constraints described by the equations (3.16-18) can be transformed into matrix notation:

Bx = A

46

Page 47 of 168

1 0 0 M M M 1 0 0

0 0 b1x 1 0 b1 y 0 1 b1z

0 0 1 0 0 1

0 0 0

0 0 0 M M M bNx bNy L bNz L

x a1x y a 1y z a1z 1 M 2 = M M M M a Nx M a Ny a N Nz

(3.19)

using SVD the least squares estimate of the 3D intersection of the N lines can be determined. This approach represents a complete least squares solution. Using SVD simplifies the solution of the value x from equation (3.19), which is (3N) by (3+N), where N is the number of cameras used to make the measurement. However, this approach is computationally more expensive that the least squares estimate approach discussed in section 3.5.1.1. In addition, our surveillance network typically does not contain more than three overlapping camera views, so the approach is numerically stable.

3.5.2 3D Measurement Uncertainty


It is important to have a mechanism for assessing the accuracy of the 3D measurements. This makes it possible to decide the degree of confidence that can be assumed for a given measurement. The uncertainty of a pixel in the image plane can be propagated to a plane in the global coordinate system [7,80]. The measurement uncertainty can allow us to identify when an object is moving out of the field of view for a specific camera, since the uncertainty increases with the distance between the camera and detected object. Once the uncertainty of the 3D measurement has been determined it is used to set the observation noise of the 3D Kalman filter employed for object tracking, which will be discussed further in chapter 4.

3.5.2.1

Jacobian Transfer To derive the uncertainty of the 3D measurement it was necessary to propagate the

covariance from the image space to the object space. This is achieved by formulating two

47

Page 48 of 168

functions which define how an image point (x,y) is mapped to a 3D object space point (X,Y,Zh), for a given height. Zh is the estimated centroid height of the object computed by the method described in the previous section. Taking the partial derivatives of these two functions we can then derive the Jacobian matrix. The uncertainty of the 3D measurement for the given camera viewpoint is:

= JJ T

(3.20)

Where is the estimate of the image covariance of the given camera viewpoint. The derivation of the values in the Jacobian matrix is shown in Appendix B.

3.5.2.2

Covariance Fusion The image space covariance is propagated for each camera that was used to make the

3D measurement of the moving object. This results in a set of object space covariance matrices. These covariance matrices must now be combined, or fused, to arrive at an optimal estimate of the uncertainty of the 3D measurement point. The equations for covariance accumulation are [53]:

CA = 1 1 + 21 + L 1 N

(3.21)

where i is the result of propagating the image covariance to object space for each camera viewpoint which was used to make the 3D measurement, and is the single distribution of uncertainty as a result of fusing the covariance of each camera viewpoint.

The equations for the covariance intersection are:


CI = w11 1 + w2 1 + L w N 1 2 N

(3.22)

where wi =

wi'
N i =1

' i

and wi =
'

1 trace( i )

48

Page 49 of 168

Covariance Accumulation cov1 cov2 CA

Covariance Intersection cov1 cov2 CI

-2

-2

-4

-4

-6

-6

-8

-6

-4

-2

0 x

10

-8

-6

-4

-2

0 x

10

(a)

(b)

Figure 3.6 Covariance fusion by accumulation (a), Covariance fusion by intersection (b)

Covariance intersection has been successfully applied to decentralised estimation applications, where sensors may partially observe the state of a tracked object [53]. The covariance intersection equations are similar to the covariance accumulation equations except a weighting term is introduced. Non-uniform weighting is applied, as described above, so that preference is given to those estimates that have smaller trace value. The difference between the two approaches can be seen in Figure 3.6, which shows a plot of two general covariance ellipses fused by each method for illustration purposes only. It can be observed that fusion by accumulation gives a more optimistic result than fusion by intersection.

3.6

Experiments and Analysis


The following experiments were performed in order to assess the performance of the

methods discussed in this chapter. The first set of experiments focus on the homography estimation, which is important to the system performance since it is used by the multi view tracking framework to correspond 2D object features in overlapping views, as will be discussed in chapter 4. The homography estimation is first evaluated by simulation in order to test how well the method performs in the presence of noise. The purpose of this experiment was to determine the quality of tracking data required to ensure the homography estimation is robust. This could not be determined by evaluating the method over a few minutes of video. The method is then applied to two different sets of video sequences, in order to evaluate the performance on real data. The first set of video sequences is a standard test dataset that has

49

Page 50 of 168

been provided by the PETS workshop that was discussed in chapter 2. The second test dataset was taken from our own system located at Northampton Square on the City University campus.

3.6.1 Homography Estimation Experiment By Simulation


The objective of the first experiment was to assess how 2D image noise affects the accuracy of the homography estimation. A set of 2D trajectories was extracted for tracked objects from a single camera view over a 30-minute interval as shown in Figure 3.7. The object trajectories were then translated to the second camera using a known homography transformation. The total number of observations was 3534 and 2249 for cameras one and two respectively. The set of observations were then randomly perturbed using 2D Gaussian noise of known standard deviation. In Figure 3.8 a histogram is plotted of the re-projection errors of the correspondence points used to compute a homography estimated by using the LQS method for another dataset, it can be observed that each plot can be approximated by a Gaussian distribution. The mean and standard deviation of the re-projection errors of the correspondence points were (3.3,1.61) and (2.67,1.30) respectively. The actual distribution of the re-projection errors is dependent on a number of factors, which include the accuracy of motion detection and 2D tracking, along with the position and depth of the object in the scene.

Figure 3.7 Synthetic trajectories created to evaluate LQS method by simulation.

50

Page 51 of 168

Histogram of homography reprojection error (camera one) 14


12

Histogram of homography reprojection error (camera two)

12

10

10
8

8 Count
Count 6

4
2

3 4 5 6 Reprojection error (pixels)

3 4 Reprojection error (pixels)

Figure 3.8 Histograms of re-projection errors of correspondence points.

The standard deviation was varied between 1 and 9 pixels to identify what value of noise results in a severe breakdown of the accuracy of the homography estimation. The value of the mean re-projection errors of each camera, and the mean transfer error of each correspondence point is shown in Figure 3.9. The top and middle plots show the mean transfer errors for camera one and two respectively. The bottom plot shows the mean transfer error for both cameras. The value of the transfer error and error bars appears to increase rapidly when the standard deviation is larger than 6 pixels, which indicates where the method of estimation becomes unreliable.

51

Page 52 of 168

Homography Estimation Error - Camera One 180 Mean transfer error (pixels squared) 160 140 120 100 80 60 40 20 0 0 2 4 6 noise (pixel std dev) 8 10

Homography Estimation Error - Camera Two 140 120 100 80 60 40 20

Mean transfer error (pixels squared)

4 6 noise (pixel std dev)

10

Homography Estimation Error 300 Mean transfer error (pixels squared)

250

200

150

100

50

4 6 noise (pixel std dev)

10

Figure 3.9 LQS Evaluation by Simulation

52

Page 53 of 168

3.6.2 Homography Estimation PETS2001 Datasets


In order to test the effectiveness of the homography estimation on real data we used the PETS2001 datasets. The PETS2001 datasets are a set of video sequences that has been made available to the machine vision community to provide standard testing datasets that may be used for performance evaluation of tracking systems. Each dataset consists of two camera views overlooking a University campus area. Each video sequence contains a combination of pedestrians and vehicles. The video sequence data provided by PETS were captured using digital camcorders running at 25fps. The LQS score was found to be 327.63 and 93.35 pixels for datasets one and two respectively. The value of the LQS score is larger in dataset one, since the pair of cameras has a wider baseline than dataset two. The correspondence points found by the LQS search algorithm are shown in Figure 3.10. It should noted that each correspondence point lies about a metre above the ground plane, which approximates the average centroid height above the ground plane.

Figure 3.10 Correspondence points found for dataset 1 (top row), and dataset 2(bottom row) by using LQS search.

53

Page 54 of 168

3.6.3 Homography Estimation Northampton Square Dataset


The next set of experiments was performed using data captured by two cameras overlooking the Northampton Square entrance of City University. This data provided a more challenging test compared to the PETS datasets, since the video contained a higher frequency of objects. In addition, the PETS datasets have better image quality compared to the analogue cameras used at Northampton Square. The objective of this set of experiments was to determine if the homography estimation method could be applied within an online surveillance system framework. Training data for the homography evaluation was gathered during periods of low activity. Results are presented for three 30 minute video sequences captured by the surveillance system. The homography estimation method was then applied to each set of training data. The resulting homography transformations were then checked for consistency by computing a set of error statistics between the set of corresponded feature points.

Figure 3.11 Correspondence points homography estimation method Northampton dataset 1

Figure 3.12 Correspondence points homography estimation method Northampton dataset 2

54

Page 55 of 168

Figure 3.13 Correspondence points homography estimation method Northampton dataset 3

The correspondence points found by the LQS search for each of the datasets are shown in Figures 3.11-13. The LQS scores for the three datasets were: 16.8, 90.7, and 16.5. The dataset with the minimum LQS score also has the best error statistics, which are summarised in table 3.4.

No. Observations

Mean Reprojection Error (pixels)

Standard deviation (pixels)

Dataset 1 (cam1) Dataset 1 (cam2) Dataset 2 (cam1) Dataset 2 (cam2) Dataset 3 (cam1) Dataset 3 (cam2)

1457 2817 7318 9241 2886 3534

1.9 1.54 4.1 3.3 1.8 1.63

0.85 0.71 1.92 1.68 0.80 0.73

Table 3.4 Summary of error statistics for homography estimation

3.6.4 Temporal Calibration


Temporal calibration can be performed on pre-recorded video sequences to determine the time offset between each camera. A video sequence was captured using a frame grabbing system. The average frame processing rates for each camera were 5.402 and 5.397 fps. The frame-processing rate is dependent upon the frame grabber hardware and software as well as the processor speed. Since each camera has a different frame-processing rate there is a temporal

55

Page 56 of 168

drift between each camera, so it is necessary to synchronise the cameras prior to matching features between each camera. Each image frame has an associated timestamp, so once the time offset between each camera has been determined it is possible to perform the synchronisation step. The image frames are synchronised to the camera with the slowest processing rate. The frame offset between each camera can be determined by creating a LQS plot as shown in Figure 3.14. The plot shows the LQS scores for time offsets varying between 2 and 2 seconds with increments of 0.02 seconds. The frame offset between each camera can be taken as the minimum of each LQS plot. The time offset between camera one and camera two was manually determined to be 0.36s. The minimum of the LQS plot is located between 0.2 and 0.4 seconds, which is consistent with the actual time offset. This approach can only be applied when timestamp information is available for each camera and the camera views overlap. This method cannot be used to determine the time offset between non-overlapping cameras, and is not suitable for continuous monitoring for a period of several hours or days. During long time intervals it is likely the internal clocks of each camera will become skewed resulting in the temporal offset calculated being invalid. An alternative approach to solve this problem will be discussed in chapter 5.

Temporal Calibration between overlapping cameras 1000 900 T=-0.36 800 700 LMS Score 600 500 400 300 200 100 0 -2

-1.5

-1

-0.5 0 0.5 Time Offset (sec)

1.5

Figure 3.14 LQS Plots between two cameras for different time offsets.

56

Page 57 of 168

3.6.5 3D Measurement and Uncertainty Experiment By Application


The methods adopted for 3D measurement and uncertainty was evaluated using a configuration of four overlapping cameras. The video sequence consisted of 101 frames of a toy car moving through a laboratory, which used 768x576 colour cameras. The inter-frame sampling time was approximately 2 seconds for the captured video sequence. The video contained instances of the object entering and leaving the field of view of each camera viewpoint. Each camera was calibrated using Tsai's [94] method for coplanar calibration. Landmark features were added to the environment and then surveyed. Manual measurements of the image coordinates were then matched against the survey measurements. For the toy car video sequence the mean square object space errors were: 33.78, 81.62, 83.64, and 185.83 mm, for camera views one to four respectively. In Figure 3.15(a) a plot is shown of least squares estimate of the height of the centroid of the toy car as it moved between the four cameras. Figure 3.15(b) shows the least squares estimate of the ground plane location of the toy car as it moved between the four cameras. Figure 3.16(a) shows the measurement uncertainty (represented by the trace of the covariance matrix) for all the measurements shown on Figure 3.15. The variation in the height of the car is a combination of several factors: the distance of the car from the cameras used to estimate the height, the number of cameras used to make the measurement, and the orientation of the car with respect to each camera. It can be observed in Figure 3.16(a) that there were several measurements with uncertainties that were larger by several orders of magnitude compared to the other measurements. These cases were related to the second and fourth camera viewpoints. The toy car was observed at a large distance from both cameras, compared to the other observations, which accounts for the significant increase of the uncertainty of the measurements. Figure 3.16(b) is the plot of measurement uncertainties with the outliers removed. The uncertainty is dependent on the number of cameras used to make the 3D measurement. The mean trace (covariance accumulation) for sequence one was 2.49 10 4 , 419.18, 410.62 for two, three, and four cameras being used to make the 3D measurements, respectively. In image sequence two the mean trace (covariance accumulation) was 38.08, 22.81, and 9.67 for two, three and four cameras being used to make the 3D measurement, respectively. For both image sequences the uncertainty exhibited a downwards trend with an increase in the number of cameras used to make 3D measurement, which implies that 3D measurements become more accurate with the number of cameras.

57

Page 58 of 168

3 D
2 7 5 2 5 0 2 2 5

M e a s u re m e n ts

Z-Axis Value (mm)

2 0 0 1 7 5 1 5 0 1 2 5 1 0 0 7 5 5 0 2 5 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 C e n tr o id H e ig h t

F ra m e
(a)
C a r M o tio n
12000 10000 8000 6000 4000 2000 0 0 4000 8000 12000

Y-Axis Value (mm)

X - A x is V a lu e ( m m )
(b) Figure 3.15 Least squares estimate measurements for toy car video sequence

58

Page 59 of 168

3 D M e a s u re m e n ts T r a c e o f C o v a r ia n c e M a tr ix
800000 C o v a r ia n c e A c c u m u la tio n C o v a r ia n c e In te r s e c tio n

600000

Trace

400000

200000

0 0 20 40 60 80 100

F ra m e
(a)

3 D M e a s u re m e n ts T r a c e o f C o v a r ia n c e M a tr ix (O u tlie r s R e m o v e d )
30000 25000 20000 C o v a r ia n c e A c c u m u la tio n C o v a r ia n c e In te r s e c t io n

Trace

15000 10000 5000 0 0 20 40 60 80 100

F ra m e
(b) Figure 3.16 Uncertainty of the 3D measurements for toy car sequence

59

Page 60 of 168

3.7

Summary
This chapter has discussed a set of methods that can be used for: camera calibration, robust

homography estimation, extracting 3D measurements, and 3D measurement uncertainty. Each of these techniques facilitates the operation of the multi view tracking framework, which will be discussed in chapter 4. Each of the cameras in our surveillance network was calibrated using Tsais method [94] using visible landmark features that were manually surveyed. We found this approach was practical for our surveillance application. A LQS approach was used to derive the estimate of the homography transformation between pairs of overlapping camera views. The LQS algorithm uses the tracked object centroids as input and performs an iterative search of the solution space to find a homography alignment that is most consistent with the training data. The LQS method was selected, since a robust method was required that could work when outliers are present in the input data. Once the homography has been estimated it can be used to match features between overlapping camera views, as will be discussed in chapter four. The LQS algorithm shares some similarities with RANSAC with the most notable difference being that it is not necessary to know the properties of the noise present in the training data [41], only an estimate of the percentage of outliers is required. The homography is a popular transformation that has been used for aligning trajectories between overlapping views [34,90,91]. The reason for using the homography transformation instead of epipole line analysis is that the surveillance region conforms to the ground plane constraint. Hence, the point based transform of the homography approach is more appropriate than the 2D line search required for epipole line analysis, which could produce more ambiguity. This chapter has described the method that the system uses to estimate the 3D location and measurement uncertainty of each detected object. A least squares estimation technique is employed to intersect 3D lines to make measurements involving two or more camera views. The measurement uncertainty is estimated by using the Jacobian transformation to propagate a nominal image covariance from image space to object space. When measurements are made using two or more cameras the uncertainty of each view are integrated using weighted covariance intersection. The justification for using covariance intersection is that it assigns an appropriate weight to each covariance matrix during the fusion. Hence, covariance matrices with small uncertainties are weighted higher that those with larger uncertainties. Covariance intersection also has the added benefit that is does not make an optimistic estimate of measurement uncertainty. This is an important property, since the measurement uncertainty is

60

Page 61 of 168

used to indicate the level of observation noise in a 3D Kalman filter, as will be described in the chapter 4. If the estimates of the measurement uncertainty were too optimistic this could result in the 3D Kalman filter becoming too confident in its state estimate of the tracked object, which could result in object tracking failure. To conclude this chapter has illustrated the effectiveness of calibration for making 3D measurements using multiple camera views. The 3D measurement uncertainty is estimated using the Jacobian transformation in order to propagate the image covariance to object space.

61

Page 62 of 168

4 Object Tracking and Trajectory Prediction


4.1 Background
In the previous chapter we discussed a set of techniques that could be used to calibrate each camera in the surveillance network, match 2D object features between overlapping camera views, and extract a set of 3D measurements given a set of matched 2D features. In this chapter we will describe the framework employed for robust tracking in 3D using multiple camera views. We first review the method employed by each camera in the surveillance network to perform motion segmentation, feature detection, and 2D tracking. The actual implementation of the motion segmentation and 2D object tracking is outside the scope of this thesis and it is assumed that we have a number of intelligent cameras available with the single view tracking software pre-installed. We then describe the methods used by the system to match 2D object features between overlapping camera views. The homography relations are automatically recovered between each pair of overlapping camera views as discussed in chapter 3. The homography relations are then applied to match 2D object features between each overlapping camera view. The 3D object-tracking framework is then discussed in detail. The 3D object tracking framework allows the system to integrate the 2D object tracking information from each camera. In addition, since each camera in the surveillance network is calibrated in the same world coordinate system it is possible to visualise the object activity on a ground plane map. A 3D Kalman filter is employed in order to facilitate robust object tracking. One of the key requirements of the system identified in chapters 1 and 2 is that the system should be able to preserve the identity of objects that move between non-overlapping camera views. We describe how geometric scene constraints can be used to model object handover regions between nonoverlapping camera views. We also will demonstrate how these models can be employed to robustly track objects that move between non-overlapping cameras separated by a short temporal period of a few seconds. The remainder of the chapter discusses the results from a set of experiments to demonstrate the effectiveness of the multi view tracking framework.

62

Page 63 of 168

4.2

Feature Detection and 2D Tracking


Each intelligent camera employs an adaptive background model [97] for motion

detection as discussed earlier in chapter 2, which provides a robust framework for motion detection in outdoor environments that are subject to varying changes in illumination. Object tracking is performed using a partial observation-tracking algorithm [96,98] for robust occlusion reasoning in 2D. The features tracked for each object include: bounding box dimensions, centroid location, and the mean colour components of the foreground object in the (R,G,B) colour space. The features of the tracked objects detected by each intelligent camera are used as input to the multi view tracking algorithm, which will be discussed in the remainder of this chapter.

4.3

Feature Matching Between Overlapping Views

4.3.1 Viewpoint Correspondence (Two Views)


The LQS algorithm described in chapter 3 was used to determine a set of correspondence points, which were then used to compute the object centroid homography. This homography can be used to correspond object tracks in the testing video sequences. Once the calibration data and homography alignment model are available we can use the relationship between both camera views to correspond detected objects. From observing the results of motion detection it is apparent that the object centroid is a more stable feature to track in 3D, since it is more reliably detected than the top or bottom of the object, particularly in outdoor scenes where the object may be a far distance from the camera. To summarize the following steps are used for matching 2D object tracks, taken from different views, for a given image frame:

1. Create a list of all possible correspondence pairs of objects for each camera view. 2. Compute the transfer error for each object pair 3. Sort the correspondence points list by increasing transfer error. 4. Select the most likely correspondence pairs according to the transfer error. Apply a threshold so that correspondence pairs where Transfer Error > max are not considered as potential matches. 5. Create a correspondence points list for each matching object

63

Page 64 of 168

6. Map each entry in the correspondence points list using 3D line intersection of the bundle of image rays to locate the object in 3D. 7. For each object centroid which does not have a match in the correspondence pair list use the calibration information to estimate the location of the object in 3D

Two additional constraints that are applied to the viewpoint correspondence are that we do not allow one to many mappings between observations in each camera view. This has the effect of reducing the number phantom objects which can appear at the end of a dynamic occlusion. An example of viewpoint correspondence is shown in figure 4.1, the top image shows the original objects detected by the 2D object tracker, the bottom row shows the observations remaining once viewpoint correspondence has been applied. It can be observed that the number of phantom objects near the lamppost in left camera view have been eliminated by the viewpoint correspondence process. In addition, the three pedestrians are classified as a group because one to many matches are not allowed between the objects in each camera view. The viewpoint correspondence process has the affect of reducing the number of false objects that have been detected by the 2D object tracker.

4.3.2 Viewpoint Correspondence (Three Views)


The viewpoint correspondence algorithm can be extended to match objects between three camera views with a few modifications. It is assumed that the homography mappings between each pair of camera views have been determined by performing an LQS search as described in section 3.3. We first consider the triplets (i, j , k ) , where i, j, and k are observations of moving objects in camera views one, two and three, respectively. We then evaluate the transfer errors between each pair of observations formed from the triplet (i, j , k ) , which results in three transfer errors: TEij , TEik , and TE jk , which represent the transfer errors between each camera pair. When the transfer error is below the threshold max we can conclude that the observation pair forms a match between each of the corresponding pair of camera views. We can use the following transitive relationship to identify when a triplet of observations forms a match across all three camera views:
(TEij < max TEik < max ) (TEij < max TE jk < max ) (TEik < max TE jk < max )

64

Page 65 of 168

The remaining correspondence pairs relationships are summarised below:


TEij < max The condition to form the correspondence pair (i, j ) , between camera views one and

two
TEik < max The condition to form the correspondence pair (i, k ) , between camera views one and

three
TE jk < max The condition to form the correspondence pair ( j , k ) , between camera views two

and three

The algorithm then proceeds in the same manner as described in Section 4.3.1 for matching between two camera views. An example of homography transfer between three views is shown in figure 4.2. The black arrows indicate how the homography is used to match the three objects visible in each camera view.

Figure 4.1 Example of feature matching in PETS2001 dataset one.

65

Page 66 of 168

Figure 4.2 Example of viewpoint correspondence between three overlapping camera views

4.4

Tracking in 3D
Using the approach described in chapter 3 we are able to merge the object tracks from

separate camera views into a global world coordinate view. This 3D track data provides the set of observations, which are used for tracking using the Kalman filter [33]. The Kalman filter is shown in block diagram form in figure 4.3. The block diagram shows how the Kalman filter can be used to track the state of a process given a set of discrete observations.

Discrete System
w t-1

Measurement
v
t-1

Discrete Kalman Filter


( +) Xt

Xt

Ht

Zt

Kt

X t -1
DE LAY

Ht

() Xt

( +) X t-1

DELAY

Figure 4.3 Block diagram of the Kalman filter

66

Page 67 of 168

System Dynamic Model


X t = X t 1 + wt 1 wt ~ (0, Qt )

(4.1) (4.2)

Measurement Model
Z t = H X t + vt
v t ~ (0, Rt )

(4.3) (4.4)

State Estimate Extrapolation


+) X t( ) = X t(1

(4.5)

Error Covariance Extrapolation


Pt( ) = Pt(+ ) T + Q 1

(4.6)

State Estimate Observational Update


X t( + ) = X t( ) + K t Z t HX t( )

(4.7)

Error Covariance Update


Pt( + ) = (I K t H ) Pt( )

(4.8)

Kalman Gain Matrix


K t = Pt( ) H T HPt( ) H T + Rt

(4.9)

The Kalman filter provides an efficient recursive solution for tracking the state of a discrete time controlled process. The filter has been applied in numerous tracking applications for visual surveillance. The filter assumes the process noise wt and the measurement noise

vt are independent of each other, and have a Gaussian distribution. The system dynamic model
is represented by the N by N matrix , where N is the dimension of the state space. The state transition matrix propagates the state of the process through a single discrete time step at time (t-1) to time (t). The observation matrix H is a M by N matrix that relates the state space

67

Page 68 of 168

to the observation space, where M is the dimension of the measurement features. The equations of the discrete Kalman filter can be viewed as time update equations, and measurement update equations. The time update equations (4.5-4.6) project forward in time the current state and the error covariance estimate to obtain priori estimates for the next time step. The measurement update equations (4.7-4.9) incorporate the observation into the prior estimate to obtain an improved estimate of the tracked state.

The 3D Kalman filter tracker assumes a constant velocity model. A summary is given of the state model used for tracking in equations 4.10-4.12. At each state update step the observation covariance Rt is set according to the measurement uncertainty determined by projecting a nominal image covariance from the image to the 3D object space using the method discussed in section 3.5.2.

State Model

X t = [x
where,

& x

& y

&T z]

(4.10)

[x & [x

y
& y

z ] is the spatial location in world coordinates


& z ] is the spatial velocity in world coordinates

State Transition Model

1 0 0 = 0 0 0

0 1 0 0 0 0

0 0 1 0 0 0

T 0 0 0 T 0 0 0 T 1 0 0 0 1 0 0 0 1

(4.11)

Where T is the difference of the time of capture of the current and previous image frame.

68

Page 69 of 168

Observation Model

1 0 0 0 0 0 H = 0 1 0 0 0 0 0 0 1 0 0 0

(4.12)

The Kalman filter allows the system to reliably track the object state even if the objects disappear from the camera view for ten frames. This value was set based on empirical evidence.

4.4.1 3D Data Association


The 2D object tracks detected by the background subtraction are converted to a set of 3D observations using the 3D line intersection algorithm as discussed in chapter 3. Since the system is tracking multiple objects in 3D it is necessary to ensure that each observation is assigned to the correct object being tracked. This problem is generally referred to as the data association problem. The Mahalanobis distance provides a probabilistic solution to find the best matches between the predicted states of the tracked objects and the observations made by the system:

M D = ( HX t( ) Z t ) T HPt( ) H T + Rt

( HX t( ) Z t )

4.13

An appropriate threshold can be chosen for the Mahalanobis distance by selecting a value that gives a 95% confidence for a match, assuming a chi-square distribution. The dimension of the observations for the 3D tracker is 3, hence the value threshold selected for

M D was 7.81. A Mahalanobis distance table is created between each tracked object and
observation. The system then assigns each observation to each tracked object based upon the size of the Mahalanobis distance.

69

Page 70 of 168

4.4.2 Outline of 3D Tracking Algorithm


The following is a summary of the steps used to update each tracked 3D object for a giving image frame:

1. For each 3D observation in frame T, create a Mahalanobis distance table and sort by the distance measure. Threshold the values such that each Mahalanobis distance< max 2. For each existing tracked object: a. Select the observation which has the highest likelihood of being a match b. Update the tracked object using this observation 3. For each existing tracked object not matched to an observation a. If the tracked object has not been matched in K frames then mark it as deleted, else b. Update the tracked object using the predicted state estimate 4. For each unmatched observation: a. Create a new tracked object, using the observation to initialise the object state b. Set the initial covariance of the object state.

The following two constraints are applied during the object state update process: a) Each observation can be used to update only one existing tracked object. b) A new tracked object can only be created when its initial observation does not match an existing tracked object

The first constraint prevents an object being updated during a dynamic occlusion when the observation is not consistent with its predicted trajectory. The second constraint prevents new tracks being created at the end of a dynamic occlusion when the objects separate.

4.5

Non-Overlapping Views
In a typical image surveillance network the cameras are usually organised so as to

maximise the total field of coverage. As a consequence there can be several cameras in the surveillance network that are separated by a short temporal and spatial distance, or have minimal overlap. In these situations the system needs to track an object when it leaves the field of view of one camera and then re-enters the field of view of another after a short temporal delay. For short time durations of less than two seconds the trajectory prediction of the Kalman

70

Page 71 of 168

filter can be used to predict where the object should become visible again to the system [9]. However, if the object changes direction significantly or disappears for a longer time period this approach is unreliable. In order to handle these cases the system uses an object handover policy between the pair of non-overlapping cameras. The object handover policy attempts to resolve the handover of objects that move between non-overlapping camera views. The system waits for a new object to be created in the adjacent camera view. A data association method is applied to check the temporal constraints of the objects exit and re-entry into the network field of view.

4.5.1 Entry and Exit Regions


In order to facilitate the object handover reasoning process a model of the major exit and entry regions is constructed for each pair of adjacent non-overlapping camera views. These models can be hand crafted or automatically learned by analysis of trajectory data stored in the surveillance database. The surveillance data can be accessed to retrieve the start and end of object trajectories. An algorithm can then be applied to construct a list of clustered regions, each modelled using a Gaussian distribution, to represent the major entry and exit regions of each camera view [69]. Since each object trajectory has an associated timestamp it is possible to identify the spatial links between exit regions in one camera view and an entry region in the adjacent camera view. The spatial links can be found by identifying a model that is most consistent with respect to spatial and temporal constraints of the object trajectory data [68]. These models of the entry and exit regions are used to improve the performance of the object handover reasoning. When an object is terminated within an exit region the system uses the exit and entry regions models to determine the regions where the object is most likely to reappear. The main benefits of using the model to facilitate the handover reasoning is that the method reduces the computational complexity, since the model is only used to focus attention on the major entry and exit regions where object handover is most likely to occur, and even if the two cameras are calibrated in different world coordinates the system can still track objects since the model uses temporal properties to perform data association.

4.5.2 Object Handover Regions


The object handover region models consist of a linked entry and exit region along with the temporal delay between each region. The temporal delay can be determined manually by observation, or by generating statistics from the data stored in the database. The temporal delay

71

Page 72 of 168

gives an indication of the transit time for the handover region for a specific object class, so the temporal delay for a pedestrian object class and a vehicle object class would be different based on the set of observations used to generate the statistics. Each entry or exit region is modelled as a Gaussian:

( x, y ),

(4.14)

where ( x, y ) is the centre of the distribution in 2D image coordinates, and is the spatial covariance of the distribution. The following convention is used to describe the major entry and exit regions in each camera view:

X ik is the kth exit region in the ith camera view


E lj is the lth entry region in the jth camera view.

Given the set of major exit and entry regions in each camera the following convention is used to define the handover regions between the non-overlapping camera views:

H ijp = X ik , E lj , t , is the pth handover region between camera ith and jth camera views.

As previously discussed each handover region H ijp consists of a spatially connected exit and entry region pair X ik , E lj , along with the temporal delay and the variance of the temporal delay (t , ) between the exit and entry region. An example of object handover regions is visually depicted in figure 4.4. The black and white ellipses in each camera view represent the major entry and exit regions in each camera. The links represent the handover regions between each camera.

72

Page 73 of 168

Figure 4.4 Handover regions for six cameras in the surveillance system.

4.5.3 Object Handover Agents


The object handover mechanism only needs to be activated when an object is terminated within an exit region that is linked to an entry region in the adjacent camera view. Once the object leaves the network field of view and is in transit between the non-overlapping views the system cannot reliably track the object.

4.5.3.1

Handover Initiation The handover agent is activated when an object is terminated within an exit region X ik

that is included in the handover region list. The handover agent records the geometric location, and time when the object left the field of view of the ith camera. Allowing the object handover agent to only be activated when an object is terminated in a handover region eliminates the case where an object is prematurely terminated within the field of view due to tracking failure caused by a complex dynamic occlusion. In addition, once the handover agent has been activated the handover region model can be used to determine the most likely regions where the

73

Page 74 of 168

object is expected to re-appear, hence reducing the computational cost of completing the handover process.

4.5.3.2

Handover Completion The handover agent achieves completion when an object is created within the entry

region E lj that forms a handover region with the exit region X ik , where the object was terminated in the ith camera view. The handover agent task is only complete if the new object satisfies the temporal constraints of the handover region. The new object location must be consistent with the temporal delay of the handover region and the transit time of when the object left and reappeared in the ith and jth camera view, respectively.

4.5.3.3

Handover Termination The handover agent is terminated once an object has not been matched after a maximal

temporal delay, which can be determined by the statistical properties of the handover regions related to the exit region where the object left the field of view. The maximal temporal delay in a handover region is an important characteristic, since the surveillance regions are not constrained in such a way that an object must re-appear in the field of view once it enters a handover region. It is possible that once the object has been terminated within an exit region it will not re-appear within the network field of view. When this case occurs it is not possible for the system to locate the object, since it is will not be visible by any of the cameras in the surveillance network. The framework used for tracking objects between non-overlapping views makes several assumptions. It is assumed that the temporal delay between the camera views is of the order of seconds for each object class. If the handover regions are located on the same ground plane and calibrated in the same world coordinate system then 3D trajectory prediction can be used to add another constraint to the data association between the handover object and candidate objects which appear in entry regions in the adjacent camera view. The 3D trajectory prediction is only valid if the object maintains the same velocity and does not significantly change direction once it has entered the handover region.

74

Page 75 of 168

4.6

Experiments and Evaluation

4.6.1 Object Tracking Using Overlapping Cameras


The system has been tested using a variety of video sequences. The PETS2001 datasets were used to evaluate the performance of the tracking algorithm using two widely separated overlapping views. The PETS2001 datasets are a standard set of video sequences that have been made available for the machine vision community for evaluating tracking algorithms, as was described in chapter 3. Initially, the training datasets were used to recover the homography mapping between the two overlapping camera views using the method discussed in section 3.4. Each PETS2001 dataset set contains five co-planar landmark points whose positions are known in 3D, allowing Tsais camera calibration to be employed [94], as was discussed in section 3.2. Given the homography calibration and the 3D camera calibration it is possible to apply the tracking algorithm described in section 4.4.

4.6.1.1

Resolving Object Occlusions The PETS2001 datasets were used to perform a number of experiments to determine

how effectively the tracking algorithm could resolve dynamic and static occlusions. Figure 4.5 shows how dynamic occlusions can be resolved in 3D when tracking objects between two camera views. The left figure shows a pedestrian being occluded by a vehicle, while the right image shows a group of pedestrians being occluded by a white van. The 3D ground plane map shows that the tracked objects can still be tracked in 3D during the dynamic occlusions that occur in both camera views. In figure 4.6 the car is partially occluded by the tree. The tree causes the object to be split into two separate parts. The 3D tracker still correctly assigns a single label to both segments. Figure 4.7 shows a more complex tracking scenario where three objects form a static occlusion with the tree in the left image. The cyclist then forms two dynamic occlusions by overtaking the two pedestrians. The top row and bottom row of images show the tracked objects before and after the static and dynamic object occlusions. The correct tracking object labels are still assigned after the object interactions. The ground plane map in figure 4.8 shows the tracked 3D trajectory of each of the objects during the dynamic and static occlusions. The ground plane map was constructed by projecting the pixels in each camera view to the ground plane. In the tracking error plots of figure 4.9 the events S1, S2, and S3 indicate when the tree occludes the first pedestrian, second pedestrian, and cyclist respectively in the left hand camera

75

Page 76 of 168

view. The two events D1 and D2 indicate when the cyclist overtakes the pedestrians resulting in two dynamic occlusions. It can be observed that the 3D tracking error does not degrade significantly during each type of occlusion. The ability of the 3D tracker to resolve these types of object interactions demonstrates how effective multi view tracking can be compared to single view tracking.

Figure 4.5 Examples of handling dynamic occlusion, frames 601 and 891 in data sequence one.

76

Page 77 of 168

Figure 4.6 Tracker output for frame 1366 in data sequence two. The tree splits the tracked object but the 3D tracker still correctly assigns a single object label.

Figure 4.7 Objects during static occlusion (top image), and dynamic occlusion (bottom image).

77

Page 78 of 168

3 1

Figure 4.8 An example of resolving both dynamic and static occlusions

3D Tracking Error (Tracked State and Observation) 1000 track 1 track 3 track 5

900

800 S1 700 Tracking Error (mm) S2 S3 D1 D2

600

500

400

300

200

100

0 120

130

140

150 160 Time (seconds)

170

180

190

Figure 4.9 Tracking error during dynamic and static object occlusions.

78

Page 79 of 168

The 3D tracker has also been tested using a video sequence containing three camera views. The first camera was a 640x480 monochrome security surveillance camera, situated outdoors. The second and third cameras were 640x480 colour cameras; both located indoors viewing the surveillance region through a window. The average frame processing rates for each camera was 5.6, 5.4, and 5.4 fps for cameras one, two, and three respectively. The frameprocessing rate is dependent upon the frame grabber hardware and software as well as the processor speed. Since each camera has a different frame-processing rate there is a temporal drift between each camera, so it is necessary to synchronise the cameras before attempting to track the moving objects in 3D. Each image frame has an associated timestamp, so it was possible to apply the temporal alignment method described in section 3.6.4.

Figure 4.10 Example of object tracking using three overlapping camera views. The 3D trajectories are visualised of a ground plane map.

79

Page 80 of 168

3D Tracking Error 1100 1000 900 800 700 600 500 400 300 200 100 0 3640 A B track 3 track 4

Tracking Error (mm)

3660

3680

3700 Frame Number

3720

3740

3760

Figure 4.11 Plot of 3D tracking error for tracks 1 and 3 in the three-camera video sequence.

The algorithm is also effective in handling dynamic occlusions using three camera views as is illustrated in figure 4.10. Three objects undergo dynamic occlusion between frames 3664 and 3679, as is indicated by the events A and B in the plot of Figure 4.11. At the end of the dynamic occlusion the correct object labels are still assigned. In image frame 3669 the objects are occluded in camera views two and three but can be distinguished in camera view one. The wide baseline of the camera views is what allows the algorithm to resolve dynamic occlusions. The 3D tracking error of two selected tracks are shown in figure 4.11. A dynamic occlusion occurs between the two objects during image frames 3666 to 3677. It can be observed that the tracking error does not significantly increase during the period of occlusion, as was the case with the PETS2001 datasets. The tracking error tends to increase when the tracked object initially appears in the field of view of one of the cameras. The correct labels are assigned to each tracked object during their movement through the surveillance region. Errors in label assignment tend to occur when the tracked object is initially detected at a large distance from the cameras, since the moving object may not be reliably detected for several frames. Due to the adaptive model used by the background subtraction method, objects are absorbed into the background model when stationary for more than 5 seconds. A new object label is assigned if the object begins to move again. Hence, the object is tracked while moving within the surveillance region but it is possible that a new label will be assigned if it stops moving for several seconds.

80

Page 81 of 168

4.6.2 Object Tracking Between Widely Separated Views


The purpose of this experiment was to determine the reliability of the multi view tracker for coordinating tracking between widely separated camera views, which are nonoverlapping, or have limited overlap. Ground truth was manually generated for a 30 minute video sequence that comprised of 10,000 image frames for each camera view. There were six cameras in the captured video sequence. The ground truth defines the 2D track identification for the same object moving between different camera views. The multi view tracking results and the ground truth were compared to determine if the correct track identity was preserved when objects move between each of the camera views. In total 134 ground-truth objects were manually selected from the multi view video sequence. Using 3D tracking alone the multi view tracker had an 80% success rate in preserving an objects identity when it moves between adjacent non-overlapping views. The number of object handover failures was due to: poor track initialisation by the 2D tracker, and the size of objects such as large vehicles whose position on the ground plane could not be reliably estimated, resulting in tracking failure. In addition, when the transition times are greater than a few seconds or the object changes direction during the transit period between the camera views the 3D trajectory prediction used for handover reasoning is less reliable. The results of this experiment illustrate that 3D trajectory prediction of the Kalman Filter is not always sufficient for tracking objects between non-overlapping camera views. In these instances it is possible to improve the tracking performance by using the object handover policy as was discussed in section 4.5. The same multi view video sequence was run again using the handover region policy as was discussed in section 4.5. As a consequence the success of the object handover of the multi view tracker increased to 87%. Figure 4.12 shows an example where object handover fails when a large bus moves between two camera views. Due to the size of the object it is visible in both camera views and is incorrectly identified as two separate objects, which resulted in object handover failure. In Figure 4.13 examples are shown of object tracking between adjacent camera views. In both of the figures the object tracks are plotted every five frames, so it is possible to visualise the motion of the object in each camera view.

81

Page 82 of 168

Figure 4.12 Example of object handover failure due to the size of an object

Figure 4.13 Example of object tracking between adjacent non-overlapping views

82

Page 83 of 168

4.7

Summary
In this chapter a method has been presented for tracking objects between multiple camera

views, which have been calibrated using a set of known 3D landmark features. For overlapping camera views the homography is used to match features visible in several cameras. The homography computed using the LQS approach described in chapter 3 is preferred over feature correspondence using the 3D calibration information. The justification for this is that the homography can automatically be recovered from a set of training data, which is dependent upon the accuracy of the 2D tracker and not the camera calibration information that may not be accurate in all regions of the camera view. Once image features are matched between overlapping views it is possible to generate 3D observations using the least squares estimate technique presented in chapter 3, along with the associated measurement uncertainty. These measurements are used in a Kalman filtering framework for tracking each objects position in 3D. The algorithm was shown to be robust in resolving both dynamic and static object occlusions for a variety of video sequences, and coordinating the tracking of objects between a network of six cameras in an outdoor environment. One problem with using the homography transformation for feature matching is that it is possible that the system will classify individual objects in close proximity (less than one metre) as a group. As a consequence it would not be possible to track the activity of individuals within the group, or scenes containing a large density of objects under these conditions. In addition, for complex dynamic occlusions where objects interact for long periods, or significantly change speed and direction during the occlusion the Kalman filter tracking becomes less reliable. This is due to limitation of the Kalman filter in that it cannot handle multiple hypotheses. It would be possible to resolve this deficiency by employing a multiple hypothesis tracker framework, which could robustly handle multiple object states during the occlusion [78], or using colour cues to match objects based upon appearance once the occlusion has ended [16,17,71,79,81]. However, for the environment where the system has been applied the Kalman filter was fit for purpose for robust real-time tracking. For non-overlapping camera views separated by a short distance (less than two seconds) it was possible to use 3D trajectory prediction to retain an objects identity when it temporarily leaves and re-enters the network field of view [9,28]. For longer temporal delays greater than two seconds the Kalman filter prediction is less reliable, since information about the objects location is more uncertain, and the object can change its direction or speed during the transition period. For these cases the system uses an object handover policy between

83

Page 84 of 168

spatially linked exit and entry zones between pairs of adjacent cameras [4]. Spatio-temporal cues are used to retain the objects identity when it moves between a pair of non-overlapping camera views. The experiments demonstrated that this approach improved the system performance compared to using 3D trajectory prediction alone.

84

Page 85 of 168

5 System Architecture
5.1 Background
The objective of this chapter is to describe the system architecture of the surveillance system implemented to support the research presented in this thesis. This was a necessary requirement, since the system would need to run continuously over extended periods of time in order to capture surveillance data. It was not feasible to store raw video, because the surveillance system contained six cameras and this would generate terabytes of storage each for twenty-four hour period. Each key component of the surveillance architecture is discussed along with the methods used by each sub-process to exchange information. The surveillance system comprises of a set of intelligent camera units (ICU) that utilise vision algorithms for detecting and robustly tracking moving objects in 2D image coordinates. It is assumed that the viewpoint of each ICU is fixed and has been calibrated using a set of known 3D landmark points. Each ICU communicates to a central multi-view-tracking server (MTS), which can integrate all the information received in order to generate global 3D tracking information. Each individual object, along with its associated tracking details is stored in a central surveillance database. The surveillance database also enables offline learning and subsequent data analysis to be performed. In addition, given the query and retrieval properties of the surveillance database it is possible to generate phantom based video sequences that can be used for performance evaluation of object tracking algorithms. The surveillance system employs a centralised control strategy as shown in figure 5.1. The multi view-tracking server (MTS) creates separate receiver threads (RT) to process the data transmitted by each intelligent camera unit (ICU) connected to the surveillance network. Each ICU transmits tracking data to each RT in the form of symbolic packets. The system uses TCP/IP sockets to exchange data between each ICU and RT. Once the object tracking information has been received it is loaded into the surveillance database that can be accessed for subsequent online or offline processing. Each ICU and RT exchanges data using the following four message types:

85

Page 86 of 168

Background Image Message This message format is used to transmit the background image from the ICU to the MTS during the initial start-up of the camera. The MTS uses the background image to visualise the tracking activity in 2D. The background image can be periodically refreshed to reflect changes in the camera viewpoint.

Timestamp Message This message is used to transmit the timestamp of the current image frame processed by the ICU to the MTS. The MTS uses the recorded timestamps to perform temporal alignment

Intelligent Camera Network


ICU1 ICU2

Multi View Tracking Server

ICUN

RT1 RT2

Temporal Alignment Viewpoint Integration 3D Tracking

LAN

NTP DAMEON

: : : RTN

PostgreSQL Surveillance

DB ICU: Intelligent Camera RT: U i Receiver Thread

Offline Calibration/Learning Homography Alignment Handover Reasoning Path Learning 3D World Calibration

Performance Evaluation PViGEN Online Evaluation Surveillance Metric Reports

Video Playback and Review

Figure 5.1 System Architecture of the Image Surveillance Network of Cameras.

86

Page 87 of 168

2D Object Track Message This message is generated for each detected object in the current image frame. The message includes details of the objects location, bounding box dimensions, and normalised colour components.

2D Object Framelet Message Each detected object in the current frame is extracted from the image and transmitted to the MTS. The framelet is then stored in the surveillance database and can be plotted on the background image to visualise the activity within the field of view of the ICU.

5.2

Intelligent Camera Network


The surveillance system uses a network of ICUs that are located at various positions

within the surveillance region. Each ICU uses an adaptive background-modelling algorithm to detect possible moving objects. A partial observation algorithm is used to track each detected objects location in 2D image coordinates. After each image frame is processed the ICU transmits the objects information to the MTS using the four message formats described in section 5.1. During live operation the ICUs can robustly track objects in scenes that undergo changes in illumination. The operating speed of each camera is typically between 5-10Hz, depending on the level of activity within the field of view of the camera.

5.3

Multi View Tracking Server (MTS)


The MTS forms one of the core components of the work discussed in the remainder of

this thesis. The MTS receives object-tracking data from each ICU and integrates the information for tracking in 3D. Temporal alignment is performed online to synchronise the object tracks received from each ICU. For overlapping camera views the homography constraint is used to correspond objects visible in each viewpoint. A 3D Kalman filter is then employed for object tracking and trajectory prediction.

87

Page 88 of 168

5.3.1 Temporal Alignment


In chapter 3 we discussed how temporal calibration could be performed using a robust homography estimation for various time offsets between the set of input data. This approach is valid for pre-recorded video but cannot reliably be applied in a real application, since the temporal calibration would have to be performed continuously to account for the slight variations of each camera internal clock during continuous operation. In addition, some of the cameras are non-overlapping, so it would not be possible to use the homography estimation method. In a typical image surveillance network each ICU acts as an independent process. Due to the wide separation of the camera views it is not feasible to use synchronisation signals, since the slowest camera connected to the network would solely determine the overall processing speed of the surveillance system. Each ICU uses an Ethernet connection to a local area network (LAN), hence the synchronisation signals could be delayed due to the network load on the LAN, as shown in Figure 5.1. Each ICU is allowed to free run at their normal operating speed, which is typically between 5-10Hz as discussed in section 5.2. The frames captured by each ICU are unsynchronised but have an associated timestamp. The network time protocol (NTP) daemon is used to ensure that the internal clocks of each ICU are synchronised to a trusted time source. The NTP daemon can function across a LAN even where there are delays in transmitting synchronisation information. The internal clocks only have to be periodically updated, so even under these conditions the accuracy of the synchronisation is accurate to within a few milliseconds, which is sufficient for the surveillance system. This allows the MTS to perform temporal alignment once the tracking information has been received from each camera connected to the surveillance network. The following relation is used to perform temporal alignment during live data capture by the surveillance system:

Ti A TjB

MIN

Ti A TkC

MIN

Where T pS is the time stamp associated with the pth captured image frame of view S. An image frame is skipped if it is found that the timestamp associated with a camera has already been used. The temporal alignment process is graphically depicted in Figure 5.2, the timestamps of three cameras are plotted (blue lines) and the links (red lines) indicate the frames that are synchronised during temporal alignment. Camera A and Camera B are two colour cameras

88

Page 89 of 168

overlooking Northampton Square, which is located near to the City University Entrance. Camera C is monochrome security camera that is adjacent to Northampton square. It can be observed that frames are occasionally skipped in camera C but this is expected, since its average processing speed is faster than the other two cameras.

C a m e ra C

C a m e ra B

C a m e ra A

2.8

3.2

3.4 T im e (s e c o n d s )

3.6

3.8

Figure 5.2 Graphical illustration of the temporal alignment process.

5.3.2 Viewpoint Integration


For overlapping camera views the homography mapping is used to match 2D objects between each viewpoint. The transfer error, which represents the re-projection error of the homography transformation, is used to derive a list of corresponded 2D object tracks. The transfer error is evaluated for each combination of pairs of objects in each camera view. An appropriate threshold is applied to identify matched objects between each viewpoint. A transitive relationship is used to derive object matches between three or more overlapping camera views as was discussed in chapter 4. The homography transformation has the benefit over epipole based matching in that a 2D line search is not required to determine the correspondence between objects. However, the homography matching assumes all the objects are constrained to move on the same ground plane, which may not be the case in some scenes.

89

Page 90 of 168

5.3.3 3D Tracking
Once the object tracking information is received from each ICU the MTS performs temporal alignment in order to synchronise the image frames, as discussed in section 5.3.1. For overlapping views objects in each camera are then corresponded using the homography constraint. The calibrated camera parameters for each view are then used to estimate the 3D location of each detected object. A 3D line intersection algorithm is used to estimate the location of the object in world coordinates using a least squares estimation approach as was demonstrated in chapter 3. The 3D line intersection provides a more accurate measurement of an objects location than can be achieved with a single camera. A constant velocity 3D Kalman filter is then employed to track each object in world coordinates. The 3D Kalman filter is an effective tool for object tracking and trajectory prediction. The main benefit of tracking in 3D is that for widely separated overlapping camera views the system can robustly track objects through dynamic and static object occlusions. In addition, the system only assigns a single object identifier, even when it is visible in several camera views simultaneously. The 3D trajectories can also be plotted on a 3D ground plane map to visualise the object activity in real-time. When tracked objects move between non-overlapping camera viewpoints the system uses short-term trajectory prediction to estimate when the object should re-appear in the adjacent camera view once it has left the field of view of the other camera. This approach assumes the temporal separation between the cameras is less that two seconds in duration, the cameras are calibrated in the same world coordinate system, and the object does not significantly change direction during the transit period between the two views. The surveillance system also uses a scene model of the major entry and exit regions in each camera view to facilitate the reasoning for object handover between non-overlapping views for cameras separated by longer temporal durations. This model also describes the spatial links between exit and entry regions where each non-overlapping camera viewpoints are known.

5.4

Offline Calibration/Learning
Each ICU is calibrated using a set of known landmark points. The camera calibration

allows the 2D observations to be mapped to 3D world coordinates that can be used by the MTS for object tracking. The system also performs offline processing to recover the homography relations between each pair of overlapping camera views, in order to facilitate the multi view

90

Page 91 of 168

tracking. As previously discussed the homography assumes the ground plane constraint and the cameras have a certain degree of overlap. Using a Least Quantile of Squares (LQS) search it is possible to recover a set of correspondence points that can be used to compute the homography mapping between each pair of overlapping views. The LQS method performs an iterative search of a solution space by selecting a minimal set of correspondence points to compute the homography mapping. The solution found to be the most consistent with the selected object tracks is taken as the final solution. The data stored in the surveillance database has also been used to learn spatial models of the scene. This part of the surveillance system is outside the scope of this thesis but what follows is a summary of the key functions performed by this component. Based on the analysis of object trajectory data it is possible to automatically learn probabilistic spatial models, which can be used to characterise object activity and behaviour. A geometric model and low-level Hidden Markov Model (HMM) can then be combined to form a statistical framework for analysis of pedestrian behaviour. This allows atypical behaviour to be detected within the scene. Given the number of object trajectories stored in the database it is also possible to identify the major entry and exit regions in each camera view. The entry and exit regions provide an additional aid for tracking objects between non-overlapping views. Given the model of the entry and exit regions, along with the spatial links between each exit and entry regions between non-overlapping views, the reliability of the object handover is improved. The work presented in Makris, Ellis [68,69,70] can be reviewed for a more in-depth discussion of the spatial model learning process. The integration of the surveillance database and probabilistic spatial models allows the system to perform visual queries to identify certain types of object behaviour. Some examples of visual queries will be demonstrated in the section 5.6 of this chapter.

5.5

Surveillance Database Design


The key design consideration for the surveillance database was that it should be possible

to support a range of low-level and high-level queries. At the lowest level it is necessary to access the raw video data in order to observe some object activity recorded by the system. At the highest level a user would execute database queries to identify various types of object activity observed by the system. In order to address each of these requirements we decided to use a multi-layered database design, where each layer represents a different abstraction of the original video data. The surveillance database comprises three layers of abstraction:

91

Page 92 of 168

Image framelet layer Object motion layer Semantic description layer

This three-layer hierarchy supports the requirements for real-time capture and storage of detected moving objects at the lowest level, to the online query of activity analysis at the highest level. Computer vision algorithms are employed to automatically acquire the information at each level of abstraction. The physical database is implemented using PostgreSQL running on a Linux server. PostgreSQL provides support for storing each detected object in the database. This provides an efficient mechanism for real-time storage of each object detected by the surveillance system. In addition to providing fast indexing and retrieval of data the surveillance database can be customised to offer remote access via a graphical user interface and also log each query submitted by each user.

5.5.1 Image Framelet Layer


The image framelet layer is the lowest level of representation of the raw pixels identified as a moving object by each camera in the surveillance network. Each camera view is fixed and background subtraction is employed to detect moving objects of interest [97]. The raw image pixels identified as foreground objects are transmitted via a TCP/IP socket connection to the surveillance database for storage This MPEG-4 [15] like coding strategy enables considerable savings in disk space, and allows efficient management of the video data. Typically, twentyfour hours of video data from six cameras can be condensed into only a few gigabytes of data. This compares to an uncompressed volume of approximately 4 terabytes for one day of video data in the current format we are using, representing a compression ratio of more than 1000:1. In Figure 5.3 an example is shown of some objects stored in the image framelet layer. The images show the motion of two pedestrians as they move through the field of view of the camera. Information stored in the image framelet layer can be used to reconstruct the video sequence by plotting the framelets onto a background image.

92

Page 93 of 168

Figure 5.3 Example of objects stored in the image framelet layer.

Field Name Camera Videoseq Frame Trackid Bounding_box Data

Description The camera view The identification of the captured video sequence The frame where the object was detected The track number of the detected object The bounding box describing the region where the object was detected A reference to the raw image pixels of the detected object
Table 5.1 Attributes stored in image framelet layer.

The main attributes stored in the image framelet layer are described in Table 5.1. An entry in the image framelet layer is created for each object detected by the system. It should be noted that additional information, such as the time when the object was detected is stored in other underlying database tables, which are described in Appendix C. The raw image pixels associated with each detected object are stored internally in the database. The PostgreSQL database compresses the framelet data, which has the benefit of conserving disk space.

5.5.2 Object Motion Layer


The object motion layer is the second level in the hierarchy of abstraction. Each intelligent camera in the surveillance network employs a robust 2D tracking algorithm to record an objects movement within the field of view of each camera [98]. Features are extracted from each object including: bounding box, normalized colour components, object centroid, and the object pixel velocity. Information is integrated between cameras in the surveillance network by

93

Page 94 of 168

employing a 3D multi view object tracker [4,8,9], which tracks objects between partially overlapping, and non-overlapping camera views separated by a short spatial distance. Objects in overlapping views are matched using the ground plane constraint. A first order 3D Kalman filter is used to track the location and dynamic properties of each moving object, which was discussed in chapter 4. The 2D and 3D object tracking results are stored in the object motion layer of the surveillance database. The object motion layer can be accessed to execute offline learning processes that can augment the object tracking process. For example, a set of 2D object trajectories can be used to automatically recover the homography relations between each pair of overlapping cameras, as was discussed in chapter 3. The multi view object tracker robustly matches objects between overlapping views by using these homography relations. The object motion and image framelet layer can also be combined in order to review the quality of the object tracking in both 2D and 3D. The key attributes stored in the object motion layer are described in Table 5.2 and Table 5.3. In Figure 5.4 results from both the 2D tracking and multi-view object tracker are illustrated. The six images represent the viewpoints of each camera in the surveillance network. Cameras 1 and 2, 3 and 4, and 5 and 6 have partially overlapping fields of view. It can be observed that the multi-view tracker has assigned the same identity to each object. Figure 5.5 shows the field of view of each camera plotted onto a common ground plane generated from a landmark-based camera calibration. 3D motion trajectories are also plotted on this map in order to allow the object activity to be visualized over of the entire surveillance region.

Field Name Camera Videoseq Frame Trackid Bounding_box Position Appearance

Description The camera view The identification of the captured video sequence The frame where the object was detected The track number of the detected object The bounding box describing the tracked region of the object The 2D location of the object in the image The normalized colour components of the tracked object

Table 5.2 Attributes stored in object motion layer (2D Tracker).

94

Page 95 of 168

Field Name Multivideoseq Frame Trackid Position Velocity

Description The identification of the captured multi video sequence The frame where the object was detected The track number of the detected object The 3D location of the tracked object in ground plane coordinates The velocity of the object

Table 5.3 Attributes stored in object motion layer (Multi View Tracker).

Figure 5.4. Camera network on University campus showing 6 cameras distributed around the building, numbered 1-6 from top left to bottom right, raster-scan fashion.

95

Page 96 of 168

Figure 5.5. Re-projection of the camera views from Figure 5.4 onto a common ground plane, showing tracked objects trajectories plotted into the views (white, red, blue and green trails).

5.5.3 Semantic Description Layer


The object motion layer provides input to a machine-learning algorithm that automatically learns a semantic scene model, which contains both spatial and probabilistic information [69,70]. Regions of activity can be labelled in each camera view, for example entry/exit zones, paths, routes and junctions. These models can also be projected on the ground plane as is illustrated in Figure 5.6. These paths were constructed by using 3D object trajectories stored in the object motion layer. The green lines represent the hidden paths between cameras. These are automatically defined by linking entry and exit regions between adjacent non-overlapping camera views [68]. These semantic models enable high-level queries to be submitted to the database in order to detect various types of object activity. For example we can generate spatial queries to identify any objects that have followed a specific path between an entry and exit zone in the scene model. This allows any object trajectory to be compactly expressed in terms of routes and paths stored in the semantic description layer.

96

Page 97 of 168

Figure 5.6. Re-projection of routes onto ground plane

Field Name Camera Zoneid Position Cov Poly_zone

Description The camera view of the entry or exit zone The identification of the entry or exit zone The 2D centroid of the entry or exit zone The covariance of the entry or exit zone A polygonal approximation of the entry or exit zone

Table 5.4 Attributes stored in semantic description layer (entry/exit zones).

Field Name Camera Routeid Nodes Poly_zone

Description The camera view of the route The identification of the route The number of nodes in the route A polygonal approximation of the envelope of the route

Table 5.5 Attributes stored in semantic description layer (routes).

97

Page 98 of 168

Field Name

Description

Camera
Routeid Nodeid Position Position_left Position_right Stddev Poly_zone

The camera view of the route node


The identification of the route The identification of the route node The central 2D position of route node The left 2D position of the route node The right 2D position of the route node The standard deviation Gaussian distribution of object trajectories observed at the route node A polygon representation of the region between the this route node and its successor

Table 5.6 Attributes stored in semantic description layer (route nodes).

The main properties stored in the semantic description layer are described in Table 5.4, Table 5.5 and Table 5.6. Each entry and exit zone is approximated by a polygon that represents the covariance of the region. Using this internal representation in the database simplifies spatial queries to determine when an object enters an entry or exit zone. The polygonal representation is also used to approximate the envelope of each route and route node, which reduces the complexity of the queries required for online route classification that will be demonstrated in the next section. An example of the routes, routenodes, entry and entry regions is shown in Figure 5.7. The black and white ellipses indicate entry and exit zones, respectively. Each route is represented by a sequence of nodes, where the blue points represent the main axis of each route, and the red points define the envelope of each route.

3 2 1

Figure 5.7. Example of routes, entry and exit zones stored in semantic description layer

98

Page 99 of 168

5.6

Metadata Generation
Metadata is data that describes data. The multi-layered database allows the video content

to be annotated using an abstract representation. The key benefit of the metadata is that it can be more efficiently queried for high-level activity queries when compared to the low level data. It is possible to generate metadata online when detected objects are stored in the image framelet and object motion layers. In Figure 5.8 the data flow is shown from the input video data to the metadata generated online. Initially, the video data and object trajectory is stored in the image framelet and object motion layers. The object motion history is then expressed in terms of the model stored in the semantic description layer to produce a high-level compact summary of the objects history. The metadata contains information for each detected object including: entry point, exit point, time of activity, appearance features, and the route taken through the FOV. This information is tagged to each object detected by the system. The key properties of the generated metadata are summarised in Table 5.7 and Table 5.8. Each tracked object trajectory is represented internally in the database as a path geometric primitive, which facilitates online route classification.

Field Name Videoseq Trackid EntryTime ExitTime EntryPosition ExitPosition Path Appearance

Description The identification of the capture video sequence in the image framelet layer The trackid of the object The time when the object was first detected The time when the object was last seen The 2D entry position of the object The 2D exit position of the object A sequence of points used to represent the objects 2D trajectory The average normalized colour components of the tracked object Description The identification of the captured video sequence in the image framelet layer The trackid of the object The identification of the route The time the object entered the route The first node the object entered along the route The time the object left the route The last node the object entered along the route

Table 5.7 Attributes metadata generated (object_summary).

Field Name Videoseq Trackid Routeid EntryTime Entrynode EndTime ExitNode

Table 5.8 Attributes metadata generated (object_history).

99

Page 100 of 168

5.7

Applications

5.7.1 Performance Evaluation


The system has been running daily for several months capturing a high volume of surveillance data. During periods of low activity the system captures isolated object tracks along with individual framelets and stores them in the surveillance database. Using a set of online metrics it is possible to automatically select a set of ground truth tracks from the surveillance database. Once a list of isolated ground truth tracks have been selected a Pseudosynthetic Video Generator (PViGEN) is used to generate synthetic video sequences, which can be used for performance evaluation. It is assumed the camera view remains fixed for the duration of data capture and storage into the surveillance database. PViGEN can be applied to single fixed camera views, providing a means for evaluating the 2D tracking algorithms employed by the surveillance system. The performance evaluation framework is described in detail in chapter 6.

5.7.2 Visual Queries


Another application of the surveillance database is to support object activity related queries. The data stored in the surveillance database provides training data for machine learning processes that learn spatial probabilistic activity models in each camera view [69,70]. By integrating this information with tracking data in the surveillance database it is possible to automatically annotate object trajectories. A high level conceptual layout of the database that supports these types of queries is illustrated in Figure 5.8. The bottom image framelet layer of the database contains the low-level object pixel data that was detected as a moving object by the single view camera. This layer is used to support video playback of object activity at various time intervals. The second layer comprises of the object tracking data that is captured by the single view tracking performed by each intelligent camera. The data consists of the tracked features of each object detected by the system. The tracked features stored as a result of single view tracking include: bounding box dimensions, object centroid, and the normalised colour components. Data is extracted from the object motion layer in order to learn spatial probabilistic models that can be used to analyse object activity in the scene. The semantic description of the scene allows the information in the object motion layer to be expressed in

100

Page 101 of 168

terms of high-level meta-data that can support various types of activity based queries. The query response times are reduced from several minutes to only a few seconds. An example of the results returned by an activity query is shown in Figure 5.9. The semantic description of the scene includes all the major entry and exit regions identified by the learning process. The major entry and exit regions are labelled on each image. The first example in Figure 5.9(a) shows a sample of the pedestrians moving between entry region B and exit region A. The second example in Figure 5.9(b) shows a sample of the pedestrians moving between entry region C to exit region B. The activity based queries are run using the meta-data layer of the database, resulting in considerable savings in terms of execution time, compared to using the object motion, or image framelet layers. In Figure 5.10 it is illustrated how the database is used to perform route classification for two of the tracked object trajectories. Four routes are shown that are stored in the semantic description layer of the database in Figure 5.10(a). In this instance the first object trajectory is assigned to route 4, since this is the route with the largest number of intersecting nodes. The second object trajectory is assigned to route 1. The corresponding SQL query used to classify routes is shown in Figure 5.10(b). Each node along the route is modelled as a polygon primitive provided by the PostgreSQL database engine. The query counts the number of route nodes the objects trajectory intersects with. This allows a level of discrimination between ambiguous choices for route classification. The ?# operator in the SQL statue is a logical operator that returns true if the object trajectory intersects with polygon region of a route node. Additional processing of the query results allows the system to generate the history of the tracked object in terms of the route models stored in the semantic description layer. A summary of this information generated for the two displayed trajectories is given in Figure 5.10(c). It should be noted that if a tracked object traversed multiple routes during its lifetime then several entries would be created for each route visited.

101

Page 102 of 168

Meta Data Layer

Semantic Description Layer

Meta Data Generation

Object Motion Layer

Image Framelet Layer

Figure 5.8 Conceptual Layout of High Level Surveillance Database

102

Page 103 of 168

(a)

(b)
Figure 5.9 Visualisation of results returned by spatial temporal activity queries.

1 4 3 2

select routeid, count(nodeid) from routenodes r, objects o where camera=2 and o.trajectory ?# r.polyzone and o.videoseq =87 and o.trackid =1 group by routeid

(a) Videoseq 87 87 Trackid 1 3 Start Time 08:16:16 08:16:31

(b) End Time 08:16:27 08:16:53 Route 4 1

(c)
Figure 5.10 Example of online route classification

103

Page 104 of 168

5.8

Summary
This chapter has described the system architecture used by the work presented in this

thesis for multi view image surveillance. The image surveillance network comprises of a number of intelligent cameras that can robustly detect and track moving objects by using an adaptive background subtraction algorithm. Each intelligent camera unit transmits objecttracking data to a multi view tracking server, which is responsible for integrating all the information received from the network of cameras. All of the object-tracking data is stored in a central surveillance database that can be accessed for offline processing and analysis. The use of databases for surveillance applications is not completely new. The Spot prototype is an information access system that can answer interesting questions about video surveillance footage [55]. The system supports various activity queries by integrating a motion tracking algorithm and a natural language system. The generalized framework supports: event recognition, querying using a natural language, event summarization, and event monitoring. In [93] a collection of distributed databases were used for networked incident management of highway traffic. A semantic event/activity database was used to recognize various types of vehicle traffic events. The key distinction between these systems and the architecture presented in this chapter is that a MPEG-4 like strategy is used to encode the underlying video, and the semantic scene information is automatically learned using a set of offline processes [69,70]. The scope of this thesis is restricted to: the multi view tracking server, surveillance database, offline learning/calibration, and performance evaluation components shown in the system diagram in Figure 5.1. The intelligent camera units and path learning components are only shown for completeness and do not form part of the work presented in this thesis. The surveillance database is accessed by the PViGEN application to generate pseudo synthetic video sequences that can be used for quantitative performance evaluation. The surveillance data is also accessed to determine the homography relations and handover regions between each overlapping camera view.

104

Page 105 of 168

6 Video Tracking Evaluation Framework


6.1 Background
One of the key objectives of this research identified in chapter one was to define a methodology that could be employed for quantitative performance evaluation of video tracking algorithms. Evaluation of video tracking algorithms presents a number of issues that include: How can we acquire a large number of datasets with a quantifiable range of complexity? How can we define ground truth for large datasets in a seamless manner? What measures can be used to determine the complexity of a dataset, along with the quality of its associated ground truth? What measures are appropriate for characterising tracking performance? How can we measure the relationship between tracking performance and the complexity of the test datasets?

The first issue can be addressed by capturing a set of pre-recorded videos that represents a diverse range of object activity. However, we are still presented with the problem of how to measure the complexity of each captured video sequence. This could be determined by visual inspection but this approach is subjective and error prone. A partial solution exists for the second issue, since a ground truth capture frameworks such as ViPER [26] and ODViS [49], as discussed in chapter two are publicly available. However, adopting this approach for several thousand frames of video data would still be time consuming. Some metrics are already available to address the third issue [26,27,29], although little work has been reported on measuring perceptual complexity. Measuring perceptual complexity is relatively

straightforward once reliable ground truth data is available. Some metrics are available to resolve the fourth and fifth issues [26,77,88] but little work has been reported over very large datasets. For the remainder of this chapter we describe the generic framework that can be applied for quantitative video tracking performance evaluation. In addition we discuss some of the issues that are faced when attempting to apply this generic framework for practical applications. The remainder of the chapter then focuses on the description of the video tracking performance framework employed by this research. The key novelty of this framework is that

105

Page 106 of 168

we use pseudo synthetic video data to construct test video sequences, which can be employed for performance evaluation. We choose to use pseudo synthetic video, since it is possible to construct a diverse range of test datasets, exercising some degree of control over the perceptual complexity of each generated video sequence, allowing comprehensive sets of test data to be generated. Compiling ground truth using a semi-automatic tool over datasets spanning large volumes of video data (for example several hundred thousand image frames) would be very time consuming. An additional benefit of adopting this approach is that it is not necessary to use a manual or semi-automatic method for ground truth generation, which is one of the key issues in performing evaluation over very large datasets.

6.2

Performance Evaluation
The most common approach to evaluating the performance of the detection and tracking

system uses ground truth to provide independent and objective data (e.g. classification, location, size) that can be related to the observations extracted from the video sequence. Manual ground truth is conventionally gathered by a human operator who uses a point and click user interface to step through a video sequence and select well-defined points for each moving object. The manual ground truth consists of a set of points that define the trajectory of each object in the video sequence (e.g. the object centroid). The human operator decides if objects should be tracked as individuals or classified as a group. The motion detection and tracking algorithm is then run on the pre-recorded video sequence and ground truth and tracking results are compared to assess tracking performance. The reliability of the video tracking algorithm can be associated with a number of criteria: the frequency and complexity of dynamic occlusions, the duration of targets behind static occlusions, the distinctiveness of the targets (e.g. if they are all different colours), and changes in illumination or weather conditions. We choose to express a measure for estimating the perceptual complexity of the sequence based on the occurrence and duration of dynamic occlusions, since this is most likely to be the cause of tracking failure. Such information can be estimated from the ground truth data by computing the ratio of the number of target occlusion frames divided by the total length of each target track (i.e. the number of frames over which it is observed), averaged over the sequence. A general framework for quantitative evaluation of a set of video tracking algorithms is shown in figure 6.1. Initially a set of video sequences must be captured in order to evaluate the tracking algorithms. Ideally, the video sequences should represent a diverse range of object tracking scenarios, which vary in perceptual complexity to provide an adequate test for the

106

Page 107 of 168

tracking algorithms. Once the video data has been acquired ground truth must then be generated to define the expected tracking results for each video sequence. Ground truth, as previously discussed, can consist of the derived trajectory of the centroid of each object along with other information such as the bounding box dimensions. Given the complete set of ground truth for the testing datasets the tracking algorithms are then applied to each video sequence resulting in a set of tracking results. The tracking results and the ground truth are then compared in order to measure the tracking performance using an appropriate set of surveillance metrics.

Video Sequences Video Sequences

Ground Truth

Ground Truth Generation

Tracking Results Tracking Tracking Tracking Algorithm Tracking Algorithm Algorithm Algorithms Performance Evaluation

Figure 6.1 Generic framework for quantitative evaluation of a set of video tracking algorithm

One of the main issues with the quantitative evaluation of video tracking algorithms is that it can be time consuming to acquire an appropriate set of video sequences that can be used for performance evaluation. This problem is being partially addressed by the Police Scientific Development Branch (PSDB) who are gathering data for a Video Test Image Library (VITAL) [2], which will represent a broad range of object tracking and surveillance scenarios encompassing: parked vehicle detection, intruder detection, abandoned baggage detection, doorway surveillance, and abandoned vehicles. However, at this point in time there is no automatic method available to measure the perceptual complexity of each video sequence, and

107

Page 108 of 168

automatically capturing ground truth for each dataset. An ideal solution for the generic quantitative evaluation framework depicted in figure 6.1 would fully automate the video data and ground truth acquisition steps. If these steps were fully automated it would be practical to evaluate tracking algorithms over very large volumes of test data, which would not be feasible with either manual or semi-automatic methods.

6.3

Pseudo Synthetic Video


As an alternative to the existing methods of video tracking performance evaluation we

propose to use pseudo synthetic video sequences for evaluation. One issue for quantitative tracking evaluation is that it is not trivial to collect adequate sets of test data and ground truth, which vary with respect to a quantifiable measure of complexity. Ideally, we want to be able to run a large number of experiments, which provide a comprehensive test of a tracking algorithm. The generation of pseudo synthetic video is dependent on ground tracks selected from a surveillance database. The structure of the surveillance database was described in detail in chapter 5. The surveillance database stores the object and tracking data observed by each camera in the surveillance network. The system can accumulate a large number of object tracks over a period of days, or weeks. By selecting a list of high quality object tracks from the surveillance database they can be inserted into pseudo synthetic video sequences.

6.3.1 Ground Truth Track Selection


Before we can generate pseudo synthetic video sequences it is necessary to select an appropriate set of ground truth tracks from the surveillance database. We prefer to use an approach that requires no supervision, since we expect the surveillance database to collect several hundred or thousands of object tracks over a period of several hours or days, so it would not be feasible to manually review all the data stored. Ground truth tracks are selected during periods of low activity (e.g. over the weekends), since there is a smaller likelihood of object interactions that can result in tracking errors. Since ground truth is not available for the tracking data captured online we employ a set of online metrics to quantify the quality of the object tracking. Erdem, Tekalp, and Sankur [29,30,31] employed a similar strategy but they focused on motion segmentation and not object tracking. We use additional metrics that define both the smoothness of the derived tracked object trajectory, and the consistency between the localisation of the tracked object and measured foreground object regions. Each potential

108

Page 109 of 168

ground truth track is checked for consistency with respect to path coherence, colour coherence, and shape coherence in order to automatically identify and remove tracks of poor quality.

6.3.1.1

Path Coherence The path coherence metric [83,96,98] makes the assumption that the derived object

trajectory should be smooth subject to direction and motion constraints. Measurements are penalised for lower consistency with respect to direction and speed, while measurements are rewarded for the converse situation. An alternative strategy could be to approximate the object trajectory as a spline and estimate the cumulative error of each trajectory point from the smoothed version. This approach also presents some problems, since it is necessary to select an appropriate set of control points to derive the smoothed object trajectory. In addition a spline approximation may not create a suitable reconstruction for objects, which change direction significantly on several occasions.

X X X X 1 N1 pc = w11 X k1X k Xk Xk+1 N 2 k=2 k1 k k k+1

2 Xk1 Xk Xk Xk+1 + w 1 2 Xk1 Xk + Xk Xk+1

(6.1)

Where X k 1X k is the vector representing the positional shift of the tracked object

between frames k and k-1. The weighting factors can be appropriately assigned to define the contribution of the direction and speed components of the measure. The value of both weights was set to 0.5.

6.3.1.2

Colour Coherence

The colour coherence metric measures the average inter-frame histogram distance of a tracked object. It is assumed that the object histogram should remain constant between image frames. The normalised histogram is generated using the (r,g) colour space in order to account for small lighting variations. This metric has low values if the segmented object has similar colour attributes, and higher values when colour attributes are different. Each histogram contains 8x8 bins for the normalised colour components.

cc

1 = N 1

1
k =2 u =1

p k 1 (u ) p k (u )

(6.2)

109

Page 110 of 168

Where p k (u ) is the normalised colour histogram of the tracked object at frame k, which has M

bins, and N is the number for frames the object was tracked. This metric is a popular colour similarity measure employed by several robust tracking algorithms [21,22,23,71,79].

6.3.1.3

Shape Coherence

The shape coherence metric gives an indication of the level of agreement between the tracked object position and the object foreground region. This metric will have a high value when the localisation of the tracked object is incorrect due to poor initialisation or an error in tracking. The value of the metric is computed by evaluating the symmetric shape difference between the bounding box of the foreground object and tracked object state.

1 N R f (k ) Rt (k ) + Rt (k ) R f (k ) sc = N k =1 Rt (k ) R f (k )

(6.3)

Where Rt (k ) R f (k ) represents the area difference between the bounding box of the tracked object (state) and the overlapping region with the foreground object (measurement). The normalisation factor Rt (k) Rf (k) represents the area of the union of both bounding boxes.

110

Page 111 of 168

P a t h C o h e r e n c e H is t o g r a m 45 40 35 30 Frequency 25 20 15 10 5 0 0 0 .0 5 0 .1 0 .1 5 P a th C o he re nc e 0 .2 0 .2 5

(a)
C o lo u r C o h e r e n c e H is t o g r a m 50 45 40 35 Frequency 30 25 20 15 10 5 0 0 .0 4 0 .0 6 0 .0 8 0 .1 0 .1 2 C o lo u r C o h e r e n c e 0 .1 4 0 .1 6 0 .1 8

(b)
S h a p e C o h e r e n c e H is t o g r a m 35 30 25 Frequency 20 15 10 5 0 0 .0 5

0 .1

0 .1 5

0 .2 0 .2 5 S ha p e C o he re nc e

0 .3

0 .3 5

0 .4

(c)
Figure 6.2 Distribution of the ground truth metrics

111

Page 112 of 168

6.3.1.4

Outlier Ground Truth Tracks

The path coherence, colour coherence, and shape coherence provide a set of metrics that can be used to rank the potential ground truth tracks that are stored in the surveillance database. It is possible to remove outlier ground truth tracks by applying an appropriate threshold to the values of pc , cc , and sc . Examples of the distributions of the path coherence, colour coherence, and shape coherence are shown in figures 6.2 (a), (b) and (c) respectively. Each histogram was generated by evaluating the value of each metric from a selection of tracks stored in the database over a period of twelve hours from a single camera view. It can be observed that each metric can be adequately approximated by a Gaussian distribution. We apply an upper limit to the value of each metric by setting a threshold to be the mean value plus two standard deviations of the samples. This automatic setting of the thresholds allow the ground truth track selection process to be seamlessly applied to other surveillance data, which may have been captured with different cameras, or under other weather conditions.

pc = pc + 2 pc cc = cc + 2 cc sc = sc + 2 sc

(6.4) (6.5) (6.6)

Given the set of thresholds an object track in the surveillance database can be classified as an inlier or outlier by using the following relation: Inlier Track: pc pc ( cc cc ) ( sc sc ) Outlier Track: pc > pc ( cc > cc ) ( sc > sc )

We also exclude object tracks that are short in duration, or have formed a dynamic occlusion with another track in the surveillance database. Tracks that are short in duration would not be suitable to include in pseudo synthetic video, since they would normally represent tracks of poor quality. Tracks that have formed dynamic occlusions with another track are also not suitable for including in pseudo synthetic video because it is not possible to distinguish

112

Page 113 of 168

between each object during the dynamic occlusion. Hence, this would result in the object track containing ground truth that is not reliable during the grouping formed by the dynamic occlusion.

6.3.2 Pseudo Synthetic Video Generation


Once the ground truth tracks have been selected from the surveillance database they can be used to generate pseudo synthetic video sequences. Each pseudo synthetic video is constructed by replaying the ground truth tracks randomly in the generated video sequence. The one disadvantage of this approach is that the generated video sequences will be biased towards the motion detection algorithm used to capture the original ground truth tracks. In addition, few ground truth tracks will be observed in regions where tracking or detection performance is poor. However, the pseudo synthetic data is still effective for characterising tracking performance with respect to tracking correctness and dynamic occlusion reasoning, which is the main focus of the evaluation framework. A fundamental point of the method is that the framelets stored in the surveillance database consist of the foreground regions identified by the motion detection algorithm (i.e. within the bounding box). When the framelet is replayed in the pseudo synthetic video sequence this improves the realism of the dynamic occlusions. A number of steps are taken to construct each pseudo synthetic video sequence, since the simple insertion of ground truth tracks would not be sufficient to create a realistic video sequence.

Initially, a dynamic background video is captured for the same camera view from which the ground truth tracks have been selected. This allows the pseudo synthetic video to simulate small changes in illumination that typically occur in outdoor environments.

All the ground truth tracks are selected from a fixed camera view. This ensures that the object motion in the synthetic video sequence is consistent with background video. In addition, since the ground truth tracks are constrained to move along the same paths in the camera view this increases the likelihood to forming dynamic occlusions in the video sequences.

3D calibration information is used to ensure that framelets are plotted in the correct order during dynamic occlusions, according to their estimated depth from the camera. This gives the effect of an object occluding or being occluded by other objects based on its distance from the camera.

113

Page 114 of 168

Some of the above concepts are illustrated in figure 6.3, where a dynamic occlusion is simulated in a synthetic video sequence. The top and bottom row of images in figure 6.3 show the original and synthetic images respectively. In figure 6.3a the cyclist occludes a phantom pedestrian. In figure 6.3b a phantom vehicle is occluded by the cyclist and then occludes the two pedestrians. The calibration information was used in each case to plot the framelets in the correct order to simulate the dynamic occlusions. Pre-recorded video from the PETS2001 dataset was used to generate the example shown in figure 6.3. In this instance two ground truth tracks were overlaid onto the pre-recorded video sequence. Two ground truth tracks identified in the surveillance database are shown in the left and middle images of figure 6.4. The framelets of each ground truth track is plotted every five frames so that it is possible to visualise its motion history in the scene. When the two ground truth tracks are inserted into a pseudo synthetic video sequence it is possible to construct a dynamic occlusion as shown in the right hand image of figure 6.4. Since we know the ground truth for each of the tracks it is possible to determine the exact time and duration of the dynamic occlusion. By adding more ground truth tracks in the generated video sequence it is possible to generate more complex object interactions.

114

Page 115 of 168

(a)

(b)
Figure 6.3 Examples of how phantom objects can be used to form dynamic occlusions in synthetic video sequences.

115

Page 116 of 168

Figure 6.4 Using ground truth tracks to simulate dynamic occlusions

6.4

Perceptual Complexity
Another key requirement of the performance evaluation framework identified in

chapter two is that is should be possible to have a mechanism to control and quantify the complexity of each generated pseudo synthetic video sequence. A system diagram of the Pseudo synthetic Video Generator (PViGEN) is shown in figure 6.5. PViGEN as previously discussed takes several inputs: a background video sequence, the list of ground truth tracks, and 3D camera calibration data. The ground truth tracks are replayed on a pre-recorded background video sequence. The 3D camera calibration information is used to plot each ground truth track according to their depth with respect to the camera view, which allows realistic simulations of dynamic object occlusions. The perceptual complexity of each synthetic video sequence can be controlled by varying the values of two input parameters: Max Objects and P(new) defined as:

Max Objects (Max): The maximum number of objects that can be present in a single frame of the synthetic video sequence.

New object probability p(new): The probability of creating a new object in the video sequence, while the maximum number of objects has not been exceeded.

P(new) controls the frequency of the creation of new objects in the generated video sequence. New objects are created by randomly selecting a track from the input list of ground truth tracks. Only one new object can be created per image frame with probability p(new). New objects are not created if the number of active objects in the current frame equals the maximum

116

Page 117 of 168

number of objects. These two parameters are used to vary the object activity in each generated video sequence. Increasing the values of the p(new) and Max Objects results in an increase of object activity. This model provides a very realistic simulation of real video sequences. In figure 6.6 the images demonstrate how the value of p(new) can be used to control the density of objects in the synthetic video sequence.

P(new) Max Objects

Background Video

Ground Truth Tracks

3D Calibration

PViGEN

Pseudo Synthetic Video


Figure 6.5 System diagram for main input and outputs of PViGEN.

Figure 6.6 Perceptual complexity: left framelets plotted for p(new)=0.01, middle framelets plotted for p(new)=0.10, right framelets plotted for p(new)=0.20.

117

Page 118 of 168

The images show examples for p(new) having values 0.01, 0.10, and 0.20 respectively (max objects=20). In order to create a dynamic background a video sequence of the camera view is recorded with several thousand frames, which contains no moving objects but exhibits some changes in illumination. The selected ground truth tracks are then overlaid on the prerecorded background video to simulate a realistic video sequence. The perceptual complexity of the synthetic video sequence is controlled by varying the value of the p(new) parameter. Once the pseudo synthetic video sequence has been generated the ground truth can be analysed to measure its complexity. The number of dynamic occlusions is determined by counting the number of occurrences where the bounding box of a ground truth object overlaps with another object in the same image frame. The following attributes of the video sequence are used to provide a measure of the perceptual complexity of each synthetic video sequence.

Number of Dynamic Occlusions (NDO) A count of the number of dynamic occlusions in the synthetic video sequence.

Number of Occluding Objects (NOO) The average number of interacting objects involved in each dynamic occlusion

Duration of Dynamic Occlusion (DDO) The average duration of each dynamic occlusion (in frames) in the synthetic video sequence.

Objects per Frame (OPF) The average number of objects visible in each frame of the synthetic video sequence.

Object Track Length (OTL) The average duration (in frames) of each object trajectory that appears in the synthetic video sequence.

The p(new) parameter is effective for controlling the total number of objects appearing in each synthetic video. The values of NDO, NOO, DDO, and OTL are dependent on the ground truth tracks chosen by PViGEN during the random selection process.

118

Page 119 of 168

Perceptual Com plexity - Average num ber of objects per frame 20 18 16 14 Number of objects 12 10 8 6 4 2 0 0 0.05 0.1 0.15 0.2 P(new ) 0.25 0.3 0.35 0.4

(a)
P e r c e p t ua l C o m p le xit y - A ve r a g e num b e r o f d y na m ic o c c lus io ns 1000 900 800 Number of dynamic occlusions 700 600 500 400 300 200 100 0 0 0 .0 5 0 .1 0 .1 5 0 .2 P ( ne w ) 0 .2 5 0 .3 0 .3 5 0 .4

(b) Figure 6.7 Perceptual Complexity average objects per frame, and average number of dynamic occlusions

119

Page 120 of 168

Object Count by Frame - p(new)=0.01 20 18 16 14 Number of Objects 12 10 8 6 4 2 0 Number of Objects 20 18 16 14 12 10 8 6 4 2 0

Object Count by Frame - p(new)=0.10

500 Frame Number

1000

1500

500 Frame Number

1000

1500

(a)
Object Count by Frame - p(new)=0.20 20 18 16 14 Number of Objects 12 10 8 6 4 2 0 Number of Objects 20 18 16 14 12 10 8 6 4 2 0

(b)
Object Count by Frame - p(new)=0.30

500 Frame Number

1000

1500

500 Frame Number

1000

1500

(c)
Object Count by Frame - p(new)=0.40 20 18 16 14 Number of Objects 12 10 8 6 4 2 0 Number of Objects 20 18 16 14 12 10 8 6 4 2 0

(d)
Object Count by Frame - p(new)=0.50

500 Frame Number

1000

1500

500 Frame Number

1000

1500

(e)

(f)

Figure 6.8 Plots of number of objects in each frame of a sample of synthetic video sequences for different values of p(new).

The plots in figure 6.7 illustrate how the average number of objects per frame, and total number of dynamic occlusions can vary with the value of p(new). For each plot the number of

120

Page 121 of 168

image frames is 1500, and the maximum number of objects is 20. The error bars on each plot indicate the standard deviation over the five simulations performed for each value of p(new). Both plots become asymptotic when the average number of objects per frame approaches the maximum number of objects allowed in each image frame. In figure 6.8 the numbers of objects present in each generated image frame is shown for a sample of the generated video sequences. The horizontal bar in each plot indicates the average number of objects per frame for each of the video sequences. As the value of p(new) increases as expected the average number of objects per frame approaches the maximum value of 20. However, once the maximum number of objects has been reached, p(new) has a reduced influence in the creation of new objects. As a consequence the plots of the average number of objects per frame, and the number of dynamic object occlusions become asymptotic as the maximum number of objects appear in the synthetic video sequences.

6.5

Surveillance Metrics
Once the tracking algorithm has been used to process each generated pseudo synthetic

video sequence the ground truth and tracking results are compared to generate a surveillance metrics report. The surveillance metrics report has been derived from a number of sources [26,27,77,88]. Minimizing the following trajectory distance measure allows us to align the objects in the ground truth and tracking results:

DT ( g , r ) =

1 N rg

i g ( ti ) r ( ti )

( xg i xri ) 2 + ( yg i yri ) 2

(7.7)

Where N rg is the number of frames that the ground truth track and result track have in common, and ( xg i , yg i ), ( xri , yri ) is the location of the ground truth and result object at frame

i respectively. Once the ground truth and results trajectories have been matched the following
metrics are used in order to characterize the tracking performance:

Tracker Detection Rate (TRDR) =

Total True Positives Total Number of Ground Truth Points

False Alarm Rate (FAR) =

Total False Positives Total True Positives + Total False Positives

121

Page 122 of 168

Track Detection Rate (TDR) =

Number of true positives for tracked object Total number of ground truth points for object

Object Tracking Error (OTE) =

1 N rg

i g (ti ) r (ti )

( xg i xri ) 2 + ( yg i yri ) 2

Track Fragmentation (TF) = Number of result tracks matched to ground truth track

Occlusion Success Rate (OSR) =

Number of successful dynamic occlusions Total number of dynamic occlusions

Tracking Success Rate (TSR)=

Number of non - fragmented tracked objects Total number of ground objects

A true positive is defined as a ground truth point that is located within the bounding box of an object detected and tracked by the tracking algorithm. A false positive is an object that is tracked by the system that does not have a matching ground truth point. A false negative is a ground truth point that is not located within the bounding box of any object tracked by the tracking algorithm. In figure 6.9a the vehicle in the top image has not been tracked correctly, hence the ground truth point is classified as a false negative, while bounding box of the incorrectly tracked object is counted as a false positive. The three objects in the bottom image are counted as true positives, since the ground truth points are located within the tracked bounding boxes. The tracker detection rate (TRDR) and false alarm rate (FAR) characterise the performance of the object-tracking algorithm. The track detection rate (TDR) indicates the completeness of a specific ground truth object. The object tracking error (OTE) indicates the mean distance between the ground truth and tracked object trajectories. The track fragmentation (TF) indicates how often a tracked object label changes. Ideally, the TF value should be one, with larger values reflecting poor tracking and trajectory maintenance. The tracking success rate (TSR) summarises the performance of the tracking algorithm with respect to track fragmentation. The occlusion success rate (OSR) summarises the performance of the tracking algorithm with respect to dynamic occlusion reasoning.

122

Page 123 of 168

(a)

(b)

Figure 6.9 Illustration of surveillance metrics: (a) Image to illustrate true positives, false negative and false positive, (b) Image to illustrate a fragmented tracked object trajectory.

Figure 6.9b shows a tracked object trajectory for the pedestrian who is about to leave the camera field of view. The track is fragmented into two parts shown as black and white trajectories. The two track segments are used to determine the track detection rate, which indicates the completeness of the tracked object. As a result this particular ground truth object had a TDR, OTE, and TF of 0.99, 6.43 pixels, and 2 respectively.

6.6

Experiments and Evaluation


In order to test the effectiveness of the ground truth track selection process described in

section 6.3.1 the method was applied to two different surveillance databases. The first surveillance database contained data captured during a cloudy day. The second surveillance database contained data captured during a sunny day, where the outdoor illumination varied considerably. Two different cameras where used to capture the surveillance data, so the results should provide a good test of the approach using different hardware.

6.6.1 Ground Truth Track Selection Surveillance Database A (Cloudy Day)


The first surveillance database contained object tracks observed over an eight hour period on a cloudy day. It is expected that the weather conditions should have some impact on the quality of the tracks. On a cloudy day we would expect better tracking results, since there will be smaller illuminations changes, and the effects of cast shadows are minimal. The total number of unique object tracks stored in the surveillance database was 1098. Tracks were then filtered based on the track duration and object interaction constraints defined in section 6.3.1.

123

Page 124 of 168

As a result the number of tracks removed due to a short track duration and object interactions was 728 and 117 respectively. These tracks do not necessarily indicate failure of the object tracking but would not be good candidates to use as ground truth tracks within the tracking framework. For the remaining 253 tracks we then generated the values of the path coherence, colour coherence, and shape coherence metrics. The mean and standard deviations of pc , cc , and sc were (0.092, 0.086, 0.157) and (0.034, 0.020, 0.054) respectively. Outlier ground truth tracks can be removed by applying a threshold to the values of pc , cc , and sc . The distributions of the path coherence, colour coherence, and shape coherence were shown earlier in the figure 6.2. A total of 30 outlier tracks were identified leaving a 223 ground tracks, which were used to generate pseudo synthetic video sequences for the results presented in section 6.6.4. In figure 6.10 some example outlier tracks are shown. The top left track was rejected due to poor path coherence, since the derived object trajectory is not smooth. The top right track was rejected due to poor colour coherence, which is a consequence of the poor object segmentation. The bottom left track was rejected due to poor shape coherence, where an extra pedestrian is included in the track. The tracked bounding boxes are not consistent with the detected foreground object. The bottom right track was rejected due to forming a dynamic occlusion with another track. It can be observed that in this instance the tracking failed and the objects switched identities near the bottom of the image. These examples illustrate that the metrics: path coherence, colour coherence, and shape coherence are effective for rejecting outlier ground truth tracks of poor quality.

124

Page 125 of 168

(a)

(b)

(c)

(d)

Figure 6.10 Example of outlier tracks identified during ground truth track selection.

Once the outlier tracks have been removed the values of each metric are combined using a uniform weight scheme in order to rank each ground truth track according to its perceived quality derived from the set of ground truth metrics:

track rank = w1 pc + w2 cc + w3 sc

125

Page 126 of 168

Figure 6.11 Top four ranked ground truth tracks

126

Page 127 of 168

Figure 6.12 Bottom four ranked ground truth tracks

The top and bottom four ranked tracks are shown in figures 6.11 and 6.12 respectively. It can be observed that the uniform weighting of each ground truth metric is effective in ordering the tracks based on their quality.

6.6.2 Ground Truth Track Selection Surveillance Database B (Sunny Day)


The purpose of the next experiment was to determine if the set of ground truth metrics could identify any differences in the quality of tracks under different weather conditions. The same process was repeated as in section 6.6.1, except the ground truth tracks were selected from a surveillance database captured during a sunny day. There were a total of 4250 uniquely tracked objects in the surveillance database. Due to short track duration and object interactions a total of 1595 and 2080 tracks were rejected respectively. For the remaining 575 tracks we then generated the values of the path coherence, colour coherence, and shape coherence

127

Page 128 of 168

metrics. The mean and standard deviations of pc , cc , and sc were (0.079, 0.0948, 0.165) and (0.032, 0.027, 0.0712) respectively. Outlier ground truth tracks were removed by applying a threshold to the values of pc , cc , and sc . The distributions of the path coherence, colour coherence, and shape coherence are shown in the figure 6.13. Again it can be observed that the distribution of each metric can be approximated by a Gaussian variable. A total of 115 outlier tracks were identified leaving a 460 ground tracks. One of the problems associated with the tracks selected is that the motion detection algorithm does not correctly segment the boundary of each foreground object. This reduction in the performance of the tracking is indicated by the increase in the mean colour coherence and shape coherence metrics, when compared to the results in section 6.6.1. Due to the changes in illumination, which are a consequence of the disappearance and re-appearance of the sun, the background model changes more frequently when compared to cloudy day. Using framelets captured under these conditions results in the tracks exhibiting a ghosting effect, where the objects motion is more apparent due to differences in the pre-recorded background, and the background of the image when the object was tracked. From this observation we could conclude that framelets captured under these conditions would not be suitable for generating pseudo-synthetic video sequences, without further pre-processing for removing pixels in each framelet incorrectly identified as part of the foreground. Once the outlier tracks were removed the same ranking scheme was applied as sort the tracks by a weighted sum of each ground truth metric. The top and bottom four ranked tracks are shown in figure 6.14 and 6.15 respectively. Again, it can be observed that the weighted combination of the ground truth metrics is effective is sorting the tracks based upon their quality.

128

Page 129 of 168

P a th 5 0 4 5 4 0 3 5 3 0 Frequency 2 5 2 0 1 5 1 0 5 0

C o h e re n c e

H is t o g ra m

(S u n n y

D a y )

0 .0 5

0 .1 P a th

0 .1 5 C o h e re n c e

0 .2

0 .2 5

(a)
C o lo u r C o h e re n c e H is t o g ra m 70 (S u n n y D a y )

60

50

Frequency

40

30

20

10

0 .0 5

0 .1 0 .1 5 C o lo u r C o h e re n c e

0 .2

0 .2 5

(b)
S h a p e 8 0 7 0 6 0 5 0 Frequency 4 0 3 0 2 0 1 0 0 C o h e re n c e H is t o g ra m (S u n n y D a y )

0 .1

0 .2

0 .3 S h a p e

0 .4 0 .5 C o h e re n c e

0 .6

0 .7

0 .8

(c)
Figure 6.13 Distribution of the average path coherence (a), average colour coherence (b), and average shape coherence of each track selected from the surveillance database (sunny day)

129

Page 130 of 168

Figure 6.14 Top four ranked ground truth tracks

130

Page 131 of 168

Figure 6.15 Bottom four ranked ground truth tracks

6.6.3 Single View Tracking Evaluation (Qualitative)


A number of experiments were run to test the performance of the tracking algorithm used by the online surveillance system. The tracking algorithm employs a partial observation tracking model [98] for occlusion reasoning. Manual ground truth was generated for the PETS2001 datasets using the point and click graphical user interface as described in section 6.2. The PETS2001 dataset is a set of video sequences that have been made publicly available for performance evaluation. Each dataset contains static camera views with pedestrians, vehicles, cyclists, and groups of pedestrians. The datasets were processed at a rate of 5fps, since this approximately reflects the operating speed of our online surveillance system. Table 6.1 provides a summary of the surveillance metrics reports. The results demonstrate the robust tracking performance, since the track completeness is nearly perfect for all the objects. A couple of the tracks are fragmented due to poor initialisation or termination. Figure 6.16 demonstrates what can happen when a tracked object is not initialised correctly. The left, middle, and right images show the pedestrian exiting and leaving the parked vehicle. The

131

Page 132 of 168

pedestrian is partially occluded by other objects, so is not detected by the tracking algorithm until it has moved from the vehicle. The pedestrian relates to ground truth object 9 in table 6.1. An example of dynamic occlusion reasoning is shown in figure 6.17. The cyclist overtakes the two pedestrians, forming two dynamic occlusions and it can be noted that the correct trajectory is maintained for all three objects. The object labels in figure 6.17 have been assigned by the tracking algorithm and are different from the ground truth object labels. Table 6.2 gives a summary of the tracking performance of the PETS2001 datasets. These results validate our assumption that our object tracker can be used to generate ground truth for video with low activity. The PETS dataset were used to construct a pseudo synthetic video by adding four additional ground truth tracks to the original sequence. Table 6.3 summarises the differences in perceptual complexity between the original and synthetic video sequence. The number of dynamic object occlusions increases from 4 to 12, having the desired affect of increasing the complexity of the original video sequence.

Track TP FN TDR TF OTE

0 25 0 1.00 1 11.09

1 116 2 0.98 1 7.23

2 26 0 1.00 1 8.37

3 104 5 0.95 1 4.70

4 36 0 1.00 1 10.82

5 369 5 0.99 1 11.63

6 78 1 0.99 1 9.05

7 133 1 0.99 2 6.43

8 43 1 0.98 1 8.11

9 88 2 0.98 2 11.87

TP: Number of true positives TDR: Track Detection Rate

FN: Number of false positives TF: Track Fragmentation OTE: Object Tracking Error

Table 6.1 Summary of surveillance metrics for PETS2001 dataset2 camera 2

TRDR TSR FAR

AOTE Mean Stdev 2.4 2.09

ATDR Mean 0.99 1.00 Stdev 0.010 0.002

Dataset 2(Cam 2) Pseudo Synthetic PETS Dataset

0.99 8/10 1.00 9/13

0.01 0.01

8.93 1.36

Table 6.2 Summary of object tracking metrics

132

Page 133 of 168

TNO Original PETS Dataset Pseudo Synthetic PETS Dataset


NOO: Number of Occluding Objects

NDO 4 12

DDO 8.5 8.58

NOO 1 1.08

10 14

NDO: No. of Dynamic Occlusions DDO: Duration of Dynamic Occlusion (frames) TNO: Total Number of Objects

Table 6.3 Summary of perceptual complexity of PETS datasets

Figure 6.16 An example of how poor track initialisation results in low object track detection rate of the pedestrian leaving the vehicle.

Figure 6.17 Example of dynamic occlusion reasoning for PETS2001 dataset 2 camera 2.

133

Page 134 of 168

6.6.4 Single View Tracking Evaluation (Quantitative)


To test the effectiveness of the tracking algorithm for tracking success and dynamic occlusion reasoning a number of synthetic videos sequences were generated. Initially, ground truth tracks were automatically selected from a surveillance database using the method described in section 6.3.1. The tracks in the surveillance database were observed over a period of eight hours by a camera overlooking a building entrance. Five pseudo synthetic video sequences were then generated for each level of perceptual complexity. The value of p(new) was varied between 0.01 to 0.4 with increments of 0.01. Each synthetic video sequence was 1500 frames in length, which is equivalent to approximately 4 minutes of live captured video by our online system running at 7Hz. Hence, in total the system was evaluated with 200 different video sequences, totalling approximately 800 minutes of video with varying perceptual complexity. The synthetic video sequences were used as input to the tracking algorithm. The tracking results and ground truth were then compared and used to generate a surveillance metrics report as described in section 6.5. Table 6.4 gives a summary of the complexity of a selection of the synthetic video sequences. These results confirm that p(new) controls the perceptual complexity, since the number of objects, average number of dynamic occlusions and occluding objects increases from (19.4, 6.2, 2.1) to (326.6, 761.4, 3.1) respectively for the smallest and largest values of p(new). Table 6.5 summarises the tracking performance for various values of p(new). The plot in figure 6.18 demonstrates how the object tracking error (OTE) varies with the value of p(new). Large values of p(new) result in an increase in the density and grouping of objects, as a result this causes the tracking error of each object to increase. The plot in figure 6.19 illustrates how the tracker detection rate (TDR) varies with the value of p(new). The value of the TDR decreases from 90% to 72% with the increasing value of p(new). This indicates that the tracking algorithm has a high detection rate for each object with the increasing perceptual complexity of video sequence. This result is expected since we expect that the video sequences generated would have a degree of bias towards the motion segmentation algorithm used to capture and store the ground truth tracks in the surveillance database. The TDR indicates how well an object has been detected and does not account for track fragmentation and identity switching, which can occur during a dynamic occlusion that is not correctly resolved by the tracker. This is more accurately reflected by the occlusion success rate (OSR) and the tracking success rate (TSR) metrics, which are shown in figures 6.20 and

134

Page 135 of 168

6.21. The track fragmentation increases with the value of p(new), which represents a degradation of tracking performance with respect to occlusion reasoning. The OSR and TSR decreases in value from (77%, 81%) to (57%, 19%) with the increasing value of p(new). When the number of objects per frame approaches the maximum this limits the number of dynamic occlusions created, hence increasing values of p(new) have a diminished affect of increasing the perceptual complexity. As a consequence the OTE, TDR, TSR and OSR become asymptotic once the number of objects per frame approaches the maximum of 20 as illustrated in the plots of figure 6.18-6.21. Larger values of p(new) and the maximum number of objects should result in more complex video sequences. Hence even with the bias present in the generated video sequences we can still evaluate the object tracking performance with respect to tracking success and occlusion reasoning, without exhaustive manual truthing, fulfilling the main objective of our framework for performance evaluation.

Surveillance Metrics - Object Tracking Error (OTE)

16 14
Object Tracking Error

12 10 8 6 4 2 0 0 0.1 0.2
P(new)
Figure 6.18 Plot of Object Tracking Error (OTE)

0.3

0.4

135

Page 136 of 168

Surveillance Metrics - Track Detection Rate (TDR)

0.8
Track Detection Rate

0.6

0.4

0.2

0 0

0.1

0.2
P(new)

0.3

0.4

Figure 6.19 Plot of Track Detection Rate (TDR)

Surveillance Metrics - Tracking Success Rate 0.8 Tracking Success Rate 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 P(new) 0.3 0.4

Figure 6.20 Plot of Tracking Success Rate (TSR)

136

Page 137 of 168

Surveillance Metrics - Dynamic Occlusion Success Rate 1 Occlusion Success Rate 0.8 0.6 0.4 0.2 0 0

0.1

0.2 P(new)

0.3

0.4

Figure 6.21 Plot of Occlusion success rate (OSR)

TNO P(new) 0.01 0.20 0.40 Mean 19.40 276.40 Stdev 1.673 7.668

NDO Mean 6.20 595.20 761.40 Stdev 1.304 41.889 49.958

DDO Mean 10.59 11.51 11.53 Stdev 4.373 1.314 0.847

NOO Mean 2.08 2.95 3.06 Stdev 0.114 0.115 0.169

326.60 13.240

Table 6.4 Summary of the perceptual complexity of the synthetic video sequences

TRDR P(new) 0.01 0.20 0.40 0.91 0.86 0.85

FAR

OSR Mean Stdev 0.132 0.021 0.021

AOTE Mean 3.43 12.49 13.26 Stdev 0.582 0.552 0.508

ATDR Mean 0.89 0.74 0.73 Stdev 0.070 0.011 0.007

ATSR Mean 0.81 0.21 0.19 Stdev 0.080 0.023 0.020

0.052 0.009 0.006

0.77 0.56 0.57

Table 6.5 Summary of metrics generated using each synthetic video sequence.

137

Page 138 of 168

6.7

Summary
In this chapter a novel framework for evaluating the performance of a video tracking

algorithm has been presented. The framework uses a comprehensive set of metrics, which are used to measure the quality of the ground truth tracks, as well as characterise tracking performance. The performance evaluation framework initially automatically selects ground truth tracks from a surveillance database. These ground truth tracks are then used to construct pseudo synthetic video sequences. Outlier tracks of poor quality are removed by using the path coherence, shape coherence, and colour coherence metrics. It is acknowledged that the pseudo synthetic video will have a degree of bias to the motion detection algorithm used to capture the original data. However, the generated video sequences are effective for evaluating performance of occlusion reasoning, and can be used to evaluate other tracking algorithms. The main strength of the evaluation framework is that it can automatically generate a variety of different testing datasets with some degree of control of the perceptual complexity. In this chapter a tracking algorithm was quantitatively evaluated over three hundred thousand frames of video, without any human intervention or semi-automatic ground truth generation.

138

Page 139 of 168

7 Conclusion
7.1 Research Summary
In chapter two a set of requirements were identified and used to define the scope and research goals of this thesis. This work has primarily focused on the following problems associated with visual surveillance: multi view object tracking, surveillance database management, and performance evaluation of video tracking systems. In chapter three we described a technique that could be employed for automatically recovering the homography transformations between pairs of overlapping camera views. The approach was shown to provide robust estimation for real and synthetic video sequences in the experiments performed in chapter three. In chapter three the methods used to extract 3D measurements from the scene were also discussed. It is assumed that each camera in the surveillance system is calibrated with respect to the same ground plane. The uncertainty of each 3D measurement is defined by projecting the 2D image uncertainty to the 3D object space using a Jacobian transformation. The coefficients of the Jacobian matrix are derived in terms of the calibrated camera parameters, which allow a variant 3D spatial uncertainty to be evaluated for each measurement. The measurement uncertainty allows a degree of confidence to be assigned to each 3D measurement. This is an important property, since the measurements are used to track each object within a Kalman filter framework, enabling the observation noise to be set to a value according to the number of cameras used to make a measurement, and the distance of the object with respect to each camera. In chapter four we discussed how the system uses the estimated homography transformations to correspond features between overlapping camera views, and track objects in 3D using a first order Kalman filter. One of the key benefits of tracking objects in 3D is that it is possible to resolve both dynamic and static object occlusions, which was demonstrated on PETS2001 datasets and outdoor video sequences captured at the City University campus. In addition, each camera is calibrated in the same world coordinate system. This enables the system to preserve an objects identity when it moves between non-overlapping cameras that are separated by a short temporal distance of less than two seconds. The Kalman filter prediction is effective when an object maintains a linear trajectory and constant speed during the transition time between the non-overlapping views. This is not the case when objects change direction or the transition time is much longer. To handle these types of tracking

139

Page 140 of 168

scenarios an object handover policy is defined between each pair of non-overlapping camera views. An object handover region is represented by a linked exit and entry region between adjacent camera views. The statistical properties of the transition time are used as a temporal cue to match an object on its reappearance into the scene. We demonstrated how the approach could be used to coordinate the tracking of vehicles between non-overlapping camera views in outdoor environments. Another of the key requirements identified in chapter two was the design and implementation of the surveillance system that could be run continuously over extended periods of monitoring. The system should also provide a suite of tools to allow the fast indexing and retrieval of the surveillance data. In chapter five we discussed this functionality in terms of a hierarchical database. The database stores several different representations of the surveillance data, which support spatio-temporal queries at the highest level, to playback of video data at the lowest level. Each intelligent camera in the surveillance system streams 2D tracking data to the multi view-tracking server, where the tracking data is integrated and then stored in the surveillance database. An MPEG4 like encoding strategy is used to encode the video data and this enables the system to operate in real time (between 5-10 frames per second) using a 100mb/sec Ethernet connection. The surveillance database has been accessed for offline learning of several properties of the surveillance region including: routes, entry regions, exit regions, and the camera topology. This semantic scene information is stored in the semantic description layer of the database and is used to perform online route classification. Database technology is not new but its potential application to surveillance systems has been recently recognized for reducing the information load on security operators. In chapter five we demonstrated that this framework could be used to generate metadata online, and execute highlevel activity queries. The advantage of using a hierarchical database is that the meta data can be utilised to give much faster response times to various object activity queries than would be possible when querying the original tracking data. In chapter six a framework was defined for evaluating the performance of video tracking systems. One of the problems associated with quantitative performance evaluation is acquiring ground truth, which defines the expected tracking results for a specific video sequence. Even when using semi-automatic tools this can be a time consuming process. We have presented an alternative approach that is fully automatic and can be used to complement existing methods for performance evaluation of video tracking algorithms. Initially, ground truth tracks are extracted from the surveillance database. Tracks of poor quality are removed by using a number of metrics: path coherence, colour coherence and shape coherence. The ground truth tracks are

140

Page 141 of 168

then used to construct pseudo-synthetic video sequences, which can be used to evaluate video tracking algorithms. One key aspect of the pseudo synthetic video generation process is that we can control the perceptual complexity of each dataset. Hence, it becomes practical to create a variety of testing datasets that vary in perceptual complexity (with respect to object density and dynamic occlusions) to be automatically generated without any supervision. In chapter six results were present for quantitative evaluation of a video tracking algorithm over three hundred thousand frames of video without any supervision.

7.2

Limitations
The multi view tracking algorithm uses a homography relation to correspond features between overlapping camera views. This presents a problem when objects are in close proximity (within less than a metre), since the feature correspondence algorithm classifies the objects as a group. As a consequence it is not possible for the system to track individual members within the group, or correctly track objects in scenes that have a large density of objects, for example crowds of pedestrians in a shopping center.

When tracking objects between non-overlapping camera views the system relies on 3D trajectory prediction of the Kalman filter, and object handover policies between linked exit and entry zones, as was discussed in chapter 4. This approach was demonstrated to work in scenes for monitoring vehicles and pedestrians in an outdoor environment. Currently the method does not make any provision to handle cases of ambiguity when several objects move between overlapping cameras. For example a vehicle overtaking another between non-overlapping views. Under these circumstances it is possible that an objects identity will be incorrectly assigned in the ordering of the objects changes between the non-overlapping camera views

A Kalman filter is used to track objects in 3D assuming a constant velocity model, as was discussed in chapter four. Trajectory prediction is used maintain an objects identity during a dynamic and static occlusion once the object interaction has completed. This approach was demonstrated to work in chapter 4 but there some instances where the tracking would fail. If the objects do not maintain the same trajectory during the occlusion, or there are more than three interacting objects it is possible the tracking will fail and the identities of the object be assigned incorrectly. The problem with the density of objects in the scene also affects the performance of single view tracking, since motion detection and tracking is more difficult.

141

Page 142 of 168

It assumed that the surveillance region conforms to the ground plane constraint and 3D camera calibration is available for each camera. This could present a problem in applying the methods discussed in chapter three to scenes where the ground plane assumption is invalid, for example for tracking people between several floors of a building. In addition, it may not also be practical to perform a survey of the surveillance region, for example in rough terrain where there are no visible landmark points.

The surveillance system presented in chapter five demonstrated a robust framework for continuous monitoring of a region over extended periods of several hours or days. The current system has been operating for several months using a network of six cameras connected to a standard 100MB/sec Ethernet local area network. One problem with the architecture is that a centralized control strategy is employed to integrate all the tracking data received from each camera connected to the network of intelligent cameras. This would represent a bottleneck if the system had to be scaled to cope with several hundred cameras, which is common in many surveillance environments.

The video tracking performance evaluation framework presented in chapter six does not address some of the issues associated with motion detection such as: shadow suppression, detecting objects of low contrast, and coping with abrupt changes in illumination. The main focus of the evaluation framework is to evaluate tracking performance with respect to dynamic occlusion reasoning. The synthetic video sequences generated within the framework also have an inherent bias towards the motion detection and tracking algorithm used to original capture data.

7.3

Future Work
Spatial temporal cues are used to coordinate object tracking between multiple camera

views. The system relies on 3D trajectory prediction to resolve dynamic occlusions as stated in the list of limitations. The Kalman filter is not effective in instances where objects change direction significantly, or the scene contains clutter. This limitation also presents problems when tracking objects between non-overlapping views, where the ambiguity for matching is increased if several objects move between cameras concurrently. An enhancement can be made to the system to use appearance information to improve the robustness of occlusion reasoning, and object handover between cameras. Assuming colour calibration information between the

142

Page 143 of 168

cameras is available it would be possible to use appearance cues as an additional method for object matching. When a surveillance application is installed in a new environment it should be possible to automatically configure the system with limited operator intervention. The current system can calibrate the homographies between overlapping views without supervision, however 3D camera calibration data is still required to coordinate tracking between multiple camera views. In future work it should be possible to perform self-calibration of the ground plane without the need for performing a landmark survey [41,48,50,67,86]. Many real surveillance environments also use active cameras where the pan, tilt, and zoom characteristics can be controlled remotely. In order to use these devices an alternative method would need to be employed for motion detection and coordinating object tracking when the cameras move. The testing datasets generated within the performance evaluation framework will be used to evaluate several tracking algorithms. Performance evaluation is an important topic that gives end users the ability to identify the reliability of the system under different operating conditions. The current framework will also be extended to generate pseudo synthetic video sequences between multiple camera views. The current system has been running continuously over a period of several months allowing a large volume of object tracking data to be accumulated. All the information is stored in a surveillance database, which allows various types of activity queries to be executed to recognize single object behaviours. In future work new methods will be explored to recognize different types of interactions between multiple objects. This concept of data mining is an emerging research area for surveillance systems, since it would be possible for human operators to perform forensic analysis of data streams without having to manually review large volumes of video. In future work we also plan to make the metadata MPEG-7 compliant [72]. MPEG-7 is a standard for video content description that is expressed in XML. MPEG-7 only describes content and is not concerned with the methods used to extract features and property from the originating video. By adopting these standards the system can be compatible with other content providers.

7.4

Epilogue
Image surveillance and monitoring is a topic that is being actively investigated by the

machine vision research community. This thesis has presented several contributions to this area. Firstly, we have demonstrated how it is possible to use multiple cameras to coordinate the

143

Page 144 of 168

tracking of objects over a set of widely separated cameras. The system architecture described in chapter 5 has been implemented in an online system, which has been running continuously over a period for several months. The database design provides an effective mechanism for managing the surveillance data. The novel contribution of the surveillance database is that it is possible to store semantic scene models that can be used to generate compact video summaries of tracking data. The meta-data can be accessed to execute various spatial-temporal activity queries, with response times of only a few seconds. Before video surveillance systems can be deployed for real world applications it is necessary to measure tracking performance in order to determine the reliability of the system. This thesis has presented a novel framework for video tracking performance evaluation. The framework is fully automatic and allows the generation of pseudo-synthetic video sequences that can be employed for performance evaluation. A key distinction from existing methods is that there is some degree of control over perceptual complexity of each generated video sequence. This makes it practical to evaluate tracking algorithms over a variety of tracking scenarios, without the need for exhaustive manual ground-truthing. This framework can be adopted to complement the conventional methods of video tracking performance evaluation. In the quantitative experiments presented in chapter 6 a tracking algorithm was evaluated using 300,000 frames of video without any supervision.

144

Page 145 of 168

Bibliography
1 Aggarwal J.K., Cai Q. Human Motion Analysis: A Review. Computer Vision and Image Understanding (CVIU), 1999, Vol. 73, No. 3, pp 428-440 2 Baker D. Specification of a Video Test Imagery Library (VITAL). IEE Intelligent Distributed Surveillance Systems, London, February 2003. 3 Beymer D. Person counting using stereo. IEEE Workshop on Human Motion (HUMO'00), Austin, Texas, December 2000, pp 127-136 4 Black J, Ellis T, Makris D. Wide Area Surveillance With a Multi Camera Network. IEE Intelligent Distributed Surveillance Systems, London, February 2004. 5 Black J., Ellis T.J. Intelligent image surveillance and monitoring. The Institute of Measurement and Control. Vol.35, No. 8, September 2002, pp 204-208 6 Black J, Ellis T, Rosin P. A Novel Method for Video Tracking Performance Evaluation. The Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), Nice, France, October 2003. 7 Black J., Ellis T.J. Multi Camera Image Measurement and Correspondence. The Journal of the International Measurement Confederation (IMEKO), Vol. 32, No. 1, July 2002, pp 61-71 8 Black J., Ellis T.J. Multi Camera Image Tracking. The Second International Workshop on Performance Evaluation of Tracking and Surveillance (PETS2001), Kauai, Hawaii, December 2001. 9 Black J., Ellis T.J., Rosin P. Multi View Image Surveillance and Tracking, IEEE Workshop on Motion and Video Computing, Orlando, December 2002, pp 169-174. 10 Bobick A., Intille S., Davis J., Baird F., Pinhanez C., Campbell L., Ivanov Y., Schutte A., Wilson A., The KidsRoom: A Perceptually-based Interactive and Immersive Story

145

Page 146 of 168

Environment, PRESENCE: Teleoperators and Virtual Environments, 8(4), August 1999, pp 367-391. 11 Boyd J.E., Hunter E., Kelly P.H., Tai L.C., Phillips C.B., Jain R.C. MPI-Video Infrastructure for Dynamic Environments . IEEE International Conference on Multimedia Computing and Systems (ICMCS'98), Austin, Texas, July 1998, pp 249254 12 Cai Q., Aggarwal J.K. Automatic Tracking of Human Motion in Indoor Scenes Across Multiple Synchronized Video Streams. International Conference on Computer Vision (ICCV'98), Bombay, India, January 1998, pp 356-362 13 Cai Q., Aggarwal J.K. Tracking Human Motion Using Multiple Cameras. International Conference on Pattern Recognition (ICPR '96),Vienna, Austria, August 1996, Vol. 3, pp 68-73 14 Cai Q., Aggarwal J.K. Tracking Human Motion in Structured Environments Using a Distributed-Camera System. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), November 1999, Vol. 21, No. 11, pp 1241-1247 15 Cai X., Ali F., Stipidis E. MPEG4 Over Local Area Mobile Surveillance System, IEE Intelligent Distributed Surveillance Systems, London, UK, February 2003. 16 Chang T.H., S Gong S. Tracking Multiple People with a Multi-Camera System. The IEEE Workshop on Multi-Object Tracking (WOMOT01), Vancouver, British Columbia, July 2001, pp 19-28 17 Chang T.H., S Gong S., Ong E.J. Tracking Multiple People Under Occlusion Using Multiple Cameras. British Machine Vision Conference (BMVC 2000), Bristol, September 2000. 18 Cohen I., Medioni G. Detecting and Tracking Moving Objects for Video Surveillance. IEEE Conference on Computer Vision and Pattern Recognition (CVPR '99), Fort Collins, Colorado, June 1999, pp 2319-2325

146

Page 147 of 168

19

Collins R.T., Lipton A.J., Fujiyoshi H., Kanade T.

Algorithms for Cooperative

Multisensor Surveillance. Proceedings of the IEEE , October 2001, Vol. 89, No. 10 , pp 1456-1477 20 Collins R.T., Lipton A.J., Kanade T. A System for Video Surveillance and Monitoring, Proceedings of the American Nuclear Society (ANS) Eighth International Topical Meeting on Robotics and Remote Systems, April 1999 21 Comaniciu D., Ramesh V. Robust Detection and Tracking of Human Faces with an Active Camera, The IEEE International Workshop on Visual Surveillance (VS'2000), Dublin, Ireland, July 2000, pp 11-19. 22 Comaniciu D., Ramesh V., Meer P. Kernel-Based Object Tracking . IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), August 2000, Vol. 22, No. 8, pp 564-575 23 Comaniciu D., Ramesh V., Meer P. Real-Time Tracking of Non-Rigid Objects Using Mean Shift. IEEE Conference on Computer Vision and Pattern Recognition (CVPR'00), Hilton Head, South Carolina, June 2000, pp 2142-2151 24 Criminisi, A., Reid I., Zisserman A. A Plane Measuring Device. Image and Vision Computing (IVC), Vol. 17, No. 8, August 1999, pp. 625-634 25 Dockstader S.L., Murat Tekalp A. Multiple Camera Fusion for Multi-Object Tracking. The IEEE Workshop on Multi-Object Tracking (WOMOT01), Vancouver, British Columbia, July 2001, pp 95-103 26 Doermann D., Mihalcik D. Tools and Techniques for Video Performance Evaluation. International Conference on Pattern Recognition (ICPR'00), Barcelona, Spain, September 2000, Vol. 4, pp 4167-4170 27 Ellis T.J Performance Metrics and Methods for Tracking in Surveillance. The Third International Workshop on Performance Evaluation of Tracking and Surveillance (PETS2002), Copenhagen, June 2002, pp 26-31. 28 Ellis T.J., Black J. A Multi-view surveillance system. IEE Intelligent Distributed Surveillance Systems, London, February 2003.

147

Page 148 of 168

29

Erdem C., Sankur B. Performance Evaluation Metrics for Object-Based Video Segmentation. 10th European Signal Processing Conference (EUSIPCO'2000), Tampere, Finland, September 2000, pp. 917-920

30

Erdem .,Sankur B., Tekalp A.M. Metrics for Performance Evaluation of Video Object Segmentation and Tracking Without Ground-Truth. IEEE International Conference on Image Processing (ICIP'01), Thessaloniki, Greece, October 2001

31

Erdem .,Sankur B., Tekalp A.M. Non-Rigid Object Tracking using Performance Evaluation Measures as Feedback. IEEE Conference on Computer Vision and Pattern Recognition (CVPR'01), Kauai, Hawaii, December 2001.pp 323-330

32

Faugeras O., Three-Dimensional Computer Vision a Geometric Viewpoint, MIT Press, 1993

33

Grewal M., Andrews A. Kalman Filtering Theory and Practice, Prentice Hall Information and System Sciences Series, 1993

34

Grimson W.E.L., Stauffer C., Romano R., Lee L. Using Adaptive Tracking to Classify and Monitor Activities in a Site. IEEE Conference on Computer Vision and Pattern Recognition (CVPR '98), Santa Barnara, California, June, 1998, pp 22-31

35

Haritaoglu I., Flickner M. Detection and Tracking of Shopping Groups in Stores. IEEE Conference on Computer Vision and Pattern Recognition (CVPR'01), Kauai, Hawaii, December 200, pp 431-437

36

Haritaoglu I., Harwood D., Davis L.S. Active Outdoor Surveillance. International Conference on Image Analysis and Processing (ICIAP'99), Venice, Italy, September 1999, pp 1096-1100

37

Haritaoglu I., Harwood D., Davis L.S. An Appearance-Based Body Model for Multiple People Tracking. International Conference on Pattern Recognition (ICPR'00), Barcelona, Spain, September 2000, Vol. 4, pp 4184-4187

38

Haritaoglu I., Harwood D., Davis L.S. Hydra: Multiple People Detection and Tracking Using Silhouettes. IEEE Workshop on Visual Surveillance (VS'1999), Fort Collins, Colorado, June 1999, pp 6-13

148

Page 149 of 168

39

Haritaoglu I., Harwood D., Davis L.S. Hydra: Multiple People Detection and Tracking Using Silhouettes. International Conference on Image Analysis and Processing (ICIAP'99), Venice, Italy, September 1999, pp 280-285

40

Haritaoglu I., Harwood D., Davis L.S. W4: Real-Time Surveillance of People and Their Activities. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), August 2000, Vol. 22, No. 8, pp 809-830

41

Hartley R., Zisserman A. Multiple View Geometry in Computer Vision. Cambridge University Press, 2000

42

Intille S.S., Bobick A.F. Closed-world tracking. International Conference on Computer Vision (ICCV'95), Cambridge, Massachusetts ,June 1995, pp 672-678

43

Intille S.S., Davis J.W., Bobick A.F. Real-time closed-world tracking. IEEE Conference on Computer Vision and Pattern Recognition (CVPR '97), Puerto Rico, June 1997, pp 697-703

44

Irani M., Anandan P. Robust Multi-Sensor Image Alignment. International Conference on Computer Vision (ICCV'95), Bombay, India, January 1998, pp 959-967

45

Javed O., Khan S., Rasheed Z., Shah M.. Camera handoff: tracking in multiple uncalibrated stationary cameras. IEEE Workshop on Human Motion (HUMO'00), Austin, Texas, December 2000, pp 113-120

46

Javed O., Rasheed Z., Alatas A., Shah M. KNIGHT: A Real Time Surveillance System for Multiple Overlapping and Non-Overlapping Cameras. International Conference on Multimedia and Expo (ICME 2003), Baltimore, Maryland, 2003.

47

Javed O., Rasheed Z., Shafique K., Shah M. Tracking Across Multiple Cameras With Disjoint Views. IEEE International Conference on Computer Vision, Nice, France, 2003, pp 952-957.

48

Jaynes C. Multi-View Calibration from Planar Motion for Video Surveillance. IEEE Workshop on Visual Surveillance (VS'1999), Fort Collins, Colorado, June 1999, pp 5967

149

Page 150 of 168

49

Jaynes C., Webb S., Matt Steele R., Xiong, Q. An Open Development Environment for Evaluation of Video Surveillance Systems. The Third International Workshop on Performance Evaluation of Tracking and Surveillance (PETS2002), Copenhagen, June 2002, pp 32-39.

50

Jones G.A., Renno J., Remagnino P. Auto-Calibration in Multiple-Camera Surveillance Environments. The Third International Workshop on Performance Evaluation of Tracking and Surveillance (PETS2002), Copenhagen, June 2002, pp 40-47.

51

Jones M.J., Rehg J.M. Statistical Color Models with Application to Skin Detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR '99), Fort Collins, Colorado, June 1999, pp 1274-1280

52

Junior B., Anido R., Objects Detection with Multiple Cameras, IEEE Workshop on Motion and Video Computing, Orlando, December 2002, pp 187-196

53

Julier S., Uhlmann J.K., A Non-divergent estimation algorithm in the presence of unknown correlations. American Control Conference, 1997

54

Kang J., Cohen I., Medioni G. Continuous Tracking Within and Across Camera Streams. IEEE Conference on Computer Vision and Pattern Recognition (CVPR'03), Madison Wisconsin, June 2003, Vol 1, pp 267-272

55

Katz B., Lin J., Stuaffer C., Grimson E. Answering Questions about Moving Objects in Surveillance Videos. Proceedings of 2003 AAAI Spring Symposium on New Directions in Question Answering, March 2003.

56

Kettnaker V., Zabih R. Bayesian Multi-Camera Surveillance. IEEE Conference on Computer Vision and Pattern Recognition (CVPR '99), Fort Collins, Colorado, June 1999, pp 2253-2259

57

Kettnaker V., Zabih R. Counting People from Multiple Cameras. IEEE International Conference on Multimedia Computing and Systems (ICMCS'99), Florence, Italy, June 1999, pp 267-272

150

Page 151 of 168

58

Khan S., Javed O., Rasheed Z., Shah M.. Human Tracking in Multiple Cameras. IEEE International Conference on Computer Vision (ICCV'01, Vancouver, Canada, July 2001, pp 331-337

59

Khan S., Javed O., Shah M. Tracking in Uncalibrated Cameras with Overlapping Field of View. The Second International Workshop on Performance Evaluation of Tracking and Surveillance (PETS2001), Kauai, Hawaii, December 2001.

60

Kogut G.T., Trivedi M. Maintaining the Identity of Multiple Vehicles as They Travel Through a Video Network. The IEEE Workshop on Multi-Object Tracking (WOMOT01), Vancouver, British Columbia, July 2001, pp 29-34.

61

Krumm J., Harris S., Meyers B.,Brumitt B., Hale M., Shafer S. Multi-Camera MultiPerson Tracking for EasyLiving. The IEEE International Workshop on Visual Surveillance (VS'2000), Dublin, Ireland, July 2000, pp 3-10.

62

Lee L, Romano R., Stein G. Monitoring Activities from Multiple Video Streams: Establishing a Common Coordinate Frame. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), August 2000, Vol. 22, No. 8, pp 758-767

63

Li Y., Hilton A., Illingworth J. A relaxation algorithm for real-time multiview 3dtracking. Image and Vision Computing, Vol. 20, No. 12 , October 2002 , pp 841-859

64

Li Y., Hilton A., Illingworth J. Towards Reliable Real-Time Multiview Tracking. The IEEE Workshop on Multi-Object Tracking (WOMOT01), Vancouver, British Columbia, July 2001, pp 43-52

65

Lipton A.J., Fujiyoshi H. Real-time Human Motion Analysis by Image Skeletonization, The IEEE Workshop on Applications of Computer Vision (WACV'98), New Jersey, October 1998, pp 15-22.

66

Lipton A.J., Fujiyoshi H., Patil R.S. Moving Target Classification and Tracking from Real-time Video. The IEEE Workshop on Applications of Computer Vision (WACV'98), New Jersey, October 1998, pp 8-14.

151

Page 152 of 168

67

Lv F., Zhao T., Nevatia R. Self-Calibration of a Camera from Video of a Walking Human. International Conference on Pattern Recognition (ICPR'02), Quebec City, Canada, August, 2002, Vol. 1, pp 10562-10567

68

Makris D, Ellis T, Black J. Bridging the Gaps Between Cameras. IEEE Conference on Computer Vision and Pattern Recognition (CVPR'04), Washington DC, June 2004.

69

Makris D, Ellis T. Automatic Learning of an Activity-Based Semantic Model. IEEE International Conference of Advanced Video and Signal Based Surveillance, Miami, USA, July 2003, pp. 183-188.

70

Makris D, Ellis T. Spatial and Probabilistic Modelling of Pedestrian Behaviour, British Machine Vision Conference 2002, September 2002, Cardiff, pp 557-566.

71

Marcenaro L., Oberti F., Regazzoni C. Multiple Objects Colour-Based Tracking using Multiple Cameras in Complex Time-Varying Outdoor Scenes. The Second International Workshop on Performance Evaluation of Tracking and Surveillance (PETS2001), Kauai, Hawaii, December 2001.

72

Martinez J.M., Koenen R., Pereira F. MPEG-7 The Generic Multimedia Content Description Standard, Part 1. IEEE Multimedia, Vol. 9, No. 2, April-June 2002, pp 7887

73

Mikic I., Huang K., Trivedi M. Activity monitoring and summarization for an intelligent meeting room. IEEE Workshop on Human Motion (HUMO'00), Austin, Texas, December 2000, pp 107-112

74

Mikic I., Santini S., Jain R., Video Integration from Multiple Cameras, DARPA Image Understanding Workshop, Monterey, CA, November 1998.

75

Mittal A., Davis L. Unified Multi-Camera Detection and Tracking Using RegionMatching. The IEEE Workshop on Multi Object Tracking (WOMOT01), Vancouver, British Columbia, July 2001, pp 3-10

76

Mittal A., Davis L.S. M2Tracker: A Multi-view Approach to Segmenting and Tracking People in a Cluttered Scene Using Region-Based Stereo. European Conference on Computer Vision (ECCV'02), Copenhagen, Denmark, May 2002, pp 18-33

152

Page 153 of 168

77

Needham C.J., Boyle R.D. Performance Evaluation Metrics and Statistics for Positional Tracker Evaluation. International Conference on Computer Vision Systems (ICVS'03), Graz, Austria, April 2003, pp 278-289

78

Needham C.J., Boyle R.D. Tracking multiple sports players through occlusion, congestion and scale. British Machine Vision Conference (BMVC 2001), Manchester, September 2001, pp 93-102

79

Nummiaro K., Koller-Meier E., Van Gool L. Color Features for Tracking Non-Rigid Objects. Special Issue on Visual Surveillance, Chinese Journal of Automation, May 2003, Vol 29, No. 3, pp 345-355

80

Olsen B.D. Robot navigation using a sensor network. Masters thesis, Laboratory of Image Analysis, Aalborg University, Denmark, 1998.

81

Orwell J., Remagnino P., Jones G.A. Multi-camera colour tracking. IEEE Workshop on Visual Surveillance (VS'1999), Fort Collins, Colorado, June 1999, pp 14-24

82

Pingal S., Segen J. Performance Evaluation of People Tracking Systems. The IEEE Workshop on Applications of Computer Vision (WACV '96), Sarasoto, FL, December1996, pp 33-39.

83

Polat E., Yeasin M., Sharma R. Tracking Body Parts of Multiple People: A New Approach. The IEEE Workshop on Multi-Object Tracking (WOMOT01), Vancouver, British Columbia, July 2001, pp 35-42

84

Prati A., Mikic I., Trivedi M.M., Cucchiara R. Detecting Moving Shadows: Algorithms and Evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), July 2003, Vol. 25, No. 7, pp 918-923

85

Press W.H., Teukolsky S.A., Vetterling W.T., Flannery B.P. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, 2nd edition 1992.

86

Remagnino P., G.A. Jones G.A., Automated Registration of Surveillance Data for Multi-Camera Fusion, IEEE ISIF International Conference on Information Fusion Invited Session on Information Fusion Techniques for Surveillance and Security Applications, Annapolis, USA, July 2002, pp. 1190-1197

153

Page 154 of 168

87

Renno J., Orwell J., Jones G.A. Learning Surveillance Tracking Models for the SelfCalibrated Ground Plane. British Machine Vision Conference (BMVC 2001), Manchester, September 2001, pp 607-616

88

Senior A., Hampapur A., Tian Y.,Brown L., Pankanti S., Bolle R., Appearance Models for Occlusion Handling. The Second International Workshop on Performance Evaluation of Tracking and Surveillance (PETS2001), Kauai, Hawaii, December 2001.

89

Stauffer C., Grimson W.E.L. Adaptive Background Mixture Models for Real-Time Tracking. IEEE Conference on Computer Vision and Pattern Recognition (CVPR '99), Fort Collins, Colorado, June 1999, pp 2246-2252

90

Stauffer C., Tieu K. Automated multi-camera planar tracking correspondence modeling. IEEE Conference on Computer Vision and Pattern Recognition (CVPR'03), Madison Wisconsin, June 2003, Vol 1, pp 259-266

91

Stein G. Tracking from Multiple View Points:Self Calibration of Space and Time. DARPA Image Understanding Workshop 1998, pp 1037-1042

92

Stewart C.V. Robust Parameter Estimation in Computer Vision. Siam Review, 1999, Vol 41, No. 3, pp 513-537.

93

Trivedi M., Bhonsle S., Gupta A. Database Architecture for Autonomous Transportation Agents for On-scene Networked Incident Management (ATON). International Conference on Pattern Recognition (ICPR2000), Barcelona, Spain, 2000, pp 4664-4667.

94

Tsai R.Y. A Versatile Camera Calibration Technique for High Accuracy 3D Machine Vision Metrology using off the shelf TV Cameras and Lenses. IEEE Journal of Robotics and Automation, August 1987, 3(4) pp. 323-344

95

Wren C.R., Azarbayejani A., Darrell T., Pentland A.P. Pfinder: Real-Time Tracking of the Human Body. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), July 1997, Vol. 19, No. 7, pp 780-785

154

Page 155 of 168

96

Xu M., Ellis T.J. Tracking occluded objects using partial observation, Acta Automatica Sinica, Special Issue on Visual Surveillance of Dynamic Scenes, May 2003, 29(3):370380,

97

Xu M., Ellis T.J., Illumination-Invariant Motion Detection Using Color Mixture Models, British Machine Vision Conference (BMVC 2001), Manchester, September 2001, pp 163-172.

98

Xu M., Ellis T.J., Partial Observation vs Blind Tracking through Occlusion, British Machine Vision Conference (BMVC 2002), Cardiff, September 2002, pp 777-786.

99

Yamamoto M., Sato A., Kawada S., Kondo T. Osaki Y. Incremental Tracking of Human Actions from Multiple Views. IEEE Conference on Computer Vision and Pattern Recognition (CVPR '98), Santa Barnara, California, June, 1998, pp 2-7

100

Yuan X., Sun Z., Varol Y., Bebis G. A Distributed Visual Surveillance System. IEEE Conference on Advanced Video and Signal Based Surveillance (AVSS 2003), Miami, Florida, July 2003

101

Zitnick C.L., Kanade T. A Cooperative Algorithm for Stereo Matching and Occlusion Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), July 2000, Vol. 22, No. 7, pp 675-684

155

Page 156 of 168

Appendix A Camera Models


Camera calibration information defines a mapping between 2D image coordinates and 3D world coordinates. This appendix gives a summary of the camera model used in this thesis. A thorough introduction to this topic can be found in [32,41]. The basic pinhole model assumes that points are projected from the camera centre onto a plane. The plane z=f is referred to as the image plane or focal plane, where f is the focal length of the camera.
y

Image plane
x

C z Z

R,T

Figure A.1 The transformation between world coordinates and image coordinates

A 3x3 rotational matrix (R) and a 3x1 translation matrix (T) relate the image coordinate space and world coordinate space as shown in Figure A.1. Using Tsais method of calibration [94] the camera model can be estimated using five coplanar landmark points.

156

Page 157 of 168

Appendix B Jacobian Matrix of 2D to 3D Translation


In chapter 3 a method was presented for estimating the 3D measurement uncertainty using the calibrated camera parameters. This was achieved by propagating the 2D pixel uncertainty from the image plane to the world coordinate space. This appendix is a summary of how the coefficients of the Jacobian matrix were computed.

B.1

Image Coordinates to Ideal Undistorted Coordinates

First the image coordinates (U f , V f ) are transformed to distorted image coordinates (U d , Vd ) :

U d = (U f C x )dx

(A.1) (A.2) and

Vd = (V f C y )dy

Where dx =

d x N cx s x N fx

dy = d y

dx

is the width of each pixel in mm. is the height of each pixel in mm. is the number of sensor elements in the cameras x direction. is the number of pixels in the frame grabbers x direction.

dy
N cx

N fx
sx

is a scale factor that accounts for any uncertainty in the frame grabbers resampling of the

horizontal scanline.
(C x , C y )

defines the centre point of the radial lens distortion on the image plane.

Then the distorted image coordinates are transformed to ideal undistorted image coordinates:
U u = U d (1 + kr 2 ) Vu = Vd (1 + kr 2 ) , where r 2 = U d2 + Vd2 and k is the first order radial lens distortion coefficient.

By rearrangement and substitution the undistorted image coordinates can be expressed in terms of the image coordinates.

157

Page 158 of 168

U u = u0U 3 + u1U 2 + u 2U 1f + u3 f f Vu = u 4U 2 + u5U 1f + u6 f U u = u0U 3 + u1U 2 + u 2U 1f + u3 f f Vu = u 4U 2 + u5U 1f + u6 f

Where,
u0 = k dx 3 u1 = 3C x dx 3 k u 2 = dx(1 + k (3C x2 dx 2 + (V f C y ) 2 d y2 )) u3 = C x dx (1 + k (C x2 dx 2 + (V f C y ) 2 d y2 )) u 4 = (V f CY ) dy k dx 2 u5 = 2(V f CY ) dy k C x dx 2 u6 = (V f C y ) dy ( 1 + k (C x2 dx 2 + (V f C y ) 2 dy 2 ))

v0 = k dy 3 v1 = 3C y dy 3 k v2 = dy (1 + k ((U f C x ) 2 dx 2 + 3C y2 dy 2 +) v3 = C y dy (1 + k (C y2 dy 2 + (U f C x ) 2 dx 2 )) v4 = (U f C x ) dx k dy 2 u5 = 2(U f C x ) dx k C x dy 2 u6 = (U f C x ) dx( 1 + k ((U f C x ) 2 dx 2 + C x2 dx 2 ))

B.2

Ideal Undistorted Coordinates to World Coordinates

Using the calibration rotation and translation parameters it is possible to construct a 3D line of sight through the undistorted image coordinates.
r = Po + h ( Pu Po ) , where
P0 = R T T

158

Page 159 of 168

Pu = R T ([U u Vu f ] T ) r = R T T + h R T [U u Vu f ]
T

R T

is a 3x3 rotation matrix from the world to camera coordinate space is a 3x1 translational vector from the world to camera coordinate space. is the focal length of the camera in mm.

Using this representation it is possible to derive a functional form of world coordinates X w and
Yw for a given height above the ground plane.

h =

Z w r13T11 + r23T21 + r33T31 r13U u + r23Vu + r33 f

X w = h (r11U u + r21Vu + r31 f ) (r11T11 + r21T21 + r31T31 ) Yw = h (r12U u + r22Vu + r21 f ) (r12T11 + r22T21 + r32T31 )

After performing algebraic manipulation and substitution the world coordinates X w and Yw can be expressed in terms of the image coordinates U f and V f :

X w = C 0 + C1 Yw = C2 + C1

u 7U 3 + u 8U 2 + u 9U + u10 f f u11U 3 + u12U 2 + u13U + u14 f f

u15U 3 + u16U 2 + u17U + u18 f f u11U 3 + u12U 2 + u13U + u14 f f

where,

C0 = X w + (r11T11 + r21T21 + r31T31 ) C1 = Z w (r13T11 + r23T21 + r33T31 ) C2 = Yw + (r12T11 + r22T21 + r32T31 ) u7 = r11u0 u8 = r11u1 + r21u 4 u9 = r11u 2 + r21u5 u10 = r11u3 + r21u6 + r31 f u11 = r13u 0 u12 = r13u1 + r23u 4

159

Page 160 of 168

u13 = r13u 2 + r23u5 u14 = r13u3 + r23u6 + r33 f u15 = r12u0 u16 = r12u1 + r22u 4 u17 = r12 u 2 + r22 u5 u18 = r12u3 + r22u6 + r32 f

Similarly, for V f we have:

X w = C0 + C1 Yw = C2 + C1

v7V f3 + v8V f2 + v9V f + v10 v11V f3 + v12V f2 + v13V f + v14

v15V f3 + v16V f2 + v17V + v18 v11V f3 + v12V f2 + v13V + v14

where,

v7 = r21v0 v8 = r21v1 + r11v4 v9 = r21v2 + r11v5 v10 = r21v3 + r11v6 + r31 f v11 = r23v0 v12 = r23v1 + r13v4 v13 = r23 v2 + r13 v5 v14 = r23 v3 + r13 v6 + r33 f v15 = r22 v0 v16 = r22 v1 + r12 v4 v17 = r22 v2 + r12 v5 v18 = r22 v3 + r12 v6 + r32 f

160

Page 161 of 168

The Jacobian coefficients can now be simply computed:


(3u7U 2 + 2u8U f + u9 )(u11U 3 + u12U 2 + u13U + u14 ) (3u11U 2 + 2u12U f + u13 )(u7U 3 + u8U 2 + u9U + u10 ) X w f f f f f f = C1 (u11U 3 + u12U 2 + u13U + u14 ) 2 U f f f
(3u15U 2 + 2u16U f + u17 )(u11U 3 + u12U 2 + u13U + u14 ) (3u11U 2 + 2u12U f + u13 )(u15U 3 + u16U 2 + u17U + u18 ) Yw f f f f f f = C1 U f (u11U 3 + u12U 2 + u13U + u14 ) 2 f f (3v7V f2 + 2v8V f + v9 )(v11V f3 + v12V f2 + v13V + v14 ) (3v11V f2 + 2v12V f + v13 )(v7V f3 + v8V f2 + v9V + v10 ) X w = C1 (v11V f3 + v12V f2 + v13V + v14 ) 2 V f

(3v V 2 + 2v16V f + v17 )(v11V f3 + v12V f2 + v13V + v14 ) (3v11V f2 + 2v12V f + v13 )(v15V f3 + v16V f2 + v17V + v18 ) Yw = C1 15 f V f (v11V f3 + v12V f2 + v13V + v14 ) 2

X w U f J = Y w U f

X w V f Yw V f

161

Page 162 of 168

Appendix C Surveillance Database Tables


This appendix provides a detailed description of the database used to store surveillance data system developed at City University. An entity relationship diagram is shown in Figure C.1. The diagram describes the relationship between the physical tables in the image framelet and object motion layers of the hierarchical database presented in chapter 5.

REGION

CAMERA

VIDEOSEQ

MULTIVIDEOSEQ

TIMESTAMPS

TRACKS2D

FRAMELETS

TRACKS3D

MULTITRACKS2D

Figure C.1 Physical conceptual database schema

162

Page 163 of 168

C.1

Region

Description Describes each surveillance region with the network of cameras. Allows a group of cameras to be allocated to a common area.

Field Name

Type

Comment

Id description

Int char(100)

Primary Key Text description of surveillance region

C.2

Camera

Description Stores the information relating to each camera connected to the surveillance network. This table is accessed during the start-up process to determine the IP address and shell command to be used to invoke each camera.

Field Name Type

Comment

id region description

int int

Primary Key Region camera is located in

char(100) Text description of camera location. True if the camera is monochrome, otherwise false The IP address of the camera The port to use for socket connection on camera

monochrome boolean ipaddress port command char(20) smallint

char(200) The command to use to invoke the camera. This is typically an rsh command to remotely invoke the 2D tracker on the camera server.

163

Page 164 of 168

C.3 Videoseq
Description Stores information relating to each video sequence captured from each camera in the surveillance network. In the current system set-up each video sequence contains around 30 minutes of video. Video is captured by the system daily between 8:15-16:15. The exact times can vary, depending on the number of hours of daylight during the year.

Field Name

Type

Comment

Id multivideoseq camera region description backgroundimage

serial int int int char(100) oid

Primary Key Foreign Key Foreign Key Foreign Key

Pointer to the background image of this video sequence. The large object API is uses this field to access the image data

path command scale model xsize ysize startframe endframe creationdate

char(100) char(200) smallint char(20) smallint smallint timestamp timestamp timestamp default now()

The path of the background image when saved from disk. The command used to invoke the camera The scale of the 2D tracking data (not used) The width of the background image The height of the background image. The start time of the videosequence. The end time of the video sequence Time the video sequence was created in the database

164

Page 165 of 168

C.4 Multivideoseq
Description

This table is used to document each multi video sequence captured by the surveillance system. In the current system set-up each multi video sequence consists between two and five video streams.

Field Name Type

Comment

Id region

serial int

Primary Key Foreign Key Text description of the multi video sequence Reference to the ground plane map of the multi videosequence

description char(100) gmap oid

gmapath

char(100)

The original path of the ground plane map.

creationdate timestamp default now() The date and time the multi video sequence was created in the database.

C.5 Tracks3d
Description This table stores the 3D trajectory data of each tracked object.

Field Name

Type Comment

Id region

serial Primary Key int Foreign Key Foreign Key Foreign Key This field indicates the video sequence used as a reference for each timestamp

multivideoseq int videoseq int

frame trackid status

int int int

The frame number The trackid of the object The status of the tracked object

165

Page 166 of 168

xloc yloc zloc xvel yvel zvel

real real real real real real

The location of the object along the x-axis The location of the object along the y-axis The location of the object along the z-axis The velocity of the object along the x-axis The velocity of the object along the y-axis The velocity of the object along the z-axis

C.6 Tracks2d
Description This table stores the original 2D trajectory data of each tracked object received from each intelligent camera in the surveillance network.

Field Name

Type

Comment

Id camera videoseq frame trackid

serial int int int int

Primary Key Foreign Key Foreign Key The frame number The trackid of the object The bounding box of the tracked object The status of the tracked object The postion of the centroid of the along x-axis The postion of the centroid of the along yaxis The mean value of the red channel of the tracked object The mean value of the green channel of the tracked object The mean value of the blue channel of the tracked object

bounding_box box status xcog ycog Red green Blue int real real real real real

166

Page 167 of 168

C.7 Multitracks2d
Description This table stores the 2D trajectory data of each tracked object. This table is similar to the table tracks2d except that the 3D tracker and not the 2D tracker have assigned the trackids of each tracked object.

Field Name

Type Comment

Id camera

serial Primary Key int Foreign Key Foreign Key Foreign Key This field indicates the video sequence used as a reference for each timestamp

multivideoseq int videoseq int

frame trackid

int int

The frame number The trackid of the object The bounding box of the tracked object The status of the tracked object The postion of the centroid of the along x-axis The postion of the centroid of the along yaxis The mean value of the red channel of the tracked object The mean value of the green channel of the tracked object The mean value of the blue channel of the tracked object

bounding_box box status xcog ycog Red green Blue int real real real real real

167

Page 168 of 168

C.8 Framelets
Description This table stores each object detected by each camera in the surveillance network. The images stored are used to playback the captured video sequences, and generate synthetic video for performance evaluation.

Field Name

Type

Comment

Id camera videoseq frame trackid bounding_box

serial int int int int box

Primary Key Foreign Key Foreign Key The frame number The trackid of the object The bounding box of the tracked object

Path Data

char(100) oid

The original path of the framelet Reference to framelet image stored in the database.

C.9 Timestamps
Description This table stores timestamps of each processed image frame for all the cameras in the surveillance network.
Field Name Type Comment

Id videoseq frame timestamp

serial int int timestamp

Primary Key Foreign Key The frame number The time the frame was captured by the camera.

168

You might also like