10 1 1 101

Multimodal Verification of Identity for a Realistic Access Control Application
by
Nele Denys
Thesis submitted in partial fulfilment of the requirements for the degree Doctor Ingeneriae in Mechanical Engineering at the Rand Afrikaans University Supervisor: Professor A.L. Nel May 2004
Abstract This thesis describes a real world application in the field of pattern recognition. License plate recognition and face recognition algorithms are combined to implement automated access control at the gates of RAU campus. One image of the license plate and three images of the drivers face are enough to check if the person driving a particular car into campus is the same as the person driving this car out. The license plate recognition module is based on learning vector quantization and performs well enough to be used in a realistic environment. The face recognition module is based on the Bayes rule and while performing satisfactory, extensive research is still necessary before this system can be implemented in real life. The main reasons for failure of the system were identified as the variable lighting and insufficient landmarks for effective warping.
Opsomming Hierdie proefskrif beskryf n werklike toepassing in die veld van patroonherkenning. Nommerplaat- en gesigsherkenning algoritmes word gekombineer om ge-outomatiseerde toegangsbeheer te implementeer by die ingange van RAU. Een beeld van die nommerplaat en drie beelde van die motorbestuurder se gesig, is genoeg om seker te maak of die bestuurder wat n spesifieke motor die kampus binnebring, dieselfde persoon is wat die motor uitneem. Die nommerplaatherkenning module is gebaseer op learning vector quantization en presteer goed genoeg om gebruik te word in n realistiese omgewing. Die gesigsherkenning module is gebaseer op die Bayes rel, en al presteer hierdie sisteem goed, is daar nog vele navorsing nodig voordat hierdie sisteem as n werklike sekuriteitsstelsel geimplementeer kan word. Die hoofredes vir die onvermoe van die sisteem was geidentifeer as die varieerende beligting en onvoldoende verwysingspunte vir die effektiewe gesigsverwringing.
ii
Acknowledgements I would like to take this opportunity to thank a number of people for their help and support during the course of my study. Professor A.L. Nel for introducing me to the field of Image Processing and for his help and support over the past nineteen months. My family, especially my parents, for their patience and encouragement. Gustaf for solving every computer crisis. My friends, Michele, Jaco, Mario and Denis, for supporting me during tough times and showing so much interest.
iii
Table of contents
List of figures .............................................................................................................................. vii List of tables ................................................................................................................................. ix Chapter 1 Introduction ...............................................................................................................1 1.1 Biometrics ..............................................................................................................................1 1.2 Proposed application ..............................................................................................................3 1.3 Research hypothesis ...............................................................................................................5 1.4 Hard- and software.................................................................................................................5 1.5 Thesis overview .....................................................................................................................6 Chapter 2 Literature survey .......................................................................................................7 2.1 Introduction to pattern recognition.........................................................................................7 2.2 Pattern recognition pertaining to this study............................................................................8 2.2.1 Learning vector quantization ..........................................................................................9 2.2.2 Bayes decision theory ...................................................................................................10 2.3 License plate recognition .....................................................................................................11 2.3.1 University of Stellenbosch, SA.....................................................................................12 2.3.2 University of Bristol, UK..............................................................................................13 2.3.3 Optasia Systems............................................................................................................15 2.3.4 Hi-Tech Solutions .........................................................................................................15 2.4 Face recognition ...................................................................................................................15 2.4.1 Face localization ...........................................................................................................16 2.4.1.1 Knowledge-based methods....................................................................................17 2.4.1.2 Feature-based methods ..........................................................................................17 2.4.1.3 Template matching methods..................................................................................20 2.4.1.4 Appearance-based methods ...................................................................................20 2.4.2 Face recognition............................................................................................................24 2.4.2.1 Template matching ................................................................................................25 2.4.2.2 Feature-based methods ..........................................................................................25 2.4.2.3 Subspace methods .................................................................................................26 2.4.2.3.1 Eigenfaces and fisherfaces .............................................................................26 2.4.2.3.2 Probabilistic subspaces...................................................................................30 2.4.2.3.3 Independent component analysis ...................................................................31 2.4.2.3.4 Kernel principal component analysis .............................................................31 2.4.2.3.5 Kernel Fisher linear discriminant...................................................................33 2.4.2.4 Neural networks ....................................................................................................33 2.4.2.5 Graph matching and dynamic space warping........................................................35 2.4.3 Face verification ...........................................................................................................36 2.5 Conclusion ...........................................................................................................................37
iv
Chapter 3 License plate recognition ........................................................................................39 3.1 Introduction ..........................................................................................................................39 3.2 Process description...............................................................................................................39 3.3 Image acquisition .................................................................................................................41 3.4 Pre-processing 1 ...................................................................................................................42 3.5 Localization..........................................................................................................................42 3.5.1 Introduction...................................................................................................................42 3.5.2 Current system ..............................................................................................................44 3.6 Pre-processing 2 ...................................................................................................................49 3.7 Segmentation........................................................................................................................52 3.7.1 Introduction...................................................................................................................52 3.7.2 Current system ..............................................................................................................54 3.8 Feature extraction.................................................................................................................55 3.9 Character recognition ...........................................................................................................56 3.9.1 Introduction...................................................................................................................56 3.9.2 Current system ..............................................................................................................57 3.10 Post-processing ..................................................................................................................59 3.11 Database .............................................................................................................................60 3.12 Results................................................................................................................................60 3.12.1 Performance of the localization module .....................................................................61 3.12.2 Performance of the segmentation module...................................................................61 3.12.3 Performance of the character recognition and post-processing modules....................61 3.12.4 Total performance.......................................................................................................63 3.12.5 Execution times...........................................................................................................63 3.13 Conclusions ........................................................................................................................63 3.14 Summary ............................................................................................................................64 Chapter 4 Face recognition .......................................................................................................65 4.1 Introduction ..........................................................................................................................65 4.2 Process description...............................................................................................................65 4.3 Image acquisition .................................................................................................................67 4.4 Face localization...................................................................................................................68 4.4.1 Colour detection............................................................................................................68 4.4.2 Face equalization ..........................................................................................................70 4.5 Facial feature detection ........................................................................................................72 4.5.1 Bar filter........................................................................................................................73 4.5.2 Grouping of feature candidates.....................................................................................76 4.5.3 Iris detection .................................................................................................................78 4.5.4 Boundary detection .......................................................................................................80 4.5.5 Rotation invariance .......................................................................................................84 4.6 Face representation...............................................................................................................85 4.6.1 Introduction...................................................................................................................85 4.6.2 Pose estimation .............................................................................................................88 4.6.3 Warping ........................................................................................................................88 4.6.4 Illumination normalization ...........................................................................................89 4.7 Face verification...................................................................................................................90 4.7.1 Theory...........................................................................................................................90 4.7.2 Current application .......................................................................................................93 4.8 Results..................................................................................................................................94
4.8.1 Performance of the face localization module................................................................94 4.8.2 Performance of the facial feature detection module .....................................................95 4.8.3 Performance of the face verification module..............................................................100 4.8.4 Total performance.......................................................................................................105 4.8.5 Execution times...........................................................................................................106 4.9 Conclusions ........................................................................................................................106 4.10 Summary ..........................................................................................................................106 Chapter 5 Combined application ...........................................................................................108 5.1 Campus entry .....................................................................................................................108 5.2 Campus exit........................................................................................................................111 5.2.1 Campus exit A ............................................................................................................111 5.2.2 Campus exit B.............................................................................................................112 5.3 Results................................................................................................................................114 5.3.1 Performance of the license plate recognition..............................................................114 5.3.2 Performance of the driver verification........................................................................114 5.4 Conclusions ........................................................................................................................116 5.5 Summary ............................................................................................................................117 Chapter 6 Conclusions and future research ..........................................................................118 6.1 Conclusions ........................................................................................................................118 6.2 Future research ...................................................................................................................120 References ...................................................................................................................................121
vi
List of figures
Figure 1-1: Example of Cantata visual program ..............................................................................6 Figure 2-1: The basic stages involved in the design of a classification system................................8 Figure 3-1: A few examples of South African number plates ........................................................40 Figure 3-2: The different steps in a number plate recognition system ...........................................40 Figure 3-3: Typical data capture station .........................................................................................41 Figure 3-4: Process flow for localization algorithm.......................................................................44 Figure 3-5: Image of a number plate together with a cross-section through the plate ...................45 Figure 3-6: Image of a headlight together with a cross-section through the light ..........................45 Figure 3-7: Estimating top of plate by counting the number of vertical edges under the sliding bar ...............................................................................................................................................47 Figure 3-8: Simple model of a number plate..................................................................................47 Figure 3-9: Least Squares Error: poor fit of the model due to outliers...........................................48 Figure 3-10: Examples of localized number plates ........................................................................50 Figure 3-11: Typical number plate and its histogram.....................................................................51 Figure 3-12: Transfer function used to normalise grey-levels of number plates............................51 Figure 3-13: Typical number plate after normalisation and its histogram .....................................52 Figure 3-14: Examples of number plates before and after pre-processing .....................................52 Figure 3-15: Example of a number plate and its segmented characters .........................................54 Figure 3-16: Example of a wrongly segmented number plate........................................................55 Figure 3-17: Example features: A & B are two-rectangle features, C & D are three-rectangle features and E is a four-rectangle feature ..............................................................................56 Figure 3-18: Example of rectangle features applied.......................................................................56 Figure 3-19: A selection of the 2375 number plate images............................................................62 Figure 4-1: The different steps in a face recognition system..........................................................66 Figure 4-2: Typical data capture station .........................................................................................67 Figure 4-3: Typical image captured by the image acquisition .......................................................68
vii
Figure 4-4: Two training images and their colour distributions in the RGB colour space, together with the adopted linear discriminants ....................................................................................70 Figure 4-5: A few examples of original images, images with pixels classified as non-face pixels set to white and resulting images of the face localization module.........................................71 Figure 4-6: Face equalization: (a) - (b) input, (c) output................................................................73 Figure 4-7: (a) A Gaussian derivative filter. (b) The surface plot of (a) ........................................74 Figure 4-8: Bar filter module: (a) face equalized input, (b) convolution output, (c) thresholding output, (e) non-maximal suppression output, (e) detected features, (f) feature centres.........75 Figure 4-9: The face model and the component face groups..........................................................76 Figure 4-10: Flow diagram of grouping process ............................................................................77 Figure 4-11: Example of distance limits for a Hpair ......................................................................77 Figure 4-12: Grouping module: (a) input, (b) output .....................................................................79 Figure 4-13: Illustrating the procedure for locating the iris: (a) red channel eye windows, (b) thresholded eye-windows, (c) eye-windows after closing, (d) disk template ........................80 Figure 4-14: Illustrating the procedure for iris detection: (a) input, (b) output ..............................80 Figure 4-15: Resulting image from iris detection module..............................................................80 Figure 4-16: Estimating the face boundary with two partial ellipses .............................................81 Figure 4-17: (a) Ellipse parameters for a face (b) Ellipse parameters for a face in profile view ...82 Figure 4-18: Different steps of Canny edge detector......................................................................84 Figure 4-19: Face boundary detection: (a) Edge map input (b) Face boundary output..................85 Figure 4-20: Gaussian bar filter applied over three different rotations: (a) -20, (b) 0, (c) 20.....86 Figure 4-21: Facial feature detection applied on rotated face ........................................................86 Figure 4-22: Pose estimation ..........................................................................................................88 Figure 4-23: Warped and masked image........................................................................................90 Figure 4-24: (a) Decomposition of RN into the principal subspace F and its orthogonal complement F for a Gaussian density. (b) A typical eigenvalue spectrum and its division into the two orthogonal subspaces .........................................................................................91 Figure 4-25: Examples where the face could not be located by the colour detector ......................96 Figure 4-26: Example of falsely detected face ...............................................................................96 Figure 4-27: Examples of faces with correctly detected features and boundary ............................97 Figure 4-28: Examples of incorrectly detected and/or grouped features........................................98 Figure 4-29: Examples of images with too low contrast ................................................................98 Figure 4-30: Examples of incorrectly detected irises .....................................................................99 Figure 4-31: Examples of incorrectly located face boundaries ......................................................99 Figure 4-32: Examples of partially occluded faces ......................................................................100 Figure 4-33: Difference due to illumination variation..................................................................104
viii
Figure 5-1: Static license plate image at entry .............................................................................109 Figure 5-2: Video sequence of drivers head at entry (every 5th frame shown) ...........................109 Figure 5-3: Static license plate image at exit (A) .........................................................................111 Figure 5-4: Video sequence of drivers head at exit (A) (every 5th frame shown) .......................111 Figure 5-5: Static license plate image at exit (B) .........................................................................113 Figure 5-6: Video sequence of drivers head at exit (B) (every 5th frame shown) .......................113
ix
List of tables
Table 1-1: Strengths and weaknesses of biometric technologies .....................................................2 Table 3-1: Possible syntax for South African number plates (L = letter, D = digit, * = any character) ...............................................................................................................................60 Table 3-2: Performance of character recognition and post-processing modules............................61 Table 4-1: Performance of colour detection module ......................................................................95 Table 4-2: Comparison of the performance rates when using different colour components........102 Table 4-3: Comparison of the performance rates when using different features..........................103 Table 4-4: Comparison of the performance rates when using different feature combinations.....103 Table 4-5: Comparison of the performance rates when testing frames maximum half an hour apart .............................................................................................................................................105 Table 5-1: Performance of the video-based face verification.......................................................114 Table 5-2: Euclidean distance between warped images ...............................................................115
Chapter 1
Introduction
1.1 Biometrics
There is a strong need in our society for user-friendly systems that can secure our assets and protect our privacy without losing our identity in a sea of numbers [148]. At present, a PIN code is needed to get cash from an ATM, a password for a computer, a dozen others to access the internet, and so on. Biometric technologies are automated methods for verifying or recognizing the identity of a living person based on a physiological or behavioural characteristic [136]. Finger-scan, facial-scan, irisscan, hand-scan and retina-scan are considered physiological biometrics, based on direct measurements of a part of the human body. Voice-scan, signature-scan and keystroke-scan are considered behavioural biometrics; they are based on measurements and data derived from an action and therefore indirectly measure characteristics of the human body [98]. Esoteric biometrics are biometrics that are still in the early experimental and developmental stages. A few examples are biometric technologies based on vein patterns, facial thermography, DNA, sweat pores, hand grip, fingernail bed, body odour, ear shape, gait, skin luminescence, brain wave patterns, blood pressure, footprints and foot dynamics, etc. Research efforts have started to tackle some of the more esoteric biometrics and it is reasonable to expect that, as algorithms improve and computing power gets cheaper, some of these will move from esoteric to mainstream [136]. In a biometric system, a physical trait needs to be recorded. The recording is referred to as the biometric enrolment. This enrolment is based on the creation of a template, i.e., a digital representation of the physical trait. The template is normally a long string of alphanumeric characters that describe, based on a biometric algorithm, characteristics or features of the physical
trait. The algorithm will also allow the matching of an enrolled template with a new template just created, called a live template. When a stored template and a live template are compared, the system calculates how closely they match. If the match is close enough, a person will be verified. If the match is not close enough, the person will be rejected [104]. A biometric system may operate in the identification mode or the verification mode [38, 120]. A biometric system operating in the identification mode recognizes an individual by searching the entire template database for a match. The objective is to answer the question Who is this? It conducts one-to-many comparisons to establish the identity of the individual. A biometric system operating in the verification mode authenticates an individuals identity by comparing the individual only with his/her own template(s). The question to answer is Is this whom he/she claims to be? It conducts one-to-one comparisons to determine whether the identity claimed by the individual is true or not. The leading biometric technologies each have their strengths and weaknesses, and are each wellsuited for particular applications. There is no single best biometric technology, nor is it likely that any single technology will come to dominate in every area of the biometric industry. Instead, the requirements of a specific application determine which, if any, is the best biometric. Table 1-1 [98] gives an overview of the strengths and weaknesses of several mainstream biometric technologies.
Biometric technology
Finger-scan
Strengths
Proven technology capable of high levels of accuracy Range of deployment environments. Ergonomic, easy-to-use devices Ability to enrol multiple fingers Ability to leverage existing equipment and imaging processes Ability to operate without physical contact or user complicity Ability to enrol static images Resistance to false acceptance Stability of characteristic over lifetime Suitability for logical and physical access Ability to leverage existing telephony infrastructure Synergy with speech recognition and verbal account authentication Resistance to impostors Lack of negative perceptions associated with other biometrics Resistance to false acceptance Stable physiological trait
Weaknesses
Inability to enrol some users Performance deterioration over time Association with forensic applications Need to deploy specialized devices Acquisition environment effect on matching accuracy Changes in physiological characteristics that reduce matching accuracy Potential for privacy abuse due to noncooperative enrolment and identification False rejection and failure to enrol User discomfort with eye-based technology Need for a proprietary acquisition device Effect of acquisition devices and ambient noise on accuracy Perception of low accuracy Lack of suitability for todays PC usage Large template size Difficult to use User discomfort with eye-related technology Limited applications
Facial-scan
Iris-scan
Voice-scan
Retina-scan
Table 1-1: Strengths and weaknesses of biometric technologies
Hand-scan
Signature-scan
Keystroke-scan
Ability to operate in challenging environments Established, reliable core technology General perception as non-intrusive Relatively stable physiological characteristic as basis Combination of convenience and deterrence Resistant to impostors Leverages existing processes Perceived as non-invasive Users can change signatures Leverages existing hardware Leverages common authentication process Can enrol and verify users with little effort Usernames and passwords can be changed
Inherently limited accuracy Form factor that limits scope of potential applications Price
Inconsistent signatures lead to increased error rates Users unaccustomed to signing on tablets Limited applications Young and unproven technology Does not increase user convenience Retains many flaws of password-based systems
Table 1-1 (continued): Strengths and weaknesses of biometric technologies
Currently, biometrics are mainly used in security operations [4]. A few examples are prison visitor systems, state benefit payment systems, border control, gold and diamond mines and bank vaults. Clearly, these are areas where security is an issue and fraud is a threat. Recent world events have lead to an increased interest in security that will propel biometrics into mainstream use. Areas of future use include workstation and network access, internet transactions, telephone transactions and applications in travel and tourism.
1.2 Proposed application

The application described in this thesis will use a combination of two pattern recognition systems, one of them being a biometric technology. The aim is to protect the cars of people entering a big parking lot, e.g. a university campus. This will be done by linking every car to the person driving the car. The obvious characteristic to describe a car is its license plate. As described in the previous section, people can be characterized by many biometrics, such as a finger-scan, handscan, voice-scan, face-scan, etc. In this application, it was opted to characterize people by their face. The Rand Afrikaans University (RAU) has currently two cameras installed at every gate. One of these cameras continuously grabs images of the front of cars leaving campus. The other camera grabs images of the drivers face when he/she swipes his/her staff or student card to open the boom in order to leave campus. The two cameras are linked, so that it is known which face corresponds with which number plate. When a claim is made about a stolen car, the video is manually rewound to the expected time of the crime, so a human observer can check if any information can be gathered from the video. The shortcomings of this system are clear. If the claim of a stolen car is made a few hours after the event, the robber will have a considerable head start and it will be difficult to arrest him/her. As a university is typically a venue where cars are parked for hours before the owner returns, it is
difficult to estimate the time of the crime. Therefore, a security guard will have to watch several hours of video, probably for every possible gate, to find the car under investigation. A human loses concentration very quickly in a situation like this. If the car is found on the video, it is very possible that the thief has his face covered and is unrecognizable, as all it takes to open the boom is a student or staff card. These are easy to obtain, many people even leave them in their cars. An automated version of this system would check for every person driving off campus if he/she is the legitimate owner of the car. This can be done by comparing the person driving the car in with the person driving it out. As the boom will not open as long as the system did not make a decision, a person cannot cover his/her face deliberately, as it will be impossible to drive out. If the system cannot match the claim of the person driving out being the legitimate owner, a human security guard will be called in to investigate the case. Security guards will no longer have to watch hours of boring videos, as the system will automatically save the proof of the crime. This system would eliminate the thiefs advantage as he/she would be caught before leaving campus. The implementation of the automated system requires the installation of two extra cameras at every gate, as the cars driving into campus need to be registered. The same setup will be used as for the outgoing vehicles. Instead of saving the raw video sequence, the images will be processed immediately. The vehicles license plate and the persons face information will be extracted from the respective images and saved in a database. This leads to smaller storage needs than the original analogue video. When a vehicle is driving out, the same information will be extracted. The decoded number plate will serve as the search key in the database and the face information of the driver entering and the driver leaving will be compared. If a match is found, the boom will open and the car is allowed to drive out. This persons information and corresponding number plate will be wiped from the system. If the two faces do not match, the case will be further investigated by a security guard. Obviously this also saves storage of video material if the probability of correct identification is very high. The proposed application is not limited to a university campus. This idea can find application at the entrance/exit of every housing complex or shopping centre. As it is not necessary to know the specific identity of the drivers, no information on them is needed in advance. The drivers information will only be saved in the database the moment they drive into the venue. The moment they leave, their information is erased. In the parking lot of a shopping centre, the system could be extended to automatically charging the appropriate fee for use of the parking as the time between entering and leaving can easily be computed. However, as the drivers face is inside the car while the camera is outside, it is still preferable that people take a ticket, or in the case of a university campus, swipe a student or staff card, so that they are forced to open the window. The reflection of the glass would in all probability cause the system to operate less than optimally. It is the ultimate goal to make the system work with people not looking directly at the camera. However, if they have to take a ticket, or swipe a card, the chances are greater to grab high-quality images where faces are not occluded and all facial features are visible. Face recognition is a good choice for this system as it is a non-invasive and widely accepted biometric. It is accepted by most people that humans readily use faces for recognition. Face biometrics can also operate with a relatively low-cost imaging device [101]. However, facial recognition can be greatly influenced by the acquisition environment. Variations in pose, rotation, illumination, etc. deteriorate the performance rate. Changes in appearance, e.g. the presence of glasses, can also affect the matching accuracy of a facial recognition system.
1.3 Research hypothesis

In the area of pattern recognition, humans are still far superior to machines. However, machines can bring certain advantages, such as not losing concentration or adding parallel computation. These characteristics are useful when pattern recognition tasks need to be performed on a large scale. Though many face recognition techniques have been proposed and have demonstrated significant promise, the task of robust face recognition is still difficult. At least two major challenges can be observed: the illumination variation problem the pose variation problem Either of these problems may cause serious performance degradation for most existing systems. Unfortunately, they are unavoidable when face images are acquired in an uncontrolled environment as in surveillance video clips [148]. Most recognition performance rates mentioned in the literature are determined under controlled conditions. Illumination, pose and other variations are examined, but most training and test sets are still artificially created. People are asked to sit in a certain way, to look at a certain point, to tilt their head in a certain way, etc. Few results have been published so far on actual situations where no interaction exists between the person on the image and the person taking the image. The research hypothesis for this study can thus be formulated as follows: Testing existing pattern recognition algorithms in realistic situations, allows for better evaluation of their performance and causes of error. A realistic pattern recognition problem, commonly performed by a human observer, will be automated, to test the performance of some well-known pattern recognition methods.
1.4 Hard- and software

The application in this study uses a digital video camera, which outputs digital colour video. A DVD Creation Station [22] was used to transfer the images from the video camera cassette to the hard drive of a personal computer (PC). These videos were de-interlaced and the separate frames saved in jpeg-format. Thus the image input to the number plate and face recognition modules exists of a sequence of jpeg-images. As the original video was captured in PAL, 25 frames per second are available. The number plate and face recognition modules are implemented on PC by using the visual programming and software development environments of Khoros Pro [47]. The software development environment Craftsman is used to create operators (stand-alone programs written in C). Every operator is represented by a glyph (a rectangular icon) in the visual programming environment Cantata. By placing connections between the glyphs, these stand-alone operators can be combined into comprehensive visual programs. An example of such a visual program, also called workspace, can be seen in Figure 1-1.
1.5 Thesis overview

This thesis falls under the general field of pattern recognition, specifically the recognition of number plates and faces. These tasks are far from new, but existing algorithms still need to be implemented in realistic situations. This is necessary to evaluate their real performance and highlight parameters causing failure. The remainder of this thesis is outlined as follows. Chapter two introduces the study of pattern recognition. This chapter presents the pattern recognition methods used in the current application. An overview of related research in the fields of license plate recognition and face recognition will follow. The license plate recognition module is presented in chapter three. The different algorithms used for localization, segmentation and recognition are described and some preliminary results presented. Chapter four presents the face recognition module. A detailed description of the face localization, facial feature detection and face verification algorithms is included. A few preliminary tests are performed of which the results are presented at the end of the chapter. The combination of the license plate and face recognition modules is the topic of chapter five. A detailed example shows the entire process that starts when a person drives his/her car through the gate of RAU campus. The process is concluded when the person is recognized as the legitimate owner of the car and can drive home; or is recognized as an intruder stealing the car and is retained by security. Chapter six rounds off with the final conclusions and some hints for future research.
Figure 1-1: Example of Cantata visual program
Chapter 2
Literature survey
2.1 Introduction to pattern recognition
Pattern recognition is the scientific discipline whose goal is the classification of objects into a number of categories or classes [126]. Depending on the application, these objects can be images or signal waveforms or any type of measurements that need to be classified. These objects are generally referred to as patterns. When a set of training data is available and the classifier is designed by exploiting this a priori known information, the process is known as supervised pattern recognition. When no training data with a priori known class labels is available, the pattern recognition task is called unsupervised. In this case, a set of objects is given, and the goal is to unravel the underlying similarities, and cluster (group) similar objects together. We will attempt, as far as possible, to sidestep the philosophical question of what constitutes similarity? Pattern recognition has a long history. Before the 1960s it was mostly the output of theoretical research in the area of statistics. However, the advent of computers increased the demand for practical applications of pattern recognition, which in turn set new demands for further theoretical development. As society evolves from the industrial to its post-industrial phase, automation in industrial production and the need for information handling and retrieval are becoming increasingly important. This trend has pushed pattern recognition to the high edge of todays engineering applications and research. Pattern recognition is an integral part in most machine intelligence systems built for decision making [126]. Some of the applications of pattern recognition can be found in the areas of machine vision, character (letter or number) recognition, computer-aided diagnosis, speech recognition, fingerprint identification, signature authentication, text retrieval and face and gesture recognition.
Figure 2-1 shows the various stages followed in the design of a classification system. The objects are captured by the system through a sensing device, e.g. a thermometer or a camera. The measurements used for the classification, are known as the features. These are extracted from the object in the feature generation stage. In most cases, a larger than necessary number of feature candidates will be generated. The best features are selected in the feature selection stage. Having adopted the appropriate, for the specific task, features, the classifier is designed in the next stage. Surfaces are used to divide the multi-dimensional feature space into various class regions. Both linear and non-linear methods exist and an optimality criterion must be adopted. Finally, once the classifier has been designed, the performance of the designed classifier is assessed in the system evaluation stage. These stages are not independent, as can be seen from the feedback arrows. On the contrary, they are interrelated and, depending on the results, one may go back to redesign earlier stages in order to improve the overall performance.
patterns
sensor
feature generation
feature selection
classifier design
system evaluation
Figure 2-1: The basic stages involved in the design of a classification system
Pattern recognition methods can be divided into three main approaches [24, 41]: statistical approaches, structural or synthetic approaches and neural approaches. Statistical approaches assume that the underlying model is characterized only by a set of probabilities, but its structure is ignored. Most statistical approaches are based on Bayesian theory and can be implemented using discriminant functions. In structural or syntactic approaches, more attention is given to the structural interrelationships between features as explicitly apparent in the originally captured data. A typical application of a structural approach is scene analysis [25] where a structural description is preferred to a discriminant-based classification. Most syntactical pattern recognition approaches will generate and analyse complex patterns using a hierarchical decomposition into more manageable and quantifiable entities. The neural approaches use large interconnected networks of nonlinear units, simply called neural nets. The philosophical basis is to imitate the (partially understood) human neural system which has a similar inter-connective structure and which performs complex pattern recognition relatively well in most cases.
2.2 Pattern recognition pertaining to this study

Machine or computer vision is an important area of pattern recognition. Computer vision refers to the automatic extraction of information regarding a scene or three-dimensional objects as captured in two-dimensional images by some photographic process [146]. It is a highly active research area with many applications such as automation (e.g. on the assembly line), inspection (e.g. of integrated circuit chips to detect defects in them), security (e.g. face and fingerprint recognition), medical diagnosis (e.g. detection of abnormal cells that may indicate cancer), remote sensing (e.g. automated recognition of possible hostile terrain to generate maps) and aids for the visually impaired (e.g. mechanical guide dogs). A more detailed list of computer vision applications can be found in [84]. 8
In this study, two applications of pattern recognition in the computer vision area will be described. The proposed license plate recognition system uses learning vector quantization, a statistical method also often implemented as a neural network approach. The described face recognition system uses a statistical method based on Bayesian theory.
2.2.1 Learning vector quantization

In the late 1950s, Rosenblatt proposed the perceptron algorithm for training the perceptron, the basic unit used for modelling neurons of the brain. This simple and popular algorithm works as follows [126]. The N training vectors enter the algorithm cyclically, one after the other. If the algorithm has not converged after the presentation of all the samples once, then the procedure keeps repeating until convergence is achieved, that is, until all training samples have been classified correctly. Let wi(t) be the weight vector estimate and x (t ) i the corresponding feature vector, presented at the tth iteration step. The algorithm is stated as follows:
w i (t + 1) = w i (t ) + x (t ) w j (t + 1) = w j (t ) x (t ) w k (t + 1) = w k (t )
if wiT (t ) x (t ) < w T (t ) x (t ), j if w (t ) x (t ) < w (t ) x (t ),

T i T j
ji ji
(2.1)
k i
and
k j
In other words, if the current training sample is classified correctly, no action is taken. Otherwise, if the sample is misclassified, the weight vector is corrected by adding (subtracting) an amount proportional to x(t). In 1987, Kohonen [54] added another algorithm to the statistical pattern recognition methods, called learning vector quantization (LVQ). Learning is performed in a supervised, decisioncontrolled teaching process. Although the principle is related to the perceptron idea, it seems that this simple and still effective scheme had escaped the attention of researchers for a few decades [54]. This method is basically a nearest-neighbour method, as the smallest distance of the unknown vector from a set of reference vectors is sought. However, no statistical samples of tokens of known vectors are used, but a fixed number of reference vectors are selected for each class, the values of which are then optimized in a learning process. The learning scheme bears a superficial resemblance to the perceptron learning rule. However, the following features distinguish the present method from the perceptron algorithm: In training, only the nearest reference vector is updated. Updating of the reference vector is done for both correct and incorrect classification, whereas only the latter is updated in the classical method. The corrective process is metrically comparable with the criterion used for identification, which does not seem to be the case in the classical method. As a result, it turns out that the reference vectors in a way approximate the probability density functions of the pattern classes. More accurately, the nearest neighbours define decision surfaces between the pattern classes which seem to approximate those of the theoretical Bayes classifier very closely, and in classification, the class boundaries are of primary importance; the description of the inside of the class density functions, and any particular distributional information, is less significant.
The algorithm described below is initialized with the first sample values of x which are identified with mi = mi(0). Thereafter, these vectors are labelled, using a set of calibration samples of x with known classification. Distribution of the calibration samples to the various classes, as well as the relative numbers of the mi assigned to these classes must comply with the a priori probabilities P(k) of the classes k. Each calibration sample is assigned to that mi to which it is closest. Each mi is then labelled according to the majority of classes represented among those samples which have been assigned to mi. To continue, a training set of samples with known classification is needed, which is applied iteratively during the learning steps. These samples can be used cyclically, or the training vectors can be picked randomly from this set (bootstrap techniques). Let the training vector x(t) belong to class r. Assume that the closest reference vector mc is labelled according to class s. The supervised learning algorithm which rewards correct classifications and punishes incorrect ones is then defined as [54]:
mc (t + 1) = mc (t ) + (t )[ x (t ) mc (t )] mc (t + 1) = mc (t ) (t )[ x (t ) mc (t )] mi (t + 1) = mi (t )
if s = r if s r for i c
(2.2)
So, unlike the perceptron algorithm, both correct and incorrect classification results in an action from the algorithm. In the case of correct classification, the reference vector is moved closer towards the training sample, in the case of incorrect classification, the reference vector is moved away.
2.2.2 Bayes decision theory

Given a classification task of M classes, 1, 2, , M, and an unknown pattern, represented by a feature vector x, M conditional probabilities P(i | x), i = 1, 2, , M, can be computed. Sometimes, these are also referred to as a posteriori probabilities. Each of them represents the probability that the unknown pattern belongs to the respective class i, given that the corresponding feature vector takes the value x. Classifiers based on Bayes decision theory compute either the maximum of these M conditional probabilities or, equivalently, the maximum of an appropriately defined function of them. The unknown pattern is then assigned to the class corresponding to this maximum. When the a priori probabilities P(i) and the class-conditional probabilities P(x | i) are known, the a posteriori probabilities can be computed using Bayes rule [126]:
P ( i | x ) =
P ( x | i ) P (i ) P( x )
(2.3)
where the probability P(x) is given by:
P ( x ) = P ( x | i ) P ( i )
i =1
(2.4)
10
The Bayes classification rule can now be stated as follows. The feature vector x is assigned to class i if
P ( i | x ) > P ( j | x )
ji
(2.5)
The Bayes classifier is known to be optimal with respect to minimizing the classification error probability. In many pattern recognition applications, the a priori and class-conditional probabilities are not known, but need to be estimated from the available experimental evidence, that is, from the feature vectors corresponding to the patterns of the training set. In many practical situations, the a priori probabilities can be easily estimated from the available training feature vectors. If N is the total number of available training patterns, and Ni of them belong to i, then P(i) Ni / N. A popular model for the class-conditional probabilities is the Gaussian or normal density function. Assume that the class-conditional probabilities in the ldimensional feature space follow the general multivariate normal density:
P ( x | i ) =
1 T 1 ( x i ) i ( x i ) 2
(2 )l / 2 | i |1 / 2
i = 1, , M
(2.6)
where i = E[x] is the mean value of the i class and i = E[(x - i)(x - i)T] the l l covariance matrix. The estimation of the class-conditional probabilities is then reduced to an estimation of the mean and covariance matrix.
2.3 License plate recognition

License plate recognition (LPR) is an image-based technology which captures, interprets, records and processes the image of a license plate for use in a variety of applications [1, 18, 35, 37, 52, 99, 107, 114, 115, 122]. LPR can save time and alleviate congestion by allowing motorists to pass toll plazas or weigh stations without stopping. It can save money by collecting and processing vehicle data without human intervention. It can improve safety and security by controlling access to secured areas or assisting in law enforcement. LPR is one form of automatic vehicle identification, which can distinguish vehicles as unique. In some applications, such as electronic toll collection and red-light violation enforcement, LPR captures a license plate number so the vehicle owner can be identified and charged the appropriate toll or fine [35]. In other applications, such as commercial vehicle operations or secure-access control, a vehicle's license plate number is checked against a database of acceptable numbers to determine whether a truck can bypass a weigh station or a car can enter a gated community or parking lot [52]. LPR can be used to issue violations to speeders or simply to offer speeding drivers a reminder by displaying a license plate number with the vehicle's speed on a variable messaging sign. It can facilitate vehicle-emissions testing by recording a vehicle's license plate number while its exhaust is automatically analyzed, or it can be used at the roadside to help identify and fine emissions-law violators [99]. LPR can also help monitor the time it takes vehicles to travel from one point to
11
another, keeping traffic management centres apprised of transit times along streets and highways [115]. At international border crossings, license plate numbers can be checked against a database of hot cars to locate stolen vehicles or plates, or those registered to fugitives, criminals or persons suspected of smuggling [1, 115]. At least two other technologies are also being used for many of these applications: bar code-based identification and radio-frequency identification (RFID) [99]. LPR offers two advantages over these options. First, it does not require any special owner or driver compliance such as the use of an in-vehicle radio-frequency transponder or bar code identifier because every road-legal vehicle already has a license plate. Second, because LPR relies on video technology, human interpretation can often overcome any failures of the system to interpret a license plate. However, LPR is less mature than bar code-based and RFID technologies, so concerns about its reliability and accuracy remain. Even in stationary applications, weather and precipitation can affect LPR performance, whereas RFID performance generally is not affected by the weather. In the last decade, quite a few license plate recognition systems have been developed around the world. However, developers are reluctant to publish details of their systems due to the commercial nature of the problem. Recognition rates are often described as almost perfect or close to 100%. As most systems are also developed for a specific and unique application, it is difficult to compare the claims of system producers. Many examples of number plate recognition systems developed in recent years can be found on the internet. A few examples are worth mentioning because they currently have working applications in South Africa or because performance rates are available.
2.3.1 University of Stellenbosch, SA

During 1997, Coetzee et al. [10, 18] developed a PC based number plate recognition system at the DSP Lab of the University of Stellenbosch. Digital grey-level images were acquired under varying lighting conditions where no special lighting was used. In a pre-processing step, the car images are thresholded using the Niblack algorithm. The calculated threshold adapts intelligently to its pixel neighbourhood and results in a binary image which facilitates rule-based pixeloriented image processing. The plate finder algorithm uses an adaptive bounding box searching technique to determine entities that could possibly be alphanumeric. An upside-down L-shaped construct of fixed dimensions is moved through the image. At every position, a set of rules is used to check whether a candidate digit occurs at that location. Next, the size of the bounding box is iteratively adapted to determine the size of the candidate digit. A second bounding box, shaped as a vertical bar, is used to group the candidate digits into candidate plates. Once the plate has been located, characters are segmented from the thresholded plate using blobcolouring. This is a region-growing method which operates on binary images. It labels pixels which form 8-connected contiguous regions (blobs), each region receiving a unique label. It thus colours the blobs. To discard artefacts, bounding boxes for each blob are calculated and a set of heuristics applied. Too small or large blobs are discarded. This eliminates borders, noise blobs, hyphens and the Gauteng type registration logo. The segmented characters are passed as 15 15 pixel bitmaps to a neural network-based optical character recognition (OCR) system. First, a novel dimension reduction technique, based on a binarization linear transform (BLT), reduces the neural network input from 225 to 50 features.
12
The neural network structure consists of six small multilayer perceptrons (MLPs) in parallel, each with six outputs, to recognise six characters. Each network is trained to produce zero outputs for classes it should not recognise, hence effectively increasing training data six times for each network. Each sub network is separately optimised for its own group of characters. The largest of all 36 outputs is chosen as the classification result, i.e., the winner-takes-all principle is applied. A classification confidence is estimated by taking the difference between the two largest outputs. The system was tested with a set of images not used during design or training. The size of the plates in the images was about 190 56 pixels. The system completed full plate recognition (including location and segmentation) in an average of 1.5 seconds. If pathological cases (i.e., plates with overhangs, tow hooks or very bad lighting) were discarded, 86.1% of the plates were located and read correctly. The system can recognize single and double line plates under varying lighting conditions and slight rotation. The greatest weakness of the system is the inability of the segmentation algorithms (both plate location and character extraction) to process correctly characters which are connected to each other or to the border. In practice, system performance is aided by supplying the system with large plate images. This can be achieved by ensuring correct framing and placement of the camera.
2.3.2 University of Bristol, UK

In 1997, John Setchell used computer vision techniques to implement two road-traffic monitoring systems on a transputer-based platform at the University of Bristol [115]. One of these is a number plate recognition system that is capable of monitoring the output from a video camera and detecting when a vehicle passes by. At this moment an image is captured and the vehicles number plate is located and deciphered. The video camera is positioned to survey a section of road. An automatic trigger continuously monitors the grey-level output of the camera. When it sees a vehicle traversing a specified point in the image it generates a trigger which causes the system to store the current image and initiate the recognition process. The automatic trigger employs two virtual trip-wires overlayed on the image of the scene. Through low-pass filtering and computing the median deviation of the crosssection under the trip-wire, it can be determined if the trip-wires are broken or complete. If both are broken, in the right order, a signal is passed to the camera to store the image. The grabbed image is passed on to the finder which locates the number plate in the image. The location algorithm is based on the fact that the image of a number plate will contain a significant number of vertical edges. Every nth cross-section of the image is scanned for vertical edges, which are consequently grouped based on a set of rules. This algorithm will yield a set of clusters, where each cluster represents a possible number plate. The top, bottom and angle of tilt of these possible plates are determined by fitting a simple model of a number plate to the previously detected vertical edges. In the process, a large proportion of the possible plates can be rejected, if they do not adequately fit the model. Typically, only a couple of possible plates survive the filtering. These are scaled to a standard size and passed on to the reader.
13
Three different approaches are described for recognising the characters within the scaled plate: Template matching One large neural network having n outputs, one for each of the n possible characters n small networks, each having one output. Each network is trained to recognise one of the n possible characters As the individual characters are not segmented from the plate, template matching is performed by sliding each template over the scaled plate. A normalised correlation function is computed at every position, and whenever a maximum occurs, that position is noted as a possible position for that particular character, the value of the maximum giving the confidence in the match. The possible character is entered into a table. After matching all templates, the table is passed on to the syntax checker, where letters and digits are chosen from the table such that the syntax rules are adhered to and maximum overall confidence is achieved. The input to the neural networks is formed by sliding a window over the scaled plate. The number of nodes in the input layer is thus governed by the size of the window. An 8 8 window was used, so all networks had 64 nodes in the input layer. The number of nodes in the hidden layer was determined by experimentation. Between 5 and 36 hidden nodes were investigated for use in the large network, while for the small networks this number was between 5 and 20. The networks were trained using the Stuttgart Neural Network Simulator (SNNS) based on scaled conjugate gradient descent. In the recognition phase, the n network outputs represent a confidence in the match for the n possible characters. When a maximum in the confidence of a character occurs, the character is entered into a table. As before, this table is passed on to the syntax checker to extract the deciphered plate. After the syntax checker decides which characters and digits make part of the final plate, the deciphered number plate is searched for in a database of wanted vehicles. If a match is found, the fax module of the system will send a fax to the local police station alerting them that a stolen vehicle has just been spotted. Experiments were performed on six hours of video taken in a realistic environment. On a total of 4357 vehicles, the system generated 94.7% correct triggers, 5.3% false negatives and 1.7% false positives. The false negatives were all caused by two or more vehicles driving very close behind each other. Three thousand images of British number plates were fed to the finder, which located the plate correctly in 99.17% of the cases. This good performance was achieved despite many of the plates being at an angle, dirty, cracked, broken, or partially or completely in shadow. Of the three approaches tested in the reader, the single, large neural network performed best with the plate read completely correct in 85.4% of the three thousand examples. The n small neural networks resulted in a performance of 69.4%, while template matching resulted in 59.5% performance. The average total processing time per image was 3.18 seconds. The next two systems are included because they are commercially available in South Africa. However, their developers are reluctant to publish any details. Recognition rates are often described as almost perfect or close to 100%.
14
2.3.3 Optasia Systems

IMPS, the license plate recognition system developed by Optasia Systems [100] can be used for several of the abovementioned applications. It has been used for border control, red light enforcement, electronic toll collection and access control to parking lots. Using advanced image processing and artificial intelligent techniques such as the AI best first breadth-wise search algorithm, fuzzy logic, combined template and neural network recognisers, it automatically locates vehicle license plates and reads the numbers accurately. The systems works on monochrome images and uses extra lighting mounted near the camera. No details on South African applications or recognition rates are provided. A demo version can be downloaded from http://singaporegateway.com/optasia/.
2.3.4 Hi-Tech Solutions

A form of SeeCar, the number plate recognition system developed by Hi-Tech Solutions [36] is installed at the gates of a South African university. It is used for gate control and theft prevention. The license plate of the entering car is recorded along with the drivers face. This data is compared to the same information at the exit and the guard can see if the person at the entrance to the university was different from the person driving the car out. There is no attempt to automatically detect if the person driving out is the same as the person driving in. No recognition rates have been published so far. A demo version can be downloaded from http://www.htsol.com/Index.html.
2.4 Face recognition

A general statement of the problem of face recognition can be formulated as follows [148]: Given still or video images of a scene, identify or verify one or more persons in the scene using a stored database of faces. A pre-requisite for a fully automated face recognition system is a face detection and/or localization module. According to [142], face detection is defined as follows: Given an arbitrary image, the goal of face detection is to determine whether or not there are any faces in the image, and, if present, return the image location and extent of each face. Face localization aims to determine the image position of a single face as it is defined as a simplified detection problem with the assumption that an input image contains only one face. The goal of facial feature detection is to detect the presence and location of features, such as eyes, nose, nostrils, eyebrows, mouth, lips, ears, etc. with the assumption that there is only one face in an image. Finally, face recognition is the process in which an unknown face image is compared to a gallery of known faces to determine the persons identity. Face verification authenticates an individuals identity by comparing the individual only with his/her own pre-stored image(s). Many face recognition algorithms only start from the final step of face recognition by assuming that the face has been cropped from the image, by assuming some constraints about the face and/or background such that the face localization problem becomes trivial, or by assigning the facial features and/or contour manually. The rest of the chapter will give some background on how several authors perform face localization and recognition.
15
Face localization and recognition are among the most difficult problems in the world of pattern recognition. A few reasons why can be given as reported in [142, 146] amongst others: Although most faces are similarly structured with the facial features arranged in roughly the same spatial configuration, there can be a large component of non-rigidity and textural differences among faces. For the most part, these elements of variability are due to the basic differences in facial appearance between individuals person A has a larger nose than person B, person C has eyes that are further apart than person D, while person E has a darker skin complexion than person F. Even between images of the same persons face, there can still be significant geometrical or textural differences due to changes in expression and the presence or absence of facial makeup. As a result, traditional fixed template matching techniques and geometric model-based object recognition approaches that work well for rigid and articulate objects tend to perform inadequately for localizing and recognizing faces. Face localization and recognition is also made difficult because certain common but significant features, such as glasses or a moustache, can either be present or totally absent from a face. Furthermore, these features, when present, can cloud out other basic facial features (e.g. the glare in ones glasses may de-emphasize the darkness of ones eyes) and have a variable appearance themselves (e.g. glasses come in many different designs). All this adds more variability to the range of permissible face patterns that a comprehensive face localization and recognition system must handle. Face localization and recognition can be further complicated by unpredictable imaging conditions in an unconstrained environment. Because faces are essentially threedimensional structures, a change in light source distribution can cast or remove significant shadows from a particular face, hence bringing even more variability to 2D facial patterns. The two-dimensional image of a face will vary with the viewing direction (technically called pose). The face can become partly occluded when the person does not look directly into the camera. A comprehensive face localization and recognition system should be able to handle different poses and orientations. Another factor to take into account is the simple daily variability, such as a change in make-up, a change in hairstyle, the trimming of a beard, etc. These small changes mainly pose a difficulty to the recognition step and in a lesser extent to the localization step.
2.4.1 Face localization

Over the last 15 years, great attention has been given to the face localization problem and the number and variety of methods for face localization has become extensive. Face detection can actually be seen as a two-class (face vs. non-face) classification problem. Therefore, some techniques developed for face recognition have also been used to detect faces, but most of them are computationally very demanding and cannot handle large variations in face images [39]. The following summary of face localization techniques is based on a detailed review by Yang et al. [142]. Face localization methods can be classified into four major categories: 1. Knowledge-based or rule-based methods encode human knowledge of what constitutes a human face. The rules usually capture the relationships between facial features. 2. Feature invariant approaches aim to find structural features that exist even when the pose, viewpoint or lighting conditions vary, and then use these to locate faces. 3. Template matching methods store several standard patterns of a face to describe the face as a whole or the facial features separately. The correlations between an input image and the stored patterns are computed for detection.
16
4. Appearance-based methods learn the models (or templates) from a set of training images which should capture the representative variability of facial appearance. These learned models are then used for detection. These categories are not absolute, as some face detection methods could be classified into more than one category. For example, the boundary between knowledge-based methods and some template matching methods is rather blurry since the latter usually implicitly applies human knowledge to define the face templates. As will be seen in the following review, some approaches also use different steps which could each be classified into a different category.
2.4.1.1 Knowledge-based methods

In the knowledge-based approach, face detection methods are developed based on some rules derived from the researchers knowledge of human faces [142]. It is easy to come up with simple rules to describe the features of a face and their relationships. For example, a face often appears in an image with two eyes that are symmetric to each other, a nose and a mouth. The relationships between features can be represented by their relative distances and positions. The facial features in an input image are extracted first, and face candidates are identified based on the coded rules. Usually, a verification process is applied to reduce false detections. One problem with this approach is the difficulty in translating human knowledge into welldefined rules. If the rules are too strict, they may fail to detect faces that do not pass all the rules. If the rules are too general, they may give many false positives. Moreover, it is difficult to extend this approach to detect faces in different poses since it is challenging to enumerate all possible cases. On the other hand, heuristics about faces work well in detecting frontal faces in uncluttered scenes. Liu et al. [79] presented a rule-based localization method. A Laplacian operator converts the grey-level image into a binary edge map. The peaks of the projection profiles of this edge map determine the possible eye locations. Let I(x, y) be the intensity value of an m n image at position (x, y), the horizontal and vertical projections of the image are defined as
HI ( x ) = I ( x, y )
y =0 n
and
VI ( y ) = I ( x, y )
x=0
(2.7)
Then, a combination of the grey-level distribution in the eye region and a genetic algorithm is used to eliminate false eye locations. Consequently, a set of rules is applied to verify if the eyes are part of a face by searching for the presence of a mouth. The error rate is 94% for a test set of 100 images with simple background and 89% for a test set of 100 images with complex background.
2.4.1.2 Feature-based methods

In the feature-based methods [2, 30, 31, 142], researchers try to find invariant features for face detection. The underlying assumption is based on the observation that humans can effortlessly detect faces in different poses and lighting conditions and, so, properties or features must exist which are invariant to this variability. Numerous methods have been proposed to first detect facial features and then to infer the presence of a face. Facial features such as eyebrows, eyes, nose, mouth and hair-line are commonly extracted using edge detectors. Based on the extracted features, a statistical model is built to describe their relationships and to verify the existence of a face. One problem with these feature-based algorithms is that the image features can be severely
17
corrupted due to illumination, noise or occlusion. Feature boundaries can be weakened, while shadows can cause numerous strong edges which together render perceptual grouping algorithms useless. The invariant features most commonly used are facial features (eyes, eyebrows, nose, mouth, etc.), texture and skin colour. Obviously, these features can also be combined into a multiplefeatures method. Leung et al. [63] introduced a face localization algorithm based on random labelled graph matching. The arrangement of facial features is viewed as a random graph in which the nodes correspond to the features and the arc lengths to the distances between the respective features. Since different people have different distances between their features, the arc lengths are modelled as a random vector drawn from a joint probability distribution. Candidate locations for various facial features are computed based on a template matching technique. The input image is convolved with a set of Gaussian derivative filters at different orientations and scales. The vector of filter responses at a particular spatial location R(x, y) serves as a description of the local image brightness. Candidates for the ith facial feature are found by matching R(x, y) against a template (or prototype) vector of filter responses Pi. The distances between the facial features are modelled as a joint Gaussian distribution with the mean and covariance estimated from training data. The complexity of the graph matching can then be reduced by exploiting the statistical structure of the graph. The original features will only be coupled with candidates for the other features that appear close to the expected locations, where the meaning of close is determined by the covariance estimates. Finally, the face candidates are ranked based on a hypothesis test. A highest rank correct localization of 86% was achieved on a database with 150 images. If only the quasifrontal views are taken into account, the performance increased to 95%. Yow and Cipolla [145, 146] presented a feature-based method that uses a large amount of evidence from the visual image and its contextual evidence. First, they apply a second derivative Gaussian filter, elongated at an aspect ratio of three to one, to a raw image. Interest points, detected at the local maxima in the filter response, indicate the possible locations of facial features. Second, the edges around these interest points are examined and grouped into regions. The perceptual grouping of edges is based on their proximity and similarity in orientation and strength. Measurements of a regions characteristics, such as edge length, edge strength and intensity variance, are stored in a feature vector. From the training data of facial features, the mean and covariance matrix of each facial feature vector are computed. An image region becomes a valid facial feature candidate if the Mahalanobis distance between the corresponding feature vectors is below a threshold. The labelled features are further grouped based on model knowledge of where they should occur with respect to each other. Each facial feature and grouping is then evaluated using a Bayesian network. One attractive aspect is that this method can detect faces at different orientations and poses. The overall detection rate on a test set of 110 images of faces with different scales, orientations and viewpoints is 85%. However, the reported false detection rate is 28% and the implementation is only effective for faces larger than 60 60 pixels. Human faces have a distinct texture that can be used to separate them from other objects. Dai et al. [21] developed a method that infers the presence of a face through the identification of facelike textures. A facial texture model is derived using space grey level dependence matrices (SGLD) computed on sub-images of 16 20 pixels. Using the face texture model, they design a scanning scheme for face detection in grey level or colour images. The reported detection rate is
18
perfect for a test set of 60 grey-level images with 150 faces [21] and for a test set of 30 colour images with 60 faces [142]. Human skin colour has been used and proven to be an effective feature in many applications from face detection to hand tracking. Although different people have different skin colour, the major difference lies largely between their intensity rather than their chrominance [142]. Several colour spaces have been utilized to label pixels as skin including RGB, normalized RGB, HSV, YCrCb, YIQ, YES, CIE XYZ and CIE LUV [2, 15, 30, 43, 131, 142, 150]. Several methods have been proposed to build a skin colour model. The simplest model is to use a thresholding technique, either directly on the intensity of the image pixels [2] or on the histogram of the image [142]. In contrast to these non-parametric methods, Gaussian density functions and a mixture of Gaussians are often used to model skin colour. The parameters in a unimodal Gaussian distribution are often estimated using maximum-likelihood [150]. The motivation for using a mixture of Gaussians is based on the observation that the colour histogram for the skin of people with different ethnic background does not form a unimodal distribution. The parameters in a mixture of Gaussians are usually estimated using an expectation maximization algorithm [43, 139]. Wu et al. [138] presented a face detection method based on fuzzy theory. They used two fuzzy models to describe the distribution of skin and hair colour in the UCS (perceptually uniform colour system) colour space. These models are used to extract the skin and hair colour regions which are then compared to the pre-built head-shape models by using a fuzzy theory based pattern-matching method. Skin colour alone is usually not sufficient to detect faces [142]. Numerous methods for face detection or localization combine several facial features. Most of them use global features such as skin colour, size, shape or motion to find face candidates, and then verify these candidates using local, detailed features such as eyebrows, nose and hair. A typical approach begins with the detection of skin-like regions as described above. Next, skin-like pixels are grouped together using connected component analysis or clustering algorithms. If the shape of a connected region has an elliptic or oval shape, it becomes a face candidate. Finally, local features are used for verification [43, 130]. Hsu et al. [39] detect faces and facial features based on an elliptical skin colour model. Skin-like regions are detected over the entire image, and based on the spatial arrangement of these skin patches, face candidates are generated. Finally, eye, mouth and boundary maps are created to verify these face candidates. The detection rate on a collection of 382 family and news photos is 80.35%, while the false detection rate is 10.41%. Ahlberg [2] expresses the combined face localization and facial feature extraction problem as an optimization problem. He proposes a four step system using colour discrimination, statistical pattern matching, iris detectors and deformable graphs. The face detection method proposed by Zuo and de With [150] combines skin-colour segmentation, histogram analysis for the detection of facial features (eyes, mouth, eyebrows and nostrils) and a verification step based on distribution statistics. The system was tested on several video sequences. In 90 to 95% of the frames the face region was correctly identified. Many more examples of detection algorithms that combine multiple features can be found in [142].
19
2.4.1.3 Template matching methods

In template matching, a standard face pattern is manually predefined or parameterized by a function. The correlation values between an input image and the standard patterns are computed to determine the existence of a face. This approach has the advantage of being simple to implement. However, it cannot effectively deal with variation in scale, pose and shape. Multiresolution, multi-scale, sub-templates and deformable templates have subsequently been proposed to achieve scale and shape invariance. Yang et al. [142] report on several authors using sub-templates for the eyes, nose, mouth and face contour to model a face. These sub-templates are defined in terms of line segments. Lines in the input image are extracted based on greatest gradient change or a Sobel operator and consequently matched against the sub-templates. Often, the correlations between the input image and one type of sub-template (eyes or contour sub-templates) are computed first to detect candidate face locations. Then matching with the other sub-templates is performed at the candidate positions to verify the existence of a face. In other words, the first step determines the focus of attention or region of interest while the second step examines the details to determine the presence of a face. Kirchberg et al. [53] introduced a face detection technique based on template matching. The input image is reduced to an edge magnitude image by using the Sobel operator and a locally adaptive threshold filter to compensate for variable illumination. A face model was created based on a genetic algorithm. Experiments showed that the GA performs better when starting from random initialization than from a hand-drawn model. A modified Hausdorff distance is used to measure the similarity between an input image and the face model. The detection rate on a test set containing 2360 grey-level images was 94.2%. Kwon and da Vitoria Lobo [56] developed a detection method based on snakes and templates. An image is first convolved with a blurring filter and then a morphological operator to enhance the edges. Small curve segments are detected by dropping a population of snakelets on the image. Those snakelets that stabilized in shallow valleys are eliminated. A Hough transform of the remaining snakelets is approximated by a set of ellipses, which are used as candidate face locations. For each of these candidates, a method similar to the deformable template method is used to find detailed features. An energy function links edges, peaks and valleys in the input image to corresponding parameters in a template. The best fit of the elastic model is found by minimizing this energy function. If a substantial number of facial features are found and if their proportions satisfy certain ratio tests, a face is considered to be found.
2.4.1.4 Appearance-based methods

In contrast with template matching methods where templates are predefined by experts, the templates in appearance-based methods are learned from example images. In general, appearance-based methods rely on techniques from statistical analysis and machine learning to find the relevant characteristics of face and non-face images. The learned characteristics are in the form of distribution models or discriminant functions that are consequently used for face detection. Many appearance-based methods can be understood in a probabilistic framework [142]. An image or feature vector derived from an image is viewed as a random variable x, which is characterized by the class-conditional density functions p(x | face) and p(x | non-face). Bayesian classification or maximum likelihood can be used to classify a candidate image location as face or non-face. However, straightforward implementation of Bayesian classification is impossible because of the high dimensionality of x, because of the multi-modality of p(x | face) and p(x |
20
non-face) and because it is not yet understood if there exist natural parameterized forms for p(x | face) and p(x | non-face). Hence, many appearance-based methods utilize empirically validated parametric and non-parametric approximations for p(x | face) and p(x | non-face). Another approach in appearance-based methods is to find a discriminant function (a decision surface, separating hyperplane or threshold function) between the face and non-face classes. Conventionally, image patterns are projected to a lower dimensional space and then a discriminant function is formed for classification (usually based on distance metrics) [129] or a nonlinear decision surface can be formed using multilayer neural networks [108]. Another approach is based on the use of support vector machines or other kernel methods. These methods implicitly project patterns to a higher dimensional space and then form a decision surface between the projected face and non-face patterns. To train an appearance-based classifier, training examples of both faces and non-faces will be required. It is easy to collect a representative sample of face patterns, but much more difficult to get a representative sample of non-face patterns [30]. This problem is often solved by a bootstrap method that selectively adds images to the training set as training progresses [108, 132]. Starting with a small set of non-face examples in the training set, the classifier is trained with this database of examples. Then, the face detector is run on a sequence of random images. All the non-face patterns that the current system wrongly classifies as faces are collected and added to the training database as new non-face examples. This bootstrap method avoids the problem of explicitly collecting a representative sample of non-face patterns. Turk and Pentland [129] applied principal component analysis (PCA), also known as the Karhunen-Love transform, to face detection. PCA is performed on a training set of face images to generate the eigenfaces which span a subspace (called the face space) of the image space. Images of faces are projected onto the subspace and clustered. Similarly, non-face training images are projected onto the same subspace and clustered. Images of faces do not change radically when projected onto the face space, while the projection of non-face images appears quite different. The presence of a face in a scene is detected by computing the distance between an image region and the face space for all locations in the image. The distance from face space is used as a measure of faceness and by calculating the distance from face space for every location in the image, a face map is generated. A face can then be detected from the local minima of the face map. The same approach is also used for feature detection, such as eyes, nose or mouth [88, 125]. A probabilistic visual learning method based on density estimation in a high-dimensional space using eigenspace decomposition was developed by Moghaddam and Pentland [89]. Principal component analysis is used to define the subspace best representing a set of face patterns. The major linear correlations in the data are preserved by these principal components while the minor ones are discarded. This method decomposes the vector space into two mutually exclusive and complementary subspaces: the principal subspace (or feature space) and its orthogonal complement. Therefore, the target density is decomposed into two components: the density in the principal subspace (spanned by the principal components) and its orthogonal complement (which is discarded in standard PCA). A multivariate Gaussian or a mixture of Gaussians is used to learn the statistics of the local features of a face. The maximum likelihood rule is then used for object detection based on these probability densities. When tested on a database of 2000 faces, 97% was correctly localized. Many authors have adapted this method for face localization and facial feature detection [86, 87].
21
Principal component analysis is not the only possibility for dimensionality reduction [142]. Other methods proposed in the field of face localization and detection, are factor analysis and Fishers linear discriminant (FLD). Probabilistic methods have been presented for face detection using a mixture of factor analyzers (MFA). The parameters in the mixture model are estimated using an EM algorithm [142]. While PCA determines the most representative projection direction, FLD aims to find the most discriminant projection direction. Consequently, the classification results in the projected subspace may be superior to other methods. Similar as the method proposed by Moghaddam and Pentland, the FLD technique models each subclass by a Gaussian density whose parameters are estimated using maximum likelihood. To detect faces, each input image is scanned with a rectangular window in which the class-dependent probability is computed. The maximum likelihood decision rule is used to determine whether a face is detected or not. Detection rates of 92.3% for MFA and 93.6% for the FLD-based method were reported on a test set of 225 images with 619 faces [142]. Many pattern recognition problems, such as optical character recognition, object recognition and autonomous robot driving, have successfully applied neural networks. Since face detection can be treated as a two class pattern recognition problem, various neural network architectures have been proposed. The advantage of using neural networks for face detection is the feasibility of training a system to capture the complex class-conditional density of face patterns. However, one drawback is that the network architecture has to be extensively tuned (number of layers, number of nodes, learning rates, etc.) to get exceptional performance. Barker et al. [6] developed one of the first neural networks for face detection. Their network consists of three layers with 176 input units, 15 hidden units and one output unit. Learning is based on the standard backpropagation algorithm. The input images are typically 150 225 pixels in size, but the resolution is reduced with a scale factor of five before the images are scanned by the neural network. Only faces of the same size as the input layer could be detected. A method using hierarchical neural networks with local recurrent connectivity was proposed by Behnke for accurate eye localization [7]. The input images are presented to a neural abstraction pyramid. This hierarchical structure represents images at different levels of abstraction. Highresolution signal-like representations are present at the bottom of the pyramid, while more abstract representations can be found in higher layers, where the resolution decreases and the number of different features increases. Since the networks connections form horizontal and vertical feedback loops, the networks activity develops over time and contextual information can be incorporated. Only in 1.5% of the 1521 test examples were the eyes not localized accurately. Lin et al. [70] presented a face detection system using probabilistic decision-based neural networks (PDBNN). The architecture of PDBNN is similar to a radial basis function (RBF) network with modified learning rules and probabilistic interpretation. Instead of converting a whole face image into a training vector of intensity values for the neural network, feature vectors are extracted based on intensity and edge information in the facial region that contains eyebrows, eyes and nose. The extracted two feature vectors are fed into two PDBNNs and the fusion of the outputs determines the detection result. On a test set of 473 images, 98.5% of the errors are within five pixels. The same approach was also used for eye localization where, on a test set of 323 images, 96.4% of the errors are within three pixels. Rowley et al. [108] propose a multilayer neural network for face detection. Multiple neural networks and several arbitration methods are used to improve the performance over a single network. The system consists of two major components: multiple neural networks to detect face patterns and a decision-making module to render the final decision from the multiple detection 22
results. The first component is a neural network that receives a 20 20 pixel region of an image and outputs a score ranging from -1 to 1. Given a test pattern, the output of the trained neural network indicates the evidence for a non-face (close to -1) or face pattern (close to 1). To detect faces anywhere in an image, the neural network is applied at all image locations. To detect faces larger than 20 20 pixels, the input image is repeatedly sub-sampled, and the network is applied at each scale. Nearly 1050 face samples of various sizes, orientations, positions and intensities are used to train the network. In each training image, the eyes, tip of the nose, corners and centre of the mouth are labelled manually and used to normalize the face to the same scale, orientation and position. The second component of this method merges overlapping detections and arbitrates between the outputs of multiple networks. Simple arbitration schemes such as logic operators (AND/OR) and voting are used to improve performance. Detection rates on different databases and with different arbitration schemes vary from 76.9 to 100%. Aitkenhead and McDonald [3] trained a neural network composed of an input layer of size 32 32, two hidden layers of variable size and an output layer consisting of two nodes, corresponding to presence or absence of a face. Each layer was fully connected to its successor and the backpropagation algorithm was used to adjust the connection weights. During training, the networks control variables (number of nodes in the two hidden layers and the difference between positive and negative face presence output values) were subject to mutative evolutionary computation. The input images were converted into edge maps before being fed into the network. A detection rate of 94.7% with a false positive rate of 0.6% was obtained on a test set of 2000 images. Viola and Jones [132] describe a frontal face detection framework that is capable of processing grey-scale images extremely rapidly while achieving high detection rates. They define a set of rectangle features that are reminiscent of Haar basis functions. Given that the base resolution of the detector is 24 24, the exhaustive set of rectangle features is quite large, 45 396. Unlike the Haar basis, the used set of rectangle features is overcomplete. A new image representation, called the integral image allows for very fast computation of these features. The learning algorithm is based on AdaBoost, which selects a small number of critical features and yields extremely efficient classifiers. The classifiers are combined in a cascade which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions. The final face detector is a 32-layer cascade of classifiers which include a total of 4297 features. The first classifier in the cascade is constructed using two features and rejects about 60% of nonfaces while correctly detecting close to 100% of faces. The next classifier has five features and rejects 80% of non-faces while detecting almost 100% of faces. The next three layers are 20features classifiers followed by two 50-feature classifiers followed by five 100-feature classifiers and then twenty 200-feature classifiers. The detector is scanned across the image at multiple scales and locations. Scaling is achieved by scaling the detector itself, rather than scaling the image. The detection rate on a test set of 130 images containing 507 frontal faces is 78.3% with only 10 false positives, while the detection rate increases to 93.7% when 422 false positives are allowed. The system was reported to be about 15 times faster then the detection system constructed by Rowley et al. Lienhart et al. [66] extended the approach of Viola and Jones in two ways. Firstly, a novel set of rotated Haar-like features was introduced. These enrich the features defined in [132] and can also be calculated efficiently. On average the number of false positives was about 10% lower for the extended Haar-like feature set at comparable detection rates. Secondly, they compared different boosting algorithms and concluded that gentle AdaBoost outperforms discrete and real AdaBoost. 23
Support vector machines (SVMs) can be viewed as a way to train polynomial function, neural networks or radial basis function (RBF) classifiers. While most methods used to train these classifiers are based on minimizing the training error, i.e., the empirical risk, SVMs operate on another induction principle, called structural risk minimization, which aims to minimize an upper bound on the expected generalization error. An SVM classifier is a linear classifier where the separating hyperplane minimizes the expected classification error of the unseen test patterns. This optimal hyperplane is defined by a weighted combination of a small subset of the training vectors, called support vectors. Estimating the optimal hyperplane is equivalent to solving a linearly constrained quadratic programming problem. However, the computation is both time and memory intensive [142, 148]. Li et al. [65] propose an approach to multi-view face detection based on pose estimation. SVM regression is employed to estimate the head pose. Using the pose information, the problem of face detection across a large range of views is decomposed into a set of sub-problems, each of them for a small range of views. The task of detection is accomplished by exhaustively scanning and matching in sub-images which may contain faces. Three methods for multi-view face detection are compared: the eigenface and SVM-based methods extended to the multi-view case and a novel combination of the two methods aiming to improve the overall performance in terms of speed and accuracy. In this last method, the SVM-based classifier is only activated when an ambiguous pattern emerges and the eigenface stage can not determine with high enough probability if the sub-image contains a face or not. The combined method demonstrates the best overall performance; it is almost as accurate as the SVM method and not significantly slower than the eigenface method. Liu et al. [77] propose a hierarchical shape model (HSM) which is a multi-resolution shape model corresponding to a Gaussian pyramid of the face image. The coarsest shape model can be quickly located in the lowest resolution image. The located coarse model is then used to guide the search for a finer face model in the higher resolution image. A global and local (GL) distribution is used to learn the likelihood of the joint distribution of facial features; while a novel hierarchical data-driven Markov chain Monte Carlo (HDDMCMC) approach is proposed to achieve the global optimum for face localization. The output result gives the location of the eyebrows, eyes, nose, mouth and face contour. The experimental results indicate that both modelling and learning of the distributions in HSM are accurate and robust.
2.4.2 Face recognition

Within the last two decades, numerous algorithms have been proposed for face recognition. While much progress has been made toward recognizing faces under controlled conditions of lighting, facial expression and pose, reliable techniques for recognition under more extreme variations have proven elusive. The face recognition problem can be simply stated as follows. Given a set of face images labelled with the persons identity (the learning set) and an unlabelled set of face images from the same group of people (the test set), identify each person in the test images. Most face verification methods are derivatives of face recognition techniques where a threshold needs to be determined to discriminate between same person and different person. A comprehensive survey on face recognition methods, both in still images and video sequences, can be found in [148]. A short review of available techniques will now follow. Face recognition techniques can mainly be divided into three categories: template matching methods, feature-based methods and hybrid methods. However, some methods are difficult to
24
classify and are so widely used that they almost deserve their own category. A few examples are the eigenfaces technique developed by Turk and Pentland [129] and elastic graph matching [134].
2.4.2.1 Template matching

The simplest classification scheme is direct correlation in the image space [8]. An image, represented as a bi-dimensional array of intensity values, is compared using a suitable metric (typically the Euclidean distance), with a single template representing the whole face. There are, of course, several expansions to this basic principle. For example, the array of intensity values may be suitably pre-processed before matching (e.g. gradient operation [12]). Several full templates for each face may be used to account for the recognition from different viewpoints. Another variation is to use, even for a single viewpoint, multiple templates. A face is then stored as a set of distinctive smaller templates (eye-templates, nose-templates, mouth-templates, etc.) [12]. The disadvantages of template matching are well known. First, if the images in the learning and test set are gathered under varying conditions of lighting, pose, facial expression, etc., then the corresponding points in the image space may not be tightly clustered. So, in order for this method to work reliably under these variations, the learning set should densely sample the continuum of possible conditions. Second, face recognition by correlation is computationally expensive as the image of the test face must be correlated with each template in the learning set. Third, large amounts of storage are needed because the learning set must contain numerous images of each person [8, 58]. Lin et al. [67, 68] use a template matching technique based on the edge maps of facial images. The advantage of using edges for matching two objects is that this representation is robust to illumination change. A probe image and a gallery image are compared using a spatially (eigen-) weighted doubly Hausdorff distance. This modified Hausdorff metric incorporates information about the location of important facial features and alleviates the effect of facial expressions. The classifier is trained on 40 individuals, one image each. Tests on six images per person, 240 in total, result in recognition rates between 88% and 91%.
2.4.2.2 Feature-based methods

A face can be recognized by individuals even when the details of individual features (such as eyes, nose and mouth) are no longer resolved. The remaining information is, in a sense, purely geometrical and represents what is left at a very coarse resolution. Brunelli and Poggio [12] describe a feature-based approach using a vector of numerical data representing the position and size of the main facial features (eyes, eyebrows, nose and mouth) together with a description of the shape of the face outline. Position, scale and rotation invariance are obtained by setting the inter-ocular distance and the direction of the eye-to-eye axis. Recognition is then performed using a Bayes classifier. On a database with 47 people, four images per person, a recognition rate of 90% was obtained. Template matching on the same database resulted in perfect recognition [12]. In general, geometrical feature matching based on precisely measured distances between features may be most useful for finding possible matches in a large database such as a mug shot album. However, it will be dependent on the accuracy of the feature location algorithms. Current automated feature location algorithms do not provide a high degree of accuracy and require considerable computation time [29]. The features used by Kepenekci and Tek [46] to represent a face are based on Gabor wavelets. The face image is convolved with a set of Gabor filters representing five different frequencies and
25
eight different orientations. Feature vectors containing the Gabor wavelet coefficients are extracted at image pixels where the filter response reaches a local maximum. Thus, a different number of feature vectors are saved for every image and every feature vector only represents a small part of the image. A similarity function ignoring phase is defined to compare two images described by a set of complex feature vectors. Several tests were performed where the recognition rate never falls below 95%, the errors mainly due to different head poses. This method performs very well in the case of occluded faces. Garcia et al. [31] use a different set of wavelets. The facial image is convolved with a set of lowand band-pass filters. The convolution with the low-pass filter results in a so-called approximation image while the convolutions with the band-pass filters in specific directions result in so-called details images. The filter responses are represented by their means and variances and concatenated into a feature vector containing 21 components. Classification of the feature vectors is based on a probabilistic measure derived from the Brattacharyya distance. The algorithm was tested on a database of 200 individuals, where two images per person are used for training and one image per person for testing. The reported recognition rate was 90.83%. The line edge map (LEM) proposed by Gao and Leung [29], uses line-based face coding and line matching techniques to integrate geometrical and structural features. First, the edge map of the facial image is extracted. After thinning the edge map, a polygonal line fitting process is applied to generate the LEM of a face. A novel line segment Hausdorff distance (LHD) is proposed to match LEMs of faces. LHD has better discriminative power than standard pixel-wise Hausdorff distances because it can make use of the additional structural attributes of line orientation, linepoint association and number disparity in the line edge map. In several experiments, the LEM technique achieved higher recognition rates than edge map methods and proved to be quite robust to lighting condition changes and size variations.
2.4.2.3 Subspace methods

As correlation methods are computationally expensive and require great amounts of storage, dimensionality reduction schemes are pursued. In the next section, both linear and non-linear methods to compute lower-dimensional manifolds are reviewed.
2.4.2.3.1 Eigenfaces and fisherfaces A technique now commonly used for dimensionality reduction in computer vision is principal component analysis (PCA). PCA techniques, also known as Karhunen-Love methods, choose a dimensionality reducing linear projection that maximizes the scatter of all projected samples. In the field of face recognition the technique was introduced by Turk and Pentland [129] and is commonly known as the eigenfaces technique. A more recent technique, called fisherfaces, was proposed by Belhumeur et al. [8]. This method uses a dimensionality reducing linear projection that maximizes the ratio of the between-class scatter and the within-class scatter.
More formally [8], consider a set of N sample images { x1 , x2 ,..., x N } taking values in an ndimensional image space, and assume that each image belongs to one of c classes {1 , 2 ,..., c } . Also consider a linear transformation mapping the original n-dimensional image space into an m-dimensional feature space, where m < n. The new feature vectors yk R m are defined by the following linear transformation:
yk = W T xk
k = 1, 2,..., N
(2.8)
26
where W R nm is a matrix with orthonormal columns. The total scatter matrix ST is defined as
ST = ( xk )( xk )T
k =1
(2.9)
where N is the number of sample images and R n is the mean image of all samples. Note that ST is the covariance matrix of the images xk with respect to the mean . After applying the linear transformation WT, the scatter of the transformed feature vectors { y1 , y2 ,..., y N } is W T ST W . In PCA, the projection Wopt is chosen to maximize the determinant of the total scatter matrix of the projected samples, i.e.,
Wopt = arg max | W T ST W |= [ w1

W
w2
...
wm ]
(2.10)
where {wi | i = 1, 2,..., m} is the set of n-dimensional eigenvectors of ST corresponding to the m largest eigenvalues. Since these eigenvectors have the same dimension as the original images, they are referred to as eigenpictures or eigenfaces [129]. If the classification is performed using a nearest neighbour classifier in the reduced feature space and m is chosen to be the number of images N in the training set, then the eigenface method is equivalent to the correlation method in section 2.4.2.1 . A disadvantage of this approach is that the scatter being maximized is due not only to the between-class scatter, useful for classification, but also to the within-class scatter which, for classification purposes, is unwanted information. Thus, e.g. if PCA is presented with images of faces under varying illumination, the projection matrix W will contain principal components (i.e., eigenfaces) which retain, in the projected feature space, the variation due to lighting. Consequently, the points in the projected space will not be well clustered, and worse, the classes may be smeared together. Belhumeur et al. [8] propose a linear projection of the faces from the high-dimensional image space to a significantly lower dimensional feature space which is insensitive both to variation in lighting direction and facial expression. They choose projection directions that are nearly orthogonal to the within-class scatter, projecting away variations in lighting and facial expression while maintaining discriminability. Their method fisherfaces, a derivative of Fishers linear discriminant (FLD) maximizes the ratio of between-class scatter and within-class scatter. Let the between-class scatter matrix be defined as
S B = N i ( i )( i )T
i =1
(2.11)
and the within-class scatter matrix be defined as
SW = ( xk i )( x k i )T
i =1 xk i
(2.12)
27
where i is the mean image of class i, and Ni is the number of samples in class Xi. If SW is nonsingular, the optimal projection Wopt is chosen as the matrix with orthonormal columns which maximizes the ratio of the determinant of the between-class scatter matrix of the projected samples to the determinant of the within-class scatter matrix of the projected samples, i.e.,
Wopt = arg max

W
| W T S BW | | W T SW W |
= [ w1
w2
...
wm ]
(2.13)
where {wi | i = 1, 2,..., m} is the set of generalized eigenvectors of SB and SW corresponding to the m largest generalized eigenvalues {i | i = 1, 2,..., m} , i.e.,
S B wi = i SW wi ,
i = 1, 2,..., m
(2.14)
Note that there are at most c 1 nonzero generalized eigenvalues, and so an upper bound on m is c 1, where c is the number of classes, see [8]. Generally, in the technique of fisherfaces, PCA is first used to reduce the dimension of the feature space to N - c, so that the resulting within-class scatter matrix SW is non-singular. Then, the standard FLD technique (equation (2.14)) is applied to reduce the dimension to c - 1. More formally, Wopt is given by
T T T Wopt = W fld Wpca
(2.15)
where:
Wpca = arg max | W T ST W |

W
W fld = arg max

W
T | W T W pca S BWpcaW | T | W T Wpca SW WpcaW |
(2.16)
For both eigenfaces and fisherfaces, the actual face recognition is performed in the reduced i feature space. Let x be an unknown face and xk be a representative of class i. The classification rule is then specified as follows:
i x i if distance( y yk ) = min distance( y ykj ), j
(2.17)
y = W x,
j = 1, , c
The unknown face is classified as belonging to that class i from which the face x yields the minimum distance. Several possible distance metrics have been used in the literature: Manhattan distance, nearest neighbour or Euclidean distance [8, 73, 129], Mahalanobis distance [20, 73, 83, 85, 86, 87], mixture-distance [60], cosine similarity measure [73, 75, 76], nearest feature line [64, 109], nearest feature angle [109], etc. Recognition rates are difficult to compare as different training and test images are used. Turk and Pentland [129] report recognition rates between 64% and 96% on a database of 2500 face images
28
of 16 different people. One image per person is used for training. The eigenface algorithm is quite sensitive to size variation, while it performs well under lighting variation. Belhumeur et al. [8] used a database of 16 people with 10 images per person. A leaving-one-out strategy resulted in an error rate of 0.6%. While the eigenface and fisherface techniques are very often used as described above [8, 83, 116, 129, 147], several improvements have been proposed over the last few years. The first adaptation was to use the training images to create multiple eigenspaces instead of one general eigenspace. Lizama et al. [80] use two eigenspaces for the recognition of frontal faces: one for males and one for females. Moghaddam and Pentland [88] use a view-based recognition paradigm for multiple head orientations. A separate eigenspace is defined for every available view. In this view-based, multiple-observer approach, the location and orientation of the target object is first determined by selecting the eigenspace which best describes the input image. This is accomplished by calculating the residual description error (the distance-from-face-space [129]) using each view spaces eigenvectors. Once the proper view space is determined, the image is described using the eigenvectors of that view space, and then recognized. Kim et al. [49, 50] first perform an expectation-maximization (EM) algorithm in the highdimensional face space to cluster the training face images into a mixture of Gaussian distributions. A separate set of eigenfaces is calculated for every cluster and each image in the gallery is represented by an appropriate set of eigenfaces. Recognition is performed based on the minimum distance between the input image and a labelled gallery image, where the input image is projected onto the same set of eigenfaces as used for the gallery image. This so-called mixtureof-eigenfaces method performs well for databases with large variations in illumination and pose. Several authors [33, 83, 86, 87, 88, 103, 127] have applied the eigenface method on individual facial features (e.g. eyes, nose, mouth, left and right side of the face) which results in eigeneyes, eigennoses, eigenmouths and eigensides. These eigenfeatures could be combined in the classification stage. In [83, 127] the global distance between the test and the training image is calculated as the weighted sum of the different distances in such a way that the contribution of each eigenfeature to the recognition stage is the same. However, these weighting factors could be altered to give more importance to certain features than others. Thresholds could also be introduced to decide if a part of the face is useful for recognition. This approach has proven to be useful in case of partial occlusions. In [45, 117, 118] another adaptation of the eigenface method is proposed. Multiple face eigensubspaces are created, each one corresponding to one individual. Compared with the traditional single subspace face representation, the proposed method captures the extra-personal difference to the most possible extent, which is crucial to distinguish between individuals, and on the other hand, it throws away the most intra-personal difference and noise in the input. The experiments described in [118] strongly support the proposed idea, as 20% improvement in performance over the traditional eigenface method was observed when testing on the same face database. Liu and Wechsler [75] introduced enhanced Fisher linear discriminant models (EFM) to improve the generalization capability of the standard FLD-based classifiers (such as fisherfaces) by decomposing the FLD procedure into a simultaneous diagonalisation of the within- and betweenclass scatter matrices. They applied this method in their Gabor-Fisher classifier [75] where images are represented by Gabor feature vectors, derived from a set of Gabor wavelet transformations of the face images. This improves the classification performance as a Gabor feature representation is a discriminant representation method rather than an expressive 29
representation method. The combination of a Gabor vector representation with 62 features, the enhanced FLD model and a cosine similarity measure leads to 100% recognition on a database of 200 people (two images per person were used for training, one was used for testing) [75].
2.4.2.3.2 Probabilistic subspaces Moghaddam [90, 91, 92, 95] extended the eigenface method based on nearest-neighbour matching to a maximum a posteriori (MAP) matching rule using a Bayesian similarity measure derived from dual probabilistic subspaces. This method is called probabilistic visual learning [71, 89]. This similarity measure is based on the probability that the image intensity differences, denoted by = I1 I2, are characteristic of typical variations in appearance of the same object.
Two classes of facial image variations are defined: intrapersonal variations I (corresponding, for example, to different facial expressions of the same individual) and extrapersonal variations E (corresponding to variations between different individuals). The similarity measure S() is then expressed in terms of the intrapersonal a posteriori probability given by the Bayes rule:
S () = P (I | ) =
P ( | I ) P (I ) P ( | I ) P (I ) + P ( | E ) P (E )
(2.18)
Because of the high-dimensionality of the data, the likelihoods P ( | I ) and P ( | E ) are obtained using an eigenspace density estimation technique, see [89]. The priors P() can be set to reflect specific operating conditions (e.g. the number of test images versus the size of the database) or other sources of a priori knowledge regarding the two images being matched. Based on this Bayesian formulation, the standard face recognition task (essentially an m-ary classification problem for m individuals) is cast into a binary pattern classification problem with I and E. This simpler problem is then solved using the maximum a posteriori (MAP) rule i.e., two images are determined to belong to the same individual if P (I | ) > P (E | ) or, equivalently, if S() > . In a proper recognition task, the probe image will be classified as belonging to the same individual as the gallery image with the highest response for S(). The training set consisted of 353 individuals with 947 images, while the test set contained 353 individuals with 882 images. The training and test set consist of entirely different people. A baseline eigenface test obtained a recognition rate of 78.98%, while the probabilistic subspace method resulted in 94.71% recognition [92]. Moghaddam et al. [93, 94] describe a face recognition technique based on deformable intensity surfaces which incorporates both the shape and texture components of the 2D image. The intensity surface of the facial image is modelled as a deformable 3D mesh in (x, y, I(x, y)) space. A dense correspondence field (or 3D warp) between two images is obtained using a matching technique in terms of the analytic modes of vibration between two surfaces. The same approach as in the previous paragraph is then used where the displacement vector U replaces the difference vector . The a posteriori probability S(U) is used to classify faces as belonging to the same class or a different class. The experimental data consisted of 700 training images and 1000 test images containing one or more images of every person in the training gallery, taken at different times, at different locations, and under different imaging conditions. A recognition rate of 97.8% was obtained, while the recognition rate of the baseline eigenface test was 79.5%. Out of all the extrapersonal warps performed in the experiment, only 2% were misclassified as being intrapersonal.
30
Liu and Wechsler [71] also introduced two probabilistic reasoning models (PRM-1 and PRM-2) which combine principal component analysis (PCA) with a Bayes classifier. The conditional probability density function for each object class is modelled using the within-class scatter and the maximum a posteriori (MAP) classification rule is implemented in the reduced PCA subspace. The first model (PRM-1) assumes that all the within-class covariance matrices are diagonal and identical. The other model (PRM-2) derives the within-class scatter by computing the averaged within-class covariance matrix based on all the within-class scatters in the reduced PCA subspace, diagonalising it, and utilizing the ordered diagonal elements as the estimation. A face image x is then classified to object class i for whom the a posteriori probability is the largest among all the classes:
P ( x | i ) P (i ) = max { P ( x | j ) P ( j )} ,
j
x i
(2.19)
A dataset consisting of 1107 facial images corresponding to 369 subjects was used for experiments. Two images of each subject are used for training and the remaining image for testing. The peak recognition for both PRMs was about 96%.
2.4.2.3.3 Independent component analysis Independent component analysis (ICA) searches for a linear transformation to express a set of random variables as linear combinations of statistically independent source variables [76]. ICA is similar to PCA, except that the distribution of the components is designed to be sub/super Gaussian (usually by minimizing/maximizing fourth-order distribution cumulants such as kurtosis). Maximizing non-Gaussianity also promotes statistical independence, which is the desired goal. Like PCA, ICA is also a linear projection from a high-dimensional space RN to a lower-dimensional space RM, but with different properties [92]:
approximate reconstruction: x Ay non-orthogonality of the basis A: AT A I near factorization of the joint distribution P(y) into marginal distributions of the (nonGaussian) independent components: P ( y ) p ( yi ) Liu and Wechsler [76] combine independent component analysis and probabilistic reasoning models in their independent Gabor features (IGF) method. The IGF method first derives a Gabor feature vector based upon a set of down-sampled Gabor wavelet representations of face images by incorporating different orientation and scale local features. Independent component analysis operates then on the Gabor feature vector, whose dimensionality has been reduced by PCA, and derives independent Gabor features. Finally, the independence property of the independent Gabor features leads to the application of the PRM method for classification. Experiments were performed on a dataset that contains 600 frontal face images corresponding to 200 individuals, acquired under variable illumination and facial expression. When two images per person are used for training and the remaining image for testing, the IGF method achieves 98.5% correct recognition when using 180 features.
2.4.2.3.4 Kernel principal component analysis Kernel principal component analysis (KPCA) is a non-linear extension of PCA [51, 92]. The basic idea is to first map the input data x into a high-dimensional feature space F via a nonlinear mapping and then perform a linear PCA in F. Since a PCA in F can be formulated in terms of
31
the dot products in F, this same formulation can also be performed using kernel functions (the dot product of two data in F) without explicitly working in F. Specifically, assuming that the mapped data are centred, i.e.,
( x ) = 0 , where M is the
i i=1
number of input data, kernel PCA diagonalises the estimate of the covariance matrix of the mapped data ( xi ) , defined as
C=
1 M
( x ) ( x )
i i i =1
(2.20)
To do this, the eigenvalue equation v = Cv must be solved for eigenvalues 0 and 1 M eigenvectors v F \ {0} . As Cv = (( xi ) v )( xi ) , all solutions v with 0 lie within M i =1 the span of ( x1 ), , ( x M ) , i.e., the coefficients i (i = 1, , M ) exist such that
v = i ( xi )
i =1
(2.21)
Then the following set of equations can be considered:
(( xi ) v ) = (( xi ) Cv )
for all i = 1,..., M
(2.22)
The substitution of (2.20) and (2.21) in (2.22) and the definition of an M M matrix K by K ij k ( xi , x j ) = (( xi ) ( x j )) produces an eigenvalue problem which can be expressed in terms of the dot products of two mappings. Solve
M = K
(2.23)
l for nonzero eigenvalues l and column vectors l = (1l , , M )T subject to the normalization
condition l ( l l ) = 1 . Subsequently, the KPCA principal components of any input vector can be efficiently computed with simple kernel evaluations against the data set. The lth principal component yl of x is given by
yl = (v l ( x )) = il k ( x , xi )
i =1
(2.24)
where vl is the lth eigenvector of the feature space F. As with PCA, the eigenvectors vl can be ranked by decreasing order of their eigenvalues l and a D-dimensional manifold projection of x is y = (y1, , yD)T, with individual components defined by (2.24).
32
A significant advantage of KPCA over neural networks is that KPCA does not require nonlinear optimization, is not subject to overfitting and does not require prior knowledge of network architecture or number of dimensions [92]. Furthermore, unlike traditional PCA, one can use more eigenvector projections than the input dimensionality of the data (since KPCA is based on the matrix K, the number of eigenvectors or features available is M). On the other hand, the selection of the optimal kernel (and its associated parameters) remains an engineering problem. || xi x j ||2 Typical kernels include Gaussians exp [141], polynomials ( xi x j ) d [51, 141], 2 and sigmoids tanh( a ( xi x j ) + b) [92]. Note that classical PCA is a special case of kernel PCA with first order polynomial kernel [141]. Kim et al. [51] adopt the kernel PCA as a mechanism for extracting facial features. Through the use of a polynomial kernel, higher order correlations can be utilized between input pixels in the analysis of facial images. This amounts to identifying the principal components within the product space of the input pixels making up a facial image. Based on these features, face recognition is then performed using linear support vector machines (SVMs). An SVM is constructed for each class to first separate that class from all the other classes. An expert based on the max-selector principle then arbitrates between each SVM output in order to produce the final decision. The experiments were performed on 40 people with five training images and five test images per person. The best performance achieved an error rate of 2.5%.
2.4.2.3.5 Kernel Fisher linear discriminant Similar as with kernel PCA, the kernel Fisher linear discriminant (KFLD) method first maps the input data x into a high-dimensional feature space F via a nonlinear mapping and then performs a Fisher linear discriminant analysis in F. KFLD can similarly be formulated in terms of kernel functions, without explicitly working in F, see [141] for the mathematical derivation.
Yang [141] investigates the use of kernel principal component analysis and kernel Fisher linear discriminant for learning low dimensional representations for face recognition. He calls these techniques kernel eigenfaces and kernel fisherfaces. Both polynomial and Gaussian kernels were used. The types of kernel and corresponding parameters (e.g. polynomial degree) were empirically determined to achieve the best results. Experiments on several datasets showed that nonlinear kernel methods perform up to 15% better than linear PCA and FLD methods.
2.4.2.4 Neural networks

Neural network techniques are very suitable for face recognition. Instead of recognizing a face by following a set of human-designed rules, neural networks learn the underlying rules from the given collection of representative examples [70]. Numerous forms of neural networks have been trained to perform face recognition. These range from the popular multi-layer perceptron trained by backpropagation [147] to more specialized techniques such as the probabilistic decision-based neural nets [70, 119]. Neural networks have their own disadvantages. When new images are added to the training set, the neural network needs to be retrained, which can take a long time for large training sets. The training time increases with the size of the training set, as every image has to be presented to the neural network at least once. Similarly, when an image is removed from the image gallery, the neural network needs to be retrained on the remaining images. Therefore, neural networks cannot generally be adapted to the problem of face verification with changing image galleries.
33
Intrator et al. [43] perform face recognition using a hybrid supervised/unsupervised neural network. The unsupervised training portion is aimed at finding features (such as clusters) to perform a dimensionality reduction. The supervised portion is aimed at finding features that minimize the classification error on the training set. The application of this hybrid training in a feed-forward neural network is done by modifying the learning rule of the hidden units to reflect the additional constraints. The supervised training uses the backpropagation algorithm, while the unsupervised training is done by projection pursuit learning. Before being fed into the neural network, the input grey-level image is transformed into a symmetry map, where every pixel is assigned a symmetry magnitude and a symmetry orientation. An error rate of 3.96% was reported on a database of 16 people, where 17 images per person are used for training and 10 for testing. Lawrence et al. [58, 59] present a hybrid neural network solution which combines local image sampling, a self-organizing map (SOM) neural network and a convolutional neural network. The self-organizing map provides a quantization of the image samples into a topological space where inputs that are nearby in the original space are also nearby in the output space, thereby providing dimensionality reduction and invariance to minor changes in the image sample. The convolutional network provides for partial invariance to translation, rotation, scale and deformation. The convolutional network extracts successively larger features in a hierarchical set of layers. The method is capable of rapid classification, requires only fast, approximate normalization and pre-processing, and consistently exhibits better classification performance than the eigenfaces approach on the same database. Testing was performed on a database of 400 images of 40 individuals which contains a high degree of variability in expression, pose and facial details. When five images per person are used for training, an error rate of 3.5% was reported. Espinosa-Dur and Fandez-Zanuy [25] describe a face identification method that combines the eigenfaces theory with neural nets. The eigenfaces method is used as a feature extractor, obtaining an output result represented as a pattern vector. A feedforward multi-layer perceptron classifier trained by a backpropagation algorithm performs the identification process. An 80 40 40 MLP achieved a recognition rate of 87% on a database of 40 people with five images for training and five images for testing. Lin et al. [70] extended their PDBNN face detector to a PDBNN face recognizer. For a K-people recognition problem, a PDBNN face recognizer consists of K different subnets. A subnet j in the PDBNN recognizer estimates the distribution of the patterns of person j only, and treats those patterns which do not belong to person j as the non-j patterns. This has the advantage that the trained system can easily be adapted to face verification applications. Due to the distributed structure of PDBNN, any individual persons database may be individually retrieved for verification of his/her identity as proclaimed. Because the original PDBNN structure was sensitive to lighting variations, a hierarchical face recognition system was constructed where a face verifier is cascaded after the (original) face recognizer stage. The face verifier is itself another PDBNN that verifies or rejects the recognition of the primary recognizer based on the classification of the forehead and hairline region. A database of 66 people with variations in illumination, size and facial expression, was used for testing. When ten images per person are used for training and 16 images for testing, a testing accuracy of 97.75% was obtained. Shen et al. [119] combine an eigenface classifier and a probabilistic decision-based neural network. The eigenface recognition stage gives as output a list of the ten highest ranked classes. The PDBNN is then used to recognize the input face amongst these ten classes. The feature vectors used as input for the PDBNN are based on the horizontal projection profile of the input image. The system was also reversed, where the PDBNN outputs the top ten faces and the eigenface technique decides on the final classification output. The system was trained on faces of 34
135 Chinese people in three different rotations (-22.5, 0 and 22.5). The test set contained two images of each individual rotated over -45 and 45. Recognition rates of 68.2% (PCA/NN) and 69.3% (NN/PCA) were obtained. Aitkenhead and McDonald [3] extended their face detection neural network with a face recognition module. The output of the face detection is first processed by a substructure detection network. This is a two layer network where training uses an adapted reinforcement technique, based on Hebbian connection modification. The face recognition module is then an extension of this network, as a single self-connected layer that obeys the same node and connection adjustment rules but which by being self-connected is used to form attractors and thus recognize subjects. Similarly as the detection network, these two networks were also subject to mutative evolutionary computation during training to determine the best values for the control variables. On a test set of 20 people with five images per person, a recognition accuracy of 74.1% was obtained. Huang et al. [40] propose a neural network that can recognize human faces with any pose in the range of 30 degrees left to 30 degrees right out of plane rotation. Seven view-specific eigenspaces [88] are used to build one eigenface set for each view. View-specific neural networks are trained on the feature coefficients calculated in the corresponding eigenspace. A second neural network combines the outputs from the view-specific networks into a final decision. The view-specific networks are conventional feed-forward networks with one hidden layer and trained with the backpropagation algorithm. The combinatorial network outputs both the identity and the pose of the input image. On a test set of 400 images of five persons with unknown poses, an average recognition rate of 98.75% was obtained. Stochastic modelling of non-stationary vector time series based on hidden Markov models (HMM) has been very successful for speech applications [29]. Samaria and Harter [113] applied this method to human face recognition. Faces can be intuitively divided into regions such as the eyes, nose, mouth, etc., which can be associated with the states of a hidden Markov model. A spatial observation sequence is extracted from a face image by using a band sampling technique. Each face image is represented by a one-dimensional vector series of pixel observations. Each observation is a block of L lines and there is an M-line overlap between successive observations. An unknown test image is first sampled to an observation sequence. Then, it is matched against every HMM in the model face database (each HMM represents a different subject). The match with the highest likelihood is considered the best match and the relevant model reveals the identity of the unknown face. Recognition rates of 95% were reported on a database of 40 individuals, where five images per person were used for training and five images per person for testing.
2.4.2.5 Graph matching and dynamic space warping

Wiskott et al. [134] propose a face recognition method based on elastic bunch graph matching. The basic object representation is a labelled graph, where edges are labelled with distance information and nodes are labelled with Gabor wavelet responses locally bundled in jets. Every jet consists of 40 coefficients, which are the convolution coefficients for Gabor kernels of eight different orientations and five different frequencies at the image pixel corresponding with the node. The nodes represent a set of fiducial points, e.g. the pupils, the corners of the mouth, the tip of the nose, the top and bottom of the ears, etc. The class-specific information is stored in the form of bunch graphs, one for each pose, which are stacks of a moderate number (70 in [134]) of different faces, jet-sampled in an appropriate set of fiducial points. Bunch graphs are treated as combinatorial entities in which, for each fiducial
35
point, a jet from a different sample face can be selected, thus creating a highly adaptable model. This model is matched to new facial images to reliably find the fiducial points in the image. Jets at these points and their relative positions are extracted and are combined into an image graph, a representation of the face which has no remaining variation due to size, position or in-plane orientation. For the purpose of recognition, image graphs are compared with the model graphs (in the gallery) at a small computing cost by evaluating the mean jet similarity. A recognition rate of 92% on a database of 216 frontal pose images (108 for training and 108 for testing) was reported. The wrongly recognized images were mainly due to differences in facial expression. When images rotated 22 from the frontal pose are tested against the same gallery, the recognition rate drops to 85% [134]. Sun and Wu [125] apply the technique of elastic graph matching on the entire (inner) face (similar to [134]) and on three smaller windows containing the left eye, nose and mouth respectively. The matching between the new image graph and the model graphs results in a cost function for each gallery face which is a weighted sum of the individual cost functions. The weights are proportional to the corresponding single recognition rates. Although the individual local features do not result in high recognition rates, their combination with the inner face recognition, leads to an increase in performance for the whole face from 91.6% to 96.7%. Sahbi and Boujemaa [110, 111] propose a face recognition technique based on dynamic space warping. Salient feature extraction is obtained through entropy maximization of local grey-level histograms. The result is a binary image, which is consequently subdivided into regions describing the shape variation between different faces. A new face X is transformed into a face Y in the gallery using three kinds of operations: the substitution of a region Xi by Yj, the deletion of a region from X and the insertion of a region into X. Dynamic space warping is used to avoid combination explosion by introducing an ordering assumption: if Xi is matched with Yj, then Xi+1 is matched (if possible) only with Yj+k (k > 0) [33, 110, 111]. A scoring function based on maximum likelihood is computed for every gallery image and the image with the highest score is selected to be from the same class as the query image. This technique performs similarly as the eigenface technique in most tests, but shows a higher performance in the case of illumination variation and partial occlusions [33, 110, 111].
2.4.3 Face verification

In this section, two systems specifically developed for face verification are briefly reviewed. Kim et al. [48] propose a face verification algorithm based on multiple feature combination and a support vector machine. The feature vector describing a face is a combination of eigenface coefficients, eigenfeature coefficients and edge distribution coefficients. These features are projected on a new intra-person/extra-person similarity space based on eight similarity measures, and finally evaluated by a support vector machine supervisor. Experiments were performed on a realistic and large database were the training and testing sets are totally independent. Therefore, people who participate in making the eigenspaces do not exist in the database used for testing. An equal error rate of 2.9% was achieved. Romano et al. [106] describe a real-time face verification system for use in screen locking applications. The system is based on template matching. Feature templates (left and right eye, nose and mouth) are extracted from the original grey-level image and from the result of filtering the reference image with several differential operators (e.g. a Laplacian operator). The verification stage computes the similarity between the new image and the reference image using normalized cross-correlation as a distance measure. The training samples form two classes, a
36
positive class and a negative class. The similarity between the new input image and the reference image is compared on a nearest neighbour basis with the mean of these two classes. The closer of these two means will determine to accept or reject the individual. The system yields an estimated false entry rate of less than 0.5% and may be tuned to be more or less tolerant depending on the security level demanded by the application.
2.5 Conclusion
Two applications in the field of pattern recognition will be investigated in this study. The first one deals with the recognition of car license plates, while the second one aims to verify car drivers by means of their face. A few working examples of number plate recognition systems were described in this chapter. Neural network based methods lead to satisfying results for the recognition of number plate characters. Traditional approaches are commonly based on the multi-layer perceptron and the backpropagation algorithm. In this study, however, a different approach will be investigated. The learning vector quantization algorithm, first introduced by Kohonen in 1987 [54], will be implemented to perform the character recognition. Chapter three describes the license plate recognition module of the current system. Related research in the fields of face detection and face recognition was comprehensively reviewed in the current chapter. Several reasons why face detection and/or recognition are such a difficult problem were mentioned. In the current application of driver identification, the system will definitely have to deal with pose, rotation and illumination variation, next to the influence of common but significant features such as glasses, make-up, different hairstyles, etc. Knowledge-based or template matching face detection methods generally do not cope well with pose variation. Appearance-based methods use a set of training images to capture the representative variability of facial appearance. In this way, a probabilistic framework or discriminant function is learned to distinguish between the face and the non-face classes. Featurebased methods aim to find structural features that exist even when the viewpoint, pose or lighting conditions vary, and use these to locate faces. Both appearance-based and feature-based methods can deal with pose, rotation and illumination variations. The training set needed by appearancebased methods can be difficult to collect. Therefore, in this study it was chosen to follow a feature-based approach to detect both the faces and facial features (e.g. the eyes, eyebrows, nose and mouth). The current application is one of face verification. As mentioned before, most face verification methods are derivatives of face recognition techniques where a threshold needs to be determined between same person and different person. As the application in this study is installed at the gates of a university campus, it must be able to cope with thousands of people every day. It is not known beforehand who will pass the camera and the first images are only gathered the moment a person drives in. Therefore, the system cannot be trained beforehand and there is no time to train after every single person enters the campus. This means that neural networks cannot be used, as they require long training times and retraining after every new entry. Template matching methods need to store a large number of templates per person, to be able to cope with variations in pose, facial expression, illumination, etc. As a person generally only spends a few seconds in front of the camera, no large variation in facial expression or
37
illumination can be captured. Therefore, template matching methods cannot be considered for the current application. Feature-based face recognition methods are very sensitive to the preceding face and feature detection. Current automated feature location algorithms do not provide a high enough degree of accuracy, and therefore feature-based methods are abandoned. Both subspace methods and graph matching techniques can be trained with an independent training set. This means that facial images can be captured in advance to compose the subspace or compute the graph jets. When the system is in use, the images of people driving into campus are projected on the subspace or graph and saved in a smaller format. The images of people driving out are also projected on the same subspace or graph and compared with the previously stored images. This method does not need retraining after every new entry and the data of people matched by the system can simply be erased. The images composing the training set do not have to contain any of the images later encountered during use of the system. However, it is recommended to use the same camera set-up, so as to capture as much as possible of the illumination, pose and rotation variation as will be encountered in the working environment. The current application will use a probabilistic subspace technique, as this face recognition method is very easily converted to a face verification technique. The system is trained to distinguish between two classes: intrapersonal variations I and extrapersonal variations E. Two images belong to the same person if
P (I | ) > P (E | )
(2.25)
If equation (2.25) is not fulfilled, the images belong to different persons. Chapter four will describe the face detection and face verification algorithms in more detail.
38
Chapter 3
License plate recognition

This chapter presents the license plate recognition system. This work builds upon several algorithms found in the literature, improved to work in the particular situation for access to a university campus. The system components have been subjected to a number of tests. Details of these tests and the results are presented at the end of the chapter.
3.1 Introduction
South African number plates are quite diverse. Every province creates its own type of number plate. Between 5 and 8 characters can be used which appear in blue, green or black on a white or yellow background. Some plates include a landscape in the background or a provincial crest. In some provinces old plates are also still in use. A few examples of South African number plates are given in Figure 3-1.
3.2 Process description

An overview of the number plate recognition system can be seen in Figure 3-2. The video camera is positioned to survey a section of the road where the cars stop in order for the driver to swipe his/her staff or student card. When a vehicle is in a suitable position on the road, the system acquires an image from the camera. This image is scanned to locate the vehicles number plate. Once found, the number plate is segmented into the different characters which are deciphered by the optical character recognition. The deciphered number plate characters are passed through a post processing step which ensures the decoded plate adheres to predefined syntax rules. In the case of incoming vehicles, an image of the vehicle, the deciphered plate and a link to the driver information will be stored in a database. In the case of vehicles leaving the campus, the plate will
39
be compared with the database and the corresponding driver information will be passed on to the face recognition module.
Gauteng
Kwazulu Natal
North West
Mpumalanga
Northern Province
Free State
Eastern Cape
Western Cape
Northern Cape
Kwazulu Natal (old)
Free State (old)
Northern Province (old)
Figure 3-1: A few examples of South African number plates
Image Acquisition
Pre-processing 1
Localization
Pre-processing 2
Segmentation
Feature Extraction
Character Recognition
Post-processing
Database
Figure 3-2: The different steps in a number plate recognition system
40
Each of the components which make up the system will now be discussed in more detail.
3.3 Image acquisition

Commonly used cameras in LPR systems are black and white, colour and infra-red cameras. The lather one outputs monochrome images since the infra-red spectrum is above the normal colour spectrum [37]. The luminance or intensity data from the infra-red spectrum is mapped to normal black and white luminance. The current system uses a normal colour camera, positioned to take pictures of the front of the car when the driver swipes his/her staff or student card (see Figure 3-3). The video camera captures 25 frames per second. Later on, in the lab, between 1 and 3 usable frames per car were manually converted to jpeg-format and saved for testing the system. A frame is considered usable when the plate is completely visible in the image, irrespective of size, orientation or quality. This module could be automated by the use of an external trigger (such as the card swiping, a trip wire, in-ground loop, or cross-traffic light beam) or an internal trigger, wherein the signal change from the video subsystem alerts the processor that an object of interest may be present. The South African sun can be very bright, often making the number plate illegible due to reflection. However, every gate on campus has a small roof covering the cars when waiting for the boom to open, so the sun cannot shine directly onto the number plates. This roof is also very helpful in the case of rain, as there is no precipitation visible between the camera and the license plate. When the visibility becomes too low, illumination is switched on, positioned above the camera. This makes it possible for the system to work day and night, under all weather conditions.
Figure 3-3: Typical data capture station
41
A typical image acquired by the picture acquisition is an RGB colour image of size 720 576, the number plate is typically between 25 and 50 pixels high. However, the system is capable of handling any image and number plate size. This image is passed on to the first pre-processing step.
3.4 Pre-processing 1
The first pre-processing module consists of four steps. First, the black border around the image, caused by the camera, is erased. Secondly, the date and time, recorded by the camera on the picture, are masked. Thirdly, adjustments are made for the perspective distortions caused by the positioning of the camera. These adjustments are determined at the time of positioning the camera and do not change during the working of the recognition system (at least as long as the camera is not moved). Lastly the input image is reduced to one data channel, which results in a grey level image (256 grey levels). This reduces the size of the image by three and results in the following steps to be faster. In the current system, the green channel is chosen because the major part of backgrounds on South African number plates consists of green. By blocking the blue and red channel, these green parts appear as white and result in greater contrast with the black letters. As green letters are used for Northern Cape plates, this approach would not work. However, a module could be added, so that if no plate can be localized, the procedure could be started all over again using the blue or red channel, which would result in detection of the NC plates. This module is not inserted in the current system as not enough Northern Cape number plates were available for testing. Parallel analysis could result in best of channel type working, again not implemented for the same reasons.
3.5 Localization
3.5.1 Introduction
In order to achieve number plate recognition, the number plate and its constituent characters first need to be located in the image. Several techniques for doing this are documented in the literature: edge extraction, Hough transform, blob analysis, colour analysis, histogram analysis, morphological operators, gradient analysis, neural networks, etc. [1, 52, 107, 114, 115, 122, 149] An edge-based approach is normally simple and fast. However, it is very sensitive to undesired edges, which often appear in the image of a car. First, an edge detector (e.g. a Sobel operator) is applied, followed by an image thresholding step, where the threshold is dependent on the average brightness of the image. After the Hough transform is applied on the entire image, the Hough space is searched for peaks, which correspond to lines in the original image. The Hough transform for line detection gives good results on images with a large plate region where it can be assumed that the shape of the license plate is defined by lines [122]. However, a large memory space and a considerable amount of computing time are needed. Blob analysis identifies connected regions of pixels in the image. These regions, which are called blobs, are represented by a few characteristic features (such as the height, width and top left coordinate). A set of rules can then be used to group the blobs into a candidate number plate [122]. The performance of blob analysis decreases quickly for touching characters or cracked, dirty or other imperfect number plates [18, 115]. Colour analysis uses a threshold value to distinguish between license plate pixels and non-license plate pixels. Colour analysis can only be 42
used when the background of the license plate is uniform (as in yellow Israeli number plates [114]). This approach would not be useful in the case of South African plates, see Figure 3-1. In a histogram-based approach, the image is first binarized into black and white pixels. Horizontal and vertical projection profiles can be obtained by counting the number of black pixels in a column or row [62]. Based on the peaks and valleys in these profiles, the position of the plate can be determined. This method is very simple and fast, however, histogram-based approaches do not work properly on images with large noise content and on tilted plates. Morphology operators, in contrast, work well on noisy images, but are rarely used in real time systems because of the speed of operation [115]. Kim et al. [52] propose a method that detects plate candidate areas in the input image using gradient and density measurements. Three statistical features are introduced: the gradient variance, the density of edges and the density variance. These three features are combined with a neural network to determine if a particular pixel could belong to the plate region. Among the plate candidates, the real plate area is determined based on horizontal and vertical projection profiles. When the profile parameters satisfy a list of pre-defined conditions, the candidate plate is regarded as the correct plate region. In a test with 1000 images, the number plate was correctly located in 90% of the images. The sources of failure were the presence of other text blocks or the weak gradient information in the plate area. The use of the neural network adds to the robustness of the method against noise and tilted plates. Adorni et al. [1] exploit the knowledge that license plates are small with respect to the image size and are composed of black symbols on a white background. The area containing the license plate will thus be characterized by many high peaks in the horizontal component of the gradient. Therefore the license plate position and apparent size are detected by locating an area in the scene with a high density of horizontal-gradient peaks. The threshold used to binarize the gradient image is related to the overall brightness of the image and can be adapted to the lighting conditions. The upper border of the box containing the license plate can be found by counting the number of on pixels in each row of the binarized gradient image. A large increase in the number of on pixels from one row to the following indicates that the row is the upper border of the license plate. The lower border and the vertical borders of the plate can be found in a similar way. Setchell [115] uses a very similar method based on gradient analysis and obtained a localization performance of 99.17%. Rovetta and Zunino [107, 149] use a vector quantization (VQ) approach to simultaneously compress the image and collect information about the image content. In VQ-based image coding, input images are split into elementary blocks. Such blocks span vectors in a data space, where the quantization process defines a predetermined, fixed codebook of reference vectors (codewords). The coding process associates each block with the codeword that optimizes a similarity criterion, and encodes the block by the codewords index. Compression derives from using a codebook that is small as compared with the number of possible blocks. The quantization principle can greatly facilitate the localization process because the coding process involves an implicit analysis of the image contents. A codebook is defined in the same (pixel) space of the encoded blocks, hence associating each block with the best-matching codeword implies a classification of the block contents. Classification results may give some hint about the block content itself, and in particular, whether the block is likely to cover a license plate. An overall error rate of 2% was reported on a test set containing more than 300 images.
43
3.5.2 Current system

The localization module is based on the gradient analysis algorithms proposed by Adorni et al. [1] and Setchell [115]. The process flow for the localization module is shown in Figure 3-4.
take new cross-section

end of image?
yes
adapt gradient threshold
no locate vertical edges no

clusters long enough?
yes find top, bottom & tilt no

size within limits?
yes scale and rotate

Figure 3-4: Process flow for localization algorithm
The image of a number plate, even if degraded by bad contrast, non-uniform lighting conditions, touching characters, cracks, dirt, mounting bolts or other imperfections, will contain a significant number of vertical edges. Figure 3-5 shows a number plate image and a horizontal cross-section through the plate. The vertical edges are clearly visible as steps in the cross-section of the plate as they are characterised by large differences in the grey-level of adjacent pixels (see Figure 3-5). Other areas in the image will similarly contain vertical edges; however, these are not as regular as in the case of a number plate, see e.g. the headlight in Figure 3-6. This forms the basis for the number plate localization algorithm. This stage identifies the part of the image which represents a number plate and proceeds as follows:
44
Cross-section through number plate

200 150
grey-level
100 50 0 0 50 100 150 200 250 300 350
x coordinate
Figure 3-5: Image of a number plate together with a cross-section through the plate
Cross-section through image

150
grey-level
100 50 0 0 50 100 150
x coordinate
Figure 3-6: Image of a headlight together with a cross-section through the light
45
1. A horizontal cross-section through the image is taken at every nth line of the image, starting from the bottom. From the position of the camera and the possible positions of the car, estimates can be made for the height of the smallest and the largest character that can be encountered. n is chosen to be half the height of the smallest character. This ensures that any number plate in the image is hit by at least one of the cross-sections. 2. For every pixel in the current cross-section, the pixel is marked as a vertical edge if the difference in grey-level between the pixel and its neighbour is greater than some threshold T1. 3. All vertical edge points on the current cross-section are clustered into groups such that the horizontal distance between vertical edge points in the same group is less than a predefined distance d1. Only groups with a sufficient number of vertical edges are retained. This works on the assumption that every character in the plate will produce at least one vertical edge and that the distance between any two characters in the plate will be smaller than d1. Executing this algorithm yields a number of clusters on the current cross-section. The positions of the edge points within each cluster determine the horizontal position and left and right extremes of the possible plate. These values are passed on to the next stage of the algorithm. The second stage of the algorithm attempts to find the top, bottom and angle of tilt of the possible plate. This stage of the algorithm works as follows: 1. If sufficiently long, the longest cluster on the current cross-section is selected as the possible plate. If no clusters are long enough, the next cross-section is examined. 2. The horizontal line between the left and right extremes of the possible plate is divided into a number of intervals. For each interval, an estimate is found for the position of the top and bottom of the plate in the following way. A horizontal bar is placed over the current interval of the horizontal line and the vertical edges in the image under the bar are counted. Then the bar is slid vertically upwards a pixel at a time and the number of edges at each point is counted. If the bar is sliding over a number plate character, the number of vertical edges will dramatically drop when the bar passes beyond the top of the character (see Figure 3-7). The bar is slid no further when the number of edges reaches zero or the distance between the bar and the original horizontal line is equal to the height of the largest character expected to be encountered. So, as the bar is slid upwards, the point which gave the least number of edges is used as the estimate for the top of the plate. The same procedure is used to estimate the bottom of the plate by sliding the bar downwards. 3. The white space between characters will yield an estimate for the top and bottom of the plate equal to the original horizontal line. These values are deleted. 4. Due to noise or image features other than number plate characters, some of the estimates for the top and bottom of the plate will be wrong. To find the plates height and angle of tilt, a model of a number plate is fitted to the top and bottom estimates. The simple model consists of two parallel lines having parameters a1, a2 and m (see Figure 3-8). The normal procedure would be to fit the model to the data points using a least squares error method:
N _ top
e( a1 , a2 , m) =
(t
i =1
N _ bottom y i
mtix a1 ) 2 +
i =1
(biy mbix a2 ) 2
(3.1)
where, tix , tiy are the x and y coordinates of the estimates for the top of the plate and
bix , biy are the estimates for the bottom of the plate.
46
interval
sliding bar
d
NKF 755 GP
no. of vertical edges under bar
top of number plate character
d (pixels)
Figure 3-7: Estimating top of plate by counting the number of vertical edges under the sliding bar
y = mx + a1 y = mx + a2
x
Figure 3-8: Simple model of a number plate
The advantage of the least squares method is that the parameters of the model may be found via simple differentiation and solving the resulting set of linear equations.
47
However, it only takes a few outliers in the data to cause a rather poor fit (Figure 3-9). To overcome this problem, the following error function was used:
N _ top
s ( a1 , a2 , m) =
1 + (t
i =1
1
y i
N _ bottom x i
mt a1 )
i =1
1 1 + (bi mbix a2 ) 2
y
(3.2)
As a high value of this function indicates a good fit, it will be referred to as a score function rather than an error function. The important characteristic of this score function is that the further a point is away from the model, the less it contributes. Therefore, outliers are unable to influence the fit of the model. Unfortunately, this function may not be solved via simple differentiation as it may have several local maxima. To find the global maximum, the following approach was adopted. For 200 values of m equally spread in the interval [-1, 1], the values of a1 and a2 are determined which maximize the first and second term in equation (3.2), respectively. This is possible because a1 and a2 are the parameters for two separate lines, therefore, they may be fitted independently. This combination of a1 and a2 represents the greatest value for s which can be attained for each m. The value of m maximizing s is then located. These values of a1, a2 and m give the best fit of the number plate model to the estimates of the top and bottom of the possible plate.
Least Squares Error
Score function
Figure 3-9: Least Squares Error: poor fit of the model due to outliers
5. Possible plates that do not adequately fit the model can now be filtered out based on certain criteria. If the best fit of the model yields a value for s which is not sufficiently high, the plate is rejected. The difference between a1 and a2 gives a good estimate for the vertical height of the plate. If this does not fall within certain bounds, the plate is rejected. If m, the angle of tilt of the plate, does not fall within certain bounds, the plate is also rejected. 6. For each possible plate that satisfies the criteria, a bounding box specified by the coordinates of the top, bottom, left and right of the fitted number plate model is passed on to the next module. As the bounding box may be at an angle, it is first rotated to make it
48
horizontal. This box is than scaled to a standard height, while the aspect ratio of the plate is maintained. As soon as a number plate is found on line n, the algorithm is stopped, as we assume that maximum one license plate appears per image. If the top of the image is reached without detecting a number plate, the threshold T1 is lowered and the algorithm is executed again. If T1 decreases below a certain value and no plate can be found in the image, the threshold is not lowered anymore and the localization module sends a warning message to the supervisor of the system that no plate could be found in this particular image. These images can then be stored and later investigated by human intervention. Figure 3-10 shows a few examples of license plates extracted by the localization module.
3.6 Pre-processing 2
An extra pre-processing step is performed before proceeding to the segmentation module. This is a normalisation process to enhance the contrast between the number plate characters and the background. The output of the localization module is also checked for columns on the left and right of the image that cannot belong to the actual number plate. These are erased before proceeding to the next step. Examples of contrast enhancement found in the literature use histogram stretching [52], localized (adaptive) contrast enhancement [115, 122] or a binarization algorithm which reduces the grey level image to a black and white image [35, 112]. The current system uses a normalisation process based on the grey-level histogram of each example. Figure 3-11 shows a typical example of a number plate together with its histogram. The value of min is chosen such that one sixth of the examples pixels have a value less than min while max is chosen such that one sixth of the pixels have a value greater than max. The transfer function in Figure 3-12 is then used to obtain the normalised pixel values. Figure 3-13 shows the number plate from Figure 3-11 after normalisation, together with its new histogram. Starting from the left and right side of the image, the output of the previous normalisation step is searched for big dark or light patches. The pixel values in every column are summed and if this sum is below a threshold T2, the column is marked as dark, if the sum is above a threshold T3, the column is marked as light. Dark patches are formed by consecutive dark columns, light patches by consecutive light columns. Too big patches cannot belong to a license plate, as a license plate is characterised by a sequence of short dark and light patches (the characters and the gaps between the characters). If the length of a patch is above a threshold d2, determined as a function of the height of the number plate, it is erased and the contrast normalisation step is repeated. The normalised number plate image is then sent to the segmentation module. Figure 3-14 shows a few examples of number plates before and after the pre-processing step.
49
Figure 3-10: Examples of localized number plates
50
Histogram
250 200
frequency
150 100 50 0 0 50 100 150 200 250
grey-level
Figure 3-11: Typical number plate and its histogram
255
output grey-level
0 0
min
max
255
input grey-level
Figure 3-12: Transfer function used to normalise grey-levels of number plates
51
Histogram
800 700 600 500 400 300 200 100 0 0 50 100 150 200 250
frequency
grey-level
Figure 3-13: Typical number plate after normalisation and its histogram
Figure 3-14: Examples of number plates before and after pre-processing
3.7 Segmentation
3.7.1 Introduction
Character segmentation is the operation in which an image of a sequence of characters is decomposed into sub-images of the individual symbols. Character segmentation has long been a critical area of the optical character recognition process. The higher recognition rates for isolated characters versus those obtained for words and connected character strings illustrates this fact. Casey and Lecolinet [14] give an extensive survey of strategies in character segmentation. They list these methods under four main headings. What may be termed the classical approach or dissection consists of methods that partition the input image into sub-images, which are then classified. The second class of methods avoids dissection, and segments the image either explicitly, by classification of pre-specified windows, or implicitly by classification of subsets of
52
spatial features collected from the image as a whole. The third strategy is a hybrid of the first two, employing dissection together with recombination rules to define potential segments, but using classification to select from the range of admissible segmentation possibilities offered by these sub-images. Finally, holistic approaches avoid segmentation by recognizing entire character strings as units. The difficulty of performing accurate segmentation is determined by the nature of the material to be read. Number plates have a reasonably well defined structure and are considered a relatively easy case for character segmentation. Commonly used methods for character segmentation of number plates are morphology operations, connected component analysis, projection techniques and neural networks. Most of the techniques used to extract character regions from an image can also be used to segment characters [123]. The segmentation stage is only briefly mentioned by most authors. Individual performance rates are rarely available. Adorni et al. [1] perform symbol segmentation by searching the license plate region for the vertical white areas separating the symbols. Once symbols have been isolated, each of them is resampled to the size of 8 13 pixels and then binarized with a single threshold. Hermida et al. [35] segment the characters using an adaptive search of the connected black spots in the binary number plate image. Some heuristical rules based on aspect ratio and relative size, remove spots that are not characters. Rodrigues and Thom [105] propose a decision tree based segmentation technique. The first level of the tree is based on the projected density of pixels and an adaptive parameter called refinement rate. Successive levels of refinement are entered as long as elements exist which were not successfully segmented in the previous level. Heuristic criteria are used to decide when to enter a new level in the tree. No results were mentioned for the segmentation of number plates, but good results are obtained for handwritten postal codes, where a segmentation performance of 88% was reported. Siah et al. [122] and Coetzee et al. [18] use blob analysis to segment the license plate characters. This method identifies connected regions of pixels in the image and consequently clusters them into characters. The greatest weakness of this method is its inability to process correctly characters which are connected to each other or to the border of the license plate. Coetzee et al. [18] report a segmentation performance of 92.5%. Lee et al. [62] propose a segmentation methodology, in which the character segmentation regions are determined by using projection profiles and topographic features extracted from the greyscale images. Then a nonlinear character segmentation path in each character segmentation region is found by using a multi-stage graph search algorithm. Finally, in order to confirm the nonlinear character segmentation paths and recognition results, recognition-based segmentation is adopted. Some authors prefer not to segment the number plate into individual characters [19, 115]. In this case, a window is slid over the scaled plate with the elements of the window providing the input values to the classification module. Both Comelli et al. [19] and Setchell [115] base this decision on comparisons with previous license plate recognition systems, where segmentation techniques based on a single threshold and blob analysis were used. The great variation in illumination and the presence of connected characters (because of dirt, cracks, mounting bolts or other imperfections) cause these systems to fail. Coetzee et al. [18] who use blob analysis to segment the characters reported a full plate recognition performance of 86.1%, while Comelli et al. obtained a 91.7% recognition performance. Because these systems use different methods in the
53
recognition stage and different test sets, it is not sure if the increase in performance is due to the absence of the segmentation step.

The algorithm used in the segmentation module of the current system is based on projection analysis, as defined by Lee et al. [62]. This method leads to good results because of the constrained structure of a number plate, where adjacent characters can ordinarily be separated at columns. In grey-scale images, the vertical projection profile P(x) is defined as follows:
P ( x ) = H x ( g ) c ( g ),
g =0
L1
0 c( g ) 1
(3.3)
where g(x, y) is the intensity of pixel (x, y) ( 0 g ( x, y ) L 1 ), L is the number of intensity levels, h is the height of the image, Hx(g) is the histogram of column x with intensity g, and c(g) is a factor contributing to the projection profile, emphasizing certain values of Hx(g). Grey-level images have 256 intensity levels (0 = black and 255 = white). In the current system, the darkest values are emphasized by choosing L as 85 and defining c(g) as:
c( g ) = L 3 g
(3.4)
The input image is segmented by using the projection profile P(x). The columns which have P(x) less than a threshold T4 are selected as segmentation points. If no segmentation boundary is found within distance d3, the segmentation process is repeated with a threshold T5, which is five times the value of T4. Next, the provincial crest is determined by searching the segmentation results for one character which is smaller and wider than the other characters. This crest is marked as a non-character. If two neighbouring characters are smaller than a threshold distance d4, they are merged. A maximum of eight characters are cut from the image and scaled to a standard height and width. Within every character, the pixel values are divided by the sum of all the pixels in the character, to cancel out differences in illumination. This averaged and normalised character is sent to the classification step. An example of a number plate with the corresponding separated characters is given in Figure 3-15. An example of a wrongly segmented number plate is given in Figure 3-16.
Figure 3-15: Example of a number plate and its segmented characters
54
Figure 3-16: Example of a wrongly segmented number plate
3.8 Feature extraction

The whole character can be used as a feature where the feature vector will contain the intensity levels of the individual pixels. However, in many situations, the selection of a set of representative features can decrease the dimensionality of the feature vector, and even improve classification. Trier et al. [128] present an overview of feature extraction methods for off-line recognition of segmented (isolated) characters. They mention the following methods to extract features from grey scale images: template matching, deformable templates, unitary image transforms (such as the Karhunen-Love, Fourier, Walsh, Haar, Hadamard, Hough, Gabor and chain-code transforms) [121], zoning, geometric moment invariants (such as Hus invariants) [27], Zernike moments, ntuples [115] and neural networks [16]. Other authors describe algorithms to extract geometrical and topological features such as strokes and bays in various directions, number of holes [124, 125], end points, intersections of lines, loops, angular relations between lines [102] or Kirsch edge descriptors [122]. Irrespective of the particular features detected or the techniques used to identify them, the output of the feature detection stage is a feature vector that contains the measured properties of the character region. In the current system, two types of feature extraction were used. The first one uses the whole character as a feature where the feature vector contains the intensity levels of the individual pixels. The second method was based on an algorithm proposed by Viola and Jones [132]. A set of features reminiscent of Haar Basis functions is used. The use of the integral image representation (as defined in [132]) allows very fast feature evaluation. More specifically, three kinds of feature are used. The value of a two-rectangle feature is the difference between the sums of the pixel intensities within two rectangular regions which are horizontally or vertically adjacent. A three-rectangle feature computes the sum within two outside rectangles subtracted from the sum in a centre rectangle. Finally a four-rectangle feature computes the difference between diagonal pairs of rectangles. Feature examples are shown in Figure 3-17, where the sum of the pixel intensities within the white rectangles is subtracted from the sum of pixel intensities in the black rectangles.
55
Figure 3-17: Example features: A & B are two-rectangle features, C & D are three-rectangle features and E is a four-rectangle feature
These five features are calculated for approximately one sixth of the pixels in the character image, equally spread over the image. The resulting values are sorted into a one-dimensional feature vector and passed on to the classification module. Figure 3-18 shows three examples of the output of the rectangle feature extraction. For visual purposes, the feature vectors are represented in 2D instead of 1D, and their size is exaggerated. Every row corresponds with one of the five features from Figure 3-17. It is clear that this output does not resemble an alphanumeric character anymore.
Figure 3-18: Example of rectangle features applied
3.9 Character recognition

3.9.1 Introduction
Character recognition is a fairly well developed field in computer vision and several techniques are available. Several reviews can be found in the literature [99, 115, 124]. Most techniques belong to one of two types: template matching and neural networks. Template matching involves the use of a database of characters or templates. There is a separate template for each possible input character. Recognition is achieved by comparing the current input character to each template in order to find the one which matches the best. If I(x, y) is the input character and Tn(x, y) is template n, then the matching function s(I, Tn) will return a value indicating how well template n matches the input character. Several common matching functions are: Manhattan distance: s ( I , Tn ) = | I (i, j ) Tn (i, j ) |
i =0 j =0 w h
(3.5)
56
Euclidean distance: s ( I , Tn ) = ( I (i, j ) Tn (i, j )) 2

i=0 j =0 w h
(3.6)
Cross-correlation: s ( I , Tn ) = I (i, j )Tn (i, j )

i =0 j =0 w h
(3.7)
( I (i, j ) I )(T (i, j ) T )

n n
Normalised correlation: s ( I , Tn ) =
i =0
j =0
( I (i, j ) I ) (T (i, j ) T )
2
n n i =0 j =0 i =0 j =0
(3.8)
2
Character recognition is achieved by identifying which Tn gives the best value of the matching function s(I, Tn). The method can only be successful if the input character and the stored templates are of the same (or at least very similar) font. Template matching can be performed on binary or grey-level characters. In the latter case, comparison functions such as normalised correlation are usually used as they provide improved immunity to variations in brightness and contrast between the input character and the stored template [19, 35, 96, 124]. Neural networks can be used effectively for classification and are therefore suitable for character recognition. They are trained on a set of example characters. While learning to recognize a recurring pattern, the neural network constructs statistical models that adapt to individual characters' distinctive features. Therefore, neural networks tend to be resilient to noise. Various forms of neural network training algorithms have been used for training of various neural network character classification architectures, the most popular one being the backpropagation algorithm with a normal three layer perceptron architecture [18, 82, 144]. However, other algorithms have been tested, such as scaled conjugate gradient descent [115], fuzzy artmap neural networks [122], radial basis function networks [61], self organizing maps [97], learning vector quantization [13, 82], time-delay neural networks [5], local receptive field networks [121], etc.

Learning Vector Quantization or LVQ was selected for the character recognition module. This is a supervised version of vector quantization which can be classified under neural networks and is extensively documented [28, 32, 34, 81, 126]. Learning Vector Quantization is a group of algorithms applicable to statistical pattern recognition, in which the classes are described by a relatively small number of codebook vectors, properly placed within each zone such that the decision borders are approximated by the nearest-neighbour rule [55].
The LVQ-PAK program package contains all programs necessary for the correct application of certain Learning Vector Quantization algorithms in an arbitrary statistical classification or pattern recognition task, as well as a program for the monitoring of the codebook vectors at any time during the learning process [55]. The LVQ algorithm chosen for the character recognition module in our system is OLVQ1 (optimized-learning-rate LVQ1). This algorithm starts by selecting a small set of prototypes or codebook vectors from the training set. The initial prototypes are selected at random, with the same number of entries allocated to each class.
57
In the learning stage, the training set is used to modify the codebook vectors in order to gradually adapt the decision surface they define to that defined by the entire training set and so reduce the classification error rate. Samples from the training set are cyclically or randomly presented for classification. When a training sample x(t) is correctly classified, that is, the training sample is of the same class as the nearest prototype mc(t), this prototype is moved towards the training sample, along the line connecting the two vectors. When misclassification occurs, the nearest neighbour prototype mc(t), which, in this case, is of the wrong class, is moved away from the training sample, along the line connecting the two vectors. The following equations define the basic LVQ1 process:
mc (t + 1) = mc (t ) + (t )[ x (t ) mc (t )] if x and mc belong to the same class, mc (t + 1) = mc (t ) (t )[ x (t ) mc (t )] if x and mc belong to different classes, mi (t + 1) = mi (t ) for i c
(3.9)
The term is a learning rate. In the basic LVQ1 algorithm (t) may be constant or decrease monotonically with time. In the OLVQ1 algorithm, an individual learning rate i(t) is assigned to each mi. Thus, the equations for the OLVQ1 algorithm become:
mc (t + 1) = mc (t ) + c (t )[ x (t ) mc (t )] if x and mc belong to the same class, mc (t + 1) = mc (t ) c (t )[ x (t ) mc (t )] if x and mc belong to different classes, mi (t + 1) = mi (t ) for i c
(3.10)
The individual learning rates i(t) can be determined optimally for fastest possible convergence of equation (3.10) by the following recursion: c (t ) = c (t 1) 1 + s (t )c (t 1) (3.11)
where s(t) = +1 if the classification is correct and s(t) = -1 if the classification is wrong. Because of the fast convergence of the OLVQ1 algorithm, its asymptotic recognition accuracy will be achieved after a number of learning steps that is about 30 to 50 times the total number of codebook vectors (compared to 50 to 200 times with the basic LVQ1 algorithm). Two more LVQ algorithms are included in the LVQ-PAK: LVQ2.1 and LVQ3. LVQ2.1 modifies both the nearest and next-to-nearest neighbours of a training sample. One of them must belong to the correct class and the other to a wrong class, respectively. Furthermore, LVQ2.1 requires the training vector to fall within a window which is determined by the relative distances of the training sample to the prototypes. Finally, LVQ3 is a combination of the earlier LVQ algorithms. In an attempt to improve the recognition accuracy, the OLVQ1 algorithm may be followed by the basic LVQ1, the LVQ2.1 or the LVQ3, using a low initial value of learning rate, which is now the same for all the classes. However, in the current application, the optimized LVQ1 learning phase alone is sufficient.
58
Before the system can be used for recognition, it must first be trained. The LVQ-PAK was used for this [55]. The process requires a large number of training samples, each of which must have associated with it the desired output when the sample is presented as input to the network. There are no fixed rules as what the size of the training set or the number of codebook vectors should be for any given pattern recognition problem. A rule of thumb is to use a number of learning steps that is 30 to 50 times the number of codebook vectors. If the number of required learning steps is bigger than the number of training samples available, the samples must be used re-iteratively in training, either in a cyclical or in a randomly-sampled order. An upper limit to the total number of codebook vectors is set by the restricted recognition time and computing power available. The highest number of codebook vectors is equal to the number of training samples. A good strategy for the initialization of the codebook vectors is to start with the same number of codebook vectors in each class. Before training starts, LVQ-PAK uses an iterative algorithm, called balance, to compute the medians of the shortest distances between the initial codebook vectors of each class. If the distances turn out to be very different for the different classes, new codebook vectors may be added to or old ones deleted from the deviating classes. In the current system, a total of 1374 training samples were gathered by using the modules up to the feature extraction to find all number plate characters in an image. Each of these was manually classified. Different numbers of initial codebook vectors were experimented with. It was found that the results do not differ significantly when more than 600 codebook vectors are used. Below 600 codebook vectors, the recognition performance becomes worse as less codebook vectors are initialized. The output of the LVQ-PAK after training is a set of codebook vectors with their associated classes. This information is saved in the character recognition module of the system. When this module is given an input feature vector from the feature extraction module, it will calculate the Euclidean distance between this feature vector and each codebook vector. The class associated with the closest codebook vector is selected as the output of the character recognition. This output value is then passed on to the post-processing module for syntax checking.
3.10 Post-processing
Once the characters are recognized, their syntax can aid in refining the determination. Syntax refers to the number and placement of characters on the plate, their sequence, and whether each can be a letter or a number or other character. As mentioned above, every province in South Africa has its own design of number plate and also some of the old plates are still in use. Table 3-1 shows the possible syntax for South African number plates. Firstly the syntax checker will determine what province the number plate comes from. This is done by checking the plate for the bold characters in Table 3-1. If any of these combinations is found, the rest of the license plate syntax is known. As some cars do not have number plates, or some plates have less than eight characters, the localization and segmentation modules will sometimes pass an image of a non-number plate character to the character recognition module, which classifies this image as belonging to the character class from the closest prototype. However, the syntax checker can easily detect these mistakes as the syntax is not identical to one of Table 3-1. In the case of a plate with less than
59
eight characters, the post-processing module will delete the non-characters; in the case of a nonnumber plate image, the post-processing module will output a message that the current image does not contain a number plate.
Province Gauteng Mpumalanga Eastern Cape Northern Cape North West Free State Northern Province
Western Cape KwaZulu Natal old plates still in use
Possible syntax LLL DDD GP, DDD LLL GP or ****** GP LLL DDD MP LLL DDD EC LLL DDD NC LLL DDD NW LLL DDD FS LLL DDD N CL DDD, CL DDDD, CL DDDDD, CLL DDD, CLL DDDD or CLL DDDDD LLL DDD ZN, DDD LLL ZN or ****** ZN NL DDD, NL DDDD, NL DDDDD, NLL DDD, NLL DDDD or NLL DDDDD LLL DDD G or LLL DDD T
Table 3-1: Possible syntax for South African number plates (L = letter, D = digit, * = any character)
Some characters, such as D and 0, B and 8, can appear very similar and it is therefore difficult for the character recognition module to discriminate between them. In most cases the syntax checker is able to resolve these ambiguities as the context of the character dictates whether it must be a letter or a number.
3.11 Database
When the number plate of an incoming vehicle is deciphered, it is stored in the database, together with the original image of the vehicle and the driver information (see chapter 4). When the number plate of an outgoing vehicle is deciphered, it is searched for in the database. If a match is found, the corresponding driver information, together with the stored driver information is sent to the face recognition module.
3.12 Results
When assessing the systems performance, three aspects need to be considered: Performance of the localization module Performance of the segmentation module Performance of the recognition module and post-processing module In order to measure the system performance within these three areas a camera was setup at a height of about 0.5 metres above the ground, looking towards the front of vehicles as they approached the camera. The vehicles were travelling very slowly or standing still. The weather conditions were dry and lighting varied from strong sunlight to overcast skies. A total of six hours
60
of video was recorded at different times in the day. From this video, still-images of 2375 vehicles were stored in jpeg-format. A selection of these images is shown in Figure 3-19. Some of the plates are mounted at a significant angle, some are dirty, cracked or broken and some contain strong shadows. For each of these still images the vehicles number plate was manually deciphered and stored in a database.
3.12.1 Performance of the localization module

The system is programmed to cut out the possible plate located by the localization module. The 2375 stored images were then processed by the system while a human observed whether the actual vehicle number plate was included in the possible plate located by the system. The actual plate was included in the output of the localization module in 99.07% of the tested examples.
3.12.2 Performance of the segmentation module

The segmentation module cuts out a maximum of eight individual characters. A total of 2353 images were processed while a human checked if the actual characters were among the characters outputted by the system. One wrongly segmented character results in the plate being marked as wrongly segmented. If the segmentation module outputs non-characters instead of characters in the case of a number plate with less than eight characters, this is not counted as wrong, as these non-characters can be detected and removed in the recognition module. Of the 2353 number plates tested, 99.28% was correctly segmented.
3.12.3 Performance of the character recognition and post-processing modules

In this experiment the 2336 properly localized and segmented number plate images were presented to the number plate recognition system. This set of test images does not contain any of the images used for training the character recognition module. For each image the plate, as deciphered by the system, was compared to the manually deciphered plate and the number of differing characters counted. Results for these tests are shown in Table 3-2.
Feature vector Feature vector as Pixel intensity levels described in section 3.8 plate completely correct 99.02% 98.46% 1 character wrong 0.77% 1.33% 2 characters wrong 0.13% 0.17% 3 characters wrong 0.08% 0.04% more than 3 characters wrong 0% 0%
Table 3-2: Performance of character recognition and post-processing modules
It can be seen in Table 3-2 that the first type of feature extraction performs slightly better than the second one. However, the second type of feature extraction leads to feature vectors that are about 15% smaller and thus needs less storage space.
61
Figure 3-19: A selection of the 2375 number plate images
It must be mentioned that for the intended application, a number plate with one character wrongly recognized, can still be useful. Consider an example where the number plate of the incoming vehicle was stored entirely correct in the database, but one character of the number plate of the outgoing vehicle was wrongly recognized. This means that the number plate of the outgoing vehicle will not be found in the database. In this case, a search can be performed for all number plates with one character different. All these plates with corresponding face information can then be passed to the face recognition module. If for any of these number plates, the face is recognized with high enough confidence, it is assumed one number plate character was wrongly recognized,
62
the car can leave campus and the corresponding license plate and face information are erased from the system. If still no match can be found in the database when number plates with one character different are considered, the process can be repeated for number plates with two characters different, and so on. Even when the character recognition is not perfect, the right result can thus still be found. Similarly, a wrongly segmented image can have only one character wrongly segmented. This can lead to a number plate with seven out of eight characters perfectly recognized. Thus, in some cases, a wrongly segmented number plate can also still lead to the right system output.
3.12.4 Total performance

The system can only identify a license plate's alphanumeric content after it has properly localized and segmented the plate. The success rate of each step must thus be incorporated into the overall calculation:
T = L S R
where
T = total system accuracy, L = rate of successful plate localization, S = rate of successful plate segmentation, R = rate of successful interpretation of entire plate content.
(3.12)
This results in a total performance of 97.39% in case the pixel intensities are used to form the feature vector; and a total performance of 96.84% when the other type of features (as described in section 3.8) are used.
3.12.5 Execution times

No effort was done to make the system fast. Also, no extensive testing was done to measure the execution time. However, when tested on fifty number plates, the average execution time was 2.1 seconds, which is considered fast enough for the intended application.
3.13 Conclusions
The performance of the localization module was excellent, finding 99.07% of the number plates in the test images. This was achieved despite many of the plates being at an angle, dirty, cracked, broken or partially in shadow. The performance of the segmentation module was equally impressive, correctly segmenting 99.28% of the number plates. Most of the wrong segmentations were caused by very dirty or cracked plates. With regard to the character recognition, the above results make it clear that the use of the raw data, i.e., the pixel intensity levels, as the input for the character recognition module, gives the best performance.
63
It is very difficult to compare these performance rates with results reported in the literature because they are based on different data sets. The only results reported on South African number plates are by Botha and Coetzee [10] (University of Stellenbosch). They report a 96% character recognition accuracy and an 83% total plate finding/segmentation/reading accuracy. When an attempt is done anyway to compare with number plate recognition systems reported on in the literature, it is clear that the current system is definitely as good, or even better than the reported systems [19, 96, 128, 149].
3.14 Summary
This chapter described the development of a number plate recognition system. The algorithms used in the different modules have been described in detail. The total performance of 97.4% shows that a robust system was developed, capable of reading number plates in a realistic environment without any operator intervention.
64
Chapter 4
Face recognition
This chapter presents the face recognition system developed for access to a university campus. This work is based on a few algorithms described in the literature, adapted and improved to work in this particular situation. The system components have been subjected to a number of tests. Details of these tests and the results are presented at the end of the chapter.
4.1 Introduction
In this study, only one face in the image is of interest, namely the face of the driver. Therefore, the term face localization will be used for the process of detecting the single face in the input image. The facial feature detection is concerned with the detection of the eyes, eyebrows, nose, mouth, chin and face boundary. This will make it possible in the face representation step to extract the face from the background and position it precisely according to a reference face. The face verification is the process that compares an unknown face image to a set of known faces to verify the persons identity. The term face recognition will be used in this study for the total combination of face localization, facial feature detection, face representation and face verification.
4.2 Process description

An overview of the different modules in the face recognition system can be seen in Figure 4-1. The video camera is positioned to survey a section of the road where the cars stop in order for the driver to swipe his/her staff or student card. The images grabbed from the camera are scanned to find the drivers face. If no face can be found, the image is discarded and the system moves on to the next frame. Once the face is found, the features within the face are localized: the eyebrows, eyes, nose, mouth, chin and face edges. Based on their positions, the face can be normalized into 65
Image acquisition
Face localization
Colour detection Face equalization
Facial feature detection

Bar filter Grouping Iris detection Boundary detection
Face representation
Pose estimation Warping Illumination normalization
Face verification
Figure 4-1: The different steps in a face recognition system
66
a standard form and reduced to a feature vector. In the case of an incoming vehicle, this feature vector will be saved in a database, together with the deciphered number plate of the car. When the vehicle is driving out, this feature vector together with the feature vector stored in the database (corresponding with the same number plate) will be passed on to the face matching module. At this point it is determined if the two drivers are the same or not. If they are the same, the boom will open and the car can drive out; if not, a warning message is passed on to the security guards who will investigate if the car may be stolen.
4.3 Image acquisition

The colour video camera is positioned on the side of the road, in order to grab images of the side of the car when the driver swipes his/her staff or student card (see Figure 4-2). The camera captures 25 frames per second. These frames are automatically de-interlaced and converted into jpeg-format. All frames from the video sequence are saved for testing the system. As the driver must be able to swipe his/her card, the distance between the driver and the camera cannot change much. Therefore, the current face localization system does not have to cope with a great deal of scale variation. As it takes most drivers about four seconds to swipe their card, about hundred images will be available per person. However, in most of these images the face will be in a non-frontal pose. Also, in many of these images, the face might be occluded by the drivers hand or student/staff card. The main source of illumination is the sun. Pixel intensities are directly modified by a change in illumination intensity and direction. Sunlight also creates shadows and reflections. During the day, the face finding system will thus have to cope with large variations in illumination. At night, a light is switched on above the camera, to make it possible for the system to work day and night. While this ensures constant illumination, it can introduce other potential problems such as shadowless faces.
Figure 4-2: Typical data capture station
67
A typical image acquired by the picture acquisition is an RGB colour image of size 720 576, see Figure 4-3. The size of the drivers face in this image is typically about 150 200 pixels. As a first pre-processing step, the date and time recorded by the camera on the image, are masked. The black border around the image, caused by the camera is erased. The resulting image is passed on to the face localization module.
Figure 4-3: Typical image captured by the image acquisition
4.4 Face localization

4.4.1 Colour detection
When an image enters the face localization module, a colour discriminator processes the image. Each pixel is classified as being of face or non-face colour. Skin colour detection is a featurebased face detection method that allows fast processing and is highly robust to geometric variations of the face pattern. A comprehensive survey on techniques for skin colour modelling and recognition can be found in [131]. They describe the different colour spaces used for skin detection and several skin colour modelling methods. A very popular approach is to use the YCbCr colour space where the luminance Y is discarded, as it is highly dependent on lighting. This also reduces the mentioned day/night problem. Thus, the classification is made in the two-dimensional chrominance plane [2, 15, 30]. As popular is the normalized RGB colour space [131, 150], defined as follows:
r= R R+G+ B , g= G R+G+ B , b= B R+G+ B (4.1)
68
As the sum of the three normalized components is known (r + g + b = 1), the third component does not hold any significant information and can be omitted, reducing the space dimensionality. The remaining components are often called pure colours, for the dependence of r and g on the brightness of the source RGB colour is diminished by the normalization. The final goal of skin colour detection is to build a decision rule that will discriminate between skin and non-skin pixels. This is usually accomplished by introducing a metric, which measures the distance (in a general sense) of the pixel colour to skin tone. The type of this metric is defined by the skin colour modelling method. A simple thresholding technique is often sufficient to discriminate between skin and non-skin pixels [2]. However, a multitude of both parametric and non-parametric techniques has been introduced to classify pixels as face or non-face. A few examples are Bayesian classifiers [15], neural networks [139], Gaussian models [43, 131, 150] and fuzzy pattern matching methods [138]. In the current study, it was opted to use the normalized RGB colour space. By collecting statistics of face and non-face pixels, the area in the r - g plane corresponding to face-pixels can be characterized, e.g. using linear discriminants. To illustrate, Figure 4-4 shows two training images. In each face 250 pixels have been marked as face pixels and were plotted in the r - g plane, see Figure 4-4. Four linear discriminants have then been placed around the clusters. The linear discriminants are:
0.35 r 0.61 0.22 g 0.35
(4.2)
These linear discriminants were then used to classify pixels in other images. The result can be seen in Figure 4-5. As can be seen, the method works fairly well, but is not stable enough for a final detection or segmentation. However, the algorithm is very fast and can be used as a first step, finding areas in the image that possibly contain a face. In that case, the discriminants should be very generous, making the risk of missing a face very small. Garcia et al. [30] use a technique based on shape analysis and wavelet packet decomposition to determine which of the face candidates can be classified as faces. Ahlberg [2] uses statistical pattern matching and simulated annealing for this purpose. [39] construct eye, mouth and boundary maps to verify the face candidates. From the setup of the camera, it is known that a maximum of two faces can be present in the image: the driver and the passenger sitting next to the driver. The distance between the camera and the two possible faces is approximately known, so that minimum and maximum values for the width and height of these faces can be estimated. These distances are used as a threshold to determine which blob represents the drivers face. The area representing the wanted face is cut out of the image and a process of closing is performed to correct the small patches inside the face that were not selected previously as being of skin colour (the eyes often appear too dark while the nostrils and mouth appear too bright). The borders of the face image are smoothed to eliminate peaks of hair or pieces of clothing. The resulting image (see Figure 4-5) is passed on to the next module of the facial feature detection system. The size of the passed images is variable, in Figure 4-5 they are cropped for displaying purposes. If no area with face colour that fulfils the size requirements is found, the system reports that no face is found and moves on to the next frame. As can be seen in Figure 4-5, significant parts of the hair and/or clothes can be detected as skin colour, however, these parts of the image will be 69
deleted in a next stage of the algorithm. The main aim of this module was to reduce the size of the search space. The reductions seen in Figure 4-5 are typical and represent the success of this preprocessing.
r - g plane
0.5 0.4
0.3 0.2 0.1 0 0 0.5 1
r - g plane
0.5 0.4
0.3 0.2 0.1 0 0 0.5 1
r
Figure 4-4: Two training images and their colour distributions in the RGB colour space, together with the adopted linear discriminants
4.4.2 Face equalization

To make the next step of facial feature detection less dependent on the intensity of the light source, a further pre-processing step is included. Several possibilities have been proposed over the years in the literature to reduce the dependence of image processing on the lighting levels.
70
Figure 4-5: A few examples of original images, images with pixels classified as non-face pixels set to white and resulting images of the face localization module
71
According to Belhumeur et al. [8] the result becomes independent of light source identity by reducing the feature vector to zero mean and unit variance. Liu et al. [78] adopt a homomorphic filter to eliminate illumination variations across the image. Wong et al. [135] reduce the lighting effect by transforming the images histogram into the histogram of a reference face image. Jebara and Pentland [43] also use a histogram fitting process to normalize the illumination in face images. Two transfer functions are computed: one for mapping the left half of the face to a desired histogram and the other for mapping the right half of the face. A weighted mixture of these transfer functions is used when traversing from the left side of the face to the right side, smoothly removing directional shading. The destination histograms were constructed from a properly illuminated, canonical face. Rowley et al. [108] first attempt to equalize the intensity values across the image. A function which varies linearly across the image is fitted to the intensity values in an oval region inside the image. Pixels outside the oval may represent the background, so those intensity values are ignored in computing the lighting variation across the face. The linear function will approximate the overall brightness of each part of the image and can be subtracted from the image to compensate for a variety of lighting conditions. Then, histogram equalization is performed, which nonlinearly maps the intensity values to expand the range of intensities in the image. The histogram is computed for pixels inside an oval region in the image. This compensates for differences in camera input gains, as well as improving contrast in some cases. In the current system, the pre-processing step is based on the histogram equalization proposed by Rowley et al. [108]. The input to the pre-processing module consists of two images outputted by the face finding module. They have the same size and contain the same face, but the one still contains the background information, while the other one does not (see Figure 4-6). First, histogram equalization is performed on the non-background pixels (Figure 4-6(b)). To improve contrast, the resulting histogram is only allowed values between 100 and 255. Thus, the histogram transformation of the face can be described as a nonlinear function:
y = H ( x)
0 x 255, 100 y 255
(4.3)
Secondly, the same histogram transformation H is performed on the background. The reason for not discarding the background will become clear in the next section. The histogram equalization is performed separately on the three colour channels (red, green and blue). Strictly spoken, these three histograms are not independent, but for the purpose of decreasing the lighting effect and improving contrast, the proposed approach is sufficient. The result of this face equalization serves as input for the next module, the facial feature detection.
4.5 Facial feature detection

A natural approach to detecting facial features is to exploit the geometric and intensity patterns of the facial features. Many researchers have exploited the regular patterns in facial features and have made various assumptions and successful approaches to extract these features. Over the years, various strategies for facial feature detection have been proposed, ranging from edge-map projections [12, 137] to techniques using generalized symmetry operators [133], colour analysis [17] and principal component analysis [89]. As these techniques often form an integral part of face detection and recognition, more information on these techniques can be found in chapter 2.
72
(a)
(b)
(c)
Figure 4-6: Face equalization: (a) - (b) input, (c) output
The facial feature detection in the current system will be modelled as a two stage algorithm based on Yow [145, 146]. The first stage operates on the raw image data and produces a list of interest points from the image, indicating likely location of facial features. The second stage will examine these interest points, and group and label them based on model knowledge of where they should occur with respect to each other.
4.5.1 Bar filter

The face is modelled as a plane with seven oriented facial features: two eyebrows, two eyes, a nose, a mouth and a chin. In a low-resolution image, these facial features appear as dark elongated blobs against the light background of the face. Therefore, filters with a good response to such patterns of intensity variation, can be selected. Such filters are low-intensity bar detectors (also called valley detectors or line detectors) and a good choice for such a filter is the Gaussian derivative filter shown in Figure 4-7. This filter is a second derivative of a Gaussian in one direction, and a Gaussian in the orthogonal direction. The length of the filter is also elongated at three times the width, giving it better orientation selectivity. The Gaussian derivative filter is a separable filter and can be represented as a product of two onedimensional functions in orthogonal directions. This allows efficient implementation by convolving the image with two one-dimensional kernels rather than a single two-dimensional kernel. Hence, the Gaussian derivative filter, Gdf (x, y), oriented as shown in Figure 4-7, can be expressed as:
Gdf ( x, y ) = G ( x )
where
d2 dy 2
G( y)
(4.4)
G ( x) =
1 2
2
( x )
(4.5)
73
is a univariate Gaussian distribution with mean and variance 2. The second derivative of a Gaussian detects bars of low intensity, while the Gaussian smoothes out intensity variations in the longitudinal direction. This makes it an excellent detector for the eyebrows, eyes, nose, mouth and chin. Also, the 3:1 elongation of the filter corresponds approximately to the length-width ratio of the eyes, eyebrows and nose, resulting in maximal response when the scale and orientation of the filter matches that of the features.
(a)
(b)
Figure 4-7: (a) A Gaussian derivative filter. (b) The surface plot of (a)
The histogram equalized face in Figure 4-8(a) is convolved with a Gaussian derivative filter of a matching scale and orientation. An iterative thresholding and non-maximal suppression operation is applied to the convolution output and the resulting points are marked as facial feature points. The convolution is performed for every colour channel separately and only if a pixel is marked as belonging to a facial feature in the three channels, will it be part of the final convolution output. Only interest points within the region earlier marked as being skin colour are retained. However, it is important that the input image contains the background, as otherwise the border between face and background could result in unwanted responses from the bar filter. Figure 4-8(b), (c) and (d) show the output of the convolution, the thresholding and the nonmaximal suppression, respectively. Figure 4-8(e) shows that the Gaussian derivative filter is indeed able to extract the location of the facial features. Every facial feature is represented by the coordinates of its gravity centre and the strength of the feature. This feature strength is simply the number of pixels retained by the bar filter. Features with strength lower than a threshold are discarded immediately (see Figure 4-8(f)), the others are passed on to the next stage of the algorithm. A first problem with using the second derivative Gaussian filter is that a number of false feature candidates are formed which will cause extra computational effort in subsequent stages. Secondly, the feature localization in the longitudinal direction is poor, causing inter-feature measurements to be inconsistent and unreliable. In the next few sections, we will investigate how we can reduce the effect of these problems.
74
Since the seven facial features used in the face model all lie in the same orientation and have approximately the same size, a single Gdf (x, y) filter at the right scale and orientation will generate strong responses at all the facial feature locations. In the current study, it was opted to use a Gaussian derivative filter with a scale of = 2.0 and an aspect ratio of 3:1. As long as the head is within 20 degrees of being upright, all the facial features that can be seen in the image are detected by the Gaussian derivative filter. A way to extend this filter process to larger rotations will be described in section 4.5.5.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 4-8: Bar filter module: (a) face equalized input, (b) convolution output, (c) thresholding output, (e) non-maximal suppression output, (e) detected features, (f) feature centres
A similar approach to detect possible locations for facial features has been used by Yao et al. [143]. However, they use Gabor-like filters to detect the features as dark blobs. A carefully selected threshold is then used to binarize and subsequently extract the candidate feature blobs. Each feature blob is represented by its centre of weight point. Based on the assumption that the potential face must be upright and near frontal, some heuristic rules were designed to exclude most false combinations.
75
Yow [146] used a combination of three Gaussian derivative filters to detect the facial features. This is followed by a technique of edge detection to verify the facial features. In this way, a great many of the falsely detected features can be discarded. By examining the edges, the position of the features can be adjusted so that they lie in the middle of the enclosing edges. In this way, better localization of the facial features could be obtained. In the current system, it was observed that the use of one derivative Gaussian filter is sufficient to obtain a list of interest points. This list will contain a few false features, but significantly less than in the system in [146] because of the preceding skin colour detection. The remaining false features will be discarded in the following stage.
4.5.2 Grouping of feature candidates

The face is modelled as a plane with seven orientated facial features (the eyebrows, the eyes, nose, mouth and chin). Unfortunately, these seven facial features are not always present in the image. This may be due to occlusion (e.g. profile view or hair covering the eyebrows) or due to missing features (e.g. indistinct eyebrows). Due to these problems, the face model is decomposed into components consisting of three or four features, which are common occurrences of faces under different viewpoints or different identity. These groups are called partial face groups or PFGs [146]. These PFGs are further subdivided into components consisting of two features (horizontal and vertical pairs: Hpairs and Vpairs), see Figure 4-9.
Face
Hpair
3 possible Vpairs
Top PFG
Bottom PFG
Left PFG
Right PFG
Chin PFG
Figure 4-9: The face model and the component face groups
Leung et al. [63] use a statistical framework based on the inter-feature distances to group facial feature points into face hypotheses. They assume that the variations in inter-feature distances between faces of different identities follow a Gaussian distribution and that any face image (given that the facial features are correctly located) can be represented by a particular instance in the distribution. Leung et al. [63] build a fully-connected graph describing all the inter-feature distances between all facial feature points and then evaluate the mean and variance of each of these distances. Face hypotheses are evaluated for these inter-feature distances to see if they fall
76
within the distribution specified by a training set. However, this approach is only effective if the face remains in a fixed pose (usually fronto-parallel) with little variation. The feature detection process based on the bar filter, resulted in a set of points that could be the actual features. They must now be grouped into face candidates. A grouping framework based on [146] is used as it is applicable to a wider range of views, not only fronto-parallel. Single features are grouped into vertical and horizontal pairs, pairs are grouped into partial face groups, and partial face groups are grouped into face candidates. A flow diagram of the feature grouping process is shown in Figure 4-10.
single features
partial face groups
face
feature pairs
face candidates
Figure 4-10: Flow diagram of grouping process
Because of the application of driver verification, the size of the face in the image is more or less fixed. Therefore it is possible to define minimum and maximum limits for the distance between any two facial features. Based on these limiting distances, pairs of feature candidates are examined to see if they form a valid feature pair (Hpair or Vpair), see Figure 4-9. Interest points that cannot be grouped into a Hpair or Vpair, are removed from further processing. Although called Hpair and Vpair, these pairs are not required to be horizontal or vertical. In general, the grouping algorithm can handle any orientation. However, as the bar filter is restricted to faces within 20 degrees left or right from upright, the grouping algorithm makes use of this constraint in defining horizontal and vertical distance limits, see Figure 4-11.
B A
hor - min A hor - max B ver - variance
Figure 4-11: Example of distance limits for a Hpair
From the set of feature pairs thus formed, pairs of them are examined to see if they form a valid partial face group. For example, two Hpairs may combine to form a Top PFG if the inter-feature distances are within the defined limits. These limits are tightened from the previous step based on the PFG being formed. For example, the distance between the two features forming each Hpair of a Top PFG, must be almost equal. Geometric constraints are also introduced. For example the two
77
Hpairs of a Top PFG must be parallel (within a tolerance) or the nose and mouth candidates in a Bottom PFG must lie between the eye candidates. From the set of PFGs thus formed, pairs of them will be examined to see if they have features in common. For example a Top PFG and a Bottom PFG will have two features in common if they are part of the same face. These PFGs are combined to form face candidates. If two face candidates have the overlapping features in common, they are combined into one. In most cases, there will be only one face candidate formed by the combination of a Top PFG, Bottom PFG and Chin PFG. This candidate is put forward to the next module as the localized face with known feature coordinates. If more than one face candidate is supported by three PFGs and seven facial features, the final face is chosen based on a set of heuristic rules. If these face candidates overlap, the rules are based on the strength of the nose and mouth features. If the face candidates do not overlap, the rules are based on the relative orientation of the features, the strength of the features and the relative position of the face candidate in the image. Note that the image is now the reduced image only containing the skin colour pixels and immediate background, not the original image anymore. If no face candidate is supported by three PFGs and seven facial features (e.g., due to occlusion or very light eyebrows), the face candidates with two PFGs and six facial features are examined. Again, the set of heuristic rules is applied if more than one candidate exists. If no such candidate can be found, the candidates with five features are examined, and if still no candidate is available, the candidates with four features. When a decision has been made which face candidate to put forward as the final face with localized feature coordinates, this information is passed on to the next module where the eyes will be localized more accurately and the face boundary will be extracted. Figure 4-12 shows an example of the final face outputted by the grouping algorithm. If no combination of the feature candidates results in a face candidate consisting of four or more features, the system reports that no face is found and moves on to the next frame in the video sequence. To select the best face among the face candidates, Yow [146] proposed a probabilistic decision framework to model the uncertainties in the candidates and the image evidence. A Bayesian belief network is used to update the true face candidates to a high probability based on the presence of facial features, PFGs and image evidence. The face candidate with the highest probability is then chosen as the final output. Yow [146] needed a statistical framework like this, because many face candidates were selected in the background. By using a colour discrimination step before the facial feature detection, the current system discards most of the background and heuristic rules seem sufficient to select the true face among the face candidates.
4.5.3 Iris detection

As can be seen in Figure 4-12, the Gaussian bar filter rather detects pixels on the eye contour than in the iris region. As in a later module the face image will be normalized to a standard size and shape, it is important that a few facial features can repetitively be determined with good accuracy. The features detected on the eye contour are very sensitive to the face pose and image illumination. Therefore, it was opted to replace these features with the centre of the irises, as these features are less sensitive to pose and illumination changes.
78
(a)
(b)
Figure 4-12: Grouping module: (a) input, (b) output
Starting from the approximate eye position located by the Gaussian bar filter, it is only necessary to look for the centre of the iris in a small window around this position. The algorithm used is based on the eye location method described in [44] and works under the assumption that an iris is approximately circular and dark against a bright background (the white part of the eye). Because this assumption holds in the three RGB colour channels, only the red channel is used for the accurate iris detection. Note that the original image, and not the face equalized image is used. Each eye window is a square window of 40 40 pixels around the approximate eye position located by the previous bar filter. Because the position of the above-lying eyebrow is also known, care is taken not to include the eyebrow in this window. Many authors define an eye window that contains both eyes [44, 57, 68]. In that case, symmetry operators can be used to decide on the final iris position. However, because in the current system, not only frontal views are examined, it is better to use a separate eye window for each eye. The face image within each eye window is segmented using a low threshold value to isolate the darkest 15% of the pixels, including those in the iris. The result is a binary image with the dark pixels set to black and the bright pixels to white. A closing operation (dilation followed by erosion) is performed to discard the few bright pixels inside the iris caused by reflection. Care is taken to only perform the closing operation in the inside region, to fill up the bright reflection spots in the iris. Next, a round disc pattern is correlated in each eye window by counting the matching points for a range of circle centres and radii. The maximum correlation in each eye window gives the iris centre. This processing is illustrated in Figure 4-13 where parts (a) and (b) contain the eye-window images showing the red channel pixels and the retained dark points, respectively. Part (c) shows the eye-windows after performing the closing operation. The hole in the left eye is too big and cannot be closed while the hole in the right eye has disappeared. Part (d) shows an example of a disk template; the image is enlarged for better visibility. The crosses in Figure 4-14 show the eye positions before and after the iris detection procedure. Although the hole in the left eye could not
79
be closed, an accurate position for the iris centre could still be found. Based on the new positions of the irises, the face image is rotated so that the two irises lie on a horizontal line, see Figure 4-15.
(a)
(b)
(c)
(d)
Figure 4-13: Illustrating the procedure for locating the iris: (a) red channel eye-windows, (b) thresholded eye-windows, (c) eye-windows after closing, (d) disk template
(a)
(b)
Figure 4-14: Illustrating the procedure for iris detection: (a) input, (b) output
Figure 4-15: Resulting image from iris detection module
4.5.4 Boundary detection

The face boundary is probably one of the hardest features in a human face image to extract. This is because the human face is a 3D structure and the face boundary is an occluding boundary obtained from projecting the 3D shape of the human head onto a 2D image. As a result, it is very difficult to describe the shape of the face boundary or to build a suitable model. Several algorithms have been proposed to extract the face boundary, ranging from template matching [44] to active contours and snakes [56].
80
Snakes or active contours are widely used as a face boundary tracking method. A curve defined by a controlled continuity spline function is transformed from an initial state into the shape of the face boundary by minimizing an energy function. The main difficulties with active contours are that a good initial estimation must be available and that the active contour will always converge onto a solution, whether it is the desired solution or a false one. Moreover, the calculation cost is expensive and therefore it is quite slow [140]. The boundary detection module in the current system is based on [146]. A cubic B-spline snake is used because of the ease and speed of defining and moving the control points rather than the actual snake pixels. Only the face boundary instead of the entire head boundary will be located because the hair accounts for a large part of the variance which causes the detection to be unreliable. It can be observed that each side of the face can be modelled separately as a quarter ellipse, see Figure 4-16. Therefore, the snake will be initialized as two quarter ellipses as shown in Figure 4-16, sharing the same centre, but with two different minor axis parameters. The snake is started in the interior of the face, very close to the facial features, and the search for the face boundary is performed outwards. In this way, face boundaries at different viewpoints can be extracted.
Figure 4-16: Estimating the face boundary with two partial ellipses
The centre of the partial ellipses is chosen to be the intersection point between the line joining the two eyes and the perpendicular from the line to the chin. The two minor axes are chosen to be the distance from the ellipse centre, along the direction of each eye until points on the face boundary (these points are called (x1, y1) and (x2, y2)). The major axis is similarly chosen from the ellipse centre to a point (x3, y3) on the chin. Figure 4-17 shows the ellipse parameters for a face. The positions of the facial features can be obtained from the output of the feature detection process in the previous sections. If the chin could not be detected by the previous bar filter and grouping algorithm, the position of the chin is determined in the following way. An initial position for the chin is estimated as lying on the perpendicular from the mouth to the line joining the two eyes, on the opposite side of the mouth as the nose. The distance from the chin to the mouth is estimated as being one and a half time the distance from the mouth to the nose. This initial estimate is refined by making a search 81
for a short segment of edges which have an orientation parallel to the line joining the two eyes. This search is only performed in a small window around the initial chin estimate. The result of this search gives the chin position (x3, y3). Let (xc, yc) be the coordinates of the centre of the ellipse. Let (xle, yle), (xre, yre) and (x3, y3) be the coordinates of the left eye, right eye and chin respectively. The equation for the coordinates of the centre of the ellipse (xc, yc) is given by:
xc =
and
A2 xle Ayle + Ay3 + x3 A2 + 1
(4.6)
yc =
where
A2 y3 + Ax3 Axle + yle A2 + 1
(4.7)
A=
yre yle xre xle
(4.8)
r1
r2
r1
r2
(x1, y1)
(x2, y2)
(x1, y1)
(x2, y2)
r3
r3
(x3, y3)
(x3, y3)
(a)
(b)
Figure 4-17: (a) Ellipse parameters for a face. (b) Ellipse parameters for a face in profile view
Finding the ellipse parameters is based on the knowledge that the face boundary is made up of intensity edges lying in the same orientation as the tangents of the ellipse at the same point. The major axis parameter r3 is determined by the distance between the ellipse centre and the chin coordinates (x3, y3). After fixing the major axis parameter, the minor axis parameters r1 and r2 are searched for by initializing the boundary point at one of the eye positions and moving the boundary point along the minor axis in the direction of that eye. For each point along the minor axis, a quarter ellipse is drawn from the boundary point to the major axis and the edge strength
82
and orientation are evaluated along the ellipse. If the edge orientation at a point on the ellipse is the same as the orientation of the tangent to the ellipse at that point (within a specified tolerance), the edge strength is recorded and summed up along with other valid points along the ellipse. The position of the boundary point which gives the largest summation of edge strengths will be chosen as the ellipse parameter for the minor axis under consideration. This process is then repeated for the other minor axis parameter. The ellipse parameters r1, r2 and r3 are then obtained by taking the Euclidean distance of the respective boundary points to the ellipse centre:
r1 = r2 = r3 =
( xc x1 ) + ( yc y1 )
2 2
(4.9)
2
( xc x2 ) + ( yc y2 ) ( xc x3 ) + ( yc y3 )
2
(4.10) (4.11)
As mentioned before, the boundary detection process is based on the position and orientation of intensity edges in the face image. In the current system, these edges are detected by a Canny edge detector [26]. First, the image is smoothed by Gaussian convolution. Then a simple 2D first derivative operator is applied to the smoothed image to highlight regions of the image with high first spatial derivatives. Edges give rise to ridges in the gradient magnitude image. The algorithm then tracks along the top of these ridges and sets to zero all pixels that are not actually on the ridge top so as to give a thin line in the output, a process known as non-maximal suppression. The tracking process exhibits hysteresis controlled by two thresholds: T1 and T2 with T1 > T2. Tracking can only begin at a point on a ridge higher than T1. Tracking then continues in both directions out from that point until the height of the ridge falls below T2. This hysteresis helps to ensure that noisy edges are not broken up into multiple edge fragments. In the current implementation of the Canny edge detection, the lower threshold T2 is fixed at 1, while the higher threshold T1 is determined iteratively so as to make sure a minimum number of edge pixels are retained. When the Canny edge detector is implemented on the three separate colour channels, very similar results are obtained. Therefore it was decided to use only the red channel. The result of the basic Canny edge detection is shown in Figure 4-18(a). Only the edges within the previously determined skin colour region are retained to avoid the snake being attracted to background edges, see Figure 4-18(b). During experiments, it was noticed that the elliptical snake was often attracted by boundaries around the eye and mouth regions. Therefore, an extra step was added to the edge detection to discard most edges in the interior of the face. First, at each eye position a disk with small diameter is placed over the edge map. If an edge crosses the disk boundary on the outside half, a disk with a slightly larger diameter replaces the previous one. This process goes on till no more edges cross the disk boundary on the outside half or a diameter limit is reached, see Figure 4-18(c). Second, a similar growing algorithm is used in the mouth region. A rectangular region covering the nose and mouth coordinates is grown sidewise till no more edges cross the left and right boundaries, see Figure 4-18(d). Finally the covered eye and mouth regions are connected to form one region covering most of the edges in the interior of the face, see Figure 4-18(e). The resulting edge map is shown in Figure 4-19(a) and is used as input to the boundary detection algorithm. Figure 4-19(b) then shows the output of the boundary detection module.
83
In the case where only one eye is present (e.g. a profile view), the ellipse centre is taken to be the intersection of the line joining the chin and mouth, and the perpendicular from the line to the detected eye, see Figure 4-17(b). The equations for the ellipse centre will be of the same form as (4.6) to (4.8), replacing the coordinates of (xle, yle), (xre, yre), and (x3, y3) with (xm, ym), (x3, y3) and (xe, ye) respectively, where (xm, ym) is the coordinate of the mouth and (xe, ye) the coordinate of the detected eye.
xc' =
and
B 2 xm Bym + Bye + xe B2 + 1
(4.12)
y =
where B=
' c
B 2 ye + Bxe Bxm + ym B2 + 1
(4.13)
y3 ym x3 xm
(4.14)
The occluded eye is assumed to be directly behind the visible eye. Thus the search for the minor axis parameters r1 and r2 for the quarter ellipses will start from the same location but in the opposite direction.
(a)
(b)
(c)
(d)
(e)
Figure 4-18: Different steps of Canny edge detector
4.5.5 Rotation invariance

As mentioned before, the Gaussian bar filter gives good results as long as the face is within 20 in-plane-rotation from being upright. As the face rotates out of this range, fewer features will be detected by the Gaussian filter. Because of the particular application of driver verification, this constraint is in most cases fulfilled. However, to detect faces that are rotated over more than 20, a solution is to apply more than one filter, each rotated over 20 degrees.
84
(a)
(b)
Figure 4-19: Face boundary detection: (a) Edge map input, (b) Face boundary output
Figure 4-20 shows the output of convolving the face image with bar filters in three different rotations (-20, 0 and 20). The original image (shown in Figure 4-21(a)) forms an angle of 18 with the vertical direction. As can be seen, the features appear stronger when the orientation of the facial features corresponds closely to the orientation of the filter. With 0 degrees rotation, the facial features are still detected, but many false features are also detected and the output appears cluttered. Therefore, these three filter outputs can be examined for the strongest, most horizontal features and only that output will be forwarded to the following stages. The final output of the feature detection module is shown in Figure 4-21(b).
4.6 Face representation

4.6.1 Introduction
The next important module, after the face localization and the facial feature detection, is the face representation. In this step the located face needs to be saved in a standard format that can be used in the face verification module. In feature-based recognition systems, a number of features (e.g. inter-feature distances, Gabor filter responses, etc.) are extracted and combined into a feature vector [12, 73, 140]. In template-based recognition systems, the pixel intensities of the face will be used for recognition [72, 74, 80, 88]. In this case, the intensity values of the image pixels will form the feature vector. However, many authors perform some type of normalization (for variations in size, position, orientation, illumination, etc.) before placing the pixels in the feature vector. Lizama et al. [80] use the relative positions between the detected features (left and right eye, nose and mouth) to determine if a front-view of the face is involved. If this is the case, the image is rotated so that the eye-to-eye line is horizontal. Next, the image is horizontally and vertically scaled so that the eye-to-eye distance and the eye-to-mouth distance are the same for all images. Finally, the image is masked, illumination-normalized and cropped.
85
(a)
(b)
(c)
Figure 4-20: Gaussian bar filter applied over three different rotations: (a) -20, (b) 0, (c) 20
(a)
(b)
Figure 4-21: Facial feature detection applied on rotated face
86
Torres et al. [127] apply a morphing technique to every image. The morphing process results in an image where the facial feature points have been moved to standard predefined positions. This is equivalent to a normalization of the shape of the faces. The standard positions correspond to a frontal view with neutral expression and a given fixed size. The morphing technique is based on texture mapping, a widely known technique in computer graphics. The image of a face is split into triangular polygons (based on the Candide model) whose vertices are characteristic facial points. The texture of each original triangle is mapped to the triangle defined by the standard points. The Candide model uses 44 points and 78 polygons. However, it is very difficult to automate the location of these 44 points and therefore manual allocation was used. Lorente and Torres [83] apply a morphing technique based on a simplified face model. Only the location of four facial points is needed: the eyes, the nose and the mouth. The automatic location of these points is an affordable task even with small images. The model then automatically estimates thirteen additional facial points, so the entire model consists of 17 points which are connected by 22 triangles. All the additional points except three of them are located on the edge of the face. The three additional inner points are needed to cater for changes in nose appearance in the case of orientation variations. The additional points are located based on the geometrical relations between the four main points. Finally, the contrast of the images is normalized using histogram stretching. Liu and Wechsler [72, 74] present the face as a feature vector combining shape and texture information. Shape encodes the feature geometry of a face while texture provides a normalized shape-free image by warping the original face image to the mean shape, i.e., the average of aligned shapes. Shape and texture coding, usually used in conjunction with norm-based coding, is a two-stage process once the face has been located. Coding starts by annotating the face using important internal and face boundary points. Once these control points are located, they are aligned using translation, scaling and rotation transformations as necessary, and a corresponding mean shape is derived. The next stage then triangulates the annotated faces and warps each face to the mean shape. The first stage yields the shape, a feature vector containing the coordinates of the facial feature points, while the second stage yields the texture and corresponds to what is known as full anti-caricature. This combination of shape and texture was compared with masked images and shape images. Masked images were derived by first using the eye centres to align the face images and then placing a mask on them. Shape images undergo the same alignment procedure as the shapes do, but preserve the intensity information within the contours of the faces. Experiments showed that the shape and texture combination eventually lead to better recognition performance, however, the allocation of the feature points needed to align the shapes, still happens manually [72, 74]. Craw et al. [20] investigate different codings used for automatic face recognition. They found that for eigenface-based recognition, a coding of shape-free faces using manually located landmarks was more effective than the corresponding coding of correctly shaped faces. Configuration (shape information) also proved an effective method of recognition, with the rankings given to incorrect matches relatively uncorrelated with those from shape-free faces. When combining both sets of information, the performance of either system improves significantly. The addition of a system, which directly correlates the intensity values of shape-free images, also significantly increased recognition, suggesting extra information was still available. Manipulation with the shape-free coding to emphasize distinctive features of the faces, by caricaturing, allowed further increases in performance when the independent shape-free and configuration coding were used.
87
From the previous examples, it seems clear that shape-free images result in the best recognition performance. However, it is very difficult to automatically detect the control points necessary for the mapping algorithm. Therefore, in the current system it was opted to use masked images. The face images will be aligned by a warping technique based on [9]. Because the parameters of the mapping algorithm will depend on the pose of the face, a pose estimation step precedes the warping algorithm. Once the images are warped to a standard format, a mask will be placed over them and illumination normalization will be performed.
4.6.2 Pose estimation

The pose estimation used in the current system is based on [40]. The relative location of the eyes and eyebrows in the face are used to estimate the pose of the face in the image. Figure 4-22 shows a human head seen from above. It is assumed the head can be modelled as a circle. The distance a between the projection of the mid-point of the two eyes and the centre of the face can be calculated. The radius of the head r can be obtained from the previous boundary detection stage. The pose can then be estimated by = arcsin (a/r). In the particular application of driver verification, it was observed that a persons gaze is not always aligned with the pose of his face. For example, in Figure 4-19(b), the persons irises are not in the centre of her eyes. Therefore, the location of the eyebrows, rather than the eyes, is used to estimate the pose of the face. Only when the distance between the eyebrows differs significantly from the distance between the eyes, which indicates that one of the eyebrows was probably mislocated, are the eyes used for pose estimation.
right eye left eye w/2 w/2
Figure 4-22: Pose estimation
4.6.3 Warping
The purpose of this step is to align the faces in some way. In the current system, the faces will be aligned based on the left and right iris, the left and right facial edge, the forehead and the chin. This means that every face will be warped to a canonical form with fixed positions for these six facial features. Bookstein [9] has developed a mapping based on thin-plate splines that can handle both affine and non-affine mappings.
88
The warping transformation f (x, y): R2 R2: (x, y) (x, y) is of the following form:
f ( x, y ) = a1 + ax x + a y y + wiU (| Z i ( x, y ) |)
i =1
(4.15)
where
U ( r ) = r 2 log r 2
(4.16)
is the thin-plate spline function, r is the distance r = x 2 + y 2 from the Cartesian origin, n is the number of landmark points, Zi = (xi, yi) are the n landmark points (i = 1, , n) and the weights wi sum to one. The computation of the coefficients in (4.15) involves the solution of two square linear systems of size n + 3 (with the same matrix in each case). An algebraic treatment of the computation of the coefficients for the above mapping can be found in [9]. Six landmark points were chosen for the mapping: the left and right eye, the left and right edge, the forehead and the chin. All these points, except for the forehead, were determined in the previous feature detection steps. The forehead is defined as a point on the line connecting the ellipse centre with the chin, in the opposite direction of the chin. The distance between the forehead and the ellipse centre is taken equal to the distance between the mouth and the chin. Based on the previous pose estimation, the faces are divided into nine classes between -90 and 90, each with a range of 20. Thus, the frontal view class actually contains poses between -10 and 10. For each class, a canonical face is defined, i.e., the coordinates Z i' = ( xi' , yi' ) of the six landmark points are defined. Because, even within every pose class, there is a significant difference between poses, the x ' -coordinates for the edge landmark points in the canonical face are further refined before calculating the warping function f (x, y). This can be explained using the following example. The landmark coordinates in the canonical face for the class [-10, 10] are defined as xl' = 20 and xr' = 100. These will be the final coordinates if the pose is equal to 0. If the pose is between -10 and 10, but different from 0, these coordinates will be moved over a distance sin( ) ( xr' xl' ) / 2 . For example, if = 5, the new edge coordinates will be xl' = 23 and
xr' = 103.
Once the new edge coordinates are calculated, the warping function f (x, y) can be calculated for every face. All warped faces will have a size of 60 100 pixels, with the eyes, edges, forehead and chin on a fixed, known position. Eventually, a mask is placed over the face and only the pixels not covered by the mask are retained. Figure 4-23 shows an example of a face image after warping and masking. For each colour channel, the rows of this image are concatenated to form a feature vector. This means that every image is represented by three feature vectors (R, G and B).
4.6.4 Illumination normalization

Because the original pixel intensities were used to form the warped image, it is important to include an illumination normalization step. In section 4.4.2, a histogram equalization algorithm was included to make the facial feature detection less dependent on the intensity of the light source. Several other possibilities, proposed in the literature, were also described.
89
Figure 4-23: Warped and masked image
Here, two possibilities are examined. In the first option, each feature vector is reduced to zero mean and unit norm as described in [88, 89]. In the second option, the feature vectors first undergo histogram equalization as in section 4.4.2, and are consequently normalized to zero mean and unit norm. Torres et al. [127] describe the use of colour information in eigenface-based recognition. They show that the use of a colour space with separated luminance and chrominance information (such as YCbCr) can improve the recognition rate, while a colour space where this information is mixed (such as RGB) does not improve the recognition rate when compared to a method that uses only one colour channel. Based on this study, a third option will be examined. The RGB colour space is transformed to the YCbCr colour space (Y for luminance, Cb for blue chrominance and Cr for red chrominance) by the following equations [11]:
Y Cr
= =
0.2989 R + 0.5866 G + 0.1145 B

(4.17)
Cb = 0.1687 R 0.3312 G + 0.5 B + 127.5 0.5 R 0.4183 G 0.0816 B + 127.5
Again, for each colour channel, the rows of the image are concatenated to form a feature vector. Thus, every image is still represented by three feature vectors (Y, Cb and Cr). Again, each feature vector is reduced to zero mean and unit norm.
4.7 Face verification

4.7.1 Theory
The face verification module is based on the probabilistic subspace technique introduced by Moghaddam [90, 91, 92, 95]. A probabilistic measure based on the probability that the image intensity differences = I1 I2 are characteristic of typical variations in appearance of the same object, is used. Two classes of facial image variations are defined: intrapersonal variations I (corresponding, for example, to different facial expressions or different lighting conditions for the same individual) and extrapersonal variations E (corresponding to variations between different individuals). The similarity measure is expressed in terms of the intrapersonal a posteriori probability given by the Bayes rule: S () = P (I | ) = P ( | I ) P (I ) P ( | I ) P (I ) + P ( | E ) P (E ) (4.18)
90
To deal with the high-dimensionality of (which is the same as that of the images), an efficient density estimation method is used [89, 95]. The vector space RN is divided into two complementary subspaces using eigenspace decomposition. This method relies on principal component analysis (PCA) to form a low-dimensional estimate of the complete likelihood which can be evaluated using only the first M principal components, where M << N. This decomposition is illustrated in Figure 4-24(a) which shows an orthogonal decomposition of the vector space RN into two mutually exclusive subspaces: the principal subspace F containing the first M principal components and its orthogonal complement F , which contains the residual of the expansion. The component in the orthogonal subspace F is the so-called distance-from-feature-space (DFFS), an Euclidean distance equivalent to the PCA residual error. The component of which lies in the feature space F is referred to as the distance-in-feature-space (DIFS) and is a Mahalanobis distance for Gaussian densities.
(a)
(b)
Figure 4-24: (a) Decomposition of RN into the principal subspace F and its orthogonal complement
F for a Gaussian density. (b) A typical eigenvalue spectrum and its division into the two orthogonal subspaces
As derived in [89], the complete likelihood estimate can be written as the product of two independent marginal Gaussian densities:
M 2 y 2 exp 1 i exp () 2 2 i=1 i . P ( | ) = M (2 ) ( N M ) / 2 (2 ) M / 2 i1 / 2 i =1
(4.19)
= PF ( | ) P F ( | ; )
where PF ( | ) is the true marginal density in F, P F ( | ; ) is the estimated marginal density in the orthogonal complement F , yi are the principal components, and 2 () is the PCA residual (reconstruction error). The information-theoretic optimal value for the density parameter is derived by minimizing the Kullback-Leibler divergence and is found to be simply the average of the F eigenvalues:
91
1
N M
i = M +1
(4.20)
In actual practice, the majority of the F eigenvalues are unknown but can be estimated, for example, by fitting a nonlinear function to the available portion of the eigenvalue spectrum and estimating the average of the eigenvalues beyond the principal subspace. Fractal power law spectra of the form 1/f n are thought to be typical of natural phenomena and are often a good fit to the decaying nature of the eigenspectrum, see Figure 4-24(b). Referring back to equation (4.18), it can be seen that this approach requires two projections of the difference vector , from which likelihoods can be estimated for the Bayesian similarity measure S(). The projection steps are linear, while the posterior computation is nonlinear. Because of the double PCA projections required, this approach has been called a dual eigenspace technique [92, 95]. In the following section, it will be shown that, in practice, each input vector x will have two (precomputed) linear PCA projections y and y , and that the posterior similarity S() between any
I E
pair of vectors can be expressed in terms of a pair of difference norms between their corresponding dual projections. Consider a feature space of vectors, the differences between two images Ij and Ik. The two classes of interest in this space correspond to intrapersonal and extrapersonal variations and each is modelled as a high-dimensional Gaussian density as in (4.19). The densities are zero-mean since for each = Ij Ik there exists a = Ik Ij.
P ( | E ) =
1 T 1 E 2
(2 ) D / 2 | E |1 / 2 e
1 T 1 I 2
(4.21)
P ( | I ) =
(2 ) D / 2 | I |1 / 2
By PCA, the Gaussians are known to only occupy a subspace of the image space (face-space) and, thus, only the first few eigenvectors of the Gaussian densities are relevant for modelling. These densities are used to evaluate the similarity in equation (4.18). Computing the S() similarity involves first subtracting a candidate image Ij from a database entry Ik. The resulting image is then projected onto the eigenvectors of the extrapersonal Gaussian and the eigenvectors of the intrapersonal Gaussian. The exponentials are computed, normalized, and then combined as in equation (4.18). To compute the likelihoods P ( | I ) and P ( | E ) , the Ik images are pre-processed with whitening transformations. Each image is converted and stored as a set of two whitened subspace coefficients, y for intrapersonal space and y for extrapersonal space (see (4.22)). Here, and
I E
V are matrices of the largest eigenvalues and eigenvectors of I or E.
yj = I 2 VIT I j
I
1
E
T yj = E 2 VE I j
(4.22)
92
After this pre-processing, evaluating the Gaussians can be reduced to simple Euclidean distances as in equation (4.23). Denominators are of course pre-computed. These likelihoods are evaluated and used to compute the MAP similarity S() in (4.18). Euclidean distances are computed between the MI dimensional y vectors, as well as the ME dimensional y vectors. Thus,
I E
roughly 2 ( M I + M E ) arithmetic operations are required for each similarity computation, avoiding repeated image differencing and projections.
P ( | I ) = P ( I j I k | I ) = P ( | E ) = P ( I j I k | E ) =
|| y y || / 2
I I
(2 ) D / 2 | I |1 / 2 e
|| y y || / 2
E E j k 2
(4.23)
(2 ) D / 2 | E |1 / 2
The priors P (I ) and P (E ) in equation (4.18) can be set to reflect specific operating conditions (e.g. the number of test images versus the size of the database) or other sources of prior knowledge regarding the two images being matched. Note that Moghaddam [92] originally developed this particular Bayesian formulation to cast the standard face recognition task (essentially an m-ary classification problem for m individuals) into a binary classification task with I and E. This simpler problem is then solved using the maximum a posteriori (MAP) rule as in (4.18). Consequently, this formulation can easily be adapted to a face verification task. Two images are determined to belong to the same individual if
P (I | ) > P (E | )
or, equivalently,
(4.24)
S () >
1 2
(4.25)
4.7.2 Current application

Currently, the verification algorithm is only applied to faces with a pose between -10 and 10. A training set of 31 individuals containing 214 images was collected. Between three and twelve images per person are available. These training images were gathered on four different days, in the conditions where the system will ultimately be applied. Thus, differences in illumination, facial expression and pose were encountered. These images were put through the face localization and facial feature detection algorithms described above. Finally, the 214 warped and illumination normalized images were used to form the densities of I and E. From the 214 available images, 1500 difference vectors I = I1 I2 were formed, where I1 and I2 are images belonging to the same individual. Another 1500 combinations E = I3 I4 were formed, with I3 and I4 images belonging to different individuals. The vectors I are placed in the columns of a matrix AI, while the vectors E form the columns of a matrix AE. The covariance matrices I and E are then computed in the following way:
I = AI AIT
and
T E = AE AE
(4.26)
93
Next, the eigenvalues and eigenvectors of I are E are calculated and ordered in decreasing order of the eigenvalues. The MI largest eigenvalues and eigenvectors of I form the matrices I and VI, respectively. The ME largest eigenvalues and eigenvectors of E form the matrices E and VE, respectively. For every image I, the vectors y and y can now be calculated (see equation (4.22)). When two
I E
images are compared, the probabilities in (4.23) are computed and placed in the similarity measure (4.18). Several authors improved the results of the eigenface technique by combining it with an eigenfeature technique [83, 86, 87, 88, 127]. A similar principle was tested in this system. The probabilistic subspace technique is applied on seven different images containing, respectively the entire face, from forehead to chin a partial face, from just above the eyebrow, to just below the mouth the left eye region the right eye region the eyes region, including the nose bridge the nose region the mouth region For the first two situations, the resolution of the warped image is first lowered with a factor two. The resulting images then have the size 30 50 in the first case, and 30 30 in the second one. For the final five situations, the warped images are of size 60 100, before extracting the relevant windows. The probabilistic subspace technique is then applied for each colour component separately. For every colour component, seven feature vectors are formed. The similarity measure (4.18) is calculated for every of the seven features and every colour component (twenty one measures in total). A suitable combination of some (or all) of these twenty one values needs to be found to distinguish between images of the same individual and images of different individuals.
4.8 Results
4.8.1 Performance of the face localization module
The performance of the face localization module is determined by the skin colour detector. Three hundred original images, as in Figure 4-3, were used to test the colour detection. The test set contains a representative distribution between men and women, and between white, black and coloured people. Some images contain two faces: the driver and the passenger sitting next to him/her. These 300 images were fed to the colour detection algorithm. The output was marked as correct if it contained at least 90% of the pixels a human observer marked as belonging to the drivers face. It is not seen as incorrect when the system returns, in addition to the drivers face, part of his/her hair or clothing, or the passengers face. This will create a more complex and difficult input for the following steps of the facial feature detection, but does not necessarily lead to failure. Out of the 300 images containing a drivers face, 292 were detected correctly. The major cause of faces not being detected properly is the presence of dark sunglasses or large moustaches. These 94
cut the face in two parts which are too small to be recognized by the system as being a face. A dark beard can cause the remaining part of the face to be too small to be classified as a face. A hand rather than a face might be returned, when the hand appears as a skin coloured region of similar size as the head. If the inside of the car is too dark, the colour detection is also unable to detect enough skin pixels to form a face. A few examples where the face could not be localized are shown in Figure 4-25. An additional fifty images, containing a small face or no face at all, were added to the test set. These images should result in the system returning a message that no face could be found. In most of these images, some pixels with skin colour could be detected, but the size of these areas does not fulfil the size requirements for a face. However, in some cases where the face is turned away or occluded and thus too small to detect, the clothing of the driver is classified as being of skin colour and is returned by the face localization module as being a face. These false positives will have to be discarded in the following modules of the facial feature detection. An example of a falsely detected face is given in Figure 4-26.
Large face Small face or no face Total
Number of images 300 50 350
Correct output 292 (97.3%) 47 (94%) 339 (96.9%)
Table 4-1: Performance of colour detection module
These results indicate that a colour detector can distinguish very easily between images containing faces and images not containing faces, given that these faces must be of a minimum size. Therefore, when video sequences are examined, a fast colour detector can be used to indicate which frames do not contain faces and need not to be examined by the subsequent feature detection stages. The frames passed on by the colour detector very likely contain a face, and this knowledge can be used in the following algorithms.
4.8.2 Performance of the facial feature detection module

The facial feature detection module consists of four consecutive algorithms: the bar filter, the grouping stage, the iris detection and the boundary detection. In the current application, three bar filters (-20, 0 and 20) are applied to obtain rotation invariance. Two hundred of the images which passed the face localization module were selected to form a representative test set to evaluate the performance of the facial feature detection. This test set contains images of men and women; white, black and coloured people; people wearing glasses, scarves and beards. The seven facial features (eyebrows, eyes, nose, mouth and chin) are visible in every test image. No occluded faces or profile views are included in the test set. These 200 images were passed through the four stages of the facial feature detection. In 73.5% of the images, the facial features and face boundary were correctly detected. A face is marked as correct when the detected iris centres are within three pixels of the real iris, as localized by a human observer. The eyebrows, nose, mouth and chin can be localized anywhere on the respective features. Note that if the eyebrows could not be localized by the system, but the other features are correct, the output of the facial feature detection is marked as correct. If the eyebrows are incorrectly localized, the output is marked as incorrect.
95
Figure 4-25: Examples where the face could not be located by the colour detector
Figure 4-26: Example of falsely detected face
The face boundary must be the best fitting ellipse to the face contour. It is not counted as incorrect when hair is included in the face boundary, however when too much background is included, the boundary is classified as incorrect (see the examples in Figure 4-27 and Figure 4-31). A few examples of correctly detected faces can be seen in Figure 4-27.
96
Figure 4-27: Examples of faces with correctly detected features and boundary
97
As can be seen in Figure 4-27, the facial feature detection works quite well. It allows for invariance, to a certain extent, to ethnic variations, pose variations, rotation variations and illumination variations. The main reasons why the facial feature detection module fails are the following: The Gaussian bar filter detects too many false feature candidates, which are incorrectly grouped. The presence of glasses, a moustache, a beard, hair, a hat, large horizontal wrinkles, the collar of a shirt, etc. can create extra bars when convolving the image with the Gaussian bar filter. If these extra features fulfil the distance and geometric constraints of the grouping stage, they can incorrectly be selected as one of the seven facial features. Figure 4-28 shows a few examples.
Figure 4-28: Examples of incorrectly detected and/or grouped features
The working of the bar filter is based on the contrast between the face and its facial features. When not enough contrast is present, no bars can be detected by the Gaussian filter. If the remaining bars can not be grouped, the system returns a message that no face is present, see Figure 4-29 for a few examples.
no face present
Figure 4-29: Examples of images with too low contrast
98
The iris detection does not work properly when the eyes are occluded by glasses. In this case, the iris does not appear as a clear black disk. Even when no glasses are present, shadows can cause the eye corner or eyelid to look more like a black disk than the real iris. In those cases, the iris detection will fail. A few examples can be seen in Figure 4-30.
Figure 4-30: Examples of incorrectly detected irises
When large differences in illumination are present on the face, the boundary detection will fail because the line between the light and dark areas will have higher edge strength than the border between the face and the background. When a passenger is present, the boundary between the two faces might not be clear enough and the ellipse will include both faces. Figure 4-31 shows a few examples of incorrectly located face boundaries.
Figure 4-31: Examples of incorrectly located face boundaries
Yow [146] used a test set of 90 static images with complex background, containing both frontoparallel and fronto-rotated faces. The average performance on these images was 60%. Therefore, the facial feature detection module in this work, resulting in 73.5% correct detection, compares favourably with the algorithms of Yow. The main reason for the improved performance is the presence of a colour detection stage. This prevents the bar filter from detecting too many false features.
99
However, the Bayesian belief network used by Yow to group the candidate features, was omitted in the current system. It is believed that by replacing the current grouping algorithm, based on simple heuristic rules, by a more advanced algorithm, such as for example a probabilistic decision network, a great part of the currently incorrect results could be detected correctly. A limited test set of 20 profile views was submitted to the facial feature detection algorithms. The performance was unsatisfactory as only four faces were detected properly. The main reason is that the bar filter detects too many false features which are grouped in too many face candidates. A more advanced grouping algorithm needs to replace the heuristic rules for profile views to be detected correctly. Similar unsatisfactory results were obtained for partially occluded faces. The reason for this is twofold. The source of occlusion (e.g. a hand, arm or student card) can either cloud out the features the system aims to detect, or can cause additional false features by itself. Both situations often lead to the grouping stage interpreting false features as features actually not visible which results in the output being an incorrect face candidate. Figure 4-32 shows a few examples of partially occluded faces, for which the facial feature detection failed.
Figure 4-32: Examples of partially occluded faces
4.8.3 Performance of the face verification module

Before evaluating a biometric verification or recognition application, a few terms need to be defined [38, 104]. The correct acceptance rate (CAR) is defined as the probability that a user making a true claim about his/her identity will be correctly verified as him/herself. The false rejection rate (FRR) is defined as the probability that a user making a true claim about his/her identity will be falsely rejected as him/herself. The correct rejection rate (CRR) is defined as the probability that a user making a false claim about his/her identity will be correctly rejected as that false identity. The false acceptance rate (FAR) is defined as the probability that a user making a false claim about his/her identity will be falsely verified as that false identity.
100
To test the verification algorithm, a set of 300 images, containing the face of one of fifty individuals, was collected. Three images per person were frames from a video sequence taken when the person drives into campus; the remaining three were frames from a video sequence taken when the individual leaves campus. The three images belonging to the same video sequence were selected randomly from the available frames in the pose range [-10, 10]. So, each time, these three frames were only separated by a fraction of a second to a few seconds maximum, while the two sets of three images were at least six hours apart. None of the images in the test set appeared in the training set earlier used to calculate the probability densities of I and E. These 300 images were correctly passed through the face localization and facial feature detection algorithms described above. They all demonstrate a pose between -10 and 10. Finally, these images are warped and normalized for illumination. Several parameters influence the results of the verification module: The prior probabilities P(I) and P(E) used to compute the similarity measure S(), see equation (4.18). MI and ME, the dimension of the vectors y and y , respectively.
I E
The illumination normalization technique used, as described in section 4.6.4. The combination of some or all of the features described in section 4.7.2. The prior probabilities were set to be equal as no prior knowledge can be incorporated. When two images are tested by the verification algorithm, it is assumed that the probability that the two images are from the same individual is equal to the probability that the two images are from different individuals. Thus:
P (I ) = P (E ) = 0.5
(4.27)
The values of MI and ME are taken as 100. Some experiments were performed with MI and ME equal to 40, 80 and 100. It was found that the a posteriori probabilities for images belonging to the same face are higher when a greater number of eigenvalues and eigenvectors is used to compute the vectors y and y . This means that these images can be classified as belonging to
I E
the same class with a higher certainty. Therefore it is determined to use the values
M I = 100
and
M E = 100
(4.28)
The illumination normalization technique chosen for the final application is based on the YCbCr colour space. A small-scale experiment was performed on 150 randomly selected combinations of the 300 test images, where the two images being compared were taken at least six hours apart. Only the window containing the entire face was taken into account for calculating the performance rates. The results for the three illumination normalization methods described in section 4.6.4 are summarized in Table 4-2. Two faces are classified as belonging to the same individual when the similarity measure S() is greater than 0.50; otherwise the images belong to different individuals. When more than one channel is used, the average of the three obtained similarity values is compared to the threshold 0.50. As in the experiments performed by Torres et al. [127], the YCbCr colour space leads to better recognition results as the RGB colour space. Using R, G or B independently provides the same results as using R, G and B simultaneously and almost the same results as using the luminance Y only. By adding the chrominance Cr and Cb to the luminance Y, the verification performance
101
increased with 2.4% over using the luminance Y only. However, based on a chi-square test, the difference between the RGB and YCbCr colour spaces is statistically not significant and can, therefore, not be generalized.
Colour components Correct acceptance rate
RGB
RGB + localized histogram equalization

73.3%
YCbCr
76.9%
76.7%
79.3%
Table 4-2: Comparison of the performance rates when using different colour components
The histogram equalized RGB images show similar results as the RGB images. Note that in these images the histogram equalization is only applied on the face region, therefore the name localized histogram equalization. Using the histogram equalized R, G and B simultaneously does not improve the performance over using any of these three channels independently and leads to a lower performance rate than when no histogram equalization was performed. Although the difference between the RGB and YCbCr colour spaces is not statistically significant, the YCbCr colour space performs better than the RGB colour space on the current test set. Therefore it was determined to use the YCbCr colour space for further experiments. Thus, every image is represented by three feature vectors (Y, Cb and Cr). Illumination normalization is performed by reducing each feature vector to zero mean and unit norm. Several authors [33, 83, 86, 87, 88, 103, 127] have applied the eigenface method on individual facial features which results in eigeneyes, eigennoses, eigenmouths and eigensides. These eigenfeatures could be combined in the classification stage. In [83, 127] the global distance between the test and the training image is calculated as the weighted sum of the different distances in such a way that the contribution of each eigenfeature to the recognition stage is the same. However, these weighting factors could be altered to give more importance to certain features than others. In the current application, several combinations of the seven features described in section 4.7.2 (entire face, partial face, eyes region, left eye region, right eye region, nose region and mouth region) were tested. The similarity measure S() is calculated for each feature (and each colour component) separately. For each feature the results of the three colour components are averaged resulting in seven similarity measures
S '() = SY () + SCr () + SCb ()
(4.29)
A combination of any of these seven measures now needs to be found to decide if the two presented face images belong to the same person or not. The 300 test images were presented to the verification algorithm in 485 combinations; 400 combinations were images of the same individual, 85 were images of different individuals. Each combination consisted of two images taken at least six hours apart. The results for the seven similarity measures are given in Table 4-3.
102
Number of claims Correct acceptance rate False rejection rate Correct rejection rate False acceptance rate Total correct
400 400 85 85 485
Entire face
75.3% 24.7% 77.6% 22.4% 75.7%
Partial face
71.3% 28.7% 77.6% 22.4% 72.4%
Eyes region
65.8% 34.2% 71.8% 28.2% 66.8%
Left eye region

65.3% 34.7% 70.6% 29.4% 66.2%
Right eye region

72% 28% 45.9% 54.1% 67%
Nose region
73.5% 26.5% 36.5% 63.5% 67%
Mouth region
72.3% 27.7% 43.5% 56.5% 67.2%
Table 4-3: Comparison of the performance rates when using different features
Every possible combination of two to seven features (127 combinations in total) was tested by forming a weighted sum of the different similarity measures in such a way that the contribution of each feature to the verification stage is the same. A few examples are shown in Table 4-4. No combination could be found that results in better performance than the similarity measure calculated from the window containing the entire face only (75.7%).
Entire face + nose + mouth

72.3% 27.7% 50.6% 49.4% 68.5%
Entire face + eyes + nose + mouth

70.8% 29.2% 63.5% 36.5% 69.5%
Partial face + eyes + nose + mouth

70.3% 29.7% 68.2% 31.8% 69.9%
Entire face Partial face + left eye + + left eye + all seven right eye + right eye + features nose + nose + mouth mouth
70.5% 29.5% 56.5% 43.5% 68% 69.5% 30.5% 60% 40% 67.8% 69.3% 30.7% 65.9% 34.1% 68.7%
400 400 85 85 485
Table 4-4: Comparison of the performance rates when using different feature combinations
103
Even when the weighting factors were altered to give more importance to certain features, no combination of any of the seven features could be found that performs better. This implies that holistic recognition leads to better results than summed partial recognition, when using the a posteriori probability measure based on the Bayes rule. Therefore, the verification algorithm based on the entire face only, will be used in the final application of the face verification module. When tested on 400 true claims and 85 false claims, the FRR was 24.7% while the FAR was 22.4%. Both values are actually too high for an implementation in a realistic situation. The main reason for this low performance is the illumination variation. Images taken on different times of the day can look very different due to the variation in illumination, see Figure 4-33 for an example. The illumination normalization performed was based on reducing the feature vector to zero mean and unit norm. This is obviously not enough and other methods of illumination normalization will have to be looked at in the future.
Figure 4-33: Difference due to illumination variation
This major influence by the illumination variation can be verified by performing an additional experiment. A set consisting of 150 combinations of the 300 selected test images was compiled. In this case, when two images belonging to the same individual are compared, they belong to the same video sequence, i.e., they are at maximum taken a few seconds apart. When two different individuals are tested, the test images are at maximum half an hour apart. This ensures little change in illumination between the two images being tested. One hundred combinations tested two images of the same individual; fifty combinations tested two different individuals. The results in Table 4-5 confirm the previous observation that the window containing the entire face leads to the best performance. The increase from 75.3% to 89% by limiting the testing to images with the same type of lighting, proofs that illumination variation plays an important role in the failure of a face verification algorithm.
104
100 100 50 50 150
Entire face
89% 11% 86% 14% 88%
Partial face
81% 19% 84% 16% 82%
Eyes region
81% 19% 74% 26% 78.7%
Left eye region

72% 28% 70% 30% 71.3%
Right eye region

81% 19% 68% 32% 76.7%
Nose region
88% 12% 60% 40% 78.7%
Mouth region
83% 17% 46% 54% 70.7%
Table 4-5: Comparison of the performance rates when testing frames maximum half an hour apart
A second reason for the low performance of the face verification algorithm is the warping stage being based on only six landmarks: the face edges, the eyes, the forehead and the chin. In a realistic situation with such a wide pose variation as the current one, these six landmark points are not sufficient to guarantee that the same pixel in two images will represent the same facial feature. Several landmarks should be located on the eyes, nose, mouth and face contour before warping the face image. An example of this approach is the Candide model which uses 44 points. However, this requires the facial feature detection algorithms to be expanded and much more precise, which proves to be a very difficult task in an unconstrained environment. A third reason for the low performance is the low number of images and individuals used for training. Only 214 images corresponding to 31 people were included in the training set. This number is restricted by the computing power of Khoros Pro. Ideally, a much higher number of images, representative for the conditions encountered in the real application of driver verification, should be used. Moghaddam [92] used a training set containing 1463 images of 566 individuals. When testing with a probe set containing 366 images of 140 individuals, a performance rate of 94.83% was obtained. This higher result is due to the much better controlled illumination and pose variations in the training and test sets, and to the larger and more representative training set.
4.8.4 Total performance

The system can only verify the identity of a face after it has properly localized the face and its features. The success rate of each step must thus be incorporated into the overall calculation:
T = L F V
where T = total system accuracy, L = rate of successful face localization,
(4.30)
105
F = rate of successful facial feature detection, V = rate of successful face verification Thus:
T = 96.9% 73.5% 75.7% = 53.9%
(4.31)
This means that the total performance of the static version of the face recognition system is 53.9%. It is clear that both the facial feature detection and face verification will need to improve before realistic implementation will be possible.
4.8.5 Execution times

No effort was done to make the system fast. Also, no extensive testing was done to measure the execution time. However, the average execution time, colour detection, facial feature detection and verification included, is approximately 35 seconds, which is too long for the intended application. However, by optimizing the code and using a faster computer, this time could decrease drastically.
4.9 Conclusions
This chapter described the development of a face recognition system containing three major modules: the face localization, the facial feature detection and the face verification. Each of these modules was submitted to a few experiments to determine the respective performance rates. The face localization module showed good performance, with a correct localization rate of 96.9%. The main reason for the failure of the face localization is the presence of dark sunglasses or beards. The facial feature detection module performs quite well, with a correct detection rate of 73.5%. It allows for invariance, to a certain extent, to ethnic variations, pose variations, rotation variations and illumination variations. A few main reasons why the facial feature detection module fails, were determined. The presence of glasses, beards, hats, etc. can result in the system returning a false face candidate; too bright illumination can result in the face contour being attracted to the border between light and dark areas instead of the border between face and background; while too dark illumination causes low contrast and results in the facial features not being detected. At this moment, the facial feature detection performs badly on partially occluded faces and profile views. The face verification algorithm results in a correct verification rate of 75.7% based on images of the entire face. The addition of individual features, such as the eyes or mouth, does not lead to a performance increase. The main reasons for failure of the verification module are the large illumination differences and the current warping algorithm not leading to satisfactory inputs to the verification module.
4.10 Summary
This chapter described the development of a face recognition application. The algorithms used in the different modules of face location, facial feature detection and face verification have been
106
described in detail. The total performance of 53.9% shows that still a lot of work needs to be done to develop a system that will be able to identify faces in a realistic environment without any operator intervention.
107
Chapter 5
Combined application
This chapter presents the combination of the license plate recognition and face recognition modules described in chapters three and four. A detailed example will show the entire process starting when a person drives his/her car through the gate of RAU campus. The advantages of using video above static images will also be explained.
5.1 Campus entry

It is 08:08 AM when person X enters RAU campus. At this moment, two actions are taken simultaneously. The first camera grabs an image of the cars number plate as it receives a trigger from the trip wire. This is a static image (see Figure 5-1) which is sent immediately to the license plate recognition module described in chapter three. The number plate is localized and segmented and the characters recognized. The output of this process is an eight character string (MKG638GP), representing the car license plate. When the trip wire is crossed, the second camera will grab a video sequence of images of the side of the car. On average, a driver spends between five and eight seconds in front of this camera, resulting in 120 to 200 frames being captured. In a great many of these, the drivers head is present (see Figure 5-2). This video is de-interlaced into 25 frames per second which are passed on to the face recognition module. The colour detector discards most of the frames containing no face or a face which is too small for further use. The images containing a face are reduced in size as the colour detection only retains a small window containing the face, discarding most of the background. The images retained by the colour detector are passed on to the facial feature detection stage. The grouping algorithm composes a face candidate from the features detected by the bar filter. Only if the face candidate returned by the grouping stage contains both eyes, is it passed on to the iris and face boundary detection steps. Because the currently used algorithms can detect profile views only with very low accuracy, they are discarded by the system.
108
Figure 5-1: Static license plate image at entry
frame 000
frame 005
frame 010
frame 015
frame 020
frame 025
frame 030
frame 035
frame 040
frame 045
frame 050
frame 055
frame 060
frame 065
frame 070
frame 075
frame 080
frame 085
frame 090
frame 095
frame 100
frame 105
frame 110
frame 115
frame 120
Figure 5-2: Video sequence of drivers head at entry (every 5th frame shown)
This results in only a limited number of face images for which the pose is estimated. In an ideal situation, a few images per pose class should be saved by the system for later comparison.
109
Remind that the faces are divided into nine classes between -90 and 90, each with a range of 20 for warping purposes. However, in the current application, the face verification algorithm is only developed for the frontal class, with poses between -10 and 10. This decision was taken based on two observations. Almost every person spends at least 20 frames in this pose class without the face being occluded. When the face is classified in any of the other pose classes it is often occluded by a hand or student/staff card. The facial feature detection stage returns a correct result for many images containing nonoccluded non-profile faces (see chapter four). However, it was observed that the greater the pose varies from 0, the less predictable the boundary detection becomes. At these angles, the boundary detection becomes very sensitive to the position of the hair and ears. If in one frame the hair is included in the face contour, but in the next frame it is not, this will lead to quite different results in the pose estimation and warping stages. This leads to incorrect results in the verification stage. If the pose is between -10 and 10, the face boundary detection is much more predictable and if in one frame, the hair is included, it will also be included in the next frame. Thus the warping stage will result in very similar outputs, which favours the verification stage. For these two reasons, it is felt that the face boundary detection must be made more predictable and the occlusion problem must be dealt with, before the verification stage should be expanded to larger pose classes. As mentioned earlier, at least 20 frames are available in the pose class [-10, 10] for most drivers. However, because the application of driver verification needs to work in real-time, it is impossible to check every frame from the incoming sequence with every frame from the outgoing sequence. Therefore, only a limited number of these images will be saved for later verification. Three face images with pose estimation between -10 and 10 are randomly selected among those retained by the facial feature detection module. These images are warped and transformed to the YCbCr colour space. Next, the three feature vectors (Y, Cb and Cr) are reduced to zero mean and unit norm. Finally, each of these three feature vectors is converted into a set of two whitened subspace coefficients, y for intrapersonal space and y for extrapersonal space.
I E
Thus, a total of 18 vectors will represent the car driver (3 frames 3 colour components 2 subspace coefficients). This information together with the decoded number plate string is stored in a database. This entire data acquisition process should ideally take no longer than five seconds, the time it takes for a driver to swipe his student or staff card. It is repeated for every person driving in, thus a few thousand times per day. However, the computer currently used to perform the testing takes about 90 seconds to perform this process of data acquisition. Approximately 3 000 cars enter the Rand Afrikaans University daily. The maximum needed storage space, when every drivers information is entered into the database, would be 43.2 MB. This is a small size compared to the 30 GB needed to store 24 hours of video for one gate. The database centralizes all information, with no difference between the various gates. When the card swipe system and video cameras are placed a few car lengths ahead of the booms, the time available for data acquisition can be extended to approximately 30 seconds. This means that the current system can be applied in real time if a processor is used which is three times faster than the current 1GHz Pentium III.
110
5.2 Campus exit

5.2.1 Campus exit A
It is 04:48 PM when person X leaves RAU campus. The entire process of data acquisition described in the previous section, is repeated for every person intending to drive out of campus. The same information a static image of the license plate and a video sequence of the drivers face is gathered, see Figure 5-3 and Figure 5-4. Three frames containing the drivers face in a more or less frontal position are selected and converted into 18 vectors (3 frames 3 colour components 2 subspace coefficients). The license plate is decoded and represented by an eight character string (MKG638GP). The three frames containing the face can be collected as fast as possible.
Figure 5-3: Static license plate image at exit (A)
frame 000
frame 005
frame 010
frame 015
frame020
frame 025
frame 030
frame 035
frame 040
frame 045
frame 050
frame 055
frame 060
frame 065
frame 070
Figure 5-4: Video sequence of drivers head at exit (A) (every 5th frame shown)
111
frame 075
frame080
frame 085
frame 090
frame 095
frame 100
frame 105
frame 110
frame 115
frame 120
Figure 5-4 (continued): Video sequence of drivers head at exit (A) (every 5th frame shown)
As soon as the facial feature detection module outputs three images with pose estimation between -10 and 10, the verification step can be started. The decoded license plate serves as the search key in the database and the corresponding face information, saved at entry, is retrieved. Thus, the stored templates and the new or live templates are passed on to the face verification module. Each of the three frames retained at exit is compared to each of the three frames stored at entry. This results in nine comparisons per driver. Every comparison follows the process outlined in chapter four to result in an output stating same person or different person. This means that per comparison three similarity measures S() are computed, one for each colour component (Y, Cb and Cr). The average of these three values is calculated and compared with the threshold 0.50. So, nine comparisons result in nine votes for same person or different person. The two drivers are determined to be the same as at least six comparisons out of nine, vote for same person. It currently takes approximately four seconds to perform nine comparisons. This means that if the camera and swipe system would be placed a few car lengths ahead of the boom and a faster processor is used, this system could be applied in real time. The driver in image Figure 5-4 is verified to be the same as the one in Figure 5-2 and therefore the boom will open and she can drive on. All database entries for this person are erased while one entry-exit pairing will be saved for reconciliation purposes.
5.2.2 Campus exit B

It is 03:57 PM when person Y leaves RAU campus, see Figure 5-5 and Figure 5-6. The entire process of data acquisition and face verification described in the previous section is performed. The license plate is decoded and represented by the eight character string MKG638GP. Three frames containing the drivers face in a more or less frontal position are selected and passed on to the facial feature detection and warping algorithms. This, again, results in 18 vectors containing face information. The decoded license plate serves as the search key in the database and the corresponding face information, saved at entry, is retrieved. When the face verification module compares the stored templates with the live templates, the face information extracted from the video sequences in Figure 5-2 and Figure 5-6 cannot be matched. Therefore, the driver in Figure 5-6 is rejected as the legitimate driver of the car under investigation.
112
Figure 5-5: Static license plate image at exit (B)
frame 000
frame 005
frame 010
frame 015
frame 020
frame 025
frame 030
frame 035
frame 040
frame 045
frame 050
frame 055
frame 060
frame 065
frame 070
frame 075
frame 080
frame 085
frame 090
frame 095
frame 100
frame 105
frame 110
frame 115
frame 120
Figure 5-6: Video sequence of drivers head at exit (B) (every 5th frame shown)
A next step might be to compare the rejected drivers information to all the faces saved in the database. However, even if the time for comparing two people could be reduced from four to one
113
seconds, it would still take 3000 seconds or more than two days to check against a full database. Unless some form of tree search is implemented, this is not a useful approach.
5.3 Results
5.3.1 Performance of the license plate recognition
The total performance of the license plate recognition was calculated in chapter three. This performance rate includes the performance of the localization, segmentation and recognition modules. The total recognition rate was computed as 97.4%.
5.3.2 Performance of the driver verification

The performance of the driver verification is dependent on the performance of three separate modules: The face localization module The facial feature detection module The face verification module The performance of the first two modules was already determined in chapter four. The face is properly localized in 96.9%, while the facial features are correctly detected in 73.5% of the images. The performance of the third and last module will now be assessed. To test the performance of the face verification based on video sequences, 90 video sequences of 45 drivers were used. These were combined into 45 true claims and 36 false claims. The results are summarized in Table 5-1.
Number of Percentage claims Correct acceptance rate False rejection rate Correct rejection rate False acceptance rate Total correct
45 45 36 36 81 77.8% 22.2% 88.9% 11.1% 82.7%
Table 5-1: Performance of the video-based face verification
Compared to the static face verification, the correct acceptance rate is slightly higher. The main advantage of using video instead of static images, or basing the verification decision on three frames instead of on one, is the decrease of the false acceptance rate. The result for static images in chapter four indicated a false acceptance rate of 22.4%, while the video-based face verification results in a false acceptance rate of only 11.1%. As can be seen in Table 5-1, the face verification
114
module makes the right decision in 82.7% of the claims. This result was only 75.7% when based on static images. If the false acceptance rate decreases when basing the decision on three frames, one might think that this rate can be lowered even more by using more frames, say four or five. This was not tested on the full set of 81 claims. But when performing the face verification based on five frames, the four out of 36 falsely accepted claims from Table 5-1 were still falsely accepted. When the verification step is based on five frames, 25 comparisons must be performed and 20 out of 25 comparisons must vote for same person. Note that this increases the computation time to more than 2.5 times the time needed to check three frames. This indicates that the false acceptance rate cannot be lowered by using more frames. This can be explained by the effect of the warping and illumination normalization algorithms. These algorithms ideally result in all faces with a pose between -10 and 10, looking exactly the same before being fed to the verification step. The warping algorithm makes sure that the same pixel in two images represents the same feature, while the illumination normalization ensures no influence from variable lighting. Therefore, the extra fourth and fifth frame should ideally result in feature vectors which are the same, or at least very close, to those of the first three frames. Table 5-2 lists the Euclidean distance between any two out of five warped face images of the same person. As the images are of size 60 100 with three colour components, these Euclidean distances are calculated in RN with N = 18000. These distances all lie very close together and form a cloud formation. This affirms that any three out of these five images is sufficient to represent an individual.
Combination of 2 images image 1 & image 2 image 1 & image 3 image 1 & image 4 image 1 & image 5 image 2 & image 3 image 2 & image 4 image 2 & image 5 image 3 & image 4 image 3 & image 5 image 4 & image 5
standard deviation
Euclidean distance 0.130896 0.172972 0.121545 0.131816 0.132829 0.119018 0.0676402 0.166339 0.120566 0.111472
= 0.029117
Table 5-2: Euclidean distance between warped images
In an ideal situation, this argument would also hold for the use of three frames instead of one. The use of three frames should perform the same as the use of one frame. However, the currently used warping and illumination algorithms are not ideal and the effect can be seen in the improved performance rates when using three frames. The fact that the performance does not increase when using more than three frames, shows that these algorithms work well enough and one can make up for their non-ideal performance by using three frames instead of one.
115
The total performance of the driver verification module can now be computed as follows. The success rate of each step must be incorporated into the overall calculation:
T = L F V
where T = total system accuracy, L = rate of successful face localization, F = rate of successful facial feature detection, V = rate of successful face verification Thus:
(5.1)
T = 96.9% 73.5% 82.7% = 58.9%
(5.2)
This means that the total performance of the video-based version of the face recognition system is 58.9%. This is a 5% increase over the result obtained for the static version. However, a great deal of effort will still be needed to improve the performances, of especially the facial feature detection and the face verification, before this system can be used in realistic applications. As the correct rejection rate is the most important result from the above discussion, equation (5.2) can be applied on this value separately:
T ' = 96.9% 73.5% 88.9% = 63.3%
(5.3)
Thus, the overall correct rejection rate of the system is 63.3%. In other words, 63.3% of the car robbers will be stopped before leaving campus.
5.4 Conclusions
This chapter introduced the use of video-based face verification. Currently, the face verification step is only implemented for faces with a pose between -10 and 10. Therefore, all frames showing another pose of the face are discarded. Of the remaining frames, three are randomly selected to perform face verification. Each of the three frames saved at entry is compared with each of the three frames retained at the drivers exit. When at least six out of these nine comparisons lead to a decision that the two drivers compared are the same, the boom is opened and the driver can leave campus. When less than six comparisons vote for same person, a security guard is called to investigate the situation. The main advantage of the use of three frames is the decrease in the false acceptance rate. A FAR of 11.1% was observed, where this rate was still 22.4% when the verification step was based on only one frame per person. The use of more than three frames does not improve the false acceptance rate further, because of the application of two pre-processing steps: the warping and illumination normalization algorithms.
116
5.5 Summary
This chapter described the development of video-based face verification. Instead of only one image per person, three frames are used to verify if the current driver is the legitimate driver of the car. The total performance of 58.9% shows some improvement over the use of static images, but a lot of work still needs to be done to develop a system that will be able to identify faces in a realistic environment without any operator intervention.
117
Chapter 6
Conclusions and future research

This chapter presents the final conclusions obtained from this study. A few hints for future research will conclude this thesis.
6.1 Conclusions
The aim of the research hypothesis was to automate a realistic pattern recognition problem, commonly performed by a human observer, to test some well-known pattern recognition methods. This allows for better evaluation of their performance and causes of error. The realistic environment in this study was that of car and driver verification. This problem can be split into two pattern recognition applications: license plate recognition and face recognition. The license plate recognition system performs very well. The localization module is able to locate the number plate correctly in 99.07% of the images, while 99.28% of the remaining number plates were correctly segmented into their individual characters. The character recognition module based on a learning vector quantization network, decoded 99.02% of the license plates completely correct. The combination of these three modules leads to a total performance of 97.4%. This proofs that the techniques used for the localization, segmentation and recognition of number plates are mature enough to be implemented in a realistic environment. The face recognition system consists of three main modules: the face localization, the facial feature detection and the face verification. The face localization is based on a skin colour detector in the RGB colour space. This module proved to work very well, with 96.9% of the images leading to correct localization. The facial feature detection module aims to detect the eyes, eyebrows, nose, mouth, chin and face boundary. In 73.5% of the 200 test images, the facial features and face boundary were correctly detected. The facial feature detection allows for invariance, to a certain extent, to ethnic variations, pose variations, rotation variations and
118
illumination variations. The video-based face verification based on the Bayes rule obtained a total performance of 82.7% with a false acceptance rate of 11.1%. This results in a total performance for the face recognition module of 58.9%. This figure is still too low for use in a real word application. The main reason for failure of the system is the influence of variable lighting. Both the facial feature detection and face verification will improve if the images can be better normalised for illumination variations. A second problem is the warping stage. Six landmarks are insufficient to standardise the images across all poses. Therefore, it can be concluded that the face detection and verification algorithms are very dependent on pre-processing algorithms. The better these techniques, the higher the performance of the face recognition system will be. According to the research hypothesis, the goals of this study have been achieved. Two pattern recognition techniques in the field of license plate and face recognition were tested in a real world environment. This led to a better evaluation and the determination of their main reasons for failure. The developed system is a possible basis for a real world system. However, the illumination and warping problems will need to be solved first. The use of colour information proved beneficial to the current application. A colour detector can distinguish very easily between images containing faces and images not containing faces, given that these faces must be of a minimum size. Therefore, when video sequences are examined, a fast colour detector can be used to indicate which frames do not contain faces and need not to be examined by the subsequent feature detection stages. The colour detection is also able to reduce the search space for the following detection stages significantly. Instead of the whole frame, only a small window containing the drivers face needs to be searched for facial features. The use of colour information could be expanded to the individuals clothes. It can be expected a person will wear the same clothes when driving in as when driving out. Checking if the clothes have roughly the same colour in the two events, could be helpful during the face verification. If they are the same colour, the face verification decision could be favoured in the direction of acceptance by lowering the decision threshold. The same approach could be followed for the hair colour. This study makes it clear that there are two ways to defeat the current access control application. Both ways are based on the use of camouflage to confuse the face recognition system. The first approach a car robber can follow to slip through the system is to try to look as much as possible as the legitimate owner of the car. By making sure that he/she is of the same ethnicity as the real driver, has the same facial hair growth and wears the same type of glasses, the verification module could perceive the thief to be the same person as the owner of the car. With a false acceptance rate of 11.1% and some effort from the thiefs side, this is a real possibility. The second approach is to cause confusion by wearing sunglasses or by driving out at a time where the illumination is very different as it was earlier when the drivers information was saved into the database. This will cause the system to reject the claim. In principle, the car and driver should then be investigated by a human security guard. However, the false rejection rate of 22.2% will make the security guards very reluctant to check every rejection thoroughly. After a few days of operation, the guards will realise that individuals wearing sunglasses are often rejected (as well truly as falsely) by the system and it lies in humans nature to discard further investigation and let the car pass through. Therefore, a car robber can purposely wear sunglasses and hope that the human investigators will blame the rejection on a flaw of the system.
119
Philosophically, this argument can be extended to defeating any remote-sensing biometric system. As long as no system is developed that guarantees a false acceptance rate and a false rejection rate of zero percent, there will always exist a chance to slip through. A system with an FAR and an FRR equal to zero percent would need to combine much more information than used in the currently available systems. A few examples of extra information that could be used in the current application are the colour of the clothes or the hair texture of the driver. Biometric systems with an FAR or an FRR almost equal to zero percent are available, but these are contact-based, e.g. retina-scan or fingerprint-scan applications. However, the time necessary to firstly obtain a live template and secondly match the stored and live templates is too long to be viable for driver verification.
6.2 Future research

In the course of this study, various extensions and other possibilities were encountered. It is unfortunately impossible to investigate each one of them. A few alternative ideas will be mentioned below and could maybe inspire some future researchers. The current version of the license plate recognition module uses only the green colour channel and discards the red and blue channels. Therefore, the current system does not cater for Northern Cape number plates. However, parallel analysis between the three colour channels could result in a best of channel type working and might even improve the current performance. As made clear in the previous section, illumination variation is a major cause of failure. More advanced techniques are needed to standardize the face images irrespective of the lighting strength or direction. To obtain higher performance rates, the facial feature detection system used in the current application should be replaced with a more advanced and more accurate technique. One possibility is the elastic bunch graph matching method proposed by Wiskott et al. [134]. This technique, based on Gabor wavelets, can detect more facial features more accurately (e.g. the tip of the nose, the corner of the mouth, etc.) than the current bar filter can. In the current system, all facial features are represented by bars and they are only labelled as certain features (e.g. eyes or nose) based on geometric constraints. In the elastic bunch graph matching method, the separate features are searched for by different jets. Therefore, they are labelled before grouping occurs. This should result in less false face candidates than the current technique and a better performance of the facial feature detection. This separate labelling also makes it possible to detect occluded faces, where one or some of the jets cannot be detected. This is not possible yet with the current bar filter and grouping-based algorithm. The graph matching technique can be applied for both feature detection and face recognition/verification. Like the eigenface-based techniques, this method allows for an independent training set which can be collected in advance. The method does not need retraining after every new entry and the data of people matched by the system can simply be erased. Tests performed in constrained conditions of illumination and pose, result in approximately the same performance as the currently used technique based on a posteriori probabilities and the Bayes rule. Results in a realistic environment of driver verification are not available yet.
120
References
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] Adorni, G., Bergenti, F. & Cagnoni, S. (1998) Vehicle license plate recognition by means of cellular automata. IEEE International Conference on Intelligent Vehicles, Stuttgart, Germany, pp. 689-693. Ahlberg, J. (1999) A system for face localization and facial feature extraction. Report LiTH-ISY-R2172, Linkping University, Sweden. Aitkenhead, M.J. & McDonald, A.J.S. (2003) A neural network face recognition system. Engineering Applications of Artificial Intelligence, vol. 16, no. 3, pp. 167-176. Atick, J. Biometrics. Available from: http://faculty.darden.virginia.edu/smithr/Biometrics.doc [Accessed 23 march, 2004]. Auger, J.-M., Idan, Y., Chevallier, R. & Dorizzi, B. (1992) Complementary aspects of topological maps and time delay neural networks for character recognition. IEEE International Joint Conference on Neural Networks, Baltimore, MD, vol. 4, pp. 444-449. Barker, S.E., Powell, H.M. & Palmer-Brown, D. (1995) High speed face location at optimal resolution. World Congress on Neural Networks, Washington, DC, vol. 2, pp. 536-541. Behnke, S. (2003) Face localization in the neural abstraction pyramid. 7th International Conference on Knowledge-Based Intelligent Information & Engineering Systems, Oxford, UK, part II, pp. 139-146. Belhumeur, P.N., Hespanha, J.P. & Kriegman, D.J. (1997) Eigenfaces vs. Fisherfaces: recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711-720. Bookstein, F.L. (1999) Principal warps: thin-plate splines and the decomposition of deformations. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, no. 6, pp. 567-585. Botha, C. & Coetzee, C. (1997) Number plate recognition system, Available from: http://espresso.ee.sun.ac.za/~cc/npr/ [Accessed February 4, 2004]. Bourke, P. (2000) YCC colour space and image compression. Available from: http://astronomy.swin.edu.au/~pbourke/colour/ycc/ [Accessed April 2, 2004]. Brunelli, R. & Poggio, T. (1993) Face recognition: features versus templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 10, pp. 1042-1052. Camastra, F. & Vinciarelli, A. (2001) Combining neural gas and learning vector quantization for cursive character recognition. Research report, Dalle Molle Institute for Perceptual Artificial Intelligence, Switzerland. Casey, R.G. & Lecolinet, E. (1996) A survey of methods and strategies in character segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 7, pp. 690-706. Chai, D., Phung, S.L. & Bouzerdoum, A. (2001) Skin color detection for face localization in humanmachine communications. 6th International Symposium on Signal Processing and its Applications, Kuala-Lumpur, Malaysia. Chan, K.-H., Ng, G.-S. & Erdogan, S.S. (1996) Feature extraction for handwritten character recognition. 7th International Conference on Signal Processing Applications & Technology, Boston, MA, pp. 1076-1080. Chang, T.C., Huang, T.S. & Novak, C. (1994) Facial feature extraction from colour images. 12th International Conference on Pattern Recognition Conference B: Computer Vision & Image Processing, Jerusalem, Israel, vol. 2, pp. 39-43. Coetzee, C., Botha, C. & Weber, D. (1998) PC based number plate recognition system. IEEE International Symposium on Industrial Electronics, Pretoria, South Africa, vol. 2, pp. 605-610. Comelli, P., Ferragina, P., Notturno Granieri, M. & Stabile, F. (1995) Optical recognition of motor vehicle license plates. IEEE Transactions on Vehicular Technology, vol. 44, no. 4, pp. 790-799.
121
[20] Craw, I., Costen, N., Kato, T. & Akamatsu, S. (1999) How should we represent faces for automatic recognition? IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 8, pp. 725-736. [21] Dai, Y., Nakano, Y. & Miyao, H. (1994) Extraction of facial images from a complex background using SGLD matrices. 12th International Conference on Pattern Recognition Conference A: Computer Vision & Image Processing, Jerusalem, Israel, vol. 1, pp. 137-141. [22] Dazzle DVD Creation Station 200 Hardware, http://www.dazzle.com [Accessed May 3, 2004]. [23] Duda, R.O & Hart, P.E. (1973) Pattern classification and scene analysis. Wiley Interscience. [24] Duda, R.O, Hart, P.E. & Stork, D.G. (2001) Pattern classification. 2nd ed., Wiley Interscience. [25] Espinosa-Dur, V. & Fandez-Zanuy, M. (1999) Face identification by means of a neural net classifier. 33rd IEEE Annual International Carnahan Conference on Security Technology, Madrid, Spain, pp. 182-186. [26] Fisher, B., Perkins, S., Walker, A. & Wolfart, E. (1994) Canny edge detector. Available from: http://www.cee.hw.ac.uk/hipr/html/canny.html [Accessed April 17, 2004]. [27] Flusser, J. & Suk, T. (1994) Affine moment invariants: a new tool for character recognition. Pattern Recognition Letters, vol. 15, pp. 433-436. [28] Fu, L. (1994) Neural networks in computer intelligence. McGraw-Hill. [29] Gao, Y. & Leung, M.K.H. (2002) Face recognition using line edge map. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 6, pp. 764- 779. [30] Garcia, C., Zikos, G. & Tziritas, G. (1999) Face detection in color images using wavelet packet analysis. IEEE International Conference on Multimedia Computing and Systems, Florence, Italy, vol. 1, pp. 703-708. [31] Garcia, C., Zikos, G. & Tziritas, G. (2000) Wavelet packet analysis for face recognition. Image and Vision Computing, vol. 18, no. 4, pp. 289-297. [32] Geva, S. & Sitte, J. (1991) Adaptive nearest neighbor pattern classification. IEEE Transactions on Neural Networks, vol. 2, no. 2, pp. 318-322. [33] Gross, R., Yang, J. & Waibel, A. (2000) Face recognitiom in a meeting room. 4th IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, pp. 294-299. [34] Haykin, S. (1994) Neural networks, a comprehensive foundation. Macmillan College Publishing Company. [35] Hermida, X.F., Rodrguez, F.M., Lij, J.L.F., Sande, F.P. & Iglesias, M.P. (1997) An O.C.R. for V.L.P.s (Vehicle License Plate). 8th International Conference on Signal Processing Applications & Technology, San Diego, CA. [36] Hi-Tech Solutions, Ramat Gabriel Industrial Park, Migdal Haemek, Israel 10500, http://www.htsol.com/Index.html [Accessed January 27, 2004]. [37] Hofman, Y. License plate recognition A tutorial. Available from: http://www.licenseplaterecognition.com [Accessed February 3, 2004]. [38] Hong, L. & Jain, A. (1998) Integrating faces and fingerprints for personal identification. IEEE Transactions on Patterns Analysis and Machine Intelligence, vol. 20, no. 12, pp. 1295-1307. [39] Hsu, R.-L., Abdel-Mottaleb, M. & Jain, A.K. (2002) Face detection in color images. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 696-706. [40] Huang, F.J., Zhou, Z., Zhang, H.-J. & Chen, T. (2000) Pose invariant face recognition. 4th IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, pp. 245250. [41] Hunter, A. (2002) Pattern recognition (computer vision). Available from: http://www.dur.ac.uk/andrew1.hunter/Vision/CV12PatternRecognition.pdf [Accessed February 24, 2004]. [42] Intrator, N., Reisfeld, D. & Yeshurun, Y. (1994) Face recognition using a hybrid supervised/unsupervised neural network. 12th International Conference on Pattern Recognition Conference B: Computer Vision & Image Processing, Jerusalem, Israel, vol. 2, pp. 50-54. [43] Jebara, T.S. & Pentland, A. (1997) Parametrized structure from motion for 3D adaptive feedback tracking of faces. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, pp. 144-150. [44] Jia, X. & Nixon, M.S. (1995) Extending the feature vector for automatic face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 12, pp. 1167-1176.
122
[45] Jiang, M., Zhang, G. & Chen, Z. (2001) Face recognition with multiple eigenface spaces. Proceedings of SPIE, vol. 4550, pp. 160-164. [46] Kepenekci, B. & Tek, F.B. (2002) Occluded face recognition based on Gabor wavelets. International Conference on Image Processing, Rochester, NY, vol. 1, pp. 293-296. [47] Khoros Pro 2001, Student Version 3.0.1.2, http://www.khoral.com [Accessed May 3, 2004]. [48] Kim, D.-H., Lee, J.-Y., Soh, J. & Chung, Y.-K. (2003) Real-time face verification using multiple feature combination and a support vector machine supervisor. International Conference on Multimedia and Expo, Baltimore, MD, vol. 3, pp. 145-148. [49] Kim, H.-C., Kim, D. & Bang, S.Y. (2002) Face recognition using the mixture-of-eigenfaces method. Pattern Recognition Letters, vol. 23, no. 13, pp. 1549-1558. [50] Kim, H.-C., Kim, D. & Bang, S.Y. (2002) Face retrieval using 1st- and 2nd-order PCA mixture model. International Conference on Image Processing, Rochester, NY, vol. 2, pp. 605-608. [51] Kim, K.I., Jung, K. & Kim, H.J. (2002) Face recognition using kernel principal component analysis. IEEE Signal Processing Letters, vol. 9, no. 2, pp. 40-42. [52] Kim, S., Kim, D., Ryu, Y. & Kim, G. (2002) A robust license-plate extraction method under complex image conditions. 16th IEEE International Conference on Pattern Recognition, Qubec, Canada, vol. 3, pp. 216-219. [53] Kirchberg, K.J., Jesorsky, O. & Frischholz, R.W. (2002) Genetic model optimization for Hausdorff distance-based face localization. International ECCV Workshop on Biometric Authentication, Copenhagen, Denmark, pp. 103-111. [54] Kohonen, T. (1989) Self-organization and associative memory, 3rd ed., Springer-Verlag. [55] Kohonen, T., Hynninen, J., Kangas, J., Laaksonen, J. & Torkkola, K. (1996) LVQ_PAK: The learning vector quantization program package. Helsinki University of Technology, Finland. [56] Kwon, Y.H. & da Vitoria Lobo, N. (1994) Face detection using templates. 12th International Conference on Pattern Recognition Conference A: Computer Vision & Image Processing, Jerusalem, Israel, vol. 1, pp. 764-767. [57] Lam, K.-M. & Li, Y.-L. (1998) An efficient approach for facial feature detection. 4th International Conference on Signal Processing, Beijing, China, vol. 2, pp. 1100-1103. [58] Lawrence, S., Giles, C.L. & Tsoi, A.C. (1996) Convolutional neural networks for face recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, pp. 217-222. [59] Lawrence, S., Giles, C.L., Tsoi, A.C. & Back, A.D. (1997) Face recognition: a convolutional neuralnetwork approach. IEEE Transactions on Neural Networks, vol. 8, no. 1, pp. 98-113. [60] Lawrence, S., Yianilos, P. & Cox, I. (1997) Face recognition using mixture-distance and raw images. IEEE International Conference on Systems, Man, and Cybernetics Computational Cybernetics and Simulation, Orlando, FL, vol. 3, pp. 2016-2021. [61] Lee, S. & Pan, J.C.-J. (1996) Unconstrained handwritten numeral recognition based on radial basis competitive and cooperative networks with spatio-temporal feature representation. IEEE Transactions on Neural Networks, vol. 7, no. 2, pp. 455-474. [62] Lee, S.-W., Lee, D.-J. & Park, H.-S. (1996) A new methodology for gray-scale character segmentation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 10, pp. 1045-1050. [63] Leung, T.K., Burl, M.C. & Perona, P. (1995) Finding faces in cluttered scenes using random labelled graph matching. 5th International Conference on Computer vision, Cambridge, MA, pp. 637-644. [64] Li, S.Z. & Lu, J. (1998) Generalizing capacity of face database for face recognition. 3rd IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, pp. 402- 406. [65] Li, Y., Gong, S., Sherrah, J. & Liddell, H. (2000) Multi-view face detection using support vector machines and eigenspace modelling. 4th International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies, Brighton, UK, vol. 1, pp. 241-244. [66] Lienhart, R., Kuranov, A. & Pisarevsky, V. (2002) Empirical analysis of detection cascades of boosted classifiers for rapid object detection. Technical report, Microprocessor Intel Research Lab, Santa Clara, CA. [67] Lin, K.-H., Guo, B., Lam, K.-M & Siu, W.-C. (2001) Human face recognition using a spatially weighted modified Hausdorff distance. International Symposium on Intelligent Multimedia, Video and Speech Processing, Hong Kong, China, pp. 477-480.
123
[68] Lin, K.-H., Lam, K.-M. & Siu, W.-C. (2001) Locating the eye in human face images using fractal dimensions. IEE Proceedings of Vision, Image and Signal Processing, vol. 148, no. 6, pp. 413-421. [69] Lin, K.-H., Lam, K.-M. & Siu, W.-C. (2003) Spatially eigen-weighted Hausdorff distances for human face recognition. Pattern Recognition, vol. 36, no. 8, pp. 1827-1834. [70] Lin, S.-H., Kung, S.-Y & Lin, L.-J. (1997) Face recognition/detection by probabilistic decision-based neural network. IEEE Transactions on Neural Networks, vol. 8, no. 1, pp. 114-132. [71] Liu, C. & Wechsler, H. (1998) Probabilistic reasoning models for face recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Santa Barbara, CA, pp. 827-832. [72] Liu, C. & Wechsler, H. (1999) Face recognition using shape and texture. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Fort Collins, CO, vol. 1, pp. 598-603. [73] Liu, C. & Wechsler, H. (2001) A Gabor feature classifier for face recognition. 8th IEEE International Conference on Computer Vision, Vancouver, Canada, vol. 2, pp. 270-275. [74] Liu, C. & Wechsler, H. (2001) A shape- and texture-based enhanced Fisher classifier for face recognition. IEEE Transactions on Image processing, vol. 10, no. 4, pp. 598-608. [75] Liu, C. & Wechsler, H. (2002) Gabor feature based classification using the enhanced Fisher linear discriminant model for face recognition. IEEE Transactions on Image Processing, vol. 11, no. 4, pp 467-476. [76] Liu, C. & Wechsler, H. (2003) Independent component analysis of Gabor features for face recognition. IEEE Transactions on Neural Networks, vol. 14, no. 4, pp. 919-928. [77] Liu, C., Shum, H.-Y. & Zhang, C. (2002) Hierachical shape modelling for automatic face localization. International ECCV Workshop, Heidelberg, Germany, pp. 687-703. [78] Liu, X., Zhu, Y. & Fujimura, K. (2002) Real-time pose classification for driver monitoring. 5th IEEE International Conference on Intelligent Transportation Systems, Singapore, pp. 174-178. [79] Liu, Z., Chen, Z., He, X., Zhou, J. & Xiong, G. (2002) A novel approach for detecting human face with various poses. IEEE Region 10 Conference on Computers, Communications, Control and Power Engineering, Beijing, China, vol. 1, pp. 289-292. [80] Lizama, E., Waldoestl, D. & Nickolay, B. (1997) An eigenfaces-based automatic face recognition system. IEEE International Conference on Systems, Man, and Cybernetics Computational Cybernetics and Simulation, Orlando, FL, vol. 1, pp. 174-177. [81] Lo, Z.-P., Yu, Y. & Bavarian, B. (1992) Derivation of learning vector quantization algorithms. IEEE International Joint Conference on Neural Networks, Baltimore, MD, vol. 3, pp. 561-566. [82] Loncelle, J., Derycke, N. & Fogelman Souli, F. (1992) Cooperation of GBP and LVQ networks for optical character recognition. International Joint Conference on Neural Networks, Baltimore, MD, vol. 3, pp. 694-699. [83] Lorente, L. & Torres, L. (1999) Face recognition of video sequences in a MPEG-7 context using a global eigen approach. IEEE International Conference on Image Processing, Kobe, Japan, vol. 4, pp. 187-191. [84] Lowe, D. The computer vision industry. Available from: http://www.cs.ubc.ca/spider/lowe/vision.html [Accessed February 24, 2004]. [85] Luo, L., Swamy, M.N.S. & Plotkin, E.I. (2003) A modified PCA algorithm for face recognition. IEEE Canadian Conference on Electrical and Computer Engineering, Montreal, Canada, vol. 1, pp. 57-60. [86] Martnez, A.M. (2000) Recognition of partially occluded and/or imprecisely localized faces using a probabilistic approach. IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head Island, SC, vol. 1, pp. 712-717. [87] Martnez, A.M. (2002) Recognizing imprecisely localized, partially occluded and expression variant faces from a single sample per class. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 6, pp. 748-763. [88] Moghaddam, B. & Pentland, A. (1994) Face recognition using view-based and modular eigenspaces. Proceedings of SPIE: Automatic Systems for the Identification and Inspection of Humans, vol. 2277, pp. 12-21. [89] Moghaddam, B. & Pentland, A. (1997) Probabilistic visual learning for object representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 696-710. [90] Moghaddam, B. & Pentland, A. (1998) Probabilistic matching for face recognition. IEEE Southwest Symposium on Image Analysis and Interpretation, Tucson, AZ, pp. 186-191. [91] Moghaddam, B. (1999) Principal manifolds and Bayesian subspaces for visual recognition. 7th IEEE International Conference on Computer Vision, Kerkyra, Greece, vol. 2, pp. 1131-1136.
124
[92] Moghaddam, B. (2002) Principal manifolds and probabilistic subspaces for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 6, pp. 780-788. [93] Moghaddam, B., Nastar, C. & Pentland, A. (1996) Bayesian face recognition using deformable intensity surfaces. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, pp. 638-645. [94] Moghaddam, B., Nastar, C. & Pentland, A. (2001) Bayesian face recognition with deformable image models. 11th International Conference on Image Analysis and Processing, Palermo, Italy, pp. 26-35. [95] Moghaddam, B., Wahid, W. & Pentland, A. (1998) Beyond eigenfaces: probabilistic matching for face recognition. 3rd IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, pages 30-35. [96] Naito, T., Tsukada, T., Yamada, K., Kozuka, K. & Yamamoto, S. (2000) Robust license-plate recognition method for passing vehicles under outside environment. IEEE Transaction on Vehicular Technology, vol. 49, no. 6, pp. 2309-2319. [97] Nakayama, K., Chigawa, Y. & Hasegawa, O. (1992) Handwritten alphabet and digit character recognition using feature extracting neural network and modified self-organizing map. IEEE International Joint Conference on Neural Networks, Baltimore, MD, vol. 4, pp. 235-240. [98] Nanavati, S., Thieme, M. & Nanavati, R. (2002) Biometrics. Identity verification in a networked world. John Wiley & Sons. [99] Nelson, L.J. License plate recognition systems. Available from: http://www.ee.psu.ac.th/leang/LPR/ITS101/9701atricle.htm [Accessed 6 January, 2004]. [100] Optasia Systems (Pty) Ltd., 20 Ayer Rajah Crescent #09-16/17, Singapore 139964, http://singaporegateway.com/optasia/ [Accessed 28 January 2004]. [101] Pentland, A. & Choudhury, T. (2000) Face recognition for smart environments. Computer, vol. 33, no. 2, pp. 50-55. [102] Perez, J.-C., Vidal, E. & Sanchez, L. (1994) Simple and effective feature extraction for optical character recognition. IN: Sanfeliu, A. & Casacuberta, F. ed. Advances in Pattern Recognition and Applications: Selected Papers from the 5th Spanish Symposium on Pattern Recognition and Image Analysis, World Scientific. [103] Quintiliano, P., Santa-Rosa, A. & Guadagnin, R. (2001) Face recognition based on eigenfeatures. Proceedings of SPIE, vol. 4550, pp. 140-145. [104] Reid, P. (2004) Biometrics for network security. Prentice Hall. [105] Rodrigues, R.J. & Thom, A.C.G. (2000) Cursive character recognition a character segmentation method using projection profile-based technique. 6th International Conference on Information System, Analysis and Synthesis and 4th World Multiconference on Systemics, Cybernetics and Informatics, Orlando, FL. [106] Romano, R., Beymer, D. & Poggio, T. (1996) Face verification for real-time applications. Image Understanding Workshop, Palm Springs, CA, vol. 1, pp. 747-756. [107] Rovetta, S. and Zunino, R. (1999) License-plate localization by using vector quantization. IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, AZ, vol. 2, pp. 1113-1116. [108] Rowley, H.A., Baluja, S. & Kanade, T. (1998) Neural network-based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 23-38. [109] Ryu, Y.-S. & Oh, S.-Y. (2002) Simple hybrid classifier for face recognition with adaptively generated virtual data. Pattern Recognition Letters, vol. 23, no. 7, pp. 833-841. [110] Sahbi, H. & Boujemaa, N. (2001) Robust matching by dynamic space warping for accurate face recognition. IEEE International Conference on Image Processing, Thessaloniki, Greece, vol. 1, pp. 1010-1013. [111] Sahbi, H. & Boujemaa, N. (2002) Robust face recognition using dynamic space warping. European Conference on Computer Vision, Workshop on Biometric Authentication, Copenhagen, Denmark, pp. 121-132. [112] Saifullah, Y. & Manry, M.T. (1993) Classification-based segmentation of ZIP codes. IEEE Transactions on Systems, Man, and Cybernetics, vol. 23, no. 5, pp. 1437-1443. [113] Samaria, F.S. & Harter, A.C. (1994) Parameterisation of a stochastic model for human face identification. 2nd IEEE Workshop on Applications of Computer Vision, Sarasota, FL, pp. 138-142.
125
[114] Sergey, V., Alexander, R. & Rosen, S. Moving car license plate recognition. Available from: http://www.cs.technion.ac.il/Labs/Isl/Project/Projects_done/cars_plates/finalreport.htm#_Toc4550691 29 [Accessed February 4, 2004]. [115] Setchell, C.J. (1997) Applications of computer vision to road-traffic monitoring. Doctoral thesis, University of Bristol, UK. [116] Shan, S., Cao, B., Gao, W. & Zhao, D. (2002) Extended Fisherface for face recognition from a single example image per person. IEEE International Symposium on Circuits and Systems, PhoenixScottsdale, AZ, vol. 2, pp. 81-84. [117] Shan, S., Gao, W. & Zhao, D. (2002) Face identification from a single example image based on facespecific subspace (FSS). IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, vol. 2, pp. 2125-2128. [118] Shan, S., Gao, W., Chen, X. & Ma, J. (2000) Novel face recognition based on individual eigensubspaces. 5th International Conference on Signal Processing, Beijing, China, vol. 3, pp. 1522-1525. [119] Shen, L.J., Fu, H.C., Xu, Y.Y., Hsu, F.R., Chang, H.T. & Meng, W.Y. (1996) A principal component based probabilistic DBNN for face recognition. IEEE International Conference on Image Processing, vol. 3, pp. 499-502. [120] Shen, W. & Khanna, R. (1997) Prolog to Face recognition: eigenface, elastic matching, and neural nets, An introduction to the paper by Zhan, Yan, and Lades. Proceedings of the IEEE, vol. 85, no. 9, p. 1422. [121] Shustorovich, A. (1994) A subspace projection approach to feature extraction: the two-dimensional Gabor transform for character recognition. Neural Networks, vol. 7, no. 8, pp. 1295-1301. [122] Siah, Y.K., Haur, T.Y., Khalid, M. & Ahmad, T. (1999) Vehicle licence plate recognition by fuzzy artmap neural network. World Engineering Congress, Kuala Lumpur, Malaysia. [123] Smith, G. Automatic licence plate recognition. Available from: http://www.cat.csiro.au/cmst/AC/projects/Project.php?alpr [Accessed 10 February, 2004]. [124] Smith, G. Optical character recognition. Available from: http://www.cat.csiro.au/cmst/AC/expertise/Expertise.php?ocr [Accessed 4 February, 2004]. [125] Sun, D.-R. & Wu, L.-N. (2002) A local-to-holistic face recognition approach using elastic graph matching. 1st International Conference on Machine Learning and Cybernetics, Beijing, China, pp. 240-242. [126] Theodoridis, S. & Koutroumbas, K. (1999) Pattern Recognition. Academic Press. [127] Torres, L., Reutter, J.Y. & Lorente, L. (1999) The importance of the color information in face recognition. IEEE International Conference on Image Processing, Kobe, Japan, vol. 3, pp. 627-631. [128] Trier, .D., Jain A.K. & Taxt, T. (1996) Feature extraction methods for character recognition A survey. Pattern Recognition, vol. 29, no. 4, pp. 641-662. [129] Turk, M.A. & Pentland, A.P. (1991) Face recognition using eigenfaces. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Maui, HI, pp. 586-591. [130] Vezhnevets, V. (2002) Method for localization of human faces in color-based face detectors and trackers. 3rd International Conference on Digital Information Processing and Control in Extreme Situations, Minsk, Belarus, pp. 51-56. [131] Vezhnevets, V., Sazonov, V. & Andreeva, A. (2003) A survey on pixel-based skin color detection techniques. GraphiCon, Moscow, Russia, pp. 85-92. [132] Viola, P. & Jones, M. (2001) Robust real-time object detection. 2nd International workshop on Statistical and Computational Theories of Vision Modeling, Learning, Computing, and Sampling, Vancouver, Canada. [133] Wang, W., Gao, Y., Hui, S.C. & Leung, M.K. (2002) A fast and robust algorithm for face detection and localization. 9th International Conference on Neural Information Processing, Singapore, vol. 4, pp. 2118-2121. [134] Wiskott, L., Fellous, J.-M., Krger, N. & von der Malsburg, C. (1999) Face recognition by elastic bunch graph matching. IN: Jain et al. ed. Intelligent Biometric Techniques in Fingerprint and Face Recognition, Chapter 11, pages 355-396, CRC Press. [135] Wong, K.-W., Lam, K.-M. & Siu, W.-C. (2001) An efficient algorithm for human face detection and facial feature extraction under different conditions. Pattern Recognition, vol. 34, no. 10, pp. 19932004. [136] Woodward, J.D., Orlans, N.M. & Higgins, P.T. (2003) Biometrics. Identity assurance in the information age. McGraw-Hill/Osborne.
126
[137] Wu, F.-C., Yang, T.-J. & Ouhyoung, M. (1998) Automatic feature extraction and face synthesis in facial image coding. 6th Pacific Conference on Computer Graphics and Applications, Singapore, pp. 218-219. [138] Wu, H., Chen, Q. & Yachida, M. (1999) Face detection from color images using a fuzzy pattern matching method. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 6, pp. 557-563. [139] Wu, Y. & Huang, T.S. (2002) Nonstationary color tracking for vision-based human-computer interaction. IEEE Transactions on Neural Networks, vol. 13, no. 4, pp. 948-960. [140] Xiao, J. & Yan, H. (2003) Face boundary extraction. 8th Conference on Digital Image Computing Techniques and Applications, Sydney, Australia, pp. 947-956. [141] Yang, M.-H. (2002) Kernel eigenfaces vs. kernel fischerfaces: face recognition using kernel methods. 5th IEEE International Conference on Automatic Face and Gesture Recognition, Washington, DC, pp. 215-220. [142] Yang, M.-H., Kriegman, D.J. & Ahuja, N. (2002) Detecting faces in images: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 1, pp. 34-58. [143] Yao, T., Li, H., Liu, G., Ye, X., Gu, W. & Ji, Y. (2002) A fast and robust face location and feature extraction system. IEEE International Conference on Image Processing, New York, NY, vol. 1, pp. 157-160. [144] Yoo, J.-H., Chun, B.-T. & Shin, D.P. (1989) A neural network for recognizing characters extracted from moving vehicles. The Transputer Data Book, Inmos Ltd., vol. 3, pp. 162-166. [145] Yow, K.C. & Cipolla, R. (1997) Feature-based human face detection. Image and Vision Computing, vol. 15, no. 9, pp. 713-735. [146] Yow, K.C. (1998) Automatic human face detection and localization, Doctoral thesis, University of Cambridge, UK. [147] Zhang, J., Yan, Y. & Lades, M. (1997) Face recognition: eigenface, elastic matching, and neural nets. Proceedings of the IEEE, vol. 85, no. 9, pp. 1423-1435. [148] Zhao, W., Chellappa, R., Rosenfeld, A. & Phillips, P.J. (2002) Face recognition: a literature survey. Technical Report CFAR-TR00-948, University of Maryland. [149] Zunino, R. & Rovetta, S. (2000) Vector quantization for license-plate location and image coding. IEEE Transactions on Industrial Electronics, vol. 47, no. 1, pp. 159-167. [150] Zuo, F. & de With, P.H.N. (2002) Automatic human face detection for a distributed video security system. 3rd PROGRESS Workshop on Embedded Systems, Utrecht, the Netherlands, pp. 269-274.
127

10 1 1 101

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 1 1 101

Uploaded by

Copyright:

Available Formats

Multimodal Verification of Identity for a Realistic Access Control Application

Table 1-1: Strengths and weaknesses of biometric technologies

Table 1-1 (continued): Strengths and weaknesses of biometric technologies

1.2 Proposed application

1.3 Research hypothesis

1.4 Hard- and software

1.5 Thesis overview

Figure 1-1: Example of Cantata visual program

2.2 Pattern recognition pertaining to this study

2.2.1 Learning vector quantization

if wiT (t ) x (t ) < w T (t ) x (t ), j if w (t ) x (t ) < w (t ) x (t ),

2.2.2 Bayes decision theory

where the probability P(x) is given by:

2.3 License plate recognition

2.3.1 University of Stellenbosch, SA

2.3.2 University of Bristol, UK

2.3.3 Optasia Systems

2.3.4 Hi-Tech Solutions

2.4 Face recognition

2.4.1 Face localization

2.4.1.1 Knowledge-based methods

2.4.1.2 Feature-based methods

2.4.1.3 Template matching methods

2.4.1.4 Appearance-based methods

2.4.2 Face recognition

2.4.2.1 Template matching

2.4.2.2 Feature-based methods

2.4.2.3 Subspace methods

Wopt = arg max | W T ST W |= [ w1

and the within-class scatter matrix be defined as

Wopt = arg max

Wpca = arg max | W T ST W |

W fld = arg max

T | W T W pca S BWpcaW | T | W T Wpca SW WpcaW |

Then the following set of equations can be considered:

for all i = 1,..., M

2.4.2.4 Neural networks

2.4.2.5 Graph matching and dynamic space warping

2.4.3 Face verification

License plate recognition

3.2 Process description

Kwazulu Natal (old)

Free State (old)

Northern Province (old)

Figure 3-1: A few examples of South African number plates

3.3 Image acquisition

Figure 3-3: Typical data capture station

3.5.2 Current system

take new cross-section

adapt gradient threshold

no locate vertical edges no

yes find top, bottom & tilt no

yes scale and rotate

Cross-section through number plate

100 50 0 0 50 100 150 200 250 300 350

Cross-section through image

100 50 0 0 50 100 150

top of number plate character

Least Squares Error

Figure 3-10: Examples of localized number plates

150 100 50 0 0 50 100 150 200 250

Figure 3-14: Examples of number plates before and after pre-processing