Face Recognition

VISVESVARAYA TECHNOLOGICAL UNIVERSITY
BELGAUM-590014

Project Report
On
Face Recognition using Artificial Neural Networks
Submitted in partial fulfilment of the requirement for the award of degree of
BACHELOR OF ENGINEERING
In
ELECTRONICS & COMMUNICATION
By
Prashanth Namit
1PI05EC072 1PI05EC057
Under the Guidance of

Prof. Koshy George
PES Centre for Intelligent Systems;
Dept. of Telecommunication Engineering

DEPARTMENT OF ELECTRONICS AND COMMUNICATION
PES INSTITUTE OF TECHNOLOGY
BANGALORE 560 085
2008- 2009

PES INSTITUTE OF TECHNOLOGY
BANGALORE 560 085

CERTIFICATE
Certified that the project work entitled Face Recognition Techniques using Neural Networks is a
bona fide work carried out by Prashanth H, Namit Chauhan bearing USN 1PI05EC072 and
1PI05EC043 in partial fulfilment for the award of degree of BACHELOR OF ENGINEERING in
ELECTRONICS & COMMUNICATION of Visvesvaraya Technological University, during the
academic year 2008-2009. It is certified that all corrections/suggestions indicated for internal assessment
have been incorporated in the report. The project report has been approved as it satisfies the academic
requirements with respect to project work prescribed for the mentioned degree.

Dr Koshy George
PES Centre for Intelligent Systems;
Dept. of Telecom. Engg., PESIT
Dr A. Vijaykumar
HOD,
Dept. of E&C Engg., PESIT
Dr K. N. B. Murthy
Principal
PESIT

Prashanth
1PI05EC072
Namit Chauhan
1PI05EC057

External Viva

Name of the Examiners Signature with Date
1._____________________ _____________________
2. _____________________ _____________________

ACKNOWLEDGEMENTS

We would like to express our sincere thanks to Dr. Koshy George Professor, TE
Department, PESIT, Bangalore, for his kind and constant support and guidance during the
course of this project without which, the project would not have been a success

We would like to thank and extend
1
our heart-felt gratitude to Dr. A Vijaya Kumar,
HOD of the Department of Electronics and Communication, for his kind, inspiring and
illustrious guidance and ample encouragement.

We would like to thank Prof. D Jawahar, CEO, PES Group of Institutions, and
Dr. K N Balasubramanya Murthy, Director & Principal, PESIT, for the valuable
resources provided for the completion of the project.

We would like to express our sincere thanks and deep sense of gratitude to our parents
for their everlasting support.

And finally it gives us immense pleasure to thank our friends and everyone who have
been instrumental in the successful completion of the project.

Table of Contents:

Serial
Number
Topic Page Number
1 Face Recognition: An Introduction 1
1.1 Classification 3
1.1.1 Application of face recognition 6
1.1.2 Adopted approach 6
1.2 Literature survey 7
1.2.1 Existing algorithms 7
1.2.2 Existing techniques 8
1.3 Organization of the report 10
2 Neural Networks: An Introduction 11
2.1 Multilayer Perceptrons 13
2.2
Derivation of Error back propagation
algorithm
15
2.3 Activation function 20
2.4 Rate of learning 22
2.5 Modes of Training 23
2.6 Local minima of Error Surface 24
2.7
Supervised learning viewed as an
optimization problem
24
2.8 Face recognition using Neural Networks 26
2.9 Conclusions 26
3
Principal Components Analysis: An
Introduction
27
3.1 Derivation of Principal components 28
3.2 Face Recognition using Eigen faces 30
3.3
Euclidean Distance method for
Classification
33

ii
3.4 Conclusions 33
4 Support Vector Machines: An Introduction 34
4.1 Basics 36
4.2
Linear Support Vector Machines:
The Separable Case
37
4.3
Quadratic Optimization for Finding the
Optimal Hyperplane
39
4.4 The Karush-Kuhn-Tucker Conditions 41
4.5 The Non Separable Case 42
4.6 Nonlinear Support Vector Machines 45
4.6.1 Inner product Kernel 46
4.6.2 Mercers Theorem 48
4.6.3 Summary of Inner-Product Kernels 49
4.7 Multiple Class Support Vector Machine 49
4.8 Binary Search Tree 52
4.9 SVM for Face Recognition 53
4.10 Conclusions 53
5 Image Preprocessing 54
5.1 Histogram Equalization 54
5.2 Median Filtering 57
5.3 Bi-Cubic Interpolation 57
5.4 Conclusions 58
6 Results 59
6.1 MLP 59
6.2 PCA 63
6.3 SVMs 67
6.4 PCA+MLP 70
7 Conclusions and Future work 73
8 References 75
Appendix 78

iii
List of Figures

Figure 1.1: Applications of Face Recognition
Figure 1.2: Supervised Learning
Figure 1.3: Unsupervised Learning
Figure 1.4: (a) Good Generalization, (b) Bad Generalization
Figure 2.1: Block diagram of nervous system
Figure 2.2: Single-Layer Feedforward Networks, Multilayer Feedforward Networks,
Recurrent Networks
Figure 2.3: Nonlinear model of a neuron
Figure 2.4: MLP with one hidden layer
Figure 2.5: Signal-flow graph highlighting the details of output neuron j
Figure 2.7: Signal-flow graphical summary of back-propogation learning. Top part of
the graph: forward pass. Bottom part of the graph: backward pass.
Figure 2.8: Hyperbolic tangent function
Figure 2.9: Error surface
Figure 3.1: (a) The 1
st
PC z
1
is a minimum distance fit to a line in X space. (b) The 2
nd

PC z
2
is a minimum distance fit to a line in the plane perpendicular to the 1
st
PC.
Figure 3.2: Eigen faces
Figure 4.1: General structure of SVM
Figure 4.2: Linear separating hyperplane for separable case
Figure 4.3: Linear separating hyperplane for non-separable case
Figure 4.4: Mapping into higher space
Figure 4.5: Decision surface of Kernels
Figure 4.6: Decision surface using one SVM
Figure 4.7: Problem encountered when one SVM is used for multi- classification
Figure 4.8: Multi-classification using Binary classifiers
Figure 4.9: Binary tree structure for 8 classes
Figure 5.1: Image after Histogram Equalization
Figure 5.2: Median filtering
Figure 6.1: Face recognition accuracy with multilayer feedforward networks with
histogram equalization and median filtering.

iv
Figure 6.2: Face recognition accuracy with multilayer feedforward networks with
only median filtering.
Figure 6.3: Face recognition accuracy with multilayer feedforward networks with no
pre-processing
Figure 6.4: Face recognition accuracy with PCA with histogram equalization and
median filtering.
Figure 6.5: Face recognition accuracy with PCA with only median filtering.
Figure 6.6: Face recognition accuracy with PCA and no pre-processing
Figure 6.7: Face recognition accuracy with PCA+MLP with histogram equalization
and median filtering.
Figure 6.8: Face recognition accuracy with PCA+MLP with only median filtering.

Face Recognition using Artificial Neural Network

Dept. of ECE Jan June 09 Page 1
CHAPTER I
Face Recognition: An Introduction

Human beings have recognition capabilities that are unparalleled in the modern
computing era. These are mainly due to the high degree of interconnectivity, adaptive nature,
learning skills and generalization capabilities of the nervous system. The human brain has
numerous highly interconnected biological neurons which, on some specific tasks, can
outperform super computers. A child can accurately identify a face, but for a computer it is a
cumbersome task. Therefore, the main idea is to engineer a system which can emulate what a
child can do. Advancements in computing capability over the past few decades have enabled
comparable recognition capabilities from such engineered systems quite successfully. Early
face recognition algorithms used simple geometric models, but recently the recognition
process has now matured into a science of sophisticated mathematical representations and
matching processes. Major advancements and initiatives have propelled face recognition
technology into the spotlight.

Face recognition technology can be used in wide range of applications. (A few
example applications are shown in Fig 1.1.) An important aspect is that such technology
should be able to deal with various changes in face images, like rotation, changes in
expression. Surprisingly, the mathematical variations between the images of the same face
due to illumination and viewing direction are almost always larger than image variations due
to changes in face identity. This presents a great challenge to face recognition. At the core,
two issues are central to successful face recognition algorithms: First, the choice of features
used to represent a face. Since images are subject to changes in viewpoint, illumination, and
expression, an effective representation should be able to deal with these possible changes.
Secondly, the classification of a new face image using the chosen representation.


Face Recognition can be of two types:
Feature Based (Geometric)
Template Based (Photometric)
In geometric or feature-based methods, facial features such as eyes, nose, mouth, and chin
are detected. Properties and relations such as areas, distances, and angles between the
features are used as descriptors of faces. Although this class of methods is economical and
efficient in achieving data reduction and is insensitive to variations in illumination and
viewpoint, it relies heavily on the extraction and measurement of facial features.
Unfortunately, feature extraction and measurement techniques and algorithms developed to
date have not been reliable enough to cater to this need.

In contrast, template matching and neural methods generally operate directly on an
image-based representation of faces, i.e., pixel intensity array. Because the detection and
measurement of geometric facial features are not required, this type of method has been more
practical and easier to implement when compared to geometric feature-based methods.

Figure 1.1: Applications of face recognition.


1.1 Classification
Face recognition is nothing but the ability of a machine to successfully categorize a set
of images based on certain discriminatory features. Classification or pattern recognition can
be a very difficult problem and is still a very active field of research. The aim of the current
project can be considered as an attempt to simulate the recognition capability of the human
brain. A human being has the ability to put a certain scenario in a context and identify its
components. Of course it is not entirely obvious how to make the machine discriminate
between elements of different objects of different classes. The classification task may be
mathematically represented as the mapping:
f : A x B = C (1.1)
where A represents the elements that have to be classified, i.e. the pixel intensity vectors,
using B as the function parameters. The output C has to discriminate between images of
different classes so that the class of each element can be determined. Classification by
machines requires two steps: First, the properties which distinguish an element of one class
from that of another class have to be identified. Secondly, the machine has to be trained to
know how to discriminate between the classes by defining a learning model. The learning
model describes the procedure that has to be used for the actual training. Basically there are
two types of training often referred to as supervised and unsupervised learning.

Supervised Learning: When dealing with supervised learning, the trainer feeds the
machine a sample and the machine classifies it. The trainer then tells the machine
whether it has classified the sample correctly or not. In the case that it has
misclassified the sample, the machine has to adjust its classifier parameters to better
classify the given sample. Supervised learning is illustrated in Fig. 1.2. Examples of
these include Bayes Classifier and Neural Networks. Another important aspect
involved is the method of training imparted to these learning machines. It is important
to have a training set which is representative for the given classes so that future
classification can be successful.


Figure 1.2: Supervised Learning

Unsupervised Learning: In unsupervised or self-organized learning there is no
external trainer to oversee the learning process. Rather a task-independent measure of
the quality of classification is computed, and the free parameters of the network are
optimized with respect to that measure. Once the network has become tuned to the
statistical regularities of the input data, it develops the ability to form internal
representations for encoding features of the input and thereby create new classes
automatically. One technique for this is called clustering. In this case the trainer has
nothing to do with the classification. Instead the learning machine determines which
elements should belong to the same class and assigns a class accordingly.
Unsupervised learning is as shown below in Fig. 1.3.

Figure 1.3: Unsupervised Learning

It is to be noted that over-training leads to poor generalization. For example, an over-
trained machine would return different classes for the same image subjected to different
modifications. As illustrated in the Fig 1.4, over-training essentially is memorizing the data,
and leads to poor generalization.


Figure 1.4: (a) Good Generalization, (b) Bad Generalization

Face recognition is a very complex form of pattern recognition. It consists of
classifying highly ambiguous input signals, with multiple dimensions and matching them
with known signals. The following are the problems that occur:
The intrinsic 2D structure of an image matrix is more often than not removed.
Consequently, the spatial information stored therein is discarded and not effectively
utilized.
Curse of Dimensionality: Each image sample is modeled is typically a point in a high-
dimensional space. Consequently, a large number of training samples are often

needed to get reliable and robust estimation about the characteristics of data
distribution.
Usually very limited amounts of data are available in real applications such as face
recognition, image retrieval, and image classification.
A number of ways have been proposed to solve these problems. Finding an effective means
to reduce the dimensionality is the first step in face recognition.

1.1.1 Applications of Face Recognition
Some of the applications are as follows:
Trying to find a face within a large database of faces. In this approach the system
returns a possible list of faces from the database. The most useful applications contain
crowd surveillance, video content indexing, personal identification (example: drivers
license), mug shots matching, etc.
Real time face recognition: Here, face recognition is used to identify a person on the
spot and grant access to a building or a compound, thus avoiding security hassles. In
this case the face is compared against a multiple training samples of a person.

1.1.2 Adopted Approach
Face recognition is done naturally by humans. However, developing a computer
algorithm to achieve the same purpose is a rather difficult task. Starting with images, the aim
is to distinguish between faces of different people. One class of methods presupposes the
existence of certain features in the image, i.e. eyes, nose, mouth, hair, and an algorithm is
devised to find and characterize these features. A second class of methods also assumes that
there are features to be found but does not predefine what these features are or how to
measure them. The approach followed in this project belongs to the latter class of methods
since it much less restrictive than the former. Specifically, given a set of training images, the
strategy is to develop algorithms that construct a different set of images which
approximates best the given image data set. The training set is then projected onto this
subspace. To query a new image, its projection onto this subspace is determined, and a
training image whose projection is closest to that of the new image is sought.

1.2 Literature Survey
Several algorithms and techniques for face recognition have been developed in the past
by researchers. These are discussed briefly in this section.

1.2.1 Existing Algorithms

Face Recognition Based on Principal Component Analysis: Principal Component
Analysis (PCA) is a well known algorithm used in face recognition. The basic idea in PCA is
to determine a vector of much lower dimension that best approximates in some sense a given
data vector. Thus, in face recognition, it takes an s-dimensional vector representation of each
face in a training set of images as input, and determines a t-dimensional subspace whose
basis vector is maximum corresponding to the original image. The dimension of this new
subspace is lower than the original one (t <<s). If the original image elements are considered
as random variables, then the principal components are along the eigenvectors corresponding
to the larger eigenvalues of the correlation matrix, and error minimization is done in a least-
squares sense[2].

Face Recognition Based on Independent Component Analysis: Like the PCA based
technique, this is also a technique to extract the statistics of the random variables. However,
in this case, the second-order and higher-order dependencies of the input data are minimized
by the technique and the basis along which the data is statically independent are found[1].

Evolutionary Pursuit: This is an eigenspace-based approach that searches for the best set of
projection axes in order to maximize a fitness function, measuring at the same time the
classification accuracy and generalization ability of the system. Because the dimension of the
solution space is too big, it is solved using a specific kind of genetic algorithm called
Evolutionary Pursuit (EP) [18],[19].

Elastic Bunch Graph Matching: All human faces share a similar topological structure.
Faces are represented as graphs, with nodes positioned at fiducially points (eyes, nose, etc.)
and edges labeled with 2-D distance vectors. Each node contains a set of 40 complex Gabor
wavelet coefficients at different scales and orientations (phase, amplitude). They are called
"jets". Recognition is based on labeled graphs. A labeled graph is a set of nodes connected by
edges; each node is labeled as jets, edges are labeled as distances [20].

Kernel Methods: The face manifold in subspace need not be linear. Kernel methods are a
generalization of linear methods. Direct non-linear manifold schemes are explored to learn
this non-linear manifold [4],[5],[10],[11].

Trace Transform: The Trace transform, a generalization of the Radon transform, is a new
tool for image processing which can be used for recognizing objects under transformations,
e.g. rotation, translation and scaling. To produce the Trace transform one computes a
functional along tracing lines of an image. Different Trace transforms can be produced from
an image using different trace functional [21],[22].

1.2.2. Existing Techniques
Eigenfaces: Many approaches to the overall face recognition problem have been devised
over the years, but one of the most accurate and fastest ways to identify faces is to use the so-
called eigenface technique. This uses a combination of linear algebra and statistical analysis
to generate a set of basis faces the eigenfaces against which inputs are tested. Eigenface
mostly uses PCA and ICA tools for face recognition. Using these, the efficiency is variable.
Although ICA significantly outperforms the standard PCA, it has been claimed that that the
performance of ICA strongly depends on its involved PCA process. The pure ICA projection
has little effect on the performance of face recognition.


Range Imaging: Range imaging is the name for a collection of techniques, used to produce
a 2D image showing the distance to a set of points in a scene from a specific point, normally
associated with some type of sensor device. The resulting image, the range image, has pixel
values which correspond to the distance, e.g., brighter values mean shorter distance, or vice
versa. If the sensor which is used for produce the range image is properly calibrated, the pixel
values can be given directly in physical units such as centimeters [23].

Line edge map: Image-based face recognition algorithm that uses a set of random rectilinear
line segments of 2D face image views as the underlying image representation, together with a
nearest-neighbor classifier as the line-matching scheme. The combination of 1D line
segments exploits the inherent coherence in one or more 2D face image views in the viewing
sphere [24].

Neural Network based Face Recognition Techniques: Neural networks are used to create
the face database and recognize the face. A separate network for each person is built. The
input face is projected onto the eigenface space first to get a new descriptor. This descriptor
is used as network input and applied to each person's network. The one with maximum
output is selected and reported as the host if it is larger than a predefined recognition
threshold [3],[1],[13],[12].

Gabor wavelet networks (GWN): The Gabor wavelet network is used for an effective
object representation. The Gabor wavelet network has several advantages such as invariance
to some degree with respect to translation, rotation and dilation. Furthermore, it has the
ability to generalize and to abstract from the training data and to assure, for a given network.


Sparse Representation: The input image is divided with L-1 minimization. And then those
sparse will be compared with training datas sparse. There are also many techniques which
has been tried to use for facial recognition techniques, but among all of them, the eigenface
technique shows the fastest and most accurate results than other techniques. So far, from our
research, we have seen that the eigenface technique has faster performance than other
techniques. But in recent times the researchers and scientists are trying to focus on Human
Vision System (HVS) for face recognition. It has not been implemented in practical field of
face recognition yet but has laid the foundation for future studies [25].

1.3 Organization of the Report
The report is organized as follows: The basics of neural networks and the learning
processes involved in neural networks are discussed in Chapter 2. The back propagation
learning and conjugate gradient algorithms are also dealt with. Then the process of face
recognition using neural networks is then described. Chapter 3 covers a famous and widely
used method for face recognition i.e. eigenface approach. Both principal components and
nearest neighbour classification are dealt with. Support vector machines together with the
various optimization techniques involved are presented in Chapter 4. Both linear and
nonlinear cases are explained, as well as the kernels involved in support vector machines
(SVMs). Face recognition using binary SVMs for multi-classification purposes is also
explained. Face recognition involves several pre-processing steps. These are detailed in
Chapter 5. The results obtained using a number of techniques presented in the previous
chapters are compared in Chapter 6. Concluding comments and directions for future work are
indicated in Chapter 7.


CHAPTER II
Artificial Neural Networks

Neural Networks are nonlinear models of the neural pathways in the nervous system. The
flow of signals in the nervous system is as shown in Fig. 2.1. The propagation of stimulus, in
either direction throughout the system of sense organs, is due to the firing of simple
elements operating in parallel. These elements can be structured as:
Single-Layer Feedforward Networks
Multilayer Feedforward Networks
Recurrent Networks
These structures are shown in Fig. 2.2, where each node represents a mathematical model
of a neuron.

Figure 2.1: Block diagram of nervous system

Figure 2.2: Single-Layer Feedforward Networks, Multilayer Feedforward Networks, Recurrent
Networks

The manner in which this transmission proceeds is determined by the learning
abilities of these neurons. Learning may be error-correction learning, memory-based
learning, Hebbian learning; or based on paradigms as supervised learning, unsupervised
learning, reinforcement learning through continued interaction with the environment.

Rosenblatts perceptron is a basic form of a neural network, built around a nonlinear
model of a neuron, namely, the McCulloh-Pitts model of a neuron as shown in Fig. 2.3. The
adder node of the neuron model computes a linear combination of input data applied to the
synapses and the tendency to be biased, given by the value b, during a classification task.
This induced local field is converted to a binary signal, +1 or -1, which determines the
weighted capacity to activate the neighboring neuron.

Figure 2.3: Nonlinear model of a neuron

In its most general form, a neural network is a machine that is designed to model the
way in which the brain performs a particular task or function of interest. It resembles the
brain in the following two respects:
1. Knowledge is acquired by the network from its environment through a learning
process.
2. Interneuron connection strengths, known as synaptic weights, are used to store the
acquired knowledge.
The procedure used to perform the learning process is called a learning algorithm, the
function of which is to modify the synaptic weights of the network in an orderly fashion to

attain a desired design objective. The modification of synaptic weights provides the
traditional basis for the design of neural networks. Further, it is also possible for a neural
network to modify its own topology, which is motivated by the fact that neurons in the
human brain can die and that new synaptic connections can grow.

2.1 Multilayer Perceptrons
Multilayer perceptrons are multilayer feedforward networks consisting of at least three
layers: the input layer, the hidden layer(s) and the output layer:
1. The input layer consists of input sensory nodes to which data, represented by pixel
intensity vectors of pre-processed images, is presented. These propagate through the
neural network in a forward direction, on a layer by layer basis, hence known as
multi-layer feedforward networks.
2. The hidden layer performs a non-linear transformation on the input signal into a new
space called the feature space. More than one hidden layer maybe used to achieve
this purpose. As learning progresses, the hidden neurons gradually discover the
features that characterize the images. The number of hidden neurons in each hidden
layer, and the number of hidden layers is critical to the learning process, and is varied
for higher accuracy.
3. At the output layer, the output/functional signal(s), expressed as a continuous
nonlinear function of the input signal and synaptic weights associated with that
neuron, is obtained.
Fig. 2.4 shows the architectural graph of a multilayer perceptron. The network shown is fully
connected, that is, a neuron in any layer of the network is connected to all the neurons in the
previous layer.

In the case of supervised learning, this output signal is subtracted from the desired
response, to yield the error signal. In the classification problem, the desired response
corresponds to the classes of images. For this specific problem, the number of output neurons
equals the number of classes of images. The backward propagation of the estimate of the
error gradient vector, that is, the gradients of the error surface with respect to the weights
connected to the inputs of a neuron, forms the core of the supervised learning process. In the

training phase, the optimum weights of the hidden and output layers are obtained, which
minimize the error estimate for desired results. The optimum weights are learnt by
propagating the error signal backwards, against the synaptic connections. This supervised
learning algorithm is known as error back propagation algorithm, which is described in detail
in Section 2.2.

Figure 2.4: MLP with one hidden layer


2.2 Derivation of Error Back Propagation Algorithm
The error signal at the output of neuron j at iteration n, that is, presentation of the n
th
training
example, is defined by:
e
j
(n)=d
j
(n)-y
j
(n) (2.1)
The instantaneous value of the error energy for neuron j is defined as 0.5e
2
j
(n).
Correspondingly, the instantaneous value of the error energy E(n) of the total error energy is
obtained by summing 0.5e
2
j
(n) over all neurons in the output layer. Thus:
E(n)=0.5 (2.2)
where the set C includes all the neurons in the output layer of the network. Let N denote the
total number of patterns contained in the training set. The average squared error energy is
obtained by summing E(n) over all n and then normalizing with respect to the set size N, as
shown by:
E
av
=1/N (2.3)
The instantaneous error energy E(n), and therefore the average error energy E
av
, is a function
of all the free parameters (i.e. synaptic weights and bias levels) of the network. For a given
training set, E
av
represents the cost function as a measure of learning performance. The
objective of the learning process is to adjust the free parameters of the network to minimize
E
av
. A simple method of training is considered in which the weights are updated on a pattern
by pattern basis until one epoch, that is, one complete presentation of one training set has
been dealt with. The adjustments to the weights are made in accordance with the respective
errors computed for each pattern presented to the network.



Figure 2.5: Signal-flow graph highlighting the details of output neuron j

The arithmetic average of these individual weight changes over the training set is
therefore an estimate of the true change that would result in from modifying the weights
based on minimizing the cost function E
av
over the entire training set. Consider Fig. 2.5
which depicts neuron j being fed by a set of function signals produced by a layer of neurons
to its left. The induced local field v
j
(n) produced at the input of the activation function
associated with neuron j is therefore:
v
j
(n)= (2.4)
and m is the total number of inputs (excluding the bias) applied to neuron j. The synaptic
weight w
j0
(corresponding to the fixed input y
0
=+1) equals the bias b
j
applied to neuron j.
Hence the function signal y
j
(n) appearing at the output of neuron j at iteration n is:
y
j
(n)=
j
(v
j
(n)) (2.5)
The back-propagation algorithm applies a correction w
ji
(n) to the synaptic weight w
ji
(n),
which is proportional to the partial derivative E(n)/dw
ji
(n). According to the chain rule of
calculus, one can express this gradient as:

E(n)/w
ji
(n)= E(n)/ e
j
(n). e
j
(n)/ y
i
(n). y
j
(n)/ v
j
(n). v
j
(n)/ w
ji
(n) (2.6)
The derivative E(n)/ w
ji
(n) represents the sensitivity factor, determining the direction of
search in weight space for the synaptic weight w
ji
. Differentiating both sides of equation (2.2)
with respect to e
j
(n), equation (2.1) with respect to y
j
(n), equation (2.5) with respect to
y
j
(n), and equation (2.4) with respect to w
ji
(n) one respectively gets:
E(n)/ e
j
(n)=e
j
(n) (2.7)
e
j
(n)/ y
j
(n)=-1 (2.8)
y
j
(n)/ v
j
(n)=
j
(v
j
(n)) (2.9)
v
j
(n)/ w
ji
(n)=y
i
(n) (2.10)
The use of equations (2.7) and (2.10) in (6) yields:
E(n)/ w
ji
(n)=-e
j
(n)(v
j
(n))y
i
(n) (2.11)
The correction w
ji
(n) applied to w
ji
(n) is defined by the delta rule:
w
ji
(n)=-E(n)/w
ji
(n) (2.12)
where is the learning-rate parameter of the back propagation algorithm. The use of the
minus sign accounts for the gradient descent in weight space. Accordingly, the use of (2.11)
in (2.12) yields:
w
ji
(n)=
j
(n)y
i
(n) (2.13)
where local gradient
j
(n) is defined by:
j
(n)=- E(n)/ v
j
(n)=e
j
(n)
j
(v
j
(n)) (2.14)
The local gradient points to required changes in synaptic weights. According to (2.14), the
local gradient
j
(n) for output neuron j is equal to the product of the corresponding error
signal e
j
(n) for that neuron and the derivative
j
(v
j
(n)) of the associated activation function.

Case 1: Neuron j is an output node
When neuron j is located in the output layer of the network, it is supplied with a
desired response of its own. Equation (2.1) is used to compute the error signal e
j
(n)
associated with this neuron.

Case 2: Neuron j is the hidden node
When neuron j is located in a hidden layer of the network, there is no specified
desired response for that neuron. Accordingly, the error signal for a hidden neuron would
have to be determined recursively in terms of the error signals of all the neurons to which
that hidden neuron is directly connected; this is where the development of the back-
propagation algorithm gets complicated. Consider the situation depicted in Fig 2.6, which
depicts neuron j as a hidden node of the network. According to equation (2.14), the local
gradient
j
(n) for hidden neuron j is redefined as:
j
(n)=-[E(n)/ y
j
(n)].[y
j
(n)/ v
j
(n)]
=-[E(n)/ y
j
(n)].
j
(v
j
(n)), neuron j is hidden (2.15)
where in the second line equation (2.9) is used. To calculate the partial derivative E(n)/
y
j
(n), one proceeds as follows. From equation (2.2):
E(n)=0.5 , neuron k is an output node (2.16)
Differentiating equation (2.16) with respect to the function signal y
j
(n):
E(n)/ y
j
(n)= [e
k
(n)/ y
j
(n)] (2.17)
Using the chain rule for the partial derivative e
k
(n)/ y
j
(n), and rewriting in the equivalent
form the following is obtained:
E(n)/y
j
(n)= .e
k
(n)/v
k
(n).v
k
(n)/y
j
(n) (2.18)
However,
e
k
(n)=d
k
(n)-y
k
(n)
=d
k
(n)-
k
(v
k
(n)), neuron k is output node (2.19)
Hence:
e
k
(n)/ v
k
(n)=-
k
(v
k
(n)) (2.20)
Note that for neuron k the induced local field is:
v
k
(n)= (2.21)



Figure 2.6: Signal-flow graph highlighting the details of output neuron k connected to hidden neuron j

where m is the total number of inputs (excluding the bias) applied to neuron k. Here again,
the synaptic weight w
ko
(n) is equal to the bias b
k
(n) applied to neuron k, and the
corresponding input is fixed at the value +1. Differentiating equation (2.21) with respect to
y
j
(n) yields:
v
k
(n)/ y
j
(n)=w
kj
(n) (2.22)
By using equation (2.20) and equation (2.22) in equation (2.18), the desired partial derivative
is obtained:
E(n)/ y
j
(n)= -
k
(v
k
(n))w
kj
(n)
= - w
kj
(n) (2.23)
where in the second line the definition of the local gradient
k
(n) given in equation (2.14)
with the index k substituted for j is used. Finally, the back-propagation formula for the local
gradient
j
(n) is obtained as described:
j
(n)= (v
j
(n)) w
kj
(n) (2.24)

The factor
j
(v
j
(n)) involved in the computation of the local gradient
j
(n) in (2.24) depends
solely on the activation function associated with hidden neuron j. The remaining factor
involved in this computation, namely the summation over k, depends on two sets of terms.
The first set of terms, the
k
(n), requires knowledge of the error signals e
k
(n), for all neurons
that lie in the layer to the immediate right of hidden neuron j, and that are directly connected
to neuron j. The second set of terms, the w
kj
(n), consists of synaptic weights associated with
these connections. Figure 2.7 summarizes back-propagation learning.

Figure 2.7: Signal-flow graphical summary of back-propagation learning. Top part of the graph: forward
pass. Bottom part of the graph: backward pass.

2.3 Activation function
The computation of for each neuron of a multilayer perceptron requires knowledge
of the derivative of the activation function (.) associated with that neuron. For this
derivative to exist, the function (.) must be continuous. Therefore, differentiability is the
only requirement that an activation function has to satisfy.


Hyperbolic tangent function as shown in Fig 2.8, an antisymmetric function, is a
commonly used form of sigmoidal non-linearity, which in its most general form is defined
by:
j
(v
j
(n))=atanh(bv
j
(n)), (a,b)>0 (2.25)
where a and b are constants. Its derivative with respect to v
j
(n) is given by:
j
(v
j
(n))=absec
2
h(bv
j
(n))
=ab(1-tanh
2
(bv
j
(n)))
=b/a[a-v
j
(n)][a+y
j
(n)] (2.26)
For a neuron j located in the output layer y
j
(n)=o
j
(n), the local gradient is therefore:
j
(n)=e
j
(n)
j
(v
j
(n))
=b/a[d
j
(n)-o
j
(n)][a-o
j
(n)][a+o
j
(n)] (2.27)
For a neuron j in the hidden layer,
j
(n)=
j
(v
j
(n)) w
kj
(n)
=[b/a][a-y
j
(n)][a+y
j
(n)] w
kj
(n), neuron j is hidden (2.28)

Figure 2.8: Hyperbolic tangent function


2.4 Rate of learning
The back-propagation algorithm provides an approximation to the trajectory in weight
space computed by the method of steepest descent. The smaller the learning-rate parameter
is made, the smaller the changes to the synaptic weights in the network will be from one
iteration to the next, and the smoother will be the trajectory in weight space. The
improvement, however is attained at the cost of a slower rate of learning. If, on the other
hand, the learning-rate parameter is made too large in order to speed up the rate of learning,
the resulting large changes in the synaptic weights assume such a form that the network may
become unstable. A simple method of increasing the rate of learning yet avoiding the danger
of instability is to modify the delta rule, by including a momentum term as follows:
w
ji
(n)= w
ji
(n-1)+
j
(n)y
i
(n) (2.29)
where is a positive number called the momentum constant. Rewriting the above equation
(2.29) as a time series with index t. The index t goes from the initial time 0 to current time n.
It maybe viewed as a first-order difference equation in the weight correction w
ji
(n):
w
ji
(n)=

(
j
(t)y
i
(t))
=-

(E(t)/w
ji
(t)) (2.30)
Based on this relation, the following conclusions are made:
1. The current adjustment w
ji
(n) represents the sum of an exponentially weighted time
series. For the time series to be convergent, the momentum constant must be
restricted to the range 0<||<1.
2. When the partial derivative has the same algebraic sign on consecutive iterations, the
inclusion of momentum in the algorithm tends to accelerate descent in steady
downhill directions.
3. When the partial derivative has the same algebraic sign on consecutive iterations, the
inclusion of momentum in the algorithm has a stabilizing effect in directions that
oscillate in sign.


2.5 Modes of Training
One complete representation of the entire training set during the learning process is called
an epoch. The learning process is maintained on an epoch by epoch basis until the synaptic
weights and bias levels of the network stabilize and the averaged mean squared error over the
entire training set converges to some minimum value. It is good practice to randomize the
order of presentation of training examples from one epoch to the next.

For a given training set, back-propagation learning may proceed in one of two basic
ways:
1. Sequential mode: The sequential mode of back-propagation learning is also referred
to as on-line, pattern or stochastic mode. In this mode of operation weight updating is
performed after the presentation of each training example.
2. Batch mode: In the batch mode of back-propagation learning, weight updating is
performed after the presentation of all training examples that constitute an epoch. For
a particular, the cost function is defined as the averaged squared error of equations
(2.2) and (2.3):
E
av
=1/2N (2.31)
where the error signal pertains to the output neuron j for the training example n.

For an on-line operational point of view, the sequential mode of training is
preferred over the batch mode because it requires less local storage for each synaptic
connection. Moreover, given that the patterns are presented to the network in a random
manner, the use of pattern by pattern updating of weights makes the search in weight space
stochastic in nature. This in turn makes it less likely for the back-propagation algorithm to be
trapped in a local minimum.

In the same way, the stochastic nature of the sequential mode makes it difficult to
establish theoretical conditions for convergence of the algorithm, In contrast, the use of batch
mode of training provides an accurate estimate of the gradient vector; convergence to a local
minimum is guaranteed under simple conditions.


2.6 Local Minima of Error Surface
A peculiarity of the error surface (Fig 2.9) that impacts the performance of the back-
propagation algorithm is the presence of local minima in addition to global minima. It runs
the risk of being trapped in a local minimum where even small change in synaptic weights
increases the cost function. But somewhere else in the weight space there exists another set
of synaptic weights for which the cost function is smaller than the local minima in which the
network is stuck. It is clearly undesirable to have the learning process terminate at local
minima, especially if it is located far above a global minimum.

Figure 2.9: Error surface

2.7 Supervised Learning Viewed as an Optimization Problem
In this case, the supervised training of a multilayer perceptron is viewed as a problem
in numerical optimization. The error surface of a multilayer perceptron is a highly nonlinear
function of the synaptic weight vector w. Let E
av
(w) denote the cost function, averaged over
the training sample. Using the Taylor series expand E
av
(w) about the current point on the
error surface w(n) as:
E
av
(w(n)+w(n)) =

E
av
(w(n))+g
T
(n)w(n)+0.5w
T
(n)H(n)w(n) (2.32)
where g(n) is the local gradient vector defined by:
g(n)=dE
av
(w)/dw, at w=w(n) (2.33)
and H(n) is the local hessian matrix defined by:
H(n)=d
2
E
av
(w)/dw
2
, at w=w(n) (2.34)
The use of an ensemble-averaged cost function E
av
(w) presumes a batch mode of
operation. The basic back-propagation algorithm adjusts the weights in the steepest descent
direction (negative of the gradient). This is the direction in which the performance function is
decreasing most rapidly. It turns out that, although the function decreases most rapidly along
the negative of the gradient, this does not necessarily produce the fastest convergence. In the
conjugate gradient type of algorithms a search is performed along conjugate directions,
which produces generally faster convergence than steepest descent directions.

In most of the training algorithms, a learning rate is used to determine the length of
the weight update (step size). In most of the conjugate gradient algorithms, the step size is
adjusted at the end of each iteration. A search is made along the conjugate gradient direction
to determine the step size, which minimizes the performance function along that line.

Fletcher-Reeves Update: All of the conjugate gradient algorithms start out by searching in
the steepest descent direction (negative of the gradient) on the first iteration:
p
o
=-g
o
(2.35)
A line search is then performed to determine the optimal distance to move along the current
search direction:
x
k+1
=x
k
+
k
p
k
(2.36)
Then the next search direction is determined so that it is conjugate to previous search
directions. The general procedure for determining the new search direction is to combine the
new steepest descent direction with the previous search direction:
P
k
=-g
k
+
k
p
k-1
(2.37)

The various versions of conjugate gradient are distinguished by the manner in which the
constant is computed. For the Fletcher-Reeves update the procedure is:
k
=g
T
k
g
k
/g
T
k-1
g
k-1
(2.38)
This is the ratio of the norm squared of the current gradient to the norm squared of the
previous gradient. Once the MLP has been trained, it can be tested on a database of images.
The output neuron which fires corresponds to the class of image.

2.8 Face Recognition Using Neural Networks
The Cambridge ORL database consists of 400 frontal images of 40 people, i.e. 10
images of each person, of which 5 are used for training and the other 5 are used for testing
purposes. Since there are 40 people that need to be identified, the number of classes that a
neural network must distinguish is also 40. The pixel intensity vectors of the pre-processed
training images are obtained and the MLP is thus trained. The mode of training used is batch
mode, and using the back-propagation algorithm, in conjunction with Fletcher-Reeves
update, the optimum weights for MLP are obtained. In order to improve results, the inputs
are fed at random. Training is conducted for 2500 epochs.

2.9 Conclusions
This chapter dealt with artificial neural networks. Specifically, the architecture of
multilayered feedforward neural networks is introduced. Supervised learning via the back
propagation algorithm is discussed along with the modes of training, the choice of learning
rate, and the potential pitfalls of getting trapped in local minima. Finally, the use of artificial
neural networks for face recognition is also introduced. The results are presented in a Chapter
6. A technique that is frequently used for analysis of images is that of principal components.
This is the topic of the next chapter.



CHAPTER III
Principal Components Analysis

Principal components analysis is a method of unsupervised learning. The main idea is
to discover significant patterns or features in the input data without an external aide. To do
so, the algorithm is provided with a set of rules of a local nature, which enables it to learn to
compute an input-output mapping with specific desirable properties; the term local implies
that the change applied to the synaptic weight of a neuron is confined to the immediate
neighborhood of that neuron.

Feature selection refers to a process whereby the given data is compressed to features
or patterns fewer in number than the given data; Thus, the data space is transformed into a
feature space and undergoes a dimensionality reduction. Evidently, this data compression is
lossy, and it can be shown that in the case of principal component analysis, the mean-square
of the resulting error equals the sum of the variances of the elements of that part of the data
vector that is eventually eliminated. Therefore one seeks a transformation that is optimum in
the mean squared sense; i.e., the eliminated component must have total variance that is less
than a predefined threshold. Principal components analysis computes the basis of a space
which represents the training vectors. These basis vectors are eigenvectors of a related
covariance matrix.

When applied to face recognition, one determines the principal components of the
distribution of faces treating the image as a point in a high dimensional space; i.e., the
eigenvectors of the covariance matrix of the set of face images are computed. These
eigenvectors referred to as eigenfaces in this context contain relevant discriminatory
information extracted from the images. Thus, characteristic features representing the
variation in the collection of faces are captured, and this information is used to encode and
compare individual faces. It is to be noted that the features (i.e., the eigenvectors) are ordered
with respect to the corresponding eigenvalues, and only those features that are significant (in

the sense of the value of the eigenvalues) are considered. Recognition is performed by
projecting a new image on to the subspace spanned by the eigenfaces and then classifying the
face by comparing its position in the face space with the position of known individuals.

Figure 3.1: (a) The 1
st
PC z
1
is a minimum distance fit to a line in X space. (b) The 2
nd
PC z
2
is a
minimum distance fit to a line in the plane perpendicular to the 1
st
PC.

3.1 Derivation of Principal Components (PC)
Given a sample of n observations on a vector of p variable:
x=(x
1
, x
2
,, x
p
) (3.1)
define the first principal component of the sample by the linear transformation:
z
1
= a
1
T
x= (3.2)
where the vector a
1
=(a
11
,a
21
,,a
p1
) is chosen such that var[z
1
] is maximum. Likewise, define
the k
th
PC of the sample by the linear transformation:
z
k
=a
k
T
x, k=1p (3.3)
where the vector a
k
=(a
1k
,a
2k
,,a
pk
) is chosen such that var[z
1
] is maximum:
cov[z
k
,z
l
]=0, for k1, and a
k
T
a
k
=1 (3.4)

To find a
1
first note that:
var[z
1
] = (z
1
2
) (z
1
)
2

= , where S
ij
=
xixj
=(x
i
x
j
) - (x
i
)(x
j
)
= a
1
T
Sa
1
(3.5)
where S is the is the covariance matrix for x.
To find a
1
maximize var[z
1
], subject to a
1
T
a
1
=1. This constrained optimization problem can
be transformed into an unconstrained one by using the Lagrange multiplier . Thus, the
problem is reduced to maximization of a
1
T
Sa
1
(a
1
T
a
1
1). Differentiating with respect to
both the vector a
1
and the Lagrange multiplier and setting the derivatives to zero one obtains
Sa
1
a
1
= 0
(S

I
p
)a
1
= 0

(3.6)
Therefore a
1
is an eigenvector of S corresponding to eigenvalue =
1
. Note that the following
has been maximized:
var[z
1
]= a
1
T
Sa
1
=a
1
T
1
a
1
=
1
(3.7)
So
1
is the largest eigenvalue of S. The first PC z
1
retains the greatest amount of variation in
the sample. An example is illustrated in Fig. 3.1(a). To find the next coefficient vector a
2

maximize var[z
2
] subject to:
cov[z
2
,z
1
]=0 and a
1
T
a
1
=1 (3.8)
First note that:
cov[z
2
,z
1
]= a
1
T
Sa
2
=
1
a
1
T
a
2
(3.9)
then let and be Lagrange multipliers, and maximize a
2
T
Sa
2
(a
2
T
a
2
1) - a
2
T
a
1
. It is
found that find that a
2
is also an eigenvector of S whose eigenvalue =
2
and is the second
largest, as illustrated in Fig. 3.1(b) in the two dimensional case. In general:
var[z
k
]= a
k
T
Sa
k
=
k
(3.10)
The k
th
largest eigenvalue of S is the variance of the k
th
PC.
The k
th
PC z
k
retains the k
th
greatest fraction of the variation in the sample.

Given a sample of n observations on a vector of p variables x=(x
1
, x
2
,, x
p
), define a vector
of p PCs z=(z
1
, z
2
,, z
p
), according to Z=A
T
X, where A is an orthogonal p x p matrix whose
k
th
column is the k
th
eigenvector a
k
of S. Then = A
T
SA is the covariance matrix of the PCs;
the matrix is diagonal with elements
ij
=
ij
ij
.

3.2 Face Recognition using Eigenfaces
Let a face image (x,y) be a two-dimensional N by N array of intensity values. By
stacking the columns of this array into a single vector, an image may also be considered as a
vector of dimension
2
N . Thus, a typical image of size 256 by 256 becomes a vector of
dimension 65,536, or equivalently, a point in 65,536-dimensional space. An ensemble of
images, then, maps to a collection of points in this vector space of a relatively large
dimension.
Images of faces, being similar in overall configuration, will not be randomly
distributed in this huge image space and thus can be described by a relatively low
dimensional subspace. The main idea of the principal component analysis is to find a set of
basis vectors that best accounts for the distribution of face images within the entire image
space. These vectors define the subspace of face images, which is generally referred to as the
face space. Each vector is of length
2
N , describes an N by N image, and is a linear
combination of the original face images. Because these vectors are the eigenvectors of the
covariance matrix corresponding to the original face images, and because they are face-like
in appearance, they are typically referred to as eigenfaces.
As described earlier, a 2-D facial image can be represented as 1-D vector by
concatenating each row (or column). Let the training set of face images be
M
, , ,
2 1
K .
The average face of the set is defined by

=
=
M
n
n
M
1
1
. Each face differs from the average by
the vector =
n n
. This set of very large vectors is then subject to principal component
analysis, which seeks a set of M orthonormal vectors
n
which best describes the distribution
of the data. The k
th
vector
k
is chosen such that:

=
=
M
n
n
T
k k
M
1
2
) (
1

(3.11)
is a maximum, subject to:
=
=
otherwise
k l
k
T
l
, 0
, 1

(3.12)
The vectors
k
and scalars
k
are the eigenvectors and eigenvalues, respectively, of the
covariance matrix:
=
= =
M
n
T T
n n
AA
M
C
1
1
(3.13)
where the matrix ] ... [
2 1 M
A = . The matrix C is however
2
N by
2
N , and determining
the
2
N eigenvectors and eigenvalues is an intractable task for typical image sizes. A
computationally feasible method is needed to find these eigenvectors.

If the number of data points in the image space is less than the dimension of the space
(
2
N M < ), there will be only M, rather than
2
N , meaningful eigenvectors in the sense that
the remaining eigenvectors are associated with zero eigenvalues. Accordingly, it is
computationally better to determine the eigenvalues and eigenvectors of much smaller
matrix. For example, in the situation outlined earlier, one computes the eigenvalues and the
corresponding eigenvectors of an M x M matrix as opposed to a 65,536 x 65,536 matrix.
Consider the eigenvectors
n
of A A
T
such that:

n n n
T
A A =
(3.14)
Premultiplying both sides by A,
n n n
T
A A AA =
(3.15)
from which it can be inferred that
n
A are the eigenvectors of
T
AA C = . Following this
analysis, construct the M by M matrix A A L
T
= , where
n
T
m mn
L = , and find the M
eigenvectors
n
of L. These vectors determine linear combinations of the M training set face
images to form the eigenfaces
n
:

M n A
n
M
k
k nk n
,......, 1 ,
1
= = =
=

(3.16)
With this analysis the calculations are greatly reduced, from the order of the number of pixels
in the images (
2
N ) to the order of the number of images in the training set (M). In practice,
the training set of face images will be relatively small (
2
N M < ), and the calculations
become quite manageable. The value of the associated eigenvalues allows the ordering of the
eigenvectors according to their usefulness in characterizing the variation among the images.
It is to be emphasized at this juncture that such a ordering is possible as all the eigenvalues of
A A
T
(and hence
T
AA ) are positive due to the sign-definiteness of these matrices.

The purpose of PCA is to reduce the large dimensionality of the data space (observed
variables) to the smaller intrinsic dimensionality of feature space (independent variables),

which are needed to describe the data economically. This is the case when there is a strong
correlation between observed variables. Therefore, the goal is to find a set of e
i
s which have
the largest possible projection onto each of the w
i
s. The eigenvectors corresponding to
nonzero eigenvalues of the covariance matrix produce an orthonormal basis for the subspace
within which most image data can be represented with a small amount of error. The
eigenvectors are sorted from high to low according to their corresponding eigenvalues. The
eigenvector associated with the largest eigenvalue is one that reflects the greatest variance in
the image. That is, the smallest eigenvalue is associated with the eigenvector that finds the
least variance. It has been the experience that the usefulness of the eigenvectors decreases in
exponential fashion; i.e., roughly 90% of the total variance is contained in the first 5% to
10% of the dimensions.

3.3 Euclidean Distance Method for Classification
The simplest method for determining which face class provides the best description of
an input facial image is to find the face class k that minimizes the Euclidean distance
k
=|-
k
|, where
k
is a vector describing the k
th
face class. If
k
is less than some predefined
threshold
, a face is classified as belonging to the class k.

Note: Eigenfaces, once obtained, can also be used to train a multilayer perceptron for
classification purposes. This type of training incorporates both supervised and unsupervised
learning. Once trained, it can be tested on a database of images.

3.4 Conclusions
In this chapter the eigenface approach to face recognition is dealt with. This is the
first statistics-based face recognition technique to have been proposed by researchers. A
principal advantage of this is that it is amenable to real-time face recognition. The eigenface
approach followed in this project uses principal components and nearest-neighbour
classification technique. In this next chapter, face recognition using support vector machines
is presented.


CHAPTER IV
Support Vector Machines

Support vector machines are universal feedforward networks, which can be used for pattern
classification and non-linear regression. Basically the support vector machine is a linear
machine with some suitable properties. To explain how it works, it is perhaps easiest to start
with the case of separable patterns that could arise in the context of pattern classification. In
this context, the main idea of a support vector machine is to construct a decision surface,
called a hyperplane, in such a way that the margin of separation between positive and
negative examples is maximized.

The machine achieves this desirable property by following a principled approach
rooted in statistical learning theory. More precisely, support vector machine is an
approximate implementation of the method of structural risk minimization. The induction
principle is based on the fact that the error-rate of a learning machine on test data is bounded
by the sum of the training-error rate and a term that depends on the Vapnik-Chervonenkis
(VC) dimension; in the case of separable patterns, a support vector machine produces a value
of zero for the first term and minimizes the second term. Accordingly, the support vector
machine can provide a good generalization on pattern classification problems despite the fact
that it does not incorporate problem-domain knowledge. This attribute is unique to support
vector machines.

A notion that is central to the construction of the support vector learning algorithm is
the inner-product kernel between a support vector x
i
and the vector x drawn from the input
space. The support vectors consist of a small subset of the training data extracted by the
algorithm. Depending on how this inner-product kernel is generated, construction of different
learning machines is characterized by their respective nonlinear decision surfaces. In
particular, the support vector learning algorithm is used to construct the following three types
of learning machines:
Polynomial learning machines

Radial-basis function networks
Two-layer perceptrons with a single hidden layer.
That is, for each of these feedforward networks the support vector learning algorithm is used
to implement the learning process using a given set of training data, automatically
determining the required number of hidden units. Stated in another way, whereas the back-
propagation algorithm is devised specifically to train a multilayer perceptron, the support
vector learning algorithm is of a more generic nature because it has wider applicability.

Support Vector machines are rooted in statistical learning theory which gives a family
of bounds which govern the learning capacity of the machine:
R(alpha) = 0.5|y-f(x,alpha)|.dP(x,y) (4.1)
Properties of the bound:
1. It is independent of P(x,y). It assumes only that both the training data and the test data
are drawn independently according to some P(x,y)
2. It is usually not possible to compute the left hand side.
3. If h is known, the right hand side can be easily computed.

Thus given several different learning machines, and choosing a fixed, sufficiently small ,
that machine is chosen which gives the lowest upper bound on the actual risk. This gives a
principled method for choosing a learning machine for a given task, and is the essential idea
of structural risk minimization.
The general structure of SVM is as shown in Fig. 4.1.


Figure 4.1: General structure of SVM

4.1 Basics
Let X R
n
denote the possible input to the Support Vector Machine. These inputs are pixel
intensity vectors obtained after image pre-processing. An SVM is trained with images
belonging to two classes at a time. A single machine is not efficient at encoding multiple
classes. Assume that a point x X is associated with one of two possible classes, denoted -1
and +1. This means that it is sufficient to have the output Y
estimate
={-1,1} from the classifier.
Consider two stochastic variables, representing the point denoted X, Y representing class
label. These variables then give rise to the following conditional distributions, p(X=x|Y=1),
p(X=x |Y=-1). There are two phases concerning support vector machines, first training and
then classifying. By letting f: X -> Y
estimate
represent the discriminating function, the task
during training is to minimize the probability of returning wrong classification for all

elements included in the training set. This means that for the training set Y
estimate
should be
identical to Y in as many cases as possible.

During training, SVMs solve this problem by simply finding a function f, which for
every point x
i
=[x
i1
, x
i2
, , x
in
]
T
, with corresponding class label y
k
, in the training set
S=((x1,y1),,(xl,yl)) has the following property, f(x
k
)0, if y
k
=1 and f(x
k
)<0 if y
k
=-1. This is
only possible if the training set S is such that there exists a seperating hyperplane. This
assumption is relaxed later on. By presenting the SVM with a number of training examples
and optimizing the function parameters so that the sign of the output corresponds as well as
possible to the true classes of the training examples, the property mentioned can be forced to
be valid. When the training has been completed the sign-function is applied to the function f
and the resulting classifier has the desired output, that is {-1,1}.

4.2 Linear Support Vector Machines: The Separable Case
Start with the simplest case i.e., linear machines trained on separable data. It will be
seen in that the analysis for the general case nonlinear machines trained on non-separable
data results in a very similar quadratic programming problem. Label the training data as:
{x
i
,y
i
}, i=1,,N, yi {-1,1}, xi R
d
(4.2)

Suppose that there is a hyperplane, which separates the positive from the negative examples.
The points x which lie on the hyperplane satisfy w.x + b = 0, where w is normal to the
hyperplane, |b|/||w|| is the perpendicular distance from the hyperplane to the origin, and ||w|| is
the Euclidean norm of w. Let d
+
(d
-
) be the shortest distance from the separating hyperplane
to the closest positive (negative) example. Define the margin of a separating hyperplane to be
d
+
+ d
-
. For the linearly separable case, the support vector algorithm simply looks for the
separating hyperplane with largest margin.


To formulate this, suppose that all the training data satisfy the following constraints:
w.x
i
+ b +1 for y
i
= +1 (4.3.a)
w.x
i
+ b -1 for y
i
= -1 (4.3.b)
These can be combined into one set of inequalities:
y
i
(w.x
i
+ b) 1 0, i=1,,N (4.3)

Now consider the points for which the equality in (4.3.a) holds. These points are the support
vectors which lie on the hyperplane H1: w.x
i
+ b = +1, with normal w and perpendicular
distance from the origin |1 - b|/||w||. Similarly, the points for which the equality in (4.3.b)
holds are also the support vectors which lie on the hyperplane H2: w.x
i
+ b = -1 with
normal again w and perpendicular distance from the origin |-1-b|/||w||. Hence d
+
= d
- =
1/||w||
and the margin is simply 2/||w||. Note that H1 and H2 are parallel as they have the same
normal and that no training points fall between them. Thus, the pair of hyperplanes which
gives the maximum margin by minimizing the objective function 0.5||w||
2
subject to
constraints in equation can be found. Thus the solution for a typical two dimensional case is
expected to have the form shown in Figure 4.2. Those training points for which the equality
in (4.3) holds (i.e. those which wind up lying on one of the hyperplanes H1, H2) and whose
removal would change the solution found, are called support vectors; they are indicated in
Figure 4.2 by the extra circles.

Fig. 4.2: Linear separating hyperplanes for the separable case. The support vectors are circled.


4.3 Quadratic Optimization for Finding the Optimal Hyperplane
A Lagrangian formulation is used to obtain the solution. There are two reasons for
doing this. The first is that the constraints variables are replaced by constraints on the
Lagrange multipliers themselves, which are much easier to handle. The second is that in this
reformulation of the problem, the training data appears (in the actual training and test
algorithms) in the form of dot products between vectors. This is a crucial property which will
allows one to generalize the procedure to the nonlinear case.

Thus, introducing positive Lagrange multipliers
i
, i=1.N; one for each of the
inequality constraints in equation. Recall that the rule is that for constraints of the form c
i
0,
the constraint equations are multiplied by positive Lagrange multipliers and subtracted from
the objective function, to form the Lagrangian. For equality constraints, the Lagrange
multipliers are unconstrained. This gives Lagrangian:
LP = 0.5||w||
2
y
i
(w.x
i
+ b) + (4.4)

The solution to the constrained optimization problem is determined by the saddle
point of LP, which must be minimized with respect to w, b, and simultaneously require that
the derivatives of LP with respect to all the
i
vanish, all subject to the constraints
i
0.
Suppose that the set of constraints is referred to as C
1
. Now this is a convex quadratic
programming problem, since the objective function is itself convex, and those points which
satisfy the constraints also form a convex set. This means that following dual problem can
be equivalently solved: maximize LP, subject to the constraints that the gradient of LP with
respect to w and b vanish, and subject also to the constraints
i
0. Suppose that this set of
constraints is referred to as C
2
. This particular dual formulation of the problem is called the
Wolfe dual. It has the property that the maximum of LP, subject to constraints C
2
, occurs at
the same values of w, b and
,
as the minimum of LP, subject to constraints C
1
.


Requiring that the gradient of LP with respect to w and b vanish, gives the conditions:
1. w = y
i
x
i
(4.5)

2. y
i
= 0 (4.6)

Since these are equality constraints in the dual formulation substituting them into equation to
give:
LD =

0.5
j
y
i
y
j
x
i
.x
j
(4.7)

Maximize (4.7) subject to constraints:
1. y
i
= 0 (4.8)
2.
i
0 (4.9)

Note that there are different labels for Lagrangian (LP for primal, LD for dual), to emphasize
that the two formulations are different: LP and LD arise from the same objective function but
with different constraints, and the solution is found by minimizing LP or by maximizing LD.
Note also that if the problem is formulated with b=0, which amounts to requiring that all
hyperplanes contain the origin, the constraint (4.8) does not appear.

Support vector training (for the separable, linear case) therefore amounts to
maximizing LD with respect to
i
subject to constraints (4.8) and positivity of
i
with
solution given by (4.5). Notice that there is a Lagrange multiplier
i
for every training point.
In the solution, those points for which
i
>0 are called support vectors, and lie on one of the
hyperplanes H1, H2. All other training points have
i
=0 and lie on that side of H1 or H2 such
that the strict inequality in Equation (4.3) holds. For these machines, the support vectors are
the critical elements of the training set. They lie closest to the decision boundary, if all other
training points are removed or moved around, but so as not to cross H1 or H2, and training
repeated, the same separating hyperplane is obtained.


4.4 The Karush-Kuhn-Tucker Conditions
The Karush-Kuhn-Tucker (KKT) conditions play a central role in both the theory and
practice of constrained optimization. For the primal problem described in the previous
section, the KKT conditions may be stated as:

(4.10)

The solution vector w is defined in terms of an expansion that involves the N training
examples. Note, however, that although this solution is unique by virtue of convexity of the
Lagrangian, the same cannot be said about the Lagrange coefficients,
i
. It is important to
note that at the saddle point, for each
i
, the product of that multiplier with the corresponding
constraint vanishes, as shown by (4.10). Therefore, only those multipliers exactly meeting the
above equation can assume nonzero values.

Having determined the optimal Lagrange multipliers,
o,i
, computing the optimum
weight vector as:
w
o
= y
i
x
i
, where Ns is the support vectors (4.11)
The optimum bias can be calculated as:

b
o
= 1 w
o
.x
(s)
for d
(s)
=1 (4.12)



4.5 Linear Support Vector Machines: The Non-separable Case
The margin of separation between classes is said to be soft if a data point (x
i
,y
i
)
violates the following condition:
y
i
(w.x
i
+ b) + 1, i=1,,N
This violation can arise in one of two ways, as shown in Figure 4.3:
The data point falls inside the region but on the right hand side of the decision
surface.
The data point falls on the wrong side of the decision surface, resulting in
misclassification.
This can be rectified by introducing positive slack variables
i
, i=1,,N, in the constraints
which then become:
w.x
i
+ b +1 -
i
for y
i
=+1
w.x
i
+ b -1 +
i
for y
i
=-1
These can be combined into one set of inequalities:
y
i
(w.x
i
+ b) 1 -
i
, i=1,,N; where the slack variables 0 for all i. (4.13)



Fig. 4.3: Linear separating hyperplanes for the non-separable case.

Thus, for an error to occur, the corresponding
i
must exceed unity, so
i
is an upper bound
on the number of training errors. Hence a natural way to assign an extra cost for errors is to
change the objective function to be minimized to 0.5||w||
2
to 0.5||w||
2
+ C(
i
)
k
, where C is a
parameter to be chosen by the user, a larger C corresponding to assigning a higher penalty to
errors. As before minimizing the first term is related to minimizing the VC dimension of the
support vector machine. As for the second term, it is an upper bound on the number of test
errors. Formulation of the cost function is therefore in perfect accord with the principle of
structural risk minimization.

The parameter C controls the trade-off between the complexity of the machine and
the number of non-separable points; it may therefore be viewed as a form of a
regularization parameter. The parameter C has to be selected by the user. This can be done
in one of two ways:
The parameter C is determined experimentally via the standard use of a
training/(validation) test set, which is a crude form of resampling.
It is determined analytically by estimating the VC dimension and then using the
bounds thus determined.


Maximize:
LD =

0.5
j
y
i
y
j
x
i
.x
j
(4.14)
Subject to the constraints:
1. 0
i
C (4.15)
2. y
i
= 0 (4.16)

Having determined the optimal Lagrange multipliers,
o,i
, computing the optimum weight
vector as:
w
o
= y
i
x
i
(4.17)
The optimum bias can be calculated as:

b
o
= 1 w
o
.x
(s)
for d
(s)
=1 (4.18)

The Karush-Kuhn-Tucker conditions are needed for the primal problem. The primal
Lagrangian is:
LP = 0.5||w||
2
+ C

- {yi(w.x
i
+ b)1+
i
} -
i

(4.19)
where
i
are the Lagrange multipliers introduced to enforce positivity of
i
. The KKT
conditions for the primal problem are therefore:


(4.20) and (4.21)

The optimum bias b
0
is determined by taking any data point in the training set for which 0
0
,
i
C and therefore
i
=0 and using that data point in the KKT conditions. However, from a
numerical perspective it is better to take the mean value of b
0
resulting from all such data
points in the sample.

4.6 Nonlinear Support Vector Machines
Basically the idea of a nonlinear support vector machine hinges on two mathematical
operations summarized here.
Nonlinear mapping of an input vector into a high-dimensional feature space that is
hidden from both input and output.
Construction of an optimal hyperplane for separating the features discovered in
previous step.
An important question at this juncture is whether or not the methods discussed in earlier
sections can be generalized to the case where the decision function is not a linear function of
the data, or, in other words, when the input space is made up of nonlinearly separable
patterns. Covers theorem states that such a multi-dimensional space maybe transformed into
a new feature space where the patterns are linearly separable with high probability, provided

two conditions are satisfied. First, the transformation is nonlinear. Second, the dimensionality
of the feature space is high enough. The next operation exploits the idea of building an
optimal separating hyperplane in accordance with the theory described, but with a
fundamental difference: the separating hyperplane is now defined as a linear function of
vectors drawn from the feature space rather than the original input space.

4.6.1 Inner Product Kernel
Observe that the data appears in the training problem only in the form of dot products,
x
i
.x
j
. Now suppose the data is first mapped to some other (possibly infinite dimensional)
Euclidean space H, using a mapping : R
d
to H. Then of course the training algorithm would
only depend on the data through dot products in H, i.e. on functions of the form (x
i
). (x
j
).
Now if there were a kernel function K such that K(x
i
,x
j
)= (x
i
). (x
j
) only K need to be
used in the training algorithm, and would never need to explicitly even know what is. One
example is: K(x
i
,x
j
)=exp(-0.5||xi-xj||
2
/2*sigma
2
). In this particular example, H is infinite
dimensional, so it would not be very easy to work with explicitly. However, if one replaces
x
i
.x
j
by K(x
i
,x
j
) everywhere in the training algorithm, the algorithm produces a support vector
machine in an infinite dimensional space, and furthermore do so in roughly the same amount
of time it would take to train on the unmapped data.

Suppose that the data belongs to a space denoted L. (The notation L is a mnemonic
for low-dimensional, as is the notation for the range space H for high-dimensional.)
Evidently, a map is not necessarily onto; i.e., there need exist an element in L that is
mapped onto a specific element in H.

Let
j
(x), j=1,,M; where M is the number of hidden units, denote a set of nonlinear
transformations, define a hyperplane acting as the decision surface as follows:
j
j
(x)
+ b=0, where w defines the vector of linear weights connecting the feature space to the
output space, and b is the bias. The quantity
j
(x) represents the input supplied to the weight
w
j
via the feature space. In effect, the vector (x), represents the image induced in the

feature space due to the input vector x. Thus, in terms of this image define the decision
surface in the compact form, as shown in Figure 4.4:
w
T
(x)=0 (4.22)

Fig. 4.4: Image in H of the square [-1, 1] x [-1 1] under the mapping .

From the discussion of Lagrange multipliers it can be written as:
w= y
i
(x
i
) (4.23)
where the feature vector (x
i
) corresponds to the input pattern x
i
in the ith example.
Substituting into (4.23), the decision surface is obtained as:
y
i

T
(x
i
)(x
i
)=0 (4.24)

The term
T
(x
i
)(x
i
) represents the inner product of two vectors induced in the
feature space by the input vector x and the input pattern x
i
pertaining to the ith example.
Introduce the inner product kernel denoted by K(x
i
,x
j
) and defined by K(x
i
,x
j
)=
T
(x
i
)(x
i
).

This kernel is a symmetric function of its arguments. Therefore, the decision surface can be
defined as:
y
i
K(x
i
,x
j
)=0 (4.25)

4.6.2 Mercers Theorem
The above discussion leads to the next important question: For which kernels does
there exist a pair {H, } with the properties described above, and for which does there not?
For a kernel to be valid, it must satisfy Mercers condition. This theorem may be formally
stated as: Let K(x,x) be a continuous symmetric kernel that is defined in the closed interval a
x b and likewise for x. The kernel K(x,x) can be expanded in the series:
K(x,x) = (4.26)
with positive coefficients >0 for all i. For this expansion to be valid and for it to converge
absolutely and uniformly, it is necessary and sufficient that the condition:
.dx.dx 0 (4.27)
holds for all for which:
< (4.28)
The Kernel is positive definite as the eigenvalues of the eigenfunction
i
(x). Mercers
theorem only tells whether or not a candidate kernel is actually an inner-product kernel in
some space and therefore admissible for use in a support vector machine.


4.6.3 Inner-Product Kernels

Type of support vector
machine
Inner product kernel
K(x,x
i
), i=1,,N
Comments
Polynomial Learning
Machine
(x
T
x
i
+ 1)
p
Power p is specified a priori
by the user
Radial-basis function
network
exp(-0.5||xi-xj||
2
/2*sigma
2
)

The widthsigma
2
,
common to all the kernels, is
specified a priori by the user
Two-layer perceptron tanh(
0
x
T
x
i
+
1
) Mercers theorem is
satisfied only for some
values of
0
and
1

Polynomial Kernel

Radial-basis Kernel
Figure 4.5: Decision surface of Kernels
4.7 Multiple Class Support Vector Machine
Suppose that a given set with three different classes are as shown in Fig. 4.6. It is
possible to train a different hyperplane for every class, using the current class as class 1 and
the other two classes as class 2. The resulting multiple class classifiers would now consists of
the three hyperplanes. The classification would simply consist of evaluating which

hyperplane gives the best classification for the given element. This approach gives rise to the
following problem. The output of the SVM in this case is based upon which hyperplane
derives the greatest value, due to the discriminating function, when evaluated for the current
sample. The question is then what happens when two different classes are very similar and
the other class is well distinguished from these two. Although discriminating between class 1
and class 2 is a much harder task than discriminating between class 1 and class 3, or class 2
and class 3, the training algorithm will be prevented to derive an optimal solution for just
classifying between class 1 and class 2 because the whole training set has to be taken under
consideration during training. By just considering the first two classes a better classifier for
discriminating between these could be derived.

Figure 4.6: Decision surface using one SVM
It can be seen from Fig. 4.7 that the hyperplane w
1,2
clearly discriminates better
between class 1 and class 2 than the hyperplane w
2
, although this hyperplane has a better
overall performance. By using some sort of a hierarchical structure shown in Fig. 4.8, the
strength of the binary classifier could be exploited.



Figure 4.7: Problem encountered when one SVM is used for multi-classification

Figure 4.8: Multi-classification using Binary classifiers


4.8 Binary Search Tree
It is proposed to construct a bottom-up binary tree for classification. Suppose there
are eight classes in the data set, the decision tree is as shown, where the numbers 1-8 encode
the classes. Note that the numbers encoding the classes are arbitrary without any means of
ordering. By comparison between each pair, one class number is chosen representing the
winner of the current two classes. In other words, an optimum hyperplane corresponding to
two classes is drawn, and the class, of the two, which best represents the data set, is the
winner. The selected classes (from the lowest level of the binary tree) will come to the
upper level for another round of tests. Finally, the unique class will appear on the top of the
tree. This method is depicted in Fig. 4.9.

Figure 4.9: Binary tree structure for 8 classes

Denote the number of classes as c, the SVMs learns (c(c-1))/2 discrimination
functions in the training stage, and carry out comparisons of c-1 times under the fixed binary
tree structure. If c does not equal to the power of 2, c can be decomposed as: c =
2
n1
+2
n2
++2
nI
, where n1 n2 nI. Because any natural number (even or odd) can be
decomposed into finite positive integers which are the power of 2. If c is an odd number, nI =
0; if c is even, nI > 0. Note that the decomposition is not unique, but the number of
comparisons in the test stage is always c-1.


4.9 SVM for Face Recognition
The SVMs used for non-linear mapping are polynomial learning machines with
degree 2. Each SVM is used to differentiate between a pair of classes. The number of SVMs
used in training corresponds to the number of comparisons made, which in this case is 39.
The number of discrimination functions learnt by SVMs is 780. Since there are 40 classes,
the binary tree structure is divided into two trees, one with 32 leaves and the other with 8.
The number of leaves must be equal to a power of 2. In testing stage, the tests are performed
first in the tree with 32 leaves and then another tree with 8 leaves. Finally, these two outputs
are compared to determine the true class in another tree with only two leaves. The total
number of comparisons for one query is 39. The inputs to an SVM are pixel intensity arrays
of pre-processed images, each SVM being trained for distinguishing between two classes,
which yields the decision surface.

4.10 Conclusions
Support Vector Machines are dealt with in this chapter. Both linearly separable and
nonlinearly separable data are considered. The Karush-Kuhn-Tucker conditions are also
discussed. Face recognition is generally a multi-classification problem. Since SVMs is
typically a binary classification technique, the solution to the multi-classification using
SVMs is also described. This uses a binary tree approach. Essential pre-processing of images
is discussed in the next chapter.


CHAPTER V
Image Preprocessing

5.1 Histogram Equalization
Let the variable r represent the gray levels of the image to be enhanced. Assume that r has
been normalized to the interval [0,1], with r=0 representing black and r=1 representing white.
Consider transformation of the form:
s=T(r), 0r1 (5.1)

that produces a level s for every pixel value r in the original image and satisfies the following
conditions:
T(r) is single valued and monotonically increasing in the interval 0r1.
0T(r)1 for 0r1.
The first requirement is needed to guarantee that the inverse transformation will exist, and
the monotonicity condition preserves the increasing order from black to white in the output
image. A transformation that is not monotonically increasing could result in at least a section
of the intensity range being inverted, thus producing some inverted gray levels in the output
image. The other condition guarantees that the output gray levels will be in the same range as
the input levels. The inverse transformation from s back to r is denoted:
r=T
-1
(s), 0s1 (5.2)

The gray levels in an image may be viewed as random variables in the interval [0,1]. Let
p
r
(r), p
s
(s) denote the probability density functions of random variables r and s. A basic result
is that, if p
r
(r), T(r) are known and T
-1
satisfies the first condition, then p
s
(s) of the
transformed variable s can be obtained by:
p
s
(s)=p
r
(r)|dr/ds| (5.3)


Thus, the probability density function of the transformed variable s is determined by the
gray-level PDF of the input image and by the chosen transformation function.

A transformation function in image processing has the form:
s=T(r)= dw (5.4)

where w is a dummy variable of integration. The right side of equation is described as the
cumulative distribution function (CDF) of random variable r. Since probability density
functions are always positive, and recalling that the integral of a function is the area under
the function, it follows that this transformation is single-valued and monotonically
increasing. Similarly the integral of a probability density function for variables in the range
[0,1] is also in the range [0,1].

Given transformation function T(r), find p
s
(s) by applying equation (5.3). It is known
from basic calculus, Leibnizs rule, that the derivative of a definite integral with respect to its
upper limit is simply the integrand evaluated at that limit. In other words:
ds/dr = dT(r)/dr
= d[ dw]/dr (5.5)
=p
r
(r)

Substituting this result for dr/ds into equation, and keeping in mind that all probability values
are positive, yields:
p
s
(s)=p
r
(r)|dr/ds|
=p
r
(r)|1/p
r
(r)|
=1, 0s1 (5.6)


Because p
s
(s) is a probability density function, it follows that it must be zero outside the
interval [0,1] in this case because its integral over all values of s must equal 1. This value of
p
s
(s) is a uniform probability density function.
Discrete values are dealt with probabilities and summations instead of probability
density functions and integrals. The probability of occurrence of gray level r
k
in an image is
approximated by:
p
r
(r
k
)=n
k
/n, k=0,1,2,,L-1 (5.7)

where n is the total number of pixels in the image, n
k
is the number of pixels that have gray
level r
k
, and L is the total number of possible gray levels in the image. The discrete version of
the transformation function is given as:
s
k
=T(r
k
)= (r
j
)
= ) / n, where k=0,1,2,,L-1 (5.8)

Figure 5.1: Image after Histogram Equalization

Thus, a processed output image is obtained by mapping each pixel with level r
k
in the input
image into a corresponding pixel with level s
k
in the output image. A plot of p
r
(r
k
) versus r
k
is
called a histogram. The transformation given is called histogram equalization. An example is
shown in Fig. 5.1.


5.2 Median Filtering
This is a type of noise-reduction spatial filter which simply smoothes local variations in an
image. Noise is reduced as a result of blurring. Let S
xy
represent the set of coordinates in a
rectangular subimage window of size mxn, centered at point (x,y). The arithmetic mean
filtering process computes the average value of the corrupted image g(x,y) in the area defined
by S
xy
. The value of the restored image f at any point (x,y) is simply the arithmetic mean
computed using the pixels in the region defined by S
xy
. In other words:
f(x,y)=(1/mn)

(5.9)

This operation can be implemented using a convolution mask in which all coefficients
have value (1/mn). Figure 5.2 illustrates Median filtering in a 2-D space.

Figure 5.2: Median filtering


5.3 Bi-Cubic Interpolation
The process of histogram equalization is followed by image resizing. The key
advantage of resizing through bi-cubic interpolation is that it produces smoother surfaces
than any other interpolation technique. For example, reducing the actual resolution of the
image from 92 x 112 to 20 x 20. Bi-cubic interpolation takes into account 16 pixels in the
rectangular grid, takes weighted average of pixels, and replaces them with a single pixel;
accordingly, that pixel has the flavor of all the replaced pixels. This reduction is done to
reduce redundant information and to make neural network input data low dimensional. It may
be recalled that the input layer of a neural network require the same number of neurons as
number of pixels in an input image. In this way, less number of input neurons makes it less
complex and easy for neural network training and testing phases. Here the actual resolution
of an image is reduced also to ease complexity in convergence and significant effects on time
required for training the neural network.

5.4 Conclusions
Pre-processing of images is discussed in this chapter. This project uses histogram
equalization, median filtering and bi-cubic interpolation to process the images before face
recognition. With experience this is found to improve the recognition accuracy. After pre-
processing of images, the different techniques discussed in Chapters 2, 3 and 4 are used for
face recognition, and the results are presented in Chapter 6.



CHAPTER VI
Results
The algorithms were tested on the Cambridge ORL database, which consists of 40
subjects with 10 images of each with varying illumination, lighting conditions and pose. Of
these, 5 images of a subject are used for training and remaining 5 for testing purposes.
Variations are visible before the pre-processing stages, and these changes are noted down.
6.1 Multilayer Feedforward Networks
As mentioned in Chapter 5, the images are pre-processed using histogram
equalization and median filtering. In this section the dependence on the accuracy of face
recognition on pre-processing is presented when a multilayer feedforward network. Three
cases are considered: Images are preprocessed using both histogram equalization and median
filtering, images preprocessed using only median filtering, and images without
preprocessing.
Case A:
Histrogram Equalization ON
Median Filtering ON
The face recognition accuracy for varying number of hidden neurons is summarized in
Table 6.1, and the same data pictured in Fig. 6.1. It can be observed that for larger image
sizes, there is little change in the recognition accuracy for hidden neurons more than 40, and
for smaller image this plateau is reached when the number of neurons is 20. These results
show that neural networks need more neurons to extract the hidden features from larger sized
images. Further, it can be easily seen that with artificial neural networks face recognition
accuracy of better than 90% can be achieved.



Resolution/hidden
neurons
10 20 30 40 50 60
50x40 14 32 58 87.5 88 91
20x20 10 79 87.5 89 92.5 93.5
10x10 6 91.5 89.5 88 90 88.5
Table 6.1: Face recognition accuracy with multilayer feedforward networks with histogram equalization

Fig. 6.1: Face recognition accuracy with multilayer feedforward networks with histogram equalization


Case B:
Histrogram Equalization OFF
Median Filtering ON
Table 6.2 and shown in Fig. 6.2. From the data conclusions similar to that of Case A can be
drawn.
Resolution/hidden
neurons
10 20 30 40 50 60
50x40 44 67 71.5 91.5 93.5 92
20x20 10 79 87.5 89 92.5 93.5
10x10 6 91.5 89.5 88 90 88.5
Table 6.2: Face recognition accuracy with multilayer feedforward networks with only median filtering.

Fig. 6.2: Face recognition accuracy with multilayer feedforward networks with only median filtering.

Case C:
Median Filtering OFF
Table 6.3 and Fig. 6.3. Again, the conclusions drawn are similar to that of Case A and Case
B.
Resolution/hidden
neurons
10 20 30 40 50 60
50x40 29 63.5 70.5 90 90 93
20x20 36.5 88.5 90.5 91.5 95.5 95
10x10 32 45.5 89 91.5 91.5 90
Table 6.3: Face recognition accuracy with multilayer feedforward networks with no preprocessing.

Fig. 6.3: Face recognition accuracy with multilayer feedforward networks with no preprocessing.
Discussion:

From the results for Cases A, B and C, it can be observed that there is no significant
increase in the recognition accuracy when the number of hidden neurons is increased beyond
50. Large differences in recognition accuracies are seen for resolutions 50x40 and 20x20,
with the fact that accuracies decrease for larger sized images. This may be attributed to the
following fact: neural networks learned the unnecessary features when provided with higher
resolution, eventually leading to poor generalization. It can also be observed that the best
recognition accuracy is achieved when there is no pre-processing. Thus, it can be concluded
that pre-processing by median filtering and histogram equalization removes features
important for neural networks.
6.2 Principal Component Analysis
In this section, the dependence on the accuracy of face recognition, for varying
number of eigenfaces, on pre-processing is presented when the images are analyzed using
principal components. The number of eigenfaces is varied from 1 to 46 in steps of 5. As
before, three cases are considered: Images are preprocessed using both histogram
equalization and median filtering, images preprocessed using only median filtering, and
images without preprocessing.
Case A:
Histogram Equalization ON
Median Filtering ON
The face recognition accuracy for varying number of eigenfaces is summarized in
Table 6.4, and the same data pictured in Fig. 6.4. It can be observed that there is very
little change in the recognition accuracies if the number of eigenfaces is beyond 11.
Further, these accuracies are practically independent of the image size. Moreover, the
accuracies achieved are poorer when compared to those obtained when artificial
neural networks are used.


Resolution/No.
of Eigen Faces
1 6 11 16 21 26 31 36 41 46
50x40 9 74 80 83.5 84.5 85 85 85 85 86
30x20 11.5 73.5 79.5 84 84.5 85 85 85 85 85.5
20x20 11 75 81 83.5 84 85.5 86.5 87 87.5 87.5
10x10 12.5 77.5 85.5 87.5 87.5 89.5 89.5 89.5 89.5 89.5
5x5 11.5 79.5 84 85 86 87.5 87.5 87.5 88 88
Table 6.4: Face recognition accuracy with principal component analysis with histogram equalization and
median filtering.

Fig. 6.4: Face recognition accuracy with principal component analysis with histogram equalization and
median filtering.


Case B:
Histogram Equalization OFF
Median Filtering ON
The face recognition accuracy for varying number of eigenfaces is summarized in
Table 6.5 and Fig. 6.5. From the data conclusions similar to that of Case A can be drawn.

Resolution/No.
of Eigen Faces
1 6 11 16 21 26 31 36 41 46
50x40 8.5 84.5 86.5 87.5 89 89.5 89.5 90 90 90
30x20 7 85 88 89 89.5 89.5 90 90.5 90.5 90.5
20x20 13.5 86 88.5 89 89 90.5 90.5 90.5 90.5 90.5
10x10 15.5 87.5 90 91 91.5 91.5 92 92 92 92
5x5 14 86 91.5 91.5 91 91 91 91 91 91.5
Table 6.5: Face recognition accuracy with PCA with only median filtering.

Fig. 6.5: Face recognition accuracy with PCA with only median filtering.

Case C:
The face recognition accuracy for varying number of eigenfaces is summarized in Table
6.6 and Fig. 6.6. From the data conclusions similar to that of Case A and Case B can be
drawn.
Resolution/No.
of Eigen Faces
1 6 11 16 21 26 31 36 41 46
50x40 12.5 84.5 87.5 88.5 88.5 89.5 90 90 90 90
30x20 12.5 55.5 87 88.5 89.5 89.5 90 90.5 91 91
20x20 15 86 88.5 89 89.5 90.5 90.5 90.5 90.5 90.5
10x10 13 87.5 90.5 91 93 93 92.5 92.5 92.5 92.5
5x5 14 88 89 90 90 90 90 90.5 90.5 90.5
Table 6.6: Face recognition accuracy with PCA with no preprocessing.

Fig 6.6: Face recognition accuracy with PCA with no preprocessing.

Discussion:
Evidently, when the number of eigenfaces is increased beyond 11, there is no
significant increase in the recognition accuracy for all the three cases. It can also be observed
that the best recognition accuracy is achieved when the images are not preprocessed.
Therefore, such preprocessing removes important information necessary for PCA.
When the performance is compared to ANN, it can be observed that the latter
outperforms PCA. However, a significant point to be noted is that image sizes as low as 5 x 5
can be used with PCA to gather information about the faces. This is difficult to achieve using
ANN.
6.3 Support Vector Machines
In this section, the dependence on the accuracy of face recognition on pre-processing
is presented when the images are analyzed using support vector machines. As before, three
cases are considered: Images are preprocessed using both histogram equalization and median
filtering, images preprocessed using only median filtering, and images without
preprocessing.
Case A:
Histogram equalization ON
Median Filtering ON
The face recognition accuracy for varying resolution is summarized in Table 6.7, and the
same data pictured in Fig. 6.7. It can be observed that there is very little change in the
recognition accuracies. Further, these accuracies are practically independent of the image
size. Moreover, the accuracies achieved are better when compared to those obtained when
artificial neural networks are used.


Resolution Accuracy (%)
50x40 96
30x20 96
20x20 96
10x10 97
5x5 94
Table 6.7: Face recognition accuracy with SVM with histogram equalization and median filtering.
Case B:
Median Filtering ON
6.8. From the data, conclusion similar to that of Case A can be drawn.
50x40 97
30x20 97
20x20 97.5
10x10 97
5x5 97
Table 6.8: Face recognition accuracy with SVM with median filtering.
Case C:
6.9 . From the data conclusions similar to that of Case A and Case B can be drawn.


50x40 97
30x20 97
20x20 97
10x10 97
5x5 97.5
Table 6.9: Face recognition accuracy with SVM with no preprocessing.

Discussion:
From Tables 6.7, 6.8 and 6.9, it can be concluded that performance of support vector
machines are almost invariant to changes in resolution. Further, preprocessing does not
appear to have any effect on the recognition accuracy. This suggests that SVMs are more
robust with respect to preprocessing. The required features are extracted despite
preprocessing of images, which is different from that observed when ANN or PCA technique
is used. Furthermore, the SVMs outperform both ANNs and PCA in that SVMs yield the
best recognition accuracies.
6.4 PCA+MLP
Case A:
Histogram Equalization ON
Median Filtering ON
Table 6.10, and the same data pictured in Fig. 6.7. It can be observed that for larger image
sizes, there is little change in the recognition accuracy for hidden neurons more than 40, and
for smaller image this plateau is reached when the number of neurons is 35. These results

show that neural networks need more neurons to extract the hidden features from larger sized
images. Further, it can be easily seen that with PCA and artificial neural networks face
recognition accuracy intermediate to that of PCA and MLP is achieved.
Res/Hidden
Neurons
5 15 25 35 45 55 65 75
50x40 7.5 37 86.5 80.5 82.5 87.5 93 92.5
20x20 2.5 46 78.5 82.5 89 89.5 88 89.5
10x10 2.5 54.5 75 80.5 86 86.5 89 88.5

Table 6.10 Face recognition accuracy with PCA+MLP with pre-processing

Fig 6.7: Face recognition accuracy with PCA+MLP with preprocessing
Case B:
Histogram Equalisation OFF
Median Filtering ON

6.11. From the data, conclusion similar to that of Case A can be drawn.
Accuracy (%):
Res/Hidden Neurons 5 15 25 35 45 55 65 75
20X20 6.5 40 76.5 86.5 2.5 88 93.5 92
10X10 2.5 55 70.5 2.5 87.5 86.5 90 88
Table 6.11 Face recognition accuracy with PCA+MLP with only median filtering.

Fig 6.8: Face recognition accuracy with PCA+MLP with only median filtering.
Case C:
Histogram Equalisation OFF

6.12. From the data, conclusions similar to that of Case A and case B can be drawn.
Accuracy (%):
Res/Hidden
Neurons
5 15 25 35 45 55 65 75
50X40 2.5 41 78.5 78 88 87 89 92.5
20X20 2.5 37 82 86.5 89.5 89 91 91
10X10 2.5 24 57 78 83.5 82 88 87

Table 6.12 Face recognition accuracy with PCA+MLP with no pre-processing.
Discussion:
By intuition this method was expected to have better accuracy than PCA and MLP
alone. But the results obtained do not reflect on the expectations. The results are intermediate
to those of PCA and MLP suggesting that the weights obtained after PCA process are not
optimum for classification using MLPs.


CHAPTER VII
Conclusions and Future Work

7.1 Conclusions
This project has demonstrated the recognition of faces using Principal Component
Analysis, Networks and Support Vector Machines successfully. Face images are the inputs
the face recognition system. The pixel intensities are used as inputs to the respective
methods. The pre-processing of faces is also done and the results are obtained. For each
method the pre-processing changes done are using Histogram Equalization and Median
filtering to the face images.
Chapter 2 introduced Neural Networks and Backpropagation algorithm and used the
same for recognizing faces. Recognizing faces using neural network can be done in two
ways. The first method is a direct neural network based approach, and the other is done by
obtaining the weights of eigenfaces, and then classifying the weights using neural networks.
The parameter of Neural Networks, which is varied at each simulation, is the number of
hidden neurons. The change of recognition rate is not appreciable when the number of hidden
neurons is more than 40. Also, the Neural Network does not perform well when the
resolution is 50x40, but performs very well for lower resolutions. This suggests that the
network was learning unwanted parameters leading to poor generalization capabilities.
Chapter 3 introduced an information theory technique i.e. eigenface approach which
uses principal components. Even in this method different combination of pre-processing is
done and resolution of faces is varied. The performance is very commendable when the
number of eigenfaces is more than 25. Highest accuracy of 93% is obtained when then
number of eigenfaces is 26 and the resolution is 10x10.
In Chapter 4 support vector machines were discussed and using these the method of
face recognition was mentioned. It is concluded that performance of support vector machines
are almost invariant to changes in resolution. Further, preprocessing does not appear to have
any effect on the recognition accuracy. This suggests that SVMs are more robust with respect

to preprocessing. The required features are extracted despite preprocessing of images, which
is different from that observed when ANN or PCA techniques are used. Furthermore, the
SVMs outperform both ANNs and PCA to yield the best recognition accuracies. The
accuracy is around 96%.
7.2 Future Work
Several improvements can be made to the methods discussed in the thesis. The
algorithms introduced in the thesis can be refined in a number of ways, at the expense of
more computations. One such thing is the usage of kernel PCA instead of PCA. Kernel PCA
maps the non-linearities better than PCA. Another better alternative is to use self-organized
maps, which uses unsupervised learning algorithms. These completely eliminate the need for
training altogether. The recognition accuracy needs to be improved. Ensemble based systems
for face recognition is still in nascent stage. They offer a better accuracy.
The further extension of the current can be to apply to a bigger problem of face
detection and recognition. Face detection is the task of finding faces given an image. After
doing so, the whole system can be extended to videos. This will be a similar task since video
is a sequence of images. Though a lot of extensions can be made to the current system, for
the purpose of face recognition the algorithms discussed are satisfactory.



CHAPTER VIII
References
[1] W. Zhao, R. Chellappa, A. Rosenfeld, P.J. Phillips, Face Recognition: A Literature
Survey, ACM Computing Surveys, 2003, pp. 399-458
[2] M. Turk, A. Pentland, Eigenfaces for Recognition, Journal of Cognitive
Neuroscience, Vol. 3, No. 1, Win. 1991, pp. 71-86
[3] Jahan Zeb, Muhammed Younus Javed and Usman Qayyum Low resolution single
neural network based face recognition International journal of Biomedical Sciences
volume 2 number 3 2007 ISSN 1306 -1216 , pg 206-210.
[4] A Tutorial on Support Vector Machines for Pattern Recognition, CHRISTOPHER
J.C. BURGES, Bell Laboratories, Lucent Technologies.
[5] Face Recognition by Support Vector Machines :- Guodong Guo, Stan Z. Li, and
Kapluk Chan. 2000
[6] R. Chellappa, C. L.Wilson, and S. Sirohey. Human and machine recognition of faces:
A survey. Proc. IEEE, 83:705741, May 1995.
[7] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273297,
1995.
[8] E. Osuna, R. Freund, and F. girosi. Training support vector machines: an application
to face detection. Proc. CVPR,1997.
[9] M. Pontil and A. Verri. Support vector machines for 3-d object recognition. IEEE
Trans. on Pattern Analysis and Machine Intelligence, 20:637646, 1998
[10] Master Thesis: Pattern Recognition using Support Vector Machines by Fredrik Gran,
Matematikcentrum LTH.
[11] Yung-Mao Lu; Bin-Yih Liao; Jeng-Shyang Pan, "A Face Recognition Algorithm
Decreasing the Effect of Illumination," Intelligent Information Hiding and
Multimedia Signal Processing, 2008. IIHMSP '08 International Conference on , vol.,
no., pp.378-381, 15-17 Aug. 2008
[12] PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND
TECHNOLOGY VOLUME 31 JULY 2008 ISSN 2070-3740. A New Face


Recognition Method using PCA,LDA and Neural Network by A. Hossein
Sahoolizadeh, B. Zargham Heidari, and C. Hamid Dehghani
[13] Face Recognition Techniques by Sumanth Tamma, Department of Computer
Science, University of New Mexico, Albuquerque
[14] Face Recognition by Support Vector Machines by Guodong Guo, Stan Z. Li, and
Kapluk Chan, School of Electrical and Electronic Engineering, Nanyang
Technological University, Singapore
[15] Face Recognition with Support Vector Machines: Global versus Component-based
Approach, Bemd Heiselet Purdy Hoj Tomaso Poggio, Massachusetts Institute of
Technology, Center for Biological and Computational Learning, Cambridge, MA 02
142
[16] Neural Networks Simon Haykin
[17] Digital Imaging Processing Gonzalez and Woods
[18] M.T. Harandi; M.N. Ahmadabadi; B.N. Araabi; C. Lucas, "Feature selection
using genetic algorithm and it's application to face recognition," Cybernetics and
Intelligent Systems, 2004 IEEE Conference on , vol.2, no., pp.1368-1373, 2004
[19] Liu, C.; Wechsler, H., "Evolutionary pursuit and its application to face recognition,"
Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.22, no.6,
pp.570-582, Jun 2000
[20] Wiskott, L.; Fellous, J.-M.; Kruger, N.; von der Malsburg, C., "Face recognition by
elastic bunch graph matching," Image Processing, 1997. Proceedings., International
Conference on , vol.1, no., pp.129-132 vol.1, 26-29 Oct 1997
[21] Nan Liu; Han Wang, "Modeling Images With Multiple Trace Transforms for Pattern
Analysis," Signal Processing Letters, IEEE , vol.16, no.5, pp.394-397, May 2009
[22] Srisuk, S.; Petrou, M.; Kurutach, W.; Kadyrov, A., "Face authentication using the
trace transform," Computer Vision and Pattern Recognition, 2003. Proceedings. 2003
IEEE Computer Society Conference on , vol.1, no., pp. I-305-I-312 vol.1, 18-20 June
2003
[23] Achermann, B.; Jiang, X.; Bunke, H., "Face recognition using range images," Virtual
Systems and MultiMedia, 1997. VSMM '97. Proceedings., International Conference
on , vol., no., pp.129-136, 10-12 Sep 1997


[24] Yongsheng Gao; Leung, M.K.H., "Face recognition using line edge map," Pattern
Analysis and Machine Intelligence, IEEE Transactions on , vol.24, no.6, pp.764-779,
Jun 2002
[25] Wright, J.; Yang, A.Y.; Ganesh, A.; Sastry, S.S.; Yi Ma, "Robust Face Recognition
via Sparse Representation," Pattern Analysis and Machine Intelligence, IEEE
Transactions on , vol.31, no.2, pp.210-227, Feb. 2009



APPENDIX

MATLAB SCRIPTS
PRINCIPAL COMPONENT ANALYSIS
pca.m
cl ear
cl c
cl ose al l
dat abasepat h=ui get di r ( ' C: \ Pr ogr amFi l es\ MATLAB7\ wor k' , ' sel ect dat abase
pat h' ) ;
T=[ ] ;
f or i =1: 2: 400
st r =i nt 2st r ( i ) ;
st r =st r cat ( ' \ ' , st r , ' . sgi ' ) ;
st r =st r cat ( dat abasepat h, st r ) ;

i mg=sgi r ead( st r ) ;

[ i r ow i col ] = si ze( i mg) ;
B = i mr esi ze( i mg, [ 50 40] , ' bi cubi c' ) ;
t emp = r eshape( B' , 50*40, 1) ; %Reshapi ng 2D i mages i nt o 1D i mage
vect or s
T = [ T t emp] ; %' T' gr ows af t er each t ur n
end
T=doubl e( T) ;
T=T/ 256;

m= mean( T, 2) ;
Tr ai n_Number = si ze( T, 2) ;

A = [ ] ;
f or i = 1 : Tr ai n_Number
t emp = doubl e( T( : , i ) ) - m; %Comput i ng t he di f f er ence i mage f or each
i mage i n t he t r ai ni ng set Ai = Ti - m
A = [ A t emp] ; %Mer gi ng al l cent er ed i mages
end
T=A;

C=[ ] ;
f or i =2: 2: 400




vect or s
C = [ C t emp] ; %' T' gr ows af t er each t ur n
end
C=doubl e( C) ;
C=C/ 256;
E = [ ] ;
t emp = doubl e( C( : , i ) ) - m; %Comput i ng t he di f f er ence i mage f or each
E = [ E t emp] ; %Mer gi ng al l cent er ed i mages
end

L=A' *A;
[ V D] =ei g( L) ;
l _ei g_vect or s=[ ] ;

t hr eshol d=max( D) ;

f or i =1: si ze( D, 2)

i f ( D( i , i ) >t hr eshol d( 190) )
l _ei g_vect or s=[ l _ei g_vect or s V( : , i ) ] ;
end
end

ei genf aces=( A*l _ei g_vect or s) / 256;
f or i =1: si ze( ei genf aces, 2)
i mg=r eshape( ei genf aces( : , i ) , 40, 50) ;
i mg=i mg' ;
subpl ot ( 5, 2, i ) ;

i mshow( i mg) ;
dr awnow;
end
Pr oj ect edI mages = [ ] ;
Tr ai n_Number = si ze( ei genf aces, 2) ;
f or i = 1 : 200
t emp = ei genf aces' *A( : , i ) ; %Pr oj ect i on of cent er ed i mages i nt o
f acespace
Pr oj ect edI mages = [ Pr oj ect edI mages t emp] ;
end
Pr oj ect edt est I mages = [ ] ;
Test r ai n_Number = si ze( E, 2) ;
f or i = 1 : Test r ai n_Number
t emp = ei genf aces' *E( : , i ) ; %Pr oj ect i on of cent er ed i mages i nt o
f acespace
Pr oj ect edt est I mages = [ Pr oj ect edt est I mages t emp] ;
end

Z=zer os( 200, 40) ;


j = 1;
f or i =1: 1: 200
Z( i , j ) = 1;
i f ( mod( i , 5) == 0)
j = j + 1;
end
end

euc_di st =[ ] ;
euc_di st _mi n=[ ] ;
r ec_i ndex=[ ] ;
f or i =1: 200
q=Pr oj ect edt est I mages( : , i ) ;
f or j =1: 200
r =Pr oj ect edI mages( : , j ) ;

t emp=( nor m( r - q) ) ^2;
euc_di st =[ euc_di st t emp] ;
end
[ euc_di st _mi n( i ) , r ec_i ndex( i ) ] =mi n( euc_di st ) ;
euc_di st =[ ] ;
end
r ecogni sed_f ace=cei l ( r ec_i ndex/ 5) ;
no_f aces_r ecogni sed=0;
f or j =1: 200
i f ( cei l ( j / 5) ==r ecogni sed_f ace( j ) )
no_f aces_r ecogni sed=no_f aces_r ecogni sed+1;
end

end
accur acy=no_f aces_r ecogni sed/ 2;

MULTILAYER PERCEPTRON

mlp.m

cl ear
cl c
cl ose al l
pat h' ) ;
T=[ ] ;
f or i =1: 2: 400




vect or s
end
T=doubl e( T) ;
T=T/ 256;

m= mean( T, 2) ;

A = [ ] ;
end
T=A;

C=[ ] ;
f or i =2: 2: 400


vect or s
end
C=doubl e( C) ;
C=C/ 256;
E = [ ] ;
end

L=A' *A;
[ V D] =ei g( L) ;
l _ei g_vect or s=[ ] ;

t hr eshol d=max( D) ;

f or i =1: si ze( D, 2)

i f ( D( i , i ) >t hr eshol d( 190) )
l _ei g_vect or s=[ l _ei g_vect or s V( : , i ) ] ;


end
end

ei genf aces=( A*l _ei g_vect or s) / 256;
f or i =1: si ze( ei genf aces, 2)
i mg=r eshape( ei genf aces( : , i ) , 40, 50) ;
i mg=i mg' ;
subpl ot ( 5, 2, i ) ;

i mshow( i mg) ;
dr awnow;
end
Pr oj ect edI mages = [ ] ;
Tr ai n_Number = si ze( ei genf aces, 2) ;
f or i = 1 : 200
t emp = ei genf aces' *A( : , i ) ; %Pr oj ect i on of cent er ed i mages i nt o
f acespace
Pr oj ect edI mages = [ Pr oj ect edI mages t emp] ;
end
Pr oj ect edt est I mages = [ ] ;
Test r ai n_Number = si ze( E, 2) ;
f or i = 1 : Test r ai n_Number
t emp = ei genf aces' *E( : , i ) ; %Pr oj ect i on of cent er ed i mages i nt o
f acespace
Pr oj ect edt est I mages = [ Pr oj ect edt est I mages t emp] ;
end

Z=zer os( 200, 40) ;
j = 1;
f or i =1: 1: 200
Z( i , j ) = 1;
i f ( mod( i , 5) == 0)
j = j + 1;
end
end

euc_di st =[ ] ;
euc_di st _mi n=[ ] ;
r ec_i ndex=[ ] ;
f or i =1: 200
q=Pr oj ect edt est I mages( : , i ) ;
f or j =1: 200
r =Pr oj ect edI mages( : , j ) ;

t emp=( nor m( r - q) ) ^2;
euc_di st =[ euc_di st t emp] ;
end
[ euc_di st _mi n( i ) , r ec_i ndex( i ) ] =mi n( euc_di st ) ;
euc_di st =[ ] ;
end
r ecogni sed_f ace=cei l ( r ec_i ndex/ 5) ;
no_f aces_r ecogni sed=0;
f or j =1: 200
i f ( cei l ( j / 5) ==r ecogni sed_f ace( j ) )
no_f aces_r ecogni sed=no_f aces_r ecogni sed+1;
end



end
accur acy=no_f aces_r ecogni sed/ 2;

SUPPORT VECTOR MACHINES

svm.m

cl ear
cl c
cl ose al l
pat h' ) ;
T=[ ] ;
f or i =1: 2: 400


i mg = medf i l t 2( i mg, [ 3 3] ) ;
i mg=hi st eq( i mg) ;
vect or s
end
T=doubl e( T) ;
T=T/ 256;

m= mean( T, 2) ;

A = [ ] ;
end
T=A;

C=[ ] ;
f or i =2: 2: 400


i mg = medf i l t 2( i mg, [ 3 3] ) ;
i mg=hi st eq( i mg) ;

vect or s

end
C=doubl e( C) ;
C=C/ 256;
E = [ ] ;
end
nbcl ass=40;
xapp=A' ;
f or i =1: 200
t ar get ( i ) =cei l ( ( i ) / 5)
end

yapp=t ar get ;
%- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

nbcl ass=40;
[ n1, n2] =si ze( xapp) ;

%- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
% Lear ni ng and Lear ni ng Par amet er s
c = 1000;
l ambda = 1e- 7;
ker nel opt i on= 1;
ker nel =' pol y' ;
ver bose = 0;

%- - - - - - - - - - - - - - - - - - - - - One Agai nst One al gor i t hms- - - - - - - - - - - - - - - -
nbcl ass=40;
%
%[ xsup, w, b, nbsv, cl assi f i er , pos] =svmmul t i cl assoneagai nst one( xapp, yapp, nbcl as
s, c, l ambda, ker nel , ker nel opt i on, ver bose) ;
ker nel opt i onm. mat r i x=svmker nel ( xapp, ker nel , ker nel opt i on) ;
[ xsup, w, b, nbsv, cl assi f i er , pos] =svmmul t i cl assoneagai nst one( [ ] , yapp, nbcl ass, c
, l ambda, ' numer i cal ' , ker nel opt i onm, ver bose) ;

%[ ypr ed, maxi ] =
svmmul t i val oneagai nst one( xt est , xsup, w, b, nbsv, ker nel , ker nel opt i on) ;
%ker nel opt i onm. mat r i x=svmker nel ( xt est , ker nel , ker nel opt i on, xapp( pos, : ) ) ;
ker nel opt i onm. mat r i x=svmker nel ( E' , ker nel , ker nel opt i on, xapp( pos, : ) ) ;
[ ypr ed, maxi ] =


svmmul t i val oneagai nst one( [ ] , [ ] , w, b, nbsv, ' numer i cal ' , ker nel opt i onm) ;
ypr ed

Important Note:

The MATLAB scripts can run on any basic version of MATLAB which as Neural Network
toolbox and PNM and SVM-KM toolbox is to be added to the work folder.These toolboxes
are provided in the CD.

Face Recognition

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Face Recognition

Uploaded by

Copyright:

Available Formats

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

, a face is classified as belonging to the class k.

You might also like