You are on page 1of 20

A PROJECT REPORT

ON

"SMART CAR SECURITY SYSTEM"


Submitted to Swami Ramanand Teerth Marathwada University, Nanded in partial fulfilment
of the requirements for the degree of BACHELOR OF TECHNOLOGY in
Electronics and Telecommunication Engineering.

By

Vikas M. Gavhane (2013BEC604)


Pooja P. Dhakulkar(2014BEC5)
Gauri G. Sarda(2013BEC154)

Guide
Dr.A.V.Nandedkar sir

DEPARTMENT OF
ELECTRONICS AND TELECOMMUNICATION ENGINEERING,
S. G. G. S. INSTITUTE OF ENGINEERING AND TECHNOLOGY
VISNUPURI, NANDED (M.S.), INDIA-431606.
APRIL 2016

1
CERTIFICATE

This is to certify that the seminar report entitled “SMART CAR SECURITY
SYSTEM” being submitted by Mr.Vikas M. Gavhane (2013BEC604) , Miss Pooja P.
Dhakulkar (2014BEC522) & Miss. Gauri G. Sarda (2014BEC154) in S.G.G.S.I.E.&T.,
Nanded for the award of the degree of Bachelor of Technology in Electronics and
telecommunication Engineering is record of bonafied work carried out by her under my
supervision and guidance. The matter contained in this seminar has not been submitted to any
other university or institute for the award of the degree.

HOD Guide

Prof. A. B. Gonde Dr. A.V. Nandedkar

2
Content
1 Introduction 4
2 Literature Review 5
2.1 Eigen faces 5
2.2 Neural Networks 6
2.3 Graph Matching 7
2.4 Hidden Markov Models 7
2.5 Geometrical Feature Matching 8
2.6 Template Matching 9
2.7 3D Morphable Model 10
3 Our proposed approach 11
3.1 Face Detection using Viola Jones 11
3.2 Scale Invariant Detector 13

3.2.1 Modified Adaboost Algorithm 14


3.2.3 The cascaded classifier
3.2.4 Points of Interest
3.3 SURF Descriptor
3.3.1 Orientation Assignment
3.3.2 Descriptor Component
3.3.3 Construction of Scale space
3.3.4 Accurate Interest Point Localisation

3
1. Introduction
1.1. Introduction

The car security is being the main issue , in such cases we need an optimal solution to avoid
car from getting stolen or being by unauthorised person . And it is also the matter of fact that
there is always a bug in the system provided by manufacturer . So for that we need an backup
system in order to maintain the safety of our vehicles. This can be achieved by the software
program which may reduce the human effort of keeping eye on the vehicles. We can apply
the security on the ignition of the car . So that if any unauthorised person wants to have
access it won’t give response to it. Implementation of the system has to be carried out on
accordance of some techniques named as face detection and face recognition. The Face
detection and face recognition are very advanced in terms of computer authentication
technology. The technology of car security system is used to support the secure the vehicle in
modern way. It is gradually evolving to a universal biometric solution since it requires
virtually zero effort from the user end while compared with other biometric options. The
system is going to work by some techniques such as the picture is taken by camera then
processed towards the detection as the detected face image is obtained face recognition has to
be done which is divided into further parts namely face alignment, pre-processing, feature
extraction, face matching where the image is converted into gravy scale image and the result
has to be seen.

Here we have presented a technique to recognize the faces of the authorised accessible
persons who will drive the car. The general block diagram for face recognition system
consists of three blocks. The first is the Face Detection, the second is feature extraction, and
the third is Face recognition. The general block diagram is as shown in below figure 1.1.

4
Figure 1.1: Block diagram

2. Literature Review

The methods considered are Eigenfaces (Eigenfeatures), neuralnetworks, dynamic link


architecture, hidden Markov model,geometrical feature matching, and template matching.
Theapproaches are analysed in terms of the facial representations they used.

2.1 Eigenfaces

Eigenface is one of the most thoroughly investigatedapproaches to face recognition. It is also


known as Karhunen-Loève expansion, Eigenpicture, eigenvector, and principalcomponent.
References used principal componentanalysis to efficiently represent pictures of faces. They
arguedthat any face images could be approximately reconstructed bya small collection of
weights for each face and a standard facepicture (Eigenpicture). The weights describing each
face areobtained by projecting the face image onto the Eigenpicture.Reference used
Eigenfaces, which was motivated by thetechnique of Kirby and Service, for face detection
andidentification.In mathematical terms, beigenfaces are the principalcomponents of the
distribution of faces, or the eigenvectors ofthe covariance matrix of the set of face images.
TheEigenvectors are ordered to represent different amounts of thevariation, respectively,
among the faces. Each face can berepresented exactly by a linear combination of the
Eigenfaces. It can also be approximated using only the “best” Eigenvectorswith the largest
eigenvalues. The best M Eigenfaces constructan M dimensional space, i.e., the “face space”.
The authorsreported 96 per cent, 85 per cent, and 64 per cent correctclassifications averaged
over lighting, orientation, and sizevariations, respectively. Their database contained
2,500images of 16 individuals.As the images include a large quantity of background area,the
above results are influenced by background. The authorsexplained the robust performance of
the system underdifferent lighting conditions by significant correlation betweenimages with
changes in illumination. However, showed that the correlation between images of the whole
faces is not efficient for satisfactory recognition performance . Illumination normalization is
usually necessary for the Eigen faces approach .Reference proposed a new method to
compute thecovariance matrix using three images each was taken indifferent lighting
conditions to account for arbitraryillumination effects, if the object is Lambert Ian. Reference
extended their early work on Eigenface to Eigenfeaturescorresponding to face components,
such as eyes, nose, andmouth. They used a modular Eigenspace which was composedof the
above Eigenfeatures (i.e., Eigeneyes, Eigennose, andEigenmouth). This method would be less
sensitive toappearance changes than the standard Eigenface method. Thesystem achieved a
recognition rate of 95 per cent on theFERET database of 7,562 images of approximately
3,000individuals. In summary, Eigenface appears as a fast, simple,and practical method.

5
However, in general, it does not provideinvariance over changes in scale and lighting
conditions.

Recently, in experiments with ear and face recognition,using the standard principal
component analysis approach, showed that the recognition performance is essentiallyidentical
using ear images or face images and combining thetwo for multimodal recognition results in
a statisticallysignificant performance improvement. For example, thedifference in the rank-
one recognition rate for the day variation experiment using the 197-image training sets is
International Journal of Signal Processing 2;2 20069090.9% for the multimodal biometric
versus 71.6% for the ear and 70.5% for the face .There is substantial related work in
multimodal biometrics .For example used face and fingerprint in multi modal biometric
identification, and used face and voice .However, use of the face and ear in combination
seems morerelevant to surveillance applications.

2.2. Neural Networks

The attractiveness of using neural networks could be due toits nonlinearity in the network.
Hence, the feature extractionstep may be more efficient than the linear Karhunen-
Loèvemethods. One of the first artificial neural networks (ANN)techniques used for face
recognition is a single layer adaptive network called WISARD which contains a separate
network for each stored individual . The way in constructing aneural network structure is
crucial for successful recognition.It is very much dependent on the intended application. For
face detection, multilayer per ceptron and convolutional neural network have been applied.
For face verification, is a multi-resolution pyramid structure. Reference proposed a hybrid
neural network which combines localimage sampling, a self-organizing map (SOM)
neuralnetwork, and a convolutional neural network. The SOMprovides a quantization of the
image samples into atopological space where inputs that are nearby in the originalspace are
also nearby in the output space, thereby providingdimension reduction and invariance to
minor changes in theimage sample. The convolutional network extractssuccessively larger
features in a hierarchical set of layers andprovides partial invariance to translation, rotation,
scale, anddeformation. The authors reported 96.2% correct recognitionon ORL database of
400 images of 40 individuals.The classification time is less than 0.5 second, but thetraining
time is as long as 4 hours. Reference used probabilistic decision-based neural network
(PDBNN) whichinherited the modular structure from its predecessor, adecision based neural
network (DBNN) [40]. The PDBNNcan be applied effectively to 1) face detector: which
finds thelocation of a human face in a cluttered image, 2) eye localizer:which determines the
positions of both eyes in order togenerate meaningful feature vectors, and 3) face recognizer.

PDNN does not have a fully connected network topology.Instead, it divides the network into
K subnets. Each subset isdedicated to recognize one person in the database. PDNN usesthe
Gaussian activation function for its neurons, and theoutput of each “face subnet” is the
weighted summation of theneuron outputs. In other words, the face subnet estimates
thelikelihood density using the popular mixture-of-Gaussianmodel. Compared to the AWGN
scheme, mixture of Gaussianprovides a much more flexible and complex model
forapproximating the time likelihood densities in the face space.The learning scheme of the

6
PDNN consists of two phases,in the first phase; each subnet is trained by its own faceimages.
In the second phase, called the decision-basedlearning, the subnet parameters may be trained
by someparticular samples from other face classes. The decision-basedlearning scheme does
not use all the training samples for the training. Only misclassified patterns are used. If the
sample is misclassified to the wrong subnet, the rightful subnet will tune its parameters so
that its decision-region can be moved closer to the misclassified sample.PDBNN-based
biometric identification system has the merits of both neural networks and statistical
approaches, and its distributed computing principle is relatively easy to implement on parallel
computer. In [39], it was reported thatPDBNN face recognizer had the capability of
recognizing up to 200 people and could achieve up to 96% correct recognition rate in
approximately 1 second. However, when the number of persons increases, the computing
expense will become more demanding. In general, neural network approaches encounter
problems when the number of classes(i.e., individuals) increases. Moreover, they are not
suitable for a single model image recognition test because multiple model images per person
are necessary in order for training the systems to “optimal” parameter setting.

2.3. Graph Matching

Graph matching is another approach to face recognition. Reference presented a dynamic link
structure for distortion invariant object recognition which employed elastic graph matching to
find the closest stored graph. Dynamic link architecture is an extension to classical artificial
neural networks. Memorized objects are represented by sparse graphs, whose vertices are
labelled with a multiresolutiondescription in terms of a local power spectrum and whose
edges are labelled with geometrical distance vectors. Object recognition can be formulated as
elastic graph matching which is performed by stochastic optimization of a matching cost
function. They reported good results on a database of 87people and a small set of office items
comprising different expressions with a rotation of 15 degrees. The matching process is
computationally expensive, talking about 25 seconds to compare with 87 stored objects on
parallel machine with 23 transporters. Reference extended the technique and matched human
faces against a gallery of112 neutral frontal view faces. Probe images were distorted due to
rotation in depth and changing facial expression. Encouraging results on faces with large
rotation angles were obtained. They reported recognition rates of 86.5% and 66.4%for the
matching tests of 111 faces of 15 degree rotation and110 faces of 30 degree rotation to a
gallery of 112 neutral frontal views. In general, dynamic link architecture is superior to other
face recognition techniques in terms of rotation invariance; however, the matching process is
computationally expensive.

2.4 Hidden Markov Models (HMMs)

Stochastic modelling of non-stationary vector time series based on (HMM) has been very
successful for speech applications. Reference applied this method to human face recognition.
Faces were intuitively divided into regions such as the eyes, nose, mouth, etc., which can be
associated with the states of a hidden Markov model. Since HMMsrequire a one-dimensional
observation sequence and images are two-dimensional, the images should be converted into

7
either 1D temporal sequences or 1D spatial sequences. International Journal of Signal
Processing 2;2 200691. In, a spatial observation sequence was extracted from face image by
using a band sampling technique. Each face image was represented by a 1D vector series of
pixel observation. Each observation vector is a block of L lines and there is an M lines
overlap between successive observations. An unknown test image is first sampled to an
observation sequence. Then, it is matched against every HMMs in the model face database
(each HMM represents a different subject). The match with the highest likelihood is
considered the best match and the relevant model reveals the identity of the test face. The
recognition rate of HMM approach is 87% using OR database consisting of 400 images of 40
individuals. A pseudo2D HMM [44] was reported to achieve a 95% recognition rate in their
preliminary experiments. Its classification time and training time were not given (believed to
be very expensive).The choice of parameters had been based on subjective intuition.

2.5. Geometrical Feature Matching

Geometrical feature matching techniques are based on the computation of a set of


geometrical features from the picture of a face. The fact that face recognition is possible even
at coarse resolution as low as 8x6 pixels when the single facial features are hardly revealed in
detail, implies that the overall geometrical configuration of the face features is sufficient for
recognition. The overall configuration can be described by a vector representing the position
and size of the main facial features, such as eyes and eyebrows, nose, mouth, and the shape of
face outline. One of the pioneering works on automated face recognition by using
geometrical features was done by in 1973. Their system achieved a peak performance of
75% recognition ration a database of 20 people using two images per person, one as the
model and the other as the test image. References showed that a face recognition program
provided with features extracted manually could perform recognition apparently with
satisfactory results. Reference automatically extracted a set of geometrical features from the
picture of a face, such as nose width and length, mouth position, and chin shape. There were
35 features extracted form a 35 dimensional vector. The recognition was then performed with
a Bayes classifier. They reported a recognition rate of 90% on a database of 47 people.
Reference introduced a mixture-distance technique which achieved 95% recognition rate on a
query database of685 individuals. Each face was represented by 30 manually extracted
distances. Reference used Gabor wavelet decomposition to detect feature points for each face
image which greatly reduced the storage requirement for the database. Typically, 35-45
feature points per face were generated. The matching process utilized the information
presented in a topological graphic representation of the feature points. After compensating for
different centroid location, two cost values, the topological cost, and similarity cost, were
evaluated. The recognition accuracy in terms of the best match to the right person was 86%
and 94% of the correct person’s faceswas in the top three candidate matches. In summary,
geometrical feature matching based on precisely measured distances between features may be
most useful for finding possible matches in a large database such as a Mug shot album.
However, it will be dependent on the accuracy of the feature location algorithms. Current
automated face feature location algorithms do not provide a high degree of accuracy and
require considerable computational time.

8
2.6. Template Matching

A simple version of template matching is that a test image represented as a two-dimensional


array of intensity values is compared using a suitable metric, such as the Euclidean distance,
with a single template representing the whole face. There are several other more sophisticated
versions of template matching on face recognition. One can use more than one face template
from different viewpoints to represent an individual’sface. A face from a single viewpoint
can also be represented by asset of multiple distinctive smaller templates . The face image of
gravy levels may also be properly processed before matching . In, Brunei and
Pogoautomatically selected a set of four features templates, i.e., the eyes, nose, mouth, and
the whole face, for all of the available faces. They compared the performance of their
geometrical matching algorithm and template matching algorithm on the same database of
faces which contains 188 images of 47individuals. The template matching was superior in
recognition (100 present recognition rate) to geometrical matching (90 per cent recognition
rate) and was also simpler.

Since the principal components (also known as Eigen faces orEigenfeatures) are linear
combinations of the templates in the data basis, the technique cannot achieve better results
than correlation , but it may be less computationally expensive. One drawback of template
matching is its computational complexity. Another problem lies in the description of these
templates. Since the recognition system has to be tolerant to certain discrepancies between
the template and the test image, this tolerance might average out the differences that make
individual faces unique. In general, template-based approaches compared to feature matching
are a more logical approach. In summary, no existing technique is free from limitations.
Further efforts are required to improve the performances of face recognition techniques,
especially in the wide range of environments encountered in real world.

3.7. 3D Morph able Model

The morph able face model is based on a vector space representation of faces that is
constructed such that any convex combination of shape and texture vectors of a set of
examples describes a realistic human face. Fitting the 3D morph able model to images can be
used in two ways for recognition across different viewing conditions: Paradigm 1. After
fitting the model, recognition can be based on model coefficients, which represent intrinsic
shape and texture of faces, and are independent of the imaging conditions: Paradigm 2.
Three-dimension face reconstruction can also be employed to generate synthetic views from
gallery probe images . The synthetic views are then International Journal of Signal Processing
2;2 200692transferred to a second, viewpoint-dependent recognitionsystem. More recently,
combines deformable 3 D models with computer graphics simulation of projection and

9
illumination. Given a single image of a person, the algorithm automatically estimates 3D
shape, texture, and all relevant 3D scene parameters. In this framework, rotations in depth or
changes of illumination are very simple operations, and all poses and illuminations are
covered by a single model. Illumination is not restricted to Lambert Ianreflection, but takes
into account specular reflections and cast shadows, which have considerable influence on the
appearance of human skin. This approach is based on a morph able model of 3D faces that
captures the class-specific properties of faces. These properties are learned automatically
from a data set of 3Dscans. The morph able model represents shapes and textures offices as
vectors in a high-dimensional face space, and involves a probability density function of
natural faces within face space. The algorithm presented in estimates all 3Dscene parameters
automatically, including head position and orientation, focal length of the camera, and
illumination direction. This is achieved by a new initialization procedure that also increases
robustness and reliability of the system considerably. The new initialization uses image
coordinates of between six and eight feature points. The percentage of correct identification
on CMU-PIEdatabase, based on side-view gallery, was 95% and the corresponding
percentage on the FERET set, based on frontal view gallery images, along with the estimated
and poses obtained from fitting, was 95.9%.

10
3. Our proposed approach

An outline of the proposed algorithm portrayed in Figure 3.1, contains the following major
modules: (1) Face segmentation using Viola Jones algorithm, (2) Computing points of
interests on the detected face, (3) Compute SURF descriptor, (4) Classify the descriptors
whether it is face or not with the database images. The following sections present a brief
summary of each step.

Figure 3.1

3.1 Face Detection using Viola Jones

Contrary to the standard approach Viola-Jones rescale the detector instead of the input image
and run the detector many times through the image – each time with a different size. At first
one might suspect both approaches to be equally time consuming, but Viola-Jones have
devised a scale invariant detector that requires the same number of calculations whatever the
size. This detector is constructed using a so-called integral image and some simple
rectangular features reminiscent of Hear wavelets.

3.1.1 Scale Invariant Detector

The first step of the Viola-Jones face detection algorithm is to turn the input image into an
integral image. This is done by making each pixel equal to the entire sum of all pixels above
and to the left of the concerned pixel. This is demonstrated in Figure 3.2.

11
Figure 3.2: Integral Image

This allows for the calculation of the sum of all pixels inside any given rectangle using only
four values. These values are the pixels in the integral image that coincide with the corners of
the rectangle in the input image. This is demonstrated in Figure 3.3.

Figure 3.3: Integral image formation


Sum of grey rectangle =D-(B+C)-A
Since both rectangle B and C include rectangle A the sum of A has to be added to the
calculation.It has now been demonstrated how the sum of pixels within rectangles of arbitrary
size can becalculated in constant time. The Viola-Jones face detector analyzes a given sub-
window usingfeatures consisting of two or more rectangles. The different types of features
7are shown in Figure 3.4.

Figure 3.4: Haar features.


Each feature results in a single value which is calculated by subtracting the sum of the white
rectangle(s) from the sum of the black rectangle(s). Viola-Jones have empirically found that a
detector with a base resolution of 24*24 pixels gives satisfactory results. When allowing for
all possible sizes and positions of the features in Figure 4 a total of approximately 160,000
different features can then be constructed. Thus, the amount ofpossible features vastly
outnumbers the 576 pixels contained in the detector at base resolution. These features may
seem overly simple to perform such an advanced task as face detection, but what the features
lack in complexity they most certainly have in computational efficiency. One could

12
understand the features as the computer’s way of perceiving an input image. The hope being
that some features will yield large values when on top of a face. Of course operations could
also be carried out directly on the raw pixels, but the variation due to different pose and
individualcharacteristics would be expected to hamper this approach. The goal is now to
smartly construct a mesh of features capable of detecting faces and this is the topic of the
next section.

3.1.2 Modified Adaboost Algorithm


As stated above there can be calculated approximately 160,000 feature values within a
detector atbase resolution. Among all these features some few are expected to give almost
consistently highvalues when on top of a face. In order to find these features Viola-Jones use
a modified version ofthe AdaBoost algorithm developed by Freund and Schapire in 1996.
AdaBoost is a machine learning boosting algorithm capable of constructing a strong classifier
through a weighted combination of weak classifiers. (A weak classifier classifies correctly in
only alittle bit more than half the cases.) To match this terminology to the presented theory
each feature isconsidered to be a potential weak classifier. A weak classifier is
mathematically described as:
1 𝑝𝑓(𝑥) > 𝑝𝜃
ℎ(𝑥, 𝑓, 𝑝, 𝜃) = {
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Where x is a 24*24 pixel sub-window, f is the applied feature, p the polarity and 𝜃the
threshold thatdecides whether x should be classified as a positive (a face) or a negative (a
non-face).Since only a small amount of the possible 160.000 feature values are expected to be
potential weakclassifiers the AdaBoost algorithm is modified to select only the best features.
Viola-Jones’modified AdaBoost algorithm is presented in pseudo code.

13
An important part of the modified Adobos algorithm is the determination of the best feature,
polarity and threshold. There seems to be no smart solution to this problem and Viola-Jones
suggest a simple brute force method. This means that the determination of each new weak
classifier involves evaluating each feature on all the training examples in order to find the
best performing feature. This is expected to be the most time consuming part of the training
procedure. The best performing feature is chosen based on the weighted error it produces.
This weighted error is a function of the weights belonging to the training examples. The
weight of a correctly classified example is decreased and the weight of a misclassified
example is kept constant. As a result it is more ‘expensive’ for the second feature (in the final
classifier) to misclassify an example also misclassified by the first feature, than an example
classified correctly. An alternative interpretation is that the second feature is forced to focus
harder on the examples misclassified by the first. The point being that the weights are a vital
part of the mechanics of the Adobosalgorithm. With the integral image, the computationally
efficient features and the modified Adobos algorithm in place it seems like the face detector
is ready for implementation, but Viola-Jones have one more ace up the sleeve.

3.1.3 The cascaded classifier

The basic principle of the Viola-Jones face detection algorithm is to scan the detector many
times through the same image – each time with a new size. Even if an image should contain
one or more faces it is obvious that an excessive large amount of the evaluated sub-windows
would still be negatives (non-faces). This realization leads to a different formulation of the
problem: Instead of finding faces, the algorithm should discard non-faces.
The thought behind this statement is that it is faster to discard a non-face than to find a face.
With this in mind a detector consisting of only one (strong) classifier suddenly seems
inefficient since the evaluation time is constant no matter the input. Hence the need for a
cascaded classifier arises.
The cascaded classifier is composed of stages each containing a strong classifier. The job of
each stage is to determine whether a given sub-window is definitely not a face or maybe a
face. When a sub-window is classified to be a non-face by a given stage it is immediately
discarded. Conversely a sub-window classified as a maybe-face is passed on to the next stage
in the cascade. It follows that the more stages a given sub-window passes, the higher the
chance the sub-window actually contains a face. The concept is illustrated with two stages in
Figure 3.5.

14
Figure 3.5: Cascaded classifier
In a single stage classifier one would normally accept false negatives in order to reduce the
falsepositive rate. However, for the first stages in the staged classifier false positives are not
consideredto be a problem since the succeeding stages are expected to sort them out.
Therefore Viola-Jonesprescribe the acceptance of many false positives in the initial stages.
Consequently the amount offalse negatives in the final staged classifier is expected to be very
small.
Viola-Jones also refer to the cascaded classifier as an attentional cascade. This name implies
thatmore attention (computing power) is directed towards the regions of the image suspected
to containfaces.It follows that when training a given stage, say n, the negative examples
should of course befalsenegatives generated by stage n-1.

3.2 Points of Interest

One of the main advantages of SURF is to be able to compute distinctive descriptors quickly.
In addition, SURF descriptor is invariant to common image transformations including image
rotation, scale changes, illumination changes, and small change inviewpoint. We base our
detector on the Hessian matrix because of its good performance in computation time and
accuracy. However, rather than using a different measure for selecting the location and the
scale (as was done in the Hessian-Laplace detector [11]), we rely on the determinant of the
Hessian for both. Given a point X= (x, y) in an image I, the Hessian matrix H(x, σ) in x at
scale σ is definedas follows

𝐿𝑥𝑥 (𝑥, 𝜎) 𝐿𝑥𝑦 (𝑥, 𝜎)


𝐻(𝑥, 𝜎) = [ ]
𝐿𝑦𝑥 (𝑥, 𝜎) 𝐿𝑦𝑦 (𝑥, 𝜎)

𝜕𝑔(𝜎)2
where𝐿𝑥𝑥 (𝑥, 𝜎) is the convolution of the Gaussian second order derivative with the
𝜕𝑥 2
image I in point x, and similarly for 𝐿𝑥𝑦 (𝑥, 𝜎) and𝐿𝑦𝑦 (𝑥, 𝜎). As Gaussian filters are non-ideal
in any case, and given Lowe’s success with Log approximations, we push the approximation
even further with box filters.

Figure 3.6(Left to Right): The discretised and cropped Gaussian second order partial
derivative in y-direction and xy-direction, and approximation there of using box filter.

15
The 9 ×9 box filters in Figure 3.6 are approximations for Gaussian second order derivatives
with σ = 1.2 and represent our lowest scale (i.e. highest spatial resolution). We denote our
approximations by 𝐷𝑥𝑥 , 𝐷𝑥𝑦 , and𝐷𝑦𝑦 . The weights is theoretically sensitive to scale but it can
be kept constant at 0.9. This yields

𝐷(𝐻𝑎𝑝𝑝𝑟𝑜𝑥 ) = 𝐷𝑥𝑥 𝐷𝑦𝑦 − (0.9𝐷𝑥𝑦 )2

Furthermore, the filter responses are normalised with respect to the mask size.

Scale spaces are usually implemented as image pyramids. The images are repeatedly
smoothed with a Gaussian and subsequently sub-sampled in order to achieve a higher level of
the pyramid. Due to the use of box filters and integral images, we do not have to iteratively
apply the same filter to the output of a previously filtered layer, but instead can apply such
filters of any size at exactly the same speed directly on the original image, and even in
parallel (although the latter is not exploited here). Therefore, the scale space is analysed by
up-scaling the filter size rather than iteratively reducing the image size. The output of the
above 9×9 filter is considered as the initial scale layer, to which we will refer as scale s = 1.2
(corresponding to Gaussian derivatives with σ = 1.2). The following layers are obtained by
filtering the image with gradually bigger masks, taking into account the discrete nature of
integral images and the specific structure of our filters. Specifically, this results in filters of
size 9×9, 15×15, 21×21, 27×27,etc. At larger scales, the step between consecutive filter
sizes should also scale accordingly. Hence, for each new octave, the filter size increase is
doubled (going from 6 to 12 to 24). Simultaneously, the sampling intervals for the extraction
of the interest points can be doubled as well.

As the ratios of our filter layout remain constant after scaling, the approximated Gaussian
derivatives scale accordingly. Thus, for example, our 27×27filter corresponds to σ = 3×1.2 =
3.6 = s. Furthermore, as the Frobenius normremains constant for our filters, they are already
scale normalised [26].In order to localise interest points in the image and over scales, a non-
maximumsuppression in a 3 × 3 × 3 neighbourhood is applied. The maxima of the
determinant of the Hessian matrix are then interpolated in scale and image space with the
method proposed by Brown et al. [27]. Scale space interpolations especially important in our
case, as the difference in scale between the first layers of every octave is relatively large. Fig.
3.7 shows an example of the detected interest points using our ’Fast-Hessian’ detector.

3.3. SURF Descriptor

The first step consists of fixing a reproducible orientation based on information from a
circular region around the interest point. Then, we construct a square region aligned to the
selected orientation, and extract the SURF descriptor from it. These two steps are now
explained in turn.

16
3.3.1 Orientation Assignment

In order to be invariant to rotation, we identify a reproducible orientation for the interest


points. For that purpose, we first calculate the Hear-wavelet responses in x and y direction,
shown in Figure 3.7 , and this in a circular neighbourhood of radius 6s around the interest
point, with s the scale at which the interest point was detected. Also the sampling step is scale
dependent and chosen to be s. In keeping with the rest, also the wavelet responses are
computed at that current scale s. Accordingly, at high scales the size of the wavelets is big.
Therefore, we use again integral images for fast filtering. Only six operations are needed to
compute the response in x or y direction at any scale. The side length of the wavelets is 4s.

Figure 3.7: Hear features

Once the wavelet responses are calculated and weighted with a Gaussian (σ =2.5s) centered
at the interest point, the responses are represented as vectors in space with the horizontal
response strength along the abscissa and the vertical response strength along the ordinate. The
dominant orientation is estimated by calculating the sum of all responses within a sliding
𝜋
orientation window covering an angle of 3 . The horizontal and vertical responses within the
window aresummed. The two summed responses then yield a new vector. The longest
suchvector lends its orientation to the interest point. The size of the sliding window is a
parameter, which has been chosen experimentally. Small sizes fire on single dominating
wavelet responses, large sizes yield maxima in vector length that are not outspoken.

3.3.2 Descriptor Component

For the extraction of the descriptor, the first step consists of constructing square region
centeredaround the interest point, and oriented along the orientation selected in the previous
section. For the upright version, this transformation is not necessary. The size of this window
is 20s. Examples of such square regions are illustrated in Fig. 2.
The region is split up regularly into smaller 4 × 4 square sub-regions. This keeps important
spatial information in. For each sub-region, we compute a few simple features at 5×5
regularly spaced sample points. For reasons of simplicity, we call dx the Hear wavelet
response in horizontal direction and ditherharewavelet response in vertical direction (filter
size 2s). ”Horizontal” and ”vertical “here is defined in relation to the selected interest point
orientation. To increase the robustness towards geometric deformations and localisation
errors, the responsesand dare first weighted with a Gaussian (σ = 3.3s) centered at the
interest point.

Then, the wavelet responses dx and dare summed up over each subregionand form a first set
of entries to the feature vector. In order to bring in information aboutthe polarity of the
intensity changes, we also extract the sum of the absolute values of the responses, |𝑑𝑥| and

17
|𝑑𝑦|. Hence, each sub-region has a four-dimensional descriptor vector v for its underlying
intensity structure= (∑ 𝑑𝑥, ∑ 𝑑𝑦 , ∑|𝑑𝑥|, ∑|𝑑𝑦|).

Figure 3.8: The descriptor entries of a sub-region represent the nature of the underlying
intensity pattern. Left: In case of a homogeneous region, all values are relatively low.
Middle: In presence of frequencies in x direction, the value of ∑|𝒅𝒙| is high, but all
others remain low. If the intensity is gradually increasing in x direction, both values 𝒅𝒙
and |𝒅𝒙| are high.

This results in a descriptor vector for all 4×4sub-regions of length 64. The wavelet responses
are invariant to a bias in illumination(offset). Invariance to contrast (a scale factor) is
achieved by turning the descriptor into a unit vector. Fig. 3 shows the properties of the
descriptor for three distinctively different image intensity patterns within a sub-region. One
can imagine combinationsof such local intensity patterns, resulting in a distinctive descriptor.

18
Result

19
Conclusion

While implementing the algorithm we faced some problems


regarding the thresholding which caused to detect the even non face
object as faces such as bright point in the images or gray colour
patterns , etc. . So to overcome this we created a bounding box
around the detected objects and defined its height and width then
calculated its areas provided the condition such that if the bounding
box has a small area and contains less number of pixels then that
bounding box will be discarded .

20

You might also like