Professional Documents
Culture Documents
ON
By
Guide
Dr.A.V.Nandedkar sir
DEPARTMENT OF
ELECTRONICS AND TELECOMMUNICATION ENGINEERING,
S. G. G. S. INSTITUTE OF ENGINEERING AND TECHNOLOGY
VISNUPURI, NANDED (M.S.), INDIA-431606.
APRIL 2016
1
CERTIFICATE
This is to certify that the seminar report entitled “SMART CAR SECURITY
SYSTEM” being submitted by Mr.Vikas M. Gavhane (2013BEC604) , Miss Pooja P.
Dhakulkar (2014BEC522) & Miss. Gauri G. Sarda (2014BEC154) in S.G.G.S.I.E.&T.,
Nanded for the award of the degree of Bachelor of Technology in Electronics and
telecommunication Engineering is record of bonafied work carried out by her under my
supervision and guidance. The matter contained in this seminar has not been submitted to any
other university or institute for the award of the degree.
HOD Guide
2
Content
1 Introduction 4
2 Literature Review 5
2.1 Eigen faces 5
2.2 Neural Networks 6
2.3 Graph Matching 7
2.4 Hidden Markov Models 7
2.5 Geometrical Feature Matching 8
2.6 Template Matching 9
2.7 3D Morphable Model 10
3 Our proposed approach 11
3.1 Face Detection using Viola Jones 11
3.2 Scale Invariant Detector 13
3
1. Introduction
1.1. Introduction
The car security is being the main issue , in such cases we need an optimal solution to avoid
car from getting stolen or being by unauthorised person . And it is also the matter of fact that
there is always a bug in the system provided by manufacturer . So for that we need an backup
system in order to maintain the safety of our vehicles. This can be achieved by the software
program which may reduce the human effort of keeping eye on the vehicles. We can apply
the security on the ignition of the car . So that if any unauthorised person wants to have
access it won’t give response to it. Implementation of the system has to be carried out on
accordance of some techniques named as face detection and face recognition. The Face
detection and face recognition are very advanced in terms of computer authentication
technology. The technology of car security system is used to support the secure the vehicle in
modern way. It is gradually evolving to a universal biometric solution since it requires
virtually zero effort from the user end while compared with other biometric options. The
system is going to work by some techniques such as the picture is taken by camera then
processed towards the detection as the detected face image is obtained face recognition has to
be done which is divided into further parts namely face alignment, pre-processing, feature
extraction, face matching where the image is converted into gravy scale image and the result
has to be seen.
Here we have presented a technique to recognize the faces of the authorised accessible
persons who will drive the car. The general block diagram for face recognition system
consists of three blocks. The first is the Face Detection, the second is feature extraction, and
the third is Face recognition. The general block diagram is as shown in below figure 1.1.
4
Figure 1.1: Block diagram
2. Literature Review
2.1 Eigenfaces
5
However, in general, it does not provideinvariance over changes in scale and lighting
conditions.
Recently, in experiments with ear and face recognition,using the standard principal
component analysis approach, showed that the recognition performance is essentiallyidentical
using ear images or face images and combining thetwo for multimodal recognition results in
a statisticallysignificant performance improvement. For example, thedifference in the rank-
one recognition rate for the day variation experiment using the 197-image training sets is
International Journal of Signal Processing 2;2 20069090.9% for the multimodal biometric
versus 71.6% for the ear and 70.5% for the face .There is substantial related work in
multimodal biometrics .For example used face and fingerprint in multi modal biometric
identification, and used face and voice .However, use of the face and ear in combination
seems morerelevant to surveillance applications.
The attractiveness of using neural networks could be due toits nonlinearity in the network.
Hence, the feature extractionstep may be more efficient than the linear Karhunen-
Loèvemethods. One of the first artificial neural networks (ANN)techniques used for face
recognition is a single layer adaptive network called WISARD which contains a separate
network for each stored individual . The way in constructing aneural network structure is
crucial for successful recognition.It is very much dependent on the intended application. For
face detection, multilayer per ceptron and convolutional neural network have been applied.
For face verification, is a multi-resolution pyramid structure. Reference proposed a hybrid
neural network which combines localimage sampling, a self-organizing map (SOM)
neuralnetwork, and a convolutional neural network. The SOMprovides a quantization of the
image samples into atopological space where inputs that are nearby in the originalspace are
also nearby in the output space, thereby providingdimension reduction and invariance to
minor changes in theimage sample. The convolutional network extractssuccessively larger
features in a hierarchical set of layers andprovides partial invariance to translation, rotation,
scale, anddeformation. The authors reported 96.2% correct recognitionon ORL database of
400 images of 40 individuals.The classification time is less than 0.5 second, but thetraining
time is as long as 4 hours. Reference used probabilistic decision-based neural network
(PDBNN) whichinherited the modular structure from its predecessor, adecision based neural
network (DBNN) [40]. The PDBNNcan be applied effectively to 1) face detector: which
finds thelocation of a human face in a cluttered image, 2) eye localizer:which determines the
positions of both eyes in order togenerate meaningful feature vectors, and 3) face recognizer.
PDNN does not have a fully connected network topology.Instead, it divides the network into
K subnets. Each subset isdedicated to recognize one person in the database. PDNN usesthe
Gaussian activation function for its neurons, and theoutput of each “face subnet” is the
weighted summation of theneuron outputs. In other words, the face subnet estimates
thelikelihood density using the popular mixture-of-Gaussianmodel. Compared to the AWGN
scheme, mixture of Gaussianprovides a much more flexible and complex model
forapproximating the time likelihood densities in the face space.The learning scheme of the
6
PDNN consists of two phases,in the first phase; each subnet is trained by its own faceimages.
In the second phase, called the decision-basedlearning, the subnet parameters may be trained
by someparticular samples from other face classes. The decision-basedlearning scheme does
not use all the training samples for the training. Only misclassified patterns are used. If the
sample is misclassified to the wrong subnet, the rightful subnet will tune its parameters so
that its decision-region can be moved closer to the misclassified sample.PDBNN-based
biometric identification system has the merits of both neural networks and statistical
approaches, and its distributed computing principle is relatively easy to implement on parallel
computer. In [39], it was reported thatPDBNN face recognizer had the capability of
recognizing up to 200 people and could achieve up to 96% correct recognition rate in
approximately 1 second. However, when the number of persons increases, the computing
expense will become more demanding. In general, neural network approaches encounter
problems when the number of classes(i.e., individuals) increases. Moreover, they are not
suitable for a single model image recognition test because multiple model images per person
are necessary in order for training the systems to “optimal” parameter setting.
Graph matching is another approach to face recognition. Reference presented a dynamic link
structure for distortion invariant object recognition which employed elastic graph matching to
find the closest stored graph. Dynamic link architecture is an extension to classical artificial
neural networks. Memorized objects are represented by sparse graphs, whose vertices are
labelled with a multiresolutiondescription in terms of a local power spectrum and whose
edges are labelled with geometrical distance vectors. Object recognition can be formulated as
elastic graph matching which is performed by stochastic optimization of a matching cost
function. They reported good results on a database of 87people and a small set of office items
comprising different expressions with a rotation of 15 degrees. The matching process is
computationally expensive, talking about 25 seconds to compare with 87 stored objects on
parallel machine with 23 transporters. Reference extended the technique and matched human
faces against a gallery of112 neutral frontal view faces. Probe images were distorted due to
rotation in depth and changing facial expression. Encouraging results on faces with large
rotation angles were obtained. They reported recognition rates of 86.5% and 66.4%for the
matching tests of 111 faces of 15 degree rotation and110 faces of 30 degree rotation to a
gallery of 112 neutral frontal views. In general, dynamic link architecture is superior to other
face recognition techniques in terms of rotation invariance; however, the matching process is
computationally expensive.
Stochastic modelling of non-stationary vector time series based on (HMM) has been very
successful for speech applications. Reference applied this method to human face recognition.
Faces were intuitively divided into regions such as the eyes, nose, mouth, etc., which can be
associated with the states of a hidden Markov model. Since HMMsrequire a one-dimensional
observation sequence and images are two-dimensional, the images should be converted into
7
either 1D temporal sequences or 1D spatial sequences. International Journal of Signal
Processing 2;2 200691. In, a spatial observation sequence was extracted from face image by
using a band sampling technique. Each face image was represented by a 1D vector series of
pixel observation. Each observation vector is a block of L lines and there is an M lines
overlap between successive observations. An unknown test image is first sampled to an
observation sequence. Then, it is matched against every HMMs in the model face database
(each HMM represents a different subject). The match with the highest likelihood is
considered the best match and the relevant model reveals the identity of the test face. The
recognition rate of HMM approach is 87% using OR database consisting of 400 images of 40
individuals. A pseudo2D HMM [44] was reported to achieve a 95% recognition rate in their
preliminary experiments. Its classification time and training time were not given (believed to
be very expensive).The choice of parameters had been based on subjective intuition.
8
2.6. Template Matching
Since the principal components (also known as Eigen faces orEigenfeatures) are linear
combinations of the templates in the data basis, the technique cannot achieve better results
than correlation , but it may be less computationally expensive. One drawback of template
matching is its computational complexity. Another problem lies in the description of these
templates. Since the recognition system has to be tolerant to certain discrepancies between
the template and the test image, this tolerance might average out the differences that make
individual faces unique. In general, template-based approaches compared to feature matching
are a more logical approach. In summary, no existing technique is free from limitations.
Further efforts are required to improve the performances of face recognition techniques,
especially in the wide range of environments encountered in real world.
The morph able face model is based on a vector space representation of faces that is
constructed such that any convex combination of shape and texture vectors of a set of
examples describes a realistic human face. Fitting the 3D morph able model to images can be
used in two ways for recognition across different viewing conditions: Paradigm 1. After
fitting the model, recognition can be based on model coefficients, which represent intrinsic
shape and texture of faces, and are independent of the imaging conditions: Paradigm 2.
Three-dimension face reconstruction can also be employed to generate synthetic views from
gallery probe images . The synthetic views are then International Journal of Signal Processing
2;2 200692transferred to a second, viewpoint-dependent recognitionsystem. More recently,
combines deformable 3 D models with computer graphics simulation of projection and
9
illumination. Given a single image of a person, the algorithm automatically estimates 3D
shape, texture, and all relevant 3D scene parameters. In this framework, rotations in depth or
changes of illumination are very simple operations, and all poses and illuminations are
covered by a single model. Illumination is not restricted to Lambert Ianreflection, but takes
into account specular reflections and cast shadows, which have considerable influence on the
appearance of human skin. This approach is based on a morph able model of 3D faces that
captures the class-specific properties of faces. These properties are learned automatically
from a data set of 3Dscans. The morph able model represents shapes and textures offices as
vectors in a high-dimensional face space, and involves a probability density function of
natural faces within face space. The algorithm presented in estimates all 3Dscene parameters
automatically, including head position and orientation, focal length of the camera, and
illumination direction. This is achieved by a new initialization procedure that also increases
robustness and reliability of the system considerably. The new initialization uses image
coordinates of between six and eight feature points. The percentage of correct identification
on CMU-PIEdatabase, based on side-view gallery, was 95% and the corresponding
percentage on the FERET set, based on frontal view gallery images, along with the estimated
and poses obtained from fitting, was 95.9%.
10
3. Our proposed approach
An outline of the proposed algorithm portrayed in Figure 3.1, contains the following major
modules: (1) Face segmentation using Viola Jones algorithm, (2) Computing points of
interests on the detected face, (3) Compute SURF descriptor, (4) Classify the descriptors
whether it is face or not with the database images. The following sections present a brief
summary of each step.
Figure 3.1
Contrary to the standard approach Viola-Jones rescale the detector instead of the input image
and run the detector many times through the image – each time with a different size. At first
one might suspect both approaches to be equally time consuming, but Viola-Jones have
devised a scale invariant detector that requires the same number of calculations whatever the
size. This detector is constructed using a so-called integral image and some simple
rectangular features reminiscent of Hear wavelets.
The first step of the Viola-Jones face detection algorithm is to turn the input image into an
integral image. This is done by making each pixel equal to the entire sum of all pixels above
and to the left of the concerned pixel. This is demonstrated in Figure 3.2.
11
Figure 3.2: Integral Image
This allows for the calculation of the sum of all pixels inside any given rectangle using only
four values. These values are the pixels in the integral image that coincide with the corners of
the rectangle in the input image. This is demonstrated in Figure 3.3.
12
understand the features as the computer’s way of perceiving an input image. The hope being
that some features will yield large values when on top of a face. Of course operations could
also be carried out directly on the raw pixels, but the variation due to different pose and
individualcharacteristics would be expected to hamper this approach. The goal is now to
smartly construct a mesh of features capable of detecting faces and this is the topic of the
next section.
13
An important part of the modified Adobos algorithm is the determination of the best feature,
polarity and threshold. There seems to be no smart solution to this problem and Viola-Jones
suggest a simple brute force method. This means that the determination of each new weak
classifier involves evaluating each feature on all the training examples in order to find the
best performing feature. This is expected to be the most time consuming part of the training
procedure. The best performing feature is chosen based on the weighted error it produces.
This weighted error is a function of the weights belonging to the training examples. The
weight of a correctly classified example is decreased and the weight of a misclassified
example is kept constant. As a result it is more ‘expensive’ for the second feature (in the final
classifier) to misclassify an example also misclassified by the first feature, than an example
classified correctly. An alternative interpretation is that the second feature is forced to focus
harder on the examples misclassified by the first. The point being that the weights are a vital
part of the mechanics of the Adobosalgorithm. With the integral image, the computationally
efficient features and the modified Adobos algorithm in place it seems like the face detector
is ready for implementation, but Viola-Jones have one more ace up the sleeve.
The basic principle of the Viola-Jones face detection algorithm is to scan the detector many
times through the same image – each time with a new size. Even if an image should contain
one or more faces it is obvious that an excessive large amount of the evaluated sub-windows
would still be negatives (non-faces). This realization leads to a different formulation of the
problem: Instead of finding faces, the algorithm should discard non-faces.
The thought behind this statement is that it is faster to discard a non-face than to find a face.
With this in mind a detector consisting of only one (strong) classifier suddenly seems
inefficient since the evaluation time is constant no matter the input. Hence the need for a
cascaded classifier arises.
The cascaded classifier is composed of stages each containing a strong classifier. The job of
each stage is to determine whether a given sub-window is definitely not a face or maybe a
face. When a sub-window is classified to be a non-face by a given stage it is immediately
discarded. Conversely a sub-window classified as a maybe-face is passed on to the next stage
in the cascade. It follows that the more stages a given sub-window passes, the higher the
chance the sub-window actually contains a face. The concept is illustrated with two stages in
Figure 3.5.
14
Figure 3.5: Cascaded classifier
In a single stage classifier one would normally accept false negatives in order to reduce the
falsepositive rate. However, for the first stages in the staged classifier false positives are not
consideredto be a problem since the succeeding stages are expected to sort them out.
Therefore Viola-Jonesprescribe the acceptance of many false positives in the initial stages.
Consequently the amount offalse negatives in the final staged classifier is expected to be very
small.
Viola-Jones also refer to the cascaded classifier as an attentional cascade. This name implies
thatmore attention (computing power) is directed towards the regions of the image suspected
to containfaces.It follows that when training a given stage, say n, the negative examples
should of course befalsenegatives generated by stage n-1.
One of the main advantages of SURF is to be able to compute distinctive descriptors quickly.
In addition, SURF descriptor is invariant to common image transformations including image
rotation, scale changes, illumination changes, and small change inviewpoint. We base our
detector on the Hessian matrix because of its good performance in computation time and
accuracy. However, rather than using a different measure for selecting the location and the
scale (as was done in the Hessian-Laplace detector [11]), we rely on the determinant of the
Hessian for both. Given a point X= (x, y) in an image I, the Hessian matrix H(x, σ) in x at
scale σ is definedas follows
𝜕𝑔(𝜎)2
where𝐿𝑥𝑥 (𝑥, 𝜎) is the convolution of the Gaussian second order derivative with the
𝜕𝑥 2
image I in point x, and similarly for 𝐿𝑥𝑦 (𝑥, 𝜎) and𝐿𝑦𝑦 (𝑥, 𝜎). As Gaussian filters are non-ideal
in any case, and given Lowe’s success with Log approximations, we push the approximation
even further with box filters.
Figure 3.6(Left to Right): The discretised and cropped Gaussian second order partial
derivative in y-direction and xy-direction, and approximation there of using box filter.
15
The 9 ×9 box filters in Figure 3.6 are approximations for Gaussian second order derivatives
with σ = 1.2 and represent our lowest scale (i.e. highest spatial resolution). We denote our
approximations by 𝐷𝑥𝑥 , 𝐷𝑥𝑦 , and𝐷𝑦𝑦 . The weights is theoretically sensitive to scale but it can
be kept constant at 0.9. This yields
Furthermore, the filter responses are normalised with respect to the mask size.
Scale spaces are usually implemented as image pyramids. The images are repeatedly
smoothed with a Gaussian and subsequently sub-sampled in order to achieve a higher level of
the pyramid. Due to the use of box filters and integral images, we do not have to iteratively
apply the same filter to the output of a previously filtered layer, but instead can apply such
filters of any size at exactly the same speed directly on the original image, and even in
parallel (although the latter is not exploited here). Therefore, the scale space is analysed by
up-scaling the filter size rather than iteratively reducing the image size. The output of the
above 9×9 filter is considered as the initial scale layer, to which we will refer as scale s = 1.2
(corresponding to Gaussian derivatives with σ = 1.2). The following layers are obtained by
filtering the image with gradually bigger masks, taking into account the discrete nature of
integral images and the specific structure of our filters. Specifically, this results in filters of
size 9×9, 15×15, 21×21, 27×27,etc. At larger scales, the step between consecutive filter
sizes should also scale accordingly. Hence, for each new octave, the filter size increase is
doubled (going from 6 to 12 to 24). Simultaneously, the sampling intervals for the extraction
of the interest points can be doubled as well.
As the ratios of our filter layout remain constant after scaling, the approximated Gaussian
derivatives scale accordingly. Thus, for example, our 27×27filter corresponds to σ = 3×1.2 =
3.6 = s. Furthermore, as the Frobenius normremains constant for our filters, they are already
scale normalised [26].In order to localise interest points in the image and over scales, a non-
maximumsuppression in a 3 × 3 × 3 neighbourhood is applied. The maxima of the
determinant of the Hessian matrix are then interpolated in scale and image space with the
method proposed by Brown et al. [27]. Scale space interpolations especially important in our
case, as the difference in scale between the first layers of every octave is relatively large. Fig.
3.7 shows an example of the detected interest points using our ’Fast-Hessian’ detector.
The first step consists of fixing a reproducible orientation based on information from a
circular region around the interest point. Then, we construct a square region aligned to the
selected orientation, and extract the SURF descriptor from it. These two steps are now
explained in turn.
16
3.3.1 Orientation Assignment
Once the wavelet responses are calculated and weighted with a Gaussian (σ =2.5s) centered
at the interest point, the responses are represented as vectors in space with the horizontal
response strength along the abscissa and the vertical response strength along the ordinate. The
dominant orientation is estimated by calculating the sum of all responses within a sliding
𝜋
orientation window covering an angle of 3 . The horizontal and vertical responses within the
window aresummed. The two summed responses then yield a new vector. The longest
suchvector lends its orientation to the interest point. The size of the sliding window is a
parameter, which has been chosen experimentally. Small sizes fire on single dominating
wavelet responses, large sizes yield maxima in vector length that are not outspoken.
For the extraction of the descriptor, the first step consists of constructing square region
centeredaround the interest point, and oriented along the orientation selected in the previous
section. For the upright version, this transformation is not necessary. The size of this window
is 20s. Examples of such square regions are illustrated in Fig. 2.
The region is split up regularly into smaller 4 × 4 square sub-regions. This keeps important
spatial information in. For each sub-region, we compute a few simple features at 5×5
regularly spaced sample points. For reasons of simplicity, we call dx the Hear wavelet
response in horizontal direction and ditherharewavelet response in vertical direction (filter
size 2s). ”Horizontal” and ”vertical “here is defined in relation to the selected interest point
orientation. To increase the robustness towards geometric deformations and localisation
errors, the responsesand dare first weighted with a Gaussian (σ = 3.3s) centered at the
interest point.
Then, the wavelet responses dx and dare summed up over each subregionand form a first set
of entries to the feature vector. In order to bring in information aboutthe polarity of the
intensity changes, we also extract the sum of the absolute values of the responses, |𝑑𝑥| and
17
|𝑑𝑦|. Hence, each sub-region has a four-dimensional descriptor vector v for its underlying
intensity structure= (∑ 𝑑𝑥, ∑ 𝑑𝑦 , ∑|𝑑𝑥|, ∑|𝑑𝑦|).
Figure 3.8: The descriptor entries of a sub-region represent the nature of the underlying
intensity pattern. Left: In case of a homogeneous region, all values are relatively low.
Middle: In presence of frequencies in x direction, the value of ∑|𝒅𝒙| is high, but all
others remain low. If the intensity is gradually increasing in x direction, both values 𝒅𝒙
and |𝒅𝒙| are high.
This results in a descriptor vector for all 4×4sub-regions of length 64. The wavelet responses
are invariant to a bias in illumination(offset). Invariance to contrast (a scale factor) is
achieved by turning the descriptor into a unit vector. Fig. 3 shows the properties of the
descriptor for three distinctively different image intensity patterns within a sub-region. One
can imagine combinationsof such local intensity patterns, resulting in a distinctive descriptor.
18
Result
19
Conclusion
20