You are on page 1of 40

Neelam Rawat, AI UNIT - V 1

Unit-V

Pattern Recognition : Introduction, Design principles of


pattern recognition system, Statistical Pattern recognition,
Parameter estimation methods - Principle Component
Analysis (PCA) and Linear Discriminant Analysis (LDA),
Classification Techniques Nearest Neighbor (NN) Rule,
Bayes Classifier, Support Vector Machine (SVM), K means
clustering.

Neelam Rawat, AI UNIT - V 2



PATTERN RECOGNITION: Introduction

RECOGNITION
Recognition Re + Cognition

COGNITION:- To become acquainted with, to come to know the


act, or the process of knowing an entity (the process of knowing).

Recognition : The knowledge or feeling that the present object has


been met before (the process of knowing again).

PR is a study of ideas and algorithms that provide computers with


a perceptual capability to put abstract objects, or patterns into
categories in a simple and reliable way.

PATTERN : Pattern is a set of objects or phenomena or concepts


where the elements of the set are similar to one another in certain
ways/aspects. The Pattern are described by certain quantities,
qualities, traits, notable features and so on.

Neelam Rawat, AI UNIT - V 3


Cloud Patterns
Forest and Cultivated
Land
Coal Mine Detection
Natural Gas Detection
Examples of
applications
Handwritten: sorting letters by
Optical Character postal code, input device for PDAs.
Printed texts: reading machines for
Recognition (OCR) blind people, digitalization of text
documents.
Biometrics Face recognition, verification,
retrieval.
Diagnostic systems Finger prints recognition.
Speech recognition.
Military applications
Medical diagnosis: X-Ray, EKG
analysis.
Machine diagnostics, waster
detection.
Automated Target Recognition (ATR).
Image segmentation and analysis
(recognition from aerial or satelite
photographs).

PATTERN RECOGNITION & CLASSIFICATION PROCESS

Step-1: Stimuli produced by objects are perceived by sensory


devices. The attribute and their relations are used to characterize
an object in the form of a pattern vector X. The range of
characteristic attribute values is known as the measurement
space M.
Step-2: A subset of attributes whose values provide cohesive
object grouping or clustering consistent with some goals
associated with the object classification are selected. The range of
subset of attribute values is known as the feature space F.
Step-3: Using the selected attribute values, object or class
characterization models are learned by forming generalized photo
type descriptions, classification rules or decision functions.
The range of the decision function values or classification rules is
known as the decision space D.
Step-4: Recognition of familiar objects is achieved through
application of the rules learned in step 3 by comparison and
matching of objects features with the stored models.
Neelam Rawat, AI UNIT - V 9

PATTERN RECOGNITION APPLICATIONS

Medical diagnosis
Life form analysis
Sonar detection
Radar detection
Image processing
Process control
Information Management systems
Aerial photo interpretation.
Weather prediction
Sensing of life on remote planets.
Behavior analysis
Character recognition
Speech and Speaker recognition etc.

Methodology of PR consists of the following:


1.We observe patterns
2.We study the relationships between the various patterns.
3.We study the relationships between patterns and ourselves and thus
arrive at situations.
4.We study the changes in situations and come to know about the
events.
5.We study events and thus understand the law behind the events.
Neelam Rawat, AI UNIT - V 10
6. Using the law, we can predict future events.
EXAMPLE: Astrology/Palm history

According to this methodology, it consists of the following


1.We observe the different planets/lines on hand.
2.We study the relationship between the planets/lines.
3.We study the relations between the position of planets/lines
and situations
in life and arrive at events.
4.We study the events and understand the law behind the
events.
5.Using the law we can predict the future of a person.

EXAMPLE: DC Machines

According to this methodology, it consists of the following:


1.We observe the patterns like magnetic poles, conductors core
and so on.
2.We study the relationship between poles, conductors etc.
3.We study the relationship between patterns and arrive at
voltage current etc.
4.We study changes of situation and arrive at events like rotating
the conductor
and voltage being induced in them
Neelam Rawat, AIbecause
UNIT - V of cutting the lines 11
of flux.
TYPES OF PATTERNS

1.SPATIAL PATTERNS- These patterns are located in space.


Eg:- characters in character recognition
* images of ground covers in remote sensing
* images of medical diagnosis.

2.TEMPORAL PATTERN-These are distributed in time.


Eg:- Radar signal, speech recognition, sonar signal etc.

3.ABSTRACT PATTERNS-Here the patterns are distributed


neither in space nor time.
Eg:- classification of people based on psychological tests.
* Medical diagnosis based on medical history and other
medical tests.
* Classification of people based on language they speak.

Neelam Rawat, AI UNIT - V 12


APPROACH TO PATTERN RECOGNITION

1.Statistical or decision theoretic or discriminant method.


2.Syntactic or Grammatical or structural approach.
1. Statistical Pattern Recognition

The data is reduced to vectors of numbers and statistical


techniques are used for the tasks to be performed.

2. Structural Pattern Recognition

The data is converted to a discrete structure(such as a


grammar or a graph) and the techniques are related to
computer science subjects (such as parsing and graph
matching).
Neelam Rawat, AI UNIT - V 13
STATISTICAL APPROACH

Patterns Results
Feature Extraction
Transducer Learning Classification
And
Feature Selection

Fig1.1: Block diagram representation of statistical approach

Transducer : It is used for making measurements for various attributes of


the pattern.
Feature Extractor: From the measurements, it extracts, number of features
which are
required for describing the pattern and classifying.
Feature selector : Depending on the problem the feature selector selects
minimum
number of features that are sufficient to classify the
pattern.
STATISTICAL APPROACH
There are two feature selector methods.

1.Transformation Method :
Here we reduce the features by considering the linear or nonlinear
combinations of original features. This is also called as aggregation
method.
Eg:- let us assume originally we have four features f1,f2,f3,f4.
One method of selecting two features is
f5 = f1 + f2
f6 = f3 + f4.
2.Subsetting or filtering Method:
Here we select a subset of the original features.
Eg:- Original features are f1,f2,f3,f4.
We can select a subset like
f5 = f1 and f6 = f3.
Learning : It is a process of determining useful parameters which are required
for classifying the patterns efficiently.
Classifying: Here the patterns are assigned to different classes using a
suitable classification method.
PRINCIPAL COMPONENT
ANALYSIS (PCA)

Neelam Rawat, AI UNIT - V 16


Why Principal Component Analysis?

Motive
Find bases which has high variance in data
Encode data with small number of bases with low MSE

Neelam Rawat, AI UNIT - V 17


Derivation of PCs

Assume that
1
E[ x ] 0 T
ax qq x T T
|| q || (q q) 2 1

2 E[a 2 ] E[a ]2 E[ a 2 ]
E[(qT x)(xT q)] qT E[xxT ]q qT Rq
Find qs maximizing this!!
Principal component q can be obtained
by Eigenvector decomposition such as SVD!

R QQT , Q [q1 , q 2 ,..., q j ,..., q


m ], diag[1 , 2 ,..., j ,..., m ]
Rq j j q j j 1, 2,..., m

Rq q
Neelam Rawat, AI UNIT - V 18
Dimensionality Reduction
(1/2)

Can ignore the components of less significance.

You do lose some information, but if the eigenvalues are small, you
dont lose much
n dimensions in original data
calculate n eigenvectors and eigenvalues
choose only the first p eigenvectors, based on their
eigenvalues
final data set has only p dimensions
Neelam Rawat, AI UNIT - V 19
Dimensionality Reduction
(2/2)

Variance

Dimensionality

Neelam Rawat, AI UNIT - V 20


Reconstruction from
PCs

q=1 q=2 q=4 q=8

Original
q=16 q=32 q=64 q=100 Image

Neelam Rawat, AI UNIT - V 21


LINEAR DISCRIMINANT
ANALYSIS (LDA)

Neelam Rawat, AI UNIT - V 22


Limitations of PCA

Are the maximal variance dimensions the


relevant dimensions for preservation?

Neelam Rawat, AI UNIT - V 23


Linear Discriminant Analysis (1/6)

What is the goal of LDA?

Perform dimensionality reduction while preserving as much


of the class discriminatory information as possible.
Seeks to find directions along which the classes are best
separated.
Takes into consideration the scatter within-classes but also
the scatter between-classes.
For example of face recognition, more capable of
distinguishing image variation due to identity from variation
due to other sources such as illumination and expression.

Neelam Rawat, AI UNIT - V 24


Linear Discriminant
Analysis (2/6)

c ni
Within-class scatter matrix Sw (Y
i 1 j 1
j M i )(Y j M i )T

c
Between-class scatter matrix Sb (M
i 1
i M )( M i M )T

projection matrix
y U x T

LDA computes a transformation that maximizes the


between-class scatter while minimizing the within-class
scatter:
| S%
b | | U T SbU | products of eigenvalues !
max max T
%
| Sw | | U S wU |
S w1Sb U U T
S%
b , %: scatter matrices of the projected data
S w Neelam Rawat, AI UNIT - V 25
y
Linear Discriminant Analysis (3/6)

Does Sw-1 always exist?

If Sw is non-singular, we can obtain a conventional eigenvalue


problem by writing:
S w1Sb U U T

In practice, Sw is often singular since the data are image vectors


with large dimensionality while the size of the data set is much
smaller (M << N )

c.f. Since Sb has at most rank C-1, the max number of


eigenvectors with non-zero eigenvalues is C-1 (i.e., max
dimensionality of sub-space is C-1)

Neelam Rawat, AI UNIT - V 26


Linear Discriminant
Analysis (4/6)
Does Sw-1 always exist? cont.

To alleviate this problem, we can use PCA first:

1) PCA is first applied to the data set to reduce its


dimensionality.

2) LDA is then applied to find the most discriminative directions:

Neelam Rawat, AI UNIT - V 27


Linear Discriminant Analysis (5/6)

PCA LDA

D. Swets, J. Weng, "Using Discriminant Eigenfeatures for


Image Retrieval", IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 18, no. 8, pp. 831-836, 1996
Neelam Rawat, AI UNIT - V 28
Linear Discriminant
Analysis (6/6)
Factors unrelated to classification
MEF vectors show the tendency of PCA to capture major variations
in the training set such as lighting direction.
MDF vectors discount those factors unrelated to classification .

D. Swets, J. Weng, "Using Discriminant Eigenfeatures for


Image Retrieval", IEEE Transactions on Pattern Analysis and
MachineNeelam
Intelligence, vol. 18, no. 8, pp. 831-836, 1996 29
Rawat, AI UNIT - V
PCA vs LDA vs ICA: A short
Review

PCA : Proper to dimension reduction


LDA : Proper to pattern classification if the number of training
samples of each class are large
ICA : Proper to blind source separation or classification using ICs
when class id of training data is not available

Is LDA always better than PCA?


There has been a tendency in the computer vision
community to prefer LDA over PCA.
This is mainly because LDA deals directly with
discrimination between classes while PCA does not pay
attention to the underlying class structure.
This paper shows that when the training set is small, PCA
can outperform LDA.
When the number of samples is large and representative
for each class, LDA outperforms PCA.
Neelam Rawat, AI UNIT - V 30
SVM SUPPORT VECTOR
MACHINE

A new classification method for both linear and nonlinear data


It uses a nonlinear mapping to transform the original training data
into a higher dimension
With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., decision boundary)
With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a
hyperplane
SVM finds this hyperplane using support vectors (essential
training tuples) and margins (defined by the support vectors)

Neelam Rawat, AI UNIT - V 31


SVM HISTORY &
APPLICATION

Vapnik and colleagues (1992)groundwork from Vapnik &


Chervonenkis statistical learning theory in 1960s
Features: training can be slow but accuracy is high owing to
their ability to model complex nonlinear decision boundaries
(margin maximization)
Used both for classification and prediction
Applications:
handwritten digit recognition, object recognition, speaker
identification, benchmarking time-series prediction tests

Neelam Rawat, AI UNIT - V 32


SVM GENERAL PHYLOSPHY

Small Margin Large Margin


Support Vectors

Neelam Rawat, AI UNIT - V 33


SVM MARGINS & SUPPORT
VECTORS

Neelam Rawat, AI UNIT - V 34


SVM When Data Is Linearly
Separable

Let data D be (X1, y1), , (X|D|, y|D|), where Xi is the set of training
tuples associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but
we want to find the best one (the one that minimizes classification
error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e.,
maximum marginal hyperplane (MMH)
Neelam Rawat, AI UNIT - V 35
SVM Linearly Separable

A separating hyperplane can be written as


WX+b=0
where W={w1, w2, , wn} is a weight vector and b a scalar (bias)
For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 1 for yi = 1
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints
Quadratic Programming (QP) Lagrangian multipliers

Neelam Rawat, AI UNIT - V 36


SVM Why Is SVM Effective on High Dimensional
Data?

The complexity of trained classifier is characterized by the # of


support vectors rather than the dimensionality of the data
The support vectors are the essential or critical training
examples they lie closest to the decision boundary (MMH)
If all other training examples are removed and the training is
repeated, the same separating hyperplane would be found
The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier,
which is independent of the data dimensionality
Thus, an SVM with a small number of support vectors can have
good generalization, even when the dimensionality of the data is
high
Neelam Rawat, AI UNIT - V 37
K NEAREST NEIGHBOUR ALGORITHM

1. All instances correspond to points in the n-D space


2. The nearest neighbor are defined in terms of Euclidean distance,
dist(X1, X2)
3. Target function could be discrete- or real- valued
4. For discrete-valued, k-NN returns the most common value among
the k training examples nearest to xq
5. Vonoroi diagram: the decision surface induced by 1-NN for a
typical set of training examples:

_
_
_ _ .
+
_ . +
xq + . . .
Neelam Rawat, AI UNIT - V . 38
Discussion on the k-NN Algorithm

k-NN for real-valued prediction for a given unknown tuple


Returns the mean values of the k nearest neighbors
Distance-weighted nearest neighbor algorithm
Weight the contribution of each of the k neighbors
according to their distance to the query xq
-- Give greater weight to closer neighbors
Robust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors could be
dominated by irrelevant attributes
To overcome it, axes stretch or elimination of the least
relevant attributes

Neelam Rawat, AI UNIT - V 39


Neelam Rawat, AI UNIT - V 40

You might also like