You are on page 1of 16

A Written Report About Bayesian Decision Theory in Pattern Recognition

Aaron Carl T. Fernandez


MSCS – Mapua University

Introduction
One of the goals in pattern recognition is the achievement of an optimal decision rule to classify
these data into their respective categories. Of all existing decision rules in pattern recognition
such as the Chow’s rule and the nearest neaighbor rule, the Bayesian decision theory is often
considered as one of the most optimal (Bow, 2002).

The Bayesian approach describe categories by probability distributions over the attributes of the
objects specified by a model function and its parameters. It also has several advantages over
other methods such as but not limited to, the number of categories is determined automatically,
objects are not assigned to the categories absolutely, all attributes are potentially significant and
data can be real or discrete.

This written report presents a review of Bayesian decision theory in pattern recognition. Decision
theories deal with the development of methods and techniques that are approriate for making
decisions in an optimal fashion. This optimality of the Bayesian approach has been exemplified
in this paper by surverying real-world applications in artificial intelligence and pattern
recognition research.

The survey revealed that more often, the Bayesian approach outperforms all other machine
learning models utilized in solving the task at hand. This paper enumerates five real-world
examples of applying Bayesian decision-driven machine learning in english letter recognition,
computer-vision application, spam filtering, database clustering and association football
prediction.

Bayesian Decision Theory


The Bayes decision theory is a fundamental statistical approach in pattern classification problems
which quantifies the tradeoffs between various classifications using probability and the costs that
accompany such classifications which is based on the Bayes rule:

𝑝(𝑥|𝜔𝑗 )P(𝜔𝑗 )
𝑃(𝜔𝑗 |𝑥) =
𝑝(x)
𝐹𝑜𝑟𝑚𝑢𝑙𝑎 1 − 𝐵𝑎𝑦𝑒𝑠 𝑟𝑢𝑙𝑒 𝑢𝑠𝑖𝑛𝑔 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠

It is an expression of conditional probabilities which represent the chance of an event occuring


given an evidence. The formula above is translated in Lehman’s term as:

𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑥 𝑝𝑟𝑖𝑜𝑟 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦


𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 𝑝𝑟𝑜𝑏𝑎𝑏𝑙𝑖𝑡𝑦 = 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
𝐹𝑜𝑟𝑚𝑢𝑙𝑎 2 − 𝐵𝑎𝑦𝑒𝑠 𝑟𝑢𝑙𝑒 𝑖𝑛 𝐿𝑒ℎ𝑚𝑎𝑛′ 𝑠 𝑡𝑒𝑟𝑚
As shown in the simplified formula above, the posterior probability is the result of the Bayes
rule. Specifically, it states the probability of an event occurring (or a condition being true) given
a specific evidence. Hence the posterior probability is shown as P(ωj|x) where ωj represents a
finite set of ω possible states and x is the feature vector which can also be referred as the feature
space.

Likelihood is the probability of a specific class given a random variable while the prior
probability is the initial reflection of how likely a certain class is expected before the actual
observation.

The evidence signified as 𝑝(x) is usually considered as a scaling term which the Bayes theorem
states as equivalent to the following formula:

𝑝(𝑥) = ∑ 𝑝(𝑥|𝜔𝑗 )P(𝜔𝑗 )


𝑖=1
𝐹𝑜𝑟𝑚𝑢𝑙𝑎 3 − 𝑆𝑐𝑎𝑙𝑖𝑛𝑔 𝑡𝑒𝑟𝑚 𝑓𝑜𝑟𝑚𝑢𝑙𝑎

Continuous and Discrete Bayes


The continous case involves feature vectors that could be any point in the dimensional space
such as the temperature of a room, current bank balances, or any value including decimals while
the discrete case deals with limited or finite number of possible observations such as the state of
a light switch which can only be on (1) or off (0). This leads to minor differences in the way a
decision rule is calculated one of which is the continous case utilizes probability density
functions (see formula 1) while the discrete case just utilizes the probability distribution (see
formula 4).

𝑃(𝑥|𝜔𝑗 )P(𝜔𝑗 )
𝑃(𝜔𝑗 |𝑥) = 𝑃(x)
𝐹𝑜𝑟𝑚𝑢𝑙𝑎 4 − 𝐵𝑎𝑦𝑒𝑠 𝑟𝑢𝑙𝑒 𝑢𝑠𝑖𝑛𝑔 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛

Nonetheless, the Bayes decision rule remains unchanged for both cases as its purpose is to
minimize the risk or cost in the decision.

The Risk Function


For example, we are considering shelling out 1,000,00,000 pesos for a business insurance against
fire which costs 100,000 pesos a year. If we decided to not insure it and a fire occurred, then the
cost is 1,000,000 pesos but if a fire does not occur, then we gain 100,000 pesos for not availing
the insurance. The key is to take the action that leads to the minimum risk hence the need to
calculate the risk for every possible decision.

Each action αi has an associated risk 𝑅(𝑎𝑖 | 𝑥), and 𝜆(𝑎𝑖 | 𝜔𝑗 ) is the loss incurred for deciding
𝜔𝑗 . The conditional risk is computed to minimize the overall risk which is the same for both the
continuous and discrete case:
𝑛

𝑅(𝑎𝑖 | 𝑥) = ∑ 𝜆(𝑎𝑖 | 𝜔𝑗 )P(𝜔𝑗 |x)


𝑗=1
𝐹𝑜𝑟𝑚𝑢𝑙𝑎 5 − 𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝑟𝑖𝑠𝑘

The risk given our observations for every action is the sum of all losses for that action given all
the states, weighted by the probability of occurrence of each state. The action with the minimum
risk is then selected.

The Two-Category Case


A dichotomizer is a classifier that places a pattern in one of two possible categories. Given two
discriminant functions 𝑔1 (𝑥) and 𝑔2 (𝑥), 𝑥 will be assigned to 𝜔1 if 𝑔1 (𝑥) > 𝑔2 (𝑥). This enables
the following definition:

𝐺(𝑥) = 𝑔1 (𝑥) - 𝑔2 (𝑥).


𝐹𝑜𝑟𝑚𝑢𝑙𝑎 6 – Dichotomizer formula

Given the formula above, we can decide 𝜔1 if 𝑔1 (𝑥) > 0 which gives us two forms of
discriminant fucntion:
𝑔(𝑥) = 𝑃(𝜔1 |𝑥) − 𝑃(𝜔2 |𝑥)
𝑃(𝑥|𝜔 ) 𝑃(𝜔 )
𝑔(𝑥) = 𝑙𝑛 𝑃(𝑥|𝜔1 ) + 𝑙𝑛 𝑃(𝜔1 )
2 2
𝐹𝑜𝑟𝑚𝑢𝑙𝑎 7 − 𝑇𝑤𝑜 𝑓𝑜𝑟𝑚𝑠 𝑜𝑓 𝑡𝑤𝑜 − 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑐𝑎𝑠𝑒 𝑑𝑖𝑠𝑐𝑟𝑖𝑚𝑖𝑛𝑎𝑛𝑡 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑣𝑒𝑐𝑡𝑜𝑟𝑠

If the feature vector is binary and assumed (correctly or incorrectly) independent, a simplified
Bayes rules can be employed:
𝑑

𝑔(𝑥) = ∑ 𝜔𝑖 xi + 𝜔0
𝑖 =1
where
𝑝𝑖 (1 − 𝑞𝑖 )
𝜔𝑖 = 𝑙𝑛 ,𝑖 = 1…𝑑
𝑞𝑖 (1 − 𝑝𝑖 )
and
𝑑
1 − 𝑝𝑖 𝑃(𝜔1 )
𝜔0 = ∑ 𝑙𝑛 + 𝑙𝑛
1 − 𝑞𝑖 𝑃(𝜔2 )
𝑖 =1
𝐹𝑜𝑟𝑚𝑢𝑙𝑎 8 − 𝑇𝑤𝑜 𝑓𝑜𝑟𝑚𝑠 𝑜𝑓 𝑡𝑤𝑜 − 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑐𝑎𝑠𝑒 𝑑𝑖𝑠𝑐𝑟𝑖𝑚𝑖𝑛𝑎𝑛𝑡 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠 𝑓𝑜𝑟 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑣𝑒𝑐𝑡𝑜𝑟𝑠

It is important to note that 𝜔𝑖 and 𝜔0 are weights calculated for the linear discriminant. The
discriminant function 𝑔(𝑥) above will indicate whether the current feature vector belongs to
class 1 or class 2. A decision boundary lies wherever 𝑔(𝑥) = 0. This decision boundary can be a
line, or hyper-plane depending on the dimension of the feature space.
Example of a Two-Category Case Problem

𝐹𝑖𝑔𝑢𝑟𝑒 1 – Observed data and its possible classes

Consider the three-dimensional binary feature vector 𝑥 = (𝑥1 , 𝑥2 , 𝑥3 ) = (0, 1, 1) of the observed
input above which we will attempt to classify if it falls under class 1 or class 2. Given the
following prior probabilities: 𝑃(𝜔1 ) = 0.6 and 𝑃(𝜔2 ) = 0.4. It is already evident that there is a
bias towards class 1.

Also, the likelihood of each independent features is: 𝑝 = {0.8, 0.2, 0.5} and 𝑞 = {0.2, 0.5, 0.9}.
Since the problem definition assumes that the features are independent, the discriminant function
can be calculated as follows:

0.8(1−0.2) 0.2(1−0.5) 0.5(1−0.9)


𝜔1 = 𝑙𝑛 0.2(1−0.8) = 2.77; 𝜔2 = 𝑙𝑛 0.5(1−0.2) = -1.39; 𝜔3 = 𝑙𝑛 0.9(1−0.5) = -2.19;
0.6 1−0.8 1−0.2 1−0.5
𝜔0 = ln (0.4) + ln (1−0.2) + ln(1−0.5) + ln (1−0.9) = 1.0986

𝑔(𝑥) = 2.77𝑥1 - 1.39𝑥2 – 2.19𝑥3 + 1.0986

Plugging the 𝑥𝑖 values into the discriminant function will give the answer 𝑔(𝑥) = −2.4849.
Since 𝑔(𝑥) = −2.4849 < 0, the feature vector 𝑥 = (0, 1, 1) belongs to class 2.

Higher – Dimensional Cases


The problem becomes more difficult when there are more than two potential classes to classify
the data into. The procedure presented above does not yield the correct answer since the
discriminant function’s likelihood is a ratio between two possible states only. However, a neat
trick would be, instead of determining which multiple classes {1, 2, … , 𝑛 } a feature vector
belongs, translate the problem to binary classification and determine the probability that it
belongs to a certain class 𝑖 or not.

This is accomplished by setting 𝑔1 (𝑥) = 𝑔𝑖 (𝑥) and 𝑔2 (𝑥) = 𝑔𝑛𝑜𝑡 𝑖 (𝑥). The probabilities for
𝑔2 (𝑥) can be obtained by summing all the probabilities for classes {1, … , 𝑖 − 1, 𝑖 + 1, … , 𝑛 }. If
𝑥 belongs to class 𝑖, then 𝑔𝑖 (𝑥) > 𝑔𝑛𝑜𝑡 𝑖 (𝑥); otherwise 𝑥 belongs to some other class.

Literature Survey
English Letter Classification Using Bayesian Decision Theory and Feature Extraction Using
Principal Component Analysis
(Husnain, & Naweed, 2009) utilized Bayesian decision theory to identify each of the large
number of black and white rectangular pixel displays as one of the 26 capital letters in the
english alphabet. The character images were based on 20 different fonts with each randomly
distorted to produce a file of 20,000 unique instances.

The image dataset used in the research was donated by David J. Slate and P.W. Frey in 1991 to
UCI data repository. Different distortion techniques such as compress, change aspect ratio along
with x and y axis to add bearable noise to the dataset. For each of the black and white image of
the english alphabet, 16-dimensional feature vector was extracted by the authors to demonstrate
the summary of the alphabet image.

The feature vector contains the characteristic features of the image such as vertical and
horizontal position of the rectangular box containing the alphabet, total number of ON pixels and
edge count. Each instance was converted into 16 primitive numerical attributes such as mean,
variance, moments, and covariance scaled to fit into a range of integer value from 0 to 15. The
detail of the 16 attributes are as follows:

Letter: capital letter 26 values from A to Z


X-box: horizontal position of box Integer
Y-box: vertical position of box Integer
Width: width of box Integer
Height: height of box Integer
Onpix: total # on pixels Integer
X-bar: mean x of on pixels in box Integer
Y-bar: mean y of on pixels in box Integer
X2bar: mean x variance Integer
Y2bar: mean y variance Integer
XYbar: mean x y correlation Integer
X2ybr: mean of x * x * y Integer
Xy2br: mean of x * y * y Integer
X-edge: mean edge count left to right Integer
Xegvy: correlation of x-edge with y Integer
Y-edge: mean edge count bottom to top Integer
Yegvx: correlation of y-edge with x Integer
𝐹𝑖𝑔𝑢𝑟𝑒 2 − 𝐿𝑒𝑡𝑡𝑒𝑟 𝑟𝑒𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑜𝑛 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠

On the first set of experiment, 14,000 items were used in the training data and the remaining
6,000 were used in the test set wherein one instance is selected randomly and plugged into the
classifier to check its corresponding alphabet class. This achieved 92% accuracy upon
introducing 100 random input instances where only 8 out of those were misclassified.

The training data was increased from 14,000 to 16,000 on the second set of experiment which
reduced the error rate to 2% wherein only 2 out of the 10 random character data were
misclassified. The results were far better than what (Frey, & Slate, 1991) achieved using holland-
style adaptive classifier in letter recognition which had only 80% accuracy.

The research also revealed that english alphabets ‘N’ and ‘H’ have almost the same shape and
same number of ON pixels resulting to similar posterior probability:

𝐹𝑖𝑔𝑢𝑟𝑒 3 − 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑒𝑛𝑔𝑙𝑖𝑠ℎ 𝑎𝑙𝑝ℎ𝑎𝑏𝑒𝑡 ′𝐻 ′ (𝑙𝑒𝑓𝑡) 𝑎𝑛𝑑 ′𝑁 ′ (𝑟𝑖𝑔ℎ𝑡)

The features were reduced from 16 to 8 using principal component analysis which is an
eigenvector/value-based approched used to reduce the dimensions of the features of a
multivariate data. It identifies patterns in data, and express it in such a way which highlights their
similarities and differences. The accuracy of the bayesian decision theory classifier was checked
again for 100 random inputs and resulted to 98% accuracy with 16,000 instances kept as the
training data. As shown in the screengraph below, the first 8 principal components were very
near to 90% of the variance with the remaining components have no significance in classifying
and their variance values diminished.
𝐹𝑖𝑔𝑢𝑟𝑒 4 − 𝑆𝑐𝑟𝑒𝑒𝑛𝑔𝑟𝑎𝑝ℎ 𝑜𝑓 𝑝𝑟𝑖𝑛𝑐𝑖𝑝𝑎𝑙 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 𝑎𝑛𝑑 𝑖𝑡𝑠 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒

It can be concluded that both principal component analysis and Bayesian decision theory can
give efficient results on document analysis and was proven to be more effective than the holland-
style and statistical adaptive similarity classifiers on letter recognition tasks.

A Vision-based Method for Weeds Identification through the Bayesian Decision Theory
(Tellaechea, Burgos-Artizzu, Pajaresa, & Angela Ribeiro, 2007) developed an automatic
computer vision-based approach for the detection and diffential weed sparying in corn-crops.
Their strategy involved an image segmentation process which divides the incoming image in
cells and extracts features and attributes to be used in the decision-making procedure based on
the computation of a posterior probability under a Bayesian framewor wherein the prior
priobability is computed by the dynamic of the tractor where the method is embedded. The
decision to be made is determining if a cell is to be sprayed or not and requires the existence of a
database containing a set of samples classified as items to be sprayed or not which could be
offline or online.

The knowledge base is built during the offline stage wherein the decision during the online stage
will be based upon. The image segmentation process is identical for both the offline and online
stages.

The training process is done during the offline stage while new images are processed so that a
decision is made about them. This will then be stored in the knowledge based and estimated
during the offline stage.
𝐹𝑖𝑔𝑢𝑟𝑒 5 − 𝑉𝑖𝑠𝑖𝑜𝑛 − 𝑏𝑎𝑠𝑒𝑑 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 𝑠𝑐ℎ𝑒𝑚𝑒 𝑎𝑛𝑑 𝑑𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑝𝑟𝑜𝑐𝑒𝑠𝑠.

A set of 340 digital images acquired with a HPR817 digital camera during four different days in
May 2006 and April/May 2007 were used to assess the validity and performance of the proposed
approach. 18 video sequences acquired at a rate of 15 frames per second based on the tractor
motion were selected. 10 frames were extracted from each video sequence so the 𝑘𝑡ℎ frame in
the sequence 𝑖 is given by 𝑓𝑖𝑘 , where 𝑘 = 1, … , 10 and 𝑖 = 1, … , 18. So, given two consecutive
frames, 𝑓𝑖𝑘 and 𝑓𝑖𝑘+1 , these will differ in 3𝑢 image rows. Assuming the origin of the coordinates
is the bottom-left corner, the row number 1 in 𝑓𝑖𝑘+1 will match the row 3𝑢 in 𝑓𝑖𝑘 where 𝑢 is a
constant parameter which is set to 50.

The rows of cells fourth and fifth are expanded in the frame 𝑓𝑖𝑘+1 to the first, second, and third
rows of cells which implies that the final spraying decision should be made about the first,
second, and third rows of cells while the fourth and fifth rows are used for computing the prior
probability for the next frame. It should also be noted that the tractor speed is fixed at 4km/h
which implies that 12 m are covered in 11 seconds hence, the time elapsed between frames
𝑓𝑖𝑘+1 and 𝑓𝑖𝑘 is about 11 seconds.

The authors designed a test strategy which involved an initialization step labeled as STEP 0. This
step simulates the offline phase with 160 images and was estimated by cross-validation of 256
cells in the training set and 48 cells in the validation set both of which were randomly selected.
Five training processes were performed where each used a different set as validation and the
remaining cells as the training data. This guarantees that the number of training samples is
always greater or equal than 256. For each validation set, 𝑘 was varied and error was computed.
The errors were averaged for each set and for each 𝑘. The best performance for 𝑘 is obtained for
the minimum mean error which was obtained in 𝑘 = 0.3.

𝐹𝑖𝑔𝑢𝑟𝑒 6 − 𝑀𝑖𝑛𝑖𝑚𝑢𝑚 𝑚𝑒𝑎𝑛 𝑒𝑟𝑟𝑜𝑟 𝑜𝑏𝑡𝑎𝑖𝑛𝑒𝑑

For STEP 1 to 3, a decision is made for each frame 𝑘 for the six cells in its bottom part and each
cell was described by its area-vector of attributes 𝑥. After the decision, a set 𝑆𝑌𝑛 of cells
belonging to wn to be sprayed and a set of 𝑆𝑁𝑛 of cells belonging to wn that do not require
spraying are obtained. Prior probabilities are set to 0.5 otherwise the prior probabilities are the
posterior probabilities computed for the four preceding cells in the previous frame.

The knowledgebase is updated by adding both sets of cells (𝑆𝑌𝑛 and 𝑆𝑁𝑛 ) to the previous entries
classifying all cells belongting to 𝑤𝑦 and 𝑤𝑛 and stored in the knowledgebase to obtain a new
estimate of the class-conditional probability density functions.

The performance is established by comparing the criterion of farmers and technical consultants
against the results obtained for each test wherein the number of cells correctly identified to be
sprayed are signified as True Spraying (TS), the number of cells that do not require spraying
correctly detected as True No Spraying (TN), the number of cells that do not require spraying but
identified as cells to be sprayed as False Spraying (FS), and the number of cells requiring
spraying that they are identified by the method as cells that do not require spraying as False No
Spraying (FN).
𝐹𝑖𝑔𝑢𝑟𝑒 7 − 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑚𝑎𝑔𝑒𝑠, 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓𝑐𝑒𝑙𝑙𝑠 𝑡𝑜 𝑏𝑒 𝑜𝑟 𝑛𝑜𝑡 𝑠𝑝𝑟𝑎𝑦𝑒𝑑 𝑎𝑐𝑐𝑜𝑟𝑑𝑖𝑛𝑔 𝑡𝑜 𝑡ℎ𝑒 𝐵𝑎𝑦𝑒𝑠𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠𝑓𝑖𝑒𝑟.

𝐹𝑖𝑔𝑢𝑟𝑒 8 − 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑎𝑛𝑑 𝑌𝑢𝑙𝑒 𝑠𝑐𝑜𝑟𝑒 𝑣𝑎𝑙𝑢𝑒𝑠 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑡𝑒𝑠𝑡𝑠 𝑎𝑛𝑑 𝑠𝑡𝑒𝑝𝑠.

(𝑇𝑆+𝑇𝑁)
The correct classification percentage is equated as follows: 𝐶𝐶𝑃 = (𝑇𝑆 + 𝐹𝑆 + 𝑇𝑁 +
𝑇𝑆
𝐹𝑁) while the Yule coefficient as: 𝑌𝑢𝑙𝑒 = | 𝑇𝑁 |. Figures 7 and 8 shows that the
(𝑇𝑆+𝐹𝑆)+( )−1
𝑇𝑁+𝐹𝑁
best performance was achieved by Test 3 in STEP 3 and that the worst performer was Test 1.
The best performance achieved in STEP 3 was due to the degree of learning performed as also
shown on the tables, the performance improves as the learning progresses.

Overall, the research was successful in developing an automated decision-making process for
detecting weeds in corn corps using bayesian decision theory. Although the robustness of the
proposed approach against illumination variability is still in question according to the paper, it
still achieves an important saving in cost and pollution.

A Bayesian Approach to Filtering Junk E-mail


(Sahami, Dumais, Heckerman, & Horvitz, 1998) employed Bayesian classification techniques to
the problem of junk e-mail filtering. Phrasal features such as the appearance of “FREE!”, “only
$”, and “be over 21” were included in the feature space as well as non-textual features such as
the domain type of the sender and recipient of the message. Other distinctions such as whether a
message has been attached, percentage of non-alphanumeric characters in the subject of the mail
message, or when a given message was received were also considered as powerful distinguishers
in the study.
Due to the large feature space, the authors employed feature selection fo reduce the dimension of
the feature space and attenuate the degree to which the independence assumption is violated by
the Naïve Bayesian classifier.
The first set of experiments dealt with the efficacy of the hand-crafted features using a corpus of
1789 actual e-mail messages out of which 1578 were pre-classified as “junk” and 211 messages
were pre-classified as “legitimate”. The data was split into a training set of 1538 messages and a
testing set of 251 messages. Word-based tokens in the subject and body of each e-mail was first
considered as the feature set which then was augmented to approximately 35 hand-crafted
phrasal features which was finally, further enhanced with 20 non-textual domain-specific
features for junk e-mail detection.

A theoretic notion of cost sensitive classification was considered as the cost for misclassifying a
legitimate e-mail as junk far outweighs the cost of marking a piece of junk as legitimate. With
this, a message is only classified as junk if the probability it would placed in the junk class is
greater than 99.9%.

𝐹𝑖𝑔𝑢𝑟𝑒 9 − 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑟𝑒𝑠𝑢𝑙𝑡𝑠 𝑢𝑠𝑖𝑛𝑔 𝑣𝑎𝑟𝑖𝑜𝑢𝑠 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑠𝑒𝑡𝑠

Figure 9 shows the precision and recall for both junk and legitimate e-mail for each feature
regime. It shows that while the phrasal information improves the performance slightly, the
incorporation of little domain knowledge improves the resulting classifications.
𝐹𝑖𝑔𝑢𝑟𝑒 10 − 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑎𝑛𝑑 𝑅𝑒𝑐𝑎𝑙𝑙 𝑐𝑢𝑟𝑣𝑒𝑠 𝑓𝑜𝑟 𝑗𝑢𝑛𝑘 𝑒 − 𝑚𝑎𝑖𝑙 𝑢𝑠𝑖𝑛𝑔 𝑣𝑎𝑟𝑖𝑜𝑢𝑠 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑠𝑒𝑡𝑠.

Figure 10 focuses on the range from 0.85 to 1.0 to clearly show the greatest variation in the junk
mail precision/recall curves. It shows that the incorporation of additional features, especially
non-textual domain-specific information, gives consistently superior results to just considering
the words in the messages.

This research proved that it is possible to automatically learn effective filters to eliminate a large
portion of junk email from a user’s mail stream. The efficacy of these filters can also be
enhanced by a set of hand-crafted features which are specific for the task at hand. While the
bayesian framework that had been used in the research is successful, it exposed the need for
methods aimed at controlling the variance in parameter estimates for text categorization
problems hence the utlization of Support Vector Machines or SVMs in a decision theoretic
framework that incorporates asymmetric misclassification costs is a promising venue for further
research. Also, the utilization of other Bayesian classifiers which are less restrictive than Naïve
Bayes are seen to obtain better classification probability estimates and make more accurate costs
sesitive classifications.

Autoclass: A Bayesian Approach to Classification. Maximum Entropy and Bayesian Methods


(Cheeseman, Kelly, Self, Stutz, Taylor, & Freeman, 1988) developed a program which
automatically discovers classes from a database based on a Bayesian statistical technique which
determines the most probable number of classes, their probabilistic descriptions, and the
probability that each object is a member of each class. This has been testted on several large, real
databses, and has discovered previously unsuspected classes.
The authors assumed that the data are in an attribute-value vector form and are independent in
each class. It models real valued attributes with a Gaussian normal distribution, parameterized by
a mean and a standard deviation. It also employs conjugate priors which are prior information in
the same form as the data.

The Autoclass program developed breaks the classification problem into two parts: determining
the number of classes and determining the parameters defining them. It uses a Bayesian variant
of Dempster and Laird’s EM algorithm to find the best class parameters for a given number of
classes and differentiate the postetior distribution with respect to the class parameters and equate
with zero to derive the algorithm.

The developed program classified data supplied by researchers active in various domains and has
yielded new and intriguing results such as the discovery of three classes present in the Iris
database with high confidence despite that not all cases can be assigned to their classes with
certainty. It also found four known classes in Stepp’s soybean diesease database which exactly
matched Michalski’s CLUSTER/2 system.

Finally, Autoclass assayed the Infared Astronomical Satellite Databse which contains 5,425
cases and 94 attributes. It was also considered as the least throroughly understood by domain
experts. The program discovered classes which differed significantly from NASA’s previous
analysis but clearly reflect physical phenomena in the data.

Predicting Football Results using Bayesian Nets and Other Machine Learning Techniques
(Joseph, Fenton, & Neil, 2006) compared the performance of an expert-constructed Bayesian
nets with other machine learning techniques such as naïve BN, KNN and decision trees for
predicting the outcome of matches played by english football club, Tottenham Hotspur FC from
1995 to 1997. Their objective was to see how the expert-constructed Bayesian nets perform in
terms of predictive accuracy and explanatory clarity for the factors effecting the result of the
matches under investigation.

The expert-constructed BN uses features such as presence or absence of three key players
(Sherringham, Anderton, & Armstrong), if Wilson is playing in midfield or not, quality of the
opposing team measured on a simple 3-point scale (high, medium, & low), and if the game is
played at Spurs’ home ground or away.

Aside from these, additional factors such as the quality of the Spurs attacking force, the overall
quality of the Spurs team and how well the team will perform given their own quality and of
their opponents were related to the outcome of the game (win, lose, or draw) to simplify the
structure. All of which were measured as low, medium, or high.
𝐹𝑖𝑔𝑢𝑟𝑒 11 − 𝐸𝑥𝑝𝑒𝑟𝑡 − 𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑒𝑑 𝐵𝑎𝑦𝑒𝑠𝑖𝑎𝑛 𝑛𝑒𝑡𝑠 𝑓𝑜𝑟 𝑇𝑜𝑡𝑡𝑒𝑛ℎ𝑎𝑚 𝐻𝑜𝑡𝑠𝑝𝑢𝑟 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒.

All machine learning models were implemented using the MLC++ package apart from the
expert-constructed Bayesian nets which was part of the Hugin tool. The match data was divided
into disjoint subsets which was used for training and validation sets. The data for each season
was divided into three groups of ten matches and one group of eight matches organized
chronologically.
𝐹𝑖𝑔𝑢𝑟𝑒 12 − 𝐶𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛 𝑜𝑓 𝑙𝑒𝑎𝑟𝑛𝑒𝑟 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑤𝑖𝑡ℎ 𝑒𝑥𝑝𝑒𝑟𝑡 𝑚𝑜𝑑𝑒𝑙 𝑑𝑎𝑡𝑎.

The expert-constructed Bayesian nets was the most accurate predictor of the outcome of the
Spurs games wuth a classification error over the disjoint training and test data sets of 40.79%. Its
poorest performance was on the 1995/1996 data. However, it is worth noting that with
classification errors of 50% and 40.74% for the 1995/1996 and 1996/1997 seasons respectively,
it was still the best classifier for the intra-season data. It also produced the best results among all
the classifiers for every one of the cross-season test period with the classification error averaged
at 33.62%.

Figure 12 shows the relative accuracy of the machine learning models implemented in this
research. KNN was the best performer when the same training and test data for the complete
seasons was used but it dropped significantly when disjoint training and test data sets were used
out of which the expert-constructed Bayesian nets outperformed all other learners.

This study reveals which of the selected attributes are the crucial factors affecting the outcome of
a football game, and the relationship between these factors. One of the limitations of all the non-
expert methods used is that they only use the supplied attributes which affects the learnt
Bayesian nets. The performance of the Bayesian network constructed was impressive given the
inherent analtsus bias against it. Although this study has now long been irrelevant since it
contains variables relating to key players who have retired or left the team already, its results
confirm the excellent potential of Bayesian networks when they are built by a reliable domain
expert.

The direction in which future work could be done to extend this study is the construction of a
more symmetrical model using similar data for all the teams in the league. However, this may
involve multiplying the amount of computational work by the number of additional teams in the
league. Another potential improvement is to qualify the inherent quality of each player who
plays and usage of abstract nodes like the quality of the attack and defence to improve the model
and ensure its longevity.

Conclusion
This paper has described the Bayesian approach to pattern recognition and exemplified five real
– world classification tasks. The Bayesian decision theory provides a simple and extensible
approach not only limited to classification but predictions and general mixture separation. The
theoretical basis behind this is free from any ad hoc quantities, and in any measures which alter
the data to suit the needs of the program. As a result, most of the Bayesian classification models
described in the paper lend itself easily to extension and further research.

References
Bow, S. (2002). Pattern recognition and image preprocessing. New York: Marcel Dekker.
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. Toronto: John Wiley &
Sons.

Husnain, M., & Naweed, S. (2009). English Letter Classification Using Bayesian Decision
Theory and Feature Extraction Using Principal Component Analysis. European Journal of
Scientific Research, 34, 2nd ser., 196-203.

Joseph, A., Fenton, N., & Neil, M. (2006). Predicting football results using Bayesian nets and
other machine learning techniques. Knowledge-Based Systems, 19(7), 544-553.
doi:10.1016/j.knosys.2006.04.011

Nadler, M. (1993). Pattern recognition engineering. New York: John Wiley & Sons.
Tellaeche, A., Burgos-Artizzu, X. P., Pajares, G., & Ribeiro, A. (2008). A vision-based method
for weeds identification through the Bayesian decision theory. Pattern Recognition, 41(2), 521-
530. doi:10.1016/j.patcog.2007.07.007

Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Bayesian Approach to
Filtering Junk E-mail.

Stutz, J., & Cheeseman, P. (1996). Autoclass — A Bayesian Approach to Classification.


Maximum Entropy and Bayesian Methods, 117-126. doi:10.1007/978-94-009-0107-0_13

Tellaeche, A., Burgos-Artizzu, X. P., Pajares, G., & Ribeiro, A. (2008). A vision-based method
for weeds identification through the Bayesian decision theory. Pattern Recognition, 41(2), 521-
530. doi:10.1016/j.patcog.2007.07.007

You might also like