You are on page 1of 6

An empirical study of the naive Bayes classifier

I. Rish
T.J. Watson Research Center
rish@us.ibm.com

Abstract The success of naive Bayes in the presence of feature de-


pendencies can be explained as follows: optimality in terms
The naive Bayes classifier greatly simplify learn-
of zero-one loss (classification error) is not necessarily related
ing by assuming that features are independent given
to the quality of the fit to a probability distribution (i.e., the
class. Although independence is generally a poor appropriateness of the independence assumption). Rather, an
assumption, in practice naive Bayes often competes optimal classifier is obtained as long as both the actual and
well with more sophisticated classifiers.
estimated distributions agree on the most-probable class [2].
Our broad goal is to understand the data character- For example, [2] prove naive Bayes optimality for some prob-
istics which affect the performance of naive Bayes. lems classes that have a high degree of feature dependencies,
Our approach uses Monte Carlo simulations that al- such as disjunctive and conjunctive concepts.
low a systematic study of classification accuracy However, this explanation is too general and therefore not
for several classes of randomly generated prob- very informative. Ultimately, we would like to understand
lems. We analyze the impact of the distribution the data characteristics which affect the performance of naive
entropy on the classification error, showing that Bayes. While most of the work on naive Bayes compares
low-entropy feature distributions yield good per- its performance to other classifiers on particular benchmark
formance of naive Bayes. We also demonstrate problems (e.g., UCI benchmarks), our approach uses Monte
that naive Bayes works well for certain nearly- Carlo simulations that allow a more systematic study of clas-
functional feature dependencies, thus reaching its sification accuracy on parametric families of randomly gen-
best performance in two opposite cases: completely erated problems. Also, our current analysis is focused only
independent features (as expected) and function- on the bias of naive Bayes classifier, not on its variance.
ally dependent features (which is surprising). An- Namely, we assume an infinite amount of data (i.e., a perfect
other surprising result is that the accuracy of naive knowledge of data distribution) which allows us to separate
Bayes is not directly correlated with the degree the approximation error (bias) of naive Bayes from the error
of feature dependencies measured as the class- induced by training sample set size (variance).
conditional mutual information between the fea-
tures. Instead, a better predictor of naive Bayes ac- We analyze the impact of the distribution entropy
curacy is the amount of information about the class on the classification error, showing that certain almost-
that is lost because of the independence assump- deterministic, or low-entropy, dependencies yield good per-
tion. formance of naive Bayes. (Note that almost-deterministic
dependencies are a common characteristic in many practi-
cal problem domains, such as, for example, computer sys-
1 Introduction
!  
"
tem management and error-correcting codes.) We show that
Bayesian classifiers assign the most likely class to a given the error of naive Bayes vanishes as the entropy
example described by its feature vector. Learning such clas- approaches zero. Another class of almost-deterministic de-
 
     

sifiers can be greatly simplified by assuming that features are pendencies generalizes functional dependencies between the

where
    
independent given class, that is,  ,
is a feature vector and is a class.
features. Particularly, we show that naive Bayes works best in
two cases: completely independent features (as expected) and
Despite this unrealistic assumption, the resulting classifier functionally dependent features (which is less obvious), while
known as naive Bayes is remarkably successful in practice, reaching its worst performance between these extremes.
often competing with much more sophisticated techniques [6; We also show that, surprisingly, the accuracy of naive
8; 4; 2]. Naive Bayes has proven effective in many practical Bayes is not directly correlated with the degree of feature de-
"$   
#  &%' 

applications, including text classification, medical diagnosis, pendencies measured as the class-conditional mutual infor-
and systems performance management [2; 9; 5].
T.J. Watson Research Center, 30 Saw Mill River Road, features and

mation between the features, ( and
is the class). Instead, our experiments re-
are

Hawthorne, NY 10532. Phone +1 (914) 784-7431 veal that a better predictor of naive Bayes accuracy can be

41
$  ‹NXdŒN of a clas-
#     % " )(
the loss of information that features contain about the class The probability of a classification error, or
$ /  S Ž  _   ’ ‘  " „
#+*-, .   0 % 1
when assuming naive Bayes model, namely sifier is defined as
, where #+*-,
is the mutual information be- S S G
tween features and class under naive Bayes assumption. ”“  S OP ’ ‘ G OP " "3 9OP _ 9— | ?  S OP ’ ‘ G OP " ˜D 
This paper is structured as follows. In the next section we |•aE–
provide necessary background and definitions. Section 3 dis-
— | O Ž Ž 
cusses naive Bayes performance for nearly-deterministic de- where is the expectation over . l S l denotes the
pendencies, while Section 4 demonstrates that the “informa- Bayes error (Bayes risk).
tion loss” criterion can be a better error predictor than the We say that classifier S is optimal on a given problem if its
strength of feature dependencies. A summary and conclu- risk coincides with the Bayes risk. Assuming there is no noise
?risk), 4544is called separable by a set of
YWš  F  < A ; (ƒBWD if every example O
sions are given in Section 5. (i.e. zero Bayes a concept
functions ™
is classified correctly when using each YWš F as discriminant
2 3 4454  
2 Definitions and Background
Let be a vector of observed random vari- functions.
As a% measure of dependence between two features
œ›
6 
ables, called features, where each feature takes values from
89444:8 
7 6 
its domain . The set of all feature vectors (examples, or and we use the class-conditional mutual information [1],
6 $  
# œ› 0žŸ%' 
 
_ ! œ›~ 
P ! &%' 
z( ! œ› ž &%' 

states), is denoted . Let be an un- which can be defined as
  4454 4
<>=@?)A ; (CBED
observed random variable denoting the class of an example,
where can take one of values  ; Capi-
where !

tal letters, such as , will denote variables, while lower-case
is the class-conditional entropy of , defined
F
letters, such as , will denote their values; boldface letters (‰“ I X  “¢¡ž… ¤£ I X •¥5¦ ] ž… ¤£ I X 4
as:
45445 
GIHJ7LKM?NA ; (BED G  OP Q R
will denote vectors.
A function , where ,
G F œ› and &% are mutually in-
denotes a concept to be learned. Deterministic
sponds to a concept without noise, which always assigns the
corre- Mutual information is zero  when
dependent given class , and increases with increasing level
same class to a given example (e.g., disjunctive and conjunc- of dependence, reaching the maximum when one feature is a

G F
tive concepts are deterministic). In general, however, a con- deterministic function of the other.
cept can be noisy, yielding a random function .
4544 STH
7UK )? A ; (VBWD
A classifier is defined by a (deterministic) function 3 When does naive Bayes work well? Effects
(a hypothesis) that assigns a class
of some nearly-deterministic dependencies

Y  OP X
to any given example. A common approach is to asso-
4544 ( B X
A ; 9
ciate each class with a discriminant function , In this section, we discuss known limitations of naive

S OP Z
, and let the classifier select the class with max- Bayes and then some conditions of its optimality and near-
[W\/]_^[` a'bdcegfgfgfge hji/k Y  OP
imum discriminant function on a given example: optimality, that include low-entropy feature distributions and

Sl OPOP
. nearly-functional feature dependencies.
The Bayes classifier (that we also call Bayes-optimal
classifier and denote m&n
), uses as discriminant functions
3.1 Concepts without noise
I X  OP „ A B
 OP  p.q X  r s OP
the class posterior probabilities given a feature vector, i.e. O
We focus first on concepts with
X
or for any

Yol t X  L uOP v x wzy5{ o|~w} y5P{ o€|Wwz€ y o€


and (i.e. no noise), which therefore have zero Bayes risk.
. Applying Bayes rule gives
L OP 
The features are assumed to have finite domains ( -th feature X
, where 
has values), and are often called nominal. (A nominal fea-
is identical for all classes, and therefore can be ignored. This ture can be transformed into a numeric one by imposing an

yields Bayes discriminant functions
Y  l OP ‚ 9 ƒO„ … X d.… X
order on its domain.) Our attention will be restricted to bi-
(1) nary classification problems where the class is either 0 or 1.
4544"¨

 
 
 9
„
O  I
   L§ X RB
Some limitations of naive Bayes are well-known: in case
where X is called the class-conditional prob- of binary features ( for all ), it can only
ability distribution (CPD). Thus, the Bayes classifier learn linear discriminant functions [3], and thus it is always
S l OP „ [E\1]^[W`  ƒO„ … X d.… X (2) suboptimal for non-linearly separable concepts (the classical
;
¨
 ª© §
example is XOR function; another one is -of- concepts [7;
finds the maximumO a posteriori probability (MAP) hypothe- 2]). When
†
for some features, naive Bayes is able
to learn (some) polynomial discriminant functions [3]; thus,
O„ ‡given
sis X example . However, direct estimation of
from a given set of training examples is hard when polynomial separability is a necessary, although not suffi-
the feature space is high-dimensional. Therefore, approxima- cient, condition of naive Bayes optimality for concepts with
tions are commonly used, such as using the simplifying as- finite-domain features.

ˆ‰m OP
sumption that features are independent given the class. This Despite its limitations, naive Bayes was shown to be opti-
yields the naive Bayes classifier defined by discrimi- mal for some important classes of concepts that have a high
degree of feature dependencies, such as disjunctive and con-
4
nant functions
Y  *-, OP „ 9Š %  &%v F %'  X d . X
junctive concepts [2]. These results can be generalized to
(3) concepts with any nominal features (see [10] for details):

42
Theorem 1 [10] The naive Bayes classifier is optimal for NBerror, I(X1;X2|C), and H(P(x1|c) vs. P(0) (n=2, m=2, k=10, N=1000)
any two-class concept with nominal features that assigns 3
class 0 to exactly one example, and class 1 to the other ex- NBerr
amples, with probability 1. 1 2.5
I(X1;X2|C)
H(P(x1|c)

The performance of naive Bayes degrades with increas-


† A , also denoted
 A
ing number of class-0 examples (i.e., with increasing prior
), as demonstrated in Figure
2

1a. This figure plots average naive Bayes error computed 1.5

value of
« A
over 1000 problem instances generated randomly for each
. The problem generator, called Zer-
;
oBayesRisk, assumes features (here we only consider two
1

  .L A ¬
¢­N® § ¯ 
features), each having values, and varies the number of
B ®˜ˆ © A 4g°
0.5
class-0 examples from 1 to (so that varies
from to 0.5; the results for ‡ A A
are sym- 0
metric)2 . As expected, larger (equivalently, larger 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

¬), yield a wider range of problems with various dependencies


P(0)
(a)
among features, which result into increased errors of Bayes;
 A _ uB ®˜ˆ
Average errors vs. mutual information (n=2, m=2, k=10)
a closer look at the data shows no other cases of optimality
besides . 0.6
Surprisingly, the strength of inter-feature dependen-
$
#    ­  

cies, measured as the class-conditional mutual information
(also denoted ±I#
), is not a good predictor of
0.5

 A
naive Bayes performance: while average naive Bayes error
 A _ A 4 B
0.4

increases monotonically with , the mutual information


error

is non-monotone, reaching its maximum around . 0.3

This observation is consistent with previous empirical results


on UCI benchmarks [2]) that also revealed low correlation 0.2

between the degree of feature dependence and relative perfor-


mance of naive Bayes with respect to other classifiers, such 0.1

as C4.5, CN2, and PEBLS. boptErr

   

NBerr
I(X;Y|C)
It turns out that the entropy of class-conditional marginal 0

   A
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
distributions, , is a better predictor of naive Bayes delta
performance. Intuitively, low entropy of means that (b)
most of 0s are “concentrated around ” one state (in the limit, Average error difference vs. mutual information (n=2, m=2, k=10)


!   A
0.01
this yields the optimality condition stated by Theorem 1). In- NBerr−boptErr
I(X;Y|C)/300

deed, plotting average in Figure 1a demonstrates 0.009

tonically in
 A
that both average error and average entropy increase mono-
. Further discussion of low-entropy distri-
0.008

0.007
butions is given next in the more general context of noisy
NBerr−boptErr

0.006
(non-zero Bayes risk) classification problems.
0.005

3.2 Noisy concepts 0.004

Low-entropy feature distributions


† X  OP
0.003

Generally, concepts can be noisy, i.e. can have non-


0.002
deterministic and thus a non-zero Bayes risk.
A natural extension of the conditions of Theorem 1 to noisy 0.001

concepts yields low-entropy, or “extreme”, probability distri- 0


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

butions, having almost all the probability mass concentrated delta


in one state. Indeed, as shown in [10], the independence (c)
assumption becomes more accurate with decreasing entropy
which yields an asymptotically optimal performance of naive Figure 1: (a) results for the generator ZeroBayesRisk
Bayes. Namely, (k=10, 1000 instances): average naive Bayes error
$
#  ² B 1 §³ 

Theorem 2 [10] Given that one of the following conditions (NBerr), class-conditional mutual information between fea-
hold: 
 
 

tures (
!  A
), and entropy of marginal distribution,
; the error bars correspond to the standard devi-
1
Clearly, this also holds in case of a single example of class 1. ation of each measurement across 1000 problem instances;
$  §  

# ZB ´
2
Note that in all experiments perfect knowledge of data distribu- (b) Results for the generator EXTREME: average Bayes
tion (i.e., infinite amount of data)is assumed in order to avoid the and naive Bayes errors and average ; (c) results
effect of finite sample size. for the generator FUNC1: average difference between naive 4 ·¸·E¹
Bayes error and Bayes error ( - constant for all ), µ¶A º
and scaled I(X1;X2—C) (divided by 300).
43
  4454      c      
   4544 T» B(

1. a joint probability distribution is such Ol feature, and , for class 0 and class 1, re-
that F l 4454 F Fl l Fl
, or
for some state º spectively. Finally, it creates class-conditional joint feature
distributions satisfying the following conditions:
  45445    š  F W F   " „  š  F  +dBJ( º  and
­  Y " _F 
2. a set of marginal probability distributions
  Fl »…BJ( º  X  š  F W F ‘  º  B4
Y F  š  F  (C
is such that for each ,

then
  F  4454 F  z(²  
for some ,
.
FF l  ¢¼ ¨ º ­ B < A
The performance of naive Bayes on low-entropy distri- B-( º
This way the states satisfying functional dependence obtain
probability mass, so that by controlling we can get as º
butions is demonstrated using a random problem genera-
tor called EXTREME. This generator takes the number of ¨ › io› 
close as we want to the functional dependence described be-
;
classes, , number of features, , number of values per fea-
º 
fore, i.e. the generator relaxes the conditions of Theorem 3.
 º ; O„ t š  F ­  „ …À9ÁÂà š  F W  F ­ _
Note that, on the other hand, gives us uniform distri-
ture, , and the parameter , and creates class-conditional

< ½ 2B
( º OI 3O š
feature distributions, each satisfying the condition Oš ; ›
butions over the second feature
<

if , where the are different states , which makes it independent of (given class ). Thus
X º
º  (…O„B  u X
randomly selected from possible states. For each class , varying from 0 to 1 explores the whole range from deter-
the remaining probability mass in is randomly ministic dependence to complete independence between the
distributed among the remaining
distributions are uniform. Once
  

states. Class prior
is generated, naive
features given class.
The results for 500 problems with are summarized 
BA
Bayes classifier (NB) is compared against the Bayes-optimal in Figure 1c, which plots the difference between the average
classifier (BO). 4 ·¸·E¹
naive Bayes error and average Bayes risk (which turned out to
µIA º º
º A
Figure 1b shows that, as expected, the naive Bayes error be , a constant for all ) is plotted against . We can

º A
(both the average and the maximum) converges to zero with see that naive Bayes is optimal when (functional de-
¨ºœK¾
VA§ ; V§  x
BA
(simulation performed on a set of 500 problems with pendence) and when (complete independence), while
, , ). Note that, similarly to the previ- its maximum error is reached between the two extremes. On
ous observations, the error of naive Bayes is not a monotone the other hand, the class-conditional mutual information de-
º 4Ä º
º A
function of the strength of feature dependencies; namely, the creases monotonically in , from its maximum at (func-
average class-conditional mutual information plotted in Fig- tional dependencies) to its minimum at (complete
4 ¿•° 4g°
º A º A
ure 1b is a concave function reaching its maximum between independence) 4.
and , while the decrease of average naive
Bayes error is monotone in . º 4 Information loss: a better error predictor
Almost-functional feature dependencies than feature dependencies?
Another ”counterintuitive” example that demonstrates the As we observed before, the strength of feature dependencies
non-monotonic relation between the feature dependence and (i.e. the class-conditional mutual information between the
the naive Bayes accuracy is the case of certain functional and features) ’ignored’ by naive Bayes is not a good predictor of
nearly-functional dependencies among features. Formally, its classification error. This makes us look for a better param-
  Y   
Theorem 3 [10] Given equal class priors, Naive Bayes is   X T§ 4544"¨ eter that estimates the impact of independence assumption on
optimal if    for every feature , , classification.
where Y is a one-to-one mapping 3 . We start with a basic question: which dependencies be-
Namely, naive Bayes can be optimal in situations just oppo- tween features can be ignored when solving a classification
site to the class-conditional feature independence (when mu- task? Clearly, the dependencies which do not help distin-
$ W
#    ­ "
tual information is at minimum) - namely, in cases of com- guishing between different classes, i.e. do not provide any
pletely deterministic dependence among the features (when information about the class. Formally, let
mutual information achieves its maximum). For exam- be the mutual information between the features and the
ple, Figure 1c plots the simulations results obtained using class (note the difference from class-conditional mutual      


  $    "
information) given the “true” distribution ,
­
 * ,  #+*-W,   
 ” ­ ²B• 
"§³ 
"

an ”nearly-functional” feature distribution generator called
FUNC1, which assumes uniform class priors, two features, while is the same quantity computed for

each having values, and ”relaxes” functional dependencies ­ $       $ 

, the naive
º #'ÅEXdYY #    ­ 1 ˜( # * ,  ­    ­ "
  
between the features using the noise parameter . Namely, Bayes approximation of . Then the parameter
this generator selects a random permutation of numbers,  measures the
 Y   ’dB
( º Y amount of information about the class which is “lost” due to
 A #' ÅEXdYY
which corresponds to a one-to-one function that binds the
naive Bayes assumption. Figure 2a shows that average
two features:
­ . Then it generates ran-
domly two class-conditional (marginal) distributions for the (“information loss”) increases monotonically with , just
as the average error of naive Bayes. More interestingly, Fig-
~°   ° #'ÅEXdYY
  BA B
3
A similar observation was made in [11], but the important ”one- ure 2b plots average naive Bayes error versus average
to-one” condition on functional dependencies was not mentioned for three different values of ( ), which all yield
there. However, it easy to construct an example of a non-one-to-
4
one functional dependence between the features that yields non-zero Note that the mutual information in Figure 1c is scaled (divided
error of naive Bayes. by 300) to fit the error range.

44
4· 4 4
Æ A ³F ­  A B F  A AEA B
almost same curve, closely approximated by a quadratic func- Ediff=R −R* vs. Idiff and Idiff+ (n=2, m=2, k=10)
NB
tion . Our results, not shown here 0.14

due to space restrictions, also demonstrate that variance of


#'ŸXÇYY 
0.12

the error increases with for each fixed ; however,


maximum variance decreases with . While the dependence  0.1

between the error and the information loss requires further 0.08

study, it is clear that the for zero-Bayes-risk problems infor- 0.06


mation loss is a much better predictor of the error than the

error
0.04
mutual dependence between the features (compare to Figure
1a). 0.02

0 Ediff
Ediff=RNB−R* vs. Idiff (n=2, m=2, k=10) Idiff
Idiff+
−0.02

0.8 −0.04

0.7 −0.06
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
delta
0.6
(a)
0.5
Ediff=R −R* vs. Idiff and Idiff+ (n=2, m=2, k=10)
NB
NBerror

0.02

0.4

0
0.3

−0.02
0.2

0.1 −0.04
error

Ediff
Idiff

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 −0.06
P(class=0)
(a) −0.08 Ediff
Idiff
Ediff=R −R* vs. Idiff (n=2, m=2, k=5,10,15, N=2000) Idiff+
NB
−0.1

2
y = 0.31*x + 0.098*x + 0.00087
0.3
Quadratic: norm of residuals = 0.010884 −0.12
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
delta
0.25
(b)
Ediff=(R −R*) vs. Idiff and MI) (n=2, m=2, k=15)
0.2 NB
NBerror

0.15

0.15 0.14

k=5 0.13
0.1 k=10
k=15
quadratic 0.12
−R*)

0.05
0.11
NB
Ediff=(R

0 0.1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Idiff=I(C;X1,X2)−INB(C;X1,X2)
0.09

(b)
0.08

 4 °É
Figure 2: Results for generator ZeroBayesRisk (13 values of  A
ÈA A
0.07
P(0) in range, 2000 instances per each value of ): Idiff

 A
I(C;X1,X2)
(a) Average naive Bayes error and average information loss 0.06

#'ÅEXdYY
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

versus ; (b) Average naive Bayes error versus av- Idiff=I(C;(X ,X ))−I (C;(X ,X ) and MI=I(X ;X |C)

#'ŸXÇYY
1 2 NB 1 2 1 2

erage ”information loss” for k=5,10, and 15. (c)


For non-zero Bayes risk, the picture is somewhat less clear.
However, the information loss still seems to be a better er- Figure 3: Information loss on noisy concepts: aver- #¸Å¸XdYY
ror predictor than the class-conditional mutual information — ŸXdYY
age error difference between naive Bayes and optimal Bayes,
#¸Å¸XdYY
between the features. Figure 3a plots the average differ- , and average for (a) generator EXTREME and — ŸXdYY #'ÅEXdYY
W$
— ŸXdYY ±I# #    ­  


ence between naive Bayes error and the Bayes risk, called (b) generator FUNC1; (c) scatter plot of versus
, and the information loss versus the parame- #¸Å¸XdYY and versus mutual information for gen-
erator RANDOM.

45
º — ¸Å XÇYY
ter . At the first sight, it looks like is non-monotone #¸Å¸XdYY mance between these extremes.
º
in while is monotone; particularly, while the error Surprisingly, the accuracy of naive Bayes is not directly
4 ° º ¨oÊ £ Ê
A ¼ º ¼ A §
increases with , information loss decreases in the interval correlated with the degree of feature dependencies measured
. Note, however, this interval yields (!) G¢Ë .X Ì as the class-conditional mutual information between the fea-
values of#'ÅEXdYY. It appears that naive Bayes overestimates tures. Instead, a better predictor of accuracy is the loss of
the amount of information the features have about the class information that features contain about the class when assum-
(possibly, by counting same information twice due to the in- ing naive Bayes model. However, further empirical and the-
dependence assumption), which results in negative . #'ÅEXdYY oretical study is required to better understand the relation be-
If we assume that such overestimation is not harmful, just tween those information-theoretic metrics and the behavior of

;QË'F  #¸Å¸XdYY A
equivalent to not losing any information, and plot instead the naive Bayes. Further directions also include analysis of naive
average of (denoted ), we observe a#'ŸXÇYYÍ Bayes on practical application that have almost-deterministic
monotone relationship between the average of and #'ÅEXdYYÍ dependencies, characterizing other regions of naive Bayes op-

º A
the average naive Bayes error, as one would expect (i.e., both timality and studying the effect of various data parameters on
increase monotonically up to , and then decrease).
Similarly, in Figure 3b we plot the error difference
— ÅEXdYY the naive Bayes error. Finally, a better understanding of the
impact of independence assumption on classification can be
as well as #¸Å¸XdYY
and #¸Å¸XdYYÍ
versus for our second gen- º used to devise better approximation techniques for learning
erator of non-zero Bayes risk problems, FUNC1. In this efficient Bayesian net classifiers, and for probabilistic infer-
cases, naive Bayes always overestimates the amount of in- ence, e.g., for finding maximum-likelihood assignments.
#¸Å¸XdYY
#¸Å¸XdYYÍ A
formation about the class, thus is always non-positive,
i.e. . Its relation to the naive error Bayes which Acknowledgements
reaches its maximum at some intermediate value of is thus º We wish to thank Mark Brodie, Vittorio Castelli, Joseph
not clear.
Hellerstein, Jayram Thathachar, Daniel Oblinger, and Ri-
Finally, we used a “completely” random problem genera-
$
cardo Vilalta for many insightful discussions that contributed
#   ­  

tor (called RANDOM) to compare the class-conditional mu- to the ideas of this paper.
tual information between the features, , and the
information loss #¸Å¸XdYY
, on arbitrary noisy concepts. For   F N   References
F ­  < ­
each class, this generator samples each
from a uniform distribution on the interval [1] T.M. Cover and J.A. Thomas. Elements of information theory.
[0.0,1.0]; the resulting probability table is then normalized New York:John Wiley & Sons, 1991.

a scatter-plot for
— ŸXÇYY
(divided by the total sum over all entries). Figure 3c shows
, the error difference between naive
[2] P. Domingos and M. Pazzani. On the optimality of the simple
Bayesian classifier under zero-one loss. Machine Learning,
$
#    ­  

Bayes and optimal Bayes classifiers, versus feature depen- 29:103–130, 1997.
dence and versus information loss . In #¸Å¸XdYY [3] R.O. Duda and P.E. Hart. Pattern classification and scene anal-
this cases, we can see that both parameters are correlated with ysis. New York: John Wiley and Sons, 1973.
N $
#    ­  

the error, however, the variance is quite high, especially for [4] N. Friedman, D. Geiger, and Goldszmidt M. Bayesian network
. Further study of both parameters on different classifiers. Machine Learning, 29:131–163, 1997.
classes of noisy concepts is needed to gain a better under- [5] J. Hellerstein, Jayram Thathachar, and I. Rish. Recognizing
standing of their relevance to the classification error. end-user transactions in performance management. In Pro-
ceedings of AAAI-2000, pages 596–602, Austin, Texas, 2000.
5 Conclusions [6] J. Hilden. Statistical diagnosis based on conditional indepen-
dence does not require it. Comput. Biol. Med., 14(4):429–435,
Despite its unrealistic independence assumption, the naive 1984.
Bayes classifier is surprisingly effective in practice since its
[7] R. Kohavi. Wrappers for performance enhancement and obliv-
classification decision may often be correct even if its prob-
ious decision graphs. Technical report, PhD thesis, Department
ability estimates are inaccurate. Although some optimality
of Computer Science, Stanford, CA, 1995.
conditions of naive Bayes have been already identified in the
past [2], a deeper understanding of data characteristics that [8] P. Langley, W. Iba, and K. Thompson. An analysis of Bayesian
affect the performance of naive Bayes is still required. classifiers. In Proceedings of the Tenth National Conference
on Artificial Intelligence, pages 399–406, San Jose, CA, 1992.
Our broad goal is to understand the data characteristics AAAI Press.
which affect the performance of naive Bayes. Our approach
uses Monte Carlo simulations that allow a systematic study of [9] Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997.
classification accuracy for several classes of randomly gener- [10] I. Rish, J. Hellerstein, and T. Jayram. An analysis of data char-
ated problems. We analyze the impact of the distribution en- acteristics that affect naive Bayes performance. Technical Re-
tropy on the classification error, showing that certain almost- port RC21993, IBM T.J. Watson Research Center, 2001.
deterministic, or low-entropy, dependencies yield good per- [11] H. Schneiderman and T. Kanade. A statistical method for 3d
formance of naive Bayes. Particularly, we demonstrate that detection applied to faces and cars. In Proceedings of CVPR-
naive Bayes works best in two cases: completely indepen- 2000, 2000.
dent features (as expected) and functionally dependent fea-
tures (which is surprising). Naive Bayes has its worst perfor-

46

You might also like