You are on page 1of 10

Alternative Strategies to Explore the SNNB Algorithm Performance

Laura Cruz R.1, Joaquín Pérez O.2, Rodolfo A. Pazos R.2, Vanesa Landero N.2,
Víctor M. Álvarez H.1, Claudia G. Gómez S.1
1
Technological Institute of Cd. Madero, Mexico
2
National Center for Investigation and Technological Development (CENIDET),
Mexico
lcruzreyes@prodigy.net.mx, {jperez, pazos}@cenidet.edu.mx,
landerov76@yahoo.com.mx, {mantorvicuel, cggs7}@hotmail.com

Abstract
Data mining is the process of extracting useful knowledge from large datasets. A sub-
area of data mining is the classification that induces a set of models for predicting the label
of the unknown class. The Naive Bayes classifier is simple, efficient and robust; its
performance has been improved by some works, which focused on finding an instances
subset in a conditional way and selecting the appropriate classifier with the highest
probability. In this paper we propose to modify the Selective Neighborhood based Naive
Bayes (SNNB) algorithm, using and combining other distance measurements, instance
organization, instance space search and model selection. The proposed combinations are
aimed at exploring the classifying accuracy of the SNNB algorithm. Experimental results
show that the best strategy found (using 26 datasets from the UCI repository) won in 15
cases and only lost in 3 cases

1. Introduction
Data mining, also known as knowledge discovery in databases (KDD), is the process of
extracting useful knowledge from large datasets. Classification is a data mining process that
induces a set of models from the training data and uses them for predicting the label of the
unknown class.
In [1] we proposed a statistical approach for heuristic algorithm selection. This approach
uses machine learning techniques for developing classifiers that learn the relationship
between algorithm performance and problem characteristics. When a new instance has to be
solved, the classifier predicts the algorithm with the best performance for it. For this reason,
we are interested in some classification method with high prediction accuracy.
Naive Bayes (NB) [2] is a probability based classification method, which involves the
assumption that, given the class label, the attributes are conditionally independent.
Although Naive Bayes is simple, it has shown good performance in a wide variety of
domains, even in many domains where there are clear dependencies among the attributes.
As a result, Naive Bayes has attracted much attention from researchers and its performance
has been improved.
Most works have focused on searching an instances subset in a conditional way for
developing classifiers. Additionally the highest probability was used to select the
appropriate learner. In work [3] the concept of distance between training instances and input
instance was used for distributing them in different neighborhoods. Furthermore they used a
parametric mechanism to find neighborhoods for developing classifiers, finding the
classifier with the highest estimated accuracy for making decisions. We chose this classifier
because this technique outperforms NB and NBTree [4] in a large variety of domains.
In this paper we modify the Selective Neighborhood based Naive Bayes (SNNB) classifier
presented in [3]. We propose other mechanisms for handling distance measures, instances
organization, instances space search and model selection. All of them were combined and we
obtained 36 different modifications, 35 of which have never been explored. The objective of
these proposed combinations is to find the modification that increases the accuracy of the
SNNB algorithm. To validate the classifier quality we used the 26 commonest UCI repository
datasets. The final results showed that the best strategy found won in 15 cases, tied up in 7, and
only lost in 3 cases.
This paper is organized as follows: section 2 presents the related work; section 3 describes
the SNNB classification method, a scheme and diagram of the proposal strategies, and one of
the four algorithms implemented; section 4 presents the cases used for experimentation, the
evaluation process and experimental results; finally section 5 presents some conclusions of this
work.

2. Related works

Classification methods can be divided into two types: eager learning and lazy learning [3].
The first replaces the training inputs with an abstraction expression, such as a rule set, decision
tree, or neural network, and uses it to process queries. The second is characterized by spending
little or no effort during training, and the process starts until the classification stage.
Naive Bayes has surprisingly good performance in a wide variety of domains, even in many
domains where there are clear dependencies among attributes. As a result, Naive Bayes has
attracted much attention from researchers and its performance has been improved. Research
work to extend Naive Bayes can be broadly divided into three main categories. The first
category aims at improving Naive Bayes by transforming the input characteristics space.
The notion of local learning is employed by the second category for extending Naive Bayes.
A Naive Bayes tree (NBTree) [4] uses decision tree techniques to divide the entire instance
space into several subspaces, and then trains a Naive Bayes classifier for each leaf node. For
solving the small disjunctive problem of the NBTree, the Lazy Bayesian Rule (LBR)
classification method [5] tries to build a single tree that is most appropriate, on average, for all
the instances. Although this tree is not good, the prediction performance is usually very poor for
those examples that match paths with few training examples at their leaves.
Zheng Zijian [6] presented another method to generate Naive Bayes Classifier Committees
by constructing individual Naive Bayes classifiers using different attribute subsets in sequential
trials. Majority vote of the committees was applied at the classification stage.
In [3] the author proposed a novel lazy learning algorithm: SNNB. This method computes
different distance neighborhoods of the new input object, lazily learns multiple Naive Bayes
classifiers, and uses the classifier with the highest estimated accuracy to make a decision.
The third category of research rests on the attribute independence assumption. A restricted
subclass of Bayesian network by inducing a tree-structure network structure was explored by
Friedman and Goldszmidt [7]. This algorithm was named Tree Augmented Naive Bayes (TAN).
Table 1 shows some of the most recent works related to improvements of the Naive Bayes
classifier. The distance measures used for matching the training instances to the input instance
are indicated in columns 2-4. The instance organizations that were used are described in
columns 5-7. The method for selecting a subset of instances for constructing classifiers is
indicated in columns 8-10. Finally, the methods to choose the appropriate learner and classify
the input instance are indicated in columns 11-13.
Most of these works focused on finding a subset of instances to construct classifiers in a
conditional way. Additionally the highest probability was used to select the appropriate learner.
In [3] the concept of distance between training instances and input instance was used, to
organize them in different neighborhoods. Furthermore, they utilized a parametric mechanism
to find neighborhoods for constructing classifiers, from which they found the classifier with the
highest estimated accuracy to make decisions.

Table 1. Related Works


Distance
Local Learning Model Selection
Measures
Related Works Instances
Search
H M E Organization MP SP MV
T SV CV C BP BB
Kohavi [4] x x x
Zheng and Webb [5] x x x
Zheng [6] x x x
Friedman [7] x x x
Zhipeng [3] x x x x
Our approach x x x x x x x x x x

The abbreviations in Table 1 mean: Different Attributes (H), Manhattan (M), Euclidean (E),
Tree (T), Ordered (SV), Neighborhood (CV), Conditional (C), Parametric (BP), Binary (BB),
Best Probability (MP), Sum or Average (SP) and Majority Vote (MV).
In this paper we modified the SNNB classifier [3] and we proposed other mechanisms to
handle distance measures, instances organization, space search, and model selection (see Table
1). All of them were combined and we obtained 36 types of modifications. The objective is to
find the modification that most increases the accuracy of the SNNB algorithm.

3. Description of the SNNB method and proposed strategies

This section presents a brief explanation of SNNB classifier; a complete description can be
found in [3]. In order to increase the accuracy of SNNB, we propose mechanisms for handling
distance measures, instances organization, instances space search and model selection.

3.1. SNNB classification method

SNNB (Selective Neighborhood based Naive Bayes for Lazy Learning) is one of the
algorithms that have been proposed to improve the performance of the bayesian classifier. It
consists of three main steps. The first step calculates the distance between the new input object
x and each training object t in training set T, and stores all the training objects in neighborhoods
according to their distances. The second step finds, in a parametric way, adequate
neighborhoods to construct several NB classifiers.
The algorithm SNNB has two parameters: one is the support difference threshold θ,, with 0.5
as the default value; the other is the support threshold φ,, with 0.03 as the default value. The
support threshold is used to ensure the generalizing ability of the learned model, while the
support difference threshold is mainly for controlling the speed of the algorithm [3]. The last
step classifies the new object with the most accurate NB classifier.

3.2. Proposed strategies

In lazy learning we need to match the input object to the training objects. This can be carried
out using some distance measures, so we used the penalized form like [8] for handling
Euclidean (E) and Manhattan (M) measures. Additionally, we utilized the different attributes
distance (H) defined in [3] (see the first part in Figure 1).
The basic instances organizations that we used are: simple structure (SV), where all the
training instances are ordered and stored in a bidimensional array; and neighborhood usage
(UV). We also used two strategies with this last approach. One of these is neighborhoods with
intervals (CVI), where the distances of the training instances are divided into I intervals (I was
set to 5 for our experiments), which have the same width w=(dmax-dmin)/I, and the neighborhoods
are delimited by dmin+w, dmin+2w, …, dmin+(I–1)w (we used Manhattan and Euclidean distance).
The other strategy is neighborhoods without intervals (CV), where [3] proposes measure H to
calculate the distance between the training instances and the new instance for creating
neighborhoods. The proposed modifications to organize instances are shown in part 2a of
Figure 1.

Distance Measures Local Learning (2) Model


(1) Selection
Instances (3)
Organization (2a) Space Search (2b)

Simple
Best Probability
Structure Binary (BB) (MP)
UV
Calculating Classifiers
Neighborhoods
distances Organized Vote Majority
with intervals
between Ordered instances
(MV)
(CVI)
instances instances

Averaged
Different Attributes (H) Neighborhoods
Manhattan (M) Parametric probabilities for
without intervals each class, the
Euclidean (E) (BP)
(CV) best is selected
(SP)

Figure 1. Proposed strategies

We used two different types of space search to find instance subsets from the original
structure and construct some appropriate classifiers. The first uses a binary methodology (BB),
where the objects structure is partitioned and the best part is selected to construct a classifier;
afterwards, this part is partitioned again and so on. The second type is similar to that in [3],
which uses configuration parameters (BP) to find and regulate the neighborhoods size. These
different types of space search are shown in part 2b of Figure 1.
For selecting classifiers we used the most efficient approaches, which consist of choosing the
classifier: with the highest probability (MP), with vote majority (MV), and with the highest
probability using averaged probabilities for each class (SP) (see part 3 in Figure 1).

3.3. Proposed strategies diagram

We combined the strategies explained in the last section aiming at exploring possible
improvements to the SNNB algorithm. We obtained 36 improvement alternatives, which are
shown in Figure 2. Each branch represents a strategy alternative denoted by Ei; for example if
i=1, the strategy applies parametric search (BP), simple structure (SV), vote majority (MP) and
Hamming distance (H), which were explained in the last section. Strategy E10 corresponds to
the SNNB algorithm. The experimental evaluation of each strategy is presented in section 4.

--------H--------E1
--------M--------E2
MP --------E--------E3
--------H--------E4
MV --------M--------E5
--------E--------E6
SV --------H--------E7
SP --------M--------E8
--------E--------E9
--------H--------E10 (CV)
MP --------M--------E11 (CVI)
UV --------E--------E12 (CVI)
BP MV --------H--------E13 (CV)
--------M--------E14 (CVI)
--------E--------E15 (CVI)
SP --------H--------E16 (CV)
--------M--------E17 (CVI)
--------E--------E18 (CVI)
--------H--------E19
MP --------M--------E20
--------E--------E21
BB MV --------H--------E22
--------M--------E23
SV --------E--------E24
SP --------H--------E25
--------M--------E26
--------E--------E27
--------H--------E28 (CV)
MP --------M--------E29 (CVI)
--------E--------E30 (CVI)
UV
MV --------H--------E31 (CV)
--------M--------E32 (CVI)
--------E--------E33 (CVI)
SP --------H--------E34 (CV)
--------M--------E35 (CVI)
--------E--------E36 (CVI)

Figure 2. Diagram of proposed strategies

For implementing the 36 strategies, we designed four algorithms that correspond to the
combinations of the first two levels of Figure 2: BB-SV, BB-UV, BP-SV, and BP-UV. One of
these algorithms (BB-UV) is shown below, whose input parameters are: training instances set T,
Global Naive Bayes classifier CLSglobal, a new instance x, a model selection technique
model_selection, a distance measure distance_measure and the number of intervals I.

Algorithm BB-UV
Begin
For each tj ∈ T, dj = distance between x and tj
If distance_measure= H Then
I = number of attributes
Else
Create I intervals and assign the corresponding interval to each
distance
For each tj ∈ T, add tj to INST[dj]
numInstancias = number of instances of T
total=numInstancias, k_NB =CLSgloba, NHk =T, ini=0, fn=I-1
best_ps = INITIALIZE_PROBABILITIES(k_NB, x, model_selection)
Repeat
count = 0, k=fn
While count < int( 1 – θ ) * total and k ≥ ini )
count = count + number of instances of INST[ k ]
k=k–1
If k < ini Then
Exit
If count < int ( φ * numInstancias ) and
total – count < int ( φ * numInstancias ) Then
Exit
NHk = INST [ ini ] ∪ INST [ ini +1] ∪…∪ INST [ k ]
NHk_aux = INST[ k + 1 ] ∪ INST [ k +2] ∪…∪ INST [ fn ]
Train a Naive Bayes classifier with NHk in k_NB
ps = the probabilities of x correspond to each class
using the k_NB classifier
Train a Naive Bayes classifier with NHk_aux in k_NB
ps_aux = the probabilities of x corresponds to each class
using the k_NB classifier
If maximum value of ps_aux > maximum value of ps Then
ps = ps_aux
NHk = NHk_aux
ini = k + 1
Else
fn = k
UPDATE_PROBABILITIES(best_ps, ps, model_ selection)
total = number of instances of NHk
Index = index of the maximum value of best_ps
clase = the class with the index index
Return index, clase
End

The algorithm BB-UV organizes the training instances, according to intervals of the distance
values, in neighborhoods. First the instance set is divided, and with each partition a classifier is
built. Then the subset of data of the best classifier is chosen. This process is repeated with the
selected subset and it is finished when an improvement is not achieved in the percentage of
accuracy of the classification. Finally, to classify a new instance a method of model selection
(MV, SP, MP) is applied.

4. Empirical strategy design and results

4.1. Machine learning datasets and performance evaluation

We collected 26 machine learning datasets, which include a large variety of domains from
the UCI repository (Table 2). Additionally, we applied 3 preprocessing schemes to each case: in
the first discretization was applied (WP); in the second we balanced datasets, selected
characteristics and discretized values (PP); and in the third the missing values were eliminated,
the cases were balanced, the features were selected, and the values were discretized (P).
For evaluation purposes, we obtained a set of samples x1, x2, …, xk by applying successive
tenfold cross-validations (we used the Weka project, www.cs.waikato.ac.nz/ml/weka/) to one
strategy, and a second set of samples y1, y2, …, yk obtained by applying the same technique with
a different strategy. Each cross-validation was generated using a different tenfold partition of
the data; the value k was 10, because we were using ten tenfold cross-validations. The same
cross-validation partitions were used for both strategies, so that x1 and y1 were obtained using
the same cross-validation split, as well as x2 and y2, and so on. We used the t-statistic on a 1%
confidence level.

Table 2. Dataset characteristics


Dataset Attributes Classes Size Dataset Attributes Classes Size
Anneal 38 6 798 Ionosphere 34 2 351
Australian 14 2 690 Iris 4 3 150
Auto 25 7 205 Labor 16 2 57
Breast-w 10 2 699 Led7 7 10 3200
Cleve 13 2 303 Lymph 18 4 148
Crx 15 2 690 Pima 8 2 768
Diabetes 8 2 768 Sick 29 2 2800
German 20 2 1000 Sonar 60 2 229
Glass 9 7 214 Tic-tac 9 2 958
Herat 13 2 270 Vehicle 18 4 846
Hepatitis 19 2 155 Waveform 21 3 5000
Horse 22 2 368 Wine 13 3 178
Hypo 25 2 3163 Zoo 16 7 101

4.2. Experimental results

We tested the proposed strategies on the machine learning datasets. Table 3 shows the results
of each strategy with different preprocessing schemes. The table shows that the best strategy
was E17, which uses the Manhattan distance (M) to match the input and training objects, and
intervals with equal width for building the objects structure using neighborhoods (CVI).
Additionally, it uses parametric search (BP) to find the objects subsets for constructing
appropriate classifiers. The model selection consists of using the averaged probabilities to
classify a known object (SP). The best result was 87.58 % with the preprocessing scheme (PP).
Additionally, Table 4 shows the results of the 26 cases of the best strategy and the SNNB
algorithm in a preprocessing scheme (PP). These results were satisfactory, because this strategy
won in 15 cases, tied up in 7, and only lost in 3 cases with a small difference. (We omitted the
NB classifier results, because it was outperformed by SNNB de [3]).

Table 3. Results of the proposed strategies


Preprocessing schemes Preprocessing schemes
Strategy Strategy
WP PP P WP PP P
E1 84.87 86.25 86.68 E19 84.89 86.06 86.63
E2 84.93 86.22 86.64 E20 84.87 86.08 86.63
E3 84.91 86.32 86.65 E21 84.88 86.06 86.62
E4 84.86 86.21 86.65 E22 84.90 86.11 86.65
E5 84.89 86.18 86.61 E23 84.89 86.15 86.65
E6 84.88 86.24 86.64 E24 84.87 86.12 86.66
E7 84.87 86.22 86.67 E25 84.90 86.07 86.65
E8 84.92 86.20 86.62 E26 84.89 86.11 86.64
E9 84.90 86.29 86.64 E27 84.88 86.09 86.65
E10 85.05 86.89 87.06 E28 84.75 84.88 85.13
E11 85.06 87.51 87.15 E29 84.86 85.32 85.59
E12 85.03 87.21 87.17 E30 84.82 84.82 85.35
E13 85.08 87.21 87.11 E31 84.84 84.71 85.25
E14 85.07 87.35 87.13 E32 84.92 85.43 85.77
E15 84.97 87.10 87.00 E33 84.80 84.67 85.64
E16 85.06 87.01 87.08 E34 84.73 84.62 85.03
E17 85.11 87.58 87.25 E35 84.87 85.33 85.47
E18 85.04 87.16 87.18 E36 84.82 84.65 85.20

Table 4. Results of the best strategy


With PP scheme E17 With PP E17
Cases outperforms Cases scheme outperforms
E17 SNNB SNNB E17 SNNB SNNB
Anneal 97.92 97.89 Yes Iono 95.02 93.71 Yes
Austra 86.94 85.77 Yes Iris 97.07 97.07 Equal
Auto 89.33 89.48 No Labor 88.19 89.17 No
Breast 98.09 97.74 Yes Led7 72.91 72.91 Equal
Cleve 83.17 82.73 Yes Lymph 89.32 90.00 No
Crx 86.46 85.59 Yes Pima 76.86 76.86 Equal
Diabetes 78.71 78.57 Yes Sick 97.69 97.24 Yes
German 79.26 77.15 Yes Sonar 83.38 83.38 Equal
Glass 89.98 90.30 No Tic-tac 83.74 73.21 Yes
Heart 82.93 82.73 Yes Vehicle 64.28 64.28 Equal
Hepatitis 90.12 89.96 Yes Wave 84.95 83.44 Yes
Horse 82.80 82.30 Yes Wine 98.49 98.49 Equal
Hypo 99.36 99.16 Yes Zoo 100 100 Equal

Finally, Table 5 presents only the results of win cases of the best strategy, which show that
E17 has a superior performance with an averaged difference of 1.34 % over all win cases.
Although this percentage is not very high, the best strategy had a good performance in several
heavy cases: Waveform, Sick and Tic-tac, because they are large datasets; Anneal, Hypo and
German, because they have many attributes; and Anneal, because it has more than two classes.
Table 5. Win cases of the best strategy.
UIT PP Écheme
Cases E17 Outperforms SNNB
E17 SNNB
Anneal 97.92 97.89 Yes
Austra 86.94 85.77 Yes
Breast 98.09 97.74 Yes
Cleve 83.17 82.73 Yes
Crx 86.46 85.59 Yes
Diabetes 78.71 78.57 Yes
German 79.26 77.15 Yes
Heart 82.93 82.73 Yes
Hepatitis 90.12 89.96 Yes
Horse 82.80 82.30 Yes
Hypo 99.36 99.16 Yes
Iono 95.02 93.71 Yes
Sick 97.69 97.24 Yes
Tic-Tac 83.74 73.21 Yes
Waveform 84.95 83.44 Yes
Averages 88.48 87.14 1.34% difference

5. Conclusions

In this work we presented several strategies to modify the SNNB algorithm. These were
based on different ways for handling distance measures, instances organization, instances space
search, and model selection. All of them were combined to yield 36 different modifications, 35
of which have never been explored.
The results of the experiments were satisfactory. We analyzed the results of 26 cases from
the machine learning UCI repository and found that the new strategy E17 was the best. This
strategy combines the Manhattan distance (H), a division of the distances of the training
instances in I intervals for building neighborhoods in a different way (CVI), parametric search
(BP), and model selection using averaged probabilities (SP). Strategy E17 obtained an average
accuracy of 87.58 %; i.e., it won in 15 cases, tied up in 7, and only lost in 3 cases with a
minimal difference. Also we found that the best strategy had a good performance in several
heavy cases: Waveform, Sick, Tic-tac, Anneal, Hypo and German.
For future work we are planning to conduct experimentations using different values of
parameter I in order to find a method to set it automatically. We would try to use another
method of intervals for instances organization; one of the alternative methods could use equal
frequencies. Finally, since the objective of the statistical approach, proposed in [1], is to predict
with high accuracy the algorithm with the best performance for a new instance, we will use this
strategy to experiment with standard cases [9] for the bin packing problem.
References

[1] Pérez, J., Pazos, R.A., Frausto, J., Rodríguez, G., Romero, D., Cruz, L.: A Statistical Approach for Algorithm
Selection. Lectures Notes in Computer Science, Vol. 3059. Springer-Verlag, Berlin Heidelberg New York, pp.
417-431, 2004.
[2] Chawla, Nitesh.: V. C4.5 and imbalanced data sets: Investigating the effect of sampling method, probabilistic
estimate, and decision tree structure. ICML Workshop on Learning from Imbalanced Data Sets II, Washington,
DC, USA, 2003.
[3] Zhipeng, XIE, Wynne, HSU, Zongtian LIU, Mong LI LEE.: SNNB A selective neighborhood based naives for
lazy learning. In Sixth Pacific-Asia Conference on Knowledge Discovery and Data Mining, Taipei, Taiwan,
2002.
[4] Kohavi R.: Scaling up the accuracy of Naïve-Bayes classifiers: a decision-tree hybrid. In Simoudis E. & Han J.,
editor, Proceedings Second International Conference on Knowledge Discovery & Data Mining, pp. 202-207,
AAAI Press/MIT press, Cambridge/Menlo Park, 1996.
[5] Zheng, Z. and Webb, G.I.: Lazy learning of bayesian rules. Machine Learning 41(1), 53-84, Kluwer Academic
Publishers, 2000.
[6] Zheng, Z.: Naïve bayesian classifier committees. Proceedings of the 10th Europan Conference on Machine
Learning, Vol. 1398 of Lecture Notes in Computer Science, pp. 196-207, London, UK, Springer-Verlag, 1998.
[7] Friedman, N., Goldszmidt M.: Building classifiers using bayesian networks. In Proceedings of Thirteenth
National Conference on Artificial Intelligence, AAAI Press/MIT press, Cambridge/Menlo Park, pp. 1277-1284,
1996.
[8] Chawla, Nitesh V.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence
Research. 16, 321-357, 2002.
[9] Coffman, E.G. Jr., Garey, M.R., Johnson, D.S.: Approximation algorithms for Bin-Packing, a survey. In Dorit S.
Hochbaum, editor, Approximation Algorithms for NP-hard Problems, pp. 46-93, PWS, Boston MA, 1997.

You might also like