Professional Documents
Culture Documents
85-90
ABSTRACT: Data mining is a step in the knowledge discovery process consisting of data mining algorithms that used to
finds patterns or models in data. Data Mining also can be define as an analytic process designed to explore large amounts
of data in search for consistent patterns and systematic relationships between variables and then to validate the findings by
applying the detected patterns to new subsets of data. Classification is the most commonly applied data mining technique,
which employs a set of pre-classified examples to develop a model that can classify the population of records at large. In
classification techniques a model is built based on training data and applied to test data. WEKA is an open source data
mining tool which includes implementation of data mining algorithms. Using WEKA we have compared the ADTree, Bayes
Network, Decision Table, J48, Logistic, Naive Bayes, NBTree, PART, RBFNetwork and SMO algorithms. To compare these
algorithms we have used five datasets.
Keywords: Algorithms, Data Mining, Classification
intuitive concept. Also, in some cases it is also seen algorithms which includes C4.5, k-Means, SVM,
that Nave Bayes outperforms many other Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes,
comparatively complex algorithms. It makes use of the and CART. Andrew Secker, Matthew N. Davies et.al
variables contained in the data sample, by observing [3] compared different classification algorithms for
them individually, independent of each other. The hierarchical prediction for protein function based on
Nave Bayes classifier is based on the Bayes rule of the predictive accuracy of classifiers. In [4], the author
conditional probability. It makes use of all the attributes has performed the experiments on weed and crop
contained in the data, and analyses them individually images and dataset to test the classification algorithms.
as though they are equally important and independent In [5], Ryan Potter has performed the comparison of
of each other. The Nave Bayes classifier will consider classification algorithms on breast cancer dataset to
each of these attributes separately when classifying a perform the diagnosis of the patients.
new instance. NBTree [10] is an hybrid approach which
includes the capabilities of decision tree and nave 3. EXPERIMENTAL DETAILS
bayes classifier, the decision-tree nodes contain splits We have used the data sets available from Depaul
as regular decision-trees, but the leaves contain Naive- University and UCI machinery website. From WEKA
Bayesian classifiers. PART [11] is a partial decision GUI we have tested the ADTree (ADT), Bayes
tree which is an extension of C 4.5 for generation the Network (BayesNet), DecisionTable (DT), J48,
rule set. PART builds a rule, removes the instances it Logistic, naive bayes (NB), NBTree (NBT), PART,
covers, and continues creating rules recursively for the RBFNetwork (RBFN) and SMO algorithms on five
remaining instances until none are left. RBF (radial data sets and observed the following results. We have
basis function) network [12] is a variant of neural used the bank, car, breast cancer, credit-g and diabetes
network, which are embedded in to two layers. In order datasets. Table 1 shows the brief description about the
to use radial basis function we need to specify the each dataset. We have studied and compared these
hidden unit activation function, the number of algorithms on the parameters like Correctly Classified
processing units, a criterion for modeling the given task Instances (CCI), Incorrectly Classified Instances (ICI),
and a training algorithm for finding the parameters of Kappa Statistic (KS), Mean Absolute Error (MAE),
the network. The SMO (sequential minimal Root Mean Squared Error (RMSE). Kappa statistics
optimization) [13] is extension of Support Vector is used to measure the inter-rater agreement for
Machines (SVM) to solve the problem of handling categorical items i.e. it is an index which compares
large datasets in SVM. the agreement against that which might be expected
by chance. Kappa can be thought of as the chance-
2. RELATED WORK corrected proportional agreement, and possible values
In [2], Xindong Wu, Vipin Kumar et al. has given the range from +1 (perfect agreement) via 0 (no agreement
descriptive study of the 10 data mining classification above that expected by chance) to -1 (complete
Table 1
Description of Datasets available from Depaul University and UCI repository
disagreement). Mean absolute error is used to measure for bank dataset. The Kappa Statistic value for J48 is
how close predictions to the eventual outcome. It is much closer to 1 (i.e. 0.8178) which indicates that J48
average of absolute errors in the predictions. The root provides the perfect agreement for classification of data
mean squared mean error is a measure to the variance items. J48 has lesser error rate in mean absolute error
of the predictions. Root mean square error is a and root mean squared error as it provides the more
frequently-used measure of the differences between perfect predictions and lesser variance in predictions.
values predicted by a model or an estimator and the
values actually observed from the thing being modeled 3.2. Car Dataset
or estimated. There are 6 attributes (buying capacity, maintenance,
number of doors, person seating capacity, luggage boot
3.1. Bank Dataset space, safety and class) and 1728 data items in car data
In Bank dataset there are 11 attributes (age, sex, region, set. The data items are classified into four classes
income, married, children, car, save- account, current- unacc, acc, good and very good acceptance level of
account, mortgage and pep) and 600 data items, car by people based on six attributes.
classified into two classes, the classification is done
Table 3
whether the person will go for Pension Equity Plan Result from WEKA for Car Dataset
(PEP) or not.
Algorithm CCI(% ) ICI (%) KS MAE RMSE
Table 2
Result from WEKA for Bank Dataset ADT - - - - -
BayesNet 85.71 14.29 0.6713 0.1114 0.2254
Algorithm CCI(%) ICI(%) KS MAE RMSE
DT 91.03 08.97 0.7987 0.2748 0.3220
ADT 84.67 15.33 0.6853 0.3350 0.3728 J48 92.36 07.64 0.8343 0.0421 0.1718
BayesNet 70.00 30.00 0.3862 0.3968 0.4487 Logistic 93.11 06.89 0.8504 0.0428 0.1520
DT 80.83 19.17 0.6123 0.2988 0.3750 Naive Bayes 85.53 14.47 0.6665 0.1137 0.2262
J48 91.00 9.00 0.8178 0.1559 0.2903 NBT 94.21 05.79 0.8752 0.0676 0.1571
Logistic 73.00 27.00 0.4518 0.3607 0.4303 PART 95.78 04.22 0.9091 0.0241 0.1276
Naive Bayes 69.00 31.00 0.3724 0.3773 0.4397 RBFN 94.21 05.79 0.8752 0.0676 0.1571
NBT 88.67 11.33 0.7710 0.1766 0.3194 SMO 93.75 06.25 0.8649 0.2559 0.3202
PART 85.17 14.83 0.7003 0.1803 0.3573
RBFN 73.33 26.67 0.4585 0.3590 0.4317
For the car dataset PART performs the best
SMO 70.80 29.20 0.4062 0.2917 0.5401
followed by RBFNetwork and NBTree. ADTree is
disabled for car dataset in WEKA as it provides
predictions for a dataset with two classes. The Kappa
Statistic for PART is closest to perfect agreement (i.e.
0.9091). PART has highest percentage for correctly
classified instances and lesser mean absolute error and
root mean squared error.
[12] Adrian G. Bors, I. Pitas, Introduction to RBF Network, [13] JingminWang, KanzhangWu, Study of the SMO Algorithm
Online Symposium for Electronics Engineers, 1(1), 1-7 Applied in Power System Load Forecasting Springer LNCS,
(2001). pp. 1022-1026, (2006).