You are on page 1of 8

2.1: Classification Trees 2.1(a.

Using training set Error Rate:


C/M 0.40 0.25 0.05 2 2.1952 % 2.8255 % 4.6946 % 10 4.7381 % 4.9772 % 5.6727 % 50 8.9763 % 8.9763 % 9.1285 % 100 10.9541 % 10.9541 % 10.9541 %

Using training set Size of Tree:


C/M 0.40 0.25 0.05 2 269 207 113 10 117 99 75 50 23 23 19 100 11 11 11

Confidence Factor : According to weka documentation confidence factor is used for pruning the tree. And lower the confidence factor means more pruning. We can see this from the 2nd table above that as the value of c decreases the size of the tree decreases. In this case pruning the tree has caused the error rate to increase slightly. minNumObj:

It is the minimum number of objects that are acceptable in the leaf nodes of the tree.

As the minNumObj increases it also increases the error rate with the given data set. Since the Number of objects have increased at leaf nodes the size of tree reduces with the increase in m value.

2.1(b.) 10-fold cross-validation Error Rate:


c/m 0.40 0.25 0.05 2 7.3028 % 7.0202 % 7.4984 % 10 7.8679 % 7.8896% 8.0417 % 75 50 9.737 % 9.8457 % 9.9109 % 100 11.4758 % 11.454 % 11.4106 %

10-fold cross-validation Size of Tree:


C/M 0.40 0.25 0.05 2 269 207 113 10 117 99 75 50 23 23 19 100 11 11 11

K-fold cross-validation: Original data set is divided into k parts in which k-1 sets are used as training data and one as validation. This procedure is repeated k times such that each sub sample is used once as validation data. Here the 10 fold cross validation has made the error rate to increase but didnt affect the size of the tree.

2.2 a) Naive Bayes Classifier :


Error rate : 20.7129 %.

Comparing to the decision tree classifier Naive Bayes performs badly in this dataset. It has more than double the error rate while predicting the clasees.

2.2 b)

2.2 c) No. I feel that the normality assumption that predictors of attributes are independent of each other doesnt hold completely true in this case. I think that certain words when appeared together like money and free have more probability to be a spam message than the normal one. 2.2 d)
Error rate : 10.15 % . Yes it improves performance of the nave bayes classifier. Correctly Classified Instances Incorrectly Classified Instances 4134 467 89.85 % 10.15 %

3.1 (a)
3.1a data set

k/F 1 5 10 20

F-measure 0.973 0.977 0.977

f-measure class 1 0.986 0.99 0.983 0.985

f-measure class 2 0.962 0.924 0.962 0.957

3.1b data set k/F 1 5 F-measure 0.992 0.986 f-measure class 1 0.995 0.992 f-measure class 2 0.97 0.947

10 20

0.983 0.974

0.99 0.986

0.936 0.901

3.1c data set k/F 1 5 10 20 F-measure 0.993 0.987 0.972 0.967 f-measure class 1 0.997 0.993 0.987 0.985 f-measure class 2 0.947 0.889 0.75 0.71

As we can see f-measure close to 1 is good and close to 0 is not that good. As the value if k increses the f-measure decreses for both the classes. This might be probably because as the nimber of nearest neighbours increses, the border regions will contain neighbours from both the classes instead of from a single class.

3.1(b)
For datasets 3.1a, 3.1b, 3.1c the f-measures increased for class 1 and for class 0 it increases and then decreased. This is because the number of objects in class 0 have been less in 3.1c when compared to 3.1a. So at the border region between the classes, an object belonging to class 0 might have been classified into class 1 but since there are more class 1 objects the error rate in classifying class 1 objects might have been reduced.

3.1 (c)
3.1a data set k/F 5 10 20 F-measure 0.99 1 0.992 0.995 f-measure class 1 0.993 0.995 0.997 f-measure class 2 0.981 0.986 0.99

3.1b data set k/F F-measure f-measure class 1 f-measure class 2

5 10 20

0.992 0.992 0.989

0.995 0.995 0.994

0.967 0.967 0.958

3.1c data set k/F 5 10 20 F-measure 0.987 0.983 0.976 f-measure class 1 0.993 0.992 0.988 f-measure class 2 0.889 0.857 0.788

By using the 1/weight option we are giving importance to the nearest neighbours of the point than to the farthest points. Because of this we can see that the f-measure for both the classes have been increased significantly. 3.1 (d)
Filename/nave bayes with 10 fold 3.1a 3.1b 3.1c Nave bayes results 0.94 0.942 0.954 f-measure class 1 0.962 0.97 0.98 f-measure class 2 0.877 0.765 0.571

The trend we can observe is that f-measure(nave bayes) < f-measure (knn) < fmeasure(knn-with weighing option). The reason might be because that Bayesian models are good for high dimensional data and also that the synthetic data the classes have clear boundaries with each other and no noise allowing knn classifiers to perform well.

3.2(a) data set


Algo/F Decision tree Nave Bayes Knn 1 Knn 5 F-measure 1 0.47 1 1 f-measure class 1 1 0.482 1 1 f-measure class 2 1 0.458 1 1

The classes have clear boundaries between them allowing the decision tree and knn classifiers to easily distinguish between them in their model. Bayesian model being predictive considering the prior probability of classes over a region misclassifies the objects and has a low f-measure.

3.2(b) data set


Algo/F Decision tree Nave Bayes Knn 1 Knn 5 F-measure 1 0.516 0.769 0.837 f-measure class 1 1 0.52 0.772 0.839 f-measure class 2 1 0.512 0.766 0.836

The knn classifiers have shown a significant drop in performance while the nave Bayes has almost maintained the same performance as the previous one. This is because the class boundaries are very vaguely defined here making more than one class objects as neighbors for knn classification. Bayes being predictive has maintained the same low f-measure it produced previously.

3.3
3.3(a) data set
Algo/F Decision tree Nave Bayes Knn 1 Knn 5 F-measure weighted 0.897 0.491 0.966 0.962 f-measure class 1 0.913 0.669 0.969 0.968 f-measure class 2 0.878 0.269 0.962 0.957

3.3(b) data set


Algo/F Decision tree F-measure weighted 0.574 f-measure class 1 0.561 f-measure class 2 0.582

Nave Bayes Knn 1 Knn 5

0.632 0.559 0.532

0.464 0.53 0.544

0.744 0.58 0.524

3.3(c) data set


Algo/F Decision tree Nave Bayes Knn 1 Knn 5 F-measure weighted 0.523 0.593 0.508 0.54 f-measure class 1 0.427 0.44 0.461 0.528 f-measure class 2 0.59 0.698 0.541 0.548

3.3(a)
Decision tree and knn classifiers seems to be good for data set with less number of attributes and while the number of attributes increased nave bayes performed well.

3.3(b)
As the number of attributes increased the decision tree must have suffered with the over fitting. Similarly the knn classifier must have suffered with redundant data and noise points which must have led to a bad performance in classification of data with higher number of attributes.

References :

http://weka.wikispaces.com/Primer www.cs.columbia.edu/~kathy/cs4701/documents/ai_weka.pp http://condor.depaul.edu/ntomuro/courses/578/assign/hw1.html http://www.inf.ed.ac.uk/teaching/courses/dme/html/week4.html

http://www.waset.org/journals/waset/v50/v50-95.pdf http://research.cs.queensu.ca/home/xiao/doc/dm/PreDS1.pdf http://www.cs.waikato.ac.nz/~mhall/HallHolmesTKDE.pdf http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayes.html https://list.scms.waikato.ac.nz/pipermail/wekalist/2010-September/049723.html http://en.wikipedia.org/wiki/Precision_and_recall#F-measure http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4420473

You might also like