You are on page 1of 10

Experiment No: 4

BE IT A 2

AIM: Classification in data mining in weka.


Theory-:
Classification theory:
Classification is a data mining function that assigns items in a collection to target categories or
classes. The goal of classification is to accurately predict the target class for each case in the
data. For example, a classification model could be used to identify loan applicants as low,
medium, or high credit risks.
In the model build (training) process, a classification algorithm finds relationships between the
values of the predictors and the values of the target. Different classification algorithms use
different techniques for finding relationships. These relationships are summarized in a model,
which can then be applied to a different data set in which the class assignments are unknown.
Classification models are tested by comparing the predicted values to known target values in a
set of test data. The historical data for a classification project is typically divided into two data
sets: one for building the model; the other for testing the
Training model
Data classification is a two-step process, as shown for
the loan application data of In the first step,a classifier is built describing a predetermined set of
data classes or concepts. This is the learning step (or training phase), where a classification
algorithm builds the classifier by analyzing or learning from a training set made up of database
tuples and their associated class labels
Test Model
First, the predictive accuracy of the classifier is estimated. If we were to use the training set to
measure the accuracy of the classifier, this estimate would likely be optimistic, because the
classifier tends to overfit the data (i.e., during learning it may incorporate someparticular
anomalies of the training data that are not present in the general data set overall). Therefore, a
test set is used, made up of test tuples and their associated class labels. These tuples are
randomly selected fromthe general data set.Classification has many applications in customer
segmentation, business modeling, marketing, credit analysis, and biomedical and drug response
modeling
Classification Mining via decision trees in weka:
This example illustrates the use of C4.5 (J48) classifier in WEKA. The sample data set used for
this example, unless otherwise indicated, is the bank data available in comma-separated format
(bank-data.csv). This document assumes that appropriate data preprocessing has been perfromed.
In this case ID field has been removed. Since C4.5 algorithm can handle numeric attributes, there
is no need to discretize any of the attributes. For the purposes of this example, however, the
"Children" attribute has been converted into a categorical attribute with values "YES" or "NO".
WEKA has implementations of numerous classification and prediction algorithms. The basic
ideas behind using all of these are similar. In this example we will use the modified version of
the bank data to classify new instances using the C4.5 algorithm (note that the C4.5 is
implemented in WEKA by the classifier class: weka.classifiers.trees.J48). The modified (and

Experiment No: 4
BE IT A 2

smaller) version of the bank data can be found in the file "bank.arff" and the new unclassified
instances are in the file "bank-new.arff".
As usual, we begin by loading the data into WEKA, as seen in Figure A:
Figure A

Next, we select the "Classify" tab and click the "Choose" button to select the J48 classifier, as
depicted in Figures 21-a and 21-b. Note that J48 (implementation of C4.5 algorithm) does not
require discretization of numeric attributes, in contrast to the ID3 algorithm from which C4.5 has
evolved.

Experiment No: 4
BE IT A 2

Figure B
Now, we can specify the various parameters. These can be specified by clicking in the text box
to the right of the "Choose" button, as depicted in Figure 22. In this example we accept the
default values.
The default version does perform some pruning (using the subtree raising approach), but does
not perform error pruning. The selected parameters are depicted in Figure B.

Experiment No: 4
BE IT A 2

Figure D
Under the "Test options" in the main panel we select 10-fold cross-validation as our evaluation
approach. Since we do not have separate evaluation data set, this is necessary to get a reasonable
idea of accuracy of the generated model. We now click "Start" to generate the model. The ASCII
version of the tree as well as evaluation statistics will appear in the eight panel when the model
construction is completed (see Figure 23).

Experiment No: 4
BE IT A 2

Figure E

We can view this information in a separate window by right clicking the last result set (inside the
"Result list" panel on the left) and selecting "View in separate window" from the pop-up menu.
These steps and the resulting window containing the classification results are depicted in Figures
F and G.

Experiment No: 4
BE IT A 2

Figure G
Note that the classification accuracy of our model is only about 69%. This may indicate that we
may need to do more work (either in preprocessing or in selecting the correct parameters for
classification), before building another model. In this example, however, we will continue with
this model despite its inaccuracy.
WEKA also let's us view a graphical rendition of the classification tree. This can be done by
right clicking the last result set (as before) and selecting "Visualize tree" from the pop-up menu.
The tree for this example is depicted in Figure 25. Note that by resizing the window and
selecting various menu items from inside the tree view (using the right mouse button), we can
adjust the tree view to make it more readable.

Experiment No: 4
BE IT A 2

Figure H

Experiment No: 4
BE IT A 2

The file bank c1.txt is depicted as follows:


=== Run information ===
Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: olibank-data-weka.filters.unsupervised.attribute.Remove-R1weka.filters.unsupervised.attribute.Discretize-B3-M-1.0-R4weka.filters.unsupervised.attribute.Discretize-B3-M-1.0-R1
Instances: 600
Attributes: 11
age
sex
region
income
married
children
car
save_act
current_act
mortgage
pep
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===
J48 pruned tree
-----------------children = 0
| married = NO
| | mortgage = NO: YES (48.0/3.0)
| | mortgage = YES
| | | save_act = NO: YES (12.0)
| | | save_act = YES: NO (23.0)
| married = YES
| | save_act = NO
| | | mortgage = NO: NO (36.0/5.0)
| | | mortgage = YES: YES (25.0/3.0)
| | save_act = YES: NO (119.0/12.0)
children = 1: YES (135.0/25.0)
children = 2
| income = 0-50000: NO (64.0/7.0)
| income = 50001_100000
| | region = INNER_CITY
| | | age = 0_34: NO (2.0)
| | | age = 35_58
| | | | car = NO

Experiment No: 4
BE IT A 2

| | | | | save_act = NO: NO (3.0/1.0)


| | | | | save_act = YES: YES (5.0/1.0)
| | | | car = YES: NO (4.0)
| | | age = 59_100: YES (6.0/1.0)
| | region = TOWN: YES (14.0/3.0)
| | region = RURAL: NO (7.0/2.0)
| | region = SUBURBAN
| | | sex = FEMALE: NO (4.0/1.0)
| | | sex = MALE: YES (2.0)
| income = 100001_max: YES (23.0/1.0)
children = 3
| income = 0-50000: NO (30.0/3.0)
| income = 50001_100000: NO (30.0/2.0)
| income = 100001_max: YES (8.0)
children = 4: NO (0.0)
Number of Leaves : 22
Size of the tree :

35

Time taken to build model: 0 seconds


=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
510
Incorrectly Classified Instances
90
Kappa statistic
0.6983
Mean absolute error
0.2209
Root mean squared error
0.3518
Relative absolute error
44.5145 %
Root relative squared error
70.6212 %
Total Number of Instances
600

85
15

%
%

=== Detailed Accuracy By Class ===


TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.847 0.147
0.829 0.847 0.838
0.862 YES
0.853 0.153
0.869 0.853 0.861
0.862 NO
Weighted Avg. 0.85
0.151
0.85
0.85
0.85
0.862
=== Confusion Matrix ===
a b <-- classified as
232 42 | a = YES
48 278 | b = NO

Experiment No: 4
BE IT A 2

Conclusion: Thus, we have studied Classification in data mining, We saw how a tree can be
Viewed from data set. We also understood that classification basically helps us to classify Items
based on test and training models. Data sets are stored in database in .arff format. The whole
process of data mining was understood with example of bank and J-48 tree.

You might also like