You are on page 1of 13

Decision Tree Classification and K Means Clustering Using Weka

Kumar Abhishek (10BM60040)

Keywords: Data mining, decision tree, K Means, Weka. Abstract: This term paper would explain the methods of classifying and clustering data using Weka an open source data mining and analysis tool. This paper would explore two methods namely Decision tree classification and K Means clustering of data. However the tool can be used for many other methods of analysis.

Vinod Gupta School of Management, IIT Kharagpur

Introduction
Identifying patterns in data and being able to make predictions based on the patterns plays significant role in all aspects of an industry or an individual business. A plethora of methods and tools are available. This paper is an attempt to introduce two such methods namely Decision tree classification and K Means clustering using a tool called Weka.

Decision Trees
This is a method of classification of data. The ultimate aim is to predict where a target variable would lie. We start with a data collection and make the system learn about the various classes. Later this model can be used to predict where a new variable would fit in. Splitting the source data into subsets based on some rules does it. This process is called recursive partitioning as it is repeated for each subset of data. It is repeated till further partitioning does not add any value to predictions. All the variables have the same value at a particular node. There are various algorithms available for performing partitions like CART, C4.5 and J48. Advantages Simple to understand and interpret. Requires little data preparation. Able to handle both numerical and categorical data. Possible to validate a model using statistical tests. Robust. Performs well even if its assumptions are somewhat violated by the true model from which the data were generated. Performs well with large data in a short time. Limitations Trees can get very complex. Sometimes trees can become prohibitively large. Information from a tree is biased towards attributes with more levels.

K Means Cluster
This method of data analysis aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Following are the basic steps of the process: For each data point calculate the distance from the data point to each cluster. If the data point is closest to its own cluster, leave it where it is. If the data point is not closest to its own cluster, move it into the closest cluster. Repeat the above step until a complete pass through all the data points results in no data point moving from one cluster to another. At this point the clusters are stable and the clustering process ends. The choice of initial partition can greatly affect the final clusters that result, in terms of inter-cluster and intra cluster distances and cohesion.

Vinod Gupta School of Management, IIT Kharagpur

Advantages With a large number of variables, K-Means may be computationally faster than hierarchical clustering (if K is small). K-Means may produce tighter clusters than hierarchical clustering, especially if the clusters are globular. Limitations Euclidean distance is used as a metric and variance is used as a measure of cluster scatter. The number of clusters k is an input parameter: an inappropriate choice of k may yield poor results. That is why, when performing k-means, it is important to run diagnostic checks for determining the number of clusters in the data set.

Data Set
A bank has collected data for 100 customers and categorized them into either group 1 or 0. The attributes for which data is collected are: Salary Commission Age Elevel Car Zipcode Hvalue - Highest value of loan Hyears Number of years of repayment Loan

Vinod Gupta School of Management, IIT Kharagpur

Decision Tree
This section would explain steps to create a decision tree with Weka and how to interpret and visualize the output.

Steps
1. Launch Weka. It has to be downloaded and installed in the local machine. Follow the link to download the latest version.

Weka GUI 2. Click on Explorer button.

Weka Explorer

Vinod Gupta School of Management, IIT Kharagpur

3. Click on open file. The data set can be of the .ARFF or .csv format. Ther are also other options to load data. Find the data file and click on Choose button. This would load the data in the Preprocess tab.

File selection

Preprocess tab 4. Go to Classify tab. From the Classifier section Click on the Choose button. Select the J48 rule from the list. Let the rest of the options be as shown in figure. Weka makes reasonable assumptions for parameters. Then click on the start button.

Vinod Gupta School of Management, IIT Kharagpur

Output window 5. Weka shows the output of the analysis on the right hand side. 6. To visualize the tree right click on the result from the result list and select Visualize tree to see the tree display window.

Decision tree

Vinod Gupta School of Management, IIT Kharagpur

Output
=== Run information === Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2 Relation: weka.datagenerators.classifiers.classification.Agrawal-S_1_-n_100_-F_1_-P_0.05 Instances: 100 Attributes: 10 salary commission age elevel car zipcode hvalue hyears loan group Test mode: 10-fold cross-validation === Classifier model (full training set) === J48 pruned tree -----------------age <= 37: 0 (31.0) age > 37 | age <= 62: 1 (39.0/4.0) | age > 62: 0 (30.0) Number of Leaves : 3 Size of the tree :5

Time taken to build model: 0.05 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 93 93 % 7 7 % 0.8472 0.0977 0.2577 21.4022 % 53.9683 % 100

=== Detailed Accuracy By Class === TP Rate 0.938 0.914 Weighted Avg. 0.93 FP Rate Precision Recall 0.086 0.953 0.938 0.062 0.889 0.914 0.077 0.931 0.93 F-Measure ROC Area Class 0.946 0.912 0 0.901 0.912 1 0.93 0.912

=== Confusion Matrix === a b <-- classified as 61 4 | a = 0 3 32 | b = 1

Vinod Gupta School of Management, IIT Kharagpur

Interpretation of output
The previous section shows the complete output. At the beginning is a summary of the dataset. Cross validation was used to evaluate it. Attributes of data set are also shown. Then under the J48 Pruned tree. The textual decision tree is shown. Splits are made based on the age attribute. In the tree structure, a colon introduces a class label that has been assigned to a particular leaf, followed by the number of instances that reach that leaf, expressed as decimal number. Thus at age <= 62 level 39.0/4.0 means that 39 instances reached the leaf out of which 4 are classified incorrectly. Number of leaves and size of tree are mentioned which in this case are 3 and 5 respectively. The next part of the output gives the trees predictive performance. As seen 7% (7 out of 100) instances are classified incorrectly. At the end confusion matrix is given. From there we can see that 64 instances of group 0 and 32 instances of group 1 are correctly classified. 4 instances of group 0 are incorrectly assigned to group 1 and 3 instances of group 1 are assigned to group 0. Kappa statistic is used to measure agreement between predicted and observed categorization of dataset, while correcting for an agreement that occurs by chance. In this case it is around 84%.

Vinod Gupta School of Management, IIT Kharagpur

K Means Cluster
This section would explain the K means clustering. The data set would be same as the one used in the decision tree. Here the customers would be clustered on their similarities into groups.

Steps
1. Follow the steps till step 3 in the decision tree process to load and prepare data in Weka. Go to the cluster tab by pressing the Cluster button on the top of Weka window. 2. Click Choose button and choose SimpleKMeans rule from the list as shown below. Clicking on the rule would show the various parameters. Weka already makes reasonable assumptions for those. Here we have to specify the number of clusters. This can be found from the agglomeration schedule of Hierarchical clustering. In this case it was found to be 2 so 2 In the Cluster mode click on Use training set.

Rule selection

Number of clusters

Vinod Gupta School of Management, IIT Kharagpur

3. Click on the Start button to generate output as shown below.

Output 4. Right click on the result for options and select Visualize cluster assignments to get the visualization of results as shown below.

Cluster assignments

Vinod Gupta School of Management, IIT Kharagpur

10

Output
=== Run information === Scheme: weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: weka.datagenerators.classifiers.classification.Agrawal-S_1_-n_100_-F_1_-P_0.05 Instances: 100 Attributes: 10 salary commission age elevel car zipcode hvalue hyears loan group Test mode: evaluate on training data === Model and evaluation on training set === kMeans ====== Number of iterations: 4 Within cluster sum of squared errors: 313.23293708458965 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Full Data 0 1 (100) (55) (45) ================================================ salary 76506.8492 93821.2199 55344.8406 commission 22893.0265 13121.7237 34835.73 age 49.57 49.5455 49.6 elevel 2 4 0 car 6 8 10 zipcode 1 2 0 hvalue 130829.1991 131134.2686 130456.3364 hyears 16.71 16.9636 16.4 loan 250972.2305 257594.1869 242878.7283 group 0 0 0 Attribute Time taken to build model (full training data): 0.03 seconds === Model and evaluation on training set === Clustered Instances 0 1 55 ( 55%) 45 ( 45%)

Vinod Gupta School of Management, IIT Kharagpur

11

Interpretation of output
The output of Weka shows that Euclidean distance was used with two clusters. Result of clustering is shown in the table with rows that are attribute names and columns that correspond to the cluster centroid. Number of instances is mentioned in parentheses at the top of the table. Each table entry is either mean (numeric value) or mode (nominal value) of the corresponding attribute for the cluster in that column. Characteristics of clusters: Average ages of both clusters are same (49 yrs. approx.) However people of cluster 0 earn more than the other cluster. People of first cluster take more loans. Loan tenure is almost similar at 16 years. People of cluster 0 are more educated.

Conclusion
There are numerous other analyses available for users in Weka. Here however only two of them were shown. Weka also provides easy and efficient ways to visualize results.

Vinod Gupta School of Management, IIT Kharagpur

12

Bibliography
Witten, I. H., Frank, E., & Hall, M. A. (2012). Data Mining - Practical Machine Learning Tools and Techniques (2rd Edition ed.). Morgan Kaufmann. Wikipedia. (2012, March 28). Decision_tree_learning. Retrieved April 6, 2012, from Wikipedia: http://en.wikipedia.org/wiki/Decision_tree_learning Wikipedia. (2012, April 7). K-means_clustering. Retrieved April 9, 2012, from Wikipedia: http://en.wikipedia.org/wiki/K-means_clustering Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witter, I. H. (2009). The WEKA Data Mining Software: An Update. 11 (1).

Vinod Gupta School of Management, IIT Kharagpur

13

You might also like