You are on page 1of 107

Data Mining Lab

LABORATORY MANUAL DATA MINING on

Prepared by INDRANEEL K Associate Professor CSE Department


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

SRI KOTTAM TULASI REDDY MEMORIAL COLLEGE OF ENGINEERING


(Affiliated to JNTU, Hyderabad, Approved by AICTE, Accredited by NBA) KONDAIR, MAHABOOBNAGAR (Dist), AP - 509125

S.K.T.R.M College of Engineering

Data Mining Lab


INDEX
The objective of the lab exercises is to use data mining techniques to identify customer segments and understand their buying behavior and to use standard databases available to understand DM processes using WEKA (or any other DM tool) 1. Gain insight for running pre- defined decision trees and explore results using MS OLAP Analytics. 2. Using IBM OLAP Miner Understand the use of data mining for evaluating the content of multidimensional cubes. 3. Using Teradata Warehouse Miner Create mining models that are executed in SQL. ( BI Portal Lab: The objective of the lab exercises is to integrate pre-built reports into a portal application ) 4. Publish cognos cubes to a business intelligence portal. Metadata & ETL Lab: The objective of the lab exercises is to implement metadata import agents to pull metadata from leading business intelligence tools and populate a metadata repository. To understand ETL processes 5. Import metadata from specific business intelligence tools and populate a meta data repository. 6. Publish metadata stored in the repository. 7. Load data from heterogeneous sources including text files into a predefined warehouse schema

S.K.T.R.M College of Engineering

Data Mining Lab


CONTENTS
S.no 1 2 3 4 5 Experiment Defining Weather relation for different attributes Defining employee relation for different attributes Defining labor relation for different attributes Defining student relation for different attributes Exploring weather relation using experimenter and obtaining results in various schemes Exploring employee relation using experimenter Exploring labor relation using experimenter Exploring student relation using experimenter Setting up a flow to load an arff file (batch mode) andperform a cross validation using J48 Design a knowledge flow layout, to load attribute selection normalize the attributes and to store the result in a csv saver. Week NO 1 2 3 4 5 Page NOs 7-18 19-28 29-38 39-49 49-59

6 7 8 9

6 7 8 9

60-65 66-71 72-78 86-112

10

10

116-117

S.K.T.R.M College of Engineering

Data Mining Lab


Aim:
Implementation of Data Mining Algorithms by Attribute Relation File formats Introduction to Weka (Data Mining Tool) Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset (using GUI) or called from your own Java code (using Weka Java library). Tools (or functions) in Weka include: Data preprocessing (e.g., Data Filters), Classification (e.g., BayesNet, KNN, C4.5 Decision Tree, Neural Networks, SVM), Regression (e.g., Linear Regression, Isotonic Regression, SVM for Regression), Clustering (e.g., Simple K-means, Expectation Maximization (EM)), Association rules (e.g., Apriori Algorithm, Predictive Accuracy, Confirmation Guided), Feature Selection (e.g., Cfs Subset Evaluation, Information Gain, Chisquared Statistic), and Visualization (e.g., View different two-dimensional plots of the data).

Launching WEKA
The Weka GUI Chooser (class weka.gui.GUIChooser) provides a starting point for launching Wekas main GUI applications and supporting tools. If one prefers a MDI (multiple document interface) appearance, then this is provided by an alternative launcher called Main (class weka.gui.Main). The GUI Chooser consists of four buttons one for each of the four major Weka applications and four menus. The buttons can be used to start the following applications: Explorer An environment for exploring data with WEKA (the rest of this documentation deals with this application in more detail). Experimenter An environment for performing experiments and conducting statistical tests between learning schemes. Knowledge Flow This environment supports essentially the same functions as the Explorer but with a drag-and-drop interface. One advantage is that it supports incremental learning. Simple CLI Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface.

S.K.T.R.M College of Engineering

Data Mining Lab

Working with Explorer Weka Data File Format (Input) The most popular data input format of Weka
is arff (with arff being the extension name of your input data file). Experiment:1 WEATHER RELATION: % ARFF file for weather data with some numeric features @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes

S.K.T.R.M College of Engineering

Data Mining Lab

PREPROCESSING:
In order to experiment with the application, the data set needs to be presented to WEKA in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program. Open File- allows for the user to select files residing on the local machine or recorded medium Open URL- provides a mechanism to locate a file or data source from a different location specified by the user Open Database- allows the user to retrieve files or data from a database source provided by the user

S.K.T.R.M College of Engineering

Data Mining Lab

CLASSIFICATION:
The user has the option of applying many different algorithms to the data set in order to produce a representation of information. The best approach is to independently apply a mixture of the available choices and see what yields something close to the desired results. The Classify tab is where the user selects the classifier choices. Figure 5 shows some of the categories.

Output:
correctly Classified Instances 5 35.7143 % Kappa statistic 0 Mean absolute error 0.4762 Root mean squared error 0.4934 Relative absolute error 100 % Root relative squared error 100 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 1 0.643 1 0.783 0.178 yes 0 0 0 0 0 0.178 no Weighted Avg. 0.643 0.643 0.413 0.643 0.503 0.178 === Confusion Matrix === a b <-- classified as 9 0 | a = yes 5 0 | b = no

CLUSTERING:
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those S.K.T.R.M College of Engineering 7

Data Mining Lab


described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options. == Run information === Output: Scheme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode: evaluate on training data === Model and evaluation on training set = Number of clusters selected by cross validation: 1 Cluster Attribute 0 (1) ====================== outlook sunny 6 overcast 5 rainy 6 [total] 17 temperature mean 73.5714 std. dev. 6.3326 humidity mean 81.6429 std. dev. 9.9111 windy TRUE 7 FALSE 9 [total] 16 play yes 10 no 6 [total] 16 Clustered Instance0 14 (100%) Log likelihood: -9.4063

Choosing Relationship for cluster:

S.K.T.R.M College of Engineering

Data Mining Lab

ASSOCIATION:
The associate tab opens a window to select the options for associations within the data set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori, is shown in Figure below.

== Run information ===


Scheme: weka.associations.FilteredAssociator -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.ReplaceMissingValues \"" -c -1 -W weka.associations.Apriori -- -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play

SELECTING ATTRIBUTES:
The next tab is used to select the specific attributes used for the calculation process. By default all of the available attributes are used in the evaluation of the data set. If the user wanted to exclude certain categories of the data they would deselect those specific choices from the list in the cluster window. This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results. The software searches through the selected attributes to decide which of them will best fit the desired calculation. To perform this, the user has to select two options, an attribute evaluator and a search method. Once this is done the program evaluates the data based on the subset of the attributes, then it performs the necessary search for commonality with the date. Figure 8 shows the opinions of attribute evaluation. OUTPUT: S.K.T.R.M College of Engineering 9

Data Mining Lab


=== Run information === Evaluator: weka.attributeSelection.CfsSubsetEval Search: weka.attributeSelection.BestFirst -D 1 -N 5 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windyplay Evaluation mode: evaluate on all training data === Attribute Selection on all input data === Search Method: Best first. Start set: no attributes Search direction: forward Stale search after 5 node expansions Total number of subsets evaluated: 11 Merit of best subset found: 0.196 Attribute Subset Evaluator (supervised, Class (nominal): 5 play): CFS Subset Evaluator Including locally predictive attributes Selected attributes: 1,4 : 2 outlook windy

VISUALIZATION:
The last tab in the window is the visualization tab. Using the other tabs in the program, calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the attributes from one view to another.

S.K.T.R.M College of Engineering

10

Data Mining Lab

S.K.T.R.M College of Engineering

11

Data Mining Lab


Experiment :2
Employee Relation(INPUT): % ARFF file for employee data with some numeric features @relation employee @attribute ename {john, tony, ravi} @attribute eid numeric @attribute esal numeric @attribute edept {sales, admin} @data john, 85, 8500, sales tony, 85, 9500, admin john, 85, 8500, sales

OUTPUT

PREPROCESSING:
In order to experiment with the application, the data set needs to be presented to WEKA in a format the program understands. There are rules for the S.K.T.R.M College of Engineering 12

Data Mining Lab


type of data that WEKA will accept and three options for loading data into the program. Open File- allows for the user to select files residing on the local machine or recorded medium Open URL- provides a mechanism to locate a file or data source from a different location specified by the user Open Database- allows the user to retrieve files or data from a database source provided by the user

CLASSIFICATION:
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of ins. === Run information === OUTPUT: Scheme: weka.classifiers.rules.ZeroR Relation: employee Instances: 3 Attributes: 4 ename eid esal edept S.K.T.R.M College of Engineering 13

Data Mining Lab


Test mode: 10-fold cross-validation === Classifier model (full training set) === ZeroR predicts class value: sales Time taken to build model: 0 seconds

CLUSTERING:
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options.

OUTPUT: Scheme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100 Relation: employee Instances: 3 Attributes: 4 ename eid esal edept Test mode: evaluate on training data === Model and evaluation on training set === EM == Number of clusters selected by cross validation: 1 Cluster Attribute 0 (1) ====================== ename john 3 tony 2 ravi 1 [total] 6 eid mean 85 std. dev. 0 esal mean 8833.3333 std. dev. 471.4045 edept sales 3 admin 2 [total] 5 Clustered Instances 0 3 (100%)
S.K.T.R.M College of Engineering 14

Data Mining Lab


Log likelihood: 3.84763

ASSOCIATION:
The associate tab opens a window to select the options for associations within the data set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori, is shown in Figure below.

=== Run information === Scheme: weka.associations.FilteredAssociator -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.ReplaceMissingValues \"" -c -1 -W weka.associations.Apriori -- -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1 Relation: employee Instances: 3 Attributes: 4 ename eid esal edept

SELECTING ATTRIBUTES:
The next tab is used to select the specific attributes used for the calculation process. By default all of the available attributes are used in the evaluation of the data set. If the user wanted to exclude certain categories of the data they would deselect those specific choices from the list in the cluster window. This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results. The software searches through the selected attributes to decide which of them will best fit the desired calculation. To perform this, the user has to select two options, an attribute evaluator and a search method. Once this is done the program evaluates the data based on the subset of the attributes, then it performs the necessary search for commonality with the date. Figure 8 shows the opinions of attribute evaluation.

OUTPUT: === Attribute Selection on all input data === Search Method: Best first. Start set: no attributes Search direction: forward Stale search after 5 node expansions Total number of subsets evaluated: 11 Merit of best subset found: 0.196 Attribute Subset Evaluator (supervised, Class (nominal): 5 play): CFS Subset Evaluator Including locally predictive attributes Selected attributes: 1,4 : 2 outlook windy

VISUALIZATION:
S.K.T.R.M College of Engineering 15

Data Mining Lab


The last tab in the window is the visualization tab. Using the other tabs in the program, calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the attributes from one view to another.

S.K.T.R.M College of Engineering

16

Data Mining Lab


Experiment:3
STUDENT RELATION % % ARFF file for student data with some numeric features % @relation student @attribute sname {john, tony, ravi} @attribute sid numeric @attribute sbranch {ECE, CSE, IT} @attribute sage numeric @data john, 285, ECE, 19 tony, 385, IT, admin john, 485, ECE, 19

PREPROCESSING:
In order to experiment with the application, the data set needs to be presented to WEKA in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program. Open File- allows for the user to select files residing on the local machine or recorded medium Open URL- provides a mechanism to locate a file or data source from a different location specified by the user Open Database- allows the user to retrieve files or data from a database source provided by the user

CLASSIFICATION:
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. S.K.T.R.M College of Engineering 17

Data Mining Lab


There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options. Output: Scheme: weka.classifiers.rules.ZeroR Relation: student Instances: 3 Attributes: 4 sname sid sbranch sage Test mode: 2-fold cross-validation === Classifier model (full training set) === ZeroR predicts class value: 19.333333333333332 Time taken to build model: 0 seconds === Cross-validation ====== Summary === Correlation coefficient -0.5 Mean absolute error 0.5 Root mean squared error 0.6455 Relative absolute error 100 % Root relative squared error 100 % Total Number of Instances 3

CLUSTERING:
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options. heme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode: evaluate on training data=== Model and evaluation on training set === EM S.K.T.R.M College of Engineering 18

Data Mining Lab


==Number of clusters selected by cross validation Cluster Attribute 0 (1) ====================== outlook sunny 6 overcast 5 rainy 6 [total] 17 temperature mean 73.5714 std. dev. 6.3326 humidity mean 81.6429 std. dev. 9.9111windy TRUE 7 FALSE 9 [total] 16 play yes 10 no 6 [total] 16 Clustered Instances 0 14 (100%) Log likelihood: -9.4063

ASSOCIATION:
The associate tab opens a window to select the options for associations within the data set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori, is shown in Figure below.

=== Run information === Scheme: weka.associations.FilteredAssociator -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.ReplaceMissingValues \"" -c -1 -W weka.associations.Apriori -- -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1 Relation: student Instances: 3 Attributes: 4 sname sid sbranch sage

SELECTING ATTRIBUTES:
The next tab is used to select the specific attributes used for the calculation process. By default all of the available attributes are used in the evaluation of the data set. If the user wanted to exclude certain categories of the data they would deselect those specific choices from the list in the cluster window. This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results. The software searches through the selected attributes to decide which of S.K.T.R.M College of Engineering 19

Data Mining Lab


them will best fit the desired calculation. To perform this, the user has to select two options, an attribute evaluator and a search method. Once this is done the program evaluates the data based on the subset of the attributes, then it performs the necessary search for commonality with the date. Figure 8 shows the opinions of attribute evaluation.

Search Method: Best first. Start set: no attributes Search direction: forward Stale search after 5 node expansions Total number of subsets evaluated: 7 Merit of best subset found: 1 Attribute Subset Evaluator (supervised, Class (numeric): 4 sage): CFS Subset Evaluator Including locally predictive attributes Selected attributes: 1,3 : 2 sname sbranch

VISUALIZATION:
The last tab in the window is the visualization tab. Using the other tabs in the program, calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the attributes from one view to another.

S.K.T.R.M College of Engineering

20

Data Mining Lab

S.K.T.R.M College of Engineering

21

Data Mining Lab


Experiment:4
% LABOR RELATION: % ARFF file for labor data with some numeric features % @relation labor @attribute name {rom, tony, santu} @attribute wage-increase-first-year numeric @attribute wage-increase-second-year numeric @attribute working-hours numeric @attribute pension numeric @attribute vacation numeric @data rom, 500, 600, 8, 200, 15 tony, 400, 450, 8, 200, 15 santu, 600, 650, 8, 200, 15

PREPROCESSING:
In order to experiment with the application, the data set needs to be presented to WEKA in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program. Open File- allows for the user to select files residing on the local machine or recorded medium Open URL- provides a mechanism to locate a file or data source from a different location specified by the user Open Database- allows the user to retrieve files or data from a database source provided by the user

CLASSIFICATION:
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. S.K.T.R.M College of Engineering 22

Data Mining Lab


There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options.

Output:
Scheme: weka.classifiers.rules.ZeroR Relation: labor Instances: 3 Attributes: 6 name wage-increase-first-year wage-increase-second-year working-hours pension vacation Test mode: 2-fold cross-validation === Classifier model (full training set) === ZeroR predicts class value: 15.0 Time taken to build model: 0 seconds === Cross-validation ====== Summary === Correlation coefficient 0 Mean absolute error 0 Root mean squared error 0 Relative absolute error NaN % Root relative squared error NaN % Total Number of Instances 3

CLUSTERING:
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options.

Scheme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100 Relation: labor Instances: 3


S.K.T.R.M College of Engineering 23

Data Mining Lab


Attributes: 6 name wage-increase-first-year wage-increase-second-year working-hours pension vacation Test mode: evaluate on training data === Model and evaluation on training set === EM== Number of clusters selected by cross validation: 1 Cluster Attribute 0 (1) ===================================== name rom 2 tony 2 santu 2 [total] 6 wage-increase-first-year mean 500 std. dev. 81.6497 wage-increase-second-year mean 566.6667 std. dev. 84.9837 working-hours mean 8 std. dev. 0 pension mean 200 std. dev. 0 vacation mean 15 std. dev. 0 Clustered Instances 0 3 (100%) Log likelihood: 25.90833

ASSOCIATION:
The associate tab opens a window to select the options for associations within the data set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori, is shown in Figure below.

Scheme: weka.associations.FilteredAssociator -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.ReplaceMissingValues \"" -c -1 -W weka.associations.Apriori -- -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
S.K.T.R.M College of Engineering 24

Data Mining Lab


Relation: labor Instances: 3 Attributes: 6 name wage-increase-first-year wage-increase-second-year working-hours pension vacation

SELECTING ATTRIBUTES:
The next tab is used to select the specific attributes used for the calculation process. By default all of the available attributes are used in the evaluation of the data set. If the user wanted to exclude certain categories of the data they would deselect those specific choices from the list in the cluster window. This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results. The software searches through the selected attributes to decide which of them will best fit the desired calculation. To perform this, the user has to select two options, an attribute evaluator and a search method. Once this is done the program evaluates the data based on the subset of the attributes, then it performs the necessary search for commonality with the date. Figure 8 shows the opinions of attribute evaluation.

=== Attribute Selection on all input data === Search Method: Best first. Start set: no attributes Search direction: forward Stale search after 5 node expansions Total number of subsets evaluated: 19 Merit of best subset found: 0 Attribute Subset Evaluator (supervised, Class (numeric): 6 vacation): CFS Subset EvaluatIncluding locally predictive attributes Selected attributes: 1 : 1 name

VISUALIZATION:
The last tab in the window is the visualization tab. Using the other tabs in the program, calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the

S.K.T.R.M College of Engineering

25

Data Mining Lab


attributes from one view to

EXPERIMENTER:

The Weka Experiment Environment enables the user to create, run, modify, and analyse experiments in a more convenient manner than is possible when processing the schemes individually. For example, the user can create an experiment that runs several schemes against a series of datasets and then analyse the results to determine if one of the schemes is (statistically) better than the other schemes.The Experiment Environment can be run from the command line using the Simple CLI

S.K.T.R.M College of Engineering

26

Data Mining Lab


Experiment:5 COMMAND LINE:java weka.experiment.Experiment -r -T data/weather.arff

Defining an Experiment When the Experimenter is started, the Setup window (actually a pane) is displayed. Click New toinitialize an experiment. This causes default parameters to be defined for the experiment.

To define the dataset to be processed by a scheme, first select Use relative paths in the Datasetspanel of the Setup window and then click Add New to open a dialog box below

S.K.T.R.M College of Engineering

27

Data Mining Lab

Select iris.arff and click Open to select the iris dataset.

The dataset name is now displayed in the Datasets panel of the Setup window. Saving the Results of the Experiment To identify a dataset to which the results are to be sent, click on the CSVResultListener entry in the Destination panel. Note that this window (and other similar windows in Weka) is not initially expanded and some of the information in the window is not visible. Drag the bottom right-hand corner of the window to resize the window until the scroll bars disappear.
S.K.T.R.M College of Engineering 28

Data Mining Lab

The output file parameter is near the bottom of the window, beside the text outputFile. Click on this parameter to display a file selection window.

S.K.T.R.M College of Engineering

29

Data Mining Lab


Type the name of the output file, click Select, and then click close (x). The file name is displayed in the outputFile panel. Click on OK to close the window.

The dataset name is displayed in the Destination panel of the Setup window.

Saving the Experiment Definition The experiment definition can be saved at any time. Select Save at the top of the Setup window. Type the dataset name with the extension exp (or select the dataset name if the experiment definition dataset already exists).

S.K.T.R.M College of Engineering

30

Data Mining Lab

The experiment can be restored by selecting Open in the Setup window and then selecting Experiment1.exp in the dialog window. Running an Experiment To run the current experiment, click the Run tab at the top of the Experiment Environment window. The current experiment performs 10 randomized train and test runs on the Iris dataset, using 66% of the patterns for training and 34% for testing, and using the ZeroR scheme.

Click Start to run the experiment.

S.K.T.R.M College of Engineering

31

Data Mining Lab

If the experiment was defined correctly, the 3 messages shown above will be displayed in the Log panel. The results of the experiment are saved to the dataset Experiment1.txt.
Dataset,Run,Scheme,Scheme_options,Scheme_version_ID,Date_time,Number_of_instances,Number _correct,Number_incorrect,Number_unclassified,Percent_correct,Percent_incorrect,Percent_ unclassified,Mean_absolute_error,Root_mean_squared_error,Relative_absolute_error,Root_re lative_squared_error,SF_prior_entropy,SF_scheme_entropy,SF_entropy_gain,SF_mean_prior_en tropy,SF_mean_scheme_entropy,SF_mean_entropy_gain,KB_information,KB_mean_information,KB_ relative_information,True_positive_rate,Num_true_positives,False_positive_rate,Num_false _positives,True_negative_rate,Num_true_negatives,False_negative_rate,Num_false_negatives ,IR_precision,IR_recall,F_measure,Summary iris,1,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,15.0,36.0,0.0, 29.41176470588235,70.58823529411765,0.0,0.4462386261694216,0.47377732045597576,100.0,100 .0,81.5923629400546,81.5923629400546,0.0,1.5998502537265609,1.5998502537265609,0.0,0.0,0 .0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,? iris,2,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,11.0,40.0,0.0, 21.568627450980394,78.43137254901961,0.0,0.4513648596693575,0.48049218646442554,100.0,10 0.0,83.58463098131035,83.58463098131035,0.0,1.6389143329668696,1.6389143329668696,0.0,0.0,0.0,0.0,0. 0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,?

S.K.T.R.M College of Engineering

32

Data Mining Lab


Experiment:6 Aim: to setup standard experiments, that are run locally on a single machine,

or remote experiments, which are distributed between several hosts for employee relation

Type this command in simple CLI java weka.experiment.Experiment -r -T data/emp.arff

Add new relation using add new button on the right panel And give database connection using jdbc and click ok

Choose the relation and click ok button


S.K.T.R.M College of Engineering 33

Data Mining Lab

Choose ZERO R from the menu choose button by clicking add new button on the right panel and click ok

Click on the run tab to get the output


S.K.T.R.M College of Engineering 34

Data Mining Lab

The results of the experiment are saved to the dataset Experiment2.txt.


Dataset,Run,Scheme,Scheme_options,Scheme_version_ID,Date_time,Number_of_instances,Number _correct,Number_incorrect,Number_unclassified,Percent_correct,Percent_incorrect,Percent_ unclassified,Mean_absolute_error,Root_mean_squared_error,Relative_absolute_error,Root_re lative_squared_error,SF_prior_entropy,SF_scheme_entropy,SF_entropy_gain,SF_mean_prior_en tropy,SF_mean_scheme_entropy,SF_mean_entropy_gain,KB_information,KB_mean_information,KB_ relative_information,True_positive_rate,Num_true_positives,False_positive_rate,Num_false _positives,True_negative_rate,Num_true_negatives,False_negative_rate,Num_false_negatives ,IR_precision,IR_recall,F_measure,Summary iris,1,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,15.0,36.0,0.0, 29.41176470588235,70.58823529411765,0.0,0.4462386261694216,0.47377732045597576,100.0,100 .0,81.5923629400546,81.5923629400546,0.0,1.5998502537265609,1.5998502537265609,0.0,0.0,0 .0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,? iris,2,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,11.0,40.0,0.0, 21.568627450980394,78.43137254901961,0.0,0.4513648596693575,0.48049218646442554,100.0,10 0.0,83.58463098131035,83.58463098131035,0.0,1.6389143329668696,1.6389143329668696,0.0,0.0,0.0,0.0,0. 0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,?

S.K.T.R.M College of Engineering

35

Data Mining Lab


Experiment:7 Aim: to setup standard experiments, that are run locally on

a single machine, or remote experiments, which are distributed between several hosts for labor relation

Type this command in simple CLI java weka.experiment.Experiment -r -T data/labor.arff

Add new relation using add new button on the right panel And give database connection using jdbc and click ok

S.K.T.R.M College of Engineering

36

Data Mining Lab

Choose the relation and click ok button

S.K.T.R.M College of Engineering

37

Data Mining Lab


Choose ZERO R from the menu choose button by clicking add new button on the right panel and click ok

Click on the run tab to get the output

S.K.T.R.M College of Engineering

38

Data Mining Lab


The results of the experiment are saved to the dataset Experiment3.txt.
Dataset,Run,Scheme,Scheme_options,Scheme_version_ID,Date_time,Number_of_instances,Numbe r _correct,Number_incorrect,Number_unclassified,Percent_correct,Percent_incorrect,Percent_ unclassified,Mean_absolute_error,Root_mean_squared_error,Relative_absolute_error,Root_re lative_squared_error,SF_prior_entropy,SF_scheme_entropy,SF_entropy_gain,SF_mean_prior_en tropy,SF_mean_scheme_entropy,SF_mean_entropy_gain,KB_information,KB_mean_information,KB _ relative_information,True_positive_rate,Num_true_positives,False_positive_rate,Num_false _positives,True_negative_rate,Num_true_negatives,False_negative_rate,Num_false_negatives ,IR_precision,IR_recall,F_measure,Summary iris,1,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,15.0,36.0,0.0, 29.41176470588235,70.58823529411765,0.0,0.4462386261694216,0.47377732045597576,100.0,10 0 .0,81.5923629400546,81.5923629400546,0.0,1.5998502537265609,1.5998502537265609,0.0,0.0,0 .0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,? iris,2,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,11.0,40.0,0.0, 21.568627450980394,78.43137254901961,0.0,0.4513648596693575,0.48049218646442554,100.0,1 0 0.0,83.58463098131035,83.58463098131035,0.0,1.6389143329668696,1.6389143329668696,0.0,0. 0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,?

S.K.T.R.M College of Engineering

39

Data Mining Lab


Experiment:8 Aim: to setup standard experiments, that are run locally on a single machine,

or remote experiments, which are distributed between several hosts for student relation

Type this command in simple CLI java weka.experiment.Experiment -r -T data/student.arff

S.K.T.R.M College of Engineering

40

Data Mining Lab


Add new relation using add new button on the right panel And give database connection using jdbc and click ok

Choose the relation and click ok button

S.K.T.R.M College of Engineering

41

Data Mining Lab

Choose ZERO R from the menu choose button by clicking add new button on the right panel and click ok

Click on the run tab to get the output

S.K.T.R.M College of Engineering

42

Data Mining Lab

The results of the experiment are saved to the dataset Experiment4.txt.


Dataset,Run,Scheme,Scheme_options,Scheme_version_ID,Date_time,Number_of_instances,Number _correct,Number_incorrect,Number_unclassified,Percent_correct,Percent_incorrect,Percent_ unclassified,Mean_absolute_error,Root_mean_squared_error,Relative_absolute_error,Root_re lative_squared_error,SF_prior_entropy,SF_scheme_entropy,SF_entropy_gain,SF_mean_prior_en tropy,SF_mean_scheme_entropy,SF_mean_entropy_gain,KB_information,KB_mean_information,KB_ relative_information,True_positive_rate,Num_true_positives,False_positive_rate,Num_false _positives,True_negative_rate,Num_true_negatives,False_negative_rate,Num_false_negatives ,IR_precision,IR_recall,F_measure,Summary iris,1,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,15.0,36.0,0.0, 29.41176470588235,70.58823529411765,0.0,0.4462386261694216,0.47377732045597576,100.0,100 .0,81.5923629400546,81.5923629400546,0.0,1.5998502537265609,1.5998502537265609,0.0,0.0,0 .0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,? iris,2,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,11.0,40.0,0.0, 21.568627450980394,78.43137254901961,0.0,0.4513648596693575,0.48049218646442554,100.0,10 0.0,83.58463098131035,83.58463098131035,0.0,1.6389143329668696,1.6389143329668696,0.0,0.0,0.0,0.0,0. 0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,?

KNOWLEDGE FLOW

The Knowledge Flow provides an alternative to the Explorer as a graphical front end to Weka's core algorithms. The Knowledge Flow is a work in progress so some of the functionality from the Explorer is not yet available. On the other hand, there are things that can be done in the Knowledge Flow but not in the Explorer.

S.K.T.R.M College of Engineering

43

Data Mining Lab


The Knowledge Flow presents a "data-flow" inspired interface to Weka. The user can select Weka components from a tool bar, place them on a layout canvas and connect them together in order to form a "knowledge flow" for processing and analyzing data. At present, all of Weka's classifiers, filters, clusterers, loaders and savers are available in the KnowledgeFlow along with some extra tools. Features of the KnowledgeFlow: * intuitive data flow style layout * process data in batches or incrementally * process multiple batches or streams in parallel! (each separate flow executes in its own thread) * chain filters together * view models produced by classifiers for each fold in a cross validation * visualize performance of incremental classifiers during processing (scrolling plots of classification accuracy, RMS error, predictions etc) omponents available in the KnowledgeFlow: DataSources: All of Weka's loaders are available DataSinks: All of Weka's savers are available Filters: All of Weka's filters are available Classifiers: All of Weka's classifiers are available Clusterers: All of Weka's clusterers are available valuation: TrainingSetMaker - make a data set into a training set TestSetMaker - make a data set into a test set CrossValidationFoldMaker - split any data set, training set or test set into folds TrainTestSplitMaker - split any data set, training set or test set into a training set and a test set ClassAssigner - assign a column to be the class for any data set, training set or test set ClassValuePicker - choose a class value to be considered as the "positive" class. This is useful when generating data for ROC style curves (see below) ClassifierPerformanceEvaluator - evaluate the performance of batch trained/tested classifiers
S.K.T.R.M College of Engineering 44

Data Mining Lab


IncrementalClassifierEvaluator - evaluate the performance of incrementally trained classifiers ClustererPerformanceEvaluator - evaluate the performance of batch trained/tested clusterers PredictionAppender - append classifier predictions to a test set. For discrete class problems, can either append predicted class labels or probability distributions Visualization: DataVisualizer - component that can pop up a panel for visualizing data in a single large 2D scatter plot ScatterPlotMatrix - component that can pop up a panel containing a matrix of small scatter plots (clicking on a small plot pops up a large scatter plot) AttributeSummarizer - component that can pop up a panel containing a matrix of histogram plots - one for each of the attributes in the input data ModelPerformanceChart - component that can pop up a panel for visualizing threshold (i.e. ROC style) curves. TextViewer - component for showing textual data. Can show data sets, classification performance statistics etc. GraphViewer - component that can pop up a panel for visualizing tree based models StripChart - component that can pop up a panel that displays a scrolling plot of data (used for viewing the online performance of incremental classifiers)

Launching the KnowledgeFlow


The Weka GUI Chooser window is used to launch Weka's graphical environments. Select the button labeled "KnowledgeFlow" to start the KnowledgeFlow. Alternatively, you can launch the KnowledgeFlow from a terminal window by typing "javaweka.gui.beans.KnowledgeFlow". At the top of the KnowledgeFlow window is are seven tabs: DataSources, DataSinks, Filters, Classifiers, Clusterers, Evaluation and Visualization. The names are pretty much self explanatory.

Components
Components available in the KnowledgeFlow:

DataSources
All of WEKAs loaders are available.

S.K.T.R.M College of Engineering

45

Data Mining Lab

DataSinks
All of WEKAs savers are available.

Filters
All of WEKAs filters are available.

Classifiers
All of WEKAs classifiers are available.

Clusterers
All of WEKAs clusterers are available.

S.K.T.R.M College of Engineering

46

Data Mining Lab


Evaluation

TrainingSetMaker - make a data set into a training set. TestSetMaker - make a data set into a test set. CrossValidationFoldMaker - split any data set, training set or test set into folds. TrainTestSplitMaker - split any data set, training set or test set into a training set and a test set. ClassAssigner - assign a column to be the class for any data set, training set or test set. ClassValuePicker - choose a class value to be considered as the positive class. This is useful when generating data for ROC style curves (see ModelPerformanceChart below and example 6.4.2). ClassifierPerformanceEvaluator - evaluate the performance of batch trained/tested classifiers. IncrementalClassifierEvaluator - evaluate the performance of incrementally trained classifiers. ClustererPerformanceEvaluator - evaluate the performance of batch trained/tested clusterers. PredictionAppender - append classifier predictions to a test set. For discrete class problems, can either append predicted class labels or probability distributions.

Visualization

DataVisualizer - component that can pop up a panel for visualizing data in a single large 2D scatter plot. ScatterPlotMatrix - component that can pop up a panel containing a matrix of small scatter plots (clicking on a small plot pops up a large scatter plot). AttributeSummarizer - component that can pop up a panel containing a matrix of histogram plots - one for each of the attributes in the input data. ModelPerformanceChart - component that can pop up a panel for visualizing threshold (i.e. ROC style) curves.

S.K.T.R.M College of Engineering

47

Data Mining Lab


TextViewer - component for showing textual data. Can show data sets, classification performance statistics etc. GraphViewer - component that can pop up a panel for visualizing tree based models. StripChart - component that can pop up a panel that displays a scrolling plot of data (used for viewing the online performance of incremental clas-iers)

Experiment:9 Aim: Setting up a flow to load an arff file (batch mode) and perform a cross validation using J48 (Weka's C4.5 implementation).
First start the KnowlegeFlow. Next click on the DataSources tab and choose "ArffLoader" from the toolbar (the mouse pointer will change to a "cross hairs").

S.K.T.R.M College of Engineering

48

Data Mining Lab

Next place the ArffLoader component on the layout area by clicking somewhere on the layout (A copy of the ArffLoader icon will appear on the layout area). Next specify an arff file to load by first right clicking the mouse over the ArffLoader icon on the layout. A pop-up menu will appear. Select "Configure" under "Edit" in the list from this menu and browse to the location of your arff file.

Alternatively, you can


S.K.T.R.M College of Engineering 49

Data Mining Lab


double-click on the icon to bring up the configuration dialog

Next click the "Evaluation" tab at the top of the window and choose the "ClassAssigner" (allows you to choose which column to be the class) component from the toolbar. Place this on the layout.

Now connect the ArffLoader to the ClassAssigner: first right click


S.K.T.R.M College of Engineering 50

Data Mining Lab


over the ArffLoader and select the "dataSet" under "Connections" in the menu. A "rubber band" line will appear.

Move the mouse over the ClassAssigner component and left click - a red line labeled "dataSet" will connect the two components.
S.K.T.R.M College of Engineering 51

Data Mining Lab

Next right click over the ClassAssigner and choose "Configure" from the menu. This will pop up a window from which you can specify which column is the class in your data (last is the default).

S.K.T.R.M College of Engineering

52

Data Mining Lab

Next right click over the ClassAssigner and choose "Configure" from the menu. This will pop up a window from which you can specify which column is the class in your data (last is the default).

S.K.T.R.M College of Engineering

53

Data Mining Lab


Next grab a "CrossValidationFoldMaker" component from the Evaluation toolbar and place it on the layout.

Connect the ClassAssigner to the CrossValidationFoldMaker by right clicking over "ClassAssigner" and selecting "dataSet" from under "Connections" in the menu.

S.K.T.R.M College of Engineering

54

Data Mining Lab


Next click on the "Classifiers" tab at the top of the window and scroll along the toolbar until you reach the "J48" component in the "trees" section.

Place a J48 component on the layout.

Connect the CrossValidationFoldMaker to J48 TWICE by first choosing "trainingSet" and then "testSet" from the pop-up menu for the CrossValidationFoldMaker.

S.K.T.R.M College of Engineering

55

Data Mining Lab

S.K.T.R.M College of Engineering

56

Data Mining Lab

Next go back to the "Evaluation" tab and place a "ClassifierPerformanceEvaluator" component on the layout.

S.K.T.R.M College of Engineering

57

Data Mining Lab

Connect J48 to this component by selecting the "batchClassifier" entry from the pop-up menu for J48.

S.K.T.R.M College of Engineering

58

Data Mining Lab

Next go to the "Visualization" toolbar and place a "TextViewer" component on the layout.

Connect the ClassifierPerformanceEvaluator to the TextViewer by selecting the "text" entry from the pop-up menu for ClassifierPerformanceEvaluator.

S.K.T.R.M College of Engineering

59

Data Mining Lab

Now start the flow executing by selecting "Start loading" from the pop-up menu for ArffLoader.

S.K.T.R.M College of Engineering

60

Data Mining Lab

When finished you can view the results by choosing show results from the pop-up menu for the TextViewer component.

S.K.T.R.M College of Engineering

61

Data Mining Lab

S.K.T.R.M College of Engineering

62

Data Mining Lab

Simple CSI

The Simple CLI provides full access to all Weka classes, i.e., classifiers, filters, clusterers, etc., but without the hassle of the CLASSPATH (it facilitates the one, with which Weka was started). It offers a simple Weka shell with separated commandline and output.

Commands
The following commands are available in the Simple CLI: java <classname> [<args>] invokes a java class with the given arguments (if any) breakstops the current thread, e.g., a running classifier, in a friendly manner 31 32 CHAPTER 3. SIMPLE CLI kill stops the current thread in an unfriendly fashion cls clears the output area exit exits the Simple CLI help [<command>] provides an overview of the available commands if without a command name as argument, otherwise more help on the specified command

S.K.T.R.M College of Engineering

63

Data Mining Lab

Commands
The following commands are available in the Simple CLI: java <classname> [<args>] invokes a java class with the given arguments (if any) break stops the current thread, e.g., a running classifier, in a friendly manner SIMPLE CLI kill stops the current thread in an unfriendly fashion cls clears the output area exit exits the Simple CLI help [<command>] provides an overview of the available commands if without a comman

Command redirection
Starting with this version of Weka one can perform a basic redirection: java weka.classifiers.trees.J48 -t test.arff > j48.txt Note: the > must be preceded and followed by a space, otherwise it is not recognized as redirection, but part of another parameter.

Command completion
Commands starting with java support completion for classnames and filenames

S.K.T.R.M College of Engineering

64

Data Mining Lab


via Tab (Alt+BackSpace deletes parts of the command again). In case that there are several matches, Weka lists all possible matches. package name completion java weka.cl<Tab> results in the following output of possible matches of package names: Possible matches: weka.classifiers weka.clusterers classname completion java weka.classifiers.meta.A<Tab> lists the following classes Possible matches: weka.classifiers.meta.AdaBoostM1 weka.classifiers.meta.AdditiveRegression weka.classifiers.meta.AttributeSelectedClassifier filename completion In order for Weka to determine whether a the string under the cursor is a classname or a filename, filenames need to be absolute (Unix/Linx: /some/path/file;Windows: C:\Some\Path\file) or relative and starting with a dot (Unix/Linux: ./some/other/path/file;Windows: .\Some\Other\Path\file)

S.K.T.R.M College of Engineering

65

Data Mining Lab


EXPERMIENT-10 AIM: To design a knowledge flow layout, to load apply attribute selection normalize the attributes and to store the result in a csv saver. Procedure: Click on knowledge Glow from weak GUI chooser. It opens a window called Weka knowledge flow environment. Click on data sources and select Arff to read data is the arff source. Now click on the knowledge flow layout area, which laces the Arffloader in the layout. 5) Cdlick on filters and select on attribute selector from the supervised filters. Place it on the design layout. 6) Now select another filter to normalize the numeric attribute values , from the unsupervised filters. Placae it on the design layout. 7) Click on Data sinks and choose csv, which writes to a destination that is in csv format. Place it on the design layout of knowledge flow. 8) Now right click on Arffloader and click on data set to direct the flow to attribute selection. 9) Now right click on Attribute selection and select data set to direct the flow to Normalize from which ;lthe flow is directed to the csv saver in the same way. 10) Right click on csv saver and click on configure, to specify the destination where to sotre the results let at be selected as z:\weka @ ravi. 11) Now right click on Affloader and select configure to specify the source data. Let ins relation has been selected as so. 12) Now again right click on the Affloader and click on start loading which results in the below knowledge flow layout. 13) We can observe the results of lthe abouve process by opening the file z:\Weka@ravi\ins-weka.filters.supervised.attributeMicrosoft office Excellomma in notepad, which displays the results I a comma separated value form 1) 2) 3) 4)

Petal length, 0.067797 0.067797 0.050847 0.627119 0.830508 0.677966

Petal width 0.041667 0.041667 0.041667 0.541667 0.833333 0.791667

Class Ins-setosa Ins-setosa Ins-setosa Ins-versicolor In,s-virginica Ins-virginica

S.K.T.R.M College of Engineering

66

Data Mining Lab

Description of the German credit dataset in ARFF (Attribute Relation File Format) Format:
Structure of ARFF Format:
%comment lines @relation relation name @attribute attribute name @Data Set of data items separated by commas. % 1. Title: German Credit data % % 2. Source Information % % Professor Dr. Hans Hofmann % Institut f"ur Statistik und "Okonometrie % Universit"at Hamburg % FB Wirtschaftswissenschaften % Von-Melle-Park 5 % 2000 Hamburg 13 % % 3. Number of Instances: 1000 % % Two datasets are provided. the original dataset, in the form provided % by Prof. Hofmann, contains categorical/symbolic attributes and % is in the file "german.data". % % For algorithms that need numerical attributes, Strathclyde University % produced the file "german.data-numeric". This file has been edited % and several indicator variables added to make it suitable for % algorithms which cannot cope with categorical variables. Several % attributes that are ordered categorical (such as attribute 17) have % been coded as integer. This was the form used by StatLog. % % % 6. Number of Attributes german: 20 (7 numerical, 13 categorical) % Number of Attributes german.numer: 24 (24 numerical) % % % 7. Attribute description for german % % Attribute 1: (qualitative) % Status of existing checking account % A11 : ... < 0 DM % A12 : 0 <= ... < 200 DM % A13 : ... >= 200 DM / % salary assignments for at least 1 year % A14 : no checking account

S.K.T.R.M College of Engineering

67

Data Mining Lab

% Attribute 2: (numerical) % Duration in month % % Attribute 3: (qualitative) % Credit history % A30 : no credits taken/ % all credits paid back duly % A31 : all credits at this bank paid back duly % A32 : existing credits paid back duly till now % A33 : delay in paying off in the past % A34 : critical account/ % other credits existing (not at this bank) % % Attribute 4: (qualitative) % Purpose % A40 : car (new) % A41 : car (used) % A42 : furniture/equipment % A43 : radio/television % A44 : domestic appliances % A45 : repairs % A46 : education % A47 : (vacation - does not exist?) % A48 : retraining % A49 : business % A410 : others % % Attribute 5: (numerical) % Credit amount % % Attibute 6: (qualitative) % Savings account/bonds % A61 : ... < 100 DM % A62 : 100 <= ... < 500 DM % A63 : 500 <= ... < 1000 DM % A64 : .. >= 1000 DM % A65 : unknown/ no savings account % % Attribute 7: (qualitative) % Present employment since % A71 : unemployed % A72 : ... < 1 year % A73 : 1 <= ... < 4 years % A74 : 4 <= ... < 7 years % A75 : .. >= 7 years % % Attribute 8: (numerical)

Installment rate in percentage of disposable income 68

S.K.T.R.M College of Engineering

Data Mining Lab


% % Attribute 9: (qualitative) % Personal status and sex % A91 : male : divorced/separated % A92 : female : divorced/separated/married % A93 : male : single % A94 : male : married/widowed % A95 : female : single % % Attribute 10: (qualitative) % Other debtors / guarantors % A101 : none % A102 : co-applicant % A103 : guarantor % % Attribute 11: (numerical) % Present residence since % % Attribute 12: (qualitative) % Property % A121 : real estate % A122 : if not A121 : building society savings agreement/ % life insurance % A123 : if not A121/A122 : car or other, not in attribute 6 % A124 : unknown / no property % % Attribute 13: (numerical) % Age in years % % Attribute 14: (qualitative) % Other installment plans % A141 : bank % A142 : stores % A143 : none % % Attribute 15: (qualitative) % Housing % A151 : rent % A152 : own % A153 : for free % % Attribute 16: (numerical) % Number of existing credits at this bank % % Attribute 17: (qualitative) % Job % A171 : unemployed/ unskilled - non-resident

% % %

A172 : unskilled - resident A173 : skilled employee / official A174 : management/ self-employed/ 69

S.K.T.R.M College of Engineering

Data Mining Lab


% highly qualified employee/ officer % % Attribute 18: (numerical) % Number of people being liable to provide maintenance for % % Attribute 19: (qualitative) % Telephone % A191 : none % A192 : yes, registered under the customers name % % Attribute 20: (qualitative) % foreign worker % A201 : yes % A202 : no % % % % 8. Cost Matrix % % This dataset requires use of a cost matrix (see below) % % % 1 2 % ---------------------------% 1 0 1 % ----------------------% 2 5 0 % % (1 = Good, 2 = Bad) % % the rows represent the actual classification and the columns % the predicted classification. % % It is worse to class a customer as good when they are bad (5), % than it is to class a customer as bad when they are good (1). % % % % % % Relabeled values in attribute checking_status % From: A11 To: '<0' % From: A12 To: '0<=X<200' % From: A13 To: '>=200' % From: A14 To: 'no checking' % % % Relabeled values in attribute credit_history % From: A30 To: 'no credits/all paid' % From: A31 To: 'all paid' % From: A32 To: 'existing paid' % From: A33 To: 'delayed previously' % From: A34 To: 'critical/other existing credit' S.K.T.R.M College of Engineering 70

Data Mining Lab


% % % Relabeled values in attribute purpose % From: A40 To: 'new car' % From: A41 To: 'used car' % From: A42 To: furniture/equipment % From: A43 To: radio/tv % From: A44 To: 'domestic appliance' % From: A45 To: repairs % From: A46 To: education % From: A47 To: vacation % From: A48 To: retraining % From: A49 To: business % From: A410 To: other % % % Relabeled values in attribute savings_status % From: A61 To: '<100' % From: A62 To: '100<=X<500' % From: A63 To: '500<=X<1000' % From: A64 To: '>=1000' % From: A65 To: 'no known savings' % % % Relabeled values in attribute employment % From: A71 To: unemployed % From: A72 To: '<1' % From: A73 To: '1<=X<4' % From: A74 To: '4<=X<7' % From: A75 To: '>=7' % % % Relabeled values in attribute personal_status % From: A91 To: 'male div/sep' % From: A92 To: 'female div/dep/mar' % From: A93 To: 'male single' % From: A94 To: 'male mar/wid' % From: A95 To: 'female single' % % % Relabeled values in attribute other_parties % From: A101 To: none % From: A102 To: 'co applicant'

% From: A103 To: guarantor % % % Relabeled values in attribute property_magnitude % From: A121 To: 'real estate' % From: A122 To: 'life insurance' S.K.T.R.M College of Engineering 71

Data Mining Lab


% From: A123 To: car % From: A124 To: 'no known property' % % % Relabeled values in attribute other_payment_plans % From: A141 To: bank % From: A142 To: stores % From: A143 To: none % % % Relabeled values in attribute housing % From: A151 To: rent % From: A152 To: own % From: A153 To: 'for free' % % % Relabeled values in attribute job % From: A171 To: 'unemp/unskilled non res' % From: A172 To: 'unskilled resident' % From: A173 To: skilled % From: A174 To: 'high qualif/self emp/mgmt' % % % Relabeled values in attribute own_telephone % From: A191 To: none % From: A192 To: yes % % % Relabeled values in attribute foreign_worker % From: A201 To: yes % From: A202 To: no % % % Relabeled values in attribute class % From: 1 To: good % From: 2 To: bad % @relation german_credit @attribute checking_status { '<0', '0<=X<200', '>=200', 'no checking'} @attribute duration real @attribute credit_history { 'no credits/all paid', 'all paid', 'existing paid', 'delayed previously', 'critical/other existing credit'} @attribute purpose { 'new car', 'used car', furniture/equipment, radio/tv, 'domestic appliance', repairs, education, vacation, retraining, business, other} @attribute credit_amount real @attribute savings_status { '<100', '100<=X<500', '500<=X<1000', '>=1000', 'no known savings'} @attribute employment { unemployed, '<1', '1<=X<4', '4<=X<7', '>=7'} @attribute installment_commitment real S.K.T.R.M College of Engineering 72

Data Mining Lab


@attribute personal_status { 'male div/sep', 'female div/dep/mar', 'male single', 'male mar/wid', 'female single'} @attribute other_parties { none, 'co applicant', guarantor} @attribute residence_since real @attribute property_magnitude { 'real estate', 'life insurance', car, 'no known property'} @attribute age real @attribute other_payment_plans { bank, stores, none} @attribute housing { rent, own, 'for free'} @attribute existing_credits real @attribute job { 'unemp/unskilled non res', 'unskilled resident', skilled, 'high qualif/self emp/mgmt'} @attribute num_dependents real @attribute own_telephone { none, yes} @attribute foreign_worker { yes, no} @attribute class { good, bad} @data '<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male single',none,4,'real estate',67,none,own,2,skilled,1,yes,yes,good '0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female div/dep/mar',none,2,'real estate',22,none,own,1,skilled,1,none,yes,bad 'no checking',12,'critical/other existing credit',education,2096,'<100','4<=X<7',2,'male single',none,3,'real estate',49,none,own,1,'unskilled resident',2,none,yes,good '<0',42,'existing paid',furniture/equipment,7882,'<100','4<=X<7',2,'male single',guarantor,4,'life insurance',45,none,'for free',1,skilled,2,none,yes,good '<0',24,'delayed previously','new car',4870,'<100','1<=X<4',3,'male single',none,4,'no known property',53,none,'for free',2,skilled,2,none,yes,bad 'no checking',36,'existing paid',education,9055,'no known savings','1<=X<4',2,'male single',none,4,'no known property',35,none,'for free',1,'unskilled resident',2,yes,yes,good

S.K.T.R.M College of Engineering

73

Data Mining Lab

Lab Experiments
1. List all the categorical (or nominal) attributes and the real-valued attributes separately.
From the German Credit Assessment Case Study given to us, the following attributes are found to be applicable for Credit-Risk Assessment:
Categorical or Nominal attributes(which takes True/false, etc values)

Total Valid Attributes 1. checking_status 2. duration 3. credit history 4. purpose 5. credit amount 6. savings_status 7. employment duration 8. installment rate 9. personal status 10. debitors 11. residence_since 12. property 14. installment plans 15. housing 16. existing credits 17. job 18. num_dependents 19. telephone 20. foreign worker

Real valued attributes 1. duration 2. credit amount 3. credit amount 4. residence 5. age 6. existing credits 7. num_dependents

1. checking_status 2. credit history 3. purpose 4. savings_status 5. employment 6. personal status 7. debtors 8. property 9. installment plans 10. housing 11. job 12. telephone 13. foreign worker

S.K.T.R.M College of Engineering

74

Data Mining Lab


2. What attributes do you think might be crucial in making the credit assessment? Come up with some simple rules in plain English using your selected attributes.
According to me the following attributes may be crucial in making the credit risk assessment. 1. Credit_history 2. Employment 3. Property_magnitude 4. job 5. duration 6. crdit_amount 7. installment 8. existing credit Based on the above attributes, we can make a decision whether to give credit or not. checking_status = no checking AND other_payment_plans = none AND credit_history = critical/other existing credit: good checking_status = no checking AND existing_credits <= 1 AND other_payment_plans = none AND purpose = radio/tv: good checking_status = no checking AND foreign_worker = yes AND employment = 4<=X<7: good foreign_worker = no AND personal_status = male single: good checking_status = no checking AND purpose = used car AND other_payment_plans = none: good duration <= 15 AND other_parties = guarantor: good duration <= 11 AND credit_history = critical/other existing credit: good checking_status = >=200 AND num_dependents <= 1 AND property_magnitude = car: good checking_status = no checking AND property_magnitude = real estate AND other_payment_plans = none AND age > 23: good savings_status = >=1000 AND property_magnitude = real estate: good savings_status = 500<=X<1000 AND employment = >=7: good credit_history = no credits/all paid AND housing = rent: bad savings_status = no known savings AND checking_status = 0<=X<200 AND existing_credits > 1: good S.K.T.R.M College of Engineering 75

Data Mining Lab


checking_status = >=200 AND num_dependents <= 1 AND property_magnitude = life insurance: good installment_commitment <= 2 AND other_parties = co applicant AND existing_credits > 1: bad installment_commitment <= 2 AND credit_history = delayed previously AND existing_credits > 1 AND residence_since > 1: good installment_commitment <= 2 AND credit_history = delayed previously AND existing_credits <= 1: good duration > 30 AND savings_status = 100<=X<500: bad credit_history = all paid AND other_parties = none AND other_payment_plans = bank: bad duration > 30 AND savings_status = no known savings AND num_dependents > 1: good duration > 30 AND credit_history = delayed previously: bad duration > 42 AND savings_status = <100 AND residence_since > 1: bad

S.K.T.R.M College of Engineering

76

Data Mining Lab


3. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training.
A decision tree is a flow chart like tree structure where each internal node(non-leaf) denotes a test on the attribute, each branch represents an outcome of the test ,and each leaf node(terminal node)holds a class label. Decision trees can be easily converted into classification rules. e.g. ID3,C4.5 and CART.

J48 pruned tree 1. Using WEKA Tool, we can generate a decision tree by selecting the classify tab. 2. In classify tab select choose option where a list of different decision trees are available. From that list select J48. 3. Now under test option ,select training data test option. 4. The resulting window in WEKA is as follows:

S.K.T.R.M College of Engineering

77

Data Mining Lab


5. To generate the decision tree, right click on the result list and select visualize tree option by which the decision tree will be generated.

6. The obtained decision tree for credit risk assessment is very large to fit on the screen.

7. The decision tree above is unclear due to a large number of attributes.

S.K.T.R.M College of Engineering

78

Data Mining Lab


4. Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy?
In the above model we trained complete dataset and we classified credit good/bad for each of the examples in the dataset. For example: IF purpose=vacation THEN credit=bad; ELSE purpose=business THEN credit=good; In this way we classified each of the examples in the dataset. We classified 85.5% of examples correctly and the remaining 14.5% of examples are incorrectly classified. We cant get 100% training accuracy because out of the 20 attributes, we have some unnecessary attributes which are also been analyzed and trained. Due to this the accuracy is affected and hence we cant get 100% training accuracy.

S.K.T.R.M College of Engineering

79

Data Mining Lab


5. Is testing on the training set as you did above a good idea? Why Why not?
Bad idea, if take all the data into training set. Then how to test the above classification is correctly or not ? According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as training set and the remaining 1/3 as test set. But here in the above model we have taken complete dataset as training set which results only 85.5% accuracy. This is done for the analyzing and training of the unnecessary attributes which does not make a crucial role in credit risk assessment. And by this complexity is increasing and finally it leads to the minimum accuracy. If some part of the dataset is used as a training set and the remaining as test set then it leads to the accurate results and the time for computation will be less. This is why, we prefer not to take complete dataset as training set. UseTraining Set Result for the table GermanCreditData: Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 855 145 0.6251 0.2312 0.34 55.0377 % 74.2015 % 1000 85.5 14.5 % %

S.K.T.R.M College of Engineering

80

Data Mining Lab


6. One approach for solving the problem encountered in the previous question is using cross-validation? Describe what crossvalidation is briefly. Train a Decision Tree again using crossvalidation and report your results. Does your accuracy increase/decrease? Why?
Cross validation:In k-fold cross-validation, the initial data are randomly portioned into k mutually exclusive subsets or folds D1, D2, D3, . . . . . ., Dk. Each of approximately equal size. Training and testing is performed k times. In iteration I, partition Di is reserved as the test set and the remaining partitions are collectively used to train the model.

That is in the first iteration subsets D2, D3, . . . . . ., Dk collectively serve as the training set in order to obtain as first model. Which is tested on Di. The second trained on the subsets D1, D3, . . . . . ., Dk and test on the D2 and so on.

1. Select classify tab and J48 decision tree and in the test option select cross validation radio button and the number of folds as 10. 2. Number of folds indicates number of partition with the set of attributes. 3. Kappa statistics nearing 1 indicates that there is 100% accuracy and hence all the errors will be zeroed out, but in reality there is no such training set that gives 100% accuracy. S.K.T.R.M College of Engineering 81

Data Mining Lab


Cross Validation Result at folds: 10 for the table GermanCreditData:
Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 705 295 0.2467 0.3467 0.4796 82.5233 % 104.6565 % 1000 70.5 29.5 % %

Here there are 1000 instances with 100 instances per partition.

Cross Validation Result at folds: 20 for the table GermanCreditData:


Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 698 302 0.2264 0.3571 0.4883 85.0006 % 106.5538 % 1000 69.8 30.2 % %

Cross Validation Result at folds: 50 for the table GermanCreditData:


Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error S.K.T.R.M College of Engineering 709 291 0.2538 0.3484 0.4825 82 70.9 29.1 % %

Data Mining Lab


Relative absolute error Root relative squared error Total Number of Instances 82.9304 % 105.2826 % 1000

Cross Validation Result at folds: 100 for the table GermanCreditData:


Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 710 290 0.2587 0.3444 0.4771 81.959 % 104.1164 % 1000 71 29 % %

Percentage split does not allow 100%, it allows only till 99.9% S.K.T.R.M College of Engineering 83

Data Mining Lab

Percentage Split Result at 50%:


Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error 362 138 0.2725 0.3225 0.4764 72.4 27.6 % %

Relative absolute error


Root relative squared error Total Number of Instances

76.3523 %
106.4373 % 500

S.K.T.R.M College of Engineering

84

Data Mining Lab

Percentage Split Result at 99.9%: Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 0 1 0 0.6667 0.6667 221.7054 % 221.7054 % 1 0 100 % %

S.K.T.R.M College of Engineering

85

Data Mining Lab


7. Check to see if the data shows a bias against "foreign workers"
(attribute 20), or "personal-status"(attribute 9). One way to do this (Perhaps rather simple minded) is to remove these attributes from the dataset and see if the decision tree created in those cases is significantly different from the full dataset case which you have already done. To remove an attribute you can use the reprocess tab in WEKA's GUI Explorer. Did removing these attributes have any significant effect? Discuss.

This increases in accuracy because the two attributes foreign workers and personal status are not much important in training and analyzing. By removing this, the time has been reduced to some extent and then it results in increase in the accuracy. The decision tree which is created is very large compared to the decision tree which we have trained now. This is the main difference between these two decision trees.

After forign worker is removed, the accuracy is increased to 85.9%

S.K.T.R.M College of Engineering

86

Data Mining Lab

If we remove 9th attribute, the accuracy is further increased to 86.6% which shows that these two attributes are not significant to perform training.

S.K.T.R.M College of Engineering

87

Data Mining Lab


Cross validation after removing 9 th attribute.

Percentage split after removing 9 th attribute.

S.K.T.R.M College of Engineering

88

Data Mining Lab


After removing the 20th attribute, the cross validation is as above.

After removing 20th attribute, the percentage split is as above.

S.K.T.R.M College of Engineering

89

Data Mining Lab


8. Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7 Remember to reload the ARFF data file to get all the attributes initially before you start selecting the ones you want.)
Select attribute 2,3,5,7,10,17,21 and click on invert to remove the remaining attributes.

Here accuracy is decreased. Select random attributes and then check the accuracy.

S.K.T.R.M College of Engineering

90

Data Mining Lab

After removing the attributes 1,4,6,8,9,11,12,13,14,15,16,18,19 and 20,we select the left over attributes and visualize them. S.K.T.R.M College of Engineering 91

Data Mining Lab

After we remove 14 attributes, the accuracy has been decreased to 76.4% hence we can further try random combination of attributes to increase the accuracy. Cross validation

S.K.T.R.M College of Engineering

92

Data Mining Lab


Percentage split

S.K.T.R.M College of Engineering

93

Data Mining Lab


9. Sometimes, the cost of rejecting an applicant who actually has a good credit Case 1. might be higher than accepting an applicant who has bad credit Case 2. Instead of counting the misclassifications equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in WEKA. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)?
In the Problem 6, we used equal cost and we trained the decision tree. But here, we consider two cases with different cost. Let us take cost 5 in case 1 and cost 2 in case 2. When we give such costs in both cases and after training the decision tree, we can observe that almost equal to that of the decision tree obtained in problem 6. Case1 (cost 5) Case2 (cost 5)

Total Cost

3820 1705

Average cost 3.82 1.705 We dont find this cost factor in problem 6. As there we use equal cost. This is the major difference between the results of problem 6 and problem 9. The cost matrices we used here: Case 1: 5 1 1 5 Case 2: 2 1 12

S.K.T.R.M College of Engineering

94

Data Mining Lab

1.Select classify tab. 2. Select More Option from Test Option.

S.K.T.R.M College of Engineering

95

Data Mining Lab


3.Tick on cost sensitive Evaluation and go to set.

4.Set classes as 2. 5.Click on Resize and then well get cost matrix. 6.Then change the 2nd entry in 1st row and 2nd entry in 1st column to 5.0 7.Then confusion matrix will be generated and you can find out the difference between good and bad attribute. 8.Check accuracy whether its changing or not.

S.K.T.R.M College of Engineering

96

Data Mining Lab


10. Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model?
When we consider long complex decision trees, we will have many unnecessary attributes in the tree which results in increase of the bias of the model. Because of this, the accuracy of the model can also effect. This problem can be reduced by considering simple decision tree. The attributes will be less and it decreases the bias of the model. Due to this the result will be more accurate. So it is a good idea to prefer simple decision trees instead of long complex trees. 1. Open any existing ARFF file e.g labour.arff. 2. In preprocess tab, select ALL to select all the attributes. 3. Go to classify tab and then use traning set with J48 algorithm.

S.K.T.R.M College of Engineering

97

Data Mining Lab

4.

To generate the decision tree, right click on the result list and select visualize tree option, by which the decision tree will be generated.

S.K.T.R.M College of Engineering

98

Data Mining Lab


5. Right click on J48 algorithm to get Generic Object Editor window 6. In this,make the unpruned option as true . 7. Then press OK and then start. we find the tree will become more complex if not pruned.

Visualizetree

S.K.T.R.M College of Engineering

99

Data Mining Lab

8. The tree has become more complex.

S.K.T.R.M College of Engineering

100

Data Mining Lab


11. You can make your Decision Trees simpler by pruning the node s. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross-validation (you can do this in WEKA) and report the Decision Tree you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase?
Reduced-error pruning:The idea of using a separate pruning set for pruningwhich is applicable to decision trees as well as rule setsis called reduced-error pruning. The variant described previously prunes a rule immediately after it has been grown and is called incremental reduced-error pruning. Another possibility is to build a full, unpruned rule set first, pruning it afterwards by discarding individual tests. However, this method is much slower. Of course, there are many different ways to assess the worth of a rule based on the pruning set. A simple measure is to consider how well the rule would do at discriminating the predicted class from other classes if it were the only rule in the theory, operating under the closed world assumption. If it gets p instances right out of the t instances that it covers, and there are P instances of this class out of a total T of instances altogether, then it gets positive instances right. The instances that it does not cover include N - n negative ones, where n = t p is the number of negative instances that the rule covers and N = T - P is the total number of negative instances. Thus the rule has an overall success ratio of [p +(N - n)] T , and this quantity, evaluated on the test set, has been used to evaluate the success of a rule when using reduced-error pruning. 1. Right click on J48 algorithm to get Generic Object Editor window 2. In this,make reduced error pruning option as true and also the unpruned option as true . 3. Then press OK and then start. 4. We find that the accuracy has been increased by selecting the reduced error pruning option.

S.K.T.R.M College of Engineering

101

Data Mining Lab

S.K.T.R.M College of Engineering

102

Data Mining Lab

12. (Extra Credit): How can you convert a Decision Trees into "if-thenelse rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in WEKA is rules. PART, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oneR.
In WEKA, rules.PART is one of the classifier which converts the decision trees into IF-THEN-ELSE rules. Converting Decision trees into IF-THEN-ELSE rules using rules.PART classifier:PART decision list outlook = overcast: yes (4.0) windy = TRUE: no (4.0/1.0) outlook = sunny: no (3.0/1.0) : yes (3.0) Number of Rules : 4 Yes, sometimes just one attribute can be good enough in making the decision. In this dataset (Weather), Single attribute for making the decision is outlook outlook: sunny -> no overcast -> yes rainy -> yes (10/14 instances correct) With respect to the time, the oneR classifier has higher ranking and J48 is in 2 nd place and PART gets 3rd place. S.K.T.R.M College of Engineering 103

Data Mining Lab


J48 PART oneR TIME (sec) 0.12 0.14 0.04 RANK II III I But if you consider the accuracy, The J48 classifier has higher ranking, PART gets second place and oneR gets lst place J48 PART oneR ACCURACY (%) 70.5 70.2% 66.8% 1.Open existing file as weather.nomial.arff 2.Select All. 3.Go to classify. 4.Start.

S.K.T.R.M College of Engineering

104

Data Mining Lab

Here the accuracy is 100%

S.K.T.R.M College of Engineering

105

Data Mining Lab


The tree is something like if-then-else rule If outlook=overcast then play=yes If outlook=sunny and humidity=high then play = no else play = yes If outlook=rainy and windy=true then play = no else play = yes To click out the rules

1. Go to choose then click on Rule then select PART. 2. Click on Save and start. 3. Similarly for oneR algorithm.

S.K.T.R.M College of Engineering

106

Data Mining Lab

If outlook = overcast then play=yes If outlook = sunny and humidity= high then play=no If outlook = sunny and humidity= low then play=yes

S.K.T.R.M College of Engineering

107

You might also like