Professional Documents
Culture Documents
BY
0853756
SUBMITTED TO
Table of contents
1. Abstract
2. Introduction
2.1 Introduction to WEKA
2.2 Data Mining
2.3 Implementation of Algorithms
2.3.1 Dataset Introduction: Habermans Survival Data
2.3.2 Data Cleaning
2.3.3 Dataset Transformation
2.3.4 Attribute Description
2.3.5 Class Description
2.3.6 Input Encoding/Input Representation
2.3.7 Input Type
2.4 Implementing the Algorithms
2.4.1 Reasons why WEKA is chosen
2.4.1.1 Implementation of CRUISE
2.4.2 Implementing Algorithms in WEKA
2.4.3 In-depth Analysis of Results of three Algorithms J48, Multilayer
Perceptron and Naïve Bayes
2.4.3.1 Evaluation of J48:
2.4.3.2 Calculation of Accuracy & Confusion Matrix in J48
2.4.3.3 Evaluation of Multi Layer Perceptron
2.4.3.4 Evaluation of Naïve Bayes
2.5 Overall Accuracy rate of the algorithms implemented
2.6 Difficulties faced while using WEKA
3. Research analysis
3.1 Introduction
3.2 Decision Tree
3.3 Neural Networks
3.4 Comparison of J48 Decision Tree and Neural Networks-Multilayer
Perceptron
3.5 Results based on J48 and Multilayer Perceptron
3.6 Conclusion
1. Abstract:
2. Introduction:
In this section we will come across how the Decision Trees, Naïve
Bayes algorithm and Neural Networks are being implemented to the
dataset and are compared to each of the algorithm based on its
results.
According to the UCI Archives there are some datasets which posses
missing values and inconsistent values but the dataset Habermans
Survival Data has no missing values. If there are missing values
they are represented as ‘?’ in the data.
After following this then the file is saved as ARFF (.arff - Attribute
Relation File Format) format in order to use it in WEKA and is then
opened with WEKA software in order to get the patterns.
Class Description
Survival status Survival status of the patient.
The input file which we open in WEKA should be in .arff file format.
@DATA
30, 64, 1, 1
30, 62, 3, 1
And so on
A B Result
210+ 15 225
70+ 11 81
306(total no
of instances)
After calculating and comparing the accuracy of the data then errors
are classified. The accuracy rate is more to discrete data when
compared to raw data because in supervised discretization class
label is known and also there is improvement in time taken to build
the model.
Error classifier gives the graph which shows the wrongly classified
instances. By clicking on the instance in the graph it displays
“instance info” which shows “wrongly classified instance number”
and what could be the prediction class. By changing the prediction
class in the data there could be a certain amount of increasing
difference in accuracy.
From the above table the accuracy rate of Naïve Bayes is more
when the supervised discretization is set to false for raw data. And
when the raw data is discretized the highest accuracy rate is shown
by Multi layer perceptron. Here in this comparison of accuracy rates
with the algorithms there is not much difference in the percentage.
There is small amount of increase only when the raw data is
discretized.
3. Research analysis:
In this section detailed analysis between the two data mining
techniques is done based on results obtained.
3.1 Introduction:
In the data mining tools Neural Network can be used to solve data
patterns and relations between input and output data which are
complex in nature. Using this Neural Networks information can be
gained from large databases. Neural Networks can be used in
clustering algorithms, classification pattern recognition, function
approximation and prediction. Neural Networks are assigned to only
specific tasks.
A simple artificial neuron has input and output based on the
data but when compared to human brain contains more than
10billion nerve cells or neurons. Neural Networks are used in Face
recognition and each pixel is divided into binary splits. If Neural
Networks is operating 10 face recognitions, it first looks for 10
common features.
In data mining Multilayer Perceptron is under Neural
Networks algorithm which has input values, output values and
hidden values. As we can see the output model generated in WEKA
which shows that the input is assigned as weights.
A decision tree has leaf nodes and decision nodes which are similar
to use cases in UML where the actor has use cases represented in
tree fashion. Leaf nodes are nothing but classification of the
instances and decision nodes are the nodes used to test the leaf
nodes. In Neural Networks there are neurons which are used for
classification.
J48 is used as algorithm in decision trees and Multilayer
algorithm is used in neural networks. Implementing J48 to a dataset
can visualise the decision trees and implementing Multilayer
perceptron and changing the properties such GUI (Graphic User
Interface) to true gives the neural network for the chosen dataset.
3.6 Conclusion:
The Dataset used in this coursework deals with Medical and Health
care area where the database has huge amount of information. Data
mining in this area means applying different kinds of algorithms to
derive patterns of data. By applying algorithms like J48 and
Multilayer perceptron efficiency is maintained throughout the data.
Even though by using the WEKA tool there are some accuracy
differences between the algorithms which are applied, each
algorithm plays a vital role in developing Decision trees or Neural
Networks.
References:
1. 1.Rokach, L. and O. Maimon (2008). Data mining with decision trees:
theory and applications, World Scientific Pub Co Inc.
3. http://archive.ics.uci.edu/ml/datasets.html