You are on page 1of 19

Department of Computer Science and Electronic Engineering, University of Essex

Report of Data Mining


Data mining classification techniques implemented in Medical Decision Support System

CE802 Machine Learning and Data Mining

Name: Bouchou Mohamed Oussama E-Mail: ombouc@essex.ac.uk Supervisor: Paul Scott Date: 15 January 2012

Abstract
This paper, presents the analyse and the evaluation of some data mining classification algorithms which are mainly applied on the modern medical decision support systems (mdss). Various medical institutions store a very large quantity of data in order to extract relevant information; these might be precious for the future of science and medicine as well. To be able to extract the relevant data algorithms are needed. So three algorithms have been chosen to realise this experiment and WEKA was the program used to conduct these analyses, in addition to the four medical datasets employed. In fact it is hard to name a single algorithm to be suitable for this kind of experience. There was no big gap between the results. However, it reviles that Nave Bayes is the best classifier for such medical data.

Data Mining report

TABLE OF CONTENTS: ABSTRACT ........................................................................................................................................................ 1 123INTRODUCTION ...................................................................................................................................... 3 THE OBJECTIVES OF THE EXPERIMENT .................................................................................................... 3 DATA MINING ALGORITHMS CLASSIFIERS ............................................................................................... 4 3-1 DECISION TREE INDUCTION ................................................................................................................................. 4 3-2 MULTILAYERS PERCEPTRON (NEURAL NETWORKS) ................................................................................................... 4 3-3 NAVE BAYES ................................................................................................................................................... 4 4DESCRIPTION OF THE DATASETS ............................................................................................................. 5 4-1 DATABASES DESCRIPTION.................................................................................................................................... 5 4-1-1 Breast cancer database ....................................................................................................................... 5 4-1-2 Hepatitis database ............................................................................................................................... 6 4-1-3 Heart disease database ....................................................................................................................... 6 4-1-4 Diabetes database ............................................................................................................................... 7 5THE EXPERIMENT .................................................................................................................................... 8 5-1 WEKA ........................................................................................................................................................... 8 5-2 METHOD FOR EVALUATING THE ALGORITHMS ......................................................................................................... 9 5-3 THE APPLICATION OF THE ALGORITHMS TO THE DATA SETS ....................................................................................... 10 5-3-1 Breast cancer database ..................................................................................................................... 10 5-3-2 Hepatitis database ............................................................................................................................. 10 5-3-3 Heart disease database ..................................................................................................................... 11 5-3-4 Diabetes database ............................................................................................................................. 11 5-4 THE MEDICAL PREDICTION................................................................................................................................. 12 5-5 THE COMPARISON OF THE DATA MINING ALGORITHMS APPLIED TO MEDICAL DATABASES ............................................... 13 CONCLUSION ................................................................................................................................................. 14 REFRENCES .................................................................................................................................................... 15 APPENDIX ...................................................................................................................................................... 16

Data Mining report

1- Introduction
The healthcare is one of the richest sectors when talking about collecting information. Medical information, date and knowledge are incredibly increasing by the time. It has been estimated that an acute care may generate five terabytes of data a year [1]. In fact, it is crucial and not evident to extract useful knowledge of information among this huge quantity of data. The data collected is stored in databases or even in data warehouses, these databases systems differs from one to another. In the last few decades a new type of medical system has emerged in the domain of medicine [2]. The data stored in these medical systems might contain a precious knowledge hidden in it. So and in order to process the data and retrieve a useful knowledge experiments are needed. It is undeniable that human making decision is almost always optimal especially when a small amount of data to be processed, but it becomes hard and non-exact when a bog amount of data has to be processed. The position cited above has pushed computer science engineers and medical staff to contribute and work together. The objective of this collaboration is to develop the most suitable method for data processing which will allow us to discover nontrivial rules. It results in improving the process of diagnosis and treatment in addition to reducing the risk of medical mistake. This experiment aims to identify and evaluate the most common data mining algorithms implemented in modern medical decision support system (mdss) [3]. A lot of experiments have been done in this field and they were assessed using different measures with different datasets, so it makes the comparison between algorithms almost impossible. This paper contrast and compares three of the common data mining classifiers algorithms (Nave Bayes, multilayers perceptron, decision tree induction). All the conditions and the configurations were prepared to make sure that the experiment will be conducted under the same conditions.

2- The objectives of the experiment


The main objectives of this experiment is to evaluate three selected data mining classifiers algorithms which are commonly implemented in the medical decision support system. After getting the results, it is important to compare the performances of the algorithms and try to identify the most suitable and powerful one for the extraction of knowledge from the medical data. In the rest of this report A partial definition of the algorithms used, in addition to the presentation of the datasets Conduct the experiment under WEKA

Data Mining report

Make a comparison between the performances of the algorithms and give a conclusion

3- Data mining algorithms classifiers


In medicine, data mining is one of the tools used to search for a valuable pattern hidden. Therefor giving a precise diagnosis as well as gaining a precious time. This part will describe the three selected algorithms which are used by the experts commonly in such an experiment.

3-1 Decision Tree Induction Decision trees are one of the most frequently used techniques of data analysis. The advantages of this method are unquestionable. Decision trees are, among other things, easy to visualize and understand and resistant to noise in data [4]. Commonly, decision trees are used to classify records to a specific class. Moreover, they are applicable in both regression and Association tasks. Decision trees are applied successfully in medicine for example it is used to classify the prostate cancer or breast cancer 3-2 Multilayers perceptron (Neural networks) Neural network is a type of artificial intelligence that attempts to imitate the way a human brain works. Rather than using a digital model, in which all computations manipulate zeros and ones, a neural network works by creating connections between processing elements, the computer equivalent of neurons. The organization and weights of the connections determine the output. Neural networks are particularly effective for predicting events when the networks have a large database of prior examples to draw on. Strictly speaking, a neural network implies a non-digital computer, but neural networks can be simulated on digital computers. In medicine diagnosis the input is the symptoms of the patient. The output however, is the prediction of different diseases [5]. A type of an Artificial Neural Network depends on the basic unit perceptron. The Perceptron takes an input value vector and outputs 1 if the results are greater than a predefined threshold or 1 otherwise. The calculation is based on a linear combination of the input values. 3-3 Nave Bayes The Nave Bayes is a simple probabilistic classifier. It is based on an assumption about mutual independency of attributes (independent variable, independent feature model). Usually this assumption is far from being true and this is the reason for the naivety of the method [5]. The probabilities applied in the Nave Bayes algorithm are calculated according to the Bayes Rule the probability of hypothesis H can be calculated on the basis of the hypothesis H and evidence about the hypothesis E according to the following formula:

( | )

( | ) ( )

( )

In the practice, Nave Bayes has shown that it can work efficiently in a real situation, For instance in medicine it was used in diagnosis for the treatment of pneumonia. 4

Data Mining report

4- Description of the datasets


To conduct the experiment four data bases were taken into consideration. In this part will be discussing the source of the databases in addition to the data bases details. The databases were taken from the famous UCI medical data repository, the reason for this choice was because data bases vary from one to another, they are taken from different clinics or hospitals, this permit us to compare and evaluate the performances of the algorithms in a real condition. Another reason for choosing UCI is to let other persons to do the same experiment and may be compare the results together. 4-1 Databases description Here is the details description of the four databases chosen to realise the experiment they were all taken from UCI, in addition to the details of the values of every database.

4-1-1 Breast cancer database The following web link represents the source from where the data were taken (http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breastcancer-wisconsin.data) This table contains the details of the breast cancer database Dataset
Number of attribute Number of symptoms Number of instances Number of classes Type of attributes Missing values

Breast 11 9 cancer Table 4-1: Breast cancer database details

699

Integer

Yes

The following table summarise the values of the data base cited above
Symptoms name Values of the attributes

Patient number id Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size Bare Nuclei Bland Chromatin Normal Nucleoli Mitoses

Ex:1000025 1 - 10 1 - 10 1 - 10 1 - 10 1 - 10 1 - 10 1 10 1 - 10 1 - 10 Number of diagnosis(classes) 2: malignant 4: benign Table 4-2: Attribute and their possible values of the Breast cancer database

Data Mining report

4-1-2 Hepatitis database The following web link represents the source from where the data were taken (http://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data) This table contains the details of the hepatitis database Dataset Hepatitis
Number of attribute Number of symptoms Number of instances Number of classes Type of attributes Missing values

20

19

155

Integer, Real

Yes

Table 4-3: Hepatitis database details The following table summarise the values of the data base cited above
Symptoms name Age Sex Antivirals Fatigue Anorexia Liver big Liver firm Malaise Steroid Spleen palpable Spiders Ascites Histology Varices Bilirubin Alk Phosphate SGOT Albumin Protime Values of the attributes

10, 20, 30, 40, 50. 1: Male 0:Female 1: Yes, 0: No 1: Yes, 0: No 1: Yes, 0: No 1: Yes, 0: No 1: Yes, 0: No 1: Yes, 0: No 1: Yes, 0: No 1: Yes, 0: No 1: Yes, 0: No 1: Yes, 0: No 1: Yes, 0: No 1: Yes, 0: No 0.39, 0.40, 0.41..4 30, 90, 130..220, 300 13, 100, 200,400, 500 2.1, 3.0, 3.8, 4.5, 5.0, 6.0 10, 20, 30, 40, 50, 60, 70, 80, 90 Number of 1: Alive diagnosis(classes) 2: Die Table 4-4: Attribute and their possible values of the Hepatitis database

4-1-3 Heart disease database The following web link represents the source from where the data were taken (http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland. data) This table contains the details of the Heart disease database

Data Mining report

Dataset

Number of attribute

Number of symptoms

Number of instances

Number of classes

Type of attributes

Missing values

Heart 14 13 disease Table 4-5: Heart disease database details

303

Integer, Real

Yes

The following table summarise the values of the data base cited above
Symptoms name Age Sex Chest pain type Values of the attributes

Resting blood pressure in mm/Hg Serum cholesterol in mg/dl Fasting blood sugar > 120mg/dl Resting electrocardiographic results Maximum heart rate achieved Exercise inducted angina ST depression inducted by exercise relative to rest The slope of the peak exercise ST segment Number of major vessels colored by fluoroscopy Thal

10, 20, 30, 40, 50. 1: Male 0:Female 1:typical angina 2:atypical angina 3:non angina pain 4: asymptomatic 94.200 126654 1: Yes, 0: No 0, 1, 2 70200 1: Yes, 0: No 0, 0.16.01 1, 2, 3 1, 2 ,3

3: Normal 6:Fixed defect 7:Reversable defect Number of diagnosis(classes) 0, 1, 2, 3, 4 Table 4-6: Attribute and their possible values of the Heart disease database 4-1-4 Diabetes database The following web link represents the source from where the data were taken (http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pimaindians-diabetes.data) This table contains the details of the diabetes database

Dataset Diabetes

Number of attribute

Number of symptoms

Number of instances

Number of classes

Type of attributes

Missing values

768

Integer, Real

Yes

Table 4-7: Diabetes database details

Data Mining report

The following table summarise the values of the data base cited above
Symptoms name Values of the attributes

10, 20, 30, 40, 50. 0, 1, 2, 3, 4. 0-199 24-122 7-99 14-850 18-68 0.01, 0.02..0.4.. Number of diagnosis(classes) 0: non diabetic 1: diabetic Table 4-8: Attribute and their possible values of the Heart disease database Age Number of time pregnant Plasma glucose concentration Diastolic blood pressure mm/Hg Triceps skin fold thickness mm Serum insulin muU/ml Body mass index kg Diabetes pedigree

5- The Experiment

5-1 WEKA WEKA is a data mining program developed by the University of Waikato in New Zealand that implements data mining algorithms using JAVA language. WEKA is a facility for developing machine learning (ML) techniques and their application to the real world data mining problems. It is a collection of machine learning algorithms for data mining tasks. The algorithms are applied directly to a dataset. WEKA implements algorithms for data preprocessing, classification, regression, clustering and association rules; in addition to the visualization tools. The machine learning schemes can also be developed with this package. WEKA is an open source software issued under General Public License (GPL) [6]. The tree decision algorithm is called C4.5 algorithm, in WEKA it is called J48. It has a pruning feature which is used to remove sections of classifiers that is based on noise data. It results in reducing the phenomenon of over fitting. The Nave Bayes is called in WEKA Nave Bayes. It is a simple algorithm, known for its good performances especially when talking about a large data set because all the attributes are independent (conditional independence). The neural network algorithm used is called multilayers perceptron, it consists of a layers of nodes connected to others, except for the first one which is a neuron. Basically it classify well especially for numeric values.

Data Mining report

5-2 Method for evaluating the algorithms There are several methods for analysing and extracting knowledge from data. In our case which is Medical Decision Support System, the correct and non-correct rate diagnoses should be analysed. This classification is taken seriously into consideration, for the important impact that might happen. In medical records the classification is not a certainty but a more accurate prediction for the reasons that we do not have the ability to say this person will die or no and when or this person is having cancer. It is crucial that the experiment follow certain conditions and measures, because one of the objectives is to determine what parameters setting yield the best model. Another aim of the experiment is to find an optimal setting in order to maximise the performances. During the experiment, we used the default data splits and fold cross validation. The whole training data set was kept in order to compares the algorithms in a fair environment. Regarding the split, the more data is used for the training the better the model is built. At the same time the more data testing is taken the more accurate is the result. So it is important to divide the data set into two categories training data and testing data. But the question is what the proportion of each one is? According the scientist the best results are obtained, when 66% of the data are used to build the model and the rest which is 34% is used just for the test. For the fold cross validation, it says the experiments have proved that the best results are obtained 10 folds (this means split the data set into 10 equal pieces) is used. Every single algorithm was analysed under the same data split and cross validation parameters. In order to achieve a fairer result, all the algorithms were initialised with the same learning parameters during all the experiment. Here are some of the selected values and results which were taken during the experiment: Correct classified instances Incorrect classified instances Mean absolute error (MAE): this error has a tight relation with the classification accuracy. So if the MAE increase the classification accuracy prediction decease and vice versa Root mean squared error (RMSE): it is almost the same thing as the MAE. However the RMSE is considered as a better factor of classification than MAE Precision: the probability of being correct, giver a decision. Time of execution of the algorithm

Data Mining report

5-3 The application of the algorithms to the data sets


5-3-1 Breast cancer database
Classifiers
Methods Correct classified instances Incorrect classified instance Mean absolute error (MAE) Root mean error squared (RMSE) Precision Time (seconds)

Nave Bayes
Training set 96.13% 3.86% 3.85% 19.28% 96.3% 0.13 Testing set (66%) 95.37% 4.62% 4.73% 21.23% 95.4% 0.03 10-fold cross validation 95.99% 4.00% 4.03% 19.83% 96.2% 0.03

Multilayer Perceptron
Training set 99.28% 0.71% 1.79% 8.66% 99.3% 13.7 Testing set (66%) 95.79% 4.20% 4.71% 19.04% 95.8% 13.54 10-fold cross validation 95.85% 4.14% 4.72% 19.15% 95.9% 13.42

Decision Tree(C45)
Training set 98.56% 1.43% 2.83% 11.79% 98.6% 0.23 Testing set (66%) 95.37% 4.62% 6.48% 21.14% 95.4% 0.11 10-fold cross validation 94.56% 5.43% 6.91% 22.28% 94.6% 0.09

Table 5-1: The performances of the three algorithms applied to the Breast cancer database

The table shows that tree algorithms performed well applied to the breast database, where the smallest percentage is recorded for the C45 algorithm. This reflects the good precision as well as a small marge of error. The highest precision is recorded when it reaches the 99%, while the lowest point is 94%, this matches exactly the results obtain for the classification. The smallest root mean error squared belongs to multilayer perceptron with 19.15%, Nave Bayes errors comes after with 19.83%. Overall multilayer perceptron is slightly better than the others regarding the results, followed by Nave Bayes and then C45 algorithm. However when it comes to the time of execution, Nave Bayes is largely the winner with 0.03seconds.

5-3-2 Hepatitis database


Classifiers
Methods Correct classified instances Incorrect classified instance Mean absolute error (MAE) Root mean error squared (RMSE) Precision Time (seconds)

Nave Bayes
Training set 70.96% 29.03% 28.19% 48.65% 71.1% 0.01 Testing set (66%) 71.69% 28.30% 30.22% 51.26% 74.7% 0.01 10-fold cross validation 70.96% 29.03% 29.86% 50.61% 71.2% 0.02

Multilayer Perceptron
Training set 96.77% 3.22% 6.12% 16.51% 96.8% 7.47 Testing set (66%) 66.03% 33.96% 36.86% 54.41% 66.5% 7.02 10-fold cross validation 56.77% 43.22% 42.97% 60.44% 56.8% 7.74

Decision Tree(C45)
Training set 82.58% 17.41% 28.21% 36.59% 82.8% 0.08 Testing set (66%) 69.81% 30.18% 38.13% 46.6% 72% 0.08 10-fold cross validation 58.06% 41.93% 43.57% 55.1% 57.7% 0.08

Table 5-2: The performances of the three algorithms applied to the hepatitis database

The performances showed in this table are less convincing than the previous one which is allocated to breast cancer. One reason for this result is that the database might contain a bigger number of missing values than the breast database. Nave Bayes is the best classifier for this database, but still the result poor. Multilayer perceptron gets the best precision but just for the training set. However it gets the worst proportion the 10 fold cross-validation, in this case Nave Bayes is the best classifier followed by the C45 algorithm. For the time of the 10

Data Mining report

execution Nave Bayes also has the shortest time with 0.01 seconds. O the other hand multilayer perceptron has average of 7 seconds time for the execution, which is much higher than Nave Bayes and even tree decision with an average of 0.08 seconds.

5-3-3 Heart disease database


Classifiers
Methods Correct classified instances Incorrect classified instance Mean absolute error (MAE) Root mean error squared (RMSE) Precision Time (seconds)

Nave Bayes
Training set 63.36% 36.63% 16.38% 30.66% 62.3% 0.02 Testing set (66%) 52.42% 47.57% 19.8% 35.92% 53.3% 0.02 10-fold cross validation 56.43% 43.56% 18.39% 33.97% 54.9% 0.02

Multilayer Perceptron
Training set 81.84% 18.15% 9.22% 21.86% 85% 13.85 Testing set (66%) 54.36% 45.63% 19.24% 39.9% 52.3% 13.01 10-fold cross validation 53.46% 46.53% 19.07% 38.34% 51.8% 12.22

Decision Tree(C45)
Training set 78.54% 21.45% 12.47% 24.92% 77.5% 0.13 Testing set (66%) 46.60% 53.39% 21.79% 42.42% 48.3% 0.11 10-fold cross validation 52.47% 47.52% 21.05% 40.11% 47.5% 0.11

Table 5-3: The performances of the three algorithms applied to the heart disease database

The algorithms were tasted with the third database which is heart disease heart. It confirms the result obtained in the previous evaluation. As it has been showed, that the algorithms performed well in the breast cancer database, then less perform ant in the hepatitis. The worst classification actually is the classification of the heart disease which has shown the worst result since the starting of the experiment. It has the highest proportion of the incorrect classification as well as the lowest percentages of correct classification. Concerning the timing, the rates still maintained stable and like always Nave Bayes is the fastest algorithm. On the other hand multilayer perceptron timing is far from its two concurrent and considered as the slowest.

5-3-4 Diabetes database


Classifiers
Methods Correct classified instances Incorrect classified instance Mean absolute error (MAE) Root mean error squared (RMSE) Precision Time (seconds)

Nave Bayes
Training set 76.30% 23.69% 28.11% 41.33% 75.9% 0.02 Testing set (66%) 77.01% 22.98% 26.6% 38.22% 76.7% 0.02 10-fold cross validation 76.30% 23.69% 28.41% 41.68% 75.9% 0.03

Multilayer Perceptron
Training set 80.59% 19.40% 28.52% 38.15% 81.9% 12.12 Testing set (66%) 74.32% 25.67% 31.86% 44.45% 75.6% 12.96 10-fold cross validation 75.39% 24.60% 29.55% 42.15% 75% 11.76

Decision Tree(C45)
Training set 84.11% 15.88% 23.83% 34.52% 84.2% 0.11 Testing set (66%) 76.24% 23.75% 31.25% 40.59% 75.6% 0.11 10-fold cross validation 73.82% 26.17% 31.58% 44.63% 73.5% 0.14

Table 5-4: The performances of the three algorithms applied to the diabetes database

11

Data Mining report

The result of the last experience is quite satisfying better than the last two previous. Regarding the correct classified instances, the average is 75% and the lowest one is 73.8% which a good average, and the same thing is noticed for the precision. The root mean error squared results approves the percentage of classification. Nothing has changed for the execution time, it still remaining the approximate results for the algorithms, where Nave Bayes comes the first and end up with tree decision.

5-4 The medical prediction Scientist give a great intention to a medical prediction, this is way data mining is working to improve the medicine and the accuracy of diagnosis as well as prediction. The following table summarise a data mining prediction
Survival Predicted survival Predicted death Death

TP FN

FP TN

Table5-5: TP, FN, FP and TN prediction table An interpretation for the values could be useful. True Positive TP means the number of patients predicted to survive among survival patients. True Negative TN is the patients who are predicted to die among death patients. False Positive FP is the patients who are predicted to survive among death patients and False Negative FN is patients who are predicted to die among survival patients. As the experiment was done for four different databases, it was preferable to choose one and give a small example of medical prediction. Hepatitis database was chosen for this medical prediction and the 10fold cross validation was chosen as a method for its best precision for the experiment done before. Number of patient is 155 so TP+FN+FP+TN=155. The following results were obtained relaying on the confusion matrix (see appendix 2).
Nave Bayes
Predicted survival Predicted death Survival Death

70 (45.16%) 30 (19.35%)

15 (9.67%) 40 (25.80%)

Table5-6: prediction for the hepatitis by Nave Bayes algorithm

Multilayer Perceptron
Predicted survival Predicted death

Survival

Death

51 (32.9%) 33 (21.29%)

34 (21.93%) 37 (23.87%)

Table5-7: prediction for the hepatitis by multilayer perceptron algorithm

Decision Tree (C45)


Predicted survival Predicted death

Survival

Death

57 (36.77%) 37 (23.87%)

28 (18.06%) 33 (21.29%)

Table5-8: prediction for the hepatitis by Decision tree algorithm 12

Data Mining report

5-5 The Comparison of the data mining algorithms applied to medical databases In this part we will be analysing and evaluating the results and performances of the algorithms which are drawn in the previous tables. Each algorithm was evaluated individually, for this reason it is possible to select the most suitable one for the classification of medical data. In fact three methods have been adopted but two of them are taken into serious consideration. The methods are the testing set and the fold10 cross-validation. Is it hard to name or state a random algorithm test configuration to be the best, but at the final stage, 10 fold cross-validation was selected among all the other configurations. The reason for choosing 10 fold cross-validations is simply because of its highest performances when it claims that better than any split. Its popularity to be used to classify such databases is also one of the reasons that made us choosing it rather than another configuration. A comparison can be done, relying on the results presented on the previous tables experiment and in appendix 1 (visualisation statistic). The results gained were almost all satisfying, apart the heart disease results which is not a disaster but the worst among the others. At the second comes the hepatitis and diabetes nearly with similar proportions. Finally, the best results were gain belongs to the breast cancer which it interacts perfectly with the algorithms. This excellent result for the breast cancer allows us to deduce that this data set could be used for training data. Regarding algorithms, the boss is Nave Bayes when the majority of the proportions are superior to other results, followed by multilayer perceptron and then tree decision algorithm at the last place. For the training set multilayer perceptron and tree decision algorithm beat nave Bayes in all the cases but it is not a good factor to be taken to judge the performance of algorithms. However, the testing set and the 10fold cross-validation nave Bayes produced excellent results which conduct us to award the first place of this test to nave Bayes. The results delivered for the errors were quite high, the reason might be the heterogeneity of the medical data and the complexity of the values of each attribute, and this might affect the performance of classification negatively. In our case multilayer perceptron and especially decision tree algorithm might have been over trained for the disappointing results obtained. Surprisingly, closet results to multilayer perceptron and nave Bayes were yield by decision tree algorithm when training the hepatitis data set. This unique results, allows us to look carefully to the hepatitis data set, which its almost attributes is binary. We conclude that binominal data set is a good source of training data for tree decision algorithms. When talking about the time of execution, here nave Bayes again is largely the winner without exception, followed by decision tree then far from this two multilayer perceptron. A final conclusion drawn for this comparison is that nave Bayes is the most suitable algorithm for a medical data set in term of classification, this in both timing and performances.

13

Data Mining report

Conclusion
As mentioned in the introduction, a huge amount of data is gathered daily then stored in medical databases. These databases could contain a nontrivial dependencies symptoms and diagnosis. The process of this data could be realised with the use of medical systems, this medical system make it easy to uncover some unclear medical results. So with the better diagnosis and prediction knowledge, it is much easier for doctors to give an accurate diagnosis and in a very short time for future cases. The objective of the experiment is to identify and evaluate the performance of the most suitable data mining algorithm implemented in modern Medical Decision Support System. Three algorithms were chosen (Nave Bayes, multilayers perceptron, decision tree induction) and four datasets were selected (Breast cancer, Hepatitis, Heart disease, Diabetes) to conduct the experiment. It was crucial to use as much as possible the same measures and configuration, in order to a correct classification and it was realised successfully While the average accuracy classification of Nave Bayes is quite higher that the decision tree and multilayer perceptron, it would not fair to conclude that Nave Bayes is a better classifier task that tree decision and multilayer perceptron. It is suggested that Nave Bayes classifier has the potential to significantly improve the conventional classification methods for use in medical or bioinformatics sector.

14

Data Mining report

References [1] Huang, H. et al. Business rule extraction from legacy code, Proceedings of 20th International Conference on Computer Software and Applications, IEEE COMPSAC96, 1996, pp.162-167 [2] Chae Y. M., Kim H. S., Tark K. C., Park H. J., Ho S. H., Analysis of healthcare quality indicator using data mining and decision support system. Expert Systems with Applications, 2003, 167172. [3]Duch W, Grabczewski K, Adamczak R, Grudzinski K, Hippe Z.S, Rules for melanoma skin cancer diagnosis, 2001 http://www.phys.uni.torun.pl/publications/kmk/ retrieved on 4.04.2007 [4]Witten I. H., Frank E., Data Mining, Practical Machine Learning Tools and Techniques, 2nd Elsevier, 2005 [5] Kamila, A. Evaluation of selected data mining algorithms implemented in Medical Decision Support Systems, September 2007 [6] Available online at http://www.cs.waikato.ac.nz/~ml/weka

15

Data Mining report

Appendix Appendix 1: Evaluation of performance of the data mining algorithms for the medical
databases for 10 folds crosses validation
Correct classified instances 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

Nave Bayes Multilayer p C45

Incorrect classified instances 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

Nave Bayes Multilayer p C45

Mean absolute error (MAE) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

Nave Bayes Multilayer p C45

16

Data Mining report

Root mean error squared (RMSE) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

Nave Bayes Multilayer p C45

Precision 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

Nave Bayes Multilayer p C45

17

Data Mining report

Appendix 2: Statistic for predicting the diagnosis of the hepatitis data set (10 fold
cross-validations) Nave Bayes Total Number of Instances Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Time taken to build model Precision Number of class Confusion Matrix
Survival Predicted survival Predicted death Death

155 110 (70.96 %) 45 (29.03 %) 0.4026 0.2986 0.5061 60.26% 101.69% 0.02 seconds 71.2% 2

70 (45.16%) 30 (19.35%)

15 (9.67%) 40 (25.80%)

Multilayer perceptron Total Number of Instances Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Time taken to build model Precision Number of class Confusion Matrix

155 88 (56.77 %) 67 (43.22 %) 0.1248 0.4297 0.6044 86.72% 121.11% 7.44 seconds 56.8% 2

Survival Predicted survival Predicted death

Death

51 (32.9%) 33 (21.29%)

34 (21.93%) 37 (23.87%)

18

Data Mining report

Decision tree induction (C45) Total Number of Instances Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Time taken to build model Precision Number of class Confusion Matrix

155 65 (58.06 %) 67 (41.93 %) 0.1436 0.551 0.6044 87.93% 110.71% 0.08seconds 57.7% 2

Survival Predicted survival Predicted death

Death

57 (36.77%) 37 (23.87%)

28 (18.06%) 33 (21.29%)

19