45 views

Original Title: Application of Cart Algorithm in Hepatitis Diseaseas Diagnosis

Uploaded by Rizal Amegia Saputra

- A Survey on Machine Learning Algorithms
- From Improved Auto-taggers to Improved Music
- Parametric Comparison Based on Split Criterion on Classification Algorithm
- Classification and Prediction
- Classification Course Slides
- SAS_E-Miner2000 (2)
- iosrjournals.org
- Contextual Sentiment Polarity Analysis Using Conditional Random Fields
- Narrative Science Systems: A Review
- 25
- 17904_FULLTEXT
- Intrusion Detection Using Decision Tree Approach
- A 1802050102
- Cornell CS578: Performance Measures
- MISQ2007 ENHANCING INFORMATION RETRIEVAL THROUGH STATISTICAL NATURAL LANGUAGE PROCESSING A STUDY OF COLLOCATION INDEXING
- capgemini project
- Hybrid Classifier for Sentiment Analysis using Effective Pipelining
- Zip
- Preorder Tree Travers Al
- Rapid Modeling and Discovery of Priority Dispatching Rules- An Autonomous Learning Approach

You are on page 1of 5

HEPATITIS DISEASE DIAGNOSIS

G.Sathyadevi

Department of Computer science and Engineering

Anna University of Technology, Tiruchirapalli,

Tamil Nadu, India

sathya_ciet42@yahoo.co.in

Abstract- The healthcare industry collects a huge amount of data

which is not properly mined and not put to the optimum use.

Discovery of these hidden patterns and relationships often goes

unexploited. Our research focuses on this aspect of Medical

diagnosis by learning pattern through the collected data of

hepatitis and to develop intelligent medical decision support

systems to help the physicians. In this paper, we propose the use of

decision trees C4.5 algorithm, ID3 algorithm and CART

algorithm to classify these diseases and compare the effectiveness,

correction rate among them. Thus, the CART derived model

along with the extended definition for identifying (diagnosing)

hepatitis disease provided a good classification accuracy based

model.

KeywordsActive learning, domain expert, data mining, dynamic

query, ID3 algorithm, CART algorithm, C4.5 algorithm

I.

INTRODUCTION

attempts to maximize the utility of domain experts (oracles) in

active learning process. This study uses CART Algorithm to

examine hepatitis disease diagnosis. From the given training

datasets, only relevant attributes are selected using decision

tree algorithm CART. The missing values in the given datasets

can be easily handled by CART algorithm.

Identification and selection of relevant attributes that

contribute to hepatitis disease is a challenging task. The

analysis had been carries out with UCI hepatitis patient

datasets [1] and using CART decision tree algorithm. CART

has always offered sophisticated high performance missing

value handling. CART introduces a new set of missing value

analysis tools for automatic exploration of the optimal handling

of our incomplete data. Thus, in this paper, our empirical

studies with UCI hepatitis patient datasets [1] show that the

proposed active learning algorithm is more effective than the

other state-of-the-art algorithms for active learning.

CART Algorithm: CART [2] is an acronym for Classification

and Regression Trees, a decision-tree procedure introduced in

1984 by world-renowned UC Berkeley and Stanford

statisticians, Leo Breiman, Jerome Friedman, Richard Olshen,

and Charles Stone [3]. The CART methodology solves a

number of performance, accuracy, and operational problems

that still plague many current decision-tree methods. CART

innovations include:

i.

ii.

iii.

iv.

using strictly two-way (binary) splitting;

incorporating automatic testing and tree validation,

and;

Providing a completely new method for handling

missing values.

1. The visual display enables users to see the hierarchical

interaction of the variables;

2. Further, because simple if then rules can be read right off

the tree, models are easy to grasp and easy to apply to new

data.

3. CART uses strictly binary, or two-way, splits that divide

each parent node into exactly two child nodes by posing

questions with yes/no answers at each decision node.

4. CART is unique among decision-tree tools. CART- proven

methodology is characterized by:

a. Reliable pruning strategy - CART developers

determined definitively that

b. no stopping rule could be relied on to discover the

optimal tree,

c. Powerful binary-split search approach CART binary

decision trees are more sparing with data and detect

more structure before too little data is left for learning.

d. Automatic self-validation procedures - in the search

for patterns in databases it is essential to avoid the trap

of over fitting

e. Further, the testing and selection of the optimal tree

are an integral part of the CART algorithm.

f. It has automated solutions that surrogate splitters

intelligently handle missing values;

g. multiple-tree, committee-of-expert methods increase

the precision of results.

The next section describes the datasets used and overview

of this research. Section III outlines the results, explaining the

decision tree algorithms and the classification rules extracted

using CART. Section IV illustrates conclusions.

II.

The hepatitis patient dataset is obtained from UC-Irvine

archive [1] of machine learning datasets. The aim is to

distinguish between the presence and absence of hepatitis

1283

IEEE-ICRTIT 2011

disease and to identify the lifetime of a patient. The input

dataset is in WEKA ARFF file or .csv file format. The hepatitis

disease dataset has 20 attributes, 14 of which are linear valued

and are relevant. There are 281 instances and 2 classes. The

hepatitis patient dataset is run against the CART decision tree

algorithm. There are some missing values in the dataset. The

instance with missing values is probabilistically assigned a

possible value according to the distribution of values for that

attribute based on the training data using CART algorithm. The

figure 1 shows the original hepatitis patient dataset

There are a number of high quality commercial and open

source tools for data mining. In this research Weka [4] (Ian

Witten and Eibe Frank, 2005) has been used from the

perspectives of direct core usage. This serves as a powerful

core tool that allows the ability to load, pre-process and

visualizes data and also performs standard DM algorithms with

sufficient parameterization. These algorithms can either be

applied directly to a dataset or called from custom Java code.

1) Decision trees models: Decision tree learning is a common

method used in data mining. The goal is to create a model that

predicts the value of a target variable based on several input

variables. Each interior node corresponds to one of the input

variables; there are edges to children for each of the possible

values of that input variable. Each leaf represents a value of the

target variable given the values of the input variables

represented by the path from the root to the leaf (Dunham,

2003). Some of the key advantages of using decision trees are

the ease of use and overall efficiency. Rules can be derived that

are easy to interpret.

2) CART algorithm: In this study , the use of Classification

been attempted (Breiman et al., 1984). Classification tree

analysis is when the predicted outcome is the class to which the

data belongs. Regression tree analysis is when the predicted

outcome can be considered a real number. CART has been

applied in a number of applications in the medical domain.

One of the advantages of using classification trees is their

ability to provide easy to understand classification rules. Each

node of a classification tree is a rule. The only exception to this

would be in cases where the tree is very large and in such cases

there may need to be a more specific focus on pruning required

to optimize the tree size. Trees are easy off the shelf classifiers

that require no variable transformation. CART builds the tree

by recursively splitting the variable space based on the

impurity of the variables to determine the split till the

termination condition is met. The gini impurity determines how

often a randomly chosen element from the set would be

incorrectly labeled if it were randomly labeled according to the

distribution of labels in the subset. The following is a pseudo

procedure [15] ,

1. Start with root node (t = 1)

2. Search for a split s* among the set if all possible

candidates s that gives the purest decrease in

impurity.

3. Split node 1 (t = 1) into two nodes (t = 2, t = 3)using

the split s*.

4. Repeat the split search process (t = 2, t = 3) as

indicated in steps 1-3 until the tree growing the tree

growing rules are met.

C. Overview of this Work:

The overall description of the working process is as

follows,

1. Load the hepatitis patient datasets which is in the form of

.csv file format.

2. Set the threshold values (or some conditions) for all the

given attributes in order to identify the relevant attributes.

3. Examine the threshold values and eliminate the weakirrelevant attributes. Relevant attributes are identified and

indicated as shown in table I

4. Apply CART decision tree algorithm against the relevant

datasets.

5. Construct decision tree using Classification And

Regression Trees (CART).

6. Extract classification rules from the CART decision tree

induction.

7. Construct dynamic queries and submit these to Oracle.

III.

A. Experimental Data

The hepatitis disease data set of 473 patients is used in this

experiment [1]. Relevant attributes are identified and selected.

This dataset contains 19 attributes and a class variable with two

possible values, which are shown in Table-I. This data contains

attributes (Age, Bilirubin, Alk Phosphate, Sgot, Albumin,

Protime) which contain continuous values. The other attributes

1284

such as Steroid, Antivirals, Fatigue, Malaise, Anorexia, Liver

Big, Liver Firm, Spleen Palpable, Spiders, Ascites, and Varices

are binary valued. So, before using this dataset in this

experiment, those continuous valued attributes are divided into

ranges.

B. Classification Rules:

Significant rules are extracted which are useful for

understanding the data pattern and behaviour of experimental

dataset. The following pattern is extracted by applying CART

decision tree algorithm. The rules are extracted as follows,

(i) if PROTIME <= 46.50000

Improvement = 0.232570; Complexity Threshold = 0.340987

(ii) if ANOREXIA$ = ("No")

Improvement = 0.056518;Complexity Threshold = 0.037313

(iii) if SEX$ = ("Female")

Improvement = 0.038456 ;Complexity Threshold = 0.022388

(iv) if FATIGUE$ = ("Yes")

Improvement = 0.042526: Complexity Threshold = 0.022388

(v) if ALKPHOS <= 229.00000

Improvement = 0.059490; Complexity Threshold = 0.038462

Once a rule is selected and splits a node into two, the same

logic is applied to each child node (i.e. it is a recursive

procedure) Splitting stops when CART [14] detects no further

gain can be made, or some pre-set stopping rules are met. Each

branch of the tree ends in a terminal node. Each observation

falls into one and exactly one terminal node Each terminal

node is uniquely defined by a set of rules.

The basic idea of tree growing is to choose a split among

all the possible splits at each node so that the resulting child

nodes are the purest. In this algorithm, only univariate splits

are considered. That is, each split depends on the value of only

one predictor variable. All possible splits consist of possible

splits of each predictor. The fig.2 shows the decision tree

generated by CART analysis.

PROTIME

ANOREXIA$

ALKPHOS

SEX$

TABLE I

Description Of The Features In The Hepatitis patient Dataset

FATIGUE$

Class

DIE, LIVE

Age

Sex

male, female

Steroid

no, yes

Antivirals

no, yes

Fatigue

no, yes

Malaise

no, yes

Anorexia

no, yes

generated by CART decision tree is shown by gain chart and

the ROC curve. The figures 3 and 4 are the gains for the target

classes Live and Die. The figures 5 and 6 are the ROC curves

for the classes Live and Die respectively.

Liver Big

no, yes

10

Liver Firm

no, yes

11

Spleen Palpable

no, yes

12

Spiders

no, yes

13

Ascites

no, yes

14

Varices

no, yes

15

Bilirubin

16

Alk Phosphate

17

SGOT

18

Albumin

19

Protime

20

Histology

no, yes

Trees are formed by a collection of rules based on values of

certain variables in the modeling data set. Rules are selected

based on how well splits based on variables values can

1285

IEEE-ICRTIT 2011

We have implemented the ID3, C4.5 [10], CART algorithm

[15] and tested them on our experimental dataset. The accuracy

of these algorithms can be examined by confusion matrix

produced by them. A confusion matrix contains information

about actual and predicted classifications done by a

classification system. Performance of such systems is

commonly evaluated using the data in the matrix. The

following tables II,III,IV shows the confusion matrix for three

class classifier

TABLE II.

CONFUSION MATRIX OF ID3 ALGORITHM

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.686

0.281

0.66

0.686

0.673

0.68

No

0.719

0.314

0.742

0.719

0.73

0.719

Yes

TABLE III

CONFUSION MATRIX OF C4.5 ALGORITHM

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.97

0.385

0.615 0.89

0.03

0.714

0.97 0.929

0.385 0.5

0.669

Live

0.669

Die

TABLE IV

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.91

0.231

0.769

0.09

0.859

0.333

0.91 0.884

0.231 0.273

0.541

0.541

Live

Die

Where,

1) The recall or true positive rate (TP) is the proportion of

positive cases that were correctly identified, as calculated

using the equation:

cases that were incorrectly classified as positive, as

calculated using the equation:

D. Query Construction and Submission

Dynamic queries are constructed and asked to domain

experts. Only relevant attributes are examined and used for

constructing the queries. Since, the missing values are handled

by CART easily. More accurate results are generated while

using CART rather than other algorithms such as ID3 and C4.5

[10].

E. Comparison Of ID3, C4.5 and CART Algorithm

positive cases that were correct, as calculated using the

equation:

axis and the true positive rate on the Y axis.

Here, the entries in the confusion matrix have the following

meaning in the context of our study:

1286

[5]

i.

ii.

iii.

iv.

instance is negative,

b is the number of incorrect predictions that an

instance is positive,

c is the number of incorrect of predictions that an

instance negative, and

d is the number of correct predictions that an

instance is positive.

algorithms, we observed that among the attribute selection

measures C4.5 [8] performs better than the ID3 algorithm, but

CART performs better both in respect of accuracy and time

complexity.

[6]

[7]

[8]

[9]

[10]

modeland its applications to health care dataset, in roceedings of

the 2005 International Conference on Services Systems and

Services Management," 2005. p. 1099-1103.

[11]

Business Applications, pp. 3-10, Springer, 2009.

Dwyer, K.D.: Decision tree instability and active learning. Masters

thesis, University of Alberta (2007)

TABLE V

PREDICTION ACCURACY TABLE

SNO

1

2

3

NAME OF ALGORITHM

CART Algorithm

ID3 Algorithm

C4.5 Algorithm

[12]

ACCURACY %

83.2

64.8

71.4

Mining for Business Applications, pp. 3-10, Springer, 2009.

[14] Quinlan, J.R., "Induction of Decision Trees," Machine Learning.

[13]

[15]

accuracy with the CART algorithm which is greater than

previous research of ID3 and C4.5 [17].

IV.

REFERENCES

[2]

[3]

[4]

[16]

[17]

CONCLUSION

against the biomedical hepatitis patient datasets and compare

the results with other data mining techniques. Among these

algorithms, CART [14] algorithm always generates a binary

decision tree. That means the decision tree generated by CART

algorithm has exactly two or no child. But the decision tree

which is generated by other two algorithms may have two or

more child. Also, in respect of accuracy and time complexity

CART algorithm performs better than the other two algorithms.

Dynamic queries and their certain answers from ORACLE are

examined using the learning model.

[1]

TreeBased Symbolic Rule Induction System for TexCategorization,

IBM Systems Journal, Vol. 41, No 3, 2002

A. Juozapavicius and V. Rapsevicius, Clustering through Decision

Tree Construction in Geology, Nonlinear Analysis: Modeling and

Control, 2001, vol. 6, No 2, 29-41.

Ahmed Sultan Al-Hegami, Classical and Incremental Classification in

Data Mining Process, IJCSNS International Journal of Computer

Science and Network Security, VOL.7 No.12, December 2007 .

Kusrini, Sri Hartati, Implementation of C4.5 algorithm to evaluate the

cancellation possibility of new student applicants at stmik amikom

Yogyakarta. Proceedings of the International Conference on Electrical

Engineering and Informatics Institute Technologic Bandung, Indonesia

June 17-19, 2007.

C.X. Ling and J. Du, Active Learning with Direct Query Construction,

Proc. 14th ACM SIGKDD Intl Conf. Knowledge Discovery and Data

Mining (KDD 08), pp. 480-487, 2008.

[18]

[19]

[20]

[21]

http://www.ics.uci.edu/~mlearn/MLRepository.html

J. Du and C.X. Ling, Active Learning with Generalized Queries, Proc.

Ninth IEEE Intl Conf. Data Mining, pp. 120-128, 2009

Jiawei Han and Micheline Kamber, Data Mining Concepts and

techniques, 2nd ed., Morgan Kaufmann Publishers, San Francisco,

CA, 2007.

Margaret H. Dunham, Data Mining Introductory and Advanced

Topics, Published by Pearson Education (Singapur) Pte. Ltd.

Delhi,India, 2004.

[22]

1287

L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and

Regression Trees. Wadsworth Int. Group, 1984.

classifier methodology. IEEE Trans. on Systems, Man and

Cybernetics, 21(3):660-674, 1991.

Kusrini, Sri Hartati, Implementation of C4.5 algorithm to

evaluate the cancellation possibility of new student applicants

at stmik amikom yogyakarta. Proceedings of the International

Conference on Electrical Engineering and Informatics Institut

Technologic Bandung, Indonesia June 17-19, 2007.

P. Magni and R. Bellazzi, A stochastic model to assess the

variability of blood glucose time series in diabetic patients selfmonitoring, IEEE Trans. Biomed. Eng., vol. 53, no. 6, pp. 977

985, Jun. 2006.

K. Polat and S. Gunes, An expert system approach based on

principal component analysis and adaptive neuro-fuzzy

inference system to diagnosis of diabetes disease, Dig. Signal

Process., vol. 17, no. 4, pp. 702710, Jul. 2007.

J.Friedman, Fitting functions to noisy data in high

dimensions, in Proc.20th Symp. Interface Amer. Statistical

.Assoc. , E.J.Wegman.D.T.Gantz, and I.J. Miller.Eds.1988

pp.13-43

T.W.simpson, C.Clark and J.Grelbsh ,Analysis of support

vector regression for appreciation of complex engineering

analyses , presented as the ASME 2003.

L. B. Goncalves, M. M. B. R. Vellasco, M. A. C. Pacheco, and

F. J. de Souza, Inverted hierarchical neuro-fuzzy BSP system:

A novel neuro-fuzzy model for pattern classification and rule

extraction in LEE AND WANG: FUZZY EXPERT SYSTEM

FOR DIABETES DECISION SUPPORT APPLICATION 153

databases, IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol.

36, no. 2, pp. 236248, Mar. 2006.

- A Survey on Machine Learning AlgorithmsUploaded byIJIRAE- International Journal of Innovative Research in Advanced Engineering
- From Improved Auto-taggers to Improved MusicUploaded byAnonymous ntIcRwM55R
- Parametric Comparison Based on Split Criterion on Classification AlgorithmUploaded byIAEME Publication
- Classification and PredictionUploaded bypriyasuji1
- Classification Course SlidesUploaded bySeun -nuga Daniel
- SAS_E-Miner2000 (2)Uploaded byNusrat E
- iosrjournals.orgUploaded byInternational Organization of Scientific Research (IOSR)
- Contextual Sentiment Polarity Analysis Using Conditional Random FieldsUploaded byJay Garcia
- Narrative Science Systems: A ReviewUploaded byWhite Globe Publications (IJORCS)
- 25Uploaded byTony Waribo
- 17904_FULLTEXTUploaded byfpttmm
- Intrusion Detection Using Decision Tree ApproachUploaded bybhattajagdish
- A 1802050102Uploaded byIOSRjournal
- Cornell CS578: Performance MeasuresUploaded byIan
- MISQ2007 ENHANCING INFORMATION RETRIEVAL THROUGH STATISTICAL NATURAL LANGUAGE PROCESSING A STUDY OF COLLOCATION INDEXINGUploaded bymyday
- capgemini projectUploaded byapi-354352849
- Hybrid Classifier for Sentiment Analysis using Effective PipeliningUploaded byIRJET Journal
- ZipUploaded byesato1981
- Preorder Tree Travers AlUploaded byLilis Kurniasari
- Rapid Modeling and Discovery of Priority Dispatching Rules- An Autonomous Learning ApproachUploaded byLim Chee Keong
- A Real-Time Wireless Brain–Computer InterfaceUploaded bygopu_90
- cm_editor_help.pdfUploaded byjormula
- Using Data StructuresUploaded bySujeet Kumar
- Kawano - Using Similar Positions to Search Game TreesUploaded byanon-355439
- Knowledge ManagementUploaded bylauracirja
- A Binarization Algorithm for Historical Arabic Manuscript Images using a Neutrosophic ApproachUploaded byDon Hass
- An Introduction to Binary Search Trees and Balanced TreesUploaded byDimitris Stathopoulos
- (Paper Size - Legal) ICPC Dhaka Preliminary, 2017 - StatementsUploaded byমেহেদী এলিন
- Improving Web Spam Detection with Re-Extracted FeaturesUploaded bymachinelearner
- Network flowsUploaded byKaris Shang

- SDTM-ETL_v1.3_BrochureUploaded byysmahen
- Unit-6Uploaded bynalinagc
- Ip RouteUploaded byElmer Cedillos
- ToolboxUploaded bymrloadmovie
- TSO00364USEN.ibm System Storage Product GuideUploaded bytschok53
- SOL Solaris 11 Cheat Sheet 1556378Uploaded byNanard78
- placement questionsUploaded bymanindarkumar
- Cs4printguide LowUploaded byRonBryant
- All Netapp MultprotocalUploaded byPurushothama Gn
- 8086 MicroprocessorUploaded bydi
- palo-alto-networks-product-summary-specsheet.pdfUploaded byAndrew Atef
- tms320f28379dUploaded byadnantan
- Text Mining in WEKA RevisitedUploaded bykant.bele
- AMBA APB v2 Protocol SpecUploaded byravindarsingh
- How 2 Break in 2 CompsUploaded byAvinash Bendigeri
- Hana TechmUploaded byTarakeshsap
- 2 TX Goguen Malcolm 2000 Soft Eng With OBJUploaded byivanarandia
- IBMS & HVAC SpecsUploaded byHumaid Shaikh
- Unit Testing NotesUploaded byvidya mandava
- The Influence of Lossless Information on Complexity TheoryUploaded byGath
- IBM Systems Director_mgmtUploaded byLior Moneta
- Exam 70-452Uploaded byDavid Dunlap
- 236038144-Handling-Multiple-Attachments-Using-Java-Mapping-SAP-PI.pdfUploaded byroberto.faccini64
- RoboManager Users GuideUploaded byInnovite
- 20214570-WriteImageFTPTechNoteUploaded bynayan_r
- NMap- Network MappingUploaded byFaisal AR
- Pyxis Work Book on MenterUploaded bykapil singh
- Chapter 2Uploaded byv_shashank2006
- Tips and Techniques for Wind WorkflowUploaded byRonald
- Official Invitation Letter Malaysia Open Source Conference 2012 (MOSC2012)Uploaded byLinuxMalaysia Malaysia