You are on page 1of 5

IEEE-International Conference on Recent Trends in Information Technology, ICRTIT 2011

MIT, Anna University, Chennai. June 3-5, 2011

APPLICATION OF CART ALGORITHM IN


HEPATITIS DISEASE DIAGNOSIS
G.Sathyadevi
Department of Computer science and Engineering
Anna University of Technology, Tiruchirapalli,
Tamil Nadu, India
sathya_ciet42@yahoo.co.in
Abstract- The healthcare industry collects a huge amount of data
which is not properly mined and not put to the optimum use.
Discovery of these hidden patterns and relationships often goes
unexploited. Our research focuses on this aspect of Medical
diagnosis by learning pattern through the collected data of
hepatitis and to develop intelligent medical decision support
systems to help the physicians. In this paper, we propose the use of
decision trees C4.5 algorithm, ID3 algorithm and CART
algorithm to classify these diseases and compare the effectiveness,
correction rate among them. Thus, the CART derived model
along with the extended definition for identifying (diagnosing)
hepatitis disease provided a good classification accuracy based
model.
KeywordsActive learning, domain expert, data mining, dynamic
query, ID3 algorithm, CART algorithm, C4.5 algorithm

I.

INTRODUCTION

Motivated by domain-driven data mining, this paper


attempts to maximize the utility of domain experts (oracles) in
active learning process. This study uses CART Algorithm to
examine hepatitis disease diagnosis. From the given training
datasets, only relevant attributes are selected using decision
tree algorithm CART. The missing values in the given datasets
can be easily handled by CART algorithm.
Identification and selection of relevant attributes that
contribute to hepatitis disease is a challenging task. The
analysis had been carries out with UCI hepatitis patient
datasets [1] and using CART decision tree algorithm. CART
has always offered sophisticated high performance missing
value handling. CART introduces a new set of missing value
analysis tools for automatic exploration of the optimal handling
of our incomplete data. Thus, in this paper, our empirical
studies with UCI hepatitis patient datasets [1] show that the
proposed active learning algorithm is more effective than the
other state-of-the-art algorithms for active learning.
CART Algorithm: CART [2] is an acronym for Classification
and Regression Trees, a decision-tree procedure introduced in
1984 by world-renowned UC Berkeley and Stanford
statisticians, Leo Breiman, Jerome Friedman, Richard Olshen,
and Charles Stone [3]. The CART methodology solves a
number of performance, accuracy, and operational problems
that still plague many current decision-tree methods. CART
innovations include:

978-1-4577-0590-8/11/$26.00 2011 IEEE

i.
ii.
iii.
iv.

solving the how big to grow the tree- problem;


using strictly two-way (binary) splitting;
incorporating automatic testing and tree validation,
and;
Providing a completely new method for handling
missing values.

Features of CART Algorithm


1. The visual display enables users to see the hierarchical
interaction of the variables;
2. Further, because simple if then rules can be read right off
the tree, models are easy to grasp and easy to apply to new
data.
3. CART uses strictly binary, or two-way, splits that divide
each parent node into exactly two child nodes by posing
questions with yes/no answers at each decision node.
4. CART is unique among decision-tree tools. CART- proven
methodology is characterized by:
a. Reliable pruning strategy - CART developers
determined definitively that
b. no stopping rule could be relied on to discover the
optimal tree,
c. Powerful binary-split search approach CART binary
decision trees are more sparing with data and detect
more structure before too little data is left for learning.
d. Automatic self-validation procedures - in the search
for patterns in databases it is essential to avoid the trap
of over fitting
e. Further, the testing and selection of the optimal tree
are an integral part of the CART algorithm.
f. It has automated solutions that surrogate splitters
intelligently handle missing values;
g. multiple-tree, committee-of-expert methods increase
the precision of results.
The next section describes the datasets used and overview
of this research. Section III outlines the results, explaining the
decision tree algorithms and the classification rules extracted
using CART. Section IV illustrates conclusions.
II.

MATERIALS AND METHODS

A. About the Datasets


The hepatitis patient dataset is obtained from UC-Irvine
archive [1] of machine learning datasets. The aim is to
distinguish between the presence and absence of hepatitis

1283

IEEE-ICRTIT 2011
disease and to identify the lifetime of a patient. The input
dataset is in WEKA ARFF file or .csv file format. The hepatitis
disease dataset has 20 attributes, 14 of which are linear valued
and are relevant. There are 281 instances and 2 classes. The
hepatitis patient dataset is run against the CART decision tree
algorithm. There are some missing values in the dataset. The
instance with missing values is probabilistically assigned a
possible value according to the distribution of values for that
attribute based on the training data using CART algorithm. The
figure 1 shows the original hepatitis patient dataset

Fig1. The original hepatitis patient datasets

B. Data Mining Tools:


There are a number of high quality commercial and open
source tools for data mining. In this research Weka [4] (Ian
Witten and Eibe Frank, 2005) has been used from the
perspectives of direct core usage. This serves as a powerful
core tool that allows the ability to load, pre-process and
visualizes data and also performs standard DM algorithms with
sufficient parameterization. These algorithms can either be
applied directly to a dataset or called from custom Java code.
1) Decision trees models: Decision tree learning is a common
method used in data mining. The goal is to create a model that
predicts the value of a target variable based on several input
variables. Each interior node corresponds to one of the input
variables; there are edges to children for each of the possible
values of that input variable. Each leaf represents a value of the
target variable given the values of the input variables
represented by the path from the root to the leaf (Dunham,
2003). Some of the key advantages of using decision trees are
the ease of use and overall efficiency. Rules can be derived that
are easy to interpret.
2) CART algorithm: In this study , the use of Classification

and Regression Trees (CART) classification algorithm has


been attempted (Breiman et al., 1984). Classification tree
analysis is when the predicted outcome is the class to which the
data belongs. Regression tree analysis is when the predicted
outcome can be considered a real number. CART has been
applied in a number of applications in the medical domain.
One of the advantages of using classification trees is their
ability to provide easy to understand classification rules. Each
node of a classification tree is a rule. The only exception to this
would be in cases where the tree is very large and in such cases
there may need to be a more specific focus on pruning required
to optimize the tree size. Trees are easy off the shelf classifiers
that require no variable transformation. CART builds the tree
by recursively splitting the variable space based on the
impurity of the variables to determine the split till the
termination condition is met. The gini impurity determines how
often a randomly chosen element from the set would be
incorrectly labeled if it were randomly labeled according to the
distribution of labels in the subset. The following is a pseudo
procedure [15] ,
1. Start with root node (t = 1)
2. Search for a split s* among the set if all possible
candidates s that gives the purest decrease in
impurity.
3. Split node 1 (t = 1) into two nodes (t = 2, t = 3)using
the split s*.
4. Repeat the split search process (t = 2, t = 3) as
indicated in steps 1-3 until the tree growing the tree
growing rules are met.
C. Overview of this Work:
The overall description of the working process is as
follows,
1. Load the hepatitis patient datasets which is in the form of
.csv file format.
2. Set the threshold values (or some conditions) for all the
given attributes in order to identify the relevant attributes.
3. Examine the threshold values and eliminate the weakirrelevant attributes. Relevant attributes are identified and
indicated as shown in table I
4. Apply CART decision tree algorithm against the relevant
datasets.
5. Construct decision tree using Classification And
Regression Trees (CART).
6. Extract classification rules from the CART decision tree
induction.
7. Construct dynamic queries and submit these to Oracle.
III.

ANALYSIS AND RESULTS

A. Experimental Data
The hepatitis disease data set of 473 patients is used in this
experiment [1]. Relevant attributes are identified and selected.
This dataset contains 19 attributes and a class variable with two
possible values, which are shown in Table-I. This data contains
attributes (Age, Bilirubin, Alk Phosphate, Sgot, Albumin,
Protime) which contain continuous values. The other attributes

1284

APPLICATION OF CART ALGORITHM IN HEPATITIS DISEASE DIAGNOSIS


such as Steroid, Antivirals, Fatigue, Malaise, Anorexia, Liver
Big, Liver Firm, Spleen Palpable, Spiders, Ascites, and Varices
are binary valued. So, before using this dataset in this
experiment, those continuous valued attributes are divided into
ranges.
B. Classification Rules:
Significant rules are extracted which are useful for
understanding the data pattern and behaviour of experimental
dataset. The following pattern is extracted by applying CART
decision tree algorithm. The rules are extracted as follows,
(i) if PROTIME <= 46.50000
Improvement = 0.232570; Complexity Threshold = 0.340987
(ii) if ANOREXIA$ = ("No")
Improvement = 0.056518;Complexity Threshold = 0.037313
(iii) if SEX$ = ("Female")
Improvement = 0.038456 ;Complexity Threshold = 0.022388
(iv) if FATIGUE$ = ("Yes")
Improvement = 0.042526: Complexity Threshold = 0.022388
(v) if ALKPHOS <= 229.00000
Improvement = 0.059490; Complexity Threshold = 0.038462

differentiate observations based on the dependent variable


Once a rule is selected and splits a node into two, the same
logic is applied to each child node (i.e. it is a recursive
procedure) Splitting stops when CART [14] detects no further
gain can be made, or some pre-set stopping rules are met. Each
branch of the tree ends in a terminal node. Each observation
falls into one and exactly one terminal node Each terminal
node is uniquely defined by a set of rules.
The basic idea of tree growing is to choose a split among
all the possible splits at each node so that the resulting child
nodes are the purest. In this algorithm, only univariate splits
are considered. That is, each split depends on the value of only
one predictor variable. All possible splits consist of possible
splits of each predictor. The fig.2 shows the decision tree
generated by CART analysis.
PROTIME

ANOREXIA$

ALKPHOS

SEX$

TABLE I
Description Of The Features In The Hepatitis patient Dataset

FATIGUE$

Class

DIE, LIVE

Age

10, 20, 30, 40, 50, 60, 70,80

Sex

male, female

Steroid

no, yes

Fig 2. Decision tree: by CART

Antivirals

no, yes

Fatigue

no, yes

Malaise

no, yes

Anorexia

no, yes

The estimation method used here is the Gini. The report


generated by CART decision tree is shown by gain chart and
the ROC curve. The figures 3 and 4 are the gains for the target
classes Live and Die. The figures 5 and 6 are the ROC curves
for the classes Live and Die respectively.

Liver Big

no, yes

10

Liver Firm

no, yes

11

Spleen Palpable

no, yes

12

Spiders

no, yes

13

Ascites

no, yes

14

Varices

no, yes

15

Bilirubin

0.39, 0.80, 1.20, 2.00, 3.00, 4.00

16

Alk Phosphate

33, 80, 120, 160, 200, 250

17

SGOT

13, 100, 200, 300, 400, 500,

18

Albumin

2.1, 3.0, 3.8, 4.5, 5.0, 6.0

19

Protime

10, 20, 30, 40, 50, 60, 70, 80, 90

20

Histology

no, yes

Fig 3. Gains for the class: Live

C. Decision Tree by CART analysis


Trees are formed by a collection of rules based on values of
certain variables in the modeling data set. Rules are selected
based on how well splits based on variables values can

1285

IEEE-ICRTIT 2011
We have implemented the ID3, C4.5 [10], CART algorithm
[15] and tested them on our experimental dataset. The accuracy
of these algorithms can be examined by confusion matrix
produced by them. A confusion matrix contains information
about actual and predicted classifications done by a
classification system. Performance of such systems is
commonly evaluated using the data in the matrix. The
following tables II,III,IV shows the confusion matrix for three
class classifier
TABLE II.
CONFUSION MATRIX OF ID3 ALGORITHM
TP Rate FP Rate Precision Recall F-Measure ROC Area Class

Fig 4. Gains for the class: Die

0.686

0.281

0.66

0.686

0.673

0.68

No

0.719

0.314

0.742

0.719

0.73

0.719

Yes

TABLE III
CONFUSION MATRIX OF C4.5 ALGORITHM
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.97
0.385

0.615 0.89
0.03
0.714

0.97 0.929
0.385 0.5

0.669
Live
0.669
Die

TABLE IV

Fig 5. ROC curve for the class: Live

CONFUSION MATRIX OF CART ALGORITHM


TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.91
0.231

0.769
0.09

0.859
0.333

0.91 0.884
0.231 0.273

0.541
0.541

Live
Die

Where,
1) The recall or true positive rate (TP) is the proportion of
positive cases that were correctly identified, as calculated
using the equation:

2) The false positive rate (FP) is the proportion of negatives


cases that were incorrectly classified as positive, as
calculated using the equation:

Fig 6. ROC curve for the class: Live


D. Query Construction and Submission
Dynamic queries are constructed and asked to domain
experts. Only relevant attributes are examined and used for
constructing the queries. Since, the missing values are handled
by CART easily. More accurate results are generated while
using CART rather than other algorithms such as ID3 and C4.5
[10].
E. Comparison Of ID3, C4.5 and CART Algorithm

3) Finally, precision (P) is the proportion of the predicted


positive cases that were correct, as calculated using the
equation:

4) A ROC graph is a plot with the false positive rate on the X


axis and the true positive rate on the Y axis.
Here, the entries in the confusion matrix have the following
meaning in the context of our study:

1286

APPLICATION OF CART ALGORITHM IN HEPATITIS DISEASE DIAGNOSIS


[5]

i.
ii.
iii.
iv.

a is the number of correct predictions that an


instance is negative,
b is the number of incorrect predictions that an
instance is positive,
c is the number of incorrect of predictions that an
instance negative, and
d is the number of correct predictions that an
instance is positive.

On examining the confusion matrices of these three


algorithms, we observed that among the attribute selection
measures C4.5 [8] performs better than the ID3 algorithm, but
CART performs better both in respect of accuracy and time
complexity.

[6]
[7]
[8]

[9]
[10]

Z. Yao, P.L., L. Lei, and J. Yin, "R-C4.5 Decision tree


modeland its applications to health care dataset, in roceedings of
the 2005 International Conference on Services Systems and
Services Management," 2005. p. 1099-1103.

[11]

L. Cao, Introduction to Domain Driven Data Mining, Data Mining for


Business Applications, pp. 3-10, Springer, 2009.
Dwyer, K.D.: Decision tree instability and active learning. Masters
thesis, University of Alberta (2007)

TABLE V
PREDICTION ACCURACY TABLE
SNO

1
2
3

NAME OF ALGORITHM
CART Algorithm
ID3 Algorithm
C4.5 Algorithm

[12]

ACCURACY %
83.2
64.8
71.4

L. Cao, Introduction to Domain Driven Data Mining, Data


Mining for Business Applications, pp. 3-10, Springer, 2009.
[14] Quinlan, J.R., "Induction of Decision Trees," Machine Learning.
[13]

[15]

We have done this research and we have found 83.184%


accuracy with the CART algorithm which is greater than
previous research of ID3 and C4.5 [17].
IV.

REFERENCES

[2]
[3]
[4]

[16]

[17]

CONCLUSION

In this paper, we propose a CART decision tree Algorithm


against the biomedical hepatitis patient datasets and compare
the results with other data mining techniques. Among these
algorithms, CART [14] algorithm always generates a binary
decision tree. That means the decision tree generated by CART
algorithm has exactly two or no child. But the decision tree
which is generated by other two algorithms may have two or
more child. Also, in respect of accuracy and time complexity
CART algorithm performs better than the other two algorithms.
Dynamic queries and their certain answers from ORACLE are
examined using the learning model.

[1]

D. E. Johnson, F. J. Oles, T. Zhang and T. Geotz, A Decision


TreeBased Symbolic Rule Induction System for TexCategorization,
IBM Systems Journal, Vol. 41, No 3, 2002
A. Juozapavicius and V. Rapsevicius, Clustering through Decision
Tree Construction in Geology, Nonlinear Analysis: Modeling and
Control, 2001, vol. 6, No 2, 29-41.
Ahmed Sultan Al-Hegami, Classical and Incremental Classification in
Data Mining Process, IJCSNS International Journal of Computer
Science and Network Security, VOL.7 No.12, December 2007 .
Kusrini, Sri Hartati, Implementation of C4.5 algorithm to evaluate the
cancellation possibility of new student applicants at stmik amikom
Yogyakarta. Proceedings of the International Conference on Electrical
Engineering and Informatics Institute Technologic Bandung, Indonesia
June 17-19, 2007.
C.X. Ling and J. Du, Active Learning with Direct Query Construction,
Proc. 14th ACM SIGKDD Intl Conf. Knowledge Discovery and Data
Mining (KDD 08), pp. 480-487, 2008.

[18]

[19]

[20]

[21]

UCI Machine Learning Repository


http://www.ics.uci.edu/~mlearn/MLRepository.html
J. Du and C.X. Ling, Active Learning with Generalized Queries, Proc.
Ninth IEEE Intl Conf. Data Mining, pp. 120-128, 2009
Jiawei Han and Micheline Kamber, Data Mining Concepts and
techniques, 2nd ed., Morgan Kaufmann Publishers, San Francisco,
CA, 2007.
Margaret H. Dunham, Data Mining Introductory and Advanced
Topics, Published by Pearson Education (Singapur) Pte. Ltd.
Delhi,India, 2004.

[22]

1287

Vol. 1. 1986. 81-106.


L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and
Regression Trees. Wadsworth Int. Group, 1984.

S. R. Safavin and D. Landgrebe. A survey of decision tree


classifier methodology. IEEE Trans. on Systems, Man and
Cybernetics, 21(3):660-674, 1991.
Kusrini, Sri Hartati, Implementation of C4.5 algorithm to
evaluate the cancellation possibility of new student applicants
at stmik amikom yogyakarta. Proceedings of the International
Conference on Electrical Engineering and Informatics Institut
Technologic Bandung, Indonesia June 17-19, 2007.
P. Magni and R. Bellazzi, A stochastic model to assess the
variability of blood glucose time series in diabetic patients selfmonitoring, IEEE Trans. Biomed. Eng., vol. 53, no. 6, pp. 977
985, Jun. 2006.
K. Polat and S. Gunes, An expert system approach based on
principal component analysis and adaptive neuro-fuzzy
inference system to diagnosis of diabetes disease, Dig. Signal
Process., vol. 17, no. 4, pp. 702710, Jul. 2007.
J.Friedman, Fitting functions to noisy data in high
dimensions, in Proc.20th Symp. Interface Amer. Statistical
.Assoc. , E.J.Wegman.D.T.Gantz, and I.J. Miller.Eds.1988
pp.13-43
T.W.simpson, C.Clark and J.Grelbsh ,Analysis of support
vector regression for appreciation of complex engineering
analyses , presented as the ASME 2003.
L. B. Goncalves, M. M. B. R. Vellasco, M. A. C. Pacheco, and
F. J. de Souza, Inverted hierarchical neuro-fuzzy BSP system:
A novel neuro-fuzzy model for pattern classification and rule
extraction in LEE AND WANG: FUZZY EXPERT SYSTEM
FOR DIABETES DECISION SUPPORT APPLICATION 153
databases, IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol.
36, no. 2, pp. 236248, Mar. 2006.