You are on page 1of 4

Medical data mining with extended WEKA

R. ROBU* and C. HORA*


*

University Politehnica from Timioara, Romania


raul.robu@aut.upt.ro, cata.hora@gmail.com

Abstract This paper presents a synthesis on the analysis of


medical data through data mining techniques. Aspects
regarding data mining tools are also mentioned. Certain
aspects related to the use of the classification techniques in
the medical field as well as the improvements that were
performed on the user interface for the classification of
data, in the data mining tool WEKA, are presented. The
achieved modifications improve the prediction process. The
extended application was tested on four medical datasets
obtained from UCI Machine Learning Repository and the
results were good.

I.
INTRODUCTION
Modern medicine generates almost every day,
enormous quantities of heterogeneous data. Today, the
biggest challenge is to transform these huge quantities of
data into useful information and knowledge. Medical data
may contain images (RMN), signals (ECG), clinical
information such as temperature, cholesterol level, etc, as
well as the doctors interpretation. More and more medical
procedures use medical imaging as a favorite diagnosis
instrument, so there is a necessity to develop efficient
exploitation methods in the images data bases.
The analysis of the data of sick people permits building
the profile of a patient suffering from a certain disease,
improving the diagnosis [1], determining the relationships
that exist between certain medical parameters in order to
realize medical predictions. Data mining techniques
successfully apply on human brain imaging, genetics (in
order to predict protein structures, to determine 3D
structure of the proteins given their amino-acid sequence).
In the genetics field, data mining techniques are applied in
order to discover some correlations between the
modifications of the DNA sequences of diverse
individuals and the susceptibility of the apparition of some
disease. The aim of the research in this field is to improve
disease diagnosis, to more efficiently prevent them and
treat them more easily.
The data mining techniques permit the discovery of
medicines and support the prediction on the individual
(personalization of medicines).
Due to the fact that the files which contain medical
data, on which data mining techniques are applied, are
linked to human subjects, referring to their private life, the
problem of insuring data confidentiality is very important
[2].
This paper presents a synthesis on the analysis of
medical data through data mining techniques. Aspects
regarding data mining tools are also mentioned. Certain
aspects related to the use of the classification techniques
in the medical field are presented. WEKA is a very
powerful data mining and machine learning instrument.

Since it is an open Source application, developed in Java,


the authors had the opportunity to improve the system that
makes predictions using a built and tested model. The
legal aspects regarding the improvements made by the
authors are also presented. The modified application was
tested by building models and making predictions using 4
datasets of medical data obtained from UCI Machine
Learning Repository. The results were very good.
II.

MEDICAL DATA MINING TECHNIQUES

The main data mining techniques used in medicine are:

Classification in order to predict a nominal


value, for example if a solitary pulmonary
nodule is cancerous or benign [3]
Regression - estimation of an output value based
on input values, for example expressing the
resistance index of the respiratory muscle
PEmax according to the predictive variables
represented by height, weight, age, sex, body
mass percent, forced respiratory volume per
second, residual volume, residual functional
capacity and total capacity of the lung for a part
of the patients ill with cystic fibrosis
Time series analysis - is the value of an attribute
examined over a time period usually at evenly
spaced time intervals. For example, depending
upon the conditions of a patient, values of
certain attributes may be obtained on a daily or
hourly basis. This may be used to predict future
values or to determine similarity between
different time intervals.
Clustering is a descriptive technique which
consists of identifying classes or groups in sets
of unclassified data. Clustering is often one of
the first steps in data mining analysis. It
identifies groups of related records that can be
used as a starting point for exploring further
relationships.
Association rules - For example, one may
discover that a set of symptoms often occur
together with another set of symptoms [3].
III.

DATA MINING TOOLS

For the application of data mining techniques in the


medical field, commercial software applications can be
used (SPSS Clementine, SAS E-Miner, MATLAB, Oracle
DM, SQL Server), open-source software such as WEKA,
R [4], Orange or software that has been developed

through contracts in order to resolve certain punctual


matters (MaternQual, NEONAT).
MaternQual is an integrated informatic system which
by using data mining techniques permits the identification
of primary and secondary risk factors in premature birth,
as well as the prediction of premature birth risk. Neonat is
an instrument that allows the recording, digital processing
and analysis of the newborns cries.
In this paper the authors concentrated their attention
on the user interface for the classification of the data
mining instrument and machine learning WEKA, which is
an open source application developed in Java.
IV. CLASSIFICATION TECHNIQUE ON MEDICAL DATA
One of the data mining techniques successfully applied
in the medical field is classification. Classification permits
building a classifier model based on training data, testing
this model on he test data and using it in order to realize
predictions. The construction of a classifier model is
performed with the aid of specialized algorithms such as
Naive Bayes, ID3, C4.5, Random Forest etc. With the
help of a classifier nominal values of the class attribute
could be predicted. For example, classification rules about
diseases can be extracted from known cases and used for
diagnosis of new patients based on their symptoms.
We may classify patients with heart problems on the
basis of various types of heart diseases. Some knowledge
of data under consideration is assumed before applying
the classification technique.
Suppose D is a database of patients. We may regard D
as set of tuples (x1, x2 xn) where x1, x2 xn are
values of attributes A1,A2 .An relevant to a particular
disease. We may define various classes C= {C1, C2
Cn} of patients depending on severity of disease or
particular classification type of the disease. The
classification problem is basically to define a function; f =
D C where each ti D is mapped to f (ti) belonging to
some Cj [5].
V.

b)Using the classifier for predictions


Using the built model in order to make predictions in
WEKA may be achieved only indirectly, using the test
option of the model on a well precised training data set.
Realizing the prediction requires going through the
following steps: a new arff file is created in which the
instance or instances whose class should be predicted, are
placed. For each instance a question mark is inserted or a
random nominal value in the class attribute that will be
predicted is filled in. The next step consists of setting the
Output predictions option and loading the arff file created
with the aid of the Supplied test set command. Next, we
right click the built model and choose the command or we
pick the Reevaluate model on current test set command, or
the model is rebuilt and tested on the instances from the
created arff file. The authors consider that the mentioned
process is a little forced and not very intuitive.
c)Evaluation and improvement of classification user
interface
The evaluation of the classification interface in WEKA
leads us to the following conclusions:
The use of a classifier to realize predictions is not
at all suggestive, and thus the question How do I
make predictions with a trained model? appears
on the FAQ list on the WEKA website as well in
other sources
A lot of the data displayed implicitly in the
Classifier Output section, are useful solely to the
persons who have experience in using WEKA
(Kappa statistic, Mean absolute error, Root mean
squared error, Relative absolute error, etc.) (see
Figure 1).

IMPROVEMENT OF THE WEKA USER INTERFACE


FOR CLASSIFICATION

a)Building and testing a classifier


WEKA permits to easily build a classifier based on
training data. In order to build this classifier model,
diverse classification algorithms can be used. WEKA
version 3.6.3 is the last stable version and offers 117
classification and regression algorithms (among them
NaiveBayes, ZeroR, RandomForest, etc). Testing the built
model can be performed either on the training data or on
the test data randomly obtained by dividing the initial set
of data into training data, test data or on well precised test
data [6].
After the testing of the built model, some information
about the Classifier Output section is provided. Among
the displayed information there is information about the
running manner, the built classifier model, the result of
testing the model, detailed accuracy by class and
confusion matrix [7].
In case the built model has a satisfactory performance,
it can be used in order to make predictions [8].

Figure 1 - Original classification user interface in WEKA

In order to improve this interface and to simplify the


prediction process, the Classify user interface was
modified, [9] by adding the Prediction of an instance
panel. This panel has a dynamic content, which displays
for each dataset, a number of labels equal to the number of
attributes from the data set (a label for each attribute). For
the nominal attributes the panel displays ComboBoxes
with the possible nominal values and for the numeric
attributes, TextBoxes will be displayed. The user selects
the values of the nominal attributes, fills in the values of
the numeric attributes for the instance he wished to predict
and then presses the predict command. The result of the

prediction will be displayed beside the Predict button (see


Figure 2).
Another modification that was performed in this
Classify section aimed to transform the user interface
Classify into a friendlier interface. To this purpose a 3d
graph was illustrated with the number of instances that
were correctly classified, respectively the number of
instances that were incorrectly classified. The access to
the information that were displayed in the interface
implicitly, (among which Kappa statistic, Mean absolute
error, Root mean squared error, Relative absolute error,
Root relative squared error can be found [6]) is going to
be obtained by pressing the new Advance Information
button, that was introduced.
The modifications that have been made greatly
simplified the actions that a WEKA user has to make in a
medical data mining study, with the purpose to realize
predictions using the built model. The files in which
modifications were made are: GuiChoser.java,
ClasifierPanel.java, Explorer.java, PreprocessPanel.java.
The SimpleBarChart.java and SpringUtilities.java files
were added.

Figure 2 -Modified WEKA classification user interface

VI. TEST RESULTS


The modified application was tested by making
classifications and predictions on 4 datasets obtained from
UCI Machine Learning Repository. These are Ljubriana
Breast Cancer, Heart Disease, Dermatology, Diabetes.
The algorithms used to build the classifier model were
Nave Bayes, Random Forest and ID3. Since the models
obtained with these algorithms led to good accuracy, they
were used to make predictions. The predictions were made
with the help of the new dynamic interface that was added
to WEKA.
Ljubriana breast cancer contains data supplied by
doctors from the Oncology Institute of the University
Medical Centre from Ljubljana, Yugoslavia and has 286
instances and 10 attributes. Some of these include age
(divided in ten year intervals), whether the woman is at
menopause or not, the dimension of the tumor (12
intervals), the malignity degree (3 degrees), the breast
(left or right), the area in which the tumor is located, if the
breast is radiated or not. The class attribute indicates the
reappearance or not of some events specific to the illness.
The diabetes data set has a number of 768 instances and
9 attributes. The data regard a group of Pima

Amerindians, near Phoenix, Arizona and are supplied by


the National Institute of Diabetes, Digestive and Kidney
Disease. The attributes characterize women who are at
least 21 years old and the number of births, the
concentration of glucose in the plasma 2 hours from an
oral glucose tolerance test, the diastolic arterial blood
pressure, the thickness of the skin on the triceps, the
quantity of serum insulin after 2 hours, the body mass
index, the class attribute that indicates if the result of the
test is positive or negative.
Cleveland heart disease contains data supplied by the
Cleveland Clinic Foundation, and has 303 instances and
14 attributes. The class field indicates if the patient has a
heart disease. The values of this field range between 0 (no
disease) and 4 (seriously ill). The name and other
confidential information such as the personal numeric
code were eliminated from the database. The attributes
that were considered for the study were age, sex, type of
chest pain (with 4 possible values), the arterial blood
pressure at rest, cholesterol, quantity of sugar in the blood,
the result of the electrocardiogram, the maximum heart
rate achieved, the pectoral angina induced by exercise
with values yes or no, the ST depression induced by
exercise as opposed to rest, the ramp of the extracted ST
segment, the number of vessels (0-3) colored at the
fluoroscopy, the diagnosis of the heart disease (the status
from the angiographic point of view).
The set of data dermatology has 34 attributes and 336
instances. The class represents the differential diagnosis of
the Eryhemato-Squamous diseases which are a real
problem in dermatology. These diseases share all the
clinical characteristics of erythema and scaling with very
few differences. The diseases from this group (values of
the class) are psoriasis, seboreic dermatitis, lichen planus,
pityriasis rosea, cronic dermatitis, and pityriasis rubra
pilaris. A biopsy is usually required to diagnose these
illnesses, but unfortunately they share a lot of the
histopathological characteristics as well. Another
difficulty of the differentiate diagnosis is that one disease
may have the characteristics of another disease in the
beginning stage and then develop specific characteristics
in the following stages. The patients were initially
evaluated from the clinical point of view, retaining 12
characteristics. Then, skin samples were taken to evaluate
22 histopathological characteristics. The values of these
22 characteristics were obtained by analyzing the skin
samples under a microscope.
Table 1 presents the results obtained for the accuracy of
the prediction on the data sets using the Nave Bayes, J48
and Random Forest algorithms. The built models were
tested through cross validation. After that, the predictions
were tested with the built models and with the help of the
developed dynamic interface.
TABLE I.
RESULTS OF MODELS TESTING
Test data
Nave
J48
Random Forest
Bayes
71.68
75.52
69.23
Ljubriana

breast cancer
Diabetes
Cleveland
heart disease
dermatology

76.3

73.83

73.83

85.3

77.56

81.52

97.27

93.99

94.81

The application worked correctly in every considered


case, the built models could be successfully used by the
dynamic interface in order to make predictions. Figure 2
shows the interface of the application that makes
predictions with the help of the model built on the
Cleveland Heart Disease data set.

The changes that were made were tested through


classifications and predictions based on 4 sets of real
medical data obtained from UCI Machine Learning
Repository. The final results were very good.
REFERENCES
[1]

VII. CONCLUSIONS
In the medical data bases enormous amounts of data are
gathered everyday. Medical data is very diverse and can
include:
images (RMN), signals (ECG), clinical
information (temperature), etc. Analyzing these data
through data mining techniques aims to extract useful
knowledge from the data.
In the medical field several data mining techniques are
used, such as classification (in order to predict a nominal
value), regression (in order to predict a numeric value),
clustering (in order to determine the main groups of
similar data), the association rules (in order to detect the
association between different types of information which
apparently have no sort of dependency), etc.
The above mentioned data mining techniques apply to
medical data with the means of some Open Source
instruments such as WEKA, R, Orange.
The authors improved the classification interface of the
Open-Source WEKA program, by introducing a section
whose content is dynamic according to the opened data
set, section which allows the use of the classifier model in
order to make predictions. The graphical interface was
also improved through a 3d graph which illustrates the
number of correctly or incorrectly classified instances.

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Y. Atilgan and F. Dogan, Data Mining on Distributed Medical


Databases: Recent Trends and Future Directions. IT Revolutions
(2009), Volume 11 pp. 216-224
K. Gang, P. Yi, S. Yong and C. Zhengxin, Privacy-Preserving
Data Mining of Medical Data Using Data SeparationBased
Techniques, Data Science Journal (2007), Vol. 6, pp.429-434
A. Kusiak, K.H. Kernstine, J.A. Kern, K.A. McLaughlin and T.L.
Tseng, Data Mining: Medical and Engineering Case Studies,
Proceedings of the Industrial Engineering Research 2000
Conference (2000), Cleveland, Ohio, May 21-23, pp. 1-7
K. Hornik, C. Buchta and A. Zeileis, Open-Source Machine
Learning: R Meets Weka, Computational Statistics (2009), pp
225-232
S.K. Wasan,V. Bhatnagar and H. Kaur, The Impact of Data
Mining Techniques on Medical Diagnostics, Data Science Journal
(2006), Volume 5, pp. 119-126
R. Bouckaert, E. Frank, M.Hall, R.Kirkby, P. Reutemann, A.
Seewald and D. Scuse. WEKA Manual for Version 3-6-0,
University of Waikato, Hamilton, New Zealand, 2008
I.H. Witten and E. Frank, Data Mining Practical Machine
Learning Tools and Techniques, Second Edition, Elsevier Inc.,
2005
R. Frank, M. Ester and A. Knobbe, A Multi-Relational Approach
to Spatial Classification, Proceedings of the 15th ACM SIGKDD
international conference on Knowledge discovery and data mining
(2009), Paris, France, pp. 309-318
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann & I.H.
Witten, The WEKA Data Mining Software: An Update, ACM
SIGKDD Explorations Newsletter (2009), Volume 11 , Issue 1,
pp. 10-18

You might also like