You are on page 1of 12

CHAPTER III

HEART DISEASE RISK PREDICTION MODEL (HDRPM)


METHODOLOGY FOR DIABETIC PATIENTS

3.1 NAIVE BAYES APPROACH


In the first experiment, presented Nave Bayes data mining classifier
technique has been applied which produces an optimal prediction model using
minimum training set to predict the chances of diabetic patient getting heart
disease. The diagnosis of diseases plays vital role in medical field. Using diabetics
diagnosis, the proposed system predicts attributes such as age, sex, blood pressure
and blood sugar and the chances of a diabetic patient getting a heart disease. It
should be noted that the attributes used in our proposed method are those used for
diagnosis of diabetes and are not direct indicators of heart disease.
3.1.1 Data set and Parameters used in Nave Bayes
Risk of heart disease increases due to various factors including family
history, smoking, poor diet, high blood pressure, high blood cholesterol and obesity.

Cleveland Clinic Foundation heart disease dataset is available for the sake
of determining the accuracy rate in India. However initially records of about 500
diabetic patients from Seshiah Diabetic Research Institute in Chennai, India to
perform the experiments were collected for the present study. The clinical data set
specification provides concise, unambiguous definition for items related to
diabetes.
61

The diabetes attributes used in our Nave Bayes experiment and their
descriptions are shown in Table 3.1.

Table 3.1 Diabetes attributes used in our Nave Bayes experiments

3.1.2 Data pre-processing and sampling


Each algorithm requires submission of data in a specified format. The
conversion of raw data into machine understandable format is called preprocessing.
The data preparation phase covers all activities to construct the final dataset from
the initial raw data. These raw data can be stored in several formats including text,
62

excel or other database types of files. Then the raw data is changed into data sets
with a few appropriate characteristics.
Values are absent in most datasets with many causes contributing it. The raw
data usually have a great deal of noise, which is a random error or variance in a
measured variable. It cannot be used directly for processing, with the machinelearning algorithms. Data cleaning can be applied to remove noise and correct
inconsistencies in the data. Its routines attempt to fill in missing values, smooth out
noise while identifying outliers, and correct inconsistencies in the data.
Cleaning and filtering of the data have necessarily to be carried out with
respect to the data in data mining algorithm to avoid the creation of deceptive or
inappropriate rules or patterns. To make the data appropriate for the mining process,
it needs to be transformed.
Data integration merges data from multiple sources into a coherent data
store, like a data warehouse or a data cube. Careful integration of the data from
multiple sources helps in reducing and avoiding redundancies and inconsistencies
in the resulting data set. This helps in improving the accuracy and speed of the
subsequent mining process. Data reduction can reduce the data size by aggregating
and eliminating redundant features. Using the data mining techniques, the focus is
on specific fields that allow exploration of the data, by selecting and filtering some
fields as input, output fields and predictive fields.
All attributes used in Nave Bayes experiments listed in Table 3.1 with the
exception of sex and family heredity have numeric values. The attribute sex takes
on values M or F to denote male or female respectively. The attribute family
heredity takes on values Father, Mother or Both. In case there is no previous
diabetes history for the patient, this attribute is left empty. Since no attribute value
should be left empty for the mining algorithm to work properly, the value No for
patients without any previous diabetes history can be used. Likewise, a categorical
attribute based on which the data sets are to be classified is needed.

63

The aim of the present work is to predict the chances of a diabetic patient
getting heart disease. Hence, the LP Tot Y/N attribute has been taken as the class
attribute. Since the LP Tot Y/N attribute is a numeric attribute, the attribute values
have been categorized as high cholesterol value (Yes) or low cholesterol value
(No). Under the data exploration mode, almost all attribute selection modules
applicable for the data have been explored with a view to collect optimal subset of
attributes to predict the risk factors of diabetic patients getting various types of
heart diseases.
3.2 SUPPORT VECTOR MACHINE APPROACH
In the second experiment in the present work, Support vector machine data
mining classifier technique has been used with radial basis function kernel to
diagnose vulnerability of diabetic patients to heart disease. Most of these systems
have successfully employed SVM for the classification purpose. On the evidence of
this, SVM classifier has been used in the experiments that figure in the present
work. The results of the proposed system are quite good. The system exhibits good
accuracy in predicting the vulnerability of diabetic patients to heart diseases.
3.2.1 Data set and Parameters used in SVM
Here the methodology described is diagnosing vulnerability of diabetic
patients to heart diseases and records of about 500 diabetic patients as in 3.1.1
experiment. The diabetes attributes used in our SVM proposed system and their
descriptions are shown in Table 3.2 and all the attribute roles are regular except the
vulnerability attribute.

64

Table 3.2 Diabetes attributes used in our Support vector machine experiments
Attribute
role
Regular
Regular

Regular
Regular
Regular
Regular
Regular
Regular
Regular

Regular

Label

Out of the 500 records, 142 related to patients highly vulnerable to heart
diseases, and the remaining 358 patients found less vulnerable to heart disease.
Since SVM processes only numeric attributes, the nominal attributes were
converted to numeric attributes by replacing each value by a unique integer. For
example, the attribute sex values are converted as follows: Male 1 and female 0.
The values of the attributes are then normalized to the range 0 to 1. These records
were then given as input to the SVM classifier and the performance of support

vector algorithm has been analyzed. Hence this SVM model can be recommended
for the classification of the diabetic dataset.
65

3.3 SUPPORT VECTOR MACHINE AND DECISION TREE APPROACH


In the third experiment in the present work, it aims to determine the most
accurate technique between support vector machine and decision tree induction to
predict the risk in diabetic patients for heart disease.
3.3.1 Data structure used in SVM and Decision Tree
Here the records of about 1000 diabetic patients have been collected and the
methodology described is the characteristics of the diabetic attributes making up of
each record is shown in Table 3.3 and all the attribute roles are regular except the
risk class attribute.
Table 3.3 Diabetes attributes used in our SVM and Decision Tree
experiments
Role
label
regular
regular
regular
regular
regular
regular
regular
regular
regular
regular
regular
regular
regular

66

A classification model is a mapping of instances between certain classes or


groups. It predicts categorical class labels. It also classifies the data based on the
training set and the values in classifying the attributes and uses it in classifying the
new data. The models are classification based on support vector machine and
decision tree induction and the goal is to accurately predict the target class for each
case in the data.
3.4 NAIVE BAYES, SUPPORT VECTOR MACHINE AND DECISION
TREE APPROACH
In the final experimentation, a comparative analysis on the classifiers which
can classify the risk of diabetic patients getting heart disease from a machine
learning perspective has been provided. It aims to evaluate and compare using three
different data mining classification techniques such as Nave Bayes , Support
Vector Machine and Decision Tree to determine the possible ways to predict the
risk of heart disease for diabetic patients based on their predictive accuracy.
3.4.1 Data structure used in Nave Bayes, SVM and Decision Tree
Here also records of about 1000 diabetic patients have been used to evaluate
and compare using three different data mining classification techniques and the
performances are compared through sensitivity, specificity, F-score and accuracy.
The attributes making up each record and their characteristics used in the
NB, SVM, DT experiments are shown in Table 3.4. In the Statistics column,
numeric and integer values have average and standard deviation which are
represented with (+/-). The polynomial /nominal values are referred to as mode and
least values. The mode is the maximum appearance and the least is the minimum
appearance of data. Missing values are the blank ones and values present are the
ones which have data.

67

Table 3.4 Data structure and summary used in our NB, SVM, DT experiments
Attribute

Sex

Age

Heridity

Smoking

Alcohol

BP

Fasting
PP
A1C
LP Tot
LDL
VLDL
TGL
Cholesterol HDL

attribute_Label

68

The overview of the proposed system is shown in Figure 3.1.

Figure 3.1 Overview of the proposed system


69

The standard manner of predicting the error rate of a learning technique


given a fixed sample of data is through the use of stratified tenfold cross-validation.
Extensive tests on numerous different datasets, with different learning techniques,
have shown that ten is about the right number of folds to get the best estimate of
error. In order to measure the stability of the proposed model, the data is divided
into training and testing data with 10-fold cross validation to evaluate the accuracy
of our learning model. In this case, we will divide the dataset into 10 parts and train
and test each part.
In this research, three popular data mining classification techniques
including Nave Bayes theorem, Decision tree induction and Support vector
machine are applied and compared to one another based on their predictive
accuracy. Weka and Rapid miner has been used as a tool for evaluating and
comparing various classification techniques with given patient dataset due to its
learning operators and operator framework, which permit formation of nearly
arbitrary processes.
3.5 SUMMARY
The main focus in this chapter is on the application of Nave Bayes data
mining classifier technique which produces an optimal prediction model using
minimum training set to predict the chances of diabetic patient getting heart disease
in our first experiment. Support vector machine data mining classifier technique has
been used along with the radial basis function kernel in the next experiment to
diagnose vulnerability of diabetic patients to heart disease. In the final experiment,
a comparative analysis of the classifiers which can classify the risk of diabetic
patients getting heart disease from a machine learning perspective has been
provided.

70

You might also like