You are on page 1of 20

Disponible en ligne sur

ScienceDirect
www.sciencedirect.com
IRBM 38 (2017) 305–324

General Review

Stark Assessment of Lifestyle Based Human Disorders Using Data Mining


Based Learning Techniques
M. Sharma a,∗ , , G. Singh b , R. Singh b
a Dept. of Computer Sc. and App., DAV University, Jalandhar, India
b DCS, GNDU, Amritsar, Punjab, India

Received 5 April 2017; received in revised form 20 August 2017; accepted 15 September 2017
Available online 18 October 2017

Highlights Graphical abstract

• Lifestyle based human disor-


ders diagnosed have been done
by using different data mining
techniques.
• Common and live datasets for
lifestyle based human disorders
have been mentioned.
• Rate of accuracies achieved us-
ing different mining techniques
is also highlighted.
• Effect of preprocessing, re-
lationship between disease,
datatype and mining approach
is analyzed.
• A novel hybrid diagnosis model
for lifestyle based human disor-
ders is proposed.

Abstract
Background: Medical informatics has observed an unrestrained growth in the database. Latest advancements in the field of medical sciences
have wiped out lots of critical diseases. Nowadays, the medical industry is affluent in data sources. These data sources are of use only if these are
effectively analyzed on time.
Methods: Data mining techniques are artificially intelligent and used to investigate known and unknown patterns available in the medical
databases. Nowadays, data mining techniques are chronically used to mine abundant data sources of medical science. This paper explores the
practice of diverse data mining techniques, the role of dataset used, effect of preprocessing, and the performances of different data mining
techniques in diagnosis of different lifestyle based diseases. The venture of this paper is to fetch out stark assessments of different data mining
techniques used in medical sciences.
Results: By far, surveillance discloses that significant effort has been made for mining the data allied to the Cardiology and Diabetes. As per
Google Scholar, in last seven years, the percentage of articles published related to cardio, diabetes, digestive, dentistry and ophthalmology disease

* Corresponding author.
E-mail address: manik10143@davuniversity.org (M. Sharma).

https://doi.org/10.1016/j.irbm.2017.09.002
1959-0318/© 2017 AGBM. Published by Elsevier Masson SAS. All rights reserved.
306 M. Sharma et al. / IRBM 38 (2017) 305–324

diagnosis using data mining are 42%, 26%, 18%, 10% and 4% respectively. So, a little attention has been paid to develop predictive model for the
diseases viz. ophthalmology, dentistry and digestive disorders. In addition, the rate of usage of preprocessing in diagnosis of different disorders
related to cardio, diabetes, digestive, dentistry and ophthalmology lies between 10.65%–17.75%, 8.48%–14.80%, 4.58–8.93%, 2.96%–7.73% and
5.83%–12.93% respectively.
Conclusion: An attention is obligatory to develop smart diagnostic system to aware and save human masses from wide critical spectrum of
diseases related to ophthalmology, oral and digestive systems.
© 2017 AGBM. Published by Elsevier Masson SAS. All rights reserved.

Keywords: Data mining; Lifestyle; Heart disease; Diabetes; Ophthalmology; Oral and digestive disorder

1. Introduction to data mining making process. In general, data mining is used to test hypoth-
esis or to discover some new or hidden patterns. Traditionally,
In modern times, Data mining has acknowledged an im- it was associated with hypothesis testing only. The idea was to
mense deal of attention. Data mining is an automatic extraction first prepare a statement that has to be tested against particular
of patterns or useful information. It is used to stumble on the set of data and condition.
consequence of the problem by probing the available facts and There are certain anomalies [4] in mining the database. The
data. It helps in measuring the association, nature and degree of most imperativecrisis in data mining is the lack of long term vi-
correlation in the different parameters of the dataset. Moreover, sion. Sometimes, data used in mining is incorrect or may not
the dataset can also be classified into several classes. At core be updated. Moreover, some of the organization or departments
level, data mining is known as knowledge discovery database. within an organization do not want to reveal the nature and con-
One can also predict the future by carefully analyzing the past tents of their data. In addition, legal and privacy condition may
and current state of dataset.The commonly used data mining further create a problem in data mining. One of the crucial fac-
techniques are naïve bayes, neural network (NN), genetic al- tors of decision making is having the right information at the
gorithm (GA), support vector machine (SVM), decision tree right time. Today, data collection is not of major issue. Rather,
induction (DTI) etc. The complete working of data mining tech- analysis of data is of concern. Nowadays, the survival of organi-
niques is based upon two sets of data i.e. Training and Testing. zation is based upon the effectiveness in generating information
Data mining is a difficult process as one has to train the system from their data. It is believed that the future and success of med-
regarding the characteristics or the features that have to be ex- ical industry is heavily dependent upon the data mining process.
tracted. Data mining is one of the dominant research fields and This paper briefly summarized the endeavor of different re-
is used in almost all streams viz. Agriculture, Computer Sci- searchers for mining information for medical industry.
ence, Mathematics, Chemistry, Finance, Economics, Medical
Science, Zoology, Bio-informatics etc. The major applications 2. Related work
of data mining are given below [1,2]:
Suvarna Pawar and Smita Sikchi [5] have carried out a sur-
• Agriculture: To study characteristics of soil, crop yield vey related to diabetes. The survey is focused on to unearth out
analysis, optimizing use of pesticide, diagnosis of plant dis- intricacy with the existing diagnosis systems. Authors found
eases. that still it seems too difficult to diagnose diabetes with high
• Medical Science: Early diagnosis and prediction of various precision. Authors tried to emphasize on diverse classification
diseases. techniques that were used in diabetes diagnosis. Authors ob-
• Computer Science: Image processing, Network Security, served that for PIMA data set, SVM seems to be a good solution
Computer Security. for early diagnosis of diabetes with 97% of accuracy.
• Business: Airline, Shopping Malls, Banks, Marketing. Shubpreet Kaur, R.K. Bawa [6] stated that the use of data
• Natural and Life Sciences: Creation of new hypothesis, mining provides an efficient way to mine the required clinical
Stimulation of new facts, Analyzing Protein structure etc. information from capacious, raw and heterogeneous data. One
• Engineering; CAD, CAM, Fault detection in production may use the techniques viz. decision tree, naïve bayes, logis-
lines. tic regression and ANN to forecast different medical problems
with low cost and high precision. In addition, author presented
Data mining [3] is a multistep approach. In the first phase, how data mining can be used to ascertain the interactions be-
data is collected from heterogeneous sources and is converted tween health status and a disease.
into a homogeneous format. Different preprocessing and nor- Divya Tomar, Sonali Aggarwal [7] discovered the efficacy of
malization procedures are used to reduce data inconsistency. In an assortment of data mining techniques like association, clas-
second phase, the data mining procedures are applied to exca- sification, clustering and regression in health sphere. Authors
vate some meaningful information. Third phase analyzes the briefly explained different data mining techniques with their
processed data and represent it in standardized format. Finally, intrinsic worth and demerits. Authors suggested that before ap-
the upshots of data mining progression are used in decision plying the classification techniques, one should preprocess the
M. Sharma et al. / IRBM 38 (2017) 305–324 307

data i.e. data should be normalized by removing all types of • Highlighting the domain of medical science where data
redundancies as it may degrade the execution time. Authors mining is still required.
recommended cross validation method in classification process. • Need of predictive model for correcting lifestyle based dis-
Clustering is beneficial if there is no or missing information. orders in ophthalmology, oral and digestive system.
Finally, authors suggested to use a fusion of classification, clus-
tering and association to get better mining performance. 4. Lifestyle based diseases
Punam Bajaj, Preeti Gupta [8] have studied the use of dis-
tinct data mining techniques in the diagnosis of heart disease.
Authors found the significant effort has been conceded out by Every individual has its own way of living that he/she patent
different researchers to diagnose heart related disease by us- in coping with their physical, psychosomatic, societal, and
ing different classification and clustering techniques, feature monetary environments on a regular basis. Lifestyle reflects a
selection methods, fuzzy logic, naïve bayes, association rule, person’s attitude, behavior, eating habits, social and economic
regression techniques and their amalgamation. Authors have values. It is a reflection of the person that they perceive and
also proposed a hybrid (GA and ANN) diagnosis system for want to be seen by society. The lifestyle of a person is very
the same. much affected by the genes, culture, society and the region. In
Parvez Ahmad et al. [9] have analyzed the convenience of general, there are two major pillars of lifestyle viz. Eating habits
data mining techniques in healthcare. Authors mentioned that (intake of dietary, sugary products, alcohol consumption, smok-
data mining techniques plays a noteworthy role in nurturing ing etc.) and Social/Economic Behavior (family background,
the momentous volume of data into useful information. Au- social environment, economic conditions, employment, work-
thors discussed the role of classification, clustering, SVM, NN ing conditions etc.). A vigorous or morbid lifestyle will most
and bayesian methods in mining the data related to breast, likely be diffused across generations. Case et al. [10] observed
lung cancer, heart diseases, smoking behavior, thyroid, dengue,
that there are around 27% chances that a child will adopt the
alzheimer’s, diabetes etc. Authors also mentioned some of the
same lifestyle of his/her parents.
challenges that researcher have faced while mining the data
The bad lifestyle (Sedentary, laziness, alcohol, tobacco,
related to healthcare industry. Author suggested that medical
smoking, drug, overdose of dairy products, oil, sugary prod-
industry has to focus on developing better and accurate med-
ucts) of a person may leads to several human disorders (Cardio-
ical information system by using different data mining tech-
vascular, diabetes, oral, ophthalmology, digestive) also known
niques.
as lifestyle diseases. In adults, generally, the social and eco-
nomic behavior are responsible for lifestyle based diseases. The
3. Research objective
horrific lifestyle of an individual have significant role in oral
diseases. Some of the major longevity diseases are given be-
The major objective of this research work is to show the low:
contribution of data mining in assessing the lifestyle based dis-
eases. The effort is made to review the existing literature of – Cardiovascular
different key authors whose research work is contributed for – Diabetes
both patients as well as the experts. The key points (disease,
– Oral
methodology, results, accuracy) of the different research works
◦ Cancer
along with the use of tool or techniques are highlighted. Fi-
◦ Gum disease
nally, focus is to determine the areas that require more attention
◦ Tingling
of data mining techniques. The existing literature is studied to
◦ Trauma
respond the following queries:
◦ Bad breath
• What is longevity or civilization disease? ◦ Temporomandibular disorder
• Uncover association between eating habits, social and eco- – Ophthalmology
nomic behavior with lifestyle based diseases. ◦ Glaucoma
• To reveal the significance of data mining in early diagnosis ◦ Diabetic retinopathy
of civilization diseases. ◦ Cataract
• Contribution of data mining in analyzing data related to – Digestive System
heart patients. ◦ Stomach cancer
• Significance of research work in scrutinizing data related to – Neurology
diabetic patients. ◦ Stroke
• Highlighting some important healthcare data sets. ◦ Autism
• Analyzing the need and effect of data preprocessing in
healthcare diagnostic systems. This paper briefly summarized the endeavor of different re-
• Identifying diseases where a significant data mining has searchers for mining information for lifestyle based human dis-
been carried out. orders.
308 M. Sharma et al. / IRBM 38 (2017) 305–324

data is increased as the dimension of classification is increased.


Class labels are recognized by leaf nodes. ID3, C4.5 and CART
are some of milestones of DTI. Some of the major characteris-
tics of DTI are

• No need of domain knowledge.


• Suitable for investigative knowledge

Naive bayes [12] classification scrounges its idea from statis-


tics and probability. It is based upon posterior probability and
prior probability. One of the foremost anomalies in this ap-
proach is when one gets zero probability. To handle zero proba-
bility case, the concept of ‘Laplace’ estimation was introduced.
The best part of this approach is that in general it has minimum
rate or error.
Fig. 1. Working principle of data mining classification process. Rule based classification [13] approach is based upon set of
rules. The rules can be represented in the form of ‘IF-Then’,
5. Data mining techniques (supervised learning) extracted from decision tree or can be generated by using a
sequential covering approach. The working principle of rule
The concept of data mining has been originated from three based technique is based upon antecedent (Left side of rule)
different techniques viz. Statistics, Artificial Intelligence and and consequent (right side of rule). One of the major flaws of
Machine Learning. Several heuristics have been projected to rule based approach is sharp cut off. To overcome sharp cut off
perk up the competence of data mining process. Classification problem, the concept of fuzzy logic is used.
is a dominant data mining technique. In general, classification SVM (Support Vector Machine) [14] is used to classify both
is categorized as single or multi class. In single class, there is linear and non-linear types of data. In SVM, the dimensions of
only one class label that has to be recognized. The elements that training data are expanded by using non linear mapping. Credit
belong to the class are known as normal and rest of the elements of SVM goes to Vladimir Vapnik. One of the best aspects of
are categorized as anomalies. The working principle of classi- SVM is that it leads to the high rate of accuracy. It is effectively
fication procedures is based upon training and testing data sets. applicable for both prediction and classification process. It is
The working of classification model is represented in Fig. 1. prominently used in digit recognition, voice identification and
In the first phase, system is trained by using existing labeled object recognition. SVM is strongly based upon hyper plane.
data. The training data set is then used to segregate the data It is used to differentiate elements from different classes. For
into diverse categories based upon the parameters and results of linear data, maximal marginal hyper plane is represented as
the training data set. In other words, training data is one of the
major phases of classification process and is intended to impart 
k
d(X ) =
T
Ci αi Si X T + b0
some sort of intelligence based upon which the data is mined.
i=1
Technically, training is a machine learning process. The optimal
mining of data is based upon the nature and level of trained Here, Ci is class label. αi is Lagrangian multiplier. Si is support
data set. Training data set is used to develop a classification or vector. X T is test data.
predictive model. Neural network [15] is a soft computing technique that
To shun over fitting predicament and to distill the classifi- scrounges its idea from human mind. These are used for both
cation model, sometimes, training data set is decomposed into single and multi class problems. It was appeared in 1943. How-
training set and validation set. The responsibility of validation ever, came into action in 1980s. Besides, solving the problem,
set is to perk up the performance of the under construction neural network can also learn from the former system or appli-
framework. Finally, the accuracy of the developed system is cations. NN are self organizing and adaptive in learning.
tested with test data. The accuracy can be measured by using ‘Genetic Algorithm’ [16,17] borrows its essential features
confusion matrix which represents the accuracy of recognizing from natural genetics and let a population serene of numer-
data set as true negative, true positive, false positive and false ous individual chromosomes to grow under demarcate selec-
negative. There are several classification techniques. The com- tion rules to breed a state which is optimizing the objective
monly used classification approaches are given below. function. It generally employs some heuristics like ‘Selection’,
Decision tree induction [3,11] assists in erudition the con- ‘Crossover’, and ‘Mutation’ to develop better solutions. GA is
cepts and working of decision trees. Decision tree is a hierar- capable of being applied to an enormously wide range of prob-
chical constitution consisting of leaf, non-leaf nodes and edges. lems like Task Scheduling, Image Processing, Machine Learn-
The testing condition is represented by non leaf node. Edge ing, Data Mining, Medical Sciences etc. It starts its working
is used to represent the result. Decision tree classification ap- from a set of solutions rather than a single solution. The initial
proach is painless to implement. However, for a complex classi- population is generated randomly. Each solution of the problem
fication problem it requires lots of training data. The amount of is adequately represented by encoding a chromosome. Every
M. Sharma et al. / IRBM 38 (2017) 305–324 309

Fig. 2. Attributes of datasets for cardio, diabetes and cancer diagnosis.

chromosome has a fitness value allied with it. The anthology required to train the system. The training data can be collected
of chromosomes is called population. The population at a par- from hospitals, clinics, research centers and online repositories.
ticular instance is called generation. The genetic properties of It is difficult to collect data from hospitals and clinics. However,
two chromosomes are blended to breedhealthier offspring us- one can easily obtain required data set from online reposito-
ing an operation called crossover. Mutation operation is used ries. UCI (UC Irvine) is an imperative online repository that
to modify the behavior of the generated child to make it more contains variety of data sets related to different domains. Some
effective. of the important multivariate and Time-series data related to
The performance of data mining process is measured by lifestyle based human disorders are available on UCI reposi-
finding the confusion matrix. There are four major parameters tory. Number of researcher have used these data sets for their
of confusion matrix that indicates the efficiency of the tech- experimentation and analysis [19] (Table 1).
nique in finding true positive, true negative, false positive and From existing research, it is observed that generally, dia-
false negative cases. betes patient’s data are represented in form of text and numeric
attributes. However, some of the researchers have also used
6. Data mining techniques and their role in medical science tongue images in diabetes diagnosis. Moreover, signal and im-
ages are most common media for representing the details of
No doubt, medical informatics is changing the scenario of cardio and cancer patients. Some of the important and common
medical industry. With this advancement, one is able to diag- parameters which are generally used in early diagnosis of dif-
nose and cure the problem effectively. The latest vaccination ferent lifestyle based human disorders are mentioned in Fig. 2.
Table 2 represents the brief details of attributes, instances,
and other treatment methodology has exterminated lots of fatal
type and sources of data used in early diagnosis of lifestyle
diseases. However, there are still number of chronic diseases
based human disorder.
(Cardiovascular, Respiratory, Malignant, Digestive, Diarrheal,
Depression, Malaria, Cancer, Diabetes etc.) that don’t appear to
6.2. Diabetes diagnosis using classification approaches
affirm vanish. Number of researcher has studied the use of data
mining techniques and their role in medical science. The re- Diabetes is one of a silent killer disease. Some of contribut-
maining part of section give brief detail about the data mining ing factors behind diabetes are lack of physical activities, seden-
techniques and upshots of the research carried out in medical tary problems and obesity. Billions of people across the world
field. Nowadays, heart disease, diabetes, brain stroke, cancer are affected by this so called modern society disease. An im-
are effecting people badly. proper treatment of diabetes may further leads to lots of other
physical problems and even death in some cases. In general,
6.1. Datasets used in disease diagnosis diabetes are divided into three categories viz. Type 1, Type 2
and Gestational diabetes. Number of researchers has tried to
A disease diagnostic system in healthcare application must develop prediction model to deal with this life threatening dis-
be trained through existing healthcare data. The data for train- ease. The remaining part of this section briefly describes the
ing diagnostic system can be generated through different modes contribution of different researchers in this domain.
like screening, physical and clinical diagnosis [18]. This data Iyer, Jeylatha et al. [21] have performed a study to diag-
can be represented in the form of text, numeric, sound, images nose the diabetic using different classification techniques. Au-
and signals. For disease diagnostic system, initially, the data is thors stated that diabetes is one of the dominating diseases that
310 M. Sharma et al. / IRBM 38 (2017) 305–324

Table 1
Common datasets for lifestyle based human disorder diagnosis.
Data set Characteristic Number of instances Attribute characteristics Number of attributes
PIMA Multivariate 768 Integer, Real 8
130-US hospital Multivariate 100000 Integer 55
AIM ’94 Multivariate, Time-Series Categorical, Integer 20
Cardiotocography Multivariate 2126 Real 23
Heart Disease Multivariate 303 Categorical, Integer, Real 75
Echocardiogram Multivariate 132 Categorical, Integer, Real 12
PAMAP2 Multivariate, Time-Series 3850505 Real 52
SPECT Multivarite 267 Categorical 22
SPECTF Multivariate 267 Integer 44
Statlog Multivariate 270 Categorical, Real 13
Breast Cancer Multivariate 286 Categorical 9
Breast Cancer Wisconsin Multivariate 569 Real 32
Breast Tissue Multivariate 106 Real 10
Lung Cancer Multivariate 32 Integer 56
Liver Disorders Multivariate 345 Categorical, Integer, Real 7
Indian Liver Patient Dataset Multivariate 583 Integer, Real 10

Table 2
Datasets and their attributes used in earlier disease diagnosis research.
Authors Instances Attributes Type of data Source of data
Ilayaraja M., Meyyappan [20] 1000 19 Text and numeric Not mentioned
Aiswarya Iyer, S. Jeyalatha and Ronak 768 8 Integer, Real PIMA, UCI
Sumbaly [21]
Haofan Yang, Yi-Ping Phoebe Chen 500 06 Lung cancer cases, pathology Cancer Genome Atlas
[22] and clinical report
B. Venkatalakshmi, M.V. Shivsankar 294 13 Text and numeric UCI
[23]
Abhishek Taneja [24] 7339 15 and 8 Text and numeric PGI Chandigarh
K. Rajesh and V. Sangeetha [25] 760 9 Integer, Real PIMA, UCI
Soni and Ansari [26] 909 14 Text and numeric Cleveland Heart Disease database
R. Ramani et al. [27] 322 13 Images MIAS
R. Delshi Howsalya Devi, M. Indra 303 14 Text and numeric Cleveland (UCI)
Devi [28]
Jianfeng Zhang et al. [29] 827 23 Tongue images TCM hospital, Shanghai
Rajesh Kumar, Rajeev Srivastava, and 1000 115 Microscopic biopsy images Histology Image Dataset
Subodh Srivastava [30]
K. Vimala and Dr. V. Kalaivani [31] 171 Not mentioned ECG Bio signal Hospital
Prakash D., Uma Mageshwari T., Not mentioned Not mentioned Phonocardiogram Signals Not mentioned
Prabakaran K., and Suguna A. [32]
Hang Wu, Sahong Kim, and Keunsung 325 10 Heart Sound Signals Clinical audio CD
Bae [33]
Vishakha Pareek, R.K. Sharma [34] 160 Not mentioned Voice signal Not mentioned
Du-Yih Tsai and Yongbum Lee [35] 90 Not mentioned Ultrasonic images Gifu University Hospital
P. Rajeswari, G. Sophia Reena [36] 345 7 Textual and numeric UCI
Gouda I. Salama, M.B. Abdelhalim, 699 11 Image Wisconsin Breast Cancer
Magdy Abd-elghany Zeid [37] (Original), UCI
Gouda I. Salama, M.B. Abdelhalim, 569 32 Image Wisconsin Diagnosis Breast
Magdy Abd-elghany Zeid [37] Cancer
Gouda I. Salama, M.B. Abdelhalim, 198 34 Image Wisconsin Prognosis Breast Cancer
Magdy Abd-elghany Zeid [37]
Disha Sharma, Gagandeep Jindal [38] 1000 Not mentioned Images Lung Image Database Consortium
A.S. Aneesh Kumar, C. Jothi 2453 15 Text and numeric Sir Ivan Stedeford Hospital,
Venkateswaran [39] Chennai
Santi Wulan Purnami [40] 768 13 Text and numeric UCI (PIMA)
Santi Wulan Purnami [40] 2701 13 Text and numeric V.A. Medical Center, Cleveland
Clinic Foundation
M. Sharma et al. / IRBM 38 (2017) 305–324 311

have affected millions of people worldwide. Authors said that authors concluded that BPNN is able to attain 82% accuracy
women are badly affected by this disease. Authors have de- and is higher than other said techniques. Aljumah et al. [46]
picted different types, symptoms, diagnosis and the treatment has performed a study on Saudi Arabians. Authors found that
for diabetes. Authors showed that use of decision tree and naive diabetes is significantly increasing the Saudi Arabian people.
bayes assist in diagnosis of diabetes. The deviation between Author performed study on both young and old people. As per
the experimental and expected outcome is totally insignificance their study, the order of treatment is different for young and
with of decision tree and naive bayes data mining techniques. old patients. Authors have used one of the data mining tools of
Rajesh and Sangeetha [25] have explained the role of data Oracle Corporation. Authors performed rigorous study on the
mining techniques in diagnoses of diabetes. Authors stated that basis of different parameters viz. drug, diet, weight, smoking,
the success of medical science is heavily depended upon the exercise and insulin. Authors concluded that drug treatment is
reliable prediction of different parameters. Authors have taken effective for both young and old diabetes patients. However, it is
different parameters viz. number of pregnancy, diastolic blood dominant in older. Authors further revealed that except smoking
pressure, skin fold thickness, BMI, diabetic pedigree function, and insulin, the prediction of effectiveness in treatment using
age etc. to probe the performance of the classification algo- diet control, weight control and using exercise is more in older
rithms in diagnosing the diabetes. Authors have compared dif- than young’s.
ferent classification algorithm viz. C-RT, CS-RT, C4.5, ID3, Marinov, Mosa et al. [47] have collected data from MED-
LDA, SVM, naive bayes etc. From the confusion matrix of the LINE database. Authors examined the role of data mining in
dataset, authors obtained 91% accuracy rate with C4.5. Santi diabetes research. As per authors, data mining is very benefi-
Wulan et al. [40] have presented an innovative smooth SVM cial for diabetes research. Moreover, it will improve the health
solution for diagnosis of diabetes. Authors have used modified status of diabetic persons more effectively and efficiently.
spline. By using modified spline SVM, authors achieved the ac- S. Nagarajan et al. [48] used four algorithms ID3, naive
curacy rate up to 96.48%. bayes, C4.5 and Random tree to diagnose Gestational Dia-
Santhanam et al. [41] have used hybrid idea of genetic betes in pregnant women. Authors mined the data set of 600
algorithm and SVM. With this combination, authors achieve instances. Each instance has six attributes. The study has publi-
the diagnosis diabetes precision rate of 98.79%. Authors have cized that the Random Tree with error rate 0.000 (lowest among
tested 50 different cases to unearth the performance of the pro- all) and accuracy value with 0.938 (highest among all) act as the
jected classification approach. Nilesh Jagdish Vispute et al. [42] best algorithm. K.R. Lakshmi et al. [49] performed the com-
made an empirical assessment of diverse classification tech- parison of ten dissimilar data mining algorithms – C4.5, SVM,
niques used for analysis of diabetic. Authors have used WEKA k-NN, PNN, BLR, MLR, PLS-DA, PLS-LDA, k-means and
tool for their experimentation. 10 cross validation approach Apriori to get best results for diagnosing diabetes in patients.
has been used. Authors have compared various techniques viz. Authors found that PLS-DA give optimal results as compared
naive bayes, J48 Tree, SMO, RepTree and Random tree. Au- to other techniques. The rate of accuracy achieved with PLS-
thors found naïve bayes as best classifier. The maximum rate DAis 74%.
of accuracy achieved is 76.03%. Peter [43] stated that diabetes
mellitus is one of persistent and dominant type of disease. Hy- 6.3. Diagnosis of heart problems using data mining
perglycemia is one of the major reasons behind it. Author has approaches
presented an assessment of diverse algorithms which are used to
mine this disease. Authors have compared the features of NN, Cardiovascular disease is a most precarious disease. There is
SVM and hybrid approach based upon different performance a variety of circumstance that influences the heart of a person.
metric viz. classification accuracy and processing time required As per WHO record, more than ten million casualties occur due
to predict. Authors concluded that with hybrid approach of to the heart related problems. Indian are significantly affected
data mining the classification accuracy can be achieved up to by this disease. The problem related to blood vessels, coronary
96.65%. Velide Phani Kumar et al. [44] have developed a pre- artery, abnormal heart rhythm and stroke are dominated heart
dictive model for diagnosis and treatment of diabetes. Authors related diseases. Data mining method or techniques has been
have compared different classification techniques. The experi- used to predict the risk of cardiovascular disease so that one
ments were carried out on nine different parameters. Authors can detect it on early stage and have adequate treatment.
found J48 as best classifier as it gives 100% accuracy in their Ilayaraja and Meyyappan [20] have developed a method to
data set. Authors also examined the time taken in the classifi- predict the risk of heart disease using frequent itemset. Author
cation process by J48 outperforms other the time consumed by has taken 19 different symptoms which are the reasons behind
other techniques. heart disease. Authors compared their result with other tech-
Ebenezer Obaloluwa Olaniyi et al. [45] have spotted dia- niques viz. apriori algorithm, IMSIA algorithm, semi-apriori
betes using ANN. Authors stated that a person suffering from algorithm etc. Authors found their algorithm work well as com-
diabetes is normally affected with problems like frequent uri- pared to other existing techniques. B. Venkatalakshmi, M.V.
nation, polyphagia, polydipsia etc. Authors do experimentation Shivsankar [23] stated untimely diagnose of heart problem may
with 500 training sample and 268 testing samples. Authors lead to illness or even death in some cases. Authors have tried
compared various techniques viz. BSS, KNN, C4.5 and BPNN to diagnose the heart disease using two different data mining
(Back propagation neural network). During experimentation, techniques viz. naive bayes and decision tree. Authors found
312 M. Sharma et al. / IRBM 38 (2017) 305–324

that naive bayes gives optimal results as compared to decision tainty and to improve the accuracy level of prediction related to
tree. The accuracy of naïve bayes is 85.03%. Taneja [24] has de- heart patients. Author worked on a data set consisting of 550
signed a predictive model which is used to mine dataset based records. Each record has 13 different parameters. Saba Bashir
upon Transthoracic Echocardiography Report. Author has col- et al. [56] have proposed an innovative ensemble classifier
lected data from one of the reputed hospital of Chandigarh. based upon five different techniques viz. naïve bayes, decision
Three different supervised machine learning techniques viz. tree induction based on GINI index, information gain, mem-
J48 classifier, multilayer perception and naive bayes have been ory based learner and SVM. Authors approach (MV5) found
used. The accuracy of classification obtained was 95.56. to be an improve one. The rate of forecasting accuracy, sensi-
Soni et al. [26] have provided an overview of data min- tivity and specificity attained using MV5 are 88.52%, 86.96%
ing techniques used in prediction heart diseases. Authors have and 90.83% respectively. Nishara and Gomathy [57] have used
taken 15 different parameters for their study. Authors have stud- a hybrid data mining technique to diagnose heart disease. Au-
ied the working and performance of KNN (k-nearest Neural thors have proposed a hybrid model by combining classification
Network), naïve bayes and Decision Lists. Authors concluded (C4.5), clustering (k-means) and association (maximal frequent
that the performance of decision tree is best. The rate of ac- itemset) and found it as a more promising as compared to other
curacy is 99%. Moreover, in some circumstances naïve bayes techniques. The rate of precision, recall and accuracy achieve
gives a solution very close to as given by the decision tree. with proposed model are 0.82, 0.89 and 89% respectively. Mai
Masethe and Masethe [50] have tried to predict heart disease Shouman et al. [58] have used k-NN in diagnosis of heart re-
problem. Authors have compared and contrasted the perfor- lated problems. Authors used standardized data set from Cleve-
mance of five different classification algorithms i.e. J48, Rep- land Clinic Foundation benchmark. There are around 300 rows
Tree, naïve bayes, CART and bayes net. Authors also used in the experimental data set. Authors varied the value of k from
different evaluation criterion like kappa statistics, mean abso- 1 to 13. The rate of accuracy varied in between 94% to 97.1%.
lute error, root mean square value, relative absolute error and In addition, authors observed that voting does not improve the
root relative squared error. Authors attained 99% accuracy with performance of k-NN in forecasting the heart related problems.
J48.
Dangare an Apte [51] stated that data mining techniques are 6.4. Role of data mining in examining other diseases
very useful in predicting the heart disease based upon the his-
tory of the patient. Authors stated that normally, 13 attributes Beside diabetes and cardiovascular disease, there exist num-
like sex, blood pressure, cholesterol etc. are used to diagnose ber of other problems that affects the life and system of a per-
the situation. Authors extended their study by adding two more son. The remaining part of this sub-section depicts the role of
parameters viz. obesity and smoking. Authors used and com- data mining in diseases other than diabetes and cardiovascular.
pared the performance of different data mining techniques in Haofan Yang et al. [22] have designed a framework to di-
predicting the status of heart disease. Authors found that neural agnose the lung cancer. Authors designed a framework that
network gives 100% accuracy in determining the heart disease find the association between the clinical and pathological in-
status. Sudha et al. [52] stated that stroke is life threatening dis- formation related to lung cancer patients. Author used Apriori
ease and is one of the major causes behind the deaths. Authors algorithm to train the system. Gouda I. Salama et al. [37] have
tried to analyze the data of stroke patients. They used different explored the working of individual and fusion of classifier in
data mining techniques viz. neural network, naïve bayes and breast cancer diagnosis. Authors found that the best rate of clas-
decision tree. Author got better results with the neural network. sification is achieved with four fusion classifier i.e. when four
Authors proposed a model that predicts the nature of data based different type of classifier are blended together.
upon the feature subset selection. The accuracy of proposed Ling Chen et al. [59] have developed a model called MyPHI
framework is 91%. Panzarasa et al. [53] have used data min- that is used to compute the personal health index of a person.
ing techniques to the stroke register in one particular area of With MyPHI, one is able to know the status of their health. Au-
Italy. Authors analyzed around 5000 different cases. The study thor found that their system is better than other benchmark of
was carried out to analyze the delay between the stroke onset healthcare industry. Swati Gupta [60] tried to diagnose one of
and the hospital admission. Authors used classification tree to the eye disease i.e. diabetic retinopathy. Authors have exam-
mine the data. Though, they get both expected and unexpected ined the features viz. micro aneurysm and exudates. Authors
results. G. Subbalakshmi et al. [54] developed a decision sup- found better results with SVM and the rate of accuracy re-
port for heart related diagnostic system by using naïve bayes alized is 86%. Authors compared results of SVM and KNN.
data mining technique. They had taken basic parameter like age, Lincoln F. Silva et al. [61] have developed a predictive sys-
gender, blood pressure etc. to foresee the likelihood of getting tem for diagnosis of breast cancer. Authors worked on the pa-
heart disease. tients data of hospital of Antonio Pedro University. Authors
V. Krishnaiah et al. [55] have developed a fuzzy KNN based used Bayesian, neural network and decision tree to classify
prophetic system for heart diseases. Authors claimed that num- the patients suffering from breast cancer. Authors found that
ber of other researchers has merely used the in intrinsic clas- the performance of both neural network and decision tree is
sifier of the mining software for diagnosis or forecasting the same and the rate of accuracy is 90.91. The data and training
data related to heart patients and ignored the data uncertainty. set used for experimentation was very small in size. Chaitali
Therefore, authors used fuzzy approach to remove the uncer- et al. [62] have compared the results of different classifica-
M. Sharma et al. / IRBM 38 (2017) 305–324 313

Table 3
Survey of data mining techniques.
Authors and year of Type of Methodology and tool Results
publication disease
Gwenolé Quellec, Mathieu Eyes Apriori Authors have successfully used mining techniques to diagnose eyes
Lamard, Ali Erginay, Agnès pathological problems.
Chabouis, Pascale Massin,
Béatrice Cochener, Guy
Cazuguel [68]
Ilayaraja M., Meyyappan [20] Heart Apriori algorithm, IMSIA, A predictive method is developed to predict the risk of heart problems.
Semi-priori algorithm
Aiswarya Iyer, Jeylatha and Diabetes J48, Naïve Bayes Diabetes has affected millions of people over the globe. Authors have
Ronak Sumlbaly [21] applied classification algorithm to data collected from pregnant women.
Authors have computed the confusion matrix using WEKA tool.
Experimental results that the results of naïve bayes outperformed the
results obtained with decision tree.

Haofan Yang, Yi-Ping Phoebe Lung Cancer Apriori algorithm Authors have proposed a framework to find association between the clinical
Chen [22] information and pathological information for lung cancer patients. The
framework found to be efficient in lung cancer pathologic Staging.
Chaitali Vaghela, Nikita Bhatt, Breast Naïve bayes, RBFN Authors have compared the results of different classification techniques for
Darshana Mistry [62] Cancer, (Radial base function the diseases like breast cancer, heart and diabetes. For breast cancer,
Heart network), neural network Authors have tested 286 samples. For breast cancer, rate of accuracy
Diabetes achieved is 75.4%. For heart disease, 303 samples were tested and the
performance of RBFN (Radial base function network) is outperformed. For
diabetes, 768 instances were taken. For diabetes, naïve bayes gives best
performance. The rate of accuracy attained 75.39%.

Masethe and Masethe [50] Heart J48, RepTree, Naïve A comparative study on five different classification algorithms is carried
Bayes, CART and Bayes out to analyze the heart related problems. To evaluate the efficiency of
NET different classification algorithms, several parameters viz. kappa statistics,
mean absolute error, root mean square values, relative absolute error and
root relative squared error are also computed.

S. Peter [43] Diabetes Neural Network, Support A survey is conducted on different classification techniques used to
Vector Machine, Hybrid diagnose diabetes. Authors found the hybrid approach predicts more
approach accurately as compared to other classification approaches.

Kavita Chaudhary, Pinki Bajaj RCT Cross validation and Authors have used classification technique to diagnose the need of root
[69] Decision Tree canal treatment. It was found that by using certain metrics like age,
brushing, smoking, tooth decay etc., one is able to predict the need of RCT
by using classification techniques of data mining.

Aljumah, Ahmad et al. [46] Diabetes Oracle Data Corporation Authors carried out meticulous study on the different types of treatment
suggested for both young and older. As per the study, except smoking and
insulin, prediction of effectiveness in treatment using weight, diet, exercise
and drug is more in older than young’s.

Taneja [24] Heart Decision Tree Author has collected analyzed the data collected from PGI, Chandigarh,
Classification, Bayesian India. WEKA tool has been used to develop a predictive model for heart
Classifier and Neural disease. Author found that optimal results with J48. 95.56% accuracy has
Network been achieved

Dangare and Apte [51] Authors have used 15 different parameters rather than 13 for predicting
heart disease. Authors have taken 571 records out of which 303 are used for
training and rest are used for testing. Authors found that for their particular
set of data, neural network gives 100% accurate results.

Sudha, Gayathri et al. [52] Stroke Decision Tree, Naïve Stroke is one of the most dangerous diseases that may lead to death or life
Bayes, Neural network time disability. Authors tried to predict the chances of stroke by measuring
certain parameters. It was found the neural network gives better
performance as compared to decision tree and naïve bayes.

K. Rajesh and V. Sangeetha Diabetes Not specified Authors have performed a comparative study of different classification
[25] algorithms in diagnosis the diabetes. Authors found that C4.5 classification
algorithm gives better results efficiency. Authors also suggested that one
can further improve the design of C4.5 to achieve better results.
(continued on next page)
314 M. Sharma et al. / IRBM 38 (2017) 305–324

Table 3 (continued)
Authors and year of Type of Methodology and tool Results
publication disease
Soni, Ansari et al. [26] Heart Decision Tree, Bayesian Authors performed experimentation for predicting the problems related to
Network, KNN, Neural heart diseases. Authors found the decision tree outperforms other
Network, Genetic classification techniques. The rate of accuracy achieved using naïve bayes,
Algorithm decision tree and clustering is 96.5%, 99.2% and 88.3% respectively.

Marinov, Mosa et al. [57] Diabetes Authors have performed a systematic review of used data mining
techniques for diabetes. Author has collected 31 different papers indexed in
Medline. Authors stated that one should analyze the genomic data as
diabetes is a genetic disease.

Subbalakshmi, Kumar et al. Heart Naïve Bayes Authors developed a DSS for predicting the risk factor in heart disease
[54] disease using naïve bayes. Authors have taken basic parameters for developing the
system.

Silvia Panzarasa, Silvana Stroke Not specified Authors used data mining techniques determine the effect of
Quanglini, Lucia Sacchi, non-compliance to clinical guidelines proposed for stroke patients. The
Anna Cavallini, Giuseppe study is confined to one of the region of Italy. Authors have analyzed
Micieli and Mario Stefanelli Stroke Unit Network (SUN) register. The objective of the study was to
[53] analyze the delay between the stroke onset and the hospital admission.
Authors used classification tree to mine the data. Though, they get both
expected and unexpected results.

tion techniques for the diseases like breast cancer, heart and
diabetes. For breast cancer, Authors have tested 286 samples.
For breast cancer authors observed the 75.4 diagnose accu-
racy. For heart disease 303 samples were tested and the per-
formance of RBFN (Radial base function network) is outper-
formed. For diabetes, 768 instances were taken. For diabetes,
Naïve Bayes gives best performance. The rate of accuracy at-
tained 75.39%. P. Ramachandran et al. [63] developed a hybrid
model for early detection and prevention of cancer. Authors
have used a hybrid approach of clustering and decision tree.
Authors evaluated the performance of their model by using
support vector machine. The evaluation has been done based
upon three different statistical parameters viz. accuracy, sensi- Fig. 3. Accuracy in diagnosis of diabetes.
tivity and specificity. The cancer detection is based upon several
parameters like age, education, living area, anemia, history of
family, weight loss and several habits like smoking, alcohol, reviewed a literature on the diagnosis and prospects of breast
chewing etc. cancer. Author found decision tree as an optimal predictor. Rate
Hayrettin Evirgen et al. [64] have developed a predictive sys- of accuracy achieved is 93.62%. Author suggested to develop a
tem for diagnosis of retinopathy. Retinopathy is one of the eye web based application so that the people from the remote area
problems that occur mainly in the people who are affected with can self-diagnose their status based upon the concerned param-
diabetes for a significant period of time. Authors proved that eters.
with the use of naïve bayes, one is able to achieve 89% of ac- Table 3 highlight the crux of the above mentioned research
curacy in diagnosis of retinopathy. A. Kalairasai et al. [65] has works.The performance of data mining techniques strongly de-
developed a framework for detecting oral cancer. Authors have pends upon the selected parameters, training data and the test-
experimented with NMDS database. Authors have compared ing data set.
different techniques viz. apriori algorithm, C4.5, SVM, naïve Table 4 represents the accuracy and the data mining ap-
bayes and random forest. In the experimental data, random for- proaches of some of key researchers used to diagnose different
est gives best possible results with NMDS dataset. The rate of types of diseases.
accuracy attained is 83%. Author proposed that a hybrid com- Fig. 3 represents the level of accuracy achieved using dif-
bination of C4.5 and random forest may further enhance the ferent classification techniques for spotting diabetes. It is re-
accuracy level of the predictive model. Mohanty et al. [66] have vealed that depending upon selection of parameters and data
developed a classification model for detecting breast cancer. set used, the prediction accuracy can be achieved between 75 to
Authors have taken 26 parameters and found them more suit- 100%.
able for detecting the cancer. Association rule mining is used Fig. 4 represents the level of accuracy achieved using dif-
to classify normal and effected cases. Shweta Khraya [67] has ferent classification techniques for spotting heart diseases. It is
M. Sharma et al. / IRBM 38 (2017) 305–324 315

Table 4
Accuracy level of used data mining techniques.
Authors and year of publication Type of disease Accuracy Best approach
Aiswarya Iyer, Jeylatha and Ronak Sumlbaly [21] Diabetes 79.56 Naïve bayes
Masethe and Masethe [50] Heart 99 J48
S. Peter (2014) [43] Diabetes 96.68 Hybrid approach
Kavita Chaudhary, Pinki Bajaj (2014) [69] RCT 93.06 Decision Tree with Cross Validation
Taneja (2013) [24] Heart 95.56 J4.8
Sudha, Gayathri et al. (2012) [52] Stroke 91 Neural network
K. Rajesh and V. Sangeetha (2012) [25] Diabetes 91 C4.5
Jyoti Soni et al. (2011) [26] Heart 99.2 Decision Tree
Silvia Panzarasa, Silvana Quanglini, Lucia Sacchi, Stroke 99 Neural network
Anna Cavallini, Giuseppe Micieli and Mario
Stefanelli (2010) [53]
Chaitrali S Dangare, Sulabha S. Apte (2012) [51] Heart disease 100 Neural network
B. Venkatalakshmi, M.V. Shivsankar (2014) [23] Heart disease 85.3 Naïve bayes
A. Kalairasai (2013) [65] Oral Cancer 83 Random forest
Shweta Khraya (2012) [67] Breast Cancer 93.62 Decision tree
Lincoln F. Silva, Giomar O. Sequeiros, Maria Lúcia Breast Cancer 90.91 Decision Tree and Neural Network
O. Santo, Cristina A. P. Fontes, Débora C.
Muchaluat-Saadeand Aura Conci (2015) [61]
Santi Wulan Purnami, Jasni Mohamad Zain, Diabetes 96.58 Smooth SVM
Abdullah Embong (2010) [40]
T. Santhanam, M.S. Padmavathi (2015) [41] Diabetes 98.79 GASVM
Swati Gupta, Karandikar AM (2015) [60] Ophthalmology 86 SVM
Velide Phani Kumar (2014) [44] Diabetes 100 J48
Ebenezer Obaloluwa Olaniyi, Khashman Adnan Diabetes 82 BPNN
(2014) [45]
Chaitali Vaghela, Nikita Bhatt, Darshana Mistry Breast Cancer 75.4 Decision tree
(2015) [62]
Chaitali Vaghela, Nikita Bhatt, Darshana Mistry Diabetes 75.39 Naïve Bayes
(2015) [62]
Chaitali Vaghela, Nikita Bhatt, Darshana Mistry Heart Disease 83.2 RBFN
(2015) [62]
Mahua Nandy (2013) [70] Breast Cancer 96.7 SVM
M.A. Nishara Banu and B. Gomathy (2014) [57] Heart Disease 89 Hybrid
Mai Shouman et al. (2012) [58] Heart Disease 97.1 k-NN
P. Rajeswari, G. Sophia Reena (2010) [36] Liver disorder 96.52 Naïve bayes
P. Rajeswari, G. Sophia Reena (2010) [36] Liver disorder 97.10 FT Tree
P. Rajeswari, G. Sophia Reena (2010) [36] Liver disorder 83.4 K-Star

data set grows significantly. It is also found that the accuracy in


diagnosis of chronic diseases namely heart and diabetes with
J48 and decision tree respectively is above par as compared
to other classification approaches. In addition, the performance
of J48 and decision tree is different for diagnosis of other dis-
eases.

7. Effect of preprocessing on predictive accuracy of


lifestyle based disease diagnosis

Before mining, the data collected from different sources like


Fig. 4. Accuracy in diagnosis of heart disease. signal, images or reports has to be cleaned and normalized to
avoid any noise or data variation. The patient’s data are mo-
revealed that depending upon selection of parameters and data mentously prone to missing, inconsistent and noisy data as it
set used, the prediction accuracy can be achieved between 85 to is collected from different sources. The data inconsistency will
100%. affect the predictive rate of disease diagnosis accuracy. Data
From Table 3, it is observed that authors have achieved the preprocessing assist to improve the quality of healthcare data.
100% diagnosis accuracy in heart and diabetes diseases. In ad- Data cleaning and data reduction are two major categories of
dition, independent of size of testing and training data, the rate data preprocessing techniques. Data cleaning helps in noise re-
of accuracy in diagnosis of different diseases lies between 70 moval and to minimize the data inconsistency. Data reduction
to 100%. However, it would be unrealistic if the size of testing are useful in reducing data redundancy, summarizing and in fea-
316 M. Sharma et al. / IRBM 38 (2017) 305–324

ture selection process. Researchers have found that the use of that the use of SVM without any preprocessing techniques was
data cleaning and dimension reduction gives better mining re- able to achieve 72.04% predictive rate of classification. It was
sults (Table 5). observed that the use of preprocessing techniques (attribute se-
The initial data set of patients may suffer from several lection using GA, Replacing missing value with mean) signif-
anomalies. For better predictions, the data should be cleaned icantly raised the classification rate. The rate of classification
by using different methods of preprocessing. Amir R. Razavi et achieved using SVM after preprocessing is 83.33%.
al. [71] have tried to analyze the effect of preprocessing on the Ashok Kumar and Govidsamy [73] have compared the per-
breast cancer patients dataset. Authors cleaned the data set of formance of five different classification algorithms viz. SVM.
3949 patients by using different laws of logic. Missing values Naïve bayes, regression, bayes net and decision for diabetes
were manipulated using expectation maximizing techniques. patients. Authors also examine the effect of greedy stepwise
Moreover, CCA (canonical correlation analysis) has been used preprocessing technique on the predictive rate of classification.
to reduce data dimension. Authors observed that use of expec- It was observed that after applying greedy stepwise features
tation maximization and CCA has improved the classification selection technique the performance of each classification tech-
process. The rate of accuracy achieved without preprocessing, nique has been improved.
after replacing missing values and after dimension reduction Jayalskshmi and Santhakumaran [74] have examined the im-
using DTI is 54%, 57% and 67% respectively. pact of preprocessing techniques on the performance of ANN
Soni and Ansari [26] have used textual and numeric based in diagnosing diabetes patients. Authors found the use of PCA
Cleveland heart disease dataset. Authors used K-means clus- with ANN and replacing missing value with median achieved
tering to extract the homogeneous data. GA is used for feature better accuracy as compared to other preprocessing techniques
selection so that optimal and normalized data set can be con- like min–max, Z-score and Mapstd.
structed. H. Hamidi and A. Daraei [75] have explored the effect of
R. Ramani et al. [27] have used four different types of filters pre and post processing on heart diagnosis problems. Authors
viz. median, adaptive median, mean and wiener to deal with surveyed 43 different papers on heart diagnosis using data min-
noisy images. Authors examined the performance of different ing techniques. They found that features selection is one of the
filters in context to the three different types of noise Gaussian, common preprocessing techniques. In addition, data transfor-
salt & pepper and speckle. It was observed that results produced mation also play important role in improving the performance
by adaptive median filter are better than other filters. of diagnosis process.
Jianfeng Zhang et al. [29] have diagnosed diabetes using Hongmei Yan, Jun Zheng, Yingtao Jiang et al. [76] have im-
SVM. GA and PCA are used for data preprocessing. Authors plemented real coded GA to determine the best possible and
have used tongue image for diabetes diagnosis. GA is used for critical features for heart disease diagnosis. Authors extracted
feature selection. PCA is used to minimize the magnitude of at- 24 features out of 40 and stated that these 24 features are suffi-
tribute’s range so that the scale from −1 to +1 can be set for all cient to precisely diagnose any heart disease problem.
selected attributes.
Gouda I. Salama et al. [37] diagnosed breast cancer using 8. Relationship of data type and data mining algorithm
UCI data sets. Author stated that the use of PCA improved the
classification rate of the fusion of J48 and SMO. The rate of Researchers have used different data sets to diagnose dif-
accuracy achieved using above mentioned classifier fusion and ferent lifestyle based human disorders. Some of the common
feature reduction technique (PCA) is 97.56%. datasets used for diabetes diagnosis are PIMA, UCI and Hos-
Rajesh Kumar, Rajeev Srivastava, and Subodh Srivastava pital’s data. Most of the researchers have used textual and nu-
[30] have detect cancer using biopsy images. Authors have meric types of attributes in diabetes diagnosis. However, some
used preprocessing, segmentation and feature extraction tech- of the researchers have also found effect results by using tongue
niques to get more precise diagnosis results. The enhancement images.
techniques were used to remove staining and to handle wicked From Table 6, following observations have been made:
contrast. The study were performed over four different types
of tissues viz. connective, muscular, nervous and epithelial. – For diabetes diagnosis, the best results are achieved when
Moreover, different feature extraction techniques like wavelet, the fusion of GA and SVM are applied on text and numeric
HOG, LTE, texture and shape were used. Authors found that based datasets.
the combination of above mentioned preprocessing techniques – Tongue images are also seems to be useful in diabetes di-
with KNN is able to provide 92.19% rate of accuracy. agnosis. However, rate of accuracy achieved using tongue
Razieh Asgarnezhad et al. [72] have discussed different pre- images is not as high as accomplished with text and nu-
processing techniques used to handle missing values for dia- meric based datasets.
betes data set. In addition, different methods for attributes se- – Like diabetes, cardio human disorders are also well diag-
lection are also concisely stated. Authors explored the usage nosed with text and numeric data sets. However, authors
of backward elimination and forward selection, brute force and have also used images, signal and sound for heart dis-
genetic algorithms for choosing the subset of attributes. Ad- ease diagnosis. Decision tree with text and numeric based
ditionally, the effect of replacing missing values with mean, datasets is found to be more effective for cardio problem
median and k-NN approach was also examined. It was found diagnosis.
M. Sharma et al. / IRBM 38 (2017) 305–324 317

Table 5
Effect of preprocessing on classification accuracy.
Authors and yea Preprocessing technique Classification Accuracy Accuracy
techniques before after
preprocessing preprocessing
Kumar and Govindsamy (2015) [73] Greedy Stepwise SVM 77.47% 77.73%
Naïve bayes 76.30% 77.60%
BayesNet 78.25% 78.25%
Decision Tre 77.60% 79.81%
Regression 77.34% 77.60%

Razieh Asgarnezhad et al. (2017) [72] Backward Elimination & Forward Selection, SVM 72.04 74.58
Replacing missing value with mean
Backward Elimination & Forward Selection, SVM 72.04 77.78
Replacing missing value with Median
Backward Elimination & Forward Selection, SVM 72.04 74.14
Replacing missing value with k-NN
Brute Force, Replacing missing value with Mean SVM 72.04 75
Brute Force, Replacing missing value with Median SVM 72.04 79.03
Brute Force, Replacing missing value with k-NN SVM 72.04 77.55
GA, Replacing missing value with Mean SVM 72.04 83.33
GA, Replacing missing value with Median SVM 72.04 80.36
GA, Replacing missing value with k-NN SVM 72.04 76.56

Jayalskshmi and Santhakumaran Min–max, omitting missing value ANN – 71.59


(2010) [74] Z-score, omitting missing value ANN – 76.38
Mapstd, omitting missing value ANN – 68.35
PCA, omitting missing value ANN – 67.84
Min–max, replacing with zero ANN – 68.49
Z-score, replacing with zero ANN – 72.47
Mapstd, replacing with zero ANN – 69.27
PCA, replacing with zero ANN – 66.67
Min–max, replacing with Mean ANN – 71.36
Z-score, replacing with Mean ANN – 69.87
Mapstd, replacing with Mean ANN – 67.25
PCA, replacing with Mean ANN – 99.9
Min–max, replacing with k-NN ANN – 69.87
Z-score, replacing with k-NN ANN – 68.09
Mapstd, replacing with k-NN ANN – 60.77
PCA, replacing with k-NN ANN – 99.8

Jianfeng Zhang et al. (2017) [29] GA and PCA SVM 77.83 78.77
Rajesh Kumar, Rajeev Srivastava, and Enhancement, HOG, LTE, Wavelet, Shape, KNN – 92.19
Subodh Srivastava (2015) [30] Texture
K. Vembandasamy, T. Karthikeyan Outlier detection FCM 84 93
(2016) [77]
K. Vimala and Dr. V. Kalaivani (2013) [31] DWT ANN – 97.68

Gouda I. Salama et al. (2012) [37] PCA Fusion of J48 and SMO 96.28 96.99

– For breast cancer, images are seem to be more effective. this, last seven years of data has been extracted from Google
SVM provides better results in early diagnosis of breast Scholar’s. Tables 7, 8, 9, 10 and 11 represent the status of use
cancer. of preprocessing techniques in healthcare diagnosis system for
– For digestive disorders different types of data viz. text, nu- diabetes, cardio, digestive, ophthalmology and dentistry human
meric and images have been used. For liver disorder identi- disorders.
fication, text and numeric based dataset with C4.5 achieved From Fig. 5, the rate of usage of preprocessing in diabetes
better classification results. diagnosis lies between 8.48 to 14.80%. From last seven years,
the rate of usage of preprocessing is diabetes diagnosis is in-
9. Analysis of research publications using Google scholar creased.
Like diabetes, the rate of usage of preprocessing in cardio
The important facet of this paper is to emphasize the medical disorder has been also increased in last seven years. The max-
sciences domain where already lots of data have been mined imum rate of usage of preprocessing in cardio disorder was in
and the areas that still require an attention for data mining. For year 2017 and rate of usage was 17.75% (Fig. 6).
318 M. Sharma et al. / IRBM 38 (2017) 305–324

Table 6
Relationship between disease, datatype and accuracy of data mining approach.
Disease Data type Techniques Accuracy
Diabetes (Aiswarya Iyer, Jeylatha and Ronak Sumlbaly) [21] Text and numeric Naïve bayes 79.56
Diabetes (T. Santhanam, M.S. Padmavathi) [41] GA-SVM 98.79
Diabetes (Mahmoud Heydari, Mehdi Teimouri, Numeric and nominal ANN 97.44
Zainabolhoda) [78]
Diabetes (S. Peter) [43] Text and numeric Neural Network, Support 96.68
Vector Machine, Hybrid
approach
Diabetes (K. Rajesh and V. Sangeetha) [25] Textual and numeric C4.5 91
Diabetes (Velide Phani Kumar) [44] Text and numeric J48 100
Heart (Masethe and Masethe) [50] Nominal J48, RepTree, Naïve 99
Bayes, CART and Bayes
NET
Diabetes (Santi Wulan) [40] Integer, Real Smooth SVM 96.58
Diabetes (Nilesh Jagdish Vispute et al.) [42] Text and numeric Naïve Bayes 76.03
Diabetes (Jianfeng Zhang et al.) [29] Tongue Images SVM 83.06
Diabetes (N.V. Cibin, S. Wilfred Franklin, N.V. Ajin) [79] Tongue Images SVM 66.26
Diabetes (N.V. Cibin, S. Wilfred Franklin, N.V. Ajin) [79] Tongue Images LCA 85.52
Heart Disease (Santi Wulan) [40] Integer, Real Smooth SVM 94.15
Heart (Soni and Ansari) [26] Text and numeric Decision Tree 99.2
Heart diseases (M.A. NisharaBanu and B. Gomathy) [57] Text and numeric Hybrid 89
Heart disease (B. Venkatalakshmi, M.V. Shivsankar) [23] Naïve bayes 85.3
Heart (Taneja) [24] Transthoracic J48 classifier, multilayer 95.56
Echocardiography perception and naive
Report bayes
Heart (Sujata Joshi and Mydhili K. Nair) [11] Text and numeric Decision Trees 92.2
Heart disease (Hyeongsoo Kim et al.) [83] Ultrasound Neural Network 86.24
images and elec- Bayesnet 79.65
trocardiogram C4.5 83.63
signal SVM 89.51
CMAR 89.46
Kumar and Govindsamy [73] Text and numeric Decision Tree 79.81
Breast cancer (Ramani) [27] Images Not mentioned Not mentioned
Breast cancer (Shweta) [67] Images Decision tree 93.62
Breast cancer (Chaitali) [62] Images Naïve bayes, RBFN 75.4
(Radial base function
network), neural network
Breast cancer (Mahua Nandy) [70] Images SVM 96.7
Cancer (Rajesh Kumar, Rajeev Srivastava, and Subodh) [30] Biopsy Images KNN 92.19
Liver disorder (P. Rajeswari, G. Sophia Reena) [36] Text and numeric FT Tree 97.10
Ophthalmology (Swati Gupta, Karandikar AM) (2015) [60] Images SVM 86
Liver disorder (A.S. Aneeshkumar, C. Jothi Text and numeric C4.5 99.20
Venkateswaran) [39]

Table 7 Table 8
Year wise publication details with and without preprocessing techniques. Year wise publication details with and without preprocessing techniques.
Year Article on diabetes Article on diabetes Year Article on cardio Article on cardio disorder
diagnosis using diagnosis using data disorder diagnosis diagnosis using data
data mining mining with preprocessing using data mining mining with preprocessing
2011 6900 585 2011 13000 1480
2012 8380 735 2012 13900 1480
2013 8920 1090 2013 14900 1910
2014 8790 1110 2014 13000 2000
2015 9100 1230 2015 14900 2220
2016 9790 1390 2016 15100 2460
2017 5980 885 2017 8280 1470

From Table 10 and Fig. 8, it is observed that researchers have


From Table 9 and Fig. 7, it is observed that rate of usage of almost ignored the data mining in ophthalmology disorder di-
preprocessing in digestive disorder diagnosis is very low. The agnosis. However, the rage of usage of preprocessing has been
rate of digestive disorder diagnosis lies between 4.58 to 8.93%. increased in last seven years.
M. Sharma et al. / IRBM 38 (2017) 305–324 319

Table 9 Table 11
Year wise publication details with and without preprocessing techniques. Year wise publication details with and without preprocessing techniques.
Year Article on digestive Article on digestive Year Article on dentistry Article on dentistry
disorder diagnosis disorder diagnosis using disorder diagnosis disorder diagnosis using
using data mining data mining with using data mining data mining with
preprocessing preprocessing
2011 5850 268 2011 3610 107
2012 5790 315 2012 3210 136
2013 6260 559 2013 3840 297
2014 6060 387 2014 3180 167
2015 6070 461 2015 3250 176
2016 6350 473 2016 2860 178
2017 3980 259 2017 1940 141

10. Practice connotation


Table 10
Year wise publication details with and without preprocessing techniques.
It was found that data mining techniques were effectively
Year Article on Article on ophthalmology used in mining various health related problems particularly that
ophthalmology disorder diagnosis using deals with heart, diabetes, stroke, cancer etc. However, majority
disorder diagnosis data mining with
using data mining preprocessing
of the experimentation were based upon some specific data sets.
Researchers have normally focused on certain parameters and
2011 1080 63
2012 1360 80 not all. By improving the data set contents, one can mine more
2013 1500 137 useful results and information.
2014 1240 134
2015 1230 142 11. Need of predictive model for correcting lifestyle based
2016 1210 136
2017 789 102
human disorder

Barbara et al. [80] exposed that a smoker has 9% more


chances of being infected with nuclear cataract (a type of
Like ophthalmology, a very small amount of mining work cataract). Beside, smoking is also one of the major reasons
has been done in dentistry disorder diagnosis. The range of us- behind diabetic retinopathy. Similarly, overdose of alcohol con-
age of preprocessing in dentistry diagnosis lies between 2.96 to sumption (more than four drinks) increases the chances of
7.73% (Fig. 9). cataract by 34%. Moreover, cataract chances are also increased

Fig. 5. Number of article published using preprocessing in diabetes diagnosis.

Fig. 6. Number of article published using preprocessing in cardio disorder diagnosis.


320 M. Sharma et al. / IRBM 38 (2017) 305–324

Fig. 7. Number of article published using preprocessing in digestive disorder diagnosis.

Fig. 8. Number of article published using preprocessing in ophthalmology disorder diagnosis.

Fig. 9. Number of article published using preprocessing in dentistry disorder diagnosis.

the lifestyle habits like smoking, recreational drug and tongue


piercings may lead to gum diseases, bleeding, bad breadth, tin-
gling, lumps and sometimes to oral cancer also. Work or social
stress may also lead to serious issues including oral cancer, gum
disease or tooth decay. It may also lead to Temporomandibular
disorder that may badly affect the jaw joints and other oral mus-
cles.
Fig. 10 represents the last seven years article publication sta-
tus in healthcare industry. The rate of work done in diagnosis
different lifestyle based human disorder related to cardio, dia-
Fig. 10. Lifestyle based disease diagnosis related article publication summary. betes, digestive, dentistry and ophthalmology disorder are 42%,
26%, 18%, 10% and 4% respectively.
After careful analysis from Google Scholar, it is observed
if a person works under UV-B light range. Blue collar persons that a significant work has been done for mining the data related
are 50% are more prone to trauma (dysfunction of any compo- to the disease viz. heart, and diabetes. Most of the researchers
nent of eye) than white collars. have tried to mine data related to cardio patients. However, very
Oral health [81,82] which is generally unseen is an impor- little attention has been paid to analyze the data related to Oph-
tant factor that contributes to a vigorous lifestyle. Some of thalmology, Dentistry and Digestive disorders. Ophthalmology
M. Sharma et al. / IRBM 38 (2017) 305–324 321

Fig. 11. Proposed diagnosis model for lifestyle based human disorders.

and Dentistry are almost ignored by data mining researchers. then communicated through IoT to both patient and health-
However, there exist some chronic diseases in these domains care professional so that proper monitoring and care of patient
that may lead to blindness, oral, lung cancer and even death to can be taken out. Finally, the patient response in regard to the
some cases. Hence, an attention is considered necessary to ex- prescribed treatment should be monitored using social media.
plore the usage of data mining in these areas. Therefore, there Accordingly, the ontologies and training data should be restruc-
is a need to develop an amalgam prophetic model that assists tured from time to time.
in early diagnosis of the diseases related to ophthalmology, oral The novelty lies in the fusion of machine learning, soft com-
and digestive system. puting, fuzzy logic, ontologies, IoT, biosensors and preprocess-
Fig. 11 represents the projected architecture for lifestyle ing techniques.
based human disorder diagnosis. The major components of pro-
jected architecture are biosensors, IoT (Internet of Things), pre- 12. Conclusion
processing and data reduction techniques, a fusion procedure of
This paper targets to start an effort for designing predic-
soft computing and machine learning techniques, data analysis
tive model for life style based diseases namely ophthalmology,
procedure, patients and healthcare professionals. Here, an intel-
dentistry and digestive system. As per Google Scholar, the per-
ligent procedure is designed to collect patient’s data. The data
centage of articles published related to cardio, diabetes, diges-
can be collected manually or through biosensors which are con-
tive, dentistry and ophthalmology disease diagnosis using data
nected through IoT. A preprocessing module is incorporated to
mining in last seven years are 42%, 26%, 18%, 10% and 4%
clean, transform and normalize data. Data reduction model is respectively. This study confirms that already lots of data have
used to reduce the unwanted features of data sets. A hybrid been mined for cardiology and diabetic issues. Time has come
predictive system based upon soft computing and data min- to aware human masses for correcting their life style to avoid
ing based machine learning techniques should be developed. the wide destructive spectrum of digestive based disorders, oph-
By varying different permutation of soft computing and data thalmology and dentistry. Therefore, a smart and intelligent
mining techniques (GA and SVM, GA and Naïve Bayes, ACO solution is required for the same.
and Decision Tree, PSO and Association Rules etc.) an effec- First, it is not feasible to incorporate all the papers in any sur-
tive predictive model can be designed that will assist in early vey. However, the best one’s are selected for analyzing the role
diagnosis and treatment of lifestyle based diseases. To further of classification and predictive techniques in medical databases.
improve the rate of forecasting accuracy, different symptoms Second, data mining techniques are used to predict human
based ontologies related to ophthalmology, dentistry and diges- disorder aforementioned with high precision and stumpy cost.
tive system can also be incorporated. The use of fuzzy logic Data mining is used to reveal novel clinical and medical ac-
and information theory may also be incorporated to deal with quaintances which are assisting to decrease the death rate of
uncertainty and missing data. The patient’s data is completely patients. Early human disorder verdict has significant potential
analyzed by data analysis module. The analyzed data will be in providing a better medical treatment. Moreover, data mining
322 M. Sharma et al. / IRBM 38 (2017) 305–324

techniques also help in finding the association between the dis- References
ease sand the life style of the people. Moreover, based upon size
and nature of training and testing data sets and with same clas- [1] Du Hongbo. Data mining techniques and applications – an introduction.
sification techniques, different authors found different level of Cengage Learning India; 2010.
[2] Liao Shu-Hsien, Chu Pei-Hui, Hsiao Pei-Yuan. Data mining techniques
accuracy. In addition, the researchers have normally focused on
and applications – a decade review from 2000 to 2011. Expert Syst Appl
common 13 parameters. They have ignored the parameters like 2012;39:11303–10.
level of stress due to family, work environment, sedentary, life [3] Han Jiawei, Kamber Micheline, Pei Jian. Data mining concepts and tech-
style, hereditary etc. which are more imperative for cardiology niques. USA: Morgan Kaufmann Publishers; 2015.
and diabetes. Therefore, to get more accurate results from the [4] Dunham Margaret H. Data mining introductory and advanced topics. In-
dia: Pearson Publications; 2013.
momentous testing data set, one has to incorporate numerous
[5] Pawar Suvarna, Sikchi Smita. An extensive survey on diagnosis of dia-
parameters and has to statistically explore them. betes mellitus in healthcare. In: Proceedings of the international confer-
Third, several types of datasets like textual, numeric, cate- ence on data engineering and communication technology. Advances in
gorical, sound, images and bio-signals have been used to di- intelligent systems and computing, vol. 468. 2017.
agnose lifestyle based human disorders. However, most of the [6] Kaur Shubpreet, Bawa RK. Future trends of data mining in predicting the
various diseases in medical healthcare systems. Int J Energy Inf Commun
authors have used textual, numeric and categorical variables.
2015;6(4):17–34.
In addition, tongue images are also seems to be useful in di- [7] Tomar Divya, Aggarwal Sonali. A survey on data mining approaches for
abetes diagnosis. However, rate of accuracy achieved using healthcare. Int J Bio-Sci Bio-Technol 2013;5(5):241–66.
tongue images is not as high as accomplished with text and nu- [8] Bajaj Punam, Gupta Preeti. Review on heart disease diagnosis based on
meric based datasets. For diabetes diagnosis, the best results data mining techniques. Int. J. Comput. Sci. Manag. Res. 2014;3(5):1596.
[9] Ahmad Parvez, Qamar Saqib, Qasim Syed, Rizvi Afser. Tech-
are achieved when the fusion of GA and SVM are applied on
niques of data mining in healthcare: a review. Int J Comput Appl
text and numeric based datasets. For cardiac problem diagnosis, 2015;120(15):38–50.
authors have used different type of datasets like textual, bio- [10] Case A, Lubotsky D, Paxson C. Economic status and health in childhood:
signals, heart sound, human voice etc. Decision tree with text the origins of the gradient. Am Econ Rev 2002;92(5):1308–34.
and numeric based datasets is found to be more effective for [11] Joshi Sujatha, Nair Mydhili K. Prediction of heart disease using clas-
sification based data mining techniques. Comput Intell Data Mini
cardio problem diagnosis. For cancer detection biopsy images
2015;2:503–11.
were found to be more useful. SVM provides better results in [12] Bhagya Shree SR, Sheshdari HS. Diagnosis of Alzheimer’s disease using
early diagnosis of breast cancer. For digestive disorders differ- naive bayesian classifier. Neural Comput Appl 2016.
ent types of data viz. text, numeric and images have been used. [13] Bhagya Shree SR, Sheshadri HS, Krishna Murali. Diagnosis of
For liver disorder identification, text and numeric based dataset Alzheimer’s disease using rule based approach. Indian J Sci Technol
2016;9(13):1–6.
with C4.5 achieved better classification results.
[14] Sweilam Nasser H, Tharwat AA, Abdel Moniem NK. Support vector ma-
As well, the predictive rate of disease diagnosis has been sig- chine for diagnosis cancer disease: a comparative study. Egypt Inform J
nificantly improved by using different data preprocessing tech- 2010;11:81–92.
niques like data cleaning, transformation, normalization, outlier [15] Divya, Chhabra Raman, Kaur Sumit, Ghosh Swagata. Diabetes detection
detection and dimension reduction. Moreover, the use of fu- using artificial neural networks & back-propagation algorithm. Int J Sci
sion classifier with preprocessing techniques gives better results Technol Res 2013;2(1):9–11.
[16] Sharma Manik, Gurvinder Singh, Rajinder Singh, Gurdev Singh. Analy-
as compared to the use of individual classifier. However, the sis of DSS queries using entropy based restricted genetic algorithm. Appl
rate of usage of preprocessing in diagnosis of different lifestyle Math Inf Sci 2015;9(5):2599–609.
based human disorder using data mining related to cardio, di- [17] Sharma Manik, Gurvinder Singh, Rajinder Singh. Design and analysis of
abetes, digestive, dentistry and ophthalmology lies between stochastic DSS query optimizers in a distributed database system. Egypt
10.65%–17.75%, 8.48%–14.80%, 4.58–8.93%, 2.96%–7.73% Inform J 2015. http://dx.doi.org/10.1016/j.eij.2015.10.003.
[18] Jiang Fei, Jiang HuiZhi Yong, et al. Artificial intelligence in healthcare:
and 5.83%–12.93% respectively. past, present and future. Stroke Vasc Neurol 2017;2017:1–13.
So, it was found that a significant work has been done [19] https://archive.ics.uci.edu/ml/datasets.html?for-
for mining the data related to the disease viz. heart, and dia- mat=&task=cla&att=&area=life&numAtt=&nu-
betes.However, very little attention has been paid to analyze the mIns=&type=&sort=nameUp&view=table.
data for diseases viz. Ophthalmology, Dentistry and Digestive [20] Ilayaraja M, Meyyappan T. Efficient data mining method to predict
the risk of heart disease through frequent itemset. Proc Comput Sci
system etc. Therefore, an attention is needed to explore the us- 2015;70:586–92.
age of data mining in these areas. [21] Aiswarya Iyer, Jeyalatha S, Sumbaly Ronak. Diagnosis of diabetes using
Finally, the research should be conceded to develop a hybrid classification mining techniques. Int J Data Min Knowl Manag Process
predictive system using soft computing, data mining and infor- 2015;5(1):1–14.
mation theory. Further, to reduce ambiguity and to perk up the [22] Yang Haofan, Chen Yi-Ping Phoebe. Data mining in lung cancer patho-
logic staging diagnosis: correlation between clinical and pathology infor-
predictive precision, fuzzy logic and symptom based ontologies mation. Expert Syst Appl 2015;42:6168–76.
can also be incorporated. [23] Venkatalakshmi B, Shivsankar MV. Heart disease diagnosis using predic-
tive data mining. Int J Innov Res Sci Eng Technol 2014;3(3):1873–7.
Conflict of interest statement [24] Taneja Abhishek. Heart disease prediction system using data mining tech-
niques. Orient J Comput Sci Technol 2013;6(4):457–66.
[25] Rajesh K, Sangeetha V. Application of data mining methods and tech-
There are no conflict of interests. niques for diabetes diagnosis. Int J Eng Innov Technol 2012;2(3):224–9.
M. Sharma et al. / IRBM 38 (2017) 305–324 323

[26] Soni Jyoti, Ansari Ujma, et al. Predictive data mining for medical di- [50] Masethe Hlaudi Daniel, Masethe Mosima Anna. Prediction of heart dis-
agnosis: an overview of heart disease prediction. Int J Comput Appl ease using classification algorithms. In: Proceeding of the world congress
2011;17(8):43–8. on engineering and computer science; 2014.
[27] Ramani R, Suthanthira Vanitha N, Valarmathy S. The pre-processing tech- [51] Dangare S Chaitrali, Apte Sulabha S. Improved study of heart disease pre-
niques for breast cancer detection in mammography images. Int J Image diction system using data mining classification techniques. Int J Comput
Graph Signal Process 2013;5:47–54. Appl 2012;47(10):44–8.
[28] Delshi Howsalya Devi R, Indra Devi M. Effective diagnosis of heart dis- [52] Sudha A, Gayathri P, et al. Effective analysis and predictive model
ease using intern quartile range filter and decision tree classifier. 2016. of stroke disease using classification methods. Int J Comput Appl
[29] Zhang Jianfeng, Xu Jiatuo, Hu Xiaajuan, et al. Diagnostic method of di- 2012;43(14):26–31.
abetes based on support vector machine and tongue images. BioMed Res [53] Panzarasa Silvia, Quaglini Silvana, Sacchi Lucia, Cavallini Anna, Mi-
Int 2017;2017:1–9. cieli Giuseppe, Stefanelli Mario. Data mining techniques for analyzing
[30] Kumar Rajesh, Srivastava Rajeev, Srivastava Subodh. Detection and stroke care processes. Stud Health Technol Inform 2010;160:939–43.
classification of cancer from microscopic biopsy images using clin- [54] Subbalakshmi G, Ramesh K, et al. Decision support in heart disease pre-
ically significant and biologically interpretable features. J Med Eng diction system using naïve bayes. Ind J Comput Sci Eng 2011;2(2):170–6.
2015;2015:1–14. [55] Krishnaiah V, Narsimha G, Chandra NS. Heart disease prediction system
[31] Vimala K, Kalaivani V. Classification of cardiac vascular disease from using data mining technique by fuzzy KNN approach. Adv Intell Syst
ECG signals for enhancing modern health scenario. Health Inform - Int Comput 2015:371–84.
J 2013;2(4):63–72. [56] Bashir S, Qamar U, Hassan Khan F, Younus Javed M. MV5: a clinical dss
[32] Prakash D, Uma Mageshwari T, Prabakaran K, Suguna A. Detection framework for heart diseases prediction using majority vote based classi-
of heart diseases by mathematical artificial intelligence algorithm using fier ensemble. Arab J Sci Eng 2014;39:7771–83.
phonocardiogram signals. Int J Innov Appl Stud 2013;3(1):145–50. [57] Nishara Banu MA, Gomathy B. Disease forecasting system using data
[33] Wu Hang, Kim Sahong, Bae Keunsung. Hidden Markov model with heart mining methods. In: 2014 IEEE international conference on intelligent
sound signals for identification of heart diseases. In: Proceedings of 20th computing applications; 2014. p. 130–3.
international congress on acoustics; 2010. [58] Shouman Mai, Turner Tim, Stocker Rob. Applying k-nearest neighbour in
[34] Pareek Vishakha, Sharma RK. Coronary heart disease detection from diagnosing heart disease patients. Int J Inf Educ Technol 2012;2(3):220–3.
voice analysis. In: IEEE students’ conference on electrical, electronics and [59] Chen Ling, Li Xue, Yang Yi, Kurniawati Hanna, Sheng Quan Z, Hu Hsiao-
computer science; 2016. Yun, et al. Personal health indexing based on medical examinations: a data
[35] Tsai Du-Yih, Lee Yongbum. Fuzzy-reasoning-based computer-aided di- mining approach. Decis Support Syst 2014;81:54–65.
agnosis for automated discrimination of myocardial heart disease from [60] Gupta Swati, Karandikar AM. Diagnosis of diabetic retinopathy using ma-
ultrasonic images. Electron. Commun. Jpn., Part 3, Fundam Electron Sci chine learning. J Res Dev 2015;3(2):1–6.
2002;85(11):1–8. [61] Silva F Lincoln, Sequeiros Giomar O, Santos Maria Lúcia O,
[36] Rajeswari P, Reena G Sophia. Analysis of liver disorder using data mining Fontes Cristina AP, Muchaluat-Saade Débora C, Conci Aura. Thermal sig-
algorithm. Glob J Comput Sci Technol 2010;10(14):48–52. nal analysis for breast cancer risk verification. In: MEDINFO; 2015.
[37] Salama GI, Abdelhalim MB, Abd-elghanyZeid Magdy. Breast cancer di- [62] Vaghela Chaitali, Bhatt Nikita, Mistry Darshana. A survey on various clas-
agnosis on three different datasets using multi-classifiers. Int J Comput Inf sification techniques for clinical decision support system. Int J Comput
Technol 2012;1(1):36–43. Appl 2015;116(23):14–7.
[38] Sharma Disha, Jindal Gagandeep. Identifying lung cancer using im- [63] Ramachandran P, Girija N, Bhuvaneswari T. Early detection and pre-
age processing techniques. In: International conference on computational vention of cancer using data mining techniques. Int J Comput Appl
techniques and artificial intelligence; 2011. p. 115–20. 2014;97(13):48–53.
[39] Aneeshkumar AS, Jothi Venkateswaran C. Estimating the surveillance [64] Evirgen Hayrettin, Çerkezi Menduh. Prediction and diagnosis of dia-
of liver disorder using classification algorithms. Int J Comput Appl betic retinopathy using data mining technique. Turk Online J Sci Technol
2012;57(6):39–42. 2014;4(3):32–42.
[40] Santi Wulan Purnami, Mohamad Zain J, Embong Abdullah. A new ex- [65] Kalairasai A, Mohammad Amanulla K. Unconscious oral cancer detection
pert system for diabetes disease diagnosis using modified spline smooth using data mining classification approaches. Int J Adv Res Comput Eng
support vector machine. In: ICCSA 2010; 2010. Technol 2015;4(7):3177–84.
[41] Santhanam T, Padmavathi MS. Application of K-means and genetic algo- [66] Kumar Mohanty Ashwini, Ranjan Senapati Manas, Kumar Lenka Saroj.
rithms for dimension reduction by integrating SVM for diabetes diagnosis. An improved data mining technique for classification and detec-
Proc Comput Sci 2015;47:76–83. tion of breast cancer from mammographs. Neural Comput Appl
[42] Vispute Nilesh Jagdish, Kumar Sahu Dinesh, Rajput Anil. An empirical 2013;22(1):303–10.
comparison by data mining classification techniques for diabetes data set. [67] Khraya Shweta. Using data mining techniques for diagnosis and prognosis
Int J Comput Appl 2015;131(2):6–11. of cancer disease. Int J Comput Sci Eng Inf Technol 2012;2(2):56–66.
[43] Peter S. An analytical study on early diagnosis and classification of dia- [68] Quellec Gwenole, Lamard Mathieu, Erginay Ali, Chabouis Agnes, Beat-
betes mellitus. Bonfring Int J Data Min 2014;4(2):7–11. rice Massin Pascale, Cochener Azuguel Guy. Automatic detection of re-
[44] Velide Phani Kumar, Velide Lakshmi. A data mining approach for ferral patients due to retinal pathologies through data mining. Med Image
prediction and treatment of diabetes disease. Int J Sci Invent Today Anal 2016;29:47–64.
2014;3(3):73–9. [69] Chaudhary Kavita, Bajaj Pinki. Automated prediction of RCT (root canal
[45] Obaloluwa Olaniyi Ebenezer, Adnan Khashman. Onset diabetes diagnosis treatment) using data mining techniques: ICT in health care. In: Elsevier
using artificial neural network. Int J Sci Eng Res 2014;5(10):754–9. proceeding of international conference on information and communication
[46] Aljumah Abdhullah A, Gulam Ahamad Mohammed, et al. Applications of technologies; 2014.
data mining: diabetes health care in young and old patients. J King Saud [70] Nandy Mahua. An analytical study of supervised and unsupervised clas-
Univ, Comput Inf Sci 2013;25:127–36. sification methods for breast cancer diagnosis. In: 2nd international con-
[47] Marinov Miroslav, Mosa Abu Saleh Mohammad, Yoo Illhoi, ference on computing communication and sensor network. Int J Comput
Boren Suzanne Austin. Data mining technologies for diabetes: a system- Appl; 2013. p. 1–4.
atic review. J Diabetes Sci Technol 2011;5(6):1549–56. [71] Razavi Amir R, Gill Hans, Åhlfeldt Hans, et al. A data pre-processing
[48] Nagarajan S, Chandrasekaran RM, Ramasubramanian P. Data mining method to increase efficiency and accuracy in data mining. In: AIME
techniques for performance evaluation of diagnosis in gestational diabetes. 2005. LNAI, vol. 3581. 2005. p. 434–43.
Int J Curr Res Acad Rev 2014;2(10):91–8. [72] Asgarnezhad Razieh, Shekofteh Maryam, Boroujeni Farsadzamani. Im-
[49] Lakshmi KR, Kumar SP. Utilization of data mining techniques for predic- proving diagnosis of diabetes mellitus using combination of preprocessing
tion of diabetes disease survivability. Int J Sci Eng Res 2013;4(6):933–42. techniques. J Theor Appl Inf Technol 2017;95(13).
324 M. Sharma et al. / IRBM 38 (2017) 305–324

[73] Ashok Kumar D, Govindsamy R. Performance and evaluation of classifi- [78] Heydari Mahmoud, Teimouri Mehdi, Heshmati Zainabolhoda, et al. Com-
cation data mining techniques in diabetes. Int J Comput Sci Inf Technol parison of various classification algorithms in the diagnosis of type 2 dia-
2015;6(2):1312–9. betes in Iran. International J Diabetes Dev Countries 2016;36(2):167–73.
[74] Jayalskshmi T, Santhakumaran A. Impact of preprocessing for diagnosis [79] Cibin NV, Franklin S Wilfred, Ajin NV. Diagnosis of diabetes mellitus and
of diabetes mellitus using artificial neural networks. In: Second interna- NPDR in diabetic patient from tongue images using LCA classifier. Int J
tional conference on machine learning and computing; 2010. p. 109–12. Adv Res Trends Eng Technol 2015;2:57–62.
[75] Hamidi H, Daraei A. Analysis of preprocessing and post-processing [80] Klein Barbara EK, Klein Ronald. Lifestyle exposures and eye diseases in
methods and using data mining to diagnose heart disease. Int J Eng adults. Am J Ophthalmol 2007;144(6):961–9.
2016;29(7):921–30. [81] http://www.youroralhealth.ca/oral-health-a-your-body/lifestyle.
[82] Sharma P, Busby M, Chapple RL, et al. The relationship between gen-
[76] Yan Hongmei, Zheng Jun, Jiang Yingtao, et al. Selecting critical clini-
eral health and lifestyle factors and oral health outcomes. Br Dent J
cal features for heart diseases diagnosis with a real-coded GA. Appl Soft
2016;221(2):65–9.
Comput 2008;8:1105–11.
[83] Kim Hyeongsoo, Ishag Musa Ibrahim M, Piao Minghao, et al. A data
[77] Vembandasamy K, Karthikeyan T. Novel outlier detection in dia- mining approach for cardiovascular disease diagnosis using heart rate vari-
betes classification using data mining techniques. Int J Appl Eng Res ability and images of carotid arteries. Symmetry 2016;8(6):1–15.
2016;11(2):1400–3.

You might also like