You are on page 1of 9

IPASJ International Journal of Information Technology (IIJIT)

Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm


A Publisher for Research Motivation ........ Email:editoriijit@ipasj.org
Volume 6, Issue 10, October 2018 ISSN 2321-5976

A implementation approach for diagnosis of diabetes type II


foot ulcer using R in data mining”
R. Christy pushpaleela1 , Dr R. Padmajavalli2

1
Research Scholar, Bharathiar University,Coimbatore - 641046 & Assistant Professor in Department of Computer Science,
Women’s Christian College, Chennai.

2
Research Supervisor ,Bharathiar University, Coimbatore - 641046 & Associate Professor in Bhaktavatsalam Memorial
College for Women Chennai, Chennai

Abstract
Data mining approach helps to analyze patient diseases. Diabetes Mellitus is a chronic disease to affect various
organs of the human body. Early prediction can save human life and can take control over the diseases. The present
work emphases on analysis of diabetes through decision tree algorithm with statistical implication using R. Diabetes
is a common disease with all age group on the world . The main aim of this research paper is to study and discuss
about the decision tree implementation using R with different kinds of medical datasets. For classification is
performed by decision trees in R. This paper will help for early detection of the complications which might help in
timely treatment of individuals. For data preprocessing the real time medical data set used Manual, R Studio and
WEKA tool.
Keywords : Data Mining, Classification, R, R Studio , SVM , Decision Tree, Accuracy rate and KNN.

1. Data Mining
Data mining is the process of classification based on data patterns obtained from a dataset. It is the extraction or
mining of knowledge from large amounts of the data, also called as Knowledge mining, knowledge discovery,
knowledge extraction in databases. Different types of algorithms have been developed and implemented for extracting
information and discovering knowledge patterns which are useful for decision support [1].
 Problem definition
 Data preparation
 Exploring data and Model building
 Validating the models
 Deploying and updating models

Fig1: Data Mining –Process flow

Volume 6, Issue 10, October 2018 Page 21


IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
A Publisher for Research Motivation ........ Email:editoriijit@ipasj.org
Volume 6, Issue 10, October 2018 ISSN 2321-5976

All the data is not clean, Duplicity of data and the no quality data and the most important is no quality result so data
preprocessing is important. Quality decisions must be based on the quality data. By the processing of data, data quality
can be measures in term of accuracy, completeness, consistency, timeliness, believability, interpretability [1].
Two forms of data analysis that can be used for extracting model. Classification and prediction
Supervised Classification - set of possible classes is known in advance. Unsupervised Classification- Set of possible
classes is not known called Unsupervised (clustering) Prediction means: missing values of the data set.

2. Related Work
The number of people with diabetes type 2 is increasing in every country, 82% of people with diabetes live in all
the countries with less income. Lot of Research people have developed various prediction models using Weka, met lab
and R to predict diabetes. I have taken for my reference few of the models developed using data mining are as below.
K. Sharmila , Dr. S. A. Vetha Manickam “ Efficient Prediction and Classification of Diabetic Patients from
bigdata using R” in this paper Diabetes is one of the common growing Problem in the world. It is a major health issue
in most of the countries in the world. Hence detailed analysis has been made for diabetic data set with the help tool
R[46].
Aanurag Kumar Srivastava , Chandan Kumar and Neha Mangla “Analysis of Diabetic Dataset and Developing
Prediction Model by using Hive and R” in this paper this paper analyzed diabetic dataset and build the good prediction
model for the collected data set [47].
K. Sharmila , S.A. Vethamanickam “Performance Analysis of Diabetic Dataset Using MapReduce Framework in
Cloud and Standalone Computing” in this paper discussed about a complete analysis of a huge dataset using Map
Reduce framework as a new programming model [49].
The above mentioned researchers has been developed good prediction model . Most of researches are used tools like
Weka , Mat Lab etc . few of them are used the other tool for his model developing. In this paper to develop to make
analysis of diabetic data set using R and R studio for prediction.

3. R And R – Studio Introduction


“R” programming language for data manipulations, statistical computing and graphics. It is an Open source tool.
It has excellent graphics capabilities. R will supported by a large user network. R contains some statistical algorithms
that are not yet available in other tools. The functions that R provides are
 Mean, Median & Mode
 Distribution and Covariance
 Regression and Non-linear
 GLM and GAM
RStudio Desktop - R program can run in locally .RStudio Server - RStudio can be accessed through a browser when it
is actually being executed at a remote server.
3. 1 Decision Tree
Decision trees are commonly used in operations research. Decision tree is a type of supervised learning algorithm
that is mostly used in classification problems.

Volume 6, Issue 10, October 2018 Page 22


IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
A Publisher for Research Motivation ........ Email:editoriijit@ipasj.org
Volume 6, Issue 10, October 2018 ISSN 2321-5976

Fig 2: Decision Tree work Flow


4. Diabetes
For energy we are mostly eat the food and it’s converted to glucose, or sugar which is used for energy. There are two
type of Diabetes [3].
Type 1 - Insulin Dependent Diabetes Mellitus (IDDM). And it’s Affected mostly children and young adult.Type I
complaint appears in people less than 35, usually from the ages 10 to 16.

Type II - Diabetes is also named as Non Insulin Dependent Diabetes Mellitus (NIDDM) or Adult Onset Diabetes
Mellitus. Type II disorder happens mostly after 40 ages. 20 % of patients with diabetes will develop foot ulcers due to
nerve damage and reduced blood flow [2]
4.1.1 Diabetes Foot Ulcer - Symptoms

Diabetic ulcers are most commonly affected


 poor circulation
 high blood sugar (hyperglycemia)
 nerve damage
 irritated or wounded feet

Doctor will identify scale 0 to 3 following parameter


 0: no ulcer but foot at risk
 1: ulcer present but no infection
 2: ulcer deep, exposing joints and tendons
 3: extensive ulcers or abscesses from infection
4.1.2 Risk Factors For Diabetic Foot Ulcers

 Poor quality shoes


 poor hygiene (not washing regularly or thoroughly)
 improper trimming of toenails
 alcohol consumption
 eye disease from diabetes
 heart disease
 kidney disease
 obesity

Volume 6, Issue 10, October 2018 Page 23


IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
A Publisher for Research Motivation ........ Email:editoriijit@ipasj.org
Volume 6, Issue 10, October 2018 ISSN 2321-5976

Fig 3: Symptoms Foot Ulcer


5. Work Flow Using R Studio

To execute the program using R & RStudio , need to follow the below process

Step 1: To load the CSV file into R- Import the data file in to R
Step 2: Execute the sample command to validate the table rows and column data
Step 3: To preprocess the data set using R
Step 4: To evaluate the real time collected data set
Step 5: To apply the decision tree Algorithm
Step 6: Final step obtained the prediction Accuracy rate.

Fig 4: Work flow using R

6. Data Preprocessing
To pre-process the data the following method are used.

Volume 6, Issue 10, October 2018 Page 24


IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
A Publisher for Research Motivation ........ Email:editoriijit@ipasj.org
Volume 6, Issue 10, October 2018 ISSN 2321-5976

Manual Pre-processing
Collected the data from one of the reputed hospital in Tamilnadu where the data received are in the ‘manually entered’
and optional system data format. This dataset is manually integrated in to system. Data pre- processing will be taken
care during the time of integration. 60% of the data pre- processing is done during the time of integration
Tool-Based pre-processing R Upload the pre proccseed data in R will do the preprocessing to get the right build
model. To import the CSV file in R studio .CSV file contain the whole real time data set.

6. 1 Data Set Attribute


Data set has been collected from various laboratories in and around tamilnadu and around two lakhs of medical
report samples are collected from reputed hospital. The whole dataset consists of 25 attributes but we will consider for
our research only 10 attributes and attributes details are given below

Sno Attribute
1 Age
2 SEX
3 BMI
4 BP HIGH
5 BP LOW
6 Sugar ( POST, Pre)
7 HBA1C
8 Hemoglobin
9 GlycemicIndex
10 Alcohol
11 Single/multiple ulcer
Table 1: Data Set attribute

6. 2 Experimental –Model Analysis


To view the table use the below command. Table name is Diabetes
View(Diabetes)

Volume 6, Issue 10, October 2018 Page 25


IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
A Publisher for Research Motivation ........ Email:editoriijit@ipasj.org
Volume 6, Issue 10, October 2018 ISSN 2321-5976

Fig 5: Commands for Upload the data and View the tables in databases

Fig 6: Commands for view the first 10 Observation in the table

Considering the real time data set after preprocessed we have 325 data along with 25 attributes. To draw the graph
using the below command.

 ggplot(Diabetes,aes(x=action,y=GlycemicIndex))+geom_boxplot(aes(fill=factor(action)))
 ggplot(Diabetes,aes(x=action,y=BPH))+geom_boxplot(aes(fill=factor(action)))
 ggplot(Diabetes,aes(x=action,y=SugarPre))+geom_boxplot(aes(fill=factor(action)))
 ggplot(Diabetes,aes(x=action,y=Age))+geom_boxplot(aes(fill=factor(action)))

Volume 6, Issue 10, October 2018 Page 26


IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
A Publisher for Research Motivation ........ Email:editoriijit@ipasj.org
Volume 6, Issue 10, October 2018 ISSN 2321-5976

Fig 7: Graphs generated using R


Classification used for the diabetes dataset is decision trees. The below Figure 8 shown the decision tree generated
using rpart() in R .

Fig 8: Decision Tree training data set


7. Confusion Matrix
A confusion matrix is a table that is regularly used to describe the performance of a classification model on a
dataset of test data for which the true values are called.
Accuracy Rate = TP+TN/ TP+FP+ TN +FN

8. Conclusion & Future Work


We conclude with the fact that, the prediction of the model can be done using the R value, with help of Decision tree
implementation based on the glycemic index, BP and age values to determine the go for amputation or surgery or
medical procedures. Future work on this research is to try to get the current accuracy rate increase from its present
state to a higher accuracy rate using different datasets and approaches to the data prediction model.

References
[1] Laura Auria1and Rouslan A. Moro2, “Support Vector Machines (SVM) as a Technique for Solvency Analysis”.

Volume 6, Issue 10, October 2018 Page 27


IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
A Publisher for Research Motivation ........ Email:editoriijit@ipasj.org
Volume 6, Issue 10, October 2018 ISSN 2321-5976

[2] International Journal of Data Mining, June 2014,”An Analytical Study on Early Diagnosis and Classification of
Diabetics Mellitus, S.Peter.
[3] Journal of Diabetes and Its Complications Development and validation of risk assessment models for diabetes-
related complications based on the DCCT/EDIC data - Vincenzo Lagani a, Franco Chiarugi.
[4] Sam Chao, Fai Wong, “An Incremental Decision Tree Learning Methodology Regarding Attributes in Medical
Data Mining”, 2012, International Journal of Advanced Research in Computer and Communication Engineering.
[5] Raj Kumar, Dr. Anil Kr. Kapil, Anupam Bhatia, “Modified Tree Classification in Data Mining Global Journal of
Computer Science and Technology, Vol. 12, Issue 2 (Ver. 1.0), 2012, pp. 59-62.
[6] S.V.G.Reddy, K.Thammi Reddy, V. Valli Kumari , Kamadi VSRP Varma, “An Svm Based Approach To Breast
Cancer Classification Using Rbf And Polynomial Kernel Functions With Varying Arguments”,2014, International
Journal Of Computer Science And Information Technologies.
[7] Trilok Chand Sharma, Manoj Jain, ―WEKA Approach for Comparative Study of Classification Algorithm‖,
International Journal of Advanced Research in Computer and Communication Engineering Vol. 2, Issue 4, April
2013.
[8] Rashedur M. Rahman, Farhana Afroz, “Comparison Of Various Classification Techniques Using Different Data
Mining Tools For Diabetes Diagnosis”, 2013, Journal Of Software Engineering And Applications.
[9] International Journal of science and Research (IJSR) ISSN (0nline):2319-7064 Impact Factor (2012):3.358.
[10] International Journal of Science and Engineering (IJSE) http://ejournal.undip.ac.id/index.php/ijse Comparative
Analysis of Data Mining Classification Algorithms in Type-2 Diabetes Prediction Data Using WEKA Approach
Kawsar Ahmed, Tasnuba Jesmin.
[11] Nathalie Villa and Fabrice Rossigive, “Support Vector Machine for Functional Data Classification”, april 2005.
[12] International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-
2, Issue-6, May 2013 A Study on WEKA Tool for Data Preprocessing, Classification and Clustering Swasti Singhal,
Monika Jena.
[13] Payal Dhakate , Suvarna Patil , K. Rajeswari , Dr. V.Vaithiyananthan , Deepa Abin, ―Preprocessing and
Classification in WEKA using different classifiers‖, Journal of Engineering Research and Applications
www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 8( Version 1), August 2014.
[14] Remco R. Bouckaert, Eibe Frank, Mark A. HallGeoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H.
Witten, ―WEKA—Experiences with a Java Open-Source Project‖, Journal of Machine Learning Research,
November 2010.
[15] Performance Analysis of various Data mining classification (C4.5 Vs SVM) Techniques on Diabetics in Heart
Problem. Viswanathan K, Dr.Mayilvahanan K, R.Christy Pushpaleela.
[16] Thair Nu Phyu,”Survey of Classification Techniques in Data Mining” IMECS 2009.
[17] Fayyad, Piatetsky-Shapiro, Smyth and Uthurusamy”, Advances in Knowledge Discovery and Data Mining”,
(Chapter 1), AAAI/MIT Press 1996.
[18] Witten,I. and Eibe,F. Data mining practical machine learning tools and techniques. 2nded, Sanfrancisco: Morgan
Kaufmann series in data management systems. 2005.
[19] PardhaRepalli, “Prediction on Diabetes Using Datamining Approach”.
[20] Joseph L. Breault., “Data Mining Diabetic Databases: Are Rough Sets aUseful Addition”.
[21] G. Parthiban, A. Rajesh, S.K.Srivatsa, “Diagnosis of Heart Disease for Diabetic Patients using Naive Bayes
Method “, International Journal of Computer Applications (0975 –8887) Volume 24– No.3, June 2011.
[22] P. Padmaja, “Characteristic evaluation of diabetes data using clustering techniques”, IJCSNS International Journal
of Computer Science and Network Security, VOL.8 No.11, November 2008.
[23] P.Yasodha, M. Kannan, ―Analysis of a Population of Diabetic Patients Databases in Weka Tool‖, Research Vol 2,
Issue 5, May-2011.
[24] Vikas Chaurasia, Saurabh Pal, ―Data Mining Approach to Detect Heart Dieses‖, International Journal of
Advanced Computer Science and Information Technology Vol. 2.
[25] D.Lavanya and Dr.K.Usha Rani, ―Ensemble decision tree classifier for breast cancer data International Journal of
Information Technology Convergence and Services, Vol.2, No.1. February 2011.
[26] Prof.K.Rajeswari , Dr.V.Vaithiyanathan and Shailaja V.Pede, ―Feature Selection for Classification in Medical
Data Mining, International journal of emerging trends and technology in computer science.Vol 2, Issue 2, March –
April 2013.
[27] J. S. Raikwal, Kanak Saxena, ” Performance Evaluation of SVM and K-Nearest Neighbor Algorithm over Medical
Data set”, 2012, International Journal of Computer Applications

Volume 6, Issue 10, October 2018 Page 28


IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
A Publisher for Research Motivation ........ Email:editoriijit@ipasj.org
Volume 6, Issue 10, October 2018 ISSN 2321-5976

[28] Mai Shouman, Tim Turner, and Rob Stocker, “Using Decision Tree For Diagnosing Heart Disease Patients”,
2011, 9-Th Australasian Data Mining Conference (Ausdm'11), Ballarat, Australi.
[29] Ms.Rupali, R.Patil, “Heart Disease Prediction System Using Naive Bayes and Jelinek-Mercer Smoothing”, 2014,
International Journal of Advanced Research in Computer and Communication Engineering.
[30] Krati Saxena, Dr. Zubair Khan, Shefali Singh, “Diagnosis of Diabetes Mellitus Using K Nearest Neighbor
Algorithm”, 2014, International Journal of Computer Science Trends And Technology (Ijcst).
[31] Nongyao Nai-aruna, Rungruttikarn Moungmaia, “Comparison of Classifiers for the Risk of Diabetes Prediction”,
2015, Procedia Computer Science 69.
[32] D.Sheela Jeyarani, G.Anushya, R.Rajarajeswari, A.Pethalakshmi, “A Comparative Study Of Decision Tree and
Naive Bayesian Classifiers on Medical Datasets”, 2013, International Journal of Computer Applications.
[33] Nipjyoti Sarma, Sunil Kumar, Anupam Kr. Saini ,“A Comparative Study On Decision Tree And Bayes Net
Classifier For Predicting Diabetes Type 2”,2014, International Journal Of Scientific Research Engineering &
Technology (Ijsret).
[34] T. Revathi, S. Jeevitha, “Comparative Study On Heart Disease Prediction System Using Data Mining Techniques”,
2013, International Journal Of Science And Research (Ijsr).

Volume 6, Issue 10, October 2018 Page 29

You might also like