Professional Documents
Culture Documents
1
Research Scholar, Bharathiar University,Coimbatore - 641046 & Assistant Professor in Department of Computer Science,
Women’s Christian College, Chennai.
2
Research Supervisor ,Bharathiar University, Coimbatore - 641046 & Associate Professor in Bhaktavatsalam Memorial
College for Women Chennai, Chennai
Abstract
Data mining approach helps to analyze patient diseases. Diabetes Mellitus is a chronic disease to affect various
organs of the human body. Early prediction can save human life and can take control over the diseases. The present
work emphases on analysis of diabetes through decision tree algorithm with statistical implication using R. Diabetes
is a common disease with all age group on the world . The main aim of this research paper is to study and discuss
about the decision tree implementation using R with different kinds of medical datasets. For classification is
performed by decision trees in R. This paper will help for early detection of the complications which might help in
timely treatment of individuals. For data preprocessing the real time medical data set used Manual, R Studio and
WEKA tool.
Keywords : Data Mining, Classification, R, R Studio , SVM , Decision Tree, Accuracy rate and KNN.
1. Data Mining
Data mining is the process of classification based on data patterns obtained from a dataset. It is the extraction or
mining of knowledge from large amounts of the data, also called as Knowledge mining, knowledge discovery,
knowledge extraction in databases. Different types of algorithms have been developed and implemented for extracting
information and discovering knowledge patterns which are useful for decision support [1].
Problem definition
Data preparation
Exploring data and Model building
Validating the models
Deploying and updating models
All the data is not clean, Duplicity of data and the no quality data and the most important is no quality result so data
preprocessing is important. Quality decisions must be based on the quality data. By the processing of data, data quality
can be measures in term of accuracy, completeness, consistency, timeliness, believability, interpretability [1].
Two forms of data analysis that can be used for extracting model. Classification and prediction
Supervised Classification - set of possible classes is known in advance. Unsupervised Classification- Set of possible
classes is not known called Unsupervised (clustering) Prediction means: missing values of the data set.
2. Related Work
The number of people with diabetes type 2 is increasing in every country, 82% of people with diabetes live in all
the countries with less income. Lot of Research people have developed various prediction models using Weka, met lab
and R to predict diabetes. I have taken for my reference few of the models developed using data mining are as below.
K. Sharmila , Dr. S. A. Vetha Manickam “ Efficient Prediction and Classification of Diabetic Patients from
bigdata using R” in this paper Diabetes is one of the common growing Problem in the world. It is a major health issue
in most of the countries in the world. Hence detailed analysis has been made for diabetic data set with the help tool
R[46].
Aanurag Kumar Srivastava , Chandan Kumar and Neha Mangla “Analysis of Diabetic Dataset and Developing
Prediction Model by using Hive and R” in this paper this paper analyzed diabetic dataset and build the good prediction
model for the collected data set [47].
K. Sharmila , S.A. Vethamanickam “Performance Analysis of Diabetic Dataset Using MapReduce Framework in
Cloud and Standalone Computing” in this paper discussed about a complete analysis of a huge dataset using Map
Reduce framework as a new programming model [49].
The above mentioned researchers has been developed good prediction model . Most of researches are used tools like
Weka , Mat Lab etc . few of them are used the other tool for his model developing. In this paper to develop to make
analysis of diabetic data set using R and R studio for prediction.
Type II - Diabetes is also named as Non Insulin Dependent Diabetes Mellitus (NIDDM) or Adult Onset Diabetes
Mellitus. Type II disorder happens mostly after 40 ages. 20 % of patients with diabetes will develop foot ulcers due to
nerve damage and reduced blood flow [2]
4.1.1 Diabetes Foot Ulcer - Symptoms
To execute the program using R & RStudio , need to follow the below process
Step 1: To load the CSV file into R- Import the data file in to R
Step 2: Execute the sample command to validate the table rows and column data
Step 3: To preprocess the data set using R
Step 4: To evaluate the real time collected data set
Step 5: To apply the decision tree Algorithm
Step 6: Final step obtained the prediction Accuracy rate.
6. Data Preprocessing
To pre-process the data the following method are used.
Manual Pre-processing
Collected the data from one of the reputed hospital in Tamilnadu where the data received are in the ‘manually entered’
and optional system data format. This dataset is manually integrated in to system. Data pre- processing will be taken
care during the time of integration. 60% of the data pre- processing is done during the time of integration
Tool-Based pre-processing R Upload the pre proccseed data in R will do the preprocessing to get the right build
model. To import the CSV file in R studio .CSV file contain the whole real time data set.
Sno Attribute
1 Age
2 SEX
3 BMI
4 BP HIGH
5 BP LOW
6 Sugar ( POST, Pre)
7 HBA1C
8 Hemoglobin
9 GlycemicIndex
10 Alcohol
11 Single/multiple ulcer
Table 1: Data Set attribute
Fig 5: Commands for Upload the data and View the tables in databases
Considering the real time data set after preprocessed we have 325 data along with 25 attributes. To draw the graph
using the below command.
ggplot(Diabetes,aes(x=action,y=GlycemicIndex))+geom_boxplot(aes(fill=factor(action)))
ggplot(Diabetes,aes(x=action,y=BPH))+geom_boxplot(aes(fill=factor(action)))
ggplot(Diabetes,aes(x=action,y=SugarPre))+geom_boxplot(aes(fill=factor(action)))
ggplot(Diabetes,aes(x=action,y=Age))+geom_boxplot(aes(fill=factor(action)))
References
[1] Laura Auria1and Rouslan A. Moro2, “Support Vector Machines (SVM) as a Technique for Solvency Analysis”.
[2] International Journal of Data Mining, June 2014,”An Analytical Study on Early Diagnosis and Classification of
Diabetics Mellitus, S.Peter.
[3] Journal of Diabetes and Its Complications Development and validation of risk assessment models for diabetes-
related complications based on the DCCT/EDIC data - Vincenzo Lagani a, Franco Chiarugi.
[4] Sam Chao, Fai Wong, “An Incremental Decision Tree Learning Methodology Regarding Attributes in Medical
Data Mining”, 2012, International Journal of Advanced Research in Computer and Communication Engineering.
[5] Raj Kumar, Dr. Anil Kr. Kapil, Anupam Bhatia, “Modified Tree Classification in Data Mining Global Journal of
Computer Science and Technology, Vol. 12, Issue 2 (Ver. 1.0), 2012, pp. 59-62.
[6] S.V.G.Reddy, K.Thammi Reddy, V. Valli Kumari , Kamadi VSRP Varma, “An Svm Based Approach To Breast
Cancer Classification Using Rbf And Polynomial Kernel Functions With Varying Arguments”,2014, International
Journal Of Computer Science And Information Technologies.
[7] Trilok Chand Sharma, Manoj Jain, ―WEKA Approach for Comparative Study of Classification Algorithm‖,
International Journal of Advanced Research in Computer and Communication Engineering Vol. 2, Issue 4, April
2013.
[8] Rashedur M. Rahman, Farhana Afroz, “Comparison Of Various Classification Techniques Using Different Data
Mining Tools For Diabetes Diagnosis”, 2013, Journal Of Software Engineering And Applications.
[9] International Journal of science and Research (IJSR) ISSN (0nline):2319-7064 Impact Factor (2012):3.358.
[10] International Journal of Science and Engineering (IJSE) http://ejournal.undip.ac.id/index.php/ijse Comparative
Analysis of Data Mining Classification Algorithms in Type-2 Diabetes Prediction Data Using WEKA Approach
Kawsar Ahmed, Tasnuba Jesmin.
[11] Nathalie Villa and Fabrice Rossigive, “Support Vector Machine for Functional Data Classification”, april 2005.
[12] International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-
2, Issue-6, May 2013 A Study on WEKA Tool for Data Preprocessing, Classification and Clustering Swasti Singhal,
Monika Jena.
[13] Payal Dhakate , Suvarna Patil , K. Rajeswari , Dr. V.Vaithiyananthan , Deepa Abin, ―Preprocessing and
Classification in WEKA using different classifiers‖, Journal of Engineering Research and Applications
www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 8( Version 1), August 2014.
[14] Remco R. Bouckaert, Eibe Frank, Mark A. HallGeoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H.
Witten, ―WEKA—Experiences with a Java Open-Source Project‖, Journal of Machine Learning Research,
November 2010.
[15] Performance Analysis of various Data mining classification (C4.5 Vs SVM) Techniques on Diabetics in Heart
Problem. Viswanathan K, Dr.Mayilvahanan K, R.Christy Pushpaleela.
[16] Thair Nu Phyu,”Survey of Classification Techniques in Data Mining” IMECS 2009.
[17] Fayyad, Piatetsky-Shapiro, Smyth and Uthurusamy”, Advances in Knowledge Discovery and Data Mining”,
(Chapter 1), AAAI/MIT Press 1996.
[18] Witten,I. and Eibe,F. Data mining practical machine learning tools and techniques. 2nded, Sanfrancisco: Morgan
Kaufmann series in data management systems. 2005.
[19] PardhaRepalli, “Prediction on Diabetes Using Datamining Approach”.
[20] Joseph L. Breault., “Data Mining Diabetic Databases: Are Rough Sets aUseful Addition”.
[21] G. Parthiban, A. Rajesh, S.K.Srivatsa, “Diagnosis of Heart Disease for Diabetic Patients using Naive Bayes
Method “, International Journal of Computer Applications (0975 –8887) Volume 24– No.3, June 2011.
[22] P. Padmaja, “Characteristic evaluation of diabetes data using clustering techniques”, IJCSNS International Journal
of Computer Science and Network Security, VOL.8 No.11, November 2008.
[23] P.Yasodha, M. Kannan, ―Analysis of a Population of Diabetic Patients Databases in Weka Tool‖, Research Vol 2,
Issue 5, May-2011.
[24] Vikas Chaurasia, Saurabh Pal, ―Data Mining Approach to Detect Heart Dieses‖, International Journal of
Advanced Computer Science and Information Technology Vol. 2.
[25] D.Lavanya and Dr.K.Usha Rani, ―Ensemble decision tree classifier for breast cancer data International Journal of
Information Technology Convergence and Services, Vol.2, No.1. February 2011.
[26] Prof.K.Rajeswari , Dr.V.Vaithiyanathan and Shailaja V.Pede, ―Feature Selection for Classification in Medical
Data Mining, International journal of emerging trends and technology in computer science.Vol 2, Issue 2, March –
April 2013.
[27] J. S. Raikwal, Kanak Saxena, ” Performance Evaluation of SVM and K-Nearest Neighbor Algorithm over Medical
Data set”, 2012, International Journal of Computer Applications
[28] Mai Shouman, Tim Turner, and Rob Stocker, “Using Decision Tree For Diagnosing Heart Disease Patients”,
2011, 9-Th Australasian Data Mining Conference (Ausdm'11), Ballarat, Australi.
[29] Ms.Rupali, R.Patil, “Heart Disease Prediction System Using Naive Bayes and Jelinek-Mercer Smoothing”, 2014,
International Journal of Advanced Research in Computer and Communication Engineering.
[30] Krati Saxena, Dr. Zubair Khan, Shefali Singh, “Diagnosis of Diabetes Mellitus Using K Nearest Neighbor
Algorithm”, 2014, International Journal of Computer Science Trends And Technology (Ijcst).
[31] Nongyao Nai-aruna, Rungruttikarn Moungmaia, “Comparison of Classifiers for the Risk of Diabetes Prediction”,
2015, Procedia Computer Science 69.
[32] D.Sheela Jeyarani, G.Anushya, R.Rajarajeswari, A.Pethalakshmi, “A Comparative Study Of Decision Tree and
Naive Bayesian Classifiers on Medical Datasets”, 2013, International Journal of Computer Applications.
[33] Nipjyoti Sarma, Sunil Kumar, Anupam Kr. Saini ,“A Comparative Study On Decision Tree And Bayes Net
Classifier For Predicting Diabetes Type 2”,2014, International Journal Of Scientific Research Engineering &
Technology (Ijsret).
[34] T. Revathi, S. Jeevitha, “Comparative Study On Heart Disease Prediction System Using Data Mining Techniques”,
2013, International Journal Of Science And Research (Ijsr).