Professional Documents
Culture Documents
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Data mining is a dominant progression to Decision tree is a renowned classification technique
forecast upcoming behaviors. Data Mining is used in various commonly used in many researches. Decision tree model
domains and disciplines to solve an existing problem or to represents a flowchart-like structure where each node
envisage comportments. The different Data mining represents a test on data objects and the leaf node
techniques are Clustering, Association, Classification, represents the class label. Decision tree are used in all
Regression and Structured Prediction. Classification is a
domains to predict hidden patterns. Decision tree is well-
most important task of simplifying dataset. Decision tree
method is commonly used in Classification technique. known because it is simple and easy to interpret.
Decision tree model is represented by branch and nodes.
There are several Decision tree algorithms to contrivance in The aim of this research is to compare the efficiency of
data mining tools. The objective of this study is to compare different decision tree algorithms. Education is one of the
the most frequently used Decision tree algorithms in various domains which is profited by Data mining. To compare the
domains and holds the good in predicting the best decision decision tree algorithms Educational dataset from a
tree algorithm. Educational dataset is implemented to find reputed college is implemented. Semester marks of college
the accuracy of the Decision tree algorithms and to predict
students is collected and analyzed by Data mining tool to
the students performance level. This research provides an
idea to educators on student progress level. classify students to Grade A, Grade B or Grade C and
predict the next semester percentage of each student.
Key Words: Classification, Decision Stump, Decision
This study finds out the commonly used decision tree
Tree algorithms, J48, Hoeffding Tree, Random Forest,
algorithms in various domains. The objective of this study
Random Tree, REPTree.
is to list out the efficiency and accuracy of decision tree
1. INTRODUCTION algorithms. College students performance in exam is
Data mining is a prevailing technology with prodigious analyzed to rank them and predict their future
prospective to emphasis the most essential information in performance. This research helps professors to predict
data warehouses. Data mining tools are used to foretell achievement levels and identify a student or a group of
future tendencies, making practical and knowledge-driven students in need of further attention.
verdicts. Data mining tools can response queries that
usually were complex to resolve. Data mining tasks is used 2. REVIEW OF LITERATURE
to discover hidden patterns from existing information. Data mining is used in many researches for various
Data mining is applicable for any kind of data repository. purposes in different field. Researchers have worked in
Data mining methods can be applied on existing software educational field to predict loyal students and dropout
and hardware platforms to improve the significance of students to improve educational quality. Medical field is
present information resources. widely used with data mining to diagnosis many diseases
like Breast cancer, Diabetics, Typhoid. In organizational
Classification is a data mining technique used to organize field data mining is supportive to make decision and set
data objects according to given class labels. Classification marketing goal. In weather domain data mining commonly
process slants using training set of data in which all data used to predict weather. In environment domain data
objects are already classified with class labels. The mining is implemented and analyzed with soil, iris flower
classification algorithm absorbs the training set and and mushroom datasets.
constructs a model. The constructed model is used to
classify unclassified large datasets. There are many Nilima Patil, Rekha Lathi, and Vidya Chitre [1] provided
classification algorithms. the way for decision making process of customers to
recommend the membership card using classification
2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 2342
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
which is helpful to enterprise's development. This study techniques is used to discover accurate and reliable
concludes that the main factor that affects the customer results. This paper concludes the accuracy of C4.5 is
ranks is income and found that C4.5 performs well than 85.81% and can be applied on real time applications.
CART. Saeide kakavand, Taha Mokfi and Mohammad Jafar Aman Kumar Sharma and Suruchi Sahni [9] construct a
Tarokh [2] proposes a decision tree model to predict loyal model to classify electronic mails as spam or non-spam.
student to increase educational quality. They conclude The four different decision tree algorithms ID3, J48,
that data mining is a powerful technology which can be Simple CART and Alternating Decision Tree are used and
used to predict faithful students with 90% accuracy. They compared. This paper concludes that J48 outperforms
compared the three common decision tree algorithms and with 92.7624% accuracy and Simple CART also exhibited
found CART performed well followed by C.5 and CHAID alike results that were only a little dissimilar from J48
with lowest accuracy. Adeyemo and Adeyeye [3] equipped algorithm. Sudheep Elayidom.M, Sumam Mary Idicula and
a comparative study of decision tree algorithms and Joseph Alexander [10] proposed a new decision tree
multiple perception algorithms for prediction of typhoid. algorithm which is implemented with different UCI
They analyzed, compared the algorithms on medical datasets and compared with other decision tree
dataset and found MLP had greater accuracy and C4.5 had algorithms. ADT tree, REPT tree, Random Tree, C4.5*stat ,
greater speed. Sweta Rai, Priyanka Saini and Ajit Kumar C 4.5 , Neural Network and Nave Bayes algorithms are
Jain [4] constructed a decision tree model with ID3 implemented in Iris, Segment, Diabetes, Breast cancer
algorithm to make a decision whether first year students ,Glass and Labor datasets in data mining tool WEKA. The
from undergraduates will continue their study or drop result illustrates neural networks showed a higher
their study. The study concludes that a student dropout accuracy.
seems to be associated with the residence, stress, family
type, stream in higher secondary, satisfaction level, 3. METHODOLOGY
enrolled for other institute, change in goal, infrastructure
Start
of university, participation in extra-curricular activity,
adjustment problem in hostel, and family problem. The
ID3 classifier gives accuracy of 98%. A. Sivasankari, Mrs. S. Student Data Collection
Sudarvizhi and S. Radhika Amirtha Bai [5] proposed a
comparative study on different decision tree and
clustering algorithms. Activities dataset is implemented to Preprocessing
clustering algorithms and concluded that K-means has
greater speed and SOM has greater accuracy. Tumor
dataset is applied to decision tree algorithms and Implementation of Classification
concluded that ID3 has greater speed and C4.5 has greater Algorithms
accuracy.
2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 2343
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
3.1 Data Collection percentage. The final attributes selected are register
An educational dataset is used in this study. Students number and percentage for classification process. In
semester percentage dataset from a reputed college is prediction process register number and five semesters
used. Dataset fields consist of students name, register percentage attributes are chosen.
number, marks of various subjects and the percentage to
classify students with Grade. Another Dataset fields 3.3 Data Transformation
consist of students name, register number and the WEKA (Waikato Environment for Knowledge Analysis)
percentages of first five semesters to predict the data mining software is used in this study for
percentage of sixth semester. The unnecessary attributes implementation. Weka only accepts ARFF (Attribute
are removed and fields which influence the result are only Relation File Format) files. Hence the dataset should to be
selected. converted to ARFF. The dataset is stored in Microsoft Excel
and saved as CSV (Comma Separated Value) file which is
Table -1: Students datasets attributes and description for converted to ARFF file using Weka.
classification.
4. CLASSIFICATION ALGORITHM
ATTRIBUTE DESCRIPTION DATA TYPE 4.1 Decision Tree Algorithms
Register Number Unique register Numeric Decision tree is a well-known classification technique. It is
number for all way to display algorithms in tree flowchart like structure.
students in the college
Decision tree algorithms are used commonly in researches
Students Name Name of the registered String
due to simplicity and easy of understandability. Decision
student
tree model uses flowchart symbol to represent the
Percentage Percentage in the Numeric
semester classification. The main advantage of decision tree is they
can be combined with other decision-making techniques.
Table -2: Students datasets attributes and description for 4.2 J48
prediction. This algorithm is an extension of well-known decision tree
algorithm ID3. In WEKA which is the popular data mining
ATTRIBUTE DESCRIPTION DATA TYPE tool J48 algorithm is the implementation of famous C4.5
algorithm. The C4.5 algorithm was developed by Ross
Register Number Unique register Numeric
number for all Quinlan. It creates rules by which the dataset have to be
students in the college classified. It creates decision tree centered on the given
Students Name Name of the registered String class labels.
student
Percentage of Percentage in the first Numeric 4.3 Hoeffding Tree
Semester I semester The name of this decision tree is derived from the
Percentage of Percentage in the Numeric Hoeffding Bound which is used in decision tree induction
Semester II second semester
process. The aim of Hoeffding bound is to give poise to the
Percentage of Percentage in the third Numeric
correct attribute to divide the tree to build the best model.
Semester III semester
Percentage of Percentage in the Numeric Hoeffding tree is a streaming of decision tree induction
Semester IV fourth semester technique. It was developed to overcome the previous
Percentage of Percentage in the fifth Numeric streaming classification technique.
Semester V semester
4.4 Random Forest
3.2 Data Preprocessing Random forest algorithm is famous in several
Data preprocessing is an important phase since real-world competitions because of its powerful intensive calculation.
data are imperfect. In this process the missing values of It is similar to bootstrapping algorithm with CART
attributes in the dataset are filled. The unnecessary (Classification and Regression Trees) model. It is
attributes can be removed to improve performance. In the programmed to build multiple CART model with diverse
students dataset the only important field is the initial attributes. It is also capable of performing
2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 2344
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
regression technique in data mining. It builds multiple visualized tree, efficiency and the time consumed for each
trees and each tree constructs a classification. The decision tree algorithm is noted.
algorithm selects the best classification by votes.
2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 2345
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Chart -1: Graph on accuracy of decision tree Classifiers Chart -2: Graph on accuracy of Random Tree to predict next
semester percentage on different datasets.
2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 2346
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 2347
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
Prediction International Journal of Software Engineering [22] Kasra Madadipouya A New Decision tree method for
and Its Applications Vol. 8, No. 12 (2014), pp. 31-42. Data mining in Medicine Advanced Computational
Intelligence: An International Journal (ACII), Vol.2, No.3,
[13] T.Miranda Lakshmi, A.Martin, R.Mumtaj Begum, July 2015.
Dr.V.Prasanna Venkatesan An Analysis on Performance of
Decision Tree Algorithms using Students Qualitative Data [23] Amit Gupta, Ali Syed, Azeem Mohammad, Malka N.
I.J. Modern Education and Computer Science, 2013, 5, 18- Halgamuge A Comparative Study of Classification
27. Algorithms using Data Mining: Crime and Accidents in
Denver City the USA International Journal of Advanced
[14] Harvinder Chauhan, Anu Chauhan Implementation of Computer Science and Applications, Vol. 7, No. 7, 2016.
decision tree algorithm c4.5 International Journal of
Scientific and Research Publications, Volume 3, Issue 10, [24] Abid hasan Evaluation of Decision Tree Classifiers
October 2013. and Boosting Algorithm for Classifying High Dimensional
Cancer Datasets International Journal of Modeling and
[15] Jaimin N. Undaviaa, Dr. P.M.Doliab and Dr. AtulPatela Optimization, Vol. 2, No. 2, April 2012.
Comparison of Classification Algorithms to Predict
Comparison of Decision Tree Classification Algorithm to [25] Pooja Sharma, Asst. Prof. Rupali Bhartiya
Predict Student's Post Graduation Degree in Weka Implementation of Decision Tree Algorithm to Analysis
Environment International Journal of Innovative and the Performance International Journal of Advanced
Emerging Research in Engineering Volume 1, Issue 2, Research in Computer and Communication Engineering,
2014. Vol. 1, Issue 10, December 2012.
[16] Shikha Chourasia Survey paper on improved [26] Mai Shouman, Tim Turner, Rob Stocker Using
methods of ID3 decision tree classification International Decision Tree for Diagnosing Heart Disease Patients
Journal of Scientific and Research Publications, Volume 3, CRPIT Volume 121 - Data Mining and Analytics 2011.
Issue 12, December 2013.
[27] Rachna Raghuwanshi A Comparative Study of
[17] Masud Karim, Rashedur M. Rahman Decision Tree Classification Techniques for Fire Data Set (IJCSIT)
and Nave Bayes Algorithm for Classification and International Journal of Computer Science and
Generation of Actionable Knowledge for Direct Marketing Information Technologies, Vol. 7 (1) , 2016, 78-82.
Journal of Software Engineering and Applications, 2013, 6,
196-206. [28] D.Lavanya and Dr.K.Usha Rani Ensemble Decision
tree Classifier for Breast Cancer data International
[18] Anuja Priyama, Abhijeeta, Rahul Guptaa, Anju Journal of Information Technology Convergence and
Ratheeb, and Saurabh Srivastavab Comparative Analysis Services (IJITCS) Vol.2, No.1, February 2012.
of Decision Tree Classification Algorithms International
Journal of Current Engineering and Technology, Vol.3, [29] Md. Nurul Amin, Md. Ahsan Habib Comparison of
No.2 June 2013. Different Classification Techniques Using WEKA for
Hematological Data American Journal of Engineering
[19] V. Vaithiyanathan, K. Rajeswari, Kapil Tajane, Rahul Research (AJER) 2015, Volume-4, Issue-3, pp-55-61.
Pitale Comparison of different Classification techniques
using different datasets International Journal of Advances [30] Pardeep Kumar, Nitin and Vivek Kumar Sehgal and
in Engineering & Technology, Vol. 6, Issue 2, pp. 764-768, Durg Singh Chauhan A Benchmark to select Data mining
May 2013. based Classification algorithms for business intelligence
and decision support systems International Journal of
[20] Sreerama K.Murthy, Simon Kasif, Steven Salzberg A Data Mining & Knowledge Management Process (IJDKP)
System for Induction of Oblique Decision Trees Journal of Vol.2, No.5, September 2012.
Artificial Intelligence Research [1994] 1-32.
2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 2348