You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/316717904

Comparison of applications for educational data mining in Engineering


Education

Conference Paper · March 2017


DOI: 10.1109/EDUNINE.2017.7918187

CITATIONS READS

3 473

2 authors:

Diego Buenaño-Fernández Sergio Luján-Mora


Universidad de Las Américas University of Alicante
15 PUBLICATIONS   13 CITATIONS    289 PUBLICATIONS   1,572 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Ontology Syllabus View project

Mejora de las competencias para la vida en función de la calidad de las aplicaciones educativas web y móviles View project

All content following this page was uploaded by Diego Buenaño-Fernández on 23 January 2018.

The user has requested enhancement of the downloaded file.


Comparison of applications for educational data
mining in Engineering Education

Diego Buenaño Fernández Sergio Luján-Mora


Facultad de Ingenierías y Ciencias Agropecuarias Department of Software and
Universidad de las Américas Computing Systems
Quito, Ecuador University of Alicante
diego.buenano@udla.edu.ec Alicante, Spain
sergio.lujan@ua.es

Abstract— Currently there are many techniques based on Nowadays, there are several open source tools that support
information technology and communication aimed at assessing the the management of Data Mining (DM). Some of the principal
performance of students. Data mining applied in the educational tools are as follows: Hadoop, Orange, Weka, Knime, Rapid
field (educational data mining) is one of the most popular Miner, Keel, among other [5].
techniques that are used to provide feedback with regard to the
teaching-learning process. In recent years there have been a large The large number of DM tools that are currently available
number of open source applications in the area of educational data in the market, and which are used in widely dispersed areas of
mining. These tools have facilitated the implementation of complex data analysis, generate uncertainty for specialists who focus on
algorithms for identifying hidden patterns of information in the analysis of educational data [6]. This is especially in view
academic databases. The main objective of this paper is to compare of the need to take into account that this environment has very
the technical features of three open source tools (RapidMiner, special features that makes it different from other
Knime and Weka) as used in educational data mining. These environments. While it is true that specific applications have
features have been compared in a practical case study on the been developed for educational environments, especially in
academic records of three engineering programs in an Ecuadorian universities [7], these applications do not have all the
university. This comparison has allowed us to determine which tool functionalities compared with the most commonly used DM
is most effective in terms of predicting student performance. applications in the market.
Keywords— Educational Data Mining; performance; Open
The main objective of this work was to compare the
Source; Software Tool, K-means.
technical characteristics of three DM tools (RapidMiner,
Knime and Weka) in an educational environment. These
I. INTRODUCTION characteristics were measured in terms of the application of
The increasing use of information and communication clustering and segmentation techniques using a case study of
technologies in the educational field entails the storage of large the academic records of three engineering programs at an
volumes of data in various formats. The applications used in Ecuadorian university. The results of this comparison will be of
educational environments save the information in many value to people who wish to work in EDM and need to know
different repositories such as archives, blogs, documents, how to choose the most appropriate tool to perform a particular
images, videos, audios, scientific data, meta data or hyperlinks, study.
and many new data formats. The amount of data available in The content of this paper is organized as follows: Section II
previous scenarios is so enormous that traditional processing covers the description of the method used in this study, Section
techniques are insufficient when it comes to processing them III includes a detailed description of the case study, and finally,
[1]. This makes the information stored underutilized, and Section IV presents the conclusions and future work.
means that it is not taken into account in terms of strategic
decision making. Therefore, these data require the application
of appropriate methods or techniques to process them and to II. METHOD
extract knowledge. In the educational field, these techniques The method used to perform the comparison of the
are classified into what is known as Educational Data Mining technical features of the tools will be through their use in each
(EDM), Learning Analytics (LA) and Knowledge Discovery in of the phases of the DM process. The different characteristics
Databases (KDD) [2]. The main function of these techniques is are compared in terms of aspects such as the results generated
the application of various methods and algorithms that allow by each tool in the development of certain processes, the
the user to discover and extract patterns in the stored data [3]. number of algorithms available for performing DM operations,
In the educational field, the most commonly used algorithms and the working environment that each tool uses.
are Regression, Nearest Neighbor, Clustering, Classification,
Artificial Intelligence, Neural Networks, Association Rules, The DM process follows a sequence that generally includes
Decision Trees, Nearest Neighbor method, etc. [4]. the following elements [8] [9]:

978-1-5090-4886-1/17/$31.00 ©2017 IEEE


a) Data collection and the application of pre-processing B. Application of DM techniques
techniques. There are different techniques and algorithms for DM. In
b) The application of DM techniques. this article the emphasis is placed on clustering and
c) The interpretation of the results segmentation. These are DM techniques that concentrate the
elements of a database in groups of similar characteristics,
The DM process is shown in Fig. 1 and is described in denominated clusters. There are characteristics of segments of
detail in the following section. data that are not visible at first sight, and that can be identified
through these techniques. They are of the multivariable type
and are configured within the procedures denominated as non-
supervised, because they do not work with a monitoring
variable, dependent or endogenous. All the variables are
considered to belong to the same hierarchy, and have the same
degree of participation in the model [12]. A cluster is a set of
data that responds to a given classification based on one or
several characteristics that have been defined by the
information analyst. A basic classification of grouping
techniques is shown in Fig. 2.

Fig. 1. Usual method in the MDE process

A. Data collection and the application of pre-processing


techniques
Before any DM technique is applied, a preliminary analysis
of the data is required. The first task that is usually addressed is
the exploratory analysis of the data. The second task is the Fig. 2. Classification of clustering techniques
analysis of lost data (missing data). Finally, the third task is
the detection of non-common values. These are also often C. Interpretation of the results
referred to as outliers.
In this last stage, we analyze and compare the models in
The exploratory data analysis phase begins with the terms of which has obtained the best results for use in decision
collection and initial understanding of the data. The work done making in the educational field. To do this, we analyze the
at this stage is to evaluate what type of data they are, how the factors that appear in the summary tables of the clusters, and
data serves us, and what we can get from them. how these are related to other factors [9].
The analysis of missing values helps solve several problems
caused by incomplete data. Missing data can reduce the D. Data mining tools
accuracy of the calculated statistical values, because the There are numerous commercial and open source tools
information originally planned to be used is not available. The available for data processing. Nevertheless, we have only
lost values analysis process has three functionalities: (a) to compared three of them because they are the most commonly
describe the pattern of missing data, (b) to fill in missing values used ones in educational environments [9]. Furthermore, they
using different estimation methods and (c) to estimate, through are Open Source in nature and are developed under GPL.
statistical methods, the influence of these data on the file [10] These are RapidMiner, Knime y Weka.
[11].
RapidMiner1 is a package that is oriented to the mining of
The phase involving the detection of outliers, seeks to data and the creation of models. Through its sophisticated
resolve the inconsistencies as well as eliminate such data that graphical user interface, DM processes can be implemented
may arise from the consolidation and integration of the and executed quickly and intuitively [6]. The tool allows the
databases. The outliers are data that do not follow the development of data analysis processes through the chaining of
characteristic distribution of the rest of the data. These data operators. It provides more than 500 operators for analysis, pre-
could reflect genuine properties of a phenomenon underlying processing and data visualization purposes. It also allows the
the one being analyzed or be the result of errors or other user to use algorithms included in the Weka tool.
anomalies that should not be modeled [11].
1
http://www.rapidminer.com
Knime2 (Konstanz Information Miner) is a DM platform TABLE II. DESCRIPTION OF ATTRIBUTES OF THE DATA FILE
that allows the development of models in a visual environment. Attributes Description
It has been developed on the Eclipse platform and is essentially Ano_lectivo Year of the academic calendar.
programmed in Java. Consequently, it can be executed on Periodo The academic year has two periods
different operating systems. It presents different options in Cod_carrera Career code.
terms of visualization (histograms, pie chart, cloud graphs of Matricula Academic code of the student.
points, matrices, etc.) and of the creation of statistical models Cod_asignatura Code of the subject.
Nom_asignara Name of subject.
and representations of DM such as decision trees, regressions, Nom_docente Name of teacher.
clusters, etc. It also allows the user to directly make calls to the Paralelo Number of courses in each subject.
Weka tool and, in a simple way, to incorporate code developed Nota_progreso_1 Notes of the three evaluation
in R or Python [8]. Nota_progreso_2 moments. The maximum evaluation
Nota_examen_final grade of each component is 10.
Weka3 (Waikato Environment for Knowledge Analysis) is Nota_examen_final Final grade obtained in the course.
part of the intelligence suite of the Pentaho business. It contains The minimum passing grade is 6
a collection of sophisticated machine learning algorithms Situacion Discretized variable: 1 = approve; 0
written in Java. WEKA can be used in customized Java = does not approve
applications or through an integrated graphical user interface. It Repitencia Number of repetitions of the subject
contains tools for data pre-processing, classification,
association rules, clustering, regression and visualization. In the pre-processing data stage, and through the mining
WEKA is perhaps the oldest and most successful open source tools used, 9 records (0.2% of the entire file) were identified
software and DM library, and in recent years it has been that did not have all the complete information (missing values).
integrated into the libraries of RapidMiner, Knime, and R, An impact analysis was performed on the data set, and it was
among others [13]. decided to remove them from the file. An important task in this
phase is to determine the type of data of each attribute, and the
III. CASE STUDY role that it fulfills within the process of DM. With regard to the
This section describes the steps mentioned above, as tools used, initially all the variables are recognized as
applied to a case study carried out on the academic records of numerical. For this reason it is necessary to perform the manual
three engineering programs at an Ecuadorian university. configuration of these variables depending on the analyst's
requirement. Finally, in this stage of the process, and supported
by the tools under consideration, the user can review the basic
A. Data collection and application of pre-processing statistics and also view different graphical options available.
techniques The graphical presentation helps to perform a validation of the
The file of the academic record to be analyzed is in Excel data under consideration.
format and comes from a query to the academic database of the
university from the period 2016-2 (March 2016 - July 2016). Table III describes some of the pre-processing options
The file contains information regarding the students' academic available in each of the tools analyzed.
performance in three stages of evaluation which is entitled
progress. In addition there is a description of the number of TABLE III. PRE-PROCESSING FUNCTIONS
repetitions that the student has in the subject and the name of
Tools Pre-processing functions
the teacher responsible for the same. The careers that were RapidMiner Within the cleansing group RapidMiner presents
taken into account for the analysis were: Network and pre-processing options classified into the
Telecommunications Engineering, Computer and Information following groups: Normalization, Binning,
Systems Engineering and Electronics and Information Missing, Duplicates, Outliers and Dimensionality
Networks Engineering. Table I shows the detail of the Reduction.
information analyzed. Knime There is a node called manipulation, and it can be
applied at the level of rows, columns and tables.
Within each one there are several pre-processing
TABLE I. SUMMARY OF FILE RECORDS options, which are applied in particular each of
these levels.
Description Quantity Weka Attribute filters: Selection filters, discretization
Number of careers 3 filters, filters to add expressions.
Number of records 3743 Instance filters: Selection of instances with
Number of students 662 attributes conditions.
Number of subjects 96
Number of teachers 170 B. Application of DM techniques
This section describes the tests performed and the
application of the DM technique used to classify the data of the
Then, Table II shows the attributes of the file with its
file. This classification was based on three evaluation marks in
respective description.
each subject in the academic period 2016-2 and for each of the
three engineering programs. Specifically, the K-Means
algorithm has been executed on 5 of the 13 original attributes
2
http://www.knime.org that are part of the file. To select the best attributes, we
3
http://www.cs.waikato.ac.nz
reviewed the results obtained by applying four specific
algorithms that are available in Weka. These algorithms are
FilteredSubsetEval, ChiSquaredAttributeEval, GrainRatio-
AttributeEval, OneRAttributeEval. The aspects that were taken
into account were: Cod_carrera, Nota_progreso_1,
Nota_progreso_2, Nota_examen_final, Nota_final.
Once the best attributes were obtained, the K-Means
algorithm was executed. To do this, tests were performed to
determine the number of clusters into which to divide the array.
After an initial exploration, the tests were performed with three
and four clusters, and after an analysis of the composition of
each cluster, it was determined that 3 would be the number of
clusters to be analyzed. These tests were performed on the
three tools. An additional and important parameter in this
technique is the number of iterations that are performed to
stabilize the algorithm. The number of iterations will depend
on the number of variables and the disparity of each of them.
Although this is a parameter that can be configured manually,
in this work each tool will automatically run the number of
iterations.
Figs. 3, 4 and 5, show the process for executing the Fig. 5. Execution of the K-means algorithm in Weka
algorithm using each tool. In the case of RapidMiner, an
operator was applied to evaluate the performance of the C. Interpretation of the results
centroids established in each cluster. On the other hand, in Table IV shows a comparison of the centroid values
Knime, an operator was placed in order to realize a filter of the generated for each cluster in each of the three tools. A centroid
column, in order to work specifically with variables of the of a cluster is defined as the equidistant point of the objects
integer type. In addition, Knime has the facility to include belonging to that cluster [14]. At first glance it emphasizes that
operators to determine the color of points in each cluster at the the first two tools generate similar results, the differences are
time of making a ScatterPlot type chart. not remarkable, whereas the last one differs.
When analyzing the data of the three centroid tables, it is
observed that the establishment of 4 clusters is not the most
appropriate decision because two of them have a very small
equidistance. That is, two clusters could be grouped together
to form one. Consequently, following this analysis, it was
decided to work with 3 clusters. It is important to take into
account that the goal of the K-means algorithm is to maximize
inter-cluster distance and minimize intra-cluster distance.
In Table IV it can be observed that there are three clearly
defined groups using the three tools. On the one hand those
records that are around an average grade of 8.1 (Cluster 0) and
Fig. 3. Process for executing the K-means algorithm in RapidMiner that have passed the course, represent about 30% of students
registered in the file. On the other hand, the students who have
achieved a rating in a range between 6.2 and 7.1 represent 59%
of the file data (Cluster 1). These students, equally, have passed
the course, although his note is the minimum. Initially this
group was divided into two clusters. However, when analyzing
the data in detail, we observed overlapping elements. Finally,
there are students whose grade ranges between 1.8 and 3.9 and
therefore have failed the course, and who represent
approximately 11%. It can be seen in Table IV that the limits of
the values in each cluster have a minimum variation range for
each tool. This can be justified by the number of iterations
generated in each tool.
The three software tools can generate a Plot-type linear
graph that helps to detect the trend shown by each group of
Fig. 4. Process for executing the K-means algorithm in Knime records in relation to its three evaluation instances. The cluster
0 registers (those ranging from 8.1 to 8.2) maintain a constant
performance throughout the three evaluation instances. On the not user-friendly, unlike RapidMiner and Knime, in which you
other hand, the records of cluster 2 (those that oscillate between can observe the entire process in the workflow. In addition,
1.8 and 3.9) begin with a maximum limit in the first their use is very intuitive.
qualification of 4 points, and decrease until they reach a value This study gives teachers a guide to identifying those
of less than 2 in the final exam. Finally, the records of cluster 1 students who need special attention from the beginning of the
(those that oscillate between 6.2 and 6.5), show a stable course, so that the most appropriate measures can be taken at
performance in the three assessments. the right time.
Based on this analysis, strategic decisions must be made The data analyst experience is critical in any DM process.
from the beginning a new academic period, with regard to In the present work an important aspect was the definition of
those students that are represented in cluster 1, and which are the number of clusters. This was done based on the detailed
part of the current study group. analysis of the results obtained. After this process, very clearly
delimited groups were obtained.
TABLE IV. TABLE OF CENTROIDS FOUND USING THE THREE TOOLS

(a) RapidMiner Cluster centroides REFERENCES


[1] R. Saptarshi, “Big data in education,” Gravity, the Great Lakes
Attribute Cluster 0 Cluster 1 Cluster 2 Magazine, vol. 20, pp. 8-10, 2013.
Nota_progreso_1 8.01 6.46 3.99 [2] D. Buenaño Fernández and S. Luján-Mora, “Exploring approaches to
educational data mining and learning analytics, to measure the level
Nota_progreso_2 8.16 6.47 2.81 of acquisition of student's learning outcome,” 8th annual
Nota_examen_final 7.97 6.28 1.85 International Conference on Education and New Learning
Nota_total 8.05 6.41 2.88 Technologies Proceedings, pp. 1845-1850, 2016.
[3] N. Rajadhyax.and and R. Shirwaikar, “Data mining on educational
(b) Knime Cluster centroides domain,” arXiv Preprint arXiv:1207.1535, pp. 1-6, 2012.
[4] K. Sin and L. Muthu, “Application of big data in education data
Attribute Cluster 0 Cluster 1 Cluster 2 mining and learning analytics – a literature review,” ICTACT Journal
Nota_progreso_1 8.01 6.46 3.99 on soft computing , vol. 5, nº 4, pp. 1035-1049, 2015.
[5] K. Radhakrishnan, “Toppersworld.com,” 11 August 2015. [on line].
Nota_progreso_2 8.16 6.47 2.80
Available: http://toppersworld.com/top-5-open-source-data-mining-
Nota_examen_final 7.97 6.28 1.85 tools/. [last access: 02 December 2016].
Nota_total 8.05 6.40 2.88 [6] S. Slater, et al., “Tools for educational data mining a review.,”
Journal of Educational and Behavioral Statistics, pp. 1-20, 2016.
(c) Weka Cluster centroides [7] F. Castro, et al., “Applying data mining techniques to e-learning
problems.,” In Evolution of teaching and learning paradigms in
Attribute Cluster 0 Cluster 1 Cluster 2
intelligent environment. Ed. Springer Link, Berlin Heidelberg, pp.
Nota_progreso_1 8.40 6.20 3.50 183-221, 2007.
Nota_progreso_2 7.00 7.10 1.80 [8] A. Lausc, A. Schmidt and L. Tischendorf, “Data mining and linked
Nota_examen_final 7.60 6.50 1.15 open data–New perspectives for data analysis in environmental
research,” Ecological Modelling, vol. 295, pp. 5-17, 2015.
Nota_total 7.70 6.60 2.21 [9] C. Márquez, C. Romero.and S.Ventura, “Predicción del fracaso
escolar mediante técnicas de minería de datos,” IEEE Revista
IV. CONCLUSIONS AND FUTURE WORK Iberoamericana de Tecnologias del Aprendizaje, vol. 7, nº 3, pp. 109-
117, 2012.
It is important to mention that a relevant task in this work
[10] IBM, “IBM Knowledge Center,” [On line]. Available:
consisted of the pre-processing of the data. This is because the http://www.ibm.com/support/knowledgecenter/es/SSLVMB_22.0.0/c
quality and reliability of the information that is entered into a om.ibm.spss.statistics.help/spss/mva/idh_miss.htm. [last access: 2016
DM tool affects the results obtained. December 05].
[11] G. González Sánchez , S. Delfin Avila and J. Lluís de la Rosa,
The case study allowed us to evaluate the technical
“Preprocesamiento de bases de datos masivas y multi-dimensionales
characteristics of the tools analyzed in terms of pre-processing en minería de uso web para modelar usuarios: multi-dimensionales
and clustering techniques. In applying the K-means algorithm en minería de uso web para modelar usuarios:comparación de
using the three study tools, it is observed in Table IV that herramientas y técnicas con un caso de estudio,” TAMIDA III Taller
Nacional de Minería de Datos y Aprendizaje, pp. 193-202, 2005.
Knime and Rapid Miner present similar data for the centroid
[12] S. Parack, Z Zahid and F. Merchant,“Application of data mining in
clusters, and on the other hand, Weka shows different values educational databases for predicting academic trends and patterns,”
and yet they are within an acceptable range. This shows that Proceedings IEEE international conference on technology enhanced
the three tools work very similarly in terms of precision when education (ICTEE), pp. 1-4, 2012
[13] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H.
applying a certain classification algorithm. Based on the
Witten, “The WEKA data mining software: an update,” ACM
reference literature, we can conclude that Weka presents the SIGKDD explorations newsletter, vol11 (1), pp.10-18, 2009.
largest number of natively implemented algorithms, followed [14] A. Naika. and L. Samant, “Correlation review of classification
by RapidMiner and finally Knime. These last two tools use a algorithm using data mining tool: WEKA, Rapidminer , Tanagra
,Orange and Knime,” International Conference on Computational
significant number of algorithms imported from the Weka
Modeling and Security (CMS 2016), vol. 85, nº 1, pp. 662-668, 2016.
database. On the other hand, the weka graphical interface it is

View publication stats

You might also like