Professional Documents
Culture Documents
net/publication/228780408
CITATIONS READS
129 1,658
2 authors, including:
Paulo Cortez
University of Minho
214 PUBLICATIONS 2,793 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
A decision support system for slope management and repair View project
All content following this page was uploaded by Paulo Cortez on 31 May 2014.
Mean Squared (RMSE) is a popular metric (Witten and (i.e. G3 of Table 1) will be modeled using three super-
Frank 2005). A high PCC (i.e. near 100%) suggests a vised approaches:
good classifier, while a regressor should present a low
global error (i.e. RMSE close to zero). These metrics 1. Binary classification – pass if G3≥10, else fail;
can be computed using the equations: 2. 5-Level classification – based on the Erasmus1
grade conversion system (Table 2);
3. Regression – the G3 value (numeric output be-
1 , if yi = ybi
Φ(i) = tween 0 and 20).
0 , else
PN (1)
P CC = q i=1 Φ(i)/N × 100 (%) Figure 1 plots the respective histograms.
PN
RM SE = bi )2 /N Several DM algorithms, each one with its own purposes
i=1 (yi − y
and capabilities, have been proposed for classification
where ybi denotes the predicted value for the i-th exam- and regression tasks. The Decision Tree (DT) is a
ple. 1 European exchange programme that enables student exchange
I II III IV V
Country (excellent/very good) (good) (satisfactory) (sufficient) (fail)
Portugal/France 16-20 14-15 12-13 10-11 0-9
Ireland A B C D F
Mathematics:
250
120
80
100
200
60
80
150
40
60
100
40
20
50
20
0
0
150
150
400
100
300
100
200
50
50
100
0
0
Figure 1: Histograms for the output variables (binary/5-level classification and regression)
branching structure that represents a set of rules, dis- the DT/RF algorithms, since the latter explicitly per-
tinguishing values in a hierarchical form (Breiman et al. form an internal feature selection.
1984). This representation can translated into a set of
IF-THEN rules, which are easy to understand by hu- Computational Environment
mans. The Random Forest (RF) (Breiman 2001) is an
ensemble of T unpruned DT. Each tree is based in a ran- All experiments reported in this study were conducted
dom feature selection from bootstrap training samples using the RMiner (Cortez In press), an open source
and the RF predictions are built by averaging the out- library for the R environment that facilitates the use of
puts of the T trees. The RF is more difficult to interpret DM techniques (Figure 2). R is a free and high-level
when compared with the single DT, although it is still matrix programming language with a powerful suite of
possible to provide explanatory knowledge in terms of tools for statistical and data analysis (R Development
its input variable relevance. Nonlinear functions, such Core Team 2006). It can run in multiple platforms (e.g.
as Neural Networks (NN) and Support Vector Machines Windows, MacOS or Linux) and new features can be
(SVM), have also been proposed for DM tasks (Hastie added by creating packages.
et al. 2001), obtaining better results when a high non- The RMiner library2 presents a set of coherent func-
linearity is present. In this work, the NN model is based tions (e.g. mining, saveMining) for classification and
in the popular multilayer perceptron, with one hidden regression tasks (Cortez In press). In particular, the li-
layer with H hidden nodes, while the SVM will use a brary uses the rpart (DT), randomForest (RF), nnet
gaussian kernel with one hyperparameter (γ). It should (NN) and kernlab (SVM) packages. As an example,
be noted that NN and SVM use model representations the following RMiner/R code was used during the DT
that are difficult to understand by humans. Also, NN predictive experiments of the portuguese binary classi-
and SVM are more affected by irrelevant inputs than
2 http://www.dsi.uminho.pt/˜pcortez/R/rminer.zip
fication: To access the predictive performances, 20 runs of a 10-
fold cross-validation (Hastie et al. 2001) (in a total of 200
source("rminer.R") # read the RMiner library simulations) were applied to each configuration. Under
# load the data:
such scheme, for a given run the data is randomly di-
data=read.table("a-por-bin.dat",sep=";",header=T)
vided in 10 subsets of equal size. Sequentially, one dif-
K=c("kfold",10) # 10-fold cross-validation
# execute 10 runs of a DT classification: ferent subset is tested (with 10% of the data) and the
DT=mining(y~.,data,model="dt",Runs=20,method=K) remaining data used to fit the DM technique. At the
# show mean classification error on test set: end of this process, the evaluated test set contains the
print(mean(DT$error)) whole dataset, although 10 variations of the same DM
# save the results (predictions, ...): model are used to create the predictions.
saveMining(DT,"DT-results") As a baseline comparison, a naive predictor (NV) will
also be tested. For the A setup, this model is equal
to the second period grade (G2 or its binary/5-level
versions). When the second grade is not available (B
setup), the first period grade is used (or its binary/5-
level variants). If no evaluation is present (C setup)
then the most common class (for classification tasks) or
the average output value (regression) is returned.
The test set results are shown in Tables 3 to 5 in terms
of the mean and respective t-student 95% confidence in-
tervals (Flexer 1996). As expected, the A setup achieves
the best results. The predictive performance decreases
when the second period grade is not known (B) and
the wost results are obtained when no student scores
are used (C). Using only the last available evaluation
Figure 2: MacOS example of the R/RMiner tool. (NV method for the first two input setups) is the best
option for the Mathematics classification goals (both bi-
nary and 5-level) and Portuguese regression under the
RESULTS A input selection. This suggests that in these cases the
non evaluation inputs are of no use. However, the sce-
Predictive Performance nario changes for the remaining experiments. The RF
is the best choice in 8 cases, followed by the DT, which
Before fitting the models, some preprocessing was re- obtains 4 best results. In general, the nonlinear func-
quired by the NN and SVM models. The nominal3 vari- tion methods (NN and SVM) are outperformed by the
ables (e.g Mjob) were transformed into a 1-of-C en- tree based ones. This behavior may be explained by a
coding and all attributes were standardized to a zero high number of irrelevant inputs, as shown in the next
mean and one standard deviation (Hastie et al. 2001). section.
Next, the DM models were fitted. The DT node split As an example of the quality of the predictions, Figure 3
was adjusted for the reduction of the sum of squares. shows the confusion matrices for the DT algorithm, Por-
Regarding the remaining methods, the default parame- tuguese class and A setup. In both binary and 5-level
ters were adopted for the RF (e.g. T = 500), NN (e.g. classification, the majority of the values are in/near the
E = 100 epochs of the BFGS algorithm) and SVM (e.g. matrix diagonal, revealing a good fit. The figure also
Sequential Minimal Optimization algorithm). Also, the presents the RF scatter plot (predicted versus the ob-
NN and SVM hyperparameters were optimized using an served values) for the Mathematics regression approach.
internal grid search (i.e. using only training data) where The RF predictions are close to the diagonal line (which
H ∈ {0, 2, 4, 6, 8} and γ ∈ {2−9 , 2−7 , 2−5 , 2−3 , 2−1 }. denotes the perfect forecast) within most of the output
Since we suspected that the G1 and G2 grades would range (i.e. [5, 20]).
have a high impact, three input configurations were
tested for each DM model: Descriptive Knowledge
• A - with all variables from Table 1 except G3 (the
Here the goal is not to infer about the predictive capabil-
output);
ities of each model, as measured in the previous section,
• B - similar to A but without G2 (the second period
but to give a simple description that summarizes the
grade); and
• C - similar to B but without G1 (the first period best DM models. Thus, the whole dataset will be used
grade). in the descriptive experiments. Furthermore, only the
DT/RF algorithms will be considered, since it is more
3 Discrete with more than two non-ordered values. easy to extract knowledge from these models and they
Table 3: Binary classification results (PCC values, in %; underline – best model; bold – best within the input setup)
Table 4: Five-level classification results (PCC values, in %; underline – best model; bold – best within the input
setup)
Table 5: Regression results (RMSE values; underline – best model; bold – best within the input setup)
I. Mathematics Portuguese
S. NV NN SVM DT RF NV NN SVM DT RF
A 2.01±0.00 2.05±0.02 2.09±0.02 1.94±0.04 1.75† ±0.01 1.32±0.00 1.36±0.04 1.35±0.01 1.46±0.03 1.32±0.00
B 2.80±0.00 2.82±0.02 2.90±0.02 2.67±0.04 2.46† ±0.01 1.89±0.00 1.88±0.02 1.87±0.01 1.78∗±0.03 1.79±0.01
C 4.59±0.00 4.41±0.03 4.37±0.03 4.46±0.04 3.90† ±0.01 3.23±0.00 2.79±0.02 2.76±0.02 2.93±0.02 2.67† ±0.01
achieved in general the best predictive performances. grades are the most relevant features, appearing at the
Table 6 presents the relative importance (in percentage) root of the trees, and only a small number (2 to 5) of the
of each input variable as measured by the RF algorithm inputs considered are used. There are some interesting
(Breiman 2001). To clarify the analysis, only the five rules that can be extracted from these trees, for instance:
most relevant results are shown in the table. These five 1. if G1< 10 ∧ G2=9 ∧ Mjob ∈ {teacher, other} then
variables present an overall impact that ranges from 43% pass;
to 77%, which indicates that there is a high number of 2. if G1< 10 ∧ G2=9 ∧ Mjob ∈ {home, health, civil
irrelevant inputs. As expected, the student evaluations services} then fail;
have a high impact (from 23% to 46%) in the models. 3. if 12 ≤G2< 14 ∧ goout> 1 then III;
For instance, G2 is the most important feature for the A 4. if 12 ≤G2< 14 ∧ goout= 1 then II;
input selection example, while G1 is highly relevant for 5. if 11 <G1≤ 13 ∧ absences< 7 then 13;
the B setup. Also, the number of past failures, which is 6. if 11 <G1≤ 13 ∧ absences≥ 7 then 11;
related with previous student performance, is the most These rules show the influence of the mother’s job (rules
important factor when no student scores are available. 1–2), going out with friends (rules 3–4) and the number
Nevertheless, there are other relevant factors, such as of absences (rules 5–6).
school related (e.g. number of absences, extra school
support or travel time), demographic (e.g. the mother’s CONCLUSIONS
job) and social (e.g. going out with friends, alcohol con-
sumption) variables. Education is a crucial element in our society. Business
Figure 4 plots the best four DT. Again, the student Intelligence (BI)/Data Mining (DM) techniques, which
Table 6: Relative importance of the input variables for the best RF models
A setup: B setup:
G2<8 Decision G2<12? G1<12? G1>11?
false true false true false true false true
G1<10? Fail Leave G2<14? G2<9? G1<15 G1<9? G1>8? G1>13?
true false true true false true true false true true
Pass G2<10? G2<16? goout>1? IV V I G1<14? IV V 8 failures<1? absences<7? G1>16?
false true false true false true false true
Pass Mjob=teacher or other? I II II III II absences>7? 10 G1<11? 11 13 15 Fedu<3?
false true false true false true
Fail Pass III reason=school reputation? Fjob=teacher or other? 11 17 18
false true false
IV II 10 11
allow a high level extraction of knowledge from raw data, DM techniques were applied after the data was collected.
offer interesting possibilities for the education domain. However, there is a potential for an automatic on-line
In particular, several studies have used BI/DM methods learning environment, by using a student prediction en-
to improve the quality of education and enhance school gine as part of a school management support system.
resource management. This will allow the collection of additional features (e.g.
grades from previous school years) and also to obtain
In this paper, we have addressed the prediction of sec-
a valuable feedback from the school professionals. Fur-
ondary student grades of two core classes (Mathematics
thermore, we intent to enlarge the experiments to more
and Portuguese) by using past school grades (first and
schools and school years, in order to enrich the student
second periods), demographic, social and other school
databases. Automatic feature selection methods (e.g.
related data. Three different DM goals (i.e. binary/5-
filtering or wrapper) (Witten and Frank 2005) will also
level classification and regression) and four DM meth-
be explored, since only a small portion of the input vari-
ods, i.e. Decision Trees (DT), Random Forests (RF),
ables considered seem to be relevant. In particular, this
Neural Networks (NN) and Support Vector Machines
is expected to benefit the nonlinear function methods
(SVM), were tested. Also, distinct input selections (e.g.
(e.g. NN and SVM), which are more sensitive to irrele-
with or without past grades) were explored. The ob-
vant inputs. More research is also needed (e.g. sociolog-
tained results reveal that it is possible to achieve a high
ical studies) in order to understand why and how some
predictive accuracy, provided that the first and/or sec-
variables (e.g. reason to choose school, parent’s job or
ond school period grades are known. This confirms the
alcohol consumption) affect student performance.
conclusion found in (Kotsiantis et al. 2004): student
achievement is highly affected by previous performances.
Nevertheless, an analysis to knowledge provided by the
best predictive models has shown that, in some cases,
there are other relevant features, such as: school related ACKNOWLEDGMENTS
(e.g. number of absences, reason to choose school, ex-
tra educational school support), demographic (e.g. stu-
dent’s age, parent’s job and education) and social (e.g.
going out with friends, alcohol consumption) variables.
We wish to thank the students and school professionals
This study was based on an off-line learning, since the who voluntarily participated in this study.
Using Machine Learning Techniques. Applied Artifi-
Pass Fail I II III IV V
cial Intelligence (AAI), 18, no. 5, 411–426.
Pass 480 69 I 44 33 4 1 0
Luan J., 2002. Data Mining and Its Applications in
Fail 12 88 II 2 54 42 14 0 Higher Education. New Directions for Institutional
III 0 19 84 48 3 Research, 113, 17–36.
IV 0 0 24 111 66 Ma Y.; Liu B.; Wong C.; Yu P.; and Lee S., 2000.
Targeting the right students using data mining. In
V 0 0 0 1 88 Proc. of 6th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining. Boston,
20
USA, 457–464.
Minaei-Bidgoli B.; Kashy D.; Kortemeyer G.; and
15
Figure 3: Example of the predictions on the test set: DT R Development Core Team, 2006. R: A language and
confusion matrices for A-Por-Bin and A-Por-5L setups environment for statistical computing. R Foundation
(top); and RF scatter plot for A-Mat-Reg (bottom) for Statistical Computing, Vienna, Austria, ISBN 3-
900051-00-3, http://www.R-project.org.
REFERENCES Turban E.; Sharda R.; Aronson J.; and King D.,
2007. Business Intelligence, A Managerial Approach.
Breiman L., 2001. Random Forests. Machine Learning, Prentice-Hall.
45, no. 1, 5–32. Witten I. and Frank E., 2005. Data Mining: Practi-
cal Machine Learning Tools and Techniques with Java
Breiman L.; Friedman J.; Ohlsen R.; and Stone C., 1984.
Implementations. Morgan Kaufmann, San Francisco,
Classification and Regression Trees. Wadsworth,
CA.
Monterey, CA.
Cortez P., In press. RMiner: Data Mining with Neural BIBLIOGRAPHIE
Networks and Support Vector Machines using R. In
R. Rajesh (Ed.), Introduction to Advanced Scientific PAULO CORTEZ was born in Braga, Portugal and
Softwares and Toolboxes. went to the University of Minho where he obtained
Eurostat, 2007. Early school-leavers. an Eng. Degree (1995), a M.Sc. (1998) and a Ph.D
http://epp.eurostat.ec.europa.eu/. (2002) in Computer Science. In 2001, he became Lec-
turer at the Dep. of Information Systems and Re-
Flexer A., 1996. Statistical Evaluation of Neural searcher at the Algoritmi R&D Centre. He partic-
Networks Experiments: Minimum Requirements and ipated in 7 R&D projects (principal investigator in
Current Practice. In Proceedings of the 13th Euro- 2) and he is co-author of more than fifty publica-
pean Meeting on Cybernetics and Systems Research. tions in int. conferences and journals. Research in-
Vienna, Austria, vol. 2, 1005–1008. terests: Business Intelligence/Data Mining; Neural and
Evolutionary Computation; Forecasting. Web-page:
Hastie T.; Tibshirani R.; and Friedman J., 2001. The
http://www.dsi.uminho.pt/˜pcortez
Elements of Statistical Learning: Data Mining, Infer-
ALICE SILVA studied Informations Systems, obtain-
ence, and Prediction. Springer-Verlag, NY, USA.
ing a Degree (Bragança Polytechnic Inst., 2003) and a
Kotsiantis S.; Pierrakeas C.; and Pintelas P., 2004. Pre- M.Sc. (U. Minho, 2007). Currently, she teaches Infor-
dicting Students’ Performance in Distance Learning matics at the secondary school of Arcozelo, Portugal.