You are on page 1of 18

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/326610642

An ensemble-based method for the selection of instances in the multi-target


regression problem

Article  in  Integrated Computer Aided Engineering · July 2018

CITATIONS

3 authors:

Oscar Reyes Habib M. Fardoun


University of Cordoba (Spain) King Abdulaziz University
30 PUBLICATIONS   87 CITATIONS    154 PUBLICATIONS   376 CITATIONS   

SEE PROFILE SEE PROFILE

Sebastian Ventura
University of Cordoba (Spain)
271 PUBLICATIONS   7,395 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Serendipity Engineering View project

Doctoral research View project

All content following this page was uploaded by Sebastian Ventura on 06 August 2018.

The user has requested enhancement of the downloaded file.


An ensemble-based method for the selection
of instances in the multi-target regression
problem
Oscar Reyesa, c, Habib M. Fardounb and Sebastián Venturaa, b, c, *
a
Department of Computer Science and Numerical Analysis, University of Córdoba, Córdoba, Spain.
b
Faculty of Computing and Information Technology, King Abdulaziz University, North Jeddah, Saudi Arabia
Kingdom.
c
Knowledge Discovery and Intelligent Systems in Biomedicine Laboratory, Maimonides Biomedical Research
Institute of Córdoba, Córdoba, Spain.

Abstract. The multi-target regression problem comprises the prediction of multiple continuous variables at the same time us-
ing a common set of input variables, and in the last few years, this problem has gained an increasing attention due to the broad
range of real-world applications that can be analyzed under this framework. The complexity of the multi-target regression
problem is higher than the single-target regression one since target variables often have statistical dependencies, and these
dependencies should be correctly exploited in order to effectively solve this problem. Consequently, additional difficulties
appear when the aim is to perform a selection of instances on this type of data. In this work, an ensemble-based method to
perform the instance selection task in multi-target regression problems is proposed. First, a well-known instance selection
method is adapted to directly work with multi-target data. Second, the proposed ensemble-based approach uses a set of these
adapted methods to select the final subset of instances. The members of the ensemble select partial data subsets, where each
member is performed on a different input space that is expanded with target variables, exploiting therefore the underlying in-
ter-target dependencies. Finally, the ensemble-based method aggregates all the selected partial data subsets into a final subset
of relevant instances by means of solving an optimization problem with a simple greedy heuristic. The experimental study
carried out on 18 datasets shows the effectiveness of our proposal for selecting instances in the multi-target regression prob-
lem. Results demonstrate that the size of datasets is considerably reduced, whilst the predictive performance of the multi-target
regressors is maintained or even improved. Also, it is observed that the proposed method is robust to the presence of noise in
data.

Keywords: Instance selection, multi-target regression, ensemble learning, ensemble-based instance selection

1. Introduction MTR) is one of these problems, and it comprises the


prediction of multiple continuous variables from a
In the past few years, the scientific community has common set of input variables [59]. In other words,
paid an increasing attention to problems that com- MTR algorithms aim to learn a predictive model that,
prise the prediction of multiple outputs simultaneous- given an unseen input vector X, predicts a target vec-
ly, mainly due to the many real-world applications tor Z of numeric variables. This type of regression
that are possible to study within this framework [26, problems has been successfully applied to diverse
42, 51, 53, 54]. Multi-target regression (henceforth engineering applications including automatic control

*
Tel.: +34 957 212 218; Fax: +34 957 218 630;
E-mail: sventura@uco.es.
[28], energy efficiency [63] and signal processing Nowadays, these algorithms can bring many benefits
[66]. to the scientific community mainly due to their appli-
Up to date, many methods have been proposed to cations to the Big Data challenge [8]. The IS task has
tackle the MTR problem, and these can be organized been widely studied for the classification problem
into problem transformation and algorithm adapta- (see, for instance, the Olvera-López et al. [45] and
tion methods [10]. Problem transformation methods García et al. [24] works), however, this task for the
decompose an MTR problem into several single- regression problem has been far less studied [3]. The
target regression tasks. Recent researches have fo- IS task in the regression problem has some difficul-
cused on applying some well-known multi-label ties that do not exist in the classification task. For
learning transformation methods to solve the MTR instance, several IS methods assess the relevance of
problem, mainly motivated by the tight connection an instance by means of measuring its usefulness in
between these two learning paradigms2. In this regard, predicting the correct classes of its nearest neighbors.
Spyromitros-Xioufis et al. [59] demonstrated that However, the concept of data class in the regression
several multi-label approaches, such as the binary problem does not exist since the domain of the output
relevance [14], stacked generalization [64] and clas- variables is continuous. On the other hand, the identi-
sifier chains [50], are straightforward to adapt to the fication of class boundaries, an important criterion on
MTR problem. On the other side, the algorithm adap- which many IS methods for the classification prob-
tation category comprises algorithms that do not de- lem are formulated, does not have sense in regression
compose an MTR problem into several single-target [3]. As for performing the IS task in the MTR prob-
regression tasks; i.e. they directly handle the multi- lem, the complexity for selecting the instances is
target data. In this category, many methods have been higher than the one we could have in the single-target
proposed, such as statistical techniques [58], support regression problem, mainly due to the aforemen-
vector machines [44], kernel-based approaches [6], tioned challenges that MTR problem presents.
MTR trees [60], rule-based methods [1], and locally In the last two decades, existing ensemble-based
weighted regression methods [53]. methods have demonstrated to be really effective
Spyromitros-Xioufis et al. [59], Melki et al. [44], techniques to improve the results in complex prob-
Reyes et al. [53] and many other authors have lems [32, 33, 41, 46-48, 55, 67]. Kocev et al. [37] and
demonstrated that the MTR problem can be solved Spyromitros-Xioufis et al. [59], for example, demon-
more effectively if the inter-target correlations are strated how a better predictive performance could be
detected and exploited. However, the major challeng- obtained in solving the MTR problem by using en-
es of MTR lie in how to model such inter-target de- semble-based approaches. On the other hand, several
pendencies correctly, and how to estimate the nonlin- authors have demonstrated that the IS task can be
ear relationships that may exist between the input and significantly improved by means of applying ensem-
output spaces of the problem [75]. On the other hand, ble-based methods [56]. By this way, the relevance of
when the MTR problem is studied not all available the training instances is measured by considering not
training samples are useful to construct an accurate only a single criterion but many approximations and,
predictive model; it is well-known that noisy, redun- therefore, a more reliable estimation of the relevance
dant and incomplete data can significantly deteriorate of the instances is obtained.
the performance of the most learning algorithms [52]. In this work, an ensemble-based method to per-
Consequently, the acquisition of a high-quality and form the IS task in the MTR problem is proposed.
compact dataset, from which an algorithm can learn First, an error accumulation-based approach is intro-
relevant data relationships, is also an important issue duced, which is an adaptation of the well-known fam-
to be considered when tackling the MTR problem. ily of the Decremental Reduction Optimization Pro-
The instance selection task (henceforth IS) is an cedures (henceforth DROP) [71] to multi-target data.
important data preprocessing step, that aims to select Second, an ensemble-based method that effectively
a representative subset of an original dataset by filter- combines the partial data subsets that are previously
ing noisy and redundant data, in such a manner that selected by each member of the ensemble is proposed.
the predictive performance of the learner that was To obtain the final data subset, an aggregation pro-
induced from the data subset would be the same cess is carried out by a simple greedy heuristic that
(even better) as if the original dataset was used [45]. solves an optimization problem. The members of the
ensemble select the partial data subsets on different
2
In multi-label learning, the output variables (a.k.a. labels) are input spaces which are expanded by target variables,
restricted to binary values. exploiting therefore the underlying inter-target de-
pendencies. On the other hand, the method proposed Many IS methods have been proposed in the litera-
does not use any threshold value to decide whether an ture, and a complete description of these methods can
instance is selected or not, resulting in a method less be consulted in [25]. The IS methods can be catego-
dependent on the specific features of each problem. rized by considering the following three criteria: (I)
To the best of our knowledge, this is the first attempt the selection criterion used to select the instances; (II)
to study the selection of instances in MTR, and the the type of points that are removed in the IS process;
main motivation of this work is to analyze the bene- (III) and finally they can be classified according to
fits of the IS task for constructing better MTR models. the search direction used to obtain the final data sub-
The effectiveness of the proposal is assessed set.
through an extensive experimental study, where 18 The first category includes the wrapper [71] and
datasets of varied features and different application filter methods [43], and the main difference between
domains are used. The results showed that the pro- these two type of approaches lies in which the wrap-
posed IS method can significantly boost the predic- per methods select the relevant instances based on the
tive performance of the multi-target regressors, and prediction made by a learning algorithm, whilst the
therefore, it can benefit the development of methods filter methods are not based on a classifier to deter-
for solving complex problems that comprise the pre- mine the instances to be discarded from the training
diction of multiple outputs. A good trade-off between set.
the predictive performance and reduction rate is at- Regarding the second criterion, the IS algorithms
tained; the size of the training sets is reduced without can be classified into condensation [34], edition [69]
significantly deteriorating the predictive performanc- or hybrid methods [71]. The condensation methods
es of the multi-target regressors. In addition, the pro- retain the points closer to the decision boundaries
posed ensemble-based method demonstrates to be (border points), preserving the training error, but at
robust on datasets which have noise samples. the expense of deteriorating the generalization test
The remainder of this paper is arranged as follows. error. The edition methods remove the border points
Section 2 briefly describes the IS task and exposes and maintain the internal points, getting smoother
the related works that have been proposed to perform decision boundaries and reducing the generalization
the selection of instances in the regression problem. test error. The hybrid methods, on the other hand,
Section 3 presents the proposed ensemble-based remove the internal and border points, taking the ad-
method. Section 4 shows a description and discussion vantages of both the condensation and edition meth-
of the experimental results. Finally, some concluding ods.
remarks are presented in Section 5. As for the third criterion, the IS algorithms can be
classified into incremental [29], decremental [71],
2. Related works batch [12] or mixed methods [57]. The incremental
methods start with an empty data subset and continue
Roughly speaking, IS methods aim to reduce the adding instances to it; in this case, the presentation
size of an original training data but retaining or im- order of the instances is an important issue that might
proving the predictive capacity of the models. The affect the effectiveness of the IS algorithms. The dec-
optimal outcome of an IS method is a minimum data remental methods begin with the whole dataset and
subset from which a learning algorithm would ac- continues removing instances of it; in this case, the
complish the same task with no performance loss as presentation order of the instances is still an im-
if the original dataset was used [25]. However, some portant issue, but not so significant as in the case of
authors have noted that in practice, it is not always the incremental methods. As for batch methods, they
possible to maintain the performance levels as the analyze all the instances but without removing them,
dataset is reduced, and a loss of effectiveness may be and at the end of the process, all the instances marked
inevitable [15]. IS methods have the following goals as disposable are removed; the complexity of this
[25]: (I) decrease the computational cost for predict- type of methods is usually higher than the one of in-
ing new patterns; (II) reduce the storage requirements cremental and decremental methods. Finally, the
by removing redundant information from datasets; mixed methods begin with a pre-selected data subset,
(III) improve the performance of learning algorithms and the instances which satisfy a specific criterion
by removing noise and outliers; and (IV) increase the can be added or removed; the pre-selected data subset
efficiency when working on large-scale datasets. may be constructed by either a random selection, an
incremental method, or a decremental one.
Olvera-López et al. [45], García et al. [25], and sented by Blachnik & Kordos [9] and Arnaiz-
many other authors have noted that the IS task for the González et al. [3], who showed that a better IS pro-
classification problem is widely studied. However, cess in the regression problem can be achieved by
the selection of instances in the regression problem using bagging models.
has not followed the same path, existing far less re- Finally, it is important to note that all the afore-
search in this regard [3]. Some works have proposed mentioned works have been designed for selecting
different evolutionary algorithms to perform the IS instances on regression problems that have only one
task in the regression problem. For example, Tolvi target variable, and they are not directly applicable to
[62] presented a genetic algorithm that was able to the MTR problem. As far as we know, an IS method
detect the outliers in linear regression models, and for the MTR problem has not been proposed yet. In
Antonelli et al. [2] addressed the IS task through a addition to the difficulties that appear when the IS
multi-objective evolutionary learning approach. Also, process is performed on any regression problem
there have been other efforts focused on studying the (these were previously mentioned in the introduction
IS task in time series [61]. On the other hand, in the of this work), the major challenges of MTR arise
last few years, several works have been focused on from modelling the inter-target correlations and com-
the adaptation of some IS methods, that were origi- plex input-output relationships. In the next section, an
nally designed for the classification problem, to the ensemble-based method for the selection of instances
regression problem. For example, Kordos & Blach- in the MTR problem is presented.
nick [38] adapted the Condensed Nearest Neighbor
[29] and Edited Nearest Neighbor [69] methods, and 3. An ensemble-based method for the selection of
Arnaiz-González et al. [4] proposed an adaptation of instances
DROP method [71]. Finally, another approach for
performing the IS task in the regression problem In this section, first, an error accumulation-based
comprises the discretization of the target variable [5], approach, which is an adaptation of the well-known
and therefore, in this case any existing IS method can DROP method to multi-target data, is introduced, and
be used directly. then, the ensemble-based method to perform the IS
Independently, in order to improve the efficiency task in the MTR problem is presented.
and accuracy of a method to find a solution for a giv-
en learning problem, ensembles of methods have A DROP-based extension for the MTR problem
gained an increasing popularity in the research com-
munity in the last few decades [21, 73]. Summarizing Let us say S = {(X1, Y1), (X2, Y2), …, (Xn, Yn)}
the advantages of the ensemble-based methods [18, represents a dataset of n training instances. An in-
73]: (I) ensemble methods perform well in both sce- stance i  S is represented as a tuple (Xi, Yi), where
narios, when there are very scarce data samples for Xi  X and Yi  Y are the input and target vectors of i,
learning and when a huge amount of data is available; respectively. X represents the input space that con-
(II) a combined classifier can have a better predictive
tains d input variables {x1, x2, …, xd}, whereas Y is
performance that the best individual classifier; (III)
the output space that comprises q target variables {y1,
combining methods trained from different samples
y2, …, yq}3. On the other hand, xli denotes the value of
could overcome the local optima problem; and (IV)
an exact function may be impossible to be modelled the l-th input variable for the instance i, whereas yli
by any single hypothesis, but the combination of sev- represents the value of its l-th target variable. An
eral hypotheses may expand the space of representa- MTR algorithm aims to learn a predictive model 
ble functions. Taking the advantages provided by the that, given an unseen input vector X, can predict a
ensemble learning paradigm, it is not surprising the target vector Z that best approximates the true target
use of ensemble-based methods to perform the IS vector Y.
task [8, 56]. The main objective of such ensemble- DROP is a well-known IS method that, according
based IS methods is to produce more reliable estima- to the three categories portrayed in Section 2, can be
tions of the relevance of the instances by aggregating classified as a wrapper, hybrid and decremental
the outputs produced by the members of the ensemble. method. In this work, the DROP-based adaptation to
In this regard, there are very few works that have
studied the selection of instances in the regression 3
The domain of the input variables can be continuous, discrete
problem following an ensemble-based approach. The or mixed type, whereas the domain of the target variables is always
most relevant works on this topic are the ones pre- continuous.
the regression problem presented by Arnaiz-González
et al. [4] is extended to multi-target data. 1 q
 (y j − z j )2

jAi , (1)
Let us say Ni represents the set of k-nearest neigh-
bours of the instance i in the input space X. Given a
q =1  (y
jAi
j − ym ) 2
set of instances S, the set of associates of i (denoted
as Ai) comprises those training instances that include
to i into their sets of k-nearest neighbors, i.e. Ai = {j Algorithm 1. DROPMTR algorithm. The function
 S | i  Nj}. Our proposed DROP-based method kNearestNeighbors(i, S, k) computes the k-nearest neighbors
of the instance i in the dataset S. The function predictTar-
uses the following simple rule as removal criterion: getVector(, N, i) trains the multi-target regressor  on N
the instance i can be safely removed from the training and predicts the target vector of the instance i.
set, if the target vectors of its associated instances
can be correctly estimated without considering i. Input
S: training set of multi-target instances, k: number of near-
From this rule arises the necessity of designing a reli- est neighbours, : multi-target regressor
ability function for multi-target data that assigns Output
proper scores according to the error levels obtained in SS  S: subset of training instances
the predictions of the target vectors of the associate Begin
instances. This reliability function is crucial for the SS  S
success of the IS process since the error associated #Create an empty set of associates for each example i  SS
foreach i  SS do
with the elimination of a relevant instance can rein-
Ai  
force itself in the subsequent iterations. However, end
developing a good reliability measure is not a simple foreach i  SS do
task [13]. This has not been entirely resolved yet in # Find the k nearest neighbours of i on the input space X
the single-target regression and classification, and Ni  kNearestNeighbors(i, SS , k)
even less for the MTR problem [39]. # Add i to the sets of associates
Different reliability scores have been proposed for foreach j  Ni do
single-target regression, such as the estimation based Aj  Aj  {i}
end
on sensitivity analysis [13], local cross-validation end
[13], analysis of the density of the distribution of foreach i  SS do
instances [19], the variance of bagged models [31], # Predict the target vectors of the associate instances
and the estimation of the instances’ error by consider- foreach j  Ai do
ing its local environment in the training set [11]. Re- predictTargetVector (, Nj, j)
cently, Levatić et al. [39] defined various reliability predictTargetVector (, Nj \{i}, j)
end
functions for the MTR problem following a semi-
compute Ewithi
supervised approach. However, these last-mentioned compute Ewithouti
functions are not directly applicable to our problem # Check whether the instance i can be removed or not
since they do not consider the true target vector of the if Ewithouti ≤ Ewithi then
instances. SS  SS \{i}
# Remove the instance i from each set of nearest neighbours
In this work, given a training instance i, we follow
foreach j  Ai do
a traditional approach to estimate the error made in Nj  Nj \{i}
predicting the target vectors of the associate instances # Find a new nearest neighbour for j
by considering their local environments; it is similar p  kNearestNeighbors (j, SS \ Nj, 1);
to the approach proposed by Briesemeister et al. in # Add p to the set of nearest neighbours of j
[11]. Given the associate instance j of i, a multi-target Nj  Nj  {p}
# Add j to the set of associates of p
algorithm  is trained on the dataset formed by the Ap  Ap  {j}
set of k-nearest neighbors Nj, and afterward the model end
predicts a target vector for j (Zj). This procedure is end
performed for each associated instance j  Ai, and the end
estimation of the global error can be calculated with return SS
end
the average relative root mean square error (aRRM-
SE)
where ylj and zlj are the values of the l-th target varia- fk(n, d)) steps are needed to compute the list of asso-
ble in the true (Yj) and predicted (Zj) target vectors of ciates of all training instances. On the other hand, let
the associate instance j, respectively. Also, ylm is the us say f(N,i) represents the cost function of training
mean of the true values for the l-th target variable in the multi-target regressor  on the dataset N (dataset
the set of associates of i. aRRMSE is a measure wide- formed by k instances), added to the cost for predict-
ly used in the MTR literature [10], and it averages the ing the target vector of the instance i. So, O(n×aavg×
RMSE values of each target variable, and automati- f(K,i)) steps are required to compute the accumula-
cally re-scales the error contributions of each target tive errors of all the training instances, being aavg the
variable. average number of associates per instance. Conse-
We believe that aRRMSE is a reliable estimator quently, the overall runtime complexity of the pro-
since it considers the actual errors made by the inter- posed IS method is O(max(n×fk(n, d), n×aavg×
nal regressor. It also imposes almost no additional f(K,i))).
computational overhead, as opposed to some other
estimation methods for regression. We denoted as Ensemble-based IS method
Ewithi the global error in predicting the target vectors
of the associated instances of i, but without removing We propose an ensemble-based method for tack-
i from the sets of the nearest neighbors of its associ- ling the IS task in the MTR problem since more reli-
ated instances, whereas Ewithouti represents the op- able estimations on the importance of the instances
posite case. Finally, the instance i can be safely re- could be obtained if multiple approximations were
moved from the training set if Ewithouti ≤ Ewithi. considered. Similar to Spyromitros-Xioufis et al. [59],
Algorithm 1 shows the steps performed by our error we adopted an approach that composes the ensemble
accumulation-based approach (hereafter, dubbed as by means of adding target variables to the input space
DROPMTR). of the MTR problem.
The proposed method is able to detect the errors The rationale of our proposal is as follows. Given a
and outliers in data. For example, if the set of associ- multi-target dataset S with q target variables, our en-
ate instances of the instance i is empty (Ai = ), it semble is formed by q+1 members, where q of them
could mean that this instance is an outlier since i is select a data subset from a slightly different version
not a neighbour of any training instance. In addition, of the original dataset S. The l-th member (l ≤ q) of
it could be the case that Ewithouti = Ewithi but Ai ≠ , the ensemble (denoted as Il) selects the instances
meaning that the instance i does not contribute for a from a multi-target dataset which has an input space
better prediction of the target vectors of its associates, equal to X  {yl}, and an output space equal to Y
and therefore it can be safely removed. \ {yl}. By this way, the member Il allows to model the
The main advantages of the proposed approach are contribution of the l-th target for predicting the rest of
that it can be wrapped around any existing MTR re-
the target variables yp  Y | p ≠ l, and therefore, Il
gressor (problem transformation or algorithm adap-
tation methods), and it also does not depend on any would select a data subset formed by those relevant
threshold value to decide whether an instance is se- instances that reflect the inter-target dependencies
lected. Therefore, the IS process can be significantly that are related with the l-th target variable. The last
benefited from the capacities of the internal regressor. member of the ensemble (Iq+1) is executed on the
The proposed IS method can implicitly exploit the original dataset S, and the selected data subset would
inter-target dependencies for the selection of more comprise those instances that are relevant for predict-
relevant instances if the internal MTR is able to mod- ing all target variables. Finally, each member of the
el such correlations. In this sense, the challenge of ensemble returns a subset S{1, 2, …, q, q+1}, and then, an
modelling complex input-output relationships can be aggregation process determines the best subset (Se)
also effectively tackled since our proposal can be from these q+1 partial data subsets.
applied with any linear and non-linear regression Figure 1 shows the general schema of the proposed
MTR algorithm. ensemble-based method for performing the IS task in
Regarding the runtime complexity of our proposal, the MTR problem. It is noteworthy that the diversity
let us say S is a dataset with n instances, d input vari- of the members of the ensemble is tackled by means
ables and q target variables, and fk(n, d) represents the of executing each member on different datasets. Also,
cost function for determining the k-nearest neigh- note that the members must be IS methods able to
bours of an instance in the input space X. So, O(n × work directly with multi-target data, as DROPMTR
Figure 1. Schema of the proposed ensemble-based method.

method. On the other hand, in order to add more di- aims to determine the subset of instances Se  T that
versity to the ensemble, for each new dataset over generalises well the observations in T and minimises
which the members are performed, the presentation the following cost function
order of the instances is randomly changed. This ac-
tion also allows that the ensemble method will be less
sensitive to the presentation order of the instances, 1 q (y i − zi ) 2
 iT , (3)
that is a limitation of any DROP-based method.
Another important issue to analyse in our approach
q =1 (y
iT
i − ym ) 2
is how to aggregate the q+1 partial data subsets into a where yli and zli are the values of the l-th target varia-
final data subset of instances; this component is of a ble in the true (Yi) and predicted (Zi) target vectors of
major importance in all ensemble-based methods. In i, respectively, and ylm is the mean of true values for
this work, we adopt a stacking approach [72], where the l-th target in the set T. Note that, this cost function
q+1 independent members select data subsets from S,
is simply the measure aRRMSE, but now it is defined
but finally one extra model produces an optimal
over the set T.
combination of the outputs of the members.
Solving this optimization problem with classical
Let us say ci represents the times the instance i  S methods could take a considerable runtime since the
is selected by the members, and it can be calculated objective function requires the evaluation of a multi-
as
q +1 target regressor  for predicting the target vector of
ci =  1S (i ) , (2) each instance i  T. On the other hand, if we consider
=1 this formulation as a searching problem, the number
where 1Sl is the function that indicates whether the of feasible solutions is 2|T| - 1, resulting in a huge
instance i is in the data subset Sl selected by the l-th space. Consequently, in this work, the following sim-
member or not. It is important to note that, although ple heuristic is defined: The instances that were se-
each member is executed on datasets that slightly lected by a higher number of members are preferred
differ from the original dataset S and the presentation over those instances with a fewer selection frequency.
order of the instances is changed in each dataset, a Consequently, the following hill climbing process is
function that maps the new indexes of the instances proposed to compute the final subset of instances Se:
to the original indexes in the dataset S can be easily considering as starting point the set of instances most
constructed. By this way, given the data subset Sl selected (denoted as Rm), add continuously to Se the
selected by the l-th member of the ensemble, it is next set of instances with the highest selection fre-
easy to retrieve the corresponding original instances quency (Rl | 1≤ l <m), until a degradation in estimat-
from S. ing T is obtained.
The set of instances that are exactly selected by l Algorithm 2 shows the steps of the proposed en-
members is denoted as Rl = {i  S | ci = l}, and T rep- semble-based IS method (hereafter, dubbed as
EDROMTR). EDROMTR comprises two phases: (I)
resents the set of instances resulting of the union of
the ensemble’s members select the partial data sub-
the subsets S1, S2, …, Sq, Sq+1. Therefore, our proposal
Algorithm 2. EDROPMTR algorithm. The function Algorithm 2. Continuation.
construct(S, Xnew, Ynew) constructs a new dataset from S but
considering the input and output spaces Xnew and Ynew, respec- foreach l  {m-1, m-2, …, 1} do
tively. The function shuffle(S) changes the presentation order if Rl ≠  then
of the instances in S. The function dropMTR(S, k, ) per- predictTargetVectors (, Se  Rl , T)
forms the DROPMTR algorithm on the dataset S, using the enew  equation 3
internal multi-target regressor  and considering k nearest if enew ≤ ebest then
neighbours. The function retrievOriginal(Sl , S) transforms all Se  Se  Rl
the instances i  Sl, recovering their original forms as they ebest  enew
appear in S. The function predictTargetVectors(, S, T) else
trains the regressor  on S and predicts the target vectors of break
all the instances i  T. end
end
Input end
S: training set, k: number of nearest neighbours, : multi- return Se
target regressor end
Output
Se  S: subset of training instances sets; and (II) the q+1 selected data subsets are aggre-
Begin gated by the proposed greedy heuristic. Generally
#Compute the frequency of selection of each instance
speaking, the first phase requires the construction of
foreach i  S do
ci  0
the datasets from which the members select the par-
end tial data subsets, the execution of q+1 IS methods,
T  and finally, the estimation of the frequency that each
# Construct the members instance is selected. Let us say that fDROPMTR repre-
foreach l  {1, …, q+1} do sents the computational cost of the DROPMTR
Rl 
method. Therefore, the overall runtime complexity of
if l ≤ q then
the first phase is O((q + 1) × fDROPMTR), since the rest
# Remove the l-th target variable from Y and add it to X
Ynew  Y \ {yl }
of the mentioned steps of the first phase can be per-
Xnew  X  {yl } formed in linear time. Note that, each member of the
else ensemble can be executed in parallel, so the efficien-
Ynew  Y cy can be significantly improved.
Xnew  X On the other hand, the second phase of the ensem-
end ble-based method comprises the execution of the
# Construct a new training set from S proposed greedy heuristic, which in turn needs to
Snew  construct (S, Xnew, Ynew)
# Change the order presentation
train and test (at most m times) the internal MTR re-
Snew shuffle (Snew) gressor  for determining the data subset Se  T from
# Execute DROPMTR on the training set Snew which is attained a better estimation of T. Let us say
Sl  dropMTR (Snew, k, ) f(Se, T) represents the cost function of training  on
# Retrieve the original instances from S Se, added to the cost of testing  on T, where Se  T.
Sl  retrieveOriginal (Sl , S)
So, the overall complexity of the second phase is
T  T  Sl
# Increment the frequency of selection of the instances O(m× f(Se, T)). It is noteworthy that the multi-target
foreach i  Sl do regressor used in the second phase is the same to the
ci  ci + 1 one that is employed by each member of the ensem-
end ble. Therefore, the overall complexity of the pro-
end posed ensemble-based method is O(max((q + 1) ×
# Construct the frequency sets
fDROPMTR, m× f(Se, T))).
foreach i  S do
if ci > 0 then
RCi  RCi  {i} 4. Experimental study
end
end In this section, the experimental study is described.
# Determine the set of instances most selected First, a description of the datasets and other experi-
m  arg max Rl ≠ 
l  {1, …, q+1}
mental settings used in the experiments are presented.
# Hill climbing process Second, DROPMTR and EDROPMTR are performed
Se  Rm on all the datasets, with the aim of analysing whether
predictTargetVectors (, Se, T)
ebest  equation 3
the proposed IS methods improve or maintain the
predictive performance of the MTR regressors, and to
demonstrate that the best performance is attained by Table 1. Summary of the benchmark datasets.
the proposed ensemble-based IS method. Dataset #Instances #Input vars. #Target vars.
Andro 49 30 6
4.1 Multi-target datasets Atp1d 337 411 6
Atp7d 296 411 6
Edm 154 16 2
In this experimental study, the largest collection of Enb 768 8 2
MTR datasets publicly available was used [59]. All Jura 359 15 3
the 18 datasets within this collection have a variety of Oes10 403 298 16
Oes97 334 263 16
features and belong to several application domains. Osales 639 413 12
Some of these datasets represent well-known engi- Rf1 9125 64 8
neering problems, for example: the dataset Electrical Rf2 9125 576 8
Discharge Machining (Edm) [36] represents a two- Scm1d 9,803 280 16
Scm20d 8966 61 16
target regression problem, where the task is to mini- Scpf 1137 23 3
mize the machining time by reproducing the behav- Sf1 323 10 3
iour of a human operator that controls two variables; Sf2 1066 10 3
the dataset Energy Building (Enb) [63] concerns the Slump 103 7 3
Wq 1060 16 14
prediction of the heating and cooling loads require-
ments of buildings as a function of eight parameters;
the dataset Concrete Slump (Slump) [74] comprises
tion of the classic kNN algorithm for the MTR prob-
the prediction of three properties of concrete as a
lem. In this work, this kNN-based method was used
function of the content of seven concrete ingredients.
as the internal MTR regressor of our IS methods
On the other hand, the Andromeda (Andro) [30]
(DROPMTR and EDROPMTR). The best number of
and Water Quality (Wq) [20] datasets concern the
nearest neighbours (k) was estimated via cross-
prediction of water quality parameters, whereas the
validation on the original datasets. The main reason
Jura dataset [28] focus on the prediction of the con-
to use this MTR regressor is due to its simplicity and
centration of metals. The Solar Flare datasets [40]
low computational cost. However, note that any other
(Sf1 and Sf2) are about the prediction of the number
MTR algorithm could be used as internal regressor
of solar flares are observed within one day. The River
since the proposed methods follow a wrapper ap-
Flow datasets (Rf1 and Rf2) [59] concern the predic-
proach.
tion of river network flows. Finally, we have the fol-
The parameter k is also important for DROPMTR
lowing datasets associated with the business domain:
since the lists of associates are created by computing
Online Product Sales (Osales) [34], See Click Predict
the k-nearest neighbours of each instance of the train-
Fix (Scpf) [35], Airline Ticket Price (Atp1d and
ing set. This parameter was set the same as the num-
Atp7d) [59], Supply Chain Management (Scm1d and
ber of nearest neighbours used by the internal MTR
Scm20d) [59] and Occupational Employment Survey
regressor described before. As for the distance func-
(Oes10 and Oes97) [59].
tion used to compute the nearest neighbours of a
Table 1 shows a summary of the characteristics of
point, the well-known Heterogeneous Euclidean
the datasets. The datasets vary in size: from 49 up to
Overlap Metric (HEOM) was used [70].
9,803 examples, from 7 up to 576 input variables, and
On the other hand, Spyromitros-Xioufis et al. [59]
from 2 up to 16 target variables. All the datasets have
showed that the method Ensemble of Regressor
numeric input variables, except for Sf1 and Sf2
Chains (ERC) is one of the most significant state-of-
whose input variables are discrete. The datasets Scpf,
the-art MTR methods. Hence, the effectiveness of the
Osales, Rf1, Rf2, Atp1d and Atp1d have missing
proposed IS methods was assessed by means of eval-
values that were replaced by the median values of the
uating ERC on the selected data subsets. ERC is a
corresponding input variables. Finally, all the numer-
problem transformation method, and therefore, it in-
ic variables were centred and scaled.
ternally requires a single-target regressor. Three sin-
gle-target regressors were used, namely RepTree,
4.2 Experimental settings
Linear Regression and the classic kNN (the parame-
ters proposed in [59] were used), resulting in three
Pugelj & Dzeroski [39] presented a simple adapta-
different combinations of ERC (dubbed as ERC- that the final data subsets selected by EDROPMTR
REPTree, ERC-LR and ERC-kNN). will have a size greater than those data subsets select-
To estimate the predictive performance of the ed by one DROPMTR; the partial q+1 data subsets
MTR models, the measure aRRMSE (previously de- could be very diverse between each other and, there-
scribed in Section 3) was analysed on the test sets. In fore, the aggregation process will attain a lower re-
all datasets, a 10-fold cross-validation was performed, duction level than the one that could be obtained by a
and the aRRMSE values were averaged across all single DROPMTR method. However, it is notewor-
fold executions. In each fold execution, the following thy that, although DROPMTR obtained the best re-
steps were conducted: (I) the IS method reduces the duction levels in 13 datasets, there were no statistical
training set; (II) the multi-target regressor is trained differences between the reduction levels attained by
on the selected data subset; and (III) the learned DROPMTR and EDROMTR; the Wilcoxon Signed-
model is assessed on the test set. On the other hand, Rank test [68] did not reject the null hypothesis with
the effectiveness of the IS methods was also studied a p-value equal to 0.092 at the significance level
by means of analysing the reduction levels of the size =0.05.
of the training sets.
Finally, non-parametric statistical tests were con- Table 2. Average reduction levels attained by DROPMTR and
ducted to analyse and validate the obtained results, as EDROPMTR.
proposed by Demsar [17]. All computational methods Dataset DROPMTR EDROPMTR
were implemented in the Java language and integrat- Andro 0.449 0.330
ed into MULAN library [65]. MULAN is constructed Atp1d 0.463 0.400
over the popular framework WEKA [22] and is de- Atp7d 0.459 0.430
Edm 0.385 0.204
signed for researching in multi-label learning and Enb 0.306 0.220
MTR. Jura 0.342 0.219
Oes10 0.494 0.524
4.3 Reduction levels on the size of the datasets Oes97 0.522 0.584
Osales 0.454 0.451
Rf1 0.372 0.351
This experiment aims to analyse whether the two Rf2 0.330 0.179
proposed IS methods (DROPMTR and EDROPMTR) Scm1d 0.344 0.425
Scm20d 0.366 0.199
can significantly reduce the size of the datasets. The Scpf 0.327 0.192
attained reduction rate on a dataset is calculated as 1 Sf1 0.639 0.844
– sr/so, where sr is the number of instances in the se- Sf2 0.731 0.710
lected data subset, and so is the number of instances Slump 0.338 0.272
Wq 0.452 0.728
in the original dataset. The higher a reduction rate,
Ave. reduction rate 0.432 0.403
the higher the percentage of instances that were re-
moved from the training sets. Table 2 shows the re-
4.4 Analyzing the predictive performance of the
duction rates averaged across all fold executions. The
multi-target regressors
best reduction rate attained in each dataset is high-
lighted in bold typeface.
This experiment focusses on determining whether
It is observed that DROPMTR attained reduction
the application of the proposed IS methods implies a
levels from 0.306 till 0.731, whereas EDROPMTR
significant improvement or deterioration in the over-
obtained reduction levels from 0.179 till 0.844.
all predictive performance of the regressors ERC-
DROPMTR method produced a big reduction (73%)
REPTree, ERC-LR and ERC-kNN. These three mul-
in the dataset Sf2, whereas EDROPMTR method
ti-target regressors were trained on the original train-
achieved a significant reduction (84%) in the dataset
ing sets, and on the subsets selected by the IS meth-
Sf1. In average, the experimental results showed that
ods.
the DROPMTR can reduce the size of the datasets
Tables 3-5 show the results of the aRRMSE
more than EDROPMTR. This behaviour was ex-
measure. In each row, the best error value is high-
pected because EDROPMTR intends to determine the
lighted in bold typeface. The column named “Origi-
best subset of instances that produces the lowest pre-
nal” represents the predictive performance obtained
diction error on a test set that contains all the instanc-
on the original datasets, whereas the columns named
es selected by the q + 1 DROPMTR members of the
“SubsetDROPMTR” and “SubsetEDROPMTR” represent the
ensemble. Consequently, the expected tendency is
predictive performance of the MTR regressors on the
data subsets selected by DROPMTR and EDROP- IS methods can select subset of relevant instances,
MTR, respectively. The Friedman’s test [23] was and also that these particular datasets have a consid-
conducted to perform multiple comparisons, and the erable number of irrelevant and/or redundant instanc-
last row of the tables shows the average ranking es. The average rankings computed by Friedman’s
computed by this test. test shows that, in average, the best results were re-
ported when the MTR regressors were executed on
Table 3. Results of the aRRMSE measure for ERC-REPTree. The the data subsets selected by EDROPMTR, indicating
Friedman’s statistic is equal to 8.333, and the null hypothesis was
the effectiveness of the proposed ensemble-based
rejected with a p-value=0.015 at the significance level =0.05.
approach. Furthermore, Friedman’s test rejected all
Dataset Original SubsetDROMTR SubsetEDROMTR the null hypotheses, indicating that significant differ-
Andro 0.595 0.744 0.674
Atp1d 0.438 0.436 0.431
ences exist in the predictive performance of the MTR
Atp7d 0.605 0.699 0.657 regressors.
Edm 0.923 0.906 0.862
Enb 0.133 0.155 0.140 Table 5. Results of the aRRMSE measure for ERC-kNN. The
Jura 0.689 0.687 0.675 Friedman’s statistic is equal to 9.194, and the null hypothesis was
Oes10 0.616 0.614 0.625 rejected with a p-value=0.010 at the significance level =0.05.
Oes97 0.706 0.838 0.811
Osales 0.782 0.898 0.878 Dataset Original SubsetDROMTR SubsetEDROMTR
Rf1 0.121 0.120 0.108 Andro 0.619 0.824 0.628
Rf2 0.147 0.134 0.105 Atp1d 0.452 0.451 0.439
Scm1d 0.358 0.351 0.341 Atp7d 0.616 0.648 0.619
Scm20d 0.476 0.509 0.496 Edm 0.819 0.846 0.835
Scpf 0.887 0.828 0.823 Enb 0.308 0.307 0.300
Sf1 1.081 0.832 0.819 Jura 0.734 0.732 0.720
Sf2 1.018 0.951 0.904 Oes10 0.452 0.462 0.460
Slump 0.783 0.762 0.751 Oes97 0.551 0.570 0.571
Wq 0.952 0.951 0.941 Osales 0.919 0.912 0.910
Avg. ranking 2.278 2.278 1.444 Rf1 0.180 0.120 0.120
Rf2 0.198 0.158 0.132
Table 4. Results of the aRRMSE measure for ERC-LR. The Scm1d 0.351 0.321 0.311
Friedman’s statistic is equal to 7.861, and the null hypothesis was Scm20d 0.303 0.327 0.322
rejected with a p-value=0.020 at the significance level =0.05. Scpf 1.054 0.808 0.801
Sf1 1.144 0.825 0.821
Dataset Original SubsetDROMTR SubsetEDROMTR Sf2 1.705 1.163 1.115
Andro 5.335 1.399 2.562 Slump 0.750 0.749 0.730
Atp1d 1.280 0.833 1.094 Wq 0.932 0.941 0.921
Atp7d 2.119 1.688 1.150 Avg. ranking 2.278 2.305 1.417
Edm 0.835 0.906 0.855
Enb 0.315 0.323 0.319
Jura 0.607 0.611 0.610
The Bergmann-Hommel’s test [7] was conducted in
Oes10 0.833 0.579 0.489 order to perform all pairwise comparisons and detect
Oes97 1.330 0.720 0.649 particular significant differences. Figure 2 shows the
Osales 1.864 1.754 1.657 results of this statistical test, highlighting two im-
Rf1 0.522 0.562 0.533
Rf2 0.488 0.387 0.314
portant results: (I) the predictive performance of the
Scm1d 0.393 0.400 0.287 MTR regressors that were trained on the data subsets
Scm20d 0.643 0.641 0.643 selected by DROPMTR are not significantly different
Scpf 0.887 0.840 0.533 to the performance attained when they were trained
Sf1 1.196 0.969 0.922
Sf2 1.545 1.348 1.291
on the original training sets, so indicating that
Slump 0.683 0.682 0.677 DROPMTR can reduce considerably the size of the
Wq 0.959 0.976 0.964 datasets without deteriorating the performance of the
Avg. ranking 2.361 2.167 1.472 regressors; and (II) the predictive performance of
those regressors that were trained on the data subsets
It was observed that the predictive performance of selected by EDROPMTR is significantly better than
the three MTR regressors is improved in many cases. the performance attained on the original training sets
Also, it is relevant to note that the predictive perfor- and on the data subsets selected by DROPMTR, so
mance was improved even on those datasets for showing the potential of the proposed ensemble-
which the IS methods attained high reduction levels based method.
(e.g. Sf1, Sf2 and Wq), so showing that the proposed
a) ERC-RepTree b) ERC-LR c) ERC-kNN

Figure 2. All pairwise comparisons conducted by Bergmann-Hommel's test. In the diagrams, the groups of methods that are not signifi-
cantly different are connected by a line.

4.5 Noise tolerance


Table 6. Average reduction rates attained at the different noise
levels.
In general, data gathered in real-world problems
include noise and, therefore, the predictive perfor- Dataset Noise level
mance of learning algorithms can be significantly 10% 20% 30%
Andro 0.424 0.534 0.545
deteriorated [25]. In this regard, the IS methods also Atp1d 0.400 0.455 0.519
have the intention of eliminating the noise and outli- Atp7d 0.463 0.510 0.588
ers in data. Edm 0.212 0.301 0.393
Once the superiority of the proposed ensemble- Enb 0.280 0.279 0.390
Jura 0.250 0.279 0.300
based method has been demonstrated, in this section Oes10 0.637 0.757 0.794
we analysed its capacity to eliminate noise from data. Oes97 0.625 0.632 0.653
It was analysed whether EDROPMTR maintains or Osales 0.467 0.580 0.688
even increases the predictive performance levels of Rf1 0.337 0.448 0.498
Rf2 0.189 0.218 0.257
the regressors on datasets which have different noise Scm1d 0.245 0.313 0.399
levels. Similar to the method proposed by Arnaiz- Scm20d 0.202 0.315 0.417
González et al. in [4], we added noise to the original Scpf 0.450 0.485 0.582
datasets by exchanging target vectors of randomly Sf1 0.833 0.855 0.871
Sf2 0.768 0.780 0.838
selected instances; the random selection was made Slump 0.323 0.447 0.466
without replacement. By this way, the sample distri- Wq 0.127 0.200 0.233
butions in the input and output spaces of the training
sets are not modified. Three different noise levels Table 7. The predictive performance of ERC-REPTree at the dif-
(10%, 20% and 30%) were introduced in the training ferent noise levels.
data; target vectors are swapped until these percent- Dataset 10% 20% 30%
ages of the total of instances in the training data are Noisy Red. Noisy Red. Noisy Red
Andro 0.771 0.851 0.939 0.977 1.051 1.037
modified. A 10-fold cross-validation process was Atp1d 0.681 0.600 0.834 0.732 0.875 0.790
executed five times with different seeds, and finally Atp7d 0.657 0.739 0.895 0.943 1.003 0.994
Edm 0.929 0.880 0.964 0.883 1.019 0.900
the results were averaged. Enb 0.567 0.554 0.812 0.808 0.904 0.895
Table 6 shows the average reduction rates at the Jura 0.844 0.808 0.926 0.907 0.987 0.942
Oes10 0.869 0.919 0.970 0.924 0.961 0.931
different noise levels. We can see that better reduc- Oes97 0.821 0.980 0.927 0.974 0.967 0.964
tion rates were obtained as the noise levels increased. Osales 0.834 0.899 0.892 0.942 0.900 0.956
Thus, EDROPMTR is able to detect and remove Rf1 0.471 0.450 0.621 0.546 0.653 0.572
Rf2 0.158 0.112 0.164 0.115 0.175 0.120
noise from data. On the other hand, Tables 7-9 show Scm1d 0.378 0.347 0.399 0.349 0.455 0.356
the predictive performance of the three regressors Scm20d 0.646 0.638 0.729 0.711 0.803 0.747
Scpf 0.937 0.865 0.989 0.900 0.999 0.908
considered in the experimental study. In each row, Sf1 1.119 0.832 1.161 0.850 1.227 0.859
the best aRRMSE value attained at each noise level is Sf2 1.071 0.931 0.995 0.842 1.662 0.856
Slump 0.916 0.859 0.977 0.909 1.000 0.926
highlighted in bold type face. The columns with the Wq 0.975 0.972 0.986 0.985 0.993 0.993
label “Noise” represent the predictive performance of p-value 0.186 0.008 0.000
the regressors on the datasets with noise, whereas the
columns with the label “Red.” represent the predic- detect whether there were significant differences in
tive performance attained on the data subsets selected the predictive performance attained at each noise
by EDROPMTR. level. The p-values computed by the Wilcoxon
It was observed that, in many cases, the predictive Signed-Rank test are shown in the last row of Tables
performance of the multi-target regressors were im- 7-9. The statistical test rejected all the null hypothe-
proved once the training sets are pre-processed with ses at the significance level =0:05, excepting the
the IS method. The Wilcoxon’s test was conducted to
one related to ERC-REPTree regressor at the 10% of 4.6 Discussion
noise. These results indicate that EDROPMTR is able
to detect and eliminate noisy instances, allowing to Two IS methods were proposed: the first one
improve the generalization error of the regressors. (DROPMTR) is a DROP-base extension that removes
Finally, it is noteworthy that, on average, a worse internal and border points that do not contribute to a
predictive performance was obtained as the noise better prediction of the target vectors of their neigh-
level increased. This is an expected result because bours, whereas the second one is an ensemble-based
EDROPMTR selects smaller training sets as the IS method (EDROPMTR) that aggregates multiple
number of noisy instances in data increased. predictions to select a final data subset of relevant
instances. Any of the two methods do not require the
Table 8. The predictive performance of ERC-LR at the different use of threshold values for determining whether an
noise levels. instance is included in the selected data subset or not.
Dataset 10% 20% 30% This is a major advantage because threshold values
Noisy Red. Noisy Red. Noisy Red are usually problem dependent and, therefore, it is
Andro 5.933 2.909 6.545 3.064 8.848 3.205
Atp1d 2.206 1.890 2.787 2.162 3.248 2.290
required to conduct an additional analysis to select
Atp7d 2.605 1.213 5.147 1.519 4.757 1.783 their adequate values. On the other hand, the pro-
Edm 0.910 0.924 0.956 0.955 1.055 1.010 posed IS methods have acceptable runtime complexi-
Enb 0.571 0.553 0.790 0.737 0.862 0.790 ties that allow their use in large-scale datasets. In the
Jura 0.758 0.739 0.861 0.817 0.936 0.834
Oes10 3.079 1.075 3.185 1.169 3.964 1.103
case of EDROPMTR, the members of the ensemble
Oes97 2.497 0.842 2.627 0.975 5.074 1.384 can be easily executed in parallel, so allowing to con-
Osales 2.257 1.785 2.272 2.000 2.525 2.085 siderably decrease the runtime needed to select the
Rf1 0.683 0.607 0.772 0.624 0.790 0.675 final data subset.
Rf2 0.602 0.402 0.639 0.405 0.645 0.410
Scm1d 0.435 0.297 0.483 0.295 0.493 0.299
Another advantage of the proposed ensemble-
Scm20d 0.729 1.000 0.776 0.769 0.826 0.825 based method is that it can implicitly model the inter-
Scpf 0.963 0.643 1.134 0.681 1.160 0.687 target dependencies, so easing the selection of more
Sf1 1.401 0.953 1.469 1.078 1.525 0.121 relevant instances. By analysing the way the mem-
Sf2 1.593 1.318 1.629 1.304 1.707 1.323
Slump 0.811 0.804 0.900 0.885 0.899 0.890
bers of the ensemble are constructed, some similari-
Wq 0.974 0.978 0.984 0.988 0.995 1.007 ties with regard to the approach proposed by Spyro-
p-value 0.000 0.000 0.000 mitros-Xioufis et al. in [59] are observed. In such an
approach, it was demonstrated that the expansion of
the input space with target variables is an effective
Table 9. The predictive performance of ERC-kNN at the different manner to exploit the inter-target dependencies. By
noise levels.
this way, each member of the ensemble models the
Dataset 10% 20% 30% relationship of one target variable with the rest of the
Noisy Red. Noisy Red. Noisy Red
Andro 0.808 0.803 0.855 0.882 1.090 1.032
targets.
Atp1d 0.702 0.685 0.825 0.807 0.879 0.862 On the other hand, it is possible to consider that
Atp7d 0.677 0.708 0.967 0.959 1.032 0.989 the aggregation process formulated in the last step of
Edm 0.859 0.852 0.861 0.851 0.933 0.865 EDROPMTR is an artefact that tries to exploit the
Enb 0.603 0.607 0.848 0.851 1.089 1.012
Jura 0.832 0.827 0.949 0.916 1.012 0.980 similarities between the structural and stochastic parts
Oes10 0.884 0.884 0.912 0.900 0.930 0.906 of the models. The structural-part of models corre-
Oes97 0.776 0.756 0.833 0.839 0.951 0.953 spond to the data subsets selected by each member of
Osales 0.929 0.919 0.959 0.931 0.962 0.943 the ensemble, whereas the stochastic-parts are related
Rf1 0.495 0.404 0.666 0.409 0.696 0.450
Rf2 0.218 0.142 0.243 0.152 0.256 0.169
to the errors associated with the searching of the data
Scm1d 0.366 0.322 0.399 0.342 0.423 0.361 subset that attains the best estimation. According to
Scm20d 0.531 0.537 0.641 0.649 0.845 0.814 Dembczynski et al. [16], those methods that follow
Scpf 1.151 0.880 1.184 0.902 1.191 0.913 an architecture similar to the one used by EDROMTR
Sf1 1.151 0.839 1.185 0.845 1.235 0.866
Sf2 1.804 1.128 1.847 1.132 1.931 1.141
can model the existing marginal and conditional de-
Slump 0.806 0.793 0.901 0.899 0.985 0.972 pendencies between target variables.
Wq 0.967 0.965 0.986 0.970 1.006 1.000 Finally, it is important to highlight that the two
p-value 0.003 0.002 0.000
proposed IS methods follow a wrapper approach, and
therefore, they can be implicitly benefited from the
power of the internal regressor. Consider that, a pow-
erful MTR regressor could tackle not only the model- well-known DROP method to multi-target data. Sec-
ling of inter-target dependencies, but also the estima- ond, an ensemble-based method that effectively com-
tion of complex non-linear input-output relationships. bines the partial data subsets selected by each mem-
Consequently, it is highly likely that the selected data ber of the ensemble has been also presented. The ma-
subsets contain those relevant instances that reflect jor features of our approach are: (I) a wrapper ap-
this type of data relationships, which have shown to proach was adopted where any MTR regressor can be
be of paramount importance for solving more effec- used to estimate the relevance of the instances, so the
tively the MTR problem. IS task can be benefited from the capacities of the
As main drawbacks, it is noteworthy that internal regressor to model the inter-target dependen-
EDROPMTR focuses more on minimizing errors cies and complex input-output relationships; (II) the
than on the size of the final subset and, therefore, the way the ensemble's members are constructed not only
final data subset is not necessarily the most consistent guarantee the diversity between them, but also the
subset of instances. Also, EDROPMTR is not suita- modelling of the inter-target dependencies; (III) the
ble for incremental learning scenarios since it would proposed ensemble-based method selects the final
require recomputing from scratch the subset of rele- data subset by a simple greedy heuristic process,
vant instances every time that new samples are added. avoiding the use of complex optimization algorithms;
In this work, an extensive experimental study was and (IV) no threshold values are used in order to de-
carried out. The first experiment showed that the pro- cide whether an instance is selected or removed, so
posed IS methods is able to reduce significantly the the proposed approach is less problem dependent.
size of the datasets. Excellent reduction rates were The experimental study confirmed the benefits of
attained on datasets with a moderate number of input the IS task for solving the MTR problem, which was
variables (e.g. the datasets Atp1d, Atp7d, Oes10, the main motivation of the present work. A good
Oes97 and Osales), as well as on datasets with many trade-off between the reduction levels of the size of
target variables (e.g. the datasets Oes10, Oes97, the datasets and the predictive performance of the
Osales and Wq). regressors was attained. Consequently, not only the
In addition to the high reduction levels that were runtime needed to construct a regression model on
attained in several datasets, the second experiment large-scale datasets is significantly reduced, but also
showed that EDROPMTR can significantly improve its predictive performance can be even improved.
the predictive performance of the regressors. This Future works will study better solutions for solving
result is very promising since in the past several au- the optimisation problem formulated to aggregate the
thors have noted that not always is possible to main- partial data subsets selected by the ensemble’s mem-
tain the predictive performance of the learning algo- bers. It is noteworthy that the final data subset deter-
rithms after applying an IS method. Also, the results mined by the proposed ensemble-based method is not
showed that EDROPMTR significantly outperforms the minimal data subset and, therefore, this is a rele-
to DROPMTR, demonstrating the effectiveness of the vant point to be studied in future works. On the other
proposed ensemble-based approach. hand, it would be interesting to consider other ways
The third experiment demonstrated that the pro- of exploiting the relationships between targets varia-
posed ensemble-based IS method is robust on da- bles. In this regard, the design of other approaches to
tasets which have noise. The results indicated that constructing the members of the ensemble is a possi-
EDROPMTR is able to detect the noisy instances, ble idea to follow. Finally, it would also be important
allowing that the regression models do not deteriorate to study the benefits of combining the IS task with
so much their predictive performance. Consequently, the feature selection one for constructing better MTR
EDROPMTR is well suited to be used in real-world models.
engineering applications that require the elimination
of noise before performing crucial tasks.
Acknowledgements
4. Conclusions
This research was supported by the Spanish Minis-
In this work, an ensemble-based method to per- try of Economy and Competitiveness and the Euro-
form the IS task in the MTR problem has been pro- pean Regional Development Fund, project TIN2017-
posed. First, an error accumulation-based approach 83445-P.
has been introduced, which is an adaptation of the
References [14] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown,
Learning multi-label scene classification, Pattern
[1] T. Aho, B. Zenko, S. Dzeroski, and T. Elomaa. recognition, 37(9):1757–1771, 2004.
Multi-target regression with rule ensembles. Journal [15] J. Calvo-Zaragoza, J. J. Valero-Mas, and J. R.
of Machine Learning Research, 373:2055-2066, 2009. Rico-Juan. Improving kNN multi-label classification
[2] J. Antonelli, P. Ducange, and F. Marcelloni. Ge- in prototype selection scenarios using class proposals.
netic training instance selection in multiobjective Pattern Recognition, 48(5):1608-1622, 2015.
evolutionary fuzzy systems: a coevolutionary ap- [16] K. Dembczynski, W. Waegeman, W. Cheng, and
proach. IEEE Transactions on Fuzzy Systems, E. Hullermeier. On label dependence and loss mini-
20(2):276-290, 2012. mization in multi-label classification. Machine
[3] A. Arnaiz-González, M. Blachnik, M. Kordos, Learning, 88(1):5-45, 2012.
and C. I. García-Osorio. Fusion of instance selection [17] J. Demsar. Statistical comparisons of classifiers
methods in regression tasks. Information Fusion, over multiple data sets. Journal of Machine Learning
30:69-79, 2016. Research, 7:1-30, 2006.
[4] A. Arnaiz-González, J. F. Díez-Pastor, J. J. Ro- [18] T. G. Dietterich. Ensemble methods in machine
dríguez, and C. I. García-Osorio. Instance selection learning. In International workshop on multiple clas-
for regression: Adapting DROP. Neurocomputing, sifier systems, Springer, Berlin, Heidelberg, LNCS
201:66-81, 2016. 1857: 1-15, 2000.
[5] A. Arnaiz-González, J. F. Díez-Pastor, J. J. Ro- [19] H. Dragos, M. Gilles & V. Alexandre. Predicting
dríguez, and C. I. García-Osorio. Instance selection the predictability: a unified approach to the applica-
for regression by discretization. Expert Systems with bility domain problem of QSAR models. Journal of
Applications, 54:340-350, 2016. Chemical Information and Modeling, 49: 1762–1776,
[6] L. Baldassarre, L. Rosasco, A. Barla, and A. Verri. 2009.
Multi-output learning via spectral filtering. Machine [20] S. Dzeroski, D. Demsar, and J. Grbovic. Predict-
Learning, 87(3):259-301, 2012. ing chemical parameters of river water quality from
[7] G. Bergmann and G. Hommel. Improvements of bioindicator data. Applied Intelligence, 13(1):7-17,
general multiple test procedures for redundant sys- 2000.
tems of hypotheses. Multiple Hypotheses Testing, [21] A. Fernández, C. J. Carmona, M. J. del Jesús,
100-115, Springer, 1988. and F. Herrera. A Pareto Based Ensemble with Fea-
[8] M. Blachnik. Ensembles of instance selection ture and Instance Selection for Learning from Multi-
methods based on feature subset. Procedia Computer Class Imbalanced Datasets. International Journal of
Science, 388-396, 2014. Neural Systems, 27(6): 1750028, 2016.
[9] M. Blachnik and M. Kordos. Bagging of instance [22] E. Frank, M. A. Hall, and I. H.Witten. The WE-
selection algorithms. In Artificial Intelligence and KA-Workbench. In Data Mining: Practical Machine
Soft Computing, LNCS, Springer, 8468:40-51, 2014. Learning Tools and Techniques. Morgan Kaufmann,
[10] H. Borchani, G. Varando, C. Bielza, and P. La- 4th edition, 2016.
rra˜naga. A survey on multi-output regression. Wiley [23] M. Friedman. A comparison of alternative tests
Interdisciplinary Reviews: Data Mining and of significance for the problem of m rankings. The
Knowledge Discovery, 5(5):216-233, 2015. Annals of Mathematical Statistics, 11: 86-92, 1940.
[11] S. Briesemeister, J. Rahnenführer, & O. [24] S. García, J. Derrac, J. Cano, and F. Herrera.
Kohlbacher. No longer confidential: estimating the Prototype selection for nearest neighbor classifica-
confidence of individual regression predictions. PloS tion: taxonomy and empirical study. IEEE Transac-
one, 7(11): e48723, 2012. tions on Pattern Analysis and Machine Intelligence,
[12] H. Brighton and C. Mellish. Advances in in- 34(3):417-435, 2012.
stance selection for instance-based learning algo- [25] S. García, J. Luengo, and F. Herrera. Data Pre-
rithms. Data Mining and Knowledge Discovery, processing in Data Mining. Springer, 2015.
6:153-172, 2002. [26] N. M. Ghani and M. O. Tokhi. Simulation and
[13] Z. Bosnic and I. Kononenko. Comparison of control of multipurpose wheelchair for disa-
approaches for estimating reliability of individual bled/elderly mobility. Integrated Computer-Aided
regression predictions, Data and Knowledge Engi- Engineering, 23(4): 331-347, 2016.
neering, 67 (3): 504-516, 2008. [27] P. Goovaerts. Geostatistics for natural resources
evaluation. Oxford University Press on Demand,
1997.
[28] Z. Han, Y. Liu, J. Zhao, and W. Wang. Real [43] E. Marchiori. Class conditional nearest neighbor
time prediction for converter gas tank levels based on for large margin instance selection. IEEE Transac-
multi-output least square support vector regressor. tions on Pattern Analysis and Machine Intelligence,
Control Engineering Practice, 20(12):1400-1409, 32:364-370, 2010.
2012. [44] G. Melki, A. Cano, V. Kecman, and S. Ventura.
[29] P. Hart. The condensed nearest neighbor rule. Multi-target support vector regression via correlation
IEEE Transactions on Information Theory, 14:515- regressor chains. Information Sciences, 415-416:53-
516, 1968. 69, 2017.
[30] E. V. Hatzikos, G. Tsoumakas, G. Tzanis, N. [45] J. A. Olvera-López, J. Carrasco-Ochoa, J. Marti-
Bassiliades, and I. P. Vlahavas. An empirical study nez-Trinidad, and J. Kittler. A review of instance
on sea water quality prediction. Knowledge-Based selection methods. Artificial Intelligence Review,
Systems, 21(6):471-478, 2008. 34(2):133–143, 2010.
[31] T. Heskes. Practical confidence and prediction [46] A. Ortiz, J. Munilla, J. M. Gorriz, and J. Ramírez.
intervals. In Advances in Neural Information Pro- Ensembles of deep learning architectures for the early
cessing Systems, MIT Press, 9:176-182, 1997. diagnosis of the Alzheimer’s disease. International
[32] G. Iacca, F. Caraffini, F. Neri. Multi-strategy Journal of Neural Systems, 26(07): 1650025, 2016.
Coevolving Aging Particle Optimization. Interna- [47] C. Otte and C. Störmann. Improving the accura-
tional Journal of Neural Systems, 24(1):1450008, cy of network intrusion detectors by input-dependent
2014. stacking. Integrated Computer-Aided Engineering,
[33] G. Iacca and F. Caraffini and F. Neri. Continu- 18(3): 291-297, 2011.
ous Parameter Pools in Ensemble Differential Evolu- [48] Y. Ouyang and H. Yin. Multi-Step Time Series
tion. In IEEE Symposium Series on Computational Forecasting with an Ensemble of Varied Length Mix-
Intelligence, 1529-1536, 2015. ture Models. International journal of neural systems,
[34] Kaggle. Kaggle competition: Online product 28(04): 1750053, 2018.
sales. https://www.kaggle.com/c/online-sales. 2012. [49] M. Pugelj and S. Dzeroski. Predicting structured
[35] Kaggle. Kaggle competition: See click predict outputs k-nearest neighbours method. In Discovery
fix. https://www.kaggle.com/c/see-click-predict-fi. Science, Springer, 262-276, 2011.
2013. [50] J. Read, B. Pfahringer, G. Holmes, and E. Frank.
[36] A. Karalic and I. Bratko. First order regression. Classifier chains for multi-label classification. Ma-
Machine Learning, 26(2-3):147-176, 1997. chine Learning, 85(3):333-359, 2011.
[37] D. Kocev, C. Vens, J. Struyf, and S. Džeroski. [51] O. Reyes, C. Morell, and S. Ventura. Evolution-
Tree ensembles for predicting structured outputs. ary feature weighting to improve the performance of
Pattern Recognition, 46:817–833, 2012. multi-label lazy algorithms. Integrated Computer-
[38] M. Kordos and M. Blachnick. Instance selection Aided Engineering, 21(4): 339-354, 2014.
with neural networks for regression problems. In Ar- [52] O. Reyes, A. H. Altalhi, and S. Ventura. Statisti-
tificial Neural Networks and Machine Learning, cal comparisons of active learning strategies over
7553:263-270. Springer, 2012. multiple datasets. Knowledge-Based Systems, 145,
[39] J. Levatić, M. Ceci, D. Kocev and S. Džeroski. 274-288, 2018.
Self-training for multi-target regression with tree en- [53] O. Reyes, A. Cano, H. Fardoun, and S. Ventura.
sembles. Knowledge-Based Systems, 123: 41-60, A locally weighted learning method based on a data
2017. gravitation model for multi-target regression. Interna-
[40] M. Lichman. UCI machine learning repository. tional Journal of Computational Intelligence Systems,
http://archive.ics.uci.edu/ml. 2013. 11:282-295, 2018.
[41] Y. Lim, H. M. Kim, S. Kang, and T. H. Kim. [54] O. Reyes, C. Morell, and S. Ventura. Effective
Vehicle-to-grid communication system for electric active learning strategy for multi-label learning. Neu-
vehicle charging. Integrated Computer-Aided Engi- rocomputing, 273:494-508, 2018.
neering, 19(1): 57-65, 2012. [55] M. Roveri and F. Trovò. An Ensemble Approach
[42] R. Lostado, R. F. Martínez, B. J. Mac Donald for Cognitive Fault Detection and Isolation in Sensor
and P. M. Villanueva. Combining soft computing Networks. International Journal of Neural Systems,
techniques and the finite element method to design 27(03): 1650047, 2017.
and optimize complex welded products. Integrated [56] M. Saidi, M. Bechar, N. Settouti, and M. A.
Computer-Aided Engineering, 22(2): 153-170, 2015. Chikh. Instances selection algorithm by ensemble
margin. Journal of Experimental & Theoretical Arti- [68] F. Wilcoxon. Individual comparisons by ranking
ficial Intelligence, 30 (3):457-478, 2018. methods. Biometrics, 1:80-83, 1945.
[57] B. Sierra, E. Lazkano, I. Inza, M. Merino, P. [69] D. Wilson. Asymptotic properties of nearest
Larrañaga, and J. Quiroga. Prototype selection and neighbor rules using edited data. IEEE Transactions
feature subset selection by estimation of distribution on Systems, Man and Cybernetics, 2:408-421, 1972.
algorithms. A case study in the survival of cirrhotic [70] D. Wilson and T. R. Martínez. Improved hetero-
patients treated with TIPS. In 8th Conference on AI in geneous distance functions. Journal of Artificial In-
Medicine in Europe, LNCS, 2101:20-29, 2001. telligence Research, 6:1-34, 1997.
[58] T. Simila and J. Tikka. Input selection and [71] D. Wilson and T. R. Martínez. Reduction tech-
shrinkage in multiresponse linear regression. Compu- niques for instance-based learning algorithms. Ma-
tational Statistics & Data Analysis, 52(1):406-422, chine Learning, 38:257-286, 2000.
2007. [72] D. H. Wolpert. Stacked generalization. Neural
[59] E. Spyromitros-Xioufis, G. Tsoumakas, W. networks, 5(2):241-259, 1992.
Groves, and I. Vlahavas. Multi-target regression via [73] M. Woźniak, M. Graña, and E. Corchado. A
input space expansion: Treating targets as inputs. survey of multiple classifier systems as hybrid sys-
Machine Learning, 104(1):55-98, 2016. tems. Information Fusion, 16: 3-17, 2014.
[60] D. Stojanova, M. Ceci, A. Appice, and S. [74] I. C. Yeh. Modeling slump flow of concrete us-
Dˇzeroski. Network regression with predictive clus- ing second-order regressions and artificial neural
tering trees. Data Mining and Knowledge Discovery, networks. Cement and Concrete Composites,
25(2):378-413, 2012. 29(6):474-480, 2007.
[61] M. B. Stojanovic, M. M. Bozic, M. M. [75] X. Zhen, M. Yu, X. He, and S. Li. Multi-target
Stankovic, and Z. P. Stajic. A methodology for train- regression via robust low-rank learning. IEEE Trans-
ing set instance selection using mutual information in actions on Pattern Analysis and Machine Intelligence,
time series prediction. Neurocomputing, 141(0):236- 40(2): 497-504, 2018.
245, 2014.
[62] J. Tolvi. Genetic algorithms for outlier detection
and variable selection in linear regression models.
Soft Computing, 8(8):527-533, 2004.
[63] A. Tsanas and A. Xifara. Accurate quantitative
estimation of energy performance of residential
buildings using statistical machine learning tools.
Energy and Buildings, 49:560-567, 2012.
[64] G. Tsoumakas, A. Dimou, E. Spyromitros, V.
Mezaris, I. Kompatsiaris, and I. Vlahavas. Correla-
tion-based pruning of stacked binary relevance mod-
els for multi-label learning. In ECML/PKDD 2009
Workshop on Learning from Multi-Label Data, 101-
116, 2009.
[65] G. Tsoumakas, E. Spyromitros-Xioufi, J. Vilcek,
and I. Vlahavas. MULAN: A Java Library for Multi-
Label Learning. Journal of Machine Learning Re-
search, 12:2411-2414, 2011.
[66] D. Tuia, J. Verrelst, L. Alonso, F. Pérez-Cruz,
and G.Camps-Valls. Multioutput support vector re-
gression for remote sensing biophysical parameter
estimation. IEEE Geoscience and Remote Sensing
Letters, 8(4):804-808, 2011.
[67] E. Wandekokem, E. Mendel, F. Fabris, M.
Valentim, R. J. Batista, F. M. Varejão and T. W.
Rauber. Diagnosing multiple faults in oil rig motor
pumps using support vector machine classifier en-
sembles. Integrated Computer-Aided Engineering,
18(1): 61-74, 2011.

View publication stats

You might also like