Professional Documents
Culture Documents
net/publication/326610642
CITATIONS
3 authors:
Sebastian Ventura
University of Cordoba (Spain)
271 PUBLICATIONS 7,395 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Sebastian Ventura on 06 August 2018.
Abstract. The multi-target regression problem comprises the prediction of multiple continuous variables at the same time us-
ing a common set of input variables, and in the last few years, this problem has gained an increasing attention due to the broad
range of real-world applications that can be analyzed under this framework. The complexity of the multi-target regression
problem is higher than the single-target regression one since target variables often have statistical dependencies, and these
dependencies should be correctly exploited in order to effectively solve this problem. Consequently, additional difficulties
appear when the aim is to perform a selection of instances on this type of data. In this work, an ensemble-based method to
perform the instance selection task in multi-target regression problems is proposed. First, a well-known instance selection
method is adapted to directly work with multi-target data. Second, the proposed ensemble-based approach uses a set of these
adapted methods to select the final subset of instances. The members of the ensemble select partial data subsets, where each
member is performed on a different input space that is expanded with target variables, exploiting therefore the underlying in-
ter-target dependencies. Finally, the ensemble-based method aggregates all the selected partial data subsets into a final subset
of relevant instances by means of solving an optimization problem with a simple greedy heuristic. The experimental study
carried out on 18 datasets shows the effectiveness of our proposal for selecting instances in the multi-target regression prob-
lem. Results demonstrate that the size of datasets is considerably reduced, whilst the predictive performance of the multi-target
regressors is maintained or even improved. Also, it is observed that the proposed method is robust to the presence of noise in
data.
Keywords: Instance selection, multi-target regression, ensemble learning, ensemble-based instance selection
*
Tel.: +34 957 212 218; Fax: +34 957 218 630;
E-mail: sventura@uco.es.
[28], energy efficiency [63] and signal processing Nowadays, these algorithms can bring many benefits
[66]. to the scientific community mainly due to their appli-
Up to date, many methods have been proposed to cations to the Big Data challenge [8]. The IS task has
tackle the MTR problem, and these can be organized been widely studied for the classification problem
into problem transformation and algorithm adapta- (see, for instance, the Olvera-López et al. [45] and
tion methods [10]. Problem transformation methods García et al. [24] works), however, this task for the
decompose an MTR problem into several single- regression problem has been far less studied [3]. The
target regression tasks. Recent researches have fo- IS task in the regression problem has some difficul-
cused on applying some well-known multi-label ties that do not exist in the classification task. For
learning transformation methods to solve the MTR instance, several IS methods assess the relevance of
problem, mainly motivated by the tight connection an instance by means of measuring its usefulness in
between these two learning paradigms2. In this regard, predicting the correct classes of its nearest neighbors.
Spyromitros-Xioufis et al. [59] demonstrated that However, the concept of data class in the regression
several multi-label approaches, such as the binary problem does not exist since the domain of the output
relevance [14], stacked generalization [64] and clas- variables is continuous. On the other hand, the identi-
sifier chains [50], are straightforward to adapt to the fication of class boundaries, an important criterion on
MTR problem. On the other side, the algorithm adap- which many IS methods for the classification prob-
tation category comprises algorithms that do not de- lem are formulated, does not have sense in regression
compose an MTR problem into several single-target [3]. As for performing the IS task in the MTR prob-
regression tasks; i.e. they directly handle the multi- lem, the complexity for selecting the instances is
target data. In this category, many methods have been higher than the one we could have in the single-target
proposed, such as statistical techniques [58], support regression problem, mainly due to the aforemen-
vector machines [44], kernel-based approaches [6], tioned challenges that MTR problem presents.
MTR trees [60], rule-based methods [1], and locally In the last two decades, existing ensemble-based
weighted regression methods [53]. methods have demonstrated to be really effective
Spyromitros-Xioufis et al. [59], Melki et al. [44], techniques to improve the results in complex prob-
Reyes et al. [53] and many other authors have lems [32, 33, 41, 46-48, 55, 67]. Kocev et al. [37] and
demonstrated that the MTR problem can be solved Spyromitros-Xioufis et al. [59], for example, demon-
more effectively if the inter-target correlations are strated how a better predictive performance could be
detected and exploited. However, the major challeng- obtained in solving the MTR problem by using en-
es of MTR lie in how to model such inter-target de- semble-based approaches. On the other hand, several
pendencies correctly, and how to estimate the nonlin- authors have demonstrated that the IS task can be
ear relationships that may exist between the input and significantly improved by means of applying ensem-
output spaces of the problem [75]. On the other hand, ble-based methods [56]. By this way, the relevance of
when the MTR problem is studied not all available the training instances is measured by considering not
training samples are useful to construct an accurate only a single criterion but many approximations and,
predictive model; it is well-known that noisy, redun- therefore, a more reliable estimation of the relevance
dant and incomplete data can significantly deteriorate of the instances is obtained.
the performance of the most learning algorithms [52]. In this work, an ensemble-based method to per-
Consequently, the acquisition of a high-quality and form the IS task in the MTR problem is proposed.
compact dataset, from which an algorithm can learn First, an error accumulation-based approach is intro-
relevant data relationships, is also an important issue duced, which is an adaptation of the well-known fam-
to be considered when tackling the MTR problem. ily of the Decremental Reduction Optimization Pro-
The instance selection task (henceforth IS) is an cedures (henceforth DROP) [71] to multi-target data.
important data preprocessing step, that aims to select Second, an ensemble-based method that effectively
a representative subset of an original dataset by filter- combines the partial data subsets that are previously
ing noisy and redundant data, in such a manner that selected by each member of the ensemble is proposed.
the predictive performance of the learner that was To obtain the final data subset, an aggregation pro-
induced from the data subset would be the same cess is carried out by a simple greedy heuristic that
(even better) as if the original dataset was used [45]. solves an optimization problem. The members of the
ensemble select the partial data subsets on different
2
In multi-label learning, the output variables (a.k.a. labels) are input spaces which are expanded by target variables,
restricted to binary values. exploiting therefore the underlying inter-target de-
pendencies. On the other hand, the method proposed Many IS methods have been proposed in the litera-
does not use any threshold value to decide whether an ture, and a complete description of these methods can
instance is selected or not, resulting in a method less be consulted in [25]. The IS methods can be catego-
dependent on the specific features of each problem. rized by considering the following three criteria: (I)
To the best of our knowledge, this is the first attempt the selection criterion used to select the instances; (II)
to study the selection of instances in MTR, and the the type of points that are removed in the IS process;
main motivation of this work is to analyze the bene- (III) and finally they can be classified according to
fits of the IS task for constructing better MTR models. the search direction used to obtain the final data sub-
The effectiveness of the proposal is assessed set.
through an extensive experimental study, where 18 The first category includes the wrapper [71] and
datasets of varied features and different application filter methods [43], and the main difference between
domains are used. The results showed that the pro- these two type of approaches lies in which the wrap-
posed IS method can significantly boost the predic- per methods select the relevant instances based on the
tive performance of the multi-target regressors, and prediction made by a learning algorithm, whilst the
therefore, it can benefit the development of methods filter methods are not based on a classifier to deter-
for solving complex problems that comprise the pre- mine the instances to be discarded from the training
diction of multiple outputs. A good trade-off between set.
the predictive performance and reduction rate is at- Regarding the second criterion, the IS algorithms
tained; the size of the training sets is reduced without can be classified into condensation [34], edition [69]
significantly deteriorating the predictive performanc- or hybrid methods [71]. The condensation methods
es of the multi-target regressors. In addition, the pro- retain the points closer to the decision boundaries
posed ensemble-based method demonstrates to be (border points), preserving the training error, but at
robust on datasets which have noise samples. the expense of deteriorating the generalization test
The remainder of this paper is arranged as follows. error. The edition methods remove the border points
Section 2 briefly describes the IS task and exposes and maintain the internal points, getting smoother
the related works that have been proposed to perform decision boundaries and reducing the generalization
the selection of instances in the regression problem. test error. The hybrid methods, on the other hand,
Section 3 presents the proposed ensemble-based remove the internal and border points, taking the ad-
method. Section 4 shows a description and discussion vantages of both the condensation and edition meth-
of the experimental results. Finally, some concluding ods.
remarks are presented in Section 5. As for the third criterion, the IS algorithms can be
classified into incremental [29], decremental [71],
2. Related works batch [12] or mixed methods [57]. The incremental
methods start with an empty data subset and continue
Roughly speaking, IS methods aim to reduce the adding instances to it; in this case, the presentation
size of an original training data but retaining or im- order of the instances is an important issue that might
proving the predictive capacity of the models. The affect the effectiveness of the IS algorithms. The dec-
optimal outcome of an IS method is a minimum data remental methods begin with the whole dataset and
subset from which a learning algorithm would ac- continues removing instances of it; in this case, the
complish the same task with no performance loss as presentation order of the instances is still an im-
if the original dataset was used [25]. However, some portant issue, but not so significant as in the case of
authors have noted that in practice, it is not always the incremental methods. As for batch methods, they
possible to maintain the performance levels as the analyze all the instances but without removing them,
dataset is reduced, and a loss of effectiveness may be and at the end of the process, all the instances marked
inevitable [15]. IS methods have the following goals as disposable are removed; the complexity of this
[25]: (I) decrease the computational cost for predict- type of methods is usually higher than the one of in-
ing new patterns; (II) reduce the storage requirements cremental and decremental methods. Finally, the
by removing redundant information from datasets; mixed methods begin with a pre-selected data subset,
(III) improve the performance of learning algorithms and the instances which satisfy a specific criterion
by removing noise and outliers; and (IV) increase the can be added or removed; the pre-selected data subset
efficiency when working on large-scale datasets. may be constructed by either a random selection, an
incremental method, or a decremental one.
Olvera-López et al. [45], García et al. [25], and sented by Blachnik & Kordos [9] and Arnaiz-
many other authors have noted that the IS task for the González et al. [3], who showed that a better IS pro-
classification problem is widely studied. However, cess in the regression problem can be achieved by
the selection of instances in the regression problem using bagging models.
has not followed the same path, existing far less re- Finally, it is important to note that all the afore-
search in this regard [3]. Some works have proposed mentioned works have been designed for selecting
different evolutionary algorithms to perform the IS instances on regression problems that have only one
task in the regression problem. For example, Tolvi target variable, and they are not directly applicable to
[62] presented a genetic algorithm that was able to the MTR problem. As far as we know, an IS method
detect the outliers in linear regression models, and for the MTR problem has not been proposed yet. In
Antonelli et al. [2] addressed the IS task through a addition to the difficulties that appear when the IS
multi-objective evolutionary learning approach. Also, process is performed on any regression problem
there have been other efforts focused on studying the (these were previously mentioned in the introduction
IS task in time series [61]. On the other hand, in the of this work), the major challenges of MTR arise
last few years, several works have been focused on from modelling the inter-target correlations and com-
the adaptation of some IS methods, that were origi- plex input-output relationships. In the next section, an
nally designed for the classification problem, to the ensemble-based method for the selection of instances
regression problem. For example, Kordos & Blach- in the MTR problem is presented.
nick [38] adapted the Condensed Nearest Neighbor
[29] and Edited Nearest Neighbor [69] methods, and 3. An ensemble-based method for the selection of
Arnaiz-González et al. [4] proposed an adaptation of instances
DROP method [71]. Finally, another approach for
performing the IS task in the regression problem In this section, first, an error accumulation-based
comprises the discretization of the target variable [5], approach, which is an adaptation of the well-known
and therefore, in this case any existing IS method can DROP method to multi-target data, is introduced, and
be used directly. then, the ensemble-based method to perform the IS
Independently, in order to improve the efficiency task in the MTR problem is presented.
and accuracy of a method to find a solution for a giv-
en learning problem, ensembles of methods have A DROP-based extension for the MTR problem
gained an increasing popularity in the research com-
munity in the last few decades [21, 73]. Summarizing Let us say S = {(X1, Y1), (X2, Y2), …, (Xn, Yn)}
the advantages of the ensemble-based methods [18, represents a dataset of n training instances. An in-
73]: (I) ensemble methods perform well in both sce- stance i S is represented as a tuple (Xi, Yi), where
narios, when there are very scarce data samples for Xi X and Yi Y are the input and target vectors of i,
learning and when a huge amount of data is available; respectively. X represents the input space that con-
(II) a combined classifier can have a better predictive
tains d input variables {x1, x2, …, xd}, whereas Y is
performance that the best individual classifier; (III)
the output space that comprises q target variables {y1,
combining methods trained from different samples
y2, …, yq}3. On the other hand, xli denotes the value of
could overcome the local optima problem; and (IV)
an exact function may be impossible to be modelled the l-th input variable for the instance i, whereas yli
by any single hypothesis, but the combination of sev- represents the value of its l-th target variable. An
eral hypotheses may expand the space of representa- MTR algorithm aims to learn a predictive model
ble functions. Taking the advantages provided by the that, given an unseen input vector X, can predict a
ensemble learning paradigm, it is not surprising the target vector Z that best approximates the true target
use of ensemble-based methods to perform the IS vector Y.
task [8, 56]. The main objective of such ensemble- DROP is a well-known IS method that, according
based IS methods is to produce more reliable estima- to the three categories portrayed in Section 2, can be
tions of the relevance of the instances by aggregating classified as a wrapper, hybrid and decremental
the outputs produced by the members of the ensemble. method. In this work, the DROP-based adaptation to
In this regard, there are very few works that have
studied the selection of instances in the regression 3
The domain of the input variables can be continuous, discrete
problem following an ensemble-based approach. The or mixed type, whereas the domain of the target variables is always
most relevant works on this topic are the ones pre- continuous.
the regression problem presented by Arnaiz-González
et al. [4] is extended to multi-target data. 1 q
(y j − z j )2
jAi , (1)
Let us say Ni represents the set of k-nearest neigh-
bours of the instance i in the input space X. Given a
q =1 (y
jAi
j − ym ) 2
set of instances S, the set of associates of i (denoted
as Ai) comprises those training instances that include
to i into their sets of k-nearest neighbors, i.e. Ai = {j Algorithm 1. DROPMTR algorithm. The function
S | i Nj}. Our proposed DROP-based method kNearestNeighbors(i, S, k) computes the k-nearest neighbors
of the instance i in the dataset S. The function predictTar-
uses the following simple rule as removal criterion: getVector(, N, i) trains the multi-target regressor on N
the instance i can be safely removed from the training and predicts the target vector of the instance i.
set, if the target vectors of its associated instances
can be correctly estimated without considering i. Input
S: training set of multi-target instances, k: number of near-
From this rule arises the necessity of designing a reli- est neighbours, : multi-target regressor
ability function for multi-target data that assigns Output
proper scores according to the error levels obtained in SS S: subset of training instances
the predictions of the target vectors of the associate Begin
instances. This reliability function is crucial for the SS S
success of the IS process since the error associated #Create an empty set of associates for each example i SS
foreach i SS do
with the elimination of a relevant instance can rein-
Ai
force itself in the subsequent iterations. However, end
developing a good reliability measure is not a simple foreach i SS do
task [13]. This has not been entirely resolved yet in # Find the k nearest neighbours of i on the input space X
the single-target regression and classification, and Ni kNearestNeighbors(i, SS , k)
even less for the MTR problem [39]. # Add i to the sets of associates
Different reliability scores have been proposed for foreach j Ni do
single-target regression, such as the estimation based Aj Aj {i}
end
on sensitivity analysis [13], local cross-validation end
[13], analysis of the density of the distribution of foreach i SS do
instances [19], the variance of bagged models [31], # Predict the target vectors of the associate instances
and the estimation of the instances’ error by consider- foreach j Ai do
ing its local environment in the training set [11]. Re- predictTargetVector (, Nj, j)
cently, Levatić et al. [39] defined various reliability predictTargetVector (, Nj \{i}, j)
end
functions for the MTR problem following a semi-
compute Ewithi
supervised approach. However, these last-mentioned compute Ewithouti
functions are not directly applicable to our problem # Check whether the instance i can be removed or not
since they do not consider the true target vector of the if Ewithouti ≤ Ewithi then
instances. SS SS \{i}
# Remove the instance i from each set of nearest neighbours
In this work, given a training instance i, we follow
foreach j Ai do
a traditional approach to estimate the error made in Nj Nj \{i}
predicting the target vectors of the associate instances # Find a new nearest neighbour for j
by considering their local environments; it is similar p kNearestNeighbors (j, SS \ Nj, 1);
to the approach proposed by Briesemeister et al. in # Add p to the set of nearest neighbours of j
[11]. Given the associate instance j of i, a multi-target Nj Nj {p}
# Add j to the set of associates of p
algorithm is trained on the dataset formed by the Ap Ap {j}
set of k-nearest neighbors Nj, and afterward the model end
predicts a target vector for j (Zj). This procedure is end
performed for each associated instance j Ai, and the end
estimation of the global error can be calculated with return SS
end
the average relative root mean square error (aRRM-
SE)
where ylj and zlj are the values of the l-th target varia- fk(n, d)) steps are needed to compute the list of asso-
ble in the true (Yj) and predicted (Zj) target vectors of ciates of all training instances. On the other hand, let
the associate instance j, respectively. Also, ylm is the us say f(N,i) represents the cost function of training
mean of the true values for the l-th target variable in the multi-target regressor on the dataset N (dataset
the set of associates of i. aRRMSE is a measure wide- formed by k instances), added to the cost for predict-
ly used in the MTR literature [10], and it averages the ing the target vector of the instance i. So, O(n×aavg×
RMSE values of each target variable, and automati- f(K,i)) steps are required to compute the accumula-
cally re-scales the error contributions of each target tive errors of all the training instances, being aavg the
variable. average number of associates per instance. Conse-
We believe that aRRMSE is a reliable estimator quently, the overall runtime complexity of the pro-
since it considers the actual errors made by the inter- posed IS method is O(max(n×fk(n, d), n×aavg×
nal regressor. It also imposes almost no additional f(K,i))).
computational overhead, as opposed to some other
estimation methods for regression. We denoted as Ensemble-based IS method
Ewithi the global error in predicting the target vectors
of the associated instances of i, but without removing We propose an ensemble-based method for tack-
i from the sets of the nearest neighbors of its associ- ling the IS task in the MTR problem since more reli-
ated instances, whereas Ewithouti represents the op- able estimations on the importance of the instances
posite case. Finally, the instance i can be safely re- could be obtained if multiple approximations were
moved from the training set if Ewithouti ≤ Ewithi. considered. Similar to Spyromitros-Xioufis et al. [59],
Algorithm 1 shows the steps performed by our error we adopted an approach that composes the ensemble
accumulation-based approach (hereafter, dubbed as by means of adding target variables to the input space
DROPMTR). of the MTR problem.
The proposed method is able to detect the errors The rationale of our proposal is as follows. Given a
and outliers in data. For example, if the set of associ- multi-target dataset S with q target variables, our en-
ate instances of the instance i is empty (Ai = ), it semble is formed by q+1 members, where q of them
could mean that this instance is an outlier since i is select a data subset from a slightly different version
not a neighbour of any training instance. In addition, of the original dataset S. The l-th member (l ≤ q) of
it could be the case that Ewithouti = Ewithi but Ai ≠ , the ensemble (denoted as Il) selects the instances
meaning that the instance i does not contribute for a from a multi-target dataset which has an input space
better prediction of the target vectors of its associates, equal to X {yl}, and an output space equal to Y
and therefore it can be safely removed. \ {yl}. By this way, the member Il allows to model the
The main advantages of the proposed approach are contribution of the l-th target for predicting the rest of
that it can be wrapped around any existing MTR re-
the target variables yp Y | p ≠ l, and therefore, Il
gressor (problem transformation or algorithm adap-
tation methods), and it also does not depend on any would select a data subset formed by those relevant
threshold value to decide whether an instance is se- instances that reflect the inter-target dependencies
lected. Therefore, the IS process can be significantly that are related with the l-th target variable. The last
benefited from the capacities of the internal regressor. member of the ensemble (Iq+1) is executed on the
The proposed IS method can implicitly exploit the original dataset S, and the selected data subset would
inter-target dependencies for the selection of more comprise those instances that are relevant for predict-
relevant instances if the internal MTR is able to mod- ing all target variables. Finally, each member of the
el such correlations. In this sense, the challenge of ensemble returns a subset S{1, 2, …, q, q+1}, and then, an
modelling complex input-output relationships can be aggregation process determines the best subset (Se)
also effectively tackled since our proposal can be from these q+1 partial data subsets.
applied with any linear and non-linear regression Figure 1 shows the general schema of the proposed
MTR algorithm. ensemble-based method for performing the IS task in
Regarding the runtime complexity of our proposal, the MTR problem. It is noteworthy that the diversity
let us say S is a dataset with n instances, d input vari- of the members of the ensemble is tackled by means
ables and q target variables, and fk(n, d) represents the of executing each member on different datasets. Also,
cost function for determining the k-nearest neigh- note that the members must be IS methods able to
bours of an instance in the input space X. So, O(n × work directly with multi-target data, as DROPMTR
Figure 1. Schema of the proposed ensemble-based method.
method. On the other hand, in order to add more di- aims to determine the subset of instances Se T that
versity to the ensemble, for each new dataset over generalises well the observations in T and minimises
which the members are performed, the presentation the following cost function
order of the instances is randomly changed. This ac-
tion also allows that the ensemble method will be less
sensitive to the presentation order of the instances, 1 q (y i − zi ) 2
iT , (3)
that is a limitation of any DROP-based method.
Another important issue to analyse in our approach
q =1 (y
iT
i − ym ) 2
is how to aggregate the q+1 partial data subsets into a where yli and zli are the values of the l-th target varia-
final data subset of instances; this component is of a ble in the true (Yi) and predicted (Zi) target vectors of
major importance in all ensemble-based methods. In i, respectively, and ylm is the mean of true values for
this work, we adopt a stacking approach [72], where the l-th target in the set T. Note that, this cost function
q+1 independent members select data subsets from S,
is simply the measure aRRMSE, but now it is defined
but finally one extra model produces an optimal
over the set T.
combination of the outputs of the members.
Solving this optimization problem with classical
Let us say ci represents the times the instance i S methods could take a considerable runtime since the
is selected by the members, and it can be calculated objective function requires the evaluation of a multi-
as
q +1 target regressor for predicting the target vector of
ci = 1S (i ) , (2) each instance i T. On the other hand, if we consider
=1 this formulation as a searching problem, the number
where 1Sl is the function that indicates whether the of feasible solutions is 2|T| - 1, resulting in a huge
instance i is in the data subset Sl selected by the l-th space. Consequently, in this work, the following sim-
member or not. It is important to note that, although ple heuristic is defined: The instances that were se-
each member is executed on datasets that slightly lected by a higher number of members are preferred
differ from the original dataset S and the presentation over those instances with a fewer selection frequency.
order of the instances is changed in each dataset, a Consequently, the following hill climbing process is
function that maps the new indexes of the instances proposed to compute the final subset of instances Se:
to the original indexes in the dataset S can be easily considering as starting point the set of instances most
constructed. By this way, given the data subset Sl selected (denoted as Rm), add continuously to Se the
selected by the l-th member of the ensemble, it is next set of instances with the highest selection fre-
easy to retrieve the corresponding original instances quency (Rl | 1≤ l <m), until a degradation in estimat-
from S. ing T is obtained.
The set of instances that are exactly selected by l Algorithm 2 shows the steps of the proposed en-
members is denoted as Rl = {i S | ci = l}, and T rep- semble-based IS method (hereafter, dubbed as
EDROMTR). EDROMTR comprises two phases: (I)
resents the set of instances resulting of the union of
the ensemble’s members select the partial data sub-
the subsets S1, S2, …, Sq, Sq+1. Therefore, our proposal
Algorithm 2. EDROPMTR algorithm. The function Algorithm 2. Continuation.
construct(S, Xnew, Ynew) constructs a new dataset from S but
considering the input and output spaces Xnew and Ynew, respec- foreach l {m-1, m-2, …, 1} do
tively. The function shuffle(S) changes the presentation order if Rl ≠ then
of the instances in S. The function dropMTR(S, k, ) per- predictTargetVectors (, Se Rl , T)
forms the DROPMTR algorithm on the dataset S, using the enew equation 3
internal multi-target regressor and considering k nearest if enew ≤ ebest then
neighbours. The function retrievOriginal(Sl , S) transforms all Se Se Rl
the instances i Sl, recovering their original forms as they ebest enew
appear in S. The function predictTargetVectors(, S, T) else
trains the regressor on S and predicts the target vectors of break
all the instances i T. end
end
Input end
S: training set, k: number of nearest neighbours, : multi- return Se
target regressor end
Output
Se S: subset of training instances sets; and (II) the q+1 selected data subsets are aggre-
Begin gated by the proposed greedy heuristic. Generally
#Compute the frequency of selection of each instance
speaking, the first phase requires the construction of
foreach i S do
ci 0
the datasets from which the members select the par-
end tial data subsets, the execution of q+1 IS methods,
T and finally, the estimation of the frequency that each
# Construct the members instance is selected. Let us say that fDROPMTR repre-
foreach l {1, …, q+1} do sents the computational cost of the DROPMTR
Rl
method. Therefore, the overall runtime complexity of
if l ≤ q then
the first phase is O((q + 1) × fDROPMTR), since the rest
# Remove the l-th target variable from Y and add it to X
Ynew Y \ {yl }
of the mentioned steps of the first phase can be per-
Xnew X {yl } formed in linear time. Note that, each member of the
else ensemble can be executed in parallel, so the efficien-
Ynew Y cy can be significantly improved.
Xnew X On the other hand, the second phase of the ensem-
end ble-based method comprises the execution of the
# Construct a new training set from S proposed greedy heuristic, which in turn needs to
Snew construct (S, Xnew, Ynew)
# Change the order presentation
train and test (at most m times) the internal MTR re-
Snew shuffle (Snew) gressor for determining the data subset Se T from
# Execute DROPMTR on the training set Snew which is attained a better estimation of T. Let us say
Sl dropMTR (Snew, k, ) f(Se, T) represents the cost function of training on
# Retrieve the original instances from S Se, added to the cost of testing on T, where Se T.
Sl retrieveOriginal (Sl , S)
So, the overall complexity of the second phase is
T T Sl
# Increment the frequency of selection of the instances O(m× f(Se, T)). It is noteworthy that the multi-target
foreach i Sl do regressor used in the second phase is the same to the
ci ci + 1 one that is employed by each member of the ensem-
end ble. Therefore, the overall complexity of the pro-
end posed ensemble-based method is O(max((q + 1) ×
# Construct the frequency sets
fDROPMTR, m× f(Se, T))).
foreach i S do
if ci > 0 then
RCi RCi {i} 4. Experimental study
end
end In this section, the experimental study is described.
# Determine the set of instances most selected First, a description of the datasets and other experi-
m arg max Rl ≠
l {1, …, q+1}
mental settings used in the experiments are presented.
# Hill climbing process Second, DROPMTR and EDROPMTR are performed
Se Rm on all the datasets, with the aim of analysing whether
predictTargetVectors (, Se, T)
ebest equation 3
the proposed IS methods improve or maintain the
predictive performance of the MTR regressors, and to
demonstrate that the best performance is attained by Table 1. Summary of the benchmark datasets.
the proposed ensemble-based IS method. Dataset #Instances #Input vars. #Target vars.
Andro 49 30 6
4.1 Multi-target datasets Atp1d 337 411 6
Atp7d 296 411 6
Edm 154 16 2
In this experimental study, the largest collection of Enb 768 8 2
MTR datasets publicly available was used [59]. All Jura 359 15 3
the 18 datasets within this collection have a variety of Oes10 403 298 16
Oes97 334 263 16
features and belong to several application domains. Osales 639 413 12
Some of these datasets represent well-known engi- Rf1 9125 64 8
neering problems, for example: the dataset Electrical Rf2 9125 576 8
Discharge Machining (Edm) [36] represents a two- Scm1d 9,803 280 16
Scm20d 8966 61 16
target regression problem, where the task is to mini- Scpf 1137 23 3
mize the machining time by reproducing the behav- Sf1 323 10 3
iour of a human operator that controls two variables; Sf2 1066 10 3
the dataset Energy Building (Enb) [63] concerns the Slump 103 7 3
Wq 1060 16 14
prediction of the heating and cooling loads require-
ments of buildings as a function of eight parameters;
the dataset Concrete Slump (Slump) [74] comprises
tion of the classic kNN algorithm for the MTR prob-
the prediction of three properties of concrete as a
lem. In this work, this kNN-based method was used
function of the content of seven concrete ingredients.
as the internal MTR regressor of our IS methods
On the other hand, the Andromeda (Andro) [30]
(DROPMTR and EDROPMTR). The best number of
and Water Quality (Wq) [20] datasets concern the
nearest neighbours (k) was estimated via cross-
prediction of water quality parameters, whereas the
validation on the original datasets. The main reason
Jura dataset [28] focus on the prediction of the con-
to use this MTR regressor is due to its simplicity and
centration of metals. The Solar Flare datasets [40]
low computational cost. However, note that any other
(Sf1 and Sf2) are about the prediction of the number
MTR algorithm could be used as internal regressor
of solar flares are observed within one day. The River
since the proposed methods follow a wrapper ap-
Flow datasets (Rf1 and Rf2) [59] concern the predic-
proach.
tion of river network flows. Finally, we have the fol-
The parameter k is also important for DROPMTR
lowing datasets associated with the business domain:
since the lists of associates are created by computing
Online Product Sales (Osales) [34], See Click Predict
the k-nearest neighbours of each instance of the train-
Fix (Scpf) [35], Airline Ticket Price (Atp1d and
ing set. This parameter was set the same as the num-
Atp7d) [59], Supply Chain Management (Scm1d and
ber of nearest neighbours used by the internal MTR
Scm20d) [59] and Occupational Employment Survey
regressor described before. As for the distance func-
(Oes10 and Oes97) [59].
tion used to compute the nearest neighbours of a
Table 1 shows a summary of the characteristics of
point, the well-known Heterogeneous Euclidean
the datasets. The datasets vary in size: from 49 up to
Overlap Metric (HEOM) was used [70].
9,803 examples, from 7 up to 576 input variables, and
On the other hand, Spyromitros-Xioufis et al. [59]
from 2 up to 16 target variables. All the datasets have
showed that the method Ensemble of Regressor
numeric input variables, except for Sf1 and Sf2
Chains (ERC) is one of the most significant state-of-
whose input variables are discrete. The datasets Scpf,
the-art MTR methods. Hence, the effectiveness of the
Osales, Rf1, Rf2, Atp1d and Atp1d have missing
proposed IS methods was assessed by means of eval-
values that were replaced by the median values of the
uating ERC on the selected data subsets. ERC is a
corresponding input variables. Finally, all the numer-
problem transformation method, and therefore, it in-
ic variables were centred and scaled.
ternally requires a single-target regressor. Three sin-
gle-target regressors were used, namely RepTree,
4.2 Experimental settings
Linear Regression and the classic kNN (the parame-
ters proposed in [59] were used), resulting in three
Pugelj & Dzeroski [39] presented a simple adapta-
different combinations of ERC (dubbed as ERC- that the final data subsets selected by EDROPMTR
REPTree, ERC-LR and ERC-kNN). will have a size greater than those data subsets select-
To estimate the predictive performance of the ed by one DROPMTR; the partial q+1 data subsets
MTR models, the measure aRRMSE (previously de- could be very diverse between each other and, there-
scribed in Section 3) was analysed on the test sets. In fore, the aggregation process will attain a lower re-
all datasets, a 10-fold cross-validation was performed, duction level than the one that could be obtained by a
and the aRRMSE values were averaged across all single DROPMTR method. However, it is notewor-
fold executions. In each fold execution, the following thy that, although DROPMTR obtained the best re-
steps were conducted: (I) the IS method reduces the duction levels in 13 datasets, there were no statistical
training set; (II) the multi-target regressor is trained differences between the reduction levels attained by
on the selected data subset; and (III) the learned DROPMTR and EDROMTR; the Wilcoxon Signed-
model is assessed on the test set. On the other hand, Rank test [68] did not reject the null hypothesis with
the effectiveness of the IS methods was also studied a p-value equal to 0.092 at the significance level
by means of analysing the reduction levels of the size =0.05.
of the training sets.
Finally, non-parametric statistical tests were con- Table 2. Average reduction levels attained by DROPMTR and
ducted to analyse and validate the obtained results, as EDROPMTR.
proposed by Demsar [17]. All computational methods Dataset DROPMTR EDROPMTR
were implemented in the Java language and integrat- Andro 0.449 0.330
ed into MULAN library [65]. MULAN is constructed Atp1d 0.463 0.400
over the popular framework WEKA [22] and is de- Atp7d 0.459 0.430
Edm 0.385 0.204
signed for researching in multi-label learning and Enb 0.306 0.220
MTR. Jura 0.342 0.219
Oes10 0.494 0.524
4.3 Reduction levels on the size of the datasets Oes97 0.522 0.584
Osales 0.454 0.451
Rf1 0.372 0.351
This experiment aims to analyse whether the two Rf2 0.330 0.179
proposed IS methods (DROPMTR and EDROPMTR) Scm1d 0.344 0.425
Scm20d 0.366 0.199
can significantly reduce the size of the datasets. The Scpf 0.327 0.192
attained reduction rate on a dataset is calculated as 1 Sf1 0.639 0.844
– sr/so, where sr is the number of instances in the se- Sf2 0.731 0.710
lected data subset, and so is the number of instances Slump 0.338 0.272
Wq 0.452 0.728
in the original dataset. The higher a reduction rate,
Ave. reduction rate 0.432 0.403
the higher the percentage of instances that were re-
moved from the training sets. Table 2 shows the re-
4.4 Analyzing the predictive performance of the
duction rates averaged across all fold executions. The
multi-target regressors
best reduction rate attained in each dataset is high-
lighted in bold typeface.
This experiment focusses on determining whether
It is observed that DROPMTR attained reduction
the application of the proposed IS methods implies a
levels from 0.306 till 0.731, whereas EDROPMTR
significant improvement or deterioration in the over-
obtained reduction levels from 0.179 till 0.844.
all predictive performance of the regressors ERC-
DROPMTR method produced a big reduction (73%)
REPTree, ERC-LR and ERC-kNN. These three mul-
in the dataset Sf2, whereas EDROPMTR method
ti-target regressors were trained on the original train-
achieved a significant reduction (84%) in the dataset
ing sets, and on the subsets selected by the IS meth-
Sf1. In average, the experimental results showed that
ods.
the DROPMTR can reduce the size of the datasets
Tables 3-5 show the results of the aRRMSE
more than EDROPMTR. This behaviour was ex-
measure. In each row, the best error value is high-
pected because EDROPMTR intends to determine the
lighted in bold typeface. The column named “Origi-
best subset of instances that produces the lowest pre-
nal” represents the predictive performance obtained
diction error on a test set that contains all the instanc-
on the original datasets, whereas the columns named
es selected by the q + 1 DROPMTR members of the
“SubsetDROPMTR” and “SubsetEDROPMTR” represent the
ensemble. Consequently, the expected tendency is
predictive performance of the MTR regressors on the
data subsets selected by DROPMTR and EDROP- IS methods can select subset of relevant instances,
MTR, respectively. The Friedman’s test [23] was and also that these particular datasets have a consid-
conducted to perform multiple comparisons, and the erable number of irrelevant and/or redundant instanc-
last row of the tables shows the average ranking es. The average rankings computed by Friedman’s
computed by this test. test shows that, in average, the best results were re-
ported when the MTR regressors were executed on
Table 3. Results of the aRRMSE measure for ERC-REPTree. The the data subsets selected by EDROPMTR, indicating
Friedman’s statistic is equal to 8.333, and the null hypothesis was
the effectiveness of the proposed ensemble-based
rejected with a p-value=0.015 at the significance level =0.05.
approach. Furthermore, Friedman’s test rejected all
Dataset Original SubsetDROMTR SubsetEDROMTR the null hypotheses, indicating that significant differ-
Andro 0.595 0.744 0.674
Atp1d 0.438 0.436 0.431
ences exist in the predictive performance of the MTR
Atp7d 0.605 0.699 0.657 regressors.
Edm 0.923 0.906 0.862
Enb 0.133 0.155 0.140 Table 5. Results of the aRRMSE measure for ERC-kNN. The
Jura 0.689 0.687 0.675 Friedman’s statistic is equal to 9.194, and the null hypothesis was
Oes10 0.616 0.614 0.625 rejected with a p-value=0.010 at the significance level =0.05.
Oes97 0.706 0.838 0.811
Osales 0.782 0.898 0.878 Dataset Original SubsetDROMTR SubsetEDROMTR
Rf1 0.121 0.120 0.108 Andro 0.619 0.824 0.628
Rf2 0.147 0.134 0.105 Atp1d 0.452 0.451 0.439
Scm1d 0.358 0.351 0.341 Atp7d 0.616 0.648 0.619
Scm20d 0.476 0.509 0.496 Edm 0.819 0.846 0.835
Scpf 0.887 0.828 0.823 Enb 0.308 0.307 0.300
Sf1 1.081 0.832 0.819 Jura 0.734 0.732 0.720
Sf2 1.018 0.951 0.904 Oes10 0.452 0.462 0.460
Slump 0.783 0.762 0.751 Oes97 0.551 0.570 0.571
Wq 0.952 0.951 0.941 Osales 0.919 0.912 0.910
Avg. ranking 2.278 2.278 1.444 Rf1 0.180 0.120 0.120
Rf2 0.198 0.158 0.132
Table 4. Results of the aRRMSE measure for ERC-LR. The Scm1d 0.351 0.321 0.311
Friedman’s statistic is equal to 7.861, and the null hypothesis was Scm20d 0.303 0.327 0.322
rejected with a p-value=0.020 at the significance level =0.05. Scpf 1.054 0.808 0.801
Sf1 1.144 0.825 0.821
Dataset Original SubsetDROMTR SubsetEDROMTR Sf2 1.705 1.163 1.115
Andro 5.335 1.399 2.562 Slump 0.750 0.749 0.730
Atp1d 1.280 0.833 1.094 Wq 0.932 0.941 0.921
Atp7d 2.119 1.688 1.150 Avg. ranking 2.278 2.305 1.417
Edm 0.835 0.906 0.855
Enb 0.315 0.323 0.319
Jura 0.607 0.611 0.610
The Bergmann-Hommel’s test [7] was conducted in
Oes10 0.833 0.579 0.489 order to perform all pairwise comparisons and detect
Oes97 1.330 0.720 0.649 particular significant differences. Figure 2 shows the
Osales 1.864 1.754 1.657 results of this statistical test, highlighting two im-
Rf1 0.522 0.562 0.533
Rf2 0.488 0.387 0.314
portant results: (I) the predictive performance of the
Scm1d 0.393 0.400 0.287 MTR regressors that were trained on the data subsets
Scm20d 0.643 0.641 0.643 selected by DROPMTR are not significantly different
Scpf 0.887 0.840 0.533 to the performance attained when they were trained
Sf1 1.196 0.969 0.922
Sf2 1.545 1.348 1.291
on the original training sets, so indicating that
Slump 0.683 0.682 0.677 DROPMTR can reduce considerably the size of the
Wq 0.959 0.976 0.964 datasets without deteriorating the performance of the
Avg. ranking 2.361 2.167 1.472 regressors; and (II) the predictive performance of
those regressors that were trained on the data subsets
It was observed that the predictive performance of selected by EDROPMTR is significantly better than
the three MTR regressors is improved in many cases. the performance attained on the original training sets
Also, it is relevant to note that the predictive perfor- and on the data subsets selected by DROPMTR, so
mance was improved even on those datasets for showing the potential of the proposed ensemble-
which the IS methods attained high reduction levels based method.
(e.g. Sf1, Sf2 and Wq), so showing that the proposed
a) ERC-RepTree b) ERC-LR c) ERC-kNN
Figure 2. All pairwise comparisons conducted by Bergmann-Hommel's test. In the diagrams, the groups of methods that are not signifi-
cantly different are connected by a line.