Nuevos Retos en La Selecci On Evolutiva de Instancias: Escalabilidad, Aprendizaje Con Clases No Balanceadas y Caracterizaci On de La E Cacia

Universidad de Granada
Departamento de Ciencias de la Computacin

o
e Inteligencia Articial
Nuevos Retos en la Seleccin

o
Evolutiva de Instancias: Escalabilidad,
Aprendizaje con Clases no Balanceadas y
Caracterizacin de la Ecacia
o
Tesis Doctoral
Salvador Garc Lpez
a o
Granada, Octubre de 2008
Universidad de Granada
Nuevos Retos en la Seleccin

o
Evolutiva de Instancias: Escalabilidad,
Aprendizaje con Clases no Balanceadas y
Caracterizacin de la Ecacia
o
MEMORIA QUE PRESENTA
Salvador Garc Lpez
a o
PARA OPTAR AL GRADO DE DOCTOR EN INFORMATICA
Octubre de 2008
DIRECTOR
Francisco Herrera Triguero
Departamento de Ciencias de la Computacin

o
e Inteligencia Articial
La memoria titulada Nuevos Retos en la Seleccin Evolutiva de Instancias: Escalabilidad,

o
Aprendizaje con Clases no Balanceadas y Caracterizacin de la Ecacia, que presenta D. Salvador
o
Garc Lpez para optar al grado de doctor, ha sido realizada dentro del programa de doctorado
a o
Diseo, Anlisis y Aplicaciones de Sistemas Inteligentes del Departamento de Ciencias de la
n
a
Computacin e Inteligencia Articial de la Universidad de Granada bajo la direccin del doctor D.
o
o
Francisco Herrera Triguero.
Granada, Octubre de 2008
El Doctorando
El Director
Fdo: Salvador Garc Lpez

a o
Fdo: Francisco Herrera Triguero
Tesis Doctoral Parcialmente Subvencionada por el Ministerio de Educacin y Ciencia bajo el

o
Proyecto Nacional TIN2005-08386-C05. Tambin ha sido Subvencionada bajo el Programa de Becas
e
de Formacin de Profesorado Universitario, en la Resolucin del 30 de Marzo de 2005, bajo la
o
o
referencia AP2005-4197.
Agradecimientos
Esta memoria est dedicada a todas aquellas personas sin las cuales no hubiera sido posible.
a
Ante todo a mis padres, Mar y Salvador, porque todo lo que he conseguido ha sido gracias a
a
ellos, a su apoyo, cario y compresin en las dicultades por las que he pasado desde que empec en
n
o
e
este viaje. Estoy muy orgulloso de ellos por brindarme las oportunidades que me han dado y que
ellos no tuvieron en su vida. Esta memoria es por y para vosotros.
Tambin agradezco en cantidad los nimos y consejos que me han dado mis hermanos, Alfonso,
e
a
Manolo y Miguel Angel, que han visto como he ido evolucionando y aprendiendo y siempre han
estado para ayudarme a enfrentarme a cualquier obstculo. Aado en este agradecimiento a un
a
n
buen amigo, que siempre ha estado ah y con el que he compartido gran parte del tiempo hasta
lograr este objetivo. Diego, siempre has sido para m como un hermano y te agradezco todo el
tiempo que has dedicado a darme un buen consejo, sea a la hora que sea.
Si en el mbito familiar he tenido mucha suerte por los nimos y apoyos recibidos, stos no
a
a
e
han sido menores con respecto a mi director de tesis. Francisco Herrera me ha guiado en todos
los aspectos relacionados con el mundo profesional y muestro mi ms sincero agradecimiento a su
a
gran dedicacin e inters hacia m as como a los valiosos consejos que me ha dado, me da, y me
o
e

seguir dando.
a
No puedo olvidarme de todas aquellas personas que han estado a mi lado desde el principio de
este viaje y me han aguantado durante todo el camino, a pesar de mis d de mal humor. Muchas
as
gracias a los hermanos Alcal, Rafa y Jess, a Antonio Gabriel, a Jos Ramn por mostrarme la
a
u
e
o
otra cara de la moneda en momentos de mucho estrs, a Alberto y Julin, con los que he formado el
e
a
tr de becarios imparables de Paco, a Alicia por su eterna simpat y a mi amigo de estas y ahora
o
a
tambin amigo en el entorno profesional, Manolo Cobo. Gracias tambin a Oscar Crdon, Enrique
e
e
o
Herrera y Manuel Lozano por ayudarme a empezar en este viaje. Tambin, quisiera agradecer los
e
momentos buenos que he pasado con el resto de mis compaeros ya sea fuera o dentro de la vida
n
laboral, y no me olvido de ninguno: Sergio, Jorge, Javi, Coral, Roc Cristina, Carlos Porcel, Carlos
o,
Garc Carlos Mantas, Dani, Jos Santamar Ana, Perico, Igor, Nacho y Oscar Harari.
a,
e
a,
Quiero tambin expresar mi gratitud a los compaeros con los que he compartido reuniones,
e
n
seminarios y congresos y me han ayudado tambin a estar en donde estoy. De Jan, cabe citar a
e
e
Mar Jos del Jesus, a Pedro Gonzlez, a Chequin y a Paco Berlanga, de Crdoba no me olvido
a
e
a
o
de Sebastin Ventura, de Pedro Antonio, Amelia y Juan Carlos, de Barcelona, Ester y Albert y de
a
Gijn, a Luciano y a Pepe.
o
No quiero dejar de mencionar a los amigos por lo vivido y lo que nos queda por vivir: Jorge,
Juanda, Tomy, Edu, Ra, Mar Luisa y al resto del grupo de Linares.
a,
Mi agradecimiento a todas aquellas personas que no por no citarlas han sido menos importantes
para el trmino de esta memoria. Quiero dedicaros el esfuerzo de este, nuestro trabajo.
e
GRACIAS A TODOS

Indice
1. Memoria
1.1. Introduccin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
o
1.1.1. Clasicacin, Aprendizaje basado en Instancias y Seleccin de Prototipos . .

o
o
1.1.2. Algoritmos Evolutivos en Miner de Datos. Seleccin Evolutiva de Prototipos.

a
o
1.1.3. Problemas de Clasicacin con Clases no Balanceadas . . . . . . . . . . . . .

o
1.1.4. Medidas de Complejidad de Datos en Clasicacin . . . . . . . . . . . . . . .

o
1.1.5. Anlisis y Comparacin de Algoritmos: Tests Estad

a
o
sticos . . . . . . . . . . .
1.2. Justicacin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
o
1.3. Objetivos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4. Resumen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5. Discusin Conjunta de los Resultados . . . . . . . . . . . . . . . . . . . . . . . . . . 12
o
1.5.1. Un Algoritmo Memtico para la Seleccin Evolutiva de Prototipos: Un Ene
o
foque para la Escalabilidad . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.2. Diagnstico de la Efectividad de la Seleccin Evolutiva de Prototipos usando
o
o
una Medida de Solapamiento . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.3. Bajo-Muestreo Evolutivo para la Clasicacin en Problemas no Balanceados:
o
Propuestas para Aprendizaje basado en Instancias y Seleccin de Conjuntos
o
de Entrenamiento . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.4. Diseo de Experimentos en Inteligencia Computacional: Sobre el Uso de la
n
Inferencia Estad
stica . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6. Comentarios Finales: Breve Resumen de los Resultados Obtenidos y Conclusiones . . 15
1.6.1. Un Algoritmo Memtico para la Seleccin Evolutiva de Prototipos: Un Ene
o
foque para la Escalabilidad . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6.2. Diagnstico de la Efectividad de la Seleccin Evolutiva de Prototipos usando
o
o
una Medida de Solapamiento . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6.3. Bajo-Muestreo Evolutivo para la Clasicacin en Problemas no Balanceados:
o
Propuestas para Aprendizaje basado en Instancias y Seleccin de Conjuntos
o
de Entrenamiento . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6.4. Diseo de Experimentos en Inteligencia Computacional: Sobre el Uso de la
n
Inferencia Estad
stica . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
vii

INDICE
viii
1.7. Perspectivas Futuras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2. Publicaciones: Trabajos Publicados, Aceptados y Sometidos
23
2.1. Un Algoritmo Memtico para la Seleccin Evolutiva de Prototipos: Un Enfoque

e
o
para la Escalabilidad - A Memetic Algorithm for Evolutionary Prototype Selection:
A Scaling Up Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2. Diagnstico de la Efectividad de la Seleccin Evolutiva de Prototipos usando una
o
o
Medida de Solapamiento - Diagnose of Eective Evolutionary Prototype Selection
using an Overlapping Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3. Bajo-Muestreo Evolutivo para la Clasicacin en Problemas no Balanceados: Proo
puestas para Aprendizaje basado en Instancias y Seleccin de Conjuntos de Entrenao
miento - Evolutionary Under-Sampling for Classication with imbalanced Data Sets:
Proposals for Instance-Based Learning and Training Set Selection . . . . . . . . . . . 69
2.3.1. Evolutionary Under-Sampling for Classication with Imbalanced Data Sets:
Proposals and Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.3.2. Enhancing the Eectiveness and Interpretability of Decision Tree and Rule
Induction Classiers with Evolutionary Training Set Selection over Imbalanced Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
2.4. Diseo de Experimentos en Inteligencia Computacional: Sobre el Uso de la Inferencia
n
Estad
stica - Design of Experiments in Computational Intelligence: On the Use of
Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
2.4.1. A Study on the Use of Non-Parametric Tests for Analyzing the Evolutionary
Algorithms Behaviour: A Case Study on the CEC2005 Special Session on
Real Parameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
2.4.2. An Extension on Statistical Comparisons of Classiers over Multiple Data
Sets for all Pairwise Comparisons . . . . . . . . . . . . . . . . . . . . . . . . 149
2.4.3. A Study of Statistical Techniques and Performance Measures for GeneticsBased Machine Learning: Accuracy and Interpretability . . . . . . . . . . . . 169
Bibliograf
a
193
Cap
tulo 1
Memoria
1.1.
Introduccin
o
La revolucin digital ha posibilitado que la captura de datos sea fcil y su almacenamiento tenga
o
a
un coste prcticamente nulo. Con el desarrollo del software y el hardware y la rpida informatizacin
a
a
o
de los negocios, enormes cantidades de datos son recogidas y almacenadas en bases de datos. Las
herramientas tradicionales de gestin de datos junto con tcnicas estad
o
e
sticas no son adecuadas
para analizar estas enormes cantidades de datos.
Es conocido que los datos por s solos no producen benecio directo. Su verdadero valor radica en
la posibilidad de extraer informacin util para la toma de decisiones o la exploracin y comprensin

o
o
o
del fenmeno que produjo los datos. Tradicionalmente en la mayor de los dominios este anlisis de
o
a
a
datos se hac mediante un proceso manual o semiautomtico: uno o ms analistas con conocimiento
a
a
a
de los datos y con la ayuda de tcnicas estad
e
sticas proporcionaban resmenes y generaban informes,
u
o validaban modelos sugeridos manualmente por los expertos. Sin embargo, este proceso, en especial
la generacin de modelos, es irrealizable conforme aumenta el tamao de los datos y el nmero de
o
n
u
dimensiones o parmetros se incrementa. Bases de datos con un nmero de registros del orden
a
u
de 109 y 103 de dimensin son un fenmeno relativamente comn y slo la tecnolog informtica
o
o
u
o
a
a
puede automatizar el proceso.
Por todo lo anterior, surge la necesidad de plantear metodolog para el anlisis inteligente
as
a
de datos, que permitan descubrir un conocimiento util a partir de estos. Este es el concepto de
proceso del Descubrimiento de Conocimiento en Bases de Datos (en ingls Knowledge Discovery in
e
Databases), con el acrnimo KDD, que puede ser denido como el proceso no trivial de identicar
o
patrones en los datos con las caracter
sticas siguientes: vlidos, novedosos, utiles y comprensibles.
a
El proceso de KDD es un conjunto de pasos interactivos e iterativos, entre los que se incluye el
preprocesado de los datos para corregir los posibles datos errneos, incompletos o inconsistentes, la
o
reduccin del nmero de registros o caracter
o
u
sticas encontrando los ms representativos, la bsqueda
a
u
de patrones de inters con una representacin particular y la interpretacin de estos patrones incluso
e
o
o
de una forma visual. El descubrimiento de conocimiento en bases de datos combina las tcnicas
e
tradicionales de extraccin de conocimiento con numerosos recursos desarrollados en el rea de la
o
a
inteligencia articial. En estas aplicaciones, el trmino miner de datos (data mining) es el que ha
e
a
tenido ms aceptacin siendo utilizado con frecuencia para reejar directamente todo el proceso de
a
o
KDD [FHOR05, Han05, WF05].
1
Cap
tulo 1. Memoria
Los campos de investigacin que intervienen en un proceso de KDD son muy variados: bases
o
de datos y reconocimiento de patrones, estad
stica e inteligencia articial, visualizacin de datos y
o
supercomputacin. Los investigadores de KDD incorporan tcnicas, algoritmos y mtodos de estos
o
e
e
campos. As un proceso KDD engloba todos estos campos y principalmente centra su atencin
,
o
en el proceso completo de extraer conocimiento de grandes volmenes de datos incluyendo el alu
macenamiento y acceso, escalando el algoritmo cuando sea necesario, interpretando y visualizando
los resultados y soportando la interaccin hombre-mquina. La siguiente gura ilustra el proceso
o
a
completo:
Figura 1.1: Proceso de KDD

El paso ms importante de este proceso es conocido como Miner de Datos (MDD a partir
a
a
de ahora) o Data Mining [TSK05]. La MDD es un campo interdisciplinar con el objetivo general
de predecir resultados y/o descubrir relaciones en los datos. La MDD puede ser descriptivo, i.e.
descubrir patrones que describen los datos, o predictivo, para pronosticar el comportamiento del
modelo basado en los datos disponibles.
T
picamente un algoritmo de MDD tiene tres componentes: el modelo, el criterio de preferencia
o eleccin y el algoritmo de bsqueda. El modelo con dos posibles tipolog segn su funcin o su
o
u
as
u
o
representacin. En el primer caso puede ser de clasicacin, regresin, clustering, de generacin de
o
o
o
o
reglas, reglas de asociacin, modelos de dependencia o anlisis de secuencias. Segn su representao
a
u
cin puede ser redes neuronales, rboles de decisin, discriminacin lineal, etc. Cada modelo tiene
o
a
o
o
unos parmetros que deben ser determinados mediante un algoritmo de bsqueda que optimiza
a
u
los parmetros del modelo segn el criterio de eleccin o preferencia que hace un mejor ajuste del
a
u
o
modelo a los datos.
Un concepto primordial, y diferenciador de las tcnicas estad
e
sticas ms clsicas, es el de aprena a
dizaje automtico (en ingls, machine learning), que fue concebido hace aproximadamente cuatro
a
e
dcadas con el objetivo de desarrollar mtodos computacionales que implementar varias formas
e
e
an
de aprendizaje, en particular, mecanismos capaces de inducir conocimiento a partir de datos. Ya
que el desarrollo de software ha llegado a ser uno de los principales cuellos de botella de la tecnolog
a
informtica de hoy, la idea de introducir conocimiento por medio de ejemplos parece particulara
mente atractiva. Tal forma de induccin de conocimiento es deseable en problemas que carecen de
o
solucin algor
o
tmica eciente, son vagamente denidos, o informalmente especicados. Ejemplos
1.1. Introduccin
o
de tales problemas pueden ser la diagnosis mdica, problemas de marketing, reconocimiento de

e
patrones visuales o la deteccin de regularidades en enormes cantidades de datos.
o
Los algoritmos de aprendizaje automtico pueden clasicarse en dos grandes categor
a
as:
mtodos de caja negra (o sin modelo), tales como redes neuronales o los mtodos bayesianos,
e
e
y
mtodos orientados al conocimiento, tales como los que generan rboles de decisin, reglas de
e
a
o
asociacin, o reglas de decisin.
o
o
La propuesta de caja negra desarrolla su propia representacin del conocimiento, que no es
o
visible desde el exterior. Los mtodos orientados al conocimiento, por el contrario, construyen una
e
estructura simblica del conocimiento que intenta ser util desde el punto de vista de la funcionalidad,
o
pero tambin descriptiva desde la perspectiva de la inteligibilidad. Existen tambin mtodos para
e
e
e
extraer reglas comprensibles a partir de estas cajas negras, con lo que en realidad ambas categor
as
pueden ser utiles para la extraccin de conocimiento.
o
Un concepto relacionado es el de Soft-Computing (tambin denominado inteligencia compue
tacional (IC)/computational intelligence) [Kon05], idea que engloba gran parte de las metodolog
as
que pueden ser aplicadas en MDD. Algunas de las metodolog ms extendidas y usadas son alas a
goritmos genticos, lgica fuzzy, redes neuronales, razonamiento basado en casos, conjuntos rough
e
o
o hibridaciones de las anteriores.
Como hemos mencionado anteriormente, el proceso de MDD puede ser descriptivo y predictivo.
A continuacin, enumeramos las disciplinas bsicas en cada uno de los procesos:
o
a
Procesos descriptivos: Clustering, obtencin de reglas de asociacin y descubrimiento de subo
o
grupos.
Procesos predictivos: Clasicacin y regresin.
o
o
Las tcnicas de MDD son sensibles a la calidad de la informacin sobre la que se pretende extraer
e
o
conocimiento. Cuanto mayor sea esta calidad, mayor ser la de los modelos de toma de decisiones
a
generados. En este sentido, la obtencin de informacin util para ser posteriormente procesada es
o
o
un factor clave. Aparece por tanto en el proceso de descubrimiento una etapa de preprocesamiento
de datos previa al MDD [Pyl99].
Podemos considerar como Preprocesamiento o Preparacin de Datos a todas aquellas tcnicas
o
e
de anlisis de datos que permiten mejorar la calidad de los mismos, de modo que los mtodos de
a
e
MDD puedan obtener mayor y mejor informacin [ZZY03].
o
La relevancia de la preparacin de los datos se debe a que:
o
Los datos reales pueden ser impuros, pudiendo conducir a la extraccin de modelos poco
o
utiles. Dicha circunstancia puede estar originada por datos incompletos, datos con ruido o
datos inconsistentes [KCH+ 03].

La preparacin de los datos puede generar un conjunto de menor tamao que el original,
o
n
lo cual puede mejorar la eciencia en MDD. En este aspecto, se pueden desarrollar tareas
dirigidas a seleccionar datos relevantes, eliminar registros duplicados, anomal o bien reducir
as
el volumen de datos, mediante la seleccin de caracter
o
sticas, de instancias, discretizacin, etc.
o
Cap
tulo 1. Memoria
La preparacin origina datos de calidad, los cuales pueden conducir a modelos de calidad.
o
Para ello, se emplean mecanismos que recuperan informacin incompleta, resuelven conictos
o
o bien, eliminan datos errneos.
o
En esta memoria, de entre las diferentes estrategias a seguir en el procesado de datos, vamos
a dirigir nuestra atencin hacia la Reduccin de Datos (RDD), donde el objetivo es extraer del
o
o
conjunto original de datos un conjunto de datos ms pequeo y representativo para confeccionar el
a
n
modelo. As mismo, la reduccin de puede llevar a cabo de mltiples formas. En esta memoria nos
o
u
centraremos en la Seleccin de Instancias (SII) donde se escogen las muestras ms signicativas
o
a
del conjunto de datos [LM01]. Ms concretamente, en la Seleccin de Prototipos (SPP) empleando
a
o
razonamiento basado en casos.
El proceso de SII se puede orientar desde dos perspectivas posibles:
Obtener un clasicador basado en casos v la SPP. Se pretende aumentar la precisin del
a
o
clasicador que utiliza un conjunto de casos previamente conocidos mediante la SPP.
La Seleccin de Conjuntos de Entrenamiento, donde se considerar la calidad de los conjuntos
o
a
obtenidos para la extraccin de modelos mediante tcnicas de MDD. Las medidas de calidad
o
e
consideradas para valorar los subconjuntos dependern del mbito al que se dirijan los modelos
a
a
generados. En nuestro caso, al tratarse de modelos predictivos en problemas de clasicacin,
o
dependern de la precisin e interpretabilidad obtenida.
a
o
Nuestro inters en esta memoria se enmarca principalmente dentro del proceso de clasicacin,
e
o
en el cual mltiples estrategias han emergido y suscitado gran inters. Una de ellas es el razonau
e
miento basado en casos, en el que el conocimiento se representa a partir de los propios ejemplos o
casos obtenidos directamente de los datos iniciales. Una subfamilia de mtodos basados en razonae
miento por casos la constituyen los mtodos de aprendizaje basados en instancias, cuyos conceptos
e
principales sern destacados en la prxima seccin.
a
o
o
1.1.1.
Clasicacin, Aprendizaje basado en Instancias y Seleccin de Prototipos

o
o
En el contexto de la MDD, entendemos por clasicacin el proceso en el que, sabiendo la

o
existencia de ciertas clases o categor establecemos una regla para ubicar nuevas observaciones
as,
en alguna de las clases existentes (aprendizaje supervisado). Las clases resultan de un problema de
prediccin, donde cada clase corresponde a la salida posible de una funcin a predecir a partir de
o
o
los atributos con que describimos los elementos de la base de datos. La necesidad de un clasicador
surge por requerimientos de disponer de un procedimiento mecnico mucho ms rpido que un
a
a a
supervisor humano y que a la vez pueda evitar sesgos y prejuicios adoptados por un experto.
As mismo, tambin nos permite evitar acciones costosas y servir de ayuda a los supervisores
e
humanos, sobretodo en caso particularmente dif
ciles.
Hay cinco consideraciones para calicar un clasicador:
Precisin: Representa el nivel de conanza del clasicador, usualmente representado como la
o
proporcin de clasicaciones correctas que es capaz de producir.
o
Velocidad: Tiempo de respuesta desde que se presenta un nuevo ejemplo a clasicar hasta que
obtenemos la clase que el clasicador predice. Normalmente, la velocidad es tan importante
como la precisin.
o
1.1. Introduccin
o
Interpretabilidad: Claridad y credibilidad, desde el punto de vista humano, de la regla de

clasicacin.
o
Velocidad de aprendizaje: Tiempo requerido por un clasicador para obtener la regla de
clasicacin desde un conjunto de ejemplos.
o
Robustez: Nmero m
u
nimo de ejemplos necesarios para obtener una regla de clasicacin
o
able y precisa.
Un clasicador recibe como entrada un conjunto de ejemplos, denominado conjunto de entrenamiento, con el que se aprende la regla de clasicacin. Adems, en el proceso de validacin
o
a
o
de un clasicador, se utiliza un conjunto de ejemplos, no conocido en el proceso de aprendizaje,
denominado conjunto de test y utilizado para comprobar la precisin del clasicador.
o
En la literatura especializada se han propuesto numerosas estrategias para abordar el problema
de la clasicacin; desde estrategias puramente estad
o
sticas, como discriminantes, hasta redes neuronales, rboles de decisin y reglas lgicas o difusas. Una estrategia exitosa en clasicacin es el
a
o
o
o
Razonamiento Basado en Casos, (RBCC), en el que el conocimiento se representa a partir de los
propios ejemplos o casos obtenidos directamente de los datos iniciales. Una subfamilia de mtodos
e
de RBCC la constituyen los mtodos de Aprendizaje Basados en Instancias (ABII), en donde se
e
utilizan funciones de similitud para describir conceptos probabil
sticos.
Los algoritmos de ABII [AKA91] son los derivados del clasicador de vecinos ms cercanos
a
[CH67], en ingls k-Nearest Neighbours (k-NN). La diferencia principal entre ABII y RBCC radica
e
en que los mtodos RBCC modican los casos y utilizan partes de casos en el aprendizaje. Los
e
algoritmos ABII emplean los ejemplos por completo y dieren en las medidas de similitud adoptadas. Aunque parezca que el clasicador k-NN es demasiado simple, se ha empleado con xito
e
en numerosas aplicaciones [Pap04] y es considerado unos de los 10 mejores algoritmos en MDD
[WKQ+ 07].
No obstante, el clasicador k-NN presenta los siguientes inconvenientes:
Es costoso en trminos computacionales a la hora de clasicar, puesto que retiene todos los
e
ejemplos de entrenamiento.
No es tolerante a la existencia de ejemplos ruidosos, que son aquellos ejemplos considerados
como errores en una base de datos.
No es tolerante frente a la existencia de atributos irrelevantes.
Es sensible a la eleccin de la medida de similitud.
o
En su denicin original, no hay manera natural de trabajar con valores nominales o valores
o
perdidos, aunque en la actualidad existen medidas de similitud que permiten trabajar con
estos valores.
Proporciona poca informacin util con respecto a la estructura de los datos.
o
En la Seccin 1.1, vimos que dentro de la MDD, tambin existe un proceso de preprocesamiento
o
e
de datos [Pyl99, ZZY03]. Este proceso es crucial cuando el algoritmo que obtiene el conocimiento
es muy sensible a la calidad de los datos de entrada, como es el caso de los algoritmos ABII. El
preprocesamiento de datos se constituye de mltiples procesos que actan sobre los datos, tales
u
u
como la reduccin de datos, limpieza de datos, construccin de datos, integracin de datos, cambio
o
o
o
Cap
tulo 1. Memoria
de formato, etc. Un proceso muy interesante y ecaz en los algoritmos ABII es la reduccin de
o
datos [WM00], puesto que con ella podemos dotar al clasicador k-NN de mejoras frente al coste,
en trminos computacionales, de clasicar y tolerancia al ruido y a la existencia de atributos
e
irrelevantes.
La reduccin de datos se puede obtener mediante:
o
Seleccin de Caracter
o
sticas [LM98]: Reduccin del nmero de atributos en la base de datos.
o
u
Discretizacin [LHTD02]: Reduccin del nmero posible de valores en un atributo.
o
o
u
Seleccin de Instancias (SII) [LM02]: Reduccin del nmero de ejemplos en la base de datos.
o
o
u
La SII surgi desde anteriores perspectivas relacionadas con el clasicador k-NN, y el trmino
o
e
de SII se adopta desde el punto de vista general de MDD. Sin embargo, cuando nos referimos a los
algoritmos ABII, en particular k-NN, las tcnicas de reduccin de ejemplos en la base de datos se
e
o
dividen en:
Tcnicas de Condensacin [Har68]: Su principal objetivo consiste en eliminar los ejemplos
e
o
redundantes de la base de datos y buscan la consistencia en el conjunto de entrenamiento;
esto es, mantener la precisin que obtiene k-NN en el conjunto de entrenamiento. Estos
o
algoritmos obtienen tasas de reduccin altas pero pierden precisin en conjuntos de test.
o
o
Tcnicas de Edicin [Wil72]: Se interesan en eliminar unicamente los ejemplos ruidosos de
e
o
la base de datos, con el objetivo de mejorar las capacidades de prediccin de k-NN. Estos
o
algoritmos obtienen tasas de reduccin bajas pero normalmente mejoran la precisin de k-NN
o
o
en conjuntos de test.
Tcnicas de Seleccin de Prototipos (SPP) [WM00]: Buscan reducir lo mximo posible el
e
o
a
conjunto de datos a la vez que mejorar la precisin de k-NN.
o
Por otro lado, cabe destacar la diferencia entre mtodos de SPP y mtodos de Generacin de
e
e
o
Prototipos. Aunque ambos buscan el mismo objetivo, los primeros lo hacen haciendo una seleccin
o
de ejemplos de la base de datos, por lo que los ejemplos nales coinciden o existen en la base de
datos original. La generacin de prototipos selecciona y genera nuevos ejemplos a partir de los
o
originales. La ventaja de la SPP es que nos permite identicar los casos o ejemplos ms inuyentes
a
en la base de datos incrementando en cierta manera la interpretabilidad del modelo.
La combinacin de tcnicas de Soft-Computing, como los algoritmos evolutivos, con la MDD ha
o
e
demostrado ser de gran utilidad y ha obtenido resultados prometedores, particularmente en SII y
SPP como veremos en la siguiente seccin.
o
1.1.2.
Algoritmos Evolutivos en Miner de Datos. Seleccin Evolutiva de Proa

o
totipos.
Los Algoritmos Evolutivos (AAEE) [ES03] son algoritmos de bsqueda basados en los procesos
u
naturales de evolucin y gentica. En los ultimos aos se han consolidado como uno de los tipos
o
e
n
de algoritmos de bsqueda en problemas complejos (gran nmero de variables, mltiples ptimos
u
u
u
o
locales, y/o mltiples objetivos y condiciones/relaciones complejas entre ellas) de ms xito en
u
a e
Inteligencia Articial.
1.1. Introduccin
o
Los AAEE han demostrado ser una herramienta importante para el aprendizaje y extraccin de
o
conocimiento [Fre02]. Se han utilizado combinados con mltiples modelos de representacin de cou
o
nocimiento, tales como redes neuronales, bases de reglas difusas, reglas intervalares, aproximaciones
basadas en prototipos, seleccin de variables e instancias, extraccin de reglas de asociacin, etc.
o
o
o
En la actualidad existe un continuo desarrollo de modelos evolutivos de extraccin de conocimiento.
o
Como muestra citamos libros recientemente publicados que recogen aportaciones recientes en esta
temtica [JG05, GDG08, AGR06].
a
Aunque los AAEE no fueron diseados como algoritmos espec
n
cos de aprendizaje, sino como
algoritmos de bsqueda global, s podemos destacar las ventajas del uso de stos en el campo del
u
e
aprendizaje automtico. Muchas de las metodolog de aprendizaje automtico estn basadas en
a
as
a
a
la bsqueda de un modelo ptimo dentro de un espacio de modelos, como puede ser el espacio de
u
o
bases de reglas, el espacio de pesos y topolog de redes neuronales, o el espacio de conjuntos de
as
prototipos, por sealar algunos ejemplos. Por tanto, pueden ser planteadas como un problema de
n
bsqueda u optimizacin subyacente. Los AAEE permiten realizar la bsqueda en los espacios de
u
o
u
modelos mediante la codicacin de un modelo de solucin en un cromosoma. Son muy exibles en
o
o
cuanto a la codicacin de diferentes modelos, ya que se puede utilizar el mismo AE con diferentes
o
representaciones.
En el preprocesado de datos en MDD, y en particular en la reduccin de datos, los AAEE han
o
sido muy utilizados en la seleccin de caracter
o
sticas [Lee04] y la SII [CHL03, CHL07, GP07]. En
este ultimo caso, hablamos de Seleccin Evolutiva de Instancias, y particularmente de Seleccin
o
o
Evolutiva de Prototipos (SEPP) cuando usamos k-NN como clasicador.
Los algoritmos de SEPP propuestos en [CHL03] dotan al clasicador k-NN de una mayor precisin y obtienen subconjuntos de ejemplos muy reducidos. El esquema que siguen viene especicado
o
en la siguiente gura:
Data Set (D)
Training Set (T)
Test Set (TS)
Evolutionary Prototype
Selection Algorithm
Prototype Subset Selected (S)
1Nearest Neighbour
Classifier
Figura 1.2: Seleccin Evolutiva de Prototipos

o
El incremento del tamao de las bases de datos es un problema bsico en la SPP (problema del
n
a
escalado de datos [PK99]). Este problema produce un requerimiento excesivo de almacenamiento,
incrementa el coste computacional y afecta a la capacidad de generalizacin del clasicador. Obo
viamente, estas debilidades se presentan tambin en la SEPP porque resultan en un incremento
e
considerable del tamao del cromosoma (representacin de soluciones en los AAEE) y en el tiempo
n
o
de ejecucin, junto con una prdida de la capacidades de convergencia en estos algoritmos. Sin
o
e
Cap
tulo 1. Memoria
embargo, los conjuntos de gran tamao puede ser abordables mediante la tcnica de estraticacin
n
e
o
propuesta en [CHL05], en donde se comprueba un gran rendimiento por parte de la SEPP, aunque
dependiendo de la eleccin del tamao de cada estrato, los AAEE pueden o no alcanzar su mximo
o
n
a
rendimiento en funcin de su capacidad de converger hacia soluciones ptimas.
o
o
1.1.3.
Problemas de Clasicacin con Clases no Balanceadas

o
El problema de las clases no balanceadas es uno de los nuevos problemas que surgieron cuando el
aprendizaje automtico madur de una ciencia a una tecnolog aplicada, ampliamente usada en el
a
o
a
mundo de los negocios, industria e investigacin cient
o
ca. Aunque los experimentadores ya ten
an
conocimiento sobre este problema, hizo su aparicin en la comunidad cient
o
ca del aprendizaje
automtico y MDD hace no ms de una dcada. Su importancia creci a medida de que cada vez,
a
a
e
o
los investigadores se daban cuenta de que los conjuntos de datos que analizaban eran no balanceados
y obten modelos de clasicacin por debajo del umbral deseado de ecacia.
an
o
El problema del no balanceo de las clases ocurre t
picamente cuando, en un problema de clasicacin, existen muchas ms instancias o ejemplos de una clase que del resto de clases [CJK04].
o
a
En estos casos, una clasicacin estndar tiende a ser inundada por las clases grandes e ignorar
o
a
las ms pequeas. En aplicaciones prcticas, la tasa de la clases pequea sobre la grande puede
a
n
a
n
ser drstica cuando tenemos 1 ejemplo frente a 100, 1 frente a 1.000 o 1 frente a 10.000. Como se
a
ha mencionado anteriormente, este problema es observable en muchas situaciones, incluyendo la
deteccin de fraude o intrusos, gestin de riesgos, clasicacin de textos, diagnstico mdico, etc.
o
o
o
o
e
Es bueno saber que en determinados dominios (como los mencionados) el problema de las clases
no balanceadas es intr
nseco al problema. Por ejemplo, existen muy pocos casos de fraude comparados con la gran cantidad de uso honesto de las facilidades ofertadas a un cliente. Sin embargo,
el no balanceo de las clases ocurre a veces en dominios que no tienen un desequilibrio intr
nseco.
Esto ocurre cuando el proceso de coleccin de datos est limitado (debido a razones econmicas o
o
a
o
privadas). Adems, puede haber tambin un desequilibrio en los costes asociados a hacer diferentes
a
e
errores, que pueden variar para cada caso.
Se han propuesto un gran nmero de soluciones al problema de las clases no balanceadas en
u
dos tipos de niveles: a nivel de datos y a nivel algor
tmico. En el nivel de datos, estas soluciones
incluyen muchas formas diferentes de re-muestreo, como sobre-muestreo aleatorio con reemplazo,
bajo-muestreo aleatorio [EJJ04], sobre-muestreo enfocado (en el que no se crean nuevos ejemplos,
pero la eleccin de las muestras a reemplazar es informada ms que aleatoria), sobre-muestreo con
o
a
generacin de ejemplos articiales informada [CBHK02], bajo-muestreo informado [BPM04] y como
binaciones o hibridaciones de las tcnicas anteriores. En el nivel algor
e
tmico, las soluciones incluyen
el ajuste de costes de las distintas clases del problema de tal forma que la clase menos representada
es ms costosa a efectos de clasicacin, incluyen el ajuste de la estimacin de probabilidad de
a
o
o
las hojas de un rbol [WP03] (cuando trabajamos con rboles de decisin), incluyen el ajuste del
a
a
o
umbral de decisin y el uso de aprendizaje basado en reconocimiento (aprender con una clase)
o
mejor que el basado en discriminacin (para dos clases).
o
Recientemente, en [CCHJ08], se ha mostrado la relacin emp
o
rica existente entre el tratamiento
de los problemas de clasicacin no balanceados con propuestas a nivel de datos y propuestas a
o
nivel algor
tmico. El preprocesamiento de los datos para ser tratados desde el punto de vista de
problemas no balanceados ha demostrado ser de gran utilidad y tiene la gran ventaja de no necesitar
realizar modicacin alguna de los algoritmos de clasicacin que ya conocemos de antemano.
o
o
1.1. Introduccin
o
1.1.4.
Medidas de Complejidad de Datos en Clasicacin

o
A simple vista, existen una serie de parmetros que claramente condicionan que un problema
a
concreto de clasicacin sea ms complejo que otro. Por ejemplo, podemos destacar el nmero de
o
a
u
instancias en la base de datos, el nmero de atributos por patrn, el nmero de distintas clases a
u
o
u
predecir, etc. Sin embargo, otras medidas de complejidad pueden ser aplicadas para denir a priori
la dicultad de los problemas de clasicacin.
o
Las medidas de complejidad de datos en clasicacin [HB02, HB06] estudian la complejidad
o
geomtrica de los l
e
mites o bordes de decisin entre los ejemplos de distinta clase. Como norma
o
prctica, la dicultad de un problema se considera proporcional a la tasa de error obtenido por un
a
clasicador. No obstante, y segn el teorema de No Free Lunch [WM97], no es posible encontrar
u
un algoritmo que sea mejor en todos los problemas.
En la literatura especializada podemos encontrar una serie de medidas estandarizadas para
medir la complejidad de los datos:
Medidas de Solapamiento: para medir el volumen de solapamiento entre ejemplos de distintas
clases as como por atributos.
Medidas de Separabilidad: para medir el grado de separacin existente entre clases.

o
Medidas de Geometr Topolog y Densidad: miden la alineacin entre clases, el espacio
a,
a
o
cubierto por intervalos de vecindad y la cantidad de ejemplos existente por unidad de volumen.
No est claro an hasta que punto nos pueden ser utiles dichas medidas. El objetivo deseado
a
u
consiste en ser capaz de determinar a priori qu tipo de algoritmo puede ser ms benecioso de ser
e
a
aplicado, o cuando merece la pena usar un algoritmo concreto [BMH05].
1.1.5.
Anlisis y Comparacin de Algoritmos: Tests Estad

a
o
sticos
En la actualidad, una propuesta de un algoritmo debe ser justicada cuando obtiene un benecio
con respecto a otras propuestas ya estudiadas, en relacin a cualquier medida de rendimiento:
o
precisin, eciencia, interpretabilidad, etc. En problemas de clasicacin u optimizacin, un estudio
o
o
o
requiere involucrar varios casos de distinta
ndole, que son los llamados conjuntos de datos (data
sets) en clasicacin o funciones de test en optimizacin.
o
o
Uno de los problemas abiertos en IC es que no existe una teor unicada ni forma de demostrar
a
tericamente la mejora de un mtodo frente a otro [YW06]. Debido a esto, necesitamos hacer
o
e
comparaciones rigurosas que nos permitan trabajan con resultados concretos.
La teor de los tests estad
a
sticos nos permite inducir una probabilidad de error a una hiptesis
o
dada a partir de una muestra nita de resultados. En MDD e IC, la tendencia de los investigadores
es la de usar tcnicas estad
e
sticas pareadas (comparan dos algoritmos entre s por ejemplo el t,
test) sobre muestras de resultados que presentan unas condiciones idneas para ser analizadas por
o
tcnicas paramtricas [She06]. En algunas ocasiones, tambin se utilizan tcnicas de comparaciones
e
e
e
e
mltiples (ANOVA). Las condiciones apropiadas para hacer un anlisis paramtrico vienen dadas
u
a
e
por tres propiedades bsicas: Independencia de resultados, normalidad de resultados y homocedasa
ticidad. En raras ocasiones se cumplen las tres condiciones a la vez.
No obstante, la estad
stica paramtrica no funciona de forma adecuada cuando la muestra de
e
resultados la forman los resultados obtenidos entre varios data sets, debido a que cada resultado
10
Cap
tulo 1. Memoria
representa a un problema diferente y la poblacin se constituye por resultados muy dispares. Demar
o
s
en [Dem06] revisa y propone el uso de tcnicas estad
e
sticas no paramtricas [Con98] para analizar los
e
resultados de los clasicadores entre varios conjuntos de datos. Se centra principalmente en acercar
las tcnicas estad
e
sticas simples al entorno de la clasicacin y mostrar que realmente son ventajosas
o
con respecto al uso de tcnicas paramtricas. Adems, incide en el uso de tcnicas de comparaciones
e
e
a
e
mltiples que nos permiten a priori especicar el nivel de error apropiado de un experimento
u
independientemente del nmero de factores (algoritmos en nuestro caso) que intervienen en l.
u
e
1.2.
Justicacin
o
Una vez conocidos los principales conceptos a los que se reere esta memoria, nos planteamos
una serie de problemas abiertos que nos sitan en el planteamiento y la justicacin del presente
u
o
proyecto de tesis.
Como hemos sealado en la Seccin 1.1.2, los AAEE son tcnicas de optimizacin que precisan
n
o
e
o
de altos requerimientos de cmputo. Por esta razn, el empleo de dichas tcnicas en problemas
o
o
e
de clasicacin basados en instancias, como proceso de SPP, puede resultar ser muy costoso
o
cuando los conjuntos de entrada aumentan su tamao. Por otro lado, los AAEE requieren
n
aumentar su esquema de representacin de soluciones ante estos tipos de problemas y esto
o
puede ocasionar faltas en la convergencia del algoritmo para obtener soluciones precisas en
problemas de mayor tamao.
n
La aplicacin de los AAEE en la SPP es muy efectiva en trminos de tasas de reduccin de
o
e
o
datos obtenidos, pero puede ocurrir que la precisin del modelo obtenido no se incremente.
o
Ser muy util disponer de un mecanismo eciente que nos permita diagnosticar el posible
a
resultado que vamos a obtener tras aplicar un AAEE en SPP a partir del tipo de datos al
que nos enfrentamos.
Como hemos visto anteriomente, una de las tcnicas ms ecaces para tratar el problema
e
a
de la falta de balanceo en las clases en clasicacin consiste en hacer un preprocesamiento
o
previo de los datos. En RDD, diversas tcnicas han sido propuestas para tal n, pero todas
e
ellas se basan en heur
sticas y no obtienen soluciones lo sucientemente precisas. Por otro
lado, existen otras tcnicas basadas en sobre-muestreo o generacin de nuevos datos. Estas
e
o
tcnicas ofrecen altas prestaciones en trminos de precisin cuando son previamente aplicadas
e
e
o
a un algoritmo predictivo de MDD, pero tienen el inconveniente de aumentar el tamao de
n
los modelos y, por tanto, reducir su interpretabilidad.
La propuesta de nuevos mtodos en MDD o IC es una actividad frecuente en la investigacin.
e
o
Todo nuevo mtodo debe disponer de alguna ventaja frente a los ya propuestos y para dee
terminarla se han utilizado tcnicas estad
e
sticas de comparacin. Estas tcnicas no siempre
o
e
son aplicables a todo tipo de resultados obtenidos, sobre todo cuando queremos hacer un
anlisis en el que la comparacin incluye diferentes casos o instancias pertenecientes al mismo
a
o
problema (entorno multi-problema).
1.3. Objetivos
1.3.
11
Objetivos
Como se acaba de mencionar en la seccin anterior, la presente memoria se organiza en torno

o
a cuatro grandes objetivos que involucran la mejora de la eciencia en la seleccin evolutiva de
o
prototipos al tratar con problemas de gran tamao, el diagnstico de la efectividad de la seleccin
n
o
o
evolutiva de prototipos previo a su aplicacin en el proceso de RDD, la aplicacin de la seleccin
o
o
o
evolutiva de instancias para la RDD en problemas de clasicacin con clases no balanceadas y la
o
propuesta de un adecuado anlisis experimental global en entornos de mltiples casos de un mismo
a
u
problema mediante la aplicacin de la estad
o
stica no paramtrica.
e
En concreto, los objetivos que nos proponemos son:
Aumentar la eciencia y precisin de la aplicacin de los AAEE a la SPP. Los algoritmos de
o
o
seleccin evolutiva de prototipos van a verse penalizados por el tamao del conjunto de datos
o
n
sobre el que se aplican. Se pretende estudiar en qu medida se produce esta penalizacin
e
o
y ofrecer una nueva tcnica alternativa que la solvente. Para ello, analizamos una nueva
e
propuesta de AAEE que pondera de forma adecuada la reduccin del conjunto de datos con
o
la menor prdida de precisin cuando se abordan problemas de gran tamao.
e
o
n
Determinar el comportamiento de los algoritmos de seleccin evolutiva de prototipos a partir
o
de la complejidad de los datos. Los problemas de clasicacin pueden tener diferentes grados
o
de complejidad dependiendo de un clasicador concreto. La medicin de esta complejidad
o
es ms ecaz que la utilizacin del algoritmo, por lo que podemos diagnosticar cuando un
a
o
algoritmo de seleccin evolutiva de prototipos va a alcanzar un rendimiento mximo, en
o
a
trminos de ecacia, antes de ser aplicado. En este caso, analizamos diferentes problemas en
e
trminos de su complejidad y modelamos el comportamiento esperado de dichos algoritmos.
e
Evaluar los algoritmos de seleccin evolutiva de instancias en procesos de RDD para problemas
o
con clases no balanceadas. Para ello, proponemos la aplicacin de las tcnicas de seleccin
o
e
o
evolutiva de instancias utilizando medidas de evaluacin propias de los problemas con clases
o
no balanceados, con el n de aumentar la precisin obtenido por las tcnicas ya propuestas
o
e
de bajo-muestro y aumentar la interpretabilidad de los modelos obtenidos por algoritmos de
MDD con respecto a las tcnicas ya propuestas de sobre-muestreo.
e
Proponer una metodolog basada en tcnicas estad
a
e
sticas para realizar comparaciones globales
de algoritmos en entornos multi-problema. Las tcnicas estad
e
sticas comnmente conocidas,
u
aunque aplicables, no ofrecen la certeza necesaria cuando se aplican en diversas muestras
de resultados. La utilizacin de tcnicas estad
o
e
sticas no paramtricas puede solventar este
e
problema, sobre todo cuando la muestra de resultados a analizar se compone de valores
dispares como fruto de la compaginacin de resultados entre problemas de diferente
o
ndole.
Proponemos su utilizacin y damos una serie de indicaciones segn el tipo y cantidad de
o
u
algoritmos que se desean comparar.
1.4.
Resumen
Para desarrollar los objetivos planteados, la memoria est constituida por siete publicaciones
a
distribuidas en cuatro partes que se desarrollarn en el Cap
a
tulo 2. Estas partes son las siguientes:
12
Cap
tulo 1. Memoria
Un Algoritmo Memtico para la Seleccin Evolutiva de Prototipos: Un Enfoque para la Escae

o
labilidad - A Memetic Algorithm for Evolutionary Prototype Selection: A Scaling Up Approach
Diagnstico de la Efectividad de la Seleccin Evolutiva de Prototipos usando una Medida de
o
o
Solapamiento - Diagnose of Eective Evolutionary Prototype Selection using an Overlapping
Measure
Bajo-Muestreo Evolutivo para la Clasicacin en Problemas no Balanceados: Propuestas para
o
Aprendizaje basado en Instancias y Seleccin de Conjuntos de Entrenamiento - Evolutionary
o
Under-Sampling for Classication with imbalanced Data Sets: Proposals for Instance-Based
Learning and Training Set Selection
Diseo de Experimentos en Inteligencia Computacional: Sobre el Uso de la Inferencia Esn
tad
stica - Design of Experiments in Computational Intelligence: On the Use of Statistical
Inference
Adems, en esta memoria incluimos una Seccin de Discusin Conjunta de los Resultados, que
a
o
o
proporciona una informacin resumida de las propuestas y los resultados ms interesantes obtenidos
o
a
en cada parte. La Seccin Comentarios Finales: Breve Resumen de los Resultados Obtenidos y
o
Conclusiones resume los resultados obtenidos en esta memoria y presenta algunas conclusiones
sobre stos. Finalmente, en la Seccin Perspectivas Futuras se comentarn algunos aspectos
e
o
a
sobre trabajos futuros que quedan abiertos en la presente memoria.
1.5.
Discusin Conjunta de los Resultados

o
Esta seccin muestra un resumen de las distintas propuestas que se recogen en la presente
o
memoria y presenta una breve discusin sobre los resultados obtenidos por cada una de ellas.
o
1.5.1.
Un Algoritmo Memtico para la Seleccin Evolutiva de Prototipos: Un

e
o
Enfoque para la Escalabilidad
En esta parte (Seccin 2.1), se analiza el problema de la SPP desde el punto de vista de la
o
escalabilidad, haciendo una breve revisin de las tcnicas clsicas y evolutivas propuestas para
o
e
a
SPP y deniendo los conceptos necesarios para abordarlo. Una vez expuestos los inconvenientes de
los algoritmos de SEPP convencionales, que son la falta de convergencia cuando los utilizamos en
problemas de mayor tamao y su ineciencia, indicamos que el uso de los AAEE en problemas de
n
mayor tamao puede estar ms limitado. Para solucionar este problema en los AAEE, podemos
n
a
hibridarlos con meta-heur
sticas que presentan mejor comportamiento de explotacin local, como
o
los algoritmos de bsqueda local. Estos algoritmos son los denominados Algoritmos Memticos
u
e
[KS05]. Como los algoritmos memticos consiguen un buen equilibrio entre explotacin y explorae
o
cin y hemos observado que los algoritmos de SEPP tradicionales no consiguen buenas soluciones
o
cuando los problemas aumentan su tamao, nosotros proponemos utilizar un modelo espec
n
co de
algoritmo memtico en el problema de la SPP. Para conseguir un menor tiempo de respuesta, utie
lizamos un procedimiento de bsqueda local ad-hoc para el problema de la SPP, en donde solo se
u
permiten operaciones rpidas y se registra toda la informacin que necesita un clasicador k-NN.
a
o
1.5. Discusin Conjunta de los Resultados

o
13
Se incluye el estudio experimental desarrollado, donde se especica la metodolog seguida, sus

a
resultados y un completo anlisis de los mismos. Los resultados obtenidos indican que la propuesta
a
basada en algoritmos memticos mejora a las propuestas no evolutivas y evolutivas, sobre todo
e
cuando aumenta la escalabilidad del problema. Adems, el tiempo de cmputo queda reducido en
a
o
comparacin al resto de tcnicas de SEPP ya propuestas en la literatura.
o
e
Las art
culos asociados a esta parte son:
S. Garc J.R. Cano, F. Herrera, A Memetic Algorithm for Evolutionary Prototya,
pe Selection: A Scaling Up Approach. Pattern Recognition 41:8 (2008) 2693-2709,
doi:10.1016/j.patcog.2008.02.006.
1.5.2.
Diagnstico de la Efectividad de la Seleccin Evolutiva de Prototipos

o
o
usando una Medida de Solapamiento
En esta parte (Seccin 2.2), se estudia una forma de diagnosticar el uso efectivo de los algoritmos
o
de SEPP en problemas de clasicacin. Como ya hemos dicho anteriormente, los AAEE destacan
o
por la obtencin de resultados excelentes en diferentes campos de aplicacin, pero no as por su
o
o
rapidez en el proceso de optimizacin, y es extrapolable al problema de la SEPP. El estudio de las

o
mtricas de complejidad de los datos en problemas de clasicacin nos puede ser de gran utilidad
e
o
para realizar un diagnstico previo al proceso de RDD cuando un algoritmo de SEPP es ms
o
a
efectivo dependiendo del problema o caso al que se va a aplicar. Aprovechando que el clasicador
k-NN (ABII en general), debido a su simplicidad, es muy sensible a la complejidad de los datos
[SMS07], sobre todo a medidas de complejidad basadas en solapamiento, nos proponemos hacer
un anlisis de efectividad de un algoritmo de SEPP convencional (CHC) basndonos en diferentes
a
a
umbrales de solapamiento de los datos. El estudio nos permite categorizar los conjuntos de datos
segn el solapamiento observado en trminos de efectividad desde el punto de vista de la SEPP.
u
e
Obtenemos un umbral a partir del cual podemos dividir los conjuntos de datos entre muy solapados
y poco solapados. En los primeros, la SEPP obtiene tasas de acierto ms elevadas que la aplicacin
a
o
del k-NN solo o combinado con otros algoritmos clsicos de SPP. En los segundos, se conserva la
a
precisin con los prototipos seleccionados con respecto a k-NN, pero las tasas de reduccin siguen
o
o
siendo muy ventajosas.
Las art
S. Garc J.R. Cano, E. Bernad-Mansilla, F. Herrera, Diagnose of Eective Evolutionary
a,
o
Prototype Selection using an Overlapping Measure. International Journal of Pattern Recognition and Articial Intelligence, in press (2008).
1.5.3.
Bajo-Muestreo Evolutivo para la Clasicacin en Problemas no Balano

ceados: Propuestas para Aprendizaje basado en Instancias y Seleccin
o
de Conjuntos de Entrenamiento
En esta parte (Seccin 2.3), se presenta el bajo-muestreo evolutivo para problemas de clasicao
cin con clases no balanceadas desde dos perspectivas: bajo-muestreo evolutivo para clasicacin
o
o
14
Cap
tulo 1. Memoria
mediante el vecino ms cercano y seleccin de conjuntos de entrenamiento para obtener modelos

a
o
predictivos utilizando rboles de decisin y algoritmos de induccin de reglas, particularmente C4.5
a
o
o
[Qui93] y PART [FW98]. En un primer momento, se presenta una taxonom de diferentes mtodos
a
e
de bajo-muestreo evolutivos que se proponen, basada en las diferencias entre distintas formas de
realizar la seleccin de ejemplos del conjunto de entrenamiento, distintas funciones de rendimiento
o
que se optimizan con el AE en cuestin y el inters en obtener un conjunto equilibrado en trmio
e
e
nos de nmero de ejemplos por cada clase. Adems, tambin se presentan los diferentes modelos
u
a
e
de bajo-muestreo y sobre-muestreo no evolutivos que se van a estudiar. Se incluye el estudio experimental desarrollado, donde se especica la metodolog seguida, sus resultados y un completo
a
anlisis de los mismos. Los resultados indican que la aplicacin del bajo-muestreo evolutivo sobre el
a
o
clasicador de vecinos cercanos es muy ventajoso sobre todo cuando el grado de desequilibrio entre
la distribucin de clases es muy alto. Adems, en la seleccin de conjuntos de entrenamiento con
o
a
o
C4.5 y PART, el bajo-muestreo evolutivo es superior a los mtodos de bajo-muestreo clsicos e iguae
a
la en rendimiento a los mtodos de sobre-muestreo. La principal ventaja sobre estos ultimos es que
e
permiten obtener rboles o bases de reglas de menor tamao, lo que aumenta la interpretabilidad
a
n
de los modelos.
Las art
S. Garc F. Herrera, Evolutionary Under-Sampling for Classication with Imbalanced Data
a,
Sets: Proposals and Taxonomy. Evolutionary Computation, in press (2008).
S. Garc A. Fernndez, F. Herrera, Enhancing the Eectiveness and Interpretability of
a,
a
Decision Tree and Rule Induction Classiers with Evolutionary Training Set Selection over
Imbalanced Problems. Applied Soft Computing Journal, submitted (2008).
1.5.4.
Dise o de Experimentos en Inteligencia Computacional: Sobre el Uso de

n
la Inferencia Estad
stica
En esta parte (Seccin 2.4), se describe y justica la eleccin de la metodolog de anlisis y

o
o
a
a
comparacin de algoritmos seguida en las anteriores partes de la memoria de tesis. Hemos visto
o
que tanto en MDD, como en IC es necesaria una unicacin a la hora de realizar experimentos y
o
analizar resultados debido principalmente a la ausencia de capacidad de demostrar tericamente
o
la supremac de un mtodo sobre otro. Entonces, se puede apostar por hacer comparaciones esa
e
tad
sticas rigurosas de los resultados obtenidos. Cuando nuestro inters se focaliza en hacer anlisis
e
a
que nos proporcionen un resultado global a un conjunto o serie de casos del problema a tratar, las
tcnicas estad
e
sticas paramtricas convencionales, como puede ser el t-test o un anlisis de varianzas
e
a
ANOVA, pueden conrmar hiptesis errneas debido a que la muestra de resultados en la que se
o
o
aplican no cumple con las condiciones necesarias para que su uso sea correcto. Nos proponemos analizar estas propiedades en dos diferentes entornos relacionados con la IC y la MDD. El primero de
ellos es la aplicacin de AAEE en la optimizacin continua de funciones con parmetros reales y el
o
o
a
segundo es la aplicacin de AAEE al aprendizaje basado en reglas en problemas de clasicacin. En
o
o
ningn caso, las condiciones ptimas a aplicar se cumplen totalmente y debido a esto, nos planteau
o
mos el uso de tcnicas no paramtricas en entornos multi-problema, siguiendo las recomendaciones
e
e
para problemas de clasicacin hechas en [Dem06]. Demar describe las tcnicas ms avanzadas
o
s
e
a
cuando se involucra un mtodo control, esto es, una nueva propuesta. Las comparaciones del tipo
e
n n se introducen con un mtodo poco potente que ha hecho que los investigadores en aprendizaje
e
automtico no lo utilicen demasiado debido a que, para conseguir un buen equilibrio entre poder
a
1.6. Comentarios Finales: Breve Resumen de los Resultados Obtenidos y Conclusiones
15
del test estad

stico y algoritmos a comparar, las conguraciones de los experimentos requieren un
excesivo nmero de casos distintos del problema. Finalmente, nos proponemos aplicar las tcnicas
u
e
estad
sticas no paramtricas ms potentes para realizar comparaciones del tipo n n, que son las
e
a
basadas en hiptesis lgicamente relacionadas. Se incluye un estudio experimental basado en casos
o
o
prcticos, donde se especica la metodolog seguida, los resultados comparativos entre el uso de
a
a
ambos tipos de tcnicas estad
e
sticas en este tipo de entornos y un completo anlisis de los mismos.
a
Los resultados indican claramente la necesidad de utilizar la estad
stica no paramtrica en entornos
e
multi-problema as como la supremac de nuevos mtodos propuestos para hacer comparaciones
a
e
del tipo n n.
Las art
S. Garc D. Molina, M. Lozano, F. Herrera, A Study on the Use of Non-Parametric Tests for
a,
Analyzing the Evolutionary Algorithms Behaviour: A Case Study on the CEC2005 Special
Session on Real Parameter Optimization. Journal of Heuristics, doi: 10.1007/s10732-008-90804, in press (2008).
S. Garc F. Herrera, An Extension on Statistical Comparisons of Classiers over Multiple
a,
Data Sets for all Pairwise Comparisons. Journal of Machine Learning Research, in press
(2008)
S. Garc A. Fernndez, J. Luengo, F. Herrera, A Study of Statistical Techniques and Perfora,
a
mance Measures for Genetics-Based Machine Learning: Accuracy and Interpretability. Soft
Computing, submitted (2008).
1.6.
Comentarios Finales: Breve Resumen de los Resultados Obtenidos y Conclusiones
Dedicamos esta seccin a resumir brevemente los resultados obtenidos y a destacar las concluo
siones que esta memoria aporta.
Hemos estudiado diferentes aspectos abiertos o nuevos retos en donde se pueden utilizar los
AAEE en RDD para llevar a cabo tareas de clasicacin basadas en vecinos cercanos y basadas en
o
seleccin de conjunto de entrenamiento. Se pretende llevar a cabo una reduccin del conjunto inicial
o
o
para aumentar las prestaciones de los clasicadores que se utilizan posteriormente en problemas
que presentan un mayor tamao o problemas que presentan un desequilibrio en la distribucin de
n
o
clases. Adems, tambin hemos estudiado, mediante las mtricas de complejidad de datos, cundo
a
e
e
a
nos es ms interesante y ecaz la aplicacin de AAEE para la RDD en problemas de clasicacin
a
o
o
convencionales.
El comportamiento de las tcnicas propuestas se ha comparado con los algoritmos ya propuestos
e
en la literatura especializada, considerando la tipolog concreta del problema a tratar. De esta
a
forma, se han considerando los mejores algoritmos de SPP y SEPP en el estudio de la escalabilidad
y los mejores y ms conocidos mtodos de bajo-muestreo y sobre-muestreo en el estudio de nuestra
a
e
propuesta en problemas de clasicacin no balanceados. Las comparaciones de los algoritmos se
o
han realizado conforme a la metodolog propuesta en esta memoria de tesis, que est basada en el
a
a
anlisis estad
a
stico mediante tests no paramtricos en entornos multi-problema. De forma anloga,
e
a
la aplicacin de estos tests se ha justicado mediante la presentacin de varios estudios emp
o
o
ricos
16
Cap
tulo 1. Memoria
correspondientes a distintas situaciones propias de la IC o MDD. Estos nos indican que es ms

a
apropiado utilizar las tcnicas no paramtricas que las paramtricas.
e
e
e
Los siguientes apartados resumen brevemente los resultados obtenidos y presentan algunas conclusiones sobre los mismos.
1.6.1.
Un Algoritmo Memtico para la Seleccin Evolutiva de Prototipos: Un

e
o
Enfoque para la Escalabilidad
Para solucionar los problemas observados de falta de convergencia de los algoritmos de SEPP y
aumentar su eciencia cuando aumenta la escalabilidad del problema, hemos utilizado un AE basado
en un algoritmo memtico especialmente diseado para el problema de la SPP. Los resultados
e
n
obtenidos han sido prometedores debido a los siguientes factores:
Nuestra propuesta basada en algoritmos memticos presenta buenas tasas de reduccin de
e
o
datos y de eciencia con respecto a los dems algoritmos de SEPP.
a
Independientemente del tamao de los conjuntos de datos, nuestra propuesta mejora a las
n
propuestas no evolutivas. Al compararla con los algoritmos de SEPP, observamos un comportamiento muy similar en problemas de tamao pequeo. Sin embargo, cuando los problemas
n
n
aumentan de tamao, los niveles de precisin obtenidos por nuestra propuesta superan a los
n
o
de los dems algoritmos de SEPP, que disminuyen su capacidad de explotacin.
a
o
1.6.2.
Diagnstico de la Efectividad de la Seleccin Evolutiva de Prototipos

o
o
usando una Medida de Solapamiento
Aunque se ha mostrado que la efectividad de los AAEE en problemas de SPP es excelente,

tambin sabemos que su aplicacin requiere un alto coste computacional. Podemos anticipar a su
e
o
aplicacin una fase de diagnstico que nos informe acerca de la efectividad esperada dependiendo
o
o
del problema de clasicacin al que van a ser aplicados. Para ello, hacemos un estudio de la relacin
o
o
de las medidas de complejidad de los datos en problemas de clasicacin, concretamente en medidas
o
de solapamiento, con respecto a la efectividad obtenida por los algoritmos de SEPP. Los resultados
obtenidos nos permiten realizar una clara distincin entre varios tipos de conjuntos de datos:
o
Utilizando la medida de solapamiento F1 [HB02], hemos obtenido un umbral a partir del
cual los algoritmos de SEPP, concretamente CHC, siempre mejora la capacidad predictiva
del clasicador basado en vecinos ms cercanos (k-NN, con k= 1, 3, 5) y combinaciones con
a
algoritmos clsicos de SPP y alcanza altas tasas de reduccin. Este umbral separa los datos
a
o
entre muy solapados y poco solapados.
En los datos poco solapados, hemos comprobado que, aunque la SEPP mantiene la capacidad
predictiva obtenida por k-NN, las tasas de reduccin siguen siendo muy altas, lo que nos
o
permite realizar una tarea de aprendizaje basada en instancias mucho ms rpida.
a a
1.6. Comentarios Finales: Breve Resumen de los Resultados Obtenidos y Conclusiones
1.6.3.
17
Bajo-Muestreo Evolutivo para la Clasicacin en Problemas no Balano

ceados: Propuestas para Aprendizaje basado en Instancias y Seleccin
o
de Conjuntos de Entrenamiento
El bajo-muestreo evolutivo para problemas de clasicacin con clases no balanceadas lo hemos

o
explicado siguiendo dos perspectivas:
La obtencin de un clasicador basado en el vecino ms cercano mediante la seleccin de
o
a
o
muestras de entrenamiento, teniendo como objetivo mejorar su precisin en problemas no
o
balanceados.
Generar modelos predictivos basados en rboles de decisin y reglas a partir de la seleccin
a
o
o
de conjuntos de entrenamiento y analizando tanto la precisin obtenida en problemas con
o
clases no balanceadas como su interpretabilidad.
En ambos casos, los resultados presentados siguiendo ambas v han sido satisfactorios, debido
as
a las siguientes razones:
Los algoritmos clsicos de SPP para el clasicador de vecinos ms cercanos no funcionan
a
a
adecuadamente en problemas con clases no balanceadas.
Los algoritmos no evolutivos de bajo-muestreo funcionan adecuadamente cuando el grado de
desequilibrio entre clases no es demasiado alto. En este caso, la aplicacin del bajo-muestreo
o
evolutivo muestra resultados similares a los obtenidos por estos algoritmos, aunque la tasa de
reduccin obtenida es mucho mayor por la propuesta evolutiva.
o
Al aumentar el grado de desequilibrio entre clases y, por tanto, la complejidad del problema
en trminos de aprendizaje no balanceado, la propuesta evolutiva de bajo-muestreo supera al
e
resto de propuestas no evolutivas considerando el clasicador de vecinos cercanos.
Se han estudiado mltiples conguraciones del bajo-muestreo evolutivo, considerando el tipo
u
de seleccin, grado de balanceo exigido y tipo de funcin de evaluacin de soluciones. Se
o
o
o
recomienda cada una en funcin del grado de balanceo que presente el problema.
o
Al tratar con la seleccin de conjuntos de entrenamiento para obtener modelos basados en
o
rboles de decisin y reglas, nuestra propuesta de bajo-muestreo supera con creces a las
a
o
mejores tcnicas de bajo-muestro ya propuestas en trminos de precisin. En este caso, la
e
e
o
interpretabilidad de los modelos es similar.
Cuando comparamos en seleccin de conjuntos de entrenamiento nuestra propuesta con alo
goritmos avanzados de sobre-muestreo, obtenemos comportamientos similares considerando
la precisin. Sin embargo, los modelos obtenidos por los algoritmos de rboles o reglas tras
o
a
usar nuestra propuesta son menor en tamao que los obtenidos por los algoritmos de sobren
muestreo y, por tanto, ms interpretables.
a
1.6.4.
Dise o de Experimentos en Inteligencia Computacional: Sobre el Uso de

n
la Inferencia Estad
stica
18
Cap
tulo 1. Memoria
El anlisis de resultados en entornos que consideran casos de diferente

a
ndole puede realizarse
mediante tcnicas estad
e
sticas. Sin embargo, no todas las tcnicas son apropiadas en todos los
e
casos. Las tcnicas estad
e
sticas no paramtricas son ms apropiadas que las paramtricas debido a
e
a
e
los siguientes factores estudiados:
Recomendamos el uso de las tcnicas no paramtricas en entornos multi-problema debido al
e
e
hecho de que las condiciones necesarias para aplicar de forma segura las tcnicas paramtricas
e
e
no se satisfacen por normal general.
Mostramos cmo funcionan las tcnicas de comparaciones en pareja, como el test de Wilcoxon,
o
e
y las tcnicas de comparaciones mltiples, como los tests de Friedman, Iman-Davenport,
e
u
Bonferroni-Dunn, Holm, Hochberg y Hommel.
Distinguimos y explicamos las diferencias entre realizar comparaciones de algoritmos en pareja y combinarlas y realizar comparaciones mltiples de algoritmos. Tambin destacamos la
u
e
exibilidad que ofrecen estas tcnicas para analizar cualquier tipo de medida de rendimiento,
e
ya sea basada en precisin, tiempos, tamao de modelos, etc.
o
n
Proponemos tests ms avanzados para realizar comparaciones n n, el test de Schaer y el
a
test de Bergmann-Hommel. Mostramos que la potencia obtenida por estos tests es mucho
mayor que el procedimiento conocido en este tipo de comparaciones, es test de Nemenyi,
mediante un anlisis emp
a
rico basado en repeticiones de muchos experimentos.
1.7.
Perspectivas Futuras
A continuacin, se presentan algunas l

o
neas de trabajo futuras que se plantean a partir de los
mtodos propuestos en esta memoria.
e
Uso de Algoritmos Genticos Multiobjetivo con Doble Objetivo, Reduccin y Precie
o
sin, para Seleccin de Instancias
o
o
La funcin objetivo empleada en los AAEE combina de forma ponderada la precisin y reduccin
o
o
o
conseguida en cada solucin. Al combinar dichos factores al 50 % en un mismo valor, se pondera
o
del mismo modo a ambos objetivos. Esta circunstancia le supone la misma dicultad a reducir el
tamao del conjunto que a incrementar la precisin conseguida, cuando no es as Resulta mucho
n
o
.
ms sencillo el reducir el tamao que el incrementar la precisin, lo cual ocasiona que se tienda a
a
n
o
escoger conjuntos de menor tamao a costa de su precisin. Se puede considerar la bsqueda de
n
o
u
soluciones no dominadas considerando los objetivos de forma independiente, lo cual podr ofrecer
a
un abanico de soluciones que podr sernos util para distintos intereses.
a
Por lo tanto, se propone considerar dos objetivos, precisin y simplicidad, mediante un AAEE
o
multiobjetivo [CLV06]. En cualquier problema con mltiples objetivos, siempre hay un conjunto de
u
soluciones que son superiores a las dems en el espacio de bsqueda cuando se consideran todos los
a
u
objetivos. Dichas soluciones se conocen como soluciones no dominadas (conjunto Pareto). Ninguna
de las soluciones contenidas en el conjunto Pareto es absolutamente mejor que el resto de las no
dominadas. De esta manera, se podr obtener un conjunto de soluciones que abarcar desde los
a
a
modelos ms precisos hasta los ms simples, pasando por distintos niveles de equilibrio entre ambos
a
a
criterios.
1.7. Perspectivas Futuras
19
Uso de Algoritmos Genticos Distribuidos para Seleccin de Instancias

e
o
Como hemos visto, los AAEE requieren altas prestaciones en cuanto a cmputo. Este requerio
miento est relacionado con la imposibilidad de ser aplicados a partir de un determinado tamao
a
n
del problema. En SEPP, hemos visto que a partir de unos 20.000 ejemplos aproximadamente, los
AAEE son incapaces de tratar semejantes volmenes de datos.
u
Se puede plantear el problema desde un punto de vista distribuido, mediante el uso de tcnicas
e
evolutivas distribuidas [AK03]. El paralelismo del cmputo se puede buscar desde dos puntos de
o
vista: particionando los datos y cargndolos en cada nodo procesador, como se hace en [CHL05],
a
pero de una forma distribuida, e ideando un mecanismo efectivo y eciente de unin de los modeo
los individuales obtenidos; o construyendo algoritmos realmente distribuidos que se trabajen con
subpoblaciones de soluciones e intercambien soluciones entre ellos siguiendo los esquemas convencionales de distribucin de AAEE.
o
Generacin Evolutiva de Prototipos en Problemas de Clasicacin

o
o
La SPP mantiene la estructura de los ejemplos porque unicamente produce selecciones de sub
conjuntos de instancias pertenecientes al conjunto de entrenamiento. En la generacin de prototipos,
o
tambin se permite la modicacin de los valores de los datos originales, lo que permite alcanzar
e
o
un mejor ajuste de las clases a los l
mites de decisin mejorando las capacidades predictivas de un
o
algoritmo de vecinos cercanos.
Actualmente, existen diversas propuestas en la literatura para generacin de prototipos y han
o
demostrado tener un buen rendimiento. Tambin podemos encontrar aplicaciones de meta-heur
e
sticas a este problema, como AAEE [FI04]. Sin embargo, no se han usado an modelos de AAEE
u
avanzados que manejan representaciones de cromosomas con codicacin real, a excepcin de moo
o
delos bsicos de algoritmos basados en optimizacin de nubes de part
a
o
culas. La utilizacin de AAEE
o
de optimizacin con parmetros reales podr resultar en la obtencin de conjuntos de prototipos
o
a
a
o
o
ptimos para un problema de clasicacin concreto cuando lo tratamos con el clasicador basado
o
en vecinos cercanos.
Combinacin de Procedimientos de Seleccin/Generacin de Instancias, Seleco

o
o
cin/Extraccin de Caracter
o
o
sticas, Medidas de Distancia y Pesos para Construir Propuestas de Lazy Learning Evolutivo
Lazy Learning es la rama del aprendizaje automtico que incluye todo lo referente al razonaa
miento basado en casos [Aha97], en el que el modelo con el que se produce la clasicacin consta
o
unica y exclusivamente de ejemplos o casos de estudio de un problema y alguna informacin ex
o
tra relacionada con stos. Permite construir una regla de clasicacin a partir de un conjunto de
e
o
ejemplos y una medida de similitud entre ellos.
La optimizacin de un modelo de Lazy Learning mediante AAEE puede ser muy interesante
o
debido a la gran cantidad de parmetros que se pueden manejar y que pueden resultar en mejoa
ras interesantes del modelo. Hasta ahora, hemos visto como la seleccin evolutiva de instancias
o
y caracter
sticas ofrecen resultados excelentes en problemas de clasicacin. La optimizacin de
o
o
otros parmetros como la combinacin de caracter
a
o
sticas, pesos por caracter
sticas y ejemplos o
expresiones que nos denan medidas diferentes de distancias podr resultar en algoritmos de Lazy
a
Learning precisos y adaptables a cualquier problema.
20
Cap
tulo 1. Memoria
Anlisis e Hibridacin de Nuevas Tcnicas para Aprendizaje con Clases no Balanceadas

a
o
e
Combinando Bajo-Muestreo y Sobre-Muestreo
El bajo-muestro de ejemplos para aprendizaje no balanceado tiene la desventaja de que puede
permitir la eliminacin de ejemplos realmente inuyentes en el proceso de aprendizaje y perjudica en
o
mayor medida a las clases mayoritarias. Por otro lado, el sobre-muestreo no tiene este inconveniente,
pero necesita replicar o generar un gran nmero de nuevos ejemplos para conseguir una clasicacin
u
o
efectiva y esto incrementa el tiempo de cmputo de los clasicadores aplicados a posteriori.
o
Un esquema que puede solucionar ambos problemas consiste en la combinacin de tcnicas de
o
e
sobre-muestreo con tcnicas de bajo-muestreo. Si primero aplicamos sobre-muestreo, se generan los
e
ejemplos minoritarios necesarios para reforzar el concepto de la clase que representan, sin tener
que eliminar ejemplos de la clase mayoritaria. Justo despus, se puede proceder con una tcnica de
e
e
Bajo-Muestreo o de SII para eliminar ejemplos redundantes y nocivos para la tarea de aprendizaje.
Esta ultima fase puede ser incluso dependiente del tipo de clasicador a usar, e incidir ms en la
a
obtencin de tasas de reduccin altas o bajas dependiendo de la complejidad del clasicador en
o
o
cuestin.
o
Deteccin de Anomal mediante Algoritmos Evolutivos

o
as
Hablamos de deteccin de anomal cuando nos referimos a una tcnica de MDD no supervisada
o
as
e
que produce un modelo para identicar casos extraos que se desv de la norma en un conjunto de
n
an
datos. Los datos proporcionados para la construccin del modelo consisten de casos normales para
o
lo cuales, un algoritmo de deteccin de anomal puede aprender patrones normales. La aplicacin
o
as
o
del modelo a los datos con un esquema y atributos similares produce que cada caso tenga una
probabilidad asociada de ser un caso normal o anmalo.
o
Desde el punto de vista de aprendizaje no balanceado, la deteccin de anomal es su equivao
as
lente pero en aprendizaje no supervisado. En la literatura se han propuesto numerosas tcnicas de
e
agrupamiento y reglas de asociacin evolutivas que tratan con conjuntos de datos no supervisados
o
y ofrecen buenos resultados. La aplicacin de algunos mecanismos y procedimientos usados en las
o
propuestas descritas en esta memoria podr ser de gran utilidad a dichas tcnicas para focalizarlas
an
e
al problema de la deteccin de anomal
o
as.
Determinacin de la Ecacia en Clasicacin a partir de las Mtricas de Complejidad:

o
o
e
Meta-Aprendizaje
Mediante el clculo de los valores de complejidad asociados a un conjunto de problemas de claa
sicacin, podemos obtener informacin atribuida a cada conjunto de datos que puede relacionarse
o
o
con una prediccin del rendimiento asociado a uno o varios tipos de clasicadores. Esto nos puede
o
indicar o hacer ver una tendencia de cundo es ms util utilizar un tipo de clasicador u otro
a
a
dependiendo del problema y las mtricas calculadas sobre el mismo.
e
Pero si disponemos de multitud de conjuntos de datos, ya sean generados de forma articial
u obtenidos desde datos reales, podemos automatizar esta tarea de identicacin de patrones o
o
tendencias sobre la complejidad de los datos por medio de la propia aplicacin de algoritmos de
o
MDD, lo que nos lleva a un meta-aprendizaje. Por ejemplo, mediante tcnicas de agrupamiento,
e
podr
amos identicar diferentes nichos o subconjuntos de datos que son muy diferentes entre s para
recomendar grupos de datos para hacer pruebas; o mediante tcnicas de reglas de asociacin, se
e
o
podr analizar las relaciones existentes entre medidas o incluso proponer otras nuevas.
a
1.7. Perspectivas Futuras
21
Estudio de la Complejidad de los Datos no Balanceados para el Diagnstico util de la

o
Ecacia de las tcnicas de Bajo-Muestreo y Sobre-Muestreo

e
Al igual que podemos medir la complejidad de problemas convencionales de clasicacin, podeo
mos hacer lo mismo con problemas de clasicacin no balanceados. Sabemos de antemano que el
o
grado de desequilibrio en la distribucin de clases inuye en la complejidad del problema, pero tamo
bin podemos averiguar el grado en que tambin inuye otros factores tales como el solapamiento,
e
e
la densidad de los datos o la topolog que siguen.
a
Tambin es posible hacer un diagnstico de la ecacia de distintos mtodos de bajo-muestreo
e
o
e
o sobre-muestreo aplicados a problemas de clasicacin no balanceados y los resultados pueden
o
ayudarnos a predecir el comportamiento de los mismos antes de ser ejecutados o ayudarnos a
tomar la decisin de elegir el mtodo a aplicar ante un determinado conjunto de datos.
o
e
Aplicacin de Nuevas Tcnicas Estad

o
e
sticas en otros Ambitos de Comparacin
o
El anlisis comparativo de algoritmos en determinados entornos an no es posible o es dif de
a
u
cil
imaginar. Por ejemplo, existen muestras de resultados obtenidos por medidas de carcter binario,
a
que involucran la cobertura de un algoritmo sobre otro. Por otro lado, tambin existe la posibilidad
e
de comparar los algoritmos considerando a la vez varias medidas de rendimiento, requiriendo un
anlisis estad
a
stico multi-objetivo. Otro ejemplo podr ser la consideracin de anlisis estad
a
o
a
sticos
no balanceados, en donde nos faltan resultados de algunos algoritmos debido a la imposibilidad de
ser ejecutados en algunos problemas.
Estos problemas abiertos en el anlisis estad
a
stico en IC o en MDD pueden considerarse utilizando y estudiando otras tcnicas estad
e
sticas no demasiado conocidas en nuestra comunidad
cient
ca, adems de la toma en consideracin de tcnicas estad
a
o
e
sticas computacionales an ms
u
a
potentes [Goo05].
Cap
tulo 2
Publicaciones: Trabajos Publicados,

Aceptados y Sometidos
2.1.
Un Algoritmo Memtico para la Seleccin Evolutiva de Proe

o
totipos: Un Enfoque para la Escalabilidad - A Memetic Algorithm for Evolutionary Prototype Selection: A Scaling Up
Approach
Las publicaciones en revista asociadas a esta parte son:

S. Garc J.R. Cano, F. Herrera, A Memetic Algorithm for Evolutionary Prototya,
pe Selection: A Scaling Up Approach. Pattern Recognition 41:8 (2008) 2693-2709,
doi:10.1016/j.patcog.2008.02.006.
Estado: Publicado

Indice de Impacto (JCR 2007): 2,019.
Area de Conocimiento: Computer Science, Articial Intelligence. Ranking 17 / 93.
Area de Conocimiento: Engineering, Electrical and Electronic. Ranking 25 / 227.
23
Pattern Recognition 41 (2008) 2693 2709

www.elsevier.com/locate/pr
A memetic algorithm for evolutionary prototype selection:

A scaling up approach
Salvador Garca a, , Jos Ramn Cano b , Francisco Herrera a
a Department of Computer Science and Articial Intelligence, University of Granada, 18071 Granada, Spain
b Department of Computer Science, University of Jan, 23700 Linares, Jan, Spain
Received 8 September 2007; received in revised form 18 January 2008; accepted 14 February 2008
Abstract
Prototype selection problem consists of reducing the size of databases by removing samples that are considered noisy or not inuential on
nearest neighbour classication tasks. Evolutionary algorithms have been used recently for prototype selection showing good results. However,
due to the complexity of this problem when the size of the databases increases, the behaviour of evolutionary algorithms could deteriorate
considerably because of a lack of convergence. This additional problem is known as the scaling up problem.
Memetic algorithms are approaches for heuristic searches in optimization problems that combine a population-based algorithm with a local
search. In this paper, we propose a model of memetic algorithm that incorporates an ad hoc local search specically designed for optimizing the
properties of prototype selection problem with the aim of tackling the scaling up problem. In order to check its performance, we have carried
out an empirical study including a comparison between our proposal and previous evolutionary and non-evolutionary approaches studied in
the literature.
The results have been contrasted with the use of non-parametric statistical procedures and show that our approach outperforms previously
studied methods, especially when the database scales up.
2008 Elsevier Ltd. All rights reserved.
Keywords: Data reduction; Evolutionary algorithms; Memetic algorithms; Prototype selection; Scaling up; Nearest neighbour rule; Data mining
1. Introduction
Considering supervised classication problems, we usually
have a training set of samples in which each example is labelled
according to a given class. Inside the family of supervised classiers, we can nd the nearest neighbour (NN) rule method
[1,2] that predicts the class of a new prototype by computing a
similarity [3,4] measure between it and all prototypes from the
training set, called the k-nearest neighbours (k-NN) classier.
Recent studies show that k-NN classier could be improved by
employing numerous procedures. Among them, we could cite
proposals on instance reduction [5,6], for incorporating weights
This work was supported by Projects TIN2005-08386-C05-01 and
TIN2005-08386-C05-03. S. Garca holds a FPU scholarship from Spanish
Ministry of Education and Science.
Corresponding author. Tel.: +34 958 240598; fax: +34 958 243317.
E-mail addresses: salvagl@decsai.ugr.es (S. Garca), jrcano@ujaen.es
(J.R. Cano), herrera@decsai.ugr.es (F. Herrera).
0031-3203/$30.00 2008 Elsevier Ltd. All rights reserved.

doi:10.1016/j.patcog.2008.02.006
for improving classication [7], and for accelerating classication task [8], etc.
Prototype selection (PS) is an instance reduction process consisting of maintaining those instances that are more relevant
in the classication task of the k-NN algorithm and removing the redundant ones. This attempts to reduce the number
of rows in data set with no loss of classication accuracy and
obtain an improvement in the classier. Various approaches of
PS algorithms were proposed in the literature, see Refs. [6,9]
for review. Another process used for reducing the number of
instances in training data is the prototype generation, which
consists of building new examples by combining or computing
several metrics among original data and including them into
the subset of training data [10].
Evolutionary algorithms (EAs) have been successfully used
in different data mining problems (see Refs. [1113]). Given
that PS problem could be seen as a combinatorial problem, EAs
[14] have been used to solve it with promising results [15],
which we have termed evolutionary prototype selection (EPS).
2694
S. Garca et al. / Pattern Recognition 41 (2008) 2693 2709
The increase of the databases size is a staple problem in PS

(which is known as scaling up problem). This problem produces
excessive storage requirement, increases time complexity and
affects generalization accuracy. These drawbacks are presented
in EPS because they result in an increment in chromosome size
and time execution and also involve a decrease in convergence
capabilities of the EA. Traditional EPS approaches generally
suffer from excessively slow convergence between solutions
because of their failure to exploit local information. This often
limits the practicality of EAs on many large-scale problems
where computational time is a crucial consideration. A rst
rapprochement about the use of EAs when this problem scales
up can be found in Ref. [16].
The combination of EAs with local search (LS) was named
memetic algorithm (MA) in Ref. [17]. Formally, a MA is
dened as an EA that includes one or more LS phases within
its evolutionary cycle [18]. The choice of name is inspired by
concept of a meme, which represents a unit of cultural evolution that can show local renement [19]. MAs have been
shown to be more efcient (i.e., needing fewer evaluations
to nd optima) and more effective (identifying higher quality solutions) than traditional EAs for some problem domains.
In the literature, we can nd a lot of applications of MAs
for different problems; see Ref. [20] for an understanding of
MA issues and examples of MAs applied to different domain
problems.
The aim of this paper is to present a proposal of MA for
EPS for dealing with the scaling up problem. The process of
designing effective and efcient MAs currently remains fairly
ad hoc. It is frequently hidden behind problem-specic details.
In our case, the meme used is ad hoc designed for the PS
problem, taking advantage of its divisible nature and simplicity
of hybridization within the EA itself, and allowing us good
convergence with increase of the problem size. We will compare
it with other EPS and non-EPS algorithms already studied in the
literature, paying special attention to the scaling up problem,
analysing its behaviour when we increase the problem size.
This paper is organized in the following manner. Section 2
presents the PS problem formally and enumerates some PS
methods. A review of EPS is given in Section 3. In Section 4
we explain our MA approach and meme procedure. Details
of empirical experiments and results obtained are reported in
Section 5. Section 6 contains a brief summary of the work and
the conclusions reached.
2. Preliminaries: PS
PS methods are instance selection methods [5] which expect
to nd training sets offering best classication accuracy by
using the nearest neighbour rule (1-NN).
A formal specication of the problem is the following: Let
an example where = (x , x , . . . , x , x ), with
xp
xp
xp
p1 p2
pm pl
belonging to a class c given by xpl and a m-dimensional space
in which xpi is the value of the ith feature of the pth sample.
Then, let us assume that there is a training set TR which consists
of n instances and a test set TS composed by t instances

xp
. Let S T R be the subset of selected samples resulted for
xp
the execution of a PS algorithm, then we classify a new pattern

from TS by the 1-NN rule acting over S.
Wilson and Martinez in Ref. [6] suggest that the determination of the k value in the k-NN classier may depend according
to the proposal of the PS algorithm. In k-NN, setting k greater
than 1, decreases the sensitivity of the algorithm to noise and
tends to smooth the decision boundaries. In some PS algorithms, a value k > 1 may be convenient, when its interest lies
in protecting the classication task of noisy instances. In any
case, Wilson and Martinez state that it may be appropriate to
nd a value of k to use during the reduction process, and then
redetermine the best value of k in the classication task. In EPS
we have used the value k = 1, given that EAs need to have the
greatest possible sensitivity to noise during the reduction process. In this manner, an EPS algorithm could better detect the
noisy instances and the redundant ones in order to nd a good
subset of instances perfectly adapted to the simplest method of
NNs. By considering only an instance during the evolutionary
process, the reductionaccuracy trade-off is more balanced and
the efciency is improved. The implication of this fact is the use
of k = 1 in the classication, as Wilson and Martinez point out.
In the next subsection, we will describe the algorithms used
in this study but not the EAs (which will be described in
Section 3).
2.1. PS methods
Algorithms for PS may be classied according to the heuristic
followed in the selection. We have selected the most representative and well-known methods belonging to the non-evolutionary
family and the algorithms that offer the best performance for
the PS problem.
Enn [21]. Edited NN edits out noisy instances, as well as
close border cases, leaving smoother decision boundaries.
It also retains internal points. It works by editing out those
instances in which class does not agree with the majority of
classes of its k NNs.
Allknn [22]. Allknn is an extension of Enn. The algorithm,
for i = 1 to k ags as bad any instance not correctly classied
by its i NNs. When the loop is completed k times, it removes
the instances agged as bad.
Pop [23]. This algorithm consists of eliminating the samples
that are not within the limits of the decision boundaries. This
means that its behaviour is in opposite direction from that of
Enn and Allknn.
Rnn [24]. The reduced NN rule searches a minimal and consistent subset which correctly classies all the learning instances.
x
Drop3 [6]. An associate of is that sample i which has
xp
as NN. This method removes if at least as many of
xp
xp
its associates in TR would be classied correctly without .

xp
Prior to this process, it applies a noise reduction lter (Enn).
Ib3 [25]. It introduces the acceptable concept, based on the
statistical condence of inserting a certain instance in the
subset, to carry out the selection.
Cpruner [26]. C-Pruner is a sophisticated algorithm constructed by extending concepts and procedures taken from
algorithms Icf [27] and Drop3.
Explore [28]. Cameron-Jones used an encoding length
heuristic to determine how good the subset S is in describing
TR. Explore is the most complete method belonging to this
group and it includes three tasks:
It starts from the empty set S and adds instances if only
the cost function is minimized.
After this, it tries to remove instances if this helps to minimize the cost function.
Additionally, it performs 1000 mutations to try to improve
the classication accuracy.
Rmhc [29]. First, it randomly selects a subset S from TR
which contains a xed number of instances s (s =%|T R|). In
each iteration, the algorithm interchanges an instance from
S with another from T R S. The change is maintained if it
offers better accuracy.
Rng [30]. It builds a graph associated with TR in which a
relation of neighbourhood among instances is reected. Misclassied instances by using this graph are discarded following a specic criterion.
2.2. The scaling up problem
Any algorithm is affected when the size of the problem which
it is applied increases. This is the scaling up problem, characterized for producing:
Excessive storage requirements.
Increment of time complexity.
Decrement of generalization capacity, introducing noise and
over-tting.
A way of avoiding the drawbacks of this problem was proposed
in Ref. [16], where a stratied strategy divides the initial data
set into disjoint strata with equal class distribution. The number
of strata chosen will determine their size, depending on the
size of the data set. Using the proper number of strata we can
signicantly reduce the training set and we could avoid the
drawbacks mentioned above.
Following the stratied strategy, initial data set D is divided
into t disjoint sets Dj , strata of equal size, D1 , D2 , . . . , Dt
maintaining class distribution within each subset. Then, PS algorithms will be applied to each Dj obtaining a selected subset
DSj . Stratied prototype subset selected (SPSS) is dened as
SPSS =
DSj ,
J {1, 2, . . . , t}.
(1)
j J
3. EPS: a review
In this section, we will review the main contributions that
have included or proposed an EPS model.
The rst appearance of application of an EA to PS problem
can be found in Ref. [31]. Kuncheva applied a genetic algorithm
(GA) to select a reference set for the k-NN rule. Her GA maps
the TR set onto a chromosome structure composed by genes,
2695
each one with two possible states (binary representation). The

computed tness function measures the error rate by application
of the k-NN rule. This GA was improved in Refs. [32,33].
At this point, all EPS algorithms reviewed above correspond
to adapting a classical GA model to PS problem. Later, a development of more conditioned EPS algorithms to the problem
is made. The rst example of this can be found in Ref. [34].
In this paper, an estimation of distribution algorithm (EDA) is
used.
A GA design for obtaining an optimal NN classier is proposed in Ref. [35]. Ho et al. propose an intelligent genetic algorithm (IGA) based on orthogonal experimental design used
for PS and feature selection. IGA is a GGA that incorporates
an intelligent crossover (IC) operator. IC builds an orthogonal
array (OA) (see Ref. [35]) from two parents of chromosomes
and searches within the OA for the two best individuals according to the tness function. It takes about 2 log2 ( +1) tness
evaluations to perform an IC operation, where is the number
of bits that differ between both parents. Note that only an application of IC on large-size chromosomes (resulting chromosomes from large-size data sets) could consume a high number
of evaluations.
The technical term EPS has been adopted by Cano et al. in
Ref. [15], in which they analyse the behaviour of different EAs,
steady-state GAs (SSGAs), GGAs, the CHC model [36] and
PBIL [37] (which can be considered as one of the basic EDAs).
The representation of solutions as chromosomes follows the
guidelines in Ref. [31], but tness function used in these models
combines two values: classication rate (clas_rat) by using 1NN classier and percentage reduction of prototypes of S with
regards to TR (perc_red):
Fitness(S) = clas_rat + (1 ) perc_red.
(2)
Finally, as a multi-objective approach, we can nd an EPS

algorithm in Ref. [38].
In our empirical study, the four models developed in Refs.
[15,36,37] together with the IGA proposal [35] will be used to
compare with the MA-EPS proposal. In order to prepare IGA
to be applied only as PS method, we ignore feature selection
functionality. We must point out that the GGA described in Ref.
[15] is really an improved model of Kunchevas and Ishibuchis
GGAs.
4. A MA for EPS
In this section, we introduce our proposal of MA for EPS.
It is a steady-state MA (SSMA) that makes use of a LS or
meme specically developed for this purpose. In Section 4.1
we introduce the foundations of the SSMAs. In Section 4.2
we explain the details of the proposed algorithm. Finally, in
Section 4.3 we clarify the application of the ad hoc meme in
the algorithm.
4.1. Steady-state MAs
In SSGA usually one or two offspring are produced in each
generation. Parents are selected to produce offspring and then
2696
1. Select two parents from the population.

2. Create one/two offspring using crossover and mutation.
3. Evaluate the offspring with the tness function.
4. Select one/two individuals in the population, which may
be replaced by the offspring.
5. Decide if this/these individuals will be replaced.
Fitness function: Let S be a subset of instances of TR to evaluate and be coded by a chromosome. We dene a tness
function considering the number of instances correctly classied using the 1-NN classier and the percentage of reduction achieved with regard to the original size of the training
data. The evaluation of S is carried out by considering all the
training set TR. For each object y in S, the NN is searched
for among those in the set S\{y}
Fitness(S) = clas_rat + (1 ) perc_red.
Fig. 1. Pseudocode algorithm for the SSGA model.
a decision is made as to which individuals in the population to

select for deletion in order to make room for the new offspring.
The basic algorithm steps of SSGA are shown at Fig. 1.
In step 4, one can choose the replacement strategy
(e.g., replacement of the worst, the oldest, or a randomly chosen
individual), and step 5, the replacement condition (e.g., replacement if the new individual is better, or unconditional
replacement). A widely used combination is to replace the
worst individual only if the new individual is better. We will
call this strategy the standard replacement strategy. In Ref.
[39] it was suggested that the deletion of the worst individuals
induced a high selective pressure, even when the parents were
selected randomly.
Although SSGAs are less common than GGAs, different authors [40,41] recommend the use of SSGAs for the design of
MAs because they allow the results of LS to be kept in the
population from one generation to the next.
SSMAs integrate global and local searches more tightly than
generational MAs. This interweaving of the global and local
search phases allows the two to inuence each other, e.g., the
SSGA chooses good starting points, and LS provides an accurate representation of that region of the domain. In contrast,
generational MAs proceed in alternating stages of global and
local searches. First, the GGA produces a new population, then
the meme procedure is performed. The specic state of meme
is generally not kept from one generation to the next, though
meme results do inuence the selection of individuals.
4.2. SSMA model for PS problem
The main characteristics of our proposed MA are:
Population initialization: The rst step that the algorithm
makes consists of the initialization of the population, which
is carried out by generating a population of random chromosomes.
Representation: Let us assume a data set denoted TR with n
instances. The search space associated with the instance selection of TR is constituted by all the subsets of TR. Therefore, the chromosomes should represent subsets of TR. This
is achieved by using a binary representation. A chromosome
consists of n genes (one for each instance in TR) with two
possible states: 0 and 1. If the gene is 1, then its associated
instance is included in the subset of TR represented by the
chromosome. If it is 0, then this does not occur.
The objective of the MA is to maximize the tness function

as dened. Note that the tness function is the same as that
used by EPS models previously proposed.
Parent selection mechanism: In order to select two parents
for applying the evolutionary operators, binary tournament
selection is employed.
Genetic operators: We use a crossover operator that randomly
replaces half of rst parents bits with second parents bits
and vice versa. The mutation operator involves a probability
that an arbitrary bit in a genetic sequence will be changed to
its other state.
Replacement strategy: This will use a standard replacement
strategy, which was dened above.
Mechanism of LS application: It is necessary to control the
operation of the LS over the total visited solutions. This is
because the additional function evaluations required for total
search can be very expensive and the MA in question could
become a multi-restart LS and not take advantage of the
qualities of the EAs. In order to do this, we have included
in the algorithm the Adaptive PLS Mechanism, which is an
adaptive tness-based method that is very simple but it offers
good results in Ref. [41]. Indeed, this scheme assigns a LS
probability value to each chromosome generated by crossover
and mutation, cnew :
PLS =
1
if f (cnew ) is better than f (Cworst ),
(3)
0.0625 otherwise,
where f is the tness function and Cworst is the current worst

element in the population. As was observed by Hart [42],
applying LS to as little of 5% of each population results in
faster convergence to good solutions.
Termination condition: The MA continues carrying out iterations until a specied number of evaluations is reached.
Fig. 2 shows the SSMA pseudocode. After the initialization
of the population is done, each generation of the algorithm is
composed by a selection of two parents (step 3), together with
the application of the genetic operators: crossover to create two
offspring (step 4) and mutation applied to each one (with the
corresponding probability associated per bit at step 5). At this
point, the two individuals generated are evaluated, followed by
computation of the value of PLS mechanism for each offspring.
A LS optimization is performed here only if the mechanism
decides so. The computation of the value of PLS is done in
step 8 and in the next step, the result of adaptive mechanism
is determined with an uniform distribution u(0, 1). After step
1. Initialize population.
2. While (not termination-condition) do
3.Use Binary Tournament to select two parents
4.Apply crossover operator to create offspring
(Off1, Off2)
5. Apply mutation to Off1 and Off2
6. Evaluate Off1 and Off2
7.Foreach Offi
8. InvokeAdaptive-PLS-mechanism to obtain PLSi
for Offi
9. If u(0,1) < PLSi then
10. Perform meme optimization for Offi
11. End if
2697
a partial evaluation counts as

PE =
Nnu
,
n
(4)
where Nnu is the number of neighbours updated when a

determined instance is removed by meme procedure and n =
|T R| is the size of the original set of instances (also the size
of the chromosome). Partial evaluations always take place
inside the local optimization procedure.
The SSMA computes total evaluations; when we consider a
partial evaluation we add to the counter evaluation variable the
respective partial value PE (expression (4)). Therefore, a certain
number of partial evaluations (depending on the PE values) will
be considered as a total evaluation.
12. End for

13. Employ standard replacement for Off1 and Off2
14. End while
15. Return the best chromosome
Fig. 2. Pseudocode algorithm for the proposed MA.
number 12, the replacement strategy can be carried out. The

algorithm returns the best chromosome of the population once
the stop criterion is satised.
4.3. Ad hoc meme procedure
In this subsection, we explain the procedure of optimization
via LS performed within the evolutionary cycle described in
Fig. 2 (step 10) according to the following structure: rstly
we present the evaluation mechanism with total and partial
evaluations; secondly, we present the pseudocode describing
the whole procedure; thirdly, we will explain the two strategies
used associated to tness improvement, improving accuracy
stage and avoiding premature convergence stage with loss of
the local objective; and nally, an illustrative example is shown.
4.3.1. Evaluation mechanisms
During the operation of the SSMA, a xed number of evaluations must take place in order to determine the quality of
the chromosomes. We can distinguish between total evaluation
and partial evaluation.
Total evaluation consists of a standard evaluation of performance of a chromosome in EPS, computing the NN of
each instance belonging to the selected subset and counts the
correctly classied instances. Total evaluations always take
place outside the optimization procedure, that is, within the
evolutionary cycle.
Partial evaluation can be carried out on a neighbour solution
of a current instance that has already been evaluated and differs only in one bit position changed from value 1 to 0. If
a total evaluation counts as one evaluation in terms of taking account of number of evaluations for the stop condition,
4.3.2. Description of the optimization procedure

The meme optimization procedure used (step 10 in Fig. 2)
in this method is an iterative process that aims to improve individuals of a population by reducing the number of prototypes
selected and by enhancing classication accuracy.
The pseudocode described in Fig. 3 corresponds to the procedure in question. It can be described as follows: To achieve
the double objective (to improve the number of classied patterns while reducing the subset size) the procedure considers
neighbourhood solutions with m 1 instances selected, where
m is equal to the number of instances selected in a current chromosome (positions with value 1 in the chromosome). In other
words, a neighbour is obtained by changing 1 to 0 in a gene.
In this way, the number of samples represented in a chromosome after optimization has been carried out will always be
less than or equal to the initial number of samples selected in
the chromosome. The algorithm in Fig. 3 receives as inputs the
chromosome to optimize and its list of associated neighbours,
called U (lines 13). The R list will include the identiers of the
instances that have already been removed without having obtained a sufcient gain according to the threshold value. The U
list contains the identiers of the instances considered the NNs
of each instance. U has room for n identiers of instances. It
links the instance identied by a number stored in the ith cell as
the NN of the instance identied by the number i. In this way,
the search of the NN of each instance is only needed when the
instance is removed from the subset selected. Note that the U
list could be upgraded in order to contain more of one neighbour per instance; the case explained here is the easiest.
A partial evaluation can take advantage of U and of the divisible nature of the PS problem when instances are removed.
Next, we explain the concepts necessary to understand the
procedure. Two functions are very useful in this procedure:
Nearest_NeighbourS (): This function returns the NN of an
instance considering only those instances selected by the
chromosome S. It requires as input an integer that will be the
identier of an instance within the training set TR, which is
composed of n instances. The output will also be an integer
that identies the nearest instance of the input belonging to
the S subset (or selected instances in the chromosome).
2698
classication accuracy, since the reduction prot is implicit by

the fact that the choice is always a removal.
Looking at Fig. 3, the variable called gain maintains an account of LS contributions carried out when a move on the
meme procedure is performed. It may be negative or positive,
depending on whether or not the NN of an instance changes the
class label of the previous NN. A negative contribution, which
subtracts 1 from the local objective, is caused when the new
neighbour misclassies a correctly classied instance. A positive contribution, which adds 1 to the local objective, is caused
when the new neighbour classies correctly a badly classied
instance. A null contribution is caused when an instance maintains the same state of classication, which remains misclassied or correctly classied. This process is carried out over
all instances whose NN has changed by using the identiers of
NNs stored in structure U, appropriately updated after a move
of meme procedure.
The acceptance of a choice of removal depends upon the gain
in accuracy that the algorithm is looking for, which is dened
according to the current LS stage.
4.3.3. LS stages
Two stages can be distinguished within the optimization procedure. Each stage has a different objective and its application depends on the progress of the actual search process: the
rst one is an exclusive improvement of tness and the second
stage is a strategy for dealing with the problem of premature
convergence.
Fig. 3. Pseudocode of meme optimization.
cls(): This function returns the class that the instance

belongs to.
From step 4 to 14 in Fig. 3, the procedure tries to nd instances
for removing from the subset selected (which are randomly chosen). Once a choice of removal is done, the procedure attempts
to compute the gain of this choice, making a backup copy of
the U structure at step 8 (the choice may not be good). In order to do this, it locates the instances which have the choice of
removal as NN and updates the new NN (step 10) by using the
remainder of instances belonging to the subset. Meanwhile, it
simultaneously computes the gain, checking if the new neighbours produce a success or a fail in the classication of the instance (steps from 11 to 16). This gain, computed in a relative
and efcient way, allows the algorithm to decide if the choice
of removal is maintained or discarded. The gain only refers to
Improving accuracy stage: This starts from the initial assignment (a recently generated offspring) and iteratively tries to
improve the current assignment by local changes. If, in the
neighbourhood of the current assignment, a better assignment is found, it replaces the current assignment and it continues from the new one. The selection of a neighbour is randomly made without repetition from among all the solutions
that belong to the neighbourhood. In order to consider an
assignment as better than the current one the classication
accuracy must be greater than or equal to the previous one,
but in this last case, the number of selected instances will be
lower than previously, so the tness function value is always
increased.
Avoiding premature convergence stage: When the search process has advanced, a tendency of the population to premature
convergence toward a certain area of the search space takes
place. A local optimization promotes this behaviour when
it considers solutions with better classication accuracy. In
order to prevent this, the meme optimization procedure proposed will accept worse solutions in the neighbourhood, in
terms of accuracy of classication. Here, the tness function
value cannot be increased; it may be decreased or maintained.
The parameter threshold is used in order to determine the way
the algorithm operates depending on the current stage. When
threshold has a value greater or equal to 0, then the stage
in progress is the improving accuracy stage because a new
Class
2699
Instances
{1, 2, 3, 4, 5, 6, 7}
{8, 9, 10, 11, 12, 13}
Representation
0110110100010
Neighbour Solution
Current Solution
010 0110100010
U structure
{3, 5, 8, 8, 3, 2, 6, 2, 8, 8, 3, 2, 3}
{12, 5, 8, 8, 2, 2, 6, 2, 8, 8, 8, 2, 8}
Gain
{1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0}
Correct classied patterns
{-1,,,,0,,,,,,+1,,+1}
7 1+1+1
Nnu
Partial evaluation account: PE = n = 4
13
Number of correct classied patterns: 8
Fig. 4. Example of a move in meme procedure and a partial evaluation.
generated chromosome via meme optimization is accepted

when its tness contribution is not negative (gain 0). Note
that gain = 0 implies that the new chromosome is as good as
the original considering accuracy objective, but it will have
fewer instances (improving reduction objective). On the other
hand, if threshold value is less than 0, then the stage in progress
is the avoiding premature convergence stage because the new
chromosome is accepted when its tness contribution is not
positive. This process is described in Fig. 3 in step 18.
Initially, the algorithm starts with a threshold value equal to
0. The parameter is affected by three possible conditions:
After a certain number of generations, the classication accuracy of the best chromosome belonging to the population
has not been improved: threshold = threshold + 1.
After a certain number of generations, the reduction of the
subset selected with respect to the original set of the best
chromosome belonging to the population has not been improved: threshold = threshold 1.
After a certain number of generations, neither accuracy nor
reduction objectives have been improved: threshold value
does not change.
The exact number of generations was tested by an empirical
way, and at the end we checked that a value of 10 generations
worked appropriately.
Once the change is accepted due to the fact that the gain
equals or exceeds the current threshold, in step 19 of Fig. 3,
the new tness value for the optimized solution is computed as
tnessgain =
1
gain
100 + 100
n
n
2.
That is, the gain obtained is changed in terms of percentage of

classication accuracy and this value is added to the percentage of reduction prot, which is always 1/n given that a LS
movement is always a removal of an instance.
4.3.4. Example of ad hoc meme
An example is illustrated at Fig. 4, where a chromosome of 13
instances is considered. Meme procedure removes the instance
Table 1
Small data sets characteristics
Name
N. instances
N. features
N. classes
Bupa
Cleveland
Glass
Iris
Led7Digit
Lymphography
Monks
Pima
Wine
Wisconsin
345
297
294
150
500
148
432
768
178
683
7
13
9
4
7
18
6
8
13
9
2
5
7
3
10
4
2
2
3
2
number 3. Once removed, the instance number 3 cannot appear

in the U structure as NN of another instance. U must be updated
at this moment obtaining the new NNs for the instances that
had instance number 3 as NN. Then a relative objective with
respect to the original chromosome tness is calculated (the
gain of instances 1, 5, 11 and 13). The result obtained is a
new chromosome with a higher number of correctly classied
patterns (8 instead of 7) that, as well, takes up only 4 of the
13 evaluations, a number which will count towards the total of
evaluations carried out.
5. Experimental study
This section presents the framework used in the experimental study carried out together with results. To scale appropriately the problem we have used three sizes of problems: small,
medium and large. We intend to study the behaviour of the algorithms when the size of the problem increases. When considering large data sets, stratication process [16] is used obtaining strata of medium size. The small-size data sets are summarized in Table 1 and medium and large data sets can be seen in
Table 2. The data sets have been taken from the UCI Machine
Learning Database Repository [43]. Ringnorm data set comes
from DELVE project.1
1 URL: http://www.cs.toronto.edu/delve/.
2700
Table 2
Medium and large data sets characteristics
Table 3
Parameters used in PS algorithms
Name
N. instances
N. features.
N. classes
Algorithm
Parameters
Nursery
Page-Blocks
Pen-Based
Ringnorm
Satimage
Spambase
Splice
Thyroid
12 960
5476
10 992
7400
6435
4597
3190
7200
8
10
16
20
36
57
60
21
5
5
10
2
7
2
3
3
CHC
IGA
Adult (large)
45 222
14
Pop = 50, Eval = 10 000, = 0.5

Pop = 10, Eval = 10 000
pm = 0.01, = 0.5
Pm = 0.001, Pc = 0.6, P op = 50
Eval = 10 000, = 0.5
LR = 0.1, Mutshif t = 0.05, pm = 0.02, Pop = 50
NegativeLR = 0.075, Eval = 10 000
Pm = 0.001, Pc = 1, P op = 50
Eval = 10 000, = 0.5
Acept level = 0.9, Drop level = 0.7
S = 90%, Eval = 10 000
Order = 1
Pop = 30, Eval = 10 000, pm = 0.001, pc = 1
= 0.5
GGA
PBIL
SSGA
We will distinguish two models of partitions used in this

work:
Tenfold cross-validation classic (Tfcv classic): where TRi ,
i = 1, . . . , 10 is a 90% of D and T S i is its complementary 10% of D. It is obtained as the following equations
indicate:
TRi =
Dj ,
(5)
j J
J = {j/1 j
(i 1) and (i + 1) j
10},
TSi = D\T R i .
(6)
Tenfold cross-validation strat (Tfcv strat): where SPSSi is

generated using the DSj instead of Dj (see Section 2.2).
SPSSi =
DSj ,
j J
J = {j/1 j
b (i 1) and (i b) + 1 j
t}.
(7)
The data sets considered are partitioned using the tfcv classic
(see expressions (5) and (6)) except for the adult data set, which
is partitioned using the tfcv strat procedure with t = 10 and
b = 1. (see expression (7)). Deterministic algorithms have been
run once over these partitions, whereas probabilistic algorithms
(including SSMA) run 3 trials over each partition and we show
the average results over these trials.
Whether small, medium or large data sets are evaluated, the
parameters used are the same, as specied in Table 3. They
are specied by following the indications given for their respective authors. With respect to the standard EAs employed in
the study, GGA and SSGA, the selection strategy is the binary
tournament. The mutation operator is the same one used in our
model of SSMA while SSGA uses standard replacement strategy. The crossover operator used by both algorithms denes
two cut points and interchanges substrings of bits.
To compare results we propose the use of non-parametric
tests, according to the recommendations made in Ref. [44].
They are safer than parametric tests since they do not assume
normal distribution or homogeneity of variance. As such, these
non-parametric tests can be applied to classication accuracies,
error ratios or any other measure for evaluation of classiers,
Ib3
Rmhc
Rng
SSMA
even including model sizes and computation times. Empirical

results suggest that they are also stronger than the parametric
test. Demar recommends a set of simple, safe and robust nonparametric tests for statistical comparisons of classiers. We
will use two tests with different purposes, the rst of which
is the Iman and Davenport test [45], a non-parametric test,
derived from the Friedman test, and equivalent to the repeatedmeasures ANOVA. Under the null-hypothesis, which states that
all the algorithms are equivalent, a rejection of this hypothesis
implies the existence of differences of performance among all
the algorithms studied. The second is the Wilcoxon signedranks test [46]. This is analogous to the paired t-test in nonparametrical statistical procedures; therefore, it is a pairwise
test that aims to detect signicant differences in the behaviour
of two algorithms.
We will present four types of tables according to the subsequent structure:
(1) Complete results table: This shows the average of the results obtained by each algorithm in all data sets evaluated
(small or medium group of data sets). These tables are
grouped in columns by algorithms. For each one it shows
the average reduction, accuracy in training and accuracy
in test data with their respective standard deviations (SDs)
(see for example Table 4). The two last rows compute the
average and SD over the average results obtained on each
data set, respectively.
(2) Summary results table: This shows the average of the results obtained by each algorithm in the data sets evaluated
(small or medium group of data sets). The columns show
the following information (see for example Table 9):
The rst column shows the name of the algorithm.
The second and third columns contain the average execution time and the SD associated to each algorithm of
the run of 10-fcv. They have been run in a HP Proliant
DL360 G4p, Intel Xeon 3.0 GHz, 1 GB RAM.
The fourth and fth columns show the average reduction
percentage and associated SD from the initial training
sets.
Table 4
Average results for EPS algorithms over small data sets
Data set
Measure
CHC
GGA
IGA
PBIL
SSGA
SSMA
Bupa
Cleveland
Glass
Iris
Led7Digit
Lymphography
Monks
Pima
Wine
Wisconsin
GLOBAL
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Tra.
Tst.
Red.
Tra.
Tst.
Red.
Tra.
Tst.
Red.
Tra.
Tst.
Red.
Tra.
Tst.
Red.
Tra.
Tst.
97.13
0.81
98.35
0.3
94.34
0.86
96.81
0.47
96.58
0.18
96.55
0.68
99.05
0.23
98.78
0.28
96.94
0.51
99.44
0.08
97.40
1.53
73.24
1.49
63.48
0.92
74.04
1.51
97.41
0.76
68.71
2.88
54.36
1.42
96.86
0.42
80.14
0.87
98.69
0.81
97.57
0.28
80.45
16.27
58.76
6.75
58.75
5.91
65.39
9.97
96.67
3.33
64
7.32
39.49
9.99
97.27
2.65
75.53
3.11
94.93
4.62
96.56
2.42
74.74
20.63
92.27
1.48
95.45
0.87
90.91
0.99
95.56
0.33
95.64
0.18
91.96
1.5
93.98
0.73
95.11
0.75
95.69
0.71
99.08
0.2
94.57
2.37
78.58
1.1
64.76
1
76.44
1.91
97.63
0.55
68.64
3.27
57.21
1.52
94.62
0.99
81.57
0.7
98.69
0.9
97.71
0.18
81.59
15.13
59.57
7.39
54.8
5.45
65.87
13.28
94
4.67
63.8
4.85
35.26
9.83
92.3
3.71
70.73
4.87
93.82
4.61
96
2.46
72.62
20.69
81.61
1.93
86.32
2.15
80.74
1.82
93.33
0.66
95.16
0.33
86.27
2.2
84.83
1.61
86
0.71
93.76
1.01
98.22
0.42
88.62
6.03
83.09
1.47
68.72
1.03
82.55
1.65
98.81
0.49
65.89
2.21
60.59
2.35
98.15
0.81
86.81
0.78
99.88
0.25
98.04
0.23
84.25
14.86
63.69
9.27
51.49
7.16
68.89
10.7
94
3.59
67.6
4.18
40.57
11.1
85.75
6.98
70.84
2.68
94.97
6.78
95.28
3.33
73.31
18.95
91.3
0.91
94.94
0.83
91.06
1.66
96.07
0.58
93.91
0.78
94.75
1.31
92.34
1.13
91.9
0.49
96.32
0.59
98.54
0.28
94.11
2.47
78.39
1.05
64.98
0.8
76.18
2.36
98.07
0.76
68.67
3.09
54.88
1.92
94.57
1.31
82.03
0.57
98.63
0.73
97.66
0.23
81.41
15.58
64.61
4.64
55.77
4.94
64.58
9.49
96
4.42
66
2.53
41.71
13.85
89.17
7.15
72.27
2.96
94.41
5.56
96.99
1.86
74.15
19.08
90.82
1.87
94.02
1.09
90.34
1.69
95.41
0.93
95.71
0.37
92.33
2.32
92.31
1.08
94.23
0.64
94.69
1.29
99.06
0.26
93.89
2.58
79.19
1.67
65.16
1.45
75.86
1.94
97.41
0.68
66.44
3.18
55.57
3.11
94.88
1.27
83.02
0.97
98.69
0.81
97.76
0.25
81.40
15.63
62.25
7.18
52.52
6.29
65.83
12.37
95.33
4.27
64.8
5.15
43.62
9.94
93.39
3.72
73.32
4.53
97.19
3.75
96.57
3.08
74.48
19.86
95.01
0.72
97.84
0.72
92.58
1.32
96.07
0.67
96.71
0.45
94.67
1.52
97.66
1.27
97.38
0.68
96.44
0.68
99.38
0.09
96.38
1.92
76.55
1.41
63.51
0.92
76.12
1.74
97.93
1.4
54.11
8.78
55.78
2.51
97.22
1.31
82.15
0.67
98.69
0.34
97.65
0.19
79.97
17.76
63.99
4.11
57.47
6.51
66.07
10.51
95.33
5.21
75.4
4.29
42.88
12.12
96.58
3.26
73.21
5.5
93.82
6.31
96.57
2.32
76.13
18.94
Red.
2701
2702
The last four columns include the accuracy mean and

SD when 1-NN is applied using S; the rst two show
the training accuracy when classifying TR with S, while
the last two show the test accuracy when classifying TS
with S.
In the fourth, sixth and eighth columns, the best result per
column is shown in bold.
(3) Wilcoxons test tables for n n comparisons: Given that
the evaluation of only the mean classication accuracy over
all the data sets would hide important information and that
each data set represents a different classication problem
with different degrees of difculty, we have included a second type of table that shows a statistical comparison of
methods over multiple data sets. As we have mentioned,
Demar [44] recommends a set of simple, safe and robust
non-parametric tests for statistical comparisons of classiers, one of which is the Wilcoxon signed-ranks test [46].
This is analogous to the paired t-test in non-parametrical
statistical procedures; therefore, it is a pairwise test that
aims to detect signicant differences in the behaviour of
two algorithms. In our study, we always consider a level of
signicance of < 0.05.
Let di be the difference between the performance scores of
the two classiers on ith out of N data sets. The differences
are ranked according to their absolute values; average ranks
are assigned in case of ties. Let R + be the sum of ranks for
the data sets in which the second algorithm outperformed
the rst, and R the sum of ranks for the opposite. Ranks
of di = 10 are split evenly among the sums; if there is an
odd number of them, one is ignored:
R+ =
rank(di ) +
di >0
R =
rank(di ) +
di <0
1
2
1
2
rank(di ),
(8)
rank(di ).
(9)
di =0
to compare all the algorithms in them. In each of the Nalg

Nalg cells three symbols can appear: +, or =. They
show that the algorithm situated in that row is better (+),
worse () or equal (=) in behaviour (accuracy or balance
accuracyreduction) to the algorithm that appears in the
column. The penultimate column represents the number of
algorithms with worse or equal behaviour to the one that
appears in the row (without considering the algorithm itself)
and the last column represents the number of algorithms
with worse behaviour than the one that appears in the row.
(4) Wilcoxons test tables to contrast results for a control algorithm: These tables are made up of three columns: In
the rst one, the name of the algorithm is indicated, in the
second and third columns, symbols +, or = show the
existence or absence of signicant differences between the
control algorithm and the algorithm specied in the row,
according to accuracy performance and accuracyreduction
balance performance, respectively. Note that in this type of
tables, a symbol + indicates that the control method is better than the algorithm in the row. In the previous type of
tables, the meaning of the symbols is just the opposite; that
is, for example, a symbol + indicates that the algorithm in
the row is better than the algorithm located at the column.
(5) Computation of Iman and Davenport statistic tables: We
follow the indications given in Ref. [44] incarrying out
Iman and Davenport test. It ranks the algorithms for each
data set separately, starting by assigning the rank of 1 to
j
the best performing algorithm. Let ri be the rank of the
jth of k algorithms on the ith of N data sets. The Iman and
Davenport statistic is dened as
FF =
2
F
2
F
N (k 1)
in which
di =0
Let T be the smaller of the sums, T = min(R + , R ).

The Wilcoxon signed-ranks test is more sensitive than the
t-test. It assumes commensurability of differences, but only
qualitatively: greater differences count still more, which
is probably to be desired, but absolute magnitudes are ignored. From the statistical point of view, the test is safer
since it does not assume normal distributions. Also outliers
(exceptionally good/bad performances on a few data sets)
have less effect on Wilcoxon than on the t-test. Wilcoxons
test assumes continuous differences di , which therefore
should not be rounded to, say, one or two decimals since
this would decrease the power of the test due to a high
number of ties.
A Wilcoxon table, in this paper, is divided into two parts:
In the rst part, we carry out a Wilcoxon test using as the
performance measure the accuracy classication in the test
set, while in the second part, a balance of reduction and
classication accuracy is used as the performance measure.
This balance corresponds to 0.5 clas_rat + 0.5 perc_red.
The structure of the tables presents Nalg (Nalg + 2) cells
(N 1)
2
F
(10)
2
F
is the Friedman statistic
2
1
12N
k(k + 1)2
j
=
ri
.
k(k + 1)
N
4
j
(11)
FF is distributed according to the F-distribution with k 1

and (k 1)(N 1) degrees of freedom.
These tables are made up of four columns. In the rst and
second, information about the conditions of the experiment
is indicated: the type of result that is measured and the
scale of the data sets, respectively. In the third, the computed value of FF is shown and, in the last column, the
corresponding critical value of the F-distribution table with
= 0.05 is indicated. If the value FF is higher than its associated critical value, then the null-hypothesis is rejected
(this implies a signicant difference of results among all
methods considered).
We divide the experimental study into two groups: Comparison among EPS algorithms and comparison of our proposal
with other non-EPS methods. The large-size data sets will be
separately studied with the objective of ascertaining if the behaviour of the new proposal when the stratication process
85.28
0.81
95.05
1.03
98.04
0.6
92.66
0.89
89.39
0.89
88.3
1.49
74.7
2.67
93.82
0.54
89.66
7.28
Tst.
2703
Table 6
Iman and Davenport statistic for EPS algorithms
88.07
0.24
96.03
0.12
98.74
0.16
95.53
0.38
91.98
0.35
91.23
0.55
83.6
0.38
94.32
0.14
92.44
4.85
94.07
0.89
99.18
0.09
98.56
0.28
98.88
0.13
98.36
0.23
98.12
0.22
97.04
0.48
99.83
0.07
98.01
1.79
80.56
0.79
95.12
0.75
98.37
0.29
84.36
0.94
88.95
1.31
87.64
2.18
72.92
1.84
93.58
0.5
87.69
8.33
90.16
0.17
96.77
0.16
95.27
0.07
93.31
0.22
94.84
0.21
94.79
0.24
93.07
0.49
96.65
0.16
94.36
2.16
80.55
0.94
94.99
0.83
98.27
0.32
77.45
1.66
88.7
1.01
87.3
1.68
72.51
2.7
93.32
0.5
86.64
9.07
82.78
0.32
95.7
0.17
98.7
0.11
80.11
0.44
90.64
0.21
89.39
0.52
78.96
0.83
93.64
0.3
88.74
7.38
88.62
0.24
93.15
0.21
91.65
0.2
90.14
0.28
91.58
0.24
91.71
0.29
91.13
0.53
92.45
0.18
91.30
1.40
74.07
1.98
94.32
0.79
95.87
0.62
87.7
1.51
86.59
0.9
86.84
1.64
71.57
1.66
92.53
2.45
86.19
8.97
74.98
0.88
94.84
0.24
95.87
0.29
88.08
1.07
87.63
0.32
87.63
0.9
75.41
0.76
92.72
2.51
87.15
8.05
93.6
0.82
99.66
0.04
98.91
0.12
99.59
0.05
99.51
0.1
99.53
0.05
99.14
0.16
99.9
0.03
98.73
2.10
GLOBAL
Thyroid
Splice
Spambase
Satimage
Ring
Pen-Based
Page-Blocks
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Nursery
Tra.
82.88
1.1
95.39
0.8
98.94
0.28
77.07
0.82
89.25
1.3
87.82
1.46
73.73
1.52
91.94
0.71
87.13
8.74
87.78
0.17
96.5
0.13
99.32
0.08
82.66
0.29
92.14
0.25
91.08
0.36
80.95
0.37
93.16
0.18
90.45
6.37
69.54
0.19
73.34
0.28
73.95
0.55
67.11
0.17
71.14
0.52
68.7
0.46
65.28
0.62
71.71
0.27
70.10
3.01
79.99
0.23
85.75
0.29
83.32
0.15
81.08
0.2
83.88
0.32
84.76
0.43
83.9
0.5
84.38
0.29
83.38
1.92
86.43
0.21
96.04
0.14
99.08
0.1
81.64
0.39
91.34
0.27
90.53
0.44
80.92
0.44
93.51
0.16
89.94
6.53
83.46
0.92
95.41
0.51
98.74
0.35
76.65
1.17
88.97
1.16
87.77
0.9
74.11
1.57
93.1
0.81
87.28
8.75
84.96
0.27
96.11
0.19
98.95
0.12
87.71
0.44
91.77
0.23
90.94
0.42
83.16
0.5
93.78
0.2
90.92
5.43
Red.
Red.
Red.
Tra.
Red.
Red.
Red.
Tst.
GGA
Measure
Data set
Table 5
Average results for EPS algorithms over medium data sets
CHC
Tra.
Tst.
Tst.
PBIL
IGA
Tra.
Tst.
SSGA
Tra.
Tst.
SSMA
Tra.
Performance
Scale
FF
Critical value
Accuracy
Accuracy
Accuracyreduction
Accuracyreduction
Small
Medium
Small
Medium
2.568
5.645
14.936
179.667
2.422
2.485
2.422
2.485
is employed follows the tendency shown for EAs [16]. A nal subsection will be included to illustrate a time complexity analysis considering EPS algorithms over all the medium
size data sets considered and to study how the execution time
of SSMA is given out among evolutionary and optimization
stages.
5.1. Part I: comparing SSMA with EAs
In this section, we carry out a comparison that includes all
EPS algorithms described in this paper.
Tables 4 and 5 show the average results for EPS algorithms
run over small and medium data sets, respectively.
Iman and Davenport tests result is presented in Table 6.
Tables 7 and 8 present the statistical differences by using
Wilcoxons test among EPS algorithms, considering accuracy
performance and accuracyreduction balance performance,
respectively.
The following conclusions from examination of Tables 48
can be pointed out:
In Table 4, SSMA achieves the best test accuracy rate. EPS
algorithms are prone to present over-tting obtaining a good
accuracy in training data but not in test data. The SSMA
proposal does not stress this behaviour in a noticeable way.
When the problem scales up, in Table 5, SSMA presents the
best reduction and accuracy in training and test data rates.
The ImanDavenport statistic (presented in Table 6) indicates
the existence of signicant differences of results among all
EPS approaches studied.
Considering only the performance in classication over test
data in Table 7, all algorithms are very competitive. Statistically, SSMA always obtains subsets of prototypes with equal
performance to the remaining of the EPS methods, improving GGA with the use of small databases and GGA and CHC
when the problem scales-up to a medium size.
When the reduction objective is included in the measure of
quality, Table 8, our proposal obtains the best result. Only
CHC presents the same behaviour in small data sets. When
the problem scales up, SSMA again outperforms CHC.
Finally, we provide a map of convergence of SSMA, in
Fig. 5, in order to illustrate the optimization process carried
out on the satimage data set. In it, the two goals, reduction and
train accuracy, are shown. We can see that the two goals are
opposite, but the trend of the two lines of convergence usually
rises.
2704
Table 7
Wilcoxon table for EPS algorithms considering accuracy performance
6
Small data sets

CHC (1)
GGA (2)
IGA (3)
PBIL (4)
SSGA (5)
SSMA (6)
=
=
=
=
=
=
=
=
=
=
Medium data sets

CHC (1)
GGA (2)
IGA (3)
PBIL (4)
SSGA (5)
SSMA (6)
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
>
=
=
=
=
5
2
5
5
5
5
1
0
0
0
1
1
=
=
+
+
=
=
=
=
+
4
3
5
5
5
5
0
0
0
0
1
2
=
=
+
+
=
Table 8
Wilcoxon table for EPS algorithms considering reductionaccuracy balance performance
6
Small data sets

CHC (1)
GGA (2)
IGA (3)
PBIL (4)
SSGA (5)
SSMA (6)
+
=
+
=
+
+
Medium data sets

CHC (1)
GGA (2)
IGA (3)
PBIL (4)
SSGA (5)
SSMA (6)
+
+
+
+
Reduction
22
840
1335
1870
2376
2926
3399
3893
4379
4864
5333
5807
6318
6844
7350
7796
8257
8730
9208
9687
Percentage
Acc. Test
100
99
98
97
96
95
94
93
92
91
90
=
+
Evaluations
Fig. 5. Map of convergence of the SSMA algorithm on satimage data set.
5.2. Part II: comparing SSMA with non-evolutionary

algorithms
Tables 9 and 10 show the average results for non-EPS algorithms together with the SSMA proposal run over small and
+
+
+
+
+
+
+
+
=
=
+
+
+
+
>
5
3
0
3
3
5
4
1
0
1
1
4
=
+
4
2
0
1
4
5
3
2
0
1
3
5
medium data sets, respectively. This table summarizes the mean

and SD of the results carried out over all data sets.
The ImanDavenport test result is presented in Table 11.
Table 12 summarizes the existence or absence of statistical difference by using Wilcoxons test between the SSMA proposal
and remaining non-EPS methods (as we mentioned before, every cell presents the comparison of SSMA with the respective
algorithm in the row. The symbol + indicates better behaviour
of SSMA than that of the corresponding one in the row).
We can show the following:
In Table 9, the SSMA proposal clearly outperforms the other
algorithms taking account of accuracy performance in test
data.
When the problem scales up, SSMA presents the best accuracy
rate and the second best rate of reduction (results shown in
Table 10).
Non-EPS algorithms usually present a lower run-time than
that of the EPS algorithm. However, those non-EPS algorithms that present best behaviour in both objectives are also
algorithms with high running times.
2705
Table 9
Average results for non-EPS algorithms over small data sets
Algorithm
Time (s)
SD time (s)
% Red.
SD red.
% Ac. trn.
SD ac. trn
% Ac. test
SD ac. test
1-NN
Allknn
Cpruner
Drop3
Enn
Explore
Ib3
Pop
Rmhc
Rng
Rnn
SSMA
0.2
0.18
0.26
0.71
0.13
0.83
0.17
0.05
11.82
1.35
3.18
6.38
0.00
0.03
0.02
0.02
0.01
0.03
0.01
0.01
0.05
0.03
0.05
0.16
37.00
92.59
83.43
25.50
97.66
68.34
14.56
90.18
25.07
92.40
96.37
23.97
5.09
8.85
19.56
1.16
19.85
15.17
0.17
20.45
4.56
1.92
72.89
77.51
66.38
73.43
77.47
77.59
64.16
70.49
83.68
78.24
74.41
79.97
18.63
16.29
24.16
15.17
15.16
16.78
22.95
20.19
14.25
15.31
17.57
17.76
72.22
72.60
65.31
67.69
73.21
74.42
69.96
72.10
73.93
73.85
73.11
76.13
19.15
18.40
23.22
19.06
18.38
19.60
20.18
19.60
19.44
18.32
17.70
18.94
Table 10
Average results for non-EPS algorithms over medium data sets
Algorithm
Time (s)
SD time (s)
% Red.
SD red.
% Ac. trn.
SD ac. trn
% Ac. test
SD ac. test
1-NN
Allknn
Cpruner
Drop3
Enn
Explore
Ib3
Pop
Rmhc
Rng
Rnn
SSMA
72.93
34.39
23.83
99.48
13.48
1525.52
2.70
1.20
6572.53
3149.19
26 428.38
3775.75
0.15
0.31
0.07
0.82
0.03
92.27
0.05
0.04
172.08
26.62
271.56
184.13
18.68
89.08
88.17
12.22
98.74
77.42
23.17
90.00
7.20
93.65
98.01
13.49
3.56
8.35
10.21
1.03
17.83
32.94
0.01
4.90
4.25
1.79
89.11
88.46
81.31
81.99
87.85
91.44
84.91
86.09
94.30
89.92
88.85
92.44
7.75
11.33
15.51
9.88
10.66
4.96
10.80
13.55
3.57
7.74
7.45
4.85
87.79
85.21
80.52
78.84
85.98
88.99
86.36
84.97
89.53
87.99
86.04
89.66
8.94
12.94
16.09
13.35
12.14
7.21
9.49
13.78
7.61
9.14
9.57
7.28
As we can see in Table 12, Wilcoxons test considers the existence of competitive algorithms in terms of classication
accuracy. None of the non-EPS algorithms outperforms our
proposal in both considerations of evaluation. The unique algorithm that equals the result of SSMA when the reduction
and accuracy are combined is Explore. Note that this last
algorithm obtains a greater reduction rate than SSMA, but
our approach improves upon it when we consider the classication performance. This fact is interesting given that, although a good trade-off between reductionaccuracy is required, accuracy in terms of classication is more difcult to
improve upon than the reduction, so the solutions contributed
by SSMA may be considered of higher quality.
There are algorithms, for example Allknn, Rng or Pop, that
have a similar performance in classication accuracy but that
do not achieve a high reduction rate. This could be critical
when large data sets need to be processed. A small reduction
rate implies a small decrease in classication resources (storage and time); therefore, an application of these algorithms
on large data sets lacks interest.
Table 11
Iman and Davenport statistic for non-EPS algorithms
In relation to the superiority of the proposed algorithm, we

are able show difference in the results obtained by SSMA and
the ones obtained by CHC and Explore by means of graphical
We have seen before the promising results offered by SSMA

in small and medium sized databases. However, a size limit
of data sets exists which makes the execution of an EPS
Performance
Scale
FF
Critical value
Accuracy
Accuracy
Accuracyreduction
Accuracyreduction
Small
Medium
Small
Medium
6.009
5.817
56.303
68.583
1.938
1.969
1.938
1.969
representations of the difference in reduction and test accuracy

between SSMA and the two mentioned algorithms over all data
sets.
Fig. 6 displays these representations, where the positive bars
indicate superiority of SSMA. The bias of SSMA is to obtain less
reduction than its two strong competitors, but it achieves a better
accuracy than them when the problem scales up, more notably
when compared with CHC. Note that SSMA outperforms them
in accuracy on 12 data sets to the two main competitors.
5.3. Part III: a large-size case study
2706
algorithm over them impossible. This limit depends on the algorithms employed, the properties of data treated and easiness
of reduction of instances in the data set. We could argue that,
Table 12
Wilcoxon table for comparing SSMA with Non-EPS algorithms
Accuracy
Acc. 0.5 + red. 0.5
Small data sets

Allknn
Cpruner
Drop3
Enn
Explore
Ib3
Pop
Rmhc
Rng
Rnn
=
+
+
=
=
+
=
=
=
+
+
+
+
+
=
+
+
+
+
+
Medium data sets

Allknn
Cpruner
Drop3
Enn
Explore
Ib3
Pop
Rmhc
Rng
Rnn
=
+
+
=
+
+
=
=
=
+
+
+
+
+
=
+
+
+
+
+
surely, it may not be possible to handle a size of a data set

above 20,000 instances with EPS due to a time complexity of
the evaluations of O(n2 m). SSMA proposal may get closer this
limit.
In this case, authors in Ref. [16] recommend the use of Stratication, which could be combined with a cross-validation procedure as can be seen in expression (7). By using the Tfcv-Strat,
we have run the algorithms considered in this study over the
data set adult with t = 10 and b = 1. We chose these parameters to obtain subsets whose size is not too large as well as to
show the effectiveness of the combination of an EPS and the
stratication procedure.
Fig. 7 shows a representation of an opposition between
the two objectives: reduction and test accuracy. Each algorithm located inside the graphic gets its position from the
average values of each measure evaluated (exact position corresponding to the beginning of the name of the algorithm).
Across the graphic, there is a line that represents the threshold of test accuracy achieved by the 1-NN algorithm without
preprocessing.
As we can see, all EPS methods are above the 1-NN horizontal line. The graphic clearly emphasizes three methods as
best with their position in the graphic at top-right. In addition
to this, we can remark that the SSMA algorithm achieves the
highest accuracy rate, and it is only surpasses in the reduction
objective by the CHC model.
Fig. 8 presents the opposition reductionaccuracy for nonEPS algorithms. Ib3 and Pop algorithms have been removed
SSMA vs. CHC
SSMA vs. Explore

thyroid
splice
thyroid
splice
satimage
satimage
ring
ring
pen-based
pen-based
nursery
spambase
spambase
page-blocks
page-blocks
wisconsin
wisconsin
wine
nursery
wine
pima
pima
monks
monks
lymphography
led7dig
lymphography
led7dig
iris
iris
glass
glass
cleveland
cleveland
bupa
-15
Test Accuracy
bupa
-5
5
Reduction
15
-15
Test Accuracy
-5
15
Reduction
Fig. 6. Difference of results between SSMA and CHC and SSMA and Explore. (a) SSMA vs. CHC. (b) SSMA vs. Explore.
2707
82.5
SSMA
82
Test Accuracy
SSGA
CHC
81.5
81
PBIL
GGA
80.5
80
IGA
79.5
1-NN
79
65
60
70
75
80
85
Reduction
90
95
100
Fig. 7. Accuracy in test vs. reduction in adult data set for EPS algorithms.
83
Rng
Enn
Test Accuracy
82
Allknn
SSMA
Explore
Rmhc
Cpruner
81
80
Rnn
1-NN
79
78
Drop3
77
18
28
38
48
58
68
78
88
98
Reduction
Fig. 8. Accuracy in test vs. reduction in adult data set for non-EPS algorithms.
from the graphic because of their poor accuracy and reduction,

respectively.
The results offered by SSMA situate the algorithm at the top
of the remaining methods that are nearest in the x-axis. This
fact points to its superiority over the nearest methods. However,
others methods further from it in x-axis are above SSMA in yaxis. The case studied practically shows the same behaviour
as in previous cases, so it may be specied a generalization
capacity of the results independently of the size of data.
5.4. Part IV: time complexity analysis for EPS
A way of estimating the efciency of EPS algorithms could
be by an empirical study. This implies a study of the execution time of each method by using different sizes of training
data. We have showed the results obtained by using a graphical
representation in Fig. 9 for medium size data sets.
As it can be seen in Fig. 9, the most efcient EPS algorithms
are SSMA and CHC. Time complexity of an algorithm basically
depends on two factors:

Reduction capacity: When an EPS algorithm can quickly
reduce the subset of instances selected in the chromosome, an
evaluation will have to compute NNs over less data in order
to calculate the tness value. CHC and SSMA are inclined
towards a tendency of removing instances and then improving
classication accuracy, thus obtaining a good reduction rate.
Remaining EPS algorithms try to improve both objectives
simultaneously.
Operational cost: In this case, any algorithm has an operational cost of time in all its procedures. For example, a GGA
algorithm must sort and apply crossover and mutation operators to a part of a population, whereas PBIL must generate new populations from a single probability vector that
receives all possible modications. This explains why GGA
takes more time than PBIL even if the former has a higher
reduction rate than the latter. SSMA takes advantage of the
partial evaluation mechanism, which is an efcient procedure
to evaluate chromosomes.
2708
When the PS problem scales up, the tendency on the part of

classical EPS is to inappropriately converge to two objectives
(accuracy and reduction) at the same time.
Considering all types of data sets, SSMA outperforms or
equalizes all methods when both objectives have the same
importance. Furthermore, it usually outperforms in test accuracy other methods that obtain good rates of reduction.
1072.637
33081.747
26372.639
14751.796
31620.449
3260.008
thyroid
SSMA
IGA
PBIL
493.565
10523.368
4547.91
2880.089
4997.814
872.694
splice
SGA
GGA
CHC
Finally, we would like to point out in conclusion that our SSMA

proposal allows EPS to be competitive with other models for
PS when the size of the databases increases, tackling the scaling
up problem with excellent results.
An open problem, beyond of the scope of this paper, consists
of hybridizing feature selection and weighting with the SSMA
proposal in order to obtain a better behaviour in low complexity
classiers (like 1-NN) [47]. Future research will be focus on
this together with tackling the prototype generation process
with real-coded evolutionary algorithms.
1329.829
43206.889
12652.34
8323.344
13572.339
1807.272
satimage
1343.833
37582.409
13224.822
8869.263
13442.095
3144.271
ring
6102.942
99510.786
27474.19
19183.612
32439.13
5865.286
pen-based
16333.8171
95548.553
36661.43
27832.58
38827.026
14772.908
nursery
678.612
13987.491
8075.734
7345.749
11074.682
1187.331
spambase
702.54
10051.149
4703.519
3155.491
4885.795
1046.733
page-blocks
20000
40000
60000
80000 100000
Fig. 9. Run-time in seconds for EPS over medium size data sets.
6. Conclusion
This paper presents a memetic algorithm for evolutionary
prototype selection and its application over different sizes of
data sets, paying special attention to the scaling up problem.
An experimental study was carried out to establish a comparison between this proposal and previous evolutionary and
non-evolutionary approaches studied in the literature. The main
conclusions reached are as follows:
Our proposal of SSMA presents a good reduction rate and
computational time with respect to other EPS schemes.
SSMA outperforms the classical PS algorithms, irrespective
of the scale of data set. Those algorithms that could be competitive with it in classication accuracy are not so when the
reduction rate is considered.
References
[1] A.N. Papadopoulos, Y. Manolopoulos, Nearest Neighbor Search: A
Database Perspective, Springer-Verlag Telos, 2004.
[2] G. Shakhnarovich, T. Darrel, P. Indyk (Eds.), Nearest-Neighbor Methods
in Learning and Vision: Theory and Practice, MIT Press, Cambridge,
MA, 2006.
[3] E. Pekalska, R.P.W. Duin, P. Paclk, Prototype selection for dissimilaritybased classiers, Pattern Recognition 39 (2) (2006) 189208.
[4] S.-W. Kim, B.J. Oommen, On using prototype reduction schemes to
optimize dissimilarity-based classication, Pattern Recognition 40 (11)
(2007) 29462957.
[5] H. Liu, H. Motoda, On issues of instance selection, Data Mining Knowl.
Discovery 6 (2) (2002) 115130.
[6] D.R. Wilson, T.R. Martinez, Reduction techniques for instance-based
learning algorithms, Mach. Learn. 38 (2000) 257286.
[7] R. Paredes, E. Vidal, Learning prototypes and distances: a prototype
reduction technique based on nearest neighbor error minimization,
Pattern Recognition 39 (2) (2006) 180188.
[8] E. Gmez-Ballester, L. Mic, J. Oncina, Some approaches to improve
tree-based nearest neighbour search algorithms, Pattern Recognition 39
(2) (2006) 171179.
[9] M. Grochowski, N. Jankowski, Comparison of instance selection
algorithms II. Results and comments, in: Proceedings of the International
Conference on Articial Intelligence and Soft Computing (ICAISC04),
Lecture Notes in Computer Science, vol. 3070, Springer, Berlin, 2004,
pp. 580585.
[10] M. Lozano, J.M. Sotoca, J.S. Snchez, F. Pla, E. Pekalska, R.P.W. Duin,
Experimental study on prototype optimisation algorithms for prototypebased classication in vector spaces, Pattern Recognition 39 (10) (2006)
18271838.
[11] A.A. Freitas, Data Mining and Knowledge Discovery with Evolutionary
Algorithms, Springer, New York, 2002.
[12] A. Ghosh, L.C. Jain (Eds.), Evolutionary Computation in Data Mining,
Springer, Berlin, 2005.
[13] N. Garca-Pedrajas, D. Ortiz-Boyer, A cooperative constructive method
for neural networks for pattern recognition, Pattern Recognition 40 (1)
(2007) 8098.
[14] A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing,
Springer, Berlin, 2003.
[15] J.R. Cano, F. Herrera, M. Lozano, Using evolutionary algorithms as
instance selection for data reduction in KDD: an experimental study,
IEEE Trans. Evol. Comput. 7 (6) (2003) 561575.
[16] J.R. Cano, F. Herrera, M. Lozano, Stratication for scaling up
evolutionary prototype selection, Pattern Recognition Lett. 26 (7) (2005)
953963.

[17] P. Moscato, On evolution, search, optimization, genetic algorithms and
martial arts: towards memetic algorithms, Technical Report C3P 826,
Pasadena, CA, 1989.
[18] N. Krasnogor, J. Smith, A tutorial for competent memetic algorithms:
model, taxonomy, and design issues, IEEE Trans. Evol. Comput. 9 (5)
(2005) 474488.
[19] R. Dawkins, The Selsh Gene, Oxford University Press, Oxford, 1976.
[20] W.E. Hart, N. Krasnogor, J. Smith (Eds.), Recent Advances in Memetic
Algorithms, Springer, Berlin, 2005.
[21] D.L. Wilson, Asymptotic properties of nearest neighbor rules using edited
data, IEEE Trans. Syst. Man Cybern. 2 (3) (1972) 408421.
[22] I. Tomek, An experiment with the edited nearest-neighbor rule, IEEE
Trans. Syst. Man Cybern. 6 (6) (1976) 448452.
[23] J.C. Riquelme, J.S. Aguilar-Ruiz, M. Toro, Finding representative
patterns with ordered projections, Pattern Recognition 36 (2003) 1009
1018.
[24] G.W. Gates, The reduced nearest neighbour rule, IEEE Trans. Inf. Theory
18 (3) (1972) 431433.
[25] D.W. Aha, D. Kibler, M.K. Albert, Instance-based learning algorithms,
Mach. Learn. 6 (1991) 3766.
[26] K.P. Zhao, S.G. Zhou, J.H. Guan, A.Y. Zhou, C-pruner: an improved
instance prunning algorithm, in: Second International Conference on
Machine Learning and Cybernetics (ICMLC03), 2003, pp. 9499.
[27] H. Brighton, C. Mellish, Advances in instance selection for instancebased learning algorithms, Data Mining Knowl. Discovery 6 (2002)
153172.
[28] R.M. Cameron-Jones, Instance selection by encoding length heuristic
with random mutation hill climbing, in: Proceedings of the Eighth
Australian Joint Conference on Articial Intelligence, 1995, pp. 99106.
[29] D.B. Skalak, Prototype and feature selection by sampling and random
mutation hill climbing algorithms, in: Proceedings of the Eleventh
International Conference on Machine Learning (ML94), Morgan
Kaufmann, Los Altos, CA, 1994, pp. 293301.
[30] J.S. Snchez, F. Pla, F.J. Ferri, Prototype selection for the nearest
neighbour rule through proximity graphs, Pattern Recognition Lett. 18
(1997) 507513.
[31] L.I. Kuncheva, Editing for the k-nearest neighbors rule by a genetic
algorithm, Pattern Recognition Lett. 16 (1995) 809814.
[32] L.I. Kuncheva, J.C. Bezdek, Nearest prototype classication: clustering
genetic, algorithms, or random search?, IEEE Trans. Syst. Man Cybern.
28 (1) (1998) 160164.
[33] H. Ishibuchi, T. Nakashima, Evolution of reference sets in nearest
neighbor classication, Second Asia-Pacic Conference on Simulated
Evolution and Learning on Simulated Evolution and Learning
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
2709
(SEAL98),f Lecture Notes in Computer Science, vol. 1585, Springer,

Berlin, 1999, pp. 8289.
B. Sierra, E. Lazkano, I. Inza, M. Merino, P. Larraaga, J. Quiroga,
Prototype selection and feature subset selection by estimation of
distribution algorithms. A case study in the survival of cirrhotic patients
treated with tips, in: Proceedings of the Eighth Conference on AI
in Medicine in Europe (AIME01), Springer, London, UK, 2001,
pp. 2029.
S.-Y. Ho, C.-C. Liu, S. Liu, Design of an optimal nearest neighbor
classier using an intelligent genetic algorithm, Pattern Recognition Lett.
23 (13) (2002) 14951503.
L.J. Eshelman, The CHC adaptative search algorithm: how to have
safe search when engaging in nontraditional genetic recombination,
in: G.J.E. Rawlins (Ed.), Foundations of Genetic Algorithms, 1991,
pp. 265283.
S. Baluja, Population-based incremental learning: a method for
integrating genetic search based function optimization and competitive
learning, Technical Report CMU-CS-94-163, Pittsburgh, PA, 1994.
J.-H. Chen, H.-M. Chen, S.-Y. Ho, Design of nearest neighbor classiers:
multi-objective approach, Int. J. Approx. Reasoning 40 (12) (2005)
322.
D.E. Goldberg, K. Deb, A comparative analysis of selection schemes
used in genetic algorithms, in: G.J.E. Rawlins (Ed.), Foundations of
Genetic Algorithms, 1991, pp. 6993.
M.W.S. Land, Evolutionary algorithms with local search for
combinatorial optimization, Ph.D. Thesis, University of California, San
Diego, 1998.
M. Lozano, F. Herrera, N. Krasnogor, D. Molina, Real-coded memetic
algorithms with crossover hill-climbing, Evol. Comput. 12 (3) (2004)
273302.
W.E. Hart, Adaptive global optimization with local search, Ph.D. Thesis,
University of California, San Diego, 1994.
A. Asuncion, D.J. Newman, UCI repository of machine learning
databases, 2007. URL: http://www.ics.uci.edu/mlearn/MLRepository.
html .
J. Demar, Statistical comparisons of classiers over multiple data sets,
J. Mach. Learn. Res. 7 (2006) 130.
R.L. Iman, J.M. Davenport, Approximations of the critical region of the
Friedman statistic, Commun. Stat. (1980) 571595.
D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical
Procedures, CRC Press, Boca Raton, FL, 2003.
D. Wettschereck, D.W. Aha, Weighting features, in: First International
Conference of Case-Based Reasoning, Research and Development, 1995,
pp. 347358.
About the AuthorSALVADOR GARCA received the M.Sc. degree in Computer Science form the University of Granada, Granada, Spain, in 2004. He is
currently a Ph.D. student in the Department of Computer Science and Articial Intelligence, University of Granada, Granada, Spain. His research interests
include data mining, data reduction, evolutionary algorithms, learning from imbalanced data and statistical analysis.
About the AuthorJOS RAMN CANO received the M.Sc. and Ph.D. degrees in Computer Science from the University of Granada, Granada, Spain, in
1999 and 2004, respectively. He is currently an Associate Professor in the Department of Computer Science, University of Jan, Jan, Spain. His research
interests include data mining, data reduction, interpretability-accuracy trade off, and evolutionary algorithms.
About the AuthorFRANCISCO HERRERA received the M.Sc. degree in Mathematics in 1988 and the Ph.D. degree in Mathematics in 1991, both from
the University of Granada, Spain.
He is currently a Professor in the Department of Computer Science and Articial Intelligence at the University of Granada. He has published more than 100
papers in international journals. He is coauthor of the book Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases (World
Scientic, 2001). As edited activities, he has co-edited four international books and co-edited 16 special issues in international journals on different Soft
Computing topics. He currently serves as area editor of the Journal Soft Computing (area of genetic algorithms and genetic fuzzy systems), and he serves
as member of the editorial board of the journals: Fuzzy Sets and Systems, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems,
Memetic Computation, International Journal of Computational Intelligence Research, Mediterranean Journal of Articial Intelligence, International Journal
of Information Technology and Intelligent and Computing. He acts as associated editor of the journals: Mathware and Soft Computing, Advances in Fuzzy
Systems, and Advances in Computational Sciences and Technology. His current research interests include computing with words and decision making, data
mining, data preparation, fuzzy rule based systems, genetic fuzzy systems, knowledge extraction based on evolutionary algorithms, memetic algorithms and
genetic algorithms.
2.2.
Diagnstico de la Efectividad de la Seleccin Evolutiva de

o
o
Prototipos usando una Medida de Solapamiento - Diagnose of Eective Evolutionary Prototype Selection using an
Overlapping Measure

S. Garc J.R. Cano, E. Bernad-Mansilla, F. Herrera, Diagnose of Eective Evolutionary
a,
o
Prototype Selection using an Overlapping Measure. International Journal of Pattern Recognition and Articial Intelligence, in press (2008).
Estado: Aceptado

Your Submission Manuscript No. IJPRAI-D-07-00177R3
1 de 1
Asunto: Your Submission Manuscript No. IJPRAI-D-07-00177R3

De: "Int. J. Pattern Recogn. (IJPRAI)" <ijprai@wspc.com.sg>
Fecha: 8 Oct 2008 03:15:33 -0400
Para: salvagl@decsai.ugr.es
Ref.: Ms. No. IJPRAI-D-07-00177R3
Diagnose of Effective Evolutionary Prototype Selection using an Overlapping Measure
International Journal of Pattern Recognition and Artificial Intelligence
Dear Mr. Salvador Garca,
I am pleased to inform you that your work has now been accepted for publication in
International Journal of Pattern Recognition and Artificial Intelligence.
You will be contacted by the publisher for the source files shortly before the date of
publication.
Thank you for submitting your work to this journal and we hope you will consider to
cite papers published in IJPRAI in your future work.
With kind regards
Xiaoyi Jiang
Managing Editor
International Journal of Pattern Recognition and Artificial Intelligence
08/10/2008 09:45
October 1, 2008 18:43

Herrera-Bernado-IJPRAI
WSPC/INSTRUCTION FILE
Cano-Garcia-
Diagnose of Eective Evolutionary Prototype Selection using an

Overlapping Measure
Salvador Garc
a
Dept. of Computer Science and Articial Intelligence, University of Granada
Granada, 18071, Spain
salvagl@decsai.ugr.es
Jos-Ramn Cano
e
o
Department of Computer Science, University of Jan,
e
Higher Polytechnic Center of Linares, Alfonso X El Sabio street
Linares, 23700, Spain,
jrcano@ujaen.es
Ester Bernad-Mansilla
o
Dept. of Computer Engineering, University of Ramon Llull
Barcelona, 08022, Spain
esterb@salleurl.edu
Francisco Herrera
Dept. of Computer Science and Articial Intelligence, University of Granada
herrera@decsai.ugr.es
Evolutionary prototype selection has shown its eectiveness in the past in the prototype
selection domain. It improves in most of the cases the results oered by classical prototype selection algorithms but its computational cost is expensive. In this paper we
analyse the behaviour of the evolutionary prototype selection strategy, considering a
complexity measure for classication problems based on overlapping. In addition, we
have analysed dierent k values for the nearest neighbour classier in this domain of
study to see its inuence on the results of PS methods. The objective consists of predicting when the evolutionary prototype selection is eective for a particular problem,
based on this overlapping measure.
Keywords: Prototype Selection; Evolutionary Prototype Selection; Complexity Measures; Overlapping Measure; Data Complexity
This
work was supported by Projects TIN2005-08386-C05-01, TIN2005-08386-C05-03 and

TIN2005-08386-C05-04.
1
October 1, 2008 18:43

Cano-Garcia-
1. Introduction
Prototype Selection (PS) is a classical supervised learning problem where the objective consists in, using an input data set, nding those prototypes which improve
the accuracy of the nearest neighbour classier 26 . More formally, lets assume that
there is a training set T which consists of pairs (xi , yi ), i = 1, ..., n, where xi
denes an input vector of attributes and yi denes the corresponding class label. T
contains n samples, which have m input attributes each one and they should belong to one of the C classes. Let S T be the subset of selected samples resulting
from the execution of a prototype selection algorithm. The small size of the subset
selected decreases the requirements of computational resources of the classication
algorithm while keeping the classication performance 1 .
In the literature, another process used for reducing the number of instances can
be found. This is the prototype generation, which consists of building new examples
18,19
. Many of the examples generated may not coincide with the examples belonging
to the original data set, due to the fact that they are articially generated. In some
applications, this behaviour is not desired, as it could be the case in some data
set from the UCI Repository, such as Adult or KDD Cup99, where information
appears about real people or real connections, respectively and if new instances
were generated, it could be possible that they do not correspond to valid real
values. In this paper, we center our attention to the prototype selection domain,
keeping the initial characteristics of the instances unchanged.
There are many proposals of prototype selection algorithms 14,33 . These methods
follow dierent strategies for the prototype selection problem, and oer dierent
behaviours depending on the input data set. Evolutionary algorithms are one of the
most promising heuristics.
Evolutionary Algorithms (EAs) 9,13 are general-purpose search algorithms that
use principles inspired by natural genetic populations to evolve solutions to problems. The basic idea is to evolve a population of chromosomes, which represents
plausible solutions to the problem, by means of a competition process. EAs have
been used to solve the PS problem in 5,20,31 with promising results.
The EAs oer optimal results but at the expense of high computational cost.
Thus, it would be interesting to characterize their eective use in large-scale classication problems beforehand 34 . We consider their work as eective when they
improve the classication capabilities of the nearest neighbours classier. To reach
this objective, we analyse the data sets characteristics previous to the prototype
selection process.
In the literature, several studies have addressed the characterization of the data
set by means of a set of complexity measures 2,17 . Mollineda et al. in 23 present a
previous work where they analysed complexity measures like overlapping and nonparametric separability considering the Wilsons Edited Nearest Neighbor 32 and
the Harts Condensed Nearest Neighbor 15 as prototype selection algorithms.
In this study we are interested in diagnosing when the evolutionary prototype
October 1, 2008 18:43

Cano-Garcia-
selection is eective for a particular problem, using the overlapping measure suggested in 23 . To address this, we have analysed its behaviour by means of statistical comparisons with classical prototype selection algorithms considering data sets
from the UCI Repository24 and dierent values of k neighbours for the prototype
selection problem.
In order to do that, the paper is set out as follows. Section 2 is devoted to
describe the evolutionary prototype selection strategy and the algorithm used in
this study which belongs to this family. In Section 3, we present the complexity
measure considered. Section 4 explains the experimental study and Section 5 deals
with the results and their statistical analysis. Finally, in Section 6, we point out
our conclusions.
2. Evolutionary Prototype Selection
EAs have been extensively used in the past both in learning and preprocessing
5,6,11,25
. EAs may be applied to the PS problem 5 because it can be considered as
a search problem.
The application of EAs to PS is accomplished by tackling two important issues:
the specication of the representation of the solutions and the denition of the
tness function.
Representation: Lets assume a training data set denoted T with n instances. The search space associated with the instance selection is constituted
by all the subsets of T . A chromosome consists of the sequence of n genes
(one for each instance in T ) with two possible states: 1 and 0, meaning
that the instance is or not included in the subset selected respectively (see
Figure 1).
T 1 0 1 0 0 1 0 0
1
S 1 3 6
Fig. 1. Chromosome binary representation of a solution.
Fitness function: Let S be a subset of instances coded in a chromosome

that needs to be evaluated. We dene the tness function that combines
two values: the classication performance (clasper) associated with S and
the percentage of reduction (percred) of instances of S with respect to T :
F itness(S) = clasper + (1 ) percred.
(1)
October 1, 2008 18:43

Cano-Garcia-
where 0 1 is the relative weight of these objectives. The k-Nearest

Neighbour (k-NN) classier is used for measuring the classication rate,
clasper, associated with S. It denotes the percentage of objects from T
correctly classied using only S to nd the nearest neighbours. For each
object y in T, the nearest neighbours are searched among those in the set
S \ {y}. Whereas, percred is dened as:
percred = 100
|T ||S|
.
|T |
(2)
The objective of the EAs is to maximize the tness function dened,

i.e., maximize the classication rate and minimize the number of instances
obtained. We have used the value = 0.5 considering the suggestion of the
authors 5 .
As EA, we have selected the CHC 10 model. This decision is based on its best
competitive behaviour showed in 5 . Figure 2 describes the evolutionary prototype
selection process.
Input Data Set (D)
Training Set (T)
Test Set (TS)
Evolutionary Prototype
Selection Algorithm
Output Prototype Subset

Selected (S)
kNearest Neighbour
Classifier
Fig. 2. Evolutionary Prototype Selection process.
During each generation, Evolutionary Instance Selection CHC (EIS-CHC) method develops the following steps:
(1) It uses a parent population to generate an intermediate population of individuals, which are randomly paired and used to generate potential ospring.
(2) Then, a survival competition is held where the best chromosomes from the
parent and ospring populations are selected to form the next generation.
EIS-CHC also implements a form of heterogeneous recombination using HUX, a
special recombination operator. HUX exchanges half of the bits that dier between
October 1, 2008 18:43

Cano-Garcia-
parents, where the bit position to be exchanged is randomly determined. EIS-CHC

also employs a method of incest prevention. Before applying HUX to two parents,
the Hamming distance between them is measured. Only those parents who dier
from each other by some number of bits (mating threshold) are mated. The initial
threshold is set at L/4, where L is the length of the chromosomes. If no ospring
are inserted into the new population then the threshold is reduced by one.
No mutation is applied during the recombination phase. Instead, when the population converges or the search stops making progress (i.e., the dierence threshold
has dropped to zero and no new ospring are being generated which are better than
any members of the parent population) the population is reinitialized to introduce
new diversity to the search. The chromosome representing the best solution found
along the search is used as a template to reseed the population. Reseeding of the
population is accomplished by randomly changing 35% of the bits in the template
chromosome to form each of the other N 1 new chromosomes in the population.
The search is then resumed.
The tness function (see expression 1) combines two values: the classication
rate (using k-NN) associated with S and the percentage of reduction of instances
of S with respect to T .
The pseudocode of CHC appears in Algorithm 1.
input : A population of chromosomes Pa
output: An optimized population of chromosomes Pa
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
t 0;
Initialize(Pa ,ConvergenceCount);
while not EndingCondition(t,Pa ) do
Parents SelectionParents(Pa );
Ospring HUX(Parents);
Evaluate(Ospring);
Pn ElitistSelection(Ospring,Pa );
if not modified(Pa ,Pn ) then
ConvergenceCount ConvergenceCount 1;
if ConvergenceCount = 0 then
Pn Restart(Pa );
Initialize(ConvergenceCount);
end
end
t t +1;
Pa Pn ;
end
Algorithm 1: Pseudocode of CHC algorithm
October 1, 2008 18:43

Cano-Garcia-
3. Data Set Characterization Measure

Classication problems can be dicult for three reasons
17
Certain problems are known to have nonzero Bayes error 16 . This is because
some classes can be intrinsically ambiguous or due to inadequate feature measurements.
Some problems may present complex decision boundaries so it is not possible
to oer a compact description of them 28 .
Sparsity induced by small sample size and high dimensionality aect the generalization of the rules 27,22 .
Real life problems are often aected by a mixture of the three previously mentioned situations.
The prediction capabilities of classiers are strongly dependent on data complexity. This is the reason why various recent papers have introduced the use of
measures to characterize the data and to relate these characteristics to the classier performance 28 .
In 17 , Ho and Basu dene some complexity measures for classication problems
of two classes. Singh in 30 oers a review of data complexity measures and proposes
two new ones. Dong and Kothari in 8 propose a feature selection algorithm based on
a complexity measure dened by Ho and Basu. Bernad and Ho in 4 investigate the
o
domain of competence of XCS by means of a methodology that characterizes the
complexity of a classication problem by a set of geometrical descriptors. In 21 , Li et
al. analyse some omnivariate decision trees using the measure of complexity based
in data density proposed by Ho and Basu. Baumgartner and Somorjai in 3 dene
specic measures for regularized linear classiers, using Ho and Basus measures
as reference. Mollineda et al. in 23 extend some Ho and Basus measure denitions
for problems with two or more classes. They analyse these generalized measures in
two classic PS algorithms and remark that Fishers discriminant ratio is the most
eective for PS. Snchez et al. in 28 analyse the eect of the data complexity in the
a
nearest neighbours classier.
In this case, according to the conclusions of Mollineda et al. 23 , we have considered Fishers discriminant ratio, which is a geometrical measure of overlapping, for
studying the behaviour of evolutionary prototype selection. Fishers discriminant
ratio is presented in this section.
The plain version of Fishers discriminant ratio oered by Ho and Basu 17 computes the degree of separability of two classes according to a specic feature. It
compares the dierence between the class means with respect to the sum of class
variances. Fishers discriminant ratio is dened as follows:
f=
(1 2 )2
2
2
1 + 2
(3)
2
2
where 1 , 2 , 1 , 2 are the means and the variances of the two classes respectively.
October 1, 2008 18:43

Cano-Garcia-
A possible generalization for C classes is proposed by Mollineda et al.

considers all feature dimensions. Its expression is the following:
F1 =
C
i=1 ni (m, mi )
C ni (xi , m )
i=1 j=1 j i
23
, and
(4)
where ni denotes the number of samples in class i, is the metric, m is the overall
mean, mi is the mean of class i, and xj represents the sample j belonging to class
i
i. Small values of this measure indicate that classes present strong overlapping.
4. Experimental Framework
To analyse the behaviour of EIS-CHC we include in the study two classical prototype selection algorithms and three advanced methods, which will be described
in Section 4.1. In Subsection 4.2 we present the algorithms parameters and data
sets considered.
4.1. Prototype Selection Algorithms
The classical PS algorithms used in this study are: an edition algorithm (Edited
Nearest Neighbor 32 ) and a boundary conservative or condensation algorithm (Condensed Nearest Neighbor 15 ). The advanced methods used in the comparison are: an
edition method (Edition by Normalized Radial Basis Function 14 ), a condensation
method (Modied Selective Subset 1 ) and a hybrid method, which combines edition and condensation (Decremental Reduction Optimization Procedure 33 ). The
use of edition schemes is motivated by the relevance of the analysis of data sets
with low overlapping, where there are noisy instances inside the classes, not just in
the boundaries. This is a situation where the lter PS algorithms could present an
interesting behaviour. The use of condensation methods has as objective the study
of the eect introduced by PS algorithms which keeps the instances situated in the
boundaries, where the overlapping appears.
Their description is the following:
Edited Nearest Neighbor (ENN) 32 . The algorithm starts with S = T and then
each instance in S is removed if it does not agree with the majority of its k
nearest neighbours. ENN lter is considered the standard noise lter and it
is usually employed at the beginning of many algorithms. The pseudocode of
ENN appears in Algorithm 2.
Condensed Nearest Neighbor (CNN) 15 . It begins by randomly selecting one
instance belonging to each output class from T and putting them in S. Then,
each instance in T is classied using only the instances in S. If an instance is
misclassied, it is added to S, thus ensuring that it will be classied correctly.
This process is repeated until there are no misclassied instances in T . The
pseudocode of CNN appears in Algorithm 3.
October 1, 2008 18:43

Cano-Garcia-
input : Training set of examples T

output: Subset of training examples S
1
2
3
4
5
6
7
S T;
foreach example xi in S do
if xi is misclassied by its k nearest neighbours in S then
S S {xi };
end
end
return S;
Algorithm 2: Pseudocode of ENN algorithm

1
2
3
4
5
6
7
8
9
10
11
12
13
S;
fail true;
S S {xc1 , xc2 , ..., xcC }, where xci is any example that belongs to class i;
while fail = true do
fail false;
foreach example xi in T do
if xi is misclassied by using S then
S S {xi };
fail true;
end
end
end
return S;
Algorithm 3: Pseudocode of CNN algorithm
Modied Selective Algorithm (MSS) 1 . Let Ri be the set of all xi in T such

that xj is of the same class of xi and is closer to xi than the nearest neighbor
of xi in T of a dierent class than xi . Then, MSS is dened as that subset of
T containing, for every xi in T , that element of its Ri that is the nearest to a
dierent class than that of xi . An ecient algorithmic representation of MSS
method is depicted in Algorithm 4.
Edition by Normalized Radial Basis Function (ENRBF) 14 . It is an Edition algorithm based on the principles of Normalized Radial Basis Functions (NRBF).
NRBF estimates the probability of c-th class given a vector x and training set
October 1, 2008 18:43

Cano-Garcia-

1
2
3
4
5
6
7
8
9
10
11
12
13
Q T;
Sort the examples {xj }n according to increasing values of enemy distance
j=1
Dj ;
foreach example xi in T do
add false;
foreach example xj in T do
if xj Q and d(xi , xj ) < Dj then
Q Q {xj };
add true;
end
end
if add then S S {xi };
if Q = then return S;
end
Algorithm 4: Pseudocode of MSS algorithm
T:
Gi (x; xi ),
P (c|x, T ) =
(5)
iI c
where I c = {i : (xi , yi ) T yi = c}, and Gi (x; xi ) is dened by
Gi (x; xi ) =
G(x; xi , )
,
G(x; xj , )
(6)
n
j=1
and G(x; xi , ) ( is xed) is dened by G(x; xi , ) = e

ENRBF eliminates all vectors if only:
c=yi P (yi |x, T i ) < P (c|x, T i ).
||xxi ||2
.
(7)
Decremental Reduction Optimization Procedure 3 (DROP3) 33 . Its removal

criterion can be restated as: Remove xi if at least as many of its associates
in T would be classied correctly without xi . Each instance xi in T continues
keeping a list of its k + 1 nearest neighbors in S, even after xi is removed from
S. This means that instances in S have associates that are both in and out of
S, while instances that have been removed from S have no associates. DROP3
changes the order of removal of instances. It initially sorts the instances in S
by the distance to their nearest enemy. Instances are then checked for removal
beginning at the furthest instance from its nearest enemy. Additionally, DROP3
employs a noise lter based on ENN at the beginning of this process.
October 1, 2008 18:43

Cano-Garcia-
10
4.2. Data Sets and Parameters

The experimental study is dened in two aspects: data sets and algorithms parameters. They are as follows:
Data Sets: The data sets used have been collected from the UCI repository 24
and their characteristics appear in Table 1. We consider using a ten fold cross
validation in all data sets.
Table 1. Data Sets.
Instances
Australian
Balanced
Bupa
Car
Cleveland
Contraceptive
Crx
Dermatology
Ecoli
Glass
Haberman
Iris
Led7digit
Lymphography
Monks
New-Thyroid
Penbased
Pima
Vehicle
Wine
Wisconsin
Satimage
Thyroid
Zoo
Features
Classes
689
625
345
1727
297
1472
689
366
336
214
305
150
500
148
432
215
10992
768
846
178
683
6435
7200
100
14
4
7
6
13
9
15
34
7
9
3
4
7
18
6
5
16
8
18
13
9
36
21
16
2
3
2
4
5
3
2
6
2
7
2
3
10
4
2
3
10
2
4
3
2
7
3
7
Parameters: The parameters are chosen considering the authors suggestions in

the literature. For each one of the algorithms there are:
CNN : It has not any parameter to be xed.

ENN : Minimum number of neighbours (k) = 3.
MSS : It has not any parameter to be xed.
ENRBF : = 1.0 and = 1.0.
DROP3 : It has not any parameter to be xed.
EIS-CHC : evaluations=10000, population=50 and =0.5.
The algorithms were run three times for each partition in the ten fold cross
validation. The measure F1 is calculated by averaging the F1 obtained in each
training set of the ten fold cross validation.
October 1, 2008 18:43

Cano-Garcia-
11
5. Results and Analysis

This section contains the results and their statistical analysis, considering dierent
values of k for the k -NN classier to study the eect of the data complexity and
the prototype selection.
We present the results obtained after the evaluation of the data sets by the
prototype selection algorithms. The results for 1-NN, 3-NN and 5-NN are presented
in Tables 3, 5 and 7 respectively, whose structure is the following:
In the rst column we oer the name of the data sets, ordered increasingly
considering measure F1.
The second column contains measure F1 computed for the data set in increasing
order.
The third column shows the mean test accuracy rate oered by the k -NN
classier in each data set.
The following columns present the mean test accuracy rate and the mean reduction rate oered by CNN, ENN, MSS, ENRBF, DROP3 and EIS-CHC respectively.
In Tables 3, 5 and 7, the values in bold indicate that the test accuracy rates are
equal to or higher than the ones oered by the k -NN (1-NN, 3-NN or 5-NN) using
the whole data set (that is, without a previous PS process). The separation line in
the tables, xed in F1=0.410, is based on the previous works of Mollineda et. al 23
used as reference.
Associated to each table, we have included a statistical study based on Wilcoxons test (see Appendix A to nd its description) to analyse the behaviour of the
algorithms. This test allows us to establish a comparison over multiple data sets 7 ,
considering those delimited by the F1 measure. Tables 2, 4 and 6 show the statistical results correspoding to 1-NN, 3-NN and 5-NN, respectively. The structure of
these tables is the following:
The rst column indicates the result of the test considering a level of signicance
= 0.10. With the symbol > we represent that the rst algorithm outperforms
the second one. The symbol = denotes that both algorithms behave equally
and nally in the < case, the rst algorithm is worse than the second one.
The second column is the sum of the rankings associated to the rst algorithm
(see Appendix A for more details).
The third column is the sum of the rankings related to the second algorithm.
The fourth column shows the p-value.
In the following subsections we present the results for the case of 1-NN (Subsection 5.1), 3-NN (Subsection 5.2) and nally 5-NN (Subsection 5.3).
5.1. Results and Analysis for the 1-Nearest Neighbour classier
Tables 2 and 3 present the results of the 1-NN case.
October 1, 2008 18:43

Cano-Garcia-
12
The analysis of Tables 2 and 3 is the following:

F1 low [0,0.410] which represents strong overlapping: The evolutionary algorithm (EIS-CHC) outperforms 1-NN when F1 is low. EIS-CHC presents the
best accuracy rates among all the PS algorithms in most of the data sets with
the strongest overlapping. Wilcoxons test supports this observation (Table 2).
F1 high [0.410,...], being small overlapping: There is not any improvement of
PS algorithms with respect to Without PS, as statistical results indicate. The
benet of the use of the PS algorithms in these kind of data sets using the 1-NN
is the reduction of the size of the data set. Only ENN and EIS-CHC obtain
the same performance as not using PS. The comparison between EIS-CHC and
the rest of the models indicates that the accuracy of EIS-CHC is always better
than or equal to that of the method compared (Table 2).
Considering the results that CNN and MSS present in Table 3 we must point
that the PS algorithms which keep boundary instances (condensation methods)
notably aect the classication capabilities of the 1-NN classier, independently of
the overlapping of the data set. DROP3 obtains a performance similar to that of
not using PS, due to the fact that it integrates a noise lter in its denition.
Paying attention to the relation between F1 and the behaviour of EIS-CHC,
we can point out that the use of this measure can help us to decide when the use
of EIS-CHC improves the accuracy rates of 1-NN classier in a concrete data set,
previously to its execution.
Table 2. Wilcoxon test over 1-NN

WILCOXON
1-NN > CNN
1-NN < ENN
1-NN > MSS
1-NN = ENRBF
1-NN = DROP3
1-NN < EIS-CHC
EIS-CHC > CNN
EIS-CHC > ENN
EIS-CHC > MSS
EIS-CHC > ENRBF
EIS-CHC > DROP3
1-NN with F1 < 0.410.

R+
Rp-value
46
9
0.059
5
50
0.022
53
2
0.009
12
33
0.214
36
9
0.11
1
54
0.007
55
0
0.005
55
0
0.005
55
0
0.005
47
8
0.047
55
0
0.005
1-NN with F1 >

WILCOXON
1-NN > CNN
1-NN = ENN
1-NN > MSS
1-NN > ENRBF
1-NN > DROP3
1-NN = EIS-CHC
EIS-CHC = CNN
EIS-CHC = ENN
EIS-CHC = MSS
EIS-CHC > ENRBF
EIS-CHC > DROP3
0.410.
R+
98
45.5
89.5
82.5
97
57
75
48
61
99
95
R7
59.5
15.5
22.5
8
48
30
57
44
6
10
p-value
0.004
0.6
0.023
0.046
0.005
0.778
0.158
0.778
0.594
0.004
0.008
Data Set
thyroid
lymphography
bupa
haberman
pima
contraceptive
cleveland
crx
australian
monks
balanced
dermatology
vehicle
new-thyroid
glass
ecoli
zoo
car
penbased
led7digit
wisconsin
satimage
wine
iris
0.035
0.051
0.166
0.169
0.217
0.224
0.235
0.285
0.287
0.365
0.455
0.473
0.506
0.731
0.745
0.888
0.949
1.022
1.161
1.344
1.354
1.476
1.820
2.670
F1
Accur.
1-NN
0.9258
0.7387
0.6108
0.6697
0.7033
0.4277
0.5314
0.7957
0.8145
0.7791
0.7904
0.9535
0.7010
0.9723
0.7361
0.8070
0.9281
0.8565
0.9935
0.4020
0.9557
0.9058
0.9552
0.9333
Accur.
CNN
0.9007
0.7052
0.5907
0.6465
0.6525
0.4128
0.5052
0.7913
0.8043
0.8252
0.7168
0.9454
0.6655
0.9487
0.6973
0.7532
0.8289
0.8825
0.9853
0.3740
0.9342
0.8835
0.9157
0.9200
0.8060
0.5376
0.4180
0.4622
0.5080
0.2679
0.3905
0.6692
0.6441
0.6970
0.6345
0.8689
0.5035
0.8646
0.5146
0.5989
0.7579
0.7658
0.9555
0.2727
0.8965
0.8041
0.8308
0.8519
Red.
Accur.
ENN
0.9379
0.7547
0.5859
0.6990
0.7449
0.4528
0.5576
0.8449
0.8377
0.7817
0.8560
0.9591
0.6963
0.9494
0.6919
0.8248
0.9114
0.8513
0.9927
0.4920
0.9657
0.9013
0.9552
0.9533
0.0608
0.2146
0.3775
0.3032
0.2617
0.5498
0.4554
0.2013
0.1514
0.0411
0.1559
0.0528
0.2931
0.0579
0.3214
0.1958
0.0815
0.0775
0.0065
0.5660
0.0529
0.0890
0.0337
0.0474
Red.
Accur.
MSS
0.9161
0.7436
0.5965
0.6405
0.6797
0.4196
0.5282
0.7841
0.8043
0.7445
0.7825
0.9372
0.6868
0.9723
0.7206
0.7800
0.8467
0.8455
0.9913
0.4020
0.9485
0.9009
0.9605
0.9467
0.6998
0.4084
0.2055
0.3678
0.3385
0.1621
0.3099
0.4976
0.5074
0.3274
0.3755
0.7450
0.3740
0.7829
0.3905
0.4726
0.7834
0.5489
0.9051
0.1629
0.8210
0.6841
0.7247
0.7844
Red.
Accur.
ENRBF
0.9258
0.7609
0.5789
0.7353
0.6511
0.4481
0.5644
0.8551
0.8609
0.7619
0.8559
0.9618
0.5746
0.6981
0.3956
0.4256
0.9281
0.7002
0.8834
0.4580
0.9414
0.7302
0.9663
0.9400
Table 3. Results considering the 1-NN Classier.
0.0742
0.1546
0.4203
0.2647
0.3490
0.5500
0.4404
0.1417
0.1398
0.2059
0.1150
0.0404
0.5074
0.3023
0.6774
0.5744
0.0571
0.2998
0.1902
0.2609
0.0769
0.4395
0.0543
0.1267
Red.
Accur.
DROP3
0.8581
0.7674
0.5986
0.6697
0.6901
0.4406
0.4947
0.7841
0.8014
0.6571
0.8177
0.9293
0.6099
0.9491
0.6571
0.7263
0.9264
0.6221
0.9431
0.4060
0.8913
0.8308
0.9105
0.9267
0.9566
0.8356
0.7005
0.8003
0.8213
0.7307
0.8317
0.8779
0.8847
0.8670
0.8676
0.9232
0.7727
0.8853
0.7415
0.8423
0.8351
0.8834
0.9719
0.8278
0.9747
0.9101
0.9101
0.9230
Red.
Accur.
EIS-CHC
0.9406
0.7938
0.6058
0.7154
0.7501
0.5180
0.6173
0.8522
0.8420
0.9727
0.8929
0.9617
0.6147
0.9541
0.6227
0.8039
0.9142
0.8802
0.9560
0.6580
0.9642
0.8710
0.9438
0.9733
0.9991
0.9572
0.9752
0.9855
0.9871
0.9918
0.9864
0.9915
0.9918
0.9915
0.9883
0.9736
0.9757
0.9767
0.9507
0.9653
0.9111
0.9897
0.9871
0.9649
0.9941
0.9950
0.9732
0.9674
Red.
October 1, 2008
18:43
Cano-Garcia-Herrera-Bernado-IJPRAI
13
October 1, 2008 18:43

Cano-Garcia-
14

Tables 4 and 5 present the results of the PS methods with the 3-NN classier.

WILCOXON
3-NN > CNN
3-NN = ENN
3-NN = MSS
3-NN = ENRBF
3-NN > DROP3
3-NN < EIS-CHC
EIS-CHC > CNN
EIS-CHC > ENN
EIS-CHC > MSS
EIS-CHC > ENRBF
EIS-CHC > DROP3
3-NN with F1 < 0.410.

R+
Rp-value
53
2
0.009
15.5
39.5
0.26
43
12
0.114
31.5
23.3
0.594
51
4
0.017
6
49
0.028
55
0
0.005
50.5
4.5
0.021
51
4
0.017
47
8
0.047
55
0
0.005
3-NN with F1 > 0.410.

WILCOXON
R+
3-NN > CNN
104.5
3-NN = ENN
63.5
3-NN > MSS
82.5
3-NN > ENRBF
91
3-NN > DROP3
99
3-NN = EIS-CHC
75
EIS-CHC = CNN
54
EIS-CHC = ENN
30
EIS-CHC = MSS
46
EIS-CHC > ENRBF
90
EIS-CHC > DROP3
104
R0.5
41.5
22.5
14
6
30
51
75
59
15
1
p-value
0.001
0.507
0.064
0.016
0.004
0.158
0.925
0.158
0.683
0.019
0.001

F1 low [0,0.410] (strong overlapping): Similarly to the case with the 1-NN,
EIS-CHC outperforms the Without PS when F1 is low. EIS-CHC presents
the best accuracy rates among all the PS algorithms in all data sets with the
strongest overlapping. Wilcoxons test in Table 4 conrms this armation.
F1 high [0.410,...] (small overlapping): The situation is similar to the previous
case. There is not any improvement of PS algorithms with respect to 3-NN, as
the statistical results indicate (see Table 4). Only ENN and EIS-CHC obtain
the rest of the models indicates that the accuracy of EIS-CHC is always better
than or equal to that of the method compared.
Data Set
thyroid
lymphography
bupa
haberman
pima
contraceptive
cleveland
crx
australian
monks
balanced
dermatology
vehicle
new-thyroid
glass
ecoli
zoo
car
penbased
led7digit
wisconsin
satimage
wine
iris
0.035
0.051
0.166
0.169
0.217
0.224
0.235
0.285
0.287
0.365
0.455
0.473
0.506
0.731
0.745
0.888
0.949
1.022
1.161
1.344
1.354
1.476
1.820
2.670
F1
Accur.
3-NN
0.9236
0.7739
0.6066
0.7058
0.7306
0.4495
0.5444
0.8420
0.8478
0.9629
0.8337
0.9700
0.7175
0.9537
0.7011
0.8067
0.9281
0.9231
0.9718
0.4520
0.9600
0.8662
0.9549
0.9400
Accur.
CNN
0.9083
0.7826
0.5845
0.6537
0.6654
0.4447
0.5247
0.8203
0.8203
0.8658
0.7472
0.9511
0.6619
0.9485
0.6493
0.7650
0.8328
0.9010
0.9536
0.4040
0.9542
0.8444
0.9327
0.9400
0.7718
0.5580
0.3649
0.4281
0.5049
0.2475
0.3935
0.6531
0.6581
0.6556
0.6240
0.8607
0.4768
0.8476
0.4762
0.5913
0.6954
0.7662
0.8464
0.2829
0.8894
0.7171
0.8514
0.8415
Red.
Accur.
ENN
0.9250
0.7530
0.6174
0.7125
0.7384
0.4583
0.5447
0.8449
0.8478
0.9567
0.8768
0.9619
0.6927
0.9307
0.6616
0.8126
0.9114
0.8930
0.9618
0.5460
0.9657
0.8646
0.9549
0.9533
0.0759
0.2146
0.3775
0.3032
0.2617
0.5498
0.4448
0.1560
0.1514
0.0411
0.1559
0.0310
0.2931
0.0579
0.3214
0.1958
0.0815
0.0775
0.0287
0.5660
0.0337
0.1322
0.0337
0.0474
Red.
Accur.
MSS
0.9360
0.7589
0.6112
0.6959
0.7176
0.4528
0.5346
0.8304
0.8348
0.9613
0.8304
0.9457
0.6809
0.9394
0.6650
0.7767
0.7811
0.9173
0.9914
0.4320
0.9600
0.9061
0.9552
0.9333
0.6998
0.4084
0.2055
0.3678
0.3385
0.1621
0.3099
0.4976
0.5074
0.3274
0.3755
0.7450
0.3740
0.7829
0.3905
0.4726
0.7834
0.5489
0.9051
0.1629
0.8210
0.6841
0.7247
0.7844
Red.
Accur.
ENRBF
0.9258
0.7530
0.5789
0.7353
0.6511
0.4522
0.5677
0.8522
0.8478
0.8143
0.8768
0.9646
0.5569
0.6981
0.3842
0.4256
0.9114
0.7002
0.8799
0.4300
0.9471
0.7316
0.9719
0.9467
0.0742
0.1546
0.4203
0.2647
0.3490
0.5500
0.4404
0.1417
0.1398
0.2059
0.1150
0.0404
0.5074
0.3023
0.6774
0.5744
0.0571
0.2998
0.1902
0.2609
0.0769
0.4395
0.0543
0.1267
Red.
Accur.
DROP3
0.8056
0.7053
0.6315
0.6500
0.7176
0.4542
0.4688
0.7377
0.7783
0.6957
0.8193
0.8987
0.5545
0.8799
0.5764
0.6636
0.8436
0.6887
0.8765
0.5320
0.9099
0.7706
0.9154
0.8467
0.9812
0.8679
0.7649
0.8885
0.8562
0.7442
0.8775
0.9274
0.9293
0.8678
0.9090
0.9262
0.8164
0.9189
0.8105
0.8714
0.7859
0.8865
0.9783
0.8331
0.9777
0.9367
0.9207
0.9326
Red.
Accur.
EIS-CHC
0.9250
0.8034
0.6524
0.7219
0.7684
0.4875
0.5643
0.8536
0.8681
0.9376
0.9009
0.9429
0.6087
0.9584
0.6267
0.7534
0.9006
0.8409
0.8700
0.6900
0.9685
0.8164
0.9438
0.9600
0.9983
0.9535
0.9755
0.9862
0.9860
0.9911
0.9802
0.9905
0.9897
0.9830
0.9860
0.9505
0.9634
0.9592
0.9465
0.9583
0.8813
0.9853
0.9568
0.9509
0.9908
0.9666
0.9457
0.9333
Red.
October 1, 2008
18:43
15
October 1, 2008 18:43

Cano-Garcia-
16

WILCOXON
5-NN > CNN
5-NN = ENN
5-NN > MSS
5-NN = ENRBF
5-NN > DROP3
5-NN < EIS-CHC
EIS-CHC > CNN
EIS-CHC > ENN
EIS-CHC > MSS
EIS-CHC > ENRBF
EIS-CHC > DROP3
5-NN with F1 < 0.410.

R+
Rp-value
55
0
0.005
25
30
0.799
40
5
0.038
41
14
0.169
40
5
0.038
9
46
0.059
55
0
0.005
52.5
2.5
0.011
49
6
0.028
52.5
2.5
0.011
55
0
0.005
5-NN with F1 > 0.410.

WILCOXON
R+
5-NN > CNN
102.5
5-NN = ENN
76.5
5-NN > MSS
87
5-NN > ENRBF
93
5-NN > DROP3
100
5-NN = EIS-CHC
75.5
EIS-CHC = CNN
50.5
EIS-CHC = ENN
26.5
EIS-CHC = MSS
56
EIS-CHC > ENRBF
86
EIS-CHC > DROP3
104
R2.5
28.5
18
12
5
29.5
54.5
78.5
49
19
1
p-value
0.002
0.158
0.03
0.011
0.003
0.136
0.9
0.116
0.826
0.035
0.001
Note that when k = 3, the nearest neighbour classier is more robust in presence
of noise than the 1-NN classier. Due to this fact, the ENN and ENRBF lters
behave similarly to the 3-NN when F1 is lower than 0.410, according to Wilcoxons
test. The same eect occurs in DROP3. However, a PS process by EIS-CHC prior
to the 3-NN classier improves the accuracy of the classier without using PS and
also achieves a high reduction of the subset selected.
Tables 6 and 7 present the results of the PS methods with the 5-NN classier.
F1 low [0,0.410] (strong overlapping): EIS-CHC outperforms the Without PS
when F1 is low. EIS-CHC presents the best accuracy rates among all the PS
algorithms in most of the data sets with the strongest overlapping (Table 7).
Considering Wilcoxons test in Table 6, only EIS-CHC improves the classication capabilities of 5-NN which reects the proper election of the most
representative instances in presence of overlapping.
F1 high [0.410,...] (small overlapping): The situation is similar to the previous
case. There is not any improvement of PS algorithms with respect to 5-NN, as
the statistical results indicate (see Table 6). Only ENN and EIS-CHC obtain
the rest of models indicates that indicates that the accuracy of EIS-CHC is
always better than or equal to that of the method compared.
In this case, ENN and ENRBF obtain a result similar to the previous subsection
(3-NN case), where F1 is low, but again EIS-CHC oers a signicant improvement
in accuracy with respect to the use of the nearest neighbours classier without using
PS.
Data Set
thyroid
lymphography
bupa
haberman
pima
contraceptive
cleveland
crx
australian
monks
balanced
dermatology
vehicle
new-thyroid
glass
ecoli
zoo
car
penbased
led7digit
wisconsin
satimage
wine
iris
0.035
0.051
0.166
0.169
0.217
0.224
0.235
0.285
0.287
0.365
0.455
0.473
0.506
0.731
0.745
0.888
0.949
1.022
1.161
1.344
1.354
1.476
1.820
2.670
F1
Accur.
5-NN
0.9292
0.7944
0.6131
0.6695
0.7306
0.4685
0.5545
0.8551
0.8478
0.9475
0.8624
0.9646
0.7175
0.9398
0.6685
0.8127
0.9364
0.9520
0.9618
0.4140
0.9657
0.8740
0.9605
0.9600
Accur.
CNN
0.9056
0.7931
0.6101
0.6402
0.7044
0.4542
0.5382
0.8232
0.8159
0.8523
0.8080
0.9481
0.6903
0.9400
0.6531
0.7799
0.8097
0.9323
0.9455
0.3860
0.9500
0.8412
0.9605
0.9533
0.7810
0.5736
0.3304
0.4230
0.4899
0.2383
0.3748
0.6659
0.6591
0.6168
0.6574
0.8443
0.4641
0.8439
0.4429
0.6035
0.6044
0.7748
0.8195
0.2878
0.8886
0.7099
0.8421
0.8356
Red.
Accur.
ENN
0.9250
0.7796
0.6137
0.7288
0.7396
0.4725
0.5676
0.8507
0.8609
0.8855
0.8928
0.9592
0.6822
0.9119
0.6652
0.8065
0.8964
0.9016
0.9482
0.5520
0.9671
0.8708
0.9605
0.9600
0.0716
0.2079
0.3910
0.3061
0.2626
0.5374
0.4488
0.1433
0.1588
0.0432
0.1351
0.0316
0.2871
0.0646
0.3453
0.1812
0.0708
0.0475
0.0376
0.5860
0.0296
0.1317
0.0418
0.0430
Red.
Accur.
MSS
0.9396
0.7922
0.6045
0.6594
0.7306
0.4562
0.5446
0.8435
0.8304
0.9263
0.8496
0.9318
0.6844
0.9165
0.6067
0.7889
0.7536
0.9196
0.9864
0.3940
0.9686
0.9050
0.9663
0.6267
0.6998
0.4084
0.2055
0.3678
0.3385
0.1621
0.3099
0.4976
0.5074
0.3274
0.3755
0.7450
0.3740
0.7829
0.3905
0.4726
0.7834
0.5489
0.9051
0.1629
0.8210
0.6841
0.7247
0.7844
Red.
Accur.
ENRBF
0.9258
0.7801
0.5789
0.7353
0.6511
0.4535
0.5711
0.8536
0.8435
0.8013
0.8831
0.9619
0.5592
0.6981
0.3609
0.4256
0.9197
0.7002
0.8739
0.4180
0.9485
0.7329
0.9719
0.9400
0.0742
0.1546
0.4203
0.2647
0.3490
0.5500
0.4404
0.1417
0.1398
0.2059
0.1150
0.0404
0.5074
0.3023
0.6774
0.5744
0.0571
0.2998
0.1902
0.2609
0.0769
0.4395
0.0543
0.1267
Red.
Accur.
DROP3
0.7901
0.6983
0.6137
0.7019
0.7187
0.4685
0.5222
0.7014
0.7188
0.7490
0.8013
0.8772
0.5497
0.9119
0.5813
0.6620
0.7125
0.7066
0.8551
0.4820
0.8624
0.7416
0.9157
0.9133
0.9842
0.8904
0.8039
0.9346
0.8866
0.7690
0.8969
0.9406
0.9499
0.8382
0.9234
0.9250
0.8240
0.9173
0.8172
0.8810
0.7506
0.8749
0.9774
0.8347
0.9777
0.9473
0.8976
0.9119
Red.
Accur.
EIS-CHC
0.9250
0.8423
0.6464
0.7353
0.7671
0.4820
0.6075
0.8594
0.8580
0.8959
0.8879
0.9426
0.6123
0.9680
0.6331
0.7351
0.8717
0.8368
0.8500
0.6500
0.9657
0.8289
0.9660
0.9600
0.9978
0.9369
0.9710
0.9960
0.9854
0.9909
0.9725
0.9874
0.9865
0.9784
0.9838
0.9375
0.9567
0.9385
0.9216
0.9527
0.8470
0.9819
0.9432
0.9416
0.9876
0.9603
0.9295
0.9148
Red.
October 1, 2008
18:43
17
October 1, 2008 18:43

Cano-Garcia-
18
5.4. Summary of the Analysis

Considering the previous results and analysis we can present as summary the following comments:
Independently of the k value selected for the nearest neighbours classier, when
the overlapping of the initial data set is strong (it presents low values of F1)
EIS-CHC is a very eective PS algorithm to improve the accuracy rates of the
nearest neighbours classier.
When the overlapping of the data set is low, the statistical test has shown
that the PS algorithms are not capable of improving the accuracy of the kNN without using PS. The benets of their use is that they keep the accuracy
capabilities of the nearest neighbours classier, reducing the initial data set
size.
Considering the results that CNN and MSS present, we must point out that the
PS algorithms which keep boundary instances (condensation methods) notably
aect the classication capabilities of the k -NN classier, independently of the
overlapping of the data set and the value of k.
In the classical algorithms, the best behaviour corresponds to ENN. The lter
process that ENN introduces outperforms in some cases the classication capabilities of the k -NN, but the election of the most representative prototypes
that EIS-CHC develops seems to be the most eective strategy. Nevertheless,
ENN in combination with k-NN obtains similar results to k-NN when k 3,
given that the nearest neighbours classier is more robust in presence of noise.
In the most advanced algorithms, the behaviour coincides in most of the cases
with the equivalent in the classic algorithms; MSS behaves very similarly to
CNN and ENRBF to ENN. DROP3, as a hybrid model alike EIS-CHC, obtains
an intermediate behavior between condensation and edition methods, because
it performs adequately when strong overlapping is presented when considering
1-NN. Nevertheless, EIS-CHC always outperforms DROP3 in any case.
Paying attention to the relation between F1 and the behaviour of EIS-CHC,

we can point out that the use of this measure can help us decide when the use
of EIS-CHC improves the accuracy rates of k -NN classier in a concrete data set,
previously to its execution.
With these results in mind, we could analyse the F1 measure in a new data set
and if it is small (F1 between [0,0.410)), we can use EIS-CHC as PS method to
improve the accuracy rate of the k -NN classier. When F1 is greater than 0.410,
EIS-CHC oers interesting behaviour, with accuracy equivalent to the obtained
without reduction as Wilcoxons test indicates, but with reduction rates larger
than 90% in most of the data sets.
October 1, 2008 18:43

Cano-Garcia-
19
6. Concluding Remarks
This paper addresses the analysis of the evolutionary prototype selection considering a complexity data set measure based on overlapping, with the objective of
predicting when the evolutionary prototype selection is eective for a particular
problem.
An experimental study has been carried out using data sets from dierent domains and comparing the results with classical PS algorithms, having the F1 measure as reference. To extend the analysis of the k -NN classier we have considered
dierent values of k. The main conclusions reached are the following:
EIS-CHC presents the best accuracy rate when the input data set has strong
overlapping, even improving condensation algorithms (CNN and MSS), edition
schemes (ENN and ENRBF) and hybrid methods, such as DROP3.
EIS-CHC improves the classication accuracy of k-NN when the data sets have
strong overlapping, independently of the k value, and obtains a high reduction
rate of the data. However, ENN, ENRBF and DROP3 algorithms are not able
to improve the accuracy rate of k-NN when k 3.
In the case of data sets with low overlapping, the results of the PS algorithms
are not conclusive so none of them can be suggested considering accuracy rates. Therefore, their use is recommended to keep the accuracy capabilities by
reducing the initial data set size.
Condensation algorithms, which keep the boundaries (CNN and MSS), have
normally shown negative eects on the accuracy of the k -NN classier.
As we have indicated in the analysis section, the use of this measure can help
us to evaluate a data set previously to the evolutionary PS process and decide if it
is adequate or not to improve the classication capabilities of the k -NN classier.
The results show that when F1 is low (strong overlapping), the best accuracy
rates appear using EIS-CHC, while when F1 is high (low overlapping), the PS
algorithms do not guarantee an accuracy improvement.
As future works, the analysis of the eect of data complexity on evolutionary instance selection for training set selection considering other well-known classication
algorithms will be studied. Another interesting research line is the measurement
of data complexity on imbalanced data sets when we can perform evolutionary
under-sampling 12 .
Appendix A. Wilcoxon Signed Rank Test
Wilcoxons test is used for answering this question: do two samples represent two
dierent populations? It is a non-parametric procedure employed in a hypothesis
testing situation involving a design with two samples. It is the analogous of the paired t-test in non-parametrical statistical procedures; therefore, it is a pairwise test
that aims to detect signicant dierences between the behaviour of two algorithms.
The null hypothesis for Wilcoxons test is H0 : D = 0; in the underlying
October 1, 2008 18:43

Cano-Garcia-
20
populations represented by the two samples of results, the median of the dierence
scores equals zero. The alternative hypothesis is H1 : D = 0, but H1 : D > 0 or
H1 : D < 0 can also be used as directional hypothesis.
In the following, we describe the tests computations. Let di be the dierence
between the performance scores of the two algorithms on i -th out of N data sets.
The dierences are ranked according to their absolute values; average ranks are
assigned in case of ties. Let R+ be the sum of ranks for the data sets on which the
second algorithm outperformed the rst, and R the sum of ranks for the opposite.
Ranks of di = 0 are split evenly among the sums; if there is an odd number of them,
one is ignored:
R+ =
rank(di ) +
di >0
R =
rank(di ) +
di <0
1
2
1
2
rank(di )
di =0
rank(di )
di =0
Let T be the smallest of the sums, T = min(R+ , R ). If T is less than or equal

to the value of the distribution of Wilcoxon for N degrees of freedom (Table B.12
in 35 ), the null hypothesis of equality of means is rejected.
Obtaining the p-value associated to a comparison is performed by means of the
normal approximation for the Wilcoxon T statistic (Section VI, Test 18 in 29 ).
Furthermore, the computation of the p-value for this test is usually included in
well-known statistical software packages (SPSS, SAS, R, etc.).
References
1. R. Barandela, F.J. Ferri and J.S. Snchez,Decision boundary preserving prototype
a
selection for nearest neighbor classication, International Journal of Pattern Recognition and Articial Intelligence 19(6) (2005) 787806.
2. M. Basu and T.K. Ho, Data Complexity in Pattern Recognition, Springer, 2006.
3. R. Baumgartner and R.L. Somorjai, Data complexity assessment in undersampled classication of high-dimensional biomedical data, Pattern Recognition Letters
27(12) (2006) 13831389.
4. E. Bernad-Mansilla and T.K. Ho, Domain of Competence of XCS Classier System
o
in Complexity Measurement Space, IEEE Transactions on Evolutionary Computation
9(1) (2005) 82104.
5. J.-R. Cano, F. Herrera and M. Lozano, Using Evolutionary Computation as Instance
Selection for Data Reduction in KDD: An Experimental Study, IEEE Transactions
on Evolutionary Computation 7(6) (2003) 561575.
6. L.P. Corella, C. De stefano and F. Fontanella, Evolutionary prototyping for handwriting recognition, International Journal of Pattern Recognition and Articial Intelligence 21(1) (2007) 157178.
7. J. Demsar, Statistical Comparison of Classiers over Multiple Data Sets, Journal of
Machine Learning Research 7 (2006) 130.
8. M. Dong and R. Kothari, Feature subset selection using a new denition of classiability, Pattern Recognition Letters 24(9-10) (2003) 12151225.
October 1, 2008 18:43

Cano-Garcia-
21
9. A.E. Eiben and J.E. Smith, Introduction to Evolutionary Computation, Springer Verlag,
2003.
10. L.J. Eshelman, The CHC adaptive search algorithm: How to have safe search when
engaging in nontraditional genetic recombination, in Foundations of Genetic Algorithms 1, Rawlins, G.J.E. (Eds.), Morgan Kauman (1991) 265283.
11. C. Gagn and M. Parizeau, Co-evolution of nearest neighbor classiers, Internatioe
nal Journal of Pattern Recognition and Articial Intelligence 21(5) (2007) 912946.
a
12. S. Garc and F. Herrera, Evolutionary Under-Sampling for Classication with Imbalanced Data Sets: Proposals and Taxonomy, Evolutionary Computation, in press
(2008).
13. D.E. Goldberg, Genetic algorithms in search, optimization, and machine learning,
Addison-Wesley, 1989.
14. M. Grochowski and N. Jankowski, Comparison of instance selection algorithms II.
Results and Comments, Proceedings of the 7th International Conference on Articial
Intelligence and Soft Computing, Lecture Notes in Computer Science 3070 (2004) pp.
580585.
15. P.E. Hart, The Condensed Nearest Neighbour Rule, IEEE Trans. on Information
Theory 14 (1968) 515516.
16. T.K. Ho and H.S. Baird, Large-Scale Simulation Studies in Image Pattern Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 19(10) (1997)
10671079.
17. T.K. Ho and M. Basu, Complexity Measures of Supervised Classication Problems,
IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3) (2002) 289
300.
18. S.W. Kim and B.J. Oommen, Enhancing Prototype Reduction Schemes with LVQ3Type Algorithms, Pattern Recognition 36 (2004) 10831093.
19. S.W. Kim and B.J. Oommen, On using prototype reduction schemes to optimize
kernel-based nonlinear subspace methods, Pattern Recognition 37 (2004) 227239.
20. L. Kuncheva, Editing for the k-nearest neighbors rule by a genetic algorithm, Pattern Recognition Letters 16 (1995) 809-814.
21. Y.-H. Li, M. Dong and R. Kothari, Classiability-based omnivariate decision trees,
IEEE Transactions On Neural Networks 16(6) (2005) 1547-1560.
22. M. Liwicki and H. Bunke, Handwriting recognition of whiteboard notes - studying the
inuence of training set size and type, International Journal of Pattern Recognition
and Articial Intelligence 21(1) (2007) 8398.
23. R.A. Mollineda, J.S. Snchez and J.M. Sotoca, Data Characterization for Eective
a
Prototype Selection, Proceedings of the IbPRIA 2005, Lecture Notes in Computer
Science 3523 (2005) pp. 2734.
24. A.
Asuncion
and
D.J.
Newman,
UCI
Machine
Learning Repository, [http://www.ics.uci.edu/mlearn/MLRepository.html]. Irvine,
CA: University of California, Schools of Information and Computer Science, 2007.
25. I.S. Oh, J.S. Lee and B.R. Moon, Hybrid genetic algorithms for feature selection,
IEEE Transactions on Pattern Analysis and Machine Intelligence 26(11) (2004) 1424
1437.
26. A.N. Papadopoulos and Y. Manolopoulos, Nearest Neighbor Search: A database perspective, Springer Verlag, 2004.
27. X. Qiu and L. Wu, Nearest neighbor discriminant analysis, International Journal
of Pattern Recognition and Articial Intelligence 20(8) (2006) 12451259.
a
28. J.S. Snchez, R.A. Mollineda and J.M. Sotoca, An analysis of how training data
complexity aects the nearest neighbours classiers, Pattern Analysis & Applications
October 1, 2008 18:43

Cano-Garcia-
22
10 (2007) 189201.
29. D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, Second Edition, Chapman and Hall, 2003.
30. S. Singh, Multiresolution estimates of classication complexity, IEEE Transactions
On Pattern Analysis And Machine Intelligence 25(12) (2003) 1534-1539.
31. H. Shinn-Ying, L. Chia-Cheng and L. Soundy, Design of an optimal nearest neighbour classier using an intelligent genetic algorithm, Pattern Recognition Letter
23(13) (2002) 14951503.
32. D.L. Wilson, Asymptotic Properties of Nearest Neighbor Rules Using Edited Data,
IEEE Transactions on Systems, Man and Cybernetics 2(3) (1972) 408421.
33. D.R. Wilson and T.R. Martinez, Reduction techniques for instance-based learning
algorithms, Machine Learning 38 (2000) 257268.
34. B. Yang, X. Su and Y. Wang, Distributed learning based on chips for classication
with large-scale dataset, International Journal of Pattern Recognition and Articial
Intelligence 21(5) (2007) 899-920.
35. J.H. Zar, Biostatistical Analysis (4th Edition), Pearson, 1999.
Biographical Sketch and Photo

Upon acceptance of an article, a brief biographical sketch and photograph of each
author are to be supplied to the Publisher.
2.3.
Bajo-Muestreo Evolutivo para la Clasicacin en Problemas

o
no Balanceados: Propuestas para Aprendizaje basado en Instancias y Seleccin de Conjuntos de Entrenamiento - Evoluo
tionary Under-Sampling for Classication with imbalanced
Data Sets: Proposals for Instance-Based Learning and Training Set Selection
2.3.1.
Evolutionary Under-Sampling for Classication with Imbalanced Data

Sets: Proposals and Taxonomy
S. Garc F. Herrera, Evolutionary Under-Sampling for Classication with Imbalanced Data

a,
Sets: Proposals and Taxonomy. Evolutionary Computation, in press (2008).
Estado: Aceptado

Area de Conocimiento: Computer Science, Theory and Methods. Ranking 14 / 79.
ENTRADA: 2 mensajes no ledo(s) - salvagl@decsai.ugr.es - 10/07/200...
1 de 1
https://decsai.ugr.es/cgi-bin/openwebmail/openwebmail-read.pl?session...
Fecha: Mon, 11 Feb 2008 17:49:39 +0100

Remitente: ecj@lri.fr
Responder a: Marc.Schoenauer@lri.fr, ecj@lri.fr
Destinatario: salvagl@decsai.ugr.es, herrera@decsai.ugr.es
Asunto: ECJ Status of paper 350
Dear Salvador Garca and Francisco Herrera
We are happy to inform you that your paper with Id 350, entitled "Evolutionary Under-Sampling for Classification with
Imbalanced Data Sets: Proposals and Taxonomy", submitted to ECJ, is now definitely accepted for publication.
To submit this final version, you must access the http://ecj.lri.fr/SubmitPaper.php
upload interface and enter your id and password:
Paper id: 350
Password: 64c976.
Please follow the instructions found at the ECJ site
http://ecj.lri.fr in order to prepare your final version. Note in particular that your final version must be written using
LaTeX (you can find all LaTeX style files at ftp://ecj.inria.fr), and that we will need the full LaTeX sources (including
figures and bibliography file if any) in order to be able to recompile your paper.
Congratulations for the acceptation of your paper.
Best regards,
Marc Schoenauer,
Editor in chief, Evolutionary Computation
10/07/2008 01:41
Evolutionary Under-Sampling for Classication

with Imbalanced Data Sets: Proposals and
Taxonomy
Salvador Garca
Department of Computer Science and Articial Intelligence, University of Granada,
Francisco Herrera
Department of Computer Science and Articial Intelligence, University of Granada,
Abstract
Learning with imbalanced data is one of the recent challenges in machine learning.
Various solutions have been proposed in order to nd a treatment for this problem,
such as modifying methods or application of a preprocessing stage. Within the preprocessing focused on balancing data, two tendencies exist: reduce the set of examples
(under-sampling) or replicate minority class examples (over-sampling).
Under-Sampling with imbalanced data sets could be considered as a prototype selection procedure with the purpose of balancing data sets to achieve a high classication
rate, avoiding the bias towards majority class examples.
Evolutionary algorithms have been used for classical prototype selection showing
good results, where the tness function is associated to the classication and reduction
rates. In this paper, we propose a set of methods called Evolutionary Under-Sampling
which take into consideration the nature of the problem and use different tness functions for getting a good trade-off between balance of distribution of classes and performance. The study includes a taxonomy of the approaches and an overall comparison
among our models and state-of-the-art under-sampling methods. The results have
been contrasted by using non-parametric statistical procedures and show that evolutionary under-sampling outperforms the non-evolutionary models when the degree of
imbalance is increased.
Keywords
Classication, class imbalance problem, under-sampling, prototype selection, evolutionary algorithms
1 Introduction
In the last years, the class imbalance problem is one of the emergent challenges in machine learning. The problem appears when the data presents a class imbalance, which
consists of containing many more examples of one class than the other one (Chawla
et al., 2004; Xie and Qiu, 2007). Many applications have appeared in learning with imbalanced domains, such as fraud detection, intrusion detection (Cieslak et al., 2006),
biological and medical identication (Cohen et al., 2006), etc.
Usually, the instances are grouped into two classes: the majority or negative class,
and the minority or positive class. The latter class, in imbalanced domains, usually represents a concept with the same or greater interest than the negative class. A standard
c 200X by the Massachusetts Institute of Technology
Evolutionary Computation x(x): xxx-xxx
S. Garca and F. Herrera
classier might ignore the importance of the minority class because its representation
inside the data set is not strong enough. As a classical example, if the ratio of imbalance
presented in the data is 1:100 (that is, there is one positive instance versus one hundred
negatives), the error of ignoring this class is only 1%.
A large number of approaches have been proposed to deal with the class imbalance
problem. These approaches can be divided into three groups, depending on the way
they work:
At the algorithmic level, they are the internal approaches. This group of methods
tries to adapt the decision threshold to impose a bias on the minority class or to
improve the prediction performance by adjusting weights for each class. In any
case, they are based on modifying previous algorithms or making specic proposals for dealing with imbalanced data (Grzymala-Busse et al., 2005; Huang et al.,
2006). Recently, in the eld of evolutionary learning, some studies have been pre
sented analyzing the behavior of XCS (Bernado-Mansilla and Garrell-Guiu, 2003;
Kovacs and Kerber, 2006; Butz et al., 2006) for imbalanced data sets (Orriols-Puig
and Bernado-Mansilla, 2006).

At the data level, they are the external approaches. This group of methods does
not consist of modifying existing algorithms, on the contrary they consist of resampling the data in order to decrement the effect caused by the imbalance of
data (Batista et al., 2004). The main advantage of these techniques is that they
are independent of the classier used, so they can be considered as preprocessing
approaches. A preliminary study by using Evolutionary Algorithms (EAs) as resampling for balancing data is found in (Garca et al., 2006), where we proposed
a new evolutionary method. This study provides a method that uses a tness
function designed for performing a prototype selection process with the aim of
balancing data, improving the generalization capability and reducing the training
data.
Combining the data and the algorithmic level, they are the boosting approaches.
This set is composed by methods which consist of ensembles with the objective of
improving the performance of weak classication algorithms. In boosting, the performance of weak classiers is improved by means of focusing on hard examples
which are difcult to classify. These approaches learn a way of combining several
classiers by using weighted examples, in order to increase the attention to hard
examples. Thus, they preprocess the data through the incorporation of weights.
In imbalanced data, as well as handling weights associated to each hard example,
is also used the replication of minority class instances. Moreover, they constitute
an adapted ensemble of classiers developed depending on the data, so the algorithms are also modied for obtaining appropriate models to tackle imbalanced
domains. Two main approaches have been developed with promising results in
this group: the SMOTEBoost approach (Chawla et al., 2003) and DataBoost-IM
approach (Guo and Viktor, 2004).
Re-sampling approaches can be categorized into two tendencies: under-sampling,
that consists of reducing the data by eliminating examples belonging to the majority
class with the objective of equalizing the number of examples of each class; and oversampling, that aims to replicate or generate new positive examples in order to gain importance (Chawla et al., 2002).
2
Evolutionary Computation
Volume x, Number x
Evolutionay Under-Sampling for Balancing Data
In specialized literature, we can nd papers for re-sampling techniques from a

point of view of studying the effect of the class distribution in classication (Weiss
and Provost, 2003; Estabrooks et al., 2004) and adaptations of Prototype Selection (PS)
methods (Wilson and Martinez, 2000) to treat with imbalanced data sets (Kubat and
Matwin, 1997; Batista et al., 2004).
Data may be categorized depending on its Imbalance Ratio (IR), which is dened as
the relation between the majority class and minority class instances, by the expression
1
IR =
N
,
N+
(1)
where N is the number of instances belonging to the majority class, and N + is the
number of instances belonging to the minority class. Logically, a data set is imbalanced
when its IR is greater than 1. We will consider that a IR above 9 represents a high IR
in a data set, due to the fact that ignoring the minority class instances by a classier,
supposes an error of 0.1 in accuracy, which has poor relevance. We will separately
study the data sets belonging to this group in order to analyze the behaviour of the
proposals over them, given that when data sets present a high IR, the difculty of the
learning process increases.
EAs have been used for data reduction with promising results. They have been
successfully used for feature selection (Whitley et al., 1998; Guerra-Salcedo and Whitley, 1998; Guerra-Salcedo et al., 1999; Papadakis and Theocharis, 2006; Wang et al., 2007;
Sikora and Piramuthu, 2007) and PS (Cano et al., 2003, 2005), calling this last one as
Evolutionary Prototype Selection (EPS). EAs also have a good behaviour for training
set selection in terms of getting a trade-off between precision and interpretability with
classication rules (Cano et al., 2007).
PS is an instance reduction process whose results are used as a reference set of
examples for the Nearest Neighbour rule (1-NN) in order to classify new patterns. This
reduces the number of rows in the data set with no loss of classication accuracy and
even with an improvement in the classier. Various approaches of PS algorithms were
proposed in the literature, see (Wilson and Martinez, 2000) for review. A distinction is
needed among those methods that are centered on an efcient selection of prototypes in
order to increase or maintain global accuracy rate and to reduce the size of the training
data, and those that are focused on balancing data by selecting samples in order to
prevent bad behaviours in a subsequent classication process.
In this paper, we propose the use of EAs for under-sampling imbalanced data sets,
we call it Evolutionary Under-Sampling (EUS) approach. The objective is to increase
the accuracy of the classier by means of reducing instances mainly belonging to the
majority class. In fact, the tness functions proposed are designed to achieve a good
trade-off between reduction, data balancing and accuracy in classication. We propose
eight EUS methods and categorize them into a taxonomy depending on their objective,
scheme of selection and metrics of performance employed.
We will distinguish two levels of imbalanced degree among data sets. A high
degree of imbalance may have a remarkable inuence on performance in a classication
task and may cause problems in preprocessing stages carried out by many algorithms at
data level. For this reason, we analyze the use of EUS method under these conditions by
empirically comparing different methods among themselves and arranging them into
a taxonomy. In addition to this, we compare our approach with other under-sampling
methods studied in the literature. The empirical study has been contrasted via nonEvolutionary Computation
Volume x, Number x
parametrical statistical testing.

The rest of the paper is organized as follows: Section 2 gives an explanation about
issues on evaluating imbalanced learning. In Section 3, the EPS concepts are explained,
together with a description of each used model. Section 4 gives the proposed taxonomy
and expands the description of all the EUS models proposed. In Sections 5 and 6, the
experimentation framework and the results and their analysis are presented. Finally,
in Section 7, we point out our conclusion. An appendix is included containing a complete description of under-sampling methods focused on balancing data and prototype
selection methods.
2 How to Evaluate a Classier in Imbalanced Domains?

When we want to evaluate a classier over imbalanced domains, classical ways of evaluating, such as classication accuracy, have no sense. A standard classier that uses
accuracy rate may be biased towards the majority class due to the bias inherent in the
measure, which is directly related to the ratio between the number of instances of each
class. A typical example of this fact is the following: if the ratio of imbalance presented
in the data is 1:100, the error of ignoring this class is only 1%.
The most correct way of evaluating the performance of classiers is based on the
analysis of the confusion matrix. In Table 1, a confusion matrix is illustrated for a
problem of two classes, with the values for the positive and negative classes. From
this matrix it is possible to extract a number of widely used metrics to measure the
FP
performance of learning systems, such as Error Rate, dened as Err = T P +F N+F N+T N
+F P
T P +T N
and Accuracy, dened as Acc = T P +F N +F P +T N = 1 Err.
Positive Class
Negative Class
Positive Prediction
True Positive (TP)
False Positive (FP)
Negative Prediction
False Negative (FN)
True Negative (TN)
Table 1: Confusion matrix for two-class problem.

In relation to the use of error (or accuracy) rate, another type of metric in the domain of the imbalanced problems is considered more correct. Concretely, from Table
1 it is possible to obtain four metrics of performance that measure the classication
performance for the positive and negative classes independently:
False negative rate F Nrate =
classied as negative.
FN
T P +F N
is the percentage of true positive cases mis-
False positive rate F Prate =

classied as positive.
FN
F P +T N
is the percentage of true negative cases mis-
True negative rate T Nrate =

rectly classied as negative.
TN
F P +T N
is the percentage of true negative cases cor-
True positive rate T Prate =

rectly classied as positive.
TP
T P +F N
is the percentage of true positive cases cor-
The goal of a classier is to minimize the false positive and false negative rates or,
in a similar way, to maximize the true positive and true negative rates.
In (Barandela et al., 2003) it was indicated a metric called Geometric Mean (GM),
dened as g = a+ a , where a+ denotes accuracy in positive examples (T Prate ),

4
Volume x, Number x
and a is accuracy on negative examples (T Nrate ). This measure tries to maximize

accuracy in order to balance both classes at the same time. It is an evaluation measure
that allows to simultaneously maximize the accuracy in positive and negative examples
with a good trade-off.
Another metric that could be used to measure the performance of classication
over imbalanced data sets is the Receiver Operating Characteristic (ROC) graphics
(Bradley, 1997). In these graphics, the relationship between F Nrate and F Prate can
be visualized. The area under the ROC curve (AUC) corresponds to the probability of
correctly identifying which of the two stimuli is noise and which is signal plus noise.
AUC provides a single-number summary for the performance of learning algorithms.
Working with imbalanced domains and re-sampling techniques, the ROC analysis
can be carried out by using a parameter of quantity of sampling, which indicates the IR
desired at the end of the preprocess task (Chawla et al., 2002). In this paper, most of the
methods evaluated does not allow to adjust this parameter given that the IR obtained
can not be previously dened as parameter.
We have used two measures to evaluate the performance of the methods studied
in this paper: GM and AUC.
3 Evolutionary Prototype Selection: Representation and Fitness Function

Let us assume that there is a training set T R with N instances which consists of pairs
(xi , yi ), i = 1, ..., N , where xi denes an input vector of attributes and yi denes the
corresponding class label. Each of the N instances have M input attributes and they
should belong to positive or negative class. Let S T R be the subset of selected
instances resulted inthe execution of an algorithm.
PS can be considered as a search problem in which EAs can be applied. To accomplish this, we take into account two important issues: the specication of the representation of the solutions and the denition of the tness function.
Representation: The search space associated is constituted by all the subsets of T R.
This is accomplished by using a binary representation. A chromosome consists of
N genes (one for each instance in T R) with two possible states: 0 and 1. If the
gene is 1, its associated instance is included in the subset of T R represented by the
chromosome. If it is 0, this does not occur (Kuncheva and Bezdek, 1998).
Fitness Function: Let S be a subset of instances of T R and be coded by a chromosome. Classically, we dene a tness function that combines two values: the classication rate (clas rat) associated with S and the percentage of reduction (perc red)
of instances of S with regards to T R (Cano et al., 2003).
F itness(S) = clas rat + (1 ) perc red.
(2)
The 1-NN classier is used for measuring the classication rate, clas rat, associated with S. It denotes the percentage of correctly classied objects from T R using
only S to nd the nearest neighbor. For each object y in S, the nearest neighbor is
searched for among those in the set S \ {y}. Whereas, perc red is dened as
perc red = 100
|T R| |S|
.
|T R|
(3)
The objective of the EAs is to maximize the tness function dened, i.e., maximize
the classication rate and minimize the number of instances obtained. The EAs
with this tness function will be denoted with the extension PS in the name.
Volume x, Number x
Crossover operator for data reduction: In order to achieve a good reduction rate,
Heuristic Uniform Crossover (HUX) implemented for CHC undergoes a change
that makes more difcult the inclusion of instances inside the selected subset.
Therefore, if a HUX switches a bit on in a gene, then the bit could be switched
off depending on a certain probability (it will be specied in Section 5.1).
As the evolutionary computation method in the core of EPS, we have used the
CHC model (Eshelman, 1991; Cano et al., 2003) and Intelligent Genetic Algorithm
(IGA) (Ho et al., 2002):
CHC is a classical evolutionary model that introduces different features to
obtain a trade-off between exploration and exploitation; such as incest prevention, reinitialization of the search process when it becomes blocked and
the competition among parents and offspring into the replacement process.
Recently, it has been used in many applications; for example, as a method for
optimizing learning models (Alcal et al., 2007), information processing (Alba
a
et al., 2006) and image registration (Cordon et al., 2006).

During each generation the CHC develops the following steps.
It uses a parent population of size N to generate an intermediate population of N individuals, which are randomly paired and used to generate N
potential offspring.
Then, a survival competition is held where the best N chromosomes from
the parent and offspring populations are selected to form the next generation.
CHC also implements a form of heterogeneous recombination using HUX, a
special recombination operator. HUX exchanges half of the bits that differ
between parents, where the bit position to be exchanged is randomly determined. CHC also employs a method of incest prevention. Before applying
HUX to the two parents, the Hamming distance between them is measured.
Only those parents who differ from each other by some number of bits (mating
threshold) are mated. The initial threshold is set at L/4, where L is the length
of the chromosomes. If no offspring are inserted into the new population then
the threshold is reduced by one.
No mutation is applied during the recombination phase. Instead, when the
population converges or the search stops making progress (i.e., the difference threshold has dropped to zero and no new offspring are being generated
which are better than any member of the parent population) the population is
reinitialized to introduce new diversity to the search. The chromosome representing the best solution found over the course of the search is used as a template to reseed the population. Reseeding of the population is accomplished
by randomly changing 35% of the bits in the template chromosome to form
each of the other N 1 new chromosomes in the population. The search is
then resumed.
IGA is a Generational Genetic Algorithm (GGA) which incorporates an Intelligent Crossover (IC) operator. IC builds an Orthogonal Array (OA) (see Ho
et al. (2002)) from two parents of chromosomes and searches within the OA the
two best individuals according to tness function. It takes about 2 log2 (+1)
tness evaluations to perform an IC operation, where is the number of bits
6
Volume x, Number x
that differs between both parents. IGA is based on Orthogonal experimental

design used for PS and feature selection.
IGA is a GGA with elitist strategy in initialization, evaluation, selection and
mutation. It randomly generates N individuals at the beginning. The selection uses the rank that replaces the worst Ps N for the best Ps N individuals
to form a new population, where Ps is a selection probability. It randomly selects Pc N individuals to perform IC, where Pc is a crossover probability. The
mutation operator is the conventional bit inverse mutation operator. The best
individual is retained without being subject to the mutation operator. The
algorithm nishes when the number of evaluations achieves a certain value.
IC operator is based upon the developing of a OA, which consists of a set
of orthogonal representations obtained from two chromosomes. Having OA
constructed, IC evaluates all candidates belonging to it and returns the best
and the second best individuals. The development of OA is expensive and
the algorithm for doing it is detailed in (Ho et al., 2002).
4 Evolutionary Under-Sampling: Models and Taxonomy

We will present a taxonomy for EUS methods, identifying the main issues used for the
classication of the respective models and including each method in its corresponding
place. We will use two ways of division, the objective that they pursue and the way
that they do the selection of instances.
Regarding the objective, there are two goals of interest in EUS:
Aiming for an optimal balancing of data without loss of effectiveness in classication accuracy. EUS models that follow this tendency will be called Evolutionary
Balancing Under-Sampling (EBUS).
Aiming for an optimal power of classication without taking into account the balancing of data, considering the latter as a sub-objective that may be an implicit
process. EUS models that follow this tendency will be called Evolutionary UnderSampling guided by Classication Measures (EUSCM).
With respect to the types of instance selection that can be carried out in EUS, we
distinguish:
If the selection scheme proceeds over any kind of instance, then it is called Global
Selection (GS). That is, the chromosome contains the state of all instances belonging to the training data set and removals of minority class instances (those belonging to positive class) are allowed.
If the selection scheme only proceeds over majority class instances then it is called
Majority Selection (MS). In this case, the chromosome saves the state of instances
that belong to the negative class and a removal of a positive or minority class instance is not allowed.
This categorization produces 4 subgroups of EUS methods. Furthermore, two
methods that differ in the measure of evaluation are included in each group (GM and
AUC), obtaining a total number of eight EUS methods.
These methods will be described in the following subsections. They can be easily
distinguished by their names:
Volume x, Number x
If the term GM appears, then it means that the model uses Geometric Mean as evaluator of accuracy. In the other case, it must appears the term AUC for evaluation
with AUC measure.
If the term GS appears, this implies that the selection scheme is Global Selection, as
well as the term MS, which implies that the selection scheme is Majority Selection.
The method will be a EBUS or a EUSCM model and this fact is specied in its
name.
The EUS approaches have been developed by using the CHC evolutionary model.
Figure 1 summarizes the proposed taxonomy for EUS approach and includes the
8 methods studied in this paper.
Evolutionary
Under-Sampling
Evolutionary Balancing
Under-Sampling
Global Selection
Evolutionary UnderSampling guided for

Classification Measures
Majority Selection
Global Selection
Majority Selection
EUSCM-MS-AUC
EUSCM-MS-GM
EUSCM-GS-AUC
EUSCM-GS-AUC
EUSCM-GS-GM
EUSCM-GS-GM
EBUS-MS-AUC
EBUS-MS-AUC
EBUS-MS-GM
EBUS-GS-AUC
EBUS-GS-GM
Figure 1: Evolutionary Under-Sampling Taxonomy

In the following two subsections, we will describe the EBUS and EUSCM models.
4.1 Evolutionary Balancing Under-Sampling
This subgroup contains four methods of EBUS:
EBUS-GS-GM: It is the model used in (Garca et al., 2006), consisting of apply
ing the same tness function dened in expression 4 and the selection over the
majority and minority class samples simultaneously. This model aims to remove
instances of both classes, identifying minority class examples that have a negative
inuence over the classication task and achieving a maximal reduction in positive
instances. A penalization factor used for preserving the same number of instances
belonging to each class helps to maintain a generalization capability in the reduction task, in the way it does not specialize the data subset only for the positive
class.
8
Volume x, Number x
F itnessBal (S) =
g |1
gP
n+
n |
if n > 0
if n = 0
(4)
where g is geometric mean of balanced accuracy dened in Section 2, n+ is the

number of positive instances selected (minority class), n is the number of negative instances selected (majority class), and P is the penalization factor.
The P parameter is considered an inuential value that controls the intensity and
importance of the balance during the evolutionary search. It denes the maximum
penalization applied over the classication measure if there was a total unbalancing between both classes, that is, either no positive instances are selected or no
negative instances are selected. We have empirically determined that the penalization over the classication measure should be closer to 0.02 so that a low value
does not affect sufciently the achievement of the balance, moreover a high value
implies that the trade-off between accuracy and balancing could be lost. So, the
the parameter P value that we will use is 0.2.
Figures 2 and 3 represent the GM accuracy when parameter P is set from 1 to 50, in
the EBUS model with majority (which will be detailed in the next point) and global
selection, respectively. In this case study, we have used the set of imbalanced data
set derived from the Glass data set (see Table 2), which is composed by 6 versions
with different IR values. The graphics show how by employing low and high P
values, it could lead to extremely bad results on some data sets. A value of P = 0.1
or P = 0.2 remains stable on all data sets.
s s a lg
79
P FWB s s a lg
PFNWB s s a lg
29
sr e n i at n oC s s a lg
78
G
77 M
i
28 n
t
e
s
t
WN s s a l g
er aw e l b aT s s a lg
27
76
26
5.0
3. 0
2.0
1.0
10.0
r et ema rap P
Figure 2: Inuence of the P factor in EBUS-MS-GM

This tness function tries to nd subsets of instances making a trade-off between
the classication balanced accuracy and an equal number of examples selected
from each class. This second objective is obtained through the penalization applied
to g in tness value.
EBUS-MS-GM: It is the same model as before, but it only selects instances belonging to the negative class (that is, majority class samples). The tness function corresponds to expression 5. With this scheme, the reduction only affects the negative
Volume x, Number x
s s a lg
P FWB s s a lg
49
PFNWB s s a lg
48
sr e n i at n oC s s a lg
46 G
M
ni
t
e
s
t
WN s s a l g
47
er aw e l b aT s s a lg
45
44
5.0
3.0
2.0
1.0
10.0
r et ema rap P
Figure 3: Inuence of the P factor in EBUS-GS-GM

instances, but the process is controlled in the same way as the previous model;
thus, the reduction is not free, it depends on the minority class examples.
F itnessBal (S) =
g |1
gP
N+
n |
if n > 0
if n = 0
(5)
Given that the instances belonging to the positive class are not affected is this
model, N + is a constant that represents the number of original positives instances
within the training data set.
EBUS-GS-AUC: This model is obtained from the rst one described in this section
by replacing the Geometric Mean to measure the accuracy on training data with
the AUC measure (see Section 2). The tness function corresponds to expression
6.
F itnessBal (S) =
AU C |1
AU C P
n+
n |
if n > 0
if n = 0
(6)
Although the AUC measure is totally valid to achieve a good balance between accuracy in both classes, it does not control the resulting balance of instances selected
in both classes.
EBUS-MS-AUC: Using AUC as performance measure, this model does not remove
examples belonging to positive class. The tness function employed by it corresponds to expression 7.
F itnessBal (S) =
10
AU C |1
AU C P
N+
n |
if n > 0
if n = 0
(7)
Volume x, Number x
4.2 Evolutionary Under-Sampling Guided by Classication Measures

This subgroup is composed by four methods of EUSCM:
EUSCM-GS-GM: It is the same model than the rst, but no penalization factor (P )
is applied during the selection, implying that a balancing between classes is not
expected. The tness function corresponds to expression 8. Usually, the objective
of this model is to select a minimal number of examples representing the whole
training data set. The disadvantage of lack of generalization capability may be
pointed out in this model.
F itness(S) = g,
(8)
EUSCM-MS-GM: In this model, the selection is applied over negative instances.

The tness function can be represented by the expression 8.
As we can notice, penalization factor P is not included in tness function. This
implies that the difference of the number of instances among both classes does not
tend to be equal to 0, so there is not a control over the selection process in terms
of getting a balance. However, a control over negative instances is present, given
that removing instances only affects to the majority class ones.
EUSCM-GS-AUC: This model is obtained from the rst one described in this section by replacing the Geometric Mean to measure the accuracy on training data
with the AUC measure (see Section 2). The tness function corresponds to expression 9.
F itness(S) = AU C,
(9)
AUC measure involves itself in a trade-off between improving the accuracy rate
over positives instances and not losing accuracy rate power over negative instances.
EUSCM-MS-AUC: This model is the same than the previous one, with the exception of the selection carried out, which is only performed in examples belonging to
the majority class. The tness function used corresponds to expression 9.
5 Experimental Framework
This section describes the methodology followed in the experimental study of the
under-sampling methods compared. We will explain the conguration of the experiment: data sets used and parameters for the algorithms. The PS methods used in the
study and the proposals of Under-Sampling found in specialized literature are (for a
detailed description, see Appendix):
Prototype Selection Methods:
Instance-Based 3 (IB3) (Aha et al., 1991): It is an incremental instance selection
algorithm which introduces the acceptable concept in the selection.
Decremental Reduction Optimization Procedure 3 (DROP3) (Wilson and Martinez, 2000): It is based in the rule Any instance incorrectly classied by its
k-NN is removed.
Volume x, Number x
11
Classical Under-Sampling Methods for Balancing Class Distribution:

Random Under-Sampling (RUS): It is a non-heuristic method that aims to balance class distribution through the random elimination of majority class examples to get a balanced instance set.
Tomek Links (TL) (Tomek, 1976): It searches Tomek Links and eliminates examples belonging to the majority class in each Tomek link found.
Condensed Nearest Neighbor Rule (US-CNN) (Hart, 1968): It is a modication
of the classic CNN rule for imbalanced learning.
One-Sided Selection (OSS) (Kubat and Matwin, 1997): It is an under-sampling
method resulting from the application of Tomek links followed by the application of US-CNN.
US-CNN + TL (Batista et al., 2004): It is similar to OSS, but the method USCNN is applied before the Tomek links.
Neighborhood Cleaning Rule (NCL) (Laurikkala, 2001): It is an adaptation of
the ENN rule for imbalanced learning.
Class Purity Maximization (CPM) (Yoon and Kwek, 2005): It is a clustering
based method for imbalanced learning which manages the impurity concept.
Under-Sampling Based on Clustering (SBC) (Yen and Lee, 2006): It is a random under-sampling based on clustering.
Finally, we will briey introduce the use of non-parametric statistical tests employed for the comparison of the results obtained.
5.1 Conguration of the Experiment
Performance of PS methods and under-sampling for balancing data, which will be described in Appendix A, is analyzed by using 28 data sets taken from the UCI Machine
Learning Database Repository (Newman et al., 1998). Multi-class data sets are modied
to obtain two-class non-balanced problems, dening one class as positive and one or
more classes as negative. Missing values have been replaced with the lowest possible
value of the attribute associated domain.
The data sets are sorted by their IR values in an incremental way. The main characteristics of these data sets are summarized in Table 2. For each data set, it shows
the number of examples (#Examples), number of attributes (#Attributes), class name
(minority and majority) together with the class distribution and the IR value. Some
of these data sets have already been used in previous works (Weiss and Provost, 2003;
Batista et al., 2004; Guo and Viktor, 2004; Akbani et al., 2004).
The data sets considered are partitioned using the ten fold cross-validation (10-fcv)
procedure. The parameters of the used algorithms (see Appendix for detailed descriptions of PS and Under-Sampling methods) are presented in Table 3. The EUS approach
always uses the same parameters independently of the tness function it considers,
in order to make them more comparable in performance. All the parameters for the
algorithms are the recommended by their respective authors.
5.2 Non-Parametric Statistical Tests for Statistical Analysis
In this paper we have used a 10-fcv, which is a way of estimating the real error of a classier by testing it against all instances of the data set, while training the classier with
instances independent of the testing ones. The results obtained from this validation are
12
Volume x, Number x
Data set
GlassBWNFP
#Examples
214
#Attributes
9
EcoliCP-IM
Pima
GlassBWFP
220
768
214
7
8
9
German
Haberman
Splice-ie
Splice-ei
GlassNW
VehicleVAN
EcoliIM
New-thyroid
Segment1
EcoliIMU
Optdigits0
Satimage4
Vowel0
GlassVWFP
EcoliOM
GlassContainers
Abalone9-18
GlassTableware
YeastCYT-POX
YeastME2
YeastME1
YeastEXC
Car
Abalone19
1000
306
3176
3176
214
846
336
215
2310
336
5564
6435
990
214
336
214
731
214
483
1484
1484
1484
1728
4177
20
3
60
60
9
18
7
5
19
7
64
36
13
9
7
9
9
9
8
8
8
8
6
9
Class (min., maj.)

(build-window-non oat-proc,
remainder)
(im,cp)
(1,0)
(build-window-oat-proc,
remainder)
(1, 0)
(Die, Survive)
(ie,remainder)
(ei,remainder)
(non-windows glass, remainder)
(van,remainder)
(im,remainder)
(hypo,remainder)
(1,remainder)
(iMU, remainder)
(0, remainder)
(4, remainder)
(0, remainder)
(Ve-win-oat-proc, remainder)
(om, remainder)
(containers, remainder)
(18, 9)
(tableware, remainder)
(POX, CYT)
(ME2, remainder)
(ME1, remainder)
(EXC, remainder)
(good, remainder)
(19, remainder)
%Class(min.,maj.)
(35.51, 64.49)
IR
1.82
(35.00, 65.00)
(34.77, 66.23)
(32.71, 67.29)
1.86
1.9
2.06
(30.00, 70.00)
(26.47, 73.53)
(24.09, 75.91)
(23.99, 76.01)
(23.93, 76.17)
(23.52, 76.48)
(22.92, 77.08)
(16.28, 83.72)
(14.29, 85.71)
(10.42, 89.58)
(9.90, 90.10)
(9.73, 90.27)
(9.01, 90.99)
(7.94, 92.06)
(6.74, 93.26)
(6.07, 93.93)
(5.75, 94.25)
(4.2, 95.8)
(4.14, 95.86)
(3.43, 96.57)
(2.96, 97.04)
(2.49, 97.51)
(3.99, 96.01)
(0.77, 99.23)
2.33
2.68
3.15
3.17
3.19
3.25
3.36
4.92
6.00
8.19
9.10
9.28
10.1
10.39
13.84
15.47
16.68
22.81
23.15
28.41
32.78
39.16
71.94
128.87
Table 2: Imbalanced Data Sets.
Algorithm
IB3
EPS-CHC
EPS-IGA
RUS
SBC
EUS
Parameters
Acept. Level = 0.9, Drop Level = 0.7
P op = 50, Eval = 10000, = 0.5
P op = 10, Eval = 10000, = 0.5
Balancing Ratio = 1 : 1
Balancing Ratio = 1 : 1, N. Clusters = 3
P op = 50, Eval = 10000, P = 0.2, P rob. inclusion HU X = 0.25
Table 3: Parameters considered for the algorithms.
Volume x, Number x
13
not completely independent, therefore the results neither present normal distribution
nor homogeneity of variance. In this situation, we consider the use of non-parametric
tests, according to the recommendations made in (Dem ar, 2006).
s
As such, these non-parametric tests can be applied to classication accuracies, error ratios or any other measure for techniques evaluation, even including model sizes
and computation times. Empirical results suggest that they are also more powerful
than the parametric tests. Dem ar recommends a set of simple, safe and robust nons
parametric tests for statistical comparisons of classiers. We will briey describe the
two tests used in this study.
The rst one is Friedmans test (Sheskin, 2003), which is a non-parametric test
equivalent to the repeated-measures ANOVA. Under the null-hypothesis, it states
that all the algorithms are equivalent, so a rejection of this hypothesis implies the
existence of differences among the performance of all the algorithms studied. After
this, a post-hoc test could be used in order to nd whether the control or proposed
algorithm presents statistical differences with regards to the remaining methods
into the comparison. The simplest of them is the Bonferroni-Dunn test, but we
can use more powerful tests that control the family-wise error rate and reject more
hypothesis than Bonferroni-Dunn test; for example, Holms test.
Due to the fact that Friedmans test could be too conservative, we have used a
derivation of it, Iman and Davenports test (Iman and Davenport, 1980). The descriptions and computations of the tests are explained in (Dem ar, 2006).
s
As post-hoc test of Friedman statistic, we will use Holms procedure (Holm, 1979),
which is a multiple comparison procedure that works with a control algorithm
(normally, the best of them is chosen) and compares it with the remaining methods. The results obtained in each comparison by using Holms procedure will be
reported through p-values.
6 Experimental Results and Analysis

This section shows the experimental results and their associated statistical analysis in
the evaluation of the imbalanced data sets. It is divided into 3 parts, corresponding to
different studies that aim to achieve a certain conclusion. The objectives of the 3 parts
are the following:
Section 6.1 shows a study applying and evaluating PS methods over imbalanced
data.
Section 6.2 compares the approaches of EUS proposed among themselves.
Section 6.3 includes a study between the most promising EUS model and the algorithms of under-sampling focused in balancing data obtained from the state-ofthe-art.
6.1 Using PS Methods over Imbalanced Domains
In Section 2, the metric of GM is described as a good way of evaluating the performance
of classiers over imbalanced domains. The use of a preprocessing stage with the aim
of improving the performance of a posterior classier should obtain a better GM rate
than not using it.
This rst study comprises a comparison among the four PS methods considered in
this study and the classier 1-NN without preprocessing.
14
Volume x, Number x
Table 4 shows us the average and standard deviations of the results offered by the
PS algorithms over the imbalanced data sets. Each column shows:
The PS method employed.
The percentage of reduction with respect to the original data set size. Furthermore, the percentage of reduction associated with each class is showed in posterior
columns.
The accuracy for each class by using an 1-NN classier (a+ and a ), where sub
index tra refers to training data and sub index tst refers to test data. GM value
also is showed for training and test data.
Finally, the AUC measure in test data is reported.
None indicates that no balancing method is employed (original data set is used for
classication with 1-NN).
PS
Method
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
%Red
%Red
%Red+
a
tra
a+
tra
GMtra
a
tst
a+
tst
GMtst
tst
AUC
none
0.0
0.0
61.82
1.49
91.76
1.81
98.85
1.88
89.49
1.79
0.0
0.0
62.00
1.49
94.00
1.83
99.00
1.88
90.00
1.80
0.0
0.0
80.60
1.70
78.13
1.67
97.17
1.86
81.91
1.71
0.9399
0.1832
0.9196
0.1812
0.8879
0.1781
0.9735
0.1865
0.9767
0.1868
0.6414
0.1514
0.475
0.1302
0.7027
0.1584
0.5747
0.1433
0.7206
0.1604
0.7485
0.1635
0.5965
0.146
0.7657
0.1654
0.6757
0.1553
0.8044
0.1695
0.9387
0.1831
0.9227
0.1815
0.8761
0.1769
0.9584
0.185
0.9459
0.1838
0.6175
0.1485
0.4615
0.1284
0.6299
0.15
0.5183
0.1361
0.5919
0.1454
0.6958
0.1576
0.5267
0.1372
0.6751
0.1553
0.6037
0.1468
0.6691
0.1546
0.7606
0.1648
0.6746
0.1552
0.7359
0.1621
0.7206
0.1604
0.7516
0.1638
IB3
DROP3
EPS-CHC
EPS-IGA
Table 4: Average results for PS algorithms over imbalanced data sets

Table 4 reports that, in general, all PS methods lose accuracy and AUC in test data,
given that the usage of this preprocess stage performs worse than 1-NN. EPS-CHC is
the algorithm which obtains the highest reduction rate and EPS-IGA obtains the best
result in training data, indicating us that it over-ts the selected instances to the training
data.
In Figure 4, the values of the average rankings using Friedmans method are specied. Each column represents the average ranking obtained by an algorithm; that is,
if a certain algorithm achieves rankings 1, 3, 1, 4 and 2, on ve data sets, the average
ranking is 1+3+1+4+2 = 11 . The height of each column is proportional to the ranking,
5
5
the lower a column is, the better its associated algorithm is.
Then, we apply the Friedmans and Iman-Davenports tests (considering a level
of signicance = 0.05) to check whether differences exist among all the methods by
using the GM measure, presenting the results in Table 5:
Friedmans statistic
33.143
Iman-Davenport statistic
11.348
Critical value
9.488
Critical value
2.456
hypothesis
rejected
hypothesis
rejected
Table 5: Statistics and critical values for Friedmans and Iman-Davenports test
Table 5 indicates us that both, Friedmans and Iman-Davenports, statistics are
higher than their associated critical value, so the hypothesis of equivalence of results
Volume x, Number x
15
Friedman Rankings
5.4
16 1. 4
4
393.3
5.3
450. 3
3
2 84.2
5.2
11 9. 1
2
5.1
A
v
e
r
a
g
e
R
a
n
k
i
n
g
s
1
5.0
0
3B I
3PORD
CHC-SPE
AG I-SPE
NN-1
Figure 4: Friedman Rankings for classical PS and EPS algorithms

is rejected. Then, a post-hoc test is needed in order to distinguish whether the control
method (1-NN without preprocessing in this case) is signicantly better than the remaining of them. We will apply the post-hoc statistical analysis considering all data
sets. We use Holms procedure to check this over the GM measure, and the results
are offered in Figure 5. Following the indications given in Section 5.2, this procedure
computes the zi value and obtains the associated pi value by using the normal distribution for each hypothesis i to evaluate. The gure represents the p-values associated
to each comparison between 1-NN without preprocessing and the corresponding PS
algorithm indicated in x-axis. The discontinuous line similar to a staircase represents
the /i value established for each comparison following Holms method. If a p-value
of a certain comparison exceeds this line, this hypothesis can not be rejected for the
Holm test and it implies to stop checking the remaining hypotheses. Otherwise, when
a p-value does not exceed the discontinuous line, this implies the rejecting of the hypothesis associated and allows to do the next test.
Holm's Test
1.0
90.0
80.0
60.0
50.0
40.0
p-value
70.0
30.0
20.0
1 48 600. 0
692 67 1.0
3 P ORD
AG I-SPE
25 40 00.0
CHC-S PE
7 0-E2 10 .1
10.0
0
3B I
Control Algorithm: 1-NN
Figure 5: Holms test: The control algorithm is 1-NN without preprocessing

The statistical analysis of this comparison declares that the use of PS methods is
not recommendable for non-balanced domains given that the accuracy of 1-NN is signicantly better than three PS methods studied. With respect to EPS-IGA algorithm, we
16
Volume x, Number x
can not state that 1-NN is better than EPS-IGA preprocessing, but EPS-IGA does not obtain a better value of ranking than 1-NN. The exhaustive search process performed by
EPS-IGA and its capability of reduction (lower than EPS-CHC) of the training data allow to nd subsets of instances that over-t the training set but in test accuracy perform
worse on imbalanced domains with respect to no use of a PS method. Note that EPSCHC algorithm notably performs poorly when we treat with high imbalanced data
sets, so the excessive reduction achieved by this method does not produce benet in
imbalanced domains.
6.2 EUS Methods
As second objective, we analyze all EUS models proposed over all imbalanced data sets.
The 8 algorithms that compose the taxonomy explained in Section 4 will be analyzed
in terms of efcacy and efciency in order to obtain the most appropriate conguration
of EUS over a set of imbalanced data sets. Firstly, we will study the EUS models on the
full set of data considered. Then, we will divide the data sets into two groups: those
that have a IR below 9 and those that have a IR above 9. All the studies will include
statistical analysis of the results.
Beginning with the study that considers the 28 imbalanced data sets, Table 6 shows
us the average and standard deviations of the results offered by the models proposed.
It follows the same structure as Table 4.
EUS
Method
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
%Red
%Red
%Red+
a
tra
a+
tra
GMtra
a
tst
a+
tst
GMtst
tst
AUC
EBUSMS-AUC
EBUSMS-GM
EBUSGS-AUC
EBUSGS-GM
EUSCMMS-AUC
EUSCMMS-GM
EUSCMGS-AUC
EUSCMGS-GM
70.04
1.58
69.93
1.58
96.30
1.85
96.23
1.85
76.86
1.66
76.18
1.65
94.46
1.84
94.34
1.84
81.00
1.70
80.00
1.69
98.09
1.87
98.00
1.87
90.00
1.80
89.00
1.79
95.00
1.85
95.00
1.84
0.00
0.00
0.00
0.00
82.12
1.71
82.13
1.71
0.00
0.00
0.00
0.00
84.01
1.73
84.19
1.73
0.8473
0.174
0.8504
0.1743
0.8749
0.1768
0.8812
0.1774
0.8639
0.1757
0.8714
0.1764
0.9144
0.1807
0.9155
0.1808
0.9323
0.1825
0.9252
0.1818
0.9259
0.1818
0.9195
0.1812
0.9371
0.1829
0.9313
0.1824
0.9116
0.1804
0.9054
0.1798
0.8878
0.1781
0.8862
0.1779
0.8991
0.1792
0.8996
0.1792
0.8961
0.1789
0.8983
0.1791
0.9092
0.1802
0.9068
0.18
0.8289
0.1721
0.8319
0.1724
0.8566
0.1749
0.8595
0.1752
0.8285
0.172
0.8354
0.1727
0.8916
0.1784
0.8894
0.1782
0.8189
0.171
0.8188
0.171
0.7826
0.1672
0.7863
0.1676
0.8084
0.1699
0.8081
0.1699
0.7374
0.1623
0.7278
0.1612
0.7955
0.1686
0.7971
0.1687
0.7872
0.1677
0.7927
0.1683
0.7795
0.1669
0.7861
0.1676
0.7712
0.166
0.7575
0.1645
0.8071
0.1698
0.8085
0.1699
0.8024
0.1693
0.8058
0.1696
0.8014
0.1692
0.805
0.1696
0.797
0.1687
0.7912
0.1681
Table 6: Average results for the proposed models over imbalanced data sets
By analyzing Table 6, we can point out the following:
The best average results are offered by the models EBUS, by measuring the performance with GM accuracy and AUC.
An observable difference between the use of global selection and majority selection
exists. In all cases, the majority selection is preferable to global selection.
The employment of GM or AUC in the tness does not affect too much in the
results obtained.
We are interested in checking if these differences are signicant by using nonparametrical statistical tests. For this, we compute the average rankings by using Friedmans test over the results obtained in all imbalanced data sets, as well as in the results
on data sets with IR < 9 and IR > 9. In Figures 6, 7 and 8 (they follow the same
scheme of Figure 4), we represent the ranking values for each algorithm and for GM
Volume x, Number x
17
and AUC measures. With these values, we have computed Iman-Davenports statistic
(considering a level of condence = 0.05) and the results are showed in Table 7:
Friedman Rankings
MG
CUA
6
34
1 6 1. 5 1 . 5
46 9.4
3 4 1. 5
23 7.4
52 1.16 1 .4
4
5
5 7.4 7.4
7 5 3. 4
6 4 4. 4 . 4
5
92 9.39 8 .3
3
75 8.3
9 8 0. 4
A
v
e
r
a
g
e
3 R
a
n
k
i
n
g
s
4
2
1
0
MG-SG
-MCSUE
CUA-SG
-MCSUE
MG-SM
-MCSUE
CUA-SM
-MCSUE
MG
-SG-SUBE
CUA
-SG-SUBE
MG
-SM-SUBE
CUA
-SM-SUBE
Figure 6: Friedman Rankings for all EUS models over all imbalanced data sets
Friedman Rankings
7
MG
CUA
34 6.5
6 8 7. 5
5. 4
4 1 7. 4
97 6.4
9 7 1. 4 2 . 4
41
17 0.4
9 2 4. 4
MG-SM
-MCSUE
CUA-SG
-MCSUE
6
5 2.4
4 6 4. 4
68
9 2 4. 4 2 . 4
6 3 0. 4
MG
-SM-SUBE
CUA
-SG-SUBE
5
34
9 7 1. 4 1 . 4
A
v
e
r
3 a
g
e
R
a
n
k
i
n
g
s
4
2
1
0
MG-SG
-MCSUE
CUA-SM
-MCSUE
MG
-SG-SUBE
CUA
-SM-SUBE
Figure 7: Friedman Rankings for all EUS models over imbalanced data sets with IR < 9
Imbalance
Data Sets
All
IR < 9
IR > 9
Iman-Davenport
statistic for GM
1.099
0.575
2.355
Iman-Davenport
statistic for AUC
1.049
0.716
1.736
Critical
value
2.058
2.112
2.112
hypothesis
both non-rejected
both non-rejected
rejected for GM measure
Table 7: Statistics and critical values for Iman-Davenport test ( = 0.05)

Imans and Davenports multiple comparison test procedure cannot reject, in the
majority of the cases, the hypothesis of equivalence of means, so we can conclude that
signicant differences do not exist among the distinct models of EUS studied. An exception is produced when we evaluate the models with GM over data sets with IR > 9.
In this case, Iman-Davenports test rejects the null hypothesis (we have proved that
Friedmans test also rejects it). Due to the fact that Iman-Davenports test is more pow18
Volume x, Number x
Friedman Rankings
7
MG
CUA
75 8.75 8 .5
5
7 0 6. 5
68 7.5
CUA-SG
-MCSUE
MG-SG
-MCSUE
6
9 7 6. 4 7 . 4
68
1 7 0. 4 1 . 4
70
4 6 4. 4
41 7.4
5
4
4 1 7. 3
75 8.3
1 2 3. 3
7 0 6. 3
17 5.3
A
v
e
r
3 a
g
e
R
a
n
k
i
n
g
s
4
2
1
0
MG-SM
-MCSUE
CUA-SM
-MCSUE
MG
-SG-SUBE
CUA
-SG-SUBE
MG
-SM-SUBE
CUA
-SM-SUBE
Figure 8: Friedman Rankings for all EUS models over imbalanced data sets with IR > 9
erful than Friedmans test, if it is not able to reject the null hypothesis, Friedmans test
cannot do it either.
An analysis based upon the results obtained from the rankings computed following the guidelines for the Friedmans test allows us to state the following:
With respect to imbalanced data sets with IR < 9 (Figure 7):
The parameter addressed to balancing data (P factor) lacks interest when the data
is not imbalanced enough. A EUSCM model obtains good results without balancing mechanisms. Hence, in general, EUSCM approach behaves better than EBUS.
However, the best performing method is EBUS-MS-AUC, because it obtains low
rankings in both measures, although it is not the best in GM (EUSCM-GS-AUC
outperforms it) and in AUC (EUSCM-MS-AUC is the best in this case).
The differences between the use of global and majority selection or GM and AUC
in the tness function do not follow a specic bias towards carrying out the best
choice.
With respect to imbalanced data sets with IR > 9 (Figure 8):
When the IR becomes high, a GS mechanism has no sense due to the reduced
number of examples belonging to the minority class. Thus, MS mechanism obtains
better results than GS mechanism.
We can observe that EBUS models behave better than EUSCM model. Therefore, a
balancing mechanism may help the under-sampling process over extreme circumstances of imbalance.
In particular, an algorithm that belongs to the group of EBUS models with majority
selection, which is EBUS-MS-GM, is the best performing method in this case.
In spite of the conclusion obtained from Iman-Davenports test, which is that there
are not notably differences among the models, we have to choose a certain model for
performing a comparison with the state-of-the-art techniques in order to stress the benet of using EUS. Thus, we will select the most accurate model: EBUS-MS-GM, which
presents the best result in high imbalanced data sets (IR > 9) and considering all of
them (see Figure 6).
Volume x, Number x
19
Finally, Figure 9 shows a set of bar charts that represent the run-time spent by
each type of model of EUS on some data sets with different IRs. Obviously, they are
inuenced by the size of the data sets, due to the fact that chromosome size grows
agreeing with this increase of size. On the other hand, it is observable the fact that GS
is less affected by the IR, and that MS is very inuenced by it. In pima data set, with a
low IR, the run-time of EUS model with MS is high because of the evaluation cost of
the minority class examples, which are retained in all evaluations. In EBUS, this fact is
more notably due to the interest in balancing both classes. However, when IR is high
(as in the case of yeastEXC data set), the EBUS-MS model is favoured in efciency by
its interest in balancing.
00 6
SG-MCSUE
SM-MCSUE
00 5
SG-SUBE
SM-SUBE
00 4
00 3
t
i
m
e
(
s
e
c
o
n
d
s
)
00 2
00 1
0
) 61.9 3, 48 41(
CXEt s a eY
) 9. 1, 46 7( am iP
) 18.2 2 , 41 2(
er aW elb aT s s a lG
) 28.1, 412(
PFNWB s s alG
Figure 9: Run-time of the EUS models

6.3 EUS versus other Under-Sampling Methods
In this subsection we will compare the EBUS-MS-GM model with 1-NN (called none,
as we mentioned) and the remaining Under-Sampling techniques.
Table 8 shows us the average and standard deviations of the results offered by each
algorithm over the 28 imbalanced data sets considered.
The results offered in Table 8 suggest that the most accurate method is EBUS-MSGM by considering GM and AUC. In general, the classical Under-Sampling methods
behave well, with the exception of SBC algorithm. In relation to the reduction of the
training set achieved, three classes of algorithms exist: algorithms whose reduction rate
is low: TL and NCL; algorithms whose reduction capability is high (in general, all EUS
models with GS, as we saw in Section 4.2) and algorithms which adapt the reduction
to the optimum accuracy; to whose class belong the remaining methods which usually
achieve a reduction rate closer to 70%-80%.
Our interest lies in knowing whether the EUS models may be considered better
than the classical under-sampling algorithms in order to recommend their use. In order
to do this, the results will be contrasted through a multiple comparison test, which will
be the Holms procedure. We will represent the results of this test by using the model
of gures already employed in Section 4.1 (Figure 5).
The Friedmans and Iman-Davenports tests results with a level of signicance =
0.05 by considering the 28 imbalanced data sets and the methods enumerated in Table
8 are reported in Table 9.
Both tests nd signicant differences in the results obtained in this study. There20
Volume x, Number x
EUS
Method
%Red
%Red
%Red+
a
tra
a+
tra
GMtra
a
tst
a+
tst
GMtst
tst
AUC
mean
SD
none
0.00
0.00
0.00
0.00
0.00
0.00
0.9399
0.1832
0.6414
0.1514
0.7485
0.1635
0.9387
0.1831
0.6175
0.1485
0.6958
0.1576
0.7606
0.1648
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
US-CNN +
TL
US-CNN
81.31
1.70
72.95
1.61
81.12
1.70
10.04
0.60
76.37
1.65
69.28
1.57
76.84
1.67
6.67
0.49
96.00
1.85
85.00
1.746
84.00
1.74
13.00
0.69
90.00
1.78
79.0
1.69
90.00
1.79
9.00
0.56
0.00
0.00
0.00
0.00
51.74
1.36
0.00
0.00
0.00
0.00
0.0
0.00
0.00
0.00
0.00
0.90
0.6949
0.1575
0.8702
0.1763
0.8854
0.1778
0.8966
0.1789
0.838
0.173
0.8062
0.1697
0.3275
0.1082
0.9191
0.1812
0.8975
0.179
0.6882
0.1568
0.5778
0.1437
0.822
0.1713
0.8177
0.1709
0.8425
0.1735
0.9279
0.182
0.7804
0.1669
0.7649
0.1653
0.747
0.1633
0.6906
0.157
0.8378
0.173
0.8067
0.1697
0.8222
0.1714
0.3458
0.1111
0.8241
0.1716
0.7093
0.1592
0.8855
0.1778
0.898
0.1791
0.8907
0.1784
0.8475
0.174
0.8045
0.1695
0.3268
0.108
0.9079
0.1801
0.8444
0.1737
0.6882
0.1568
0.6345
0.1505
0.7162
0.1599
0.7543
0.1641
0.8045
0.1695
0.8857
0.1779
0.6925
0.1573
0.7193
0.1603
0.7195
0.1603
0.7039
0.1586
0.7385
0.1624
0.7455
0.1632
0.7757
0.1664
0.3382
0.1099
0.7338
0.1619
0.7618
0.1649
0.7696
0.1658
0.7487
0.1635
0.7862
0.1676
0.7837
0.1673
0.7892
0.1679
0.6063
0.1472
0.7829
0.1672
mean
SD
EBUSMS-GM
69.93
1.58
80.00
1.69
0.00
0.00
0.8504
0.1743
0.9252
0.1818
0.8862
0.1779
0.8319
0.1724
0.8188
0.171
0.7971
0.1687
0.8085
0.1699
CPM
NCL
OSS
RUS
SBC
TL
Table 8: Average results obtained for the state-of-the-art methods and the two proposed
algorithms chosen over imbalanced data sets
Friedman
statistic for GM
74.170
Friedman
statistic for AUC
78.789
Critical
value
16.919
Iman-Davenport
statistic for GM
11.261
Iman-Davenport
statistic for AUC
12.282
Critical
value
1.918
hypotheses
all rejected
Table 9: Statistics and critical values for Friedmans and Iman-Davenports tests
fore, we can apply the Holms procedure as post-hoc test in order to detect the set of
methods which are signicantly worse than the control method. Figures 10 and 11 display the p-values and the threshold of signicance for the Holms procedure with a
= 0.05 and = 0.10. The control method is set as the one that achieves the highest
value of performance in GM and AUC, respectively.
Holm's procedure
= 0.05
= 0.10
0.1
0.09
0.08
p-value
0.07
0.06
0.05
0.04
0.03
0.02
0.01
73 5.0
9 56.0
SUR
LCN
21 1.0
LT
55 0.0
SS O
2 0 0. 0
NNC-SU
10 0.0
enon
00 0.0
00 0.0
M PC
LT+NNC
-SU
0 0 0. 0
CBS
Control Algorithm: EBUS-MS-GM
Figure 10: Holms test on all data sets with GM: The control algorithm is EBUS-MS-GM
The EBUS-MS-GM model is the one that achieves the best ranking, so it is the
control method in both comparisons. As we can see in both gures, EBUS-MS-GM
outperforms ve under-sampling methods: SBC, CPM, US-CNN, US-CNN+TL and no
application of under-sampling. Although EBUS-MS-GM obtains a better performance
than the four remainder algorithms, Holms procedure is not able to detect these difEvolutionary Computation
Volume x, Number x
21
Holm's procedure
= 0.05
= 0.10
0.1
0.09
0.08
p-value
0.07
0.06
0.05
0.04
0.03
0.02
0.01
5 76.0
10 2.0
52 2.0
LT
SUR
LCN
63 0.0
SS O
20 0.0
2 0 0. 0
enon
NNC-SU
00 0.0
00 0.0
M PC
LT+NNC
-SU
0 0 0. 0
CBS
Figure 11: Holms test on all data sets with AUC: The control algorithm is EBUS-MSGM
ferences as signicant, measuring the performance with GM or AUC.
One of the factors that makes more difcult the learning on imbalanced domains,
as we had already commented, is the increase of the degree of imbalance between
classes. In relation to this, we will make a second study that comprises the four algorithms which have no statistical differences with respect to the EBUS-MS-GM model
by dividing the group of imbalanced data sets into two subgroups, in the same way
as we did in previous section: those which have an IR < 9 and those which have an
IR > 9. Note that although the number of algorithms to be compared is lower than
originally, the number of data sets is also reduced to half, so the results reported by the
non-parametric tests are not inuenced in favor or against a desired result.
Firstly, we study the case where imbalanced data sets have IR < 9. Figure 12
shows a graphical representation of the Holms procedure.
Holm's Procedure
AUC
0 1.0=
5 0.0=
GM
0.1
0.09
0.08
p-value
0.07
0.06
0.05
0.04
0.03
0.02
0.01
22 40.0 86 10.0
9 3 6 0 . 0 6 8 4 0. 0
MG-SM-SUBE
LT
85 600 0 .05 100 . 0
SUR
384 00 0010 01 00 .0
.3
Control Algorithm: NCL
SSO
Figure 12: Holms test on data sets with IR < 9: The control algorithm is NCL
In the case of IR < 9, the best method is NCL. Measuring the performance by
means of GM, NCL is statistically better than all the method considered with = 0.05
and = 0.10. However, with AUC as performance metric, EBUS-MS-GM and TL
behave equally to it when considering = 0.05.
Secondly, we study the case where imbalanced data sets have IR > 9. Figure 13
22
Volume x, Number x
shows a graphical representation of the Holms procedure.

Holm's Procedure
AUC
0 1.0=
5 0.0=
GM
0.1
0.09
0.08
p-value
0.07
0.06
0.05
0.04
0.03
0.02
0.01
22 40.0 23 20.0
3 4 9 0 . 0 8 2 0 4. 0
SSO
SUR
6 800. 0 8 200. 0
LCN
68 00 .08 250 00 .0
LT
Figure 13: Holms test on data sets with IR > 9: The control algorithm is EBUS-MS-GM
Considering = 0.05, EBUS-MS-GM is similar in performance to RUS and it is
also similar to OSS when evaluating with AUC.
Considering = 0.10, EBUS-MS-GM is similar to RUS in performance when using
GM as performance measure.
EBUS-MS-GM is the best method for a level of signicance = 0.10 and AUC as
performance measure.
The conclusions that we can extract analyzing these tables and gures are pointed
out as follows:
EUS models usually present an equal or better performance than the remaining
methods, independently of the degree of imbalance of data.
The best performing under-sampling model over imbalance data sets is EBUS-MSGM (Table 8).
EBUS-MS-GM is not the best model when we use imbalanced data sets with low
IR, although it obtains good results. The NCL algorithm is the most appropriate
to be used in this type of data sets (Figure 12), but when IR increases, it does not
behave well. Hence, NCL is not appropriate to use over data sets with high IR.
The tendency of the EUS models follows an improving of the behaviour in classication when the data turns to a high degree of imbalance.
EBUS-MS-GM model is the most accurate when we deal with data sets with IR >
9. This fact is proved by observing Figure 13 in which it is signicantly better than
the remaining of the algorithms by using AUC measure.
An observable difference exists when measuring the behaviour of the classical and
EUS methods between GM and AUC. For instance, with GM evaluation, the algorithm RUS and EBUS-MS-GM are signicantly equivalent to the Holms procedure. GM evaluates a trade-off between accuracy on positive and negative classes.
Volume x, Number x
23
RUS maintains all the positive examples and randomly selects a subset of negative instances. This subset of instances, although randomly selected, may become
a good representative of the negative set of instances. On the other hand, AUC
measures a trade-off between true positives and false positives, so it penalizes the
misclassication of positive instances. As well as it is easy to obtain a random subset of instances that is accurate with respect to examples of the same class, it is not
so easy to nd a random subset of instances of a certain class that does not harm
the classication of the opposite class. For this reason, RUS algorithm performs
well when considering GM and not as well with AUC.
Classical under-sampling algorithms, such as NCL and TL, lose accuracy when IR
becomes high. This is logical because of the fact that their intention is to preserve
minority class instances as well as to not produce massive removing on majority
class instances.
The model EBUS-MS-GM (in general EUS) can adapt to distinct situations of imbalance and it is not problem dependent.
7 Conclusions
This paper addressed the analysis of Prototype Selection and Under-Sampling algorithms over imbalance classication problems when they are applied in different imbalance ratios in the distribution of classes. A proposal of taxonomy of evolutionary
under-sampling methods is offered, categorizing all models according to the objective
of interest, the selection scheme and the evaluation measure.
An experimental study has been carried out to compare the results of the evolutionary under-sampling approach with non-evolutionary techniques.
The main conclusions achieved are the following ones:
Prototype Selection algorithms must not be used for handling imbalanced problems. They are prone to gain global performance by eliminating examples belonging to minority class considered as noisy examples.
During the evolutionary under-sampling process, the employment of majority selection mechanism helps to obtain more accurate subsets of instances than use of
global selection. However, the later mechanism is necessary to achieve highest
reduction rates.
A signicant difference between the use of GM or AUC in the evaluation of solutions in EUS approaches is not observed.
Data sets with a low imbalance ratio may be faced by EUSCM models, especially
by using the model with a global mechanism of selection and evaluation through
the GM measure.
Although over data sets with high imbalance ratio, all EUS models obtain good
results, we emphasize the EBUS models with a special interest in the one that
performs a majority selection by using the GM measure. The superiority of this
model in relation to state-of-the-art under-sampling algorithms has been empirically proved.
24
Volume x, Number x
Finally, we would like to point out that the EUS approach is a good choice for
under-sampling imbalanced data sets, specially when the data presents a high imbalance ratio among the classes. We recommend the use of the EBUS-MS-GM model over
imbalanced data sets.
As future research lines, we could tackle the following topics:
The use of Evolutionary Under-Sampling for training set selection (Cano et al.,
2007) in order to analyze the behaviour of other classication methods (C4.5,
SVMs, etc.), combined with subset selection for imbalanced data sets.
A study on the scalability for making it feasible to apply Evolutionary UnderSampling for very large data sets (Song et al., 2005; Cano et al., 2005).
The analysis of Evolutionary Under-Sampling in terms of data complexity (Ho and
Basu, 2002; Bernado-Mansilla and Ho, 2005) for a better understanding of the behaviour of our approach over data set depending on the data complexity measure
values.
Acknowledgments
This research has been supported by the project TIN2005-08386-C05-01. S. Garca holds
a FPU scholarship from Spanish Ministry of Education and Science. The authors are
very grateful to the anonymous reviewers for their valuable suggestions and comments
to improve the quality of this paper.
References
Aha, D. W., Kibbler, D., and Albert, M. K. (1991). Instance-based learning algorithms.
Machine Learning, 6:3766.
Akbani, R., Kwek, S., and Japkowicz, N. (2004). Applying support vector machines to
imbalanced datasets. In ECML, LNCS 3201, pages 3950.
Alba, E., Luque, G., and Araujo, L. (2006). Natural language tagging with genetic
algorithms. Information Processing Letters, 100(5):173182.
Alcal , R., Alcala-Fdez, J., Herrera, F., and Otero, J. (2007). Genetic learning of accurate
a
and compact fuzzy rule based systems based on the 2-tuples linguistic representation. International Journal of Approximate Reasoning, 44(1):4564.
Barandela, R., S nchez, J. S., Garca, V., and Rangel, E. (2003). Strategies for learning in
a
class imbalance problems. Pattern Recognition, 36(3):849851.

Batista, G. E. A. P. A., Prati, R. C., and Monard, M. C. (2004). A study of the behavior of
several methods for balancing machine learning training data. SIGKDD Explorations,
6(1):2029.
Bernado-Mansilla, E. and Garrell-Guiu, J. M. (2003). Accuracy-based learning classier systems: models, analysis and applications to classication tasks. Evolutionary
Computation, 11(3):209238.
Bernado-Mansilla, E. and Ho, T. K. (2005). Domain of competence of XCS classier

system in complexity measurement space. IEEE Transactions on Evolutionary Computation, 9(1):82104.
Volume x, Number x
25
Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of
machine learning algorithms. Pattern Recognition, 30(7):11451159.
Butz, M. V., Pelikan, M., Llor ;, X., and Goldberg, D. E. (2006). Automated global
a
structure extraction for effective local building block processing in XCS. Evolutionary
Computation, 14(3):345380.
Cano, J. R., Herrera, F., and Lozano, M. (2003). Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study. IEEE Transactions
on Evolutionary Computation, 7(6):561575.
Cano, J. R., Herrera, F., and Lozano, M. (2005). Stratication for scaling up evolutionary
prototype selection. Pattern Recognition Letters, 26(7):953963.
Cano, J. R., Herrera, F., and Lozano, M. (2007). Evolutionary stratied training set
selection for extracting classication rules with trade-off precision-interpretability.
Data and Knowledge Engineering, 60:90108.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE:
Synthetic minority over-sampling technique. Journal of Articial Intelligence Research,
16:321357.
Chawla, N. V., Japkowicz, N., and Kotcz, A. (2004). Editorial: special issue on learning
from imbalanced data sets. SIGKDD Explorations, 6(1):16.
Chawla, N. V., Lazarevic, A., Hall, L. O., and Bowyer, K. W. (2003). Smoteboost: Improving prediction of the minority class in boosting. In PKDD, pages 107119.
Cieslak, D. A., Chawla, N. V., and Striegel, A. (2006). Combating imbalance in network
intrusion datasets. In Proocedings of the IEEE International Conference on Granular Computing, pages 732737.
Cohen, G., Hilario, M., Sax, H., Hugonnet, S., and Geissbuhler, A. (2006). Learning
from imbalanced data in surveillance of nosocomial infection. Articial Intelligence in
Medicine, 37(1):718.
Cordon, O., Damas, S., and Santamara, J. (2006). Feature-based image registration by
means of the CHC evolutionary algorithm. Image Vision Computing, 24(5):525533.

Dem ar, J. (2006). Statistical comparisons of classiers over multiple data sets. Journal
s
of Machine Learning Rechearch, 7:130.
Eshelman, L. J. (1991). The CHC adaptive search algorithm: How to safe search when
engaging in nontraditional genetic recombination. In Rawlings, G. J. E., editor, Foundations of genetic algorithms, pages 265283.
Estabrooks, A., Jo, T., and Japkowicz, N. (2004). A multiple resampling method for
learning from imbalanced data sets. Computational Intelligence, 20(1):1836.
Garca, S., Cano, J. R., Fern ndez, A., and Herrera, F. (2006). A proposal of evolutionary
a
prototype selection for class imbalance problems. In IDEAL, LNCS 4224, pages 1415
1423.
26
Volume x, Number x
Grzymala-Busse, J. W., Stefanowski, J., and Wilk, S. M. (2005). A comparison of two

approaches to data mining from imbalanced data. Journal of Intelligent Manufacturing,
16:565573.
Guerra-Salcedo, C., Chen, S., Whitley, D., and Smith, S. (1999). Fast and accurate feature
selection using hybrid genetic strategies. In CEC, pages 177184.
Guerra-Salcedo, C. and Whitley, D. (1998). Genetic search for feature subset selection: A
comparison between CHC and GENESIS. In Proceedings of the Third Annual Conference
of Genetic Programming, pages 504509.
Guo, H. and Viktor, H. L. (2004). Learning from imbalanced data sets with boosting
and data generation: the databoost-im approach. SIGKDD Explorations, 6(1):3039.
Hart, P. E. (1968). The condensed nearest neighbour rule. IEEE Transactions on Informarion Theory, 18:515516.
Ho, S.-Y., Liu, C.-C., and Liu, S. (2002). Design of an optimal nearest neighbor classier
using an intelligent genetic algorithm. Pattern Recognitoin Letters, 23(13):14951503.
Ho, T. K. and Basu, M. (2002). Complexity measures of supervised classication problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):289300.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian
Journal of Statistics, 6:6570.
Huang, K., Yang, H., King, I., and Lyu, M. R. (2006). Imbalanced learning with a biased
minimax probability machine. IEEE Transactions on Systems, Man, and Cybernetics Part B: Cybernetics, 36(4):913923.
Iman, R. L. and Davenport, J. M. (1980). Approximations of the critical region of the
friedman statistic. Communications in Statistics, 18:571595.
Kovacs, T. and Kerber, M. (2006). A study of structural and parametric learning in XCS.
Evolutionary Computation, 14(1):119.
Kubat, M. and Matwin, S. (1997). Addressing the course of imbalanced training sets:
One-sided selection. In ICML, pages 179186.
Kuncheva, L. I. and Bezdek, J. C. (1998). Nearest prototype classication: Clustering,
genetic algorithms, or random search? IEEE Transactions on Systems, Man, and Cybernetics, 28(1):160164.
Laurikkala, J. (2001). Improving identication of difcult small classes by balancing
class distribution. In AIME 01: Proceedings of the 8th Conference on AI in Medicine in
Europe, pages 6366.
Newman, D. J., Hettich, S., Blake, C. L., and Merz, C. J. (1998). UCI repository of
machine learning databases.
Orriols-Puig, A. and Bernado-Mansilla, E. (2006). Bounding XCSs parameters for unbalanced datasets. In GECCO 06: Proceedings of the 8th annual conference on Genetic
and evolutionary computation, pages 15611568.
Volume x, Number x
27
Papadakis, E. and Theocharis, B. (2006). A genetic method for designing TSK models
based on objective weighting: application to classication problems. Soft Computing,
10(9):805824.
Sheskin, D. (2003). Handbook of parametric and nonparametric statistical procedures. Chapman & Hall/CRC.
Sikora, R. and Piramuthu, S. (2007). Framework for efcient feature selection in genetic
algorithm based data mining. European Journal of Operational Research, 180:723737.
Song, D., Heywood, M. I., and Zincir-Heywood, A. N. (2005). Training genetic programming on half a million patterns: An example from anomaly detection. IEEE
Transactions on Evolutionary Computation, 9(3):225239.
Tomek, I. (1976). Two modications of cnn. IEEE Transactions on Systems, Man, and
Communications, 6:769772.
Wang, X., Yang, J., Teng, X., Xia, W., and Jensen, R. (2007). Feature selection based on
rough sets and particle swarm optimization. Pattern Recognition Letters, 28(4):459
471.
Weiss, G. M. and Provost, F. J. (2003). Learning when training data are costly: The
effect of class distribution on tree induction. Journal of Articial Intelligence Research,
19:315354.
Whitley, D., Beveridge, R., Guerra, C., and Graves, C. (1998). Messy genetic algorithms
for subset feature selection. In Proceedings of the International Conference on Genetic
Algorithms, pages 568575.
Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data.
IEEE Transactions on Systems, Man, and Cybernetics, 2:408421.
Wilson, D. R. and Martinez, T. R. (2000). Reduction techniques for instancebasedlearning algorithms. Machine Learning, 38(3):257286.
Xie, J. and Qiu, Z. (2007). The effect of imbalanced data sets on LDA: A theoretical and
empirical analysis. Pattern Recognition, 40:557562.
Yen, S. and Lee, Y. (2006). Under-sampling approaches for improving prediction of the
minority class in an imbalanced dataset. In ICIC, LNCIS 344, pages 731740.
Yoon, K. and Kwek, S. (2005). An unsupervised learning approach to resolving the
data imbalanced issue in supervised learning problems in functional genomics. In
HIS 05: Proceedings of the Fifth International Conference on Hybrid Intelligent Systems,
pages 303308.
A Under-Sampling Methods Focused on Balancing Data versus Prototype

Selection Methods
This Appendix summarizes and describes the methods used in the experimental study
of this paper. We distinguish between methods used in PS task (subsection A.1) and
classical under-sampling methods focused in reducing data with the aim of balancing
data (subsection A.2).
28
Volume x, Number x
A.1 Prototype Selection Methods

Two classical models for PS are used in this study: An incremental well-known technique, IB3 (Aha et al., 1991), and a decremental one DROP3 (Wilson and Martinez,
2000). In addition to this, we point out that the study also includes the models CHC
and IGA for PS dened in Section 3. These models will be named as EPS-CHC and
EPS-IGA, respectively.
Next, we describe the two classical methods:
IB3: Instance x from the training set TR is added to the new set S if the nearest
acceptable instance in S (if there are not acceptable instances a random one is used)
has different class to x. The acceptable concept is dened as the condence interval:
p+
z2
2n
z
1
p(p1)
n
2
+z
n
z2
2n2
(10)
z is the condence factor (0.9 is used to accept, 0.7 to reject). p is the classication
accuracy of a x instance (while x is added to S). n is the number of classicationtrials for a given instance (while added to S).
DROP3: T R is copied to the subset selected S. It uses a noise ltering pass before
sorting the instances in S. This is done using the rule: Any instance not classied by
its k-nearest neighbours is removed (we use k = 3). After removing noisy instances
from S in this manner, the instances are sorted by distance to their nearest enemy
remaining in S, and thus points far from the real decision boundary are removed
rst. This allows points internal to clusters to be removed early in the process, even
if there were noisy points nearby. After the noise removal, the steps are described
in Figure 14 (Wilson and Martinez (2000)):
A.2 Classical Under-Sampling Methods for Balancing Class Distribution
In this work, we evaluate 8 different methods of under-sampling to balance the class
distribution on training data:
Random Under-Sampling (RUS): It is a non-heuristic method that aims to balance
class distribution through the random elimination of majority class examples to
get a balanced instance set. The nal ratio of balancing can be adjusted.
Tomek Links (TL) (Tomek, 1976): It can be dened as follows: given two examples
Ei = (xi , yi ) and Ej = (xj , yj ) where yi = yj and d(Ei , Ej ) being the distance
between Ei and Ej . A pair (Ei , Ej ) is called Tomek link if there is not an example
El , such that d(Ei , El ) < d(Ei , Ej ) or d(Ej , El ) < d(Ei , Ej ). Tomek links can be
used as an under-sampling method eliminating only examples belonging to the
majority class in each Tomek link found.
Condensed Nearest Neighbor Rule (US-CNN) (Hart, 1968): First, randomly draw
one majority class example and all examples from the minority class and put these
examples in S. Afterwards, use a 1-NN over the examples in S to classify the
examples in T R. Every misclassied example from T R is moved to S.
One-Sided Selection (OSS) (Kubat and Matwin, 1997): It is an under-sampling
method resulting from the application of Tomek links followed by the application
of US-CNN.
Volume x, Number x
29
Figure 14: Pseudocode of DROP3 algorithm

1. Let S = TR
2. For each instance s in S
3. Find s.N1..k+1, the k+1 nearest neighbors of s in S
4. Add s to each of its neighbors lists of associates
5. For each instance s in S
6. Let with = of associates of s classied correctly with s as a neighbor
7. Let without = of associates of s classied correctly without s
8. If (without - with) = 0
9. Remove s from S if at least as many of its associates in TR would
be classied correctly without s.
10. For each associate a of s
11. Remove s from as list of nearest neighbors
12. Find a new nearest neighbor for a
13. Add a to its new neighbors list of associates
14. For each neighbor k of s
15. Remove s from ks lists of associates
16. Return S
US-CNN + TL (Batista et al., 2004): It is similar to OSS, but the method US-CNN is
applied before the Tomek links.
Neighborhood Cleaning Rule (NCL) (Laurikkala, 2001): Uses the Wilsons Edited
Nearest Neighbor Rule (ENN) (Wilson, 1972) to remove majority class examples. For
each example Ei = (xi , yi ) in the training set, its three nearest neighbors are found.
If Ei belongs to the majority class and the classication given by its three nearest
neighbors contradicts the original class of Ei , then Ei is removed. If Ei belongs to
the minority class and its three nearest neighbors misclassify Ei , then the nearest
neighbors that belong to the majority class are removed.
Class Purity Maximization (CPM) (Yoon and Kwek, 2005): It attempts to nd a pair
of centers, one being a minority class instance while the other is a majority class
instance. Using these centers, it partitions all the instances into two clusters C1 and
C2 . If either of the clusters have less class impurity than its parents impurity (Imp)
then we have found our clusters. The impurity of a set of instances is simply the
proportion of minority class instances. It then recursively partitions each of these
clusters into subclusters. Thus, it forms a hierarchical clustering. If the impurity
cannot be improved then we stop the recursion. The algorithm is described in
Figure 15.
Under-Sampling Based on Clustering (SBC) (Yen and Lee, 2006): Considering that
the number of samples in the class-imbalanced data set is N , within it, the number
of samples belonging to the majority class is N and the number of minority class
samples is N + . SBC rst clusters all samples in the data set into K clusters. The
30
Volume x, Number x
Input: Imp: cluster impurity of parent cluster

parent: parent cluster ID
Output: subclusters Ci rooted at parent
CPM(Imp,parent)
1. impurity
2. While Imp impurity
3. If all the instance pairs in parent were tested then return
4. Pick a pair of majority and minority class instances as centers
5. Partition all instances into 2 clusters C1 and C2
according to nearest center
6. impurity min(impurity(C1 ), impurity(C2 ))
7. CPM(impurity(C1 ), C1 )
8. CPM(impurity(C2 ), C2 )
Figure 15: Pseudocode of CPM algorithm
number of majority class and minority class samples is Ni and Ni+ , respectively.
Therefore, the ratio of the number of majority class samples to the number of minority class samples in the i-th cluster is Ni /Ni+ . If the ratio of Ni to Ni+ in the
training data set is set to be m : 1, the number of selected majority class samples in
the i-th cluster is shown in expression 11:
SNi = (m N + )
Ni /Ni+
K
+
i=1 (Ni /Ni )
(11)
After determining the number of majority class samples in each cluster, it randomly chooses majority class samples in the i-th cluster.
Volume x, Number x
31
2.3.2.
Enhancing the Eectiveness and Interpretability of Decision Tree and

Rule Induction Classiers with Evolutionary Training Set Selection
over Imbalanced Problems
S. Garc A. Fernndez, F. Herrera, Enhancing the Eectiveness and Interpretability of

a,
a
Decision Tree and Rule Induction Classiers with Evolutionary Training Set Selection over
Imbalanced Problems. Applied Soft Computing Journal, submitted (2008).
Estado: Sometido a Revisin
o

Area de Conocimiento: Computer Science, Interdisciplinary Applications. Ranking 23 /

92.
Submission Confirmation
1 de 1
Asunto: Submission Confirmation

De: "Applied Soft Computing" <asoc@cranfield.ac.uk>
Fecha: 19 Oct 2008 23:54:53 +0100
Dear salvagl,
We have received your article "Enhancing the Effectiveness and Interpretability of
Decision Tree and Rule Induction Classifiers with Evolutionary Training Set Selection
over Imbalanced Problems" for consideration for publication in Applied Soft Computing.
Your manuscript will be given a reference number once an editor has been assigned.
To track the status of your paper, please do the following:
1. Go to this URL: http://ees.elsevier.com/asoc/
2. Enter these login details:
Your username is: salvagl
Your password is: garca362488
3. Click [Author Login]
This takes you to the Author Main Menu.
4. Click [Submissions Being Processed]
Thank you for submitting your work to this journal.
Kind regards,
Professor Rajkumar Roy
Editor in Chief
Applied Soft Computing
******************************************
Please note that the editorial process varies considerably from journal to journal. To
view a sample editorial process, please click here:
http://ees.elsevier.com/eeshelp/sample_editorial_process.pdf
For any technical queries about using EES, please contact Elsevier Author Support at
authorsupport@elsevier.com
Global telephone support is available 24/7:
For The Americas: +1 888 834 7287 (toll-free for US & Canadian customers)
For Asia & Pacific: +81 3 5561 5032
For Europe & rest of the world: +353 61 709190
20/10/2008 07:48
Elsevier Editorial System(tm) for Applied Soft Computing

Manuscript Draft
Manuscript Number:
Title: Enhancing the Effectiveness and Interpretability of Decision Tree and Rule Induction
Classifiers with Evolutionary Training Set Selection over Imbalanced Problems
Article Type: Fast Track Paper
Section/Category:
Keywords: Evolutionary Algorithms; Imbalanced Classification; Data Reduction; Training Set
Selection; Decision Trees; Rule Induction; HIS 2008.
Corresponding Author: Mr Salvador Garca, M.D.
Corresponding Author's Institution: University of Granada
First Author: Salvador Garca, M.D.
Order of Authors: Salvador Garca, M.D.; Alberto Fernndez, M.D.; Francisco Herrera, Ph.D.
Manuscript Region of Origin:
Abstract: Classification in imbalanced domains is a recent challenge in data mining. We refer to
imbalanced classification when data presents many examples from one class and few from the
other class, and the less representative class is the one which has more interest from the point of
view of the learning task. One of the most used techniques to tackle this problem consists in
preprocessing the data previously to the learning process. This preprocessing could be done
through under-sampling; removing examples, mainly belonging to the majority class; and oversampling, by means of replicating or generating new minority examples. In this paper, we propose
an under-sampling procedure guided by evolutionary algorithms to perform a training set selection
for enhancing the decision trees obtained by the C4.5 algorithm and the rule sets obtained by
PART rule induction algorithm. The proposal has been compared with other under-sampling and
over-sampling techniques and the results indicate that the new approach is very competitive in
terms of accuracy when comparing with over-sampling and it outperforms standard undersampling. Moreover, the obtained models are smaller in terms of number of leaves or rules
generated and they can considered more interpretable. The results have been contrasted through
non-parametric statistical tests over multiple data sets.
Cover Letter
Department of Computer Science and Artificial Intelligence

University of Granada
Granada, Spain 18071
Monday, 20th October, 2008
Dr. R. Roy, Editor
Applied Soft Computing Journal
Dear Dr. Roy:
Please find enclosed a manuscript entitled: "Enhancing the Effectiveness and
Interpretability of Decision Tree and Rule Induction Classifiers with Evolutionary
Training Set Selection over Imbalanced Problems" which I am submitting for exclusive
consideration of publication as an article in Applied Soft Computing Journal.
The paper demonstrates some new research in the field of data reduction in data mining.
As such this paper should be of interest to a broad readership including those interested
in data mining techniques, especially in data reduction, evolutionary algorithms and
decision tree and rule induction process for classification.
Thank you for your consideration of my work! Please address all correspondence
concerning this manuscript to me at University of Granada and feel free to correspond
with me by e-mail (salvagl@decsai.ugr.es).
Sincerely,
Salvador Garca
Manuscript
Enhancing the Eectiveness and Interpretability of Decision Tree and Rule Induction
Classiers with Evolutionary Training Set Selection over Imbalanced Problems
Salvador Garca,a , Alberto Fern ndeza, Francisco Herreraa
a
a Dept.
of Computer Science and Articial Intelligence, University of Granada, 18071, Granada, Spain.
Abstract
Classication in imbalanced domains is a recent challenge in data mining. We refer to imbalanced classication when data
presents many examples from one class and few from the other class, and the less representative class is the one which has more
interest from the point of view of the learning task. One of the most used techniques to tackle this problem consists in preprocessing
the data previously to the learning process. This preprocessing could be done through under-sampling; removing examples, mainly
belonging to the majority class; and over-sampling, by means of replicating or generating new minority examples. In this paper,
we propose an under-sampling procedure guided by evolutionary algorithms to perform a training set selection for enhancing the
decision trees obtained by the C4.5 algorithm and the rule sets obtained by PART rule induction algorithm. The proposal has
been compared with other under-sampling and over-sampling techniques and the results indicate that the new approach is very
competitive in terms of accuracy when comparing with over-sampling and it outperforms standard under-sampling. Moreover, the
obtained models are smaller in terms of number of leaves or rules generated and they can considered more interpretable. The results
have been contrasted through non-parametric statistical tests over multiple data sets.
Key words:
Evolutionary Algorithms, Imbalanced Classication, Data Reduction, Training Set Selection, Decision Trees, Rule Induction.
1. Introduction
The data used in a classication task could be not perfect.
Data could present dierent types of imperfections, such as
the presence of errors or missing values or imbalanced distribution of classes. In the last years, the class imbalance problem
is one of the emergent challenges in Data Mining (DM) [45].
The problem appears when the data presents a class imbalance,
which consists in containing many more examples of one class
than the other one and the less representative class represents
the most interesting concept from the point of view of learning
[10]. The imbalance classication problem is very related with
the cost-sensitive classication problem [9]. Imbalance in class
distribution is pervasive in a variety of real-world applications,
including but not limited to telecommunications [37], WWW,
nance, ecology [29], biology and medicine [21].
Usually, in imbalanced classication problems, the instances
are grouped into two type of classes: the majority or negative
class, and the minority or positive class. The minority or positive class has more interest and it is also accompanied with a
higher cost of making errors. A standard classier might ignore
the importance of the minority class because its representation
inside the data set is not strong enough. As a classical example, if the ratio of imbalance presented in the data is 1:100 (that
Corresponding
Author. Tel.: +34 958 240598

Email addresses: salvagl@decsai.ugr.es.es (Salvador Garca),
alberto@decsai.ugr.es (Alberto Fern ndez), herrera@decsai.ugr.es

a
(Francisco Herrera)
Preprint submitted to Applied Soft Computing
is, there is one positive instance versus one hundred negatives),

the error of ignoring this class is only 1%, so many classiers
could ignore it or could not make any eort to learn an eective
model for it.
Many approaches have been proposed to deal with the class
imbalance problem. They can be divided into algorithmic approaches and data approaches. The rst ones assume modications in the operation of the algorithms, making them costsensitive towards the minority class [24, 32, 47, 34]. The
data approaches modify the data distribution, conditioned on
an evaluation function. Re-sampling of data could be done by
means of under-sampling, by removing instances from the data,
and over-sampling, by replicating or generating new minority
examples. There have been numerous papers and case studies
exemplifying their advantages [8, 3, 16, 9, 18].
Decision trees and rule induction algorithms are very important techniques and they are used extensively in DM [26]. They
are able to produce human-readable descriptions of trends in the
underlying relationships of a data set and can be used for classication and prediction tasks. In the literature, many techniques
of decision trees and rule induction algorithms have been proposed [4, 35, 19, 38]. In their conventional denition, these algorithms can be applied to imbalanced classication problems,
although the performance that they achieve is not the adequate,
unless we use appropriate algorithms which are adapted to imbalanced performance measures [17].
Evolutionary Algorithms (EAs) [14] have been used for DM
with promising results [20, 11]. In data reduction, they have
October 19, 2008
been successfully used for feature selection [42, 25, 40, 46] and
instance selection [5, 6, 22]. EAs also have a good behaviour
for Training Set Selection (TSS) in terms of getting a trade-o
between precision and interpretability with classication rules
[7].
In the eld of class imbalanced classication, EAs are being applied recently. In [28], an EA is used to search an optimal tree in a global manner for cost-sensitive classication. In
[13], the authors proposes new heuristics and metrics for improving the performance of several genetic programming classiers in imbalanced domains. EAs have also been applied for
under-sampling the data in imbalanced domains in instancebased learning [23].
In this contribution, we propose the use of EAs for TSS in
imbalanced data sets. Our objective is to increase the eectiveness of a well-known decision tree classier, C4.5 [35], and
a rule induction algorithm, PART [19] by means of removing instances guided by an evolutionary under-sampling algorithm. We compare our approach with other under-sampling,
over-sampling methods and hybridization proposals of oversampling and under-sampling [3] studied in the literature. The
empirical study is contrasted via non-parametrical statistical
testing in a multiple data set environment.
To achieve this objective, the rest of the contribution is organized as follows: Section 2 gives an overview about imbalanced classication. In Section 3, the evolutionary TSS issues
are explained, together with a description of the used model. In
Section 4 the experimentation framework, the results obtained
and their analysis are presented. Section 5, remarks our conclusion. Finally, the Appendix A is included in order to illustrate
the comparisons of our proposal with other techniques through
star plots.
1. Solutions at the data level [8, 3, 9]: This kind of solution consists of balancing the class distribution by oversampling the minority class (positive instances) or undersampling the majority class (negative instances).
2. Solutions at the algorithmic level: In this case we may t
our method adjusting the cost per class [24], for example,
adjusting the probability estimation in the leaves of a decision tree bias the positive class [41].
We focus on the two class imbalanced data sets, where there
are only one positive and one negative class. We consider the
positive class as the one with the lower number of examples and
the negative class the one with the higher number of examples.
In order to deal with the class imbalance problem we analyse
the cooperation of some preprocessing methods of instances.
2.2. Evaluation in Imbalanced Domains
The most straightforward way to evaluate the performance
of classiers is based on the confusion matrix analysis. Table
1 illustrates a confusion matrix for a two class problem having
positive and negative class values. From such a table it is possible to extract a number of widely used metrics for measuring
the performance of learning systems, such as Error Rate (1) and
Accuracy (2).
Err =
FP + FN
T P + FN + FP + T N
(1)
Acc =
TP + TN
= 1 Err
T P + FN + FP + T N
(2)
Table 1: Confusion matrix for a two-class problem
Positive Class
Negative Class
2. Imbalanced Data Sets in Classication: Evaluation Metrics and Preprocessing Techniques
Positive Prediction
True Positive (TP)
False Positive (FP)
Negative Prediction
False Negative (FN)
True Negative (TN)
In [41] it is shown that the error rate of the classication of

the rules of the minority class is 2 or 3 time greater than the
rules that identify the examples of the majority class and that
the examples of the minority class are less probable to be predict than the examples of the majority one. Because of this,
instead of using the error rate (or accuracy), in the ambit of imbalanced problems more correct metrics are considered. Specifically, from Table 1 it is possible to derive four performance
metrics that directly measure the classication performance on
positive and negative classes independently:
In this section we will rst introduce the data set imbalance

problem. Then we will present the evaluation metrics for this
kind of classication problem. Finally we will show some preprocessing techniques that are commonly applied in order to
deal with the imbalanced data sets.
2.1. The Problem of Imbalanced Data Sets
The imbalanced data set problem in classication domains
occurs when the number of instances which represents one class
is much larger than the other classes. Furthermore, the minority
class is usually the one which has more interest from the point
of view of the learning task [10]. This problem is very related
with the cost-sensitive classication problem [21, 47, 32].
As we have mentioned, the classical machine learning algorithms may be biased towards the majority class and thus, may
predict poorly the minority class examples.
To solve the imbalance data-set problem there are two main
types of solutions:
TP
True positive rate T Prate = T P+FN is the percentage of
positive cases correctly classied as belonging to the positive class.
TN
True negative rate T Nrate = FP+T N is the percentage of
negative cases correctly classied as belonging to the negative class.
FP
False positive rate FPrate = FP+T N is the percentage of
negative cases misclassied as belonging to the positive
class.
FN
False negative rate FNrate = T P+FN is the percentage of
positive cases misclassied as belonging to the negative
class.
that are distant from the decision border. The remainder

examples, i.e. safe majority class examples and all minority class examples are used for learning.
These four performance measures have the advantage of being independent of class costs and prior probabilities. The aim
of a classier is to minimize the false positive and negative rates
or, similarly, to maximize the true negative and positive rates.
The metric used in this work is the geometric mean of the
true rates [2], which can be dened as
GM = acc+ acc
(3)
Neighborhood Cleaning Rule (NCL) uses the Wilsons Edited Nearest Neighbor Rule (ENN) [43, 31] to remove majority class examples. ENN removes any example
whose class label diers from the class of at least two of its
three nearest neighbors. NCL modies the ENN in order
to increase the data cleaning. For a two-class problem the
algorithm can be described in the following way: for each
example ei in the training set, its three nearest neighbors
are found. If ei belongs to the majority class and the classication given by its three nearest neighbors contradicts
the original class of ei , then ei is removed. If ei belongs
to the minority class and its three nearest neighbors misclassify ei , then the nearest neighbors that belong to the
majority class are removed.
where acc+ means the accuracy in the positive examples

(T Prate ) and acc is the accuracy in the negative examples
(T Nrate ). This metric tries to maximize the accuracy of each
one of the two classes with a good balance. It is a performance
metric that links both objectives.
2.3. Preprocessing Imbalanced Data Sets
2.3.2. Over-sampling methods

Synthetic Minority Over-sampling Technique
(SMOTE) [8] is an over-sampling method. Its main idea
is to form new minority class examples by interpolating
between several minority class examples that lie together.
Thus, the overtting problem is avoided and causes the
decision boundaries for the minority class to spread
further into the majority class space.
In the specialized literature, we may nd some papers for

re-sampling techniques from the point of view of the study of
the eect of the class distribution in classication [41, 16] and
adaptations of instance selection methods [44, 33] to treat with
imbalanced data sets. It has been proved that applying a preprocessing step in order to balance the class distribution is a positive solution to the imbalance data set problem [3]. Besides, the
main advantage of these techniques is that they are independent
of the classier used.
In this work we evaluate dierent instance selection methods
together with over-sampling and hybrid techniques to adjust the
class distribution in the training data. Specically we have used
the methods which oer the best results in [3]. These methods
are classied into three groups:
2.3.3. Hybrid methods: Over-sampling + Under-sampling

SMOTE + Tomek links (TL) Frequently, class clusters
are not well dened since some majority class examples
might be invading the minority class space. The opposite
can also be true, since interpolating minority class examples can expand the minority class clusters, introducing
articial minority class examples too deeply in the majority class space. Inducing a classier under such a situation
can lead to overtting. In order to create better-dened
class clusters, we propose applying Tomek links [39] to
the over-sampled training set as a data cleaning method.
Thus, instead of removing only the majority class examples that form Tomek links, examples from both classes
are removed.
Under-sampling methods that create a subset of the original database by eliminating some of the examples of the
majority class.
Over-sampling methods that create a superset of the original database by replicating some of the examples of the
minority class or creating new ones from the original minority class instances.
SMOTE + ENN The motivation behind this method is

similar to SMOTE + Tomek links. ENN [43] tends to remove more examples than the Tomek links does, so it is
expected that it will provide a more in depth data cleaning. Dierently from NCL which is an under-sampling
method, ENN is used to remove examples from both
classes. Thus, any example that is misclassied by its three
nearest neighbors is removed from the training set.
Hybrid methods that combine the two previous methods

eliminating some of the minority class examples expanded
by the over-sampling method in order to get rid of overtting.
2.3.1. Under-sampling methods
One-sided selection (OSS) [30] is an under-sampling
method resulting from the application of Tomek links followed by the application of CNN. Tomek links are used
as an under-sampling method and removes noisy and borderline majority class examples. Borderline examples can
be considered unsafe since a small amount of noise can
make them fall on the wrong side of the decision border.
CNN aims to remove examples from the majority class
3. Evolutionary Training Set Selection in Imbalanced Classication

Let us assume that there is a training set T R with N instances
which consists of pairs (xi , yi ), i = 1, ..., N, where xi denes an
3
input vector of attributes and yi denes the corresponding class

label. Each of the N instances has M input attributes and they
should belong to positive or negative class. Let S T R be
the subset of selected instances resulted in the execution of an
algorithm.
TSS can be considered as a search problem in which EAs
can be applied. Our approach will be denoted by Evolutionary Under-Sampling for Training Set Selection (EUSTSS). We
take into account two important issues: the specication of the
representation of the solutions and the denition of the tness
function.
providing to the classication costs a higher weight (W) to

the instances that are no included in S than to the instances
included in S . An instance of T R well classied scores a
value W if it is not included in S and a value of 1 if it is
included in S . This procedure encourages the reduction
ability of the selected subset, due to the fact that it is more
benecial to evaluate chromosomes with a higher number
of examples out of the selected ones. Obviously, the instance causes a substraction on accuracy of the same magnitude in case of misclassication. Our empirical studies
have determined that a value of W equal to 3 works appropriately.
Representation: The search space associated is constituted

by all the subsets of T R. This is accomplished by using a
binary representation. A chromosome consists of N genes
(one for each instance in T R) with two possible states: 0
and 1. If the gene is 1, its associated instance is included
in the subset of T R represented by the chromosome. If it
is 0, this does not occur (see Figure 1).
Figure 2 represents the evolutionary under-sampling process followed by our proposal.
Input Imbalanced Data Set (D)

Training Set (T)
T 1 0 1 0 0 1 0 0
1
Evolutionary
Under-Sampling
Algorithm
S 1 3 6
Output Training Set
Selected (S)
Figure 1: Chromosome binary representation of a solution.
Fitness Function: Let S be a subset of instances of T R and

be coded by a chromosome. We dene a tness function
based on the GM measure evaluated over T R.
Fitness(S ) = GM.
Test Set (Ts)
Decision
Tree or Rule
Induction
Classifier
Figure 2: Evolutionary Under-Sampling process.
(4)
input : A population of chromosomes Pa
output: An optimized population of chromosomes Pa
This tness function is related with the proposal of Evolutionary Under-Sampling for nearest neighbours classier
guided by Classication Measures (EUSCM) proposed in
[23]. A decision tree or a rule induction algorithm can be
used for measuring the accuracy associated with the model
induced by using the instances selected in S . Obvioulsy,
the choice of the classier is conditioned to the nal evaluator classier, following a wrapper scheme. The accuracy
independently computed in each class is useful to obtain
GM value associated to the chromosome. The objective of
the EAs is to maximize the tness function dened: maximize the GM rate.
t 0;
Initialize(Pa,ConvergenceCount);
while not EndingCondition(t,Pa) do
Parents SelectionParents(Pa);
Offspring HUX(Parents);
Evaluate(Offspring);
Pn ElitistSelection(Offspring,Pa);
if not modified(Pa,Pn ) then
ConvergenceCount ConvergenceCount 1;
if ConvergenceCount = 0 then
Pn Restart(Pa);
Initialize(ConvergenceCount);
end
end
t t +1;
Pa Pn ;
end
Algorithm 1: Pseudocode of CHC algorithm
A mechanism to avoid overlearning in training data is

needed in the tness function. Although most of the tree
or rule induction algorithms, in their denition, usually incorporate a pruning mechanism to avoid overtting, the
inclusion of the induction process within an evolutionary
cycle can guide the resulting model to be optimal for only
known data, loosing the generalization ability. We incorporate a simple and eective mechanism which consists of
4
Data Set
Abalone9-18
Dermatology2
EcoliCP-IM
EcoliIM
EcoliIMU
EcoliOM
German
GlassBWFP
#Examples
731
366
220
336
336
336
1000
214
#Attributes
9
34
7
7
7
7
20
9
GlassBWNFP
214
GlassNW
GlassVWFP
Haberman
New-thyroid
PageBlocks(2,4,5)-3
Pima
Segment1
VehicleVAN
Vowel0
Yeast(1)
Yeast(2)
214
214
306
215
559
768
2310
846
990
467
1240
9
9
3
5
10
8
19
18
13
8
8
Yeast(3)
Yeast(4)
YeastCYT-POX
YeastNUC-POX
YeastPOX
1334
1120
483
449
1484
8
8
8
8
8
Class (min., maj.)

(18, 9)
(2, remainder)
(im, cp)
(im, remainder)
(iMU, remainder)
(om, remainder)
(1, 0)
(build-window-oat-proc,
remainder)
(build-window-non oat-proc,
remainder)
(non-windows glass, remainder)
(Ve-win-oat-proc, remainder)
(Die, Survive)
(hypo,remainder)
(3, 2+4+5)
(1,0)
(1, remainder)
(van, remainder)
(0, remainder)
(POX, MIT+ME3+EXC+ERL)
(POX+ERL,
MIT+NUC+CYT+ME1+EXC)
(EXC, MIT+NUC+CYT+ME3)
(VAC, NUC+CYT+ME3+EXC)
(POX, CYT)
(POX, NUC)
(POX, remainder)
%Class(min.,maj.)
(5.75, 94.25)
(16.67, 83.33)
(35.00, 65.00)
(22.92, 77.08)
(10.42, 89.58)
(6.74, 93.26)
(30.00, 70.00)
(32.71, 67.29)
(35.51, 64.49)
(23.93, 76.17)
(7.94, 92.06)
(26.47, 73.53)
(16.28, 83.72)
(5.01, 94.99)
(34.77, 66.23)
(14.29, 85.71)
(23.52, 76.48)
(9.01, 90.99)
(4.28, 95.72)
(2.02, 97.98)
(2.62, 97.38)
(2.68, 97.32)
(4.14, 95.86)
(4.45, 95.55)
(1.35, 98.65)
Table 2: Imbalanced Data Sets.
As the evolutionary computation method, we have used

the CHC model [15, 7]. CHC is a classical evolutionary
model that introduces dierent features to obtain a tradeo between exploration and exploitation; such as incest
prevention, reinitialization of the search process when it
becomes blocked and the competition among parents and
ospring into the replacement process.
L/4, where L is the length of the chromosomes. If no ospring are inserted into the new population then the threshold is reduced by one.
No mutation is applied during the recombination phase.
Instead, when the population converges or the search stops
making progress (i.e., the dierence threshold has dropped
to zero and no new ospring are being generated which
are better than any member of the parent population) the
population is reinitialized to introduce new diversity to the
search. The chromosome representing the best solution
found over the course of the search is used as a template
to reseed the population. Reseeding of the population is
accomplished by randomly changing 35% of the bits in
the template chromosome to form each of the other N 1
new chromosomes in the population. The search is then
resumed.
During each generation the CHC develops the following

steps.
It uses a parent population of size N to generate an
intermediate population of N individuals, which are
randomly paired and used to generate N potential ospring.
Then, a survival competition is held where the best
N chromosomes from the parent and ospring populations are selected to form the next generation.
The pseudocode of CHC appears in Algorithm 1.
CHC also implements a form of heterogeneous recombination using HUX, a special recombination operator.
HUX exchanges half of the bits that dier between parents, where the bit position to be exchanged is randomly
determined. CHC also employs a method of incest prevention. Before applying HUX to the two parents, the
Hamming distance between them is measured. Only those
parents who dier from each other by some number of bits
(mating threshold) are mated. The initial threshold is set at
Crossover operator for data reduction: In order to achieve

a good reduction rate, Heuristic Uniform Crossover
(HUX) implemented for CHC undergoes a change that
makes more dicult the inclusion of instances inside the
selected subset. Therefore, if a HUX switches a bit on in
a gene, then the bit could be switched o depending on a
certain probability (its value will be specied in Section
4.1, Table 3).
5
4. Experimental Framework and Results
EUSTSS proposal obtains the best average result in GM

evaluation measure. It clearly outperforms the other
under-sampling methods (OSS and NCL) and it improves
the accuracy even when comparing with over-sampling approaches.
This section describes the methodology followed in the experimental study of the re-sampling compared techniques. We
will explain the conguration of the experiment: used data sets
and parameters for the algorithms. The algorithms used in the
comparison are the same described in Section 2.3.
Over-sampling techniques obtain better accuracy than

under-sampling procedures in combination with C4.5 (see
[3]), but they cannot outperform EUSTSS proposal.
4.1. Experimental Framework
Except for NCL, EUSTSS produces decision trees with

lower number of leaves than the remaining methods. Although the combination NCL + C4.5 yields smaller trees,
its accuracy in GM is worse than the one obtained by EUSTSS and over-sampling approaches.
Performance of the algorithms is analysed by using 25 data

sets taken from the UCI Machine Learning Database Repository [1]. Multi-class data sets are modied to obtain twoclass non-balanced problems, dening the joint of one or more
classes as positive and the joint of one or more classes as negative.
The main characteristics of these data sets are summarized
in Table 2. For each data set, it shows the number of examples
(#Examples), number of attributes (#Attributes) and class name
(minority and majority).
The data sets considered are partitioned using the ten fold
cross-validation (10-fcv) procedure. The parameters of the used
algorithms are presented in Table 3.
Algorithm
SMOTE
EUSTSS
Over-sampling techniques force C4.5 to produce big trees.

This fact is not desirable when our interest lies in interpretable models.
We have included a second type of table accomplishing
a statistical comparison of methods over multiple data sets.
Demar [12] recommends a set of simple, safe and robust
s
non-parametric tests for statistical comparisons of classiers.
We will use two non-parametric procedures for conducting the
comparisons. One of them is Wilcoxon Signed-Ranks Test [36].
It is a pairwise test which can be used for comparing two algorithms. The second one is the Holms procedure [27], which is a
multiple comparison procedure used for contrasting the results
obtained by a control algorithm against a set of algorithms. It
is a 1 n comparison procedure which controls the family wise
error rate [36] and it should be used when we want to compare
a proposal that obtains the best results in a certain performance
measure. Table 7 collects the results of applying Wilcoxons
and Holms tests between our proposed methods and the rest of
re-sampling algorithms studied in this paper over the 25 data
sets considered. This table is divided into three parts: In the
rst part, the measure of performance used is the accuracy classication in test set through GM and the Wilcoxons test is conducted. In the second part, we accomplish Wilcoxons test by
using as performance measure the number of leaves yielded by
C4.5. Finally, the third part contains the results of Holms test
over GM evaluation measure. Note that Holms test cannot be
applied when comparing the number of leaves yielded by the
trees, because by considering this performance measure, the
NCL algorithm outperforms our proposal and a 1 n comparison has no sense. Each part of this table contains one column,
representing our proposed methods, and Na rows where Na is
the number of algorithms considered in this study. In each one
of the cells, three symbols can appear: +, = or -. They represent that the proposal outperforms (+), is similar (=) or is worse
(-) in performance than the algorithm which appears in the column (Table 7). The value in parentheses is the p-value obtained
in the comparison and the level of signicance considered is
= 0.05.
We make a brief analysis of results summarized in Table 7:
Parameters
k = 5, Balancing Ratio = 1 : 1
Pop = 50, Eval = 10000,
Prob. inclusion HUX = 0.25, W = 3
Table 3: Parameters considered for the algorithms.
4.2. Results and Analysis for C4.5

Tables 4 and 5 show the results in training and test data obtained by the re-sampling approaches compared by means of
GM evaluation measure. The column denoted by none corresponds to the case in which no re-sampling is performed previous to C4.5. The best case in each data set is remarked in
bold.
Figure 3 in Appendix A illustrates the comparison of EUSTSS with the remaining techniques considered in this study in
terms of GM accuracy over test data and using C4.5 as classier.
Table 6 shows the average number of leaves obtained by C4.5
in each data set.
Observing Tables 4, 5 and 6, we can make the following analysis:
In training data, the results are mainly favourable to the
SMOTE and SMOTE+ENN algorithms. Nevertheless,
when we take into account the results obtained in test data,
we see that SMOTE, in average, loses performance with
respect to the hybrid techniques and EUSTSS. This points
out that, in spite of the fact that all the techniques used produce overlearning, the one produced by SMOTE is more
remarkable.
The use of Wilcoxons and Holms tests conrms the improvement caused by EUSTSS over OSS and NCL under6
dataset
abalone9-18
dermatology2
ecoliCP-IM
ecoliIM
ecoliMU
ecoliOM
german
glassBWFP
glassBWNFP
glassNW
glassVWFP
haberman
new-thyroid
pageblocks(2,4,5)-3
pima
segment1
vehicle
vowel0
yeast(1)
yeast(2)
yeast(3)
yeast(4)
yeastCYT-POX
yeastNUC-POX
yeastPOX
AVERAGE
none
0.6611
0.9563
0.9869
0.8602
0.8794
0.9416
0.7779
0.9391
0.8684
0.9770
0.8476
0.4660
0.9678
0.9919
0.8151
0.9908
0.9856
0.9973
0.6699
0.3938
0.8862
0.1086
0.2568
0.6742
0.0000
0.7560
NCL
0.7206
0.9240
0.9526
0.9184
0.8799
0.9197
0.6881
0.7557
0.6501
0.8456
0.8828
0.4856
0.9507
0.9542
0.7115
0.9827
0.8965
0.9531
0.7491
0.7902
0.9053
0.1460
0.8052
0.8265
0.7362
0.8012
OSS
0.7218
0.9437
0.9869
0.9275
0.9234
0.9576
0.7790
0.8528
0.8766
0.9670
0.9691
0.7215
0.9787
0.9918
0.8115
0.9957
0.9696
0.9973
0.6171
0.4203
0.8973
0.4341
0.3438
0.6742
0.0000
0.7903
SMOTE
0.9348
0.9894
0.9906
0.9502
0.9722
0.9782
0.8676
0.9553
0.9450
0.9899
0.9779
0.7733
0.9869
1.0000
0.8631
0.9991
0.9889
0.9941
0.9467
0.8888
0.9642
0.7927
0.9072
0.9215
0.8279
0.9362
SMOTE + ENN
0.9337
0.9853
0.9860
0.9483
0.9625
0.9891
0.8136
0.8915
0.8964
0.9679
0.9611
0.7519
0.9873
1.0000
0.8387
0.9988
0.9784
0.9949
0.9460
0.8918
0.9675
0.8241
0.9205
0.9379
0.8502
0.9289
SMOTE + TL
0.8543
0.9845
0.9862
0.9341
0.9331
0.9566
0.7773
0.8906
0.8720
0.9704
0.8968
0.7520
0.9854
0.9980
0.8210
0.9972
0.9713
0.9947
0.8769
0.8668
0.9334
0.6912
0.8793
0.8970
0.8220
0.9017
EUSTSS
0.8449
0.9820
0.9869
0.9428
0.9374
0.9914
0.7474
0.9157
0.8856
0.9783
0.9608
0.7141
0.9963
1.0000
0.8084
0.9969
0.9666
0.9979
0.9357
0.8936
0.9554
0.7793
0.9377
0.9745
0.8473
0.9191
Table 4: Results obtained by C4.5 using GM evaluation measure over training data
dataset
abalone9-18
dermatology2
ecoliCP-IM
ecoliIM
ecoliMU
ecoliOM
german
glassBWFP
glassBWNFP
glassNW
glassVWFP
haberman
new-thyroid
pageblocks(2,4,5)-3
pima
segment1
vehicle
vowel0
yeast(1)
yeast(2)
yeast(3)
yeast(4)
yeastCYT-POX
yeastNUC-POX
yeastPOX
AVERAGE
none
0.3763
0.8623
0.9787
0.8167
0.7709
0.8073
0.5759
0.8138
0.6934
0.8942
0.5286
0.4280
0.9048
0.9270
0.6908
0.9852
0.9172
0.9808
0.4121
0.1155
0.7343
0.0000
0.0699
0.5828
0.0000
0.6347
NCL
0.4761
0.8988
0.9486
0.8882
0.7600
0.8220
0.6437
0.6652
0.5648
0.8101
0.6755
0.4329
0.9132
0.9327
0.6457
0.9728
0.8737
0.9360
0.5979
0.7038
0.8653
0.0000
0.7245
0.6151
0.6238
0.7196
OSS
0.4963
0.8928
0.9787
0.8860
0.8092
0.8749
0.6753
0.7551
0.7353
0.9505
0.6884
0.6089
0.8810
0.9260
0.7161
0.9849
0.9118
0.9808
0.3414
0.2151
0.8313
0.1144
0.1000
0.5536
0.0000
0.6763
SMOTE
0.6023
0.9194
0.9751
0.8795
0.8661
0.8412
0.6410
0.8216
0.7511
0.9239
0.6994
0.6832
0.9193
0.9991
0.7155
0.9918
0.9202
0.9657
0.5399
0.6783
0.7983
0.3737
0.5585
0.6974
0.5718
0.7733
SMOTE + ENN
0.6724
0.9181
0.9748
0.9060
0.8137
0.8010
0.6636
0.7599
0.7631
0.9373
0.7572
0.6292
0.9492
0.9991
0.6990
0.9947
0.9216
0.9764
0.6073
0.6940
0.8890
0.4509
0.6156
0.6630
0.5408
0.7839
SMOTE + TL
0.6724
0.9098
0.9787
0.8811
0.8671
0.8725
0.6658
0.7971
0.7427
0.9344
0.4930
0.6022
0.9414
0.9807
0.7181
0.9965
0.9241
0.9671
0.6883
0.7477
0.8649
0.3044
0.6176
0.5647
0.6410
0.7749
Table 5: Results obtained by C4.5 using GM evaluation measure over test data
EUSTSS
0.6697
0.9505
0.9787
0.8809
0.8579
0.9291
0.6419
0.8425
0.7235
0.9321
0.7816
0.6206
0.9463
0.9991
0.7179
0.9891
0.9239
0.9734
0.6271
0.6846
0.8759
0.3749
0.6489
0.6819
0.6154
0.7947
dataset
abalone9-18
dermatology2
ecoliCP-IM
ecoliIM
ecoliMU
ecoliOM
german
glassBWFP
glassBWNFP
glassNW
glassVWFP
haberman
new-thyroid
pageblocks(2,4,5)-3
pima
segment1
vehicle
vowel0
yeast(1)
yeast(2)
yeast(3)
yeast(4)
yeastCYT-POX
yeastNUC-POX
yeastPOX
AVERAGE
none
8.10
10.6
2.00
5.30
10.00
3.90
91.00
12.20
12.40
6.70
7.50
2.60
4.10
4.7
22.40
10
20.60
7.80
3
3
5
1.4
1.70
2.9
0
10.36
NCL
6.50
5.4
2.50
5.10
5.80
3.40
35.30
5.80
5.50
4.10
6.10
3.90
2.60
3.1
16.10
8.9
12.50
5.00
2.2
3.9
4.2
1.3
3.70
4.2
2
6.36
OSS
7.30
8.9
2.00
6.20
6.50
4.40
57.60
6.70
11.60
4.40
8.40
8.70
4.30
4.7
24.60
12.4
16.30
7.80
3.2
3.1
3.3
5
2.30
3
0
8.91
SMOTE
57.50
15.5
2.90
10.40
16.70
7.80
159.90
15.70
19.90
9.70
13.40
16.10
4.90
4.2
39.50
12.5
28.40
10.70
21.2
38.9
32.6
58.2
23.30
15.1
34.7
26.79
SMOTE + ENN
57.30
14.3
3.10
10.10
13.10
6.60
121.00
10.40
15.90
6.90
13.10
18.20
4.90
4.2
38.90
12.3
23.40
11.40
21.9
39
29.5
61.7
19.70
15.9
36.2
24.36
SMOTE + TL
52.60
14.5
2.00
10.40
14.00
6.80
82.40
10.40
15.90
7.10
13.50
18.00
5.00
4.2
34.90
12.6
22.50
10.50
19.2
36.7
28.8
54.2
21.20
18.5
36.8
22.11
EUSTSS
6.30
7.2
2.00
6.00
5.40
5.40
33.60
7.00
9.60
5.60
6.90
5.70
4.30
4
14.50
7.5
11.10
7.90
8.2
7
5
7.4
7.60
8
5
7.93
Table 6: Average number of leaves obtained by C4.5 decision tree
algorithm
none
NCL
OSS
SMOTE
SMOTE + ENN
SMOTE + TL
EUSTSS
Wilcoxon
GM
num. leaves
+ (.000)
= (.447)
+ (.000)
- (.001)
+ (.001)
= (.316)
+ (.011)
+ (.000)
= (.391)
+ (.000)
= (.317)
+ (.000)
STSS allows C4.5 to induce very precise trees with small

size.
Holm
GM
+ (.000)
+ (.000)
+ (.005)
= (.248)
= (1.000)
= (1.000)
4.3. Results and Analysis for PART

Tables 8 and 9 show the results in training and test data obtained by the re-sampling approaches compared by means of
GM evaluation measure. The column denoted by none corresponds to the case in which no re-sampling is performed previous to PART. The best case in each data set is remarked in
bold.
Figure 4 in Appendix A illustrates the comparison of EUSTSS with the remaining techniques considered in this study in
terms of GM accuracy over test data and using PART as classier.
Table 10 shows the average number of leaves obtained by
C4.5 in each data set.
Observing Tables 8, 9 and 10, we can make the following
analysis:
Table 7: Non-parametric statistical tests results over GM and number of rules

using C4.5
sampling methods. Curiously, it statistically outperforms

SMOTE considering a pairwise comparison. We have seen
in Table 5 that SMOTE obtains a similar average GM
to SMOTE + TL, but Wilcoxons test indicates us that
SMOTE has an irregular behaviour depending on the data
sets.
In the case of interpretability, Wilcoxons test conrms the
results observed in Table 6. The combination EUSTSS +
C4.5 yields a low number of rules.
In training data, the results are mainly favourable to the

SMOTE and SMOTE+TL algorithms. In the case of
PART, the overlearning in training data is less notorios
than in the case of C4.5.
EUSTSS outperforms OSS, NCL and SMOTE in GM

measure and behaves similarly to SMOTE + TL and
SMOTE + ENN. However, the number of leaves produced
by C4.5 when it is applied after EUSTSS is much lower
than the produced by SMOTE and its hybridizations. EU-
EUSTSS proposal obtains the best average result in GM

measure together with SMOTE. It again outperforms the
other under-sampling methods (OSS and NCL) and it
achieves similar rates of accuracy when comparing with
over-sampling approaches.
8
dataset
abalone9-18
dermatology2
ecoliCP-IM
ecoliIM
ecoliMU
ecoliOM
german
glassBWFP
glassBWNFP
glassNW
glassVWFP
haberman
new-thyroid
pageblocks(2,4,5)-3
pima
segment1
vehicle
vowel0
yeast(1)
yeast(2)
yeast(3)
yeast(4)
yeastCYT-POX
yeastNUC-POX
yeastPOX
AVERAGE
none
0.6828
0.9761
0.9948
0.9061
0.7950
0.9775
0.9368
0.9475
0.8154
0.9793
0.9062
0.5842
0.9923
0.9960
0.7262
0.9983
0.9899
0.9963
0.4147
0.4637
0.8947
0.3732
0.3318
0.3556
0.0000
0.7614
NCL
0.8077
0.9750
0.9864
0.9232
0.9149
0.9873
0.8525
0.8661
0.8674
0.9672
0.9560
0.6973
0.9909
0.9956
0.7937
0.9986
0.9767
0.9962
0.6033
0.4655
0.9094
0.4846
0.2663
0.4376
0.0000
0.7888
OSS
0.6243
0.9363
0.9302
0.9079
0.8477
0.9294
0.7730
0.7860
0.6698
0.7973
0.7902
0.5209
0.9610
0.9542
0.7180
0.9835
0.9477
0.9692
0.7491
0.7929
0.9127
0.1601
0.8208
0.8102
0.7362
0.8011
SMOTE
0.9316
0.9892
0.9935
0.9418
0.9641
0.9879
0.9522
0.9246
0.9145
0.9862
0.9701
0.7321
0.9907
0.9998
0.7964
0.9993
0.9950
0.9970
0.9165
0.8851
0.9496
0.7812
0.9387
0.9162
0.8755
0.9332
SMOTE + ENN
0.8634
0.9855
0.9865
0.9254
0.9235
0.9664
0.8131
0.8927
0.8535
0.9678
0.9062
0.7212
0.9828
0.9980
0.7910
0.9977
0.9820
0.9994
0.8762
0.8695
0.9303
0.6616
0.8885
0.8861
0.8201
0.8995
SMOTE + TL
0.9236
0.9841
0.9804
0.9307
0.9563
0.9773
0.8818
0.9035
0.8939
0.9631
0.9624
0.7050
0.9871
1.0000
0.7789
0.9987
0.9852
0.9972
0.9556
0.9044
0.9618
0.7856
0.9425
0.9259
0.8602
0.9258
EUSTSS
0.8345
0.9830
0.8920
0.9116
0.9250
0.9787
0.7406
0.9251
0.8747
0.9778
0.9358
0.6389
0.9689
0.9945
0.6963
0.9856
0.9699
0.9725
0.9313
0.8974
0.9502
0.7839
0.9277
0.9605
0.8590
0.9006
Table 8: Results obtained by PART using GM evaluation measure over training data
dataset
abalone9-18
dermatology2
ecoliCP-IM
ecoliIM
ecoliMU
ecoliOM
german
glassBWFP
glassBWNFP
glassNW
glassVWFP
haberman
new-thyroid
pageblocks(2,4,5)-3
pima
segment1
vehicle
vowel0
yeast(1)
yeast(2)
yeast(3)
yeast(4)
yeastCYT-POX
yeastNUC-POX
yeastPOX
AVERAGE
none
0.4305
0.8776
0.9717
0.8335
0.6607
0.7193
0.6305
0.8136
0.6105
0.8963
0.6019
0.5161
0.8891
0.9553
0.6867
0.9890
0.9344
0.9557
0.2113
0.1155
0.8156
0.0000
0.0000
0.2121
0.0000
0.6131
NCL
0.3741
0.8791
0.9787
0.8687
0.7921
0.8144
0.6439
0.7957
0.7560
0.9446
0.6928
0.6111
0.9224
0.9525
0.6967
0.9810
0.9271
0.9557
0.3105
0.2151
0.8700
0.1147
0.0000
0.2414
0.0000
0.6535
OSS
0.4668
0.8882
0.9201
0.8740
0.7652
0.8311
0.6137
0.6973
0.5750
0.7370
0.4838
0.4754
0.9393
0.9327
0.6651
0.9774
0.9059
0.9040
0.5734
0.6266
0.8658
0.0553
0.8097
0.5879
0.6238
0.7118
SMOTE
0.6047
0.9409
0.9751
0.8651
0.8436
0.9014
0.6319
0.8046
0.7371
0.9131
0.7019
0.6417
0.9252
0.9807
0.7145
0.9911
0.9308
0.9706
0.6219
0.6787
0.8484
0.3950
0.4960
0.6928
0.5709
0.7751
SMOTE + ENN
0.5401
0.8855
0.9787
0.8698
0.8648
0.7979
0.6148
0.8102
0.6884
0.9273
0.7638
0.6513
0.9204
0.9624
0.7251
0.9921
0.9388
0.9665
0.6774
0.7248
0.8926
0.1122
0.7181
0.6642
0.6387
0.7731
SMOTE + TL
0.6355
0.9199
0.9606
0.8805
0.8447
0.9535
0.6453
0.7985
0.7136
0.9088
0.5089
0.5765
0.9261
0.9807
0.7134
0.9893
0.9530
0.9557
0.6061
0.6418
0.8651
0.1845
0.7451
0.6239
0.5439
0.7630
Table 9: Results obtained by PART using GM evaluation measure over test data
EUSTSS
0.5862
0.9672
0.8827
0.8806
0.8073
0.8710
0.6126
0.8302
0.7400
0.9213
0.7360
0.5478
0.9231
0.9914
0.6373
0.9838
0.9329
0.9232
0.5967
0.6871
0.8492
0.2934
0.7502
0.7254
0.7016
0.7751
dataset
abalone9-18
dermatology2
ecoliCP-IM
ecoliIM
ecoliMU
ecoliOM
german
glassBWFP
glassBWNFP
glassNW
glassVWFP
haberman
new-thyroid
pageblocks(2,4,5)-3
pima
segment1
vehicle
vowel0
yeast(1)
yeast(2)
yeast(3)
yeast(4)
yeastCYT-POX
yeastNUC-POX
yeastPOX
AVERAGE
none
8.30
7.10
4.10
5.90
6.00
4.50
108.00
7.50
5.20
5.50
6.50
3.40
4.10
4.00
7.40
7.90
13.70
5.80
4.00
4.80
5.00
4.60
3.30
3.40
1.00
9.64
NCL
9.10
5.80
3.60
5.30
5.50
3.90
66.40
5.00
6.50
3.90
6.30
6.10
3.60
4.00
10.70
7.80
11.50
5.80
4.60
4.40
3.50
5.10
2.80
3.60
1.00
7.83
OSS
5.70
3.40
2.60
2.80
4.20
3.20
56.50
3.90
4.80
3.00
4.50
3.20
2.10
2.00
7.10
6.20
9.50
5.00
2.10
4.00
4.20
2.00
3.10
3.40
2.00
6.02
SMOTE
29.10
9.60
4.80
7.40
9.30
4.40
128.50
7.90
9.00
6.00
9.20
7.30
4.20
2.50
11.50
7.50
16.40
7.40
12.10
20.10
14.30
29.80
13.50
9.30
24.10
16.21
SMOTE + ENN
28.00
8.10
2.60
6.20
8.10
4.40
76.40
6.60
7.40
5.20
8.90
8.20
4.10
2.60
12.70
7.10
14.10
7.50
11.70
21.50
14.70
29.90
12.00
10.20
19.90
13.52
SMOTE + TL
27.00
9.90
4.10
6.30
6.90
4.30
100.70
7.10
8.30
5.00
8.00
9.90
4.40
2.60
12.80
7.60
13.70
7.80
13.90
18.20
14.70
28.50
12.50
11.30
21.30
14.67
EUSTSS
4.90
3.10
3.50
4.30
4.60
3.70
40.10
5.20
5.50
4.40
6.10
4.20
3.20
2.90
5.10
6.50
8.70
4.70
5.60
5.30
4.50
4.80
5.00
6.20
4.60
6.27
Table 10: Average number of rules obtained by PART
Except for OSS, EUSTSS produces smaller rule bases than

the remaining methods. Although the combination OSS +
PART yields the lowest number of rules, the accuracy in
GM is lower than the achieved by EUSTSS.
methods. In the case of PART, SMOTE is more robust

than in the C4.5 case and it lets to obtain accurate sets of
rules without the requirement of hybridization with noise
lters (ENN) or under-sampling techniques (TL).
Over-sampling techniques force PART to produce many

rules as the previous case.
When we refer to interpretability, Wilcoxons test again

conrms the results observed in Table 10. The combination EUSTSS + PART yields a low number of rules.
Table 11 includes the results of applying the non-parametric

statistical test between our proposed methods and the rest of resampling algorithms studied in this paper over the 25 data sets
considered. It follows the same structure as Table 7.
algorithm
none
NCL
OSS
SMOTE
SMOTE + ENN
SMOTE + TL
EUSTSS
Wilcoxon
GM
num. rules
+ (.001)
= (.174)
+ (.048)
= (.339)
+ (.001)
- (.001)
= (.667)
+ (.000)
= (.989)
+ (.000)
= (.925)
+ (.000)
EUSTSS outperforms OSS and NCL in GM measure and

behaves similarly to SMOTE and hybridizations. However, the number of rules produced by PART when it is
applied after EUSTSS is much lower than the produced by
SMOTE. As in the C4.5 case, EUSTSS allows PART to
obtain very accurate sets of rules with small size.
Holm
GM
+ (.000)
+ (.000)
+ (.020)
= (1.000)
= (1.000)
= (1.000)
5. Concluding Remarks
The purpose of this paper is to present a proposal of evolutionary training set selection algorithm for being applied over
imbalanced data sets to improve the performance of decision
tree or rule based induction classiers. The study has been
performed by using the C4.5 decision tree classier and PART
rule induction classier. The results shows that our proposal
allows to each one of the classiers used to obtain very accurate models (trees or rule bases) with a low number of leaves or
rules. The eectiveness of the models obtained is very competitive with respect to advanced hybrids of over-sampling. The
proposal oers more accurate models than the oered by other
Table 11: Non-parametric statistical tests results over GM and number of rules
using PART
We make a brief analysis of results summarized in Table 11:

The use of Wilcoxons test conrms the improvement
caused by EUSTSS over OSS and NCL under-sampling
10
under-sampling techniques, and the interpretability of the models obtained is increased due to the fact that the tree or rule
bases yielded are made up by a lower number of leaves/rules.
[22] S. Garca, J. R. Cano, F. Herrera, A memetic algorithm for evolutionary
prototype selection: A scaling up approach., Pattern Recognition 41 (8)

(2008) 26932709.
[23] S. Garca, F. Herrera, Evolutionary under-sampling for classication with
imbalanced data sets: Proposals and taxonomy., Evolutionary Computation. In press.

[24] J. W. Grzymala-Busse, J. Stefanowski, S. Wilk, A comparison of two
approaches to data mining from imbalanced data, Journal of Intelligent
Manufacturing 16 (2005) 565573.
[25] C. Guerra-Salcedo, S. Chen, D. Whitley, S. Smith, Fast and accurate feature selection using hybrid genetic strategies., in: Proceedings of the International Conference on Evolutionary Computation, 1999, pp. 177184.
[26] J. Han, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005.
[27] S. Holm, A simple sequentially rejective multiple test procedure, Scandinavian Journal of Statistics 6 (1979) 6570.
[28] M. Kretowski, M. Grzes, Evolutionary induction of decision trees for
misclassication cost minimization, in: ICANNGA (1), Lecture Notes
in Computer Science 4431, 2007, pp. 110.
[29] M. Kubat, R. C. Holte, S. Matwin, Machine learning for the detection
of oil spills in satellite radar images, Machine Learning 30 (2-3) (1998)
195215.
[30] M. Kubat, S. Matwin, Addressing the course of imbalanced training sets:
One-sided selection., in: ICML 97Proceeding of the Fourteenth International Conference on Machine Learning, 1997, pp. 179186.
[31] J. Laurikkala, Improving identication of dicult small classes by balancing class distribution, in: AIME 01: Proceedings of the 8th Conference on AI in Medicine in Europe, 2001, pp. 6366.
[32] C. X. Ling, V. S. Sheng, Test strategies for cost-sensitive decision trees,
IEEE Transactions on Knowledge and Data Engineering 18 (8) (2006)
10551067, senior Member-Qiang Yang.
[33] E. Marchiori, Hit miss networks with applications to instance selection,
Journal of Machine Learning Research 9 (2008) 9971017.
[34] A. Orriols-Puig, E. Bernad -Mansilla, Evolutionary rule-based syso
tems for imbalanced data sets., Soft Computing. In press. DOI:
10.1007/s00500-008-0319-7.
[35] J. R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann
Series in Machine Learning), Morgan Kaufmann, 1993.
[36] D. Sheskin, Handbook of parametric and nonparametric statistical procedures., Chapman & Hall/CRC, 2006.
[37] A. Tajbakhsh, M. Rahmati, A. Mirzaei, Intrusion detection using
fuzzy association rules., Applied Soft Computing. In press. DOI:
10.1016/j.asoc.2008.06.001.
[38] F. A. Thabtah, P. I. Cowling, A greedy classication algorithm based on
association rule, Applied Soft Computing 7 (3) (2007) 11021111.
[39] I. Tomek, Two modications of cnn., IEEE Transactions on Systems,
Man, and Communications 6 (1976) 769772.
[40] B. Verma, P. Zhang, A novel neural-genetic algorithm to nd the most
signicant combination of features in digital mammograms, Applied Soft
Computing 7 (2) (2007) 612625.
[41] G. M. Weiss, F. J. Provost, Learning when training data are costly: The
eect of class distribution on tree induction., Journal of Articial Intelligence Research 19 (2003) 315354.
[42] D. Whitley, R. Beveridge, C. Guerra, C. Graves, Messy genetic algorithms for subset feature selection., in: Proceedings of the International
Conference on Genetic Algorithms, 1998, pp. 568575.
[43] D. L. Wilson, Asymptotic properties of nearest neighbor rules using
edited data., IEEE Transactions on Systems, Man, and Cybernetics 2
(1972) 408421.
[44] D. R. Wilson, T. R. Martinez, Reduction techniques for instance-based
learning algorithms, Machine Learning 38 (3) (2000) 257286.
[45] Q. Yang, X. Wu, 10 challenging problems in data mining research., International Journal of Information Technology & Decision Making 5 (4)
(2006) 597604.
[46] H. Yan, J. Zheng, Y. Jiang, C. Peng, S. Xiao, Selecting critical clinical
features for heart diseases diagnosis with a real-coded genetic algorithm,
Applied Soft Computing 8 (2) (2008) 11051111.
[47] S. Zhang, L. Liu, X. Zhu, C. Zhang, A strategy for attributes selection in
cost-sensitive decision trees induction, in: CITWORKSHOPS 08: Proceedings of the 2008 IEEE 8th International Conference on Computer and
Information Technology Workshops, IEEE Computer Society, Washing-
Acknowledgement
This work was supported by TIN2005-08386-C05-01.
References
[1] A. Asuncion, D. Newman, UCI machine learning repository (2007).
URL http://www.ics.uci.edu/mlearn/MLRepository.html
[2] R. Barandela, J. S. S nchez, V. Garca, E. Rangel, Strategies for learning
a
in class imbalance problems., Pattern Recognition 36 (3) (2003) 849851.

[3] G. E. A. P. A. Batista, R. C. Prati, M. C. Monard, A study of the behavior of several methods for balancing machine learning training data,
SIGKDD Explorations 6 (1) (2004) 2029.
[4] L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, Classication and
Regression Trees, Wadsworth, 1984.
[5] J. R. Cano, F. Herrera, M. Lozano, Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study., IEEE
Transactions on Evolutionary Computation 7 (6) (2003) 561575.
[6] J. R. Cano, F. Herrera, M. Lozano, On the combination of evolutionary algorithms and stratied strategies for training set selection in data mining,
Applied Soft Computing 6 (3) (2006) 323332.
[7] J. R. Cano, F. Herrera, M. Lozano, Evolutionary stratied training
set selection for extracting classication rules with trade-o precisioninterpretability., Data and Knowledge Engineering 60 (2007) 90108.
[8] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, SMOTE:
Synthetic minority over-sampling technique., Journal of Articial Intelligence Research 16 (2002) 321357.
[9] N. V. Chawla, D. A. Cieslak, L. O. Hall, A. Joshi, Automatically countering imbalance and its empirical relationship to cost, Data Mining and
Knowledge Discovery 17 225252.
[10] N. V. Chawla, N. Japkowicz, A. Kotcz, Editorial: special issue on learning
from imbalanced data sets., SIGKDD Explorations 6 (1) (2004) 16.
[11] S. Dehuri, S. Patnaik, A. Ghosh, R. Mall, Application of elitist multiobjective genetic algorithm for classication rule generation, Applied
Soft Computing 8 (1) (2008) 477487.
[12] J. Demar, Statistical comparisons of classiers over multiple data sets,
s
Journal of Machine Learning Research 7 (2006) 130.
[13] J. Doucette, M. I. Heywood, Gp classication under imbalanced data sets:
Active sub-sampling and auc approximation, in: EuroGP, Lecture Notes
in Computer Science 4971, 2008, pp. 266277.
[14] A. E. Eiben, J. E. Smith, Introduction to Evolutionary Computing,
Springer-Verlag, 2003.
[15] L. J. Eshelman, The CHC adaptive search algorithm: How to safe search
when engaging in nontraditional genetic recombination., in: G. J. E.
Rawlings (ed.), Foundations of genetic algorithms, 1991, pp. 265283.
[16] A. Estabrooks, T. Jo, N. Japkowicz, A multiple resampling method for
learning from imbalanced data sets, Computational Intelligence 20 (1)
(2004) 1836.
[17] T. Fawcett, PRIE: a system for generating rulelists to maximize roc performance, Data Mining and Knowledge Discovery 17 (2) (2008) 207
224.
[18] A. Fern ndez, S. Garca, M. J. del Jesus, F. Herrera, A study of the bea
haviour of linguistic fuzzy rule based classication systems in the framework of imbalanced data-sets, Fuzzy Sets and Systems 159 (18) (2008)
23782398.
[19] E. Frank, I. H. Witten, Generating accurate rule sets without global optimization, in: ICML 98: Proceedings of the Fifteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, 1998, pp. 144151.
[20] A. A. Freitas, Data Mining and Knowledge Discovery with Evolutionary
Algorithms, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2002.
[21] A. Freitas, A. da Costa Pereira, P. Brazdil, Cost-sensitive decision trees
applied to medical data, in: DaWaK, Lecture Notes in Computer Science
4654, 2007, pp. 303312.
11
ton, DC, USA, 2008, pp. 813.
A. Star Plot Representations: EUSTSS vs.

Methods
Remaining
12
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
yeastCYT-POX
ecoliIM
yeast(4)
ecoliMU
yeast(3)
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
yeastCYT-POX
ecoliIM
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
yeastCYT-POX
ecoliIM
yeast(4)
ecoliOM
ecoliMU
yeast(3)
ecoliOM
yeast(4)
ecoliMU
yeast(3)
ecoliOM
yeast(2)
german
yeast(2)
german
yeast(2)
german
yeast(1)
glassBWFP
yeast(1)
glassBWFP
yeast(1)
glassBWFP
vowel0
glassBWNFP
vehicle
segment1
pima
pageblocks(2,4,5)-3
vowel0
glassNW
glassVWFP
haberman
new-thyroid
vowel0
yeast(4)
ecoliOM
glassVWFP
haberman
new-thyroid
(c) EUSTSS vs. OSS

abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
yeastCYT-POX
ecoliIM
ecoliMU
yeast(3)
glassNW
segment1
pima
pageblocks(2,4,5)-3
glassVWFP
haberman
new-thyroid
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
yeastCYT-POX
ecoliIM
ecoliMU
glassBWNFP
vehicle
(b) EUSTSS vs. NCL
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
yeastCYT-POX
ecoliIM
yeast(4)
glassNW
segment1
pima
pageblocks(2,4,5)-3
(a) EUSTSS vs. C4.5 without preprocessing
yeast(3)
glassBWNFP
vehicle
ecoliOM
yeast(4)
ecoliMU
yeast(3)
ecoliOM
yeast(2)
german
yeast(2)
german
yeast(2)
german
yeast(1)
glassBWFP
yeast(1)
glassBWFP
yeast(1)
glassBWFP
vowel0
glassBWNFP
vehicle
segment1
pima
pageblocks(2,4,5)-3
vowel0
glassNW
glassBWNFP
vehicle
glassVWFP
haberman
new-thyroid
glassNW
segment1
pima
pageblocks(2,4,5)-3
(d) EUSTSS vs. SMOTE
glassVWFP
haberman
new-thyroid
vowel0
glassBWNFP
vehicle
glassNW
segment1
pima
pageblocks(2,4,5)-3
(e) EUSTSS vs. SMOTE+ENN
glassVWFP
haberman
new-thyroid
(f) EUSTSS vs. SMOTE+TL
Figure 3: Results obtained for C4.5 considering GM in test data
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
yeastCYT-POX
ecoliIM
yeast(4)
yeastCYT-POX
yeast(3)
ecoliIM
yeast(4)
ecoliMU
yeastCYT-POX
ecoliMU
yeast(3)
ecoliOM
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
ecoliOM
ecoliIM
yeast(4)
ecoliMU
yeast(3)
ecoliOM
yeast(2)
german
yeast(2)
german
yeast(2)
german
yeast(1)
glassBWFP
yeast(1)
glassBWFP
yeast(1)
glassBWFP
vowel0
glassBWNFP
vehicle
segment1
pima
pageblocks(2,4,5)-3
vowel0
glassNW
yeast(4)
yeastCYT-POX
ecoliMU
yeast(3)
ecoliOM
segment1
pima
pageblocks(2,4,5)-3
glassVWFP
haberman
new-thyroid
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
yeastCYT-POX
ecoliMU
yeast(3)
glassNW
(c) EUSTSS vs. OSS
ecoliIM
yeast(4)
glassBWNFP
vehicle
(b) EUSTSS vs. NCL

abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
ecoliIM
vowel0
glassVWFP
haberman
new-thyroid
pima
pageblocks(2,4,5)-3
(a) EUSTSS vs. PART without preprocessing
yeastCYT-POX
glassNW
segment1
glassVWFP
haberman
new-thyroid
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
glassBWNFP
vehicle
ecoliOM
yeast(4)
yeast(3)
ecoliIM
ecoliMU
ecoliOM
yeast(2)
german
yeast(2)
german
yeast(2)
german
yeast(1)
glassBWFP
yeast(1)
glassBWFP
yeast(1)
glassBWFP
vowel0
glassBWNFP
vehicle
segment1
pima
pageblocks(2,4,5)-3
glassNW
glassVWFP
haberman
new-thyroid
(d) EUSTSS vs. SMOTE
vowel0
glassBWNFP
vehicle
segment1
pima
pageblocks(2,4,5)-3
glassNW
glassVWFP
haberman
new-thyroid
(e) EUSTSS vs. SMOTE+ENN

Figure 4: Results obtained for PART considering GM in test data
13
vowel0
vehicle
segment1
pima
pageblocks(2,4,5)-3
glassBWNFP
glassNW
glassVWFP
haberman
new-thyroid
(f) EUSTSS vs. SMOTE+TL
2.4.
Dise o de Experimentos en Inteligencia Computacional: Son

bre el Uso de la Inferencia Estad
stica - Design of Experiments in Computational Intelligence: On the Use of Statistical Inference
2.4.1.
A Study on the Use of Non-Parametric Tests for Analyzing the Evolutionary Algorithms Behaviour: A Case Study on the CEC2005 Special
Session on Real Parameter Optimization
S. Garc D. Molina, M. Lozano, F. Herrera, A Study on the Use of Non-Parametric Tests for
a,
Analyzing the Evolutionary Algorithms Behaviour: A Case Study on the CEC2005 Special
Session on Real Parameter Optimization. Journal of Heuristics, doi: 10.1007/s10732-008-90804, in press (2008).
Estado: Aceptado (publicado on-line)

Area de Conocimiento: Computer Science, Theory and Methods. Ranking 51 / 79.
J Heuristics
DOI 10.1007/s10732-008-9080-4
A study on the use of non-parametric tests for analyzing

the evolutionary algorithms behaviour: a case study
on the CEC2005 Special Session on Real Parameter
Optimization
Salvador Garca Daniel Molina Manuel Lozano
Francisco Herrera
Received: 24 October 2007 / Revised: 21 February 2008 / Accepted: 25 April 2008
Springer Science+Business Media, LLC 2008
Abstract In recent years, there has been a growing interest for the experimental
analysis in the eld of evolutionary algorithms. It is noticeable due to the existence
of numerous papers which analyze and propose different types of problems, such as
the basis for experimental comparisons of algorithms, proposals of different methodologies in comparison or proposals of use of different statistical techniques in algorithms comparison.
In this paper, we focus our study on the use of statistical techniques in the analysis of evolutionary algorithms behaviour over optimization problems. A study about
the required conditions for statistical analysis of the results is presented by using
some models of evolutionary algorithms for real-coding optimization. This study is
conducted in two ways: single-problem analysis and multiple-problem analysis. The
results obtained state that a parametric statistical analysis could not be appropriate
specially when we deal with multiple-problem results. In multiple-problem analysis,
we propose the use of non-parametric statistical tests given that they are less restrictive than parametric ones and they can be used over small size samples of results.
As a case study, we analyze the published results for the algorithms presented in the
This work was supported by Project TIN2005-08386-C05-01.
S. Garca holds a FPU scholarship from Spanish Ministry of Education and Science.
S. Garca ( ) M. Lozano F. Herrera
Department of Computer Science and Articial Intelligence, University of Granada, Granada 18071,
Spain
e-mail: salvagl@decsai.ugr.es
M. Lozano
e-mail: lozano@decsai.ugr.es
F. Herrera
e-mail: herrera@decsai.ugr.es
D. Molina
Department of Computer Engineering, University of Cdiz, Cdiz, Spain
e-mail: daniel.molina@uca.es
S. Garca et al.
CEC2005 Special Session on Real Parameter Optimization by using non-parametric

test procedures.
Keywords Statistical analysis of experiments Evolutionary algorithms
Parametric tests Non-parametric tests
1 Introduction
The No free lunch theorem (Wolpert and Macready 1997) demonstrates that it is
not possible to nd one algorithm being better in behaviour for any problem. On the
other hand, we know that we can work with different degrees of knowledge about
the problem which we expect to solve, and that it is not the same to work without
knowledge about the problem (hypothesis of the no free lunch theorem) than to
work with partial knowledge about the problem, knowledge that allows us to design
algorithms with specic characteristics which can make them more suitable for the
solution of the problem.
Once situated in this eld, the partial knowledge of the problem and the necessity
of having disposals of algorithms for its solution, the question about deciding when an
algorithm is better than another one is suggested. In the case of the use of evolutionary
algorithms, the latter may be done attending to the efciency and/or effectiveness
criteria. When theoretical results are not available in order to allow the comparison
of the behaviour of the algorithms, we have to focus on the analysis of empirical
results.
In the last years, there has been a growing interest in the analysis of experiments
in the eld of evolutionary algorithms. The work of Hooker is pioneer in this line and
it shows an interesting study on what we must do and not do when we suggest the
analysis of the behaviour of a metaheuristic about a problem (Hooker 1997).
In relation to the analysis of experiments, we can nd three types of works: the
study and design of test problems, the statistical analysis of experiments and experimental design.
Several authors have focused their interest in the design of test problems which
could be appropriate to do a comparative study among the algorithms. Focusing
our attention to continuous optimization problems, which will be used in this paper,
we can point out the pioneer papers of Whitley and co-authors for the design of
complex test functions for continuous optimization (Whitley et al. 1995, 1996),
and the recent works of Gallagher and Yuan (2006); Yuan and Gallagher (2003).
In the same way, we can nd papers that present test cases for different types of
problems.
Centred on the statistical analysis of the results, if we analyze the published
papers in specialized journals, we nd that the majority of the articles make
a comparison of results based on average values of a set of executions over a
concrete case. In proportion, a little set of works use statistical procedures in
order to compare results, although their use is recently growing and it is being suggested as a need for many reviewers. When we nd statistical studies, they are usually based on the average and variance by using parametric
tests (ANOVA, t-test, etc. . . .) (Czarn et al. 2004; Ozcelik and Erzurumlu 2006;
Rojas et al. 2002). Recently, non-parametric statistical procedures have been considered for being used in analysis of results (Garca et al. 2007; Moreno-Prez et
al. 2007). A similar situation can be found in the machine learning community
(Demar 2006).
The experimental design consists of a set of techniques which comprise methodologies for adjusting the parameters of the algorithms depending on the settings
used and results obtained (Bartz-Beielstein 2006; Kramer 2007). In our study, we
are not interested in this topic; we assume that the algorithms in a comparison have
obtained the best possible results, depending on an optimal adjustment of their parameters in each problem.
We are interested in the use of statistical techniques for the analysis of the behaviour of the evolutionary algorithms over optimization problems, analyzing the
use of the parametric statistical tests and the non-parametric ones (Sheskin 2003;
Zar 1999). We will analyze the required conditions for the usage of the parametric
tests, and we will carry out an analysis of results by using non-parametric tests.
The study of this paper will be organized into two parts. The rst one, we will
denoted it by single-problem analysis, corresponds to the study of the required conditions of a safe use of parametric statistical procedures when comparing the algorithms
over a single problem. The second one, denoted by multiple-problem analysis, will
suppose the study of the same required conditions when considering a comparison of
algorithms over more than one problems simultaneously.
The single-problem analysis is usually found in specialized literature (BartzBeielstein 2006; Ortiz-Boyer et al. 2007). Although the required conditions for using
parametric statistics are usually not fullled, as we will see here, a parametric statistical study could obtain similar conclusions to a non-parametric one. However, in the
multiple-problem analysis, due to the dissimilarities in the results obtained and the
small size of the sample to be analyzed, a parametric test may reach erroneous conclusions. In recent papers, authors start using single-problem and multiple-problem
analysis simultaneously (Ortiz-Boyer et al. 2007).
Non-parametric tests can be used for comparing algorithms whose results represent average values for each problem, in spite of the inexistence of relationships
among them. Given that the non-parametric tests do not require explicit conditions
for being conducted, it is recommendable that the sample of results is obtained following the same criterion, that is, computing the same aggregation (average, mode,
etc.) over the same number of runs for each algorithm and problem. They are used
for analyzing the results of the CEC2005 Special Session on Real Parameter Optimization (Suganthan et al. 2005) over all the test problems, in which average results
of the algorithms for each function are published. We will show signicant statistical differences among the algorithms compared in the CEC2005 Special Session on
Real Parameter Optimization, supporting the conclusions obtained in this session.
In order to do that, the paper is organized as follows. In Sect. 2, we describe the
setting of the CEC2005 Special Session: algorithms, tests functions and parameters.
Section 3 shows the study on the required conditions for safe use of parametric tests,
considering single-problem and multiple-problem analysis. We analyze the published
results of the CEC2005 Special Session on Real Parameter Optimization by using
S. Garca et al.
non-parametric tests in Sect. 4. Section 5 points out some considerations on the use
of non-parametric tests. The conclusions of the paper are presented in Sect. 6. An
introduction to statistics and a complete description of the non-parametric tests procedures are given in Appendix A. The published average results of the CEC2005
Special Session are shown in Appendix B.
2 Preliminaries: settings of the CEC2005 Special Session

In this section we will briey describe the algorithms compared, the test functions,
and the characteristics of the experimentation in the CEC2005 Special Session.
2.1 Evolutionary algorithms
In this section we enumerate the eleven algorithms which were presented in the
CEC2005 Special Session. For more details on the description and parameters used
for each one, please refer to the respective contributions. The algorithms are: BLXGL50 (Garca-Martnez and Lozano 2005), BLX-MA (Molina et al. 2005), CoEVO
(Pok 2005), DE (Rnkknen et al. 2005), DMS-L-PSO (Liang and Suganthan
2005), EDA (Yuan and Gallagher 2005), G-CMA-ES (Auger and Hansen 2005a),
K-PCX (Sinha et al. 2005), L-CMA-ES (Auger and Hansen 2005b), L-SaDE (Qin
and Suganthan 2005), SPC-PNX (Ballester et al. 2005).
2.2 Test functions
In the following we present the set of test functions designed for the Special Session
on Real Parameter Optimization organized in the 2005 IEEE Congress on Evolutionary Computation (CEC 2005) (Suganthan et al. 2005).
It is possible to consult in Suganthan et al. (2005) the complete description of the
functions, furthermore in the link the source code is included. The set of test functions
is composed of the following functions:
5 Unimodals functions
Sphere function displaced.
Schwefels problem 1.2 displaced.
Elliptical function rotated widely conditioned.
Schwefels problem 1.2 displaced with noise in the tness.
Schwefels problem 2.6 with global optimum in the frontier.
20 Multimodals functions
7 basic functions
Rosenbrock function displaced.
Griewank function displaced and rotated without frontiers.
Ackley function displaced and rotated with the global optimum in the frontier.
Rastrigin function displaced.
Rastrigin function displaced and rotated.
Weierstrass function displaced and rotated.
Schwefels problem 2.13.
2 expanded functions.
11 hybrid functions. Each one of them have been dened through compositions
of 10 out of the 14 previous functions (different in each case).
All functions have been displaced in order to ensure that their optima can never be
found in the centre of the search space. In two functions, in addition, the optima can
not be found within the initialization range, and the domain of search is not limited
(the optimum is out of the range of initialization).
2.3 Characteristics of the experimentation
The experiments were performed following the instructions indicated in the document
associated to the competition. The main characteristics are:
Each algorithm is run 25 times for each test function, and the average of error of
the best individual of the population is computed.
We will use the study with dimension D = 10 and the algorithms do 100000 evaluations of the tness function.
In the mentioned competition, experiments with dimension D = 30 and D = 50
have also been done.
Each run stops either when the error obtained is less than 108 , or when the maximal number of evaluations is achieved.
3 Study of the required conditions for the safe use of parametric tests
In this section, we will describe and analyze the conditions that must be satised
for the safe usage of parametric tests (Sect. 3.1). For doing it, we collect the overall
set of results obtained by the algorithms BLX-MA and BLX-GL50 in the 25 functions considering dimension D = 10. With them, we will rstly analyze the indicated
conditions over the complete sample of results for each function, in a single-problem
analysis (see Sect. 3.2). Finally, we will consider the average results for each function
to composite a sample of results for each one of the two algorithms. With these two
samples we will check again the required conditions for the safe use of parametric
test in a multiple-problem scheme (see Sect. 3.3).
3.1 Conditions for the safe use of parametric tests
In Sheskin (2003), the distinction between parametric and non-parametric tests is
based on the level of measure represented by the data which will be analyzed. In this
way, a parametric test uses data composed by real values.
The latter does not imply that when we always dispose of this type of data, we
should use a parametric test. There are other initial assumptions for a safe usage
of parametric tests. The non fulllment of these conditions might cause a statistical
analysis to lose credibility.
In order to use the parametric tests, it is necessary to check the following conditions (Sheskin 2003; Zar 1999):
S. Garca et al.
Independence: In statistics, two events are independent when the fact that one occurs does not modify the probability of the other one occurring.
Normality: An observation is normal when its behaviour follows a normal or Gauss
distribution with a certain value of average and variance . A normality test
applied over a sample can indicate the presence or absence of this condition in
observed data. We will use three normality tests:
Kolmogorov-Smirnov: It compares the accumulated distribution of observed
data with the accumulated distribution expected from a Gaussian distribution,
obtaining the p-value based on both discrepancies.
Shapiro-Wilk: It analyzes the observed data to compute the level of symmetry
and kurtosis (shape of the curve) in order to compute the difference with respect
to a Gaussian distribution afterwards, obtaining the p-value from the sum of the
squares of these discrepancies.
DAgostino-Pearson: It rst computes the skewness and kurtosis to quantify how
far from Gaussian the distribution is in terms of asymmetry and shape. It then
calculates how far each of these values differs from the value expected with
a Gaussian distribution, and computes a single p-value from the sum of these
discrepancies.
Heteroscedasticity: This property indicates the existence of a violation of the hypothesis of equality of variances. Levenes test is used for checking whether or
not k samples present this homogeneity of variances (homoscedasticity). When
observed data does not fulll the normality condition, this tests result is more
reliable than Bartletts test (Zar 1999), which checks the same property.
In our case, it is obvious the independence of the events given that they are independent runs of the algorithm with randomly generated initial seeds. In the following,
we will carry out the normality analysis by using Kolmogorov-Smirnov, ShapiroWilk and DAgostino-Pearson tests on single-problem and multiple-problem analysis, and heteroscedasticity analysis by means of Levenes test.
3.2 On the study of the required conditions over single-problem analysis
With the samples of results obtained from running 25 times the algorithms BLX-GL50
and BLX-MA for each function, we can apply statistical tests for determining whether
they check or not the normality and homoscedasticity properties. We have seen before that the independence condition is easily satised in this type of experiments.
The number of runs may be low for carrying out statistical analysis, but it was a
requirement in the CEC2005 Special Session.
All the tests used in this section will obtain the p-value associated, which represents the dissimilarity of the sample of results with respect to the normal shape.
Hence, a low p-value points out a non-normal distribution. In this study, we will consider a level of signicance = 0.05, so a p-value greater than indicates that the
condition of normality is fullled. All the computations have been performed by the
statistical software package SPSS.
Table 1 shows the results where the symbol * indicates that the normality is not
satised and the p-value in brackets. Table 2 shows the results by applying the test

Table 1 Test of normality of Kolmogorov-Smirnov
f1
BLX-MA
f3
f4
(.01)
(.04)
(.00)
(.00)
(.01)
f10
BLX-GL50
f2
f11
f12
(.20)
BLX-MA
BLX-GL50
BLX-MA
(.10)
(.20)
(.20)
(.00)
(.00)
(.00)
f19
BLX-GL50
f20
f21
(.00)
(.00)
(.00)
(.00)
(.00)
(.00)
f5
f6
f7
(.00)
(.00)
(.00)
(.04)
(.20)
(.00)
(.20)
(.00)
(.00)
(.00)
f13
f14
f15
f16
f17
f18
(.14)
(.16)
f8
f9
(.20)
(.20)
(.20)
(.20)
(.00)
(.00)
(.00)
(.02)
(.00)
(.00)
(.20)
(.20)
f22
f23
f24
f25
(.00)
(.00)
(.00)
(.00)
(.00)
(.00)
(.00)
(.02)
f3
f4
f5
f6
f7
(.01)
(.23)
(.27)
(.03)
(.00)
(.00)
f17
f18
Table 2 Test of normality of Shapiro-Wilk

f1
BLX-MA
(.03)
(.00)
(.00)
(.01)
(.03)
(.00)
(.00)
(.00)
(.00)
(.00)
f10
BLX-GL50
f2
f11
f12
f13
f14
f15
f16
(.06)
BLX-MA
BLX-GL50
BLX-MA
(.07)
(.25)
(.31)
(.00)
(.00)
(.00)
f19
BLX-GL50
f20
f21
(.00)
(.00)
(.00)
(.00)
(.00)
(.00)
(.05)
f8
(.39)
(.41)
(.12)
(.56)
(.00)
(.00)
(.00)
(.01)
(.25)
(.72)
f22
f23
f24
f25
(.00)
(.00)
(.00)
(.00)
(.00)
(.00)
f9
(.00)
(.00)
(.00)
(.02)
of normality of Shapiro-Wilk and Table 3 displays the results of DAgostino-Pearson

test.
In addition to this general study, we show the sample distribution in three cases,
with the objective of illustrating representative cases in which the normality tests
obtain different results.
From Fig. 1 to Fig. 3, different examples of graphical representations of histograms and Q-Q graphics are shown. A histogram represents a statistical variable
by using bars, so that the area of each bar is proportional to the frequency of the
represented values. A Q-Q graphic represents a confrontation between the quartiles
from data observed and those from the normal distributions.
In Fig. 1 we can observe a general case in which the property of abnormality is
clearly presented. On the contrary, Fig. 2 is the illustration of a sample whose distribution follows a normal shape, and the three normality tests employed veried this
fact. Finally, Fig. 3 shows a special case where the similarity between both distributions, the sample of results and the normal one, is not conrmed by all normality
S. Garca et al.
Table 3 Test of normality of DAgostino-Pearson
f1
BLX-GL50
f2
BLX-GL50
BLX-MA
(.06)
(.00)
f10
BLX-MA
(.10)
(.00)
f11
BLX-MA
f4
f5
f6
f7
f8
(.00)
(.24)
(.22)
(.00)
(.00)
(.00)
(.00)
(.00)
f13
f14
f15
f16
(.00)
(.00)
(.00)
(.07)
(.21)
(.54)
f9
f25
f12
(.17)
(.19)
(.89)
(.00)
(.00)
(.03)
f20
f21
f19
BLX-GL50
f3
(.05)
(.05)
(.06)
(.00)
(.00)
(.25)
(.79)
(.47)
(.38)
(.16)
f22
f23
f24
(.01)
(.00)
(.00)
(.00)
(.00)
(.00)
(.28)
(.21)
(.19)
(.12)
f17
(.00)
(.00)
f18
(.03)
(.04)
(.11)
(.20)
Fig. 1 Example of non-normal distribution: Function f20 and BLX-GL50 algorithm: Histogram and Q-Q
Graphic
Fig. 2 Example of normal distribution: Function f10 and BLX-MA algorithm: Histogram and Q-Q
Graphic
Fig. 3 Example of a special case: Function f21 and BLX-MA algorithm: Histogram and Q-Q Graphic
Table 4 Test of heteroscedasticity of Levene (based on means)

f1
LEVENE
(.07)
f10
LEVENE
(.99)
f2
(.07)
f11
(.00)
f3
f4
f5
f6
f7
(.00)
(.04)
(.00)
(.00)
(.00)
f12
f13
f14
f15
f16
(.00)
(.00)
f24
f25
(.00)
(.00)
(.98)
f19
LEVENE
f20
f21
(.01)
(.00)
(.01)
(.18)
f22
(.47)
(.87)
f23
(.28)
f8
(.41)
f17
(.24)
f9
(.00)
f18
(.21)
tests. In this case, a normality test could work better than another, depending on types
of data, number of ties or number of results collected. Due to this fact, we have employed three well-known normality tests for studying the normality condition. The
choice of the most appropriate normality test depending on the problem is out of the
scope of this paper.
With respect to the study of homoscedasticity property, Table 4 shows the results
by applying Levenes test, where the symbol * indicates that the variances of the
distributions of the different algorithms for a certain function are not homogeneities
(we reject the null hypothesis at a level of signicance = 0.05).
Clearly, in both cases, the non fulllment of the normality and homoscedasticity
conditions is perfectible. In most functions, the normality condition is not veried in
a single-problem analysis. The homoscedasticity is also dependent of the number of
algorithms studied, because it checks the relationship among the variances of all population samples. Even though in this case we only analyze this condition on results
for two algorithms, the condition is also not fullled in many cases.
A researcher may think that the non fulllment of these conditions is not crucial
for obtaining adequate results. By using the same samples of results, we will show
an example in which some results offered by a parametric test, the paired t-test, do
not agree with the ones obtained through a non-parametric test, Wilcoxons test. Table 5 presents the difference of average error rates, in each function, between the
S. Garca et al.
Table 5 Difference of error
rates and p-values for paired
t-test and Wilcoxon test in
single-problem analysis
Function
Difference
t-test
Wilcoxon
f1
f2
f3
47129
f4
1.9 108
0.281
f5
0.0212
0.011
f6
1.489618
f7
0.1853
f8
0.2
0.686
0.716
f9
0.716
f10
0.668086
f11
2.223405
0.028
0.037
f12
332.7
0.802
0.51
f13
0.024
0.058
0.058
f14
0.142023
0.827
0.882
f15
130
0.01
0.061
f16
8.5
f17
18
f18
383
f19
314
0.001
f20
354
f21
33
0.178
0.298
f22
88
0.545
0.074
f23
288
f24
24
0.043
0.046
f25
0.558
0.459
algorithms BLX-GL50 and BLX-MA (if it is negative, the best performed algorithm is
BLX-GL50), and the p-value obtained by the paired t-test and Wilcoxon test.
As we can see, the p-values obtained by paired t-test are very similar to the ones
obtained by Wilcoxon test. However, in three cases, they are quite different. We enumerate them:
In function f4, Wilcoxon test considers that both algorithms behave differently,
whereas paired t-test does not. This example perfectly ts with a non-practical
case. The difference of error rates is less than 107 , and in practical sense, this has
no signicant effect.
In function f15, the situation is opposite to the previous one. The paired t-test
obtains a signicant difference in favour of BLX-MA. Is this result reliable? As the
normality condition is not veried in the results of f15 (see Tables 1, 2, 3), the
results obtained by Wilcoxon test are theoretically more reliable.
Finally, in function f22, although Wilcoxon test obtains a p-value greater than the
level of signicance = 0.05, both p-values are again very different.
In 3 of the 25 functions, there are observable differences in the application of

paired t-test and Wilcoxon test. Moreover, in these 3 functions, the required conditions for the safe usage of parametric statistics are not veried. In principle, we could
suggest the usage of the non-parametric test of Wilcoxon in single-problem analysis.
This is one alternative, but there exist other ways for ensuring that the results obtained
are valid for parametric statistical analysis.
Obtaining new results is not very difcult in single-problem analysis. We only
have to run the algorithms again to get larger samples of results. The Central Limit
Theorem conrms that the sum of many identically distributed random variables
tends to a normal distribution. Nevertheless, the number of runs carried out must
not be very high, because any statistical test has a negative effect size. If the sample
of results is too large, a statistical test could detect insignicant differences as
signicant.
For controlling the size effect, we can use the Cohens index d
t
d =
where t is the t-test statistics and n is the number of results collected. If d is near
to 0.5, then the differences are signicant. A value of d lower than 0.25 indicates
insignicant differences and the statistical analysis may not be taken into account.
The application of transformations for obtaining normal distributions, such as logarithm, square root, reciprocal and power transformations (Patel and Read 1982).
In some situations, skip outliers, but this technique must be used with great care.
These alternatives could solve the normality condition, but the homoscedasticity
condition may result difcult to solve. Some parametric tests, such as ANOVA, are
very inuenced by the homoscedasticity condition.
3.3 On the study of the required conditions over multiple-problem analysis
When tackling a multiple-problem analysis, the data to be used is an aggregation of
results obtained from individual algorithms runs. In this aggregation, there must be
only a result representing a problem or function. This result could be obtained through
averaging results for all runs or something similar, but the procedure followed must
be the same for each function; i.e., in this paper we have used the average of the
25 runs of an algorithm in each function. The size of the sample of results to be
analyzed, for each algorithm, is equal to the number of problems. In this way, a
multiple-problem analysis allows us to compare two or more algorithms over a set
of problems simultaneously.
We can use the results published in the CEC2005 Special Session to perform a
multiple-problem analysis. Indeed, we will follow the same procedure as the previous
subsection. We will analyze the required conditions for the safe usage of parametric
tests over the sample of results obtained by averaging the error rate on each function.
Table 6 shows the p-values of the normality tests over the sample results obtained
by BLX-GL50 and BLX-MA. Figures 4 and 5 represent the histograms and Q-Q plots
for such samples.
S. Garca et al.
Obviously, the normality condition is not satised because the sample of results is
composed by 25 average error rates computed in 25 different problems. We compare
the behaviour of the two algorithms by means of pairwise statistical tests:
The p-value obtained with a paired t-test is p = 0.318. The paired t-test does not
consider the existence of difference in performance between the algorithms.
The p-value obtained with Wilcoxon test is p = 0.089. The Wilcoxon t-test does
neither consider the existence of difference in performance between the algorithms,
but it considerably reduces the minimal level of signicance for detecting differences. If the level of signicance considered were = 0.10, Wilcoxons test would
conrm that BLX-GL50 is better than BLX-MA.
Average results for these two algorithms indicate this behaviour, BLX-GL50 usually performs better than BLX-MA (see Table 13 in Appendix B), but a paired t-test
Table 6 Normality tests over multiple-problem analysis
Algorithm
BLX-GL50
BLX-MA
Kolmogorov-Smirnov
Shapiro-Wilk
DAgostino-Pearson
(.00)
(.00)
(.00)
(.00)
(.00)
Fig. 4 BLX-GL50 algorithm: Histogram and Q-Q Graphic
Fig. 5 BLX-MA algorithm: Histogram and Q-Q Graphic
(.10)
cannot appreciate this fact. In multiple-problem analysis it is not possible to enlarge

the sample of results, unless new functions/problems were added. Applying transformations or skipping outliers cannot be used either, because we would be changing
results for certain problems and not for other problems.
These facts may induce us to using non-parametric statistics for analyzing the
results in multiple-problems. Non-parametric statistics do not need prior assumptions
related to the sample of data for being analyzed and, in the example shown in this
section, we have seen that they could obtain reliable results.
4 A case study: on the use of non-parametric statistics for comparing the

results of the CEC2005 Special Session in Real Parameter Optimization
In this section, we study the results obtained in the CEC2005 Special Session in Real
Parameter Optimization as a case study on the use of the non-parametric tests. As we
have mentioned, we will focus on the dimension D = 10.
We will divide the set of functions into two subgroups, according to the suggestion
given in Hansen (2005) about their degrees of difculty.
The rst group is composed by the unimodal functions (from f1 to f5), in which
all participant algorithms in the CEC2005 competition normally achieve the optimum, and the multimodal functions (from f6 to f14), in which at least one run of a
participant algorithm achieves the optimum.
The second group contains the remaining functions, from the function f15 to f25.
In these functions, no participant algorithm has achieved the optimum.
This division is carried out with the objective of showing the differences in the
statistical analysis considering distinct numbers of functions, which is an essential
factor that inuences over the study. It also allows us to compare the behaviour of the
algorithms when they tackle the most complicated functions. Indeed, we could also
study the group of functions f1f14, but we do not include it in order not to enlarge
the content of the paper. Hence, the results offered by the algorithms that take part in
the CEC2005 Special Session are analyzed independently for all functions (from f1
to f25) and the difcult functions (from f15 to f25).
As we have done before, we have considered using, as performance measure, the
error rate obtained for each algorithm. This case corresponds to a multiple-problem
analysis, so the employment of non-parametric statistical tests is preferable to a parametric one, as we have seen in the previous section. Table 13 in Appendix B summarizes the ofcial results obtained in the competition organized by functions and
algorithms.
Values included in Table 13 allow us to carry out a rigorous statistical study in
order to check whether the results of the algorithms are rather signicant for considering them different in terms of quality on approximation of continuous functions.
Our study will be focused on the algorithm that had the lowest average error rate in
the comparison, G-CMA-ES (Hansen 2005). We will study the behaviour of this algorithm with respect to the remaining ones, and we will determine if the results it offers
are better than the ones offered by the rest of algorithms, computing the p-values on
each comparison.
S. Garca et al.
Table 7 Results of the Friedman and Iman-Davenport tests ( = 0.05)
Friedman
Value
value
in 2
Iman-Davenport
Value
value
p-value
in FF
p-value
f15f25
26.942
18.307
0.0027
3.244
1.930
0.0011
All
41.985
18.307
<0.0001
4.844
1.875
<0.0001
Table 8 Rankings obtained

through Friedmans test and
critical difference of
Bonferroni-Dunns procedure
Algorithm
Ranking (f15f25)
Ranking (f1f25)
BLX-GL50
5.227
5.3
BLX-MA
7.681
7.14
CoEVO
9.000
6.44
DE
4.955
5.66
DMS-L-PSO
5.409
5.02
EDA
6.318
6.74
G-CMA-ES
3.045
3.34
K-PCX
7.545
6.8
L-CMA-ES
6.545
6.22
L-SaDE
4.956
4.92
SPC-PNX
5.318
6.42
Crit. Diff. = 0.05
3.970
2.633
Crit. Diff. = 0.10
3.643
2.417
Table 7 shows the result of applying Friedmans and Iman-Davenports tests in

order to see whether there are global differences in the results. Given that the p-values
of Friedman and Iman-Davenport are lower than the level of signicance considered
= 0.05, there are signicant differences among the observed results in the functions
of the rst and second group. Attending to these results, a post-hoc statistical analysis
could help us to detect concrete differences among algorithms.
First of all, we will employ Bonferroni-Dunns test to detect signicant differences
for the control algorithm G-CMA-ES. Table 8 summarizes the ranking obtained by
Friedmans test and the critical difference of Bonferroni-Dunns procedure. Figures
6 and 7 display graphical representations (including the rankings obtained for each
algorithm) for the two groups of functions. In a Bonferroni-Dunns graphic the difference among rankings obtained for each algorithm is illustrated. In them, we can
draw a horizontal cut line which represents the threshold for the best performing algorithm, that one with the lowest ranking bar, in order to consider it better than other
algorithm. A cut line is drawn for each level of signicance considered in the study at
height equal to the sum of the ranking of the control algorithm and the corresponding
Critical Difference computed by the Bonferroni-Dunn method (see Appendix A.3).
Those bars which exceed this line are the associated to an algorithm with worse performance than the control algorithm.
The application of Bonferroni-Dunns test informs us of the following signicant

differences with G-CMA-ES as control algorithm:
f15f25: G-CMA-ES is better than CoEVO and BLX-MA and K-PCX with = 0.05
and = 0.10 (3/10 algorithms).
f1f25: It outperforms CoEVO, BLX-MA, K-PCX, EDA, SPC-PNX and L-CMA-ES
with = 0.05 and = 0.10 (6/10 algorithms). Although G-CMA-ES obtains the
lowest error and ranking rates, Bonferroni-Dunns test is not able to distinguish it
as better than all the remaining algorithms.
Fig. 6 Bonferroni-Dunns graphic corresponding to the results for f15f25
Fig. 7 Bonferroni-Dunns graphic corresponding to the results for f1f25
S. Garca et al.
Table 9 p-values on functions f15f25 (G-CMA-ES is the control algorithm)
G-CMA-ES vs.
Unadjusted p
Bonferroni-Dunn p
Holm p
Hochberg p
CoEVO
4.21050
2.54807 105
2.54807 104
2.54807 104
2.54807 104
BLX-MA
3.27840
0.00104
0.0104
0.00936
0.00936
k-PCX
3.18198
0.00146
0.0146
0.01168
0.01168
L-CMA-ES
2.47487
0.01333
0.1333
0.09331
0.09331
EDA
2.31417
0.02066
0.2066
0.12396
0.12396
DMS-L-PSO
1.67134
0.09465
0.9465
0.47325
0.17704
SPC-NPX
1.60706
0.10804
1.0
0.47325
0.17704
BLX-GL50
1.54278
0.12288
1.0
0.47325
0.17704
DE
1.34993
0.17704
1.0
0.47325
0.17704
L-SaDE
1.34993
0.17704
1.0
0.47325
0.17704
Table 10 p-values on functions f1f25 (G-CMA-ES is the control algorithm)

G-CMA-ES vs.
Unadjusted p
Bonferroni-Dunn p
Holm p
Hochberg p
CoEVO
5.43662
5.43013 108
5.43013 107
5.43013 107
5.43013 107
4.05081
5.10399 105
5.10399 104
4.59359 104
4.59359 104
K-PCX
3.68837
2.25693 104
0.002257
0.001806
0.001806
EDA
3.62441
2.89619 104
0.0028961
0.002027
0.002027
SPC-PNX
3.28329
0.00103
0.0103
0.00618
0.00618
L-CMA-ES
3.07009
0.00214
0.0214
0.0107
0.0107
DE
2.47313
0.01339
0.1339
0.05356
0.05356
BLX-GL50
2.08947
0.03667
0.3667
0.11
0.09213
DMS-L-PSO
1.79089
0.07331
0.7331
0.14662
0.09213
L-SaDE
1.68429
0.09213
0.9213
0.14662
0.09213
BLX-MA
In the same way as the previous section, we will apply more powerful procedures,
such as Holms and Hochbergs (they are described in Appendix A.3), for comparing
the control algorithm with the rest of the algorithms. The results are shown by computing p-values for each comparison. Tables 9 and 10 show the p-value obtained for
Bonferroni-Dunns, Holms and Hochbergs procedures considering both groups of
functions. The procedure used to compute the p-values is explained in Appendix A.3.
Holms and Hochbergs procedures allow us to point out the following differences,
considering G-CMA-ES as control algorithm:
f15f25: G-CMA-ES is better than CoEVO, BLX-MA and K-PCX with = 0.05
(3/10 algorithms) and is better than L-CMA-ES with = 0.10 (4/10 algorithms).
Here, Holms and Hochbergs procedures coincide and they reject an extra hypothesis considering = 0.10, with regards to Bonferroni-Dunns.
f1f25: Based on Holms procedure, it outperforms CoEVO, BLX-MA, K-PCX,
EDA, SPC-PNX and L-CMA-ES with = 0.05 (6/10 algorithms) and it also outperforms DE with = 0.10 (7/10 algorithms). It rejects equal number of hypotheses
as Bonferroni-Dunn does by considering = 0.05. It also rejects an extra hypothesis than Bonferroni-Dunn when = 0.10.
Hochbergs procedure behaves the same as Holms when we establish = 0.05.
However, with a = 0.10, it obtains a different result. All the p-values in the
comparison are lower than 0.10, so all the hypotheses associated with them are
rejected (10/10 algorithms). In fact, Hochbergs procedure conrms that G-CMAES is the best algorithm in the competition considering all functions on the whole.
In the following, we present a study in which the G-CMA-ES algorithm will be
compared with the rest of them by means of pairwise comparisons. In this study we
will use the Wilcoxon test (see Appendix A.2).
Until now, we have used procedures for performing multiple comparisons in order
to check the behaviour of the algorithms. Attending to Hochbergs procedure results,
this process could not be necessary, but we include it for stressing the differences
between using multiple comparisons procedures instead of pairwise comparisons.
Tables 11 and 12 summarize the results of applying Wilcoxon test. They display the
sum of rankings obtained in each comparison and the p-value associated.
Table 11 Wilcoxon test
considering functions f15f25
R+
BLX-GL50
62.5
3.5
0.009
BLX-MA
60.0
6.0
0.016
CoEVO
60.0
6.0
0.016
DE
56.5
9.5
0.028
DMS-L-PSO
47.0
19.0
0.213
EDA
60.5
5.5
0.013
K-PCX
60.0
6.0
0.016
L-CMA-ES
58.0
8.0
0.026
L-SaDE
47.5
18.5
0.203
SPC-PNX
Table 12 Wilcoxon test

considering functions f1f25
G-CMA-ES vs.
63.5
2.5
0.007
p-value
G-CMA-ES vs.
R+
BLX-GL50
289.5
35.5
0.001
BLX-MA
295.5
29.5
0.001
CoEVO
301.0
24.0
0.000
DE
262.5
62.5
0.009
DMS-L-PSO
199.0
126.0
0.357
EDA
284.5
40.5
0.001
K-PCX
269.0
56.0
0.004
L-CMA-ES
273.0
52.0
0.003
L-SaDE
209.0
116.0
0.259
SPC-PNX
305.5
19.5
0.000
p-value
S. Garca et al.
Wilcoxons test performs individual comparisons between two algorithms (pairwise comparisons). The p-value in a pairwise comparison is independent from another one. If we try to extract a conclusion involving more than one pairwise comparison in a Wilcoxons analysis, we will obtain an accumulated error coming from the
combination of pairwise comparisons. In statistical terms, we are losing the control
on the Family Wise Error Rate (FWER), dened as the probability of making one or
more false discoveries among all the hypotheses when performing multiple pairwise
tests. The true statistical signicance for combining pairwise comparisons is given
by:
p = P (Reject H0 |H0 true)
= 1 P (Accept H0 |H0 true)
= 1 P (Accept Ak = Ai , i = 1, . . . , k 1|H0 true)
k1
P (Accept_Ak = Ai |H0 true)
=1
i=1
k1
=1
[1 P (Reject Ak = Ai |H0 true)]

i=1
k1
(1 pHi )
=1
(1)
i=1
Observing Table 11, the statement: The G-CMA-ES algorithm outperforms the
BLX-GL50, BLX-MA, CoEVO, DE, EDA, K-PCX, L-CMA-ES and SPC-PNX algorithms with a level of signicance = 0.05 could not be correct until we cannot
check controlling the FWER. The G-CMA-ES algorithm really outperforms these
eight algorithms considering independent pairwise comparisons due to the fact that
the p-values are below = 0.05. On the other hand, note that two algorithms were
not included. If we include them within the multiple comparison, the p-value obtained is p = 0.4505 in f15f25 group and p = 0.5325 considering all functions. In
such cases, it is not possible to declare that G-CMA-ES algorithm obtains a significantly better performance than the remaining algorithms, due to the fact that the
p-values achieved are too high.
From expression (1), and Tables 11 and 12, we can deduce that G-CMA-ES is
better than the eight algorithms enumerated before with a p-value of
p = 1 ((1 0.009) (1 0.016) (1 0.016) (1 0.028) (1 0.013)
(1 0.016) (1 0.026) (1 0.007)) = 0.123906
for the group of functions f15f25 and
p = 1 ((1 0.001) (1 0.001) (1 0.000) (1 0.009) (1 0.001)
(1 0.004) (1 0.003) (1 0.000)) = 0.018874
considering all functions. Hence, the previous statement has been denitively conrmed only when considering all functions in the comparison.
The procedures designed for performing multiple comparisons control the FWER
in their denition. By using the example considered in this section, in which we
have used the G-CMA-ES algorithm as control, we can easily reect the relationship
among the power of all the testing procedures used. In increasing order of power and
considering all functions in the study, the procedures can be order in the following
way: Bonferroni-Dunn (p = 0.9213), Wilcoxons test (when it is used in multiple
comparisons) (p = 0.5325), Holm (p = 0.1466) and Hochberg (p = 0.0921).
Finally, we must point out that the statistical procedures used here indicate that
the best algorithm is G-CMA-ES. Although in Hansen (2005), the categorization of
the functions depending on their degree of difculty is different than the used in this
paper (we have joined the unimodal and soluble multimodal functions in one group),
the G-CMA-ES algorithm has been stressed as the algorithm with best behaviour
considering error rate. Therefore and to sum up, in this paper the conclusions drawn
in Hansen (2005) have been statistically supported.
5 Some considerations on the use of non-parametric tests

Taking into consideration all the results, tables and gures on the application of the
non-parametric tests shown in this paper, we can suggest some aspects and details
about the use of non-parametric statistical techniques:
A multiple comparison of various algorithms must be carried out rst by using a
statistical method for testing the differences among the related samples means, that
is, the results obtained by each algorithm. Once this test rejects the hypothesis of
equivalence of means, the detection of the concrete differences among the algorithms can be done with the application of post-hoc statistical procedures, which
are methods used for comparing a control algorithm with two or more algorithms.
Holms procedure can always be considered better than Bonferroni-Dunns one,
because it appropriately controls the FWER and it is more powerful than the
Bonferroni-Dunns. We strongly recommend the use of Holms method in a rigorous comparison. Nevertheless, the results offered by the Bonferroni-Dunns test
are suitable to be visualized in graphical representations.
Hochbergs procedure is more powerful than Holms. The differences reported
between it and Holms procedure are in practice rather small, but in this paper,
we have shown a case in which Hochbergs method obtains lower p-values than
Holms (see Table 10). We recommend the use of this test together with Holms
method.
Although Wilcoxons test and the remaining post-hoc tests for multiple comparisons belong to the non-parametric statistical tests, they operate in a different way.
The main difference lies in the computation of the ranking. Wilcoxons test computes a ranking based on differences between functions independently, whereas
Friedman and derivative procedures compute the ranking between algorithms.
In relation to the sample size (number of functions when performing Wilcoxons
or Friedmans tests in multiple-problem analysis), there are two main aspects to be
S. Garca et al.
determined. Firstly, the minimum sample considered acceptable for each test needs
to be stipulated. There is no established agreement about this specication. Statisticians have studied the minimum sample size when a certain power of the statistical
test is expected (Noether 1987; Morse 1999). In our case, the employment of a,
as large as possible, sample size is preferable, because the power of the statistical
tests (dened as the probability that the test will reject a false null hypothesis) will
increase. Moreover, in a multiple-problem analysis, the increasing of the sample
size depends on the availability of new functions (which should be well-known in
real-parameter optimization eld). Secondly, we have to study how the results are
expected to vary if there was a larger sample size available. In all statistical tests
used for comparing two or more samples, the increasing of the sample size benets
the power of the test. In the following items, we will state that Wilcoxons test is
less inuenced by this factor than Friedmans test. Finally, as a rule of thumb, the
number of functions (N ) in a study should be N = a k, where k is the number of
algorithms to be compared and a 2.
Taking into account the previous observation and knowing the operations performed by the non-parametric tests, we can deduce that Wilcoxons test is inuenced by the number of functions used. On the other hand, both the number of
algorithms and functions are crucial when we refer to the multiple comparisons
tests (such as Friedmans test), given that all the critical values depend on the value
of N (see expressions in Appendix A.3). However, the increasing/decreasing of
the number of functions rarely affects in the computation of the ranking. In these
procedures, the number of functions used is an important factor to be considered
when we want to control the FWER.
An appropriate number of algorithms in contrast with an appropriate number of
functions are needed to be used in order to employ each type of test. The number of algorithms used in multiple comparisons procedures must be lower than
the number of functions. In the study of the CEC2005 Special Session, we can
appreciate the effect of the number of functions used whereas the number of algorithms stays constant. See, for instance, the p-value obtained when considering
the f15f25 group and all functions. In the latter case, p-values obtained are always lower than in the rst one, for each testing procedure. In general, p-values
are lower agreeing with the increasing of the number of functions used in multiple
comparison procedures; therefore, the differences among the algorithms are more
detectable.
The previous statement may not be true in Wilcoxons test. The inuence of the
number of functions used is more noticeable in multiple comparisons procedures
than in Wilcoxons test. For example, the nal p-value computed for Wilcoxons
test in group f15f25 is lower than in the group f1f25 (see previous section).
6 Conclusions
In this paper we have studied the use of statistical techniques in the analysis of the
behaviour of evolutionary algorithms in optimization problems, analyzing the use of
the parametric and non-parametric statistical tests.
We have distinguished two types of analysis. The rst one, called single-problem
analysis, is that in which the results are analyzed for each function/problem independently. The second one, called multiple-problem analysis, is that in which the results
are analyzed by considering all the problems studied simultaneously.
In single-problem analysis, we have seen that the required conditions for a safe usage of parametric statistics are usually not satised. Nevertheless, the results obtained
are quite similar between a parametric and non-parametric analysis. Also, there are
procedures for transforming or adapting sample results for being used by parametric
statistical tests.
We encourage the use of non-parametric tests when we want to analyze results obtained by evolutionary algorithms for continuous optimization problems in multipleproblem analysis, due to the fact that the initial conditions that guarantee the reliability of the parametric tests are not satised. In this case, the results come from
different problems and it is not possible to analyze the results by means of parametric
statistics.
With respect to the use of non-parametric tests, we have shown how to use Friedman, Iman-Davenport, Bonferroni-Dunn, Holm, Hochberg, and Wilcoxons tests;
which on the whole, are a good tool for the analysis of the algorithms. We have
employed these procedures to carry out a comparison on the CEC2005 Special Session on Real Parameter Optimization by using the results published for each algorithm.
Acknowledgements The authors are very grateful to the anonymous reviewers for their valuable suggestions and comments to improve the quality of this paper.
Appendix A: introduction to inferential statistical tests

This section is dedicated to introduce the necessary issues to understand the statistical
terms used in this paper. Moreover, a description of the non-parametric tests is given
in order to use them in further research. In order to distinguish a non-parametric
test from a parametric one, we must check the type of data used by the test. A nonparametric test is that which uses nominal or ordinal data. This fact does not force
it to be used only for these types of data. It is possible to transform the data from
real values to ranking based data. In such way, a non-parametric test can be applied
over classical data of parametric test when they do not verify the required conditions
imposed by the test. As a general rule, a non-parametric test is less restrictive than
a parametric one, although it is less robust than a parametric when data are well
conditioned.
A.1 Hypothesis testing and p-values
In inferential statistics, sample data are primarily employed in two ways to draw
inferences about one or more populations. One of them is the hypothesis testing.
The most basic concept in hypothesis testing is a hypothesis. It can be dened as
a prediction about a single population or about the relationship between two or more
S. Garca et al.
populations. Hypothesis testing is a procedure in which sample data are employed

to evaluate a hypothesis. There is a distinction between research hypothesis and statistical hypothesis. The rst is a general statement of what a researcher predicts. In
order to evaluate a research hypothesis, it is restated within the framework of two
statistical hypotheses. They are the null hypothesis, represented by the notation H0 ,
and the alternative hypothesis, represented by the notation H1 .
The null hypothesis is a statement of no effect or no difference. Since the statement of the research hypothesis generally predicts the presence of a difference with
respect to whatever is being studied, the null hypothesis will generally be a hypothesis that the researcher expects to be rejected. The alternative hypothesis represents a
statistical statement indicating the presence of an effect or a difference. In this case,
the researcher generally expects the alternative hypothesis to be supported.
An alternative hypothesis can be nondirectional (two-tailed hypothesis) and directional (one-tailed hypothesis). The rst type does not make a prediction in a specic
direction; i.e. H1 : = 100. The latter implies a choice of one of the following directional alternative hypothesis; i.e. H1 : > 100 or H1 : < 100.
Upon collecting the data for a study, the next step in the hypothesis testing procedure is to evaluate the data through use of the appropriate inferential statistical test.
An inferential statistical test yields a test statistic. The latter value is interpreted by
employing special tables that contain information with regard to the expected distribution of the test statistic. Such tables contain extreme values of the test statistic
(referred to as critical values) that are highly unlikely to occur if the null hypothesis
is true. Such tables allow a researcher to determine whether or not the results of a
study is statistically signicant.
The conventional hypothesis testing model employed in inferential statistics assumes that prior to conducting a study, a researcher stipulates whether a directional
or nondirectional alternative hypothesis is employed, as well as at what level of signicance is represented the null hypothesis to be evaluated. The probability value
which identies the level of signicance is represented by .
When one employs the term signicance in the context of scientic research, it
is instructive to make a distinction between statistical signicance and practical signicance. Statistical signicance only implies that the outcome of a study is highly
unlikely to have occurred as a result of chance, but it does no necessarily suggest
that any difference or effect detected in a set of data is of any practical value. For
example, no-one would normally care if algorithm A solves the sphere function to
within 1010 of error of the global optimum and algorithm B solves it within 1015 .
Between them, statistical signicance could be found, but in practical sense, this difference is not signicant.
Instead of stipulating a priori a level of signicance , one could calculate the
smallest level of signicance that results in the rejection of the null hypothesis. This
is the denition of p-value, which is an useful and interesting datum for many consumers of statistical analysis. A p-value provides information about whether a statistical hypothesis test is signicant or not, and it also indicates something about how
signicant the result is: The smaller the p-value, the stronger the evidence against
the null hypothesis. Most important, it does this without committing to a particular
level of signicance.
The most common way for obtaining the p-value associated to a hypothesis is
by means of normal approximations, that is, once computed the statistic associated to a statistical test or procedure, we can use a specic expression or algorithm for obtaining a z value, which corresponds to a normal distribution statistics.
Then, by using normal distribution tables, we could obtain the p-value associated
with z.
A.2 The Wilcoxon matched-pairs signed-ranks test
Wilcoxons test is used for answering this question: do two samples represent two different populations? It is a non-parametric procedure employed in a hypothesis testing
situation involving a design with two samples. It is the analogous of the paired t-test
in non-parametrical statistical procedures; therefore, it is a pairwise test that aims to
detect signicant differences between the behavior of two algorithms.
The null hypothesis for Wilcoxons test is H0 : D = 0; in the underlying populations represented by the two samples of results, the median of the difference
scores equals zero. The alternative hypothesis is H1 : D = 0, but also can be used
H1 : D > 0 or H1 : D < 0 as directional hypothesis.
In the following, we describe the tests computations. Let di be the difference between the performance scores of the two algorithms on i-th out of N functions. The
differences are ranked according to their absolute values; average ranks are assigned
in case of ties. Let R + be the sum of ranks for the functions on which the second
algorithm outperformed the rst, and R the sum of ranks for the opposite. Ranks
of di = 0 are split evenly among the sums; if there is an odd number of them, one is
ignored:
R+ =
rank(di ) +
di >0
R =
rank(di ) +
di <0
1
2
1
2
rank(di )
di =0
rank(di )
di =0
Let T be the smallest of the sums, T = min(R + , R ). If T is less than or equal

to the value of the distribution of Wilcoxon for N degrees of freedom (Table B.12 in
Zar 1999), the null hypothesis of equality of means is rejected.
The obtaining of the p-value associated to a comparison is performed by means
of the normal approximation for the Wilcoxon T statistic (Section VI, Test 18 in
Sheskin 2003). Furthermore, the computation of the p-value for this test is usually
included in well-known statistical software packages (SPSS, SAS, R, etc.).
A.3 The Friedman two-way analysis of variance by ranks
Friedmans test is used for answering this question: In a set of k samples (where
k 2), do at least two of the samples represent populations with different median
S. Garca et al.
values? It is a non-parametric procedure employed in a hypothesis testing situation

involving a design with two or more samples. It is the analogous of the repeatedmeasures ANOVA in non-parametrical statistical procedures; therefore, it is a multiple comparison test that aims to detect signicant differences between the behavior
of two or more algorithms.
The null hypothesis for Friedmans test is H0 : 1 = 2 = = k ; the median
of the population i represents the median of the population j , i = j, 1 i k,
1 j k. The alternative hypothesis is H1 : N ot H0 , so it is non-directional.
In the following, we describe the tests computations. It computes the ranking of
the observed results for algorithm (rj for the algorithm j with k algorithms) for each
function, assigning to the best of them the ranking 1, and to the worst the ranking k.
Under the null hypothesis, formed from supposing that the results of the algorithms
are equivalent and, therefore, their rankings are also similar, the Friedmans statistic
2
F =
12N
k(k + 1)
2
Rj
j
k(k + 1)2
4
2
is distributed according to F with k 1 degrees of freedom, being Rj =
j
1
i ri , and N the number of functions. The critical values for the Friedmans
N
statistic coincide with the established in the 2 distribution when N > 10 and
k > 5. In a contrary case, the exact values can be seen in Sheskin (2003); Zar
(1999).
Iman and Davenport (1980) proposed a derivation from the Friedmans statistic
given that this last metric produces a conservative undesirably effect. The proposed
statistic is
FF =
2
(N 1)F
2
N (k 1) F
and it is distributed according to a F distribution with k 1 and (k 1)(N 1)

degrees of freedom.
Computation of the p-values given a 2 or FF statistic can be done by using the
algorithms in Abramowitz (1974). Also, most of the statistical software packages
include it.
The rejection of the null hypothesis in both tests described above does not involve
the detection of the existing differences among the algorithms compared. They only
inform us about the presence of differences among all samples of results compared.
In order to conducting pairwise comparisons within the framework of multiple comparisons, we can proceed with a post-hoc procedure. In this case, a control algorithm
(maybe a proposal to be compared) is usually chosen. Then, the post-hoc procedures
proceed to compare the control algorithm with the remain k 1 algorithms. In the
following, we describe three post-hoc procedures:
Bonferroni-Dunns procedure (Zar 1999): it is similar to Dunnets test for ANOVA
designs. The performance of two algorithms is signicantly different if the corre-
sponding average of rankings is at least as great as its critical difference (CD).

CD = q
k(k + 1)
6N
The value of q is the critical value of Q for a multiple non-parametric comparison with a control (Table B.16 in Zar 1999).
Holm (1979) procedure: for contrasting the procedure of Bonferroni-Dunn, we
dispose of a procedure that sequentially checks the hypotheses ordered according to their signicance. We will denote the p-values ordered by p1 , p2 , . . . ,
in the way that p1 p2 pk1 . Holms method compares each pi with
/(k i) starting from the most signicant p-value. If p1 is below than /(k 1),
the corresponding hypothesis is rejected and it leaves us to compare p2 with
/(k 2). If the second hypothesis is rejected, we continue with the process. As
soon as a certain hypothesis can not be rejected, all the remaining hypotheses are
maintained as supported. The statistic for comparing the i algorithm with the j
algorithm is:
z = (Ri Rj )
k(k + 1)
6N
The value of z is used for nding the corresponding probability from the table of
the normal distribution (p-value), which is compared with the corresponding value
of .
Holms method is more powerful than Bonferroni-Dunns and it does no additional assumptions about the hypotheses checked.
Hochberg (1988) procedure: It is a step-up procedure that works in the opposite
direction to Holms method, comparing the largest p-value with , the next largest
with /2 and so forth until it encounters a hypothesis it can reject. All hypotheses with smaller p values are then rejected as well. Hochbergs method is more
powerful than Holms (Shaffer 1995).
When a p-value is within a multiple comparison it reects the probability error
of a certain comparison, but it does not take into account the remaining comparisons
belonging to the family. One way to solve this problem is to report Adjusted P -Values
(APVs) which take into account that multiple tests are conducted. An APV can be
directly taken as the p-value of a hypothesis belonging to a comparison of multiple
algorithms.
In the following, we will explain how to compute the APVs for the three post-hoc
procedures described above, following the indications given in Wright (1992).
Bonferroni APVi : min{v; 1}, where v = (k 1)pi .
Holm APVi : min{v; 1}, where v = max{(k j )pj : 1 j i}.
Hochberg APVi : max{(k j )pj : (k 1) j i}.
Appendix B: published average results of the CEC2005 Special Session
SPC-PNX
L-SaDE
L-CMA-ES
K-PCX
G-CMA-ES
EDA
DMS-L-PSO
DE
CoEVO
BLX-MA
BLX-GL50
Algorithm
SPC-PNX
L-SaDE
L-CMA-ES
K-PCX
G-CMA-ES
EDA
DMS-L-PSO
DE
CoEVO
f20
4.46 102
8 102
8.629 102
4.6 102
8.22 102
6.519 102
3 102
8.13 102
4.42 102
7.13 102
4.4 102
4.49 102
7.628 102
8.445 102
4.2 102
7.143 102
5.644 102
3.26 102
7.51 102
5.16 102
7.049 102
3.8 102
6.65
3.65
4.891
1.91
f19
4.08 10
4.969
7.304
2.334
4.557
9.029
8.47 101
4.623
3.944
9.34 101
4.975
5.643
2.677 10
1.25 10
3.622
5.289
7.96 102
2.39 101
BLX-GL50
BLX-MA
f11
109
109
109
109
109
109
109
109
109
109
109
109
109
109
109
109
109
109
109
109
109
109
f10
f2
f1
Algorithm
SPC-PNX
L-SaDE
L-CMA-ES
K-PCX
G-CMA-ES
EDA
DMS-L-PSO
DE
CoEVO
BLX-MA
BLX-GL50
Algorithm
6.893 102
7.218 102
6.349 102
4.92 102
5.36 102
4.84 102
5 102
1.05 103
4.04 102
4.64 102
6.801 102
f21
4.069 102
7.43 10
6.046 102
3.17 10
2.4001
4.423 102
2.93 10
1.49 102
2.09 102
4.501 107
2.595 102
f12
2.121 10
109
4.15 101
109
1.672 102
1.081 105
5.705 102
4.771 104
109
1.94 106
109
f3
7.586 102
6.709 102
7.789 102
7.18 102
6.924 102
7.709 102
7.29 102
6.59 102
7.4 102
7.349 102
7.493 102
f22
0.22
8.379 101
7.498 101
7.736 101
1.137
9.77 101
3.689 101
1.841
6.96 101
6.53 101
4.94 101
f13
109
1.997 108
109
109
1.885 103
109
109
7.94 107
1.76 106
1.418 105
109
f4
Table 13 Average error rate obtained in CEC2005 Special Session in dimension 10
6.389 102
9.267 102
8.346 102
5.72 102
7.303 102
6.405 102
5.59 102
1.06 103
7.91 102
6.641 102
5.759 102
f23
2.172
2.03
3.706
3.45
2.36
2.63
3.01
2.35
4.01
2.915
3.046
f14
109
2.124 102
2.133
109
1.138 106
109
109
4.85 10
109
0.012
109
f5
2 102
2.24 102
3.138 102
2 102
2.24 102
2 102
2 102
4.06 102
8.65 102
2 102
2 102
f24
32
2.538 102
4 102
2.696 102
2.938 102
2.59 102
4.854
3.65 102
2.28 102
5.1 102
2.11 102
f15
109
1.49
1.246 10
1.59 101
6.892 108
4.182 102
109
4.78 101
109
1.199 108
1.891 10
f6
4.036 102
3.957 102
2.573 102
9.23 102
3.657 102
3.73 102
3.74 102
4.06 102
4.42 102
3.759 102
4.06 102
f25
9.349 10
1.016 102
1.772 102
1.13 102
9.476 10
1.439 102
9.13 10
9.59 10
1.05 102
1.012 102
1.096 102
f16
0.02
8.261 102
1.172 102
1.971 101
3.705 102
1.46 101
4.519 102
4.205 101
109
2.31 101
109
f7
9.73 10
5.49 102
1.141 102
1.19 102
1.09 102
1.27 102
2.118 102
1.15 102
1.101 102
1.568 102
1.23 102
f17
2.035 10
2.019 10
2.027 10
2.04 10
2 10
2.034 10
2 10
2 10
2 10
2 10
2.099 10
f8
4.2 102
8.033 102
9.014 102
4 102
7.607 102
4.832 102
3.32 102
7.52 102
4.97 102
7.194 102
4.396 102
f18
1.154
4.379 101
1.919 10
9.55 101
109
5.418
2.39 101
1.19 101
4.49 10
109
4.02
f9
S. Garca et al.
References
Abramowitz, M.: Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables. Dover, New York (1974)
Auger, A., Hansen, N.: A restart CMA evolution strategy with increasing population size. In: Proceedings
of the 2005 IEEE Congress on Evolutionary Computation (CEC2005), pp. 17691776 (2005a)
Auger, A., Hansen, N.: Performance evaluation of an advanced local search evolutionary algorithm. In:
Proceedings of the 2005 IEEE Congress on Evolutionary Computation (CEC2005), pp. 17771784
(2005b)
Ballester, P.J., Stephenson, J., Carter, J.N., Gallagher, K.: Real-parameter optimization performance study
on the CEC-2005 benchmark with SPC-PNX. In: Proceedings of the 2005 IEEE Congress on Evolutionary Computation (CEC2005), pp. 498505 (2005)
Bartz-Beielstein, T.: Experimental Research in Evolutionary Computation: The New Experimentalism.
Springer, New York (2006)
Czarn, A., MacNish, C., Vijayan, K., Turlach, R., Gupta, R.: Statistical exploratory analysis of genetic
algorithms. IEEE Trans. Evol. Comput. 8(4), 405421 (2004)
Demar, J.: Statistical comparisons of classiers over multiple data sets. J. Mach. Learn. Res. 7, 130
(2006)
Gallagher, M., Yuan, B.: A general-purpose tunable landscape generator. IEEE Trans. Evol. Comput. 10(5),
590603 (2006)
Garca, S., Molina, D., Lozano, M., Herrera, F.: An experimental study on the use of non-parametric tests
for analyzing the behaviour of evolutionary algorithms in optimization problems. In: Proceedings of
the Spanish Congress on Metaheuristics, Evolutionary and Bioinspired Algorithms (MAEB2007),
pp. 275285 (2007) (in Spanish)
Garca-Martnez, C., Lozano, M.: Hybrid real-coded genetic algorithms with female and male differentiation. In: Proceedings of the 2005 IEEE Congress on Evolutionary Computation (CEC2005), pp.
896903 (2005)
Hansen, N.: (2005). Compilation of results on the CEC benchmark function set. Tech. Report, Institute of Computational Science, ETH Zurich, Switzerland. Available as http://www.ntu.edu.sg/home/
epnsugan/index_les/CEC-05/compareresults.pdf
Hochberg, Y.: A sharper Bonferroni procedure for multiple tests of signicance. Biometrika 75, 800803
(1988)
Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian J. Statist. 6, 6570 (1979)
Hooker, J.: Testing heuristics: we have it all wrong. J. Heuristics 1(1), 3342 (1997)
Iman, R.L., Davenport, J.M.: Approximations of the critical region of the Friedman statistic. Commun.
Stat. 18, 571595 (1980)
Kramer, O.: An experimental analysis of evolution strategies and particle swarm optimisers using design of experiments. In: Proceedings of the Genetic and Evolutionary Computation Conference 2007
(GECCO2007), pp. 674681 (2007)
Liang, J.J., Suganthan, P.N.: Dynamic multi-swarm particle swarm optimizer with local search. In: Proceedings of the 2005 IEEE Congress on Evolutionary Computation (CEC2005), pp. 522528 (2005)
Molina, D., Herrera, F., Lozano, M.: Adaptive local search parameters for real-coded memetic algorithms.
In: Proceedings of the 2005 IEEE Congress on Evolutionary Computation (CEC2005), pp. 888895
(2005)
Moreno-Prez, J.A., Campos-Rodrguez, C., Laguna, M.: On the comparison of metaheuristics through
non-parametric statistical techniques. In: Proceedings of the Spanish Congress on Metaheuristics,
Evolutionary and Bioinspired Algorithms (MAEB2007), pp. 286293 (2007) (in Spanish)
Morse, D.T.: Minsize2: a computer program for determining effect size and minimum sample size for
statistical signicance for univariate, multivariate, and nonparametric tests. Educ. Psychol. Meas.
59(3), 518531 (1999)
Noether, G.E.: Sample size determination for some common nonparametric tests. J. Am. Stat. Assoc.
82(398), 645647 (1987)
Ortiz-Boyer, D., Hervs-Martnez, C., Garca-Pedrajas, N.: Improving crossover operators for real-coded
genetic algorithms using virtual parents. J. Heuristics 13, 265314 (2007)
Ozcelik, B., Erzurumlu, T.: Comparison of the warpage optimization in the plastic injection molding using
ANOVA, neural network model and genetic algorithm. J. Mater. Process. Technol. 171(3), 437445
(2006)
Patel, J.K., Read, C.B.: Handbook of the Normal Distribution. Dekker, New York (1982)
S. Garca et al.
Pok, P.: Real-parameter optimization using the mutation step co-evolution. In: Proceedings of the 2005
IEEE Congress on Evolutionary Computation (CEC2005), pp. 872879 (2005)
Qin, A.K., Suganthan, P.N.: Self-adaptive differential evolution algorithm for numerical optimization. In:
Proceedings of the 2005 IEEE Congress on Evolutionary Computation (CEC2005), pp. 17851791
(2005)
Rojas, I., Gonzlez, J., Pomares, H., Merelo, J.J., Castillo, P.A., Romero, G.: Statistical analysis of the
main parameters involved in the design of a genetic algorithm. IEEE Trans. Syst. Man Cybern. Part
C 32(1), 3137 (2002)
Rnkknen, J., Kukkonen, S., Price, K.V.: Real-parameter optimization using the mutation step coevolution. In: Proceedings of the 2005 IEEE Congress on Evolutionary Computation (CEC2005),
pp. 506513 (2005)
Shaffer, J.P.: Multiple hypothesis testing. Annu. Rev. Psychol. 46, 561584 (1995)
Sinha, A., Tiwari, S., Deb, K.: A population-based, steady-state procedure for real-parameter optimization.
In: Proceedings of the 2005 IEEE Congress on Evolutionary Computation (CEC2005), pp. 514521
(2005)
Suganthan, P.N., Hansen, N., Liang, J.J., Deb, K., Chen, Y.P., Auger, A., Tiwari, S.: (2005). Problem
denitions and evaluation criteria for the CEC 2005 Special Session on Real Parameter Optimization. Tech. Report, Nanyang Technological University. Available as http://www.ntu.edu.sg/home/
epnsugan/index_les/CEC-05/Tech-Report-May-30-05.pdf
Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, Boca Raton
(2003)
Whitley, D.L., Beveridge, R., Graves, C., Mathias, K.E.: Test driving three 1995 genetic algorithms: new
test functions and geometric matching. J. Heuristics 1(1), 77104 (1995)
Whitley, D.L., Rana, S., Dzubera, J., Mathias, K.E.: Evaluating evolutionary algorithms. Artif. Intell.
85(12), 245276 (1996)
Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. IEEE Trans. Evol. Comput.
1(1), 6782 (1997)
Wright, S.P.: Adjusted p-values for simultaneous inference. Biometrics 48, 10051013 (1992)
Yuan, B., Gallagher, M.: On building a principled framework for evaluating and testing evolutionary algorithms: a continuous landscape generator. In: Proceedings of the 2003 Congress on Evolutionary
Computation (CEC2003), pp. 451458 (2003)
Yuan, B., Gallagher, M.: Experimental results for the Special Session on Real-Parameter Optimization at
CEC 2005: a simple, continuous EDA. In: Proceedings of the 2005 IEEE Congress on Evolutionary
Computation (CEC2005), pp. 17921799 (2005)
Zar, J.H.: Biostatistical Analysis. Prentice Hall, Englewood Cliffs (1999)
2.4.2.
An Extension on Statistical Comparisons of Classiers over Multiple

Data Sets for all Pairwise Comparisons
S. Garc F. Herrera, An Extension on Statistical Comparisons of Classiers over Multiple

a,
Data Sets for all Pairwise Comparisons. Journal of Machine Learning Research, in press
(2008).
Estado: Aceptado

Area de Conocimiento: Automation and Control Systems. Ranking 2 / 52.
JMLR Decision (07-268, 2)
1 de 1
Asunto: JMLR Decision (07-268, 2)

De: John Shawe-Taylor <jst@cs.ucl.ac.uk>
Fecha: Thu, 23 Oct 2008 10:03:41 -0400 (EDT)
CC: jst@cs.ucl.ac.uk
Your manuscript entitled "An Extension on Statistical Comparisons of
Classifiers over Multiple Data Sets for all pairwise comparisons"
(07-268, version 2) has been reviewed by external reviewers and the
JMLR Editorial Board.
I am pleased to inform you that your manuscript has been accepted for
publication. Please contact Rich Maclin, the Production Editor, (at
rmaclin@d.umn.edu) for details about preparing the final version for
publication.
The decision and more detailed information can be found by logging
into to http://jmlr.csail.mit.edu/manudb/center/.
Thank you for considering publishing this work in the Journal of
Machine Learning Research. Congratulations, and please keep us in
mind for your next submission.
John Shawe-Taylor
Action Editor,
Journal of Machine Learning Research
25/10/2008 16:56
Journal of Machine Learning Research n (200x)
Submitted 10/07, Revised 04/08; Published
An Extension on Statistical Comparisons of Classiers over

Multiple Data Sets for all Pairwise Comparisons
Salvador Garc
a
Francisco Herrera
Department of Computer Science and Articial Intelligence

University of Granada
Editor:
Abstract
In a recently published paper in JMLR, Demar (2006) recommends a set of nons
parametric statistical tests and procedures which can be safely used for comparing the
performance of classiers over multiple data sets. After studying the paper, we realize
that the paper correctly introduces the basic procedures and some of the most advanced
ones when comparing a control method. However, it does not deal with some advanced
topics in depth. Regarding these topics, we focus on more powerful proposals of statistical
procedures for comparing n n classiers. Moreover, we illustrate an easy way of obtaining
adjusted and comparable p-values in multiple comparison procedures.
Keywords: statistical methods, non-parametric test, multiple comparisons tests, adjusted p-values, logically related hypotheses
1. Introduction
In the Machine Learning (ML) scientic community there is a need for rigorous and correct
statistical analysis of published results, due to the fact that the development or modications of algorithms is a relatively easy task. The main inconvenient related to this necessity
is to understand and study the statistics and to know the exact techniques which can or
cannot be applied depending on the situation, i.e. type of results obtained. In a recently
published paper in JMLR, Demar (2006), a group of useful guidelines are given in order
s
to perform a correct analysis when we compare a set of classiers over multiple data sets.
Demar recommends a set of non-parametric statistical techniques (Zar, 1999; Sheskin,
s
2003) for comparing classiers under these circumstances, given that the sample of results
obtained by them does not fulll the required conditions and it is not large enough for
making a parametric statistical analysis. He analyzed the behavior of the proposed statistics on classication tasks and he checked that they are more convenient than parametric
techniques.
Recent studies apply the guidelines given by Demar in the analysis of performance of
s
classiers (Esmeir and Markovitch, 2007; Marrocco et al., 2008). In them, a new proposal
or methodology is oered and it is compared with other methods by means of pairwise
comparisons. Another type of studies assume an empirical comparison or review of already proposed methods. In these cases, no proposal is oered and a statistical comparison
c 200x Salvador Garc and Francisco Herrera.
a
Garc and Herrera

a
could be very useful in determining the dierences among the methods. In the specialized
literature, many papers provide reviews on a specic topic and they also use statistical
methodology to perform comparisons. For example, in a review of ensembles of decision
trees, non-parametric tests are also applied in the analysis of performance (Baneld et al.,
2007). However, only the rankings computed by Friedmans method (Friedman, 1937) are
stipulated and authors establish comparisons based on them, without taking into account
signicance levels. Demar focused his work in the analysis of new proposals, and he introdus
ced the Nemenyi test for making all pairwise comparisons (Nemenyi, 1963). Nevertheless,
the Nemenyi test is very conservative and it may not nd any dierence in most of the
experimentations. In recent papers, the authors have used the Nemenyi test in multiple
comparisons. Due to the fact that this test posses low power, authors have to employ many
data sets (Yang et al., 2007b) or most of the dierences found are not signicant (Yang
et al., 2007a; Nnez et al., 2007). Although the employment of many data sets could seem
u
benecial in order to improve the generalization of results, in some specic domains, i.e.
imbalanced classication (Owen, 2007) or multi-instance classication (Murray et al., 2005),
data sets are dicult to nd.
Procedures with more power than Nemenyis one can be found in specialized literature.
We have based on the necessity to apply more powerful procedures in empirical studies
in which no new method is proposed and the benet consists of obtaining more statistical
dierences among the classiers compared. Thus, in this paper we describe these procedures
and we analyze their behavior by means of the analysis of multiple repetitions of experiments
with randomly selected data sets.
On the other hand, we can see other works in which the p-value associated to a comparison between two classiers is reported (Garc
a-Pedrajas and Fyfe, 2007). Classical
non-parametric tests, such as Wilcoxon and Friedman (Sheskin, 2003), may be incorporated in most of the statistical packages (SPSS, SAS, R, etc.) and the computation of the
nal p-value is usually implemented. However, advanced procedures such as Holm (1979),
Hochberg (1988), Hommel (1988) and the ones described in this paper are usually not incorporated in statistical packages. The computation of the correct p-value, or Adjusted
P -Value (APV) (Westfall and Young, 2004), in a comparison using any of these procedures is not very dicult and, in this paper, we show how to include it with an illustrative
example.
The paper is set up as follows. Section 2 presents more powerful procedures for comparing all the classiers among them in a n n comparison of multiple classiers and a case
study. In Section 3 we describe the procedures for obtaining the APV by considering the
post-hoc procedures explained by Demar and the ones explained in this paper. In Section
s
4, we perform an experimental study of the behavior of the statistical procedures and we
discuss the results obtained. Finally, Section 5 concludes the paper.
2. Comparison of Multiple Classiers: Performing All pairwise

Comparisons
In the paper Demar (2006), referring to carrying out comparisons of more than two class
siers, a set of useful guidelines were given for detecting signicant dierences among the
results obtained and post-hoc procedures for identifying these dierences. Friedmans test
2
An Extension on Statistical Comparisons of Classifiers over Multiple Data Sets
is an omnibus test which can be used to carry out these type of comparison. It allows to
detect dierences considering the global set of classiers. Once Friedmans test rejects the
null hypothesis, we can proceed with a post-hoc test in order to nd the concrete pairwise
comparisons which produce dierences. Demar described the use of the Nemenyi test used
s
when all classiers are compared with each other. Then, he focused on procedures that
control the family-wise error when comparing with a control classier, arguing that the
objective of a study is to test whether a newly proposed method is better than the existing
ones. For this reason, he described and studied in depth more powerful and sophisticated procedures derived from Bonferroni-Dunn such as Holms, Hochbergs and Hommels
methods.
Nevertheless, we think that performing all pairwise comparisons in an experimental
analysis may be useful and interesting in dierent cases when proposing a new method. For
example, it would be interesting to conduct a statistical analysis over multiple classiers in
review works in which no method is proposed. In this case, the repetition of comparisons
choosing dierent control classiers may lose the control of the family-wise error.
Our intention in this section is to give a detailed description of more powerful and
advanced procedures derived from the Nemenyi test and to show a case study that uses
these procedures.
2.1 Advanced Procedures for Performing All pairwise Comparisons
A set of pairwise comparisons can be associated with a set or family of hypotheses. Any
of the post-hoc tests which can be applied to non-parametric tests (that is, those derived
from the Bonferroni correction or similar procedures) work over a family of hypotheses. As
Demar explained, the test statistics for comparing the i-th and j-th classier is
s
z=
(Ri Rj )
k(k+1)
6N
where Ri is the average rank computed through the Friedman test for the i-th classier,
k is the number of classiers to be compared and N is the number of data sets used in the
comparison.
The z value is used to nd the corresponding probability (p-value) from the table of
normal distribution, which is then compared with an appropriate level of signicance
(Table A1 in Sheskin (2003)). Two basic procedures are:
Nemenyi (1963) procedure: it adjusts the value of in a single step (identically to
Bonferroni-Dunn, Olejnik et al. (1997)) by dividing the value of by the number of
comparisons performed, m = k(k 1)/2. This procedure is the simplest but it also
has little power.
Holm (1979) procedure: it was also described in Demar (2006) but it was used for
s
comparisons of multiple classiers involving a control method. It adjusts the value of
in a step down method. Let p1 , ..., pm be the ordered p-values (smallest to largest)
and H1 , ..., Hm be the corresponding hypotheses. Holms procedure rejects H1 to
H(i1) if i is the smallest integer such that pi > /(m i + 1). Other alternatives
3
Garc and Herrera

a
were developed by Hochberg (1988); Hommel (1988); Rom (1990). They are easy to
perform, but they often have a similar power to Holms procedure (they have more
power than Holms procedure, but the dierence between them is not very notable)
when considering all pairwise comparisons.
The hypotheses being tested belonging to a family of all pairwise comparisons are logically interrelated so that not all combinations of true and false hypotheses are possible. As
a simple example of such a situation suppose that we want to test the three hypotheses of
pairwise equality associated with the pairwise comparisons of three classiers Ci , i = 1, 2, 3.
It is easily seen from the relations among the hypotheses that if any one of them is false,
at least one other must be false. For example, if C1 is better/worse than C2 , then it is not
possible that C1 has the same performance as C3 and C2 has the same performance as C3 .
C3 must be better/worse than C1 or C2 or the two classiers at the same time. Thus, there
cannot be one false and two true hypotheses among these three.
Based on this argument, Shaer proposed two procedures which make use of the logical
relation among the family of hypotheses for adjusting the value of (Shaer, 1986).
Shaers static procedure: following Holms step down method, at stage j, instead of
rejecting Hi if pi /(m i + 1), reject Hi if pi /ti , where ti is the maximum
number of hypotheses which can be true given that any (i 1) hypotheses are false.
It is a static procedure, i.e. t1 , ..., tm are fully determined for the given hypotheses H1 , ..., Hm , independent of the observed p-values. The possible numbers of true
hypotheses, and thus the values of ti can be obtained from the recursive formula
k
S(k) =
{
j=1
j
+ x : x S(k j)},
2
where S(k) is the set of possible numbers of true hypotheses with k classiers being
compared, k 2, and S(0) = S(1) = {0}.
Shaers dynamic procedure: it increases the power of the rst by substituting /ti
at stage i by the value /t , where t is the maximum number of hypotheses that
i
i
could be true, given that the previous hypotheses are false. It is a dynamic procedure
since t depends not only on the logical structure of the hypotheses, but also on the
i
hypotheses already rejected at step i. Obviously, this procedure has more power than
the rst one. In this paper, we have not used this second procedure, given that it is
included in an advanced procedure which we will describe in the following.
In Bergmann and Hommel (1988) was proposed a procedure based on the idea of nding
all elementary hypotheses which cannot be rejected. In order to formulate BergmannHommels procedure, we need the following denition.
Denition 1 An index set of hypotheses I {1, ..., m} is called exhaustive if exactly all
Hj , j I, could be true.
4
In order to exemplify the previous denition, we will consider the following case: We
have three classiers, and we will compare them in a n n comparison. We will obtain
three hypotheses:
H1 = C1 es equal in behavior than C2 .
and eight possible sets Si :
S1 : All Hj are true.
S2 : H1 and H2 are true and H3 is false.
S5 : H1 is true and H2 and H3 are false.
S8 : All Hj are false.
Sets S1 , S5 , S6 , S7 and S8 can be possible, because their hypotheses can be true at
the same time, so they are exhaustive sets. Set S2 , basing on logically related hypotheses
principles, is not possible because the performance of C1 cannot be equal to C2 and C3 ,
whereas C2 has dierent performance than C3 . The same consideration can be done to S3
and S4 , which are not exhaustive sets.
Under this denition, it works as follows.
Bergmann and Hommel (1988) procedure: Reject all Hj with j A, where the
/
acceptance set
A=
{I : I exhaustive, min{Pi : i I} > /|I|}
is the index set of null hypotheses which are retained.

For this procedure, one has to check for each subset I of {1, ..., m} if I is exhaustive,
which leads to intensive computation. Due to this fact, we will obtain a set, named
E, which will contain all the possible exhaustive sets of hypotheses for a certain
comparison. A rapid algorithm which was described in Hommel and Bernhard (1994)
allows a substantial reduction in computing time. Once the E set is obtained, the
hypotheses that do not belong to the A set are rejected.
5
Garc and Herrera

a
Figure 1 shows a valid algorithm for obtaining all the exhaustive sets of hypotheses,
using as input a list of classiers C. E is a set of families of hypotheses; likewise, a family of hypotheses is a set of hypotheses. The most important step in the algorithm is
the number 6. It performs a division of the classiers into two subsets, in which the last
classier k always is inserted in the second subset and the rst subset cannot be empty.
In this way, we ensure that a subset yielded in a division is never empty and no repetitions are produced. For example, suppose a set C with three classiers C = {1, 2, 3}.
All possible divisions without taking into account the previous assumptions are:
D1 = {C1 = {}, C2 = {1, 2, 3}}, D2 = {C1 = {1}, C2 = {2, 3}}, D3 = {C1 = {2}, C2 =
{1, 3}}, D4 = {C1 = {1, 2}, C2 = {3}}, D5 = {C1 = {3}, C2 = {1, 2}}, D6 = {C1 =
{1, 3}, C2 = {2}}, D7 = {C1 = {2, 3}, C2 = {1}}, D8 = {C1 = {1, 2, 3}, C2 = {}}.
Divisions D1 and D8 , D2 and D7 , D3 and D6 , D4 and D5 are equivalent, respectively.
Furthermore, divisions D1 and D8 are not interesting. Using the assumptions in step
6 of the algorithm, the possible divisions are: D1 = {C1 = {1}, C2 = {2, 3}}, D2 =
{C1 = {2}, C2 = {1, 3}}, D3 = {C1 = {1, 2}, C2 = {3}}. In this case, all the divisions
are interesting and no repetitions are yielded. The computational complexity of the
2
algorithm for obtaining exhaustive sets is O(2n ). However, the computation requirements may be reduced by means of using storage capabilities. Relative exhaustive
sets for k i, 1 i (k 2) classiers can be stored in memory and there is no
necessity of invoking the obtainingExhaustive function recursively. The computational
complexity using storage capabilities is O(2n ), so the algorithm still requires intensive
computation.
An example illustrating the algorithm for obtaining all exhaustive sets is drawn in
Figure 2. In it, four classiers, enumerated from 1 to 4 in the C set, are used. The
comparisons or hypotheses are denoted by pairs of numbers without a separation
character between them. This illustration does not show the case in which the set
|Ci | < 2, for simplifying the representation. When |Ci | < 2, no comparisons can be
performed, so the obtainExhaustive function returns an empty set E.
An edge connecting two boxes represents an invocation of this function. In each
box, the list of classiers given as input and the rst initialization of the E set
are displayed. The main edges, whose starting point is the initial box, are labeled by the order of invocation. Below the graph, the resulting E subset in each
main edge is denoted. The nal E will be composed by the union of these E
subsets. At the end of the process, 14 distinct exhaustive sets are found: E =
{(12, 13, 14, 23, 24, 34), (23, 24, 34), (13, 14, 34), (12, 14, 24), (12, 13, 23),
(12), (13), (14), (23), (24), (34), (12, 34), (13, 24), (23, 14)}.
Table 1 gives the number of hypotheses (m), the number (2n ) of index sets I and the
number of exhaustive index sets (ne ) for k classiers being compared.
The following subsections present a case study of a nn comparison of some well-known

classiers over thirty data sets. In it, the four procedures explained above are employed.
6
Function obtainExhaustive(C = {c1 , c2 , ..., ck }: list of classiers)

1. Let E =
2. E = E {set of all possible and distinct pairwise comparisons using C}
3. If E ==
4. Return E
5. End if
6. For all possible divisions of C into two subsets C1 and C2 , ck C2 and C1 =
7. E1 = obtainExhaustive(C1 )
8. E2 = obtainExhaustive(C2 )
9. E = E E1
10. E = E E2
11. For each family of hypotheses e1 of E1
12. For each family of hypotheses e2 of E2
13. E = E (e1 e2 )
14. End for
15. End for
16. End for
17. Return E
Figure 1: Algorithm for obtaining all exhaustive sets
C2 = {2, 3}
C2 = {1, 3}
C1 = {1, 2}
E = {(23)}
E = {(13)}
E = {(12)}
C1 = {1, 2}
C2 = {1, 2, 4}
C2 = {1, 2, 3}
C1 = {1, 2}
C2 = {3, 4}
E = {(12,14,24)}
E = {(12)}
E = {(12,13,23)}
E = {(12)}
E = {(34)}
C2 = {1, 4}
1
2
E = {(14)}
C1 = {1, 3}
C = {1, 2, 3, 4}
C2 = {2, 4}
E = {(13)}
E = {(12,13,14,23,24,34)}
C2 = {2, 4}
E = {(24)}
E = {(24)}
6
5
4
C1 = {1, 3}
E = {(13)}
C1 = {2, 3}
C2 = {1, 3, 4}
C2 = {2, 3, 4}
E = {(23)}
E = {(13,14,34)}
C1 = {2, 3}
C1 = {1, 4}
E = {(23)}
E = {(23,24,34)}
E = {(14)}
C2 = {1, 4}
C2 = {2, 4}
E = {(14)}
E = {(24)}
1: E = {(12,13,23),(23),(13),(12)}
2: E = {(12),(34),(12,34)}
C2 = {3, 4}
C2 = {3, 4}
3: E = {(13),(24),(13,24)}
E = {(34)}
E = {(34)}
4: E = {(23,24,34),(23),(24),(34)}
5: E = {(23),(14),(23,14)}
6: E = {(13,14,34),(13),(14),(34)}
7: E = {(12,14,24),(12),(14),(24)}
Figure 2: Example of the obtaining of exhaustive sets of hypotheses considering 4 classiers
Garc and Herrera

a
k
4
5
6
7
8
9
m=
k
2
6
10
15
21
28
36
2m
64
1024
32768
2097152
2.7 108
6.7 1010
ne
14
51
202
876
4139
21146
Table 1: All pairwise comparisons of k classiers

2.2 Performing All Pairwise Comparisons: A Case Study
In the following, we show an example involving the four procedures described with a comparison of ve classiers: C4.5 (Quinlan, 1993); One Nearest Neighbor (1-NN) with Euclidean
distance, NaiveBayes, Kernel (McLachlan, 2004) 1 and, nally, CN2 (Clark and Niblett,
1989).2 The parameters used are specied in Section 4. We have used 10-fold cross validation and standard parameters for each algorithm. The results correspond to average
accuracy or 1 class error in test data. We have used 30 data sets.3 Table 2 shows the
overall process of computation of average rankings.
Friedman (1937) and Iman and Davenport (1980) tests check whether the measured average ranks are signicantly dierent from the mean rank Rj = 3. They respectively use the
2 and the F statistical distributions to determine if a distribution of observed frequencies
diers from the theoretical expected frequencies. Their statistics use nominal (categorical) or ordinal level data, instead of using means and variances. Demar (2006) detailed
s
the computation of the critical values in each distribution. In this case, the critical values
are 9.488 and 2.45, respectively at = 0.05, and the Friedmans and Iman-Davenports
statistics are:
2 = 39.647, FF = 14.309
F
Due to the fact that the critical values are lower than the respective statistics, we can
proceed with the post-hoc tests in order to detect signicant pairwise dierences among
all the classiers. For this, we have to compute and order the corresponding statistics
and p-values. The standard error in the pairwise comparison between two classiers is
k(k+1)
56
=
SE =
6N
630 = 0.408. Table 3 presents the family of hypotheses ordered by
their p-value and the adjustment of by Nemenyis, Holms and Shaers static procedures.
Nemenyis test rejects the hypotheses [14] since the corresponding p-values are smaller
than the adjusted s.
Holms procedure rejects the hypotheses [15].
1. Kernel method is a bayesian classier which employs a non-parametric estimation of density functions
through a gaussian kernel function. The adjustment of the covariance matrix is performed by the ad-hoc
method.
2. NaiveBayes and CN2 are classiers for discrete domains, so we have discretized the data prior to learning
with them. The discretizer algorithm is Fayyad and Irani (1993).
3. Data sets marked with * have been subsampled being adapted to slow algorithms, such as CN2.
Abalone*
Adult*
Australian
Autos
Balance
Breast
Bupa
Car
Cleveland
Crx
Dermatology
German
Glass
Hayes-Roth
Heart
Ion
Led7Digit
Letter*
Lymphography
Mushrooms*
OptDigits*
Satimage*
SpamBase*
Splice*
Tic-tac-toe
Vehicle
Vowel
Wine
Yeast
Zoo
average rank
C4.5
0.219 (3)
0.803 (2)
0.859 (1)
0.809 (1)
0.768 (3)
0.759 (1)
0.693 (1)
0.915 (1)
0.544 (2)
0.855 (2)
0.945 (3)
0.725 (2)
0.674 (4)
0.801 (1)
0.785 (2)
0.906 (2)
0.710 (2)
0.691 (2)
0.743 (3)
0.990 (1.5)
0.867 (3)
0.821 (3)
0.893 (2)
0.799 (2)
0.845 (1)
0.741 (1)
0.799 (2)
0.949 (4)
0.555 (3)
0.928 (2.5)
2.100
1-NN
0.202 (4)
0.750 (4)
0.814 (4)
0.774 (3)
0.790 (2)
0.654 (5)
0.611 (3)
0.857 (3)
0.531 (4)
0.796 (4)
0.954 (2)
0.705 (4)
0.736 (1)
0.357 (4)
0.770 (3)
0.359 (5)
0.402 (4)
0.827 (1)
0.739 (4)
0.482 (5)
0.098 (1)
0.872 (2)
0.824 (4)
0.655 (4)
0.731 (2)
0.701 (2)
0.994 (1)
0.955 (2)
0.505 (4)
0.928 (2.5)
3.250
NaiveBayes
0.249 (2)
0.813 (1)
0.845 (2)
0.673 (4)
0.727 (4)
0.734 (2)
0.572 (4.5)
0.860 (2)
0.558 (1)
0.857 (1)
0.978 (1)
0.739 (1)
0.721 (2)
0.520 (2.5)
0.841 (1)
0.895 (3)
0.728 (1)
0.667 (3)
0.830 (1)
0.941 (3)
0.915 (2)
0.815 (4)
0.902 (1)
0.925 (1)
0.693 (4)
0.591 (5)
0.603 (4)
0.989 (1)
0.569 (1)
0.945 (1)
2.200
Kernel
0.165 (5)
0.692 (5)
0.542 (5)
0.275 (5)
0.872 (1)
0.703 (4)
0.689 (2)
0.700 (5)
0.439 (5)
0.607 (5)
0.541 (5)
0.625 (5)
0.356 (5)
0.309 (5)
0.659 (5)
0.641 (4)
0.120 (5)
0.527 (5)
0.549 (5)
0.857 (4)
0.986 (1)
0.885 (1)
0.739 (5)
0.517 (5)
0.653 (5)
0.663 (3)
0.269 (5)
0.770 (5)
0.312 (5)
0.419 (5)
4.333
CN2
0.261 (1)
0.798 (3)
0.816 (3)
0.785 (2)
0.706 (5)
0.714 (3)
0.572 (4.5)
0.777 (4)
0.541 (3)
0.809 (3)
0.858 (4)
0.717 (3)
0.704 (3)
0.520 (2.5)
0.759 (4)
0.918 (1)
0.674 (3)
0.638 (4)
0.746 (2)
0.990 (1.5)
0.784 (4)
0.778 (5)
0.885 (3)
0.755 (3)
0.704 (3)
0.619 (4)
0.621 (3)
0.954 (3)
0.556 (2)
0.897 (4)
3.117
Table 2: Computation of the rankings for the ve algorithms considered in the study over
30 data sets, based on test accuracy by using ten-fold cross validation
Garc and Herrera

a
i
1
2
3
4
5
6
7
8
9
10
hypothesis
C4.5 vs. Kernel
NaiveBayes vs. Kernel
Kernel vs. CN2
C4.5 vs. 1NN
1NN vs. Kernel
1NN vs. NaiveBayes
C4.5 vs. CN2
NaiveBayes vs. CN2
1NN vs. CN2
C4.5 vs. NaiveBayes
z = (R0 Ri )/SE
5.471
5.226
2.98
2.817
2.654
2.572
2.49
2.245
0.327
0.245
p
4.487 108
1.736 107
0.0029
0.0048
0.008
0.0101
0.0128
0.0247
0.744
0.8065
N M
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
HM
0.005
0.0055
0.0063
0.0071
0.0083
0.01
0.0125
0.0167
0.025
0.05
SH
0.005
0.0083
0.0083
0.0083
0.0083
0.0125
0.0125
0.0167
0.025
0.05
Table 3: Family of hypotheses ordered by p-value and adjusting of by Nemenyi (NM),

Holm (HM) and Shaer (SH) procedures, considering an initial = 0.05
Shaers static procedure rejects the hypotheses [16].

Bergmann-Hommels dynamic procedure rst obtains the exhaustive index set of hypotheses. It obtains 51 index sets. We can see them in Table 4. From the index sets,
it computes the A set.4 It rejects all hypotheses Hj with j A, so it rejects the
/
hypotheses [18].
Size 1
(12)
(13)
(23)
(14)
(24)
(34)
(15)
(25)
(35)
(45)
Size 2
(12,34)
(13,24)
(14,23)
(12,35)
(13,25)
(15,23)
(12,45)
(13,45)
(23,45)
(14,25)
(15,24)
(14,35)
(24,35)
Size 3
(12,13,23)
(12,14,24)
(13,14,34)
(23,24,34)
(12,15,25)
(13,15,35)
(23,25,35)
(14,15,45)
(24,25,45)
(34,35,45)
Size 4
(12,13,23,45)
(12,14,24,35)
(12,34,35,45)
(13,14,25,34)
(13,15,24,35)
(13,24,25,45)
(14,15,23,45)
(14,23,25,35)
(15,23,24,34)
Size 6
(12,13,14,15,23,24,25,34,35,45)
(12,13,14,23,24,34)
(12,13,15,23,25,35)
(12,14,15,24,25,45)
(13,14,15,34,35,45)
(23,24,25,34,35,45)
Table 4: Exhaustive sets obtained for the case study. Those belonging to the Acceptance
set (A) are typed in bold.
Bergmann-Hommels dynamic procedure allows to clearly distinguishing among three
groups of classiers, attending to their performance:
4. We have considered that each classier follows the order: 1 - C4.5, 2 - 1-NN, 3 - NaiveBayes, 4 - Kernel,
5 - CN2. For example, the hypothesis 13 represents the comparison between C4.5 and NaiveBayes.
10
Best classiers: C4.5 and NaiveBayes.

Middle classiers: 1-NN and CN2.
Worst classier: Kernel.
In Demar (2006), we can nd a discussion about the power of Hochbergs and Homs
mels procedures with respect to Holms one. They reject more hypothesis than Holms, but
the dierences are in practice rather small (Shaer, 1995). The most powerful procedures
detailed in this paper, Shaers and Bergmann-Hommels, work following the same method
of Holms procedure, so it is possible to hybridize them with other types of step up procedures, such as Hochbergs, Hommels and Roms methods. When we apply these methods
by using the logical relationships among hypothesis in a static way, they do not control the
family-wise error (Hochberg and Rom, 1995). In opposite, when applying these methods
by detecting dynamical relationships, they control the family-wise error. In Hochberg and
Rom (1995), several extensions were given in this way. Furthermore, a small improvement
of power in the Bergmann-Hommel procedure described here can be achieved when using
Simes conjecture (Simes, 1986) in the obtaining of A set (see Hommel and Bernhard (1999)
for more details).
3. Adjusted P-Values
The smallest level of signicance that results in the rejection of the null hypothesis, the
p-value, is a useful and interesting datum for many consumers of statistical analysis. A
p-value provides information about whether a statistical hypothesis test is signicant or
not, and it also indicates something about how signicant the result is: The smaller the
p-value, the stronger the evidence against the null hypothesis. Most important, it does this
without committing to a particular level of signicance.
When a p-value is within a multiple comparison, as in the example in Table 3, it reects
the probability error of a certain comparison, but it does not take into account the remaining
comparisons belonging to the family. One way to solve this problem is to report APVs which
take into account that multiple tests are conducted. An APV can be compared directly
with any chosen signicance level . In this paper, we encourage the use of APVs due to
the fact that they provide more information in a statistical analysis.
In the following, we will explain how to compute the APVs depending on the post-hoc
procedure used in the analysis, following the indications given in Wright (1992); Hommel
and Bernhard (1999). We also include the post-hoc tests explained in Demar (2006) and
s
other for comparisons with a control classier. The notation used in the computation of
the APVs is the following:
Indexes i and j correspond each one to a concrete comparison or hypothesis in the
family of hypotheses, according to an incremental order by their p-values. Index i
always refers to the hypothesis in question whose APV is being computed and index
j refers to another hypothesis in the family.
pj is the p-value obtained for the j-th hypothesis.
11
Garc and Herrera

a
k is the number of classiers being compared.

m is the number of possible comparisons in an all pairwise comparisons design; that
is, m = k(k1) .
2
tj is the maximum number of hypotheses which can be true given that any (j 1)
hypotheses are false (see the description of Shaers static procedure in Section 2.1).
The procedures of p-value adjustment can be classied into:
one-step.
Bonferroni AP Vi : min{v; 1}, where v = (k 1)pi .
Nemenyi AP Vi : min{v; 1}, where v = m pi .
step-up.
Hochberg AP Vi : max{(k j)pj : (k 1) j i}.
Hommel AP Vi : see algorithm at Figure 3.
step-down.
Holm AP Vi (using a control classier): min{v; 1}, where v = max{(k j)pj :
1 j i}.
Nemenyi AP Vi : min{v; 1}, where v = m pi .
Holm AP Vi (using it in all pairwise comparisons): min{v; 1}, where v = max{(m
j + 1)pj : 1 j i}.
Shaer static AP Vi : min{v; 1}, where v = max{tj pj : 1 j i}.
Bergmann-Hommel AP Vi : min{v; 1}, where v = max{|I| min{pj , j I} :
I exhaustive, i I}.
Table 5 shows the results in the nal form of APVs for the example considered in this
section. As we can see, this example is suitable for observing the dierence of power among
the test procedures. Also, this table can provide information about the state of retainment
or rejection of any hypothesis, comparing its associated APV with the level of signicance
previously xed.
4. Experimental Framework
In this section, we want to determine the power and behavior of the studied procedures
through the experiments in which we repeatedly compared the classiers on sets of ten
randomly chosen data sets, recording the number of equivalence hypothesis rejected and
APVs. We follow a similar method used in Demar (2006).
s
The classiers used are the same as in the case study of the previous subsection: C4.5
with minimum number of item-sets per leaf equal to 2 and condence level tted for optimal
12
1. Set AP Vi = pi for all i.

2. For each j = k 1, k 2, ..., 2 (in that order)
3. Let B = .
4. For each i, i > (k 1 j)
5. Compute value ci = (j pi )/(j + i k + 1).
6. B = B ci .
7. End for
8. Find the smallest ci value in B; call it cmin .
9. If AP Vi < cmin , then AP Vi = cmin .
10. For each i, i (k 1 j)
11. Let ci = min(cmin , j pi ).
12. If AP Vi < ci , then AP Vi = ci .
13. End for
Figure 3: Algorithm for calculating APVs based on Hommels procedure
i
1
2
3
4
5
6
7
8
9
10
hypothesis
C4.5 vs .Kernel
NaiveBayes vs .Kernel
Kernel vs .CN2
C4.5 vs .1NN
1NN vs .Kernel
1NN vs .NaiveBayes
C4.5 vs .CN2
NaiveBayes vs .CN2
1NN vs .CN2
C4.5 vs .NaiveBayes
pi
4.487 108
1.736 107
0.0029
0.0048
0.008
0.0101
0.0128
0.0247
0.744
0.8065
AP VN M
4.487 107
1.736 106
0.0288
0.0485
0.0796
0.1011
0.1276
0.2474
1.0
1.0
AP VHM
4.487 107
1.563 106
0.023
0.0339
0.0478
0.0506
0.0511
0.0742
1.0
1.0
AP VSH
4.487 107
1.042 106
0.0173
0.0291
0.0478
0.0478
0.0511
0.0742
1.0
1.0
AP VBH
4.487 107
1.042 106
0.0115
0.0291
0.0319
0.0319
0.0383
0.0383
1.0
1.0
Table 5: APVs obtained in the example by Nemenyi (NM), Holm (HM), Shaers static
(SH) and Bergmann-Hommels dynamic (BH)
13
Garc and Herrera

a
accuracy and pruning strategy, naive Bayesian learner with continuous attributes discretized using Fayyad and Irani (1993) discretization, classic 1-Nearest-Neighbor classier with
Euclidean distance, CN2 with Fayyad-Iranis discretizer, star size = 5 and 95% of examples
to cover and Kernel classier with sigmaKernel = 0.01, which is the inverse value of the
variance that represents the radius of neighborhood. All classiers are available in KEEL
software (Alcal-Fdez et al., 2008).5
a
For performing this study, we have compiled a sample of fty data sets from the UCI
machine learning repository (Asuncion and Newman, 2007), all of them valid for a classication task.6 We measured the performance of each classier by means of accuracy in test
by using ten-fold cross validation. As Demar did, when comparing two classiers, samples
s
of ten data sets were randomly selected so that the probability for the data set i being
chosen was proportional to 1/(1 + ekdi ), where di is the (positive or negative) dierence
in the classication accuracies on that data set and k is the bias through which we can
regulate the dierences between the classiers. With k = 0, the selection is purely random
and as k is being higher, the selected data sets are favorable to a particular classier.
In comparisons of multiple classiers, samples of data sets have to be selected with the
probabilities computed from the dierences in accuracy of two classiers. We have chosen
C4.5 and 1-NN, due to the fact that we have found signicant dierences between them
in the study conducted before (Section 2.2) which involved thirty data sets. Note that the
repeated comparisons done here only involve ten data sets each time, so the rejection of
equivalence of two classiers is more dicult at the beginning of the process.
Figure 4 shows the results of this study considering the pairwise comparison between
C4.5 and 1-NN. It gives an approximation of the power of the statistical procedures considered in this paper. Figure 4(a) reects the number of times they rejected the equivalence
of C4.5 and 1-NN. Obviously, the Bergmann-Hommel procedure is the most powerful, followed by Shaers static procedure. The graphic also informs us about the use of logically
related hypothesis, given that the procedures that use this information have a bias towards
the same point and those which do not use this information, tend to a lower point than the
rst. When the selection of data sets is purely random (k = 0), the benet of using the
Bergmann-Hommel procedure is appreciable. Figure 4(b) shows the average APV of the
same comparison of classiers. As we can see, the Nemenyi procedure is too conservative in
comparison with the remaining procedures. Again, the benet of using more sophisticated
testing procedures is easily noticeable.
Figure 5 shows the results of this study considering all possible pairwise comparisons in
the set of classiers. It helps us to compare the overall behavior of the four testing procedures. Figure 5(a) presents the number of times they rejected any comparison belonging
to the family. Although it could seem that the selection of data sets determined by the
dierence of accuracy between two classiers may not inuence on the overall comparison,
the graphic shows us that it occurs. Furthermore, the lines drawn follow a parallel behavior,
5. It is also available at http://www.keel.es
6. The data sets used are: abalone, adult, australian, autos, balance, bands, breast, bupa, car, cleveland,
dermatology, ecoli, are, german, glass, haberman, hayes-roth, heart, iris, led7digit, letter, lymphography, magic, monks, mushrooms, newthyroid, nursery, optdigits, pageblocks, penbased, pima, ring,
satimage, segment, shuttle, spambase, splice, tae, thyroid, tic-tac-toe, twonorm, vehicle, vowel, wine,
wisconsin, yeast, zoo.
14
Nemenyi
Holm
Shaffer
Nemenyi
Bergmann
800
Shaffer
Bergmann
0.3
700
average p-value
rejected hypotheses
Holm
0.35
900
600
500
400
300
0.25
0.2
0.15
0.1
200
0.05
100
0
9 10 11 12 13 14 15 16 17 18 19 20
9 10 11 12 13 14 15 16 17 18 19 20
(a) Number of hypotheses rejected in pairwise

comparisons
(b) Average APV in pairwise comparisons
Figure 4: C4.5 vs. 1-NN

indicating us the relation and magnitude of power among the four procedures. In Figure
5(b) we illustrate the average APV for all the comparisons of classiers. We can notice that
the conservatism of the Nemenyi test is obvious with respect to the rest of procedures. The
benet of using a more advanced testing procedure is similar with respect to the following
less-powerful procedure, except for Holms procedure.
Nemenyi
Holm
Shaffer
Nemenyi
Bergmann
Shaffer
Bergmann
0.5
average p-value
0.6
2500
Holm
0.7
3000
rejected hypotheses
3500
2000
1500
1000
0.4
0.3
0.2
0.1
500
0
0
0
9 10 11 12 13 14 15 16 17 18 19 20
9 10 11 12 13 14 15 16 17 18 19 20
k
(a) Total number of hypotheses rejected
(b) Average APV in all comparisons
Figure 5: All comparisons

Finally, our recommendation on the usage of a certain procedure depends on the results
obtained in this paper and in our experience in understanding and implementing them:
We do not recommend the use of Nemenyis test, because it is a very conservative
procedure and many of the obvious dierences may not be detected.
When we use a considerable number of data sets with regards to number of classiers,
we could proceed with the Holm procedure.
However, conducting the Shaer static procedure means a not very signicant increase of the diculty with respect to the Holm procedure. Moreover, the benet of
15
Garc and Herrera

a
using information about logically related hypothesis is noticeable, thus we strongly

encourage the use of this procedure.
Bergmann-Hommels procedure is the best performing one, but it is also the most
dicult to understand and computationally expensive. We recommend its usage when
the situation requires so (i.e. the dierences among the classiers compared are not
very signicant), given that the results it obtains are as valid as using other testing
procedure.
5. Conclusions
The present paper is an extension of Demar (2006). Demar does not deal in depth with
s
s
some topics related to multiple comparisons involving all the algorithms and computations
of adjusted p-values.
In this paper, we describe other advanced testing procedures for conducting all pairwise
comparisons in a multiple comparisons analysis: Shaers static and Bergmann-Hommels
procedures. The advantage that they obtain is produced due to the incorporation of more
information about the hypotheses to be tested: in n n comparisons, a logical relationship
among them exists. As a general rule, the Bergmann-Hommel procedure is the most powerful one but it requires intensive computation in comparisons involving numerous classiers.
The second one, Shaers procedure, can be used instead of Bergmann-Hommels in these
cases. Moreover, we present the methods for obtaining the adjusted p-values, which are
valid p-values associated to each comparison useful to be compared with any level of signicance without restrictions and they also provide more information. We have illustrated
them with a case study and we have checked that the new described methods are more
powerful than the classical ones, Nemenyis and Holms procedures.
Acknowledgments
This research has been supported by the project TIN2005-08386-C05-01. S. Garc holds a
a
FPU scholarship from Spanish Ministry of Education and Science. The present paper was
submitted as a regular paper in the JMLR journal. After the review process, the action
editor Dale Schuurmans encourages us to submit the paper to the special topic on Multiple
Simultaneous Hypothesis Testing. We are very grateful to the anonymous reviewers and
both action editors who managed this paper for their valuable suggestions and comments
to improve its quality.
Appendix A. Source code of the procedures

The source code, written in JAVA, that implements all the procedures described in this
paper, is available at http://sci2s.ugr.es/keel/multipleTest.zip. The program allows the
A
input of data in CSV format and obtains as output a L TEX document.
16
References
J. Alcal-Fdez, L. Snchez, S. Garc M.J. del Jesus, S. Ventura, J.M. Garrell, J. Otero,
a
a
a,
C. Romero, J. Bacardit, V.M. Rivas, J.C. Fernndez, and F. Herrera. KEEL: A software
a
tool to assess evolutionary algorithms to data mining problems. Soft Computing. doi:
10.1007/s00500-008-0323-y, 2008. In press.
A. Asuncion and D.J. Newman.
UCI machine learning repository, 2007.
http://www.ics.uci.edu/mlearn/MLRepository.html.
URL
R. E. Baneld, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer. A comparison of decision

tree ensemble creation techniques. IEEE Transactions on Pattern Anaylisis and Machine
Intelligence, 29(1):173180, 2007.
G. Bergmann and G. Hommel. Improvements of general multiple test procedures for redundant systems of hypotheses. In P. Bauer, G. Hommel, and E. Sonnemann, editors,
Multiple Hypotheses Testing, pages 100115. Springer, Berlin, 1988.
P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning, 3(4):261283,
1989.
J. Demar. Statistical comparisons of classiers over multiple data sets. Journal of Machine
s
Learning Research, 7:130, 2006.
S. Esmeir and S. Markovitch. Anytime learning of decision trees. Journal of Machine
Learning Research, 8:891933, 2007.
U. M. Fayyad and K. B. Irani. Multi-interval discretization of continuous valued attributes
for classication learning. In Proceedings of the 13th International Joint Conference on
Articial Intelligence, pages 10221029. Morgan-Kaufmann, 1993.
M. Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis
of variance. Journal of the American Statistical Association, 32:675701, 1937.
N. Garc
a-Pedrajas and C. Fyfe. Immune network based ensembles. Neurocomputing, 70
(7-9):11551166, 2007.
Y. Hochberg. A sharper bonferroni procedure for multiple tests of signicance. Biometrika,
75:800802, 1988.
Y. Hochberg and D. Rom. Extensions of multiple testing procedures based on Simes test.
Journal of Statistical Planning and Inference, 48:141152, 1995.
S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of
Statistics, 6:6570, 1979.
G. Hommel. A stagewise rejective multiple test procedure. Biometrika, 75:383386, 1988.
G. Hommel and G. Bernhard. A rapid algorithm and a computer program for multiple test
procedures using procedures using logical structures of hypotheses. Computer Methods
and Programs in Biomedicine, 43:213216, 1994.
17
Garc and Herrera

a
G. Hommel and G. Bernhard. Bonferroni procedures for logically related hypotheses. Journal of Statistical Planning and Inference, 82:119128, 1999.
R. L. Iman and J. M. Davenport. Approximations of the critical region of the friedman
statistic. Communications in Statistics, pages 571595, 1980.
C. Marrocco, R. P. W. Duin, and F. Tortorella. Maximizing the area under the ROC curve
by pairwise feature combination. Pattern Recognition, 41:19611974, 2008.
G. J. McLachlan. Discriminant Analysis and Statistical Pattern Recognition. Wiley Series
in Probability and Mathematical Statistics, 2004.
J. F. Murray, G. F. Hughes, and K. Kreutz-Delgado. Machine learning methods for predicting failures in hard drives: A multiple-instance application. Journal of Machine Learning
Research, 6:783816, 2005.
P. B. Nemenyi. Distribution-free multiple comparisons. PhD thesis, Princeton University,
1963.
M. Nnez, R. Fidalgo, and R. Morales. Learning in environments with unknown dynamics:
u
Towards more robust concept learners. Journal of Machine Learning Research, 8:2595
2628, 2007.
S. Olejnik, J. Li, S. Supattathum, and C.J. Huberty. Multiple testing and statistical power
with modied bonferroni procedures. Journal of Educational and Behavioral Statistics,
22(4):389406, 1997.
A. B. Owen. Innitely imbalanced logistic regression. Journal of Machine Learning Research, 8:761773, 2007.
J. R. Quinlan. Programs for Machine Learning. Morgan Kauman, 1993.
D. M. Rom. A sequentially rejective test procedure based on a modied bonferroni inequality. Biometrika, 77:663665, 1990.
J.P. Shaer. Modied sequentially rejective multiple test procedures. Journal of the American Statistical Association, 81(395):826831, 1986.
J.P. Shaer. Multiple hypothesis testing. Annual Review of Psychology, 46:561584, 1995.
D. Sheskin. Handbook of parametric and nonparametric statistical procedures. Chapman &
Hall/CRC, 2003.
R.J. Simes. An improved Bonferroni procedure for multiple tests of signicance. Biometrika,
73:751754, 1986.
P. H. Westfall and S. S. Young. Resampling-Based Multiple Testing: examples and methods
for p-value adjustment. John Wiley and Sons, 2004.
S. P. Wright. Adjusted p-values for simultaneous inference. Biometrics, 48:10051013, 1992.
18
Y. Yang, G. Webb, K. Korb, and K. M. Ting. Classifying under computational resource

constraints: anytime classication using probabilistic estimators. Machine Learning, 69:
3553, 2007a.
Y. Yang, G. I. Webb, J. Cerquides, K. B. Korb, J. Boughton, and K. M. Ting. To select
or to weigh: A comparative study of linear combination schemes for superparent-onedependence estimators. IEEE Transcations on Knowledge and Data Engineering, 19(12):
16521665, 2007b.
J. H. Zar. Biostatistical Analysis. Prentice Hall, 1999.
19
2.4.3.
A Study of Statistical Techniques and Performance Measures for

Genetics-Based Machine Learning: Accuracy and Interpretability
S. Garc A. Fernndez, J. Luengo, F. Herrera, A Study of Statistical Techniques and Perfora,

a
mance Measures for Genetics-Based Machine Learning: Accuracy and Interpretability. Soft
Computing, submitted (2008).
Estado: Sometido a Revisin
o
Area de Conocimiento: Interdisciplinary Applications. Ranking 61 / 92.
Submission Confirmation for SOCO-D-07-00252R2
1 de 1
Asunto: Submission Confirmation for SOCO-D-07-00252R2

De: "Editorial Office Soft Computing" <bgerla@unisa.it>
Fecha: 26 Aug 2008 12:52:32 -0400
Ref.: Ms. No. SOCO-D-07-00252R2
A Study of Statistical Techniques and Performance Measures for Genetics-Based Machine
Learning: Accuracy and Interpretability
Dear Mr. Garca,
Soft Computing has received your revised submission.
You may check the status of your manuscript by logging onto Editorial Manager at
(http://soco.edmgr.com/).
Kind regards,
20/10/2008 07:50
Editorial Manager(tm) for Soft Computing

Manuscript Draft
Manuscript Number: SOCO-D-07-00252R2

Title: A Study of Statistical Techniques and Performance Measures for Genetics-Based Machine
Learning: Accuracy and Interpretability
Article Type: Original Research
Keywords: Genetics-based machine learning; Genetic algorithms; Statistical tests; Non-Parametric
Tests; Cohen's kappa; Interpretability; Classification.
Corresponding Author: Mr. Salvador Garca, M.D.
Corresponding Author's Institution: University of Granada
First Author: Salvador Garca, M.D.
Order of Authors: Salvador Garca, M.D.; Alberto Fernndez, M.D.; Julin Luengo, M.D.; Francisco
Herrera, Ph.D.
Abstract: The experimental analysis on the performance of a proposed method is a crucial and
necessary task to carry out in a research. This paper is focused on the statistical analysis of the
results in the field of Genetics-Based Machine Learning. It presents a study involving a set of
techniques
which can be used for doing a rigorous comparison among algorithms, in terms of obtaining
successful classification models.
Two accuracy measures for multi-class problems have been employed: classification rate and
Cohen's kappa. Moreover, two interpretability measures have been employed: size of the rule set
and number of antecedents. We have studied whether the samples of results obtained by Geneticbased classifiers, using the performance measures cited above, check the necessary conditions for
being analyzed by means of parametrical tests. The results obtained state that the fulfillment of
these conditions are problem-dependent and indefinite, which supports the use of non-parametric
statistics in the experimental analysis. In addition, non-parametric tests can be satisfactorily
employed for
comparing generic classifiers over various data sets considering any performance measure.
According to these facts, we propose the use of the most powerful non-parametric statistical tests
to carry out multiple comparisons. However, the statistical analysis conducted on interpretability
must be carefully considered.
* Manuscript
Click here to download Manuscript: Garciaetal-SoftComputing-GBML-R2.pdf
Click here to view linked References
myjournal manuscript No.

(will be inserted by the editor)
A Study of Statistical Techniques and Performance Measures for

Genetics-Based Machine Learning: Accuracy and Interpretability
S. Garc 1 , A. Fernndez1 , J. Luengo1 , F. Herrera1
a
a
University of Granada, Department of Computer Science and Articial Intelligence, 18071 Granada, Spain e-mail:
{salvagl,alberto,julianlm,herrera}@decsai.ugr.es
Received: date / Revised version: date
Abstract The experimental analysis on the performance of a proposed method is a crucial and necessary
task to carry out in a research. This paper is focused
on the statistical analysis of the results in the eld of
Genetics-Based Machine Learning. It presents a study
involving a set of techniques which can be used for doing
a rigorous comparison among algorithms, in terms of obtaining successful classication models.
Two accuracy measures for multi-class problems have
been employed: classication rate and Cohens kappa.
Moreover, two interpretability measures have been employed: size of the rule set and number of antecedents.
We have studied whether the samples of results obtained by Genetic-based classiers, using the performance
measures cited above, check the necessary conditions for
being analyzed by means of parametrical tests. The results obtained state that the fulllment of these conditions are problem-dependent and indenite, which supports the use of non-parametric statistics in the experimental analysis. In addition, non-parametric tests can
be satisfactorily employed for comparing generic classiers over various data sets considering any performance
measure. According to these facts, we propose the use
of the most powerful non-parametric statistical tests to
carry out multiple comparisons. However, the statistical
analysis conducted on interpretability must be carefully
considered.
Key words Genetics-based machine learning, Genetic algorithms, Statistical tests, Non-Parametric Tests,
Cohens kappa, Interpretability, Classication.
Supported by the Spanish Ministry of Science and Technology under Project TIN-2005-08386-C05-01. S. Garc and
a
J. Luengo hold a FPU scholarship from Spanish Ministry of
Education and Science.
1 Introduction
In general, the classication problem can be covered by
numerous techniques and algorithms, which belong to
dierent paradigms of Machine Learning (ML). The new
developed methods for ML must be analyzed with previous approaches following a rigorous criterion, since in
any empirical comparison the results are dependent on
the choice of the cases for studying, the conguration
of the experimentation and the measurements of performance. Nowadays, the statistical validation of published
results is a necessity in order to establish a certain conclusion on an experimental analysis [18].
Evolutionary rule-based systems [21] is a kind of Genetics-Based Machine Learning (GBML) that uses sets of
rules as knowledge representation [22]. Many approaches
have been proposed in GBMLs based on oering some
advantages with respect to other existing ML techniques; such as the production of interpretable models, no
assumption of prior relationships among attributes and
possibility of obtaining compact and precise rule sets.
Some examples of proposed GBMLs are: GABIL [17],
SIA [41], XCS [43], DOGMA and JoinGA [25], G-Net
[4], UCS [12], GASSIST [7], OCEC [30] and HIDER [1].
Recently, statistical analysis is highly demanded in
any research work and thus, we can nd recent studies
that propose some methods for conducting comparisons
among various approaches [18,34]. Statistics allows us
to determine whether the obtained results are signicant with respect to the choices taken and whether the
conclusions achieved are supported by the experimentation that we have carried out. On the other hand, the
performance of classiers is not only given by their classication rate and there is a growing interest in proposing
or adapting new accuracy measures [11,19]. Most of the
accuracy measures are proposed for two-class problems
and their adaptation to multi-class problems is not intuitive [32]. Only two accuracy measures have been used
for multi-class problems with successful results: the classical classication rate and Cohens kappa measure. The
main dierence between them is the scoring of the true

classications rates. Classication rate scores all the successes over all classes, whereas Cohens kappa scores the
successes independently for each class and aggregates
them. The second way of scoring is less sensitive to randomness caused by dierent number of examples in each
class, which leads a learner to a bias towards obtaining
data-dependent models.
In GBMLs, the interpretability of the rule sets obtained is very important, due to the fact that very large
sets of rules or very complex rules are rather lacking in
interest.
The use of parametric statistical techniques over the
sample of results is only adequate when they fulll three
necessary conditions: independency, normality and homoscedasticity [37,47]. This paper shows that these conditions are usually not veried when analyzing GBML
algorithms. Under these assumptions, a statistical analysis conducted by means of parametric tests may not be
safe with respect to the achieved results and hence, the
conclusions about an experimental study could be incorrect.
In this paper, we are interested in the study of the
most appropriate statistical techniques and performance
measures for analyzing the experimentation of GBML
algorithms. We mainly focus on ve topics:
To study the fulllment of the necessary conditions
for a safe usage of parametric tests.
To emphasize the existing dierences between a pairwise comparison statistical procedure and a multiple
comparison statistical procedure, pointing out the
advantages of using the second ones.
To notice that the use of dierent performance measures may yield dierent conclusions in the statistical
study, due to the fact that they have dierent purposes in the evaluation of the algorithms.
To show the generality for comparing GBML algorithm with other ML approaches, in spite of the nonstochasticity of the latter methods. For doing it, we
will include CN2 algorithm [14] when conducting the
non-parametric statistical analysis.
Making an analysis based on interpretability is not
trivial. We give some concerns in this paper and we
justify why the available interpretability metrics have
to be treated with a grain of salt.
In order to do that, the paper is organized as follows. Section 2 presents the GBML algorithms used.
The description of the multi-class performance measures together with the experimental framework and the
results obtained are given in Section 3. We introduce
the statistical analysis and we carry out the study of the
necessary conditions for a safe use of parametric tests in
Section 4. Section 5 describes the procedures for doing
pairwise comparisons with non-parametric statistics. In
the case of multiple comparisons tests, we present and
use them in Section 6. We present the analysis based on
S. Garc et al.
a
interpretability and we give our concerns in Section 7.

Finally, the conclusions are summarized in Section 8. An
appendix is included containing an extended description
of the GBML methods used in our study.
2 Genetics-Based Machine Learning Algorithms
for Classication
In this paper we use GBML methods in order to perform classication tasks. Specically, we have chosen
four Genetic Interval Rule Based Algorithms, such as
Pittsburgh Genetic Interval Rule Learning Algorithm
(Pitts-GIRLA), XCS, Genetic Algorithm based Classier System (GASSIST-ADI) and Hierarchical Decision
Rules (HIDER). The algorithms are provided by the
KEEL software [2], which includes update versions of
these GBML methods.
In the following we will give a brief description of
the dierent approaches that we have employed in our
work. A wider explanation about the methods exposed
here can be found in the appendix of this work.
1. Pittsburgh Genetic Interval Rule Learning Algorithm.
The Pitts-GIRLA Algorithm [16] is a GBML method
which makes use of the Pittsburgh approach in order
to perform a classication task. The main structure
of this algorithm is a generational Genetic Algorithm
(GA), in which for each generation the steps of selection, crossover, mutation and replacement are applied.
We initialize all the chromosomes at random, with
values between the range of each variable. The selection mechanism consists in choosing two individuals
at random among all the chromosomes of the population.
The tness of a particular chromosome is simply the
percentage of instances correctly classied by the chromosomes rule set (classication rate).
The best chromosome of the population is always
maintained as in the elitist scheme.
2. XCS Algorithm.
XCS [43] is a Learning Classier System (LCS) [38]
that evolves online a set of rules that describe the feature space accurately. It inherits part of its behavior
from ZCS [42], and diers in several ways from more
traditional LCSs. Firstly, the classier tness is based on the payo prediction instead of the prediction
itself. Secondly, XCS has no message list. Finally, the
GA is applied over niches instead of the whole population. The set of rules has a xed maximum size N
and it is initially built by generalizing some of the
input examples.
3. GASSIST Algorithm.
GASSIST (Genetic Algorithms based claSSIer sySTem) [9] is a Pittsburgh-style LCS originally inspired
in GABIL [17] from where it has taken the semantically correct crossover operator.
Title Suppressed Due to Excessive Length
The core of the system consists of a GA which evolve

individuals formed by a set of production rules. The
individuals are evaluated according to the proportion
of correct classied training examples.
In GASSIST-ADI, the representation for real-valued
attributes is through Adaptive Discretization Intervals Rule Representation [6,8].
4. HIDER Algorithm.
HIerarchical DEcision Rules (HIDER) [1], produces a
hierarchical set of rules, that is, the rules are sequentially obtained and must be, therefore, tried in order
until one, whose conditions are satised, is found.
In order to extract the rule-list, a real-coded GA is
employed in the search process. Two genes will dene
the lower and upper bounds of the rule attribute.
One rule is extracted in each iteration of the GA and
all the examples covered by that rule are deleted. A
parameter called examples pruning factor denes
a percentage of examples that can remain uncovered.
Thus, the termination criterion is reached when there
are no more examples to cover, depending on the
examples pruning factor.
The GA main operators are dened in the following:
(a) Crossover : Where the ospring takes values between the upper and lower bounds of the parents.
(b) Mutation: Where a small value is subtracted or
added in the case of lower and upper bound respectively.
(c) Fitness Function: The tness function considers
a two-objective optimization, trying to maximize
the number of correctly classied examples and
to minimize the number of errors.
3 Performance Measures and Experimental
Results
In this section, we describe the accuracy measures for
multi-class problems and the interpretability metrics used
in this paper. In specialized literature, most of them
are designed for two-class problems [29]. Well-known accuracy measures for two-class problems are: classication rate, precision, sensitivity, specicity, G-mean [10],
F-score, AUC [24], Youdens index [46] and Cohens
Kappa [11].
Some of the two-class accuracy measures have been
adapted for multi-class problems. For example, in a recent paper [32] authors propose an approximating multiclass ROC analysis, which is theoretically possible but
its computation is still restrictive. Only two measures
are widely used because of their simplicity and successful
application when the number of classes is large enough.
We refer to classication rate and Cohens kappa measures, which will be explained in Subsection 3.1. The two
interpretability metrics will be described in Subsection
3.2. Finally, Subsection 3.3 presents the experimental
framework of this paper and shows the average results
obtained for each GBML algorithm employed.
3.1 Accuracy Measures for Multi-Class Problems

The analysis of the four GBML approaches described
above will be carried out by means of the following accuracy measures:
Classication rate: is the number of successful hits
relative to the total number of classications. It is
far the most commonly used metric for assessing the
performance of classiers for years [3,33,44].
Cohens kappa: is an alternative to classication rate,
a method, known for decades, that compensates for
random hits [15]. Its original intent was to measure
the degree of agreement, or disagreement, between
two people observing the same phenomenon. Cohens
kappa can be adapted to classication tasks and it is
recommended to be employed because it takes random successes into consideration as a standard, in the
same way as AUC measure [11]. Also, it is used in
some well-known software packages, such as WEKA
[44], SAS, SPSS, etc.
An easy way of computing Cohens kappa is making
use of the resulting confusion matrix in a classication task. With the expression (1), we can obtain the
Cohens kappa:
kappa =
C
C
i=1 xi. x.i
i=1 xii
,
C
2
n
i=1 xi. x.i
(1)
where xii is the cell count in the main diagonal, n

is the number of examples, C is the number of class
values, and x.i ,xi. are the columns and rows total
counts, respectively.
Cohens kappa ranges from 1 (total disagreement)
through 0 (random classication) to 1 (perfect agreement). Being a scalar, it is less expressive than ROC
curves when applied to binary-class cases. However,
for multi-class problems, kappa is a very useful, yet
simple, meter for measuring classiers classication
rate while compensating for random successes.
The main dierence between classication rate and
Cohens kappa is the scoring of the correct classications. Classication rate scores all the successes over
all classes, whereas Cohens kappa scores the successes independently for each class and aggregates
them. The second way of scoring is less sensitive to
randomness caused by dierent number of examples
in each class, which leads a learner to a bias towards
obtaining data-dependent models.
3.2 Interpretability Measures
The analysis of the four GBML approaches described in
the paper will be carried out by means of two interpretability measures:
Size: it is a measure that considers the number of
rules which compose the model (see expression 2).
S. Garc et al.
a
Reducing the size of the model increases the interpretability by the user.
Size = nR ,
(2)
Number of Antecedents (ANT): Let Ri being a rule

in the form Cond Class, and Cond composed by
(Antecedent1 Antecedent2 ... Antecedentk ), this
measure is dened as the following expression:
Ant(Ri ) = k.
(3)
The average number of antecedents in the rule is described in the expression:

AN T =
1
nR
nR
Ant(Ri ).
(4)
i=1
4 Study on the Initial Conditions for Parametric

Tests using Genetics-Based Machine Learning
3.3 Experimental Results

We have selected 14 data sets from UCI repository [5].
Table 1 summarizes the properties of these data sets. It
shows, for each data set, the number of examples (#Ex.),
number of attributes (#Atts.) and the number of classes
(#C.). In the case of presenting missing values (cleveland
and wisconsin) we have removed the instances with any
missing value before partitioning. We also add in the last
columns some of the Pitts-GIRLA parameters (number
of rules #R and number of generations #Gen) which we
have made problem-dependent in order to increase the
performance of the algorithm. The rest of the parameters
are common for all problems and they are shown in Table
2.
Table 1 Data Sets summary descriptions and Pitts-GIRLA
problem-dependent parameters
Data Set Description
Data set
#Ex.
#Atts.
bupa (bup)
345
6
cleveland (cle)
297
13
ecoli (eco)
336
7
glass (gla)
214
9
haberman (hab)
306
3
iris (iri)
150
4
monk-2 (mon)
432
6
new-Thyroid (new)
215
5
pima (pim)
768
8
vehicle (veh)
846
18
vowel (vow)
988
13
wine (win)
178
13
wisconsin (wis)
683
9
yeast (yea)
1484
8
all data sets, considering the classication rate and kappa

measures in test data, respectively. The column titled
Mean shows the average classication rate achieved and
the column titled SD shows the associated standard deviation. We stress the best result for each data-set and
the average one in boldface.
Using the same data sets and conguration of the
algorithms, Table 5 shows the results obtained for the
GBML approaches studied in this paper over all data
sets, considering size and ANT measures. The column
titled Mean shows the average size/ANT achieved and
the column titled SD shows the associated standard deviation. We stress the best result for each data-set and
the average one in boldface.
#C.
2
5
8
7
2
3
2
3
2
4
11
3
2
10
Pitts-GIRLA
#R
#Gen
30
5000
40
5000
40
5000
20
10000
10
5000
20
5000
20
5000
20
10000
10
5000
20
10000
20
10000
20
10000
50
5000
20
10000
We have used 10-fold cross validation (10fcv) and

we have repeated the experiments using the GBML algorithms with dierent random seed 5 times. Thus, we
will obtain samples composed by 50 results in each of
the measures considered. CN2 is deterministic and has
been run once, thus we can only obtain 10 results.
Tables 3 and 4 show the results obtained for the
GBML approaches studied in this paper and CN2 over
In this paper, we discuss on the use of statistical techniques for the analysis of GBML methods. Firstly, we
distinguish between two types of analysis: single data set
analysis and multiple data set analysis. A single data set
analysis is carried out when the results of two or more
algorithms are compared considering an unique problem
or data set. A multiple data set analysis is given when
our interest lies in comparing two or more approaches
over multiple problems or data sets simultaneously, in
the way of obtaining generalizable conclusions on an experimental study.
The Central Limit Theorem suggests that the sum of
many independent, identically distributed random variables approaches a normal distribution [37]. This theorem
for classication performance is rarely held, it depends
on the case of the problem studied and the number of
runs of the algorithm. However, an excessive number of
runs (the eect size of the samples) aects negatively
in the statistical test due to the fact that it makes a
statistical score more sensitive to a little dierence of results (which would not be detected), by the simple fact
of repeating runs. Thus, our intention is to study the necessary conditions for using parametric statistical tests
on single data set analysis by means of the obtaining
of large size result samples by running the algorithms
several times.
For doing so, we rstly introduce the necessary conditions mentioned above. Then, we present the analysis of
these conditions, and nally we show some case studies
of the normality property.
4.1 Conditions for a safe use of parametric tests
In [37], the distinction done between parametric and
non-parametric tests is based on the level of measure
represented by the data that will be analyzed. In this
way, a parametric test uses data with real values belonging to a range.
Table 2 Parameter specication for the algorithms employed in the experimentation.

Algorithm
Pitt-GIRLA
XCS
GASSIST-ADI
HIDER
CN2
Parameters
Number of rules: problem-dependent, Number of generations: problem-dependent
Population size: 61 chromosomes, Crossover Probability: 0.7, Mutation Probability: 0.5.
number of explores = 100000, population size = 6400, = 0.1, = 0.2, = 0.1,
= 10.0, mna = 2, del = 50.0, sub = 50.0, 0 = 1, do Action Set Subsumption = false,
tness reduction = 0.1, pI = 10.0, FI = 0.01, I = 0.0, = 0.25, = 0.8, = 0.04,
GA = 50.0, doGASubsumption = true, type of selection = RWS,
type of mutation = free, type of crossover = 2 point, P# = 0.33, r0 = 1.0, m0 = 0.1,
l0 = 0.1, doSpecify = false, nSpecify = 20.0 pSpecify = 0.5.
Threshold in Hierarchical Selection = 0,
Iteration of Activation for Rule Deletion Operator = 5,
Iteration of Activation for Hierarchical Selection = 24,
Minimum Number of Rules before Disabling the Deletion Operator = 12,
Minimum Number of Rules before Disabling the Size Penalty Operator = 4,
Number of Iterations = 750, Initial Number Of Rules = 20, Population Size = 400,
Crossover Probability = 0.6, Probability of Individual Mutation = 0.6,
Probability of Value 1 in Initialization = 0.90, Tournament Size = 3,
Possible size in micro-intervals of an attribute = {4, 5, 6, 7, 8, 10, 15, 20, 25},
Maximum Number of Intervals per Attribute = 5, psplit = 0.05, pmerge = 0.05,
Probability of Reinitialize Begin = 0.03, Probability of Reinitialize End = 0,
Use MDL = true, Iteration MDL = 25,
Initial Theory Length Ratio = 0.075, Weight Relaxation Factor = 0.90.
Class Initialization Method = cwinit, Default Class = auto,
Population Size = 100, Number of Generations = 100, Mutation Probability = 0.5,
Percentage of Crossing = 80, Extreme Mutation Probability = 0.05,
Prune Examples Factor = 0.05, Penalty Factor = 1, Error Coecient = 0.
Percentage of Examples to Cover = 95%,
Star Size = 5, Use Disjunct Selectors = NO
Table 3 Average in classication rate oered by the algorithms
bup
cle
eco
gla
hab
iri
mon
new
pim
veh
vow
win
wis
yea
AVG
Pitts-GIRLA
Mean
SD
.5922
.0641
.5583
.0376
.7367
.0850
.6247
.1104
.6997
.1245
.9493
.0514
.6236
.1165
.9140
.0499
.6485
.1161
.4594
.1095
.2467
.0548
.7039
.2199
.7655
.2269
.3723
.0877
.6353
.1039
XCS
Mean
SD
.6568
.0764
.5650
.0540
.8105
.0680
.7181
.1279
.7284
.0484
.9493
.0477
.6728
.0238
.9449
.0545
.7520
.0581
.7359
.0446
.5438
.0682
.9584
.0477
.9666
.0189
.4960
.0598
.7499
.0570
GASSIST-ADI
Mean
SD
.6306
.0932
.5613
.0693
.7985
.0703
.6472
.1035
.7121
.0676
.9653
.0409
.6673
.0407
.9269
.0511
.7425
.0437
.6783
.0421
.4020
.0356
.9056
.0744
.9564
.0247
.5442
.0327
.7242
.0564
HIDER
Mean
SD
.6186
.0986
.5545
.0723
.8422
.0597
.6962
.1331
.7485
.0449
.9640
.0409
.6719
.0206
.9382
.0660
.7473
.0497
.6593
.0502
.7248
.0482
.9476
.0792
.9653
.0236
.5781
.0376
.7625
.0611
CN2
Mean
SD
.5715
.0740
.5412
.0457
.8101
.0618
.6998
.0963
.7349
.0444
.9400
.0492
.6719
.0215
.9446
.0472
.7122
.0393
.6191
.0839
.6212
.0632
.9268
.0648
.9517
.0218
.5560
.0362
.7358
.1529
GASSIST-ADI
Mean
SD
.2382
.1842
.2750
.0948
.7158
.1000
.5019
.1416
.1272
.1921
.9480
.0614
.0460
.1161
.8424
.1077
.4131
.1103
.5714
.0558
.3422
.0391
.8560
.1135
.9040
.0542
.3983
.0453
.5128
.1011
HIDER
Mean
SD
.1793
.1939
.2387
.1182
.7761
.0827
.5665
.1899
.1469
.1719
.9460
.0613
.1095
.1697
.8644
.1363
.3794
.1334
.5450
.0669
.6969
.0530
.9201
.1171
.9222
.0532
.4481
.0505
.5528
.1141
CN2
Mean
SD
.0444
.1580
.1617
.0586
.7317
.0892
.5765
.1284
.1826
.1900
.9100
.0738
.0000
.0000
.8742
.1063
.2476
.1182
.4897
.1130
.5833
.0695
.8870
.1000
.8909
.0501
.4137
.0483
.4995
.0931
Table 4 Average in kappa oered by the algorithms
bup
cle
eco
gla
hab
iri
mon
new
pim
veh
vow
win
wis
yea
AVG
Pitts-GIRLA
Mean
SD
.0916
.1472
.1710
.1192
.6260
.1099
.4663
.1490
.0605
.1156
.9240
.0771
.0067
.0354
.8171
.1013
.1260
.2047
.2802
.1470
.1726
.0602
.5125
.3822
.5465
.3683
.1640
.1226
.3546
.1528
XCS
Mean
SD
.2619
.1837
.2995
.0949
.7345
.0964
.6089
.1731
.0943
.1431
.9240
.0716
.0107
.0536
.8762
.1327
.4321
.1404
.6479
.0593
.4982
.0751
.9371
.0716
.9271
.0411
.3279
.0837
.5415
.1014
The latter does not involve that when we always dispose of this type of data, we should use a parametric test.
It is possible that one or more initial assumptions for the
use of parametric tests may be not fullled, making that
a statistical analysis loses credibility.
In order to use the parametric tests, it is necessary
to check the following conditions [37,47]:
Independence: In statistics, two events are independent when the fact that one occurs does not modify
the probability of the other one occurring.
Normality: An observation is normal when its behaviour follows a normal or Gauss distribution with
a certain value of mean and variance . A normality test applied over a sample can indicate the
S. Garc et al.
a
Table 5 Average of interpretability measures of GBML algorithms

Data Set
bup
cle
eco
gla
hab
iri
mon
new
pim
veh
vow
win
wis
yea
AVG
Pitts-GIRLA
Mean
SD
30.00
0.00
40.00
0.00
40.00
0.00
20.00
0.00
10.00
0.00
20.00
0.00
20.00
0.00
20.00
0.00
10.00
0.00
20.00
0.00
20.00
0.00
20.00
0.00
50.00
0.00
20.00
0.00
24.29
0.00
Size
XCS
Mean
SD
2400.62
198.03
4594.96
109.62
2321.02
147.56
3254.32
155.87
1181.52
360.75
547.08
105.77
283.78
95.72
1037.00
133.42
3576.62
150.34
5211.18
56.17
4284.34
141.49
4098.70
347.37
708.90
78.20
3608.44
221.66
2650.61
164.43
GASSIST-ADI
Mean
SD
16.84
6.20
10.76
4.54
6.32
1.45
8.52
2.51
7.92
3.25
4.08
0.27
5.50
0.61
5.42
1.01
15.34
4.61
11.68
3.92
11.92
4.44
4.30
0.54
5.92
1.35
8.38
2.17
8.78
2.70
HIDER
Mean
SD
5.56
1.05
22.30
2.28
8.66
1.10
22.38
2.43
2.26
0.66
3.00
0.00
6.26
4.15
3.34
0.52
8.84
2.00
46.86
4.59
114.50
4.55
27.50
2.53
2.12
0.33
46.12
8.27
22.84
2.46
presence or absence of this condition in the observed data. A well-known example of normality test is
the Kolmogorov-Smirnov test, which possess a very
low power. In this study, we will use more powerful
normality tests:
Shapiro-Wilk (SW): It analyzes the observed data
for computing the level of symmetry and kurtosis (shape of the curve) in order to compute the
dierence with respect to a Gaussian distribution
afterwards, obtaining the p-value from the sum of
the squares of these discrepancies. The power of
this test has been shown to be excellent; however, its performance is adversely aected in the
common situation where there is tied data.
DAgostino-Pearson (DP): It rst computes the
skewness and kurtosis to quantify how far from
Gaussian the distribution is in terms of asymmetry and shape. It then calculates how far each of
these values diers from the value expected with
a Gaussian distribution, and computes a single
p-value from the sum of these discrepancies. The
performance of this test is not as good as that of
SWs procedure, but it is not as aected by tied
data.
Heteroscedasticity: This property indicates the existence of a violation of the hypothesis of equality of
variances. Levenes test is used for checking whether or not k samples present this homogeneity of variances (homoscedasticity). When observed data does
not fulll the normality condition, it is more reliable
the result of using this test than Bartletts test [47],
which is another test that checks the same property.
With respect to the independence condition, Demar
s
in [18] suggests that independency is not truly veried in
10fcv (a portion of samples is used either for training and
testing in dierent partitions). In the following, we show
a normality analysis by using SW and DP tests, together
with a heteroscedasticity analysis by using Levenes test.
4.2 Analysis of the conditions for a safe use of
parametric tests
We apply the two tests of normality (SW and DP) presented above by considering a level of signicance of
Pitts-GIRLA
Mean
SD
2.96
0.16
6.09
0.27
3.52
0.24
3.96
0.32
1.53
0.23
1.91
0.19
2.73
0.23
2.17
0.21
3.04
0.42
7.77
0.42
5.60
0.43
5.75
0.41
4.46
0.23
3.67
0.31
3.94
0.29
XCS
Mean
SD
2.31
0.28
4.15
0.17
2.06
0.17
2.86
0.23
1.31
0.25
1.19
0.14
1.23
0.12
1.57
0.16
3.17
0.16
5.14
0.14
2.04
0.08
2.83
0.18
2.11
0.14
2.29
0.21
2.45
0.17
ANT
GASSIST-ADI
Mean
SD
3.53
0.46
3.28
0.92
1.69
0.40
2.30
0.64
1.81
0.48
0.91
0.21
1.27
0.95
1.52
0.29
3.50
0.63
3.19
0.65
2.15
0.54
1.74
0.35
3.19
0.72
2.37
0.48
2.32
0.55
HIDER
Mean
SD
5.29
0.36
5.75
0.24
5.39
0.39
8.44
0.20
1.93
0.25
2.26
0.41
3.12
0.99
4.35
0.41
7.20
0.38
17.36
0.16
9.93
0.11
12.79
0.11
3.46
0.59
6.07
0.12
6.67
0.34
= 0.05 (we have employed the statistical software package SPSS). Tables 6 and 7 show the results in classication rate and kappa measures, respectively. Tables 8
and 9 show the results in size and ANT measures, respectively. The symbol * indicates that the normality
condition is not satised and the value in brackets is the
p-value needed for rejecting the normality hypothesis.
As we can observe in the run of the two tests of normality, we can declare that the conditions needed for the
application of parametric tests are not fullled in some
cases. The normality condition is not always satised although the size of the sample of results would be enough
(50 in this case). A main factor that inuences this condition seems to be the nature of the problem, since there
exist some problems in which it is never satised, such
as the wine and the wisconsin problems in both classication rate and kappa measures, and the general trend is
not predictable. In addition, the results oered by PittsGIRLA are very distant to a normal shape. The measure
which yields less rejections of the normality condition is
ANT.
In relation to the heteroscedasticity study, Table 10
shows the results by applying Levenes test, where the
symbol * indicates that the variances of the distributions of the dierent algorithms for a certain function
are not homogeneities (the null hypothesis is rejected).
The homoscedasticity property is even more dicult
to be fullled, since the variances associated to each problem also depend on the algorithms results, that is,
the capacity of the algorithms for oering similar results with random seeds variations. This fact also inuences that an analysis of performance of GBML algorithms performed through parametric statistical treatment could mean erroneous conclusions.
4.3 Case studies of the Normality Property
We present two case studies of the normality property
considering the sample of results obtained by an GBML
method on a data set. Figures 1 and 2, show dierent
examples of graphical representations of histograms and
Q-Q graphics. A histogram represents a statistical variable by using bars, so that the area of each bar is proportional to the frequency of the represented values. A Q-Q
Table 6 Normality condition in classication rate

Shapiro-Wilk
iri
mon
(.00)
* (.00)
(.00)
* (.00)
(.00)
(.07)
(.00)
(.06)
Pitts-GIRLA
XCS
GASSIST
HIDER
bup
* (.02)
(.25)
(.17)
(.11)
cle
* (.00)
* (.03)
* (.01)
(.42)
eco
* (.00)
(.23)
(.22)
(.22)
gla
(.73)
* (.00)
(.31)
* (.00)
hab
* (.00)
* (.02)
(.08)
* (.01)
*
*
*
*
Pitts-GIRLA
XCS
GASSIST
HIDER
bup
(.13)
(.44)
(.16)
(.07)
cle
(.10)
(.09)
(.13)
(.52)
eco
* (.00)
(.61)
(.88)
(.42)
gla
(.69)
(.06)
(.37)
(.05)
hab
* (.00)
(.22)
(.58)
(.78)
DAgostino-Pearson
iri
mon
(.11)
* (.00)
(.06)
* (.00)
(.08)
(.19)
* (.00)
(.19)
*
*
*
*
new
(.01)
(.00)
(.00)
(.00)
pim
(.00)
(.03)
(.01)
(.00)
veh
* (.00)
(.17)
(.96)
(.25)
vow
* (.00)
(.30)
(.32)
(.15)
*
*
*
*
pim
* (.00)
(.24)
* (.02)
* (.00)
veh
* (.02)
(.33)
(.93)
(.43)
vow
* (.00)
(.40)
(.95)
(.37)
pim
(.00)
(.04)
(.01)
(.01)
veh
* (.00)
(.17)
(.98)
(.23)
vow
* (.00)
(.30)
(.32)
(.56)
*
*
*
*
*
*
*
*
new
(.71)
* (.00)
(.70)
* (.00)
win
(.00)
(.00)
(.00)
(.00)
wis
(.00)
(.00)
(.04)
(.00)
yea
* (.00)
(.45)
(.78)
(.23)
win
* (.00)
* (.00)
(.17)
* (.00)
wis
* (.00)
* (.03)
(.36)
* (.02)
yea
* (.00)
(.48)
(.39)
(.18)
win
(.00)
(.00)
(.00)
(.00)
wis
* (.00)
* (.00)
(.07)
* (.00)
yea
* (.00)
(.51)
(.14)
(.20)
*
*
*
*
Table 7 Normality condition in Cohens kappa

Shapiro-Wilk
iri
mon
(.00)
* (.00)
(.00)
* (.00)
(.00)
* (.00)
(.00)
* (.00)
Pitts-GIRLA
XCS
GASSIST
HIDER
bup
* (.00)
(.65)
(.30)
(.61)
cle
* (.02)
(.11)
* (.03)
(.42)
eco
* (.00)
(.37)
(.47)
(.21)
gla
(.79)
* (.00)
(.32)
* (.00)
hab
* (.00)
* (.01)
(.77)
* (.01)
*
*
*
*
Pitts-GIRLA
XCS
GASSIST
HIDER
bup
* (.00)
(.54)
(.16)
(.33)
cle
(.49)
(.41)
(.10)
(.45)
eco
* (.00)
(.72)
(.90)
(.43)
gla
(.58)
(.06)
(.21)
(.05)
hab
* (.00)
* (.03)
(.96)
(.21)
DAgostino-Pearson
iri
mon
(.11)
* (.00)
(.06)
* (.00)
(.09)
* (.00)
* (.00)
* (.00)
new
(.80)
* (.00)
(.66)
* (.02)
pim
* (.00)
(.27)
* (.01)
* (.00)
veh
* (.01)
(.32)
(.95)
(.41)
vow
* (.01)
(.40)
(.95)
(.38)
win
* (.00)
* (.01)
(.18)
* (.00)
wis
* (.00)
* (.04)
(.39)
* (.01)
yea
* (.00)
(.35)
(.19)
(.20)
Shapiro-Wilk
iri
mon
* (.00)
* (.00)
(.75)
(.26)
* (.00)
* (.00)
* (.00)
* (.00)
new
* (.00)
(.74)
* (.00)
* (.00)
*
*
*
*
pim
(.00)
(.00)
(.01)
(.04)
veh
* (.00)
(.42)
(.16)
(.46)
vow
* (.00)
(.46)
* (.00)
(.84)
win
* (.00)
* (.00)
* (.00)
(.21)
wis
* (.00)
(.56)
* (.00)
* (.00)
yea
* (.00)
(.59)
(.13)
(.10)
DAgostino-Pearson
iri
mon
* (.00)
* (.00)
(.76)
(.27)
* (.00)
(.68)
* (.00)
(.10)
new
* (.00)
(.47)
(.54)
* (.01)
pim
* (.00)
* (.00)
* (.03)
(.98)
veh
* (.00)
(.38)
(.43)
(.47)
vow
* (.00)
(.22)
* (.04)
(.80)
win
* (.00)
* (.00)
* (.00)
(.37)
wis
* (.00)
(.52)
* (.00)
* (.00)
yea
* (.00)
(.76)
(.21)
(.21)
*
*
*
*
new
(.04)
(.00)
(.01)
(.00)
*
*
*
*
Table 8 Normality condition in size

cle
* (.00)
(.53)
* (.00)
(.16)
eco
* (.00)
(.86)
* (.00)
* (.00)
gla
* (.00)
(.89)
* (.00)
* (.00)
*
*
*
*
hab
(.00)
(.00)
(.00)
(.00)
Pitts-GIRLA
XCS
GASSIST
HIDER
bup
* (.00)
(.67)
* (.01)
(.86)
cle
* (.00)
(.17)
* (.00)
(.61)
eco
* (.00)
(.86)
* (.00)
(.47)
gla
* (.00)
(.84)
* (.00)
* (.00)
hab
* (.00)
* (.00)
* (.00)
(.23)
graphic represents a confrontation between the quartiles

from data observed and those from the normal distribution.
In Figure 1 we observe a typical case of absolute lack
of normality. Figure 2 illustrates an example in which
the normality hypothesis is accepted as well by the two
tests used.
Normal Q-Q Plot of yeast

Histogram
algorithm GASSIST-ADI
algorithm GASSIST-ADI
12
10
2
Expected Normal
bup
* (.00)
(.91)
* (.00)
* (.00)
Frequency
Pitts-GIRLA
XCS
GASSIST
HIDER
-2
2
-4
0
0.45
0.50
0.55
yeast
Expected Normal
Frequency
30
20
0.60
0.65
-2
10
-4
0
0.10
0.55
Observed Value
Fig. 2 Classication rate results of GASSIST-ADI over

yeast data set: Histogram and Q-Q Graphic.
40
monks
0.50
algorithm XCS
4
50
0.00
0.45
Normal Q-Q Plot of monks
Histogram
algorithm XCS
-0.10
0.60
0.20
0.30
-0.1
0.0
0.1
0.2
0.3
Observed Value
Fig. 1 Cohens kappa results of XCS over monks data set:

Histogram and Q-Q Graphic.
5 Non-parametric tests for Comparing Two

Algorithms in Multiple Data Set Analysis
As we introduced previously, the obtention of results in
a single data set analysis when using GBML algorithms
is a relatively easy task, due to the fact that new results

can be yielded in new runs of the algorithms. In spite of
this fact, a sample of 50 results does not always veried
the necessary conditions for applying parametric tests,
as we could see in the previous section.
On the other hand, other ML approaches are not stochastic and it is not possible to obtain a larger sample of
results. This fact makes dicult the comparison between
GBML methods and deterministic ML algorithms, given
that the sample of results could not be large enough or
there is a necessity for using procedures which can operate with samples of dierent size.
The authors are usually familiarized with parametric and non-parametric tests for pairwise comparisons.
GBML approaches have been compared through para-
S. Garc et al.
a
Table 9 Normality condition in ANT

Pitts-GIRLA
XCS
GASSIST
HIDER
bup
(.83)
(.15)
(.85)
* (.01)
cle
(.13)
(.67)
* (.04)
(.22)
eco
(.10)
(.18)
* (.01)
* (.00)
gla
(.50)
(.10)
* (.00)
* (.01)
hab
(.26)
(.55)
(.19)
* (.00)
Pitts-GIRLA
XCS
GASSIST
HIDER
bup
(.73)
* (.03)
(.88)
* (.00)
cle
(.05)
(.57)
* (.00)
(.18)
eco
* (.04)
(.20)
* (.00)
* (.00)
gla
(.63)
(.23)
* (.00)
* (.00)
hab
(.84)
(.26)
(.13)
(.76)
Shapiro-Wilk
iri
mon
(.27)
* (.00)
(.64)
(.86)
* (.00)
* (.00)
* (.01)
* (.00)
DAgostino-Pearson
iri
mon
(.39)
(.07)
(.67)
(.89)
* (.00)
* (.00)
* (.00)
(.09)
new
(.12)
(.23)
* (.01)
* (.04)
pim
(.20)
(.73)
* (.00)
* (.01)
veh
(.51)
(.67)
(.27)
(.26)
vow
(.96)
(.43)
* (.00)
(.05)
win
(.32)
(.46)
(.09)
(.74)
wis
* (.03)
(.68)
(.58)
* (.00)
yea
(.39)
(.17)
(.38)
(.70)
new
(.38)
(.34)
* (.01)
(.61)
pim
(.41)
(.46)
* (.01)
* (.00)
veh
(.88)
(.50)
(.05)
(.69)
vow
(.84)
(.67)
* (.00)
(.18)
win
(.39)
(.56)
(.69)
(.63)
wis
* (.00)
(.61)
(.57)
(.27)
yea
(.33)
(.18)
(.72)
(.72)
Table 10 Heteroscedasticity condition by using Levenes test

Classification rate
Cohens kappa
Size
ANT
bup
(.16)
(.53)
* (.00)
* (.00)
*
*
*
*
cle
(.00)
(.02)
(.00)
(.00)
eco
(.77)
(.66)
* (.00)
* (.00)
gla
(.26)
(.17)
* (.00)
* (.00)
*
*
*
*
hab
(.01)
(.02)
(.00)
(.00)
iri
(.53)
(.53)
* (.00)
* (.00)
metric tests by means of paired t-tests [1, 4, 13,23]. In

some cases, the t-test is accompanied with the non-parametric Wilcoxons test applied over multiple data sets
[12, 40]. The use of these type of tests is correct when
we are interested in nding the dierences between two
methods, but they must not be used when we are interested in comparisons that include several methods. In
the case of repeating pairwise comparisons, there is an
associated error that grows agreeing with the number
of comparisons done, called the family-wise error rate
(FWER), dened as the probability of at least one error
in the family of hypotheses. For solving this problem,
some authors use the Bonferroni correction for applying
paired t-test in their works [39,7].
Our interest lies in presenting a methodology for analyzing the results oered by the algorithms in a certain
study of GBML, by using non-parametric tests in a multiple data set analysis. Furthermore, we want to remark
the possibility of comparison with other deterministic
ML algorithms. Non-parametric tests could be applied
to small sample of data and their eectiveness have been
proved in complex experiments. They are preferable to
an adjustment of data with transformations or to a discarding of certain extreme observations (outliers) [31].
This section is devoted to describing a non-parametric
statistical procedure for performing pairwise comparisons between two algorithms, which is the Wilcoxon signedrank test, Section 5.1; and to show the operation of this
test in the presented case study, Section 5.2.
5.1 Wilcoxon signed-ranks test

This is the analogous of the paired t-test in non-parametric statistical procedures; therefore, it is a pairwise test
that aims to detect signicant dierences between two
sample means, that is, the behavior of two algorithms.
Let di be the dierence between the performance scores
of the two classiers on i -th out of Nds data sets. The
dierences are ranked according to their absolute values;
average ranks are assigned in case of ties. Let R+ be
the sum of ranks for the data sets on which the rst
*
*
*
*
mon
(.00)
(.00)
(.00)
(.00)
new
(.36)
(.36)
* (.00)
* (.00)
pim
(.05)
* (.00)
* (.00)
* (.00)
*
*
*
*
veh
(.00)
(.00)
(.00)
(.00)
*
*
*
*
vow
(.00)
(.00)
(.00)
(.00)
*
*
*
*
win
(.00)
(.00)
(.00)
(.00)
*
*
*
*
wis
(.00)
(.00)
(.00)
(.00)
*
*
*
*
yea
(.00)
(.00)
(.00)
(.00)
algorithm outperformed the second, and R the sum of

ranks for the opposite. Ranks of di = 0 are split evenly
among the sums; if there is an odd number of them, one
is ignored:
R+ =
rank(di ) +
di >0
R =
rank(di ) +
di <0
1
2
1
2
rank(di )
di =0
rank(di )
di =0
Let T be the smaller of the sums, T = min(R+ , R ).

If T is less than or equal to the value of the distribution
of Wilcoxon for Nds degrees of freedom (Table B.12 in
[47]), the null hypothesis of equality of means is rejected.
Wilcoxon signed ranks test is more sensible than the
t-test. It assumes commensurability of dierences, but
only qualitatively: greater dierences still count more,
which is probably desired, but the absolute magnitudes
are ignored. From the statistical point of view, the test
is safer since it does not assume normal distributions.
Also, the outliers (exceptionally good/bad performances
on a few data sets) have less eect on the Wilcoxon than
on the t-test. The Wilcoxon test assumes continuous differences di , therefore they should not be rounded to, one
or two decimals, since this would decrease the power of
the test due to a high number of ties.
When the assumptions of the paired t-test are met,
Wilcoxon signed-ranks test is less powerful than the paired t-test. On the other hand, when the assumptions are
violated, the Wilcoxon test can be even more powerful
than the t-test. This allows us to apply it over the means
obtained by the algorithms in each data set, without any
assumptions about the sample of results obtained.
5.2 A case study in GBML: Performing pairwise
comparisons
In the following, we will perform the statistical analysis
by means of pairwise comparisons by using the results of
performance measures obtained by the algorithms described in Section 2.
In order to compare the results between two algorithms and to stipulate which one is the best, we can
perform Wilcoxon signed-rank test for detecting dierences in both means. This statement must be enclosed
by a probability of error, that is the complement of the
probability of reporting that two systems are the same,
called the p-value [47]. The computation of the p-value
in Wilcoxons distribution could be carried out by computing a normal approximation [37]. This test is well
known and it is usually included in standard statistics
packages (such as SPSS, R, SAS, etc.).
Tables 11 and 12 show the results obtained in all possible comparisons among the ve algorithms considered
in the study, in classication rate and kappa respectively. We stress in bold the winner algorithm in each row
when the p-value associated is below 0.05.
Table 11 Wilcoxons test applied over the all possible comparisons between the ve algorithms in classication rate
Comparison
Pitts-GIRLA - XCS
Pitts-GIRLA - GASSIST-ADI
Pitts-GIRLA - HIDER
Pitts-GIRLA - CN2
XCS - GASSIST-ADI
XCS - HIDER
XCS - CN2
GASSIST-ADI - HIDER
GASSIST-ADI - CN2
HIDER - CN2
Classication rate
R+
R
p-value
0.5 104.5
0.001
0
105
0.001
1
104
0.001
6
99
0.004
89
16
0.022
53
52
0.975
78
27
0.109
20
85
0.041
52
53
0.975
100
5
0.003
value lower than 0.05 is incorrect whereas we do not

prove the control of the FWER. The HIDER algorithm
really outperforms Pitts-GIRLA and GASSIST-ADI algorithms considering classication rate in independent
comparisons.
The true statistical signication for combining pairwise comparisons is given by expression 5:
p = P (Reject H0 |H0 true) =
= 1 P (Accept H0 |H0 true) =
= 1 P (Accept Ak = Ai , i = 1, . . . , k 1|H0 true) =
k1
= 1 i=1 P (Accept Ak = Ai |H0 true) =
k1
= 1 i=1 [1 P (Reject Ak = Ai |H0 true)] =
k1
= 1 i=1 (1 pHi )
(5)
Wilcoxons test suggests the following information:
Regarding classication rate, the best two algorithms
are XCS and HIDER. In the comparison between
them, XCS obtains the most favorable ranking, but
its dierence with respect to HIDER is rather small,
so they are statistically equal in classication rate.
Nevertheless, HIDER independently outperforms CN2,
whereas XCS does not do it.
Regarding kappa, the best algorithms are XCS, HIDER and GASSIST-ADI. The null hypothesis of equality of means is rejected when Pitts-GIRLA takes
part in a comparison. In their comparison, HIDER
obtains the best ranking and it outperfoms CN2 algorithm (XCS and GASSIST-ADI do not do it).
6 Non-Parametric tests for Multiple
Comparisons among more than two Algorithms
Table 12 Wilcoxons test applied over the all possible comparisons between the ve algorithms in kappa
Comparison
Pitts-GIRLA - XCS
Pitts-GIRLA - GASSIST-ADI
Pitts-GIRLA - HIDER
Pitts-GIRLA - CN2
XCS - GASSIST-ADI
XCS - HIDER
XCS - CN2
GASSIST-ADI - HIDER
GASSIST-ADI - CN2
HIDER - CN2
R+
0.5
0
0
10
74
51
78
28
60
96
Cohens kappa
R
p-value
104.5
0.001
105
0.001
105
0.001
95
0.008
31
0.177
54
0.925
27
0.109
77
0.124
45
0.638
9
0.006
The comparisons performed in this study are independent, so they never have to be considered in a whole.
If we try to extract from previous tables a conclusion
which involves more than one comparison, we are losing
control on the FWER. For instance, the statement: HIDER algorithm obtains a classication rate better than
Pitts-GIRLA and GASSIST-ADI algorithms with a p-
When a new proposal of GBML algorithm is developed,

it could be interesting to compare it with previous approaches. Making pairwise comparisons allows us to conduct this analysis, but the experiment wise error can not
be previously controlled. Moreover, a pairwise comparison is not inuenced by any external factor, whereas in
a multiple comparison, the set of algorithms chosen can
determine the results of the analysis.
Multiple comparison procedures are designed for allowing us to x the FWER before performing the analysis
and for taking into account all the inuences that can
exist within the set of results for each algorithm. Following the same structure as in the previous section, the
basic and advanced non-parametrical tests for multiple
comparisons are described in Section 6.1 and their application on the case study is conducted in Section 6.2.
6.1 Friedman test and post-hoc tests
In order to perform a multiple comparison, it is necessary to check whether all the results obtained by the
10
S. Garc et al.
a
algorithms present any inequality. In the case of nding

it, then we can know, by using a post-hoc test, what algorithms partners average results are dissimilar. In the
following, we describe the non-parametric tests used.
The rst one is the Friedman test [37], which is a nonparametric test equivalent to the repeated-measures
ANOVA. Under the null-hypothesis, it states that all
the algorithms are equivalent, so a rejection of this
hypothesis implies the existence of dierences among
the performance of all the algorithms studied. After
this, a post-hoc test could be used in order to nd
whether the control or proposed algorithm presents
statistical dierences with regards to the remaining
methods in the comparison. The simplest of them is
Bonferroni-Dunns test, but it is a very conservative
procedure and we can use more powerful tests that
control the FWER and reject more hypothesis than
Bonferroni-Dunns test; for example, Holms method
[26].
Friedmans test way of working is described as follows: It ranks the algorithms for each data set separately, the best performing algorithm getting the
rank of 1, the second best rank 2, and so on. In case
of ties average ranks are assigned.
j
Let ri be the rank of the j-th of k algorithms on the ith of Nds data sets. The Friedman test compares the
j
average ranks of algorithms, Rj = N1
i ri . Under
ds
the null-hypothesis, which states that all the algorithms are equivalent and so their ranks Rj should
be equal, the Friedman statistic:
k(k + 1)2
12Nds
2
jRj
k(k + 1)
4
2
is distributed according to F with k 1 degrees of
freedom, when Nds and k are big enough.
The second one of them is Iman and Davenport test
[28], which is a non-parametric test, derived from the
Friedman test, less conservative than the Friedman
statistic:
2
F =
(Nds 1)2
F
Nds (K 1) 2
F
which is distributed according to the F-distribution
with k 1 and (k 1)(Nds 1) degrees of freedom.
Statistical tables for critical values can be found at
[37,47].
Bonferroni-Dunns test: if the null hypothesis is rejected in any of the previous tests, we can continue
with Bonferroni-Dunns procedure. It is similar to
Dunnets test for ANOVA and it is used when we
want to compare a control algorithm opposite to the
remainder. The quality of two algorithms is signicantly dierent if the corresponding average of rankings is at least as great as its critical dierence (CD).
FF =
CD = q
k(k + 1)
.
6N
The value of q is the critical value for a multiple nonparametric comparison with a control (Table B.16 in
[47]).
Holms test [26]: it is a multiple comparison procedure that can work with a control algorithm and compares it with the remaining methods. The test statistics for comparing the i-th and j-th method using
this procedure is:
z = (Ri Rj )/
k(k + 1)
6Nds
The z value is used to nd the corresponding probability from the table of normal distribution, which
is then compared with an appropriate level of condence . In Bonferroni-Dunn comparison, this value is always /(k 1), but Holms test adjusts the
value for in order to compensate for multiple comparison and control the FWER.
Holms test is a step-up procedure that sequentially
tests the hypotheses ordered by their signicance. We
will denote the ordered p-values by p1 , p2 , ..., so that
p1 p2 ... pk1 . Holms test compares each
pi with /(k i), starting from the most signicant
p value. If p1 is below /(k 1), the corresponding
hypothesis is rejected and we allow to compare p2
with /(k 2). If the second hypothesis is rejected,
the test proceeds with the third, and so on. As soon
as a certain null hypothesis cannot be rejected, all
the remain hypotheses are retained as well.
Hochbergs procedure [27]: It is a step-up procedure
that works in the opposite direction to Holms method, comparing the largest p-value with , the next
largest with /2 and so forth until it encounters a hypothesis that it can reject. All hypotheses with smaller p-values are then rejected as well.
The post-hoc procedures described above allow us
to know whether or not a hypothesis of comparison of
means could be rejected at a specied level of signicance . However, it is very interesting to compute the
p-value associated to each comparison, which represents
the lowest level of signicance of a hypothesis that results in a rejection. In this manner, we can know whether
two algorithms are signicantly dierent and we can also
have a metric of how dierent they are.
Next, we will describe the method for computing
these exact p-values for each test procedure, which are
called adjusted p-values [45].
The adjusted p-value for Bonferroni-Dunns test (also
known as the Bonferroni correction) is calculated by
pBonf = (k 1)pi .
The adjusted p-value for Holms procedure is computed by pHolm = (k i)pi . Once computed all of
them for all hypotheses, it is not possible to nd an
adjusted p-value for the hypothesis i lower than for
the hypothesis j, j < i. In this case, the adjusted

Accuracy
50. 0=
01. 0=
9 7 6.4
6 82. 3
34 1. 3
1 7 0. 2
ALR IG- stt iP
2NC
IDA-TS ISSAG
RED IH
1 28. 1
0
5. 0
1 F
r
i
5. 1 e
d
2 m
a
5. 2 n
R
a
3 n
k
5. 3 n
i
g
4
5. 4
5
SC X
p-value for hypothesis i is set equal to the associated

to the hypothesis j.
The adjusted p-value for Hochbergs method is computed with the same formula as Holms, and the same
restriction is applied in the process, but in the opposite sense, that is, it is not possible to nd an adjusted p-value for the hypothesis i lower than for the
hypothesis j, j > i.
11
Fig. 3 Bonferroni-Dunns graphic for classication rate
6.2 A case study in GBML: Performing multiple

comparisons
50. 0=
01. 0=
9 7 6.4
7 53. 3
75 8. 2
3 4 1. 2
ALR IG- stt iP
2NC
IDA-TS ISSAG
RED IH
4 69. 1
0
5. 0
1 F
r
i
5. 1 e
d
2 m
a
5. 2 n
R
a
3 n
k
5. 3 n
i
g
4
5. 4
5
SC X
This section presents the study of applying multiple comparisons procedures to the results of the case study described above. We will use the results obtained in the evaluation of the performance measures considered and we
will dene the control algorithm as the best performing
algorithm (which obtains the lowest value of ranking,
computed through Friedmans test).
First of all, we have to test whether signicant dierences exist among all the means. Table 13 shows the result of applying Friedmans and Iman-Davenports tests.
The table shows the Friedman and Iman-Davenport values, 2 and FF respectively, and it relates them with
F
the corresponding critical values for each distribution
by using a level of signicance = 0.05. The p-value
obtained is also reported for each test. Given that the
statistics of Friedman and Iman-Davenport are clearly
greater than their associated critical values, there are
signicant dierences among the observed results with a
level of signicance 0.05. According to these results,
a post-hoc statistical analysis is needed in the two cases.
Then, we will employ Bonferroni-Dunns test to detect signicant dierences for the control algorithm in
each measure. It obtains the values CD = 1.493 and CD
= 1.34 for = 0.05 and = 0.10 respectively in the
two measures considered. Figures 3 and 4 summarize
the ranking obtained by the Friedman test and draw the
threshold of the critical dierence of Bonferroni-Dunns
procedure, with the two levels of signicance mentioned
above. They display a graphical representation composed by bars whose height is proportional to the average
ranking obtained for each algorithm in each measure
studied. If we choose the smallest of them (which corresponds to the best algorithm), and we sum its height
with the critical dierence obtained by Bonferronis (CD
value), representing its result by using a cut line that
goes through all the graphic, those bars above the line
belong to algorithms whose behaviour are signicantly
worse than the contributed by the control algorithm.
We will apply more powerful procedures, such as
Holms and Hochbergss, for comparing the control algorithm with the rest of algorithms. Table 14 shows all
the adjusted p-values for each comparison which involves the control algorithm. The p-value is indicated in
each comparison and we stress in bold the algorithms
Cohen's kappa
Fig. 4 Bonferroni-Dunns graphic for kappa
which are worse than the control, considering a level of

signicance = 0.05.
Note that the results oered by the two most powerful procedures, Holms and Hochbergs methods, are the
same in this case study. In practice, Hochbergs method
is more powerful than Holms one, but this dierence is
rather small [36]. In any case, the results here do not
exactly coincide with the results obtained with the use
of Wilcoxons test in Section 5.2:
In classication rate, the dierence between XCS and
HIDER is higher in Holm and Hochberg than in Wilcoxon. Anyway, no testing procedure is able to distinguish one of them as the best.
In Cohens kappa, according to Holms and Hochbergs procedures, the dierence between XCS and
HIDER is also higher than according to Wilcoxon.
After conducting the multiple comparison analysis,
we can see that:
By using the classication rate measure, XCS is signicantly better than Pitts-GIRLA and CN2, but it
behaves equally to GASSIST-ADI and HIDER.
Considering the kappa measure, only Pitts-GIRLA
obtains the worst results with respect to the remaining algorithms. The other four GBML algorithms
do not dier signicantly.
HIDER loses performance when we evaluate the results with kappa, whereas GASSIST-ADI achieves a
better kappa rate (Figures 3 and 4). The latter seems
to be more robust against randomness yielded by the
data.
In relation to the sample size (number of data sets
when performing Wilcoxons or Friedmans tests in multiple data set analysis), there are two main aspects to
12
S. Garc et al.
a
Table 13 Results of the Friedman and Iman-Davenport Tests ( = 0.05)
Classication rate
Cohens kappa
Friedman
Value
28.957
26.729
Value
in 2
9.487
9.487
p-value
< 0.0001
< 0.0001
Iman-Davenport
Value
13.920
11.871
Value
in FF
2.55
2.55
p-value
< 0.0001
< 0.0001
Table 14 Adjusted p-values for the comparison of the control algorithm in each measure with the remaining algorithms
(Holms and Hochbergs test)
i
n
5. 2
R
a
3 n
k
a
n
6 82. 1
r
i
5. 1e
d
2m
F
5. 0
1
0
SC X
ALR IG- stt iP
RED IH
IDA-TS ISSAG
By using the data contained in Section 3.3 at Table

5, we can conduct a statistical study of the complexity
of the rule sets obtained in a multiple data sets analysis.
In this study, only multiple comparison procedures will
be used.
Figure 5 shows a Bonferroni-Dunns graphic which
compares the complexity of the rule set and Table 15
displays the adjusted p-values for all the multiple comparison procedures considered in this study.
4 1 2.2
complexity = size AN T.
5. 2
The interpretability of the rule sets obtained will be evaluated by means of the two measures described in Section
3.2, size and ANT. We will aggregate these two measures
in one, which will represent the complexity of the rule
set. It measures the average complexity of the rule set
taking into account the number of rules and the average
number of antecedents per rule:
Interpretability
5. 4
7 Analyzing Interpretability of Models
pHoch
2.230 105
0.05931
0.27033
0.76509
5. 3g
4
be determined. Firstly, the minimum sample considered

acceptable for each test needs to be stipulated. There is
no established agreement about this specication. In our
case, the use of a sample as large as possible is preferable, because the power of the statistical tests (dened
as the probability that the test will reject a false null
hypothesis) will increase. Moreover, in a multiple data
set analysis, the increasing of the sample size depends on
the availability of new data sets. Secondly, we have to
study how the results are expected to vary if there was
a larger sample size available. In all statistical tests used
for comparing two or more samples, the increasing of the
sample size benets the power of the test. As a rule of
thumb, the number of data sets should be greater than
2 k, where k is the number of methods to be compared.
pHoch
6.980 106
0.04283
0.05405
0.67571
i
1
2
3
4
Classication rate (XCS is the control)

unadjusted p
pBonf
pHolm
1.745 106
6.980 106
6.980 106
0.01428
0.05711
0.04283
0.02702
0.10810
0.05405
0.67571
1.00000
0.67571
Cohens kappa (XCS is the control)
algorithm
unadjusted p
pBonf
pHolm
Pitts-GIRLA
5.576 106
2.230 105
2.230 105
CN2
0.01977
0.07908
0.05931
GASSIST-ADI
0.13517
0.54067
0.27033
HIDER
0.76509
1.00000
0.76509
algorithm
Pitts-GIRLA
CN2
GASSIST-ADI
HIDER
50. 0=
01. 0=
i
1
2
3
4
Fig. 5 Bonferroni-Dunns graphic measuring interpretability
As we can see, the two most powerful statistical procedures (Holms and Hochbergs one) are able to distinguish the GASSIST-ADI algorithm as the one whose
rule sets are the most interpretable with a p = 0.05704
(a level of signicance = 0.10 is required).
However, we have to be cautious with respect to the
concept of interpretability. GBML algorithms can produce dierent types of rules or dierent ways for reading
or interpreting the rules. For example, the four algorithms used in this paper produce rule sets with dierent properties. In Table 16 we show an example of rule
for each algorithm, considering the iris data set in the
examples):
Pitts-GIRLA yields a set of conjunctive rules, with
possibility of dont care values, allowing that the
number of antecedents may change in dierent rules.
The classication of a new example implies searching
those rules whose antecedent is compatible with it
and to determine the class agreeing with the maximal
number of rules of the same consequent. If no rules
have been found, the example is not classied.
XCS also uses conjunctive rules, with a generality
index in each attribute. If the generality index covers
the complete domain of a certain attribute, then it
obtains a dont care value. In order to classify a
new example, the rules that match with it are chosen
13
Table 15 Adjusted p-values for the comparison of complexity of rules (Holms and Hochbergs test)
i
1
2
3
Interpretability (GASSIST-ADI is the control)

algorithm
unadjusted p
pBonf
pHolm
XCS
2.657 108
7.972 108
7.972 108
Pitts-GIRLA
0.01283
0.03848
0.02565
HIDER
0.05704
0.17112
0.05704
pHoch
7.972 108
0.02565
0.05704
Table 16 Examples of rules in iris data set

Algorithm
Pitts-GIRLA
XCS
GASSIST-ADI
HIDER
Example of Rule
IF sepalLength = Dont Care AND sepalWidth= Dont Care AND
petalLength = [4.947674287707237,5.965516026050438] AND petalWidth= Dont Care THEN Class = Iris-virginica.
(normalized) IF sepalLength = [0.0,1.0] AND sepalWidth = [0.0, 1.0] AND
petalLength = [0.3641094725703955, 1.0] AND petalWidth = [0.0, 1.0] THEN Class = Iris-setosa
IF petalLength = [1.0,5.071428571428571] AND petalWidth = [0.5363636363636364,1.6272727272727274]
THEN Class = Iris-versicolor
IF sepalLength = (..., 6.65] AND petalLength = (..., 6.7] AND petalWidth = (..., 0.75] THEN Class = iris-setosa
and each one of them votes according to their tness

and consequent.
GASSIST-ADI uses CNF type rules, where disjunctions can coexist with conjunctions. The matching
process is done by means of decision lists, in which
the rules are evaluated from the top of the list to the
bottom, until the antecedent matches the example
to be classied. There always is a default rule, so no
examples remain unclassied.
HIDER yields hierarchical rules similar to decision
lists in the matching process. Some rules may be included within parent rules and the rules are only formed by conjunctions. The rules allow to dene open
extremes of real intervals and the rule set usually
tends to cover all the space of solutions.
Given the dierences among the four algorithms, taking into consideration the characteristics of the rules
and the matching techniques, the comparison of interpretability measures must be cautiously taken. Although
the results indicate that GASSIST-ADI may produce
the most interpretable rule sets, its type of rule could
be considered less understood than the ones yielded by
Pitts-GIRLA algorithm. Moreover, it uses decision lists,
so a certain rule (except the rst in the list) depends of
previous rules. On the other hand, a concept could not
be learned because it is being considered in the default
rule. With regard to HIDER, although both use the same
matching technique, the latter can use open intervals in
the rules and has not default rule.
The choice of the most interpretable type of rule or
rule set is a relative task because it may depend on the
usefulness and purpose of the model. In this paper, this
question is out of the scope, but we want to point out
that a statistical analysis of the interpretability of rule
sets could be valid when the circumstances permit so.
8 Conclusions
In this paper we have studied the use of statistical techniques in the analysis of the behaviour of GBML algorithms in classication problems, analyzing the use of
parametric and non-parametric statistical tests.
We have raised the necessity of applying non-parametric

tests in the use of GBML algorithms in classication, due
to the fact that the initial conditions that guarantee the
reliability of the parametric tests are not satised in a
single data set analysis.
Non-parametric tests can be used in multiple data set
analysis and allow the comparison between GBML methods and deterministic algorithms. We have shown how
to use Friedman, Iman-Davenport, Bonferroni-Dunn,
Holm, Hochberg, and Wilcoxons tests; which on the
whole, are a good tool for the analysis of algorithms
performance. We have employed these procedures to carry out a comparison in a case study composed by an
experimentation that involves several data sets and 4
well-known GBML algorithms.
We have checked that dierent statistical results are
obtained when we consider dierent accuracy measures,
such as classication rate and Cohens kappa. In interpretability analysis, the results cannot predict what is
the algorithm which yields the easiest models, due to
the fact that the rule sets are dierent in structure and
there are many ways of representing knowledge.
As main conclusion on the use of non-parametric statistical methods for analyzing results, we have emphasized the use of the most appropriate test depending on
the circumstances and type of comparison. Specically,
we have recommended the use of Holms and Hochbergs
procedures since they are the most powerful statistical
techniques for multiple comparisons.
Acknowledgments
The authors are very grateful to the anonymous reviewers for their valuable suggestions and comments to improve the quality of this paper. We also are very grateful to Prof. Bacardit, Prof. Bernad-Mansilla and Prof.
o
Aguilar-Ruiz for providing the KEEL software with the
GASSIST-ADI, XCS and HIDER algorithms, respectively.
14
References
1. J.S. Aguilar-Ruiz, R. Girldez and J.C. Riquelme, Naa
tural encoding for evolutionary supervised learning, IEEE
Transactions on Evolutionary Computation 11:4, (2007)
466479.
2. J. Alcal-Fdez, L. Snchez, S. Garc M.J. del Jesus, S.
a
a
a,
Ventura, J.M. Garrell, J. Otero, C. Romero, J. Bacardit,
V.M. Rivas, J.C. Fernndez, F. Herrera, KEEL: A Software
a
Tool to Assess Evolutionary Algorithms to Data Mining
Problems, Soft Computing (2008) In press.
3. E. Alpaydin, Introduction to Machine Learning (MIT
Press, Cambridge, MA 2004) 452.
4. C. Anglano and M. Botta, NOW G-Net: learning classication programs on networks of workstations, IEEE Transactions on Evolutionary Computation 6:13, (2002) 463
480.
5. A.
Asuncion
and
D.
J.
Newman,
UCI
Machine
Learning
Repository.
http://www.ics.uci.edu/mlearn/MLRepository.html.
Irvine, CA: University of California, School of Information
and Computer Science (2007).
6. J. Bacardit and J.M. Garrell, Evolving multiple discretizations with adaptive intervals for a pittsburgh rule-based
learning classier system, In: Proc. of Genetic and Evolutionary Computation Conference (GECCO03), LNCS
2724, (2003) 18181831.
7. J. Bacardit, Pittsburgh genetic-based machine learning in
the data mining era: representations, generalization and
run-time, Dept. Comput. Sci., University Ramon Llull,
Barcelona, Spain, 2004.
8. J. Bacardit and J.M. Garrell, Analysis and improvements
of the adaptive discretization intervals knowledge representation, In: Proc. of Genetic and Evolutionary Computation
Conference (GECCO04), LNCS 3103, (2004) 726738.
9. J. Bacardit and J.M. Garrell, Bloat control and generalization pressure using the minimum description length principle for Pittsburgh approach learning classier system, In:
T. Kovacs, X. Llor and K. Takadama (Eds.), Advances
a
at the frontier of Learning Classier Systems, LNCS 4399,
(2007) 6180.
10. R. Barandela, J.S. Snchez, V. Garc and E. Rangel.
a
a
Strategies for learning in class imbalance problems, Pattern
Recognition 36:3, (2003) 849851.
11. A. Ben-David, A lot of randomness is hiding in accuracy, Engineering Applications of Articial Intelligence 20,
(2007) 875885.
o
12. E. Bernad-Mansilla and J.M. Garrell, Accuracy-based
learning classier systems: models, analysis and applications to classication tasks, Evolutionary Computation
11:3, (2003) 209238.
o
13. E. Bernad-Mansilla and T.K. Ho. Domain of Competence of XCS Classier System in Complexity Measurement
Space, IEEE Transactions on Evolutionary Computation
9:1, (2005) 82104.
14. P. Clark and T. Niblett, The CN2 induction algorithm,
Machine Learning, 3:4, (1989) 261-283.
15. J.A. Cohen, Coecient of agreement for nominal scales,
Educational and Psychological Measurement, (1960) 3746.
16. A.L. Corcoran and S. Sen, Using real-valued genetic algorithms to evolve rule sets for classication, In: Proc. of
the IEEE Conference on Evolutionary Computation, (1994)
120124.
S. Garc et al.
a
17. K.A. De Jong, W.M. Spears and D.F. Gordon, Using
genetic algorithms for concept learning, Machine Learning
13, (1993) 161188.
s
18. J. Demar, Statistical comparisons of classiers over multiple data sets, Journal of Machine Learning Research 7,
(2006) 130.
19. C. Drummond and R.C. Holte, Cost curves: an improved method for visualizing classier performance, Machine
Learning 65:1, (2006) 95130.
20. A.E. Eiben and J.E. Smith, Introduction to Evolutionary
Computing (Springer-Verlag, Berlin 2003) 299.
21. A.A. Freitas, Data Mining and Knowledge Discovery with
Evolutionary Algorithms (Springer-Verlag, Berlin 2002)
264.
22. J.J. Grefenstette, Genetic Algorithms for Machine Learning (Kluwer Academic Publishers, Norwell 1993) 176.
23. S.U. Guan and F. Zhu, An incremental approach to
genetic-algorithms-based classication, IEEE Transactions
on Systems, Man, and Cybernetics, Part B 35:2, (2005)
227239.
24. J, Huang and C. X. Ling. Using AUC and accuracy in
evaluating learning algorithms, IEEE Transactions on Knowledge and Data Engineering 17:3, (2005) 299310.
25. J. Hekanaho, An evolutionary approach to concept learning, Ph.D. dissertation, Dept. Comput. Sci., Abo akademi
University, Abo, Finland, 1998.
26. S. Holm, A simple sequentially rejective multiple test
procedure, Scandinavian Journal of Statistics 6, (1979) 65
70.
27. Y. Hochberg, A sharper bonferroni procedure for multiple tests of signicance, Biometrika 75, (1988) 800803.
28. R.L. Iman and J.M. Davenport, Approximations of the
critical region of the Friedman statistic, Communications
in Statistics 18, (1980) 571595.
29. M. Sokolova, N. Japkowicz and S. Szpakowicz. Beyond
accuracy, F-score and ROC: A family of discriminant measures for performance evaluation, In: Australian Conference
on Articial Intelligence, LNCS 4304, (2006) 10151021.
30. L. Jiao, J. Liu and W. Zhong, An organizational coevolutionary algorithm for classication, IEEE Transactions on
Evolutionary Computation 10:1, (2006) 6780.
31. G.G. Koch, The use of non-parametric methods in the
statistical analysis of a complex split plot experiment, Biometrics 26:1, (1970) 105128.
32. T.C.W. Landgrebe and R. P.W. Duin, Ecient Multiclass ROC Approximation by Decomposition via Confusion
Matrix Perturbation Analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 30:5, (2008) 810
822.
33. T.-S Lim, W.-Y Loh and Y.-S Shih, A comparison of prediction accuracy, complexity, and training time of thirtythree old and new classication algorithms, Machine Learning 40:3, (2000) 203228.
34. M. Markatou, H. Tian, S. Biswas and G. Hripcsak,
Analysis of variance of cross-validation estimators of the
generalization error, Journal of Machine Learning Research
6, (2005) 11271168.
35. R.L. Rivest, Learning decision lists, Machine Learning 2,
(1987) 229246.
36. J.P. Shaer, Multiple hypothesis testing, Annual Review
of Psychology 46 (1995) 561584.
37. D.J. Sheskin, Handbook of parametric and nonparametric
statistical procedures (Chapman & Hall/CRC 2006) 1736.

38. O. Sigaud and S.W. Wilson, Learning classier systems:
a survey, Soft Computing 11, (2007) 10651078.
39. K.C. Tan, Q. Yu and J.H. Ang, A coevolutionary algorithm for rules discovery in data mining, International
Journal of Systems Science 37:12, (2006) 835864.
40. A.F. Tulai and F. Oppacher, Multiple species weighted voting - a genetics-based machine learning system, In:
Proc. of Genetic and Evolutionary Computation Conference (GECCO03), LNCS 3103, (2004) 12631274.
41. G. Venturini, SIA: a supervised inductive algorithm with
genetic search for learning attributes based concepts, In:
Proc. of the Machine Learning ECML93, LNAI 667,
(1993) 280296.
42. S.W. Wilson, ZCS: a zeroth order classier system, Evolutionary Computation 2, (1994) 118.
43. S.W. Wilson, Classier tness based on accuracy, Evolutionary Computation 3:2, (1995) 149175.
44. I.H. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques. Second Edition (Morgan Kaufmann, San Francisco 2005) 525.
45. S.P. Wright, Adjusted p-values for simultaneous inference, Biometrics 48, (1992) 10051013.
46. W. Youden, Index for rating diagnostic tests, Cancer 3,
(1950) 32-35.
47. J.H. Zar, Biostatistical analysis (Prentice Hall 1999) 929.
A Genetic Algorithms in Classication

Here we will give a wider description of all the methods
employed in our work, regarding their main components,
structure and operation of each one of them.
For more details about the methods explained here,
please refer to the corresponding references.
Pittsburgh Genetic Interval Rule Learning Algorithm.
The Pitts-GIRLA Algorithm [16] is a GBML method
which makes use of the Pittsburgh approach in order
to perform a classication task. Two real variables indicate the minimum and maximum value of the attribute,
where a dont care condition may occur if the maximum value is lower than the minimum value.
This algorithm employs three dierent operators: modied simple (one point) crossover, creep mutation and
simple random mutation.
XCS Algorithm XCS [43] is a Learning Classier System (LCS) that evolves online a set of rules that describe
the feature space accurately. In the following we will present in detail the dierent components of this algorithm:
1. Interaction with the environment: In keeping
with the typical LCS model, the environment provides as input to the system a series of sensory situations (t) {0, 1}L , where L is the number of bits
in each situation. In response the system executes
actions (t) {a1 , . . . , an } upon the environment.
Each action results in a scalar reward (t).
15
2. A Classier in XCS: XCS keeps a population of

classiers which represent its knowledge about the
problem. Each classier is a condition-action-prediction rule having the following parts: the condition
C {0, 1, #}L , the action A {a1 , . . . , an } and
the prediction p. Furthermore, each classier keeps
certain additional parameters such as the prediction
error , the tness F , the experience exp, the time
stamp ts, the action set size as and the numerosity.
3. The dierent sets: There are four dierent sets
that need to be considered in XCS: the population
[P ], the match set [M ], the action set [A] and the
previous action set [A1 ].
The result of this algorithm is that the knowledge is
represented by a set of rules or classiers with a certain
tness. When classifying unseen examples, each rule that
matches the input votes according its prediction and tness. The most voted class is chosen to be the output.
GASSIST Algorithm GASSIST (Genetic Algorithms
based claSSIer sySTem) [9] is a Pittsburgh style classier system based on GABIL [17] from where it has
taken the semantically correct crossover operator. The
main features of this classier system are presented as
follows:
1. General operators and policies
Matching strategy The matching process follows
a if ... then ... else if ... then... structure, usually
called Decision Lists [35].
Mutation operators When an individual is selected for mutation a random gene is chosen inside
its chromosome to be mutated.
2. Control of the individuals length: This control
is achieved using two dierent operators:
Rule deletion: This operator deletes the rules of
the individuals that do not match any training
example.
Selection bias using the individual size: Tournament selection is used, where the criterion of the
tournament is given by an operator called hierarchical selection, dened as follows:
If |accuracya accuracyb | < threshold then:
If lengtha < lengthb then a is better than
b.
If lengtha > lengthb then b is better than
a.
If lengtha = lengthb then we will use the
general case.
Otherwise, we use the general case: we select
the individual with higher tness.
3. Knowledge Representations
Rule Representations for symbolic or discrete attributes: It uses the GABIL [17] representation
for this kind of attributes.
Rule Representations for real-valued attributes For
GASSIST-ADI, the representation is based on the
16
S. Garc et al.
a
Adaptive Discretization Intervals rule representation [6, 7].

HIDER Algorithm HIerarchical DEcision Rules
(HIDER) [1], produces a hierarchical set of rules, which
may be viewed as a Decision List. In order to extract
the rule-list a real-coded GA is employed in the search
process. The elements of this procedure are described
below.
1. Coding:
Each rule is represented by an individual (chromosome), where two genes dene the lower and upper
bounds of the rule attribute.
2. Algorithm: The algorithm is a typical sequential covering GA. It chooses the best individual of the evolutionary process, transforming it into a rule which
is used to eliminate data from the training le [41].
Initially, the set of rules R is empty, but in each iteration a rule is included in R. In each iteration, the
training le is reduced, eliminating those examples
that have been covered by the description of the rule
r, independently of its class.
The GA main operators are dened in the following:
(a) Initialization: First, an example is randomly selected from the training le for each individual of
the population. Afterwards, an interval to which
the example belongs is obtained.
(b) Crossover : The crossover works as follows:
j
k
let [li , uj ] and [li , uk ] be the intervals of two pai
i
rents, j and k, for the same attribute i. From
these parents one child is generated by selecting
j k
values that satisfy the expression: l [min(li , li ),
j k
j
j
max(li , li )] and u [min(ui , uk ), max(ui , uk )].
i
i
(c) Mutation: a small value is subtracted or added,
depending on whether it is the lower or the upper
boundary, respectively.
(d) Fitness Function: The tness function f considers
a two-objective optimization, trying to maximize
the number of correctly classied examples and
to minimize the number of errors.
Bibliograf
a
[AGR06]
Abraham A., Grosan C., y Ramos V. (2006) Swarm Intelligence in Data Mining (Studies in Computational Intelligence). Springer-Verlag New York, Inc., Secaucus, NJ,
USA.
[Aha97]
Aha D. W. (Ed.) (1997) Lazy learning. Kluwer Academic Publishers, Norwell, MA,
USA.
[AK03]
Alba E. y Khuri S. (2003) Sequential and distributed evolutionary algorithms for combinatorial optimization problemspginas 211233.
a
[AKA91]
Aha D. W., Kibler D., y Albert M. K. (1991) Instance-based learning algorithms.

Machine Learning 6: 3766.
[BMH05]
Bernad-Mansilla E. y Ho T. K. (2005) Domain of competence of xcs classier system

o
in complexity measurement space. IEEE Transactions on Evolutionary Computation
9(1): 82104.
[BPM04]
Batista G. E. A. P. A., Prati R. C., y Monard M. C. (2004) A study of the behavior of

several methods for balancing machine learning training data. SIGKDD Explorations
Newsletter 6(1): 2029.
[CBHK02] Chawla N. V., Bowyer K. W., Hall L. O., y Kegelmeyer W. P. (2002) Smote: Synthetic
minority over-sampling technique. Journal of Articial Intelligence Research 16: 321
357.
[CCHJ08] Chawla N. V., Cieslak D. A., Hall L. O., y Joshi A. (2008) Automatically countering
imbalance and its empirical relationship to cost. Data Mining and Knowledge Discovery
17(2): 225252.
[CH67]
Cover T. M. y Hart P. E. (Enero 1967) Nearest neighbor pattern classication. IEEE

Transactions in Information Theory 13(1): 2127.
[CHL03]
Cano J. R., Herrera F., y Lozano M. (2003) Using evolutionary algorithms as instance
selection for data reduction in kdd: an experimental study. IEEE Transactions on
Evolutionary Computation 7(6): 561575.
[CHL05]
Cano J. R., Herrera F., y Lozano M. (2005) Stratication for scaling up evolutionary
prototype selection. Pattern Recogn. Lett. 26(7): 953963.
[CHL07]
Cano J. R., Herrera F., y Lozano M. (2007) Evolutionary stratied training set selection for extracting classication rules with trade o precision-interpretability. Data
Knowledge Engineering 60(1): 90108.
193
[CJK04]
Chawla N. V., Japkowicz N., y Kolcz A. (2004) Special issue on learning from imbalanced datasets. SIGKDD Explorations Newsletter 6(1).
[CLV06]
Coello C. A. C., Lamont G. B., y Veldhuizen D. A. V. (2006) Evolutionary Algorithms for Solving Multi-Objective Problems (Genetic and Evolutionary Computation).
Springer-Verlag New York, Inc., Secaucus, NJ, USA.
[Con98]
Conover W. J. (1998) Practical Nonparametric Statistics. John Wiley & Sons.
[Dem06]
Demar J. (2006) Statistical comparisons of classiers over multiple data sets. J. Mach.
s
Learn. Res. 7: 130.
[EJJ04]
Estabrooks A., Jo T., y Japkowicz N. (2004) A multiple resampling method for learning
from imbalanced data sets. Computational Intelligence 20(1): 1836.
[ES03]
Eiben A. E. y Smith J. E. (2003) Introduction to Evolutionary Computing. SpringerVerlag.
[FHOR05] Ferri C., Hernndez-Orallo J., y Ram

a
rez M. J. (2005) Introduccin a la Miner de
o
a
Datos. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
[FI04]
Fernndez F. y Isasi P. (2004) Evolutionary design of nearest prototype classiers.

a
Journal of Heuristics 10(4): 431454.
[Fre02]
Freitas A. A. (2002) Data Mining and Knowledge Discovery with Evolutionary Algorithms. Springer-Verlag New York, Inc., Secaucus, NJ, USA.
[FW98]
Frank E. y Witten I. H. (1998) Generating accurate rule sets without global optimization. En ICML 98: Proceedings of the Fifteenth International Conference on Machine
Learning, pginas 144151. Morgan Kaufmann Publishers Inc., San Francisco, CA,
a
USA.
[GDG08]
Ghosh A., Dehuri S., y Ghosh S. (2008) Multi-objective Evolutionary Algorithms for
Knowledge Discovery from Data Bases. Springer-Verlag New York, Inc.
[Goo05]
Good P. I. (2005) Resampling Methods: A Practical Guide to Data Analysis. Birkhauser.
[GP07]
Gagn C. y Parizeau M. (2007) Coevolution of nearest neighbor classiers. International

e
Journal of Pattern Recognition and Articial Intelligence 21(5): 921946.
[Han05]
Han J. (2005) Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers
Inc., San Francisco, CA, USA.
[Har68]
Hart P. E. (1968) The condensed nearest neighbour rule. IEEE Transactions in Information Theory 14: 515516.
[HB02]
Ho T. K. y Basu M. (2002) Complexity measures of supervised classication problems.

IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3): 289300.
[HB06]
Ho T. K. y Basu M. (2006) Data Complexity. Springer-Verlag New York, Inc.
[JG05]
Jain L. C. y Ghosh A. (2005) Evolutionary Computation in Data Mining (Studies in

Fuzziness and Soft Computing). Springer-Verlag New York, Inc., Secaucus, NJ, USA.
[KCH+ 03] Kim W., Choi B.-J., Hong E.-K., Kim S.-K., y Lee D. (2003) A taxonomy of dirty data.
Data Mining and Knowledge Discovery 7(1): 8199.
[Kon05]
Konar A. (2005) Computational Intelligence: Principles, Techniques and Applications.

Springer-Verlag New York, Inc., Secaucus, NJ, USA.
[KS05]
Krasnogor N. y Smith J. (2005) A tutorial for competent memetic algorithms: model,

taxonomy, and design issues. IEEE Transactions on Evolutionary Computation 9(5):
474488.
[Lee04]
Lee J.-S. (2004) Hybrid genetic algorithms for feature selection. IEEE Transactions
on Pattern Analysis and Machine Intelligence 26(11): 14241437. Member-Il-Seok Oh
and Member-Byung-Ro Moon.
[LHTD02] Liu H., Hussain F., Tan C. L., y Dash M. (2002) Discretization: An enabling technique.
[LM98]
Liu H. y Motoda H. (1998) Feature Selection for Knowledge Discovery and Data Mining.
Kluwer Academic Publishers, Norwell, MA, USA.
[LM01]
Liu H. y Motoda H. (2001) Instance Selection and Construction for Data Mining.
Kluwer Academic Publishers, Norwell, MA, USA.
[LM02]
Liu H. y Motoda H. (2002) On issues of instance selection. Data Mining and Knowledge
Discovery 6(2): 115130.
[Pap04]
Papadopoulos A.N. (2004) Nearest Neighbor Search: A Database Perspective. SpringerVerlag TELOS, Santa Clara, CA, USA.
[PK99]
Provost F. y Kolluri V. (1999) A survey of methods for scaling up inductive algorithms.

[Pyl99]
Pyle D. (1999) Data preparation for data mining. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA.
[Qui93]
Quinlan J. R. (1993) C4.5: Programs for Machine Learning (Morgan Kaufmann Series
in Machine Learning). Morgan Kaufmann.
[She06]
Sheskin D. J. (2006) Handbook of Parametric and Nonparametric Statistical Procedures.

CRC Press.
[SMS07]
Snchez J. S., Mollineda R. A., y Sotoca J. M. (2007) An analysis of how training data
a
complexity aects the nearest neighbor classiers. Pattern Analysis and Applications
10(3): 189201.
[TSK05]
Tan P.-N., Steinbach M., y Kumar V. (2005) Introduction to Data Mining, (First Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
[WF05]
Witten I. H. y Frank E. (2005) Data Mining: Practical Machine Learning Tools and
Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems).
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
[Wil72]
Wilson D. L. (1972) Asymptotic properties of nearest neighbor rules using edited data.
IEEE Transactions on Systems Man and Cybernetics 2(3): 408421.
[WKQ+ 07] Wu X., Kumar V., Quinlan J. R., Ghosh J., Yang Q., Motoda H., McLachlan G. J.,
Ng A., Liu B., Yu P. S., Zhou Z.-H., Steinbach M., Hand D. J., y Steinberg D. (2007)
Top 10 algorithms in data mining. Knowledge and Information Systems 14(1): 137.
[WM97]
Wolpert D. y Macready W. G. (1997) No free lunch theorems for optimization. IEEE

Transactions on Evolutionary Computation 1(1): 6782.
[WM00]
Wilson D. R. y Martinez T. R. (2000) Reduction techniques for instance-based learning

algorithms. Machine Learning 38(3): 257286.
[WP03]
Weiss G. M. y Provost F. J. (2003) Learning when training data are costly: The eect
of class distribution on tree induction. Journal of Articial Intelligence Research 19:
315354.
[YW06]
Yang Q. y Wu X. (2006) 10 challenging problems in data mining research. International

Journal of Information Technology and Decision Making 5(4): 597604.
[ZZY03]
Zhang S., Zhang C., y Yang Q. (2003) Data preparation for data mining. Applied
Articial Intelligence 17(5-6): 375381.

Nuevos Retos en La Selecci On Evolutiva de Instancias: Escalabilidad, Aprendizaje Con Clases No Balanceadas y Caracterizaci On de La E Cacia

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nuevos Retos en La Selecci On Evolutiva de Instancias: Escalabilidad, Aprendizaje Con Clases No Balanceadas y Caracterizaci On de La E Cacia

Uploaded by

Copyright:

Available Formats

Universidad de Granada

Departamento de Ciencias de la Computacin

Nuevos Retos en la Seleccin

Granada, Octubre de 2008

Nuevos Retos en la Seleccin

PARA OPTAR AL GRADO DE DOCTOR EN INFORMATICA

Departamento de Ciencias de la Computacin

La memoria titulada Nuevos Retos en la Seleccin Evolutiva de Instancias: Escalabilidad,

Granada, Octubre de 2008

Fdo: Salvador Garc Lpez

Fdo: Francisco Herrera Triguero

Tesis Doctoral Parcialmente Subvencionada por el Ministerio de Educacin y Ciencia bajo el

1.1.1. Clasicacin, Aprendizaje basado en Instancias y Seleccin de Prototipos . .

1.1.2. Algoritmos Evolutivos en Miner de Datos. Seleccin Evolutiva de Prototipos.

1.1.3. Problemas de Clasicacin con Clases no Balanceadas . . . . . . . . . . . . .

1.1.4. Medidas de Complejidad de Datos en Clasicacin . . . . . . . . . . . . . . .

1.1.5. Anlisis y Comparacin de Algoritmos: Tests Estad

1.7. Perspectivas Futuras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.1. Un Algoritmo Memtico para la Seleccin Evolutiva de Prototipos: Un Enfoque

la posibilidad de extraer informacin util para la toma de decisiones o la exploracin y comprensin

Figura 1.1: Proceso de KDD

de tales problemas pueden ser la diagnosis mdica, problemas de marketing, reconocimiento de

datos inconsistentes [KCH+ 03].

Clasicacin, Aprendizaje basado en Instancias y Seleccin de Prototipos

En el contexto de la MDD, entendemos por clasicacin el proceso en el que, sabiendo la

Interpretabilidad: Claridad y credibilidad, desde el punto de vista humano, de la regla de

Algoritmos Evolutivos en Miner de Datos. Seleccin Evolutiva de Proa

Data Set (D)

Training Set (T)

Test Set (TS)

Prototype Subset Selected (S)

Figura 1.2: Seleccin Evolutiva de Prototipos

Problemas de Clasicacin con Clases no Balanceadas

Medidas de Complejidad de Datos en Clasicacin

Medidas de Separabilidad: para medir el grado de separacin existente entre clases.

Anlisis y Comparacin de Algoritmos: Tests Estad

Como se acaba de mencionar en la seccin anterior, la presente memoria se organiza en torno

Un Algoritmo Memtico para la Seleccin Evolutiva de Prototipos: Un Enfoque para la Escae

Discusin Conjunta de los Resultados

Un Algoritmo Memtico para la Seleccin Evolutiva de Prototipos: Un

1.5. Discusin Conjunta de los Resultados

Se incluye el estudio experimental desarrollado, donde se especica la metodolog seguida, sus

Diagnstico de la Efectividad de la Seleccin Evolutiva de Prototipos

rapidez en el proceso de optimizacin, y es extrapolable al problema de la SEPP. El estudio de las

Bajo-Muestreo Evolutivo para la Clasicacin en Problemas no Balano

mediante el vecino ms cercano y seleccin de conjuntos de entrenamiento para obtener modelos

Dise o de Experimentos en Inteligencia Computacional: Sobre el Uso de

En esta parte (Seccin 2.4), se describe y justica la eleccin de la metodolog de anlisis y

1.6. Comentarios Finales: Breve Resumen de los Resultados Obtenidos y Conclusiones

del test estad

Comentarios Finales: Breve Resumen de los Resultados Obtenidos y Conclusiones

correspondientes a distintas situaciones propias de la IC o MDD. Estos nos indican que es ms

Un Algoritmo Memtico para la Seleccin Evolutiva de Prototipos: Un

Diagnstico de la Efectividad de la Seleccin Evolutiva de Prototipos

Aunque se ha mostrado que la efectividad de los AAEE en problemas de SPP es excelente,

1.6. Comentarios Finales: Breve Resumen de los Resultados Obtenidos y Conclusiones

Bajo-Muestreo Evolutivo para la Clasicacin en Problemas no Balano

El bajo-muestreo evolutivo para problemas de clasicacin con clases no balanceadas lo hemos

Dise o de Experimentos en Inteligencia Computacional: Sobre el Uso de

El anlisis de resultados en entornos que consideran casos de diferente

A continuacin, se presentan algunas l

1.7. Perspectivas Futuras

Uso de Algoritmos Genticos Distribuidos para Seleccin de Instancias