Professional Documents
Culture Documents
Tesis Doctoral
Salvador Garc Lpez
a o
Universidad de Granada
Octubre de 2008
DIRECTOR
Francisco Herrera Triguero
El Doctorando
El Director
Agradecimientos
Esta memoria est dedicada a todas aquellas personas sin las cuales no hubiera sido posible.
a
Ante todo a mis padres, Mar y Salvador, porque todo lo que he conseguido ha sido gracias a
a
ellos, a su apoyo, cario y compresin en las dicultades por las que he pasado desde que empec en
n
o
e
este viaje. Estoy muy orgulloso de ellos por brindarme las oportunidades que me han dado y que
ellos no tuvieron en su vida. Esta memoria es por y para vosotros.
Tambin agradezco en cantidad los nimos y consejos que me han dado mis hermanos, Alfonso,
e
a
Manolo y Miguel Angel, que han visto como he ido evolucionando y aprendiendo y siempre han
estado para ayudarme a enfrentarme a cualquier obstculo. Aado en este agradecimiento a un
a
n
buen amigo, que siempre ha estado ah y con el que he compartido gran parte del tiempo hasta
lograr este objetivo. Diego, siempre has sido para m como un hermano y te agradezco todo el
tiempo que has dedicado a darme un buen consejo, sea a la hora que sea.
Si en el mbito familiar he tenido mucha suerte por los nimos y apoyos recibidos, stos no
a
a
e
han sido menores con respecto a mi director de tesis. Francisco Herrera me ha guiado en todos
los aspectos relacionados con el mundo profesional y muestro mi ms sincero agradecimiento a su
a
gran dedicacin e inters hacia m as como a los valiosos consejos que me ha dado, me da, y me
o
e
seguir dando.
a
No puedo olvidarme de todas aquellas personas que han estado a mi lado desde el principio de
este viaje y me han aguantado durante todo el camino, a pesar de mis d de mal humor. Muchas
as
gracias a los hermanos Alcal, Rafa y Jess, a Antonio Gabriel, a Jos Ramn por mostrarme la
a
u
e
o
otra cara de la moneda en momentos de mucho estrs, a Alberto y Julin, con los que he formado el
e
a
tr de becarios imparables de Paco, a Alicia por su eterna simpat y a mi amigo de estas y ahora
o
a
tambin amigo en el entorno profesional, Manolo Cobo. Gracias tambin a Oscar Crdon, Enrique
e
e
o
Herrera y Manuel Lozano por ayudarme a empezar en este viaje. Tambin, quisiera agradecer los
e
momentos buenos que he pasado con el resto de mis compaeros ya sea fuera o dentro de la vida
n
laboral, y no me olvido de ninguno: Sergio, Jorge, Javi, Coral, Roc Cristina, Carlos Porcel, Carlos
o,
Garc Carlos Mantas, Dani, Jos Santamar Ana, Perico, Igor, Nacho y Oscar Harari.
a,
e
a,
Quiero tambin expresar mi gratitud a los compaeros con los que he compartido reuniones,
e
n
seminarios y congresos y me han ayudado tambin a estar en donde estoy. De Jan, cabe citar a
e
e
Mar Jos del Jesus, a Pedro Gonzlez, a Chequin y a Paco Berlanga, de Crdoba no me olvido
a
e
a
o
de Sebastin Ventura, de Pedro Antonio, Amelia y Juan Carlos, de Barcelona, Ester y Albert y de
a
Gijn, a Luciano y a Pepe.
o
No quiero dejar de mencionar a los amigos por lo vivido y lo que nos queda por vivir: Jorge,
Juanda, Tomy, Edu, Ra, Mar Luisa y al resto del grupo de Linares.
a,
Mi agradecimiento a todas aquellas personas que no por no citarlas han sido menos importantes
para el trmino de esta memoria. Quiero dedicaros el esfuerzo de este, nuestro trabajo.
e
GRACIAS A TODOS
Indice
1. Memoria
1.1. Introduccin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
o
1.2. Justicacin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
o
1.3. Objetivos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4. Resumen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5. Discusin Conjunta de los Resultados . . . . . . . . . . . . . . . . . . . . . . . . . . 12
o
1.5.1. Un Algoritmo Memtico para la Seleccin Evolutiva de Prototipos: Un Ene
o
foque para la Escalabilidad . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.2. Diagnstico de la Efectividad de la Seleccin Evolutiva de Prototipos usando
o
o
una Medida de Solapamiento . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.3. Bajo-Muestreo Evolutivo para la Clasicacin en Problemas no Balanceados:
o
Propuestas para Aprendizaje basado en Instancias y Seleccin de Conjuntos
o
de Entrenamiento . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.4. Diseo de Experimentos en Inteligencia Computacional: Sobre el Uso de la
n
Inferencia Estad
stica . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6. Comentarios Finales: Breve Resumen de los Resultados Obtenidos y Conclusiones . . 15
1.6.1. Un Algoritmo Memtico para la Seleccin Evolutiva de Prototipos: Un Ene
o
foque para la Escalabilidad . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6.2. Diagnstico de la Efectividad de la Seleccin Evolutiva de Prototipos usando
o
o
una Medida de Solapamiento . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6.3. Bajo-Muestreo Evolutivo para la Clasicacin en Problemas no Balanceados:
o
Propuestas para Aprendizaje basado en Instancias y Seleccin de Conjuntos
o
de Entrenamiento . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6.4. Diseo de Experimentos en Inteligencia Computacional: Sobre el Uso de la
n
Inferencia Estad
stica . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
vii
INDICE
viii
23
193
Cap
tulo 1
Memoria
1.1.
Introduccin
o
La revolucin digital ha posibilitado que la captura de datos sea fcil y su almacenamiento tenga
o
a
un coste prcticamente nulo. Con el desarrollo del software y el hardware y la rpida informatizacin
a
a
o
de los negocios, enormes cantidades de datos son recogidas y almacenadas en bases de datos. Las
herramientas tradicionales de gestin de datos junto con tcnicas estad
o
e
sticas no son adecuadas
para analizar estas enormes cantidades de datos.
Es conocido que los datos por s solos no producen benecio directo. Su verdadero valor radica en
proceso del Descubrimiento de Conocimiento en Bases de Datos (en ingls Knowledge Discovery in
e
Databases), con el acrnimo KDD, que puede ser denido como el proceso no trivial de identicar
o
patrones en los datos con las caracter
sticas siguientes: vlidos, novedosos, utiles y comprensibles.
a
El proceso de KDD es un conjunto de pasos interactivos e iterativos, entre los que se incluye el
preprocesado de los datos para corregir los posibles datos errneos, incompletos o inconsistentes, la
o
reduccin del nmero de registros o caracter
o
u
sticas encontrando los ms representativos, la bsqueda
a
u
de patrones de inters con una representacin particular y la interpretacin de estos patrones incluso
e
o
o
de una forma visual. El descubrimiento de conocimiento en bases de datos combina las tcnicas
e
tradicionales de extraccin de conocimiento con numerosos recursos desarrollados en el rea de la
o
a
inteligencia articial. En estas aplicaciones, el trmino miner de datos (data mining) es el que ha
e
a
tenido ms aceptacin siendo utilizado con frecuencia para reejar directamente todo el proceso de
a
o
KDD [FHOR05, Han05, WF05].
1
Cap
tulo 1. Memoria
Los campos de investigacin que intervienen en un proceso de KDD son muy variados: bases
o
de datos y reconocimiento de patrones, estad
stica e inteligencia articial, visualizacin de datos y
o
supercomputacin. Los investigadores de KDD incorporan tcnicas, algoritmos y mtodos de estos
o
e
e
campos. As un proceso KDD engloba todos estos campos y principalmente centra su atencin
,
o
en el proceso completo de extraer conocimiento de grandes volmenes de datos incluyendo el alu
macenamiento y acceso, escalando el algoritmo cuando sea necesario, interpretando y visualizando
los resultados y soportando la interaccin hombre-mquina. La siguiente gura ilustra el proceso
o
a
completo:
1.1. Introduccin
o
pero tambin descriptiva desde la perspectiva de la inteligibilidad. Existen tambin mtodos para
e
e
e
extraer reglas comprensibles a partir de estas cajas negras, con lo que en realidad ambas categor
as
pueden ser utiles para la extraccin de conocimiento.
o
Un concepto relacionado es el de Soft-Computing (tambin denominado inteligencia compue
tacional (IC)/computational intelligence) [Kon05], idea que engloba gran parte de las metodolog
as
que pueden ser aplicadas en MDD. Algunas de las metodolog ms extendidas y usadas son alas a
goritmos genticos, lgica fuzzy, redes neuronales, razonamiento basado en casos, conjuntos rough
e
o
o hibridaciones de las anteriores.
Como hemos mencionado anteriormente, el proceso de MDD puede ser descriptivo y predictivo.
A continuacin, enumeramos las disciplinas bsicas en cada uno de los procesos:
o
a
Procesos descriptivos: Clustering, obtencin de reglas de asociacin y descubrimiento de subo
o
grupos.
Procesos predictivos: Clasicacin y regresin.
o
o
Las tcnicas de MDD son sensibles a la calidad de la informacin sobre la que se pretende extraer
e
o
conocimiento. Cuanto mayor sea esta calidad, mayor ser la de los modelos de toma de decisiones
a
generados. En este sentido, la obtencin de informacin util para ser posteriormente procesada es
o
o
un factor clave. Aparece por tanto en el proceso de descubrimiento una etapa de preprocesamiento
de datos previa al MDD [Pyl99].
Podemos considerar como Preprocesamiento o Preparacin de Datos a todas aquellas tcnicas
o
e
de anlisis de datos que permiten mejorar la calidad de los mismos, de modo que los mtodos de
a
e
MDD puedan obtener mayor y mejor informacin [ZZY03].
o
La relevancia de la preparacin de los datos se debe a que:
o
Los datos reales pueden ser impuros, pudiendo conducir a la extraccin de modelos poco
o
utiles. Dicha circunstancia puede estar originada por datos incompletos, datos con ruido o
Cap
tulo 1. Memoria
La preparacin origina datos de calidad, los cuales pueden conducir a modelos de calidad.
o
Para ello, se emplean mecanismos que recuperan informacin incompleta, resuelven conictos
o
o bien, eliminan datos errneos.
o
En esta memoria, de entre las diferentes estrategias a seguir en el procesado de datos, vamos
a dirigir nuestra atencin hacia la Reduccin de Datos (RDD), donde el objetivo es extraer del
o
o
conjunto original de datos un conjunto de datos ms pequeo y representativo para confeccionar el
a
n
modelo. As mismo, la reduccin de puede llevar a cabo de mltiples formas. En esta memoria nos
o
u
centraremos en la Seleccin de Instancias (SII) donde se escogen las muestras ms signicativas
o
a
del conjunto de datos [LM01]. Ms concretamente, en la Seleccin de Prototipos (SPP) empleando
a
o
razonamiento basado en casos.
El proceso de SII se puede orientar desde dos perspectivas posibles:
Obtener un clasicador basado en casos v la SPP. Se pretende aumentar la precisin del
a
o
clasicador que utiliza un conjunto de casos previamente conocidos mediante la SPP.
La Seleccin de Conjuntos de Entrenamiento, donde se considerar la calidad de los conjuntos
o
a
obtenidos para la extraccin de modelos mediante tcnicas de MDD. Las medidas de calidad
o
e
consideradas para valorar los subconjuntos dependern del mbito al que se dirijan los modelos
a
a
generados. En nuestro caso, al tratarse de modelos predictivos en problemas de clasicacin,
o
dependern de la precisin e interpretabilidad obtenida.
a
o
Nuestro inters en esta memoria se enmarca principalmente dentro del proceso de clasicacin,
e
o
en el cual mltiples estrategias han emergido y suscitado gran inters. Una de ellas es el razonau
e
miento basado en casos, en el que el conocimiento se representa a partir de los propios ejemplos o
casos obtenidos directamente de los datos iniciales. Una subfamilia de mtodos basados en razonae
miento por casos la constituyen los mtodos de aprendizaje basados en instancias, cuyos conceptos
e
principales sern destacados en la prxima seccin.
a
o
o
1.1.1.
e
humanos, sobretodo en caso particularmente dif
ciles.
Hay cinco consideraciones para calicar un clasicador:
Precisin: Representa el nivel de conanza del clasicador, usualmente representado como la
o
proporcin de clasicaciones correctas que es capaz de producir.
o
Velocidad: Tiempo de respuesta desde que se presenta un nuevo ejemplo a clasicar hasta que
obtenemos la clase que el clasicador predice. Normalmente, la velocidad es tan importante
como la precisin.
o
1.1. Introduccin
o
Cap
tulo 1. Memoria
de formato, etc. Un proceso muy interesante y ecaz en los algoritmos ABII es la reduccin de
o
datos [WM00], puesto que con ella podemos dotar al clasicador k-NN de mejoras frente al coste,
en trminos computacionales, de clasicar y tolerancia al ruido y a la existencia de atributos
e
irrelevantes.
La reduccin de datos se puede obtener mediante:
o
Seleccin de Caracter
o
sticas [LM98]: Reduccin del nmero de atributos en la base de datos.
o
u
Discretizacin [LHTD02]: Reduccin del nmero posible de valores en un atributo.
o
o
u
Seleccin de Instancias (SII) [LM02]: Reduccin del nmero de ejemplos en la base de datos.
o
o
u
La SII surgi desde anteriores perspectivas relacionadas con el clasicador k-NN, y el trmino
o
e
de SII se adopta desde el punto de vista general de MDD. Sin embargo, cuando nos referimos a los
algoritmos ABII, en particular k-NN, las tcnicas de reduccin de ejemplos en la base de datos se
e
o
dividen en:
Tcnicas de Condensacin [Har68]: Su principal objetivo consiste en eliminar los ejemplos
e
o
redundantes de la base de datos y buscan la consistencia en el conjunto de entrenamiento;
esto es, mantener la precisin que obtiene k-NN en el conjunto de entrenamiento. Estos
o
algoritmos obtienen tasas de reduccin altas pero pierden precisin en conjuntos de test.
o
o
Tcnicas de Edicin [Wil72]: Se interesan en eliminar unicamente los ejemplos ruidosos de
e
o
la base de datos, con el objetivo de mejorar las capacidades de prediccin de k-NN. Estos
o
algoritmos obtienen tasas de reduccin bajas pero normalmente mejoran la precisin de k-NN
o
o
en conjuntos de test.
Tcnicas de Seleccin de Prototipos (SPP) [WM00]: Buscan reducir lo mximo posible el
e
o
a
conjunto de datos a la vez que mejorar la precisin de k-NN.
o
Por otro lado, cabe destacar la diferencia entre mtodos de SPP y mtodos de Generacin de
e
e
o
Prototipos. Aunque ambos buscan el mismo objetivo, los primeros lo hacen haciendo una seleccin
o
de ejemplos de la base de datos, por lo que los ejemplos nales coinciden o existen en la base de
datos original. La generacin de prototipos selecciona y genera nuevos ejemplos a partir de los
o
originales. La ventaja de la SPP es que nos permite identicar los casos o ejemplos ms inuyentes
a
en la base de datos incrementando en cierta manera la interpretabilidad del modelo.
La combinacin de tcnicas de Soft-Computing, como los algoritmos evolutivos, con la MDD ha
o
e
demostrado ser de gran utilidad y ha obtenido resultados prometedores, particularmente en SII y
SPP como veremos en la siguiente seccin.
o
1.1.2.
Los Algoritmos Evolutivos (AAEE) [ES03] son algoritmos de bsqueda basados en los procesos
u
naturales de evolucin y gentica. En los ultimos aos se han consolidado como uno de los tipos
o
e
n
de algoritmos de bsqueda en problemas complejos (gran nmero de variables, mltiples ptimos
u
u
u
o
locales, y/o mltiples objetivos y condiciones/relaciones complejas entre ellas) de ms xito en
u
a e
Inteligencia Articial.
1.1. Introduccin
o
Los AAEE han demostrado ser una herramienta importante para el aprendizaje y extraccin de
o
conocimiento [Fre02]. Se han utilizado combinados con mltiples modelos de representacin de cou
o
nocimiento, tales como redes neuronales, bases de reglas difusas, reglas intervalares, aproximaciones
basadas en prototipos, seleccin de variables e instancias, extraccin de reglas de asociacin, etc.
o
o
o
En la actualidad existe un continuo desarrollo de modelos evolutivos de extraccin de conocimiento.
o
Como muestra citamos libros recientemente publicados que recogen aportaciones recientes en esta
temtica [JG05, GDG08, AGR06].
a
Aunque los AAEE no fueron diseados como algoritmos espec
n
cos de aprendizaje, sino como
algoritmos de bsqueda global, s podemos destacar las ventajas del uso de stos en el campo del
u
e
aprendizaje automtico. Muchas de las metodolog de aprendizaje automtico estn basadas en
a
as
a
a
la bsqueda de un modelo ptimo dentro de un espacio de modelos, como puede ser el espacio de
u
o
bases de reglas, el espacio de pesos y topolog de redes neuronales, o el espacio de conjuntos de
as
prototipos, por sealar algunos ejemplos. Por tanto, pueden ser planteadas como un problema de
n
bsqueda u optimizacin subyacente. Los AAEE permiten realizar la bsqueda en los espacios de
u
o
u
modelos mediante la codicacin de un modelo de solucin en un cromosoma. Son muy exibles en
o
o
cuanto a la codicacin de diferentes modelos, ya que se puede utilizar el mismo AE con diferentes
o
representaciones.
En el preprocesado de datos en MDD, y en particular en la reduccin de datos, los AAEE han
o
sido muy utilizados en la seleccin de caracter
o
sticas [Lee04] y la SII [CHL03, CHL07, GP07]. En
este ultimo caso, hablamos de Seleccin Evolutiva de Instancias, y particularmente de Seleccin
o
o
Evolutiva de Prototipos (SEPP) cuando usamos k-NN como clasicador.
Los algoritmos de SEPP propuestos en [CHL03] dotan al clasicador k-NN de una mayor precisin y obtienen subconjuntos de ejemplos muy reducidos. El esquema que siguen viene especicado
o
en la siguiente gura:
Evolutionary Prototype
Selection Algorithm
1Nearest Neighbour
Classifier
Cap
tulo 1. Memoria
embargo, los conjuntos de gran tamao puede ser abordables mediante la tcnica de estraticacin
n
e
o
propuesta en [CHL05], en donde se comprueba un gran rendimiento por parte de la SEPP, aunque
dependiendo de la eleccin del tamao de cada estrato, los AAEE pueden o no alcanzar su mximo
o
n
a
rendimiento en funcin de su capacidad de converger hacia soluciones ptimas.
o
o
1.1.3.
El problema de las clases no balanceadas es uno de los nuevos problemas que surgieron cuando el
aprendizaje automtico madur de una ciencia a una tecnolog aplicada, ampliamente usada en el
a
o
a
mundo de los negocios, industria e investigacin cient
o
ca. Aunque los experimentadores ya ten
an
conocimiento sobre este problema, hizo su aparicin en la comunidad cient
o
ca del aprendizaje
automtico y MDD hace no ms de una dcada. Su importancia creci a medida de que cada vez,
a
a
e
o
los investigadores se daban cuenta de que los conjuntos de datos que analizaban eran no balanceados
y obten modelos de clasicacin por debajo del umbral deseado de ecacia.
an
o
El problema del no balanceo de las clases ocurre t
picamente cuando, en un problema de clasicacin, existen muchas ms instancias o ejemplos de una clase que del resto de clases [CJK04].
o
a
En estos casos, una clasicacin estndar tiende a ser inundada por las clases grandes e ignorar
o
a
las ms pequeas. En aplicaciones prcticas, la tasa de la clases pequea sobre la grande puede
a
n
a
n
ser drstica cuando tenemos 1 ejemplo frente a 100, 1 frente a 1.000 o 1 frente a 10.000. Como se
a
ha mencionado anteriormente, este problema es observable en muchas situaciones, incluyendo la
deteccin de fraude o intrusos, gestin de riesgos, clasicacin de textos, diagnstico mdico, etc.
o
o
o
o
e
Es bueno saber que en determinados dominios (como los mencionados) el problema de las clases
no balanceadas es intr
nseco al problema. Por ejemplo, existen muy pocos casos de fraude comparados con la gran cantidad de uso honesto de las facilidades ofertadas a un cliente. Sin embargo,
el no balanceo de las clases ocurre a veces en dominios que no tienen un desequilibrio intr
nseco.
Esto ocurre cuando el proceso de coleccin de datos est limitado (debido a razones econmicas o
o
a
o
privadas). Adems, puede haber tambin un desequilibrio en los costes asociados a hacer diferentes
a
e
errores, que pueden variar para cada caso.
Se han propuesto un gran nmero de soluciones al problema de las clases no balanceadas en
u
dos tipos de niveles: a nivel de datos y a nivel algor
tmico. En el nivel de datos, estas soluciones
incluyen muchas formas diferentes de re-muestreo, como sobre-muestreo aleatorio con reemplazo,
bajo-muestreo aleatorio [EJJ04], sobre-muestreo enfocado (en el que no se crean nuevos ejemplos,
pero la eleccin de las muestras a reemplazar es informada ms que aleatoria), sobre-muestreo con
o
a
generacin de ejemplos articiales informada [CBHK02], bajo-muestreo informado [BPM04] y como
binaciones o hibridaciones de las tcnicas anteriores. En el nivel algor
e
tmico, las soluciones incluyen
el ajuste de costes de las distintas clases del problema de tal forma que la clase menos representada
es ms costosa a efectos de clasicacin, incluyen el ajuste de la estimacin de probabilidad de
a
o
o
las hojas de un rbol [WP03] (cuando trabajamos con rboles de decisin), incluyen el ajuste del
a
a
o
umbral de decisin y el uso de aprendizaje basado en reconocimiento (aprender con una clase)
o
mejor que el basado en discriminacin (para dos clases).
o
Recientemente, en [CCHJ08], se ha mostrado la relacin emp
o
rica existente entre el tratamiento
de los problemas de clasicacin no balanceados con propuestas a nivel de datos y propuestas a
o
nivel algor
tmico. El preprocesamiento de los datos para ser tratados desde el punto de vista de
problemas no balanceados ha demostrado ser de gran utilidad y tiene la gran ventaja de no necesitar
realizar modicacin alguna de los algoritmos de clasicacin que ya conocemos de antemano.
o
o
1.1. Introduccin
o
1.1.4.
A simple vista, existen una serie de parmetros que claramente condicionan que un problema
a
concreto de clasicacin sea ms complejo que otro. Por ejemplo, podemos destacar el nmero de
o
a
u
instancias en la base de datos, el nmero de atributos por patrn, el nmero de distintas clases a
u
o
u
predecir, etc. Sin embargo, otras medidas de complejidad pueden ser aplicadas para denir a priori
la dicultad de los problemas de clasicacin.
o
Las medidas de complejidad de datos en clasicacin [HB02, HB06] estudian la complejidad
o
geomtrica de los l
e
mites o bordes de decisin entre los ejemplos de distinta clase. Como norma
o
prctica, la dicultad de un problema se considera proporcional a la tasa de error obtenido por un
a
clasicador. No obstante, y segn el teorema de No Free Lunch [WM97], no es posible encontrar
u
un algoritmo que sea mejor en todos los problemas.
En la literatura especializada podemos encontrar una serie de medidas estandarizadas para
medir la complejidad de los datos:
Medidas de Solapamiento: para medir el volumen de solapamiento entre ejemplos de distintas
clases as como por atributos.
consiste en ser capaz de determinar a priori qu tipo de algoritmo puede ser ms benecioso de ser
e
a
aplicado, o cuando merece la pena usar un algoritmo concreto [BMH05].
1.1.5.
En la actualidad, una propuesta de un algoritmo debe ser justicada cuando obtiene un benecio
con respecto a otras propuestas ya estudiadas, en relacin a cualquier medida de rendimiento:
o
precisin, eciencia, interpretabilidad, etc. En problemas de clasicacin u optimizacin, un estudio
o
o
o
requiere involucrar varios casos de distinta
ndole, que son los llamados conjuntos de datos (data
sets) en clasicacin o funciones de test en optimizacin.
o
o
Uno de los problemas abiertos en IC es que no existe una teor unicada ni forma de demostrar
a
tericamente la mejora de un mtodo frente a otro [YW06]. Debido a esto, necesitamos hacer
o
e
comparaciones rigurosas que nos permitan trabajan con resultados concretos.
La teor de los tests estad
a
sticos nos permite inducir una probabilidad de error a una hiptesis
o
dada a partir de una muestra nita de resultados. En MDD e IC, la tendencia de los investigadores
es la de usar tcnicas estad
e
sticas pareadas (comparan dos algoritmos entre s por ejemplo el t,
test) sobre muestras de resultados que presentan unas condiciones idneas para ser analizadas por
o
tcnicas paramtricas [She06]. En algunas ocasiones, tambin se utilizan tcnicas de comparaciones
e
e
e
e
mltiples (ANOVA). Las condiciones apropiadas para hacer un anlisis paramtrico vienen dadas
u
a
e
por tres propiedades bsicas: Independencia de resultados, normalidad de resultados y homocedasa
ticidad. En raras ocasiones se cumplen las tres condiciones a la vez.
No obstante, la estad
stica paramtrica no funciona de forma adecuada cuando la muestra de
e
resultados la forman los resultados obtenidos entre varios data sets, debido a que cada resultado
10
Cap
tulo 1. Memoria
representa a un problema diferente y la poblacin se constituye por resultados muy dispares. Demar
o
s
en [Dem06] revisa y propone el uso de tcnicas estad
e
sticas no paramtricas [Con98] para analizar los
e
resultados de los clasicadores entre varios conjuntos de datos. Se centra principalmente en acercar
las tcnicas estad
e
sticas simples al entorno de la clasicacin y mostrar que realmente son ventajosas
o
con respecto al uso de tcnicas paramtricas. Adems, incide en el uso de tcnicas de comparaciones
e
e
a
e
mltiples que nos permiten a priori especicar el nivel de error apropiado de un experimento
u
independientemente del nmero de factores (algoritmos en nuestro caso) que intervienen en l.
u
e
1.2.
Justicacin
o
Una vez conocidos los principales conceptos a los que se reere esta memoria, nos planteamos
una serie de problemas abiertos que nos sitan en el planteamiento y la justicacin del presente
u
o
proyecto de tesis.
Como hemos sealado en la Seccin 1.1.2, los AAEE son tcnicas de optimizacin que precisan
n
o
e
o
de altos requerimientos de cmputo. Por esta razn, el empleo de dichas tcnicas en problemas
o
o
e
de clasicacin basados en instancias, como proceso de SPP, puede resultar ser muy costoso
o
cuando los conjuntos de entrada aumentan su tamao. Por otro lado, los AAEE requieren
n
aumentar su esquema de representacin de soluciones ante estos tipos de problemas y esto
o
puede ocasionar faltas en la convergencia del algoritmo para obtener soluciones precisas en
problemas de mayor tamao.
n
La aplicacin de los AAEE en la SPP es muy efectiva en trminos de tasas de reduccin de
o
e
o
datos obtenidos, pero puede ocurrir que la precisin del modelo obtenido no se incremente.
o
Ser muy util disponer de un mecanismo eciente que nos permita diagnosticar el posible
a
resultado que vamos a obtener tras aplicar un AAEE en SPP a partir del tipo de datos al
que nos enfrentamos.
Como hemos visto anteriomente, una de las tcnicas ms ecaces para tratar el problema
e
a
de la falta de balanceo en las clases en clasicacin consiste en hacer un preprocesamiento
o
previo de los datos. En RDD, diversas tcnicas han sido propuestas para tal n, pero todas
e
ellas se basan en heur
sticas y no obtienen soluciones lo sucientemente precisas. Por otro
lado, existen otras tcnicas basadas en sobre-muestreo o generacin de nuevos datos. Estas
e
o
tcnicas ofrecen altas prestaciones en trminos de precisin cuando son previamente aplicadas
e
e
o
a un algoritmo predictivo de MDD, pero tienen el inconveniente de aumentar el tamao de
n
los modelos y, por tanto, reducir su interpretabilidad.
La propuesta de nuevos mtodos en MDD o IC es una actividad frecuente en la investigacin.
e
o
Todo nuevo mtodo debe disponer de alguna ventaja frente a los ya propuestos y para dee
terminarla se han utilizado tcnicas estad
e
sticas de comparacin. Estas tcnicas no siempre
o
e
son aplicables a todo tipo de resultados obtenidos, sobre todo cuando queremos hacer un
anlisis en el que la comparacin incluye diferentes casos o instancias pertenecientes al mismo
a
o
problema (entorno multi-problema).
1.3. Objetivos
1.3.
11
Objetivos
1.4.
Resumen
Para desarrollar los objetivos planteados, la memoria est constituida por siete publicaciones
a
distribuidas en cuatro partes que se desarrollarn en el Cap
a
tulo 2. Estas partes son las siguientes:
12
Cap
tulo 1. Memoria
1.5.
Esta seccin muestra un resumen de las distintas propuestas que se recogen en la presente
o
memoria y presenta una breve discusin sobre los resultados obtenidos por cada una de ellas.
o
1.5.1.
En esta parte (Seccin 2.1), se analiza el problema de la SPP desde el punto de vista de la
o
escalabilidad, haciendo una breve revisin de las tcnicas clsicas y evolutivas propuestas para
o
e
a
SPP y deniendo los conceptos necesarios para abordarlo. Una vez expuestos los inconvenientes de
los algoritmos de SEPP convencionales, que son la falta de convergencia cuando los utilizamos en
problemas de mayor tamao y su ineciencia, indicamos que el uso de los AAEE en problemas de
n
mayor tamao puede estar ms limitado. Para solucionar este problema en los AAEE, podemos
n
a
hibridarlos con meta-heur
sticas que presentan mejor comportamiento de explotacin local, como
o
los algoritmos de bsqueda local. Estos algoritmos son los denominados Algoritmos Memticos
u
e
[KS05]. Como los algoritmos memticos consiguen un buen equilibrio entre explotacin y explorae
o
cin y hemos observado que los algoritmos de SEPP tradicionales no consiguen buenas soluciones
o
cuando los problemas aumentan su tamao, nosotros proponemos utilizar un modelo espec
n
co de
algoritmo memtico en el problema de la SPP. Para conseguir un menor tiempo de respuesta, utie
lizamos un procedimiento de bsqueda local ad-hoc para el problema de la SPP, en donde solo se
u
permiten operaciones rpidas y se registra toda la informacin que necesita un clasicador k-NN.
a
o
13
1.5.2.
En esta parte (Seccin 2.2), se estudia una forma de diagnosticar el uso efectivo de los algoritmos
o
de SEPP en problemas de clasicacin. Como ya hemos dicho anteriormente, los AAEE destacan
o
por la obtencin de resultados excelentes en diferentes campos de aplicacin, pero no as por su
o
o
1.5.3.
En esta parte (Seccin 2.3), se presenta el bajo-muestreo evolutivo para problemas de clasicao
cin con clases no balanceadas desde dos perspectivas: bajo-muestreo evolutivo para clasicacin
o
o
14
Cap
tulo 1. Memoria
permiten obtener rboles o bases de reglas de menor tamao, lo que aumenta la interpretabilidad
a
n
de los modelos.
Las art
culos asociados a esta parte son:
S. Garc F. Herrera, Evolutionary Under-Sampling for Classication with Imbalanced Data
a,
Sets: Proposals and Taxonomy. Evolutionary Computation, in press (2008).
S. Garc A. Fernndez, F. Herrera, Enhancing the Eectiveness and Interpretability of
a,
a
Decision Tree and Rule Induction Classiers with Evolutionary Training Set Selection over
Imbalanced Problems. Applied Soft Computing Journal, submitted (2008).
1.5.4.
15
a
e
del tipo n n.
Las art
culos asociados a esta parte son:
S. Garc D. Molina, M. Lozano, F. Herrera, A Study on the Use of Non-Parametric Tests for
a,
Analyzing the Evolutionary Algorithms Behaviour: A Case Study on the CEC2005 Special
Session on Real Parameter Optimization. Journal of Heuristics, doi: 10.1007/s10732-008-90804, in press (2008).
S. Garc F. Herrera, An Extension on Statistical Comparisons of Classiers over Multiple
a,
Data Sets for all Pairwise Comparisons. Journal of Machine Learning Research, in press
(2008)
S. Garc A. Fernndez, J. Luengo, F. Herrera, A Study of Statistical Techniques and Perfora,
a
mance Measures for Genetics-Based Machine Learning: Accuracy and Interpretability. Soft
Computing, submitted (2008).
1.6.
Dedicamos esta seccin a resumir brevemente los resultados obtenidos y a destacar las concluo
siones que esta memoria aporta.
Hemos estudiado diferentes aspectos abiertos o nuevos retos en donde se pueden utilizar los
AAEE en RDD para llevar a cabo tareas de clasicacin basadas en vecinos cercanos y basadas en
o
seleccin de conjunto de entrenamiento. Se pretende llevar a cabo una reduccin del conjunto inicial
o
o
para aumentar las prestaciones de los clasicadores que se utilizan posteriormente en problemas
que presentan un mayor tamao o problemas que presentan un desequilibrio en la distribucin de
n
o
clases. Adems, tambin hemos estudiado, mediante las mtricas de complejidad de datos, cundo
a
e
e
a
nos es ms interesante y ecaz la aplicacin de AAEE para la RDD en problemas de clasicacin
a
o
o
convencionales.
El comportamiento de las tcnicas propuestas se ha comparado con los algoritmos ya propuestos
e
en la literatura especializada, considerando la tipolog concreta del problema a tratar. De esta
a
forma, se han considerando los mejores algoritmos de SPP y SEPP en el estudio de la escalabilidad
y los mejores y ms conocidos mtodos de bajo-muestreo y sobre-muestreo en el estudio de nuestra
a
e
propuesta en problemas de clasicacin no balanceados. Las comparaciones de los algoritmos se
o
han realizado conforme a la metodolog propuesta en esta memoria de tesis, que est basada en el
a
a
anlisis estad
a
stico mediante tests no paramtricos en entornos multi-problema. De forma anloga,
e
a
la aplicacin de estos tests se ha justicado mediante la presentacin de varios estudios emp
o
o
ricos
16
Cap
tulo 1. Memoria
1.6.1.
Para solucionar los problemas observados de falta de convergencia de los algoritmos de SEPP y
aumentar su eciencia cuando aumenta la escalabilidad del problema, hemos utilizado un AE basado
en un algoritmo memtico especialmente diseado para el problema de la SPP. Los resultados
e
n
obtenidos han sido prometedores debido a los siguientes factores:
Nuestra propuesta basada en algoritmos memticos presenta buenas tasas de reduccin de
e
o
datos y de eciencia con respecto a los dems algoritmos de SEPP.
a
Independientemente del tamao de los conjuntos de datos, nuestra propuesta mejora a las
n
propuestas no evolutivas. Al compararla con los algoritmos de SEPP, observamos un comportamiento muy similar en problemas de tamao pequeo. Sin embargo, cuando los problemas
n
n
aumentan de tamao, los niveles de precisin obtenidos por nuestra propuesta superan a los
n
o
de los dems algoritmos de SEPP, que disminuyen su capacidad de explotacin.
a
o
1.6.2.
1.6.3.
17
1.6.4.
18
Cap
tulo 1. Memoria
1.7.
Perspectivas Futuras
Por lo tanto, se propone considerar dos objetivos, precisin y simplicidad, mediante un AAEE
o
multiobjetivo [CLV06]. En cualquier problema con mltiples objetivos, siempre hay un conjunto de
u
soluciones que son superiores a las dems en el espacio de bsqueda cuando se consideran todos los
a
u
objetivos. Dichas soluciones se conocen como soluciones no dominadas (conjunto Pareto). Ninguna
de las soluciones contenidas en el conjunto Pareto es absolutamente mejor que el resto de las no
dominadas. De esta manera, se podr obtener un conjunto de soluciones que abarcar desde los
a
a
modelos ms precisos hasta los ms simples, pasando por distintos niveles de equilibrio entre ambos
a
a
criterios.
19
20
Cap
tulo 1. Memoria
a
obtencin de tasas de reduccin altas o bajas dependiendo de la complejidad del clasicador en
o
o
cuestin.
o
recomendar grupos de datos para hacer pruebas; o mediante tcnicas de reglas de asociacin, se
e
o
podr analizar las relaciones existentes entre medidas o incluso proponer otras nuevas.
a
21
Cap
tulo 2
23
Received 8 September 2007; received in revised form 18 January 2008; accepted 14 February 2008
Abstract
Prototype selection problem consists of reducing the size of databases by removing samples that are considered noisy or not inuential on
nearest neighbour classication tasks. Evolutionary algorithms have been used recently for prototype selection showing good results. However,
due to the complexity of this problem when the size of the databases increases, the behaviour of evolutionary algorithms could deteriorate
considerably because of a lack of convergence. This additional problem is known as the scaling up problem.
Memetic algorithms are approaches for heuristic searches in optimization problems that combine a population-based algorithm with a local
search. In this paper, we propose a model of memetic algorithm that incorporates an ad hoc local search specically designed for optimizing the
properties of prototype selection problem with the aim of tackling the scaling up problem. In order to check its performance, we have carried
out an empirical study including a comparison between our proposal and previous evolutionary and non-evolutionary approaches studied in
the literature.
The results have been contrasted with the use of non-parametric statistical procedures and show that our approach outperforms previously
studied methods, especially when the database scales up.
2008 Elsevier Ltd. All rights reserved.
Keywords: Data reduction; Evolutionary algorithms; Memetic algorithms; Prototype selection; Scaling up; Nearest neighbour rule; Data mining
1. Introduction
Considering supervised classication problems, we usually
have a training set of samples in which each example is labelled
according to a given class. Inside the family of supervised classiers, we can nd the nearest neighbour (NN) rule method
[1,2] that predicts the class of a new prototype by computing a
similarity [3,4] measure between it and all prototypes from the
training set, called the k-nearest neighbours (k-NN) classier.
Recent studies show that k-NN classier could be improved by
employing numerous procedures. Among them, we could cite
proposals on instance reduction [5,6], for incorporating weights
This work was supported by Projects TIN2005-08386-C05-01 and
TIN2005-08386-C05-03. S. Garca holds a FPU scholarship from Spanish
Ministry of Education and Science.
Corresponding author. Tel.: +34 958 240598; fax: +34 958 243317.
E-mail addresses: salvagl@decsai.ugr.es (S. Garca), jrcano@ujaen.es
(J.R. Cano), herrera@decsai.ugr.es (F. Herrera).
for improving classication [7], and for accelerating classication task [8], etc.
Prototype selection (PS) is an instance reduction process consisting of maintaining those instances that are more relevant
in the classication task of the k-NN algorithm and removing the redundant ones. This attempts to reduce the number
of rows in data set with no loss of classication accuracy and
obtain an improvement in the classier. Various approaches of
PS algorithms were proposed in the literature, see Refs. [6,9]
for review. Another process used for reducing the number of
instances in training data is the prototype generation, which
consists of building new examples by combining or computing
several metrics among original data and including them into
the subset of training data [10].
Evolutionary algorithms (EAs) have been successfully used
in different data mining problems (see Refs. [1113]). Given
that PS problem could be seen as a combinatorial problem, EAs
[14] have been used to solve it with promising results [15],
which we have termed evolutionary prototype selection (EPS).
2694
xp
xp
xp
p1 p2
pm pl
belonging to a class c given by xpl and a m-dimensional space
in which xpi is the value of the ith feature of the pth sample.
Then, let us assume that there is a training set TR which consists
xp
x
Drop3 [6]. An associate of is that sample i which has
xp
as NN. This method removes if at least as many of
xp
xp
Cpruner [26]. C-Pruner is a sophisticated algorithm constructed by extending concepts and procedures taken from
algorithms Icf [27] and Drop3.
Explore [28]. Cameron-Jones used an encoding length
heuristic to determine how good the subset S is in describing
TR. Explore is the most complete method belonging to this
group and it includes three tasks:
It starts from the empty set S and adds instances if only
the cost function is minimized.
After this, it tries to remove instances if this helps to minimize the cost function.
Additionally, it performs 1000 mutations to try to improve
the classication accuracy.
Rmhc [29]. First, it randomly selects a subset S from TR
which contains a xed number of instances s (s =%|T R|). In
each iteration, the algorithm interchanges an instance from
S with another from T R S. The change is maintained if it
offers better accuracy.
Rng [30]. It builds a graph associated with TR in which a
relation of neighbourhood among instances is reected. Misclassied instances by using this graph are discarded following a specic criterion.
2.2. The scaling up problem
Any algorithm is affected when the size of the problem which
it is applied increases. This is the scaling up problem, characterized for producing:
Excessive storage requirements.
Increment of time complexity.
Decrement of generalization capacity, introducing noise and
over-tting.
A way of avoiding the drawbacks of this problem was proposed
in Ref. [16], where a stratied strategy divides the initial data
set into disjoint strata with equal class distribution. The number
of strata chosen will determine their size, depending on the
size of the data set. Using the proper number of strata we can
signicantly reduce the training set and we could avoid the
drawbacks mentioned above.
Following the stratied strategy, initial data set D is divided
into t disjoint sets Dj , strata of equal size, D1 , D2 , . . . , Dt
maintaining class distribution within each subset. Then, PS algorithms will be applied to each Dj obtaining a selected subset
DSj . Stratied prototype subset selected (SPSS) is dened as
SPSS =
DSj ,
J {1, 2, . . . , t}.
(1)
j J
3. EPS: a review
In this section, we will review the main contributions that
have included or proposed an EPS model.
The rst appearance of application of an EA to PS problem
can be found in Ref. [31]. Kuncheva applied a genetic algorithm
(GA) to select a reference set for the k-NN rule. Her GA maps
the TR set onto a chromosome structure composed by genes,
2695
(2)
2696
Fitness function: Let S be a subset of instances of TR to evaluate and be coded by a chromosome. We dene a tness
function considering the number of instances correctly classied using the 1-NN classier and the percentage of reduction achieved with regard to the original size of the training
data. The evaluation of S is carried out by considering all the
training set TR. For each object y in S, the NN is searched
for among those in the set S\{y}
Fitness(S) = clas_rat + (1 ) perc_red.
1
if f (cnew ) is better than f (Cworst ),
(3)
0.0625 otherwise,
1. Initialize population.
2. While (not termination-condition) do
3.Use Binary Tournament to select two parents
4.Apply crossover operator to create offspring
(Off1, Off2)
5. Apply mutation to Off1 and Off2
6. Evaluate Off1 and Off2
7.Foreach Offi
8. InvokeAdaptive-PLS-mechanism to obtain PLSi
for Offi
9. If u(0,1) < PLSi then
10. Perform meme optimization for Offi
11. End if
2697
Nnu
,
n
(4)
2698
4.3.3. LS stages
Two stages can be distinguished within the optimization procedure. Each stage has a different objective and its application depends on the progress of the actual search process: the
rst one is an exclusive improvement of tness and the second
stage is a strategy for dealing with the problem of premature
convergence.
Improving accuracy stage: This starts from the initial assignment (a recently generated offspring) and iteratively tries to
improve the current assignment by local changes. If, in the
neighbourhood of the current assignment, a better assignment is found, it replaces the current assignment and it continues from the new one. The selection of a neighbour is randomly made without repetition from among all the solutions
that belong to the neighbourhood. In order to consider an
assignment as better than the current one the classication
accuracy must be greater than or equal to the previous one,
but in this last case, the number of selected instances will be
lower than previously, so the tness function value is always
increased.
Avoiding premature convergence stage: When the search process has advanced, a tendency of the population to premature
convergence toward a certain area of the search space takes
place. A local optimization promotes this behaviour when
it considers solutions with better classication accuracy. In
order to prevent this, the meme optimization procedure proposed will accept worse solutions in the neighbourhood, in
terms of accuracy of classication. Here, the tness function
value cannot be increased; it may be decreased or maintained.
The parameter threshold is used in order to determine the way
the algorithm operates depending on the current stage. When
threshold has a value greater or equal to 0, then the stage
in progress is the improving accuracy stage because a new
Class
2699
Instances
{1, 2, 3, 4, 5, 6, 7}
Representation
0110110100010
Neighbour Solution
Current Solution
010 0110100010
U structure
{3, 5, 8, 8, 3, 2, 6, 2, 8, 8, 3, 2, 3}
{12, 5, 8, 8, 2, 2, 6, 2, 8, 8, 8, 2, 8}
Gain
{1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0}
{-1,,,,0,,,,,,+1,,+1}
7 1+1+1
Nnu
Partial evaluation account: PE = n = 4
13
Number of correct classied patterns: 8
Fig. 4. Example of a move in meme procedure and a partial evaluation.
1
gain
100 + 100
n
n
2.
Table 1
Small data sets characteristics
Name
N. instances
N. features
N. classes
Bupa
Cleveland
Glass
Iris
Led7Digit
Lymphography
Monks
Pima
Wine
Wisconsin
345
297
294
150
500
148
432
768
178
683
7
13
9
4
7
18
6
8
13
9
2
5
7
3
10
4
2
2
3
2
2700
Table 2
Medium and large data sets characteristics
Table 3
Parameters used in PS algorithms
Name
N. instances
N. features.
N. classes
Algorithm
Parameters
Nursery
Page-Blocks
Pen-Based
Ringnorm
Satimage
Spambase
Splice
Thyroid
12 960
5476
10 992
7400
6435
4597
3190
7200
8
10
16
20
36
57
60
21
5
5
10
2
7
2
3
3
CHC
IGA
Adult (large)
45 222
14
GGA
PBIL
SSGA
Dj ,
(5)
j J
J = {j/1 j
(i 1) and (i + 1) j
10},
TSi = D\T R i .
(6)
DSj ,
j J
J = {j/1 j
b (i 1) and (i b) + 1 j
t}.
(7)
The data sets considered are partitioned using the tfcv classic
(see expressions (5) and (6)) except for the adult data set, which
is partitioned using the tfcv strat procedure with t = 10 and
b = 1. (see expression (7)). Deterministic algorithms have been
run once over these partitions, whereas probabilistic algorithms
(including SSMA) run 3 trials over each partition and we show
the average results over these trials.
Whether small, medium or large data sets are evaluated, the
parameters used are the same, as specied in Table 3. They
are specied by following the indications given for their respective authors. With respect to the standard EAs employed in
the study, GGA and SSGA, the selection strategy is the binary
tournament. The mutation operator is the same one used in our
model of SSMA while SSGA uses standard replacement strategy. The crossover operator used by both algorithms denes
two cut points and interchanges substrings of bits.
To compare results we propose the use of non-parametric
tests, according to the recommendations made in Ref. [44].
They are safer than parametric tests since they do not assume
normal distribution or homogeneity of variance. As such, these
non-parametric tests can be applied to classication accuracies,
error ratios or any other measure for evaluation of classiers,
Ib3
Rmhc
Rng
SSMA
Table 4
Average results for EPS algorithms over small data sets
Data set
Measure
CHC
GGA
IGA
PBIL
SSGA
SSMA
Bupa
Cleveland
Glass
Iris
Led7Digit
Lymphography
Monks
Pima
Wine
Wisconsin
GLOBAL
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Tra.
Tst.
Red.
Tra.
Tst.
Red.
Tra.
Tst.
Red.
Tra.
Tst.
Red.
Tra.
Tst.
Red.
Tra.
Tst.
97.13
0.81
98.35
0.3
94.34
0.86
96.81
0.47
96.58
0.18
96.55
0.68
99.05
0.23
98.78
0.28
96.94
0.51
99.44
0.08
97.40
1.53
73.24
1.49
63.48
0.92
74.04
1.51
97.41
0.76
68.71
2.88
54.36
1.42
96.86
0.42
80.14
0.87
98.69
0.81
97.57
0.28
80.45
16.27
58.76
6.75
58.75
5.91
65.39
9.97
96.67
3.33
64
7.32
39.49
9.99
97.27
2.65
75.53
3.11
94.93
4.62
96.56
2.42
74.74
20.63
92.27
1.48
95.45
0.87
90.91
0.99
95.56
0.33
95.64
0.18
91.96
1.5
93.98
0.73
95.11
0.75
95.69
0.71
99.08
0.2
94.57
2.37
78.58
1.1
64.76
1
76.44
1.91
97.63
0.55
68.64
3.27
57.21
1.52
94.62
0.99
81.57
0.7
98.69
0.9
97.71
0.18
81.59
15.13
59.57
7.39
54.8
5.45
65.87
13.28
94
4.67
63.8
4.85
35.26
9.83
92.3
3.71
70.73
4.87
93.82
4.61
96
2.46
72.62
20.69
81.61
1.93
86.32
2.15
80.74
1.82
93.33
0.66
95.16
0.33
86.27
2.2
84.83
1.61
86
0.71
93.76
1.01
98.22
0.42
88.62
6.03
83.09
1.47
68.72
1.03
82.55
1.65
98.81
0.49
65.89
2.21
60.59
2.35
98.15
0.81
86.81
0.78
99.88
0.25
98.04
0.23
84.25
14.86
63.69
9.27
51.49
7.16
68.89
10.7
94
3.59
67.6
4.18
40.57
11.1
85.75
6.98
70.84
2.68
94.97
6.78
95.28
3.33
73.31
18.95
91.3
0.91
94.94
0.83
91.06
1.66
96.07
0.58
93.91
0.78
94.75
1.31
92.34
1.13
91.9
0.49
96.32
0.59
98.54
0.28
94.11
2.47
78.39
1.05
64.98
0.8
76.18
2.36
98.07
0.76
68.67
3.09
54.88
1.92
94.57
1.31
82.03
0.57
98.63
0.73
97.66
0.23
81.41
15.58
64.61
4.64
55.77
4.94
64.58
9.49
96
4.42
66
2.53
41.71
13.85
89.17
7.15
72.27
2.96
94.41
5.56
96.99
1.86
74.15
19.08
90.82
1.87
94.02
1.09
90.34
1.69
95.41
0.93
95.71
0.37
92.33
2.32
92.31
1.08
94.23
0.64
94.69
1.29
99.06
0.26
93.89
2.58
79.19
1.67
65.16
1.45
75.86
1.94
97.41
0.68
66.44
3.18
55.57
3.11
94.88
1.27
83.02
0.97
98.69
0.81
97.76
0.25
81.40
15.63
62.25
7.18
52.52
6.29
65.83
12.37
95.33
4.27
64.8
5.15
43.62
9.94
93.39
3.72
73.32
4.53
97.19
3.75
96.57
3.08
74.48
19.86
95.01
0.72
97.84
0.72
92.58
1.32
96.07
0.67
96.71
0.45
94.67
1.52
97.66
1.27
97.38
0.68
96.44
0.68
99.38
0.09
96.38
1.92
76.55
1.41
63.51
0.92
76.12
1.74
97.93
1.4
54.11
8.78
55.78
2.51
97.22
1.31
82.15
0.67
98.69
0.34
97.65
0.19
79.97
17.76
63.99
4.11
57.47
6.51
66.07
10.51
95.33
5.21
75.4
4.29
42.88
12.12
96.58
3.26
73.21
5.5
93.82
6.31
96.57
2.32
76.13
18.94
Red.
2701
2702
rank(di ) +
di >0
R =
rank(di ) +
di <0
1
2
1
2
rank(di ),
(8)
rank(di ).
(9)
di =0
2
F
2
F
N (k 1)
in which
di =0
(N 1)
2
F
(10)
2
F
2
1
12N
k(k + 1)2
j
=
ri
.
k(k + 1)
N
4
j
(11)
85.28
0.81
95.05
1.03
98.04
0.6
92.66
0.89
89.39
0.89
88.3
1.49
74.7
2.67
93.82
0.54
89.66
7.28
Tst.
2703
Table 6
Iman and Davenport statistic for EPS algorithms
88.07
0.24
96.03
0.12
98.74
0.16
95.53
0.38
91.98
0.35
91.23
0.55
83.6
0.38
94.32
0.14
92.44
4.85
94.07
0.89
99.18
0.09
98.56
0.28
98.88
0.13
98.36
0.23
98.12
0.22
97.04
0.48
99.83
0.07
98.01
1.79
80.56
0.79
95.12
0.75
98.37
0.29
84.36
0.94
88.95
1.31
87.64
2.18
72.92
1.84
93.58
0.5
87.69
8.33
90.16
0.17
96.77
0.16
95.27
0.07
93.31
0.22
94.84
0.21
94.79
0.24
93.07
0.49
96.65
0.16
94.36
2.16
80.55
0.94
94.99
0.83
98.27
0.32
77.45
1.66
88.7
1.01
87.3
1.68
72.51
2.7
93.32
0.5
86.64
9.07
82.78
0.32
95.7
0.17
98.7
0.11
80.11
0.44
90.64
0.21
89.39
0.52
78.96
0.83
93.64
0.3
88.74
7.38
88.62
0.24
93.15
0.21
91.65
0.2
90.14
0.28
91.58
0.24
91.71
0.29
91.13
0.53
92.45
0.18
91.30
1.40
74.07
1.98
94.32
0.79
95.87
0.62
87.7
1.51
86.59
0.9
86.84
1.64
71.57
1.66
92.53
2.45
86.19
8.97
74.98
0.88
94.84
0.24
95.87
0.29
88.08
1.07
87.63
0.32
87.63
0.9
75.41
0.76
92.72
2.51
87.15
8.05
93.6
0.82
99.66
0.04
98.91
0.12
99.59
0.05
99.51
0.1
99.53
0.05
99.14
0.16
99.9
0.03
98.73
2.10
GLOBAL
Thyroid
Splice
Spambase
Satimage
Ring
Pen-Based
Page-Blocks
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Nursery
Tra.
82.88
1.1
95.39
0.8
98.94
0.28
77.07
0.82
89.25
1.3
87.82
1.46
73.73
1.52
91.94
0.71
87.13
8.74
87.78
0.17
96.5
0.13
99.32
0.08
82.66
0.29
92.14
0.25
91.08
0.36
80.95
0.37
93.16
0.18
90.45
6.37
69.54
0.19
73.34
0.28
73.95
0.55
67.11
0.17
71.14
0.52
68.7
0.46
65.28
0.62
71.71
0.27
70.10
3.01
79.99
0.23
85.75
0.29
83.32
0.15
81.08
0.2
83.88
0.32
84.76
0.43
83.9
0.5
84.38
0.29
83.38
1.92
86.43
0.21
96.04
0.14
99.08
0.1
81.64
0.39
91.34
0.27
90.53
0.44
80.92
0.44
93.51
0.16
89.94
6.53
83.46
0.92
95.41
0.51
98.74
0.35
76.65
1.17
88.97
1.16
87.77
0.9
74.11
1.57
93.1
0.81
87.28
8.75
84.96
0.27
96.11
0.19
98.95
0.12
87.71
0.44
91.77
0.23
90.94
0.42
83.16
0.5
93.78
0.2
90.92
5.43
Red.
Red.
Red.
Tra.
Red.
Red.
Red.
Tst.
GGA
Measure
Data set
Table 5
Average results for EPS algorithms over medium data sets
CHC
Tra.
Tst.
Tst.
PBIL
IGA
Tra.
Tst.
SSGA
Tra.
Tst.
SSMA
Tra.
Performance
Scale
FF
Critical value
Accuracy
Accuracy
Accuracyreduction
Accuracyreduction
Small
Medium
Small
Medium
2.568
5.645
14.936
179.667
2.422
2.485
2.422
2.485
is employed follows the tendency shown for EAs [16]. A nal subsection will be included to illustrate a time complexity analysis considering EPS algorithms over all the medium
size data sets considered and to study how the execution time
of SSMA is given out among evolutionary and optimization
stages.
5.1. Part I: comparing SSMA with EAs
In this section, we carry out a comparison that includes all
EPS algorithms described in this paper.
Tables 4 and 5 show the average results for EPS algorithms
run over small and medium data sets, respectively.
Iman and Davenport tests result is presented in Table 6.
Tables 7 and 8 present the statistical differences by using
Wilcoxons test among EPS algorithms, considering accuracy
performance and accuracyreduction balance performance,
respectively.
The following conclusions from examination of Tables 48
can be pointed out:
In Table 4, SSMA achieves the best test accuracy rate. EPS
algorithms are prone to present over-tting obtaining a good
accuracy in training data but not in test data. The SSMA
proposal does not stress this behaviour in a noticeable way.
When the problem scales up, in Table 5, SSMA presents the
best reduction and accuracy in training and test data rates.
The ImanDavenport statistic (presented in Table 6) indicates
the existence of signicant differences of results among all
EPS approaches studied.
Considering only the performance in classication over test
data in Table 7, all algorithms are very competitive. Statistically, SSMA always obtains subsets of prototypes with equal
performance to the remaining of the EPS methods, improving GGA with the use of small databases and GGA and CHC
when the problem scales-up to a medium size.
When the reduction objective is included in the measure of
quality, Table 8, our proposal obtains the best result. Only
CHC presents the same behaviour in small data sets. When
the problem scales up, SSMA again outperforms CHC.
Finally, we provide a map of convergence of SSMA, in
Fig. 5, in order to illustrate the optimization process carried
out on the satimage data set. In it, the two goals, reduction and
train accuracy, are shown. We can see that the two goals are
opposite, but the trend of the two lines of convergence usually
rises.
2704
Table 7
Wilcoxon table for EPS algorithms considering accuracy performance
6
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
>
=
=
=
=
5
2
5
5
5
5
1
0
0
0
1
1
=
=
+
+
=
=
=
=
+
4
3
5
5
5
5
0
0
0
0
1
2
=
=
+
+
=
Table 8
Wilcoxon table for EPS algorithms considering reductionaccuracy balance performance
6
+
=
+
=
+
+
+
+
+
+
Reduction
22
840
1335
1870
2376
2926
3399
3893
4379
4864
5333
5807
6318
6844
7350
7796
8257
8730
9208
9687
Percentage
Acc. Test
100
99
98
97
96
95
94
93
92
91
90
=
+
Evaluations
Fig. 5. Map of convergence of the SSMA algorithm on satimage data set.
+
+
+
+
+
+
+
+
=
=
+
+
+
+
>
5
3
0
3
3
5
4
1
0
1
1
4
=
+
4
2
0
1
4
5
3
2
0
1
3
5
2705
Table 9
Average results for non-EPS algorithms over small data sets
Algorithm
Time (s)
SD time (s)
% Red.
SD red.
% Ac. trn.
SD ac. trn
% Ac. test
SD ac. test
1-NN
Allknn
Cpruner
Drop3
Enn
Explore
Ib3
Pop
Rmhc
Rng
Rnn
SSMA
0.2
0.18
0.26
0.71
0.13
0.83
0.17
0.05
11.82
1.35
3.18
6.38
0.00
0.03
0.02
0.02
0.01
0.03
0.01
0.01
0.05
0.03
0.05
0.16
37.00
92.59
83.43
25.50
97.66
68.34
14.56
90.18
25.07
92.40
96.37
23.97
5.09
8.85
19.56
1.16
19.85
15.17
0.17
20.45
4.56
1.92
72.89
77.51
66.38
73.43
77.47
77.59
64.16
70.49
83.68
78.24
74.41
79.97
18.63
16.29
24.16
15.17
15.16
16.78
22.95
20.19
14.25
15.31
17.57
17.76
72.22
72.60
65.31
67.69
73.21
74.42
69.96
72.10
73.93
73.85
73.11
76.13
19.15
18.40
23.22
19.06
18.38
19.60
20.18
19.60
19.44
18.32
17.70
18.94
Table 10
Average results for non-EPS algorithms over medium data sets
Algorithm
Time (s)
SD time (s)
% Red.
SD red.
% Ac. trn.
SD ac. trn
% Ac. test
SD ac. test
1-NN
Allknn
Cpruner
Drop3
Enn
Explore
Ib3
Pop
Rmhc
Rng
Rnn
SSMA
72.93
34.39
23.83
99.48
13.48
1525.52
2.70
1.20
6572.53
3149.19
26 428.38
3775.75
0.15
0.31
0.07
0.82
0.03
92.27
0.05
0.04
172.08
26.62
271.56
184.13
18.68
89.08
88.17
12.22
98.74
77.42
23.17
90.00
7.20
93.65
98.01
13.49
3.56
8.35
10.21
1.03
17.83
32.94
0.01
4.90
4.25
1.79
89.11
88.46
81.31
81.99
87.85
91.44
84.91
86.09
94.30
89.92
88.85
92.44
7.75
11.33
15.51
9.88
10.66
4.96
10.80
13.55
3.57
7.74
7.45
4.85
87.79
85.21
80.52
78.84
85.98
88.99
86.36
84.97
89.53
87.99
86.04
89.66
8.94
12.94
16.09
13.35
12.14
7.21
9.49
13.78
7.61
9.14
9.57
7.28
As we can see in Table 12, Wilcoxons test considers the existence of competitive algorithms in terms of classication
accuracy. None of the non-EPS algorithms outperforms our
proposal in both considerations of evaluation. The unique algorithm that equals the result of SSMA when the reduction
and accuracy are combined is Explore. Note that this last
algorithm obtains a greater reduction rate than SSMA, but
our approach improves upon it when we consider the classication performance. This fact is interesting given that, although a good trade-off between reductionaccuracy is required, accuracy in terms of classication is more difcult to
improve upon than the reduction, so the solutions contributed
by SSMA may be considered of higher quality.
There are algorithms, for example Allknn, Rng or Pop, that
have a similar performance in classication accuracy but that
do not achieve a high reduction rate. This could be critical
when large data sets need to be processed. A small reduction
rate implies a small decrease in classication resources (storage and time); therefore, an application of these algorithms
on large data sets lacks interest.
Table 11
Iman and Davenport statistic for non-EPS algorithms
Performance
Scale
FF
Critical value
Accuracy
Accuracy
Accuracyreduction
Accuracyreduction
Small
Medium
Small
Medium
6.009
5.817
56.303
68.583
1.938
1.969
1.938
1.969
2706
algorithm over them impossible. This limit depends on the algorithms employed, the properties of data treated and easiness
of reduction of instances in the data set. We could argue that,
Table 12
Wilcoxon table for comparing SSMA with Non-EPS algorithms
Accuracy
=
+
+
=
=
+
=
=
=
+
+
+
+
+
=
+
+
+
+
+
=
+
+
=
+
+
=
=
=
+
+
+
+
+
=
+
+
+
+
+
thyroid
splice
satimage
satimage
ring
ring
pen-based
pen-based
nursery
spambase
spambase
page-blocks
page-blocks
wisconsin
wisconsin
wine
nursery
wine
pima
pima
monks
monks
lymphography
led7dig
lymphography
led7dig
iris
iris
glass
glass
cleveland
cleveland
bupa
-15
Test Accuracy
bupa
-5
5
Reduction
15
-15
Test Accuracy
-5
15
Reduction
Fig. 6. Difference of results between SSMA and CHC and SSMA and Explore. (a) SSMA vs. CHC. (b) SSMA vs. Explore.
2707
82.5
SSMA
82
Test Accuracy
SSGA
CHC
81.5
81
PBIL
GGA
80.5
80
IGA
79.5
1-NN
79
65
60
70
75
80
85
Reduction
90
95
100
Fig. 7. Accuracy in test vs. reduction in adult data set for EPS algorithms.
83
Rng
Enn
Test Accuracy
82
Allknn
SSMA
Explore
Rmhc
Cpruner
81
80
Rnn
1-NN
79
78
Drop3
77
18
28
38
48
58
68
78
88
98
Reduction
Fig. 8. Accuracy in test vs. reduction in adult data set for non-EPS algorithms.
2708
1072.637
33081.747
26372.639
14751.796
31620.449
3260.008
thyroid
SSMA
IGA
PBIL
493.565
10523.368
4547.91
2880.089
4997.814
872.694
splice
SGA
GGA
CHC
1329.829
43206.889
12652.34
8323.344
13572.339
1807.272
satimage
1343.833
37582.409
13224.822
8869.263
13442.095
3144.271
ring
6102.942
99510.786
27474.19
19183.612
32439.13
5865.286
pen-based
16333.8171
95548.553
36661.43
27832.58
38827.026
14772.908
nursery
678.612
13987.491
8075.734
7345.749
11074.682
1187.331
spambase
702.54
10051.149
4703.519
3155.491
4885.795
1046.733
page-blocks
20000
40000
60000
80000 100000
Fig. 9. Run-time in seconds for EPS over medium size data sets.
6. Conclusion
This paper presents a memetic algorithm for evolutionary
prototype selection and its application over different sizes of
data sets, paying special attention to the scaling up problem.
An experimental study was carried out to establish a comparison between this proposal and previous evolutionary and
non-evolutionary approaches studied in the literature. The main
conclusions reached are as follows:
Our proposal of SSMA presents a good reduction rate and
computational time with respect to other EPS schemes.
SSMA outperforms the classical PS algorithms, irrespective
of the scale of data set. Those algorithms that could be competitive with it in classication accuracy are not so when the
reduction rate is considered.
References
[1] A.N. Papadopoulos, Y. Manolopoulos, Nearest Neighbor Search: A
Database Perspective, Springer-Verlag Telos, 2004.
[2] G. Shakhnarovich, T. Darrel, P. Indyk (Eds.), Nearest-Neighbor Methods
in Learning and Vision: Theory and Practice, MIT Press, Cambridge,
MA, 2006.
[3] E. Pekalska, R.P.W. Duin, P. Paclk, Prototype selection for dissimilaritybased classiers, Pattern Recognition 39 (2) (2006) 189208.
[4] S.-W. Kim, B.J. Oommen, On using prototype reduction schemes to
optimize dissimilarity-based classication, Pattern Recognition 40 (11)
(2007) 29462957.
[5] H. Liu, H. Motoda, On issues of instance selection, Data Mining Knowl.
Discovery 6 (2) (2002) 115130.
[6] D.R. Wilson, T.R. Martinez, Reduction techniques for instance-based
learning algorithms, Mach. Learn. 38 (2000) 257286.
[7] R. Paredes, E. Vidal, Learning prototypes and distances: a prototype
reduction technique based on nearest neighbor error minimization,
Pattern Recognition 39 (2) (2006) 180188.
[8] E. Gmez-Ballester, L. Mic, J. Oncina, Some approaches to improve
tree-based nearest neighbour search algorithms, Pattern Recognition 39
(2) (2006) 171179.
[9] M. Grochowski, N. Jankowski, Comparison of instance selection
algorithms II. Results and comments, in: Proceedings of the International
Conference on Articial Intelligence and Soft Computing (ICAISC04),
Lecture Notes in Computer Science, vol. 3070, Springer, Berlin, 2004,
pp. 580585.
[10] M. Lozano, J.M. Sotoca, J.S. Snchez, F. Pla, E. Pekalska, R.P.W. Duin,
Experimental study on prototype optimisation algorithms for prototypebased classication in vector spaces, Pattern Recognition 39 (10) (2006)
18271838.
[11] A.A. Freitas, Data Mining and Knowledge Discovery with Evolutionary
Algorithms, Springer, New York, 2002.
[12] A. Ghosh, L.C. Jain (Eds.), Evolutionary Computation in Data Mining,
Springer, Berlin, 2005.
[13] N. Garca-Pedrajas, D. Ortiz-Boyer, A cooperative constructive method
for neural networks for pattern recognition, Pattern Recognition 40 (1)
(2007) 8098.
[14] A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing,
Springer, Berlin, 2003.
[15] J.R. Cano, F. Herrera, M. Lozano, Using evolutionary algorithms as
instance selection for data reduction in KDD: an experimental study,
IEEE Trans. Evol. Comput. 7 (6) (2003) 561575.
[16] J.R. Cano, F. Herrera, M. Lozano, Stratication for scaling up
evolutionary prototype selection, Pattern Recognition Lett. 26 (7) (2005)
953963.
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
2709
About the AuthorSALVADOR GARCA received the M.Sc. degree in Computer Science form the University of Granada, Granada, Spain, in 2004. He is
currently a Ph.D. student in the Department of Computer Science and Articial Intelligence, University of Granada, Granada, Spain. His research interests
include data mining, data reduction, evolutionary algorithms, learning from imbalanced data and statistical analysis.
About the AuthorJOS RAMN CANO received the M.Sc. and Ph.D. degrees in Computer Science from the University of Granada, Granada, Spain, in
1999 and 2004, respectively. He is currently an Associate Professor in the Department of Computer Science, University of Jan, Jan, Spain. His research
interests include data mining, data reduction, interpretability-accuracy trade off, and evolutionary algorithms.
About the AuthorFRANCISCO HERRERA received the M.Sc. degree in Mathematics in 1988 and the Ph.D. degree in Mathematics in 1991, both from
the University of Granada, Spain.
He is currently a Professor in the Department of Computer Science and Articial Intelligence at the University of Granada. He has published more than 100
papers in international journals. He is coauthor of the book Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases (World
Scientic, 2001). As edited activities, he has co-edited four international books and co-edited 16 special issues in international journals on different Soft
Computing topics. He currently serves as area editor of the Journal Soft Computing (area of genetic algorithms and genetic fuzzy systems), and he serves
as member of the editorial board of the journals: Fuzzy Sets and Systems, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems,
Memetic Computation, International Journal of Computational Intelligence Research, Mediterranean Journal of Articial Intelligence, International Journal
of Information Technology and Intelligent and Computing. He acts as associated editor of the journals: Mathware and Soft Computing, Advances in Fuzzy
Systems, and Advances in Computational Sciences and Technology. His current research interests include computing with words and decision making, data
mining, data preparation, fuzzy rule based systems, genetic fuzzy systems, knowledge extraction based on evolutionary algorithms, memetic algorithms and
genetic algorithms.
2.2.
1 de 1
08/10/2008 09:45
WSPC/INSTRUCTION FILE
Cano-Garcia-
Salvador Garc
a
Dept. of Computer Science and Articial Intelligence, University of Granada
Granada, 18071, Spain
salvagl@decsai.ugr.es
Jos-Ramn Cano
e
o
Department of Computer Science, University of Jan,
e
Higher Polytechnic Center of Linares, Alfonso X El Sabio street
Linares, 23700, Spain,
jrcano@ujaen.es
Ester Bernad-Mansilla
o
Dept. of Computer Engineering, University of Ramon Llull
Barcelona, 08022, Spain
esterb@salleurl.edu
Francisco Herrera
Dept. of Computer Science and Articial Intelligence, University of Granada
Granada, 18071, Spain
herrera@decsai.ugr.es
Evolutionary prototype selection has shown its eectiveness in the past in the prototype
selection domain. It improves in most of the cases the results oered by classical prototype selection algorithms but its computational cost is expensive. In this paper we
analyse the behaviour of the evolutionary prototype selection strategy, considering a
complexity measure for classication problems based on overlapping. In addition, we
have analysed dierent k values for the nearest neighbour classier in this domain of
study to see its inuence on the results of PS methods. The objective consists of predicting when the evolutionary prototype selection is eective for a particular problem,
based on this overlapping measure.
Keywords: Prototype Selection; Evolutionary Prototype Selection; Complexity Measures; Overlapping Measure; Data Complexity
This
WSPC/INSTRUCTION FILE
Cano-Garcia-
1. Introduction
Prototype Selection (PS) is a classical supervised learning problem where the objective consists in, using an input data set, nding those prototypes which improve
the accuracy of the nearest neighbour classier 26 . More formally, lets assume that
there is a training set T which consists of pairs (xi , yi ), i = 1, ..., n, where xi
denes an input vector of attributes and yi denes the corresponding class label. T
contains n samples, which have m input attributes each one and they should belong to one of the C classes. Let S T be the subset of selected samples resulting
from the execution of a prototype selection algorithm. The small size of the subset
selected decreases the requirements of computational resources of the classication
algorithm while keeping the classication performance 1 .
In the literature, another process used for reducing the number of instances can
be found. This is the prototype generation, which consists of building new examples
18,19
. Many of the examples generated may not coincide with the examples belonging
to the original data set, due to the fact that they are articially generated. In some
applications, this behaviour is not desired, as it could be the case in some data
set from the UCI Repository, such as Adult or KDD Cup99, where information
appears about real people or real connections, respectively and if new instances
were generated, it could be possible that they do not correspond to valid real
values. In this paper, we center our attention to the prototype selection domain,
keeping the initial characteristics of the instances unchanged.
There are many proposals of prototype selection algorithms 14,33 . These methods
follow dierent strategies for the prototype selection problem, and oer dierent
behaviours depending on the input data set. Evolutionary algorithms are one of the
most promising heuristics.
Evolutionary Algorithms (EAs) 9,13 are general-purpose search algorithms that
use principles inspired by natural genetic populations to evolve solutions to problems. The basic idea is to evolve a population of chromosomes, which represents
plausible solutions to the problem, by means of a competition process. EAs have
been used to solve the PS problem in 5,20,31 with promising results.
The EAs oer optimal results but at the expense of high computational cost.
Thus, it would be interesting to characterize their eective use in large-scale classication problems beforehand 34 . We consider their work as eective when they
improve the classication capabilities of the nearest neighbours classier. To reach
this objective, we analyse the data sets characteristics previous to the prototype
selection process.
In the literature, several studies have addressed the characterization of the data
set by means of a set of complexity measures 2,17 . Mollineda et al. in 23 present a
previous work where they analysed complexity measures like overlapping and nonparametric separability considering the Wilsons Edited Nearest Neighbor 32 and
the Harts Condensed Nearest Neighbor 15 as prototype selection algorithms.
In this study we are interested in diagnosing when the evolutionary prototype
WSPC/INSTRUCTION FILE
Cano-Garcia-
selection is eective for a particular problem, using the overlapping measure suggested in 23 . To address this, we have analysed its behaviour by means of statistical comparisons with classical prototype selection algorithms considering data sets
from the UCI Repository24 and dierent values of k neighbours for the prototype
selection problem.
In order to do that, the paper is set out as follows. Section 2 is devoted to
describe the evolutionary prototype selection strategy and the algorithm used in
this study which belongs to this family. In Section 3, we present the complexity
measure considered. Section 4 explains the experimental study and Section 5 deals
with the results and their statistical analysis. Finally, in Section 6, we point out
our conclusions.
2. Evolutionary Prototype Selection
EAs have been extensively used in the past both in learning and preprocessing
5,6,11,25
. EAs may be applied to the PS problem 5 because it can be considered as
a search problem.
The application of EAs to PS is accomplished by tackling two important issues:
the specication of the representation of the solutions and the denition of the
tness function.
Representation: Lets assume a training data set denoted T with n instances. The search space associated with the instance selection is constituted
by all the subsets of T . A chromosome consists of the sequence of n genes
(one for each instance in T ) with two possible states: 1 and 0, meaning
that the instance is or not included in the subset selected respectively (see
Figure 1).
T 1 0 1 0 0 1 0 0
1
S 1 3 6
Fig. 1. Chromosome binary representation of a solution.
(1)
WSPC/INSTRUCTION FILE
Cano-Garcia-
|T ||S|
.
|T |
(2)
Evolutionary Prototype
Selection Algorithm
kNearest Neighbour
Classifier
During each generation, Evolutionary Instance Selection CHC (EIS-CHC) method develops the following steps:
(1) It uses a parent population to generate an intermediate population of individuals, which are randomly paired and used to generate potential ospring.
(2) Then, a survival competition is held where the best chromosomes from the
parent and ospring populations are selected to form the next generation.
EIS-CHC also implements a form of heterogeneous recombination using HUX, a
special recombination operator. HUX exchanges half of the bits that dier between
WSPC/INSTRUCTION FILE
Cano-Garcia-
t 0;
Initialize(Pa ,ConvergenceCount);
while not EndingCondition(t,Pa ) do
Parents SelectionParents(Pa );
Ospring HUX(Parents);
Evaluate(Ospring);
Pn ElitistSelection(Ospring,Pa );
if not modified(Pa ,Pn ) then
ConvergenceCount ConvergenceCount 1;
if ConvergenceCount = 0 then
Pn Restart(Pa );
Initialize(ConvergenceCount);
end
end
t t +1;
Pa Pn ;
end
Algorithm 1: Pseudocode of CHC algorithm
WSPC/INSTRUCTION FILE
Cano-Garcia-
17
Certain problems are known to have nonzero Bayes error 16 . This is because
some classes can be intrinsically ambiguous or due to inadequate feature measurements.
Some problems may present complex decision boundaries so it is not possible
to oer a compact description of them 28 .
Sparsity induced by small sample size and high dimensionality aect the generalization of the rules 27,22 .
Real life problems are often aected by a mixture of the three previously mentioned situations.
The prediction capabilities of classiers are strongly dependent on data complexity. This is the reason why various recent papers have introduced the use of
measures to characterize the data and to relate these characteristics to the classier performance 28 .
In 17 , Ho and Basu dene some complexity measures for classication problems
of two classes. Singh in 30 oers a review of data complexity measures and proposes
two new ones. Dong and Kothari in 8 propose a feature selection algorithm based on
a complexity measure dened by Ho and Basu. Bernad and Ho in 4 investigate the
o
domain of competence of XCS by means of a methodology that characterizes the
complexity of a classication problem by a set of geometrical descriptors. In 21 , Li et
al. analyse some omnivariate decision trees using the measure of complexity based
in data density proposed by Ho and Basu. Baumgartner and Somorjai in 3 dene
specic measures for regularized linear classiers, using Ho and Basus measures
as reference. Mollineda et al. in 23 extend some Ho and Basus measure denitions
for problems with two or more classes. They analyse these generalized measures in
two classic PS algorithms and remark that Fishers discriminant ratio is the most
eective for PS. Snchez et al. in 28 analyse the eect of the data complexity in the
a
nearest neighbours classier.
In this case, according to the conclusions of Mollineda et al. 23 , we have considered Fishers discriminant ratio, which is a geometrical measure of overlapping, for
studying the behaviour of evolutionary prototype selection. Fishers discriminant
ratio is presented in this section.
The plain version of Fishers discriminant ratio oered by Ho and Basu 17 computes the degree of separability of two classes according to a specic feature. It
compares the dierence between the class means with respect to the sum of class
variances. Fishers discriminant ratio is dened as follows:
f=
(1 2 )2
2
2
1 + 2
(3)
2
2
where 1 , 2 , 1 , 2 are the means and the variances of the two classes respectively.
WSPC/INSTRUCTION FILE
Cano-Garcia-
C
i=1 ni (m, mi )
C ni (xi , m )
i=1 j=1 j i
23
, and
(4)
where ni denotes the number of samples in class i, is the metric, m is the overall
mean, mi is the mean of class i, and xj represents the sample j belonging to class
i
i. Small values of this measure indicate that classes present strong overlapping.
4. Experimental Framework
To analyse the behaviour of EIS-CHC we include in the study two classical prototype selection algorithms and three advanced methods, which will be described
in Section 4.1. In Subsection 4.2 we present the algorithms parameters and data
sets considered.
4.1. Prototype Selection Algorithms
The classical PS algorithms used in this study are: an edition algorithm (Edited
Nearest Neighbor 32 ) and a boundary conservative or condensation algorithm (Condensed Nearest Neighbor 15 ). The advanced methods used in the comparison are: an
edition method (Edition by Normalized Radial Basis Function 14 ), a condensation
method (Modied Selective Subset 1 ) and a hybrid method, which combines edition and condensation (Decremental Reduction Optimization Procedure 33 ). The
use of edition schemes is motivated by the relevance of the analysis of data sets
with low overlapping, where there are noisy instances inside the classes, not just in
the boundaries. This is a situation where the lter PS algorithms could present an
interesting behaviour. The use of condensation methods has as objective the study
of the eect introduced by PS algorithms which keeps the instances situated in the
boundaries, where the overlapping appears.
Their description is the following:
Edited Nearest Neighbor (ENN) 32 . The algorithm starts with S = T and then
each instance in S is removed if it does not agree with the majority of its k
nearest neighbours. ENN lter is considered the standard noise lter and it
is usually employed at the beginning of many algorithms. The pseudocode of
ENN appears in Algorithm 2.
Condensed Nearest Neighbor (CNN) 15 . It begins by randomly selecting one
instance belonging to each output class from T and putting them in S. Then,
each instance in T is classied using only the instances in S. If an instance is
misclassied, it is added to S, thus ensuring that it will be classied correctly.
This process is repeated until there are no misclassied instances in T . The
pseudocode of CNN appears in Algorithm 3.
WSPC/INSTRUCTION FILE
Cano-Garcia-
S T;
foreach example xi in S do
if xi is misclassied by its k nearest neighbours in S then
S S {xi };
end
end
return S;
Algorithm 2: Pseudocode of ENN algorithm
S;
fail true;
S S {xc1 , xc2 , ..., xcC }, where xci is any example that belongs to class i;
while fail = true do
fail false;
foreach example xi in T do
if xi is misclassied by using S then
S S {xi };
fail true;
end
end
end
return S;
Algorithm 3: Pseudocode of CNN algorithm
WSPC/INSTRUCTION FILE
Cano-Garcia-
3
4
5
6
7
8
9
10
11
12
13
Q T;
Sort the examples {xj }n according to increasing values of enemy distance
j=1
Dj ;
foreach example xi in T do
add false;
foreach example xj in T do
if xj Q and d(xi , xj ) < Dj then
Q Q {xj };
add true;
end
end
if add then S S {xi };
if Q = then return S;
end
Algorithm 4: Pseudocode of MSS algorithm
T:
Gi (x; xi ),
P (c|x, T ) =
(5)
iI c
Gi (x; xi ) =
G(x; xi , )
,
G(x; xj , )
(6)
n
j=1
||xxi ||2
.
(7)
WSPC/INSTRUCTION FILE
Cano-Garcia-
10
Features
Classes
689
625
345
1727
297
1472
689
366
336
214
305
150
500
148
432
215
10992
768
846
178
683
6435
7200
100
14
4
7
6
13
9
15
34
7
9
3
4
7
18
6
5
16
8
18
13
9
36
21
16
2
3
2
4
5
3
2
6
2
7
2
3
10
4
2
3
10
2
4
3
2
7
3
7
The algorithms were run three times for each partition in the ten fold cross
validation. The measure F1 is calculated by averaging the F1 obtained in each
training set of the ten fold cross validation.
WSPC/INSTRUCTION FILE
Cano-Garcia-
11
WSPC/INSTRUCTION FILE
Cano-Garcia-
12
0.410.
R+
98
45.5
89.5
82.5
97
57
75
48
61
99
95
R7
59.5
15.5
22.5
8
48
30
57
44
6
10
p-value
0.004
0.6
0.023
0.046
0.005
0.778
0.158
0.778
0.594
0.004
0.008
Data Set
thyroid
lymphography
bupa
haberman
pima
contraceptive
cleveland
crx
australian
monks
balanced
dermatology
vehicle
new-thyroid
glass
ecoli
zoo
car
penbased
led7digit
wisconsin
satimage
wine
iris
0.035
0.051
0.166
0.169
0.217
0.224
0.235
0.285
0.287
0.365
0.455
0.473
0.506
0.731
0.745
0.888
0.949
1.022
1.161
1.344
1.354
1.476
1.820
2.670
F1
Accur.
1-NN
0.9258
0.7387
0.6108
0.6697
0.7033
0.4277
0.5314
0.7957
0.8145
0.7791
0.7904
0.9535
0.7010
0.9723
0.7361
0.8070
0.9281
0.8565
0.9935
0.4020
0.9557
0.9058
0.9552
0.9333
Accur.
CNN
0.9007
0.7052
0.5907
0.6465
0.6525
0.4128
0.5052
0.7913
0.8043
0.8252
0.7168
0.9454
0.6655
0.9487
0.6973
0.7532
0.8289
0.8825
0.9853
0.3740
0.9342
0.8835
0.9157
0.9200
0.8060
0.5376
0.4180
0.4622
0.5080
0.2679
0.3905
0.6692
0.6441
0.6970
0.6345
0.8689
0.5035
0.8646
0.5146
0.5989
0.7579
0.7658
0.9555
0.2727
0.8965
0.8041
0.8308
0.8519
Red.
Accur.
ENN
0.9379
0.7547
0.5859
0.6990
0.7449
0.4528
0.5576
0.8449
0.8377
0.7817
0.8560
0.9591
0.6963
0.9494
0.6919
0.8248
0.9114
0.8513
0.9927
0.4920
0.9657
0.9013
0.9552
0.9533
0.0608
0.2146
0.3775
0.3032
0.2617
0.5498
0.4554
0.2013
0.1514
0.0411
0.1559
0.0528
0.2931
0.0579
0.3214
0.1958
0.0815
0.0775
0.0065
0.5660
0.0529
0.0890
0.0337
0.0474
Red.
Accur.
MSS
0.9161
0.7436
0.5965
0.6405
0.6797
0.4196
0.5282
0.7841
0.8043
0.7445
0.7825
0.9372
0.6868
0.9723
0.7206
0.7800
0.8467
0.8455
0.9913
0.4020
0.9485
0.9009
0.9605
0.9467
0.6998
0.4084
0.2055
0.3678
0.3385
0.1621
0.3099
0.4976
0.5074
0.3274
0.3755
0.7450
0.3740
0.7829
0.3905
0.4726
0.7834
0.5489
0.9051
0.1629
0.8210
0.6841
0.7247
0.7844
Red.
Accur.
ENRBF
0.9258
0.7609
0.5789
0.7353
0.6511
0.4481
0.5644
0.8551
0.8609
0.7619
0.8559
0.9618
0.5746
0.6981
0.3956
0.4256
0.9281
0.7002
0.8834
0.4580
0.9414
0.7302
0.9663
0.9400
0.0742
0.1546
0.4203
0.2647
0.3490
0.5500
0.4404
0.1417
0.1398
0.2059
0.1150
0.0404
0.5074
0.3023
0.6774
0.5744
0.0571
0.2998
0.1902
0.2609
0.0769
0.4395
0.0543
0.1267
Red.
Accur.
DROP3
0.8581
0.7674
0.5986
0.6697
0.6901
0.4406
0.4947
0.7841
0.8014
0.6571
0.8177
0.9293
0.6099
0.9491
0.6571
0.7263
0.9264
0.6221
0.9431
0.4060
0.8913
0.8308
0.9105
0.9267
0.9566
0.8356
0.7005
0.8003
0.8213
0.7307
0.8317
0.8779
0.8847
0.8670
0.8676
0.9232
0.7727
0.8853
0.7415
0.8423
0.8351
0.8834
0.9719
0.8278
0.9747
0.9101
0.9101
0.9230
Red.
Accur.
EIS-CHC
0.9406
0.7938
0.6058
0.7154
0.7501
0.5180
0.6173
0.8522
0.8420
0.9727
0.8929
0.9617
0.6147
0.9541
0.6227
0.8039
0.9142
0.8802
0.9560
0.6580
0.9642
0.8710
0.9438
0.9733
0.9991
0.9572
0.9752
0.9855
0.9871
0.9918
0.9864
0.9915
0.9918
0.9915
0.9883
0.9736
0.9757
0.9767
0.9507
0.9653
0.9111
0.9897
0.9871
0.9649
0.9941
0.9950
0.9732
0.9674
Red.
October 1, 2008
18:43
WSPC/INSTRUCTION FILE
Cano-Garcia-Herrera-Bernado-IJPRAI
13
WSPC/INSTRUCTION FILE
Cano-Garcia-
14
R0.5
41.5
22.5
14
6
30
51
75
59
15
1
p-value
0.001
0.507
0.064
0.016
0.004
0.158
0.925
0.158
0.683
0.019
0.001
Data Set
thyroid
lymphography
bupa
haberman
pima
contraceptive
cleveland
crx
australian
monks
balanced
dermatology
vehicle
new-thyroid
glass
ecoli
zoo
car
penbased
led7digit
wisconsin
satimage
wine
iris
0.035
0.051
0.166
0.169
0.217
0.224
0.235
0.285
0.287
0.365
0.455
0.473
0.506
0.731
0.745
0.888
0.949
1.022
1.161
1.344
1.354
1.476
1.820
2.670
F1
Accur.
3-NN
0.9236
0.7739
0.6066
0.7058
0.7306
0.4495
0.5444
0.8420
0.8478
0.9629
0.8337
0.9700
0.7175
0.9537
0.7011
0.8067
0.9281
0.9231
0.9718
0.4520
0.9600
0.8662
0.9549
0.9400
Accur.
CNN
0.9083
0.7826
0.5845
0.6537
0.6654
0.4447
0.5247
0.8203
0.8203
0.8658
0.7472
0.9511
0.6619
0.9485
0.6493
0.7650
0.8328
0.9010
0.9536
0.4040
0.9542
0.8444
0.9327
0.9400
0.7718
0.5580
0.3649
0.4281
0.5049
0.2475
0.3935
0.6531
0.6581
0.6556
0.6240
0.8607
0.4768
0.8476
0.4762
0.5913
0.6954
0.7662
0.8464
0.2829
0.8894
0.7171
0.8514
0.8415
Red.
Accur.
ENN
0.9250
0.7530
0.6174
0.7125
0.7384
0.4583
0.5447
0.8449
0.8478
0.9567
0.8768
0.9619
0.6927
0.9307
0.6616
0.8126
0.9114
0.8930
0.9618
0.5460
0.9657
0.8646
0.9549
0.9533
0.0759
0.2146
0.3775
0.3032
0.2617
0.5498
0.4448
0.1560
0.1514
0.0411
0.1559
0.0310
0.2931
0.0579
0.3214
0.1958
0.0815
0.0775
0.0287
0.5660
0.0337
0.1322
0.0337
0.0474
Red.
Accur.
MSS
0.9360
0.7589
0.6112
0.6959
0.7176
0.4528
0.5346
0.8304
0.8348
0.9613
0.8304
0.9457
0.6809
0.9394
0.6650
0.7767
0.7811
0.9173
0.9914
0.4320
0.9600
0.9061
0.9552
0.9333
0.6998
0.4084
0.2055
0.3678
0.3385
0.1621
0.3099
0.4976
0.5074
0.3274
0.3755
0.7450
0.3740
0.7829
0.3905
0.4726
0.7834
0.5489
0.9051
0.1629
0.8210
0.6841
0.7247
0.7844
Red.
Accur.
ENRBF
0.9258
0.7530
0.5789
0.7353
0.6511
0.4522
0.5677
0.8522
0.8478
0.8143
0.8768
0.9646
0.5569
0.6981
0.3842
0.4256
0.9114
0.7002
0.8799
0.4300
0.9471
0.7316
0.9719
0.9467
0.0742
0.1546
0.4203
0.2647
0.3490
0.5500
0.4404
0.1417
0.1398
0.2059
0.1150
0.0404
0.5074
0.3023
0.6774
0.5744
0.0571
0.2998
0.1902
0.2609
0.0769
0.4395
0.0543
0.1267
Red.
Accur.
DROP3
0.8056
0.7053
0.6315
0.6500
0.7176
0.4542
0.4688
0.7377
0.7783
0.6957
0.8193
0.8987
0.5545
0.8799
0.5764
0.6636
0.8436
0.6887
0.8765
0.5320
0.9099
0.7706
0.9154
0.8467
0.9812
0.8679
0.7649
0.8885
0.8562
0.7442
0.8775
0.9274
0.9293
0.8678
0.9090
0.9262
0.8164
0.9189
0.8105
0.8714
0.7859
0.8865
0.9783
0.8331
0.9777
0.9367
0.9207
0.9326
Red.
Accur.
EIS-CHC
0.9250
0.8034
0.6524
0.7219
0.7684
0.4875
0.5643
0.8536
0.8681
0.9376
0.9009
0.9429
0.6087
0.9584
0.6267
0.7534
0.9006
0.8409
0.8700
0.6900
0.9685
0.8164
0.9438
0.9600
0.9983
0.9535
0.9755
0.9862
0.9860
0.9911
0.9802
0.9905
0.9897
0.9830
0.9860
0.9505
0.9634
0.9592
0.9465
0.9583
0.8813
0.9853
0.9568
0.9509
0.9908
0.9666
0.9457
0.9333
Red.
October 1, 2008
18:43
WSPC/INSTRUCTION FILE
Cano-Garcia-Herrera-Bernado-IJPRAI
15
WSPC/INSTRUCTION FILE
Cano-Garcia-
16
R2.5
28.5
18
12
5
29.5
54.5
78.5
49
19
1
p-value
0.002
0.158
0.03
0.011
0.003
0.136
0.9
0.116
0.826
0.035
0.001
Note that when k = 3, the nearest neighbour classier is more robust in presence
of noise than the 1-NN classier. Due to this fact, the ENN and ENRBF lters
behave similarly to the 3-NN when F1 is lower than 0.410, according to Wilcoxons
test. The same eect occurs in DROP3. However, a PS process by EIS-CHC prior
to the 3-NN classier improves the accuracy of the classier without using PS and
also achieves a high reduction of the subset selected.
5.3. Results and Analysis for the 5-Nearest Neighbour classier
Tables 6 and 7 present the results of the PS methods with the 5-NN classier.
The analysis of Tables 6 and 7 is the following:
F1 low [0,0.410] (strong overlapping): EIS-CHC outperforms the Without PS
when F1 is low. EIS-CHC presents the best accuracy rates among all the PS
algorithms in most of the data sets with the strongest overlapping (Table 7).
Considering Wilcoxons test in Table 6, only EIS-CHC improves the classication capabilities of 5-NN which reects the proper election of the most
representative instances in presence of overlapping.
F1 high [0.410,...] (small overlapping): The situation is similar to the previous
case. There is not any improvement of PS algorithms with respect to 5-NN, as
the statistical results indicate (see Table 6). Only ENN and EIS-CHC obtain
the same performance as not using PS. The comparison between EIS-CHC and
the rest of models indicates that indicates that the accuracy of EIS-CHC is
always better than or equal to that of the method compared.
In this case, ENN and ENRBF obtain a result similar to the previous subsection
(3-NN case), where F1 is low, but again EIS-CHC oers a signicant improvement
in accuracy with respect to the use of the nearest neighbours classier without using
PS.
Data Set
thyroid
lymphography
bupa
haberman
pima
contraceptive
cleveland
crx
australian
monks
balanced
dermatology
vehicle
new-thyroid
glass
ecoli
zoo
car
penbased
led7digit
wisconsin
satimage
wine
iris
0.035
0.051
0.166
0.169
0.217
0.224
0.235
0.285
0.287
0.365
0.455
0.473
0.506
0.731
0.745
0.888
0.949
1.022
1.161
1.344
1.354
1.476
1.820
2.670
F1
Accur.
5-NN
0.9292
0.7944
0.6131
0.6695
0.7306
0.4685
0.5545
0.8551
0.8478
0.9475
0.8624
0.9646
0.7175
0.9398
0.6685
0.8127
0.9364
0.9520
0.9618
0.4140
0.9657
0.8740
0.9605
0.9600
Accur.
CNN
0.9056
0.7931
0.6101
0.6402
0.7044
0.4542
0.5382
0.8232
0.8159
0.8523
0.8080
0.9481
0.6903
0.9400
0.6531
0.7799
0.8097
0.9323
0.9455
0.3860
0.9500
0.8412
0.9605
0.9533
0.7810
0.5736
0.3304
0.4230
0.4899
0.2383
0.3748
0.6659
0.6591
0.6168
0.6574
0.8443
0.4641
0.8439
0.4429
0.6035
0.6044
0.7748
0.8195
0.2878
0.8886
0.7099
0.8421
0.8356
Red.
Accur.
ENN
0.9250
0.7796
0.6137
0.7288
0.7396
0.4725
0.5676
0.8507
0.8609
0.8855
0.8928
0.9592
0.6822
0.9119
0.6652
0.8065
0.8964
0.9016
0.9482
0.5520
0.9671
0.8708
0.9605
0.9600
0.0716
0.2079
0.3910
0.3061
0.2626
0.5374
0.4488
0.1433
0.1588
0.0432
0.1351
0.0316
0.2871
0.0646
0.3453
0.1812
0.0708
0.0475
0.0376
0.5860
0.0296
0.1317
0.0418
0.0430
Red.
Accur.
MSS
0.9396
0.7922
0.6045
0.6594
0.7306
0.4562
0.5446
0.8435
0.8304
0.9263
0.8496
0.9318
0.6844
0.9165
0.6067
0.7889
0.7536
0.9196
0.9864
0.3940
0.9686
0.9050
0.9663
0.6267
0.6998
0.4084
0.2055
0.3678
0.3385
0.1621
0.3099
0.4976
0.5074
0.3274
0.3755
0.7450
0.3740
0.7829
0.3905
0.4726
0.7834
0.5489
0.9051
0.1629
0.8210
0.6841
0.7247
0.7844
Red.
Accur.
ENRBF
0.9258
0.7801
0.5789
0.7353
0.6511
0.4535
0.5711
0.8536
0.8435
0.8013
0.8831
0.9619
0.5592
0.6981
0.3609
0.4256
0.9197
0.7002
0.8739
0.4180
0.9485
0.7329
0.9719
0.9400
0.0742
0.1546
0.4203
0.2647
0.3490
0.5500
0.4404
0.1417
0.1398
0.2059
0.1150
0.0404
0.5074
0.3023
0.6774
0.5744
0.0571
0.2998
0.1902
0.2609
0.0769
0.4395
0.0543
0.1267
Red.
Accur.
DROP3
0.7901
0.6983
0.6137
0.7019
0.7187
0.4685
0.5222
0.7014
0.7188
0.7490
0.8013
0.8772
0.5497
0.9119
0.5813
0.6620
0.7125
0.7066
0.8551
0.4820
0.8624
0.7416
0.9157
0.9133
0.9842
0.8904
0.8039
0.9346
0.8866
0.7690
0.8969
0.9406
0.9499
0.8382
0.9234
0.9250
0.8240
0.9173
0.8172
0.8810
0.7506
0.8749
0.9774
0.8347
0.9777
0.9473
0.8976
0.9119
Red.
Accur.
EIS-CHC
0.9250
0.8423
0.6464
0.7353
0.7671
0.4820
0.6075
0.8594
0.8580
0.8959
0.8879
0.9426
0.6123
0.9680
0.6331
0.7351
0.8717
0.8368
0.8500
0.6500
0.9657
0.8289
0.9660
0.9600
0.9978
0.9369
0.9710
0.9960
0.9854
0.9909
0.9725
0.9874
0.9865
0.9784
0.9838
0.9375
0.9567
0.9385
0.9216
0.9527
0.8470
0.9819
0.9432
0.9416
0.9876
0.9603
0.9295
0.9148
Red.
October 1, 2008
18:43
WSPC/INSTRUCTION FILE
Cano-Garcia-Herrera-Bernado-IJPRAI
17
WSPC/INSTRUCTION FILE
Cano-Garcia-
18
Independently of the k value selected for the nearest neighbours classier, when
the overlapping of the initial data set is strong (it presents low values of F1)
EIS-CHC is a very eective PS algorithm to improve the accuracy rates of the
nearest neighbours classier.
When the overlapping of the data set is low, the statistical test has shown
that the PS algorithms are not capable of improving the accuracy of the kNN without using PS. The benets of their use is that they keep the accuracy
capabilities of the nearest neighbours classier, reducing the initial data set
size.
Considering the results that CNN and MSS present, we must point out that the
PS algorithms which keep boundary instances (condensation methods) notably
aect the classication capabilities of the k -NN classier, independently of the
overlapping of the data set and the value of k.
In the classical algorithms, the best behaviour corresponds to ENN. The lter
process that ENN introduces outperforms in some cases the classication capabilities of the k -NN, but the election of the most representative prototypes
that EIS-CHC develops seems to be the most eective strategy. Nevertheless,
ENN in combination with k-NN obtains similar results to k-NN when k 3,
given that the nearest neighbours classier is more robust in presence of noise.
In the most advanced algorithms, the behaviour coincides in most of the cases
with the equivalent in the classic algorithms; MSS behaves very similarly to
CNN and ENRBF to ENN. DROP3, as a hybrid model alike EIS-CHC, obtains
an intermediate behavior between condensation and edition methods, because
it performs adequately when strong overlapping is presented when considering
1-NN. Nevertheless, EIS-CHC always outperforms DROP3 in any case.
WSPC/INSTRUCTION FILE
Cano-Garcia-
19
6. Concluding Remarks
This paper addresses the analysis of the evolutionary prototype selection considering a complexity data set measure based on overlapping, with the objective of
predicting when the evolutionary prototype selection is eective for a particular
problem.
An experimental study has been carried out using data sets from dierent domains and comparing the results with classical PS algorithms, having the F1 measure as reference. To extend the analysis of the k -NN classier we have considered
dierent values of k. The main conclusions reached are the following:
EIS-CHC presents the best accuracy rate when the input data set has strong
overlapping, even improving condensation algorithms (CNN and MSS), edition
schemes (ENN and ENRBF) and hybrid methods, such as DROP3.
EIS-CHC improves the classication accuracy of k-NN when the data sets have
strong overlapping, independently of the k value, and obtains a high reduction
rate of the data. However, ENN, ENRBF and DROP3 algorithms are not able
to improve the accuracy rate of k-NN when k 3.
In the case of data sets with low overlapping, the results of the PS algorithms
are not conclusive so none of them can be suggested considering accuracy rates. Therefore, their use is recommended to keep the accuracy capabilities by
reducing the initial data set size.
Condensation algorithms, which keep the boundaries (CNN and MSS), have
normally shown negative eects on the accuracy of the k -NN classier.
As we have indicated in the analysis section, the use of this measure can help
us to evaluate a data set previously to the evolutionary PS process and decide if it
is adequate or not to improve the classication capabilities of the k -NN classier.
The results show that when F1 is low (strong overlapping), the best accuracy
rates appear using EIS-CHC, while when F1 is high (low overlapping), the PS
algorithms do not guarantee an accuracy improvement.
As future works, the analysis of the eect of data complexity on evolutionary instance selection for training set selection considering other well-known classication
algorithms will be studied. Another interesting research line is the measurement
of data complexity on imbalanced data sets when we can perform evolutionary
under-sampling 12 .
Appendix A. Wilcoxon Signed Rank Test
Wilcoxons test is used for answering this question: do two samples represent two
dierent populations? It is a non-parametric procedure employed in a hypothesis
testing situation involving a design with two samples. It is the analogous of the paired t-test in non-parametrical statistical procedures; therefore, it is a pairwise test
that aims to detect signicant dierences between the behaviour of two algorithms.
The null hypothesis for Wilcoxons test is H0 : D = 0; in the underlying
WSPC/INSTRUCTION FILE
Cano-Garcia-
20
populations represented by the two samples of results, the median of the dierence
scores equals zero. The alternative hypothesis is H1 : D = 0, but H1 : D > 0 or
H1 : D < 0 can also be used as directional hypothesis.
In the following, we describe the tests computations. Let di be the dierence
between the performance scores of the two algorithms on i -th out of N data sets.
The dierences are ranked according to their absolute values; average ranks are
assigned in case of ties. Let R+ be the sum of ranks for the data sets on which the
second algorithm outperformed the rst, and R the sum of ranks for the opposite.
Ranks of di = 0 are split evenly among the sums; if there is an odd number of them,
one is ignored:
R+ =
rank(di ) +
di >0
R =
rank(di ) +
di <0
1
2
1
2
rank(di )
di =0
rank(di )
di =0
WSPC/INSTRUCTION FILE
Cano-Garcia-
21
9. A.E. Eiben and J.E. Smith, Introduction to Evolutionary Computation, Springer Verlag,
2003.
10. L.J. Eshelman, The CHC adaptive search algorithm: How to have safe search when
engaging in nontraditional genetic recombination, in Foundations of Genetic Algorithms 1, Rawlins, G.J.E. (Eds.), Morgan Kauman (1991) 265283.
11. C. Gagn and M. Parizeau, Co-evolution of nearest neighbor classiers, Internatioe
nal Journal of Pattern Recognition and Articial Intelligence 21(5) (2007) 912946.
a
12. S. Garc and F. Herrera, Evolutionary Under-Sampling for Classication with Imbalanced Data Sets: Proposals and Taxonomy, Evolutionary Computation, in press
(2008).
13. D.E. Goldberg, Genetic algorithms in search, optimization, and machine learning,
Addison-Wesley, 1989.
14. M. Grochowski and N. Jankowski, Comparison of instance selection algorithms II.
Results and Comments, Proceedings of the 7th International Conference on Articial
Intelligence and Soft Computing, Lecture Notes in Computer Science 3070 (2004) pp.
580585.
15. P.E. Hart, The Condensed Nearest Neighbour Rule, IEEE Trans. on Information
Theory 14 (1968) 515516.
16. T.K. Ho and H.S. Baird, Large-Scale Simulation Studies in Image Pattern Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 19(10) (1997)
10671079.
17. T.K. Ho and M. Basu, Complexity Measures of Supervised Classication Problems,
IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3) (2002) 289
300.
18. S.W. Kim and B.J. Oommen, Enhancing Prototype Reduction Schemes with LVQ3Type Algorithms, Pattern Recognition 36 (2004) 10831093.
19. S.W. Kim and B.J. Oommen, On using prototype reduction schemes to optimize
kernel-based nonlinear subspace methods, Pattern Recognition 37 (2004) 227239.
20. L. Kuncheva, Editing for the k-nearest neighbors rule by a genetic algorithm, Pattern Recognition Letters 16 (1995) 809-814.
21. Y.-H. Li, M. Dong and R. Kothari, Classiability-based omnivariate decision trees,
IEEE Transactions On Neural Networks 16(6) (2005) 1547-1560.
22. M. Liwicki and H. Bunke, Handwriting recognition of whiteboard notes - studying the
inuence of training set size and type, International Journal of Pattern Recognition
and Articial Intelligence 21(1) (2007) 8398.
23. R.A. Mollineda, J.S. Snchez and J.M. Sotoca, Data Characterization for Eective
a
Prototype Selection, Proceedings of the IbPRIA 2005, Lecture Notes in Computer
Science 3523 (2005) pp. 2734.
24. A.
Asuncion
and
D.J.
Newman,
UCI
Machine
Learning Repository, [http://www.ics.uci.edu/mlearn/MLRepository.html]. Irvine,
CA: University of California, Schools of Information and Computer Science, 2007.
25. I.S. Oh, J.S. Lee and B.R. Moon, Hybrid genetic algorithms for feature selection,
IEEE Transactions on Pattern Analysis and Machine Intelligence 26(11) (2004) 1424
1437.
26. A.N. Papadopoulos and Y. Manolopoulos, Nearest Neighbor Search: A database perspective, Springer Verlag, 2004.
27. X. Qiu and L. Wu, Nearest neighbor discriminant analysis, International Journal
of Pattern Recognition and Articial Intelligence 20(8) (2006) 12451259.
a
28. J.S. Snchez, R.A. Mollineda and J.M. Sotoca, An analysis of how training data
complexity aects the nearest neighbours classiers, Pattern Analysis & Applications
WSPC/INSTRUCTION FILE
Cano-Garcia-
22
10 (2007) 189201.
29. D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, Second Edition, Chapman and Hall, 2003.
30. S. Singh, Multiresolution estimates of classication complexity, IEEE Transactions
On Pattern Analysis And Machine Intelligence 25(12) (2003) 1534-1539.
31. H. Shinn-Ying, L. Chia-Cheng and L. Soundy, Design of an optimal nearest neighbour classier using an intelligent genetic algorithm, Pattern Recognition Letter
23(13) (2002) 14951503.
32. D.L. Wilson, Asymptotic Properties of Nearest Neighbor Rules Using Edited Data,
IEEE Transactions on Systems, Man and Cybernetics 2(3) (1972) 408421.
33. D.R. Wilson and T.R. Martinez, Reduction techniques for instance-based learning
algorithms, Machine Learning 38 (2000) 257268.
34. B. Yang, X. Su and Y. Wang, Distributed learning based on chips for classication
with large-scale dataset, International Journal of Pattern Recognition and Articial
Intelligence 21(5) (2007) 899-920.
35. J.H. Zar, Biostatistical Analysis (4th Edition), Pearson, 1999.
2.3.
2.3.1.
1 de 1
https://decsai.ugr.es/cgi-bin/openwebmail/openwebmail-read.pl?session...
10/07/2008 01:41
salvagl@decsai.ugr.es
Department of Computer Science and Articial Intelligence, University of Granada,
Granada, 18071, Spain
Francisco Herrera
herrera@decsai.ugr.es
Department of Computer Science and Articial Intelligence, University of Granada,
Granada, 18071, Spain
Abstract
Learning with imbalanced data is one of the recent challenges in machine learning.
Various solutions have been proposed in order to nd a treatment for this problem,
such as modifying methods or application of a preprocessing stage. Within the preprocessing focused on balancing data, two tendencies exist: reduce the set of examples
(under-sampling) or replicate minority class examples (over-sampling).
Under-Sampling with imbalanced data sets could be considered as a prototype selection procedure with the purpose of balancing data sets to achieve a high classication
rate, avoiding the bias towards majority class examples.
Evolutionary algorithms have been used for classical prototype selection showing
good results, where the tness function is associated to the classication and reduction
rates. In this paper, we propose a set of methods called Evolutionary Under-Sampling
which take into consideration the nature of the problem and use different tness functions for getting a good trade-off between balance of distribution of classes and performance. The study includes a taxonomy of the approaches and an overall comparison
among our models and state-of-the-art under-sampling methods. The results have
been contrasted by using non-parametric statistical procedures and show that evolutionary under-sampling outperforms the non-evolutionary models when the degree of
imbalance is increased.
Keywords
Classication, class imbalance problem, under-sampling, prototype selection, evolutionary algorithms
1 Introduction
In the last years, the class imbalance problem is one of the emergent challenges in machine learning. The problem appears when the data presents a class imbalance, which
consists of containing many more examples of one class than the other one (Chawla
et al., 2004; Xie and Qiu, 2007). Many applications have appeared in learning with imbalanced domains, such as fraud detection, intrusion detection (Cieslak et al., 2006),
biological and medical identication (Cohen et al., 2006), etc.
Usually, the instances are grouped into two classes: the majority or negative class,
and the minority or positive class. The latter class, in imbalanced domains, usually represents a concept with the same or greater interest than the negative class. A standard
c 200X by the Massachusetts Institute of Technology
classier might ignore the importance of the minority class because its representation
inside the data set is not strong enough. As a classical example, if the ratio of imbalance
presented in the data is 1:100 (that is, there is one positive instance versus one hundred
negatives), the error of ignoring this class is only 1%.
A large number of approaches have been proposed to deal with the class imbalance
problem. These approaches can be divided into three groups, depending on the way
they work:
At the algorithmic level, they are the internal approaches. This group of methods
tries to adapt the decision threshold to impose a bias on the minority class or to
improve the prediction performance by adjusting weights for each class. In any
case, they are based on modifying previous algorithms or making specic proposals for dealing with imbalanced data (Grzymala-Busse et al., 2005; Huang et al.,
2006). Recently, in the eld of evolutionary learning, some studies have been pre
sented analyzing the behavior of XCS (Bernado-Mansilla and Garrell-Guiu, 2003;
Kovacs and Kerber, 2006; Butz et al., 2006) for imbalanced data sets (Orriols-Puig
a new evolutionary method. This study provides a method that uses a tness
function designed for performing a prototype selection process with the aim of
balancing data, improving the generalization capability and reducing the training
data.
Combining the data and the algorithmic level, they are the boosting approaches.
This set is composed by methods which consist of ensembles with the objective of
improving the performance of weak classication algorithms. In boosting, the performance of weak classiers is improved by means of focusing on hard examples
which are difcult to classify. These approaches learn a way of combining several
classiers by using weighted examples, in order to increase the attention to hard
examples. Thus, they preprocess the data through the incorporation of weights.
In imbalanced data, as well as handling weights associated to each hard example,
is also used the replication of minority class instances. Moreover, they constitute
an adapted ensemble of classiers developed depending on the data, so the algorithms are also modied for obtaining appropriate models to tackle imbalanced
domains. Two main approaches have been developed with promising results in
this group: the SMOTEBoost approach (Chawla et al., 2003) and DataBoost-IM
approach (Guo and Viktor, 2004).
Re-sampling approaches can be categorized into two tendencies: under-sampling,
that consists of reducing the data by eliminating examples belonging to the majority
class with the objective of equalizing the number of examples of each class; and oversampling, that aims to replicate or generate new positive examples in order to gain importance (Chawla et al., 2002).
2
Evolutionary Computation
Volume x, Number x
N
,
N+
(1)
where N is the number of instances belonging to the majority class, and N + is the
number of instances belonging to the minority class. Logically, a data set is imbalanced
when its IR is greater than 1. We will consider that a IR above 9 represents a high IR
in a data set, due to the fact that ignoring the minority class instances by a classier,
supposes an error of 0.1 in accuracy, which has poor relevance. We will separately
study the data sets belonging to this group in order to analyze the behaviour of the
proposals over them, given that when data sets present a high IR, the difculty of the
learning process increases.
EAs have been used for data reduction with promising results. They have been
successfully used for feature selection (Whitley et al., 1998; Guerra-Salcedo and Whitley, 1998; Guerra-Salcedo et al., 1999; Papadakis and Theocharis, 2006; Wang et al., 2007;
Sikora and Piramuthu, 2007) and PS (Cano et al., 2003, 2005), calling this last one as
Evolutionary Prototype Selection (EPS). EAs also have a good behaviour for training
set selection in terms of getting a trade-off between precision and interpretability with
classication rules (Cano et al., 2007).
PS is an instance reduction process whose results are used as a reference set of
examples for the Nearest Neighbour rule (1-NN) in order to classify new patterns. This
reduces the number of rows in the data set with no loss of classication accuracy and
even with an improvement in the classier. Various approaches of PS algorithms were
proposed in the literature, see (Wilson and Martinez, 2000) for review. A distinction is
needed among those methods that are centered on an efcient selection of prototypes in
order to increase or maintain global accuracy rate and to reduce the size of the training
data, and those that are focused on balancing data by selecting samples in order to
prevent bad behaviours in a subsequent classication process.
In this paper, we propose the use of EAs for under-sampling imbalanced data sets,
we call it Evolutionary Under-Sampling (EUS) approach. The objective is to increase
the accuracy of the classier by means of reducing instances mainly belonging to the
majority class. In fact, the tness functions proposed are designed to achieve a good
trade-off between reduction, data balancing and accuracy in classication. We propose
eight EUS methods and categorize them into a taxonomy depending on their objective,
scheme of selection and metrics of performance employed.
We will distinguish two levels of imbalanced degree among data sets. A high
degree of imbalance may have a remarkable inuence on performance in a classication
task and may cause problems in preprocessing stages carried out by many algorithms at
data level. For this reason, we analyze the use of EUS method under these conditions by
empirically comparing different methods among themselves and arranging them into
a taxonomy. In addition to this, we compare our approach with other under-sampling
methods studied in the literature. The empirical study has been contrasted via nonEvolutionary Computation
Volume x, Number x
Positive Prediction
True Positive (TP)
False Positive (FP)
Negative Prediction
False Negative (FN)
True Negative (TN)
FN
T P +F N
FN
F P +T N
TN
F P +T N
TP
T P +F N
The goal of a classier is to minimize the false positive and false negative rates or,
in a similar way, to maximize the true positive and true negative rates.
In (Barandela et al., 2003) it was indicated a metric called Geometric Mean (GM),
Evolutionary Computation
Volume x, Number x
(2)
The 1-NN classier is used for measuring the classication rate, clas rat, associated with S. It denotes the percentage of correctly classied objects from T R using
only S to nd the nearest neighbor. For each object y in S, the nearest neighbor is
searched for among those in the set S \ {y}. Whereas, perc red is dened as
perc red = 100
|T R| |S|
.
|T R|
(3)
The objective of the EAs is to maximize the tness function dened, i.e., maximize
the classication rate and minimize the number of instances obtained. The EAs
with this tness function will be denoted with the extension PS in the name.
Evolutionary Computation
Volume x, Number x
Crossover operator for data reduction: In order to achieve a good reduction rate,
Heuristic Uniform Crossover (HUX) implemented for CHC undergoes a change
that makes more difcult the inclusion of instances inside the selected subset.
Therefore, if a HUX switches a bit on in a gene, then the bit could be switched
off depending on a certain probability (it will be specied in Section 5.1).
As the evolutionary computation method in the core of EPS, we have used the
CHC model (Eshelman, 1991; Cano et al., 2003) and Intelligent Genetic Algorithm
(IGA) (Ho et al., 2002):
CHC is a classical evolutionary model that introduces different features to
obtain a trade-off between exploration and exploitation; such as incest prevention, reinitialization of the search process when it becomes blocked and
the competition among parents and offspring into the replacement process.
Recently, it has been used in many applications; for example, as a method for
optimizing learning models (Alcal et al., 2007), information processing (Alba
a
Evolutionary Computation
Volume x, Number x
Volume x, Number x
If the term GM appears, then it means that the model uses Geometric Mean as evaluator of accuracy. In the other case, it must appears the term AUC for evaluation
with AUC measure.
If the term GS appears, this implies that the selection scheme is Global Selection, as
well as the term MS, which implies that the selection scheme is Majority Selection.
The method will be a EBUS or a EUSCM model and this fact is specied in its
name.
The EUS approaches have been developed by using the CHC evolutionary model.
Figure 1 summarizes the proposed taxonomy for EUS approach and includes the
8 methods studied in this paper.
Evolutionary
Under-Sampling
Evolutionary Balancing
Under-Sampling
Global Selection
Majority Selection
Global Selection
Majority Selection
EUSCM-MS-AUC
EUSCM-MS-GM
EUSCM-GS-AUC
EUSCM-GS-AUC
EUSCM-GS-GM
EUSCM-GS-GM
EBUS-MS-AUC
EBUS-MS-AUC
EBUS-MS-GM
EBUS-GS-AUC
EBUS-GS-GM
Evolutionary Computation
Volume x, Number x
F itnessBal (S) =
g |1
gP
n+
n |
if n > 0
if n = 0
(4)
79
P FWB s s a lg
PFNWB s s a lg
29
sr e n i at n oC s s a lg
78
G
77 M
i
28 n
t
e
s
t
WN s s a l g
er aw e l b aT s s a lg
27
76
26
5.0
3. 0
2.0
1.0
10.0
r et ema rap P
Volume x, Number x
s s a lg
P FWB s s a lg
49
PFNWB s s a lg
48
sr e n i at n oC s s a lg
46 G
M
ni
t
e
s
t
WN s s a l g
47
er aw e l b aT s s a lg
45
44
5.0
3.0
2.0
1.0
10.0
r et ema rap P
F itnessBal (S) =
g |1
gP
N+
n |
if n > 0
if n = 0
(5)
Given that the instances belonging to the positive class are not affected is this
model, N + is a constant that represents the number of original positives instances
within the training data set.
EBUS-GS-AUC: This model is obtained from the rst one described in this section
by replacing the Geometric Mean to measure the accuracy on training data with
the AUC measure (see Section 2). The tness function corresponds to expression
6.
F itnessBal (S) =
AU C |1
AU C P
n+
n |
if n > 0
if n = 0
(6)
Although the AUC measure is totally valid to achieve a good balance between accuracy in both classes, it does not control the resulting balance of instances selected
in both classes.
EBUS-MS-AUC: Using AUC as performance measure, this model does not remove
examples belonging to positive class. The tness function employed by it corresponds to expression 7.
F itnessBal (S) =
10
AU C |1
AU C P
N+
n |
if n > 0
if n = 0
Evolutionary Computation
(7)
Volume x, Number x
(8)
(9)
AUC measure involves itself in a trade-off between improving the accuracy rate
over positives instances and not losing accuracy rate power over negative instances.
EUSCM-MS-AUC: This model is the same than the previous one, with the exception of the selection carried out, which is only performed in examples belonging to
the majority class. The tness function used corresponds to expression 9.
5 Experimental Framework
This section describes the methodology followed in the experimental study of the
under-sampling methods compared. We will explain the conguration of the experiment: data sets used and parameters for the algorithms. The PS methods used in the
study and the proposals of Under-Sampling found in specialized literature are (for a
detailed description, see Appendix):
Prototype Selection Methods:
Instance-Based 3 (IB3) (Aha et al., 1991): It is an incremental instance selection
algorithm which introduces the acceptable concept in the selection.
Decremental Reduction Optimization Procedure 3 (DROP3) (Wilson and Martinez, 2000): It is based in the rule Any instance incorrectly classied by its
k-NN is removed.
Evolutionary Computation
Volume x, Number x
11
Evolutionary Computation
Volume x, Number x
Data set
GlassBWNFP
#Examples
214
#Attributes
9
EcoliCP-IM
Pima
GlassBWFP
220
768
214
7
8
9
German
Haberman
Splice-ie
Splice-ei
GlassNW
VehicleVAN
EcoliIM
New-thyroid
Segment1
EcoliIMU
Optdigits0
Satimage4
Vowel0
GlassVWFP
EcoliOM
GlassContainers
Abalone9-18
GlassTableware
YeastCYT-POX
YeastME2
YeastME1
YeastEXC
Car
Abalone19
1000
306
3176
3176
214
846
336
215
2310
336
5564
6435
990
214
336
214
731
214
483
1484
1484
1484
1728
4177
20
3
60
60
9
18
7
5
19
7
64
36
13
9
7
9
9
9
8
8
8
8
6
9
%Class(min.,maj.)
(35.51, 64.49)
IR
1.82
(35.00, 65.00)
(34.77, 66.23)
(32.71, 67.29)
1.86
1.9
2.06
(30.00, 70.00)
(26.47, 73.53)
(24.09, 75.91)
(23.99, 76.01)
(23.93, 76.17)
(23.52, 76.48)
(22.92, 77.08)
(16.28, 83.72)
(14.29, 85.71)
(10.42, 89.58)
(9.90, 90.10)
(9.73, 90.27)
(9.01, 90.99)
(7.94, 92.06)
(6.74, 93.26)
(6.07, 93.93)
(5.75, 94.25)
(4.2, 95.8)
(4.14, 95.86)
(3.43, 96.57)
(2.96, 97.04)
(2.49, 97.51)
(3.99, 96.01)
(0.77, 99.23)
2.33
2.68
3.15
3.17
3.19
3.25
3.36
4.92
6.00
8.19
9.10
9.28
10.1
10.39
13.84
15.47
16.68
22.81
23.15
28.41
32.78
39.16
71.94
128.87
Algorithm
IB3
EPS-CHC
EPS-IGA
RUS
SBC
EUS
Parameters
Acept. Level = 0.9, Drop Level = 0.7
P op = 50, Eval = 10000, = 0.5
P op = 10, Eval = 10000, = 0.5
Balancing Ratio = 1 : 1
Balancing Ratio = 1 : 1, N. Clusters = 3
P op = 50, Eval = 10000, P = 0.2, P rob. inclusion HU X = 0.25
Table 3: Parameters considered for the algorithms.
Evolutionary Computation
Volume x, Number x
13
not completely independent, therefore the results neither present normal distribution
nor homogeneity of variance. In this situation, we consider the use of non-parametric
tests, according to the recommendations made in (Dem ar, 2006).
s
As such, these non-parametric tests can be applied to classication accuracies, error ratios or any other measure for techniques evaluation, even including model sizes
and computation times. Empirical results suggest that they are also more powerful
than the parametric tests. Dem ar recommends a set of simple, safe and robust nons
parametric tests for statistical comparisons of classiers. We will briey describe the
two tests used in this study.
The rst one is Friedmans test (Sheskin, 2003), which is a non-parametric test
equivalent to the repeated-measures ANOVA. Under the null-hypothesis, it states
that all the algorithms are equivalent, so a rejection of this hypothesis implies the
existence of differences among the performance of all the algorithms studied. After
this, a post-hoc test could be used in order to nd whether the control or proposed
algorithm presents statistical differences with regards to the remaining methods
into the comparison. The simplest of them is the Bonferroni-Dunn test, but we
can use more powerful tests that control the family-wise error rate and reject more
hypothesis than Bonferroni-Dunn test; for example, Holms test.
Due to the fact that Friedmans test could be too conservative, we have used a
derivation of it, Iman and Davenports test (Iman and Davenport, 1980). The descriptions and computations of the tests are explained in (Dem ar, 2006).
s
As post-hoc test of Friedman statistic, we will use Holms procedure (Holm, 1979),
which is a multiple comparison procedure that works with a control algorithm
(normally, the best of them is chosen) and compares it with the remaining methods. The results obtained in each comparison by using Holms procedure will be
reported through p-values.
Evolutionary Computation
Volume x, Number x
Table 4 shows us the average and standard deviations of the results offered by the
PS algorithms over the imbalanced data sets. Each column shows:
The PS method employed.
The percentage of reduction with respect to the original data set size. Furthermore, the percentage of reduction associated with each class is showed in posterior
columns.
The accuracy for each class by using an 1-NN classier (a+ and a ), where sub
index tra refers to training data and sub index tst refers to test data. GM value
also is showed for training and test data.
Finally, the AUC measure in test data is reported.
None indicates that no balancing method is employed (original data set is used for
classication with 1-NN).
PS
Method
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
%Red
%Red
%Red+
a
tra
a+
tra
GMtra
a
tst
a+
tst
GMtst
tst
AUC
none
0.0
0.0
61.82
1.49
91.76
1.81
98.85
1.88
89.49
1.79
0.0
0.0
62.00
1.49
94.00
1.83
99.00
1.88
90.00
1.80
0.0
0.0
80.60
1.70
78.13
1.67
97.17
1.86
81.91
1.71
0.9399
0.1832
0.9196
0.1812
0.8879
0.1781
0.9735
0.1865
0.9767
0.1868
0.6414
0.1514
0.475
0.1302
0.7027
0.1584
0.5747
0.1433
0.7206
0.1604
0.7485
0.1635
0.5965
0.146
0.7657
0.1654
0.6757
0.1553
0.8044
0.1695
0.9387
0.1831
0.9227
0.1815
0.8761
0.1769
0.9584
0.185
0.9459
0.1838
0.6175
0.1485
0.4615
0.1284
0.6299
0.15
0.5183
0.1361
0.5919
0.1454
0.6958
0.1576
0.5267
0.1372
0.6751
0.1553
0.6037
0.1468
0.6691
0.1546
0.7606
0.1648
0.6746
0.1552
0.7359
0.1621
0.7206
0.1604
0.7516
0.1638
IB3
DROP3
EPS-CHC
EPS-IGA
Critical value
9.488
Critical value
2.456
hypothesis
rejected
hypothesis
rejected
Table 5: Statistics and critical values for Friedmans and Iman-Davenports test
Table 5 indicates us that both, Friedmans and Iman-Davenports, statistics are
higher than their associated critical value, so the hypothesis of equivalence of results
Evolutionary Computation
Volume x, Number x
15
Friedman Rankings
5.4
16 1. 4
4
393.3
5.3
450. 3
3
2 84.2
5.2
11 9. 1
2
5.1
A
v
e
r
a
g
e
R
a
n
k
i
n
g
s
1
5.0
0
3B I
3PORD
CHC-SPE
AG I-SPE
NN-1
p-value
70.0
30.0
20.0
1 48 600. 0
692 67 1.0
3 P ORD
AG I-SPE
25 40 00.0
CHC-S PE
7 0-E2 10 .1
10.0
0
3B I
Evolutionary Computation
Volume x, Number x
can not state that 1-NN is better than EPS-IGA preprocessing, but EPS-IGA does not obtain a better value of ranking than 1-NN. The exhaustive search process performed by
EPS-IGA and its capability of reduction (lower than EPS-CHC) of the training data allow to nd subsets of instances that over-t the training set but in test accuracy perform
worse on imbalanced domains with respect to no use of a PS method. Note that EPSCHC algorithm notably performs poorly when we treat with high imbalanced data
sets, so the excessive reduction achieved by this method does not produce benet in
imbalanced domains.
6.2 EUS Methods
As second objective, we analyze all EUS models proposed over all imbalanced data sets.
The 8 algorithms that compose the taxonomy explained in Section 4 will be analyzed
in terms of efcacy and efciency in order to obtain the most appropriate conguration
of EUS over a set of imbalanced data sets. Firstly, we will study the EUS models on the
full set of data considered. Then, we will divide the data sets into two groups: those
that have a IR below 9 and those that have a IR above 9. All the studies will include
statistical analysis of the results.
Beginning with the study that considers the 28 imbalanced data sets, Table 6 shows
us the average and standard deviations of the results offered by the models proposed.
It follows the same structure as Table 4.
EUS
Method
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
%Red
%Red
%Red+
a
tra
a+
tra
GMtra
a
tst
a+
tst
GMtst
tst
AUC
EBUSMS-AUC
EBUSMS-GM
EBUSGS-AUC
EBUSGS-GM
EUSCMMS-AUC
EUSCMMS-GM
EUSCMGS-AUC
EUSCMGS-GM
70.04
1.58
69.93
1.58
96.30
1.85
96.23
1.85
76.86
1.66
76.18
1.65
94.46
1.84
94.34
1.84
81.00
1.70
80.00
1.69
98.09
1.87
98.00
1.87
90.00
1.80
89.00
1.79
95.00
1.85
95.00
1.84
0.00
0.00
0.00
0.00
82.12
1.71
82.13
1.71
0.00
0.00
0.00
0.00
84.01
1.73
84.19
1.73
0.8473
0.174
0.8504
0.1743
0.8749
0.1768
0.8812
0.1774
0.8639
0.1757
0.8714
0.1764
0.9144
0.1807
0.9155
0.1808
0.9323
0.1825
0.9252
0.1818
0.9259
0.1818
0.9195
0.1812
0.9371
0.1829
0.9313
0.1824
0.9116
0.1804
0.9054
0.1798
0.8878
0.1781
0.8862
0.1779
0.8991
0.1792
0.8996
0.1792
0.8961
0.1789
0.8983
0.1791
0.9092
0.1802
0.9068
0.18
0.8289
0.1721
0.8319
0.1724
0.8566
0.1749
0.8595
0.1752
0.8285
0.172
0.8354
0.1727
0.8916
0.1784
0.8894
0.1782
0.8189
0.171
0.8188
0.171
0.7826
0.1672
0.7863
0.1676
0.8084
0.1699
0.8081
0.1699
0.7374
0.1623
0.7278
0.1612
0.7955
0.1686
0.7971
0.1687
0.7872
0.1677
0.7927
0.1683
0.7795
0.1669
0.7861
0.1676
0.7712
0.166
0.7575
0.1645
0.8071
0.1698
0.8085
0.1699
0.8024
0.1693
0.8058
0.1696
0.8014
0.1692
0.805
0.1696
0.797
0.1687
0.7912
0.1681
Table 6: Average results for the proposed models over imbalanced data sets
By analyzing Table 6, we can point out the following:
The best average results are offered by the models EBUS, by measuring the performance with GM accuracy and AUC.
An observable difference between the use of global selection and majority selection
exists. In all cases, the majority selection is preferable to global selection.
The employment of GM or AUC in the tness does not affect too much in the
results obtained.
We are interested in checking if these differences are signicant by using nonparametrical statistical tests. For this, we compute the average rankings by using Friedmans test over the results obtained in all imbalanced data sets, as well as in the results
on data sets with IR < 9 and IR > 9. In Figures 6, 7 and 8 (they follow the same
scheme of Figure 4), we represent the ranking values for each algorithm and for GM
Evolutionary Computation
Volume x, Number x
17
and AUC measures. With these values, we have computed Iman-Davenports statistic
(considering a level of condence = 0.05) and the results are showed in Table 7:
Friedman Rankings
MG
CUA
6
34
1 6 1. 5 1 . 5
46 9.4
3 4 1. 5
23 7.4
52 1.16 1 .4
4
5
5 7.4 7.4
7 5 3. 4
6 4 4. 4 . 4
5
92 9.39 8 .3
3
75 8.3
9 8 0. 4
A
v
e
r
a
g
e
3 R
a
n
k
i
n
g
s
4
2
1
0
MG-SG
-MCSUE
CUA-SG
-MCSUE
MG-SM
-MCSUE
CUA-SM
-MCSUE
MG
-SG-SUBE
CUA
-SG-SUBE
MG
-SM-SUBE
CUA
-SM-SUBE
Figure 6: Friedman Rankings for all EUS models over all imbalanced data sets
Friedman Rankings
7
MG
CUA
34 6.5
6 8 7. 5
5. 4
4 1 7. 4
97 6.4
9 7 1. 4 2 . 4
41
17 0.4
9 2 4. 4
MG-SM
-MCSUE
CUA-SG
-MCSUE
6
5 2.4
4 6 4. 4
68
9 2 4. 4 2 . 4
6 3 0. 4
MG
-SM-SUBE
CUA
-SG-SUBE
5
34
9 7 1. 4 1 . 4
A
v
e
r
3 a
g
e
R
a
n
k
i
n
g
s
4
2
1
0
MG-SG
-MCSUE
CUA-SM
-MCSUE
MG
-SG-SUBE
CUA
-SM-SUBE
Figure 7: Friedman Rankings for all EUS models over imbalanced data sets with IR < 9
Imbalance
Data Sets
All
IR < 9
IR > 9
Iman-Davenport
statistic for GM
1.099
0.575
2.355
Iman-Davenport
statistic for AUC
1.049
0.716
1.736
Critical
value
2.058
2.112
2.112
hypothesis
both non-rejected
both non-rejected
rejected for GM measure
Evolutionary Computation
Volume x, Number x
Friedman Rankings
7
MG
CUA
75 8.75 8 .5
5
7 0 6. 5
68 7.5
CUA-SG
-MCSUE
MG-SG
-MCSUE
6
9 7 6. 4 7 . 4
68
1 7 0. 4 1 . 4
70
4 6 4. 4
41 7.4
5
4
4 1 7. 3
75 8.3
1 2 3. 3
7 0 6. 3
17 5.3
A
v
e
r
3 a
g
e
R
a
n
k
i
n
g
s
4
2
1
0
MG-SM
-MCSUE
CUA-SM
-MCSUE
MG
-SG-SUBE
CUA
-SG-SUBE
MG
-SM-SUBE
CUA
-SM-SUBE
Figure 8: Friedman Rankings for all EUS models over imbalanced data sets with IR > 9
erful than Friedmans test, if it is not able to reject the null hypothesis, Friedmans test
cannot do it either.
An analysis based upon the results obtained from the rankings computed following the guidelines for the Friedmans test allows us to state the following:
With respect to imbalanced data sets with IR < 9 (Figure 7):
The parameter addressed to balancing data (P factor) lacks interest when the data
is not imbalanced enough. A EUSCM model obtains good results without balancing mechanisms. Hence, in general, EUSCM approach behaves better than EBUS.
However, the best performing method is EBUS-MS-AUC, because it obtains low
rankings in both measures, although it is not the best in GM (EUSCM-GS-AUC
outperforms it) and in AUC (EUSCM-MS-AUC is the best in this case).
The differences between the use of global and majority selection or GM and AUC
in the tness function do not follow a specic bias towards carrying out the best
choice.
With respect to imbalanced data sets with IR > 9 (Figure 8):
When the IR becomes high, a GS mechanism has no sense due to the reduced
number of examples belonging to the minority class. Thus, MS mechanism obtains
better results than GS mechanism.
We can observe that EBUS models behave better than EUSCM model. Therefore, a
balancing mechanism may help the under-sampling process over extreme circumstances of imbalance.
In particular, an algorithm that belongs to the group of EBUS models with majority
selection, which is EBUS-MS-GM, is the best performing method in this case.
In spite of the conclusion obtained from Iman-Davenports test, which is that there
are not notably differences among the models, we have to choose a certain model for
performing a comparison with the state-of-the-art techniques in order to stress the benet of using EUS. Thus, we will select the most accurate model: EBUS-MS-GM, which
presents the best result in high imbalanced data sets (IR > 9) and considering all of
them (see Figure 6).
Evolutionary Computation
Volume x, Number x
19
Finally, Figure 9 shows a set of bar charts that represent the run-time spent by
each type of model of EUS on some data sets with different IRs. Obviously, they are
inuenced by the size of the data sets, due to the fact that chromosome size grows
agreeing with this increase of size. On the other hand, it is observable the fact that GS
is less affected by the IR, and that MS is very inuenced by it. In pima data set, with a
low IR, the run-time of EUS model with MS is high because of the evaluation cost of
the minority class examples, which are retained in all evaluations. In EBUS, this fact is
more notably due to the interest in balancing both classes. However, when IR is high
(as in the case of yeastEXC data set), the EBUS-MS model is favoured in efciency by
its interest in balancing.
00 6
SG-MCSUE
SM-MCSUE
00 5
SG-SUBE
SM-SUBE
00 4
00 3
t
i
m
e
(
s
e
c
o
n
d
s
)
00 2
00 1
0
) 61.9 3, 48 41(
CXEt s a eY
) 9. 1, 46 7( am iP
) 18.2 2 , 41 2(
er aW elb aT s s a lG
) 28.1, 412(
PFNWB s s alG
Evolutionary Computation
Volume x, Number x
EUS
Method
%Red
%Red
%Red+
a
tra
a+
tra
GMtra
a
tst
a+
tst
GMtst
tst
AUC
mean
SD
none
0.00
0.00
0.00
0.00
0.00
0.00
0.9399
0.1832
0.6414
0.1514
0.7485
0.1635
0.9387
0.1831
0.6175
0.1485
0.6958
0.1576
0.7606
0.1648
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
mean
SD
US-CNN +
TL
US-CNN
81.31
1.70
72.95
1.61
81.12
1.70
10.04
0.60
76.37
1.65
69.28
1.57
76.84
1.67
6.67
0.49
96.00
1.85
85.00
1.746
84.00
1.74
13.00
0.69
90.00
1.78
79.0
1.69
90.00
1.79
9.00
0.56
0.00
0.00
0.00
0.00
51.74
1.36
0.00
0.00
0.00
0.00
0.0
0.00
0.00
0.00
0.00
0.90
0.6949
0.1575
0.8702
0.1763
0.8854
0.1778
0.8966
0.1789
0.838
0.173
0.8062
0.1697
0.3275
0.1082
0.9191
0.1812
0.8975
0.179
0.6882
0.1568
0.5778
0.1437
0.822
0.1713
0.8177
0.1709
0.8425
0.1735
0.9279
0.182
0.7804
0.1669
0.7649
0.1653
0.747
0.1633
0.6906
0.157
0.8378
0.173
0.8067
0.1697
0.8222
0.1714
0.3458
0.1111
0.8241
0.1716
0.7093
0.1592
0.8855
0.1778
0.898
0.1791
0.8907
0.1784
0.8475
0.174
0.8045
0.1695
0.3268
0.108
0.9079
0.1801
0.8444
0.1737
0.6882
0.1568
0.6345
0.1505
0.7162
0.1599
0.7543
0.1641
0.8045
0.1695
0.8857
0.1779
0.6925
0.1573
0.7193
0.1603
0.7195
0.1603
0.7039
0.1586
0.7385
0.1624
0.7455
0.1632
0.7757
0.1664
0.3382
0.1099
0.7338
0.1619
0.7618
0.1649
0.7696
0.1658
0.7487
0.1635
0.7862
0.1676
0.7837
0.1673
0.7892
0.1679
0.6063
0.1472
0.7829
0.1672
mean
SD
EBUSMS-GM
69.93
1.58
80.00
1.69
0.00
0.00
0.8504
0.1743
0.9252
0.1818
0.8862
0.1779
0.8319
0.1724
0.8188
0.171
0.7971
0.1687
0.8085
0.1699
CPM
NCL
OSS
RUS
SBC
TL
Table 8: Average results obtained for the state-of-the-art methods and the two proposed
algorithms chosen over imbalanced data sets
Friedman
statistic for GM
74.170
Friedman
statistic for AUC
78.789
Critical
value
16.919
Iman-Davenport
statistic for GM
11.261
Iman-Davenport
statistic for AUC
12.282
Critical
value
1.918
hypotheses
all rejected
Table 9: Statistics and critical values for Friedmans and Iman-Davenports tests
fore, we can apply the Holms procedure as post-hoc test in order to detect the set of
methods which are signicantly worse than the control method. Figures 10 and 11 display the p-values and the threshold of signicance for the Holms procedure with a
= 0.05 and = 0.10. The control method is set as the one that achieves the highest
value of performance in GM and AUC, respectively.
Holm's procedure
= 0.05
= 0.10
0.1
0.09
0.08
p-value
0.07
0.06
0.05
0.04
0.03
0.02
0.01
73 5.0
9 56.0
SUR
LCN
21 1.0
LT
55 0.0
SS O
2 0 0. 0
NNC-SU
10 0.0
enon
00 0.0
00 0.0
M PC
LT+NNC
-SU
0 0 0. 0
CBS
Figure 10: Holms test on all data sets with GM: The control algorithm is EBUS-MS-GM
The EBUS-MS-GM model is the one that achieves the best ranking, so it is the
control method in both comparisons. As we can see in both gures, EBUS-MS-GM
outperforms ve under-sampling methods: SBC, CPM, US-CNN, US-CNN+TL and no
application of under-sampling. Although EBUS-MS-GM obtains a better performance
than the four remainder algorithms, Holms procedure is not able to detect these difEvolutionary Computation
Volume x, Number x
21
Holm's procedure
= 0.05
= 0.10
0.1
0.09
0.08
p-value
0.07
0.06
0.05
0.04
0.03
0.02
0.01
5 76.0
10 2.0
52 2.0
LT
SUR
LCN
63 0.0
SS O
20 0.0
2 0 0. 0
enon
NNC-SU
00 0.0
00 0.0
M PC
LT+NNC
-SU
0 0 0. 0
CBS
Figure 11: Holms test on all data sets with AUC: The control algorithm is EBUS-MSGM
ferences as signicant, measuring the performance with GM or AUC.
One of the factors that makes more difcult the learning on imbalanced domains,
as we had already commented, is the increase of the degree of imbalance between
classes. In relation to this, we will make a second study that comprises the four algorithms which have no statistical differences with respect to the EBUS-MS-GM model
by dividing the group of imbalanced data sets into two subgroups, in the same way
as we did in previous section: those which have an IR < 9 and those which have an
IR > 9. Note that although the number of algorithms to be compared is lower than
originally, the number of data sets is also reduced to half, so the results reported by the
non-parametric tests are not inuenced in favor or against a desired result.
Firstly, we study the case where imbalanced data sets have IR < 9. Figure 12
shows a graphical representation of the Holms procedure.
Holm's Procedure
AUC
0 1.0=
5 0.0=
GM
0.1
0.09
0.08
p-value
0.07
0.06
0.05
0.04
0.03
0.02
0.01
22 40.0 86 10.0
9 3 6 0 . 0 6 8 4 0. 0
MG-SM-SUBE
LT
SUR
384 00 0010 01 00 .0
.3
SSO
Figure 12: Holms test on data sets with IR < 9: The control algorithm is NCL
In the case of IR < 9, the best method is NCL. Measuring the performance by
means of GM, NCL is statistically better than all the method considered with = 0.05
and = 0.10. However, with AUC as performance metric, EBUS-MS-GM and TL
behave equally to it when considering = 0.05.
Secondly, we study the case where imbalanced data sets have IR > 9. Figure 13
22
Evolutionary Computation
Volume x, Number x
AUC
0 1.0=
5 0.0=
GM
0.1
0.09
0.08
p-value
0.07
0.06
0.05
0.04
0.03
0.02
0.01
22 40.0 23 20.0
3 4 9 0 . 0 8 2 0 4. 0
SSO
SUR
6 800. 0 8 200. 0
LCN
68 00 .08 250 00 .0
LT
Figure 13: Holms test on data sets with IR > 9: The control algorithm is EBUS-MS-GM
Considering = 0.05, EBUS-MS-GM is similar in performance to RUS and it is
also similar to OSS when evaluating with AUC.
Considering = 0.10, EBUS-MS-GM is similar to RUS in performance when using
GM as performance measure.
EBUS-MS-GM is the best method for a level of signicance = 0.10 and AUC as
performance measure.
The conclusions that we can extract analyzing these tables and gures are pointed
out as follows:
EUS models usually present an equal or better performance than the remaining
methods, independently of the degree of imbalance of data.
The best performing under-sampling model over imbalance data sets is EBUS-MSGM (Table 8).
EBUS-MS-GM is not the best model when we use imbalanced data sets with low
IR, although it obtains good results. The NCL algorithm is the most appropriate
to be used in this type of data sets (Figure 12), but when IR increases, it does not
behave well. Hence, NCL is not appropriate to use over data sets with high IR.
The tendency of the EUS models follows an improving of the behaviour in classication when the data turns to a high degree of imbalance.
EBUS-MS-GM model is the most accurate when we deal with data sets with IR >
9. This fact is proved by observing Figure 13 in which it is signicantly better than
the remaining of the algorithms by using AUC measure.
An observable difference exists when measuring the behaviour of the classical and
EUS methods between GM and AUC. For instance, with GM evaluation, the algorithm RUS and EBUS-MS-GM are signicantly equivalent to the Holms procedure. GM evaluates a trade-off between accuracy on positive and negative classes.
Evolutionary Computation
Volume x, Number x
23
RUS maintains all the positive examples and randomly selects a subset of negative instances. This subset of instances, although randomly selected, may become
a good representative of the negative set of instances. On the other hand, AUC
measures a trade-off between true positives and false positives, so it penalizes the
misclassication of positive instances. As well as it is easy to obtain a random subset of instances that is accurate with respect to examples of the same class, it is not
so easy to nd a random subset of instances of a certain class that does not harm
the classication of the opposite class. For this reason, RUS algorithm performs
well when considering GM and not as well with AUC.
Classical under-sampling algorithms, such as NCL and TL, lose accuracy when IR
becomes high. This is logical because of the fact that their intention is to preserve
minority class instances as well as to not produce massive removing on majority
class instances.
The model EBUS-MS-GM (in general EUS) can adapt to distinct situations of imbalance and it is not problem dependent.
7 Conclusions
This paper addressed the analysis of Prototype Selection and Under-Sampling algorithms over imbalance classication problems when they are applied in different imbalance ratios in the distribution of classes. A proposal of taxonomy of evolutionary
under-sampling methods is offered, categorizing all models according to the objective
of interest, the selection scheme and the evaluation measure.
An experimental study has been carried out to compare the results of the evolutionary under-sampling approach with non-evolutionary techniques.
The main conclusions achieved are the following ones:
Prototype Selection algorithms must not be used for handling imbalanced problems. They are prone to gain global performance by eliminating examples belonging to minority class considered as noisy examples.
During the evolutionary under-sampling process, the employment of majority selection mechanism helps to obtain more accurate subsets of instances than use of
global selection. However, the later mechanism is necessary to achieve highest
reduction rates.
A signicant difference between the use of GM or AUC in the evaluation of solutions in EUS approaches is not observed.
Data sets with a low imbalance ratio may be faced by EUSCM models, especially
by using the model with a global mechanism of selection and evaluation through
the GM measure.
Although over data sets with high imbalance ratio, all EUS models obtain good
results, we emphasize the EBUS models with a special interest in the one that
performs a majority selection by using the GM measure. The superiority of this
model in relation to state-of-the-art under-sampling algorithms has been empirically proved.
24
Evolutionary Computation
Volume x, Number x
Finally, we would like to point out that the EUS approach is a good choice for
under-sampling imbalanced data sets, specially when the data presents a high imbalance ratio among the classes. We recommend the use of the EBUS-MS-GM model over
imbalanced data sets.
As future research lines, we could tackle the following topics:
The use of Evolutionary Under-Sampling for training set selection (Cano et al.,
2007) in order to analyze the behaviour of other classication methods (C4.5,
SVMs, etc.), combined with subset selection for imbalanced data sets.
A study on the scalability for making it feasible to apply Evolutionary UnderSampling for very large data sets (Song et al., 2005; Cano et al., 2005).
The analysis of Evolutionary Under-Sampling in terms of data complexity (Ho and
Basu, 2002; Bernado-Mansilla and Ho, 2005) for a better understanding of the behaviour of our approach over data set depending on the data complexity measure
values.
Acknowledgments
This research has been supported by the project TIN2005-08386-C05-01. S. Garca holds
a FPU scholarship from Spanish Ministry of Education and Science. The authors are
very grateful to the anonymous reviewers for their valuable suggestions and comments
to improve the quality of this paper.
References
Aha, D. W., Kibbler, D., and Albert, M. K. (1991). Instance-based learning algorithms.
Machine Learning, 6:3766.
Akbani, R., Kwek, S., and Japkowicz, N. (2004). Applying support vector machines to
imbalanced datasets. In ECML, LNCS 3201, pages 3950.
Alba, E., Luque, G., and Araujo, L. (2006). Natural language tagging with genetic
algorithms. Information Processing Letters, 100(5):173182.
Alcal , R., Alcala-Fdez, J., Herrera, F., and Otero, J. (2007). Genetic learning of accurate
a
and compact fuzzy rule based systems based on the 2-tuples linguistic representation. International Journal of Approximate Reasoning, 44(1):4564.
Barandela, R., S nchez, J. S., Garca, V., and Rangel, E. (2003). Strategies for learning in
a
Bernado-Mansilla, E. and Garrell-Guiu, J. M. (2003). Accuracy-based learning classier systems: models, analysis and applications to classication tasks. Evolutionary
Computation, 11(3):209238.
Volume x, Number x
25
Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of
machine learning algorithms. Pattern Recognition, 30(7):11451159.
Butz, M. V., Pelikan, M., Llor ;, X., and Goldberg, D. E. (2006). Automated global
a
structure extraction for effective local building block processing in XCS. Evolutionary
Computation, 14(3):345380.
Cano, J. R., Herrera, F., and Lozano, M. (2003). Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study. IEEE Transactions
on Evolutionary Computation, 7(6):561575.
Cano, J. R., Herrera, F., and Lozano, M. (2005). Stratication for scaling up evolutionary
prototype selection. Pattern Recognition Letters, 26(7):953963.
Cano, J. R., Herrera, F., and Lozano, M. (2007). Evolutionary stratied training set
selection for extracting classication rules with trade-off precision-interpretability.
Data and Knowledge Engineering, 60:90108.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE:
Synthetic minority over-sampling technique. Journal of Articial Intelligence Research,
16:321357.
Chawla, N. V., Japkowicz, N., and Kotcz, A. (2004). Editorial: special issue on learning
from imbalanced data sets. SIGKDD Explorations, 6(1):16.
Chawla, N. V., Lazarevic, A., Hall, L. O., and Bowyer, K. W. (2003). Smoteboost: Improving prediction of the minority class in boosting. In PKDD, pages 107119.
Cieslak, D. A., Chawla, N. V., and Striegel, A. (2006). Combating imbalance in network
intrusion datasets. In Proocedings of the IEEE International Conference on Granular Computing, pages 732737.
Cohen, G., Hilario, M., Sax, H., Hugonnet, S., and Geissbuhler, A. (2006). Learning
from imbalanced data in surveillance of nosocomial infection. Articial Intelligence in
Medicine, 37(1):718.
Cordon, O., Damas, S., and Santamara, J. (2006). Feature-based image registration by
a
prototype selection for class imbalance problems. In IDEAL, LNCS 4224, pages 1415
1423.
26
Evolutionary Computation
Volume x, Number x
Orriols-Puig, A. and Bernado-Mansilla, E. (2006). Bounding XCSs parameters for unbalanced datasets. In GECCO 06: Proceedings of the 8th annual conference on Genetic
and evolutionary computation, pages 15611568.
Evolutionary Computation
Volume x, Number x
27
Papadakis, E. and Theocharis, B. (2006). A genetic method for designing TSK models
based on objective weighting: application to classication problems. Soft Computing,
10(9):805824.
Sheskin, D. (2003). Handbook of parametric and nonparametric statistical procedures. Chapman & Hall/CRC.
Sikora, R. and Piramuthu, S. (2007). Framework for efcient feature selection in genetic
algorithm based data mining. European Journal of Operational Research, 180:723737.
Song, D., Heywood, M. I., and Zincir-Heywood, A. N. (2005). Training genetic programming on half a million patterns: An example from anomaly detection. IEEE
Transactions on Evolutionary Computation, 9(3):225239.
Tomek, I. (1976). Two modications of cnn. IEEE Transactions on Systems, Man, and
Communications, 6:769772.
Wang, X., Yang, J., Teng, X., Xia, W., and Jensen, R. (2007). Feature selection based on
rough sets and particle swarm optimization. Pattern Recognition Letters, 28(4):459
471.
Weiss, G. M. and Provost, F. J. (2003). Learning when training data are costly: The
effect of class distribution on tree induction. Journal of Articial Intelligence Research,
19:315354.
Whitley, D., Beveridge, R., Guerra, C., and Graves, C. (1998). Messy genetic algorithms
for subset feature selection. In Proceedings of the International Conference on Genetic
Algorithms, pages 568575.
Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data.
IEEE Transactions on Systems, Man, and Cybernetics, 2:408421.
Wilson, D. R. and Martinez, T. R. (2000). Reduction techniques for instancebasedlearning algorithms. Machine Learning, 38(3):257286.
Xie, J. and Qiu, Z. (2007). The effect of imbalanced data sets on LDA: A theoretical and
empirical analysis. Pattern Recognition, 40:557562.
Yen, S. and Lee, Y. (2006). Under-sampling approaches for improving prediction of the
minority class in an imbalanced dataset. In ICIC, LNCIS 344, pages 731740.
Yoon, K. and Kwek, S. (2005). An unsupervised learning approach to resolving the
data imbalanced issue in supervised learning problems in functional genomics. In
HIS 05: Proceedings of the Fifth International Conference on Hybrid Intelligent Systems,
pages 303308.
Evolutionary Computation
Volume x, Number x
z2
2n
z
1
p(p1)
n
2
+z
n
z2
2n2
(10)
z is the condence factor (0.9 is used to accept, 0.7 to reject). p is the classication
accuracy of a x instance (while x is added to S). n is the number of classicationtrials for a given instance (while added to S).
DROP3: T R is copied to the subset selected S. It uses a noise ltering pass before
sorting the instances in S. This is done using the rule: Any instance not classied by
its k-nearest neighbours is removed (we use k = 3). After removing noisy instances
from S in this manner, the instances are sorted by distance to their nearest enemy
remaining in S, and thus points far from the real decision boundary are removed
rst. This allows points internal to clusters to be removed early in the process, even
if there were noisy points nearby. After the noise removal, the steps are described
in Figure 14 (Wilson and Martinez (2000)):
A.2 Classical Under-Sampling Methods for Balancing Class Distribution
In this work, we evaluate 8 different methods of under-sampling to balance the class
distribution on training data:
Random Under-Sampling (RUS): It is a non-heuristic method that aims to balance
class distribution through the random elimination of majority class examples to
get a balanced instance set. The nal ratio of balancing can be adjusted.
Tomek Links (TL) (Tomek, 1976): It can be dened as follows: given two examples
Ei = (xi , yi ) and Ej = (xj , yj ) where yi = yj and d(Ei , Ej ) being the distance
between Ei and Ej . A pair (Ei , Ej ) is called Tomek link if there is not an example
El , such that d(Ei , El ) < d(Ei , Ej ) or d(Ej , El ) < d(Ei , Ej ). Tomek links can be
used as an under-sampling method eliminating only examples belonging to the
majority class in each Tomek link found.
Condensed Nearest Neighbor Rule (US-CNN) (Hart, 1968): First, randomly draw
one majority class example and all examples from the minority class and put these
examples in S. Afterwards, use a 1-NN over the examples in S to classify the
examples in T R. Every misclassied example from T R is moved to S.
One-Sided Selection (OSS) (Kubat and Matwin, 1997): It is an under-sampling
method resulting from the application of Tomek links followed by the application
of US-CNN.
Evolutionary Computation
Volume x, Number x
29
Evolutionary Computation
Volume x, Number x
Ni /Ni+
K
+
i=1 (Ni /Ni )
(11)
After determining the number of majority class samples in each cluster, it randomly chooses majority class samples in the i-th cluster.
Evolutionary Computation
Volume x, Number x
31
2.3.2.
Submission Confirmation
1 de 1
******************************************
Please note that the editorial process varies considerably from journal to journal. To
view a sample editorial process, please click here:
http://ees.elsevier.com/eeshelp/sample_editorial_process.pdf
For any technical queries about using EES, please contact Elsevier Author Support at
authorsupport@elsevier.com
Global telephone support is available 24/7:
For The Americas: +1 888 834 7287 (toll-free for US & Canadian customers)
For Asia & Pacific: +81 3 5561 5032
For Europe & rest of the world: +353 61 709190
20/10/2008 07:48
Manuscript Number:
Title: Enhancing the Effectiveness and Interpretability of Decision Tree and Rule Induction
Classifiers with Evolutionary Training Set Selection over Imbalanced Problems
Article Type: Fast Track Paper
Section/Category:
Keywords: Evolutionary Algorithms; Imbalanced Classification; Data Reduction; Training Set
Selection; Decision Trees; Rule Induction; HIS 2008.
Corresponding Author: Mr Salvador Garca, M.D.
Corresponding Author's Institution: University of Granada
First Author: Salvador Garca, M.D.
Order of Authors: Salvador Garca, M.D.; Alberto Fernndez, M.D.; Francisco Herrera, Ph.D.
Manuscript Region of Origin:
Abstract: Classification in imbalanced domains is a recent challenge in data mining. We refer to
imbalanced classification when data presents many examples from one class and few from the
other class, and the less representative class is the one which has more interest from the point of
view of the learning task. One of the most used techniques to tackle this problem consists in
preprocessing the data previously to the learning process. This preprocessing could be done
through under-sampling; removing examples, mainly belonging to the majority class; and oversampling, by means of replicating or generating new minority examples. In this paper, we propose
an under-sampling procedure guided by evolutionary algorithms to perform a training set selection
for enhancing the decision trees obtained by the C4.5 algorithm and the rule sets obtained by
PART rule induction algorithm. The proposal has been compared with other under-sampling and
over-sampling techniques and the results indicate that the new approach is very competitive in
terms of accuracy when comparing with over-sampling and it outperforms standard undersampling. Moreover, the obtained models are smaller in terms of number of leaves or rules
generated and they can considered more interpretable. The results have been contrasted through
non-parametric statistical tests over multiple data sets.
Cover Letter
Manuscript
Enhancing the Eectiveness and Interpretability of Decision Tree and Rule Induction
Classiers with Evolutionary Training Set Selection over Imbalanced Problems
Salvador Garca,a , Alberto Fern ndeza, Francisco Herreraa
a
a Dept.
of Computer Science and Articial Intelligence, University of Granada, 18071, Granada, Spain.
Abstract
Classication in imbalanced domains is a recent challenge in data mining. We refer to imbalanced classication when data
presents many examples from one class and few from the other class, and the less representative class is the one which has more
interest from the point of view of the learning task. One of the most used techniques to tackle this problem consists in preprocessing
the data previously to the learning process. This preprocessing could be done through under-sampling; removing examples, mainly
belonging to the majority class; and over-sampling, by means of replicating or generating new minority examples. In this paper,
we propose an under-sampling procedure guided by evolutionary algorithms to perform a training set selection for enhancing the
decision trees obtained by the C4.5 algorithm and the rule sets obtained by PART rule induction algorithm. The proposal has
been compared with other under-sampling and over-sampling techniques and the results indicate that the new approach is very
competitive in terms of accuracy when comparing with over-sampling and it outperforms standard under-sampling. Moreover, the
obtained models are smaller in terms of number of leaves or rules generated and they can considered more interpretable. The results
have been contrasted through non-parametric statistical tests over multiple data sets.
Key words:
Evolutionary Algorithms, Imbalanced Classication, Data Reduction, Training Set Selection, Decision Trees, Rule Induction.
1. Introduction
The data used in a classication task could be not perfect.
Data could present dierent types of imperfections, such as
the presence of errors or missing values or imbalanced distribution of classes. In the last years, the class imbalance problem
is one of the emergent challenges in Data Mining (DM) [45].
The problem appears when the data presents a class imbalance,
which consists in containing many more examples of one class
than the other one and the less representative class represents
the most interesting concept from the point of view of learning
[10]. The imbalance classication problem is very related with
the cost-sensitive classication problem [9]. Imbalance in class
distribution is pervasive in a variety of real-world applications,
including but not limited to telecommunications [37], WWW,
nance, ecology [29], biology and medicine [21].
Usually, in imbalanced classication problems, the instances
are grouped into two type of classes: the majority or negative
class, and the minority or positive class. The minority or positive class has more interest and it is also accompanied with a
higher cost of making errors. A standard classier might ignore
the importance of the minority class because its representation
inside the data set is not strong enough. As a classical example, if the ratio of imbalance presented in the data is 1:100 (that
Corresponding
been successfully used for feature selection [42, 25, 40, 46] and
instance selection [5, 6, 22]. EAs also have a good behaviour
for Training Set Selection (TSS) in terms of getting a trade-o
between precision and interpretability with classication rules
[7].
In the eld of class imbalanced classication, EAs are being applied recently. In [28], an EA is used to search an optimal tree in a global manner for cost-sensitive classication. In
[13], the authors proposes new heuristics and metrics for improving the performance of several genetic programming classiers in imbalanced domains. EAs have also been applied for
under-sampling the data in imbalanced domains in instancebased learning [23].
In this contribution, we propose the use of EAs for TSS in
imbalanced data sets. Our objective is to increase the eectiveness of a well-known decision tree classier, C4.5 [35], and
a rule induction algorithm, PART [19] by means of removing instances guided by an evolutionary under-sampling algorithm. We compare our approach with other under-sampling,
over-sampling methods and hybridization proposals of oversampling and under-sampling [3] studied in the literature. The
empirical study is contrasted via non-parametrical statistical
testing in a multiple data set environment.
To achieve this objective, the rest of the contribution is organized as follows: Section 2 gives an overview about imbalanced classication. In Section 3, the evolutionary TSS issues
are explained, together with a description of the used model. In
Section 4 the experimentation framework, the results obtained
and their analysis are presented. Section 5, remarks our conclusion. Finally, the Appendix A is included in order to illustrate
the comparisons of our proposal with other techniques through
star plots.
1. Solutions at the data level [8, 3, 9]: This kind of solution consists of balancing the class distribution by oversampling the minority class (positive instances) or undersampling the majority class (negative instances).
2. Solutions at the algorithmic level: In this case we may t
our method adjusting the cost per class [24], for example,
adjusting the probability estimation in the leaves of a decision tree bias the positive class [41].
We focus on the two class imbalanced data sets, where there
are only one positive and one negative class. We consider the
positive class as the one with the lower number of examples and
the negative class the one with the higher number of examples.
In order to deal with the class imbalance problem we analyse
the cooperation of some preprocessing methods of instances.
2.2. Evaluation in Imbalanced Domains
The most straightforward way to evaluate the performance
of classiers is based on the confusion matrix analysis. Table
1 illustrates a confusion matrix for a two class problem having
positive and negative class values. From such a table it is possible to extract a number of widely used metrics for measuring
the performance of learning systems, such as Error Rate (1) and
Accuracy (2).
Err =
FP + FN
T P + FN + FP + T N
(1)
Acc =
TP + TN
= 1 Err
T P + FN + FP + T N
(2)
Positive Class
Negative Class
Positive Prediction
True Positive (TP)
False Positive (FP)
Negative Prediction
False Negative (FN)
True Negative (TN)
TP
True positive rate T Prate = T P+FN is the percentage of
positive cases correctly classied as belonging to the positive class.
TN
True negative rate T Nrate = FP+T N is the percentage of
negative cases correctly classied as belonging to the negative class.
FP
False positive rate FPrate = FP+T N is the percentage of
negative cases misclassied as belonging to the positive
class.
FN
False negative rate FNrate = T P+FN is the percentage of
positive cases misclassied as belonging to the negative
class.
These four performance measures have the advantage of being independent of class costs and prior probabilities. The aim
of a classier is to minimize the false positive and negative rates
or, similarly, to maximize the true negative and positive rates.
The metric used in this work is the geometric mean of the
true rates [2], which can be dened as
GM = acc+ acc
(3)
Neighborhood Cleaning Rule (NCL) uses the Wilsons Edited Nearest Neighbor Rule (ENN) [43, 31] to remove majority class examples. ENN removes any example
whose class label diers from the class of at least two of its
three nearest neighbors. NCL modies the ENN in order
to increase the data cleaning. For a two-class problem the
algorithm can be described in the following way: for each
example ei in the training set, its three nearest neighbors
are found. If ei belongs to the majority class and the classication given by its three nearest neighbors contradicts
the original class of ei , then ei is removed. If ei belongs
to the minority class and its three nearest neighbors misclassify ei , then the nearest neighbors that belong to the
majority class are removed.
Under-sampling methods that create a subset of the original database by eliminating some of the examples of the
majority class.
Over-sampling methods that create a superset of the original database by replicating some of the examples of the
minority class or creating new ones from the original minority class instances.
T 1 0 1 0 0 1 0 0
1
Evolutionary
Under-Sampling
Algorithm
S 1 3 6
Output Training Set
Selected (S)
Decision
Tree or Rule
Induction
Classifier
(4)
input : A population of chromosomes Pa
output: An optimized population of chromosomes Pa
This tness function is related with the proposal of Evolutionary Under-Sampling for nearest neighbours classier
guided by Classication Measures (EUSCM) proposed in
[23]. A decision tree or a rule induction algorithm can be
used for measuring the accuracy associated with the model
induced by using the instances selected in S . Obvioulsy,
the choice of the classier is conditioned to the nal evaluator classier, following a wrapper scheme. The accuracy
independently computed in each class is useful to obtain
GM value associated to the chromosome. The objective of
the EAs is to maximize the tness function dened: maximize the GM rate.
t 0;
Initialize(Pa,ConvergenceCount);
while not EndingCondition(t,Pa) do
Parents SelectionParents(Pa);
Offspring HUX(Parents);
Evaluate(Offspring);
Pn ElitistSelection(Offspring,Pa);
if not modified(Pa,Pn ) then
ConvergenceCount ConvergenceCount 1;
if ConvergenceCount = 0 then
Pn Restart(Pa);
Initialize(ConvergenceCount);
end
end
t t +1;
Pa Pn ;
end
Algorithm 1: Pseudocode of CHC algorithm
Data Set
Abalone9-18
Dermatology2
EcoliCP-IM
EcoliIM
EcoliIMU
EcoliOM
German
GlassBWFP
#Examples
731
366
220
336
336
336
1000
214
#Attributes
9
34
7
7
7
7
20
9
GlassBWNFP
214
GlassNW
GlassVWFP
Haberman
New-thyroid
PageBlocks(2,4,5)-3
Pima
Segment1
VehicleVAN
Vowel0
Yeast(1)
Yeast(2)
214
214
306
215
559
768
2310
846
990
467
1240
9
9
3
5
10
8
19
18
13
8
8
Yeast(3)
Yeast(4)
YeastCYT-POX
YeastNUC-POX
YeastPOX
1334
1120
483
449
1484
8
8
8
8
8
%Class(min.,maj.)
(5.75, 94.25)
(16.67, 83.33)
(35.00, 65.00)
(22.92, 77.08)
(10.42, 89.58)
(6.74, 93.26)
(30.00, 70.00)
(32.71, 67.29)
(35.51, 64.49)
(23.93, 76.17)
(7.94, 92.06)
(26.47, 73.53)
(16.28, 83.72)
(5.01, 94.99)
(34.77, 66.23)
(14.29, 85.71)
(23.52, 76.48)
(9.01, 90.99)
(4.28, 95.72)
(2.02, 97.98)
(2.62, 97.38)
(2.68, 97.32)
(4.14, 95.86)
(4.45, 95.55)
(1.35, 98.65)
L/4, where L is the length of the chromosomes. If no ospring are inserted into the new population then the threshold is reduced by one.
No mutation is applied during the recombination phase.
Instead, when the population converges or the search stops
making progress (i.e., the dierence threshold has dropped
to zero and no new ospring are being generated which
are better than any member of the parent population) the
population is reinitialized to introduce new diversity to the
search. The chromosome representing the best solution
found over the course of the search is used as a template
to reseed the population. Reseeding of the population is
accomplished by randomly changing 35% of the bits in
the template chromosome to form each of the other N 1
new chromosomes in the population. The search is then
resumed.
CHC also implements a form of heterogeneous recombination using HUX, a special recombination operator.
HUX exchanges half of the bits that dier between parents, where the bit position to be exchanged is randomly
determined. CHC also employs a method of incest prevention. Before applying HUX to the two parents, the
Hamming distance between them is measured. Only those
parents who dier from each other by some number of bits
(mating threshold) are mated. The initial threshold is set at
This section describes the methodology followed in the experimental study of the re-sampling compared techniques. We
will explain the conguration of the experiment: used data sets
and parameters for the algorithms. The algorithms used in the
comparison are the same described in Section 2.3.
Parameters
k = 5, Balancing Ratio = 1 : 1
Pop = 50, Eval = 10000,
Prob. inclusion HUX = 0.25, W = 3
The use of Wilcoxons and Holms tests conrms the improvement caused by EUSTSS over OSS and NCL under6
dataset
abalone9-18
dermatology2
ecoliCP-IM
ecoliIM
ecoliMU
ecoliOM
german
glassBWFP
glassBWNFP
glassNW
glassVWFP
haberman
new-thyroid
pageblocks(2,4,5)-3
pima
segment1
vehicle
vowel0
yeast(1)
yeast(2)
yeast(3)
yeast(4)
yeastCYT-POX
yeastNUC-POX
yeastPOX
AVERAGE
none
0.6611
0.9563
0.9869
0.8602
0.8794
0.9416
0.7779
0.9391
0.8684
0.9770
0.8476
0.4660
0.9678
0.9919
0.8151
0.9908
0.9856
0.9973
0.6699
0.3938
0.8862
0.1086
0.2568
0.6742
0.0000
0.7560
NCL
0.7206
0.9240
0.9526
0.9184
0.8799
0.9197
0.6881
0.7557
0.6501
0.8456
0.8828
0.4856
0.9507
0.9542
0.7115
0.9827
0.8965
0.9531
0.7491
0.7902
0.9053
0.1460
0.8052
0.8265
0.7362
0.8012
OSS
0.7218
0.9437
0.9869
0.9275
0.9234
0.9576
0.7790
0.8528
0.8766
0.9670
0.9691
0.7215
0.9787
0.9918
0.8115
0.9957
0.9696
0.9973
0.6171
0.4203
0.8973
0.4341
0.3438
0.6742
0.0000
0.7903
SMOTE
0.9348
0.9894
0.9906
0.9502
0.9722
0.9782
0.8676
0.9553
0.9450
0.9899
0.9779
0.7733
0.9869
1.0000
0.8631
0.9991
0.9889
0.9941
0.9467
0.8888
0.9642
0.7927
0.9072
0.9215
0.8279
0.9362
SMOTE + ENN
0.9337
0.9853
0.9860
0.9483
0.9625
0.9891
0.8136
0.8915
0.8964
0.9679
0.9611
0.7519
0.9873
1.0000
0.8387
0.9988
0.9784
0.9949
0.9460
0.8918
0.9675
0.8241
0.9205
0.9379
0.8502
0.9289
SMOTE + TL
0.8543
0.9845
0.9862
0.9341
0.9331
0.9566
0.7773
0.8906
0.8720
0.9704
0.8968
0.7520
0.9854
0.9980
0.8210
0.9972
0.9713
0.9947
0.8769
0.8668
0.9334
0.6912
0.8793
0.8970
0.8220
0.9017
EUSTSS
0.8449
0.9820
0.9869
0.9428
0.9374
0.9914
0.7474
0.9157
0.8856
0.9783
0.9608
0.7141
0.9963
1.0000
0.8084
0.9969
0.9666
0.9979
0.9357
0.8936
0.9554
0.7793
0.9377
0.9745
0.8473
0.9191
Table 4: Results obtained by C4.5 using GM evaluation measure over training data
dataset
abalone9-18
dermatology2
ecoliCP-IM
ecoliIM
ecoliMU
ecoliOM
german
glassBWFP
glassBWNFP
glassNW
glassVWFP
haberman
new-thyroid
pageblocks(2,4,5)-3
pima
segment1
vehicle
vowel0
yeast(1)
yeast(2)
yeast(3)
yeast(4)
yeastCYT-POX
yeastNUC-POX
yeastPOX
AVERAGE
none
0.3763
0.8623
0.9787
0.8167
0.7709
0.8073
0.5759
0.8138
0.6934
0.8942
0.5286
0.4280
0.9048
0.9270
0.6908
0.9852
0.9172
0.9808
0.4121
0.1155
0.7343
0.0000
0.0699
0.5828
0.0000
0.6347
NCL
0.4761
0.8988
0.9486
0.8882
0.7600
0.8220
0.6437
0.6652
0.5648
0.8101
0.6755
0.4329
0.9132
0.9327
0.6457
0.9728
0.8737
0.9360
0.5979
0.7038
0.8653
0.0000
0.7245
0.6151
0.6238
0.7196
OSS
0.4963
0.8928
0.9787
0.8860
0.8092
0.8749
0.6753
0.7551
0.7353
0.9505
0.6884
0.6089
0.8810
0.9260
0.7161
0.9849
0.9118
0.9808
0.3414
0.2151
0.8313
0.1144
0.1000
0.5536
0.0000
0.6763
SMOTE
0.6023
0.9194
0.9751
0.8795
0.8661
0.8412
0.6410
0.8216
0.7511
0.9239
0.6994
0.6832
0.9193
0.9991
0.7155
0.9918
0.9202
0.9657
0.5399
0.6783
0.7983
0.3737
0.5585
0.6974
0.5718
0.7733
SMOTE + ENN
0.6724
0.9181
0.9748
0.9060
0.8137
0.8010
0.6636
0.7599
0.7631
0.9373
0.7572
0.6292
0.9492
0.9991
0.6990
0.9947
0.9216
0.9764
0.6073
0.6940
0.8890
0.4509
0.6156
0.6630
0.5408
0.7839
SMOTE + TL
0.6724
0.9098
0.9787
0.8811
0.8671
0.8725
0.6658
0.7971
0.7427
0.9344
0.4930
0.6022
0.9414
0.9807
0.7181
0.9965
0.9241
0.9671
0.6883
0.7477
0.8649
0.3044
0.6176
0.5647
0.6410
0.7749
Table 5: Results obtained by C4.5 using GM evaluation measure over test data
EUSTSS
0.6697
0.9505
0.9787
0.8809
0.8579
0.9291
0.6419
0.8425
0.7235
0.9321
0.7816
0.6206
0.9463
0.9991
0.7179
0.9891
0.9239
0.9734
0.6271
0.6846
0.8759
0.3749
0.6489
0.6819
0.6154
0.7947
dataset
abalone9-18
dermatology2
ecoliCP-IM
ecoliIM
ecoliMU
ecoliOM
german
glassBWFP
glassBWNFP
glassNW
glassVWFP
haberman
new-thyroid
pageblocks(2,4,5)-3
pima
segment1
vehicle
vowel0
yeast(1)
yeast(2)
yeast(3)
yeast(4)
yeastCYT-POX
yeastNUC-POX
yeastPOX
AVERAGE
none
8.10
10.6
2.00
5.30
10.00
3.90
91.00
12.20
12.40
6.70
7.50
2.60
4.10
4.7
22.40
10
20.60
7.80
3
3
5
1.4
1.70
2.9
0
10.36
NCL
6.50
5.4
2.50
5.10
5.80
3.40
35.30
5.80
5.50
4.10
6.10
3.90
2.60
3.1
16.10
8.9
12.50
5.00
2.2
3.9
4.2
1.3
3.70
4.2
2
6.36
OSS
7.30
8.9
2.00
6.20
6.50
4.40
57.60
6.70
11.60
4.40
8.40
8.70
4.30
4.7
24.60
12.4
16.30
7.80
3.2
3.1
3.3
5
2.30
3
0
8.91
SMOTE
57.50
15.5
2.90
10.40
16.70
7.80
159.90
15.70
19.90
9.70
13.40
16.10
4.90
4.2
39.50
12.5
28.40
10.70
21.2
38.9
32.6
58.2
23.30
15.1
34.7
26.79
SMOTE + ENN
57.30
14.3
3.10
10.10
13.10
6.60
121.00
10.40
15.90
6.90
13.10
18.20
4.90
4.2
38.90
12.3
23.40
11.40
21.9
39
29.5
61.7
19.70
15.9
36.2
24.36
SMOTE + TL
52.60
14.5
2.00
10.40
14.00
6.80
82.40
10.40
15.90
7.10
13.50
18.00
5.00
4.2
34.90
12.6
22.50
10.50
19.2
36.7
28.8
54.2
21.20
18.5
36.8
22.11
EUSTSS
6.30
7.2
2.00
6.00
5.40
5.40
33.60
7.00
9.60
5.60
6.90
5.70
4.30
4
14.50
7.5
11.10
7.90
8.2
7
5
7.4
7.60
8
5
7.93
algorithm
none
NCL
OSS
SMOTE
SMOTE + ENN
SMOTE + TL
EUSTSS
Wilcoxon
GM
num. leaves
+ (.000)
= (.447)
+ (.000)
- (.001)
+ (.001)
= (.316)
+ (.011)
+ (.000)
= (.391)
+ (.000)
= (.317)
+ (.000)
Holm
GM
+ (.000)
+ (.000)
+ (.005)
= (.248)
= (1.000)
= (1.000)
dataset
abalone9-18
dermatology2
ecoliCP-IM
ecoliIM
ecoliMU
ecoliOM
german
glassBWFP
glassBWNFP
glassNW
glassVWFP
haberman
new-thyroid
pageblocks(2,4,5)-3
pima
segment1
vehicle
vowel0
yeast(1)
yeast(2)
yeast(3)
yeast(4)
yeastCYT-POX
yeastNUC-POX
yeastPOX
AVERAGE
none
0.6828
0.9761
0.9948
0.9061
0.7950
0.9775
0.9368
0.9475
0.8154
0.9793
0.9062
0.5842
0.9923
0.9960
0.7262
0.9983
0.9899
0.9963
0.4147
0.4637
0.8947
0.3732
0.3318
0.3556
0.0000
0.7614
NCL
0.8077
0.9750
0.9864
0.9232
0.9149
0.9873
0.8525
0.8661
0.8674
0.9672
0.9560
0.6973
0.9909
0.9956
0.7937
0.9986
0.9767
0.9962
0.6033
0.4655
0.9094
0.4846
0.2663
0.4376
0.0000
0.7888
OSS
0.6243
0.9363
0.9302
0.9079
0.8477
0.9294
0.7730
0.7860
0.6698
0.7973
0.7902
0.5209
0.9610
0.9542
0.7180
0.9835
0.9477
0.9692
0.7491
0.7929
0.9127
0.1601
0.8208
0.8102
0.7362
0.8011
SMOTE
0.9316
0.9892
0.9935
0.9418
0.9641
0.9879
0.9522
0.9246
0.9145
0.9862
0.9701
0.7321
0.9907
0.9998
0.7964
0.9993
0.9950
0.9970
0.9165
0.8851
0.9496
0.7812
0.9387
0.9162
0.8755
0.9332
SMOTE + ENN
0.8634
0.9855
0.9865
0.9254
0.9235
0.9664
0.8131
0.8927
0.8535
0.9678
0.9062
0.7212
0.9828
0.9980
0.7910
0.9977
0.9820
0.9994
0.8762
0.8695
0.9303
0.6616
0.8885
0.8861
0.8201
0.8995
SMOTE + TL
0.9236
0.9841
0.9804
0.9307
0.9563
0.9773
0.8818
0.9035
0.8939
0.9631
0.9624
0.7050
0.9871
1.0000
0.7789
0.9987
0.9852
0.9972
0.9556
0.9044
0.9618
0.7856
0.9425
0.9259
0.8602
0.9258
EUSTSS
0.8345
0.9830
0.8920
0.9116
0.9250
0.9787
0.7406
0.9251
0.8747
0.9778
0.9358
0.6389
0.9689
0.9945
0.6963
0.9856
0.9699
0.9725
0.9313
0.8974
0.9502
0.7839
0.9277
0.9605
0.8590
0.9006
Table 8: Results obtained by PART using GM evaluation measure over training data
dataset
abalone9-18
dermatology2
ecoliCP-IM
ecoliIM
ecoliMU
ecoliOM
german
glassBWFP
glassBWNFP
glassNW
glassVWFP
haberman
new-thyroid
pageblocks(2,4,5)-3
pima
segment1
vehicle
vowel0
yeast(1)
yeast(2)
yeast(3)
yeast(4)
yeastCYT-POX
yeastNUC-POX
yeastPOX
AVERAGE
none
0.4305
0.8776
0.9717
0.8335
0.6607
0.7193
0.6305
0.8136
0.6105
0.8963
0.6019
0.5161
0.8891
0.9553
0.6867
0.9890
0.9344
0.9557
0.2113
0.1155
0.8156
0.0000
0.0000
0.2121
0.0000
0.6131
NCL
0.3741
0.8791
0.9787
0.8687
0.7921
0.8144
0.6439
0.7957
0.7560
0.9446
0.6928
0.6111
0.9224
0.9525
0.6967
0.9810
0.9271
0.9557
0.3105
0.2151
0.8700
0.1147
0.0000
0.2414
0.0000
0.6535
OSS
0.4668
0.8882
0.9201
0.8740
0.7652
0.8311
0.6137
0.6973
0.5750
0.7370
0.4838
0.4754
0.9393
0.9327
0.6651
0.9774
0.9059
0.9040
0.5734
0.6266
0.8658
0.0553
0.8097
0.5879
0.6238
0.7118
SMOTE
0.6047
0.9409
0.9751
0.8651
0.8436
0.9014
0.6319
0.8046
0.7371
0.9131
0.7019
0.6417
0.9252
0.9807
0.7145
0.9911
0.9308
0.9706
0.6219
0.6787
0.8484
0.3950
0.4960
0.6928
0.5709
0.7751
SMOTE + ENN
0.5401
0.8855
0.9787
0.8698
0.8648
0.7979
0.6148
0.8102
0.6884
0.9273
0.7638
0.6513
0.9204
0.9624
0.7251
0.9921
0.9388
0.9665
0.6774
0.7248
0.8926
0.1122
0.7181
0.6642
0.6387
0.7731
SMOTE + TL
0.6355
0.9199
0.9606
0.8805
0.8447
0.9535
0.6453
0.7985
0.7136
0.9088
0.5089
0.5765
0.9261
0.9807
0.7134
0.9893
0.9530
0.9557
0.6061
0.6418
0.8651
0.1845
0.7451
0.6239
0.5439
0.7630
Table 9: Results obtained by PART using GM evaluation measure over test data
EUSTSS
0.5862
0.9672
0.8827
0.8806
0.8073
0.8710
0.6126
0.8302
0.7400
0.9213
0.7360
0.5478
0.9231
0.9914
0.6373
0.9838
0.9329
0.9232
0.5967
0.6871
0.8492
0.2934
0.7502
0.7254
0.7016
0.7751
dataset
abalone9-18
dermatology2
ecoliCP-IM
ecoliIM
ecoliMU
ecoliOM
german
glassBWFP
glassBWNFP
glassNW
glassVWFP
haberman
new-thyroid
pageblocks(2,4,5)-3
pima
segment1
vehicle
vowel0
yeast(1)
yeast(2)
yeast(3)
yeast(4)
yeastCYT-POX
yeastNUC-POX
yeastPOX
AVERAGE
none
8.30
7.10
4.10
5.90
6.00
4.50
108.00
7.50
5.20
5.50
6.50
3.40
4.10
4.00
7.40
7.90
13.70
5.80
4.00
4.80
5.00
4.60
3.30
3.40
1.00
9.64
NCL
9.10
5.80
3.60
5.30
5.50
3.90
66.40
5.00
6.50
3.90
6.30
6.10
3.60
4.00
10.70
7.80
11.50
5.80
4.60
4.40
3.50
5.10
2.80
3.60
1.00
7.83
OSS
5.70
3.40
2.60
2.80
4.20
3.20
56.50
3.90
4.80
3.00
4.50
3.20
2.10
2.00
7.10
6.20
9.50
5.00
2.10
4.00
4.20
2.00
3.10
3.40
2.00
6.02
SMOTE
29.10
9.60
4.80
7.40
9.30
4.40
128.50
7.90
9.00
6.00
9.20
7.30
4.20
2.50
11.50
7.50
16.40
7.40
12.10
20.10
14.30
29.80
13.50
9.30
24.10
16.21
SMOTE + ENN
28.00
8.10
2.60
6.20
8.10
4.40
76.40
6.60
7.40
5.20
8.90
8.20
4.10
2.60
12.70
7.10
14.10
7.50
11.70
21.50
14.70
29.90
12.00
10.20
19.90
13.52
SMOTE + TL
27.00
9.90
4.10
6.30
6.90
4.30
100.70
7.10
8.30
5.00
8.00
9.90
4.40
2.60
12.80
7.60
13.70
7.80
13.90
18.20
14.70
28.50
12.50
11.30
21.30
14.67
EUSTSS
4.90
3.10
3.50
4.30
4.60
3.70
40.10
5.20
5.50
4.40
6.10
4.20
3.20
2.90
5.10
6.50
8.70
4.70
5.60
5.30
4.50
4.80
5.00
6.20
4.60
6.27
algorithm
none
NCL
OSS
SMOTE
SMOTE + ENN
SMOTE + TL
EUSTSS
Wilcoxon
GM
num. rules
+ (.001)
= (.174)
+ (.048)
= (.339)
+ (.001)
- (.001)
= (.667)
+ (.000)
= (.989)
+ (.000)
= (.925)
+ (.000)
Holm
GM
+ (.000)
+ (.000)
+ (.020)
= (1.000)
= (1.000)
= (1.000)
5. Concluding Remarks
The purpose of this paper is to present a proposal of evolutionary training set selection algorithm for being applied over
imbalanced data sets to improve the performance of decision
tree or rule based induction classiers. The study has been
performed by using the C4.5 decision tree classier and PART
rule induction classier. The results shows that our proposal
allows to each one of the classiers used to obtain very accurate models (trees or rule bases) with a low number of leaves or
rules. The eectiveness of the models obtained is very competitive with respect to advanced hybrids of over-sampling. The
proposal oers more accurate models than the oered by other
Table 11: Non-parametric statistical tests results over GM and number of rules
using PART
under-sampling techniques, and the interpretability of the models obtained is increased due to the fact that the tree or rule
bases yielded are made up by a lower number of leaves/rules.
Acknowledgement
This work was supported by TIN2005-08386-C05-01.
References
[1] A. Asuncion, D. Newman, UCI machine learning repository (2007).
URL http://www.ics.uci.edu/mlearn/MLRepository.html
[2] R. Barandela, J. S. S nchez, V. Garca, E. Rangel, Strategies for learning
a
haviour of linguistic fuzzy rule based classication systems in the framework of imbalanced data-sets, Fuzzy Sets and Systems 159 (18) (2008)
23782398.
[19] E. Frank, I. H. Witten, Generating accurate rule sets without global optimization, in: ICML 98: Proceedings of the Fifteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, 1998, pp. 144151.
[20] A. A. Freitas, Data Mining and Knowledge Discovery with Evolutionary
Algorithms, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2002.
[21] A. Freitas, A. da Costa Pereira, P. Brazdil, Cost-sensitive decision trees
applied to medical data, in: DaWaK, Lecture Notes in Computer Science
4654, 2007, pp. 303312.
11
Remaining
12
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
yeastCYT-POX
ecoliIM
yeast(4)
ecoliMU
yeast(3)
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
yeastCYT-POX
ecoliIM
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
yeastCYT-POX
ecoliIM
yeast(4)
ecoliOM
ecoliMU
yeast(3)
ecoliOM
yeast(4)
ecoliMU
yeast(3)
ecoliOM
yeast(2)
german
yeast(2)
german
yeast(2)
german
yeast(1)
glassBWFP
yeast(1)
glassBWFP
yeast(1)
glassBWFP
vowel0
glassBWNFP
vehicle
segment1
pima
pageblocks(2,4,5)-3
vowel0
glassNW
glassVWFP
haberman
new-thyroid
vowel0
yeast(4)
ecoliOM
glassVWFP
haberman
new-thyroid
ecoliMU
yeast(3)
glassNW
segment1
pima
pageblocks(2,4,5)-3
glassVWFP
haberman
new-thyroid
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
yeastCYT-POX
ecoliIM
ecoliMU
glassBWNFP
vehicle
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
yeastCYT-POX
ecoliIM
yeast(4)
glassNW
segment1
pima
pageblocks(2,4,5)-3
yeast(3)
glassBWNFP
vehicle
ecoliOM
yeast(4)
ecoliMU
yeast(3)
ecoliOM
yeast(2)
german
yeast(2)
german
yeast(2)
german
yeast(1)
glassBWFP
yeast(1)
glassBWFP
yeast(1)
glassBWFP
vowel0
glassBWNFP
vehicle
segment1
pima
pageblocks(2,4,5)-3
vowel0
glassNW
glassBWNFP
vehicle
glassVWFP
haberman
new-thyroid
glassNW
segment1
pima
pageblocks(2,4,5)-3
glassVWFP
haberman
new-thyroid
vowel0
glassBWNFP
vehicle
glassNW
segment1
pima
pageblocks(2,4,5)-3
glassVWFP
haberman
new-thyroid
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
yeastCYT-POX
ecoliIM
yeast(4)
yeastCYT-POX
yeast(3)
ecoliIM
yeast(4)
ecoliMU
yeastCYT-POX
ecoliMU
yeast(3)
ecoliOM
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
ecoliOM
ecoliIM
yeast(4)
ecoliMU
yeast(3)
ecoliOM
yeast(2)
german
yeast(2)
german
yeast(2)
german
yeast(1)
glassBWFP
yeast(1)
glassBWFP
yeast(1)
glassBWFP
vowel0
glassBWNFP
vehicle
segment1
pima
pageblocks(2,4,5)-3
vowel0
glassNW
yeast(4)
yeastCYT-POX
ecoliMU
yeast(3)
ecoliOM
segment1
pima
pageblocks(2,4,5)-3
glassVWFP
haberman
new-thyroid
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
yeastCYT-POX
ecoliMU
yeast(3)
glassNW
ecoliIM
yeast(4)
glassBWNFP
vehicle
ecoliIM
vowel0
glassVWFP
haberman
new-thyroid
pima
pageblocks(2,4,5)-3
yeastCYT-POX
glassNW
segment1
glassVWFP
haberman
new-thyroid
abalone9-18
yeastPOX
dermatology2
yeastNUC-POX
ecoliCP-IM
glassBWNFP
vehicle
ecoliOM
yeast(4)
yeast(3)
ecoliIM
ecoliMU
ecoliOM
yeast(2)
german
yeast(2)
german
yeast(2)
german
yeast(1)
glassBWFP
yeast(1)
glassBWFP
yeast(1)
glassBWFP
vowel0
glassBWNFP
vehicle
segment1
pima
pageblocks(2,4,5)-3
glassNW
glassVWFP
haberman
new-thyroid
vowel0
glassBWNFP
vehicle
segment1
pima
pageblocks(2,4,5)-3
glassNW
glassVWFP
haberman
new-thyroid
13
vowel0
vehicle
segment1
pima
pageblocks(2,4,5)-3
glassBWNFP
glassNW
glassVWFP
haberman
new-thyroid
2.4.
2.4.1.
A Study on the Use of Non-Parametric Tests for Analyzing the Evolutionary Algorithms Behaviour: A Case Study on the CEC2005 Special
Session on Real Parameter Optimization
S. Garc D. Molina, M. Lozano, F. Herrera, A Study on the Use of Non-Parametric Tests for
a,
Analyzing the Evolutionary Algorithms Behaviour: A Case Study on the CEC2005 Special
Session on Real Parameter Optimization. Journal of Heuristics, doi: 10.1007/s10732-008-90804, in press (2008).
Estado: Aceptado (publicado on-line)
Indice de Impacto (JCR 2007): 0,644.
J Heuristics
DOI 10.1007/s10732-008-9080-4
Abstract In recent years, there has been a growing interest for the experimental
analysis in the eld of evolutionary algorithms. It is noticeable due to the existence
of numerous papers which analyze and propose different types of problems, such as
the basis for experimental comparisons of algorithms, proposals of different methodologies in comparison or proposals of use of different statistical techniques in algorithms comparison.
In this paper, we focus our study on the use of statistical techniques in the analysis of evolutionary algorithms behaviour over optimization problems. A study about
the required conditions for statistical analysis of the results is presented by using
some models of evolutionary algorithms for real-coding optimization. This study is
conducted in two ways: single-problem analysis and multiple-problem analysis. The
results obtained state that a parametric statistical analysis could not be appropriate
specially when we deal with multiple-problem results. In multiple-problem analysis,
we propose the use of non-parametric statistical tests given that they are less restrictive than parametric ones and they can be used over small size samples of results.
As a case study, we analyze the published results for the algorithms presented in the
This work was supported by Project TIN2005-08386-C05-01.
S. Garca holds a FPU scholarship from Spanish Ministry of Education and Science.
S. Garca ( ) M. Lozano F. Herrera
Department of Computer Science and Articial Intelligence, University of Granada, Granada 18071,
Spain
e-mail: salvagl@decsai.ugr.es
M. Lozano
e-mail: lozano@decsai.ugr.es
F. Herrera
e-mail: herrera@decsai.ugr.es
D. Molina
Department of Computer Engineering, University of Cdiz, Cdiz, Spain
e-mail: daniel.molina@uca.es
S. Garca et al.
1 Introduction
The No free lunch theorem (Wolpert and Macready 1997) demonstrates that it is
not possible to nd one algorithm being better in behaviour for any problem. On the
other hand, we know that we can work with different degrees of knowledge about
the problem which we expect to solve, and that it is not the same to work without
knowledge about the problem (hypothesis of the no free lunch theorem) than to
work with partial knowledge about the problem, knowledge that allows us to design
algorithms with specic characteristics which can make them more suitable for the
solution of the problem.
Once situated in this eld, the partial knowledge of the problem and the necessity
of having disposals of algorithms for its solution, the question about deciding when an
algorithm is better than another one is suggested. In the case of the use of evolutionary
algorithms, the latter may be done attending to the efciency and/or effectiveness
criteria. When theoretical results are not available in order to allow the comparison
of the behaviour of the algorithms, we have to focus on the analysis of empirical
results.
In the last years, there has been a growing interest in the analysis of experiments
in the eld of evolutionary algorithms. The work of Hooker is pioneer in this line and
it shows an interesting study on what we must do and not do when we suggest the
analysis of the behaviour of a metaheuristic about a problem (Hooker 1997).
In relation to the analysis of experiments, we can nd three types of works: the
study and design of test problems, the statistical analysis of experiments and experimental design.
Several authors have focused their interest in the design of test problems which
could be appropriate to do a comparative study among the algorithms. Focusing
our attention to continuous optimization problems, which will be used in this paper,
we can point out the pioneer papers of Whitley and co-authors for the design of
complex test functions for continuous optimization (Whitley et al. 1995, 1996),
and the recent works of Gallagher and Yuan (2006); Yuan and Gallagher (2003).
In the same way, we can nd papers that present test cases for different types of
problems.
Centred on the statistical analysis of the results, if we analyze the published
papers in specialized journals, we nd that the majority of the articles make
a comparison of results based on average values of a set of executions over a
concrete case. In proportion, a little set of works use statistical procedures in
order to compare results, although their use is recently growing and it is being suggested as a need for many reviewers. When we nd statistical studies, they are usually based on the average and variance by using parametric
tests (ANOVA, t-test, etc. . . .) (Czarn et al. 2004; Ozcelik and Erzurumlu 2006;
Rojas et al. 2002). Recently, non-parametric statistical procedures have been considered for being used in analysis of results (Garca et al. 2007; Moreno-Prez et
al. 2007). A similar situation can be found in the machine learning community
(Demar 2006).
The experimental design consists of a set of techniques which comprise methodologies for adjusting the parameters of the algorithms depending on the settings
used and results obtained (Bartz-Beielstein 2006; Kramer 2007). In our study, we
are not interested in this topic; we assume that the algorithms in a comparison have
obtained the best possible results, depending on an optimal adjustment of their parameters in each problem.
We are interested in the use of statistical techniques for the analysis of the behaviour of the evolutionary algorithms over optimization problems, analyzing the
use of the parametric statistical tests and the non-parametric ones (Sheskin 2003;
Zar 1999). We will analyze the required conditions for the usage of the parametric
tests, and we will carry out an analysis of results by using non-parametric tests.
The study of this paper will be organized into two parts. The rst one, we will
denoted it by single-problem analysis, corresponds to the study of the required conditions of a safe use of parametric statistical procedures when comparing the algorithms
over a single problem. The second one, denoted by multiple-problem analysis, will
suppose the study of the same required conditions when considering a comparison of
algorithms over more than one problems simultaneously.
The single-problem analysis is usually found in specialized literature (BartzBeielstein 2006; Ortiz-Boyer et al. 2007). Although the required conditions for using
parametric statistics are usually not fullled, as we will see here, a parametric statistical study could obtain similar conclusions to a non-parametric one. However, in the
multiple-problem analysis, due to the dissimilarities in the results obtained and the
small size of the sample to be analyzed, a parametric test may reach erroneous conclusions. In recent papers, authors start using single-problem and multiple-problem
analysis simultaneously (Ortiz-Boyer et al. 2007).
Non-parametric tests can be used for comparing algorithms whose results represent average values for each problem, in spite of the inexistence of relationships
among them. Given that the non-parametric tests do not require explicit conditions
for being conducted, it is recommendable that the sample of results is obtained following the same criterion, that is, computing the same aggregation (average, mode,
etc.) over the same number of runs for each algorithm and problem. They are used
for analyzing the results of the CEC2005 Special Session on Real Parameter Optimization (Suganthan et al. 2005) over all the test problems, in which average results
of the algorithms for each function are published. We will show signicant statistical differences among the algorithms compared in the CEC2005 Special Session on
Real Parameter Optimization, supporting the conclusions obtained in this session.
In order to do that, the paper is organized as follows. In Sect. 2, we describe the
setting of the CEC2005 Special Session: algorithms, tests functions and parameters.
Section 3 shows the study on the required conditions for safe use of parametric tests,
considering single-problem and multiple-problem analysis. We analyze the published
results of the CEC2005 Special Session on Real Parameter Optimization by using
S. Garca et al.
non-parametric tests in Sect. 4. Section 5 points out some considerations on the use
of non-parametric tests. The conclusions of the paper are presented in Sect. 6. An
introduction to statistics and a complete description of the non-parametric tests procedures are given in Appendix A. The published average results of the CEC2005
Special Session are shown in Appendix B.
2 expanded functions.
11 hybrid functions. Each one of them have been dened through compositions
of 10 out of the 14 previous functions (different in each case).
All functions have been displaced in order to ensure that their optima can never be
found in the centre of the search space. In two functions, in addition, the optima can
not be found within the initialization range, and the domain of search is not limited
(the optimum is out of the range of initialization).
2.3 Characteristics of the experimentation
The experiments were performed following the instructions indicated in the document
associated to the competition. The main characteristics are:
Each algorithm is run 25 times for each test function, and the average of error of
the best individual of the population is computed.
We will use the study with dimension D = 10 and the algorithms do 100000 evaluations of the tness function.
In the mentioned competition, experiments with dimension D = 30 and D = 50
have also been done.
Each run stops either when the error obtained is less than 108 , or when the maximal number of evaluations is achieved.
3 Study of the required conditions for the safe use of parametric tests
In this section, we will describe and analyze the conditions that must be satised
for the safe usage of parametric tests (Sect. 3.1). For doing it, we collect the overall
set of results obtained by the algorithms BLX-MA and BLX-GL50 in the 25 functions considering dimension D = 10. With them, we will rstly analyze the indicated
conditions over the complete sample of results for each function, in a single-problem
analysis (see Sect. 3.2). Finally, we will consider the average results for each function
to composite a sample of results for each one of the two algorithms. With these two
samples we will check again the required conditions for the safe use of parametric
test in a multiple-problem scheme (see Sect. 3.3).
3.1 Conditions for the safe use of parametric tests
In Sheskin (2003), the distinction between parametric and non-parametric tests is
based on the level of measure represented by the data which will be analyzed. In this
way, a parametric test uses data composed by real values.
The latter does not imply that when we always dispose of this type of data, we
should use a parametric test. There are other initial assumptions for a safe usage
of parametric tests. The non fulllment of these conditions might cause a statistical
analysis to lose credibility.
In order to use the parametric tests, it is necessary to check the following conditions (Sheskin 2003; Zar 1999):
S. Garca et al.
Independence: In statistics, two events are independent when the fact that one occurs does not modify the probability of the other one occurring.
Normality: An observation is normal when its behaviour follows a normal or Gauss
distribution with a certain value of average and variance . A normality test
applied over a sample can indicate the presence or absence of this condition in
observed data. We will use three normality tests:
Kolmogorov-Smirnov: It compares the accumulated distribution of observed
data with the accumulated distribution expected from a Gaussian distribution,
obtaining the p-value based on both discrepancies.
Shapiro-Wilk: It analyzes the observed data to compute the level of symmetry
and kurtosis (shape of the curve) in order to compute the difference with respect
to a Gaussian distribution afterwards, obtaining the p-value from the sum of the
squares of these discrepancies.
DAgostino-Pearson: It rst computes the skewness and kurtosis to quantify how
far from Gaussian the distribution is in terms of asymmetry and shape. It then
calculates how far each of these values differs from the value expected with
a Gaussian distribution, and computes a single p-value from the sum of these
discrepancies.
Heteroscedasticity: This property indicates the existence of a violation of the hypothesis of equality of variances. Levenes test is used for checking whether or
not k samples present this homogeneity of variances (homoscedasticity). When
observed data does not fulll the normality condition, this tests result is more
reliable than Bartletts test (Zar 1999), which checks the same property.
In our case, it is obvious the independence of the events given that they are independent runs of the algorithm with randomly generated initial seeds. In the following,
we will carry out the normality analysis by using Kolmogorov-Smirnov, ShapiroWilk and DAgostino-Pearson tests on single-problem and multiple-problem analysis, and heteroscedasticity analysis by means of Levenes test.
3.2 On the study of the required conditions over single-problem analysis
With the samples of results obtained from running 25 times the algorithms BLX-GL50
and BLX-MA for each function, we can apply statistical tests for determining whether
they check or not the normality and homoscedasticity properties. We have seen before that the independence condition is easily satised in this type of experiments.
The number of runs may be low for carrying out statistical analysis, but it was a
requirement in the CEC2005 Special Session.
All the tests used in this section will obtain the p-value associated, which represents the dissimilarity of the sample of results with respect to the normal shape.
Hence, a low p-value points out a non-normal distribution. In this study, we will consider a level of signicance = 0.05, so a p-value greater than indicates that the
condition of normality is fullled. All the computations have been performed by the
statistical software package SPSS.
Table 1 shows the results where the symbol * indicates that the normality is not
satised and the p-value in brackets. Table 2 shows the results by applying the test
BLX-MA
f3
f4
(.01)
(.04)
(.00)
(.00)
(.01)
f10
BLX-GL50
f2
f11
f12
(.20)
BLX-MA
BLX-GL50
BLX-MA
(.10)
(.20)
(.20)
(.00)
(.00)
(.00)
f19
BLX-GL50
f20
f21
(.00)
(.00)
(.00)
(.00)
(.00)
(.00)
f5
f6
f7
(.00)
(.00)
(.00)
(.04)
(.20)
(.00)
(.20)
(.00)
(.00)
(.00)
f13
f14
f15
f16
f17
f18
(.14)
(.16)
f8
f9
(.20)
(.20)
(.20)
(.20)
(.00)
(.00)
(.00)
(.02)
(.00)
(.00)
(.20)
(.20)
f22
f23
f24
f25
(.00)
(.00)
(.00)
(.00)
(.00)
(.00)
(.00)
(.02)
f3
f4
f5
f6
f7
(.01)
(.23)
(.27)
(.03)
(.00)
(.00)
f17
f18
BLX-MA
(.03)
(.00)
(.00)
(.01)
(.03)
(.00)
(.00)
(.00)
(.00)
(.00)
f10
BLX-GL50
f2
f11
f12
f13
f14
f15
f16
(.06)
BLX-MA
BLX-GL50
BLX-MA
(.07)
(.25)
(.31)
(.00)
(.00)
(.00)
f19
BLX-GL50
f20
f21
(.00)
(.00)
(.00)
(.00)
(.00)
(.00)
(.05)
f8
(.39)
(.41)
(.12)
(.56)
(.00)
(.00)
(.00)
(.01)
(.25)
(.72)
f22
f23
f24
f25
(.00)
(.00)
(.00)
(.00)
(.00)
(.00)
f9
(.00)
(.00)
(.00)
(.02)
S. Garca et al.
Table 3 Test of normality of DAgostino-Pearson
f1
BLX-GL50
f2
BLX-GL50
BLX-MA
(.06)
(.00)
f10
BLX-MA
(.10)
(.00)
f11
BLX-MA
f4
f5
f6
f7
f8
(.00)
(.24)
(.22)
(.00)
(.00)
(.00)
(.00)
(.00)
f13
f14
f15
f16
(.00)
(.00)
(.00)
(.07)
(.21)
(.54)
f9
f25
f12
(.17)
(.19)
(.89)
(.00)
(.00)
(.03)
f20
f21
f19
BLX-GL50
f3
(.05)
(.05)
(.06)
(.00)
(.00)
(.25)
(.79)
(.47)
(.38)
(.16)
f22
f23
f24
(.01)
(.00)
(.00)
(.00)
(.00)
(.00)
(.28)
(.21)
(.19)
(.12)
f17
(.00)
(.00)
f18
(.03)
(.04)
(.11)
(.20)
Fig. 1 Example of non-normal distribution: Function f20 and BLX-GL50 algorithm: Histogram and Q-Q
Graphic
Fig. 2 Example of normal distribution: Function f10 and BLX-MA algorithm: Histogram and Q-Q
Graphic
Fig. 3 Example of a special case: Function f21 and BLX-MA algorithm: Histogram and Q-Q Graphic
(.07)
f10
LEVENE
(.99)
f2
(.07)
f11
(.00)
f3
f4
f5
f6
f7
(.00)
(.04)
(.00)
(.00)
(.00)
f12
f13
f14
f15
f16
(.00)
(.00)
f24
f25
(.00)
(.00)
(.98)
f19
LEVENE
f20
f21
(.01)
(.00)
(.01)
(.18)
f22
(.47)
(.87)
f23
(.28)
f8
(.41)
f17
(.24)
f9
(.00)
f18
(.21)
tests. In this case, a normality test could work better than another, depending on types
of data, number of ties or number of results collected. Due to this fact, we have employed three well-known normality tests for studying the normality condition. The
choice of the most appropriate normality test depending on the problem is out of the
scope of this paper.
With respect to the study of homoscedasticity property, Table 4 shows the results
by applying Levenes test, where the symbol * indicates that the variances of the
distributions of the different algorithms for a certain function are not homogeneities
(we reject the null hypothesis at a level of signicance = 0.05).
Clearly, in both cases, the non fulllment of the normality and homoscedasticity
conditions is perfectible. In most functions, the normality condition is not veried in
a single-problem analysis. The homoscedasticity is also dependent of the number of
algorithms studied, because it checks the relationship among the variances of all population samples. Even though in this case we only analyze this condition on results
for two algorithms, the condition is also not fullled in many cases.
A researcher may think that the non fulllment of these conditions is not crucial
for obtaining adequate results. By using the same samples of results, we will show
an example in which some results offered by a parametric test, the paired t-test, do
not agree with the ones obtained through a non-parametric test, Wilcoxons test. Table 5 presents the difference of average error rates, in each function, between the
S. Garca et al.
Table 5 Difference of error
rates and p-values for paired
t-test and Wilcoxon test in
single-problem analysis
Function
Difference
t-test
Wilcoxon
f1
f2
f3
47129
f4
1.9 108
0.281
f5
0.0212
0.011
f6
1.489618
f7
0.1853
f8
0.2
0.686
0.716
f9
0.716
f10
0.668086
f11
2.223405
0.028
0.037
f12
332.7
0.802
0.51
f13
0.024
0.058
0.058
f14
0.142023
0.827
0.882
f15
130
0.01
0.061
f16
8.5
f17
18
f18
383
f19
314
0.001
f20
354
f21
33
0.178
0.298
f22
88
0.545
0.074
f23
288
f24
24
0.043
0.046
f25
0.558
0.459
algorithms BLX-GL50 and BLX-MA (if it is negative, the best performed algorithm is
BLX-GL50), and the p-value obtained by the paired t-test and Wilcoxon test.
As we can see, the p-values obtained by paired t-test are very similar to the ones
obtained by Wilcoxon test. However, in three cases, they are quite different. We enumerate them:
In function f4, Wilcoxon test considers that both algorithms behave differently,
whereas paired t-test does not. This example perfectly ts with a non-practical
case. The difference of error rates is less than 107 , and in practical sense, this has
no signicant effect.
In function f15, the situation is opposite to the previous one. The paired t-test
obtains a signicant difference in favour of BLX-MA. Is this result reliable? As the
normality condition is not veried in the results of f15 (see Tables 1, 2, 3), the
results obtained by Wilcoxon test are theoretically more reliable.
Finally, in function f22, although Wilcoxon test obtains a p-value greater than the
level of signicance = 0.05, both p-values are again very different.
where t is the t-test statistics and n is the number of results collected. If d is near
to 0.5, then the differences are signicant. A value of d lower than 0.25 indicates
insignicant differences and the statistical analysis may not be taken into account.
The application of transformations for obtaining normal distributions, such as logarithm, square root, reciprocal and power transformations (Patel and Read 1982).
In some situations, skip outliers, but this technique must be used with great care.
These alternatives could solve the normality condition, but the homoscedasticity
condition may result difcult to solve. Some parametric tests, such as ANOVA, are
very inuenced by the homoscedasticity condition.
3.3 On the study of the required conditions over multiple-problem analysis
When tackling a multiple-problem analysis, the data to be used is an aggregation of
results obtained from individual algorithms runs. In this aggregation, there must be
only a result representing a problem or function. This result could be obtained through
averaging results for all runs or something similar, but the procedure followed must
be the same for each function; i.e., in this paper we have used the average of the
25 runs of an algorithm in each function. The size of the sample of results to be
analyzed, for each algorithm, is equal to the number of problems. In this way, a
multiple-problem analysis allows us to compare two or more algorithms over a set
of problems simultaneously.
We can use the results published in the CEC2005 Special Session to perform a
multiple-problem analysis. Indeed, we will follow the same procedure as the previous
subsection. We will analyze the required conditions for the safe usage of parametric
tests over the sample of results obtained by averaging the error rate on each function.
Table 6 shows the p-values of the normality tests over the sample results obtained
by BLX-GL50 and BLX-MA. Figures 4 and 5 represent the histograms and Q-Q plots
for such samples.
S. Garca et al.
Obviously, the normality condition is not satised because the sample of results is
composed by 25 average error rates computed in 25 different problems. We compare
the behaviour of the two algorithms by means of pairwise statistical tests:
The p-value obtained with a paired t-test is p = 0.318. The paired t-test does not
consider the existence of difference in performance between the algorithms.
The p-value obtained with Wilcoxon test is p = 0.089. The Wilcoxon t-test does
neither consider the existence of difference in performance between the algorithms,
but it considerably reduces the minimal level of signicance for detecting differences. If the level of signicance considered were = 0.10, Wilcoxons test would
conrm that BLX-GL50 is better than BLX-MA.
Average results for these two algorithms indicate this behaviour, BLX-GL50 usually performs better than BLX-MA (see Table 13 in Appendix B), but a paired t-test
Table 6 Normality tests over multiple-problem analysis
Algorithm
BLX-GL50
BLX-MA
Kolmogorov-Smirnov
Shapiro-Wilk
DAgostino-Pearson
(.00)
(.00)
(.00)
(.00)
(.00)
(.10)
S. Garca et al.
Table 7 Results of the Friedman and Iman-Davenport tests ( = 0.05)
Friedman
Value
value
in 2
Iman-Davenport
Value
value
p-value
in FF
p-value
f15f25
26.942
18.307
0.0027
3.244
1.930
0.0011
All
41.985
18.307
<0.0001
4.844
1.875
<0.0001
Algorithm
Ranking (f15f25)
Ranking (f1f25)
BLX-GL50
5.227
5.3
BLX-MA
7.681
7.14
CoEVO
9.000
6.44
DE
4.955
5.66
DMS-L-PSO
5.409
5.02
EDA
6.318
6.74
G-CMA-ES
3.045
3.34
K-PCX
7.545
6.8
L-CMA-ES
6.545
6.22
L-SaDE
4.956
4.92
SPC-PNX
5.318
6.42
3.970
2.633
3.643
2.417
S. Garca et al.
Table 9 p-values on functions f15f25 (G-CMA-ES is the control algorithm)
G-CMA-ES vs.
Unadjusted p
Bonferroni-Dunn p
Holm p
Hochberg p
CoEVO
4.21050
2.54807 105
2.54807 104
2.54807 104
2.54807 104
BLX-MA
3.27840
0.00104
0.0104
0.00936
0.00936
k-PCX
3.18198
0.00146
0.0146
0.01168
0.01168
L-CMA-ES
2.47487
0.01333
0.1333
0.09331
0.09331
EDA
2.31417
0.02066
0.2066
0.12396
0.12396
DMS-L-PSO
1.67134
0.09465
0.9465
0.47325
0.17704
SPC-NPX
1.60706
0.10804
1.0
0.47325
0.17704
BLX-GL50
1.54278
0.12288
1.0
0.47325
0.17704
DE
1.34993
0.17704
1.0
0.47325
0.17704
L-SaDE
1.34993
0.17704
1.0
0.47325
0.17704
Unadjusted p
Bonferroni-Dunn p
Holm p
Hochberg p
CoEVO
5.43662
5.43013 108
5.43013 107
5.43013 107
5.43013 107
4.05081
5.10399 105
5.10399 104
4.59359 104
4.59359 104
K-PCX
3.68837
2.25693 104
0.002257
0.001806
0.001806
EDA
3.62441
2.89619 104
0.0028961
0.002027
0.002027
SPC-PNX
3.28329
0.00103
0.0103
0.00618
0.00618
L-CMA-ES
3.07009
0.00214
0.0214
0.0107
0.0107
DE
2.47313
0.01339
0.1339
0.05356
0.05356
BLX-GL50
2.08947
0.03667
0.3667
0.11
0.09213
DMS-L-PSO
1.79089
0.07331
0.7331
0.14662
0.09213
L-SaDE
1.68429
0.09213
0.9213
0.14662
0.09213
BLX-MA
In the same way as the previous section, we will apply more powerful procedures,
such as Holms and Hochbergs (they are described in Appendix A.3), for comparing
the control algorithm with the rest of the algorithms. The results are shown by computing p-values for each comparison. Tables 9 and 10 show the p-value obtained for
Bonferroni-Dunns, Holms and Hochbergs procedures considering both groups of
functions. The procedure used to compute the p-values is explained in Appendix A.3.
Holms and Hochbergs procedures allow us to point out the following differences,
considering G-CMA-ES as control algorithm:
f15f25: G-CMA-ES is better than CoEVO, BLX-MA and K-PCX with = 0.05
(3/10 algorithms) and is better than L-CMA-ES with = 0.10 (4/10 algorithms).
Here, Holms and Hochbergs procedures coincide and they reject an extra hypothesis considering = 0.10, with regards to Bonferroni-Dunns.
f1f25: Based on Holms procedure, it outperforms CoEVO, BLX-MA, K-PCX,
EDA, SPC-PNX and L-CMA-ES with = 0.05 (6/10 algorithms) and it also outperforms DE with = 0.10 (7/10 algorithms). It rejects equal number of hypotheses
as Bonferroni-Dunn does by considering = 0.05. It also rejects an extra hypothesis than Bonferroni-Dunn when = 0.10.
Hochbergs procedure behaves the same as Holms when we establish = 0.05.
However, with a = 0.10, it obtains a different result. All the p-values in the
comparison are lower than 0.10, so all the hypotheses associated with them are
rejected (10/10 algorithms). In fact, Hochbergs procedure conrms that G-CMAES is the best algorithm in the competition considering all functions on the whole.
In the following, we present a study in which the G-CMA-ES algorithm will be
compared with the rest of them by means of pairwise comparisons. In this study we
will use the Wilcoxon test (see Appendix A.2).
Until now, we have used procedures for performing multiple comparisons in order
to check the behaviour of the algorithms. Attending to Hochbergs procedure results,
this process could not be necessary, but we include it for stressing the differences
between using multiple comparisons procedures instead of pairwise comparisons.
Tables 11 and 12 summarize the results of applying Wilcoxon test. They display the
sum of rankings obtained in each comparison and the p-value associated.
Table 11 Wilcoxon test
considering functions f15f25
R+
BLX-GL50
62.5
3.5
0.009
BLX-MA
60.0
6.0
0.016
CoEVO
60.0
6.0
0.016
DE
56.5
9.5
0.028
DMS-L-PSO
47.0
19.0
0.213
EDA
60.5
5.5
0.013
K-PCX
60.0
6.0
0.016
L-CMA-ES
58.0
8.0
0.026
L-SaDE
47.5
18.5
0.203
SPC-PNX
G-CMA-ES vs.
63.5
2.5
0.007
p-value
G-CMA-ES vs.
R+
BLX-GL50
289.5
35.5
0.001
BLX-MA
295.5
29.5
0.001
CoEVO
301.0
24.0
0.000
DE
262.5
62.5
0.009
DMS-L-PSO
199.0
126.0
0.357
EDA
284.5
40.5
0.001
K-PCX
269.0
56.0
0.004
L-CMA-ES
273.0
52.0
0.003
L-SaDE
209.0
116.0
0.259
SPC-PNX
305.5
19.5
0.000
p-value
S. Garca et al.
Wilcoxons test performs individual comparisons between two algorithms (pairwise comparisons). The p-value in a pairwise comparison is independent from another one. If we try to extract a conclusion involving more than one pairwise comparison in a Wilcoxons analysis, we will obtain an accumulated error coming from the
combination of pairwise comparisons. In statistical terms, we are losing the control
on the Family Wise Error Rate (FWER), dened as the probability of making one or
more false discoveries among all the hypotheses when performing multiple pairwise
tests. The true statistical signicance for combining pairwise comparisons is given
by:
p = P (Reject H0 |H0 true)
= 1 P (Accept H0 |H0 true)
= 1 P (Accept Ak = Ai , i = 1, . . . , k 1|H0 true)
k1
=1
i=1
k1
=1
(1 pHi )
=1
(1)
i=1
Observing Table 11, the statement: The G-CMA-ES algorithm outperforms the
BLX-GL50, BLX-MA, CoEVO, DE, EDA, K-PCX, L-CMA-ES and SPC-PNX algorithms with a level of signicance = 0.05 could not be correct until we cannot
check controlling the FWER. The G-CMA-ES algorithm really outperforms these
eight algorithms considering independent pairwise comparisons due to the fact that
the p-values are below = 0.05. On the other hand, note that two algorithms were
not included. If we include them within the multiple comparison, the p-value obtained is p = 0.4505 in f15f25 group and p = 0.5325 considering all functions. In
such cases, it is not possible to declare that G-CMA-ES algorithm obtains a significantly better performance than the remaining algorithms, due to the fact that the
p-values achieved are too high.
From expression (1), and Tables 11 and 12, we can deduce that G-CMA-ES is
better than the eight algorithms enumerated before with a p-value of
p = 1 ((1 0.009) (1 0.016) (1 0.016) (1 0.028) (1 0.013)
(1 0.016) (1 0.026) (1 0.007)) = 0.123906
for the group of functions f15f25 and
p = 1 ((1 0.001) (1 0.001) (1 0.000) (1 0.009) (1 0.001)
(1 0.004) (1 0.003) (1 0.000)) = 0.018874
considering all functions. Hence, the previous statement has been denitively conrmed only when considering all functions in the comparison.
The procedures designed for performing multiple comparisons control the FWER
in their denition. By using the example considered in this section, in which we
have used the G-CMA-ES algorithm as control, we can easily reect the relationship
among the power of all the testing procedures used. In increasing order of power and
considering all functions in the study, the procedures can be order in the following
way: Bonferroni-Dunn (p = 0.9213), Wilcoxons test (when it is used in multiple
comparisons) (p = 0.5325), Holm (p = 0.1466) and Hochberg (p = 0.0921).
Finally, we must point out that the statistical procedures used here indicate that
the best algorithm is G-CMA-ES. Although in Hansen (2005), the categorization of
the functions depending on their degree of difculty is different than the used in this
paper (we have joined the unimodal and soluble multimodal functions in one group),
the G-CMA-ES algorithm has been stressed as the algorithm with best behaviour
considering error rate. Therefore and to sum up, in this paper the conclusions drawn
in Hansen (2005) have been statistically supported.
S. Garca et al.
determined. Firstly, the minimum sample considered acceptable for each test needs
to be stipulated. There is no established agreement about this specication. Statisticians have studied the minimum sample size when a certain power of the statistical
test is expected (Noether 1987; Morse 1999). In our case, the employment of a,
as large as possible, sample size is preferable, because the power of the statistical
tests (dened as the probability that the test will reject a false null hypothesis) will
increase. Moreover, in a multiple-problem analysis, the increasing of the sample
size depends on the availability of new functions (which should be well-known in
real-parameter optimization eld). Secondly, we have to study how the results are
expected to vary if there was a larger sample size available. In all statistical tests
used for comparing two or more samples, the increasing of the sample size benets
the power of the test. In the following items, we will state that Wilcoxons test is
less inuenced by this factor than Friedmans test. Finally, as a rule of thumb, the
number of functions (N ) in a study should be N = a k, where k is the number of
algorithms to be compared and a 2.
Taking into account the previous observation and knowing the operations performed by the non-parametric tests, we can deduce that Wilcoxons test is inuenced by the number of functions used. On the other hand, both the number of
algorithms and functions are crucial when we refer to the multiple comparisons
tests (such as Friedmans test), given that all the critical values depend on the value
of N (see expressions in Appendix A.3). However, the increasing/decreasing of
the number of functions rarely affects in the computation of the ranking. In these
procedures, the number of functions used is an important factor to be considered
when we want to control the FWER.
An appropriate number of algorithms in contrast with an appropriate number of
functions are needed to be used in order to employ each type of test. The number of algorithms used in multiple comparisons procedures must be lower than
the number of functions. In the study of the CEC2005 Special Session, we can
appreciate the effect of the number of functions used whereas the number of algorithms stays constant. See, for instance, the p-value obtained when considering
the f15f25 group and all functions. In the latter case, p-values obtained are always lower than in the rst one, for each testing procedure. In general, p-values
are lower agreeing with the increasing of the number of functions used in multiple
comparison procedures; therefore, the differences among the algorithms are more
detectable.
The previous statement may not be true in Wilcoxons test. The inuence of the
number of functions used is more noticeable in multiple comparisons procedures
than in Wilcoxons test. For example, the nal p-value computed for Wilcoxons
test in group f15f25 is lower than in the group f1f25 (see previous section).
6 Conclusions
In this paper we have studied the use of statistical techniques in the analysis of the
behaviour of evolutionary algorithms in optimization problems, analyzing the use of
the parametric and non-parametric statistical tests.
We have distinguished two types of analysis. The rst one, called single-problem
analysis, is that in which the results are analyzed for each function/problem independently. The second one, called multiple-problem analysis, is that in which the results
are analyzed by considering all the problems studied simultaneously.
In single-problem analysis, we have seen that the required conditions for a safe usage of parametric statistics are usually not satised. Nevertheless, the results obtained
are quite similar between a parametric and non-parametric analysis. Also, there are
procedures for transforming or adapting sample results for being used by parametric
statistical tests.
We encourage the use of non-parametric tests when we want to analyze results obtained by evolutionary algorithms for continuous optimization problems in multipleproblem analysis, due to the fact that the initial conditions that guarantee the reliability of the parametric tests are not satised. In this case, the results come from
different problems and it is not possible to analyze the results by means of parametric
statistics.
With respect to the use of non-parametric tests, we have shown how to use Friedman, Iman-Davenport, Bonferroni-Dunn, Holm, Hochberg, and Wilcoxons tests;
which on the whole, are a good tool for the analysis of the algorithms. We have
employed these procedures to carry out a comparison on the CEC2005 Special Session on Real Parameter Optimization by using the results published for each algorithm.
Acknowledgements The authors are very grateful to the anonymous reviewers for their valuable suggestions and comments to improve the quality of this paper.
S. Garca et al.
The most common way for obtaining the p-value associated to a hypothesis is
by means of normal approximations, that is, once computed the statistic associated to a statistical test or procedure, we can use a specic expression or algorithm for obtaining a z value, which corresponds to a normal distribution statistics.
Then, by using normal distribution tables, we could obtain the p-value associated
with z.
A.2 The Wilcoxon matched-pairs signed-ranks test
Wilcoxons test is used for answering this question: do two samples represent two different populations? It is a non-parametric procedure employed in a hypothesis testing
situation involving a design with two samples. It is the analogous of the paired t-test
in non-parametrical statistical procedures; therefore, it is a pairwise test that aims to
detect signicant differences between the behavior of two algorithms.
The null hypothesis for Wilcoxons test is H0 : D = 0; in the underlying populations represented by the two samples of results, the median of the difference
scores equals zero. The alternative hypothesis is H1 : D = 0, but also can be used
H1 : D > 0 or H1 : D < 0 as directional hypothesis.
In the following, we describe the tests computations. Let di be the difference between the performance scores of the two algorithms on i-th out of N functions. The
differences are ranked according to their absolute values; average ranks are assigned
in case of ties. Let R + be the sum of ranks for the functions on which the second
algorithm outperformed the rst, and R the sum of ranks for the opposite. Ranks
of di = 0 are split evenly among the sums; if there is an odd number of them, one is
ignored:
R+ =
rank(di ) +
di >0
R =
rank(di ) +
di <0
1
2
1
2
rank(di )
di =0
rank(di )
di =0
S. Garca et al.
12N
k(k + 1)
2
Rj
j
k(k + 1)2
4
2
is distributed according to F with k 1 degrees of freedom, being Rj =
j
1
i ri , and N the number of functions. The critical values for the Friedmans
N
statistic coincide with the established in the 2 distribution when N > 10 and
k > 5. In a contrary case, the exact values can be seen in Sheskin (2003); Zar
(1999).
Iman and Davenport (1980) proposed a derivation from the Friedmans statistic
given that this last metric produces a conservative undesirably effect. The proposed
statistic is
FF =
2
(N 1)F
2
N (k 1) F
k(k + 1)
6N
The value of q is the critical value of Q for a multiple non-parametric comparison with a control (Table B.16 in Zar 1999).
Holm (1979) procedure: for contrasting the procedure of Bonferroni-Dunn, we
dispose of a procedure that sequentially checks the hypotheses ordered according to their signicance. We will denote the p-values ordered by p1 , p2 , . . . ,
in the way that p1 p2 pk1 . Holms method compares each pi with
/(k i) starting from the most signicant p-value. If p1 is below than /(k 1),
the corresponding hypothesis is rejected and it leaves us to compare p2 with
/(k 2). If the second hypothesis is rejected, we continue with the process. As
soon as a certain hypothesis can not be rejected, all the remaining hypotheses are
maintained as supported. The statistic for comparing the i algorithm with the j
algorithm is:
z = (Ri Rj )
k(k + 1)
6N
The value of z is used for nding the corresponding probability from the table of
the normal distribution (p-value), which is compared with the corresponding value
of .
Holms method is more powerful than Bonferroni-Dunns and it does no additional assumptions about the hypotheses checked.
Hochberg (1988) procedure: It is a step-up procedure that works in the opposite
direction to Holms method, comparing the largest p-value with , the next largest
with /2 and so forth until it encounters a hypothesis it can reject. All hypotheses with smaller p values are then rejected as well. Hochbergs method is more
powerful than Holms (Shaffer 1995).
When a p-value is within a multiple comparison it reects the probability error
of a certain comparison, but it does not take into account the remaining comparisons
belonging to the family. One way to solve this problem is to report Adjusted P -Values
(APVs) which take into account that multiple tests are conducted. An APV can be
directly taken as the p-value of a hypothesis belonging to a comparison of multiple
algorithms.
In the following, we will explain how to compute the APVs for the three post-hoc
procedures described above, following the indications given in Wright (1992).
Bonferroni APVi : min{v; 1}, where v = (k 1)pi .
Holm APVi : min{v; 1}, where v = max{(k j )pj : 1 j i}.
Hochberg APVi : max{(k j )pj : (k 1) j i}.
SPC-PNX
L-SaDE
L-CMA-ES
K-PCX
G-CMA-ES
EDA
DMS-L-PSO
DE
CoEVO
BLX-MA
BLX-GL50
Algorithm
SPC-PNX
L-SaDE
L-CMA-ES
K-PCX
G-CMA-ES
EDA
DMS-L-PSO
DE
CoEVO
f20
4.46 102
8 102
8.629 102
4.6 102
8.22 102
6.519 102
3 102
8.13 102
4.42 102
7.13 102
4.4 102
4.49 102
7.628 102
8.445 102
4.2 102
7.143 102
5.644 102
3.26 102
7.51 102
5.16 102
7.049 102
3.8 102
6.65
3.65
4.891
1.91
f19
4.08 10
4.969
7.304
2.334
4.557
9.029
8.47 101
4.623
3.944
9.34 101
4.975
5.643
2.677 10
1.25 10
3.622
5.289
7.96 102
2.39 101
BLX-GL50
BLX-MA
f11
109
109
109
109
109
109
109
109
109
109
109
109
109
109
109
109
109
109
109
109
109
109
f10
f2
f1
Algorithm
SPC-PNX
L-SaDE
L-CMA-ES
K-PCX
G-CMA-ES
EDA
DMS-L-PSO
DE
CoEVO
BLX-MA
BLX-GL50
Algorithm
6.893 102
7.218 102
6.349 102
4.92 102
5.36 102
4.84 102
5 102
1.05 103
4.04 102
4.64 102
6.801 102
f21
4.069 102
7.43 10
6.046 102
3.17 10
2.4001
4.423 102
2.93 10
1.49 102
2.09 102
4.501 107
2.595 102
f12
2.121 10
109
4.15 101
109
1.672 102
1.081 105
5.705 102
4.771 104
109
1.94 106
109
f3
7.586 102
6.709 102
7.789 102
7.18 102
6.924 102
7.709 102
7.29 102
6.59 102
7.4 102
7.349 102
7.493 102
f22
0.22
8.379 101
7.498 101
7.736 101
1.137
9.77 101
3.689 101
1.841
6.96 101
6.53 101
4.94 101
f13
109
1.997 108
109
109
1.885 103
109
109
7.94 107
1.76 106
1.418 105
109
f4
6.389 102
9.267 102
8.346 102
5.72 102
7.303 102
6.405 102
5.59 102
1.06 103
7.91 102
6.641 102
5.759 102
f23
2.172
2.03
3.706
3.45
2.36
2.63
3.01
2.35
4.01
2.915
3.046
f14
109
2.124 102
2.133
109
1.138 106
109
109
4.85 10
109
0.012
109
f5
2 102
2.24 102
3.138 102
2 102
2.24 102
2 102
2 102
4.06 102
8.65 102
2 102
2 102
f24
32
2.538 102
4 102
2.696 102
2.938 102
2.59 102
4.854
3.65 102
2.28 102
5.1 102
2.11 102
f15
109
1.49
1.246 10
1.59 101
6.892 108
4.182 102
109
4.78 101
109
1.199 108
1.891 10
f6
4.036 102
3.957 102
2.573 102
9.23 102
3.657 102
3.73 102
3.74 102
4.06 102
4.42 102
3.759 102
4.06 102
f25
9.349 10
1.016 102
1.772 102
1.13 102
9.476 10
1.439 102
9.13 10
9.59 10
1.05 102
1.012 102
1.096 102
f16
0.02
8.261 102
1.172 102
1.971 101
3.705 102
1.46 101
4.519 102
4.205 101
109
2.31 101
109
f7
9.73 10
5.49 102
1.141 102
1.19 102
1.09 102
1.27 102
2.118 102
1.15 102
1.101 102
1.568 102
1.23 102
f17
2.035 10
2.019 10
2.027 10
2.04 10
2 10
2.034 10
2 10
2 10
2 10
2 10
2.099 10
f8
4.2 102
8.033 102
9.014 102
4 102
7.607 102
4.832 102
3.32 102
7.52 102
4.97 102
7.194 102
4.396 102
f18
1.154
4.379 101
1.919 10
9.55 101
109
5.418
2.39 101
1.19 101
4.49 10
109
4.02
f9
S. Garca et al.
References
Abramowitz, M.: Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables. Dover, New York (1974)
Auger, A., Hansen, N.: A restart CMA evolution strategy with increasing population size. In: Proceedings
of the 2005 IEEE Congress on Evolutionary Computation (CEC2005), pp. 17691776 (2005a)
Auger, A., Hansen, N.: Performance evaluation of an advanced local search evolutionary algorithm. In:
Proceedings of the 2005 IEEE Congress on Evolutionary Computation (CEC2005), pp. 17771784
(2005b)
Ballester, P.J., Stephenson, J., Carter, J.N., Gallagher, K.: Real-parameter optimization performance study
on the CEC-2005 benchmark with SPC-PNX. In: Proceedings of the 2005 IEEE Congress on Evolutionary Computation (CEC2005), pp. 498505 (2005)
Bartz-Beielstein, T.: Experimental Research in Evolutionary Computation: The New Experimentalism.
Springer, New York (2006)
Czarn, A., MacNish, C., Vijayan, K., Turlach, R., Gupta, R.: Statistical exploratory analysis of genetic
algorithms. IEEE Trans. Evol. Comput. 8(4), 405421 (2004)
Demar, J.: Statistical comparisons of classiers over multiple data sets. J. Mach. Learn. Res. 7, 130
(2006)
Gallagher, M., Yuan, B.: A general-purpose tunable landscape generator. IEEE Trans. Evol. Comput. 10(5),
590603 (2006)
Garca, S., Molina, D., Lozano, M., Herrera, F.: An experimental study on the use of non-parametric tests
for analyzing the behaviour of evolutionary algorithms in optimization problems. In: Proceedings of
the Spanish Congress on Metaheuristics, Evolutionary and Bioinspired Algorithms (MAEB2007),
pp. 275285 (2007) (in Spanish)
Garca-Martnez, C., Lozano, M.: Hybrid real-coded genetic algorithms with female and male differentiation. In: Proceedings of the 2005 IEEE Congress on Evolutionary Computation (CEC2005), pp.
896903 (2005)
Hansen, N.: (2005). Compilation of results on the CEC benchmark function set. Tech. Report, Institute of Computational Science, ETH Zurich, Switzerland. Available as http://www.ntu.edu.sg/home/
epnsugan/index_les/CEC-05/compareresults.pdf
Hochberg, Y.: A sharper Bonferroni procedure for multiple tests of signicance. Biometrika 75, 800803
(1988)
Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian J. Statist. 6, 6570 (1979)
Hooker, J.: Testing heuristics: we have it all wrong. J. Heuristics 1(1), 3342 (1997)
Iman, R.L., Davenport, J.M.: Approximations of the critical region of the Friedman statistic. Commun.
Stat. 18, 571595 (1980)
Kramer, O.: An experimental analysis of evolution strategies and particle swarm optimisers using design of experiments. In: Proceedings of the Genetic and Evolutionary Computation Conference 2007
(GECCO2007), pp. 674681 (2007)
Liang, J.J., Suganthan, P.N.: Dynamic multi-swarm particle swarm optimizer with local search. In: Proceedings of the 2005 IEEE Congress on Evolutionary Computation (CEC2005), pp. 522528 (2005)
Molina, D., Herrera, F., Lozano, M.: Adaptive local search parameters for real-coded memetic algorithms.
In: Proceedings of the 2005 IEEE Congress on Evolutionary Computation (CEC2005), pp. 888895
(2005)
Moreno-Prez, J.A., Campos-Rodrguez, C., Laguna, M.: On the comparison of metaheuristics through
non-parametric statistical techniques. In: Proceedings of the Spanish Congress on Metaheuristics,
Evolutionary and Bioinspired Algorithms (MAEB2007), pp. 286293 (2007) (in Spanish)
Morse, D.T.: Minsize2: a computer program for determining effect size and minimum sample size for
statistical signicance for univariate, multivariate, and nonparametric tests. Educ. Psychol. Meas.
59(3), 518531 (1999)
Noether, G.E.: Sample size determination for some common nonparametric tests. J. Am. Stat. Assoc.
82(398), 645647 (1987)
Ortiz-Boyer, D., Hervs-Martnez, C., Garca-Pedrajas, N.: Improving crossover operators for real-coded
genetic algorithms using virtual parents. J. Heuristics 13, 265314 (2007)
Ozcelik, B., Erzurumlu, T.: Comparison of the warpage optimization in the plastic injection molding using
ANOVA, neural network model and genetic algorithm. J. Mater. Process. Technol. 171(3), 437445
(2006)
Patel, J.K., Read, C.B.: Handbook of the Normal Distribution. Dekker, New York (1982)
S. Garca et al.
Pok, P.: Real-parameter optimization using the mutation step co-evolution. In: Proceedings of the 2005
IEEE Congress on Evolutionary Computation (CEC2005), pp. 872879 (2005)
Qin, A.K., Suganthan, P.N.: Self-adaptive differential evolution algorithm for numerical optimization. In:
Proceedings of the 2005 IEEE Congress on Evolutionary Computation (CEC2005), pp. 17851791
(2005)
Rojas, I., Gonzlez, J., Pomares, H., Merelo, J.J., Castillo, P.A., Romero, G.: Statistical analysis of the
main parameters involved in the design of a genetic algorithm. IEEE Trans. Syst. Man Cybern. Part
C 32(1), 3137 (2002)
Rnkknen, J., Kukkonen, S., Price, K.V.: Real-parameter optimization using the mutation step coevolution. In: Proceedings of the 2005 IEEE Congress on Evolutionary Computation (CEC2005),
pp. 506513 (2005)
Shaffer, J.P.: Multiple hypothesis testing. Annu. Rev. Psychol. 46, 561584 (1995)
Sinha, A., Tiwari, S., Deb, K.: A population-based, steady-state procedure for real-parameter optimization.
In: Proceedings of the 2005 IEEE Congress on Evolutionary Computation (CEC2005), pp. 514521
(2005)
Suganthan, P.N., Hansen, N., Liang, J.J., Deb, K., Chen, Y.P., Auger, A., Tiwari, S.: (2005). Problem
denitions and evaluation criteria for the CEC 2005 Special Session on Real Parameter Optimization. Tech. Report, Nanyang Technological University. Available as http://www.ntu.edu.sg/home/
epnsugan/index_les/CEC-05/Tech-Report-May-30-05.pdf
Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, Boca Raton
(2003)
Whitley, D.L., Beveridge, R., Graves, C., Mathias, K.E.: Test driving three 1995 genetic algorithms: new
test functions and geometric matching. J. Heuristics 1(1), 77104 (1995)
Whitley, D.L., Rana, S., Dzubera, J., Mathias, K.E.: Evaluating evolutionary algorithms. Artif. Intell.
85(12), 245276 (1996)
Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. IEEE Trans. Evol. Comput.
1(1), 6782 (1997)
Wright, S.P.: Adjusted p-values for simultaneous inference. Biometrics 48, 10051013 (1992)
Yuan, B., Gallagher, M.: On building a principled framework for evaluating and testing evolutionary algorithms: a continuous landscape generator. In: Proceedings of the 2003 Congress on Evolutionary
Computation (CEC2003), pp. 451458 (2003)
Yuan, B., Gallagher, M.: Experimental results for the Special Session on Real-Parameter Optimization at
CEC 2005: a simple, continuous EDA. In: Proceedings of the 2005 IEEE Congress on Evolutionary
Computation (CEC2005), pp. 17921799 (2005)
Zar, J.H.: Biostatistical Analysis. Prentice Hall, Englewood Cliffs (1999)
2.4.2.
1 de 1
25/10/2008 16:56
salvagl@decsai.ugr.es
herrera@decsai.ugr.es
Abstract
In a recently published paper in JMLR, Demar (2006) recommends a set of nons
parametric statistical tests and procedures which can be safely used for comparing the
performance of classiers over multiple data sets. After studying the paper, we realize
that the paper correctly introduces the basic procedures and some of the most advanced
ones when comparing a control method. However, it does not deal with some advanced
topics in depth. Regarding these topics, we focus on more powerful proposals of statistical
procedures for comparing n n classiers. Moreover, we illustrate an easy way of obtaining
adjusted and comparable p-values in multiple comparison procedures.
Keywords: statistical methods, non-parametric test, multiple comparisons tests, adjusted p-values, logically related hypotheses
1. Introduction
In the Machine Learning (ML) scientic community there is a need for rigorous and correct
statistical analysis of published results, due to the fact that the development or modications of algorithms is a relatively easy task. The main inconvenient related to this necessity
is to understand and study the statistics and to know the exact techniques which can or
cannot be applied depending on the situation, i.e. type of results obtained. In a recently
published paper in JMLR, Demar (2006), a group of useful guidelines are given in order
s
to perform a correct analysis when we compare a set of classiers over multiple data sets.
Demar recommends a set of non-parametric statistical techniques (Zar, 1999; Sheskin,
s
2003) for comparing classiers under these circumstances, given that the sample of results
obtained by them does not fulll the required conditions and it is not large enough for
making a parametric statistical analysis. He analyzed the behavior of the proposed statistics on classication tasks and he checked that they are more convenient than parametric
techniques.
Recent studies apply the guidelines given by Demar in the analysis of performance of
s
classiers (Esmeir and Markovitch, 2007; Marrocco et al., 2008). In them, a new proposal
or methodology is oered and it is compared with other methods by means of pairwise
comparisons. Another type of studies assume an empirical comparison or review of already proposed methods. In these cases, no proposal is oered and a statistical comparison
c 200x Salvador Garc and Francisco Herrera.
a
could be very useful in determining the dierences among the methods. In the specialized
literature, many papers provide reviews on a specic topic and they also use statistical
methodology to perform comparisons. For example, in a review of ensembles of decision
trees, non-parametric tests are also applied in the analysis of performance (Baneld et al.,
2007). However, only the rankings computed by Friedmans method (Friedman, 1937) are
stipulated and authors establish comparisons based on them, without taking into account
signicance levels. Demar focused his work in the analysis of new proposals, and he introdus
ced the Nemenyi test for making all pairwise comparisons (Nemenyi, 1963). Nevertheless,
the Nemenyi test is very conservative and it may not nd any dierence in most of the
experimentations. In recent papers, the authors have used the Nemenyi test in multiple
comparisons. Due to the fact that this test posses low power, authors have to employ many
data sets (Yang et al., 2007b) or most of the dierences found are not signicant (Yang
et al., 2007a; Nnez et al., 2007). Although the employment of many data sets could seem
u
benecial in order to improve the generalization of results, in some specic domains, i.e.
imbalanced classication (Owen, 2007) or multi-instance classication (Murray et al., 2005),
data sets are dicult to nd.
Procedures with more power than Nemenyis one can be found in specialized literature.
We have based on the necessity to apply more powerful procedures in empirical studies
in which no new method is proposed and the benet consists of obtaining more statistical
dierences among the classiers compared. Thus, in this paper we describe these procedures
and we analyze their behavior by means of the analysis of multiple repetitions of experiments
with randomly selected data sets.
On the other hand, we can see other works in which the p-value associated to a comparison between two classiers is reported (Garc
a-Pedrajas and Fyfe, 2007). Classical
non-parametric tests, such as Wilcoxon and Friedman (Sheskin, 2003), may be incorporated in most of the statistical packages (SPSS, SAS, R, etc.) and the computation of the
nal p-value is usually implemented. However, advanced procedures such as Holm (1979),
Hochberg (1988), Hommel (1988) and the ones described in this paper are usually not incorporated in statistical packages. The computation of the correct p-value, or Adjusted
P -Value (APV) (Westfall and Young, 2004), in a comparison using any of these procedures is not very dicult and, in this paper, we show how to include it with an illustrative
example.
The paper is set up as follows. Section 2 presents more powerful procedures for comparing all the classiers among them in a n n comparison of multiple classiers and a case
study. In Section 3 we describe the procedures for obtaining the APV by considering the
post-hoc procedures explained by Demar and the ones explained in this paper. In Section
s
4, we perform an experimental study of the behavior of the statistical procedures and we
discuss the results obtained. Finally, Section 5 concludes the paper.
is an omnibus test which can be used to carry out these type of comparison. It allows to
detect dierences considering the global set of classiers. Once Friedmans test rejects the
null hypothesis, we can proceed with a post-hoc test in order to nd the concrete pairwise
comparisons which produce dierences. Demar described the use of the Nemenyi test used
s
when all classiers are compared with each other. Then, he focused on procedures that
control the family-wise error when comparing with a control classier, arguing that the
objective of a study is to test whether a newly proposed method is better than the existing
ones. For this reason, he described and studied in depth more powerful and sophisticated procedures derived from Bonferroni-Dunn such as Holms, Hochbergs and Hommels
methods.
Nevertheless, we think that performing all pairwise comparisons in an experimental
analysis may be useful and interesting in dierent cases when proposing a new method. For
example, it would be interesting to conduct a statistical analysis over multiple classiers in
review works in which no method is proposed. In this case, the repetition of comparisons
choosing dierent control classiers may lose the control of the family-wise error.
Our intention in this section is to give a detailed description of more powerful and
advanced procedures derived from the Nemenyi test and to show a case study that uses
these procedures.
2.1 Advanced Procedures for Performing All pairwise Comparisons
A set of pairwise comparisons can be associated with a set or family of hypotheses. Any
of the post-hoc tests which can be applied to non-parametric tests (that is, those derived
from the Bonferroni correction or similar procedures) work over a family of hypotheses. As
Demar explained, the test statistics for comparing the i-th and j-th classier is
s
z=
(Ri Rj )
k(k+1)
6N
where Ri is the average rank computed through the Friedman test for the i-th classier,
k is the number of classiers to be compared and N is the number of data sets used in the
comparison.
The z value is used to nd the corresponding probability (p-value) from the table of
normal distribution, which is then compared with an appropriate level of signicance
(Table A1 in Sheskin (2003)). Two basic procedures are:
Nemenyi (1963) procedure: it adjusts the value of in a single step (identically to
Bonferroni-Dunn, Olejnik et al. (1997)) by dividing the value of by the number of
comparisons performed, m = k(k 1)/2. This procedure is the simplest but it also
has little power.
Holm (1979) procedure: it was also described in Demar (2006) but it was used for
s
comparisons of multiple classiers involving a control method. It adjusts the value of
in a step down method. Let p1 , ..., pm be the ordered p-values (smallest to largest)
and H1 , ..., Hm be the corresponding hypotheses. Holms procedure rejects H1 to
H(i1) if i is the smallest integer such that pi > /(m i + 1). Other alternatives
3
were developed by Hochberg (1988); Hommel (1988); Rom (1990). They are easy to
perform, but they often have a similar power to Holms procedure (they have more
power than Holms procedure, but the dierence between them is not very notable)
when considering all pairwise comparisons.
The hypotheses being tested belonging to a family of all pairwise comparisons are logically interrelated so that not all combinations of true and false hypotheses are possible. As
a simple example of such a situation suppose that we want to test the three hypotheses of
pairwise equality associated with the pairwise comparisons of three classiers Ci , i = 1, 2, 3.
It is easily seen from the relations among the hypotheses that if any one of them is false,
at least one other must be false. For example, if C1 is better/worse than C2 , then it is not
possible that C1 has the same performance as C3 and C2 has the same performance as C3 .
C3 must be better/worse than C1 or C2 or the two classiers at the same time. Thus, there
cannot be one false and two true hypotheses among these three.
Based on this argument, Shaer proposed two procedures which make use of the logical
relation among the family of hypotheses for adjusting the value of (Shaer, 1986).
Shaers static procedure: following Holms step down method, at stage j, instead of
rejecting Hi if pi /(m i + 1), reject Hi if pi /ti , where ti is the maximum
number of hypotheses which can be true given that any (i 1) hypotheses are false.
It is a static procedure, i.e. t1 , ..., tm are fully determined for the given hypotheses H1 , ..., Hm , independent of the observed p-values. The possible numbers of true
hypotheses, and thus the values of ti can be obtained from the recursive formula
k
S(k) =
{
j=1
j
+ x : x S(k j)},
2
where S(k) is the set of possible numbers of true hypotheses with k classiers being
compared, k 2, and S(0) = S(1) = {0}.
Shaers dynamic procedure: it increases the power of the rst by substituting /ti
at stage i by the value /t , where t is the maximum number of hypotheses that
i
i
could be true, given that the previous hypotheses are false. It is a dynamic procedure
since t depends not only on the logical structure of the hypotheses, but also on the
i
hypotheses already rejected at step i. Obviously, this procedure has more power than
the rst one. In this paper, we have not used this second procedure, given that it is
included in an advanced procedure which we will describe in the following.
In Bergmann and Hommel (1988) was proposed a procedure based on the idea of nding
all elementary hypotheses which cannot be rejected. In order to formulate BergmannHommels procedure, we need the following denition.
Denition 1 An index set of hypotheses I {1, ..., m} is called exhaustive if exactly all
Hj , j I, could be true.
4
In order to exemplify the previous denition, we will consider the following case: We
have three classiers, and we will compare them in a n n comparison. We will obtain
three hypotheses:
H1 = C1 es equal in behavior than C2 .
H2 = C1 es equal in behavior than C3 .
H3 = C2 es equal in behavior than C3 .
and eight possible sets Si :
S1 : All Hj are true.
S2 : H1 and H2 are true and H3 is false.
S3 : H1 and H3 are true and H2 is false.
S4 : H2 and H3 are true and H1 is false.
S5 : H1 is true and H2 and H3 are false.
S6 : H2 is true and H1 and H3 are false.
S7 : H3 is true and H1 and H2 are false.
S8 : All Hj are false.
Sets S1 , S5 , S6 , S7 and S8 can be possible, because their hypotheses can be true at
the same time, so they are exhaustive sets. Set S2 , basing on logically related hypotheses
principles, is not possible because the performance of C1 cannot be equal to C2 and C3 ,
whereas C2 has dierent performance than C3 . The same consideration can be done to S3
and S4 , which are not exhaustive sets.
Under this denition, it works as follows.
Bergmann and Hommel (1988) procedure: Reject all Hj with j A, where the
/
acceptance set
A=
Figure 1 shows a valid algorithm for obtaining all the exhaustive sets of hypotheses,
using as input a list of classiers C. E is a set of families of hypotheses; likewise, a family of hypotheses is a set of hypotheses. The most important step in the algorithm is
the number 6. It performs a division of the classiers into two subsets, in which the last
classier k always is inserted in the second subset and the rst subset cannot be empty.
In this way, we ensure that a subset yielded in a division is never empty and no repetitions are produced. For example, suppose a set C with three classiers C = {1, 2, 3}.
All possible divisions without taking into account the previous assumptions are:
D1 = {C1 = {}, C2 = {1, 2, 3}}, D2 = {C1 = {1}, C2 = {2, 3}}, D3 = {C1 = {2}, C2 =
{1, 3}}, D4 = {C1 = {1, 2}, C2 = {3}}, D5 = {C1 = {3}, C2 = {1, 2}}, D6 = {C1 =
{1, 3}, C2 = {2}}, D7 = {C1 = {2, 3}, C2 = {1}}, D8 = {C1 = {1, 2, 3}, C2 = {}}.
Divisions D1 and D8 , D2 and D7 , D3 and D6 , D4 and D5 are equivalent, respectively.
Furthermore, divisions D1 and D8 are not interesting. Using the assumptions in step
6 of the algorithm, the possible divisions are: D1 = {C1 = {1}, C2 = {2, 3}}, D2 =
{C1 = {2}, C2 = {1, 3}}, D3 = {C1 = {1, 2}, C2 = {3}}. In this case, all the divisions
are interesting and no repetitions are yielded. The computational complexity of the
2
algorithm for obtaining exhaustive sets is O(2n ). However, the computation requirements may be reduced by means of using storage capabilities. Relative exhaustive
sets for k i, 1 i (k 2) classiers can be stored in memory and there is no
necessity of invoking the obtainingExhaustive function recursively. The computational
complexity using storage capabilities is O(2n ), so the algorithm still requires intensive
computation.
An example illustrating the algorithm for obtaining all exhaustive sets is drawn in
Figure 2. In it, four classiers, enumerated from 1 to 4 in the C set, are used. The
comparisons or hypotheses are denoted by pairs of numbers without a separation
character between them. This illustration does not show the case in which the set
|Ci | < 2, for simplifying the representation. When |Ci | < 2, no comparisons can be
performed, so the obtainExhaustive function returns an empty set E.
An edge connecting two boxes represents an invocation of this function. In each
box, the list of classiers given as input and the rst initialization of the E set
are displayed. The main edges, whose starting point is the initial box, are labeled by the order of invocation. Below the graph, the resulting E subset in each
main edge is denoted. The nal E will be composed by the union of these E
subsets. At the end of the process, 14 distinct exhaustive sets are found: E =
{(12, 13, 14, 23, 24, 34), (23, 24, 34), (13, 14, 34), (12, 14, 24), (12, 13, 23),
(12), (13), (14), (23), (24), (34), (12, 34), (13, 24), (23, 14)}.
Table 1 gives the number of hypotheses (m), the number (2n ) of index sets I and the
number of exhaustive index sets (ne ) for k classiers being compared.
C2 = {1, 3}
C1 = {1, 2}
E = {(23)}
E = {(13)}
E = {(12)}
C1 = {1, 2}
C2 = {1, 2, 4}
C2 = {1, 2, 3}
C1 = {1, 2}
C2 = {3, 4}
E = {(12,14,24)}
E = {(12)}
E = {(12,13,23)}
E = {(12)}
E = {(34)}
C2 = {1, 4}
1
2
E = {(14)}
C1 = {1, 3}
C = {1, 2, 3, 4}
C2 = {2, 4}
E = {(13)}
E = {(12,13,14,23,24,34)}
C2 = {2, 4}
E = {(24)}
E = {(24)}
6
5
4
C1 = {1, 3}
E = {(13)}
C1 = {2, 3}
C2 = {1, 3, 4}
C2 = {2, 3, 4}
E = {(23)}
E = {(13,14,34)}
C1 = {2, 3}
C1 = {1, 4}
E = {(23)}
E = {(23,24,34)}
E = {(14)}
C2 = {1, 4}
C2 = {2, 4}
E = {(14)}
E = {(24)}
1: E = {(12,13,23),(23),(13),(12)}
2: E = {(12),(34),(12,34)}
C2 = {3, 4}
C2 = {3, 4}
3: E = {(13),(24),(13,24)}
E = {(34)}
E = {(34)}
4: E = {(23,24,34),(23),(24),(34)}
5: E = {(23),(14),(23,14)}
6: E = {(13,14,34),(13),(14),(34)}
7: E = {(12,14,24),(12),(14),(24)}
k
4
5
6
7
8
9
m=
k
2
6
10
15
21
28
36
2m
64
1024
32768
2097152
2.7 108
6.7 1010
ne
14
51
202
876
4139
21146
Nemenyis test rejects the hypotheses [14] since the corresponding p-values are smaller
than the adjusted s.
Holms procedure rejects the hypotheses [15].
1. Kernel method is a bayesian classier which employs a non-parametric estimation of density functions
through a gaussian kernel function. The adjustment of the covariance matrix is performed by the ad-hoc
method.
2. NaiveBayes and CN2 are classiers for discrete domains, so we have discretized the data prior to learning
with them. The discretizer algorithm is Fayyad and Irani (1993).
3. Data sets marked with * have been subsampled being adapted to slow algorithms, such as CN2.
Abalone*
Adult*
Australian
Autos
Balance
Breast
Bupa
Car
Cleveland
Crx
Dermatology
German
Glass
Hayes-Roth
Heart
Ion
Led7Digit
Letter*
Lymphography
Mushrooms*
OptDigits*
Satimage*
SpamBase*
Splice*
Tic-tac-toe
Vehicle
Vowel
Wine
Yeast
Zoo
average rank
C4.5
0.219 (3)
0.803 (2)
0.859 (1)
0.809 (1)
0.768 (3)
0.759 (1)
0.693 (1)
0.915 (1)
0.544 (2)
0.855 (2)
0.945 (3)
0.725 (2)
0.674 (4)
0.801 (1)
0.785 (2)
0.906 (2)
0.710 (2)
0.691 (2)
0.743 (3)
0.990 (1.5)
0.867 (3)
0.821 (3)
0.893 (2)
0.799 (2)
0.845 (1)
0.741 (1)
0.799 (2)
0.949 (4)
0.555 (3)
0.928 (2.5)
2.100
1-NN
0.202 (4)
0.750 (4)
0.814 (4)
0.774 (3)
0.790 (2)
0.654 (5)
0.611 (3)
0.857 (3)
0.531 (4)
0.796 (4)
0.954 (2)
0.705 (4)
0.736 (1)
0.357 (4)
0.770 (3)
0.359 (5)
0.402 (4)
0.827 (1)
0.739 (4)
0.482 (5)
0.098 (1)
0.872 (2)
0.824 (4)
0.655 (4)
0.731 (2)
0.701 (2)
0.994 (1)
0.955 (2)
0.505 (4)
0.928 (2.5)
3.250
NaiveBayes
0.249 (2)
0.813 (1)
0.845 (2)
0.673 (4)
0.727 (4)
0.734 (2)
0.572 (4.5)
0.860 (2)
0.558 (1)
0.857 (1)
0.978 (1)
0.739 (1)
0.721 (2)
0.520 (2.5)
0.841 (1)
0.895 (3)
0.728 (1)
0.667 (3)
0.830 (1)
0.941 (3)
0.915 (2)
0.815 (4)
0.902 (1)
0.925 (1)
0.693 (4)
0.591 (5)
0.603 (4)
0.989 (1)
0.569 (1)
0.945 (1)
2.200
Kernel
0.165 (5)
0.692 (5)
0.542 (5)
0.275 (5)
0.872 (1)
0.703 (4)
0.689 (2)
0.700 (5)
0.439 (5)
0.607 (5)
0.541 (5)
0.625 (5)
0.356 (5)
0.309 (5)
0.659 (5)
0.641 (4)
0.120 (5)
0.527 (5)
0.549 (5)
0.857 (4)
0.986 (1)
0.885 (1)
0.739 (5)
0.517 (5)
0.653 (5)
0.663 (3)
0.269 (5)
0.770 (5)
0.312 (5)
0.419 (5)
4.333
CN2
0.261 (1)
0.798 (3)
0.816 (3)
0.785 (2)
0.706 (5)
0.714 (3)
0.572 (4.5)
0.777 (4)
0.541 (3)
0.809 (3)
0.858 (4)
0.717 (3)
0.704 (3)
0.520 (2.5)
0.759 (4)
0.918 (1)
0.674 (3)
0.638 (4)
0.746 (2)
0.990 (1.5)
0.784 (4)
0.778 (5)
0.885 (3)
0.755 (3)
0.704 (3)
0.619 (4)
0.621 (3)
0.954 (3)
0.556 (2)
0.897 (4)
3.117
Table 2: Computation of the rankings for the ve algorithms considered in the study over
30 data sets, based on test accuracy by using ten-fold cross validation
i
1
2
3
4
5
6
7
8
9
10
hypothesis
C4.5 vs. Kernel
NaiveBayes vs. Kernel
Kernel vs. CN2
C4.5 vs. 1NN
1NN vs. Kernel
1NN vs. NaiveBayes
C4.5 vs. CN2
NaiveBayes vs. CN2
1NN vs. CN2
C4.5 vs. NaiveBayes
z = (R0 Ri )/SE
5.471
5.226
2.98
2.817
2.654
2.572
2.49
2.245
0.327
0.245
p
4.487 108
1.736 107
0.0029
0.0048
0.008
0.0101
0.0128
0.0247
0.744
0.8065
N M
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
0.005
HM
0.005
0.0055
0.0063
0.0071
0.0083
0.01
0.0125
0.0167
0.025
0.05
SH
0.005
0.0083
0.0083
0.0083
0.0083
0.0125
0.0125
0.0167
0.025
0.05
Size 2
(12,34)
(13,24)
(14,23)
(12,35)
(13,25)
(15,23)
(12,45)
(13,45)
(23,45)
(14,25)
(15,24)
(14,35)
(24,35)
Size 3
(12,13,23)
(12,14,24)
(13,14,34)
(23,24,34)
(12,15,25)
(13,15,35)
(23,25,35)
(14,15,45)
(24,25,45)
(34,35,45)
Size 4
(12,13,23,45)
(12,14,24,35)
(12,34,35,45)
(13,14,25,34)
(13,15,24,35)
(13,24,25,45)
(14,15,23,45)
(14,23,25,35)
(15,23,24,34)
Size 6
(12,13,14,15,23,24,25,34,35,45)
(12,13,14,23,24,34)
(12,13,15,23,25,35)
(12,14,15,24,25,45)
(13,14,15,34,35,45)
(23,24,25,34,35,45)
Table 4: Exhaustive sets obtained for the case study. Those belonging to the Acceptance
set (A) are typed in bold.
Bergmann-Hommels dynamic procedure allows to clearly distinguishing among three
groups of classiers, attending to their performance:
4. We have considered that each classier follows the order: 1 - C4.5, 2 - 1-NN, 3 - NaiveBayes, 4 - Kernel,
5 - CN2. For example, the hypothesis 13 represents the comparison between C4.5 and NaiveBayes.
10
3. Adjusted P-Values
The smallest level of signicance that results in the rejection of the null hypothesis, the
p-value, is a useful and interesting datum for many consumers of statistical analysis. A
p-value provides information about whether a statistical hypothesis test is signicant or
not, and it also indicates something about how signicant the result is: The smaller the
p-value, the stronger the evidence against the null hypothesis. Most important, it does this
without committing to a particular level of signicance.
When a p-value is within a multiple comparison, as in the example in Table 3, it reects
the probability error of a certain comparison, but it does not take into account the remaining
comparisons belonging to the family. One way to solve this problem is to report APVs which
take into account that multiple tests are conducted. An APV can be compared directly
with any chosen signicance level . In this paper, we encourage the use of APVs due to
the fact that they provide more information in a statistical analysis.
In the following, we will explain how to compute the APVs depending on the post-hoc
procedure used in the analysis, following the indications given in Wright (1992); Hommel
and Bernhard (1999). We also include the post-hoc tests explained in Demar (2006) and
s
other for comparisons with a control classier. The notation used in the computation of
the APVs is the following:
Indexes i and j correspond each one to a concrete comparison or hypothesis in the
family of hypotheses, according to an incremental order by their p-values. Index i
always refers to the hypothesis in question whose APV is being computed and index
j refers to another hypothesis in the family.
pj is the p-value obtained for the j-th hypothesis.
11
4. Experimental Framework
In this section, we want to determine the power and behavior of the studied procedures
through the experiments in which we repeatedly compared the classiers on sets of ten
randomly chosen data sets, recording the number of equivalence hypothesis rejected and
APVs. We follow a similar method used in Demar (2006).
s
The classiers used are the same as in the case study of the previous subsection: C4.5
with minimum number of item-sets per leaf equal to 2 and condence level tted for optimal
12
i
1
2
3
4
5
6
7
8
9
10
hypothesis
C4.5 vs .Kernel
NaiveBayes vs .Kernel
Kernel vs .CN2
C4.5 vs .1NN
1NN vs .Kernel
1NN vs .NaiveBayes
C4.5 vs .CN2
NaiveBayes vs .CN2
1NN vs .CN2
C4.5 vs .NaiveBayes
pi
4.487 108
1.736 107
0.0029
0.0048
0.008
0.0101
0.0128
0.0247
0.744
0.8065
AP VN M
4.487 107
1.736 106
0.0288
0.0485
0.0796
0.1011
0.1276
0.2474
1.0
1.0
AP VHM
4.487 107
1.563 106
0.023
0.0339
0.0478
0.0506
0.0511
0.0742
1.0
1.0
AP VSH
4.487 107
1.042 106
0.0173
0.0291
0.0478
0.0478
0.0511
0.0742
1.0
1.0
AP VBH
4.487 107
1.042 106
0.0115
0.0291
0.0319
0.0319
0.0383
0.0383
1.0
1.0
Table 5: APVs obtained in the example by Nemenyi (NM), Holm (HM), Shaers static
(SH) and Bergmann-Hommels dynamic (BH)
13
accuracy and pruning strategy, naive Bayesian learner with continuous attributes discretized using Fayyad and Irani (1993) discretization, classic 1-Nearest-Neighbor classier with
Euclidean distance, CN2 with Fayyad-Iranis discretizer, star size = 5 and 95% of examples
to cover and Kernel classier with sigmaKernel = 0.01, which is the inverse value of the
variance that represents the radius of neighborhood. All classiers are available in KEEL
software (Alcal-Fdez et al., 2008).5
a
For performing this study, we have compiled a sample of fty data sets from the UCI
machine learning repository (Asuncion and Newman, 2007), all of them valid for a classication task.6 We measured the performance of each classier by means of accuracy in test
by using ten-fold cross validation. As Demar did, when comparing two classiers, samples
s
of ten data sets were randomly selected so that the probability for the data set i being
chosen was proportional to 1/(1 + ekdi ), where di is the (positive or negative) dierence
in the classication accuracies on that data set and k is the bias through which we can
regulate the dierences between the classiers. With k = 0, the selection is purely random
and as k is being higher, the selected data sets are favorable to a particular classier.
In comparisons of multiple classiers, samples of data sets have to be selected with the
probabilities computed from the dierences in accuracy of two classiers. We have chosen
C4.5 and 1-NN, due to the fact that we have found signicant dierences between them
in the study conducted before (Section 2.2) which involved thirty data sets. Note that the
repeated comparisons done here only involve ten data sets each time, so the rejection of
equivalence of two classiers is more dicult at the beginning of the process.
Figure 4 shows the results of this study considering the pairwise comparison between
C4.5 and 1-NN. It gives an approximation of the power of the statistical procedures considered in this paper. Figure 4(a) reects the number of times they rejected the equivalence
of C4.5 and 1-NN. Obviously, the Bergmann-Hommel procedure is the most powerful, followed by Shaers static procedure. The graphic also informs us about the use of logically
related hypothesis, given that the procedures that use this information have a bias towards
the same point and those which do not use this information, tend to a lower point than the
rst. When the selection of data sets is purely random (k = 0), the benet of using the
Bergmann-Hommel procedure is appreciable. Figure 4(b) shows the average APV of the
same comparison of classiers. As we can see, the Nemenyi procedure is too conservative in
comparison with the remaining procedures. Again, the benet of using more sophisticated
testing procedures is easily noticeable.
Figure 5 shows the results of this study considering all possible pairwise comparisons in
the set of classiers. It helps us to compare the overall behavior of the four testing procedures. Figure 5(a) presents the number of times they rejected any comparison belonging
to the family. Although it could seem that the selection of data sets determined by the
dierence of accuracy between two classiers may not inuence on the overall comparison,
the graphic shows us that it occurs. Furthermore, the lines drawn follow a parallel behavior,
5. It is also available at http://www.keel.es
6. The data sets used are: abalone, adult, australian, autos, balance, bands, breast, bupa, car, cleveland,
dermatology, ecoli, are, german, glass, haberman, hayes-roth, heart, iris, led7digit, letter, lymphography, magic, monks, mushrooms, newthyroid, nursery, optdigits, pageblocks, penbased, pima, ring,
satimage, segment, shuttle, spambase, splice, tae, thyroid, tic-tac-toe, twonorm, vehicle, vowel, wine,
wisconsin, yeast, zoo.
14
Nemenyi
Holm
Shaffer
Nemenyi
Bergmann
800
Shaffer
Bergmann
0.3
700
average p-value
rejected hypotheses
Holm
0.35
900
600
500
400
300
0.25
0.2
0.15
0.1
200
0.05
100
0
9 10 11 12 13 14 15 16 17 18 19 20
9 10 11 12 13 14 15 16 17 18 19 20
Holm
Shaffer
Nemenyi
Bergmann
Shaffer
Bergmann
0.5
average p-value
0.6
2500
Holm
0.7
3000
rejected hypotheses
3500
2000
1500
1000
0.4
0.3
0.2
0.1
500
0
0
0
9 10 11 12 13 14 15 16 17 18 19 20
9 10 11 12 13 14 15 16 17 18 19 20
k
Bergmann-Hommels procedure is the best performing one, but it is also the most
dicult to understand and computationally expensive. We recommend its usage when
the situation requires so (i.e. the dierences among the classiers compared are not
very signicant), given that the results it obtains are as valid as using other testing
procedure.
5. Conclusions
The present paper is an extension of Demar (2006). Demar does not deal in depth with
s
s
some topics related to multiple comparisons involving all the algorithms and computations
of adjusted p-values.
In this paper, we describe other advanced testing procedures for conducting all pairwise
comparisons in a multiple comparisons analysis: Shaers static and Bergmann-Hommels
procedures. The advantage that they obtain is produced due to the incorporation of more
information about the hypotheses to be tested: in n n comparisons, a logical relationship
among them exists. As a general rule, the Bergmann-Hommel procedure is the most powerful one but it requires intensive computation in comparisons involving numerous classiers.
The second one, Shaers procedure, can be used instead of Bergmann-Hommels in these
cases. Moreover, we present the methods for obtaining the adjusted p-values, which are
valid p-values associated to each comparison useful to be compared with any level of signicance without restrictions and they also provide more information. We have illustrated
them with a case study and we have checked that the new described methods are more
powerful than the classical ones, Nemenyis and Holms procedures.
Acknowledgments
This research has been supported by the project TIN2005-08386-C05-01. S. Garc holds a
a
FPU scholarship from Spanish Ministry of Education and Science. The present paper was
submitted as a regular paper in the JMLR journal. After the review process, the action
editor Dale Schuurmans encourages us to submit the paper to the special topic on Multiple
Simultaneous Hypothesis Testing. We are very grateful to the anonymous reviewers and
both action editors who managed this paper for their valuable suggestions and comments
to improve its quality.
References
J. Alcal-Fdez, L. Snchez, S. Garc M.J. del Jesus, S. Ventura, J.M. Garrell, J. Otero,
a
a
a,
C. Romero, J. Bacardit, V.M. Rivas, J.C. Fernndez, and F. Herrera. KEEL: A software
a
tool to assess evolutionary algorithms to data mining problems. Soft Computing. doi:
10.1007/s00500-008-0323-y, 2008. In press.
A. Asuncion and D.J. Newman.
UCI machine learning repository, 2007.
http://www.ics.uci.edu/mlearn/MLRepository.html.
URL
G. Hommel and G. Bernhard. Bonferroni procedures for logically related hypotheses. Journal of Statistical Planning and Inference, 82:119128, 1999.
R. L. Iman and J. M. Davenport. Approximations of the critical region of the friedman
statistic. Communications in Statistics, pages 571595, 1980.
C. Marrocco, R. P. W. Duin, and F. Tortorella. Maximizing the area under the ROC curve
by pairwise feature combination. Pattern Recognition, 41:19611974, 2008.
G. J. McLachlan. Discriminant Analysis and Statistical Pattern Recognition. Wiley Series
in Probability and Mathematical Statistics, 2004.
J. F. Murray, G. F. Hughes, and K. Kreutz-Delgado. Machine learning methods for predicting failures in hard drives: A multiple-instance application. Journal of Machine Learning
Research, 6:783816, 2005.
P. B. Nemenyi. Distribution-free multiple comparisons. PhD thesis, Princeton University,
1963.
M. Nnez, R. Fidalgo, and R. Morales. Learning in environments with unknown dynamics:
u
Towards more robust concept learners. Journal of Machine Learning Research, 8:2595
2628, 2007.
S. Olejnik, J. Li, S. Supattathum, and C.J. Huberty. Multiple testing and statistical power
with modied bonferroni procedures. Journal of Educational and Behavioral Statistics,
22(4):389406, 1997.
A. B. Owen. Innitely imbalanced logistic regression. Journal of Machine Learning Research, 8:761773, 2007.
J. R. Quinlan. Programs for Machine Learning. Morgan Kauman, 1993.
D. M. Rom. A sequentially rejective test procedure based on a modied bonferroni inequality. Biometrika, 77:663665, 1990.
J.P. Shaer. Modied sequentially rejective multiple test procedures. Journal of the American Statistical Association, 81(395):826831, 1986.
J.P. Shaer. Multiple hypothesis testing. Annual Review of Psychology, 46:561584, 1995.
D. Sheskin. Handbook of parametric and nonparametric statistical procedures. Chapman &
Hall/CRC, 2003.
R.J. Simes. An improved Bonferroni procedure for multiple tests of signicance. Biometrika,
73:751754, 1986.
P. H. Westfall and S. S. Young. Resampling-Based Multiple Testing: examples and methods
for p-value adjustment. John Wiley and Sons, 2004.
S. P. Wright. Adjusted p-values for simultaneous inference. Biometrics, 48:10051013, 1992.
18
19
2.4.3.
1 de 1
20/10/2008 07:50
these conditions are problem-dependent and indefinite, which supports the use of non-parametric
statistics in the experimental analysis. In addition, non-parametric tests can be satisfactorily
employed for
comparing generic classifiers over various data sets considering any performance measure.
According to these facts, we propose the use of the most powerful non-parametric statistical tests
to carry out multiple comparisons. However, the statistical analysis conducted on interpretability
must be carefully considered.
* Manuscript
Click here to download Manuscript: Garciaetal-SoftComputing-GBML-R2.pdf
Abstract The experimental analysis on the performance of a proposed method is a crucial and necessary
task to carry out in a research. This paper is focused
on the statistical analysis of the results in the eld of
Genetics-Based Machine Learning. It presents a study
involving a set of techniques which can be used for doing
a rigorous comparison among algorithms, in terms of obtaining successful classication models.
Two accuracy measures for multi-class problems have
been employed: classication rate and Cohens kappa.
Moreover, two interpretability measures have been employed: size of the rule set and number of antecedents.
We have studied whether the samples of results obtained by Genetic-based classiers, using the performance
measures cited above, check the necessary conditions for
being analyzed by means of parametrical tests. The results obtained state that the fulllment of these conditions are problem-dependent and indenite, which supports the use of non-parametric statistics in the experimental analysis. In addition, non-parametric tests can
be satisfactorily employed for comparing generic classiers over various data sets considering any performance
measure. According to these facts, we propose the use
of the most powerful non-parametric statistical tests to
carry out multiple comparisons. However, the statistical
analysis conducted on interpretability must be carefully
considered.
Key words Genetics-based machine learning, Genetic algorithms, Statistical tests, Non-Parametric Tests,
Cohens kappa, Interpretability, Classication.
Supported by the Spanish Ministry of Science and Technology under Project TIN-2005-08386-C05-01. S. Garc and
a
J. Luengo hold a FPU scholarship from Spanish Ministry of
Education and Science.
1 Introduction
In general, the classication problem can be covered by
numerous techniques and algorithms, which belong to
dierent paradigms of Machine Learning (ML). The new
developed methods for ML must be analyzed with previous approaches following a rigorous criterion, since in
any empirical comparison the results are dependent on
the choice of the cases for studying, the conguration
of the experimentation and the measurements of performance. Nowadays, the statistical validation of published
results is a necessity in order to establish a certain conclusion on an experimental analysis [18].
Evolutionary rule-based systems [21] is a kind of Genetics-Based Machine Learning (GBML) that uses sets of
rules as knowledge representation [22]. Many approaches
have been proposed in GBMLs based on oering some
advantages with respect to other existing ML techniques; such as the production of interpretable models, no
assumption of prior relationships among attributes and
possibility of obtaining compact and precise rule sets.
Some examples of proposed GBMLs are: GABIL [17],
SIA [41], XCS [43], DOGMA and JoinGA [25], G-Net
[4], UCS [12], GASSIST [7], OCEC [30] and HIDER [1].
Recently, statistical analysis is highly demanded in
any research work and thus, we can nd recent studies
that propose some methods for conducting comparisons
among various approaches [18,34]. Statistics allows us
to determine whether the obtained results are signicant with respect to the choices taken and whether the
conclusions achieved are supported by the experimentation that we have carried out. On the other hand, the
performance of classiers is not only given by their classication rate and there is a growing interest in proposing
or adapting new accuracy measures [11,19]. Most of the
accuracy measures are proposed for two-class problems
and their adaptation to multi-class problems is not intuitive [32]. Only two accuracy measures have been used
for multi-class problems with successful results: the classical classication rate and Cohens kappa measure. The
S. Garc et al.
a
C
C
i=1 xi. x.i
i=1 xii
,
C
2
n
i=1 xi. x.i
(1)
S. Garc et al.
a
Reducing the size of the model increases the interpretability by the user.
Size = nR ,
(2)
(3)
1
nR
nR
Ant(Ri ).
(4)
i=1
#C.
2
5
8
7
2
3
2
3
2
4
11
3
2
10
Pitts-GIRLA
#R
#Gen
30
5000
40
5000
40
5000
20
10000
10
5000
20
5000
20
5000
20
10000
10
5000
20
10000
20
10000
20
10000
50
5000
20
10000
In this paper, we discuss on the use of statistical techniques for the analysis of GBML methods. Firstly, we
distinguish between two types of analysis: single data set
analysis and multiple data set analysis. A single data set
analysis is carried out when the results of two or more
algorithms are compared considering an unique problem
or data set. A multiple data set analysis is given when
our interest lies in comparing two or more approaches
over multiple problems or data sets simultaneously, in
the way of obtaining generalizable conclusions on an experimental study.
The Central Limit Theorem suggests that the sum of
many independent, identically distributed random variables approaches a normal distribution [37]. This theorem
for classication performance is rarely held, it depends
on the case of the problem studied and the number of
runs of the algorithm. However, an excessive number of
runs (the eect size of the samples) aects negatively
in the statistical test due to the fact that it makes a
statistical score more sensitive to a little dierence of results (which would not be detected), by the simple fact
of repeating runs. Thus, our intention is to study the necessary conditions for using parametric statistical tests
on single data set analysis by means of the obtaining
of large size result samples by running the algorithms
several times.
For doing so, we rstly introduce the necessary conditions mentioned above. Then, we present the analysis of
these conditions, and nally we show some case studies
of the normality property.
4.1 Conditions for a safe use of parametric tests
In [37], the distinction done between parametric and
non-parametric tests is based on the level of measure
represented by the data that will be analyzed. In this
way, a parametric test uses data with real values belonging to a range.
GASSIST-ADI
HIDER
CN2
Parameters
Number of rules: problem-dependent, Number of generations: problem-dependent
Population size: 61 chromosomes, Crossover Probability: 0.7, Mutation Probability: 0.5.
number of explores = 100000, population size = 6400, = 0.1, = 0.2, = 0.1,
= 10.0, mna = 2, del = 50.0, sub = 50.0, 0 = 1, do Action Set Subsumption = false,
tness reduction = 0.1, pI = 10.0, FI = 0.01, I = 0.0, = 0.25, = 0.8, = 0.04,
GA = 50.0, doGASubsumption = true, type of selection = RWS,
type of mutation = free, type of crossover = 2 point, P# = 0.33, r0 = 1.0, m0 = 0.1,
l0 = 0.1, doSpecify = false, nSpecify = 20.0 pSpecify = 0.5.
Threshold in Hierarchical Selection = 0,
Iteration of Activation for Rule Deletion Operator = 5,
Iteration of Activation for Hierarchical Selection = 24,
Minimum Number of Rules before Disabling the Deletion Operator = 12,
Minimum Number of Rules before Disabling the Size Penalty Operator = 4,
Number of Iterations = 750, Initial Number Of Rules = 20, Population Size = 400,
Crossover Probability = 0.6, Probability of Individual Mutation = 0.6,
Probability of Value 1 in Initialization = 0.90, Tournament Size = 3,
Possible size in micro-intervals of an attribute = {4, 5, 6, 7, 8, 10, 15, 20, 25},
Maximum Number of Intervals per Attribute = 5, psplit = 0.05, pmerge = 0.05,
Probability of Reinitialize Begin = 0.03, Probability of Reinitialize End = 0,
Use MDL = true, Iteration MDL = 25,
Initial Theory Length Ratio = 0.075, Weight Relaxation Factor = 0.90.
Class Initialization Method = cwinit, Default Class = auto,
Population Size = 100, Number of Generations = 100, Mutation Probability = 0.5,
Percentage of Crossing = 80, Extreme Mutation Probability = 0.05,
Prune Examples Factor = 0.05, Penalty Factor = 1, Error Coecient = 0.
Percentage of Examples to Cover = 95%,
Star Size = 5, Use Disjunct Selectors = NO
bup
cle
eco
gla
hab
iri
mon
new
pim
veh
vow
win
wis
yea
AVG
Pitts-GIRLA
Mean
SD
.5922
.0641
.5583
.0376
.7367
.0850
.6247
.1104
.6997
.1245
.9493
.0514
.6236
.1165
.9140
.0499
.6485
.1161
.4594
.1095
.2467
.0548
.7039
.2199
.7655
.2269
.3723
.0877
.6353
.1039
XCS
Mean
SD
.6568
.0764
.5650
.0540
.8105
.0680
.7181
.1279
.7284
.0484
.9493
.0477
.6728
.0238
.9449
.0545
.7520
.0581
.7359
.0446
.5438
.0682
.9584
.0477
.9666
.0189
.4960
.0598
.7499
.0570
GASSIST-ADI
Mean
SD
.6306
.0932
.5613
.0693
.7985
.0703
.6472
.1035
.7121
.0676
.9653
.0409
.6673
.0407
.9269
.0511
.7425
.0437
.6783
.0421
.4020
.0356
.9056
.0744
.9564
.0247
.5442
.0327
.7242
.0564
HIDER
Mean
SD
.6186
.0986
.5545
.0723
.8422
.0597
.6962
.1331
.7485
.0449
.9640
.0409
.6719
.0206
.9382
.0660
.7473
.0497
.6593
.0502
.7248
.0482
.9476
.0792
.9653
.0236
.5781
.0376
.7625
.0611
CN2
Mean
SD
.5715
.0740
.5412
.0457
.8101
.0618
.6998
.0963
.7349
.0444
.9400
.0492
.6719
.0215
.9446
.0472
.7122
.0393
.6191
.0839
.6212
.0632
.9268
.0648
.9517
.0218
.5560
.0362
.7358
.1529
GASSIST-ADI
Mean
SD
.2382
.1842
.2750
.0948
.7158
.1000
.5019
.1416
.1272
.1921
.9480
.0614
.0460
.1161
.8424
.1077
.4131
.1103
.5714
.0558
.3422
.0391
.8560
.1135
.9040
.0542
.3983
.0453
.5128
.1011
HIDER
Mean
SD
.1793
.1939
.2387
.1182
.7761
.0827
.5665
.1899
.1469
.1719
.9460
.0613
.1095
.1697
.8644
.1363
.3794
.1334
.5450
.0669
.6969
.0530
.9201
.1171
.9222
.0532
.4481
.0505
.5528
.1141
CN2
Mean
SD
.0444
.1580
.1617
.0586
.7317
.0892
.5765
.1284
.1826
.1900
.9100
.0738
.0000
.0000
.8742
.1063
.2476
.1182
.4897
.1130
.5833
.0695
.8870
.1000
.8909
.0501
.4137
.0483
.4995
.0931
bup
cle
eco
gla
hab
iri
mon
new
pim
veh
vow
win
wis
yea
AVG
Pitts-GIRLA
Mean
SD
.0916
.1472
.1710
.1192
.6260
.1099
.4663
.1490
.0605
.1156
.9240
.0771
.0067
.0354
.8171
.1013
.1260
.2047
.2802
.1470
.1726
.0602
.5125
.3822
.5465
.3683
.1640
.1226
.3546
.1528
XCS
Mean
SD
.2619
.1837
.2995
.0949
.7345
.0964
.6089
.1731
.0943
.1431
.9240
.0716
.0107
.0536
.8762
.1327
.4321
.1404
.6479
.0593
.4982
.0751
.9371
.0716
.9271
.0411
.3279
.0837
.5415
.1014
The latter does not involve that when we always dispose of this type of data, we should use a parametric test.
It is possible that one or more initial assumptions for the
use of parametric tests may be not fullled, making that
a statistical analysis loses credibility.
In order to use the parametric tests, it is necessary
to check the following conditions [37,47]:
Independence: In statistics, two events are independent when the fact that one occurs does not modify
the probability of the other one occurring.
Normality: An observation is normal when its behaviour follows a normal or Gauss distribution with
a certain value of mean and variance . A normality test applied over a sample can indicate the
S. Garc et al.
a
bup
cle
eco
gla
hab
iri
mon
new
pim
veh
vow
win
wis
yea
AVG
Pitts-GIRLA
Mean
SD
30.00
0.00
40.00
0.00
40.00
0.00
20.00
0.00
10.00
0.00
20.00
0.00
20.00
0.00
20.00
0.00
10.00
0.00
20.00
0.00
20.00
0.00
20.00
0.00
50.00
0.00
20.00
0.00
24.29
0.00
Size
XCS
Mean
SD
2400.62
198.03
4594.96
109.62
2321.02
147.56
3254.32
155.87
1181.52
360.75
547.08
105.77
283.78
95.72
1037.00
133.42
3576.62
150.34
5211.18
56.17
4284.34
141.49
4098.70
347.37
708.90
78.20
3608.44
221.66
2650.61
164.43
GASSIST-ADI
Mean
SD
16.84
6.20
10.76
4.54
6.32
1.45
8.52
2.51
7.92
3.25
4.08
0.27
5.50
0.61
5.42
1.01
15.34
4.61
11.68
3.92
11.92
4.44
4.30
0.54
5.92
1.35
8.38
2.17
8.78
2.70
HIDER
Mean
SD
5.56
1.05
22.30
2.28
8.66
1.10
22.38
2.43
2.26
0.66
3.00
0.00
6.26
4.15
3.34
0.52
8.84
2.00
46.86
4.59
114.50
4.55
27.50
2.53
2.12
0.33
46.12
8.27
22.84
2.46
presence or absence of this condition in the observed data. A well-known example of normality test is
the Kolmogorov-Smirnov test, which possess a very
low power. In this study, we will use more powerful
normality tests:
Shapiro-Wilk (SW): It analyzes the observed data
for computing the level of symmetry and kurtosis (shape of the curve) in order to compute the
dierence with respect to a Gaussian distribution
afterwards, obtaining the p-value from the sum of
the squares of these discrepancies. The power of
this test has been shown to be excellent; however, its performance is adversely aected in the
common situation where there is tied data.
DAgostino-Pearson (DP): It rst computes the
skewness and kurtosis to quantify how far from
Gaussian the distribution is in terms of asymmetry and shape. It then calculates how far each of
these values diers from the value expected with
a Gaussian distribution, and computes a single
p-value from the sum of these discrepancies. The
performance of this test is not as good as that of
SWs procedure, but it is not as aected by tied
data.
Heteroscedasticity: This property indicates the existence of a violation of the hypothesis of equality of
variances. Levenes test is used for checking whether or not k samples present this homogeneity of variances (homoscedasticity). When observed data does
not fulll the normality condition, it is more reliable
the result of using this test than Bartletts test [47],
which is another test that checks the same property.
With respect to the independence condition, Demar
s
in [18] suggests that independency is not truly veried in
10fcv (a portion of samples is used either for training and
testing in dierent partitions). In the following, we show
a normality analysis by using SW and DP tests, together
with a heteroscedasticity analysis by using Levenes test.
4.2 Analysis of the conditions for a safe use of
parametric tests
We apply the two tests of normality (SW and DP) presented above by considering a level of signicance of
Pitts-GIRLA
Mean
SD
2.96
0.16
6.09
0.27
3.52
0.24
3.96
0.32
1.53
0.23
1.91
0.19
2.73
0.23
2.17
0.21
3.04
0.42
7.77
0.42
5.60
0.43
5.75
0.41
4.46
0.23
3.67
0.31
3.94
0.29
XCS
Mean
SD
2.31
0.28
4.15
0.17
2.06
0.17
2.86
0.23
1.31
0.25
1.19
0.14
1.23
0.12
1.57
0.16
3.17
0.16
5.14
0.14
2.04
0.08
2.83
0.18
2.11
0.14
2.29
0.21
2.45
0.17
ANT
GASSIST-ADI
Mean
SD
3.53
0.46
3.28
0.92
1.69
0.40
2.30
0.64
1.81
0.48
0.91
0.21
1.27
0.95
1.52
0.29
3.50
0.63
3.19
0.65
2.15
0.54
1.74
0.35
3.19
0.72
2.37
0.48
2.32
0.55
HIDER
Mean
SD
5.29
0.36
5.75
0.24
5.39
0.39
8.44
0.20
1.93
0.25
2.26
0.41
3.12
0.99
4.35
0.41
7.20
0.38
17.36
0.16
9.93
0.11
12.79
0.11
3.46
0.59
6.07
0.12
6.67
0.34
= 0.05 (we have employed the statistical software package SPSS). Tables 6 and 7 show the results in classication rate and kappa measures, respectively. Tables 8
and 9 show the results in size and ANT measures, respectively. The symbol * indicates that the normality
condition is not satised and the value in brackets is the
p-value needed for rejecting the normality hypothesis.
As we can observe in the run of the two tests of normality, we can declare that the conditions needed for the
application of parametric tests are not fullled in some
cases. The normality condition is not always satised although the size of the sample of results would be enough
(50 in this case). A main factor that inuences this condition seems to be the nature of the problem, since there
exist some problems in which it is never satised, such
as the wine and the wisconsin problems in both classication rate and kappa measures, and the general trend is
not predictable. In addition, the results oered by PittsGIRLA are very distant to a normal shape. The measure
which yields less rejections of the normality condition is
ANT.
In relation to the heteroscedasticity study, Table 10
shows the results by applying Levenes test, where the
symbol * indicates that the variances of the distributions of the dierent algorithms for a certain function
are not homogeneities (the null hypothesis is rejected).
The homoscedasticity property is even more dicult
to be fullled, since the variances associated to each problem also depend on the algorithms results, that is,
the capacity of the algorithms for oering similar results with random seeds variations. This fact also inuences that an analysis of performance of GBML algorithms performed through parametric statistical treatment could mean erroneous conclusions.
4.3 Case studies of the Normality Property
We present two case studies of the normality property
considering the sample of results obtained by an GBML
method on a data set. Figures 1 and 2, show dierent
examples of graphical representations of histograms and
Q-Q graphics. A histogram represents a statistical variable by using bars, so that the area of each bar is proportional to the frequency of the represented values. A Q-Q
Pitts-GIRLA
XCS
GASSIST
HIDER
bup
* (.02)
(.25)
(.17)
(.11)
cle
* (.00)
* (.03)
* (.01)
(.42)
eco
* (.00)
(.23)
(.22)
(.22)
gla
(.73)
* (.00)
(.31)
* (.00)
hab
* (.00)
* (.02)
(.08)
* (.01)
*
*
*
*
Pitts-GIRLA
XCS
GASSIST
HIDER
bup
(.13)
(.44)
(.16)
(.07)
cle
(.10)
(.09)
(.13)
(.52)
eco
* (.00)
(.61)
(.88)
(.42)
gla
(.69)
(.06)
(.37)
(.05)
hab
* (.00)
(.22)
(.58)
(.78)
DAgostino-Pearson
iri
mon
(.11)
* (.00)
(.06)
* (.00)
(.08)
(.19)
* (.00)
(.19)
*
*
*
*
new
(.01)
(.00)
(.00)
(.00)
pim
(.00)
(.03)
(.01)
(.00)
veh
* (.00)
(.17)
(.96)
(.25)
vow
* (.00)
(.30)
(.32)
(.15)
*
*
*
*
pim
* (.00)
(.24)
* (.02)
* (.00)
veh
* (.02)
(.33)
(.93)
(.43)
vow
* (.00)
(.40)
(.95)
(.37)
pim
(.00)
(.04)
(.01)
(.01)
veh
* (.00)
(.17)
(.98)
(.23)
vow
* (.00)
(.30)
(.32)
(.56)
*
*
*
*
*
*
*
*
new
(.71)
* (.00)
(.70)
* (.00)
win
(.00)
(.00)
(.00)
(.00)
wis
(.00)
(.00)
(.04)
(.00)
yea
* (.00)
(.45)
(.78)
(.23)
win
* (.00)
* (.00)
(.17)
* (.00)
wis
* (.00)
* (.03)
(.36)
* (.02)
yea
* (.00)
(.48)
(.39)
(.18)
win
(.00)
(.00)
(.00)
(.00)
wis
* (.00)
* (.00)
(.07)
* (.00)
yea
* (.00)
(.51)
(.14)
(.20)
*
*
*
*
Pitts-GIRLA
XCS
GASSIST
HIDER
bup
* (.00)
(.65)
(.30)
(.61)
cle
* (.02)
(.11)
* (.03)
(.42)
eco
* (.00)
(.37)
(.47)
(.21)
gla
(.79)
* (.00)
(.32)
* (.00)
hab
* (.00)
* (.01)
(.77)
* (.01)
*
*
*
*
Pitts-GIRLA
XCS
GASSIST
HIDER
bup
* (.00)
(.54)
(.16)
(.33)
cle
(.49)
(.41)
(.10)
(.45)
eco
* (.00)
(.72)
(.90)
(.43)
gla
(.58)
(.06)
(.21)
(.05)
hab
* (.00)
* (.03)
(.96)
(.21)
DAgostino-Pearson
iri
mon
(.11)
* (.00)
(.06)
* (.00)
(.09)
* (.00)
* (.00)
* (.00)
new
(.80)
* (.00)
(.66)
* (.02)
pim
* (.00)
(.27)
* (.01)
* (.00)
veh
* (.01)
(.32)
(.95)
(.41)
vow
* (.01)
(.40)
(.95)
(.38)
win
* (.00)
* (.01)
(.18)
* (.00)
wis
* (.00)
* (.04)
(.39)
* (.01)
yea
* (.00)
(.35)
(.19)
(.20)
Shapiro-Wilk
iri
mon
* (.00)
* (.00)
(.75)
(.26)
* (.00)
* (.00)
* (.00)
* (.00)
new
* (.00)
(.74)
* (.00)
* (.00)
*
*
*
*
pim
(.00)
(.00)
(.01)
(.04)
veh
* (.00)
(.42)
(.16)
(.46)
vow
* (.00)
(.46)
* (.00)
(.84)
win
* (.00)
* (.00)
* (.00)
(.21)
wis
* (.00)
(.56)
* (.00)
* (.00)
yea
* (.00)
(.59)
(.13)
(.10)
DAgostino-Pearson
iri
mon
* (.00)
* (.00)
(.76)
(.27)
* (.00)
(.68)
* (.00)
(.10)
new
* (.00)
(.47)
(.54)
* (.01)
pim
* (.00)
* (.00)
* (.03)
(.98)
veh
* (.00)
(.38)
(.43)
(.47)
vow
* (.00)
(.22)
* (.04)
(.80)
win
* (.00)
* (.00)
* (.00)
(.37)
wis
* (.00)
(.52)
* (.00)
* (.00)
yea
* (.00)
(.76)
(.21)
(.21)
*
*
*
*
new
(.04)
(.00)
(.01)
(.00)
*
*
*
*
eco
* (.00)
(.86)
* (.00)
* (.00)
gla
* (.00)
(.89)
* (.00)
* (.00)
*
*
*
*
hab
(.00)
(.00)
(.00)
(.00)
Pitts-GIRLA
XCS
GASSIST
HIDER
bup
* (.00)
(.67)
* (.01)
(.86)
cle
* (.00)
(.17)
* (.00)
(.61)
eco
* (.00)
(.86)
* (.00)
(.47)
gla
* (.00)
(.84)
* (.00)
* (.00)
hab
* (.00)
* (.00)
* (.00)
(.23)
10
2
Expected Normal
bup
* (.00)
(.91)
* (.00)
* (.00)
Frequency
Pitts-GIRLA
XCS
GASSIST
HIDER
-2
2
-4
0
0.45
0.50
0.55
yeast
Expected Normal
Frequency
30
20
0.60
0.65
-2
10
-4
0
0.10
0.55
Observed Value
40
monks
0.50
algorithm XCS
4
50
0.00
0.45
Histogram
algorithm XCS
-0.10
0.60
0.20
0.30
-0.1
0.0
0.1
0.2
0.3
Observed Value
S. Garc et al.
a
bup
(.83)
(.15)
(.85)
* (.01)
cle
(.13)
(.67)
* (.04)
(.22)
eco
(.10)
(.18)
* (.01)
* (.00)
gla
(.50)
(.10)
* (.00)
* (.01)
hab
(.26)
(.55)
(.19)
* (.00)
Pitts-GIRLA
XCS
GASSIST
HIDER
bup
(.73)
* (.03)
(.88)
* (.00)
cle
(.05)
(.57)
* (.00)
(.18)
eco
* (.04)
(.20)
* (.00)
* (.00)
gla
(.63)
(.23)
* (.00)
* (.00)
hab
(.84)
(.26)
(.13)
(.76)
Shapiro-Wilk
iri
mon
(.27)
* (.00)
(.64)
(.86)
* (.00)
* (.00)
* (.01)
* (.00)
DAgostino-Pearson
iri
mon
(.39)
(.07)
(.67)
(.89)
* (.00)
* (.00)
* (.00)
(.09)
new
(.12)
(.23)
* (.01)
* (.04)
pim
(.20)
(.73)
* (.00)
* (.01)
veh
(.51)
(.67)
(.27)
(.26)
vow
(.96)
(.43)
* (.00)
(.05)
win
(.32)
(.46)
(.09)
(.74)
wis
* (.03)
(.68)
(.58)
* (.00)
yea
(.39)
(.17)
(.38)
(.70)
new
(.38)
(.34)
* (.01)
(.61)
pim
(.41)
(.46)
* (.01)
* (.00)
veh
(.88)
(.50)
(.05)
(.69)
vow
(.84)
(.67)
* (.00)
(.18)
win
(.39)
(.56)
(.69)
(.63)
wis
* (.00)
(.61)
(.57)
(.27)
yea
(.33)
(.18)
(.72)
(.72)
bup
(.16)
(.53)
* (.00)
* (.00)
*
*
*
*
cle
(.00)
(.02)
(.00)
(.00)
eco
(.77)
(.66)
* (.00)
* (.00)
gla
(.26)
(.17)
* (.00)
* (.00)
*
*
*
*
hab
(.01)
(.02)
(.00)
(.00)
iri
(.53)
(.53)
* (.00)
* (.00)
*
*
*
*
mon
(.00)
(.00)
(.00)
(.00)
new
(.36)
(.36)
* (.00)
* (.00)
pim
(.05)
* (.00)
* (.00)
* (.00)
*
*
*
*
veh
(.00)
(.00)
(.00)
(.00)
*
*
*
*
vow
(.00)
(.00)
(.00)
(.00)
*
*
*
*
win
(.00)
(.00)
(.00)
(.00)
*
*
*
*
wis
(.00)
(.00)
(.00)
(.00)
*
*
*
*
yea
(.00)
(.00)
(.00)
(.00)
rank(di ) +
di >0
R =
rank(di ) +
di <0
1
2
1
2
rank(di )
di =0
rank(di )
di =0
In order to compare the results between two algorithms and to stipulate which one is the best, we can
perform Wilcoxon signed-rank test for detecting dierences in both means. This statement must be enclosed
by a probability of error, that is the complement of the
probability of reporting that two systems are the same,
called the p-value [47]. The computation of the p-value
in Wilcoxons distribution could be carried out by computing a normal approximation [37]. This test is well
known and it is usually included in standard statistics
packages (such as SPSS, R, SAS, etc.).
Tables 11 and 12 show the results obtained in all possible comparisons among the ve algorithms considered
in the study, in classication rate and kappa respectively. We stress in bold the winner algorithm in each row
when the p-value associated is below 0.05.
Table 11 Wilcoxons test applied over the all possible comparisons between the ve algorithms in classication rate
Comparison
Pitts-GIRLA - XCS
Pitts-GIRLA - GASSIST-ADI
Pitts-GIRLA - HIDER
Pitts-GIRLA - CN2
XCS - GASSIST-ADI
XCS - HIDER
XCS - CN2
GASSIST-ADI - HIDER
GASSIST-ADI - CN2
HIDER - CN2
Classication rate
R+
R
p-value
0.5 104.5
0.001
0
105
0.001
1
104
0.001
6
99
0.004
89
16
0.022
53
52
0.975
78
27
0.109
20
85
0.041
52
53
0.975
100
5
0.003
Table 12 Wilcoxons test applied over the all possible comparisons between the ve algorithms in kappa
Comparison
Pitts-GIRLA - XCS
Pitts-GIRLA - GASSIST-ADI
Pitts-GIRLA - HIDER
Pitts-GIRLA - CN2
XCS - GASSIST-ADI
XCS - HIDER
XCS - CN2
GASSIST-ADI - HIDER
GASSIST-ADI - CN2
HIDER - CN2
R+
0.5
0
0
10
74
51
78
28
60
96
Cohens kappa
R
p-value
104.5
0.001
105
0.001
105
0.001
95
0.008
31
0.177
54
0.925
27
0.109
77
0.124
45
0.638
9
0.006
The comparisons performed in this study are independent, so they never have to be considered in a whole.
If we try to extract from previous tables a conclusion
which involves more than one comparison, we are losing
control on the FWER. For instance, the statement: HIDER algorithm obtains a classication rate better than
Pitts-GIRLA and GASSIST-ADI algorithms with a p-
10
S. Garc et al.
a
(Nds 1)2
F
Nds (K 1) 2
F
which is distributed according to the F-distribution
with k 1 and (k 1)(Nds 1) degrees of freedom.
Statistical tables for critical values can be found at
[37,47].
Bonferroni-Dunns test: if the null hypothesis is rejected in any of the previous tests, we can continue
with Bonferroni-Dunns procedure. It is similar to
Dunnets test for ANOVA and it is used when we
want to compare a control algorithm opposite to the
remainder. The quality of two algorithms is signicantly dierent if the corresponding average of rankings is at least as great as its critical dierence (CD).
FF =
CD = q
k(k + 1)
.
6N
The value of q is the critical value for a multiple nonparametric comparison with a control (Table B.16 in
[47]).
Holms test [26]: it is a multiple comparison procedure that can work with a control algorithm and compares it with the remaining methods. The test statistics for comparing the i-th and j-th method using
this procedure is:
z = (Ri Rj )/
k(k + 1)
6Nds
The z value is used to nd the corresponding probability from the table of normal distribution, which
is then compared with an appropriate level of condence . In Bonferroni-Dunn comparison, this value is always /(k 1), but Holms test adjusts the
value for in order to compensate for multiple comparison and control the FWER.
Holms test is a step-up procedure that sequentially
tests the hypotheses ordered by their signicance. We
will denote the ordered p-values by p1 , p2 , ..., so that
p1 p2 ... pk1 . Holms test compares each
pi with /(k i), starting from the most signicant
p value. If p1 is below /(k 1), the corresponding
hypothesis is rejected and we allow to compare p2
with /(k 2). If the second hypothesis is rejected,
the test proceeds with the third, and so on. As soon
as a certain null hypothesis cannot be rejected, all
the remain hypotheses are retained as well.
Hochbergs procedure [27]: It is a step-up procedure
that works in the opposite direction to Holms method, comparing the largest p-value with , the next
largest with /2 and so forth until it encounters a hypothesis that it can reject. All hypotheses with smaller p-values are then rejected as well.
The post-hoc procedures described above allow us
to know whether or not a hypothesis of comparison of
means could be rejected at a specied level of signicance . However, it is very interesting to compute the
p-value associated to each comparison, which represents
the lowest level of signicance of a hypothesis that results in a rejection. In this manner, we can know whether
two algorithms are signicantly dierent and we can also
have a metric of how dierent they are.
Next, we will describe the method for computing
these exact p-values for each test procedure, which are
called adjusted p-values [45].
The adjusted p-value for Bonferroni-Dunns test (also
known as the Bonferroni correction) is calculated by
pBonf = (k 1)pi .
The adjusted p-value for Holms procedure is computed by pHolm = (k i)pi . Once computed all of
them for all hypotheses, it is not possible to nd an
adjusted p-value for the hypothesis i lower than for
the hypothesis j, j < i. In this case, the adjusted
50. 0=
01. 0=
9 7 6.4
6 82. 3
34 1. 3
1 7 0. 2
2NC
IDA-TS ISSAG
RED IH
1 28. 1
0
5. 0
1 F
r
i
5. 1 e
d
2 m
a
5. 2 n
R
a
3 n
k
5. 3 n
i
g
4
5. 4
5
SC X
11
50. 0=
01. 0=
9 7 6.4
7 53. 3
75 8. 2
3 4 1. 2
ALR IG- stt iP
2NC
IDA-TS ISSAG
RED IH
4 69. 1
0
5. 0
1 F
r
i
5. 1 e
d
2 m
a
5. 2 n
R
a
3 n
k
5. 3 n
i
g
4
5. 4
5
SC X
This section presents the study of applying multiple comparisons procedures to the results of the case study described above. We will use the results obtained in the evaluation of the performance measures considered and we
will dene the control algorithm as the best performing
algorithm (which obtains the lowest value of ranking,
computed through Friedmans test).
First of all, we have to test whether signicant dierences exist among all the means. Table 13 shows the result of applying Friedmans and Iman-Davenports tests.
The table shows the Friedman and Iman-Davenport values, 2 and FF respectively, and it relates them with
F
the corresponding critical values for each distribution
by using a level of signicance = 0.05. The p-value
obtained is also reported for each test. Given that the
statistics of Friedman and Iman-Davenport are clearly
greater than their associated critical values, there are
signicant dierences among the observed results with a
level of signicance 0.05. According to these results,
a post-hoc statistical analysis is needed in the two cases.
Then, we will employ Bonferroni-Dunns test to detect signicant dierences for the control algorithm in
each measure. It obtains the values CD = 1.493 and CD
= 1.34 for = 0.05 and = 0.10 respectively in the
two measures considered. Figures 3 and 4 summarize
the ranking obtained by the Friedman test and draw the
threshold of the critical dierence of Bonferroni-Dunns
procedure, with the two levels of signicance mentioned
above. They display a graphical representation composed by bars whose height is proportional to the average
ranking obtained for each algorithm in each measure
studied. If we choose the smallest of them (which corresponds to the best algorithm), and we sum its height
with the critical dierence obtained by Bonferronis (CD
value), representing its result by using a cut line that
goes through all the graphic, those bars above the line
belong to algorithms whose behaviour are signicantly
worse than the contributed by the control algorithm.
We will apply more powerful procedures, such as
Holms and Hochbergss, for comparing the control algorithm with the rest of algorithms. Table 14 shows all
the adjusted p-values for each comparison which involves the control algorithm. The p-value is indicated in
each comparison and we stress in bold the algorithms
Cohen's kappa
12
S. Garc et al.
a
Classication rate
Cohens kappa
Friedman
Value
28.957
26.729
Value
in 2
9.487
9.487
p-value
< 0.0001
< 0.0001
Iman-Davenport
Value
13.920
11.871
Value
in FF
2.55
2.55
p-value
< 0.0001
< 0.0001
Table 14 Adjusted p-values for the comparison of the control algorithm in each measure with the remaining algorithms
(Holms and Hochbergs test)
i
n
5. 2
R
a
3 n
k
a
n
6 82. 1
r
i
5. 1e
d
2m
F
5. 0
1
0
SC X
RED IH
IDA-TS ISSAG
4 1 2.2
complexity = size AN T.
5. 2
The interpretability of the rule sets obtained will be evaluated by means of the two measures described in Section
3.2, size and ANT. We will aggregate these two measures
in one, which will represent the complexity of the rule
set. It measures the average complexity of the rule set
taking into account the number of rules and the average
number of antecedents per rule:
Interpretability
5. 4
pHoch
2.230 105
0.05931
0.27033
0.76509
5. 3g
4
pHoch
6.980 106
0.04283
0.05405
0.67571
i
1
2
3
4
50. 0=
01. 0=
i
1
2
3
4
As we can see, the two most powerful statistical procedures (Holms and Hochbergs one) are able to distinguish the GASSIST-ADI algorithm as the one whose
rule sets are the most interpretable with a p = 0.05704
(a level of signicance = 0.10 is required).
However, we have to be cautious with respect to the
concept of interpretability. GBML algorithms can produce dierent types of rules or dierent ways for reading
or interpreting the rules. For example, the four algorithms used in this paper produce rule sets with dierent properties. In Table 16 we show an example of rule
for each algorithm, considering the iris data set in the
examples):
Pitts-GIRLA yields a set of conjunctive rules, with
possibility of dont care values, allowing that the
number of antecedents may change in dierent rules.
The classication of a new example implies searching
those rules whose antecedent is compatible with it
and to determine the class agreeing with the maximal
number of rules of the same consequent. If no rules
have been found, the example is not classied.
XCS also uses conjunctive rules, with a generality
index in each attribute. If the generality index covers
the complete domain of a certain attribute, then it
obtains a dont care value. In order to classify a
new example, the rules that match with it are chosen
13
Table 15 Adjusted p-values for the comparison of complexity of rules (Holms and Hochbergs test)
i
1
2
3
pHoch
7.972 108
0.02565
0.05704
Example of Rule
IF sepalLength = Dont Care AND sepalWidth= Dont Care AND
petalLength = [4.947674287707237,5.965516026050438] AND petalWidth= Dont Care THEN Class = Iris-virginica.
(normalized) IF sepalLength = [0.0,1.0] AND sepalWidth = [0.0, 1.0] AND
petalLength = [0.3641094725703955, 1.0] AND petalWidth = [0.0, 1.0] THEN Class = Iris-setosa
IF petalLength = [1.0,5.071428571428571] AND petalWidth = [0.5363636363636364,1.6272727272727274]
THEN Class = Iris-versicolor
IF sepalLength = (..., 6.65] AND petalLength = (..., 6.7] AND petalWidth = (..., 0.75] THEN Class = iris-setosa
Acknowledgments
The authors are very grateful to the anonymous reviewers for their valuable suggestions and comments to improve the quality of this paper. We also are very grateful to Prof. Bacardit, Prof. Bernad-Mansilla and Prof.
o
Aguilar-Ruiz for providing the KEEL software with the
GASSIST-ADI, XCS and HIDER algorithms, respectively.
14
References
1. J.S. Aguilar-Ruiz, R. Girldez and J.C. Riquelme, Naa
tural encoding for evolutionary supervised learning, IEEE
Transactions on Evolutionary Computation 11:4, (2007)
466479.
2. J. Alcal-Fdez, L. Snchez, S. Garc M.J. del Jesus, S.
a
a
a,
Ventura, J.M. Garrell, J. Otero, C. Romero, J. Bacardit,
V.M. Rivas, J.C. Fernndez, F. Herrera, KEEL: A Software
a
Tool to Assess Evolutionary Algorithms to Data Mining
Problems, Soft Computing (2008) In press.
3. E. Alpaydin, Introduction to Machine Learning (MIT
Press, Cambridge, MA 2004) 452.
4. C. Anglano and M. Botta, NOW G-Net: learning classication programs on networks of workstations, IEEE Transactions on Evolutionary Computation 6:13, (2002) 463
480.
5. A.
Asuncion
and
D.
J.
Newman,
UCI
Machine
Learning
Repository.
http://www.ics.uci.edu/mlearn/MLRepository.html.
Irvine, CA: University of California, School of Information
and Computer Science (2007).
6. J. Bacardit and J.M. Garrell, Evolving multiple discretizations with adaptive intervals for a pittsburgh rule-based
learning classier system, In: Proc. of Genetic and Evolutionary Computation Conference (GECCO03), LNCS
2724, (2003) 18181831.
7. J. Bacardit, Pittsburgh genetic-based machine learning in
the data mining era: representations, generalization and
run-time, Dept. Comput. Sci., University Ramon Llull,
Barcelona, Spain, 2004.
8. J. Bacardit and J.M. Garrell, Analysis and improvements
of the adaptive discretization intervals knowledge representation, In: Proc. of Genetic and Evolutionary Computation
Conference (GECCO04), LNCS 3103, (2004) 726738.
9. J. Bacardit and J.M. Garrell, Bloat control and generalization pressure using the minimum description length principle for Pittsburgh approach learning classier system, In:
T. Kovacs, X. Llor and K. Takadama (Eds.), Advances
a
at the frontier of Learning Classier Systems, LNCS 4399,
(2007) 6180.
10. R. Barandela, J.S. Snchez, V. Garc and E. Rangel.
a
a
Strategies for learning in class imbalance problems, Pattern
Recognition 36:3, (2003) 849851.
11. A. Ben-David, A lot of randomness is hiding in accuracy, Engineering Applications of Articial Intelligence 20,
(2007) 875885.
o
12. E. Bernad-Mansilla and J.M. Garrell, Accuracy-based
learning classier systems: models, analysis and applications to classication tasks, Evolutionary Computation
11:3, (2003) 209238.
o
13. E. Bernad-Mansilla and T.K. Ho. Domain of Competence of XCS Classier System in Complexity Measurement
Space, IEEE Transactions on Evolutionary Computation
9:1, (2005) 82104.
14. P. Clark and T. Niblett, The CN2 induction algorithm,
Machine Learning, 3:4, (1989) 261-283.
15. J.A. Cohen, Coecient of agreement for nominal scales,
Educational and Psychological Measurement, (1960) 3746.
16. A.L. Corcoran and S. Sen, Using real-valued genetic algorithms to evolve rule sets for classication, In: Proc. of
the IEEE Conference on Evolutionary Computation, (1994)
120124.
S. Garc et al.
a
17. K.A. De Jong, W.M. Spears and D.F. Gordon, Using
genetic algorithms for concept learning, Machine Learning
13, (1993) 161188.
s
18. J. Demar, Statistical comparisons of classiers over multiple data sets, Journal of Machine Learning Research 7,
(2006) 130.
19. C. Drummond and R.C. Holte, Cost curves: an improved method for visualizing classier performance, Machine
Learning 65:1, (2006) 95130.
20. A.E. Eiben and J.E. Smith, Introduction to Evolutionary
Computing (Springer-Verlag, Berlin 2003) 299.
21. A.A. Freitas, Data Mining and Knowledge Discovery with
Evolutionary Algorithms (Springer-Verlag, Berlin 2002)
264.
22. J.J. Grefenstette, Genetic Algorithms for Machine Learning (Kluwer Academic Publishers, Norwell 1993) 176.
23. S.U. Guan and F. Zhu, An incremental approach to
genetic-algorithms-based classication, IEEE Transactions
on Systems, Man, and Cybernetics, Part B 35:2, (2005)
227239.
24. J, Huang and C. X. Ling. Using AUC and accuracy in
evaluating learning algorithms, IEEE Transactions on Knowledge and Data Engineering 17:3, (2005) 299310.
25. J. Hekanaho, An evolutionary approach to concept learning, Ph.D. dissertation, Dept. Comput. Sci., Abo akademi
University, Abo, Finland, 1998.
26. S. Holm, A simple sequentially rejective multiple test
procedure, Scandinavian Journal of Statistics 6, (1979) 65
70.
27. Y. Hochberg, A sharper bonferroni procedure for multiple tests of signicance, Biometrika 75, (1988) 800803.
28. R.L. Iman and J.M. Davenport, Approximations of the
critical region of the Friedman statistic, Communications
in Statistics 18, (1980) 571595.
29. M. Sokolova, N. Japkowicz and S. Szpakowicz. Beyond
accuracy, F-score and ROC: A family of discriminant measures for performance evaluation, In: Australian Conference
on Articial Intelligence, LNCS 4304, (2006) 10151021.
30. L. Jiao, J. Liu and W. Zhong, An organizational coevolutionary algorithm for classication, IEEE Transactions on
Evolutionary Computation 10:1, (2006) 6780.
31. G.G. Koch, The use of non-parametric methods in the
statistical analysis of a complex split plot experiment, Biometrics 26:1, (1970) 105128.
32. T.C.W. Landgrebe and R. P.W. Duin, Ecient Multiclass ROC Approximation by Decomposition via Confusion
Matrix Perturbation Analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 30:5, (2008) 810
822.
33. T.-S Lim, W.-Y Loh and Y.-S Shih, A comparison of prediction accuracy, complexity, and training time of thirtythree old and new classication algorithms, Machine Learning 40:3, (2000) 203228.
34. M. Markatou, H. Tian, S. Biswas and G. Hripcsak,
Analysis of variance of cross-validation estimators of the
generalization error, Journal of Machine Learning Research
6, (2005) 11271168.
35. R.L. Rivest, Learning decision lists, Machine Learning 2,
(1987) 229246.
36. J.P. Shaer, Multiple hypothesis testing, Annual Review
of Psychology 46 (1995) 561584.
37. D.J. Sheskin, Handbook of parametric and nonparametric
statistical procedures (Chapman & Hall/CRC 2006) 1736.
15
16
S. Garc et al.
a
Bibliograf
a
[AGR06]
Abraham A., Grosan C., y Ramos V. (2006) Swarm Intelligence in Data Mining (Studies in Computational Intelligence). Springer-Verlag New York, Inc., Secaucus, NJ,
USA.
[Aha97]
Aha D. W. (Ed.) (1997) Lazy learning. Kluwer Academic Publishers, Norwell, MA,
USA.
[AK03]
Alba E. y Khuri S. (2003) Sequential and distributed evolutionary algorithms for combinatorial optimization problemspginas 211233.
a
[AKA91]
[BMH05]
[BPM04]
[CBHK02] Chawla N. V., Bowyer K. W., Hall L. O., y Kegelmeyer W. P. (2002) Smote: Synthetic
minority over-sampling technique. Journal of Articial Intelligence Research 16: 321
357.
[CCHJ08] Chawla N. V., Cieslak D. A., Hall L. O., y Joshi A. (2008) Automatically countering
imbalance and its empirical relationship to cost. Data Mining and Knowledge Discovery
17(2): 225252.
[CH67]
[CHL03]
Cano J. R., Herrera F., y Lozano M. (2003) Using evolutionary algorithms as instance
selection for data reduction in kdd: an experimental study. IEEE Transactions on
Evolutionary Computation 7(6): 561575.
[CHL05]
Cano J. R., Herrera F., y Lozano M. (2005) Stratication for scaling up evolutionary
prototype selection. Pattern Recogn. Lett. 26(7): 953963.
[CHL07]
Cano J. R., Herrera F., y Lozano M. (2007) Evolutionary stratied training set selection for extracting classication rules with trade o precision-interpretability. Data
Knowledge Engineering 60(1): 90108.
193
[CJK04]
Chawla N. V., Japkowicz N., y Kolcz A. (2004) Special issue on learning from imbalanced datasets. SIGKDD Explorations Newsletter 6(1).
[CLV06]
Coello C. A. C., Lamont G. B., y Veldhuizen D. A. V. (2006) Evolutionary Algorithms for Solving Multi-Objective Problems (Genetic and Evolutionary Computation).
Springer-Verlag New York, Inc., Secaucus, NJ, USA.
[Con98]
[Dem06]
Demar J. (2006) Statistical comparisons of classiers over multiple data sets. J. Mach.
s
Learn. Res. 7: 130.
[EJJ04]
Estabrooks A., Jo T., y Japkowicz N. (2004) A multiple resampling method for learning
from imbalanced data sets. Computational Intelligence 20(1): 1836.
[ES03]
[Fre02]
Freitas A. A. (2002) Data Mining and Knowledge Discovery with Evolutionary Algorithms. Springer-Verlag New York, Inc., Secaucus, NJ, USA.
[FW98]
Frank E. y Witten I. H. (1998) Generating accurate rule sets without global optimization. En ICML 98: Proceedings of the Fifteenth International Conference on Machine
Learning, pginas 144151. Morgan Kaufmann Publishers Inc., San Francisco, CA,
a
USA.
[GDG08]
Ghosh A., Dehuri S., y Ghosh S. (2008) Multi-objective Evolutionary Algorithms for
Knowledge Discovery from Data Bases. Springer-Verlag New York, Inc.
[Goo05]
[GP07]
[Han05]
Han J. (2005) Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers
Inc., San Francisco, CA, USA.
[Har68]
Hart P. E. (1968) The condensed nearest neighbour rule. IEEE Transactions in Information Theory 14: 515516.
[HB02]
[HB06]
[JG05]
[KCH+ 03] Kim W., Choi B.-J., Hong E.-K., Kim S.-K., y Lee D. (2003) A taxonomy of dirty data.
Data Mining and Knowledge Discovery 7(1): 8199.
[Kon05]
[KS05]
[Lee04]
Lee J.-S. (2004) Hybrid genetic algorithms for feature selection. IEEE Transactions
on Pattern Analysis and Machine Intelligence 26(11): 14241437. Member-Il-Seok Oh
and Member-Byung-Ro Moon.
[LHTD02] Liu H., Hussain F., Tan C. L., y Dash M. (2002) Discretization: An enabling technique.
Data Mining and Knowledge Discovery 6(4): 393423.
[LM98]
Liu H. y Motoda H. (1998) Feature Selection for Knowledge Discovery and Data Mining.
Kluwer Academic Publishers, Norwell, MA, USA.
[LM01]
Liu H. y Motoda H. (2001) Instance Selection and Construction for Data Mining.
Kluwer Academic Publishers, Norwell, MA, USA.
[LM02]
Liu H. y Motoda H. (2002) On issues of instance selection. Data Mining and Knowledge
Discovery 6(2): 115130.
[Pap04]
Papadopoulos A.N. (2004) Nearest Neighbor Search: A Database Perspective. SpringerVerlag TELOS, Santa Clara, CA, USA.
[PK99]
[Pyl99]
Pyle D. (1999) Data preparation for data mining. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA.
[Qui93]
Quinlan J. R. (1993) C4.5: Programs for Machine Learning (Morgan Kaufmann Series
in Machine Learning). Morgan Kaufmann.
[She06]
[SMS07]
Snchez J. S., Mollineda R. A., y Sotoca J. M. (2007) An analysis of how training data
a
complexity aects the nearest neighbor classiers. Pattern Analysis and Applications
10(3): 189201.
[TSK05]
Tan P.-N., Steinbach M., y Kumar V. (2005) Introduction to Data Mining, (First Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
[WF05]
Witten I. H. y Frank E. (2005) Data Mining: Practical Machine Learning Tools and
Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems).
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
[Wil72]
Wilson D. L. (1972) Asymptotic properties of nearest neighbor rules using edited data.
IEEE Transactions on Systems Man and Cybernetics 2(3): 408421.
[WKQ+ 07] Wu X., Kumar V., Quinlan J. R., Ghosh J., Yang Q., Motoda H., McLachlan G. J.,
Ng A., Liu B., Yu P. S., Zhou Z.-H., Steinbach M., Hand D. J., y Steinberg D. (2007)
Top 10 algorithms in data mining. Knowledge and Information Systems 14(1): 137.
[WM97]
[WM00]
[WP03]
Weiss G. M. y Provost F. J. (2003) Learning when training data are costly: The eect
of class distribution on tree induction. Journal of Articial Intelligence Research 19:
315354.
[YW06]
[ZZY03]
Zhang S., Zhang C., y Yang Q. (2003) Data preparation for data mining. Applied
Articial Intelligence 17(5-6): 375381.