An Ensemble Approach To Variable Selection For Classification of DNA Microarray Data

An Ensemble Approach to Variable Selection for Classification of DNA Microarray Data
Francesco Masulli
Department of Computer Science, University of Pisa, Italy Email:masulli@di.unipi.it
Stefano Rovetta
Department of Computer and Information Sciences, University of Genoa, Italy
The input selection problem

Hard Given d inputs, there are 2d possible subsets and no guarantee that larger subset perform better/worse than smaller (a.k.a.: no monotonicity) Classic A lot of references dating back from about mid-seventies Important Curse of dimensionality, Generalization, Cost of measurements, Cost of computation ...
A different perspective
Although old, the input selection problem is being actively studied now From optimization Classic approach: improve training speed / generalization ability / computational resources requirements... ...to model analysis Mainstream approach as of today: find the subset of inputs which account the most for the observed phenomenon A tool for scientific inquiry, not for system design
Gene selection
Bioinformatics is where input selection is a current (hot) topic DNA microarrays provide bulks of simultaeous data e.g., gene expression We have to find out which genes are the most relevant to a given pathology (Good candidates to be the true cause) We are interested in a specific approach: assessing the relative importance of each input variable (gene)
Problem statement
We address: Classification problems with 2 classes only to simplify the analysis (can be extended to multiclass) seeking a saliency ranking - on a d-dimensional vector space: x d
A single separating function is assumed, denoted by g(x)
Outline of the technique

The proposed technique has three components 1 a local analysis step with a basic classifier 2 a resampling procedure to iterate step 1 3 an integration step
Saliency
(or importance or sensitivity or...)
Many definitions Intuitively: some attribute of an input variable which measures its influence on the solution of a given (classification) problem The derivative of the output w.r.t. each input variable is a natural measure of influence
g(x) = (g(x)/x1, ... , g(x)/xd)

But...
Finite sample effects

The rule is learned from a training set: random variability
Derivatives and local fluctuations

often it is better to study difference ratios ( f(x+) f(x) ) / rather than derivatives f'(x)
Use of linear separators

If the decision function is of the form g(x) = w
.
then derivatives w.r.t. inputs are constant and given directly by the coefficient vector w SVMs can provide the optimum linear separators w.r.t. a given generalization bound 2-norm soft margin optimization: bound on generalization error based on (soft) margin such linear separators are robust in terms of sample variations (they depend on support vectors only)
Local analysis
The linear separator is applied on a local basis Nonlinear g(x) can be studied by local linearization
Voronoi partitioning
A Voronoi tessellation is performed on the training set Linear analysis is applied within each Voronoi polyhedron (a localized subset of training samples) We obtain a saliency ranking directly by t = w/max{wi} (signs can be discarded and analyzed separately)
Drawbacks
Several: mainly border effects and small sample size within Voronoi polyhedra
Solution: resampling
The Voronoi tessellation is performed several times Random Voronoi tessellations are used each time
An ensemble method
The procedure can be seen as an ensemble of localized linear classifiers The necessary classifier diversity is provided by random Voronoi tessellations
What we need next: Integration of local analyses
Integrating by clustering
For each Voronoi polyhedron of each resampling step, we obtain a pair of d-dimensional vectors (or a 2ddimensional combined vector) v i = ( ti , yi ) where: ti the saliency ranking yi the Voronoi centroid (site) To integrate the local analyses we perform a c-means type clustering on vectors vi
Some details on the clustering step

- The clustering technique is the Graded Possibilistic c-Means algorithm - The dimensionality problem is easily tackled by working only within the subspace spanned by the training set - Clusters are obtained by merging (averaging) sets of vectors vi which are close either by their y (location) or by their t (saliency pattern) components - The number of clusters is currently to be prespecified (as in standard c-means) It is independent on the number of voronoi sites used
Results
Leukemia data set by Golub et al.
Discussion and future work

The results indicate that some of the genes indicated by the original work by Golub et al. are found to be important also by our approach. Extensive validation (by the help of domain experts or biologists) must be done The direction (sign) of saliency has been found to be always in agreement with statistical correlation as indicated by the original work. Further experiments: a new data set (still unpublished) is currently being investigated An interesting tweak: replacing the general c-means-type clustering with a technique specifically tailored on rank data

An Ensemble Approach To Variable Selection For Classification of DNA Microarray Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Ensemble Approach To Variable Selection For Classification of DNA Microarray Data

Uploaded by

Copyright:

Available Formats

An Ensemble Approach to Variable Selection for Classification of DNA Microarray Data

Department of Computer Science, University of Pisa, Italy Email:masulli@di.unipi.it

Department of Computer and Information Sciences, University of Genoa, Italy

The input selection problem

A single separating function is assumed, denoted by g(x)

Outline of the technique

(or importance or sensitivity or...)

g(x) = (g(x)/x1, ... , g(x)/xd)

Finite sample effects

Derivatives and local fluctuations

Use of linear separators

What we need next: Integration of local analyses

Some details on the clustering step

Discussion and future work

You might also like