You are on page 1of 9

Drug S

reening
by Nonparametri Posterior Estimation
Alexander Gray
February 2004
CMU-CS-04-109

S hool of Computer S ien e


Carnegie Mellon University
Pittsburgh, PA 15213

To be presented at ENAR 04.

Abstra t

Automated high-throughput drug s reening onstitutes a riti al emerging approa h in modern pharama eu-
ti al resear h. The statisti al task of interest is that of dis riminating a tive versus ina tive mole ules given
a target mole ule, in order to rank potential drug andidates for further testing. Be ause the ore problem
is one of ranking, our approa h on entrates on a urate estimation of unknown lass probabilities, in on-
trast to popular non-probabilisti methods whi h simply estimate de ision boundaries. While this motivates
nonparametri density estimation, we are fa ed with the fa t that the mole ular des riptors used in pra ti e
typi ally ontain thousands of binary features. In this paper we attempt to improve the extent to whi h
kernel density estimation an work well in high-dimensional dis rimination settings. We present a synthesis
of te hniques (SLAMDUNK: Sphere, Learn A Metri , Dis riminate Using Nonisotropi Kernels) whi h yields
favorable performan e in omparison to previous published approa hes to drug s reening, as tested on a large
proprietary pharma euti al dataset.

The author was supported by the NASA Graduate Resear h Fellowship.


Keywords: virtual s reening, high-throughput s reening, lassi ation, nonparametri , metri learning,
nonisotropi , kernel density estimation
1 Introdu tion: Dis rimination for Drug S reening

1.1 Automated Drug S reening

Virtual s reening refers to the use of statisti al and omputational methods for prioritizing andidate
mole ules for biologi al testing for their possible use as drugs. Be ause these assays are time- onsuming
and expensive, a urate \virtual" assays, or prioritization of mole ules by omputer, has dire t impa t in
ost savings and more rapid drug development. Virtual s reening, whi h is part of the more general enter-
prise of high-throughput s reening, has thus be ome an in reasingly pressing new omponent of modern drug
development resear h.
The dis rimination problem. Several s enarios exist for the spe i setup of the virtual s reening
problem, and these demand slightly di erent emphases. In this paper we are on erned with a s enario that
is representative of that of a large pharma euti al resear h and development laboratory, whi h is as follows:
We assume there is a single target mole ule. There are multiple mole ules whi h are known to intera t in the
desired fashion with the target mole ule, or are a tive with respe t to the target. There are also a number
of mole ules whi h are known to be ina tive with respe t to the target. The number of ina tive mole ules
available is generally mu h larger than the number of a tive mole ules. A very a tive resear h area within
omputational hemistry ontinues to explore the use of statisti al dis rimination (also alled lassi ation)
using a set of features (or measurements) des ribing mole ular properties to predi t whether a previously
unseen mole ule will be a tive with respe t to the target or not.
The labels. Building su h a virtual s reening system begins by olle ting the training set, or the set of
labelled mole ules (labelled as \a tive" or \ina tive"), often by a mixture of human-intensive but fairly ertain
biologi al testing and automated or semi-automated testing whose out ome may retain some un ertainty.
Labels are often obtained by thresholding a ontinuous \a tivity" level. Datasets are also sometimes formed
ombining mole ules obtained from outside sour es, su h as the pur hase of datasets from other resear h
groups.
The features. To date, a su in t hara terization of the properties of mole ules whi h are relevant to
their a tivity with respe t to a target is not known. The stru ture of a mole ule determines its intera tion
with a target mole ule { whether and how it will interlo k, or \do k" with the target { but the intera tion
is itself a omplex dynami pro ess guided by the pairwise potential energies between the atoms of the two
mole ules (and atoms within the same mole ule), whose omplete hara terization remains an oustanding
problem of s ien e. Thus, mole ular des riptions used in virtual s reening typi ally ontain hundreds or
thousands of binary (0/1) features, olle ting all manner of both generi and target-spe i properties whi h
might be relevant to the dis rimination task. Typi al binary features re ord the absen e or presen e of a
ertain kinds of atom or substru tures, proximity relationships, and so on. Exploration of di erent ways to
hara terize mole ules in terms of xed-length ve tors of features is itself an a tive resear h topi .
Our dataset and goal. In this work our goal is to design a lassi er with the best possible predi tion
performan e based on a proprietary ommer ial training set of 26,733 mole ules, 6,348 binary features, and
one output variable (\a tive" or \ina tive"). While further details about this dataset annot be dis losed,
it is similar in nature to the kind of data that an be found in publi ar hives su h as the NCI AIDS le,
whi h ontains ompounds whi h are known to be a tive or ina tive with respe t to the HIV virus.
In general, we are primarily empiri ally motivated rather than philosophi ally motivated, and take an
approa h of synthesizing insights and te hniques a ross both statisti s and pattern re ognition (an outgrowth
of ele tri al engineering) and ma hine learning (an outgrowth of omputer s ien e). We believe that this
uni ed ultural viewpoint is both fruitful and inevitable.
1.2 Previous Dis rimination Methods

Re ent work in virtual s reening. Virtual s reening is a rapidly developing area of resear h. Our work is
strongly motivated by two of the most re ently published omparisons of dis rimination methods for virtual
s reening ([16℄,[10℄).
The primary methods that have been proposed are more or less those whi h have enjoyed re ent popularity
in general, drawn mainly from the elds of pattern re ognition and ma hine learning; they in lude de ision
trees, neural networks, and naive Bayes lassi ers. However, among them, support ve tor ma hines (SVM)

1
([15℄), the subje t of perhaps the most re ent attention in the study of dis rimination, have been distinguished
as one of the most su essful empiri ally.
Among the lesser-known methods proposals is the `binary kernel dis riminator' (BKD) of [9℄, a fairly
standard kernel estimator for dis rimination using a kernel based on the Hamming distan e, whi h was
demonstrated to yield good performan e in the virtual s reening problem. (We note that the BKD is
not formulated dire tly in terms of de ision theory.) In [16℄, a fairly extensive omparison (by a di erent
group of resear hers) between support ve tor ma hines and the binary kernel dis riminator was performed,
demonstrating surprisingly lear superiority in the performan e of BKD's over SVM's. In that work, mole ule
des riptions ontaining up to about 1,000 features were used.
In [10℄, whi h performed experiments using the same dataset used in this paper, a onjugate gradient-
based logisti regression (LR) method was demonstrated to have onsistently favorable performan e om-
pared with several popular methods in luding SVM's with both linear and nonlinear (radial basis fun tion)
kernels, de ision trees, naive Bayes lassi ers, and k-nearest-neighbor lassi ers. Interestingly, though logis-
ti regression nds only linear de ision boundaries, it outperformed both linear SVM's, whi h nds linear
de ision boundaries via a di erent pro edure, and RBF-based SVM's, whi h have the apa ity to represent
more omplex de ision boundaries.
Our work helps to understand and relate the su ess of both of these methods, and ultimately ombines
aspe ts of both to a hieve a method with performan e superior to either one.
1.3 Ranking Versus Binary De ision-making

To s ore the ranking performan e of a dis riminator, we use the standard devi e of re eiver operating hara -
teristi (ROC) urves ([2℄), whi h aptures more information than simply the per entage of orre tly- lassi ed
data. An ROC urve is onstru ted by by sorting the data a ording to the predi ted probability for the
\a tive" lass, i.e. P (C1 jx). Starting at the origin and stepping through the data in order of de reasing
\a tive" probability, a point on the urve is plotted by moving up one unit if the true label was a tually
\a tive" and moving right one unit if the predi tion was in orre t. The ROC urve for any dis riminator
begins at the origin of the graph and ends at the top right orner. One urve re e ts ranking performan e
superior to another to the extent that it sits above it. A summary of an ROC urve is the area under the
urve (AUC), whi h is 0.5 for a dis riminator whi h guesses randomly and 1.0 for one whi h ranks perfe tly.
The starting point for the approa h of this paper is that the ranking problem is di erent from the \pure"
dis rimination problem, and in fa t is more diÆ ult, be ause the quantity of interest is the posterior lass
probability rather than simply the error rate of making binary de isions. A dis riminator may estimate
lass probabilities with very large bias, but still perform well when s ored in terms of a ura y in binary
de ision-making as long as the order relation between the lass probabilities is maintained. As noted by [3℄,
this simple fa t may explain the histori ally puzzling observation that dis riminators that are very di erent
in on ept (i.e. modeling assumptions) tend to yield similar error rates.
In this work we pursue the extent to whi h dire t estimation of posterior lass probabilities, as opposed
to pure dis rimination designed to minimize the binary error rate, might yield superior ranking performan e.
There are additional pra ti al advantages to obtaining a urate lass- onditional densities. Among them:
imputation of missing data is naturally treated, outliers are more naturally identi ed, and ambiguous data
whi h are diÆ ult to lassify are easy to isolate.

2 General Approa h: De ision Theory with Nonparametri Den-

sity Estimation

De ision theory. Based on the motivation above, we are led naturally to the general framework of statisti al
de ision theory. The posterior lass probability P (C1 jx) is expressed in terms of the lass- onditional density
p(xjC1 ):
p(xjC1 )P (C1 )
P (C1 jx) = (1)
p(xjC1 )P (C1 ) + p(xjC2 )P (C2 )

2
If the lass- onditional distributions on the right-hand side are known, the so- alled Bayes error rate is
a hieved, meaning that no better performan e an be a hieved.
Typi ally in pattern re ognition appli ations an empiri al Bayes stan e is impli itly taken, in whi h
P (C1 ) and P (C2 ) are estimated from the data. This has signi an e in our setting, in whi h the lass
proportions are signi antly di erent, and extra a ura y is in fa t obtained in pra ti e by in orporating
this information.
Nonparametri density estimation. When the lass- onditional distributions are normal (with di-
agonal ovarian e), the resulting estimator is alled the (naive) Bayes lassi er. We onsider the lassi er
obtained by estimating p(xjC1 ) and p(xjC2 ) with minimal assumptions, using the nonparametri kernel
density estimator:

p^(x) =
1X N
Kh (x; xi ) (2)
N i
R
where N is the number of data, K () is alled the kernel fun tion and satis es 11 Kh (z )dz , and h is
a s aling fa tor alled the bandwidth. We refer to the resulting dis riminator as a nonparametri Bayes
lassi er (NBC), for la k of a standard name.
The standard form of kernel whi h is most often used is the produ t kernel, in whi h
Y
D  kx x k 
Kh (x; xi ) = Kd i
; (3)
d
h

where D is the number of dimensions, i.e. the kernel fun tion is a produ t of D univariate kernel fun tions,
and all share the same bandwidth h. Though we ould onsider a setup in whi h separate bandwidths
an be adjusted for ea h dimension, this reates a ombinatorial problem whi h is intra table in our high-
dimensional setting. If we ensure that the s ales of the respe tive features are roughly the same, we need
only adjust a single parameter h.
Note that a parti ular advantage of the de ision-theoreti framework whi h is relevant for this problem
is that unequal mis lassi ation osts are easily handled.
2.1 Pros and Cons

Sin e we annot assume any parametri plausible model for the lass- onditional densities, nonparamet-
ri estimation is required. Kernel density estimation is the most widely-used and well-studied method for
nonparametri density estimation, owing to both its simpli ity and exibility, and the many theorems estab-
lishing its onsisten y for near-arbitrary unknown densities and rates of onvergen e for its many variants
([14℄, [13℄). However, two main fa tors have traditionally kept it (and nonparametri density estimation in
general) from more widespread appli ability, parti ularly in ontexts like the present one:
 Computational intra tability. Estimation of the density at ea h of the N points, when performed in
the straightforward manner, has O(N 2 ) omputational ost. This qui kly be omes prohibitive even for
moderate sizes of N .
 Statisti al ineÆ ien y in high dimensions. Theoreti al bounds establish that in the worst ase, the
number of samples required for a urate kernel density estimation rises exponentially with the dimen-
sion. Even for relatively small D, these worst- ase numbers are dis ouraging.
 Ignores simpler de ision problem. One of Vapnik's entral arguments for the non-probabilisti approa h
underlying the support ve tor ma hine is that if the error rate is the desired quantity to be minimized,
estimation of entire densities rather than simpler de ision boundaries is unne essary and wasteful of
modeling apa ity ([15℄). Stated di erently, the straightforward de ision-theoreti approa h does not
make use of information whi h an be obtained from the de ision boundary, whi h is possibly more
easily hara terized than the entire lass- onditional densities.

3
3 SLAMDUNK: Sphere, Learn A Metri , Dis riminate Using Non-

isotropi Kernels

The SLAMDUNK methodology onsists of a set of pro edures designed to mitigate the traditional limitations
of nonparametri density estimation in the setting of high-dimensional dis rimination, so that its distin t
advantages may be exploited. We now treat in turn ea h of the three roadblo ks mentioned in the last
Se tion.
3.1 Fast Algorithm for Kernel Density Estimation

Computational intra tability is the rst major roadblo k hit in pra ti e. Computational eÆ ien y impa ts
statisti al inferen e dire tly { for example in [16℄ only 200 data were subsampled for ea h lass to form the
training set, due to the omputational ost of BKD. In our experiments we use the entire set of 26,733 data.
Any high-dimensional ontext demands the use of as mu h data as possible, exa erbating the omputational
issue.
Fortunately, this problem has been largely mitigated in very re ent work presenting a fast algorithm
yielding simultaneously fast and a urate omputation of kernel density estimates ([8℄). The method asts the
kernel density estimation omputational problem within a larger lass alled `generalized N -body problems'
and is a spe ial ase of a more general algorithmi approa h alled `higher-order divide-and- onquer' whi h
a hieves the best known time omplexity (asymptoti order of growth in runtime for a given problem size N )
for this lass of problems. It is proven in [8℄ that the algorithm redu es the O(N 2 ) ost of density estimation
at ea h of the N points to O(N ), or onstant time per point.
It is shown empiri ally in [8℄ that the algorithm's time omplexity is not exponential in the dimension D,
as indi ated by well-known worst- ase theoreti al results ([4℄). It is instead onje tured that su h algorithms
are sensitive to the intrinsi dimensionality, the lo al dimensionality of the manifold upon whi h the data
lies ([5℄) (see below).
The algorithm employs te hniques of omputational geometry, in parti ular spa e-partitioning data stru -
tures alled ball-trees, also alled metri trees, whi h are onstru ted as a prepro essing step in O(N log N )
time. The major onstraint imposed by this approa h whi h is most relevant in this ontext is the fa t that
ball trees require that the underlying distan e be a true metri , heavily relying for example on the triangle
inequality. This will be ome relevant in onstraining other parts of our methodology.
3.2 Nonstationary and Nonisotropi Estimators

We now onsider extensions to the standard kernel density estimator as des ribed earlier, regarding the
parametrization of the kernel fun tion.
Nonstationary estimators. It has long been noted that the assumption of spatial stationarity, or
a single s ale h holding a ross the entire spa e is de ient. Visually it is lear that smoothing with a
xed bandwidth is unappealing when the dataset ontains regions of di ering density, whi h is inevitable in
pra ti e. This problem is learly seen, for example, in the sparser tails of a typi al univariate dataset.
Adaptive (or variable-kernel) kernel density estimators have been studied and shown to be more e e tive
than xed-width kernel density estimators in experimental studies, e.g. [1℄. In these estimators, the variable
bandwidth hi for ea h point xi is obtained by s aling the single global bandwidth h by a fa tor
i / fp~(xi )g 1 =2
(4)
where p~() is a pilot estimate of the density, to whi h the overall estimator is largely insensitive. Many simple
hoi es an be used for this pilot estimate, in luding adaptive Gaussian mixture models or pie ewise- onstant
estimates based on multivariate binning, for example by kd-trees ([4℄).
Nonisotropi estimators. It has been noted by many authors (parti ularly in the eld of ma hine
learning, in whi h high-dimensional data dis rimination and lustering is routinely performed) that in pra -
ti e it is virtually never the ase that a dataset's intrinsi dimensionality is equal to its expli it dimensionality
D, e.g. [3℄. With the assumption that the data lie on a linear manifold, the dimension of the subspa e an be
estimated using the eigenspe trum from a prin ipal omponents analysis ([5℄). However in general the data

4
may lie on a nonlinear manifold ([12℄). A ommon way estimator of the intrinsi dimension with minimal
assumptions has been alled, among other things, the orrelation dimension ([7℄), but amounts to the 2-
point orrelation fun tion used in spatial statisti s. Very often in pra ti e the intrinsi dimension D0 << D,
regardless of whi h variant of its de nition is used.
With this in mind, the standard produ t kernel, whi h is isotropi , i.e. has equal extent in all dire tions,
is a poor mat h to realisti high-dimensional data. Further, as noted earlier, the behavior of volumes in high
dimensionalities, rising exponentially in D, is disastrous when D is large.
Instead we use an estimator in whi h the univariate bandwidths hi are repla ed by matri es Hi , resulting
in a multivariate kernel su h as the multivariate Gaussian
 1 
KH (x; xi ) =
1 exp (x x ) T
H 1
(x x ) (5)
i
(2)D=2 jH j1=2 2 i i

where Hi = hi  k , with  k the ovarian e matrix estimated from the k nearest neighbors of xi and xi . Su h
estimators have re eived relatively little study, though one example showing their onsisten y is [6℄.
By allowing in reased sensitivity to the lo al manifold of the data, we de ate the extent of the urse of
dimensionality in kernel density estimation, relative to the naive produ t kernel estimator.

4 Coordinate Transformation and Metri Learning

Metri learning. An impli it part of the kernel estimator is the underlying metri used to obtain the
distan es. The standard Eu lidean distan e is used by default. It an be seen as a spe ial ase of a more
general weighted Eu lidean distan e
q
d(x; y ) = kx y k = (x y )T W (x y ) (6)
in whi h the matrix W is diagonal ontaining all 1's. Rather than assume this spe ial ase, we take the
stan e that the metri weight matrix W should be onsidered a free parameter to adjust to maximize the
performan e of our estimator. We refer to this as \learning the metri ".
The question of the optimal metri has been heavily studied in the ontext of dis rimination by the
nearest-neighbor rule (whi h an be regarded as a spe ial ase of the kernel estimator for dis rimination).
Although asymptoti results imply that the hoi e of metri does not a e t performan e, nite-sample
experiments show that marked improvements an be made by adjusting the metri to the task at hand. A
general theory has been developed ([11℄) whi h formulates the optimal distan e in terms of the Bayes optimal
posterior lass probabilities. However, this form of \distan e" does not in general yield a formal metri . We
will ensure that metri properties are retained, for the purpose of using the fast algorithm des ribed earlier,
by staying within the on nes of weighted Eu lidean distan es.
The linear dis riminant metri . We propose a form of W whi h relates the metri to the de ision
boundary orresponding to a linear dis riminant.
We onsider only forms of W whi h are diagonal. First we obtain the ve tor w whi h is the result of
a linear lassi er su h as logisti regression or a linear support ve tor ma hine (we use logisti regression
based on the favorable experimental results des ribed earlier). The weight ve tor w des ribes a dis riminator
where the lass predi tion for x is obtained by omputing wx and omparing it to a threshold w0 . Thus if
two points x and y lie on the de ision boundary of the dis riminator, we have that
wT (x y ) = 0; (7)
i.e. the ve tor w is orthogonal to the de ision boundary.
By taking the metri formed by the norm
d(x; y ) = kwT (x y )k (8)
we obtain a metri whi h measures distan e along w, or between the lass means (with the appropriate
Gaussian assumptions). This an be interpreted as measuring the extent to whi h the linear dis riminant
prefers lass 1 or lass 2.

5
This an be regarded as an impli it form of dimensionality redu tion, by realizing that values of w tending
to zero will ause the metri to assign negligible weight in those dire tions, whi h in the limit is akin to
removing the orresponding features.
In this manner, we use dis riminant information to weight our metri rather than assume isotropy in the
metri .
Sphering. Our diagonal restri tion on W motivates the removal of orrelation between the features in
advan e. Normalizing ea h feature so that they all have roughly the same s ale is also important for kernel
density estimation as noted earlier. For this reason we perform these operations (sphering the data) as the
rst step of our methodology using prin ipal omponent analysis (PCA). We also take the opportunity at
this stage to examine the resulting eigenspe trum and remove low-eigenvalue features.
Our overall dimension redu tion s heme thus in ludes two kinds of steps: this PCA-based expli it feature
removal, whi h aims to `denoise' the data, and the impli it dire tion weighting performed by our metri
learning pro edure.

5 Experimental Results

Our dataset ontains 26733 rows and 6348 attributes, and is sparse, ontaining 3732607 non-zero input
values. It has 804 positive output values (\a tive" lass).
A pre-analysis of the data, however, reveals that 2290 olumns are empty. Furthermore, 388 out of
8235711 pairs of olumns are identi al. These are also removed. Among the remaining olumns, a olumn
redu tion s heme also reveals linear dependen ies. Removal of 406 olumns from the remaining 3871 olumns
is performed. This leaves about half of the original dimensions. We then perform PCA, keeping only 100 of
these dimensions.
All experiments were performed using 10-fold ross-validation, in whi h the data is broken into 10 equally-
sized disjoint subsets, and testing (evaluation) is performed on one of them while training is performed on
the other 9 put together.
The following table lists the results of the experimental evaluation of [10℄ performed on the same data.
It shows the best performan e yielded by ea h method, with and without the use of PCA proje ting to 100
dimensions.
Method AUC
k -nearest neighbors 0.862  0.017
Bayes lassi er 0.891  0.012
De ision tree 0.893  0.011
linear support ve tor ma hine 0.918  0.010
RBF support ve tor ma hine 0.927  0.013
Logisti regression 0.931  0.012
The next table shows the results of the SLAMDUNK methods on this data.
Method AUC
SLAMDUNK xed isotropi kernel 0.933  0.017
SLAMDUNK xed isotropi kernel with metri learning 0.937  0.012
SLAMDUNK variable nonisotropi kernel with metri learning 0.940  0.012

6 Con lusion

We have presented a methodology alled SLAMDUNK whi h we have designed to have favorable properties
for the problem of virtual s reening. We have demonstrated its favorable performan e on a real pharma-
euti al dataset as eviden e that this line of thinking may hold promise for this important ontemporary
problem.
Additionally, this work represents a foray into the more general problem of high-dimensional dis rimina-
tion, in parti ular exploring the extent to whi h probabilisti methods an be su essful in high-dimensional
problems. We plan to ontinue developing the seeds of the ideas whi h have been presented here.

6
A knowledgements

The author would like to thank Andrew Moore for his enthuasiasti support of this work and for providing
a rst- lass resear h environment, Paul Komarek and Ting Liu for the ex ellent software framework whi h
fa ilitated this work, and our pharma euti al ollaborators for providing the data and their insights regarding
this problem.

Referen es

[1℄ L. Breiman, W. Meisel, and E. Pur ell. Variable Kernel Estimates of Multivariate Densities. Te hno-
metri s, 19:135{144, 1977.
[2℄ R. O. Duda and P. E. Hart. Pattern Classi ation and S ene Analysis. John Wiley & Sons, 1973.
[3℄ J. H. Friedman. Flexible Metri Nearest Neighbor Classi ation. Te hni al report, Stanford University,
1994.
[4℄ J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Mat hes in Logarithmi
Expe ted Time. ACM Transa tions on Mathemati al Software, 3(3):209{226, September 1977.
[5℄ K. Fukunaga. Introdu tion to Statisti al Pattern Re ognition, 2nd ed. A ademi Press, 1990.
[6℄ G. H. Givens. Consisten y of the Lo al Kernel Density Estimator. Te hni al report, Colorado State
University, 1994.
[7℄ P. Grassberger and I. Pro a ia. Measuring the Strangeness of Strange Attra tors. Physi a D, pages
189{208, 1983.
[8℄ A. G. Gray and A. W. Moore. Very Fast Multivariate Kernel Density Estimation via Computational
Geometry. In Joint Statisti al Meeting 2003, 2003. to be submitted to JASA.
[9℄ G. Harper, J. Bradshaw, J. C. Gittins, and D. V. S. Green. Predi tion of Biologi al A tivity for High-
Throughput S reening Using Binary Kernel Dis rimination. J. Chem. Inf. Comput. S i., 41:1295{1300,
2001.
[10℄ P. Komarek and A. W. Moore. Fast Robust Logisti Regression for Large Sparse Datasets with Binary
Outputs. In Workshop on AI and Statisti s, 2003.
[11℄ T. P. Minka. Distan e Measures as Prior Probabilities. Te hni al report, Massa husetts Institute of
Te hnology, 2000.
[12℄ S. Roweis and L. Saul. Nonlinear Dimensionality Redu tion by Lo ally Linear Embedding. S ien e,
290(5500), De ember 2000.
[13℄ D. W. S ott. Multivariate Density Estimation. Wiley, 1992.
[14℄ B. W. Silverman. Density Estimation. Chapman and Hall, New York, 1986.
[15℄ V. N. Vapnik. The Nature of Statisti al Learning Theory. Springer-Verlag, 1995.
[16℄ D. Wilton and P. Willett. Comparison of Ranking Methods for Virtual S reening in Lead-Dis overy
Programs. J. Chem. Inf. Comput. S i., 43:469{474, 2003.

You might also like