Professional Documents
Culture Documents
reening
by Nonparametri
Posterior Estimation
Alexander Gray
February 2004
CMU-CS-04-109
Abstra t
Automated high-throughput drug s
reening
onstitutes a
riti
al emerging approa
h in modern pharama
eu-
ti
al resear
h. The statisti
al task of interest is that of dis
riminating a
tive versus ina
tive mole
ules given
a target mole
ule, in order to rank potential drug
andidates for further testing. Be
ause the
ore problem
is one of ranking, our approa
h
on
entrates on a
urate estimation of unknown
lass probabilities, in
on-
trast to popular non-probabilisti
methods whi
h simply estimate de
ision boundaries. While this motivates
nonparametri
density estimation, we are fa
ed with the fa
t that the mole
ular des
riptors used in pra
ti
e
typi
ally
ontain thousands of binary features. In this paper we attempt to improve the extent to whi
h
kernel density estimation
an work well in high-dimensional dis
rimination settings. We present a synthesis
of te
hniques (SLAMDUNK: Sphere, Learn A Metri
, Dis
riminate Using Nonisotropi
Kernels) whi
h yields
favorable performan
e in
omparison to previous published approa
hes to drug s
reening, as tested on a large
proprietary pharma
euti
al dataset.
Virtual s
reening refers to the use of statisti
al and
omputational methods for prioritizing
andidate
mole
ules for biologi
al testing for their possible use as drugs. Be
ause these assays are time-
onsuming
and expensive, a
urate \virtual" assays, or prioritization of mole
ules by
omputer, has dire
t impa
t in
ost savings and more rapid drug development. Virtual s
reening, whi
h is part of the more general enter-
prise of high-throughput s
reening, has thus be
ome an in
reasingly pressing new
omponent of modern drug
development resear
h.
The dis
rimination problem. Several s
enarios exist for the spe
i
setup of the virtual s
reening
problem, and these demand slightly dierent emphases. In this paper we are
on
erned with a s
enario that
is representative of that of a large pharma
euti
al resear
h and development laboratory, whi
h is as follows:
We assume there is a single target mole
ule. There are multiple mole
ules whi
h are known to intera
t in the
desired fashion with the target mole
ule, or are a
tive with respe
t to the target. There are also a number
of mole
ules whi
h are known to be ina
tive with respe
t to the target. The number of ina
tive mole
ules
available is generally mu
h larger than the number of a
tive mole
ules. A very a
tive resear
h area within
omputational
hemistry
ontinues to explore the use of statisti
al dis
rimination (also
alled
lassi
ation)
using a set of features (or measurements) des
ribing mole
ular properties to predi
t whether a previously
unseen mole
ule will be a
tive with respe
t to the target or not.
The labels. Building su
h a virtual s
reening system begins by
olle
ting the training set, or the set of
labelled mole
ules (labelled as \a
tive" or \ina
tive"), often by a mixture of human-intensive but fairly
ertain
biologi
al testing and automated or semi-automated testing whose out
ome may retain some un
ertainty.
Labels are often obtained by thresholding a
ontinuous \a
tivity" level. Datasets are also sometimes formed
ombining mole
ules obtained from outside sour
es, su
h as the pur
hase of datasets from other resear
h
groups.
The features. To date, a su
in
t
hara
terization of the properties of mole
ules whi
h are relevant to
their a
tivity with respe
t to a target is not known. The stru
ture of a mole
ule determines its intera
tion
with a target mole
ule { whether and how it will interlo
k, or \do
k" with the target { but the intera
tion
is itself a
omplex dynami
pro
ess guided by the pairwise potential energies between the atoms of the two
mole
ules (and atoms within the same mole
ule), whose
omplete
hara
terization remains an oustanding
problem of s
ien
e. Thus, mole
ular des
riptions used in virtual s
reening typi
ally
ontain hundreds or
thousands of binary (0/1) features,
olle
ting all manner of both generi
and target-spe
i
properties whi
h
might be relevant to the dis
rimination task. Typi
al binary features re
ord the absen
e or presen
e of a
ertain kinds of atom or substru
tures, proximity relationships, and so on. Exploration of dierent ways to
hara
terize mole
ules in terms of xed-length ve
tors of features is itself an a
tive resear
h topi
.
Our dataset and goal. In this work our goal is to design a
lassier with the best possible predi
tion
performan
e based on a proprietary
ommer
ial training set of 26,733 mole
ules, 6,348 binary features, and
one output variable (\a
tive" or \ina
tive"). While further details about this dataset
annot be dis
losed,
it is similar in nature to the kind of data that
an be found in publi
ar
hives su
h as the NCI AIDS le,
whi
h
ontains
ompounds whi
h are known to be a
tive or ina
tive with respe
t to the HIV virus.
In general, we are primarily empiri
ally motivated rather than philosophi
ally motivated, and take an
approa
h of synthesizing insights and te
hniques a
ross both statisti
s and pattern re
ognition (an outgrowth
of ele
tri
al engineering) and ma
hine learning (an outgrowth of
omputer s
ien
e). We believe that this
unied
ultural viewpoint is both fruitful and inevitable.
1.2 Previous Dis
rimination Methods
Re
ent work in virtual s
reening. Virtual s
reening is a rapidly developing area of resear
h. Our work is
strongly motivated by two of the most re
ently published
omparisons of dis
rimination methods for virtual
s
reening ([16℄,[10℄).
The primary methods that have been proposed are more or less those whi
h have enjoyed re
ent popularity
in general, drawn mainly from the elds of pattern re
ognition and ma
hine learning; they in
lude de
ision
trees, neural networks, and naive Bayes
lassiers. However, among them, support ve
tor ma
hines (SVM)
1
([15℄), the subje
t of perhaps the most re
ent attention in the study of dis
rimination, have been distinguished
as one of the most su
essful empiri
ally.
Among the lesser-known methods proposals is the `binary kernel dis
riminator' (BKD) of [9℄, a fairly
standard kernel estimator for dis
rimination using a kernel based on the Hamming distan
e, whi
h was
demonstrated to yield good performan
e in the virtual s
reening problem. (We note that the BKD is
not formulated dire
tly in terms of de
ision theory.) In [16℄, a fairly extensive
omparison (by a dierent
group of resear
hers) between support ve
tor ma
hines and the binary kernel dis
riminator was performed,
demonstrating surprisingly
lear superiority in the performan
e of BKD's over SVM's. In that work, mole
ule
des
riptions
ontaining up to about 1,000 features were used.
In [10℄, whi
h performed experiments using the same dataset used in this paper, a
onjugate gradient-
based logisti
regression (LR) method was demonstrated to have
onsistently favorable performan
e
om-
pared with several popular methods in
luding SVM's with both linear and nonlinear (radial basis fun
tion)
kernels, de
ision trees, naive Bayes
lassiers, and k-nearest-neighbor
lassiers. Interestingly, though logis-
ti
regression nds only linear de
ision boundaries, it outperformed both linear SVM's, whi
h nds linear
de
ision boundaries via a dierent pro
edure, and RBF-based SVM's, whi
h have the
apa
ity to represent
more
omplex de
ision boundaries.
Our work helps to understand and relate the su
ess of both of these methods, and ultimately
ombines
aspe
ts of both to a
hieve a method with performan
e superior to either one.
1.3 Ranking Versus Binary De
ision-making
To s
ore the ranking performan
e of a dis
riminator, we use the standard devi
e of re
eiver operating
hara
-
teristi
(ROC)
urves ([2℄), whi
h
aptures more information than simply the per
entage of
orre
tly-
lassied
data. An ROC
urve is
onstru
ted by by sorting the data a
ording to the predi
ted probability for the
\a
tive"
lass, i.e. P (C1 jx). Starting at the origin and stepping through the data in order of de
reasing
\a
tive" probability, a point on the
urve is plotted by moving up one unit if the true label was a
tually
\a
tive" and moving right one unit if the predi
tion was in
orre
t. The ROC
urve for any dis
riminator
begins at the origin of the graph and ends at the top right
orner. One
urve re
e
ts ranking performan
e
superior to another to the extent that it sits above it. A summary of an ROC
urve is the area under the
urve (AUC), whi
h is 0.5 for a dis
riminator whi
h guesses randomly and 1.0 for one whi
h ranks perfe
tly.
The starting point for the approa
h of this paper is that the ranking problem is dierent from the \pure"
dis
rimination problem, and in fa
t is more diÆ
ult, be
ause the quantity of interest is the posterior
lass
probability rather than simply the error rate of making binary de
isions. A dis
riminator may estimate
lass probabilities with very large bias, but still perform well when s
ored in terms of a
ura
y in binary
de
ision-making as long as the order relation between the
lass probabilities is maintained. As noted by [3℄,
this simple fa
t may explain the histori
ally puzzling observation that dis
riminators that are very dierent
in
on
ept (i.e. modeling assumptions) tend to yield similar error rates.
In this work we pursue the extent to whi
h dire
t estimation of posterior
lass probabilities, as opposed
to pure dis
rimination designed to minimize the binary error rate, might yield superior ranking performan
e.
There are additional pra
ti
al advantages to obtaining a
urate
lass-
onditional densities. Among them:
imputation of missing data is naturally treated, outliers are more naturally identied, and ambiguous data
whi
h are diÆ
ult to
lassify are easy to isolate.
sity Estimation
De
ision theory. Based on the motivation above, we are led naturally to the general framework of statisti
al
de
ision theory. The posterior
lass probability P (C1 jx) is expressed in terms of the
lass-
onditional density
p(xjC1 ):
p(xjC1 )P (C1 )
P (C1 jx) = (1)
p(xjC1 )P (C1 ) + p(xjC2 )P (C2 )
2
If the
lass-
onditional distributions on the right-hand side are known, the so-
alled Bayes error rate is
a
hieved, meaning that no better performan
e
an be a
hieved.
Typi
ally in pattern re
ognition appli
ations an empiri
al Bayes stan
e is impli
itly taken, in whi
h
P (C1 ) and P (C2 ) are estimated from the data. This has signi
an
e in our setting, in whi
h the
lass
proportions are signi
antly dierent, and extra a
ura
y is in fa
t obtained in pra
ti
e by in
orporating
this information.
Nonparametri
density estimation. When the
lass-
onditional distributions are normal (with di-
agonal
ovarian
e), the resulting estimator is
alled the (naive) Bayes
lassier. We
onsider the
lassier
obtained by estimating p(xjC1 ) and p(xjC2 ) with minimal assumptions, using the nonparametri
kernel
density estimator:
p^(x) =
1X N
Kh (x; xi ) (2)
N i
R
where N is the number of data, K () is
alled the kernel fun
tion and satises 11 Kh (z )dz , and h is
a s
aling fa
tor
alled the bandwidth. We refer to the resulting dis
riminator as a nonparametri
Bayes
lassier (NBC), for la
k of a standard name.
The standard form of kernel whi
h is most often used is the produ
t kernel, in whi
h
Y
D kx x k
Kh (x; xi ) = Kd i
; (3)
d
h
where D is the number of dimensions, i.e. the kernel fun
tion is a produ
t of D univariate kernel fun
tions,
and all share the same bandwidth h. Though we
ould
onsider a setup in whi
h separate bandwidths
an be adjusted for ea
h dimension, this
reates a
ombinatorial problem whi
h is intra
table in our high-
dimensional setting. If we ensure that the s
ales of the respe
tive features are roughly the same, we need
only adjust a single parameter h.
Note that a parti
ular advantage of the de
ision-theoreti
framework whi
h is relevant for this problem
is that unequal mis
lassi
ation
osts are easily handled.
2.1 Pros and Cons
Sin
e we
annot assume any parametri
plausible model for the
lass-
onditional densities, nonparamet-
ri
estimation is required. Kernel density estimation is the most widely-used and well-studied method for
nonparametri
density estimation, owing to both its simpli
ity and
exibility, and the many theorems estab-
lishing its
onsisten
y for near-arbitrary unknown densities and rates of
onvergen
e for its many variants
([14℄, [13℄). However, two main fa
tors have traditionally kept it (and nonparametri
density estimation in
general) from more widespread appli
ability, parti
ularly in
ontexts like the present one:
Computational intra
tability. Estimation of the density at ea
h of the N points, when performed in
the straightforward manner, has O(N 2 )
omputational
ost. This qui
kly be
omes prohibitive even for
moderate sizes of N .
Statisti
al ineÆ
ien
y in high dimensions. Theoreti
al bounds establish that in the worst
ase, the
number of samples required for a
urate kernel density estimation rises exponentially with the dimen-
sion. Even for relatively small D, these worst-
ase numbers are dis
ouraging.
Ignores simpler de
ision problem. One of Vapnik's
entral arguments for the non-probabilisti
approa
h
underlying the support ve
tor ma
hine is that if the error rate is the desired quantity to be minimized,
estimation of entire densities rather than simpler de
ision boundaries is unne
essary and wasteful of
modeling
apa
ity ([15℄). Stated dierently, the straightforward de
ision-theoreti
approa
h does not
make use of information whi
h
an be obtained from the de
ision boundary, whi
h is possibly more
easily
hara
terized than the entire
lass-
onditional densities.
3
3 SLAMDUNK: Sphere, Learn A Metri
, Dis
riminate Using Non-
isotropi Kernels
The SLAMDUNK methodology
onsists of a set of pro
edures designed to mitigate the traditional limitations
of nonparametri
density estimation in the setting of high-dimensional dis
rimination, so that its distin
t
advantages may be exploited. We now treat in turn ea
h of the three roadblo
ks mentioned in the last
Se
tion.
3.1 Fast Algorithm for Kernel Density Estimation
Computational intra
tability is the rst major roadblo
k hit in pra
ti
e. Computational eÆ
ien
y impa
ts
statisti
al inferen
e dire
tly { for example in [16℄ only 200 data were subsampled for ea
h
lass to form the
training set, due to the
omputational
ost of BKD. In our experiments we use the entire set of 26,733 data.
Any high-dimensional
ontext demands the use of as mu
h data as possible, exa
erbating the
omputational
issue.
Fortunately, this problem has been largely mitigated in very re
ent work presenting a fast algorithm
yielding simultaneously fast and a
urate
omputation of kernel density estimates ([8℄). The method
asts the
kernel density estimation
omputational problem within a larger
lass
alled `generalized N -body problems'
and is a spe
ial
ase of a more general algorithmi
approa
h
alled `higher-order divide-and-
onquer' whi
h
a
hieves the best known time
omplexity (asymptoti
order of growth in runtime for a given problem size N )
for this
lass of problems. It is proven in [8℄ that the algorithm redu
es the O(N 2 )
ost of density estimation
at ea
h of the N points to O(N ), or
onstant time per point.
It is shown empiri
ally in [8℄ that the algorithm's time
omplexity is not exponential in the dimension D,
as indi
ated by well-known worst-
ase theoreti
al results ([4℄). It is instead
onje
tured that su
h algorithms
are sensitive to the intrinsi
dimensionality, the lo
al dimensionality of the manifold upon whi
h the data
lies ([5℄) (see below).
The algorithm employs te
hniques of
omputational geometry, in parti
ular spa
e-partitioning data stru
-
tures
alled ball-trees, also
alled metri
trees, whi
h are
onstru
ted as a prepro
essing step in O(N log N )
time. The major
onstraint imposed by this approa
h whi
h is most relevant in this
ontext is the fa
t that
ball trees require that the underlying distan
e be a true metri
, heavily relying for example on the triangle
inequality. This will be
ome relevant in
onstraining other parts of our methodology.
3.2 Nonstationary and Nonisotropi
Estimators
We now
onsider extensions to the standard kernel density estimator as des
ribed earlier, regarding the
parametrization of the kernel fun
tion.
Nonstationary estimators. It has long been noted that the assumption of spatial stationarity, or
a single s
ale h holding a
ross the entire spa
e is de
ient. Visually it is
lear that smoothing with a
xed bandwidth is unappealing when the dataset
ontains regions of diering density, whi
h is inevitable in
pra
ti
e. This problem is
learly seen, for example, in the sparser tails of a typi
al univariate dataset.
Adaptive (or variable-kernel) kernel density estimators have been studied and shown to be more ee
tive
than xed-width kernel density estimators in experimental studies, e.g. [1℄. In these estimators, the variable
bandwidth hi for ea
h point xi is obtained by s
aling the single global bandwidth h by a fa
tor
i / fp~(xi )g 1 =2
(4)
where p~() is a pilot estimate of the density, to whi
h the overall estimator is largely insensitive. Many simple
hoi
es
an be used for this pilot estimate, in
luding adaptive Gaussian mixture models or pie
ewise-
onstant
estimates based on multivariate binning, for example by kd-trees ([4℄).
Nonisotropi
estimators. It has been noted by many authors (parti
ularly in the eld of ma
hine
learning, in whi
h high-dimensional data dis
rimination and
lustering is routinely performed) that in pra
-
ti
e it is virtually never the
ase that a dataset's intrinsi
dimensionality is equal to its expli
it dimensionality
D, e.g. [3℄. With the assumption that the data lie on a linear manifold, the dimension of the subspa
e
an be
estimated using the eigenspe
trum from a prin
ipal
omponents analysis ([5℄). However in general the data
4
may lie on a nonlinear manifold ([12℄). A
ommon way estimator of the intrinsi
dimension with minimal
assumptions has been
alled, among other things, the
orrelation dimension ([7℄), but amounts to the 2-
point
orrelation fun
tion used in spatial statisti
s. Very often in pra
ti
e the intrinsi
dimension D0 << D,
regardless of whi
h variant of its denition is used.
With this in mind, the standard produ
t kernel, whi
h is isotropi
, i.e. has equal extent in all dire
tions,
is a poor mat
h to realisti
high-dimensional data. Further, as noted earlier, the behavior of volumes in high
dimensionalities, rising exponentially in D, is disastrous when D is large.
Instead we use an estimator in whi
h the univariate bandwidths hi are repla
ed by matri
es Hi , resulting
in a multivariate kernel su
h as the multivariate Gaussian
1
KH (x; xi ) =
1 exp (x x ) T
H 1
(x x ) (5)
i
(2)D=2 jH j1=2 2 i i
where Hi = hi
k , with
k the
ovarian
e matrix estimated from the k nearest neighbors of xi and xi . Su
h
estimators have re
eived relatively little study, though one example showing their
onsisten
y is [6℄.
By allowing in
reased sensitivity to the lo
al manifold of the data, we de
ate the extent of the
urse of
dimensionality in kernel density estimation, relative to the naive produ
t kernel estimator.
Metri
learning. An impli
it part of the kernel estimator is the underlying metri
used to obtain the
distan
es. The standard Eu
lidean distan
e is used by default. It
an be seen as a spe
ial
ase of a more
general weighted Eu
lidean distan
e
q
d(x; y ) = kx y k = (x y )T W (x y ) (6)
in whi
h the matrix W is diagonal
ontaining all 1's. Rather than assume this spe
ial
ase, we take the
stan
e that the metri
weight matrix W should be
onsidered a free parameter to adjust to maximize the
performan
e of our estimator. We refer to this as \learning the metri
".
The question of the optimal metri
has been heavily studied in the
ontext of dis
rimination by the
nearest-neighbor rule (whi
h
an be regarded as a spe
ial
ase of the kernel estimator for dis
rimination).
Although asymptoti
results imply that the
hoi
e of metri
does not ae
t performan
e, nite-sample
experiments show that marked improvements
an be made by adjusting the metri
to the task at hand. A
general theory has been developed ([11℄) whi
h formulates the optimal distan
e in terms of the Bayes optimal
posterior
lass probabilities. However, this form of \distan
e" does not in general yield a formal metri
. We
will ensure that metri
properties are retained, for the purpose of using the fast algorithm des
ribed earlier,
by staying within the
onnes of weighted Eu
lidean distan
es.
The linear dis
riminant metri
. We propose a form of W whi
h relates the metri
to the de
ision
boundary
orresponding to a linear dis
riminant.
We
onsider only forms of W whi
h are diagonal. First we obtain the ve
tor w whi
h is the result of
a linear
lassier su
h as logisti
regression or a linear support ve
tor ma
hine (we use logisti
regression
based on the favorable experimental results des
ribed earlier). The weight ve
tor w des
ribes a dis
riminator
where the
lass predi
tion for x is obtained by
omputing wx and
omparing it to a threshold w0 . Thus if
two points x and y lie on the de
ision boundary of the dis
riminator, we have that
wT (x y ) = 0; (7)
i.e. the ve
tor w is orthogonal to the de
ision boundary.
By taking the metri
formed by the norm
d(x; y ) = kwT (x y )k (8)
we obtain a metri
whi
h measures distan
e along w, or between the
lass means (with the appropriate
Gaussian assumptions). This
an be interpreted as measuring the extent to whi
h the linear dis
riminant
prefers
lass 1 or
lass 2.
5
This
an be regarded as an impli
it form of dimensionality redu
tion, by realizing that values of w tending
to zero will
ause the metri
to assign negligible weight in those dire
tions, whi
h in the limit is akin to
removing the
orresponding features.
In this manner, we use dis
riminant information to weight our metri
rather than assume isotropy in the
metri
.
Sphering. Our diagonal restri
tion on W motivates the removal of
orrelation between the features in
advan
e. Normalizing ea
h feature so that they all have roughly the same s
ale is also important for kernel
density estimation as noted earlier. For this reason we perform these operations (sphering the data) as the
rst step of our methodology using prin
ipal
omponent analysis (PCA). We also take the opportunity at
this stage to examine the resulting eigenspe
trum and remove low-eigenvalue features.
Our overall dimension redu
tion s
heme thus in
ludes two kinds of steps: this PCA-based expli
it feature
removal, whi
h aims to `denoise' the data, and the impli
it dire
tion weighting performed by our metri
learning pro
edure.
5 Experimental Results
Our dataset
ontains 26733 rows and 6348 attributes, and is sparse,
ontaining 3732607 non-zero input
values. It has 804 positive output values (\a
tive"
lass).
A pre-analysis of the data, however, reveals that 2290
olumns are empty. Furthermore, 388 out of
8235711 pairs of
olumns are identi
al. These are also removed. Among the remaining
olumns, a
olumn
redu
tion s
heme also reveals linear dependen
ies. Removal of 406
olumns from the remaining 3871
olumns
is performed. This leaves about half of the original dimensions. We then perform PCA, keeping only 100 of
these dimensions.
All experiments were performed using 10-fold
ross-validation, in whi
h the data is broken into 10 equally-
sized disjoint subsets, and testing (evaluation) is performed on one of them while training is performed on
the other 9 put together.
The following table lists the results of the experimental evaluation of [10℄ performed on the same data.
It shows the best performan
e yielded by ea
h method, with and without the use of PCA proje
ting to 100
dimensions.
Method AUC
k -nearest neighbors 0.862 0.017
Bayes
lassier 0.891 0.012
De
ision tree 0.893 0.011
linear support ve
tor ma
hine 0.918 0.010
RBF support ve
tor ma
hine 0.927 0.013
Logisti
regression 0.931 0.012
The next table shows the results of the SLAMDUNK methods on this data.
Method AUC
SLAMDUNK xed isotropi
kernel 0.933 0.017
SLAMDUNK xed isotropi
kernel with metri
learning 0.937 0.012
SLAMDUNK variable nonisotropi
kernel with metri
learning 0.940 0.012
6 Con lusion
We have presented a methodology
alled SLAMDUNK whi
h we have designed to have favorable properties
for the problem of virtual s
reening. We have demonstrated its favorable performan
e on a real pharma-
euti
al dataset as eviden
e that this line of thinking may hold promise for this important
ontemporary
problem.
Additionally, this work represents a foray into the more general problem of high-dimensional dis
rimina-
tion, in parti
ular exploring the extent to whi
h probabilisti
methods
an be su
essful in high-dimensional
problems. We plan to
ontinue developing the seeds of the ideas whi
h have been presented here.
6
A
knowledgements
The author would like to thank Andrew Moore for his enthuasiasti
support of this work and for providing
a rst-
lass resear
h environment, Paul Komarek and Ting Liu for the ex
ellent software framework whi
h
fa
ilitated this work, and our pharma
euti
al
ollaborators for providing the data and their insights regarding
this problem.
Referen es
[1℄ L. Breiman, W. Meisel, and E. Pur
ell. Variable Kernel Estimates of Multivariate Densities. Te
hno-
metri
s, 19:135{144, 1977.
[2℄ R. O. Duda and P. E. Hart. Pattern Classi
ation and S
ene Analysis. John Wiley & Sons, 1973.
[3℄ J. H. Friedman. Flexible Metri
Nearest Neighbor Classi
ation. Te
hni
al report, Stanford University,
1994.
[4℄ J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Mat
hes in Logarithmi
Expe
ted Time. ACM Transa
tions on Mathemati
al Software, 3(3):209{226, September 1977.
[5℄ K. Fukunaga. Introdu
tion to Statisti
al Pattern Re
ognition, 2nd ed. A
ademi
Press, 1990.
[6℄ G. H. Givens. Consisten
y of the Lo
al Kernel Density Estimator. Te
hni
al report, Colorado State
University, 1994.
[7℄ P. Grassberger and I. Pro
a
ia. Measuring the Strangeness of Strange Attra
tors. Physi
a D, pages
189{208, 1983.
[8℄ A. G. Gray and A. W. Moore. Very Fast Multivariate Kernel Density Estimation via Computational
Geometry. In Joint Statisti
al Meeting 2003, 2003. to be submitted to JASA.
[9℄ G. Harper, J. Bradshaw, J. C. Gittins, and D. V. S. Green. Predi
tion of Biologi
al A
tivity for High-
Throughput S
reening Using Binary Kernel Dis
rimination. J. Chem. Inf. Comput. S
i., 41:1295{1300,
2001.
[10℄ P. Komarek and A. W. Moore. Fast Robust Logisti
Regression for Large Sparse Datasets with Binary
Outputs. In Workshop on AI and Statisti
s, 2003.
[11℄ T. P. Minka. Distan
e Measures as Prior Probabilities. Te
hni
al report, Massa
husetts Institute of
Te
hnology, 2000.
[12℄ S. Roweis and L. Saul. Nonlinear Dimensionality Redu
tion by Lo
ally Linear Embedding. S
ien
e,
290(5500), De
ember 2000.
[13℄ D. W. S
ott. Multivariate Density Estimation. Wiley, 1992.
[14℄ B. W. Silverman. Density Estimation. Chapman and Hall, New York, 1986.
[15℄ V. N. Vapnik. The Nature of Statisti
al Learning Theory. Springer-Verlag, 1995.
[16℄ D. Wilton and P. Willett. Comparison of Ranking Methods for Virtual S
reening in Lead-Dis
overy
Programs. J. Chem. Inf. Comput. S
i., 43:469{474, 2003.