Neymann-Pearson Approach To Statistical Learning

3806 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO.
11, NOVEMBER 2005
A NeymanPearson Approach to Statistical Learning

Clayton Scott, Member, IEEE, and Robert Nowak, Senior Member, IEEE
AbstractThe NeymanPearson (NP) approach to hypothesis the (almost sure) convergence of both the learned miss and
testing is useful in situations where different types of error have false alarm probabilities to the optimal probabilities given by
different consequences or a priori probabilities are unknown. For the NP lemma. We also apply NP-SRM to (dyadic) decision
any 0, the NP lemma specifies the most powerful test of size
, but assumes the distributions for each hypothesis are known or trees to derive rates of convergence. Finally, we present explicit
(in some cases) the likelihood ratio is monotonic in an unknown algorithms to implement NP-SRM for histograms and dyadic
parameter. This paper investigates an extension of NP theory to decision trees.
situations in which one has no knowledge of the underlying distri-
butions except for a collection of independent and identically dis- A. Motivation
tributed (i.i.d.) training examples from each hypothesis. Building
on a fundamental lemma of Cannon et al., we demonstrate that In the NP theory of binary hypothesis testing, one must de-
several concepts from statistical learning theory have counterparts cide between a null hypothesis and an alternative hypothesis. A
in the NP context. Specifically, we consider constrained versions of level of significance (called the size of the test) is imposed
empirical risk minimization (NP-ERM) and structural risk mini- on the false alarm (type I error) probability, and one seeks a
mization (NP-SRM), and prove performance guarantees for both.
test that satisfies this constraint while minimizing the miss (type
General conditions are given under which NP-SRM leads to strong
universal consistency. We also apply NP-SRM to (dyadic) decision II error) probability, or equivalently, maximizing the detection
trees to derive rates of convergence. Finally, we present explicit al- probability (power). The NP lemma specifies necessary and suf-
gorithms to implement NP-SRM for histograms and dyadic deci- ficient conditions for the most powerful test of size , provided
sion trees. the distributions under the two hypotheses are known, or (in spe-
Index TermsGeneralization error bounds, NeymanPearson cial cases) the likelihood ratio is a monotonic function of an un-
(NP) classification, statistical learning theory. known parameter [1][3] (see [4] for an interesting overview of
the history and philosophy of NP testing). We are interested in
I. INTRODUCTION extending the NP paradigm to the situation where one has no
knowledge of the underlying distributions except for indepen-
I N most approaches to binary classification, classifiers are

designed to minimize the probability of error. However, in
many applications it is more important to avoid one kind of
dent and identically distributed (i.i.d.) training examples drawn
from each hypothesis. We use the language of classification,
whereby a test is a classifier and each hypothesis corresponds
error than the other. Applications where this situation arises to a particular class.
include fraud detection, spam filtering, machine monitoring, To motivate the NP approach to learning classifiers, consider
target recognition, and disease diagnosis, to name a few. In the problem of classifying tumors using gene expression mi-
this paper, we investigate one approach to classification in this croarrays [5], which are typically used as follows: First, identify
context inspired by classical NeymanPearson (NP) hypothesis several patients whose status for a particular form of cancer is
testing. known. Next, collect cell samples from the appropriate tissue in
In the spirit of statistical learning theory, we develop the the- each patient. Then, conduct a microarray experiment to assess
oretical foundations of an NP approach to learning classifiers the relative abundance of various gene transcripts in each of the
from labeled training data. We show that several results and subjects. Finally, use this training data to build a classifier that
concepts from standard learning theory have counterparts in can, in principle, be used to diagnose future patients based on
the NP setting. Specifically, we consider constrained versions their gene expression profiles.
of empirical risk minimization (NP-ERM) and structural risk The dominant approach to classifier design in microarray
minimization (NP-SRM), and prove performance guarantees studies has been to minimize the probability of error (see,
for both. General conditions are given under which NP-SRM for example, [6] and references therein). Yet it is clear that
leads to strong universal consistency. Here consistency entails failing to detect a malignant tumor has drastically different
consequences than erroneously flagging a benign tumor. In the
Manuscript received September 19, 2004; revised February 2, 2005. This NP approach, the diagnostic procedure involves setting a level
work was supported by the National Science Foundation under Grants CCR- of significance , an upper bound on the fraction of healthy
0310889 and CCR-0325571, and by the Office of Naval Research under Grant patients that may be unnecessarily sent for treatment or further
N00014-00-1-0966.
C. Scott is with the Department of Statistics, Rice University, Houston, TX screening, and constructing a classifier to minimize the number
77005 USA (e-mail: cscott@rice.edu). of missed true cancer cases.1
R. Nowak is with the Department of Electrical and Computer Engineering, Further motivation for the NP paradigm comes from a
University of Wisconsin, Madison, WI 53706 USA (e-mail: nowak@engr.
wisc.edu). comparison with cost-sensitive classification (CSC). CSC (also
Communicated by P. L. Bartlett, Associate Editor for Pattern Recognition, called cost-sensitive learning) is another approach to handling
Statistical Learning and Inference.
Digital Object Identifier 10.1109/TIT.2005.856955 1Obviously, the two classes may be switched if desired.
0018-9448/$20.00 2005 IEEE

SCOTT AND NOWAK: A NEYMANPEARSON APPROACH TO STATISTICAL LEARNING 3807
disparate kinds of errors in classification (see [7][10] and ref- introduce a constrained form of NP-SRM that automatically
erences therein). Following classical Bayesian decision theory, balances model complexity and training error, and gives rise
CSC modifies the standard loss function to a weighted to strongly universally consistent rules. Third, assuming mild
Bayes cost. Cost-sensitive classifiers assume the relative costs regularity conditions on the underlying distribution, we derive
for different classes are known. Moreover, these algorithms rates of convergence for NP-SRM as realized by a certain
assume estimates (either implicit or explicit) of the a priori family of decision trees called dyadic decision trees. Finally,
class probabilities can be obtained from the training data or we present exact and computationally efficient algorithms for
some other source. implementing NP-SRM for histograms and dyadic decision
CSC and NP classification are fundamentally different ap- trees.
proaches that have differing pros and cons. In some situations it In a separate paper, Cannon et al. [13] consider NP-ERM over
may be difficult to (objectively) assign costs. For instance, how a data-dependent collection of classifiers and are able to bound
much greater is the cost of failing to detect a malignant tumor the estimation error in this case as well. They also provide an al-
compared to the cost of erroneously flagging a benign tumor? gorithm for NP-ERM in some simple cases involving linear and
The two approaches are similar in that the user essentially has spherical classifiers, and present an experimental evaluation. To
one free parameter to set. In CSC, this free parameter is the ratio our knowledge, the present work is the third study to consider an
of costs of the two class errors, while in NP classification it is NP approach to statistical learning, the second to consider prac-
the false alarm threshold . The selection of costs does not di- tical algorithms with performance guarantees, and the first to
rectly provide control on , and conversely setting does not consider model selection, consistency and rates of convergence.
directly translate into costs. The lack of precise knowledge of
the underlying distributions makes it impossible to precisely re- C. Notation
late costs with . The choice of method will thus depend on We focus exclusively on binary classification, although ex-
which parameter can be set more realistically, which in turn de- tensions to multiclass settings are possible. Let be a set and
pends on the particular application. For example, consider a net- let be a random variable taking values in
work intrusion detector that monitors network activity and flags . The variable corresponds to the observed signal
events that warrant a more careful inspection. The value of (pattern, feature vector) and is the class label associated with
may be set so that the number of events that are flagged for fur- . In classical NP testing, corresponds to the null hy-
ther scrutiny matches the resources available for processing and pothesis.
analyzing them. A classifier is a Borel measurable function
From another perspective, however, the NP approach seems mapping signals to class labels. In standard classification,
to have a clear advantage with respect to CSC. Namely, NP clas- the performance of is measured by the probability of error
sification does not assume knowledge of or about a priori class . Here denotes the probability mea-
probabilities. CSC can be misleading if a priori class probabili- sure for . We will focus instead on the false alarm and miss
ties are not accurately reflected by their sample-based estimates. probabilities denoted by
Consider, for example, the case where one class has very few
representatives in the training set simply because it is very ex-
pensive to gather that kind of data. This situation arises, for ex-
for and , respectively. Note that
ample, in machine fault detection, where machines must often
be induced to fail (usually at high cost) to gather the necessary
training data. Here the fraction of faulty training examples
is not indicative of the true probability of a fault occurring. In where is the (unknown) a priori probability
fact, it could be argued that in most applications of interest, class of class . The false-alarm probability is also known as the size,
probabilities differ between training and test data. Returning to while one minus the miss probability is known as the power.
the cancer diagnosis example, class frequencies of diseased and Let be a user-specified level of significance or false
normal patients at a cancer research institution, where training alarm threshold. In NP testing, one seeks the classifier min-
data is likely to be gathered, in all likelihood do not reflect the imizing over all such that . In words, is
disease frequency in the population at large. the most powerful test (classifier) of size . If and are the
conditional densities of (with respect to a measure ) corre-
B. Previous Work on NP Classification sponding to classes and , respectively, then the NP lemma
[1] states that . Here denotes the indicator
Although NP classification has been studied previously from
function, is the likelihood ratio, and is
an empirical perspective [11], the theoretical underpinnings
as small as possible such that . Thus, when
were apparently first studied by Cannon, Howse, Hush, and
and are known (or in certain special cases where the likeli-
Scovel [12]. They give an analysis of a constrained form of
hood ratio is a monotonic function of an unknown parameter)
empirical risk minimization (ERM) that we call NP-ERM.
the NP lemma delivers the optimal classifier.2
The present work builds on their theoretical foundations in
several respects. First, using different bounding techniques, 2In certain cases, such as when X is discrete, one can introduce randomized
we derive predictive error bounds for NP-ERM that are sub- tests to achieve slightly higher power. Otherwise, the false alarm constraint may
not be satisfied with equality. In this paper, we do not explicitly treat randomized
stantially tighter. Second, while Cannon et al. consider only tests. Thus, h and g should be understood as being optimal with respect to
learning from fixed VapnikChervonenkis (VC) classes, we classes of nonrandomized tests/classifiers.
3808 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 11, NOVEMBER 2005
In this paper, we are interested in the case where our Moreover, , decay exponentially fast as functions of
only information about is a finite training sample. Let increasing , (see Section II for details).
be a collection of i.i.d. samples False Alarm Probability Constraints: In classical hypoth-
of . A learning algorithm (or learner for short) is esis testing, one is interested in tests/classifiers satis-
a mapping , where is the set of all fying . In the learning setting, such constraints
classifiers. In words, the learner is a rule for selecting a can only hold with a certain probability. It is however
classifier based on a training sample. When a training sample possible to obtain nonprobabilistic guarantees of the form
is given, we also use to denote the classifier produced (see Section II-C for details)
by the learner.
In classical NP theory it is possible to impose absolute bounds
on the false alarm and miss probabilities. When learning from Oracle inequalities: Given a hierarchy of sets of classifiers
training data, however, one cannot make such guarantees be- , does about as well as an oracle that
cause of the dependence on the random training sample. Un- knows which achieves the proper balance between the
favorable training samples, however unlikely, can lead to arbi- estimation and approximation errors. In particular we will
trarily poor performance. Therefore, bounds on and can show that with high probability, both
only be shown to hold with high probability or in expectation.
Formally, let denote the product measure on induced by
, and let denote expectation with respect to . Bounds on
the false alarm and miss probabilities must be stated in terms of and
and (see Section I-D).
We investigate classifiers defined in terms of the following
sample-dependent quantities. For , let hold, where tends to zero at a certain rate de-
pending on the choice of (see Section III for details).
Consistency: If grows (as a function of ) in a suitable
way, then is strongly universally consistent [14, Ch. 6]
be the number of samples from class . Let in the sense that
with probability
and
denote the empirical false alarm and miss probabilities, corre- with probability
sponding to and , respectively. Given a class of
for all distributions of (see Section IV for details).
classifiers , define
Rates of Convergence: Under mild regularity conditions,
there exist functions and tending to zero at
and a polynomial rate such that
That is, is the most powerful test/classifier3 in of size . and

Finally, set to be the miss probability of the op-
timal classifier provided by the NP lemma.
We write when and if both
D. Problem Statement and (see Section V for details).
Implementable Rules: Finally, we would like to find rules
The goal of NP learning is to design learning algorithms satisfying the above properties that can be implemented
producing classifiers that perform almost as well as or . efficiently.
In particular, we desire learners with the following properties.
(This section is intended as a preview; precise statement are II. NEYMANPEARSON AND EMPIRICAL RISK MINIMIZATION
given later.)
In this section, we review the work of Cannon et al. [12] who
PAC bounds: is probably approximately correct (PAC) in study NP learning in the context of fixed VC classes. We also
the sense that given , there exist apply a different bounding technique that leads to substantially
and such that for any tighter upper bounds. For a review of VC theory see Devroye,
Gyrfi, and Lugosi [14].
For the moment let be an arbitrary, fixed collection of clas-
and
sifiers and let . Cannon et al. propose the learning rule
3In this paper, we assume that a classifier h achieving the minimum exists.
Although not necessary, this allows us to avoid laborious approximation argu-
ments. s.t. (1)
We call this procedure NP-ERM. Cannon et al. demonstrate that While we focus our discussion on VC and finite classes
NP-ERM enjoys properties similar to standard ERM [14], [15] for concreteness and comparison with [12], other classes and
translated to the NP context. We now recall their analysis. bounding techniques are applicable. Proposition 1 allows for
To state the theoretical properties of NP-ERM, introduce the the use of many of the error deviance bounds that have ap-
following notation.4 Let . Recall peared in the empirical process and machine learning literature
in recent years. The tolerances may even depend on the full
sample or on the individual classifier. Thus, for example,
Define Rademacher averages [17], [18] could be used to define the
tolerances in NP-ERM. However, the fact that the tolerances
for VC and finite classes (defined below in (2) and (3)) are
independent of the classifier, and depend on the sample only
through and , does simplify our extension to SRM in
Section III.
A. NP-ERM With VC Classes

Suppose has VC dimension . Cannon et al. con-
A main result of Cannon et al. [12] is the following lemma. sider two viewpoints for NP classification. First they consider
retrospective sampling where and are known before the
Lemma 1 ([12]): With and defined as above and sample is observed. Applying the VC inequality as stated by
as defined in (1) we have Devroye et al. [14], together with Proposition 1, they obtain the
following.
Theorem 1 ([12]): Let and take as in (1). Then
for any
and in particular
or
This result is termed a fundamental lemma for its pivotal

role in relating the performance of to bounds on the error
An alternative viewpoint for NP classification is i.i.d. sam-
deviance. Vapnik and Chervonenkis introduced an analogous
pling in which and are unknown until the training sample
result to bound the performance of ERM (for standard classi-
is observed. Unfortunately, application of the VC inequality is
fication) in terms of the error deviance [16] (see also [14, Ch.
not so straightforward in this setting because and are now
8]). An immediate corollary is the following.
random variables. To circumvent this problem, Cannon et al.
Proposition 1 ([12]): Let and take as in (1). arrive at the following result using the fact that with high prob-
Then for any ability and are concentrated near their expected values.
Recall that is the a priori probability of class .
Theorem 2 ([12]): Let and take as in (1). If
, , then
or
Later, this result is used to derive PAC bounds by applying

results for convergence of empirical processes such as VC in-
equalities. Owing to the larger constants out front and in the exponents,
We make an important observation that is not mentioned by their bound for i.i.d. sampling is substantially larger than for
Cannon et al. [12]. In both of the above results, the tolerance retrospective sampling. In addition, the bound does not hold for
parameters and need not be constants, in the sense that small , and since the a priori class probabilities are unknown
they may depend on the sample or certain other parameters. This in the NP setting, it is not known for which the bound does
will be a key to our improved bound and extension to SRM. hold.
In particular, we will choose to depend on , a specified We propose an alternate VC-based bound for NP classifica-
confidence parameter , and a measure of the capacity of tion under i.i.d. sampling that improves upon the preceding re-
such as the cardinality (if is finite) or VC dimension (if is sult. In particular, our bound is as tight as the bound in The-
infinite). orem 1 for retrospective sampling and it holds for all values of
4We interchange the meanings of the subscripts 0 and 1 used in [12], prefer- . Thus, it is no longer necessary to be concerned with the philo-
ring to associate class 0 with the null hypothesis. sophical differences between retrospective and i.i.d. sampling,
Fig. 1. Assuming = = and " =" =

" , this table reports, for various values of , ", and , the ratio of the sample sizes needed to satisfy the two
bounds of Theorems 2 and 3, respectively. Clearly, the bound of Theorem 3 is tighter, especially for asymmetric a priori probabilities. Furthermore, the bound
of Theorem 2 requires knowledge of , which is typically not available in the NP setting, while the bound of Theorem 3 does not. See Example 1 for further
discussion.
since the same bound is available for both. As mentioned previ- The new bound is substantially tighter than that of Theorem 2,
ously, the key idea is to let the tolerances and be variable as illustrated with the following example.
as follows. Let and define
Example 1: For the purpose of comparison, assume ,
, and . How large should be
(2) so that we are guaranteed that with at least probability (say
, both bounds hold with For the
new bound of Theorem 3 to hold we need
for . Let be defined by (1) as before, but with the
new definition of (which now depends on , , and ).
The main difference between the rule of Cannon et al. and the
rule proposed here is that in their formulation the term
constraining the empirical false alarm probability is indepen- which implies . In contrast, Theorem 2 requires
dent of the sample. In contrast, our constraint is smaller for
larger values of . When more training data is available for
class , a higher level of accuracy is required. We argue that this
which implies . To verify that this phenomenon is
is a desirable property. Intuitively, when is larger, should
not a consequence of the particular choice of , , or , Fig. 1
more accurately estimate , and therefore a suitable classifier
reports the ratios of the necessary sample sizes for a range of
should be available from among those rules approximating to
these parameters.5 Indeed, as tends to , the disparity grows
within the smaller tolerance. Theoretically, our approach leads
significantly. Finally, we note that the bound of Cannon et al. for
to a substantially tighter bound as the following theorem shows.
retrospective sampling requires as many samples as our bound
Theorem 3: For NP-ERM over a VC class with tolerances for i.i.d. sampling.
given by (2), and for any
B. NP-ERM With Finite Classes
The VC inequality is so general that in most practical settings
it is too loose owing to large constants. Much tighter bounds are
or
possible when is finite. For example, could be obtained by
Proof: By Proposition 1, it suffices to show for quantizing the elements of some VC class to machine precision.
Let be finite and define the NP-ERM estimator as be-
fore. Redefine the tolerances and by
Without loss of generality take . Let be the proba- (3)

bility that there are examples of class in the training sample.
Then We have the following analog of Theorem 3.
Theorem 4: For NP-ERM over a finite class with toler-
ances given by (3)
or
The proof is identical to the proof of Theorem 3 except that

the VC inequality is replaced by a bound derived from Ho-
effdings inequality and the union bound (see [14, Ch. 8]).
Example 2: To illustrate the significance of the new bound,
consider the scenario described in Example 1, but assume the
where the inequality follows from the VC inequality [14,
Ch. 12]. This completes the proof. 5When calculating (2) for this example we assume n =n 1 .
VC class is quantized so that . For the bound of A. NP-SRM Over VC Classes

Theorem 4 to hold, we need Let , be given, with having VC dimen-
sion . Assume . Define the tolerances and
by
which implies , a significant improvement.

Another example of the improved bounds available for finite
(4)
is given in Section VI-B where we apply NP-SRM to dyadic
decision trees. Remark 1: The new value for is equal to the value of
in the previous section with replaced by . The choice
C. Learner False Alarm Probabilities of the scaling factor stems from the fact ,
In the classical setting, one is interested in classifiers sat- which is used to show (by the union bound) that the VC inequal-
isfying . When learning from training data, such ities hold for all simultaneously with probability at least
bounds on the false alarm probability of are not possible (see the proof of Theorem 5).
due to its dependence on the training sample. However, a prac- NP-SRM produces a classifier according to the following
titioner may still wish to have an absolute upper bound of some two-step process. Let be a nondecreasing integer valued
sort. One possibility is to bound the quantity function of with .
1) For each , set
which we call the learner false alarm probability.
The learner false alarm probability can be constrained for a
certain range of depending on the class and the number s.t. (5)
of class training samples. Recall
2) Set
Arguing as in the proof of Theorem 3 and using Lemma 1

, we have
The term may be viewed as a penalty that

measures the complexity of class . In words, NP-SRM
The confidence is essentially a free parameter, so let be the uses NP-ERM to select a candidate from each VC class, and
minimum possible value of as ranges over then selects the best candidate by balancing empirical miss
. Then the learner false alarm probability can be bounded by probability with classifier complexity.
any desired , , by choosing appropriately. In Remark 2: If , then NP-SRM may
particular, if is obtained by NP-ERM with , then equivalently be viewed as the solution to a single-step optimiza-
tion problem
Example 3: Suppose is finite, , and .

A simple numerical experiment, using the formula of (3) for
, shows that the minimum value of is s.t.
. Thus, if a bound of on the learner false
where is the smallest such that .
alarm probability is desired, it suffices to perform NP-ERM
with . We have the following oracle bound for NP-SRM. The proof
is given in Appendix I. For a similar result in the context of
standard classification see Lugosi and Zeger [20].
III. NEYMANPEARSON AND STRUCTURAL
RISK MINIMIZATION Theorem 5: For any , with probability at least
One limitation of NP-ERM over fixed is that most possibil- over the training sample , both
ities for the optimal rule cannot be approximated arbitrarily (6)
well. Such rules will never be universally consistent. A solution
and
to this problem is known as structural risk minimization (SRM)
[19], whereby a classifier is selected from a family ,
, of increasingly rich classes of classifiers. In this sec-
tion we present NP-SRM, a version of SRM adapted to the NP (7)
setting, in the two cases where all are either VC classes or
finite. hold.
The inequality in (6) implies that the excess false alarm prob- 2) is summable, i.e., for each
ability decays as . Moreover, this rate can
be designed through the choice of and requires no as-
sumption about the underlying distribution.
To interpret the second inequality, observe that for any 3) as
4) if are VC classes, then ; if
are finite, then .
Assume that for any distribution of there exists a sequence
such that
The two terms on the right are referred to as estimation error and
approximation error,6 respectively. If is such that ,
Then is strongly universally consistent, i.e.,
then Theorem 3 implies that the estimation error is bounded
above by . Thus, (7) says that performs as well with probability
as an oracle that clairvoyantly selects to minimize this upper and
bound. As we will soon see, this result leads to strong universal
consistency of . with probability
Oracle inequalities are important because they indicate the for all distributions.
ability of learning rules to adapt to unknown properties of the
data generating distribution. Unfortunately, we have no oracle The proof is given in Appendix II. Note that the conditions
inequality for the false alarm error. Perhaps this is because, on in the theorem hold if for some .
while the miss probability involves both an estimation and Example 4: To illustrate the theorem, suppose
approximation error, the false alarm probability has only a and is the family of regular histogram classifiers based on
stochastic component (no approximation is required). In other cells of bin-width . Then and NP-SRM is con-
words, oracle inequalities typically reflect the learners ability sistent provided and , in analogy
to strike a balance between estimation and approximation to the requirement for strong universal consistency of the reg-
errors, but in the case of the false alarm probability, there is ular histogram rule for standard classification (see [14, Ch. 9]).
nothing to balance. For further discussion of oracle inequalities Moreover, NP-SRM with histograms can be implemented effi-
for standard classification see [18], [21]. ciently in operations as described in Appendix IV.
B. NP-SRM Over Finite Classes

V. RATES OF CONVERGENCE
The developments of the preceding subsection have coun-
terparts in the context of SRM over a family of finite classes. In this section, we examine rates of convergence to zero for
The rule for NP-SRM is defined in the same way, but now with the expected 7 excess false alarm probability
penalties
and expected excess miss probability

(8)
Theorem 5 holds in this setting as well. The proof is an easy Moreover, we are interested in rates that hold independent of .
modification of the proof of that theorem, substituting Theorem
4 for Theorem 3, and is omitted. A. Rates for False Alarm Error
The rate for false alarm error can be specified by the choice
IV. CONSISTENCY . In the case of NP-SRM over VC classes we have the
following result. A similar result holds for NP-SRM over finite
The inequalities above for NP-SRM over VC and finite classes (but without the logarithmic terms).
classes may be used to prove strong universal consistency of
NP-SRM provided the sets are sufficiently rich as Proposition 2: Select such that
and provided and are calibrated appropriately. for some , . Under assumptions 3) of Theorem 7,
satisfies
Theorem 6: Let be the classifier given by (5), with
defined by (4) for NP-SRM over VC classes, or
(8) for NP-SRM over finite classes. Specify and
such that
1) satisfies ; The proof follows exactly the same lines as the proof of The-
orem 7 below, and is omitted.
6Note that, in contrast to the approximation error for standard classification, 7It is also possible to prove rates involving probability inequalities. While
here the optimal classifier g must be approximated by classifiers satisfying the more general, this approach is slightly more cumbersome so we prefer to work
constraint on the false alarm probability. with expected values.
B. Rates for Both Errors by the regular partition of into hypercubes of sidelength
A more challenging problem is establishing rates for both . Let be positive real numbers. Let
errors simultaneously. Several recent studies have derived rates
of convergence for the expected excess probability of error
, where is the (optimal) Bayes classifier be the optimal decision set, and let be the topological
[22][26]. Observe that boundary of . Finally, let denote the number of
cells in that intersect .
We define the box-counting class to be the set of all distri-
butions satisfying the following assumptions.
Hence, rates of convergence for NP classification with The marginal density of given is es-
imply rates of convergence for standard classification. sentially bounded by .
We summarize this observation as follows. for all .
Proposition 3: Fix a distribution of the data . Let be The first assumption8 is equivalent to requiring
a classifier, and let , , be nonnegative functions
tending toward zero. If for each
for all measurable sets , where denotes the Lebesgue mea-
sure on . The second assumption essentially requires the
and
optimal decision boundary to have Lipschitz smoothness.
See [28] for further discussion. A theorem of Tsybakov [22]
then implies that the minimax rate for standard classification for this
class is when [28]. By Proposition 4, both errors
cannot simultaneously decay faster than this rate. In the fol-
lowing, we prove that this lower bound is almost attained using
dyadic decision trees and NP-SRM.
Devroye [27] has shown that for any classifier there exists
a distribution of such that decays at an arbi-
B. Dyadic Decision Trees
trarily slow rate. In other words, to prove rates of convergence
one must impose some kind of assumption on the distribution. Scott and Nowak [23], [28] demonstrate that a certain family
In light of Proposition 3, the same must be true of NP learning. of decision trees, DDTs, offer a computationally feasible classi-
Thus, let be some class of distributions. Proposition 3 also fier that also achieves optimal rates of convergence (for standard
informs us about lower bounds for learning from distributions classification) under a wide range of conditions [23], [29], [30].
in . DDTs are especially well suited for rate of convergence studies.
Indeed, bounding the approximation error is handled by the re-
Proposition 4: Assume that a learner for standard classifica- striction to dyadic splits, which allows us to take advantage of
tion satisfies the minimax lower bound recent insights from multiresolution analysis and nonlinear ap-
proximations [31][33]. We now show that an analysis similar
to that of Scott and Nowak [29] applies to NP-SRM for DDTs,
leading to similar results: rates of convergence for a computa-
If , are upper bounds on the rate of convergence for
tionally efficient learning algorithm.
NP learning (that hold independent of ), then either
A dyadic decision tree is a decision tree that divides the
or .
input space by means of axis-orthogonal dyadic splits. More
In other words, minimax lower bounds for standard classifi- precisely, a dyadic decision tree is a binary tree (with a
cation translate to minimax lower bounds for NP classification. distinguished root node) specified by assigning 1) an integer
to each internal node of (corresponding
VI. RATES FOR DYADIC DECISION TREES to the coordinate that gets split at that node); 2) a binary label
or to each leaf node of . The nodes of DDTs correspond
In this section, we provide an example of how to derive rates
to hyperrectangles (cells) in . Given a hyperrectangle
of convergence using NP-SRM combined with an appropriate
, let and denote the hyperrectan-
analysis of the approximation error. We consider a special
gles formed by splitting at its midpoint along coordinate .
family of decision trees known as dyadic decision trees (DDTs)
Specifically, define
[28]. Before introducing DDTs, however, we first introduce the
and . Each node of is associated with a cell
class of distributions with which our study is concerned.
according to the following rules: 1) The root node is associated
A. The Box-Counting Class with ; 2) If is an internal node associated with the
8When proving rates for standard classification, it is often necessary to place
From this point on assume . Before introducing
fx X
a similar restriction on the unconditional density ( ) of . Here it is only
we need some additional notation. Let denote a positive f x
necessary to bound ( ) because only the excess miss probability requires an
integer, and define to be the collection of cells formed analysis of approximation error.
and
where the is over all distributions belonging to the box-

counting class.
We note in particular that the constants and in the defi-
nition of the box-counting class need not be known.
Fig. 2. A dyadic decision tree (right) with the associated recursive dyadic D. Optimal Rates for the Box-Counting Class
=
partition (left) in d 2 dimensions. Each internal node of the tree is labeled The rates given in the previous theorem do not match the
with an integer from 1 to d indicating the coordinate being split at that node.
The leaf nodes are decorated with class labels. lower bound of mentioned in Section VI-A. At this point,
one may as two questions: 1) Is the lower bound or the upper
bound loose? and 2) If it is the upper bound, is suboptimality
cell , then the children of are associated with and due to the use of DDTs or is it inherent in the NP-SRM learning
. See Fig. 2. procedure? It turns out that the lower bound is tight, and sub-
Let be a nonnegative integer and define to optimality stems from the NP-SRM learning procedure. It is
be the collection of all DDTs such that no leaf cell has a side possible to obtain optimal rates (to within a logarithmic factor)
length smaller than . In other words, when traversing a path using DDTs and an alternate penalization scheme.11
from the root to a leaf, no coordinate is split more than times. A similar phenomenon appears in the context of standard
Finally, define to be the collection of all trees in having classification. In [29], we show that DDTs and standard SRM
leaf nodes. yield suboptimal rates like those in Theorem 7 for standard clas-
sification. Subsequently, we were able to obtain the optimal rate
C. NP-SRM With DDTs with DDTs using a spatially adaptive penalty (which favors un-
We study NP-SRM over the family . Since balanced trees) and a penalized empirical risk procedure [30],
is both finite and a VC class, we may define penalties via (4) [23]. A similar modification works here. That same spatially
or (8). The VC dimension of is simply , while may adaptive penalty may be used to obtain optimal rates for NP
be bounded as follows: The number of binary trees with classification. Thus, NP-SRM is suboptimal because it does not
leaves is given by the Catalan number9 . promote the learning of an unbalanced tree, which is the kind
The leaves of such trees may be labeled in ways, while the of tree we would expect to accurately approximate a member of
internal splits may be assigned in ways. Asymptotically, the box-counting class. For further discussion of the importance
it is known that . Thus, for sufficiently of spatial adaptivity in classification see [28].
large, . If is defined
E. Implementing Dyadic Decision Trees
by (8) for finite classes, it behaves like ,
while the penalty defined by (4) for VC classes behaves like The importance of DDTs stems not only from their theoretical
. Therefore, we adopt the penalties for finite properties but also from the fact that NP-SRM may be imple-
classes because they lead to bounds having smaller constants mented exactly in polynomial time. In Appendix V, we provide
and lacking an additional term. an explicit algorithm to accomplish this task. The algorithm is
By applying NP-SRM to DDTs10 with parameters , inspired by the work of Blanchard, Schfer, and Rozenholc [34]
, and chosen appropriately, we obtain the following who extend an algorithm of Donoho [35] to perform standard
result. Note that the condition on in the following the- penalized ERM for DDTs.
orem holds whenever , . The proof is
found in Appendix III. VII. CONCLUSION
Theorem 7: Let be the classifier given by (5), with We have extended several results for learning classifiers from
defined by (8). Specify , , and training data to the NP setting. Familiar concepts such as em-
such that pirical and SRM have counterparts with analogous performance
guarantees. Under mild assumptions on the hierarchy of classes,
1) ;
NP-SRM allows one to deduce strong universal consistency and
2) ;
rates of convergence. We have examined rates for DDTs, and
3) and .
presented algorithms for implementing NP-SRM with both his-
If then tograms and DDTs.
This work should be viewed as an initial step in translating the
ever growing field of supervised learning for classification to the
NP setting. An important next step is to evaluate the potential
9See http://mathworld.wolfram.com/CatalanNumber.html. impact of NP classification in practical settings where different
10Since H
L(n) changes with n, the classes are not independent of n as
11We do not present this refined analysis here because it is somewhat spe-
they are in the development of Section III. However, a quick inspection of the
proofs of Theorems 5 and 6 reveals that those theorems also hold in this slightly cialized to DDTs and would require a substantial amount of additional space,
more general setting. detracting from the focus of the paper.
class errors are valued differently. Toward this end, it will be

necessary to translate the theoretical framework established here
into practical learning paradigms beyond decision trees, such as
boosting and support vector machines (SVMs). In boosting, for where
example, it is conceivable that the procedure for reweighting
the training data could be controlled to constrain the false alarm
error. With SVMs or other margin-based classifiers, one could and in the last step we use again. Since
imagine a margin on each side of the decision boundary, with the
, it follows that from which we
class margin constrained in some manner to control the false
conclude
alarm error. If the results of this study are any indication, the
theoretical properties of such NP algorithms should resemble
those of their more familiar counterparts.
The lemma now follows by subtracting from both sides.
APPENDIX I
PROOF OF THEOREM 5 The theorem is proved by observing
Define the sets
where the second inequality comes from Remark 1 in Section III

and a repetition of the argument in the proof of Theorem 3.
APPENDIX II
PROOF OF THEOREM 6
We prove the theorem in the case of VC classes, the case of
finite classes being entirely analogous. Our approach is to apply
the BorelCantelli lemma [36, p. 40] to show
and
Our goal is to show
It then follows that the second inequality must hold with
equality, for otherwise there would be a classifier that strictly
Lemma 2: outperforms the optimal classifier given by the NP lemma
(or equivalently, there would be an operating point above the
receiver operating characteristic), a contradiction.
First consider the convergence of to . By the
Proof: We show the contrapositive BorelCantelli lemma, it suffices to show that for each
So suppose . By Lemma 1
So let . Define the events
In particular, . Since for some

, it follows that .
To show , first note that
Since , we
since . By the definition of NP-SRM have
(9)
To bound the second term we use the following lemma.

Lemma 3:
Proof: The relative Chernoff bound [37] states that if Lemma 5: There exists such that implies
Binomial , then for all , .
. Since Binomial , the lemma follows by Proof: Define
applying the relative Chernoff bound with .
It now follows that
Since and , we
may choose such that there exists satisfying
i) and ii) . Sup-
To bound the first term in (9) we use the following lemma.
pose and . Since we have
Lemma 4: There exists such that for all ,
.
Proof: Define
Since and , we may

choose such that implies . Suppose
and consider . Since we have Since we conclude .
The remainder of the proof now proceeds as in the case of the
false alarm error.
and since we conclude .
It now follows that, for the integer provided by the lemma APPENDIX III
PROOF OF THEOREM 7
Define the sets
where in the last line we use Theorem 5.

Now consider the convergence of to . As before,
it suffices to show that for each
Observe
Let and define the sets
where the next to last step follows from Theorem 5 and

Lemma 3, and the last step follows from the assumption
.
Thus, it suffices to show decays at the desired
Arguing as before, it suffices to show rate whenever . For such we have
and
Since , we know . By assumption,

. Furthermore, from the discussion
The first expression is bounded using an analogue of Lemma prior to the statement of Theorem 7, we know
3. To bound the second expression we employ the following
analog of Lemma 4.
for sufficiently large. Combining these facts yields
Plugging in gives
(10)
as desired.
For the miss probability, it can similarly be shown that
Thus, it suffices to consider . Our strategy is to

find a tree for some such that
Fig. 3. Algorithm for NP-ERM with histograms.
and Applying the lemma we have
both decay at the desired rate. The result will then follow by the
oracle inequality implied by . where the last step follows by the same argument that produced
Let be a dyadic integer (a power of two) such that (10).
and and . To bound the approximation error, observe
Note that this is always possible by the assumptions
and . Recall that
denotes the partition of into hypercubes of side
length . Define to be the collection of all cells in
that intersect the optimal decision boundary . By the box
counting hypothesis (A1), for all . where the second inequality follows from A0 and the third from
Construct as follows. We will take to be a cyclic DDT. A A1. This completes the proof.
cyclic DDT is a DDT such that when is the root
node and if is a cell with child , then
APPENDIX IV
AN ALGORITHM FOR NP-SRM WITH HISTOGRAMS
Thus, cyclic DDTs may be grown by cycling through the co-
NP-SRM for histograms can be implemented by solving
ordinates and splitting at the midpoint. Define to be the cyclic
NP-ERM for each . This yields classifiers , .
DDT consisting of all the cells in , together with their an-
The final NP-SRM is then determined by simply selecting
cestors, and their ancestors children. In other words, is the
smallest cyclic DDT containing all cells in among its leaves. that minimizes . The challenge is in
Finally, label the leaves of so that they agree with the optimal implementing NP-ERM. Thus, let be fixed.
classifier on cells not intersecting , and label cells in- An algorithm for NP-ERM over is given in Fig. 3. The fol-
tersecting with class . By this construction, satisfies lowing notation is used. Let , denote the in-
. Note that has depth . dividual hypercubes of side length that comprise histograms
in . Let be the number of
By the following lemma we know for some .
class training samples in cell . Let be the largest integer
Lemma 6: Let denote the number of leaf nodes of . Then such that . For let
. denote the histogram classifier having minimum empirical
Proof: Observe that only those nodes in that intersect miss probability among all histograms having .
can be ancestors of nodes in . By the box-counting hy- Let denote the empirical miss probability of . Assume
pothesis, there are at most nodes of at depth that histogram classifiers are represented by the cells labeled
that can intersect . Hence, there are at most one, so that each is a collection of cells.
NP-ERM amounts to determining . The algorithm
builds up this optimal classifier in a recursive fashion. The
proof that the algorithm attains the NP-ERM solutions is
a straightforward inductive argument, and is omitted. The
computational complexity of the algorithm is . As-
ancestors of cells in . Since the leaf nodes of are suming is chosen so that NP-SRM is consistent, we have
the children of ancestors of cells in , it follows that , and hence, the complexity of NP-ERM here
. is .
APPENDIX V
AN ALGORITHM FOR NP-SRM WITH DDTS
Our theorems on consistency and rates of convergence tell
us how to specify the asymptotic behavior of and , but in
a practical setting these guidelines are less helpful. Assume
then is selected by the user (usually the maximum such that
the algorithm runs efficiently; see below) and take , the
largest possible meaningful value. We replace the symbol for
a generic classifier by the notation for trees. Let denote the
number of leaf nodes of . We seek an algorithm implementing
Fig. 4. Algorithm for NP-SRM with DDTs.

s.t.
Let be the set of all cells corresponding to nodes of trees
This follows by additivity of the empirical miss probability .
in . In other words, every is obtained by applying
Note that this recursive relationship leads to a recursive algo-
no more than dyadic splits to each coordinate. Given ,
rithm for computing .
define to be the set of all that are nodes of . If
At first glance, the algorithm appears to involve visiting all
, let denote the subtree of rooted at . Given
, a potentially huge number of cells. However, given
, define
a fixed training sample, most of those cells will be empty. If
is empty, then is the degenerate tree consisting only of
Let be the set of all such that and . Thus, it is only necessary to perform the recursive update
for some . For each define at nonempty cells. This observation was made by Blanchard et
al. [34] to derive an algorithm for penalized ERM over for
DDTs (using an additive penalty). They employ a dictionary-
based approach which uses a dictionary to keep track of the
When write for and for . We refer to cells that need to be considered. Let , denote
these trees as minimum empirical risk trees, or MERTs for the cells in at depth . Our algorithm is inspired by their
short. They may be computed in a recursive fashion (described formulation, and is summarized in Fig. 5.
below) and used to determine
Proposition 5: The algorithm in Fig. 5 requires
operations.
Proof: The proof is a minor variation on an argument
given by Blanchard et al. [34]. For each training point there
are exactly cells in containing the point (see [34]).
Thus, the total number of dictionary elements is . For
The algorithm is stated formally in Fig. 4. each cell there are at most parents to
The MERTs may be computed as follows. Recall that for a consider. For each such , a loop over is required.
hyperrectangle we define and to be the hyperrect- The size of is . This follows because there
angles formed by splitting at its midpoint along coordinate . are possibilities for and for . To see
The idea is, for each cell , to compute recur- this last assertion note that each element of has
sively in terms of and , , starting from ancestors up to depth . Using and combining
the bottom of the tree and working up. The procedure for com- the above observations it follows that each requires
puting is as follows. First, the base of the recursion. Define operations. Assuming that dictionary operations
, the number of class (searches and inserts) can be implemented in oper-
samples in cell . When is a cell at maximum depth , ations the result follows.
(labeled with class ) and (labeled Unfortunately, the computational complexity has an expo-
with class ). Furthermore, . nential dependence on . Computational and memory con-
Some additional notation is necessary to state the recursion: straints limit the algorithm to problems for which [38].
Denote by the element of However, if one desires a computationally efficient algorithm
having and as its left and right branches. Now that achieves the rates in Theorem 7 for all , there is an alterna-
observe that for any cell at depth tive. As shown in the proof of Theorem 7, it suffices to consider
cyclic DDTs (defined in the proof). For NP-SRM with cyclic
DDTs the algorithm of Fig. 5 can be can be simplified so that
it requires operations. We opted to present the
more general algorithm because it should perform much better
in practice.
[12] A. Cannon, J. Howse, D. Hush, and C. Scovel. (2002) Learning With

the Neyman-Pearson and min-max Criteria. Los Alamos National
Laboratory. [Online]. Available: http://www.c3.lanl.gov/kelly/ml/pubs/
2002minmax/paper.pdf
[13] , (2003) Simple Classifiers. Los Alamos National Laboratory.
[Online]. Available: http://www.c3.lanl.gov/ml/pubs/2003sclassifiers/
abstract.shtml
[14] L. Devroye, L. Gyrfi, and G. Lugosi, A Probabilistic Theory of Pattern
Recognition. New York: Springer-Verlag, 1996.
[15] V. Vapnik and C. Chervonenkis, On the uniform convergence of relative
frequencies of events to their probabilities, Theory Probab. Its Applic.,
vol. 16, no. 2, pp. 264280 , 1971.
[16] , Theory of Pattern Recognition (in Russian). Moscow, U.S.S.R.:
Nauka, 1971.
[17] V. Koltchinskii, Rademacher penalties and structural risk minimiza-
tion, IEEE Trans. Inf. Theory, vol. 47, no. 5, pp. 19021914, Jul. 2001.
[18] P. Bartlett, S. Boucheron, and G. Lugosi, Model selection and error
estimation, Mach. Learn., vol. 48, pp. 85113, 2002.
[19] V. Vapnik, Estimation of Dependencies Based on Empirical
Data. New York: Springer-Verlag, 1982.
[20] G. Lugosi and K. Zeger, Concept learning using complexity regulariza-
tion, IEEE Trans. Inf. Theory, vol. 42, no. 1, pp. 4854, Jan. 1996.
[21] S. Boucheron, O. Bousquet, and G. Lugosi. (2004) Theory of Classifi-
cation: A Survey of Recent Advances. [Online]. Available: http://www.
econ.upf.es/lugosi/
[22] A. B. Tsybakov, Optimal aggregation of classifiers in statistical
learning, Ann. Statist., vol. 32, no. 1, pp. 135166, 2004.
[23] C. Scott and R. Nowak, On the adaptive properties of decision trees,
in Advances in Neural Information Processing Systems 17, L. K. Saul,
Y. Weiss, and L. Bottou, Eds. Cambridge, MA: MIT Press, 2005, pp.
12251232.
[24] I. Steinwart and C. Scovel, Fast rates to bayes for kernel machines,
in Advances in Neural Information Processing Systems 17, L. K. Saul,
Y. Weiss, and L. Bottou, Eds. Cambridge, MA: MIT Press, 2005, pp.
13371344.
[25] G. Blanchard, G. Lugosi, and N. Vayatis, On the rate of convergence
of regularized boosting classifiers, J. Mach. Learn. Res., vol. 4, pp.
Fig. 5. Algorithm for computing minimum empirical risk trees for DDTs. 861894, 2003.
[26] A. B. Tsybakov and S. A. van de Geer. (2004) Square Root Penalty:
Adaptation to the Margin in Classification and in Edge Estimation.
ACKNOWLEDGMENT [Online]. Available: http://www.proba.jussieu.fr/pageperso/tsybakov/
tsybakov.html
The authors wish to thank Rebecca Willett for helpful com- [27] L. Devroye, Any discrimination rule can have an arbitrarily bad prob-
ments and suggestions. ability of error for finite sample size, IEEE Trans. Pattern Anal. Mach.
Intell., vol. PAMI-4, pp. 154157, Mar. 1982.
[28] C. Scott and R. Nowak. Minimax Optimal Classification With
REFERENCES Dyadic Decision Trees. Rice Univ., Houston, TX. [Online]. Available:
[1] E. Lehmann, Testing Statistical Hypotheses. New York: Wiley, 1986. http://www.stat.rice.edu/~cscott
[2] H. V. Poor, An Introduction to Signal Detection and Estimation. New [29] , Dyadic classification trees via structural risk minimization, in
York: Springer-Verlag, 1988. Advances in Neural Information Processing Systems 15, S. Becker, S.
[3] H. V. Trees, Detection, Estimation, and Modulation Theory: Part Thrun, and K. Obermayer, Eds. Cambridge, MA, 2003.
I. New York: Wiley, 2001. [30] , Near-minimax optimal classification with dyadic classification
[4] E. L. Lehmann, The Fisher, Neyman-Pearson theories of testing hy- trees, in Advances in Neural Information Processing Systems 16, S.
potheses: One theory or two?, J. Amer. Statist. Assoc., vol. 88, no. 424, Thrun, L. Saul, and B. Schlkopf, Eds. Cambridge, MA: MIT Press,
pp. 12421249, Dec. 1993. 2004.
[5] P. Sebastiani, E. Gussoni, I. S. Kohane, and M. Ramoni, Statistical chal- [31] R. A. DeVore, Nonlinear approximation, Acta Numer., vol. 7, pp.
lenges in functional genomics, Statist. Sci., vol. 18, no. 1, pp. 3360, 51150, 1998.
2003. [32] A. Cohen, W. Dahmen, I. Daubechies, and R. A. DeVore, Tree approx-
[6] S. Dudoit, J. Fridlyand, and T. P. Speed, Comparison of discrimination imation and optimal encoding, Appl. Comput. Harmonic Anal., vol. 11,
methods for the classification of tumors using gene expression data, J. no. 2, pp. 192226, 2001.
Amer. Statist. Assoc., vol. 97, no. 457, pp. 7787, Mar. 2002. [33] D. Donoho, Wedgelets: Nearly minimax estimation of edges, Ann.
[7] B. Zadrozny, J. Langford, and N. Abe, Cost sensitive learning by cost- Statist., vol. 27, pp. 859897, 1999.
proportionate example weighting, in Proc. 3rd Int. Conf. Data Mining, [34] G. Blanchard, C. Schfer, and Y. Rozenholc, Oracle bounds and exact
Melbourne, FL, 2003. algorithm for dyadic classification trees, in Learning Theory: Proc.
[8] P. Domingos, Metacost: A general method for making classifiers 17th Annu. Conf. Learning Theory, COLT 2004, J. Shawe-Taylor and
cost sensitive, in Proc. 5th Int. Conf. Knowledge Discovery and Data Y. Singer, Eds. Heidelberg, Germany: Sspringer-Verlag, 2004, pp.
Mining, San Diego, CA, 1999, pp. 155164. 378392.
[9] C. Elkan, The foundations of cost-sensitive learning, in Proc. 17th Int. [35] D. Donoho, CART and best-ortho-basis selection: A connection, Ann.
Joint Conf. Artificial Intelligence, Seattle, WA, 2001, pp. 973978. Statist., vol. 25, pp. 18701911, 1997.
[10] D. Margineantu, Class probability estimation and cost-sensitive clas- [36] R. Durrett, Probability: Theory and Examples. Pacific Grove, CA:
sification decisions, in Proc. 13th European Conf. Machine Learning, Wadsworth & Brooks/Cole, 1991.
Helsinki, Finland, 2002, pp. 270281. [37] T. Hagerup and C. Rb, A guided tour of Chernoff bounds, Inf.
[11] D. Casasent and X.-W. Chen, Radial basis function neural networks Process. Lett., vol. 33, no. 6, pp. 305308, 1990.
for nonlinear Fisher discrimination and Neyman-Pearson classification, [38] G. Blanchard, C. Schfer, Y. Rozenholc, and K.-R. Mller, Optimal
Neural Netw., vol. 16, pp. 529535, 2003. Dyadic Decision Trees, preprint, 2005.

Neymann-Pearson Approach To Statistical Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neymann-Pearson Approach To Statistical Learning

Uploaded by

Copyright:

Available Formats

3806 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO.

11, NOVEMBER 2005

A NeymanPearson Approach to Statistical Learning

I N most approaches to binary classification, classifiers are

0018-9448/$20.00 2005 IEEE

That is, is the most powerful test/classifier3 in of size . and

A. NP-ERM With VC Classes

This result is termed a fundamental lemma for its pivotal

Later, this result is used to derive PAC bounds by applying

Fig. 1. Assuming  = = and " =" =

Without loss of generality take . Let be the proba- (3)

The proof is identical to the proof of Theorem 3 except that

VC class is quantized so that . For the bound of A. NP-SRM Over VC Classes

which implies , a significant improvement.

Arguing as in the proof of Theorem 3 and using Lemma 1

The term may be viewed as a penalty that

Example 3: Suppose is finite, , and .

B. NP-SRM Over Finite Classes

and expected excess miss probability

where the is over all distributions belonging to the box-

class errors are valued differently. Toward this end, it will be

where the second inequality comes from Remark 1 in Section III

In particular, . Since for some

To bound the second term we use the following lemma.

Since and , we may

where in the last line we use Theorem 5.

Let and define the sets

where the next to last step follows from Theorem 5 and

Since , we know . By assumption,

for sufficiently large. Combining these facts yields

Thus, it suffices to consider . Our strategy is to

and Applying the lemma we have

Fig. 4. Algorithm for NP-SRM with DDTs.

[12] A. Cannon, J. Howse, D. Hush, and C. Scovel. (2002) Learning With

You might also like

Fig. 1. Assuming = = and " =" =