Professional Documents
Culture Documents
Box 2
Classical results of empirical risk minimization algorithms
The results described in this Letter for the special case of exact
minimization are also valid in the general case of almost ERM, in which
the minimum is not assumed to exist (see ref. 16), though the proofs are
technically somewhat more complex.
The key theorem1–5 of classical learning theory relates consistency of
ERM to a constraint on the function classes H2.
Theorem. The following are equivalent for ‘well-behaved’ loss
functions, such as the square-loss; (1) ERM generalizes and is
consistent; (2) H is a uGC class.
A function class is a uGC class if universal, uniform convergence in
probability holds
ð !
1 Xn
lim sup P sup fðx i Þ 2 fðxÞdmðxÞ . 1 ¼ 0:
n!1 m f2H n i¼1 X
Figure 1 Example of an empirical minimizer with large expected error. In the case The result extends to general loss functions if the functions induced by
sketched here the data were generated from the ‘true’ green function. The blue the composition of the loss V and hypothesis space H are uGC. For
function fits the data set and therefore has zero empirical error (I S [f blue] ¼ 0). Yet it binary functions the uGC property reduces to the well known
is clear that on future data, this function f blue will perform poorly as it is far from the requirement of finite VC dimension.
true function on most of the domain. Therefore I [f blue] is large. Generalization refers Cucker and Smale7 and Zhou25 developed a complete and effective
theory exploiting the property of compactness of H, which is
to whether the empirical performance on the training set (I S [f ]) will generalize to test
sufficient, but not necessary, for generalization and consistency of ERM,
performance on future examples (I [f ]). If an algorithm is guaranteed to generalize, an
because compactness of H implies the uGC property of H but not vice
absolute measure of its future predictivity can be determined from its empirical
versa.
performance.
420 ©2004 Nature Publishing Group NATURE | VOL 428 | 25 MARCH 2004 | www.nature.com/nature
letters to nature
We now ask whether CVloo stability is sufficient for generalization Learning techniques are similar to fitting a multivariate function
of any learning algorithm satisfying it. The answer to this question is to measurement data, a classical problem in nonparametric stat-
negative (Theorem 11 in ref. 15 claims incorrectly that CVloo istics10,18,19. The key point is that the fitting should be predictive. In
stability is sufficient for generalization, since there are counter- this sense ‘learning from examples’ can also be considered as a
examples16). stylized model for the scientific process of developing predictive
A positive answer can be obtained by augmenting CVloo stability theories from empirical data2. The classical conditions for general-
with stability of the expected error and stability of the empirical ization of ERM can then be regarded as the formalization of a ‘folk
error to define a new notion of stability, CVEEEloo stability, as theorem’, which says that simple theories should be preferred among
follows: the learning map L is distribution-independent, CVEEEloo all of those that fit the data. Our stability conditions would instead
stable if for all probability distributions m: (1) is CVloo stable; correspond to the statement that the process of research should—
lim sup jI½f S 2 I½f Si j ¼ 0 in probability; (3)n!1
(2) n!1 lim sup most of the time—only incrementally change existing scientific
ie{1;…;n} ie{1;…;n} theories as new data become available.
jI S ½ f S 2 I Si ½ f Si j ¼ 0 in probability.
It is intellectually pleasing that the concept of stability, which cuts
Properties (2) and (3) are weak and satisfied by most ‘reasonable’ across so many different areas of mathematics, physics and engin-
learning algorithms. They are not sufficient for generalization. In eering, turns out to play such a key role in learning theory. It is
the case of ERM CVloo stability is the key property since both somewhat intuitive that stable solutions are predictive, but it is
conditions (2) and (3) are implied by consistency of ERM16. Unlike especially surprising that our specific definition of CVEEEloo
CVloo stability, CVEEEloo stability is sufficient for generalization for stability fully subsumes the classical necessary and sufficient con-
any learning algorithm16: ditions on H for consistency of ERM.
Theorem B. If a learning algorithm is CVEEEloo stable and the loss In this Letter we have stated the main properties in an asymptotic
function is bounded, then f S generalizes, that is uniformly for all m form. The detailed statements16 of the two stability theorems
lim jI½f S 2 I S ½f S j ¼ 0 in probability. sketched here, however, provide bounds on the difference between
n!1
Notice that theorem B provides a general condition for general- empirical and expected error for any finite size of the training
ization but not for consistency, which is not too surprising because set.
the class of non-ERM algorithms is very large. In summary, Our observations on stability suggest immediately several other
CVEEEloo stability is sufficient for generalization for any algorithm questions. The first challenge is to bridge learning theory and a quite
and necessary and sufficient for generalization and consistency of different and broad research area—the study of inverse problems in
ERM. applied mathematics and engineering20—since stability is a key
Good generalization ability means that the performance of the condition for the solution of inverse problems (our stability
algorithm on the training set will accurately reflect its future condition can be seen as an extension16 to the general learning
performance on the test set. Therefore an algorithm that guarantees problem of the classical notion of condition number that charac-
good generalization will predict well if its empirical error on the terize stability of linear systems). In fact, while predictivity is at the
training set is small. Conversely, notice that it is also possible for an core of classical learning theory, another motivation drove the
algorithm to generalize well but predict poorly (when both empiri- development of several of the best existing algorithms (such as
cal and expected performances are poor). Crucially, therefore, one regularization algorithms of which SVMs are a special case): well-
can empirically determine the predictive performance by looking at
the error on the training set.
Box 3
Uniform stability
422 ©2004 Nature Publishing Group NATURE | VOL 428 | 25 MARCH 2004 | www.nature.com/nature