Professional Documents
Culture Documents
-r
INTRODUCTION
pliers (a,
&), is defined as
e
f(Xnew) =
where the bias b is determined by using the constraints in (1). The input data vectors that correspond with positive Lagrange Multipliers, are referred to as support vectors. Note that the loss term
in (1) is quadratic, but (1) can also be expressed
in terms of linear loss. For linear loss, the second
term in (1) becomes l/eC:=l (& <i) and the Lagrange Multipliers in (2) are bounded from above by
C. More information on Support Vector Machines
can be found in [l]and [7].
The parameter C in (1) controls the trade-off between the complexity of the model (f 11 w 1 2) and the
(& ti)) [7]. C is also called
training error (1/C
the regularization parameter since it corresponds to
subject to
+ b)
5 E +ti,
yi - ( ( w .Xi) + b) 5 + 52,
((W.
&,ti
xi)
20,
yi
= 1...
,e.
i = 1 , ... , e
2 =
1 , . . . ,e,
(3)
i=l
2192
0-7803-7278-6/02/~10.0002002 IEEE
ated with smoothness. Therefore, the choice of regularization parameter can not be based on one factor
aIone, but on the combined influence. None of the
heuristics of estimation methods in literature does
that. The research was therefore aimed at deriving
an estimating rule that combines the characteristics
of the feature space, the expected noise level, and
some other contributing factors.
The rest of the paper consists of four sections. In
section two, useful results from the L-Curve method
are discussed. In the third section a method is derived
that estimates the value of C from a priori parameters. The performance of this method is shown in
section four.
I1
and
i=l
k=l
2193
(b) M&l
(a) -L
with C=5
of the curvature expression, an important relation between the derivative of p ( y ) and ~(y)
emerged. And
it is this relation we are interested in.
Consider the following minimization problem
0.8
1
:
w
.
1 4
4 2
0.6
0.4
c=254
$ 0
0.2
C=5
-2
0
-5
In(Emx mm1
-10
-0.2
0.2
0.4
0.6
0.8
27
0.6
0.6
0.4
0.4
0.2
0.2
0
0.2
0.4
0.6
(7)
where is a symmetric positive (semi)definite coefficient matrix and b the given output data. Using the
SVD decomposition of A , the norms of the solution
and error can be written as
0.6
0.8
l.:m
A
l.:m
-0.2
= argmin
0.8
-0.2
0.2
0.4
0.6
0.6
(8)
i=l
where u i are the singular vectors, o i the singular values, and f i the Tikhonov filter factors, that depend
on ~i and y as follows,
fi =
0;
-
U;
+y
In regularization theory, the corner point of the LCurve is normally found by determining the curvature of the L-Curve. In [3] an expression for the cur- leads to a very important relation
vature of the GCurve is derived in terms of p ( 7 ) and
~ ( 7and
) their derivatives. As part of the derivation
2194
computed as R
l e (
v
Yi 7
i=l
- e
-e v (Yi -
\2
Pk(?)K(zk,xi)
k=l
YiJ2
i=l
Rewriting the relation (13), given in the previous section, such that y stands alone and using (14), leads
to
It is clear from (17) that a lower bound in terms of a
priori information should involve the number of data
points, the range of the output data and the value
of E . Since no such bound exists in literature, one
was
derived from a number of assumptions about the
Now using the fact that C = 117, we arrive at
error and experimental results.
Let us assume that the resulting model will be a
C = -V(Y)
.
(15) relatively good model such that the -insensitive zone
P(Y)
is smaller than half the range of the output values
and that there is an equal number of support vectors
This equation forms the basis of the estimate .
Since the true solution and therefore, true error, above and below the +insensitive zone. Then a very
is unknown, we will use upper and lower bounds in loose lower bound on (17) can be given by,
terms of the a priori parameters. From the Support
p > iRange(y) - 26
Vector theory, is known that the norm of the solution
e 2
llw112 < R2,where R is the radius of the ball centred
at the origin in the feature space and which can be
From experimental observations, it was found that a
lIt is also interesting to note the close resemblance between power of four gives a more accurate estimation. This
the derivation of the expression for the curvature of the L- leads to the proposed estimate given by
IV
EXPERIMENTAL RESULTS
2195
-10
2(a) shows various error statistics of models for increasing values C. The resulting GCurve is plotted
in Figure 2(b). The area between the vertical dashed
lines in Figure 2(a) corresponds to the area in the
corner of the L-Curve, as shown in Figure 2(b). The
area around the corner point in the GCurve is shown
more clearly in Figure 2(c). In Figure 2(a) and Figure
2(c) the location of the optimal C-value is indicated
by the asterisk and the circle shows the location of
the estimated C-value. Finally, Figure 2(d) and Figure 2(e) show the performance of the models built
using the (near) optimal C and the estimated C, respectively. At first glance, one might think that an
estimated value of C = 340 is far from the (near)
optimal value of C = 1151 from the GCurve. However, from Figure 2(c) it is clear that C is a rather
robust parameter. Therefore, the estimation needs
only to predict a value of C close to the corner of the
L-Curve.
-5
0
(d) Model wh
i L-CimeC
2196
CONCLUSIONS
A method for estimating the regularization parameter C for Support Vector Regression problems is presented. The estimation is based on results from the
analysis of the L-Curve method. It was mentioned in
the introduction that choosing a value for C should
involve taking into account several factors, including
the kernel function and the noise level. These factors
are all present in the heuristic proposed.
Comparing the values of C obtained from the LCurve method with the values determined by the estimate, using several data sets, showed that the estimates of C-values are in close proximity to the optimal C. Furthermore, the difference in performance
between a model using the C-value determined by
the L-Curve and a model using the C estimated by
the method, is very small and often negligible.
The computation time needed to determine a good
estimate of the optimal C is a fraction of the time
needed to determine the (near) optimal value of C
by means of the L-Curve method. Therefore, the
proposed estimation method can be used for online
References
[l]N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, and other
kernel-based learning methods, Cambridge University Press, 2000.
[2] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Problems, Kluwer Academic
Publishers, Hingham, MA, 1996.
[3] P. C. Hansen, The L-Curve and its use in the
numerical treatment of inverse problems, invited
paper for P. Johnston (Ed.), Computational Inverse Problems in Electrocardiology, pp. 119-142,
WIT Press, Southampton, 2001.
[4] T . J. Hastie and R. J. Tibshirani, Generalized
Linear Models, Chapman and Hall, London, UK,
1990.
[5] B. Scholkopf, C. J. Burges, and A. J. Smola, Advances in Kernel Methods: Support Vector Learning, MIT Press, London, 1998.
181
3The RMSEP is the relative error multiplied by the standard deviation of the predicted test data.
2197