You are on page 1of 19
Chapter 9 CLASSIFYING MICROARRAY DATA USING SUPPORT VECTOR MACHINES ‘Sayan Mukherjee PostDoctoral Fellow: MIT/Whitehead Instiute for Genome Research and Center for Biological and Computational Learning at MIT, e-mail: sayan@mitedy 1. INTRODUCTION Over the last few years the routine use of DNA microarrays has made possible the creation of large datasets of molecular information characterizing complex biological systems. Molecular classification approaches based on machine learning algorithms applied to DNA microarray data have been shown to have statistical and clinical relevance for a variety of tumor types: Leukemia (Golub et al., 1999), Lymphoma (Shipp et al., 2001), Brain cancer (Pomeroy et al, 2002), Lung cancer (Bhattacharjee et al., 2001) and the classification of multiple primary tumors (Ramaswamy et al., 2001), One particular machine learning algorithm, Support Vector Machines (SVMs), has shown promise in a variety of biological classification tasks, including gene expression microarrays (Brown et al., 2000, Mukherjee et al., 1999). SVMs are powerful classification systems based on regularization techniques with excellent performance in many practical classification problems (Vapnik, 1998, Evgeniou et al, 2000). ‘This chapter serves as an introduction to the use of SVMs in analyzing DNA microarray data. An informal theoretical motivation of SVMS_ both from a geometric and algorithmic perspective, followed by an application to leukemia classification, is described in Section 2. The problem of gene selection is described in Section 3. Section 4 states some results on a variety of cancer morphology and treatment outcome problems. Multiclass classification is described in Section 5. Section 6 lists software sources and 9. Classifying Microarray Data Using Support Vector Machines 167 rules of thumb that a user may find helpful. The chapter concludes with a brief discussion of SVMs with respect to some other algorithms. . SUPPORT VECTOR MACHINES In binary microarray classification problems, we are given 1 experiments {Gu Yoo @aID}- This is called the training set, where x; is a vector corresponding io the expression measurements of the # experiment or sample (this vector has n components, each component is the expression ‘measurement for a gene or EST) and yj is a bi ss label, which will be #1. We want to estimate a multivariate function from the training set that will accurately label a new sample, that is f(r) =Yaon This problem of leaming a classification boundary given positive and negative examples is a particular case of the problem of approximating a multivariate function from sparse data. The problem of approximating a function from sparse data is ill- posed. Regularization is a classical approach to making the problem well- posed (Tikhonov and Arsenin, 1977) A problem is well-posed if it has a solution, the solution is unique, and the solution is stable with respect to perturbations of the data (small perturbations of the data result in small perturbations of the solution). This last property is very important in the domain of analyzing microarray data where the number of variables, genes and ESTs measured, is much larger than the number of samples. In statistics when the number of variables is, much larger that the number of samples one is said to be facing the curse of dimensionality and the function estimated may very accurately fit the samples in the training set but be very inaccurate in assigning the label of a new sample, this is referred to as over-fitting. The imposition of stability on the solution mitigates the above problem by making sure that the function is smooth so new samples similar to those in the training set will be labeled similarly, this is often referred to as the blessing of smoothness. 2.1 Mathematical Background of SVMs We start with a geometric intuition of SVMs and then gi mathematical formulation e a more general 2.1.1 A Geometrical Interpretation A. geometric interpretation of the SVM illustrates how this idea of smoothness or stability gives rise to a geometric quantity called the margin which is a measure of how well separated the two classes can be. We start by assuming that the classification function is linear 168, Chapter 9 @.1) ‘ors x and w, respectively. ‘The operation ws is called a dot product. The label of a new point yew is the sign of the above function, Yrer= sign [f(tyey) } The. classification boundary, all values ofx for which f(x) = 0, is a hyperplane’ defined by its normal vector w (see Figure 9.1. a, aA A ang Figure 9.1, The hyperplane separating two classes, The circles and the triangles designate the ‘members ofthe two classes. The normal vector tothe hyperplane isthe vector w. ‘Assume we have points from two classes that can be separated by a hyperplane and x is the closest data point to the hyperplane, define x9 to be the closest point on the hyperplane to x. This is the closest point to x that satisfies wex=0 (see Figure 9.2). We then have the following two equations: wx k for some k, and wx. Subtracting these two equations, we obtain w(x ~%9 Dividing by the norm of w (the norm of w is the length of the vector w), we obiain? A liperplane isthe extension of the two or three dimensional concepts of li to higher dimensions. Note, that in an n-dimensional space, a hypespla ‘dimensional object in the same way that a plane is a ewo dimensional object in a Uhre ‘dimensional space. This is why the normal or perpendicular to the hyperplane is always a 2-The notation “ql” refers to the absolute value ofthe variable a. The notation “al refers the length ofthe veetor a es and planes is ann 1