Bioinformatics Approaches in Clinical Proteomics: Review

Review
For reprint orders, please contact reprints@future-drugs.com
Bioinformatics approaches in
clinical proteomics
Eric T Fung†, Scot R Weinberger, Ed Gavin and Fujun Zhang
Protein expression profiling is increasingly being used to discover, validate and

characterize biomarkers that can potentially be used for diagnostic purposes and to aid in
pharmaceutical development. Correct analysis of data obtained from these experiments
CONTENTS requires an understanding of the underlying analytic procedures used to obtain the data,
statistical principles underlying high-dimensional data and clinical statistical tools used to
Analytic approaches to
clinical proteomics determine the utility of the interpreted data. This review summarizes each of these steps,
with the goal of providing the nonstatistician proteomics researcher with a working
Data preprocessing
understanding of the various approaches that may be used by statisticians. Emphasis is
Data mining
placed on the process of mining high-dimensional data to identify a specific set of
Biomarker identification biomarkers that may be used in a diagnostic or other assay setting.
Expert commentary
Expert Rev. Proteomics 2(6), 847–862 (2005)
& five-year view
Key issues Clinical proteomics is an emerging field dedi- have proven to be implausible. Each algorithm
References cated to the discovery, validation and charac- comes with its own assumptions and biases;
terization of biomarkers that can address a consequently, a certain level of agnosticism is
Affiliations
variety of clinical questions. Generally, most recommended when it comes to choosing data
clinical proteomics studies involve fractionat- analysis approaches. A generic workflow of the
ing a set of clinical samples from a case group clinical proteomics process is shown in FIGURE 1.
and a set of clinical samples from a control Several papers discuss issues such as study design
group, and then analyzing the fractions by mass and sample acquisition which, albeit critical, are
spectrometry (MS) or 2D gel electrophoresis. beyond the scope of this review [1,5]. Rather, this
The data streams resulting from these processes review starts with the data analysis steps, com-
are then subjected to complex pattern recogni- mencing with acquisition of the data, which are
tion analysis, resulting in the conclusion that a acquired primarily using time-of-flight (TOF)
set of biomarkers has been generated. The MS, for example, using surface-enhanced laser
strength of that conclusion is dependent on a desorption/ionization (SELDI) TOF-MS or
careful approach to each step of the discovery matrix-assisted laser desorption/ionization
process, involving appropriate study design, (MALDI) TOF-MS. This is followed by a dis-
attention to data acquisition protocols and a cussion of data preprocessing; for example, base-
† conservative approach to data mining. Perhaps line subtraction, intensity normalization, peak
Author for correspondence
Ciphergen Biosystems, Inc., naively, initial forays in this field were met with alignment and peak detection. Once these steps
6611 Dumbarton Circle, Fremont, overexuberance, with the expectation of quick are performed, the data are mined with multi-
CA 94555, USA answers to long-standing questions, such as the variate analysis tools. Selected validated biomar-
Tel.: +1 510 505 2242 early detection of cancer. A retrospective assess- kers are then purified and identified so that
Fax: +1 510 505 2101
ment of these early promises reveals that errors assays can be made. This review will discuss
efung@ciphergen.com
in seemingly obvious parameters of study design various approaches to multivariate analysis and
KEYWORDS: and execution can have devastatingly misleading strategies with which to implement them. It will
biomarker discovery, clinical
proteomics, clinical statistics, results when complex multivariate analysis tools conclude with a discussion of protein identifica-
data analysis, expression profiling, are applied [1–4]. Furthermore, statements that tion, the final step in biomarker discovery, using
mass spectrometry, pattern
recognition, time-of-flight one or another particular data mining algorithm peptide mass fingerprinting or tandem MS
mass spectrometry was superior to others have also been made, but (MS/MS) and searching of databases.
www.future-drugs.com 10.1586/14789450.2.6.847 © 2005 Future Drugs Ltd ISSN 1478-9450 847

Fung, Weinberger, Gavin & Zhang
Analytic approaches to clinical proteomics imaged using an optical scanner, and scanned images can be
No discussion of clinical proteomics would be complete with- differentially compared, with particular focus upon spot loca-
out background on the analytic approaches taken to generate tion and stain intensity. As such, 3D patterns can be generated
the data that will be the subject of this review. Basic analytic for each sample, resolving isoelectric point (pI), molecular
approaches can be coarsely grouped into two different schemes: weight and abundance.
those that directly analyze nascent proteins as they present Early attempts at interpreting 2D gel profiles were facilitated
themselves in living systems (top-down) and those that directly by using artificial intelligence and machine learning programs.
examine their proteolytic fragments (bottom-up). One particular program, the Medical Electrophoresis Analysis
Interactive Expert System (MELANIE) was created to auto-
2D gel electrophoresis matically classify 2D profiles using heuristic clustering analysis
The most widely used top-down discovery approach is 2D and hierarchical classification [7,8]. The overall goal of this
gel electrophoresis MS [6]. While 2D gels do not directly work was to create a means to determine disease-associated
study proteins by MS, they do provide a means to qualita- patterns with the intent of creating a computer-based diagnos-
tively and quantitatively generate protein profiles of authen- tic regimen. It was successfully applied towards the diagnosis of
tic samples. Briefly, proteins are first separated by their iso- liver cirrhosis and the distinction of a variety of cancer types
electric point via immobilized pH gradients, and then further from cancerous biopsies [9]. Later, the work was extended
fractionated using sodium dodecyl sulfate (SDS) polyacryl- towards the comparative analysis of plasma/serum obtained
amide gel electrophoresis (PAGE). PAGE slabs are then from apparently healthy individuals and from patients with a
stained, creating a 2D array of spots. Slabs are digitally few selected, known diseases. Despite their apparent complex-
ity, the patient electropherograms revealed readily detectable
modifications of the reference protein profile for the selected
diseases. Several disease-associated spot patterns were eluci-
dated from patients with monoclonal gammopathies, hypo-
Data acquisition (10,000 –1,000,000s datapoints)
Baseline subtraction gammaglobulinemia, hepatic failure, chronic renal failure and
Calibration hemolytic anemia [10].
Normalization While initially successful, the 2D approach failed to translate
Alignment to the clinic because it was inherently troubled in terms of its
Peak detection
limited reproducibility, restricted dynamic range of detection
and laboriously slow throughput. More recently, a new 2D
approach termed difference in-gel electrophoresis (DIGE) has
Data mining (100–1000s input features) been introduced as a possible means to improve reproducibility
Unsupervised learning
Feature selection
and throughput [11]. DIGE is a modification of the classic 2D
Supervised learning approach in which multiple samples can be analyzed within the
Cross-validation/permutation testing same gel, thus allowing for the simultaneous analysis of experi-
Independent validation mental and control cohorts. DIGE is performed by fluores-
cently tagging multiple samples with different amine-reactive
dyes, running them on the same 2D gel, and then performing
Selected biomarkers (1–10s validated biomarkers) post-run fluorescence imaging of the gel, allowing for direct
Biomarker purification superimposition of groups. In this manner, the number of gels
Biomarker identification
to be processed and imaged is reduced. Furthermore, the effects
Assay development
of gel-to-gel irreproducibility are somewhat minimized. DIGE-
based analysis has been successfully performed in the study of
colorectal cancer [12,13].
Clinical assay While 2D gel patterns can be instructive, ultimately, the
identity of the unique markers must be established. Towards
Expert Review of Proteomics
this end, 2D gel analysis has been married with electrospray
Figure 1. Generic clinical proteomics workflow. Clinical samples taken ionization (ESI) and MALDI-MS. Protein spots of interest are
from the comparison groups are analyzed using any of a variety of typically excised from the gel and then destained. The excised
proteomics technologies. For the purposes of this discussion, these
technologies almost uniformly require sample fractionation followed by
gel plugs are then digested using proteases with specific proteo-
analysis by mass spectrometry. Thus, each sample generates tens of lytic activity, such as trypsin, endo-Lys-c and V-8 protease.
thousands to millions of data points. The data generated by the mass Liberated peptides diffuse out of the gel plugs, lending them-
spectrometer are subjected to preprocessing steps, and it is the processed selves to subsequent MS analysis. An excellent overview of this
data that are submitted to data mining techniques. Once data mining is process, along with suggested protocols, is provided by
completed, a best-set of validated biomarkers is derived. The biomarkers are
then identified so that high-throughput clinical assays can be constructed.
Corthals and coworkers [14]. The specific details of MS-based
protein identification are discussed later in this review.
848 Expert Rev. Proteomics 2(6), (2005)

Bioinformatics approaches in clinical proteomics
To address the reproducibility and throughput issues asso- Almost universally, current TOF-MS systems operate on the
ciated with classic 2D gel analysis, several researchers com- constant energy principle. As such, ions are accelerated by the
bined the direct analysis of polyacrylamide gels with device to a final kinetic energy, and achieve final velocities that
MALDI-MS, eliminating the SDS-PAGE step [15]. Ultrathin are dependent upon their respective m/z values. Direct signal
(<10 µm when dry) gels were soaked in MALDI matrix solu- output of a TOF-MS system is a time-dependent current gen-
tion and then directly analyzed in a MALDI-TOF mass erated by the impact of charged particles upon an ion-to-elec-
spectrometer. Initial spectra were acquired from isoelectric tron converting detector. Assuming the absence of detector
focusing, native and SDS gels. Virtual 2D gels were created by saturation, current amplitude can be taken to represent the
MALDI scanning isoelectric gels. Virtual 2D gels were number of incident particles over the sampling period. Gener-
extended to the study of the Escherichia coli proteome [16]. ated current is either directly monitored or converted to a
When compared with classic 2D studies of the same time-dependent voltage prior to converting to a digital signal.
proteome, virtual 2D analysis allowed for the postulation of Digitization is achieved using either a time interval recording
protein identities (<50 kDa) based upon improved molecular approach, such as time-to-digital converter (TDC), or via time
mass determination and pI (±0.3 pH units). Putative identi- array recording devices, such as high-speed analog-to-digital
fications were further confirmed by MALDI in-source decay converters (ADCs). With the exception of some orthogonal
of the intact protein or by peptide mass mapping following TOF systems, most TOF devices use ADCs to capture gener-
gel-wide chemical cleavage. Post-translational modifications, ated data in a digital form. In terms of data processing, the
such as fatty acid acylation, were detected. In total, approxi- temporal resolution of the device is dependent upon the ana-
mately 250 different proteins (2–120 kDa) were discovered log bandwidth of the detector and associated analog electron-
in the 5.7 to 6.0 pI range. Data reduction and display algo- ics as well as the data acquisition rate of the analog/digital con-
rithms were created to allow for facilitated viewing and study verter. Modern TOF systems employ analog bandwidths
of virtual 2D results [17]. In its most advanced state, virtual ranging from 0.5 to 2 GHz with data acquisition rates typi-
2D gel analysis demonstrated high sensitivity (analogous to cally ranging from one to five gigasamples per second. For the
silver-stain detection limits) and improved throughput, reso- purpose of m/z determination, calibration is performed by
lution and mass accuracy when compared with classic 2D measuring the TOF for a number of well-characterized sam-
analysis. However, complications associated with interfacing ples and then correlating observed TOF to known m/z. A
gels with MALDI-MS systems fundamentally limited classic calibration function uses a second-order polynomial to
broader adoption of this approach. In some cases, gels were convert observed TOF to ion m/z:
fractured or ruptured during the sample introduction pump-
2
down process, often polluting the MS system with acrylamide m/z = a ( TOF ) + b
particles and dust. Furthermore, while superior to 2D gel
analysis, direct MALDI-MS analysis of gels resulted in com- Values for a and b are empirically determined during the
promised analytic sensitivity, mass accuracy and mass-resolving calibration process.
power when compared with traditional MALDI or SELDI
measurements. As such, only a few investigators continue to Data preprocessing
use this approach. The general objective of data preprocessing of spectra is to
arrive at a peak list that can be used for downstream data min-
Mass spectrometry ing (although some investigators prefer to use m/z values
Top-down MS-based approaches to biomarker discovery directly). Most analysis packages include some combination of
include SELDI-TOF-MS [18], liquid chromatography (LC)- background correction, filtering, noise estimation and peak
MS [19] and Fourier-transform ion cyclotron resonance detection algorithms. All systems must include some method to
(FTICR) MS [20]. As expected, data generation for each transform the linear TOF data to the m/z domain.
approach is technology-dependent. For the most part, top-
down MS studies are performed using either FTICR or TOF- Signal filtering & background subtraction
MS detection. In FTICR analysis, the detector output is a Since spectra contain noise from electronic and chemical
time-dependent image current that represents the coulombic sources, signal filtering is often employed before peak detec-
charge of all orbiting ions. The angular velocity of each orbiting tion. An essential property of a suitable signal filter for mass
ion is related to its mass-to-charge (m/z) ratio. To deconvolve this spectra is that the filter must not shift the time location of fea-
complex signal, a Fourier transform is performed to convert the tures in the spectra. There are many suitable digital filters that
signal from the time to the frequency domain. Since ions of dif- meet this requirement, and moving average or Savitsky–Golay
ferent m/z values demonstrate different angular velocities, they filters are commonly employed.
will generate m/z-specific frequencies whose amplitude is Background signals in MALDI and SELDI-TOF can cause a
roughly dependent on ion abundance. The system is calibrated relatively large background at low mass in MALDI and SELDI
by analyzing well-characterized analytes and calibrating the spectra. One method of background correction includes a
observed frequency to the expected m/z. standardization of the data to a constant noise scale, and a
www.future-drugs.com 849
denoising method that utilizes hard thresholding [21]. The Spectrum alignment
threshold cut-off is determined by a multiple of the standard Since there is always a measurement error associated with the
deviation. Wavelets can also be applied to TOF spectra for reported mass of a peak, downstream methods must include
denoising and compression. Qu and coworkers found that, with methods to cluster peaks together for comparison. In order to
a discrete wavelet, the spectra data could achieve compression improve the accuracy of the clustering, an optional spectrum
without significant loss of the information when the data were alignment step may be included in the workflow. Jeffries
converted back to the time domain [22]. Malyrenko and coworkers describes an algorithm for aligning spectra, utilizing peaks that
removed the background from SELDI-TOF data collected on are common across the spectra and minimizing the error term
a ProteinChip Reader® model PBSII (Ciphergen Biosystems, using a Nelder–Mead multivariate optimization procedure [27].
Inc.) by modeling the response of the detector to overload and
charge accumulation, and applying appropriate digital filters Data mining
to correct for the effect [23]. It remains to be seen if the same Once data have been prepared, they are submitted to data mining
techniques can be generally employed, since the effects and the algorithms. Generally, data mining approaches fall into two
corrections may be specific to this instrument model. categories: unsupervised and supervised. The former refers to
approaches that do not take into account class labels, while the
Peak detection latter refers to approaches that do. Unsupervised learning
Peak detection in mass spectra serves to both reduce the approaches are analogous to clustering, while supervised learn-
dimensionality of the spectra and allow quantification of the ing approaches are analogous to classification. Examples of
protein or peptide that gave rise to the peak. Since mass spectra unsupervised learning approaches include k-means clustering,
often contain considerable amounts of noise, typical peak principal component analysis (PCA) and hierarchal clustering.
detection algorithms include user-adjustable parameters to The gamut of supervised learning techniques includes classifi-
filter peaks based on height, signal-to-noise or other properties cation and regression trees, neural networks, genetic algorithms
such as width. Peak detection algorithms are often criticized and support vector machines. With the vast array of possibili-
because they typically require user adjustment to parameters in ties, it may be difficult to decide which is the ideal algorithm to
order to optimize results on a given spectra or data set, and this use. Unfortunately, there is no simple answer to this question
can introduce a considerable human bias to the process of and it has been proposed that there is no ideal single method
biomarker discovery. (no free lunch theorem) [28]. Each algorithm has inherent
In order to determine the location of the peak, many algo- strengths and weaknesses, which must be matched to the spe-
rithms find a centroid using the top portion of a peak, typically cific statistical problem to be addressed. The descriptions below
10% of the total peak height. Since the centroid is the peak are intended to describe various algorithms and provide insight
location at which one half of the area is to each side, it provides into some of their strengths and weaknesses.
a more robust measure of peak location than the apex. The cen- Prior to the introduction of data into these mining algorithms,
troid is typically calculated on the top portion of the peak a process of feature selection must be undertaken. The ‘curse
because peak shapes are often skewed at the base of the peak. of dimensionality’ refers to the asymmetry between the
Kempka and coworkers created a peak detection algorithm that number of input features (e.g., peaks) and the number of
uses the entire peak to find the location in MALDI spectra by exemplary features (e.g., samples). Overcoming this problem is
fitting two deconvoluted Gaussian distributions to the peak [24]. challenging but necessary; any classification algorithm given
They reported that, when compared with other commercial too many features will be able to find a solution to the pre-
peak detection methods, this method improved the mass accu- sented problem. The curse of dimensionality results from the
racy, especially on low signal-to-noise peaks. Carlson and magnitude of the input data, which are signal amplitudes
coworkers have created a method they call simultaneous spectrum recorded at a regular interval (set by the digitizer rate of the
analysis to combine multiple spectra into a single spectra to use TOF mass spectrometer). A given spectrum can have tens to
for peak detection [25]. The advantage of such a scheme is that hundreds of thousands of such data points; a given sample can
the average of groups of spectra will have a higher signal-to- have dozens of spectra depending on the laboratory protocols.
noise ratio than the individual spectra, thus making the detec- One method to perform feature reduction is to select m/z values
tion of peaks more sensitive and robust. An area-under-the- that represent peaks; the process of peak detection is discussed
curve filter is used to determine peak locations. Alternatively, an earlier in this review and is not explored further in this section.
F-test can be used to test that the same peak occurs in replicate Other approaches to perform feature selection, which can be
spectra more often (as predicted by chance) than a filter against performed on the raw data or on the detected peaks, are to use
peaks that occurred as a result of noise. Coombes and coworkers t-tests, unsupervised learning approaches or supervised learn-
combined peak detection and background subtraction by first ing approaches that incorporate feature selection. These are
finding a set of preliminary peak locations, and then subtracting discussed in greater detail below.
the peaks from a spectrum and fitting a baseline to the resulting Once the features are selected, and prior to implementation
spectrum [26]. This baseline is then subtracted from the original of either unsupervised or supervised learning techniques, data
spectrum before performing another pass of peak detection. transformation is often performed. This reduces the impact

that a high variance in a given input variable might have by subspecies of Francisella tularensis by SELDI-TOF-MS [31].
altering the distribution of the original data. For example, the Notably, the first principal component mainly separated the
measurement of a specific analyte may be highly variable, either novicida group from the rest of the strains, and the second prin-
for analytic reasons or for clinical reasons. This high variability cipal component described differences between the holarctica
in some data analysis approaches may artificially improve its and tularensis groups. Finally, the third principal component
chances of being selected as a classifying feature. Data trans- described the differences in protein profiles between the media-
formation reduces the impact of this variability by constraining siatica group and the rest of the strains. These three principal
the values within a more defined range. Typical approaches to components described 72% of the total variance. Lancashire
data transformation include log transformation, square-root and coworkers similarly used PCA to cluster species of Neisseria
transformation, and linear and logarithmic scaling. Not all data based on SELDI-TOF-MS data [32]. An interesting approach to
analysis approaches require transformation; for example, classifi- using PCA is to rank peak intensities within each spectrum,
cation trees ignore variance and locate cut points that can opti- rather than compare peaks directly against each other [33]. Such
mally segregate samples (see below). Therefore, transformation an approach has been applied to 2D gel analysis of various
is not necessary. histologic types of lung [34], cervical [35] and borderline ovarian
cancer [36].
Unsupervised learning techniques Hierarchical clustering is another method to visualize the
A good initial assessment of data quality can be obtained by distribution of data [37]. Briefly, hierarchical clustering begins
using unsupervised approaches to visualize the distribution of by assigning each sample to its own cluster. It then calculates
the data [26,29]. This permits the identification of outlier sam- similarity scores or distance metrics between samples, and
ples, which may result from sample misclassification, from a places samples that are close to each other together. Specific
previously unrealized clinical subgroup or from variances in hierarchical clustering algorithms can differ in their methodol-
laboratory sampling. In addition, using unsupervised learning ogy for calculating distance metrics. This is done iteratively
techniques followed by superimposition of sample labels can until all samples are ordered. Typically, the representation of
also provide a qualitative assessment of the separation power of the data using hierarchical clustering is in the form of dendro-
the data. Finally, unsupervised learning techniques can be used grams. Hierarchical clustering can be performed simultane-
as a basis for feature selection. Each of the uses of unsupervised ously in two dimensions: typically, samples and peaks. Using
learning techniques as well as a general description of these red and green color coding to represent the relative up- and
approaches are summarized below. downregulation of peak intensity generates the familiar heat
One commonly used unsupervised learning technique is map presented in many publications [38,39].
PCA [30]. PCA maps high-dimensional data into a more
manageable set of dimensions by creating eigenvalues (linear Supervised learning techniques
combinations) of the input variables. Each linear combination By definition, supervised learning techniques require class
(or principal component) is a weighted sum of the amplitude at assignments such that training (learning) can occur on the data
each m/z value (or peak if peak detection provides the input obtained from a subset of the provided samples. The two types
variables). The individual weights assigned to each of the input of variables in this exercise are therefore the predictor variables
variables are derived by calculating the covariance structure of (intensity at m/z values or peak intensity) and response vari-
the input variables. The principal components are then ordered ables (disease classes). The most straightforward approach to
by the proportion of the overall variance for which each princi- identifying differences between groups using a supervised
pal component accounts. The feature extraction capabilities of approach is the t-test. There are numerous ways to calculate the
PCA are embodied in the fact that the top (usually less than t-statistic and to implement it. The various t-tests make
ten, and often the top two or three) principal components will assumptions about the sample size of the respective groups, the
be adequate to separate the samples into relatively homogene- distribution (e.g., Gaussian) of values within each group, and
ous clusters; that is, these top principal components account for whether the variance is equal or nonequal. The Welch t-test
the majority of the variance in the data structure. This can be assumes two independent, small, normally distributed groups
visualized in 2D or 3D plots in which the calculated values for with unequal variance [40]. A more conservative approach is to
each of the top principal components serve as the x, y and assume that the groups are not normally distributed. This is
z axes. Moreover, for each principal component, the input vari- embodied by the Mann–Whitney test (or Kruskal–Wallis test,
ables that have the largest absolute values for their coefficients when more than two groups are being analyzed). The
have the greatest weight and, therefore, have the greatest dis- Mann–Whitney assumes equal variances between the two
criminating power. Generally, for PCA to be used in this groups. The t-test, however, suffers from several shortcomings,
capacity, some sort of transformation (e.g., rescaling) needs to some of which can be overcome but others cannot. First, by
be performed so that peaks with the largest variance are not calculating the t-statistic for each peak (or m/z value), multiple
unfairly weighted. For a more detailed, mathematical discus- hypothesis testing can lead to an artificial inflation of the
sion of PCA, the reader is referred to [28]. Lundquist and number of variables deemed to be significantly different
coworkers used PCA to analyze data intended to distinguish between the comparison groups. Second, these calculations
assume that the measurements are independent of each other, well as employing cross-validation to define the fitness function,
which ignores the fact that proteins are often coregulated. may help reduce the variability. Alternatively, exact methods,
Additionally, if raw data are used rather than peaks, each m/z such as boosting and discriminant analysis (discussed later), do
value is related to its adjacent m/z value. Binning is often per- not suffer from this weakness.
formed to overcome this problem. Third, these calculations Classification and regression trees utilize recursive portion-
ignore the possibility that subsets of clinical groups exist for ing to achieve classification [65]. Decision trees begin with the
which a given input variable may be useful. The Bonferroni entire sample set and create a decision rule that partitions the
correction is one method that attempts to reduce the impact of entire sample set into two more homogeneous groups. The
multiple hypothesis testing [41]. Permutation tests and assess- decision rule examines one or more input features and uses a
ment of the false-discovery rate are other approaches [42–51]. Sig- ‘less than’ function; that is, if the intensity of peak a is less than
nificance analysis of microarrays also attempts to minimize the x, then a sample is partitioned to the left branch; if not, then it
impact of multiple hypothesis testing, and can be applied to pro- is partitioned to the right branch. Each branch is then exam-
teomics data [52,53]. Permutation tests are discussed in greater ined for homogeneity and can be subdivided using a new rule.
detail below. Each final, terminal node is then designated as a given class
t-tests are often used for feature selection to provide the input using a majority rule based on the training samples. New sam-
into more sophisticated classification algorithms. Alternatively, ples can then be classified by determining how they satisfy each
classification algorithms themselves can be used for feature of the rules in the decision tree and which terminal node they
selection as well as for classification. These types of algorithms finish in, and are classified according to the terminal nodes’
can be described as heuristic or exact [54]. Heuristic classifica- designation. An advantage of decision trees is that they create
tion algorithms rely on performing multiple iterations to con- easily interpretable and applicable decision rules for classifying
verge on a locally optimal solution, with no guarantee that the samples, and therefore have been used extensively in the analysis
solution is globally optimal. Additionally, repeated application of MS data [40,58,66–86].
of such algorithms on a specific data set usually results in vary- A weakness of decision trees is that support diminishes as the
ing classifying outputs (i.e., discriminants) from each run. complexity of the tree increases; there are fewer and fewer sam-
Examples of such algorithms include genetic algorithms, deci- ples within each subsequent branch of the tree. Some modifica-
sion trees and neural networks. In contrast, exact algorithms tions, such as bagging and boosting approaches, may help to
use closed-form equations and thus are deterministic. stabilize decision tree structures. Bagging produces different
Genetic algorithms attempt to mimic the process of natural trees from different subsets of the training set, and then aggre-
selection found in nature by creating chromosomes of concate- gates the results on a set of test instances. This helps to reduce
nated input variables (m/z values) and iteratively recombining the instability and variance in the decision tree learning process
chromosomes and mutating genes [55]. Input variables that by combining multiple models. Bagging, in its simplest form, is
satisfy a fitness function are kept, while those that do not are performed by bootstrapping. Boosting combines weak individ-
eliminated through a process of computational evolution, ual base classifiers (c1, c2, c3 and so on) into a more powerful
achieved by recombination and mutation. This process is con- classifier by performing a weighted stagewise selection of a base
tinued until preset criteria have been met, usually relating to classifier, given all the previously selected base classifiers. At
overall classification accuracy and number of fit input variables. each stage, higher weights are given to samples that are incor-
Genetic algorithms have been applied to several MS-derived rectly classified by the summary classifier; therefore, at each
data sets [56–60]. A proprietary algorithm combining genetic stage, the added base classifier will be selected due to its ability
algorithms and self-organizing maps, was used to develop mod- to correctly classify previously misclassified examples [87,88]. In
els that could classify ovarian cancer samples with high accuracy the context of decision trees, boosting generates a sequence of
[61], and this algorithm has subsequently been applied to data decision trees (usually small trees with only one decision node,
sets for prostate cancer [62], Wegener’s granulotomatosis [63], also called stumps) from a data set in which the misclassifica-
pheochromocytoma [64] and cystitis [57]. Jeffries provides a tion rate is used to adjust weights to each sample [87,89]. Each
description of the concepts behind the genetic algorithm, as subsequent tree focuses on the misclassified (more heavily
well as providing some insight into the weaknesses of this algo- weighted) samples. After a defined number of iterations, voting
rithm [56]. Notably, the solutions identified by genetic algo- occurs via a committee of experts approach, where a decision is
rithms are dependent on the initial ordering of input variables made by a majority vote. Since each expert is tuned to the
on chromosomes as well as parameters set for the fitness func- error rate of the prior tree, the committee is able to classify
tions. In addition, given their heuristic nature, genetic algo- more accurately than any given member. The goal is to maxi-
rithms are susceptible to finding locally optimal solutions, and mize the margin, which is the difference in the number of votes
will often find different solutions for the same data set when between the correctly voting members and the incorrectly
run repeatedly. This makes selecting a single solution for further voting members. The larger the margin, the greater the confidence
validation difficult. Certain implementations of genetic algo- in the classification. Boosting has been applied to MS data to
rithms may safeguard against these weaknesses. For example, improve performance of decision tree analysis [73,80]. A forest
setting limits on the number of clusters and selected variables, as (decision forest or random forest) is a collection of decision

trees that, in contrast to boosting, are created in parallel (rather while multiple discriminant analysis looks at multiple classes.
than in series), and no comparisons are made until all the trees Logistic regression is now sometimes used in place of DA as
are made [90]. Decision tree forests have two randomizing ele- it usually involves fewer violations of assumptions (namely,
ments: the selection of cases used as input for each tree, and the normal distribution and equal within-group variances), is
set of input variables that can be used as splitters for each node. robust, handles categorical as well as continuous variables
These randomizing elements, along with the committee deci- and has coefficients that many workers in the field find easier
sion, provide the basis for improving classification accuracy. The to interpret [112,113]. Logistic regression is preferred when
drawback of the forest approach is that, rather than a single data are not normal in distribution or group sizes are very
easily interpreted decision rule, the output model is complex unequal. k-nearest neighbors (k-NNs) is a form of nonpara-
and more of a black box. The random forest approach has been metric discriminant analysis that examines the k-nearest clus-
used to analyze data for prostate cancer [75], ovarian cancer [80] ters samples and takes a majority voting approach among
and bacterial proteomics [21]. these samples. Majority voting can be thought of conceptu-
Artificial neural networks (ANNs) attempt to mimic human ally as being equivalent to calculating the prior probability
learning computationally [91–93]. Therefore, the jargon sur- based on the proportion of a given class of the k examples [80].
rounding ANNs is taken from our understanding of learning in k-NN is nonparametric and, therefore, makes no assump-
the nervous system. Neurons integrate information obtained tions regarding the distribution of samples. Li and coworkers
from inputs, which could be the outside world (i.e., primary used a genetic algorithm for feature selection, and then
data) or previously integrated data (i.e., other neurons). There- k-NN for classification [114]. Lundquist and coworkers dem-
fore, there are multiple layers of neuronal integration. Like real onstrated how PCA, PLS and DA can be used together [31].
neurons, these computational neurons require that the input After performing PCA to determine the optimal linear com-
function exceeds a certain activation threshold. Most neural binations, PLS-DA was used to determine the most impor-
networks are feed-forward, that is, information flow (and deci- tant contributors to the most important principal compo-
sion making) proceeds only in one direction, starting with the nents, thus identifying the specific peaks that underlie the
input layer, flowing through n layers of neurons to the output differences in bacterial subspecies. PLS-DA has also been
layer. The parameters that can be varied and determine the used to discriminate species of wheat [103] and to assess wheat
learning process include the weights given to the input func- quality [115].
tions, activation thresholds for each neuron and computation A form of machine learning that has been applied to pro-
function performed by each neuron. Training the neural net- teomics data is the use of support vector machines (SVMs),
work involves decreasing the error rate by adjusting these which were invented by Vapnik [116]. SVMs operate first by dis-
parameters. The two major criticisms of ANNs are that they are tributing samples in n-dimensional space and then by finding a
prone to overtraining and that they result in black box, difficult- hypersurface that attempts to split the cases from the controls.
to-assay solutions. Despite these drawbacks, their ability to gen- The split will be chosen to have the largest distance (margin)
erate solutions with low error rates has made them attractive from the hypersurface to the nearest of the case and control
algorithms for the analysis of proteomics data [32,39,58,94–107]. examples. A detailed explanation of SVMs can be found in
Partial least squares (PLS) is one form of exact supervised [117]. Several studies have used SVMs for analysis of proteomics
learning [108]. Conceptually, it is quite similar to PCA, except data [69,80,118–123]. In this example, support vector machines
that knowledge of the class assignments allows it to perform could be adequately substituted for linear discriminant analysis
classification in addition to feature selection. The distinction and combined with PLS regression [54].
between PCA and PLS is that in deriving the weights for each Examining a single data set using multiple algorithms can be
input variable in each component, PCA uses only the covari- instructive. Perhaps the most scrutinized data set is the ovarian
ance structure of the input variables, while PLS utilizes the cancer data set generated by Petricoin and coworkers [61]. Levner
covariance structure of the input variables with the response comprehensively describes various approaches used to analyze
variables (i.e., class assignment). this data set as well as prostate cancer data sets generated by
Discriminant analysis (DA) attempts to maximize the ratio Adam and coworkers [66,124]. However, potential flaws in the
of the difference in class mean to the within-class variance design of these studies preclude any real conclusions regarding
[109–111]. After separation into n-dimensional space using PCA the relative performance of various statistical approaches to
or PLS, a linear discriminant is derived using a hyperplane those data [3]. Levner points out that the availability of high-
that maximizes between-class variance, while minimizing quality MS data sets will be required before direct comparisons
within-class variance. A test spectrum is dimensionally of various algorithms can be made. However, several smaller
reduced by projection onto the principal components and studies have benefited from the application of multiple algo-
then onto the calculated hyperplane. The distances between rithms. Sorensen has applied both neural networks and PLSs to
the respective clusters (healthy vs. disease) determine its classi- data obtained from analysis of wheat proteins, and has shown
fication, and the confidence in the classification is based on a that, while both approaches could provide comparable classifi-
Gaussian distribution emanating from the center of the cluster. cation capabilities, PLS was able to provide additional data
Linear discriminant analysis examines the two-class problem, regarding wheat quality [103].
Implementation of techniques the true assignments should be statistically significant from

The authors prefer to utilize several means of feature reduction when it is applied to the random, permuted class assignments.
as well as classification. The actual implementation of these Lilien and coworkers note that, in their experiments, they
approaches is as follows. The data are divided into a training set could achieve 100% classification accuracy whenever they
and a validation set. Preprocessing (e.g. baseline subtraction and used more than 50% of the spectra for training, when they
peak detection) is performed on the full complement of data. analyzed the National Cancer Institute clinical proteomics
Once this is performed, the data from the validation set are data [54]. This emphasizes the need to identify multiple
ignored until the time for validation. If the size of the training train/test splits and to report the performance and variance in
data set is adequate (e.g., n > 60 in each class), the training data performance of the models.
set can be further divided into a training and a test set. Feature Once a model has been chosen due to the low error rate and
selection and model training are performed on the training set robustness, it can be applied to the independent validation
and their performance is assessed in the test set. This can be per- sample set. Typically, the independent validation sample set is a
formed repeatedly to assess multiple sets of features and multi- subset of the original sample set (as described above). More
ple algorithms. If the training data set is too small (almost rigorous independent validation includes samples taken from
always the case), typically, cross-validation is performed. Note other medical institutions, as well as those processed at a differ-
that cross-validation can still be performed in the training set ent time and in a different laboratory from the original study.
even if it is a separate set. In effect, cross-validation is a method These steps in independent validation are required to deter-
by which the sample size may be artificially increased. v-fold mine whether the findings can be generalized across parameters
cross-validation refers to subdividing the training data set into such as patient demographics, clinical subtypes and laboratory
equal v-sized subsets. Typically, v is equal to five or ten. When handling techniques [126].
v is equal to n, it is termed ‘leave-one-out’ cross-validation. Dur-
ing cross-validation, v-1 subsets are used to train the model, and From data mining to clinical statistics
then test the model on the remaining subset. This is done itera- The most common goal of a data mining program is to minimize
tively until each subset has been used for testing once. One the error rate. However, in clinical testing, there are different
additional note of caution involves selection bias [75,125]. When types of errors, and not all errors are created equally. The classic
cross-validation is performed on algorithms generated on a two-by-two table reflects the two types of errors associated with
preselected set of variables, rather than selecting the variables clinical testing (FIGURE 2). Put simply, these misdiagnose a
with each iteration, it may overestimate the suitability for gen- healthy person as sick, and a sick person as healthy. One can
eralization of an algorithm. An alternative approach to mini- substitute responder and nonresponder, or prone to adverse
mizing selection bias is to perform cross-validation during the reaction and not prone to adverse reaction and so on for the
feature selection process. labels healthy and sick. Misdiagnosing a healthy person as hav-
Regardless of the specific feature selection and classification ing a disease (termed a false positive) results in a loss of specifi-
approaches used to identify the most important classifying city. From a patient management viewpoint, it leads to unnec-
peaks, in the diagnostic setting, the simplest, most transparent essary testing and therapeutic intervention, leading to
algorithm is desired. Therefore, a short decision tree may be a discomfort (or worse) for the individual as well as increased
good diagnostic tool due to its interpretability and robustness healthcare cost. The opposite error, missing a person with dis-
to outliers. Another approach to a final diagnostic algorithm is ease (termed a false negative), results in loss of sensitivity.
to combine the optimal classifiers into a logistic regression Although the ramifications of missing this diagnosis are appar-
model. Finally, discriminant analysis may generate attractive ently obvious, the natural history of the disease is also impor-
diagnostic models. These models can be linear, quadratic or tant. Frequent testing can compensate for decreased sensitivity,
nonparametric (k-NN). particularly if the disease is slowly progressive.
For any given model, there are two measures that need to be Sensitivity and specificity almost always trend in opposite
assessed. The most obvious is the overall error rate in the directions. An ideal test has bimodal distribution with no over-
cross-validated sample set. The lower the error rate, the more lap. In practice, most biomarkers have some degree of overlap;
accurate the model. However, examining the error rate alone this overlap is called the gray zone. Identifying the right cut-off
is inadequate, since most analysis algorithms can be driven to value for distinguishing healthy from disease within that gray
find a local optimum solution, that is, a solution particularly zone means finding the appropriate compromise between sensi-
fit (trained) to the specific input data, but unsuitable for gen- tivity and specificity. In FIGURE 3A, if a cut-off at point A was
eralization. Therefore, it is important to determine how stable chosen and an individual with levels of biomarker above that
the model is, and whether it truly differs from a random solu- cut-off were diagnosed as having a disease, then the biomarker
tion. One method by which this can be assessed is a permuta- would never miss a single case of the disease. At this cut-off, the
tion test. Class assignments are randomized in an iterative test would have 100% sensitivity, but many individuals without
fashion, and the model is applied to these random class disease would be inaccurately diagnosed (poor specificity).
assignments. The average error rate (and standard deviation) Alternatively, if a cut-off at point B were chosen, no healthy
can be calculated; the error rate when the model is applied to person would be misdiagnosed as having disease, but many sick

A Actual positive Actual negative B Actual positive Actual negative
Positive test result True positive False positive Positive test result 90 20
Negative test result False negative True negative Negative test result 10 80
Sensitivity Specificity Sensitivity = 90% Specificity = 80%

= TP/(TP + FN) = TN/(FP + TN)
Positive predictive Negative predictive Positive predictive Negative predictive
value = TP/(TP + FP) value = TN/(TN + FN) value = 81.8% value = 88.9%
C Actual positive Actual negative D Actual positive Actual negative
Positive test result 90 180 Positive test result 90 90
Negative test result 10 720 Negative test result 10 810
Sensitivity = 90% Specificity = 80% Sensitivity = 90% Specificity = 90%

Positive predictive Negative predictive Positive predictive Negative predictive
value = 33.3% value = 98.6% value = 50% value = 98.8%
Expert Review of Proteomics
Figure 2. Two-by-two diagnostic tables. (A) A candidate test (marker) is compared against a gold standard (actual). Instances in which the test matches the
actual are called true positives or negatives, and discordances are termed false positives or negatives. Based on the error rate and types of errors, clinical
parameters for sensitivity, specificity and predictive value can be derived. (B) Most clinical proteomics studies examine an equal number of cases and controls.
This artificially increases the calculated positive predictive value. (C) Most diseases have a prevalence far lower than the distribution used in clinical proteomics
studies, and therefore the actual positive predictive value would be lower. (D) Increased specificity will result in improved positive predictive value. Conversely,
increased sensitivity will result in improved negative predictive value.
FN: False negative; FP: False positive; TN: True negative; TP: True positive.
people would be missed. The trade-off between sensitivity and and -positive) rate, demonstrate this principle. Most clinical
specificity can be graphically visualized by a receiver operator proteomics studies have a prevalence of 50%; that is, there are
characteristic (ROC) curve, which plots the sensitivity on the generally an equal number of individuals with the disease as
y axis and (1-specificity) on the x axis (FIGURE 3B) [127]. The ideal there are healthy (or other relevant disease) controls (FIGURE 2B).
test would have 100% sensitivity and specificity, and the ROC However, the disease may have an actual prevalence of 10%.
plot would show a single point at the upper left corner. In prac- Note the dramatic decrease in positive predictive value; that is,
tice, no test has 100% accuracy, and the ROC plot shows the the proportion of tested positives who are actually positive.
relationship between sensitivity and specificity by choosing dif- Those who tested positive and that do not have the disease will
ferent cut-offs in the gray zone. The overall accuracy of the test go on to have unnecessary additional procedures and therapies.
can be assessed by calculating the area under the ROC curve, Of course, the negative predictive value increases dramatically
and the aim is to maximize this. More recently, there has been because, in this example, there are many more healthy individ-
an interest in examining only the portion of the ROC plot of uals. Diagnostic clinical tests generally place a greater demand
clinical relevance (i.e., at high sensitivity or specificity) and cal- on positive predictive value and, therefore, maximize specificity.
culating the area under the curve in these regions [128–131]. Tests designed to rule out a disease or a therapy generally place
These partial ROC plots, which have been used extensively in a greater demand on negative predictive value and, therefore,
radiology [132,133], are important when assessing particularly low maximize sensitivity. FIGURE 2D shows the improvement in posi-
prevalence diseases (see later). tive predictive value by improving the specificity. While sensi-
Sensitivity and specificity do not reside in a vacuum. The tivity, specificity and ROC analysis are not prevalence depend-
frequency of disease impacts the relative importance of sensitivity ent, they are less useful measures of the quality of a biomarker
versus specificity. The frequency of disease is generally described than predictive value.
as prevalence and measured as a ratio of individuals with disease
at a given time point to the number of individuals at risk for Biomarker identification
that disease at that time point. Incidence refers to the ratio of Once biomarkers have been confidently discovered and validated,
the number of new cases in a given time period to the number the next task is to purify and identify them. The purification
of individuals at risk for that disease during that time period. of biomarkers is a biochemical exercise and beyond the scope
The theoretical two-by-two tables, with varying prevalence of 50 of this review. However, actual identification requires the
and 10%, with fixed sensitivity and specificity (false-negative searching of databases and therefore deserves comment in
this review. Currently, protein identification and characteriza- and coworkers created software routines to correlate MS results
tion are almost universally achieved using single or MS/MS in to protein identities found within protein databases [136]. Initial
combination with computational algorithms. search routines focused upon protein molecular weight. While
sufficient at times, this approach failed as the databases increased
Peptide mass fingerprinting in complexity and mass determination error was excessive. Ulti-
With the continued growth of protein sequence databases, as mately, search routines based upon partial mass spectrometric
well as with the emergence of complementary (c)DNA databases, peptide maps of target proteins were created. In general, approx-
it has become possible to derive peptide sequence and protein imately four to six proteolytic peptides, measured with a mass
identity by correlating MS measurements with theoretical pep- accuracy between 0.1 and 0.01%, allowed for useful search
tide fragments of known sequences. Henzel and coworkers intro- activities in the Protein Identification Resource database.
duced a computer algorithm, later coined Fragfit, for the auto- Today, the process of identifying proteins based upon single
mated identity determination of proteins separated by 2D gels MS measurements of specific proteolytic fragments searched
[134,135]. Peptides were generated by reduction, alkylation and against protein or cDNA databases is generically referred to as
tryptic digestion and then analyzed via MALDI-TOF. Fragfit peptide mass fingerprinting (PMF). High-throughput PMF
functioned by searching an existing protein sequence database for analysis is frequently performed in hyphenated 2D gel MALDI-
multiple peptides of individual proteins that match the measured MS analysis. Alternatively, several in-gel digestions are queued
masses. To ensure that the most recent database updates were for automated analysis using LC/ESI-MS. High-throughput
included, a theoretical digest of the entire database was generated PMF studies generate a tremendous amount of data, creating
each time the program was executed. In a parallel effort, Mann corresponding challenges in ensuring quality protein identifica-
tions. Accordingly, computer algorithms directed towards
improving protein identification from PMF studies have been
A created. A parameterized multilevel scoring algorithm was com-
Percentage of population
bined with an optimized peak detection scheme to improve

identifications derived from 2D gel MALDI-TOF analysis [137].
Healthy Disease Proteolytic specificity, species origin, protein isoelectric point,
determined molecular weight, chemical modifications and
potential number of missed cleavages are differentially weighted
and considered to create a hierarchical report of potential protein
identities. In another effort, a new PMF algorithm was devel-
oped that provided improved m/z calibration and peak rejection
X Y as well as the use of a meta-search approach that employed vari-
Biomarker concentration ous PMF search engines [138]. The program successfully
improved routine PMF identification rate from 6 to 44% when
B 1 examining 1891 PMF spectra. More recently, a modular, script-
able automated analysis tool suite for high-throughput PMF
studies has been introduced [139]. The tool suite consists of auto-
matic peak extraction, peak filtering and protein database
Sensitivity
matching modules that communicate via extensible markup lan-

guage (XML). This modular approach affords flexibility and
each module can be easily replaced with other software if desired.
Sequencing & protein identification via tandem

0 1 mass spectrometry
1-Specificity Expert Review of Proteomics While PMF can often provide initial protein identification, in
Figure 3. Distribution of values in the population and receiver cases in which insufficient protein purification is achieved, or
operator characteristic (ROC) plot. (A) Almost all analytes have an in studies with limited peptide coverage, substantial irregular
overlapping bimodal distribution in concentration between healthy and peptide cleavage and/or PTMs, PMF algorithms generally fail
diseased individuals. This plot shows an equal frequency of cases and to provide a confident and complete list of all proteins found
controls, which are present in most clinical proteomics studies, but does
in the original sample. Consequently, MS/MS analysis is relied
not accurately represent the general population. The cut-off level for
designating the presence of disease at a level between X and Y affects the upon as the gold standard of establishing peptide primary
number of people misidentified as being healthy or sick. (B) The ROC curve sequence, post-translational modifications and protein identi-
plots the relationship between sensitivity (true-positive rate) and fication. One of the earliest MS/MS studies of peptides
specificity (true-negative rate). A dashed line indicates a random test. involved high-energy collision-induced dissociation (CID) of
Curves trend towards the upper left corner as they become increasingly
peptide ions generated by fast atom bombardment in a dou-
more accurate.
ble-focusing MS/MS analyzer [140]. A year later, the same

group created a computer program for derivation of primary a number of feature selection algorithms with appropriate
sequence from CID peptide fragmentation known as SEQPEP cross-validation, and to utilize the optimal features to generate
[141]. The only required input was a list of product ion masses a final diagnostic algorithm. The authors recommend utilizing
and relative abundances, the mass of the precursor ion and the various approaches, both in parallel and in sequence, to gener-
mass of any C-terminal modification. The program was capa- ate these final algorithms. In the future, as computational
ble of processing approximately 100 product ions in no more power increases, the parallel approaches to data mining will
than 5 min. become easier and more high throughput. Moreover, as appro-
In 1990, pioneering work for today’s modern peptide MSn priate data sets become publicly available, more rigorous com-
analysis was performed by Cooks and Stafford using quadru- parisons of data preprocessing and mining techniques will be
pole ion trap (qIT) MS [142]. A number of small peptides were possible. Proteomics researchers are urged to work in inter-
ionized using Cs+ surface ionization, injected into the trap, disciplinary groups that include physicians, laboratorians and
mass selected and then activated by low-energy CID, resulting statisticians. Ultimately, the only true validation of a conclu-
in dissociation. Product ions were mass-selectively ejected and sion is to apply the fixed algorithm to a newly collected set of
then analyzed to determine primary sequence. MS/MS data samples. Therefore, multiple rounds of validation must be
on subfemtomole levels of gramicidin S were demonstrated. undertaken. While the end point of a highly accurate set of
In the same year, Van Berkel and others combined ESI with biomarkers is often the focus, there can be no substitute for
qIT single and multiple MS analysis to demonstrate low appropriate study design, taking into account the relevant pre-
energy CID fragmentation and peptide sequencing [143]. analytic and analytic variables that can confound data mining.
A year later, online capillary reverse-phase (RP) LC was com- A major limitation to the successful application of clinical pro-
bined with ESI-qIT-MS analysis [144]. Online HPLC/MS teomics programs thus far has been throughput, both clinical
molecular weight determinations for cytochrome c, human and analytic. On the clinical side, there is a tremendous need
serum albumin and myoglobin were shown, as well as for well-documented clinical samples with great care in collec-
LC/MS/MS and LC/MS/MS/MS analysis of selected tryptic tion, aliquotting and storage. Most studies performed to date
peptides in a protein tryptic digest. In addition to ESI, have examined a limited number of samples obtained from a
MALDI-generated ions were also analyzed using qIT-MS single institution, with the result that promising biomarkers
[145]. Today, the majority of MS/MS experiments are per- fail to pass validation. The failure to be validated may be due
formed using online capillary RP-HPLC with ESI ion trap to specific sample acquisition parameters, approaches to sam-
devices. Sequences are automatically processed and protein ple analysis or demographic and clinical parameters of the
identification conferred using various algorithms such as small population being examined. On the analytic side, auto-
SEQUEST. Other MS/MS schemes currently used for pep- mation will become increasingly important so that more
tide sequence determination include the ESI tandem quadru- powerful fractionation technologies can be used to detect low-
pole TOF-MS analyzers [146], ESI-FTICR-MS [147], MALDI abundance proteins as well as to analyze the larger number of
post-source decay analysis [148], MALDI quadrupole TOF samples required for more adequate biomarker discovery and
analysis [149], MALDI-TOF/TOF-MS/MS analysis [150–152] validation. As proteomics and sample preparation technolo-
and MALDI-qIT-TOF-MS analysis [153,154]. gies become more high-throughput and reproducible, higher
powered studies will become feasible, accelerating the discov-
Expert commentary & five-year view ery and validation of biomarkers that can be translated into
Many approaches exist to mine high-dimensional data, and clinical practice.
novel ones are continually being developed. No specific algo-
rithm can be termed an ideal approach; consequently, the Acknowledgement
authors believe that the best approach to mining data is to utilize The authors thank Leslie Roth for assistance in making the figures.
Key issues
• Data preprocessing steps such as background correction and spectrum alignment are critical before data mining.
• No single mathematical approach is ideal or applicable to all study designs.
• High-dimensional data need to be reduced to fewer variables via a process of feature selection.
• While the temptation is to drive classification algorithms to the lowest error rate achievable, this approach is likely to result in
overfitting. Stable, globally optimal solutions are preferred.
• The statistics and proteomics community need more sample data to be made publicly available so that direct comparisons of
statistical approaches and, therefore, development of better tools, can be made.
• Identification of biomarker candidates needs to address protein identity as well as all salient post-translational modifications.
References 11 Unlu M, Morgan ME, Minden JS. 20 Kelleher NL, Lin HY, Valaskovic GA,
Papers of special note have been highlighted as: Difference gel electrophoresis. A single gel Aaserud DJ, Fridriksson EK,
• of interest method for detecting changes in protein McLafferty FW. Top down versus
•• of considerable interest extracts. Electrophoresis 18, 2071–2077 bottom up protein characterization by
(1997). tandem high-resolution mass
1 Hu J, Coombes KR, Morris JS,
12 Friedman DB, Hill S, Keller JW et al. spectrometry. J. Am. Chem. Soc. 121,
Baggerly KA. The importance of
Proteome analysis of human colon cancer 806–812 (1999).
experimental design in proteomic mass
spectrometry experiments: some cautionary by two-dimensional difference gel 21 Satten GA, Datta S, Moura H et al.
tales. Brief Funct. Genomic Proteomic 3, electrophoresis and mass spectrometry. Standardization and denoising algorithms
322–331 (2005). Proteomics 4, 793–811 (2004). for mass spectra to classify whole-organism
•• Describes how poor study design and 13 Alfonso P, Nunez A, Madoz-Gurpide J, bacterial specimens. Bioinformatics 20,
execution can lead to false discovery. Lombardia L, Sanchez L, Casal JI. 3128–3136 (2004).
2 Coombes KR, Morris JS, Hu J, Proteomic expression analysis of colorectal 22 Qu Y, Adam BL, Thornquist M et al.
Edmonson SR, Baggerly KA. Serum cancer by two-dimensional differential gel Data reduction using a discrete wavelet
proteomics profiling – a young technology electrophoresis. Proteomics 5, 2602–2611 transform in discriminant analysis of very
begins to mature. Nature Biotechnol. 23, (2005). high dimensionality data. Biometrics 59,
291–292 (2005). 14 Corthals GL, Gygi SP, Aebersold R, 143–151 (2003).
3 Baggerly KA, Morris JS, Edmonson SR, Patterson SD. Identification of proteins by 23 Malyarenko DI, Cooke WE, Adam BL
Coombes KR. Signal in noise: evaluating mass spectrometry. In: Proteome Research: et al. Enhancement of sensitivity and
reported reproducibility of serum Two-Dimensional Gel Electrophoresis and resolution of surface-enhanced laser
proteomic tests for ovarian cancer. J. Natl Detection Methods (Principles and Practice). desorption/ionization time-of-flight mass
Cancer Inst. 97, 307–309 (2005). Rabilloud T (Ed.), Springer, Berlin, spectrometric records for serum peptides
Germany 197–231 (1999). using time-series analysis techniques.
4 Ransohoff DF. Lessons from controversy:
• Presents theoretical and technical details Clin. Chem. 51, 65–74 (2005).
ovarian cancer screening and serum
along with protocols for protein 24 Kempka M, Sjodahl J, Bjork A, Roeraade J.
proteomics. J. Natl Cancer Inst. 97,
identification from 2D gels. Improved method for peak picking in
315–319 (2005).
15 Loo RRO, Mitchell C, Stevenson T, matrix-assisted laser desorption/ionization
5 Fung ET, Enderwick C. ProteinChip
Loo JA, Andrews PC. Interfacing time-of-flight mass spectrometry. Rapid
clinical proteomics: computational
polyacrylamide gel electrophoresis with Commun. Mass Spectrom. 18, 1208–1212
challenges and solutions. Biotechniques
mass spectrometry. Techniques in Protein (2004).
Suppl. 34–38, 40–41 (2002).
Chemistry VII. Symposium of the Protein 25 Carlson SM, Najmi A, Whitin JC,
6 Goerg A, Weiss W, Dunn MJ. Society. MA, USA, July 8–12, 1995, Cohen HJ. Improving feature detection
Current two-dimensional electrophoresis 305–313 (1996). and analysis of surface-enhanced laser
technology for proteomics. Proteomics 4,
16 Loo RRO, Cavalcoli JD, VanBogelen RA desorption/ionization-time of flight
3665–3685 (2004).
et al. Virtual 2-D gel electrophoresis: mass spectra. Proteomics 5, 2778–2788
7 Appel R, Hochstrasser D, Roch C, Funk M, visualization and analysis of the E. coli (2005).
Muller AF, Pellegrini C. Automatic proteome by mass spectrometry. 26 Coombes KR, Fritsche HA Jr, Clarke C
classification of two-dimensional gel Anal. Chem. 73, 4063–4070 (2001). et al. Quality control and peak finding for
electrophoresis pictures by heuristic
17 Walker AK, Rymar G, Andrews PC. proteomics data collected from nipple
clustering analysis: a step toward machine
Mass spectrometric imaging of immobilized aspirate fluid by surface-enhanced laser
learning. Electrophoresis 9, 136–142 (1988).
pH gradient gels and creation of ‘virtual’ desorption and ionization. Clin. Chem. 49,
8 Pun T, Hochstrasser DF, Appel RD et al. two-dimensional gels. Electrophoresis 22, 1615–1623 (2003).
Computerized classification of 933–945 (2001). 27 Jeffries N. Algorithms for alignment of
two-dimensional gel electrophoretograms
18 Tang N, Tornatore P, Weinberger SR. mass spectrometry proteomic data.
by correspondence analysis and ascendant
Current developments in SELDI affinity Bioinformatics 21, 3066–3073 (2005).
hierarchical clustering. Appl. Theor.
technology. Mass Spectrom. Rev. 23, 34–44 28 Duda RO, Hart PE, Stork DG. Pattern
Electrophor. 1, 3–9 (1988).
(2003). Classification. Second Ed. Wiley, NY, USA,
9 Appel RD, Hochstrasser DF, Funk M et al. •• Recent review of surface-enhanced 654 (2000).
The MELANIE project: from a biopsy to laser desorption/ionization technology • Outstanding general statistics textbook.
automatic protein map interpretation by and applications.
computer. Electrophoresis 12, 722–735 29 Hong H, Dragan Y, Epstein J et al.
19 Wall DB, Kachman MT, Gong SS, Quality control and quality assessment of
(1991).
Parus SJ, Long MW, Lubman DM. data from surface-enhanced laser
• Describes early attempts at
Isoelectric focusing nonporous silica desorption/ionization (SELDI) time-of
phenomenologically driven
reversed-phase high-performance liquid flight (TOF) mass spectrometry (MS).
biomarker discovery.
chromatography/electrospray ionization BMC Bioinformatics 6(Suppl. 2), S5
10 Tissot JD, Schneider P, James RW, time-of-flight mass spectrometry: a (2005).
Daigneault R, Hochstrasser DF. three-dimensional liquid-phase protein
High-resolution two-dimensional protein 30 Joliffe IT, Morgan BJ. Principal
separation method as applied to the
electrophoresis of pathological component analysis and exploratory
human erythroleukemia cell line. Rapid
plasma/serum. Appl. Theor. Electrophor. 2, factor analysis. Stat. Methods Med. Res.
Commun. Mass Spectrom. 15, 1649–1661
7–12 (1991). 1, 69–95 (1992).
(2001).

31 Lundquist M, Caspersen MB, Wikstrom P, 42 Hu J, Zou F, Wright FA. Practical 56 Jeffries NO. Performance of a genetic
Forsman M. Discrimination of Francisella FDR-based sample size calculations in algorithm for mass spectrometry
tularensis subspecies using surface enhanced microarray experiments. Bioinformatics 21, proteomics. BMC Bioinformatics 5, 180
laser desorption ionization mass 3264–3272 (2005). (2004).
spectrometry and multivariate data analysis. 43 Jung SH. Sample size for FDR-control in •• Very good discussion regarding strengths
FEMS Microbiol. Lett. 243, 303–310 microarray data analysis. Bioinformatics 21, and weaknesses of genetic algorithms,
(2005). 3097–3104 (2005). with concrete examples.
• Describes applications involving both 57 Van QN, Klose JR, Lucas DA et al.
44 Li SS, Bigler J, Lampe JW, Potter JD, Feng Z.
unsupervised and supervised approaches The use of urine proteomic and
FDR-controlling testing procedures and
to classify bacterial subspecies. metabonomic patterns for the diagnosis
sample size determination for microarrays.
32 Lancashire L, Schmid O, Shah H, Ball G. Stat. Med. 24, 2267–2280 (2005). of interstitial cystitis and bacterial cystitis.
Classification of bacterial species from Dis. Markers 19, 169–183 (2003).
45 Pawitan Y, Michiels S, Koscielny S,
proteomic data using combinatorial 58 Papadopoulos MC, Abel PM, Agranoff D
Gusnanto A, Ploner A. False discovery rate,
approaches incorporating artificial neural et al. A novel and accurate diagnostic test
sensitivity and sample size for microarray
networks, cluster analysis and principal for human African trypanosomiasis. Lancet
studies. Bioinformatics 21, 3017–3024
components analysis. Bioinformatics 21, 363, 1358–1363 (2004).
(2005).
2191–2199 (2005).
46 Pounds S, Cheng C. Improving false 59 Wang TH, Chang YL, Peng HH et al.
• Describes applications involving both
discovery rate estimation. Bioinformatics 20, Rapid detection of fetal aneuploidy using
unsupervised and supervised approaches
1737–1745 (2004). proteomics approaches on amniotic fluid
to classify bacterial subspecies.
supernatant. Prenat. Diagn. 25, 559–566
33 Slotta DJ, Heath LS, Ramakrishnan N, 47 Scheid S, Spang R. Twilight; a
(2005).
Helm R, Potts M. Clustering mass bioconductor package for estimating the
local false discovery rate. Bioinformatics 21, 60 Baggerly KA, Morris JS, Wang J, Gold D,
spectrometry data using order statistics.
2921–2922 (2005). Xiao LC, Coombes KR. A comprehensive
Proteomics 3, 1687–1691 (2003).
approach to the analysis of matrix-assisted
34 Seike M, Kondo T, Fujii K et al. Proteomic 48 Pan W. On the use of permutation in and
laser desorption/ionization-time of flight
signatures for histological types of lung the performance of a class of nonparametric
proteomics spectra from serum samples.
cancer. Proteomics 5, 2939–2948 (2005). methods to detect differential gene
Proteomics 3, 1667–1672 (2003).
expression. Bioinformatics 19, 1333–1340
35 Hellman K, Alaiya AA, Schedvins K, 61 Petricoin EF, Ardekani AM, Hitt BA et al.
(2003).
Steinberg W, Hellstrom AC, Auer G. Protein Use of proteomic patterns in serum to
expression patterns in primary carcinoma of 49 Xu R, Li X. A comparison of parametric
identify ovarian cancer. Lancet 359,
the vagina. Br. J. Cancer 91, 319–326 (2004). versus permutation methods with
572–577 (2002).
applications to general and temporal
36 Alaiya AA, Franzen B, Hagman A et al. 62 Petricoin EF III, Ornstein DK,
microarray gene expression data.
Molecular classification of borderline Paweletz CP et al. Serum proteomic
Bioinformatics 19, 1284–1289 (2003).
ovarian tumors using hierarchical cluster patterns for detection of prostate cancer.
analysis of protein expression profiles. 50 Ludbrook J. Advantages of permutation
J. Natl Cancer Inst. 94, 1576–1578
Int. J. Cancer 98, 895–899 (2002). (randomization) tests in clinical and
(2002).
experimental pharmacology and
37 Johnson SC. Hierarchical clustering 63 Stone JH, Rajapakse VN, Hoffman GS
physiology. Clin. Exp. Pharmacol. Physiol.
schemes. Psychometrika 2, 241–254 (1967). et al. A serum proteomic approach to
21, 673–686 (1994).
38 Kuerer HM, Coombes KR, Chen JN et al. gauging the state of remission in Wegener’s
51 Storey JD, Tibshirani R. Statistical
Association between ductal fluid proteomic granulomatosis. Arthritis Rheum. 52,
significance for genome-wide studies.
expression profiles and the presence of 902–910 (2005).
Proc. Natl Acad. Sci. USA 100, 9440–9445
lymph node metastases in women with 64 Brouwers FM, Petricoin EF III,
(2003).
breast cancer. Surgery 136, 1061–1069 Ksinantova L et al. Low molecular
(2004). 52 Larsson O, Wahlestedt C, Timmons JA.
weight proteomic information distinguishes
Considerations when using the
39 Poon TC, Yip TT, Chan AT et al. metastatic from benign pheochromocytoma.
significance analysis of microarrays (SAM)
Comprehensive proteomic profiling Endocr. Relat. Cancer 12, 263–272
algorithm. BMC Bioinformatics 6, 129
identifies serum proteomic signatures for (2005).
(2005).
detection of hepatocellular carcinoma and 65 Breiman L, Friedman JH, Olshen RA,
its subtypes. Clin. Chem. 49, 752–760 53 Sharov AA, Dudekula DB, Ko MS.
Stone CJ. Classification and regression
(2003). A web-based tool for principal component
trees. In: The Wadsworth
and significance analysis of microarray data.
40 Pan W. A comparative review of statistical Statistics/Probability Series. Bickel P,
Bioinformatics 21, 2548–2549 (2005).
methods for discovering differentially Cleveland W, Dudley R (Eds)
expressed genes in replicated microarray 54 Lilien RH, Farid H, Donald BR. Wadsworth International Group,
experiments. Bioinformatics 18, 546–554 Probabilistic disease classification of TN, USA (1984).
(2002). expression-dependent proteomic data
66 Adam BL, Qu Y, Davis JW et al.
from mass spectrometry of human serum.
41 Belknap JK. Empirical estimates of Serum protein fingerprinting coupled with
J. Comput. Biol. 10, 925–946 (2003).
Bonferroni corrections for use in a pattern-matching algorithm distinguishes
chromosome mapping studies with the 55 Willett P. Genetic algorithms in molecular prostate cancer from benign prostate
BXD recombinant inbred strains. recognition and design. Trends Biotechnol. hyperplasia and healthy men. Cancer Res.
Behav. Genet. 22, 677–684 (1992). 13, 516–521 (1995). 62, 3609–3614 (2002).
67 Banez LL, Prasanna P, Sun L et al. using SELDI-TOF mass spectrometry. therapeutic intervention. Curr. Opin. Mol.
Diagnostic potential of serum proteomic Ann. NY Acad. Sci. 1022, 317–322 (2004). Ther. 6, 616–623 (2004).
patterns in prostate cancer. J. Urol. 170, 80 Wu B, Abbott T, Fishman D et al. 93 Agatonovic-Kustrin S, Beresford R. Basic
442–446 (2003). Comparison of statistical methods for concepts of artificial neural network (ANN)
68 Bhattacharyya S, Siegel ER, Petersen GM, classification of ovarian cancer using mass modeling and its application in
Chari ST, Suva LJ, Haun RS. Diagnosis of spectrometry data. Bioinformatics 19, pharmaceutical research. J. Pharm. Biomed.
pancreatic cancer using serum proteomic 1636–1643 (2003). Anal. 22, 717–727 (2000).
profiling. Neoplasia 6, 674–686 (2004). 81 Yu Y, Chen S, Wang LS et al. 94 Bloch HA, Petersen M, Sperotto MM et al.
69 Clarke W, Silverman BC, Zhang Z, Prediction of pancreatic cancer by serum Identification of barley and rye varieties using
Chan DW, Klein AS, Molmenti EP. biomarkers using surface-enhanced laser matrix-assisted laser desorption/ionisation
Characterization of renal allograft rejection desorption/ionization-based decision tree time-of-flight mass spectrometry with neural
by urinary proteomic analysis. Ann. Surg. classification. Oncology 68, 79–86 networks. Rapid Commun. Mass Spectrom.
237, 660–664; discussion 4–5 (2003). (2005). 15, 440–445 (2001).
70 Gerton GL, Fan XJ, Chittams J et al. 82 Zhang YF, Wu DL, Guan M et al. 95 Chen YD, Zheng S, Yu JK, Hu X.
A serum proteomics approach to the Tree analysis of mass spectral urine profiles Artificial neural networks analysis of
diagnosis of ectopic pregnancy. Ann. NY discriminates transitional cell carcinoma of surface-enhanced laser
Acad. Sci. 1022, 306–316 (2004). the bladder from noncancer patient. Clin. desorption/ionization mass spectra of
71 Geurts P, Fillet M, de Seny D et al. Biochem. 37, 772–779 (2004). serum protein pattern distinguishes
Proteomic mass spectra classification using 83 Zhu H, Yu CY, Zhang H. Tree-based colorectal cancer from healthy
decision tree based ensemble methods. disease classification using protein data. population. Clin. Cancer Res. 10,
Bioinformatics 21, 3138–3145 (2005). Proteomics 3, 1673–1677 (2003). 8380–8385 (2004).
72 Markey MK, Tourassi GD, Floyd CE Jr. 84 Neville P, Tan PY, Mann G, Wolfinger R. 96 Goodacre R, Rooney PJ, Kell DB.
Decision tree classification of proteins Generalizable mass spectrometry mining Discrimination between
identified by mass spectrometry of blood used to identify disease state biomarkers methicillin-resistant and
serum samples from people with and from blood serum. Proteomics 3, methicillin-susceptible Staphylococcus
without lung cancer. Proteomics 3, 1710–1715 (2003). aureus using pyrolysis mass spectrometry
1678–1679 (2003). and artificial neural networks.
85 Becker S, Cazares LH, Watson P et al.
J. Antimicrob. Chemother. 41, 27–34
73 Qu Y, Adam BL, Yasui Y et al. Boosted Surfaced-enhanced laser desorption/
(1998).
decision tree analysis of surface-enhanced ionization time-of-flight (SELDI-TOF)
laser desorption/ionization mass spectral differentiation of serum protein profiles of 97 Grus FH, Joachim SC, Pfeiffer N.
serum profiles discriminates prostate cancer BRCA-1 and sporadic breast cancer. Ann. Analysis of complex autoantibody
from noncancer patients. Clin. Chem. 48, Surg. Oncol. 11, 907–914 (2004). repertoires by surface-enhanced laser
1835–1843 (2002). desorption/ionization-time of flight mass
86 Yang SY, Xiao XY, Zhang WG et al.
spectrometry. Proteomics 3, 957–961
74 Semmes OJ, Cazares LH, Ward MD et al. Application of serum SELDI proteomic
(2003).
Discrete serum protein signatures patterns in diagnosis of lung cancer.
discriminate between human retrovirus- BMC Cancer 5, 83 (2005). 98 Grus FH, Podust VN, Bruns K et al.
associated hematologic and neurologic SELDI-TOF-MS ProteinChip array
87 Hastie T, Tibshirani R, J Friedman.
disease. Leukemia 19, 1229–1238 (2005). profiling of tears from patients with dry
The Elements of Statistical Learning.
eye. Invest. Ophthalmol. Vis. Sci. 46,
75 Tong W, Xie Q, Hong H et al. Using Springer-Verlag, NY, USA, 301 (2001).
863–876 (2005).
decision forest to classify prostate cancer •• Excellent textbook that incorporates
samples on the basis of SELDI-TOF MS statistical principles in data mining. 99 Liu J, Zheng S, Yu JK, Zhang JM, Chen Z.
data: assessing chance correlation and Serum protein fingerprinting coupled with
88 Yasui Y, Pepe M, Thompson ML et al.
prediction confidence. Environ. Health artificial neural network distinguishes
A data-analytic strategy for protein
Perspect. 112, 1622–1627 (2004). glioma from healthy population or brain
biomarker discovery: profiling of high-
benign tumor. J. Zhejiang Univ. Sci. B 6,
76 Vlahou A, Laronga C, Wilson L et al. dimensional proteomic data for cancer
4–10 (2005).
A novel approach toward development of a detection. Biostatistics 4, 449–463 (2003).
rapid blood test for breast cancer. Clin. 100 Mian S, Ball G, Hornbuckle J et al.
89 Freund Y, Schapire R. A decision-theoretical
Breast Cancer 4, 203–309 (2003). A prototype methodology combining
generalization of on-line learning and an
surface-enhanced laser
77 Wadsworth JT, Somers KD, Cazares LH application to boosting. J. Computer Syst. Sci.
desorption/ionization protein chip
et al. Serum protein profiles to identify 55, 119–139 (1997).
technology and artificial neural network
head and neck cancer. Clin. Cancer Res. 10, 90 Izmirlian G. Application of the random algorithms to predict the
1625–1632 (2004). forest classification algorithm to a chemoresponsiveness of breast cancer cell
78 Wadsworth JT, Somers KD, Stack BC Jr SELDI-TOF proteomics study in the lines exposed to paclitaxel and
et al. Identification of patients with head setting of a cancer prevention trial. doxorubicin under in vitro conditions.
and neck cancer using serum protein Ann. NY Acad. Sci. 1020, 154–174 (2004). Proteomics 3, 1725–1737 (2003).
profiles. Arch. Otolaryngol. Head Neck Surg. 91 Bishop CM. Neural networks for pattern 101 Mian S, Ugurel S, Parkinson E et al. Serum
130, 98–104 (2004). recognition. Oxford University Press, UK proteomic fingerprinting discriminates
79 Wilson LL, Tran L, Morton DL, Hoon DS. (1995). between clinical stages and predicts disease
Detection of differentially expressed 92 Bicciato S. Artificial neural network progression in melanoma patients. J. Clin.
proteins in early-stage melanoma patients technologies to identify biomarkers for Oncol. 23, 5088–5093 (2005).

102 Rogers MA, Clarke P, Noble J et al. 113 Mitchell BL, Yasui Y, Lampe JW, 124 Levner I. Feature selection and nearest
Proteomic profiling of urinary proteins in Gafken PR, Lampe PD. Evaluation of centroid classification for protein mass
renal cancer by surface enhanced laser matrix-assisted laser desorption/ionization- spectrometry. BMC Bioinformatics 6, 68
desorption ionization and neural-network time of flight mass spectrometry proteomic (2005).
analysis: identification of key issues profiling: identification of α 2-HS 125 Ambroise C, McLachlan GJ. Selection bias
affecting potential clinical utility. glycoprotein B-chain as a biomarker of diet. in gene extraction on the basis of
Cancer Res. 63, 6971–6983 (2003). Proteomics 5, 2238–2246 (2005). microarray gene-expression data. Proc. Natl
103 Sorensen HA, Petersen MK, Jacobsen S, 114 Li L, Umbach DM, Terry P, Taylor JA. Acad. Sci. USA 99, 6562–6566 (2002).
Sondergaard I. Mass spectrometry and Application of the GA/KNN method to 126 Ransohoff DF. Rules of evidence for cancer
partial least-squares regression: a tool for SELDI proteomics data. Bioinformatics 20, molecular-marker discovery and validation.
identification of wheat variety and end-use 1638–1640 (2004). Nature Rev. Cancer 4, 309–314 (2004).
quality. J. Mass Spectrom. 39, 607–612 115 Ghirardo A, Sorensen HA, Petersen M, •• Outstanding review of levels of validation
(2004). Jacobsen S, Sondergaard I. Early required for cancer biomarkers.
104 Sorensen HA, Sperotto MM, Petersen M prediction of wheat quality: analysis 127 Metz CE. Basic principles of ROC analysis.
et al. Variety identification of wheat using during grain development using mass Semin. Nucl. Med. 8, 283–298 (1978).
mass spectrometry with neural networks and spectrometry and multivariate data
128 Zhang DD, Zhou XH, Freeman DH Jr,
the influence of mass spectra processing prior analysis. Rapid Commun. Mass Spectrom.
Freeman JL. A non-parametric method for
to neural network analysis. Rapid Commun. 19, 525–532 (2005).
the comparison of partial areas under ROC
Mass Spectrom. 16, 1232–1237 (2002). 116 Vapnik V. Statistical Learning Theory. curves and its application to large healthcare
105 Tatay JW, Feng X, Sobczak N et al. Wiley, NY, USA, (1998). data sets. Stat. Med. 21, 701–715 (2002).
Multiple approaches to data-mining of • Systematic theoretical and practical
129 Dodd LE, Pepe MS. Partial AUC
proteomic databased on statistical and discussion of support vector machines,
estimation and regression. Biometrics 59,
pattern classification methods. Proteomics 3, neural networks and statistical
614–623 (2003).
1704–1709 (2003). learning theory.
130 Pepe MS, Thompson ML. Combining
106 Ball G, Mian S, Holding F et al. 117 Burges CJC. A tutorial on support vector
diagnostic test results to increase accuracy.
An integrated approach utilizing artificial machines for pattern recognition.
Biostatistics 1, 123–140 (2000).
neural networks and SELDI mass Data Mining Knowledge Discov. 2, 121–167
spectrometry for the classification of (1998). 131 Walter SD. The partial area under the
human tumours and rapid identification of summary ROC curve. Stat. Med. 24,
118 Yu JK, Zheng S, Tang Y, Li L.
potential biomarkers. Bioinformatics 18, 2025–2040 (2005).
An integrated approach utilizing
395–404 (2002). proteomics and bioinformatics to detect 132 McClish DK. Analyzing a portion of the
107 Poon TC, Hui AY, Chan HL et al. ovarian cancer. J. Zhejiang Univ. Sci. B ROC curve. Med. Decis. Making 9,
Prediction of liver fibrosis and cirrhosis in 6, 227–231 (2005). 190–195 (1989).
chronic hepatitis B infection by serum 119 Yu JK, Chen YD, Zheng S. An integrated 133 Jiang Y, Metz CE, Nishikawa RM.
proteomic fingerprinting: a pilot study. approach to the detection of colorectal A receiver operating characteristic partial
Clin. Chem. 51, 328–335 (2005). cancer utilizing proteomics and area index for highly sensitive diagnostic
108 Fort G, Lambert-Lacroix S. Classification bioinformatics. World J. Gastroenterol. 10, tests. Radiology 201, 745–750 (1996).
using partial least squares with penalized 3127–3131 (2004). 134 Henzel WJ, Billeci TM, Stults JT, Wong SC,
logistic regression. Bioinformatics 21, 120 Xu XQ, Leow CK, Lu X et al. Molecular Grimley C, Watanabe C. Identifying proteins
1104–1111 (2005). classification of liver cirrhosis in a rat model from two-dimensional gels by molecular mass
• Interesting approach to generating by proteomics and bioinformatics. searching of peptide fragments in protein
classification algorithms. Proteomics 4, 3235–3245 (2004). sequence databases. Proc. Natl Acad. Sci. USA
109 Fisher L, Van Ness JW. Admissible 90, 5011–5015 (1993).
121 Prados J, Kalousis A, Sanchez JC, Allard L,
discriminant analysis. J. Am. Stat. Assoc. • Seminal peptide mass fingerprinting work.
Carrette O, Hilario M. Mining mass
68 (1973). spectra for diagnosis and biomarker 135 Arnott DP, Henzel WJ, Stults JT.
110 Sidransky D, Irizarry R, Califano JA et al. discovery of cerebral accidents. Proteomics Identification of proteins from
Serum protein MALDI profiling to 4, 2320–2332 (2004). two-dimensional electrophoresis gels by
distinguish upper aerodigestive tract cancer peptide mass fingerprinting. ACS
122 Zhang Z, Bast RC Jr, Yu Y et al.
patients from control subjects. J. Natl Symposium Series 619, 226–243 (1996).
Three biomarkers identified from serum
Cancer Inst. 95, 1711–1717 (2003). proteomic analysis for the detection of early 136 Mann M, Hoejrup P, Roepstorff P. Use of
111 Sorace JM, Zhan M. A data review and stage ovarian cancer. Cancer Res. 64, mass spectrometric molecular weight
re-assessment of ovarian cancer serum 5882–5890 (2004). information to identify proteins in
proteomic profiling. BMC Bioinformatics 4, • Uses a multi-institutional study design sequence databases. Biol. Mass Spectrom. 22,
24 (2003). with independent and cross-validation to 338–345 (1993).
112 Cazares LH, Adam BL, Ward MD et al. discover biomarkers. 137 Gras R, Muller M, Gasteiger E et al.
Normal, benign, preneoplastic, and 123 Li J, Zhang Z, Rosenzweig J, Wang YY, Improving protein identification from
malignant prostate cells have distinct Chan DW. Proteomics and peptide mass fingerprinting through a
protein expression profiles resolved by bioinformatics approaches for identification parameterized multi-level scoring
surface enhanced laser of serum biomarkers to detect breast algorithm and an optimized peak
desorption/ionization mass spectrometry. cancer. Clin. Chem. 48, 1296–1304 detection. Electrophoresis 20, 3535–3550
Clin. Cancer Res. 8, 2541–2552 (2002). (2002). (1999).
138 Chamrad DC, Koerting G, Gobom J et al. labeling and a quadrupole/time-of-flight 153 Ding L, Kawatoh E, Tanaka K, Smith AJ,
Interpretation of mass spectrometry data mass spectrometer. Rapid Commun. Mass Kumashiro S. High-efficiency
for high-throughput proteomics. Anal. Spectrom. 11, 1015–1024 (1997). MALDI-QIT-ToF mass spectrometer.
Bioanal. Chem. 376, 1014–1022 (2003). 147 Wu Q, Van Orden S, Cheng X, Bakhtiar R, Proceedings of SPIE-The International
139 Samuelsson J, Dalevi D, Levander F, Smith RD. Characterization of cytochrome Society for Optical Engineering 3777,
Rognvaldsson T. Modular, scriptable and c variants with high-resolution FTICR 144–155 (1999).
automated analysis tools for high-throughput mass spectrometry: correlation of 154 Tanaka K, Kawatoh E, Ding L, Smith A,
peptide mass fingerprinting. Bioinformatics fragmentation and structure. Anal. Chem. Kumashiro S. A MALDI-quadrupole ion
20, 3628–3635 (2004). 67, 2498–2509 (1995). trap-TOF mass spectrometer. Proceedings
140 Johnson RS, Martin SA, Biemann K. 148 Kaufman R, Spengler B, Lutzenkirchen F. of the 47th ASMs Conference on Mass
Collision-induced fragmentation of Mass spectrometric sequencing of linear Spectrometry and Allied Topics, June 1999,
(M + H)+ ions of peptides. Side chain peptides by product-ion analysis in a TX, USA, 1823–1824 (1999).
specific sequence ions. Int. J. Mass Spectrom. felfectron time-of-flight mass spectrometer
Ion Processes 86, 137–154 (1988). using matrix-assisted laser desorption Affiliations
141 Johnson RS, Biemann K. Computer program ionization. Rapid Commun. Mass Spectrom. • Eric T Fung, MD, PhD
(SEQPEP) to aid in the interpretation of 7, 902–910 (1993). Vice President of Clinical Affairs, Ciphergen
high-energy collision tandem mass spectra of 149 Krutchinksy AN, Loboda AV, Spicer VL, Biosystems, Inc., 6611 Dumbarton Circle,
peptides. Biomed. Environ. Mass Spectrom. 18, Dworschak R, Ens W, Standing KG. Fremont, CA 94555, USA
945–957 (1989). Orthogonal injection of matrix-assisted Tel.: +1 510 505 2242
laser desorption/ionization ions into a Fax: +1 510 505 2101
142 Kaiser RE Jr, Cooks RG, Syka JEP, efung@ciphergen.com
Stafford GC Jr. Collisionally activated time-of-flight spectrometer through a
collisional damping interface. Rapid • Scot R Weinberger, BSc
dissociation of peptides using a quadrupole President & Founder, GenNext Technologies,
ion-trap mass spectrometer. Rapid Commun. Mass Spectrom. 12, 508–518
(1998). PO Box 370645, Montara,
Commun. Mass Spectrom. 4, 30–33 (1990). CA 94037-0645, USA
143 Van Berkel GJ, Glish GL, McLuckey SA. 150 Bienvenut WV, Deon C, Pasquarello C Tel.: +1 650 563 9577
Electrospray ionization combined with ion et al. Matrix-assisted laser Fax: +1 650 563 9577
trap mass spectrometry. Anal. Chem. 62, desorption/ionization-tandem mass sweinberger@ix.netcom.com
1284–1295 (1990). spectrometry with high resolution and • Ed Gavin, BSc
sensitivity for identification and Director of Software Development, Ciphergen
144 McLuckey SA, Van Berkel GJ, Glish GL,
characterization of proteins. Proteomics 2, Biosystems, Inc., 6611 Dumbarton Circle,
Huang EC, Henion JD. Ion spray liquid
868–876 (2002). Fremont, CA 94555, USA
chromatography/ion trap mass
151 Yergey AL, Coorssen JR, Backlund PS et al. Tel.: +1 510 505 2244
spectrometry determination of
De novo sequencing of peptides using Fax: +1 510 505 2101
biomolecules. Anal. Chem. 63, 375–383
MALDI/TOF-TOF. J. Am. Soc. Mass egavin@ciphergen.com
(1991).
Spectrom. 13, 784–791 (2002). • Fujun Zhang, PhD
145 Qin J, Chait BT. Preferential fragmentation Staff Statistician, Ciphergen Biosystems, Inc.,
of protonated gas-phase peptide ions 152 Juhasz P, Campbell JM, Vestal ML.
6611 Dumbarton Circle, Fremont,
adjacent to acidic amino acid residues. MALDI-TOF/TOF technology for peptide
CA 94555, USA
J. Am. Chem. Soc. 117, 5411–5412 (1995). sequencing and protein identification. Mass Tel.: +1 510 505 2332
Spectrometry and Hyphenated Techniques in Fax: +1 510 505 2101
146 Shevchenko A, Chernushevich I, Ens W
Neuropeptide Research. Silberring J, fzhang@ciphergen.com
et al. Rapid ‘de novo’ peptide sequencing by
Ekman R (Eds), Wiley, NY, USA 375–413
a combination of nanoelectrospray isotopic
(2002).

Bioinformatics Approaches in Clinical Proteomics: Review

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bioinformatics Approaches in Clinical Proteomics: Review

Uploaded by

Copyright:

Available Formats

Review

For reprint orders, please contact reprints@future-drugs.com

Protein expression profiling is increasingly being used to discover, validate and

www.future-drugs.com 10.1586/14789450.2.6.847 © 2005 Future Drugs Ltd ISSN 1478-9450 847

848 Expert Rev. Proteomics 2(6), (2005)

850 Expert Rev. Proteomics 2(6), (2005)

852 Expert Rev. Proteomics 2(6), (2005)

Implementation of techniques the true assignments should be statistically significant from

854 Expert Rev. Proteomics 2(6), (2005)

A Actual positive Actual negative B Actual positive Actual negative

Sensitivity Specificity Sensitivity = 90% Specificity = 80%

C Actual positive Actual negative D Actual positive Actual negative

Positive test result 90 180 Positive test result 90 90

Negative test result 10 720 Negative test result 10 810

Sensitivity = 90% Specificity = 80% Sensitivity = 90% Specificity = 90%

Expert Review of Proteomics

bined with an optimized peak detection scheme to improve

matching modules that communicate via extensible markup lan-

Sequencing & protein identification via tandem

856 Expert Rev. Proteomics 2(6), (2005)

858 Expert Rev. Proteomics 2(6), (2005)

860 Expert Rev. Proteomics 2(6), (2005)

862 Expert Rev. Proteomics 2(6), (2005)

You might also like