You are on page 1of 10

Isotopic Peak Intensity Ratio Based Algorithm for

Determination of Isotopic Clusters and


Monoisotopic Masses of Polypeptides from
High-Resolution Mass Spectrometric Data
Kunsoo Park,*
,
Joo Young Yoon,

Sunho Lee,

Eunok Paek,*
,
Heejin Park,

Hee-Jung Jung,
|
and
Sang-Won Lee
|
School of Computer Science and Engineering, Seoul National University, Seoul, Korea, Department of Mechanical
and Information Engineering, University of Seoul, Seoul, Korea, College of Information and Communications, Hanyang
University, Seoul, Korea, and Department of Chemistry and Center for Electro- and Photo-Responsive Molecules,
Korea University, Seoul, Korea
Determining isotopic clusters and their monoisotopic
masses is a rst step in interpreting complex mass spectra
generated by high-resolution mass spectrometers. We
propose a mathematical model for isotopic distributions
of polypeptides and an effective interpretation algorithm.
Our model uses two types of ratios: intensity ratio of two
adjacent peaks and intensity ratio product of three adja-
cent peaks in an isotopic distribution. These ratios can
be approximated as simple functions of a polypeptide
mass, the values of which fall within certain ranges,
depending on the polypeptide mass. Given a spectrumas
a peak list, our algorithm rst nds all isotopic clusters
consisting of two or more peaks. Then, it scores clusters
using the ranges of ratio functions and computes the
monoisotopic masses of the identied clusters. Our
method was applied to high-resolution mass spectra
obtained from a Fourier transform ion cyclotron reso-
nance (FTICR) mass spectrometer coupled to reverse-
phase liquid chromatography (RPLC). For polypeptides
whose amino acid sequences were identied by tandem
mass spectrometry (MS/MS), we applied both THRASH-
based software implementations and our method. Our
method was observed to nd more masses of known
peptides when the numbers of the total clusters identied
by both methods were xed. Experimental results show
that our method performed better for isotopic mass
clusters of weak intensity where the isotopic distributions
deviate signicantly from their theoretical distributions.
Also, it correctly identied some isotopic clusters that
were not found by THRASH-based implementations,
especially those for which THRASH gave 1 Da mis-
matches. Another advantage of our method is that it is
very fast, much faster than THRASH that calculates the
least-squares t.
With the introduction of soft ionization methods such as
electrospray ionization (ESI)
1
and matrix-assisted laser desorp-
tion/ionization (MALDI),
2
mass spectrometry (MS) has been one
of the most robust and powerful analytical tools to characterize
large biological molecules. MS-based proteomic experiments have
provided valuable biological information, including qualitative and
quantitative identication of proteome and the types and degrees
of post-translational modications. Especially, high-resolution mass
spectrometers, such as Fourier transform ion cyclotron resonance
(FTICR) or Orbitrap mass spectrometers, greatly improved
accuracy of proteomic information.
In a common experimental practice of shotgun proteomics,
precursor peptides are dynamically selected for fragmentation with
exclusion to prevent repetitive acquisition of MS/MS spectra for
the same peptide. While this experimental scheme greatly
increased the throughput of proteomic experiments, it often incurs
fragmentation of peptide ions having weak intensities. MS data
of such weak ions exhibit nonstatistical isotopic distributions with
missing peaks, which lead to inaccurate determination of monoiso-
topic masses. A recent study showed that the portion of wrong
interpretation of precursor ion mass is up to 40%.
3
Overlapping
isotopic clusters are often observed with complex proteome
samples and resulted in wrong interpretation of their masses as
well. It is also well-known that MS/MS spectra from ECD on intact
proteins often suffer from inaccurate extraction of fragments mass
information due to nonideal and overlapping isotopic clusters.
4
Determining isotopic clusters and their monoisotopic masses
is the rst step in interpreting complex mass spectra generated
by high-resolution mass spectrometers such as FTICR or Orbitrap.
* To whom correspondence should be addressed. Eunok Paek, Department
of Mechanical and Information Engineering, University of Seoul, Seoul, 130-
743, Korea. Phone: +82-2-2210-2680. Fax: +82-2-2210-5575. E-mail: paek@uos.ac.kr.
Kunsoo Park, School of Computer Science and Engineering, Seoul National
University, Seoul, 151-742, Korea. Phone: +82-2-880-8381. Fax: +82-2-885-3141.
E-mail: kpark@theory.snu.ac.kr.

Seoul National University.

University of Seoul.

Hanyang University.
|
Korea University.
(1) Fenn, J. B.; Mann, M.; Meng, C. K.; Wong, S. F.; Whitehouse, C. M. Mass
Spectrom. Rev. 1990, 9, 3770.
(2) Karas, M.; Hillenkamp, F. Anal. Chem. 1988, 60, 22992301.
(3) Shin, B.; Jung, H.-J.; Hyung, S.-W.; Kim, H.; Lee, D.; Lee, C.; Yu, M.-H.;
Lee, S.-W. Mol. Cell. Proteomics 2008, 7, 11241134.
(4) Zubarev, R. A.; Kelleher, N. L.; McLafferty, F. W. J. Am. Chem. Soc. 1998,
120, 32653266.
Anal. Chem. 2008, 80, 72947303
10.1021/ac800913b CCC: $40.75 2008 American Chemical Society 7294 Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
Published on Web 08/28/2008
D
o
w
n
l
o
a
d
e
d

b
y

K
O
R
E
A

U
N
I
V

L
I
B

o
n

A
u
g
u
s
t

3
1
,

2
0
0
9

|

h
t
t
p
:
/
/
p
u
b
s
.
a
c
s
.
o
r
g


P
u
b
l
i
c
a
t
i
o
n

D
a
t
e

(
W
e
b
)
:

A
u
g
u
s
t

2
8
,

2
0
0
8

|

d
o
i
:

1
0
.
1
0
2
1
/
a
c
8
0
0
9
1
3
b
An LC/MS/MS experiment using these mass spectrometers
routinely generates high-resolution MS data (usually on the order
of 10
4
spectra) along with MS/MS spectra in large quantity. Fast,
automated and accurate interpretation of the vastly large amount
of MS data is a fundamental and critical step in MS-based
proteomic experiments and remains the subject of much research
activity. Mann et al.
5
suggested a deconvolution algorithm to nd
charge states. Senko et al.
6
introduced a notion of an average
amino acid called averagine and suggested a computational
method for determination of monoisotopic masses using it. Zscore
7
is a fast and automated isotopic cluster identication algorithm
based on a charge scoring scheme. Many other algorithms such
as ESI-ISOCONV,
8
MATCHING,
9
PepList,
10
LASSO,
11
AID-MS,
12
and THRASH
13
were reported.
Among these algorithms, THRASH has been one of the most
widely used algorithms. It employs the Fourier transform/
Patterson method for charge determination and least-squares
tting to compare a peak cluster with an averagine isotopic
distribution. However, the use of least-squares tting and/or
averagine isotopic distribution often leads to an inaccurate
monoisotopic mass that is 1-2 Da different from the correct
value.
12
In addition, since the least-squares tting is not a
computationally efcient operation, THRASH is known to be
computationally demanding.
In this paper, we present a new probabilistic model of an
isotopic distribution, which regards peak intensities in an isotopic
distribution as the existential probabilities of isotope compositions.
Our distribution model has two feature functions: intensity ratios
of two adjacent peaks and intensity ratio products of three adjacent
peaks in an isotopic cluster. We show that the intensity ratios
can be approximated as linear functions of polypeptide mass values
and that the intensity ratio products to constants. These ap-
proximations can be computed from theoretical distributions of
tryptic peptides generated from a protein database. On the basis
of our model, we propose an innovative algorithm that determines
isotopic clusters and their monoisotopic masses with accuracy. It
is shown that our algorithm outperforms two THRASH imple-
mentations, ICR2LS and Decon2LS (http://ncrr.pnl.gov/software/
), both in its accuracy and speed, which was demonstrated with
an LC/MS/MS data set of known standard peptide samples. Our
program is available via an e-mail to: kpark@theory.snu.ac.kr.
EXPERIMENTAL DATASETS
LC/MS/MS Experiments. We tested our algorithm on a data
set from tryptic digests of an 18 protein mixture, ISB standard
protein mix.
14
(This mixture was generously provided by Aeber-
sold group.) The tryptic peptides of the 18 protein mixture were
separated using a modied version of the nanoACQUITY UPLC
(NanoA, Waters, Milford) system, having a maximum operating
pressure of 10 000 psi. Briey, the NanoA system was modied
to equip a RPLC capillary column (75 m i.d. 360 m o.d. 80
cm length, C18-bonded particles, 3 m, 300 pore size, Jupiter,
Phenomenex) and an SPE column. The SPE column was prepared
by packing a 1-cm-long liner (250 m i.d.) inside an internal
reducer (1/16 in. to 1/32 in.; VICI) with the same C18-bonded
particles. The peptides were eluted by a mixture of solvents A
(0.1% formic acid in water) and B (99.9% acetonitrile, 0.1% formic
acid in water), where the percentage of solvent A was increased
linearly from 0 to 15% over 5 min, then increased to 50% over 120
min, and nally increased to 100% over 10 min where it was
maintained for 10 min prior to re-equilibration with solvent A.
A 7-T FTICR mass spectrometer (LTQ-FT, Thermo Electron,
San Jose, CA) was used to collect the mass spectra. MS precursor
ion scans (m/z 400-2000) were acquired in full-prole mode (i.e.,
with no baseline truncation) with an AGC target value of 1 10
6
,
a mass resolution of 1 10
5
, and a maximum ion accumulation
time of 1000 ms. Acquisition of an MS scan in full-prole mode
signicantly increases the data size: one full LC/MS experiment
would result in an MS result le (.raw le) exceeding 2 GB, which
cannot be handled in the current Xcalibur software and other MS
data analysis tools that utilize Xcaliburs API to handle the raw
le. We divided one full LC/MS experiment of ISB standard
peptide mix into ve 30-min experiments (i.e., ve segments) by
placing ve MS acquisition sequences consecutively during an
LC gradient. The mass spectrometer was operated in data-
dependent tandem MS mode; the seven most abundant ions
detected in a precursor MS scan were dynamically selected for
MS/MS experiments simultaneously incorporating a dynamic
exclusion option (exclusion mass width low, 1.10 Th; exclusion
mass width high, 2.10 Th; exclusion list size, 120; exclusion
duration, 30 s). Collision-induced dissociations of the precursor
ions were performed in an ion trap (LTQ) with the collisional
energy and isolation width set to 35% and 3 Th, respectively. The
Xcalibur software package (v. 2.0 SR1, Thermo Electron) was used
to construct the experimental methods.
Database Search. All MS/MS data (i.e., DTA les) were
subjected to the postexperiment monoisotopic mass ltering and
renement (PE-MMR) process
3
before they were searched against
a protein database, containing sequences of 18 proteins and
common contaminant sequences. The tolerance was set to 10 ppm
for precursor ions and 1 Da for fragment ions. Variable modica-
tion options were used for the carbamidomethylation of cysteine
and arginine (57.021 460 Da) and the oxidation of methionine
(15.994 920 Da). The search results were subsequently subjected
to statistical validation by PeptideProphet and the peptide IDs with
probability score of 0.5 or higher (839 nonredundant peptides)
were further analyzed by manual inspection to produce the nal
494 nonredundant peptide sequences from the 18 protein analysis.
(5) Mann, M.; Meng, C. K.; Fenn, J. B. Anal. Chem. 1989, 61, 17021708.
(6) Senko, M. W.; Beu, S. C.; McLafferty, F. W. J. Am. Soc. Mass Spectrom.
1995, 6, 229233.
(7) Zhang, Z. Q.; Marshall, A. G. J. Am. Soc. Mass Spectrom. 1998, 9, 225
233.
(8) Wehofsky, M.; Hoffman, R. J. Mass Spectrom. 2002, 37, 223229.
(9) Ferna ndez-de-Cossio, J.; Gonzalez, L. J.; Satomi, Y.; Betancout, L.; Ramos,
Y.; Huerta, V.; Besada, V.; Padron, G.; Minamino, N.; Takao, T. Rapid
Commun. Mass Spectrom. 2004, 19, 24652472.
(10) Li, X.; Yi, E. C.; Kemp, C. J.; Zhang, H.; Aebersold, R. Mol. Cell. Proteomics
2005, 4, 13281340.
(11) Du, P.; Angeletti, R. H. Anal. Chem. 2006, 78, 33853392.
(12) Chen, L.; Sze, S. K.; Yang, H. Anal. Chem. 2006, 78, 50065018.
(13) Horn, D. M.; Zubarev, R. A.; McLafferty, F. W. J. Am. Soc. Mass Spectrom.
2000, 11, 320332.
(14) Klimek, J.; Eddes, J. S.; Hohmann, L.; Jackson, J.; Peterson, A.; Letarte, S.;
Gafken, P. R.; Katz, J. E.; Mallick, P.; Lee, H.; Schmidt, A.; Ossola, R.; Eng,
J. K.; Aebersold, R.; Martin, D. B. J. Proteome Res. 2008, 7, 96103.
7295 Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
D
o
w
n
l
o
a
d
e
d

b
y

K
O
R
E
A

U
N
I
V

L
I
B

o
n

A
u
g
u
s
t

3
1
,

2
0
0
9

|

h
t
t
p
:
/
/
p
u
b
s
.
a
c
s
.
o
r
g


P
u
b
l
i
c
a
t
i
o
n

D
a
t
e

(
W
e
b
)
:

A
u
g
u
s
t

2
8
,

2
0
0
8

|

d
o
i
:

1
0
.
1
0
2
1
/
a
c
8
0
0
9
1
3
b
METHODS
We rst present a probabilistic model of an isotopic distribution
of a polypeptide. Then, we describe our approximations of intensity
ratio functions, which are the intensity ratios of two adjacent peaks
in an isotopic distribution, and of intensity ratio product functions,
the intensity ratio products of three adjacent peaks. Finally, our
algorithm is shown to determine isotopic clusters and their
monoisotopic masses in a fast and accurate manner.
Isotopic Distribution Model. We rst introduce some nota-
tions. Let A ) {C,H,N,O,S} be the set of atoms that compose a
polypeptide. For each atom X A, let X
a
denote the +a isotope
of an atom X, and P
X
a
denote its existential probability. For
example, P
C
1
) 0.011 07 because 1.107% of carbon atoms in nature
are +1 isotopes.
15
C
n
C
H
n
H
N
n
N
O
n
O
S
n
S
denotes the elemental com-
position of a polypeptide where n
X
is the number of atom X in
the polypeptide.
Because of the isotopes, the mass of a polypeptide C
n
C
H
n
-
HN
n
N
O
n
O
S
n
S
is not unique. If an instance of the polypeptide has four
+1 isotopes, its mass is bigger by 4 Da than an instance of the
polypeptide with no isotopes. The set of peaks generated by
various instances of a polypeptide is called the isotopic cluster of
the polypeptide. We dene an isotopic distribution of a polypeptide
as the theoretical masses and intensities of the peaks generated
by all instances of the polypeptide. In an isotopic distribution, each
peak is separated by 1 Da (average value 1.002 35 Da
12,13
). Let I
k
denote the intensity of the kth, k g 0, peak in an isotopic
distribution. Specically, intensity I
0
is the intensity of the
monoisotopic peak and I
k
, k g 1, is the intensity of the peak whose
mass difference from the monoisotopic peak is k. We model I
k
as
in Lemma 1 using the existential probability of the polypeptide
instance whose mass is bigger by k Da than the polypeptide
instance with no isotopes. A detailed derivation of Lemma 1 is
given in Supporting information section S-1.
Lemma 1. The intensity I
k
in an isotopic distribution ap-
proximates to
I
k
)I
0
k
1
+2k
2
+4k
4
)k
T
1
k
1
T
2
k
2
T
4
k
4
k
1
! k
2
! k
4
!
where
T
1
)

X
n
X
P
X
1
P
X
0
, T
2
)

X
n
X
P
X
2
P
X
0
, and T
4
)

X
n
X
P
X
4
P
X
0
For example, when k
1
+ 2k
2
+ 4k
4
) 4, there are four cases: four
+1 isotopes (k
1
) 4, k
2
) 0, k
4
) 0); two +1 isotopes, and one +2
isotope (k
1
) 2, k
2
) 1, k
4
) 0); two +2 isotopes (k
1
) 0, k
2
) 2,
k
4
) 0); and one +4 isotope (k
1
) 0, k
2
) 0, k
4
) 1). Hence I
4
approximates to I
0
(T
1
4
/ 4! + T
1
2
T
2
/ 2! + T
2
2
/ 2! + T
4
).
Now we want to simplify further the mathematical form of the
intensity I
k
in Lemma 1. We assume the linearity between mass
m and the numbers of atoms, i.e., n
X
a
X
m where a
X
is a constant
for each atom X, which may have a range of values according to
elemental compositions of polypeptides. If each n
X
is linear in m,
then T
1
, T
2
, and T
4
are also linear in mass m and I
k
becomes a
polynomial of mass m. In the representation of I
k
by T
1
, T
2
, and
T
4
in Lemma 1, the degree of T
1
determines that of I
k
, which is k,
because the term with highest degree is T
1
k
/k! from the case of
k isotopes of +1 Da.
Lemma 2. In an isotopic distribution of a polypeptide C
n
C
H-
n
H
N
n
N
O
n
O
S
n
S
, intensity I
k
approximates to a polynomial of mass m
with degree k, i.e., I
k
) c
k
m
k
+ c
k-1
m
k-1
+...+ c
1
m + c
0
.
Because of variations in elemental compositions, each of T
1
,
T
2
, and T
4
has a range of constants in its linear form. For example,
consider the extreme case that a polypeptide consists of one kind
of amino acid: polypeptides of phenylalanine (F, C
9
H
9
NO) give
the maximum T
1
) 6.97 10
-4
m and polypeptides of aspartic
acid (D, C
4
H
5
NO
3
) the minimum T
1
) 4.23 10
-4
m. The average
T
1
) 5.43 10
-4
m is computed from the averagine C
4.9384
-
H
7.7583
N
1.3577
O
1.4773
S
0.0417
. Note that the averagine model xes T
1
,
T
2
, and T
4
as the average values for all values of m. However, we
obtain both minimum and maximum of T
1
, T
2
, and T
4
as linear
forms in addition to their averages. From the ranges of values
T
1
, T
2
, and T
4
can take, we can estimate the range of I
k
.
Ratio Functions and Ratio Product Functions. On the basis
of the approximation of I
k
given above, we rst show that an
intensity ratio, I
k+1
/I
k
, can be approximated to a linear function of
polypeptide mass and that an intensity ratio product, I
k
I
k+2
/I
k+1
2
,
to a constant function. Recently, a similar model using the intensity
ratio was proposed independently, in which I
k+1
/I
k
is modeled by
a polynomial of mass.
16
We show here that a simple linear
approximation of I
k+1
/I
k
sufces.
Second, we compute their average, minimum, and maximum
functions using simulation spectra of tryptic polypeptides gener-
ated from a protein database. The algebraic estimation of min/
max functions from T
1
, T
2
, and T
4
becomes harder for higher
degree k, so we compute them using stochastic simulation. These
intensity ratio and ratio product functions are simpler than the
intensity itself and reveal more features of isotopic distributions.
From Lemma 2, I
k+1
/I
k
is a ratio of two polynomials of degree
k+1 and k. For a sufciently large mass m, the highest degree
terms (c
k+1
m
k+1
in I
k+1
and c
k
m
k
in I
k
) dominate and thus I
k+1
/I
k
approximates to some linear function, cm + b.
Theorem 1. In an isotopic distribution of a polypeptide
C
n
C
H
n
H
N
n
N
O
n
O
S
n
S
, the ratio of two adjacent peaks, I
k+1
/I
k
, can be
approximated by a linear function of the polypeptide mass.
To determine the constants of the ratio function, I
k+1
/I
k
) cm
+ b, we sampled about 100 000 tryptic peptides of 400 Da to 5 200
Da generated from UniProt database 8.0
17
and computed the ratio
I
k+1
/I
k
for each peptide. Figure 1 shows our ratio functions I
k+1
/
I
k
for 0 e k e 3. For a sufciently large mass m g 1800, it can be
clearly seen that the intensity ratios can be approximated by linear
functions of mass, represented as the solid lines in Figure 1, which
is in accordance with our theoretical analysis. The solid line,
named R
avg
(k,m), is computed by linear regression using least-
squares tting in gnuplot program (http://www.gnuplot.info). The
dotted line, R
max
(k,m), is the upper bound and the dashed line,
R
min
(k,m), is the lower bound of the ratios in the graph, also
computed by linear regression using least-squares tting. Note
(15) Beavis, R. B. Anal. Chem. 1993, 65, 496497.
(16) Valkenborg, D.; Jansen, I.; Burzykowski, T. J. Am. Soc. Mass Spectrom.
2008, 19, 703712.
(17) Wu, C. H.; Apweiler, R.; Bairoch, A.; Natale, D. A.; Barker, W. C.;
Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane,
M.; Martin, M. J.; Mazumder, R.; ODonovan, C.; Redaschi, N.; Suzek, B.
Nucleic Acids Res. 2006, 34, D187191.
7296 Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
D
o
w
n
l
o
a
d
e
d

b
y

K
O
R
E
A

U
N
I
V

L
I
B

o
n

A
u
g
u
s
t

3
1
,

2
0
0
9

|

h
t
t
p
:
/
/
p
u
b
s
.
a
c
s
.
o
r
g


P
u
b
l
i
c
a
t
i
o
n

D
a
t
e

(
W
e
b
)
:

A
u
g
u
s
t

2
8
,

2
0
0
8

|

d
o
i
:

1
0
.
1
0
2
1
/
a
c
8
0
0
9
1
3
b
that the min/max functions, R
min
(k,m) and R
max
(k,m), represent
the variation of I
k+1
/I
k
due to elemental composition of polypep-
tides of mass m. In Supporting Information Table S-1, we show
that the average function R
avg
(k,m) is very close to the line
estimated by averagine.
For a small mass m < 1800, we use the linearlike quotient of
two polynomials with degrees k + 1 and k in Lemma 2. Especially,
I
1
/I
0
has a strong linearity for all m, because the quotient of I
1
/I
0
is cm. The reason for choosing the threshold 1800 is that a
polypeptide within 1800 Da has the rst and most abundant peak
as its monoisotopic peak. In other words, I
0
is the most abundant
and I
k+1
/I
k
, k g 1, becomes insignicant in the range of m< 1800.
Note that the model by Valkenborg et al.
16
proposes a rened
model of isotopic distributions for low-mass peptides by consider-
ing the number of sulfurs in the peptides, which explains the tails
of ratios in the low mass range. However, our simple model
performed well in the experimental data, and we expect that the
experimental error in peaks dominates the theoretical error in
our model.
In a similar way to Theorem 1, we obtain a constant ap-
proximation of the ratio product of three adjacent peaks (i.e., (I
k
/
I
k+1
)(I
k+2
/I
k+1
)). From Lemma 2, the degrees of (I
k
)(I
k+2
) and I
k+1
2
are the same as 2k + 2. Hence, I
k
I
k+2
/I
k+1
2
can be approximated
as a constant for polypeptides of sufciently large masses.
Theorem 2. In an isotopic distribution of a polypeptide
C
n
C
H
n
H
N
n
N
O
n
O
S
n
S
, the ratio product of three adjacent peaks, I
k
I
k+2
/
I
k+1
2
, can be approximated to a constant.
Similarly to the ratio functions, we dene ratio product
functions RP
max
(k,m), RP
min
(k,m), and RP
avg
(k,m), respectively,
corresponding to the maximum, the minimum, and the average
values of I
k
I
k+2
/I
k+1
2
. These functions are also computed from the
peptide database (Figure 2 and Supporting Information Table S-2).
We also divide the mass range by 1800 Da and compute the ratio
products for two intervals.
Algorithm Overview. We present an algorithm for determin-
ing isotopic clusters and their monoisotopic masses from a raw
spectrum. Before describing our algorithm, we introduce several
cluster names. A peak cluster indicates a list of peaks selected
from a raw spectrum and sorted in increasing order of m/z. A
pseudo (isotopic) cluster with charge state C is a peak cluster
such that the m/z difference of every adjacent peak pair in the
peak cluster is 1/C. An isotopic cluster with charge state C is a
pseudocluster with charge state C such that the intensity pattern
of the pseudocluster corresponds to that of an isotopic distribution.
Our determination algorithm consists of the following four steps:
(1) peak picking, (2) pseudocluster identication, (3) isotopic
cluster identication and monoisotopic mass determination, and
(4) duplicate cluster removal. We describe the steps one by one.
Peak Picking. We remove noise and select relatively high
intensity peaks from the raw spectrum. It should be noted that
this step is not closely related to the essence of our algorithm.
On the contrary, it is more related to the noise pattern of a mass
spectrometer. Thus, any peak picking algorithm that removes well
the noise from the raw spectrum can be used. In our experiment,
we used the peak picking algorithm of Decon2LS.
Pseudocluster Identication. We identify pseudoclusters by
scanning the selected peaks from low m/z to high m/z. Every
time we examine a peak, we nd all the pseudoclusters starting
at the peak, in a way that we rst nd pseudoclusters with a charge
state 1+ and nd the other pseudoclusters with higher charge
states by incrementing the charge state. We describe how to
enumerate all pseudoclusters starting at a peak P with a charge
Figure 1. Ratio functions (I
k+1
/I
k
) obtained from stochastic simulation using 100 000 tryptic peptides sampled from Uniprot database. These
four gures show the kth intensity ratios for 0 e k e 3 of sampled peptides. For a sufciently large mass of m g 1800, we represent the kth
intensity ratio, I
k+1
/I
k
, by a linear function of polypeptide mass m (i.e., cm + b) and compute its average (solid line), its upper bound (dotted line),
and its lower bound (dashed line) by least-squares tting. For a small mass of m < 1800, we employ the quotient of two polynomials with
degrees k + 1 and k. Supporting Information Table S-1 compares the average ratio functions by the averagine model and by our tting result.
7297 Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
D
o
w
n
l
o
a
d
e
d

b
y

K
O
R
E
A

U
N
I
V

L
I
B

o
n

A
u
g
u
s
t

3
1
,

2
0
0
9

|

h
t
t
p
:
/
/
p
u
b
s
.
a
c
s
.
o
r
g


P
u
b
l
i
c
a
t
i
o
n

D
a
t
e

(
W
e
b
)
:

A
u
g
u
s
t

2
8
,

2
0
0
8

|

d
o
i
:

1
0
.
1
0
2
1
/
a
c
8
0
0
9
1
3
b
state C. We rst enumerate pseudoclusters with two peaks and
then pseudoclusters with more peaks. Let X denote the m/z of P;
we rst nd the next peaks of P, i.e., peaks in the mass range [X
+ (D - E)/C... X + (D + E)/C] where D is the estimated mass
difference between two adjacent peaks in an isotopic cluster and
E is the error bound. In our experiment, D is 1.002 35, which is
the mass difference of two adjacent averagine peaks and E )
10
-5
X, which corresponds to 10 ppm mass accuracy. By pairing
P and each next peak of P, we generate all pseudoclusters with
two peaks. Once pseudoclusters with two peaks are enumerated,
we enumerate pseudoclusters with three peaks by extending the
pseudoclusters with two peaks to the second next peaks of P. In
this way, we can enumerate all pseudoclusters starting at a peak
P with a charge state C.
Isotopic Cluster Identication and Monoisotopic Mass
Determination. From the pseudoclusters, we identify isotopic
clusters whose intensity patterns are similar to those of isotopic
distributions. For each pseudocluster, we determine whether it
is an isotopic cluster or not by checking the intensity ratio of every
adjacent peak pair and the intensity ratio product of every three
adjacent peaks in the pseudocluster. In determining isotopic
clusters, we also consider the case that some peaks are missing
in pseudoclusters because sometimes the monoisotopic and its
neighboring peaks are as small in their intensities as the noise
level and they may be missing from a pseudocluster. Our
algorithm allows up to three leftmost peaks to be missing in a
pseudocluster. More specically, we calculate scores for four cases
(in which we assume that we miss zero to three leftmost peaks)
and select the case with the highest score. If the score of the
selected pseudocluster is above zero, it means that most of the
ratios and ratio products range from R
max
(k,m) to R
min
(k,m) and
from RP
max
(k,m) to RP
min
(k,m), respectively. Therefore, the
pseudocluster is selected and becomes an isotopic cluster.
Otherwise, the pseudocluster is discarded.
Score calculation for a pseudocluster starts with monoisotopic
mass calculation. The monoisotopic mass, denoted by m, is
computed from the most abundant peak in the pseudocluster. If
the most abundant peak is the qth peak in the pseudocluster and
p peaks are assumed to be missing, m is computed as follows.
m) mass of the qth peak -1.002 35(q +p -1)
The score of a pseudocluster with p peaks assumed missing
is as follows.
Figure 2. Ratio product functions (I
k
I
k+2
/I
k+1
2
) obtained from stochastic simulation using 100 000 tryptic peptides sampled from Uniprot database.
These four gures show the kth intensity ratio product for 0 e k e 3 of sampled peptides. We represent the kth intensity ratio product, I
k
I
k+2
/
I
k+1
2
, by an approximate constant function of polypeptide mass m (i.e., t + (c/(m + b)). and compute its average (solid line), its upper bound
(dashed line), and its lower bound (dotted line) by least-squares tting. We use the same approximate constant functions but obtain divided
tting results by the mass range. Supporting Information Table S-2 compares the average ratio product functions by the averagine model and
by our tting result.
Figure 3. Numbers of identied clusters of 494 known peptides by
each program. Isotopic clusters with the monoisotopic mass within a
mass tolerance of 10 ppm are considered as the correct isotopic
clusters of known peptides.
7298 Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
D
o
w
n
l
o
a
d
e
d

b
y

K
O
R
E
A

U
N
I
V

L
I
B

o
n

A
u
g
u
s
t

3
1
,

2
0
0
9

|

h
t
t
p
:
/
/
p
u
b
s
.
a
c
s
.
o
r
g


P
u
b
l
i
c
a
t
i
o
n

D
a
t
e

(
W
e
b
)
:

A
u
g
u
s
t

2
8
,

2
0
0
8

|

d
o
i
:

1
0
.
1
0
2
1
/
a
c
8
0
0
9
1
3
b
Score )

k)0
n-2
scoreR(k, p, m)+

k)0
n-3
scoreRP(k, p, m), 0 ep e3
where n is the number of peaks in the pseudocluster.
The score is the sum of ratio score, scoreR(k, p, m), dened
on every adjacent peak pair and ratio product score, scoreRP(k,
p, m), dened on every three adjacent peaks in a pseudocluster.
Let intensity I
k
be the intensity of the (k + 1)st peak in a
pseudocluster. (Note that I
k
corresponds to I
k+p
in the isotopic
distribution). The ratio score scoreR(k, p, m) measures the
similarity of the intensity ratio I
k+1
/I
k
to the intensity ratio I
k+p+1
/
I
k+p
in the isotopic distribution whose monoisotopic mass is m:
scoreR(k, p, m) )
{
1 -
I'
k+1
I'
k
-R
avg
(k +p, m)
R
max
(k +p, m) -R
avg
(k +p, m)
if I'
k+1
I'
k
> R
avg
(k +p, m)
1 -
R
avg
(k +p, m) -I'
k+1
I'
k
R
avg
(k +p, m) -R
min
(k +p, m)
otherwise
The ratio score function consists of two linear function fragments
of the ratio I
k+1
/I
k
that is designed to have the maximum value
1 when the ratio is R
avg
(k + p, m), and to have 0 when the ratio
is R
max
(k + p, m) or R
min
(k + p, m). In addition, the score has
negative values when the ratio is higher than R
max
(k + p, m) or
lower than R
min
(k + p, m).
The ratio product score scoreRP(k, p, m) measures the
similarity of the intensity ratio product I
k
I
k+2
/I
k+1
2
to the intensity
ratio product I
k+p
I
k+p+2
/I
k+p+1
2
in an isotopic distribution whose
monoisotopic mass is m:
scoreRP(k, p, m) )
{
1 -
I'
k
I'
k+2
I
k+1
2
-RP
avg
(k +p, m)
RP
max
(k +p, m) -RP
avg
(k +p, m)
if I'
k
I'
k+2
I
k+1
2
> RP
avg
(k +p, m)
1 -
RP
avg
(k +p, m) -I'
k
I'
k+2
I
k+1
2
RP
avg
(k +p, m) -RP
min
(k +p, m)
otherwise
Refer to Supporting Information section S-2 for more tech-
niques to improve the accuracy of our method.
Duplicate Cluster Removal. Because we consider all possible
pseudoclusters, many pseudoclusters can be generated from a
single isotopic cluster. Suppose that there are ve peaks and
adjacent peaks are separated by 0.5 Th. In this case, a pseudoclus-
ter consisting of ve peaks (with charge state 2+), a pseudocluster
consisting of four peaks (missing the rst peak), and a pseudoclus-
ter consisting of three peaks (with charge state 1+) can be
generated. We call these clusters duplicate clusters and select
one of them. (They are not overlapping clusters.) Generally, if
two clusters share one or more peaks and the charge state of
one is a multiple of the other, they are duplicate clusters. Then
we remove one of them as follows. First, we remove an isotopic
cluster whose most abundant peak is smaller than anothers. If
the most abundant peaks are the same, an isotopic cluster with
the lower charge state is removed. If their charge states are
also the same, the cluster with the lower score is removed.
RESULTS AND DISCUSSION
To evaluate the performance of our method, we compared it
with ICR2LS and Decon2LS, both developed by Smith group at
Pacic Northwest National Laboratory (http://ncrr.pnl.gov/
software/). ICR2LS is a powerful FTICR mass analysis software
package. For deisotoping, it basically adapts THRASH. Decon2LS
also adapts THRASH, but its algorithm has been modied to
increase deisotoping speed while the details of the improvements
were not disclosed. All three programs were executed on the same
PC (Pentium M processor 1.70 GHz, 1GB RAM, Windows XP OS).
To be as fair as possible to each program, parameters were set
so that each method works on a similar number of total clusters.
Our method and Decon2LS use the same peak picking method.
The result of each peak picking program contained about 25 000
isotopic clusters.
Identication of Known Peptides. In comparing three
programs, we counted the number of identied isotope clusters
of known peptides whose amino acid sequences were identied
by MS/MS. It is difcult, however, to pick out the isotopic clusters
of known peptides because the MS data from an LC/MS/MS can
contain many peptides whose monoisotopic masses are very
similar. Therefore we use the following method to classify
peptides. For each known (condently identied by MS/MS
spectrum) peptide, we nd isotopic clusters of this peptide at the
MS scan where this peptide was identied by MS/MS. If an
isotopic cluster has the monoisotopic mass within a mass tolerance
of 10 ppm, we consider it a potentially correct isotopic cluster.
We also look for this peptide in adjacent scans. If no isotopic
cluster is found within any of 10 consecutive scans, the cluster is
discarded. We regard these isotopic clusters as true positives.
We counted the isotopic clusters of 494 known peptides. Figure
3 shows the number of isotopic clusters identied by each
program. It shows 10.6% improvement over ICR2LS and 4.8%
improvement over Decon2LS. To observe the performance ac-
Table 1. Numbers of Clusters of 494 Known Peptides
a
number of clusters
mass number of peptides our method Decon2LS ICR2LS
1000 47 790 767 777
1500 158 2630 2559 2575
2000 109 2136 2024 1961
2500 72 1555 1447 1393
3000 52 1162 1151 1060
3500 26 963 880 802
4000 19 969 856 687
4500 2 42 41 37
5000 2 30 31 30
5000 7 311 348 255
sum 494 10 588 10 104 9577
a
We divided the 494 peptides into 500 Da intervals and counted
the number of identied clusters of peptides that belong to each
interval.
Table 2. Result of Monoisotopic Mass Determination
for the Peptide Whose Mass Is 2296.22 Da
our method Decon2LS ICR2LS
2296.22 Da (correct) 35 27 21
2295.22 Da (-1 Da) 2 1 0
2297.22 Da (+1 Da) 6 10 9
2298.22 Da (+2 Da) 0 1 2
765.40 Da (wrong CS) 0 2 6
not found 0 2 5
7299 Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
D
o
w
n
l
o
a
d
e
d

b
y

K
O
R
E
A

U
N
I
V

L
I
B

o
n

A
u
g
u
s
t

3
1
,

2
0
0
9

|

h
t
t
p
:
/
/
p
u
b
s
.
a
c
s
.
o
r
g


P
u
b
l
i
c
a
t
i
o
n

D
a
t
e

(
W
e
b
)
:

A
u
g
u
s
t

2
8
,

2
0
0
8

|

d
o
i
:

1
0
.
1
0
2
1
/
a
c
8
0
0
9
1
3
b
F
i
g
u
r
e
4
.
E
x
a
m
p
l
e
s
w
h
e
r
e
o
u
r
m
e
t
h
o
d
d
e
t
e
r
m
i
n
e
s
t
h
e
c
o
r
r
e
c
t
m
o
n
o
i
s
o
t
o
p
i
c
m
a
s
s
.
T
h
e
c
h
e
m
i
c
a
l
f
o
r
m
u
l
a
i
s
C
1
0
1
H
1
6
5
N
2
9
O
3
2
a
n
d
t
h
e
m
o
n
o
i
s
o
t
o
p
i
c
m
a
s
s
i
s
2
2
9
6
.
2
2
D
a
.
A
n
a
r
r
o
w
r
e
p
r
e
s
e
n
t
s
t
h
e
m
o
n
o
i
s
o
t
o
p
i
c
p
e
a
k
o
f
t
h
i
s
p
e
p
t
i
d
e
a
n
d
O
,
]
,
a
n
d
/
r
e
p
r
e
s
e
n
t
t
h
e
t
h
e
o
r
e
t
i
c
a
l
i
s
o
t
o
p
i
c
d
i
s
t
r
i
b
u
t
i
o
n
s
o
f
t
h
i
s
p
e
p
t
i
d
e
c
a
l
c
u
l
a
t
e
d
b
y
e
a
c
h
o
f
o
u
r
m
e
t
h
o
d
,
D
e
c
o
n
2
L
S
a
n
d
I
C
R
2
L
S
,
r
e
s
p
e
c
t
i
v
e
l
y
.
(
a
)
D
e
c
o
n
2
L
S
a
s
s
i
g
n
e
d
2
2
9
5
.
2
2
D
a
a
s
t
h
e
m
o
n
o
i
s
o
t
o
p
i
c
m
a
s
s
a
n
d
I
C
R
2
L
S
f
o
u
n
d
n
o
c
l
u
s
t
e
r
.
(
b
)
D
e
c
o
n
2
L
S
a
n
d
I
C
R
2
L
S
a
s
s
i
g
n
e
d
2
2
9
7
.
2
2
D
a
.
(
c
)
I
C
R
2
L
S
a
s
s
i
g
n
e
d
2
2
9
8
.
2
2
D
a
.
(
d
)
I
C
R
2
L
S
a
s
s
i
g
n
e
d
a
n
i
n
c
o
r
r
e
c
t
c
h
a
r
g
e
s
t
a
t
e
a
n
d
a
s
s
i
g
n
e
d
7
6
5
.
4
0
D
a
a
s
t
h
e
m
o
n
o
i
s
o
t
o
p
i
c
m
a
s
s
.
7300 Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
D
o
w
n
l
o
a
d
e
d

b
y

K
O
R
E
A

U
N
I
V

L
I
B

o
n

A
u
g
u
s
t

3
1
,

2
0
0
9

|

h
t
t
p
:
/
/
p
u
b
s
.
a
c
s
.
o
r
g


P
u
b
l
i
c
a
t
i
o
n

D
a
t
e

(
W
e
b
)
:

A
u
g
u
s
t

2
8
,

2
0
0
8

|

d
o
i
:

1
0
.
1
0
2
1
/
a
c
8
0
0
9
1
3
b
cording to the mass, we divided the 494 peptides into 500 Da
intervals and counted the number of identied clusters of peptides
that belong to each interval (Table 1). Our method works well
regardless of peptide masses.
There can be various reasons that each program gives different
search results. Some clusters are inherently ambiguous and each
program can make different judgments. Sometimes the charge
states of clusters are determined incorrectly. For all three
programs, primary errors are 1-2 Da errors. In THRASH based
algorithms, 1-2 Da errors often happen when the position of the
most abundant peak of an identied cluster is different from that
of averagine. On the contrary, our method has low dependency
on the most abundant peak. Sometimes THRASH based algo-
rithms determine the monoisotopic mass of an identied isotopic
cluster 1 Da larger than the correct mass, even though there exists
the correct monoisotopic peak in the spectrum. Such an error is
uncommon in our method because the existence of the monoiso-
topic peak in a pseudocluster usually increases our score.
However, our method also cannot correctly identify several
ambiguous cases because it is still based on the cluster shape.
Detection of false positives in search results can only be
performed by manual inspection because many unidentied
peptides are crowded in the spectrum and it is possible that there
exists a peptide whose monoisotopic mass is 1 Da different from
a known peptide. Here we present several examples in which
monoisotopic masses determined by our method are different
Figure 5. Examples of overlapping clusters. (a) Two clusters share no peak. Two isotopic clusters were identied by all three programs. (b)
Two clusters share the peak of 716.62 Th. The isotopic cluster of 6433.46 Da (O) was identied by all three programs, but the isotopic cluster
of 3576.03 Da (]) was identied only by our method.
7301 Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
D
o
w
n
l
o
a
d
e
d

b
y

K
O
R
E
A

U
N
I
V

L
I
B

o
n

A
u
g
u
s
t

3
1
,

2
0
0
9

|

h
t
t
p
:
/
/
p
u
b
s
.
a
c
s
.
o
r
g


P
u
b
l
i
c
a
t
i
o
n

D
a
t
e

(
W
e
b
)
:

A
u
g
u
s
t

2
8
,

2
0
0
8

|

d
o
i
:

1
0
.
1
0
2
1
/
a
c
8
0
0
9
1
3
b
from masses of other programs. A peptide whose chemical
formula is C
101
H
165
N
29
O
32
and monoisotopic mass is 2296.22 Da
is observed in relatively long duration in elution time (from scan
no. 3464 to 3565) during the LC/MS/MS experiment of the ISB
standard peptide mix. The results of mass determination are
summarized in Table 2. We show four examples in Figure 4 where
our method determines the correct monoisotopic mass. O, ], and
/ represent the theoretical isotopic distributions of this peptide
calculated by our method, Decon2LS, and ICR2LS, respectively.
In Figure 4a, Decon2LS determined the mass of the cluster as 1
Da smaller than the correct theoretical mass because the rst
peak of the cluster is much larger than the averagine isotopic
distribution. ICR2LS found no cluster in this region. On the other
hand, Decon2LS and ICR2LS assigned 2297.22 Da, which is 1 Da
larger than the theoretical mass in Figure 4b. Figure 4c is a case
where the intensities are close to the noise level. Because the
fourth peak appears abnormally large, ICR2LS assigned 2298.22
Da, which is 2 Da larger than the theoretical mass. These
examples (Figure 4a-c) show that THRASH algorithm often
assigns incorrect mass when the most abundant peak of the
identied cluster shows a discrepancy from the averagine isotopic
distribution. Figure 4d is a case where ICR2LS assigned an
incorrect charge state and assigned 765.40 Da as the monoisotopic
mass. Some clusters that were not found by a program may be
found if the parameters are set differently (lowering minimum
S/N ratios, for example). However, a different parameter set may
well cause false positive determination of other clusters and there
is always a compromise between the accuracy and computational
costs. The highly accurate determination of monoisotopic masses
by our method should increase the accuracy in peptide identica-
tion and decrease false positive peptide identication by MS-based
proteomics. More scans of this peptide are shown in Supporting
Information Figure S-1.
Identication of Overlapping Clusters. Although FTICR MS
has a high resolving power, there are many overlapping clusters
because hundreds of isotopic clusters crowded into a narrow
range. Even in these cases it is easy to identify all overlapping
isotopic clusters if there is no shared peak. All programs correctly
found two isotopic clusters in Figure 5a. However, it is very hard
to identify all clusters if isotopic clusters share one or more peaks.
THRASH fails to identify all clusters that share one or more peaks
in many cases, because the subtraction of an identied cluster
might eliminate the shared peaks. Our method can identify
overlapping clusters that share one or more peaks in many cases
because we consider all possible pseudoclusters and do not
subtract the peaks of identied clusters. In Figure 5b, the cluster
whose monoisotpic mass is 6433.46 Da (O) was identied by all
three programs, but the cluster whose monoisotopic mass is
3576.03 Da (]) was identied only by our method. Both clusters
belong to the clusters of 494 known peptides. However, Decon2LS
and ICR2LS have failed to identify both because the peak of 716.62
Th is shared by both clusters. Elimination of the 716.62 Th peak
results in low match (i.e., low t number) between the theoretical
averagine distribution and the experimental distribution, leading
to loss of the mass information.
Execution Time. Another noticeable advantage of our method
is its speed. Since our method uses simple ratio functions and
ratio product functions that are precomputed, our method can
calculate the scores of isotopic clusters much faster than THRASH
calculating the least-squares t on the y. Execution time for our
data set is shown in Supporting Information Table S-3 and Figure
6. ICR2LS is much slower than other programs. Execution time
of our method was similar to that of Decon2LS in deisotoping
the rst segment data due to the dominant effect of I/O time.
We can see a remarkable difference in execution time in analyzing
segment 4 data, (almost 5 times faster than Decon2LS,) for which
Figure 6. Execution time of three programs. The 18 protein data set consists of ve les. Our method is almost 5 times faster than Decon2LS
in analyzing segment 4 data. ICR2LS is much slower than other programs.
7302 Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
D
o
w
n
l
o
a
d
e
d

b
y

K
O
R
E
A

U
N
I
V

L
I
B

o
n

A
u
g
u
s
t

3
1
,

2
0
0
9

|

h
t
t
p
:
/
/
p
u
b
s
.
a
c
s
.
o
r
g


P
u
b
l
i
c
a
t
i
o
n

D
a
t
e

(
W
e
b
)
:

A
u
g
u
s
t

2
8
,

2
0
0
8

|

d
o
i
:

1
0
.
1
0
2
1
/
a
c
8
0
0
9
1
3
b
it took the longest time. It must also be noted that the number of
peaks obtained by the peak picking step is a major factor in
execution time.
CONCLUSION
We have presented a new probabilistic model for isotopic
distributions and a novel algorithm for determining isotopic
distributions and monoisotopic masses based on the model. Our
method was applied to protein mixture data from a high-resolution
mass spectrometer, and we obtained better performance than
those of THRASH-based implementations. Our method found
more isotopic clusters of identied peptides in spite of the similar
number of the total clusters. Our method does not use the
averagine tting method, so we successfully resolve the 1-2 Da
mismatch problem in THRASH, which occurs especially to isotopic
clusters that deviate from the averagine distribution due to their
weak intensity. Overlapping clusters are also identied success-
fully in our method. Because our method uses simple ratio
functions to evaluate the score of isotopic clusters, its execution
time is very fast. This speed is expected to allow on-the-y
determination of monoisotopic masses during an LC/MS/MS
experiment, which provides advantages such as accurate assign-
ment of precursor monoisotopic masses to the corresponding MS/
MS data.
ACKNOWLEDGMENT
This study was supported by Grants FPR08-A1-020, FPR08-
A1-021, and FPR08-A1-010 of the 21C Frontier Functional Pro-
teomics Project from the Korean Ministry of Education, Science
& Technology.
SUPPORTING INFORMATION AVAILABLE
Additional information as noted in text. This material is
available free of charge via the Internet at http://pubs.acs.org.
Received for review May 2, 2008. Accepted July 9, 2008.
AC800913B
7303 Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
D
o
w
n
l
o
a
d
e
d

b
y

K
O
R
E
A

U
N
I
V

L
I
B

o
n

A
u
g
u
s
t

3
1
,

2
0
0
9

|

h
t
t
p
:
/
/
p
u
b
s
.
a
c
s
.
o
r
g


P
u
b
l
i
c
a
t
i
o
n

D
a
t
e

(
W
e
b
)
:

A
u
g
u
s
t

2
8
,

2
0
0
8

|

d
o
i
:

1
0
.
1
0
2
1
/
a
c
8
0
0
9
1
3
b

You might also like