You are on page 1of 99

Algorithmen in der Massenspektrometrie-basierten Proteomforschung

Algorithms in mass spectrometry


based proteomics
Dr. Clemens Gr opl
3 Lectures (28.1., 2.2., 4.2.)
VL Algorithmische Bioinformatik, Winter Semester 2008/09
Free University Berlin, http://www.inf.fu-berlin.de/inst/ag-bio/
1000
Outline
1. Mass spectrometry (and liquid chromatography) based methods in proteomics.
Introduction to experimental methods and important concepts.
2. Initial data processing of mass spectra and LC-MS data
Isotopic distributions
Mass decomposition
Baseline ltering
Noise ltering
Wavelets
Peak picking in MS and feature nding in LC-MS
3. Analysis of LC-MS/MS data, identication of peptides and proteins
1001
Isotope distributions
This exposition is based on:
R. Martin Smith: Understanding Mass Spectra. A Basic Approach. Wiley, 2nd
edition 2004. [S04]
Exact masses and isotopic abundances can be found for example at http:
//www.sisweb.com/referenc/source/exactmaa.htm or http://education.
expasy.org/student_projects/isotopident/htdocs/motza.html
IUPAC Compendium of Chemical Terminology - the Gold Book. http://
goldbook.iupac.org/ [GoldBook]
Sebastian B ocker, Zzuzsanna Lipt ak: Efcient Mass Decomposition. ACM
Symposium on Applied Computing, 2005. [BL05]
Christian Huber, lectures given at Saarland University, 2005. [H05]
Wikipedia: http://en.wikipedia.org/, http://de.wikipedia.org/
2000
Isotopes
This lecture addresses some more combinatorial aspect of mass spectrometry re-
lated to isotope distributions and mass decomposition.
Most elements occur in nature as a mixture of isotopes. Isotopes are atom species
of the same chemical element that have different masses. They have the same
number of protons and electrons, but a different number of neutrons. The main ele-
ments occurring in proteins are CHNOPS. A list of their naturally occurring isotopes
is given below.
Isotope Mass [Da] % Abundance
1
H 1.007825 99.985
2
H 2.014102 0.015
12
C 12. (exact) 98.90
13
C 13.003355 1.10
14
N 14.003074 99.63
15
N 15.000109 0.37
Isotope Mass [Da] % Abundance
16
O 15.994915 99.76
17
O 16.999131 0.038
18
O 17.999159 0.20
31
P 30.973763 100.
32
S 31.972072 95.02
33
S 32.971459 0.75
34
S 33.967868 4.21
2001
Isotopes (2)
Note that the lightest isotope is also the most abundant one for these elements.
Here is a list of the heavy isotopes, sorted by abundance:
Isotope Mass [Da] % Abundance
34
S 33.967868 4.21
13
C 13.003355 1.10
33
S 32.971459 0.75
15
N 15.000109 0.37
18
O 17.999159 0.20
17
O 16.999131 0.038
2
H 2.014102 0.015
We see that sulfur has a big impact on the isotope distribution. But it is not always
present in a peptide (only the amino acids Cystein or Methionin contain sulfur).
Apart from that,
13
C is most abundant, followed by
15
N. These isotopes lead to
+1 peaks. The heavy isotopes
18
O and
34
S lead to +2 peaks. Note that
17
O and
2
H are very rare.
2002
Isotopes (3)
The two isotopes of hydrogen have special names:
1
H is called protium, and
2
H = D
is called deuterium (or sometimes heavy hydrogen).
Note that whereas the exact masses are universal physical constants, the relative
abundances are different at each place on earth and can in fact be used to trace the
origin of substances. They are also being used in isotopic labeling techniques.
The standard unit of mass, the unied atomic mass unit , is dened as 1/12 of the
mass of
12
C and denoted by u or Da, for Dalton. Hence the atomic mass of
12
C
is 12 u by denition. The atomic masses of the isotopes of all the other elements
are determined as ratios against this standard, leading to non-integral values for
essentially all of them.
The subtle differences of masses are due to the mass defect (essentially, the bind-
ing energy of the nucleus). We will return to this topic later. For understanding the
next few slides, the difference between nominal and exact masses is not essential.
2003
Isotopes (4)
The average atomic mass (also called the average atomic weight or just atomic
weight ) of an element is dened as the weighted average of the masses of all its
naturally occurring stable isotopes.
For example, the average atomic mass of carbon is calculated as
(98.9% 12.0 + 1.1% 13.003355)
100%
.
= 12.011
For most purposes such as weighing out bulk chemicals only the average molecular
mass is relevant since what one is weighing is a statistical distribution of varying
isotopic compositions.
The monoisotopic mass is the sum of the masses of the atoms in a molecule using
the principle isotope mass of each atom instead of the isotope averaged atomic
mass and is most often used in mass spectrometry. The monoisotopic mass of
carbon is 12.
2004
Isotopes (5)
According to the [GoldBook] the principal ion in mass spectrometry is a molecular or
fragment ion which is made up of the most abundant isotopes of each of its atomic
constituents.
Sometimes compounds are used that have been articially isotopically enriched
in one or more positions, for example CH
3
13
CH
3
or CH
2
D
2
. In these cases the
principal ion may be dened by treating the heavy isotopes as new atomic species.
Thus, in the above two examples, the principal ions would have masses 31 (not 30)
and 18 (not 16), respectively.
In the same vein, the monoisotopic mass spectrum is dened as a spectrum con-
taining only ions made up of the principal isotopes of atoms making up the original
molecule.
You will see that the monoisotopic mass is sometimes dened using the lightest
isotope. In most cases the distinction between principle and lightest isotope
is non-existent, but there is a difference for some elements, for example iron and
argon.
2005
Isotopic distributions
The mass spectral peak representing the monoisotopic mass is not always the most
abundant isotopic peak in a spectrum although it stems from the most abundant
isotope of each atom type.
This is due to the fact that as the number of atoms in a molecule increases the
probability of the entire molecule containing at least one heavy isotope increases.
For example, if there are 100 carbon atoms in a molecule, each of which has an
approximately 1% chance of being a heavy isotope, then the whole molecule is not
unlikely to contain at least one heavy isotope.
The monoisotopic peak is sometimes not observable due to two primary reasons.
The monoisotopic peak may not be resolved from the other isotopic peaks. In
this case only the average molecular mass may be observed.
Even if the isotopic peaks are resolved, the monoisotopic peak may be below
the noise level and heavy isotopomers may dominate completely.
2006
Isotopic distributions (2)
Terminology
2007
Isotopic distributions (3)
Example:
2008
Isotopic distributions (4)
To summarize: Learn to distinguish the following concepts!
nominal mass
monoisotopic mass
most abundant mass
average mass
2009
Isotopic distributions (5)
Example: Isotopic distribution of human myoglobin
Screen shot: http://education.expasy.org/student_projects/isotopident/
2010
Isotopic distributions (6)
A basic computational task is:
Given an ion whose atomic composition is known, how can we compute its
isotopic distribution?
We will ignore the mass defects for a moment. It is convenient to number the peaks
by their number of additional mass units, starting with zero for the lowest isotopic
peak. We can call this the isotopic rank.
Let E be a chemical element. Let
E
[i ] denote the probability of the isotope of E
having i additional mass units. Thus the relative intensities of the isotopic peaks for
a single atom of element E are (
E
[0],
E
[1],
E
[2], ...,
E
[k
E
]). Here k
E
denotes
the isotopic rank of the heaviest isotope occurring in nature. We have
E
[] = 0 for
> k.
For example carbon has
C
[0] = 98.9% = 0.989 (isotope
12
C) and
C
[1] = 1.1% =
0.011 (isotope
13
C).
2011
Isotopic distributions (7)
The probability that a molecule composed out of one atom of element E and one
atom of element E

has a total of n additional neutrons is

EE
[n] =
n

i =0

E
[i ]
E
[n i ] .
Note that
EE
[] = 0 for > k
E
+ k

E
.
This type of composition is very common in mathematics and known as a convolu-
tion operation, denoted by the operator .
Using the convolution operator, we can rewrite the above equation as

EE
=
E

E
.
For example, a hypothetical molecule composed out of one carbon and one nitrogen
would have
CN
=
C

N
,

CN
[0] =
C
[0]
N
[0] ,

CN
[1] =
C
[0]
N
[1] +
C
[1]
N
[0] ,

CN
[2] =
C
[0]
N
[2] +
C
[1]
N
[1] +
C
[2]
N
[0]
_
=
C
[1]
N
[1]
_
.
2012
Isotopic distributions (8)
Clearly the same type of formula applies if we compose a larger molecule out of
smaller molecules or single atoms. Molecules have isotopic distributions just like
elements.
For simplicity, let us dene convolution powers. Let
1
:= and
n
:=
n1
, for
any isotopic distribution . Moreover, it is natural to dene
0
by
0
[0] = 1,
0
[] = 0
for > 0. This way,
0
will act as neutral element with respect to the convolution
operator , as expected.
Then the isotopic distribution of a molecule with the chemical formula E
1
n
1
E

,
composed out of the elements E
1
, ... , E

, can be calculated as

E
1
n
1
E

=
n
1
E
1
...
n

.
2013
Isotopic distributions (9)
This immediately leads to an algorithm. for computing the isotopic distribution of a
molecule.
Now let us estimate the running time for computing
n
1
E
1
...
n
l
E
l
.
The number of convolution operations is n
1
+ ... + n
l
1, which is linear in the
number of atoms.
Each convolution operation involves a summation for each [i ]. If the highest
isotopic rank for E is k
E
, then the highest isotopic rank for E
n
is k
E
n. Again,
this in linear in the number of atoms.
We can improve on both of these factors.
Do you see how?
2014
Isotopic distributions (10)
Bounding the range of isotopes
For practical cases, it is possible to restrict the summation range in the convolution
operation.
In principle it is possible to form a molecule solely out of the heaviest isotopes,
and this determines the summation range needed in the convolution calculations.
However, the abundance of such an isotopomer is vanishingly small. In fact, it will
soon fall below the inverse of the Avogadro number (6.0221415 10
23
), so we will
hardly ever see a single molecule of this type.
For simplicity, we let us rst consider a single element E and assume that k
E
= 1.
(For example, E could be carbon or nitrogen.) In this case the isotopic distribution
is a binomial with parameter p :=
E
[1],

E
n[k] =
_
n
k
_
p
k
(1 p)
nk
.
The mean of this binomial distribution is pn. Large deviations can be bounded as
follows.
2015
Isotopic distributions (11)
We use the upper bound for binomial coefcients
_
n
k
_

_
ne
k
_
k
.
For k 3pn (that is, k three times larger than expected) we get
_
n
k
_
p
k

_
nep
2pn
_
k
=
_
e
3
_
k
.
and hence
n

3pn
_
n
k
_
p
k
(1 p)
nk

3pn
_
e
3
_
k
= O
_
_
e
3
_
3pn
_
= o(1) .
While 3pn is still linear in n, it is much smaller in practice one can usually restrict
the calculations to less than 10 isotopic variants for peptides.
Fortunately, if it turns out that the chosen range was too small, we can detect this
afterwards because the probabilities will no longer add up to 1. (Exercise: Explain
why.) Thus we even have an a posteriori error estimate.
2016
Isotopic distributions (12)
More generally, a peptide is composed out of the elements C, H, N, O, S. For each
of these elements the lightest isotope has a natural abundance above 95% and the
highest isotopic rank is at most 2. Again we can bound the sum of the abundances
of heavy isotopic variants by a binomial distribution:

j i

E
1
n
1
E

[j ]

j i /2
_
n
j
_
0.05
j
0.95
nj
.
(In order to get i additional mass units, at least i /2 of the atoms must be heavy.)
2017
Isotopic distributions (13)
Computing convolution powers by iterated squaring
There is another trick which can be used to save the number of convolutions needed
to calculate the n-th convolution power
n
of an elemental isotope distribution.
Observe that just like for any associative operation , the convolution powers satisfy

2n
=
n

n
.
In general, n is not a power of two, so let (b
j
, b
j 1
, ... , b
0
) be the bits of the binary
representation of n, that is,
n =
j

=0
b

= b
j
2
j
+ b
j 1
2
j 1
+ + b
0
2
0
.
Then we can compute
n
as follows:

n
=

j
2
j
b
j
=

j

2
j
b
j
=
b
j
2
j

b
j 1
2
j 1

b
0
2
0
,
where the

is of course meant with respect to . The total number of convolutions
needed for this calculation is only O(log n).
2018
Isotopic distributions (14)
To summarize: The rst k + 1 abundances
E
n
1
E
n

[i ], i = 0, ... , k, of the isotopic


distribution of a molecule E
n
1
1
E
n

can be computed in O(k log n) time and O(k)


space, where n = n
1
+ ... + n

.
(Exercise: To test your understanding, check how these resource bounds follow
from what has been said above.)
2019
Mass decomposition
A related question is:
Given a peak mass, what can we say about the elemental composition of the
ion that generated it?
In most cases, one cannot obtain much information about the chemical structural
from just a single peak. The best we can hope for is to obtain the chemical formula
with isotopic information attached to it. In this sense, the total mass of an ion is de-
composed into the masses of its constituents, hence the term mass decomposition.
2020
Mass decomposition (2)
This is formalized by the concept of a compomer [BL05].
We are given an alphabet of size || = k, where each letter has a mass a
i
,
i = 1, ... , k. These letters can represent atom types, isotopes, or amino acids, or
nucleotides. We assume that all masses are different, because otherwise we could
never distinguish them anyway. Thus we can identify each letter with its mass, i. e.,
= {a
1
, ... , a
k
} N. This is sometimes called a weighted alphabet .
The mass of a string s = s
1
... s
n

is dened as the sum of the masses of its


letters, i. e., mass(s) =

|s|
i =1
s
i
.
Formally, a compomer is an integer vector c = (c
1
, ... , c
k
) (N
0
)
k
. Each c
i
represents the number of occurrences of letter a
i
. The mass of a compomer is
mass(c) :=

k
i =1
c
i
a
i
, as opposed to its length, |c| :=

k
i =1
c
i
.
In short: A compomer tells us how many instances of an atomic species are present
in a molecule. We want to nd all compomers whose mass is equal to the observed
mass.
2021
Mass decomposition (3)
There a many more strings (molecules) than compomers, but the compomers are
still many.
For a string s = s
1
... s
n

, we dene comp(s) := (c
1
, ... , c
k
), where c
i
:=
#{j | s
j
= a
i
}. Then comp(s) is the compomer associated with s, and vice versa.
One can prove (exercise):
1. The number of strings associated with a compomer c = (c
1
, ... , c
k
) is
_
|c|
c
1
,...,c
k
_
=
|c|!
c
1
!c
k
!
.
2. Given an integer n, the number of compomers c = (c
1
, ... , c
k
) with |c| = n is
_
n+k1
k1
_
.
Thus a simple enumeration will not sufce for larger instances.
2022
Mass decomposition (4)
Using dynamic programming, we can solve the following problems efciently
(: weighted alphabet, M: mass):
1. Existence problem: Decide whether a compomers c with mass(c) = M exists.
2. One Witness problem: Output a compomer c with mass(c) = M, if one exists.
3. All witnesses problem: Compute all compomers c with mass(c) = M.
2023
Mass decomposition (5)
The dynamic programming algorithm is a variation of the classical algorithm origi-
nally introduced for the Coin Change Problem, originally due to Gilmore and Go-
mory.
Given a query mass M, a two-dimensional Boolean table B of size kM is constructed
such that
B[i , m] = 1 m is decomposable over {a
1
, ... , a
i
} .
The table can be computed with the following recursion:
B[1, m] = 1 m mod a
1
= 0
and for i > 0,
B[i , m] =
_
_
_
B[i 1, m] m < a
i
,
B[i 1, m] B[i , m a
i
] otherwise .
The table is constructed up to mass M, and then a straight-forward backtracking
algorithm computes all witnesses of M.
2024
Mass decomposition (6)
For the Existence and One Witness Problems, it sufces to construct a one-
dimensional Boolean table A of size M, using the recursion A[0] = 1, A[m] = 0
for 1 m < a
1
; and for m a
1
, A[m] = 1 if there exists an i with 1 i k
such that A[m a
i
] = 1, and 0 otherwise. The construction time is O(kM) and one
witness c can be produced by backtracking in time proportional to |c|, which can be
in the worst case
1
a
1
M. Of course, both of these problems can also be solved using
the full table B.
A variant computes (M), the number of decompositions of M, in the last row, where
the entries are integers, using the recursion C[i , m] = C[i 1, m] + C[i , m a
i
].
The running time for solving the All Witnesses Problem is O(kM) for the table con-
struction, and O((M)
1
a
1
M) for the computation of the witnesses (where (M) is the
size of the output set), while storage space is (O(kM).
2025
Mass decomposition (7)
The number of compomers is O(M

). (Exercise: why?) Depending on the mass


resolution, the results can be useful for M up to, say, 1000 Da, but in general one
has to take further criteria into account. (Figure from [BL].)
2026
Mass decomposition (8)
Example output from http://bibiserv.techfak.uni-bielefeld.de/decomp/
# imsdecomp 1.3
# Copyright 2007,2008 Informatics for Mass Spectrometry group
# at Bielefeld University
#
# http://BiBiServ.TechFak.Uni-Bielefeld.DE/decomp/
#
# precision: 4e-05
# allowed error: 0.1 Da
# mass mode: mono
# modifiers: none
# fixed modifications: none
# variable modifications: none
# alphabet (character, mass, integer mass):
# H 1.007825 25196
# C 12 300000
# N 14.003074 350077
# O 15.994915 399873
# P 30.973761 774344
# S 31.972071 799302
# constraints: none
# chemical plausibility check: off
#
# Shown in parentheses after each decomposition:
# - actual mass
# - deviation from actual mass
2027
#
# mass 218.03 has 1626 decompositions (showing the best 100):
H2 C6 N8 O2 (218.03007; +7.1384e-05)
H8 C7 N1 O7 (218.03008; +7.6606e-05)
H13 C2 N5 O1 P1 S2 (218.02991; -8.7014e-05)
H139 O1 P2 (218.03012; +0.000117068)
H19 C1 N1 O3 P3 S1 (218.02985; -0.000151322)
H16 N3 O4 S3 (218.03029; +0.000293122)
H5 C2 N9 O2 P1 (218.03038; +0.00038199)
H11 C3 N2 O7 P1 (218.03039; +0.000387212)
H10 C6 N4 O1 S2 (218.0296; -0.00039762)
H16 C5 O3 P2 S1 (218.02954; -0.000461928)
H16 C3 N3 P4 (218.02947; -0.000531458)
H21 C2 N1 P1 S4 (218.02944; -0.000556018)
H8 N7 O5 S1 (218.03076; +0.000762126)
H14 C1 O10 S1 (218.03077; +0.000767348)
H18 C5 O1 P4 (218.03081; +0.000811196)
H13 C7 N2 P3 (218.02916; -0.000842064)
H18 C6 S4 (218.02913; -0.000866624)
H12 C8 N1 O2 S2 (218.03095; +0.000945034)
H9 C1 N5 O6 P1 (218.02904; -0.000955442)
H3 N12 O1 P1 (218.02904; -0.000960664)
H21 C1 N1 O1 P5 (218.03112; +0.001121802)
H10 C11 N1 P2 (218.02885; -0.00115267)
H137 O3 S1 (218.02884; -0.001156056)
H15 C4 N2 O2 P1 S2 (218.03126; +0.00125564)
H6 C5 N4 O6 (218.02873; -0.001266048)
C4 N11 O1 (218.02873; -0.00127127)
H4 C8 N5 O3 (218.03141; +0.001414038)
H17 C1 N1 O5 P1 S2 (218.02858; -0.001424446)
H11 N8 P1 S2 (218.02857; -0.001429668)
H7 C15 P1 (218.02854; -0.001463276)
H135 C3 N1 S1 (218.03152; +0.00152403)
H134 C2 N2 P1 (218.02846; -0.001536192)
H18 N3 O2 P2 S2 (218.03157; +0.001566246)
H12 C1 N7 S3 (218.03163; +0.001630554)
H18 C2 O5 S3 (218.03164; +0.001635776)
H7 C4 N6 O3 P1 (218.03172; +0.001724644)
H14 C5 O5 S2 (218.02826; -0.001735052)
H8 C4 N7 S2 (218.02826; -0.001740274)
H14 C3 N3 O2 P2 S1 (218.0282; -0.001804582)
H131 C6 N1 (218.02815; -0.001846798)
H11 C12 P1 S1 (218.03191; +0.001907552)
H10 N7 O3 P2 (218.03204; +0.00203525)
H16 C1 O8 P2 (218.03204; +0.002040472)
H4 C1 N11 O1 S1 (218.0321; +0.002099558)
H10 C2 N4 O6 S1 (218.0321; +0.00210478)
H11 C7 N2 O2 P1 S1 (218.02788; -0.002115188)
H14 C8 N1 P2 S1 (218.03222; +0.002218158)
H13 N1 O10 P1 (218.02771; -0.002292874)
H8 C11 N1 O2 S1 (218.02757; -0.002425794)
H22 C3 S5 (218.0325; +0.002504204)
H17 C4 N2 P3 S1 (218.03253; +0.002528764)
H10 C4 O10 (218.0274; -0.00260348)
H4 C3 N7 O5 (218.02739; -0.002608702)
H6 C10 N2 O4 (218.03276; +0.002756692)
H20 N3 P4 S1 (218.03284; +0.00283937)
H20 C2 O3 P2 S2 (218.03291; +0.0029089)
H14 C3 N4 O1 S3 (218.03297; +0.002973208)
H9 C6 N3 O4 P1 (218.03307; +0.003067298)
H12 C3 N3 O4 S2 (218.02692; -0.003077706)
H12 C1 N6 O1 P2 S1 (218.02685; -0.003147236)
H18 N2 O3 P4 (218.02679; -0.003211544)
H12 C2 N4 O4 P2 (218.03338; +0.003377904)
H6 C3 N8 O2 S1 (218.03344; +0.003442212)
H12 C4 N1 O7 S1 (218.03345; +0.003447434)
H9 C5 N5 O1 P1 S1 (218.02654; -0.003457842)
H15 C4 N1 O3 P3 (218.02648; -0.00352215)
H20 C1 N2 P2 S3 (218.02638; -0.00361624)
H15 N2 O7 P1 S1 (218.03376; +0.00375804)
H6 C9 N4 O1 S1 (218.02623; -0.003768448)
H12 C8 O3 P2 (218.02617; -0.003832756)
H17 C5 N1 P1 S3 (218.02607; -0.003926846)
H8 C2 N3 O9 (218.02605; -0.003946134)
H2 C1 N10 O4 (218.02605; -0.003951356)
H2 C11 N6 (218.03409; +0.004094124)
H22 C2 O1 P4 S1 (218.03418; +0.004182024)
H14 C9 S3 (218.02576; -0.004237452)
H16 C5 N1 O2 S3 (218.03432; +0.004315862)
H5 C7 N7 P1 (218.0344; +0.00440473)
H11 C8 O5 P1 (218.03441; +0.004409952)
H10 C1 N6 O3 S2 (218.02558; -0.00442036)
H16 N2 O5 P2 S1 (218.02552; -0.004484668)
H133 C3 O3 (218.02547; -0.004526884)
H19 C1 N2 O2 P1 S3 (218.03463; +0.004626468)
H8 C3 N8 P2 (218.03472; +0.004715336)
H14 C4 N1 O5 P2 (218.03472; +0.004720558)
H8 C5 N5 O3 S1 (218.03478; +0.004784866)
H13 C4 N1 O5 P1 S1 (218.0252; -0.004795274)
H7 C3 N8 P1 S1 (218.0252; -0.004800496)
H13 C2 N4 O2 P3 (218.02514; -0.004864804)
H18 C1 N2 O2 S4 (218.02511; -0.004889364)
H139 N1 S2 (218.03489; +0.004894858)
H17 N2 O5 P3 (218.03503; +0.005031164)
H11 C1 N6 O3 P1 S1 (218.0351; +0.005095472)
H10 C8 O5 S1 (218.02489; -0.00510588)
H4 C7 N7 S1 (218.02489; -0.005111102)
H10 C6 N3 O2 P2 (218.02482; -0.00517541)
H15 C9 P1 S2 (218.03528; +0.00527838)
H6 N6 O8 (218.02471; -0.005288788)
H21 C2 O1 P3 S2 (218.02467; -0.005333808)
H131 N5 O1 (218.03536; +0.005363862)
# done
Mass defect
The difference between the actual atomic mass of an isotope and the nearest inte-
gral mass is called the mass defect . The size of the mass defect varies over the
Periodic Table. The mass defect is due to the binding energy of the nucleus:
http://de.wikipedia.org/w/index.php?title=Datei:Bindungsenergie_massenzahl.jpg
2028
Mass defect (2)
The mass differences of light and heavy isotopes are also not exactly multiples of
the atomic mass unit. We have
mass (
2
H) mass (
1
H) = 1.00628
.
= 1
mass (
13
C) mass (
12
C) = 1.003355
.
= 1
mass (
18
O) mass (
16
O) = 2.004244
.
= 2
mass (
15
N) mass (
14
N) = 0.997035
.
= 1
mass (
34
S) mass (
32
S) = 1.995796
.
= 2
These differences (due to the mass defect) are subtle but become perceptible with
very high resolution mass spectrometers. (Exercise: About which resolution is nec-
essary?) This is currently an active eld of research.
2029
Mass defect (3)
High resolution and low resolution isotopic spectra for C15N20S4O8.
hires.dat lores.dat
715.909120 100.0 716 100.0
716.906159 7.2 717 26.8
716.908509 3.2
716.912479 16.1
716.913329 0.2
717.903189 0.2 718 22.6
717.904919 17.6
717.905549 0.2
717.909520 1.1
717.911870 0.5
717.913360 1.6
717.915830 1.1
718.901960 1.3 719 5.0
718.904310 0.4
718.908279 2.8
718.910399 0.1
718.916720 0.2
719.900719 1.1 720 1.8
719.905319 0.2
719.909160 0.2
719.911629 0.2
720.904080 0.2 721 0.2
Isotopic pattern:
Zoom on +2 mass peak (718):
2030
Signal Processing
This exposition is based on:
Steven W. Smith: The Scientist and Engineers Guide to Digital Signal Process-
ing, California Technical Publishing San Diego, California, second edition 1999.
Available at http://www.dspguide.com/ [S99]
Wikipedia articles:
http://en.wikipedia.org/wiki/Precision_and_accuracy
http://de.wikipedia.org/wiki/Pr%C3%A4zision (Pr azision)
Joachim Weickert: Image Processing and Computer Vision, lecture 11, 2004.
[W04]
Christian Huber: Bioanalytik, Vorlesung an der Universit at des Saarlandes,
2005. [H05]
3000
Outline
This lecture is an introduction to some of the signal processing aspects involved in
the analysis of mass spectrometry data. Signal processing is a large eld. Here will
can only give a glimpse of it.
Precision, accuracy, and resolution
Morphological lters for baseline reduction
Linear lters
3001
Precision, accuracy, and resolution
Precision and accuracy.
The two concepts precision and accuracy are often used interchangeably in non-
technical settings, but they have very specic denitions in science engineering,
engineering and statistics.
3002
Precision, accuracy, and resolution (2)
Accuracy is the degree of conformity of a measured or calculated quantity to its
actual (true) value. Precision is the degree to which further measurements or cal-
culations will show the same or similar results. The results of calculations or a
measurement can be accurate but not precise; precise but not accurate; neither; or
both. A result is called valid if it is both accurate and precise.
High accuracy, High precision,
but low precision but low accuracy
Accuracy and precision
3003
Precision, accuracy, and resolution (3)
Precision is usually characterized in terms of the standard deviation of the mea-
surements. (In German: Pr azision.) Precision is sometimes stratied into:
Repeatability - the variation arising when all efforts are made to keep con-
ditions constant by using the same instrument and operator, and repeating
during a short time period.
In German: innere Genauigkeit einer Messung, veraltet auch Wiederholge-
nauigkeit die Stabilit at des Messger ats oder seiner Ablesung w ahrend des
Messvorgangs selbst; dies wird durch Fehler- und Ausgleichsrechnung er-
mittelt nach oftmaligem Wiederholen der Messung unter gleichen Umst an-
den und mit demselben Messger at oder Messsystem.
Reproducibility - the variation arising using the same measurement process
among different instruments and operators, and over longer time periods.
In German: auere Genauigkeit einer Messung die Streuung der Messun-
gen, wenn sie unter verschiedenen aueren Umst anden wiederholt werden.
3004
Accuracy is related to the difference (bias) between the mean of the measure-
ments and the reference value. Establishing and correcting for bias is the task
of calibration, which corrects for systematic errors of the measurement. Of
course this requires that a measurement of higher accuracy, or another source
for the true value, is available!
In German: absolute Genauigkeit einer Messung der Grad der

Ubereinstim-
mung zwischen angezeigtem und wahrem Wert.
When deciding which name to call the problem, ask yourself two questions:
First: Will averaging successive readings provide a better measurement?
If yes, call the error precision; if no, call it accuracy.
Second: Will calibration correct the error?
If yes, call it accuracy; if no, call it precision.
Precision, accuracy, and resolution (4)
Resolution is yet another concept that is closely related to precision.
3005
Precision, accuracy, and resolution (5)
The accuracy (in German: Massengenauigkeit) is often reported using a pseudo-
unit of ppm (parts per million): Let
m[u] = m
measured
m
theoretical
.
Then
m/m[ppm] =
m
measured
m
theoretical
m
theoretical
10
6
[ ppm] .
Like accuracy, the resolution is a dimensionless number .
3006
Morphological lters
Mathematical morphology is a relatively new branch of mathematics. It was
founded around 1965 at the Ecole Normale Superieure des Mines in Fontainebleau
near Paris (according to [W04]).
Morphological methods do only take into account the level sets
L
i
(f ) := { (x, y) | f (x, y) i }
of an image f .
Hence the results are invariant under all strictly monotonous transformations. Mor-
phological methods are well-suited for analyzing the shape of objects this is in fact
what motivates the name morphology. The concept of level sets is related to the
rank transformation used in statistics.
Morphological lters provide a nonlinear alternative to the linear lters which we will
learn about later.
3007
Morphological lters (2)
Why is ltering, and signal processing in general, so important for computational
mass spectrometry? A typical mass spectrum can be decomposed into three addi-
tive terms with different frequency range :
Information: The real signal we are interested in, e.g. an isotopic pattern
caused by a peptide. Medium frequency.
Baseline: A broad trend, for example caused by signals from matrix ions when
MALDI is used. Very low frequency, should not change much within 5 Th.
Noise: Very high frequency, e.g. detector noise; (hopefully) not even correlated
among consecutive sample points in the raw data.
3008
Morphological lters (3)
The top hat lter is a morphological lter which can be used for baseline removal.
Example
The image shows the result of a (rather aggressive) baseline reduction using the
top hat lter, applied to a couple of mass spectra.
First we need to dene two morphological operations: the erosion and the dilation.
3009
Morphological lters (4)
Erosion.
Intuitively, the erosion is obtained by moving the structuring element (in German:
Strukturelement) within the area under the signal and marking the area covered
by the reference point.
3010
Morphological lters (5)
Dilation.
The dilation is dened similarly. This time we move the reference point within the
area under the signal and mark the area covered by the structuring element.
In German this is called Dilatation, but Dilation also seems to be in use.
3011
Morphological lters (6)
The mathematical denition can be given in very general terms, but in our case,
for the top hat lter, we will only consider a at structuring elements. Thus the
structuring element B is a symmetric interval around zero, and zero is the reference
point. Therefore the denitions are as simple as:
Dilation: (f B)(x) := max{ f (x x

) | x

B }.
Erosion: (f B)(x) := min{ f (x + x

) | x

B }.
Since for any set X, min{x X} = max{x | x X}, it is sufcient to explain
the algorithm for dilation (the max case).
3012
Morphological lters (7)
Trivial algorithm
Assume that the signal to be processed is x
0
, x
1
, ... , x
n1
, the size of the structuring
element is p > 1, and we want to compute the max lter:
y
i
:= max
0j <p
x
i +j
for i = 0, ... , n p .
Then the trivial algorithm (using two nested for loops) has running time propor-
tional to O(np). If p is large, this can be impractical.
The HGW algorithm
Van Herk and (later but independently) Gil and Werman found an algorithm that
requires only 3n comparisons, independently of the size of the structuring element.
In a recent paper Gil and Kimmel improved this further to
_
3
2
+
log p
p
+ O(1/p)
_
n, but
here will explain the simpler 3n algorithm.
3013
Morphological lters (8)
(Disclaimer: The algorithms of Van Herk and Gil-Werman have a slightly different
denition of segments. So the following images should only be considered exact
up to 1.)
3014
Morphological lters (9)
The idea of the van Herk-Gil-Werman (HGW) algorithm is to split the input signal
into overlapping segments of size 2p 1, centered at x
p1
, x
2p1
, x
3p1
, ....
blackboard!!!
3015
Morphological lters (10)
Now let j be the index of the element at the center of a certain segment. The
maxima t
k
of all p windows which include x
j
are computed in one batch by the HGW
algorithm as follows.
First, we compute incrementally one half of the maxima towards the lower indices:
R
k
(j ) := max{ x
j
, x
j 1
, x
j 2
, ... , x
j k
} ,
and we do the same towards the higher indices:
S
k
(j ) := max{ x
j
, x
j +1
, x
j +2
, ... , x
j +k
} , .
Then, we merge the halves together and obtain the max lter by just one more
maximum operation:
t
k
:= max{ x
j k
, ... , x
j
, ... , x
j +pk1
} = max{R
k
(j ), S
pk1
(j )}
for k = 1, ... , p 2. Moreover, we have
max{x
j p1
, ... , x
j
} = R
p1
(j ) and max{x
j
, ... , x
j +p1
} = S
p1
(j ) .
3016
Morphological lters (11)
The HGW algorithm operates in two stages. The auxiliary tables R(j ) and S(j )
are only needed for the current value of j , so they can be overwritten for the next
segment and we will drop the index j from now on.
Preprocessing: Compute R
k
and S
k
from their denition. Note that R
k
=
max{R
k1
, x
j k
} and S
k
= max{S
k1
, x
j +k
} for k = 1, ... , p1. The preprocessing
requires 2(p 1) comparisons.
Merging: Merge the R
k
and S
k
together. This stage requires another p 2 com-
parisons.
The procedure computes the maximum of p windows in total. Thus the amortized
number of comparisons per window is
2(p 1) + (p 2)
p
= 3
4
p
.
If p is large, the preprocessing step requires about two operations per element,
while the merging step requires one more such comparison.
3017
Morphological lters (12)
The erosion and dilation operations are also dened for data in two (or even more)
dimensions. The two-dimensional erosion and opening are very common opera-
tions in image processing. They have some nice mathematical properties.
Axis-parallel rectangular structuring elements are separable, that is, erosion
and dilation can be computed by applying the one-dimensional erosion algo-
rithm to both dimensions one after another. As a consequence, the very ef-
cient 1-D algorithms can still be applied.
Convex structuring elements are scalable: Dilation / erosion with structuring
element nB is equivalent to n dilations / erosions with structuring element B.
(Proofs: Exercise.)
3018
Morphological lters (13)
Separability
We consider the case where we have two dimensions and compute the dilation.
The proof for higher dimensions and/or erosion is similar.
Let f be the signal and B be the structuring element. We can assume that the
reference point is 0. As we have seen, the value of the dilation of f at a certain
position (x, y) is just the maximum of the original signal within the region:
(f B)(x, y) = max{f (x x

, y y

) | (x

, y

) B} .
Now consider the case where B is an axis-parallel (!) rectangle B = B
x
B
y
. What
would be the result of applying the one-dimensional dilation to both dimensions one
after another?
3019
Morphological lters (14)
From the rst dilation we get
max{f (x x

, y) | x

B
x
}
and after the second dilation we have
max
_
max{f (x x

, y y

) | x

B
x
}

B
y
_
= max{f (x x

, y y

) | x

B
x
, y

B
y
}
= max{f (x x

, y y

) | (x

, y

) B}
= (f B)(x, y) .
3020
Morphological lters (15)
Scalability
Assume that B is a convex structuring element, that is:
p B q B [0, 1] (1 )p + q B .
All we need to show is that
p
1
, ... , p
n
B p
1
+ ... + p
n
nB .
Now, observe that a simple induction using the convexity property implies that
p
1
, ... , p
n
B
p
1
+ ... + p
n
n
B
hence the assertion follows.
3021
Morphological lters (16)
Examples from image processing
First row: Dilation of a greyscale image (256 256 pixels) with a square of length 11
and 21. Second row: Same for erosion.
3022
Morphological lters (17)
As you can see from the examples, erosion and dilation shift the overall intensity.
In order to simplify the image structure while avoiding the expansion effects of dila-
tion, one can perform an erosion after the dilation. The resulting operation is called
closing:
f B := (f B) B . closing := dilation followed by erosion
The closing removes dark details.
Similarly, we can simplify the image structure while avoiding the shrinkage effects of
erosion by performing a dilation after the erosion. The resulting operation is called
opening:
f B := (f B) B . opening := erosion followed by dilation
The opening removes bright details.
3023
Morphological lters (18)
Examples from image processing
First row: Closing of a greyscale image (256 256 pixels) with a square of length 11
and 21. Second row: Same for opening.
3024
Morphological lters (19)
More nice mathematical properties:
The opening and closing satises the following inequalities:
f B f opening signal
f B f closing signal
Multiple openings or closings with the same structuring element do not alter the
signal any more; we have
(f B) B = f B opening twice is same as opening
(f B) B = f B closing twice is same as closing
(Proof: Exercise.)
3025
Morphological lters (20)
To show that the opening satises the inequality f B = f BB f , let us assume
that the contrary holds for some argument x. Let y := f B B(x) > f (x). Then by
denition of there exists an x

B such that f B(x x

) = y. But the denition


of implies that for all x

B it holds f ((x x

) + x

) f B(x x

). If we set
x

= x

, then f (x) f B(x x

) = y > g(x), a contradiction. In the same way,


one can prove that the closing satises the inequality f B f .
To show that the opening satises f B B = f B, we prove that in fact a slightly
stronger assertion holds: If g = h B is the result of a dilation by B, then g B = g.
(In our case, h = f B.) By the preceding inequality, it sufces to show that gB g.
Let x be arbitrary, then the claim is that g B B(x) g(x). By the denition of
, there is an x

B such that g(x) = h(x x

). But then by the denition of we


also have h B((x x

) + x

) h(x x

) for all x

B. Hence by the denition of


, we have h B B(x x

) h(x x

). Again by the denition of , we have


h B B B((x x

) + x

) h(x x

) = g(x) for all x

B. Now let x

= x

. It
follows that h B B B(x) g(x), as claimed. In the same way, one can prove
that the closing satises f B B = f B.
3026
Morphological lters (21)
Opening and closing act as morphological lowpass lters. They remove the small
details.
We obtain a morphological highpass lter by subtracting the lowpass ltered signal
from the original signal. This is called the top hat lter. There are two versions:
White top hat, top hat by opening:
WTH(f ) := f (f B) .
It extracts small bright structures.
Black top hat, top hat by closing, also known as bot hat :
BTH(f ) := f (f B) .
It extracts small dark structures.
For mass spectra, the white top hat (top hat by opening) is useful to remove the
baseline from the raw data.
3027
Morphological lters (22)
input: red, erosion: green, opening: blue, tophat: yellow
(From the documentation of class OpenMS::MorphologicalFilter.) http://www.
openms.de
3028
Morphological lters (23)
Example from image processing
3029
Morphological lters (24)
What else can be done?
A few other morphological operations . . .
Using structuring elements of increasing size, one can remove small-, middle-,
and coarse-scale structures step by step. This is called a granulometry. Gran-
ulometries act as morphological bandpass lters.
3030
Signal processing basics
A signal is a description of how one parameter varies with another parameter.
Signals can be continuous or discrete.
A system is any process that takes an input signal and produces an output
signal.
Depending on the type of input and output, one can distinguish continuous
systems, such as analog electronics, and discrete systems, such as computer
programs that manipulate the values stored in arrays.
Converting a continuous signal into a discrete signal introduces sampling and
quantization errors, due to the discretization of the abscissa (x-axis) and ordi-
nate (y-axis).
3031
Signal processing basics (2)
Sampling and Quantization.

For example, a time-of-ight (TOF) mass spectrometer has to wait until the ions
arrive at the detector after they have been generated and accelerated.
In a quadrupole mass analyzer, the frequency of the oscillating electrical elds de-
termines the ltered mass-to-charge ratio of ions which arrive at the detector. The
mass spectrum is obtained by scanning through the mass range of interest, which
takes some time.
3032
Signal processing basics (3)
From a theoretical point of view, sampling is necessary since we cannot store (or
process in digital form) an innite amount of data.
The intensity values are often reported as ion counts but this does not mean that
we do not have a quantization step here! Firstly, we should keep in mind that these
counts are the result of internal calculations performed within the instrument (e.g.,
the accumulation of micro-scans). Secondly, a mass spectrometer is simply not
designed to count single ions but rather to measure the concentration of substrates
in a probe.
3033
Signal processing basics (4)
Fortunately, the quantization error is well-behaved: The digital output is equivalent
to the continuous input, plus a quantization error.
=
In any reasonable case, the quantization error is uniformly distributed over the in-
terval [
1
2
LSB, +
1
2
LSB]. The term least signicant bit (LSB) is a jargon for the
distance between adjacent quantization levels. Thus we can easily calculate the
standard deviation of the quantization error: It is
_
1/12 LSB
.
= 0.288675 LSB.
(Proof: Exercise.)
3034
Signal processing basics (5)
Let X(i ) denote the quantization error of the i -th sample. We consider X as a ran-
dom variable dened on the sample numbers. It is uniformly distributed on the
interval [
1
2
, +
1
2
], where := LSB. The probability density function of X is
f (x) =
_
_
_
1

for |x|
1
2

0 otherwise
By elementary calculus, we have
E(X) =
_ 1
2

1
2

dx =
1

1
2
x
2

x=

2
x=

2
=
1

1
2
_
(

2
)
2
(

2
)
2
_
= 0
Thus V(X) = E(X
2
) E(X)
2
= E(X
2
). Again by elementary calculus, we have
E(X
2
) =
_ 1
2

1
2

x
2

dx =
1
3
x
3

x=

2
x=

2
=
1
3
_

3
8

()
3
8
_
=

2
12
Hence the standard deviation of the quantization error X is /

12 = LSB/

12.
3035
Linear systems
A linear systems satises the following properties:
Homogeneity: If an input signal of x[n] results in an output signal of y[n], then
an input of k x[n] results in an output of k y[n], for any input signal x and con-
stant k. This kind of operation (multiplication by a scalar) is also called scaling.
Additivity: If an input of x
1
[n] produces an output of y
1
[n], and a different input,
x
2
[n], produces another output, y
2
[n], then the system is said to be additive, if
an input of x
1
[n] + x
2
[n] results in an output of y
1
[n] + y
2
[n], for all possible input
signals x
1
, x
2
. In words: signals added at the input produce signals that are
added at the output.
Shift invariance: A shift in the input signal will result in an identical shift in the
output signal. In formal terms, if an input signal of x[n] results in an output of
y[n], an input signal of x[n+s] results in an output of y[n+s], for any input signal
and any constant, s.
3036
Note: Shift invariance is not a strict requirement for linearity in mathematics, but it
is a mandatory property for most DSP techniques. When you see the term linear
system used in DSP, you should assume it includes shift invariance unless you
have reason to believe otherwise. Hence we will adopt this usage of the term linear
system as well.
Linear systems (2)
Linear systems have appealing mathematical properties. Let A, B be linear systems.
Then the composition of A and B, dened by
(A B)(x)[n] := A(B(x))[n],
is a linear system again. Moreover, if A, B, C are linear systems, then
(A B)(x) = (B A)(x) and (commutativity)
((A B) C)(x) = (A (B C))(x) (associativity)
for all signals x, where denotes the composition of linear systems. These proper-
ties follow immediately from the denitions. (Exercise.)
3037
Linear systems (3)
Synthesis, decomposition and superposition
Synthesis: Signals can be combined by scaling (multiplication of the signals
by constants) followed by addition. The process of combining signals through
scaling and addition is called synthesis.
For example, a mass spectrum can be composed out of baseline, a number of
actual peaks, and white noise.
Decomposition is the inverse operation of synthesis, where a single signal is
broken into two or more additive components. This is more involved than syn-
thesis, because there are innite possible decompositions for any given signal.
3038
Linear systems (4)
Superposition: Consider an input signal, called x[n], passing through a linear
system, resulting in an output signal, y[n].
Assume the input signal can be decomposed into a group of simpler signals:
x
0
[n], x
1
[n], x
2
[n], etc. We will call these the input signal components.
Next, each input signal component is individually passed through the sys-
tem,
resulting in a set of output signal components: y
0
[n], y
1
[n], y
2
[n], etc. These
output signal components are then synthesized into the output signal, y[n].
3039
Linear systems (5)
System

Decomposition Synthesis

..
System
..
System
System
3040
Linear systems (6)
The trivial, but important observation here is:
The output signal obtained by superposition of the components is identical
to the one produced by directly passing the input signal through the system.
Thus instead of trying to understanding how complicated signals are changed by a
system, all we need to know is how simple signals are modied.
There are two main ways to decompose signals in signal processing: impulse de-
composition and Fourier decomposition.
3041
Fourier methods
Fourier decomposition
The Fourier decomposition (named after Jean Baptiste Joseph Fourier (1768-1830),
a French mathematician and physicist) uses cosine and sine functions as compo-
nent signals:
c
k
[i ] := cos(2ki /N) and s
k
[i ] := sin(2ki /N)
where k = 0, ... , N/2 and N is the number of sample positions. (It turns out that
s
0
= s
N/2
= 0, so there are in fact only N components.)
Fourier based methods perform best for periodic signals. They are less suitable
for peak-like signals, e. g. mass spectra or elution proles. We will not go into fur-
ther details here. Fourier theory is important in the analysis of FTICR mass spec-
trometers (FTICR = Fourier transform ion cyclotron resonance) and Orbitrap mass
spectrometers.
3042
Linear systems (2)
Impulse Decomposition
The impulse decomposition breaks an N samples signal into N component signals,
each containing N samples. Each of the component signals contains one point
from the original signal, with the remainder of the values being zero. Such a single
nonzero point in a string of zeros is called an impulse.
The delta function is a normalized impulse. It is dened by
[n] :=
_
_
_
1, n = 0
0, otherwise.
The impulse response of a linear system, usually denoted by h[n], is the output of
the system when the input is a delta function.
3043
Linear systems (3)
How can we express the response of a linear system in closed form? Let us put
together the pieces:
The input signal can be decomposed into a set of impulses, each of which can
be viewed as a scaled and shifted delta function.
The output resulting from each impulse is a scaled and shifted version of the
impulse response.
The overall output signal can be found by adding these scaled and shifted im-
pulse responses.
Thus if we knowa systems impulse response, then we can calculate what the output
will be for any possible input signal.
3044
Linear systems (4)
We obtain the standard equation for convolution: If x[n] is an N point signal running
from 0 to N 1, and the impulse response h[n] is an M point signal running from 0
to M 1, then the convolution of the two:
y[n] = x[n] h[n] ,
is an N + M 1 point signal running from 0 to N + M 2, given by:
y[i ] =
M1

j =0
h[j ] x[i j ] .
This equation is called the convolution sum.
If the system being considered is a lter, then the impulse response is also called
the lter kernel , the convolution kernel , or simply, the kernel . (In image processing,
the impulse response is called the point spread function.)
3045
Linear systems (5)
Moving average lter
Gaussian lter
3046
Linear systems (6)
The following illustration shows the contributions of the impulse responses in differ-
ent colors. A moving average kernel was used that takes an unweighted average
over three consecutive signal positions.
3047
Linear systems (7)
And this shows the convolution with a Gaussian kernel.
3048
Wavelets
So far we have only seen kernels without negative numbers. If we allow negative
samples in the impulse response, we can construct linear lters that accentuate
peaks instead of dilating them. This also leads us to the theory of wavelets.
A very commonly used wavelet is the Marr wavelet , also called Mexican hat for
obvious reasons. It has the density function
(x) = (1 x
2
) exp
_
x
2
2
_
which is essentially the second derivative of the normal distribution:
(x) = (1 x
2
) exp
_
x
2
2
_
=
d
2
dx
2
exp
_

x
2
2
_
3049
Wavelets (2)
To adjust the width of the Marr wavelet, one introduces a scaling parameter a. Thus
the impulse response would be (
x
a
)/

a.
3050
Wavelets (3)
Using wavelets, we can solve two problems at the same time:
The integral of a wavelet is zero. Therefore a constant baseline has no inuence
on the output.
High frequency noise is also ltered out.
3051
Wavelets (4)
The following gures show that the Marr wavelet does a favorable job at detecting
the maximum positions of the isotopic peaks.
In the above plot A, B, C, and D the x-axis represents the mass interval between
2230Da and 2250Da, whereas the y-axis shows the intensity. A: Part of a MALDI
mass spectrum. Plots B, C, and D show the continuous wavelet transform of the
spectrum using a Marr wavelet with different dilation values a (B: a = 3, C: a = 0.3,
D: a = 0.06).
3052
Wavelets (5)
A comparison of lter kernels: Moving average, Gauss, Marr
3053
Wavelets (6)
Even better results can be obtained if we model the whole isotopic pattern by a
wavelet:
3054
Wavelets (7)
A mass spectrum and its isotope wavelet transform for charges one to three. The
strong oscillating part of the transformed signal is a peptide of charge 3. The re-
maining peaks are non-peptidic compounds or noise.
580 582 584 586 588 590 592 594 596 598
0
5000
10000
15000
mass spectrum
580 582 584 586 588 590 592 594 596 598
2
1
0
1
2
3
x 10
5 transformed signal (mother wavelet charge 1)
580 582 584 586 588 590 592 594 596 598
1
0
1
x 10
5 transformed signal (mother wavelet charge 2)
580 582 584 586 588 590 592 594 596 598
4
2
0
2
4
x 10
5 transformed signal (mother wavelet charge 3)
3055
Averagines
But how can we know the isotopic pattern to use for the wavelet?
If one plots the atomic content of proteins in some protein database (e.g. SwissProt)
it becomes evident, that the number of atoms for each type grows roughly linearly.
The picture shows on the x-axis the molecular weight and on the y-axis the number
of atoms of a type.
0
50
100
150
200
250
300
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
"C stats"
"C mean"
"N stats"
"N mean"
3056
Averagines (2)
Since the number of C,N, and O atoms grows about linearly with the mass of the
molecule it is clear that the isotope pattern changes with mass.
mass [0] [1] [2] [3] [4]
1000 0.55 0.30 0.10 0.02 0.00
2000 0.30 0.33 0.21 0.09 0.03
3000 0.17 0.28 0.25 0.15 0.08
4000 0.09 0.20 0.24 0.19 0.12
Since there is a very nice linear relationship between peptide mass and its
atomic composition, we can estimate the average composition for peptide of a
given mass.
Given the atomic composition of a peptide, we can compute the relative inten-
sities of its peaks in a mass spectrum, the isotopic pattern.
We can use this knowledge for feature detection i.e. to summarize isotopic
pattern into peptide features and to separate them from noise peaks.
3057
Averagines (3)
Below is a spectrum with interleaved peptide peaks and noise. In this example,
real peptide peaks are solid or dashed. Noise is dotted. Peaks 1 to 4 belong to
two interleaved features of charge 1 with molecular weights of 1001 and 1001.5 Da,
respectively.
3058
Thanks!
Thats all for today!
3059

You might also like