You are on page 1of 11

International Journal for Parasitology 35 (2005) 543553 www.parasitology-online.

com

Invited review

Analysing proteomic data


J. Barrett*, P.M. Brophy, J.V. Hamilton
Institute of Biological Sciences, University of Wales, Penglais, Aberystwyth, Ceredigion, Wales SY23 3DA, UK Received 25 August 2004; received in revised form 10 January 2005; accepted 12 January 2005

Abstract The rapid growth of proteomics has been made possible by the development of reproducible 2D gels and biological mass spectrometry. However, despite technical improvements 2D gels are still less than perfectly reproducible and gels have to be aligned so spots for identical proteins appear in the same place. Gels can be warped by a variety of techniques to make them concordant. When gels are manipulated to improve registration, information is lost, so direct methods for gel registration which make use of all available data for spot matching are preferable to indirect ones. In order to identify proteins from gel spots a property or combination of properties that are unique to that protein are required. These can then be used to search databases for possible matches. Molecular mass, pI, amino acid composition and short sequence tags can all be used in database searches. Currently the method of choice for protein identication is mass spectrometry. Proteins are eluted from the gels and cleaved with specic endoproteases to produce a series of peptides of different molecular mass. In peptide mass ngerprinting, the peptide prole of the unknown protein is compared with theoretical peptide libraries generated from sequences in the different databases. Tandem mass spectroscopy (MS/MS) generates short amino acid sequence tags for the individual peptides. These partial sequences combined with the original peptide masses are then used for database searching, greatly improving specicity. Increasingly protein identication from MS/MS data is being fully or partially automated. When working with organisms, which do not have sequenced genomes (the case with most helminths), protein identication by database searching becomes problematical. A number of approaches to cross species protein identication have been suggested, but if the organism being studied is only distantly related to any organism with a sequenced genome then the likelihood of protein identication remains small. The dynamic nature of the proteome means that there really is no such thing as a single representative proteome and a complete set of metadata (data about the data) is going to be required if the full potential of database mining is to be realised in the future. q 2005 Australian Society for Parasitology Inc. Published by Elsevier Ltd. All rights reserved.
Keywords: Proteomics; Biological mass spectrometry; 2D gels; Database searching; Peptide mass nger printing; Tandem mass spectroscopy; Cross species identication

1. Introduction It is proteins, not genes that are responsible for an organisms phenotype. Post-translational protein modications, vital for the correct functioning of proteins, are not directly coded for in the genome and, in cells, there is little correlation between mRNA levels and protein levels. The proteome is the collective term for all of the proteins produced from the genetic material in the cell, in the same way the genome is the collective term for all
* Corresponding author. Tel.: C44 1970 622 315; fax: C44 1970 622 350. E-mail address: jzb@aber.ac.uk (J. Barrett).

of the genetic material in a cell. Proteomics encompasses two distinct areas of research, expression proteomics and functional proteomics. Expression proteomics is concerned with establishing quantitative maps of protein expression under specic physiological or developmental conditions, whilst functional proteomics focuses on the role of individual proteins and their interactions with other ligands (including other proteins). The rapid development of proteomics has been made possible by two technical advances. The development of immobilised pH gradient strips (IPG) which has led to reproducible electrophoresis gels and second, advances in biological mass spectrometry coupled with database searches for protein identication.

0020-7519/$30.00 q 2005 Australian Society for Parasitology Inc. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.ijpara.2005.01.013

544

J. Barrett et al. / International Journal for Parasitology 35 (2005) 543553

2. Comparing gels 2.1. Gel registration Despite technical improvements two-dimensional electrophoresis (2DE) remains an imperfect technique and geometric distortions of protein patterns are inherent in the casting, polymerisation and running procedures of gels. Inevitably gels are less than perfectly reproducible and the matching step to align gels is often the bottle neck in gel analysis. Dual or triple labelling of protein samples with covalently linked uorescent dyes prior to electrophoresis combined with multi-spectral imaging can enable two or three proteomes to be compared on the same gel (Spibey et al., 2001). However, there still remains the problem of day-to-day variability. Consequently, it is necessary to align any new gel, and make it concordant with a reference gel, so that all the spots from identical proteins are in the same place (Dowsey et al., 2003). Relative spot intensities can then be compared; new (or missing) spots detected and mass/charge shifts noted. There are two broad approaches to registration theory, feature-based and direct methods. Feature-based approaches begin with a preliminary stage of feature extraction (in this case, spot identication) and then compute registration from the features list. An early example of this approach was Flicker (http://www-lmmb.ncifcrf.gov/icker/) where two gels are compared by rapidly switching between the two pictures, as the image registration is improved the spots appear to move less. Direct methods recover the registration directly from the intensity data without going through the intermediate, time consuming and often error prone step of feature detection. Raw images of gels contain a wealth of information that is lost in gels which have been manipulated to improve registration (synthetic gels), so direct methods which use all available data should offer a more accurate and reliable matching, moreover they can be fully automated. Both approaches then use deformable models to warp the experimental image to the reference. Methods to improve gel warping have been proposed by Salmi et al. (2002). Similarity between matched gels can then be judged by any of the current methods for computing multivariate distances between digital images. One method that is frequently used for this is principal component analysis (Manly, 1994; Geladi and Grahn, 1996). Most commercial proteome software use a featuresbased approach to gel matching, with the user specifying a number of corresponding spots present in each gel. This selection is either done manually or semi-automatically. Direct matching usually uses a segmentation-based approach which divides the gel picture up into areas, often by contour following or by regional growing or other watershed methods (Bettens et al., 1997; Meyer et al., 1997; Pleissner et al., 1999; see also http://gelmatching.inf. fu-berlin.de). Gustaffson et al. (2002) divided gel matching into two phases, an initial global correction for smooth,

non-spot specic distortions, where the whole image is examined, and a second more traditional spot-specic matching step. Among the direct methods which have been developed, Smilansky (2001) used a shift vector list approach to gel warping (http://www.2dgels.com). Veeser et al., 2001 has developed a method based on a gradient descent algorithm to dictate warping of a multi-resolution grid (http://vip.doc.ic.ac.uk/sd-gel), others have developed one-step registration methods using genetic algorithms (Jacq and Roux, 1995). More recently a non-segmented method based on complex wavelet transforms has been proposed (Woodward et al., 2004). Fully automated spot matching has the advantage of removing subjective error and can be adapted for high throughput. However, it is often not very efcient at resolving overlapping spots and small differences in position due to post-translational modications can be lost. Spot shapes can also become distorted during gel matching, so quantication is usually done before the gelmatching step. Another problem that can arise with automated spot matching is resolving spots, which have very different intensities or when spot clusters are heavily saturated and there are no observable boundaries between the spots. To prevent the masking of minor spots, very abundant proteins can sometimes be removed from the sample prior to gel electrophoresis by specic adsorption (e.g. serum albumin and IgG, Lee and Lee, 2004). In the end, it is the user who has to decide whether spots correspond to each other or not! 2.2. Protein quantication There is no equivalent to PCR for protein amplication. So detection of low abundance proteins remains a challenge and samples must contain sufcient protein for analysis. Another limitation in proteomics is the poor range of most protein detection methods. In the cell, the concentrations of the different proteins span at least 56 orders of magnitude and 1213 orders of magnitude in extra-cellular uid. Coomassie blue has a linear range of approximately two orders of magnitude and silver stain four orders of magnitude. Fluorescent stains such as Sypro Ruby have linear ranges of about three orders of magnitude, whilst uorescent dyes, which are covalently linked to the proteins before electrophoresis (e.g. the Cy dyes), have a linear range greater than four. As well as the linear range of the dye, the limitations of the linear range of the scanner may also need to be considered. Most scanners are linear up to ve orders of magnitude and newer models are improving on this. Different proteins have different staining characteristics, so the relationship between staining intensity and concentration is not the same for different proteins. The limit of detection for proteins using silver staining is approximately 1 ng (20 fmol for a 50 kDa protein), the lower limit for uorescent stains is similar, 125 pg is claimed for uorescent Cy dyes, and 100 pg for Lightening Fast, a dye

J. Barrett et al. / International Journal for Parasitology 35 (2005) 543553

545

which contains the uorophore epicocconone (Macintosh et al., 2003). The lower limit for Coomassie staining is about 10 ng protein. So different applications may require different stains. Of the possible methods available for protein quantication (spot area, optical density, relative %optical density, integrated optical density, relative %integrated optical density), relative %integrated optical density (vol%) seems to provide the best estimate of protein amount. Dutt and Lee (2001) have proposed scaled volume (SV) as a more accurate measure as this accounts for more background effects and artefacts:

2.3. Comparing total proteomes Matrices showing relative spot intensities under different experimental or pathological conditions can be globally investigated using multivariate techniques such as correspondence analysis, hierarchical clustering or partial least squares regression (Schmidt et al., 1995; Jessen et al., 2002) or by supervised learning techniques such as neural networks or genetic programming (Kell, 2002). These methods enable protein spots that, individually or in combination with others, vary in relation to the experimental conditions to be identied. This can lead to the discovery

SV Z

%Vol spots of interest %Vol gel background K %Vol spots not of interest=Area gel background K Area spots of interest of potential relationships between proteins as well as the identication of marker proteins for pathological or other conditions. 2.4. Theoretical proteomes For organisms with fully sequenced genomes, it is possible to construct theoretical proteomes based on the theoretical pI and molecular mass of all of the predicted proteins in the genome (Campbell et al., 2001). A number of websites are available to do this (e.g. JVirgel at http://www. jvirgel.de/). Theoretical proteomes are interesting from the point of view of protein evolution. The overall pattern of theoretical 2D gels seems to be strongly conserved across all life forms, although there may be some correlation with ecology, particularly for organisms that live in extreme environments (Knight et al., 2004). However, theoretical proteomes do not include post-translational modications and in vivo an organisms complete protein repertoire will never all be expressed at the same time or in the same tissue and the pattern is very much dependent on internal and external conditions.

This combines %volume and area. For uorescent dyes hypospectral imaging (Woodward et al., 2001) may increase the sensitivity by more than three orders of magnitude. In this approach, dye uorescence is measured at a range of excitation/emission wavelengths (instead of the standard pairs) and the resulting mass of data analysed by multivariate statistics. Estimates of the coefcient of variation for relative protein volume range from 16 to 39% for analytical variation (replicates of the same sample) and from 22 to 55% for biological variation (replicate samples) (Asirvatham et al., 2002; Molloy et al., 2003). This suggests that, in general, changes in protein volume must be at least two-fold for the results to be statistically signicant. Protein volumes (or following suitable transformations %protein volumes) can be compared by standard t-tests, one way analysis of variance (ANOVA) or MannWhitney tests. Internal standards can be used to correct for run to run variations, but proteomic measurements still remain essentially relative; measurements of absolute abundance are currently not feasible. Spot quality (Garrels, 1989) can be a useful parameter to evaluate gel quality and to set thresholds. Spot quality is a number between zero and 100 based on ve attributes of each spot: Gaussian t, x-stretching, y-stretching, overlap with other spots and whether the peak intensity value is within the linear range of the scanner. Spots tend to have symmetrical diffusion in the pI dimension but often show tailing in the mol. wt direction. Inevitably when duplicate gels are run, there are a number of orphan spots (spots which are present on some gels, but absent from others). These orphan spots may be artefacts from background staining or the result of poorly resolved overlapping spots or they may represent genuine differences between duplicate samples. Careful examination of orphans is essential to determine whether or not they should be included in any subsequent analysis. A spot quality value of 50 is sometimes used as a threshold for spot inclusion in analyses.

3. Protein identication Spot identication is one of the key steps in proteomics. In order to identify a protein, you need a property or a combination of properties which are unique to that protein (or class of proteins), that can then be used to search databases for a match. Potential matches can be limited by using species or tissue-specic databases or ltering the results by limiting the hits to proteins within a predened molecular mass or pI range. However, specialist databases may not contain all of the relevant protein sequences, so failure to nd a match in a species-specic database, for example, does not necessarily mean that a signicant match will not be found in a more general database.

546

J. Barrett et al. / International Journal for Parasitology 35 (2005) 543553

Table 1 The table lists some of the free protein identication tools available on the Internet (December 2004) Search engine and URL ExPASy, http://ca.expasy.org/tools/ CombSearch, http://ca.expasy.org/tools/CombSearch/ AA CompIdent, http://us.expasy.org/tools/aacomp/ FASTS/FASTF, http://fasta.bioch.virginia.edu/ MASCOT, http://www.matrixscience.com MassSearch, http://cbrg.inf.ethz.ch/Server/MassSearch.html MS-BLAST, http://dove.embl-heidelberg.de/Blast2/msblast.html MS-Pattern, http://128.40.158.151/mshome3.4.htm PepFrag, http://prowl.rockefeller.edu/prowl/pepfragch.html PepMapper, http://wolf.bms.umist.ac.uk/mapper/ PepSea, http://www.unb.br/cbsp/paginiciais/pepseaseqtag.htm PeptideSearch, http://www.mann.embl-heidelberg.de/GroupPages/ PageLink/peptidesearchpage.html ProbID, http://projects.systemsbiology.net/probid/ ProteinProspector, http://prospector.ucsf.edu PROWL (ProFound, PepFrap), http://prowl.rockefeller.edu/ Tagident, http://us.expasy.org/tools/tagident.html New tools are continuously being produced and the sites and URLs may change. Protein properties Comprehensive collection of protein identication tools Enables several identication tools to be searched at the same time Amino acid composition, molecular mass, pI, sequence tag Sequence tags Peptide masses Peptide masses Peptide sequence tags Peptide sequence, mass Peptide masses Peptide masses Peptide sequence tags Peptide sequence, mass Peptide MS/MS data Peptide masses, sequence tags. Has range of tools available Peptide MS/MS data Mw, pI, sequence tag

3.1. pI and molecular mass The simplest information that can be obtained from a 2D gel is the approximate pI and molecular mass. These can then be used to search the databases to generate a list of proteins whose molecular mass and pI are close to the given values (Table 1). Estimation of pI and molecular mass from gels is relatively crude (error for pI estimation from gels G0.25 units, error for estimating molecular mass G20% and these are often the default values used in search engines), and so the results are not particularly useful. However, the results are signicantly better if the molecular mass is determined by mass spectroscopy which can achieve a mass accuracy of G0.01%. When using molecular mass as a lter, the tolerances should be set fairly wide, since databases often contain the unprocessed protein or may have part of the sequence missing. In general, a molecular mass error of G20% is suitable when searching with prokaryote proteins, but for eukaryotes up to G30% is advisable. For secreted proteins that are often heavily glycosylated, C100% should be used. Similarly a pI range of G0.25 units is suitable for prokaryote proteins, but for unmodied eukaryotic proteins use G0.5 units or G1 unit if the protein is likely to have been modied. 3.2. Amino acid composition Total amino acid composition has been used for spot identication. This relies on different proteins having idiosyncratic amino acid compositions and can sometimes lead to the detection of weak similarities based on composition rather than sequence. In this approach, the determined amino acid composition is compared with the theoretical amino acid composition of the proteins in the databases (Table 1). The results are ranked according to a score (sum of the squared differences between

the %amino acid composition of the query protein and the database entry) with zero being a perfect match and increasing compositional differences giving larger scores. In practice, analytical variation means a perfect match is most unlikely but a correct identication would be expected to have a score of less than 30. Determining amino acid composition requires at least 250 ng of protein and it is useful providing a calibration protein that can be run in parallel to compensate for the errors inherent in the analytical methods. Few amino acid analysis techniques produce composition values for all the amino acids, depending on the method of hydrolysis cysteine and tryptophan can be destroyed or it may be impossible to distinguish between cysteine and cystine, aspartate and asparagine or glutamate and glutamine. So when using programmes like AACompSim (http://ca.expasy.org/tools/ aacsim/) the most appropriate constellation (list of amino acids to be included in the analysis) must be chosen. Determining amino acid composition is time-consuming, particularly if a large number of spots are involved and it is not feasible for high throughput studies. Combining amino acid composition searches with lters such as molecular mass, pI and species-specic databases or combining with short sequence tags greatly improves the power of a search. 3.3. Amino acid sequence The amino acid sequence is unique to each protein and sequence analysis of whole proteins by sequential Edman degradation followed by BLAST searching is a basic technique for investigating proteins at the molecular level. A sequence of 1215 amino acids is often sufcient for an unambiguous identication. However, Edman sequencing is too slow (and too expensive) for routine proteomics and proteins are frequently N-terminally blocked (although N-terminally blocked proteins can be cleaved, either

J. Barrett et al. / International Journal for Parasitology 35 (2005) 543553

547

enzymatically or chemically and internal sequences obtained, this is even more labour intensive). Edman degradation has now been almost entirely replaced by mass spectrometry (see Section 3.4.3). For micro-organisms with relatively few proteins (5006000), a four amino acid sequence tag can be unique, but is much less specic in metazoa with many more (20,00050,000) different proteins. Short sequence tags of amino acids (up to six residues) can be determined relatively quickly by MS/MS (see Section 3.4.2), and can be very useful when combined with other searches (e.g. pI, molecular mass) to lter the results (Table 1). 3.4. Mass spectrometry techniques Mass spectrometry (MS) is now the method of choice for protein identication in proteomics. It is claimed that proteins can be stored in a dry gel and still be identied by MS months or even years later (Beranova-Giorgianni, 2003). An accurate protein mass determined by MS is sometimes sufcient for identication, particularly if a subproteome is used to limit the number of potential matches (see Section 3.1). Some progress has been made in the direct analysis of proteins by MS (Meng et al., 2002; Kelleher, 2004) and a useful collection of web-based tools for MS of intact proteins can be found at https://prosightptm.scs.uiuc. edu/. As MS techniques have improved there has been less emphasis on sample preparation compared with data analysis. In some cases this has led to the isolation of large numbers of irrelevant proteins for identication. Sample preparation still remains the key to successful proteomics. 3.4.1. Peptide mass ngerprints Protein identication by peptide mass ngerprinting (PMF) was the rst practical method for protein identication using MS (Henzel et al., 1993, 2003). PMF is still quite widely used and has been employed in a number of recent parasite studies (e.g. Eimeria tenella, Bromley et al., 2003; Echinoccocus granulosus, Chemale et al., 2003; Fasciola hepatica, Jefferies et al., 2000; Bernal et al., 2004; Schistosoma mansoni, Curwen et al., 2004). The method involves the generation of peptides from the unknown protein using residue-specic endoproteases. Not all proteases are suitable and the most commonly used one is unmodied trypsin, which cleaves proteins exclusively at the C-terminal end of lysine or arginine residues (provided the next amino acid is not proline). Other endoproteases (e.g. Chymotrypsin, Lys-C, Arg-C or modied Trypsin) cleave at different sites and give a different peptide prole. These can be used in place of trypsin. In theory, mixtures of proteases could be used for cleaving the protein, but in practice proteases are used singly. If alternative proteases are used, their specicity should be at least as good as that of trypsin. As long as digestion by the protease is complete, that is the molecule is cleaved at all possible sites,

endoprotease digestion will produce a series of peptides of different masses that are characteristic for that particular protein. The peptide mass prole or ngerprint of the query protein can then be compared with theoretical peptide libraries generated from protein sequence databases to produce a list of likely matches, as few as three to four masses are often enough to get a signicant match (Table 1). Protein sequences in the databases containing ambiguous amino acid assignments (residue (R)ZX, Z or B, where X stands for any amino acid, B for aspartate or asparagine and Z for glutamate or glutamine) cannot of course be used to generate theoretical peptides. Some PMF tools disregard signal sequences and/or propeptides before calculating theoretical peptide masses and can take into account post-translational modications and alternative splicing to a limited extent. The range of experimental mass values from the query protein selected for analysis should be large enough to offer good discrimination. For tryptic digests, 10003500 Da is a suitable range. In MALDI (matrix-assisted laser desorption/ionisation mass spectroscopy) peaks less than 500 Da are obscured by the matrix and in some systems peptides O6000 Da are not used in searches. If possible the peptide masses resulting from auto-digestion of the protease should be deleted from the list before analysis, particularly if the amounts of protein are small. However, because they appear at known masses, trypsin auto-digestion peaks are often used for internal mass calibration. PMF requires a number of user specied parameters. The most important are protease used, mass type, charge and tolerance; amino acid modications; missed cleavages; minimum number of peptide matches required as well as the usual molecular mass and pI range and database(s) to be searched. Trypsin is usually the default protease and some programmes do not include any alternatives. Whether the experimental peptide values are average or monoisotopic needs to be specied as does the charge state (molecular mass, negative ion, positive ion or in some cases not known). The mass tolerance of the experimental peptides should reect the accuracy of the mass spectrometer being used; typically this is 0.2 Da or 200 ppm or better. Since there is a mass-dependent error associated with mass spectrometry, relative error (ppm or %) is preferable to absolute error. Increasing the size of the error window by too much will increase the number of false positives. Proteins are usually reduced and alkylated during electrophoresis. If the alkylating agent is specied then the molecular masses of the theoretical peptides containing cysteine are modied accordingly (approximately 89% of proteins contain cysteine and 17% of their tryptic peptides contain at least one cysteine). Cysteine can also form adducts with free acrylamide monomers during gel electrophoresis to give propionamide cysteine and methionine can become oxidised to methionine sulphoxide. These modications can also be selected and again the masses of the theoretical peptides are duly

548

J. Barrett et al. / International Journal for Parasitology 35 (2005) 543553

modied. If more than one cysteine (or methionine) occurs in a peptide, all possible combinations of the modication are calculated. Depending on the programme being used other protein modications may be available. These may be classed as xed that is applied to all theoretical peptides generated, in which case there is no computational penalty, or variable. Variable modications are those, like the cysteine modications described above that may or may not be present in all peptides. Since all possible combinations of variable modications are tested the number of theoretical peptides to be searched increases geometrically, resulting in a sharp drop in discrimination. Consequently amino acid modications should be selected sparingly unless there is good reason to suppose that they are present. Potential cleavage sites can be missed during incomplete proteolyis. If one missed cleavage site is selected, all possible combinations of adjacent peptides are added to the list of theoretical peptides, if two misses are selected all possible combinations of two and three adjacent peptides are added to the list and so on. Missed cleavages can increase the number of random matches and reduce discrimination. The number of missed cleavages selected should be kept as low as possible, preferably zero and no more than two. Default values for minimum number of peptide matches required for a hit is usually set at four. That is, for a protein to be listed in the output at least four of its theoretical peptides must match peptides from the unknown sample. The results of PMF searches are ordered on a probability basis (probability molecular weight search or MOWSE score). For each protein listed, the total score is given by K10!log10(P), where P is the absolute probability that the match is random (a probability of 10K10 thus becomes a score of 100). A high score, therefore, means that there is a low probability that the match is random (i.e. a good match) and vice versa. Knowing the absolute probability that a match is random and knowing the size of the database being searched, it is then possible to get an objective measure of the signicance of the results. The generally accepted value for deciding if a result is signicant or not is P!0.05 (i.e. signicant at the 5% level) and this is usually stated in the output. Each protein score in a PMF search is accompanied by an E-value (expectation value), which is directly equivalent to the E-value in BLAST searches. The E-value is the number of matches with that score or better that you would expect to nd by chance (i.e. if the database was probed with random fragments). The value depends on the length of the query and the size of the database. Short queries are more likely to nd a match than long ones and the chances of a random match are greater the bigger the database. In general, values less than 0.02 are probably signicant, values between 0.02 and 1 cannot be entirely ruled out, above one you could expect this good a match by chance alone. Occasionally, matching programmes quote z-scores. This measures how unusual a match is in terms of the mean

and standard deviation of the population scores. The higher the z-score the greater the probability is that the match has not arisen by chance. Values greater than ve are usually signicant, a value of zero is no better than random. Z-scores assume the data is randomly distributed, but this may not always be the case, the most obvious exception being where there are extensive protein repeats. Another useful measure in PMF is %coverage, that is the proportion of the theoretical protein which is covered by the peptides. More than 20% coverage is likely to be signicant, but it is dependent on the size of the protein, with small proteins more likely to yield higher %coverage. The actual difference between the molecular mass of the experimental peptide and the corresponding theoretical peptide, expressed in ppm, is also a useful guide when used in conjunction with %coverage. Good hits will have a low error, ideally around 20 ppm or less. Signicance is a function of data quality and ideally the best match (the one with the highest MOWSE score) should be signicant. However, the best match may be correct but there may not be enough mass values, or their accuracy is not good enough or the tolerances on the search are set too wide to get a statistically signicant result. Statistics is a useful guide, but is no substitute for careful thinking! The main drawback with PMF is the ambiguity in protein identication because of peptide mass redundancy. For example, the peptide sequence LIFWP will have the same mass as PWFIL, FILWP and so on. As a result, a sufcient number of peptides must be obtained to provide the required level of specicity, and consequently PMF is not very successful when used with protein mixtures. 3.4.2. Peptide mass tags Tandem mass spectroscopy (MS/MS) can be used to generate short amino acid sequences from peptides using triple quadrupole mass spectrometers (for review see Graves and Haystead, 2002). In the rst stage of analysis, a peptide mass prole is produced, in the second stage selected peptide ions are passed into a collision chamber where they interact with a collision gas (nitrogen or argon). This causes fragmentation along the peptide backbone and generates a series of fragments that differ in mass by a single amino acid. The mass differences can then be used to deduce the amino acid sequence (with the exception of leucine and isoleucine which have identical masses). The partial amino acid sequences obtained from the MS/MS spectra (the sequence tag) combined with the original peptide masses are then used for database searching in a similar manner to PMF. Amino acid sequencing can also be carried out with MALDI-TOF machines using post-source decay. Peptide fragmentation patterns are, however, much less predictable by this method and so sequencing is not as reliable as when collision-induced dissociation is used. Peptide mass tag searching is a more specic tool for protein identication than PMF (for a parasite example see

J. Barrett et al. / International Journal for Parasitology 35 (2005) 543553

549

Lasonder et al., 2002). Another advantage of using MS/MS is the ability to identify proteins in mixtures. The main disadvantage is that the process is not easily automated and although computer programmes can assist in the interpretation of spectra they are not able to make accurate assignments without some help. In addition, there is a lack of exibility in the current search engines, which use peptide mass tags. If a single mistake is made in the assignment of a y- or b-ion, the amino acid sequence will be incorrect and the search will return irrelevant proteins (see also Section 3.4.4). 3.4.3. De novo peptide sequences As well as for ltering results, the short amino acid sequences (seven to eight residues) generated by MS/MS can be used on their own for database searches. Using BLAST or FASTA to analyse MS/MS peptide sequences is, however, problematic. These algorithms are designed to compare relatively long, accurate protein sequences, whereas the sequences from MS/MS are inherently redundant and error prone. Nor is it known with MS/MS in what order the sequences should be aligned (nor indeed if the sequences are all from the same protein). One approach has been to run separate database searches for each peptide sequence and report the hits which occur most often between all of the peptides (Huang et al., 2001). However, MS-Blast and MS-Pattern (originally developed for use with Edman sequences) can accommodate the difculties arising from incomplete sequence information and allow database searching with peptide sequence tags as can PepFrag, PeptIdent and MS-Tag (Table 1). Two new search engines FASTS/TFASTS and FASTF/TFASTF have been developed for use with short peptide sequences derived from a single protein. FASTS uses multiple short peptide sequences of unknown order to nd homologous sequences in protein or DNA databases by evaluating all possible arrangements of the peptides. Partial protein sequences can also be determined by conventional N-terminal Edman sequencing on unseparated mixtures of peptides (Damer et al., 1998). Mixed peptide sequencing usually gives a longer sequence read than MS/MS and is less affected by post-translational modication. However, the linear sequence of any one peptide will be unknown and requires deconvolution of the residues at each site. The FASTF algorithm searches with mixed peptide sequences and performs the necessary deconvolution. FASTF requires about 25% more sequence data than FASTS to achieve equivalent sensitivity (Mackey et al., 2002). 3.4.4. Multi-dimensional protein identication Gel free methods of proteome analysis are being developed in order to provide the high throughput needed to study proteome dynamics. One such method is multidimensional protein identication (MudPIT) (Link et al., 1999). In MudPIT, protein complexity is tackled at the peptide level. Protein mixtures are digested and the peptides

subsequently separated and sequenced. The problem is then to nd the minimum number of proteins, which will account for the peptides present in the digest. SEQUEST and MASCOT can be used for MudPIT type searches but the results can be very large in terms of the number of proteins returned and robust statistical methods are needed to eliminate duplicates and junk proteins (Keller et al., 2002; Nesvizhskii et al., 2003). The drawback with the MudPIT approach is that whilst it gives a complete list of the proteins likely to be present, there is no quantication. So when comparing two samples you are restricted to presence and absence, differential display is not possible. Labelling strategies based on metabolic labelling with heavy isotopes or isotope coded afnity tags have been developed that allow relative protein quantication between two samples using MudPIT (Gygi et al., 1999). MudPIT also provides an alternative method for the analysis of proteins that are traditionally difcult to analyse on 2D gels (e.g. proteins with high pI and membrane proteins) 3.4.5. Uninterpreted MS/MS spectra Increasingly peptide identication from tandem mass spectra is being fully or partially automated and both MASCOT and SEQUEST accept uninterpreted product ion spectra. However, the spectra can still require curation to resolve borderline matches and searches against unannotated or untranslated DNA databases with uninterpreted MS/MS spectra suffer from the same problems as PMF. In particular, sequencing errors, protein polymorphisms and conservative substitutions can lead to identication errors. Search algorithms for uninterpreted spectra are needed that are error-tolerant and incorporate some form of statistical scoring to order the potential protein matches. 3.5. Identication of post-translational modications Over 400 different types of protein modication have been reported in the literature (Garavelli, 2004) and, on average, every eukaryote protein has eight to ten posttranslational variants. Direct analysis of variants requires the isolation of sufcient protein for biochemical analysis and this is not always possible. On-gel stains for some protein modications (e.g. phosphoproteins and glycoproteins) are now becoming available. However, MS is routinely used for the identication of post-translational modications. FindMod is an ExPASy tool, which identies post-translational modications by comparing the experimentally determined peptide mass for a previously identied protein with the theoretical peptide weights calculated from the protein sequence. In its present form, the programme can identify 22 types of modication which have discrete mass differences and intelligent rules can be applied to the peptide sequence to make predictions as to which amino acid has been modied. Similarly GlycoMod (also part of the ExPASy suite) can calculate

550

J. Barrett et al. / International Journal for Parasitology 35 (2005) 543553

the composition of glycans from the masses of glycopeptides or of glycans released from the peptide moiety by enzymatic or chemical means. SUMOplot (http://www. abgent.com/sumoplot.html) predicts the attachment of the 11 kDa SUMO protein (Seeler and Dejean, 2003) at one or more positions in the protein 3.6. Cross species identication In functional proteomics, since cellular functions are conserved across species, it is very possible that orthologous protein complexes will share a similar structure and composition. It is, therefore, conceivable that conserved protein complexes could be initially characterised in a genome veried organism and the information used to identify orthologous complexes in other organisms (Liska and Shevchenko, 2003). Protein identication is based on a comparison between the properties of the query protein and protein sequences present in the databases. The majority of database entries are derived from genome sequencing programmes, but, despite the explosion of sequence data in recent years, there are still relatively few complete genomes available for eukaryotes and these tend to be concentrated in one or two groups. This creates problems when working with organisms with unsequenced genomes. Searches can be carried out against EST databases, of which there are an increasing number (Kwon et al., 2003), but duplications, errors and lack of annotation restricts their usefulness. One strategy is to nd a match in the relevant EST database, then to use the EST to probe other databases or to design primers to clone

the gene and express the protein prior to further characterisation. Cross species comparisons between organisms which have fully sequenced genomes suggest that amino acid composition and molecular mass are generally fairly well conserved in homologous proteins across phylogenetic boundaries (Cordwell et al., 1995; Wilkins and Williams, 1997), presumably as a result of shared domains. For a typical eukaryotic protein, the variation in molecular mass across phyla is in the region of 310%, for amino acid composition the difference score is usually between 20 and 30 (see Section 3.2). In contrast, pI and tryptic peptides are not well conserved, reecting the sensitivity of these parameters to single amino acid changes. In some cases, PMF can be used for cross species identication because only a subset of the peptides from the protein digest needs to be matched. Peptides with amino acid substitutions are not recognised and so do not contribute to the identication. To use PMF for cross species identication, you need at least 80% sequence identity. If sequence identity is less than 70%, it is unlikely that any peptides will be conserved (Lester and Hubbard, 2002). More sophisticated algorithms for PMF matching which can tolerate a limited number of mutations have been proposed (MS-Convolution, MS-Alignments; Pevzener et al., 2001). Shevchenko et al. (2001) and Habermann et al. (2004) have developed an approach specically to overcome the limitations of using mass spectrometer data in cross species identication. Their approach involves searching databases sequentially with peptides masses, then uninterpreted MS/MS spectra and nally with peptide sequences (MS-Blast).

Table 2 The table lists some of the primarily eukaryote 2D sites available on the internet (December 2004) Location http://www.chemie.fu-berlin.de/user/pleiss http://www.doc.ic.ac.uk/vip/hsc-2dpage/index.html http://proteomics.cancer.dk http://www.mdc-berlin.de/~emu/heart http://www.ludwig.edu.au/jpsl/jpslhome.html http://ca.expasy.org/ch2d/ http://www.bio-mol.unisi.it/2d/2d.html http://www.umh.ac.be/~biochim/BALF2D.html http://www.leelab.org/ASMSCSF/map.htm http://www.kendricklabs.com/ http://linux.farma.unimi.it/CSPSG/2D/index.html http://www-dsv.cea.fr/thema/MitoPick/Mito2D.html http://www-lecb.ncifcrf.gov/2dwgDB/ http://www.abdn.ac.uk/shprom/index.shtml http://www.aber.ac.uk/parasitology/ Proteome/Proteome.html http://www.pmma.pmfhk.cz http://gelbank.anl.gov/ http://bioinformatics.icmb.utexas.edu/OPD/ http://oto.wustl.edu/thc/ http://www-public.rz.uni-duesseldorf.de/~hscher/ The sites and URLs listed may change. Name Heart 2-D Page HSC-2-D Page Danish centre for human genome research Heart High Performance JPSL 2D gel database Swiss 2D PAGE Siena 2D-PAGE BALF 2D database 2DE Cornel Kendrick lab Milan serum group Mito-pick NCI-database Fishprom Aberystwyth parasitology group PMMA-2D PAGE Argonne National Lab Open Proteomics Database Washington University Toxoplasma gondii 2D Map Organism Human heart Human, rat, dog Human, mouse, bat, bovine, dog, mink, monkey, rabbit, rat, potato HumanClinks Human, mouse Human, mouse, Dictyostelium, yeast, bacteria, ArabidopsisClinks Human, yeast, Caenorhabditis elegans Human, mouse Human Mouse, rat, trout Bovine, rat Human HumanCothers Salmonids C. elegans, Fasciola, Moniezia, Hymenolepis Human, mouse Human, bacteria Human, yeast, bacteria Human T. gondii

J. Barrett et al. / International Journal for Parasitology 35 (2005) 543553

551

The success of cross species identication is dependent on the content of the available databases. Protein sequence databases are not perfect, the data can contain errors (particularly from some of the earlier genomic studies) and annotations may be missing or incorrect. There can also be a danger in trying to pigeon hole protein function entirely from their sequence data. Examples are known where similar proteins have very different functions in different organisms (glutathione transferases acting as detoxication enzymes and lens proteins, Tomarev and Zinovieva, 1998) and where very different proteins have similar functions (anti-freeze proteins, Barrett, 2001). If the organism being studied is only distantly related to one with a sequenced genome, then the likelihood of protein identication is small. To date, except for theoretical studies, there have been no published papers on successful cross species protein identication other than in micro-organisms (Pando et al., 2000).

been launched by HUPO (Human Proteome Initiative, Orchard et al., 2004, see also http://psidev.sourceforge.net/) and guidelines have been published for peptide/protein identication (Carr et al., 2004).

5. Conclusions At present, our ability to accumulate proteome data is outstriping our ability to analyse it and the high throughput methods of the future will require powerful data analysis tools (Vihinen, 2001). The power of databases lies in their ability to reveal patterns of association hidden across different types of data. In order to do this, integration of databases is a priority. Ideally this would involve a common ontology (a controlled, structured vocabulary). As databases develop, there remains the question of how can old data be incorporated into the new database structures or will they have to be archived in some way? In the future, advances in data mining of unstructured text may provide a way forward (Buckingham, 2004). XplorMed (http://www. bork.embl-heidelberg.de/xplormed/) is a simple example of text mining and what might be achieved. Many proteomics techniques, such as activity-based protein proling and isotope coded afnity tags, are still at the proof of concept stage and it remains to be seen how rapidly these techniques can be applied to parasitological problems.

4. D-Databases An increasing number of 2D databases are now available via the Internet (Table 2). Some of these are simple gel images, whereas others have spot identications linked to annotated protein databases. Increasingly databases are federated (Appel et al., 1996), that is, the data is presented in a standard format which can easily be searched (see http:// ca.expasy.org/ch2d/make2ddb.html). The ideal proteome database is difcult to dene. Unlike the genome, the proteome is a dynamic entity, changing with the physiological and developmental state of the organism and inuenced by environmental conditions. The dynamic nature of the proteome means that there is really no such thing as a single representative proteome for a particular organism or tissue. In addition, differences in protein extraction methods, electrophoretic conditions and staining protocols can all inuence the subsequent gel pattern. Few proteome sites give any information on how the spots were identied, or give the original MS data or what databases were used and when. All this necessitates a much more complete set of metadata (data about the data). As yet there is no general agreement as to the minimum set of information about a proteomics experiment that is needed for the ideal database. Standards are in place for protein structure databases and micro-array data, but not for proteomic data (see http:// www.ebi.ac.uk/Submissions/index.html). Taylor et al. (2003) have proposed a Proteomics Experiment Data Repository (PEDRo) which is intended to capture all the data from a proteomics experiment in an easily searchable form. The PEDRo model is written in universal modelling language (UML) and consists of four sections (schema) covering sample origin, sample processing, mass spectrometry and in silico identication (see http://pedro. man.ac.uk). Another proteomics standards initiative has

References
Appel, R.D., Bairoch, A., Sanchez, J.C., Vargas, R.R., Golaz, O., Pasquali, C., Hochstrasser, D.F., 1996. Federated two dimensional electrophoresis database: simple means of publishing two-dimensional electrophoresis data. Electrophoresis 17, 540546. Asirvatham, V.S., Watson, B.S., Sumner, L.W., 2002. Analytical and biological variances associated with proteomic studies of Medicago truncatula by two-dimensional polyacrylamide gel electrophoresis. Proteomics 2, 960968. Barrett, J., 2001. Thermal hysteresis proteins. Int. J. Biochem. Cell Biol. 33, 105117. Beranova-Giorgianni, S., 2003. Proteome analysis by two-dimensional gel electrophoresis and mass spectrometry: strengths and limitations. Trends Anal. Chem. 22, 273281. Bernal, D., de la Rubia, J.E., Carrasco-Abad, A.M., Toledo, R., MasComa, S., Marcilla, A., 2004. Identication of enolase as a plasminogen-binding protein in excretorysecretory products of Fasciola hepatica. Fed. Eur. Biochem. Soc. Lett. 563, 203206. Bettens, E., Scheunders, P., VanDuck, D., Moens, L., Van Osta, P., 1997. Computer analysis of two-dimensional electrophoresis gels: a new segmentation and modelling algorithm. Electrophoresis 18, 792798. Bromley, E., Leeds, N., Clark, J., McGregor, E., Ward, M., Dunn, M.J., Tomley, F., 2003. Dening the protein repertoire of microneme organelles in the apicomplexan secretor of Eimeria tenella. Proteomics 3, 15531561. Buckingham, S., 2004. Datas future shock. Nature (London) 428, 774777. Campbell, A.M., Teesdale-Spittle, P.H., Barrett, J., Liebau, E., Jeferies, J.R., Brophy, P.M., 2001. A common class of nematode

552

J. Barrett et al. / International Journal for Parasitology 35 (2005) 543553 Kell, D.B., 2002. Metabolomics and machine learning: explanatory analysis of complex metabolome data using genetic programming to produce simple, robust rules. Mol. Biol. Rep 29, 237241. Kelleher, N.L., 2004. Top-down proteomics. Anal. Chem. 76, 196A203A. Keller, A., Nesvizhskii, A.I., Kolker, E., Aebersold, R., 2002. Empirical statistical model to estimate the accuracy of peptide identications made by MS/MS and database search. Anal. Chem. 74, 53835392. Knight, C.G., Kassen, R., Hebestreit, H., Rainey, P.B., 2004. Global analysis of predicted proteomes: functional adaptation of physical properties. Proc. Natl Acad. Sci. USA 101, 83908395. Kwon, K.H., Kim, M., Kim, J.Y., Kim, K.W., Kim, S.I., Park, Y.M., Yoo, J.S., 2003. Efciency improvement of peptide identication for an organism without complete genome sequence, using expressed sequence tag database and tandem mass spectra. Proteomics 3, 23052309. Lasonder, E., Ishihama, Y., Anderson, J.S., Vermunt, A.M., Pain, A., Sauerwein, R.W., Eling, W.M., Hall, N., Waters, A.P., Stunnenberg, H.G., Mann, M., 2002. Analysis of the Plasmodium falciparum proteome by high-accuracy mass spectrometry. Nature (London) 419, 537542. Lee, W.-C., Lee, K.H., 2004. Applications of afnity chromatography in proteomics. Anal. Biochem. 324, 110. Lester, P.J., Hubbard, S.J., 2002. Comparative bioinformatics analysis of complete proteomes and protein parameters for cross species identication in proteomics. Proteomics 2, 13921405. Link, A.J., Eng, J., Schieitz, D.M., Carmack, E., Mize, G.J., Morris, D.R., Garvik, B.M., Yates, J.R., 1999. Direct analysis of protein complexes using mass spectroscopy. Nat. Biotechnol. 17, 676682. Liska, A.J., Shevchenko, A., 2003. Expanding the organismal scope of proteomics: cross-species protein identication by mass spectrometry and its implications. Proteomics 3, 1928. Mackey, A.J., Haystead, T.A.J., Pearson, W.R., 2002. Getting more from less, algorithms for rapid protein identication with multiple short peptide sequences. Mol. Cell. Proteomics 1, 139147. Mackintosh, J.A., Choi, H.Y., Bae, S.H., Veal, D.A., Bell, P.J., Ferrari, B.C., Van Dyck, D.D., Verrills, N.M., Paik, Y.K., Karuso, P., 2003. A uorescent natural product for ultra sensitive detection of proteins in one-dimensional and two-dimensional gel electrophoresis. Proteomics 3, 22732288. Manly, B.F.J., 1994. Multivariate Statistical Methods: A Primer, second ed. Chapman & Hall, London. Meng, F., Cargille, B.J., Patrie, S.M., Johnson, J.R., McLoughlin, S.M., Kelleher, N.I., 2002. Processing complex mixtures of intact proteins for direct analysis by mass spectroscopy. Anal. Chem. 74, 29232929. Meyer, F., Oliveras, A., Salembier, P., Vachier, C., 1997. Morphological tools segmentation: connected lters and watersheds. Ann. Telecommun. 52, 367379. Molloy, M.P., Brzezinsky, E.E., Hang, J.Q., McDowell, M.T., VanBogelen, R.A., 2003. Overcoming technical variation and biological variation in quantitative proteomics. Proteomics 3, 19121919. Nesvizhskii, A.I., Keller, A., Kolker, E., Aebersold, R., 2003. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75, 46464658. Orchard, S., Taylor, C.F., Hermjakob, H., Zhu, W-., Julian Jr.., R.K., Apweiler, A., 2004. Advances in the development of common interchange standard for proteomic data. Proteomics 4, 23632365. Pando, M., Ward, M., Pitarch, A., Sanchez, M., Nombela, C., Blackstock, W., Gil, C., 2000. Cross species identication of novel Candica albicans immunogenic proteins by a combination of twodimensional polyacrylamide electrophoresis and mass spectroscopy. Electrophoresis 21, 26512659. Pevzner, P.A., Dancik, V., Mulyukov, Z., Tang, C.L., 2001. Efciency of mutation-tolerant database search with tandem mass spectra. Genome Res. 11, 290299.

glutathione S-transferase (GST) revealed by the theoretical proteome of the model organism Caenorhabditis elegans. Comp. Biochem. Physiol. B 128, 701708. Carr, S., Aebersold, R., Baldwin, M., Burlingame, A., Clauser, K., Nesvizhskii, A., 2004. The need for guidelines in publication of peptide and protein identication data. Mol. Cell. Proteomics 3, 531533. Chemale, G., Van Rossum, A.J., Jefferies, J.R., Barrett, J., Brophy, P.M., Ferreira, H.B., Zaha, A., 2003. Proteomic analysis of the parasite Echinococcus granulosus: causative agent of cystic hydatid disease. Proteomics 3, 16331636. Cordwell, S.J., Wilkins, M.R., Cerpa-Poljak, A., Gooley, A.A., Duncan, M., Williams, K.L., 1995. Cross-species identication of proteins separated by two dimensional gel electrophoresis using matrixassisted laser desorbtion time of igh mass spectrometry and amino acid composition. Electrophoresis 16, 438443. Curwen, R.S., Ashton, P.D., Johnston, D.A., Wilson, R.A., 2004. The Schistosoma mansoni soluble proteome: a comparison across four lifecycle stages. Mol. Biochem. Parasitol. 138, 5766. Damer, C.K., Partridge, J., Pearson, W.R., Haysead, T.A.J., 1998. Rapid identication of protein phosphatase 1-binding proteins by mixed peptide sequencing and database searching. Characterisation of a novel holoenzymic form of protein phosphatase 1. J. Biol. Chem. 273, 2439624405. Dowsey, A.W., Dunn, M.J., Yang, G.Z., 2003. The role of bioinformatics in two-dimensional gel electrophoresis. Proteomics 3, 15671596. Dutt, J., Lee, K.H., 2001. The scaled volume as an image analysis variable for detecting changes in protein expression levels by silver stain. Electrophoresis 22, 16271632. Garavelli, J.S., 2004. The RESID database of protein modications as a resource and annotation tool. Proteomics 4, 15271533. Garrels, J.I., 1989. The quest system for quantitative-analysis of twodimensional gels. J. Biol. Chem. 264, 52695282. Geladi, P., Grahn, H., 1996. Multivariate Image Analysis. Wiley, New York. Graves, P.R., Haystead, T.A.J., 2002. Molecular biologists guide to proteomics. Microbiol. Mol. Biol. Rev. 66, 3963. Gustaffson, J.S., Blomberg, A., Rudemo, M., 2002. Warping two dimensional electrophoresis gel images to correct for geometric distortion of the spot pattern. Electrophoresis 23, 17311744. Gygi, S.P., Rist, B., Gerber, S.A., Turecek, F., Gelb, M.H., Aebersold, R., 1999. Quantitative analysis of complex protein mixtures using isotope coded afnity tags. Nat. Biotechnol. 17, 994999. Habermann, B., Oegema, J., Sunyaev, S., Shevchenko, A., 2004. The power and the limitations of cross-species protein identication by mass spectrometry driven sequence similarity searches. Mol. Cell. Proteomics 3, 238249. Henzel, W.J., Billeci, T.M., Stults, J.T., Wong, S.C., Grimley, C., Watanabe, C., 1993. Identifying proteins from 2-dimensional gels by molecular mass searching of peptide-fragments in protein-sequence databases. Proc. Natl Acad. Sci. USA 90, 50115015. Henzel, W.J., Watanabe, C., Stults, J.T., 2003. Protein identication: the origins of peptide mass ngerprinting. J. Am. Soc. Mass Spectrosc. 14, 931942. Huang, L., Jacob, R.J., Pegg, S.C.-H., Baldwin, M.A., Wang, C.C., Burlingame, A.L., Babbitt, P.C., 2001. Functional assignment of the 20S proteosome from Trypanosoma brucei using mass spectrometry and new bioinformatics approaches. J. Biol. Chem. 276, 2832728339. Jacq, J.J., Roux, C., 1995. Registration of non-segmented images using genetic algorithm. Lect. Notes Comput. Sci. 905, 205221. Jefferies, R.J., Brophy, P.M., Barrett, J., 2000. Investigation of Fasciola hepatica sample preparation for two dimensional electrophoresis. Electrophoresis 21, 37243729. Jessen, F., Lametsch, R., Bendixen, E., Kjaersgard, I.V.H., Jorgensen, B.M., 2002. Extracting information from two-dimensional gels by partial least squares regression. Proteomics 2, 3235.

J. Barrett et al. / International Journal for Parasitology 35 (2005) 543553 Pleissner, K.-P., Hoffman, F., Kreigel, K., Wenk, C., Wegner, S., Sahlstrom, A., Oswald, H., Alt, H., Fleck, E., 1999. New algorithmic approaches to protein spot detection and pattern matching in twodimensional electrophoresis databases. Electrophoresis 20, 755765. Salmi, J., Aittolallia, T., Westerholm, J., Greise, M., Rosengren, A., Nyman, A., Lahesmaa, R., Nevalainen, O., 2002. Hierarchical grid transformation for image warping in the analysis of two-dimensional electrophoresis gels. Proteomics 2, 15041515. Schmidt, H.R., Schmitter, D., Blum, P., Miller, M., Vonderschmitt, D., 1995. Lung-tumor cellsa multivariate approach to cell classication using 2-dimensional protein pattern. Electrophoresis 16, 19611968. Seeler, J.S., Dejean, A., 2003. Nuclear and unclear functions of SUMO. Nat. Rev. Mol. Cell Biol. 4, 690699. Shevchenko, A., Sunyaev, S., Loboda, A., Shevchenko, A., Bork, P., Ens, W., Standing, K.G., 2001. Charting the proteomes of organisms with unsequenced genomes by MALDI-Quadrupole Time-of-Flight mass spectrometry and BLAST homology searching. Anal. Chem. 73, 19171926. Smilansky, Z., 2001. Automatic registration for images of two-dimensional protein gels. Electrophoresis 22, 16151626. Spibey, C.A., Jackson, P., Herick, K., 2001. A unique charge-coupled device/xenon arc lamp based imaging system for accurate detection and quantitation of multicolour uorescence. Electrophoresis 22, 829836.

553

Taylor, C.F., Paton, N.W., Garwood, K.L., Kirby, P.D., Stead, D.A., Yin, Z., Deutsch, E.W., Selway, L., Walker, J., Riba-Garcia, I., Mohammed, S., Deery, M.J., Howard, J.A., Dunkley, T., Aebersold, R., Kell, D.B., Lilley, K.S., Roepstorff, P., Yates III., J.R., Brass, A., Brown, A.J.P., Cash, P., Gaskell, S.J., Hubbard, S.J., Oliver, S.G., 2003. A systematic approach to modelling, capturing, and disseminating proteomics experimental data. Nat. Biotechnol. 21, 247254. Tomarev, S.I., Zinovieva, R.D., 1998. Squid major lens polypeptides are homologous to glutathione S-transferases subunits. Nature (London) 336, 8688. Veeser, S., Dunn, M.J., Yang, G.-Z., 2001. Multiresolution image registration for two-dimensional gel electrophoresis. Proteomics 1, 856870. Vihinen, M., 2001. Bioinformatics in proteomics. Biomol. Eng. 18, 241248. Wilkins, M.R., Williams, K.L., 1997. Cross-species identication using amino acid composition, peptide mass ngerprinting, isoelectric point and molecular mass: a theoretical evaluation. J. Theor. Biol. 186, 715. Woodward, A.M., Kaderbhai, N., Kaderbhai, M., Shaw, A., Rowland, J., Kell, D.B., 2001. Histometrics: improvement of the dynamic range of uorescently stained proteins resolved in electrophoretic gels using hyperspectral imaging. Proteomics 1, 13511358. Woodward, A.M., Rowland, J., Kell, D.B., 2004. Fast automatic registration of images using the phase of a complex wavelet transform: application to proteome gels. Analyst 129, 542552.

You might also like