Professional Documents
Culture Documents
www.elsevier.com/locate/bba
a;
EMBL Outstation Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
b
Department of Membrane Research and Biophysics, The Weizmann Institute of Science, Rehovot 76100, Israel
Received 15 February 1999; accepted 19 May 1999
Abstract
The SWISS-PROT protein sequence data bank contains at present nearly 75 000 entries, almost two thirds of which
include the potential N-glycosylation consensus sequence, or sequon, NXS/T (where X can be any amino acid but proline)
and thus may be glycoproteins. The number of proteins filed as glycoproteins is however considerably smaller, 7942, of which
749 have been characterized with respect to the total number of their carbohydrate units and sites of attachment of the latter
to the protein, as well as the nature of the carbohydrate-peptide linking group. Of these well characterized glycoproteins,
about 90% carry either N-linked carbohydrate units alone or both N- and O-linked ones, attached at 1297 N-glycosylation
sites (1.9 per glycoprotein molecule) and the rest are O-glycosylated only. Since the total number of sequons in the well
characterized glycoproteins is 1968, their rate of occupancy is 2/3. Assuming that the same number of N-linked units and rate
of sequon occupancy occur in all sequon containing proteins and that the proportion of solely O-glycosylated proteins (ca.
10%) will also be the same as among the well characterized ones, we conclude that the majority of sequon containing proteins
will be found to be glycosylated and that more than half of all proteins are glycoproteins. 1999 Elsevier Science B.V. All
rights reserved.
Keywords: Glycosylation; Glycoprotein ; Database
Glycosylation is a common and highly diverse coand post-translational protein modication reaction.
Perhaps because almost all proteins of human serum
and of hen egg-white are glycosylated [1], as are
those of animal cell membranes [2], the sweeping
statement has been made that `most proteins are glycoproteins' [3,4]. The recent development of computerized protein sequence data banks allows us to put
this statement to a quantitative test. Here, we present
* Corresponding author..
1
Dedicated to Prof. Akira Kobata and Prof. Harry Schachter
on the occasion of their 65th birthdays.
0304-4165 / 99 / $ ^ see front matter 1999 Elsevier Science B.V. All rights reserved.
PII: S 0 3 0 4 - 4 1 6 5 ( 9 9 ) 0 0 1 6 5 - 8
Fig. 1. Frequency of occurrence of sequons and carbohydrate units in the 749 well characterized glycoproteins listed in the SWISSPROT database by the end of 1998. Sequons per glycoprotein (A), and carbohydrate units in N-glycoproteins (B), O-glycoproteins
(C) and N-,O-glycoproteins (D).
N-glycosidic bond is always to the amide of an asparagine that is part of the consensus sequence NXS/
T, or `sequon', where X can be any amino acid except proline. The sequons are often referred to as
`potential glycosylation sites', since, for reasons that
are not understood, in not all of these, the asparagine is glycosylated. No consensus sequences for
O-glycosylation seem to exist.
The SWISS-PROT database contained by the end
of 1998 (release 36, including updates to 01/11/98)
74 988 entries. Potential N-glycosylation sites were
identied 151 993 times in 48 636 sequences, an average of 3.1 per protein. In 26 352 protein sequences,
such sites are absent, showing that about one third of
the proteins cannot be N-glycosylated. Examination
of the TrEMBL entries leads to similar conclusions
(Table 1).
The number of proteins in SWISS-PROT that
have been led as glycoproteins is relatively small,
SWISS-PROT
TrEMBL
Number of entries
Entries containing NXS/T sequon
74 988
48 636
64.9%
151 933
3.12
156 187
107 551
68.9%
394 483
3.66
Number of sequons
Sequon/sequon containing entry
Fig. 2. Amino acid residues per sequon in all well characterized glycoproteins (A) and per real glycosylation site in the N-, O- and
N-,O-glycoproteins of the same group (B, C and D, respectively).
sition s 6 description s '. For example, a glycoprotein with a N-acetylgalactosamine at position 34 will
be annotated as `FT CARBOHYD 34 34 N-ACETYLGALACTOSAMINE'.
As a thorough biochemical characterization of
most of the glycoproteins is lacking, numerous 6 description s tags contain the strings `BY SIMILARITY', `PROBABLE' or `POTENTIAL'. These denote that no biochemical characterization of the
glycosylation site(s) is available. For the purpose of
Table 2
Potential and real glycosylation sites in the 749 well characterized glycoproteins listed in the SWISS-PROT database by the end of
1998
Potential N-glycosylation
sites (sequons)
Real glycosylation sites
Real N-glycosylation sites
Real O-glycosylation sites
Glycoproteins with at
least one biochemically
characterized (`real')
glycosylation site
Glycoproteins with
at least one real
N-glycosylation site
and at least one real
O-glycosylation site
Glycoproteins with
at least one real
N-glycosylation site
and no real
O-glycosylation site
Glycoproteins with
at least one real
O-glycosylation site
and no real
N-glycosylation site
Sites
Entries
Sites
Entries
Sites
Entries
Sites
2 066
697
289
80
1 679
582
98
35
1 965
1 279
686
749
662
167
556
238
318
80
80
80
1 041
1 041
0
582
582
0
368
0
368
87
0
87
Entries
Table 3
Spacing of sequons and real glycosylation sites in the 749 well characterized glycoproteins
Amino acid per
Minimum
Maximum
Mean
Median
Range
S.D.
a
Sequon
Real N-site
Real O-site
Real N+O-sitea
18.00
1 669.00
159.04
121
1 651.00
164.07
23.11
3 412.00
249.92
144
3 388.89
348.09
3.10
2 813.00
269.90
167
2 809.90
338.83
3.10
3 412.00
231.66
133
3 408.90
335.80
In N-,O-glycoproteins.
[2]
[3]
[4]
[5]
[6]
[7]
References
[1] N. Sharon, Complex Carbohydrates, their Chemistry, Bio-