Professional Documents
Culture Documents
ABSTRACT
A new dataset of 396 protein domains is developed and used to evaluate the performance of the protein secondary structure prediction algorithms DSC, PHD, NNSSP, and PREDATOR.
The maximum theoretical Qs accuracy for combination of these methods is shown to be 787o.A simple
consensus prediction on the 396 domains, with automatically generated multiple sequence alignments
gives an average Q3 prediction accuracy of 72.97o.
This is a l7o irnprovement over PHD, which was the
best single method evaluated. Segment Overlap Accuracy (SOV) is 75.47ofor the consensus method on
the 396-protein set. The secondary structure definition method DSSP defines 8 states, but these are
reduced by most authors to 3 for prediction.Application ofthe different published 8- to 3-state reduction
methods shows variation of over 37o ort apparent
prediction accuracy. This suggests that care should
be taken to compare methods by the same reduction
method. TWo new sequence datasets (CB513 and
CB.25l) are derived which are suitable for crossvalidation of secondary structure prediction methods without artifacts due to internal homology. A
fully automatic World Wide Web service that predicts protein secondary structure by a combination
of methods is available via http://barton.ebi.ac.uk/.
Proteins 1999;34:508-519. o 1999Wiley-Liss, Inc.
Key words: protein; secondary structure prediction; combination of methods; benchmarks
INTRODUCTION
The most successful techniques for prediction of the
protein three dimensional structure rely on aligning the
sequenceofa protein ofunknown structure to a homolog of
known structure (e.g., see Sali for reviewl). Such methods
fail if there is no homolog in the structural database, or if
the technique for searching the structural database is
unable to identify homologs that are present. While absence of a homolog must await further X-ray or NMR
structures, up to 4/5 of known homologuesmay be missed
even by the best conventional pairwise sequencecomparison methods.z
Techniques that exploit evolutionary information from
protein families3-eor use empirical pair-potentialsl0'11can
normally detect more homologs than pairwise sequence
o 1999 WILEY-LISS, INC.
SECONDARYSTRUCTUREPREDICTION METHODS
Prediction from a multiple alignment of protein sequences rather than a single sequence has long been
recognized as a way to improve prediction accuracy.l8
During evolution, residues with similar physico-chemical
properties are conservedifthey are important to the fold
or function ofthe protein. This makes patterns ofhydrophobic residues characteristic of particular secondary structures easier to identify.zl Analysis of conservation in
protein families has been effective in many secondary
structure predictions performed before knowledge of the
protein structure.22-25Zvelebil et al. (1987)26developedan
automatic procedure that showed a 9Voimptovement in
prediction accuracy on a small set of protein families when
multiple sequencedata was included. Most current secondary structure prediction algorithms exploit similar principles to gain higher accuracythan is possiblefrom a single
sequencs.27-3o
The recent CASP31series of experiments in
which predictions are made blind have shown that recent
claims for secondary structure prediction algorithmss2 are
within reasonablelimits.
Prediction accuracy has also been improved by combining more than one algorithm on a single sequence.ss-37
F61
example, Zhang et al. (1992)34obtained 66.4Voacctracy on
a set of 107 proteins, an improvement of 2Voover the best
method they considerec.
In this paper we describe datasets and procedures for
the evaluation of current techniques for secondary structure prediction. We discuss the effects of homologTwithin
the training and test datasets and describe new nonredundant datasets appropriate for developing secondary
structure prediction algorithms. We evaluate the accuracy
of four recently published algorithms that exploit multiple
sequencedataNNSSP,3oPHD,27DSC,2eand PREDATOR28
and two older methods, ZPRED26and MULPRED (Barton,
unpublished). We develop an algorithm that combines the
predictions of PHD, DSC, PREDATOR, and NNSSP and
show that it gives a 7Voimprovement in average accuracy
over the best single method. Finally, we investigate the
effect of the quality of multiple sequencealignment used in
prediction, the effect of secondary structure assignment
algorithm (DSSp,saDEFINE,Seand STRIDEao)and influenceof redundancy in the multiple alignments.
METHODS
The Problem of Objectively Testing Secondary
Structure Prediction Methods
If a protein sequenceshows clear similarity to a protein
of known three dimensional structure, then the most
accurate method ofpredicting the secondarystructure is to
align the sequences by standard dynamic programming
algorithms,al as homolory modeling is much more accurate than secondary structure prediction for high levels of
sequenceidentity. Secondarystructure prediction methods
are of most use when sequencesimilarity to a protein of
known structure is undetectable. Accordingly, it is important that there is no detectable sequence similarity between sequencesused to train and test secondary structure prediction methods.
509
Structures
510
(1)
1eca75
1azu76
2rspa77
lpa2;ts
2lhb52
lcdtaso
3clns1
2ebf'
1ovoa83
1fd1h84
4cmsa8
Q')
2lhb52
2pcf3
5hvpaso
2pcft
4sdha7e
3ebxsa
4cpvss
8abp57
ltgsist
lmcplso
Ser2eae
SD
score
5.L2
5.40
5.81
7.22
7.70
8.26
8.27
8.86
9.45
t2.66
15.98
Than Fivet
Fold (1)
Fold (2)
Globin-like
Cupredoxins
Acid proteases
Cupredoxins
Globin-like
Snake toxin-like
EF Hand-like
Periplasmic binding
Ovomucoid,/PCl-1 like
Immunoglobulin
Acid proteases
Globin-like
Cupredoxins
Acid proteases
Cupredoxins
Globin-like
Snake toxin-like
EF Hand-like
Periplasmic binding
Ovomucoid,/POl-1 like
Immunoglobulin
Acid proteases
tAlignments were generated by theAMPS package6 a blosum62 matrix, and gap penalty of 10, with
100 randomizations. Fold definitions come from the current release (1.37) ofthe SCOP database.aT
SECONDARYSTRUCTUREPREDICTION METHODS
Sequence Aligrrments
With the exceptionof PREDAIOR28's8all methods considered here, required a multiple sequence alignment as
input, where as PREDATOR only required the multiple
sequencesin an unaligned format. In order to simplify the
generation of multiple sequence alignments for large
numbers of proteins, in this study we developed an automatic procedure.
We first perform a BLAST5edatabase search of the OWL
v29.4 d,atabase, which contains 198,742 entries.60 The
BLAST output is then screenedby SCANPS, an implementation of the Smith Waterman dynamic programming
algorithm,61,62
with length dependent statistics. Sequences
are rejected if their SCANPS probability score is higher
than 1 x 10-4. Sequencesare also rejected ifthey do not fit
a length cutoffof 1.5. For example, if the query sequenceis
90 residues long, the sequencelength would have to range
between 60 and 135 residues to be included. If sequences
exceedthe length criterion, they are truncated by removing end residues until the length ofthe sequencesatisfies
the cut off value. Sequences falling short of the lower
length limit are discarded. The value of 1.5 for the length
cutoff was reached by visual inspection of a number of
multiple sequence alignments, produced with different
cut-offvalues. The method removes both rediculously long,
short and unrelated sequences. However it does allow
sequencesthat are longer than the query, and are related,
to be included after truncation. The sequence similar
proteins selected by this method, are then aligned by
CLUSTALW (version 1.7),63with default parameters.
511
Methods Analyzed
512
Method
PHD : Helix
Qs:
predicted'
-Hpg obserued;
1 s rninou(so6";sp"rd)+ 6 x len(s1).
N7
*"-"rtt*;+*l
Q)
Helix
Sheet
Coil
DSSPE8
STRIDE4o
DEFINESe
28.9
22.9
48.1
29.8
24.4
45.8
30.2
30.0
39.7
SECONDARY
STRUCTURE
PREDICTION
METHODS
513
Method
Helix
Strand
DSSP38
STRIDE4O
DEFINE39
DSSP38
STRIDE40
DEFINEs9
Mean
Max
Tbtalnumber
ofsecondary
structures
9
10
14
4
4
6
54
51
65
19
19
26
817
753
553
t302
1303
1030
3
2
D
1
1
4
N l
^ l
-o lt
ffiffi#m----
i6 1
I
o t
20
30
40
DSSP helix sment length
50
1
0
1
5
2
0
DSSP stEnd sgmehtlsgth
tr
3l
ul
*l
"l
20
30
40
Stril helix sgmonl length
50
lw
lffi#xgm**,#ffi
1
0
t
5
2
0
Strids slnnd segmfft longth
:l
3r ffi
R1
o l
ffi
s
Rl
Rl
Ws
El ffiffiffi-
:l
0
3
0
4
Delimhelixssgmsntlength
:l $tE#ilffim-*----
0
1
5
m
DliheslEndsegM longth
CB396set
RS126set
Averagepercent
identity
betweensequences
Average
sequence
length
34
31
157residues
185residues
Average
numberof
sequen@s
per alignment
18
30
5r4
RS126set
number(7o)
25 (20)
17(13)
27 (2r)
38 (20)
3(2)
18(14)
1(<1)
1(<1)
CB396set
number(7o)
107(27)
101(26)
68 (17)
78 (20)
0 (0)
27 (7)
3 (<1)
12(3)
Method
PHD64
DSC29
PREDATORP8
NNSSPsO
CONSENSUS
RS126
proteinset
SOV
Q3
73.5
73.5
7I.T
7r.6
70.3
69.9
72.7
70.6
74.8
74.5
CB396
proteinset
Q3
7I.9
68.4
68.6
77.4
72.9
sov
/ o..J
72.0
69.8
7t.3
ID.+
Quality
The RS126 set of proteins was used to check that the
alignments generated by our method could reproduce
previous results for PHD,27 and DSC.2ePHD27 (run in
cross-validation mode) increased in average Qs accuracy
by 1.97ofrom 71.6 to 73.57o,over the published accuracy.
The accuracy obtained here for DSC improved by 7.07o.
However, the published value of 70.lVo for DSC was for a
full jack-knife test, whereas our test utilized all the data.
These results confirm that the automatic method used to
build multiple alignments in this study is appropriate for
secondarystructure prediction.
Comparison of PredictionAccuracy
Test and RS126 Thaining Sets
p-Value
cutoff
10 10
p-Value
cutoff
r02
1716356
7013
2013632
8974
55.6
7t.2
SECONDARY
STRUCTURE
PREDICTION METHODS
TABLE VItr. Comparison of the Q3Accuracy for a
Decrease in the BI"ASTsep-Value Cut-OffFrom l0-ro to
10-2with the RS126 Setr
p-Value
cutoff10-10
Method
PHD64
DSC29
PREDATORs8
NNSSP3O
MULPRED
ZPRED26
CONSENSUS
DSSP
p-Value
cutoff10-2
STRIDE
DSSP
STRIDE
73.2
7t.0
70.7
72.4
67.2
66.7
74,5
73.2
70.7
70.3
7t.7
66.8
65.9
74.3
t70 ^
72.4
70.2
69.7
71.8
66.7
70.0
69.3
71.2
65.4
64.7
73.7
OD.D
73.9
l00Vo
957o
80Vo
7\Va
73.2
7L.0
70.7
72.4
74.5
73.3
7t.0
70.4
72.5
74.5
73.4
7r.0
70.1
72.7
74.6
73.5
7L.t
70.3
72.7
74.8
60Vo
73.3
70.9
70.4
72.8
74.7
DIO
MethodA
MethodB
Method C
Mean
percent
ofhelix
Mean
percent
ofsheet
Mean
percent
ofcoil
28.9
25.3
25.6
22.9
2t.2
21.2
48.1
52.6
52.3
ReductionmethodA6a
B -- Coil only
G - Coil only
B and G* Coil
GC'GHHHH- HHHHHHH, B and
G * Coil (Methodc)30
B, G and HHHH EE -* Coil (MethodB)5s
Q"
74.8
l D .t
76.6
77.5
77.5
77.9
8974
5907
4320
3833
268t
7T
47
34
30
20
ofAlternative
8- to 3-State
516
DSSP
(A)
STRIDE
(A)
DSSP
(B)
STRIDE
(B)
73.5
7L.l
70.3
72.7
74.8
73.5
70.9
69.6
72.2
74.7
76.3
73.3
75.2
77.3
77,9
76.3
73.4
74.0
76.5
77.9
SIMPAT'
GORIw4
SOPM73
RS126
CB396
Author
76.3
67.3
53.3
66.8
74.2
67.6
64.6
64.7
67.7
64.4
69.0
rThe column "Author" is the authors jack-knife value for the method
with their dataset, and definition reduction method. AII results are
calculated using reduction method A, and also converting G and B
states to coil. For PHDGa the authors qtote 7l.6Vo as their crossvalidated accuracy. However, G and B states were considered in the
accuracy calculation for PHD6a.
Methods
ties for the different states are used, rather than the
binary input, and this idea forms the basis of future work.
When the lower accuracy predictions from ZpRED and
MULPRED were included, the overall accuracy of the
consensus method was reduced. SIMPA, SOpM, and
GORTV were not included at any stage in the consensus
method. Further work aims to discover if the single
sequence prediction methods can be incorporated into a
more accurate consensusmethod.
SUMMARY AND CONCLUSIONS
In this study we have developed a new, nonredundant
test set of396 protein domains (CB3g6).The set doesnot
include any of the 126 proteins with which many current
methods have been trained, nor doesit contain homologsof
those 126 proteins as measured by a stringent test of
sequence similarity. We have shown that by combining
four secondarystructure prediction methods DSC,2epHD,27
PREDATOR,z8and NNSSPs0 by a simple majority-wins
method, the average three-state Q3 prediction accuracy
can be improved by 17ofromTI.gVo(PHD) to 72.g%o
on the
C8396 set. A fair comparison of the accuracy of the
constituent methods is only possible for PHD27and DSC2e
as all other algorithms included some of our test proteins
in their training set. Despite this, PHD27 still gave the
highest accuracyon the new test set (7I.gVo)ofany ofthe
methodsconsidered.
An automatic procedure for database searching to build
a multiple sequencealignment has been developed.Alignments from this procedure give a l.g%oincrease in the
average accuracy ofprediction compared to previous published results for the PHD algorithm on the 126 protein
set.6aThe increase may be attributed to better alignments
and the increased size ofthe current sequencedatabases.
In the literature there are different standards for reducing DSSP388-state (H,C,B,E,T,S,G,I)assignments to 3
states (H,C,E). It was found that changing the reduction
method can alter the apparent prediction accuracy by over
lVo on average. Although we were unable to train the
methods using different 8- to 3-state reductions, testing all
methods with different reduction methods showed that
Method B58consistently gave higher accuracy.This may be
attributed to Method B assigning more of the protein to
Coil (C).
Secondary structure definition methods DSSp.38
DEFINE,Seand STRIDE40were compared.All three agree
at only 75Voof positions. This is mainly due to differences
between DEFINE and DSSP/STRIDE. DSSP and STRIDE
agreeat957o of positions, though DSSP defines many more
4 residue helices than STRIDE.
In summary, with the alignment method presented here,
the method with the highest average accuracy on the new
non-redundant test set of 396 proteins was PHD6a with
7I.9Vo.While the new combination of NNSSqao PHD,27
DSC,zoand PREDATOR2spresented here improves upon
this figure by l%oto 72.97o.
The non-redundant datasets constructed during this
analysis will facilitate the future development and testing
5t7
14. Rost B. TOPITS: threading one-dimensional preditions into threedimensional structures. Proceedings from the Third International
Conference on Intelligent Systems for Molecular Biology. Menlo
Park, CA: AAA1 Press, 1995. p 314-321.
15. Rost B. Protein fold recogrrition by prediction-based threading. J
Mol Biol 1997;270:L-70.
16. Lim VI. Algorithms for prediction of o helices and B structural
regions in globular proteins. J Mol Biol 1924;88:8ZB-894.
17. Chou PY, Fasman GD. Conformational parameters for arnino
acids in helical, B-sheet, and random coil regions calculated frorn
proteins. Biochemistry I97 4;13:2Il-222.
18. Garnier J, Osguthorpe DJ, Robson B. Analysis and implications of
simple methods for predicting the secondary structure ofglobular
proteins. J Mol Biol t978;120:97-t20.
19. Schulz GE, Schirmer RH. Principles of Proteins Structure. New
York: Springer-Verlag, 1979. p 1-314.
20. Kabsch W, Sander C. How good are predictions ofprotein secondary structure? FEBS Lett 1983;155:179-182.
21. Livingstone CD, Barton GJ. Identification offunctional residues
and secondary structure from protein multiple sequence alignment. Methods Enzyrnol t996;266:497 -512.
518
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
SECONDARYSTRUCTUREPREDICTION METHODS
76. Adman ET, Sieker LC. JensenLH. Structural featuresofAzurin
at 2.7Angstroms.Isr J Chem 1981;21:8-11.
77. Wodawer A, Miller M, Jaskolski M. Crystal structure of a
retroviral proteaseprovesrelationship to aspartic proteasefamily.
Nature 1989;337
:576-579.
78. PetratosK, Dauter Z, Wilson KS. Refinementof the structure of
Pseudoazurinfrom AlcaligenesFaecalis5-6 at 1.55 Angstroms.
Acta CrystallogrSectB. 1988;44:628-636.
79. Royer WE. Jr. High resolution crystallographic analysis of a
cooperativedimeric Haemoglobin.J Mol Biol t994;657:657-681.
80. ReesB, Bilwes A, SamamaJP, Moras D. Cardiotoxin from Naja
Mossanmica:the refined crystal structure. J Mol Biol 1990;214:
28t-297.
519
81. Babu YS, Bugg CE, Cook WJ. Structure of Calmodulin refined at
2.2 Angstroms resolution. J Mol Biol 1988;204:l9t-204.
Vyas NK, Vyas MN, Quiocho FA. Sugar and signal transducer
binding sites of the Escherichia coli galactose chemoreceptor
protein. Science 1988;242:L290-t295.
83. Weber E, Papamokos E, Bode W, Huber R, Kato I, Laskowski M. Ovomucoid, a Kazal-type inhibitor, and model building
studies ofcomplexes with serine proteases. J Mol Biol 1982;158:
DIO-OJ
'.