Secondary structure predictions

As we all know proteins consost of secondary structure elements. The prediction of these elements might help us to understand more about the function of these a protein without determine its three-dimensional structure. Further it has been believed that prediction of secondary structures is a step towards the prediction of the three-dimensional structure of a protein, as has been show in some threading methods.

Some methods used in secondary structure prediction

There are many methods for secondary structure prediction including:

Statistical methods (Chou Fassman)
physico-chemical (Lim + Ptitsyn & Finkelstein)
sequence patterns
neural network
evolutionary conservation
Neural networks

PHD

First Chou Fassman method

A long time ago experiments showed that some synthetic polypeptides had different intrinsic ability to form different types od secondary structure. This lead to the assumption that secondary structure was (partly) formed by the local sequence.

Chou and Fassman created one of the first prediction method in 1978. From the known protein structures the calculated the probability for each residue to be in a certain residue type. The residues where then classified into different groups, for instance Glu, Met Ala and Leu where classified as strong Helix formers, while Val, Ile and Tyr where strong sheet-formers. The followinf classes was created:

Strong Helix Formers
Weak Helix Formers
Indifferent forms
Weak Helix Breakers
Strong Helix Breakers
Strong Sheet Formers
Weak Sheet Formers
Indifferent forms
Weak Sheet Breakers
Strong Sheet Breakers
Capping residues

Pro prefers the first residue of an helix
Asp & Glu prefers the amino termini
Arg och Lys prefers the carboxyl end.

From listing of these properties the prediction was done semi manually. By intitating a helix in a region with strn helix formers etc.

Second Chou Fassman Method

A cluster of foru helical residues nucleate a helix. The helix is extended until it reaches a tetrapeptide with the probability for helix is loeer than 1.0. If three out of five ressidues are beta formers that nucleates a beta sheet, which is extended in a similar fashionb as helixes. If a region contains both sheets and helixes the secondary structure prediction with highest propensity is selected.

GOR

Garnier and Robson extende the Chou fassman method, and introduced a cleaner and better method. First a sequence of 17 residues is examined to predict the secondary structure of the central residue. The probablity to find a certain amino acid in a certain position given a certain secondary structure in the middle is calculated. This yields in 20*17*3 probablities. For a given aminoacid the secondary structure is choosed as the one with the highest probability, sumed over all 17 residues. This method gives about 65% accuracy.

The development of new methods

In my opinion one new method completely revolutionized protein secondary structure predictions, taking it into an area where it actually is very usefull. For instance you will achieve higher accuracy in secondary struture prediction from the modern methods than from CD-measurements. And the accuracy is so high that it is often the first method used when trying to predict the structure of a protein. You can read more about PhD

The best prediction methods before PHD achieved about 66.2% accuracy of predictions. Further more most methods predicted beta-sheets much worse.

You should find out how PhD works by reading more about PhD in a real papers available at paper (I would recomend paper number 5) or electronically (None of these papers are that complete as paper 5 above. But the best is the 3rd Generation paper).

Why is PhD better ?:

Overall accuracy of 70.8 %
Beta-sheets predict with an accuracy of 65.4 %
The length of secondary structure prediction is more protein like
The predictions with high reliability (>90%) is predicted to 82% accuracy

How PhD works

A carefully selected data set was created where all pairs have a low pair-wise identity (<25 %). This is necessary as homology alignment predicts secondary structure better than any other method, and this set is used for carefull cross-validity studies.

Proteins that have a similar sequence will have similar secondary structure elements. Therefore one could predict the secondary structure of an aligned family and get the prediction of a single protein. The major difference between PhD and earlier methods is that PhD used an aligned family of proteins to predict the secondary structure of one family. This increased the performance signifcantly.

Measurement of accuracy

It is important to define what one wants to predict, the most intutitve mesure for secondary structure is the fraction of correctly predicted residues, however as the example below shows this is not the best definition.

Ex:

   SS: aaaaaaaaaLLL
    1: laaalaalalll  9/12 correct
    2: llaaaaaaaaal  8/12 correct

neural network

PhD uses a feed-forward neural network. 3 levels. a) sequence-to-structure net b) structure-structure net c) jury decision

width of 13 aa, frequency of residues to predict one aa 5-15000 junctions. Training by steepest descent minimization until > 70% accuracy. Balanced or unbalanced

window of 17 residues from level 1. Balanced or unbalanced

Jury decision (from 2*2 different networks in level 1 and 2)

After the predictions it is filtered. If a helix >3 residues long keep it as a helix, if it is < 3 -> change it into a loop if RI <4 and if RI >4 + extend the helixhelix

Output

PhD outputs a Reliability Index, i.e. how strong are the predictions for alpa/beta/loops.

Summary

More than 6 % gained with sequence profiles
Earlier methods using NN overestimated accuracy (Due to bad cross validation)
2% gained with jury decision
the first method > 70%
Performance worse for membrane proteins and single sequences
Reliabilty index helps to evaluate the prediction.
Balanced prediction by balanced predictions
Substantital improvements in predicting segment lengths
Secondary structure content predicted succesfully
No decrease in overall accuracy by filtering method
Marginal influence of free parameters and potential improvements

Nearest Neighbouring methods

The accuracy of secondary structure prediction methods has been improved significantly by the use of aligned protein sequences. The PHD method and the NNSSP method reach 71 to 72% of sustained overall three-state accuracy when multiple sequence alignments are with neural networks and nearest-neighbor algorithms, respectively. We introduce a variant of the nearest-neighbor approach that can achieve similar accuracy using a single sequence as the query input. We compute the 50 best non-intersecting local alignments of the query sequence with each sequence from a set of proteins with known 3D structures. Each position of the query sequence is aligned with the database amino acids in [Alpha]-helical, [Beta]-strand or coil states. The prediction type of secondary structure is selected as the type of aligned position with the maximal total score. On the dataset of 124 non-membrane non-homologous proteins, used earlier as a benchmark for secondary structure predictions, our method reaches an overall three-state accuracy of 71.2%. The performance accuracy is verified by an additional test on 461 non-homologous proteins giving an accuracy of 71.0%. The main strength of the method is the high level of prediction accuracy for proteins without any known homolog. Using multiple sequence alignments as input the method has a prediction accuracy of 73.5%. Prediction of secondary structure by the SSPAL method is available via Baylor College of Medicine World Wide Web server.

One alternative methods to to neural networks is the nearest neighbouting methods as described by Salamov. This method is based on identifying the nearest neighbours of a sequence. This is how it performs: The numbers below is the fraction correctly predicted secondary structures after each step.

1) Aligning a probe sequence against a number of known structures A library of proteins is used. A sequence is aligned against all these proteins. The secondary structure for a residue is predicted as the majority of the residues in a window (15 res) for the nearest neighbouring proteins (50).

2) Using alignments against environments (3*6 classes) -> 66.5% Environment is defined as 6 classes due to burried and polar burried, and 3 sec.str classes.

3) More SS-types -> 67.2 % 12 sec str classes (5 alpha-helical. 5-beta strands and B-turns and coils)

4) Restricted database -> 67.6 % A smaller database was selected from the proteins that were most similar in chou-fassman sec-str coefficience.

5) Balanced prediction -> 64.1 % Beta sheets were badly predicted changed by balancing Before Qb=42%, after Qb=65%

6) MSA -> 71,3 % Using mean score value plus predicting gaps to be coils.

7) Jury -> 71.8 % Changing window sizes, balanced-unbalanced and no of nearest neighbours

8) Filter -> 72.2 % BaB -> BBB etc

Arne Elofsson

Last modified: Tue Oct 13 17:49:25 CEST 1998