Bioinformatics for protein sequence, structure and function

Secondary structure predictions

As we all know proteins consist of secondary structure elements. The prediction of these elements might help us to understand more about the function of these proteins without determine its three-dimensional structure. Further it has been believed that prediction of secondary structures is a step towards the prediction of the three-dimensional structure of a protein, as has been show in some threading methods.

Some methods used in secondary structure prediction

There are many methods for secondary structure prediction including:

Statistical methods (Chou Fassman)
physico-chemical (Lim + Ptitsyn & Finkelstein)
sequence patterns
neural network
evolutionary conservation
Neural networks

PhD

First Chou Fassman method

A long time ago experiments showed that some synthetic polypeptides had different intrinsic ability to form different types of secondary structure. This lead to the assumption that secondary structure was (partly) formed by the local sequence.

Chou and Fassman created one of the first prediction method in 1978. From known protein structures they calculated the probability for each residue to be in a certain secondary structure type. The residues where then classified into different groups, for instance Glu, Met Ala and Leu where classified as strong Helix formers, while Val, Ile and Tyr where strong sheet-formers. The following classes were created:

Strong Helix Formers
Weak Helix Formers
Indifferent forms
Weak Helix Breakers
Strong Helix Breakers
Strong Sheet Formers
Weak Sheet Formers
Indifferent forms
Weak Sheet Breakers
Strong Sheet Breakers
Capping residues

Pro prefers the first residue of an helix
Asp & Glu prefers the amino termini
Arg and Lys prefers the carboxyl end.

From listing of these properties the prediction was done in a semi-manual fashion. By initiating a helix in a region with strong helix formers etc.

Second Chou Fassman Method

A cluster of four helical residues nucleate a helix. The helix is extended until it reaches a tetrapeptide where the probability for helix is lower than 1.0. If three out of five residues are beta formers that nucleates a beta sheet, which is extended in a similar fashion as helixes. If a region contains both sheets and helixes the secondary structure prediction with highest propensity is selected.

GOR

Garnier and Robson extended the Chou Fassman method, and introduced a cleaner and better method. First a sequence of 17 residues is examined to predict the secondary structure of the central residue. The probability to find a certain amino acid in a certain position given a certain secondary structure in the middle is calculated. This yields in 20*17*3 probabilities. For a given aminoacid the secondary structure is chosen as the one with the highest probability, summed over all 17 residues. This method gives about 65% accuracy.

The development of new methods

One new method completely revolutionized protein secondary structure predictions, PhD, taking it into an area where it actually is very useful. For instance you will achieve higher accuracy in secondary structure prediction from the modern methods than from CD-measurements. And the accuracy is so high that it is often the first method used when trying to predict the structure of a protein.

The best prediction methods before PhD achieved about 66.2% accuracy of predictions. Further more most methods predicted beta-sheets much worse.

You should find out how PhD works by reading more about PhD in a real papers available at paper (I would recommend paper number 5) or electronically (None of these papers are as complete as paper number 5 above. But the best is the 3rd Generation paper). Below are the important points from these papers highlighted.

How is a prediction using PhD better ?

So how is a predictions made by PhD better than a prediction using for instance GOR ? This are some statements that are mentioned in the articles above:

Overall accuracy of 70.8 %
Beta-sheets predict with an accuracy of 65.4 %
The length of secondary structure prediction is more protein-like
The predictions with high reliability (>90%) is predicted to 82% accuracy

As you can see the predictions are of higher quality, not only because more residues are correctly predicted but also because the prediction is more protein like. The difference is very significant if you look at an output, the length of secondary structure elements from a PhD prediction looks like in a real protein, while it is not unusual in GOR that you have predictions that predict Helixes of only a few residues with sheet residues in between.

How does PhD work ?

One important aspect when PhD was created is that it was trained on a carefully selected dataset where all pairs have a low pair-wise identity (<25 %). This is necessary as homology alignment predicts secondary structure better than any other method, and this set is used for carefull cross-validity studies. Several earier studies did not use equally good training sets.

The new idea in PhD is that proteins that have a similar sequence will have similar secondary structure elements. Therefore one could predict the secondary structure of an aligned family and get the prediction of a single protein. This increases the performance significantly.

neural network

PhD uses a standard feed-forward neural network. divided into three levels.

sequence-to-structure net
structure-structure net
jury decision

The first network takes predicts the secondary structure of the central residue in a 13 residue window. Afterwards the predicted secondary structure is feed into the second network.The second networks also predicts the secondary structure for a central residue of a window but using the output from the first net as its input. Finally several different networks are created, using slightly different parameters for the training and a jury network is trained using the output from the two earlier layers. After the predictions it is filtered. If a helix is >3 residues long keep it as a helix, if it is < 3 -> change it into a loop if RI <4 and if RI >4 + extend the helix

Summary

More than 6 % gained with sequence profiles
Earlier methods using NN overestimated accuracy (Due to bad cross validation)
2% gained with jury decision
the first method > 70%
Performance worse for membrane proteins and single sequences
Reliability index helps to evaluate the prediction.
Balanced prediction by balanced predictions
Substantial improvements in predicting segment lengths
Secondary structure content predicted successfully
No decrease in overall accuracy by filtering method
Marginal influence of free parameters and potential improvements

Nearest Neighboring methods

One alternative methods to to neural networks is the nearest neighboring methods. This method is based on identifying the nearest neighbors of a sequence. The basic idea is that you use your sequence (or multiple sequence alignment) to align it against a library of proteins. The assumption is then that the sequence will align to regions of proteins that have similar secondary structure content. It has been shown that similar performance as by using PhD can be obtained, especially if the alignments are done using structural paramaters as in the 3d-1d profile methods (see the threading section).

Arne Elofsson

Last modified: Thu Oct 28 16:05:11 CEST 1999

Arne Elofsson Stockholm Bioinformatics Center, Department of Biochemistry, Arrheniuslaboratoriet Stockholms Universitet 10691 Stockholm, Sweden	Tel: +46-(0)8/161553 Fax: +46-(0)8/158057 Hem: +46-(0)8/6413158 Email: arne@sbc.su.se WWW: /~arne/