As we all know proteins consist of secondary structure elements. The prediction of these elements might help us to understand more about the function of these proteins without determine its three-dimensional structure. Further it has been believed that prediction of secondary structures is a step towards the prediction of the three-dimensional structure of a protein, as has been show in some threading methods.
A long time ago experiments showed that some synthetic polypeptides had different intrinsic ability to form different types of secondary structure. This lead to the assumption that secondary structure was (partly) formed by the local sequence.
Chou and Fassman created one of the first prediction method in 1978. From known protein structures they calculated the probability for each residue to be in a certain secondary structure type. The residues where then classified into different groups, for instance Glu, Met Ala and Leu where classified as strong Helix formers, while Val, Ile and Tyr where strong sheet-formers. The following classes were created:
From listing of these properties the prediction was done in a semi-manual fashion. By initiating a helix in a region with strong helix formers etc.
A cluster of four helical residues nucleate a helix. The helix is extended until it reaches a tetrapeptide where the probability for helix is lower than 1.0. If three out of five residues are beta formers that nucleates a beta sheet, which is extended in a similar fashion as helixes. If a region contains both sheets and helixes the secondary structure prediction with highest propensity is selected.
Garnier and Robson extended the Chou Fassman method, and introduced a cleaner and better method. First a sequence of 17 residues is examined to predict the secondary structure of the central residue. The probability to find a certain amino acid in a certain position given a certain secondary structure in the middle is calculated. This yields in 20*17*3 probabilities. For a given aminoacid the secondary structure is chosen as the one with the highest probability, summed over all 17 residues. This method gives about 65% accuracy.
One new method completely revolutionized protein secondary structure predictions, PhD, taking it into an area where it actually is very useful. For instance you will achieve higher accuracy in secondary structure prediction from the modern methods than from CD-measurements. And the accuracy is so high that it is often the first method used when trying to predict the structure of a protein.
The best prediction methods before PhD achieved about 66.2% accuracy of predictions. Further more most methods predicted beta-sheets much worse.
You should find out how PhD works by reading more about PhD in a real papers available at paper (I would recommend paper number 5) or electronically (None of these papers are as complete as paper number 5 above. But the best is the 3rd Generation paper). Below are the important points from these papers highlighted.So how is a predictions made by PhD better than a prediction using for instance GOR ? This are some statements that are mentioned in the articles above:
As you can see the predictions are of higher quality, not only because more residues are correctly predicted but also because the prediction is more protein like. The difference is very significant if you look at an output, the length of secondary structure elements from a PhD prediction looks like in a real protein, while it is not unusual in GOR that you have predictions that predict Helixes of only a few residues with sheet residues in between.
One important aspect when PhD was created is that it was trained on a carefully selected dataset where all pairs have a low pair-wise identity (<25 %). This is necessary as homology alignment predicts secondary structure better than any other method, and this set is used for carefull cross-validity studies. Several earier studies did not use equally good training sets.
The new idea in PhD is that proteins that have a similar sequence will have similar secondary structure elements. Therefore one could predict the secondary structure of an aligned family and get the prediction of a single protein. This increases the performance significantly.
PhD uses a standard feed-forward neural network. divided into three levels.
The first network takes predicts the secondary structure of the central residue in a 13 residue window. Afterwards the predicted secondary structure is feed into the second network.The second networks also predicts the secondary structure for a central residue of a window but using the output from the first net as its input. Finally several different networks are created, using slightly different parameters for the training and a jury network is trained using the output from the two earlier layers. After the predictions it is filtered. If a helix is >3 residues long keep it as a helix, if it is < 3 -> change it into a loop if RI <4 and if RI >4 + extend the helix
One alternative methods to to neural networks is the nearest neighboring methods. This method is based on identifying the nearest neighbors of a sequence. The basic idea is that you use your sequence (or multiple sequence alignment) to align it against a library of proteins. The assumption is then that the sequence will align to regions of proteins that have similar secondary structure content. It has been shown that similar performance as by using PhD can be obtained, especially if the alignments are done using structural paramaters as in the 3d-1d profile methods (see the threading section).
Arne Elofsson Stockholm Bioinformatics Center, Department of Biochemistry, Arrheniuslaboratoriet Stockholms Universitet 10691 Stockholm, Sweden |
Tel: +46-(0)8/161553 Fax: +46-(0)8/158057 Hem: +46-(0)8/6413158 Email: arne@sbc.su.se WWW: /~arne/ |
---|