Bioinformatics for protein sequence, structure and function

Threading - protein fold recognition

One of the fundamental question that remain to solve in todays molecular biology is "How does protein fold ?". This can be divided into two parts, (1) We do not fully understand the physics about protein folding and would like to understand that better. (2) We want to put all crystallographer out of work and be able to predict the structure of a protein by hand (or with help of a computer). The first of this questions is the fun one, however it is not covered in this course, but the second one is.

The best (only ?) method to predict the structure of a protein from its sequence is by finding a protein with a structure is already known that has a similar structure to the protein of interest. If there is such a protein with a high sequence similarity the problem is solved and one can move onto the problem of homology modeling.

Many proteins with apparently unrelated sequences have been found to have very similar 3-dimensional structures. This has lead to the development of methods to detect the fold of a sequence from a library of known folds. Some of these methods are based solely on sequence information (Dayhoff et al., 1993; Vingron & Waterman, 1994), others on multiple aligned sequences (Gribskov et al., 1997), others on structural information (Abagyan et al., 1994; Bowie et al., 1991; Bryant & Lawrence, 1993; Fetrow & Bryant, 1993; Flockner et al., 1995; Godzik et al., 1992; Jones et al., 1992; Kocher et al., 1994; Ouzounis et al., 1993; Rooman et al., 1992; Zhang & Eisenberg, 1994) and still others on both sequence and structural information (Matsuo & Nishikawa, 1994; Wilmanns & Eisenberg, 1993; Wilmanns & Eisenberg, 1995; Yi & Lander, 1994).

CASP

One fundamental problem when comparing structure prediction methods is that one has to know the structure of a protein before one knows if a prediction is correct or not. This also affects the development of prediction methods as these methods might have been biased to perform better on the already known structures. The solution is as in all science, blind test, i.e. where neither the predictor or the evaluator knows the answer (the structure in this case) before the prediction starts. To facilitate the evaluation of different methods using blind test the CASP process was initiated. Below is a description of CASP printed.

Methods for obtaining information about protein structure from the amino acid sequence have apparently been advancing rapidly. But just what can these methods currently deliver?

A first large scale experiment aimed at beginning to answer these questions was conducted in 1994, and culminated in a meeting at Asilomar, California at the end of that year. Some 135 predictions were made by 35 different groups. The results are published in a special issue of Proteins: Structure, Function and Genetics, volume 23, No 3, November 1995.

A second meeting on the Critical Assessment of Techniques for Protein Structure Prediction (December 1996) was a culmination of a 9 month long, community wide experiment. Before the meeting, 42 structural targets provided by crystallographers and NMR spectroscopists were made available to the prediction community. Prior to the public release of structures, more than 900 predictions by approximately 70 research groups world wide were collected. The results are published in a special issue of Proteins: Structure, Function and Genetics, Suppl.1, 1997.

Multiple sequence information

Even if multiple sequence methods (such as Hidden Markov models and sequence profile methods) not traditionally are considered to be threading methods, these methods are often one of the best choices for protein fold recognition. These methods are described elsewhere in this course.

The 1d-3d profile - method

In addition to sequence information we also use structural information which can be included in several different ways. Bowie et al. (1991) described each position of a protein as being in one of eighteen environments. Other researches have developed similar methods e.g. (Ouzounis et al., 1993; Yi & Lander, 1994). The environments in these methods are characterized by properties such as exposed atomic areas and type of residue-residue contacts.

The principle of all these methods are as follows:

Reduction of the three-dimensional structure to a one-dimensional string of residue environments. Bowie defined these environments by measuring the area of the side chain that is buried in the protein, the fraction of the side chain area that is exposed to polar atoms, and the local secondary structure.
A scoring matrix is generated from the probabilities of finding each of the twenty amino acids in each of the environment classes as observed in a database of known structures and related sequences.
Generation of a position-dependent comparison matrix known as the 3D profile, i.e. defining the probability to find a certain aminoacid in a certain position of a given protein.
Alignment of a sequence with the 3D profile. The resulting alignment score is a measure of the compatibility of the sequence with the structure described by the 3D profile.

Pairwise distance methods

Besides information about the environment of each residue, some researchers have used information about the distances between different amino acids (Bryant & Lawrence, 1993; Godzik et al., 1992; Hendlich et al., 1990, Flockner, 1995 #1228; Jones et al., 1992; Yi & Lander, 1994). When such distance information is used, the energy which contributes to aligning one residue of the probe sequence at a certain position of the target depends explicitly on which other residues are in the vicinity. This distance information can be used in standard dynamic programming alignment with the aid of the "frozen approximation". The "frozen approximation" uses the position of amino acids interacting with other amino acids of the template sequence and not with the amino acids of its own sequence (Flockner et al., 1995). Lathorp has shown that the non-frozen approximation to the threading problem is NP complete when pairwise interactions are used and gaps are allowed (Lathorp, 1994). Abandoning the frozen approximation is computationally impractical. Besides the frozen approximation, several clever methods have been used to approximate the best possible alignment. Jones et al used a double dynamic programming algorithm to overcome the frozen approximation, (Jones et al., 1992), and others used an iterative method to in combination with the frozen approximation (Godzik et al., 1992; Wilmanns & Eisenberg, 1993), or a Monte-Carlo method (Bryant & Lawrence, 1993). However, none of these alternative methods guarantees finding the best possible alignment.

Prediction based methods

In protein fold recognition, one assigns a probe amino acid sequence of unknown structure to one of a library of target three-dimensional structures. Correct assignment depends on effective scoring of the probe sequence for its compatibility with each of the target structures. Here we show that in addition to the amino acid sequence of the probe, sequence-derived properties of the probe sequence (such as the predicted secondary structure) are useful in fold assignment. The additional measure of compatibility between probe and target is the level of agreement between the predicted secondary structure of the probe and the known secondary structure of the target fold. That is, we recommend a sequence-structure compatibility function that combines previously developed compatibility functions (such as the 3D-1D scores of Bowie et al., 1991 or sequence-sequence replacement tables) with the predicted secondary structure of the probe sequence.

The effect on fold assignment of adding predicted secondary structure is evaluated here by using a benchmark set of proteins (Fischer et al., 1996). The 3D structures of the probe sequences of the benchmark are actually known, but are ignored by our method. The results show that the inclusion of the predicted secondary structure improves fold assignment by about 25%. The results also show that if the true secondary structure of the probe were known, correct fold assignment would increase by an additional 8-32%. We conclude that incorporating sequence-derived predictions significantly improves assignment of sequences to known 3D folds. Finally we apply the new method to assign folds to sequences in the Swissprot database; 6 fold assignments are given that are not detectable by standard sequence-sequence comparison methods; for two of these, the fold is known from x-ray crystallography and the fold assignment is correct.

The method by Rost is described in detail here.

Arne Elofsson

Last modified: Wed Oct 27 15:44:37 CEST 1999

Arne Elofsson Stockholm Bioinformatics Center, Department of Biochemistry, Arrheniuslaboratoriet Stockholms Universitet 10691 Stockholm, Sweden	Tel: +46-(0)8/161553 Fax: +46-(0)8/158057 Hem: +46-(0)8/6413158 Email: arne@sbc.su.se WWW: /~arne/