A study of sequence comparison methods
M-Amin Bakali
981102
Master's degree project at the
Department of Biochemistry, Stockholm University
Supervisor: Arne Elofsson
Company:
Department of Biochemistry, Stockholm University.
Tutor at company:
Arne Elofsson.
Tutor at Mälardalens Högskola:
Sven Oscarsson.
Abstract:
Using protein sequences whose relationships are known reliably from their structures and functions as described in the SCOP database as our test set. We attempt to assess different sequence comparison methods. The methods used in this work are the three most popular and widely used in the field of sequence comparison methods and in the field of searching databases. Namely SSEARCH, FASTA, and BLAST. These programs were run with different scoring matrices and gap penalties. BLOSUM50, BLOSUM62, and PAM250, were the used matrices. To show the performance of these programs we used the specificity (rejection of unrelated proteins) and sensitivity (detection of homologies) as a measure. The fundamental question in this work has been: how well does different pairwise sequence comparison programs work with different scoring matrices, and different gap penalties. We found that the best method is SSEARCH followed by FASTA ktup1, FASTA ktup2, and BLAST. Concerning matrices BLOSUM50 is the best performing followed by BLOSUM62 and PAM250. We found that the best gap penalty choice is -16-1, -14-2, -14-1, and
-12-2, for SSEARCH, FASTA ktup1, FASTA ktup2, and BLAST respectively with BLOSUM50 as scoring matrix.
Stockholm.
Date: 981102
M-Amin Bakali
To
my family, especially my parents
TABLE OF CONTENTS
1. INTRODUCTION 5
1.2 BACKGROUND 6
1.3 METHODS: GENERAL DESCIPTION 7
1.3.1 Alignment 7
1.3.1.1 Definition of sequence alignment 7
1.3.1.2 Why make sequence alignments 7
1.3.1.3 Methods for making sequence alignments 8
1.3.1.4 Dot plots matrix 9
1.3.2 Scoring matrices 10
1.3.2.1 PAM 11
1.3.2.2 BLOSUM 12
1.3.3 Gap penalty 12
1.3.4 Overview of databases and search programs 13
1.3.4.1 SSEARCH 14
1.3.4.2 FASTA 15
1.3.4.3 BLAST 16
2. SELECTED METHODS 18
3. RESULTS 20
4. DISCUSSION XX
5. REFERENCES XX
1. INTRODUCTION
The fundamental building blocks of life are proteins. Enzymes, which are the molecular machines responsible for virtually all of the chemical transformations that cells are capable of, are proteins. In addition, much of the structure of a cell is made up of proteins. The other parts of the structure which is not made up of proteins is produced by enzymes. Proteins are variable length linear, mixed polymers of 20 different amino acids. These topologically linear polymers fold upon them to generate a shape characteristic of each different protein, and this shape along with different chemical properties of the amino acids determines the function of the protein. One of the most important concepts of modern biology is that the functional properties of proteins is determined largely by the sequence of amino acids in the linear polypeptide chain; thus in theory, knowing the sequence of a protein-the order with which amino acids occurred- one could infer its function. If you have just determined a sequence of an interesting protein, one of the first questions you are likely to ask yourself is: has anybody else seen anything like this?. Fortunately, there has been a very successful international efforts to collect all the sequences people have determined in one place so they can be searched. A number of programs have been written to rapidly search a database for a query sequence.
Which method should you use to search a database?
This question is about controversial as that over the choice of computer (Mac vs PC). In fact as you enter the world of sequence analysis, you find a variety of programs- SSEARCH, FASTA, and BLAST are not the only one- and worse, new programs are constantly appearing. In addition to this, even having selected a program, you will frequently have to select values for parameters-matrices, gap penalties are some of them. There are no magic answer to help you doing these things. This study is an attempt to give guidelines to help searching a protein database.
1.2 BACKGROUND
In the last 10 years a torrent of protein, RNA, and mainly DNA sequence data has been flooding out, largely under the impulse of Human Genom Project, HUGO. Thus, biologists are facing the problem of storing and analyzing tremendous amount of data. Fortunately, during that time, sequencing technology and computer science have grown in parallel, leading to the spread of computational biology. Currently, a lot of genomic and protein sequence information is stored in different public computer data banks. Moreover, user friendly and powerful software for sequence analysis is now readily available. Hence, biologists do not necessarily need to be expert in mathematics and computer science to use tools for computer aided sequence analysis. Neither do they have to play with cryptic computer programs. However the users of these user friendly programs should not use these software as a black box. They rather need to understand the assumptions and the principles of the methods on which they rely. Indeed, to accurately interpret the results obtained with a computer program, researchers must keep in mind the assumptions made by the used method. Moreover, the available programs offer a lot of parameters to be chosen and playing with this parameters if used the right way allows an improvement and good results to be obtained if not good results will be missed. So these user friendly software are not so friendly if the underlying principles of the methods are not somehow understood. Following is an attempt to explain and clarify some of the basics in the field of sequence comparison.
1.3 METHODS: GENERAL DESCRIPTION
1.3.1 Alignment
1.3.1.1 Definition of sequence alignment
An alignment refers to the procedure of comparing two or more sequences by looking for a series of individual characters or character patterns that are in the same order in the sequences. Identical or similar characters are placed in the same column, and non- identical characters can either be placed in the same column as a mismatch or opposite to a gap in one of the other sequences. In an optimal alignment, non-identical characters and gaps are so placed to bring as many identical or similar characters as possible into vertical register. Two types of sequence alignment have been recognized, Global and local. The global alignment optimizes the alignment over the full-length of the sequences. In local alignment, stretches of sequence with the highest density of matches are given the highest priority. The following is an example of global and local alignment.
Global alignment:
LGPSTKDFGKISESREFDN
| | | | | |
LNQLERSFGKINMRLEDA
The alignment is stretched over the entire sequence lengths to include as many matching amino acids as possible up to and including the sequence ends. Although there is an obvious region of identity in this example ( the sequence FGKI), a global alignment may not align such regions in order to favor matching more amino acids along the entire sequence length.
Local alignment:
----------FGKI----------
| | | |
----------FGKI----------
Local alignment of the same sequences as above. In this case, the alignment tends to stop at the ends of regions of identity or strong similarity. A much higher priority is given to finding these local regions than to extending the alignment to include more neighboring amino acid pairs. Dashes indicate sequence not included in the alignment. This type of alignment favors finding conserved amino acid motifs in related protein sequences.
1.3.1.2 Why make sequence alignment
There are two principle reasons for comparing and aligning protein sequences: To obtain an accurate alignment for protein modeling by comparison to proteins of known three-dimensional structure is the first reason. The second and the most used is to scan a database with newly determined protein sequence and identify possible functions for the protein by analogy with characterized proteins.
Before starting this analysis, it's important to think about the questions we might be asking in sequence comparisons. If we find that many characters in one sequence are the same as they are in the other sequence, then we say they are similar. Later, we will calculate a similarity score, which gives the probability that the sequences are related. The following may be true for similar sequences:
1. The sequences may share a common origin such as a common ancestor sequence or a gene duplication event. If we have additional evidence for an evolutionary relationship, then we say that the sequences are homologous.
2. The sequence may have the same or related structure or function.
The stronger the alignment-high similarity between sequences- the more likely they are to be related. Very similar sequences that are almost identical along their lengths almost certainly have the same function. Sequences that are only weak similar may or may not be related and no firm conclusion may be drawn about their relationship. Since the discovery that the myoglobins are very similar though their sequences are not, it has been apparent that comparing structures is a more powerful if less convenient way to recognize distant evolutionary relationship than comparing sequences. Percentage identity is frequently quoted statistic for an alignment of two sequences. However, the expected value of percentage identity may be overlooked. Clearly, an alignment of length 200 showing 30% identity is more significant than an alignment of length 50 with the same identity. Sander and Schneider [2] used protein structures to evaluate sequence comparison. Their work focused on determining a length-dependent threshold of percentage identity, above which all proteins would be of similar structure. A result of this analysis was the HSSP equation; it states that proteins with 25% identity over 80 residues will have similar structure, whereas shorter alignments require higher identity. Some rule of thumbs for alignments are:
A gap and its length are distinct quantities. Different weights should be applied to each.
Weights for different mismatches should be permitted. A transition is more likely than a transversion; Ile-Val more likely than Ile-Arg change.
Unless two sequences are known to be homologous over their entire length, a local alignment is preferred to global alignment.
An alignment demonstrates similarity, not necessarily, homology. Homology is an evolutionary inference based on examination of the similarity and its biological meaning. Sequence similarity may result from homology but it may also result from chance or analogy.
1.3.1.3 Methods for making sequence alignment
Visual editing of alignment has the advantage of giving full control to the user. The experience of the researcher with the sequences he studies can be an important factor in obtaining a meaningful alignment. Computer programs cannot easily match the cognitive capacities of the human brain. On the other hand visual alignment methods are tedious. Especially when alignments get larger, it can put a severe strain on researcher and the time needed to manually adjust the alignment can become just too big. The first programs used for visual editing of alignments were plain text editors or word processors [3], but this method is not practically usable.
1.3.1.4 Dot plots matrix
The comparison of sequences can be done by different ways. The most direct method is to make this comparison via a visual means and this is what dot plots attempt to do. Dot plots are groups of methods that visually compare two sequences and look for regions of close similarity between them. The sequences to be compared are arranged along the margins of a matrix. At every point in the matrix where the two sequences are identical a dot is places, in other words at the intersection of every row and column that have the same letter in both sequences a dot is placed. A diagonal stretch of dots will indicate regions where the two sequences are similar. Done in this fashion a dot plot as shown in figure 1 will be obtained.
A C T T T T G A A
A
A
C
T
T
G
T
A
Figure
1: If the frequency of matched letters between two sequences is
high, particularly in DNA sequences, which are composed of only four
building blocks, the background noise is high.
Maizel and Lenk [4] popularized the dot plot and suggested the use of a filter to reduce the noise. The noise is not a true reflection of similarities between the sequences. Maizel and Lenk suggested the use of a filter to reduce the background noise. Only when a specified proportion of a small group of successive bases match. Figure 2 shows the effect of reducing the background noise.
A C T T T T G A A
A
A
C
T
T
G
T
A
Figure 2: In order to reduce noise, one can place a dot only when several joined letters matched. This word approach allows us to find more significant matches between two sequences. The word length here is 2.
There is another method in which dot plots can be generated very quickly. This method involves a computer method commonly known as hashing or list- sorting. These methods are incorporated into FASTA algorithms. Basically, the idea is that instead of taking the complete matrix and calculating points for every entry in that matrix, a great saving can be made if the algorithm searches only for exact matches. The algorithm simply sub-divides the sequences into all words of a user specified block size. In addition, for both sequences the location of each word is also recorded. These array of words are then sorted alphabetically and the arrays of locations are sorted in parallel with them. Then by comparing the sorted array from one sequence with that from the other sequence immediately gives the location of all identical words.
1.3.2 Scoring matrices:
The simplest scoring scheme is the identity scoring matrix, in which amino acid pairs are classified into two types: identical or non-identical. Non identical pairs are scored 0 and identical pairs scored 1. It is even called the unitary scoring matrix. This type of scoring is generally considered less effective than schemes that weight non identical pairs [5]. Many alternatives to the unitary scoring matrix have been suggested. One of the earliest suggestions was scoring matrix based on the minimum number of bases that must be changed to convert a codon for one amino acid into a codon for a second amino acid. This matrix, known as the minimum mutation distance matrix, has succeeded in identifying more distant relationships among protein sequences than the unitary matrix approach. The minimal mutation distance matrix is an improvement because it incorporates knowledge about the process of generating mutations from one amino acid into another. However it still ignores the processes of selection that determine which mutations will survive in a population.
Another improvement is a scoring matrix based on selected physical, chemical, or structural properties shared and not shared by the 400 pairs of amino acids. Specific instances of this approach work well for sequences that have been strongly conserved during the evolution. The intuitive scheme developed by McLachlan [6] classified amino acid on the basis of polar or non-polar character, size, shape, and charge, and gives a score of six to interconversions between identical rare amino acids (e.g. F, F) reducing to zero for substitutions between amino acids of quite different character (e. g. F, E) . Another approach similar to McLachlan is by combining information from the structural features of amino acids and genetic code. However this approach suffers from problems with balancing the contributions of the different properties to the positive selection of mutations and from ignoring the different rates at which different mutations are generated. As we can see a metric of similarity between amino acid pairs is important, and not only the choice of which scoring matrix is important. It is also important that all subsequent results depend critically on just how this is done and what model lies at the basis for the construction of a specific scoring matrix.
The importance of scoring matrices:
l Scoring matrices appear in all analysis involving sequence comparison.
l The choice of matrix can strongly influence the outcome of the analysis.
l Scoring matrices implicitly represent a particular theory of evolution.
l understanding theories underlying a given scoring matrix can aid in making a proper choice.
1.3.2.1 PAM:
The best improvement achieved over the unitary matrix was based on evolutionary distances. Margaret Dayhoff pioneered this approach in the 1970's. She made an extensive study of the frequencies in which amino acids substitute for each other during evolution. The studies involved carefully aligning all of the proteins in several families of proteins and then constructing phylogenetic trees for each family. This lead to a table of relative frequencies with which amino acids replace each other over an evolutionary period. This table and the relative frequency of occurrence of amino acids in the proteins studied were combined in computing the PAM (point Accepted Mutation) family of scoring matrices. Each PAM matrix has a number associated with it. The number of mutations per 100 amino acids. The traditional PAM matrix, the PAM250 [7], often referred to as the Dayhoff matrix, assumes the occurrence of 250 point mutation per 100 amino acids.
The PAM matrices have theoretical advantages over alternative methods of scoring alignments. From biological point of view PAM matrices are based on observed mutations. Thus they contain information about the processes that generate mutations as well as the criteria that are important in selection and fixing mutation with a population. Another advance of this type of scoring matrix is from the statistical point of view. PAM matrices contain accurate description of the changes in amino acid composition that are expected after a given number of mutations that can be derived from the data used in creating the matrices. Thus the highest scoring alignment is statistically most likely to have been generating by evolution rather than by chance. The PAM matrices and other substitution matrices discussed below are generally presented as log-odds matrices. That is each score in the matrix is the logarithm of an odds ratio. The odds ratio used is the ratio of the number of times residue A is observed to replace residue B divided by the number of times residue A would be expected to replace residue B if the replacement occurred at random. Thus positive scores in the matrix designate a pair of residues that replace each other more often than expected by chance. Negative scores in the matrix designate pairs of residues that replace each other less often than would be expected by chance and are evidence against the sequences being homologous. One can use the matrix to objectively select groups of amino acids that represent conservative substitutions in proteins, because it summarizes the observed replacement that have taken place while conserving the structural and functional properties of proteins. Finally the PAM matrix provides an empirical, experimental determination of conserved replacement and it is more suitable for scoring an alignment than matrices mentioned above.
1.3.2.2 BLOSUM:
one assumption inherent in the Dayhoff model is that the evolutionary rates are uniform over the whole of protein sequence. That is likely not to be true, because the evolutionary rates are obviously lower in conserved regions and higher in non-conserved regions of proteins. Henikoff and Henikoff [8] have approached this in a different way. Their idea was to get a better measure of differences between two proteins specifically for more distantly related proteins. They used the BLOCKS database to search for differences among sequences but only among the very conserved regions of a protein family. Hence the term BLOSUM is from BLOcks SUbstitution Matrix. They first collect all of the sequences in the BLOCKS database and then for each one they sum the number of amino acids in each site to get a frequency table of how often different pairs of amino acids are found together in these conserved regions. These frequencies can be written down in a frequency table, and the odds for relatedness is calculated in a similar way as for Dayhoff matrix.
Different evolutionary distances are incorporated into this scheme with a clustering procedure: two sequences that are identical to each other for more than a certain threshold of positions are clustered. More sequences are added to the cluster if they are identical to any sequence already in the cluster at the same level. This lead to a series of matrices. The matrices are called BLOSUM matrices, with an index denoting the level of clustering, for example BLOSUM62 is derived from sequence blocks clustered at the 62% identity level.
1.3.3 Gap penalty:
Different programs use different names for the gap penalty. In addition the programs allow the user to specify a gap penalty (the penalty for introducing a gap into a sequence) and a gap extension penalty (the penalty for extending an already existing gap). Decreasing the gap creation penalty favors the introduction of gaps into the alignment, and decreasing the gap extension penalty favors the formation of long gaps. There is no hard and fast rule for deciding the correct gap penalty to use for a given alignment. The recommended strategy is to start by using the default gap penalty value. If a correct alignment is critical, the alignment procedure can be repeated using a higher and a lower gap penalty. The resulting alignment can then be compared to identify sequence regions where changing the gap penalty has a significant effect. Ultimately, the best gap penalty value can only be determined using trial and error and knowledge of the biological function of the aligned sequences.
1.3.4 Overview of databases and search programs:
Database
Protein DNA
SSEARCH tfasta
Fasta tfastax3
Protein Blastp tblastn
Q Psi-blast
U
E
R SSEARCH
Y Fastx3 Fasta
DNA Blastn
Blastx Tblastx
Figure 3: An overview of different sequence comparison methods and their utilization
One of the major advances in computer technology has been in connectivity. Most computers can now be connected to a network that permits access to other computers all over the word. This permits any one with a computer to access databases of all kinds, if they know how.
There are three major nucleotide sequence databases. There are EMBL (European Molecular Biology Laboratory), NCBI (the U.S. National Center for Biotechnology Information) and DDBJ (the DNA data Bank of Japan). Each of these databases attempt to collect all of the known nucleic acid (DNA/RNA) sequences. The sequences were collected from published sources and most journals now require submission of the sequences to a database before publication is permitted. In addition many sequences are deposited into the databases that have not been published. In addition to the sequences the databases also contain many other useful bits of data, including organism, tissue, function, and other information. The best feature with these three databases is that they are in electronic contact with each other and exchange sequence information daily. Hence you do not need to worry that one database might not have the sequence of interest but a search of some other databases would have it.
There are also a large number of protein sequence database PIR (Protein Identification Resource) is one of them. This database of protein sequences has data on x-ray crystallography and active site determination. The PDB (Protein Data Bank) also contains other structural features such as bond connectivity data. PROSITE database lists distinct structural motifs in protein. This includes amino acids post-transnational modifications, topogenic sequences, domains of specific biological function (e.g. DNA binding domains), enzyme active sites and signature patterns that are specific to a family or group of proteins. In addition to these protein databases, there are databases devoted to particular families of proteins and to particular organisms. There are also protein databases constructed from translations of the nucleotid databases - NCBI's is called GenPept and EMBL's is termed TREMBL. All of these databases can be accessed through their web sites
1.3.4.1 SSEARCH:
At the extreme slow end of database searcher is SSEARCH. It does sequence comparison using the Smith-Waterman algorithm [9]. Which is a variant of the dynamic programming algorithm. It provides a rigorous mathematical approach towards sequence alignment. It is guaranteed to find the best alignment between a pair of sequences given a particular choice of scoring matrix and gap penalties. The Smith-Waterman algorithm searches a database by using local alignments. This allows the database search to focus on the most highly conserved regions of the sequences without having to overcome interference from less well conserved regions of the sequences. It also allows similar domains within sequences to be identified, such as nucleotide binding domains, even though the sequences may not be related over the entire length. The Smith-Waterman algorithm places no restriction on the alignment it reports other than it have a positive score in terms of the similarity table used to score the alignment. It is worth the effort to understand how Smith-Waterman alignments are actually computed. First users should be convinced that the algorithm is mathematically rigorous in sense described here. That it does test all possible alignments, and always does find the best scoring alignment.
Smith-Waterman is easily described in a recursive, mathematical equation. Recursive means that the results are computed in steps with any subsequent step depending on the answers of previous steps. The equation is:
SWi,j = max {SWi-1,j-1 s(ai,bj) ; SWi-k,j gj ; SWi,j-k gi ; 0}
SWi,j is the Smith-Waterman score for the partial alignment ending at residue i of sequence A and residue j of sequence B. Putting zero in the recursion is saying if the partial alignment score becomes negative during the calculations we want to ignore the preceding calculations and start over from a neutral score. The first term, SWi-1,j-1 s(ai,bj), corresponds to extending the alignment by one residue from each sequence. The second term, SWi-k,j gj, describes extending the alignment by including residue j from sequence B and inserting a gap of k residues in length, aligned to end with residue j of sequence B into sequence A. The third term, SWi,j-k gi, is the equivalent term for inserting a gap into sequence B. Figure 4 shows an example of Smith-waterman alignment for a DNA sequence, where scores are: Match=5, mismatch=-4, gap=-7. It is a useful exercise to calculate by hand some of the values for specific cells in the table to be sure of understanding the algorithm.
G C T G G A A G G C A T
0 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5 0 0 5 5 0 0 5 5 0 0 0
C 0 0 10 3 0 1 1 0 0 1 10 3 0
A 0 0 3 6 0 0 6 6 0 0 3 15 8
G 0 5 0 0 11 5 0 2 11 5 0 8 11
A 0 0 1 0 4 7 10 5 4 7 1 5 4
G 0 5 0 0 5 9 3 6 10 9 3 0 1
C 0 0 10 3 0 2 5 0 3 6 14 7 0
A 0 0 3 6 0 0 7 10 3 0 7 19 12
C 0 0 5 0 2 0 0 3 6 0 5 12 15
T 0 0 0 10 3 0 0 0 0 2 0 5 17
Figure 4: Matrix generated from the application of Smith-Waterman equation. Hightlighted elements indicate the trackback path from maximal element.
The cells corresponding to the best scoring alignment between the two sequences are highlighted. The best scoring alignment is that ends at the cell in table with the highest score. The resulting alignment is shown below:
G A A G - G C A
G C A G A G C A
1.3.4.2 FASTA:
FASTA has been developed by William R. Person and David J. Lipman [10] as a tool for analyzing protein and DNA sequence similarity that achieve a balance of sensitivity and selectivity on one hand and speed and memory requirements on the other hand. As they described it, the search algorithm proceeds through four steps in determining a score for pair-wise similarity. FASTA uses a lookup table to locate all identities or groups of identities between the two sequences to be compared. During this step FASTA gains much of its speed. The parameter used here is the ktup, which determines how many consecutive identities are required in a match. For example using ktup=2, only those identities that occur in a run of two consecutive matches are examined. The 10 best diagonal regions are found using ktup without considering shorter runs of identities, conservative replacements, insertions or deletions. In the second step these regions are rescored using a scoring matrix. For each of these best diagonal regions, a subregion with maximal score is identified. These regions are referred to as the initial regions. In the third step FASTA optimally joins initial regions with scores greater than a threshold. The fourth step of the comparison considers all possible alignments of the query and library sequences that fall within a band centered around the highest scoring initial regions.
A B
C D
1.3.4.3 BLAST:
Basic Local Alignment Search Tool (BLAST), is a new approach to rapid sequence comparison [11]. Like FASTA, BLAST is basically a simplification of Smith-Waterman algorithm. BLAST takes an input DNA or protein sequence and compares this sequence to all of the sequences in the target sequence database, then reports the results of the best matches and alignments of these matched sequences with the input sequence. BLAST achieves the high speed comparison of sequences by comparing a list of short sequence fragments called words derived from the target sequence with similar sets of words. For protein sequences the word length is 3 and for DNA sequences, 12. The object is to find words in the target database sequences that are closely related to words in the test sequence. When a related word is found, the regions upstream and down stream from the word in the two sequences are compared to see if the word is contained in a longer region that can be aligned. Using an initial short word search to find these longer regions greatly increase the speed of search. In comparing words in the test database sequences, only matches that yield a certain minimum score T are considered. When a score of T is found an attempt is made to extend the alignment from the word in each alignment from the word in each direction along the sequences using the same scoring procedure as used for finding word matches. The extension process in each direction is stopped when the accumulated score stops increasing and has began to fall a certain amount below the best score found for shorter extensions. At this point, a larger stretch of sequences -a segment pair- with larger score may have been found. The last step of BLAST is to determine if these segment pair scores are significantly greater in value than a cutoff score S. this cutoff score is determined empirically by examining the range of scores found by comparing random sequences, and by choosing a value significantly greater in value [12]. Finally, the maximal scoring pairs (MSPs) matched in the entire database are identified and listed. Figure 6 shows an example of identification and extension of word matches. The database sequences found to have highest scoring extended regions are then reported.
L P S L D L L Test sequence
M P S L D L L Database sequence
< word > Three letter word found initially
4 4 6 Blosum62 score, word score = 14
Extension Extension
to left to right
2 7 4 4 6 4 4
< Maximal Segment pair > Score 14 + 9 + 8 = 31
2. SELECTED METHODS
The database used in this study was the Structural Classification Of Protein -SCOP database [1]. This database provides a detailed and comprehensive description of structural and evolutionary relationships of proteins whose three-dimensional structures have been determined. The classification of protein structures in the database is based on evolutionary relationships and on the principles that govern their three-dimensional structure. The choice of the SCOP database is motivated by the its quality [13]. The method used to construct the protein classification in SCOP is essentially the visual inspection. In this study we used a data set with 1434 sequences with less than 40% identity. For the assessment the family was used. referring to scop the family is defined on two criteria: first all proteins that have residue identities of 30% and greater; Second, proteins with lower sequence identities but whose functions and structures are very similar.
Programs used was: SSEARCH, version 3.0t82 November 1, 1997, FASTA, version 3.0t82 November 1, 1997, and BLASTP 2.0.4 [Feb-24-1998].FASTA was run with two modes, ktup 1 and ktup2. The matrices use are BLOSUM50, BLOSUM62, and PAM250. BLOSUM 62 is the default matrix in BLAST, the valid alternative choice include PAM250. BLOSUM50 is the default matrix in both SSEARCH and FASTA, BLOSUM62 and PAM250 are included in the scoring matrix table for protein.
Gap penalties: For BLAST:-12-2, -13-2,-14-2,-15-2, -16-1, -17-1, -18-1 were used as gap opening and gap extension when using BLOSUM50. -10-1, -11-1, -12-1 with BLOSUM62. Using PAM250 gap penalties were: -10-3, -11-3, -12-2, -14-2, -15-2, -16-1, -18-1, -19-1. Note that not all gap penalty combinations were allowed for BLAST as for SSEARCH and FASTA, therefore the combination above was chosen.
For SSEARCH and FASTA -8, -10, -12, -14 and -16 were chosen as gap opening penalties. 0, -1, -2, -3, -4 and -5 were chosen as gap extension penalties.
After choosing program scoring matrix and gap penalty the sequences were compared against each other- all against all. The method used for assessing the results is the Specificity, Sensitivity, we plotted the specificity against the sensitivity for a given score. Specificity and sensitivity are defined as below:
Specificity = #TP/(#TP + #FP)
Sensitivity = #TP/(#TP + #TN)
#TP is the number of True Positive
#FP is the number of False Positive
#TN is the number of True Negative
A pair of sequence that can be called true is the one than is reported by the program as being related, and if it is from the same family according to SCOP. False are sequences that are reported by the program as being related but they are not from the same family according to SCOP.
True positive are the sequences that are true and have higher scores than a certain score, sequences that have lower scores are called true negatives. The false positives and false negatives are defined in an analogous way. Figure 7 shows the relation between true positive, false positive, true negatives and false negatives. The specificity is a measure of finding members of the same family, if the program finds only sequences that are from the same family then the specificity is equal to one. The sensitivity reflects the ability to find members of the family, it is that fraction of sequences belonging to the family that have a score higher than a certain threshold.
TN TP
FN FP
SCORE
3. RESULTS:
Before the presentation of the results, it is may be appropriate to begin with an illustrative example just to help the reader to understand how the results are extracted from the data that will be represented here soon. Suppose that you made a search in any of these programs. You will obtain a list from your search with sequences that the program suggest that they are homologous- and keep in mind that not all of these sequences are really homologous. Because we have an advantage that we know which sequences are homologues to each other we can sort the list and mark the sequences that are homologues with a (+) and the ones that are not with a (-). Now if we obtain 5 sequences in the hit list, and four of them are positive and the other is negative. We can sort them according to the reported scores. Figure 8 shows the presentation of the results as described in the text.
Now, if we begin with score 50 we find that we have four True Positives, one False Positive, no False Negatives and no True Negative. So the Specificity is:
4 TP
4 + 1
TP + FP
The sensitivity can be calculated in similar way, and in this case it is found to be equal to one. Which means that at score 50, 80 % of you results are positives and that all of them are above this score and are True Positive. Repeated calculation gives the Spec-Sens plot, and from them we can decide which parameters are the best if we compare their sensitivity values. Our results are presented first for BLAST, then for FASTA, SSEARCH, and after that for matrices.
BLAST: From table 1, we can see that for BLAST with BLOSUM50 matrix, and different gap penalties give nearly the same sensitivity, but we recommend -12-2 to be chosen because is gives the highest sensitivity value. BLOSUM62 gives the highest sensitivity values with gap penalty -10-1. Gap penalties -11-3, -16-1, -18-1, and -19-1 gave high sensitivity, but we chose -11-3 as the best working gap penalty because the different in sensitivity is insignificant and because -16-1, -18-1 and -19-1 are seldom in the list of gap penalty choice in BLAST.
Matrix Gap penalty Specificity Sensitivity
BLOSUM50 -12-2 0,50 0,824
BLOSUM50 -13-2 0,50 0,817
BLOSUM50 -14-2 0,50 0,816
BLOSUM50 -15-2 0,50 0,817
BLOSUM50 -16-1 0,50 0,818
BLOSUM50 -17-1 0,50 0,819
BLOSUM50 -18-1 0,50 0,819
BLOSUM50 -12-2 0,95 0,749
BLOSUM50 -13-2 0,95 0,746
BLOSUM50 -14-2 0,95 0,746
BLOSUM50 -15-2 0,95 0,747
BLOSUM50 -16-1 0,95 0,739
BLOSUM50 -17-1 0,95 0,748
BLOSUM50 -18-1 0,95 0,747
BLOSUM62 -10-1 0,50 0,808
BLOSUM62 -11-1 0,50 0,799
BLOSUM62 -12-1 0,50 0,803
BLOSUM62 -10-1 0,95 0,738
BLOSUM62 -11-1 0,95 0,734
BLOSUM62 -12-1 0,95 0,735
PAM250 -10-3 0,50 0,763
PAM250 -11-3 0,50 0,756
PAM250 -12-2 0,50 0,761
PAM250 -14-2 0,50 0,760
PAM250 -15-2 0,50 0,764
PAM250 -16-1 0,50 0,767
PAM250 -17-1 0,50 0,761
PAM250 -18-1 0,50 0,766
PAM250 -19-1 0,50 0,769
PAM250 -10-3 0,95 0,695
PAM250 -11-3 0,95 0,706
PAM250 -12-2 0,95 0,691
PAM250 -14-2 0,95 0,695
PAM250 -15-2 0,95 0,696
PAM250 -16-1 0,95 0,696
PAM250 -17-1 0,95 0,694
PAM250 -18-1 0,95 0,696
PAM250 -19-1 0,95 0,700
Table 1: Sensitivity values at 0,50 and 0,95 specificity for different matrices and gap penalties used in BLAST.
Figure 9 shows the Spec-sens plot for BLAST with these gap penalties and scoring matrices. From it we can see that BLOSUM50 with gap penalty -12-2 gives the highest sensitivity followed by BLOSUM62 with gap penalty -10-1, and PAM250 with gap penalty -11-3. So in BLAST, we can rank the matrices as follow: BLOSUM50 > BLOSUM62 > PAM250.
Figure
9: Spec-sens plot for BLAST. Gap penalty -12-2 with BLOSUM50, -10-1
with BLOSUM62, and -11-3 with PAM250.
FASTA: For FASTA and SSEARCH which have another set of gap penalties we extracted Sensitivity values for Specificity at 0.50 and plotted these values in different graphs showing sensitivity change with gap penalties.
First we compared the results from FASTA, ktup1 and FASTA ktup2 with different matrices. Figure 10 shows how FASTA, ktup1 and ktup2 , with BLOSUM50 differ in sensitivity with respect to gap penalties at specificity 0.50.
From this figure one can realize that the gap penalty choice influences the sensitivity of the sequence alignment methods. In general FASTA, ktup1 gives better sensitivity values than FASTA, ktup2, and gap penalties 8-1, -12-0, -14-0, and 16-0 give low sensitivity for both FASTA, ktup1 and ktup2. Gap penalty 10-0 with FASTA, ktup2, gives significantly lower sensitivity compared to other gap penalties.-14-2 and 14-3 are the gap penalties of choice with BLOSUM50. The situation with FASTA with BLOSUM62 and the same gap penalties is different, which indicates that the gap penalties and matrices play a role in finding related sequences.
For BLOSUM62 gap penalty 12-1 and 10-2 are the two gap penalties that give the highest sensitivity. Figure 11 shows the variation of sensitivity with gap penalties. Even here FASTA, ktup1 gives higher sensitivity than FASTA, ktup2 except at some gap penalties such as 8-0, -10-0, and 12-0, where FASTA ktup2, works better than FASTA, ktup1.
Which gap penalty is the best working in FASTA with the same gap penalties as before and with PAM250 matrix?
PAM 250 gives the highest sensitivity at gap penalties 16-1, -14-2. Even here, FASTA, ktup2 is lower than FASTA, ktup1 except at gap penalty 14-0 where it is somewhat higher. Gap penalty 16-0 is an outstanding gap penalty value, the sensitivity value for FASTA, ktup2 is lower than for other gap penalties. Figure 12 shows how the sensitivity varies with the gap penalty for FASTA with PAM250 matrix. Table 2 on page 29, shows highest sensitivity values for the gap penalties, scoring matrices, and different methods. Next we are going to see how SSEARCH works with the different matrices and gap penalties.
SSEARCH: For SSEARCH with BLOSUM50, gap penalties, -16-1 and 14-2 are the two gap penalties that give the highest sensitivity values. Gap penalties 10-3, -10-2, and 10-5 give nearly the same sensitivity. Other gap penalties except those with 8 as a gap opening penalty give nearly the same sensitivity, figure 13 shows the results obtained from SSEARCH when run with BLOSUM50 scoring matrix. We can even see that gap penalties 8-0, -10-0, -12-0, -14-0, and 16-0 give lower sensitivity values and should be avoided.
The two best working gap penalties for BLOSUM62 with SSEARCH are 10-1 and 8-3. With BLOSUM 62 almost all gap penalties give sensitivity values higher than 0.5, figure 14. As we can see from figure 15 there is a stability in sensitivity for PAM250 except that for gap penalties that have 8 as a gap opening penalty and gap penalties with 0 as a gap extension penalty. The PAM250 obtains the highest sensitivity with gap penalties 14-2, and 16-2. Figure 16 shows how Sensitivity change in SSEARCH for BLOSUM50, BLOSUM62 and PAM250. The optimal parameters will be presented in table 2 page 29.
Matrices: Now that we have an idea about which gap penalty that works best or better with a given method and scoring matrix. We tried from the obtained results to answer the question which scoring matrix should be used with a given program. For FASTA, ktup1 and if the gap penalties with 0 as gap extension penalty excluded we can see from figure 16 that between gap 8-1 and 8-5 BLOSUM62 is the scoring matrix that dominates this interval and should be choosen followed by BLOSUM50 and PAM250. The situation is different when gap penalties are between 10-1 and 10-5. In this interval BLOSUM62 works better until it reaches gap penalty 10-3, then BLOSUM50 takes over. PAM250 gives lower sensitivity in the whole interval. For gap penalties 12-1 and 12-5 PAM250 have lowest sensitivity, BLOSUM50 dominates this interval. For gap penalties between 14-1 and 14-5 BLOSUM50 matrix is stil the one that works best followed by BLOSUM62 and PAM250, and this is the same situation even in the interval with gap penalties between 16-1 and 16-5.
For FASTA, ktup2 and even here with gap penalties with gap extension penalties equal to 0 excluded, we tried to answer the question, which scoring matrix works better with different gap penalties. In the interval 8-1 to 8-5 BLOSUM50 is dominating except at gap penalties 8-1 and 8-2, where BLOSUM62 is the preferable scoring matrix, PAM250 gives lower sensitivity. In the interval 10-1 to 10-5, BLOSUM50 is the scoring matrix to be chosen except at 10-1 where BLOSUM62 obtains higher sensitivity. For the interval 12-1 to 12-15, -14-1 to 14-5, and 16-1 to 16-5, BLOSUM50is better than BLOSUM62, better than PAM250. See figure 17.
For SSEARCH one can see the same trend as for FASTA. BLOSUM matrices perform better than PAM250. If we even here exclude gap penalties with zero as the gap opening penalty. We can see that BLOSUM50 works better with gap penalties 8-1 and 8-2, then BLOSUM62 works better in rest of the interval 8-3 to 10-3. After that BLOSUM50 takes over in gap penalties 10-4 and 10-5. BLOSUM62 performs better with gap penalty 12-1, then BLOSUM50 is the one that gives the highest sensitivity values. From 14-1 to 16-5, BLOSUM50 is the best performing scoring matrix followed by BLOSUM62, and PAM250. See figure18.
Figure 19, 20, and 21 show the performance of FASTA ktup1, FASTA ktup2, and SSEARCH with BLOSUM50, BLOSUM62, and PAM250. They give an overview of which program should be used with a chosen scoring matrix and gap penalty. Table 2 was constructed to show the highest sensitivity values obtained from the different programs and scoring matrices. A spec-sens plot with the highest sensitivity is shown in figure 21. From it we can see that SSEARCH is the best performing method with BLOSUM50 as a scoring matrix and gap penalty 16-1, Followed by FASTA ktup1 with BLOSUM50 and gap penalty 14-2. FASTA ktup2 with gap penalty 14-1 is better than BLAST with BLOSUM50 and gap penalty 12-2.
Program Scoring matrix Gap penalty Sensitivity
BLAST BLOSUM50 -12-2 0.824
BLOSUM62 -10-1 0.808
PAM250 -11-3 0.756
FASTA, ktup1 BLOSUM50 -14-2 0.854
BLOSUM62 -12-1 0.851
PAM250 -16-1 0.798
FASTA, ktup2 BLOSUM50 -14-1 0.825
BLOSUM62 -8-2 0.799
PAM250 -16-1 0.763
SSEARCH BLOSUM50 -16-1 0.859
BLOSUM62 -10-1 0.853
PAM250 -14-2 0.801
Figure 22: a Spec-sens plot, For SSEARCH, BLOSUM50,
-16-1; FASTA ktup1, BLOSUM50, -14-2; FASTA ktup2 BLOSUM50, -14-2,
and for BLAST, with BLOSUM50, and gap penalty 12-2.
4. DISCUSSION
The conclusions that can be drawn from the results of this study concerning which of the methods that works better is that SSEARCH performs better than FASTA, and BLAST. FASTA ktup1 performs better than FASTA ktup2 and BLAST. As we have seen earlier the performance is dependent in which parameters are used, so the statement above should be used with a lot of restrictions. Actually, Both FASTA ktup1 and ktup2 works better than SSEARCH with BLOSUM62 with for example gap penalties 8-0, -8-1, and-8-2; figure 20. So it is hard to draw general rules from these results. One must be sure of which scoring matrix, and gap penalty in use before making such a conclusion. Using the optimal parameters figure 22 - the methods can be sorted according to their performance as follow: SSEARCH > FASTA ktup1 > FASTA ktup2 > BLAST. The same results are obtained from a previous work done by Steven E. Brenner et al [14], and this is not a surprise according to the theory presented in the introduction. Figure 19, 20, and 21 can be consulted before making a search with FASTA or SSEARCH with scoring matrices BLOSUM50, BLOSUM62, or PAM250. The scoring matrices can also be arranged as follow: BLOSUM50 > BLOSUM62 > PAM250. Henikoff and Henikoff [15] found that BLOSUM62 matrix performs better than the extrapolated PAM- series, and in this study the performance of BLOSUM50 is better than For BLOSUM62. The optimal gap penalties are reported in table 2.
5. REFERENCES
1. http://scop.mrc-lmb.cam.ac.uk/scop/
2. Sander, C. & Schneider (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9, 56-68.
3. Boswell R. (1987) Sequence alignment by wordprocessor. TIBS 12:279-280.
4. Mailzel JV, Lenk RP (1981) Enhanced graphic matrix analysis of nucleic acid and protein sequences. PNAS 78:7665.
5. Dayhoff, M. O., Schwarts, R. M., and Orcutt, B. C. (1978). In Atlas of Protein Sequence and structure (ed. M. O. Dayhoff), Vol 5, pp. 345-58. National Biomedical Research Foundation, Washington DC.
6. McLachlan, A. D. (1972) Repeating sequences and gene duplication in proteins. J. Mol. Biol., 64, 417-37
7. Dayhoff, M. O., Schwartz, R. M., and Orcutt, B. C. (1979) in Atlas of Protein Sequence and structure, pp. 345-352, National Biomedical Reasearch
foundation, Washington DC.
8. Henikoff, S. and Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89: 10915-10919.
9. Smith T. F. and Waterman M.S. (1981) Identification of common molecular subsequences . J. Mol. Biol. 147: 195-197, 1981.
10. pearson W. R., Lipman D. J. (1988). Improved tools for biological sequence
Comparison. Proc. Natl. Acad. Sci. Vol. 85, pp. 2444-2448.
11. S. F. Altschul, W. Gish, W. miller, E. W Myers and D. J. Lipman. Basic
Local Alignment Search Tool. (1990) J. Mol. Biol. 215, pp. 403-410.
12. Karlin S. and Altschul S. F. (1990) Methods for assessing the statistical
significance of molecular sequence features by using general scoring schemes.
Proc. Natl. Acad Sci. Vol 87, pp. 2265-2268.
13. Alexey G. Murzin, Steven E. Brenner, Tim Hubbard and Cyrus Chothia.
SCOP: A Structural Classification of proteins Database for the investigation of
Sequences and Structures. (1995) J. Mol. Biol.247, pp. 536-540.
14. Steven E. Brenner, Cyrus Chothia, and Hubbard.(1998) Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships, Prod. Natl. Acad. Sci. USA vol. 95, pp. 6077-6078.
15. Henikoff, S. & Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, pp. 10915-10919.
16. Martin Vingron & Michael S. Waterman. Sequence alignment and penalty Choice. (1994) 235 1-12.
Arne Elofsson Stockholm Bioinformatics Center, Department of Biochemistry, Arrheniuslaboratoriet Stockholms Universitet 10691 Stockholm, Sweden |
Tel: +46-(0)8/161553 Fax: +46-(0)8/158057 Hem: +46-(0)8/6413158 Email: arne@sbc.su.se WWW: /~arne/ |
---|