Sequecnce alignments and database searches

Written by Amin M. Bakali

Sequence alignments:

An alignment refers to the procedure of comparing two or more sequences by looking for a series of individual characters or character patterns that are in the same order in the sequences. Identical or similar characters are placed in the same column, and non- identical characters can either be placed in the same column as a mismatch or opposite to a gap in one of the other sequences. In an optimal alignment, non-identical characters and gaps are so placed to bring as many identical or similar characters as possible into vertical register. Two types of sequence alignment have been recognized, Global and local. The global alignment optimizes the alignment over the full-length of the sequences. In local alignment, stretches of sequence with the highest density of matches are given the highest priority. The following is an example of global and local alignment.

Global alignment:

LGPSTKDFGKISESREFDN
LNQLERSFGKINMRLE-DA

The alignment is stretched over the entire sequence lengths to include as many matching amino acids as possible up to and including the sequence ends.Although there is an obvious region of identity in this example ( the sequence FGKI), a global alignment may not align such regions in order to favor matching more amino acids along the entire sequence length.

Local alignment:

----------FGKI----------
----------FGKI----------

Local alignment of the same sequences as above. In this case, the alignment tends to stop at the ends of regions of identity or strong similarity. A much higher priority is given to finding these local regions than to extending the alignment to include more neighboring amino acid pairs. Dashes indicate sequence not included in the alignment. This type of alignment favors finding conserved amino acid motifs in related protein sequences.

Why make sequence alignment:

There are two principle reasons for comparing and aligning protein sequences. To obtain an accurate alignment for protein modeling by comparison to proteins of known three-dimensional structure is the first reason. The second and the most used is to scan a database with newly determined protein sequence and identify possible functions for the protein by analogy with characterized proteins.

Before starting this analysis, it's important to think about the questions we might be asking in sequence comparisons. If we find that many characters in one sequence are the same in they are in the other sequence, then we say they are similar. Later, we will calculate a similarity score, which gives the probability that the sequences are related. The following may be true for similar sequences:

The sequences may share a common origin such as a common ancestor sequence or a gene duplication event. If we have additional evidence for an evolutionary relationship, then we say that the sequences are homologous.
The sequence may have the same or related structure or function.

The stronger the alignment-high similarity between sequences- the more likely they are to be related. Very similar sequences that are almost identical along their lengths almost certainly have the same function. Sequences that are only weak similar may or may not be related and no firm conclusion may be drawn about their relationship. Since the discovery that the myoglobin are very similar though their sequences are not, it has been apparent that comparing structures is a more powerful if less convenient way to recognize distant evolutionary relationship than comparing sequences. Percentage identity is frequently quoted statistic for an alignment of two sequences. However, the expected value of percentage identity may be overlooked. Clearly, an alignment of length 200 showing 30% identity is more significant than an alignment of length 50 with the same identity. Sander and Schneider used protein structures to evaluate sequence comparison. Their work focused on determining a length-dependent threshold of percentage identity, above which all proteins would be of similar structure. A result of this analysis was the HSSP equation; It states that proteins with 25% identity over 80 residues will have similar structure, whereas shorter alignments require higher identity.

Some rule of thumbs for alignments are:

A gap and its length are distinct quantities. Different weights should be applied to each.

Weights for different mismatches should be permitted. A transition is more likely than a transversion; Ile-Val more likely than Ile-Arg change.

Unless two sequences are known to be homologous over their entire length, a local alignment is preferred to global alignment.

An alignment demonstrates similarity, not necessarily, homology. Homology is an evolutionary inference based on examination of the similarity and its biological meaning. Sequence similarity may result from homology but it may also result from chance or analogy.

Methods for making sequence alignment:

Visual editing of alignment has the advantage of giving full control to the user. The experience of the researcher with the sequences he studies can be an important factor in obtaining a meaningful alignment. Computer programs cannot easily match the cognitive capacities of the human brain. On the other hand visual alignment methods are tedious. Especially when alignments get larger, it can put a severe strain on researcher and the time needed to manually adjust the alignment can become just too big. The first programs used for visual editing of alignments were plain text editors or word processors, but this method is not practically usable. Alternative approaches to the manual construction of alignment are: Dot matrix, word or ktup search as in FASTA and BLAST, and dynamic programming.

Dot plots matrix

The comparison of sequences can be done by different ways. The most direct method is to make this comparison via a visual means and this is what dot plots attempt to do. Dot plots are groups of methods that visually compare two sequences and look for regions of close similarity between them. The sequences to be compared are arranged along the margins of a matrix. At every point in the matrix where the two sequences are identical a dot is places, in other words at the intersection of every row and column that have the same letter in both sequences a dot is placed. A diagonal stretch of dots will indicate regions where the two sequences are similar..

Particularly in DNA sequences, which are composed of only four building blocks, the background noise is high.

Maizel and Lenk popularized the dot plot and suggested the use of a filter to reduce the noise. The noise is not a true reflection of similarities between the sequences. Maizel and Lenk suggested the use of a filter to reduce the background noise. Only when a specified proportion of a small group of successive bases match. Figure 2 shows the effect of reducing the background noise.

Matches between two sequences.

There is an other method in which dot plots can be generated very quickly. This method involves a computer method commonly known as "hashing" or list-sorting. These methods are incorporated into FASTA algorithms. Basically, the idea is that instead of taking the complete matrix and calculating points for every entry in that matrix, a great saving can be made if the algorithm searches only for exact matches. The algorithm simply sub-divides the sequences into all "words" of a user specified block size. In addition, for both sequences the location of each word is also recorded. These array of words are then sorted alphabetically and the arrays of locations are sorted in parallel with them. Then by comparing the sorted array from one sequence with that from the other sequence immediately gives the location of all identical words.

Scoring matrices:

The simplest scoring scheme is the identity scoring matrix, in which amino acid pairs are classified into two types: identical or non-identical. Non identical pairs are scored 0 and identical pairs scored 1. It is even called the unitary scoring matrix. This type of scoring is generally considered less effective than schemes that weight non identical pairs [5]. Many alternatives to the unitary scoring matrix have been suggested. One of the earliest suggestions was scoring matrix based on the minimum number of bases that must be changed to convert a codon for one amino acid into a codon for a second amino acid. This matrix, known as the minimum mutation distance matrix, has succeeded in identifying more distant relationships among protein sequences than the unitary matrix approach. The minimal mutation distance matrix is an improvement because it incorporates knowledge about the process of generating mutations from one amino acid into another. However it still ignores the processes of selection that determine which mutations will survive in a population.

Another improvement is a scoring matrix based on selected physical, chemical, or structural properties shared and not shared by the 400 pairs of amino acids. Specific instances of this approach work well for sequences that have been strongly conserved during the evolution. The intuitive scheme developed by McLachlan [6] classified amino acid on the basis of polar or non-polar character, size, shape, and charge, and gives a score of six to interconversions between identical rare amino acids (e.g. F, F)

reducing to zero for substitutions between amino acids of quite different character (e. g. F, E) . Feng et al. [7] encode feature similar to McLachlan by combining information from the structural features of amino acids and genetic code. However this approach suffers from problems with balancing the contributions of the different properties to the positive selection of mutations and from ignoring the different rates at which different mutations are generated. As we can see a metric of similarity between amino acid pairs is important, and not only the choice of which scoring matrix is important. It is also important that all subsequent results depend critically on just how this is done and what model lies at the basis for the construction of a specific scoring matrix.

The importance of Scoring matrices:

Scoring matrices appear in all analysis involving sequence comparison.

The choice of matrix can strongly influence the outcome of the analysis.

Scoring matrices implicitly represent a particular theory of evolution.

understanding theories underlying a given scoring matrix can aid in making a proper choice

PAM:

The best improvement achieved over the unitary matrix was based on evolutionary distances. Margaret Dayhoff pioneered this approach in the 1970's. She made an extensive study of the frequencies in which amino acids substitutet for each other during evolution. The studies involved carefully aligning all of the proteins in several families of proteins and then constructing phylogenetic trees for each family. This lead to a table of relative frequencies with which amino acids replace each other over an evolutionary period. This tableland the relative frequency of occurrence of amino acids in the proteins studies were combined in computing the PAM (point Accepted mutation) family of scoring matrices. Each PAM matrix has a number associated with it. The number of mutations per 100 amino acids. He traditional PAM matrix, the PAM250 [8], often referred to as the Dayhoff matrix, assumes the occurrence of 250 point mutation per 100 amino acids.

The PAM matrices have theoretical advantages over alternative methods of scoring alignments. From biological point of view PAM matrices are based on observed mutations. Thus they contain information about the processes that generate mutations as well as the criteria that are important in selection and fixing mutation with a population. Another advance of this type of scoring matrix is from the statistical point of view. PAM matrices contain accurate description of the changes in amino acid composition that are expected after a given number of mutations that can be derived from the data used in creating the matrices. Thus the highest scoring alignment is statistically most likely to have been generating by evolution rather than by chance. The PAM matrices and other substitution matrices discussed below are generally presented as log-odds matrices. That is each score in the matrix is the logarithm of an odds ratio. The odds ratio used is the ratio of the number of times residue "A" is observed to replace residue "B" divided by the number of times residue "A" would be expected to replace residue "B" if the replacement occurred at random. Thus positive scores in the matrix designate a pair of residues that replace each other more often than expected by chance. Negative scores in the matrix designate pairs of residues that replace each other less often than would be expected by chance and are evidence against the sequences being homologous. One can use the matrix to objectively select groups of amino acids that represent conservative substitutions in proteins, because it summarizes the observed replacement that have taken place while conserving the structural and functional properties of proteins. Finally the PAM matrix provides an empirical, experimental determination of conserved replacement and it is more suitable for scoring an alignment than matrices mentioned above.

BLOSUM:

One assumption inherent in the Dayhoff model is that the evolutionary rates are uniform over the whole of protein sequence. That is likely not to be true, because the evolutionary rates are obviously lower in conserved regions and higher in non-conserved regions of proteins. Henikoff and Henikoff [9] have approached this in a different way. Their idea was to get a better measure of differences between two proteins specifically for more distantly related proteins. They used the BLOCKS database to search for differences among sequences but only among the very conserved regions of a protein family. Hence the term BLOSUM is from BLOcks SUbstitution Matrix. They first collect all of the sequences in the BLOCKS database and then for each one they sum the number of amino acids in each site to get a frequency table of how often different pairs of amino acids are found together in these conserved regions. These frequencies can be written down in a frequency table, and the odds for relatedness is calculated in a similar way as for Dayhoff matrix.

Different evolutionary distances are incorporated into this scheme with a clustering procedure: two sequences that are identical to each other for more than a certain threshold of positions are clustered. More sequences are added to the cluster id they are identical to any sequence already in the cluster at the same level. This lead to a series of matrices. The matrices are called BLOSUM matrices, with an index denoting the level of clustering, for example BLOSUM62 is derived from sequence blocks clustered at the 62% identity level.

Gap penalty:

Different programs use different names for the gap penalty. In addition the programs allow the user to specify a gap penalty (the penalty for introducing a gap into a sequence) and a gap extension penalty (the penalty for extending an already existing gap). Decreasing the gap creation penalty favors the introduction of gaps into the alignment, and decreasing the gap extension penalty favors the formation of long gaps. There is no hard and fast rule for deciding the correct gap penalty to use for a given alignment. The recommended strategy is to start by using the default gap penalty value. If a correct alignment is critical, the alignment procedure can be repeated using a higher and a lower gap penalty. The resulting alignment can then

be compared to identify sequence regions where changing the gap penalty has a significant effect. Ultimately, the "best" gap penalty value can only be determined using trial and error and knowledge of the biological function of the aligned sequences.

Databases:

One of the major advances in computer technology has been in connectivity. Most computers can now be connected to a network that permits access to other computers all over the word. This permits any one with a computer to access databases of all kinds, if they know how.

There are three major nucleotide sequence databases. There are EMBL (European Molecular Biology Laboratory), NCBI (the U.S. National Center for Biotechnology Information) and DDBJ (the DNA data Bank of Japan). Each of these databases attempt to collect all of the known nucleic acid (DNA/RNA) sequences. The sequences were collected from published sources and most journals now require submission of the sequences to a database before publication is permitted. In addition many sequences are deposited into the databases that have not been published. In addition to the sequences the databases also contain many other useful bits of data, including organism, tissue, function, and other information. The best thing with these three databases is that they are in electronic contact with each other and exchange sequence information daily. Henceyou do not need to worry that one database might not have the sequence of interest but a search of some other databases would have it.

There are also a large number of protein sequence database PIR (Protein Identification Resource). This database of protein sequences has data on x-ray crystallography and active site determination. The PDB (Protein Data Bank) also contains other structural features such as bond connectivity data. PROSITE database lists distinct structural motifs in protein. This includes amino acids post-transnational modifications, topogenic sequences, domains of specific biological function (e.g. DNA binding domains), enzyme active sites and signature patterns that are specific to a family or group of proteins. In addition to these protein databases, there are databases devoted to particular families of proteins and to particular organisms. There are also protein databases constructed from translations of the nucleotid databases - NCBI's is called GenPept and EMBL's is termed TREMBL. All of these databases can be accessed through their web sites.

SSEARCH:

At the extreme slow end of database searcher is SSEARCH. Is does sequence comparison using the smith-Waterman algorithm [10]. Which is a variant of the dynamic programming algorithm. It provides a rigorous mathematical approach towards sequence alignment. It is guaranteed to find the best alignment between a pair of sequences given a particular choice of scoring matrix and gap penalties. The smith-Waterman algorithm searches a database by using local alignments. This allows the database search to focus on the most highly conserved regions of the sequences without having to overcome interference from less well conserved regions of the sequences. It also allows similar domains within sequences to be identified, such as nucleotide binding domains, even though the sequences may not be related over the entire length. The Smith-Waterman algorithm places no restriction on the alignment it reports other than it have a positive score in terms of the similarity table used to score the alignment. It is worth the effort to understand how Smith-Waterman alignments are actually computed. First users should be convinced that the algorithm is mathematically rigorous in sense described here. That it does test all possible alignments, and always does find the best scoring alignment.

Smith-Waterman is easily described in a recursive, mathematical equation. Recursive means that the results are computed in steps with any subsequent step depending on the answers previous steps. The equation is:

SW_i,j= max {SW_i0,j0 s(a_i,b_j) ; SW_i-k,jg_{j ;}SW_i,j-k g_i;0}

_Swi,j is the Smith-Waterman score for the partial alignment ending at residue i of sequence A and residue j of sequence B. putting zero in the recursion is saying if the partial alignment score becomes negative during the calculations we want to ignore the preceding calculations and start over from a neutral score. The first term, SW_i0,j0 s(a_i,b_j), corresponds to extending the alignment by one residue from each sequence. The second term, SW_i-k,jg_j, describes extending the alignment by including residue j from sequence B and inserting a gap of k residues in length, aligned to end with residue j of sequence B into sequence A. The third term, SW_i,j-k g_i, is the equivalent term for inserting a gap into sequence B. Figure 3 shows an example of Smith-waterman alignment for a DNA sequence, where scores are: Match=5, mismatch=-4, gap=-7. It is a useful exercise to calculate by hand some of the values for specific cells in the table to be sure of understanding the table.


   -  G  C  T   G   G   A   A  G   G   C    A  T  
-  0  0  0   0   0   0   0  0   0   0   0   0  0  
   G  0  5   0   0   5   5  0   0   5   5   0  0
C  0  0 10   3   0   1   1  0   0   1  10   3  0
A  0  0  3   6   0   0   6  6   0   0   3  15  8
G  0  5  0   0  11   5   0  2  11   5   0   8 11
A  0  0  1   0   4   7  10  5   4   7   1   5  4
G  0  5  0   0   5   9   3  6  10   9   3   0  1
C  0  0 10   3   0   2   5  0   3   6  14   7  0
A  0  0  3   6   0   0   7 10   3   0   7  19 12
C  0  0  5   0   2   0   0  3   6   0   5  12 15
T  0  0  0  10   3   0   0  0   0   2   0   5 17

Figure 3: Matrix generated from the application of Smith-Waterman equation. Hightlighted elements indicate the trackback path from maximal element.

The cells corresponding to the best scoring alignment between the two sequences are highlighted. The best scoring alignment is that ends at the cell in table with the highest score.

The resulting alignment is shown below:

G A A G  G C A
G C A G A G C A

FASTA:

FASTA has been developed by William R. Person and David J. Lipman [11] as a tool for analyzing protein and DNA sequence similarity that achieve a balance of sensitivity and selectivity on one hand and speed and memory requirements on the other hand. As they described it, the search algorithm proceeds through four steps in determining a score for pair-wise similarity. FASTA uses a lookup table to locate all identities or groups of identities between the two sequences to be compared. The parameter used here is the ktup, which determines how many consecutive identities are required in a match. For example using ktup=2, only those identities that occur in a run of two consecutive matches are examined. The 10 best diagonal regions are found using ktup without considering shorter runs of identities, conservative replacements, insertions or deletions. In the second step these regions are rescored using a scoring matrix. For each of these best diagonal regions, a subregion with maximal score is identified. These regions are referred to as the "initial regions". In the third step FASTA optimally joins initial regions with scores greater than a threshold. The fourth step of the comparison considers all possible alignments of the query and library sequences that fall within a band centered around the highest scoring initial regions.

BLAST:

Basic Local Alignment Search Tool (BLAST), is a new approach to rapid sequence comparison [12]. Like FASTA, BLAST is basically a simplification of Smith-Waterman algorithm. BLAST takeees an input DNA or protein sequence and compares this sequence to all of the sequences in the target sequence database, then reports the results of the best matches and alignments of these matched sequences with the input sequence. BLAST achieves the high speed comparison of sequences by comparing a list of short sequence fragments called words derived from the target sequence with similar sets of words. For protein sequences the word length is 3 and for DNA sequences, 12. The object is to find words in the target database sequences that are closely related to words in the test sequence. When a related word is found, the regions upstream and down stream from the word in the two sequences are compared to see if the word is contained in a longerregion that can be aligned. Using an initial short word search to find these longer regions greatly increase the speed of search. In comparing words in the test database sequences, only matches that yield a certain minimum score T are considered. When a score of T is found an attempt is made to extend the alignment from the word in each alignment from the word in each direction alongthe sequences using the same scoring procedure as used for finding word matches. The extension process in each direction is stopped when the accumulated score stops increasing and has began to fall a certain amount below the best score found for shorter extensions. At this point, a larger stretch of sequences -a segment pair- with larger score may have been found. The last step of BLAST is to determine if these segment pair scores are significantly greater in value than a cutoff score S. this cutoff score is determined empirically by examining the range of scores found by comparing random sequences, and by choosing a value significantly greater in value [13]. Finally, the maximal scoring pairs (MSPs) matched in the entire database are identified and listed. Figure 4 shows an example of identification and extension of word matches. The database sequences found to have highest scoring extended regions are then reported.