Friday, February 24, 2012

Biological Sequence Alignment - Needleman-Wunsch Algorithm

The Needleman-Wunsch algorithm is discussed here before going into the details of implementation.


This Algorithm is a Dynamic Programming based algorithm. It finds the best, overall character to character match score between the two sequences. For this, an Alignment Matrix is used to calculate the best overall, character to character match score between the 2 sequences. The result of the algorithm is a singular score which reflects the degree of goodness of the match. A trace-back process is then used to retrieve the actual alignment.


1. Alignment takes place in a two-dimensional matrix.
2. Gap characters can never be paired to each other. The matrix is an (m+1) x (n+1) matrix, where ‘m’ and ‘n’ are the lengths of the two sequences.
3. Each cell corresponds to a pairing of one letter from each sequence.
4. Alignment starts at the upper left and follows a mostly diagonal path down and to the right.
5. Each cell of the matrix actually contains two values, a ‘score’ and a ‘pointer.’
6. The score is derived from the scoring scheme.
7. The pointer is a directional indicator (an arrow) that points up, left, or diagonally up and left and navigates the matrix.
8. When two letters are aligned (match/mismatch), the path follows a diagonal trajectory.
9. The path moves horizontally when COELANCANTH is paired to gap characters.
10. The graph moves vertically when the letters from PELICAN are paired with gap characters.
11. Assign values for the first row and column of the matrix.
12. The score of each cell is set to the gap score multiplied by the distance from the origin.
13. Gaps may be present at the beginning of either sequence, and their cost is the same as anywhere else.
14. The arrows all point back to the origin, which ensures that alignments go all the way back to the origin (a requirement for global alignment).
15. We wish to determine the maximum scoring alignment
After computing these scores, assign the maximum value to the cell and point the arrow in the direction of the maximum score. Continue this operation until the entire matrix is filled, and each cell contains the score and pointer to the best possible alignment at that point.


It is abundantly clear, that there exist dependencies between the calculations of cells. At a time, only one cell can be calculated. This poses severe implications on the time and space constraints when we try to speed up the process. Parallel prefix is a remedy to this problem.

Using parallel prefix we reduce the dependencies between the cells and calculate the scores parallel. In fact, we are able to calculate the cells of an entire row in parallel. This allows us to optimize the time as well as space.

The algorithm is optimized in time and is implemented to compare two 8 sequences, each 8 characters long. The time taken for prefix computation is log2(8) = 3 clock cycles. It is a form of data parallel programming

No comments:

Post a Comment