Biological Sequence Alignment - Important Terms

I am introducing a few terms that are important in terms of understanding biological sequence alignment. I do not have a background in biology and understanding these terms are important when implementing the Needle-Wunsch algorithm.

WHAT IS BLAST?
BLAST is an acronym for Basic Local Alignment Search Tool. Despite the adjective "Basic" in its name, BLAST is a sophisticated software package that has become the single most important piece of software in the field of bioinformatics. There are several reasons for this. First, sequence similarity is a powerful tool for identifying the unknowns in the sequence world. Second, BLAST is fast. The sequence world is big and growing rapidly, so speed is important. Third, BLAST is reliable, from both a rigorous statistical standpoint and a software development point of view. Fourth, BLAST is flexible and can be adapted to many sequence analysis scenarios. Finally, BLAST is entrenched in the bioinformatics culture to the extent that the word "blast" is often used as a verb. There are other BLAST-like algorithms with some useful features, but the historical momentum of BLAST maintains its popularity above all others.

WHAT IS ALIGNMENT?
An alignment of two sequences is a one-to-one correspondence between their characters, without reordering, but with the possibility of some number of insertions or deletions (i.e., gaps or indels).

TYPES OF ALIGNMENT
Global Alignment: Both sequences are aligned along their entire lengths and the best alignment is found. Needleman-Wunsch is an algorithm for global alignment.

Local Alignment: The best sub-sequence alignment is found. TreeBLAST and Smith-Waterman algorithms implement Local Alignment. The significant difference is that although Smith-Waterman will report the best sub-sequence alignment between the two sequences, Tree BLAST can be used to report all the statistically significant scores.

ALIGNMENT SCORING THEORY

Sequences, or (more commonly) parts of sequences, are considered to have a possible biological relationship if the scoring procedure outlined here yields a score having statistical significance. Typically, one of the sequences has unknown function (e.g. a hypothesized gene) while the other is the database being searched for matches. We refer to the former as the query sequence of length m and the latter as the database of length n. Since the query is matched with only part of the database at a time, it is convenient to talk about scoring a possible alignment of the two sequences. Frequently, we are interested in the best possible matches of any subsequence of the query with the database, a process called local alignment. More precisely, an alignment of two sequences is a one-one correspondence between their characters, without reordering, but with the possibility of a number of insertions or deletions (i.e., gaps or indels).

The basis of alignment scoring is that character matches can be scored independently, and then combined into an alignment score. Each possible match has an independently generated score, with positive scores for exact or close matches and negative scores for mismatches. These scores are available a priori. We refer to the sequence of initial character- character scores as the Score-Sequence for the alignment. If no indels are considered, then the alignment is said to be un-gapped, and the alignment score is generated by summing the score sequence. Gaps are handled by adding a penalty per gap based on the length of the gap. Usually the first indel in a gap is assigned a larger penalty than its successors; various, generally simple, functions are used to generate gap penalties.
The methodology of scoring the alignments in both the algorithms is significantly different. However, the scoring scheme used is the same.

COMPUTATION OF SCORES

Match Score: Sum of the diagonal cell score and the score for a match (+1) or mismatch (-1).
Vertical Gap Score: Sum of the cell to above and the gap score (-1).
Horizontal Gap Score: Sum of the cell to the left and the gap score (-1).

Comments

Popular posts from this blog

Generating 16k ROM using Xilinx IP COREGEN and simulating it in modelsim

Setting up atalanta ATPG to work on Linux

Sparse matrix - 3 tuple representation