Friday, February 24, 2012

Salient Features - Biological Sequence alignment

Finite state machine cycles through 4 states s0, s1, s2, s3. The states are indicated by a 2 bit counter (counter_2bit) which increments every clock pulse.
s0 = 00
s1 = 01
s2 = 10
s3 = 11
One cycle is one pass through the four states of the finite state machine.
For a particular row i, consider the scoring matrix RAM (w_ram). Assume the scores for the particular row are already available. Address to the RAM is generated by a 3 bit counter. Every s0, the address to the RAM is incremented. At s1 the read enable of the ram is made high. At s2, the scores for the next row are available in the outputs of the ram. Thus during the calculation of table entries for a particular row, the scores required for the next row are fetched from the memory and are available.

While calculating the table entries of a particular row, the table entries of the previous row are already available in the s3 stage of the previous cycle. The parallel prefix module gets the z inputs during the falling edge of the s0 state. This is done so as to provide enough time for the combinational calculation of z[j]s from previous row table entries and values from scoring matrices. The z[j]s is available in w_x_reg at the rising edge of state 0. This value is latched on to x_ram during the falling edge of the same state. This is done so that the inputs to the parallel prefix module are available when it starts running at state s1.

The parallel prefix algorithm then runs on the z[j] values stored in x_ram. The initial z[j] values are
available in its inputs during state s0. The parallel prefix algorithm starts executing at s1, goes to the second stage at state s2 and completes the operation in state s3. The appropriate control signals for the nodes are given based on the table described previously. 8 nodes are instantiated and their mode of operation is controlled by 8 bit control register dup_en. Each bit of this register control the appropriate node. At s1, node 0 is alone in the duplication mode and rest is in operation mode. In s2, node0, node1 are in duplication mode and rest in operation mode. In s3, node0, node1, node2, node3 are in duplication mode and rest is in operation mode. The appropriate dup_en signals are generated using shift_control logic. Thus at s3 the x[j] values are available.

These x[j]s are connected to 8 signed 8 bit adders and added with 8 bytes of y[j] during the same state s1. This is done with minimum delay since it is a combinational circuit. The add_en signal is generated every s3 and it adds the 8 bytes of x[j]s to 8 bytes of y[j]s to get 8 bytes of table entries. These table entries are stored in table_ram as well as fed back to the w_calculation module to calculate w values for the next row.

At state 1, table_ram is write-enabled and the 8 bytes of table entries are stored in the memory. By the end of state s3, previous row table entries and scoring matrix values for the next row are available at the inputs of w_calculation module. w_calculation module is enabled in state s0 of the next cycle. The combinational logic calculates values of z and they are latched on the inputs of parallel prefix module by the falling edge of the s0 state of the next cycle.

This process is repeated 8 cycles to fill all entries of the table. the table generated at the end is optimized only to contain the calculated table values. Row0 and column0 values are not present in the table as they can be easily calculated.

Dynamic programming algorithms require O(mn) time to complete sequence matching. In this case matching 8 word query with a 8 word database, time taken by dynamic programming techniques would have been 64 clock pulses. But parallel prefix algorithm is much faster. Each cycle of the parallel prefix requires 4 clock pulses. Number of cycles in this particular case is 8. so the complete operation of sequence matching is done in a span of 32 clock pulses which is almost half the time required by dynamic programming algorithms. This is achieved by proper pipelining of the different stages. Pipelining is done so as to prefetch the values from memory, calculations are done in parallel for 8 bytes and writng in to the table is pipelined with calculation of w values for the next row. The control signals for both the RAMs, w_calculation module, the 8 nodes and the 8 signed adders are appropriately generated using the 2 bit counter as the reference.

No comments:

Post a Comment