Dynamic programming-sequence alignment problem (2) is stored by calculation

introduction

Following the previous article, for the two strings S m, T n S_m, T_nSm,Tn, We calculate its optimal alignment, and the required array space is O(mn). The optimal alignment method can withstand the memory when processing short strings, but when the processing input becomes something like a whole paper, the memory can't stand it. On the other hand, because the algorithm uses a two-layer for loop to calculate the two-dimensional array mxn, but in the end we may not use the extra content (the corners of the array), which results in waste.

Advanced dynamic programming

Dynamic programming has a very significant effect on obtaining the optimal solution, but it takes up a lot of storage space and many unnecessary calculations. The advanced dynamic programming makes up for this shortcoming, saving storage space and running time. The method of adoption is to save on behalf of calculations.

Divide and conquer reduces space

In the previous alignment algorithm, we can find that every time we calculate the score of the current grid, we consider the content of the three grids in the upper left corner. The rest of the grids are not considered. In fact, we do not need a complete second for each calculation. All the data of the dimensional array only needs to be in the following two columns:
Insert picture description here
at the beginning, we initialize the 0th column (consuming an array), when calculating the 1st column, the first element -3 can be directly initialized. For the first blue box, the three values ​​in the upper left corner have been obtained, directly according to the recursive expression, 1 is obtained, on this basis, the calculation of the second blue box is performed, and so on, the first 2 columns are calculated Content.

When calculating the content of the second column, the content of the 0th column is no longer necessary to save, so the value of the first column is saved to the previous array, and the next round of calculation is performed, as shown in the following figure:
Insert picture description here
finally our array saves The content has only the last two columns, so the final answer 4 is naturally obtained.

The pseudo code is as follows: we
Insert picture description here
can see that we only need 2n space for this method, but if we want to backtrack the path, it won't work, because the previous calculation results are not saved.
Insert picture description here
Hirschberg's algorithm added divide and conquer to solve this problem.
We can find that using calculations to save, matching from front to back and matching from back to front are the same logic.
Hirschberg applied the idea of ​​divide and conquer to dynamic programming. Thinking from the perspective of multi-step decision-making, how does S emerge from T step by step? Applying the idea of ​​divide and conquer, S is divided into two parts. The first half is generated from the former part of T, and the second half is generated from the latter part of T. The formula is as follows:
OPT (T, S) = OPT (T [1.. q ], S [1.. n 2]) + OPT (T [q + 1.. m], S [n 2 + 1.. n]) OPT(T,S) = OPT(T[1..q ],S[1..\frac{n}{2}])+OPT(T[q+1..m],S[\frac{n}{2}+1..n])OPT(T,S)=OPT(T[1..q],S[1..2n])+OPT(T[q+1..m],S[2n+. 1 . . N- ] )
applied to the partition thought to S divided into two parts, respectively, then we can use the suffix pair matches the first half, the second half of the prefix match, only two arrays and store intermediate results, as follows:
Insert picture description here
We Add the results of the two columns to get the result of the yellow column in the middle, and find that the best score 4 calculated before is obtained in the middle! It is obtained by 1+3. 1 is the similarity between "OCUR" and "OCCUR", and 3 is the similarity between "RANCE" and "RENCE". Since we use linear addition scores, the result of 1+3 is the similarity between S and T! Wait a minute, what's the meaning of the position of 4? Look at the picture below:
Insert picture description here
we find that 4 divides the left and right into upper and lower parts, and the R here is exactly the position of q. From this we can obtain the first divide and conquer of the original formula as follows:
OPT (′ OCCURRENCE ′, ′ OCURRANCE ′) = OPT (′ OCCUR ′, ′ OCUR ′) + OPT (′ RENCE ′, ′ RANCE ′) OPT('OCCURRENCE ','OCURRANCE') = OPT('OCCUR','OCUR') +\\ OPT('RENCE','RANCE')OPT(OCCURRENCE,OCURRANCE)=OPT(OCCUR,O 100 U A)+OPT(RENCE,RANCE ) The
rest is to recursively call the red area in the upper left corner and the red area in the lower right corner, and we can get a series of red grids like the previous 4.

So, how do you get the desired path? To think of the path, we must first truly understand the meaning of q. q indicates that the first half of S is obtained from [1...q] of T. In the example, "OCUR" comes from "OCCUR". Back to the previous path table:
Insert picture description here
we can easily know from the figure that at least one element in each row is in the path, and q determines the column where this element is located! The q of the first divide and conquer determines the position of the yellow 1 in the middle, and the value of q is 5, which tells us that the original two-dimensional array <5, 4> must be in the path, and 4 is determined by dividing S into two at the beginning of. In the same way, we can determine a new element in the path every time we recurse. This way we can slowly determine the path.

The pseudo code is as follows: the
Insert picture description here
total space consumption of the above algorithm is O(m+n), which is a linear storage space; the time complexity is also O(mn).
Insert picture description here

Divide and conquer point improvement

The above algorithm is very good, but it also has shortcomings. The two lines of pseudocode are two and three lines. The problem must be calculated from the front to the back, and it can also be calculated from the back to the front. This can be done in the alignment problem, but if you change it It is not necessarily because of other problems.

To obtain the path, according to the above derivation, in fact, only need to calculate q recursively, and define a variable R i, j R_{i,j}Ri,jTo indicate which row of unit (i,j) goes back to (0,0) through n/2, we have the following formula:
Insert picture description here
Take an example to illustrate the meaning of this recursive formula. First, for the element that happens to be in column n/2, When it passes through the n/2 column, the row must be its own row, and for the elements in the latter column, the calculation result of its OPT needs to be considered.
Insert picture description here
When calculating recursively, we always have two arrays (one green and one blue) to store the optimal alignment of the current calculation. For the elements of the n/2 +1 column, it wants to know which row it is in when it backtracks. Through n/2 columns, it needs to ask forward. The basis of the inquiry is where it came from. According to the comparison of OPT scores, we can clearly know that the 5 in the blue box is derived from the 3 in the green box. Similarly, we continue to update backwards to get the final A column of results is 5, which means that q = R 10, 9 R_{10,9}R10,9, We use this method to find q, and the rest is the recursive process. The following is an example of recursive calculation: the
Insert picture description here
recursive call is to calculate the q between "OCUR" and "OCCUR". When calculating "OC" and "OCC", there are two values ​​of q. This is also easy to understand, " "OC" can be aligned with "OCC", "OC-" can also be aligned with "OCC", so q has two options.
Insert picture description here
The following is the path point process stored in the entire recursive process,
Insert picture description here
so we have a path point set A=(⟨5,4⟩,⟨3,2⟩,⟨2,1⟩,⟨4,3⟩,⟨7,6⟩, ⟨6,5⟩,⟨8,7⟩,⟨9,8⟩}

to sum up

Applying the idea of ​​divide and conquer in dynamic programming, the core of its ability to reduce space complexity is to save arithmetic. The general steps are as follows:

  1. The original multi-step decision-making problem of how S is generated from T is transformed into how the first half of S is generated from the first part of T (T[1...q], S[1...n/2]+how about the second half of S Generated from the latter part of T (T[q+1…m],S[n/2+1…n])
  2. Next, we use suffix optimal alignment and prefix optimal alignment for the front and rear parts respectively, and finally calculate the results of two columns, sum the results of the two columns, and the abscissa of the largest one (that is, the previous 4) is q
  3. With q, we can store ⟨q,n/2⟩ (the point that the path must pass) in the array, and then recurse 1-2 steps, and the finally obtained array A constitutes the path we need.

However, the above method has certain limitations. It requires that the problem must satisfy both the prefix optimal alignment and the suffix optimal alignment. Therefore, lan's algorithm divide-and-conquer recursion is proposed. In essence, to find the path is to find the value of q, so he considers using OPT Calculate q recursively, and get the point set A as well.

Guess you like

Origin blog.csdn.net/qq_32505207/article/details/108043063