Dynamic programming-sequence alignment problem (3) reduce redundant calculation

introduction

In the previous article, the idea of ​​divide and conquer has been used to transform the dynamic programming problem that originally required O(mn) space complexity into O(m+n). Further, in this article, we consider whether the time complexity can be reduced Also lower. The final result is to reduce the time complexity of O(mn) to O(αmax{m,n}).

Reduce redundant calculations

problem analysis

Consider that if two strings that need to be aligned are very dissimilar (for example, "gold", "time"), is it necessary for us to calculate the similarity of them by dynamic programming? The purpose of designing dynamic programming to align the two strings (we start from the application), such as correcting the user’s wrong input, comparing whether the article is plagiarized, are all based on the similarity of the two (this relatively similar standard is set by the user) ). Therefore, we can do a preprocessing on the strings that have many characters that are not matched, and stop at this stage directly to reduce unnecessary calculations. What are the characteristics of the two strings with higher similarity in dynamic programming?

Insert picture description here
Direction of arrow:

  • Upward arrow: S uses a space to align the characters in T
  • Left arrow: T is aligned with the character in S with a space
  • Oblique upward arrow: characters in S and T are aligned with characters (may be equal & unequal)

It can be found that the corner angle like -30 in the lower left corner will not affect the result of the final path at all. The preprocessing removes very dissimilar words to ensure that the final path should be near the diagonal, so what we really need to calculate is The area near the diagonal.

Solutions

By setting a parameter α, a deviation band is opened on the original array, as shown
Insert picture description here
in the figure below: In the original dynamic array, it is as follows:
Insert picture description here

Pseudo code comparison

The original pseudo code uses a double for loop to calculate the result of each grid, and performs too many unnecessary redundant calculations, as follows: After
Insert picture description here
setting α, the pseudo code is as follows:
Insert picture description here
fill in the part that is less than the length of the strip α, you can find The total number of calculation grids is not large. The new time complexity of αmax{m,n} is O(αmax{m,n}).

to sum up

Usually we reduce the complexity of the algorithm’s running time by reducing redundant calculations. By observing the backtracking path, we find that what we really want to consider are those with high similarity. For some large differences, we directly perform the preprocessing part. Can be removed. Therefore, we find that this type of result path is usually located near the diagonal, which means that the result of the side corner is actually a redundant calculation. In order to avoid it, the steps are designed as follows:

  1. Wrap the possible area with a strip, the performance of this strip is the parameter α
  2. Redefine the calculation method for the array pair (BANDED−DP(T,S,α)), and consider the factor of whether it is in the strip or not, and the redundant calculation is reduced here

It can be seen from the case of sequence pairing that in order to reduce the time complexity of the actual use process, we can split all possible cases through the combination of applications. For the cases that are obviously not valid, we can filter out the cases in the preprocessing stage. Provide a basis for reducing time complexity.

tip:
1. Dynamic planning saves time = de-redundancy

  • Only count near the backtracking path
  • OPT table sparseness only counts changes

2. If the sub-problem is too thick, it is often impossible to establish a recursive relationship, and the sub-problem should be refined

  • For example, d(v)->d(v,k)

Guess you like

Origin blog.csdn.net/qq_32505207/article/details/108058250