Table of contents
Problem Description
The edit distance problem is to solve the insertion, deletion, and replacement required to convert one string to another string. Minimum number of times. C O M M O M → s u b C O M M U M → s u b C O M M U N → i n s C O M M U N E \mathbb{COMMOM} \overset{sub}{\rightarrow} \mathbb{COMMUM} \overset{sub}{\rightarrow}\mathbb{ COMMUN} \overset{ins}{\rightarrow} \mathbb{COMMUNE} COMMOM→subCOMMON→subCOMMON→insCOMMUNEA total of at least 3 operations are required to change the word COMMOM into COMMUNE.
Visualize edit distances to obtain sequence alignments
C | O | M | M | O | M | - |
---|---|---|---|---|---|---|
C | O | M | M | IN | N | AND |
- A space in the first line indicates insertion
- A space in the second line indicates deletion
- Columns with different characters represent substitutions
Edit distance = the number of columns with different characters in the sequence alignment
Minimum edit distance = the number of columns with different characters in the optimal sequence alignment
The edit distance problem can also be expressed like this:
For a given string A [ 1... m ] A[1...m] < /span>A[1...m]< /span> B [ 1... n ] B[1...n] Sum B[1...n]< /span> D ( m , n ) D(m,n) Solution other minimum distance D(m,n)
recurrence relationship
假设对 ∀ i < m , ∀ j < n \forall i<m,\forall j<n ∀i<m,∀j<n, possible calculation A [ 1... i ] A[1...i] A[1...i] 和 B [ 1... j ] B[1...j] B[1...j]< /span> D ( i , j ) D(i,j) 's minimum distance D(i,j)。
C | O | M | M | O | M | - |
---|---|---|---|---|---|---|
C | O | M | M | IN | N | AND |
Thoughts A [ 1... m ] A[1...m] A[1...m]< /span> B [ 1... n ] B[1...n] Sum B[1...n]< /span> The most preferable ratio, the rules below are as follows:
- The last column cannot be two spaces
- When a string is an empty string, the minimum edit distance is the length of another string
- A [ m ] A[m] A[m] Sum B [ n ] B[n] B[n] 都是: D ( m , n ) = D ( m − 1 , n − 1 ) + ( A [ m ] = B [ n ] ? 0 : 1 ) D(m,n) =D(m − 1,n − 1) + (A[m] = B[n]?0 : 1) D(m,n)=D(m−1,n−1)+(A[m]=B[n]?0:1)
- A [ m ] A[m]A[m] Sum B [ n ] B[n] B[n] One side is empty, Delete the one that is not empty: D ( m , n ) = { D ( m − 1 , n ) + 1 A [ m ] a n d − D ( m , n − 1 ) + 1 B [ n ] a n d − D(m,n) = \begin{cases} D(m − 1,n) + 1 & A[m]\quad and \quad- \\ D(m, n − 1 ) + 1 & B[n]\quad and \quad- \\ \end{cases} D(m,n)={ D(m−1,n)+1D(m,n−1)+1A[m]and−B[n]and−
- To sum up, you only need to recurse along the three paths to get the smallest one D ( m , n ) = { i i f j = 0 j i f i = 0 min { D ( m − 1 , n ) + 1 D ( m , n − 1 ) + 1 D ( m − 1 , n − 1 ) + ( A [ m ] = B [ n ] ? 0 : 1 ) o t h e r w i s e D(m,n) = \begin{cases} i &if\quad j=0\\ j &if\quad i=0 \\ \min \begin{cases} D(m − 1,n) + 1 \\ D(m, n − 1) + 1 \\ D(m − 1,n − 1) + (A[m] = B[n]?0 : 1) \end{cases} &otherwise \end{cases} D(m,n)=⎩ ⎨ ⎧ijmin⎩ ⎨ ⎧D(m−1,n)+1D(m,n−1)+1D(m−1,n−1)+(A[m]=B[n]?0:1)ifj=0ifi=0otherw ise
- Time efficiency: O ( m n ) O(mn) O(mn); Space power: O ( m n ) O(mn) O(mn)。
Running instance
对于每个 D [ i , j ] D[i,j] D[i,j], 都可以通过 D [ i − 1 , j − 1 ] D[ i-1,j-1] D[i−1,j−1]; D [ i − 1 , j ] D[i-1,j] D[i−1,j]; D [ i , j − 1 ] D[i,j-1] D[i,j−1] These three points are obtained, and these three points respectively correspond to three operations: replacement; deletion; insertion.
Through the above recursion relationship, we can construct the record table from top to bottom and from left to right. After filling in the record form, the value in the lower right corner is the minimum editing distance. The next step is to use backtracking to construct the optimal alignment that satisfies the minimum edit distance (as shown on the right side of the figure below)
#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
using namespace std;
// 计算最小编辑距离,并返回最小编辑距离的值,计算编辑距离表dp
int minEditDistance(const string& word1, const string& word2, vector<vector<int>>& dp) {
int m = word1.length();
int n = word2.length();
for (int i = 0; i <= m; ++i) {
for (int j = 0; j <= n; ++j) {
if (i == 0) {
dp[i][j] = j;
}
else if (j == 0) {
dp[i][j] = i;
}
else if (word1[i - 1] == word2[j - 1]) {
dp[i][j] = dp[i - 1][j - 1];
}
else {
dp[i][j] = 1 + min({
dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1] });
}
}
}
return dp[m][n];
}
// 通过回溯法找到所有满足最小编辑距离的操作序列。
void findAllSequences(const string& word1, const string& word2, int i, int j, const string& sequence, vector<string>& sequences, vector<vector<int>>& dp) {
if (i == 0 && j == 0) {
sequences.push_back(sequence);
return;
}
if (i > 0 && j > 0 && word1[i - 1] == word2[j - 1]) {
findAllSequences(word1, word2, i - 1, j - 1, "No operation: " + string(1, word1[i - 1]) + " -> " + string(1, word2[j - 1]) + "\n" + sequence, sequences, dp);
}
if (i > 0 && j > 0 && dp[i][j] == dp[i - 1][j - 1] + 1) {
findAllSequences(word1, word2, i - 1, j - 1, "Replace: " + string(1, word1[i - 1]) + " -> " + string(1, word2[j - 1]) + "\n" + sequence, sequences, dp);
}
if (i > 0 && dp[i][j] == dp[i - 1][j] + 1) {
findAllSequences(word1, word2, i - 1, j, "Delete: " + string(1, word1[i - 1]) + " \n" + sequence, sequences, dp);
}
if (j > 0 && dp[i][j] == dp[i][j - 1] + 1) {
findAllSequences(word1, word2, i, j - 1, "Insert: " + string(1, word2[j - 1]) + " \n" + sequence, sequences, dp);
}
}
int main() {
string word1 = "ALTRUISTIC";
string word2 = "ALGORITHM";
vector<vector<int>> dp(word1.length() + 1, vector<int>(word2.length() + 1, 0));
int minDistance = minEditDistance(word1, word2, dp);
cout << "Minimum Edit Distance between " << word1 << " and " << word2 << " is: " << minDistance << endl;
vector<string> sequences;
findAllSequences(word1, word2, word1.length(), word2.length(), "", sequences, dp);
cout << "Operations to convert " << word1 << " to " << word2 << " are: " << endl;
for (const string& seq : sequences) {
cout << seq << "----------"<< endl;
}
return 0;
}
operation result:
Time and space complexity optimization
Now we can calculate the minimum edit distance and construct the optimal alignment. Their space-time complexity is summarized as follows:
Calculate minimum edit distance | Construct optimal alignment | |
---|---|---|
time | O ( m n ) O(mn)O(mn) | O ( m + n ) O(m+n) O(m+n) |
space | O ( m n ) O(mn)O(mn) | O ( m n ) O(mn)O(mn) |
From the actual situation, O ( m n ) O(mn) O(mn) space ratio O ( m n ) O(mn) O(mn) , comparison m = n = 1 0 5 m = n = 10^5 m=n=105 time
- Time: Execution 1 0 10 10^{10} 1010 instructions take about 10 seconds (assuming the CPU executes 1 0 9 10^9 109Article Directive)
- Spatially: required 1 0 10 10^{10} 1010bits, approx. 40 GB
な么ability用 O ( m + n ) O(m+n) O(m+How to construct the optimal alignment in the space of n)?
Answer: Hirschberg algorithm can be used.
Hirschberg algorithm
The Hirschberg algorithm is an efficient linear space dynamic programming algorithm. It reduces space complexity by using a divide-and-conquer strategy to compute optimal alignments in linear space.
The idea of this algorithm is based on the following insights:
- In dynamic programming algorithms, two-dimensional matrices are usually used to store intermediate states, which results in O ( m n ) O(mn) O(mn) の空间复杂degree.
- But in fact, by observing the symmetry in the calculation process, the space complexity of dynamic programming can be reduced to O ( m + n ) O(m+n) < /span>O(m+n)。
During the calculation of dynamic programming, we observed that D ( i , j ) D(i,j) D(i, D ( i − 1 , j ) D(i -1,j) D(i−1,j)、 D ( i , j − 1 ) D(i,j-1) D(i,j−1) 和 D ( i − 1 , j − 1 ) D(i-1,j-1) D(i−1,j−1). Based on this, we can use two one-dimensional arrays of length n to store the intermediate state, and only need to retain the information of the previous row and the current row each time.