Programmer's algorithm class (6) - the longest common subsequence (LCS)

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/m0_37609579/article/details/99999354

We talk on a dynamic programming, we also know, the situation is dynamic programming for overlapping sub-problems are particularly effective, because it will sub-problem solution stored in the storage space when the required solution to a problem child, a value that is directly It may be, so as to avoid double counting!

In this section we solve a problem is the longest common subsequence.

First, Shajiao longest common subsequence?

[Baidu Encyclopedia] Longest Common Subsequence LCS is the abbreviation, i.e., the longest common subsequence. A sequence, if two or more sub-sequences of known sequence, and is the longest of all the sub-sequences, for the longest common subsequence .

In two strings, some of the same characters, the sequence may also be equal is formed, the length of the longest word sequence longest common subsequence between the two is equal, the length of which can use dynamic programming to find.

For example, for strings str1: "aabcd"; sequential and adjacent to each other is aabc subsequence thereof, not sequential but also its adjacent abd sequences . That is, as long as the derived sequence of elements belonging to the respective given number of columns, is the sequence.

Again a string str2: "12abcabcd"; comparison can be drawn longest common subsequence str1 and str2 is abcd.

get conclusion:

  1. It is not a subset of the sequences, and the original sequence order of the elements which are relevant.
  2. Empty sequence is common subsequence any two sequences.
  3. Subsequence, common subsequence and the longest common subsequence is not unique.
  4. For a sequence of length n, which a total of 2 ^ n subsequences with (2 ^ n - 1) non-empty sequence.

Two, P problems and NP problems

P problems: a problem can be solved in polynomial time complexity (O (n ^ k)) of.
NP problem: solving a problem can be verified in polynomial time.

Employing words to explain:

P problems: a problem can be solved in polynomial time complexity (O (n ^ k)) of.
NP problem: solving a problem can be verified in polynomial time.

Third, the solution longest common subsequence

PS: brute force solution may be used to recursively, needs to traverse all possible, the time complexity is O (2 ^ m * 2 ^ n), is too slow.

For general LCS problem, we are all NP problems . When the amount of the number of columns is certain, you can use dynamic programming to solve. When the time complexity of O (n * m), space is also O (n * m).

1. Analysis of the law

For solving dynamic programming problems can be used, generally have two characteristics: ① optimal substructure; ② overlapping subproblems

① optimal substructure

Provided X = (x1, x2, ..... xn) and Y = {y1, y2, ..... ym} is a sequence of two, the X and Y longest common sequence referred to as the LCS (X , Y)

Find out the LCS (X, Y) is an optimization problem . Because we need to find X and Y are the longest of the common subsequence. And looking LCS X and Y, first consider the last element of the last element of X and Y.

1) If YM = Xn , the same as the last element of the last element, i.e., X and Y, which indicates that the certain elements in a common subsequence. So now only need to find: LCS (Xn-1, Ym -1)

LCS (Xn-1, Ym- 1) is the original problem of a child problem. Why is it called sub-problems? Because of its size smaller than the original problem. (Element is a small little thing ....)

Why is sub-optimal problem? Because we are looking for Xn-1 is the longest common subsequence and Ym-1 ah. . . longest! ! ! In other words, it is the optimal one. (Here is the longest of the best means)

2) If the YM = Xn! , Next to trouble this point, because it creates two sub-problems: LCS (Xn-1, Ym ) and LCS (Xn, Ym-1)

Because the last element of the sequence X and a sequence Y Well not equal, it means that the last element is the longest possible common element in the incorrect sequence. (Not equal, and how common thing).

LCS (Xn-1, Ym) represents: the longest common sequence may be (x1, x2, .... x (n-1)) and (y1, y2, ... yn) are looking for.

LCS (Xn, Ym-1) represented by: may be the longest common sequence (x1, x2, .... xn) and (y1, y2, ... y (n-1)) of the find.

Solving the above two sub-problems, resulting longest common subsequence Who, then who is the LCS (X, Y). It is expressed mathematically:

LCS=max{LCS(Xn-1,Ym),LCS(Xn,Ym-1)}

Because of the conditions 1) and 2) takes into account all possible cases. Therefore, we have successfully transformed the original problem into three smaller sub-scale of the problem.

② overlapping sub-problems

Overlapping sub-question is what? That is the original problem into sub-problems, sub-problems have the same problem.

Take a look at the original question is: LCS (X, Y). There subproblems ❶LCS (Xn-1, Ym-1) ❷LCS (Xn-1, Ym) ❸LCS (Xn, Ym-1)

At first glance, these three sub-problems are non-overlapping. Essentially they can overlap, as they only overlap a majority. For example:

The second sub-question: LCS (Xn-1, Ym) contains: the problem ❶LCS (Xn-1, Ym-1), Why?

Because, when the last element of Xn-1 and Ym are not the same, we need the LCS (Xn-1, Ym) decomposition: decomposed into: LCS (Xn-1, Ym-1) and a LCS (Xn-2 , Ym)

In other words: in sub-problems continue to break down, some issues overlap.

2. practice

If the string of X and Y represents the corresponding front i, LCS length, then the first j characters with a two-dimensional array c, the following equation can be obtained:

  1. This is very easy to understand, one of the strings is 0 when a 0 is so sure.
  2. When two characters are equal, when well understood at this time, for example:
  3. abcd And  adcd, traversing ctime and found the front only aequal, that is, 1. 
  4. So cequal, that is, abcand adcwhen the match must be better than aband adgreater length 1, this 1is cequal to it. That is equal when the ratio is c[i-1][j-1]large 1of.
  5. The next better understanding, if not equal, be sure it is a moment to find the biggest contrast of it.

Therefore, we just from c [0] [0] Start sheet, fill in the c [m-1] [n-1], the resulting c [m 1-] [n-1] is the length of the LCS.

But how do we get the length of the LCS LCS itself, not it? Also with a two-dimensional array b is represented:

  • Corresponding character in equal time with tag ↖
  • In p1> = p2 when labeled with ↑
  • In p1 <p2 when labeled with ←

Mark function is:

For example, seeking LCS ABCBDAB and BDCABA of:

Gray and arrow with ↖ part of all is the LCS characters. It is a filling process. The completed form will handle sequences recorded, we can get the longest sequence you want by way of look-up table.

Here you can see, we have a structure of i*jthe matrix that includes not only the contents of the direction value (current optimal solution results), also includes a directional arrow, this represents the time we go back, we need to walk.

So we are here to save the two values, you can use two-dimensional matrix, you can use a matrix structure.

 

Fourth, the filling process c array of demo

LCS length in order to ABCB and BDCA example:

And so on

Finally, fill out the table:

2 LCS length is the lower right corner.

Fifth, the implementation code

 
  1.  
    public class LongestCommonSubsequence {
  2.  
    public static int [][]mem;
  3.  
    public static int [][]s;
  4.  
    public static int [] result; // 记录子串下标
  5.  
    public static int LCS(char []X,char []Y,int n,int m){
  6.  
    for (int i = 0; i <= n; i++) {
  7.  
    mem[i][0] = 0;
  8.  
    s[i][0] = 0;
  9.  
    }
  10.  
    for (int i = 0; i <= m; i++) {
  11.  
    mem[0][i] = 0;
  12.  
    s[0][i] = 0;
  13.  
    }
  14.  
    for (int i = 1; i <= n; i++) {
  15.  
    for (int j = 1; j <= m ; j++) {
  16.  
    if (X[i-1] == Y[j-1]){
  17.  
    mem[i][j] = mem[i-1][j-1] + 1;
  18.  
    s[i][j] = 1;
  19.  
    }
  20.  
    else {
  21.  
    mem[i][j] = Math.max(mem[i][j-1],mem[i-1][j]);
  22.  
    if (mem[i][j] == mem[i-1][j]){
  23.  
    s[i][j] = 2;
  24.  
    }
  25.  
    else s[i][j] = 3;
  26.  
    }
  27.  
    }
  28.  
    }
  29.  
    return mem[n][m];
  30.  
    }
  31.  
    // 追踪解
  32.  
    public static void trace_solution(int n,int m){
  33.  
    int i = n;
  34.  
    int j = m;
  35.  
    int p = 0;
  36.  
    while (true){
  37.  
    if (i== 0 || j == 0) break;
  38.  
    if (s[i][j] == 1 ){
  39.  
    result[p] = i;
  40.  
    p++;
  41.  
    i--;j--;
  42.  
    }
  43.  
    else if (s[i][j] == 2){
  44.  
    i--;
  45.  
    }
  46.  
    else { //s[i][j] == 3
  47.  
    j--;
  48.  
    }
  49.  
    }
  50.  
     
  51.  
    }
  52.  
    public static void print(int [][]array,int n,int m){
  53.  
    for (int i = 0; i < n + 1; i++) {
  54.  
    for (int j = 0; j < m + 1; j++) {
  55.  
    System.out.printf("%d ",array[i][j]);
  56.  
    }
  57.  
    System.out.println();
  58.  
    }
  59.  
    }
  60.  
     
  61.  
    public static void main(String[] args) {
  62.  
    char []X = {'A','B','C','B','D','A','B'};
  63.  
    char []Y = {'B','D','C','A','B','A'};
  64.  
    int n = X.length;
  65.  
    int m = Y.length;
  66.  
    // 这里重点理解,相当于多加了第一行 第一列。
  67.  
    mem = new int[n+1][m+1];
  68.  
    // 1 表示 左上箭头 2 表示 上 3 表示 左
  69.  
    s = new int[n+1][m+1];
  70.  
    result = new int[Math.min(n,m)];
  71.  
    int longest = LCS(X,Y,n,m);
  72.  
    System.out.println("备忘录表为:");
  73.  
    print(mem,n,m);
  74.  
    System.out.println("标记函数表为:");
  75.  
    print(s,n,m);
  76.  
    System.out.printf("longest : %d \n",longest);
  77.  
     
  78.  
    trace_solution(n,m);
  79.  
    // 输出注意 result 记录的是字符在序列中的下标
  80.  
    for (int k = longest-1; k >=0 ; k--) {
  81.  
    // 还需要再减一 才能跟 X Y序列对齐。
  82.  
    int index = result[k]-1;
  83.  
    System.out.printf("%c ",X[index]);
  84.  
    }
  85.  
     
  86.  
    }
  87.  
    }
 
 
  1.  
    备忘录表为:
  2.  
    0 0 0 0 0 0 0
  3.  
    0 0 0 0 1 1 1
  4.  
    0 1 1 1 1 2 2
  5.  
    0 1 1 2 2 2 2
  6.  
    0 1 1 2 2 3 3
  7.  
    0 1 2 2 2 3 3
  8.  
    0 1 2 2 3 3 4
  9.  
    0 1 2 2 3 4 4
  10.  
    标记函数表为:
  11.  
    0 0 0 0 0 0 0
  12.  
    0 2 2 2 1 3 1
  13.  
    0 1 3 3 2 1 3
  14.  
    0 2 2 1 3 2 2
  15.  
    0 1 2 2 2 1 3
  16.  
    0 2 1 2 2 2 2
  17.  
    0 2 2 2 1 2 1
  18.  
    0 1 2 2 2 1 2
  19.  
    longest : 4
  20.  
    B C B A
 

VI Summary

感觉没有讲到位,先挖坑在这里吧。

  1. 需要两个数组分别保存长度和具体的最长公共子序列的值
  2. 通过二维表的方式,把上一个结果存起来,后面只要查表就可以了
  3. git的diff算法是对最长公共子序列算法的延伸,性能更高

我的微信公众号:架构真经(id:gentoo666),分享Java干货,高并发编程,热门技术教程,微服务及分布式技术,架构设计,区块链技术,人工智能,大数据,Java面试题,以及前沿热门资讯等。每日更新哦!

参考资料:

  1. https://www.jianshu.com/p/cffe6217e13b
  2. https://blog.csdn.net/lz161530245/article/details/76943991
  3. https://www.cnblogs.com/xujian2014/p/4362012.html
  4. https://www.cnblogs.com/wkfvawl/p/9362287.html
  5. https://www.jianshu.com/p/b0172a3ac46c
  6. https://blog.csdn.net/weixin_40673608/article/details/84262695
  7. git diff比较
  8. https://blog.csdn.net/lxt_lucia/article/details/81209962
  9. https://blog.csdn.net/smilejiasmile/article/details/81503537

Guess you like

Origin www.cnblogs.com/anymk/p/11479911.html