[Python] The longest common subsequence VS the longest common substring [Dynamic Programming]

Preface

Since the original Microsoft open source Rouge dependent environment based on the ancient perl language is really difficult to build, it is implemented by itself following the description of Rouge's paper.

Rouge has several sub-evaluation indicators such as N, L, S, W, and SU. When reproducing the function of Rouge-L, I encountered the problem of this blog post: finding the longest common subsequence of two strings.

The difference between the longest common subsequence & the longest common substring

1. The longest common subsequence (Longest Common Subsequence, LCS): the sequence that appears in both the string A and the string B, and the sequence with the longest sequence consistent with the parent string.

2. Longest Common Substring (Longest Common Substring): Compared with LCS, the sequence must appear consecutively, that is, the common substring.

eg: csdnblog and belong, the longest common subsequence is blog, and the longest common substring is lo.

Program design and implementation

1 Longest common subsequence

def longestCommonSubsequence(seqA, seqB):
    """
    最长公共子序列
    """
    m = len(seqA);
    n = len(seqB);
    init_unit={
    
    
        "len":0,
        "lcs":[]
    }
    dp = [[ init_unit ]*(n+1) for i in range(m+1)]; # m+1行, n+1列
    for i in range(0, m+1):
        for j in range(0, n+1):
            if i==0 or j==0:
                dp[i][j] = init_unit;
            elif seqA[i-1] == seqB[j-1]:
                tmp_str = copy.copy((dp[i-1][j-1])["lcs"]);
                tmp_str.append(seqA[i-1]);
                unit = {
    
    
                    "len": (dp[i-1][j-1])["len"] + 1,
                    "lcs": tmp_str
                }
                dp[i][j] = unit;
            elif seqA[i-1] != seqB[j-1]:
                if (dp[i-1][j])["len"] > (dp[i][j-1])["len"]: # 存储最长的信息
                    dp[i][j] = dp[i-1][j];
                else:
                    dp[i][j] = dp[i][j-1];
            else:
                pass;
            pass; # end inner for loop
        pass; # end outer for loop
    return dp[m][n];
print( longestCommonSubsequence("GM%$ABG", "gbndGFMABG") ) # {'len': 5, 'lcs': ['G', 'M', 'A', 'B', 'G']}
print( longestCommonSubsequence(["G", "M", "%", "$", "A", "B", "G"], ["g","b", "n", "d", "G", "F", "M", "A", "B","G"] ) ); # {'len': 5, 'lcs': ['G', 'M', 'A', 'B', 'G']}

Longest common substring

def longestCommonSubstring(strA, strB):
    """
    最长公共子串
	遇到问题没人解答?小编创建了一个Python学习交流QQ群:778463939
	寻找有志同道合的小伙伴,互帮互助,群里还有不错的视频学习教程和PDF电子书!
    """
    m = len(strA);
    n = len(strB);
    init_unit={
    
    
        "len":0,
        "lcs":[]
    }
    dp = [[ init_unit ]*(n+1) for i in range(m+1)]; # m+1行, n+1列
    result ={
    
    
        "len":0, # 记录最长公共子串的长度
        "lcs": []
    };
    for i in range(0, m+1): # 考虑i为0或j为0的情况
        for j in range(0, n+1):
            if i==0 or j==0 or ( strA[i-1] != strB[j-1] ):
                dp[i][j] = init_unit;
            elif strA[i-1] == strB[j-1]:
                tmp_str = copy.copy((dp[i-1][j-1])["lcs"]);
                tmp_str.append(strA[i-1]);
                unit = {
    
    
                    "len": (dp[i-1][j-1])["len"] + 1,
                    "lcs": tmp_str
                }
                dp[i][j] = unit;

                if (dp[i][j])["len"] > result["len"]: # 存储最长的信息
                    result = copy.copy( dp[i][j] );
            else:
                pass;
            pass; # end inner for loop
        pass; # end outer for loop
    return result;
print( longestCommonSubstring("GM%$ABG", "gbndGFMABG") ) # {'len': 3, 'lcs': ['A', 'B', 'G']}
print( longestCommonSubstring(["G", "M", "%", "$", "A", "B", "G"], ["g","b", "n", "d", "G", "F", "M", "A", "B","G"] ) ); # {'len': 3, 'lcs': ['A', 'B', 'G']}

Application field

Machine learning> automatic text summarization/ machine translation/ machine reading comprehension and other tasks> evaluation indicators> Rouge-L
Rouge-L classification:

  • Sentence level: longest common subsequence
  • Abstract level: Union[multiple sentences] longest common subsequence

Guess you like

Origin blog.csdn.net/sinat_38682860/article/details/108977162
Recommended