The Second Experiment of Information Retrieval--Calculation of Editing Distance Between Two Strings

introduce

        Edit Distance (Edit Distance ) is mainly used to compare the similarity of two strings, mainly refers to the minimum number of operations required to convert from one to another between two strings, if their edit The larger the distance, the more different they are. Available editing operations include character replacement, insertion, and deletion.

Specific steps:

  • The first step, we first create a matrix (two-dimensional array), assuming there are two strings, the lengths are m and n respectively, then the dimension of the matrix should be (m+1)*(n+1), initialization: The first row and the first column are the row number and column number of the matrix respectively, and the element values ​​in other positions are all 0.
  • In the second step, we use d[i-1,j]+1 to represent the increase operation, d[i,j-1]+1 to represent the delete operation, and d[i-1,j-1]+cost to represent the replacement operation.
  • The third step is to traverse all the values ​​in the matrix, judge through the if statement, and continuously update the matrix element values. Finally, design the output results and take these three operations d[i,j]=min(d[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1 ]+cost) as the minimum number of edit operations between two strings.
  • The fourth step is to calculate and output the similarity between the two: 1-d/MAX(str1,str2), end.

exercise:

1. Enter the terms input and insert respectively, and calculate the edit distance


def minEditDist(sm, sn):
    m, n = len(sm) + 1, len(sn) + 1
    len1 = len(sm)                             # len1计算相似度使用
    if len(sm) < len(sn):
        len1 = len(sn)
    # create a matrix (m*n) 矩阵
    matrix = [[0] * n for i in range(m)]
    matrix[0][0] = 0  # 初始化矩阵
    #
    for i in range(1, m):
        matrix[i][0] = matrix[i - 1][0] + 1     # 每行的首列初始化为本行的行号
    for j in range(1, n):
        matrix[0][j] = matrix[0][j - 1] + 1     # 每列的首行初始化为本列的列号
    for i in range(m):
        print(matrix[i])                        # 展示初始化后的矩阵
    print("**********开始计算**********")

    for i in range(1, m):
        for j in range(1, n):           # 遍历矩阵中的所有数值(i为行,j为列)
            if sm[i - 1] == sn[j - 1]:  # 如果这个位置的字符相同,cost赋值为 0
                cost = 0
            else:
                cost = 1                # 该位置的字符不同,cost赋值为 1
            matrix[i][j] = min(matrix[i - 1][j] + 1, matrix[i][j - 1] + 1, matrix[i - 1][j - 1] + cost)
            # print("按行进行计算结果是:\n",matrix[i], '\n')
    for i in range(m):              # 输出计算之后的结果
        print(matrix[i])
    print(1-matrix[m-1][n-1]/len1)  # 输出相似度
    return matrix[m - 1][n - 1]     # 返回最终的编辑距离


if __name__ == '__main__':
    ABC1 = input("input是:")
    ABC2 = input("insert是:")
    mindist = minEditDist(ABC1, ABC2)
    print("二者的编辑距离是:", mindist)

2. Enter the terms solution and source respectively, and calculate the edit distance


def minEditDist(sm, sn):
    m, n = len(sm) + 1, len(sn) + 1
    len1 = len(sm)                             # len1计算相似度使用
    if len(sm) < len(sn):
        len1 = len(sn)
    # create a matrix (m*n) 矩阵
    matrix = [[0] * n for i in range(m)]
    matrix[0][0] = 0  # 初始化矩阵
    #
    for i in range(1, m):
        matrix[i][0] = matrix[i - 1][0] + 1     # 每行的首列初始化为本行的行号
    for j in range(1, n):
        matrix[0][j] = matrix[0][j - 1] + 1     # 每列的首行初始化为本列的列号
    for i in range(m):
        print(matrix[i])                        # 展示初始化后的矩阵
    print("**********开始计算**********")

    for i in range(1, m):
        for j in range(1, n):           # 遍历矩阵中的所有数值(i为行,j为列)
            if sm[i - 1] == sn[j - 1]:  # 如果这个位置的字符相同,cost赋值为 0
                cost = 0
            else:
                cost = 1                # 该位置的字符不同,cost赋值为 1
            matrix[i][j] = min(matrix[i - 1][j] + 1, matrix[i][j - 1] + 1, matrix[i - 1][j - 1] + cost)
            # print("按行进行计算结果是:\n",matrix[i], '\n')
    for i in range(m):              # 输出计算之后的结果
        print(matrix[i])
    print(1-matrix[m-1][n-1]/len1)  # 输出相似度
    return matrix[m - 1][n - 1]     # 返回最终的编辑距离



if __name__ == '__main__':
    ABC1 = input("solution是:")
    ABC2 = input("source是:")
    mindist = minEditDist(ABC1, ABC2)
    print("二者的编辑距离是:", mindist)

Guess you like

Origin blog.csdn.net/rui_qi_jian_xi/article/details/130014185