Detailed explanation of python string similarity judgment

1. Background introduction

      In recent projects, two string similarity solving algorithms are used to correct the results of OCR text recognition, thereby improving the accuracy of recognition. Through correction (equivalent to fuzzy query), the recognition accuracy has risen from 65% to 90% . The result is still exciting, so I blogged for the record.

2. Method and implementation

      The method implemented in this article is: "Edit distance", the so-called edit distance, is used to calculate the minimum number of insertions, deletions and replacements required to convert from the original string (s) to the target string (t).

      Finally: first take the maximum length of the two strings maxLen, and use 1- (need to divide maxLen by the operand) to get the similarity.

      For example, abc and abe need an operation with a length of 3, so the similarity is 1-1/3=0.666.

      Necessary installation package: pip install python-Levenshtein command to install Levenshtein      

import Levenshtein

str_list = ["你好", "今天天气很好", "明天去吃大餐", "我喜欢编程"]
string = "天气正好好"

score_list = []

for i in str_list:
    # 计算编辑距离相似度,即莱文斯坦比
    score = Levenshtein.ratio(string, i)
    score_list.append(score)

print("%s与其他字符串的相似度分别为" % string)
print(str_list)
print(score_list)

      Output result:

天气正好好与其他字符串的相似度分别为
['你好', '今天天气很好', '明天去吃大餐', '我喜欢编程']
[0.2857142857142857, 0.5454545454545454, 0.18181818181818182, 0.0]

    It seems that the result is relatively reliable. The higher the similarity, the closer the two strings are. The blogger used this technique in the project, which greatly improved the accuracy of text recognition.

 

Reference: Several other measurement methods of string similarity

           C++ implementation of edit distance

 

 

Guess you like

Origin blog.csdn.net/Guo_Python/article/details/110229037