Recently, because I was doing similarity comparison of short text strings, I revisited the edit distance algorithm and its application.
1. Concept:
Edit distance , also known as Levenshtein distance , refers to the minimum number of editing operations required to convert one string into another between two strings.
Permissible editing operations include replacing one character with another, inserting a character, and deleting a character.
For example, convert the word kitten into sitting:
- kitten(k→s)
- then (e→i)
- sittin(+g)
- sitting
Russian scientist Vladimir Levenshtein proposed this concept in 1965.
2. Algorithm:
Question: Find out the edit distance of a string, that is, how many steps are required to convert a string s1 into a programming string s2. There are three operations, adding a character, deleting a character, and modifying a character
Analysis: First define such a function (matrix) - d(i, j), which represents the edit distance from the substring of length i in the first string to the substring of length j in the second string.
Algorithm process:
- A length of str1 or str2 of 0 returns the length of the other string. if(str1. length==0) return str2. length; if(str2. length==0) return str1. length;
- Initialize the matrix d of (n+1)*(m+1), and let the values of the first row and column increase from 0;
- Scan two strings (n*m level), if: str1[i] == str2[j], record it with temp, which is 0. Otherwise temp is recorded as 1. Then assign the minimum value of d[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1]+temp in the matrix d[i,j] ;
- After scanning, the last value d[n][m] of the returned matrix is their distance.
The formula for calculating the similarity : 1-their distance/the maximum value of the length of the two strings.
Through the pseudocode of the algorithm process, it is not difficult to find that the algorithm is actually a kind of dynamic programming. Therefore, in theory, it can be solved by scanning the string over and over again by using a brute force algorithm, but the logic is too complicated, and the dynamic programming idea solves this problem well.
3. Code implementation
package tools;
public class EditDistance {
private int[][] array;
private String str1;
private String str2;
public EditDistance(String str1, String str2) {
this.str1 = str1;
this.str2 = str2;
}
public int edit() {
int max1 = str1.length();
int max2 = str2.length();
// 建立数组,比字符长度大一个空间
array = new int[max2 + 1][max1 + 1];
for (int i = 0; i <= max1; i++) {
array[0][i] = i;
}
for (int j = 0; j <= max2; j++) {
array[j][0] = j;
}
for (int i = 1; i <= max1; i++) {
for (int j = 1; j <= max2; j++) {
array[j][i] = levenshtein(i, j, str1.charAt(i - 1), str2.charAt(j - 1));
}
}
return array[max2][max1];
}
public int levenshtein(int i, int j, char si, char sj) {
int result = 0;
if (i >= 1 && j >= 1) {
int a = array[j - 1][i] + 1;
int b = array[j][i - 1] + 1;
int c = array[j - 1][i - 1] + ((si != sj) ? 2 : 0);
result = min(a, b, c);
}
return result;
}
public int min(int a, int b, int c) {
int temp = a < b ? a : b;
return temp < c ? temp : c;
}
// 计算相似度
public float similarity() {
float similarity = 1 - (float) array[str2.length()][str1.length()] / Math.max(str1.length(), str2.length());
return similarity;
}
public static void main(String args[]) {
String str1 = "首选的确诊方法是()";
String str2 = "首选确诊方法是";
EditDistance lt = new EditDistance(Tools.cleanTitle(str1), Tools.cleanTitle(str2));
System.out.println(lt.edit());
System.out.println(lt.similarity());
}
}
Test result :
3
0.7
4. Application
- DNA analysis
- spell check
- speech recognition
- plagiarism detection
- search engine