Short Text Similarity: Edit Distance Algorithm and Its Applications

Recently, because I was doing similarity comparison of short text strings, I revisited the edit distance algorithm and its application.

1. Concept:

Edit distance , also known as Levenshtein distance , refers to the minimum number of editing operations required to convert one string into another between two strings.

Permissible editing operations include replacing one character with another, inserting a character, and deleting a character.

For example, convert the word kitten into sitting:

  1. kitten(k→s)
  2. then (e→i)
  3. sittin(+g)
  4. sitting

Russian scientist Vladimir Levenshtein proposed this concept in 1965.

2. Algorithm:

Question: Find out the edit distance of a string, that is, how many steps are required to convert a string s1 into a programming string s2. There are three operations, adding a character, deleting a character, and modifying a character

Analysis: First define such a function (matrix) - d(i, j), which represents the edit distance from the substring of length i in the first string to the substring of length j in the second string.

Algorithm process:

  1. A length of str1 or str2 of 0 returns the length of the other string. if(str1. length==0) return str2. length; if(str2. length==0) return str1. length;
  2. Initialize the matrix d of (n+1)*(m+1), and let the values ​​of the first row and column increase from 0;
  3. Scan two strings (n*m level), if: str1[i] == str2[j], record it with temp, which is 0. Otherwise temp is recorded as 1. Then assign the minimum value of d[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1]+temp in the matrix d[i,j] ;
  4. After scanning, the last value d[n][m] of the returned matrix is ​​their distance.

The formula for calculating the similarity : 1-their distance/the maximum value of the length of the two strings.

Through the pseudocode of the algorithm process, it is not difficult to find that the algorithm is actually a kind of dynamic programming. Therefore, in theory, it can be solved by scanning the string over and over again by using a brute force algorithm, but the logic is too complicated, and the dynamic programming idea solves this problem well.

3. Code implementation

package tools;

public class EditDistance {

	private int[][] array;
	private String str1;
	private String str2;

	public EditDistance(String str1, String str2) {
		this.str1 = str1;
		this.str2 = str2;
	}

	public int edit() {
		int max1 = str1.length();
		int max2 = str2.length();
		// 建立数组,比字符长度大一个空间
		array = new int[max2 + 1][max1 + 1];
		for (int i = 0; i <= max1; i++) {
			array[0][i] = i;
		}
		for (int j = 0; j <= max2; j++) {
			array[j][0] = j;
		}

		for (int i = 1; i <= max1; i++) {
			for (int j = 1; j <= max2; j++) {
				array[j][i] = levenshtein(i, j, str1.charAt(i - 1), str2.charAt(j - 1));
			}
		}
		return array[max2][max1];
	}

	public int levenshtein(int i, int j, char si, char sj) {
		int result = 0;

		if (i >= 1 && j >= 1) {
			int a = array[j - 1][i] + 1;
			int b = array[j][i - 1] + 1;
			int c = array[j - 1][i - 1] + ((si != sj) ? 2 : 0);
			result = min(a, b, c);
		}
		return result;
	}

	public int min(int a, int b, int c) {
		int temp = a < b ? a : b;
		return temp < c ? temp : c;
	}

	// 计算相似度
	public float similarity() {
		float similarity = 1 - (float) array[str2.length()][str1.length()] / Math.max(str1.length(), str2.length());
		return similarity;
	}

	public static void main(String args[]) {
		String str1 = "首选的确诊方法是()";
		String str2 = "首选确诊方法是";
		EditDistance lt = new EditDistance(Tools.cleanTitle(str1), Tools.cleanTitle(str2));
		System.out.println(lt.edit());
		System.out.println(lt.similarity());
	}
}

Test result :

3
0.7

4. Application

  • DNA analysis
  • spell check
  • speech recognition
  • plagiarism detection
  • search engine

Guess you like

Origin blog.csdn.net/u012998680/article/details/113404323