[Switch] String Similarity Algorithm (Edit Distance Algorithm Levenshtein Distance)

When engaging in verification code recognition, it is necessary to compare the similarity of character codes and use the "edit distance algorithm", and make a record about the principle and C# implementation.

According to Baidu Encyclopedia:

Edit distance, also known as Levenshtein distance (also called Edit Distance), refers to the minimum number of editing operations required to convert two strings from one to the other. The more they are different. Permitted editing operations include replacing one character with another, inserting a character, and deleting a character.

  For example, turn the word kitten into sitting:

  sitten (k→s)

  sittin (e→i)

  sitting (→g)

  Russian scientist Vladimir Levenshtein proposed this concept in 1965. Therefore, it is also called Levenshtein Distance.

For example,

if str1="ivan", str2="ivan", then it is equal to 0 after calculation. Not converted. Similarity=1-0/Math.Max(str1.length,str2.length)=1
If str1="ivan1", str2="ivan2", then it is equal to 1 after calculation. "1" of str1 converts "2", which converts a character, so the distance is 1, similarity=1-1/Math.Max(str1.length, str2.length)=0.8
Application
  DNA analysis

  Spell check

  Speech recognition

  Plagiarism Detection

Thanks to the big stone in the comments for a good link on the application of this method added here:

Small-scale string approximate search, the requirements are similar to entering keywords in a search engine, and a similar result list appears. Article connection: [Algorithm] String approximate search The

algorithm process

str1 or str2 length is 0 and returns the length of another string . if(str1.length==0) return str2.length; if(str2.length==0) return str1.length;
initialize the matrix d of (n+1)*(m+1), and let the first row and The value of the column grows from 0.
Scan two strings (n*m level), if: str1[i] == str2[j], use temp to record it, which is 0. Otherwise temp is recorded as 1. Then assign the minimum value of d[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1]+temp to the matrix d[i,j] .
After scanning, the last value d[n][m] of the returned matrix is ​​their distance.
Calculate similarity formula: 1-their distance/maximum of two string lengths.


For intuitive performance, I write the two strings into the row and column respectively, which is not needed in the actual calculation. Let's take the strings "ivan1" and "ivan2" as examples to see the status of the values ​​in the matrix:

1. The values ​​of the first row and first column increase from 0

ivan 1
0 1 2 3 4 5
i 1
v 2
a 3
n 4
2 5

2. Generation of i column values ​​Matrix[i - 1, j] + 1 ; Matrix[i, j - 1] + 1 ; Matrix[i - 1, j - 1] + t

ivan 1
0+t=0 1+1=2 2 3 4 5
i 1+1=2 Take the minimum value of the three = 0
v 2 and so on: 1
a 3 2
n 4 3
2 5 4


3. The generation of the V column value

       ivan 1
0 1 2
i 1 0 1
v 2 1 0
a 3 2 1
n 4 3 2
2 5 4 3


and so on until all matrices are generated

ivan 1
0 1 2 3 4 5
i 1 0 1 2 3 4
v 2 1 0 1 2 3
a 3 2 1 0 1 2
n 4 3 2 1 0 1
2 5 4 3 2 1 1


Finally get their distance = 1

Similarity: 1-1/Math.Max("ivan1".length, "ivan2" .length) =0.8

public class LevenshteinDistance
    {
        /// <summary>
        /// Take the smallest digit
        /// </summary>
        /// <param name="first"></param>
        /// <param name="second"></param>
        /// <param name="third"></param>
        /// <returns></returns>
        private int LowerOfThree(int first, int second, int third)
        {
            int min = Math.Min(first, second);
            return Math.Min(min, third);
        }

        private int Levenshtein_Distance(string str1, string str2)
        {
            int[,] Matrix;
            int n = str1.Length;
            int m = str2.Length;

            int temp = 0;
            char ch1;
            char ch2;
            int i = 0;
            int j = 0;
            if (n == 0)
            {
                return m;
            }
            if (m == 0)
            {

                return n;
            }
            Matrix = new int[n + 1, m + 1];

            for (i = 0; i <= n; i++)
            {
                //initialize the first column
                Matrix[i, 0] = i;
            }

            for (j = 0; j <= m; j++)
            {
                //initialize the first line
                Matrix[0, j] = j;
            }

            for (i = 1; i <= n; i++)
            {
                ch1 = str1[i - 1];
                for (j = 1; j <= m; j++)
                {
                    ch2 = str2[j - 1];
                    if (ch1.Equals(ch2))
                    {
                        temp = 0;
                    }
                    else
                    {
                        temp = 1;
                    }
                    Matrix[i, j] = LowerOfThree(Matrix[i - 1, j] + 1, Matrix[i, j - 1] + 1, Matrix[i - 1, j - 1] + temp);
                }
            }
 	   for (i = 0; i <= n; i++)
            {
                for (j = 0; j <= m; j++)
                {
                    Console.Write(" {0} ", Matrix[i, j]);
                }
                Console.WriteLine("");
            }
      
            return Matrix[n, m];
        }

        /// <summary>
        /// Calculate string similarity
        /// </summary>
        /// <param name="str1"></param>
        /// <param name="str2"></param>
        /// <returns></returns>
        public decimal LevenshteinDistancePercent(string str1, string str2)
        {
            //int maxLenth = str1.Length > str2.Length ? str1.Length : str2.Length;
            int val = Levenshtein_Distance(str1, str2);
            return 1 - (decimal)val / Math.Max(str1.Length, str2.Length);
        }
    }


transfer:
static void Main(string[] args)
        {
            string str1 = "ivan1";
            string str2 = "ivan2";
            Console.WriteLine("String 1 {0}", str1);

            Console.WriteLine("String 2 {0}", str2);

            Console.WriteLine("相似度 {0} %", new LevenshteinDistance().LevenshteinDistancePercent(str1, str2) * 100);          
            Console.ReadLine();
        }

1
<strong>Results</strong>




Reprinted from: http://www.cnblogs.com/ivanyb/archive/2011/11/25/2263356.html
Huaxiamian Studio: http://huaxiamian.cc

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326533308&siteId=291194637