Metric Machine Learning - string distance

      Machine learning is popular AI technology in a very important direction, either supervised or unsupervised learning are learning to use a variety of "metrics" to get different degrees of similarity difference sample data or different sample data. Good "metrics" can significantly improve the classification or prediction algorithm accuracy, will be described herein in a variety of machine learning "measure", "metric" is mainly composed of two, is a distance, similarity and correlation coefficients from the studies generally linear body points in space; the similarity research body linear vector space; correlation coefficient distributed data mainly research subject. This paper describes the string distance.

1 Hamming distance - corresponding to a long string of like positional difference

      In information theory, the Hamming distance (Hamming distance) between two strings of equal length is the number of different character strings corresponding to the two positions. In other words, it is the number of a string into another character string needs to be replaced.

      Hamming weight with respect to the null string is a string of the same length of the Hamming distance, i.e., the number of elements in the string which is non-zero: for a binary string, it is the number 1, so 11101 4 is Hamming weight. E.g:

      10 . 1 . 7 . 1 01 to 10 0 . 7 0 Hamming distance between 01 2
      . 9 14 3 . 8 96 with nine 23 is 3 . 7 Hamming distance is between 96 3.
      KA ROL in the KA Thr Hamming distance is 3 in between.

      1 shows the geometrical meaning of the Hamming distance, the minimum distance between any two as shown in the vertices are the Hamming distance between two binary strings.


Figure 1 geometric meaning Hamming distance

      汉明距离是以理查德·卫斯里·汉明的名字命名的,汉明在误差检测与校正码的基础性论文中首次引入这个概念。在通信中累计定长二进制字中发生翻转的错误数据位,所以它也被称为信号距离。汉明重量分析在包括信息论、编码理论、密码学等领域都有应用。但是,如果要比较两个不同长度的字符串,不仅要进行替换,而且要进行插入与删除的运算,在这种场合下,通常使用更加复杂的编辑距离等算法。下面介绍另一个常用的字符串距离——编辑距离。

2 编辑距离——一个串变为另一个串的距离

      编辑距离是针对二个字符串(例如英文字)的差异程度的量化量测,量测方式是看至少需要多少次的处理才能将一个字符串变成另一个字符串,处理只包括插入一个字符、删除一个字符、新增一个字符。编辑距离可以用在自然语言处理中,例如拼写检查可以根据一个拼错的字和其他正确的字的编辑距离,判断哪一个(或哪几个)是比较可能的字。以下为编辑距离的例子
"kitten" and "sitting"的编辑距离是3,这是因为:

      第一次处理:kitten → sitten (替换一个字符,"s" 替换了 "k")

      第二次处理:sitten → sittin (替换一个字符, "i" 替换了 "e")

      第三次处理:sittin → sitting (插入一个字符, 字符串末尾插入了 "g").

      那么编辑距离如何计算了?假定函数dist(A, B)表示字串A转变到字串B的编辑距
离,那么对于下面3种极端情况,我们很容易给出解答(NULL表示空串)。

      dist(NULL, NULL) = 0
      dist(NULL, s) = s的长度
      dist(s, NULL) = s的长度

      对于一般的情况,dist(A, B)我们应该如何求解呢?假定我们现在正在求解dist(A+c1, B+c2),在这里A和B是字符串,c1和c2都是字符。dist(A+c1, B+c2)也就是把"A+c1"转变成"B+c2"。在这个转变过称中,我们要分情况讨论:

(1) A可以直接转变成B。这时我们只要把c1转成c2就可以了(如果c1 != c2)。

(2) A+c1可以直接转变成B。这时我们处理的方式是插入c2。

(3) A可以直接转成B+c2。这时的情况是我们需要删除c1。

      综合上面三种情况,dist(A+c1, B+c2)应该是三者的最小值。因此我们可以定义这样一个函数——edit(i, j),它表示第一个字符串的长度为i的子串到第二个字符串的长度为j的子串的编辑距离。显然可以有如下动态规划公式:

      if i = 0 且 j = 0,edit(i, j) = 0
      if i = 0 且 j > 0,edit(i, j) = j
      if i > 0 且j =0,edit(i, j) = i
      if i ≥ 1 且 j ≥ 1, 若A第i个字符等于B第j个字符edit(i, j)=min{edit(i-1, j) + 1, edit(i, j-1) + 1, edit(i-1, j-1)};否则edit(i, j)=min{edit(i-1, j) + 1, edit(i, j-1) + 1, edit(i-1, j-1) + 1}。

      结合"kitten" and "sitting"可得图2编辑距离矩阵,矩阵右下角数值就是两个字符串编辑距离。


图2 kitten和sitting编辑距离矩阵

Guess you like

Origin www.cnblogs.com/Kalafinaian/p/10992023.html