Lucene的FuzzyQuery中用到的Levenshtein Distance(LD)算法 - 代码天地

Lucene的FuzzyQuery中用到的Levenshtein Distance(LD)算法

编程语言 2018-05-14 04:33:51 阅读次数: 2

主题:Levenshtein Distance(LD);

相关介绍：Levenshtein distance是由俄国科学家Vladimir Levenshtein在1965年设计并以他的名字命名的。如果不能拼写或发Levenshtein音，通常可以称它edit distance（编辑距离）；

用途：该算法用于判断两个字符串的距离，或者叫模糊度。个人理解就是差异程度。而差异的标准就是1）加一个字母(Insert),2)删一个字母(Delete),3改变一个字母(Substitute)。

算法描述：

Step	Description
1	Set n to be the length of s.Set m to be the length of t. If n = 0, return m and exit.If m = 0, return n and exit. Construct a matrix containing 0..m rows and 0..n columns.
2	Initialize the first row to 0..n. Initialize the first column to 0..m.
3	Examine each character of s (I from 1 to n).
4	Examine each character of t (j from 1 to m).
5	If s[i] equals t[j], the cost is 0. If s[i] doesn’t equal t[j], the cost is 1.
6	Set cell d[I,j] of the matrix equal to the minimum of: a. The cell immediately above plus 1: d[i-1,j] + 1. b. The cell immediately to the left plus 1: d[I,j-1] + 1. c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.
7	After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].

1、得到源串s长度n与目标串t的长度m，如果一方为的长度0，则返回另一方的长度。

2、初始化(n+1)*(m+1)的矩阵d，第一行第一列的值为0增至对应的长度。

3、遍历数组中的每一个字符(i,j从1开始)。如果s[i]与t[j]的值相等，cost值为0，否则为1。D[i][j]的值为d[i-1,j] + 1(左边的值加1)、d[I,j-1] + 1.(上边的值加1)、d[i-1,j-1] + cost (斜上角的值加cost) 中的最小者。

4、等第三步遍历完后，右下角d[n,m]的值就为两个字符串的距离。

应用演示：source:word与target:world比较过程。

应用举例：据《开发自己的搜索引擎——Lucene 2.0+Heriterx

》记载P134页记载，lucene中FuzzyQuery(模糊匹配)就是应用该算法的；也可用于Spell checking(拼写检查),Speech recognition(语句识别),DNA analysis(DNA分析) ,Plagiarism detection(抄袭检测)。

参考资料：

http://www.merriampark.com/ld.htm

http://my.oschina.net/MrMichael/blog/339217

猜你喜欢

转载自m635674608.iteye.com/blog/2234648

Lucene的FuzzyQuery中用到的Levenshtein Distance(LD)算法

Levenshtein Distance (LD) 算法

Levenshtein Distance 算法实现

Levenshtein Distance编辑距离算法

编辑距离：Levenshtein Distance算法

Levenshtein distance 编辑距离算法

编辑距离 (Levenshtein Distance算法)

FuzzyQuery中计算edit distance的算法函数

编辑距离算法详解：Levenshtein Distance算法

算法：动态规划编辑距离 Edit Distance / Levenshtein Distance

详解编辑距离算法-Levenshtein Distance

Levenshtein Distance

字符串相似度算法-LEVENSHTEIN DISTANCE算法

编辑距离算法详解：Levenshtein Distance算法——动态规划问题

字符串相似算法-(2) Levenshtein distance

数据对齐-编辑距离算法详解（Levenshtein distance）

Levenshtein Distance（编辑距离）算法与使用场景

【Similarity calculation】 Levenshtein Distance

Levenshtein distance（编辑距离）

H - Levenshtein Distance

[转]字符串相似度算法（编辑距离算法 Levenshtein Distance）

java 两字符串相似度计算算法（转）Levenshtein Distance编辑距离算法

字符串相似度算法（编辑距离算法 Levenshtein Distance）

计算两组标签相似度算法——levenshtein distance 编辑距离算法

java版编辑距离(字符串相似度)算法 levenshtein (edit distance)

Damerau–Levenshtein Distance的java实现

Lucene--FuzzyQuery与WildCardQuery（通配符）

过采样中用到的SMOTE算法

Java 中用到的线程调度算法//TODO

python-Levenshtein AttributeError: module 'Levenshtein' has no attribute 'distance'

今日推荐

开源日报 | Chrome内置Gemini的意义不在于Gemini；中国AI追随之路的五大误区；ECharts创始人“下海”养鱼；谷歌I/O开发者大会什么都有，只是没有惊喜

微软回应中国区AI团队“打包赴美”传闻

基于大语言模型的开源知识库问答系统 MaxKB GitHub Star 数量突破 5,000 个！

美国拟限制 AI 大模型出口中国和俄罗斯

苹果将与 OpenAI 达成协议，将 ChatGPT 应用于 iPhone

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

《2024 年一季度互联网投融资运行情况》研究报告

报告：Django 仍然是 74% 开发者的首选

周排行

laravle中orm简单的增删改查

文本分类特征选取之CHI开方检验

Spark核心编程-WordCount

大数据开发实战系列之电信客服(1)

读书笔记 - 把时间当作朋友 by 李笑来

python 笔记--if else

SpringBoot/Mybatis/Druid, 多数据源MultiDataSource配置思路

排序三个整数

redis集群搭建【2】-Windows中Redis集群搭建

STM32F030驱动TM1650点亮4联数码管

每日归档

更多

2024-05-16(6)

2024-05-15(24)

2024-05-14(0)

2024-05-13(18)

2024-05-12(0)

2024-05-11(38)

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)