Search engines spelling correction

Spelling correction

More: my github
first introduced the concept of edit distance, well, too lazy to write, did force edit button should know from this road classic dynamic programming problem:

Given two words word1 and word2, word1 to calculate the minimum number of converting operations word2 used.

You can perform the following three operations on one word:

Insert a character
delete a character
to replace a character
Example 1:

Input: word1 = "horse", word2 = "ros"
Output: 3
explains:
Horse -> rorse (will be replaced with 'R & lt' 'H')
rorse -> Rose (delete 'R & lt')
Rose -> ROS (Delete ' e ')

Example 2:

Input: word1 = "intention", word2 = "execution"
Output: 5
explains:
at intention -> inention (Delete 'T')
inention -> enention (replaces the 'i' to 'E')
enention -> Exention (the ' n 'is replaced with' X ')
Exention -> exection (the' n 'is replaced with' C ')
exection -> Execution (insert' u ')

Then we can know if the probability of a word entered by the user is small, we think he may be misspelled, you look for the smallest distance edit the word to be replaced in the dictionary .

Of course, this may result in large time complexity, after all, you need to have to search through the dictionary and calculate the edit distance, this time we can not bear the cost, you can probably say that the idea of ​​using segmented partition search, although this is a possible ways but still somewhat unsatisfactory.

Then there is also another method is to generate a string edit distance is 1 or 2 and filtered

Generate the string is fairly simple, but how to filter a problem.

Despite positive solutions aside, I think we can actually filter according to the rules, such as the 24-key keyboard, it is easy to mistakenly press out some letters nearby, such as o will press to p. We can set the replacement probability for each letter of the dictionary, such as {o: {p: 0.25, i: 0.25, l: 0.25, k: 0.15, m: 0.05}} In this way, and then select a higher probability of occurrence in the dictionary word to get the final result.

These are just my personal thoughts, in practice or will use Bayesian formula .

Popular terms, the Bayesian posterior probability formula is obtained through the a priori probability. Here with apple, for example.

Right: apple

User 1: app

User 2: appl

User 3: appl

User 4: appla

User 6: appl

We can get the apple with the number of users write how wrong word from the search log.

Suppose, the user input string s, correct string is c, it can be obtained:

p ( c s ) p ( s c ) p ( c ) p (c | s) \ propto P (S | C) P (c)

p (s | c) can be understood as: for a correct string c, how many people have written s.

Into practical problems:

p ( s c ) = p ( a p p l a p p l e ) = 0.5 p(s|c) = p(appl|apple)=0.5

p ( s c ) = p ( a p p a a p p l e ) = 0.16 p(s|c) = p(appa|apple)=0.16

p © is a uni-gram, can be obtained directly in the dictionary.

Published 16 original articles · won praise 3 · Views 1355

Guess you like

Origin blog.csdn.net/weixin_40631132/article/details/104741313