[Reserved] Boyer-Moore string matching algorithm

Author: Ruan Yifeng

Date: May 3, 2013

Last article, I introduced the KMP algorithm .

However, it is not the most efficient algorithm, the actual use of not much. Various text editor "Find" function (Ctrl + F), it uses a Boyer-Moore algorithm .

Boyer-Moore algorithm not only high efficiency, and ingenious, easy to understand. In 1977, Professor Robert S. Boyer and J Strother Moore Professor of the University of Texas invented this algorithm.

Now, according to Professor Moore my own example to explain this algorithm.

1.

It assumes string "HERE IS A SIMPLE EXAMPLE", search for the word "EXAMPLE".

2.

First, the "string" is aligned with the "search term" head start comparing from the tail.

This is a very clever idea, because if the tail character does not match, so long as a comparison, you can know (as a whole) results certainly not looking for the first seven characters.

We see, "S" and "E" do not match. In this case, "S" is called "bad character" (bad character), i.e., the characters do not match. We also found, "S" is not included in the search word "EXAMPLE", and that this means you can move directly to the search terms "S" after one.

3.

Still start comparing from the tail, found that "P" and "E" does not match, the "P" is a "bad character." However, "P" included in the search word "EXAMPLE" being. So, after two search terms shift, two "P" alignment.

4.

We thus summed up the "bad character rule" :

  After the shift number = bad characters here - the first occurrence of the search term in the position

If the "bad character" are not included in the search terms, the last time the position of -1.

To "P", for example, as "bad character", appears in the search word bit 6 (numbered from zero), appears on a search word location 4, the backward 6--4 = 2 . Then to a second step in front of "S", for example, it appears in the position 6, the position of the first occurrence of -1 (i.e., not present), then the entire search term after shifting 6 - (-1) = 7.

5.

Still start comparing from the tail, "E" and "E" match.

6.

Compare a front, "LE" and "LE" match.

7.

Compare a front, "PLE" and "PLE" match.

8.

Compare a front, "MPLE" and "MPLE" match. We call this situation as "good suffix" (good suffix), that is, all the string end of the match. Note that, "MPLE", "PLE" , "LE", "E" suffix are good.

9.

A former comparison and found that "I" and "A" does not match. Therefore, "I" is "bad character."

10.

The "bad characters rule" where the search term should be shifted 2 - (-1) = 3. The problem is that at this time there is no better shift method?

11.

We know that at this time there is "good suffix." So, you can use "good suffix rule" :

  After the shift number suffix = good position - on the first occurrence of the search term in the position

For example, if the string "ABCDAB" after a "AB" is "good suffix." Then it was 5 position (counting from 0, take the last value of "B"), (the first position of "B") is a "position on the search term appears in a" in, so that the shift 5 --1 = 4, before a "AB" to the position after a "AB" is.

As another example, if the string "ABCDEF" of "EF" good suffix, "EF" position is 5, the position of the last time is -1 (i.e., not present), the backward 5 - (-1 ) = 6, i.e., to move the entire string of "F" a.

Note that this rule has three points:

  (1) "good suffix" position subject to the last character. Assumes "ABCDEF" of "EF" suffix is ​​good, its position "F" subject, i.e. 5 (counting from 0).

  (2) If the "good suffix" appears only once in a search term, then its first appearance on the position of -1. For example, "EF" in the "ABCDEF" appears only once in, it first appeared on the position of -1 (i.e., not present).

  (3) If the "good suffix" There is more, the longest in addition to the "good suffix" first appeared on the other "good suffix" must be located in the head. For example, assume that "BABCDAB" "good suffix" is "DAB", "AB", "B", may I ask what the "good suffix" At this time of the occurrence position is? The answer is, used at this time is good suffix "B", which appeared on the position of the head, i.e., the 0th bit. This rule can be expressed this way: if the longest that "good suffix" appears only once, you can search for words rewritten in the following form to calculate the position "(DA) BABCDAB", ie virtual join foremost "DA".

Back to the example above. At this time, among all the "good suffix" (MPLE, PLE, LE, E), only the "E" in "EXAMPLE" also appears in the head, so backward 6--0 = 6.

12.

You can see, the "bad character rule" can only move three, "good suffix rule" can move 6. Therefore, the basic idea of the Boyer-Moore algorithm is that after each shift is greater among these two rules.

More subtly, the two-digit mobile rules, only the search term, but not with the original string. Therefore, it is possible to generate pre-calculated "bad character rule" and the "good suffix rule table." When in use, as long as the look-up table to compare it.

13.

Compare resumes from the tail, "P" and "E" does not match, so the "P" is a "bad character." The "bad character rules" backward 6--4 = 2.

14.

Start from the tail bit-wise comparison and found that all match, then the search ends. If you have to keep looking (ie, find all matching), according to the "good suffix rule" backward 6--0 = 6, that is, the head of the "E" to the position "E" of the tail.

(Finish)

Document information

Reproduced in: https: //www.cnblogs.com/ericsun/p/3334135.html

Guess you like

Origin blog.csdn.net/weixin_33829657/article/details/93154977