On the string matching algorithm - BM Algorithm

Outline

BF algorithm in some extreme cases, more serious performance degradation.

RK algorithm requires a hash algorithm, design a hash algorithm can deal with various types of characters are not simple.

 

BM algorithm

BM (Boyer-Moore) is a very efficient algorithm for string matching algorithm, the performance is about 3 to 4 times the famous KMP algorithm.

But to realize the principle of BM algorithm is also very complicated.

 

BM algorithm thought

We pattern string matching process and the main strings, patterns can be seen next string slidably kept in the main stream.

When faced characters do not match, practices BF algorithm and RK algorithm is a pattern strings slide back and start again from the first character of the pattern matching string.

However, in the example above, the main string "c", in the pattern string does not exist. Pattern string to slide backward, as long as there is an overlap pattern string c and, certainly not match.

Therefore, we can one-time multi-pattern string back several slide, moves behind the pattern strings of c.

 

Such pattern strings will slide backwards several multi, multi-slide back a few such disposable, matching efficiency fact improved.

Under what circumstances can the mode string of multi-slide, multi slide a few? What law have?

 

Essentially BM algorithm is actually looking for such a law.

With this law, in the course of the main string pattern matching string, the string and the main mode when a character string that does not match, will certainly not be able to skip some cases match.

The multi-pattern string back several slide, thereby increasing the efficiency of

 

BM algorithm core idea: using the pattern string own characteristics, when a character string in a pattern with the main string does not match the pattern string slide back more than a few, in order to reduce unnecessary character comparisons, improve the matching of effectiveness.

 

BM algorithm principle

 

To understand the BM algorithm, BM algorithm actually consists of two parts, namely bad character rule (bad character rule) and good suffix rule (good suffix shift).

 

Bad character rule

BF algorithm and RK are by pattern string subscript ascending order sequentially matched to the main character string.

While the BM algorithm matching rather special sequence, which is in accordance with the pattern string subscript descending order, matching backwards.

 

We move forward backwards from the end of pattern matching string, and when we find that a character can not match when we put this character did not match any known bad character

Bad characters refer to the main character string

 

Take look bad character c in the pattern string, found string pattern does not exist in this character, that is, any character with the character c in the string pattern can not match.

This time, we can directly back slide three pattern string, the pattern string is slid to the rear position c, and then compares the start of the end of the character string pattern.

 

 

At this mode string "d", still can not match with the main string a, can simply slide back the pattern string three in this situation?

It is not acceptable, because at this time a bad characters in the pattern string is present, the pattern string is a position index of 0 is the character a.

 

In this case, we can slide back two pattern strings, make a two aligned up and down, then start from the end of string character mode, re-match.

 

Both cases are bad string where the presence or absence of the pattern. How many specific slide, there is no law?

 

When a mismatch occurs, we put the bad characters corresponding to the pattern string of characters as a marker si

If bad characters in the pattern string is present, we this bad characters in the pattern string labeled as xi, and if not, we -1 referred to as xi.

That pattern string is equal to the number of bits is moved backward si-xi.

Here subscript character pattern string subscripts.

 

In addition to these two cases actually there is a situation that is not only bad characters in the pattern string, there are more.

If many bad characters appear in the pattern string, then we compute xi when selecting the most by that.

Pattern string because it will not let slip too much, resulting in the situation could have been matched slip skipped. 

 

Good suffix rule

Suffix rules of good ideas and bad ideas very similar to the rules of the character.

 

When the pattern string slid into position when the drawing, and the main pattern string has a string of two characters are matched, the inverse of the third character mismatch has occurred.

You can calculate the position of the slide rule by bad characters, you can also make good use of the suffix rule.

 

bc we have a good match is called suffix, denoted by {u}.

This situation is also divided into two.

We take it to find the pattern string, if another with the sub-match {u} {u *} string found, we will slide into sub-string pattern string {u *} and {u} main stream of aligned position.

 

If no other pattern string equal to substring {u}, we will direct the pattern string, slid into the main stream {u} later.

Because any case before a next slide, do not match the primary stream of {u}.

 

 

This processing is similar to the rules of bad character, so whether it will slide a bit too far?

 

 

Suffix bc there is a good, though not another sub matching the pattern string string {u *}.

But if we move to the back of a good pattern string suffix, it will miss the circumstances and the main string pattern string can match.

 

 

If the good suffix substring matches are not present in the pattern string, then we are in the process step by step backward sliding mode string, as long as the main string {u} there is an overlap with the pattern string, it certainly will not match exactly.

However, when the mode is slid to the prefix string of the main stream {u} suffixes are partially overlapped with a time, and when the overlapping portions are equal, there may be a case where there is an exact match.

The previous example is not taken into account this situation and the excessive sliding.

 

In view of this situation, we must not only optimistic about the suffix string mode, if there is another sub-string matching.

We also examine the suffix substring good suffix, whether there is a pattern string with the prefix substring matching.

 

It s called a string suffix substring, the last character is aligned with the s substring, such as abc suffix substring including c, bc.

The so-called prefix substring, s is aligned with the start character string sub, such as abc prefix substring has a, ab.

 

We suffix substring from good suffix, and sorted to find a longest prefix string pattern matching substring, assuming {v}

 

How to choose the rules?

According to the rules of bad character, slide back the number of bits calculated, there may be negative.

 

PS: Why is there a negative?

When faced with bad character, to calculate the median backward movement of si-xi, which is the focus of computing xi, xi how we find it?

Or, how to find the location of bad characters appear in the pattern string it?

If the traversal order to find the pattern string, so it would be more inefficient.

To improve efficiency, you can use a hash table.

Record hash table "position of the last event of the" different character in the pattern string. The first position is not the position si forward looking.

FIG on this previous example, the use of the rules, then the bad characters, si = 4, and the final position c in pattern string appears xi = 6, bad characters rule si - xi = -2

Therefore, where si is greater than xi will appear, i.e. the number of bits calculated slidable negative

 

We can calculate the number of bits are good and bad character suffixes next slide, and then take the two-digit number of the largest, as the pattern string next slide.

This also avoids the bad characters out of the slide rule to calculate the median there may be a negative number.

 

Tucao about BM algorithm

Principle BM BM algorithm and compared to a lot more complicated in terms of RK, but by learning is understandable, but the specific code that implements it very easy

 

About BM algorithm learning, then it is focused on the ideas and principles, there was a very good summary of my brother

I hereby cite, as I point to learn BM algorithm

1, there must optimize awareness, BF, RK algorithm has been able to meet our needs, why invention BM algorithm? It is to reduce the time complexity, but the evils that the optimized code is complicated, high maintenance costs.

2, need to find the need to reduce time complexity, what to think? Hash table.

3, if an expression computational overhead is relatively large, and require frequent use how to do? Pretreated, and caches.

 

High performance BM algorithm, is called for performance, we need a more complex algorithm, but the more complex algorithm, code implementation is certainly more complex, the easier wrong details
 

Performance on BM algorithm, the paper has proved that in the worst case the upper limit of the number of comparisons BM algorithm is 3n.

Hereby record, there was an understanding

Published 113 original articles · won praise 25 · views 30000 +

Guess you like

Origin blog.csdn.net/qq_42006733/article/details/105136180