On the string matching algorithm - BF algorithm, RK algorithm

Outline

RK algorithm BF algorithm and algorithm are single pattern matching. The so-called single-string pattern matching algorithm is popular for a string and a string match

Common single string pattern matching algorithms as well as BM algorithm and KMP algorithm.

 

In contrast is the multi-pattern matching algorithm, that is, at the same time look for multiple strings in a string, a common AC and Trie tree automata.

 

Two concepts

 

Pattern string and main string

 

We look for the string A string B, A string that is the main string, the string is the mode B string.

The length of the main string denoted n, referred to as pattern string length m. Since we are looking for patterns in the main string string, so n> m.

 

 

BF algorithm

BF is an acronym for Brute Force, the Chinese called the violence matching algorithm, also known as simple matching algorithm.

This way string matching algorithm is very "violent", the logic is simple, easy to understand, but the corresponding performance is not high.

 

Thinking BF algorithm

Thinking BF algorithm can be summarized in one sentence:

We main stream, the starting positions are checked 0,1,2 ... nm and a length of m substrings of nm + 1, to see if a string matching with the pattern.

Such algorithms and ideas, if in an extreme case: if the main string is "aaaaaaaaaaaaa ......" (omitted numerous A), pattern string is "aaaaab".

Each of us than m characters than for n-m + 1 times.

Therefore, the worst case time complexity of this algorithm is O (n * m).

 

Applicable scene BF algorithm

Although in theory, the time complexity of the algorithm is high BF, is O (n * m), but in practice it is more commonly a string matching algorithm.

There are two main reasons:

  1. In most cases, the pattern string and the length of the main string are not too long. And every time the main pattern string string substring matching, when half-way encounter characters that can not match the time, you can stop, do not need to m characters than to look at. So, despite the theoretical worst case time complexity is O (n * m), but, statistically, in most cases, the efficiency of algorithms is much higher than this.
  2. BF idea of ​​the algorithm is simple, code implementation is very simple. Simplicity means less prone to error, if there is also likely to expose and bug fixes. In the project, while meeting performance requirements, simplicity is preferred. This is also the design principles we often say that the KISS (Keep it Simple and Stupid).

 

RK algorithm

The full name of RK algorithm called Rabin-Karp algorithm, by the name of its inventor Rabin and Karp of two named.

 

RK algorithm is more like an upgraded version of BF algorithm.

If the mode BF algorithm is run length of m, main string length n, and that in the main stream, there will be n-m + 1 m length of the sub-string.

We only need to compare it violently n-m + 1 th string and the string model, you can identify the main string and sub-string pattern matching.

However, with each check main string substring matches, each character needs to turn over, the time complexity of the algorithm is relatively high BF is O (n * m).

BF algorithm for slightly modified, introducing hash algorithm, the time complexity will be reduced immediately.

 

Thought RK algorithm

Main string of n-m substring of a hash value, respectively, seek + hash algorithm, then one by comparison with the pattern size of the hash value string.

If a hash value substring pattern string equal, it shows that the corresponding sub-string and the pattern matching.

Because the hash value is a number, the comparison is very fast for equality between digital, serial and model efficiency substring comparison would be increased.

 

But when the hash value is calculated by substring hash algorithm, we need to iterate over each character in the substring.

Although the pattern string and substring comparator efficiency is improved, but the overall efficiency of the algorithm did not improve.

We need to be efficient hashing algorithm substring hash value.

And compute the hash value, there are still problems hash conflict. So the core of RK algorithm is still the design of the hash function.

Hash function both simple and efficient, and the need to reduce the probability of a hash collision.

 

Suppose the string contains a ~ z only the 26 letters of the alphabet. For example each of a corresponding one of the letters from small to large prime numbers, we can for each letter of the string

Numbers together, and finally obtained as a hash value. The probability of such a conflict will be lowered a little.

 

When there is a hash conflict, although there may be a hash value substring and pattern strings are the same, but the two do not match itself.

This method is very simple to solve, the object can be compared before comparing Bowen talking about the issue (hashCode and equals).

 

When we find a substring hash value equal to the pattern string with a hash value, we just need to compare all of a sudden string and string pattern itself just fine.

If the hash value substring pattern string hash values ​​are not equal, then the corresponding sub-string pattern string and certainly does not match, there is no need strings and substrings itself than the pattern.

 

RK algorithm is to focus on the design of the hashing algorithm:

Hash algorithm collision probability to be controlled relatively lower. If there is a lot of conflict, it will lead to time complexity of the algorithm RK degradation efficiency.

In extreme cases, if a large amount of a conflict exists, then every time the pattern string and sub-string comparison itself, that will degenerate into a time complexity O (n * m).

But under normal circumstances, hash algorithm design is reasonable, then, the conflict will not be many, RK algorithm efficiency is still higher than the BF algorithm.

 

 

 

Published 113 original articles · won praise 25 · views 30000 +

Guess you like

Origin blog.csdn.net/qq_42006733/article/details/105072014