Article directory
fully understand
In the string matching algorithm, the KMP algorithm can achieve almost O(N) complexity, the key is to eliminate the backtracking of the main pointer, which can save a lot of time.
For example, if you want to match abcdabce
and abce
match, then the brute force algorithm is shown in the following table, each time you need to compare 4 characters, a total of 5 times.
a | b | c | d | a | b | c | e | |
---|---|---|---|---|---|---|---|---|
1 | a | b | c | e | ||||
2 | a | b | c | e | ||||
3 | a | b | c | e | ||||
4 | a | b | c | e | ||||
√ | a | b | c | e |
However, we can see at a glance that this d
is not in abce
it at all, so if we can store some other information, maybe we can skip this at once d
.
a | b | c | d | a | b | c | e | |
---|---|---|---|---|---|---|---|---|
1 | a | b | c | e | ||||
√ | a | b | c | e |
But sometimes you can't jump more, for example, if you want abcabcabc
to match cab
, a better solution is roughly as follows
a | b | c | a | b | c | a | b | c | |
---|---|---|---|---|---|---|---|---|---|
1 | c | a | b | ||||||
√ | c | a | b |
So the crux of the question is, why can the above case be skipped directly d
, and the following case can only be skipped exactly two in an impartial manner.
To sum up, two rules are found. Let txt be a long text, and you need to find a str in txt. If the character currently compared is ch, there are two simple rules.
ch is not in str | str skip this ch |
ch happens to be str[0] | str is transferred to the position of this ch |
So next, if ch is in str, but not str[0], what should be considered?
Of course, it cannot be skipped directly, because there may be repeated sequences in str, such as matching abababc
from them ababc
, then the best solution should be
a | b | a | b | a | b | c | |
---|---|---|---|---|---|---|---|
1 | a | b | a | b | c | ||
√ | a | b | a | b | c |
That is, for ababc
such strings, due to a
the different positions, when we get a new one a
, a different decision will be taken.
In the figure below, the circle represents the current match, the arrow represents a new character, and the arrow points to the next jumping position. An exclamation mark means not a certain character.
Seeing here is not a feeling of understanding, this is the so-called state machine, and the construction process of this state machine has nothing to do with txt. In other words, just do a self-match on str before matching txt, and you can get a state machine like this.
The state in the state machine actually represents the pointer position in the string to be matched. Returning to a means that the pointer points to 0; when moving forward, the pointer increases by 1; when abab receives an a and returns to the step of aba, it means The pointer rolls back from 5 to 3, as shown in the table below.
a | away | aba | abab | ababc | |
---|---|---|---|---|---|
a | 0 | 0 | 0 | 3 | |
b | 0 | 0 | 0 | 0 | |
c | 0 | 0 | 0 | 0 | success |
other | 0 | 0 | 0 | 0 |
Did you suddenly realize that the so-called state machine is actually a matrix.
The next thing we have to do is generate this state matrix.
rough implementation
Or consider finding str from txt, and the first step is to establish the state matrix of str
test = "ababcdabadc" #python中str是关键字,所以改个名
length = len(test)
#创建用于存储状态的字典
status = {
s:[0 for _ in range(length)] for s in set(test)}
for ch in status:
for i in range(length):
for j in range(i+1):
if test[i-j:i]+ch == test[0:j+1]:
status[ch][i] = j+1
get
b [0, 2, 0, 4, 0, 0, 0, 8, 0, 4, 0]
d [0, 0, 0, 0, 0, 6, 0, 0, 0, 10, 0]
c [0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 11]
a [1, 1, 3, 1, 3, 1, 7, 1, 9, 1, 1]
For a string ababcdabadc
, the initial pointer is 0. If there is one at this time a
, the pointer will jump to it a[0]=1
, indicating a
the state it is in at this time; if there is another one at this time b
, the pointer will jump to b[1]=2
, indicating ab
the state it is in at this time. And so on.
After making the state matrix of str, it is very convenient to perform string comparison.
txt = "ababcdabasdcdasdfababcdabadc"
test = "ababcdabadc"
KMP(txt,test)
def KMP(txt,test):
status = setStatus(test)
length = len(test)
keySet = set(status.keys())
match = []
j = 0
for i in range(len(txt)):
s = txt[i]
j = status[s][j] if s in keySet else 0
if j==length:
match.append(i-length+1)
return match
def setStatus(test):
length = len(test)
#创建用于存储状态的字典
status = {
s:[0 for _ in range(length)] for s in set(test)}
for ch in status:
for i in range(length):
for j in range(i+1):
if test[i-j:i]+ch == test[0:j+1]:
status[ch][i] = j+1
return status
Optimization of the matching matrix
In general, aaaaaaa
the match matrix for strings other than the maddening one is sparse, which means we do a lot of useless comparisons. Therefore, the solution process of its matching matrix can be roughly optimized, at least the outermost loop can be removed.
def setStatus(test):
length = len(test)
#创建用于存储状态的字典
status = {
s:[0 for _ in range(length)] for s in set(test)}
for i in range(length):
for j in range(i+1):
if test[i-j:i] == test[0:j]:
ch = test[j]
status[ch][i] = j+1
return status
Now there are only two layers of loops left, which looks refreshing, but the people who eat melons who don't know the truth still mind O ( N 2 ) O(N^2)O ( Women )2 )complexity. In order to be more refreshing, let's examine the characteristics of the matching matrix
b [0, 2, 0, 4, 0, 0, 0, 8, 0, 4, 0]
d [0, 0, 0, 0, 0, 6, 0, 0, 0, 10, 0]
c [0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 11]
a [1, 1, 3, 1, 3, 1, 7, 1, 9, 1, 1]
First, most non-zero values are incremented. Such d
as the non-zero value in 6,11
; c
the non-zero value 5,11
in . And once there is a smaller value, then this value must have appeared before, for example b[9]=4
, this 4 has appeared before.
If it is taken 索引降低,则索引必然重复
as a principle, then obviously 0
and 1
can also be included in this principle, b,c,d
the 0 in the middle has appeared in the first place; and a
there is no 0 in the middle, because a[0]=1
its minimum value can only be 1.
Or, a[i]
either i+1
, or a value that has ever occurred.
In this case, for a N
string of length str
, I can str
extract the previous M
character from it for backward matching. For example "ababcdabadc"
, match first to a
get a set of positions that are successfully matched; then match these positions ab
, and so on, until the matching position is the 0 position.