Algorithm Design - KMP Algorithm

string pattern matching problem

Suppose there are two strings S, T, where S is the main string (text string), T is the substring (pattern string),

We need to find the substring that matches T in S, and if found successfully, return the position of the first character of the matched substring in the main string.

Violent algorithm solution

icon

Assuming S = "aabaabaaf" and T = "aabaaf", the brute force solution process is shown in the figure below:

 The matching process in the above figure is divided into two cycles:

The outer loop, that is, the control of the number of matching rounds, or in other words, the control of the matching starting position of the S string, such as:

  • In the 0th round, the T string is matched from the 0 index position of the S string
  • In the first round, the T string is matched from the 1 index position of the S string
  • ...
  • In the kth round, the T string is matched from the k index position of the S string

The inner loop, that is, the k ~ k + t.length range of the T string and the S string are matched character by character one by one,

  • If it is found that there is an inconsistency in the characters corresponding to the digit, it means that the current round of matching fails and directly enters the next round
  • If the characters in all positions are the same, it means that the match is successful, that is, the same substring as T is found in S, and the starting position of the substring is k

Suppose, s.length = n, t.length = m, then the time complexity of the violent solution is O(n * m)

Code

JS algorithm source code

/**
 * @param {*} s 正文串
 * @param {*} t 模式串
 * @returns 在s中查找与t相匹配的子串,如果成功找到,则返回匹配的子串第一个字符在主串中的位置
 */
function indexOf(s, t) {
  // k指向s的起始匹配位置
  for (let k = 0; k <= s.length - t.length; k++) {
    let i = k;
    let j = 0;

    while (j < t.length && s[i] == t[j]) {
      i++;
      j++;
    }

    if (j == t.length) {
      return k;
    }
  }

  return -1;
}

const s = "aabaabaafaab";
const t = "aabaaf";
console.log(indexOf(s, t));

Java algorithm source code

public class Main {
  public static void main(String[] args) {
    String s = "aabaabaaf";
    String t = "aabaaf";
    System.out.println(indexOf(s, t));
  }

  /**
   * @param s 正文串
   * @param t 模式串
   * @return 在s中查找与t相匹配的子串,如果成功找到,则返回匹配的子串第一个字符在主串中的位置
   */
  public static int indexOf(String s, String t) {
    // k指向s的起始匹配位置
    for (int k = 0; k <= s.length() - t.length(); k++) {
      int i = k;
      int j = 0;

      while (j < t.length() && s.charAt(i) == t.charAt(j)) {
        i++;
        j++;
      }

      if (j == t.length()) {
        return k;
      }
    }

    return -1;
  }
}

Python algorithm source code

def indexOf(s, t):
    """
    :param s: 正文串
    :param t: 模式串
    :return: 在s中查找与t相匹配的子串,如果成功找到,则返回匹配的子串第一个字符在主串中的位置
    """

    # k指向s的起始匹配位置
    for k in range(len(s) - len(t) + 1):
        i = k
        j = 0

        while j < len(t) and s[i] == t[j]:
            i += 1
            j += 1

        if j == len(t):
            return k

    return -1


if __name__ == '__main__':
    s = "aabaabaaf"
    t = "aabaaf"

    print(indexOf(s, t))

KMP algorithm

Improved Strategies for Brute Force Solutions

For the string pattern matching problem, the brute force algorithm is not the optimal solution. Although s and t are both random strings, these random strings also have certain rules that can be used.

For example, in the previous example, s = "aabaabaaf", t = "aabaaf"

After the failure in the 0th round, are the 1st round and the 2nd round doomed to fail?

The following figure shows the last matching failure in the 0th round:

We observe the part that matches successfully, that is, the "aabaa" part, which has certain symmetry,

If we abstract the part behind "aabaa" of S and T, as shown in the figure below, then:

  • The 0th round of matching failed because the matching of the "abstract part" failed
  • The first round, the second round of matching failure, in fact, the "aabaa" part of the matching failure:

 Let's simplify the first round, the second round, and the third round again, as shown in the following figure:

So is it obvious that it can be found that the first round and the second round are doomed to fail.

Let's take another example:

If the S and T above fail to match in the 0th round because of the abstract part, then in the next round, you can actually jump directly to the symmetric position to start matching, because the match at the asymmetric position will definitely fail.

In this case, is it skipping two rounds of matching, that is, saving two rounds of matching time.

Please think again, is the re-matching of the symmetrical part directly skipped above really saving only two rounds of matching?

The figure below shows that after the 0th round of matching fails, skip directly to the symmetrical part and start matching again

If it corresponds to the violent solution process, then the part where X is drawn below is the skipped process

Let's observe the changes of the i and j pointers during the jump to the symmetrical part

It can be found that the position of the i pointer in S has not changed, and the j pointer points back to the center position "b" of the "aabaa" symmetric string of T.

So what is the time complexity of the above improved algorithm?

Since the above algorithm guarantees that the i pointer will not roll back, the time complexity is only O(n).

And this algorithm is actually the KMP algorithm.

prefix table

We have already known the general principle of the KMP algorithm, the most critical of which is to find the symmetrical part of its substring in the pattern string T,

So how to achieve this function through code?

The three founders of the KMP algorithm, K, M, and P, proposed the concept of a prefix table.

For example, T = "aabaaf", then we first need to find all substrings of T:

  • a
  • aa
  • aab
  • father
  • the father
  • aaaaaaaaaaaaa

Then calculate the length of the longest identical prefix and suffix of these substrings

Suppose the length of string s is n, then:

  • The prefix is ​​all substrings whose start index must be 0 and end index <n-1
  • The suffix is ​​all substrings whose end index must be n-1 and start index must be >0

therefore

  • prefix and suffix cannot be the string s itself
  • The prefix and suffix of the string s may overlap

Let's take an example, such as listing all the prefixes and suffixes of the substring "aabaa" of T

length prefix suffix
1 a a
2 aa aa
3 aab baa
4 father Dad

Among them, the longest and identical prefix and suffix is ​​"aa".

Note that to determine whether the prefix and suffix are the same, they are compared one by one from left to right, so in the above example, the prefix "aab" with a length of 3 and the suffix "baa" are different.

There may be overlapping of the same prefix and suffix,

For example, in the following string "ababab", the longest prefix and suffix with the same name is "abab"

length prefix suffix
1 a b
2 ab ab
3 aba bab
4 father father
5 ababa papaya

Therefore, the lengths of the longest identical prefix and suffix of all substrings of T = "aabaaf" are:

substring of T longest identical suffix The length of the longest identical prefix and suffix
a none 0
aa a 1
aab none 0
father a 1
the father aa 2
aaaaaaaaaaaaa none 0

The above prefix table, we generally use the next array to represent

next = [0, 1, 0, 1, 2, 0]

Application of the prefix table

Earlier we calculated the prefix table next array by hand

next = [0, 1, 0, 1, 2, 0]

So what is the meaning of the elements of the next array?

The next[j] element is actually the longest length of the same prefix and suffix of the 0~j substring, for example:

  • next[0] is the longest length of the same prefix and suffix of the 0~0 substring "a" of T
  • next[1] is the longest length of the same prefix and suffix of the 0~1 substring "aa" of T
  • next[2] is the longest length of the same prefix and suffix of the 0~2 substring "aab" of T
  • next[3] is the longest length of the same prefix and suffix of the 0~3 substring "aaba" of T
  • next[4] is the longest length of the same prefix and suffix of the 0~4 substring "aabaa" of T
  • next[5] is the longest length of the same prefix and suffix of the 0~5 substring "aabaaf" of T

So how to apply next to the KMP algorithm?

For example, in the figure below, when s[i] != t[j], as we analyzed earlier, we need to do the following actions:

  • i pointer remains pointing to
  • The j pointer rolls back to the center of the symmetrical part

The advantage of this exercise is that

  • Avoiding the fallback of the i pointer (increasing redundant comparison rounds)
  • Avoid redundant matching of the part before the center of the symmetrical part (because it must be the same, it is a redundant match)

However, the expression of the central position of the symmetrical part here is actually very unstudied. A more rigorous expression: it should be "the position after the end position of the prefix" in the longest identical prefix and suffix.

And the last position after the ending position of the prefix with the longest identical suffix is ​​actually the length of the longest identical suffix .

Therefore, when s[i] != t[j], we should let j = next[ j - 1 ]

In addition, if j = 0, it cannot be matched, and next[j-1] will have an out-of-bounds exception at this time, so for this i situation, we should deal with it specially, as shown in the figure below, it is a j = 0 that cannot be matched Condition:

At this point, we should keep i++, j unchanged and continue to match

This actually does not conflict with the condition that the i pointer does not fall back as stipulated in the previous KMP algorithm. Because the i pointer in the above process does not roll back.

KMP algorithm implementation (excluding prefix table generation implementation)

Here, the generation logic of the prefix table is not implemented first, and the logic of the KMP algorithm is simply implemented

JS algorithm source code

/**
 * @param {*} s 正文串
 * @param {*} t 模式串
 * @returns 在s中查找与t相匹配的子串,如果成功找到,则返回匹配的子串第一个字符在主串中的位置
 */
function indexOf(s, t) {
  // 手算的T串"aabaaf"对应的前缀表
  let next = [0, 1, 0, 1, 2, 0];
  // 手算的T串"cabaa"对应的前缀表
  // next = [0, 0, 0, 0, 0];

  let i = 0; // 扫描S串的指针
  let j = 0; // 扫描T串的指针

  // 如果 i 指针扫描到S串结束位置,或者 j 指针扫描到T串的结束位置,都应该结束查找
  while (i < s.length && j < t.length) {
    if (s[i] == t[j]) {
      // 如果 s[i] == t[j],则当前位置匹配成功,继续匹配下一个位置
      i++;
      j++;
    } else {
      // 如果 s[i] != t[j],则说明当前位置匹配失败,
      // 根据KMP算法,我们只需要回退T串的 j 指针到 next[j-1]位置,即最长相同前缀的结束位置后面一个位置,而S串的 i 指针保持不动
      if (j > 0) {
        j = next[j - 1];
      } else {
        // 如果 j = 0,则说明S子串subS和T在第一个字符上就匹配不上, 此时T不匹配字符T[j]前面已经没有前后缀了,因此只能匹配下一个S子串
        i++;
      }
    }
  }

  // 如果最终可以在S串中找到匹配T的子串,则T串的所有字符都应该被j扫描过,即最终 j = t.length
  if (j >= t.length) {
    // 则S串中匹配T的子串的首字符位置应该在 i - t.length位置,因为 i 指针最终会扫描到S串中匹配T的子串的结束位置的后一个位置
    return i - j;
  } else {
    // 否则就是没有在S中找到匹配T的子串
    return -1;
  }
}

const s = "aabaabaafaab";
let t = "aabaaf";
// t = "cabaa"; // 该T串用于测试第一个字符就不匹配的情况
console.log(indexOf(s, t));

Java algorithm source code

public class Main {
  public static void main(String[] args) {
    String s = "aabaabaaf";
    String t = "aabaaf";
    //    t = "cabaa"; // 该T串用于测试第一个字符就不匹配的情况

    System.out.println(indexOf(s, t));
  }

  /**
   * @param s 正文串
   * @param t 模式串
   * @return 在s中查找与t相匹配的子串,如果成功找到,则返回匹配的子串第一个字符在主串中的位置
   */
  public static int indexOf(String s, String t) {
    // 手算的T串"aabaaf"对应的前缀表
    int[] next = {0, 1, 0, 1, 2, 0};
    // 手算的T串"cabaa"对应的前缀表
    //    next = new int[] {0, 0, 0, 0, 0};

    int i = 0; // 扫描S串的指针
    int j = 0; // 扫描T串的指针

    // 如果 i 指针扫描到S串结束位置,或者 j 指针扫描到T串的结束位置,都应该结束查找
    while (i < s.length() && j < t.length()) {
      // 如果 s[i] == t[j],则当前位置匹配成功,继续匹配下一个位置
      if (s.charAt(i) == t.charAt(j)) {
        i++;
        j++;
      } else {
        // 如果 s[i] != t[j],则说明当前位置匹配失败,
        // 根据KMP算法,我们只需要回退T串的 j 指针到 next[j-1]位置,即最长相同前缀的结束位置后面一个位置,而S串的 i 指针保持不动
        if (j > 0) {
          j = next[j - 1];
        } else {
          // 如果 j = 0,则说明S子串subS和T在第一个字符上就匹配不上, 此时T不匹配字符T[j]前面已经没有前后缀了,因此只能匹配下一个S子串
          i++;
        }
      }
    }

    // 如果最终可以在S串中找到匹配T的子串,则T串的所有字符都应该被j扫描过,即最终 j = t.length
    if (j == t.length()) {
      // 则S串中匹配T的子串的首字符位置应该在 i - t.length位置,因为 i 指针最终会扫描到S串中匹配T的子串的结束位置的后一个位置
      return i - j;
    } else {
      // 否则就是没有在S中找到匹配T的子串
      return -1;
    }
  }
}

Python algorithm source code

def indexOf(s, t):
    """
    :param s: 正文串
    :param t: 模式串
    :return: 在s中查找与t相匹配的子串,如果成功找到,则返回匹配的子串第一个字符在主串中的位置
    """

    # 手算的T串"aabaaf"对应的前缀表
    next = [0, 1, 0, 1, 2, 0]

    # 手算的T串"cabaa"对应的前缀表
    # next = [0, 0, 0, 0, 0]

    i = 0  # 扫描S串的指针
    j = 0  # 扫描T串的指针

    # 如果 i 指针扫描到S串结束位置,或者 j 指针扫描到T串的结束位置,都应该结束查找
    while i < len(s) and j < len(t):
        # 如果 s[i] == t[j],则当前位置匹配成功,继续匹配下一个位置
        if s[i] == t[j]:
            i += 1
            j += 1
        else:
            # 如果 s[i] != t[j],则说明当前位置匹配失败
            # 根据KMP算法,我们只需要回退T串的 j 指针到 next[j-1]位置,即最长相同前缀的结束位置后面一个位置,而S串的 i 指针保持不动
            if j > 0:
                j = next[j - 1]
            else:
                # 如果 j = 0,则说明S子串subS和T在第一个字符上就匹配不上, 此时T不匹配字符T[j]前面已经没有前后缀了,因此只能匹配下一个S子串
                i += 1

    # 如果最终可以在S串中找到匹配T的子串,则T串的所有字符都应该被j扫描过,即最终 j = t.length
    if j >= len(t):
        # 则S串中匹配T的子串的首字符位置应该在 i - t.length位置,因为 i 指针最终会扫描到S串中匹配T的子串的结束位置的后一个位置
        return i - j
    else:
        # 否则就是没有在S中找到匹配T的子串
        return -1


if __name__ == '__main__':
    s = "aabaabaaf"
    t = "aabaaf"
    # t = "cabaa"  # 该T串用于测试第一个字符就不匹配的情况

    print(indexOf(s, t))

Generation of prefix table

We have already calculated the prefix table by hand, but the hand calculation process is a violent enumeration process, that is, to enumerate all the prefixes and suffixes, and then compare the prefixes and suffixes of the same length to see if the corresponding content is the same.

Regarding the generation of the prefix table, we can use dynamic programming to solve.

We now require NEXT[J], assuming that NEXT[J-1] = K is known, such as the following figure

If T[J] == T[K], then 

 

Then NEXT[J] = K + 1

(PS: If you don't understand, you can replace the above? with "d", and then calculate NEXT[J] by hand)

 

If T[J] ! = T[K]

 

So how to solve NEXT[J]?

In fact, to change the thinking, we can apply the previous KMP algorithm idea, as shown in the figure below, we can imagine the T string as two avatar strings, the SS and TT strings shown in the figure below,

The SS string is the suffix range part of the original T string, and the TT string is the prefix range part of the original T string

 

 

 

Now it has been determined that SS[J] ! = TT[K] , so we should roll back the K pointer of the TT string, that is, roll back to the NEXT[K-1] position

 

Then proceed to compare T[J] and T[K]:

  • If T[J] == T[K], then NEXT[J] = K + 1

Why can we directly think that the 0~K-1 part must be the same as the JK~J-1 part?

In fact, if the above 0~K-1 part and JK~J-1 part return to the T string, as shown in the figure below

 If you go one step further, as shown in the figure below

 

  • If T[J] ! = T[K], then again K = NEXT[K-1]

Therefore, the generation logic of the prefix table here actually applies the KMP algorithm, but the prefix table here has only one T string, and we need to abstract it into two virtual strings SS (virtual main string) and TT (virtual mode string).

For the code implementation of the prefix table, please refer to the getNext method in the code implementation in the following section. You can compare the KMP algorithm logic to see the similarities between the two.

Implementation of KMP algorithm (including prefix table generation implementation)

Java algorithm source code

public class Main {
  public static void main(String[] args) {
    String s = "xyz";
    String t = "z";

    System.out.println(indexOf(s, t));
  }

  /**
   * @param s 正文串
   * @param t 模式串
   * @return 在s中查找与t相匹配的子串,如果成功找到,则返回匹配的子串第一个字符在主串中的位置
   */
  public static int indexOf(String s, String t) {
    int[] next = getNext(t);

    int i = 0; // 扫描S串的指针
    int j = 0; // 扫描T串的指针

    // 如果 i 指针扫描到S串结束位置,或者 j 指针扫描到T串的结束位置,都应该结束查找
    while (i < s.length() && j < t.length()) {
      // 如果 s[i] == t[j],则当前位置匹配成功,继续匹配下一个位置
      if (s.charAt(i) == t.charAt(j)) {
        i++;
        j++;
      } else {
        // 如果 s[i] != t[j],则说明当前位置匹配失败,
        // 根据KMP算法,我们只需要回退T串的 j 指针到 next[j-1]位置,即最长相同前缀的结束位置后面一个位置,而S串的 i 指针保持不动
        if (j > 0) {
          j = next[j - 1];
        } else {
          // 如果 j = 0,则说明S子串subS和T在第一个字符上就匹配不上, 此时T不匹配字符T[j]前面已经没有前后缀了,因此只能匹配下一个S子串
          i++;
        }
      }
    }

    // 如果最终可以在S串中找到匹配T的子串,则T串的所有字符都应该被j扫描过,即最终 j = t.length
    if (j == t.length()) {
      // 则S串中匹配T的子串的首字符位置应该在 i - t.length位置,因为 i 指针最终会扫描到S串中匹配T的子串的结束位置的后一个位置
      return i - j;
    } else {
      // 否则就是没有在S中找到匹配T的子串
      return -1;
    }
  }

  public static int[] getNext(String t) {
    int[] next = new int[t.length()];

    int j = 1;
    int k = 0;

    while (j < t.length()) {
      if (t.charAt(j) == t.charAt(k)) {
        next[j] = k + 1;
        j++;
        k++;
      } else {
        if (k > 0) {
          k = next[k - 1];
        } else {
          j++;
        }
      }
    }

    return next;
  }
}

JS algorithm source code

/**
 * @param {*} s 正文串
 * @param {*} t 模式串
 * @returns 在s中查找与t相匹配的子串,如果成功找到,则返回匹配的子串第一个字符在主串中的位置
 */
function indexOf(s, t) {
  let next = getNext(t);

  let i = 0; // 扫描S串的指针
  let j = 0; // 扫描T串的指针

  // 如果 i 指针扫描到S串结束位置,或者 j 指针扫描到T串的结束位置,都应该结束查找
  while (i < s.length && j < t.length) {
    if (s[i] == t[j]) {
      // 如果 s[i] == t[j],则当前位置匹配成功,继续匹配下一个位置
      i++;
      j++;
    } else {
      // 如果 s[i] != t[j],则说明当前位置匹配失败,
      // 根据KMP算法,我们只需要回退T串的 j 指针到 next[j-1]位置,即最长相同前缀的结束位置后面一个位置,而S串的 i 指针保持不动
      if (j > 0) {
        j = next[j - 1];
      } else {
        // 如果 j = 0,则说明S子串subS和T在第一个字符上就匹配不上, 此时T不匹配字符T[j]前面已经没有前后缀了,因此只能匹配下一个S子串
        i++;
      }
    }
  }

  // 如果最终可以在S串中找到匹配T的子串,则T串的所有字符都应该被j扫描过,即最终 j = t.length
  if (j >= t.length) {
    // 则S串中匹配T的子串的首字符位置应该在 i - t.length位置,因为 i 指针最终会扫描到S串中匹配T的子串的结束位置的后一个位置
    return i - j;
  } else {
    // 否则就是没有在S中找到匹配T的子串
    return -1;
  }
}

function getNext(t) {
  const next = new Array(t.length).fill(0);

  let j = 1;
  let k = 0;

  while (j < t.length) {
    if (t[j] == t[k]) {
      next[j] = k + 1;
      j++;
      k++;
    } else {
      if (k > 0) {
        k = next[k - 1];
      } else {
        j++;
      }
    }
  }

  return next;
}

const s = "aabaabaafaab";
let t = "aabaaf";
console.log(indexOf(s, t));

Python algorithm source code

def getNext(t):
    next = [0] * len(t)

    j = 1
    k = 0

    while j < len(t):
        if t[j] == t[k]:
            next[j] = k + 1
            j += 1
            k += 1
        else:
            if k > 0:
                k = next[k - 1]
            else:
                j += 1

    return next


def indexOf(s, t):
    """
    :param s: 正文串
    :param t: 模式串
    :return: 在s中查找与t相匹配的子串,如果成功找到,则返回匹配的子串第一个字符在主串中的位置
    """

    next = getNext(t)

    # 手算的T串"cabaa"对应的前缀表
    # next = [0, 0, 0, 0, 0]

    i = 0  # 扫描S串的指针
    j = 0  # 扫描T串的指针

    # 如果 i 指针扫描到S串结束位置,或者 j 指针扫描到T串的结束位置,都应该结束查找
    while i < len(s) and j < len(t):
        # 如果 s[i] == t[j],则当前位置匹配成功,继续匹配下一个位置
        if s[i] == t[j]:
            i += 1
            j += 1
        else:
            # 如果 s[i] != t[j],则说明当前位置匹配失败
            # 根据KMP算法,我们只需要回退T串的 j 指针到 next[j-1]位置,即最长相同前缀的结束位置后面一个位置,而S串的 i 指针保持不动
            if j > 0:
                j = next[j - 1]
            else:
                # 如果 j = 0,则说明S子串subS和T在第一个字符上就匹配不上, 此时T不匹配字符T[j]前面已经没有前后缀了,因此只能匹配下一个S子串
                i += 1

    # 如果最终可以在S串中找到匹配T的子串,则T串的所有字符都应该被j扫描过,即最终 j = t.length
    if j >= len(t):
        # 则S串中匹配T的子串的首字符位置应该在 i - t.length位置,因为 i 指针最终会扫描到S串中匹配T的子串的结束位置的后一个位置
        return i - j
    else:
        # 否则就是没有在S中找到匹配T的子串
        return -1


if __name__ == '__main__':
    s = "aabaabaaf"
    t = "aabaaf"
    # t = "cabaa"  # 该T串用于测试第一个字符就不匹配的情况

    print(indexOf(s, t))

Guess you like

Origin blog.csdn.net/qfc_128220/article/details/131311563