String matching--BM algorithm

I was preparing for an interview recently, and I saw the question about string matching before I knew about the BM algorithm (only to blame for not studying some algorithms in depth in the rules)! ! !

Well! Just introduce what BM algorithm is, let's take a look at the explanation in Baidu Encyclopedia:

In computer science , Boyer-Moore string search algorithm is a very efficient string search algorithm . It was designed in 1977 by Bob Boyer and J Strother Moore. This algorithm only preprocesses the search target string (keyword) , not the searched string. Although the execution time of the Boyer-Moore algorithm also linearly depends on the size of the searched string, it is usually only a small part of other algorithms: it does not need to compare the characters in the searched string one by one, and will skip Some parts of it. Generally, the longer the search keyword, the faster the algorithm. Its efficiency comes from the fact that for every failed matching attempt, the algorithm can use this information to eliminate as many unmatched locations as possible.

Then let's introduce the BM algorithm in detail below. Of course, this blog will also use some other articles to make a summary. It is also the author's notes on understanding the BM algorithm. I hope it will be helpful to the readers. Not much to say, go directly to the topic!

To understand an algorithm, you first need to know the principle and idea of ​​the algorithm. This is very important, so first make an entry point to the principle of the BM algorithm:

BM algorithm based on suffix comparison (from right to left comparison method), and BM algorithm actually contains two parallel algorithms : bad character rule and good suffix rule.

(Recommend everyone to read a blog about the idea of ​​BM algorithm: string matching BM algorithm learning )

First of all, two algorithm rules need to be clarified:

1. Bad character rules:

Back shift number = position of bad character-last occurrence position of bad character in pattern string

 If the "bad character" is not included in the pattern string, the last occurrence position is -1. Take the following two strings as an example

Because "G" and "H" do not match, "G" is called a "bad character", which appears in the second position (numbering from 0) of the pattern string (the pattern string is FGH), and the number in the pattern string The position of the last occurrence was 1, so 2-1 = 1 place is shifted back

 2. Good suffix rules 

 Back shift number = position of good suffix-last occurrence position in the pattern string

For example, if the last "AB" of the pattern string "ABCDEFABCD" is a "good suffix". Then its position is 6 (counting from 0, taking the value of the last "B"), the last occurrence position in the pattern string is 1 (the position of the first "B"), so it is shifted 6-1 = 5 bits, the previous "AB" is moved to the next "AB" position.

To give another example, if the "EF" of the pattern string "ABCDEFGH" is a good suffix, the position of "EF" is 5, and the position of the last occurrence is -1 (that is, it did not appear), so it is shifted back by 5-(-1 ) = 6 digits, that is, the entire string is moved to the last digit of "F".

      There are three points to note about this rule:

  1. The position of "good suffix" is based on the last character. Assuming that the "EF" of "ABCDEF" is a good suffix, its position is subject to "F", which is 5 (counting from 0).
  2. If "good suffix" appears only once in the pattern string, its last occurrence position is -1. For example, if "EF" appears only once in "ABCDEF", its last occurrence position is -1 (that is, it did not appear).
  3. If there are multiple "good suffixes", the longest "good suffix" should be selected and its last occurrence position must be at the head. For example, if the "good suffix" of "BABCDAB" is "DAB", "AB", "B", what is the last occurrence of "good suffix" at this time? The answer is that the good suffix used at this time is "B", its last appearing position is the head, that is, the 0th position, and the last position of other good suffixes is not in the head.

Now there is such a demand: I want to know whether a string appears in another string, such as

String originText = "ABCDEFGHHH"
String moduleText = "FGGH";

Determine whether moduleText appears in originText, if it appears, return the index of the position where it appears, and return -1 if it does not appear. (Of course, here I will abandon the scenario of the string object API for the time being, and just start with the algorithm!!!). At the same time, we must also explain a problem: moduleText as the matched string, also known as the pattern string (also known as the search term), and originText as the searched string, also known as the main string.

1. First, the main string is aligned with the head of the pattern string, and the comparison starts from the tail. This idea is very efficient, because if the trailing characters do not match, you can know that the first 10 characters (on the whole) are definitely not the result you are looking for with just one comparison. We see that "C" does not match "H". At this time, "C" is called "bad character". At this time, the bad character rule gets 3, the good suffix rule gets -1, and the larger one is chosen as the back shift number. Here Choice 3

2. Still comparing from the end, it is found that "F" and "H" do not match, so "F" is a "bad character".

An analogy, and finally get

The implementation code is as follows

package www.supermaster.cn.text;

/**
 * 坏字符规则: 后移位数 = 坏字符的位置 - 模式串中的坏字符上一次出现位置
 * 
 * 好后缀规则:后移位数 = 好后缀的位置 - 模式串中的上一次出现位置
 * 
 */
public class BMTest
{

	public static void main(String[] args)
	{
		// 主串
		String originText = "ABCDEFGHHFGHH";
		// 模式串
		String moduleText = "FGH";
		// 坏字符规则表
		//	        int[] badCharacterArray = badCharacter(originString,moduleString);

		System.out.println("主串:" + originText);
		System.out.println("模式串:" + moduleText);

		int index = bmMatch(originText, moduleText);
		System.out.println("匹配的下标:" + index);
	}

	/**
	 * @Description:[BM匹配字符串]
	 * @Method: bmMatch
	 * @param originText
	 *            主串
	 * @param moduleText
	 *            模式串
	 * @return 若匹配成功,返回下标,否则返回-1
	 */
	public static int bmMatch(String originText, String moduleText)
	{
		// 主串
		if (originText == null || originText.length() <= 0)
		{
			return -1;
		}
		// 模式串
		if (moduleText == null || moduleText.length() <= 0)
		{
			return -1;
		}

		//如果模式串的长度大于主串的长度,那么一定不匹配
		if (moduleText.length() > originText.length())
		{
			return -1;
		}

		int moduleSuffix = moduleText.length() - 1;// 模式串最大长度值
		int moduleIndex = moduleSuffix; // 初始化模式串起始Index
		int originIndex = moduleSuffix; // 初始化主串初始化Index
		//
		for (int index = originIndex; originIndex < originText.length() && moduleIndex >= 0;)
		{
			char och = originText.charAt(originIndex); // 主串某个位置的Char
			char mch = moduleText.charAt(moduleIndex); // 模式串某个位置的Char
			//
			if (och == mch)
			{
				originIndex--;
				moduleIndex--;
			}
			else
			{
				// 坏字符规则
				int badMove = badCharacterRule(moduleText, och, moduleIndex);
				// 好字符规则
				int goodMove = goodCharacterRule(moduleText, moduleIndex);

				// 主串位置不动,模式串向右移动
				originIndex = index + Math.max(badMove, goodMove);
				moduleIndex = moduleSuffix;
				// index就是中间变量
				index = originIndex;
			}
		}

		if (moduleIndex < 0)
		{
			// 多减了一次
			return originIndex + 1;
		}

		return -1;
	}

	/**
	 * @Description:[利用好后缀规则计算移动位数]
	 * @Method: goodCharacterRule
	 * @param moduleText
	 * @param charSuffix
	 * @return
	 */
	private static int goodCharacterRule(String moduleText, int charSuffix)
	{
		int result = -1;
		// 模式串长度
		int moduleMax = moduleText.length();

		// 好字符数
		int charSize = moduleMax - 1 - charSuffix;

		for (; charSize > 0; charSize--)
		{
			String startText = moduleText.substring(0, charSize);
			String endText = moduleText.substring(moduleMax - charSize, moduleMax);
			if (startText.equals(endText))
			{
				result = moduleMax - charSize;
			}
		}

		return result;
	}

	/**
	 * @Description:[利用坏字符规则计算移动位数]
	 * @Method: badCharacterRule
	 * @param moduleText
	 * @param badChar
	 * @param charSuffix
	 * @return
	 */
	private static int badCharacterRule(String moduleText, char badChar, int charSuffix)
	{
		return charSuffix - moduleText.lastIndexOf(badChar, charSuffix);
	}

}

This article is mainly for the interview situation, the code is the main display and display, and the follow-up will continue to improve! If readers have good suggestions, please leave a message to encourage us to make progress together! !

Continuous improvement..............

-------------------------------------------------- --------------------------------------
author: World coding
source: CSDN
original: HTTPS: / /blog.csdn.net/dgxin_605/article/details/92360040
Copyright statement: This article is the original article of the blogger, please attach the link to the blog post if you reprint it!

 

Guess you like

Origin blog.csdn.net/dgxin_605/article/details/92360040