String matching basis (in): how to find the function text editor

String matching basis (in): how to find the function text editor

Word is a word in the unified replaced by another, can be used BM algorithm

The core idea of ​​BM algorithm

And a main string pattern string matching process, because the main string c is not in the pattern string, so when the pattern from sliding rearwardly, so long as there is an overlap pattern string c and, certainly not match the pattern string can be disposable several slides back, the pattern string is moved to the back c

BM algorithm theory analysis

BM algorithm consists of two parts, is good and bad character rule suffix rules

1. bad character rule

Subscripts descending order of the pattern string, Reverse matching

Forward backwards from the end of pattern matching string, when we can not find a matching string when this character did not match any known bad character (the main character string) we find the bad characters in the pattern string c and found that the pattern string does not have this character, this time we can directly pattern strings slide back three position (because the pattern string only 3), slide to the back of the pattern string c, and then start from the end of string character mode compare

This character is a bad time, not the three slide back, the two pattern strings slide back, so that a two vertically aligned, and then from the end of the character string pattern, rematching

Occurred when the mismatch, the next bad characters corresponding to the pattern string of characters labeled as si, if the bad character exists in the pattern string, this bad characters in the pattern string thereof labeled as xi, if not present, xi = -1, then the number of bits is moving back pattern string si - xi (where subscript character string pattern in subscript)

a b c a c a b d c

​ a b d

At this time a character is bad, d is a bad pattern strings of characters corresponding to the character index, d = si = 2, a = xi = 0

If many bad characters appear in the pattern string, the calculated xi time to choose the one most rearward

Simply use bad character rule is not enough, we also need to use "good suffix rule"

2. Good suffix rule

When the mode is slid to the lower position of the string when the string pattern string and has two main characters are matched, the antepenultimate bad characters

a b c a c a b c b c b a c a b c

​ a b b c a b c

a bad character, bc is good suffix

We can use bad character rule to calculate the mode of slide-digit string, you can use a good deal suffix rules

Suffix is to have a good match suffix bc called to do, denoted by {u}, take bc looking at the pattern string if it finds another child with {u} matches the string U { *}, the string sliding mode substring that the U { *} aligned with the position the main string {u}, if no other is equal to {u} substring, directly to the pattern string, {u} slid to the rear of the main stream, as any one sliding back, no matches are found in the main string {u} previous case, but this is likely an excessive slip

So we must not only optimistic about the suffix string mode, if there is a substring of another match, we have to examine good suffix substring, whether there is a pattern string with the prefix substring match

It s called a string suffix substring, the last character is aligned with the s substring, such as abc suffix substring including c, bc. The so-called prefix substring, s is aligned with the start character string sub, such as abc prefix substring has a, ab. We suffix substring from good suffix, and sorted to find a longest prefix string pattern matching substring, assuming {v}

How to choose a good rule or bad character suffix rules?

Calculate median good and bad character suffixes next slide, and then taking the maximum of two numbers, the next string as sliding mode bits

BM algorithm code implementation

How to find the location of bad characters appear in the pattern string?

Each character string pattern may be in its index are saved to the hash table, you can quickly find bad characters mark the string in position mode, on the hash table, assuming the string character set is not much, and every character length is 1 byte, the recording position of each character to appear in the pattern string with the size of the array 256, the ASCII code corresponding to the value of array subscript character, the position of the character stored in the array appearing in the pattern string

Pattern strings: abda

​ 0 1 2 3 bc: -1 -1 …… 3 1 -1 2 …… -1

​ 0 1 …… 97 98 99 100 …… 255

ASCII code table: a 97

​ b 98

​ c 99

​ d 100 ……

b is a schematic string, m is the length of pattern, bc hashtable

private static final int SIZE = 256;//全局变量或成员变量
private void generateBC(char[] b ,int  m ,int[] bc){
	for(int i = 0 ; i < SIZE ; ++i){
		bc[i] = -1;   //初始化bc
	}
	for(int i = 0 ; i < m ; ++i){
		int ascii  = (int)b[i];//计算b[i]的ASCII值
		bc[ascii] = i ;
	}
}

Only the first case of bad character principle, regardless of the mobile-digit si-xi calculated may appear negative

public int bm(char[] a , int n ,char[] b ,int m){
	int[] bc = new int [SIZE];  //记录模式串中每个字符最后出现的位置
	generateBC(b,m,bc);  //构建坏字符哈希表
	int i  =  0;  //i表示主串与模式串对齐的第一个字符
	while( i <= n - m){
		int j ;
		for (j = m - 1 ; j  >= 0 ; --j){   //模式串从后往前匹配
			if(a[i+j] != b[j]) break;   //坏字符对应模式串中的下标是j
		}
		if(j < 0){
			return i;   //匹配成功,返回主串与模式串第一个匹配的字符的位置
		}
		//这里等同于将模式串往后滑动  j - bc[(int)a[i+j]]位
		i = i + (j - bc[(int)a[i+j]]);
	}
	return -1;
}

Good suffix is ​​a suffix of pattern string itself substring, therefore, before the actual matching pattern string and the main strings, string by pretreatment mode, corresponding to another position can be matched substring

How to represent different pattern string suffix substring of it? Since the position of the last suffix character substring is fixed, the subscript m-1, only needs to record length, the length, we can determine a unique suffix substring, such as

Pattern strings: cabcab

Suffix length of the substring

​ b 1

​ a b 2

​ c a b 3

b c a b 4

a b c a b 5

The introduction of the most critical variables suffix array, the subscript K suffix array, suffixes substring length index corresponding array values stored in the pattern string with good suffix {u} sub match string {U *} in the starting index value

Pattern strings: cabcab

​ 0 1 2 3 4 5

Suffix suffix length of the substring

​ b 1 suffix[1] =2

​ a b 2 suffix[2] = 1

​ c a b 3 suffix[3] = 0

b c a b 4 suffix[4] = -1

a b c a b 5 suffix[5] = -1

If there are a plurality of pattern strings of characters with suffixes {u} substring match, then the suffix array which stores the start position of string sub it? The starting position of the character after the string closest to the storage mode, that is, the starting position of the index's largest substring

Not only in the pattern string, substring find another suffix with a good match, but also to find the longest substring suffix sorted pattern string prefix substring matching the good suffix suffix substring, we need a boolean the prefix string array to the recording mode if the suffix string to match the sub-pattern string prefix substring

Pattern strings: cabcab

​ 0 1 2 3 4 5

Suffix length suffix prefix substring

​ b 1 suffix = 2 prefix = false

​ a b 2 suffix = 1 prefix = false

​ c a b 3 suffix = 0 prefix = true

​ b c a b 4 suffix = -1 prefix = false

​ a b c a b 5 suffix = -1 prefix = false

The subscript string from the substring 0 to i (i is 0 to m-1) and the entire pattern, find common suffix substring, if the common suffix substring of length k, then the suffix [k] = j (j is suffix common substring starting index) If j = 0, that is common suffix sub-string pattern string is a prefix substring, the prefix [k] = true

//b表示模式串,m表示长度,suffix,prefix数组事先申请好了
private void generateGS(char[] b ,int m,int[] suffix ,boolean[] prefix){
	for(int i = 0 ; i < m ; ++i){   //初始化
		suffix[i] = -1;
		prefix[i] = false;
	}
	for(int i = 0 ; i < m -1 ;++i){   //b[0,i]
		int j = i ; 
		int k = 0;//公共后缀子串长度
		while(j >= 0 && b [j] == b[m-1-k]){   //与b[ o ,m-1]求公共后缀子串
			--j;
			++k;
			suffix[k] = j+1;   //j+1表示公共后缀子串在b[0,i]中的起始下标
		}
		if( j == -1) prefix[k] = true ; //如果公共后缀子串也是模式串的前缀子串
	}
}

In the main mode string with the string matching process, the matching characters are not met, according to how good suffix rule, the calculation mode string of digits next slide?

Good suffix length is assumed k, find the substring suffix acquire good match for its suffix in the array, if the suffix [k]! = -1 (-1 denotes substring match does not exist), then we go to the pattern string after moving j-suffix [k] +1 bit (j indicates bad characters corresponding to the character string pattern index), if the suffix [k] = - 1, indicates the absence of further matching pattern with good suffix string sub string pieces. Deal with the following rule:

Good suffix suffix substring b [r, m-1] (where, r j + 2 values ​​from the m-1) of length k = m -r, if the prefix [k] = true, indicates length k substring suffix, prefix matches with a substring, which can shift the pattern string of r bits after

If two rules can not find a good match and a suffix substring suffix substring after the entire m-bit shift pattern string

//a,b表示主串和模式串;n,m表示主串和模式串的长度
public int bm(char[] a ,int n , char[] b ,int m ){
	int[] bc = new int[SIZE] ; //记录模式串中每个字符最后出现的位置
	generateBC(b,m,bc);     //构建坏字符哈希表
	int[] suffix = new int [m];
	boolean[] prefix = new boolean[m];
	int i = 0;  // j 表示主串与模式串匹配的第一个字符
	while( i <= n - m){
		int j;
		for(j = m - 1;j >= 0 ; --j){   //模式串从后往前匹配
			if(a[i+j] != b[j])  break;   // 坏字符对应模式串中的下标是j
		}
		if(j < 0){
			return i ;   // 匹配成功,返回主串与模式串第一个匹配的字符的位置
		}
		int x = j - bc[(int) a[i+j]];
		int y = 0;
		if(j < m-1){      //如果有好后缀的话
			y = moveByGS(j ,m,suffix,prefix);
		}
		i = i +Math.max(x,y);
	}
	return -1;
}

//j表示坏字符对应的模式串中的字符下标;m表示模式串长度
private int moveByGS(int j ,int m ,int[] suffix,boolean[] prefix){
	int k = m - 1 - j;//好后缀长度
	if(suffix[k] != -1) return j - suffix[k] +1;
	for(int r = j +2 ; r <= m -1;++r){
		if(prefix[m-r] == true){
			return r;
		}
	}
	return m ;
}

BM algorithm idea is to use the pattern string own characteristics, when a character string in a pattern with the main string does not match the pattern string slide back more than a few, in order to reduce unnecessary comparison

Published 75 original articles · won praise 9 · views 9187

Guess you like

Origin blog.csdn.net/ywangjiyl/article/details/104480982