算法--串（Sequence）（二十二）

学习恋上数据结构与算法的记录，本篇主要内容是串

串（Sequence）

本文研究的串是开发中非常熟悉的字符串，是由若干个字符组成的有限序列，串匹配算法是本文主要研究，串的匹配问题

●字符串thank
前缀（prefix）：t, th, tha, than, thank
真前缀（proper prefix）：t, th, tha, than
后缀（suffix）：thank, hank, ank, nk, k
真后缀（proper suffix）：hank, ank, nk, k

串匹配算法
比如查找一个模式串（pattern）在文本串（text）中的位置

几个经典的串匹配算法
蛮力（Brute Force）
KMP
Boyer-Moore、Karp-Rabin、Sunday
本文用tlen代表文本串text 的长度，plen代表模式串pattern 的长度

蛮力（Brute Force）
以字符为单位，从左到右移动模式串，直到匹配成功
在这里插入图片描述
蛮力算法有2 种常见实现思路
蛮力1–执行过程
pi的取值范围[0, plen)、ti的取值范围[0, tlen)

匹配成功则pi++、ti++
匹配失败则pi= 0、ti–= pi–1

在这里插入图片描述
pi== plen代表匹配成功

蛮力1 –Java实现

public static int indexOf01(String text,String patten) {
		if(text == null || patten == null ) return -1;
		int tlen = text.length();
		int plen = patten.length();
		if(tlen == 0 || plen == 0 || tlen < plen) return -1;
		int pi=0, ti=0;
		while(pi < plen && ti < tlen) {
			if(text.charAt(ti) == patten.charAt(pi)) {
				ti++;
				pi++;
			}else {
				ti -= pi - 1;
				pi = 0;
			}
		}
		return pi == plen ? ti - pi :-1;
	}

蛮力1 –优化
此前实现的蛮力算法，在恰当的时候可以提前退出，减少比较次数
这是完全没必要的比较
在这里插入图片描述
●因此，ti的退出条件可以从ti< tlen改为ti–pi<= tlen–plen
ti–pi是指每一轮比较中text 首个比较字符的位置

public static int indexOf02(String text,String patten) {
		if(text == null || patten == null ) return -1;
		int tlen = text.length();
		int plen = patten.length();
		if(tlen == 0 || plen == 0 || tlen < plen) return -1;
		int pi=0, ti=0;
		while(pi < plen && ti -pi <= tlen - plen) {
			if(text.charAt(ti) == patten.charAt(pi)) {
				ti++;
				pi++;
			}else {
				ti -= pi - 1;
				pi = 0;
			}
		}
		return pi == plen ? ti - pi :-1;
	}

蛮力2–执行过程
pi的取值范围[0, plen)、ti 的取值范围[0, tlen–plen]
在这里插入图片描述
匹配失败则pi= 0、ti++

pi== plen代表匹配成功

public static int indexOf(String text,String patten) {
		if(text == null || patten == null ) return -1;
		int tlen = text.length();
		int plen = patten.length();
		if(tlen == 0 || plen == 0 || tlen < plen) return -1;
		for (int ti = 0; ti <= tlen - plen; ti++) {
			int pi = 0;
			for (; pi < plen; pi++) {
				if(text.charAt(ti+pi) != patten.charAt(pi)) break;
			}
			if(pi == plen) return ti;
		}
		return -1;
	}

蛮力–性能分析
最好情况
只需一轮比较就完全匹配成功，比较m 次（m 是模式串的长度）
时间复杂度为O(m)
最坏情况（字符集越大，出现概率越低）
执行了n –m + 1 轮比较（n 是文本串的长度）
每轮都比较至模式串的末字符后失败（m –1 次成功，1 次失败）
时间复杂度为O(m∗(n−m+1))，由于一般m 远小于n，所以为O(mn)

KMP

对比蛮力算法,KMP的精妙之处：充分利用了此前比较过的内容，可以很聪明地跳过一些不必要的比较位置
在这里插入图片描述
KMP–next表的使用
KMP 会预先根据模式串的内容生成一张next表（一般是个数组）

KMP–核心原理

当d、e失配时，如果希望pattern 能够一次性向右移动一大段距离，然后直接比较d、c字符
前提条件是A必须等于B
所以KMP 必须在失配字符e左边的子串中找出符合条件的A、B，从而得知向右移动的距离向右移动的距离

向右移动的距离：e左边子串的长度–A的长度，等价于：e的索引–c的索引
且c的索引== next[e的索引]，所以向右移动的距离：e的索引–next[e的索引]

总结
如果在pi位置失配，向右移动的距离是pi–next[pi]，所以next[pi] 越小，移动距离越大
next[pi] 是pi左边子串的真前缀后缀的最大公共子串长度

KMP–真前缀后缀的最大公共子串长度模式串字符
在这里插入图片描述
KMP–得到next表

KMP–负1的精妙之处

KMP–主算法实现

public static int indexOf(String text,String patten) {
		if(text == null || patten == null ) return -1;
		int tlen = text.length();
		int plen = patten.length();
		if(tlen == 0 || plen == 0 || tlen < plen) return -1;
		int [] next = next(patten);
		int pi=0, ti=0;
		while(pi < plen && ti -pi <= tlen - plen) {
			if(pi <0 || text.charAt(ti) == patten.charAt(pi)) {
				ti++;
				pi++;
			}else {
				pi = next[pi];
			}
		}
		return pi == plen ? ti - pi :-1;
	}

KMP–为什么是“最大“公共子串长度？
假设文本串是AAAAABCDEF，模式串是AAAAB

◼应该将1、2、3中的哪个值赋值给pi 是正确的？pi=1AAAABpi=3AAAAB◼将3 赋值给pi向右移动了1 个字符单位，最后成功匹配◼将1 赋值给pi向右移动了3 个字符单位，错过了成功匹配的机会◼公共子串长度越小，向右移动的距离越大，越不安全◼公共子串长度越大，向右移动的距离越小，越安全小码哥教育
KMP–next表的构造思路

◼已知next[i] == n①如果pattern.charAt(i) ==pattern.charAt(n)那么next[i+ 1] == n+ 1②如果pattern.charAt(i) !=pattern.charAt(n)已知next[n] == k如果pattern.charAt(i) ==pattern.charAt(k)✓那么next[i+ 1] == k+ 1如果pattern.charAt(i) !=pattern.charAt(k)✓将k代入n ，重复执行②inAAAAkk小码哥教育 @M了个J 小码哥

KMP–next表的代码实现

private static int[] next1(String patten) {
   	int len = patten.length();
   	int[] next = new int[len];
   	int i = 0;
   	int n = next[i] = -1;
   	int imax = len-1;
   	while(i < imax) {
   		if(n <0 || patten.charAt(i) == patten.charAt(n)) {
   			next[++i] = ++n;
   		}else {
   			n = next[n];
   		}
   	}
   	return next;
   }

KMP–next表的不足之处
假设文本串是AAABAAAAB ，模式串是AAAAB
在这里插入图片描述
KMP–next表的优化思路
已知：next[i] == n，next[n] == k

如果pattern[i] !=d，就让模式串滑动到next[i]（也就是n）位置跟d 进行比较
如果pattern[n] !=d，就让模式串滑动到next[n]（也就是k）位置跟d 进行比较
如果pattern[i] ==pattern[n]，那么当i位置失配时，模式串最终必然会滑到k位置跟d 进行比较
所以next[i] 直接存储next[n]（也就是k）即可

KMP–next表的优化实现

private static int[] next(String patten) {
		int len = patten.length();
		int[] next = new int[len];
		int i = 0;
		int n = next[i] = -1;
		int imax = len-1;
		while(i < imax) {
			if(n <0 || patten.charAt(i) == patten.charAt(n)) {
				i++;
				n++;
				if(n <0 || patten.charAt(i) == patten.charAt(n)) {
					next[i] = next[n];
				}else {
					next[i] = n;
				}
			}else {
				n = next[n];
			}
		}
		return next;
	}

KMP–next表的优化效果
在这里插入图片描述
KMP–性能分析
KMP 主逻辑
最好时间复杂度：O(m)
最坏时间复杂度：O(n)，不超过O(2n)