Approximate Matching

时间限制:1000ms
单点时限:1000ms
内存限制:512MB

描述

String matching, a common problem in DNA sequence analysis and text editing, is to ﬁnd the occurrences of one certain string (called pattern) in a larger string (called text). In some cases, the pattern is not required to be exactly in the text, and minor diﬀerences are acceptable (due to possible typing mistakes). When given a pattern string and a text string, we say pattern P is approximately matched within text S, if there is a substring of S which is at most one letter diﬀerent from P. Note that the length of this substring and the pattern must be identical. For example, pattern “abb” is approximately matched in text “babc” but not matched in “bbac”.

It is easy to check if a pattern is approximately matched in a text. So your task is to count the number of all text strings of length m in which the given pattern can be approximately matched, and both of the patterns and texts are binary strings in order not to handle big integers.

输入

The ﬁrst line of input is a single integer T (1 ≤ T ≤ 666), the number of test cases. Each test case begins with a line of two integers n,m (1 ≤ n,m ≤ 40), denoting the length of pattern string and text string. Then a single line of binary string P follows, which denotes the pattern. Note that there will be at most 15 test cases in which n ≥ 16.

输出

For each test case, output a single line with one integer, representing the answer.

样例输入

5
3 4
110
4 7
1011
2 10
00
7 17
1001110
11 22
11101010001

样例输出

12
104
1023
72840
291544

题目大概意思：

给出一个长度为 $n(1≤n≤40)$ 的只由 $0$ 和 $1$ 组成的字符串 $S$ ，对于一个字符串 $T$ ，若 $T$ 的存在长度为 $n$ 的子串，满足该子串与 $S$ 的不同的位的数量不超过 $1$ ，则称 $S$ 与 $T$ 近似匹配。问所有长度为 $m(1≤m≤40)$ 的只由 $0$ 和 $1$ 组成的字符串中，与 $S$ 近似匹配的有多少个。

分析：

首先容易想到，与 $S$ 近似匹配，可以看作是与 $S$ 改变至多 $1$ 位后所形成的 $(n+1)$ 条字符串 $\{S_i\}$ 中的任意一条匹配。而找到所有与 $S$ 近似匹配的，可以通过计算与 $S$ 不近似匹配的字符串的个数 $x$ ，最后再用所有可能的字符串数 $2^m$ 减去 $x$ 即可。于是我们只需要求出所有不与 $\{S_i\}$ 中任意一条匹配的字符串的数量。

扫描二维码关注公众号，回复： 11164633 查看本文章

考虑朴素的方法，枚举所有长度为 $m$ 的串，并逐一判断是否与 $S$ 近似匹配，但是所有可能的字符串的数量高达 $2^m$ ，这显然是行不通的。接下来，我们来考虑穷竭搜索中所进行的重复计算：

如果生成的字符串 $T$ 的后缀与 $S_i$ 匹配，则无论 $T$ 除去后缀 $S_i$ 后的剩余部分是什么，这条字符串都是不计入 $x$ 的，而穷竭搜索的过程中，却把所有后缀为 $S_i$ 的字符串都遍历了一遍。

可以发现是否与 $S_i$ 匹配，只与生成的字符串的后缀有关，也就是说影响 $x$ 的计数的，只有当前字符串的后缀，因此我们可以把每一种对结果有影响的后缀作为状态，也就是把所生成的字符串的后缀与 $S_i$ 的所有前缀的所有可能的匹配情况作为状态。 $\{S_i\}$ 中共有 $(n+1)$ 条字符串，每条字符串有 $(n+1)$ 个不同的前缀（把长度为 $0$ 时的空串也当作一个前缀），则共有不超过 $(n+1)^2$ 种状态。有了每种后缀所代表的状态，如果我们能够知道在每一种状态的末尾添加任意一个字符后的下一个匹配状态， $dp$ 方程就容易列出了：

设 $dp[L][State]$ 为生成字符串的长度为 $L$ ，后缀状态为 $State$ 的所有字符串的数量。其中不妨令 $State=0$ 为与所有模式串 {S_i} 匹配长度均为 $0$ 时的状态。

那么设当前状态为 $cur$ ，可以由状态 $pre_i$ 添加字符 $c_i$ 后转移而来，则有：

$\begin{aligned} dp&[0][0]=1\\ dp&[0][i]=0,i≠0\\ dp&[L][cur]=\sum_{}^{}{dp[L-1][pre_i]},L＞0 \end{aligned}$

为了让动态规划算法部分更加高效，我们需要预处理出每一种匹配状态的编号，以及从某个状态添加某个字符后的状态转移表，然后利用该表完成动态规划。

在预处理时，我们可以采用朴素的枚举 $S_i$ 的所有可能的前缀然后去重的方法得到每一种状态的编号，对于 $O(n)$ 条字符串，共有 $O(n^2)$ 种前缀，则去重的时间复杂度为 $O(n^3·\log_2{n})$ ；然后通过反复删除每个前缀的首个字符并查找的方法计算状态转移表，由于共有 $O(n^2)$ 种状态，每种状态都有 $O(n)$ 种前缀，因此共有 $O(n^3)$ 个前缀，比较两个字符串的时间复杂度为 $O(n)$ ，故对于前缀中的每一个，在使用二叉查找树的情况下要在 $O(n·\log_2{n})$ 的时间复杂度内判断这个前缀是否是某种已存在的状态，故这样的预处理算法的时间复杂度是 $O(n^4·\log_2{n})$ ，在此题的数据范围内勉勉强强，当然我们可以使用字符串哈希的算法将字符串的比较的时间复杂度优化为 $O(1)$ ，使得预处理的时间复杂度降低为 $O(n^3·\log_2{n})$ ，但较为繁琐。

我们还可以使用 $Aho-Corasick$ 算法，在构建 $AC$ 自动机的同时为每一种状态赋予编号并得出状态转移表。这种算法的复杂度为 $O(\sum{Len})$ ，此题中由于模式串的长度与数量相同，故构建 $AC$ 自动机的时间复杂度为 $O(n^2)$ ，显著优于朴素的算法。

在动态规划算法中，共有不超过 $O(n^3)$ 种状态，每个状态由不超过 $O(n)$ 个状态更新而来，故时间复杂度上界为 $O(n^4)$ ，而由于字符串仅由 $0,1$ 组成，重复的状态较多，因此这是一个较为宽松的时间复杂度上界。

下面贴代码：

#include <cstdio>
#include <cstring>
#include <queue>
using namespace std;

typedef long long ll;

const int INF = 1 << 24;
const int MAX_N = 42;
const int MAX_STATE = 1650;
const int MAX_W = 2;

struct P
{
	P* fail;
	P* next[MAX_W];
	int cnt;
	P()
	{
		fail = NULL;
		cnt = 0;
		memset(next, 0, sizeof(next));
	}
};
P pool[MAX_STATE];
int pt;

void insert(const char* const str, P* root);
void buildac(P* root);

char ch[MAX_N];

ll dp[MAX_N][MAX_STATE];


int main()
{
	int T, n, m;
	scanf("%d", &T);

	while (T--)
	{
		scanf("%d%d \n%s", &n, &m, ch);

		pt = 0;
		P* root = new(pool + pt++) P();

		for (int i = 0; i < n; ++i)
		{
			ch[i] -= '0' - 1;
		}
		insert(ch, root);
		for (int i = 0; i < n; ++i)
		{
			ch[i] = ch[i] == 1 ? 2 : 1;
			insert(ch, root);
			ch[i] = ch[i] == 1 ? 2 : 1;
		}
		buildac(root);

		dp[0][0] = 1;
		for (int i = 1; i < pt; ++i)
		{
			dp[0][i] = 0;
		}
		for (int t = 0; t < m; ++t)
		{
			for (int i = 0; i < pt; ++i)
			{
				dp[t + 1][i] = 0;
			}

			for (int i = 0; i < pt; ++i)
			{
				for (int j = 0; j < MAX_W; ++j)
				{
					int k = pool[i].next[j] - pool;
					if (pool[k].cnt)
					{
						continue;
					}
					dp[t + 1][k] += dp[t][i];
				}
			}
		}

		ll ans = 0;
		for (int i = 0; i < pt; ++i)
		{
			ans += dp[m][i];
		}
		printf("%lld\n", (1ll << m) - ans);
	}
	return 0;
}

void insert(const char* const str, P* root)
{
	P* p = root;
	int i = -1, index;
	while (str[++i])
	{
		index = str[i] - 1;
		if (p->next[index] == NULL)
		{
			p->next[index] = new(pool + pt++) P();
		}
		p = p->next[index];
	}
	p->cnt = 1;
}

void buildac(P* root)
{
	root->fail = NULL;
	queue<P*> q;
	q.push(root);
	while (!q.empty())
	{
		P* const tmp = q.front();
		P* p = NULL;
		q.pop();

		for (int i = 0; i < MAX_W; i++)
		{
			if (tmp->next[i])
			{
				if (tmp == root)
				{
					tmp->next[i]->fail = root;
				}
				else
				{
					p = tmp->fail;
					while (p)
					{
						if (p->next[i])
						{
							tmp->next[i]->fail = p->next[i];
							break;
						}
						p = p->fail;
					}
					if (!p) tmp->next[i]->fail = root;
					if (p && p->cnt)
					{
						tmp->cnt = 1;
					}
				}

				q.push(tmp->next[i]);
			}
			else
			{
				tmp->next[i] = tmp == root ? root : tmp->fail->next[i];
			}
		}
	}
}

xhxhxhxhx

原创文章 42 获赞 22 访问量 3040

关注私信

hihoCoder1877 Approximate Matching - AC自动机 - 动态规划(dp)

Approximate Matching

描述

输入

输出

样例输入

样例输出

题目大概意思：

分析：

猜你喜欢