BWT（Burrows–Wheeler transform）

1. 按照字典排序

2. 例子，X=googol$

S(i)，即按照字典排序后的序号i的suffix，所对应的一开始的序号。
B[i]，即按照字典排序后的序号i的suffix，的最后一个字符。

表明了X的长度为n(包括$)，最后一位就是$。

3. 后缀序列区间(SA intervals)

若W为X的子字符串，W在X中的出现位置必然会落在一个后缀区间内。(若落在不同的后缀区间则说明有multi-hits)
例如，W为go，X为上述所说。有go作为前缀的后缀的i为1,2，即后缀区间为[1,2]。
上图中的后缀区间左边界可以这么描述：在按照字典排序后的X后缀集合中，W为其前缀的后缀的新的序号们的最小值。
同理，后缀区间右边界则为最大值。

时刻需要注意的时，查找时，都是查找query是否为subject的前缀，而不是简单的存在与否。

4. 准确匹配：backward search

定义
a指某个字符，字母表中的。
C(a)：在X[0,n-2]（除了$外的原字符串上），比a字典排序小的字符的数量。
O(a,i)：在B[0,i]的字符上，a的出现次数。

结论是

当且仅当上述条件成立时，aW为X的子字符串

例如，W为go，a=o，Rmin（aW） = 3 + 0 + 1，Rmax(aW) = 3 +1，所以ogo为X的子字符串。

如何理解反向搜索，在我们有了第一个字符串W的的后缀区间后，如果我们要继续往前添加字符去搜索，相当于在后缀树上的从下到上的搜索。在后缀区间上，由于后缀与B[i]是形成一个圆形的，所以B[i]其实就是i后缀的前一个字符。我们比较B[0,i-1]和B[0,i]其实就只是比较这个B[i]是否是这个a。

i = 0
j = len(X)
for x in range(len(query)):
  newChar = query[-x-1] # 从后往前找
  newI = C(newChar) + O(newChar, i - 1) +1 # 有了第一个后，第二个则在第一个的基础上找aW
  newJ = C(newChar) + O(newChar, J)
  i = newI
  J = newJ 
matches = S[i,J+1] # i>j不等时就已经说明该字符串已经不在X中

5. 模糊匹配：bounded traversal/backtracking(有界的遍历/回溯)

定义

引入D，按照准确匹配的算法进行计算，但O为反向的X建立的O_reverse的，当I大于J时，则将该字符的位置写入D，作为指示的作用。

模糊的递归：通过在后缀树上的递归+D以进行模糊匹配。

终止条件为：

D比给定的最大不同字符数z还大

z由于设定的penalty已经被减少的小于0

整个字符串都被匹配

k和l的作用类似于准确匹配时的i和j。

    def inexact_recursion(self, query, i, z, k, l):
        tempset = set()
        # 2 stop conditions, one when too many differences have been encountered, another when the entire query has been matched, terminating in success
        if (z < self.get_D(i) and use_lower_bound_tree_pruning) or (
                z < 0 and not use_lower_bound_tree_pruning):  # reached the limit of differences at this stage, terminate this path traversal
            if debug: print "too many differences, terminating path\n"
            return set()  # return empty set
        if i < 0:  # empty query string, entire query has been matched, return SA indexes k:l
            if debug: print "query string finished, terminating path, success! k=%d, l=%d\n" % (k, l)
            for m in range(k, l + 1):
                tempset.add(m)
            return tempset

        result = set()
        if indels_allowed:
            result = result.union(self.inexact_recursion(query, i - 1, z - insertion_penalty, k,
                                                                        l))  # without finding a match or altering k or l, move on down the query string. Insertion
        for char in self.alphabet:  # for each character in the alphabet
            # find the SA interval for the char
            newK = self.C[char] + self.OCC(char, k - 1) + 1
            newL = self.C[char] + self.OCC(char, l)
            if newK <= newL:  # if the substring was found
                if indels_allowed: result = result.union(
                    self.inexact_recursion(query, i, z - deletion_penalty, newK, newL))  # Deletion
                if debug: print "char '%s found' with k=%d, l=%d. z = %d: parent k=%d, l=%d" % (
                char, newK, newL, z, k, l)
                if char == query[
                    i]:  # if the char was correctly aligned, then continue without decrementing z (differences)
                    result = result.union(self.inexact_recursion(query, i - 1, z, newK, newL))
                else:  # continue but decrement z, to indicate that this was a difference/unalignment
                    result = result.union(self.inexact_recursion(query, i - 1, z - mismatch_penalty, newK, newL))
        return result

最后感谢Jwomers的代码，毕竟原文章实在写的太晦涩难懂了....

论文Fast and accurate short read alignment with Burrows-Wheeler transform

Fast and accurate short read alignment with Burrows-Wheeler transform

BWT（Burrows–Wheeler transform）

参考

猜你喜欢