论文Fast and accurate short read alignment with Burrows-Wheeler transform

Original link: https://www.jianshu.com/p/3a57b79dffec

Fast and accurate short read alignment with Burrows-Wheeler transform

BWT(Burrows–Wheeler transform)

1. Follow the dictionary sort

2. example,X=googol$

S (i), i.e., in accordance with the number i of the suffix dictionary sort, the number corresponding to the beginning.
B [i], i.e., in accordance with the number i of the suffix dictionary sort, the last character.

 

 

X indicates the length of n (including $), the last one is $.

3. suffix series intervals (SA intervals)

If W is X substring, W occurrence position in X is bound to fall within a range suffix. (If different suffixes falling section then there is multi-hits)
, for example, W is go, X is said above. There go as a prefix suffix i is 1, 2, i.e. the suffix interval [1,2].
Suffix left boundary section can be so described in the figure above: X extension set according to the ordering in the dictionary as W, prefix new number suffixes are minimum.
Similarly, the suffix right boundary interval was maximum.

Note the hour of need, when looking, it is to find whether the query prefixed with the subject , rather than simply the presence or absence.

4. exact matches: backward search

Define
a refers to a character, the alphabet.
[0 n-2,], in the X (except on the outside of the original string $) ratio of a: C (a) small dictionary ordering number of characters.
O (a, i): on the character B [0, i], the number of occurrences of a.


The conclusion is


If and only if the above condition is satisfied, aW sub-strings of X

 

For example, W for the go, a = o, Rmin (aW) = 3 + 0 + 1, Rmax (aW) = 3 +1, it is a substring of X ogo.

How to understand a reverse search , a suffix after the interval we had the first of a string of W, if we want to continue to add character to search forward, the equivalent of the suffix tree from the bottom to the search. On suffix interval, since the suffix B [i] is formed in a circular, so B [i] is actually a character before the suffix i. We compared the B [0, i-1] and B [0, i] in fact, just compare B [i] whether this is a.

i = 0
j = len(X)
for x in range(len(query)):
  newChar = query[-x-1] # 从后往前找
  newI = C(newChar) + O(newChar, i - 1) +1 # 有了第一个后,第二个则在第一个的基础上找aW
  newJ = C(newChar) + O(newChar, J)
  i = newI
  J = newJ 
matches = S[i,J+1] # i>j不等时就已经说明该字符串已经不在X中

5. Fuzzy Match: bounded traversal / backtracking (bounded traversal / back)

definition

  1. Introducing D, is calculated according to the algorithm of an exact match, but O is established reverse X O_reverse when I is greater than J, then the writing position of a character D, as indicated action.
  2. Fuzzy recursion: suffix tree by recursively + D for a fuzzy match.

Termination conditions:

  1. D ratio given the maximum number of different character z is greater
  2. z Since the set penalty has been reduced to less than 0
  3. The entire string are matched
  1. K and l are similar to the role of i and j when an exact match.
    def inexact_recursion(self, query, i, z, k, l):
        tempset = set()
        # 2 stop conditions, one when too many differences have been encountered, another when the entire query has been matched, terminating in success
        if (z < self.get_D(i) and use_lower_bound_tree_pruning) or (
                z < 0 and not use_lower_bound_tree_pruning):  # reached the limit of differences at this stage, terminate this path traversal
            if debug: print "too many differences, terminating path\n"
            return set()  # return empty set
        if i < 0:  # empty query string, entire query has been matched, return SA indexes k:l
            if debug: print "query string finished, terminating path, success! k=%d, l=%d\n" % (k, l)
            for m in range(k, l + 1):
                tempset.add(m)
            return tempset

        result = set()
        if indels_allowed:
            result = result.union(self.inexact_recursion(query, i - 1, z - insertion_penalty, k,
                                                                        l))  # without finding a match or altering k or l, move on down the query string. Insertion
        for char in self.alphabet:  # for each character in the alphabet
            # find the SA interval for the char
            newK = self.C[char] + self.OCC(char, k - 1) + 1
            newL = self.C[char] + self.OCC(char, l)
            if newK <= newL:  # if the substring was found
                if indels_allowed: result = result.union(
                    self.inexact_recursion(query, i, z - deletion_penalty, newK, newL))  # Deletion
                if debug: print "char '%s found' with k=%d, l=%d. z = %d: parent k=%d, l=%d" % (
                char, newK, newL, z, k, l)
                if char == query[
                    i]:  # if the char was correctly aligned, then continue without decrementing z (differences)
                    result = result.union(self.inexact_recursion(query, i - 1, z, newK, newL))
                else:  # continue but decrement z, to indicate that this was a difference/unalignment
                    result = result.union(self.inexact_recursion(query, i - 1, z - mismatch_penalty, newK, newL))
        return result

Finally, thanks Jwomers code, after all, it is the original article written in too obscure a ....

reference

  1. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705234/
  2. https://github.com/Jwomers/burrows_wheeler_alignment

Guess you like

Origin blog.csdn.net/u010608296/article/details/102642601