Fast and accurate short read alignment with Burrows-Wheeler transform
BWT(Burrows–Wheeler transform)
1. Follow the dictionary sort
2. example,X=googol$
S (i), i.e., in accordance with the number i of the suffix dictionary sort, the number corresponding to the beginning.
B [i], i.e., in accordance with the number i of the suffix dictionary sort, the last character.X indicates the length of n (including $), the last one is $.
3. suffix series intervals (SA intervals)
If W is X substring, W occurrence position in X is bound to fall within a range suffix. (If different suffixes falling section then there is multi-hits)
, for example, W is go, X is said above. There go as a prefix suffix i is 1, 2, i.e. the suffix interval [1,2].
Suffix left boundary section can be so described in the figure above: X extension set according to the ordering in the dictionary as W, prefix new number suffixes are minimum.
Similarly, the suffix right boundary interval was maximum.
Note the hour of need, when looking, it is to find whether the query prefixed with the subject , rather than simply the presence or absence.
4. exact matches: backward search
Define
a refers to a character, the alphabet.
[0 n-2,], in the X (except on the outside of the original string $) ratio of a: C (a) small dictionary ordering number of characters.
O (a, i): on the character B [0, i], the number of occurrences of a.
The conclusion is
If and only if the above condition is satisfied, aW sub-strings of X
For example, W for the go, a = o, Rmin (aW) = 3 + 0 + 1, Rmax (aW) = 3 +1, it is a substring of X ogo.
How to understand a reverse search , a suffix after the interval we had the first of a string of W, if we want to continue to add character to search forward, the equivalent of the suffix tree from the bottom to the search. On suffix interval, since the suffix B [i] is formed in a circular, so B [i] is actually a character before the suffix i. We compared the B [0, i-1] and B [0, i] in fact, just compare B [i] whether this is a.
i = 0
j = len(X)
for x in range(len(query)):
newChar = query[-x-1] # 从后往前找
newI = C(newChar) + O(newChar, i - 1) +1 # 有了第一个后,第二个则在第一个的基础上找aW
newJ = C(newChar) + O(newChar, J)
i = newI
J = newJ
matches = S[i,J+1] # i>j不等时就已经说明该字符串已经不在X中
5. Fuzzy Match: bounded traversal / backtracking (bounded traversal / back)
definition
- Introducing D, is calculated according to the algorithm of an exact match, but O is established reverse X O_reverse when I is greater than J, then the writing position of a character D, as indicated action.
- Fuzzy recursion: suffix tree by recursively + D for a fuzzy match.
Termination conditions:
- D ratio given the maximum number of different character z is greater
- z Since the set penalty has been reduced to less than 0
- The entire string are matched
- K and l are similar to the role of i and j when an exact match.
def inexact_recursion(self, query, i, z, k, l):
tempset = set()
# 2 stop conditions, one when too many differences have been encountered, another when the entire query has been matched, terminating in success
if (z < self.get_D(i) and use_lower_bound_tree_pruning) or (
z < 0 and not use_lower_bound_tree_pruning): # reached the limit of differences at this stage, terminate this path traversal
if debug: print "too many differences, terminating path\n"
return set() # return empty set
if i < 0: # empty query string, entire query has been matched, return SA indexes k:l
if debug: print "query string finished, terminating path, success! k=%d, l=%d\n" % (k, l)
for m in range(k, l + 1):
tempset.add(m)
return tempset
result = set()
if indels_allowed:
result = result.union(self.inexact_recursion(query, i - 1, z - insertion_penalty, k,
l)) # without finding a match or altering k or l, move on down the query string. Insertion
for char in self.alphabet: # for each character in the alphabet
# find the SA interval for the char
newK = self.C[char] + self.OCC(char, k - 1) + 1
newL = self.C[char] + self.OCC(char, l)
if newK <= newL: # if the substring was found
if indels_allowed: result = result.union(
self.inexact_recursion(query, i, z - deletion_penalty, newK, newL)) # Deletion
if debug: print "char '%s found' with k=%d, l=%d. z = %d: parent k=%d, l=%d" % (
char, newK, newL, z, k, l)
if char == query[
i]: # if the char was correctly aligned, then continue without decrementing z (differences)
result = result.union(self.inexact_recursion(query, i - 1, z, newK, newL))
else: # continue but decrement z, to indicate that this was a difference/unalignment
result = result.union(self.inexact_recursion(query, i - 1, z - mismatch_penalty, newK, newL))
return result
Finally, thanks Jwomers code, after all, it is the original article written in too obscure a ....