Fast and accurate short read alignment with Burrows-Wheeler transform

Original link: https://blog.csdn.net/math715/article/details/46992623

Fast and accurate short read alignment with Burrows-Wheeler transform

Fast and accurate short read alignment of Burrows-Wheeler transform
this paper, the biological information, the algorithm design of BWA.

1. BWT transform
assuming T = RUOSHUI
added characters are smaller than an arbitrary character T T $ at the back, X = RUOSHUI $
string obtained after X cycles are combined into a set of strings shifted N * N matrix
0 $ RUOSHUI
. 1 UOSHUI $ R & lt
2 OSHUI $ RU
. 3 SHUI $ RUO
. 4 the HUI $ RUOS
. 5 the UI $ RUOSH
. 6 the I $ RUOSHU
. 7 $ RUOSHUI
obtained after sorted alphabet matrices M
0. 7 $ RUOSHUI
. 1. 4 the HUI $ RUOS
2. 6 the I $ RUOSHU
. 3 RU OSHUI $ 2
. 4 RUOSHUI $ 0
. 5. 3 $ RUO SHUI
. 6. 5 $ RUOSH the UI
. 7. 1 $ R & lt UOSHUI

BWT transform to obtain the string L = "ISUU $ OHR";
suffix array S = [7,4,6,2,0,3,5,1]

Domain suffix array 2.
If the string is a string X, W, field array to find, R suffix array in the X - (W) R_ (W) , R¯¯¯¯ (W) R¯ (W) It was expressed in the interval W suffix array.
If W is the empty string, R & lt - (W) R_ (W) =. 1, R¯¯¯¯ (W) R¯ (W) = | W | -1
C (A), the string representation X , first appeared in the row number of the matrix M, as shown in table

the I $ H O R & lt a S the U-
C (a) 0. 6. 5. 4. 3. 1 2
O (a, I), represents the number of times a character appears, as in Table L [i] in

L the I S the U-the U-$ O H R & lt
0. 1 2. 3. 4. 5. 6. 7
$ 0 0 0 0. 1. 1. 1. 1
H 0 0 0 0 0 0. 1. 1
the I. 1. 1. 1. 1. 1. 1. 1. 1
O 0 0 0 0 0. 1. 1. 1
R & lt 0 0 . 1 0 0 0 0 0
U 0. 1 0 2 2 2 2 2
3. pinpoint
aW, the formula
R - (aW) = C ( a) + O (a, R - (W) -1) + 1R_ ( aW) = C (A) + O (A, R_ (W is) -1) + 1'd
R¯¯¯¯ (aW) = C (A) + O (A, R¯¯¯¯ (W is)) R¯ (aW) = C (a) + O (a, R¯ (W))
precisely determined aW suffix number field
might, Y suffix string request number field, determined using the above formula. This is called the back-end look.

4. Non-exact search
non-exact search string representing a match does not exceed a z difference. One is wrong with a hole appears. Calculated using a recursive manner. Specific methods are as follows

InexactSearch( W, z )
    CalculateD( W )
    return inexRecur( W, |W| - 1, z, 1, |X| - 1 )

CalculateD( W ) 
    j = 0
    z = 0
    for ( i = 0; i < |W| ; ++i ){
        if ( W[j,i] is not substring of W ){
             z = z + 1
             j = i + 1
        }
        D(i) = z 
    }
inexRecur( W, i, z, k, l )
    if z < D(i) 
        return NULL
    if i < 0 
        return {[k, l]}
    I = NULL
    I = I U inexRecur( W, i - 1, z - 1, k , l )
    for each b in {A, G, C, T } 
        k = C(b) + O ( b, k - 1 ) + 1
        l = C(b) + O ( b, l )
        if k <= l 
            I = I U inexRecur( W, i , z- 1 , k , l )
            if b = W[i] 
                I = I U inexRecur( W, i - 1, z, k, l )
            else 
                I = I U inexRecur( W, i - 1, z - 1, k, l )
    return I 

 

Guess you like

Origin blog.csdn.net/u010608296/article/details/102642626