Global alignment and dynamic programming

Foreword

Students learned the letter must have known Needleman-Wunsch algorithm and the Smith-Waterman algorithm, used to perform a global alignment, one for local alignments. Simply look at the algorithm formula after the abstract, then the algorithm is not complicated to set formula with two short sequences of words, painted arrow students know not difficult. This paper briefly describes Needleman-Wunsch How is the use of dynamic programming algorithm to find the two largest sequence similarity.
Note: Some mathematical formula did not render a clickable link to view: https://www.jianshu.com/p/002bbebcaaef

Dynamic Programming

Dynamic programming is a mathematical optimization method, but also a method of computer programming. For a bio-dog, look at the dynamic programming from a mathematical level, it is too strong a nucleus. We would like to understand the dynamic programming from computer programming perspective.
Dynamic programming is a way to solve the problem, a big problem complex split into several smaller problems, the introduction of the optimal solution from the optimal solution to a big problem a few minor problems. Of course, this would not mean a big problem can be split up with the dynamic programming. At the same time problems also need to meet two properties, two properties behind repeat that. Back to my first question focuses on global alignment on: for two proteins or nucleic acid sequences, finding their biggest similarity, how to split into several small problems? (Hint: matrix algorithm formula)

Split issues

So think out of thin air, how to split, I felt like I was Xiangponaodai not want T_T. So we follow, routine manual to learn global alignment sequence alignment operation again, to see how to split from the problem.

For purposes of calculation, the score matrix get hold of a simple point:
N | A | T | G | C | -
- | - | - | - | --- | -
A | 10 |. 5 |. 5 | . 5 | -5
T |. 5 | 10 |. 5 |. 5 | -5
G |. 5 |. 5 | 10 |. 5 | -5
C |. 5 |. 5 |. 5 | 10 | -5
- | -5 | -5 | -5 | -5 | 0
is to say exact match with 10 points, not all 5 points, -5 points matched to vacancies, vacancy when vacancy for a start to initialize, makes sense. After no sense to gap alignments in space.
To give formula (d is the gap penalty, i.e. alignment score base space):
Singing teacher from PPT

Now suppose we have two sequences Seq1: CATCCCCCAAA and Seq2: CAAAA. Lengths of 11,5. According to their length suitable to draw a two-dimensional matrix, populated initial data:
image.png

Bold are Seq1, Seq2, which - on behalf of spaces. A pair of blank spaces start, we initialize, write $ F (0,0) = 0 $ , while the mind space corresponding to the row and column is the first row 0, column 0. Then row 1, column 1 are the ranks of the first two corresponding to the letter. With $ (i, j) $ Representative (row, column), $ i, j $ how much can, depending on the length of the sequence. Score $ s (x_i, y_i) $ is a sequence of bases $ I $ and another sequence of $ J $ base alignment score. It wants to put the matrix of the base and $ j $ I $ $ row column base comparison result.

According $ F (i, j) $ function formula, the value of $ F (i, j) $ by the value of the upper left ($ F (i-1, j-1) $), upper value ($ F (i-1 , j) $) value, and the left ($ F (i, j- 1) $) is determined.
(Note: Here the upper left, the upper left is a relative position of the coordinate system is set matrix makes sense current, there is a $ (0,0) in the upper left $ position, downwardly extending right and but in fact, you can also put out other locations, such as lower right)

Extreme comparison

We look Dir 0-line, they can only be worth to the left. We calculate several values:
$ F (0, 1) = F (0,1-1) -5 = 0-5 $
$ F (0,2) = F (0,2-1) -5 = -5 = -10 $ -5
$ F. (0,3) = -5 = -10-5 = -15 F. (0,3-1) $
0-th column, only, is calculated by the same token is worth to above, do not again demonstrated.
Because the first line 0 and 0, it is easy to have to come out, so when the general matrix painting, painting on the way out together.

You should know, right or down extension application, the comparison to the space. Well, this is why?
Take the ratio of the case of zero-th row for example, line 0, $ i = 0 $ has been fixed, which represents a sequence (hereinafter referred to as Sl $ $ sequence) elements of 0. $ j = 0,1,2,3 ... $, representing another sequence (hereinafter referred to as $ S2 $ sequence) of 0,1,2,3 ... elements. The first element of a sequence is the first nucleotide sequence. The first is hell 0, 0 Gesha is not, then we put it as a space bar.

  1. Sequence alignments from $ (0,0) $ sequence starting two elements, two spaces alignment score of 0.
  2. $ I = 0 $ fixed, has expressed $ Sl $ sequence element 0, but $ J $ is ever increasing, so $ j = 1 $, is to use the first $ $ S2 $ sequence J $ elements and the first $ $ Sl $ sequence i = 0 $ elements match, but $ Sl $ sequence $ i = 0 $ element already and $ j = $ S2 $ sequence 0 $ element than over, you can not repeat than ah, you can privately and $ i $ S1 $ 1 = $ sequence alignment elements, there is no way, and a space can only alignments, gap penalty generating, on the basis with the score before vacancy penalty.
  3. The first $ 2,3,4 $ S2 $ sequences ... $ element is true, can only be compared to the space, resulting gap penalties.

Than the extension application process right is the line number $ i $ fixed, did not increase, but the increase in the number of columns $ j $, and two elements in the sequence alignment can not be repeated, we can not be other than the elements of the sequence, so we can only compare the space. (The downward extension application the same way)
we see two extreme cases:
image.png

##黑色路线
CATCCCCCAAA-----
-----------CAAAA
##红色路线
-----CATCCCCCAAA
CAAAA-----------

你们说这第0行,0列的有啥用处,放这么多负分而产生极端比对。要排除这些不合适比对结果也要花费许多资源。

比对开始

上面进行了一次小探讨。比对直接向右或向下会产生gap。因为行列有一方被固定,没法提供可比对元素。如果要正确比对的话,要使得$i$和$j$同时增加,即向右下方延申。除了第0行和0列的元素,只有单一来源,矩阵其它的元素都有三个来源,即上方,左方,左上。如何确定$F(i,j)$的值,按照公式,选择其中最大的值。$F(m,n)$($m,n$为两条序列的长度)为最终序列比对的相似得分。
既然我们想使$F(m,n)$最大,肯定不能过多的产生空位罚分吧。那么我们从一开始就一直向右下方延申,直至不能继续,能到得到最大的相似得分呢?我们试一下吧:
image.png
我们用箭头表示从哪个方向得到的最大值,两个箭头则可以从两个方向得到。从图中红色箭头的路线可以看出来,一直选择往右下延申的比对,似乎可以达到最大值。而且再比对三对元素就可以到达终点了。现在看红色箭头,似乎是全场全场最靓的仔。然而最终结果如何?我们继续把剩余几个各自补全看看。
image.png
现在再看全场最靓的仔似乎被绿了??而且是一个绿一个的。简直就是螳螂捕蝉,黄雀在后,然后渔翁得利。。。

贪心算法:红色路线选择的策略,从一开始,三个方向的选择,每一步,它都选择三个方向中得分最大的那个,但最后结果却不是最好的。这种在每一步都做出当前看来是最好选择的方法,而不考虑整体最优的方法,叫做贪心算法。这种方法在这里得不到最优结果。
穷举搜索:这是一种很直观的方法,就是把所有可能的情况都列举出来,从中找最优的。但是用于序列比对是是非常差的一种方法。从矩阵左上方到达矩阵右下方,有多少种路线?矩阵每一个小格有三种可能路线,矩阵大小$m*n$,即有$3^{mn}$种路线。其中$m,n$稍微取大一点的值,数值就爆炸了。完全不可取。

emm,写到这里,还没讲到全局比对是如何拆分问题的,现在我们就开始讨论这个问题吧。。
首先,在矩阵中当比对到第$(m,n)$格时,代表着比对终止。因为两条序列的每个元素都参加了比对过程,无论序列中的元素是比对到了空格,还是另一条序列的某个元素。因为序列中的元素都比对完了,在比对下去就只有空格对空格了,没什么意义。所以$(m,n)$是终点。
我们就从终点开始探讨如何分割大问题为小问题吧。
$$F(m,n)=max\begin{cases}F(m,n)+s(x_i,y_j)\F(m-1,n)+d\F(m,n-1)+d\end{cases}$$
其中$s(x_i,y_j)$和$d$是知道的,那问题$F(m,n)$则可以通过求出$F(m-1,n-1)$,$F(m,n-1)$和$F(m-1,n)$的分数得到。即一个大问题,分成了3个子问题了。三个子问题也不知道分数,就继续分子问题呗。如图:
image.png

分治算法:分治的思想也是将一个问题分为几个问题来解决。这与动态规划的分离子问题有啥区别?从上面的图我们可以看出来由$F(m,n)$分割的子问题们是存在重复问题的。分治分离的子问题一般独立的不会相互重叠,而动态规划的子问题则一般发生重叠现象。

两个性质

之前空着没讲的两个性质,现在来讲一下。一个是最优子结构,一个是无后效性
最优子结构:如果一个问题的最优解,可以将其分为几个子问题,然后能从子问题的最优解推出大问题的最优解。就称其有最优子结构性质。这一性质很容易从函数公式中看出来。
无后效性:给定某一阶段的状态,这一阶段以后的发展过程不受这阶段以前各段状态的影响(来自阮行止的回答)这话是什么意思呢,以我们序列比对为例。比如:当你确定了前面某个阶段的比对,比对情况如下面的比对方案的前面一部分,这时比对分数已经确定了。无后效性就是前面的比对情况对后面的比对结果的分数不会有任何影响。前面部分的比对方案的任意变动也不会影响后面部分的结果。

##1
CAT  CCCCC-----AAA
CA-  ----------AAA
##2
CAT  CCCCCAAA-----
CA-  AA----------A
##3
CAT  CCCCCAAA-----
CA-  ----------AAA

无后效性,可以让我们放心大胆的在矩阵每个格子存放最优的结果,舍弃许多不是最优的分数。因为不存在前面阶段的最差比对方案,还能与后面阶段的最差方案组合成为全局最优比对方案。

复杂度

我们来看一下Needleman–Wunsch算法的复杂度。从二维矩阵中,可以看到我们需要计算每一格的所占最优分值,同时也为了之后的计算方便也得储存每一个的分值。故时间复杂度和空间复杂度都是$O(mn)$。

其它

To understand the dynamic programming, I know this question on almost anything is dynamic programming (Dynamic Programming)? What is the significance of dynamic programming is? Many saw the answer in the answer. A master of many gave their own understanding. Their answers are similar, there are also different. We can not say who is right (quietly bb: do not say who is right)
on global alignment and I talk about my views:
dynamic programming and other methods have many similarities. If the definition of other methods given little wide, dynamic programming can also be said to other methods. For example:
an exhaustive search: Prior to the above-mentioned too, but then we say that the object is to be exhaustive comparison of each program. Now that we think about it from a different point of view: Each matrix exhaustive check, that is, every exhaustive best $ F (i, j) $ . (Maximum likelihood sequence alignment to the first $ (i, j) $ of) our object here than on the path and into the $ F (i, j) $
greedy algorithm: greedy algorithm mentioned earlier, we narrow the best that is down, right or lower right of getting the score. Now we will be the best neighbor as $ F (i, j) $ in the largest one. That is according to this greedy choose, it will likely continue to contrast adjacent to $ F (i, j) $ and choose the best. Until, the last remaining three paths to the $ F (m, n) $ , select the best one.
In short, whether black or white, can help solve the problem is a good cat! Choose you think the best kind of understanding. After repeatedly encountered after dynamic programming, you slowly improve cognition.

Guess you like

Origin www.cnblogs.com/huanping/p/11257340.html