Do gene prediction using Hidden Markov Model

What is the hidden Markov model

Hidden Markov model (Hidden Markov Model, HMM) is a statistical model, which is used to describe a parameter containing hidden Markov process unknown. The difficulty is to determine the parameters of the process may be viewed from the hidden parameters. Then use these parameters to further analysis, such as pattern recognition, especially when we talk about today gene prediction. Is considered the modeled system is a Markov process [some] of the assembled sequence and unobserved (hidden) state [which is] the coding region and which are not statistical Markov model.

Here a simple example to illustrate:
Suppose my hand dice have two different colors, the other one is orange (Coding, C) is blue (Noncoding, N) is. But the usual dice different is that they are stabilized as long as there are four possibilities, that is, from top to bottom is fixed (you can only rotate parallel to understand them). So that each dice appear ATCG probability is 1/4.

 

Two chains together

Suppose we start rolling dice, we start with a two colors choose, pick probability of each dice is 1/2. Then we roll the dice, we get a ATCG in. Kept repeating the above process, we will get a bunch of sequences, each character is ATCG one. For example CGAAAAAATCG

This sequence is called visible strings chain. But in the hidden Markov model, so we just have a bunch of visible chains, as well as a bunch of hidden state chain. In this example, this string sequence chain is your status with dice implied. For example, there may be a hidden state chain: CCNNNNNNNCCC.

In general, HMM Markov chain comes in fact, refers to the implicit state chain, because the conversion probability (transition probability) between the implicit state (dice). In our example, the next state C is N. C, N probability is 1/2. This is a setting for the beginning easy to clear, but in fact we are free to set the transition probabilities. For example, we can define, by two probability behind not N C, or C is 0.1. This is a new HMM.

Similarly, although there is no visible transition probability between the states, but there are hidden between the state and a visible state probability called the output probability (emission probability). For our example, the probability Coding (C) is an A 1/4, the probability Noncoding (N) generated is 1/4 A, I of course, these probabilities are arbitrarily defined, you may be defined as other values.

State transition diagram of the implied relationship

In fact, for HMM, if the probability of generating know in advance between the transition probability between all states and all implied visible all the implicit state to state, it is quite easy to do simulation. Yes, we have to pick up easy.

Do gene prediction using Hidden Markov Model

Next, we do a simple gene prediction. A given period of the genomic DNA sequence, we predicted coding region therein. According to the hidden Markov model said before, we must first distinguish not directly observable state implicit and explicit symbols can be directly observed.

With two chains

In this example, we can easily see that, given the genomic DNA sequence is a symbol string can be observed. The coding / non-coding is a hidden state can not be directly observed. Therefore, we can draw the state transition diagram. First, we have the coding and noncoding two states. Because the genome simultaneously wrapped coding and non-coding regions, it is possible to switch between these two states. Of course, each state may be converted into its own, as a continuous coding or non-coding region. In this way, we have a transition matrix 2 * 2.

Transition probability

Next, we need to write to generate probabilities. This is straightforward, whether coding or non-coding state, are likely to have A, C, G, T four bases, so we can have the two matrices, respectively.

Generation probability

Now, we need a training set (Training set), to fill a specific numerical these three grid matrix. Specifically, we need to have a nice comment in advance - that is labeled correctly coding, DNA sequences of non-coding regions, usually longer than the sequence, so there are plenty of statistics.

After analysis we assume that the training set were filled out transfer probability matrix and production probability matrix. We need these data, the most likely to launch anti-state path to a given unknown genome sequence, which is the greatest probability that state path. Therefore, we still the same as before the use of dynamic programming algorithm, write iterative formula, and the final termination point formula (Termination equation)

训练的结果

从公式里面,我们看到,我们需要做大量测乘法。这个不仅比较慢,而且利用计算机操作时,随着连乘次数的增加,很容易数值过小而出现下溢(underflow)的问题。因此,我们通常会引入对数计算,从而将乘法转换成加法。具体来说,就是对转移和生成概率都预先取log10。

取log

好,我们正式开始,假设我组装了一段序列(咦,怎么这么短?为了简单_):

CGAAAAAATCG

首先,让我们和之前一样,画出动态规划的迭代矩阵,其中包含两个状态,非编码状态N与编码状态C。接下来,我们需要设定边界条件(boundary condition),也就是这两个状态默认的分布比例。为了计算方便,我们分别设为0.8和0.2,经过log10转换后,分别为-0.097和-0.699.接下来我们逐步填格子

之后,我们碰到的第一个碱基C,由生成概率可知C在非编码状态下的log10生成的概率是-0.523,将之与-0.097相加,就可以得到-0.62。类似的,在编码状态下,这个数是-0.699+(-0.699)=-1.4。

接下来,我们要前进一个碱基,就需要进行状态转移,我们先来看第一种情形,也就是从非编码状态到非编码状态的转换。从转移举证可以看到,这里的转移概率是-0.097,再加上非编码状态下下一个碱基G的生产概率是-0.523.我们能就可以得到(-0.699)+(-0.097 )+(-0.523)=-1.24。类似的我们来计算,在这个位点从编码状态到非编码状态的转换,也就是-1.40+(-0.398)+(-0.523)=-2.32.这个值比从非编码状态转移得到的-1.24小,因此不会被保留。(舍去概率小的可能路径)

类似的,我们可以继续完成后续的迭代,把后面所有的格子都一个个填满。如下:

移步换景

接下来,我们来做回溯。首先,选出最终概率值最大的那个值。以它为起点,依次来回溯,那么我们得到的回溯路径就可以得到最终的结果。在回溯路径中,如果下一步有两种可能,就走向概率大的那一家。

回溯路径

把这一路上走过的NC标记下来,就可以得到最后的结果:

NNCCCCCCNNN

也就是说,我们把输入的序列CGAAAAAATCG分为了非编码区N和编码区C.

结果

由于时间有限,我们的MSGP(The Most Simple Gene Predictor)非常简单。但它很容易被扩展,只需要你引入更多的状态,唯一的限制是,不同的状态对应的生成概率--在这里也就是碱基的组分--必须存在显性的差异。这样,我们才可能由你的观测序列反推出状态来。

比如说,Chris Burge 1996年提出基因预测算法GenScan针对外显子,内含子以及UTR等设定了独立的状态,从而大大提高了预测的准确度,是最成功的基因预测工具之一。但在基本原理上,它与我们刚刚讲的,最简单的MSFP并没有区别。类似的,我们还以可以用类似的方法去做5’剪切位点的预测等等。

事实上,通过将状态和可观测的符号分离开,隐马尔可夫模型为生物信息学的数据分析提供了一个有效的概率框架,是当代生物信息学最常用的算法模型之一。

本文基本上是对下面两篇博文的复述,对作者表示敬意和谢意。
另外,也参考了吴军老师的《数学之美》一书。

Guess you like

Origin www.cnblogs.com/klausage/p/11831084.html