Disposable understand Markov models, hidden Markov models, Markov random field and conditions of the airport! (POS tagging code implementation)

1. The difference between Markov networks, Markov models, Markov processes, Bayesian network

I believe we have seen the one I speak Bayesian networks are probabilistic graphical models to understand how the structure, if we did not understand, I see the summary of one: Bayesian networks

In this section we focus on speaking about Markov, as shown in the title, readers will look blinded, but fortunately we will little by little to explain the above concepts, please complete the order will get to look down understand, I am here to define a user-friendly, let's behind one explain.

The following description is divided into six of these concepts into easy entry just thinking while reading, this is followed by progressive 6:00, do not jump the look.

  1. The node as a random variable, or if the correlation of two random variables are not independent, then both connected to one side; if random variables given a number, a directed graph is formed, it constitutes a network .
  2. If the network is a directed acyclic graph, then this network is called Bayesian network.
  3. If the linear chain of FIG degenerate manner, to give the Markov model ; since each node is a random variable, each of which is associated changes as the time (or space), the stochastic process perspective, it can be seen to be a Markov process .
  4. If the network is undirected, undirected graph is a model, known as Markov random or Markov random field .
  5. If the premise given certain conditions, to study the MRF, the resulting CRFs .
  6. If conditional random labeling solve the problem, and with the further condition of the network topology becomes linear airport, is obtained linear chain CRFs .

2. Markov Model

2.1 Markov process

Markov process (Markov process) is a class of stochastic processes. Its original Markov chain model, proposed by the Russian mathematician AA Markov in 1907. This process has the following characteristics: at a known current state (now), and its future evolution (future) does not depend on its past evolution (in the past). Change in the number of animals, for example, constituting the head forest - Markov process. In the real world, there are many processes are Markov processes, such as liquid particles in Brownian motion made, the number of infected infectious diseases, such as the number of stations waiting, can be regarded as a Markov process.

Each state transition depends only on the preceding n states, this process is referred to as a model of order n, where n is the number of states and metastasis. The simplest process is the first order Markov process, transferring each state depends only on that one of its state before , this is also known as Markov nature . It is represented by mathematical expressions like the following:

Assuming that each state model is only dependent on the previous state, this assumption is called Markov assumption , this assumption can greatly simplify this problem. Obviously, this assumption may be a very bad assumption, resulting in a lot of important information is lost.

\[P(X_{n+1}|X_1=x_1,X_2=x_2,...,X_n=x_n)=P(X_{n+1}=x|X_n=x_n)\]

Weather assume obey Markov Chain :

This figure can be seen from the above:

  • If today is sunny, cloudy tomorrow turns probability is 0.1
  • If today is sunny, sunny day tomorrow is the probability of any course is 0.9, and the probability of a sum to 1, which is consistent with real life situations.
clear Yin
clear 0.9 0,1
Yin 0.5 0.5

From the above table we can get the Markov chain state transition matrix :

Thus, a first order Markov process defines the following three parts:

  • Status : sunny and cloudy
  • Initial Vector : Defines the state of the system at time 0 to time of probability
  • State transition matrix : each weather transition probability

Markov models (Markov Model) is a statistical model, widely used in speech recognition, speech auto-tagging, conversion applications sound words, the probability of grammar and other natural language processing. After long-term development, especially in the successful application of speech recognition, making it a common statistical tool. So far, it has been considered to achieve fast and accurate speech recognition system is the most successful method.

3. Hidden Markov Model (HMM)

In some cases insufficient to describe Markov process we want to find patterns. Back before the weather example, a reclusive person may not be directly observed the weather, but there are some seaweed. Folk legend tells us that the state of seaweed on a certain probability and the weather-related. In this case we have two set of states can be observed in a set of states (state seaweed) and a hidden state (the weather conditions). We hope to find an algorithm can be used to predict weather conditions in accordance with seaweed and Markov assumptions status.

And this algorithm is called the hidden Markov model (the HMM) .

Hidden Markov model (Hidden Markov Model) is a statistical model that describes a parameter containing unknown hidden Markov process. It is the structure of the simplest dynamic Bayesian network, which is a well-known directed graph model , mainly for time series data modeling, there are widely used in speech recognition, natural language processing and other fields.

3.1 Hidden Markov three major issues

  1. Given model, how to calculate the probability of the observed sequence generated? In other words, how to assess the degree of match between the model and the observed sequence?
  2. Given the observation sequence and model how to find this observation sequence that best matches the sequence of states? In other words, how to infer the hidden model state based on the observation sequence?
  3. Given the observation sequence, how to adjust the model parameters such that the probability of this sequence occurring maximum? In other words, how to train the model so that it can best describe the observed data?

The first two issues pattern recognition problems: 1) The hidden Markov model to obtain probability (an observation state sequence Evaluation ); max. 2) to find a sequence of hidden states that the probability of generating a sequence of observable state sequence ( decoding ). The third problem is the state according to a set of sequences can be observed to generate a hidden Markov model ( learning ).

Corresponding to the three major issues Solution:

  1. Forward algorithm (Forward Algorithm), backward algorithm (Backward Algorithm)
  2. Viterbi algorithm (Viterbi Algorithm)
  3. Baum - Welch algorithm (Baum-Welch Algorithm) (approximately equal to the EM algorithm)

Here we have a scenario to explain what the solution to these problems in the end is?

Xiao Ming now have a three-day holiday, he can choose three things to do in every day to pass the time, these three things are walking, shopping, cleaning ( corresponding to observable sequence ), but in life we are We decided to do generally affected by the weather, possibly sunny day when you want to go shopping or take a walk, rainy days might not want to go out, stay at home cleaning. The weather (sunny, rainy) belongs to hidden, with a probability diagram to represent the Markov process:

So, we asked three questions, which correspond to three major issues Markov:

  1. Known throughout the model, I observed three consecutive days to do are: walking, shopping, pick up. Well, according to the model, the calculated probability to produce these behaviors is.
  2. Also know this model, it is this same three things, I would like to guess which day the weather is kind of how.
  3. The most complex, I only know three days to do these three things children, and what other information did not. I have to build a model, rain or shine transition probability, probability distribution of the first day of weather conditions, choose to do something depending on the weather probability distribution.

Here we look at each scenario based on the answers to these questions.

3.1.1 The first problem solution

Traversal algorithm :

This is the simplest algorithm, and assuming that the first day (T = 1 time) is sunny, you want to shop, then put on the map by multiplying the corresponding probabilities can get.

第二天(T=2 时刻)要做的事情,在第一天的概率基础上乘上第二天的概率,依次类推,最终得到这三天(T=3 时刻)所要做的事情的概率值,这就是遍历算法,简单而又粗暴。但问题是用遍历算法的复杂度会随着观测序列和隐藏状态的增加而成指数级增长。

复杂度为:\(2TN^T\)

于是就有了第二种算法

前向算法

  1. 假设第一天要购物,那么就计算出第一天购物的概率(包括晴天和雨天);假设第一天要散步,那么也计算出来,依次枚举。
  2. 假设前两天是购物和散步,也同样计算出这一种的概率;假设前两天是散步和打扫卫生,同样计算,枚举出前两天行为的概率。
  3. 第三步就是计算出前三天行为的概率。

细心的读者已经发现了,第二步中要求的概率可以在第一步的基础上进行,同样的,第三步也会依赖于第二步的计算结果。那么这样做就能够节省很多计算环节,类似于动态规划

这种算法的复杂度为:\(N^2T\)

后向算法

跟前向算法相反,我们知道总的概率肯定是1,那么B_t=1,也就是最后一个时刻的概率合为1,先计算前三天的各种可能的概率,在计算前两天、前一天的数据,跟前向算法相反的计算路径。

3.1.2 第二个问题解法

维特比算法(Viterbi)

说起安德鲁·维特比(Andrew Viterbi),通信行业之外的人可能知道他的并不多,不过通信行业的从业者大多知道以他的名字命名的维特比算法(ViterbiAlgorithm)。维特比算法是现代数字通信中最常用的算法,同时也是很多自然语言处理采用的解码算法。可以毫不夸张地讲,维特比是对我们今天的生活影响力最大的科学家之一,因为基于CDMA的3G移动通信标准主要就是他和厄文·雅各布(Irwin Mark Jacobs)创办的高通公司(Qualcomm)制定的,并且高通公司在4G时代依然引领移动通信的发展。

维特比算法是一个特殊但应用最广的动态规划算法。利用动态规划,可以解决任何一个图中的最短路径问题。而维特比算法是针对一个特殊的图—篱笆网络(Lattice)的有向图最短路径问题而提出的。它之所以重要,是因为凡是使用隐含马尔可夫模型描述的问题都可以用它来解码,包括今天的数字通信、语音识别、机器翻译、拼音转汉字、分词等。

维特比算法一般用于模式识别,通过观测数据来反推出隐藏状态,下面一步步讲解这个算法。

因为是要根据观测数据来反推,所以这里要进行一个假设,假设这三天所做的行为分别是:散步、购物、打扫卫生,那么我们要求的是这三天的天气(路径)分别是什么。

  1. 初始计算第一天下雨和第一天晴天去散步的概率值:

    \(\bigtriangleup_1(R)\) 表示第一天下雨的概率

    \(\pi_R\) 表示中间的状态(下雨)s概率

    \(b_R(O_1=w)\) 表示下雨并且散步的概率

    \(a_{R-R}\) 表示下雨天到下雨天的概率

    \(\bigtriangleup_1(R)=\pi_R*b_R(O_1=w)=0.6*0.1=0.06\)

    \(\bigtriangleup_1(S)=\pi_S*b_S(O_1=w)=0.4*0.6=0.24\)

    初始路径为:

    \(\phi_1(R)=Rainy\)

    \(\phi_1(S)=Sunny\)

  2. 计算第二天下雨和第二天晴天去购物的概率值:

    对应路径为:

  3. 计算第三天下雨和第三天晴天去打扫卫生的概率值:

    对应路径为:

  4. 比较每一步中 \(\bigtriangleup\) 的概率大小,选取最大值并找到对应的路径,依次类推就能找到最有可能的隐藏状态路径

    第一天的概率最大值为 \(\bigtriangleup_1S\),对应路径为Sunny,

    第二天的概率最大值为 \(\bigtriangleup_2S\),对应路径为Sunny,

    第三天的概率最大值为 \(\bigtriangleup_3R\),对应路径为Rainy。

  5. 合起来的路径就是Sunny->Sunny->Rainy,这就是我们所求。

以上是比较通俗易懂的维特比算法,如果需要严谨表述,可以查看《数学之美》这本书的第26章,讲的就是维特比算法,很详细。附:《数学之美》下载地址,点击下载

3.1.3 第三个问题解法

鲍姆-韦尔奇算法(Baum-Welch Algorithm) (约等于EM算法),详细讲解请见:监督学习方法与Baum-Welch算法

4. 马尔可夫网络

4.1 因子图

wikipedia上是这样定义因子图的:将一个具有多变量的全局函数因子分解,得到几个局部函数的乘积,以此为基础得到的一个双向图叫做因子图(Factor Graph)。

通俗来讲,所谓因子图就是对函数进行因子分解得到的一种概率图。一般内含两种节点:变量节点和函数节点。我们知道,一个全局函数通过因式分解能够分解为多个局部函数的乘积,这些局部函数和对应的变量关系就体现在因子图上。

举个例子,现在有一个全局函数,其因式分解方程为:

\[g(x_1,x_2,x_3,x_4,x_5)=f_A(x_1)f_B(x_2)f_C(x1,x2,x3)f_D(x_3,x_4)f_E(x_3,x_5)\]

其中fA,fB,fC,fD,fE为各函数,表示变量之间的关系,可以是条件概率也可以是其他关系。其对应的因子图为:

4.2 马尔可夫网络

我们已经知道,有向图模型,又称作贝叶斯网络,但在有些情况下,强制对某些结点之间的边增加方向是不合适的。使用没有方向的无向边,形成了无向图模型(Undirected Graphical Model,UGM), 又被称为马尔可夫随机场或者马尔可夫网络(Markov Random Field, MRF or Markov network)。

设X=(X1,X2…Xn)和Y=(Y1,Y2…Ym)都是联合随机变量,若随机变量Y构成一个无向图 G=(V,E)表示的马尔可夫随机场(MRF),则条件概率分布P(Y|X)称为条件随机场(Conditional Random Field, 简称CRF,后续新的博客中可能会阐述CRF)。如下图所示,便是一个线性链条件随机场的无向图模型:

在概率图中,求某个变量的边缘分布是常见的问题。这问题有很多求解方法,其中之一就是把贝叶斯网络或马尔可夫随机场转换成因子图,然后用sum-product算法求解。换言之,基于因子图可以用sum-product 算法高效的求各个变量的边缘分布。

详细的sum-product算法过程,请查看博文:从贝叶斯方法谈到贝叶斯网络

5. 条件随机场(CRF)

一个通俗的例子

假设你有许多小明同学一天内不同时段的照片,从小明提裤子起床到脱裤子睡觉各个时间段都有(小明是照片控!)。现在的任务是对这些照片进行分类。比如有的照片是吃饭,那就给它打上吃饭的标签;有的照片是跑步时拍的,那就打上跑步的标签;有的照片是开会时拍的,那就打上开会的标签。问题来了,你准备怎么干?

一个简单直观的办法就是,不管这些照片之间的时间顺序,想办法训练出一个多元分类器。就是用一些打好标签的照片作为训练数据,训练出一个模型,直接根据照片的特征来分类。例如,如果照片是早上6:00拍的,且画面是黑暗的,那就给它打上睡觉的标签;如果照片上有车,那就给它打上开车的标签。

乍一看可以!但实际上,由于我们忽略了这些照片之间的时间顺序这一重要信息,我们的分类器会有缺陷的。举个例子,假如有一张小明闭着嘴的照片,怎么分类?显然难以直接判断,需要参考闭嘴之前的照片,如果之前的照片显示小明在吃饭,那这个闭嘴的照片很可能是小明在咀嚼食物准备下咽,可以给它打上吃饭的标签;如果之前的照片显示小明在唱歌,那这个闭嘴的照片很可能是小明唱歌瞬间的抓拍,可以给它打上唱歌的标签。

所以,为了让我们的分类器能够有更好的表现,在为一张照片分类时,我们必须将与它相邻的照片的标签信息考虑进来。这——就是条件随机场(CRF)大显身手的地方!这就有点类似于词性标注了,只不过把照片换成了句子而已,本质上是一样的。

如同马尔可夫随机场,条件随机场为具有无向的图模型,图中的顶点代表随机变量,顶点间的连线代表随机变量间的相依关系,在条件随机场中,随机变量Y 的分布为条件机率,给定的观察值则为随机变量 X。下图就是一个线性连条件随机场。

条件概率分布P(Y|X)称为条件随机场

6. EM算法、HMM、CRF的比较

  1. EM算法是用于含有隐变量模型的极大似然估计或者极大后验估计,有两步组成:E步,求期望(expectation);M步,求极大(maxmization)。本质上EM算法还是一个迭代算法,通过不断用上一代参数对隐变量的估计来对当前变量进行计算,直到收敛。注意:EM算法是对初值敏感的,而且EM是不断求解下界的极大化逼近求解对数似然函数的极大化的算法,也就是说EM算法不能保证找到全局最优值。对于EM的导出方法也应该掌握。

  2. 隐马尔可夫模型是用于标注问题的生成模型。有几个参数(π,A,B):初始状态概率向量π,状态转移矩阵A,观测概率矩阵B。称为马尔科夫模型的三要素。马尔科夫三个基本问题:

    概率计算问题:给定模型和观测序列,计算模型下观测序列输出的概率。–》前向后向算法

    学习问题:已知观测序列,估计模型参数,即用极大似然估计来估计参数。–》Baum-Welch(也就是EM算法)和极大似然估计。

    预测问题:已知模型和观测序列,求解对应的状态序列。–》近似算法(贪心算法)和维比特算法(动态规划求最优路径)

  3. 条件随机场CRF,给定一组输入随机变量的条件下另一组输出随机变量的条件概率分布密度。条件随机场假设输出变量构成马尔科夫随机场,而我们平时看到的大多是线性链条随机场,也就是由输入对输出进行预测的判别模型。求解方法为极大似然估计或正则化的极大似然估计。

  4. 之所以总把HMM和CRF进行比较,主要是因为CRF和HMM都利用了图的知识,但是CRF利用的是马尔科夫随机场(无向图),而HMM的基础是贝叶斯网络(有向图)。而且CRF也有:概率计算问题、学习问题和预测问题。大致计算方法和HMM类似,只不过不需要EM算法进行学习问题。

  5. HMM和CRF对比:其根本还是在于基本的理念不同,一个是生成模型,一个是判别模型,这也就导致了求解方式的不同。

7. 参考文献

  1. 条件随机场的简单理解
  2. 如何轻松愉快地理解条件随机场(CRF)
  3. 《数学之美》
  4. 监督学习方法与Baum-Welch算法
  5. 从贝叶斯方法谈到贝叶斯网络

8. 词性标注代码实现

HMM词性标注,GitHub:点击进入

作者:@mantchs

GitHub:https://github.com/NLP-LOVE/ML-NLP

欢迎大家加入讨论!共同完善此项目!群号:【541954936】NLP面试学习群

Guess you like

Origin www.cnblogs.com/mantch/p/11203748.html