Probability map (three) - Maximum Entropy Markov Model

Maximum Entropy Model

Maximum Entropy model belongs to the log-linear model, the model is maximum likelihood estimation under the conditions given training data or regularization maximum likelihood estimation:

Wherein, for the normalization factor, w is the parameter of the maximum entropy model, fi (x, y) is the characteristic function - describes the (x, y) is a fact

Referring specifically derived:  https://blog.csdn.net/asdfsadfasdfsa/article/details/80833781


Maximum Entropy Markov Model

HMM model there are two assumptions: First, independent of strictly observed output value, and second transfer process in the current state of the state is only related to the previous state. But in fact the sequence labeling problem and not just a single word related, but the length and sequence of observations, words of context, and so relevant. MEMM solve the problem of HMM output independence assumptions. Since HMM is defined only between the observation and a state-dependent, while the introduction of MEMM custom feature functions, can be expressed not only between the observed dependence, may also represent complex dependencies between a plurality of states before and after the current observation. 
Here due to the removal of the independence assumption, we can not give the joint probability distribution, the posterior probability can only ask, so is the discriminant model


 

HMM, the observation node  o_ {i} depend hidden nodes  i_{i} , which means my observation nodes rely only hidden the current moment. But in more realistic scenarios, the observed sequence need a lot of features to characterize, for example, I was doing the NER, I marked  i_{i} not only with the current state of  o_ {i} the relevant, and even went around tagging  {eight} j (j \ home i) -related, such as capitalization, part of speech and so on.

To this end, the proposed model is the ability to directly allow MEMM "defining characteristics of" direct learning the conditional probability that  P(i_{i}|i_{i-1},o_{i}) (i = 1,\cdots,n) , overall as follows:

P(I|O) = \prod_{t=1}^{n}P(i_{i}|i_{i-1},o_{i}), i = 1,\cdots,n

Moreover,  P (i | i ^ { '}, o) this probability by maximum entropy classifier model (named the cause of MEMM)

P(i|i^{'},o) = \frac{1}{Z(o,i^{'})} exp(\sum_{a})\lambda_{a}f_{a}(o,i)

Focus here, this is the ME content is the key to MEMM understand:  Z (o, i ^ { '}) This part is normalized;  f_{a}(o,i) is a characteristic function , specific point, this function is required to define the;  l is the weight characteristic function of weight, which is the unknown parameters, need learning derived from the training stage.

For example, I may be so defined characteristic function:

\ Begin {equation} f_ {a} (o, i) = \ begin {cases} 1 & \ text {certain conditions are met}, \\ 0 & \ text {other} \ end {cases} \ end {equation}

Wherein the characteristic function  f_{a}(o,i) number can be any formulation, (a = 1, \cdots, n)

 

So on the whole, MEMM modeling formula like this:

P(I|O) = \prod_{t=1}^{n}\frac{ exp(\sum_{a})\lambda_{a}f_{a}(o,i) }{Z(o,i_{i-1})} , i = 1,\cdots,n

 

Yes, the reason why this part of the formula to be so long, is determined by the ME model . (Maximum entropy model weights lamba from Lagrange factor)

Be sure to note that understanding discriminant model and define the characteristics of the two parts meaning, which has been involved in the prototype of the CRF

MEMM need to pay attention to two things:

  1. And the HMM  o_ {i} dependence  i_{i} is not the same, MEMM currently hidden state  i_{i} should be dependent on the current time of observation node  o_ {i} hidden node and the last time i_{i-1}
  2. It should be noted that the reason so drawn arrows diagram is determined by MEMM formula, and the formula is the creator defined out ???

Maximum Entropy Markov Model Markup bias issue

Most is his labeling bias problem MEMM discussion.

1. phenomenon

MEMM decoding with Viterbi algorithm, the state transitions to state 1 tends 2, 2 while the state tends to remain in state 2. Decoding process details (viterbi algorithm will need this premise):

P(1-> 1-> 1-> 1)= 0.4 x 0.45 x 0.5 = 0.09 ,
P(2->2->2->2)= 0.2 X 0.3 X 0.3 = 0.018,
P(1->2->1->2)= 0.6 X 0.2 X 0.5 = 0.06,
P(1->1->2->2)= 0.4 X 0.55 X 0.3 = 0.066

But the optimal state transition path to get is 1-> 1-> 1-> 1, and why? 2 because the state can convert the state to be more than 1 state, so that the reduced transition probability that MEMM tend to choose to have fewer state transfer.

2. explain why

Look directly MEMM formula:

P(I|O) = \prod_{t=1}^{n}\frac{ exp(\sum_{a})\lambda_{a}f_{a}(o,i) }{Z(o,i_{i-1})} , i = 1,\cdots,n

∑ Role sum of the probability is normalized, but there is normalization in the internal index, call it local normalized. Come, viterbi solution process is the use of the state transition equation dp (not expanded MEMM, please refer to the following formula CRF), because it is a local normalized, so the second part of the formula for the transfer of MEMM viterbi there is a problem, dp not lead to the correct recursion to the global optimum.

\delta_{i+1} = max_{1 \le j \le m}\lbrace \delta_{i}(I) + \sum_{i}^{T}\sum_{k}^{M}\lambda_{k}f_{k}(O,I_{i-1},I_{i},i) \rbrace


HMM->MEMM->CRF

MEMM solve the problem of HMM output independence assumptions . Since HMM is defined only between the observation and a state-dependent, while the introduction of a custom MEMM characteristic function to the expression P (present mark | previous tag, the entire observation sequence), and the HMM is P (present mark | previous marker The current observed value), the current mark is actually dependent HMM previous mark, depending on the current value of the current observation marker

The CRF will MEMM have to be changed without directed graph , solve the problem of bias MEMM mark

A common feature of the three is: are based on the Markov assumption, i.e., present mark just preceding mark dependent

 

Guess you like

Origin blog.csdn.net/asdfsadfasdfsa/article/details/91966876