Maximum Entropy Model
Maximum Entropy model belongs to the log-linear model, the model is maximum likelihood estimation under the conditions given training data or regularization maximum likelihood estimation:
Wherein, for the normalization factor, w is the parameter of the maximum entropy model, fi (x, y) is the characteristic function - describes the (x, y) is a fact
Referring specifically derived: https://blog.csdn.net/asdfsadfasdfsa/article/details/80833781
Maximum Entropy Markov Model
HMM model there are two assumptions: First, independent of strictly observed output value, and second transfer process in the current state of the state is only related to the previous state. But in fact the sequence labeling problem and not just a single word related, but the length and sequence of observations, words of context, and so relevant. MEMM solve the problem of HMM output independence assumptions. Since HMM is defined only between the observation and a state-dependent, while the introduction of MEMM custom feature functions, can be expressed not only between the observed dependence, may also represent complex dependencies between a plurality of states before and after the current observation.
Here due to the removal of the independence assumption, we can not give the joint probability distribution, the posterior probability can only ask, so is the discriminant model
HMM, the observation node depend hidden nodes , which means my observation nodes rely only hidden the current moment. But in more realistic scenarios, the observed sequence need a lot of features to characterize, for example, I was doing the NER, I marked not only with the current state of the relevant, and even went around tagging -related, such as capitalization, part of speech and so on.
To this end, the proposed model is the ability to directly allow MEMM "defining characteristics of" direct learning the conditional probability that , overall as follows:
Moreover, this probability by maximum entropy classifier model (named the cause of MEMM)
Focus here, this is the ME content is the key to MEMM understand: This part is normalized; is a characteristic function , specific point, this function is required to define the; is the weight characteristic function of weight, which is the unknown parameters, need learning derived from the training stage.
For example, I may be so defined characteristic function:
Wherein the characteristic function number can be any formulation,
So on the whole, MEMM modeling formula like this:
Yes, the reason why this part of the formula to be so long, is determined by the ME model . (Maximum entropy model weights lamba from Lagrange factor)
Be sure to note that understanding discriminant model and define the characteristics of the two parts meaning, which has been involved in the prototype of the CRF
MEMM need to pay attention to two things:
- And the HMM dependence is not the same, MEMM currently hidden state should be dependent on the current time of observation node hidden node and the last time
- It should be noted that the reason so drawn arrows diagram is determined by MEMM formula, and the formula is the creator defined out ???
Maximum Entropy Markov Model Markup bias issue
Most is his labeling bias problem MEMM discussion.
1. phenomenon
MEMM decoding with Viterbi algorithm, the state transitions to state 1 tends 2, 2 while the state tends to remain in state 2. Decoding process details (viterbi algorithm will need this premise):
P(1-> 1-> 1-> 1)= 0.4 x 0.45 x 0.5 = 0.09 ,
P(2->2->2->2)= 0.2 X 0.3 X 0.3 = 0.018,
P(1->2->1->2)= 0.6 X 0.2 X 0.5 = 0.06,
P(1->1->2->2)= 0.4 X 0.55 X 0.3 = 0.066
But the optimal state transition path to get is 1-> 1-> 1-> 1, and why? 2 because the state can convert the state to be more than 1 state, so that the reduced transition probability that MEMM tend to choose to have fewer state transfer.
2. explain why
Look directly MEMM formula:
Role sum of the probability is normalized, but there is normalization in the internal index, call it local normalized. Come, viterbi solution process is the use of the state transition equation dp (not expanded MEMM, please refer to the following formula CRF), because it is a local normalized, so the second part of the formula for the transfer of MEMM viterbi there is a problem, dp not lead to the correct recursion to the global optimum.
HMM->MEMM->CRF
MEMM solve the problem of HMM output independence assumptions . Since HMM is defined only between the observation and a state-dependent, while the introduction of a custom MEMM characteristic function to the expression P (present mark | previous tag, the entire observation sequence), and the HMM is P (present mark | previous marker The current observed value), the current mark is actually dependent HMM previous mark, depending on the current value of the current observation marker
The CRF will MEMM have to be changed without directed graph , solve the problem of bias MEMM mark
A common feature of the three is: are based on the Markov assumption, i.e., present mark just preceding mark dependent