Teacher Gavin Transformer Live Class Perception - Information Extraction CRF (Conditional Random Fields) Decryption Series One

 I. Overview

   CRF (Conditional Random Fields) is a core algorithm for information extraction, and can also be considered a framework. CRF is especially effective for sequential data, and is currently the best choice for labeling language sequences. 

   The following is the architecture diagram of DIET, you can see that CRF is used on top of Transformer. Through Transformer's multi-head attention mechanism, the dense vector represented by each token information is passed to a feedforward neural network inside the CRF. In addition, there is a transition matrix (transition state matrix) between one token and another token, from From the point of view of trainable parameters, in addition to each token corresponding to a feedforward neural network, such a transition matrix is ​​used between different tokens in a sequence to express the relationship between tokens and identify whether they are entities. will be more refined. However, CRF also has a fatal weakness. Its information expression ability is insufficient, while Transformer can express context or global information more abundantly.

CRF has the function of information correction. Combined with the application of Transformer and CRF, vector is used to represent the content of each token, such as x1, x2, ... xn, which will generate information deviation, for each input sequence in the sequence. Each token will have a corresponding label probability, and there will also be a transition probability from one label to the next label. The transition matrix provided by CRF can well capture the dependencies before and after the label level.

 2. Information extraction CRF (Conditional Random Fields) decryption

  1. Abstract Analysis of CRF Papers

CRF is a framework for building probabilistic models to split and label input sequence data. CRF has several advantages over other models such as HMMs (hidden Markov models), including relaxing the strong independence assumption of HMMs (meaning that CRF emphasizes the relationship between elements). CRF also avoids the basic limitations imposed by MEMMs (maximum entropy Markov models) and other discriminative Markov models based on directed graphs (these models are simply classification, and do not directly consider joint probability, consider joint probability It is the generative model. The joint probability involves a lot of calculations, while the conditional probability only involves part of the calculation. This directed graph model performs local normalization calculations each time, sometimes when there are more samples and sometimes fewer samples. , it will ignore the weight-related things of the probability transition, which will tend to bias the subsequent state).

 2. Detailed introduction of CRF paper

Compared with the Hard model (referring to the model whose probability is 0 or 1), the probability model can better capture finer information, and CRF is evolved on the basis of these models. The probabilistic model uses segments and labels around sequences, and has a wide range of applications, including topic segmentation, part-of-speech POS tagging, information extraction, and syntactic doubts. HMMs are generative models, which are trained based on the joint probability of pairs constructed by observation sequence and label sequence. In order to define a joint probability based on observation and label sequences, the generative model needs to be able to enumerate all possible observation sequences.

In the figure below, all labels depend on the input observation sequence X=X1,.... For label Y2, it can be connected to both Y1 and Y3. There is a transition between them. The probability of such a transition depends not only on Current observations, and possibly also past and future observations:

 3. Analysis of The Label Bias Problem

MEMMs and other non-generative models, such as the discriminative Markov model, have a problem called label bias. They describe only local information, not global information, that is, the states of transitions between labels are only relative to each other, while All other transitions in the model are not considered. In terms of probability, the probability score of a transition is the conditional probability for subsequent possible states given by the current probability and the observation sequence. If the dependencies between the models have been expressed locally, and the local normalization calculation is performed, the global influence will be greatly weakened, resulting in bias. CRF has an exponential model that performs joint probability calculation on the entire sequence of labels based on a given observation sequence, so that the weights of different features in different states can be weighed against each other.

The image below represents a simple finite-state model for distinguishing between two words rib and rob. Assuming that the observation sequence is rib, in the first step, starting from the initial state, the transitions from 0 to 1 and 0 to 4 all match r, so the probability of these two transitions is roughly equal (the case of equal probability). Next, observe i. Both states 1 and 4 have only one outgoing transition. State 1 often sees observation during training, while state 4 has almost never seen this observation, but state 4 is the same as state 1, and can only use its probability mass The outgoing transition passed to it, because it does not generate this observation, so only the states of a single outgoing transition will effectively ignore their observation, resulting in bias.

For the problem of bias, one solution is to change the state transition structure of the model, and another solution is to use a fully connected model as a starting point to train a good structure.

 4. Detailed explanation of Conditional Random Fields

In the figure below, X is a random variable based on the data sequence to be labeled, and Y is a random variable based on the corresponding label sequence. The number of Y is limited, generally dozens, and then modeling and training based on X and Y, as well as inference.

About the definition:

Represent G = (V, E) as the following graph, Y is indexed by vertex G, then (X, Y) constitutes a conditional random field. When conditional judgment is made on X, the random variable Y obeys the rules about this graph. Markov feature, all labels of CRF depend on all input X.

Generally, the CRF of the linear chain is used, and the calculation formula is as follows:

Guess you like

Origin blog.csdn.net/m0_49380401/article/details/123491612
Recommended