Teacher Gavin's Transformer Live Lessons - Detailed Explanation of CRF Modeling in NLP Information Extraction

I. Overview

The ability of CRF lies in information dependency and state transition processing, it can express any state transition and dependency relationship. The lack of information expression ability can rely on Transformer to deal with. This picture in the paper shows how the conversion between various models is performed. When using CRFs, Linear-chain CRFs will use more:

 2. Detailed explanation of CRF Modeling in NLP information extraction

  1. Applications of CRFs

CRFs have been widely used in various fields, including text processing, computer vision, bioinformatics, etc. Applications of linear-chain CRFs include NER, shallow parsing, detection of semantic roles in text, prediction of pitch stress, word assignment in machine translation, extraction of table information in text documents, Chinese word segmentation, etc.

  2. Detailed explanation of Linear-chain CRFs

To facilitate the understanding of Linear-chain CRFs, consider the conditional probability p(y|x) generated by the joint probability p(y, x) of an HMM. The key point is that the conditional probability is actually a CRF. The joint probability of HMM can be expressed as follows:

Here, θ = {θij , µoi} are the actual-valued parameters of this probability distribution, Z is a regularization constant, the sum of the probabilities equals 1, θij = log p(y ' = i|y = j) , and µoi = log p(x = o|y = i).

Each feature function has the form: fk(yt , yt−1, xt), where xt represents the input, and yt represents the current state, which depends on the previous state yt−1. To replicate, a feature needs to be defined for each transition (i, j):

Define a feature for each state-observation pair (i, o):

Usually a feature function is marked as fk, and the scope includes all fij and all fio. In this way, the HMM can be expressed as:

The conditional probability formula is derived from the above formula:

The above conditional probability distribution represents a special kind of Linear-chain CRF that only includes features of the identity of the current word, but many other linear-chain CRFs use richer features about the input, prefixes and suffixes of the current word , the identity of the words surrounding it, etc.

Suppose Y, X are random vectors, and a set of feature functions with actual values ​​is expressed as:

Then the probability distribution of Linear-chain CRF can be expressed as follows:

The regularization function Z(x) here is expressed as follows:

Other types of Linear-chain CRFs are also useful, such as in HMMs, where a transition from a state i to a state j gets the same score, log p(yt = j|yt−1 = i), independent of the input. In a CRF, it is possible to allow the transition (i, j) to depend on the vector of the current observation by simply adding a feature: 1{yt=j}1{yt−1=1}1{xt=o} with this transition Feature CRFs are often used in text applications.

In order to point out in the definition of Linear-chain CRF that each feature function can depend on the observation at any point in time, the observation parameter of fk is marked as vector xt, which can be understood as including all components of the global observation x, for example, CRF uses the next word xt+1 is taken as a feature, and then the feature vector xt is assumed to include the identity of the word xt+1.

The conversion from Linear-chain CRF to General CRF is fairly straightforward, we simply use the linear-chain factor graph to convert to a more general factor graph.

Assuming that G is a factor graph of Y, then p(y|x) is a CRF for any fixed x, factored according to G. In the linear-chain scenario, the same weights are often used as the factors Ψt(yt , yt−1, xt) at each time point, and the factors of G can be divided into a set C = {C1, C2, . . . Cp }, each Cp is a clique template, and each template contains a set of factors, so that the CRF can be expressed as:

where each factor is parameterized and expressed as:

Regarding the use of clique templates, for example, when modeling images, different clique templates can be used based on different scales, depending on the results of the algorithm that finds interest points.

 3. Detailed explanation of General CRFs

General CRFs are also applied to several NLP tasks, and a promising application will perform multiple annotation tasks simultaneously. For example, a 2-level dynamic CRF for POS tagging and chunking of noun class words outperforms one task at a time. Another application is multi-label classification (such as the case where user input can contain multiple intents during a dialog), where an instance can have multiple classification labels instead of training a separate label for each class. There is such a CRF, which can improve classification performance by learning dependencies between categories.

The skip-chain CRF is a general CRF that expresses long-distance dependencies in information extraction. Regarding long-distance dependencies, for example, suppose a person said a long paragraph, and a large paragraph in the middle The content is irrelevant, and the key information required for prediction is reflected in the beginning of this paragraph. If you want to make predictions based on key information at the end of this paragraph, then you need to use long-distance dependencies.

The graphical CRF structure can be applied to the processing of the noun denotation problem, that is, to determine which denotations are in the document. For example, Mr. President and he appearing in the document both point to the same entity. The paper also mentions the use of fully connected CRFs to learn distance metrics between various referents to reason about how to map to graph partitions. Similar models are also applied to segment handwritten characters.

In computer vision, grid-shaped CRFs are also used to label and segment images. In some applications of CRF, there is still effective dynamic programming, even if the graphical model is difficult to describe.

The most important point to consider when defining a general CRF is to specify the repeating structure and parameter system. Many recommended forms are to specify clique templates. For example, dynamic CRF allows multiple classes at each time point instead of a single class, in a similar way to dynamic Bayesian networks.

Guess you like

Origin blog.csdn.net/m0_49380401/article/details/123587324