Teacher Gavin's Transformer live class perception - Multivariate Prediction and Graphical Modeling decryption series in NLP information extraction

I. Overview

As an information extraction framework, CRF (Conditional Random Fields) can ensure the global optimality is a very important feature during processing. Since the application of the Transformer neural network will have some deviations, the use of CRF can correct the information very well.

2. Multivariate Prediction and Graphical Modeling decryption series in NLP information extraction

  1. Significance analysis of multivariate data classification model with dependencies

We often predict a large number of variables, with dependencies between them. Structured forecasting methods are essentially a combination of classification and graphical modeling. The ability of graphical models to compactly model multivariate data can be combined with the ability of classification methods to perform predictions using large-scale input features. As a popular probabilistic method for structured prediction, CRF has a wide range of applications in natural language processing, computer vision, etc. The reasoning and parameter evaluation methods for CRFs described here suffer from practical problems that arise when implementing large-scale CRFs.

When expressing a serialized model, there may be a lot of multivariate data (such as POS tagging processing results). After the CRF transition matrix is ​​trained, it is fixed for all states. Multivariate data is from the perspective of probability. It will become a unified matrix. The "structured" mentioned here mainly emphasizes the serialized model, and does not put too much emphasis on the input data, because CRF has dependencies on all inputs. For example, in the following architecture diagram, the input part is performed by the Transformer. After processing, it will be handed over to CRF. No matter what is input, it will be unified into context in the end:

 2. The core scene and significance of Graphical Modeling

Fundamental to many applications is the ability to predict multiple variables that are interdependent. From a coarse-grained perspective, it is to predict the corresponding vector y = {y0, y1, . . . , yT } based on the random variable represented by the input vector x. A relatively simple natural language processing example is POS tagging, where each variable ys is the POS tag of the word at position s, and x is divided into feature vectors {x0, x1 . . . xT }, each xs contained at position s Different information about the vocabulary, such as the ID of the vocabulary, correctly spelled features such as the prefix and suffix of the vocabulary, information in the semantic database (such as WordNet), etc.

With regard to solving the problem of multivariate data prediction, especially if the goal is to maximize the number of correctly classified labels, then one needs to learn an independent per-position classifier that maps x to ys for each position s. The difficulty, however, is that there may be complex dependencies between the output variables. For example, assuming that two words can form a name (noun), if you add another word to form a name, when CRF "sees" the third position, it also sees the first two adjacent positions. , which is the dependency between them. If the output variable structure is a complex parse tree, what grammar rules are used near the top of this tree can have a big impact on the rest of the tree. Dependencies between output variables can be represented by graphical models. A graphical model is used to describe how a given probability density can correspond to a set of conditionally independent relationships that satisfy the probability distribution after factorization.

Graphical modeling is a powerful framework for representing and reasoning in multivariate probability distributions.

 3. Generative Models and their problems and solutions

Generative Models are an explicit attempt to model the joint probability p(y, x) based on the input and output. While this approach has many advantages, it also has significant limitations, not only the dimension of the input x-sequence can be very large, but also the features have complex dependencies, so it is difficult to construct a probability distribution based on these. Modeling dependencies on inputs can lead to intractable models, but ignoring dependencies can lead to poor model performance.

 4. Analysis of Conditional Model and Joint Model

CRF can turn the original joint probability distribution into a conditional probability distribution p(y|x), which can simplify the calculation process and the amount of calculation. CRF essentially combines the advantages of classification and graphical modeling, with the ability to compactly model multivariate data with the ability to make predictions with large-scale input features. The advantage of such a conditional model is that all inputs are considered as contexts, and dependencies are considered from the perspective of transition matrices.

The following is the relationship between several models, such as HMMs converted to Linear-chain CRFs through conditional processing:

CRF can be seen as an extension of logistic regression classifiers to graphical structures, or a discriminative analogue of generative models for structured data.

 5. Undirected Models decryption

Consider a probability distribution based on a random variable V = X ∪ Y, where X is the observed input sequence and Y corresponds to the predicted sequence, for each variable s ∈ V, the result is obtained from the set of V, the result may be continuous or discrete. Let the vector x represent a random assignment of X, given a variable s ∈ X, the notation xs denotes the value assigned by x to s, and similarly, xa denotes the value assigned to a subset a ⊂ X.

The undirected model is defined using the following formula, where the left side represents the joint probability, and the right side 1/Z represents the normalization,

The constant Z is defined as follows:

The following is the calculation of the function Ψa, which turns multiplication into addition by exponential operation. If x and y are discrete, there will be no loss of generality:

Guess you like

Origin blog.csdn.net/m0_49380401/article/details/123539669
Recommended