Teacher Gavin's Transformer Live Lessons - Decryption of Generative versus Discriminative Models in NLP Information Extraction

I. Overview

Generative and discriminative models are both very important for understanding and implementing CRF (Conditional Random Fields), including how to optimize them.

2. Decryption of Generative versus Discriminative Models in NLP Information Extraction

Analysis of Classification and Sequence Models

Regarding classification, it is to predict a single discrete variable y given a vector representing features x = (x1, x2, . . . , xK). A simple way to accomplish this task is to assume that the class label is known and all features are independent of each other. Although such an assumption may not seem realistic, in the field of machine learning, due to data-driven, It has a wide range of applications, including the classification of spam, the identification of various information and so on. The classifier thus obtained is called the naive Bayes classifier, and it is a model based on joint probability, which is formulated as:

You can think of p(y) as representing all cases, and multiply by the probability of xk under y, where k represents many cases.

This model is described by the directed model shown on the left side of the figure below (the left is the naive Bayes classifier, that is, the directed model, and the right is the factor graph). It can be seen that there are many different paths from y to x, which are actually based on observation. This is the basic situation considered by Naive Bayes:

We can also turn this model into a factor graph by defining a factor Ψ(y) = p(y) and a factor Ψk(y, xk) = p(xk|y) for each feature, then in the graph is the interaction relationship between the expression factors.

Another well-known graphical representation of the model is logistic regression, sometimes referred to as the maximum entropy classifier. Statistically, this model is generated based on the log probability assumption that log p(y|x) for each class is a linear function of x, plus a regularization constant, which leads to Conditional distribution:

Here Z(x) is a constant:

In naive Bayes, θy is a bias weight that acts like log p(y). Instead of using a weight vector for each category, a different notation is used, that is, a set of weights is shared by multiple categories. The trick here is to define a set of feature functions that are non-zero for a single class. These feature functions for feature weights can be defined as:

When used for bias weights, it can be defined as:

Then use fk to index each feature function used for feature weight, and use θk to index the corresponding weight θy ' ,j , thus expressing the logistic regression model as:

Classifier only predicts a single categorical variable, and the real power of graphical models lies in their ability to model many interdependent variables. One of the simplest possible dependencies discussed here is that the output variables are arranged in a sequence. Discussed here is an application from NER, which is used to identify and classify correct names in text, including locations, people, organizations, etc. The NER task is based on a given sentence, by segmenting the sentence, to get which words belong to the part of entities, and then classifying each entity by type (type includes location, person, organization, etc.), this task The challenge is that many named entities are too few to appear in a large-scale training set, so the system must recognize them based on context only.

One way to use NER is to classify each word independently without considering the dependencies between them. The assumptions used in this way are obviously not in line with the actual situation. In natural language, there is a dependency between the named entity labels of adjacent words. For example, New York is a place, and New York Times is an organization. One way to relax this independence assumption is to arrange these output variables in a linear chain. This is also the way the HMM model takes. An HMM models observations X:

Suppose there is a basic sequence of states from a finite set of states S:

Each observation xt is the ID of the word at location t, and each state yt is a named entity label, one of these entity types: Person, Location, Organization, and Other. To model the joint probability p(y, x), the HMM makes two independent assumptions:

- Assume that each state is only dependent on its immediate predecessor (the content of its immediate predecessor), i.e. state yt is independent of all ancestors before it: y1, y2, . . . , yt−2, yt−1

- Assume that each observation variable xt only depends on the current state yt

With the assumptions above, three probability distributions are used to illustrate the HMM:

- probability p(y1) based on initial state

- transition probability p(yt |yt−1)

- observation probability p(xt |yt)

Then the joint probability composed of a state sequence y and an observation sequence x is expressed as follows:

In order to simplify the formulation, the initial state probability p(y1) is denoted as p(y1|y0). In natural language processing, HMMs are used for sequence labeling tasks, such as POS tagging, NER, and information extraction.

2. Decryption of generative and discriminative models

Generative models use joint probabilities, which describe how the output is generated probabilistically as a function of the input, while discriminative models use conditional probabilities. Generative models include naive Bayes and hidden Markov models, discriminative models are logistic regression models,

The main difference between these two models is that the conditional probability p(y|x) does not include the p(x) model (x represents the same situation, no need to do any classification). The difficulty of p(x) modeling is that it often contains many highly dependent features, which makes it difficult to model. For example, in NER, an HMM only depends on one feature, that is, the identity of the word, but the correct names of many words does not appear in the training set, so the word-indentity feature cannot be obtained. To label "invisible" words, other features of a word can be mined, such as word case, adjacent words, prefixes and suffixes, membership in predefined lists of people and places, etc. The problems and solutions in these modeling are not limited to HMM or CRF. In NLP processing, we need to consider as many dimension levels as possible. For example, from the vocabulary level and the character level, it is obvious that different levels are captured. Information, about the "invisible" vocabulary may be very important or valuable. In the following DIET architecture diagram, when input information is processed, there are both sparse features and pretrained embedding, so that various levels can be mixed. information:

For discriminative models, the core advantage is that they can better adapt to rich, overlapping features.

CRF makes independence assumptions between y and how y depends on the entire x sequence. It can be understood this way, assuming that a factor graph is used to represent the joint probability p(y, x), and then a graph is constructed to represent the conditional probability p(y|x), any factors that only depend on x are drawn from this conditional probability distribution graph disappear in the structure.

To include interdependent features in a generative model, there are two ways: enhancing the model to represent dependencies between inputs, or simplifying the independence assumption, such as that of naive Bayes. The way of enhancing the model will lead to more complex model parameters and increase the amount of calculation, while the second way may degrade the performance of the model because of the independence assumption. For example, although the naive Bayes classifier performs well in document classification, it averages over a large number of applications. It is worse than the logistic regression model.

3. Naive Bayes, logistic regression, HMMs, linear-chain CRFs分析

The difference between Naive Bayes and logistic regression is that the former is generative, while the latter is discriminative. For discrete inputs, the two classifiers are identical in all other respects. Both classifiers consider the same hypothesis space, in the sense that any logistic regression classifier can be converted to a Naive Bayes classifier using the same decision boundary, and vice versa.

The Naive Bayes model defines the same probability distribution range as the logistic regression model and is usually expressed as:

This diagram from the following paper describes the relationship between the various models:

For the generative model, there are several advantages:

- It is more natural to deal with partially labeled or unlabeled data, especially when there is no labeled data at all, it can be used in an unsupervised way, while discriminative models are still under research in unsupervised learning

- Generative models perform better than discriminative models when dealing with certain data, because the input model p(x) can have a smoothing effect on the condition, especially on small datasets. For any particular dataset, it is impossible to predict in advance which model will perform better.

The relationship between naive Bayes and logistic regression is similar to the relationship between HMMs and linear-chain CRFs.

Teacher Gavin's Transformer Live Lessons - Decryption of Generative versus Discriminative Models in NLP Information Extraction

Guess you like