NLP text feature extractor

Features of NLP tasks:

The input is a one-dimensional linear sequence (ordered) of variable length .

         Because the text may have long-distance features, whether the feature extractor can have the ability to capture long-distance features is also critical for solving NLP tasks. Whether a feature extractor adapts to the characteristics of the problem domain sometimes determines its success or failure, and the direction of many model improvements is actually to make it better match the characteristics of the domain problem .

NLP four categories of tasks:

       Four major types of NLP tasks: classification, sequence labeling, text matching, and text generation. 

  1. The second category is classification tasks, such as our common text classification, emotional computing, etc. can be classified into this category. Its characteristic is that no matter how long the article is, it only needs to give a classification category in general.
  2. One is sequence tagging, which is the most typical NLP task, such as Chinese word segmentation, part-of-speech tagging, named entity recognition, semantic role tagging, etc. can all be classified into this type of problem, and its characteristic is that each word in the sentence requires the model to The context is given a taxonomic category.
  3. The third type of task is text matching, such as Entailment, QA, semantic rewriting, natural language reasoning and other tasks are all in this mode. Its characteristic is that given two sentences, the model judges whether the two sentences have a certain semantic relationship;
  4. The fourth category is text generation, such as machine translation, text summarization, writing poems and sentences, and speaking from pictures, etc. all belong to this category. Its characteristic is that after entering the text content, it needs to generate another piece of text independently. 

        The advantage of deep learning is "end-to-end", which means that in the past, R&D personnel had to consider which features to design and extract. After the end-to-end era, you don't have to worry about these at all. Throw the original input to a good feature extractor, and it will automatically Extract useful features. In other words, what you need to do is: choose a good feature extractor , feed it a large amount of training data, set the optimization goal (loss function), and tell it what you want it to do. Then the design of the feature extractor becomes the top priority.

1. Bag-of-Words (BoW)

BoW (Bag of Words) is a commonly used text representation method for converting text to numeric vectors. The basic idea of ​​BoW is to treat the text as words in a bag (or set), ignore the order and grammatical structure between words, and only focus on the frequency of words. Specific steps are as follows:

  1. Building a vocabulary: First, build a vocabulary of unique words that appear in all texts.

  2. Feature vector representation: For each text sample, according to the words in the vocabulary, mark the number of occurrences of the word at the corresponding position of the sample or use its weight.

  3. Vectorized representation: The feature vector matrix is ​​used as the numerical representation of the text, that is, each text sample is mapped to a sparse numerical vector.

The advantages of using BoW for text representation include:

  1. Simple and intuitive: The BoW method is easy to understand and implement, regardless of grammar and word order.

  2. Ignore irrelevant information: BoW can filter out some irrelevant grammatical and word order information, and pay more attention to the frequency of words in the text.

However, the BoW method also has some limitations and disadvantages:

  1. Loss of word order information: BoW ignores the order relationship between words and thus fails to capture contextual and semantic information between words.

  2. High-dimensional sparse vectors: For large vocabularies and large-scale text datasets, BoW representations generate high-dimensional sparse vectors, resulting in increased computational and storage overhead.

Despite some limitations, BoW is still a fundamental method for many text processing tasks and can serve as the basis for other more complex text representation methods.

2. RNN

           

        As can be seen from the above figure, each input corresponds to a hidden layer node, and a linear sequence is formed between hidden layer nodes, and information is gradually transmitted backwards between hidden layers from front to back.

        At the beginning, RNN adopts a linear sequence structure to continuously collect input information from front to back, but the long backpropagation path is easy to cause gradient disappearance or gradient explosion. In order to solve this problem, the LSTM and GRU models were introduced, and the intermediate state information was directly propagated backwards to alleviate the gradient disappearance problem, and achieved good results, so LSTM and GRU soon became the standard model of RNN.

        Because the structure of RNN is naturally adapted to solve the problem of NLP, the input of NLP is often a linear sequence of sentences of variable length, and the structure of RNN itself is a network structure that can accept input of variable length and conduct information linearly from front to back . After introducing three gates, LSTM is also very effective for capturing long-distance features. Therefore, RNN is especially suitable for linear sequence application scenarios such as NLP, which is the fundamental reason why RNN is so popular in the NLP world. 

        Disadvantages of RNN: The hidden layer calculation depends on two inputs, one is the word input by the sentence, and the other is the output that depends on the state of the previous hidden layer. That is, the final result needs to be calculated step by step according to the time step, that is, sequence dependence . It is precisely because the sequence-dependent structure is not very friendly to parallel computing, the calculation is slow, and it is easy to be replaced by a fast rising star.

        CNN and Transformer do not have this sequence dependency problem and can be calculated in parallel.

2.3  word Embedding

Word Embedding that is different from the bag of words model, such as word2Vec and GLoVe.

        The idea of ​​the core function P is to predict the probability of which word will follow based on a series of preceding words in the sentence (theoretically, in addition to the above, you can also introduce the context of the word to jointly predict the probability of the word), the larger the value represents This is more like a human saying. Give you a lot of corpus to do this. How to train a neural network well. After training, input the first few words of a sentence in the future, and ask the network to output which word should follow immediately. What will you do? - Neural Network Language Model (NNLM).

 Any word W uses onehot encoding as the original input, multiplied by the matrix Q to get the vector C(W)) (that is, the Word Embedding corresponding to the word, the Q line corresponds to the Word Embedding value of the word, Q needs to be learned) each word C( W) Splicing, connect the hidden layer, and then connect softmax to predict which word should be followed. Through this network learning language model task, the network can not only predict what the next word is based on the above, but also obtain a by-product, which is the matrix Q, which is how the Word Embedding of the word is learned.

2.3.1 Word2Vec

The network structure of Word2Vec is basically similar to NNLM, but the training method is different. NNLM is to input the text of a word to predict the word, that is, to see the text above to predict the text, and word embedding is a by-product.

Word2Vec has two training methods,

  1.  CBOW, the core idea is to remove a word from a sentence, and use the context and context of the word to predict the word that has been removed;
  2. Skip-gram, which is just the opposite of CBOW, enters a word and asks the network to predict its context words.

But the goal of Word2Vec is different. It is simply for word embedding. This is the main product, so it can train the network at will.

Word Embedding is a pre-training process, how do you say it?

This depends on how to use it in downstream tasks after learning word embedding.

In NNLM, each word in the sentence is input in the form of Onehot, and then multiplied by the learned Word Embedding matrix Q, and the Word Embedding corresponding to the word is directly taken out. Q is actually the network parameter matrix mapped from the network Onehot layer to the embedding layer. Word Embedding is equivalent to initializing the network from the Onehot layer to the embedding layer with the pre-trained parameter matrix Q. It's just that Word Embedding initializes the first layer of network parameters. This is a typical practice of pre-training in the NLP field 18 years ago. Word Embedding is helpful for many downstream NLP tasks, but the effect is not so good, why?

Because the problem of polysemy is not considered.

What negative impact does polysemy have on Word Embedding? The same word has different meanings but is represented by the same word embedding. For example, the polysemous word Bank has two commonly used meanings, but Word Embedding cannot distinguish between the two meanings when encoding the word bank, which results in two different contextual information being encoded into the same word embedding space go. Therefore, word embedding cannot distinguish the different semantics of polysemous words, which is a serious problem of it.

Although many people have proposed solutions, but they are expensive or cumbersome, and ELMO provides a simple and elegant solution.

2.3.2 ELMO

Word Embedding before this is essentially a static method. The so-called static means that the expression of each word is fixed after training. When used in the future, no matter what the context word of the new sentence is, the Word Embedding of this word will not change. It will change as the context scene changes.

The essential idea of ​​ELMO is: I use the language model to learn the Word Embedding of a word in advance. At this time, polysemous words cannot be distinguished (mixed with multiple semantics). When actually using Word Embedding, the word already has a specific context. At this time, I can adjust the Word Embedding representation of the word according to the semantics of the context word, so that the adjusted Word Embedding can better express the specific context in this context. The meaning naturally solves the problem of polysemous words. So ELMO itself is an idea to dynamically adjust Word Embedding according to the current context.

 ELMO employs a typical two-stage process,

  • The first stage is to use the language model for pre-training;
  • The second stage is to extract the Word Embedding of each layer of the network corresponding to the word from the pre-trained network as a new feature to supplement the downstream task when doing the downstream task.

The first stage of pre-training:

       The above figure shows its pre-training process. Its network structure uses a two-layer bidirectional LSTM. The current task goal of language model training is to correctly predict the word W according to the context of the word W. The word sequence before W is called Context-before Above, the following word sequence Context-after is called below.

        The left double-layer LSTM is a forward encoder, and the input is the above Context-before for predicting W; the right side is a reverse double-layer LSTM encoder, and the input is the context-after of the sentence in reverse order from right to left. The depth of each encoder is a two-layer LSTM stack. This network structure is actually very commonly used in NLP. Using this network structure and using a large amount of corpus to do language model tasks can pre-train the network. If the network is trained and a new sentence Snew is input, each word in the sentence can get three corresponding Embeddings: the bottom layer is the word The Word Embedding, going up is the Embedding corresponding to the word position in the first layer of bidirectional LSTM, which has more syntactic information for encoding words; further up is the Embedding corresponding to the word position in the second layer of LSTM, this layer of encoding Words have more semantic information. In other words, ELMO's pre-training process not only learns Word Embedding of words, but also learns a two-layer bidirectional LSTM network structure, and both are useful later.

Phase 2: Use of downstream tasks

        The figure above shows the process of using the downstream task. For example, our downstream task is a QA problem. At this time, for the question X, input it into the pre-trained ELMO network to obtain the corresponding three Embeddings, and then give the three Embeddings Each Embedding has a weight a. This weight can be learned, and the three Embeddings are integrated into one according to the accumulation and summation of their respective weights. Then use the integrated Embedding as the input of the corresponding word in the network structure of the X sentence, and use it as a supplementary new feature for downstream tasks. The same is true for the answer sentence Y in the downstream task QA shown in the figure above. Because ELMO provides the feature form of each word to the downstream, this type of pre-training method is called "Feature-based Pre-Training".

        Experiments show that the task covers a wide range, including sentence semantic relationship judgment, classification tasks, reading comprehension and other fields, which shows that its application range is very wide and its universality is strong, which is a very good advantage.

What are the disadvantages of that?

  • The feature extraction ability of LSTM is much weaker than that of Transformer.
  • The two-way feature fusion ability of the splicing method is weak.

        In addition to the feature fusion-based pre-training method represented by ELMO, there is another typical practice in NLP, which looks consistent with the image field. This method is generally called "Fine-based -tuning mode", and GPT is a typical pioneer of this mode.

3 Transformer

3.1 GPT

 GPT is the abbreviation of "Generative Pre-Training". From the name, its meaning refers to generative pre-training. GPT also employs a two-stage process,

  • The first stage is to use the language model for pre-training.
  • The second stage solves downstream tasks through the Fine-tuning mode.

In fact, it is similar to ELMO, the main difference lies in two points:

  • First of all, the feature extractor does not use RNN, but uses Transformer. As mentioned above, its feature extraction ability is stronger than RNN. This choice is obviously very wise; Because of the sequential dependence of its own structure. Although CNN is easy to do parallel computing and fast, it has natural defects in capturing the sequence relationship of NLP, especially long-distance features. It is not impossible but not good. Although there are many improved models, not many are particularly successful. Transformer also has good parallelism and is suitable for capturing long-distance features.
  • Secondly, although the pre-training of GPT still uses the language model as the target task, it uses a one-way language model, and GPT only uses the context-before word to make predictions, leaving aside the context. This choice is not a very good choice now, the reason is very simple, it does not integrate the context of the word, which limits its effect in more application scenarios, and loses a lot of information in vain.

        For different downstream tasks, you could design your own network structure arbitrarily, but now you can’t. You have to follow the network structure of GPT and transform the network structure of the task to be the same as the network structure of GPT. Then, when doing downstream tasks, use the parameters pre-trained in the first step to initialize the network structure of GPT, so that the linguistic knowledge learned through pre-training can be introduced into the task at hand, which is very good. things. Again, you can use the task at hand to train the network and fine-tuning the network parameters to make the network more suitable for solving the problem at hand.

        For different tasks of various patterns in NLP, how can we transform it to be close to the network structure of GPT?

        The GPT paper gave a transformation construction drawing as above, which is actually very simple: for classification problems, you don’t need to move much, just add a start and end symbol; for sentence relationship judgment problems, such as Entailment, add a The delimiter is enough; for the text similarity judgment problem, just reverse the order of the two sentences and make two inputs, this is to tell the model that the sentence order is not important; for multiple choice problems, multiple inputs, each way Just concatenate the article and answer options as input. As can be seen from the above figure, this transformation is still very convenient, and different tasks only need to be constructed in the input part. 

        The effect of GPT is very amazing. Among the 12 tasks, 9 achieved the best results, and the performance of some tasks has improved significantly.

GPT disadvantages:

  1. Language models are not bidirectional
  2. Hype ability is weak

3.2 Bert

        Bert uses the same two-stage model as GPT, first is language model pre-training; second is to use Fine-Tuning mode to solve downstream tasks. The main difference from GPT is that a two-way language model similar to ELMO is used in the pre-training stage. Of course, another point is that the data size of the language model is larger than that of GPT. So there is no need to talk about Bert's pre-training process here. 

        In the second stage, the Fine-Tuning stage, this stage is the same as GPT. Of course, it also faces the problem of transforming the network structure of downstream tasks. There are some differences between Bert and GPT in terms of transforming tasks. Here is a brief introduction.     

​​​​​​ 

The figure above gives an example,

  • For sentence relationship tasks, it is very simple, similar to GPT, plus a start and end symbol, and a separator between sentences. For the output, just connect a softmax classification layer to the position of the last layer of Transformer corresponding to the first starting symbol.
  • For classification problems, like GPT, only the start and end symbols need to be added, and the output part and the sentence relationship judgment task are similarly modified;
  • For the sequence labeling problem, the input part is the same as the single-sentence classification, and only the corresponding position of each word in the last layer of the output part Transformer is required to be classified.

        It can be seen from this that among the four major tasks of NLP listed above, except for the generation tasks, Bert covers all others, and the transformation is very simple and intuitive. Although Bert's paper did not mention it, you can think of it with a little brainstorming. In fact, for generative tasks such as machine translation or text summarization, chat robots, you can also introduce Bert's pre-training results with a little modification. It only needs to be attached to the S2S structure, the encoder part is a deep Transformer structure, and the decoder part is also a deep Transformer structure. Select different pre-training data according to the task to initialize the encoder and decoder. This is a fairly intuitive method of transformation. Of course, it can also be simpler, for example, it is also possible to directly install a hidden layer on a single Transformer structure to generate output. In any case, it can be seen from here that the four major categories of NLP tasks can be easily transformed into a method acceptable to Bert. This is actually a very big advantage of Bert, which means that it can do almost any downstream task of NLP, and it is universal, which is very strong. 

        Bert is actually inextricably related to ELMO and GPT. For example, if we replace the GPT pre-training stage with a bidirectional language model, then we get Bert; and if we replace ELMO’s feature extractor with Transformer, then we also Will get Bert. So you can see: Bert's two most critical points are that the feature extractor uses Transformer; the second point is the use of a bidirectional language model during pre-training.

        For Transformer, how can we do bidirectional language model tasks on this structure? Directly replace LSTM with Transformer? Not really.

        The core idea of ​​CBOW is: when doing language model tasks, I cut out the word to be predicted, and then predict the word according to its above Context-Before and below Context-after. How did Bert actually do it? Bert did just that. From here you can see the inheritance relationship between methods. The author of Bert said it was inspired by the cloze task. Therefore, Bert does not actually have much innovation in terms of models, and is more like a master of important NLP technologies in recent years.

        So what innovations does Bert itself have in terms of models and methods? It is the Masked language model pointed out in the paper (the details have changed) and Next Sentence Prediction.

Masked language model: Randomly select 15% of the words in the corpus and cut them out, that is, replace the original words with the [Mask] mask, and then ask the model to correctly predict the words that have been cut out. But there is a problem here: a lot of [mask] marks are seen during the training process, but this mark will not be present when it is actually used later, which will lead the model to think that the output is for the [mask] mark, but see also for actual use. If it is less than this mark, there will naturally be problems. In order to avoid this problem, Bert made a transformation. Of the 15% of the words selected by God to perform the glorious task of [mask], only 80% were actually replaced with [mask] marks, and 10% were randomly replaced by civet cats for princes. Into another word, 10% of the time the word remains in place without modification. This is the specific approach of the Masked two-way speech model.

Next Sentence Prediction: When doing language model pre-training, select two sentences in two situations,

  • One is to select two sentences that are really sequentially connected in the corpus;
  • The other is that the second sentence throws dice from the corpus, and randomly selects one to spell behind the first sentence.

We require the model to perform the above-mentioned Masked language model task, and to make a sentence relationship prediction to judge whether the second sentence is really a follow-up sentence of the first sentence. The reason for this is that many NLP tasks are sentence relationship judgment tasks, and word prediction granularity training cannot reach the level of sentence relationship. Adding this task is helpful for downstream sentence relationship judgment tasks. So it can be seen that its pre-training is a multi-tasking process. This is also an innovation of Bert.

        Comparative experiments can prove that compared with GPT, the two-way language model plays the most important role, especially for those tasks that need to see the following. For predicting the next sentence, the impact on the overall performance is not too great, and it has a high degree of correlation with specific tasks.

Summarize:

  1. The first is a two-stage model. The first stage is two-way language model pre-training. Here, pay attention to using two-way instead of one-way. The second stage uses specific task Fine-tuning or feature integration;
  2. The second is to use Transformer as a feature extractor instead of RNN or CNN for feature extraction;
  3. Third, the two-way language model can be done using the CBOW method (of course, I think this is a detail issue, not too critical, the first two factors are more critical).

The biggest highlight of Bert is its good effect and strong universality. Almost all NLP tasks can be applied to Bert's two-stage solution, and the effect should be significantly improved. It is foreseeable that Transformer will dominate the NLP application field in the future, and this two-stage pre-training method will also dominate various applications.

        The process of pre-training is essentially what it is doing. In essence, pre-training is to design a network structure to do language model tasks, and then use a large number of even endless unlabeled natural language texts. The pre-training task uses A large amount of linguistic knowledge is extracted and encoded into the network structure. When the data with annotation information for the task at hand is limited, these a priori linguistic features will of course have a great supplementary effect on the task at hand, because when the data is limited , many linguistic phenomena cannot be covered, and the generalization ability is weak. Integrating as general linguistic knowledge as possible will naturally strengthen the generalization ability of the model. How to introduce prior linguistic knowledge has always been one of the main goals of NLP, especially NLP in deep learning scenarios, but there has been no good solution, and the two-stage model of ELMO/GPT/Bert seems undoubtedly It is a natural and concise way to solve this problem, which is the main value of these methods.

3.3 RoBERTa

        The original Bert model is an unfinished semi-finished product, while RoBERTa is the finished product following Bert's ideas. RoBERTa is regarded as a fully trained Bert model. This small difference can greatly improve the effect of the original Bert model. Based on RoBERT's original Bert model,

  1. Further increasing the amount of pre-training data can improve the model effect;
  2. Prolonging the pre-training time or increasing the number of pre-training steps can improve the model effect;
  3. Sharply enlarge the Batch Size of each batch of pre-training, which can significantly improve the model effect;
  4. Remove the Next Sentence Prediction sub-task in the pre-training task, it does not need to exist; (little impact)
  5. The dynamic Masking strategy of the input text is helpful; (not much impact)

Why is RoBERTa a strong benchmark in pre-trained models?

  1. First of all, although it seems that RoBERTa has not made any technical or model improvements, it just trained the Bert model more fully, but its effect is very good.
  2. Secondly, for an improved model, in theory, RoBERTa should be introduced as a comparison baseline, and if the effect of the improved model cannot be convincingly surpassed by RoBERTa, then the effectiveness of this improvement is more or less problematic, unless You emphasized that the advantage of improving the model is not in the effect, but in other aspects, such as smaller and faster.
  3. Thirdly, the follow-up improved pre-training model, from a strategic point of view, should stand on the giant shoulders of RoBERTa at the beginning of the design, that is to say, increase the Batch Size and lengthen the pre-training time on the premise of increasing a certain amount of data. Let the model be fully trained. Because, if you don’t do this, there is a high probability that your effect will be difficult to compare with RoBERTa. However, if you study the models with outstanding effects that we can see so far, you will find that they have actually introduced the key elements of RoBERTa. .

        Most of the current mainstream models use Transformer as a feature extractor, but how to use it to build a model structure with higher learning efficiency? this is a problem. The so-called high learning efficiency means that given the same size of training data, it can encode more knowledge into the model, which means that its learning efficiency is higher. Different Transformer usages will produce different model structures, which will lead to differentiated learning efficiency of different structures. The existing research conclusions on the model structure will introduce five common model structures, which generally include self-supervised learning methods. Common learning methods include AutoEncoding (AE for short) and AutoRegressive (AR for short). AE is what we often call a two-way language model, while AR stands for a left-to-right one-way language model.

4 . CNN 

        The earliest introduction of CNN to NLP was the work done by Kim in 2014. Refer to the above figure for the paper and network structure. 

        The convolution layer is essentially a feature extraction layer, and the hyperparameter F can be set to specify how many convolution kernels (Filter) the convolution layer contains. For a certain Filter, it is conceivable that there is a d*k moving window that moves backwards from the first word of the input matrix, where k is the window size specified by the Filter, and d is the Word Embedding length. For a window at a certain moment, through the nonlinear transformation of the neural network, the input value in this window is converted into a certain characteristic value. As the window continues to move backwards, the corresponding characteristic value of the Filter is continuously generated to form the filter. Feature vector. This is the process of convolution kernel extraction feature. Each Filter in the convolutional layer operates in this way to form a different feature sequence. The Pooling layer performs dimensionality reduction operations on the features of the Filter to form the final features. Generally, the fully connected layer neural network is connected after the Pooling layer to form the final classification process.

This is the working mechanism of the CNN model first applied in the NLP field. It is used to solve the sentence classification task in NLP. It seems to be very simple. After that, improved models based on this have appeared one after another. But CNN is not as good as RNN in the field of NLP. This shows that this version of CNN still has many problems. In fact, the most fundamental crux is that the new environment has not made targeted changes to the characteristics of the new environment, so it faces the problem of acclimatization.

Question: The key lies in the sliding window covered by the convolution kernel. The features that CNN can capture are basically reflected in this sliding window. What it captures is the k-gram fragment information of the word. The size of k determines how much it can capture. Distant features. Many people improve on this. What is the current mainstream CNN of NLP?

        Usually, the depth is superimposed by a 1-D convolutional layer, and Skip Connection is used to assist in optimization, and methods such as Dilated CNN can also be introduced. 

Reference article:

From Word Embedding to Bert model - the history of pre-training technology development in natural language processing - Programmer Sought

PTM riding the wind and waves: the technical progress of the pre-training model in the past two years

Abandon fantasy and embrace Transformer in an all-round way: Comparison of three major feature extractors (CNN/RNN/TF) for natural language processing - Programmer Sought

Guess you like

Origin blog.csdn.net/qq_41458274/article/details/132134611
Recommended