[Pytorch Neural Network Theory] 36 Common tasks in NLP + BERT model + development stage + data set

 

1 NLP development stage

Deep learning has two stages in NLP: the basic neural network stage

1.1 Basic neural network stage

1.1.1 Convolutional Neural Networks

Treat language as image data and perform convolution operations.

1.1.2 Recurrent Neural Network

A recurrent neural network is used to learn the semantics in a continuous piece of text in the order of the language text.

1.1.3 Neural Network Based on Attention Mechanism

It is a kind of network similar to the convolution idea. It calculates the similarity between the input vector and the target output through matrix multiplication, and then completes the semantic understanding.

1.2 BERTology stage

By using the above three basic models, models with stronger and stronger fitting ability were continuously built, until the BERT model finally appeared.

1.2.1 Development of BERT

The BERT model outperforms other models on almost all tasks, and eventually evolved into a variety of BERT pre-training models:

  1. A generalized autoregressive model XLNet that introduces bidirectional context information in the BERT model;
  2. Improved RoBERTa and SpanBERT models for BERT model training methods and targets;
  3. MT-DNN model combining multi-task and knowledge distillation to strengthen BERT model

1.2.2 Questions about the BERT model

Trying to explore the principles of the BERT model and the real reasons why it excels in certain tasks. The BERT model became the mainstream technical idea of ​​NLP tasks within a period of time after its appearance. This idea is also called BERTology.

2 Common tasks of NLP

NLP can be subdivided into two situations: Natural Language Understanding (NLU) and Natural Language Generation (NLG).

2.1 Tasks based on article processing

2.1.1 Meaning

It mainly processes all the text in the article, that is, text mining . The article of the task is the unit, and the model will process all the text in the article to get the semantics of the article. When the semantics are obtained, the corresponding results can be output according to the specific tasks in the output layer of the model.

2.1.2 Segmentation based on article processing tasks

  • Sequence to category: such as text classification and sentiment analysis.
  • Synchronized sequence-to-sequence: refers to generating output for each input position, such as Chinese word segmentation, named entity recognition, and part-of-speech tagging.
  • Asynchronous sequence-to-sequence: e.g. machine translation, automatic summarization.

2.2 Sentence-based tasks/sequence-level tasks

It mainly includes sentence classification tasks (such as sentiment classification), sentence inference tasks (inferring whether two sentences are synonymous), and sentence generation tasks (such as answering questions, image descriptions), etc.

2.2.1 Sentence classification task and related datasets

Sentence classification tasks are often used in scenarios such as comment classification and sick sentence inspection. The commonly used datasets are as follows:

  1. SST-2 (Stanford Sentiment Treebank): This is a binary classification dataset that aims to judge the sentiment of a sentence (sentence derived from people's evaluation of a movie).
  2. CoLA (Corpus of Linguistic Acceptability): This is a binary data set whose purpose is to judge whether an English sentence is grammatically correct.

2.2.2 Sentence inference task and related datasets

The input of the sentence inference task (aka sentence pair-based classification task) is two paired sentences, and its purpose is to judge whether the meaning of the two sentences is entailment, contradiction, or neutrality. Commonly used in intelligent question and answer, intelligent customer service and multiple rounds of dialogue. Common datasets are as follows:

  1. MNLI: This is a dataset in the GLUEDatasets dataset. It is a large-scale dataset with many sources. The purpose is to judge the relationship between the semantics of two sentences.
  2. QQP (Quora Question Pairs): This is a binary classification dataset whose purpose is to judge whether two question sentences from Quora are semantically equivalent.
  3. QNLI (Question Natural Language Inference): This is also a binary classification dataset, and each sample contains two sentences (one is the question and the other is the answer). The answers of the positive samples correspond to the questions, and the negative samples are the opposite.
  4. STS-B (Semantic Textual Similarity Benchmark): This is a dataset similar to regression problems. Given a pair of sentences, use a score of 1 to 5 to evaluate the semantic similarity between the two.
  5. MRPC (Microsoft Research Paraphrase Corpus) This is a binary data set. Sentence pairs are derived from comments on the same news, and it is judged whether the pair of sentences are semantically the same.
  6. RTE (Recognizing Textual Entailment): This is a binary classification dataset, similar to the MNLI dataset, but with less data.
  7. SWAG (Situations With Adversarial Generations): This is a question and answer dataset. Given a statement sentence and 4 alternative sentences, it is judged which of the former and the latter has the most logical continuity, which is equivalent to a reading comprehension problem.

2.2.3 Sentence Generation Task and Dataset

Sentence generation tasks: Belong to category (entity objects) to sequence tasks such as text generation, question answering, and image description.

A typical dataset is as follows:

The samples of the SQuAD dataset are sentence pairs (two sentences). Among them, the first sentence is a piece of text from an encyclopedia, and the second sentence is a question (the answer to the question is contained in the first sentence). After such a sentence pair is input to the model, the model is required to output a short sentence as the answer to the question.

SQuAD2.0, which integrates answerable questions from the existing SQuAD dataset with more than 50,000 hard-to-answer questions written by the public, where those hard-to-answer questions are semantically similar to the answerable questions. It fills the gaps in existing datasets. Existing datasets either only focus on answerable questions or use easily identifiable automatically generated unanswerable questions as datasets.

To perform better on the SQuAD2.0 dataset, the model must not only answer the question when possible, but also determine when the context of the paragraph does not support answering.

2.3 Processing tasks based on words in sentences

The processing task based on the words in the sentence is also called the token-level task, which is often used for Cloze, predicting the word (or entity word) at a certain position in the sentence, and labeling the part of speech in the sentence.

2.3.1 Token-level task and BERT model

The token-level task is also one of the pre-training tasks of the BERT model, that is, cloze. According to the context token in the sentence, it is inferred what token the current position should be.

The Masked Language Model (MLM) was used in the pre-training of the BERT model. The model can be directly used to solve token-level tasks, that is, during pre-training, part of the token in the sentence is replaced with the special token [masked], and part of the words are masked. The output of the model is the word that predicts the corresponding position of [masked]. The advantage of this kind of training is that it does not require manual annotated data. It only needs to randomly mask the sentences in the existing corpus through an appropriate method to obtain a corpus that can be used for training, and the trained model can be used directly.

2.3.2 Token-level tasks and sequence-level tasks

In some cases, sequence-level tasks can also be split into token-level tasks for processing.

The SQuAD dataset is a generative dataset based on sentence processing. The particularity of this dataset is that the final answer is contained in the content of the sample, has a range, and is continuously distributed in the content.

2.3.3 Entity word recognition task and common models

Entity word recognition (Named Entity Recognition, NER) tasks are also known as entity recognition, entity segmentation or entity extraction tasks. It is a subtask of information extraction and aims to locate named entities in text and classify named entities such as people, organizations, locations, time expressions, quantities, monetary values, percentages, etc.

Essence: Label each token in the sentence, and then determine the category of each token, which can be used to quickly evaluate resumes, optimize search engine algorithms, and optimize recommender system algorithms.

Common entity word recognition models include:

  1. The SpaCy model is a Python-based named entity recognition statistical system that assigns labels to contiguous groups of tokens. SpaCy models provide a default set of entity categories that include various named or numerical entities such as company name, location, organization, product name, etc. These default entity categories can also be updated through training.
  2. The Stanford NER model is a Named Entity Recognizer, implemented in Java. It provides a default entity category such as Organization, Person, and Location, etc., and supports multiple languages.

Guess you like

Origin blog.csdn.net/qq_39237205/article/details/124215479