NLP Tasks and Metrics (Perplexity, BLEU, METOR, ROUGH, CIDEr)

Task

Insert image description here

1. Information Retrieval IR (Information Retrieval)

Information retrieval (NLU) refers to returning documents or records that match user queries to users by searching for relevant information in large-scale text libraries or databases. Information retrieval mainly involves technologies such as index construction, query processing, and result sorting, aiming to help users obtain the required information quickly and effectively.

Insert image description here
The query and documents are mapped to the same feature space and similarity calculation is performed to avoid the problem of vocabulary and semantic mismatch in traditional IR.
Insert image description here

Model architecture

Insert image description here

Cross-Encoder architecture : First, the tokens of query and document are spliced ​​together and sent to a common LM, and CLS token is used as a common representation of both. Pairwise hinge loss is similar to triplet loss, used to bring positive sample pairs closer and separate negative sample pairs.

Insert image description here

Dual-Encoder architecture : Encode the query and document respectively, then calculate the NLL loss on the feature vector and perform comparative learning.

Insert image description here

The advantage of this architecture is that the document can be encoded in advance and the index can be built. When there is a new query, only the query is encoded and KNN vector libraries such as faiss are used to achieve fast matching.

Insert image description here

Evaluation Metric

It is used to measure the performance of the retrieval system when returning the top K results (only the TOP-K retrieval objects are evaluated).

  • MRR@K (Mean Reciprocal Rank at K) : It measures the average reciprocal rank of the first relevant document appearing in the first K returned results on a given query set Q. Note that the "reciprocal ranking" here means the reciprocal position of the relevant document in the result list. For example, the last one means it is in the first position. The higher the MRR@K value, the faster the system returns the first relevant document.
    Insert image description here

  • MAP@K (Mean Average Precision at K) : It measures the average accuracy (Precision) of the top K returned results on a given query set. MAP@K takes into account the ranking of results and the accuracy of the results at each position. The higher the score, the higher the average accuracy of the system among the top K results.
    Insert image description here

  • NDCG@K (Normalized Discounted Cumulative Gain at K) : NDCG measures the relevance sorting quality of the returned result list. In NDCG calculations, correlations are given different weights, with higher correlations being given higher weights based on their position and level in the sorted list. The higher the NDCG@K value, the better the system's relevance ranking quality in the first K results.
    Insert image description here
    Insert image description here

Direction of development

  1. How to mine more difficult negative sample pairs?

Insert image description here
ANCE allows the model to asynchronously maintain an inference program during the training process. An inference is performed every k steps of training. The top error results in the inference are used as hard-to-negative samples and added to the next round of training.
Insert image description here

  1. How to better pre-train large models?

SEED uses 很弱的decoderto force the encoder to produce stronger CLS feature representation.

Insert image description here

  1. How to improve the few-shot performance of the model?

Insert image description here

2. Text Generation TG (Text Generation)

Text generation (NLG) refers to the process of automatically generating natural language text using computers. Text generation can be based on specific rules, templates or statistical models. It can be used to generate various forms of text, such as articles, conversations, abstracts, titles, etc. Application fields of text generation include automatic summarization, machine translation, intelligent customer service, etc.

Insert image description here

It mainly includes two text generation modes: data2text and text2text. These two tasks both belong to the field of natural language generation, but the issues of concern and task goals are slightly different. data2text mainly focuses on how to convert structured data into natural language text, while text2text covers various text-to-text conversion tasks.

2.1 Data-to-text

The input data can be non-text data such as image (image understanding), table (table understanding), graph, json, etc., and the summary text of the data is output.

Insert image description here
data2text (data to text): The goal of this task is to convert given structured data into natural language text. Common applications include generating reports, summaries, descriptions, etc. For example, in weather forecasting, meteorological data (such as temperature, humidity, etc.) are converted into readable weather forecast text.

2.2 Text-to-text

text2text (text-to-text) is a broad task domain covering a variety of natural language generation tasks. The goal of text2text is to convert the input natural language text into another form of natural language text, such as machine translation, text summarization, question and answer generation, text style conversion, etc. For example, translate an English article into Chinese, generate a summary of a text, or turn questions into answers.
Insert image description here
For example, summary summary
Insert image description here

For example, the dialogue system aims to realize natural dialogue interaction between humans and machines. It can be a task-driven dialogue system that answers questions or completes tasks based on instructions or needs provided by the user; it can also be an open dialogue system that engages in free dialogue with users. Conversational systems need to understand context, generate coherent responses, and interact effectively with users.Insert image description here

Model architecture

Insert image description here

decoder type:

Insert image description here

Evaluation Metric

**Common indicators: **BLUE, Perplexity, ROUGH, NIST, METOR, CIDEr

Insert image description here
Insert image description here

Other indicators :

Insert image description here

Controllable text generationControl TG

enter:prompt + text
Insert image description here

Model level:Prefix + Model
Insert image description here

Modify the probability distribution:
Insert image description here

Modify the model structure:

Insert image description here

3. Question Answering

Question and answer (NLU+NLG) means that the system finds and generates accurate answers in predefined 知识库or based on questions raised by users. 文本集合Question and answer can combine NLU and NLG technologies. NLU is used to understand the user's question, convert it into a form that the machine can understand, and determine the key information of the query. NLG is used to generate answers in natural language form from answers found in the knowledge base and return them to the user.

  • Reading comprehension Q&A : The machine is required to read and understand the input text and questions, and on this basis, answer questions related to the text information.

  • Open-domain Q&A : Open-domain QA means that we can ask any factual question without giving any prompt text in the input. We can directly generate answers ourselves, or we can search external knowledge bases to generate answers.

3.1 Reading Comprehension RC

RC Task designs include: Cloze test完形填空(CNN/Daily Mail、CBT), Multiple choice多选(RACE),Extractive 答案在原文的提取问答(SQuAD)

Model architecture

Insert image description here

You can directly put query and reference together and send them to language models such as BERT to code and interact together:

Insert image description here

The large model unifies the form of reading comprehension: retrieval reading comprehension (the answer is in the text), summary reading comprehension (the answer is not in the text), multiple choice questions, judgment questions and other questions are unified into a paradigm text2text.

Insert image description here

3.2 Open domain question answering OQA

Open-domain QA means that we can ask any factual questions. Generally, we will give you a corpus of massive texts, such as Wikipedia/Baidu Encyclopedia, and let you find answers to any non-subjective questions from this. This is obviously difficult. many.

Model architecture
  • Generative model generate-based : The huge parameters of large models contain a large amount of knowledge, and answers can be generated directly without external knowledge bases.
    Insert image description here

  • Retrieval model retrieve-based : ① Text retrieval: A retriever is needed to find the N documents most relevant to the question from a massive amount of text (knowledge base/Internet), and these documents contain the answer to the question; ② Reading comprehension: A reader is needed to find specific answers from the documents extracted above.
    Insert image description here

RAG retrieval enhancement generation: jointly pre-trained retrieval and large language models, such as REALM.
Insert image description here

Mask the unlabeled text to fill in the blanks, use the masked text as the query, splice the query with the retrieved top k knowledge base content, and send it to the large model to generate the answer (answer the masked text).

Insert image description here

In addition to searching in external knowledge bases constructed in advance, you can also conduct massive searches on the Internet:

Insert image description here

Metric

BLEU, METEOR, and ROUGE are generally used in machine translation, and CIDEr is generally used in image subtitle generation.

Perplexity

https://zhuanlan.zhihu.com/p/633757727

BLEU

The so-called BLEU was originally used in machine translation. His idea is actually very native. For a given sentence, there is a standard translation S1, and a sentence S2 translated by a neural network. The idea of ​​BLEU is to look at all the phrases that appear in machine translation S2, see how many phrases appear in S1, and then calculate the ratio to get the BLEU score (similar to precision) . First, divide the number of words contained in a phrase according to n-gram, including BLEU-1, BLEU-2, BLEU-3, and BLEU-4. The difference is to divide the article into phrases with a length of 1 word, phrases with a length of 2 words... Then count the number of these phrases that appear in the standard translation, and divide them by the total number of divisions, which is the corresponding BLEU-1 score, BLEU -2 score... is actually the accuracy rate. See how many of these divided phrases appear in the standard translation. Generally speaking: the accuracy of unigram can be used to measure the accuracy of word translation, and the accuracy of higher-order n-gram can be used to measure the fluency of sentences n{1,2,3,4}

But BLEU will have a flaw. If I translate a word, and this word happens to be in the standard translation (the translation is very short), wouldn't the accuracy be 100%? For this flaw, the BLEU algorithm will have a length penalty factor, which is There will be a penalty if the translation is too short. Another flaw is that BLUE score does not consider the order of words in the translation.

For specific suggestions on how to use it at the code level: https://zhuanlan.zhihu.com/p/404381278

METOR

The general idea is that sometimes the translation result of the translation model is correct, but it just happens not to match the reference translation (for example, a synonym is used), so the synonym set is expanded using knowledge sources such as WordNet, and the word form is also taken into consideration. (Words with the same stem are also considered partial matches, and certain rewards should be given. For example, translating likes into like is better than translating it into other messy words, right?) When evaluating the fluency of a sentence, use Under the concept of chunk (candidate translation and reference translation can be aligned, spatially consecutive words form a chunk, this alignment algorithm is a somewhat complex heuristic beam serach), the smaller the number of chunks means that the average of each chunk The longer the length, that is to say, the more consistent the word order of the candidate translation and the reference translation is . Finally, both recall rate and accuracy rate must be considered, and the F value is used as the final evaluation index.

The shortcomings of METEOR are also obvious. One is that it is only implemented in Java, and it is a jar package and not an API. METEOR can only be counted on the entire test set instead of testing each statement individually (unless you write each statement separately to a file and then jar package), it's really stupid. At a time when Python dominates deep learning, it is conceivable that few people use this indicator...

ROUGH

The basic idea of ​​the ROUGE algorithm is similar to that of BLEU, but it counts the recall rate, that is, for the phrases in the standard translation, count how many of them appear in the translation of the machine translation (just the opposite of BLUE) . In fact, it is to look at the machine How many translations are correct? This evaluation indicator mainly depends on whether the phrases appear in the standard translation, so the longer the translation of natural machine translation, the better the result.

CIDEr

Commonly used for image subtitle generation, CIDEr is a combination of BLEU and vector space models . It treats each sentence as a document, and then calculates the cosine angle of the TF-IDF vector (but the term is n-gram instead of a word). Based on this, the similarity between the candidate sentence and the reference sentence is obtained, which is also n of different lengths. -gram similarities are averaged to get the final result. The advantage is that different n-grams have different weights with different TF-IDFs, because the more common n-grams in the entire corpus contain a smaller amount of information. The key point of image subtitle generation evaluation is to see whether the model has captured key information . For example, the content of a picture is "A person is swimming in the swimming pool during the day." The most critical information should be "Swimming". When generating subtitles, if it contains Or some other information (such as "daytime") is missed, which is actually irrelevant, so such an operation of reducing the weight of non-keywords is needed.

Guess you like

Origin blog.csdn.net/weixin_54338498/article/details/133019398