Text generation[1] Text generation task & scoring index

text generation[1]

text generation task

According to the difference of input data, text generation tasks can be divided into the following three categories: text-to-text generation, data-to-text generation, and image-to-text generation.

Typical tasks are:

  • machine translation
  • Summary (Summary)
  • dialogue
  • Creative Writing: Storytelling, Generative Poetry
  • Free-form question answering (i.e. answers are generated, not extracted from text or knowledge bases).
  • illustrate

Evaluation index

Method based on word overlap

Common Metrics for Machine Translation & Summarization

The method based on the word overlap rate refers to the similarity between the generated text of the vocabulary-based level calculation model and the artificial reference text. The more classic representatives are BLEU, METEOR and ROUGE, among which BLEU and METEOR are often used in machine translation tasks, and ROUGE Often used for automatic text summarization.

data to text commonly used indicators

The biggest difference between data to text and generative tasks such as translation and summarization is that the input is similar to other forms of data such as tables or triples. When evaluating the generated results, we also need to consider whether the text accurately covers the information of the data.

  • Relation Generation(RG)

    Extract the relationship from the generated sentence, and then compare how many relationships also appear in the source (generally there are recall and count2 indicators).

  • Content Selection(CS)

    It generally refers to how much of the content in the data appears in the generated sentence, and generally has two indicators: precision and recall.

  • Content Ordering(CO)

    The "sequence of records" of the generated and reference sentences were calculated using the normalized Damerau-Levenshtein distance.

  • Converage

    If data to text does not involve complex relationship extraction, you can also simply verify whether the text can cover the data to be described through the matching method.

Word vector evaluation index

The above word overlap evaluation indicators are basically n-gram methods to calculate the degree of overlap between the generated response and the real response, the degree of co-occurrence and other indicators. The word vector is to convert the sentence into a vector representation through Word2Vec, Sent2Vec and other methods, so that a sentence is mapped to a low-dimensional space, and the sentence vector represents its meaning to a certain extent, and can be obtained through methods such as cosine similarity Computes the degree of similarity between two sentences.

  • Greedy Matching
  • Embedding Average
  • Vector Extrema

Language Model Based Approach

  • PPL

    It is better to assign a language model with a higher probability value to the sentences in the test set. After the language model is trained, the sentences in the test set are all normal sentences, so the trained model has a higher probability on the test set, the better. The formula is as follows:
    PPL ( W ) = P ( w 1 , w 2 , … , wn ) − 1 N = 1 = P ( w 1 , w 2 , … , wn ) N PPL(W)=P(w_1,w_2, \dots,w_n)^{-\frac{1}{N}}=\sqrt[N]{\frac{1}{=P(w_1,w_2,\dots,w_n)}}PPL(W)=P(w1,w2,,wn)N1=N=P(w1,w2,,wn)1
    It is equivalent to using the model to evaluate each sentence in the test set, and evaluate the probability of each sentence appearing. The higher the evaluation result, the better.

Bert-based scoring metrics

Sentence similarity is calculated using sentence context representation (bert's family bucket) and artificially designed calculation logic. Such an evaluation index is more robust and has better performance in the absence of training data.

  • BERTSCORE

    Use Bert to extract features for the two generated sentences and the reference sentence (tokenize the word piece), and then calculate the inner product for each word of the two sentences to obtain a similarity matrix. Based on this matrix, we can accumulate and normalize the maximum similarity score for the reference sentence and the generated sentence respectively to obtain the precision of bertscore.

    [External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-aKURLiwD-1673922828356) (C:\Users\11878\AppData\Roaming\Typora\typora-user-images\ image-20230108215715665.png)]

    flow chart:

    [External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-eslBVhDm-1673922828357) (C:\Users\11878\AppData\Roaming\Typora\typora-user-images\ image-20230108215732197.png)]

Guess you like

Origin blog.csdn.net/no1xiaoqianqian/article/details/128713004