Theoretical understanding of the evaluation index of Image Captioning

These indicators are used to evaluate the quality of text generation. The general approach is to compare the similarity between a candidate text (usually machine-generated) and other reference texts (usually human-labeled), but the applicable occasions are slightly different. : BLEU, METEOR, ROUGE are generally used in translation, ROUGE is mainly used for summary evaluation, CIDEr and SPICE are generally used in image description generation.

BLEU

BLEU is the earliest proposed machine translation evaluation index, and it is the source of all text evaluation indexes. The general idea of BLEU is to compare the degree of overlap between the n-grams in the candidate translation and the reference translation (from unigram to 4-gram in practice). The higher the degree of overlap, the higher the quality of the translation. The reason for choosing n-grams of different lengths is that the accuracy of unigram can be used to measure the accuracy of word translation, and the accuracy of higher-order n-grams can be used to measure the fluency of sentences .

This is an indicator that only looks at the accuracy rate, that is, it is more concerned about how many n-grams in the candidate translation are correct (that is, they appear in the reference translation), and do not care about the recall rate (which n-grams in the reference translation are in does not appear in the candidate translation). One improvement is the brevity penalty to punish the case where the candidate translation is too short (too short a candidate translation often means missed translation in machine translation, that is, low recall rate), but in general, it is still generally believed that the BLEU indicator is biased towards Shorter translation results (brevity penalty is not as strong as expected).

METEOR

METEOR basically means that sometimes the translation result of the translation model is correct, but it happens that it does not match the reference translation (for example, a synonym is used), so we use WordNet and other knowledge sources to expand the synonym set, while considering the word form (Words with the same stem are also considered as partial matches, and some rewards should also be given. For example, it is better to translate likes into like than into other messy words?). When evaluating the fluency of a sentence, the concept of chunk is used (the candidate translation and the reference translation can be aligned, and the spatial arrangement of consecutive words forms a chunk. This alignment algorithm is a somewhat complicated heuristic beam serach), the number of chunks Less means that the average length of each chunk is longer, that is to say, the word order of the candidate translation and the reference translation is more consistent. Finally, both the recall rate and the accuracy rate must be considered, and the F value is used as the final evaluation index.

The advantage is that word forms and synonyms are taken into account .
The disadvantage is that it is only implemented in Java, and the jar package is not an API. Moreover, there are four hyperparameters alpha, beta, gamma, delta, so that the results of the algorithm are as consistent as possible with human subjective evaluation. Finally, external knowledge sources (WordNet, etc.) are needed for word alignment, so for languages not included in WordNet, METEOR cannot be used to evaluate them.

ROUGE

ROUGE and BLEU are almost the same, the difference is that BLEU only calculates the accuracy rate, while ROUGE only calculates the recall rate. It is mainly used as an auxiliary tool for abstract evaluation. Its original intention as an evaluation indicator for machine translation is: in the era of SMT (statistical machine translation), the effect of machine translation is poor, and it is necessary to evaluate the accuracy and fluency of translation at the same time; wait until NMT (neural network machine translation) After the translation) comes out, the neural network has a very strong brain supplement ability, and the translation results are smooth, but sometimes it is easy to translate blindly, such as changing the name/number, throwing away half a sentence, this situation is very common. So some people said, then we don't look at the fluency but only the recall rate, so that we can know whether the NMT system has missed a turn (this will lead to a low recall rate). So ROUGE is only suitable for evaluating NMT, not for SMT, because it doesn't care if the candidate translation flow is not smooth.

CIDEr

CIDEr is a combination of BLEU and vector space models, which is specially designed for image labeling problems . It regards each sentence as a document, and calculates the cosine angle between the TF-IDF vector (except that the term is n-gram instead of a word) (refer to the caption and the caption generated by the model), according to the similarity between the two. The advantage is that different n-grams have different weights with different TF-IDFs, because the more common n-grams in the entire corpus contain less information.

The main point of image description generation evaluation is to see whether the model has captured key information . For example, the content of a picture is "swimming in a swimming pool alone during the day", and the most critical information should be "swimming". Or missing some other information (such as "daytime") is actually irrelevant, so such an operation of reducing the weight of non-keywords is needed. In machine translation, the translation should be faithful to the original text, so multiple translations of the same sentence should be paraphrased and contain the same information; while multiple descriptions of the same picture may not necessarily be paraphrased, because different descriptions can contain Different amounts of image detail, whether described in more detail or coarser, are true.

SPICE

SPICE is also specially designed for image caption problems. When the generated sentence is very similar to the reference sentence, for example, only one word is different, the original sentence is tennis court, and the model generation is basketball court, but we still think it is a bad generation because of the consideration of semantics, The model misidentifies the tennis court as a basketball court. At this time, BLEU or other indicators cannot evaluate the generation effect well.

SPICE uses graph-based semantic representations to encode objects, attributes and relationships in descriptions. It first parses the captions to be evaluated and reference captions into a syntactic dependency tree with a probabilistic context-free grammar (PCFG) dependency analyzer, and then uses a rule-based method to map the dependency tree into a scene graph. Finally, calculate the F-score values of objects, attributes and relationships in the caption to be evaluated.

The advantage is that there are more considerations on the target, attribute and relationship; it has a higher correlation with human evaluation than the evaluation mode based on n-gram.
The disadvantage is that it does not consider grammatical issues and relies on semantic parsers. The weight of each target, attribute and relationship is the same (but the objects of a painting obviously have primary and secondary points)

Reference article:
https://www.zhihu.com/question/304798594/answer/567383628
https://blog.csdn.net/yangyanbao8389/article/details/121455709