[Deep Learning] Image Annotation and Evaluation Criteria

Recently, image understanding has received a lot of attention. Let’s also understand a little bit about what is image understanding, and the evaluation standard for image annotation is what is true.


Image comprehension is to understand the content of the image, more popular is to describe the content of the picture you see in one sentence.

Deep learning is so popular, we must try it to see if it can solve this problem. Oh no! It can definitely be solved, it's just a matter of time or how the framework is designed [Everyone must believe in a truth: deep learning can handle everything-it's a bit embarrassing to say smilethat I don't have it


OK, back to the subject. Since we want to use the universal DL, we must have a database, right? The database must have a label, right? Yes, the image label is the sentence that describes the content of the image, which is the label label ️


Different people and machines describe the same picture differently. There will naturally be good and bad points. How to distinguish good from bad? Our great predecessors have proposed the following evaluation methods:

B-1, B-2, B-3, B-4, M, R, CIDEr

The higher the score, the better. There are now papers that can be used in two data sets (5-refs and 40-refs , 5-Refs and 40-Refs indicate that there are two data sets in the test set, one data set per sheet Images have 5 reference annotations (that is, the correct sentence entered by humans), and each image in a data set has 40 reference annotations. ) Of the 14 (2 data sets * 7 indicators) participating in the scoring, 13 out of the scores are higher than human . However, this does not mean that the current algorithm is already very good. Because there will be some terrible examples. Humans generally don't make such low-level mistakes.


Upload the code here to get the evaluation score in ms coco caption. But the number of times is limited.

https://www.codalab.org/competitions/3221

You can also download the code for local testing on GitHub.

https://github.com/tylin/coco-caption


There are also flickr8k and 30k, the data volume is much smaller than ms coco caption, it should not be used, so I won't go into details.

There are also data sets based on game annotations.


There is a paper that compares various evaluation methods to see whether these methods can effectively judge the quality of the algorithm, and directly draw the conclusion.

The conclusion of the paper is to recommend METEOR first, or use ROUGE SU-4 and Smoothed BLEU. PS: Since the CIDEr standard was released in 2015, it is not reflected in this paper.


Perplexity

To get the perplexity of the sentence, how many possibilities are there that the sentence is bad? The lower the value, the better.



bleu: The idea is-the closer the machine-translated sentences are to human professional translations, the better. The higher the score, the better

rouge:

meteor:

cider:


Find some time to continue to complete it. . . There is still something to do today, let's go ahead, hahaha


The content of this article is referenced from:

https://zhuanlan.zhihu.com/p/22408033?utm_medium=social&utm_source=wechat_session  


Guess you like

Origin blog.csdn.net/Sun7_She/article/details/76247583