深度学习：语言模型的评估标准

http://blog.csdn.net/pipisorry/article/details/78677580

语言模型的评估主要measure the closeness，即生成语言和真实语言的近似度。

Classification accuracy

provides additional information about the power of a model, even if it is not being designed explicitly for text classification.

[Jaech, et al. "Low-Rank RNN Adaptation for Context-Aware Language Modeling." arXiv 2017]

Perplexity

perplexity is the geometric mean of the inverse probability for each predicted word.

Test set perplexity is the most widely accepted method for evaluating language models, both for use in recognition/translation applications and generation. It has the advantage that it is easy to measure and is widely used as a criteria for model fit, but the limitation that it is not directly matched to most tasks that language models are directly used for.

For fair comparison, when computing the perplexity with the 5-gram LM, exclude all test words marked as 〈unk〉 (i.e., with low counts or OOVs) from consideration.

Meteor

[S. Banerjee and A. Lavie, “Meteor: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proc. ACL Workshop Intrinsic Extrinsic Eval. Measures Mach. Transl. Summarization]

ROUGE

is a recall-oriented measure widely used in the summarization literature. It measures the n-gram recall between the candidate text and the reference text(s).

where count match denotes the number of n-grams co-occurring in the input and output.

一般ROUGE-1, 2 and W (based on weighted longest common subsequence).

[C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Proc. ACL Workshop Text Summarization Branches Out, 2004]

Blue

a form of precision of word n-grams between generated and reference sentences.Purely measuring recall will inappropriately reward long outputs. BLEU is designed to address such an issue by emphasizing precision.

n-gram precision scores are given by:

BLEU then combines the average logarithm of precision scores with exceeded length penalization.

most previous work report BLEU-1, i.e., they only compute precision at the unigram level, whereas BLEU-n is a geometric average of precision over 1- to n-grams.

[K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” ACL2002]

某小皮

Coherence Evaluation

Neither BLEU nor ROUGE attempts to evaluate true coherence. There is no generally accepted and readily available coherence evaluation metric.

simple approximations like number of overlapped tokens or topic distribution similarity (e.g., (Yan et al., 2011b; Yan et al., 2011a; Celikyilmaz and Hakkani-Tür, 2011)).

[Li, Jiwei, et al. "A hierarchical neural autoencoder for paragraphs and documents." ACL2015]

Cider

it measures consistency between n-gram occurrences in generated and reference sentences, where this consistency is weighted by n-gram saliency and rarity.

不同评估方法的缺点讨论亦可参考[Vedantam, R., Lawrence Zitnick, C., & Parikh, D. Cider: Consensus-based image description evaluation. CVPR2015]

人工评估

ask for raters to give a subjective score

使用Amazon Mechanical Turk

如imge caption中following the guidelines proposed in [M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” J. Artif. Intell. Res.2013]

或者[Jaech, et al "Low-Rank RNN Adaptation for Context-Aware Language Modeling." arXiv2017]

from: http://blog.csdn.net/pipisorry/article/details/78677580

ref: [Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. TPAMI2017]

[Hoang, et al "Incorporating Side Information into Recurrent Neural Network Language Models." NAACL2016]