Meta AI multilingual reading comprehension data set Belebele

Meta AI  announced the launch of a multilingual reading comprehension dataset covering 122 language variants called Belebele. "We hope this work will spark new discussions around multilingualism in LLMs".

BELEBELE is the first cross-language parallel dataset to directly compare model performance across all languages. The dataset covers 29 scripts and different types of high, medium and low resource languages ​​in 27 language families. Additionally, 7 languages ​​are included in two different scripts, resulting in the first NLP benchmark for romanized variants of Hindi, Urdu, Bengali, Nepali and Sinhala.

This dataset enables the evaluation of monolingual and multilingual models, but its parallelism also enables the evaluation of cross-lingual text representations in some cross-lingual contexts. By collecting training sets from relevant quality assurance datasets, a comprehensive fine-tuning evaluation of the task can be performed. Each question is based on a short passage from the Flores-200 dataset and has four multiple-choice answers. The questions were carefully designed to differentiate between models with different levels of general language understanding.

  • 900 questions per language
  • 488 different passages, each with 1-2 related questions.
  • Each question has 4 multiple choice answers, only one of which is correct.
  • 122 languages/language variants (including English).
  • 900 x 122 = 109,800 questions.

The researchers used this dataset to evaluate the capabilities of multilingual masked language models (MLMs) and large language models (LLMs). The results show that although English-centric LLMs have significant cross-language transfer capabilities, smaller MLMs pretrained on balanced multilingual data can still understand more languages. And the larger the vocabulary and the more consciously the vocabulary is constructed, the better the performance on low-resource languages.

Reference source: https://arxiv.org/abs/2308.16884

Guess you like

Origin blog.csdn.net/ejinxian/article/details/132676164