XNLI cross-language evaluation data set of presentation

table of Contents

I. Introduction

Two, XNLI introduction

Third, the evaluation mission briefing

Fourth, experiment

I. Introduction

For transfer learning related to the scarcity of language and cross-language understanding, the evaluation of a set of data is indispensable. 2018, Facebook's proposed XNLI (Cross-Lingual Natural Language Inference ) This data set is designed to provide a unified evaluation data sets to facilitate research. NLI, that is, the text implies, is a natural language understanding (NLU) is an important datum task, which is to determine whether the relationship between the two words is implied (entailment), contradiction (contradiction) and neutral (neutral ) three of a kind. In the paper, Facebook also proposed baseline includes a plurality of machine translation task, bags and word LSTM encoder including. More details about XNLI refer to Facebook papers: XNLI: Evaluating Indicates Cross-lingual Sentence Representations .

Two, XNLI introduction

  • Obtaining source data set:

XNLI supports 15 languages, a data set comprising 10 fields, namely:. Face-To-Face, Telephone, Government, 9/11, Letters, Oxford University Press (OUP), Slate, Verbatim, Government and fiction front nine open from American national corpus, fiction from the English novel "Captain Blood". Each field 750 contains a sample, the test sample manual annotation English total 10 field 7500, 112,500 pairs composed of English - other languages ​​labeling pairs. Each data sample from two sentences, respectively, the relationship between the premises and assumptions, premises and assumptions, there entailments (implies), Contradiction (conflict), Neutral (neutral) three, the tagging process, XNLI of developers use a sophisticated voting rules to ensure maximum tagging result is unbiased.

  • Target data set acquired:

The target data set by translation of the English language into a corresponding set of data obtained. This creates a problem after that will translate English sentences into the target language, correspondence between the sentence will change. Through the experiment, it is consistent with the overall semantic relationship between the two types of languages. Some XNLI sample data set shown in Figure I:

Third, the evaluation mission briefing

1, based translation methods

Baseline-1: TRANSLATE TRAIN, the English translation of the data set into the target language, the training model data set of translation;

Baseline-2: TRANSLATE TEST, during the testing phase, the target language translated into the language used in the training phase, and tested on the model of training;

2 encoder based on cross-linguistic representation

The first evaluation mission and based translation, the second method based Embedding unified language-independent. Based on this idea, the authors propose two types of cross-language sentences encoder:

Baseline-3: X-CBOW, pre-trained unified multi-language sentence-level word vector, based on the average word vector CBOW way to get training;

Baseline-4: X-BiLSTM, a multilingual corpus of training BiLSTM encoder; Baseline For this, the authors propose two methods to extract a feature for each hidden units, using the initial and final state as represented hidden, or using wherein the maximum as hidden state, Baseline different methods are referred to as X-BiLSTM-last, and X-BiLSTM-max;

In these Baseline, involving a number of other important concepts, described as follows:

  • Multilingual Word Embedding

This paper is mainly to do some cross-language studies at the sentence level, most previous work has focused on cross-language word level, word level Embedding alignment method, the basic idea is based on n multilingual dictionary mapping Embedding said to the two kinds of learning mapping the relationship between language, as follows:

W^{\star}=\underset{W \in O_{d}(\mathbb{R})}{\operatorname{argmin}}\|W X-Y\|_{\mathrm{F}}=U V^{T}

Wherein, d is the dimension Embedding, X, Y are the dimension (d, n) matrix. This formula can be understood as the distance between the dictionary Embedding minimize the map, so that the new Embedding space, the distance between the word vectors have the same meaning closer. X and Y are as by the SVD can be obtained U and V, can be further minimized parameter matrix Embedding distance W.

U \Sigma V^{T}=\operatorname{SVD}\left(Y X^{T}\right)

Embedding studies on cross-language, you can refer to Mikel Artetxe (Home: http://www.mikelartetxe.com/ ) the great god of the article. The Great God in the past few years, from supervised to unsupervised, research Cross-Embedding the work of several years made a number of articles will be top, and open the relevant code ( https://github.com/artetxem/vecmap ), reuse simple.

  • Aligning Sentence Embeddings

Embedding sentence alignment. English author on a good pre-training encoder, and then minimize the loss function, to get the encoder on the target language:

\mathcal{L}_{\text {align }}(x, y)=\operatorname{sim}(x, y)-\lambda\left(\operatorname{sim}\left(x_{c}, y\right)+\operatorname{sim}\left(x, y_{c}\right)\right)

x, y represent two Embedding language sentence, the first similarity is calculated using L2 norm, I understand that the second regular added, such similarity calculations more robust. x_{c}And y_ {c}represent negative sampling, \lambdathe control coefficient regular. Specific details of the alignment shown below:

Fourth, experiment

On each baseline, the experimental results:

From the experimental results, the following main conclusions can be drawn:

  • Based translation methods, TRANSLATE TEST TRANSLATE TRAIN way better than that based on the way, will soon become the target test set translation effect of the training model used in the language, the translation of the training set is better than the target of re-evaluation of language training;
  • On each Baseline, BiLSTM maximum encoder hidden state but not hidden state as the last feature, the better;
  • Translation method based on the methodology for cross-language representation based on the effect to be better, but in some cross-language tasks, the high cost of real-time translation, cross-language representation method based, offer alternative solutions.

XNLI presentation on it first here, what questions please contact the number of public concern, discuss common progress ~

 

Published 117 original articles · won praise 8 · views 50000 +

Guess you like

Origin blog.csdn.net/u014257192/article/details/104234948