论文翻译——Deep contextualized word representations

Abstract

We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e. to model polysemy).

我们引入了一种新型的“深层语境化”单词表示，它模拟了(1)单词使用的复杂特征(例如，语法和语义)，以及(2)这些用法如何在不同的语言环境中变化(例如，以模拟一词多义)。

Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pretrained on a large text corpus.

我们的词向量是深度双向语言模型(biLM)内部状态的习得函数，该模型是在大型文本语料库上预先训练的。

We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis.

我们证明了这些表示可以很容易地添加到现有的模型中，并通过6个具有挑战性的NLP问题(包括问题回答、文本隐含和情感分析)显著地改进现有模型的状态。

We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.

我们还提出了一个分析表明，暴露预先训练的网络的深层内部是至关重要的，允许下游模型混合不同类型的半监督信号。

1 Introduction

Pre-trained word representations(Mikolov et al. 2013; Pennington et al. 2014)are a key component in many neural language understanding models.

训练前的单词表示(Mikolov et al. 2013; Pennington et al. 2014)是许多神经语言理解模型的关键组成部分。

However, learning high quality representations can be challenging.

然而，学习高质量的表现是有挑战性的。

They should ideally model both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (ie, to model polysemy).

理想情况下，他们应该模拟(1)单词使用的复杂特征(例如，语法和语义)，以及(2)这些用法如何在不同的语言环境中变化(例如，模拟一词多义)。

Our representations differ from traditional word type embeddings in that each token is assigned a representation that is a function of the entire input sentence.

我们的表示与传统的词类型嵌入不同，因为每个标记都被分配了一个表示，它是整个输入语句的一个函数。

We use vectors derived from a bidirectional LSTM that is trained with a coupled language model (LM) objective on a large text corpus.

我们使用来自双向LSTM的向量，该向量是用一个耦合语言模型(LM)目标在一个大型文本语料库上训练的。

For this reason, we call them ELMo (Embeddings from Language Models) representations.

因此，我们称它们为ELMo(来自语言模型的嵌入)表示。

Unlike previous approaches for learning contextualized word vectors (Peters et al, 2017;McCann et al, 2017), ELMo representations are deep, in the sense that they are a function of all of the internal layers of the biLM.

与以往的语境化词汇向量学习方法不同(Peters等，2017; McCann et al, 2017)， ELMo表示是深层的，因为它们是biLM所有内层的功能。

More specifically, we learn a linear combination of the vectors stacked above each input word for each end task, which markedly improves performance over just using the top LSTM layer.

更具体地说，我们学习了每个结束任务的每个输入字之上的向量的线性组合，这显著地提高了性能，而不只是使用顶层LSTM层。

Combining the internal states in this manner allows for very rich word representations.

以这种方式组合内部状态允许非常丰富的单词表示。

Using intrinsic evaluations, we show that the higher-level LSTM states capture context-dependent aspects of word meaning (e.g., they can be used without modification to perform well on supervised word sense disambiguation tasks) while lower-level states model aspects of syntax (e.g., they can be used to do part-of-speech tagging).

使用内在的评价,我们表明,高级LSTM状态捕获词义的上下文相关的方面(例如,他们可以使用不需要修改监督词义消歧任务上的表现良好)虽然低级状态模型方面的语法(例如,他们可以用来做词性标注)。

Simultaneously exposing all of these signals is highly beneficial, allowing the learned models select the types of semi-supervision that are most useful for each end task.

同时暴露所有这些信号是非常有益的，允许学习的模型选择对每个最终任务最有用的半监督类型。

Extensive experiments demonstrate that ELMo representations work extremely well in practice.

大量的实验表明，ELMo表示在实践中效果非常好。

We first show that they can be easily added to existing models for six diverse and challenging language understanding problems, including textual entailment, question answering and sentiment analysis.

我们首先表明，它们可以很容易地添加到现有的六个不同的和具有挑战性的语言理解问题的模型中，包括文本蕴涵、问题回答和情感分析。

The addition of ELMo representations alone significantly improves the state of the art in every case, including up to 20% relative error reductions.

单独添加ELMo表示可以显著地改进每种情况下的技术状态，包括最多减少20%的相对错误。

For tasks where direct comparisons are possible, ELMo outperforms CoVe (McCann et al, 2017), which computes contextualized representations using a neural machine translation encoder.

在可以进行直接比较的任务中，ELMo的性能优于CoVe (McCann et al, 2017)，后者使用神经机器翻译编码器计算上下文化的表示。

Finally, an analysis of both ELMo and CoVe reveals that deep representations outperform those derived from just the top layer of an LSTM.

最后，对ELMo和CoVe的分析表明，深层表示优于仅来自LSTM顶层的表示。

Our trained models and code are publicly available, and we expect that ELMo will provide similar gains for many other NLP problems.

我们训练的模型和代码是公开可用的，我们期望ELMo将为许多其他NLP问题提供类似的收益。

Due to their ability to capture syntactic and semantic information of words from large scale unlabeled text, pretrained word vectors (Turian et al, 2010; Mikolov et al, 2013;

由于它们能够从大规模未标记文本中捕获单词的语法和语义信息，因此，预先训练的单词向量(Turian et al, 2010; Mikolov et al, 2013;

Pennington et al, 2014) are a standard component of most state-of- the-art NLP architectures, including for question answering (Liu et al, 2017), textual entailment (Chen et al,2017) and semantic role labeling (He et al, 2017).

Pennington等人(2014)是大多数最先进的NLP架构的标准组件，包括用于问题回答(Liu等人，2017)、文本隐含(Chen等人，2017)和语义角色标记(He等人，2017)。

However, these approaches for learning word vectors only allow a single context-independent representation for each word.

但是，这些学习单词向量的方法只允许每个单词有一个上下文无关的表示。

Previously proposed methods overcome some of the shortcomings of traditional word vectors by either enriching them with subword information (eg, Wieting et al, 2016, Bojanowski et al,2017) or learning separate vectors for each word sense (eg, Neelakantan et al, 2014).

之前提出的方法克服了传统词向量的一些缺点，要么用子词信息丰富它们(例如，Wieting et al, 2016, Bojanowski et al,2017)，要么为每个词意义学习单独的向量(例如，Neelakantan et al, 2014)。

Our approach also benefits from subword units through the use of character convolutions, and we seamlessly incorporate multi-sense information into downstream tasks without explicitly training to predict predefined sense classes.

我们的方法还受益于通过使用字符卷积的子单词单元，并且我们无缝地将多义信息合并到下游任务中，而不需要显式地培训来预测预定义的义类。

Other recent work has also focused on learning context-dependent representations.

其他最近的工作也集中在学习上下文相关的表示。

context2vec (Melamud et al, 2016) uses a bidirectional Long Short Term Memory (LSTM;

context2vec (Melamud et al, 2016)使用双向长短时记忆(LSTM;

Hochreiter and Schmidhuber, 1997) to encode the context around a pivot word.

Hochreiter和Schmidhuber, 1997)将上下文编码到一个关键字周围。

Other approaches for learning contextual embeddings include the pivot word itself in the representation and are computed with the encoder of either a supervised neural machine translation (MT) system (CoVe, McCann et al, 2017) or an unsupervised language model (Peters et al, 2017).

学习上下文嵌入的其他方法包括关键字本身在表示中，并使用监督神经机器翻译(MT)系统(CoVe, McCann et al, 2017)或非监督语言模型(Peters et al, 2017)的编码器进行计算。

Both of these approaches benefit from large datasets, although the MT approach is limited by the size of parallel corpora.

这两种方法都受益于大型数据集，尽管MT方法受到并行语料库大小的限制。

In this paper, we take full advantage of access to plentiful monolingual data, and train our biLM on a corpus with approximately 30 million sentences (Chelba et al, 2014).

在本文中，我们充分利用了获取大量单语数据的优势，在大约3000万个句子的语料库上训练我们的biLM (Chelba et al, 2014)。

We also generalize these approaches to deep contextual representations, which we show work well across a broad range of diverse NLP tasks.

我们还将这些方法推广到深层上下文表示，我们发现这些方法在各种NLP任务中都能很好地工作。

Previous work has also shown that different layers of deep biRNNs encode different types of information.

以前的工作也表明，不同层次的深度biRNNs编码不同类型的信息。

For example, introducing multi-task syntactic supervision (eg, part-of-speech tags) at the lower levels of a deep LSTM can improve overall performance of higher level tasks such as dependency parsing (Hashimoto et al, 2017) or CCG super tagging (Søgaard and Goldberg, 2016).

例如，在深层LSTM的低层引入多任务语法监督(如词性标记)可以提高高级任务的整体性能，如依赖项解析(Hashimoto et al, 2017)或CCG超级标记(Søgaard and Goldberg, 2016)。

In an RNN-based encoder-decoder machine translation system, (Belinkov et al, 2017) showed that the representations learned at the first layer in a 2-layer LSTM encoder are better at predicting POS tags then second layer.

在一个基于rnn的编码器-解码器机器翻译系统中，(Belinkov et al, 2017)表明，在两层LSTM编码器的第一层学习的表示比第二层更能预测POS标签。

Finally, the top layer of an LSTM for encoding word context (Melamud et al, 2016) has been shown to learn representations of word sense.

最后，用于编码单词上下文的LSTM的顶层(Melamud et al, 2016)已经被证明可以学习单词意义的表示。

We show that similar signals are also induced by the modified language model objective of our ELMo representations, and it can be very beneficial to learn models for downstream tasks that mix these different types of semi-supervision.

我们发现，类似的信号也会被我们的ELMo表示的修改后的语言模型目标所诱导，并且对于混合了这些不同类型的半监督的下游任务来说，学习模型是非常有益的。

Dai and Le (2015) and Ramachandran et al (2017) pretrain encoder-decoder pairs using language models and sequence autoencoders and then fine tune with task specific supervision.

Dai和Le(2015)和Ramachandran等人(2017)使用语言模型和序列自动编码器对编码器-解码器进行预训练，然后在特定任务的监督下进行微调。

In contrast, after pretraining the biLM with unlabeled data, we fix the weights and add additional task-specific model capacity, allowing us to leverage large, rich and universal biLM representations for cases where downstream training data size dictates a smaller supervised model.

相反，在使用未标记的数据对biLM进行预培训之后，我们修正了权重，并添加了额外的特定于任务的模型容量，允许我们在下游培训数据大小要求较小的监督模型的情况下，利用大型、丰富和通用的biLM表示。

3 ELMo: Embeddings from Language Models

Unlike most widely used word embeddings (Pennington et al, 2014), ELMo word representations are functions of the entire input sentence, as described in this section.

与最广泛使用的词嵌入(Pennington et al, 2014)不同，ELMo词表示是整个输入语句的函数，如本节所述。

They are computed on top of two-layer biLMs with character convolutions (Sec. 3.1), as a linear function of the internal network states (Sec. 3.2).

它们是在具有字符卷积的两层biLMs之上计算的(第3.1节)，作为内部网络状态的线性函数(第3.2节)。

This setup allows us to do semi-supervised learning, where the biLM is pretrained at a large scale (Sec. 3.4) and easily incorporated into a wide range of existing neural NLP architectures (Sec. 3.3).

这种设置允许我们进行半监督学习，其中biLM在大范围内进行了预训练(第3.4节)，并且很容易被合并到现有的广泛的神经NLP体系结构中(第3.3节)。