Heavy! "Natural Language Processing (NLP) Courier" ACL - FaceBook (Context-Adaptive Attention span) && tree Transformer

Source: AINLPer micro-channel public number
editor: ShuYini
proofreading: ShuYini
Time: 2019-8-24

introduction

The two articles related to the primary and Attention. The first chapter is the attention span of an adaptive algorithm based Transformer FaceBook AI team proposed that the algorithm can significantly expand the context of the time span Transform. The second presents a new Tree Transformer model, which only need to recursively traverse through the mechanism of attention, you can capture phrase syntax tree for the constituency, and for the word dependence dependency tree.

PS: Welcome concern AINLPer micro-channel public number, reading the paper would be updated daily , so you look.

First Blood

TILE: Adaptive Attention Span in Transformers
Contributor : FaceBook AI研究
Paper: https://www.aclweb.org/anthology/P19-1032
Code: None

Thesis

A new self-focus mechanism, and can learn the best attention span. When using Transformer can be significantly expanded context span, while maintaining control over memory consumption and computation time. On a character level language modeling tasks, to verify the effectiveness of the method, by using the maximum character 8k context, we realize the most advanced performance on text8 and enwiki8.

Model Summary

He proposed an alternative method of self-noted layers, in order to reduce the computational burden of the transformer. We build layers learn their optimal context size, to form a network. In this network, each layer attention to gather information about their own context. In practice, a small Transformer observed in the context of lower layer having a very large in the context of the final layer. By this modification, we can extend beyond 8k input sequence of tokens, without loss of performance, without increasing cost calculation or memory. In the character-level language modeling tasks of the proposed method was validated, the results show reached the most advanced performance.

Model Introduction

Transform network order

The purpose modeling language in order to assign probabilities to sequences of tokens $（w_1,...,W_T）$ Problems.
Transformer core layer mechanism is self-note, a note that the work head having a plurality of parallel components. Each Note will head Bahdanauetal. (2015) note that the mechanism is applied to its own input. Given a sequence token t, then it will first calculate the similarity of the past. Note that these weights is then obtained by the similarity softmax function. Finally, attention to past weight represents the weighted average, a vector output $y_ {out}$ .
For Attention is not very understanding of hair can look at my previous article about the attention. "Natural Language Processing (NLP)" conscience recommend: article for the attention mechanism (Attention)

Adaptive attention span

Each head Transform share the same focus attention span s, it is assumed that each of the heads require the same attention attention span to form a representation. As shown below, but this assumption is not applicable to the above character-level language modeling: proposed for this study attention span independently of each head, to reduce their computational and memory overhead herein.
As an extension, consider a dynamic calculation method, wherein the attention span of the current input dynamic changes. At time step t, the attention span of the head is zt parameter vector v, b is a scalar function of the inputs, for example: $z_t = Sσ(v^Tx_t + b)$ . We used the same way as the previous punishment $z_t$ And learn the parameters together with the rest of the parameters v, b.

Experimental results

Text8 character-level language modeling based on the results of the ENWIK8 span model adaptation layer 12 of each of the attention points span input sequence as a function of average dynamic attention

Double Kill

TILE: by You Only Need the Attention to Traverse Trees
Contributor: University of by Western Ontario (University of Western Ontario)
Paper: https://www.aclweb.org/anthology/P19-1030
Code: None

Thesis

针对单词序列，完全基于Attention的模型存在两个主要的问题：1、随着句子长度的增长，其对内存的消耗会呈2次方增长；2、不能有效的抓取和利用语义信息。递归神经网络可以通过遍历树结构来提取很好语义信息。为此，我们提出了一个Tree Transformer模型，该模型只需通过注意力机制进行递归遍历，就可以捕获用于选区树的短语语法，以及用于依赖树的单词依赖性。与标准转换器、基于lstm的模型以及树结构的LSTMs相比，本文模型在四个任务上的评估得到了比较好的结果。并进行了进一步的研究以确定位置信息是否在树中固有地编码以及哪种类型的注意适合于进行递归遍历。