Heavy! "Natural Language Processing (NLP) Courier" ACL - FaceBook (Context-Adaptive Attention span) && tree Transformer

Source: AINLPer micro-channel public number
editor: ShuYini
proofreading: ShuYini
Time: 2019-8-24

introduction

     The two articles related to the primary and Attention. The first chapter is the attention span of an adaptive algorithm based Transformer FaceBook AI team proposed that the algorithm can significantly expand the context of the time span Transform. The second presents a new Tree Transformer model, which only need to recursively traverse through the mechanism of attention, you can capture phrase syntax tree for the constituency, and for the word dependence dependency tree.

PS: Welcome concern AINLPer micro-channel public number, reading the paper would be updated daily , so you look.

First Blood

TILE: Adaptive Attention Span in Transformers
Contributor : FaceBook AI研究
Paper: https://www.aclweb.org/anthology/P19-1032
Code: None

Thesis

    A new self-focus mechanism, and can learn the best attention span. When using Transformer can be significantly expanded context span, while maintaining control over memory consumption and computation time. On a character level language modeling tasks, to verify the effectiveness of the method, by using the maximum character 8k context, we realize the most advanced performance on text8 and enwiki8.

Model Summary

    He proposed an alternative method of self-noted layers, in order to reduce the computational burden of the transformer. We build layers learn their optimal context size, to form a network. In this network, each layer attention to gather information about their own context. In practice, a small Transformer observed in the context of lower layer having a very large in the context of the final layer. By this modification, we can extend beyond 8k input sequence of tokens, without loss of performance, without increasing cost calculation or memory. In the character-level language modeling tasks of the proposed method was validated, the results show reached the most advanced performance.

Model Introduction

Transform network order

The purpose modeling language in order to assign probabilities to sequences of tokens w 1 , . . . , W T (w_1,...,W_T) Problems.
    Transformer core layer mechanism is self-note, a note that the work head having a plurality of parallel components. Each Note will head Bahdanauetal. (2015) note that the mechanism is applied to its own input. Given a sequence token t, then it will first calculate the similarity of the past. Note that these weights is then obtained by the similarity softmax function. Finally, attention to past weight represents the weighted average, a vector output y o u t y_ {out} .
    For Attention is not very understanding of hair can look at my previous article about the attention. "Natural Language Processing (NLP)" conscience recommend: article for the attention mechanism (Attention)

Adaptive attention span

    Each head Transform share the same focus attention span s, it is assumed that each of the heads require the same attention attention span to form a representation. As shown below,     but this assumption is not applicable to the above character-level language modeling: proposed for this study attention span independently of each head, to reduce their computational and memory overhead herein.
    As an extension, consider a dynamic calculation method, wherein the attention span of the current input dynamic changes. At time step t, the attention span of the head is zt parameter vector v, b is a scalar function of the inputs, for example: z t = S σ ( v T x t + b ) z_t = Sσ(v^Tx_t + b) . We used the same way as the previous punishment z t z_t And learn the parameters together with the rest of the parameters v, b.

Experimental results

    Text8 character-level language modeling based on     the results of the ENWIK8     span model adaptation layer 12 of each of the attention points span input sequence as a function of average dynamic attention

Double Kill

TILE: by You Only Need the Attention to Traverse Trees
Contributor: University of by Western Ontario (University of Western Ontario)
Paper: https://www.aclweb.org/anthology/P19-1030
Code: None

Thesis

    针对单词序列,完全基于Attention的模型存在两个主要的问题:1、随着句子长度的增长,其对内存的消耗会呈2次方增长;2、不能有效的抓取和利用语义信息。递归神经网络可以通过遍历树结构来提取很好语义信息。为此,我们提出了一个Tree Transformer模型,该模型只需通过注意力机制进行递归遍历,就可以捕获用于选区树的短语语法,以及用于依赖树的单词依赖性。与标准转换器、基于lstm的模型以及树结构的LSTMs相比,本文模型在四个任务上的评估得到了比较好的结果。并进行了进一步的研究以确定位置信息是否在树中固有地编码以及哪种类型的注意适合于进行递归遍历。

Tree Transformer模型介绍

    本文提出了一种新的递归神经网络结构,该结构由一个可分解的注意框架构成,称之为模型树转换器。其原理主要是:给定依赖关系树或选择树结构,任务是仔细遍历其中的每一个子树,并推断出其根表示向量。该模型使用复合函数将一组子表示转换为一个单亲表示。该模型结构图如下图所示。

实验结果

    Tree Transform与一些最先进的句子编码器的性能比较.    位置编码的影响对比    不同注意力模块作为一个复合函数产生的结果对比

ACED

Attention

更多自然语言处理相关知识,还请关注AINLPer公众号,极品干货即刻送达。

发布了43 篇原创文章 · 获赞 3 · 访问量 3825

Guess you like

Origin blog.csdn.net/yinizhilianlove/article/details/100054799