Dependency-based syntax-aware word representations reading notes

[ yygq ]

No way? No way? I also deserve to read this paper? ? ?

[ Title ]

Dependency-based syntax-aware word representations
它的父亲:Syntax-Enhanced Neural Machine Translation with Syntax-Aware Word

[ Code address ]

https://github.com/zhangmeishan/DepSAWR

[ Knowledge Reserve ]

What is heterogeneity (heterogeneity):
What is homogeneity (homogeneous):
The transformation of dependency trees and constituent trees:
What is BIAFFINE DEPENDENCY PARSING: https://arxiv.org/pdf/1611.01734.pdf
What is LAS:
What Is bootstrap resampling:

[ Some questions ]

What are Tree-RNN and Tree-Linearization approaches?

1. Background and overview

1.0 Introduction

Aggregate a wealth of dependent grammars, instead of using a single parse tree output (using the middle layer of a trained parser)
to concatenate the output obtained above and the regular word vector

1.1 Related research

The importance of Syntactic information:
Relying on grammar can shorten the distance between words and words within a word sequence. For example, in the figure below, the distance between future and boomers is changed from 6, after the word lot, to 2.
Insert picture description here
Statistical learning method: (This is totally unintelligible)
Insert picture description here
Neural network method:

  • Tree-RNN: Although these methods are effective , they have serious inefficiencies due to the heterogeneity of the dependent tree .
    Insert picture description here
  • Tree-Linearization: These methods first linearize the hierarchical tree input into sequential symbols
    Insert picture description here
    . The disadvantage of the above neural network is that the only best parse tree is modeled using parser, which brings errors to downstream tasks.

1.2 Contribution points

Propose the following model: directly send Input to Encoder (how to send this...), the output vector is normalized? Simple post-processing? Then spliced ​​with word embedding? It is further sent to Decoder, and at the same time, the grammar label should be predicted. It must also be sent to the Decoder (after all, it is labeled data, why not?), and then I don’t know what to do. Insert picture description here
sentence classification task: BiLSTM+dependency syntax+ELMo or BERT representations
sentence matching task: ESIM+dependency syntax+ELMo or BERT representations
equence labeling task: BiLSTM-CRF+dependency syntax+ELMo or BERT representations
machine translation task: RNN-Search(?) / Transformer
reproduces Tree-RNN and Tree-Linearization for comparison

1.3 Related work

1.3.1 Tree-RNN

Differences from Tree-RNN:

  • Use dependency trees instead of constant trees
  • Use non-leaf compositions
  • Define aggregation operations, support recursive combination of any number of child nodes

have to be aware of is:

  • x_i是 a concatenation of word embedding and dependency label embedding
  • What does {L,R} mean?
  • The output is the concatenation of h↑ and h↓

Insert picture description here
Insert picture description here

The disadvantage of Tree-RNN is that batch processing is difficult due to the heterogeneity of the dependent tree.

1.3.2 Tree linearization

Perform the following two steps:

  • Maintain the processed stack and the unprocessed stack, and finally merge the corresponding shift-reduce operations to get the symbols sequence
  • Send the sequence to the sequence LSTM

The process is as follows: starting from This, add the word to the Stack through the SH action, if it encounters a word, it has its own left subtree, then execute RL in turn from right to left to establish the connection between the word and the word. After traversing to the last word, start to walk back, find the word with its own right subtree, execute RR in turn (from left to right?), and finally add PR. (Is this all understandable? Really admire (Professor Zhang and others)
Insert picture description here
About embedding:

  • Use ELMo and BERT to get the vector representation of the SH operation corresponding to the word
  • Other symbols have their own vector representation

Ah, the formulaic description of the model is as follows, what does the model learn, does it learn the embedding of sym...
Insert picture description here

Two, the model

2.1 BiAffine dependency parser

Insert picture description here

2.2 Dep-SAWR

There is no training object in the dotted line. The input of the Encoder is the word sequence, and then through the so-called Bilstm and Linear, the o_encoder is obtained, and the word representation after projected is further obtained. It is concatenation with the original embedding. Then its dependency parse tree first Where did the experience knowledge go?
Insert picture description here

Three, experiment and evaluation

3.1 Task 1: Sentence classification

word embedding+bilstm+pooling, in which the word embedding part tried glove, bert and elmo

3.2 Task 2: Sentence matching

Take the ESIM model as an example:
Insert picture description here

3.3 Task 3: sequence labeling

Take BILSTM+CRF as an example
Insert picture description here

3.4 Task 4: machine translation

Machine translation doesn't understand
Insert picture description here

3.5 Other details

  • Experimented in three languages
    • English: Glove300\ELMO\BERT
    • Chinese: Pre-training yourself???..
    • German: FastText\ELMO\BERT
  • About bert
    averaged word piece to get the vector representation of the word
  • About a priori
    How to train the BiAffine dependency parsing model for pre-training? What is the LAS index?

4. Conclusion and personal summary

Five, reference

Six, expansion

Guess you like

Origin blog.csdn.net/jokerxsy/article/details/114551688