[ yygq ]
No way? No way? I also deserve to read this paper? ? ?
[ Title ]
Dependency-based syntax-aware word representations
它的父亲:Syntax-Enhanced Neural Machine Translation with Syntax-Aware Word
[ Code address ]
https://github.com/zhangmeishan/DepSAWR
[ Knowledge Reserve ]
What is heterogeneity (heterogeneity):
What is homogeneity (homogeneous):
The transformation of dependency trees and constituent trees:
What is BIAFFINE DEPENDENCY PARSING: https://arxiv.org/pdf/1611.01734.pdf
What is LAS:
What Is bootstrap resampling:
[ Some questions ]
What are Tree-RNN and Tree-Linearization approaches?
table of Contents
1. Background and overview
1.0 Introduction
Aggregate a wealth of dependent grammars, instead of using a single parse tree output (using the middle layer of a trained parser)
to concatenate the output obtained above and the regular word vector
1.1 Related research
The importance of Syntactic information:
Relying on grammar can shorten the distance between words and words within a word sequence. For example, in the figure below, the distance between future and boomers is changed from 6, after the word lot, to 2.
Statistical learning method: (This is totally unintelligible)
Neural network method:
- Tree-RNN: Although these methods are effective , they have serious inefficiencies due to the heterogeneity of the dependent tree .
- Tree-Linearization: These methods first linearize the hierarchical tree input into sequential symbols
. The disadvantage of the above neural network is that the only best parse tree is modeled using parser, which brings errors to downstream tasks.
1.2 Contribution points
Propose the following model: directly send Input to Encoder (how to send this...), the output vector is normalized? Simple post-processing? Then spliced with word embedding? It is further sent to Decoder, and at the same time, the grammar label should be predicted. It must also be sent to the Decoder (after all, it is labeled data, why not?), and then I don’t know what to do.
sentence classification task: BiLSTM+dependency syntax+ELMo or BERT representations
sentence matching task: ESIM+dependency syntax+ELMo or BERT representations
equence labeling task: BiLSTM-CRF+dependency syntax+ELMo or BERT representations
machine translation task: RNN-Search(?) / Transformer
reproduces Tree-RNN and Tree-Linearization for comparison
1.3 Related work
1.3.1 Tree-RNN
Differences from Tree-RNN:
- Use dependency trees instead of constant trees
- Use non-leaf compositions
- Define aggregation operations, support recursive combination of any number of child nodes
have to be aware of is:
- x_i是 a concatenation of word embedding and dependency label embedding
- What does {L,R} mean?
- The output is the concatenation of h↑ and h↓
The disadvantage of Tree-RNN is that batch processing is difficult due to the heterogeneity of the dependent tree.
1.3.2 Tree linearization
Perform the following two steps:
- Maintain the processed stack and the unprocessed stack, and finally merge the corresponding shift-reduce operations to get the symbols sequence
- Send the sequence to the sequence LSTM
The process is as follows: starting from This, add the word to the Stack through the SH action, if it encounters a word, it has its own left subtree, then execute RL in turn from right to left to establish the connection between the word and the word. After traversing to the last word, start to walk back, find the word with its own right subtree, execute RR in turn (from left to right?), and finally add PR. (Is this all understandable? Really admire (Professor Zhang and others)
About embedding:
- Use ELMo and BERT to get the vector representation of the SH operation corresponding to the word
- Other symbols have their own vector representation
Ah, the formulaic description of the model is as follows, what does the model learn, does it learn the embedding of sym...
Two, the model
2.1 BiAffine dependency parser
2.2 Dep-SAWR
There is no training object in the dotted line. The input of the Encoder is the word sequence, and then through the so-called Bilstm and Linear, the o_encoder is obtained, and the word representation after projected is further obtained. It is concatenation with the original embedding. Then its dependency parse tree first Where did the experience knowledge go?
Three, experiment and evaluation
3.1 Task 1: Sentence classification
word embedding+bilstm+pooling, in which the word embedding part tried glove, bert and elmo
3.2 Task 2: Sentence matching
Take the ESIM model as an example:
3.3 Task 3: sequence labeling
Take BILSTM+CRF as an example
3.4 Task 4: machine translation
Machine translation doesn't understand
3.5 Other details
- Experimented in three languages
- English: Glove300\ELMO\BERT
- Chinese: Pre-training yourself???..
- German: FastText\ELMO\BERT
- About bert
averaged word piece to get the vector representation of the word - About a priori
How to train the BiAffine dependency parsing model for pre-training? What is the LAS index?