Basic Information
Author: Qingyun Wang
论文:Paper Abstract Writing through Editing Mechanism (ACL)
Source: https://github.com/EagleW/Writing-editing-Network
Data Preprocessing
1. Load Data
- Abstract and extracted headline
- Word (to write a word, of course, can also be coated with NLTK and other tools)
- Configured corpus: listing (all sample) ==> sublist (a sample: headline + abstract) ==> sub-sub-list (tokens + sentence)
2. Construct Vocabulary
- Vocabulary training data structure; sorted according to word frequency, low-frequency words omitted, the vocabulary size to vocab_size
- On the basis of the word, punctuation on the increase <pad>, <unk>, <eos>, <bos> and several other flag
- The <pad> on the dictionary's first
3. word2id,id2word
- The corpus is mapped to a token id (padding is not performed)
- Abstract before and after the increase in the two flag bits <eos>, <bos>, (when the teacher forcing training) as the decoded input is equivalent to a shift, the need to add the title <eos>, <bos>
- Performs corpus sorted by length coding input (short of the corpus according headline from long to sort)
- Get max_x_len, max_y_len
Model Construction
1. Embedding
- Random initialization embedding matrix, passing
nn.Embedding
two parameters: vocab_size, embed_dim - After good Embedding defined, encoder for title, encoder for draft, decoder common. That headline, draft, abstract share the same encoding
2. Encoder for Title
- single-layer bidirectional GRU
- Since the code is sorted according to the input order of title, so when the input can be used
nn.utils.rnn.pack_padded_sequence
to encode the results obtained recyclednn.utils.rnn.pad_packed_sequence
fill out the padding bits. So they can make more accurate results, excluding the effect of padding bits of bi-directional encoding - input_sorted_by_length ==> pack ==> encode ==> pad ==> encoder_output
3. Encoder for Draft
- single-layer bidirectional GRU
- This is a draft encoder for encoding the output of the decoder, although the input sequence is [batch_size, max_y_len] in shape, but the number of sequence elements which effectively mixed, this time to be sorted, coded in batch recovery sample the original location is more trouble
- So we do not have
nn.utils.rnn.pack_padded_sequence
andnn.utils.rnn.pad_packed_sequence
the two functions, and the results will be affected by the coding bits of padding, but harmless
4. Decoder for All-pass Decoding
- single-layer unidirectional GRU
- Decoding each pass, and are used in the same decoder
- It contains two attention:
- For the last pass of the decoder hidden states of attention
- The current pass of the decoder hidden states, for encoder hidden states of attention
5. Complete Model
Comprising the above-mentioned two kinds of encoder, decoder, and word probability layer
6. Instantiate Model
# 为模型设置可见GPU
torch.cuda.set_device(0)
# 打印GPU信息
if torch.cuda.is_available():
print("congratulations! {} GPU(s) can be used!".format(torch.cuda.device_count()))
print("currently, you are using GPU" + str(torch.cuda.current_device()),
"named", torch.cuda.get_device_name(torch.cuda.current_device()))
# 利用CUDA进行加速
model = model.cuda()
criterion = nn.CrossEntropyLoss(ignore_index=0).cuda()
else:
print("sadly, CUDA is not available!")
Output:
Congratulations! 1 GPU(s) can be used!
Currently, you are using GPU0 named "GeForce GTX 1050 Ti"
Training Process
Enter train_epoch function
Enter train_batch function
Previously generated draft set to None
input [batch_size, max_x_len], target [batch_size, max_y_len]
Encoding the input, [batch_size, max_x_len]
Into the multi-pass decoding cycle
- If the previously generated draft to None, decoding the first note is performed only decoder - attention encoder side
- Implied decoded vector
- Mapped to vocabulary size, to obtain the probability distribution
- Decoding the sampled output (greedy search)
- 作为previously generated draft
- If the previously generated draft is not None, decoding has been explained before, so there must be attention between the decoder-encoder, but also implied the state generated draft will be attend
- The implementation of the decoding results for each pass, i.e. Draft, have been calculated with the ground truth Loss cross entropy, back-propagation, the update parameter (a training BATCH, back propagation times, as close as possible but also desirable draft target sequence)
- If the previously generated draft to None, decoding the first note is performed only decoder - attention encoder side
Record batch of loss, after the end of a print epoch average epoch loss
Each epoch in no validation
Save a set of parameters after the end of the epoch cycle
If the current epoch average loss is greater than the average loss of the previous epoch, it is necessary to stop training to prevent over-fitting, but the model is the use of loss here on Training set calculated
Inference Phase
TO DO
- Each epoch ended once validation, when the loss is greater than the previous average loss validation time, stop training.
- 添加mixed objective function
- The outer loop into the encoder, encoded only once for a batch
- Each pass of the decoder for doing not the same. It may be even a Transformer, one RNN (twice beam search). This implementation, the first few pass decoding are greedy search
- Decoding principle clear Transformer, the reason clearly beam searcher what to enter dimensions Tensor, facilitate implantation