[codes] Writing Editing Networks Source Code Analysis

Basic Information

Author: Qingyun Wang

论文:Paper Abstract Writing through Editing Mechanism (ACL)

Source: https://github.com/EagleW/Writing-editing-Network

Data Preprocessing

1. Load Data

  1. Abstract and extracted headline
  2. Word (to write a word, of course, can also be coated with NLTK and other tools)
  3. Configured corpus: listing (all sample) ==> sublist (a sample: headline + abstract) ==> sub-sub-list (tokens + sentence)

2. Construct Vocabulary

  1. Vocabulary training data structure; sorted according to word frequency, low-frequency words omitted, the vocabulary size to vocab_size
  2. On the basis of the word, punctuation on the increase <pad>, <unk>, <eos>, <bos> and several other flag
  3. The <pad> on the dictionary's first

3. word2id,id2word

  1. The corpus is mapped to a token id (padding is not performed)
  2. Abstract before and after the increase in the two flag bits <eos>, <bos>, (when the teacher forcing training) as the decoded input is equivalent to a shift, the need to add the title <eos>, <bos>
  3. Performs corpus sorted by length coding input (short of the corpus according headline from long to sort)
  4. Get max_x_len, max_y_len

Model Construction

1. Embedding

  1. Random initialization embedding matrix, passing nn.Embeddingtwo parameters: vocab_size, embed_dim
  2. After good Embedding defined, encoder for title, encoder for draft, decoder common. That headline, draft, abstract share the same encoding

2. Encoder for Title

  1. single-layer bidirectional GRU
  2. Since the code is sorted according to the input order of title, so when the input can be used nn.utils.rnn.pack_padded_sequenceto encode the results obtained recycled nn.utils.rnn.pad_packed_sequencefill out the padding bits. So they can make more accurate results, excluding the effect of padding bits of bi-directional encoding
  3. input_sorted_by_length ==> pack ==> encode ==> pad ==> encoder_output

3. Encoder for Draft

  1. single-layer bidirectional GRU
  2. This is a draft encoder for encoding the output of the decoder, although the input sequence is [batch_size, max_y_len] in shape, but the number of sequence elements which effectively mixed, this time to be sorted, coded in batch recovery sample the original location is more trouble
  3. So we do not have nn.utils.rnn.pack_padded_sequenceand nn.utils.rnn.pad_packed_sequencethe two functions, and the results will be affected by the coding bits of padding, but harmless

4. Decoder for All-pass Decoding

  1. single-layer unidirectional GRU
  2. Decoding each pass, and are used in the same decoder
  3. It contains two attention:
    1. For the last pass of the decoder hidden states of attention
    2. The current pass of the decoder hidden states, for encoder hidden states of attention

5. Complete Model

Comprising the above-mentioned two kinds of encoder, decoder, and word probability layer

6. Instantiate Model

# 为模型设置可见GPU
torch.cuda.set_device(0)

# 打印GPU信息
if torch.cuda.is_available():
    print("congratulations! {} GPU(s) can be used!".format(torch.cuda.device_count()))
    print("currently, you are using GPU" + str(torch.cuda.current_device()), 
          "named", torch.cuda.get_device_name(torch.cuda.current_device()))

    # 利用CUDA进行加速
    model = model.cuda()
    criterion = nn.CrossEntropyLoss(ignore_index=0).cuda()
    
else:
    print("sadly, CUDA is not available!")

Output:

Congratulations! 1 GPU(s) can be used!
Currently, you are using GPU0 named "GeForce GTX 1050 Ti"

Training Process

  1. Enter train_epoch function

    1. Enter train_batch function

      1. Previously generated draft set to None

      2. input [batch_size, max_x_len], target [batch_size, max_y_len]

      3. Encoding the input, [batch_size, max_x_len]

      4. Into the multi-pass decoding cycle

        1. If the previously generated draft to None, decoding the first note is performed only decoder - attention encoder side
          1. Implied decoded vector
          2. Mapped to vocabulary size, to obtain the probability distribution
          3. Decoding the sampled output (greedy search)
          4. 作为previously generated draft
        2. If the previously generated draft is not None, decoding has been explained before, so there must be attention between the decoder-encoder, but also implied the state generated draft will be attend
        3. The implementation of the decoding results for each pass, i.e. Draft, have been calculated with the ground truth Loss cross entropy, back-propagation, the update parameter (a training BATCH, back propagation times, as close as possible but also desirable draft target sequence)
    2. Record batch of loss, after the end of a print epoch average epoch loss

    3. Each epoch in no validation

  2. Save a set of parameters after the end of the epoch cycle

  3. If the current epoch average loss is greater than the average loss of the previous epoch, it is necessary to stop training to prevent over-fitting, but the model is the use of loss here on Training set calculated

Inference Phase

TO DO

  1. Each epoch ended once validation, when the loss is greater than the previous average loss validation time, stop training.
  2. 添加mixed objective function
  3. The outer loop into the encoder, encoded only once for a batch
  4. Each pass of the decoder for doing not the same. It may be even a Transformer, one RNN (twice beam search). This implementation, the first few pass decoding are greedy search
  5. Decoding principle clear Transformer, the reason clearly beam searcher what to enter dimensions Tensor, facilitate implantation

Guess you like

Origin www.cnblogs.com/lauspectrum/p/11256794.html