[code] Transformer For Summarization Source Code Reading [3]

1. Label Smoothing

For classification problems, we hope that the probability distribution model output on the label, close to one-hot representation in the real label. The problem is caused by:

  1. Generalization can not be guaranteed
  2. one-hot encouragement to the gap between the real category and other categories as possible to widen, resulting in over-confident prediction model categories

Thesis ? When Does Label Smoothing Help in math label smoothing description:

  1. (Multi-tag) constant cross entropy loss equation, the source code using the KL divergence is also a loss function, in fact, to minimize the loss function is equivalent to two kinds (a plurality of constant entropy)
  2. Typically using a real one-hot representation of y with the predicted probability distribution of the y cross entropy calculation; label smoothing with smooth off y , and the predicted probability distribution of y cross entropy calculation
  3. Uniform distribution often introduced as u (k), for label smoothing

And the introduction of an independent distribution u(k), may serve as a priori probabilities may be evenly distributed:

There are two steps in which:

  1. Determine the true label Dirac distribution, and the distribution of u (k)
  2. In (1-epsilon), epsilon weights respectively as a weighted sum of the two weight

Corresponding to the code implements:


def label_smoothing(inputs, epsilon=0.1):
    K = inputs.get_shape().as_list()[-1]    # number of labels
    return ((1-epsilon) * inputs) + (epsilon / K)

The effect is, the original one-hot representation becomes less absolute, separation of some probability to other labels, leave a generalization of the model space. Because of its primary role is to prevent over-fitting, in severe over-fitting the training, consider joining label smoothing

2. Copy Mechanism

Copy mechanism similar to the paper used in this implementation Get To The Point: Summarization with Pointer -Generator Networks consistent.

Copy mechanism of the process:

  1. copy-mode: current time step using the decoder-encoder attention weight to calculate a probability distribution of the input article word list

  2. generate-mode: using the decoder hidden states, decoder input, decoder-encoder context three to calculate the probability distribution on the fixed vocabulary

  3. gate: the above-described two modes for selecting, using the decoder hidden states, decoder input, decoder-encoder context of the three, to calculate the probability of the current selection of generate-mode

  4. Computing a joint probability (below):

Word property the following categories:

  1. fixed vocabulary V
  2. input word list X
  3. OOV (X - V)
  4. extended vocabulary X + V

How to calculate the probability distribution of the word can be seen from the chart:

X probability of words is calculated by the copy mode (by the weighting), V is the probability of a word is calculated by the generate-mode (by the weighting), the probability X∩V word is the probability of the two modes weighted sum . Wherein the weights obtained by the neural network to another gate.

Originally we only get fixed vocabulary of words in the probability distribution, now get a probability distribution over extend vocabulary (and due to the different inputs, X is dynamic). We can decode output does not appear in a fixed vocabulary, but this occurs in the input table through words.

如果在解码的时候,从X中拷贝了一个单词y,但是是属于X-V集合中,即OOV单词。因为只有V中的单词有embedding,所以y单词并没有embedding,那么下一时刻的解码输入应该如何处理?参考李丕绩前辈的实现:

next_y = []
for e in last_traces:
    eid = e[-1].item()
    if eid in modules["i2w"]:
        next_y.append(eid)
        else:
            next_y.append(modules["lfw_emb"]) # unk for copy mechanism

如果当前解码时刻的输出是从input word list中拷贝出来的OOV word的数,那么下一时刻的解码输入,就是 对应的embedding ,但是当前输出就不是UNK了,而是copy的词。

3. Recurrent Decoder?

在没有看Transformer在sequence generation任务上的具体实现的时候,一直好奇Decoder如果也是auto-regressive的话,那么是不是也是recurrent的?

在进行解码的时候(training 阶段使用teacher forcing,inference逐词输出),每次解码一个单词。在copy(pointer-generator network)机制下,当前decoder的输入如下:

y_pred, attn_dist = model.decode(ys, tile_x_mask, None, tile_word_emb, tile_padding_mask, tile_x, max_ext_len)
def decode(self, inp, mask_x, mask_y, src, src_padding_mask, xids=None, max_ext_len=None):
        pass
    

可以看出decoder的输入对应关系:

  1. 解码输入就是[seq_len, beam_width],其中seq_len是beam search已经得到的部分解码序列的长度

  2. 不对input_y进行mask,计算attention的时候,考虑全部的input_y序列中的元素;故mask_y=None

  3. src 是 tile_word_emb,输入序列的embedding,[max_x_len, beam_width, embedding_dim]

  4. src_padding_mask 是 tile_padding_mask,[max_x_len, beam_width]

  5. xid 是 tile_x,[max_x_len, beam_width]

    x = torch.tensor([1, 2, 3])
    
    x.shape
    Out[7]: torch.Size([3])
    x.repeat(4,2).shape
    Out[8]: torch.Size([4, 6])
    x.repeat(4,2,1).shape
    Out[9]: torch.Size([4, 2, 3])

    torch.Tensor.repeat函数,可以将指定维度进行复制。x的维度为[dx_0],x.repeat(dy_3, dy_2, dy_1, dy_0),需要先进行unsequeeze操作,将x的维度数目拓展到与repeat函数中的参数数目相同。再对应每个维度进行repeat。

decoder内部需要进行基于解码输入的self-attention,以及基于编码输出的external attention,输出有二:

  1. y_dec 解码器的解码输出,形状为 [seq_y_len, beam_width, vocab_size]。
    通过y_pred = y_pred[-1, :, :],取最后一个隐含状态作为当前的解码输出。
  2. attn_dist 是 word_prob_layer 计算出的,external attention的注意力权重,形状为 [seq_y_len, beam_width, max_x_len]

* Beam Search的Inference阶段,按照batch读取数据,进行编码,但是按照每个样本分别进行解码

In a nutshell,这不算recurrent,但还是auto-regressive的解码模式。recurrent指的是,当前时刻的解码,接受来自上一时刻的隐含状态。而Transformer decoder在解码每一个词的时候,不仅考虑了前一个词,还考虑了之前已经生成了的所有单词,并且利用其进行self-attention计算。

4. 参考资料

  1. 【Network】优化问题——Label Smoothing
  2. 【机器学习基础】熵、KL散度、交叉熵
  3. github repository: lipiji/TranSummar

Guess you like

Origin www.cnblogs.com/lauspectrum/p/11237421.html