2019.01.10学习总结

关于nlp中的transformed模型，本项目的练习是基础tensorflow2.0中的tutorials，详情可简直见官方api
tensorflow官方tutorial中transformer
1.数据集的准备
2.Transformer的建立
1. encoder
2. decoder
3. Linear layer + softmax
3. train and test

下面主要是讲解Transformer的构建

加粗样式

encoder ==> 主要由基本的encoderLayer组成，encoderLayer由Multi-Head Attention与Feed Forward组成，
Multi-Head Attention的input_shape为（batch_size， seq_len， d_model）=> (batch_size, seq_len, num_heads, depth)中num-heads代表head的数量， num_heads*depth=d_model。由于在输入的过程中，为了使得seq_len保持一致，做过padding，因此在Multi-Head Attention中，需加上mask来消除padding值的影响。在搭配N个encoderLayer的基础上，加上input Embedding和Positional Encoding，组成了encoder。
decoder ==> 主要由基本的decoderLayer组成，decoderLayer由Masked Multi-Head Attention，Multi-Head
Attention与Feed Forward，Masked Multi-Head Attention的输入为tar_input, 然后来得到tar_prediction。其中mask主要是look_ahead mask, 主要是在预测的过程中，mask掉后面的tokens。比如我要预测第二个词，就只用第一个，预测第三个，则用第一个和第二个词。
例如教程中写到的：
The target is divided into tar_inp and tar_real. tar_inp is passed as an input to the decoder. tar_real is that same input shifted by 1: At each location in tar_input, tar_real contains the next token that should be predicted.
For example, sentence = “SOS A lion in the jungle is sleeping EOS”
tar_inp = “SOS A lion in the jungle is sleeping”
tar_real = “A lion in the jungle is sleeping EOS”
然后第二个Multi-Head Attention的输入则为encoder的输出和Masked Multi-Head Attention的输出，mask为padding mask。
在decoderLayer的基础上，加上Embedding和Positional Encoding。
在encoderLayer和decoderLayer中，同样加入residual成分，防止网络过深带来的退化，再加上layerNormalization。
在decoder的输出上，加入线性层和softmax得到想要的输出。

optimizer

optimizer中自定义learning-rate，促进更好的收敛，趋势如下图
在这里插入图片描述

class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
  def __init__(self, d_model, warmup_steps=4000):
    super(CustomSchedule, self).__init__()
    
    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)

    self.warmup_steps = warmup_steps
    
  def __call__(self, step):
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps ** -1.5)
    
    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

learning_rate = CustomSchedule(d_model)

optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, 
                                     epsilon=1e-9)
temp_learning_rate_schedule = CustomSchedule(d_model)

plt.plot(temp_learning_rate_schedule(tf.range(40000, dtype=tf.float32)))
plt.ylabel("Learning Rate")
plt.xlabel("Train Step")

sharic_song

发布了4 篇原创文章 · 获赞 0 · 访问量 122

私信关注

下面主要是讲解Transformer的构建

optimizer

猜你喜欢