原文链接
如果打不开，也可以复制链接到https://nbviewer.jupyter.org中打开。

序列模型和注意力机制 Neural Machine Translation 神经机器翻译

1.将人类可读日期翻译成机器可读日期
- 1.1数据集
2.带注意力的神经机器翻译
- 2.1注意机制
3.可视化注意力 (选学)
- 3.1从网络获取激活
4.全代码

欢迎来到本周的第一个编程作业！
你将构建一个神经机器翻译 (NMT) 模型，将人类可读日期 (“25th of June, 2009”) 翻译为机器可读日期 (“2009-06-25”)。你将使用注意力模型执行此操作，它是序列模型中最复杂的序列之一。

这个作业是与NVIDIA的深度学习研究所共同制作的。

让我们加载你完成此作业所需的所有包。

from keras.layers import Bidirectional, Concatenate, Permute, Dot, Input, LSTM, Multiply
from keras.layers import RepeatVector, Dense, Activation, Lambda
from keras.optimizers import Adam
from keras.utils import to_categorical
from keras.models import load_model, Model
import keras.backend as K
import numpy as np

from faker import Faker
import random
from tqdm import tqdm
from babel.dates import format_date
from nmt_utils import *
import matplotlib.pyplot as plt
#%matplotlib inline

1.将人类可读日期翻译成机器可读日期

你将构建的模型可用于从一种语言翻译到另一种语言, 例如从英语翻译成印地语。然而，语言翻译需要大量的数据集，通常需要使用GPU训练数天。为了让你在不使用大量数据集的情况下也能体验这些模型，我们将使用更简单的“日期转换”任务。

网络将输入以各种可能格式编写的日期 (例如：“the 29th of August 1958”, “03/30/1968”, “24 JUNE 1987”) 将它们转换为标准化的机器可读日期 (例如：“1958-08-29”, “1968-03-30”, “1987-06-24”)。我们将让网络学会以通用的机器可读格式输出日期YYYY-MM-DD。

1.1数据集

我们将在10000个人类可读日期及并与之对应的，标准化、机器可读日期的数据集上训练模型。让我们运行以下代码来加载数据集并打印一些样例。

m = 10000
dataset, human_vocab, machine_vocab, inv_machine_vocab = load_dataset(m)

结果

100%|█████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 17749.83it/s]

打印一下

扫描二维码关注公众号，回复： 14263584 查看本文章

print(dataset[:10])

结果

[('9 may 1998', '1998-05-09'), ('10.11.19', '2019-11-10'), ('9/10/70', '1970-09-10'), ('saturday april 28 1990', '1990-04-28'), ('thursday january 26 1995', '1995-01-26'), ('monday march 7 1983', '1983-03-07'), ('sunday may 22 1988', '1988-05-22'), ('08 jul 2008', '2008-07-08'), ('8 sep 1999', '1999-09-08'), ('thursday january 1 1981', '1981-01-01')]

说明，你已经加载了

dataset：一个元组列表 (人类可读日期, 机器可读日期)。
human_vocab：一个python字典，将人类可读日期中使用的所有字符映射到整数值索引。
machine_vocab: 一个python字典，将机器可读日期中使用的所有字符映射到整数值索引。这些索引不一定与 human_vocab 的索引一致。
inv_machine_vocab: machine_vocab的逆字典，从索引到字符的映射。

让我们对数据进行预处理，将原始文本数据映射到索引值。我们还将使用Tx=30（我们假设它是人类可读日期的最大长度；如果我们得到更长的输入，我们将不得不截断它）和Ty=10（因为“YYYY-MM-DD”是10个字符长）。

Tx = 30
Ty = 10
X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)

print("X.shape:", X.shape)
print("Y.shape:", Y.shape)
print("Xoh.shape:", Xoh.shape)
print("Yoh.shape:", Yoh.shape)

结果

X.shape: (10000, 30)
Y.shape: (10000, 10)
Xoh.shape: (10000, 30, 37)
Yoh.shape: (10000, 10, 11)

现在你有：

X: 训练集中人类可读日期经过处理的版本, 其中每个字符都被它在 human_vocab 中映射的该字符的索引替换。每个日期都使用特殊字符（）进一步填充为 $T_x$ 值。。维度为 X.shape = (m, Tx)
Y: 训练集中机器可读日期的处理版本, 其中每个字符都被它在machine_vocab中映射的索引替换。维度为 Y.shape = (m, Ty)。
Xoh: X 的 one-hot 版本, one-hot 中“1” 项的索引被映射到在human_vocab中对应字符。维度为 Xoh.shape = (m, Tx, len(human_vocab))
Yoh: Y 的 one-hot 版本, one-hot 中“1” 项的索引被映射到在machine_vocab中对应字符。维度为 Yoh.shape = (m, Tx, len(machine_vocab))。这里, len(machine_vocab) = 11 因为有 11 个字符 (’-’ 以及 0-9)。

我们再来看看一些预处理训练样本。请随意使用下面代码中的index来搜索数据集，并查看如何预处理源/目标日期。

index = 0
print("Source date:", dataset[index][0])
print("Target date:", dataset[index][1])
print()
print("Source after preprocessing (indices):", X[index])
print("Target after preprocessing (indices):", Y[index])
print()
print("Source after preprocessing (one-hot):", Xoh[index])
print("Target after preprocessing (one-hot):", Yoh[index])

结果

Source date: 9 may 1998
Target date: 1998-05-09

Source after preprocessing (indices): [12  0 24 13 34  0  4 12 12 11 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36]
Target after preprocessing (indices): [ 2 10 10  9  0  1  6  0  1 10]

Source after preprocessing (one-hot): [[0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]]
Target after preprocessing (one-hot): [[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]

2.带注意力的神经机器翻译

如果你必须把一本书的段落从法语翻译成英语，你不会读完整段，然后合上书来翻译。即使在翻译过程中，你也会反复阅读法语段落中与你正在翻译的英语部分相对应的部分。

注意机制告诉一个神经机器翻译模型，在任何步骤它都应该有注意力。

2.1注意机制

在这一部分中，你将实现讲课程中介绍的注意机制。这里有一个图来提醒你这个模型是如何工作的。左侧图展示注意力模型。右侧图表展示了一个“注意”步骤：计算注意力变量 $\alpha^{\langle t, t' \rangle}$ , 使用注意力变量计算输出中每个时间步( $\ldots, T_y$ )的上下文变量 $context^{\langle t \rangle}$
在这里插入图片描述
下面是你可能会注意到的模型的一些特性

在这个模型中有两个独立的LSTM（见左图）。因为图片底部，在Attention之前，那个是一个Bi-directional LSTM双向LSTM，我们称它为pre-attention Bi-LSTM 预注意双向LSTM。图顶部的LSTM位于Attention之后，因此我们将其称为post-attention LSTM 后注意LSTM。
- pre-attention Bi LSTM经过 $T_x$ 时间步；
- post-attention LSTM经过 $T_y$ 时间步。
post attention LSTM 通过 $s^{\langle t \rangle}, c^{\langle t \rangle}$ 从一个时间步传递到下一个时间步。在课程中，我们只对post-activation序列模型使用了一个基本的RNN，因此状态被RNN输出激活 $s^{\langle t\rangle}$ 捕获。但是由于我们在这里使用的是LSTM，LSTM同时具有输出激活 $s^{\langle t\rangle}$ 和隐藏单元状态 $c^{\langle t\rangle}$ ，与以前的文本生成示例（如第1周中的恐龙）不同，在这个模型中， $t$ 时的激活后LSTM不会将生成的 $y^{\langle t-1\rangle}$ 作为输入；它只将 $s^{\langle t\rangle}$ 和 $c^{\langle t\rangle}$ 作为输入。我们这样设计模型，是因为（不同于语言生成中相邻字符高度相关）YYYY-MM-DD日期中前一个字符和下一个字符之间的依赖性没有那么强。
我们使用 $^{\langle t\rangle}=[\overrightarrow{a}^{\langle t\rangle}$ ； $\overleftarrow{a}^{\langle t\rangle}]$ 来表示pre-attention Bi LSTM的正向和反向激活的串联(Concatenation)起来。
右边的图使用RepeatVector节点复制 $s^{\langle t-1\rangle}$ 的值 $t_x$ 次，然后串联(Concatenation) $s^{\langle t-1\rangle}$ 和 $a^{\langle t\rangle}$ 来计算 $e^{\langle t，t'\rangle}$ ，然后通过softmax来计算 $\alpha^{\langle t，t'\rangle}$ 。下面我们将解释如何在Keras中使用RepeatVector和Concatenation。

让我们实现这个模型。你将从实现两个函数开始： one_step_attention() 和 model()。

one_step_attention()

在步骤 $t$ 中，给定Bi LSTM的所有隐藏状态（ $a^{<1>}，a^{<2>}，…，a^{<t_ x>}]$ ）和第二个LSTM的前一个隐藏状态（ $s^{<t-1>}$ ），one_step_attention() 将计算注意权重（ $[\alpha^{<t,1>}，\alpha^{<t,1>}，…，\alpha^{<t，t_x>}]$ ），并输出上下文向量（请参阅上面右图1）：
$context^{<t>} = \sum_{t' = 0}^{T_x} \alpha^{<t,t'>}a^{<t'>}\tag{1}$
注意，我们在本练习中使用 $context^{\langle t\rangle}$ 表示注意。在课程中，上下文表示为 $c^{\langle t\rangle}$ ，但在这里我们称它为 $context^{\langle t\rangle}$ ，以避免与(post-attention) LSTM的内部存储单元变量混淆，该变量有时也表示为 $c^{\langle t\rangle}$ 。

model()

实现整个模型。它首先通过Bi-LSTM运行输入，以返回 $a ^{<1>}，a ^{<2>}，…，a ^{<T_x>}]$ 。然后，它调用one_step_attention() $T_y$ 次（for循环）。在该循环的每次迭代中，它将计算出的上下文向量 $c^{<t>}$ 提供给第二个LSTM，并通过一个具有softmax激活的全连接层运行LSTM的输出，生成预测 $\hat{y}^{<t>}$ 。

练习：实现one_step_attention()

函数 model() 使用 for 循环调用 one_step_attention()中的层 $T_y$ 次, 很重要的一点是所有 $T_y$ 的拷贝有相同权重。也就是, 不应该每次都重新确定权重。换句话说所有 $T_y$ 步骤都共享权重。以下是如何在Keras中实现具有可共享权重的层：

定义层对象（作为样本的全局变量）
在传播输入时调用这些对象

我们已经将你需要的层定义为全局变量。请运行以下代码来创建它们。
请查看Keras文档以确保你了解这些图层做什么: RepeatVector(), Concatenate(), Dense(), Activation(), Dot().

# Defined shared layers as global variables
repeator = RepeatVector(Tx)
concatenator = Concatenate(axis=-1)
densor1 = Dense(10, activation = "tanh")
densor2 = Dense(1, activation = "relu")
activator = Activation(softmax, name='attention_weights') # We are using a custom softmax(axis = 1) loaded in this notebook
dotor = Dot(axes = 1)

现在你可以使用这些图层来实现 one_step_attention()。为了通过这些层之一传播Keras张量对象X，使用 layer(X) (或 layer([X,Y]) ，如果它需要多个输入)。例如， densor（X）将通过上面定义的Dense（1）层传播 X。

# GRADED FUNCTION: one_step_attention

def one_step_attention(a, s_prev):
    """
    Performs one step of attention: Outputs a context vector computed as a dot product of the attention weights
	执行一步 attention: 输出一个上下文向量，计算为注意力权重"alphas"和Bi-LSTM的隐藏状态 "a"的点积
    "alphas" and the hidden states "a" of the Bi-LSTM.
    
    Arguments:
    a -- hidden state output of the Bi-LSTM, numpy-array of shape (m, Tx, 2*n_a)
	Bi-LSTM的输出隐藏状态 numpy-array 维度 (m, Tx, 2*n_a)
    s_prev -- previous hidden state of the (post-attention) LSTM, numpy-array of shape (m, n_s)
	(post-attention) LSTM的前一个隐藏状态, numpy-array 维度(m, n_s)
    
    Returns:
    context -- context vector, input of the next (post-attetion) LSTM cell
	上下文向量, 下一个(post-attetion) LSTM 单元的输入
    """
    
    ### START CODE HERE ###
    # Use repeator to repeat s_prev to be of shape (m, Tx, n_s) so that you can concatenate it with all hidden states "a" (≈ 1 line)
	# 使用 repeator 重复 s_prev 维度 (m, Tx, n_s) 这样你就可以将它与所有隐藏状态"a" 连接起来。 (≈ 1 line)
    s_prev = repeator(s_prev)

    # Use concatenator to concatenate a and s_prev on the last axis (≈ 1 line)
	# 使用 concatenator 在最后一个轴上连接 a 和 s_prev (≈ 1 line)
    concat = concatenator([a,s_prev])

    # Use densor1 to propagate concat through a small fully-connected neural network to compute the "intermediate energies" variable e. (≈1 lines)
	# 使用 densor1 传入参数 concat, 通过一个小的全连接神经网络来计算“中间能量”变量 e。(≈1 lines)
    e = densor1(concat)

    # Use densor2 to propagate e through a small fully-connected neural network to compute the "energies" variable energies. (≈1 lines)
	# 使用 densor2 传入参数 e , 通过一个小的全连接神经网络来计算“能量”变量 energies。(≈1 lines)
    energies = densor2(e)

    # Use "activator" on "energies" to compute the attention weights "alphas" (≈ 1 line)
	# 使用 activator 传入参数 "energies" 计算注意力权重 "alphas" (≈ 1 line)
    alphas = activator(energies)

    # Use dotor together with "alphas" and "a" to compute the context vector to be given to the next (post-attention) LSTM-cell (≈ 1 line)
	# 使用 dotor 传入参数 "alphas" 和 "a" 计算下一个（(post-attention) LSTM 单元的上下文向量 (≈ 1 line)
    context = dotor([alphas,a])
    ### END CODE HERE ###
    
    return context

在编写 model() 函数之后，你将能够检查 one_step_attention() 的预期输出。

练习：实现model()，如上图和上文所述。同样，我们定义了全局层，这些层将共享要在model（）中使用的权重。

n_a = 32
n_s = 64
post_activation_LSTM_cell = LSTM(n_s, return_state = True)
output_layer = Dense(len(machine_vocab), activation=softmax)

现在你可以在for循环中使用这些层 $T_y$ 次来生成输出，并且它们的参数不会被重新初始化。你必须执行以下步骤：

1.传入输入参数到 Bidirectional LSTM
2. 迭代 for $\dots, T_y-1$ :
2-A.调用one_step_attention() 使用 $[\alpha^{<t,1>},\alpha^{<t,2>}, ..., \alpha^{<t,T_x>}]$ 和 $s^{<t-1>}$ 为参数，获得上下文向量 $context^{<t>}$
2-B.把上下文向量 $context^{<t>}$ 送入post-attention LSTM单元。记得使用initial_state= [previous hidden state, previous cell state]传入这个LSTM以前的隐藏状态 $s^{\langle t-1\rangle}$ 和单元状态 $c^{\langle t-1\rangle}$ 。返回新的隐藏状态 $s^{<t>}$ 和新的单元状态 $c^{<t>}$ 。
2-C.对于 $s^{<t>}$ 应用softmax层，得到输出
2-D.将输出添加到输出列表
3.创建你的Keras模型实例，它应该有三个输入(“inputs”, $s^{<0>}$ 和 $c^{<0>}$ ) 和输出列表"outputs"。

mode()实现代码如下

# GRADED FUNCTION: model

def model(Tx, Ty, n_a, n_s, human_vocab_size, machine_vocab_size):
    """
    Arguments:
    Tx -- length of the input sequence 输入序列的长度
    Ty -- length of the output sequence 输出序列的长度
    n_a -- hidden state size of the Bi-LSTM Bi-LSTM的隐藏状态大小
    n_s -- hidden state size of the post-attention LSTM LSTM的隐藏状态大小
    human_vocab_size -- size of the python dictionary "human_vocab" python字典 "human_vocab" 的大小
    machine_vocab_size -- size of the python dictionary "machine_vocab" python字典 "machine_vocab" 的大小

    Returns:
    model -- Keras model instance Keras 模型实例
    """
    
    # Define the inputs of your model with a shape (Tx,)
    # Define s0 and c0, initial hidden state for the decoder LSTM of shape (n_s,)
	# 定义模型的输入，维度 (Tx,)
    # 定义 s0 和 c0, 初始化解码器 LSTM 的隐藏状态，维度 (n_s,)
    X = Input(shape=(Tx, human_vocab_size))
    s0 = Input(shape=(n_s,), name='s0')
    c0 = Input(shape=(n_s,), name='c0')
    s = s0
    c = c0
    
    # Initialize empty list of outputs# 初始化一个空的输出列表
    outputs = []
    
    ### START CODE HERE ###
    
    # Step 1: Define your pre-attention Bi-LSTM. Remember to use return_sequences=True. (≈ 1 line)
	# 第一步：定义 pre-attention Bi-LSTM。 记得使用 return_sequences=True. (≈ 1 line)
    a = Bidirectional(LSTM(n_a, return_sequences=True))(X)
    
    # Step 2: Iterate for Ty steps# 第二步：迭代 Ty 步
    for t in range(Ty):
    
        # Step 2.A: Perform one step of the attention mechanism to get back the context vector at step t (≈ 1 line)
	# 第二步.A: 执行一步注意机制，得到在 t 步的上下文向量 (≈ 1 line)
        context = one_step_attention(a, s)
        
        # Step 2.B: Apply the post-attention LSTM cell to the "context" vector.
        # Don't forget to pass: initial_state = [hidden state, cell state] (≈ 1 line)
	# 第二步.B: 使用 post-attention LSTM 单元得到新的 "context" 
        # 别忘了使用： initial_state = [hidden state, cell state] (≈ 1 line)
        s, _, c = post_activation_LSTM_cell(context,initial_state=[s,c])
        
        # Step 2.C: Apply Dense layer to the hidden state output of the post-attention LSTM (≈ 1 line)
	# 第二步.C: 使用全连接层处理post-attention LSTM 的隐藏状态输出 (≈ 1 line)
        out = output_layer(s)
        
        # Step 2.D: Append "out" to the "outputs" list (≈ 1 line)
	# 第二步.D: 追加 "out" 到 "outputs" 列表 (≈ 1 line)
        outputs.append(out)
    
    # Step 3: Create model instance taking three inputs and returning the list of outputs. (≈ 1 line)
	# 第三步：创建模型实例，获取三个输入并返回输出列表。 (≈ 1 line)
    model = Model(inputs=[X,s0,c0],outputs=outputs)
    
    ### END CODE HERE ###
    
    return model

运行以下代码创建模型

model = model(Tx, Ty, n_a, n_s, len(human_vocab), len(machine_vocab))

让我们查看模型的摘要，以检查它是否与预期输出相匹配。

model.summary()

结果

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_1 (InputLayer)            (None, 30, 37)       0
__________________________________________________________________________________________________
s0 (InputLayer)                 (None, 64)           0
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (None, 30, 64)       17920       input_1[0][0]
__________________________________________________________________________________________________
repeat_vector_1 (RepeatVector)  (None, 30, 64)       0           s0[0][0]
                                                                 lstm_1[0][0]
                                                                 lstm_1[1][0]
                                                                 lstm_1[2][0]
                                                                 lstm_1[3][0]
                                                                 lstm_1[4][0]
                                                                 lstm_1[5][0]
                                                                 lstm_1[6][0]
                                                                 lstm_1[7][0]
                                                                 lstm_1[8][0]
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 30, 128)      0           bidirectional_1[0][0]
                                                                 repeat_vector_1[0][0]
                                                                 bidirectional_1[0][0]
                                                                 repeat_vector_1[1][0]
                                                                 bidirectional_1[0][0]
                                                                 repeat_vector_1[2][0]
                                                                 bidirectional_1[0][0]
                                                                 repeat_vector_1[3][0]
                                                                 bidirectional_1[0][0]
                                                                 repeat_vector_1[4][0]
                                                                 bidirectional_1[0][0]
                                                                 repeat_vector_1[5][0]
                                                                 bidirectional_1[0][0]
                                                                 repeat_vector_1[6][0]
                                                                 bidirectional_1[0][0]
                                                                 repeat_vector_1[7][0]
                                                                 bidirectional_1[0][0]
                                                                 repeat_vector_1[8][0]
                                                                 bidirectional_1[0][0]
                                                                 repeat_vector_1[9][0]
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 30, 10)       1290        concatenate_1[0][0]
                                                                 concatenate_1[1][0]
                                                                 concatenate_1[2][0]
                                                                 concatenate_1[3][0]
                                                                 concatenate_1[4][0]
                                                                 concatenate_1[5][0]
                                                                 concatenate_1[6][0]
                                                                 concatenate_1[7][0]
                                                                 concatenate_1[8][0]
                                                                 concatenate_1[9][0]
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 30, 1)        11          dense_1[0][0]
                                                                 dense_1[1][0]
                                                                 dense_1[2][0]
                                                                 dense_1[3][0]
                                                                 dense_1[4][0]
                                                                 dense_1[5][0]
                                                                 dense_1[6][0]
                                                                 dense_1[7][0]
                                                                 dense_1[8][0]
                                                                 dense_1[9][0]
__________________________________________________________________________________________________
attention_weights (Activation)  (None, 30, 1)        0           dense_2[0][0]
                                                                 dense_2[1][0]
                                                                 dense_2[2][0]
                                                                 dense_2[3][0]
                                                                 dense_2[4][0]
                                                                 dense_2[5][0]
                                                                 dense_2[6][0]
                                                                 dense_2[7][0]
                                                                 dense_2[8][0]
                                                                 dense_2[9][0]
__________________________________________________________________________________________________
dot_1 (Dot)                     (None, 1, 64)        0           attention_weights[0][0]
                                                                 bidirectional_1[0][0]
                                                                 attention_weights[1][0]
                                                                 bidirectional_1[0][0]
                                                                 attention_weights[2][0]
                                                                 bidirectional_1[0][0]
                                                                 attention_weights[3][0]
                                                                 bidirectional_1[0][0]
                                                                 attention_weights[4][0]
                                                                 bidirectional_1[0][0]
                                                                 attention_weights[5][0]
                                                                 bidirectional_1[0][0]
                                                                 attention_weights[6][0]
                                                                 bidirectional_1[0][0]
                                                                 attention_weights[7][0]
                                                                 bidirectional_1[0][0]
                                                                 attention_weights[8][0]
                                                                 bidirectional_1[0][0]
                                                                 attention_weights[9][0]
                                                                 bidirectional_1[0][0]
__________________________________________________________________________________________________
c0 (InputLayer)                 (None, 64)           0
__________________________________________________________________________________________________
lstm_1 (LSTM)                   [(None, 64), (None,  33024       dot_1[0][0]
                                                                 s0[0][0]
                                                                 c0[0][0]
                                                                 dot_1[1][0]
                                                                 lstm_1[0][0]
                                                                 lstm_1[0][2]
                                                                 dot_1[2][0]
                                                                 lstm_1[1][0]
                                                                 lstm_1[1][2]
                                                                 dot_1[3][0]
                                                                 lstm_1[2][0]
                                                                 lstm_1[2][2]
                                                                 dot_1[4][0]
                                                                 lstm_1[3][0]
                                                                 lstm_1[3][2]
                                                                 dot_1[5][0]
                                                                 lstm_1[4][0]
                                                                 lstm_1[4][2]
                                                                 dot_1[6][0]
                                                                 lstm_1[5][0]
                                                                 lstm_1[5][2]
                                                                 dot_1[7][0]
                                                                 lstm_1[6][0]
                                                                 lstm_1[6][2]
                                                                 dot_1[8][0]
                                                                 lstm_1[7][0]
                                                                 lstm_1[7][2]
                                                                 dot_1[9][0]
                                                                 lstm_1[8][0]
                                                                 lstm_1[8][2]
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 11)           715         lstm_1[0][0]
                                                                 lstm_1[1][0]
                                                                 lstm_1[2][0]
                                                                 lstm_1[3][0]
                                                                 lstm_1[4][0]
                                                                 lstm_1[5][0]
                                                                 lstm_1[6][0]
                                                                 lstm_1[7][0]
                                                                 lstm_1[8][0]
                                                                 lstm_1[9][0]
==================================================================================================
Total params: 52,960
Trainable params: 52,960
Non-trainable params: 0
__________________________________________________________________________________________________

像往常一样，在Keras中创建模型之后，你需要编译它并定义您想要使用的损失, 优化和评估指标。
编译模型时，损失使用 categorical_crossentropy，优化算法使用 Adam optimizer (learning rate = 0.005, β 1 = 0.9, β 2 = 0.999, decay = 0.01)，评估指标使用 [‘accuracy’]：

### START CODE HERE ### (≈2 lines)
opt = Adam(lr=0.0005, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(loss = 'categorical_crossentropy',optimizer=opt, metrics = ['accuracy'])
### END CODE HERE ###

最后一步定义了你所有的输入和输出并训练模型：

你已经有包含训练样本的 X，维度 ( m = 10000 , $T_x$ = 30 )
你需要创建用0填入的s0 和 c0，用于初始你的post_activation_LSTM_cell
给定你编码的model（），你需要“outputs”是11个元素形状是（m， $T_y$ ）的列表。因此：outputs[i][0]，…，outputs[i][Ty]表示与 $i^{th}$ 训练样本（X[i]）对应的真实分类标签（字符）。通常情况，outputs[i][j]是 $i^{th}$ 训练样本中 $j^{th}$ 字符的真实分类标签。

s0 = np.zeros((m, n_s))
c0 = np.zeros((m, n_s))
outputs = list(Yoh.swapaxes(0,1))

现在我们训练模型，迭代一次

model.fit([Xoh, s0, c0], outputs, epochs=1, batch_size=100)

在训练过程中，你可以看到输出的10个位置中的每一个的损失和准确性。
下表给出了一个示例，说明如果批次有两个样本，则精度可能是多少：
在这里插入图片描述

我们已经运行这个模型的时间更长时间，并且保存了权重。运行下面的代码加载我们的权重。（通过几分钟模型训练，你应该能够获得类似精度的模型，但加载我们的模型将节省您的时间。）

model.load_weights('models/model.h5')

你现在可以在新样本中看到结果。

EXAMPLES = ['3 May 1979', '5 April 09', '21th of August 2016', 'Tue 10 Jul 2007', 'Saturday May 9 2018', 'March 3 2001', 'March 3rd 2001', '1 March 2001']

for example in EXAMPLES:

    source = string_to_int(example, Tx, human_vocab)
    source = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), source))).swapaxes(0,1)
    
    prediction = model.predict([source, s0, c0])
    prediction = np.argmax(prediction, axis = -1)

    output = [inv_machine_vocab[int(i)] for i in prediction]
    
    print("source:", example)
    print("output:", ''.join(output))

运行结果报错

ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape (37, 30)

这是因为：模型要求输入维度为（m, 30, 37），而实际输入维度为（37， 30）
解决办法如下

EXAMPLES = ['3 May 1979', '5 April 09', '21th of August 2016', 'Tue 10 Jul 2007', 'Saturday May 9 2018', 'March 3 2001', 'March 3rd 2001', '1 March 2001']

for example in EXAMPLES:

    source = string_to_int(example, Tx, human_vocab)
    source = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), source))).swapaxes(0,1)
    source = source.transpose()#转置transpose()，交换两个轴
    source = np.expand_dims(source, axis=0)#增加一维轴

    prediction = model.predict([source, s0, c0])
    prediction = np.argmax(prediction, axis = -1)

    output = [inv_machine_vocab[int(i)] for i in prediction]
    
    print("source:", example)
    print("output:", ''.join(output))

运行结果

source: 3 May 1979
output: 1979-05-33
source: 5 April 09
output: 2009-04-05
source: 21th of August 2016
output: 2016-08-20
source: Tue 10 Jul 2007
output: 2007-07-10
source: Saturday May 9 2018
output: 2018-05-09
source: March 3 2001
output: 2001-03-03
source: March 3rd 2001
output: 2001-03-03
source: 1 March 2001
output: 2001-03-01

你还可以更改这些样本，使用你自己的样本进行测试。
下一部分将让你更好地了解注意机制在做什么 – 即，在生成特定输出字符时网络注意哪些部分输入。

3.可视化注意力 (选学)

由于问题的固定输出长度为10，因此也可以使用10个不同的softmax单元来生成输出的10个字符来执行此任务。但是注意模型的一个优点是，输出的每个部分（比如月份）都知道它只需要依赖输入的一小部分（输入中给出月份的字符）。）我们可以可视化输出的哪个部分正在查看输入的哪个部分。

考虑将“2018年5月9日星期六”翻译为“2018-05-09”的任务。如果我们可视化计算出的 $\alpha^{\langle t，t'\rangle}$ ，可以得到：
在这里插入图片描述

注意输出是如何忽略输入的“Saturday”部分的。没有一个输出时间步对输入的这一部分非常关注。我们还看到9被翻译成了09，而May被正确地翻译成了05，输出关注于输入中需要翻译的部分。年份主要要求它关注投入的“18”才能产生“2018” 。

3.1从网络获取激活

现在让我们可视化你网络中的注意力值。我们将通过传入一个样本给网络，然后可视化 $α^{⟨ t , t ^′ ⟩}$ 的值。
为了确定注意力值的位置，让我们首先打印模型的摘要。

model.summary()

结果在上面已经展示过。
浏览上面 model.summary()的输出。你可以看到图层名为 attention_weights 的输出 alphas 维度 (m, 30, 1) 在 dot_2 计算每个时间步 $\ldots, T_y-1$ 的上下文向量之前。

函数 attention_map（）从模型中提取注意力值并绘制它们。

attention_map = plot_attention_map(model, human_vocab, inv_machine_vocab, "Tuesday 09 Oct 1993", num = 7, n_s = 64)

输出结果
在这里插入图片描述

在生成的绘图上，可以观察预测输出的每个字符的注意力权重值。检查这个图，并检查网络关注的地方对你来说是否有意义。

在日期翻译应用程序中，你会发现大部分时间注意力都有助于预测年份，而对预测日/月没有太大影响。

恭喜你完成了作业。

机器翻译模型可以用来从一个序列映射到另一个序列。它们不仅适用于翻译人类语言（如法语->英语），而且也适用于日期格式的翻译。
注意力机制允许网络在产生输出的特定部分时将注意力集中在输入的最相关部分。
使用注意力机制的网络可以从长度为 $T_x$ 的输入转换为长度为 $T_y$ 的输出，其中 $T_x$ 和 $T_y$ 可以不同。
你可以将注意力权重 $α^{⟨ t , t ^′ ⟩}$ 可视化，查看在生成每个输出时网络正在关注什么。

你现在可以实现注意力模型并使用它来学习从一个序列到另一个序列的复杂映射。

4.全代码

链接

2021-2-6 吴恩达-C5 序列模型-w3 序列模型和注意力机制(课后编程1-Neural Machine Translation 神经机器翻译)