deep_learning 04. attention

开始的话：
从基础做起，不断学习，坚持不懈，加油。
一位爱生活爱技术来自火星的程序汪

RNN系列

BasicRNNCell

BasicLSTMCell

MultiRNNCell

说到RNN，我们就不得不说attention在RNN中的运用了。为啥使用attention我也就不多说了。

话不多说，直接上图（没水印且居中的图片，终于舒服了）。

看过前面几个章节的，图中 $match$ 以下的部分就不需要过多解释了。
$h_t$ 就是每个时间步的输出
$C_0$ 就是最后的 $state$ 输出

在 $BahdanauAttention$ 中：
$u^t=v^Ttanh(W_1h + W_2d_t)$
$a^t=softmax(u^t)$
$c^t = \sum_t^{L}a^th_l$

$h$ 表示每个时间步的输出
$d_t$ 表示 $decoder$ 时候的状态
$v^T$ 表示 $weights$ ,需要去学习的
剩下的就比较好理解了。

在分类的 $task$ 中，是没有 $decoder$ 的

接下来会结合着 $demo$ 和上面的图片来详细说明下过程。

def attention(inputs, hidden_size, dropout, attention_size):
    """
    :param inputs: [B, T, D] -> [batch_size, sequence_length, embedding_size]
    :param hidden_size: RNN output size
    :param dropout: dropout rate
    :param attention_size: attention output size
    :return:
    """
    fw = tf.nn.rnn_cell.GRUCell(hidden_size, name='fw')
    bw = tf.nn.rnn_cell.GRUCell(hidden_size, name='bw')

    if dropout:
        fw = tf.nn.rnn_cell.DropoutWrapper(fw, output_keep_prob=dropout)
        bw = tf.nn.rnn_cell.DropoutWrapper(bw, output_keep_prob=dropout)

    output, _ = tf.nn.bidirectional_dynamic_rnn(
        fw,
        bw,
        inputs=inputs,
        dtype=tf.float32
    )

    #   [batch_size, sequence_length, 2 * hidden_size]
    output = tf.concat(output, axis=2)

    #   W * X + B
    #   [batch_size, sequence_length, 2 * hidden_size] -> [batch_size, sequence_length, attention_size]
    I = tf.layers.dense(inputs=output, units=attention_size, activation=tf.tanh)

    V = tf.get_variable(name='v_omega', shape=[attention_size], dtype=tf.float32)

    #   [batch_size, sequence_length, attention_size]
    U = tf.multiply(I, V)
    #   [batch_size, sequence_length]
    U = tf.reduce_sum(U, axis=2)
    #   [batch_size, sequence_length]
    A = tf.nn.softmax(U, axis=1)

    #   multiply is [batch_size, sequence_length, 2 * hidden_size] * [batch_size, sequence_length, 1]
    #   multiply = [batch_size, sequence_length, 2 * hidden_size]
    #   reduce_sum = [batch_size, 2 * hidden_size]
    C = tf.reduce_sum(tf.multiply(output, tf.expand_dims(A, -1)), axis=1)
    return C, A

在拿到 $rnn$ 的输出结果 $output$ 之后，我们就拿到了在上面公式中的 $h$

I = tf.layers.dense(inputs=output, units=attention_size, activation=tf.tanh)

上面这行代码，就是公式中的
$tanh(W_1h + W_2d_t)$
只不过没有了 $d_t$ ,可以改为：
$tanh(W_1h + b)$

	V = tf.get_variable(name='v_omega', shape=[attention_size], dtype=tf.float32)

    #   [batch_size, sequence_length, attention_size]
    U = tf.multiply(I, V)

经过上面的两行代码,就得到了 $u^t$ ，而这也是我们在图片中 $match$ 后的结果。
$u^t=v^Ttanh(W_1h + b)$

	#   [batch_size, sequence_length]
    U = tf.reduce_sum(U, axis=2)
    #   [batch_size, sequence_length]
    A = tf.nn.softmax(U, axis=1)

经过 $softmax$ 之后，就拿到了我们的attention结果，也就是图片中的 $S_t$ ,对应着公式中的：
$a^t=softmax(u^t)$
最后将 $attention$ 和 $h$ 做加权并求和。

	#   multiply is [batch_size, sequence_length, 2 * hidden_size] * [batch_size, sequence_length, 1]
    #   multiply = [batch_size, sequence_length, 2 * hidden_size]
    #   reduce_sum = [batch_size, 2 * hidden_size]
    C = tf.reduce_sum(tf.multiply(output, tf.expand_dims(A, -1)), 
    # C = tf.reduce_mean(tf.multiply(output, tf.expand_dims(A, -1)), axis=1)axis=1)

对应公式中的：
$c^t = \sum_t^{L}a^th_l$

最后的 $C$ ，就是我们对一个输入的 $vector$ 表示，不同的输入 $x_t$ 贡献着不一样的权重。
上面的就是 $attention$ 在分类中的 $demo$ 展示。
最后再提一句，在 $encoder$ - $decoder$ 中,我们可以拿到 $rnn$ 的输出也就是图片中的 $h_t$ 和 $c_0$ ，通过 $h_t$ 和 $c_0$ （ $c_0$ 也就是 $decoder$ 的初始状态）计算 $attention$ 之后，我们能拿到 $C$ （也就是下面图片中的 $X_{d-1}$ ）,在 $decoder$ 的时候，会把 $X_{d-1}$ 和 $c_0$ 作为解码中的第一次输出，从而得到 $c_1$ ,然后通过 $c_1$ 以及 $h_t$ 得到下一步的输入，以此迭代到输出结束为止。

看图加深下理解：

这时候的 $attention$ 就和 $BahdanauAttention$ 中的差不多啦。

当然 $attention$ 还是有很多变体的，主要是在 $match$ 的过程中有不同。
在 LuongAttention 中， $match$ 操作是这样的:
$u^t=d_tWh$
$a^t=softmax(u^t)$
$c^t = \sum_t^{L}a^th_l$

这两类 $attention$ 也就是我们经常说的加法 $attention$ 和乘法 $attention$ 了。

这个实例中我们用的是 $global$ $attention$ ,也就是对所有的输入 $X_t$ 进行了 $attention$ 的操作。还有一种 $local$ $attention$ 的操作，是在随机窗口内做 $attention$ 操作，减少了计算量，区别在于关注的是所有 $encoder$ 状态还是部分 $encoder$ 状态。
具体请看 $LuongAttention$ 中的详细介绍。

随着 $google$ 大佬提出了 $transformer$ 之后，一种更为一般化的 $attention$ 就出来了：

$Attention(Q,V, K)= softmax(\frac{QK^T}{\sqrt{d^k}} ) V$

对应上面公式中：
$Q$ -> $C_0$
$K$ -> $h_t$
$V$ -> $h_t$

如果 $Q$ 、 $K$ 、 $V$ 都是一个值得话，那就是我们的 $self$ - $attention$ 了。

谢谢

更多代码请移步我的个人github，会不定期更新各种框架。
本章代码见code
欢迎关注

deep_learning 04. attention

猜你喜欢