每日一篇小论文 ---- Attentive Statistics Pooling for Deep Speaker Embedding

@每日一篇小论文----arXiv:1803.10963v2

attentive statistic pooling

本文提出了在与文本无关的说话人验证中深度说话人嵌入的细心统计汇总。在传统的扬声器嵌入中，帧级特征在单个话语的所有帧上被平均以形成话语级特征。我们的方法利用注意机制为不同的帧提供不同的权重，并且不仅生成加权平均值而且生成加权标准偏差。通过这种方式，它可以更有效地捕捉扬声器特性的长期变化。
在这里插入图片描述

核心思想

在statistic pooling中加入attention机制.

stattistic pooling

统计汇总层计算平均向量μ以及二阶统计量作为帧级特征 $h_t$ (t = 1,…,T）上的标准偏差向量σ。将其作为pooling层的输出，用以之后的全连接层。

均值： $\boldsymbol{\mu}=\frac{1}{T} \sum_{t}^{T} \boldsymbol{h}_{t}$

标准差： $\sigma=\sqrt{\frac{1}{T}\sum_{t}^{T} \boldsymbol{h}_{t} \odot \boldsymbol{h}_{t}-\boldsymbol{\mu} \odot \boldsymbol{\mu}}$

def statistic_pooling(inputs, scope=None):
    """
    统计池化
    reference: x-vector
    :param inputs:
    :param scope:
    :return:
    """
    with tf.name_scope(scope):
        mean, variance = tf.nn.moments(inputs, axes=1)
        std = tf.sqrt(variance)
        concat = tf.concat([mean, std], axis=1)

    return concat

Attention mechanism

一种自主学习的注意力机制，参考: arXiv:1409.0743

scalar score: $e_{t}=\boldsymbol{v}^{T} f\left(\boldsymbol{W} \boldsymbol{h}_{t}+\boldsymbol{b}\right)+k$

normalized score: $\alpha_{t}=\frac{\exp \left(e_{t}\right)}{\sum_{\tau}^{T} \exp \left(e_{\tau}\right)}$

weighted mean vector: $\tilde{\boldsymbol{\mu}}=\sum_{t}^{T} \alpha_{t} \boldsymbol{h}_{t}$


def attention(inputs, attention_size, return_alphas=False):
    """
    reference to  paper :"Hierarchical Attention Networks for Document Classification"
    :param inputs:  tensor
    :param attention_size: output size
    :param return_alphas:
    :return:
    """

    # if bi-rnn
    if isinstance(inputs, tuple):
        inputs = tf.concat(inputs, 2)

    # inputs shape [batch_size, time_steps, features]
    hidden_size = inputs.shape[2].value

    # define parameter
    w_2 = tf.Variable(tf.random_normal([hidden_size, attention_size], stddev=0.1))
    b_2 = tf.Variable(tf.random_normal([attention_size], stddev=0.1))
    u = tf.Variable(tf.random_normal([attention_size], stddev=0.1))

    # reference to paper
    with tf.name_scope('v'):
        v = tf.tanh(tf.tensordot(inputs, w_2, axes=1) + b_2)
    uv = tf.tensordot(v, u, axes=1, name='uv')
    alphas = tf.nn.softmax(uv, name='alphas')

    # sum the inputs by alphas
    output = tf.reduce_sum(inputs * tf.expand_dims(alphas, -1), 1)

    if not return_alphas:
        return output
    else:
        return output, alphas

Attentive statistics pooling

通过attnetion机制，重新构建均值，方差

$\tilde{\boldsymbol{\mu}}=\sum_{t}^{T} \alpha_{t} \boldsymbol{h}_{t}$

$\tilde{\boldsymbol{\sigma}}=\sqrt{\sum_{t}^{T} \alpha_{t} \boldsymbol{h}_{t} \odot \boldsymbol{h}_{t}-\tilde{\boldsymbol{\mu}} \odot \tilde{\boldsymbol{\mu}}}$

结构图：
在这里插入图片描述

def attentive_statistic_pooling(inputs, attention_size, scope=None):
    """
    带注意力机制的 统计池化
     reference: arXiv:1803.10963v2
    :param inputs:
    :param scope:
    :return:
    """
    def get_attention_std(inputs, anchor, alphas):
        h_square_with_attention_ = tf.matmul(tf.transpose(tf.square(inputs), [0, 2, 1]),
                                            tf.expand_dims(alphas, -1))
        h_square_with_attention = tf.squeeze(h_square_with_attention_, axis=-1)

        mean_square = tf.square(anchor)
        difference = tf.maximum(tf.subtract(h_square_with_attention, mean_square), 0.0)
        attention_std = tf.sqrt(difference)

        return attention_std


    with tf.name_scope(scope):
        # 求平均值，方差
        mean, variance = tf.nn.moments(inputs, axes=1)
        std = tf.sqrt(variance)

        # 获得attnetion层输出
        attention_anchor, attention_alphas = attention(inputs,
                                                       attention_size,
                                                       return_alphas=True)

        attention_std = get_attention_std(inputs,
                                          attention_anchor,
                                          attention_alphas)

        concat = tf.concat([attention_anchor, attention_std], axis=1)


    return concat