TACONTRON: A Fully End-to-End Text-To-Speech Synthesis Model

通常的TTS模型包含许多模块，例如文本分析，声学模型，音频合成等。而构建这些模块需要大量专业相关的知识以及特征工程，这将花费大量的时间和精力，而且各个模块之间组合在一起也会产生很多新的问题。TACOTRON是一个端到端的深度学习TTS模型，它可以说是将这些模块都放在了一个黑箱子里，我们不用花费大量的时间去了解TTS中需要用的的模块或者领域知识，直接用深度学习的方法训练出一个TTS模型，模型训练完成后，给定input,模型就能生成对应的音频。

TACOTRON是一个端到端的TTS模型，模型核心是seq2seq + attention。模型的输入为一系列文本字向量，输出spectrogram frame, 然后在使用Griffin_lim算法生成对应音频。模型结构如下图：
这里写图片描述

模型结构：

CBHG模块
encoder
decoder
post-processing net and waveform synthesis

Character Embedding

我们知道在训练模型的时候，我们拿到的数据是一条长短不一的(text, audio)的数据，深度学习的核心其实就是大量的矩阵乘法，对于模型而言，文本类型的数据是不被接受的，所以这里我们需要先把文本转化为对应的向量。这里涉及到如下几个操作

构造字典

因为纯文本数据是没法作为深度学习输入的，所以我们首先得把文本转化为一个个对应的向量，这里我使用字典下标作为字典中每一个字对应的id, 然后每一条文本就可以通过遍历字典转化成其对应的向量了。所以字典主要是应用在将文本转化成其在字典中对应的id, 根据语料库构造，这里我使用的方法是根据语料库中的字频构造字典(我使用的是基于语料库中的字构造字典，有的人可能会先分词，基于词构造。不使用基于词是现在就算是最好的分词都会有一些误分词问题，而且基于字还可以在一定程度上缓解OOV的问题)。下面是构造字典的代码：

def create_vocabulary(vocabulary_path, data_paths, max_vocabulary_size, tokenizer=None):
    if not os.path.exists(vocabulary_path):
        print("Creating vocabulary %s from data %s" % (vocabulary_path, str(data_paths)))
        vocab = defaultdict(int)
        for path in data_paths:
            with codecs.open(path, mode="r", encoding="utf-8") as fr:
                counter = 0
                for line in fr:
                    counter += 1
                    if counter % 100000 == 0:
                        print("  processing line %d" % (counter,))
                    tokens = tokenizer(line)
                    for w in tokens:
                        word = re.sub(_DIGIT_RE, " ", w)
#                        word = w
                        vocab[word] += 1

        vocab_list = sorted(vocab, key=vocab.get, reverse=True)
        print("Vocabulary size: %d" % len(vocab_list))
        if len(vocab_list) > max_vocabulary_size:
            vocab_list = vocab_list[:max_vocabulary_size]
        with codecs.open(vocabulary_path, mode="w", encoding="utf-8") as vocab_file:
            for w in vocab_list:
                vocab_file.write(w + "\n")

然后我们就可以将文本数据转化成对应的向量作为模型的输入。

embed layer

光有对应的id，没法很好的表征文本信息，这里就涉及到构造词向量，关于词向量不在说明，网上有很多资料，模型中使用词嵌入层，通过训练不断的学习到语料库中的每个字的词向量，代码如下:

def embed(inputs, vocab_size, num_units, zero_pad=True, scope="embedding", reuse=None):
    '''Embeds a given tensor. 

    Args:
      inputs: A `Tensor` with type `int32` or `int64` containing the ids
         to be looked up in `lookup table`.
      vocab_size: An int. Vocabulary size.
      num_units: An int. Number of embedding hidden units.
      zero_pad: A boolean. If True, all the values of the fist row (id 0)
        should be constant zeros.
      scope: Optional scope for `variable_scope`.  
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.

    Returns:
      A `Tensor` with one more rank than inputs's. The last dimesionality
        should be `num_units`.
    '''
    with tf.variable_scope(scope, reuse=reuse):
        lookup_table = tf.get_variable('lookup_table', 
                                       dtype=tf.float32, 
                                       shape=[vocab_size, num_units],
                                       initializer=tf.truncated_normal_initializer(mean=0.0, stddev=0.01))
        if zero_pad:
            lookup_table = tf.concat((tf.zeros(shape=[1, num_units]), 
                                      lookup_table[1:, :]), 0)
    return tf.nn.embedding_lookup(lookup_table, inputs)

值得注意的是，这里是随机初始化词嵌入层，另一种方法是引入预先在语料库训练的词向量(word2vec)，可以在一定程度上提升模型的效果。

音频特征提取

对于音频，我们主要是提取出它的melspectrogram音频特征。MFCC是一种比较常用的音频特征，对于声音来说，它其实是一个一维的时域信号，直观上很难看出频域的变化规律，我们知道，可以使用傅里叶变化，得到它的频域信息，但是又丢失了时域信息，无法看到频域随时域的变化，这样就没法很好的描述声音，为了解决这个问题，很多时频分析手段应运而生。短时傅里叶，小波，Wigner分布等都是常用的时频域分析方法。这里我们使用短时傅里叶。

所谓短时傅里叶变换，顾名思义，是对短时的信号做傅里叶变化。那么短时的信号怎么得到的? 是长时的信号分帧得来的。这么一想，STFT的原理非常简单，把一段长信号分帧(傅里叶变换适用于分析平稳的信号。我们假设在较短的时间跨度范围内，语音信号的变换是平坦的，这就是为什么要分帧的原因)、加窗，再对每一帧做傅里叶变换（FFT），最后把每一帧的结果沿另一个维度堆叠起来，得到类似于一幅图的二维信号形式。如果我们原始信号是声音信号，那么通过STFT展开得到的二维信号就是所谓的声谱图。

声谱图往往是很大的一张图，为了得到合适大小的声音特征，往往把它通过梅尔标度滤波器组（mel-scale filter banks），变换为梅尔频谱。在梅尔频谱上做倒谱分析（取对数，做DCT变换）就得到了梅尔倒谱。音频MFCC特征提取代码如下，这里主要使用第三方库librosa提取MFCC特征：

def get_spectrograms(sound_file): 
    '''Extracts melspectrogram and log magnitude from given `sound_file`.
    Args:
      sound_file: A string. Full path of a sound file.

    Returns:
      Transposed S: A 2d array. A transposed melspectrogram with shape of (T, n_mels)
      Transposed magnitude: A 2d array.Has shape of (T, 1+hp.n_fft//2)
    '''
    # Loading sound file
    y, sr = librosa.load(sound_file, sr=hp.sr) # or set sr to hp.sr.

    # stft. D: (1+n_fft//2, T)
    D = librosa.stft(y=y,
                     n_fft=hp.n_fft, 
                     hop_length=hp.hop_length, 
                     win_length=hp.win_length) 

    # magnitude spectrogram
    magnitude = np.abs(D) #(1+n_fft/2, T)

    # power spectrogram
    power = magnitude**2 #(1+n_fft/2, T) 

    # mel spectrogram
    S = librosa.feature.melspectrogram(S=power, n_mels=hp.n_mels) #(n_mels, T)

    return np.transpose(S.astype(np.float32)), np.transpose(magnitude.astype(np.float32)) # (T, n_mels), (T, 1+n_fft/2)

Encoder pre-net module

embeding layer之后是一个encoder pre-net模块，它有两个隐藏层，层与层之间的连接均是全连接；第一层的隐藏单元数目与输入单元数目一致，第二层的隐藏单元数目为第一层的一半；两个隐藏层采用的激活函数均为ReLu，并保持0.5的dropout来提高泛化能力
这里写图片描述

def prenet(inputs, is_training=True, scope="prenet", reuse=None):
    '''Prenet for Encoder and Decoder.
    Args:
      inputs: A 3D tensor of shape [N, T, hp.embed_size].
      is_training: A boolean.
      scope: Optional scope for `variable_scope`.  
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.

    Returns:
      A 3D tensor of shape [N, T, num_units/2].
    '''
    with tf.variable_scope(scope, reuse=reuse):
        outputs = tf.layers.dense(inputs, units=hp.embed_size, activation=tf.nn.relu, name="dense1")
        outputs = tf.nn.dropout(outputs, keep_prob=.5 if is_training==True else 1., name="dropout1")
        outputs = tf.layers.dense(outputs, units=hp.embed_size//2, activation=tf.nn.relu, name="dense2")
        outputs = tf.nn.dropout(outputs, keep_prob=.5 if is_training==True else 1., name="dropout2") 
    return outputs # (N, T, num_units/2)

CBHG Module

CBHG模块由1-D convolution bank ，highway network ，bidirectional GRU 组成。它的功能是从输入中提取有价值的特征，有利于提高模型的泛化能力。

1-D convolution bank：

输入序列首先会经过一个卷积层，注意这个卷积层，它有K个大小不同的1维的filter，其中filter的大小为1,2,3…K。这些大小不同的卷积核提取了长度不同的上下文信息。其实就是n-gram语言模型的思想，K的不同对应了不同的gram, 例如unigrams, bigrams, up to K-grams，然后，将经过不同大小的k个卷积核的输出堆积在一起（注意：在做卷积时，运用了padding，因此这k个卷积核输出的大小均是相同的），也就是把不同的gram提取到的上下文信息组合在一起，下一层为最大池化层，stride为1，width为2。代码如下：

def conv1d_banks(inputs, K=16, is_training=True, scope="conv1d_banks", reuse=None):
    '''Applies a series of conv1d separately.

    Args:
      inputs: A 3d tensor with shape of [N, T, C]
      K: An int. The size of conv1d banks. That is, 
        The `inputs` are convolved with K filters: 1, 2, ..., K.
      is_training: A boolean. This is passed to an argument of `batch_normalize`.

    Returns:
      A 3d tensor with shape of [N, T, K*Hp.embed_size//2].
    '''
    with tf.variable_scope(scope, reuse=reuse):
        outputs = conv1d(inputs, hp.embed_size//2, 1) # k=1
        for k in range(2, K+1): # k = 2...K
            with tf.variable_scope("num_{}".format(k)):
                output = conv1d(inputs, hp.embed_size // 2, k)
                outputs = tf.concat((outputs, output), -1)
        outputs = normalize(outputs, type=hp.norm_type, is_training=is_training, 
                            activation_fn=tf.nn.relu)
    return outputs # (N, T, Hp.embed_size//2*K)

注意ouputs = normalize(outputs, type=hp.norm_type, is_training=is_training, activation_fn=tf.nn.relu)这行代码，这里对output做了batch normalization处理(每个mini_batch中)，至于BN的作用，网上有很多资料，我这里就简单说下，我们知道不加BN的神经网络，每层都会有一个非线性话操作，如sigmoid或者RELU，这样一方面每层的输入分布和上一层都不相同，这样会使得模型的收敛和预测能力下降。其次，对于深层网络而言，这样会带来梯度弥散和梯度爆炸的问题，因为模型在back propogation的时候是依据链式法则，深度越深，问题越严重，BN引入其他参数抹去了w的scale的影响。公式网上都有，这里就不在搬了。

经过池化之后，会再经过两层一维的卷积层。第一个卷积层的filter大小为3，stride为1，采用的激活函数为ReLu；第二个卷积层的filter大小为3，stride为1，没有采用激活函数（在这两个一维的卷积层之间都会进行batch normalization）。

residual connection：

经过卷积层之后，会进行一个residual connection。也就是把卷积层输出的和embeding之后的序列相加起来。使用residual connection也是一个缓解神经网络太深带来的梯度弥散问题的方法。我们知道，在训练神经网络的时候，一个合适的layer size是很重要的，网络太深，会带来梯度弥散的问题，太浅又不能很好的学到特征，residual connection可以缓解网络太深带来的问题，就是把输入和经过卷积后结果相加，这样可以确保经过多层卷积后，没有丢失太多之前输入的信息。代码如下:

enc += prenet_out # (N, T, E/2) # residual connections

这里就是把前面prenet层后经过多层卷积后的结果和prenet的输出相加。

highway network：

下一层输入到highway layers，highway nets的每一层结构为：把输入同时放入到两个一层的全连接网络中，这两个网络的激活函数分别采用了ReLu和sigmoid函数，假定输入为input，ReLu的输出为output1，sigmoid的输出为output2，那么highway layer的输出为output=output1∗output2+input∗（1−output2)。论文中使用了4层highway layer。

代码如下：

def highwaynet(inputs, num_units=None, scope="highwaynet", reuse=None):
    '''Highway networks, see https://arxiv.org/abs/1505.00387

    Args:
      inputs: A 3D tensor of shape [N, T, W].
      num_units: An int or `None`. Specifies the number of units in the highway layer
             or uses the input size if `None`.
      scope: Optional scope for `variable_scope`.  
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.

    Returns:
      A 3D tensor of shape [N, T, W].
    '''
    if not num_units:
        num_units = inputs.get_shape()[-1]

    with tf.variable_scope(scope, reuse=reuse):
        H = tf.layers.dense(inputs, units=num_units, activation=tf.nn.relu, name="dense1")
        T = tf.layers.dense(inputs, units=num_units, activation=tf.nn.sigmoid, name="dense2")
        C = 1. - T
        outputs = H * T + inputs * C
    return outputs

从代码中我们也能看出highway network公式为：
这里写图片描述
其中C等于1-T，x为输入， y 为对应的输出，T为transfer gate，C为carry gate，其实就是让网络的输出由两部分组成，分别是网络的直接输入以及输入变形后的部分。为什么要使用highway network的节奏呢，其实说白了也是一种减少缓解网络加深带来过拟合问题，以及减少较深网络的训练难度的一个trick。它主要受到LSTM门限机制的启发。

GRU：

然后将输出输入到双向的GRU中，从GRU中输出的结果就是encoder的输出。GRU是RNN的一个变体，它和LSTM一样都使用了门限机制，不同的是它只有更新门和重置门。公式如下：
这里写图片描述

Reset gate

r(t) 负责决定h(t−1) 对new memory h^(t) 的重要性有多大，如果r(t)约等于0 的话，h(t−1) 就不会传递给new memory h^(t)

new memory

h^(t) 是对新的输入x(t) 和上一时刻的hidden state h(t−1) 的总结。计算总结出的新的向量h^(t) 包含上文信息和新的输入x(t).

Update gate

z(t) 负责决定传递多少ht−1给ht 。如果z(t) 约等于1的话，ht−1 几乎会直接复制给ht ，相反，如果z(t) 约等于0， new memory h^(t) 直接传递给ht.

Hidden state:

h(t) 由 h(t−1) 和)h^(t) 相加得到，两者的权重由update gate z(t) 控制。

代码如下:

def gru(inputs, num_units=None, bidirection=False, scope="gru", reuse=None):
    '''Applies a GRU.

    Args:
      inputs: A 3d tensor with shape of [N, T, C].
      num_units: An int. The number of hidden units.
      bidirection: A boolean. If True, bidirectional results 
        are concatenated.
      scope: Optional scope for `variable_scope`.  
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.

    Returns:
      If bidirection is True, a 3d tensor with shape of [N, T, 2*num_units],
        otherwise [N, T, num_units].
    '''
    with tf.variable_scope(scope, reuse=reuse):
        if num_units is None:
            num_units = inputs.get_shape().as_list[-1]

        cell = tf.contrib.rnn.GRUCell(num_units)  
        if bidirection: 
            cell_bw = tf.contrib.rnn.GRUCell(num_units)
            outputs, _ = tf.nn.bidirectional_dynamic_rnn(cell, cell_bw, inputs, 
                                                         dtype=tf.float32)
            return tf.concat(outputs, 2)  
        else:
            outputs, _ = tf.nn.dynamic_rnn(cell, inputs, 
                                           dtype=tf.float32)
            return outputs

到这里，CBHG的部分（encoder）就总结完毕了，CBHG的代码：

def encode(inputs, is_training=True, scope="encoder", reuse=None):
    '''
    Args:
      inputs: A 2d tensor with shape of [N, T], dtype of int32.
      seqlens: A 1d tensor with shape of [N,], dtype of int32.
      masks: A 3d tensor with shape of [N, T, 1], dtype of float32.
      is_training: Whether or not the layer is in training mode.
      scope: Optional scope for `variable_scope`
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.

    Returns:
      A collection of Hidden vectors, whose shape is (N, T, E).
    '''
    with tf.variable_scope(scope, reuse=reuse):
        # Load vocabulary 
#        char2idx, idx2char = load_vocab()
        vocab, revocab = load_vocab()

        # Character Embedding
        inputs = embed(inputs, len(vocab), hp.embed_size) # (N, T, E)  

        # Encoder pre-net
        prenet_out = prenet(inputs, is_training=is_training) # (N, T, E/2)

        # Encoder CBHG 
        ## Conv1D bank 
        enc = conv1d_banks(prenet_out, K=hp.encoder_num_banks, is_training=is_training) # (N, T, K * E / 2)

        ### Max pooling
        enc = tf.layers.max_pooling1d(enc, 2, 1, padding="same")  # (N, T, K * E / 2)

        ### Conv1D projections
        enc = conv1d(enc, hp.embed_size//2, 3, scope="conv1d_1") # (N, T, E/2)
        enc = normalize(enc, type=hp.norm_type, is_training=is_training, 
                            activation_fn=tf.nn.relu, scope="norm1")
        enc = conv1d(enc, hp.embed_size//2, 3, scope="conv1d_2") # (N, T, E/2)
        enc = normalize(enc, type=hp.norm_type, is_training=is_training, 
                            activation_fn=None, scope="norm2")
        enc += prenet_out # (N, T, E/2) # residual connections

        ### Highway Nets
        for i in range(hp.num_highwaynet_blocks):
            enc = highwaynet(enc, num_units=hp.embed_size//2, 
                                 scope='highwaynet_{}'.format(i)) # (N, T, E/2)

        ### Bidirectional GRU
        memory = gru(enc, hp.embed_size//2, True) # (N, T, E)

    return memory

decoder模块

decoder模块主要分为三部分：
- pre-net
- Attention-RNN
- Decoder-RNN

Pre-net的结构与encoder中的pre-net相同，主要是对输入做一些非线性变换。

Attention-RNN的结构为一层包含256个GRU的RNN，它将pre-net的输出和attention的输出作为输入，经过GRU单元后输出到decoder-RNN中。

Decode-RNN为两层residual GRU，它的输出为输入与经过GRU单元输出之和。每层同样包含了256个GRU单元。第一步decoder的输入为0矩阵，之后都会把第t步的输出作为第t+1步的输入。(这里paper中使用了一个trick，就是每次decoder的时候，不仅仅预测1帧的数据，而是预测多个非重叠的帧。因为就像我们前面说到的提取音频特征的时候，我们会先分帧，相邻的帧其实是有一定的关联性的，所以每个字符在发音的时候，可能对应了多个帧，因此每个GRU单元输出为多个帧的音频文件。)

为什么这样做呢：
- 减小了model size，比如当前需要预测n个帧，如果每次一个帧的话，就对应了n个GRU，而如果每次预测r个帧的话，只需要n/r个GRU，从而减小了模型的复杂度
- 减少了模型训练和预测的时间
- 提高收敛的速度
其实下两点好处都是第一点好处带来的。代码如下：

def decode1(decoder_inputs, memory, is_training=True, scope="decoder1", reuse=None):
    '''
    Args:
      decoder_inputs: A 3d tensor with shape of [N, T', C'], where C'=hp.n_mels*hp.r, 
        dtype of float32. Shifted melspectrogram of sound files. 
      memory: A 3d tensor with shape of [N, T, C], where C=hp.embed_size.
      is_training: Whether or not the layer is in training mode.
      scope: Optional scope for `variable_scope`
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.

    Returns
      Predicted melspectrogram tensor with shape of [N, T', C'].
    '''
    with tf.variable_scope(scope, reuse=reuse):
        # Decoder pre-net
        dec = prenet(decoder_inputs, is_training=is_training) # (N, T', E/2)

        # Attention RNN
        dec = attention_decoder(dec, memory, num_units=hp.embed_size) # (N, T', E)

        # Decoder RNNs
        dec += gru(dec, hp.embed_size, False, scope="decoder_gru1") # (N, T', E)
        dec += gru(dec, hp.embed_size, False, scope="decoder_gru2") # (N, T', E)

        # Outputs => (N, T', hp.n_mels*hp.r)
        out_dim = decoder_inputs.get_shape().as_list()[-1]
        outputs = tf.layers.dense(dec, out_dim) 

    return outputs

post-processing

和seq2seq网络不通的是，tacotron在decoder-RNN输出之后并没有直将其作为输出通过Griffin-Lim算法合成音频，而是添加了一层post-processing模块。

为什么要添加这一层呢？
首先是因为我们使用了Griffin-Lim重建算法，根据频谱生成音频，Griffin-Lim原理是：我们知道相位是描述波形变化的，我们从频谱生成音频的时候，需要考虑连续帧之间相位变化的规律，如果找不到这个规律，生成的信号和原来的信号肯定是不一样的，Griffin Lim算法解决的就是如何不弄坏左右相邻的幅度谱和自身幅度谱的情况下，求一个近似的相位，因为相位最差和最好情况下天壤之别，所有应该会有一个相位变化的迭代方案会比上一次更好一点，而Griffin Lim算法找到了这个方案。这里说了这么多，其实就是Griffin-Lim算法需要看到所有的帧。post-processing可以在一个线性频率范围内预测幅度谱（spectral magnitude)。

其次，post-processing能看到整个解码的序列，而不像seq2seq那样，只能从左至右的运行。它能够通过正向传播和反向传播的结果来修正每一帧的预测错误。

论文中使用了CBHG的结构来作为post-processing net，前面已经详细介绍过。代码如下：

def decode2(inputs, is_training=True, scope="decoder2", reuse=None):
    '''
    Args:
      inputs: A 3d tensor with shape of [N, T', C'], where C'=hp.n_mels*hp.r, 
        dtype of float32. Log magnitude spectrogram of sound files.
      is_training: Whether or not the layer is in training mode.  
      scope: Optional scope for `variable_scope`
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.

    Returns
      Predicted magnitude spectrogram tensor with shape of [N, T', C''], 
        where C'' = (1+hp.n_fft//2)*hp.r.
    '''
    with tf.variable_scope(scope, reuse=reuse):
        # Decoder pre-net
        prenet_out = prenet(inputs, is_training=is_training) # (N, T'', E/2)

        # Decoder Post-processing net = CBHG
        ## Conv1D bank
        dec = conv1d_banks(prenet_out, K=hp.decoder_num_banks, is_training=is_training) # (N, T', E*K/2)

        ## Max pooling
        dec = tf.layers.max_pooling1d(dec, 2, 1, padding="same") # (N, T', E*K/2)

        ## Conv1D projections
        dec = conv1d(dec, hp.embed_size, 3, scope="conv1d_1") # (N, T', E)
        dec = normalize(dec, type=hp.norm_type, is_training=is_training, 
                            activation_fn=tf.nn.relu, scope="norm1")
        dec = conv1d(dec, hp.embed_size//2, 3, scope="conv1d_2") # (N, T', E/2)
        dec = normalize(dec, type=hp.norm_type, is_training=is_training, 
                            activation_fn=None, scope="norm2")
        dec += prenet_out

        ## Highway Nets
        for i in range(4):
            dec = highwaynet(dec, num_units=hp.embed_size//2, 
                                 scope='highwaynet_{}'.format(i)) # (N, T, E/2)

        ## Bidirectional GRU    
        dec = gru(dec, hp.embed_size//2, True) # (N, T', E)  

        # Outputs => (N, T', (1+hp.n_fft//2)*hp.r)
        out_dim = (1+hp.n_fft//2)*hp.r
        outputs = tf.layers.dense(dec, out_dim)

    return outputs

最后使用Griffin-Lim算法来将post-processing net的输出合成为语音。到这里就基本结束了，感兴趣的话可以看下论文加深理解: https://arxiv.org/pdf/1703.10135.pdf

端到端的TTS深度学习模型tacotron(中文语音合成)