TensorFlow中RNN样例代码详解

　　关于RNN的理论部分已经在上一篇文章中讲过了，本文主要讲解RNN在TensorFlow中的实现。与theano不同，TensorFlow在一个更加抽象的层次上实现了RNN单元，所以调用tensorflow的API来实现RNN是比较容易的。这里先介绍TensorFlow中与RNN相关的几个比较常用的函数，

　　(1)cell = tf.nn.rnn_cell.BasicLSTMCell(num_units, forget_bias, input_size, state_is_tuple, activation)
　　　　 num_units: int, The number of units in the LSTM cell（就是指cell中隐藏层神经元的个数）；
　　　　 forget_bias: float, The bias added to forget gates (添加到“forget gates”的偏置，这里的“forget gates”指lstm网络中的component)；
　　　　 input_size: Deprecated and unused（这个参数以后会被废弃掉，就不用考虑了）；
　　　　 state_is_tuple: 为真表示，状态值是(c_state, m_state)构成的元组，比如每一个time step有K层，那么state结构为((c0, m0), (c1, m1), …, (ck, mk))；
　　　　 activation: cell中的激励函数；
　　　注：这个函数用于生成RNN网络的最基本的组成单元，这个类对象中还有一个比较重要的method，call(self, inputs, state, scope=None)，它确定
　　　　　了在forward propagation过程中，调用BasicLSTMCell对象时的输入输出参数。

　　(2) cell = tf.nn.rnn_cell.MultiRNNCell(cells, state_is_tuple=True)
　　　　 cells: list of RNNCells that will be composed in this order（根据cells列表中的LSTMCell生成MultiRNNCell的基本组成单元，这里的MultiRNNCell是
　　　　　指每一时刻的输出由多层LSTMCell级联而成。显然，列表中的每个LSTMCell可以含有不同的权重参数）；
　　　　 state_is_tuple: 同上；

　　(3) state = tf.nn.rnn_cell.MultiRNNCell.zero_state(batch_size, dtype)
　　　　 batch_size: 训练块的大小；
　　　　 dtype: 指定待返回的state变量的数据类型；
　　　注：这个函数用于返回全0的state tensor。state tensor的尺寸与层数、hidden units num、batch size有关系，前面两个在定义cell对象时已经指定过
　　　　　了，故这里要指定batch_size参数。

　　在Github上有RNN的TensorFlow官方源代码，主要包括了两个文件，一个是reader.py，另外一个是ptb_word_lm.py。本篇就先来学习一下大牛们提供的源代码，因为代码比较长，这里主要对理解上可能有困难的地方进行解析，希望能对大家有所帮助。

reader.py文件中的子函数

　　在NLP领域中，自然语言模型是比较经典的应用，在训练RNN模型前，需要把输入数据文件进行预处理，即先设定词库大小vocabulary_size，再根据训练库中单词出现的频数，找到出现次数最多的前vocabulary_size个单词，并把他们映射到0，、、、，vocabulary_size-1，而其他出现频数较少的单词，均设置成“unknown”，索引设置为vocabulary_size。通常情况下，训练数据包含了很多段语句，每段语句的长度可以不一样（用列表和array对象存储矩阵数据时，矩阵中元素的长度可以不一致，所以语料库的存储不存在问题）。当模型训练过程结束时，所学到的模型参数，就是使得训练库中所有的sentence出现概率都非常大时的参数解。值得一提的是，TF仅支持定长输入的RNN（theano中的scan函数支持不定长输入的RNN，但在实际应用中，通常都是提前给inputs加个padding改成定长的训练语料库，因为这样做会使训练速度更快）。

def ptb_raw_data(data_path=None):
  train_path = os.path.join(data_path, "ptb.train.txt") #定义文件路径
  valid_path = os.path.join(data_path, "ptb.valid.txt")
  test_path = os.path.join(data_path, "ptb.test.txt")
    #_build_vocab函数对字典对象，先按value(频数)降序，频数相同的单词再按key(单词)升序。函数返回的是字典对象， 
    # 函数返回的是字典对象，key为单词，value为对应的唯一的编号
  word_to_id = _build_vocab(train_path)
    # _file_to_word_ids函数，用于把文件中的内容转换为索引列表。在转换过程中，若文件中的某个单词不在word_to_id查询字典中，
    # 则不进行转换。返回list对象，list中的每一个元素均为int型数据，代表单词编号
  train_data = _file_to_word_ids(train_path, word_to_id)
  valid_data = _file_to_word_ids(valid_path, word_to_id)
  test_data = _file_to_word_ids(test_path, word_to_id)
  vocabulary = len(word_to_id) #vocabulary size，对于PTB数据集，大小为10k
  return train_data, valid_data, test_data, vocabulary

def ptb_producer(raw_data, batch_size, num_steps, name=None):
    # raw_data: one of the raw data outputs from ptb_raw_data.
  with tf.name_scope(name, "PTBProducer", [raw_data, batch_size, num_steps]): #定义context manager
    raw_data = tf.convert_to_tensor(raw_data, name="raw_data", dtype=tf.int32)
    data_len = tf.size(raw_data) 
    # 这里的batch_size指某一时刻输入单词的个数。因为程序在执行时要利用GPU的并行计算能力提高效率，所以程序设定了这个参数
    batch_len = data_len // batch_size
    # 这里的data指所有训练样例
    data = tf.reshape(raw_data[0 : batch_size * batch_len],
                      [batch_size, batch_len])
     # TF仅支持定长输入，这里设定RNN网络的序列长度为num_steps
    epoch_size = (batch_len - 1) // num_steps  #在训练过程中，一个周期所含的mini-batchs数量，也即周期内迭代次数
    assertion = tf.assert_positive(
        epoch_size,
        message="epoch_size == 0, decrease batch_size or num_steps")
    # tf.control_dependencies函数，用于先执行assertion操作，再执行当前context中的命令
    with tf.control_dependencies([assertion]):
      epoch_size = tf.identity(epoch_size, name="epoch_size")
     # tf.train.range_input_producer函数返回queue对象，里面存放的是int型数据0,..., epoch_size-1。好处是把数据输入部分“隐藏”起来了，
    # 在训练模型的时候，只用考虑模型部分，而不需要关注如何读取训练数据。关于tensorflow的
    # queue runner数据输入机制，在前面的博客中做了介绍
    i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()
    # 这里要注意，语言模型的输出序列是输入序列延迟1个时间戳的结果
    x = tf.slice(data, [0, i * num_steps], [batch_size, num_steps])
    y = tf.slice(data, [0, i * num_steps + 1], [batch_size, num_steps])
    return x, y

ptb_word_lm.py文件中的子函数

ptb_word_lm.py文件中的函数比较容易理解，在看的时候主要有两个地方，需要注意一下。
（1）PTBModel类对象的init（）函数中，有下面两小段代码，这里分别做一下说明。
　　代码段1中，在定义embedding变量时，继承了variable_scope中的initializer，即embedding中为均匀分布的随机初始化数。tf.nn.embedding_lookup函数用于把N维的input_data转换为N+1维的tensor对象inputs，增加的一个维度用于把单词索引映射为embedding中的向量。
　　代码段2中，就是数据forward popagation的实现部分。在每传播一个time step时，就会更新状态参数state，并保存当前时刻的输出。所以最终会得到所有时刻的输出和终点时刻的state。

# code block 1 begin....
with tf.device("/cpu:0"):
      embedding = tf.get_variable("embedding", [vocab_size, size], dtype=data_type())
      inputs = tf.nn.embedding_lookup(embedding, input_.input_data)
# code block 1 end....        

# code block 2 begin....
  outputs = []
    state = self._initial_state
    with tf.variable_scope("RNN"):
      for time_step in range(num_steps):
        if time_step > 0: tf.get_variable_scope().reuse_variables()
        (cell_output, state) = cell(inputs[:, time_step, :], state)
        outputs.append(cell_output)
# code block 2 end....

（2）run_epoch函数中，有如下这段语句，其中model.initial_state为Tuple对象，其中的每一个元素为LSTMStateTuple(c=(tf.Tensor ‘zeros_14:0’ shape=() dtype=float32), h=(tf.Tensor ‘zeros_15:0’ shape=() dtype=float32))。而state的结构类似，不同之处在于LSTMStateTuple中的c和h为具体的array数组。所以构造feed_dict字典对象，可以用来在每个epoch的训练结束时，更新RNN的state数组值。可能大家会想，state的值不是在训练模型的过程中，自动更新的吗？可以这么理解，tensorflow在模型定义阶段生成Tensor Graph,然后在训练阶段就按照Graph中的信息流向执行，所以如果不更新初始状态model.initial_state的话，在每次训练过程中，TensorFlow会向上找到self._initial_state = cell.zero_state(batch_size, data_type())语句，将其视为全0矩阵，这样就不能保证此次训练迭代过程中，state变量数值的连续性了。

def run_epoch(session, model, eval_op=None, verbose=False):
  """Runs the model on the given data."""
  start_time = time.time()
  costs = 0.0
  iters = 0
  state = session.run(model.initial_state)

  fetches = {
      "cost": model.cost,
      "final_state": model.final_state,
  }
  if eval_op is not None:
    fetches["eval_op"] = eval_op

  for step in range(model.input.epoch_size):
    feed_dict = {}
    for i, (c, h) in enumerate(model.initial_state):
      feed_dict[c] = state[i].c
      feed_dict[h] = state[i].h

    vals = session.run(fetches, feed_dict)
    cost = vals["cost"]
    state = vals["final_state"]

    costs += cost
    iters += model.input.num_steps

    if verbose and step % (model.input.epoch_size // 10) == 10:
      print("%.3f perplexity: %.3f speed: %.0f wps" %
            (step * 1.0 / model.input.epoch_size, np.exp(costs / iters),
             iters * model.input.batch_size / (time.time() - start_time)))

  return np.exp(costs / iters)

参考资料：https://github.com/tensorflow/models/tree/master/tutorials/rnn/ptb
　　　　　https://www.tensorflow.org/versions/r0.11/tutorials/recurrent/index.html