TensorFlow Seq2Seq Model笔记

0. tf跑起来一直没有用GPU...

尴尬，跑起来发现GPU没用起来，CPU满了。发现装错了，应该装tensorflow-gpu。

代码测试是否用的是GPU：https://stackoverflow.com/questions/38009682/how-to-tell-if-tensorflow-is-using-gpu-acceleration-from-inside-python-shell

1. tf.app.run()的疑惑

http://stackoverflow.com/questions/33703624/how-does-tf-app-run-work

tf.app类似python中argparse

2. variable scope 和 name scope

Variable Scope mechanism: https://www.tensorflow.org/programmers_guide/variable_scope

http://stackoverflow.com/questions/35919020/whats-the-difference-of-name-scope-and-a-variable-scope-in-tensorflow

重点：Name scopes can be opened in addition to a variable scope, and then they will only affect the names of the ops, but not of variables.

with tf.variable_scope("foo"):
    with tf.name_scope("bar"):
        v = tf.get_variable("v", [1])
        x = 1.0 + v
assert v.name == "foo/v:0"
assert x.op.name == "foo/bar/add"

scope.original_name_scope和scope.name的区别

http://stackoverflow.com/questions/41756054/tensorflow-variablescope-original-name-scope-vs-name

3. Python2 Python3区别
在修改data_utils.py（https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/data_utils.py）文件中：
没注意版本不同的区别，Python3语法中print语句没有了，取而代之的是print()函数。
另外用python2执行时候：
with gfile.GFile(data_path, mode="rb") as f:
counter = 0
for line in f:

这里line里面含有'\n'，用split切分后会和最后一个word组合一起读入list。出现list写到文件中和len（list）大小不一致。
比如 li=['a', 'b\n'] 写入文件。'\n'会换行，写的文件成为3行。
下次逐行读入时候会把空符号（‘’）计算为一个新word。

4. TensorFlow Saver类 https://www.tensorflow.org/api_docs/python/tf/train/Saver

http://blog.csdn.net/u011500062/article/details/51728830

5. Seq2Seq模型保存

其中保存的模型为：

translate.ckpt-16.data-00000-of-00001

translate.ckpt-16.index

translate.ckpt-16.meta

这些东西的解释见：

https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/Y4mzbDAUSec

http://stackoverflow.com/questions/36195454/what-is-the-tensorflow-checkpoint-meta-file

6. RNN示例

https://uqer.io/community/share/58a9332bf1973300597ae209

http://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html

7. List of tensor to tensor

http://stackoverflow.com/questions/35730161/how-to-convert-a-list-of-tensors-of-dim-n-to-a-tensor-of-dim-n1

http://blog.csdn.net/sherry_up/article/details/52169318

8. batch_matmul问题

想进行的操作： suppose I have a T x n x k and want to multiply it by a k x k2, and then to a max pool overT and then a mean pool over n. To do this now, I think you need to reshape, do the matmul() and then undo the reshape and then do the pooling.

https://github.com/tensorflow/tensorflow/issues/216

https://www.tensorflow.org/versions/r0.10/api_docs/python/math_ops/matrix_math_functions#batch_matmul

使用时候报错：AttributeError: 'module' object has no attribute 'batch_matmul'，才发现1.0版本中没有这个。需要用matmul加参数进行使用

9. Cannot feed value of shape (XX) for Tensor u'target/Y:0', which has shape '(YY)'？

第一次遇到这种问题，google后说是input feed的数据shape不一致。但出问题是第五个变量（mask5），导致自己以为不是input feed问题（如果是为什么会是第5个才出问题？）。瞎折腾好久后，发现还是输入数据时候的问题.....

10. 读取现有模型

之前一直可以读取指定目录下现有模型，后来发现读不了，折腾了几个小时才发现以前下面是有一个checkpoint文件，里面会告诉模型两个path：

model_checkpoint_path: "translate.ckpt-101000"
all_model_checkpoint_paths: "translate.ckpt-101000"

11. 读模型内的参数值

https://www.tensorflow.org/programmers_guide/variables#checkpoint_files：

When you create a Saver object, you can optionally choose names for the variables in the checkpoint files. By default, it uses the value of the tf.Variable.name property for each variable.

To understand what variables are in a checkpoint, you can use the inspect_checkpoint library, and in particular, the print_tensors_in_checkpoint_file function.

*https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/inspect_checkpoint.py

用法：

python inspect_checkpoint.py --file_name=./alpha_easy_nmt/valid_model/translate.ckpt-625000tensor

显示：

Decoder/trg_lookup_table/embedding (DT_FLOAT) [16000,620]
Decoder/trg_lookup_table/embedding/Adadelta (DT_FLOAT) [16000,620]
Decoder/trg_lookup_table/embedding/Adadelta_1 (DT_FLOAT) [16000,620]

用法：

python inspect_checkpoint.py --file_name=./alpha_easy_nmt/valid_model/translate.ckpt-625000 --tensor_name=Decoder/W_sf

显示：

tensor_name: Decoder/W_sf
[[ -4.55709170e-07 -9.10816539e-07 4.44753543e-02 ..., -2.58049741e-02
4.26506670e-03 -3.64431571e-07]
[ 7.86067460e-07 7.86348721e-07 1.29140466e-02 ..., 7.92008177e-06
5.49392325e-07 6.99410566e-06]
[ -5.86683996e-07 5.51591484e-08 9.70983803e-02 ..., 2.75615434e-07
-4.86231060e-04 1.23817983e-07]
...,
[ -1.40239194e-06 -1.00237912e-06 -1.44313052e-01 ..., -1.33047411e-06
-1.17946070e-06 -2.41477892e-07]
[ 1.19242941e-06 -9.48488719e-08 -2.48298571e-02 ..., 1.00101170e-03
-3.03782895e-03 1.45507602e-06]
[ -1.27071712e-06 -1.27975386e-06 -2.31240150e-02 ..., -7.33333752e-02
2.30671745e-03 -5.72958811e-07]]

12. tf.get_variable的default initializer

https://www.tensorflow.org/api_docs/python/tf/get_variable：

If initializer is None (the default), the default initializer passed in the variable scope will be used. If that one is None too, a glorot_uniform_initializer will be used. The initializer can also be a Tensor, in which case the variable is initialized to this value and shape.

奇怪是glorot_uniform_initializer也差不到任何文档提及，github上倒是有人问过这个问题，不过没人回答https://github.com/tensorflow/tensorflow/issues/7791。

13. 系统记录

13.1 之前系统在NIST06上BLEU到20就停住了（theano版本和我师弟的版本都能到34）。beam search输出每一个beam，发现特别多的over translation问题。找了下发现自己系统出了一个小Bug，在beam search中忘记给source annotation加上mask。

Vocab = 16k，Batch = 50，两种优化算法每隔3000batch测一次BLEU。

Adam:

72000   BLEU score = 0.2412     BEST BLEU is 0
75000   BLEU score = 0.2377     BEST BLEU is 0.2412
78000   BLEU score = 0.2380     BEST BLEU is 0.2412
81000   BLEU score = 0.2513     BEST BLEU is 0.2412
84000   BLEU score = 0.2231     BEST BLEU is 0.2513
87000   BLEU score = 0.2527     BEST BLEU is 0.2513
90000   BLEU score = 0.2314     BEST BLEU is 0.2527
93000   BLEU score = 0.2498     BEST BLEU is 0.2527
96000   BLEU score = 0.2445     BEST BLEU is 0.2527
99000   BLEU score = 0.2487     BEST BLEU is 0.2527
102000  BLEU score = 0.2497     BEST BLEU is 0.2527
105000  BLEU score = 0.2523     BEST BLEU is 0.2527
108000  BLEU score = 0.2375     BEST BLEU is 0.2527
111000  BLEU score = 0.2380     BEST BLEU is 0.2527
114000  BLEU score = 0.2457     BEST BLEU is 0.2527
117000  BLEU score = 0.2525     BEST BLEU is 0.2527
120000  BLEU score = 0.2519     BEST BLEU is 0.2527
123000  BLEU score = 0.2491     BEST BLEU is 0.2527
126000  BLEU score = 0.2391     BEST BLEU is 0.2527
129000  BLEU score = 0.2304     BEST BLEU is 0.2527
132000  BLEU score = 0.2618     BEST BLEU is 0.2527
135000  BLEU score = 0.2489     BEST BLEU is 0.2618
138000  BLEU score = 0.2458     BEST BLEU is 0.2618
141000  BLEU score = 0.2505     BEST BLEU is 0.2618
144000  BLEU score = 0.2558     BEST BLEU is 0.2618
147000  BLEU score = 0.2492     BEST BLEU is 0.2618
150000  BLEU score = 0.2463     BEST BLEU is 0.2618
153000  BLEU score = 0.2586     BEST BLEU is 0.2618
156000  BLEU score = 0.2495     BEST BLEU is 0.2618
159000  BLEU score = 0.2568     BEST BLEU is 0.2618
162000  BLEU score = 0.2571     BEST BLEU is 0.2618
165000  BLEU score = 0.2611     BEST BLEU is 0.2618
168000  BLEU score = 0.2508     BEST BLEU is 0.2618
171000  BLEU score = 0.2450     BEST BLEU is 0.2618
174000  BLEU score = 0.2459     BEST BLEU is 0.2618
177000  BLEU score = 0.2579     BEST BLEU is 0.2618
180000  BLEU score = 0.2580     BEST BLEU is 0.2618
183000  BLEU score = 0.2520     BEST BLEU is 0.2618
186000  BLEU score = 0.2730     BEST BLEU is 0.2618
189000  BLEU score = 0.2430     BEST BLEU is 0.273
192000  BLEU score = 0.2571     BEST BLEU is 0.273
195000  BLEU score = 0.2541     BEST BLEU is 0.273
198000  BLEU score = 0.2471     BEST BLEU is 0.273
201000  BLEU score = 0.2491     BEST BLEU is 0.273
204000  BLEU score = 0.2589     BEST BLEU is 0.273
207000  BLEU score = 0.2523     BEST BLEU is 0.273
210000  BLEU score = 0.2536     BEST BLEU is 0.273
213000  BLEU score = 0.2557     BEST BLEU is 0.273
216000  BLEU score = 0.2457     BEST BLEU is 0.273
219000  BLEU score = 0.2661     BEST BLEU is 0.273
222000  BLEU score = 0.2515     BEST BLEU is 0.273
225000  BLEU score = 0.2644     BEST BLEU is 0.273
228000  BLEU score = 0.2616     BEST BLEU is 0.273
231000  BLEU score = 0.2554     BEST BLEU is 0.273
234000  BLEU score = 0.2621     BEST BLEU is 0.273
237000  BLEU score = 0.2519     BEST BLEU is 0.273
240000  BLEU score = 0.2440     BEST BLEU is 0.273
243000  BLEU score = 0.2572     BEST BLEU is 0.273
246000  BLEU score = 0.2488     BEST BLEU is 0.273
249000  BLEU score = 0.2631     BEST BLEU is 0.273
252000  BLEU score = 0.2584     BEST BLEU is 0.273
255000  BLEU score = 0.2570     BEST BLEU is 0.273
258000  BLEU score = 0.2581     BEST BLEU is 0.273
261000  BLEU score = 0.2510     BEST BLEU is 0.273
264000  BLEU score = 0.2476     BEST BLEU is 0.273
267000  BLEU score = 0.2667     BEST BLEU is 0.273
270000  BLEU score = 0.2689     BEST BLEU is 0.273
273000  BLEU score = 0.2596     BEST BLEU is 0.273
276000  BLEU score = 0.2592     BEST BLEU is 0.273
279000  BLEU score = 0.2617     BEST BLEU is 0.273
282000  BLEU score = 0.2652     BEST BLEU is 0.273
285000  BLEU score = 0.2651     BEST BLEU is 0.273
288000  BLEU score = 0.2732     BEST BLEU is 0.273
291000  BLEU score = 0.2505     BEST BLEU is 0.2732
294000  BLEU score = 0.2545     BEST BLEU is 0.2732
297000  BLEU score = 0.2737     BEST BLEU is 0.2732
300000  BLEU score = 0.2662     BEST BLEU is 0.2737

Adadelta：

72000   BLEU score = 0.1732     BEST BLEU is 0
75000   BLEU score = 0.1752     BEST BLEU is 0.1732
78000   BLEU score = 0.1888     BEST BLEU is 0.1752
81000   BLEU score = 0.1771     BEST BLEU is 0.1888
84000   BLEU score = 0.1876     BEST BLEU is 0.1888
87000   BLEU score = 0.1968     BEST BLEU is 0.1888
90000   BLEU score = 0.1664     BEST BLEU is 0.1968
93000   BLEU score = 0.2059     BEST BLEU is 0.1968
96000   BLEU score = 0.1816     BEST BLEU is 0.2059
99000   BLEU score = 0.2098     BEST BLEU is 0.2059
102000  BLEU score = 0.2086     BEST BLEU is 0.2098
105000  BLEU score = 0.2029     BEST BLEU is 0.2098
108000  BLEU score = 0.2222     BEST BLEU is 0.2098
111000  BLEU score = 0.1929     BEST BLEU is 0.2222
114000  BLEU score = 0.1951     BEST BLEU is 0.2222
117000  BLEU score = 0.2212     BEST BLEU is 0.2222
120000  BLEU score = 0.2111     BEST BLEU is 0.2222
123000  BLEU score = 0.1981     BEST BLEU is 0.2222
126000  BLEU score = 0.2054     BEST BLEU is 0.2222
129000  BLEU score = 0.2228     BEST BLEU is 0.2222
132000  BLEU score = 0.2250     BEST BLEU is 0.2228
135000  BLEU score = 0.2061     BEST BLEU is 0.225
138000  BLEU score = 0.2333     BEST BLEU is 0.225
141000  BLEU score = 0.2236     BEST BLEU is 0.2333
144000  BLEU score = 0.2123     BEST BLEU is 0.2333
147000  BLEU score = 0.2242     BEST BLEU is 0.2333
150000  BLEU score = 0.2120     BEST BLEU is 0.2333
153000  BLEU score = 0.2404     BEST BLEU is 0.2333
156000  BLEU score = 0.2348     BEST BLEU is 0.2404
159000  BLEU score = 0.2195     BEST BLEU is 0.2404
162000  BLEU score = 0.2383     BEST BLEU is 0.2404
165000  BLEU score = 0.2192     BEST BLEU is 0.2404
168000  BLEU score = 0.2240     BEST BLEU is 0.2404
171000  BLEU score = 0.2265     BEST BLEU is 0.2404
174000  BLEU score = 0.2211     BEST BLEU is 0.2404
177000  BLEU score = 0.2302     BEST BLEU is 0.2404
180000  BLEU score = 0.2360     BEST BLEU is 0.2404
183000  BLEU score = 0.2161     BEST BLEU is 0.2404
186000  BLEU score = 0.2316     BEST BLEU is 0.2404
189000  BLEU score = 0.2298     BEST BLEU is 0.2404
192000  BLEU score = 0.2316     BEST BLEU is 0.2404
195000  BLEU score = 0.2166     BEST BLEU is 0.2404
198000  BLEU score = 0.2350     BEST BLEU is 0.2404
201000  BLEU score = 0.2295     BEST BLEU is 0.2404
204000  BLEU score = 0.2456     BEST BLEU is 0.2404
207000  BLEU score = 0.2392     BEST BLEU is 0.2456
210000  BLEU score = 0.2311     BEST BLEU is 0.2456
213000  BLEU score = 0.2113     BEST BLEU is 0.2456
216000  BLEU score = 0.2223     BEST BLEU is 0.2456
219000  BLEU score = 0.2258     BEST BLEU is 0.2456
222000  BLEU score = 0.2304     BEST BLEU is 0.2456
225000  BLEU score = 0.2165     BEST BLEU is 0.2456
228000  BLEU score = 0.2336     BEST BLEU is 0.2456
231000  BLEU score = 0.2345     BEST BLEU is 0.2456
234000  BLEU score = 0.2444     BEST BLEU is 0.2456
237000  BLEU score = 0.2310     BEST BLEU is 0.2456
240000  BLEU score = 0.2406     BEST BLEU is 0.2456
243000  BLEU score = 0.2294     BEST BLEU is 0.2456
246000  BLEU score = 0.2469     BEST BLEU is 0.2456
249000  BLEU score = 0.2479     BEST BLEU is 0.2469
252000  BLEU score = 0.2464     BEST BLEU is 0.2479
255000  BLEU score = 0.2490     BEST BLEU is 0.2479
258000  BLEU score = 0.2406     BEST BLEU is 0.249
261000  BLEU score = 0.2477     BEST BLEU is 0.249
264000  BLEU score = 0.2392     BEST BLEU is 0.249
267000  BLEU score = 0.2516     BEST BLEU is 0.249
270000  BLEU score = 0.2521     BEST BLEU is 0.2516
273000  BLEU score = 0.2370     BEST BLEU is 0.2521
276000  BLEU score = 0.2431     BEST BLEU is 0.2521
279000  BLEU score = 0.2548     BEST BLEU is 0.2521
282000  BLEU score = 0.2605     BEST BLEU is 0.2548
285000  BLEU score = 0.2421     BEST BLEU is 0.2605
288000  BLEU score = 0.2446     BEST BLEU is 0.2605
291000  BLEU score = 0.2521     BEST BLEU is 0.2605
294000  BLEU score = 0.2529     BEST BLEU is 0.2605
297000  BLEU score = 0.2453     BEST BLEU is 0.2605
300000  BLEU score = 0.2361     BEST BLEU is 0.2605

13.2 系统性能还是没有我师弟的高。继续检查发现初始化init_context变量有问题。我为了图方便，用最后一个source annotation作为init context，我师弟和之前theano版本用的是mean pooling方法。Nematus: a Toolkit for Neural Machine Translation也提到了这个问题：We initialize the decoder hidden state with the mean of the source annotation, rather than the annotation at the last position of the encoder backward RNN. 似乎我那种做法是有问题的。

13.3

14. Dropout问题

http://blog.csdn.net/wangxinginnlp/article/details/72649820

*. GRU+Embedding的追溯

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/rnn_cell_impl.py
class _RNNCell(object):
  """Abstract object representing an RNN cell.
  Every `RNNCell` must have the properties below and implement `__call__` with
  the following signature.
  This definition of cell differs from the definition used in the literature.
  In the literature, 'cell' refers to an object with a single scalar output.
  This definition refers to a horizontal array of such units.
  An RNN cell, in the most abstract setting, is anything that has
  a state and performs some operation that takes a matrix of inputs.
  This operation results in an output matrix with `self.output_size` columns.
  If `self.state_size` is an integer, this operation also results in a new
  state matrix with `self.state_size` columns.  If `self.state_size` is a
  tuple of integers, then it results in a tuple of `len(state_size)` state
  matrices, each with a column size corresponding to values in `state_size`.
  """

  def __call__(self, inputs, state, scope=None):
    """Run this RNN cell on inputs, starting from the given state.
    Args:
      inputs: `2-D` tensor with shape `[batch_size x input_size]`.
      state: if `self.state_size` is an integer, this should be a `2-D Tensor`
        with shape `[batch_size x self.state_size]`.  Otherwise, if
        `self.state_size` is a tuple of integers, this should be a tuple
        with shapes `[batch_size x s] for s in self.state_size`.
      scope: VariableScope for the created subgraph; defaults to class name.
    Returns:
      A pair containing:
      - Output: A `2-D` tensor with shape `[batch_size x self.output_size]`.
      - New state: Either a single `2-D` tensor, or a tuple of tensors matching
        the arity and shapes of `state`.
    """
    raise NotImplementedError("Abstract method")

  @property
  def state_size(self):
    """size(s) of state(s) used by this cell.
    It can be represented by an Integer, a TensorShape or a tuple of Integers
    or TensorShapes.
    """
    raise NotImplementedError("Abstract method")

  @property
  def output_size(self):
    """Integer or TensorShape: size of outputs produced by this cell."""
    raise NotImplementedError("Abstract method")

  def zero_state(self, batch_size, dtype):
    """Return zero-filled state tensor(s).
    Args:
      batch_size: int, float, or unit Tensor representing the batch size.
      dtype: the data type to use for the state.
    Returns:
      If `state_size` is an int or TensorShape, then the return value is a
      `N-D` tensor of shape `[batch_size x state_size]` filled with zeros.
      If `state_size` is a nested list or tuple, then the return value is
      a nested list or tuple (of the same structure) of `2-D` tensors with
      the shapes `[batch_size x s]` for each s in `state_size`.
    """
    with ops.name_scope(type(self).__name__ + "ZeroState", values=[batch_size]):
      state_size = self.state_size
      return _zero_state_tensors(state_size, batch_size, dtype)



https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py

_BIAS_VARIABLE_NAME = "biases"
_WEIGHTS_VARIABLE_NAME = "weights"

'''
matmul(): ultiplies matrix `a` by matrix `b`, producing `a` * `b`.
concat(): Concatenates tensors along one dimension.
'''

def _linear(args, output_size, bias, bias_start=0.0):
  """Linear map: sum_i(args[i] * W[i]), where W[i] is a variable.
  Args:
    args: a 2D Tensor or a list of 2D, batch x n, Tensors.
    output_size: int, second dimension of W[i].
    bias: boolean, whether to add a bias term or not.
    bias_start: starting value to initialize the bias; 0 by default.
  Returns:
    A 2D Tensor with shape [batch x output_size] equal to
    sum_i(args[i] * W[i]), where W[i]s are newly created matrices.
  Raises:
    ValueError: if some of the arguments has unspecified or wrong shape.
  """
  if args is None or (nest.is_sequence(args) and not args):
    raise ValueError("`args` must be specified")
  if not nest.is_sequence(args):
    args = [args]

  # Calculate the total size of arguments on dimension 1.
  total_arg_size = 0
  shapes = [a.get_shape() for a in args]
  for shape in shapes:
    if shape.ndims != 2:
      raise ValueError("linear is expecting 2D arguments: %s" % shapes)
    if shape[1].value is None:
      raise ValueError("linear expects shape[1] to be provided for shape %s, "
                       "but saw %s" % (shape, shape[1]))
    else:
      total_arg_size += shape[1].value

  dtype = [a.dtype for a in args][0]

  # Now the computation.
  scope = vs.get_variable_scope()
  with vs.variable_scope(scope) as outer_scope:
    weights = vs.get_variable(
        _WEIGHTS_VARIABLE_NAME, [total_arg_size, output_size], dtype=dtype)
    if len(args) == 1:
      res = math_ops.matmul(args[0], weights)
    else:
      res = math_ops.matmul(array_ops.concat(args, 1), weights)
    if not bias:
      return res
    with vs.variable_scope(outer_scope) as inner_scope:
      inner_scope.set_partitioner(None)
      biases = vs.get_variable(
          _BIAS_VARIABLE_NAME, [output_size],
          dtype=dtype,
          initializer=init_ops.constant_initializer(bias_start, dtype=dtype))
    return nn_ops.bias_add(res, biases)




class GRUCell(RNNCell):
  """Gated Recurrent Unit cell (cf. http://arxiv.org/abs/1406.1078)."""

  def __init__(self, num_units, input_size=None, activation=tanh, reuse=None):
    if input_size is not None:
      logging.warn("%s: The input_size parameter is deprecated.", self)
    self._num_units = num_units
    self._activation = activation
    self._reuse = reuse

  @property
  def state_size(self):
    return self._num_units

  @property
  def output_size(self):
    return self._num_units

  def __call__(self, inputs, state, scope=None):
    """Gated recurrent unit (GRU) with nunits cells."""
    with _checked_scope(self, scope or "gru_cell", reuse=self._reuse):
      with vs.variable_scope("gates"):  # Reset gate and update gate.
        # We start with bias of 1.0 to not reset and not update.
        value = sigmoid(_linear(
          [inputs, state], 2 * self._num_units, True, 1.0))
        r, u = array_ops.split(
            value=value,
            num_or_size_splits=2,
            axis=1)
      with vs.variable_scope("candidate"):
        c = self._activation(_linear([inputs, r * state],
                                     self._num_units, True))
      new_h = u * state + (1 - u) * c
    return new_h, new_h
	
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py
class EmbeddingWrapper(RNNCell):
  """Operator adding input embedding to the given cell.
  Note: in many cases it may be more efficient to not use this wrapper,
  but instead concatenate the whole sequence of your inputs in time,
  do the embedding on this batch-concatenated sequence, then split it and
  feed into your RNN.
  """

  def __init__(self, cell, embedding_classes, embedding_size, initializer=None,
               reuse=None):
    """Create a cell with an added input embedding.
    Args:
      cell: an RNNCell, an embedding will be put before its inputs.
      embedding_classes: integer, how many symbols will be embedded.
      embedding_size: integer, the size of the vectors we embed into.
      initializer: an initializer to use when creating the embedding;
        if None, the initializer from variable scope or a default one is used.
      reuse: (optional) Python boolean describing whether to reuse variables
        in an existing scope.  If not `True`, and the existing scope already has
        the given variables, an error is raised.
    Raises:
      TypeError: if cell is not an RNNCell.
      ValueError: if embedding_classes is not positive.
    """
    if not isinstance(cell, RNNCell):
      raise TypeError("The parameter cell is not RNNCell.")
    if embedding_classes <= 0 or embedding_size <= 0:
      raise ValueError("Both embedding_classes and embedding_size must be > 0: "
                       "%d, %d." % (embedding_classes, embedding_size))
    self._cell = cell
    self._embedding_classes = embedding_classes
    self._embedding_size = embedding_size
    self._initializer = initializer
    self._reuse = reuse

  @property
  def state_size(self):
    return self._cell.state_size

  @property
  def output_size(self):
    return self._cell.output_size

  def zero_state(self, batch_size, dtype):
    with ops.name_scope(type(self).__name__ + "ZeroState", values=[batch_size]):
      return self._cell.zero_state(batch_size, dtype)

  def __call__(self, inputs, state, scope=None):
    """Run the cell on embedded inputs."""
    with _checked_scope(self, scope or "embedding_wrapper", reuse=self._reuse):
      with ops.device("/cpu:0"):
        if self._initializer:
          initializer = self._initializer
        elif vs.get_variable_scope().initializer:
          initializer = vs.get_variable_scope().initializer
        else:
          # Default initializer for embeddings should have variance=1.
          sqrt3 = math.sqrt(3)  # Uniform(-sqrt(3), sqrt(3)) has variance=1.
          initializer = init_ops.random_uniform_initializer(-sqrt3, sqrt3)

        if type(state) is tuple:
          data_type = state[0].dtype
        else:
          data_type = state.dtype

        embedding = vs.get_variable(
            "embedding", [self._embedding_classes, self._embedding_size],
            initializer=initializer,
            dtype=data_type)
        embedded = embedding_ops.embedding_lookup(
            embedding, array_ops.reshape(inputs, [-1]))
    return self._cell(embedded, state)

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/ops/core_rnn.py

def static_rnn(cell, inputs, initial_state=None, dtype=None,
               sequence_length=None, scope=None):
  """Creates a recurrent neural network specified by RNNCell `cell`.
  The simplest form of RNN network generated is:
  ```python
    state = cell.zero_state(...)
    outputs = []
    for input_ in inputs:
      output, state = cell(input_, state)
      outputs.append(output)
    return (outputs, state)
  ```
  However, a few other options are available:
  An initial state can be provided.
  If the sequence_length vector is provided, dynamic calculation is performed.
  This method of calculation does not compute the RNN steps past the maximum
  sequence length of the minibatch (thus saving computational time),
  and properly propagates the state at an example's sequence length
  to the final state output.
  The dynamic calculation performed is, at time `t` for batch row `b`,
  ```python
    (output, state)(b, t) =
      (t >= sequence_length(b))
        ? (zeros(cell.output_size), states(b, sequence_length(b) - 1))
        : cell(input(b, t), state(b, t - 1))
  ```
  Args:
    cell: An instance of RNNCell.
    inputs: A length T list of inputs, each a `Tensor` of shape
      `[batch_size, input_size]`, or a nested tuple of such elements.
    initial_state: (optional) An initial state for the RNN.
      If `cell.state_size` is an integer, this must be
      a `Tensor` of appropriate type and shape `[batch_size, cell.state_size]`.
      If `cell.state_size` is a tuple, this should be a tuple of
      tensors having shapes `[batch_size, s] for s in cell.state_size`.
    dtype: (optional) The data type for the initial state and expected output.
      Required if initial_state is not provided or RNN state has a heterogeneous
      dtype.
    sequence_length: Specifies the length of each sequence in inputs.
      An int32 or int64 vector (tensor) size `[batch_size]`, values in `[0, T)`.
    scope: VariableScope for the created subgraph; defaults to "rnn".
  Returns:
    A pair (outputs, state) where:
    - outputs is a length T list of outputs (one for each input), or a nested
      tuple of such elements.
    - state is the final state
  Raises:
    TypeError: If `cell` is not an instance of RNNCell.
    ValueError: If `inputs` is `None` or an empty list, or if the input depth
      (column size) cannot be inferred from inputs via shape inference.
  """

  if not isinstance(cell, core_rnn_cell.RNNCell):
    raise TypeError("cell must be an instance of RNNCell")
  if not nest.is_sequence(inputs):
    raise TypeError("inputs must be a sequence")
  if not inputs:
    raise ValueError("inputs must not be empty")

  outputs = []
  # Create a new scope in which the caching device is either
  # determined by the parent scope, or is set to place the cached
  # Variable using the same placement as for the rest of the RNN.
  with vs.variable_scope(scope or "rnn") as varscope:
    if varscope.caching_device is None:
      varscope.set_caching_device(lambda op: op.device)

    # Obtain the first sequence of the input
    first_input = inputs
    while nest.is_sequence(first_input):
      first_input = first_input[0]

    # Temporarily avoid EmbeddingWrapper and seq2seq badness
    # TODO(lukaszkaiser): remove EmbeddingWrapper
    if first_input.get_shape().ndims != 1:

      input_shape = first_input.get_shape().with_rank_at_least(2)
      fixed_batch_size = input_shape[0]

      flat_inputs = nest.flatten(inputs)
      for flat_input in flat_inputs:
        input_shape = flat_input.get_shape().with_rank_at_least(2)
        batch_size, input_size = input_shape[0], input_shape[1:]
        fixed_batch_size.merge_with(batch_size)
        for i, size in enumerate(input_size):
          if size.value is None:
            raise ValueError(
                "Input size (dimension %d of inputs) must be accessible via "
                "shape inference, but saw value None." % i)
    else:
      fixed_batch_size = first_input.get_shape().with_rank_at_least(1)[0]

    if fixed_batch_size.value:
      batch_size = fixed_batch_size.value
    else:
      batch_size = array_ops.shape(first_input)[0]
    if initial_state is not None:
      state = initial_state
    else:
      if not dtype:
        raise ValueError("If no initial_state is provided, "
                         "dtype must be specified")
      state = cell.zero_state(batch_size, dtype)

    if sequence_length is not None:  # Prepare variables
      sequence_length = ops.convert_to_tensor(
          sequence_length, name="sequence_length")
      if sequence_length.get_shape().ndims not in (None, 1):
        raise ValueError(
            "sequence_length must be a vector of length batch_size")
      def _create_zero_output(output_size):
        # convert int to TensorShape if necessary
        size = _state_size_with_prefix(output_size, prefix=[batch_size])
        output = array_ops.zeros(
            array_ops.stack(size), _infer_state_dtype(dtype, state))
        shape = _state_size_with_prefix(
            output_size, prefix=[fixed_batch_size.value])
        output.set_shape(tensor_shape.TensorShape(shape))
        return output

      output_size = cell.output_size
      flat_output_size = nest.flatten(output_size)
      flat_zero_output = tuple(
          _create_zero_output(size) for size in flat_output_size)
      zero_output = nest.pack_sequence_as(structure=output_size,
                                          flat_sequence=flat_zero_output)

      sequence_length = math_ops.to_int32(sequence_length)
      min_sequence_length = math_ops.reduce_min(sequence_length)
      max_sequence_length = math_ops.reduce_max(sequence_length)

    for time, input_ in enumerate(inputs):
      if time > 0: varscope.reuse_variables()
      # pylint: disable=cell-var-from-loop
      call_cell = lambda: cell(input_, state)
      # pylint: enable=cell-var-from-loop
      if sequence_length is not None:
        (output, state) = _rnn_step(
            time=time,
            sequence_length=sequence_length,
            min_sequence_length=min_sequence_length,
            max_sequence_length=max_sequence_length,
            zero_output=zero_output,
            state=state,
            call_cell=call_cell,
            state_size=cell.state_size)
      else:
        (output, state) = call_cell()

      outputs.append(output)

    return (outputs, state)

*. embedding_attention_seq2seq追溯

embedding_attention_decoder -> attention_decoder

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py
def attention_decoder(decoder_inputs,
                      initial_state,
                      attention_states,
                      cell,
                      output_size=None,
                      num_heads=1,
                      loop_function=None,
                      dtype=None,
                      scope=None,
                      initial_state_attention=False):
  """RNN decoder with attention for the sequence-to-sequence model.
  In this context "attention" means that, during decoding, the RNN can look up
  information in the additional tensor attention_states, and it does this by
  focusing on a few entries from the tensor. This model has proven to yield
  especially good results in a number of sequence-to-sequence tasks. This
  implementation is based on http://arxiv.org/abs/1412.7449 (see below for
  details). It is recommended for complex sequence-to-sequence tasks.
  Args:
    decoder_inputs: A list of 2D Tensors [batch_size x input_size].
    initial_state: 2D Tensor [batch_size x cell.state_size].
    attention_states: 3D Tensor [batch_size x attn_length x attn_size].
    cell: core_rnn_cell.RNNCell defining the cell function and size.
    output_size: Size of the output vectors; if None, we use cell.output_size.
    num_heads: Number of attention heads that read from attention_states.
    loop_function: If not None, this function will be applied to i-th output
      in order to generate i+1-th input, and decoder_inputs will be ignored,
      except for the first element ("GO" symbol). This can be used for decoding,
      but also for training to emulate http://arxiv.org/abs/1506.03099.
      Signature -- loop_function(prev, i) = next
        * prev is a 2D Tensor of shape [batch_size x output_size],
        * i is an integer, the step number (when advanced control is needed),
        * next is a 2D Tensor of shape [batch_size x input_size].
    dtype: The dtype to use for the RNN initial state (default: tf.float32).
    scope: VariableScope for the created subgraph; default: "attention_decoder".
    initial_state_attention: If False (default), initial attentions are zero.
      If True, initialize the attentions from the initial state and attention
      states -- useful when we wish to resume decoding from a previously
      stored decoder state and attention states.
  Returns:
    A tuple of the form (outputs, state), where:
      outputs: A list of the same length as decoder_inputs of 2D Tensors of
        shape [batch_size x output_size]. These represent the generated outputs.
        Output i is computed from input i (which is either the i-th element
        of decoder_inputs or loop_function(output {i-1}, i)) as follows.
        First, we run the cell on a combination of the input and previous
        attention masks:
          cell_output, new_state = cell(linear(input, prev_attn), prev_state).
        Then, we calculate new attention masks:
          new_attn = softmax(V^T * tanh(W * attention_states + U * new_state))
        and then we calculate the output:
          output = linear(cell_output, new_attn).
      state: The state of each decoder cell the final time-step.
        It is a 2D Tensor of shape [batch_size x cell.state_size].
  Raises:
    ValueError: when num_heads is not positive, there are no inputs, shapes
      of attention_states are not set, or input size cannot be inferred
      from the input.
  """
  if not decoder_inputs:
    raise ValueError("Must provide at least 1 input to attention decoder.")
  if num_heads < 1:
    raise ValueError("With less than 1 heads, use a non-attention decoder.")
  if attention_states.get_shape()[2].value is None:
    raise ValueError("Shape[2] of attention_states must be known: %s" %
                     attention_states.get_shape())
  if output_size is None:
    output_size = cell.output_size

  with variable_scope.variable_scope(
      scope or "attention_decoder", dtype=dtype) as scope:
    dtype = scope.dtype

    batch_size = array_ops.shape(decoder_inputs[0])[0]  # Needed for reshaping.
    attn_length = attention_states.get_shape()[1].value
    if attn_length is None:
      attn_length = array_ops.shape(attention_states)[1]
    attn_size = attention_states.get_shape()[2].value

    # To calculate W1 * h_t we use a 1-by-1 convolution, need to reshape before.
    hidden = array_ops.reshape(attention_states,
                               [-1, attn_length, 1, attn_size])
    hidden_features = []
    v = []
    attention_vec_size = attn_size  # Size of query vectors for attention.
    for a in xrange(num_heads):
      k = variable_scope.get_variable("AttnW_%d" % a,
                                      [1, 1, attn_size, attention_vec_size])
      hidden_features.append(nn_ops.conv2d(hidden, k, [1, 1, 1, 1], "SAME"))
      v.append(
          variable_scope.get_variable("AttnV_%d" % a, [attention_vec_size]))

    state = initial_state

    def attention(query):
      """Put attention masks on hidden using hidden_features and query."""
      ds = []  # Results of attention reads will be stored here.
      if nest.is_sequence(query):  # If the query is a tuple, flatten it.
        query_list = nest.flatten(query)
        for q in query_list:  # Check that ndims == 2 if specified.
          ndims = q.get_shape().ndims
          if ndims:
            assert ndims == 2
        query = array_ops.concat(query_list, 1)
      for a in xrange(num_heads):
        with variable_scope.variable_scope("Attention_%d" % a):
          y = linear(query, attention_vec_size, True)
          y = array_ops.reshape(y, [-1, 1, 1, attention_vec_size])
          # Attention mask is a softmax of v^T * tanh(...).
          s = math_ops.reduce_sum(v[a] * math_ops.tanh(hidden_features[a] + y),
                                  [2, 3])
          a = nn_ops.softmax(s)
          # Now calculate the attention-weighted vector d.
          d = math_ops.reduce_sum(
              array_ops.reshape(a, [-1, attn_length, 1, 1]) * hidden, [1, 2])
          ds.append(array_ops.reshape(d, [-1, attn_size]))
      return ds

    outputs = []
    prev = None
    batch_attn_size = array_ops.stack([batch_size, attn_size])
    attns = [
        array_ops.zeros(
            batch_attn_size, dtype=dtype) for _ in xrange(num_heads)
    ]
    for a in attns:  # Ensure the second shape of attention vectors is set.
      a.set_shape([None, attn_size])
    if initial_state_attention:
      attns = attention(initial_state)
    for i, inp in enumerate(decoder_inputs):
      if i > 0:
        variable_scope.get_variable_scope().reuse_variables()
      # If loop_function is set, we use it instead of decoder_inputs.
      if loop_function is not None and prev is not None:
        with variable_scope.variable_scope("loop_function", reuse=True):
          inp = loop_function(prev, i)
      # Merge input and previous attentions into one vector of the right size.
      input_size = inp.get_shape().with_rank(2)[1]
      if input_size.value is None:
        raise ValueError("Could not infer input size from input: %s" % inp.name)
      x = linear([inp] + attns, input_size, True)
      # Run the RNN.
      cell_output, state = cell(x, state)
      # Run the attention mechanism.
      if i == 0 and initial_state_attention:
        with variable_scope.variable_scope(
            variable_scope.get_variable_scope(), reuse=True):
          attns = attention(state)
      else:
        attns = attention(state)

      with variable_scope.variable_scope("AttnOutputProjection"):
        output = linear([cell_output] + attns, output_size, True)
      if loop_function is not None:
        prev = output
      outputs.append(output)

  return outputs, state


def embedding_attention_decoder(decoder_inputs,
                                initial_state,
                                attention_states,
                                cell,
                                num_symbols,
                                embedding_size,
                                num_heads=1,
                                output_size=None,
                                output_projection=None,
                                feed_previous=False,
                                update_embedding_for_previous=True,
                                dtype=None,
                                scope=None,
                                initial_state_attention=False):
  """RNN decoder with embedding and attention and a pure-decoding option.
  Args:
    decoder_inputs: A list of 1D batch-sized int32 Tensors (decoder inputs).
    initial_state: 2D Tensor [batch_size x cell.state_size].
    attention_states: 3D Tensor [batch_size x attn_length x attn_size].
    cell: core_rnn_cell.RNNCell defining the cell function.
    num_symbols: Integer, how many symbols come into the embedding.
    embedding_size: Integer, the length of the embedding vector for each symbol.
    num_heads: Number of attention heads that read from attention_states.
    output_size: Size of the output vectors; if None, use output_size.
    output_projection: None or a pair (W, B) of output projection weights and
      biases; W has shape [output_size x num_symbols] and B has shape
      [num_symbols]; if provided and feed_previous=True, each fed previous
      output will first be multiplied by W and added B.
    feed_previous: Boolean; if True, only the first of decoder_inputs will be
      used (the "GO" symbol), and all other decoder inputs will be generated by:
        next = embedding_lookup(embedding, argmax(previous_output)),
      In effect, this implements a greedy decoder. It can also be used
      during training to emulate http://arxiv.org/abs/1506.03099.
      If False, decoder_inputs are used as given (the standard decoder case).
    update_embedding_for_previous: Boolean; if False and feed_previous=True,
      only the embedding for the first symbol of decoder_inputs (the "GO"
      symbol) will be updated by back propagation. Embeddings for the symbols
      generated from the decoder itself remain unchanged. This parameter has
      no effect if feed_previous=False.
    dtype: The dtype to use for the RNN initial states (default: tf.float32).
    scope: VariableScope for the created subgraph; defaults to
      "embedding_attention_decoder".
    initial_state_attention: If False (default), initial attentions are zero.
      If True, initialize the attentions from the initial state and attention
      states -- useful when we wish to resume decoding from a previously
      stored decoder state and attention states.
  Returns:
    A tuple of the form (outputs, state), where:
      outputs: A list of the same length as decoder_inputs of 2D Tensors with
        shape [batch_size x output_size] containing the generated outputs.
      state: The state of each decoder cell at the final time-step.
        It is a 2D Tensor of shape [batch_size x cell.state_size].
  Raises:
    ValueError: When output_projection has the wrong shape.
  """
  if output_size is None:
    output_size = cell.output_size
  if output_projection is not None:
    proj_biases = ops.convert_to_tensor(output_projection[1], dtype=dtype)
    proj_biases.get_shape().assert_is_compatible_with([num_symbols])

  with variable_scope.variable_scope(
      scope or "embedding_attention_decoder", dtype=dtype) as scope:

    embedding = variable_scope.get_variable("embedding",
                                            [num_symbols, embedding_size])
    loop_function = _extract_argmax_and_embed(
        embedding, output_projection,
        update_embedding_for_previous) if feed_previous else None
    emb_inp = [
        embedding_ops.embedding_lookup(embedding, i) for i in decoder_inputs
    ]
    return attention_decoder(
        emb_inp,
        initial_state,
        attention_states,
        cell,
        output_size=output_size,
        num_heads=num_heads,
        loop_function=loop_function,
        initial_state_attention=initial_state_attention)
def embedding_attention_seq2seq(encoder_inputs,
                                decoder_inputs,
                                cell,
                                num_encoder_symbols,
                                num_decoder_symbols,
                                embedding_size,
                                num_heads=1,
                                output_projection=None,
                                feed_previous=False,
                                dtype=None,
                                scope=None,
                                initial_state_attention=False):
  """Embedding sequence-to-sequence model with attention.
  This model first embeds encoder_inputs by a newly created embedding (of shape
  [num_encoder_symbols x input_size]). Then it runs an RNN to encode
  embedded encoder_inputs into a state vector. It keeps the outputs of this
  RNN at every step to use for attention later. Next, it embeds decoder_inputs
  by another newly created embedding (of shape [num_decoder_symbols x
  input_size]). Then it runs attention decoder, initialized with the last
  encoder state, on embedded decoder_inputs and attending to encoder outputs.
  Warning: when output_projection is None, the size of the attention vectors
  and variables will be made proportional to num_decoder_symbols, can be large.
  Args:
    encoder_inputs: A list of 1D int32 Tensors of shape [batch_size].
    decoder_inputs: A list of 1D int32 Tensors of shape [batch_size].
    cell: core_rnn_cell.RNNCell defining the cell function and size.
    num_encoder_symbols: Integer; number of symbols on the encoder side.
    num_decoder_symbols: Integer; number of symbols on the decoder side.
    embedding_size: Integer, the length of the embedding vector for each symbol.
    num_heads: Number of attention heads that read from attention_states.
    output_projection: None or a pair (W, B) of output projection weights and
      biases; W has shape [output_size x num_decoder_symbols] and B has
      shape [num_decoder_symbols]; if provided and feed_previous=True, each
      fed previous output will first be multiplied by W and added B.
    feed_previous: Boolean or scalar Boolean Tensor; if True, only the first
      of decoder_inputs will be used (the "GO" symbol), and all other decoder
      inputs will be taken from previous outputs (as in embedding_rnn_decoder).
      If False, decoder_inputs are used as given (the standard decoder case).
    dtype: The dtype of the initial RNN state (default: tf.float32).
    scope: VariableScope for the created subgraph; defaults to
      "embedding_attention_seq2seq".
    initial_state_attention: If False (default), initial attentions are zero.
      If True, initialize the attentions from the initial state and attention
      states.
  Returns:
    A tuple of the form (outputs, state), where:
      outputs: A list of the same length as decoder_inputs of 2D Tensors with
        shape [batch_size x num_decoder_symbols] containing the generated
        outputs.
      state: The state of each decoder cell at the final time-step.
        It is a 2D Tensor of shape [batch_size x cell.state_size].
  """
  with variable_scope.variable_scope(
      scope or "embedding_attention_seq2seq", dtype=dtype) as scope:
    dtype = scope.dtype
    # Encoder.
    encoder_cell = copy.deepcopy(cell)
    encoder_cell = core_rnn_cell.EmbeddingWrapper(
        encoder_cell,
        embedding_classes=num_encoder_symbols,
        embedding_size=embedding_size)
    encoder_outputs, encoder_state = core_rnn.static_rnn(
        encoder_cell, encoder_inputs, dtype=dtype)

    # First calculate a concatenation of encoder outputs to put attention on.
    top_states = [
        array_ops.reshape(e, [-1, 1, cell.output_size]) for e in encoder_outputs
    ]
    attention_states = array_ops.concat(top_states, 1)

    # Decoder.
    output_size = None
    if output_projection is None:
      cell = core_rnn_cell.OutputProjectionWrapper(cell, num_decoder_symbols)
      output_size = num_decoder_symbols

    if isinstance(feed_previous, bool):
      return embedding_attention_decoder(
          decoder_inputs,
          encoder_state,
          attention_states,
          cell,
          num_decoder_symbols,
          embedding_size,
          num_heads=num_heads,
          output_size=output_size,
          output_projection=output_projection,
          feed_previous=feed_previous,
          initial_state_attention=initial_state_attention)

    # If feed_previous is a Tensor, we construct 2 graphs and use cond.
    def decoder(feed_previous_bool):
      reuse = None if feed_previous_bool else True
      with variable_scope.variable_scope(
          variable_scope.get_variable_scope(), reuse=reuse):
        outputs, state = embedding_attention_decoder(
            decoder_inputs,
            encoder_state,
            attention_states,
            cell,
            num_decoder_symbols,
            embedding_size,
            num_heads=num_heads,
            output_size=output_size,
            output_projection=output_projection,
            feed_previous=feed_previous_bool,
            update_embedding_for_previous=False,
            initial_state_attention=initial_state_attention)
        state_list = [state]
        if nest.is_sequence(state):
          state_list = nest.flatten(state)
        return outputs + state_list

    outputs_and_state = control_flow_ops.cond(feed_previous,
                                              lambda: decoder(True),
                                              lambda: decoder(False))
    outputs_len = len(decoder_inputs)  # Outputs length same as decoder inputs.
    state_list = outputs_and_state[outputs_len:]
    state = state_list[0]
    if nest.is_sequence(encoder_state):
      state = nest.pack_sequence_as(
          structure=encoder_state, flat_sequence=state_list)
    return outputs_and_state[:outputs_len], state

6. model_with_buckets追溯

def sequence_loss_by_example(logits,
                             targets,
                             weights,
                             average_across_timesteps=True,
                             softmax_loss_function=None,
                             name=None):
  """Weighted cross-entropy loss for a sequence of logits (per example).
  Args:
    logits: List of 2D Tensors of shape [batch_size x num_decoder_symbols].
    targets: List of 1D batch-sized int32 Tensors of the same length as logits.
    weights: List of 1D batch-sized float-Tensors of the same length as logits.
    average_across_timesteps: If set, divide the returned cost by the total
      label weight.
    softmax_loss_function: Function (labels, logits) -> loss-batch
      to be used instead of the standard softmax (the default if this is None).
      **Note that to avoid confusion, it is required for the function to accept
      named arguments.**
    name: Optional name for this operation, default: "sequence_loss_by_example".
  Returns:
    1D batch-sized float Tensor: The log-perplexity for each sequence.
  Raises:
    ValueError: If len(logits) is different from len(targets) or len(weights).
  """
  if len(targets) != len(logits) or len(weights) != len(logits):
    raise ValueError("Lengths of logits, weights, and targets must be the same "
                     "%d, %d, %d." % (len(logits), len(weights), len(targets)))
  with ops.name_scope(name, "sequence_loss_by_example",
                      logits + targets + weights):
    log_perp_list = []
    for logit, target, weight in zip(logits, targets, weights):
      if softmax_loss_function is None:
        # TODO(irving,ebrevdo): This reshape is needed because
        # sequence_loss_by_example is called with scalars sometimes, which
        # violates our general scalar strictness policy.
        target = array_ops.reshape(target, [-1])
        crossent = nn_ops.sparse_softmax_cross_entropy_with_logits(
            labels=target, logits=logit)
      else:
        crossent = softmax_loss_function(labels=target, logits=logit)
      log_perp_list.append(crossent * weight)
    log_perps = math_ops.add_n(log_perp_list)
    if average_across_timesteps:
      total_size = math_ops.add_n(weights)
      total_size += 1e-12  # Just to avoid division by 0 for all-0 weights.
      log_perps /= total_size
  return log_perps

def sequence_loss(logits,
                  targets,
                  weights,
                  average_across_timesteps=True,
                  average_across_batch=True,
                  softmax_loss_function=None,
                  name=None):
  """Weighted cross-entropy loss for a sequence of logits, batch-collapsed.
  Args:
    logits: List of 2D Tensors of shape [batch_size x num_decoder_symbols].
    targets: List of 1D batch-sized int32 Tensors of the same length as logits.
    weights: List of 1D batch-sized float-Tensors of the same length as logits.
    average_across_timesteps: If set, divide the returned cost by the total
      label weight.
    average_across_batch: If set, divide the returned cost by the batch size.
    softmax_loss_function: Function (labels, logits) -> loss-batch
      to be used instead of the standard softmax (the default if this is None).
      **Note that to avoid confusion, it is required for the function to accept
      named arguments.**
    name: Optional name for this operation, defaults to "sequence_loss".
  Returns:
    A scalar float Tensor: The average log-perplexity per symbol (weighted).
  Raises:
    ValueError: If len(logits) is different from len(targets) or len(weights).
  """
  with ops.name_scope(name, "sequence_loss", logits + targets + weights):
    cost = math_ops.reduce_sum(
        sequence_loss_by_example(
            logits,
            targets,
            weights,
            average_across_timesteps=average_across_timesteps,
            softmax_loss_function=softmax_loss_function))
    if average_across_batch:
      batch_size = array_ops.shape(targets[0])[0]
      return cost / math_ops.cast(batch_size, cost.dtype)
    else:
      return cost


def model_with_buckets(encoder_inputs,
                       decoder_inputs,
                       targets,
                       weights,
                       buckets,
                       seq2seq,
                       softmax_loss_function=None,
                       per_example_loss=False,
                       name=None):
  """Create a sequence-to-sequence model with support for bucketing.
  The seq2seq argument is a function that defines a sequence-to-sequence model,
  e.g., seq2seq = lambda x, y: basic_rnn_seq2seq(
      x, y, core_rnn_cell.GRUCell(24))
  Args:
    encoder_inputs: A list of Tensors to feed the encoder; first seq2seq input.
    decoder_inputs: A list of Tensors to feed the decoder; second seq2seq input.
    targets: A list of 1D batch-sized int32 Tensors (desired output sequence).
    weights: List of 1D batch-sized float-Tensors to weight the targets.
    buckets: A list of pairs of (input size, output size) for each bucket.
    seq2seq: A sequence-to-sequence model function; it takes 2 input that
      agree with encoder_inputs and decoder_inputs, and returns a pair
      consisting of outputs and states (as, e.g., basic_rnn_seq2seq).
    softmax_loss_function: Function (labels, logits) -> loss-batch
      to be used instead of the standard softmax (the default if this is None).
      **Note that to avoid confusion, it is required for the function to accept
      named arguments.**
    per_example_loss: Boolean. If set, the returned loss will be a batch-sized
      tensor of losses for each sequence in the batch. If unset, it will be
      a scalar with the averaged loss from all examples.
    name: Optional name for this operation, defaults to "model_with_buckets".
  Returns:
    A tuple of the form (outputs, losses), where:
      outputs: The outputs for each bucket. Its j'th element consists of a list
        of 2D Tensors. The shape of output tensors can be either
        [batch_size x output_size] or [batch_size x num_decoder_symbols]
        depending on the seq2seq model used.
      losses: List of scalar Tensors, representing losses for each bucket, or,
        if per_example_loss is set, a list of 1D batch-sized float Tensors.
  Raises:
    ValueError: If length of encoder_inputs, targets, or weights is smaller
      than the largest (last) bucket.
  """
  if len(encoder_inputs) < buckets[-1][0]:
    raise ValueError("Length of encoder_inputs (%d) must be at least that of la"
                     "st bucket (%d)." % (len(encoder_inputs), buckets[-1][0]))
  if len(targets) < buckets[-1][1]:
    raise ValueError("Length of targets (%d) must be at least that of last"
                     "bucket (%d)." % (len(targets), buckets[-1][1]))
  if len(weights) < buckets[-1][1]:
    raise ValueError("Length of weights (%d) must be at least that of last"
                     "bucket (%d)." % (len(weights), buckets[-1][1]))

  all_inputs = encoder_inputs + decoder_inputs + targets + weights
  losses = []
  outputs = []
  with ops.name_scope(name, "model_with_buckets", all_inputs):
    for j, bucket in enumerate(buckets):
      with variable_scope.variable_scope(
          variable_scope.get_variable_scope(), reuse=True if j > 0 else None):
        bucket_outputs, _ = seq2seq(encoder_inputs[:bucket[0]],
                                    decoder_inputs[:bucket[1]])
        outputs.append(bucket_outputs)
        if per_example_loss:
          losses.append(
              sequence_loss_by_example(
                  outputs[-1],
                  targets[:bucket[1]],
                  weights[:bucket[1]],
                  softmax_loss_function=softmax_loss_function))
        else:
          losses.append(
              sequence_loss(
                  outputs[-1],
                  targets[:bucket[1]],
                  weights[:bucket[1]],
                  softmax_loss_function=softmax_loss_function))

  return outputs, losses

TensorFlow Seq2Seq Model笔记

猜你喜欢