TensorFlow最佳实践

内容

TensorFlow 基础
理解静态和动态形状
命名空间以及如何使用
广播的好处和坏处
给TensorFlow喂数据
利用重载书算子的优势
理解计算顺序以及控制依赖
控制流操作：条件和循环
Prototyping kernels and advanced visualization with Python ops
数据并行的多GPU处理
调试TensorFlow模型
TensorFlow中的数值稳定性
使用learn API构建训练框架
TensorFlow Cookbook

我们期望逐步的更新系列文章以便和最新的API保持一致，如果你有更好的建议或者发现有什么描述不当的地方，开一个issue，或者邮件联系我们。
我们建议你使用tf.contrib.learn的API，它可以使用如下语句获得:(https://github.com/vahidk/TensorflowFramework):

git clone https://github.com/vahidk/TensorflowFramework.git

TensorFlow 基础

Tensorflow和其他数值类似于numpy计算库最大的不同是它使用了符号式编程，这是使得tensorflow区别于numpy而变得如此强大的重要概念，但是这也使得学习它更加困难，这篇文章的目的就是揭秘tensorflow以及更高效的使用tensorflow。
让我们先从一个简单的例子开始，我们想让两个随机数相乘，在numpy中可以这么做:

import numpy as np

x = np.random.normal(size=[10, 10])
y = np.random.normal(size=[10, 10])
z = np.dot(x, y)

print(z)

而现在如果在tensorflow中想要做这个操作就得:

import tensorflow as tf

x = tf.random_normal([10, 10])
y = tf.random_normal([10, 10])
z = tf.matmul(x, y)

sess = tf.Session()
z_val = sess.run(z)

print(z_val)

和numpy立即就可以得到计算结果不同，tensorflow只是保存了计算图中的一个节点，如果我们尝试去直接打印z的值的话，会得到类似下面的东西:

Tensor("MatMul:0", shape=(10, 10), dtype=float32)

Since both the inputs have a fully defined shape, tensorflow is able to infer the shape of the tensor as well as its type. In order to compute the value of the tensor we need to create a session and evaluate it using Session.run() method.

Tip: When using Jupyter notebook make sure to call tf.reset_default_graph() at the beginning to clear the symbolic graph before defining new nodes.

为了理解符号计算的强大之处，让我们再看另外一个例子，假设我们有一些曲线上的点(say f(x) = 5x^2 + 3)并且我们希望基于这些点来估计f(x).我们定义一个带有参数x和隐参数w的函数g(x, w) = w0 x^2 + w1 x + w2,我们的目标时找到隐含的参数使得g(x, w) ≈ f(x). 这可以通过最小化如下损失函数得到: L(w) = ∑ (f(x) - g(x, w))^2. 尽管这个简单的问题有个闭式解，我们选择使用更加通用的梯度下降方法它可以推广到任意的函数.我们计算其相对于w的导数L(w).

这儿看下在tensorflow中是怎么做到的:

import numpy as np
import tensorflow as tf

# Placeholders are used to feed values from python to TensorFlow ops. We define
# two placeholders, one for input feature x, and one for output y.
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)

# Assuming we know that the desired function is a polynomial of 2nd degree, we
# allocate a vector of size 3 to hold the coefficients. The variable will be
# automatically initialized with random noise.
w = tf.get_variable("w", shape=[3, 1])

# We define yhat to be our estimate of y.
f = tf.stack([tf.square(x), x, tf.ones_like(x)], 1)
yhat = tf.squeeze(tf.matmul(f, w), 1)

# The loss is defined to be the l2 distance between our estimate of y and its
# true value. We also added a shrinkage term, to ensure the resulting weights
# would be small.
loss = tf.nn.l2_loss(yhat - y) + 0.1 * tf.nn.l2_loss(w)

# We use the Adam optimizer with learning rate set to 0.1 to minimize the loss.
train_op = tf.train.AdamOptimizer(0.1).minimize(loss)

def generate_data():
    x_val = np.random.uniform(-10.0, 10.0, size=100)
    y_val = 5 * np.square(x_val) + 3
    return x_val, y_val

sess = tf.Session()
# Since we are using variables we first need to initialize them.
sess.run(tf.global_variables_initializer())
for _ in range(1000):
    x_val, y_val = generate_data()
    _, loss_val = sess.run([train_op, loss], {x: x_val, y: y_val})
    print(loss_val)
print(sess.run([w]))

通过运行这些代码可以得到这些结果:

[4.9924135, 0.00040895029, 3.4504161]

这和我们的参数非常接近.

这只是解释tensorflow所能做的一些小甜点，优化数以百万计参数的大型神经网络也是可以通过寥寥数行几行tensorflow代码来实现的，tensorflow可以在不同设备、不同线程支持在多个平台上运行s.

理解静态和动态的形状

TensorFlow中的Tensors在构建图的过程中有一个形状属性，这个形状可能并没有特殊指定，例如我们可能会定义一个[None, 128]的变量:

import tensorflow as tf

a = tf.placeholder(tf.float32, [None, 128])

这意味着第一个维度可以为任意值并且在Session.run()中可以动态决定，你可以通过如下代码查询Tensor的形状:

static_shape = a.shape.as_list()  # returns [None, 128]

为了得到tensor的动态形状你可以调用tf.shape op, 它会返回代表给定tensor形状的tensor:

dynamic_shape = tf.shape(a)

tensor的形状可以通过Tensor.set_shape()来设定:

a.set_shape([32, 128])  # static shape of a is [32, 128]
a.set_shape([None, 128])  # first dimension of a is determined dynamically

你可以通过tf.reshape函数来改变一个动态的形状:

a =  tf.reshape(a, [32, 128])

It can be convenient to have a function that returns the static shape when available and dynamic shape when it’s not. The following utility function does just that:

def get_shape(tensor):
  static_shape = tensor.shape.as_list()
  dynamic_shape = tf.unstack(tf.shape(tensor))
  dims = [s[1] if s[0] is None else s[0]
          for s in zip(static_shape, dynamic_shape)]
  return dims

现在如果我们想把三阶的张量变形为二阶，只需使用get_shape() 函数:

b = tf.placeholder(tf.float32, [None, 10, 32])
shape = get_shape(b)
b = tf.reshape(b, [shape[0], shape[1] * shape[2]])

请注意不管形状是否是静态指定的这都可以正常工作.

事实上我们可以写一个更加通用的变形函数:

import tensorflow as tf
import numpy as np

def reshape(tensor, dims_list):
  shape = get_shape(tensor)
  dims_prod = []
  for dims in dims_list:
    if isinstance(dims, int):
      dims_prod.append(shape[dims])
    elif all([isinstance(shape[d], int) for d in dims]):
      dims_prod.append(np.prod([shape[d] for d in dims]))
    else:
      dims_prod.append(tf.prod([shape[d] for d in dims]))
  tensor = tf.reshape(tensor, dims_prod)
  return tensor

这样一来变形就变得十分容易:

b = tf.placeholder(tf.float32, [None, 10, 32])
b = reshape(b, [0, [1, 2]])

命名空间以及何时使用它们

变量是tensorflow在计算图中用来名字来区分的一个符号.如果你在创建的时候不显式指定,tensorflow就会自动帮你生成一个名字:

a = tf.constant(1)
print(a.name)  # prints "Const:0"

b = tf.Variable(1)
print(b.name)  # prints "Variable:0"

你可以通过显式指定来改变默认指定的值:

a = tf.constant(1, name="a")
print(a.name)  # prints "a:0"

b = tf.Variable(1, name="b")
print(b.name)  # prints "b:0"

TensorFlow 引入了两种不同的命名空间管理，第一个是tf.name_scope:

with tf.name_scope("scope"):
  a = tf.constant(1, name="a")
  print(a.name)  # prints "scope/a:0"

  b = tf.Variable(1, name="b")
  print(b.name)  # prints "scope/b:0"

  c = tf.get_variable(name="c", shape=[])
  print(c.name)  # prints "c:0"

请注意在tensorflow中有两种创建变量的方式tf.Variable 和tf.get_variable. 使用tf.get_variable 时会创建一个新的变量，但是如果同样名字的变量已经存在时就会抛出一个ValueError 异常, 告诉我们重新声明变量是非法的.

tf.name_scope 影响那些通过tf.get_variable创建的变量但是不会影响tf.get_variable创建的变量.和tf.name_scope不同，tf.variable_scope 还会影响通过tf.get_variable创建的变量:

with tf.variable_scope("scope"):
  a = tf.constant(1, name="a")
  print(a.name)  # prints "scope/a:0"

  b = tf.Variable(1, name="b")
  print(b.name)  # prints "scope/b:0"

  c = tf.get_variable(name="c", shape=[])
  print(c.name)  # prints "scope/c:0"

with tf.variable_scope("scope"):
  a1 = tf.get_variable(name="a", shape=[])
  a2 = tf.get_variable(name="a", shape=[])  # Disallowed

但是如果我们想复用一个已经声明的变量怎么办呢?变量空间还提供了一个函数来实现它:

with tf.variable_scope("scope"):
  a1 = tf.get_variable(name="a", shape=[])
with tf.variable_scope("scope", reuse=True):
  a2 = tf.get_variable(name="a", shape=[])  # OK

当我们使用内置的网络层时就会变得非常繁琐:

with tf.variable_scope('my_scope'):
  features1 = tf.layers.conv2d(image1, filters=32, kernel_size=3)
# Use the same convolution weights to process the second image:
with tf.variable_scope('my_scope', reuse=True):
  features2 = tf.layers.conv2d(image2, filters=32, kernel_size=3)

你还可以在创建变量时把reuse设为tf.AUTO_REUSE来告诉TensorFlow来创建和复用变量:

with tf.variable_scope("scope", reuse=tf.AUTO_REUSE):
  features1 = tf.layers.conv2d(image1, filters=32, kernel_size=3)
  features2 = tf.layers.conv2d(image2, filters=32, kernel_size=3)

If you want to do lots of variable sharing keeping track of when to define new variables and when to reuse them can be cumbersome and error prone. tf.AUTO_REUSE simplifies this task but adds the risk of sharing variables that weren’t supposed to be shared. TensorFlow templates are another way of tackling the same problem without this risk:

conv3x32 = tf.make_template("conv3x32", lambda x: tf.layers.conv2d(x, 32, 3))
features1 = conv3x32(image1)
features2 = conv3x32(image2)  # Will reuse the convolution weights.

You can turn any function to a TensorFlow template. Upon the first call to a template, the variables defined inside the function would be declared and in the consecutive invocations they would automatically get reused.

广播的好处和坏处

TensorFlow支持元素级别的广播操作，一般情况下你进行加或乘的操作时需要确定形状是否匹配，例如，你不可以把形如[3, 2]的张量和[3, 4]进行相加.但是当有一方仅有一个维度时会有一种特例，TensorFlow会隐式的把其转换为可以操作的形状，因此把[3, 2] 的张量和 [3, 1]相加是合法的

import tensorflow as tf

a = tf.constant([[1., 2.], [3., 4.]])
b = tf.constant([[1.], [2.]])
# c = a + tf.tile(b, [1, 2])
c = a + b

广播能够进行隐式铺展可以使代码变得更短和更高效因为我们不再需要存储中间结果.一个典型的应用是融合不同长度的特征然后进行拼接. 在神经网络中这是一个很常见的操作:

a = tf.random_uniform([5, 3, 5])
b = tf.random_uniform([5, 1, 6])

# concat a and b and apply nonlinearity
tiled_b = tf.tile(b, [1, 3, 1])
c = tf.concat([a, tiled_b], 2)
d = tf.layers.dense(c, 10, activation=tf.nn.relu)

但是这可以借由广播更高效的进行. 依据f(m(x + y)) 等价于 f(mx + my). 因此我们可以把线性变换分开来做，然后隐式的使用广播进行拼接:

pa = tf.layers.dense(a, 10, activation=None)
pb = tf.layers.dense(b, 10, activation=None)
d = tf.nn.relu(pa + pb)

事实上这段代码相当之常见并且可以用于任意形状的张量:

def merge(a, b, units, activation=tf.nn.relu):
    pa = tf.layers.dense(a, units, activation=None)
    pb = tf.layers.dense(b, units, activation=None)
    c = pa + pb
    if activation is not None:
        c = activation(c)
    return c

更多的例子可以参见例程 included .

上面我们讨论了广播的好处，你会问坏处是啥呢? 隐式转换使得调试变得非常麻烦，考虑下面的例子:

a = tf.constant([[1.], [2.]])
b = tf.constant([1., 2.])
c = tf.reduce_sum(a + b)

你认为结果会是什么呢? 如果你猜是6的话就错了. 实际上它是12. 这是因为这两个张量的维度并不匹配，Tensorflow在进行元素级别的操作前会自动的扩展较低维度的张量，因此加法的结果是[[2, 3], [3, 4]], 并且最终的结果是12.

避免这种问题的一种方式是显式声明. Had we specified which dimension we would want to reduce across, catching this bug would have been much easier:

a = tf.constant([[1.], [2.]])
b = tf.constant([1., 2.])
c = tf.reduce_sum(a + b, 0)

这里c的结果会是[5, 7], 我们会立刻意思到这里面的错误. 一个通常的做法是在reduction和tf.squeeze操作前指定操作的维度.

为TensorFlow供给数据

TensorFlow是专为大数据量设计的. 所以为了保持最大性能就不要使网络挨饿，有很多种供给数据的方式.

常量

最简单的方式是使用常量嵌入到网络里:

import tensorflow as tf
import numpy as np

actual_data = np.random.normal(size=[100])

data = tf.constant(actual_data)

这种方式非常高效，但是很不灵活，一个问题就是如果你换了一个数据库就得重写网络，并且一次性加载数据使得它只能用途很小的数据集.

Placeholders

使用Placeholder解决了以上两个问题:

import tensorflow as tf
import numpy as np

data = tf.placeholder(tf.float32)

prediction = tf.square(data) + 1

actual_data = np.random.normal(size=[100])

tf.Session().run(prediction, feed_dict={data: actual_data})

Placeholder返回了一个使用feed_dict输入的接口，请注意如果不供给数据就运行含有Placeholder网络的话会导致错误.

Python ops

另外一个方式就是使用Python ops:

def py_input_fn():
    actual_data = np.random.normal(size=[100])
    return actual_data

data = tf.py_func(py_input_fn, [], (tf.float32))

Python ops使得你把python的函数编程tensorflow操作变为可能.

Dataset API

目前推荐的方式是使用dataset API:

actual_data = np.random.normal(size=[100])
dataset = tf.contrib.data.Dataset.from_tensor_slices(actual_data)
data = dataset.make_one_shot_iterator().get_next()

如果你想从文件读取数据，更有效的方式是使用TFrecord存储然后yogaTFRecordDataset来读:

dataset = tf.contrib.data.TFRecordDataset(path_to_data)

查看 official docs 的例子来学习如何把你的数据转换为TFrecord 格式.

Dataset API 使得你可以使用管道更高效的处理数据，在
trainer.py可以找到相关的例子:

dataset = ...
dataset = dataset.cache()
if mode == tf.estimator.ModeKeys.TRAIN:
    dataset = dataset.repeat()
    dataset = dataset.shuffle(batch_size * 5)
dataset = dataset.map(parse, num_threads=8)
dataset = dataset.batch(batch_size)

在读取数据之后，我们使用Dataset.cache来把它缓存以便提升效率. 在训练过程中，我们不断的进行这些操作，这使得我们可以处理很多遍数据集，我们还会使用不同的分布来打乱数据集，此外，我们使用Dataset.map函数来进行原始数据的处理并且把它转化内模型需要的格式，最后我们使用Dataset.batch来打包成一批数据.

充分利用重载运算符的优势

和 NumPy一样, TensorFlow 重载了很多python运算符使得构建图变得非常容易而且代码更易读.

slicing操作是重载运算符发挥优势的地方:

z = x[begin:end]  # z = tf.slice(x, [begin], [end-begin])

当使用这个操作时务必小心. slicing操作非常低效并且可以有更好的操作来避免，特别是slice的数目狠多的时候. 让我们通过一个例子来看下这个操作有多么低效. We want to manually perform reduction across the rows of a matrix:

import tensorflow as tf
import time

x = tf.random_uniform([500, 10])

z = tf.zeros([10])
for i in range(500):
    z += x[i]

sess = tf.Session()
start = time.time()
sess.run(z)
print("Took %f seconds." % (time.time() - start))

在我的 MacBook Pro上它需要2.67秒的时间来运行!原因在于我们调用了这个slice操作500尺，使得运行很慢，一个很好的选择是使用 tf.unstack操作来一次性把矩阵分解为列:

z = tf.zeros([10])
for x_i in tf.unstack(x):
    z += x_i

这只需要 0.18秒.当然，正确的方式是使用tf.reduce_sum操作:

z = tf.reduce_sum(x, axis=0)

它仅需要0.008秒, 这比原始的实现快300倍.

TensorFlow还重载了很多数学和逻辑运算:

z = -x  # z = tf.negative(x)
z = x + y  # z = tf.add(x, y)
z = x - y  # z = tf.subtract(x, y)
z = x * y  # z = tf.mul(x, y)
z = x / y  # z = tf.div(x, y)
z = x // y  # z = tf.floordiv(x, y)
z = x % y  # z = tf.mod(x, y)
z = x ** y  # z = tf.pow(x, y)
z = x @ y  # z = tf.matmul(x, y)
z = x > y  # z = tf.greater(x, y)
z = x >= y  # z = tf.greater_equal(x, y)
z = x < y  # z = tf.less(x, y)
z = x <= y  # z = tf.less_equal(x, y)
z = abs(x)  # z = tf.abs(x)
z = x & y  # z = tf.logical_and(x, y)
z = x | y  # z = tf.logical_or(x, y)
z = x ^ y  # z = tf.logical_xor(x, y)
z = ~x  # z = tf.logical_not(x)

你还可以使用增强版本的运算符例如 x += y and x **= 2也是合法的.

注意Python不允许重载”and”, “or”, and “not”等关键字.

TensorFlow也不允许使用booleans类型的张量，因为它很容易出错:

x = tf.constant(1.)
if x:  # This will raise a TypeError error
    ...

如果你想检查张量的值的话你还可以使用tf.cond(x, …)或者”if x is None”.

其他的操作Python支持的equal (==) 和not equal (!=) 在TensorFlow中是不允许的. 使用函数版本的tf.equal and tf.not_equal来替代.

理解执行的顺序和控制依赖

As we discussed in the first item, TensorFlow doesn’t immediately run the operations that are defined but rather creates corresponding nodes in a graph that can be evaluated with Session.run() method. This also enables TensorFlow to do optimizations at run time to determine the optimal order of execution and possible trimming of unused nodes. If you only have tf.Tensors in your graph you don’t need to worry about dependencies but you most probably have tf.Variables too, and tf.Variables make things much more difficult. My advice to is to only use Variables if Tensors don’t do the job. This might not make a lot of sense to you now, so let’s start with an example.

import tensorflow as tf

a = tf.constant(1)
b = tf.constant(2)
a = a + b

tf.Session().run(a)

Evaluating “a” will return the value 3 as expected. Note that here we are creating 3 tensors, two constant tensors and another tensor that stores the result of the addition. Note that you can’t overwrite the value of a tensor. If you want to modify it you have to create a new tensor. As we did here.

TIP: If you don’t define a new graph, TensorFlow automatically creates a graph for you by default. You can use tf.get_default_graph() to get a handle to the graph. You can then inspect the graph, for example by printing all its tensors:

print(tf.contrib.graph_editor.get_tensors(tf.get_default_graph()))

Unlike tensors, variables can be updated. So let’s see how we may use variables to do the same thing:

a = tf.Variable(1)
b = tf.constant(2)
assign = tf.assign(a, a + b)

sess = tf.Session()
sess.run(tf.global_variables_initializer())
print(sess.run(assign))

Again, we get 3 as expected. Note that tf.assign returns a tensor representing the value of the assignment.
So far everything seemed to be fine, but let’s look at a slightly more complicated example:

a = tf.Variable(1)
b = tf.constant(2)
c = a + b

assign = tf.assign(a, 5)

sess = tf.Session()
for i in range(10):
    sess.run(tf.global_variables_initializer())
    print(sess.run([assign, c]))

Note that the tensor c here won’t have a deterministic value. This value might be 3 or 7 depending on whether addition or assignment gets executed first.

You should note that the order that you define ops in your code doesn’t matter to TensorFlow runtime. The only thing that matters is the control dependencies. Control dependencies for tensors are straightforward. Every time you use a tensor in an operation that op will define an implicit dependency to that tensor. But things get complicated with variables because they can take many values.

When dealing with variables, you may need to explicitly define dependencies using tf.control_dependencies() as follows:

a = tf.Variable(1)
b = tf.constant(2)
c = a + b

with tf.control_dependencies([c]):
    assign = tf.assign(a, 5)

sess = tf.Session()
for i in range(10):
    sess.run(tf.global_variables_initializer())
    print(sess.run([assign, c]))

This will make sure that the assign op will be called after the addition.

控制操作流:条件和循环

When building complex models such as recurrent neural networks you may need to control the flow of operations through conditionals and loops. In this section we introduce a number of commonly used control flow ops.

Let’s assume you want to decide whether to multiply to or add two given tensors based on a predicate. This can be simply implemented with tf.cond which acts as a python “if” function:

a = tf.constant(1)
b = tf.constant(2)

p = tf.constant(True)

x = tf.cond(p, lambda: a + b, lambda: a * b)

print(tf.Session().run(x))

Since the predicate is True in this case, the output would be the result of the addition, which is 3.

Most of the times when using TensorFlow you are using large tensors and want to perform operations in batch. A related conditional operation is tf.where, which like tf.cond takes a predicate, but selects the output based on the condition in batch.

a = tf.constant([1, 1])
b = tf.constant([2, 2])

p = tf.constant([True, False])

x = tf.where(p, a + b, a * b)

print(tf.Session().run(x))

This will return [3, 2].

Another widely used control flow operation is tf.while_loop. It allows building dynamic loops in TensorFlow that operate on sequences of variable length. Let’s see how we can generate Fibonacci sequence with tf.while_loops:

n = tf.constant(5)

def cond(i, a, b):
    return i < n

def body(i, a, b):
    return i + 1, b, a + b

i, a, b = tf.while_loop(cond, body, (2, 1, 1))

print(tf.Session().run(b))

This will print 5. tf.while_loops takes a condition function, and a loop body function, in addition to initial values for loop variables. These loop variables are then updated by multiple calls to the body function until the condition returns false.

Now imagine we want to keep the whole series of Fibonacci sequence. We may update our body to keep a record of the history of current values:

n = tf.constant(5)

def cond(i, a, b, c):
    return i < n

def body(i, a, b, c):
    return i + 1, b, a + b, tf.concat([c, [a + b]], 0)

i, a, b, c = tf.while_loop(cond, body, (2, 1, 1, tf.constant([1, 1])))

print(tf.Session().run(c))

Now if you try running this, TensorFlow will complain that the shape of the the fourth loop variable is changing. So you must make that explicit that it’s intentional:

i, a, b, c = tf.while_loop(
    cond, body, (2, 1, 1, tf.constant([1, 1])),
    shape_invariants=(tf.TensorShape([]),
                      tf.TensorShape([]),
                      tf.TensorShape([]),
                      tf.TensorShape([None])))

This is not only getting ugly, but is also somewhat inefficient. Note that we are building a lot of intermediary tensors that we don’t use. TensorFlow has a better solution for this kind of growing arrays. Meet tf.TensorArray. Let’s do the same thing this time with tensor arrays:

n = tf.constant(5)

c = tf.TensorArray(tf.int32, n)
c = c.write(0, 1)
c = c.write(1, 1)

def cond(i, a, b, c):
    return i < n

def body(i, a, b, c):
    c = c.write(i, a + b)
    return i + 1, b, a + b, c

i, a, b, c = tf.while_loop(cond, body, (2, 1, 1, c))

c = c.stack()

print(tf.Session().run(c))

TensorFlow while loops and tensor arrays are essential tools for building complex recurrent neural networks. As an exercise try implementing beam search using tf.while_loops. Can you make it more efficient with tensor arrays?

Prototyping kernels and advanced visualization with Python ops

Operation kernels in TensorFlow are entirely written in C++ for efficiency. But writing a TensorFlow kernel in C++ can be quite a pain. So, before spending hours implementing your kernel you may want to prototype something quickly, however inefficient. With tf.py_func() you can turn any piece of python code to a TensorFlow operation.

For example this is how you can implement a simple ReLU nonlinearity kernel in TensorFlow as a python op:

import numpy as np
import tensorflow as tf
import uuid

def relu(inputs):
    # Define the op in python
    def _relu(x):
        return np.maximum(x, 0.)

    # Define the op's gradient in python
    def _relu_grad(x):
        return np.float32(x > 0)

    # An adapter that defines a gradient op compatible with TensorFlow
    def _relu_grad_op(op, grad):
        x = op.inputs[0]
        x_grad = grad * tf.py_func(_relu_grad, [x], tf.float32)
        return x_grad

    # Register the gradient with a unique id
    grad_name = "MyReluGrad_" + str(uuid.uuid4())
    tf.RegisterGradient(grad_name)(_relu_grad_op)

    # Override the gradient of the custom op
    g = tf.get_default_graph()
    with g.gradient_override_map({"PyFunc": grad_name}):
        output = tf.py_func(_relu, [inputs], tf.float32)
    return output

To verify that the gradients are correct you can use TensorFlow’s gradient checker:

x = tf.random_normal([10])
y = relu(x * x)

with tf.Session():
    diff = tf.test.compute_gradient_error(x, [10], y, [10])
    print(diff)

compute_gradient_error() computes the gradient numerically and returns the difference with the provided gradient. What we want is a very low difference.

Note that this implementation is pretty inefficient, and is only useful for prototyping, since the python code is not parallelizable and won’t run on GPU. Once you verified your idea, you definitely would want to write it as a C++ kernel.

In practice we commonly use python ops to do visualization on Tensorboard. Consider the case that you are building an image classification model and want to visualize your model predictions during training. TensorFlow allows visualizing images with tf.summary.image() function:

image = tf.placeholder(tf.float32)
tf.summary.image("image", image)

But this only visualizes the input image. In order to visualize the predictions you have to find a way to add annotations to the image which may be almost impossible with existing ops. An easier way to do this is to do the drawing in python, and wrap it in a python op:

import io
import matplotlib.pyplot as plt
import numpy as np
import PIL
import tensorflow as tf

def visualize_labeled_images(images, labels, max_outputs=3, name="image"):
    def _visualize_image(image, label):
        # Do the actual drawing in python
        fig = plt.figure(figsize=(3, 3), dpi=80)
        ax = fig.add_subplot(111)
        ax.imshow(image[::-1,...])
        ax.text(0, 0, str(label),
          horizontalalignment="left",
          verticalalignment="top")
        fig.canvas.draw()

        # Write the plot as a memory file.
        buf = io.BytesIO()
        data = fig.savefig(buf, format="png")
        buf.seek(0)

        # Read the image and convert to numpy array
        img = PIL.Image.open(buf)
        return np.array(img.getdata()).reshape(img.size[0], img.size[1], -1)

    def _visualize_images(images, labels):
        # Only display the given number of examples in the batch
        outputs = []
        for i in range(max_outputs):
            output = _visualize_image(images[i], labels[i])
            outputs.append(output)
        return np.array(outputs, dtype=np.uint8)

    # Run the python op.
    figs = tf.py_func(_visualize_images, [images, labels], tf.uint8)
    return tf.summary.image(name, figs)

Note that since summaries are usually only evaluated once in a while (not per step), this implementation may be used in practice without worrying about efficiency.

数据并行的多GPU处理

如果你的软件是用类似于C++语言单核CPU写的话，想在多GPU上跑起来需要从头写很多代码，但是在Tensorflow中就简单多了。由于符号式语言的天然优势，tensorflow隐藏了所有的复杂性，使得你的程序可以在多个CPU和GPU上运行.

让我们先来看一个在CPU上进行两个向量相加的例子:

 import tensorflow as tf

with tf.device(tf.DeviceSpec(device_type="CPU", device_index=0)):
    a = tf.random_uniform([1000, 100])
    b = tf.random_uniform([1000, 100])
    c = a + b

tf.Session().run(c)
 ```

同样的事情也可以在GPU上完成:

```python
with tf.device(tf.DeviceSpec(device_type="GPU", device_index=0)):
    a = tf.random_uniform([1000, 100])
    b = tf.random_uniform([1000, 100])
    c = a + b
 ```

但是如果我们有两个GPU并且想同时使用它们应该怎么做呢? 为了完成它，我们需要把数据分为两份，每个GPU都有一半的数据:





<div class="se-preview-section-delimiter"></div>

```python
split_a = tf.split(a, 2)
split_b = tf.split(b, 2)

split_c = []
for i in range(2):
    with tf.device(tf.DeviceSpec(device_type="GPU", device_index=i)):
        split_c.append(split_a[i] + split_b[i])

c = tf.concat(split_c, axis=0)
 ```

Let's rewrite this in a more general form so that we can replace addition with any other set of operations:
```python
def make_parallel(fn, num_gpus, **kwargs):
    in_splits = {}
    for k, v in kwargs.items():
        in_splits[k] = tf.split(v, num_gpus)

    out_split = []
    for i in range(num_gpus):
        with tf.device(tf.DeviceSpec(device_type="GPU", device_index=i)):
            with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):
                out_split.append(fn(**{k : v[i] for k, v in in_splits.items()}))

    return tf.concat(out_split, axis=0)


def model(a, b):
    return a + b

c = make_parallel(model, 2, a=a, b=b)




<div class="se-preview-section-delimiter"></div>

You can replace the model with any function that takes a set of tensors as input and returns a tensor as result with the condition that both the input and output are in batch. Note that we also added a variable scope and set the reuse to true. This makes sure that we use the same variables for processing both splits. This is something that will become handy in our next example.

Let’s look at a slightly more practical example. We want to train a neural network on multiple GPUs. During training we not only need to compute the forward pass but also need to compute the backward pass (the gradients). But how can we parallelize the gradient computation? This turns out to be pretty easy.

Recall from the first item that we wanted to fit a second degree polynomial to a set of samples. We reorganized the code a bit to have the bulk of the operations in the model function:

import numpy as np
import tensorflow as tf

def model(x, y):
    w = tf.get_variable("w", shape=[3, 1])

    f = tf.stack([tf.square(x), x, tf.ones_like(x)], 1)
    yhat = tf.squeeze(tf.matmul(f, w), 1)

    loss = tf.square(yhat - y)
    return loss

x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)

loss = model(x, y)

train_op = tf.train.AdamOptimizer(0.1).minimize(
    tf.reduce_mean(loss))

def generate_data():
    x_val = np.random.uniform(-10.0, 10.0, size=100)
    y_val = 5 * np.square(x_val) + 3
    return x_val, y_val

sess = tf.Session()
sess.run(tf.global_variables_initializer())
for _ in range(1000):
    x_val, y_val = generate_data()
    _, loss_val = sess.run([train_op, loss], {x: x_val, y: y_val})

_, loss_val = sess.run([train_op, loss], {x: x_val, y: y_val})
print(sess.run(tf.contrib.framework.get_variables_by_name("w")))




<div class="se-preview-section-delimiter"></div>

Now let’s use make_parallel that we just wrote to parallelize this. We only need to change two lines of code from the above code:

loss = make_parallel(model, 2, x=x, y=y)

train_op = tf.train.AdamOptimizer(0.1).minimize(
    tf.reduce_mean(loss),
    colocate_gradients_with_ops=True)




<div class="se-preview-section-delimiter"></div>

The only thing that we need to change to parallelize backpropagation of gradients is to set the colocate_gradients_with_ops flag to true. This ensures that gradient ops run on the same device as the original op.

调试TensorFlow模型

Symbolic nature of TensorFlow makes it relatively more difficult to debug TensorFlow code compared to regular python code. Here we introduce a number of tools included with TensorFlow that make debugging much easier.

Probably the most common error one can make when using TensorFlow is passing Tensors of wrong shape to ops. Many TensorFlow ops can operate on tensors of different ranks and shapes. This can be convenient when using the API, but may lead to extra headache when things go wrong.

For example, consider the tf.matmul op, it can multiply two matrices:

a = tf.random_uniform([2, 3])
b = tf.random_uniform([3, 4])
c = tf.matmul(a, b)  # c is a tensor of shape [2, 4]




<div class="se-preview-section-delimiter"></div>

But the same function also does batch matrix multiplication:

a = tf.random_uniform([10, 2, 3])
b = tf.random_uniform([10, 3, 4])
tf.matmul(a, b)  # c is a tensor of shape [10, 2, 4]




<div class="se-preview-section-delimiter"></div>

Another example that we talked about before in the broadcasting section is add operation which supports broadcasting:

a = tf.constant([[1.], [2.]])
b = tf.constant([1., 2.])
c = a + b  # c is a tensor of shape [2, 2]




<div class="se-preview-section-delimiter"></div>

使用tf.assert*操作校验张量的值

One way to reduce the chance of unwanted behavior is to explicitly verify the rank or shape of intermediate tensors with tf.assert* ops.

a = tf.constant([[1.], [2.]])
b = tf.constant([1., 2.])
check_a = tf.assert_rank(a, 1)  # This will raise an InvalidArgumentError exception
check_b = tf.assert_rank(b, 1)
with tf.control_dependencies([check_a, check_b]):
    c = a + b  # c is a tensor of shape [2, 2]




<div class="se-preview-section-delimiter"></div>

Remember that assertion nodes like other operations are part of the graph and if not evaluated would get pruned during Session.run(). So make sure to create explicit dependencies to assertion ops, to force TensorFlow to execute them.

You can also use assertions to validate the value of tensors at runtime:

check_pos = tf.assert_positive(a)




<div class="se-preview-section-delimiter"></div>

See the official docs for a full list of assertion ops.

使用tf.Print记录张量的值

Another useful built-in function for debugging is tf.Print which logs the given tensors to the standard error:

input_copy = tf.Print(input, tensors_to_print_list)




<div class="se-preview-section-delimiter"></div>

Note that tf.Print returns a copy of its first argument as output. One way to force tf.Print to run is to pass its output to another op that gets executed. For example if we want to print the value of tensors a and b before adding them we could do something like this:

a = ...
b = ...
a = tf.Print(a, [a, b])
c = a + b




<div class="se-preview-section-delimiter"></div>

Alternatively we could manually define a control dependency.

使用tf.compute_gradient_error检查梯度

Not all the operations in TensorFlow come with gradients, and it’s easy to unintentionally build graphs for which TensorFlow can not compute the gradients.

Let’s look at an example:

import tensorflow as tf

def non_differentiable_softmax_entropy(logits):
    probs = tf.nn.softmax(logits)
    return tf.nn.softmax_cross_entropy_with_logits(labels=probs, logits=logits)

w = tf.get_variable("w", shape=[5])
y = -non_differentiable_softmax_entropy(w)

opt = tf.train.AdamOptimizer()
train_op = opt.minimize(y)

sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(10000):
    sess.run(train_op)

print(sess.run(tf.nn.softmax(w)))




<div class="se-preview-section-delimiter"></div>

We are using tf.nn.softmax_cross_entropy_with_logits to define entropy over a categorical distribution. We then use Adam optimizer to find the weights with maximum entropy. If you have passed a course on information theory, you would know that uniform distribution contains maximum entropy. So you would expect for the result to be [0.2, 0.2, 0.2, 0.2, 0.2]. But if you run this you may get unexpected results like this:

[ 0.34081486  0.24287023  0.23465775  0.08935683  0.09230034]




<div class="se-preview-section-delimiter"></div>

It turns out tf.nn.softmax_cross_entropy_with_logits has undefined gradients with respect to labels! But how may we spot this if we didn’t know?

Fortunately for us TensorFlow comes with a numerical differentiator that can be used to find symbolic gradient errors. Let’s see how we can use it:

with tf.Session():
    diff = tf.test.compute_gradient_error(w, [5], y, [])
    print(diff)




<div class="se-preview-section-delimiter"></div>

If you run this, you would see that the difference between the numerical and symbolic gradients are pretty high (0.06 - 0.1 in my tries).

Now let’s fix our function with a differentiable version of the entropy and check again:

import tensorflow as tf
import numpy as np

def softmax_entropy(logits, dim=-1):
    plogp = tf.nn.softmax(logits, dim) * tf.nn.log_softmax(logits, dim)
    return -tf.reduce_sum(plogp, dim)

w = tf.get_variable("w", shape=[5])
y = -softmax_entropy(w)

print(w.get_shape())
print(y.get_shape())

with tf.Session() as sess:
    diff = tf.test.compute_gradient_error(w, [5], y, [])
    print(diff)




<div class="se-preview-section-delimiter"></div>

The difference should be ~0.0001 which looks much better.

Now if you run the optimizer again with the correct version you can see the final weights would be:

[ 0.2  0.2  0.2  0.2  0.2]




<div class="se-preview-section-delimiter"></div>

which are exactly what we wanted.

TensorFlow summaries, and tfdbg (TensorFlow Debugger) are other tools that can be used for debugging. Please refer to the official docs to learn more.

TensorFlow中的数值稳定性

When using any numerical computation library such as NumPy or TensorFlow, it’s important to note that writing mathematically correct code doesn’t necessarily lead to correct results. You also need to make sure that the computations are stable.

Let’s start with a simple example. From primary school we know that x * y / y is equal to x for any non zero value of x. But let’s see if that’s always true in practice:

import numpy as np

x = np.float32(1)

y = np.float32(1e-50)  # y would be stored as zero
z = x * y / y

print(z)  # prints nan




<div class="se-preview-section-delimiter"></div>

The reason for the incorrect result is that y is simply too small for float32 type. A similar problem occurs when y is too large:

y = np.float32(1e39)  # y would be stored as inf
z = x * y / y

print(z)  # prints 0




<div class="se-preview-section-delimiter"></div>

The smallest positive value that float32 type can represent is 1.4013e-45 and anything below that would be stored as zero. Also, any number beyond 3.40282e+38, would be stored as inf.

print(np.nextafter(np.float32(0), np.float32(1)))  # prints 1.4013e-45
print(np.finfo(np.float32).max)  # print 3.40282e+38




<div class="se-preview-section-delimiter"></div>

To make sure that your computations are stable, you want to avoid values with small or very large absolute value. This may sound very obvious, but these kind of problems can become extremely hard to debug especially when doing gradient descent in TensorFlow. This is because you not only need to make sure that all the values in the forward pass are within the valid range of your data types, but also you need to make sure of the same for the backward pass (during gradient computation).

Let’s look at a real example. We want to compute the softmax over a vector of logits. A naive implementation would look something like this:

import tensorflow as tf

def unstable_softmax(logits):
    exp = tf.exp(logits)
    return exp / tf.reduce_sum(exp)

tf.Session().run(unstable_softmax([1000., 0.]))  # prints [ nan, 0.]




<div class="se-preview-section-delimiter"></div>

Note that computing the exponential of logits for relatively small numbers results to gigantic results that are out of float32 range. The largest valid logit for our naive softmax implementation is ln(3.40282e+38) = 88.7, anything beyond that leads to a nan outcome.

But how can we make this more stable? The solution is rather simple. It’s easy to see that exp(x - c) / ∑ exp(x - c) = exp(x) / ∑ exp(x). Therefore we can subtract any constant from the logits and the result would remain the same. We choose this constant to be the maximum of logits. This way the domain of the exponential function would be limited to [-inf, 0], and consequently its range would be [0.0, 1.0] which is desirable:

import tensorflow as tf

def softmax(logits):
    exp = tf.exp(logits - tf.reduce_max(logits))
    return exp / tf.reduce_sum(exp)

tf.Session().run(softmax([1000., 0.]))  # prints [ 1., 0.]




<div class="se-preview-section-delimiter"></div>

Let’s look at a more complicated case. Consider we have a classification problem. We use the softmax function to produce probabilities from our logits. We then define our loss function to be the cross entropy between our predictions and the labels. Recall that cross entropy for a categorical distribution can be simply defined as xe(p, q) = -∑ p_i log(q_i). So a naive implementation of the cross entropy would look like this:

def unstable_softmax_cross_entropy(labels, logits):
    logits = tf.log(softmax(logits))
    return -tf.reduce_sum(labels * logits)

labels = tf.constant([0.5, 0.5])
logits = tf.constant([1000., 0.])

xe = unstable_softmax_cross_entropy(labels, logits)

print(tf.Session().run(xe))  # prints inf




<div class="se-preview-section-delimiter"></div>

Note that in this implementation as the softmax output approaches zero, the log’s output approaches infinity which causes instability in our computation. We can rewrite this by expanding the softmax and doing some simplifications:

def softmax_cross_entropy(labels, logits):
    scaled_logits = logits - tf.reduce_max(logits)
    normalized_logits = scaled_logits - tf.reduce_logsumexp(scaled_logits)
    return -tf.reduce_sum(labels * normalized_logits)

labels = tf.constant([0.5, 0.5])
logits = tf.constant([1000., 0.])

xe = softmax_cross_entropy(labels, logits)

print(tf.Session().run(xe))  # prints 500.0




<div class="se-preview-section-delimiter"></div>

We can also verify that the gradients are also computed correctly:

g = tf.gradients(xe, logits)
print(tf.Session().run(g))  # prints [0.5, -0.5]




<div class="se-preview-section-delimiter"></div>

which is correct.

Let me remind again that extra care must be taken when doing gradient descent to make sure that the range of your functions as well as the gradients for each layer are within a valid range. Exponential and logarithmic functions when used naively are especially problematic because they can map small numbers to enormous ones and the other way around.

使用learn API来构建网络

为了简单起见，在大多数的例子中我们手动的创建回话并且不关心保存和加载检查点，但是在实际中就不行了。你很想知道怎么使用 learn API 来管理和记录会话. 我们提供了一个简单的示例framework 这里我们解释下这个框架是如何工作的.

在实验中你通常会有训练集和测试集的划分，你希望在训练集上训练在测试集上测试得到一些指标，你还需要用检查点保存一些模型的参数并且能够停止和恢复训练.TensorFlow’s learn API 就是用来使得这项工作更简单的，让我们专注于于开发模型（而不是其他的琐事）.

最常见的方式是直接使用tf.Estimator. 你需要定义一个损失函数一个训练op一个或一些预测还有一些评估的操作:

import tensorflow as tf

def model_fn(features, labels, mode, params):
    predictions = ...
    loss = ...
    train_op = ...
    metric_ops = ...
    return tf.estimator.EstimatorSpec(
        mode=mode,
        predictions=predictions,
        loss=loss,
        train_op=train_op,
        eval_metric_ops=metric_ops)

params = ...
run_config = tf.contrib.learn.RunConfig(model_dir=FLAGS.output_dir)
estimator = tf.estimator.Estimator(
    model_fn=model_fn, config=run_config, params=params)




<div class="se-preview-section-delimiter"></div>

你只需要在提供一个输入数据的函数后即可调用Estimator.train()函数训练这个模型:

def input_fn():
    features = ...
    labels = ...
    return features, labels

estimator.train(input_fn=input_fn, max_steps=...)




<div class="se-preview-section-delimiter"></div>

评估模型的话也只需要调用:

estimator.evaluate(input_fn=input_fn)




<div class="se-preview-section-delimiter"></div>

对于简单的例子来说Estimator可能就足够了但是TensorFlow提供了一个更高级的名为Experiment来提供一个额外的功能，创建一个实验变得非常简单:

experiment = tf.contrib.learn.Experiment(
    estimator=estimator,
    train_input_fn=train_input_fn,
    eval_input_fn=eval_input_fn)




<div class="se-preview-section-delimiter"></div>

现在我们可以调用train_and_evaluate function 来在训练的过程中评估一些指标了:

experiment.train_and_evaluate()




<div class="se-preview-section-delimiter"></div>

一个更高级的方式是使用learn_runner.run() 函数. 整个框架看起来大概是这个样子:

import tensorflow as tf

tf.flags.DEFINE_string("output_dir", "", "Optional output dir.")
tf.flags.DEFINE_string("schedule", "train_and_evaluate", "Schedule.")
tf.flags.DEFINE_string("hparams", "", "Hyper parameters.")

FLAGS = tf.flags.FLAGS

def experiment_fn(run_config, hparams):
  estimator = tf.estimator.Estimator(
    model_fn=make_model_fn(),
    config=run_config,
    params=hparams)
  return tf.contrib.learn.Experiment(
    estimator=estimator,
    train_input_fn=make_input_fn(tf.estimator.ModeKeys.TRAIN, hparams),
    eval_input_fn=make_input_fn(tf.estimator.ModeKeys.EVAL, hparams))

def main(unused_argv):
  run_config = tf.contrib.learn.RunConfig(model_dir=FLAGS.output_dir)
  hparams = tf.contrib.training.HParams()
  hparams.parse(FLAGS.hparams)

  estimator = tf.contrib.learn.learn_runner.run(
    experiment_fn=experiment_fn,
    run_config=run_config,
    schedule=FLAGS.schedule,
    hparams=hparams)

if __name__ == "__main__":
  tf.app.run()




<div class="se-preview-section-delimiter"></div>

The schedule flag decides which member function of the Experiment object gets called. So, if you for example set schedule to “train_and_evaluate”, experiment.train_and_evaluate() would be called.

The input function returns two tensors (or dictionaries of tensors) providing the features and labels to be passed to the model:

def input_fn():
    features = ...
    labels = ...
    return features, labels




<div class="se-preview-section-delimiter"></div>

See mnist.py for an example of how to read your data with the dataset API. To learn about various ways of reading your data in TensorFlow refer to this item.

The framework also comes with a simple convolutional network classifier in alexnet.py that includes an example model.

And that’s it! This is all you need to get started with TensorFlow learn API. I recommend to have a look at the framework source code and see the official python API to learn more about the learn API.

TensorFlow 菜谱

这节描述Tensorflow中一些常见的代码段.

Get shape

def get_shape(tensor):
  """Returns static shape if available and dynamic shape otherwise."""
  static_shape = tensor.shape.as_list()
  dynamic_shape = tf.unstack(tf.shape(tensor))
  dims = [s[1] if s[0] is None else s[0]
          for s in zip(static_shape, dynamic_shape)]
  return dims




<div class="se-preview-section-delimiter"></div>

Batch Gather

def batch_gather(tensor, indices):
  """Gather in batch from a tensor of arbitrary size.

  In pseudocode this module will produce the following:
  output[i] = tf.gather(tensor[i], indices[i])

  Args:
    tensor: Tensor of arbitrary size.
    indices: Vector of indices.
  Returns:
    output: A tensor of gathered values.
  """
  shape = get_shape(tensor)
  flat_first = tf.reshape(tensor, [shape[0] * shape[1]] + shape[2:])
  indices = tf.convert_to_tensor(indices)
  offset_shape = [shape[0]] + [1] * (indices.shape.ndims - 1)
  offset = tf.reshape(tf.range(shape[0]) * shape[1], offset_shape)
  output = tf.gather(flat_first, indices + offset)
  return output




<div class="se-preview-section-delimiter"></div>

Beam Search

import tensorflow as tf

def rnn_beam_search(update_fn, initial_state, sequence_length, beam_width,
                    begin_token_id, end_token_id, name="rnn"):
  """Beam-search decoder for recurrent models.

  Args:
    update_fn: Function to compute the next state and logits given the current
               state and ids.
    initial_state: Recurrent model states.
    sequence_length: Length of the generated sequence.
    beam_width: Beam width.
    begin_token_id: Begin token id.
    end_token_id: End token id.
    name: Scope of the variables.
  Returns:
    ids: Output indices.
    logprobs: Output log probabilities probabilities.
  """
  batch_size = initial_state.shape.as_list()[0]

  state = tf.tile(tf.expand_dims(initial_state, axis=1), [1, beam_width, 1])

  sel_sum_logprobs = tf.log([[1.] + [0.] * (beam_width - 1)])

  ids = tf.tile([[begin_token_id]], [batch_size, beam_width])
  sel_ids = tf.zeros([batch_size, beam_width, 0], dtype=ids.dtype)

  mask = tf.ones([batch_size, beam_width], dtype=tf.float32)

  for i in range(sequence_length):
    with tf.variable_scope(name, reuse=True if i > 0 else None):

      state, logits = update_fn(state, ids)
      logits = tf.nn.log_softmax(logits)

      sum_logprobs = (
          tf.expand_dims(sel_sum_logprobs, axis=2) +
          (logits * tf.expand_dims(mask, axis=2)))

      num_classes = logits.shape.as_list()[-1]

      sel_sum_logprobs, indices = tf.nn.top_k(
          tf.reshape(sum_logprobs, [batch_size, num_classes * beam_width]),
          k=beam_width)

      ids = indices % num_classes

      beam_ids = indices // num_classes

      state = batch_gather(state, beam_ids)

      sel_ids = tf.concat([batch_gather(sel_ids, beam_ids),
                           tf.expand_dims(ids, axis=2)], axis=2)

      mask = (batch_gather(mask, beam_ids) *
              tf.to_float(tf.not_equal(ids, end_token_id)))

  return sel_ids, sel_sum_logprobs




<div class="se-preview-section-delimiter"></div>

Merge

import tensorflow as tf

def merge(tensors, units, activation=tf.nn.relu, name=None, **kwargs):
  """Merge features with broadcasting support.

  This operation concatenates multiple features of varying length and applies
  non-linear transformation to the outcome.

  Example:
    a = tf.zeros([m, 1, d1])
    b = tf.zeros([1, n, d2])
    c = merge([a, b], d3)  # shape of c would be [m, n, d3].

  Args:
    tensors: A list of tensor with the same rank.
    units: Number of units in the projection function.
  """
  with tf.variable_scope(name, default_name="merge"):
    # Apply linear projection to input tensors.
    projs = []
    for i, tensor in enumerate(tensors):
      proj = tf.layers.dense(
          tensor, units, activation=None,
          name="proj_%d" % i,
          **kwargs)
      projs.append(proj)

    # Compute sum of tensors.
    result = projs.pop()
    for proj in projs:
      result = result + proj

    # Apply nonlinearity.
    if activation:
      result = activation(result)
  return result




<div class="se-preview-section-delimiter"></div>

Entropy

import tensorflow as tf

def softmax_entropy(logits, dim=-1):
  """Compute entropy over specified dimensions."""
  plogp = tf.nn.softmax(logits, dim) * tf.nn.log_softmax(logits, dim)
  return -tf.reduce_sum(plogp, dim)




<div class="se-preview-section-delimiter"></div>

KL-Divergence

def gaussian_kl(q, p=(0., 0.)):
  """Computes KL divergence between two isotropic Gaussian distributions.

  To ensure numerical stability, this op uses mu, log(sigma^2) to represent
  the distribution. If q is not provided, it's assumed to be unit Gaussian.

  Args:
    q: A tuple (mu, log(sigma^2)) representing a multi-variatie Gaussian.
    p: A tuple (mu, log(sigma^2)) representing a multi-variatie Gaussian.
  Returns:
    A tensor representing KL(q, p).
  """
  mu1, log_sigma1_sq = q
  mu2, log_sigma2_sq = p
  return tf.reduce_sum(
    0.5 * (log_sigma2_sq - log_sigma1_sq +
           tf.exp(log_sigma1_sq - log_sigma2_sq) +
           tf.square(mu1 - mu2) / tf.exp(log_sigma2_sq) -
           1), axis=-1)




<div class="se-preview-section-delimiter"></div>

Make parallel

def make_parallel(fn, num_gpus, **kwargs):
  """Parallelize given model on multiple gpu devices.

  Args:
    fn: Arbitrary function that takes a set of input tensors and outputs a
        single tensor. First dimension of inputs and output tensor are assumed
        to be batch dimension.
    num_gpus: Number of GPU devices.
    **kwargs: Keyword arguments to be passed to the model.
  Returns:
    A tensor corresponding to the model output.
  """
  in_splits = {}
  for k, v in kwargs.items():
    in_splits[k] = tf.split(v, num_gpus)

  out_split = []
  for i in range(num_gpus):
    with tf.device(tf.DeviceSpec(device_type="GPU", device_index=i)):
      with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):
        out_split.append(fn(**{k : v[i] for k, v in in_splits.items()}))

  return tf.concat(out_split, axis=0)




<div class="se-preview-section-delimiter"></div>

Leaky relu

def leaky_relu(tensor, alpha=0.1):
    """Computes the leaky rectified linear activation."""
    return tf.maximum(tensor, alpha * tensor)




<div class="se-preview-section-delimiter"></div>

Batch normalization

def batch_normalization(tensor, training=False, epsilon=0.001, momentum=0.9, 
                        fused_batch_norm=False, name=None):
  """Performs batch normalization on given 4-D tensor.

  The features are assumed to be in NHWC format. Noe that you need to 
  run UPDATE_OPS in order for this function to perform correctly, e.g.:

  with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
    train_op = optimizer.minimize(loss)

  Based on: https://arxiv.org/abs/1502.03167
  """
  with tf.variable_scope(name, default_name="batch_normalization"):
    channels = tensor.shape.as_list()[-1]
    axes = list(range(tensor.shape.ndims - 1))

    beta = tf.get_variable(
      'beta', channels, initializer=tf.zeros_initializer())
    gamma = tf.get_variable(
      'gamma', channels, initializer=tf.ones_initializer())

    avg_mean = tf.get_variable(
      "avg_mean", channels, initializer=tf.zeros_initializer(),
      trainable=False)
    avg_variance = tf.get_variable(
      "avg_variance", channels, initializer=tf.ones_initializer(),
      trainable=False)

    if training:
      if fused_batch_norm:
        mean, variance = None, None
      else:
        mean, variance = tf.nn.moments(tensor, axes=axes)
    else:
      mean, variance = avg_mean, avg_variance

    if fused_batch_norm:
      tensor, mean, variance = tf.nn.fused_batch_norm(
        tensor, scale=gamma, offset=beta, mean=mean, variance=variance, 
        epsilon=epsilon, is_training=training)
    else:
      tensor = tf.nn.batch_normalization(
        tensor, mean, variance, beta, gamma, epsilon)

    if training:
      update_mean = tf.assign(
        avg_mean, avg_mean * momentum + mean * (1.0 - momentum))
      update_variance = tf.assign(
        avg_variance, avg_variance * momentum + variance * (1.0 - momentum))

      tf.add_to_collection(tf.GraphKeys.UPDATE_OPS, update_mean)
      tf.add_to_collection(tf.GraphKeys.UPDATE_OPS, update_variance)

  return tensor




<div class="se-preview-section-delimiter"></div>

Squeeze and excitation

def squeeze_and_excite(tensor, ratio=16, name=None):
  """Apply squeeze/excite on given 4-D tensor.

  Based on: https://arxiv.org/abs/1709.01507
  """
  with tf.variable_scope(name, default_name="squeeze_and_excite"):
    original = tensor
    units = tensor.shape.as_list()[-1]
    tensor = tf.reduce_mean(tensor, [1, 2], keep_dims=True)
    tensor = tf.layers.dense(tensor, units / ratio, use_bias=False)
    tensor = tf.nn.relu(tensor)
    tensor = tf.layers.dense(tensor, units, use_bias=False)
    tensor = tf.nn.sigmoid(tensor)
    tensor = original * tensor
  return tensor

高效使用TensorFlow的最佳实践