tensorflow学习踩坑

1. tensorflow的变量共享机制

tensorflow提供了一个变量共享机制，并不需要显示的传递变量引用；tf.get_variable()创建一个变量或者返回一个已有变量；tf.variable()创建一个新的变量。因此有了tensorflow的name_scope和variable_scope来管理这2种变量；

区别在于：

两种scope对于tf.variable()来说，都会添加前缀；

tf.get_variable()只对variable_scope添加的前缀有效

with tf.name_scope("my_scope"):
    v1 = tf.get_variable("var1", [1], dtype=tf.float32)
    v2 = tf.Variable(1, name="var2", dtype=tf.float32)
    a = tf.add(v1, v2)

print(v1.name)  # var1:0
print(v2.name)  # my_scope/var2:0
print(a.name)   # my_scope/Add:0

with tf.variable_scope("my_scope"):
    v1 = tf.get_variable("var1", [1], dtype=tf.float32)
    v2 = tf.Variable(1, name="var2", dtype=tf.float32)
    a = tf.add(v1, v2)

print(v1.name)  # my_scope/var1:0
print(v2.name)  # my_scope/var2:0
print(a.name)   # my_scope/Add:0

2.对多个tensor求average操作

tf.concat是连接2个tensor的操作

tf.concat(concat_dim, values, name='concat') dim=0即在行上连接，dim=1即在列上连接

tf.expand_dims(input, dim, name=None) 是插入一个新的dim的操作

在cifar10的多gpu训练版本中，多gpu的gradient，最后需要有参数服务器来作average，实现如下：

输入的变量是[ [g,v in tower1], [g,v in tower2] ... ] 因此zip(*var)可以将其解zip为一个长度为训练参数个数的list，每一个item里包含了每个tower的（grad，var）对。解zip参照如下示例

>>> a=[[(1,2), (5,6)], [(3,4), (7,8)]]
>>> a
[[(1, 2), (5, 6)], [(3, 4), (7, 8)]]
>>> zip(*a)
[((1, 2), (3, 4)), ((5, 6), (7, 8))]

    for g, _ in grad_and_vars:
      # Add 0 dimension to the gradients to represent the tower.
      expanded_g = tf.expand_dims(g, 0)

      # Append on a 'tower' dimension which we will average over below.
      grads.append(expanded_g)

    # Average over the 'tower' dimension.
    grad = tf.concat(axis=0, values=grads)
    grad = tf.reduce_mean(grad, 0)

为了对每个tower的（grad，var）对求均值，首先增加0dim，然后通过concat将其变成行的形式，结合reduce_mean在列的维度上取均值，即得到每一个参数的梯度。

    v = grad_and_vars[0][1]
    grad_and_var = (grad, v)
    average_grads.append(grad_and_var)

由于参数在多个tower间是共享的，因此只需要取任意一个tower的var，作为这个参数名字即可。

3.get_collection和add_to_collection

get_collection('loss', scope) 能够返回以loss为key的set列表，如果指定了scope，则会从set列表中过滤出含有scope名字的值。

如果指定了with name_scope() as scope，普通的get_variable()不会带上上层的scope前缀，但执行了tf的运算的操作则会带上前缀，比如loss，weight_decay等运算；具体参见1.

在cifar10的多gpu版本中，tower中的variable是共享的，但是lose不是，因此loss默认都带有了name_scope的前缀。

GraphKeys定义了与graph相关的ops，其中包括了GlobalVariables, TrainableVariables, train_op等常见ops，Optimizer的子类会默认对TrainableVariable中的变量进行优化。

4.多线程和队列

Queue是tensorflow中的一个节点，其他节点可以操作这个节点，进行入队和出队操作。通常的范式是多个线程入队，一个线程出队，这样效率高；但如何协调这些线程，出现异常时如何捕获异常并关闭线程池，当训练完成时，如何回收Queue，这些流程自己处理起来较为麻烦。因此tensorflow提供了Coordinator和QueueRunner来简化操作。

Coordinator作为协调者，通过3个方法，控制线程之间停止和等待，分别是should_stop,request_stop,join，其中join后面的参数是线程池，即当前线程会阻塞，直到QueueRunner里的所有线程终止。

QueueRunner提供了多线程执行的环境，后台也会运行一个关闭的线程，当收到了线程异常时执行回收这个线程的任务。Tensorflow文档中，一个典型的多线程数据处理的范式如下

example = ...ops to create one example...
# Create a queue, and an op that enqueues examples one at a time in the queue.
queue = tf.RandomShuffleQueue(...)
enqueue_op = queue.enqueue(example)
# Create a training graph that starts by dequeuing a batch of examples.
inputs = queue.dequeue_many(batch_size)
train_op = ...use 'inputs' to build the training part of the graph...

qr = tf.train.QueueRunner(queue, [enqueue_op] * 4)

# Launch the graph.
sess = tf.Session()
# Create a coordinator, launch the queue runner threads.
coord = tf.train.Coordinator()
enqueue_threads = qr.create_threads(sess, coord=coord, start=True)
# Run the training loop, controlling termination with the coordinator.
for step in xrange(1000000):
    if coord.should_stop():
        break
    sess.run(train_op)
# When done, ask the threads to stop.
coord.request_stop()
# And wait for them to actually do it.
coord.join(enqueue_threads)

5. stack

stack函数将rank=r的一系列列表，组装成rank=r+1的矩阵；默认axis=0

x = tf.constant([1, 4])
y = tf.constant([2, 5])
z = tf.constant([3, 6])
tf.stack([x, y, z])  # [[1, 4], [2, 5], [3, 6]] (Pack along first dim.)
tf.stack([x, y, z], axis=1)  # [[1, 2, 3], [4, 5, 6]]

6.迁移学习

add_jpeg_decoding()提供了解码图片的操作，jpeg_data_tensor是一个placeholder，decoded_image_tensor是这个placeholder解码后的tensor值

jpeg_data_tensor, decoded_image_tensor = add_jpeg_decoding(
        model_info['input_width'], model_info['input_height'],
        model_info['input_depth'], model_info['input_mean'],
        model_info['input_std'])

resized_input_values = sess.run(decoded_image_tensor, {image_data_tensor: image_data})

将String类型的image_datar作为feeddict的输入，在sessin中运行解码和归一化，得到resized_input_values

在重建graph时，需要输出2个tensor的值，即复用的模型的输入层和输出层：

resized_input_tensor_name = 'input:0'是输入的tensor，在重建graph的时候将此tensor作为feeddict的key，将image_data写入到此key

bottleneck_tensor_name = 'MobilenetV1/Predictions/Reshape:0' 是倒数第二层的tensor，此tensor的输出即FC层的输出值

7.将多维数组转为一维数组的方法

numpy中reshape(-1)， flatten，ravel方法有将多维数组转为一维数组的功能，但squeeze是将dim的shape中的dim=1的项移除

x = np.array([[[0], [1], [2]]])

In [12]: x.shape
Out[12]: (1, 3, 1)

In [13]: np.squeeze(x).shape
Out[13]: (3,)

In [14]: np.squeeze(x)
Out[14]: array([0, 1, 2])

8. Cannot interpret feed_dict key as Tensor

在运行rnn的demo代码ptb_word_lm时，代码创建了一个graph，初始化了变量信息，并且将这些变量export到了collection中，在第二个graph中，开始run_epoch，在初始化learning-rate时，报错提示：无法将feed_dict的key解释为一个Tensor，原因在于这个2个变量没有在同一个graph中生成；对应本例子中，第一个graph中export了参数后，第二个graph开始时没有import参数。