计算机视觉，深度学习, tensorflow 知识杂记，经验教训(持续更新)

以前随手的记录本子上的内容，在这篇文章里做一个汇总，方便以后查看。

经验教训：

1、输入网络的图像一定要先随机批量查看！！，如果网络的loss很大或者有奇怪的结果时，不仅要检查网络结构，还要看输入是否在预处理的时候遭到了破坏

def sample_stack(stack, rows=6, cols=6, start_with=0, show_every=5):
	"""
	批量展示图片，很好用的工具
	args:
	stack: shape:(N,H,W),  value range:[0-1]
	show_every:可以调整步长
	"""
    fig,ax = plt.subplots(rows,cols,figsize=[18,18])
    for i in range(rows*cols):
        ind = start_with + i*show_every
        ax[int(i/cols),int(i % cols)].set_title('slice %d' % ind)
        ax[int(i/cols),int(i % cols)].imshow(stack[ind],cmap='gray')
        ax[int(i/cols),int(i % cols)].axis('off')
    plt.show()

2、使用tensorflow-gpu的时候，第一次运行出错后再运行就会内存溢出了，查了一下显卡的剩余空间，发现几乎没有剩余，但是查看当前正在运行的线程又是空，无奈，只能退出登陆，再进入虚拟环境，发现这样就好了，显卡又有充足的空间了。目前还不知道原因。。

知识杂记

第一次：
1、resnet每经过一个降采样过程，都会使它的channel数目翻倍，这样可以避免因为降采样带来的信息损失。
2、vggnet带来的启发：
2.1、网络更深；
2.2、多使用3*3的卷积核：2个3*3的卷积核可以看做一层5*5的卷积核（同等的receptive field），同时参数更少。3个3*3的卷积核可以看做一层7*7的卷积核，同时参数更少；
（感受野（Receptive Field）的定义是卷积神经网络每一层输出的特征图（feature map）上的像素点在输入图片上映射的区域大小。）
2.3、1*1的卷积核可以看做是非线性变换
2.4、每经过一个pooling层，通道数目翻倍
3、关于Dropout原理解释：
3.1、组合解释：每次dropout都相当于训练了一个子网络，最后的结果相当于很多子网络组合。
3.2、动机解释：消除了神经元之间的依赖，增强泛化能力。
3.3、数据解释：对于dropout后的结果，总能找到一个样本与其对应，相当于数据增强。
4、动量梯度下降不仅体现在大小上，还体现在方向上。
4.1、开始训练时，积累动量，加速训练
4.2、局部极值附近动荡时，梯度为0，由于动量，跳出陷阱
4.3、梯度改变方向的时候，动量缓解动荡
5、参数问题：卷积层，输入三通道，输出192通道，卷积核大小3*3，问卷积层有多少参数？
答：由于参数共享，一共有 （3*（3*3））*192个参数
6、全连接层后不再用卷积层，因为全连接后，图像原有的位置信息就丢失了
7、一般图像和标签的拼接

a=[1,1,1] # 模拟单张图片数据，若a是array，结果一样

b=[2,2,2] # 模拟单张图片数据

data=[] # 模拟训练集

data.append(a) # 模拟循环时数据的读取过程

data.append(b)

data1=np.vstack(data) # 模拟拼接

data1
Out[38]: 
array([[1, 1, 1],
       [2, 2, 2]])

data1.shape
Out[39]: (2, 3)

data2=np.hstack(data)  # 一定要注意，对于图片对应的标签来说得这样！！

data2
Out[41]: array([1, 1, 1, 2, 2, 2])

data2.shape
Out[42]: (6,)

8、如何打乱训练集顺序

# 以上面的data1为例
data1[[1,0]] # 这种操作不会改变data1原先的结构
Out[43]: 
array([[2, 2, 2],
       [1, 1, 1]])

"""
其实这种更常见，因为我们需要的经常是 x, ground_truth一起进行乱序
"""
p = np.random.permutation(2)

p
Out[47]: array([0, 1])

data_new = data1[p]

"""
注意 np.random.shuffle会直接改变原数组顺序
"""

data_new
Out[49]: 
array([[1, 1, 1],
       [2, 2, 2]])

第二次更新：
1、对于tensorflow，即使机器有多个cpu，tf也不会区分它们，所有的cpu都使用/cpu:0作为名称，而一台机器上不同gpu的名称是不同的，例如第一个叫/gpu:0,第二个叫/gpu:1，等等。
默认情况下，tf只会将运算优先放到gpu:0上。
2、经常见到sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=False))
allow_soft_placement=True表示如果运算无法由gpu执行，那么tf会自动将它放到cpu上执行
注意，tensorflow的kernel定义了哪些操作可以跑在gpu上。
log_device_placement=True表示输出运行每一个运算的设备
3、并行化深度学习模型的训练方式有两种，Data parallelism, Model parallelism
（1）Model parallelism: Different GPUs run different part of the code. Batches of data pass through all GPUs.
（2）Data parallelism: Use multiple GPUs to run the same code. Each GPU is feed with different batch of data.
Data parallelism每一轮迭代都需要统一开始，统一结束，因此当每个GPU的memory和computation capacity相近时，最好采用Data parallelism.
4、有两种方式可以将Numpy array从main memory 拷贝进GPU memory：
（1）给session用feed_dict传递array
（2）tf.constant加载array为tensor
5、tensor.get_shape()获取的是静态的shape，是在图构建时由python wrappers计算的，静态的不会包含batch_size，tf.shape(tensor)是运行时计算的，可以获得包括batch_size的shape。
6、transpose convolution输出size的计算公式：
same padding: trans_conv_size = input_size * stride
valid padding: trans_conv_size = input_size * stride + max( filter_size - stride, 0 )
7、对于tensor t，t.eval() == tf.get_default_sesion().run(t)
8、注意下面这段代码：

w = tf.constant(3)
x = w + 2
y = x + 5
with tf.Session() as sess:
    print(y.eval())
    print(z.eval())

上面的这段，计算y的时候，会计算w，x，而计算z的时候，又会计算一遍w，x！因为在all node values are dropped between graph runs, except variable values, which are maintained by the session across graph cuts.因此，如果你想计算高效一点的话，不计算w，x两遍，你需要让tensorflow evaluate y和z in just one graph run，如下代码：

with tf.Session() as sess:
    y_val, z_val = sess.run([y,z])
    print(y_val)
    print(z_val)

9、关于batch normalization出现的位置，原论文说的是：

we add the BN transform before the nonlinearity, by normalizing x= Wu+b

而在《Deep Learning for Computer Vision with Python》第一卷，chapter11.2.6却建议在nonlinearity之后。原话如下：

However, this view of batch normalization doesn’t make sense from a statistical point of view. In this context, a BN layer is normalizing the distribution of features coming out of a CONV layer. Some of these features may be negative, in which they will be clamped(i.e., set to zero) by a nonlinear activation function such as ReLU.
If we normalize before activation, we are essentially including the negative values inside the normalization. Our zero_centered features re then passed through the ReLU where we kill of any activation less than zero(which include features which may have not been negative before the normalization) - this layer ordering entirely defeats the purpose of applying batch normalization in the first place.

大意就是说，因为激活层例如relu，就是将那些负值不激活（设为零），而在relu之前进行normalization就意味着会将原本不是负值的也变成了负值，从而导致经过relu后被抹去了，而这样完全违背了应用batch normalization的初衷。

Instead, if we place the batch normalization after ReLU we will normalize the positive valued features without statistically biasing them with features that would have otherwise not made it to the next CONV layer. In fact, Francois Chollet, the creator and maintainer of Keras confirms this point stating that BN should come after the activations.

10、经验教训，输出语句别写中文，要不然放到服务器上后很容易忘记改，运行到一般又出错。