1_Convolution(卷积)

1. Some tips of Architecture

in order to ensure the spatial size satisfy that : input volume == output volume e.g. 5x5 in, 5x5 out.

there is a simple formula: P = (F-1)/2 if S=1 , there is a simple proof:

(W-F+2*P)/S + 1 == W ,if S==1, then we can get that P = (F-1)/2

consider a filter, it has some key conponets: spatial size(W,H), Depth(the filters num), Stride, zero-padding size
Constraints on strides:

如果使用(W-F+2*P)/S+1这个公式计算output volume遇到不能整除的情况，说明当前的W,F,P,S不匹配,当前的F,P,S这组hyperparameters is invalid，诸如tensorflow这样的framework就会采取(1)zero-padding input, (2)crop the input 使得上述计算公式work out.
Local Connectivity:
1. xx
2. xxx
3. 继续参考下cs231n的Notes，看看它是怎么理解local connect
Parameter Sharing:
- one reasonable assumption:
  
  That if one feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2). In other words, denoting a single 2-dimensional slice of depth as a depth slice (e.g. a volume of size [55x55x96] has 96 depth slices, each of size [55x55]), we are going to constrain the neurons in each depth slice to use the same weights and bias.
  
  Note: 即将input volume按照depth分组，每组(即一个2-dim的feature map,对应了一个depth-order)共享一个2-dim的filter
- bp:
  
  In practice during backpropagation, every neuron in the volume will compute the gradient for its weights, but these gradients will be added up across each depth slice and only update a single set of weights per slice.
  
  Note: 所有neurons按照depth分组，分别累计同一个组内neurons的gradient并计算这个组的全部weight(若干个weight组成一个filter)的gradients.
- parameter sharing assumption is relatively reasonable: （注意是相对合理，那么就有例外）
  
  Notice that the parameter sharing assumption is relatively reasonable: If detecting a horizontal edge is important at some location in the image, it should intuitively be useful at some other location as well due to the translationally-invariant structure of images.(空间平移不变形)、
- 上述parameters sharing的例外：
  
  Note that sometimes the parameter sharing assumption may not make sense. This is especially the case when the input images to a ConvNet have some specific centered structure, where we should expect, for example, that completely different features should be learned on one side of the image than another. One practical example is when the input are faces that have been centered in the image. You might expect that different eye-specific or hair-specific features could (and should) be learned in different spatial locations. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a Locally-Connected Layer.
  
  Note: 诸如人脸这种明显带有中心结构的图片，并不完全满足空间平移不变形。【它的某些特征的提取对执行Conv op的位置有偏好，在不同位置执行同样的Conv op产生的结果不同】
- 这个parameters sharing 的numpy example discussion，见: http://cs231n.github.io/convolutional-networks/#overview [其实按照这个例子，手算一下output map的输出过程，有助于自己手写一个ty.nn.conv2d，有助于加深自己对于Conv op的深刻理解]

2. Some computing tips of Conv Layer:

filter x input volume: 类比是两个立方体相撞，结合产生一张正方形平面(W_o x H_o)。所以多个filters处理一个input volume，产生的就是一个立方体(W_o x H_o x K)
用矩阵计算来实现上述Conv:
- [227x227x3]的input volume，与96个[11x11x3]的filter at stride 4, pad 0，做卷积
- (1) output volume的W(即H)：(227-11+2x0)/4 + 1 = 55，所以一个filter需要在inout volume上面(包括w,h两个方向)滑动55x55次，每次滑动就做一次卷积计算。
  - 这里的卷积计算是指：[11x11x3]的filter, 与, [11x11x3]的 a patch of image做卷积，相比之前的2-d，增加了depth。(这个应该还不算3-d吧，待考证)
- (2) input volume参与卷积计算的patch有[55x55]=3025个，每个patch[11x11x3]拉直后为363-dim的vector，所以可以把input volume的原始矩阵[227x227x3]模拟运行filter的滑动过程，然后产生了3025个363-dim的vector，这些vector就是后面与filter[11x11x3]做3-d卷积的patchs。现在把这些patchs构建为一个矩阵：[363x3025]：每一列表示一个patch的vector表示[363-dim]，一共3025列(即3025个patch)]
- (3) 同理，把每个filter[11x11x3]都拉直为一个363-dim的vector，因为有96个filter，所以可以构造为一个矩阵：[96x363]：96个363-dim的vector
- (4) 之前的[227x227x3]的input volume 与 96个[11x11x3]的filters with stride 4, pad 0，做卷积的过程，可以fomulate为：W矩阵[96x363] * X矩阵[363x3025]，结果为矩阵C[96x3025]
- (5) C其实是96个2-d的feature map，所以把[3025]reshape为[55x55]，即得到本次卷积结果[96x55x55]或者调整顺序为[55x55x96]
- 更详细的参考：http://cs231n.github.io/convolutional-networks/#overview
- 上述构建X矩阵的过程，使用了im2col思想，这个在实现pooling的编码中会再次用到。
- 下一步的优化就是：考虑W，X这两个矩阵的特点，做一些数值计算的优化。(比如分析X的生成过程，不难发现，X中有很多值是replicated,即重复值)
注意到：we operate over 3-dimensional volumes, and that the filters always extend through the full depth of the input volume. For example, if the input is [32x32x3] then doing 1x1 convolutions would effectively be doing 3-dimensional dot products (since the input depth is 3 channels).

所以，即使是1x1的卷积，其实本质还是1x1xC的filter。这个[贯通depth]的思想一定要牢记！
Dilated convolutions:(膨胀卷积，这个翻译不好)

This can be very useful in some settings to use in conjunction with 0-dilated filters because it allows you to merge spatial information across the inputs much more agressively with fewer layers

功能主要是：高效地合并不同level的feature map infomnation [feature praymid]，这个结论我还不确定，等用到了再来修改。

3. Pooling Layer:

Pooling layer downsamples the volume spatially, independently in each depth slice of the input volume
For Pooling layers, it is not common to pad the input using zero-padding.

但是，也不是不可以，比如在spp-net里面，为了实现任意size的input volume的pooling，只能选择根据input volume选择合适的zero padding
依然存在overlapping pooling，e.g. A pooling layer with F=3，S=2
一般使用max pooling，目前我唯一见过使用average pooling的是NiN(netowork in network)中的global average pooling，用做最后一层，将一个2-d的feature map映射为一个scalar，用于替代softmax的分类功能
关于max pooling的bp，有一个原则：max(x,y,z)的bp：routing the gradient to the input that had the highest value in the forward pass （之前如果是x>y>z，则bp时的upstram gradient将只会指派给x，y,z得到的upstram gradient均为0）。更多细节，后面再查一篇paper及若干blogs
考虑到pooling的信息损失较为严重，后面很多paper建议FCN(Fully Convlotional network)，减少甚至不要pooling layer。[至于reduce parameters，可以有很多辅助手段：比如Conv中采用较大的stride]
关于各种Normalization Layer，比如Alexnet中的LRN(local response normalization)，现在的观点是：其实作用不大，就我目前所了解的，现在在CNN Architecture中，15年Google提出的Batch Notmalization效果是最好的。(帮助你的网络更平稳地收敛，而不用非常小心的初始化参数值及担心梯度消失/爆炸)

4. Converting CONV layers to FC layers

如果把一个Conv Layer用一个FC Layer表示，那么这个FC Layer巨大的weight matrix有两个特点：(1)sparse，大部分为0，因为local connect. (2)很多个block相同，因为parameters sharing. 其余的卷积操作本质上就是dot product，这和FC 中的matmul没有区别

5. FC Layer -> Conv Layer

Key idea: 让filter size与input volume的size(W,H,C)完全相同即可，并且设置P=0,S=1，从而确保每个filter与input volume只做一次卷积，并得到一个scaler。[多个filter，则得到一个1-dim的vector]
for examole, input volume 7x7x512, 原来的FC Layer size 是K = 4096 (FC Layer size即为FC Layer的输出结果为4096-dim的vector)
改造方法：F = 7, P=0, S=1, K=4096. 即4096个7x7x512的filter
因为P=0,S=1，而filter size与input volume size完全相同，所以两者仅仅做一次卷积:[7x7x512 与 7x7x512卷积得到1x1的2-d feature map]。又因为4096个filter，所以得到output volume立方体为[1x1x4096]，也即4096-dim的vector。这个输出与K=4096的FC Layer完全相同
FC Layer to Conv Layer，在实践中很有用处，比如：Alexnet中，raw image input volume [227x227x3]通过一系列Conv-Pooling Layer，得到了feature volume [7x7x512]。但是，后面连续接了3个FC Layer，K_1 = 4096,K_2=4096,K_3=1000。因为FC Layer的参数量太大，现在考虑将FC Layer转化为Conv Layer。

具体步骤为：
- Replace the first FC layer that looks at [7x7x512] volume with a CONV layer that uses filter size F=7(即size为7x7x512，个数为4096), giving output volume [1x1x4096].
- Replace the second FC layer with a CONV layer that uses filter size F=1(即size为1x1x4096,个数为4096), giving output volume [1x1x4096]
- Replace the last FC layer similarly, with F=1(即size为1x1x4096，个数为1000), giving final output [1x1x1000]
- 上述 Conv Layer的filters的P=0, S=1(即filter只滑动一次)
至于这个conversion的好处，Notes是说：对于越大的input volume，把FC Layer转化成Conv Layer，计算的效率会更高，以下是原文：

Each of these conversions could in practice involve manipulating (e.g. reshaping) the weight matrix W in each FC layer into CONV layer filters. It turns out that this conversion allows us to “slide” the original ConvNet very efficiently across many spatial positions in a larger image, in a single forward pass.

For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reduction by 32, then forwarding an image of size 384x384 through the converted architecture would give the equivalent volume in size [12x12x512], since 384/32 = 12. Following through with the next 3 CONV layers that we just converted from FC layers would now give the final volume of size [6x6x1000], since (12 - 7)/1 + 1 = 6. Note that instead of a single vector of class scores of size [1x1x1000], we’re now getting an entire 6x6 array of class scores across the 384x384 image.
我突然明白了上面第6点的意思了：
- 这个AlexNet原来有[5个Conv Layer + 3个FC Layer]
- 现在把3个FC Layer换成[7x7x512 Conv6,S=1,P=0,K=4096] + [1x1x4096 Conv7,S=1,P=0,K=4096] + [1x1x4096 Conv8, S=1,P=0,K=1000]
- 第一次输入一张[224x224x3]的image，通过5层的Conv Layers得到[7x7x512]，这个作为input volume输入到后面3个Conv Layers(是有3个FC Layer改造而来)，输出一个[1x1x1000]的Score
- 第二次输入一张[384x384x3]的image，通过5层的Conv Layers得到[12x12x512], 这个作为input volume输入到后面3个Conv Layers：input volume 为[12x12x512]；(1)通过[7x7x512, S=1,P=0,K=4096]的Conv6，输出[6x6x4096]；(2)[6x6x4096]的input volume通过[1x1x4096, S=1,P=0,K=4096]的Conv7，输出[6x6x4096]；(3)[6x6x4096]的input volume通过[1x1x4096,S=1,P=0,K=1000]的Conv8输出[6x6x1000]的output volume.
- 将之前[5Conv+3FC]称为original ConvNet，改造后的[5Conv + 3Conv]称为converted ConvNet.
- 原文有如下描述：
  
  Evaluating the original ConvNet (with FC layers) independently across 224x224 crops of the 384x384 image in strides of 32 pixels gives an identical result to forwarding the converted ConvNet one time.
  
  对于上面，我的理解是：将384x384的input volume进行crop，可以得到6x6个[224x224]的small crops，使用original ConvNet对这6x6个crops进行评估(运行36次？)，得到的结果进行综合得出res_original；以及直接将384x384input volume通过converted ConNet进行评估(运行一次？)，得到的结果进行综合得出res_converted。发现res_original与res_converted是相同的！
  
  Naturally, forwarding the converted ConvNet a single time is much more efficient than iterating the original ConvNet over all those 36 locations, since the 36 evaluations share computation. This trick is often used in practice to get better performance, where for example, it is common to resize an image to make it bigger, use a converted ConvNet to evaluate the class scores at many spatial positions and then average the class scores.
  
  我的解释：实践中，把小图片变为大图片，运行conveted ConVNet，然后计算效率比original ConVNet要高
- An IPython Notebook on Net Surgery shows how to perform the conversion in practice, in code (using Caffe)

6. ConvNet Architectures

基本组件：Conv, Pool(一般指Max Pooling), FC, ReLU(一般显示地指出non-linearly activation func)
how to stack them together efficiently?
Prefer a stack of small filter CONV to one large receptive field CONV layer: 通过级联多层size较小的filter获得比较大的有效感受野(effictive respective field)，而不是直接使用较大size的filter来得到相同的efficetive respective field
- effictive respective field:
- 分析的时候，从一维分析入手较为直观和简单，分析形式化了再推广至二维乃至高维
- 从一维分析开始：(贴我自己的手绘图，1->3->5->7)
- xxx
- 参数量的比较：1个7x7xC与3个3x3xC
  - 1个7x7xC：Cx(7x7xC) = 49xC^2
  - 3个3x3xC：3x[Cx(3x3xC)] = 27xC^2
  - 本质：参数量计算涉及到[filter的W^2 or H^2，这是二次函数，以及filter个数K,这是一次函数]。而二次方比一次方的增长率要大得多！
  - 之前一直有的疑惑：why Cx(7x7xC) not (7x7xC)？最前面为何还要乘以C？HaoZhang_NJU在知乎上有个答案中提到的解释是：默认depth of output feature_map的depth of depth input feature_map都是C，换句话说，默认卷积核个数也是C，保持卷积前后 depth of feature_map不变
  - (7x7xC)很好理解，就是这个7x7xC的filter的体积，即总共参数量，它被一个input volume所使用
  - 但是前面还要乘以C是什么意思？这个C并不是filter的数量！【以后有空再想这个问题】
size较小的filter的好处：
- 因为更多层，所以可以插入更多的non-linearly(e.g. ReLU)
- 更少的参数量(计算量呢？)
Layer Sizing Patterns：
- The input layer (that contains the image)：被2整数，比如32(CIFAR-10), 224(ImageSet)
- The conv layers: using small filters (e.g. 3x3 or at most 5x5), using a stride of S=1, then (F=3,P=1 or F=5,P=2 or P=(F-1)/2) will retain the original size of the input.
- The pool layers: (1)downsampling the spatial dimensions of the input. (2)most common: F=2,S=2,max pooling (3)SPP-Net中曾使用F=3,S=2，but It is very uncommon
- 还有一种代替pool layer方案：使用大于1的Stride & P=0，只是需要小心设置各个参数组合确保[(W-F+2xP)/2+1]能work，否则会发现一些error或者信息损失(被tensorflow之类的framework用随机丢弃一些像素的方式处理一些不能整除的问题)
- Why use Stride of 1 in Conv? 答：可以搭配(P=(F-1)/2)来保持spatial resolution一直不变，即：Conv保持spatial resolution而去处理depth(比如增减depth)，而Pool专注于减小spatial resolution不处理depth
- Why use padding? 答：减少执行Conv时的border infomation的损失，同时调节spatial resolution，实践中能improve performance
- Compromising based on memory constraints: 由于越多的filter，产生的feature map越多越大，对GPU内存要求越大，所以一般CNN Architecture只能make some compremising：For example, one compromise might be to use a first CONV layer with filter sizes of 7x7 and stride of 2 (as seen in a ZF net). As another example, an AlexNet uses filter sizes of 11x11 and stride of 4.

7. Case Study:

LeNet/AlexNet/ZFNet/NiN/GoogLeNet/VGGNet/ResNet
Conv Layer贡献了整个CNN中的绝大多数的计算量，FC Layer贡献了绝大多数的参数量
Computational Considerations:
- The largest bottleneck of constructing ConvNet: memory bottleneck of a single GPU
- There are three major sources of memory to keep track of:(有三处大量使用内存)
- (1)intermediate volume sizes：有一个仅仅适用于inference阶段的减少内存使用的trick：仅仅保存当前layer的activation value而丢弃之前layer的activation value(类比DP中那种自底向上，后面计算利用前面计算的加速计算思路)
- (2)the parameter sizes： the network parameters, their gradients during backpropagation, and commonly also a step cache if the optimization is using momentum, Adagrad, or RMSProp
- (3)Every ConvNet implementation has to maintain miscellaneous memory, such as the image data batches, perhaps their augmented versions, etc.
- 对于上述这三处内存使用的一个封底估计(这个概念来自JunhuiDeng_THU)：
  
  Once you have a rough estimate of the total number of values (for activations, gradients, and misc), the number should be converted to size in GB. Take the number of values, multiply by 4 to get the raw number of bytes (since every floating point is 4 bytes, or maybe by 8 for double precision), and then divide by 1024 multiple times to get the amount of memory in KB, MB, and finally GB. If your network doesn’t fit, a common heuristic to “make it fit” is to decrease the batch size, since most of the memory is usually consumed by the activations.
- 如果训练时内存不够，减小batch size是个万能的最简单的方案

8. References

[1]. http://cs231n.github.io/convolutional-networks/#overview

1. Some tips of Architecture

2. Some computing tips of Conv Layer:

3. Pooling Layer:

4. Converting CONV layers to FC layers

5. FC Layer -> Conv Layer

6. ConvNet Architectures

Layer Sizing Patterns：

7. Case Study:

Computational Considerations:

8. References

猜你喜欢