深度学习-问题模型优化

技术是随着问题而产生的,如果不从要解决问题的角度出发应用技术,最终会演化为技术堆砌,并由于每个技术点的副作用进而引起新的问题。

以问题驱动的方式总结常用的模型训练方法和这些训练方法要解决的问题。

这些训练方法一般在论文中都能找到这些较为常用的配置,将分为两个部分,一个部分是CNN,一个部分是RNN。

共性的部分一般放在CNN部分.

1 CNN
1.1 weight decay

解决问题:This prevents the weights from growing too large, and can be seen as gradient descent on a quadratic regularization term 【防止权重过大,起到类似正则化项作用】

https://metacademy.org/graphs/concepts/weight_decay_neural_networks

如何观察:观察weight最终值和整体分布

常用值:

1.2 momentum

解决问题:In this case, you can easily get stuck in a local minima and the algorithm may think you reach the global minima leading to sub-optimal results. To avoid this situation, we use a momentum term in the objective function, which is a value between 0 and 1 that increases the size of the steps taken towards the minimum by trying to jump from a local minima.【调出局部极小值】

如何观察:观察loss变化曲线

常用值:0.9

参考:https://distill.pub/2017/momentum/

1.3 single-scale training

images are resized such that the scale (shorter side of image) is 600 pixels

解决问题:cnn model需要相同size输入

类似解决:SPP空间金字塔池化

1.4 多GPU和batchsize

Each GPU holds 1 image and selects B = 128 RoIs for backprop. We train the model with 8 GPUs (so the effective mini-batch size is 8×).

观察:GPU utilization

1.5 4-step alternating trainin

1.6 train test split

. Our experiments involve the 80k train set, 40k val set, and 20k test-dev set 【R-FCN】

观察:是否过拟合以及在新数据集上表现

1.7 learning rate

learning rate as 0.001 for 90k iterations and 0.0001 for next 30k iterations,【R-FCN】

观察是否选用:loss曲线变化

1.8 batchnormalizaiton

解决问题: speed up learning, Reducing Internal Covariate Shift

参考:Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

观察是否选用:每层输出值分布,训练速度,训练和测试集表现

1.9 Resnet / densenet

解决问题:vanishing gradient problems. More layers is better but because of the vanishing gradient problemobjective of Resnet : preserve the gradient.

观察是否选用:weight变化曲线

2.0 inception

解决问题:相同图像不同尺度识别

观察:构造相同图像不同尺度的测试数据观察效果

2.2 mobile / shuffle net 

解决问题: reduce model size

2 RNN 

2.1 LSTM 

解决问题:vanishing gradient problems

如何观察:观察weight值变换曲线

参考:https://en.wikipedia.org/wiki/Long_short-term_memory

2.2 GRU 

解决问题: the performance is on par with LSTM, but computationally more efficient (less complex structure as pointed out).

如何观察:观察weight值变换曲线参考:https://datascience.stackexchange.com/questions/14581/when-to-use-gru-over-lstm

2.3 Attention

解决问题:

RNN无论之前的context有多长,包含多少信息量,最终都要被压缩成一个几百维的vector。这意味着context越大,最终的state vector会丢失越多的信息。正如楼主贴出blog中的Figure 1所显示,输入sentence长度增加后,最终decoder翻译的结果会显著变差。事实上,因为context在输入时已知,一个模型完全可以在decode的过程中利用context的全部信息,而不仅仅是最后一个state。【正向传播信息丢失问题】

参考:https://www.zhihu.com/question/36591394/answer/69124544

2.4 gradient clipping

解决问题:梯度爆炸 On the other hand, you can have exploding gradients too. This is when they get exponentially large from being multiplied by numbers larger than 1. Gradient clipping will clip the gradients between two numbers to prevent them from getting too large.

如何观察:观察weight值变换曲线

参考:https://www.quora.com/What-is-gradient-clipping-and-why-is-it-necessary

梯度消失和梯度爆炸对比解释:Gradient clipping is most common in recurrent neural networks. When gradients are being propagated back in time, they can vanish because they they are continuously multiplied by numbers less than one. This is called the vanishing gradient problem. This is solved by LSTMs and GRUs, and if you’re using a deep feedforward network, this is solved by residual connections. On the other hand, you can have exploding gradients too. This is when they get exponentially large from being multiplied by numbers larger than 1. Gradient clipping will clip the gradients between two numbers to prevent them from getting too large.

2.5 beam search

解决问题:贪婪局部最优的(路径)最终取值不一定全局最优,则加结果选择的可能性。牺牲效率换精度。

观察:试试beam search效果和不加效果精度对比。

2.6 padding / bucketing / dynamicSeq2seq

解决问题:输入sentence长度不一的问题

友情推荐:ABC技术研习社

为技术人打造的专属A(AI),B(Big Data),C(Cloud)技术公众号和技术交流社群。

猜你喜欢

转载自blog.csdn.net/gao8658/article/details/81779496