In-depth understanding of dropout

Original link: https://blog.csdn.net/xinzhangyanxiang/article/details/49022443
注意:图片都在github上放着,如果刷不开的话,可以考虑翻墙。
转载请注明:http://blog.csdn.net/stdcoutzyx/article/details/49022443

  
  
  • 1
  • 2

Ming Yi opening, Dropout refers to the process of training the learning network in depth, for the neural network unit, according to a certain probability temporarily dropped from the network. Note that temporarily, for stochastic gradient descent, because the randomly discarded, and therefore each of a mini-batch training in different networks.

CNN prevent excessive dropout is in a big kill fitting improve effectiveness, but why its effective, yet controversial. Read the next two representative papers, representing two different perspectives, and would like everyone to share.

Combined school

References in view the first article, the old man put forward Hinton, Hinton position in the depth learning about community I will not go into details, just this position, estimated that this school of thought is "Wudang Shaolin" was. Note that I sent myself a name, and you do not laugh.

View

The paper from the neural network problem, step by step leads to dropout valid explanation why. Large-scale neural network has two disadvantages:

  • Time-consuming
  • Easy to over-fitting

These two shortcomings really hold the thigh in depth study of two large burden, either side, complement each other, not the amount, odor congenial. Machine learning is a lot of over-fitting a common problem, over-fitting, the model has been basically scrap it. In order to solve the problem of over-fitting, usually using ensemble approach that is trained to do a combination of multiple models, this time, time has become a big problem, it is not only time-consuming training, testing them multiple models is also very time-consuming. In short, almost forming a deadlock.

Good Dropout emergence can solve this problem, each finished dropout, equivalent to find a more original network from the network, as shown below:

img1

Thus, for a neural network of N nodes, with the dropout, it can be seen as 2 the n- collection of models, but the number of parameters to train at this time but is constant, which is a time-consuming relief problem.

Motivation

Although Intuitively dropout ensemble is an approximation on the classification performance, but in practice, dropout after all, made on a neural network, only trained a set of model parameters. So why is he in the end it effective? This from the analysis on the motive. Paper, motivation dropout author made a very wonderful analogy:

In nature, in large animals, usually sexual reproduction, sexual reproduction refers to a gene offspring inherit half of each parents from both sides. But Intuitively, it seems asexual reproduction is more reasonable, because asexual reproduction can retain good genes large segment of the large segment. The gene random sexual reproduction will be torn down and demolished, destroyed a large section of the joint adaptive genes.

But, after all, natural selection is not selected asexual reproduction, and the choice of sexual reproduction, natural selection instructions, survival of the fittest. We do a hypothesis, it is the power of genes is the ability to mix rather than the ability of a single gene. Whether it is sexual or asexual reproduction had to follow this assumption. In order to prove that sexual reproduction is strong, we look at a small probability to learn knowledge.

For example, engage in a terrorist attack, in two ways:
- Centralized 50 people, 50 people make it close precise division of labor, engage in a big explosion.
- 50 people were divided into 10 groups of five people, act separately, just to point out what action, even if successful once.

Which the probability of success is relatively large? Clearly the latter. Because a large team battle into a guerrilla war.

So, come analogy, sexual reproduction way not only can be passed down good genes, it can also reduce adaptive joint between genes, making the complex gene combined with large chunks of adaptability becomes smaller one by one into small pieces United adaptability gene.

dropout can achieve the same effect, it forces a nerve cell, nerve cells and other randomly selected out of working together to achieve good results. Eliminate weaken the joint between the adaptability of neurons node, and enhance the generalization ability.

Personal add this: that most of the plants and microbes using asexual reproduction, because of changes in their living environment is very small, and therefore do not need too strong ability to adapt to the new environment, it is best to retain large chunks of genes to adapt to the current environment is enough a. The higher animals is not the same, to be ready to adapt to the new environment, which will join the adaptability between a gene into a small, can increase the probability of survival.

Model changes brought dropout

In order to achieve the characteristic ensemble, with the dropout, training and forecasting neural network will be some changes.

  • Training level

Inevitably, each unit training network to add a probability process.
img2

Corresponding to the following formula changes as follows:

  • No dropout neural network
    img3
  • There dropout neural network
    img4
  • Test level

  • He predicted when the parameter of each cell to be pre-multiplied by p.
    img5

    Other papers in the technical points

    • Preventing over-fitting of:

    • Early termination (when the effect of the variation on the validation set)
    • L1 and L2 regularization weighting
    • soft weight sharing
    • dropout
  • Select the dropout rate

    • After cross-validation hidden node dropout rate is equal to 0.5 when the best results, because the most 0.5 when dropout randomly generated network structure.
    • dropout may also be used as a method of noise-added directly to the input operation. The input layer is closer to the number 1. So that the input will not change much (0.8)
  • Training process

    • Limit for spherical (max-normalization) the training of the parameters w, dropout is useful for training.
    • C is a radius of the spherical parameters should be adjusted. Validation set may be used for tuning parameters
    • Although dropout own cattle, but dropout, max-normalization, large decaying learning rates and high momentum combine better, such as max-norm regularization can prevent large learning rate resulting parameters blow up.
    • 使用pretraining方法也可以帮助dropout训练参数,在使用dropout时,要将所有参数都乘以1/p。
  • 部分实验结论

  • 该论文的实验部分很丰富,有大量的评测数据。

    • maxout 神经网络中得另一种方法,Cifar-10上超越dropout

    • 文本分类上,dropout效果提升有限,分析原因可能是Reuters-RCV1数据量足够大,过拟合并不是模型的主要问题

    • dropout与其他standerd regularizers的对比
      • L2 weight decay
      • lasso
      • KL-sparsity
      • max-norm regularization
      • dropout
    • 特征学习
      • 标准神经网络,节点之间的相关性使得他们可以合作去fix其他节点中得噪声,但这些合作并不能在unseen data上泛化,于是,过拟合,dropout破坏了这种相关性。在autoencoder上,有dropout的算法更能学习有意义的特征(不过只能从直观上,不能量化)。
      • 产生的向量具有稀疏性。
      • 保持隐含节点数目不变,dropout率变化;保持激活的隐节点数目不变,隐节点数目变化。
    • 数据量小的时候,dropout效果不好,数据量大了,dropout效果好
    • 模型均值预测

    • 使用weight-scaling来做预测的均值化
    • 使用mente-carlo方法来做预测。即对每个样本根据dropout率先sample出来k个net,然后做预测,k越大,效果越好。
  • Multiplicative Gaussian Noise
    使用高斯分布的dropout而不是伯努利模型dropout

  • dropout的缺点就在于训练时间是没有dropout网络的2-3倍。
  • 进一步需要了解的知识点

    • dropout RBM
    • Marginalizing Dropout
      具体来说就是将随机化的dropout变为确定性的,比如对于Logistic回归,其dropout相当于加了一个正则化项。
    • Bayesian neural network对稀疏数据特别有用,比如medical diagnosis, genetics, drug discovery and other computational biology applications

    噪声派

    参考文献中第二篇论文中得观点,也很强有力。

    观点

    观点十分明确,就是对于每一个dropout后的网络,进行训练时,相当于做了Data Augmentation,因为,总可以找到一个样本,使得在原始的网络上也能达到dropout单元后的效果。 比如,对于某一层,dropout一些单元后,形成的结果是(1.5,0,2.5,0,1,2,0),其中0是被drop的单元,那么总能找到一个样本,使得结果也是如此。这样,每一次dropout其实都相当于增加了样本。

    稀疏性

    知识点A

    首先,先了解一个知识点:

    When the data points belonging to a particular class are distributed along a linear manifold, or sub-space, of the input space, it is enough to learn a single set of features which can span the entire manifold. But when the data is distributed along a highly non-linear and discontinuous manifold, the best way to represent such a distribution is to learn features which can explicitly represent small local regions of the input space, effectively “tiling” the space to define non-linear decision boundaries.

    大致含义就是:
    在线性空间中,学习一个整个空间的特征集合是足够的,但是当数据分布在非线性不连续的空间中得时候,则学习局部空间的特征集合会比较好。

    知识点B

    假设有一堆数据,这些数据由M个不同的非连续性簇表示,给定K个数据。那么一个有效的特征表示是将输入的每个簇映射为特征以后,簇之间的重叠度最低。使用A来表示每个簇的特征表示中激活的维度集合。重叠度是指两个不同的簇的Ai和Aj之间的Jaccard相似度最小,那么:

    • 当K足够大时,即便A也很大,也可以学习到最小的重叠度
    • 当K小M大时,学习到最小的重叠度的方法就是减小A的大小,也就是稀疏性。

    上述的解释可能是有点太专业化,比较拗口。主旨意思是这样,我们要把不同的类别区分出来,就要是学习到的特征区分度比较大,在数据量足够的情况下不会发生过拟合的行为,不用担心。但当数据量小的时候,可以通过稀疏性,来增加特征的区分度。

    因而有意思的假设来了,使用了dropout后,相当于得到更多的局部簇,同等的数据下,簇变多了,因而为了使区分性变大,就使得稀疏性变大。

    为了验证这个数据,论文还做了一个实验,如下图:

    img6

    该实验使用了一个模拟数据,即在一个圆上,有15000个点,将这个圆分为若干个弧,在一个弧上的属于同一个类,一共10个类,即不同的弧也可能属于同一个类。改变弧的大小,就可以使属于同一类的弧变多。

    实验结论就是当弧长变大时,簇数目变少,稀疏度变低。与假设相符合。

    个人观点:该假设不仅仅解释了dropout何以导致稀疏性,还解释了dropout因为使局部簇的更加显露出来,而根据知识点A可得,使局部簇显露出来是dropout能防止过拟合的原因,而稀疏性只是其外在表现。

    论文中的其他技术知识点

    • 将dropout映射回得样本训练一个完整的网络,可以达到dropout的效果。
    • dropout由固定值变为一个区间,可以提高效果
    • 将dropout后的表示映射回输入空间时,并不能找到一个样本x*使得所有层都能满足dropout的结果,但可以为每一层都找到一个样本,这样,对于每一个dropout,都可以找到一组样本可以模拟结果。
    • dropout对应的还有一个dropConnect,公式如下:

    • dropout

    img7

  • dropConnect

  • img8

  • 试验中,纯二值化的特征的效果也非常好,说明了稀疏表示在进行空间分区的假设是成立的,一个特征是否被激活表示该样本是否在一个子空间中。
  • 参考文献

    [1]. Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: A simple way to prevent neural networks from overfitting[J]. The Journal of Machine Learning Research, 2014, 15(1): 1929-1958.

    [2]. Dropout as data augmentation. http://arxiv.org/abs/1506.08700

    Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
    This link: https://blog.csdn.net/xinzhangyanxiang/article/details/49022443

    Understanding dropout

    Guess you like

    Origin blog.csdn.net/sunhua93/article/details/102765050