Machine learning related skills

Update parameters

     The purpose of learning neural network is to find the value of the loss function parameters as small as possible. This is to find the optimal parameters of the problem, the process of solving this problem is called optimization (optimization).

      SGD

      Here the right to update a weight parameter referred to as W, a gradient of the loss function is referred to as W on . η represents a learning rate, 0.01 or 0.001 actually take such a good value determined in advance. Formula represents the updated value of the left with the right value. 

  Disadvantage of SGD: SGD root cause of inefficiency is that the direction of the gradient is not pointing in the direction of the minimum. 

    Momentum

     Momentum is the "momentum" means, and physical related. Here there is a new variable v, the speed corresponding to the physical. This formula has an αv, when the object without any force, assume that an object is gradually decelerated task ([alpha] is set to a value of 0.9 or the like), a corresponding surface friction drag on the air or physically.

    AdaGrad

    Tips on effective learning rate, there is a method called a learning rate decay (learning rate decay), i.e., with the progress of learning the learning rate decreases. AdaGrad will properly adjust the learning rate parameters for each element, at the same time learning.

     Here Emerging variables H, as in Formula (6.5), the squares of all it holds the previous value and the gradient. When the update parameters by multiplying, you can adjust the scale of learning. This means that the learning rate of change in the larger elements (been substantially updated) of the element parameters of the smaller. In other words, the rate of decay can learn by elements of the parameters, the large variation of the parameters of the learning rate decreases. AdaGrad will record all square gradient of the past and. Therefore, the more in-depth study, the smaller the update amplitude. In fact, if the endless learning, updating the amount will become 0, completely no longer updated.

    Adam

    Momentum moves in reference to the ball rolling bowl physical rules, AdaGrad for each element is appropriately adjusted parameter update step, Adam thought to combine the two. Like a small ball, like Adam in the renewal process based on a rolling bowl. Although Momentun has a similar move, but in contrast, the degree of shaking the ball around Adam eased. Thanks to learn the extent of the update is properly adjusted.

The initial value of the weight

     Briefly, the weight value of the attenuation is a kind of object to reduce the weight parameter learning method. By weight value of the weighting parameter is decreased to suppress the occurrence of over-fitting. If you want to reduce the heavy weights outset initial value is set to a small value is the right path. In fact, the initial value of the weight before this image is 0.01 * np.random.randn (10, 100) Thus, using the value generated by a Gaussian distribution obtained by multiplying the value 0.01 (standard deviation 0.01 Gaussian distribution) . In order to prevent "weight uniform" (strictly speaking, in order to collapse the weight symmetrical structure), it is randomly generated initial value.

     Using the standard deviation of a Gaussian distribution, the distribution of activation values of each layer as shown, each layer activation value was biased distribution of 0 and 1. S is a sigmoid function using a function type, as the output constant close to 0 (or close to 1), the value of the derivative which gradually approaches 0. Thus, deflection data 0 and 1 values will result in the distribution of the gradient back propagation become smaller, and finally disappeared. This problem is known as gradient disappears (gradient vanishing). Level to deepen the depth of learning, the gradient disappears problem may be more serious.

       When the standard deviation is 0.01 using a Gaussian distribution, the distribution of activation values ​​of each layer. The concentrate was distributed in the vicinity of 0.5. Because unlike bias 0 and 1 just examples, so the problem does not occur gradient disappears. However, the distribution of the activation value of favoritism, but there will be a big problem on expression. Because if there are multiple neuron output value is almost the same, that they have no existence meaning. For example, if the output neuron 100 value is almost the same, it may be expressed in substantially the same thing by a single neuron. Thus, the activation value of favoritism in the distribution of "expressive limited," the problem occurs.

     各层的激活值的分布都要求有适当的广度。为什么呢?因为通过在各层间传递多样性的数据,神经网络可以进行高效的学习。反过来,如果传递的是有所偏向的数据,就会出现梯度消失或者“表现力受限”的问题,导致学习可能无法顺利进行。

     Xavier初始值:与前一层有n个节点连接时,初始值使用标准差为的分布。

ReLU的权重初始值

     当前一层的节点数为n 时,He 初始值使用标准差为的高斯分布 。在神经网络的学习中,权重初始值非常重要。很多时候权重初始值的设定关系到神经网络的学习能否成功。

Batch Normalizatio 

     为了使各层激活值分布拥有适当的广度,“强制性”地调整激活值的分布。Batch Norm有以下优点:

• 可以使学习快速进行(可以增大学习率)。

• 不那么依赖初始值(对于初始值不用那么神经质)。

• 抑制过拟合(降低Dropout等的必要性)。

      Batch Norm,顾名思义,以进行学习时的mini-batch为单位,按minibatch进行正规化。具体而言,就是进行使数据分布的均值为0、方差为1的正规化。

 正则化

     发生过拟合的原因,主要有以下两个。

• 模型拥有大量参数、表现力强。
• 训练数据少。

     权值衰减是一直以来经常被使用的一种抑制过拟合的方法。该方法通过在学习的过程中对大的权重进行惩罚,来抑制过拟合。很多过拟合原本就是因为权重参数取值过大才发生的。L2 范数、L1范数、L∞范数都可以用作正则化项,它们各有各的特点,不过这里我们要实现的是比较常用的L2范数。

     Dropout 是一种在学习的过程中随机删除神经元的方法。训练时,随机选出隐藏层的神经元,然后将其删除。被删除的神经元不再进行信号的传递,如图所示。

        机器学习中经常使用集成学习。所谓集成学习,就是让多个模型单独进行学习,推理时再取多个模型的输出的平均值。用神经网络的语境来说,比如,准备5个结构相同(或者类似)的网络,分别进行学习,测试时,以这5 个网络的输出的平均值作为答案。

超参数的最优化

      超参数的最优化的内容:

      步骤1
      从设定的超参数范围中随机采样。
      步骤2
     使用步骤1 中采样到的超参数的值进行学习,通过验证数据评估识别精度(但是要将epoch 设置得很小)。

     步骤3
     重复步骤1 和步骤2(100 次等),根据它们的识别精度的结果,缩小超参数的范围。

      在超参数的最优化中,如果需要更精炼的方法,可以使用贝叶斯最优化(Bayesian optimization)。贝叶斯最优化运用以贝叶斯定理为中心的数学理论,能够更加严密、高效地进行最优化。

 

Guess you like

Origin www.cnblogs.com/latencytime/p/11083392.html