Depth articles - Neural Network (iv) tuning elaborate neural networks

Skip to main content

Returns the neural network directory

Previous: depth articles - Neural Networks (c)  network topology and train the neural network

Next: depth articles - neural network (e)  elaborate Optimizer

 

In this section, tuning elaborate neural networks, the next section elaborate Optimizer

 

Earlier, I explained the neural network ANN and DNN network topology, and training methods, is already beginning to train the neural network, but, in order to get a better model, but also need to learn the neural network tuning. This is the next focus.

 

5. Tuning Neural Network

(1). Understanding of the neural network

Neural network has flexibility, which is the main drawback of the algorithm: There are many parameters that need to adjust Ultra

As the number of layers of neurons, each neuron number ①, the activation function is used for each layer, the weights initialization logic.

②. Since DNN are fully connected neural network, learning ability is very strong, it is easy to over-fitting. Stopping-use Early, \large L_{1}, \large L_{2} n technical and Dropout, data enhancement and other means of preventing the over-fitting.

③. The samples were normalized using the optimizer, it will help improve the speed of sample training that improve the convergence rate. Moreover, the samples were normalized, to a certain extent, can improve the accuracy.

④. Gradient disappears problem, neural network control layers, layers too deep gradient easily disappear. Do first BN (Batch Normalization) process before each neural network can be alleviated to some extent, the problem disappeared gradient, to prevent over-fitting, improve accuracy, increase learning speed.

⑤. Use Grid search (grid search) or Cross Validation (cross-validation) or the like to validate the model is good or bad, and choose the most appropriate model.

 

(2). Early stopping (early stop method)

①. The reason for using early-stopping

  a. In order to achieve good performance neural networks, network setting process requires many decisions about the used settings (Super parameters). One parameter is the number of ultra-setting period (epoch) is: That should be done through the data set number of times. If the number is too small epoch, the network may occur underfitting; if too many epoch, it is possible to over-fitting occurs.

  b. early stopping method to solve the problem need to manually set the number of epoch. It may also be possible to avoid as a regularization method for fitting a network occurred (and  \large L_1 , \large L_2 weights and drop method similar attenuation)

  c. the fundamental reason is because the training will continue to lead the accuracy of the test set to decline. Reason to continue training leading to decline in the accuracy of the test set may have guessed:

   (A). Overfitting

   (B). Learning rate is too large lead to convergence.

 

②. Principle

  a. The data set into training and validation sets

  . B after the end of each epoch (or epoch after every N):

      Get test results on the validation set, with the increase epoch, if you find an error test set to rise on the validation set, then stop training.

  c. the right to stop after heavy as the final parameters of the network

      This approach is consistent with intuitive feelings, because accuracy is no longer raised, then continue training and unhelpful, it will only increase the training time. This would confirm what time to verify the accuracy no longer raised, because it may after this epoch, the accuracy is reduced, but then let's epoch accuracy went up, so you can not judge no increase in accordance with a continuously decreasing twice. The general practice is, in the training process, the recording accuracy by far the best validation set, when Epoch 10 times (or more times) did not achieve the best accuracy, the accuracy can not be artificially increased.

 

③. Stop standards

  a. The first stop criterion

     Define a new variable called generalized loss (Generalization Loss, GL), which is described in the current iteration  \large t , the generalization error compared to a current minimum error rate of growth:

             \large E_{opt}(t) = \min_{t' < t} E_{val} (t')

             \large GL(t) = 100 \cdot (\frac{E_{val}(t)}{E_{opt}(t)} - 1)

             \large E_{opt}(t): The number of iterations to  \large t obtain the best error validation set.

             \large E_{val}(t): Number of iterations for the  \large t validation set obtained at the time error.

       Higher generalization loss is clearly a candidate for standard stop training, because it directly indicates the over-fitting. That is when the generalization loss exceeds a certain threshold, stop training. GL is defined by, i.e., when  \large GL_{\alpha} larger than a threshold  \large \alpha time, stop training.

 

  b. The second category stopping criteria

      When training very fast, you may want the model to continue training. Because if the training error still down quickly, then generalized loss of a great probability of being repaired. A common assumption is over-fitting the training will reduce errors occur when very slow. Here, the definition of a  \large k cycle, and based on a new cycle of variable measure progress (Measure Progress):

              \large P_{k}(t) = 1000 \cdot (\frac{\sum_{t' = t - k + 1}^{t} E_{tr}(t')}{k \cdot \min_{t' = t - k + 1}^{t} E_{tr}(t')} - 1)

              \large E_{tr}(t'): Number of iterations for the  \large t' error on the training set time.

       It expresses the meaning that the average error of the training specified in the current iteration error is larger than the minimum training period and more.

       Note that when the training process is unstable when the result can be large measure progress, in which the training error increases, rather than decreases. In practice, many algorithms are inappropriate due to the selection of a larger step size resulting from this jitter. Unless global not stable, or after long training, measure progress results tend to 0 (in fact, this is the measure of the average decline in the case of the training set error within a certain period of time). As a result, namely the loss of generalization and progress quotient  \large PQ_{\alpha} greater than the specified value when to stop, that is, \large \frac{GL_{t}}{P_{k}(t)} > \alpha

 

  c. The third stopping criteria

     The third stop criterion is completely dependent on the generalization error changes, i.e., when the generalization error growth in 8 consecutive cycle arrest (UP).

     When the validation set have increased error in eight consecutive cycles, it indicates that such a phenomenon is assumed overfitting, how much it independent error growth. The stopping criterion can measure overall changes, thus can be used in the pruning algorithm, i.e., during the training phase, the minimum error can be allowed to retain much higher than the previous time.

 

④. Stop standard selection rules

   In general, the "slow" standard will be relatively slightly better performance on average, can improve the generalization ability. However, these standards require a longer training time. In fact, overall, these standards differ in systemic small. The main selection rules are as follows:

  a. Unless a small upgrade is also a great value, or select a faster stopping criteria

  b. To the maximum possible to find a good program, using a standard GL

  c. In order to maximize the average quality of the solution, if the network had only fit a little bit, you can use PQ standards, or use UP standards.

 

. ⑤ advantages and disadvantages:

  . A advantages:

      Only use gradient descent, you can find  \large w small, middle, and greater value. Without trying to  \large L_{1},\; L_{2} regularization of the many super parameter values.

  . B Disadvantages:

     We did not use different ways to solve the optimization loss function and reduce the variance of these two issues, but with a way to solve two problems at the same time, the result is something to consider more complex. The reason can not be processed independently, because if you stop the optimization of the cost function, the value of the cost function may appear small enough, but at the same time but do not want to over-fitting.

 

(3). Dropout

①. When the depth of training neural networks, often encounter too easy fitting, time-consuming these two issues. Dropout occurs, you can more effectively alleviate the overfitting occurred, to regularization effect to some extent, and increases the speed of training.

②. Dropout is an application of the normalization means in depth learning environment. It works as follows:

     In one cycle, to randomly select units in the neural network and temporarily hidden, and then training and optimization of the neural network cycle. In the next cycle, some of the cells randomly selected and neural layers and temporarily hidden, and then training and optimization of the neural network cycle. So, until the end of the training. Each selection is independent.

. ③ shown:

      Dropout generally used in the whole connection layer.

. ④ during training, the probability of each neuron to  \large p be retained (Dropout discard rate \large 1 - p); in the testing phase, each neural cell is present, the weight parameter  \large w is multiplied by  \large p, be: \large pw.

⑤. Need to take the test on  \large p the cause of

     Consider first a hidden layer neural element is output before the dropout  \large x, the dropout after the expected value is  \large E = px + (1 - p) \cdot 0, in the test of the neuron is always active, in order to maintain the desired output at the same level and to obtain the same As a result, we need to be adjusted  \large x \rightarrow px. Wherein  \large p a Bernoulli distribution - probability value of 1 (01 Distribution).

⑥. inverted dropout 

    When training, due to the abandon of some neurons, so when the test results in the need to take the incentive factor in  \large p scaling, but this need to make changes to the test code and increases the amount of computation test, very affect test performance. Usually in order to improve the performance of the test (operation test time reduction), it can be scaled work to the training phase and the test phase the same as when not dropout, called inverted dropout. The right front retained the propagation dropout neurons weight multiplied by  \large \frac{1}{p} (seen as a penalty term, the weighting expanded to the original \large \frac{1}{p} times, do not reduce weight when such testing), add inverted dropout This change in architecture will only affect the training the proportion of the process, but does not affect the testing process, is commonly used Dropout \large p = 0.5

⑦. Dropout cause of over-fitting solution

  a. averaging effect

      dropout 丢掉不同的隐藏层神经元就类似于在训练不同的网络,随机丢掉部分隐藏神经元导致网络结构已经不同,整个 dropout 过程就相当于对很多个不同的神经网络取平均。而不同的网络产生不同的拟合,一些互为 “反向” 的拟合相互抵消就可以达到整体上减小过拟合。

  b. 减少神经元之间复杂的共适应关系

     因为 dropout 程序导致两个神经元不一定每次都在一个 dropout 网络中出现。这样权值的更新不再依赖于有固定关系的隐含节点的共同作用,阻止了某些特征仅仅在其他特定特征下才有效果的情况。迫使网络取学习更加鲁棒性(健壮性)的特征,这些特征在其他的神经元的随机子集中也存在。换句话说,假如神经网络是在做某种预测,它不应该对一类特定的线索片段太过敏感,即使丢失特定的线索,它也应该可以从众多其他线索中学习到一些共同特征。从这个角度看 dropout 就有点像 \large L_{1}, \; L_{2} 正则,减少权重是的网络对丢失特定神经元连接的鲁棒性提高。

  c. dropout 类似于生物学中性别繁衍在生物进化中的角色

      物种为了生存,往往会倾向于适应所在的环境,环境的突变则会导致物种难以做出及时的反应。性别的出现可以繁衍出适应新环境的变种,有效地阻止过拟合,即避免环境改变时物种可能面临的灭绝。而 dropout 使神经随机组合成新的神经网络,就类似于性别繁衍时基因重组。

 

(4). BN (Batch Normalization, BN) 批量归一化

批量归一化,就是对每一批数据进行归一化处理。

①. 使用 BN 的原因

    网络一旦 train 起来,那么参数就会发生更新,除了输入层数据外(假设输入层数据已经人为的为每个样本归一化),后面网络每一层的输入数据分布是一直在发生变化的,因为在训练的时候,前面层训练参数分布是一直在发生变化的,因为在训练的时候,前面层训练参数 \large w 的更新将导致后面层输入数据分布的变化。以网络第二层为例:网络的第二层输入,是由第一层的参数 \large w 和 input 计算得到的,而第一层的参数 \large w 在整个训练过程中一直在更新变化,隐藏必然会引起后面每一层输入数据的改变。在网络中间层训练过程中,数据分布的改变被称为:“Internal Covariate Shift”。BN 的提出,就是要解决在训练过程中,中间层数据分布发生改变的情况。

②. BN 的使用

     在神经网络的每个全连接和激活函数之间增加 BN 层,对每层的 input 都做 BN 操作,用 BN 返回的 data,再与全连接层的 \large w,\;b 代入激活函数进行计算。

③. BN 的网络结构

④. BN 的处理方法

 输入:批量(mini batch) 输入的 \large x = \{x_{1},\; x_{2}, \; ......, \; x_{m} \}

 输出:规范化后的 \large x_{scale}

  a. 求每一个训练批次数据的均值

          \large \bar{x} = \frac{1}{m} \sum_{i = 1}^{m} x_{i}

  b. 求每一个训练批次数据的方差

         \large \sigma ^{2} = \frac{1}{m} \sum_{i = 1}^{m} (x_{i} - \bar{x})^{2}

  c. 做均值方差归一化,\large \varepsilon 为无穷小数,用于避免分母为零

         \large x_{scale} = \frac{x - \bar{x}}{\sqrt{\sigma ^{2} + \varepsilon }}

  d. 返回均值方差归一化后的 \large x_{scale}

         \large return \;\;\;\; x_{scale}

⑤. BN 的理解

    BN 就是通过一定的规范化手段,对于每个隐层神经元,把逐渐向非线性函数映射后取值区间极限饱和区靠拢的输入分布强制拉回到均值为 0,方差为 1 的比较标准的正太分布,使得非线性变换函数的输入值落入比较敏感的区域,以此避免梯度消失的问题。

⑥. BN 的优点

  a. BN 使得网络中每层输入数据分布相对稳定,加速模型学习速度

  b. BN 使得模型对网络中的参数不那么敏感,简化调参过程,使得网络学习更加稳定。

  c. BN 允许网络使用饱和性激活函数(如 Sigmoid、Tanh 等),缓解了梯度消失的问题。

  d. BN 具有一定的正则化效果。

 

(5). 参数初始化

参数初始化常用的方法有以下三种:

①. uniform 均匀分布初始化

   w = np.random.uniform(low=scale, high=scale, size=[n_in, n_out])

   scale:在 Xavier  初始法,适用于普通激活函数(Tanh、Sigmoid 等)

                             scale = np.sqrt(3 / n)

                在 He 初始法,适用于 ReLU:

                             scale = np.sqrt(6 / n)

   n_in:为网络输入的大小

   n_out:  为网络输出的大小

   n:为 n_in 或 \large \frac{n\_in + n\_out}{2}

 

②. normal 高斯分布初始化

    w = np.random.randn(n_in, n_out) * stdev

    stdev:为高斯分布的标准化,均值设为 0

                 在 Xavier 初始法,适用于普通激活函数(Tanh、Sigmoid 等)

                           stdev = np.sqrt(n)

                 在 He 初始法,适用于 ReLU:

                           stdev = np.sqrt(2 / n)

     Xavier 也可以用下面的计算方法 :

 

③. svd (白化) 初始化,对 RNN 有较好的效果

 

(6). 数据预处理方式

①. zero-center 零均值化,这个挺常用的。

x -= np.mean(x, axis=0)

      axis:为对 x 的某一维度进行操作。

                 如 axis=0 时,表示对 x 的列进行求均值操作

                      axis=1 时,表示对 x 的行进行求均值操作

#!/usr/bin/env python
# _*_ coding:utf-8 _*_
# ============================================
# @Time     : 2020/02/01 16:04
# @Author   : WanDaoYi
# @FileName : test_code.py
# ============================================

import numpy as np

x = [[1, 3, 5],
     [2, 4, 7],
     [1, 5, 7]
     ]

print(np.mean(x, axis=0))

输出结果为 3 列相对应的均值:[1.33333333, 4.0 , 6.33333333]

同样的,还有 x /= np.std(x, axis=0) 或 normalize

 

②. PCA Whitening PCA 白化,这种用得比较少

  a. Whitening 白化 相当于在零均值化和归一化之间插入一个旋转操作,将数据投影到主轴上。

   (a). 首先将数据零均值化

x -= np.mean(x, axis=0)

   (b). 再计算协方差

cov = np.dot(x.T, x) / x.shape[0]

   (c). 计算数据协方差矩阵的奇异值分解

u, s, v = np.linalg.svd(cov)

   (d). 对数据去相关

x_rot = np.dot(x, u)

   (e). 白化,即把特征基空间的数据除以每个维度的特征值来标准化尺度。这里加了个 1e-5 是为了防止分母为 0 的情况

x_whiten = x_rot / np.sqrt(s + 1e-5)

  b. PCA 白化的缺点

   (a). 白化过程计算成本高

   (b). 白化会增加数据中的噪声,因为它把输入数据的所有维度延伸到相同的大小,这些维度中就包含噪声维度(往往表现为不相关的且方差较小)。这种缺点在实际操作中可以通过把 1e-5 增大到一个更大的值来引入更强的平滑。

  c. 在全连接层选择 BN,而不选择 PCA 白化的原因:

   (a). 白化过程计算成本高昂,并且在每一轮训练中的每一层都需要做如此高成本计算的白化操作

   (b). 白化过程由于改变了网络每一层的分布,因而改变了网络中本身数据的表达能力。底层网络学习到的参数信息会被白化操作丢失掉。

 

(7). 训练技巧

a. 样本需要足够多,足够随机,并且对样本进行归一化处理

b. 要做梯度归一化,即算出来的梯度除以 mini batch size。

c. 梯度裁剪

    限制最大梯度,其实是 \large value = sqrt(w_{1}^{2}+ w_{2}^{2} + w_{3}^{2} + ...... ), 如果 value 超过了阈值,就算一个衰减系数,让 value 的值等于阈值,如 5, 10, 15 等。

d. dropout 对小数据防止过拟合有很好的效果,值一般设为 0.5。也可以使用 \large L_{1},\;L_{2} 正则化防止过拟合。

e. 需要把神经网络层之间的输入限制成 零均值 之内,尽量不要使用 Sigmoid,可以用 Tanh 或 ReLU 之类的激活函数

    Sigmoid 函数在 (-4, 4) 的区间,才有较大的梯度,之外的区间,梯度接近于 零,很容易造成梯度消失的问题。

    输入 零均值,Sigmoid 函数的输入不是 零均值。

f. ReLU + BN 这套组合,可以满足  95% 的情况。除非有些特殊情况会用 identity,比如回归问题

g. 对数据 shuffle 和 augmentation

h. 降低学习率

    随着网络训练的进行,学习率要逐渐降下来,同样的,fine-tuning 也要根据模型的性能设置合适的学习率

i. 合理的使用 tensorboard,监视网络的状况,来调整网络参数。

j. 随时存档模型,要有 validation

   把每个 epoch 和其对应的 validation 结果存下来,可以分析出开始 overfitting 的时间点,方便下次加载 fine-tuning

k. 网络层数

    在性能不丢的情况下,减到最小

l. batch_size

   一般从 128 左右开始调整,调整为 2 的倍数。batch_size 合适最重要,并不是越大越好。

 

(8). Ensemble 集成学习

Ensemble 是论文刷结果的终极核武器,深度学习中一般有以下几种方式:

①. 同样的参数,不同的初始化方式

②. 不同的参数,通过 cross validation,选取最好的一组

③. 同样的参数,模型训练的不同阶段,即不同迭代次数的模型

④. 不同的模型,进行线性融合,例如 RNN 和 传统模型

⑤. 训练多个模型,在测试时将结果评价起来,大约可以得到 2% 提升。

⑥. 训练单个模型时,平均不同时期的 checkpoints 的结果,也可以有提升

⑦. You can test parameters and test parameters combined training.

 

 

 

                

 

Skip to main content

Returns the neural network directory

Previous: depth articles - Neural Networks (c)  network topology and train the neural network

Next: depth articles - neural network (e)  elaborate Optimizer

Published 42 original articles · won praise 15 · views 2767

Guess you like

Origin blog.csdn.net/qq_38299170/article/details/104119024