Neural Networks: Deep Learning Optimization Methods

1. What methods can improve the generalization ability of CNN models?

  1. Collect more data: Data determines the upper limit of the algorithm.

  2. Optimize data distribution: data category balance.

  3. Choose an appropriate objective function.

  4. Design an appropriate network structure.

  5. Data augmentation.

  6. Weight regularization.

  7. Use appropriate optimizers, etc.

2. Summary of high-frequency questions in BN-level interviews

What problem does the BN layer solve?

A classic assumption in statistical machine learning is that "the data distribution (distribution) of the source space (source domain) and the target space (target domain) is consistent." If they are inconsistent, then new machine learning problems arise, such as transfer learning/domain adaptation, etc. Covariate shift is a branch problem under the assumption of inconsistent distribution. It means that the conditional probabilities of the source space and the target space are consistent, but their marginal probabilities are different. For the output of each layer of the neural network, since they have undergone intra-layer convolution operations, their distribution is obviously different from the corresponding input signal distribution of each layer, and the difference will increase as the depth of the network increases, but they can represent The label remains unchanged, which meets the definition of covariate shift.

Because the activation input value of the neural network before nonlinear transformation becomes deeper as the network depth deepens, its distribution gradually shifts or changes (i.e., the above-mentioned covariate shift). The reason why training converges slowly is generally because the overall distribution gradually approaches the upper and lower limits of the value range of the nonlinear function (such as sigmoid), so this causes the gradient of the low-level neural network to disappear during backpropagation. This is the reason for training a deep neural network. The essential reason for the slower and slower convergence. BN uses a certain regularization method to force the distribution of the input value of any neuron in each layer of the neural network back to the standard normal distribution with a mean of 0 and a variance of 1, to avoid the gradient dispersion problem caused by the activation function. So instead of saying that the role of BN is to alleviate covariate shift, it can also be said that BN can alleviate the gradient dispersion problem.

BN formula

Among them, scale and shift are two parameters that can be learned, because subtracting the mean and dividing the variance may not be the best distribution. For example, the data itself is very asymmetric, or the activation function may not have the best effect on data with a variance of 1. Therefore, scaling and translation variables should be added to improve the data distribution to achieve better results.

The difference between training and testing of BN layer

In the training phase, the BN layer standardizes the training data of each batch, that is, using the mean and variance of each batch of data. (The variance and standard deviation of each batch of data are different)

In the testing phase, we generally only input one test sample, and there is no concept of batch. Therefore, the mean and variance used at this time are the mean and variance of the entire data set after training, which can be obtained by the sliding average method:

The simple understanding of the above formula is: for the mean, all batch uu are directly calculatedthe mean of the u values; then for each batchσ B σ_BpBan unbiased estimate of.

When testing, the formula used by BN is:

Why not use the mean and variance of the entire training set when training BN?

Because using the mean and variance of the entire training set is easy to overfit, for BN, each batch of data is actually standardized to the same distribution, and the mean and variance of different batches of data will have certain differences, rather than fixed values. , this difference can increase the robustness of the model and reduce overfitting to a certain extent.

Where is the BN layer used?

In CNN, the BN layer should be used in front of the nonlinear activation function. Since the input of the hidden layer of the neural network is the output of the nonlinear activation function of the previous layer, its distribution is still changing drastically in the early stages of training. At this time, constraining its first-order moment and second-order moment cannot alleviate Covariate Shift well; and BN's The distribution is closer to the normal distribution, and limiting its first and second moments can make the distribution of values ​​input to the activation function more stable.

The parameters of the BN layer

We know γ γcb bβ is a parameter that needs to be learned, and the essence of BN is to use optimization learning to change the size of the variance and mean. In CNN, because the features of the network correspond to a whole feature map, when doing BN, the feature map is also used as the unit rather than in each dimension. For example, in a certain layer, the number of feature maps isccc , then the number of parameters for BN isc ∗ 2 c * 2c2

Advantages and Disadvantages of BN

advantage:

  1. A larger initial learning rate can be chosen. Because this algorithm converges quickly.

  2. Dropout and L2 regularization are not needed.

  3. There is no need to use local response normalization.

  4. The data set can be completely disrupted.

  5. The model is more robust.

shortcoming:

  1. Batch Normalization is very dependent on the size of the Batch. When the Batch value is small, the calculated mean and variance are unstable.

  2. Therefore, BN is not suitable for the following scenarios: small batch, RNN, etc.

3.The role of Instance Normalization

Instance Normalization (IN), like Batch Normalization (BN), is also a method of Normalization, except that IN acts on a single picture, while BN acts on a Batch .

BN performs a Normalization operation on the same channel of each picture in the Batch together, while IN refers to a single channel of a single picture that performs a separate Normalization operation. As shown in the figure below, C represents the number of channels and N represents the number of pictures (Batch).

IN is suitable for generative models, such as picture style transfer. Because the result of image generation mainly depends on a certain image instance, the Normalization operation on the entire Batch is not suitable for the task of image stylization. Using IN in style transfer can not only accelerate model convergence, but also maintain the relationship between each image instance. of independence.

Here is the formula for IN:

Among them, t represents the index of the picture, and i represents the index of the feature map.

4. What are the tricks to improve the stability of GAN training?

1. Enter Normalize

  1. Normalize the input image to [-1, 1] [-1, 1]between [ −1,1 ] . _ _
  2. The output of the last layer of the generator uses the Tanh activation function.

Normalize is very important. Pictures that have not been processed cannot be converged. A simple way to normalize images is (images-127.5)/127.5, and then send them to the discriminator for training. In the same way, the generated pictures also need to go through the discriminator, that is, the output of the generator is also between -1 and 1, so it is more appropriate to use the Tanh activation function.

2. Replace the original GAN ​​loss function and label inversion

  1. The original GAN ​​loss function will suffer from gradient disappearance and Mode collapse in the early training period. You can use Earth Mover distance to optimize.

  2. In actual projects, it is more convenient to use inverted labels to train the generator, that is, to train the generated pictures as real labels and to train the real pictures as fakes.

3. Use random noise $Z$ with spherical structure as input

  1. Don't use a uniform distribution for sampling

  1. Sampling using a Gaussian distribution

4. Use BatchNorm

  1. There must be only real data or fake data in a mini-batch, do not mix them together for training.
  2. If BatchNorm can be used, use BatchNorm, if not, use instance normalization.

5. Avoid using operations such as ReLU and MaxPool to introduce sparse gradients

  1. The stability of GAN will be greatly affected by the introduction of sparse gradients.
  2. It is better to use an activation function like LeakyReLU. (used in both D and G)
  3. For downsampling, it is best to use: Average Pooling or convolution + stride.
  4. For upsampling, it is best to use: PixelShuffle or transposed convolution + stride.

It is best to remove the entire Pooling logic, because using Pooling will lose information, which is not beneficial for GAN training.

6. Use Soft and Noisy tags

  1. Soft Label, that is, use [0.7 − 1.2] [0.7-1.2][0.71.2 ] and[ 0 − 0.3 ] [0-0.3][00.3 ] Random values ​​in two intervals are used to replace the Hard Label of positive samples and negative samples.
  2. You can add some noise to the labels during training, such as randomly flipping the labels of some samples.

7. Use the Adam optimizer

  1. The Adam optimizer is very useful for GANs.
  2. Use Adam in the generator and SGD in the discriminator.

8. Track signs of training failure

  1. The loss of the discriminator = 0 indicates that the model training failed.
  2. If the generator's loss steadily decreases, the discriminator is not working.

9. Add appropriate noise to the input end

  1. Add some artificial noise to the input of the discriminator.
  2. Gaussian noise is added to each layer of the generator.

10. Differential training of generator and discriminator

  1. Train the discriminator more, especially when adding noise.

11.Two Timescale Update Rule (TTUR)

Use different learning rates for the discriminator and generator. The generator is updated with a lower learning rate and the discriminator is updated with a higher learning rate.

12.Gradient Penalty

Using the gradient penalty mechanism can greatly enhance the stability of GAN and minimize the occurrence of mode collapse problems.

13.Spectral Normalization

Spectral normalization can be used in the weight normalization technique of the discriminator to ensure that the discriminator is K-Lipschitz continuous.

14. Use multiple GAN structures

Multiple GAN/multi-generator/multi-discriminator structures can be used to make GAN training more stable, improve the overall effect, and solve more difficult problems.

5. Some hyperparameters that can be adjusted in deep learning alchemy

  1. Preprocessing (data size, data Normalization)
  2. Batch-Size
  3. learning rate
  4. optimizer
  5. loss function
  6. activation function
  7. Epoch
  8. Weight initialization
  9. NAS network architecture search

6.Related knowledge of Spectral Normalization

Spectral Normalization is a wegiht Normalization technology. Like weight-clipping and gradient penalty, it is also one of the ways to make the model meet the 1-Lipschitz condition.

The Lipschitz condition limits the severity of the change of the function, that is, the gradient of the function, to ensure boundedness of statistics. Therefore, the function is smoother, and during the optimization process of the neural network, the parameter changes will be more stable, and gradient explosion will not occur easily .

The constraints of the Lipschitz condition are as follows:

Screenshot 2023-11-13 20 35 07

Among them KKK represents a constant, Lipschitz's constant. IfK = 1 K=1K=1 , then it is 1-Lipschitz.

In the field of GAN, Spectral Normalization has many applications. In WGAN, only when the 1-Lipschitz constraint is satisfied, the W distance can be converted into a better-solved dual problem, making WGAN more easy to train.

If you want matrix A to map: R n → R m R^{n}\to R^{m}RnRm satisfies K-Lipschitz continuity, and the minimum value of K isλ 1 \sqrt{\lambda_{1}}l1 ( λ 1 \lambda_{1}l1ATA A_TAATThe maximum eigenvalue of A ), then if you want the matrix A to satisfy 1-Lipschitz continuity, you only need to divide all elements of A by λ 1 at the same time \sqrt{\lambda_{1}}l1 (Spectral norm)。

What Spectral Normalization actually does is to divide the parameter matrix of each layer by its own largest singular value. It is essentially a layer-by-layer SVD process, but it is too time-consuming to actually do SVD, so power iteration is used. method to solve . The process is shown in the figure below:

Power iteration method process

Obtain the spectral norm σ l ( W ) \sigma_l(W)pl( W ) , the parameters on each parameter matrix are divided by it to achieve the purpose of Normalization.

Guess you like

Origin blog.csdn.net/weixin_51390582/article/details/135124638