Experience in deep learning (rnn, cnn) parameter tuning

Author: Zhihu User
Link : https://www.zhihu.com/question/41631631/answer/94816420
Source: Zhihu The
copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.

parameter initialization.

Choose one of the following methods, and the results are basically the same. But it must be done. Otherwise, it may slow down the convergence speed, affect the convergence results, and even cause a series of problems such as Nan.

The following n_in is the input size of the network, n_out is the output size of the network, and n is n_in or (n_in+n_out)*0.5

Xavier Proceedings Paper: pers

He initialization paper 2

  • Uniform uniform distribution initialization: w = np.random.uniform(low=-scale, high=scale, size=[n_in,n_out])
    • Xavier initial method, suitable for ordinary activation function (tanh, sigmoid): scale = np.sqrt(3/n)
    • He initialization, suitable for ReLU: scale = np.sqrt(6/n)
  • Normal Gaussian distribution initialization: w = np.random.randn(n_in,n_out) * stdev # stdev is the standard deviation of the Gaussian distribution, and the mean is set to 0
    • Xavier initial method, suitable for ordinary activation functions (tanh, sigmoid): stdev = np.sqrt(n)
    • He initialization, suitable for ReLU: stdev = np.sqrt(2/n)
  • svd initialization: has a better effect on RNN. Reference paper :

data preprocessing

  • zero-center , this is quite common.X -= np.mean(X, axis = 0) # zero-centerX /= np.std(X, axis = 0) # normalize
  • PCA whitening, this is used less.

training skills

  • To normalize the gradient, that is, divide the calculated gradient by the minibatch size
  • clip c (gradient clipping): Limit the maximum gradient, which is actually value = sqrt(w1^2+w2^2....), if the value exceeds the threshold, it is considered an attenuation coefficient, and the value of the value is equal to the threshold: 5,10 ,15
  • Dropout has a good effect on preventing overfitting of small data. The value is generally set to 0.5. In most of my experiments, dropout+sgd on small data has a very obvious effect. Therefore, if possible, it is recommended to try it. The position of dropout is more particular. For RNN, it is recommended to put it in the position of input->RNN and RNN->output. For how to use dropout in RNN, you can refer to this paper :
  • Adam, adadelta, etc. On small data, the effect of my experiment here is not as good as that of sgd, and the convergence speed of sgd will be slower, but the final convergence results are generally better. If you use sgd, you can choose to start with a learning rate of 1.0 or 0.1. After a period of time, check on the validation set. If the cost does not drop, halve the learning rate. I have seen many papers doing this, myself The results of the experiment are also very good. Of course, you can also use the ada series to run first, and when it converges quickly, replace it with sgd to continue training. It will also improve. It is said that adadelta generally has a better effect on classification problems, and adam is generating The problem is better.
  • In addition to places like gate, you need to limit the output to 0-1, try not to use sigmoid, you can use activation functions such as tanh or relu. 1. The sigmoid function is in the range of -4 to 4. big gradient. Outside the interval, the gradient is close to 0, which can easily cause the problem of gradient disappearance. 2. Input 0 mean, the output of sigmoid function is not 0 mean.
  • The dim and embedding size of rnn are generally adjusted from 128 up and down. The batch size is generally adjusted from around 128. The most important thing is that the batch size is appropriate, not the bigger the better.
  • The word2vec initialization, on small data, can not only effectively improve the convergence speed, but also improve the results.
  • Shuffle the data as much as possible
  • The bias of LSTM's forget gate, initialized with a value of 1.0 or greater, can achieve better results, from this paper: pers , I set it to 1.0 in the experiment here, which can improve the convergence speed. In actual use, different tasks may need to try different values.
  • Batch Normalization is said to improve the effect, but I have not tried it. It is recommended as the last means to improve the model. Reference paper: Accelerating Deep Network Training by Reducing Internal Covariate Shift
  • If your model contains a fully connected layer (MLP), and the input and output size are the same, you can consider replacing the MLP with the Highway Network. I try to improve the results a little bit. gate is added to the output to control the flow of information. For details, please refer to the paper: 7
  • Tips from @张新宇: One round with regularity, one round without regularity, repeat.

Ensemble

Ensemble is the ultimate nuclear weapon for the results of the paper. There are generally the following methods in deep learning

  • Same parameters, different initialization methods
  • Different parameters, through cross-validation, select the best groups
  • The same parameters, different stages of model training, i.e. models with different iterations.
  • Different models, perform linear fusion. For example, RNN and traditional models.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325631404&siteId=291194637