Hyperparameter Setting in Deep Learning

1. Setting of network hyperparameters

  • Input data pixel size settings:

    • In order to facilitate GPU parallel computing, the image size is generally set to2 的 次幂
  • The setting of the convolutional layer parameters:

    • The size of the convolution kernel is generally used 1 1 3 3 or 5 5
    • Using zero padding, you can make full use of edge information and keep the input size unchanged
    • The number of convolution kernels is usually set to 2 的次幂, such as 64, 128, 256, 512, 1024, etc.
  • Pooling layer parameter settings:

    • The size of the convolution kernel is generally used 2 2 , the step size is 2
    • The stride convolutional layer can also be used instead of the pooling layer to achieve downsampling
  • Fully connected layer (use Global Average Pooling instead):
    • The difference between Global Average Pooling and Average Pooling( Local) is in “Global”this word. Both Global and Local are literally used to describe the pooling window area. Local is one of the Feature Map 子区域求平均值,然后滑动; Global is obviously right 整个 Feature Map 求均值(the size of the kernel is set to be the same as that of the Feature Map)
    • Therefore, as many Feature Maps can output as many nodes. Generally, the output results can be directly fed to the softmax layer
      write picture description here

2. Training hyperparameter settings and techniques

  • Choice of learning-rate
    • When fixing the learning rate, 0.01,0.001etc. are better choices. In addition, we can judge the quality of the learning rate through the validation error, as shown in the following figure.
    • When adjusting the learning rate automatically, the learning rate decay algorithm (step decay, exponential decay) or some adaptive optimization algorithms (Adam, Adadelta) can be used.
      write picture description here
  • Selection of mini-batches

    • A batch size that is too small will make training very slow; too large will speed up training, but at the same time result in high memory usage and possibly lower accuracy.
    • So 32 to 256 is a good initial value choice, especially 64 和 128, 2的指数倍the reason for the choice is: computer memory is generally an exponential multiple of 2, using binary encoding.
      write picture description here
  • the number of training iterations or epochs

    • Early Stopping: Stop the training process if the validation error does not decrease within a period of time (eg: 200 iterations) of training
    • Two predefined stop monitor functions already exist in the training hook function of tf.train.
      • StopAtStepHook : Monitor function used to ask to stop training after a certain number of steps
      • NanTensorHook : Monitor function that monitors losses and stops training when NaN losses are encountered
  • shuffle per epoch

  • Gradient Clipping: Preventing Gradient Explosion


3. References

1. Deep Learning Handbook - Chapter 11.4: Choosing Hyperparameters
2. Neural Networks and Deep Learning Handbook - Chapter 3: How to Choose Neural Network Hyperparameters?
3. Systematic evaluation of CNN's progress on ImageNet

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325410000&siteId=291194637