1. Setting of network hyperparameters
Input data pixel size settings:
- In order to facilitate GPU parallel computing, the image size is generally set to
2 的 次幂
- In order to facilitate GPU parallel computing, the image size is generally set to
The setting of the convolutional layer parameters:
- The size of the convolution kernel is generally used 、 or
- Using zero padding, you can make full use of edge information and keep the input size unchanged
- The number of convolution kernels is usually set to
2 的次幂
, such as 64, 128, 256, 512, 1024, etc.
Pooling layer parameter settings:
- The size of the convolution kernel is generally used , the step size is
- The stride convolutional layer can also be used instead of the pooling layer to achieve downsampling
- Fully connected layer (use Global Average Pooling instead):
- The difference between Global Average Pooling and Average Pooling(
Local
) is in“Global”
this word. Both Global and Local are literally used to describe the pooling window area. Local is one of the Feature Map子区域求平均值,然后滑动
; Global is obviously right整个 Feature Map 求均值
(the size of the kernel is set to be the same as that of the Feature Map) - Therefore, as many Feature Maps can output as many nodes. Generally, the output results can be directly fed to the softmax layer
- The difference between Global Average Pooling and Average Pooling(
2. Training hyperparameter settings and techniques
- Choice of learning-rate
- When fixing the learning rate,
0.01,0.001
etc. are better choices. In addition, we can judge the quality of the learning rate through the validation error, as shown in the following figure. - When adjusting the learning rate automatically, the learning rate decay algorithm (step decay, exponential decay) or some adaptive optimization algorithms (Adam, Adadelta) can be used.
- When fixing the learning rate,
Selection of mini-batches
- A batch size that is too small will make training very slow; too large will speed up training, but at the same time result in high memory usage and possibly lower accuracy.
- So 32 to 256 is a good initial value choice, especially
64 和 128
,2的指数倍
the reason for the choice is: computer memory is generally an exponential multiple of 2, using binary encoding.
the number of training iterations or epochs
Early Stopping
: Stop the training process if the validation error does not decrease within a period of time (eg: 200 iterations) of training- Two predefined stop monitor functions already exist in the training hook function of tf.train.
- StopAtStepHook : Monitor function used to ask to stop training after a certain number of steps
- NanTensorHook : Monitor function that monitors losses and stops training when NaN losses are encountered
shuffle per epoch
Gradient Clipping: Preventing Gradient Explosion
3. References
1. Deep Learning Handbook - Chapter 11.4: Choosing Hyperparameters
2. Neural Networks and Deep Learning Handbook - Chapter 3: How to Choose Neural Network Hyperparameters?
3. Systematic evaluation of CNN's progress on ImageNet