Parameter interpretation in yolo

https://blog.csdn.net/jinlong_xu/article/details/76375334

1.Batch_Size (batch size) 

This parameter is mainly used in the Batch Gradient Descent algorithm . The batch gradient descent algorithm traverses all the samples in the batch in each iteration, and the samples in the batch jointly determine the optimal direction. Batch_Size is exactly the value in the batch. Number of samples. 

If the data set is relatively small, it can be in the form of a full data set (Full Batch Learning). It is suitable for large data sets;  the
other extreme is that only one sample is trained at a time, that is, Batch_Size=1, and each correction direction is corrected by the gradient direction of the respective sample. 
Increasing Batch_Size within a reasonable range can 
(1) improve memory utilization, thereby improving the parallel efficiency of large matrix multiplication; 
(2) reduce the number of iterations required to run an epoch (full data set), for the same amount of data The processing speed is further accelerated; 
(3) Within a certain range, generally speaking, the larger the Batch_Size, the more accurate the determined descending direction, and the smaller the training shock caused. 
The disadvantages of blindly increasing Batch_Size: 
(1) Exceeding the memory capacity; 
(2) The number of iterations required to run an epoch (full data set) is reduced. To achieve the same accuracy, the more epochs are required, the more The parameter correction is slower; 
(3) Batch_Size increases to a certain extent, and its determined decline direction has basically no longer changed; 
Batch_Size parameter debugging: 
Large Batch_Size converges faster when the video memory capacity allows, but sometimes Falling into a local minimum situation; the randomness introduced by a small Batch_Size will be larger, and it may have better results, but the convergence speed will be slower; when the Batch_Size is too small and the number of categories is large, it will lead to loss The function oscillates without converging. In the specific debugging process, it is generally possible to set the maximum value according to the GPU memory, and the general requirement is a multiple of 8. Select a part of the data, run several batches to see if the loss is getting smaller, and then select the appropriate Batch_Size. 
Parameters are updated every Batch_Size samples.

2. 
If the memory of subdivisions is not large enough, the batch is divided into subdivisions sub-batches, and the size of each sub-batch is batch/subdivisions; 
in the darknet code, batch/subdivisions are named batch;

3. Impulse-momentum 
gradient descent is a commonly used acceleration technique. For general SGD, its expression is

descend along the negative gradient direction, while SGD with momentum term is written as

Among them is the momentum coefficient. The popular understanding of the above formula is that if the last momentum (ie) and the negative gradient direction of this time are the same, then the magnitude of this drop will increase, so it can accelerate the convergence. The recommended configuration of the impulse is 0.9.

4. Weight decay 
- The purpose of weight decay is to prevent overfitting. When the network is gradually overfitting, the network weights tend to become larger. Therefore, in order to avoid overfitting, a certain small value is used in each iteration process. Decreasing each weight by a factor is also equivalent to adding a penalty term to the error function. A common penalty term is the sum of the squares of all weights multiplied by a decay constant. The weight decay penalty term makes the weights converge to smaller absolute values.

5. angle, saturation, exposure, hue 
angle: the angle of the picture changes, the unit is degree, if angle=5, it is randomly rotated -5~5 degrees when a new picture is generated.  
Saturation & exposure: the change of saturation and exposure, tiny- 1 to 1.5 times in yolo-voc.cfg, and 1/1.5~1 times  
hue: Hue change range, -0.1~0.1 in tiny-yolo-voc.cfg  
in each iteration, based on angle, saturation, exposure , tones to generate new training images.

6.学习率-learning rate 
学习率决定了参数移动到最优值的速度快慢,如果学习率过大,很可能会越过最优值导致函数无法收敛,甚至发散;反之,如果学习率过小,优化的效率可能过低,算法长时间无法收敛,也易使算法陷入局部最优(非凸函数不能保证达到全局最优)。合适的学习率应该是在保证收敛的前提下,能尽快收敛。 
设置较好的learning rate,需要不断尝试。在一开始的时候,可以将其设大一点,这样可以使weights快一点发生改变,在迭代一定的epochs之后人工减小学习率。 
在yolo训练中,网络训练160epoches,初始学习率为0.001,在60和90epochs时将学习率除以10。

7.burn_in 
与学习率的动态变化有关??? 
if (batch_num < net.burn_in) return net.learning_rate * pow((float)batch_num / net.burn_in, net.power); 
Yolo network.c中出现的代码

8.最大迭代次数-max_batches 
权重更新次数

9.调整学习率的策略-policy 
调整学习率的policy,有如下policy:CONSTANT, STEP, EXP, POLY,STEPS, SIG, RANDOM

10.学习率变化时的迭代次数-steps 
根据batch_num调整学习率,若steps=100,25000,35000,则在迭代100次,25000次,35000次时学习率发生变化,该参数与policy中的steps对应;

11.学习率变化的比率-scales 
相对于当前学习率的变化比率,累计相乘,与steps中的参数个数保持一致;

12.是否做BN-batch_normalize

13.激活函数-activation 
包括logistic,loggy,relu,elu,relie,plse,hardtan,lhtan,linear,ramp,leaky,tanh,stair

14.[route] layer 
the route layer is to bring finer grained features in from earlier in the network 
15.[reorg] layer 
the reorg layer is to make these features match the feature map size at the later layer;The end feature map is 13x13, the feature map from earlier is 26x26x512. The reorg layer maps the 26x26x512 feature map onto a 13x13x2048 feature map so that it can be concate_nated with the feature maps at 13x13 resolution.

16.anchors 
anchors:预测框的初始宽高,第一个是w,第二个是h,总数量是num*2,YOLOv2作者说anchors是使用K-MEANS获得,其实就是计算出哪种类型的框比较多,可以增加收敛速度,如果不设置anchors,默认是0.5;

17. jitter suppresses overfitting 
by adding noise to jitter

18.rescore 
can be understood as a switch. When it is not 0, l.delta is adjusted by re-scoring (the difference between the predicted value and the actual value)

19. random (yolo model training) 
When random is 1, Multi-Scale Training will be enabled, and images of different sizes will be used for training randomly. If it is 0, the size of each training is consistent with the input size; 
whether to randomly determine the final prediction frame

A few size notes

(1) batch_size: batch size. In deep learning , SGD training is generally used, that is, batch_size samples are taken in the training set for training each time; 
(2) iteration: 1 iteration is equal to using batchsize samples for training once; 
(3) epoch: 1 epoch is equivalent to using training All samples in the set are trained once;

The meaning of each parameter in the training log 
Region Avg IOU: The average IOU, which represents the ratio of the intersection and union of the predicted bounding box and ground truth. It is expected that the value will be close to 1. 
Class: is the probability of labeling an object, expect the value to be close to 1. 
Obj: expect the value to be close to 1. 
No Obj: expect the value to get smaller and smaller but not zero. 
Avg Recall: expect the value to be close to 1 
avg: average loss, expect this value to approach 0

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324661385&siteId=291194637