【SOT】MDNet (online update) code notes

Paper: Learning Multi-Domain Convolutional Neural Networks for Visual Tracking
Code: https://github.com/hyeonseobnam/py-MDNet

Overall network structure:

Highlights from the paper:

  • Do multi-domain learning first. After learning, when given a test sequence, all existing binary classification layer branches used in the training phase are deleted and a new branch is constructed to compute the target score on the test sequence. During tracking, new classification layers and fully-connected layers in shared layers are fine-tuned online to fit new domains. The online update is to separately model the long-term and short-term appearance changes of objects for robustness and adaptability, and an effective and efficient hard example mining technique is introduced during the learning process.
  • We consider two complementary aspects of visual tracking, robustness and adaptability, via long-term and short-term updates. Long-term updates are performed periodically using positive samples collected over a long period of time, while short-term updates are performed whenever potential tracking failures are detected when positive samples are used in the short term to classify estimated objects as background. In both cases, we use negative samples observed in the short term, since old negative samples are usually redundant or irrelevant to the current frame. Note that we maintain a network during tracking that performs both updates based on how quickly the appearance of objects changes.

Others: (Reference: https://blog.csdn.net/laizi_laizi/article/details/107475865 )

  • The network is divided into shared layers and domain-specific layers. The former uses the VGG-M model and pre-training parameters, and the latter has K branches, corresponding to K training sequences, also known as K domains (only the conv part is loaded here. The pre-trained model, and then the offline learning stage conv and fc are all trained)
  • During the training process, each iteration extracts 32 positive samples and 96 negative samples from a sequence to form a mini-batch, and trains the corresponding fc6^{k} branch (that is, under each epoch, each fc6 branch are trained once in turn)
  • Each branch of fc6 outputs two-dimensional features, which respectively represent the pos_score and neg_score of the target and background. The label for the positive sample is [0, 1], and the label for the negative sample is [1, 0]. The loss uses binary cross entropy: binary softmax cross-entropy loss
  • The training is very similar to R-CNN, because the crops on the original image are sent to the shared layers together, and the calculation is redundant and inefficient.
  • Such a classification method uses a sequence to train a domain. If there is no online update part, I personally feel that it is more accurate to track the objects I have seen. Of course, the online update part will be much better, and the speed will undoubtedly slow down.

MDNet(model.py)

  • Including optimizer, network structure, loss function, accuracy, precision

optimizer

  • learning rate multiplier

network structure

3x107x107
    | conv1
96x51x51
    | conv2
256x11x11
    | conv3
 512x3x3
    | fc4
   512
    | fc5
   512
    | fc6 ...(一个视频序列就有一个分支)
    2

loss function

  • Binary cross-entropy loss

Accuracy

  • predicted correct proportion

Accuracy

  • prediction accuracy

pretrain

data_prov.py

  • enter:
    1. img_list: list(303), such as ['','','',…], image path
    2. gt: ndarray:(303, 4), true label
    3. opts, various parameters
  • __ next __: Select 8 frames in a sequence, and generate 4 pos_regions and 12 neg_regions per frame

train_mdnet.py

  1. A sequence generates 32 pos_regions and 96 neg_regions
  2. pos_regions sent to the network to get pos_score, neg_regions sent to the network to get neg_score
  3. Calculate the loss according to pos_score, neg_score

pretraining summary

  • Generate positive and negative samples for each sequence --> Train the network with positive and negative samples --> Calculate classification loss
  • The essence is to train a classifier to distinguish whether the input is a positive sample or a negative sample (foreground or background)

tracking

gen_config.py

output:

  • img_list
  • init_bbox
  • gt
  • savefig_dir
  • args.display
  • result_path

run_tracker.py

  1. gen_config–>Training data and basic information
    ------------- Initialization -------------
  2. Initialize variables result, result_bb, overlap
  3. Initialize the model MDNet and import the pre-trained weights of the shared layer
  4. Initialize the loss function BCELoss
  5. Initialize the optimizer:
    ①Optimizer in the first frame training (fc4-5:0.0005,fc6:*10)
    ②Optimizer in the online training phase (fc4-5:0.001,fc6:*10)
    ------ ------- First frame -------------
    // train fc4-6
  6. read in the first frame image
  7. Generate 500 positive samples and 5000 negative samples (there are differences in the way the two are generated)
  8. For positive and negative samples, extract the sample areas and send them to MDNet to obtain the output features of conv3
  9. Difficult example mining: set eval mode, send the conv3 features of 1024 negative samples to the network (fc4-6), select the 256 negative samples with the highest output score of fc6, and set back to train mode
  10. Each iteration inputs 32 positive sample conv3 features and 96 negative sample conv3 features to get positive and negative sample scores
  11. Calculate the loss function, update the parameters of fc4-6 (gradient truncation)
    // train BBox regression
  12. Regenerate 1000 samples for regression
  13. Send it to MDNet to get the output features of conv3
  14. Fit conv3 features and regression deviation
    ------------- Online update -------------
  15. Regenerate 200 negative samples, obtain the features of conv3, follow the positive sample features in 8, obtain pos_feats_all, neg_feats_all, and then enter the loop
  16. Construct 256 candidate samples at the target location predicted in the previous frame, obtain the output result of fc6, and then select the 5 candidates with the highest pos_score, and average their positions as the predicted target bbox, and the corresponding average score is target_score, if target_score > 0, then success = 1
  17. If success = 0, expand the search area
  18. If success = 1, BBox regression
  19. Save the original prediction result (result) and the regression result (result_bb)
  20. If success = 1, generate 50 positive samples and 200 negative samples, append to pos_feats_all and neg_feats_all, delete the old one if the length exceeds
  21. If success = 0, short-term update, use part of pos_feats_all and neg_feats_all to train the network
  22. Train the network with pos_feats_all and neg_feats_all every 10 frames, long-term update

Summarize

Pre-training phase: train the entire network.

Tracking phase: fix the first three convolutional layers, use the information of the first frame to train fc4-5 and randomly initialized fc6, and bbox regression network. For subsequent frames, fix the bbox regression network, generate candidate samples around the position predicted in the previous frame, and use the average value of the 5 samples with the highest score as the prediction result. If the score is greater than zero, send it to the regression network, and predict it at this time. A fixed number of positive sample features and negative sample features around the target bbox are merged into short-term and long-term feature sets. If the score is less than or equal to zero, expand the search area of ​​candidate samples (for the next epoch), and use the features in the short-term feature set To update the model, in addition, the features in the long-term feature set are used to update the model every 10 frames.

Guess you like

Origin blog.csdn.net/zylooooooooong/article/details/123228842