Paper: Learning Multi-Domain Convolutional Neural Networks for Visual Tracking
Code: https://github.com/hyeonseobnam/py-MDNet
Overall network structure:
Highlights from the paper:
- Do multi-domain learning first. After learning, when given a test sequence, all existing binary classification layer branches used in the training phase are deleted and a new branch is constructed to compute the target score on the test sequence. During tracking, new classification layers and fully-connected layers in shared layers are fine-tuned online to fit new domains. The online update is to separately model the long-term and short-term appearance changes of objects for robustness and adaptability, and an effective and efficient hard example mining technique is introduced during the learning process.
- We consider two complementary aspects of visual tracking, robustness and adaptability, via long-term and short-term updates. Long-term updates are performed periodically using positive samples collected over a long period of time, while short-term updates are performed whenever potential tracking failures are detected when positive samples are used in the short term to classify estimated objects as background. In both cases, we use negative samples observed in the short term, since old negative samples are usually redundant or irrelevant to the current frame. Note that we maintain a network during tracking that performs both updates based on how quickly the appearance of objects changes.
Others: (Reference: https://blog.csdn.net/laizi_laizi/article/details/107475865 )
- The network is divided into shared layers and domain-specific layers. The former uses the VGG-M model and pre-training parameters, and the latter has K branches, corresponding to K training sequences, also known as K domains (only the conv part is loaded here. The pre-trained model, and then the offline learning stage conv and fc are all trained)
- During the training process, each iteration extracts 32 positive samples and 96 negative samples from a sequence to form a mini-batch, and trains the corresponding fc6^{k} branch (that is, under each epoch, each fc6 branch are trained once in turn)
- Each branch of fc6 outputs two-dimensional features, which respectively represent the pos_score and neg_score of the target and background. The label for the positive sample is [0, 1], and the label for the negative sample is [1, 0]. The loss uses binary cross entropy: binary softmax cross-entropy loss
- The training is very similar to R-CNN, because the crops on the original image are sent to the shared layers together, and the calculation is redundant and inefficient.
- Such a classification method uses a sequence to train a domain. If there is no online update part, I personally feel that it is more accurate to track the objects I have seen. Of course, the online update part will be much better, and the speed will undoubtedly slow down.
MDNet(model.py)
- Including optimizer, network structure, loss function, accuracy, precision
optimizer
- learning rate multiplier
network structure
3x107x107
| conv1
96x51x51
| conv2
256x11x11
| conv3
512x3x3
| fc4
512
| fc5
512
| fc6 ...(一个视频序列就有一个分支)
2
loss function
- Binary cross-entropy loss
Accuracy
- predicted correct proportion
Accuracy
- prediction accuracy
pretrain
data_prov.py
- enter:
- img_list: list(303), such as ['','','',…], image path
- gt: ndarray:(303, 4), true label
- opts, various parameters
- __ next __: Select 8 frames in a sequence, and generate 4 pos_regions and 12 neg_regions per frame
train_mdnet.py
- A sequence generates 32 pos_regions and 96 neg_regions
- pos_regions sent to the network to get pos_score, neg_regions sent to the network to get neg_score
- Calculate the loss according to pos_score, neg_score
pretraining summary
- Generate positive and negative samples for each sequence --> Train the network with positive and negative samples --> Calculate classification loss
- The essence is to train a classifier to distinguish whether the input is a positive sample or a negative sample (foreground or background)
tracking
gen_config.py
output:
- img_list
- init_bbox
- gt
- savefig_dir
- args.display
- result_path
run_tracker.py
- gen_config–>Training data and basic information
------------- Initialization ------------- - Initialize variables result, result_bb, overlap
- Initialize the model MDNet and import the pre-trained weights of the shared layer
- Initialize the loss function BCELoss
- Initialize the optimizer:
①Optimizer in the first frame training (fc4-5:0.0005,fc6:*10)
②Optimizer in the online training phase (fc4-5:0.001,fc6:*10)
------ ------- First frame -------------
// train fc4-6 - read in the first frame image
- Generate 500 positive samples and 5000 negative samples (there are differences in the way the two are generated)
- For positive and negative samples, extract the sample areas and send them to MDNet to obtain the output features of conv3
- Difficult example mining: set eval mode, send the conv3 features of 1024 negative samples to the network (fc4-6), select the 256 negative samples with the highest output score of fc6, and set back to train mode
- Each iteration inputs 32 positive sample conv3 features and 96 negative sample conv3 features to get positive and negative sample scores
- Calculate the loss function, update the parameters of fc4-6 (gradient truncation)
// train BBox regression - Regenerate 1000 samples for regression
- Send it to MDNet to get the output features of conv3
- Fit conv3 features and regression deviation
------------- Online update ------------- - Regenerate 200 negative samples, obtain the features of conv3, follow the positive sample features in 8, obtain pos_feats_all, neg_feats_all, and then enter the loop
- Construct 256 candidate samples at the target location predicted in the previous frame, obtain the output result of fc6, and then select the 5 candidates with the highest pos_score, and average their positions as the predicted target bbox, and the corresponding average score is target_score, if target_score > 0, then success = 1
- If success = 0, expand the search area
- If success = 1, BBox regression
- Save the original prediction result (result) and the regression result (result_bb)
- If success = 1, generate 50 positive samples and 200 negative samples, append to pos_feats_all and neg_feats_all, delete the old one if the length exceeds
- If success = 0, short-term update, use part of pos_feats_all and neg_feats_all to train the network
- Train the network with pos_feats_all and neg_feats_all every 10 frames, long-term update
Summarize
Pre-training phase: train the entire network.
Tracking phase: fix the first three convolutional layers, use the information of the first frame to train fc4-5 and randomly initialized fc6, and bbox regression network. For subsequent frames, fix the bbox regression network, generate candidate samples around the position predicted in the previous frame, and use the average value of the 5 samples with the highest score as the prediction result. If the score is greater than zero, send it to the regression network, and predict it at this time. A fixed number of positive sample features and negative sample features around the target bbox are merged into short-term and long-term feature sets. If the score is less than or equal to zero, expand the search area of candidate samples (for the next epoch), and use the features in the short-term feature set To update the model, in addition, the features in the long-term feature set are used to update the model every 10 frames.