py-MDNet detailed explanation with code (2): online tracking


Regarding the training part, you can read the previous blog: py-MDNet code refinement (1):
The training part mainly explains the tracking part, the content will be more complicated than the training phase.
If you do not see the online tracking of the paper, or look on a blog, it will not help a lot of problems, with the following question and answer form to point out a few key points:
Question-and-answer format

Overall procedure

In fact, the tracking stage is mainly to do the four green points in the picture above, and the overall process is as described in the paper:

  • Input: the trained network model and the target state of the first frame
  • Output: the predicted target state of the subsequent frame
    MDNet online tracking overall procedure

Here are four points in accordance with the entire process:

Initial test frame

training

After loading w 1 − w 5 w_{1}-w{5}w1w 5 finished, it will freeze conv1, conv2, conv3, updated full connection layer and the new domain-specific layer in the training phase, as the paper says:

the multiple branches of domain-specific layers ( f c 6 1 fc6^{1} fc61- f c 6 K fc6^{K} fc6K) are replaced with a single branch (fc6) for a new test sequence. Then we fine-tune the new domain-specific layer and the fully connected layers in the shared network online at the same time

As mentioned in [ line3-4 ] above , use S 1 + = 500 S_{1}^{+}=500S1+=5 0 0 positive samples andS 1 − = 5000 S_{1}^{-}=5000S1=5 0 0 0 negative samples to train the network and update thew 3, w 4, w 5 of thenetworkw_{3},w_{4},w_{5}w3,w4,w5, Which is in the code run_tracker.py:

# Draw pos/neg samples
pos_examples = SampleGenerator('gaussian', image.size, opts['trans_pos'], opts['scale_pos'])(
                    target_bbox, opts['n_pos_init'], opts['overlap_pos_init'])
# 500
neg_examples = np.concatenate([
                SampleGenerator('uniform', image.size, opts['trans_neg_init'], opts['scale_neg_init'])(
                    target_bbox, int(opts['n_neg_init'] * 0.5), opts['overlap_neg_init']),
                SampleGenerator('whole', image.size)(
                    target_bbox, int(opts['n_neg_init'] * 0.5), opts['overlap_neg_init'])])
neg_examples = np.random.permutation(neg_examples)
# 5000
# Extract pos/neg features
pos_feats = forward_samples(model, image, pos_examples)
neg_feats = forward_samples(model, image, neg_examples)

# Initial training including hard negative mining
train(model, criterion, init_optimizer, pos_feats, neg_feats, opts['maxiter_init'])
del init_optimizer, neg_feats
torch.cuda.empty_cache()

BBox regression

This is also done in the first frame. The bounding box regression is done to improve target localization accuracy, because randomly generated candidates are not necessarily suitable for the target:

we train a simple linear regression model to predict the precise target location using conv3 features of the samples near the target location.
For bounding-box regression, we use 1000 training examples with the same parameters as [13]

Here is the call from sklearn.linear_model import Ridgeto build a linear model, see the following picture for details:
bbox regression icon

Of course, only when the successbounding box needs to be returned, the paper is:

In the subsequent frames, we adjust the target locations estimated from Eq. (1) using the regression model if the estimated targets are reliable (i.e. f + ( x ∗ ) f^{+}(x^{*}) f+(x) > 0.5).

The condition that is judged as reliable here may be different from the paper, but it is judged by whether the average value of the top 5 of pos_score is greater than 0, and if it is greater than 0, it is considered as success. This is reflected in the code:

# Train bbox regressor
bbreg_examples = SampleGenerator('uniform', image.size, opts['trans_bbreg'], opts['scale_bbreg'], opts['aspect_bbreg'])(
    target_bbox, opts['n_bbreg'], opts['overlap_bbreg'])
bbreg_feats = forward_samples(model, image, bbreg_examples)
bbreg = BBRegressor(image.size)
bbreg.train(bbreg_feats, bbreg_examples, target_bbox)
# bbreg_feats [1000, 512*3*3]
# bbreg_examples [1000, 4]
# target_bbox [4,]
########### 使用的时候使用predict方法 #########
bbreg_samples = samples[top_idx]
if top_idx.shape[0] == 1:
    bbreg_samples = bbreg_samples[None,:]
bbreg_feats = forward_samples(model, image, bbreg_samples)
bbreg_samples = bbreg.predict(bbreg_feats, bbreg_samples)
bbreg_bbox = bbreg_samples.mean(axis=0)

Subsequent frames update

After using the prior information of the first frame, it is necessary to cyclically predict the target state in subsequent frames in the sequence, which is what the above [ line6-18 ] is doing.
First, construct 265 candidates from the target location predicted in the previous frame , and then send them to the network to get the score, and then select the 5 candidates with the highest pos_score , and average their positions as the predicted target bbox . Then there are two situations:

  • success:
    • 1. Perform bounding-box regression
    • 2. S t + = 50 around the target bbox predicted at this time S_{t}^{+}=50St+=. 5 0 th positive samples andS t - = 200 S_ {tSt=The features of 2 0 0 negative samples are merged into the short-term feature setT s \mathcal{T}_{s}TsAnd long-term feature set T l \mathcal{T}_{l}Tl, For later use when updating the model (if the upper limit of the set elements is exceeded, delete the first feature)
  • not success:
    • 1. Expand the search area (achieved by expanding the offset distance of the center of the samples)
    • 2. Use the short-term feature set S v ∈ T s + S_{v \in \mathcal{T}_{s}}^{+}SvTs+ and S v ∈ T s − S_{v \in \mathcal{T}_{s}}^{-} SvTsTo update the model, to solve target appearance change
  • Perform a long-term update after 10 frames (the reason why it is called long-term update here is because of the use of long-term features S v ∈ T l + S_{v \in \mathcal{T}_{l}} ^{+}SvTl+ S v ∈ T s − S_{v \in \mathcal{T}_{s}}^{-} SvTs)

Reflected in the code:

# Bbox regression
if success:
    bbreg_samples = samples[top_idx]
    if top_idx.shape[0] == 1:
        bbreg_samples = bbreg_samples[None,:]
    bbreg_feats = forward_samples(model, image, bbreg_samples)
    bbreg_samples = bbreg.predict(bbreg_feats, bbreg_samples)
    bbreg_bbox = bbreg_samples.mean(axis=0)
else:
    bbreg_bbox = target_bbox
    

# Data collect
if success:
    pos_examples = pos_generator(target_bbox, opts['n_pos_update'], opts['overlap_pos_update'])
    pos_feats = forward_samples(model, image, pos_examples)
    pos_feats_all.append(pos_feats)
    if len(pos_feats_all) > opts['n_frames_long']:
        del pos_feats_all[0]

    neg_examples = neg_generator(target_bbox, opts['n_neg_update'], opts['overlap_neg_update'])
    neg_feats = forward_samples(model, image, neg_examples)
    neg_feats_all.append(neg_feats)
    if len(neg_feats_all) > opts['n_frames_short']:
        del neg_feats_all[0]
# Expand search area at failure
if success:
    sample_generator.set_trans(opts['trans'])
else:
    sample_generator.expand_trans(opts['trans_limit'])
    
# Short term update
if not success:
    nframes = min(opts['n_frames_short'], len(pos_feats_all))
    pos_data = torch.cat(pos_feats_all[-nframes:], 0)
    neg_data = torch.cat(neg_feats_all, 0)
    train(model, criterion, update_optimizer, pos_data, neg_data, opts['maxiter_update'])
# Long term update
elif i % opts['long_interval'] == 0:
    pos_data = torch.cat(pos_feats_all, 0)
    neg_data = torch.cat(neg_feats_all, 0)
    train(model, criterion, update_optimizer, pos_data, neg_data, opts['maxiter_update'])

Hard negative mining

Difficult example mining is to train the classifier with some difficult negative samples, so that it has stronger discrimination of similar objects. Difficult example mining is carried out during the initial training with the initial frame, and 96 negative samples are selected from the 1024 negative samples. The negative samples with the highest pos_score are sent to the training of the classifier as the difficult negative samples. Below is a visualization of my training process:
Insert picture description here

Results on OTB100

The following is the result of using his pre-trained model mdnet_vot-otb.pthand then running the OTB python version benchmark, which is very close to the figs he gave:
MDNetOTB100 result graph

Demo

The gif is too big to be put on the blog, I put it on the code cloud:
Blurcar1: https://gitee.com/laisimiao/picBed/raw/master/image/BlurCar1.gif
Biker: https://gitee.com/ laisimiao/picBed/raw/master/image/Biker.gif

Bonus

Video explanation: https://www.bilibili.com/video/BV1qt4y1X7SQ/

References

  1. The algorithm flow chart comes from this blog: MDNet of Target Tracking (1)

Guess you like

Origin blog.csdn.net/laizi_laizi/article/details/107488362