py-MDNet detailed explanation with code (two): online tracking
Regarding the training part, you can read the previous blog: py-MDNet code refinement (1):
The training part mainly explains the tracking part, the content will be more complicated than the training phase.
If you do not see the online tracking of the paper, or look on a blog, it will not help a lot of problems, with the following question and answer form to point out a few key points:
Overall procedure
In fact, the tracking stage is mainly to do the four green points in the picture above, and the overall process is as described in the paper:
- Input: the trained network model and the target state of the first frame
- Output: the predicted target state of the subsequent frame
Here are four points in accordance with the entire process:
Initial test frame
training
After loading w 1 − w 5 w_{1}-w{5}w1−w 5 finished, it will freeze conv1, conv2, conv3, updated full connection layer and the new domain-specific layer in the training phase, as the paper says:
the multiple branches of domain-specific layers ( f c 6 1 fc6^{1} fc61- f c 6 K fc6^{K} fc6K) are replaced with a single branch (fc6) for a new test sequence. Then we fine-tune the new domain-specific layer and the fully connected layers in the shared network online at the same time
As mentioned in [ line3-4 ] above , use S 1 + = 500 S_{1}^{+}=500S1+=5 0 0 positive samples andS 1 − = 5000 S_{1}^{-}=5000S1−=5 0 0 0 negative samples to train the network and update thew 3, w 4, w 5 of thenetworkw_{3},w_{4},w_{5}w3,w4,w5, Which is in the code run_tracker.py
:
# Draw pos/neg samples
pos_examples = SampleGenerator('gaussian', image.size, opts['trans_pos'], opts['scale_pos'])(
target_bbox, opts['n_pos_init'], opts['overlap_pos_init'])
# 500
neg_examples = np.concatenate([
SampleGenerator('uniform', image.size, opts['trans_neg_init'], opts['scale_neg_init'])(
target_bbox, int(opts['n_neg_init'] * 0.5), opts['overlap_neg_init']),
SampleGenerator('whole', image.size)(
target_bbox, int(opts['n_neg_init'] * 0.5), opts['overlap_neg_init'])])
neg_examples = np.random.permutation(neg_examples)
# 5000
# Extract pos/neg features
pos_feats = forward_samples(model, image, pos_examples)
neg_feats = forward_samples(model, image, neg_examples)
# Initial training including hard negative mining
train(model, criterion, init_optimizer, pos_feats, neg_feats, opts['maxiter_init'])
del init_optimizer, neg_feats
torch.cuda.empty_cache()
BBox regression
This is also done in the first frame. The bounding box regression is done to improve target localization accuracy, because randomly generated candidates are not necessarily suitable for the target:
we train a simple linear regression model to predict the precise target location using conv3 features of the samples near the target location.
For bounding-box regression, we use 1000 training examples with the same parameters as [13]
Here is the call from sklearn.linear_model import Ridge
to build a linear model, see the following picture for details:
Of course, only when the success
bounding box needs to be returned, the paper is:
In the subsequent frames, we adjust the target locations estimated from Eq. (1) using the regression model if the estimated targets are reliable (i.e. f + ( x ∗ ) f^{+}(x^{*}) f+(x∗) > 0.5).
The condition that is judged as reliable here may be different from the paper, but it is judged by whether the average value of the top 5 of pos_score is greater than 0, and if it is greater than 0, it is considered as success. This is reflected in the code:
# Train bbox regressor
bbreg_examples = SampleGenerator('uniform', image.size, opts['trans_bbreg'], opts['scale_bbreg'], opts['aspect_bbreg'])(
target_bbox, opts['n_bbreg'], opts['overlap_bbreg'])
bbreg_feats = forward_samples(model, image, bbreg_examples)
bbreg = BBRegressor(image.size)
bbreg.train(bbreg_feats, bbreg_examples, target_bbox)
# bbreg_feats [1000, 512*3*3]
# bbreg_examples [1000, 4]
# target_bbox [4,]
########### 使用的时候使用predict方法 #########
bbreg_samples = samples[top_idx]
if top_idx.shape[0] == 1:
bbreg_samples = bbreg_samples[None,:]
bbreg_feats = forward_samples(model, image, bbreg_samples)
bbreg_samples = bbreg.predict(bbreg_feats, bbreg_samples)
bbreg_bbox = bbreg_samples.mean(axis=0)
Subsequent frames update
After using the prior information of the first frame, it is necessary to cyclically predict the target state in subsequent frames in the sequence, which is what the above [ line6-18 ] is doing.
First, construct 265 candidates from the target location predicted in the previous frame , and then send them to the network to get the score, and then select the 5 candidates with the highest pos_score , and average their positions as the predicted target bbox . Then there are two situations:
- success:
- 1. Perform bounding-box regression
- 2. S t + = 50 around the target bbox predicted at this time S_{t}^{+}=50St+=. 5 0 th positive samples andS t - = 200 S_ {tSt−=The features of 2 0 0 negative samples are merged into the short-term feature setT s \mathcal{T}_{s}TsAnd long-term feature set T l \mathcal{T}_{l}Tl, For later use when updating the model (if the upper limit of the set elements is exceeded, delete the first feature)
- not success:
- 1. Expand the search area (achieved by expanding the offset distance of the center of the samples)
- 2. Use the short-term feature set S v ∈ T s + S_{v \in \mathcal{T}_{s}}^{+}Sv∈Ts+ and S v ∈ T s − S_{v \in \mathcal{T}_{s}}^{-} Sv∈Ts−To update the model, to solve target appearance change
- Perform a long-term update after 10 frames (the reason why it is called long-term update here is because of the use of long-term features S v ∈ T l + S_{v \in \mathcal{T}_{l}} ^{+}Sv∈Tl+ 和 S v ∈ T s − S_{v \in \mathcal{T}_{s}}^{-} Sv∈Ts−)
Reflected in the code:
# Bbox regression
if success:
bbreg_samples = samples[top_idx]
if top_idx.shape[0] == 1:
bbreg_samples = bbreg_samples[None,:]
bbreg_feats = forward_samples(model, image, bbreg_samples)
bbreg_samples = bbreg.predict(bbreg_feats, bbreg_samples)
bbreg_bbox = bbreg_samples.mean(axis=0)
else:
bbreg_bbox = target_bbox
# Data collect
if success:
pos_examples = pos_generator(target_bbox, opts['n_pos_update'], opts['overlap_pos_update'])
pos_feats = forward_samples(model, image, pos_examples)
pos_feats_all.append(pos_feats)
if len(pos_feats_all) > opts['n_frames_long']:
del pos_feats_all[0]
neg_examples = neg_generator(target_bbox, opts['n_neg_update'], opts['overlap_neg_update'])
neg_feats = forward_samples(model, image, neg_examples)
neg_feats_all.append(neg_feats)
if len(neg_feats_all) > opts['n_frames_short']:
del neg_feats_all[0]
# Expand search area at failure
if success:
sample_generator.set_trans(opts['trans'])
else:
sample_generator.expand_trans(opts['trans_limit'])
# Short term update
if not success:
nframes = min(opts['n_frames_short'], len(pos_feats_all))
pos_data = torch.cat(pos_feats_all[-nframes:], 0)
neg_data = torch.cat(neg_feats_all, 0)
train(model, criterion, update_optimizer, pos_data, neg_data, opts['maxiter_update'])
# Long term update
elif i % opts['long_interval'] == 0:
pos_data = torch.cat(pos_feats_all, 0)
neg_data = torch.cat(neg_feats_all, 0)
train(model, criterion, update_optimizer, pos_data, neg_data, opts['maxiter_update'])
Hard negative mining
Difficult example mining is to train the classifier with some difficult negative samples, so that it has stronger discrimination of similar objects. Difficult example mining is carried out during the initial training with the initial frame, and 96 negative samples are selected from the 1024 negative samples. The negative samples with the highest pos_score are sent to the training of the classifier as the difficult negative samples. Below is a visualization of my training process:
Results on OTB100
The following is the result of using his pre-trained model mdnet_vot-otb.pth
and then running the OTB python version benchmark, which is very close to the figs he gave:
Demo
The gif is too big to be put on the blog, I put it on the code cloud:
Blurcar1: https://gitee.com/laisimiao/picBed/raw/master/image/BlurCar1.gif
Biker: https://gitee.com/ laisimiao/picBed/raw/master/image/Biker.gif
Bonus
Video explanation: https://www.bilibili.com/video/BV1qt4y1X7SQ/
References
- The algorithm flow chart comes from this blog: MDNet of Target Tracking (1)