Detailed explanation of py-MDNet with code (1): train

Detailed explanation of py-MDNet with code (1): train


It's all 0202, why do you still need to look at such an early algorithm of MDNet (CVPR2016)? The reasons are as follows:

  • There are more classic online update implementations in this paper, which can be used as a reference for understanding such techniques in the future
  • As a tracker that is essentially classified, it obtained the winner of VOT2015, and then the winner of VOT2018-LT- MBMD also borrowed from MDNet

Paper link: Learning Multi-Domain Convolutional Neural Networks for Visual Tracking [ Domestic Mirror ]
Code: https://github.com/hyeonseobnam/py-MDNet
This article first introduces the easier part of the whole article: train

Architecture

MDNet's overall architecture
The figure in the paper clearly shows the overall architecture of MDNet:

  • The network is divided into two parts: shared layers and domain-specific layers. The former uses the VGG-M model and pre-training parameters, and the latter has K branches, corresponding to K training sequences, also called K domains (here only the conv part is loaded Pre-trained model, and then both conv and fc are trained in the offline learning phase)
  • During the training process, each iteration extracts 32 positive samples and 96 negative samples from a sequence to form a mini-batch, and trains the corresponding fc 6 k fc6^{k}fc6k branch (that is, every epoch, each fc6 branch is trained once in turn)
  • Each branch of fc6 outputs two-dimensional features, which represent the pos_score and neg_score of target and background respectively. The label for positive samples is [0, 1], the label for negative samples is [1, 0], and the loss uses binary cross-entropy: binary softmax cross-entropy loss

Now let’s look at how these key parts are implemented from the code point of view. The most important thing is how to generate positive and negative samples:

Regions

Because MDNet uses classification to judge the target, it is necessary to construct positive and negative samples to train the classifier, mainly through the following two sentences:

dataset[k] = RegionDataset(seq['images'], seq['gt'], opts)
# take vot2013/cup sequence for example
# seq['images'] -> 'datasets/VOT/vot2013/cup/00000001.jpg',...
# seq['gt'] -> (303, 4) np.ndarray
# training
pos_regions, neg_regions = dataset[k].next()

And there is a magic method under the class RegionDataset to __next__()generate 32 pos_regions and 96 neg_regions. Here, 8 frames are selected in a sequence, and 4 pos_regions and 12 neg_regions are generated in each frame. Among them, pos_regions is between the IoU of GT bbox [ The samples between 0.7, 1]; neg_regions are the samples with the IoU of GT bbox between [0, 0.5], as stated in the paper For offline multi-domain learning, we collect 50 positive and 200 negative samples from every frame, where positive and negative examples have ≥ 0.7 and ≤ 0.5 IoU overlap ratios with ground-truth bounding boxes, respectively.:

def __next__(self):
    next_pointer = min(self.pointer + self.batch_frames, len(self.img_list)) # 8
    idx = self.index[self.pointer:next_pointer]
    if len(idx) < self.batch_frames:
        self.index = np.random.permutation(len(self.img_list))
        next_pointer = self.batch_frames - len(idx)
        idx = np.concatenate((idx, self.index[:next_pointer]))
    self.pointer = next_pointer

    pos_regions = np.empty((0, 3, self.crop_size, self.crop_size), dtype='float32')
    neg_regions = np.empty((0, 3, self.crop_size, self.crop_size), dtype='float32')
    for i, (img_path, bbox) in enumerate(zip(self.img_list[idx], self.gt[idx])):
        image = Image.open(img_path).convert('RGB')
        image = np.asarray(image)

        n_pos = (self.batch_pos - len(pos_regions)) // (self.batch_frames - i)  # 4 * 8
        n_neg = (self.batch_neg - len(neg_regions)) // (self.batch_frames - i)  # 12 * 8
        pos_examples = self.pos_generator(bbox, n_pos, overlap_range=self.overlap_pos) # [0.7, 1]
        neg_examples = self.neg_generator(bbox, n_neg, overlap_range=self.overlap_neg) # [0, 0.5]

        pos_regions = np.concatenate((pos_regions, self.extract_regions(image, pos_examples)), axis=0)
        neg_regions = np.concatenate((neg_regions, self.extract_regions(image, neg_examples)), axis=0)

    pos_regions = torch.from_numpy(pos_regions)
    neg_regions = torch.from_numpy(neg_regions)
    return pos_regions, neg_regions

After the training data is available, the loss function criterion and evaluator are constructed to evaluate whether the training has converged, and the optimizer optimizer:

Loss function

The loss function here is very simple, it is directly the two-category cross entropy:

class BCELoss(nn.Module):
    def forward(self, pos_score, neg_score, average=True):
        # pos_score:(32, 2) | neg_score:(96, 2)
        pos_loss = -F.log_softmax(pos_score, dim=1)[:, 1]
        neg_loss = -F.log_softmax(neg_score, dim=1)[:, 0]

        loss = pos_loss.sum() + neg_loss.sum()
        if average:
            loss /= (pos_loss.size(0) + neg_loss.size(0))
        return loss

It F.log_softmaxhad to F.nll_lossbe used in conjunction , so multiplying by 1 is omitted here. It can be seen from this that the label for positive samples is [0, 1], and the label for negative samples is [1, 0].

evaluator

The indicator to see if the training is carried out effectively is Precision, which is defined as follows:

class Precision():
    def __call__(self, pos_score, neg_score):
        scores = torch.cat((pos_score[:, 1], neg_score[:, 1]), 0)
        topk = torch.topk(scores, pos_score.size(0))[1]
        # torch.topk -> (values, indexes)
        prec = (topk < pos_score.size(0)).float().sum() / (pos_score.size(0) + 1e-8)
        return prec.item()

Here is that the smallest target score in pos_regions is larger than the largest target score in neg_regions, then Precision = 1; there is a trend: the more pos_regions have higher target scores than neg_regions, the higher the precision.

Hyper-parameters

Here are a few hyperparameters used in the paper and code:

better paper code
offline learning iterations 100K 50K
lr for conv 0.0001 0.0001
lr for fc 0.001 0.001
IoU range for pos samples [0.7, 1] [0.7, 1]
IoU range for neg samples [0, 0.5] [0, 0.5]

Figures

The picture below visualizes the positive and negative samples in the training phase (I just selected 16 samples here). Red is pos_examples, blue is neg_examples, and green is GT bbox.
Visualization of positive and negative samples on the original image

The following are the visualization of pos_regions and neg_regions after cropping with 2 positive and negative samples from the above, which is the input of the network. Here you can perform some patch data enhancement.
Positive and negative regions sent to network learning
Below is a screenshot of the training process, but the settings in the code are not The maximum precision saves the model, but each cycle is overwritten and saved.
Screenshot of training process

Discuss

The key to MDNet is not in the training phase, but more tricks in the online tracking phase, but there are several points to discuss from the training part:

  • Training is very similar to R-CNN, because the crop on the original image is sent to shared layers together, which is redundant and inefficient.
  • Such a classification method uses a sequence to train a domain. If there is no online update part, I personally feel that it is more accurate to track the objects I have seen. Of course, the online update part will be much better, and the speed will undoubtedly become slower. See the next part for more details

Guess you like

Origin blog.csdn.net/laizi_laizi/article/details/107475865