Single target tracking algorithm: SiamRPN++

Original link - SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks

Written in the front: The algorithm proposed in this article can be said to be a classic and effective single-target tracking algorithm. Siamese Networks for Object Tracking" ) is added to the bounding box regression of the target. After my own practice, the effect can be said to be very good, but it is only limited to short-term tracking, and long-term tracking may not work well.

Table of contents

1 summary

2 Strict immutability?

3 SiamRPN++

3.1 Network input 

3.2 Network structure

3.2 Output of the network

3.3 Network Training

3.4 Network Prediction


1 summary

As can be seen from the title, this article is an improved version of the original single-target tracking paper "Fully-Convolutional Siamese Networks for Object Tracking", mainly because the previous paper did not work well, but its idea ( using Siamese networks) is arguably groundbreaking. The main contributions of this article can be summarized in the following aspects:

  • Demonstrated the specific reasons why the previous "Fully-Convolutional Siamese Networks for Object Tracking" paper was not effective
  • A new tracking algorithm SiamRPN++ and Depth-wise convolution mode is proposed, which can save computing power without loss of accuracy, and the number of parameters is relatively small. At that time, the algorithm got the best result in single target tracking.

2 Strict immutability?

This article verifies the reason why the "Fully-Convolutional Siamese Networks for Object Tracking" (the first one proposed) based on the twin network's single target tracking effect is not good - strict invariance. It seems abstract to say this, but just take a look at the "Fully-Convolutional Siamese Networks for Object Tracking" article on the part of data preprocessing before training to know that it will perform a crop operation on the pictures in the data set, After cropping, the target is located in the center of the picture (I think it should be for the convenience of calculating the loss function during training), as shown in the figure below. Therefore, strict invariance refers to strictly placing the target in the center of the picture for training.

 Going back to the SiamRPN++ article, this article randomly shifts the image based on the same crop as above, and finally proves that random movement within a certain range can enhance the performance of network tracking. The figure below is the experimental result of the author of this article. The abscissa is the range of random offset (in pixels), and the ordinate is the commonly used EAO evaluation index for single target tracking (the higher the better).

3 SiamRPN++

After diss has finished other algorithms, it's time to propose our own algorithm. First, let's take a look at the network structure of SiamRPN++. (The structure diagram is used later to refer to the picture below)

3.1 Network input 

Don't panic, let's analyze it a little bit. First look at the left side of the dotted line in the above structure diagram. The green square Target refers to the target image after cropping. Its size is 127*127*3. It is input into the green backbone network as an input. If we want to track a bird, the following picture is a Target:

 In the training process , Search refers to the entire picture of one frame of the video. Its size is 255*255*3, and it is also input into the purple backbone network as input. The following picture is a Search:

The above Target and Search are both operated by crop, and the specific method is similar to the method introduced in the article "Fully-Convolutional Siamese Networks for Object Tracking", except that a random offset is added.

3.2 Network structure

After knowing what the input is, it should not be difficult to see that the multi-dimensional blocks stacked in the structure diagram are the backbone network, and the upper and lower backbone networks form a twin network. The algorithm performs a Siamese RPN operation on the output of each of the three layers of the network, as shown in the following figure:

 The specific operation of Siamese RPN is on the right side of the dotted line in the structure diagram:

First of all, don't panic, please read on. Green F(Z) and purple F(X) respectively represent the output of a certain layer of the backbone network of the color (remember that at the beginning of 3.2 I said "the algorithm performs Siamese RPN once in the output of a certain three layers of the network operation"?), the adj_n in the penultimate line of the above figure refers to a convolution operation, his purpose is to compress the number of channels of green F(Z) and purple F(X) to 256, and then it is the key It is also the essence of this article, DW-Corr (Depth-wise Cross correlation), the specific operation is shown in the figure below:

 It doesn't matter if you don't understand, let me explain briefly. Take adj_1 and adj_2 as an example. Haven’t adj_1 and adj_2 already compressed the channels of F(Z) and F(X) to 256 (assuming the outputs are F’(Z) and F’(X) respectively), At this time, the i-th channel of F'(Z) is used as the convolution kernel, and the convolution operation is performed on the i-th channel of F'(X) (i=1,2,3,...,256), The resulting output is still 256 channels. What I said above is the operation steps corresponding to DW-Corr_n.

Finally, the output of DW-Corr_n is processed by Box Head (or Cls Head) operation (essentially also a convolution operation), B_{l}and S_{l}the sum is obtained. Let’s put aside the two of them and explain them later.

Remember when I said "the algorithm performs a Siamese RPN operation on the output of each of the three layers of the network"? That's right, the above operation was performed a total of three times, each time the input (green F(Z) and purple F(X)) is from a different layer of the backbone network. That is to say, B_{l}there S_{l}are three sets of sums ( B_{1},B_{2},B_{3}sums S_{1},S_{2},S_{3}), and the final one  is obtained Bby  B_{1},B_{2},B_{3}weighted summation. Similarly, it is obtained Sby S_{1},S_{2},S_{3}weighted summation, and the coefficients of their weighted summation are also learnable by the network.

BCorresponds to BBox Regression in the structure diagram, Sand corresponds to CLS in the structure diagram.

3.2 Output of the network

Here is a detailed talk about B_{l}what and S_{l}what is. If the input of the network is only a set of pictures (that is, the batchsize is 1), then B_{l}it is a tensor of 25*25*(4*k) (k is 5 in the author's code), and 25*25 refers to the The original image (Search) is divided into 25*25 small cells, and then k refers to generating k anchor boxes in each small cell, and 4 refers to the coordinates of the anchor frames (x_{min},y_{min},x_{max},y_{max}).

(It’s good to understand it as coordinates here, but it should actually be the correction amount of the anchor frame coordinates)

Speaking of this, I should have come to my senses, this is exactly the same as YOLO.

Then S_{l}it is a tensor of 25*25*(2*k), the only difference from the above is "2". 2 refers to the confidence of not containing objects in each anchor box. In fact, it is still very similar to YOLO.

3.3 Network Training

To be honest, it's a bit complicated. If you are interested, take a look at the official code: STVIR/pysot: SenseTime Research platform for single object tracking, implementing algorithms like SiamRPN and SiamMask. (github.com) https://github.com/STVIR/ pysot

Simply put, an L1 loss is used to calculate the offset of the anchor box, and a cross-entropy loss is used to calculate the loss of confidence. 

3.4 Network Prediction

 First of all, to get a video, you need to manually draw the border of the object to be tracked in the first frame, crop the content in the border, and then input it into the network as the target in the structure diagram. Note that the Target will not change during the tracking of the entire video.

Starting from the second frame, the image around the target position of the previous frame will be input into the network as the Search in the structure diagram (here is different from the training part, the whole picture of this frame is used for training) crop), and then the network will output the corresponding Bsum S, find Sthe frame containing the object with the highest confidence, as the result of the prediction of this frame, that is, the position of the target in this frame.

Then repeat the above calculation for the third frame, and so on, until the entire video is traversed.

 Over

Guess you like

Origin blog.csdn.net/fuss1207/article/details/124217167