Siamese-RPN read notes

High Performance Visual Tracking with Siamese Region Proposal Network

This article draws on Siamese FC and RPN, RPN will Siamese network and network together to achieve Visual Object Tracking.

Visual Object Tracking: given a first frame ground truth object to find the object in the next frame and marked bounding box.

Siamese-RPN process:

Siamese-RPN consists of two parts, part Siamese network, to extract the template frame (herein, the first frame) and a feature detection frame (a second frame to the last frame), using AlexNet. As for why the use of AlexNet article does not say, but the author certainly tried other networks, such as VGGNet and GoogLeNet, the findings may not have good AlexNet. There is a paper "Deeper and Wider Siamese Networks for Real-Time Visual Tracking" in CVPR2019, the interpretation made on this issue. After AlexNet characteristics were obtained in FIG. 6 and 22 * ​​6 * 256 * 22 * ​​256. Wherein the parameters are shared Siamese portion. The second part is RPN network, mainly used to anchor classification and regression. feature map template frame after two 3 * 3 convolution kernel appeared two branches, is a 4 * 4 * 256 * 2k, for classification; is a 4 * 4 * (256 * 4k), to regression, where k is the number of the anchor. The same feature map frame is detected after 3 * 3 convolution respectively two branches 20 and 20 * 20 * 256 * 20 * 256. Next, using FIG template frame as feature detection frame check convolutional characteristic diagrams convolution (cross-correlation), respectively, to give the final 17 * 17 * 2k and characterized in FIG. 17 * 17 * 4k of. Refers to a characteristic in which FIG. 2k each pixel has k anchor, if the probability of each anchor in an object, it is 2k; 4K means that each anchor after the return and the distance between the ground truth, by dx, dy, dw, dh representation.

RPN in training this network, and Faster R-CNN about the same. Classified loss with cross-entropy loss, return loss represented by the lapse of normalized smooth L1.

Siamese-RPN training details:

  In the training phase, the random samples picked from ILSVRC continuously extracted in the Youtube-BB, but the template frame and detecting a frame to be extracted from the same video. The first pre-training network in ImageNet, and then use the SGD training, using data enhancement during training. In the phase of the RPN assumed adjacent object does not change much between the two, so the only anchor for a scale, but use a different proportion. When selecting a sample, that of the anchor and the IOU gt positive samples is greater than 0.6, less than 0.3 for the negative samples. A CCP selected training samples 64 samples, where n is a maximum of 16 samples.

one-shot detection:

  Under first explain what is the one-shot: one-shot is the number in the sample is small, even with only one sample.

  

  When training, in addition to the bounding box data does not require any supervision. In the inference stage, the first target frame into the first branch template, and obtain a convolution detection branch (cross-correlation) of the convolution kernel (information comprises object category), then the branch template removed, leaving only detected branch. The second to last frame sequentially into a detection branch, so Siamese-RPN becomes a one-shot detection task. Since only the first frame is used, it can be regarded as one-shot detection.

proposal selection: In order to make one-shot detection framework suitable for the tracking task, the authors proposed two strategies to choose the candidate box.

  • Discard those bounding box from the center point too far. The author believes is unlikely to have much movement between adjacent frames.
  • Cosine window and scaling punishments to get the best of the proposal to reorder. Cosine window is used to inhibit a large displacement. 

After the above operation, the first classification score and time punishment (?) Is multiplied, then the first K proposal will be reordered. NMS will act next to the bounding box to get the bounding box for the final track of. After the last of the selected bounding box, size of the target (bounding box) will be updated according to a linear interpolation to continuously maintain the shape change.

Experiment: nothing to say. Leading indicators at the same time, many other ways speed ahead.

 

other:

  • Experiments show that the greater the Siamese-RPN data size, the better.
  • The authors used a fixed scale on the anchor set, but uses a different ratio. The writers are tried 3,5,7 a ratio, the experiment proved that five is better than three, but seven worse than 5. The author believes the reason is due to the difference of over-fitting. After more after using the training data, the effect is changed for the better point.

Conclusion: The authors propose Siamese-RPN network, this is the end of offline training network. Siamese-RPN after bounding box coordinates correction accuracy has been greatly improved. In the tracking phase, Siamese-RPN can be seen as a local one-shot detection task. Experiments show that, Siamese-RPN not only leading performance and real-time speed, reaching 160FPS.

 

Guess you like

Origin www.cnblogs.com/liualexsone/p/11366587.html