The correspondence between the anchor in Faster-RCNN and the boundingbox in the original image

Usually the first blog post after the Flag is established, I always want to put a lot of things into one article, which is like a maze. It deviated farther and farther away from the main line, so the original plan became an unfinished building.

And the best way to solve procrastination is to abandon perfectionism. No matter how bad the writing is, get a version first and change it later. Therefore, this article is based on this original intention. There may be some places that are not well understood. I hope everyone can discuss and correct them.

Gossip less, let's enter the topic

------------------------This is the dividing line---------------------- -----

The use of Anchor is the core and essential innovation of faster-rcnn. Faster-rcnn obtains bounding boxes (bbxes) by predefining a series of anchors with relatively fixed positions, corresponding to the original image, and then uses the RPN network to perform front/background Judgment, position return; thus obtain the final proposals. The next thing is fast-rcnn. Generally speaking, for the application of faster-rcnn, a good anchor setting method can play a very critical role.

And how does the anchor correspond to the proposal? If you don’t look at the code at this point, but only look at the paper, it’s quite confusing. I don’t know what’s going on, and I haven’t found a particularly good technical blog to introduce this detail. I always thought that this point was relatively simple, so that everyone bothered to discuss it, but then I asked many people, and few could give a satisfactory answer to this question. I have always believed that for an algorithm, understanding its operating principle is more important than being able to tune its code. Only by knowing the essence can we truly adapt to local conditions and use it for our own use. Based on this starting point, this blog is to introduce how the anchor corresponds to the bbx on the original image? And some precautions brought about by this process.

1. How to determine the corresponding bbxes in the original image?

To get the corresponding bbxes in the original image in the anchor setting, just pay attention to two factors:

  1. bbxes center point position;
  2. The height and width of the bbxes.

As shown in the figure below, the center point of each bbx is determined by the feature map size after the last convolutional layer of the backbone. If the feature map size is 50 * 38, the original image is divided into 50 * 38 grids, and the center of each grid is the center point of bbxes.

The length & width of bbxes is determined by the scale and aspect ratio of the anchor. As in the original text of fater-rcnn, the reference area is determined as [128^2, 256^2, 512^2], and the aspect ratio of bbxes is [0.5, 1, 2], so the height and width of 9 bbxes at each position can be obtained as shown in the figure. Here, the height and width determination logic of each bbx is as follows:

  1. Select an area from the reference area, for example, select the area of ​​bbx= 128^2 
  2. Then select a ratio from the aspect ratio, such as height/width = 2
  3. Let the width of bbx be w, and the height be 2w; 2w^2=128^2 \Rightarrow w \approx 110

(Note that in many codes, a base rectangle base_size such as 16 * 16 will be determined first, and the scale will use [8, 16, 32]. The scale here is a multiple of the area of ​​the base rectangle, which is actually the same)

 

After the center point and length and width are determined, the positions of all bbxes in the original image are determined, that is to say, the center of bbxes is determined by the featue map size of the backbone, and the height & width are determined by the area and aspect ratio set by the anchor.

 

2. What are the precautions for anchor setting?

After the initial bbxes are determined. Then the RPN network can be trained. There are two more steps to do before this:

1. Calculate the iou of bbxes and groundtruth. If the iou is greater than the threshold, it is considered to be the foreground, otherwise it is considered to be the background (used in the training rpn classification branch)

2. Calculate the offset between bbx and the nearest grountruth (used to train the rpn regression branch)

After obtaining the proposals through the rpn network, find the corresponding bbxes in the original image (need to add an offset); the next step is the routine of fast-rcnn: find the corresponding node area on the conv feature map in proportion through the new bbxes ; Transform the feature map area corresponding to the proposals into a uniform size through the roi pooling layer for specific category judgment and position regression.

Through the above, when tuning parameters in faster-rcnn, there are a few things to pay attention to:

1. Set the anchor according to the characteristics of your own goals, and don't blindly use the original ratio settings. For example, what we want to detect is a person. Generally speaking, it is rare to have an aspect ratio of 0.5, so do not set an anchor with an aspect ratio of 0.5, and be flexible.

2. Pay attention to the depth of the backbone, not the deeper the better. As shown in the figure above, a node in the cov feature map corresponds to a range of at least 16 * 16 from the receptive field of the original image. If the target is relatively small, no matter what, the feature map cannot find a suitable corresponding area. Conversely, the shallower the layer, the higher the proposal resolution, but the model capability may be limited, and the number of bbx will be more. The depth of the backbone is also a trade-off with the size of the target object. A relatively good solution here is to introduce the idea of ​​fpn network (feature pyramid netwrok) into faster-rcnn (this idea is relatively simple, if you are interested, you can directly read the original text)

-----

Alright, so far, this blog is almost done. If there is any problem, please correct me. With the development and accumulation of the knowledge system, the views and angles on the problem are always spiraling upwards, so the content of this blog will be continuously updated to check for omissions and make up for gaps.

The next article will introduce how to form the final proposals through classification and regression after the anchor corresponds to bbx

 

Reference resources:

1. Faster-rcnn Original: https://arxiv.org/pdf/1506.01497.pdf

2. Read Faster RCNN in one article:  https://zhuanlan.zhihu.com/p/31426458

Guess you like

Origin blog.csdn.net/yangyehuisw/article/details/105033932