Detailed Architecture of Region Proposal Network (RPN)

Make a fortune with your little hand, give it a thumbs up!

Introduction

If you are reading this article [1] , then I assume you must have heard of the RCNN family for object detection, and if so, then you must have come across RPNs, Region Proposal Networks. If you are new to the RCNN series, I highly recommend you click here to read this article before delving into RPN.

So we know that in an object detection algorithm, the goal is to generate candidate boxes, boxes that may contain our object, which will be localized by a bounding box regression method and classified by a classifier into their respective categories.

In earlier versions of object detection algorithms, these candidate boxes used to be generated through traditional computer vision techniques. One of the methods is "selective search", but the disadvantage of this method is that it is offline and computationally intensive.

This is where the RPN (Region Proposal Network) approach works by generating box proposals in a very short time and most importantly, this network can be plugged into any object detection network, which makes it more useful for any object detection model.

RPN

The way CNNs learn to classify from feature maps, RPNs also learn to generate these proposals from feature maps. A typical region proposal network can be demonstrated using the following figure

alt

Let us understand the above block diagram step by step

Step 1

So, in the first step, our input image goes through a convolutional neural network, and the last layer takes a feature map as output.

Step 2

In this step, a sliding window is run on the feature map obtained in the previous step. The size of the sliding window is n*n (3×3 here). For each sliding window, a specific set of anchor points is generated, but with 3 different aspect ratios (1:1, 1:2, 2:1) and 3 different scales (128, 256 and 512), As follows.

So, for 3 different aspect ratios and 3 different scales, there are a total of 9 possible proposals per pixel. The total number of anchor boxes with feature map size WxH and the number K of anchor points per position in the feature map can be denoted as WxHxK.

The image below shows 9 anchor points at positions (450, 350) for an image of size (600, 900).

alt

In the image above, the three colors represent three scales or sizes: 128×128, 256×256, 512×512.

让我们挑出棕色的盒子/锚点(上图中最里面的盒子)。三个盒子的高宽比分别为1:1、1:2和2:1。

现在我们有 9 个锚框用于特征图的每个位置。但是可能有很多盒子里面没有任何物体。因此模型需要了解哪个锚框可能包含我们的对象。带有我们对象的锚框可以被归类为前景,其余的将是背景。同时模型需要学习前景框的偏移量以调整以适合对象。这将我们带到下一步。

Step 3

锚框的定位和分类是由 Bounding box Regressor layer 和 Bounding box Classifier layer 完成的。

Bounding Box Classifier 计算 Ground Truth Box 与 anchor boxes 的 IoU 分数,并以一定的概率将 Anchor box 分类为前景或背景。

Bounding box Regressor 层学习 x,y,w,h 值相对于被分类为前景的 Anchor Box 的 Ground truth Box 的偏移量(或差异),其中 (x,y) 是框的中心, w 和 h 是宽度和高度。

由于 RPN 是一个模型,并且每个模型都有一个要训练的成本函数,因此 RPN 也是如此。 RPN 的损失或成本函数可以写成

alt
alt

注意:- PN 不关心对象的最终类(例如猫、狗、汽车或人等)是什么。它只关心它是前景对象还是背景。

示例

让我们用一个例子来描述 RPN 的整个概念

So, if we have an image of size 600×800, after passing through a Convolutional Neural Network (CNN) block, this input image will be reduced to a 38×56 feature map with 9 anchor boxes per feature map position . Then we will have 38 56 9=1192 proposals or Anchor Boxes to consider. Each anchor box has two possible labels (foreground or background). If we set the depth of the feature map to 18 (9 anchors x 2 labels), we will have a vector for each anchor with two values ​​representing foreground and background (law called logit Wire). If we feed the logit into a softmax/logistic regression activation function, it will predict the label.

Suppose a 600×800 image is scaled down by a factor of 16 to a 39×51 feature map after applying CNN. Each location in the feature map has 9 anchors, and each anchor has two possible labels (background, foreground). If we set the depth of the feature map to 18 (9 anchors x 2 labels), we will have a vector for each anchor with two values ​​representing foreground and background (often called logit) . If we feed the logit into a softmax/logistic regression activation function, it will predict the label. The training data now contains features and labels. The model will further train it.

Summarize

The output of a Region Proposal Network (RPN) is a bunch of boxes/proposals which will be passed to classifiers and regressors to finally check the occurrence of objects. In short, RPN predicts the likelihood of an anchor being background or foreground and refines the anchor.

Reference

[1]

Source: https://towardsmachinelearning.org/region-proposal-network/

This article is published by mdnice multi-platform

Guess you like

Origin blog.csdn.net/swindler_ice/article/details/130997394