Mask RCNN - Principle Analysis (1)

Mask RCNN principle:

Briefly, Mask R-CNN is a two-stage framework. The first stage scans the image and generates proposals (proposals, that is, regions that may contain an object), and the second stage classifies the proposals and generates bounding boxes and masks. Mask R-CNN is an extension of Faster R-CNN, proposed by the same author last year. Faster R-CNN is a popular object detection framework, and Mask R-CNN extends it as an instance segmentation framework.

The main building blocks of Mask R-CNN:

1. Backbone Architecture

Simplified illustration of backbone network

This is a standard convolutional neural network (usually ResNet50 and ResNet101) as a feature extractor. Bottom layers detect low-level features (edges, corners, etc.), and higher layers detect higher-level features (cars, people, sky, etc.).

Through the forward propagation of the backbone network, the image is transformed from a tensor of 1024x1024x3 (RGB) into a feature map of shape 32x32x2048. This feature map will serve as the input for the next stage.

Feature Pyramid Network (FPN)

来源：Feature Pyramid Networks for Object Detection

The aforementioned backbone network can be further improved. Feature Pyramid Network (FPN), introduced by the same authors of Mask R-CNN, is an extension of this backbone network to better characterize objects at multiple scales.

FPN improves the performance of standard feature extraction pyramids by adding a second pyramid that selects high-level features from the first pyramid and passes them to the bottom layer. Through this process, it allows each level of features to be combined with high-level and low-level features.

The ResNet101+FPN backbone network is used in our Mask R-CNN implementation.

2. Regional Proposal Network (RPN)

Simplified diagram showing 49 anchor boxes

RPN is a lightweight neural network that scans an image with a sliding window and looks for regions where objects exist.

The area scanned by the RPN is called an anchor, which is a rectangle distributed over the image area, as shown in the image above. This is just a simplified diagram. In fact, there are nearly 200,000 anchors on the image at different sizes and aspect ratios, and they overlap each other to cover the image as much as possible.

RPN 扫描这些 anchor 的速度有多快呢？非常快。滑动窗口是由 RPN 的卷积过程实现的，可以使用 GPU 并行地扫描所有区域。此外，RPN 并不会直接扫描图像，而是扫描主干特征图。这使得 RPN 可以有效地复用提取的特征，并避免重复计算。通过这些优化手段，RPN 可以在 10ms 内完成扫描（根据引入 RPN 的 Faster R-CNN 论文中所述）。在 Mask R-CNN 中，我们通常使用的是更高分辨率的图像以及更多的 anchor，因此扫描过程可能会更久。

RPN 为每个 anchor 生成两个输出：

anchor 类别：前景或背景（FG/BG）。前景类别意味着可能存在一个目标在 anchor box 中。
边框精调：前景 anchor（或称正 anchor）可能并没有完美地位于目标的中心。因此，RPN 评估了 delta 输出（x、y、宽、高的变化百分数）以精调 anchor box 来更好地拟合目标。

使用 RPN 的预测，我们可以选出最好地包含了目标的 anchor，并对其位置和尺寸进行精调。如果有多个 anchor 互相重叠，我们将保留拥有最高前景分数的 anchor，并舍弃余下的（非极大值抑制）。然后我们就得到了最终的区域建议，并将其传递到下一个阶段。

3. ROI 分类器和边界框回归器

这个阶段是在由 RPN 提出的 ROI 上运行的。正如 RPN 一样，它为每个 ROI 生成了两个输出：

阶段 2 的图示。来源：Fast R-CNN

类别：ROI 中的目标的类别。和 RPN 不同（两个类别，前景或背景），这个网络更深并且可以将区域分类为具体的类别（人、车、椅子等）。它还可以生成一个背景类别，然后就可以弃用 ROI 了。
边框精调：和 RPN 的原理类似，它的目标是进一步精调边框的位置和尺寸以将目标封装。

ROI 池化

在我们继续之前，需要先解决一些问题。分类器并不能很好地处理多种输入尺寸。它们通常只能处理固定的输入尺寸。但是，由于 RPN 中的边框精调步骤，ROI 框可以有不同的尺寸。因此，我们需要用 ROI 池化来解决这个问题。

图中展示的特征图来自较底层。

ROI 池化是指裁剪出特征图的一部分，然后将其重新调整为固定的尺寸。这个过程实际上和裁剪图片并将其缩放是相似的（在实现细节上有所不同）。

Mask R-CNN 的作者提出了一种方法 ROIAlign，在特征图的不同点采样，并应用双线性插值。在我们的实现中，为简单起见，我们使用 TensorFlow 的 crop_and_resize 函数来实现这个过程。

4. 分割掩码

到第 3 节为止，我们得到的正是一个用于目标检测的 Faster R-CNN。而分割掩码网络正是 Mask R-CNN 的论文引入的附加网络。

掩码分支是一个卷积网络，取 ROI 分类器选择的正区域为输入，并生成它们的掩码。其生成的掩码是低分辨率的：28x28 像素。但它们是由浮点数表示的软掩码，相对于二进制掩码有更多的细节。掩码的小尺寸属性有助于保持掩码分支网络的轻量性。在训练过程中，我们将真实的掩码缩小为 28x28 来计算损失函数，在推断过程中，我们将预测的掩码放大为 ROI 边框的尺寸以给出最终的掩码结果，每个目标有一个掩码。

Mask RCNN - Principle Analysis (1)

Mask RCNN principle:

Guess you like