Faster-RCNN代码解读7：主要文件解读-下

前言

因为最近打算尝试一下Faster-RCNN的复现，不要多想，我还没有厉害到可以一个人复现所有代码。所以，是参考别人的代码，进行自己的解读。

代码来自于B站的UP主（大佬666），其把代码都放到了GitHub上了，我把链接都放到下面了（应该不算侵权吧，毕竟代码都开源了^_）：

b站链接：https://www.bilibili.com/video/BV1of4y1m7nj/?vd_source=afeab8b555e5eb1bfa1e7f267262cbf2

GitHub链接：https://github.com/WZMIAOMIAO/deep-learning-for-image-processing

目的

其实UP主已经做了很好的视频讲解了他的代码，只是有时候我还是喜欢阅读博客来学习，另外视频很长，6个小时，我看的时候容易睡着^_，所以才打算写博客记录一下学习笔记。

目前完成的内容

第一篇：VOC数据集详细介绍

第二篇：Faster-RCNN代码解读2：快速上手使用

第三篇：Faster-RCNN代码解读3：制作自己的数据加载器

第四篇：Faster-RCNN代码解读4：辅助文件解读

第五篇： Faster-RCNN代码解读5：主要文件解读-上

扫描二维码关注公众号，回复： 15275272 查看本文章

第六篇： Faster-RCNN代码解读6：主要文件解读-中

第七篇：Faster-RCNN代码解读7：主要文件解读-下（本文）

目录结构

文章目录

- Faster-RCNN代码解读7：主要文件解读-下

1. 前言：

在前面几篇中，我们基本上把该项目大部分文件都进行了解读，目前就剩下RPN部分和一个辅助文件没有解读了。

这里，我们来解读这两个文件：

det_utils.py
rpn_function.py

2. det_utils.py文件解读：

2.1 smooth_l1_loss函数：

首先，看这个最简单的函数，其作用是定义smooth_l1函数，辅助faster-rcnn的损失函数定义。

在这里插入图片描述

不过，这里定义的smooth_l1与原版的区别在于引入了一个β参数：

# 定义smooth_l1 损失函数，不过加入了β参数
n = torch.abs(input - target)
# cond = n < beta
cond = torch.lt(n, beta)
loss = torch.where(cond, 0.5 * n ** 2 / beta, n - 0.5 * beta)
if size_average:
    return loss.mean()
return loss.sum()

2.2 Matcher类：

该类的主要作用就是划分正负样本。

首先，定义了一些变量：

# 下面两个为：
# 小于阈值的定位负例，即-1
# 处于阈值之间的定义为-2，即忽略不用的类
BELOW_LOW_THRESHOLD = -1
BETWEEN_THRESHOLDS = -2

__annotations__ = {
    
    
    'BELOW_LOW_THRESHOLD': int,
    'BETWEEN_THRESHOLDS': int,
}

__init__方法：

传入的参数：

参数	意义
high_threshold	两个阈值中较高的
low_threshold	两个阈值中较低的，小于它即为负例
allow_low_quality_matches	默认为False

我这里解释一下第三个参数的意义，同时也是该类的实现思路。一般情况下，gt_box要与anchors进行匹配，即那些小于0.3（论文原阈值）的为负例，那些处于0.3和0.7之间的忽略不计，而大于0.7的为正例。而，还有一种情况为正例，即那些与gt_box匹配最大的（IOU最大）的anchor为正例，因为有时候gt_box没有匹配的anchor（它匹配的anchor全是忽略不计的，那么它就没有匹配到anchors）。

对于上述的情况，就可以通过参数allow_low_quality_matches控制是否启用匹配最大的anchor也为正例的准则。

初始化方法代码很简单，就是定义一些变量：

self.BELOW_LOW_THRESHOLD = -1
self.BETWEEN_THRESHOLDS = -2
assert low_threshold <= high_threshold
self.high_threshold = high_threshold  # 0.7
self.low_threshold = low_threshold    # 0.3
self.allow_low_quality_matches = allow_low_quality_matches

__call__方法：

该方法的作用是将gt_box与anchor匹配，并定义正负样本or忽略样本。

代码内容看注释即可，很简单的：

# 判断IOU矩阵个数是否为零，为零，说明有问题
if match_quality_matrix.numel() == 0:
    # 为0，报错
    if match_quality_matrix.shape[0] == 0:
        raise ValueError(
            "No ground-truth boxes available for one of the images "
            "during training")
	else:
        raise ValueError(
            "No proposal boxes available for one of the images "
            "during training")

# M x N 的每一列代表一个anchors与所有gt的匹配iou值
# matched_vals代表每列的最大值，即每个anchors与所有gt匹配的最大iou值
# matches对应最大值所在的索引
matched_vals, matches = match_quality_matrix.max(dim=0)  # the dimension to reduce.
# 如果启用了准则
if self.allow_low_quality_matches:
    # 克隆一份
    all_matches = matches.clone()
else:
    all_matches = None

# 计算iou小于low_threshold的索引
below_low_threshold = matched_vals < self.low_threshold
# 计算iou在low_threshold与high_threshold之间的索引值
between_thresholds = (matched_vals >= self.low_threshold) & (
    matched_vals < self.high_threshold
)
# iou小于low_threshold的matches索引置为-1
matches[below_low_threshold] = self.BELOW_LOW_THRESHOLD  # -1

# iou在[low_threshold, high_threshold]之间的matches索引置为-2
matches[between_thresholds] = self.BETWEEN_THRESHOLDS    # -2

# 是否启用与GT_box匹配的最大anchors
if self.allow_low_quality_matches:
    assert all_matches is not None
    self.set_low_quality_matches_(matches, all_matches, match_quality_matrix)

其中，说明一下传入的参数match_quality_matrix，这个是可以称为IOU矩阵，其形状应该为下图所示：

在这里插入图片描述

set_low_quality_matches_方法：

该方法的实现思路：找到与gt_box对应的IOU最大的anchor，即使IOU低于高阈值。

代码简单，看注释：

# 对于每个gt boxes寻找与其iou最大的anchor，
# highest_quality_foreach_gt为匹配到的最大iou值
highest_quality_foreach_gt, _ = match_quality_matrix.max(dim=1)  # the dimension to reduce.

# 寻找每个gt boxes与其iou最大的anchor索引，一个gt匹配到的最大iou可能有多个anchor
gt_pred_pairs_of_highest_quality = torch.where(
    torch.eq(match_quality_matrix, highest_quality_foreach_gt[:, None])
)
# Example gt_pred_pairs_of_highest_quality:
#   tensor([[    0, 39796],
#           [    1, 32055],
#           [    1, 32070],
#           [    2, 39190],
#           [    2, 40255],
#           [    3, 40390],
#           [    3, 41455],
#           [    4, 45470],
#           [    5, 45325],
#           [    5, 46390]])
# Each row is a (gt index, prediction index)
# Note how gt items 1, 2, 3, and 5 each have two ties

# gt_pred_pairs_of_highest_quality[:, 0]代表是对应的gt index(不需要)
# pre_inds_to_update = gt_pred_pairs_of_highest_quality[:, 1]
pre_inds_to_update = gt_pred_pairs_of_highest_quality[1]
# 保留该anchor匹配gt最大iou的索引，即使iou低于设定的阈值
matches[pre_inds_to_update] = all_matches[pre_inds_to_update]

2.3 BoxCoder类：

该类主要实现**回归参数的解码和编码。**即根据回归公式来计算相关参数值。回归公式如下：

在这里插入图片描述

__init__方法：

传入参数：

参数	意义
weights	超参数，四个值，格式为：Tuple[float, float, float, float]
bbox_xform_clip	用于限制dw、dh的最大值

代码内容就是初始化这两个变量：

self.weights = weights
self.bbox_xform_clip = bbox_xform_clip

encode方法：

该方法的作用是：结合anchors和与之对应的gt计算regression参数。

传入参数：

参数	意义
reference_boxes	每个proposal/anchor对应的gt_boxes
proposals	anchors/proposals

具体内容看代码注释：

# 统计每张图像的anchors个数，方便后面拼接在一起处理后在分开
boxes_per_image = [len(b) for b in reference_boxes]
# reference_boxes和proposal数据结构相同
reference_boxes = torch.cat(reference_boxes, dim=0)
proposals = torch.cat(proposals, dim=0)

# 将真实值和anchor值传给函数encode_single处理
targets = self.encode_single(reference_boxes, proposals)
# 分开
return targets.split(boxes_per_image, 0)

encode_single方法见下面。

encode_single方法：

传入参数上面说过了。这里直接看代码：

# 获取类别和设备信息
dtype = reference_boxes.dtype
device = reference_boxes.device
# 拷贝一份
weights = torch.as_tensor(self.weights, dtype=dtype, device=device)
# 编码
targets = encode_boxes(reference_boxes, proposals, weights)

可以看出，关键函数是encode_boxes，见后面2.4节。

decode方法：

decode方法与encode恰好相反，encode就是rpn回归操作后输出的参数值，decode就是将这个参数值应用，进行修改anchor坐标，得到更好的anchor。

传入参数：

参数	意义
rel_codes	bbox的回归参数（就是encode的返回值）
boxes	anchor/proposal

代码内容很简单，和encode类似，需要调用其它的方法：

# 判断类型是否存在问题
assert isinstance(boxes, (list, tuple))
assert isinstance(rel_codes, torch.Tensor)
# 获取每张图片anchor个数，每次运行ahcors个数不定
boxes_per_image = [b.size(0) for b in boxes]
# 将一个batch所有的信息拼接在一起
concat_boxes = torch.cat(boxes, dim=0)

# 获取anchor总数
box_sum = 0
for val in boxes_per_image:
	box_sum += val

# 将预测的bbox回归参数应用到对应anchors上得到预测bbox的坐标
pred_boxes = self.decode_single(
	rel_codes, concat_boxes
)

# 防止pred_boxes为空时导致reshape报错
if box_sum > 0:
	pred_boxes = pred_boxes.reshape(box_sum, -1, 4)

decode_single方法：

这个方法就是实现修正坐标的功能。是按照下面公式进行修正的：

在这里插入图片描述

传入的参数上面已经说过了，这里直接看代码：

首先，获取设备，并获取将anchor的四个坐标形式转为中心坐标+高宽形式（对应上面公式的xa、ya、wa、ha）：

# 放入设备中
boxes = boxes.to(rel_codes.dtype)

# xmin, ymin, xmax, ymax
widths = boxes[:, 2] - boxes[:, 0]   # anchor/proposal宽度
heights = boxes[:, 3] - boxes[:, 1]  # anchor/proposal高度
ctr_x = boxes[:, 0] + 0.5 * widths   # anchor/proposal中心x坐标
ctr_y = boxes[:, 1] + 0.5 * heights  # anchor/proposal中心y坐标

用回归参数获取值并设定其阈值（对应上面公式的tx、ty、th、tw）：

# 超参数wx, wy, ww, wh
wx, wy, ww, wh = self.weights  # RPN中为[1,1,1,1], fastrcnn中为[10,10,5,5]
# 0::4采样方式，得到的维度为2
dx = rel_codes[:, 0::4] / wx   # 预测anchors/proposals的中心坐标x回归参数
dy = rel_codes[:, 1::4] / wy   # 预测anchors/proposals的中心坐标y回归参数
dw = rel_codes[:, 2::4] / ww   # 预测anchors/proposals的宽度回归参数
dh = rel_codes[:, 3::4] / wh   # 预测anchors/proposals的高度回归参数

# 对dw、dh限制数值上下限
dw = torch.clamp(dw, max=self.bbox_xform_clip)
dh = torch.clamp(dh, max=self.bbox_xform_clip)

接着，用上面的公式，求出x、y、w、h值，即修正后的anchor坐标和宽高值：

# 将预测值应用到anchor中
# [:, None] 是为了维度相同
pred_ctr_x = dx * widths[:, None] + ctr_x[:, None]
pred_ctr_y = dy * heights[:, None] + ctr_y[:, None]
pred_w = torch.exp(dw) * widths[:, None]
pred_h = torch.exp(dh) * heights[:, None]

最后，将中心坐标+宽高的形式转为四个坐标值形式，并拼接在一起返回即可：

# 将中心坐标形式转为左上角+右下角坐标形式
# xmin
pred_boxes1 = pred_ctr_x - torch.tensor(0.5, dtype=pred_ctr_x.dtype, device=pred_w.device) * pred_w
# ymin
pred_boxes2 = pred_ctr_y - torch.tensor(0.5, dtype=pred_ctr_y.dtype, device=pred_h.device) * pred_h
# xmax
pred_boxes3 = pred_ctr_x + torch.tensor(0.5, dtype=pred_ctr_x.dtype, device=pred_w.device) * pred_w
# ymax
pred_boxes4 = pred_ctr_y + torch.tensor(0.5, dtype=pred_ctr_y.dtype, device=pred_h.device) * pred_h

# 拼接在一起
pred_boxes = torch.stack((pred_boxes1, pred_boxes2, pred_boxes3, pred_boxes4), dim=2).flatten(1)

2.4 encode_boxes函数：

首先，把公式放在这里：

在这里插入图片描述

这个函数的作用就是实现上面公式的计算，并按照指定格式进行存储。

首先，获取权重参数，这个是人为设定的：

# 获取超参数值，是权重参数
wx = weights[0]
wy = weights[1]
ww = weights[2]
wh = weights[3]

接着，获取anchor/proposal和真实框的坐标值，并增加一个维度（增加维度是为了统一计算格式）：

# 增加一个维度
# 并获取单独值
proposals_x1 = proposals[:, 0].unsqueeze(1)
proposals_y1 = proposals[:, 1].unsqueeze(1)
proposals_x2 = proposals[:, 2].unsqueeze(1)
proposals_y2 = proposals[:, 3].unsqueeze(1)

reference_boxes_x1 = reference_boxes[:, 0].unsqueeze(1)
reference_boxes_y1 = reference_boxes[:, 1].unsqueeze(1)
reference_boxes_x2 = reference_boxes[:, 2].unsqueeze(1)
reference_boxes_y2 = reference_boxes[:, 3].unsqueeze(1)

然后，将四个坐标值（左上+右下）转为中心坐标+宽高的形式：

# 计算建议框的中心坐标和高、宽
ex_widths = proposals_x2 - proposals_x1
ex_heights = proposals_y2 - proposals_y1
# parse coordinate of center point
ex_ctr_x = proposals_x1 + 0.5 * ex_widths
ex_ctr_y = proposals_y1 + 0.5 * ex_heights

# 真实中心坐标和高宽值
gt_widths = reference_boxes_x2 - reference_boxes_x1
gt_heights = reference_boxes_y2 - reference_boxes_y1
gt_ctr_x = reference_boxes_x1 + 0.5 * gt_widths
gt_ctr_y = reference_boxes_y1 + 0.5 * gt_heights

最后，就是按照上述公式进行计算，并将最后的值拼接在一起返回即可：

# 按照公式计算
targets_dx = wx * (gt_ctr_x - ex_ctr_x) / ex_widths
targets_dy = wy * (gt_ctr_y - ex_ctr_y) / ex_heights
targets_dw = ww * torch.log(gt_widths / ex_widths)
targets_dh = wh * torch.log(gt_heights / ex_heights)

# 将值拼接在一起
targets = torch.cat((targets_dx, targets_dy, targets_dw, targets_dh), dim=1)

3. rpn_function.py文件解读：

为了方便大家理解，把RPN框架截取了出来：

在这里插入图片描述

3.1 RegionProposalNetwork类：

这个类是这个文件主要的类，串联了其它方法。其主要定义了RPN网络。

__init__方法：

输入的参数：

参数	意义
anchor_generator	生成的anchors
head	RPN Head架构
fg_iou_thresh	前景阈值 0.7，其实就是正例阈值
bg_iou_thresh	背景阈值 0.3，其实就是负例阈值
batch_size_per_image	正负样本个数
positive_fraction	正样本占总样本比例
pre_nms_top_n	NMS处理之前保留的个数
post_nms_top_n	NMS处理后剩余的个数，即RPN输出的建议框个数
nms_thresh	NMS处理时的阈值
score_thresh	获取建议框时筛选的阈值

该方法的内容就是初始化变量：

# 初始化类变量
self.anchor_generator = anchor_generator # anchor生成器
self.head = head # rpn head部分
# 将BoxCoder类赋给box_coder变量，并初始化权重参数
self.box_coder = det_utils.BoxCoder(weights=(1.0, 1.0, 1.0, 1.0)) 

# 将计算IOU的方法赋值为box_similarity变量
self.box_similarity = box_ops.box_iou

# 实例化类：这个类是用来筛选正负样本的
self.proposal_matcher = det_utils.Matcher(
    fg_iou_thresh,  # 当iou大于fg_iou_thresh(0.7)时视为正样本
    bg_iou_thresh,  # 当iou小于bg_iou_thresh(0.3)时视为负样本
    allow_low_quality_matches=True
)

# 实例化类：这个类是用来抽取比例的正负样本
self.fg_bg_sampler = det_utils.BalancedPositiveNegativeSampler(
    batch_size_per_image, positive_fraction  # 256, 0.5
)

# 初始化参数
self._pre_nms_top_n = pre_nms_top_n
self._post_nms_top_n = post_nms_top_n
self.nms_thresh = nms_thresh
self.score_thresh = score_thresh
self.min_size = 1.

forward方法：

下面，来看前向传播算法，并以该方法为路径取解读该文件的各个方法。

首先，传入的参数：

参数	意义
images	输入图像
features	CNN架构输出的特征层，为字典
targets	真实框的各种信息

代码首先，将features参数的值提取出来：

# features是所有预测特征层组成的OrderedDict，将value提取出来
# 如果没有用FPN，字典就是一个值，value = 【batch，channel，w，h】，特征图尺寸
features = list(features.values())

然后，将特征层传给RPN Head部分进行分类和回归（此时可以跳到3.2节看RPN Head介绍），其返回了分类的概率值和回归的参数值：

# 输入head架构中，计算每个预测特征层上的预测目标概率和bboxes regression参数
# objectness和pred_bbox_deltas都是list
# 同样，如果没有用FPN，列表都只有一个值
objectness, pred_bbox_deltas = self.head(features)

接着，需要进行anchor的生成（此时可以跳到3.3节看细节）：

# 生成一个batch图像的所有anchors信息
anchors = self.anchor_generator(images, features)

然后，获取一些变量的值，比如anchor个数等：

# list(anchors)元素个数等于batch_size
num_images = len(anchors)

# 计算每个预测特征层上的对应的anchors数量
# 如果不用FPN就一层
# o.shape =【batch，c，w，h】 ； o[0].shape = 【c，w，h】
num_anchors_per_level_shape_tensors = [o[0].shape for o in objectness]
# 三个值相乘就是anchor的个数
num_anchors_per_level = [s[0] * s[1] * s[2] for s in num_anchors_per_level_shape_tensors]

接着，需要对RPN Head输出的值内部进行一定的调整（concat_box_prediction_layers方法见3.4节）：

# 调整内部tensor格式以及shape
objectness, pred_bbox_deltas = concat_box_prediction_layers(objectness,
                                                            pred_bbox_deltas)

通过上一篇的decode方法，得到RPN最终输出的anchors，并按照指定格式reshape：

# 将预测的bbox regression参数应用到anchors上得到最终预测bbox坐标
proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
# rehshape,格式为上面调整后的格式[N,-1,C]
proposals = proposals.view(num_images, -1, 4)

接着，便是对提取出来的anchors框进行筛选，比如删除小box框、nms处理等（filter_proposals方法见后面）：

# 筛除小boxes框，nms处理，根据预测概率获取前post_nms_top_n个目标
boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)

最后，就是计算损失即可：

# 计算损失
losses = {
    
    }
if self.training:
    assert targets is not None
    # 计算每个anchors最匹配的gt，并将anchors进行分类，前景，背景以及废弃的anchors
    labels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets)
    # 结合anchors以及对应的gt，计算regression参数
    regression_targets = self.box_coder.encode(matched_gt_boxes, anchors)
    # 计算损失
    loss_objectness, loss_rpn_box_reg = self.compute_loss(
        objectness, pred_bbox_deltas, labels, regression_targets
    )
    losses = {
    
    
        "loss_objectness": loss_objectness,
        "loss_rpn_box_reg": loss_rpn_box_reg
    }

filter_proposals方法：

该方法筛除小boxes框，nms处理，根据预测概率获取前post_nms_top_n个目标。

传入的参数：

参数	意义
proposals	预测的bbox坐标
objectness	预测的目标概率
image_shapes	batch中每张图片的size信息
num_anchors_per_level	每个预测特征层上预测anchors的数目

首先，获取batch值和设备信息：

# 获取个数和设备信息
num_images = proposals.shape[0]
device = proposals.device

由于RPN对于整个Faster-RCNN来说，只是一个叶结点，因此不需要它的梯度信息：

# 丢弃梯度信息，只要数值信息
objectness = objectness.detach()
objectness = objectness.reshape(num_images, -1)

接着，定义一个level变量，用于区分不同特征层上的anchor索引信息：

# levels负责记录分隔不同预测特征层上的anchors索引信息
# # 作用： 区分不同特征层
# torch.full即生成长度为n，值由idx填充的tensor
# 如果不用FPN，tensor值全为0
levels = [torch.full((n, ), idx, dtype=torch.int64, device=device)
          for idx, n in enumerate(num_anchors_per_level)]
# 拼接在一起，如果不用FPN，没什么特殊的意义，只是变为了1维tensor
levels = torch.cat(levels, 0)

# reshape处理，变为2维，方便计算
levels = levels.reshape(1, -1).expand_as(objectness)

然后，获取每张预测特征图上预测概率排前pre_nms_top_n的anchors索引值(_get_top_n_idx方法见后面)：

# 获取每张预测特征图上预测概率排前pre_nms_top_n的anchors索引值
top_n_idx = self._get_top_n_idx(objectness, num_anchors_per_level)

接着，根据获取的索引值，取获取对应的概率值、回归参数等值：

# 根据每个预测特征层预测概率排前pre_nms_top_n的anchors索引值获取相应概率信息
objectness = objectness[batch_idx, top_n_idx]
levels = levels[batch_idx, top_n_idx]
# 预测概率排前pre_nms_top_n的anchors索引值获取相应bbox坐标信息
proposals = proposals[batch_idx, top_n_idx]

然后，定义一些变量：

image_range = torch.arange(num_images, device=device)
batch_idx = image_range[:, None]  # [batch_size, 1]
objectness_prob = torch.sigmoid(objectness)
final_boxes = []
final_scores = []

最后，开始遍历处理：

# 遍历每张图像的相关预测信息
for boxes, scores, lvl, img_shape in zip(proposals, objectness_prob, levels, image_shapes):
    # 调整预测的boxes信息，将越界的坐标调整到图片边界上
    boxes = box_ops.clip_boxes_to_image(boxes, img_shape)

    # 返回boxes满足宽，高都大于min_size的索引
    keep = box_ops.remove_small_boxes(boxes, self.min_size)
    boxes, scores, lvl = boxes[keep], scores[keep], lvl[keep]

    # 移除小概率boxes，参考下面这个链接
    # https://github.com/pytorch/vision/pull/3205
    keep = torch.where(torch.ge(scores, self.score_thresh))[0]  # ge: >=
    boxes, scores, lvl = boxes[keep], scores[keep], lvl[keep]

    # nms处理
    keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)

    # 只保留最高得分预测
    keep = keep[: self.post_nms_top_n()]
    boxes, scores = boxes[keep], scores[keep]

    # 添加值
    final_boxes.append(boxes)
    final_scores.append(scores)

_get_top_n_idx方法：

该方法的作用是：获取每张预测特征图上预测概率排前pre_nms_top_n的anchors索引值。

传入参数：

参数	意义
objectness	Tensor(每张图像的预测目标概率信息 )
num_anchors_per_level	List（每个预测特征层上的预测的anchors个数）

该方法的代码内容还是很简单的：

r = []  # 记录每个预测特征层上预测目标概率前pre_nms_top_n的索引信息
offset = 0
# 遍历每个预测特征层上的预测目标概率信息
# objectness.split(num_anchors_per_level, 1) 得到不同特征层不同anchor个数
for ob in objectness.split(num_anchors_per_level, 1):
    # torchvision._is_tracing()一般不满足，直接跳过
    if torchvision._is_tracing():
        num_anchors, pre_nms_top_n = _onnx_get_num_anchors_and_pre_nms_top_n(ob, self.pre_nms_top_n())
	else:
        num_anchors = ob.shape[1]  # 预测特征层上的预测的anchors个数
        # 比如取前100个值的anchors，但是有时候特征层anchors个数可能比100个少，那么就取此时的anchors个数即可
        pre_nms_top_n = min(self.pre_nms_top_n(), num_anchors)

	# top_n_idx：返回给定输入沿给定维度的k个最大元素索引值
    # 官方自己定义的排序方法
    _, top_n_idx = ob.topk(pre_nms_top_n, dim=1)
    # 之前把所有的anchor合并在一起了，索引就变了
    # 由于不同层的anchor个数不同，因此需要添加偏移量
    r.append(top_n_idx + offset)
    offset += num_anchors

pre_nms_top_n方法：

返回NMS处理之前保留的个数。

这个代码的内容非常简单，就是看是训练还是测试模式，然后返回NMS处理之前保留的个数：

if self.training:
	return self._pre_nms_top_n['training']
return self._pre_nms_top_n['testing']

compute_loss方法：

该方法就是计算RPN的损失，包括类别损失（前景与背景），bbox regression损失。

传入的参数：

参数	意义
objectness	预测的前景概率
pred_bbox_deltas	预测的bbox regression
labels	真实的标签 1, 0, -1（batch中每一张图片的labels对应List的一个元素中）
regression_targets	真实的bbox regression

该方法实现很简单，看注释：

# 按照给定的batch_size_per_image, positive_fraction选择正负样本
sampled_pos_inds, sampled_neg_inds = self.fg_bg_sampler(labels)
# 将一个batch中的所有正负样本List(Tensor)分别拼接在一起，并获取非零位置的索引
# sampled_pos_inds = torch.nonzero(torch.cat(sampled_pos_inds, dim=0)).squeeze(1)
sampled_pos_inds = torch.where(torch.cat(sampled_pos_inds, dim=0))[0]
# sampled_neg_inds = torch.nonzero(torch.cat(sampled_neg_inds, dim=0)).squeeze(1)
sampled_neg_inds = torch.where(torch.cat(sampled_neg_inds, dim=0))[0]

# 将所有正负样本索引拼接在一起
sampled_inds = torch.cat([sampled_pos_inds, sampled_neg_inds], dim=0)
objectness = objectness.flatten()

labels = torch.cat(labels, dim=0)
regression_targets = torch.cat(regression_targets, dim=0)

# 计算边界框回归损失，smooth_l1损失
box_loss = det_utils.smooth_l1_loss(
    pred_bbox_deltas[sampled_pos_inds],
    regression_targets[sampled_pos_inds],
    beta=1 / 9,
    size_average=False,
) / (sampled_inds.numel())

# 计算目标预测概率损失，交叉熵损失
objectness_loss = F.binary_cross_entropy_with_logits(
    objectness[sampled_inds], labels[sampled_inds]
)

3.2 RPNHead类：

该类主要定义的结构为：

在这里插入图片描述

__init__方法：

传入的参数：

参数	意义
in_channels	输入特征矩阵的通道数，VGG16-512，ZF-256，
num_anchors	每个网格对应的anchor个数

初始化方法需要定义RPN Head网络结构，即一个3*3卷积层，和两个1*1卷积层，并对其进行参数初始化：

# 3x3 滑动窗口
self.conv = nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=1, padding=1)
# 两个预测器
# 计算预测的目标分数（这里的目标只是指前景或者背景）
self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)
# 计算预测的目标bbox regression参数，4k个数
self.bbox_pred = nn.Conv2d(in_channels, num_anchors * 4, kernel_size=1, stride=1)

# 对上面的三个层进行参数初始化
for layer in self.children():
    if isinstance(layer, nn.Conv2d):
        torch.nn.init.normal_(layer.weight, std=0.01)
        torch.nn.init.constant_(layer.bias, 0)

forward方法：

就是根据网络结构，定义正常的前向传播顺序即可，不过由于有的网络采用FPN结构，所以需要迭代：

logits = []
bbox_reg = []
# 遍历预测特征层，此处为1层
# 采用FPN时，就不只1层了
for i, feature in enumerate(x):
    # 滑动窗口计算
    t = F.relu(self.conv(feature))
    # 预测目标分数和回归，并添加到结果中
    logits.append(self.cls_logits(t))
    bbox_reg.append(self.bbox_pred(t))
return logits, bbox_reg

3.3 AnchorsGenerator类：

该类的作用是生成anchor。

__init__方法：

传入的参数：（基本尺寸个数3*缩放因子个数3=anchor个数9）

参数	意义
sizes	anchor对应的基本尺寸，默认为：(128, 256, 512)
aspect_ratios	anchor对应的缩放因子，默认为：(0.5, 1.0, 2.0)

本来只需要将变量初始化即可。但是，由于训练的时候可能采取FPN结构，所以该方法传入的参数值可能是不同的，需要特别处理：

# 判断传入的size是否为list or tuple类型
# FPN传入： sizes=((32, 64, 128, 256, 512),) ； aspect_ratios=((0.5, 1.0, 2.0),)
# 正常传入： sizes=((32, 64, 128, 256, 512)；aspect_ratios=((0.5, 1.0, 2.0)
# 可见，此时不满足条件，默认参数暂时不用
if not isinstance(sizes[0], (list, tuple)):
    # TODO change this
    sizes = tuple((s,) for s in sizes)
# 同样判断aspect_ratios
if not isinstance(aspect_ratios[0], (list, tuple)):
    aspect_ratios = (aspect_ratios,) * len(sizes)

接着，就是参数初始化：

# 初始化
self.sizes = sizes
self.aspect_ratios = aspect_ratios
self.cell_anchors = None
self._cache = {
    
    } # 存储待会的信息

forward方法：

同样以forward方法来解读这个类的其它方法。

首先，传入的参数：

参数	意义
image_list	里面两个值：一个打包后的数据，为一个tensor；一个resize后的大小（忘记了看看前面介绍的ImageList内容）
feature_maps	预测特征层的信息，如果不用FPN，list元素个数就1个

代码内容：

首先，获取特征的尺寸值和输入图像的尺寸值，并获取变量类型和设备信息：

# 获取每个预测特征层的尺寸(height, width)
grid_sizes = list([feature_map.shape[-2:] for feature_map in feature_maps])
# 获取输入图像（打包后的）的height和width
image_size = image_list.tensors.shape[-2:]
# 获取变量类型和设备类型
dtype, device = feature_maps[0].dtype, feature_maps[0].device

接着，计算特征层上的一步等于原始图像上的步长，比如VGG16，下采样16倍，特征图上走一步，原图上走16步：

# 计算特征层上的一步等于原始图像上的步长，比如VGG16，下采样16倍，特征图上走一步，原图上走16步
# 图像大小 / 特征图大小 = 缩放因子 ----- 宽 + 高 都要除
strides = [[torch.tensor(image_size[0] // g[0], dtype=torch.int64, device=device),
            torch.tensor(image_size[1] // g[1], dtype=torch.int64, device=device)] for g in grid_sizes]

然后，根据提供的sizes和aspect_ratios生成anchors模板（set_cell_anchors方法见后面）：

# 根据提供的sizes和aspect_ratios生成anchors模板
# 从传入的size数据，可以看出anchor生成的都是原图中的尺度
self.set_cell_anchors(dtype, device)

接着，将模板anchor进行偏移（原来是以（0，0）为中心，偏移后，就是特征图每个单元格对应的原图的anchors），得到真正的anchor：

# 计算/读取所有anchors的坐标信息（这里的anchors信息是映射到原图上的所有anchors信息，不是anchors模板）
# 得到的是一个list列表，对应每张预测特征图映射回原图的anchors坐标信息
anchors_over_all_feature_maps = self.cached_grid_anchors(grid_sizes, strides)

最后，就是针对每一张图片存储其anchor信息，并将anchor进行拼接存储：

# 定义anchors格式
anchors = torch.jit.annotate(List[List[torch.Tensor]], [])
# 遍历一个batch中的每张图像
for i, (image_height, image_width) in enumerate(image_list.image_sizes):
    anchors_in_image = []
    # 遍历每张预测特征图映射回原图的anchors坐标信息
    # anchors_over_all_feature_maps如果不用FPN，只有一个值
    for anchors_per_feature_map in anchors_over_all_feature_maps:
        anchors_in_image.append(anchors_per_feature_map)
	# 添加到anchors中
	anchors.append(anchors_in_image)
# 将每一张图像的所有预测特征层的anchors坐标信息拼接在一起
# anchors是个list，每个元素为一张图像的所有anchors信息
anchors = [torch.cat(anchors_per_image) for anchors_per_image in anchors]
# Clear the cache in case that memory leaks.
# 清空字典，防止内存泄露
self._cache.clear()

set_cell_anchors方法：

该类的作用就是生成anchor模板。

代码内容很简单，看注释：

# 检测self.cell_anchors是否为空，因为默认就是为空
if self.cell_anchors is not None:
    cell_anchors = self.cell_anchors
    assert cell_anchors is not None
    # suppose that all anchors have the same device
    # which is a valid assumption in the current state of the codebase
    if cell_anchors[0].device == device:
        return

# 第一次，直接跳过上面的if，进入循环
# 根据提供的sizes和aspect_ratios生成anchors模板 ，anchors模板都是以(0, 0)为中心的anchor
# 如果有多层，那么就需要列表存储；但不用FPN就为1层，cell_anchors列表只有1一个值，这个值有15个元素，每个元素就是一个anchor
cell_anchors = [
    # 传入size和aspect_ratios，用generate_anchors方法生成模板
    self.generate_anchors(sizes, aspect_ratios, dtype, device)
    for sizes, aspect_ratios in zip(self.sizes, self.aspect_ratios)
]
# 变为类变量
self.cell_anchors = cell_anchors

generate_anchors方法见后面。

generate_anchors方法：

该方法就是生成anchor的真正实现方法。

传入参数上面就说了，即基本尺寸和缩放因子等。

首先，将变量转为tensor格式：

# 转为tensor格式
scales = torch.as_tensor(scales, dtype=dtype, device=device)
aspect_ratios = torch.as_tensor(aspect_ratios, dtype=dtype, device=device)

然后，获取宽、高的缩放因子：

# 取缩放因子的根号，即为高度的乘法因子；；； 这里的缩放因子应该指的是面积的缩放
# 不确定，还没有仔细验证，有知道的朋友可以告知一声
h_ratios = torch.sqrt(aspect_ratios)
w_ratios = 1.0 / h_ratios

接着，就是对长、宽进行缩放：

# [r1, r2, r3]' * [s1, s2, s3]
# w_ratios、scales原来都为向量，这里添加一个维度，就变为矩阵相乘，相当于得到了每一个anchor的宽度值
# 然后再使用vier展平为1维向量
# 个数 = scales * ratios，比如5*3=15
ws = (w_ratios[:, None] * scales[None, :]).view(-1)
hs = (h_ratios[:, None] * scales[None, :]).view(-1)

最后，将长、宽转为坐标，此时生成的anchor模板，都是以（0，0）为中心的anchor框：

# 生成的anchors模板都是以（0, 0）为中心的, shape [len(ratios)*len(scales), 4]
# 以（0，0）为中心，左上角坐标为（-ws/2, -hs/2）
# torch.stack，在指定维度上拼接
base_anchors = torch.stack([-ws, -hs, ws, hs], dim=1) / 2

cached_grid_anchors方法：

该方法的作用是将计算得到的所有anchors信息进行缓存。

输入参数：

参数	意义
grid_sizes	特征图的尺寸信息
strides	原图相对于特征图的缩放因子

代码内容很简单，看注释即可：

# 将数值转为字符串并拼接在一起
key = str(grid_sizes) + str(strides)
# self._cache是字典类型
# 如果key在字典中，说明之前已经存储过了，就跳过
if key in self._cache:
    return self._cache[key]
# 否则，则进入另外一个方法中处理
anchors = self.grid_anchors(grid_sizes, strides)
# 送入字典中
self._cache[key] = anchors

grid_anchors方法见后面。

grid_anchors方法：

该类的作用是计算预测特征图对应原始图像上的所有anchors的坐标。

输入参数：

参数	意义
grid_sizes	预测特征矩阵的height和width
strides	预测特征矩阵上一步对应原始图像上的步距

将之前生成的anchor模板赋值一个变量，并判断其是否为空：

# 定义一个列表
anchors = []
# 将anchor模板复制给变量cell_anchors
cell_anchors = self.cell_anchors
# 判断cell_anchors是否为空
assert cell_anchors is not None

接着，遍历每个预测特征层的grid_size，strides和cell_anchors，不用FPN只有一层：

# 遍历每个预测特征层的grid_size，strides和cell_anchors，不用FPN只有一层
for size, stride, base_anchors in zip(grid_sizes, strides, cell_anchors):
	.......

然后，计算特征层相对于原图的偏移量，并将这个偏移量用于模板anchor中，即可生成真正的anchor：

# 遍历每个预测特征层的grid_size，strides和cell_anchors，不用FPN只有一层
for size, stride, base_anchors in zip(grid_sizes, strides, cell_anchors):
    # 获取特征层的尺寸信息
    grid_height, grid_width = size
    # 获取原图对于特征层的缩放因子
    stride_height, stride_width = stride
    # 获取设备信息
    device = base_anchors.device

    # 参考图片
    # shape: [grid_width] 对应原图上的x坐标(列)
    shifts_x = torch.arange(0, grid_width, dtype=torch.float32, device=device) * stride_width
    # shape: [grid_height] 对应原图上的y坐标(行)
    shifts_y = torch.arange(0, grid_height, dtype=torch.float32, device=device) * stride_height

    # 计算预测特征矩阵上每个点对应原图上的坐标(anchors模板的坐标偏移量)
    # torch.meshgrid函数分别传入行坐标和列坐标，生成网格行坐标矩阵和网格列坐标矩阵，shape: [grid_height, grid_width]
    shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)
    # 展平处理
    shift_x = shift_x.reshape(-1)
    shift_y = shift_y.reshape(-1)

    # 计算anchors坐标(xmin, ymin, xmax, ymax)在原图上的坐标偏移量，左上角和右下角偏移量相同
    # 所谓的偏移量，即特征图的位置转到原图后的偏移量
    # shape: [grid_width*grid_height, 4]
    shifts = torch.stack([shift_x, shift_y, shift_x, shift_y], dim=1)

    # 将anchors模板与原图上的坐标偏移量相加得到原图上所有anchors的坐标信息(shape不同时会使用广播机制)
    # shifts.view(-1, 1, 4) ： 第一个维度自己推导，第二个维度为1，第三个维度为4
    # base_anchors.view(1, -1, 4) ： 第一个维度为1，第二个维度自己推导，第三个维度为4
    shifts_anchor = shifts.view(-1, 1, 4) + base_anchors.view(1, -1, 4)
    #  shifts_anchor = 【850，15，4】，15个anchor、4个坐标、850个网格

    # 最后，reshap一下，第一维度为总共的anchor个数
    anchors.append(shifts_anchor.reshape(-1, 4))

上面的代码不好理解，可以看下面的图片（来自大佬的视频截图）：

在这里插入图片描述

上面的代码意思是，之前我们只生成了模板anchor，即这些anchor虽然大小不同，但是都是以（0，0）为中心点坐标。而真正的anchor是每个单元格都有的值，因此需要根据特征图尺寸，将这些anchor进行偏移（如上图）。

3.4 concat_box_prediction_layers函数：

该函数的作用是：对box_cla和box_regression两个list中的每个预测特征层的预测信息的tensor排列顺序以及shape进行调整。

输入参数：

参数	意义
box_cls	每个预测特征层上的预测目标概率
box_regression	每个预测特征层上的预测目标bboxes regression参数

首先，定义两个空列表，一会用于存储值：

# 定义两个空列表
box_cls_flattened = []  # 目标分数参数
box_regression_flattened = [] # 回归参数

接着，遍历每个特征层：

# 遍历每个预测特征层
for box_cls_per_level, box_regression_per_level in zip(box_cls, box_regression):
	....

该遍历的内部内容为：首先，获取回归参数和分类参数的shape值；接着，计算anchor个数和classes个数；然后，调整shape并展平：（permute_and_flatten方法见3.5节）

# 注意，当计算RPN中的proposal时，classes_num=1,只区分目标和背景
# 因此，不像论文中所说的2（前景or背景）*k，而是1（是不是背景）*k
N, AxC, H, W = box_cls_per_level.shape
# # [batch_size, anchors_num_per_position * 4, height, width]
Ax4 = box_regression_per_level.shape[1]
# anchor个数
A = Ax4 // 4
# 类别个数
C = AxC // A

# 进行展平处理，[N, -1, C]，-1就是所有anchor个数，C表示类别个数，即1
box_cls_per_level = permute_and_flatten(box_cls_per_level, N, A, C, H, W)
box_cls_flattened.append(box_cls_per_level)

# [N, -1, C]，这里的C是每个anchor有4个参数
box_regression_per_level = permute_and_flatten(box_regression_per_level, N, A, 4, H, W)
box_regression_flattened.append(box_regression_per_level)

最后，将值进行拼接再展品返回即可：

# 拼接再展平
box_cls = torch.cat(box_cls_flattened, dim=1).flatten(0, -2)  # start_dim, end_dim
box_regression = torch.cat(box_regression_flattened, dim=1).reshape(-1, 4)

3.5 permute_and_flatten函数：

该函数的作用是：调整tensor顺序，并进行reshape。

传入参数：

参数	意义
layer	预测特征层上预测的目标概率或bboxes regression参数
N	batch_size
A	每个单元格anchor个数
C	类别个数或者 4（4个坐标）
H	height
W	width

该函数的内容如下，看注释：

# [batch_size, ... , anchors_num_per_position * (C or 4), height, width]
layer = layer.view(N, -1, C,  H, W)
# 调换tensor维度
layer = layer.permute(0, 3, 4, 1, 2)  # [N, H, W, -1, C]，这里的0、1、2等表示(N, -1, C,  H, W)的索引
# view和reshape功能是一样的，先展平所有元素在按照给定shape排列
# view函数只能用于内存中连续存储的tensor，permute等操作会使tensor在内存中变得不再连续，此时就不能再调用view函数
# reshape则不需要依赖目标tensor是否在内存中是连续的
layer = layer.reshape(N, -1, C)

4. 总结：

本篇介绍了RPN的详细实现过程，也是Faster-RCNN中最重要的部分之一。