[RCNN series] Faster RCNN paper summary and source code

Object detection paper summary

【RCNN Series】
RCNN
Fast RCNN
Faster RCNN



foreword

Summary of some classic papers.


1. Pipeline

insert image description here

Faster RCNN is actually a RPN+Fast RCNN , and RPN and Fast RCNN share a convolutional layer. The input image is sent to CNN (VGG, ZF) to get the feature map, and then a n*n(论文取3)sliding window (actually a 3*3 convolution) is used to obtain RoI (Region proposals), and then sent to 2 heads (one head is a two-category Foreground background, a head predicts 4 coordinate values), and the RoI belonging to the foreground is sent to the subsequent network, which is the RPN part. The convolution part (conv layers) of Fast RCNN is the same as that of RPN. The input image is sent to CNN (VGG, ZF) to get the feature map, and the RoI output by RPN that belongs to the foreground is mapped to the feature map, which is the same as the previous Fast RCNN. Classification and frame regression are also performed after a RoI pooling layer.

It is the RPN network that replaces the SS (selective search) algorithm of the previous RCNN series to search for RoI, which greatly accelerates the speed of Fast RCNN.

2. Model design

1.RPNHead

Let's take a look before understanding the RPN network RPNHead.
The code of RPNHead is very simple, the feature map is passed in, and after a 3 3 convolution, that is, the sliding window in the paper n*n(n takes 3) to select proposals, and the shape is unchanged after 3 3 convolution (with padding ). Then two 1*1convolutions are connected, one is used to distinguish the foreground and the background, and the other is used to predict the offset of the 4 coordinates. Why is 1 1 convolution, first of all, 1 1 convolution can reduce the dimensionality, that is, reduce the number of channels, that is, reduce the number of channels in_channels( VGG is backbone, in_channels is 512, ZF is 256 ) to num_anchors( the paper takes 9[C,H,W] ), as shown in the figure below, after 1*1 convolution, a three-dimensional tensor is obtained , H, Wnum_anchors are the height and width of the feature map, and the number of channels C is 9 in the code .
insert image description here
Taking out the one-dimensional vector marked in yellow means taking out 9 channels, which represent the objectness of 9 anchors (the probability of belonging to the foreground and background). The paper said that a binary classification is used. According to the writing method of the paper, it should be 2x9=18, that is, 18 channels. Similarly, 18 channels correspond to the objectness of each anchor. The author of the paper also said that a simpler logistic regression can be used to predict, with 0.5 as the threshold, and greater than 0.5 belongs to the foreground, otherwise it is the background. So that's why it's in the code num_anchorsand not in the paper num_anchors*2.
insert image description here
insert image description here
In the same way, the predicted coordinate offset should be num_anchors*436 channels, representing 4 coordinate predictions for each anchor.


其实我感觉和YOLO的预测方法很类似,YOLO最后也是输出一个三维的Tensor,只不过YOLO是多类别预测,我认为YOLO完全可以看作是一个RPN或者是RPN的改进版(省略了Fast RCNN直接用RPN预测),他们的结构都很类似。


class RPNHead(nn.Module):
    """
    add a RPN head with classification and regression
    通过滑动窗口计算预测目标概率与bbox regression参数

    Arguments:
        in_channels: number of channels of the input feature
        num_anchors: number of anchors to be predicted
    """

    def __init__(self, in_channels, num_anchors):
        super(RPNHead, self).__init__()
        # 3x3 滑动窗口
        # 卷积后大小不变
        # bs*512*h*w
        self.conv = nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=1, padding=1)
        # 计算预测的目标分数(这里的目标只是指前景或者背景)
        # 逻辑回归 以0.5为阈值
        # bs*9*h*w
        # 特征图每个点都有9个anchor 也就是和yolo相似9个通道代表代表每个anchor的objectness
        self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)
        # 计算预测的目标bbox regression参数
        # bs*36*h*w 代表9个anchor的坐标
        self.bbox_pred = nn.Conv2d(in_channels, num_anchors * 4, kernel_size=1, stride=1)

        for layer in self.children():
            if isinstance(layer, nn.Conv2d):
                torch.nn.init.normal_(layer.weight, std=0.01)
                torch.nn.init.constant_(layer.bias, 0)

    def forward(self, x):
        # type: (List[Tensor]) -> Tuple[List[Tensor], List[Tensor]]
        logits = []
        bbox_reg = []
        for i, feature in enumerate(x):
            t = F.relu(self.conv(feature))
            logits.append(self.cls_logits(t))
            bbox_reg.append(self.bbox_pred(t))
        return logits, bbox_reg

2.Anchors

The anchor of Faster RCNN has three aspect ratios [0.5,1,2]. There are three area sizes [128*128,256*256,512*512].
Steps to generate Anchor:
1. First generate anchors with three aspect ratios. These anchors are all centered on (0, 0). The coordinates of the anchor are represented by [x1, y1, x2, y2], (x1, y1) Indicates the coordinates of the lower left corner, and (x2, y2) indicates the coordinates of the upper right corner. It is equivalent to generating 9 anchors at the origin.
2. According to the zoom ratio between the feature map and the original image, these anchors centered on (0, 0) plus an offset are translated to the corresponding positions, that is, each point on the feature map is mapped to the original On the map, then mark the positions of these anchors on the original map. So the anchor is on the original image, not on the feature map, and the feature map just plays a role of succession.

class AnchorsGenerator(nn.Module):
    __annotations__ = {
    
    
        "cell_anchors": Optional[List[torch.Tensor]],
        "_cache": Dict[str, List[torch.Tensor]]
    }

    """
    anchors生成器
    Module that generates anchors for a set of feature maps and
    image sizes.

    The module support computing anchors at multiple sizes and aspect ratios
    per feature map.

    sizes and aspect_ratios should have the same number of elements, and it should
    correspond to the number of feature maps.

    sizes[i] and aspect_ratios[i] can have an arbitrary number of elements,
    and AnchorGenerator will output a set of sizes[i] * aspect_ratios[i] anchors
    per spatial location for feature map i.

    Arguments:
        sizes (Tuple[Tuple[int]]):
        aspect_ratios (Tuple[Tuple[float]]):
    """
    # size=128,256,512每个不同大小的特征图的base anchor大小不一致
    def __init__(self, sizes=(128, 256, 512), aspect_ratios=(0.5, 1.0, 2.0)):
        super(AnchorsGenerator, self).__init__()
        # 128*128
        # 转换成((128,),(256,),(512,))
        # 把每个元素都转换成tuple
        if not isinstance(sizes[0], (list, tuple)):
            # TODO change this
            sizes = tuple((s,) for s in sizes)
        # 把每个aspect_ratios转化成tuple
        # ((0.5, 1, 2), (0.5, 1, 2), (0.5, 1, 2))
        # 每个tuple里面tuple长度和sizes长度一致
        if not isinstance(aspect_ratios[0], (list, tuple)):
            # 9种anchor的比例
            # 每个tuple里面tuple长度和sizes长度一致
            aspect_ratios = (aspect_ratios,) * len(sizes)

        assert len(sizes) == len(aspect_ratios)

        self.sizes = sizes
        self.aspect_ratios = aspect_ratios
        self.cell_anchors = None
        # 私有变量
        self._cache = {
    
    }

    def generate_anchors(self, scales, aspect_ratios, dtype=torch.float32, device=torch.device("cpu")):
        # type: (List[int], List[float], torch.dtype, torch.device) -> Tensor
        """
        compute anchor sizes
        Arguments:
            # 即上文的sizes
            scales: sqrt(anchor_area)
            # anchor宽高比
            aspect_ratios: h/w ratios
            dtype: float32
            device: cpu/gpu
        """
        # as_tensor浅拷贝
        # shape [3,1]
        scales = torch.as_tensor(scales, dtype=dtype, device=device)
        # shape [3,1]
        aspect_ratios = torch.as_tensor(aspect_ratios, dtype=dtype, device=device)
        # 开根号
        # h*w=h*h=ratios
        # 所以开根号
        h_ratios = torch.sqrt(aspect_ratios)
        w_ratios = 1.0 / h_ratios

        # [r1, r2, r3]' * [s1, s2, s3]
        # number of elements is len(ratios)*len(scales)
        # w_ratios[:, None]注意这里是在中间插入一维数据[3,1,3]
        # scales[None, :]意这里是在中间插入一维数据[1,3,3]
        ws = (w_ratios[:, None] * scales[None, :]).view(-1)
        # torch.Size([3, 1, 3])
        # torch.Size([1, 3, 1])
        # 不看通道相当于1*3的矩阵和3*1的向量相乘
        hs = (h_ratios[:, None] * scales[None, :]).view(-1)

        # left-bottom, right-top coordinate relative to anchor center(0, 0)
        # 生成的anchors模板都是以(0, 0)为中心的, shape [len(ratios)*len(scales), 4]
        base_anchors = torch.stack([-ws, -hs, ws, hs], dim=1) / 2

        return base_anchors.round()  # round 四舍五入

    # 分组生成anchor模板
    # output三组tensor 左下右上的格式
    """

     [tensor([[-91., -45.,  91.,  45.], # 128*128
             [-64., -64.,  64.,  64.],  # 256*256
             [-45., -91.,  45.,  91.]]),# 512*512
     tensor([[-181.,  -91.,  181.,   91.],
             [-128., -128.,  128.,  128.],
             [ -91., -181.,   91.,  181.]]),
     tensor([[-362., -181.,  362.,  181.],
             [-256., -256.,  256.,  256.],
             [-181., -362.,  181.,  362.]])]
     """
    def set_cell_anchors(self, dtype, device):
        # type: (torch.dtype, torch.device) -> None
        # 如果传入anchor模板就不用生成了
        if self.cell_anchors is not None:
            cell_anchors = self.cell_anchors
            assert cell_anchors is not None
            # suppose that all anchors have the same device
            # which is a valid assumption in the current state of the codebase
            if cell_anchors[0].device == device:
                return

        # 根据提供的sizes和aspect_ratios生成anchors模板
        # anchors模板都是以(0, 0)为中心的anchor
        cell_anchors = [
            self.generate_anchors(sizes, aspect_ratios, dtype, device)
            for sizes, aspect_ratios in zip(self.sizes, self.aspect_ratios)
        ]
        self.cell_anchors = cell_anchors
        # cell_anchor list类型
    def num_anchors_per_location(self):
        # 计算每个预测特征层上每个滑动窗口的预测目标数
        return [len(s) * len(a) for s, a in zip(self.sizes, self.aspect_ratios)]
    # [3,3,3]

    # For every combination of (a, (g, s), i) in (self.cell_anchors, zip(grid_sizes, strides), 0:2),
    # output g[i] anchors that are s[i] distance apart in direction i, with the same dimensions as a.
    def grid_anchors(self, grid_sizes, strides):
        # type: (List[List[int]], List[List[Tensor]]) -> List[Tensor]
        """
        anchors position in grid coordinate axis map into origin image
        计算预测特征图对应原始图像上的所有anchors的坐标
        Args:
            grid_sizes: 预测特征矩阵的height和width
            strides: 预测特征矩阵上一步 对应 原始图像上的步距
            # 比如VGG最后一层缩放了16倍
        """
        anchors = []
        cell_anchors = self.cell_anchors
        assert cell_anchors is not None

        # 遍历每个预测特征层的grid_size,strides和cell_anchors
        for size, stride, base_anchors in zip(grid_sizes, strides, cell_anchors):
            grid_height, grid_width = size
            stride_height, stride_width = stride
            device = base_anchors.device

            # For output anchor, compute [x_center, y_center, x_center, y_center]
            # shape: [grid_width] 对应原图上的x坐标(列)
            # 特征图大小grid_width
            shifts_x = torch.arange(0, grid_width, dtype=torch.float32, device=device) * stride_width
            # shape: [grid_height] 对应原图上的y坐标(行)
            shifts_y = torch.arange(0, grid_height, dtype=torch.float32, device=device) * stride_height

            # 计算预测特征矩阵上每个点对应原图上的坐标(anchors模板的坐标偏移量)
            # torch.meshgrid函数分别传入行坐标和列坐标,生成网格行坐标矩阵和网格列坐标矩阵
            # shape: [grid_height, grid_width]
            # 生成网格坐标
            shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)
            shift_x = shift_x.reshape(-1)
            shift_y = shift_y.reshape(-1)

            # 计算anchors坐标(xmin, ymin, xmax, ymax)在原图上的坐标偏移量
            # shape: [grid_width*grid_height, 4]
            # 给base anchor的左下和右上坐标同时加上shift,所以要写成如下形式
            shifts = torch.stack([shift_x, shift_y, shift_x, shift_y], dim=1)

            # For every (base anchor, output anchor) pair,
            # offset each zero-centered base anchor by the center of the output anchor.
            # 将anchors模板与原图上的坐标偏移量相加得到原图上所有anchors的坐标信息(shape不同时会使用广播机制)
            # shifts.view(-1, 1, 4) shape [grid_width*grid_height,1,4]
            # base_anchors.view(1, -1, 4) shape [1,3,4]
            # base anchor的shape是[3,4]
            # [3,4]表示3个anchor的4个坐标左下右上
            shifts_anchor = shifts.view(-1, 1, 4) + base_anchors.view(1, -1, 4)
            # shifts_anchor [12,3,4]
            anchors.append(shifts_anchor.reshape(-1, 4))

        return anchors  # List[Tensor(all_num_anchors, 4)]

    def cached_grid_anchors(self, grid_sizes, strides):
        # type: (List[List[int]], List[List[Tensor]]) -> List[Tensor]
        """将计算得到的所有anchors信息进行缓存"""
        key = str(grid_sizes) + str(strides)
        # self._cache是字典类型
        if key in self._cache:
            return self._cache[key]
        anchors = self.grid_anchors(grid_sizes, strides)
        self._cache[key] = anchors
        return anchors

    def forward(self, image_list, feature_maps):
        # type: (ImageList, List[Tensor]) -> List[Tensor]
        # 获取每个预测特征层的尺寸(height, width)
        grid_sizes = list([feature_map.shape[-2:] for feature_map in feature_maps])

        # 获取输入图像的height和width
        image_size = image_list.tensors.shape[-2:]

        # 获取变量类型和设备类型
        dtype, device = feature_maps[0].dtype, feature_maps[0].device

        # one step in feature map equate n pixel stride in origin image
        # 计算特征层上的一步等于原始图像上的步长
        # 缩放了多少倍
        strides = [[torch.tensor(image_size[0] // g[0], dtype=torch.int64, device=device),
                    torch.tensor(image_size[1] // g[1], dtype=torch.int64, device=device)] for g in grid_sizes]

        # 根据提供的sizes和aspect_ratios生成anchors模板
        self.set_cell_anchors(dtype, device)

        # 计算/读取所有anchors的坐标信息(这里的anchors信息是映射到原图上的所有anchors信息,不是anchors模板)
        # 得到的是一个list列表,对应每张预测特征图映射回原图的anchors坐标信息
        anchors_over_all_feature_maps = self.cached_grid_anchors(grid_sizes, strides)

        anchors = torch.jit.annotate(List[List[torch.Tensor]], [])
        # 遍历一个batch中的每张图像
        for i, (image_height, image_width) in enumerate(image_list.image_sizes):
            anchors_in_image = []
            # 遍历每张预测特征图映射回原图的anchors坐标信息
            for anchors_per_feature_map in anchors_over_all_feature_maps:
                anchors_in_image.append(anchors_per_feature_map)
            anchors.append(anchors_in_image)
        # 将每一张图像的所有预测特征层的anchors坐标信息拼接在一起
        # anchors是个list,每个元素为一张图像的所有anchors信息
        anchors = [torch.cat(anchors_per_image) for anchors_per_image in anchors]
        # Clear the cache in case that memory leaks.
        self._cache.clear()
        return anchors

3.RPN(Region Proposal Networks)

From foward, we can see the process of RPN:
1. Obtain the feature map from the convolutional network. Since FPN, which is a multi-scale feature map, is used here to better detect small targets, it will be passed in to multiple convolutional networks. Feature maps of different sizes.
2. Pass in the feature map RPNHead, and use it for RPNheadcoordinate prediction offset and category prediction (foreground and background).
3. Generate Anchors, and add RPNHeadthe calculated offset to get the predicted Anchor coordinates.
4. filter_proposals is to filter the target area, and use the NMS algorithm to eliminate redundant proposals. Specifically:

  • First, sort the proposals generated by the same level feature map in descending order according to the confidence (foreground score) (if FPN is introduced, the proposals generated by different level feature maps are independent), and select the top pre_nms_topn (artificially set) at most.
  • Then clip the proposal that exceeds the range of the image, and some anchors exceed the size of the original image.
  • Remove proposals with too small area
  • Perform nms operations, pay attention to the proposals generated on feature_maps of different levels, and perform nms operations independently between them.
  • Finally, the results of nms are sorted in descending order according to the confidence, and the first post_nms_topn proposals are returned at most. If the number of bboxes after nms is less than post_nms_topn, all of them are sent to the roi_head layer.

It is best to read the source code for the design of RPN. The following is the official code from Pytorch, including the RPN code and its own comments:

class RegionProposalNetwork(torch.nn.Module):
    """
    Implements Region Proposal Network (RPN).

    Arguments:
        anchor_generator (AnchorGenerator): module that generates the anchors for a set of feature
            maps.
        # RPNhead
        head (nn.Module): module that computes the objectness and regression deltas
        # 确定为正样本的IoU阈值 论文为0.7
        fg_iou_thresh (float): minimum IoU between the anchor and the GT box so that they can be
            considered as positive during training of the RPN.
        # 确定为负样本的IoU阈值 论文为0.3
        bg_iou_thresh (float): maximum IoU between the anchor and the GT box so that they can be
            considered as negative during training of the RPN.

        # batch_size的大小 论文是256 正负样本1:1
        batch_size_per_image (int): number of anchors that are sampled during training of the RPN
            for computing the loss
        # minibatch中正负样本的比例 论文为1:1
        positive_fraction (float): proportion of positive anchors in a mini-batch during training
            of the RPN
        # 按分类得分降序保留前pre_nms_top_n个proposals,  训练是2000和预测1000
        pre_nms_top_n (Dict[str]): number of proposals to keep before applying NMS. It should
            contain two fields: training and testing, to allow for different values depending
            on training or evaluation

        # 返回NMS后的前post_nms_top_n个proposals,  训练是2000和预测1000
        post_nms_top_n (Dict[str]): number of proposals to keep after applying NMS. It should
            contain two fields: training and testing, to allow for different values depending
            on training or evaluation
        # NMS阈值 0.7
        nms_thresh (float): NMS threshold used for postprocessing the RPN proposals

    """
    __annotations__ = {
    
    
        'box_coder': det_utils.BoxCoder,
        'proposal_matcher': det_utils.Matcher,
        'fg_bg_sampler': det_utils.BalancedPositiveNegativeSampler,
        'pre_nms_top_n': Dict[str, int],
        'post_nms_top_n': Dict[str, int],
    }

    def __init__(self, anchor_generator, head,
                 fg_iou_thresh, bg_iou_thresh,
                 batch_size_per_image, positive_fraction,
                 pre_nms_top_n, post_nms_top_n, nms_thresh, score_thresh=0.0):
        super(RegionProposalNetwork, self).__init__()
        self.anchor_generator = anchor_generator
        self.head = head
        self.box_coder = det_utils.BoxCoder(weights=(1.0, 1.0, 1.0, 1.0))

        # use during training
        # 计算anchors与真实bbox的iou
        self.box_similarity = box_ops.box_iou

        self.proposal_matcher = det_utils.Matcher(
            fg_iou_thresh,  # 当iou大于fg_iou_thresh(0.7)时视为正样本即前景
            bg_iou_thresh,  # 当iou小于bg_iou_thresh(0.3)时视为负样本即背景
            allow_low_quality_matches=True
        )

        self.fg_bg_sampler = det_utils.BalancedPositiveNegativeSampler(
            batch_size_per_image, positive_fraction  # 256, 0.5
        )

        # use during testing
        self._pre_nms_top_n = pre_nms_top_n
        self._post_nms_top_n = post_nms_top_n
        self.nms_thresh = nms_thresh
        self.score_thresh = score_thresh
        self.min_size = 1.

    def pre_nms_top_n(self):
        if self.training:
            return self._pre_nms_top_n['training']
        return self._pre_nms_top_n['testing']

    def post_nms_top_n(self):
        if self.training:
            return self._post_nms_top_n['training']
        return self._post_nms_top_n['testing']

    def assign_targets_to_anchors(self, anchors, targets):
        # type: (List[Tensor], List[Dict[str, Tensor]]) -> Tuple[List[Tensor], List[Tensor]]
        """
        计算每个anchors最匹配的gt,并划分为正样本,背景以及废弃的样本
        Args:
            anchors: (List[Tensor])
            targets: (List[Dict[Tensor])
        Returns:
            labels: 标记anchors归属类别(1, 0, -1分别对应正样本,背景,废弃的样本)
                    注意,在RPN中只有前景和背景,所有正样本的类别都是1,0代表背景
            matched_gt_boxes:与anchors匹配的gt
        """
        labels = []
        matched_gt_boxes = []
        # 遍历每张图像的anchors和targets
        for anchors_per_image, targets_per_image in zip(anchors, targets):
            # 获取GT的信息/取出GTbox对应的值
            gt_boxes = targets_per_image["boxes"]
            # 判断元素个数
            if gt_boxes.numel() == 0:
                device = anchors_per_image.device
                # 感觉可以替换为zeros_like
                # 没有目标全0
                matched_gt_boxes_per_image = torch.zeros(anchors_per_image.shape, dtype=torch.float32, device=device)
                labels_per_image = torch.zeros((anchors_per_image.shape[0],), dtype=torch.float32, device=device)
            else:
                # 计算anchors与真实bbox的iou信息
                # set to self.box_similarity when https://github.com/pytorch/pytorch/issues/27495 lands
                match_quality_matrix = box_ops.box_iou(gt_boxes, anchors_per_image)
                # 计算每个anchors与gt匹配iou最大的索引(如果iou<0.3索引置为-1,0.3<iou<0.7索引为-2)
                matched_idxs = self.proposal_matcher(match_quality_matrix)
                # get the targets corresponding GT for each proposal
                # NB: need to clamp the indices because we can have a single
                # GT in the image, and matched_idxs can be -2, which goes
                # out of bounds
                # 这里使用clamp设置下限0是为了方便取每个anchors对应的gt_boxes信息
                # 负样本和舍弃的样本都是负值,所以为了防止越界直接置为0
                # 因为后面是通过labels_per_image变量来记录正样本位置的,
                # 所以负样本和舍弃的样本对应的gt_boxes信息并没有什么意义,
                # 反正计算目标边界框回归损失时只会用到正样本。
                # 相当于把小于0的都设置为0 因为只需要把正样本取出来 其他样本无所谓不用区分
                matched_gt_boxes_per_image = gt_boxes[matched_idxs.clamp(min=0)]

                # 记录所有anchors匹配后的标签(正样本处标记为1,负样本处标记为0,丢弃样本处标记为-2)
                labels_per_image = matched_idxs >= 0
                labels_per_image = labels_per_image.to(dtype=torch.float32)

                # background (negative examples)
                bg_indices = matched_idxs == self.proposal_matcher.BELOW_LOW_THRESHOLD  # -1
                labels_per_image[bg_indices] = 0.0

                # discard indices that are between thresholds
                inds_to_discard = matched_idxs == self.proposal_matcher.BETWEEN_THRESHOLDS  # -2
                labels_per_image[inds_to_discard] = -1.0

            labels.append(labels_per_image)
            matched_gt_boxes.append(matched_gt_boxes_per_image)
        return labels, matched_gt_boxes
        # 返回标签和匹配的GTbox

    def _get_top_n_idx(self, objectness, num_anchors_per_level):
        # type: (Tensor, List[int]) -> Tensor
        """
        获取每张预测特征图上预测概率排前pre_nms_top_n的anchors索引值
        Args:
            objectness: Tensor(每张图像的预测目标概率信息 )
            num_anchors_per_level: List(每个预测特征层上的预测的anchors个数)
        Returns:

        """
        r = []  # 记录每个预测特征层上预测目标概率前pre_nms_top_n的索引信息
        offset = 0
        # 遍历每个预测特征层上的预测目标概率信息
        for ob in objectness.split(num_anchors_per_level, 1):
            if torchvision._is_tracing():
                num_anchors, pre_nms_top_n = _onnx_get_num_anchors_and_pre_nms_top_n(ob, self.pre_nms_top_n())
            else:
                num_anchors = ob.shape[1]  # 预测特征层上的预测的anchors个数
                pre_nms_top_n = min(self.pre_nms_top_n(), num_anchors)

            # Returns the k largest elements of the given input tensor along a given dimension
            _, top_n_idx = ob.topk(pre_nms_top_n, dim=1)
            r.append(top_n_idx + offset)
            offset += num_anchors
        return torch.cat(r, dim=1)

    def filter_proposals(self, proposals, objectness, image_shapes, num_anchors_per_level):
        # type: (Tensor, Tensor, List[Tuple[int, int]], List[int]) -> Tuple[List[Tensor], List[Tensor]]
        """
        筛除小boxes框,nms处理,根据预测概率获取前post_nms_top_n个目标
        Args:
            proposals: 预测的bbox坐标
            objectness: 预测的目标概率
            image_shapes: batch中每张图片的size信息
            num_anchors_per_level: 每个预测特征层上预测anchors的数目

        Returns:

        """
        num_images = proposals.shape[0]
        device = proposals.device

        # do not backprop throught objectness
        objectness = objectness.detach()
        objectness = objectness.reshape(num_images, -1)

        # Returns a tensor of size size filled with fill_value
        # levels负责记录分隔不同预测特征层上的anchors索引信息
        levels = [torch.full((n, ), idx, dtype=torch.int64, device=device)
                  for idx, n in enumerate(num_anchors_per_level)]
        levels = torch.cat(levels, 0)

        # Expand this tensor to the same size as objectness
        levels = levels.reshape(1, -1).expand_as(objectness)

        # select top_n boxes independently per level before applying nms
        # 获取每张预测特征图上预测概率排前pre_nms_top_n的anchors索引值
        top_n_idx = self._get_top_n_idx(objectness, num_anchors_per_level)

        image_range = torch.arange(num_images, device=device)
        batch_idx = image_range[:, None]  # [batch_size, 1]

        # 根据每个预测特征层预测概率排前pre_nms_top_n的anchors索引值获取相应概率信息
        objectness = objectness[batch_idx, top_n_idx]
        levels = levels[batch_idx, top_n_idx]
        # 预测概率排前pre_nms_top_n的anchors索引值获取相应bbox坐标信息
        proposals = proposals[batch_idx, top_n_idx]

        objectness_prob = torch.sigmoid(objectness)

        final_boxes = []
        final_scores = []
        # 遍历每张图像的相关预测信息
        for boxes, scores, lvl, img_shape in zip(proposals, objectness_prob, levels, image_shapes):
            # 调整预测的boxes信息,将越界的坐标调整到图片边界上
            boxes = box_ops.clip_boxes_to_image(boxes, img_shape)

            # 返回boxes满足宽,高都大于min_size的索引
            keep = box_ops.remove_small_boxes(boxes, self.min_size)
            boxes, scores, lvl = boxes[keep], scores[keep], lvl[keep]

            # 移除小概率boxes,参考下面这个链接
            # https://github.com/pytorch/vision/pull/3205
            keep = torch.where(torch.ge(scores, self.score_thresh))[0]  # ge: >=
            boxes, scores, lvl = boxes[keep], scores[keep], lvl[keep]

            # non-maximum suppression, independently done per level
            # 每个特征层单独NMS
            keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)

            # keep only topk scoring predictions
            # 调用post_nms_top_n方法
            keep = keep[: self.post_nms_top_n()]
            boxes, scores = boxes[keep], scores[keep]

            final_boxes.append(boxes)
            final_scores.append(scores)
        return final_boxes, final_scores

    def compute_loss(self, objectness, pred_bbox_deltas, labels, regression_targets):
        # type: (Tensor, Tensor, List[Tensor], List[Tensor]) -> Tuple[Tensor, Tensor]
        """
        计算RPN损失,包括类别损失(前景与背景),bbox regression损失
        Arguments:
            objectness (Tensor):预测的前景概率
            pred_bbox_deltas (Tensor):预测的bbox regression
            labels (List[Tensor]):真实的标签 1, 0, -1(batch中每一张图片的labels对应List的一个元素中)
            regression_targets (List[Tensor]):真实的bbox regression

        Returns:
            objectness_loss (Tensor) : 类别损失
            box_loss (Tensor):边界框回归损失
        """
        # 按照给定的batch_size_per_image, positive_fraction选择正负样本
        sampled_pos_inds, sampled_neg_inds = self.fg_bg_sampler(labels)
        # 将一个batch中的所有正负样本List(Tensor)分别拼接在一起,并获取非零位置的索引
        # sampled_pos_inds = torch.nonzero(torch.cat(sampled_pos_inds, dim=0)).squeeze(1)
        sampled_pos_inds = torch.where(torch.cat(sampled_pos_inds, dim=0))[0]
        # sampled_neg_inds = torch.nonzero(torch.cat(sampled_neg_inds, dim=0)).squeeze(1)
        sampled_neg_inds = torch.where(torch.cat(sampled_neg_inds, dim=0))[0]

        # 将所有正负样本索引拼接在一起
        sampled_inds = torch.cat([sampled_pos_inds, sampled_neg_inds], dim=0)
        objectness = objectness.flatten()

        labels = torch.cat(labels, dim=0)
        regression_targets = torch.cat(regression_targets, dim=0)

        # 计算边界框回归损失
        box_loss = det_utils.smooth_l1_loss(
            pred_bbox_deltas[sampled_pos_inds],
            regression_targets[sampled_pos_inds],
            beta=1 / 9,
            size_average=False,
        ) / (sampled_inds.numel())

        # 计算目标预测概率损失
        objectness_loss = F.binary_cross_entropy_with_logits(
            objectness[sampled_inds], labels[sampled_inds]
        )

        return objectness_loss, box_loss

    def forward(self,
                images,        # type: ImageList
                features,      # type: Dict[str, Tensor]
                targets=None   # type: Optional[List[Dict[str, Tensor]]]
                ):
        # type: (...) -> Tuple[List[Tensor], Dict[str, Tensor]]
        """
        Arguments:
            images (ImageList): images for which we want to compute the predictions
            features (Dict[Tensor]): features computed from the images that are
                used for computing the predictions. Each tensor in the list
                correspond to different feature levels
            targets (List[Dict[Tensor]): ground-truth boxes present in the image (optional).
                If provided, each element in the dict should contain a field `boxes`,
                with the locations of the ground-truth boxes.

        Returns:
            boxes (List[Tensor]): the predicted boxes from the RPN, one Tensor per
                image.
            losses (Dict[Tensor]): the losses for the model during training. During
                testing, it is an empty dict.
        """
        # RPN uses all feature maps that are available
        # features是所有预测特征层组成的OrderedDict
        features = list(features.values())

        # 计算每个预测特征层上的预测目标概率和bboxes regression参数
        # objectness和pred_bbox_deltas都是list
        # objectness, pred_bbox_deltas的元素都是tensor
        objectness, pred_bbox_deltas = self.head(features)

        # 生成一个batch图像的所有anchors信息,list(tensor)元素个数等于batch_size
        anchors = self.anchor_generator(images, features)

        # batch_size
        num_images = len(anchors)

        # numel() Returns the total number of elements in the input tensor.
        # 计算每个预测特征层上的对应的anchors数量
        num_anchors_per_level_shape_tensors = [o[0].shape for o in objectness]
        num_anchors_per_level = [s[0] * s[1] * s[2] for s in num_anchors_per_level_shape_tensors]

        # 调整内部tensor格式以及shape
        objectness, pred_bbox_deltas = concat_box_prediction_layers(objectness,
                                                                    pred_bbox_deltas)

        # apply pred_bbox_deltas to anchors to obtain the decoded proposals
        # note that we detach the deltas because Faster R-CNN do not backprop through
        # the proposals
        # 将预测的bbox regression参数应用到anchors上得到最终预测bbox坐标
        proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
        proposals = proposals.view(num_images, -1, 4)

        # 筛除小boxes框,nms处理,根据预测概率获取前post_nms_top_n个目标
        boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)

        losses = {
    
    }
        if self.training:
            assert targets is not None
            # 计算每个anchors最匹配的gt,并将anchors进行分类,前景,背景以及废弃的anchors
            labels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets)
            # 结合anchors以及对应的gt,计算regression参数
            regression_targets = self.box_coder.encode(matched_gt_boxes, anchors)
            loss_objectness, loss_rpn_box_reg = self.compute_loss(
                objectness, pred_bbox_deltas, labels, regression_targets
            )
            losses = {
    
    
                "loss_objectness": loss_objectness,
                "loss_rpn_box_reg": loss_rpn_box_reg
            }
        return boxes, losses

4. RPN positive and negative sample division threshold

One is used to identify positive samples (such as the IoU with ground truth is greater than 0.7 or the anchor with the largest IoU with GT is to prevent no anchors greater than 0.7), and the other is used to mark negative samples (that is, the background class, if and The IoU of any GT is less than 0.3), and those between the two are Hard Negatives. If they are marked as positive, they contain too much background information, and vice versa, they contain the features of the object to be detected. , which is not helpful for training, so these Proposals are ignored and are neither positive samples nor negative samples.

Each anchor finds a gt with the largest iou. If max_iou>0.7, the label of the anchor is 1, that is, the anchor is considered to be the target; if max_iou<0.3, the label of the anchor is 0, that is, the anchor is recognized as the background; if max_iou is between 0.3 and 0.7, The anchor is ignored and not included in the loss function.

There is also a special case, there may be a gt that does not have a matching anchor, that is, the iou of the groud-truth and all bboxes is less than 0.7, then we allow "the bbox with the largest iou of this gt" to be considered as a positive sample, to ensure Each gt has a paired bbox

The loss function of Faster RCNN has not changed much from that of Fast RCNN.

5. Training strategy

RPN is a separate network structure that can be trained separately. During training, each batch has 256 anchors, and the ratio of positive and negative samples is 1:1. The
positive and negative sample division of the Fast RCNN part is the same as before.

Faster RCNN uses four-step alternating training. In this paper, we adopt a practical shared learning four-step training algorithm by alternating optimized features.

In the first step, the RPN is trained separately, and the convolutional network is fine-tuned by the pre-trained ImageNet initialization model to generate proposals.
In the second step, we train Fast RCNN using these proposals generated by RPN. The convolutional network is also fine-tuned by the pre-trained ImageNet initialization model, but at this time the two networks do not share the convolutional layer, that is, two different fine-tuning backbones.
The third step is to use the convolutional network of the second step Fast RCNN as the backbone to train the RPN. At this time, only fine-tune the layers unique to the RPN (except for the CNN part). Now the two networks share the convolutional layer, that is, use the same backbone.
In the fourth step, use the RPN trained in the third step to generate proposals and send them to Fast RCNN, but also share the convolutional layer, and only fine-tune the unique layers of Fast RCNN (RoI pooling and subsequent layers).
cycle through four steps


3. Summary

Faster RCNN solves the problem of area search, uses RPN instead of SS algorithm, and further accelerates the detection speed.
The improvement ideas of the RCNN series are very clear and easy to understand:
RCNN: the first two-stage detection network
Fast RCNN: improve the pipeline and improve the shortcomings of sending each proposal to the convolutional network
Faster RCNN: RPN+Fast RCNN proposes RPN

Guess you like

Origin blog.csdn.net/m0_46412065/article/details/128225594