The most complete explanation of Faster R-CNN

One: Improvement of Faster R-CNN

If you want to better understand Faster R-CNN, you need to understand the principles of traditional R-CNN and Fast R-CNN first. You can refer to the two blog posts written by me. The most complete explanation in the history of R-CNN and the explanation of Fast R - CNN .

Back to the topic, after the accumulation of R-CNN and Fast RCNN, Ross B. Girshick proposed a new Faster RCNN in 2016. From the point of view of network naming, it is very straightforward, so where is Faster compared to Faster R-CNN?The answer is: a change in the extraction method of the region proposal

Although Fast R-CNN proposes the feature extraction method of ROI Pooling, it solves the Region Proposaldisadvantages of inputting regions into the CNN network separately in traditional R-CNN. but! ! ! The traditional Selective Search search method is always used to determine the Region Proposal , and a lot of time is spent on RP search during training and testing.The Faster R-CNN breakthrough uses the RPN network to directly extract RP and integrate it into the overall network., so that the overall performance is greatly improved, especially in the detection speed.

Two: Network Architecture

insert image description here
The figure above shows faster_rcnn_test.ptthe network structure of the VGG16 model in the python version. It can be clearly seen that the network structure is divided into the following modules:

  • Conv layers
    The Backbone layer is mainly used to extract the features in the input image and generate a Feature Map for use by the latter two modules.
  • Region Proposal Networks (RPN)
    The RPN module is used to train and extract the Region Proposal area in the original image, which is the most important module in the entire network model.
  • Semi-Fast R-CNN
    Semi-Fast R-CNN is my own name, because it is almost the same as the head layer of Fast R-CNN, and it is more called RoiHead layer. After the RP is determined by the RPN module, the Fast R-CNN network can be trained to complete the classification of the RP area and the fine-tuning of the bbox frame.

To sum up, it can be seen that careful people will find that Conv layers + Semi-Fast R-CNN is not Fast R-CNN ! so,Faster R-CNN network is actually RPN + Fast R-CNN, that is two-stage, the two modules are also trained separately during training, and the RP is first generated by the RPN during the test, and then the Feature Map with RP is input into the Fast R-CNN to complete the classification and prediction frame regression tasks . Below, I will explain the three modules in detail in turn.

Three: Conv layers module

insert image description here

Conv layers include three layers of conv, pooling, and relu. Take the network structure in the VGG16 model in the python version faster_rcnn_test.ptas an example. The Conv layers part has 13 conv layers, 13 relu layers, and 4 pooling layers. Friends who are not familiar with VGG16 should pay attention to two details :

  • All conv layers are: kernel_size=3, pad=1, stride=1
  • All pooling layers are: kernel_size=2, pad=0, stride=2

After the Conv layers module, an (800×600)input image of size MxN becomes a Feature Map of (M/16)x(N/16)! In this way, the feature map generated by Conv layers can correspond to the original image.

Four: Region Proposal Networks (RPN) module

insert image description here
Finally, the most important RPN module is ushered in, everyone raise your spirits and analyze it with me! In the final analysis, RPN is two functional modules. The first functional module is to use the binary classification to score the prospect of each anchor, and use regression to calculate the four fine-tuning parameters between each anchor and its corresponding GT. The second function module is based on the score output by the first function module and four fine-tuning parameters to obtain the ROI and select the appropriate RP.
In fact, only the first functional module of RPN needs to be trained, and the second module is a basic selection operation., no parameters need to be trained, just select the region proposal to be used for training and testing for RoiHead. Let me explain in turn from these two functional modules:

【Module 1】

In the first module, it will explain how to generate and mark anchors in the original image, and how to use the marked anchors to train RPN to perform binary classification of foreground and background and regression fine-tuning of anchor positions for each anchor. The following is a step-by-step explanation:

step1: generate_anchor_base

First, we need to generate_anchor_basegenerate anchors using a function. The main implementation idea of ​​the code: First, the anchor feature mapis generated based on the upper left corner of the feature map, with three scales, and each scale corresponds to three scales. 9个Next, multiply the 9 anchors in the upper left corner of the feature map by the scaling base_size of the original image, which is after 4 pooling layers 16倍. The anchor points are changed from those in the feature map (0.5,0.5)to those on the original image (8,8), and the w and h of the nine anchors on the original image have also been increased by 16 times. Then, based on the anchors in the upper left corner of the original image, draw 9 anchors every base_size pixels. After drawing, there are about 20,000 anchors on the original image.
For the specific implementation method, see the following code and attach the handwritten diagram:

def generate_anchor_base(base_size=16, ratios=[0.5, 1, 2], #
                         anchor_scales=[8, 16, 32]):   #对特征图features以基准长度为16、选择合适的ratios和scales取基准锚点anchor_base。(选择长度为16的原因是图片大小为600*800左右,基准长度16对应的原图区域是256*256,考虑放缩后的大小有128*128,512*512比较合适)
#根据基准点生成9个基本的anchor的功能,ratios=[0.5,1,2],anchor_scales=[8,16,32]是长宽比和缩放比例,anchor_scales也就是在base_size的基础上再增加的量,本代码中对应着三种面积的大小(16*8)^2 ,(16*16)^2  (16*32)^2  也就是128,256,512的平方大小
    py = base_size / 2.
    px = base_size / 2.   

    anchor_base = np.zeros((len(ratios) * len(anchor_scales), 4),     
                           dtype=np.float32)  #(9,4),注意:这里只是以特征图的左上角点为基准产生的9个anchor,
    for i in six.moves.range(len(ratios)): #six.moves 是用来处理那些在python2 和 3里面函数的位置有变化的,直接用six.moves就可以屏蔽掉这些变化
        for j in six.moves.range(len(anchor_scales)):
            h = base_size * anchor_scales[j] * np.sqrt(ratios[i])
            w = base_size * anchor_scales[j] * np.sqrt(1. / ratios[i]) #生成9种不同比例的h和w
 '''
这9个anchor形状应为:
90.50967 *181.01933    = 128^2
181.01933 * 362.03867 = 256^2
362.03867 * 724.07733 = 512^2
128.0 * 128.0 = 128^2
256.0 * 256.0 = 256^2
512.0 * 512.0 = 512^2
181.01933 * 90.50967   = 128^2
362.03867 * 181.01933 = 256^2
724.07733 * 362.03867 = 512^2
该函数返回值为anchor_base,形状9*4,是9个anchor的左上右下坐标:
-37.2548 -82.5097 53.2548 98.5097
-82.5097	-173.019	98.5097	189.019
-173.019	-354.039	189.019	370.039
-56	-56	72	72
-120	-120	136	136
-248	-248	264	264
-82.5097	-37.2548	98.5097	53.2548
-173.019	-82.5097	189.019	98.5097
-354.039	-173.019	370.039	189.019
'''
            index = i * len(anchor_scales) + j
            anchor_base[index, 0] = py - h / 2.
            anchor_base[index, 1] = px - w / 2.
            anchor_base[index, 2] = py + h / 2.
            anchor_base[index, 3] = px + w / 2.  #计算出anchor_base画的9个框的左下角和右上角的4个anchor坐标值
    return anchor_base 

insert image description here

Since there are some conversion functions used in the following explanations, I will post them here first, so that you can understand them first according to the comments:

def loc2bbox(src_bbox, loc): #已知源bbox 和位置偏差dx,dy,dh,dw,求目标框G
    if src_bbox.shape[0] == 0:
        return xp.zeros((0, 4), dtype=loc.dtype)        #src_bbox:(R,4),R为bbox个数,4为左下角和右上角四个坐标(这里有误,按照标准坐标系中y轴向下,应该为左上和右下角坐标)
    src_bbox = src_bbox.astype(src_bbox.dtype, copy=False) 
    src_height = src_bbox[:, 2] - src_bbox[:, 0]      #ymax-ymin
    src_width = src_bbox[:, 3] - src_bbox[:, 1]     #xmax-xmin
    src_ctr_y = src_bbox[:, 0] + 0.5 * src_height    y0+0.5h
    src_ctr_x = src_bbox[:, 1] + 0.5 * src_width   #x0+0.5w,计算出中心点坐标
#src_height为Ph,src_width为Pw,src_ctr_y为Py,src_ctr_x为Px
    dy = loc[:, 0::4]      #python [start:stop:step] 
    dx = loc[:, 1::4]
    dh = loc[:, 2::4]
    dw = loc[:, 3::4]
RCNN中提出的边框回归:寻找原始proposal与近似目标框G之间的映射关系,公式在上面
    ctr_y = dy * src_height[:, xp.newaxis] + src_ctr_y[:, xp.newaxis]  #ctr_y为Gy
    ctr_x = dx * src_width[:, xp.newaxis] + src_ctr_x[:, xp.newaxis] # ctr_x为Gx
    h = xp.exp(dh) * src_height[:, xp.newaxis] #h为Gh
    w = xp.exp(dw) * src_width[:, xp.newaxis] #w为Gw
#上面四行得到了回归后的目标框(Gx,Gy,Gh,Gw)
    dst_bbox = xp.zeros(loc.shape, dtype=loc.dtype)  #loc.shape:(R,4),同src_bbox
    dst_bbox[:, 0::4] = ctr_y - 0.5 * h
    dst_bbox[:, 1::4] = ctr_x - 0.5 * w
    dst_bbox[:, 2::4] = ctr_y + 0.5 * h
    dst_bbox[:, 3::4] = ctr_x + 0.5 * w   #由中心点转换为左上角和右下角坐标
    return dst_bbox
    
def bbox2loc(src_bbox, dst_bbox): #已知源框和目标框求出其位置偏差
    height = src_bbox[:, 2] - src_bbox[:, 0]
    width = src_bbox[:, 3] - src_bbox[:, 1]
    ctr_y = src_bbox[:, 0] + 0.5 * height
    ctr_x = src_bbox[:, 1] + 0.5 * width #计算出源框中心点坐标

    base_height = dst_bbox[:, 2] - dst_bbox[:, 0]
    base_width = dst_bbox[:, 3] - dst_bbox[:, 1]
    base_ctr_y = dst_bbox[:, 0] + 0.5 * base_height
    base_ctr_x = dst_bbox[:, 1] + 0.5 * base_width ##计算出目标框中心点坐标

    eps = xp.finfo(height.dtype).eps  #求出最小的正数
    height = xp.maximum(height, eps) 
    width = xp.maximum(width, eps)  #将height,width与其比较保证全部是非负

    dy = (base_ctr_y - ctr_y) / height
    dx = (base_ctr_x - ctr_x) / width
    dh = xp.log(base_height / height)
    dw = xp.log(base_width / width)  #根据上面的公式二计算dx,dy,dh,dw

    loc = xp.vstack((dy, dx, dh, dw)).transpose()    #np.vstack按照行的顺序把数组给堆叠起来
    return loc

def bbox_iou(bbox_a, bbox_b):  #求两个bbox的相交的交并比
    if bbox_a.shape[1] != 4 or bbox_b.shape[1] != 4:
        raise IndexError  #确保bbox第二维为bbox的四个坐标(ymin,xmin,ymax,xmax)
    tl = xp.maximum(bbox_a[:, None, :2], bbox_b[:, :2])  #tl为交叉部分框左上角坐标最大值,为了利用numpy的广播性质,bbox_a[:, None, :2]的shape是(N,1,2),bbox_b[:, :2]shape是(K,2),由numpy的广播性质,两个数组shape都变成(N,K,2),也就是对a里每个bbox都分别和b里的每个bbox求左上角点坐标最大值
    br = xp.minimum(bbox_a[:, None, 2:], bbox_b[:, 2:]) #br为交叉部分框右下角坐标最小值
    area_i = xp.prod(br - tl, axis=2) * (tl < br).all(axis=2) #所有坐标轴上tl<br时,返回数组元素的乘积(y1max-yimin)X(x1max-x1min),bboxa与bboxb相交区域的面积
    area_a = xp.prod(bbox_a[:, 2:] - bbox_a[:, :2], axis=1)  #计算bboxa的面积
    area_b = xp.prod(bbox_b[:, 2:] - bbox_b[:, :2], axis=1) #计算bboxb的面积
    return area_i / (area_a[:, None] + area_b - area_i) #计算IOU

step2: AnchorTargetCreator

After generating near- 20000个anchors in the original image, AnchorTargetCreatora function is used to annotate them for training. The main implementation idea of ​​the code: For labeling , first remove the anchors that exceed the boundary of the original image, leaving nearly 15,000. Then, calculate the maximum iou of each anchor and which bbox and the iou value, IOU>0.7the anchor is pos_anchor, IOU<0.3and the anchor is neg_anchor. At the same time, it is also necessary to calculate the iou of each bbox and which anchor is the largest (in fact, it is the difference between the largest row and the largest column in the matrix), and the anchors of the largest IOU corresponding to each bbox are also directly set to pos_anchor. But in the end, you need to randomly select each of pos and neg 128个, that is 128个正样本和128个负样本, set the label of 128 positive samples to 1, set the label of 128 negative samples to 0, and set the labels of the remaining (20000-256) anchors Both are set to 0. For the parameter labeling of the 4 regression boxes , first set all the super boxes to (0, 0, 0, 0), and the 4 parameters of the nearly 15,000 anchors in the box are the actual offsets of their bboxes corresponding to the maximum IOU quantity. The specific code is as follows:

# 下面是AnchorTargetCreator()代码,作用是生成训练要用的anchor(与对应框iou值最大或者最小的各128个框的坐标和256个label(0或者1))
class AnchorTargetCreator(object):  # 利用每张图中bbox的真实标签来为所有任务分配ground truth!
    # 为Faster-RCNN专有的RPN网络提供自我训练的样本,RPN网络正是利用AnchorTargetCreator产生的样本作为数据进行网络的训练和学习的,这样产生的预测anchor的类别和位置才更加精确,anchor变成真正的ROIS需要进行位置修正,而AnchorTargetCreator产生的带标签的样本就是给RPN网络进行训练学习用哒
    def __call__(self, bbox, anchor, img_size):  # anchor:(S,4),S为anchor数
        img_H, img_W = img_size
        n_anchor = len(anchor)  # 一般对应20000个左右anchor
        inside_index = _get_inside_index(anchor, img_H, img_W)  # 将那些超出图片范围的anchor全部去掉,只保留位于图片内部的序号
        anchor = anchor[inside_index]  # 保留位于图片内部的anchor
        argmax_ious, label = self._create_label(inside_index, anchor, bbox)  # 筛选出符合条件的正例128个负例128并给它们附上相应的label
        loc = bbox2loc(anchor, bbox[argmax_ious])  # 计算每一个anchor与对应bbox求得iou最大的bbox计算偏移量(注意这里是位于图片内部的每一个)
        label = _unmap(label, n_anchor, inside_index, fill=-1)  # 将位于图片内部的框的label对应到所有生成的20000个框中(label原本为所有在图片中的框的)
        loc = _unmap(loc, n_anchor, inside_index, fill=0)  # 将回归的框对应到所有生成的20000个框中(label原本为所有在图片中的框的)
        return loc, label

        # 下面为调用的_creat_label() 函数

    def _create_label(self, inside_index, anchor, bbox):
        label = np.empty((len(inside_index),), dtype=np.int32)  # inside_index为所有在图片范围内的anchor序号
        label.fill(-1)  # 全部填充-1
        argmax_ious, max_ious, gt_argmax_ious = self._calc_ious(anchor, bbox, inside_index)
        调用_calc_ious()函数得到每个anchor与哪个bbox的iou最大以及这个iou值、每个bbox与哪个anchor的iou最大(需要体会从行和列取最大值的区别)
        label[
            max_ious < self.neg_iou_thresh] = 0  # 把每个anchor与对应的框求得的iou值与负样本阈值比较,若小于负样本阈值,则label设为0,pos_iou_thresh=0.7, neg_iou_thresh=0.3
        label[gt_argmax_ious] = 1  # 把与每个bbox求得iou值最大的anchor的label设为1
        label[max_ious >= self.pos_iou_thresh] = 1  ##把每个anchor与对应的框求得的iou值与正样本阈值比较,若大于正样本阈值,则label设为1
        n_pos = int(self.pos_ratio * self.n_sample)  # 按照比例计算出正样本数量,pos_ratio=0.5,n_sample=256
        pos_index = np.where(label == 1)[0]  # 得到所有正样本的索引
        if len(pos_index) > n_pos:  # 如果选取出来的正样本数多于预设定的正样本数,则随机抛弃,将那些抛弃的样本的label设为-1
            disable_index = np.random.choice(
                pos_index, size=(len(pos_index) - n_pos), replace=False)
            label[disable_index] = -1
        n_neg = self.n_sample - np.sum(label == 1)  # 设定的负样本的数量
        neg_index = np.where(label == 0)[0]  # 负样本的索引
        if len(neg_index) > n_neg:
            disable_index = np.random.choice(
                neg_index, size=(len(neg_index) - n_neg),
                replace=False)  # 随机选择不要的负样本,个数为len(neg_index)-neg_index,label值设为-1
            label[disable_index] = -1
        return argmax_ious, label

    # 下面为调用的_calc_ious()函数
    def _calc_ious(self, anchor, bbox, inside_index):
        ious = bbox_iou(anchor, bbox)  # 调用bbox_iou函数计算anchor与bbox的IOU, ious:(N,K),N为anchor中第N个,K为bbox中第K个,N大概有15000个
        argmax_ious = ious.argmax(axis=1)  # 1代表行,0代表列
        max_ious = ious[np.arange(len(inside_index)), argmax_ious]  # 求出每个anchor与哪个bbox的iou最大,以及最大值,max_ious:[1,N]
        gt_argmax_ious = ious.argmax(axis=0)
        gt_max_ious = ious[gt_argmax_ious, np.arange(ious.shape[1])]  # 求出每个bbox与哪个anchor的iou最大,以及最大值,gt_max_ious:[1,K]
        gt_argmax_ious = np.where(ious == gt_max_ious)[0]  # 然后返回最大iou的索引(每个bbox与哪个anchor的iou最大),有K个
        return argmax_ious, max_ious, gt_argmax_ious

step3: training RPN

After generating and marking the training samples, we finally come to the training link of the first functional module. First, the Feature Map is 3×3卷积operated, and then it is divided into two branches. Each branch is operated first 1×1卷积, and the purpose is to compress the channel. The number of channels of the first branch is compressed 9×2, 9 represents the 9 anchors of each anchor point, and 2 represents the probability that each anchor is the foreground or background. The number of channels of the second branch is compressed 9×4, 9 represents the 9 anchors of each anchor point, and 4 represents the predicted value of the 4 position parameters of each anchor. For each min-batch, classification loss and regression loss are only calculated for 128 negative samples and 128 positive samples (actually only regression loss is calculated for positive samples). The loss function is as follows:
insert image description here
the classification loss function selects the traditional cross-entropy loss function, and the classification loss function selects the Smooth L1 Loss regression loss function, as follows:
insert image description here
Since in the actual process, N c N_cNc = min_batch , N r N_r Nr= The size of the feature map, the gap between the two is too large, and the parameter λ is used to balance the two, so that the total network Loss calculation process can evenly consider two types of Loss.

【Module 2】

The second module is based on the score output by the first functional module and the four position parameters to obtain the ROI and select the appropriate RP. This module ProposalCreatoris completed in the function. The core idea of ​​the code is to fine-tune all the anchors in the original image through the output of the first trained module about the 20000个4 position parameters of the anchor, and generate 20000个ROI. Next, the ROI is cropped, and the ROI whose length and width after cropping are smaller than the set threshold are eliminated. Then, sort the remaining ROIs from large to small according to the foreground score. If it is used for RoiHead training , take 前12000个the ROI. After the NMS secondary screening, only take 前2000个the ROI as the final region proposals. If it is used for the RoiHead test , the ROI is taken 前2000个. After the NMS secondary screening, only 前300个the ROI is taken as the final region proposals. The specific code is implemented as follows:

# 下面是ProposalCreator的代码: 这部分的操作不需要进行反向传播,因此可以利用numpy/tensor实现
class ProposalCreator:  # 对于每张图片,利用它的feature map,计算(H/16)x(W/16)x9(大概20000)个anchor属于前景的概率,然后从中选取概率较大的12000张,利用位置回归参数,修正这12000个anchor的位置, 利用非极大值抑制,选出2000个ROIS以及对应的位置参数。
    def __call__(self, loc, score, anchor, img_size,
                 scale=1.):  # 这里的loc和score是经过region_proposal_network中经过1x1卷积分类和回归得到的
        if self.parent_model.training:
            n_pre_nms = self.n_train_pre_nms  # 12000
            n_post_nms = self.n_train_post_nms  # 经过NMS后有2000个

        else:
            n_pre_nms = self.n_test_pre_nms  # 6000
            n_post_nms = self.n_test_post_nms  # 经过NMS后有300个


        roi = loc2bbox(anchor, loc)  # 将bbox转换为近似groudtruth的anchor(即rois)
        roi[:, slice(0, 4, 2)] = np.clip(roi[:, slice(0, 4, 2)], 0, img_size[0])  # 裁剪将rois的ymin,ymax限定在[0,H]
        roi[:, slice(1, 4, 2)] = np.clip(roi[:, slice(1, 4, 2)], 0, img_size[1])  # 裁剪将rois的xmin,xmax限定在[0,W]

        min_size = self.min_size * scale  # 16
        hs = roi[:, 2] - roi[:, 0]  # rois的宽
        ws = roi[:, 3] - roi[:, 1]  # rois的长
        keep = np.where((hs >= min_size) & (ws >= min_size))[0]  # 确保rois的长宽大于最小阈值
        roi = roi[keep, :]

        score = score[keep]  # 对剩下的ROIs进行打分(根据region_proposal_network中rois的预测前景概率)
        order = score.ravel().argsort()[::-1]  # 将score拉伸并逆序(从高到低)排序
        if n_pre_nms > 0:
            order = order[:n_pre_nms]  # train时从20000中取前12000个rois,test取前6000个
        roi = roi[order, :]

        keep = non_maximum_suppression(
        cp.ascontiguousarray(cp.asarray(roi)),
            thresh=self.nms_thresh)  # (具体需要看NMS的原理以及输入参数的作用)调用非极大值抑制函数,将重复的抑制掉,就可以将筛选后ROIS进行返回。经过NMS处理后Train数据集得到2000个框,Test数据集得到300个框
        if n_post_nms > 0:
            keep = keep[:n_post_nms]
        roi = roi[keep]
        return roi

五:Semi-Fast R-CNN(RoiHead)

After introducing the RPN module, the most important task of RP extraction has been completed. Next, RoiHead only needs to use the RP result output by RPN as input for training and testing. I will explain the training phase and the test module separately:

【Training stage】

step1: Label training samples in RP

If it is in the training phase, RPN will output about 2000 region proposals. So how to select samples and label them? ProposalTargetCreatorThe function realizes this task. The core idea of ​​the code is: first, splicing 2000个RP and M个Ground Truth together, that is, all GTs are also used as RP. why ?

nuclear energy ahead: In fact, the answer is very simple, now is the training phase of RoiHead, training RoiHead's classification and secondary regression capabilities. In other words, it is necessary to input training data with category labels and actual location parameters to the network. The ROI selected by RPN in a large range including real objects can be used as training data. To be honest, they are all crooked. Using them to train RoiHead is actually It is mainly based on the actual situation of the test phase. After all, it must be based on the actual situation that the task needs to adapt to. During the actual test, RPN finds out the RP, and RoiHead classifies them and corrects them with bbox. These RPs are crooked. of. So that is to say, it is imprecise to classify these samples, not to completely cover the qualified samples of the physical objects. After all, it is now the training stage. It is not a bad idea to secretly feed the network a little "high-quality carbohydrates", and directly train it with the highest quality GT. The genuine classified samples are comfortable and uncomfortable, and it is useless for nothing.

Back to the topic, after splicing RP and GT, calculate the label of the GT corresponding to their maximum IOU, and use (label+1) as the category label (1~20) of each RP. Then select 64one of the RPs with IOU>0.5 as a positive sample, IOU<0.5select one of the RPs 64as a negative sample and set the label of the negative sample to 0, and finally pack a total of 128 positive and negative samples as the training input of RoiHead . The specific implementation code is as follows:

# 下面是ProposalTargetCreator代码:ProposalCreator产生2000个ROIS,但是这些ROIS并不都用于训练,经过本ProposalTargetCreator的筛选产生128个用于自身的训练
    class ProposalTargetCreator(object):  # 为2000个rois赋予ground truth!(严格讲挑出128个赋予ground truth!)
        # 输入:2000个rois、一个batch(一张图)中所有的bbox ground truth(R,4)、对应bbox所包含的label(R,1)(VOC2007来说20类0-19)
        # 输出:128个sample roi(128,4)、128个gt_roi_loc(128,4)、128个gt_roi_label(128,1)
        def __call__(self, roi, bbox, label, loc_normalize_mean=(0., 0., 0., 0.),
                     loc_normalize_std=(0.1, 0.1, 0.2, 0.2)):  # 因为这些数据是要放入到整个大网络里进行训练的,比如说位置数据,所以要对其位置坐标进行数据增强处理(归一化处理)
            n_bbox, _ = bbox.shape
            roi = np.concatenate((roi, bbox), axis=0)  # 首先将2000个roi和m个bbox给concatenate了一下成为新的roi(2000+m,4)。
            pos_roi_per_image = np.round(
                self.n_sample * self.pos_ratio)  # n_sample = 128,pos_ratio=0.5,round 对传入的数据进行四舍五入
            iou = bbox_iou(roi, bbox)  # 计算每一个roi与每一个bbox的iou  (2000+m,m)
            gt_assignment = iou.argmax(axis=1)  # 按行找到最大值,返回最大值对应的序号以及其真正的IOU。返回的是每个roi与**哪个**bbox的最大,以及最大的iou值
            max_iou = iou.max(axis=1)  # 每个roi与对应bbox最大的iou
            gt_roi_label = label[gt_assignment] + 1  # 从1开始的类别序号,给每个类得到真正的label(将0-19变为1-20)
            pos_index = np.where(max_iou >= self.pos_iou_thresh)[0]  # 同样的根据iou的最大值将正负样本找出来,pos_iou_thresh=0.5
            pos_roi_per_this_image = int(
                min(pos_roi_per_image, pos_index.size))  # 需要保留的roi个数(满足大于pos_iou_thresh条件的roi与64之间较小的一个)
            if pos_index.size > 0:
                pos_index = np.random.choice(
                    pos_index, size=pos_roi_per_this_image, replace=False)  # 找出的样本数目过多就随机丢掉一些

            neg_index = np.where((max_iou < self.neg_iou_thresh_hi) &
                                 (max_iou >= self.neg_iou_thresh_lo))[0]  # neg_iou_thresh_hi=0.5,neg_iou_thresh_lo=0.0
            neg_roi_per_this_image = self.n_sample - pos_roi_per_this_image  # #需要保留的roi个数(满足大于0小于neg_iou_thresh_hi条件的roi与64之间较小的一个)
            neg_roi_per_this_image = int(min(neg_roi_per_this_image,
                                             neg_index.size))
            if neg_index.size > 0:
                neg_index = np.random.choice(
                    neg_index, size=neg_roi_per_this_image, replace=False)  # 找出的样本数目过多就随机丢掉一些

            keep_index = np.append(pos_index, neg_index)
            gt_roi_label = gt_roi_label[keep_index]
            gt_roi_label[pos_roi_per_this_image:] = 0  # 负样本label 设为0
            sample_roi = roi[keep_index]
            # 那么此时输出的128*4的sample_roi就可以去扔到 RoIHead网络里去进行分类与回归了。同样, RoIHead网络利用这sample_roi+featue为输入,输出是分类(21类)和回归(进一步微调bbox)的预测值,那么分类回归的groud truth就是ProposalTargetCreator输出的gt_roi_label和gt_roi_loc。
            gt_roi_loc = bbox2loc(sample_roi, bbox[gt_assignment[keep_index]])  # 求这128个样本的groundtruth
            gt_roi_loc = ((gt_roi_loc - np.array(loc_normalize_mean, np.float32)
                           ) / np.array(loc_normalize_std,
                                        np.float32))  # ProposalTargetCreator首次用到了真实的21个类的label,且该类最后对loc进行了归一化处理,所以预测时要进行均值方差处理
            return sample_roi, gt_roi_loc, gt_roi_label

step2: formal training

Project the 128 marked training samples from the original image to the corresponding ROI area in the Feature Map, and then enter the RoiPooling layer to turn these ROI areas of different sizes into vectors of the same length, and then pass through two layers of 4096FC Layer, respectively get the softmax21 classification score and the prediction results of the bbox 84parameters (21 * 4), put them into the loss function for backpropagation to update the network weight, and only calculate the regression box loss of the positive sample. The loss function is similar to that of RPN, so I won’t go into details here, and paste the core code of the loss function:

def _fast_rcnn_loc_loss(pred_loc, gt_loc, gt_label, sigma): #输入分别为rpn回归框的偏移量与anchor与bbox的偏移量以及label
    in_weight = t.zeros(gt_loc.shape).cuda()
    # Localization loss is calculated only for positive rois.
    # NOTE:  unlike origin implementation, 
    # we don't need inside_weight and outside_weight, they can calculate by gt_label
    in_weight[(gt_label > 0).view(-1, 1).expand_as(in_weight).cuda()] = 1
    loc_loss = _smooth_l1_loss(pred_loc, gt_loc, in_weight.detach(), sigma) #sigma设置为1
    # Normalize by total number of negtive and positive rois.
    loc_loss /= ((gt_label >= 0).sum().float()) # ignore gt_label==-1 for rpn_loss #除去背景类
    return loc_loss
roi_cls_loss = nn.CrossEntropyLoss()(roi_score, gt_roi_label.cuda())#求交叉熵损失

【Test stage】

The test phase of RoiHead is to input the 300 RPs output from the RPN into the network, and finally output the class of each RP and 4 regression box fine-tuning parameters. Eliminate RPs that are higher than the background (0) threshold and the maximum category (1~20) score is lower than the threshold. Finally, according to the regression parameters, fine-tune the remaining RP boxes after screening to get the final bounding box! So far, you're done.

Six: Faster R-CNN training method

Faster-RCNN has two training methods: four-step alternating iterative training and joint training. This article mainly explains the training method of four-step alternating iterations, as follows:

1. Train RPN, use a large dataset pre-training model to initialize shared convolution and RPN weights, and train RPN end-to-end to generate Region Proposals; 2. Train
Fast R-CNN and use the same pre-trained model to initialize shared convolution [Note that this is to initialize a new shared convolutional network with the same structure as the first step, not the one trained in the first step], lock the RPN weights trained in the first step, and train the RCNN with the Proposals obtained by the RPN Network;
3. Tuning RPN, using the shared convolution and RCNN trained in step 2, fixing the shared convolutional layer, and continuing to train RPN. I think this step is equivalent to fine-tuning the RPN trained in step 1; 4
. Tune Fast R-CNN, use the shared convolution and RPN trained in step 3 (fix the shared convolution layer), and continue to train and fine-tune RCNN 5.
Repeat the above steps 3 and 4 for iteration. (Generally, it is enough to reach step 4, and the effect after iterative training is almost no improvement)

Here is a flow chart of the training process, which should be clearer:
insert image description here

Seven: Faster R-CNN test method

Next, we will explain the testing process of the entire network.You're almost done!

Step1: The input image passes through the convolutional layer to obtain the feature map

step2: Feature map gets 300 RP through RPN

step3: Input RP into the RoiHead network

step4: Obtain the category score and bbox position parameters of each RP

step5: Select the final ROI by the score threshold

step6: Fine-tune the bbox box of ROI in combination with the position parameters

step7: Draw the final detection frame after NMS

Eight: Summary

Although Fast R-CNN has greatly improved in speed and accuracy, it still fails to achieve end-to-end target detection . For example, the acquisition of candidate regions cannot be performed synchronously, and the speed is still improved. space.

Attach one lastSuper large schematic flow chart ending,For reference:

insert image description here


  So far, I have given an in-depth explanation of the entire process and details of Faster R-CNN. I hope it will be helpful to you. If you have any questions or suggestions that you don’t understand, please leave a comment below. (The code word is not easy, everyoneGive it a thumbs up, leave a lingering fragrance in your hand~Thank you! )

I am a Jiangnan salted fish struggling in the CV quagmire, let's work hard together and leave no regrets!

Guess you like

Origin blog.csdn.net/weixin_43702653/article/details/124045469
Recommended