Instance segmentation 1 for free school - Tensorflow2 builds the Mask R-CNN instance segmentation platform

study foreword

I implemented Mask RCNN with tensorflow2, at least to keep up with the times, right?
insert image description here

What is Mask R-CNN

insert image description here
Mask R-CNN is the masterpiece of He Kaiming in 2017. It performs instance segmentation while performing target detection, and has achieved excellent results.
The design of its network is also relatively simple. On the basis of Faster R-CNN, a branch is added to the original two branches (classification + coordinate regression) for semantic segmentation.

Source code download

https://github.com/bubbliiiing/mask-rcnn-tf2 If
you like it, you can click a star.

Mask R-CNN implementation ideas

1. Prediction part

1. Introduction to the backbone network

insert image description here
Mask-RCNN uses Resnet101 as the backbone feature extraction network, which corresponds to the CNN part in the image. It has size requirements for the input image, which needs to be divisible to the 6th power of 2. After feature extraction, feature layers whose length and width are compressed twice, three times, four times, and five times are used to construct the feature pyramid structure.

ResNet101 has two basic blocks, named Conv Block and Identity Block. The dimensions of the input and output of the Conv Block are different, so they cannot be connected in series. Its function is to change the dimension of the network; the input dimension and output of the Identity Block are different. The dimensions are the same and can be connected in series to deepen the network.
The structure of Conv Block is as follows:
insert image description here
The structure of Identity Block is as follows:
insert image description here
Both are residual network structures.

Taking the input shape of the officially used coco dataset as an example, the input shape is 1024x1024, and the shape changes are as follows:
insert image description here
we take the results of length and width compression twice, three times, four times, and five times to construct the feature pyramid structure.

Implementation code:

from tensorflow.keras.layers import (Activation, Add, BatchNormalization,
                                     Conv2D, MaxPooling2D, ZeroPadding2D)
from tensorflow.keras.regularizers import l2


#----------------------------------------------#
#   conv_block和identity_block的区别主要就是:
#   conv_block会压缩输入进来的特征层的宽高
#   identity_block用于加深网络
#----------------------------------------------#
def identity_block(input_tensor, kernel_size, filters, stage, block, use_bias=True, weight_decay=0, train_bn=True):
    nb_filter1, nb_filter2, nb_filter3 = filters
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'

    x = Conv2D(nb_filter1, (1, 1), name=conv_name_base + '2a', use_bias=use_bias, kernel_regularizer=l2(weight_decay))(input_tensor)
    x = BatchNormalization(name=bn_name_base + '2a')(x, training=train_bn)
    x = Activation('relu')(x)

    x = Conv2D(nb_filter2, (kernel_size, kernel_size), padding='same', name=conv_name_base + '2b', use_bias=use_bias, kernel_regularizer=l2(weight_decay))(x)
    x = BatchNormalization(name=bn_name_base + '2b')(x, training=train_bn)
    x = Activation('relu')(x)

    x = Conv2D(nb_filter3, (1, 1), name=conv_name_base + '2c', use_bias=use_bias, kernel_regularizer=l2(weight_decay))(x)
    x = BatchNormalization(name=bn_name_base + '2c')(x, training=train_bn)

    x = Add()([x, input_tensor])
    x = Activation('relu', name='res' + str(stage) + block + '_out')(x)
    return x

def conv_block(input_tensor, kernel_size, filters, stage, block, strides=(2, 2), use_bias=True, weight_decay=0, train_bn=True):

    nb_filter1, nb_filter2, nb_filter3 = filters
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'

    x = Conv2D(nb_filter1, (1, 1), strides=strides, name=conv_name_base + '2a', use_bias=use_bias, kernel_regularizer=l2(weight_decay))(input_tensor)
    x = BatchNormalization(name=bn_name_base + '2a')(x, training=train_bn)
    x = Activation('relu')(x)

    x = Conv2D(nb_filter2, (kernel_size, kernel_size), padding='same', name=conv_name_base + '2b', use_bias=use_bias, kernel_regularizer=l2(weight_decay))(x)
    x = BatchNormalization(name=bn_name_base + '2b')(x, training=train_bn)
    x = Activation('relu')(x)

    x = Conv2D(nb_filter3, (1, 1), name=conv_name_base + '2c', use_bias=use_bias, kernel_regularizer=l2(weight_decay))(x)
    x = BatchNormalization(name=bn_name_base + '2c')(x, training=train_bn)

    shortcut = Conv2D(nb_filter3, (1, 1), strides=strides, name=conv_name_base + '1', use_bias=use_bias, kernel_regularizer=l2(weight_decay))(input_tensor)
    shortcut = BatchNormalization(name=bn_name_base + '1')(shortcut, training=train_bn)

    x = Add()([x, shortcut])
    x = Activation('relu', name='res' + str(stage) + block + '_out')(x)
    return x

#----------------------------------------------#
#   获得resnet的主干部分
#----------------------------------------------#
def get_resnet(input_image, train_bn=True, weight_decay=0):
    #----------------------------------------------#
    #   假设输入进来的图片为1024,1024,3
    #----------------------------------------------#

    # 1024,1024,3 -> 512,512,64
    x = ZeroPadding2D((3, 3))(input_image)
    x = Conv2D(64, (7, 7), strides=(2, 2), name='conv1', use_bias=True, kernel_regularizer=l2(weight_decay))(x)
    x = BatchNormalization(name='bn_conv1')(x, training=train_bn)
    x = Activation('relu')(x)

    # 512,512,64 -> 256,256,64
    x = MaxPooling2D((3, 3), strides=(2, 2), padding="same")(x)
    C1 = x

    # 256,256,64 -> 256,256,256
    x = conv_block(x, 3, [64, 64, 256], stage=2, block='a', strides=(1, 1), weight_decay=weight_decay, train_bn=train_bn)
    x = identity_block(x, 3, [64, 64, 256], stage=2, block='b', weight_decay=weight_decay, train_bn=train_bn)
    x = identity_block(x, 3, [64, 64, 256], stage=2, block='c', weight_decay=weight_decay, train_bn=train_bn)
    C2 = x

    # 256,256,256 -> 128,128,512
    x = conv_block(x, 3, [128, 128, 512], stage=3, block='a', weight_decay=weight_decay, train_bn=train_bn)
    x = identity_block(x, 3, [128, 128, 512], stage=3, block='b', weight_decay=weight_decay, train_bn=train_bn)
    x = identity_block(x, 3, [128, 128, 512], stage=3, block='c', weight_decay=weight_decay, train_bn=train_bn)
    x = identity_block(x, 3, [128, 128, 512], stage=3, block='d', weight_decay=weight_decay, train_bn=train_bn)
    C3 = x
    
    # 128,128,512 -> 64,64,1024
    x = conv_block(x, 3, [256, 256, 1024], stage=4, block='a', weight_decay=weight_decay, train_bn=train_bn)
    block_count = 22
    for i in range(block_count):
        x = identity_block(x, 3, [256, 256, 1024], stage=4, block=chr(98 + i), weight_decay=weight_decay, train_bn=train_bn)
    C4 = x
    
    # 64,64,1024 -> 32,32,2048
    x = conv_block(x, 3, [512, 512, 2048], stage=5, block='a', weight_decay=weight_decay, train_bn=train_bn)
    x = identity_block(x, 3, [512, 512, 2048], stage=5, block='b', weight_decay=weight_decay, train_bn=train_bn)
    x = identity_block(x, 3, [512, 512, 2048], stage=5, block='c', weight_decay=weight_decay, train_bn=train_bn)
    C5 = x 
    return [C1, C2, C3, C4, C5]

2. Construction of Feature Pyramid FPN

insert image description here
The construction of the feature pyramid FPN is to achieve multi-scale fusion of features. In Mask R-CNN, we take out the results of compressing the length and width of twice C2, three times C3, four times C4, and five times C5 in the backbone feature extraction network. Construct the feature pyramid structure.
insert image description here
The extracted P2, P3, P4, P5, and P6 can be used as the effective feature layer of the RPN network. The RPN proposal frame network is used to perform the next operation on the effective feature layer, and the a priori frame is decoded to obtain the proposal frame.

The extracted P2, P3, P4, and P5 can be used as the effective feature layers of the Classifier and Mask networks. The Classifier prediction frame network is used to perform the next operation on the effective feature layer, and the proposed frame is decoded to obtain the final prediction frame; the Mask semantic segmentation network is used. The next step is performed on the effective feature layer to obtain the semantic segmentation results inside each prediction box.

The implementation code is as follows:

#----------------------------------------------#
#   组合成特征金字塔的结构
#   P5长宽共压缩了5次
#   P5为32,32,256
#----------------------------------------------#
P5 = Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c5p5')(C5)
#----------------------------------------------#
#   将P5上采样和P4进行相加
#   P4长宽共压缩了4次
#   P4为64,64,256
#----------------------------------------------#
P4 = Add(name="fpn_p4add")([
    UpSampling2D(size=(2, 2), name="fpn_p5upsampled")(P5),
    Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c4p4')(C4)])
#----------------------------------------------#
#   将P4上采样和P3进行相加
#   P3长宽共压缩了3次
#   P3为128,128,256
#----------------------------------------------#
P3 = Add(name="fpn_p3add")([
    UpSampling2D(size=(2, 2), name="fpn_p4upsampled")(P4),
    Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c3p3')(C3)])
#----------------------------------------------#
#   将P3上采样和P2进行相加
#   P2长宽共压缩了2次
#   P2为256,256,256
#----------------------------------------------#
P2 = Add(name="fpn_p2add")([
    UpSampling2D(size=(2, 2), name="fpn_p3upsampled")(P3),
    Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c2p2')(C2)])
    
#-----------------------------------------------------------#
#   各自进行一次256通道的卷积,此时P2、P3、P4、P5通道数相同
#   P2为256,256,256
#   P3为128,128,256
#   P4为64,64,256
#   P5为32,32,256
#-----------------------------------------------------------#
P2 = Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p2")(P2)
P3 = Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p3")(P3)
P4 = Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p4")(P4)
P5 = Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p5")(P5)
#----------------------------------------------#
#   在建议框网络里面还有一个P6用于获取建议框
#   P5为16,16,256
#----------------------------------------------#
P6 = MaxPooling2D(pool_size=(1, 1), strides=2, name="fpn_p6")(P5)

#----------------------------------------------#
#   P2, P3, P4, P5, P6可以用于获取建议框
#----------------------------------------------#
rpn_feature_maps    = [P2, P3, P4, P5, P6]
#----------------------------------------------#
#   P2, P3, P4, P5用于获取mask信息
#----------------------------------------------#
mrcnn_feature_maps  = [P2, P3, P4, P5]

3. Get the Proposal suggestion box

insert image description here
The effective feature layer obtained in the previous step is the Feature Map in the image, which has two applications, one is used in combination with ROIAsign, and the other is to enter the Region Proposal Network to obtain the proposal frame.

When obtaining the proposal frame, the effective feature layers we use are P2, P3, P4, P5, and P6. They use the same RPN proposal frame network to obtain the a priori frame adjustment parameters , and whether the a priori frame contains objects.

In Mask R-cnn, the structure of the RPN proposal box network is similar to the RPN proposal box network in Faster RCNN.

First, a 3x3 convolution with a channel number of 512 is performed.

Then perform a convolution of anchors_per_location x 4 and a convolution of anchors_per_location x 2 respectively .

The convolution of anchors_per_location x 4 is used to predict the change of each prior box on each grid point on the common feature layer. (Why is it a change? This is because the prediction result of Faster-RCNN needs to be combined with the prior frame to obtain the prediction frame, and the prediction result is the change of the prior frame.)

The convolution of anchors_per_location x 2 is used to predict whether there is an object inside each prediction box on each grid point on the common feature layer.

When the shape of our input image is 1024x1024x3, the shape of the common feature layer is 256x256x256, 128x128x256, 64x64x256, 32x32x256, 16x16x256, which is equivalent to dividing the input image into grids of different sizes, and then each grid exists by default 3 (anchors_per_location) a priori boxes, which have different sizes and are densely packed on the image.

The result of the convolution of anchors_per_location x 4 will adjust these prior boxes to obtain a new box.
The convolution of anchors_per_location x 2 will determine whether the new box obtained above contains objects.

At this point we can get some useful boxes, which will use the convolution of anchors_per_location x 2 to determine whether there is an object.

At this point, it is only a rough acquisition of a box, that is, a suggestion box . Then we will continue to find things in the suggestion box .

The implementation code is:

#------------------------------------#
#   五个不同大小的特征层会传入到
#   RPN当中,获得建议框
#------------------------------------#
def rpn_graph(feature_map, anchors_per_location, weight_decay=0):
    #------------------------------------#
    #   利用一个3x3卷积进行特征整合
    #------------------------------------#
    shared = Conv2D(512, (3, 3), padding='same', activation='relu',
                       name='rpn_conv_shared', kernel_regularizer=l2(weight_decay))(feature_map)
    
    #------------------------------------#
    #   batch_size, num_anchors, 2
    #   代表这个先验框是否包含物体
    #------------------------------------#
    x = Conv2D(anchors_per_location * 2, (1, 1), padding='valid', activation='linear', name='rpn_class_raw', kernel_regularizer=l2(weight_decay))(shared)
    rpn_class_logits = Reshape([-1,2])(x)
    rpn_probs = Activation("softmax", name="rpn_class_xxx")(rpn_class_logits)
    
    #------------------------------------#
    #   batch_size, num_anchors, 4
    #   这个先验框的调整参数
    #------------------------------------#
    x = Conv2D(anchors_per_location * 4, (1, 1), padding="valid", activation='linear', name='rpn_bbox_pred', kernel_regularizer=l2(weight_decay))(shared)
    rpn_bbox = Reshape([-1, 4])(x)

    return [rpn_class_logits, rpn_probs, rpn_bbox]

#------------------------------------#
#   建立建议框网络模型
#   RPN模型
#------------------------------------#
def build_rpn_model(anchors_per_location, depth, weight_decay=0):
    input_feature_map = Input(shape=[None, None, depth], name="input_rpn_feature_map")
    outputs = rpn_graph(input_feature_map, anchors_per_location, weight_decay=weight_decay)
    return Model([input_feature_map], outputs, name="rpn_model")

4. Decoding of Proposal suggestion box

Through the second step, we obtain the prediction results of many a priori boxes. The prediction result consists of two parts.

The convolution of anchors_per_location x 4 is used to predict the change of each prior box at each grid point on the effective feature layer. **

The convolution of anchors_per_location x 1 is used to predict whether there is an object inside each prediction box on each grid point on the effective feature layer.

It is equivalent to dividing the entire image into several grids; then establish 3 a priori boxes from the center of each grid, when the input image is 1024, 1024, 3, the total number of a priori boxes is 196608+49152+12288 +3072+768 = 261,888

When the input image shape is different, the number of a priori boxes will also change.

insert image description here
Although the a priori frame can represent certain frame position information and frame size information , it is limited and cannot represent any situation, so it needs to be adjusted.

The anchors_per_location in anchors_per_location x 4 indicates the number of a priori boxes contained in this grid point, and 4 indicates the adjustment of the center and length and width of the box.

The implementation code is as follows:

#------------------------------------------------------------------#
#   利用先验框调整参数调整先验框,获得建议框的坐标
#------------------------------------------------------------------#
def apply_box_deltas_graph(boxes, deltas):
    #---------------------------------------#
    #   计算先验框的中心和宽高
    #---------------------------------------#
    height = boxes[:, 2] - boxes[:, 0]
    width = boxes[:, 3] - boxes[:, 1]
    center_y = boxes[:, 0] + 0.5 * height
    center_x = boxes[:, 1] + 0.5 * width
    #---------------------------------------#
    #   计算出调整后的先验框的中心和宽高
    #---------------------------------------#
    center_y += deltas[:, 0] * height
    center_x += deltas[:, 1] * width
    height *= tf.math.exp(deltas[:, 2])
    width *= tf.math.exp(deltas[:, 3])
    #---------------------------------------#
    #   计算左上角和右下角的点的坐标
    #---------------------------------------#
    y1 = center_y - 0.5 * height
    x1 = center_x - 0.5 * width
    y2 = y1 + height
    x2 = x1 + width
    result = tf.stack([y1, x1, y2, x2], axis=1, name="apply_box_deltas_out")
    return result

def clip_boxes_graph(boxes, window):
    wy1, wx1, wy2, wx2 = tf.split(window, 4)
    y1, x1, y2, x2 = tf.split(boxes, 4, axis=1)
    
    y1 = tf.maximum(tf.minimum(y1, wy2), wy1)
    x1 = tf.maximum(tf.minimum(x1, wx2), wx1)
    y2 = tf.maximum(tf.minimum(y2, wy2), wy1)
    x2 = tf.maximum(tf.minimum(x2, wx2), wx1)
    clipped = tf.concat([y1, x1, y2, x2], axis=1, name="clipped_boxes")
    clipped.set_shape((clipped.shape[0], 4))
    return clipped

#----------------------------------------------------------#
#   Proposal Layer
#   该部分代码用于将先验框转化成建议框
#----------------------------------------------------------#
class ProposalLayer(Layer):
    def __init__(self, proposal_count, nms_threshold, config=None, **kwargs):
        super(ProposalLayer, self).__init__(**kwargs)
        self.config = config
        self.proposal_count = proposal_count
        self.nms_threshold = nms_threshold

    def call(self, inputs):
        #----------------------------------------------------------#
        #   输入的inputs有三个内容
        #   inputs[0]   rpn_class   : Batch_size, num_anchors, 2
        #   inputs[1]   rpn_bbox    : Batch_size, num_anchors, 4
        #   inputs[2]   anchors     : Batch_size, num_anchors, 4
        #----------------------------------------------------------#

        #----------------------------------------------------------#
        #   获得先验框内部是否有物体[Batch_size, num_anchors, 1]
        #----------------------------------------------------------#
        scores = inputs[0][:, :, 1]

        #----------------------------------------------------------#
        #   获得先验框的调整参数[batch, num_rois, 4]
        #----------------------------------------------------------#
        deltas = inputs[1]

        #----------------------------------------------------------#
        #   获得先验框的坐标
        #----------------------------------------------------------#
        anchors = inputs[2]

        #----------------------------------------------------------#
        #   RPN_BBOX_STD_DEV[0.1 0.1 0.2 0.2] 改变数量级
        #----------------------------------------------------------#
        deltas = deltas * np.reshape(self.config.RPN_BBOX_STD_DEV, [1, 1, 4])

        #----------------------------------------------------------#
        #   筛选出得分前6000个的框
        #----------------------------------------------------------#
        pre_nms_limit = tf.minimum(self.config.PRE_NMS_LIMIT, tf.shape(anchors)[1])

        #----------------------------------------------------------#
        #   获得这些框的索引
        #----------------------------------------------------------#
        ix = tf.nn.top_k(scores, pre_nms_limit, sorted=True,
                         name="top_anchors").indices
        
        #----------------------------------------------------------#
        #   获得先验框、及其得分与调整参数
        #----------------------------------------------------------#
        scores = batch_slice([scores, ix], lambda x, y: tf.gather(x, y),
                                   self.config.IMAGES_PER_GPU)
        deltas = batch_slice([deltas, ix], lambda x, y: tf.gather(x, y),
                                   self.config.IMAGES_PER_GPU)
        pre_nms_anchors = batch_slice([anchors, ix], lambda a, x: tf.gather(a, x),
                                    self.config.IMAGES_PER_GPU,
                                    names=["pre_nms_anchors"])

        #----------------------------------------------------------#
        #   [batch, pre_nms_limit, (y1, x1, y2, x2)]
        #   对先验框进行解码
        #----------------------------------------------------------#
        boxes = batch_slice([pre_nms_anchors, deltas],
                                  lambda x, y: apply_box_deltas_graph(x, y),
                                  self.config.IMAGES_PER_GPU,
                                  names=["refined_anchors"])

        #----------------------------------------------------------#
        #   [batch, pre_nms_limit, (y1, x1, y2, x2)]
        #   防止超出图片范围
        #----------------------------------------------------------#
        window = np.array([0, 0, 1, 1], dtype=np.float32)
        boxes = batch_slice(boxes,
                                  lambda x: clip_boxes_graph(x, window),
                                  self.config.IMAGES_PER_GPU,
                                  names=["refined_anchors_clipped"])

        #---------------------------------------------------------#
        #   在非极大抑制后
        #   获得一个shape为[batch, NMS_ROIS, 4]的proposals
        #---------------------------------------------------------#
        def nms(boxes, scores):
            indices = tf.image.non_max_suppression(
                boxes, scores, self.proposal_count,
                self.nms_threshold, name="rpn_non_max_suppression")
            proposals = tf.gather(boxes, indices)
            padding = tf.maximum(self.proposal_count - tf.shape(proposals)[0], 0)
            proposals = tf.pad(proposals, [(0, padding), (0, 0)])
            return proposals
        proposals = batch_slice([boxes, scores], nms, self.config.IMAGES_PER_GPU)
        
        return tf.reshape(proposals, (-1, self.proposal_count, 4))

    def compute_output_shape(self, input_shape):
        return (None, self.proposal_count, 4)

5. Use the Proposal suggestion box (Roi Align)

insert image description here
Let us have an overall understanding of the proposal box:
in fact, the proposal box is a preliminary screening of which area of ​​the image has objects.

In fact, the operation of Mask R-CNN here is that through the backbone feature extraction network, we can obtain multiple common feature layers, and then the proposal box will intercept these common feature layers.

In fact, each point in the common feature layer is equivalent to the concentration of all the features inside a certain area on the original image.

The suggestion box will intercept its corresponding public feature layer, and then resize the intercepted result. In the classifier model, the intercepted content will be resized to a size of 7x7x256. In the mask model, the intercepted content will be resized to 14x14x256.
insert image description here
When using the suggestion box to intercept the common feature layer, it should be noted that to find which feature layer the suggestion box belongs to, this should be judged from the size of the suggestion box.

In the classifier model, it will use a 7x7 convolution with a channel number of 1024 and a 1x1 convolution with a channel number of 1024 to convolve the 7x7x256 area obtained by ROIAlign , and twice the number of channels. 1024 convolution is used for Simulate the full connection of 1024 twice, and then fully connect to num_classes and num_classes * 4, respectively, representing the objects in the proposal box and the adjustment parameters of this proposal box.

In the mask model, it first performs four 3x3 256-channel convolutions on the resized local feature layer, then performs a deconvolution, and then performs a convolution with a channel number of num_classes, and the final result represents each pixel. Pointed classes. The final shape is 28x28xnum_classes, representing the class of each pixel.

The code intercepted by the suggestion box for the shared feature layer is as follows:

def log2_graph(x):
    return tf.math.log(x) / tf.math.log(2.0)

def parse_image_meta_graph(meta):
    """
    将meta里面的参数进行分割
    """
    image_id = meta[:, 0]
    original_image_shape = meta[:, 1:4]
    image_shape = meta[:, 4:7]
    window = meta[:, 7:11]  # (y1, x1, y2, x2) window of image in in pixels
    scale = meta[:, 11]
    active_class_ids = meta[:, 12:]
    return {
    
    
        "image_id": image_id,
        "original_image_shape": original_image_shape,
        "image_shape": image_shape,
        "window": window,
        "scale": scale,
        "active_class_ids": active_class_ids,
    }

#----------------------------------------------------------#
#   ROIAlign Layer
#   利用建议框在特征层上截取内容
#----------------------------------------------------------#
class PyramidROIAlign(Layer):
    def __init__(self, pool_shape, **kwargs):
        super(PyramidROIAlign, self).__init__(**kwargs)
        self.pool_shape = tuple(pool_shape)

    def call(self, inputs):
        #----------------------------------------------------------#
        #   获得建议框的坐标
        #----------------------------------------------------------#
        boxes = inputs[0]
        #----------------------------------------------------------#
        #   image_meta包含了一些必要的图片信息
        #----------------------------------------------------------#
        image_meta = inputs[1]
        #----------------------------------------------------------#
        #   取出所有的特征层[batch, height, width, channels]
        #----------------------------------------------------------#
        feature_maps = inputs[2:]

        #----------------------------------------------------------#
        #   获得建议框的宽高
        #----------------------------------------------------------#
        y1, x1, y2, x2 = tf.split(boxes, 4, axis=2)
        h = y2 - y1
        w = x2 - x1

        #----------------------------------------------------------#
        #   获得输入进来的图像的大小
        #----------------------------------------------------------#
        image_shape = parse_image_meta_graph(image_meta)['image_shape'][0]
        
        #----------------------------------------------------------#
        #   通过建议框的大小找到这个建议框属于哪个特征层
        #----------------------------------------------------------#
        image_area = tf.cast(image_shape[0] * image_shape[1], tf.float32)
        roi_level = log2_graph(tf.sqrt(h * w) / (224.0 / tf.sqrt(image_area)))
        roi_level = tf.minimum(5, tf.maximum(2, 4 + tf.cast(tf.round(roi_level), tf.int32)))
        roi_level = tf.squeeze(roi_level, 2)

        pooled = []
        box_to_level = []
        # 分别在P2-P5中进行截取
        for i, level in enumerate(range(2, 6)):
            #-----------------------------------------------#
            #   找到每个特征层对应的建议框
            #-----------------------------------------------#
            ix = tf.where(tf.equal(roi_level, level))
            level_boxes = tf.gather_nd(boxes, ix)
            box_to_level.append(ix)

            #-----------------------------------------------#
            #    获得这些建议框所属的图片
            #-----------------------------------------------#
            box_indices = tf.cast(ix[:, 0], tf.int32)

            # 停止梯度下降
            level_boxes = tf.stop_gradient(level_boxes)
            box_indices = tf.stop_gradient(box_indices)

            #--------------------------------------------------------------------------#
            #   利用建议框对特征层进行截取  
            #   [batch * num_boxes, pool_height, pool_width, channels]
            #--------------------------------------------------------------------------#
            pooled.append(tf.image.crop_and_resize(
                feature_maps[i], level_boxes, box_indices, self.pool_shape,
                method="bilinear"))

        pooled = tf.concat(pooled, axis=0)
        #--------------------------------------------------------------------------#
        #   将顺序和所属的图片进行堆叠
        #--------------------------------------------------------------------------#
        box_to_level = tf.concat(box_to_level, axis=0)
        box_range = tf.expand_dims(tf.range(tf.shape(box_to_level)[0]), 1)
        box_to_level = tf.concat([tf.cast(box_to_level, tf.int32), box_range], axis=1)

        # box_to_level[:, 0]表示第几张图
        # box_to_level[:, 1]表示第几张图里的第几个框
        sorting_tensor = box_to_level[:, 0] * 100000 + box_to_level[:, 1]
        # 进行排序,将同一张图里的某一些聚集在一起
        ix = tf.nn.top_k(sorting_tensor, k=tf.shape(
            box_to_level)[0]).indices[::-1]

        # 按顺序获得图片的索引
        ix = tf.gather(box_to_level[:, 2], ix)
        pooled = tf.gather(pooled, ix)

        #--------------------------------------------------------------------------#
        #   重新reshape为如下
        #   [batch, num_rois, POOL_SIZE, POOL_SIZE, channels]
        #--------------------------------------------------------------------------#
        shape = tf.concat([tf.shape(boxes)[:2], tf.shape(pooled)[1:]], axis=0)
        pooled = tf.reshape(pooled, shape)
        return pooled

    def compute_output_shape(self, input_shape):
        return input_shape[0][:2] + self.pool_shape + (input_shape[2][-1], )

The classifier classification model and mask mask model are constructed diamagnetic as follows:

#------------------------------------#
#   建立classifier模型
#   这个模型的预测结果会调整建议框
#   获得最终的预测框
#------------------------------------#
def fpn_classifier_graph(rois, feature_maps, image_meta,
                         pool_size, num_classes, train_bn=True,
                         fc_layers_size=1024, weight_decay=0):
    #---------------------------------------------------------------#
    #   ROI Pooling,利用建议框在特征层上进行截取
    #   x   : [batch, num_rois, POOL_SIZE, POOL_SIZE, channels]
    #---------------------------------------------------------------#
    x = PyramidROIAlign([pool_size, pool_size], name="roi_align_classifier")([rois, image_meta] + feature_maps)

    #------------------------------------------------------------------#
    #   利用卷积进行特征整合
    #   x   : [batch, num_rois, 1, 1, fc_layers_size]
    #------------------------------------------------------------------#
    x = TimeDistributed(Conv2D(fc_layers_size, (pool_size, pool_size), padding="valid", kernel_regularizer=l2(weight_decay)),  name="mrcnn_class_conv1")(x)
    x = TimeDistributed(BatchNormalization(), name='mrcnn_class_bn1')(x, training=train_bn)
    x = Activation('relu')(x)
    #------------------------------------------------------------------#
    #   x   : [batch, num_rois, 1, 1, fc_layers_size]
    #------------------------------------------------------------------#
    x = TimeDistributed(Conv2D(fc_layers_size, (1, 1), kernel_regularizer=l2(weight_decay)), name="mrcnn_class_conv2")(x)
    x = TimeDistributed(BatchNormalization(), name='mrcnn_class_bn2')(x, training=train_bn)
    x = Activation('relu')(x)

    #------------------------------------------------------------------#
    #   x   : [batch, num_rois, fc_layers_size]
    #------------------------------------------------------------------#
    shared = Lambda(lambda x: K.squeeze(K.squeeze(x, 3), 2),  name="pool_squeeze")(x)

    #------------------------------------------------------------------#
    #   Classifier head
    #   这个的预测结果代表这个先验框内部的物体的种类
    #   mrcnn_probs   : [batch, num_rois, num_classes]
    #------------------------------------------------------------------#
    mrcnn_class_logits = TimeDistributed(Dense(num_classes), name='mrcnn_class_logits')(shared)
    mrcnn_probs = TimeDistributed(Activation("softmax"), name="mrcnn_class")(mrcnn_class_logits)

    #------------------------------------------------------------------#
    #   BBox head
    #   这个的预测结果会对先验框进行调整
    #   mrcnn_bbox : [batch, num_rois, num_classes, 4]
    #------------------------------------------------------------------#
    x = TimeDistributed(Dense(num_classes * 4, activation='linear'), name='mrcnn_bbox_fc')(shared)
    mrcnn_bbox = Reshape((-1, num_classes, 4), name="mrcnn_bbox")(x)

    return mrcnn_class_logits, mrcnn_probs, mrcnn_bbox


#----------------------------------------------#
#   建立mask模型
#   这个模型会利用预测框对特征层进行ROIAlign
#   根据截取下来的特征层进行语义分割
#----------------------------------------------#
def build_fpn_mask_graph(rois, feature_maps, image_meta,
                         pool_size, num_classes, train_bn=True, weight_decay=0):
    #--------------------------------------------------------------------#
    #   ROI Pooling,利用预测框在特征层上进行截取
    #   x   : batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, channels
    #--------------------------------------------------------------------#
    x = PyramidROIAlign([pool_size, pool_size], name="roi_align_mask")([rois, image_meta] + feature_maps)

    #--------------------------------------------------------------------#
    #   x   : batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, 256
    #--------------------------------------------------------------------#
    x = TimeDistributed(Conv2D(256, (3, 3), padding="same", kernel_regularizer=l2(weight_decay)), name="mrcnn_mask_conv1")(x)
    x = TimeDistributed(BatchNormalization(), name='mrcnn_mask_bn1')(x, training=train_bn)
    x = Activation('relu')(x)

    #--------------------------------------------------------------------#
    #   x   : batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, 256
    #--------------------------------------------------------------------#
    x = TimeDistributed(Conv2D(256, (3, 3), padding="same", kernel_regularizer=l2(weight_decay)), name="mrcnn_mask_conv2")(x)
    x = TimeDistributed(BatchNormalization(), name='mrcnn_mask_bn2')(x, training=train_bn)
    x = Activation('relu')(x)

    #--------------------------------------------------------------------#
    #   x   : batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, 256
    #--------------------------------------------------------------------#
    x = TimeDistributed(Conv2D(256, (3, 3), padding="same", kernel_regularizer=l2(weight_decay)), name="mrcnn_mask_conv3")(x)
    x = TimeDistributed(BatchNormalization(), name='mrcnn_mask_bn3')(x, training=train_bn)
    x = Activation('relu')(x)

    #--------------------------------------------------------------------#
    #   x   : batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, 256
    #--------------------------------------------------------------------#
    x = TimeDistributed(Conv2D(256, (3, 3), padding="same", kernel_regularizer=l2(weight_decay)), name="mrcnn_mask_conv4")(x)
    x = TimeDistributed(BatchNormalization(), name='mrcnn_mask_bn4')(x, training=train_bn)
    x = Activation('relu')(x)

    #--------------------------------------------------------------------#
    #   x   : batch, num_rois, 2xMASK_POOL_SIZE, 2xMASK_POOL_SIZE, 256
    #--------------------------------------------------------------------#
    x = TimeDistributed(Conv2DTranspose(256, (2, 2), strides=2, activation="relu", kernel_regularizer=l2(weight_decay)), name="mrcnn_mask_deconv")(x)
    #--------------------------------------------------------------------#
    #   反卷积后再次进行一个1x1卷积调整通道,
    #   使其最终数量为numclasses,代表分的类
    #   x   : batch, num_rois, 2xMASK_POOL_SIZE, 2xMASK_POOL_SIZE, numclasses
    #--------------------------------------------------------------------#
    x = TimeDistributed(Conv2D(num_classes, (1, 1), strides=1, activation="sigmoid", kernel_regularizer=l2(weight_decay)), name="mrcnn_mask")(x)
    return x

6. Decoding of the prediction box

The proposed box obtained in the fourth part also represents some areas on the image, which also plays the role of a priori box in the classifier model later.

That is, the prediction result of the classifier model represents the type and adjustment parameters of the object inside the proposal box.

The result of the adjustment of the suggestion box, that is, the final prediction result, the prediction result can be drawn on the picture.

The decoding process of the prediction frame includes the following steps:
1. Take out the suggestion frame that does not belong to the background and whose score is greater than config.DETECTION_MIN_CONFIDENCE.
2. Then use the prediction results of the proposed box and the classifier model to decode to obtain the position of the final prediction box.
3. Use the score and the position of the final prediction box for non-maximum suppression to prevent repeated detection.

The code for the decoding process of the suggestion box is as follows:

#----------------------------------------------------------#
#   利用classifier的预测结果对建议框进行调整获得预测框
#   获得每一个预测框的种类
#----------------------------------------------------------#
def refine_detections_graph(rois, probs, deltas, window, config):
    #----------------------------------------------------------#
    #   输入为:
    #   rois        : N, 4
    #   probs       : N, num_classes
    #   deltas      : N, num_classes, 4
    #   window      : 4,
    #
    #   输出为:
    #   detections  : num_detections, 6
    #----------------------------------------------------------#

    #----------------------------------------------------------#
    #   找到得分最高的类
    #----------------------------------------------------------#
    class_ids = tf.argmax(probs, axis=1, output_type=tf.int32)
    #----------------------------------------------------------#
    #   序号+类,用于取出成绩与建议框的调整参数
    #----------------------------------------------------------#
    indices = tf.stack([tf.range(probs.shape[0]), class_ids], axis=1)
    #----------------------------------------------------------#
    #   取出成绩与建议框的调整参数
    #----------------------------------------------------------#
    class_scores = tf.gather_nd(probs, indices)
    deltas_specific = tf.gather_nd(deltas, indices)
    #----------------------------------------------------------#
    #   进行解码
    #   refined_rois    : boxes, 4
    #----------------------------------------------------------#
    refined_rois = apply_box_deltas_graph(rois, deltas_specific * config.BBOX_STD_DEV)
    refined_rois = clip_boxes_graph(refined_rois, window)

    #----------------------------------------------------------#
    #   去除背景和得分小的区域
    #----------------------------------------------------------#
    keep = tf.where(class_ids > 0)[:, 0]
    if config.DETECTION_MIN_CONFIDENCE:
        conf_keep = tf.where(class_scores >= config.DETECTION_MIN_CONFIDENCE)[:, 0]
        keep = tf.compat.v1.sets.set_intersection(tf.expand_dims(keep, 0),
                                        tf.expand_dims(conf_keep, 0))
        keep = tf.compat.v1.sparse_tensor_to_dense(keep)[0]

    #----------------------------------------------------------#
    #   获得除去背景并且得分较高的框还有种类与得分
    #----------------------------------------------------------#
    pre_nms_class_ids = tf.gather(class_ids, keep)
    pre_nms_scores = tf.gather(class_scores, keep)
    pre_nms_rois = tf.gather(refined_rois,   keep)
    unique_pre_nms_class_ids = tf.unique(pre_nms_class_ids)[0]

    def nms_keep_map(class_id):
        ixs = tf.where(tf.equal(pre_nms_class_ids, class_id))[:, 0]

        class_keep = tf.image.non_max_suppression(
                tf.gather(pre_nms_rois, ixs),
                tf.gather(pre_nms_scores, ixs),
                max_output_size=config.DETECTION_MAX_INSTANCES,
                iou_threshold=config.DETECTION_NMS_THRESHOLD)

        class_keep = tf.gather(keep, tf.gather(ixs, class_keep))

        gap = config.DETECTION_MAX_INSTANCES - tf.shape(class_keep)[0]
        class_keep = tf.pad(class_keep, [(0, gap)],
                            mode='CONSTANT', constant_values=-1)

        class_keep.set_shape([config.DETECTION_MAX_INSTANCES])
        return class_keep
    #------------------------------------------------------------#
    #   对获取到的满足得分门限且不属于背景的预测框进行非极大抑制
    #------------------------------------------------------------#
    nms_keep = tf.map_fn(nms_keep_map, unique_pre_nms_class_ids, dtype=tf.int64)
    nms_keep = tf.reshape(nms_keep, [-1])
    nms_keep = tf.gather(nms_keep, tf.where(nms_keep > -1)[:, 0])

    keep = tf.compat.v1.sets.set_intersection(tf.expand_dims(keep, 0), tf.expand_dims(nms_keep, 0))
    keep = tf.compat.v1.sparse_tensor_to_dense(keep)[0]

    #------------------------------------------------------------#
    #   寻找得分最高的num_keep个框
    #------------------------------------------------------------#
    roi_count = config.DETECTION_MAX_INSTANCES
    class_scores_keep = tf.gather(class_scores, keep)
    num_keep = tf.minimum(tf.shape(class_scores_keep)[0], roi_count)
    top_ids = tf.nn.top_k(class_scores_keep, k=num_keep, sorted=True)[1]
    keep = tf.gather(keep, top_ids)

    #------------------------------------------------------------#
    #   将预测结果进行堆叠,获得的最终shape为[N,6]
    #   即:N, (y1, x1, y2, x2, class_id, score)
    #------------------------------------------------------------#
    detections = tf.concat([
        tf.gather(refined_rois, keep),
        tf.cast(tf.gather(class_ids, keep), tf.float32)[..., tf.newaxis],
        tf.gather(class_scores, keep)[..., tf.newaxis]
        ], axis=1)

    #------------------------------------------------------------#
    #   如果达不到数量的话就padding
    #------------------------------------------------------------#
    gap = config.DETECTION_MAX_INSTANCES - tf.shape(detections)[0]
    detections = tf.pad(detections, [(0, gap), (0, 0)], "CONSTANT")
    return detections

def norm_boxes_graph(boxes, shape):
    h, w = tf.split(tf.cast(shape, tf.float32), 2)
    scale = tf.concat([h, w, h, w], axis=-1) - tf.constant(1.0)
    shift = tf.constant([0., 0., 1., 1.])
    return tf.divide(boxes - shift, scale)

#----------------------------------------------------------#
#   Detection Layer
#   利用classifier的预测结果对建议框进行调整获得预测框
#----------------------------------------------------------#
class DetectionLayer(Layer):
    def __init__(self, config=None, **kwargs):
        super(DetectionLayer, self).__init__(**kwargs)
        self.config = config

    def call(self, inputs):
        #------------------------------------------------------------------#
        #   获得的inputs
        #   rpn_rois            : Batch_size, proposal_count, 4
        #   mrcnn_class         : Batch_size, num_rois, num_classes
        #   mrcnn_bbox          : Batch_size, num_rois, num_classes, 
        #------------------------------------------------------------------#
        rois = inputs[0]
        mrcnn_class = inputs[1]
        mrcnn_bbox = inputs[2]
        image_meta = inputs[3]

        #------------------------------------------------------------------#
        #   找到window的小数形式
        #------------------------------------------------------------------#
        m = parse_image_meta_graph(image_meta)
        image_shape = m['image_shape'][0]
        window = norm_boxes_graph(m['window'], image_shape[:2])

        #------------------------------------------------------------------#
        #   对每一张图的结果进行解码
        #------------------------------------------------------------------#
        detections_batch = batch_slice(
            [rois, mrcnn_class, mrcnn_bbox, window],
            lambda x, y, w, z: refine_detections_graph(x, y, w, z, self.config),
            self.config.IMAGES_PER_GPU)

        #------------------------------------------------------------#
        #   最终输出的shape为
        #   Batch_size, num_detections, 6] 
        #------------------------------------------------------------#
        return tf.reshape(
            detections_batch,
            [self.config.BATCH_SIZE, self.config.DETECTION_MAX_INSTANCES, 6])

    def compute_output_shape(self, input_shape):
        return (None, self.config.DETECTION_MAX_INSTANCES, 6)

7. Acquisition of mask semantic segmentation information

In the sixth step, we obtain the final prediction frame, which is more accurate than the previously obtained proposal frame, so we use this prediction frame as the area interception part of the mask model , and use this prediction frame to correct the mask model. The used common feature layer is intercepted.

After the interception, the mask model is used to classify the pixels to obtain the semantic segmentation results.

The training part

The loss function used for Faster-RCNN training consists of several parts, one part is the loss function of the proposal box network, one part is the loss function of the classifier network, and the other part is the loss function of the mask network.

1. Training of suggestion box network

If the public feature layer wants to obtain the prediction result of the proposed box, it needs to perform a 3x3 convolution, a 1x1 convolution of anchors_per_location x 1 channel, and a 1x1 convolution of anchors_per_location x 4 channel.

In Mask R-CNN, anchors_per_location, that is, the number of a priori boxes is 3 by default, so the result of two 1x1 convolutions is actually:

The convolution of anchors_per_location x 4 is used to predict the change of each prior box at each grid point on the effective feature layer. **

The convolution of anchors_per_location x 1 is used to predict whether each proposal box contains an object at each grid point on the effective feature layer.

That is to say, the result predicted by the Mask R-CNN proposal frame network directly is not the real position of the proposal frame on the picture, and it needs to be decoded to get the real position.

During training, we need to calculate the loss function, which is relative to the prediction result of the Mask R-CNN proposal box network . We need to input the picture into the network of the current Mask R-CNN proposal box to get the result of the proposal box; at the same time, we also need to encode , this encoding is to convert the position information format of the real box into the prediction result of the Mask R-CNN proposal box format information .

That is, we need to find the prior box corresponding to each real box of each image used for training , and find out what the prediction result of our proposal box should be if we want to get such a real box.

The process of obtaining the ground-truth box from the prediction result of the proposal box is called decoding, and the process of obtaining the prediction result of the proposal box from the ground-truth box is the process of encoding.

Therefore, we only need to reverse the decoding process and it is the encoding process.

The implementation code is as follows:

def build_rpn_targets(image_shape, anchors, gt_class_ids, gt_boxes, config):
    #------------------------------#
    #   rpn_match中
    #   1代表正样本、-1代表负样本
    #   0代表忽略
    #------------------------------#
    rpn_match = np.zeros([anchors.shape[0]], dtype=np.int32)
    #-----------------------------------------------#
    #   创建该部分内容利用先验框和真实框进行编码
    #-----------------------------------------------#
    rpn_bbox = np.zeros((config.RPN_TRAIN_ANCHORS_PER_IMAGE, 4))

    '''
    iscrowd=0的时候,表示这是一个单独的物体,轮廓用Polygon(多边形的点)表示,
    iscrowd=1的时候表示两个没有分开的物体,轮廓用RLE编码表示,比如说一张图片里面有三个人,
    一个人单独站一边,另外两个搂在一起(标注的时候距离太近分不开了),这个时候,
    单独的那个人的注释里面的iscrowing=0,segmentation用Polygon表示,
    而另外两个用放在同一个anatation的数组里面用一个segmention的RLE编码形式表示
    '''
    crowd_ix = np.where(gt_class_ids < 0)[0]
    if crowd_ix.shape[0] > 0:
        non_crowd_ix    = np.where(gt_class_ids > 0)[0]
        crowd_boxes     = gt_boxes[crowd_ix]
        gt_class_ids    = gt_class_ids[non_crowd_ix]
        gt_boxes        = gt_boxes[non_crowd_ix]
        crowd_overlaps  = compute_overlaps(anchors, crowd_boxes)
        crowd_iou_max   = np.amax(crowd_overlaps, axis=1)
        no_crowd_bool   = (crowd_iou_max < 0.001)
    else:
        no_crowd_bool   = np.ones([anchors.shape[0]], dtype=bool)

    #-----------------------------------------------#
    #   计算先验框和真实框的重合程度 
    #   [num_anchors, num_gt_boxes]
    #-----------------------------------------------#
    overlaps = compute_overlaps(anchors, gt_boxes)

    #-----------------------------------------------#
    #   1. 重合程度小于0.3则代表为负样本
    #-----------------------------------------------#
    anchor_iou_argmax = np.argmax(overlaps, axis=1)
    anchor_iou_max = overlaps[np.arange(overlaps.shape[0]), anchor_iou_argmax]
    rpn_match[(anchor_iou_max < 0.3) & (no_crowd_bool)] = -1
    #-----------------------------------------------#
    #   2. 每个真实框重合度最大的先验框是正样本
    #-----------------------------------------------#
    gt_iou_argmax = np.argwhere(overlaps == np.max(overlaps, axis=0))[:,0]
    rpn_match[gt_iou_argmax] = 1
    #-----------------------------------------------#
    #   3. 重合度大于0.7则代表为正样本
    #-----------------------------------------------#
    rpn_match[anchor_iou_max >= 0.7] = 1

    #-----------------------------------------------#
    #   正负样本平衡
    #   找到正样本的索引
    #-----------------------------------------------#
    ids = np.where(rpn_match == 1)[0]
    
    #-----------------------------------------------#
    #   如果大于(config.RPN_TRAIN_ANCHORS_PER_IMAGE // 2)则删掉一些
    #-----------------------------------------------#
    extra = len(ids) - (config.RPN_TRAIN_ANCHORS_PER_IMAGE // 2)
    if extra > 0:
        ids = np.random.choice(ids, extra, replace=False)
        rpn_match[ids] = 0
        
    #-----------------------------------------------#
    #   找到负样本的索引
    #-----------------------------------------------#
    ids = np.where(rpn_match == -1)[0]
    
    #-----------------------------------------------#
    #   使得总数为config.RPN_TRAIN_ANCHORS_PER_IMAGE
    #-----------------------------------------------#
    extra = len(ids) - (config.RPN_TRAIN_ANCHORS_PER_IMAGE -
                        np.sum(rpn_match == 1))
    if extra > 0:
        # Rest the extra ones to neutral
        ids = np.random.choice(ids, extra, replace=False)
        rpn_match[ids] = 0

    #-----------------------------------------------#
    #   找到内部真实存在物体的先验框,进行编码
    #-----------------------------------------------#
    ids = np.where(rpn_match == 1)[0]
    ix = 0 
    for i, a in zip(ids, anchors[ids]):
        gt = gt_boxes[anchor_iou_argmax[i]]
        #-----------------------------------------------#
        #   计算真实框的中心,高宽
        #-----------------------------------------------#
        gt_h = gt[2] - gt[0]
        gt_w = gt[3] - gt[1]
        gt_center_y = gt[0] + 0.5 * gt_h
        gt_center_x = gt[1] + 0.5 * gt_w
        #-----------------------------------------------#
        #   计算先验框中心,高宽
        #-----------------------------------------------#
        a_h = a[2] - a[0]
        a_w = a[3] - a[1]
        a_center_y = a[0] + 0.5 * a_h
        a_center_x = a[1] + 0.5 * a_w
        #-----------------------------------------------#
        #   编码运算
        #-----------------------------------------------#
        rpn_bbox[ix] = [
            (gt_center_y - a_center_y) / np.maximum(a_h, 1),
            (gt_center_x - a_center_x) / np.maximum(a_w, 1),
            np.log(np.maximum(gt_h / np.maximum(a_h, 1), 1e-5)),
            np.log(np.maximum(gt_w / np.maximum(a_w, 1), 1e-5)),
        ]
        #-----------------------------------------------#
        #   改变数量级
        #-----------------------------------------------#
        rpn_bbox[ix] /= config.RPN_BBOX_STD_DEV
        ix += 1
    return rpn_match, rpn_bbox

Using the above code, we can obtain all the larger iou a priori boxes corresponding to the real box , and calculate the prediction results that all the larger iou a priori boxes corresponding to the real box should have.

Mask R-CNN will ignore some a priori boxes with a relatively high degree of coincidence but not very high, and generally ignore a priori boxes with a degree of coincidence between 0.3-0.7.

The loss of the proposal box network can be obtained by comparing the prediction results that the proposal box network should have with the actual prediction results.

2. Training of the Classiffier model

The previous part provides the loss of the RPN network. In the Mask R-CNN model, we also need to adjust the proposed box to obtain the final prediction box. In the classiffier model, the proposal box is equivalent to the prior box.

Therefore, we need to calculate the degree of coincidence between all the proposed boxes and the real boxes , and filter them . If the degree of coincidence between a real box and a suggested box is greater than 0.5, the proposed box is considered to be a positive sample, and if the degree of coincidence is less than 0.5, the suggestion is considered to be a positive sample. Box is a negative sample

Therefore, we can encode the real box, and this encoding is relative to the proposed box, that is, when we have these proposed boxes, what kind of prediction results does our Classiffier model need to have to adjust these proposed boxes to real boxes? .

The implementation code is as follows:

#----------------------------------------------------------#
#   Detection Target Layer
#   该部分代码会输入建议框
#   判断建议框和真实框的重合情况
#   筛选出内部包含物体的建议框
#   利用建议框和真实框编码
#   调整mask的格式使得其和预测格式相同
#----------------------------------------------------------#
#----------------------------------------------------------#
#   对输入进来的真实框进行编码
#----------------------------------------------------------#
def box_refinement_graph(box, gt_box):
    box = tf.cast(box, tf.float32)
    gt_box = tf.cast(gt_box, tf.float32)

    height = box[:, 2] - box[:, 0]
    width = box[:, 3] - box[:, 1]
    center_y = box[:, 0] + 0.5 * height
    center_x = box[:, 1] + 0.5 * width

    gt_height = gt_box[:, 2] - gt_box[:, 0]
    gt_width = gt_box[:, 3] - gt_box[:, 1]
    gt_center_y = gt_box[:, 0] + 0.5 * gt_height
    gt_center_x = gt_box[:, 1] + 0.5 * gt_width

    dy = (gt_center_y - center_y) / height
    dx = (gt_center_x - center_x) / width
    dh = tf.math.log(gt_height / height)
    dw = tf.math.log(gt_width / width)

    result = tf.stack([dy, dx, dh, dw], axis=1)
    return result

#----------------------------------------------------------#
#   Detection Target Layer
#   该部分代码会输入建议框
#   判断建议框和真实框的重合情况
#   筛选出内部包含物体的建议框
#   利用建议框和真实框编码
#   调整mask的格式使得其和预测格式相同
#----------------------------------------------------------#
def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
    asserts = [
        tf.Assert(tf.greater(tf.shape(proposals)[0], 0), [proposals],
                  name="roi_assertion"),
    ]
    with tf.control_dependencies(asserts):
        proposals = tf.identity(proposals)

    #----------------------------------------------------------#
    #   为了满足数据长度,在先前使用了padding部分
    #   在这里需要去掉
    #----------------------------------------------------------#
    proposals, _ = trim_zeros_graph(proposals, name="trim_proposals")
    gt_boxes, non_zeros = trim_zeros_graph(gt_boxes, name="trim_gt_boxes")
    gt_class_ids = tf.boolean_mask(gt_class_ids, non_zeros, name="trim_gt_class_ids")
    gt_masks = tf.gather(gt_masks, tf.where(non_zeros)[:, 0], axis=2, name="trim_gt_masks")

    #----------------------------------------------------------#
    #   忽略掉coco数据集中的crowd部分,这些部分不易区分
    #   训练时直接忽略
    #----------------------------------------------------------#
    crowd_ix = tf.where(gt_class_ids < 0)[:, 0]
    non_crowd_ix = tf.where(gt_class_ids > 0)[:, 0]
    crowd_boxes = tf.gather(gt_boxes, crowd_ix)
    gt_class_ids = tf.gather(gt_class_ids, non_crowd_ix)
    gt_boxes = tf.gather(gt_boxes, non_crowd_ix)
    gt_masks = tf.gather(gt_masks, non_crowd_ix, axis=2)

    #----------------------------------------------------------#
    #   计算建议框和所有真实框的重合程度 
    #   overlaps    : proposals, gt_boxes
    #----------------------------------------------------------#
    overlaps = overlaps_graph(proposals, gt_boxes)

    #----------------------------------------------------------#
    #   计算建议框和crowd boxes的重合程度 
    #   overlaps    : proposals, crowd_boxes
    #----------------------------------------------------------#
    crowd_overlaps = overlaps_graph(proposals, crowd_boxes)
    crowd_iou_max = tf.reduce_max(crowd_overlaps, axis=1)
    no_crowd_bool = (crowd_iou_max < 0.001)

    #----------------------------------------------------------#
    #   每个建议框与真实框的最大重合程度
    #   roi_iou_max    : proposals, 
    #----------------------------------------------------------#
    roi_iou_max = tf.reduce_max(overlaps, axis=1)
    #----------------------------------------------------------#
    #   1. 正样本建议框和真实框的重合程度大于0.5
    #----------------------------------------------------------#
    positive_roi_bool = (roi_iou_max >= 0.5)
    positive_indices = tf.where(positive_roi_bool)[:, 0]
    #----------------------------------------------------------#
    #   2. 负样本建议框和真实框的重合程度小于0.5
    #   那些和crowd重合度比较大的建议框忽略掉
    #----------------------------------------------------------#
    negative_indices = tf.where(tf.logical_and(roi_iou_max < 0.5, no_crowd_bool))[:, 0]

    #----------------------------------------------------------#
    #   进行正负样本的平衡,取出最大33%的正样本
    #----------------------------------------------------------#
    positive_count = int(config.TRAIN_ROIS_PER_IMAGE * config.ROI_POSITIVE_RATIO)
    positive_indices = tf.random.shuffle(positive_indices)[:positive_count]
    positive_count = tf.shape(positive_indices)[0]
    #----------------------------------------------------------#
    #   保持正负样本比例
    #----------------------------------------------------------#
    r = 1.0 / config.ROI_POSITIVE_RATIO
    negative_count = tf.cast(r * tf.cast(positive_count, tf.float32), tf.int32) - positive_count
    negative_indices = tf.random.shuffle(negative_indices)[:negative_count]
    #----------------------------------------------------------#
    #   获得正样本和负样本
    #----------------------------------------------------------#
    positive_rois = tf.gather(proposals, positive_indices)
    negative_rois = tf.gather(proposals, negative_indices)

    #----------------------------------------------------------#
    #   获取建议框和真实框重合程度
    #----------------------------------------------------------#
    positive_overlaps = tf.gather(overlaps, positive_indices)
    
    #----------------------------------------------------------#
    #   判断是否有真实框
    #----------------------------------------------------------#
    roi_gt_box_assignment = tf.cond(
        tf.greater(tf.shape(positive_overlaps)[1], 0),
        true_fn = lambda: tf.argmax(positive_overlaps, axis=1),
        false_fn = lambda: tf.cast(tf.constant([]),tf.int64)
    )
    #----------------------------------------------------------#
    #   找到每一个建议框对应的真实框和种类
    #----------------------------------------------------------#
    roi_gt_boxes = tf.gather(gt_boxes, roi_gt_box_assignment)
    roi_gt_class_ids = tf.gather(gt_class_ids, roi_gt_box_assignment)

    #----------------------------------------------------------#
    #   编码获得网络应该有得预测结果
    #----------------------------------------------------------#
    deltas = box_refinement_graph(positive_rois, roi_gt_boxes)
    deltas /= config.BBOX_STD_DEV

    #----------------------------------------------------------#
    #   切换mask的形式[N, height, width, 1]
    #----------------------------------------------------------#
    transposed_masks = tf.expand_dims(tf.transpose(gt_masks, [2, 0, 1]), -1)
    
    #----------------------------------------------------------#
    #   取出每一个建议框对应的mask层
    #----------------------------------------------------------#
    roi_masks = tf.gather(transposed_masks, roi_gt_box_assignment)

    #----------------------------------------------------------#
    #   利用建议框在mask上进行截取,作为训练用的mask
    #----------------------------------------------------------#
    boxes = positive_rois
    if config.USE_MINI_MASK:
        y1, x1, y2, x2 = tf.split(positive_rois, 4, axis=1)
        gt_y1, gt_x1, gt_y2, gt_x2 = tf.split(roi_gt_boxes, 4, axis=1)
        gt_h = gt_y2 - gt_y1
        gt_w = gt_x2 - gt_x1
        y1 = (y1 - gt_y1) / gt_h
        x1 = (x1 - gt_x1) / gt_w
        y2 = (y2 - gt_y1) / gt_h
        x2 = (x2 - gt_x1) / gt_w
        boxes = tf.concat([y1, x1, y2, x2], 1)
    box_ids = tf.range(0, tf.shape(roi_masks)[0])
    masks = tf.image.crop_and_resize(tf.cast(roi_masks, tf.float32), boxes,
                                     box_ids,
                                     config.MASK_SHAPE)

    masks = tf.squeeze(masks, axis=3)
    masks = tf.round(masks)

    #----------------------------------------------------------#
    #   一般传入config.TRAIN_ROIS_PER_IMAGE个建议框进行训练,
    #   如果数量不够则padding
    #----------------------------------------------------------#
    rois = tf.concat([positive_rois, negative_rois], axis=0)
    N = tf.shape(negative_rois)[0]
    P = tf.maximum(config.TRAIN_ROIS_PER_IMAGE - tf.shape(rois)[0], 0)
    rois = tf.pad(rois, [(0, P), (0, 0)])
    roi_gt_boxes = tf.pad(roi_gt_boxes, [(0, N + P), (0, 0)])
    roi_gt_class_ids = tf.pad(roi_gt_class_ids, [(0, N + P)])
    deltas = tf.pad(deltas, [(0, N + P), (0, 0)])
    masks = tf.pad(masks, [[0, N + P], (0, 0), (0, 0)])

    return rois, roi_gt_class_ids, deltas, masks


class DetectionTargetLayer(Layer):
    """
    找到建议框的ground_truth
    Inputs:
    proposals       : [batch, N, (y1, x1, y2, x2)]                                          建议框
    gt_class_ids    : [batch, MAX_GT_INSTANCES]                                             每个真实框对应的类
    gt_boxes        : [batch, MAX_GT_INSTANCES, (y1, x1, y2, x2)]                           真实框的位置
    gt_masks        : [batch, MINI_MASK_SHAPE[0], MINI_MASK_SHAPE[1], MAX_GT_INSTANCES]     真实框的语义分割情况

    Returns: 
    rois            : [batch, TRAIN_ROIS_PER_IMAGE, (y1, x1, y2, x2)]                       内部真实存在目标的建议框
    target_class_ids: [batch, TRAIN_ROIS_PER_IMAGE]                                         每个建议框对应的类
    target_deltas   : [batch, TRAIN_ROIS_PER_IMAGE, (dy, dx, log(dh), log(dw)]              每个建议框应该有的调整参数
    target_mask     : [batch, TRAIN_ROIS_PER_IMAGE, height, width]                          每个建议框语义分割情况
    """

    def __init__(self, config, **kwargs):
        super(DetectionTargetLayer, self).__init__(**kwargs)
        self.config = config

    def call(self, inputs):
        proposals = inputs[0]
        gt_class_ids = inputs[1]
        gt_boxes = inputs[2]
        gt_masks = inputs[3]

        # 对真实框进行编码
        names = ["rois", "target_class_ids", "target_bbox", "target_mask"]
        outputs = batch_slice([proposals, gt_class_ids, gt_boxes, gt_masks],
            lambda w, x, y, z: detection_targets_graph(w, x, y, z, self.config),
            self.config.IMAGES_PER_GPU, names=names)
        return outputs

    def compute_output_shape(self, input_shape):
        return [
            (None, self.config.TRAIN_ROIS_PER_IMAGE, 4),  # rois
            (None, self.config.TRAIN_ROIS_PER_IMAGE),  # class_ids
            (None, self.config.TRAIN_ROIS_PER_IMAGE, 4),  # deltas
            (None, self.config.TRAIN_ROIS_PER_IMAGE, self.config.MASK_SHAPE[0],
             self.config.MASK_SHAPE[1])  # masks
        ]

    def compute_mask(self, inputs, mask=None):
        return [None, None, None, None]

3. Training of mask model

When training the mask model, it should be noted that when we use the proposed frame network to intercept the common feature layer that the mask model needs to use, the interception situation is different from that of the real frame, so it is necessary to calculate what we use for The position of the intercepted frame relative to the real frame is obtained to obtain the correct semantic segmentation information.

The code used is as follows, and a large part of the middle is used to calculate the position of the real box relative to the proposed box . After the calculation is completed, the relative position can be used to intercept the semantic segmentation information to obtain the correct semantic information

# Compute mask targets
boxes = positive_rois
if config.USE_MINI_MASK:
    # Transform ROI coordinates from normalized image space
    # to normalized mini-mask space.
    y1, x1, y2, x2 = tf.split(positive_rois, 4, axis=1)
    gt_y1, gt_x1, gt_y2, gt_x2 = tf.split(roi_gt_boxes, 4, axis=1)
    gt_h = gt_y2 - gt_y1
    gt_w = gt_x2 - gt_x1
    y1 = (y1 - gt_y1) / gt_h
    x1 = (x1 - gt_x1) / gt_w
    y2 = (y2 - gt_y1) / gt_h
    x2 = (x2 - gt_x1) / gt_w
    boxes = tf.concat([y1, x1, y2, x2], 1)
box_ids = tf.range(0, tf.shape(roi_masks)[0])
masks = tf.image.crop_and_resize(tf.cast(roi_masks, tf.float32), boxes,
                                    box_ids,
                                    config.MASK_SHAPE)

In this case, the model can be trained by combining the obtained mask and the prediction result of the model.

Train your own Mask-RCNN model

The overall folder structure of Mask-RCNN is as follows:
insert image description here

First, the preparation of the data set

This article trains its own dataset through the COCO dataset format.

If the dataset has not been labeled, you can first use labelme to label the data. The marked files include image files and json files, both of which are placed in the before folder. For the specific format, please refer to the shapes dataset.
When labeling targets, it should be noted that different targets of the same type need to be separated by _.
For example, if you want to train the network to detect triangles and squares , when there are two triangles in a picture, they are marked as:

triangle_1
triangle_2

Put it in the before folder:
insert image description here

Second, the processing of data sets

Modify the parameters in coco_annotation.py. For the first training, only classes_path can be modified, and classes_path is used to point to the txt corresponding to the detected category.
When training your own data set, you can create a cls_classes.txt by yourself, and write the categories you need to distinguish in it.
The content of the model_data/cls_classes.txt file is:

cat
dog
...

Modify the classes_path in coco_annotation.py to correspond to cls_classes.txt, and run coco_annotation.py.
insert image description here

3. Start training the network

There are many parameters for training, which are all in train.py. You can read the comments carefully after downloading the library. The most important part is still the classes_path in train.py.
classes_path is used to point to the txt corresponding to the detection category, which is the same as the txt in coco_annotation.py! Training your own dataset must be modified!
After modifying the classes_path, you can run train.py to start training. After training multiple epochs, the weights will be generated in the logs folder.
insert image description here

4. Model prediction

Training result prediction requires two files, mask_rcnn.py and predict.py.
First, you need to go to mask_rcnn.py to modify model_path and classes_path. These two parameters must be modified.
model_path points to the trained weights file, in the logs folder.
classes_path points to the txt corresponding to the detection category.

After completing the modification, you can run predict.py for detection. After running, enter the image path to detect.
insert image description here

Guess you like

Origin blog.csdn.net/weixin_44791964/article/details/125105892