强大的cpu上实时检测的retinaface人脸检测目标检测

前言
什么是retinaface
retinaface优点
github代码
数据
Retinnaface具体实现

网络主干------特征提取部分
特征融合--特征金字塔部分
利用ssh加强特征融合

预测结果包含部分
loss值计算

训练数据

前言

今天解读我阅读的一篇论文retinaface。
在这里插入图片描述

什么是retinaface

retinaface是一款继mtcnn以后的，single-stage人脸检测框架，并利用强监督和自监督信号的多任务损失计算，提出了一种most state-of-the-art的密集人脸定位方法。它加入了五个人脸Landmark对人脸关键点进行了定位。

retinaface优点

1.采用了多任务学习，人脸分类，人脸框回归，人脸的landmark（人脸关键点定位）同时计算出。
在这里插入图片描述

2.使用ResNet-152，提升了精度（AP 91.8% 在WIDER FACE hard set），使用小的特征提取网络MobilenetV1-0.25（AP78.25 在WIDER FACE hard set）），实现了cpu实时检测。
3.利用特征金字塔，加强了特征融合。
4.加入了ssh结构横向结构，利用33的卷积替换55和7*7的卷积，进一步加强了模型特征提取的感受野。

github代码

https://github.com/yanjingke/retinaface-keras

数据

链接: https://pan.baidu.com/s/15WqLbrNzFJXxZ7zwP-FkOQ 提取码: 3msf 复制这段内容后打开百度网盘手机App，操作更方便哦

Retinnaface具体实现

网络主干------特征提取部分

在本文特征提取部分主要介绍mobilenetV1-0.25。mobilenet是一款很好的在移动端进行运行的框架。它主要利用了深度可分离卷积。
其中深度可分离卷积是怎么减少计算量的啦？
假设有一个3×3大小的卷积层，其输入通道为16、输出通道为32。具体为，32个3×3大小的卷积核会遍历16个通道中的每个数据，最后可得到所需的32个输出通道，所需参数为16×32×3×3=4608个。

应用深度可分离卷积，用16个3×3大小的卷积核分别遍历16通道的数据，得到了16个特征图谱。在融合操作之前，接着用32个1×1大小的卷积核遍历这16个特征图谱，所需参数为16×3×3+16×32×1×1=656个。
在这里插入图片描述

在我们的结构里面用DepthwiseConv2D代表深度可分离卷积，1*1的卷积主要是调整通道数的。这是mobilenetv1的结构图：
在这里插入图片描述
而mobilenetV1-0.25将通道数缩减到了4分之一，使参数下降到20w参数。

深度可分离卷积代码：

def _depthwise_conv_block(inputs, pointwise_conv_filters,
                          depth_multiplier=1, strides=(1, 1), block_id=1):

    x = DepthwiseConv2D((3, 3),
                        padding='same',
                        depth_multiplier=depth_multiplier,
                        strides=strides,
                        use_bias=False,
                        name='conv_dw_%d' % block_id)(inputs)

    x = BatchNormalization(name='conv_dw_%d_bn' % block_id)(x)
    x = Activation(relu6, name='conv_dw_%d_relu' % block_id)(x)

    x = Conv2D(pointwise_conv_filters, (1, 1),
               padding='same',
               use_bias=False,
               strides=(1, 1),
               name='conv_pw_%d' % block_id)(x)
    x = BatchNormalization(name='conv_pw_%d_bn' % block_id)(x)
    return Activation(relu6, name='conv_pw_%d_relu' % block_id)(x)
    
def relu6(x):
    return K.relu(x, max_value=6)

mobilenet代码：

def _conv_block(inputs, filters, kernel=(3, 3), strides=(1, 1)):
    x = Conv2D(filters, kernel,
               padding='same',
               use_bias=False,
               strides=strides,
               name='conv1')(inputs)
    x = BatchNormalization(name='conv1_bn')(x)
    return Activation(relu6, name='conv1_relu')(x)
def MobileNet(img_input, depth_multiplier=1):
    # 640,640,3 -> 320,320,8
    x = _conv_block(img_input, 8, strides=(2, 2))
    # 320,320,8 -> 320,320,16
    x = _depthwise_conv_block(x, 16, depth_multiplier, block_id=1)

    # 320,320,16 -> 160,160,32
    x = _depthwise_conv_block(x, 32, depth_multiplier, strides=(2, 2), block_id=2)
    x = _depthwise_conv_block(x, 32, depth_multiplier, block_id=3)

    # 160,160,32 -> 80,80,64
    x = _depthwise_conv_block(x, 64, depth_multiplier, strides=(2, 2), block_id=4)
    x = _depthwise_conv_block(x, 64, depth_multiplier, block_id=5)
    feat1 = x

    # 80,80,64 -> 40,40,128
    x = _depthwise_conv_block(x, 128, depth_multiplier, strides=(2, 2), block_id=6)
    x = _depthwise_conv_block(x, 128, depth_multiplier, block_id=7)
    x = _depthwise_conv_block(x, 128, depth_multiplier, block_id=8)
    x = _depthwise_conv_block(x, 128, depth_multiplier, block_id=9)
    x = _depthwise_conv_block(x, 128, depth_multiplier, block_id=10)
    x = _depthwise_conv_block(x, 128, depth_multiplier, block_id=11)
    feat2 = x

    # 40,40,128 -> 20,20,256
    x = _depthwise_conv_block(x, 256, depth_multiplier, strides=(2, 2), block_id=12)
    x = _depthwise_conv_block(x, 256, depth_multiplier, block_id=13)
    feat3 = x

    return feat1, feat2, feat3

特征融合–特征金字塔部分

def RetinaFace(cfg, backbone="mobilenet"):
    inputs = Input(shape=(None, None, 3))

    if backbone == "mobilenet":
        C3, C4, C5 = MobileNet(inputs)
    elif backbone == "resnet50":
        C3, C4, C5 = ResNet50(inputs)
    else:
        raise ValueError('Unsupported backbone - `{}`, Use mobilenet, resnet50.'.format(backbone))

    leaky = 0
    if (cfg['out_channel'] <= 64):
        leaky = 0.1
    P3 = Conv2D_BN_Leaky(cfg['out_channel'], kernel_size=1, strides=1, padding='same', name='C3_reduced', leaky=leaky)(C3)
    P4 = Conv2D_BN_Leaky(cfg['out_channel'], kernel_size=1, strides=1, padding='same', name='C4_reduced', leaky=leaky)(C4)
    P5 = Conv2D_BN_Leaky(cfg['out_channel'], kernel_size=1, strides=1, padding='same', name='C5_reduced', leaky=leaky)(C5)

    P5_upsampled = UpsampleLike(name='P5_upsampled')([P5, P4])
    P4 = Add(name='P4_merged')([P5_upsampled, P4])
    P4 = Conv2D_BN_Leaky(cfg['out_channel'], kernel_size=3, strides=1, padding='same', name='Conv_P4_merged', leaky=leaky)(P4)

    P4_upsampled = UpsampleLike(name='P4_upsampled')([P4, P3])
    P3 = Add(name='P3_merged')([P4_upsampled, P3])
    P3 = Conv2D_BN_Leaky(cfg['out_channel'], kernel_size=3, strides=1, padding='same', name='Conv_P3_merged', leaky=leaky)(P3)
     SSH1 = SSH(P3, cfg['out_channel'], leaky=leaky)
    SSH2 = SSH(P4, cfg['out_channel'], leaky=leaky)
    SSH3 = SSH(P5, cfg['out_channel'], leaky=leaky)

    SSH_all = [SSH1,SSH2,SSH3]

    bbox_regressions = Concatenate(axis=1,name="bbox_reg")([BboxHead(feature) for feature in SSH_all])
    classifications = Concatenate(axis=1,name="cls")([ClassHead(feature) for feature in SSH_all])
    ldm_regressions = Concatenate(axis=1,name="ldm_reg")([LandmarkHead(feature) for feature in SSH_all])

    output = [bbox_regressions, classifications, ldm_regressions]

    model = Model(inputs=inputs, outputs=output)
    return model

利用ssh加强特征融合


def SSH(inputs, out_channel, leaky=0.1):
    conv3X3 = Conv2D_BN(out_channel//2, kernel_size=3, strides=1, padding='same')(inputs)

    conv5X5_1 = Conv2D_BN_Leaky(out_channel//4, kernel_size=3, strides=1, padding='same', leaky=leaky)(inputs)
    conv5X5 = Conv2D_BN(out_channel//4, kernel_size=3, strides=1, padding='same')(conv5X5_1)

    conv7X7_2 = Conv2D_BN_Leaky(out_channel//4, kernel_size=3, strides=1, padding='same', leaky=leaky)(conv5X5_1)
    conv7X7 = Conv2D_BN(out_channel//4, kernel_size=3, strides=1, padding='same')(conv7X7_2)

    out = Concatenate(axis=-1)([conv3X3, conv5X5, conv7X7])
    out = Activation("relu")(out)
    return out

预测结果包含部分

默认每个像素点有2个anchors（人脸框）进行预测，利用一个1x1的卷积，对ssh的通道数进行调整。
其中classHead是对目标框内有没人脸进行预测，num_anchors x 2，
Bboxhead主要把人脸框框出来，预测框的坐标（中心坐标和长宽）微调参数，num_anchors x 4，
landmarkhead主要预测人脸5个关键点坐标微调参数，num_anchors x 5.
代码：

def ClassHead(inputs, num_anchors=2):
    outputs = Conv2D(num_anchors*2, kernel_size=1, strides=1)(inputs)
    return Activation("softmax")(Reshape([-1,2])(outputs))

def BboxHead(inputs, num_anchors=2):
    outputs = Conv2D(num_anchors*4, kernel_size=1, strides=1)(inputs)
    return Reshape([-1,4])(outputs)

def LandmarkHead(inputs, num_anchors=2):
    outputs = Conv2D(num_anchors*5*2, kernel_size=1, strides=1)(inputs)
    return Reshape([-1,10])(outputs)

对获得的预测框进行解码，解码过程就是将每个网格点加上它对应的x_offset和y_offset，加完后的结果就是预测框的中心，然后再利用先验框和h、w结合计算出预测框的长和宽。这样就能得到整个预测框的位置了。

代码：

 def detection_out(self, predictions, mbox_priorbox, confidence_threshold=0.4):
        
        # 网络预测的结果
        mbox_loc = predictions[0][0]
        # 置信度
        mbox_conf = predictions[1][0][:,1:2]
        # ldm的调整情况
        mbox_ldm = predictions[2][0]
        
        decode_bbox = self.decode_boxes(mbox_loc, mbox_ldm, mbox_priorbox)

        conf_mask = (mbox_conf >= confidence_threshold)[:,0]

        detection = np.concatenate((decode_bbox[conf_mask][:,:4], mbox_conf[conf_mask], decode_bbox[conf_mask][:,4:]), -1)

        best_box = []
        scores = detection[:,4]
        # 根据得分对该种类进行从大到小排序。
        arg_sort = np.argsort(scores)[::-1]
        detection = detection[arg_sort]
        while np.shape(detection)[0]>0:
            # 每次取出得分最大的框，计算其与其它所有预测框的重合程度，重合程度过大的则剔除。
            best_box.append(detection[0])
            if len(detection) == 1:
                break
            ious = iou(best_box[-1],detection[1:])
            detection = detection[1:][ious<self._nms_thresh]

        return best_box
  def decode_boxes(self, mbox_loc, mbox_ldm, mbox_priorbox):
        # 获得先验框的宽与高
        prior_width = mbox_priorbox[:, 2] - mbox_priorbox[:, 0]
        prior_height = mbox_priorbox[:, 3] - mbox_priorbox[:, 1]
        # 获得先验框的中心点
        prior_center_x = 0.5 * (mbox_priorbox[:, 2] + mbox_priorbox[:, 0])
        prior_center_y = 0.5 * (mbox_priorbox[:, 3] + mbox_priorbox[:, 1])

        # 真实框距离先验框中心的xy轴偏移情况
        decode_bbox_center_x = mbox_loc[:, 0] * prior_width * 0.1
        decode_bbox_center_x += prior_center_x
        decode_bbox_center_y = mbox_loc[:, 1] * prior_height * 0.1
        decode_bbox_center_y += prior_center_y
        
        # 真实框的宽与高的求取
        decode_bbox_width = np.exp(mbox_loc[:, 2] * 0.2)
        decode_bbox_width *= prior_width
        decode_bbox_height = np.exp(mbox_loc[:, 3] * 0.2)
        decode_bbox_height *= prior_height

        # 获取真实框的左上角与右下角
        decode_bbox_xmin = decode_bbox_center_x - 0.5 * decode_bbox_width
        decode_bbox_ymin = decode_bbox_center_y - 0.5 * decode_bbox_height
        decode_bbox_xmax = decode_bbox_center_x + 0.5 * decode_bbox_width
        decode_bbox_ymax = decode_bbox_center_y + 0.5 * decode_bbox_height

        prior_width = np.expand_dims(prior_width,-1)
        prior_height = np.expand_dims(prior_height,-1)
        prior_center_x = np.expand_dims(prior_center_x,-1)
        prior_center_y = np.expand_dims(prior_center_y,-1)

        mbox_ldm = mbox_ldm.reshape([-1,5,2])
        decode_ldm = np.zeros_like(mbox_ldm)
        decode_ldm[:,:,0] = np.repeat(prior_width,5,axis=-1)*mbox_ldm[:,:,0]*0.1 + np.repeat(prior_center_x,5,axis=-1)
        decode_ldm[:,:,1] = np.repeat(prior_height,5,axis=-1)*mbox_ldm[:,:,1]*0.1 + np.repeat(prior_center_y,5,axis=-1)


        # 真实框的左上角与右下角进行堆叠
        decode_bbox = np.concatenate((decode_bbox_xmin[:, None],
                                        decode_bbox_ymin[:, None],
                                        decode_bbox_xmax[:, None],
                                        decode_bbox_ymax[:, None],
                                        np.reshape(decode_ldm,[-1,10])), axis=-1)
        # 防止超出0与1
        decode_bbox = np.minimum(np.maximum(decode_bbox, 0.0), 1.0)
        return decode_bbox

loss值计算

在loss值计算部分我们计算了三种loss
1、Box Smooth Loss：获取所有正标签的框的预测结果的回归loss。
2、MultiBox Loss—conf_loss：获取所有种类的预测结果的交叉熵loss。
3、Lamdmark Smooth Loss：获取所有正标签的人脸关键点的预测结果的回归loss。
在MultiBox Loss—conf_loss，由于正负样本比例不平衡，这会造成loss值很大，我们采用正样本：负样本,1:7的比例计算。
Smooth Loss计算，因为在标注时加入了有些人脸不清晰的标注。在计算时，计算所有被认定为内部包含人脸同时包含人脸关键点的先验框的loss。

def softmax_loss(y_true, y_pred):
    y_pred = tf.maximum(y_pred, 1e-7)
    softmax_loss = -tf.reduce_sum(y_true * tf.log(y_pred),
                                    axis=-1)
    return softmax_loss

def conf_loss(neg_pos_ratio = 7,negatives_for_hard = 100):
    def _conf_loss(y_true, y_pred):
        batch_size = tf.shape(y_true)[0]
        num_boxes = tf.to_float(tf.shape(y_true)[1])
        
        labels         = y_true[:, :, :-1]
        classification = y_pred

        cls_loss = softmax_loss(labels, classification)
        
        num_pos = tf.reduce_sum(y_true[:, :, -1], axis=-1)
        
        pos_conf_loss = tf.reduce_sum(cls_loss * y_true[:, :, -1],
                                      axis=1)
        # 获取一定的负样本
        num_neg = tf.minimum(neg_pos_ratio * num_pos,
                             num_boxes - num_pos)


        # 找到了哪些值是大于0的
        pos_num_neg_mask = tf.greater(num_neg, 0)
        # 获得一个1.0
        has_min = tf.to_float(tf.reduce_any(pos_num_neg_mask))
        num_neg = tf.concat( axis=0,values=[num_neg,
                                [(1 - has_min) * negatives_for_hard]])

        # 求平均每个图片要取多少个负样本
        num_neg_batch = tf.reduce_mean(tf.boolean_mask(num_neg,
                                                      tf.greater(num_neg, 0)))
        num_neg_batch = tf.to_int32(num_neg_batch)

        max_confs = y_pred[:, :, 1]

        # 取top_k个置信度，作为负样本
        x, indices = tf.nn.top_k(max_confs * (1 - y_true[:, :, -1]),
                                 k=num_neg_batch)

        # 找到其在1维上的索引
        batch_idx = tf.expand_dims(tf.range(0, batch_size), 1)
        batch_idx = tf.tile(batch_idx, (1, num_neg_batch))
        full_indices = (tf.reshape(batch_idx, [-1]) * tf.to_int32(num_boxes) +
                        tf.reshape(indices, [-1]))

        neg_conf_loss = tf.gather(tf.reshape(cls_loss, [-1]),
                                  full_indices)
        neg_conf_loss = tf.reshape(neg_conf_loss,
                                   [batch_size, num_neg_batch])
        neg_conf_loss = tf.reduce_sum(neg_conf_loss, axis=1)


        num_pos = tf.where(tf.not_equal(num_pos, 0), num_pos,
                            tf.ones_like(num_pos))
        total_loss = tf.reduce_sum(pos_conf_loss) + tf.reduce_sum(neg_conf_loss)
        total_loss /= tf.reduce_sum(num_pos)
        # total_loss = tf.Print(total_loss,[labels,full_indices,tf.reduce_sum(pos_conf_loss)/tf.reduce_sum(num_pos),tf.reduce_sum(neg_conf_loss)/tf.reduce_sum(num_pos),tf.reduce_sum(num_pos)])
        return total_loss
    return _conf_loss
def box_smooth_l1(sigma=1):
    sigma_squared = sigma ** 2

    def _smooth_l1(y_true, y_pred):
        regression        = y_pred
        regression_target = y_true[:, :, :-1]
        anchor_state      = y_true[:, :, -1]

        # 找到正样本
        indices           = tf.where(keras.backend.equal(anchor_state, 1))
        regression        = tf.gather_nd(regression, indices)
        regression_target = tf.gather_nd(regression_target, indices)

        # 计算 smooth L1 loss
        # f(x) = 0.5 * (sigma * x)^2          if |x| < 1 / sigma / sigma
        #        |x| - 0.5 / sigma / sigma    otherwise
        regression_diff = regression - regression_target
        regression_diff = keras.backend.abs(regression_diff)
        regression_loss = backend.where(
            keras.backend.less(regression_diff, 1.0 / sigma_squared),
            0.5 * sigma_squared * keras.backend.pow(regression_diff, 2),
            regression_diff - 0.5 / sigma_squared
        )

        normalizer = keras.backend.maximum(1, keras.backend.shape(indices)[0])
        normalizer = keras.backend.cast(normalizer, dtype=keras.backend.floatx())
        loss = keras.backend.sum(regression_loss) / normalizer

        return loss

    return _smooth_l1

def ldm_smooth_l1(sigma=1):
    sigma_squared = sigma ** 2

    def _smooth_l1(y_true, y_pred):
        regression        = y_pred
        regression_target = y_true[:, :, :-1]
        anchor_state      = y_true[:, :, -1]

        # 找到正样本
        indices           = tf.where(keras.backend.equal(anchor_state, 1))
        regression        = tf.gather_nd(regression, indices)
        regression_target = tf.gather_nd(regression_target, indices)

        # 计算 smooth L1 loss
        # f(x) = 0.5 * (sigma * x)^2          if |x| < 1 / sigma / sigma
        #        |x| - 0.5 / sigma / sigma    otherwise
        regression_diff = regression - regression_target
        regression_diff = keras.backend.abs(regression_diff)
        regression_loss = backend.where(
            keras.backend.less(regression_diff, 1.0 / sigma_squared),
            0.5 * sigma_squared * keras.backend.pow(regression_diff, 2),
            regression_diff - 0.5 / sigma_squared
        )

        normalizer = keras.backend.maximum(1, keras.backend.shape(indices)[0])
        normalizer = keras.backend.cast(normalizer, dtype=keras.backend.floatx())
        loss = keras.backend.sum(regression_loss) / normalizer

        return loss

    return _smooth_l1

训练数据

在这里插入图片描述

强大的cpu上实时检测的retinaface人脸检测目标检测算法