Face detection - mtcnn

1. Face detection

Face detection/face recognition is a biometric technology for identification based on human facial feature information. A series of related technologies that use a video camera or camera to collect images or video streams containing human faces, automatically detect and track human faces in the images, and then perform facial recognition on the detected faces, usually also called portrait recognition and facial recognition. .

Face Detection—Difficult
Face recognition is primarily used for identification.
Due to the rapid popularization of video surveillance, many video surveillance applications urgently need a long-distance and rapid identification technology in the non-cooperative state of the user, in order to quickly confirm the identity of personnel at a long distance and realize intelligent early warning. Face recognition technology is undoubtedly the best choice. The use of fast face detection technology can find faces in real time from surveillance video images and compare them with face databases in real time to achieve rapid identity recognition.

Face recognition products have been widely used in finance, justice, military, public security, border inspection, government, aerospace, electric power, factories, education, medical care and many enterprises and institutions. With the further maturity of technology and the improvement of social recognition, face recognition technology will be applied in more fields.

  1. Business, residential security and management. Such as face recognition access control and attendance system, face recognition anti-theft door, etc.
  2. Electronic passports and ID cards.
  3. Public security, justice and criminal investigation. For example, use the face recognition system and the Internet to search for fugitives across the country.
  4. self service.
  5. information security. Such as mobile phone, computer login, e-government and e-commerce.

2. mtcnn

2.1 Overview

MTCNN, the full name in English is Multi-task convolutional neural network, and the full name in Chinese is multi-task convolutional neural network. This neural network puts face area detection and face key point detection together.

From the perspective of engineering practice, MTCNN is an algorithm with good detection speed and accuracy, and the inference process of the algorithm is instructive. Although face detection can also be achieved with faster-rcnn and yolo, mtcnn is the best in the field of face detection.

insert image description here

Here you can see that not only the face of the left picture is framed by mtcnn, but also there are 5 points in each frame, which are two eyes, one nose, and two corners of the mouth.
But mtcnn is not completely accurate. We can see that black people have not been detected. There are two reasons here. First, there may be fewer data sets for black people in open source data sets. Second, the characteristics of black people are not obvious. Because of the influence of skin color, black people may have their eyes closed, resulting in no obvious difference between eyeballs and skin color. So it was not recognized.

2.2 Network structure

The reason why mtcnn is called a multi-task convolutional neural network is because mtcnn can be divided into P-Net, R-Net, and O-Net three-layer network structure.

First, there will be a data image, and then resaize. This resize is not simply to resize the image into the size required by the network. Instead, image pyramids are generated. Then all the images in the image pyramid will enter the P-Net, and then perform NMS to eliminate redundant frames and Bouding box regression for frame regression. Then put the result after P-Net into the second stage R-Net. The steps are the same as above, and finally sent to the O-Net network. Generate 5 keypoints. So the whole process is actually a calibration process.
insert image description here

The overall process is as follows

  1. The predicted bounding boxes are generated from the original image and PNet.
  2. Input the original image and the bounding box generated by PNet, and generate the corrected bounding box through RNet.
  3. Input the original image and the bounding box generated by RNet, and generate the corrected bounding box and facial contour key points through ONet.
    insert image description here

The network structure diagram is as follows
insert image description here

MTCNN mainly includes a three-layer network,

  1. The first layer of P-Net will undergo convolution and pooling operations to output classification (whether there is a face in the corresponding pixel) and regression (regression box) results.
  2. The second layer of the network uses non-maximum suppression (NMS) to remove highly overlapping candidate boxes from the output of the first layer, and puts these candidate boxes into R-Net for fine operations, rejecting a large number of error boxes, and then The regression frame is corrected, and NMS is used to remove the overlapping frame, and the output branch is also two classification and regression.
  3. Finally, input the candidate frame output by R-Net that is regarded as a human face into O-Net for fine operation again, and reject the wrong frame. At this time, the output branch includes three categories: a. Whether there is a human face: 2 outputs
    ;
    b. Regression: The xy coordinates of the starting point (or center point) of the box and the length and width of the box obtained by regression, 4 outputs; c.
    Facial feature point positioning: xy coordinates of 5 facial feature points, 10 outputs.

Note: The three segments of the network all have NMS, but the thresholds are different.

2.2.1 Building an Image Pyramid

First, resize the image, scale the original image to different scales, and generate an image pyramid. Then images of different scales are sent to the three sub-networks for training, the purpose is to detect faces of different sizes, so as to achieve multi-scale target detection.
The construction method is to scale the h and w of the picture separately through different scaling factors, and reduce them to the original factor size each time.

Note: The minimum length and width after shrinking cannot be less than 12
insert image description here

So, why do you need to do a "pyramid" transformation on the picture?
The scale of the face in the picture is large or small, and it has always been a challenge to make the recognition algorithm not be affected by the target scale.

MTCNN uses an image pyramid to solve the multi-scale problem of the target, that is, the original image is scaled multiple times to obtain a multi-scale image according to a certain ratio (such as 0.709), which looks like a pyramid.

The P-NET model is trained with single-scale (12*12) images. When reasoning, the minimum length and width after shrinking cannot be less than 12.

Training on input images of multiple scales is very time-consuming. Therefore, the image pyramid is usually only used in the inference stage to improve the accuracy of the algorithm.
insert image description here

Why can setting an appropriate minimum face size and scaling factor optimize computational efficiency?

  • factor refers to the multiple of each side scaling.
  • In the first stage, the original image will be zoomed several times to obtain the image pyramid. The purpose is to make the face in the zoomed image close to the image scale (12px * 12px) during P-NET training.
  • Extended optimization item : first scale the image to a certain size, and then scale the size by factor. The amount of calculation can be reduced.

Here is a statistic
We will adjust a minsize to reduce the number of cycles in the first stage Legend
: If the image to be tested is 1200px 1200px, we want the scaled size to be close to the scale of the model training image (12px 12px).
insert image description here

  • minsize refers to the minimum size of the face you think needs to be recognized in the picture (unit: px).
  • Note: The strategy of "extended optimization items" is used in the code.

Why is the scaling factor officially chosen to be 0.709?

  • When the picture pyramid is zoomed, if the width and height are changed to 1/2 of the original by default, the area after zooming will become 1/4 of the original;
  • What do you do if you think 1/4 scaling is too much? —Scale the area to 1/2 of the original.
  • This is a very intuitive idea, so the scaling factor here is 0.709 ≈ sqrt(2)/2, so that the width and height become the original sqrt(2)/2, and the area becomes 1/2 of the original.
  • In a practical sense, factor should be set to less than 1.

Disadvantages of image pyramids: Slow.

  1. First, generating image pyramids is slow;
  2. Second, pictures of each scale need to be input into the model, which is equivalent to performing multiple model inference processes.

2.2.2 P - Net

Its basic structure is a fully convolutional network. For the image pyramid constructed in the previous step, a FCN (full convolutional network) is used to perform preliminary feature extraction and calibrate the frame.
insert image description here

The MTCNN algorithm can accept pictures of any scale, why?

  • Because the P-NET in the first stage is a fully convolutional network (Fully Convolutional Networks).
  • Convolution, pooling, and nonlinear activation are all operations that can accept matrices of any scale, but full connection operations require specified inputs. If there is a fully connected layer in the network, the input picture scale (generally) needs to be fixed; if there is no fully connected layer, the picture scale can be arbitrary.
  • At inference time, the scale of the face region in the test image is unknown. But because the P-net structure is fixed, when the input image is 12 12, the output is exactly a 1 1 5-channel feature map, so the p-net as a whole can be regarded as a 12 12 convolution kernel on the picture from the upper left At the beginning, take the stride=2, and perform the sliding window operation in turn. ——> So, when the picture is very large at the beginning, the frame of 12 12 may only frame a certain part of a large face such as eyes, ears, and nose. When the image is continuously reduced to a small size (image pyramid), the 12*12 frame can frame it more and more fully until it completely frames the entire face.
  • After 3 convolutions and 1 pooling operation, the original 12 12 3 matrix becomes 1 1 32
  • Using this 1 1 32 vector, and then through a 1 1 2 convolution, the classification result of "whether it is a face" is obtained
  • We set the input image matrix as A, and the convolution kernel slides on the original image matrix A, and calculates the matrix of each 12 12 3 area as the score of the face in the area, and finally a two-dimensional matrix can be obtained as S, S The value of each element is a number in [0, 1], which represents the probability of a human face. That is, A changes to S through a series of matrix operations.

The output of P-Net:

  1. The first part of the output of the network is used to determine whether there is a face in the image, and the output vector size is 1x1x2, which is two values.
  2. The second part of the network gives the precise position of the frame, that is, the border regression: the 12×12 image block input by P-Net may not be the perfect position of the face frame. If the face is not exactly square in some cases, It is possible that the 12×12 image is left or right, so it is necessary to output the offset of the current frame position relative to the perfect face frame position. The size of this offset is 1×1×4, which means the relative offset of the abscissa of the upper left corner of the box, the relative offset of the ordinate of the upper left corner of the box, the error of the width of the box, and the error of the height of the box.
  3. The third part of the network gives the locations of the 5 keypoints of the face. The five key points correspond to the position of the left eye, the position of the right eye, the position of the nose, the position of the left mouth, and the position of the right mouth. Each keypoint requires two dimensions to represent, so the output is a vector of size 1×1×10.

Example:
A 70 70 image, after the P network is fully convolutional, the output is (70-2)/2 -2 -2 = 30, which is a 5-channel 30 30 feature map. This means that the picture has undergone a sliding window operation of p, and 30*30=900 suggestion boxes are obtained, and each suggestion box corresponds to 1 confidence level cond and 4 offsets. Then through nms, the suggestion boxes corresponding to the cond greater than the set threshold of 0.6 are retained, and the corresponding offset is subjected to the frame regression operation to obtain the coordinate information in the original image, that is, the suggestion boxes that conform to the p-net are obtained. Then pass it to the R network.

2.2.3 R-Net(Refine Network):

As can be seen from the network diagram, the network structure is only one more fully connected layer with the P-Net network structure.
Images need to be scaled to 24x24x3 before being input into R-Net. The output of the network is the same as P-Net, and the purpose of R-Net is to remove a large number of non-face frames.
insert image description here

2.2.4 O-Net(Output Network):

This layer has one more convolutional layer than the R-Net layer, so the processed results will be more refined. The input image size is 48x48x3, and the output includes coordinate information of N bounding boxes, score and key point positions.
insert image description here

3. Summary

From P-Net to R-Net, and then to the final O-Net, the image input by the network is getting bigger and bigger, the number of channels in the convolutional layer is getting more and more, and the depth (number of layers) of the network is getting deeper and deeper. Therefore, the accuracy of face recognition should be higher and higher.

The training data for MTCNN face detection can be downloaded from http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/ . The dataset has 32,203 images with a total of 93,703 faces labeled.
insert image description here

4. Code example

How to use mtcnn
Run detect.py, the models used are onet.h5, pnet.h5, rnet.h5 in model_data.

4.1 mtcnn.py

from keras.layers import Conv2D, Input,MaxPool2D, Reshape,Activation,Flatten, Dense, Permute
from keras.layers.advanced_activations import PReLU
from keras.models import Model, Sequential
import tensorflow as tf
import numpy as np
import utils
import cv2
#-----------------------------#
#   粗略获取人脸框
#   输出bbox位置和是否有人脸
#-----------------------------#
def create_Pnet(weight_path):
    input = Input(shape=[None, None, 3])

    x = Conv2D(10, (3, 3), strides=1, padding='valid', name='conv1')(input)
    x = PReLU(shared_axes=[1,2],name='PReLU1')(x)
    x = MaxPool2D(pool_size=2)(x)

    x = Conv2D(16, (3, 3), strides=1, padding='valid', name='conv2')(x)
    x = PReLU(shared_axes=[1,2],name='PReLU2')(x)

    x = Conv2D(32, (3, 3), strides=1, padding='valid', name='conv3')(x)
    x = PReLU(shared_axes=[1,2],name='PReLU3')(x)

    classifier = Conv2D(2, (1, 1), activation='softmax', name='conv4-1')(x)
    # 无激活函数,线性。
    bbox_regress = Conv2D(4, (1, 1), name='conv4-2')(x)

    model = Model([input], [classifier, bbox_regress])
    model.load_weights(weight_path, by_name=True)
    return model

#-----------------------------#
#   mtcnn的第二段
#   精修框
#-----------------------------#
def create_Rnet(weight_path):
    input = Input(shape=[24, 24, 3])
    # 24,24,3 -> 11,11,28
    x = Conv2D(28, (3, 3), strides=1, padding='valid', name='conv1')(input)
    x = PReLU(shared_axes=[1, 2], name='prelu1')(x)
    x = MaxPool2D(pool_size=3,strides=2, padding='same')(x)

    # 11,11,28 -> 4,4,48
    x = Conv2D(48, (3, 3), strides=1, padding='valid', name='conv2')(x)
    x = PReLU(shared_axes=[1, 2], name='prelu2')(x)
    x = MaxPool2D(pool_size=3, strides=2)(x)

    # 4,4,48 -> 3,3,64
    x = Conv2D(64, (2, 2), strides=1, padding='valid', name='conv3')(x)
    x = PReLU(shared_axes=[1, 2], name='prelu3')(x)
    # 3,3,64 -> 64,3,3
    x = Permute((3, 2, 1))(x)
    x = Flatten()(x)
    # 576 -> 128
    x = Dense(128, name='conv4')(x)
    x = PReLU( name='prelu4')(x)
    # 128 -> 2 128 -> 4
    classifier = Dense(2, activation='softmax', name='conv5-1')(x)
    bbox_regress = Dense(4, name='conv5-2')(x)
    model = Model([input], [classifier, bbox_regress])
    model.load_weights(weight_path, by_name=True)
    return model
    
#-----------------------------#
#   mtcnn的第三段
#   精修框并获得五个点
#-----------------------------#
def create_Onet(weight_path):
    input = Input(shape = [48,48,3])
    # 48,48,3 -> 23,23,32
    x = Conv2D(32, (3, 3), strides=1, padding='valid', name='conv1')(input)
    x = PReLU(shared_axes=[1,2],name='prelu1')(x)
    x = MaxPool2D(pool_size=3, strides=2, padding='same')(x)
    # 23,23,32 -> 10,10,64
    x = Conv2D(64, (3, 3), strides=1, padding='valid', name='conv2')(x)
    x = PReLU(shared_axes=[1,2],name='prelu2')(x)
    x = MaxPool2D(pool_size=3, strides=2)(x)
    # 8,8,64 -> 4,4,64
    x = Conv2D(64, (3, 3), strides=1, padding='valid', name='conv3')(x)
    x = PReLU(shared_axes=[1,2],name='prelu3')(x)
    x = MaxPool2D(pool_size=2)(x)
    # 4,4,64 -> 3,3,128
    x = Conv2D(128, (2, 2), strides=1, padding='valid', name='conv4')(x)
    x = PReLU(shared_axes=[1,2],name='prelu4')(x)
    # 3,3,128 -> 128,12,12
    x = Permute((3,2,1))(x)

    # 1152 -> 256
    x = Flatten()(x)
    x = Dense(256, name='conv5') (x)
    x = PReLU(name='prelu5')(x)

    # 鉴别
    # 256 -> 2 256 -> 4 256 -> 10 
    classifier = Dense(2, activation='softmax',name='conv6-1')(x)
    bbox_regress = Dense(4,name='conv6-2')(x)
    landmark_regress = Dense(10,name='conv6-3')(x)

    model = Model([input], [classifier, bbox_regress, landmark_regress])
    model.load_weights(weight_path, by_name=True)

    return model

class mtcnn():
    def __init__(self):
        self.Pnet = create_Pnet('model_data/pnet.h5')
        self.Rnet = create_Rnet('model_data/rnet.h5')
        self.Onet = create_Onet('model_data/onet.h5')

    def detectFace(self, img, threshold):
        #-----------------------------#
        #   归一化,加快收敛速度
        #   把[0,255]映射到(-1,1)
        #-----------------------------#
        copy_img = (img.copy() - 127.5) / 127.5
        origin_h, origin_w, _ = copy_img.shape
        #-----------------------------#
        #   计算原始输入图像
        #   每一次缩放的比例
        #-----------------------------#
        scales = utils.calculateScales(img)

        out = []
        #-----------------------------#
        #   粗略计算人脸框
        #   pnet部分
        #-----------------------------#
        for scale in scales:
            hs = int(origin_h * scale)
            ws = int(origin_w * scale)
            scale_img = cv2.resize(copy_img, (ws, hs))
            inputs = scale_img.reshape(1, *scale_img.shape)
            # 图像金字塔中的每张图片分别传入Pnet得到output
            output = self.Pnet.predict(inputs)
            # 将所有output加入out
            out.append(output)

        image_num = len(scales)
        rectangles = []
        for i in range(image_num):
            # 有人脸的概率
            cls_prob = out[i][0][0][:,:,1]
            # 其对应的框的位置
            roi = out[i][1][0]

            # 取出每个缩放后图片的长宽
            out_h, out_w = cls_prob.shape
            out_side = max(out_h, out_w)
            print(cls_prob.shape)
            # 解码过程
            rectangle = utils.detect_face_12net(cls_prob, roi, out_side, 1 / scales[i], origin_w, origin_h, threshold[0])
            rectangles.extend(rectangle)

        # 进行非极大抑制
        rectangles = utils.NMS(rectangles, 0.7)

        if len(rectangles) == 0:
            return rectangles

        #-----------------------------#
        #   稍微精确计算人脸框
        #   Rnet部分
        #-----------------------------#
        predict_24_batch = []
        for rectangle in rectangles:
            crop_img = copy_img[int(rectangle[1]):int(rectangle[3]), int(rectangle[0]):int(rectangle[2])]
            scale_img = cv2.resize(crop_img, (24, 24))
            predict_24_batch.append(scale_img)

        predict_24_batch = np.array(predict_24_batch)
        out = self.Rnet.predict(predict_24_batch)

        cls_prob = out[0]
        cls_prob = np.array(cls_prob)
        roi_prob = out[1]
        roi_prob = np.array(roi_prob)
        rectangles = utils.filter_face_24net(cls_prob, roi_prob, rectangles, origin_w, origin_h, threshold[1])

        if len(rectangles) == 0:
            return rectangles

        #-----------------------------#
        #   计算人脸框
        #   onet部分
        #-----------------------------#
        predict_batch = []
        for rectangle in rectangles:
            crop_img = copy_img[int(rectangle[1]):int(rectangle[3]), int(rectangle[0]):int(rectangle[2])]
            scale_img = cv2.resize(crop_img, (48, 48))
            predict_batch.append(scale_img)

        predict_batch = np.array(predict_batch)
        output = self.Onet.predict(predict_batch)
        cls_prob = output[0]
        roi_prob = output[1]
        pts_prob = output[2]

        rectangles = utils.filter_face_48net(cls_prob, roi_prob, pts_prob, rectangles, origin_w, origin_h, threshold[2])

        return rectangles


4.2 detect.py

import cv2
import numpy as np
from mtcnn import mtcnn

img = cv2.imread('img/test1.jpg')

model = mtcnn()
threshold = [0.5,0.6,0.7]  # 三段网络的置信度阈值不同
rectangles = model.detectFace(img, threshold)
draw = img.copy()

for rectangle in rectangles:
    if rectangle is not None:
        W = -int(rectangle[0]) + int(rectangle[2])
        H = -int(rectangle[1]) + int(rectangle[3])
        paddingH = 0.01 * W
        paddingW = 0.02 * H
        crop_img = img[int(rectangle[1]+paddingH):int(rectangle[3]-paddingH), int(rectangle[0]-paddingW):int(rectangle[2]+paddingW)]
        if crop_img is None:
            continue
        if crop_img.shape[0] < 0 or crop_img.shape[1] < 0:
            continue
        cv2.rectangle(draw, (int(rectangle[0]), int(rectangle[1])), (int(rectangle[2]), int(rectangle[3])), (255, 0, 0), 1)

        for i in range(5, 15, 2):
            cv2.circle(draw, (int(rectangle[i + 0]), int(rectangle[i + 1])), 2, (0, 255, 0))

cv2.imwrite("img/out.jpg",draw)

cv2.imshow("test", draw)
c = cv2.waitKey(0)

4.3 utils.py

import sys
from operator import itemgetter
import numpy as np
import cv2
import matplotlib.pyplot as plt
#-----------------------------#
#   计算原始输入图像
#   每一次缩放的比例
#-----------------------------#
def calculateScales(img):
    copy_img = img.copy()
    pr_scale = 1.0
    h,w,_ = copy_img.shape
    # 引申优化项  = resize(h*500/min(h,w), w*500/min(h,w))
    if min(w,h)>500:
        pr_scale = 500.0/min(h,w)
        w = int(w*pr_scale)
        h = int(h*pr_scale)
    elif max(w,h)<500:
        pr_scale = 500.0/max(h,w)
        w = int(w*pr_scale)
        h = int(h*pr_scale)

    scales = []
    factor = 0.709
    factor_count = 0
    minl = min(h,w)
    while minl >= 12:
        scales.append(pr_scale*pow(factor, factor_count))
        minl *= factor
        factor_count += 1
    return scales

#-------------------------------------#
#   对pnet处理后的结果进行处理
#-------------------------------------#
def detect_face_12net(cls_prob,roi,out_side,scale,width,height,threshold):
    cls_prob = np.swapaxes(cls_prob, 0, 1)
    roi = np.swapaxes(roi, 0, 2)

    stride = 0
    # stride略等于2
    if out_side != 1:
        stride = float(2*out_side-1)/(out_side-1)
    (x,y) = np.where(cls_prob>=threshold)

    boundingbox = np.array([x,y]).T
    # 找到对应原图的位置
    bb1 = np.fix((stride * (boundingbox) + 0 ) * scale)
    bb2 = np.fix((stride * (boundingbox) + 11) * scale)
    # plt.scatter(bb1[:,0],bb1[:,1],linewidths=1)
    # plt.scatter(bb2[:,0],bb2[:,1],linewidths=1,c='r')
    # plt.show()
    boundingbox = np.concatenate((bb1,bb2),axis = 1)
    
    dx1 = roi[0][x,y]
    dx2 = roi[1][x,y]
    dx3 = roi[2][x,y]
    dx4 = roi[3][x,y]
    score = np.array([cls_prob[x,y]]).T
    offset = np.array([dx1,dx2,dx3,dx4]).T

    boundingbox = boundingbox + offset*12.0*scale
    
    rectangles = np.concatenate((boundingbox,score),axis=1)
    rectangles = rect2square(rectangles)
    pick = []
    for i in range(len(rectangles)):
        x1 = int(max(0     ,rectangles[i][0]))
        y1 = int(max(0     ,rectangles[i][1]))
        x2 = int(min(width ,rectangles[i][2]))
        y2 = int(min(height,rectangles[i][3]))
        sc = rectangles[i][4]
        if x2>x1 and y2>y1:
            pick.append([x1,y1,x2,y2,sc])
    return NMS(pick,0.3)
#-----------------------------#
#   将长方形调整为正方形
#-----------------------------#
def rect2square(rectangles):
    w = rectangles[:,2] - rectangles[:,0]
    h = rectangles[:,3] - rectangles[:,1]
    l = np.maximum(w,h).T
    rectangles[:,0] = rectangles[:,0] + w*0.5 - l*0.5
    rectangles[:,1] = rectangles[:,1] + h*0.5 - l*0.5 
    rectangles[:,2:4] = rectangles[:,0:2] + np.repeat([l], 2, axis = 0).T 
    return rectangles
#-------------------------------------#
#   非极大抑制
#-------------------------------------#
def NMS(rectangles,threshold):
    if len(rectangles)==0:
        return rectangles
    boxes = np.array(rectangles)
    x1 = boxes[:,0]
    y1 = boxes[:,1]
    x2 = boxes[:,2]
    y2 = boxes[:,3]
    s  = boxes[:,4]
    area = np.multiply(x2-x1+1, y2-y1+1)
    I = np.array(s.argsort())
    pick = []
    while len(I)>0:
        xx1 = np.maximum(x1[I[-1]], x1[I[0:-1]]) #I[-1] have hightest prob score, I[0:-1]->others
        yy1 = np.maximum(y1[I[-1]], y1[I[0:-1]])
        xx2 = np.minimum(x2[I[-1]], x2[I[0:-1]])
        yy2 = np.minimum(y2[I[-1]], y2[I[0:-1]])
        w = np.maximum(0.0, xx2 - xx1 + 1)
        h = np.maximum(0.0, yy2 - yy1 + 1)
        inter = w * h
        o = inter / (area[I[-1]] + area[I[0:-1]] - inter)
        pick.append(I[-1])
        I = I[np.where(o<=threshold)[0]]
    result_rectangle = boxes[pick].tolist()
    return result_rectangle


#-------------------------------------#
#   对Rnet处理后的结果进行处理
#-------------------------------------#
def filter_face_24net(cls_prob,roi,rectangles,width,height,threshold):
    
    prob = cls_prob[:,1]
    pick = np.where(prob>=threshold)
    rectangles = np.array(rectangles)

    x1  = rectangles[pick,0]
    y1  = rectangles[pick,1]
    x2  = rectangles[pick,2]
    y2  = rectangles[pick,3]
    
    sc  = np.array([prob[pick]]).T

    dx1 = roi[pick,0]
    dx2 = roi[pick,1]
    dx3 = roi[pick,2]
    dx4 = roi[pick,3]

    w   = x2-x1
    h   = y2-y1

    x1  = np.array([(x1+dx1*w)[0]]).T
    y1  = np.array([(y1+dx2*h)[0]]).T
    x2  = np.array([(x2+dx3*w)[0]]).T
    y2  = np.array([(y2+dx4*h)[0]]).T

    rectangles = np.concatenate((x1,y1,x2,y2,sc),axis=1)
    rectangles = rect2square(rectangles)
    pick = []
    for i in range(len(rectangles)):
        x1 = int(max(0     ,rectangles[i][0]))
        y1 = int(max(0     ,rectangles[i][1]))
        x2 = int(min(width ,rectangles[i][2]))
        y2 = int(min(height,rectangles[i][3]))
        sc = rectangles[i][4]
        if x2>x1 and y2>y1:
            pick.append([x1,y1,x2,y2,sc])
    return NMS(pick,0.3)
#-------------------------------------#
#   对onet处理后的结果进行处理
#-------------------------------------#
def filter_face_48net(cls_prob,roi,pts,rectangles,width,height,threshold):
    
    prob = cls_prob[:,1]
    pick = np.where(prob>=threshold)
    rectangles = np.array(rectangles)

    x1  = rectangles[pick,0]
    y1  = rectangles[pick,1]
    x2  = rectangles[pick,2]
    y2  = rectangles[pick,3]

    sc  = np.array([prob[pick]]).T

    dx1 = roi[pick,0]
    dx2 = roi[pick,1]
    dx3 = roi[pick,2]
    dx4 = roi[pick,3]

    w   = x2-x1
    h   = y2-y1

    pts0= np.array([(w*pts[pick,0]+x1)[0]]).T
    pts1= np.array([(h*pts[pick,5]+y1)[0]]).T
    pts2= np.array([(w*pts[pick,1]+x1)[0]]).T
    pts3= np.array([(h*pts[pick,6]+y1)[0]]).T
    pts4= np.array([(w*pts[pick,2]+x1)[0]]).T
    pts5= np.array([(h*pts[pick,7]+y1)[0]]).T
    pts6= np.array([(w*pts[pick,3]+x1)[0]]).T
    pts7= np.array([(h*pts[pick,8]+y1)[0]]).T
    pts8= np.array([(w*pts[pick,4]+x1)[0]]).T
    pts9= np.array([(h*pts[pick,9]+y1)[0]]).T

    x1  = np.array([(x1+dx1*w)[0]]).T
    y1  = np.array([(y1+dx2*h)[0]]).T
    x2  = np.array([(x2+dx3*w)[0]]).T
    y2  = np.array([(y2+dx4*h)[0]]).T

    rectangles=np.concatenate((x1,y1,x2,y2,sc,pts0,pts1,pts2,pts3,pts4,pts5,pts6,pts7,pts8,pts9),axis=1)

    pick = []
    for i in range(len(rectangles)):
        x1 = int(max(0     ,rectangles[i][0]))
        y1 = int(max(0     ,rectangles[i][1]))
        x2 = int(min(width ,rectangles[i][2]))
        y2 = int(min(height,rectangles[i][3]))
        if x2>x1 and y2>y1:
            pick.append([x1,y1,x2,y2,rectangles[i][4],
                 rectangles[i][5],rectangles[i][6],rectangles[i][7],rectangles[i][8],rectangles[i][9],rectangles[i][10],rectangles[i][11],rectangles[i][12],rectangles[i][13],rectangles[i][14]])
    return NMS(pick,0.3)

Guess you like

Origin blog.csdn.net/m0_63260018/article/details/132305127