Overview of Deep Learning Face Detection

Deep learning has been widely used in computer vision, and its effect has been greatly improved compared to traditional methods. This paper introduces the development of deep learning in the field of face detection in the field of face detection.

One of the earliest masterpieces of deep learning face detection is a 2015 CVPR paper "A Convolutional Neural Network Cascade for FaceDetection", hereinafter referred to as CascadeCNN. This article retains the concept of Cascade in the traditional face detection method, using three shallow networks with input sizes of 12, 24, and 48, and a correction network after each classification network to return the position of the face frame . Comparing CascadeCNN and traditional face detection methods, this paper summarizes the similarities and differences. The same points are as follows: 1. The Cascade cascade structure is used, and the previous stage is used to quickly filter simple samples, and the latter stage obtains more accurate classification results; 2. Image pyramid structure, for face sizes of different scales, by scaling the image Obtain the image pyramid and then process it; 3. The processing mode of sliding window plus step size; 4. The final processing result uses the NMS (Non-maximum suppression) method to merge the window according to the size of the IOU (intersection over union). The differences are as follows: 1. The CNN classifier in each stage replaces the traditional classifier; 2. After each classification stage, a correction network is applied to make the position of the face frame more accurate. The paper was the fastest CNN-based face detection method at the time.

CascadeCNN test process

Another representative article about deep learning face detection in 2015 is "Multi-view Face Detection Using Deep Convolutional Neural Networks". The network used in this article is AlexNet. Due to the relatively strong classification ability of the network, the Cascade structure is not used in this article. The image pyramid is still used, and each layer of the pyramid is processed by AlexNet. A highlight of this article is the use of a fully convolutional network , which changes the parameter arrangement of the fully connected layer into a convolutional layer, which has no limit to the size of the input image. Each layer of the pyramid is processed by AlexNet into a HeatMap, and each point of the HeatMap corresponds to an area in the original image.

Original image and HeatMap image

In 2016, articles related to deep learning face detection became popular with the concept of multi-task, which combined face detection (classification) and face frame position correction (regression), as well as the detection of face key point positioning, posture, posture and other attributes. Combined (in the experiment, the author found that, except for the face frame position regression, other tasks did not improve the accuracy of face detection). Related articles are: "HyperFace: A Deep Multi-task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition", "Joint Training of Cascaded CNN for Face Detection" (hereinafter referred to as JTCCNN), "Joint Face Detection" and Alignment using Multi-task Cascaded Convolutional Networks" (hereinafter referred to as MTCNN). The network structure of HyperFace is also based on the deformation of Alexnet, which features a 512-sized fc layer for each task, which corresponds to different tasks. For the specific network results, see the attached figure.

HyperFace network structure

The two articles JTCCNN and MTCNN have many similarities in ideas and network structure. Both of these methods are based on the ideas of the 15-year CascadeCNN article. The biggest difference is that JTCCNN connects low-level features to high-level features, 12net features will be spliced ​​to 24net, and the features obtained by splicing 12 and 24 will be further spliced ​​to 48net. The probability of cascading is reflected in the network. 12net , 24net, and 48net have their own thresholds, and samples that do not pass the threshold are judged as negative samples. MTCNN maintains that the nets between each Cascade do not interfere with each other. It improves the network structure on the basis of CascadeCNN. 12net applies a fully convolutional network, which is applied to multi-task reference face frame regression and key points. Further Adjusted the speed and accuracy of face detection.

JTCCNN network structure
MTCNN network structure

This year, the articles related to face detection have paid more attention to the scale of the face. Is it possible to detect faces of different scales without so many layers of image pyramids? The answer given by "Finding tiny face" is a combination of reducing the number of layers of the pyramid and increasing the scale of the template. The article also experimentally verified that for small faces (the size of the face in the image is less than 20 pixels), adding The context information around the face can greatly improve the detection rate of the face. The idea of ​​the article "Scale-Aware Face Detection" is rather peculiar. First, there is a Scale Proposal Network (SPN) to obtain the scale and size information of all faces on the image, and then resize the image according to this (on the resized image, all The face size is at the same scale), and then only one scale of the sacle is needed to detect all faces. "Face Detection through Scale-Friendly Deep Convolutional Networks" Scale-friendly extracts anchors of various scales from the image when providing the propose method, and strives to cover various scales.

Scale-Aware Face Detection Test Process
Scale-friendly training process

At this point, many readers may wonder why there are no articles on object detection related to RCNN, fastRCNN, fasterRCNN, sppnet, YOLO, SSD, DenseBox, etc. There are also many works that apply these methods to face detection, considering its The versatility is not introduced here, and the author will describe it in detail in the article series of object detection.

To sum up, the fully convolutional network replaces the sliding window and the step size method, and uses the characteristics of the time convolutional layer to share the calculation; the Cascade structure is actually the same idea as the target detection proposal, and they all want to obtain people through a fast method. Face candidate positions, excluding a large number of negative samples; image pyramids, feature-level pyramids, anchors of different scales, and model sizes of different scales are all to solve the problem of face sacles, for example, it is difficult for the same detector to detect 20* 20 size and 200*200 size, adding face frame regression can improve the robustness of the scale to a certain extent.

Finally, let’s talk about the speed of face detection. When it comes to face detection speed, not to mention the image size and the minimum detection scale are rogue behaviors. The minimum detection scale is doubled, and the speed may be more than doubled. Among all the above methods, combining the speed and detection rate, the best performance is the MTCNN method (if the detection algorithm involving a large network structure such as VGG16 and Resnet34 is difficult to use), the author will introduce the training of MTCNN in detail later. test process.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324839746&siteId=291194637