Face Recognition in Natural Scenes (1) - Face Detection

There are many application scenarios for face detection technology. The most common one is to detect faces and make beauty when taking photos.

The subject of my undergraduate graduation project is related to face recognition. The first part of face recognition is to perform face detection, and then perform subsequent face comparison and recognition after detecting the face. Here is a summary of what I have learned during this time, and I will continue to do face recognition later.

Overview

The problem of face recognition has been raised a long time ago.

A well-known algorithm in the field of face recognition is proposed by Viola and Jones in 2004, which uses Haar-like features and AdaBoost algorithm to train the classifier, and obtains a strong face recognition model through the cascade of classification. This model is able to detect frontal faces very well. However, for photos taken in natural scenes, due to lighting, facial pose angle, and occlusion, this algorithm has a very high missed detection rate. Viola and Jones also proposed some methods to solve the problem of missed face detection caused by different facial pose angles, but the complexity of the model is relatively high.

Another way to perform face detection is to use the DPM model. The DPM model is improved on the basis of HOG features, and DPM uses multiple HOG features with different resolutions. Using the prior knowledge of the sub-model and the main model space, it has strong robustness in the change of facial pose angle and occlusion.

However, for the above two methods and their improved versions, the training process is relatively complicated, and the missed detection rate is relatively high in some scenarios. Therefore, when communicating with the graduation design teacher, the teacher suggested using the deep learning method to try face detection. So my graduation project is on the road of using neural network to realize. This summary is also written around the papers I read.

Face Detection Based on Neural Network

The earliest representative paper of deep learning face detection is " A Convolutional Neural Network Cascade for Face Detection " on CVPR in 2015, which retains the Cascade method in traditional face detection methods. Another article is " Multi-view Face Detection Using Deep Convolutional Neural Networks " The highlight is the use of a fully convolutional network to get the heatmap of the image.

" Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks " published by kpzhang in 2016 comprehensively uses the above two methods. Since kpzhang open sourced his test code and model on github, the face detection method I am currently using is also this method.

In fact, there is a lot of work to apply Object Detectionmethods such as RCNN , FastRCNN , YOLO to face detection. In fact, the method mentioned above also refers to many Object Detectionpractices. Since the graduation project is related to face recognition, the papers I read are all on the smaller problem of face detection.

Paper reading and related notes

Here are the notes from 3 papers I read about face detection when I was learning face detection

1. A Convolutional Neural Network Cascade for Face Detection

This paper is a representative article on the application of convolutional neural networks to face detection tasks. The method proposed in this paper retains Detector Cascadethe structure of , and uses 6 CNN networks for cascading, of which 3 CNN networks are used for the second classification of whether there is a face, and the other 3 are used as the calibration of the face bounding box. (calibration)

  • Binary classification network for face detection: 12-net, 24-net,48-net
  • The network used for candidate box calibration: 12-calibration-net, 24-calibration-net,48-calibration-net


(The whole process of face detection)

The three candidate box correction networks implement a multi-class network instead of a regression network

By forming N calibration patterns from a predefined scaling set of 3 dimensions, the calibration patterns can be written as \(\{[s_n,x_n,y_n]\}_{n=1}^{N}\) . Given a detection window \((x,y,w,h)\) ( \((x,y)\) represents the upper left corner of the detection window, \((w,h)\) represents the size), the calibration mode The provided detection window will be calibrated in the following form:
\[(x-\frac{x_n w}{s_n}, y-\frac{y_n h}{s_n}, \frac{w}{s_n}, \ frac{h}{s_n})\] The value range of \(s_n, x_n, y_n\)
in actual use : \[ \begin{array}{lll} s_n & \in & \{0.83, 0.91, 1.0, 1.10, 1.21\} \\ x_n & \in & \{-0.17, 0, 0.17\} \\ y_n & \in & \{-0.17, 0, 0.17\} \\ \end{array} \]

So \(N=45\) . Because the 45 calibration modes are not orthogonal to each other, this paper does not directly use the calibration mode with the highest score to calibrate the candidate frame, but adopts a weighted average method (where \(t\) is the threshold ):
\[ \begin{array}{ccc} [s,x,y] = \frac{1}{Z}\sum_{n=1}^{N}[s_n,x_n,y_n]I(c_n > t) \\ \\ Z=\sum_{n=1}^{N}I(c_n>t)\\ \\ I(c_n > t) = \left\{ \begin{array}{ll} 1 & \textrm{if $c_n > t$} \\ 0 & \textrm{otherwise.} \end{array} \right. \end{array} \]

(three calibration networks)

12-net、24-net、48-net

12-net A very shallow binary convolutional network, by using a sliding window of size 12×12 with stride 4, the result obtained as the input to the network is the score of whether the sliding window contains a face. To detect faces of different scales, the input image is scaled to get the Image Pyramid, and each image in the Image Pyramid is processed. In practical applications, assuming that the minimum acceptable face size is F, the scaling ratio of Image Pyramid is \(\frac{12}{F}\) .

24-net uses the image corrected by 12-calibration-net as input. The function is the same as that of 12-net. In this paper, a multi-resolution architecture is adopted, and the fully connected layer of 12-net is integrated into the fully connected layer of 24-net. In the text, it is said to find smaller faces.

(24-net structure)


(Using a multi-resolution structure brings performance improvements)

The structure of 48-net is similar to that of 24-net, but it is deeper.

(48-net)


2. Multi-View Face Detection Using Deep Convolutional Neural Networks

Compared with the previous one, the highlight of this network is that the author uses a fully convolutional network instead of a sliding window. The image is passed through a fully convolutional network to generate a face detection heatmap, which represents the different responses of the detector on the image. Obtain face candidates according to the response scores.

The AlexNet network is used in the paper, which is fine-tuned on the LFW dataset using AlexNet. They revised the parameters of the fully connected layer of AlexNet to make it a convolutional layer. Because AlexNet has better classification ability, the paper did not use a cascaded architecture to detect faces. The final output of the network is the heatmap of the image

(Figure 7)

In addition to introducing their use of fully convolutional networks to detect faces, this paper also uses a relatively large amount of space to illustrate the impact of the distribution of face pose angles in the training samples on the performance of the detector. It can be seen in Figure 7 that the frontal face has a very high score, 99.9%. However, with the rotation of the angle, in the same plane, the scores of faces from other angles are lower than those of frontal faces. The author conducts statistics on the training set images, and obtains the distribution of face pose angles in the training set.

Due to the use of cross-entropy as the loss function, unevenly distributed samples will bias the neural network towards frontal faces, because the loss function decreases the fastest, and the author actually did some data enhancement based on this phenomenon. For example, in the training process, since the number of negative samples is 100 times the number of positive samples, if random sampling is used to obtain the training data of each batch, there are only 2 positive samples in each batch (128/batch) on average. So they force the number of training samples to account for \(\frac{1}{4}\) (32positive + 96negative)
The author proposes some Data Augmentation worth trying:

  • Make the probability of each face rotation angle in the training sample appear equal.
  • Add occlusion to the training samples, but it is not recommended to just set some pixels to 0. Because this way the network will learn these artificial data.

3. Joint Face Detection and Alignment using Multi-task Cascaded Convolution Networks

Multi-task suddenly became popular in 2016. This paper uses the multi-task convolutional neural network, which combines some of the practices of the above two papers. The main ideas are as follows:

  1. Adopt cascaded multi-task structure, 3-level cascade structure, each level is trained separately
  2. By optimizing face detection, alignment and landmark as multi-tasks in one network simultaneously, the correlation of the three can promote the improvement of accuracy.
  3. Use a fully convolutional network instead of a sliding window to generate a picture heatmap, and filter most of the background windows in the early stage
  4. Improve accuracy with Online Hard Example Mining
  5. Use Image Pyramid to detect faces at different scales.

The following figure is the structure diagram of the neural network used in the author's paper

The three network structures used in this paper are also from shallow to deep. In the early stage, a shallow convolutional neural network was used to filter out a large number of background selection boxes, because the purpose of this time is to quickly filter the irrelevant background, so the ability of the convolutional network may not be particularly strong. With multi-task, candidate boxes can be quickly extracted in the early stage. The author of the network used in the first stage is called the Pnet

following two convolutional neural networks Rnetwith similar structures Onet. The network capability is gradually enhanced. The size of the input candidate images is also from small to large, 24px and 48px respectively

The above network is drawn by the open-source network model of kpzhang93. Sharp-eyed readers will find that, Pnetunlike Rnetin the paper, there is one less output. This is because the author uses the WiderFace dataset, which has no landmark (people). Face key points) information, so only classification and bounding box regressionPnet are done with the follow-up .Onet

Definition of multi-task loss function

The definition of the loss function is obtained by the weighted summation of the errors of multiple tasks

\ (L_ {i} ^ {det} = - (y_i ^ {det} log (p_i) + (1-y_i ^ {det}) (1-log (p_i))) \)

\(L_i^{box} = \left \| \hat{y}_i^{box} -y_i^{box} \right \|^2_2\)

\(L_i^{landmark} = \left \| \hat{y}_i^{landmark} - y_i^{landmark} \right \|^2_2\)

Get:
\[min\sum_{i=1}^{N}\sum_{j\in\left\{det, box, landmark\right\}} \alpha_j\beta_i^jL_i^j\]
When training different networks The weights used are different.
Since it is a face detection, if the detection is not a face, the loss will not be calculated for the regression box and key point positions.

Online Hard Sample Mining

OHSM is used to improve the accuracy of the model during training. 70% of the difficult samples are calculated in each mini-batch, and only the loss of these difficult samples is calculated to calculate the gradient. Because easy samples are relatively small for improving model power.

(improvement of performance using hard sample mining)

-------------------

Final Results


The image on the left uses HOG+SVM the method of , the image on the right uses mtcnn


This picture uses the mtcnn method for detection

It can be found that there is a false detection in the above picture, but the confidence level of the face is 93%. The false detection can be compensated by adjusting the threshold value during engineering. What's more interesting is that even the golden man can be identified.


Summary

This article is a summary of what I have learned in my undergraduate graduation project. If readers have any opinions and suggestions, please put them in the comments section. If there are any mistakes in the writing, please feel free to correct them. Thank you
. After a while, I will continue to summarize the learning content of face recognition and face comparison.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324674592&siteId=291194637