Comprehensive summary of face recognition technology

Researchers from the University of Hertfordshire and GBG Plc recently published a review paper that comprehensively sorts out and summarizes face recognition methods, covering various traditional methods and deep learning methods that are now in the limelight. This article will introduce the techniques related to deep learning.

For other content, please refer to the original paper and the article of the heart of the machine. Links are in the literature references below.

Since the 1970s, face recognition has become one of the hottest directions in the field of computer vision and biometrics. The deep neural network based on large-scale data set training has basically replaced the traditional method based on artificially set features and traditional machine learning techniques. In this paper, a comprehensive and up-to-date literature summary of mainstream face recognition methods, including traditional methods (geometry-based methods, holistic methods, feature-based methods, and hybrid methods), as well as deep learning methods.

introduction

Facial recognition refers to technologies that can identify or verify the identity of subjects in images and videos. The first face recognition algorithm was born in the early 1970s. Facial recognition is now often preferred over traditionally considered more robust biometric methods such as fingerprint or iris recognition. One of the important reasons is that facial recognition is non-invasive. For example, fingerprint recognition requires the user to press the finger on the sensor, iris recognition requires the user to be very close to the camera, and voice recognition requires the user to speak out loud. In contrast, modern facial recognition systems only require the user to be within the camera's field of view (assuming they are also at a reasonable distance from the camera). This makes facial recognition the most user-friendly biometric method. At the same time, the potential application range of face recognition is wider, because it enables the passive collection of facial information, such as in the application of surveillance systems. Other common applications of facial recognition include access control, fraud detection, identity authentication and social media.

When deployed in an unconstrained environment, due to the high variability in the representation of face images in the real world (such face images are often referred to as faces in-the-wild), so Face recognition is also one of the most technically difficult biometric methods. Variables in face images include head pose, age, lighting conditions, expressions, and occlusions. Figure 1 gives an example of these situations.

 

Figure 1: Typical variations found in natural face images. (a) head pose, (b) age, (c) illumination, (d) facial expression, (e) occlusion.

Facial recognition technology has undergone major changes over the years. Traditional approaches rely on a combination of handcrafted features (such as edge and texture descriptions) and machine learning techniques (such as principal component analysis, linear discriminant analysis, or support vector machines). The difficulty of artificially designing feature recognition for different changes in an unconstrained environment has led past researchers to focus on dedicated methods for each type of change, such as methods that can deal with different ages, methods that can deal with different poses , methods that can cope with different lighting conditions, etc. Recently, traditional face recognition methods have been replaced by deep learning methods based on convolutional neural networks (CNN). The main advantage of deep learning methods is that they can be trained with large data sets to learn the best features to represent these data. A large number of natural face images available on the Internet are the easiest face datasets to collect, and these images contain various changes in the real world. CNN-based face recognition methods trained using these datasets have achieved very high accuracy. In addition, the growing popularity of deep learning methods in computer vision is also accelerating the development of face recognition research, such as CNN is also being used to solve many other computer vision tasks, such as object detection and recognition, segmentation, optical character recognition, facial expression analysis, age estimation, etc.

A face recognition system usually consists of the following building blocks:

  • Face detection . The face detector is used to find the location of faces in the image, and if there are faces, it returns the coordinates of the bounding box containing each face. As shown in Figure 3(a).

  • Face alignment . The goal of face alignment is to scale and crop an image of a face using a set of reference points located at fixed locations in the image. This process usually requires the use of a feature point detector to find a set of facial feature points, and then perform face 2D alignment through affine transformation. Figures 3(b) and 3(c) show two face images aligned using the same set of reference points. A more complex 3D alignment algorithm can also achieve frontalization, that is, adjust the pose of the face to face forward.

  • Face representation . In the face representation stage, the pixel values ​​of the face image are converted into compact and discriminable feature vectors, which are also called templates. Ideally, all faces of the same subject should map to similar feature vectors.

  • Face matching . In the face matching building block, two templates are compared to obtain a similarity score, which is the likelihood that they belong to the same subject.

Figure 2: Building blocks of face recognition.

Many people think that face representation is the most important part of the face recognition system, which is also the focus of the second section of this paper.

Figure 3: (a) Bounding boxes found by the face detector. (b) and (c): Aligned faces and reference points.

deep learning method

Convolutional neural network (CNN) is the most commonly used type of deep learning model for face recognition. The main advantage of deep learning is that it can be trained with a large amount of data to learn the characteristics of the data. But the biggest difficulty is that a large amount of data needs to be collected, and the data needs to contain enough differences so that the model can generalize to unseen samples. Several large-scale face datasets containing images of natural faces are now publicly available for researchers to use. In addition to learning discriminative features, neural networks can also be dimensionally reduced and can be trained as classifiers or using metric learning methods. CNNs are considered to be end-to-end trainable systems that do not need to be combined with any other specific method.

CNNs are trained in different ways. One of them is to treat the face recognition problem as a classification problem, where each subject in the training set corresponds to a class. After training, the model can be used to identify subjects not present in the training set by removing the classification layer and using the features of the previous layer as face representations. In the literature, these features are often referred to as bottleneck features. After this first training phase, the model can be further trained using other techniques to optimize bottleneck features for the target application (such as using joint Bayes or using a different loss function to fine-tune the CNN model). Another common approach to learning face representations is to directly learn bottleneck features via distance metrics between face pairs, or face triplets.

Using neural networks for face recognition is nothing new. In 1997, researchers proposed the "Probabilistic Decision-Based Neural Network (PBDNN)" method for face detection, eye positioning, and face recognition. This face recognition PDBNN can reduce the number of hidden units while avoiding overfitting. The researchers trained two PBDNNs separately using density and edge features, and then combined their outputs to obtain the final classification result. Another early approach used a combination of self-organizing maps (SOMs) and convolutional neural networks. A self-organizing map is a class of neural networks trained in an unsupervised manner, which can map input data to a lower-dimensional space while preserving the topological properties of the input space (that is, similar inputs in the original space are also similar in the output space. similar). Note that neither of these early methods are trained in an end-to-end fashion, and the proposed neural network architectures are shallow. A paper proposes an end-to-end CNN for face recognition. This method uses a siamese architecture and utilizes a contrastive loss function for training. This contrastive loss uses a metric learning procedure whose goal is to minimize the distance between pairs of feature vectors corresponding to the same subject, while maximizing the distance between pairs of feature vectors corresponding to different subjects. The CNN architecture used in this approach is also shallow and the training dataset is small.

The methods mentioned above have failed to achieve breakthrough results, mainly due to the lack of network design and the relatively small data sets available for training. It wasn't until these deep learning models were developed and trained with large amounts of data that the first deep learning methods for face recognition reached usable and good levels. Especially worth mentioning is Facebook's DeepFace, which is the earliest CNN-based face recognition model. The model achieves 97.35% accuracy on the LFW benchmark, a 27% reduction in the previous best error rate. The researchers trained a CNN model using softmax and a dataset of 4.4 million faces (from 4030 subjects). This project has two breakthrough contributions: (1) an efficient face alignment system based on explicit 3D face modeling; (2) a CNN architecture that includes locally connected layers, which are different from regular convolutional layers , different features can be learned from each region in the image.

For CNN-based face recognition methods, there are three main factors affecting accuracy: training data, CNN network structure, and loss function. In general, the accuracy of CNNs trained for classification tasks increases with the number of samples per class. This is because when there are more intra-class differences, the CNN model can learn more features, thereby generalizing to subjects that have not appeared in the training set. There are papers that study the effect of the number of subjects in a dataset on the accuracy of face recognition. Whether a wider dataset is better or a deeper dataset is better (a dataset is considered wider if it contains more subjects; similarly, it is considered deeper if it contains more images per subject )? This study concludes that wider datasets lead to better accuracy if the number of images is equal. The researchers attribute this to the fact that wider datasets contain more inter-class differences and thus generalize better to unseen subjects. Table 1 shows some of the most commonly used public datasets for training face recognition.

Table 1: Publicly available large-scale face datasets.

CNN architectures for face recognition have taken a lot of inspiration from those architectures that performed well on the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). As an example, [11] uses a version of the VGG network [112] with 16 layers, and [10] uses a similar but smaller network. Two different types of CNN architectures are explored in [102]: VGG-style networks [112] and GoogleNet-style networks [113]. Even though the two networks achieve comparable accuracy, the GoogleNet-style network has 20 times fewer parameters. More recently, residual networks (ResNet) [114] have become the most favored choice for many object recognition tasks, including face recognition [115-121]. The main innovation of ResNet is the introduction of a building block using shortcut connections to learn residual mappings, as shown in Figure 7. The use of shortcut connections enables researchers to train deeper architectures because they facilitate the flow of information across layers. The best tradeoff between accuracy, speed, and model size is obtained using a 100-layer ResNet with a residual module.

Figure 7: The original residual module proposed in [114].

The selection of loss functions for training CNN methods has been the most active area of ​​research in face recognition recently. Even though CNNs trained with softmax loss have been very successful, some researchers believe that using this loss function cannot generalize well to subjects that have not appeared in the training set. This is because the softmax loss helps to learn features that increase inter-class differences (to distinguish different classes in the training set), but do not necessarily reduce intra-class differences. Researchers have come up with a few ways to alleviate this problem. A simple way to optimize bottleneck features is to use a discriminative subspace method, such as Joint Bayes. Another approach is to use metric learning. For example, [100, 101] used a paired contrastive loss as the only supervisory signal, and [124-126] combined a classification loss. The most commonly used metric learning method for face recognition is the triplet loss function. The goal of triplet loss is to separate the distance between pairs of positive examples and the distance between pairs of negative examples by a certain margin. Mathematically speaking, for each triplet, the following conditions need to be met:

where x_a is the anchor image, x_p is the image of the same subject, x_n is another image of a different subject, f is the mapping relationship learned by the model, and α is imposed on the margin between the positive and negative pair distances. In practice, CNNs trained with triplet loss converge slower than softmax because a large number of triplets (or pairs in contrastive losses) are required to cover the entire training set. Although this problem can be alleviated by selecting difficult triples (i.e., triples that violate the margin condition) during the training phase [102], it is common practice to train with a softmax loss in the first training phase and The first training stage uses a triplet loss to adjust the bottleneck features [11, 129, 130]. Researchers have proposed some variants of triplet loss. For example, [129] used the dot product as a similarity measure instead of Euclidean distance; [130] proposed a probabilistic triplet loss; [131,132] proposed a modified version of triplet group loss, which also minimizes the standard deviation of the distribution of positive and negative scores. Another loss function for learning discriminative features is the center loss proposed in [133]. The goal of the center loss is to minimize the distance between bottleneck features and the centers of their corresponding classes. By using softmax loss and center loss for joint training, the results show that the features learned by CNN can effectively increase the inter-class difference (softmax loss) and reduce the intra-class individual difference (center loss). Compared to contrastive loss and triplet loss, center loss has the advantage of being more efficient and easier to implement since it does not require pairings or triplets to be constructed during training. Another related metric learning approach is the range loss proposed in [134], which was proposed to improve training with imbalanced datasets. Range loss has two components. The within-class loss component minimizes the k-maximum distance between samples of the same class, while the between-class loss component maximizes the distance between the nearest two class centers in each training batch. By using these corner cases, the range loss uses the same information for each class, regardless of how many samples are available in each class. Similar to the center loss, the range loss needs to be combined with a softmax loss to avoid the loss falling to zero [133].

When combining different loss functions, a difficulty arises in finding the correct balance between each term. Recently, researchers have proposed several ways to modify the softmax loss so that it can learn discriminative features without combining with other losses. A method that has been shown to increase the discriminative power of bottleneck features is feature normalization [115, 118]. For example, [115] proposes normalizing features to have unit L2 norm, and [118] proposes normalizing features to have zero mean and unit variance. A successful approach has introduced a margin in the decision boundary between each class in the softmax loss [135]. For simplicity, we introduce the case of binary classification using softmax loss. In this case, the decision boundary between each class (if the bias is zero) is given by:

Where x is the feature vector, W_1 and W_2 are the weights corresponding to each class, θ_1 and θ_2 are the angles between x and W_1 and W_2 respectively. These two decision boundaries can be made more stringent by introducing a multiplicative margin in the above equation:

As shown in Figure 8, this margin can effectively increase the degree of differentiation between categories and the compactness within the respective categories. Depending on how this margin is integrated into the loss, researchers have proposed several available methods [116,119-121]. For example in [116] the weight vector is normalized to have unit norm such that the decision boundary only depends on the angles θ_1 and θ_2. An additive cosine residue is proposed in [119,120]. Compared to multiplicative margins [135, 116], additive margins are easier to implement and optimize. In this work, in addition to normalizing the weight vectors, the feature vectors are also normalized and scaled as in [115]. Another additive margin is proposed in [121], which has the advantages of [119,120] but also has a better geometric interpretation because the margin is added to the angle rather than the cosine. Table 2 summarizes the decision boundaries for different variants of the softmax loss with margin. These methods are the current state-of-the-art in the field of face recognition.

Figure 8: Effect of introducing a margin m in the decision boundary between two classes. (a) softmax loss, (b) softmax loss with margin.

Table 2: Decision boundaries for different variants of softmax loss with margin. Note that these decision boundaries are for class 1 in the binary classification case.

references

1. Comprehensive summary of face recognition technology: from traditional methods to deep learning

2. Face Recognition: From Traditional to Deep Learning Methods paper address: https://arxiv.org/abs/1811.00116

Guess you like

Origin blog.csdn.net/m0_46573428/article/details/126320396