Face recognition algorithm and system popularization

 
  

Click on " Xiaobai Learning Vision " above , and choose to add " Star " or " Top "

重磅干货,第一时间送达

Editor's Recommendation

 

In the AI ​​training camp jointly held by Inspur and Nvidia, the speakers shared about face recognition in terms of face recognition algorithms and evaluation indicators, and the composition of face recognition systems.

Reprinted from丨Inspur AIHPC

The goal of face recognition

To sum up two points, first, recognize the same person, no matter how your state changes, you can know that you are you. The second is to distinguish different people. Maybe the two people look alike, or both of them wear makeup, but no matter how the state changes, face recognition can know that they are two different people.

Face recognition itself is a kind of biometric technology, mainly to provide a means of identity authentication. In terms of accuracy, face recognition is not the highest. Face recognition is affected by many other conditions, such as lighting. The advantage of face recognition is that it generally does not require too much cooperation from users . Surveillance cameras in various places, including computer cameras, mobile phone video input devices, and camera equipment have become very popular. Devices using this visible light Can do face recognition. Therefore, when face recognition is introduced, the new investment may be very small, which is its advantage.

The process of face recognition

The core process of face recognition , the so-called core process is basically this process in any face recognition system. The first step is face detection, the second step is face alignment, and the third step is feature extraction . This is the three steps that must be done for each photo. When it is time to compare, compare the extracted features. , and then determine whether the two faces belong to the same person.

f914fc50a01bd9110c63753bd452cfab.jpeg

Face Detection

Face detection is to judge whether there is a face in a large scene, and to find the position of the face and cut it out. It is a kind of object detection technology and is the basis of the entire face perception task. The basic method of face detection is to slide the window on the image pyramid, use the classifier to select the candidate window, and use the regression model to correct the position.

f31c227ebd2e7ebb7ba604ff51d66254.png

The three windows drawn above are 0.3 times, 0.6 times, and 1.0 times. When the position of the face is uncertain and the size cannot be recognized, this technology can be used to make the size of the picture itself different, and the size of the sliding window is the same. The size of the image input by the deep network is generally fixed, so the front sliding window is basically fixed. In order to allow the fixed sliding window to cover different ranges, the size of the entire graph is scaled to take different proportions. Drawing 0.3, 0.6, and 1.0 here is just an example, and there can be many other multiples in actual use.

The classifier refers to looking at each position of the sliding window to determine whether it is a face, because the position where the sliding window slides may not contain the entire face, or it may be a little larger than the entire face. In order to find a more accurate face, putting the sliding window into the regression model can help correct the accuracy of face detection.

The input is a sliding window. If there is a face in the output, which side should be corrected and how much it needs to be corrected, so Δx, Δy, Δw, Δh are its coordinates and how much its width and height should be corrected. With the amount of correction and the classifier to determine that it is the window of the face, combining these two together can get a more accurate face position.

The above is the process of face detection, but also applicable to other object detection

Evaluation Metrics for Face Detection

No matter what kind of model is divided into two aspects of speed and accuracy

1. Speed

(1) The speed is the detection speed at the specified resolution

The reason for specifying the resolution is because the sliding window needs to make a classification and regression judgment every time it slides to a position, so when the picture is larger, the number of windows that need to be detected and judged may be more, and the entire face detection takes time longer.

Therefore, to evaluate the quality of an algorithm or model, it is necessary to see its detection speed under a fixed resolution. Generally speaking, what kind of value will this detection speed be? It may be the time it takes to detect a face in a picture, such as 100 milliseconds, 200 milliseconds, 50 milliseconds, 30 milliseconds, etc.

Another way to express the speed is the number of fps. Now the general network camera is often 25fps or 30fps, which means how many pictures can be processed per second. The advantage of using fps can judge whether the face detection can achieve real-time detection, as long as If the fps of face detection is greater than the fps of the camera, it can be done in real time, otherwise it cannot be done.

(2) Whether the speed is affected by the number of faces in the same screen

From our actual operation, most of them are not affected, because it is mainly affected by the number of sliding windows. The number of hits is not particularly heavy, but it has a slight impact.

2. Accuracy

Accuracy is basically judged by recall rate, false detection rate, and ROC curve. The recall rate means that this photo is a human face, and the real model judges that it is a human face. The false detection rate and negative sample error rate refer to the proportion that this photo is not a human face, but it is misjudged as a human face.

Accuracy of ACC

The ACC calculation method is to divide the correct number of samples by the total number of samples. For example, if you take 10,000 photos for face detection, some of the 10,000 photos have faces, and some have no faces. Then judge what is the proportion of pairs.

But there is a problem with this accuracy. If you use it to judge, it is completely irrelevant to the ratio of positive and negative samples, that is, it does not care about the correct rate in positive samples and the correct rate in negative samples. It only cares about the total of. When the accuracy of this model is 90%, others do not know the difference between positive and negative samples. Including classification, including regression, generally speaking, a classification model will first use a regression to obtain a so-called confidence level. When the confidence level is greater than a certain value, it is considered to be, and then when the confidence level is less than the same value, it is considered to be not.

The ACC statistical model is adjustable, that is, the accuracy will change when the confidence is adjusted.

Therefore, the ACC value itself is greatly affected by the proportion of the sample, so it is a bit problematic to use it to represent the quality of a model. When the test index says that it has reached 99.0%, it is relatively easy to be deceived just by looking at this value Or this statistic is biased. In order to solve this problem, a curve called ROC is generally used to characterize the accuracy of this model.

ROC receiver operating characteristic curve

ae0cbea126c125451db8a89e013957b0.jpeg

Abscissa: FPR (False Positive Rate) is the negative sample error rate

Vertical coordinate: TPR (True Positive Rate) is the correct rate of positive samples

The performance of the algorithm on positive samples and negative samples can be distinguished, and the shape of the curve has nothing to do with the ratio of positive and negative samples.

The ROC (Receiver Operating Characteristic) curve is to mark the abscissa and ordinate with the negative sample error rate and the positive sample correct rate. In this way, the same model does not see a point on this graph, or it is not a single data, but a line. This line is the confidence threshold. The higher you adjust it, the stricter it is, and the lower it is, the less strict it is. If it is above this, it can reflect the impact of the change of the confidence threshold on it.

In the future, it is best not to directly ask what your accuracy is, but to look at the ROC curve, so that it is easier to judge the ability of the model.

face alignment

72b07a50c7ae286079fcc8758b1b5e00.png

The purpose of face alignment is to adjust the face texture to the standard position as much as possible to reduce the difficulty of the face recognizer.

In order to artificially reduce its difficulty, it can be aligned first, that is, the eyes, nose, and mouth of the detected person are all placed in the same position. In this way, when the models are compared, they only need to find In the vicinity of the same location, whether they are the same or similar to each other is still very different. So we will do this step of alignment. For this step, the common method we now use is the two-dimensional method, which is to find the key feature points in this picture. Generally, it is five points, nineteen points, sixty or so. There are all kinds of dots, more than eighty dots. But for face recognition, five are basically enough.

The image of other points other than these five points can be considered as an operation similar to interpolation, and then paste it to that position. After finishing, it can be sent to the face recognition device behind We did identification. This is a general approach, and there are more cutting-edge approaches. Some research institutions are using the so-called 3D face alignment, which means I will tell you what a frontal face looks like, such as what it looks like when it is rotated 45 degrees. Then after training him with this kind of picture, he will know that I see a picture that is rotated 45 degrees to the left and right, and it is likely to look like after it is turned to the right. This model can guess.

Face Feature Extraction Algorithm

The previous traditional method is the so-called local texture model, global texture model, shape regression model and so on. What is more popular now is to use deep convolutional neural networks or cyclic neural networks, or convolutional neural networks with 3DMM parameters. The so-called 3DMM parameter means that there is three-dimensional information in it, and then there is a cascaded deep neural network.

a775438d0c4f68ce27225c6c93003fb0.jpeg

The cascaded deep neural network, that is, to get the face, first needs to infer the position of the five points. If you use a single model to do this at one time, the model will need to be very complicated.

But how can the complexity of this model be reduced?

That is, make multiple inputs, and make a guess after the first input to the network. This guess is an acceptable and not so accurate guess, and you probably know where the five points of the face are. Then put these five points and the original picture into the second network to get the approximate correction amount. If you have a basic five points and then calculate the correction amount, it will be more accurate than finding the accurate five points directly from the original picture. A little bit easier. Therefore, using this gradual refinement method and cascading multiple networks together can achieve a better balance between speed and accuracy. In fact, basically using two layers is almost the same when doing it now.

Evaluation Index of Face Feature Point Extraction

130d975e306a9efda5b82b90cfacf341.png

NRMSE (Normalized Root Mean Square Error) is the normalized root mean square error

It is used to measure the difference between the coordinates of each feature point and the marked coordinates.

precision

In order to allow faces of different sizes to be compared together, a statistically called normalized root mean square error is used. Take a chestnut: we draw five points on paper, and then let the machine tell the distance between these five points. The closer the value given is to the real distance, the more accurate the prediction is. Generally speaking, the predicted value must have some deviation, so how to express this precision value? We usually use the average or root mean square value of the distance to express. However, the problem is that the same machine predicts images of different sizes, and the accuracy value will appear different, because the larger the image, the higher the absolute value of the error. The same is true for changing faces of different sizes. Therefore, our solution is to take into account the original size of the face. The general denominator is the distance between the eyes of the person or the diagonal distance of the face, and then divide the distance difference by the distance between the eyes, or divide by The diagonal of the face, so that a value that basically does not change with the size of the face can be obtained and used for evaluation.

face comparison

(1) Purpose: To judge whether the two faces that have been aligned belong to the same person

(2) Difficulties: The same face will show different states under different conditions, for example, it is particularly affected by light, smoke, makeup, etc. The second is caused by the different parameters mapped to the two-dimensional photos. The so-called mapping to the two-dimensional parameters means that the original face looks like this. Whether it is accurate or not, the angle of view of shooting, etc., all have an impact on the accumulation of light, which will cause the same face to produce different states. The third is the impact of age and plastic surgery.

The method of face comparison

(1) Traditional method

1. Manually extract some features such as HOG, SIFT, wavelet transform, etc. Generally speaking, the extracted features may require fixed parameters, that is, no training is required, no learning is required, and a set of fixed algorithms is used, and then this feature Compare.

(2) Depth method

The mainstream method is the deep method, that is, the deep convolutional neural network. Generally speaking, this network uses DCNN to replace the previous feature extraction methods, that is, to get some different features on a picture and a face. Come out, there are many parameters in DCNN. This parameter is learned, not told by people. If you learn it, it will be better than the ones summed up by people.

Then the obtained set of features generally has a dimension of 128 dimensions, 256 dimensions, or 512 dimensions, or 1024 dimensions, and then compares them. To judge the distance between feature vectors, Euclidean distance or cosine similarity is generally used.

The evaluation indicators for face comparison are also divided into speed and accuracy. Speed ​​includes the calculation time of a single face feature vector and the comparison speed. Accuracy includes ACC and ROC. Since it has been introduced earlier, here we focus on the comparison speed.

Ordinary comparison is a simple operation. It is the distance between two points. You may only need to do the inner product once, which is the inner product of two vectors. But when face recognition encounters a 1:N comparison, when the N When the library is very large, when you get a photo and go to the N library to search, the number of searches will be very large. For example, if the N library has one million, you may need to search a million times. Since one million comparisons are to be made, there is still a requirement for the total time at this time, so there will be various technologies to speed up this comparison.

0bf874eeba4148b551beb12c59d751cf.png

Other algorithms related to face recognition

Mainly face tracking, quality assessment, and living body recognition.

●  Face Tracking

In video face recognition scenarios such as surveillance, if the entire face recognition process is executed for each frame of the same person passing by, it will not only waste computing resources, but also may cause false recognition due to some low-quality frames. It is necessary to determine which faces belong to the same person. And select the appropriate photos for recognition, greatly improving the overall performance of the model.

Now not only face tracking, but also various object tracking or vehicle tracking, etc., will use tracking algorithms, such algorithms do not depend or will not always depend on detection. For example, after detecting an object at the beginning, it will not detect it at all, and only use the tracking algorithm to do it. At the same time, in order to achieve very high precision and not to be lost, it takes a lot of time to track each time.

In order to prevent the tracked face from matching the range of the face recognizer, generally speaking, a face detector is still used for detection. This detection method relies on relatively lightweight tracking made by face detection. In a certain In some scenarios, a balance between speed and quality can be achieved.

48a450af99f6f38b449de748daaa68f9.jpeg

This detection method is called Tracking by Detection, that is, face detection is still performed in each frame. After the face is detected, according to the four values ​​​​of each face, that is, its coordinate position, its width, and height, to compare before and after The position and size of the faces in the two frames can probably be inferred whether the two faces belong to the same moving object.

●  Optional interval full frame detection

It means that when doing Tracking by Detection, two frames before and after, one method is to do full-screen detection. The so-called full-screen detection means to scan the entire frame, but this method is time-consuming, so sometimes another method is used. One method is to make a full screen at intervals of several frames. Generally, the next frame is predicted, and the position change will not be too much. Just expand the position of the previous frame a little bit up, down, left, and right, and then detect it again. There is often a high probability that it can be done. detected, most frames can be skipped.

Why do we have to do a full-frame detection every few frames?

It is to prevent new objects from coming in. If you only search based on the position of the previous object, you may not detect a new object when it comes in. To prevent this, you can do it every five or ten frames. A full-frame inspection.

●  Face quality assessment

Due to the limitations of the face recognizer training data, etc., it is impossible to perform well on faces in all states. The quality evaluation will judge the degree of agreement between the detected face and the characteristics of the recognizer, and only select the face with a high degree of agreement. Send it for identification to improve the overall performance of the system.

Face quality assessment includes the following 4 elements

①  The size of the face, if the face is too small for recognition, the recognition effect will be greatly reduced.

②Face  pose refers to the rotation angle of the three axes, which is generally related to the data used for recognizer training. If most of the faces used in the training are not too big, it is best not to choose the one with a large deflection when actually doing recognition, otherwise it will not be applicable.

③Fuzzy  degree, this element is very important, if the photo has lost information, there will be problems in the recognition.

 Occlusion, if the eyes, nose, etc. are covered, the feature of this piece cannot be obtained, or the obtained is wrong, it is a feature of an occluder, which will affect the subsequent recognition. If it can be judged that it is blocked, then discard it, or do some special processing, such as not putting it in the recognition model.

● Liveness recognition

This is a problem that all face recognition systems will encounter. If only faces are recognized, photos can also be fooled. In order to prevent the system from being attacked, some judgments will be made, whether this is a real face or a fake face.

Basically, there are three current methods:

① Traditional dynamic recognition , many bank withdrawal opportunities require users to cooperate, such as asking users to blink and turn their heads, so as to judge whether the users have done the same cooperation based on blinking and turning their heads. Therefore, there is a problem in the dynamic identification, that is, more cooperation is required from the user, so that the user experience will be a bit bad.

②Static recognition , that is, it does not judge based on the action, but only judges whether it is a real face or a fake face based on the photo itself. It is based on the attack methods commonly used today, which are more convenient. For example, if you take a mobile phone or a display screen, you can use the screen to attack.

The luminous ability of this kind of screen is different from the luminous ability of the human face under the actual lighting conditions. For example, the display has 16 million color numbers, which cannot achieve the luminous ability of visible light, that is, all are continuous and all wave bands can Issued to. Therefore, when shooting this kind of screen again, the human eye can see that there will be some changes and some unnaturalness when compared with the imaging under the real natural environment. After putting this unnaturalness into a model for training, you can still judge whether it is a real face or not based on this subtle difference.

③ Stereo recognition . If you use two cameras or a camera with depth information, you can know the distance between the captured points and the camera. Plane, realize that it is a plane, and the plane is definitely not a real person. This is to use a three-dimensional recognition method to exclude flat faces.

e09a97a0ab81689f7a621e46f63c3875.png

System composition of face recognition

First of all, make a classification. From the comparison form, there are 1:1 recognition system and 1:N recognition system; from the perspective of comparison objects, there are photo comparison system and video comparison system; according to the deployment form , there are private deployment, cloud deployment or mobile device deployment.

Photo 1:1 recognition system

5bf6293202206b4127b07e4570537503.png

The 1:1 recognition system is the simplest. Take two photos, generate a feature vector for each photo, and then compare whether the two feature vectors are the same person, and then you can identify it.

Photo 1: N's recognition system

b74f4be86b2ee3111f29351b9d7ec8f4.jpeg

1:N recognition system, which is to judge whether the photo material is in a sample library. This sample library is prepared in advance, and there may be a whitelist or blacklist, which contains a photo of each person, and a series of feature vectors are generated by taking this photo. This is used as a sample library, and the uploaded photos are compared with all the features in the sample library to see which one is the most similar, and it is considered that he is this person. This is a 1:N recognition system.

Video 1:1 recognition system

a170a70a86dc62d82547e5bd1f33cfcb.png

The 1:1 recognition system for videos is similar to the 1:1 system for photos, but the object of comparison is not photos, but video streams. After getting the video stream, we will do detection, tracking, and quality assessment, and then we will do comparison after getting suitable photos.

Video 1: N's Recognition System

dcccdf2af9398b747f4fc29158c4879e.png

The video 1:N adaptation system is similar to the 1:N photo system, except that the video stream is used for identification, detection, tracking, and quality assessment.

24adc1f14641e9c5010548df6705001c.png

Generally speaking, the so-called system composition is not necessarily a face recognition system, which is probably the case for various AI systems. The first is the computing resource layer, which runs on the CPU or GPU, and may also support CUDA, CUDN, etc. when running on the GPU.

The second is the computing tool layer, including deep learning forward network computing library, matrix computing library and image processing tool library. Since it is impossible for everyone who makes algorithms to write data calculations by themselves, they will use some existing data calculation libraries, such as TensorFlow, MXNET or Caffe, etc., or write one by themselves.

The last is the application algorithm layer, including face detection, feature point positioning, quality evaluation and other algorithm implementations. The above is the general system composition.

下载1:OpenCV-Contrib扩展模块中文版教程

在「小白学视觉」公众号后台回复:扩展模块中文教程,即可下载全网第一份OpenCV扩展模块教程中文版,涵盖扩展模块安装、SFM算法、立体视觉、目标跟踪、生物视觉、超分辨率处理等二十多章内容。


下载2:Python视觉实战项目52讲
在「小白学视觉」公众号后台回复:Python视觉实战项目,即可下载包括图像分割、口罩检测、车道线检测、车辆计数、添加眼线、车牌识别、字符识别、情绪检测、文本内容提取、面部识别等31个视觉实战项目,助力快速学校计算机视觉。


下载3:OpenCV实战项目20讲
在「小白学视觉」公众号后台回复:OpenCV实战项目20讲,即可下载含有20个基于OpenCV实现20个实战项目,实现OpenCV学习进阶。


交流群

欢迎加入公众号读者群一起和同行交流,目前有SLAM、三维视觉、传感器、自动驾驶、计算摄影、检测、分割、识别、医学影像、GAN、算法竞赛等微信群(以后会逐渐细分),请扫描下面微信号加群,备注:”昵称+学校/公司+研究方向“,例如:”张三 + 上海交大 + 视觉SLAM“。请按照格式备注,否则不予通过。添加成功后会根据研究方向邀请进入相关微信群。请勿在群内发送广告,否则会请出群,谢谢理解~

Guess you like

Origin blog.csdn.net/qq_42722197/article/details/131238475