[Computer Vision|Face Modeling] Basic knowledge of 3D face reconstruction (introduction)

This series of blog posts are notes for deep learning/computer vision papers, please indicate the source for reprinting

1. The basis of 3D reconstruction

3D reconstruction (3D Reconstruction) refers to the process of reconstructing 3D information from single-view or multi-view images.

1. Common 3D reconstruction techniques

artificial geometric model Instrument acquisition image-based modeling
describe Generate 3D geometric models of objects through human-computer interaction based on geometric modeling software 3D imager based on structured light and laser scanning technology Recover the 3D structure of an image or scene from a single or multiple 2D images
advantage high accuracy High precision (mm), real object 3D data low cost
shortcoming Professionals are required, and the manpower and time costs are high The high cost of instruments makes it difficult to collect on a large scale difficult, complicated
example 3DMax, Blender (provide api interface, programmable development) Usually used to build 3D databases (datasets) Starting from a face model, perform shape fitting, deformation, and texture mapping

2. Characteristics and Difficulties of 3D Face Reconstruction

Reconstructing the real 3D shape and texture of human faces from 2D images is an ill-posed problem.

2.1 Features

  • There are many face image preprocessing techniques

    • Face detection, feature point positioning, face alignment and segmentation, etc.
  • Faces have a lot of commonality and have a clear relative positional relationship

    • Transforming the Face Reconstruction Problem into Personalized Face Model Parameter Estimation

2.2 Difficulties

  • The physiological structure and geometry of the human face are very complex

    • It is easy to get overly smooth results with existing math surfaces

    • Surfaces created using point cloud data triangulation are rough

  • There are fewer obvious feature points, low texture areas, smooth brightness changes, and less gradient information

    • Difficult to generate reliable 3D point cloud data

3. Basic technology of face 3D reconstruction

Face features are divided into geometric features and texture features , and modeling needs to reply to these two parts of information at the same time.

3.1 Geometric modeling

  1. Polygon mesh modeling technology (mainstream)

    • A network model consists of several 3D grid points Vertex and polygonal facets surrounded by grid points . The more grid points and faces, the stronger the sense of reality of the model.

      • The image below is an example of a quadrilateral mesh

      • The image below is an example of a triangular mesh

    • Select control points on the network, and move the control points to complete the deformation of the network.

      • Uniform network (not commonly used)

      • Non-uniform network: highlights facial details, such as eyes and lips, etc. are denser. The cheeks and other places are less dense, which can reduce the complexity.

  1. Surface modeling techniques (not commonly used)

3.2 Texture Mapping

Any point in the 3D model in the Basel face database (Basel Face Model, Basel face model, BFM model) can be (x,y,z,r,g,b)represented by , where it is x,y,zthe position coordinate and r,g,bthe color coordinate, which can also be called texture.

Basel face model, which is a 3D model of the shape of a human face, and possibly associated skin tone and texture models. Based on a large number of 3D scans and color images, this model is a statistical model of facial shape and texture variation, which can be used in various applications of computer vision and graphics, such as facial recognition, animation and special effects. There are currently three versions of the dataset (2009, 2017, and 2019).

Texture Mapping (Texture Mapping), also known as texture mapping, simply puts a two-dimensional image on the surface of a three-dimensional image, and generally needs to be combined with lighting models, image fusion and other technologies. High-resolution two-dimensional texture maps (see below) are also often used to characterize three-dimensional textures. Regarding the BFM model and 3DMM technology, there will be more detailed descriptions later.

Specifically, after the 3D model is reconstructed, the 3D model is projected into the 2D image, and the pixel values ​​are sampled before being assigned to the 3D network.

As of 21, the latest technology is to use deep learning for rendering, such as the neural rendering method.

Related technical links:

2. Traditional 3D face reconstruction technology

  • A Method Based on Multi-eye Stereo Vision Matching
  • 3DMM fitting
  • shape from shading
  • structure from motion

This section focuses on the first two.

1. Multi-eye stereo vision matching

Multi-eye stereo vision (Multiple View Stereo, MVS)

1.1 Standard system and camera calibration

To image a three-dimensional object, it is necessary to understand the following four coordinate systems. There is a slight discrepancy between the graphic and text symbols, please focus on the text.

  1. world coordinate system, world

    • The three-dimensional space coordinates of the real world, coordinate axes (X w , Y w , Z w ), in meters
  2. Camera coordinate system, camera

    • According to the principle of lens imaging, the three-dimensional coordinate system presented by the world coordinate system in the camera, the coordinate axis (X C , Y C , Z C ), the unit is meter

    • From the world coordinate system to the camera coordinate system, rigid body changes such as scaling, translation and rotation are required

    • From the world coordinate system to the camera coordinate system, go through the conversion formula:

      [ X c Y c Z c ] = [ r 00 r 01 r 02 r 10 r 11 r 12 r 20 r 21 r 22 ] [ X w Y w Z w ] + [ T x T y T z ] \begin{bmatrix}X_c\\Y_c\\Z_c\\\end{bmatrix}=\begin{bmatrix}r_{00}&r_{01}&r_{02}\\r_{10}&r_{11}&r_{12}\\r_{20}&r_{21}&r_{22}\\\end{bmatrix}\begin{bmatrix}X_w\\Y_w\\Z_w\\\end{bmatrix}+\begin{bmatrix}T_x\\T_y\\T_z\\\end{bmatrix} XcYcZc = r00r10r20r01r11r21r02r12r22 XwYwZw + TxTyTz

      • In particular, [ r 00 r 01 r 02 r 10 r 11 r 12 r 20 r 21 r 22 ] \begin{bmatrix}r_{00}&r_{01}&r_{02}\\r_{10}&r_{11}&r_ {12}\\r_{20}&r_{21}&r_{22}\\\end{bmatrix} r00r10r20r01r11r21r02r12r22 is the rotation matrix, [ T x T y T z ] \begin{bmatrix}T_x\\T_y\\T_z\\\end{bmatrix} TxTyTz is the translation matrix
      • They have nothing to do with the camera, so these two parameters are called the camera's extrinsic parameters (Extrinsic Parameter)
  3. image coordinate system, image

    • Project the three-dimensional coordinates presented by the camera to a two-dimensional plane, the origin is the center point of the imaging plane, and the coordinate axes (X, Y) are in millimeters

    • The perspective projection (Perspective Projection) from the camera coordinate system to the image coordinate system is based on the principle of pinhole imaging, and the coordinates of its projection point on the imaging plane are obtained by using a simple similar triangle proportional relationship, such as the formula:

      { x = f x c Z c y = f y c Z c \begin{cases}x=\frac{fx_c}{Z_c}\\y=\frac{fy_c}{Z_c}\end{cases} { x=Zcfxcy=Zcfyc

      • f is the focal length of the camera, which belongs to the camera's internal reference (Intrinsic Parameter)
  4. Image (pixel coordinate system), pixel

    • The image obtained by discrete sampling from the image coordinate system is a two-dimensional coordinate system, the origin is located in the upper left corner, the coordinate axis is (U, V), and the unit is pixel

    • This step is mainly for discretization, refer to the formula:

      { u = x d x + u 0 v = y d y + v 0 \begin{cases}u=\frac{x}{dx}+u_0\\v=\frac{y}{dy}+v_0\end{cases} { u=dxx+u0v=dyy+v0

Based on the above information, the overall change process is

Z c [ u v 1 ] = [ 1 d x 0 u 0 0 1 d y v 0 0 0 1 ] [ f 0 0 0 0 f 0 0 0 0 1 0 ] [ r 00 r 01 r 02 T x r 10 r 11 r 12 T y r 20 r 21 r 22 T z 0 0 1 0 ] [ X w Y w Z w 1 ] = P [ X w Y w Z w 1 ] Z_c\begin{bmatrix}u\\v\\1\\\end{bmatrix}=\begin{bmatrix}\frac{1}{dx}&0&u_0\\0&\frac{1}{dy}&v_0\\0&0&1\\\end{bmatrix}\begin{bmatrix}f&0&0&0\\0&f&0&0\\0&0&1&0\\\end{bmatrix}\begin{bmatrix}r_{00}&r_{01}&r_{02}&T_x\\r_{10}&r_{11}&r_{12}&T_y\\r_{20}&r_{21}&r_{22}&T_z\\0&0&1&0\\\end{bmatrix}\begin{bmatrix}X_w\\Y_w\\Z_w\\1\\\end{bmatrix}=P\begin{bmatrix}X_w\\Y_w\\Z_w\\1\\\end{bmatrix} Zc uv1 = dx1000dy10u0v01 f000f0001000 r00r10r200r01r11r210r02r12r221TxTyTz0 XwYwZw1 =P XwYwZw1
in,

  • 3 × 4 3\times4 3×MatrixPP of 4P is also called the camera model (Camera Model)

  • [ 1 d x 0 u 0 0 1 d y v 0 0 0 1 ] [ f 0 0 0 0 f 0 0 0 0 1 0 ] \begin{bmatrix}\frac{1}{dx}&0&u_0\\0&\frac{1}{dy}&v_0\\0&0&1\\\end{bmatrix}\begin{bmatrix}f&0&0&0\\0&f&0&0\\0&0&1&0\\\end{bmatrix} dx1000dy10u0v01 f000f0001000 It is called the internal parameter matrix of the camera because it is determined only by the internal parameters of the camera.

1.2 Reconstruction principle based on stereo vision matching

Stereo vision technology using two images is called binocular stereo vision, and stereo vision technology using multiple images is called multi-eye stereo vision.

It usually includes the following important steps, illustrated by binocular stereo vision matching:

  1. Image acquisition , acquire two images, usually one left and one right
  2. Camera modeling , that is, camera internal parameter calibration
  3. Binocular correction , based on the joint influence of each camera's internal reference matrix and the camera's relative positional relationship parameters, the image is corrected and deformed so that the coordinates of the corrected image imaging point are on the same horizontal line
  4. Feature extraction , to obtain the salient feature points in the image, of course, all image points can also be used
  5. Image matching , a key step, to match two images, there are two methods:
    1. feature match. Establish the correspondence between sparse image feature points, such as SIFT feature points.
    2. Dense matching. The corresponding matching pixels are determined for each pixel, and a dense disparity map is established.
  6. Disparity map (depth map estimation)
  7. interpolation

Double-sided stereo matching principle:

The projection of a point p in space on the image coordinates of the left camera is pl ( x , y ) p_l(x,y)pl(x,y ) , projected to the right camera image coordinates ispr ( x ′ , y ′ ) p_r(x',y')pr(x,y ), since the imaging points of the left and right images after binocular stereo correction are on the same line, soy = y ′ y=y'y=y , according to the triangle similarity principle, has the following relationship:

a Z = a − ( x − x ′ ) Z − f \frac{a}{Z}=\frac{a-(x-x')}{Z-f} Za=Zfa(xx)
So depth Z = afx − x ′ Z=\frac{af}{xx'}Z=xxa f, where a represents the baseline distance between the two cameras, f represents the focal length of the camera, and x, x' represent the parallax of the two projection points

Related technical links:

1.3 Characteristics of the reconstruction method based on stereo vision matching

  • The advantage is that it provides dense disparity information to restore dense 3D point clouds, and the reconstructed 3D face surface is finer .

  • The disadvantage is that it needs to find matching points. For face images, the skin is relatively smooth, so there is less texture information, and only a small number of feature points can be obtained, usually no more than 200. Stereo vision matching needs to match dense points. When it is used as a face map, many similar regions will have a relatively high matching response, resulting in serious ambiguity . In addition, the common illumination changes and shadows caused by self-occlusion in face images also make stereo vision matching very difficult.

Therefore, this method does not have much advantage, and the calculation amount is large and the steps are complicated.

2. 3DMM

3DMM (3D Morphable Model) was proposed in 1999. It assumes that each 3D face can be represented by a base vector space composed of all faces in a data set, and solving the model of any 3D face is actually equivalent to solving Coefficients for each basis vector . For a face, we have the following features:

  • Forehead length (varies in [0,1])

  • eye size

  • nose size

  • ……

These features are the face basis vectors .

For any given 2D face picture, its 3D face can be weighted and combined by a series of face basis vectors of 3DMM, where MMM represents the number of base vectors (that is, the number of face models in the data set),α i \alpha_iaiRepresents the coefficient of each basis vector that controls the shape of the face (geometric model, Shape), β i \beta_ibiIndicates the coefficient of the basis vector of the control face texture (texture mapping, Texture)

{ S m o d = ∑ i = 1 M α i S i T m o d = ∑ i = 1 M β i T i ∑ i = 1 M α i = ∑ i = 1 M β i = 1 \begin{cases}\pmb{S}_{mod}=\sum\limits_{i=1}^M\alpha_i\pmb{S}_i\\\pmb{T}_{mod}=\sum\limits_{i=1}^M\beta_i\pmb{T}_i\\\sum\limits_{i=1}^M\alpha_i=\sum\limits_{i=1}^M\beta_i=1\end{cases} Smod=i=1MaiSiTmod=i=1MbiTii=1Mai=i=1Mbi=1
S i \pmb{S}_i Si T i \pmb{T}_i Tiis the ii in the data setThe shape vector and texture vector of i faces.

2.1 BFM dataset

The average face shape and average texture of the face of the earliest published dataset (2009), as shown in the figure:

Note that the face base vectors are mutually orthogonal (orthogonality) base vectors obtained through PCA (primary component analysis, principal component analysis method). Analyzing the Face Coupling Problem.

In the figure, the graph in the first column is the average value of shape and texture, and the two rows from the second column to the fourth column are the models obtained for the 1st, 2nd, and 3rd principal components, +/- variance 5 .

About the dataset:

Each model is described by 53490 points. Each model also includes tags such as gender, face height, age, etc.; in the two versions of 2017 and 2019, expression coefficients are also provided. For more accurate reconstruction, the face is also divided into 4 regions, each region can be reconstructed more accurately and then fused.

2.2 3DMM reconstruction principle

According to what was said above, any point can be (x,y,z,r,g,b)represented by .

(x,y,z)is called a shape vector and (r,g,b)is called a texture vector. The former determines the outline of the face, while the latter determines the skin color of the face.

Therefore, each face can be expressed as:

形状向量(Shape Vector): S = ( X 1 , Y 1 , Z 1 , … , X n , Y n , Z n ) \pmb{S}=(X_1,Y_1,Z_1,…,X_n,Y_n,Z_n) S=(X1,Y1,Z1,,Xn,Yn,Zn)

Texture Vector: T = ( R 1 , G 1 , B 1 , … , R n , G n , B n ) \pmb{T}=(R_1,G_1,B_1,…,R_n,G_n,B_n )T=(R1,G1,B1,,Rn,Gn,Bn)

Other formulas refer to the content written above.

Explain the orthogonal part in detail:

In the actual modeling process, S i S_i cannot be used directlySiSum T i T_iTiAs the basis vectors, because they are not orthogonally correlated, PCA is then used for dimensionality reduction decomposition.

  1. Calculate S ‾ \overline{\pmb{S}}Sand T ‾ \overline{\pmb{T}}T, which is the average of shape and texture vectors .

  2. Centralized face data, get Δ S = S i − S ‾ , Δ T = T i − T ‾ \Delta\pmb{S}=\pmb{S}_i-\overline{\pmb{S}}, \Delta\pmb{T}=\pmb{T}_i-\overline{\pmb{T}}WILL _=SiS,ΔT=TiT

  3. Calculate the covariance matrix CS , CT C_S, C_T respectivelyCS,CT

  4. Find the eigenvalues ​​of the shape and texture covariance matrix α i , β i \alpha_i,\beta_iai,biand eigenvectors si , ti s_i,t_isi,ti

  5. So the formula can be transformed into:

    { S m o d = S ‾ + ∑ i = 1 M − 1 α i s i T m o d = T ‾ + ∑ i = 1 M − 1 β i t i \begin{cases}\pmb{S}_{mod}=\overline{\pmb{S}}+\sum\limits_{i=1}^{M-1}\alpha_is_i\\\pmb{T}_{mod}=\overline{\pmb{T}}+\sum\limits_{i=1}^{M-1}\beta_it_i\\\end{cases} Smod=S+i=1M1aisiTmod=T+i=1M1biti

Notice:

  • α i , β i \alpha_i,\beta_i ai,biare sorted in descending order by size value
  • The right side of the equation is still MMM item, but the accumulated item is reduced by one dimension, and one item is missing.
  • s i , t i s_i,t_i si,tiThe first few components can make a good approximation to the original sample, so the number of parameters to be estimated can be greatly reduced without loss of accuracy.

The method based on 3DMM is to solve these coefficients. Many later models will add parameters such as expression and lighting on this basis, but the principle is similar.

2.3 3DMM solution method

As mentioned earlier, the process of reconstructing the real three-dimensional shape and texture of the face from two-dimensional images is called Model Fitting, which is an ill-posed problem.

The classic approach is the 1999 article "A Morphable Model for the Synthesis of 3D Faces"

A Morphable Model For The Synthesis Of 3D Faces-Notes- Zhihu(zhihu.com)

Its solution idea is as follows:

  1. To initialize a 3D model, you need to initialize internal parameters α i , β i \alpha_i,\beta_iai,bi, as well as external rendering parameters, such as camera position, image plane rotation angle, each component of direct light and ambient light, and image contrast, etc. more than 20 dimensions. With these parameters, the projection of a 3D to 2D image can be uniquely determined.
  2. Under the control of the initial parameters, after 3D and 2D projection, a 2D image can be obtained from a 3D model, and then the error between the input image and the input image is calculated. Then use error backpropagation to adjust the correlation coefficient, adjust the 3D model, and iterate continuously. One triangular lattice is involved in the calculation each time. If the face is occluded, this part does not participate in the loss calculation .
  3. The specific iteration adopts a coarse-to-fine method. Initially, a low-resolution image is used, and only the first principal component coefficient is optimized, and the principal component is gradually increased later. Fix the external parameters in some subsequent iterative steps, and optimize α i , β i \alpha_i,\beta_i for each part of the face respectivelyai,bi

If it does not need to be very accurate, but only needs to obtain the face shape model, many methods use 2D face key points to estimate the shape coefficient when using 3DMM, which has a smaller amount of calculation and simpler iteration.

However, there are the following disadvantages

  • The ill-conditioned problem itself does not have a global optimal solution, and it is easy to fall into a not-so-good local optimal solution
  • The background interference and occlusion of the face will affect the accuracy, and the error function itself is not continuous
  • Sensitive to initial conditions, such as when optimizing based on key points, if the accuracy of key points is poor, the accuracy of the reconstructed model will also be greatly affected.

2.4 Expression Model

As mentioned earlier, "in the two versions of 2017 and 2019, the expression coefficient is also provided", and the explanation is as follows:

In the 2009 version of the dataset, all images are collected based on neutral expressions, whereas real face images have a wide variety of expressions.

In 2014, FacewareHouse added facial expressions on the basis of 3DMM. In 2017, the authors also upgraded the BFM model and added expression coefficients.

FaceWarehouse (kunzhou.net)

At present, there are mainly two types of expression models based on 3DMM, namely the additive model and the multiplicative model.

  • The additive model regards the expression as an offset of the shape, such as:
    c ( W s , W e ) = c ‾ + E s W s + E e W ec(W^s,W^e)=\overline{c }+E^sW^s+E^eW^ec(Ws,We)=c+EsWs+EeWe
    wherec ‾ \overline{c}crepresents the average face, E s E^sEs sumE e E^eEe represent shape and expression base respectively;W s W^sWs sumW e W^eWe is the shape and expression base coefficient

  • Different from the texture model, because the expression will also change the shape of the face, so it is not completely orthogonal to the shape, so some researchers have proposed a multiplicative model, such as the formula: c ( W s , We e ) =
    ∑ j = 1 dewje T j ( c ( W s ) + δ s ) + δ jec(W^s,W^e)=\sum\limits_{j=1}^{d_e}w^e_jT_j(c(W^ s)+\delta^s)+\delta^e_jc(Ws,We)=j=1dewjeTj(c(Ws)+ds)+dje
    Among them, W e W^eWe is a collection of expression transfer operations; thejjthj operations areT j T_jTj δ s \delta^s dsδ je \delta^e_jdjeis the calibration vector, wjew^e_jwjeIt is the jjthj We W^eWcoefficient of e

2.5 Appearance Model

Models are affected by albedo and lighting, but most 3DMMs don't distinguish between the two, so we treat it as one factor, albedo.

In the BFM model proposed after 2009, the model is a linear model, that is, a basic linear combination of multiple texture representations. Subsequent researchers added texture details on this basis, such as facial wrinkles.

3. Shape from Shading(SfS)

Restoring shape from light and shade, that is , a technology that directly restores depth from grayscale based on the relationship between image brightness (the intensity of light reflected from the surface of an object) and the geometry of the object's surface .

Principle: Use the brightness information of the grayscale image and the brightness generation principle to obtain the normal vector of each pixel in the 3D space, and finally obtain the depth information according to the normal vector.

Brightness information is determined by the following four factors

  • Lighting, including direction, position, and energy distribution of light
  • The reflectivity of the surface of the object, which determines how the incident light is reflected on the surface of the object, generally determined by the material of the surface of the object
  • geometry of the surface
  • Camera model, including intrinsic and extrinsic parameters

The 3D model uses light and object reflectivity to simulate a camera and render a 2D image. SFS is the inverse of this process.

4. Structure from Motion

It is a technique for estimating three-dimensional structures from a series of multiple two-dimensional image sequences containing visual motion information. The basic steps are as follows:

  1. Image feature point detection: match each pair of feature points in the image, and only keep the points that meet the geometric constraints.
  2. Iterative optimization restores the internal and external parameters of the camera

Different from the stereo vision method, the relative position between multiple cameras in stereo vision is given by calibration, while the relative position of the camera in SfM needs to be calculated before reconstruction.

3. Deep Learning 3D Face Modeling

1. 3DMM-based method

1.1 Fully supervised 3DMM

The traditional method needs to optimize and solve the correlation coefficient, but deep learning can use the model to regress these coefficients

Represented by the 3DMM CNN method proposed in 2017

Regressing Robust and Discriminative 3D Morphable Models with a Very Deep Neural Network | IEEE Conference Publication | IEEE Xplore

This is a paper on a single-image 3D reconstruction method , an overview of the overall process:

  1. A large number of unconstrained photos are used to fit a 3DMM for each subject.
    • It uses the resnet101 network to directly regress the shape coefficient and texture coefficient of 3DMM, each with 99 dimensions.
  2. We first fit individual image 3DMM shape and texture parameters to each image separately. All 3DMM estimates for the same subject are then pooled together to provide a single estimate for each subject.
  3. These pooled estimates are used instead of costly scans of real face regions to train the model to directly regress the relevant parameters of the 3DMM

There are following problems

  • Data set acquisition. Real 3D face and two-dimensional face image pairs are very scarce, and the acquisition cost is high. The author uses multiple images in the CASIA WebFace dataset to solve Model Fitting to generate a corresponding 3D face model, which is used as the ground truth (Ground Truth ), and then obtain a 2D and 3D image pair.

    CASIA-WebFace of Dataset: Introduction, Installation, and Usage of CASIA-WebFace Dataset - A Virgo Programmer's Blog-CSDN Blog

  • Optimize the design of the target. Because the result of reconstruction is a three-dimensional model, the loss function is calculated in three-dimensional space. If the standard Euclidean loss function (Euclidean loss) is used to minimize the distance, the obtained face model will tend to be averaged. . In this regard, the author proposed the asymmetric Euclidean loss (The asymmetric Euclidean loss) , so that the model can learn more features.
    L ( γ p , γ ) = λ 1 ⋅ ∣ ∣ γ + − γ max ∣ ∣ 2 2 ⏟ Over-estimation ( over − estimate ) + λ 2 ⋅ ∣ ∣ γ p + − γ max ∣ ∣ 2 2 ⏟ Underestimation ( under − estimate ) L(\gamma_p,\gamma)=\lambda_1\cdot\underbrace{||\gamma^+-\gamma_{max}||^2_2}_{over-estimate(over-estimate)}+\lambda_2\ cdot\underbrace{||\gamma^+_p-\gamma_{max}||^2_2}_{under-estimate(under-estimate)}L ( cp,c )=l1Overestimation ( o v er es t ima t e ) ∣∣γ+cmax22+l2Underestimation ( u n d er es t ima t e ) ∣∣γp+cmax22
    For example, γ + ≐ abs ( γ ) ≐ sign ( γ ) ⋅ γ ; γ p + ≐ sign ( γ ) ⋅ γ p γ max ≐ max ( γ + , γ p + ) \gamma^+{\doteq}abs(\gamma){\doteq}sign(\gamma)\cdot\gamma; \quad\gamma^+_p{\doteq}sign(\gamma)\cdot\gamma_p\\\gamma_{max}{\doteq}max(\gamma^+,\gamma^+_p)c+abs(γ)sign(γ)c ;cp+ s i g n ( c )cpcmax ma x ( c+,cp+)

    γ \gamma γ is the label,γ p \gamma_pcpIs the predicted value, through two weights λ 1 \lambda_1l1and λ 2 \lambda_2l2A trade-off between the overestimation error and the underestimation error of the loss

    • λ 1 : λ 2 = 1 : 1 \lambda_1:\lambda_2=1:1l1:l2=1:When 1 , it is the traditional Euclidean loss function
    • The author sets λ 1 : λ 2 = 1 : 3 \lambda_1:\lambda_2=1:3l1:l2=1:3 , thereby changing the behavior of the training process so that it gets rid of underfitting faster, encouraging the network to generate more detailed and realistic 3D facial models.

In addition to the classic one above, 3DMM researchers also proposed ExpNet (2018) for predicting expression coefficients, FacePoseNet (2017) for predicting attitude coefficients, and verified the correlation coefficients learned based on data and CNN models. feasibility.

1.2 Self-supervised 3DMM

There are problems such as difficulty in obtaining real data sets, high cost, and few data sets, so the robustness of models trained based on real data needs to be improved. Many methods use simulation data sets , which can generate more data for learning , but after all, the distribution of simulation data sets is different from that of real data sets, and due to the lack of hair and other parts, the ability of the model to generalize to real data sets is relatively poor. bad .

Based on this, self-supervision-based methods are studied, which do not rely on real paired datasets, which reconstruct 2D images to 3D and back-project them back to 2D images. This type of model is represented by the MoFa method.

[1703.10580] MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction (arxiv.org)

The input is a simple 2D picture, 2D landmark is optional, it is not necessary to input, input can speed up the convergence of the model The
purpose of the encoder is to encode the picture into semantic features, that is, pose, shape, expression, facial reflection (texture) and lighting.
These semantic features are then rendered (decoded) into corresponding 3D models by a differentiable decoder, and then the entire model is trained unsupervised by a reconstruction loss.
In the middle, 2Dlandmark can be introduced to speed up the convergence of the model.

1.3 Combination with other methods

Usually 3DMM inputs only one image, if more useful supervision information is input, it will help the optimization of the model,

Such as 3DDFA (2018) framework:

Two, Face Alignment in Full Pose Range: A 3D Total Solution (3DDFA) 300w-lp dataset introduction :) Annual Blog-CSDN Blog

Most face alignment algorithms are designed for faces in small to medium poses (yaw angle less than 45°), and lack the ability to align faces in large poses (side faces) up to 90°.

The challenge has three dimensions.

  • Commonly used landmark face models assume that all landmarks are visible and thus are not suitable for large poses.
  • Facial appearance changes more in large poses from frontal to side views.
  • Labeling landmarks in large poses is very challenging because invisible landmarks have to be guessed.

In this paper, the authors propose to address these three challenges in a new alignment framework called 3D Dense Face Alignment (3DDFA), in which dense 3D deformable models (3DMMs) are fitted to images via cascaded convolutional neural networks. At the same time, 3D information is also used to synthesize face images in side view, which provides abundant samples for training.

This is a cascading regression process. The 3DDFA framework assembles RGB images and PNCC (Projected Normalized Coordinate Code) feature maps as input, and the output is the updated PNCC coefficients, including 6-dimensional pose, 199-dimensional shape and 29-dimensional expression coefficient.

A three-dimensional feature point is expressed as (x,y,z)a sum (r,g,b)value, and after being normalized to 0~1, it is called NCC (Normalized Coordinate Code, normalized coordinate code). If you use 3DMM to project the image to the XY plane, and use the Z-buffer algorithm for rendering, and use NCC as the color-map of the Z-buffer algorithm, you can get PNCC.

In addition, the author also introduced PAF (Pose Adaptive Feature, pose adaptive feature). The PAC in the network diagram is Pose Adaptive Convolution, which is a feature representation method for pose adaptation. By projecting the key points of the face To the image, the direction information between the key points in the image is encoded into the PAF . The main role of PAF is to introduce pose information into the model during the training phase to improve the robustness and accuracy of the model to pose changes.

Since the coefficients of different dimensions have different importance, the author carefully designed the loss function, and by introducing weights, let the network prioritize you and important shape parameters, including scale, rotation and translation. When the shape of the face is close to the true value, other shape parameters are then fitted. Experiments have proved that such a design can improve the accuracy of the positioning model.

Since the parametric shape will limit the ability of face deformation, the author extracts the HoG feature as input after using 3DFFA fitting, and uses linear regression to further improve the positioning accuracy of 2D feature points.

Typical failure causes of 3DDFA include

  • complex shading and occlusion
  • extreme poses and expressions
  • extreme lighting and
  • Limited shape variation of 3DMM on the nose

Derivative version: 3DDFA-V2 (2020), the previous version focused on static modeling research, this version focuses on the research of dynamic changes in models, such as adjacent frame correlation and other issues. And according to the paper, it is more accurate and stable. The project does not publish the training code, but only the test demo, which can be used as a tool for running data. Relevant information:

1.4 Challenges of 3DMM

3DMM is classic, but 20 years old (since 1999). Although it has shifted from the early traditional optimization strategy to the coefficient regression of the deep learning model, the current 3DMM still faces many challenges:

  • Limited to human faces, there is no information about eyes, lips and hair. Sometimes this information is necessary and useful.
  • For lower-dimensional spatial parameters, the texture model is relatively simple, and it is difficult to reconstruct details such as facial wrinkles.
  • PCA is mainly used to extract principal component information, but the interpretability is too poor, and it does not conform to the usual description of the face, so it is not a very reasonable feature space.
  • There are many variants of 3DMM, but few such models can achieve optimal results in various scenarios.

2. Based on an end-to-end general model

With the development of deep learning, researchers began to study the use of CNN structure direct regression, end-to-end reconstruction of 3D face models, rather than just predicting the coefficients of 3DMM, we introduce the representatives below.

2.1 PRNet

PRNet is Position map Regression Network, the paper was published in 2018.

3D reconstruction needs to predict the vertex coordinates of the three-dimensional topological grid, but direct prediction is difficult. PRNet is proposed in the paper, which uses the UV position map (UV Position map) to describe the 3D shape.

A concise translation of PRNet papers_Schrödinger's Alchemy Furnace! Blog-CSDN Blog

In the BFM model, the number of 3D vertices is 53490, and the author of PRNet chose a size of 256 × 256 × 3 256\times256\times3256×256×3 images to encode 3D vertices, the number of pixels is256 × 256 = 65536 256\times256=65536256×256=65536 , greater than and close to 53490. This map is calledthe UV position map, and the three channels are respectivelyX、Y、Z, which record the three-dimensional position information. It is worth noting that each 3D vertex is mapped to this UV position map without overlapping.

With the above method, you can directly use CNN to predict the UV position map, and use a codec structure.

In order to better predict, or make the predicted results more meaningful, the author weights the vertex errors in different regions when calculating the loss function, and there are four regions in total:

  • Feature points
  • Nose, Eyes, Mouth Area
  • Other parts of the face
  • neck

Their weight ratio is 16:4:3:0, visible feature points are the most important, and the neck does not participate in the calculation.

Although this article has a lot of content, a lot of space is devoted to introducing the shortcomings of the past methods, and there are not many substantive introductions to the methods of this article. Its implementation is not particularly complicated, but the choice of this dimension conversion is still very clever.

2.2 VRNet

The 3D reconstruction of the face can be equivalent to a depth estimation problem. Currently, there are many successful cases of direct regression of depth information based on CNN. If we discretize the three-dimensional space, and then predict whether there are points in each three-dimensional space, all the points can be combined to complete the reconstruction. This is VRNet (2017).

Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression Reading Notes- Zhihu (zhihu.com)

Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression | IEEE Conference Publication | IEEE Xplore

The core idea of ​​VRNet (Volumetric Regression Network, volume regression network), input a single RGB image, output 192 × 192 × 200 192\times192\times200192×192×200 size point clouds. It converts the 3D vertex prediction problem into a 3D image segmentation problem, using the hourglass network to predict whether it belongs to the point on the face (0 or 1).

Its training data is the same as that of 3DDFA. It is about the combination of BFM and FaceWareHouse model, that is, the expression base in the FaceWareHouse model is added to BFM (in fact, as mentioned above, the time for BFM to add expression coefficients is about this time. ), and then perform Model Fitting based on the 300W data set to obtain the parameters of the model, and use the reconstruction result as the true value, and then perform training. The loss uses sigmoid cross entropy loss.

In addition, key point detection can also be added to the model as an additional supervision to obtain a VRNet-Guided model, which can further improve the accuracy of the model.

The problem is fairly obvious:

  • It discards semantic information, and the 3D face vertices predicted by CNN are not fixed and need further comparison.
  • Its reconstruction resolution is low, and most of the output points of the network are not on the surface, resulting in a waste of computing resources.

3. Other difficulties in face 3D modeling

  1. The biggest problem with traditional or improved 3DMM-based methods is that the results are too average and lack details. Some methods adopt the "coarse + thin" method, and use the shape from shading method to capture details, but the effect is not satisfactory. In any case, it is very difficult to realistically reconstruct the details of the face.
  2. Once the two-dimensional face information is occluded, it is difficult to be accurately reconstructed. In addition to using the symmetric prior information of the face to complete , some methods draw on the idea of ​​​​retrieval matching, that is, to establish an unoccluded dataset and reconstruct The model performs pose matching and face recognition similarity matching, and then undergoes 2D alignment, using a gradient-based method for texture migration. But obviously this method is insensitive to unknown faces.

Guess you like

Origin blog.csdn.net/I_am_Tony_Stark/article/details/132011344