[Computer Vision | Face Modeling] Learn to regress 3D facial shape and expression from images without 3D supervision

This series of blog posts are deep learning/computer vision paper notes. Please indicate the source when reprinting.

标题:Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision

链接:[1905.06817] Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision (arxiv.org)

Summary

Estimating 3D facial shape from a single image must be robust to changes in lighting, head pose, expression, facial hair, makeup, and occlusion. Robustness requires large-scale training sets of wild images, which lack real 3D shape information when constructed. To train a network without any 2D-to-3D supervision, we propose RingNet, which is able to learn to compute 3D facial shapes from a single image. Our key observation is that the shape of a person's face is constant across images, independent of expression, pose, lighting, etc. RingNet utilizes multiple images of a person and automatically detected 2D facial features. It uses a novel loss function that encourages facial shapes to be similar when the identity is the same but different for different people. We achieve invariance to expressions by using the FLAME model to represent faces. Once training is complete, our method accepts an image and outputs the parameters of FLAME, which can be easily animated. Additionally, we created a new Not-So-In-the-Wild (NoW) face database containing 3D head scans and high-resolution images of subjects under various conditions. We evaluate publicly available methods and find that RingNet is more accurate than methods using 3D supervision. The dataset, model and results are available for research purposes at http://ringnet.is.tuebingen.mpg.de.

1 Introduction

Our goal is to estimate 3D head and face shape from a single image of a person. Unlike previous methods, we are interested in more than just tightly cropped regions around faces. Instead, we estimate the complete 3D face, head, and neck. Such representation is necessary for applications such as VR/AR, virtual glasses try-on, animation, biometrics, etc. Additionally, we seek a representation that can capture 3D facial expressions, factorize facial shapes based on expressions, and be able to repose and animate. Although many methods have been proposed in the computer vision literature to solve the facial shape estimation problem [40], no single method meets all our goals.

Specifically, we train a neural network to regress directly from image pixels to the parameters of a 3D facial model. Here, we use FLAME [21] because it is more accurate than other models, covers a variety of shapes, models the entire head and neck, is easy to animate, and is freely available. However, training a network to solve this problem is challenging because there is little data pairing 3D heads/faces with natural images of people. To make the model robust to imaging conditions, pose, facial hair, camera noise, lighting, etc., we want to train from a large number of images in the wild. Such images by definition lack controlled real 3D data.

This is a general problem in computer vision - finding 2D training data is easy, but learning regression from 2D to 3D becomes difficult when paired 3D training data is very limited and difficult to obtain. In the absence of true 3D, there are several options, but each has problems. Synthetic training data often fails to capture real-world complexity. It is possible to fit 3D models to 2D image features, but this mapping is ambiguous and therefore inaccurate. Due to ambiguity, training neural networks using only the loss between observed 2D features and projected 3D features does not achieve good results (see [17]).

To address the lack of training data, we propose a new method that learns a mapping from pixels to 3D shapes without any supervised 2D to 3D training data. To this end, we only use 2D facial features automatically extracted by OpenPose [29] to learn the mapping. To make this possible, our key observation is that multiple images of the same person provide strong constraints on the 3D facial shape, as the shape remains the same although other factors may change, such as pose, lighting and expression. FLAME can factorize poses and shapes, allowing our model to learn what is constant (shape) and exclude what changes (pose and expression).

Despite the fact that the facial shape of the same person is constant across images, we need to define a training method that allows the neural network to exploit this shape constancy. For this purpose, we introduce RingNet. RingNet uses multiple images of a person and enforces that the shapes between all image pairs should be similar while minimizing the 2D error between observed features and projected 3D features. While this encourages the network to encode shapes in a similar way, we found that this is not sufficient. We also add faces belonging to different random people to the "ring" and enforce that the underlying spatial distance between all other images in the ring is larger than the distance between the same people. Similar ideas have been used in manifold learning (e.g. triplet loss) [37] and face recognition [26], but to our knowledge our method has not previously been used to learn mappings from 2D to 3D geometry. . We find that extending triples to larger rings is critical for learning accurate geometries.

Although we use multiple images of a person for training, note that at runtime we only need a single image. With this formula, we are able to train a network to regress directly from image pixels to FLAME parameters. Because we train with “in-the-wild” images, the network is robust under a variety of conditions, as shown in Figure 1. However, the method is more general and can be applied to other 2D to 3D learning problems.

Figure 1: RingNet learns 3D faces from pixels of a single image to the FLAME model [21] without 3D supervision Mapping of parameters. Top: Image from the CelebA dataset [22]. Bottom: Estimated shape, pose, and expression.

Evaluating the accuracy of 3D face estimation methods remains a challenge, and although many methods have been published, no rigorous comparison of 3D accuracy has been performed under a variety of imaging conditions, poses, illumination, and occlusions. To address this issue, we collected a new dataset called NoW (Not quite in-the-Wild), which contains high-resolution real scans and high-quality images of 100 subjects taken under various conditions ( figure 2). NoW is more complex than previous datasets and we use it to evaluate all state-of-the-art methods with public implementations. Specifically, we compare with [34], [35] and [9], which are all trained with 3D supervision. Although our RingNet method does not have any 2D to 3D supervision, it recovers more accurate 3D facial shapes. We also qualitatively evaluate the method on challenging in-the-wild face images.

Figure 2:NoW datasetContains various images taken under different conditions (top) and high-resolution 3D heads Scan (bottom). The dark blue area is what we consider in the face challenge.

Overall, the main contributions of our paper are: (1) Complete facial, with neck, reconstruction from a single facial image. (2) RingNet - an end-to-end trainable network that enforces shape consistency of facial images under different viewing angles, lighting conditions, resolutions and occlusions of the subject. (3) A novel shape consistency loss for learning 3D geometry from 2D input. (4) NoW - a benchmark dataset for qualitative and quantitative evaluation of 3D facial reconstruction methods. (5) Finally, we provide the model, training code, and new datasets for free to encourage quantitative comparisons [25].

2. Related work

There are several methods to solve the problem of estimating 3D facial shape from images. One method estimates depth maps, normals, etc.; that is, these methods produce a pixel-dependent but face-specific representation of the object shape. Another method estimates a 3D shape model that can be animated. We focus on the latter approach. In a recent review article, Zollhöfer et al. [40] describe the current state of monocular facial reconstruction and provide a prospective set of challenges for the field. Note that the boundaries between supervised, weakly supervised and unsupervised methods are blurred. Most methods use some form of 3D shape model that is learned from scans beforehand; we don't call it supervised here. The term "supervised" here means that paired 2D to 3D data is used; this may come from real data or synthetic data. If the 3D model is first optimized to fit the 2D image features, then we say that 2D to 3D supervision is used. If 2D image features are used when training the network, but 3D data is not available, then this is usually weakly supervised, as opposed to unsupervised for 2D to 3D tasks.

Quantitative evaluation: Quantitative comparisons between methods have been limited due to the lack of common datasets with complex images and high-quality ground truth. Recently, Feng et al. [10] organized a single image to 3D face reconstruction challenge where ground-truth scans of subjects were provided. Our NoW benchmark is complementary to this approach as it focuses on extreme viewing angles, facial expressions and partial occlusions.

Optimization: Most existing methods require tightly cropped input images and/or reconstruction only of tightly cropped regions suitable for faces. Most current shape models are descendants of the original Blanz and Vetter 3D Plastic Model (3DMM) [3]. Although there are many variations and improvements to this model, such as [13], we use FLAME [21] here because both its shape space and expression space are learned from more scans than other methods. Only FLAME includes the neck region in shape space and simulates posture-related deformations of the neck when using head rotation. Tightly cropped facial regions make the estimation of head rotation ambiguous. Until recently, this has been the dominant paradigm [2, 30, 11]. For example, Kemelmacher-Shlizerman and Seitz [18] use multi-image shading to reconstruct an image set, allowing for changes in viewpoint and shape. Thies et al. [33] obtained accurate results on monocular video sequences. Although these methods can achieve good results at high fidelity, they are computationally expensive.

Learning using 3D supervision: Deep learning methods are rapidly replacing optimization-based methods [35, 39, 19, 16]. For example, Sela et al. [27] use a synthetic dataset to generate image-to-depth mapping and pixel-to-vertex mapping, which are combined to generate facial meshes. Tran et al. [34] directly regressed the 3DMM parameters of the facial model, using a dense network. Their key idea is to use multiple images of the same subject and fit a 3DMM to each image using 2D landmarks. Then, they take the weighted average of the fitted grids to use as ground truth for training the network. Feng et al. [9] regress from images to UV position maps that record 3D facial position information, providing dense correspondence to the semantic meaning of each point on UV space. All of the above methods use some form of 3D supervision, such as synthetic rendering, 3DMM-based optimization fitting, or using 3DMMs to generate UV maps or volumetric representations. Among fitting-based methods, none can produce ground truth for real-world facial images, and synthetically generated faces may not generalize well to the real world [31]. Methods that rely on fitting 3DMMs to images and use 2D-3D correspondence to create pseudo ground truth are always limited by the expressiveness of the 3DMM and the accuracy of the fitting process.

Learning using weak 3D supervision: Sengupta et al. [28] learn to imitate the Lambertian rendering process by using a mixture of synthetically rendered images and real images. They dealt with tightly cropped faces and did not produce models that could be animated. Genova et al. [12] proposed an end-to-end learning method using a differentiable rendering process. They also use synthetic data and their corresponding 3D parameters to train their encoder. Tran and Liu [36] learned 3DMM models in a weakly supervised manner by using nonlinear 3DMM models with analytically differentiable rendering layers.

Learning without 3D supervision: MoFA [32] estimates the parameters of a 3DMM and is trained end-to-end using photometric loss and optional 2D feature loss. Essentially, it's a neural network version of Blanz and Vetter's model, as it simulates shape, skin reflections, and lighting to produce realistic images that match the input. The advantage of this approach is that it is much faster than optimization methods [31]. MoFA estimates tight crops of faces, producing results that look good but have problems handling extreme expressions. They only perform quantitative evaluations on real images using the FaceWarehouse model as "ground truth"; this is not an accurate representation of real 3D facial shapes.

All learning methods without any 2D to 3D supervision explicitly model the image formation process (such as Blanz and Vetter) and formulate photometric losses, often combined with 2D facial feature detection with known correspondences to the 3D model. The problem with photometric loss is that the image formation model is always approximate (e.g. Lambertian). Ideally, one hopes that the network will learn not only the shape of the face, but also the complexity of real-world images and their relationship to shape. To this end, our RingNet method only uses 2D facial features without photometric terms. Despite (or because of) this, the method is able to learn directly from pixels to 3D facial shapes. This is one of the least supervised methods published.

3. Proposed method

The goal of our method is to estimate 3D head and face shape from a single face image I. Given an image, we assume that the face has been detected, loosely cropped, and roughly centered. During training, our method utilizes 2D landmarks and identity labels as input. During inference, it only uses image pixels; 2D landmarks and identity labels are not used.

Key ideas:
The key ideas can be summarized as follows:

  1. The shape of a person's face remains the same even if the facial image changes in perspective, lighting conditions, resolution, occlusion, expression, or other factors.
  2. Each person has a unique facial shape (identical twins are not considered). We exploit this idea by introducing a shape consistency loss, embodied in our ring network structure.

RingNet (Figure 3) is an architecture based on multiple encoder-decoders, where weights are shared between encoders and shape constraints are imposed on the shape variables. Each encoder in the ring is a combination of a feature extraction network and a regressor network. Imposing shape constraints on shape variables forces the network to decouple facial shape, expression, head pose, and camera parameters. We use FLAME [21] as a decoder to reconstruct 3D faces from semantically meaningful embeddings, as well as to obtain a decoupling of semantically meaningful parameters (i.e., shape, expression, and pose parameters) in the embedding space.

Figure 3: RingNet takes multiple images of the same person (Subject A) and an image of another person (Subject B) during training and enforces shape consistency between the same subject and enforces shape consistency between different subjects Inconsistency. The 3D landmarks computed from the predicted 3D mesh are projected in the 2D domain to compute the loss from the ground truth 2D landmarks. During inference, RingNet takes a single image as input and predicts the corresponding 3D mesh. Image from [6]. The figure is a simplified version for illustration purposes.

We will introduce the FLAME decoder, RingNet architecture and losses in more detail next.

3.1. FLAME model

FLAME uses linear transformations to describe shape changes related to identity and expression, and standard Linear Blend Skinning (LBS) to simulate surroundings K = 4 K = 4 < /span>K=4 joints for neck, jaw and eyeball rotation. Parameterized by the shape coefficient, β ⃗ ∈ R ∣ β ⃗ ∣ \vec{β} \in \mathbb{R}^{|\vec{β}|} b Rb , pose θ ⃗ ∈ R ∣ θ ⃗ ∣ \vec{θ} \in \mathbb{R}^{|\ vec{θ}|} i Ri ,和电影 ψ ⃗ ∈ R ∣ ψ ⃗ ∣ \vec{ψ} \in \mathbb{R}^{| \vec{ψ}|} p Rp , FLAME return N = 5023 N = 5023 N=5023个Point.

FLAME models identity-related shape changes B S ( β ⃗ ; S ) : R ∣ β ⃗ ∣ → R 3 N B_S(\vec{β};\pmb {S}):\mathbb{R}^{|\vec{β}|} \rightarrow \mathbb{R} ^ {3N} BS(b ;S):Rb R3N, the infinitesimal equation B P ( θ ⃗ ; P ) : R ∣ θ ⃗ ∣ → R 3 N B_P(\vec{θ};\pmb{P}):\mathbb{R}^{|\vec{θ}|} \rightarrow \mathbb{R} ^ {3N} BP(i ;P):Ri R3N, an infinitesimal equation B E ( ψ ⃗ ; E ) : R ∣ ψ ⃗ ∣ → R 3 N B_E(\vec{ψ};\pmb{E}):\mathbb{R}^{|\vec{ψ}|} \rightarrow \mathbb{R} ^ {3N} BE(p ;E):Rp R3N,Working tools and learning basics S \mathcal{S} S E \mathcal{E} Esum P \mathcal{P} Linear transformation of P. Given template T ‾ ∈ R 3 N \overline{\pmb{T}} \in \mathbb{R}^{3N} TR3N is in "zero pose" and the identity, pose and expression blend shape is modeled relative to T ‾ \overline{\pmb{T}} TVertex offset of . Each pose vector θ ⃗ ∈ R 3 K + 3 \vec{θ} \in \mathbb{R}^{3K+3} i R3K+3Inclusion ( K + 1 ) (K+1) (K+The rotation vector in the 1) axis angle representation; that is, one vector for each joint plus the global rotation. Mixed skinning function W ( T ‾ , J , θ ⃗ , W ) W (\overline{\pmb{T}}, \pmb{J}, \vec{θ}, \ mathcal{W}) W(T,J,i ,W)RAN后围绕关节 J ∈ R 3 K \pmb{J} \in \ mathbb{R}^{3K} JR3KTurning point, Yuki mixture weight W ∈ R K × N \mathcal{W} \in \mathbb{R} ^ {K \times N} INRK×Nlinear smooth.

更最天地,FLAME 这些的:
M ( β ⃗ , θ ⃗ , ψ ⃗ ) = W ( T P ( β ⃗ , θ ⃗ , ψ ⃗ ) , J ( β ⃗ ) , θ ⃗ , W ) , (1) M(\vec{β},\vec{θ},\vec{ψ})=W(T_P(\vec{β},\vec{θ}, \vec{ψ}),\pmb{J}(\vec{β}),\vec{θ},\mathcal{W}), \tag{1} M(b ,i ,p )=W(TP(b ,i ,p ),J(b ),i ,W),(1)
任任
T P ( β ⃗ , θ ⃗ , ψ ⃗ ) = T ‾ + B S ( β ⃗ ; S ) + B P ( θ ⃗ ; P ) + B E ( ψ ⃗ ; E ) , (2) T_P(\vec{β},\vec{θ} ,\vec{ψ})=\overline{\pmb{T}}+B_S(\vec{β};\mathcal{S})+B_P(\vec{θ};\mathcal{P})+B_E( \vec{ψ};\mathcal{E}), \tag{2} TP(b ,i ,p )=T+BS(b ;S)+BP(i ;P)+BE(p ;E),(2)
Since different facial shapes require different joint positions, the joints are defined as With β ⃗ \vec{β} b related functions. We use Equation 1 to decode our embedding space to generate a 3D mesh of the complete head and face.

3.2. RingNet

Recent advances in face recognition (e.g. [38]) and facial landmark detection (e.g. [4, 29]) have led to large image datasets with identity labels and 2D facial landmarks. During training, we assume there is a set of 2D facial images I i I_i Ii, corresponding identity tag c i c_i ciJapanese title k i k_i ki

The shape consistency assumption can be passed β i ⃗ = β j ⃗ , ∀ c i = c j \vec{β_i} = \vec{β_j}, ∀c_i = c_j bi =bj ci=cj (i.e. the facial shape of a subject should remain unchanged across multiple images), and β i ⃗ ≠ β j ⃗ , ∀ c i ≠ c j \vec{β_i} \neq \vec{β_j},∀c_i \neq c_j bi =bj ci=cj(i.e. the facial shape of different subjects should be different) to formalize. RingNet introduces a ring structure that can simultaneously optimize the shape consistency of any number of input images. See Section 3 for details on shape consistency.

RingNet division R R REnvironmental element e i = 1 i = R e^{i=R}_{i=1} It isi=1i=R, as shown in Figure 3, where each ei includes an encoder and a decoder network (see Figure 4). Encoder at e i e_i It isiWeights are shared between and the decoder remains unchanged during training. The encoder is a feature extraction network f f e a t f_{feat} ffeatand regression network f r e g f_{reg} freg combination. Detailed image I i I_i Ii f f e a t f_{feat} ffeatOutput a high-dimensional vector, then f r e g f_{reg} freg is encoded into a semantically meaningful vector (i.e. f e n c ( I i ) = f r e g ( f f e a t ( I i ) ) f_{enc}(I_i) = f_{reg }(f_{feat}(I_i)) fenc(Ii)=freg(ffeat(Ii))). This vector can be expressed as the concatenation of camera, pose, shape and expression parameters, i.e. f e n c ( I i ) = [ c a m i , θ ⃗ i , β ⃗ i , ψ ⃗ i ] f_{enc }(I_i) = [cam_i, \vec θ_i, \vec β_i, \vec ψ_i] fenc(Ii)=[cami,i i,b i,p i], among them θ ⃗ i , β ⃗ i , ψ ⃗ i \vec θ_i,\vec β_i,\vec ψ_i i ib ip iis the FLAME parameter.

Figure 4: Ring element of the 3D mesh of the output image.

For simplicity, we omit it below I I I,使用 f e n c ( I i ) = f e n c , and f_{enc}(I_i) = f_{enc,i} < /span>fenc(Ii)=fenc,i f f e a t ( I i ) = f f e a t , i f_{feat}(I_i) = f_{feat,i} ffeat(Ii)=ffeat,i. The regression network iteratively returns f e n c , i f_{enc,i} through an iterative error feedback loop [17, 7]fenc,i, rather than directly from f f e a t , i f_{feat,i} ffeat,i回归 f e n c , i f_{enc,i} fenc,i. At each iteration step, a progressive move is made from the previous estimate to arrive at the current estimate. Formally, the regression network will be concatenated [ f f e a t , i t , f e n c , i t ] [f^t_{feat,i}, f^t_{enc,i}] [ffeat,it,fenc,it]Production import, production import δ f e n c , i t δf^t_{enc,i} δfenc,it. We then update the current estimate by,
f e n c , i t + 1 = f e n c , i t + δ f e n c , i t (3) {f_{enc,i}}^{t+1} = {f_{enc,i}}^{t} + δ{f_{enc,i}}^{t} \tag{3} fenc,it+1=fenc,it+δfenc,it(3)
This iterative network performs multiple Return iteration. The initial estimate is set to 0 ⃗ \vec 0 0 . The output of the regression network is then fed to a differentiable FLAME decoder network, which outputs a 3D head mesh.

Environmental element R R The number of R is a hyperparameter of our network, which is determined at β ⃗ \vec β b The number of images to be processed in parallel for optimized consistency on . RingNet allows the simultaneous use of images of the same subject and any combination of images of different subjects. However, without loss of generality, we provide facial images of the same identity to { e j } j = 1 j = R − 1 {\{e_j\}}^{j=R− 1}_{j=1} { ej}j=1j=R1, provide images of different identities to e R e_R It isR. Therefore, for each input training batch, each slice contains R − 1 R-1 R1 images of the same person and one image of another person (see Figure 3).

3.3. Shape Consistency Loss

For simplicity, let us call two subjects with the same identity label a "matching pair" and two subjects with different identity labels a "mismatching pair". A key goal of our work is to create a robust end-to-end trainable network that can produce the same shape from images of the same subject and different shapes for different subjects. In other words, we want to make our shape generator discriminative. We do this by requiring that the distance of matching pairs in shape space is smaller than that of unmatched pairs by a boundary value η η η to enforce this. Distances are calculated in the space of facial shape parameters, which corresponds to the Euclidean space of vertices in neutral poses.

Inside RingNet, e j e_j It isjsum e k e_k It isk generate β ⃗ j \vec β_j b j β ⃗ k \vec β_k b k,当 j ≠ k j \neq k j=kLet j , k ≠ R j,k \neq R jk=When R, they are a matching pair. Similarly, e j e_j It isjsum e R e_R It isR generate β ⃗ j \vec β_j b j β ⃗ R \vec β_R b R,当 j ≠ R j \neq R j=When R, they are unmatched pairs. Our shape consistency term is:
∥ β j ⃗ − β k ⃗ ∥ 2 2 + η ≤ ∥ β j ⃗ − β R ⃗ ∥ 2 2 (4) \left\| \ vec {\beta_j} - \vec {\beta_k} \right\|_2^2 + \eta \leq \left\| \vec {\beta_j} - \vec {\beta_R} \right\|_2^2 \tag {4} bj bk 22+the bj bR 22(4)
Therefore, we minimize the following loss when training RingNet end-to-end:
L S = ∑ i = 1 n b ∑ j , k = 1 R − 1 max ⁡ ( 0 , ∥ β i j ⃗ − β i k ⃗ ∥ 2 2 − ∥ β i j ⃗ − β i R ⃗ ∥ 2 2 + η ) (5) L_S = \sum_{i=1}^{n_b} \sum_{j,k=1}^{R-1} \max\left(0, \left\| \ vec {\beta_{ij}} - \vec {\beta_{ik}} \right\|_2^2 - \left\| \vec {\beta_{ij}} - \vec {\beta_{iR}} \ right\|_2^2 + \eta\right) \tag{5} LS=i=1nbj,k=1R1max(0, bij bik 22 bij biR 22+η)(5)
其可以归一化为:
L S C = 1 n b × R × L S (6) L_{SC} = \frac{1}{n_b \times R} \times L_S \tag{6} LSC=nb×R1×LS(6)
n b n_b nbis the batch size for each element in the ring.

3.4. 2D Feature Loss

Finally, we compute the L1 loss between the ground-truth landmarks and predicted landmarks provided during training. Note that we do not directly predict 2D landmarks, but retrieve them from 3D meshes of known topology.

Given a FLAME template mesh, we define for each OpenPose [29] keypoint the corresponding 3D point on the mesh surface. Please note that this is the only place where we provide supervision that connects 2D and 3D. This is only done once. The key points of the mouth, nose, eyes and eyebrows have fixed corresponding 3D points (called static 3D landmarks), and the position of the contour features changes with the head pose (called dynamic 3D landmarks). Similar to [5, 31], we The outline landmarks are modeled to move dynamically with global head rotation (see Sup. Mat.). To automatically calculate this dynamic contour, we rotate the FLAME template between -20 and 40 degrees left and right, render the mesh with a texture, run OpenPose to predict the 2D landmarks, and project these 2D points onto the 3D surface. The resulting trajectories are transmitted symmetrically on the left and right sides of the face.

During training, RingNet outputs 3D meshes, computes static and dynamic 3D landmarks for these meshes, and projects these landmarks onto the image plane using the predicted camera parameters from the encoder output. Therefore, we calculate the following L1 loss between the projected landmark kpi and the ground truth 2D landmark ki:
L proj = ∥ w i × ( k p i − k i ) ∥ 1 (7) L_{\text{ proj}} = \|w_i \times (k_{pi} - k_i)\|_1 \tag{7} Lproj=wi×(kpiki)1(7)
inside w i w_i Ini is the confidence score for each ground truth landmark provided by the 2D landmark predictor. If the confidence is higher than 0.41, it is set to 1, otherwise it is set to 0. Total loss of training RingNet end-to-end L t o t L_{tot} Ltot是:
L tot = λ SC L SC + λ proj L proj + λ β ~ ∥ β ~ ∥ 2 2 + λ ψ ~ ∥ ψ ~ ∥ 2 2 (8 ) L_{\text{tot}} = \lambda_{\text{SC}} L_{\text{SC}} + \lambda_{\text{proj}} L_{\text{proj}} + \lambda_{\ tilde{\beta}} \|\tilde{\beta}\|_2^2 + \lambda_{\tilde{\psi}} \|\tilde{\psi}\|_2^2 \tag{8} Lall=lSCLSC+lprojLproj+lb~b~22+lp~p~22(8)
inside λ λ λ is the weight of each loss term, and the last two terms regularize the shape and expression coefficients. Since B S ( β ⃗ ; S ) B_S(\vec β; \mathcal S) BS(b ;S) B E ( ψ ⃗ ; E ) B_E( \vec ψ; \mathcal E) BE(p ;E)squared difference convergence, β ⃗ \vec β b ψ ⃗ \vec ψ p The L2 norm represents the Mahalanobis distance in orthogonal shape and expression space.

3.5. Implementation details

The feature extraction network uses the pre-trained ResNet50 [15] architecture, which is also optimized during training. The feature extraction network outputs a 2048-dimensional vector as the input of the regression network. The regression network consists of two fully connected layers of 512 dimensions with ReLu activation and dropout, followed by a final linear fully connected layer with an output of 159 dimensions. To this 159-dimensional output vector, we concatenate the camera, pose, shape, and expression parameters. The first three elements represent scale and 2D image translation. The next 6 elements are global rotation and jaw rotation, both in axis angle representation. Since FLAME's neck and eye rotations do not correspond to facial landmarks, no regression is performed. The next 100 elements are the shape parameters, then the 50 expression parameters of FLAME. Differentiable FLAME layers remain unchanged during training. We train RingNet for 10 epochs using the Adam [20] optimizer with a learning rate of 1e-4. The different model parameters are R = 6 R = 6 R=6 λ S C = 1 λ_{SC} = 1 lSC=1 λ p r o j = 60 λ_{proj} = 60 lproj=60 λ β ⃗ = 1 e − 4 λ_{\vec β} = 1e − 4 lb =1e4 λ ψ ⃗ = 1 e − 4 λ_{\vec ψ} = 1e − 4 lp =1e4 η = 0.5 η = 0.5 the=0.5. The RingNet architecture is implemented in Tensorflow [1] and will be publicly released. We use the VGG2 face database [6] as a training dataset, which contains facial images and their corresponding labels. We run OpenPose [29] on the database and compute 68 landmark points of the face. OpenPose fails for many situations. After cleaning up the failures, we got around 800,000 images, along with corresponding labels and face annotations, for our training corpus. We also consider about 3000 images with extreme poses and their corresponding labels provided by [4]. Since we do not have any labels for these extreme images, we duplicate each image by randomly cropping and scaling it to account for matching pairs.

4. Benchmark datasets and evaluation metrics

This section presents our NoW benchmark for the task of 3D face reconstruction from monocular images. The goal of this benchmark is to introduce a standard evaluation metric to measure the accuracy and robustness of 3D facial reconstruction methods under changes in viewing angle, illumination, and common occlusions.

Dataset: This data set contains 2054 2D images of 100 subjects, taken using iPhone X, and a separate 3D head is provided for each subject. scan. This head scan was used as ground truth for the evaluation. Subjects were selected to include variation in age, BMI, and sex (55 women, 45 men).

We divided the captured data into four challenges: neutral (620 images), expression (675 images), occlusion (528 images), and selfie (231 images). Neutral, Expressive, and Occluded contains neutral, expressive, and partially occluded facial images of all subjects, ranging from frontal to side views. Emoticons include different expressions such as happiness, sadness, surprise, disgust and fear. Occlusions include images with different occlusions, such as glasses, sunglasses, facial hair, hats or hoods, etc. For the selfie category, participants were asked to take a selfie using an iPhone without imposing any restrictions on the facial expressions performed. The images were captured both indoors and outdoors to provide variations in natural and artificial light.

The challenge for all categories is to reconstruct a neutral 3D face given a monocular image. Note that facial expressions are present in several images, requiring methods to separate identity and expression in order to evaluate the quality of the predicted identities.

Capture setup: For each subject, we captured raw head scans of neutral expressions using an active stereo system (3dMD LLC, Atlanta). The multi-camera system includes six pairs of grayscale stereo cameras, six color cameras, five speckled pattern projectors and six white LED panels. The reconstructed 3D geometry contained approximately 120,000 vertices for each subject. Each subject wore a hood during scanning to avoid occlusion of the face or neck area due to hair and scanner noise.

Data processing: Most existing 3D facial reconstruction methods require localization of the face. To mitigate the impact of this preprocessing step, we provide each image with a bounding box covering the face. To obtain bounding boxes for all images, we first run a face detector [38] on all images and then predict keypoints [4] for each detected face. We manually select 2D landmarks for the failure cases. We then extend the landmark's bounding box by 5% (bottom), 10% (left and right), and 30% (top) to each side to obtain a box that covers the entire face, including the forehead. For the facial challenge, we followed a processing protocol similar to [10]. For each scan, select the center of the face and crop the scan by removing everything outside the specified radius. The selected radius is subject-specific and is calculated as 0.7 × (outer eye distance + nose distance) (see Figure 2).

Evaluation Metrics: The challenge is to reconstruct a 3D face given a monocular image. Since the predicted meshes appear in different local coordinate systems, the reconstructed 3D meshes are rigidly aligned (rotated, translated, and scaled) using a corresponding set of landmarks between predictions and scans. We further perform a rigid alignment based on scan-to-mesh distance (i.e., the absolute distance between each scan vertex and the nearest point on the mesh surface), using landmark alignment as initialization. Then, the error, which is the scan-to-grid distance between the ground truth scan and the reconstructed grid, is calculated for each image. Different errors are then reported, including cumulative error plots for all distances, median distance, mean distance and standard deviation.

How to participate: To participate in the challenge, we provide a website [25] for downloading test images and uploading the reconstruction results and selected landmarks for each registration. Then, the error metric is automatically calculated and returned. Note that we do not provide ground truth scans to prevent fine-tuning on test data.

5. Experiment

We conduct qualitative and quantitative evaluations of RingNet and compare them with publicly available methods including: PRNet (ECCV 2018 [9]), Extreme3D (CVPR 2018 [35]), and 3DMM-CNN (CVPR 2017 [34] ]).

Quantitative evaluation: We compare different methods on [10] and our NoW dataset.

Feng et al.’s benchmark: Feng et al. [10] describe a benchmark dataset for evaluating 3D facial reconstruction from a single image. They provide a test dataset containing facial images and their corresponding 3D ground-truth face scans from a subset of the Stirling/ESRC 3D face database. The test dataset contains 2000 2D neutral face images, including 656 high-quality (HQ) images and 1344 low-quality (LQ) images. High-quality images are taken in controlled scenes, while low-quality images are extracted from video frames. This data focuses on neutral faces, while our data has higher variability in expression, occlusion and lighting, as discussed in Section 4.

Note that the methods we compared with (PRNet, Extreme3D, 3DMM-CNN) used 3D supervision during training, while our method did not. PRNet [9] requires facial regions to be cropped very tightly to achieve good results, and performs poorly given loosely cropped input images of baseline databases (see Supplementary Material). Instead of trying to crop images for PRNet, we ran it on a given image and noted its success: for the low-resolution test image, it output 918 meshes and for the high-quality image, 509 meshes. grid. For comparison with PRNet, we only run all other methods on the 1427 images where PRNet succeeds.

We calculate the error using the method in [10], which calculates the distance from the ground truth scan point to the estimated mesh surface. Figure 5 (left and center) shows the cumulative error curves of different methods on low- and high-quality images; RingNet outperforms other methods. Table 1 reports the mean, standard deviation, and median error.

Figure 5: Cumulative error curve. From left to right: low-quality data from [10]. high-quality data from [10]. NoW dataset face challenge.

Table 1: Statistics of Feng et al. [10] benchmark

NoW Face Challenge: For this challenge, we used cropped scans like [10] to evaluate different methods. We first perform a rigid alignment of the prediction grids of all compared methods. We then calculate the scan-to-grid distance between the predicted grid and the scan above [10]. Figure 5 (right) shows the cumulative error curves of different methods; again, RingNet outperforms the other methods. We provide the mean, median, and standard deviation errors in Table 2.

Table 2: Statistics of face challenge on NoW dataset.

Qualitative results: Here we show qualitative results on estimating 3D face/head meshes from single CelebA [22] and MultiPIE dataset [14] face images. Figure 1 shows some results from RingNet, illustrating its robustness to expression, gender, head pose, hair, occlusion, etc. In Figures 6 and 7, we demonstrate the robustness of our method under different conditions such as lighting, pose, and occlusion. A qualitative comparison is provided in Sup. Mat.

Figure 6: RingNet’s robustness to different lighting conditions. Images are from MultiPIE dataset [14].

Figure 7: RingNet’s robustness to occlusion, pose changes, and lighting changes. Images are from NoW dataset.

Elimination study: Here we compare different R R The choice of the R value provides some motivation for using a ring architecture in RingNet. We evaluate these values ​​on a validation set containing 10 subjects (six from [8] and four from [21]). For each subject, we select one neutral scan and two to four scanner images, reconstruct a 3D grid for the images, and measure the scan-to-grid reconstruction error after rigid alignment. Using a ring structure with more elements reduces the error compared to using only a single triplet loss, but it also increases the training time. To strike a balance between time and error, we chose R = 6 R = 6 R=6

Table 3: Effect of different number of ring elements R. We evaluate on the validation set described in the ablation study.

6 Conclusion

We address the challenging problem of learning to estimate 3D, articulated and deformable shapes from a single 2D image without paired 3D training data. We apply the RingNet model to faces, but the formula is general. The key idea is to utilize a series of pairwise losses that encourage solutions that share the same shape in images of the same person and have different shapes in cases where they are different. We leverage the FLAME facial model to decompose facial poses and expressions into shapes so that RingNet can keep the shape fixed while allowing other parameters to vary. Our method requires a dataset in which some people appear multiple times, as well as 2D facial features that can be estimated by existing methods. We only provide relationships between standard 2D facial features and the vertices of the 3D FLAME model. Unlike previous methods, we do not optimize 3DMM to 2D features and do not use synthetic data. Competing methods typically exploit photometric losses using approximate generative models of facial albedo, reflections, and shadows. RingNet does not need to do this to learn the relationship between image pixels and 3D shapes. Furthermore, our formulation captures the entire head and its pose. Finally, we created a new public dataset with accurate ground-truth 3D head shapes and high-quality images taken under a variety of conditions. Surprisingly, RingNet outperforms methods using 3D supervision. This opens many directions for future research, such as extending RingNet with [24]. Here we focus on the case without 3D supervision, but we can relax this and use supervision when available. We expect that a small amount of supervision will improve accuracy, while a large dataset of wild images will provide robustness to lighting, occlusion, etc. Our 2D feature detector does not include ears, although ears are very unique features. Adding 2D ear detection will further improve 3D head pose and shape. Although our model stops at the neck, we plan to extend the model to the whole body [23]. It will be interesting to see if RingNet can be extended to reconstruct 3D body pose and shape from images using only 2D joints. This could potentially surpass current methods such as HMR [17] to learn information about body shape. While RingNet learns a mapping to existing facial 3D models, we can relax this and also optimize on a low-dimensional shape space to learn more detailed shape models from examples. To this end, integrating shadow cues [32, 28] will help constrain the problem.

Acknowledgments: We thank T. Alexiadis for building the NoW dataset, J. Tesch for rendering results, D. Lleshaj for annotations, A. Osman for supplementary videos, and S. Tang Have a useful discussion.

Disclosure: Michael J. Black has received research gifts from Intel, Nvidia, Adobe, Facebook, and Amazon. He is a part-time employee of Amazon and has financial interests in Amazon and Meshcapade GmbH. His research is conducted exclusively at MPI.

References

(……)

appendix

In the following section, we show the cumulative error plots for each challenge (Neutral, Figure 8; Expression, Figure 9; Occlusion, Figure 10; Selfie, Figure 11) on the NoW dataset. The right side of Figure 5 shows the cumulative error across all challenges.

Figure 8: Cumulative error curve for the neutral challenge.

Figure 9: Cumulative error curve of expression challenge.

Figure 10: Cumulative error curve for occlusion challenge.

Figure 11: Cumulative error curve for selfie challenge.

Guess you like

Origin blog.csdn.net/I_am_Tony_Stark/article/details/133644334