[Detailed interpretation of the best papers of CVPR2020] Unsupervised Learning of Probably Symmetric Deformable 3D Object

This article is based on the best paper of CVPR2020, Unsup3D, a classmate of Wu Shangzhe from the Oxford VGG group. In addition to the detailed analysis in Chinese, based on the author's code, I replaced the differentiable rendering module from the original Neural Renderer to the Soft Rasterizer, and the effect achieved was barely passable~.

0. Abstract

The purpose of this method is to recover/learn 3D deformable objects from raw single -view images .

Our method is based on Auto Encoder, which decouples the input image into depth , albedo , viewpoint and illumination .

In order to disentangle the above four variables without supervision .

We are based on the assumption that many objects conform to symmetric structures in principle .

The authors model this by predicting a symmetric probability map for potentially symmetric objects. Learn depth , albedo , viewpoint and illumination in an end2end way .

Compared to their 2D counterparts using supervised signals, the authors say they perform better!

1. Introduction

insert image description here

图1说明
Restore 3D deformable object from in-the-wild images.
: Data format used for training (only single-view image itself, no 3D information of GT, multiple views or any other prior model) :
Once training is completed, our model The pose, shape, albedo and illumination of a 3D model can be reconstructed from a single image to a very high standard.

Understanding the 3D structure in images is of great significance for many CV applications. Furthermore, while many neural networks appear to have gotten better and better at grasping the 2D texture information of images , 3D modeling can account for much of the variability in natural images and potentially improve general understanding of images.

Based on this, Wu Shangzhe began to study the problem of recovering 3D structures from 2D images. My understanding is that Wu uses prior knowledge in order to distinguish himself from the Max Planck Institute's RingNet and other methods.

In his 2D-3D recovery task, the first condition set is: no 2D or 3D GT information. In this way, the problem of collecting image collection is solved and greatly reduces the need for deep learning (Deep). Learning) The difficulty of data collection applied to this task.

The second condition is: the algorithm must use an unconstrained collection of single-view images. In this way, the limitations of the method of restoring 3D models through multiple-view images of a single person are broken. This condition is to solve the problem that in many cases, we only have a still image to process.

Therefore, Wu’s idea of ​​restoring 3D deformable objects based on single-view image is proposed here. The purpose is to estimate its 3D shape based on a single input image (produces as output a deep network that can estimate the 3D shape of any instance given a single image of it) .

Wu and his teachers used the Auto Encoder structure to inherently decompose the image into reflectance, depth, lighting and viewpoint. (As mentioned above, there is no other form of supervision signal ).

However, without further assumptions, decomposing an image into these four factors is an ill-posed problem.

In pursuit of minimal assumptions towards this goal, we consider most object classes to be structurally symmetric .

Assuming that an object is perfectly symmetric, you can obtain a virtual second view by simply mirroring the image.

In fact, if the object relationships in mirrored images are available, then 3D reconstruction can be achieved through stereo reconstruction.

Based on this, we seek to leverage symmetry as a geometric cue to constrain the decomposition .

However, there is no such thing as perfect symmetry (either appearance or shape) for a given object .

For example, even if an object is symmetrical in shape and albedo, its appearance may still be asymmetrical due to the influence of asymmetric lighting .

To address this problem, we first explore potential symmetry structures by explicitly model lighting. (My understanding is that lighting is regarded as an additional clue rather than a property of the object itself?)

Second, we augment the model to account for potential asymmetries in the object.

Through steps 1 and 2, the model predicts an additional dense map (which contains the probability that a given pixel has a symmetric equivalent point in the image) in addition to albedo and other factors.

We have combined the above content into an end2end learning formulation. In this pipeline, all components including the confidence map are learned only from RGB images.

We observe that symmetry can be enforced by flipping internal representations, which is particularly useful for reasoning about symmetries probabilistically.

Passed tests on some data sets (including human faces, cat faces, cars). Our results are very good~, not only exceeding the method that does not rely on 2D or 3D GT information: 45(ICCVW 2019)Lifting AE, 52(2019 arxiv). It also exceeds the method using keypoint supervision 37(NIPS2018的论文).

insert image description here

Lifting AE 45 ICCVW 2019

insert image description here

Szabo´ et al.52

We demonstrate that our trained face model can generalize to non-natural images, such as face drawings and cartoons, without fine-tuning.

2. Related Work

In order to evaluate the relationship between the contribution of this article and previous image-based 3D reconstruction methods. Related work is mainly considered from three aspects:

  • ① Category of usage information
  • ② What assumptions should be adopted?
  • ③ Output situation

In Table 1, we compare the similarities and differences between this method and previous papers in these situations.

insert image description here

literature illustrate author
43 University of Basel in 2008, cited 705. This article is BFM! Pascal Paysan
44 ECCV 2018 The familiar Michael Black, the familiar Max Planck Institute... https://coma.is.tue.mpg.de/ This article is the famous coma. Michael J.Black
16 Made by Chinese, Stanford and snapchat. Chinese
47 Compared with other methods of directly predicting a single picture, researchers from the University of Maryland, College Park and UCB developed a method of predicting N, A, and L based on pictures. This is very similar to the idea of ​​​​this article and may have inspired the method of this article.
60 Imperial College's IJCV2019 also predicts disentangled expressions and other content to facilitate Face manipulation.
7 Neurips 2019, DIB-R, conducted by Sanja Fidler's research group at the University of Toronto (with Nvidia). https://www.cs.utoronto.ca/~fidler/ This paper mainly changes the interpolation method inside the triangle based on Soft Rasterizer. Sanja Fidler
52 University of Bern, Switzerland 2019 This article conducts qualitative experiments and comparisons on algorithms Szebo et al
45 Lifting AEFrom the establishment of the Centrale Supelec Institute and Imperial College in 2015. Britain and France merged~~~
  • Structure from Motion SFM :
    The method of structuring from Motion is not suitable for estimating/reconstructing 3D deformable objects from raw pixels of single view. Because it requires multiple views or supervision signals between 2D keypoints.

  • SFXStructure from
    _ _ 11, USC南加州大学 的. 2003年 _ _24 1989 MIT的大佬Berthold K.P. Horn & Michael J.Brooks 出版的一本书. _ Inspire.

insert image description here

Mirror symmetry ⇒ 2-view stereo geometry 11, USC南加州大学 的. 2003年

insert image description here

Shape From Shading 24 1989 MIT的大佬Berthold K.P. Horn & Michael J.Brooks 出版的一本书.

  • Category-specific reconstruction
    learning-based methods have recently been widely used to construct objects based on single view images. However, this task is ill-posed, so some people try to learn a suitable object prior from the training data to achieve the goal. Of course, you can also use a series of supervision signals to learn this prior. Because I mainly focus on faces and human bodies, methods like [26, 17, 60, 14] need to use a predefined shape model (SMPL or 34BFM 43) to construct a 3D deformable object from a single image. These prior models are constructed using specialized hardware and supervision, which is not very friendly to images that in-the-wild...(There are difficulties in collecting data, There is also the cost of building the model)

insert image description here

SMPL 34 马普所 2015年做的.

insert image description here

Famous Basel Face Model43 经典的3D Morphable Model

Classmate Wu summarized the research ideas of current researchers:

James Thewlis and others (Oxford VGG group) learn dense landmarks (in order to restore the 2D geometric structure of the object) by using covariance. Nips2017, Nips2018.

insert image description here

Paper published by James Thewlis et al. in Nips2017.54

DAE (Deforming autoencoders) comes from Stony Brook University in the United States and INRIA Research Institute in France.
Zhixin Shu predicts the deformation field by limiting a small bottleneck embedding for AE. The output form of this idea is very similar to Wu's idea in this article. ECCV2018

Similarly, the idea of ​​​​adversarial learning has also been introduced here.

The paper published by Yuji Kato (University of Tokyo) at CVPR2019 trains a discriminator on raw images and uses viewpoints as additional supervision signals.

insert image description here

Learning View Priors for Single-view 3D Reconstruction. Yuji Kato et al. from the University of Tokyo, CVPR2019, 28.

Szabo and others from the University of Bern in Switzerland used adversarial learning to reconstruct 3D meshes, but did not conduct quantitative analysis.

insert image description here

There are also limitations in some people's experiments. For example, in Henzeler's experiment, the background of the test subjects was all white.

In the experimental part, Wu made a comparison with He 45 Lifting ARand 52 Szabo et al.verified the effectiveness of his method.

3. Method

Taking human faces as an example, given an unconstrained collection of images, our goal is to learn a model whose input is an image instance and whose output is 3D shape, albedo, illumination and lighting.

As shown in Figure 2, the author calls this Photo-geometric Autoencoding .
insert image description here

Because Wu’s method is based on a symmetrical structure, the problem is that the appearance of similar objects is not perfectly symmetric. Asymmetric situations are very common. To solve this problem,

Wu children's shoes:

  • 1 Explicitly model asymmetric illumination.

  • 2 Our model can estimate the probability that each pixel has a symmetric structure in the image through probability map ( conf (σ, σ ′) (\sigma, \sigma^{'}) in Figure 2( p ,p)).

3.1 Photo-geometric autoencoding

Assumption of Photo-geometric autoencoding: The input image is centrally symmetrical.

Goal: Map I into 4 factors: (d, a, w, l) .

depth map d
albedo image a.
global light direction l.
viewpoint w.

insert image description here

L LThe Λ lighting function is an object generated from the official viewpoint (w=0) based on the depth map, light direction, and albedo. ∏ \prodThe ∏ function simulates the change of viewpoint from canonical to actual, and generates imageI ^ \hat{I}I^. I ≈ I ^ I ≈ \hat{I} II^ Used reconstruction loss.

3.2 Probably symmetric objects

Exploiting symmetry in 3D reconstruction requires identifying symmetrical object points in the image.

This article implements it implicitly:

Assume that depth and albedo are reconstructed in a standard coordinate system, symmetric about a fixed vertical plane.

The advantage of doing this is that it can help the model discover the "canonical view" of an object, which is important for reconstruction.

So how to achieve it?

Flip a and d horizontally: a', d'. If d = d' and a = a' are directly required, it will be difficult to achieve a balance (my understanding is that if a = a', d and a may be d' stay away, this may never reach a best tradeoff)

Therefore, Mr. Wu's ingenious attempt to achieve this goal in an indirect way: Formula 2.

insert image description here

Well, the symmetry constraints are implicitly implemented through the above.

So, what is the calculation of the reconstruction error (mentioned in 3.1)? Please see formula 3
insert image description here

Modeling uncertainty is particularly important for our task.

Because we don't just calculate IIII ^ \hat{I}I^ error, also calculateIIII ^ ′ \hat{I}^{'}I^' error.

The existence of the confidence map allows us to mine which positions of the input image may not be symmetrical.

Taking the human face as an example, the hair is usually asymmetrical, so the confidence map will assign greater reconstruction uncertainty to the location of the hair (because the hair is asymmetrical!).

Note that this is just an explanation. The specific confidence map value is learned by the model itself based on data-distribution.

In summary, the learning goal of this article is Formula 4.
insert image description here

3.3 Image formation model

Mapping the point P in the real world to the pixel p: This is achieved through the mapping of Equation 5. The model assumes a perspective camera with a field of view (FOV). We assume that the nominal distance between the object and the camera is approximately 1 meter. Given that the images are cropped around a particular object, we assume a relatively narrow FOV: around 10 degrees.

The depth map d configures a depth value duv for each pixel (u, v) in canonical view.

viewpoint w represents the Euclidean transformation (R, T). The first three values ​​of w represent the rotation angle, and the last three values ​​represent the translation value.

map(R, T) converts the 3Dpoints of the canonical view to the actual view. Warp the pixels (u, v) of the canonical view to (u', v') of the actual view: the yellow part, Formula 6.

insert image description here

Finally, reprojection function ∏ \prod∏Add the depth maptod 和viewpoint change w w w as input and apply warp to canonical imageJJJ goes up to get the image of the actual viewI ^ \hat{I}I^.

How to construct the normal n of each pixel: tuvu t_{uv}^{u}tuvu为例, t u v v t_{uv}^{v} tuvvThe same principle is easy to obtain. The normal is obtained by the cross product of 2 vectors:
insert image description here

With the normal direction of each pixel and the coefficients of diffuse and specular reflection ks k_sksand kd k_dkd(predicted by the model, tanh, to be between 0 and 1)

Direction of light, use tanh to predict lx l_xlxly l_yly, modeling the light direction as a sphere.

insert image description here

3.4 Perceptual Loss

In the calculation formula of Formula 3 (measuring the reconstruction error), the application of L1 loss will have some problems: L1 loss is very sensitive to small geometric defects, which can easily cause the reconstructed image to be blurry.

Therefore, based on L1 Loss, we added perceptual loss to alleviate this problem.

It has been verified by experiments that the normal relu3_3of VGG16 of Wu children's shoes is sufficient as the feature extraction layer of perceptual loss. Then, combining Equation 3 and Equation 7, the loss function of the entire network is designed as
L + λ p L p L + λp LpL+λ p L p where,λ p = 1 λp = 1p _=1.

4 Experiments

4.1 Setup

Data set: CelebA, 3DFAW, BFM.

Metric: Since projection camera-based 3D reconstructions have inherent blur problems, we need to take this into account in our evaluation.

In Wu's implementation, he will warped depth map d ^ \hat{d}d^ In GT depth mapd ∗ d^*d SIDEis calculated. (Green part). Here only the effective depth value is compared.

insert image description here

In addition, Wu Tongxie also effectively measures the effect of surface reconstruction by comparing the mean and variance of normal (computed from ground truth depth and from the predicted depth, ).

According to Wu, the method in this article is significantly improved compared to baseline 3, and baseline 3 can access GT information. This shows that the unsupervised method in this article can learn a good 3D representation.

insert image description here
It can be seen from the ablation study that albedo flip has the greatest impact (2), followed by using predicted shadow maps instead of calculating based on depth & light direction (4). Then there is
depth flip (3)...
insert image description here

The meaning of closing the confidence map in line 7 of Table 3 means that in the losses in formulas 3 and 7, fixed L1 and L2 losses can be used, and fixed values ​​are used to replace the confidence map predicted by the network. It can be seen that without the confidence map, the accuracy does not drop too much (because the BFM face is highly symmetrical and has no hair). But the variance increases a lot. In order to better understand the role of confidence map, Wu made perturbation on the face to make it asymmetrical.

Implementation details

① The AE network of depth & albedo does not use skip connection because the input and output images are not spatially aligned.

② viewpoint and lighting are returned using a simple encoder network.

③ For depth, albedo, viewpoint and lighting, the last activation layer is tanh; for confidence map, the last activation layer is softplus. Since the photometric and perceptual losses are calculated at different resolutions, these four confidence maps are predicted with the same network at different decoding layers. Moreover, depth needs to be normalized before passing tanh.

④ Adam optimizer, image resolution is 64 x 64. Training is about 50k iteration steps, bs=64. Please see the supplementary material for details.

⑤ I observed that the normalization in all networks in Unsup3D is Group Normalization.

4.2 Results

In order to better evaluate the contribution of confidence map to the effect (the significance of this article for uncertainty modeling).

Wu Tongxie applied asymmetric perturbation to BFM. Color patches of random colors (occupying 20 to 50% of the image) were generated and mixed with the original data with an alpha value of 0.5 to 1. As shown in Figure 3.

Then, the perturbed data is trained using the structure without confidence map. The results are shown in Table 4.

insert image description here

It can be seen that confidence maps can make the model resist this kind of noise and disturbance, while the model without confidence map does not have this ability.
(confidence maps allow the model to reject such noise, while the vanilla model without confidence maps breaks.)

Figure 4 shows the reconstruction effect of human faces, cars, and cat faces (CelebA and 3DFAW, cat faces from [66, 42] and synthetic cars from ShapeNet. )

Even in the case of extreme facial expressions, the reconstructed 3D face contains details of the nose, eyes and mouth.
insert image description here

In order to further verify the effect of the model, Wu applied the model trained on celebA to a series of painting and cartoon drawings to verify the effect.

As shown in Figure 5, although our method has never seen such images during training, it still works very well.

insert image description here

Since our model predicts that the canonical view of the object is symmetrical about the vertical centerline of the image.

We can visualize these planes of symmetry. As shown in Figure 6, we warp the centerline to the actual view.

insert image description here

As can be seen from Figure 6a, our method can accurately discover symmetry lines under asymmetric texture and lighting conditions.

Figure 6b shows us overlaying the predicted confidence map data on the image, confirming that our model is able to assign low confidence values ​​to asymmetric regions.

4.3 Comparison with the state of the art

As shown in Table 1, many reconstruction methods more or less require image annotations, prior 3D models or both.

But without such annotations or prior knowledge, the task of reconstruction becomes very difficult, and there is little prior work that can be directly compared.

Yes 45, 52speaking, it is not possible to directly obtain its code and the trained model for testing for comparison (qualitative and quantitative).

Student Wu intercepted relevant content from the paper of 45, 52and made a qualitative comparison (Figure 7). It can be seen that our results are good ~.

insert image description here

Qualitative comparison. It can be seen that the method in this paper can restore higher quality shapes.

It should be mentioned that 52the input image in is generated by GAN.

4.4 Limitations

Although our method performs well in challenging scenarios such as extreme facial expression, abstract drawing, etc.

But we also observed some failure cases (Figure 8). During the training process, we assumed a simple Lambert shading model. Shadows and specularity were ignored. This makes it difficult to detect extreme lighting conditions or non-lambert shading models. On the surface, the effect is very poor. As shown in Figure 8a.
insert image description here

It can also be seen from 8c that the extreme side face reconstruction effect is poor, possibly due to the weak supervision signal of side images. This can be improved by imposing constraints on accurate reconstructions of frontal poses.

5. Conclusions

This article proposes a method that can construct realistic 3D deformable objects based on unconstrained single-view images of certain types of objects (faces, etc.). This method can obtain high-fidelity monocular 3D reconstruction of individual object instances.

This article is entirely based on reconstruction loss rather than any other supervision signals or prior information.

Through Experiment 3, we demonstrate the importance of symmetry relationships and lighting for an ideal unsupervised reconstruction.

The model in this article is better than the method using 2D keypoint supervision.

For future work, the model currently uses depth maps to represent 3D shapes from a canonical view, which is sufficient for objects such as faces with roughly convex shapes and natural canonical viewpoints.

For complex objects, the model can be extended to use multiple canonical views or different 3D representations, such as mesh or voxel maps.

6. Code

I changed to a differentiable renderer based on the original version, using pytorch3d. You are welcome to try it~

  • Original version:https://github.com/elliottwu/unsup3d
  • My version:https://github.com/tomguluson92/unsup3D_pytorch3d

The effect of my version:
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/g11d111/article/details/106975135