Read SMR 3D reconstruction

Article title: Self-Supervised 3D Mesh Reconstruction from Single Images
Link: https://openaccess.thecvf.com/content/CVPR2021/papers/Hu_Self-Supervised_3D_Mesh_Reconstruction_From_Single_Images_CVPR_2021_paper.pdf
Authors: Tao Hu, Liwei Wang, Xiaogang Xu, S Hu Liu, Jiaya Jia
code:

What:

  1. The 2D image after inputting the mask generates a 3D mesh (similar to meshRCNN), which does not require gt's 3D supervision (like pixel2mesh);
  2. Similar to pixel2mesh, the final result is obtained by deforming an ellipse mesh;
  3. The author raised a question: if it is possible to achieve 3D attribute-level reconstruction only with 2D annotation. The main hope is to use less supervision to achieve similar results.
  4. Category-specific 3D mesh So the category of birds is used in training? Or is it that birds can only be birds? (Answer: Birds can only be birds)
  5. Run ShapeNet, BFM (this can do 2D supervised and unsupervised reconstruction) and CUB.
  6. No need for template, camera parameter prediction and part parsing

Questions before reading:

Is keypoints similar to the unsupervised method of vgg a long time ago, similar to the method in first order? How did it happen?

  • Answer: Directly use the ellipse to initialize the vertices of the mesh as the keypoint.

How:

structure:

1.3D attributes include: camera parameters, shape, texture and lighting. These 4 are the ones that need to be predicted by the network later.
2. At the 2D supervision level, the reconstructed model needs to be able to project back to a 2D image.
3. At the level of 3D supervision, as shown in Figure 1 of the paper, two consistency losses are proposed, namely Interpolated Consistency and Landmark Consistency.
4. For Interpolated Consistency, the 3D parameters after the difference can also be rendered for supervision. Some self-supervised loss may be required in the interpolated mesh (maybe it is the loss of GAN? Answer: GAN is used in the code, but it is not written in the article).
5. For Landmark Consistency, a landmark classification will be made. Looking at the picture, an additional UNet is introduced to distinguish the order of key points and predict which key point it is. (Maybe a semantic segmentation network is used? The index of the keypoint is used directly).
6. The Encoder contains 4 self-networks, which are Camera pos prediction, light prediction, shape prediction and texture prediction. Among them, only Texture looks like UNet, and the others are normal downsampling networks that predict several parameters.
7. There are V vertices, and each vertex has a 3Channel deter to form S. Is this S the xyz coordinate?
8. UV map is used for texture features. Note that this UV map is still H W 3. In fact, the position of each component has been moved.
9. Camera parameters include distance, azimuth (horizontal declination), Elevation (vertical declination).
insert image description here

  1. For lighting, Spherical Harmonics is used, which needs to be confirmed against the render. At present, only know that it is a vector.
  2. So when we know C, L, S, T, we can actually render a 3D model. (Reader's note: The normal vector of vn is still not discussed here?) And this 3D model can be projected back to 2D to get the image Ir and mask Mr.
  3. If we rethink, camera information still needs background + foreground, and lighting also needs background information; while Shape and Texture only need foreground.
  4. Here, when predicting azimuth, two values ​​are returned, which are used for atan, in order to facilitate model regression. Compared with e, it is estimated that 0,1 will be used directly.
  5. The shape encoder is used to predict the displacement relative to the spherical mesh, rather than directly regressing the shape.
  6. The texture encoder does not directly predict the UV map, but predicts the 2D flow map, and then uses the spatial transformation to move over.

2D reconstruction Loss (Section 3.3.1)

  1. The most basic if there are rendering parameters, you can directly supervise A and A_{gt}, such as Eq3;
  2. If not, we can project back to 2D such as Eq4 to see the effect of reconstruction;
  3. Specifically, this 2D reconstruction loss has two parts, one is the foreground L1 loss (Eq5), and the other is the loss for the mask, using IoU loss (Eq6).

Difference continuity Loss (Section3.3.2)

  1. Eq 8 gives me the feeling that it is similar to cyclegan and munit, Render is Decoder, but there is a difference, this Render has no parameters. E(R(E(X)) is required to be the same as E(x).
  2. Eq8 is only the most basic, in fact, it is the version of Eq11 with no difference.
  3. This paragraph discusses how each parameter differs. Fig4 shows the effect better.
  4. The value of alpha is randomly sampled. According to the previous experience of mixup, you can try to use beta distribution during training.

Coordinate continuity Loss (Section3.3.3)

  1. Here is a classification network to predict which vertex each pixel is, such as formula 12
  2. There is a vk in it to represent whether the vertex can be seen. I am thinking whether to use the mask directly or judge according to the depth (z coordinate). It feels more reasonable to use depth
  3. It should be noted that the input of the network is an image, not a mesh.

PCK indicator

  1. There are different metrics, and the common one is that if the distance between the key point and the detected point is less than 150mm, the answer is correct.
  2. But I have a concern, if there are more points, the PCK index will naturally be higher.
    This article will predict 642 key points. The original cub dataset should not have so many labels.

parameter assignment

The weight given by 2D reconstruction is 10, the weight of IC is 1, and the weight of LC is 0.1

inconsistent with the code. In the code,
the weight of 2D reconstruction is 1, the weight of IC is 0.1, and the weight of LC is 0.001.
Relatively speaking, LC is smaller.

possible defects

Compared to CMR it feels like the bird's feet are hard to come by. Everything else is OK. Maybe it's a CUB problem? Looking at Fig7, the horses and cows are OK.

Finally, everyone is welcome to check some of my other articles, thank you~ for your kindness.

Zheng Zhedong: [New UAV Dataset] From pedestrian re-identification to UAV target positioning

Zheng Zhedong: TOMM | Using CNN to classify 100,000 categories of images

Zheng Zhedong: IJCV | Use Uncertainty to correct pseudo-labels in Domain Adaptation

Does Pytorch have any tricks to save video memory?

Guess you like

Origin blog.csdn.net/Layumi1993/article/details/120593356