Single Image 3D Reconstruction Algorithm

introduce

Single-image 3D reconstruction algorithms are mainly implemented through deep learning, and are mainly divided into the following three technical routes

  1. Model the target object in the image, directly obtain its three-dimensional shape (template), and then use another model to color and handle lighting.
  2. Directly use the given prior body for posture learning, and at the same time sample to learn color features, all superimposed together to achieve 3D body reconstruction. The main implementation method is to use a steerable renderer (also called a neural renderer, neural rendering) to render and then post-process, generate a projection of a certain 3D shape to a 2D projection, and obtain the segmentation map and key of the 2D projection image. Point coordinates, RGB pixels, etc., and then optimize the segmentation and coloring branches.
  3. Using Neural Radiation Field (NERF) plus voxel rendering to learn the three-dimensional structure, you can directly learn the shape and color of the object. This method is the mainstream in the current academic circle, and the relatively well-formed exploration results include virtual human images and animal images.

Compared with other algorithms, NERF-based algorithms can generally generate virtual images with higher pixel quality, higher resolution, and higher definition. But such algorithms cannot generate images for a specific input.

Specific Algorithm for 3D Reconstruction of Single Image

shape modeling + color rendering

This technical route is to split single-image 3D reconstruction into two subtasks, namely body modeling and color rendering. Each sub-task introduces a sub-model for modeling, and uses its own sota algorithm to achieve shape modeling and color rendering. One difficulty of this technical route is that it is difficult to adjust the parameters of the model, and it is necessary to make a trade-off between shape modeling and color rendering.

Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop (WLOP) algorithm studies how to reconstruct the shape of animals (mainly dogs), and can realize the reconstruction of animal shapes without using the true value of 3d data. The model reconstruction process is implemented based on SMAL's 3D prior, 2D key points and segmentation maps. Specifically, the author uses an encoder for feature learning, and then uses the learned features to fit the shape, posture, and camera parameters. The combination of the three can realize the learning of the overall body shape. During the specific learning, since the given prior body shape does not match the actual shape of the data set, in order to estimate the shape more accurately, the author uses the EM algorithm. The E stage is to estimate the expected shape parameters and freeze the update of other shape parameters; the M stage is to update other shape parameters. Finally, the learning of the whole shape is realized through iterative updating. After completing the study of the shape, another problem to be solved is how to color, the algorithm used here is Texformer . Texformer is for human body modeling. It can learn more finely by using the global information of the input image, and at the same time try to fuse the input image and color information for complete coloring. The model uses SMPL to predict body posture, and uses Vision transformer to learn global information.
Texformer uses a precomputed color map as a query, and each pixel on the map corresponds to a vertex in a three-dimensional space; uses an input image as a value; uses a two-dimensional component segmentation map (2D part segmentation map) as a carrier for mapping images to UV space. The author also uses a blend mask to combine texture flow and RGB colors to generate better color prediction results.
insert image description here
Summarize the advantages and disadvantages of this technical route:
Advantages:

  • Train the posture first and then the color information. After the two stages are separated, the tasks are spread equally to each stage, and the training difficulty is reduced.
  • In theory, each sub-model can learn a better effect in stages, and the overall effect can be guaranteed
    . Disadvantages:
  • Many inputs are required, including input images, masks, pipe fittings, etc.
  • Only for each class, the output body estimation value
  • Multi-stage leads to longer training time and test time

Use Neural Renderer

Before the emergence of neural renderers, the basic way we learned 3D models was to use the prepared 3D truth values. For example, given a toy model and 3D coordinate information, we directly return to the 3D-based parameters to achieve 3D modeling. The emergence of the neural renderer saves this trouble, because there is a way to directly use it to obtain two-dimensional projections. In this way, the three-dimensional model characteristics can be learned by using the two-dimensional truth value. Compared with using the three-dimensional truth value, it is absolutely It is a very valuable and commercialized way . End-to-end learning can be achieved using the neural renderer, and the learning targets are pose, body shape, camera shooting parameters, and color information. Neural Renderer supports derivability by optimizing the pixel sampling process. The common ones are neural-render , soft-render and Dib-R , etc. By using a guideable renderer to construct a 2D->3D rendering result and projecting it back to 2D using a projection, the difference between the generated rendering result and the original image can be calculated, so that key reconstruction parameters can be quickly estimated and learned.

The CMR paper proposed for the first time to solve the 3D reconstruction problem by learning category templates, but the templates need to use the motion inference structure (SFM) to calculate the initialization template, and use masks and key points for weakly supervised learning; while using spherical coordinates The way of conversion is to map the UV sampling results, learn and perform rendering and coloring. Please refer to the figure below for the specific frame diagram. CMR is a very classic paper. The UMR , SMR mentioned later and the u-cmr not mentioned are all further improved based on this. Especially the dyeing solution, basically the following papers are modeled after this solution.

insert image description here
The UMR paper attempts to use part segmentation maps to replace masks and key points to simplify the 3D reconstruction problem. The author believes that an object can be divided into multiple sub-regions, each region is connected to each other, and the color information within and between regions is coherent. Therefore, the mutual conversion between 2D and 3D should be able to maintain this relationship. With this idea, the UMR algorithm does not need to construct category templates, so there is no category limitation. At the same time, UMR further clarifies the object boundary with the help of part-segmentation map, which plays a very important role in learning the color of objects in more detail. The texformer we mentioned before chose part-segmentation map, which is part of the reason.
insert image description here
SMR enables modeling by interpolating key attributes in the 3D reconstruction process. Since the body shape, texture, and key points corresponding to key parts of the body should be as consistent as possible with the original image after the object is reconstructed, the author puts forward two restrictions (c) and (d) to maintain the consistency of the reconstructed object. In addition, by maintaining the two-way projection of 2D->3D->2D to ensure that the 2D input is consistent with the prediction, and use GAN to interpolate the camera shooting angle, texture information, three-dimensional object and other information, generate new data, supplement the training set, and Get better results.
insert image description here
Advantages and disadvantages of this technical route:
Advantages:

  • Direct single-stage learning, the framework is more concise and clear
  • The amount of data required is progressively reduced, and in the best case only masks are used to produce the desired result
    Disadvantages:
  • The data assumes that the training object is a symmetrical object, and the training needs to initialize the template (sphere). For no template, non-rigid body, and asymmetric objects, the learning difficulty is significantly enhanced
  • Since it is self-supervised learning, there is no clear definition of the truth value, it is easy to converge to a suboptimal state, or it cannot converge
  • Limited by the size and complexity of the object. The effect is not good for complex objects, and the details of learning objects are not well grasped.

Using Neural Radiation Fields (NERF)

Neural Radiation Field ( NERF ) is also a recently emerging renderer, which has similar functions to neural renderers, but in comparison, it has its own more unique advantages. The working principle of the neural radiation field is to use 3D space information and 2D attitude information, view-based radiation field and volume density, learn 3D space coordinates and 2D viewing angles, and project them onto RGB color values.

The specific implementation is to use a fixed conditional encoding plus a multi-layer perceptron (MLP) to translate the input into pixel values ​​​​and voxel densities. Posture reconstruction is then performed to directly map the 2D input to 3D. Before the neural radiation field, the method of 3D reconstruction is to use the method based on voxel-grid to represent 3D objects, or the corresponding features of 3D objects;

The former consumes a lot of memory, so it can only be used for low-precision 3D object reconstruction; the latter requires an additional decoder to decode features into RGB pixels, making the multi-dimensional consistency not good enough. After using the neural radiation field, compared with the grid-based method, this method does not discretize the space, does not limit the topological shape, and has a better effect on object learning. Finally, I still want to mention that many NERF implementations are based on GAN. One of the reasons is that GAN has a great supplementary effect on the lack of training data.

Based on the neural radiation field, Graf introduces a generative confrontation network and uses unpose images for training. The purpose is to generate 3D reconstruction results in an unknown field of view. The generator is mainly responsible for sampling based on the two-dimensional coordinates of the image, obtaining a patch (K*K points) each time, and then sampling N points from these points using hierarchical sampling for fine-grained learning. The generator additionally introduces two hidden layer codes, Z_shape and Z_appearance, which can directly learn body posture and appearance features, and at the same time decouple the two features to achieve separate predictions. The discriminator is mainly responsible for comparing the patch obtained by sampling with the patch generated by prediction. During the training process, the patch with a relatively large perceptual domain starts, and then gradually shrinks.

Pi-gan has been improved based on Graf. It uses a sinusoidal representation network based on periodic activation functions to enhance the positional encoding effect in the neural radiation field to generate reconstruction results at wider viewing angles. Compared with graf, siren is used instead of positional encoding, and the style-gan-based mapping network is used to make the morphological and apparent features only depend on the given input. Also use stage training to gradually converge the model.

ShadeGAN considers the influence of illumination on 3D reconstruction on the basis of pi-gan. The purpose is to further solve the problem of poor reconstruction results caused by the interaction of shape and color in 3D reconstruction scenes. The author believes that a good 3D reconstruction model should be rendered under different lighting conditions, and the shape should not be much different. At the same time, the author proposes a method of surface tracking to improve the speed of voxel rendering. The only difference compared to pi-gan is that the author introduces a limitation based on lighting, and at the same time the output no longer directly outputs the color, but outputs the input before mapping. The purpose is to introduce lighting for post-processing. The specific processing method is Lambertian shading.
insert image description here

CIPS-3D is further improved based on pi-gan. The author found that existing methods (such as pi-gan) implicitly control the angle by editing shallow vectors, but cannot achieve reconstruction based on arbitrary rendering angles at high resolution; at the same time, mirror symmetry will appear in the case of incomplete training Suboptimal solution. Therefore, the author proposes to modulate the SIREN module to deal with the influence of the scale of different generated images on the reconstruction. At the same time, the author found that using the direction as input would lead to inconsistent imaging in different dimensions, so the input point was used instead. In addition, the author found that there is a probability of mirror symmetry in the generated results. To deal with this problem, an implicit neural representation network is used to characterize the implicits into corresponding RGB pixels, and an additional discriminator is added to handle the mirror symmetry problem. Experiments have proved that this processing method has a very good effect.
insert image description here
Advantages and disadvantages of this technical route:
Advantages:

  • Use gan to solve the problem of data scarcity, and the sota version only needs a single image input to reproduce from multiple angles. Compared with the above two methods, the overall solution cost is lower.
  • The neural radiation field itself uses the implicit learning method to learn 3D features. Compared with the 3D template-based method, there is no symmetry requirement, the scope of use can be extended to non-rigid body categories, and the generalization ability is stronger.
  • With interpretability, the generated shallow features can be used to visualize the 3D reconstruction template after processing.
    shortcoming:
  • Can only fit single-image reconstructions in a single axis.
  • Unable to reconstruct from the given image.

Guess you like

Origin blog.csdn.net/u012655441/article/details/125520540