High-Fidelity 3D Digital Human Head Creation from RGB-D Selfies

Reference papers:
1: "High-Fidelity 3D Digital Human Head Creation from RGB-D Selfies"
Purpose: Use a single picture to generate a 3D image corresponding to the image.
The process of generating a vivid 3D image mainly requires the generation of 1: 3D head model, 2: normal map, 3: texture map, 4: various expression coefficients, lighting coefficients, pose coefficients, etc. As long as the 3D head model and normal map of a head model are accurate, it will be 90% successful.

3D head model and coefficient generation

What kind of 3D head model can be considered a high-quality head model?

3D head model generation is the first step in 3D image generation. The three-dimensional human head shape composed of tens of thousands of 3D point clouds is shown in Figure 1, and is visualized in Figure 2.
1
Figure 1
figure 2
Figure 2
The quality of a 3D head model generated can only be approximated by looking at the head model directly. If you want to be precise, you need to visualize it on pictures from different angles (the same person from different angles, the shape of the face in the 2D picture and the The facial features may be different, but they must be the same on the 3D head model. If they are different, then the head model will not be generated well).
For example, in Figure 3 (an experiment using images of colleagues, deleted to avoid infringing on other people’s portrait rights), the generated head model of this person is visualized on the front face. The effect looks okay, but it is multiplied by the corresponding side Regarding the pose coefficient of the face, the problem is exposed after moving to side face visualization, indicating that the head model is not generated well.

A good head model in Figure 3
should be as shown in Figure 4. No matter how the pose coefficient is, it can fit the contour well when visualized from different viewing angles.
csdnimg.cn/903bd4b3392e4400a42dc8fada203a18.png#pic_center)
Figure 4

How to generate a high-quality head model?

How is the 3D head model generated?

Insert image description here
s represents the generated new 3D head model, s~ represents the basic head model, S represents the basic shape identifier, which is the principal component calculated based on PCA using N heads, Xshp represents the shape parameters that need to be estimated, a one-dimensional 1*500 array.
a represents the generated new color map, a~ represents the basic color map, A represents the basic color identification, which is the principal component calculated based on PCA using N heads, Xalb represents the color parameters that need to be estimated, a one-dimensional array of 1*199 .

How to obtain the basic head mold

Use a high-resolution 3D scanner such as Artec Eva to scan the human head to obtain the corresponding 3D head model. Generally, N heads are scanned to obtain N head models (Tencent's N=200), and then the N head model points are averaged. Get a basic head mold.

How to obtain the basic shape identifier

Insert image description here
Algorithm explanation: 1: Use 200 head molds as the original head mold set S
2: Enter the outer loop
2.1) Randomly extract m = 1000 head molds from a set of 10,000 head molds without replacement, recorded as D
2.2) Number of initial cycles k, and the headform error mean §
3: Enter the inner loop
3.1) Calculate the PCA of S, take the first k headform sets that can represent 99.9% of the characteristics of the S set, and record them as Sk
3.2) Use Sk to express D one by one
3.3) Head models with large errors (exceeding the mean head model error §) are moved to the M set, for example, there are m
3.4) Mirror all the head models in the M set, and get 2m
3.5) Find the mean of all errors in the M set, update §
3.6 ) Merge M sets to obtain the initial S set
3.7) Cumulative number of iterations
If § is less than the error threshold thresh, end the inner loop, enter the outer loop, and then randomly select 1000 head models from 9000 head models and enter 3. Finally, more than 10000 are
obtained PCA of the head model as a basic shape identifier.

The shape parameter Xshp that needs to be estimated is obtained

Insert image description here
Insert image description here
By calculating these five losses, P is obtained through iterative training, which contains Xshp with 500 parameters, Xalb with 199 parameters, 27 lighting coefficients and 6 pose coefficients.
The six pose coefficients include: [y-axis translation displacement, x-axis translation displacement, z-axis translation displacement, kx left and right deflection amplitude, ky up and down deflection amplitude, th scaling factor factor] By changing the pose coefficient, the generated 3D head can be
changed The pose of the mold is as shown in Figure 4 above.
1:
Insert image description here

Find the L1 normal form 2 of the original image and the rendered image
:
Insert image description here

Depth image loss, we do not have depth image data, this loss is not done.
3:
Find the depth features of the original image and the rendered image respectively (represented by the fc7 layer feature of the VGGFace model), and then perform L2 normal form on all depth features.
4:
Insert image description here
Fitting contour points
. Tencent is fitting 86 contour points here. We are fitting 206+86=292 contour points, and dynamically change the weight of the fitting points, that is, accurate points have a high weight, and inaccurate points have a high weight. point, low weight. Therefore, the shape details of the head model we generate will be more accurate. The picture below is our fitting point.
Insert image description here
By adjusting the position of the points, the appearance and size of the facial features can be adjusted.
Insert image description here

5:
Insert image description here
Regularize shape and texture parameters.

texture generation

First explain a few maps:
Albedo maps: mainly reflects the texture and color of the model, called color map
normal maps: normal map stores the normal direction of the surface,
UV map: "UV" here refers to u, v texture map coordinates The abbreviation of (it is similar to the X, Y, Z axes of the space model). It defines information about the location of each point on the image. These points are related to the 3D model to determine the location of the surface texture map. UV accurately maps each point on the image to the surface of the model object. The software performs image smooth interpolation processing on the gaps between points. This is called a UV map.
There are three main parts to texture generation:
(1) Generating Unwrap
(2) Area fitting
(3) Using pix2pix to refine the texture
. These three parts will be explained below.

How to get Unwrap? That is, how to expand a 3D model into a 2D texture?

This paper does not go into details. You need to find information online and then read it in conjunction with the official source code.
Theory: Use the texture coordinate UV as the screen position of the vertex (remap the texture coordinate in the [0,1] range to the normalized position coordinate in the [-1,1] range). It should be noted that the UV mapping of the model must be good, that is, each point on the texture must be mapped to a unique point on the model and cannot overlap. Then color the expanded mesh with the color of each pixel of the 3D model, and you will get the expanded 2D map.
In summary, it is roughly splitting the mesh, creating plane mapping, unfolding the UV mesh, mesh coloring, etc.
The main code above: the
case of a frontal face:
Insert image description here
Insert image description here
implementation steps:
1: Use the (x, y) axis fixed point of the head model geometry, the vt map coordinate point, and the v normal fixed point to expand the head model obtained in the previous step to obtain uv The position of the coordinate point, the size of the expanded map is the size of the uv.
2: Resize the original image to the same size as uv_map, and then color each uv coordinate point pixel through the position index.
3: Use the basic uv_map to perform image Laplacian pyramid fusion with the uv_map obtained in the previous step to obtain the final Unwrap.
In the case of multiple pictures (eg: one front face, plus two left and right side faces):
Implementation steps:
1: Calculate steps 1 and 2 above for three pictures respectively, and obtain uv_maps from three perspectives.
2: Fusion the uv_map obtained in the previous step with the corresponding mask below, and then superimpose them.
3: Finally, the basic uv_map is superimposed with the superimposed uv_map and the following three masks and then normalized to [0, 1] to perform image Laplacian pyramid fusion to obtain the final Unwrap.
Insert image description here

Region fitting details

In the process of regional fitting, the regional pyramid results will be used, so before talking about the details of regional fitting, we first introduce the regional pyramid. The following figure shows the entire process of regional pyramid.

Use a pyramid-based parametric representation to synthesize high-resolution color maps and normal maps

The process of constructing the pyramid algorithm:
1): Resize 200 color maps to ( two resolutions : 512 512 and 2048
2048), 2): Divide the face area into 8 areas, and use different colors in the following UV maps Representation
Insert image description here
3): In this way, each sample forms a 3-tuple, and k represents a region of 8 regional species.
Insert image description here

4): Connect the 3-tuple vectorization together into a one-dimensional vector, and calculate PCA to obtain its principal components.
5): Spread the principal components obtained in 4) into new vectors according to indices to obtain the pyramid features of the k area.
Insert image description here

Region fitting details

Area fitting has 2 parts:
1: Parameter fitting
2: High-resolution map synthesis
Let’s talk about parameter fitting first. The following formula is parameter fitting:
Insert image description here
First explain the meaning of several variables in the above formula:

: Generated in the previous step unwrap.
Insert image description here
It means that the rough color map parameters generated during the head model generation process are multiplied and fitted area by area with the A512 (512*512 resolution color map base) generated by the area pyramid and then superimposed into a total color map.
Implementation steps:
1: Calculate the l2 paradigm Insert image description here
between
and , recorded as loss1;
2: Insert image description here
Mainly eliminate artifacts in the boundaries of the 8 regions. The method is: use uv_mask to divide the entire color map into boundary masks and non-border masks, then calculate the boundary distance and non-border distance in the color map respectively, and add these two distances. Record it
as loss2 3: Apply L2 regularization to the color
map
parameter Texture parameter Xalb until the trained Xalb is obtained.
Steps 1-5 here are the parameter fitting part in the figure above, which is to obtain a better Xalb by minimizing the loss and iteratively training the roughened Xalb generated when the head model is generated.
Next, we will introduce high-resolution texture synthesis.
Insert image description here
6: Multiply and fit the better Xalb obtained in the previous part with the A2048 generated by the regional pyramid (2048*2048 resolution color map base) region by region and then superimpose it into a total color map. This is the regional map. The final color map obtained by combining the parts.
7: Multiply and fit the better Xalb obtained in the previous part with the G2048 generated by the regional pyramid (2048*2048 resolution normal map base) region by region and then superimpose it into a total normal map. This is The final normal map obtained by the area fitting part.

Color map and normal map composition details

Use two GAN-based networks to synthesize details
3.1) Obtain the initial color map from the previous step, called unwrap uv, and mask uv
| Insert image description here
| Insert image description here
|
|–|–|
| | |
3.2) Fit_unwrap map region by region through pyramid features ,The left is the detailed color map, and the right is the ,generated normal map.
| Insert image description here
| Insert image description here
|
|–|–|
| | |
3.3) Use fit_unwrap map as input, and obtain more refined color maps and normal maps with the same resolution through GAN (pix2pix network). (pix2pix is ​​a classic paper published in CVPR in 2017, applying GAN to supervised image-to-image translation) This blogger explained the pix2pix algorithm in detail: https://blog.csdn.net/u014380165/article/details /98453672
3.3.1) For color map generation:
Input: 2048* 2048 rough color map obtained in the previous step
Output: Refined 2048* 2048 color map
3.3.2) For normal map generation:
Input: The refined color map obtained in this step and the coarser normal map obtained in the previous step are spliced ​​along the channel dimension as input. Output
: refined 2048*2048 normal map
| Insert image description here
| Insert image description here
|
|–|–|
| | |
Total The process is:

Guess you like

Origin blog.csdn.net/jiafeier_555/article/details/125428388