Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition study notes

Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition

NeRF (Neural Radiance Field) is neural radiation field

The main problem to be solved: getting a 3D model from an image

What is a 3D model?

Abstractly speaking, 3D models are mainly divided into shape and appearance . Appearance can be roughly divided into material and lighting . After we get these basic information, we only need to go through certain rendering algorithm, you can get the image. This forward process from the model to the image is called rendering , and the reverse process of getting the model from the image is called inverse rendering.

Shape representation:

There are four mainstream representation methods, namely Mesh , Point Cloud , Occupancy field , and Signed distance field .

 

Of course, there are also some representations, such as voxel , multi-view, etc., but they are listed here. If you are interested, you can check the relevant information by yourself.

The shape representation method used by NeRF is called soft shape , which is to gradually create three-dimensional objects based on the needs of images from a three-dimensional space with nothing. NeRF divides many points on a Ray, as shown in the figure below:

 

This is also one of the keys to NeRF's ability to stand out from the crowd of neural network methods - in the past, neural networks often used hard geometry rather than soft geometry to reconstruct 3D models.

There are two major advantages to using the soft method :

①No need to use object segmentation masks (object segmentation masks)

② There is no boundary discontinuity problem, which means it is easier to do differentiable rendering

Of course, there are also some disadvantages, such as high rendering costs, difficulty in editing, etc.

Friends who are interested in this method can read the following papers by themselves:

Yu A , Fridovich-Keil S , Tancik M , et al. Plenoxels: Radiance Fields without Neural Networks[J]. 2021.

Sun C , Sun M , Chen H T . Direct Voxel Grid Optimization: Super-fast Convergence for Radiance Fields Reconstruction[C]// 2021.

 

Next, we will introduce the part about neural networks in NeRF :

We can use neural networks to fit various mapping relationships of each picture into a function, such as mapping to color, mapping to occupy, mapping to light density (density), etc.

But the problem is that the representation of the neural network is biased, and high-quality results may not be obtained through the neural network. As shown in the figure below, it is more inclined to obtain smooth results, but what we expect is to get sharp (sharp) results. Therefore, NeRF proposes a function that can map the image into a Fourier feature (Fourier features), which is represented by cos and sin. As can be seen in the figure below, the results have been significantly improved compared to before.

We call this way of representing signals by neural networks Neural Fields. This representation has three major benefits :

①Each 3D model only requires about 10MB to represent, which is convenient for transmission

②The scene does not need to be discretized (discretized)

③More flexible and easier to optimize: for example, it is more convenient in regularization and has stronger generalization ability


Next, we introduce the shooting scenarios of five types of NeRF , as shown in the figure below:

①The object is in the center and the camera shoots around, often used to reconstruct the subject

②The camera direction is fixed and moves in a small range

③Similar to panorama shooting, the camera is in the middle and shoots in all directions, usually used to reconstruct the background

④ Shooting in random directions and distribution in a fixed space

⑤ is a combination of ① and ③, which not only reconstructs the main body of the object, but also reconstructs the background.

The main problem that NeRF needs to solve is that the distant view will be blurred when reconstructing the close view in space. On the contrary, the close view will be blurred when the distant view is reconstructed. We expect to get clear results in both distant and near reconstructions.

In order to solve this problem, the NeRF++ method proposes a solution, which is to decompose and combine NeRF, use one NeRF to represent the foreground and the background , and finally combine them together . ,As shown below:

The fifth shooting scene mentioned above is used here, and the final effect is also very good, as shown in the following figure:

Friends who are interested in the specific details can read the following papers:

Zhang K , Riegler G , Snavely N , et al. NeRF++: Analyzing and Improving Neural Radiance Fields[J]. 2020.


Next, we will introduce the anti-aliasing issue of NeRF :

The anti-aliasing problem is a classic problem involved in graphics, that is, the problem of image aliasing caused by sampling rate issues, so NeRF also needs anti-aliasing processing.

However, we do not want to anti-alias by downsampling the results at low resolutions like traditional NeRF, which will affect the image quality at high resolutions and cause PSNR to decrease. Therefore, MipNeRF uses Positional Encoding to map images into Fourier features represented by sin and cos. As shown below:

At the same time, because the spectrum at low resolution is very high, but the sampling frequency is not high enough to satisfy the Nyquist sampling theorem, a lowpass filter (lowpass-filter) is introduced in MipNeRF, that is, for the Fourier function Perform low-pass filtering. The size of this filter depends on the pixel size.

Those who want to learn the details can check out the following papers:

Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021.

Next, we will introduce the editable issues of NeRF :

Real objects can be digitized using NeRF technology, but in the digital world, such as in the field of VR/AR, we often need to further edit three-dimensional objects, such as editing lighting, editing materials, and even stylization or art. Creation, this problem has always been something NeRF needs to solve. However, as mentioned earlier, the three-dimensional shape representation method used by NeRF is not suitable for editing, so we expect to convert it into a Mesh that is easy to edit and modify for representation. Currently, a neural inverse rendering pipeline called IRON is proposed, which can run on photometric images and output high-quality 3D content in the format of triangle meshes and material textures, which can be deployed to existing graphics pipelines. middle. As shown below:

During the RION optimization process, neural representations are used to represent geometric shapes and materials, making three-dimensional objects more flexible and compact. At the same time, IRON optimizes the SDF (directed distance field): first, the volume radiation field method is used for optimization to restore the correct The topology is then further optimized using physically based edge-aware surface rendering to improve geometry refinement and separation of materials and lighting.

Specific details can be found in the following papers on arxiv:

IRON: Inverse Rendering by Optimizing Neural SDFs and Materials from Photometric Images.

Appearance representation:

As shown in the figure below, the most common representation is definitely the material texture map + ambient lighting. This method is easy to edit and modify, and is widely used in rendering work. But the disadvantage is that the representation method is very cumbersome, not only mapping the texture, but also intersecting different lights on different surfaces.

Then there is the Radiance Field or Surface light field. This method gives the color (r, g) of different observation angles (θ, φ) at each surface point (x, y, z). , b). In this way, you can easily describe the color, reflection, and shadow effects of the object surface. But the disadvantage is that it is not easy to modify and edit.

Interested friends can refer to the following papers:

Wood D N ,Azuma D I ,Aldinger K , et al. Surface light fields for 3D photography[C]// SIGGRAPH conference. 2000.

Abstract

While dynamic Neural Radiance Fields (NeRF) have shown success in high-fidelity 3D modeling of talking portraits, the slow training and inference speed severely obstruct their potential usage. In this paper, we propose an efficient NeRF-based framework that enables real-time synthesizing of talking portraits and faster convergence by leveraging the recent success of grid-based NeRF. Our key insight is to decompose the inherently high-dimensional talking portrait representation into three low-dimensional feature grids. Specifically, a Decomposed Audio-spatial Encoding Module models the dynamic head with a 3D spatial grid and a 2D audio grid. The torso is handled with another 2D grid in a lightweight Pseudo-3D Deformable Module. Both modules focus on efficiency under the premise of good rendering quality. Extensive experiments demonstrate that our method can generate realistic and audio-lips synchronized talking portrait videos, while also being highly efficient compared to previous methods.

(While dynamic Neural Radiation Fields (NeRF) have been successful in high-fidelity 3D modeling of talking portraits, their potential use is severely hampered by slow training and inference speeds. In this paper, we propose a NeRF-based An efficient framework that can leverage the recent success of mesh-based NeRF to synthesize talking portraits in real time and accelerate convergence. Our key insight is to decompose the inherently high-dimensional talking portrait representation into three low-dimensional feature grids. Specifically, the decomposed audio spatial encoding module models the dynamic head using a 3D spatial mesh and a 2D audio mesh. The torso is handled by another 2D mesh in a lightweight pseudo-3D deformable module. Two modules Both focus on improving efficiency with good rendering quality. Extensive experiments show that our method can generate realistic and audio-synchronized talking portrait videos, while also being very efficient compared to previous methods. See :)

Summary

challenge:

  1. How to effectively represent spatial and audio information with grid-based NeRF remains unresolved. Typically, audio is encoded as a 64-dimensional vector and fed into an MLP with 3D spatial coordinates. However, the additional dimensions involved in audio in a grid-based setting for linear interpolation will result in exponential computational complexity growth.
  2. Effective modeling of the less complex but equally important parts of the torso is not trivial for realistic portraits. Previous practices either involve another complete 3D radiation field [22] or learn an entangled 3D deformation field [32], which is excessive and expensive.

Work:

We propose a decomposed Audio-spatial coding module that decomposes the audio and spatial representation into two grids. While we maintain static three-dimensional spatial coordinates, audio dynamically encodes low-dimensional "coordinates". Furthermore, instead of querying the audio and coordinates in a high-dimensional feature space grid, we show that they can be divided into two independent low-dimensional feature grids, which further reduces the cost of interpolation. This decomposed audio-spatial encoding enables an efficient dynamic headshot modeling.

As for the torso part, we study its motion patterns in pursuit of lower computational cost. Given the observed changes in topology involving less torso motion, we propose a lightweight pseudo-3D deformation module model of the torso with a 2D feature mesh. Combining these two modules with further portrait-specific NeRF acceleration design, our approach can achieve real-time inference speed using modern GPUs.

Summary of contributions:

  • · We propose a decomposed audio spatial encoding module to efficiently model the inherently high-dimensional audio-driven facial dynamics with two low-dimensional feature grids.
  • ·We propose a lightweight pseudo-3D deformable module to further improve the efficiency of synthesizing natural torso motion synchronized with head motion.
  • · Our framework can run 500 times faster than previous works, render with better quality, and also supports a variety of explicit controls for talking portraits, such as head poses, blinks, and background images.

As shown in Figure 1. Network architecture. The header is modeled with the Audio-spatial decomposition encoding module. The input audio signal is first processed by the Audio Feature Extractor (AFE) [22], and then compressed into a low-dimensional spatial-dependent audio coordinate xa \mathbf{x}_axa​. Separate the spatial coordinates x \ mathbf{x}x and audio coordinates xa \mathbf{x}_axa​. The spatial properties f ff and g gg audio properties are fused in a delay to produce head color c cc and density σ \sigmaσ volume rendering. Pseudo - 3D deformable torso modeling module. We only sample a torso per pixel coordinate xt x_txt​, and learn a torso dynamics model whose deformation field depends on the head pose p \mathcal{p}p. Another mesh encoder E tosor 2 E_{\mathrm{tosor}}^{2}Etosor2​ torso feature learning FT, which is fed to the torso color ct and ααt delays.

Loss function:
Color : L color = ∑ C ∈ I ∣ ∣ C − C gt ∣ ∣ 2 2 Pixel Transparency : L entropy = − ∑ α ∈ I ( α log ⁡ α + ( 1 − α ) log ⁡ ( 1 − α ) ) Facial Region : L dynamic = ∑ xa ∈ I ˉ face ∣ xa ∣ Fine-tuning of the Lips : L fine-tune = ∑ C ∈ P ∣ ∣ C − C gl ∣ ∣ 2 2 + λ LPIPS ( P , P gt ) \ . text{Color}:\mathcal{L}_{\text{color}}=\sum_{\mathbf{C}\in\mathcal{I}}||\mathbf{C}-\mathbf{C}_{ \text{gt}}||_2^2\\ \text{Pixel Transparency}:\mathcal{L}_{\text{entropy}}=-\sum_{\alpha\in\mathcal{I}}(\ alpha\log\alpha+(1-\alpha)\log(1-\alpha))\\ \text{Facial Region}:\mathcal{L}_{\mathrm{dynamic}}=\sum_{\mathbf{x }_a\in\bar{\mathcal{I}}_{\mathrm{face}}}|\mathbf{x}_a|\\ \text{Fine-tuning of the Lips}:\mathcal{L}_{ \text{fine-tune}}=\sum_{\mathbf{C}\in\mathcal{P}}||\mathbf{C}-\mathbf{C}_{\mathbb{gl}}||_2^ 2+\lambda\mathbf{LPIPS}(\mathcal{P},\mathcal{P}_{\mathbb{gt}})Color:Lcolor​=C∈I∑​∣∣C−Cgt​∣∣22​Pixel Transparency:Lentropy​=−α∈I∑​(αlogα+(1−α)log(1−α))Facial Region:Ldynamic​=xa​∈Iˉface​∑​∣xa​∣Fine-tuning of the Lips:Lfine-tune​=C∈P∑​∣∣C−Cgl​∣∣22​+λLPIPS(P,Pgt​)

We train our network using the MSE loss on the color C of each pixel;

The entropy regularization term is used to drive the pixel transparency to 0 or 1, where α \alphaα is the transparency of each pixel in the image I \mathcal{I}I;

The audio condition should only affect the facial area. In order to stabilize dynamic modeling, we also propose an L 1 L1L1 regularization term on the audio coordinates, which encourages the audio coordinates xa \mathbf{x}_axa​ to be in the non-face area "I face \mathcal{I}_{\mathrm {face}}Iface​” is close to 0, which helps avoid unintentional jitter outside of facial areas (such as hair and ears);

High-quality lips are crucial to making your composite portrait look natural. Experiments have found that the complex structural information of lips is difficult to learn only through pixel-by-pixel MSE loss. Therefore, we propose to use patch-type structural loss to fine-tune the lip area. The article samples the image patch P \mathcal{P}P where the lips are located based on facial landmarks. We can then apply a combination of LPIPS loss and MSE loss balanced by λ \lambdaλ to fine-tune the lip region;

Related Work

Audio-driven Talking Portrait Synthesis:

  • Audio-driven speaking portrait synthesis aims to reproduce a specific person given arbitrary input speech audio. Various methods have been proposed to achieve realistic and well-synchronized talking portrait videos.
  • Methods [6, 7] define a set of phoneme-mouth correspondence rules and use suturing-based techniques to modify the mouth shape. Deep learning implements image-based methods by synthesizing images corresponding to audio inputs. One limitation of these methods is that they can only generate images at a fixed resolution and cannot control the head pose.
  • Another research direction is model-based methods, where structural representations such as facial landmarks and 3D deformable facial models are used to assist speaking portrait synthesis. However, estimates of these intermediate representations may introduce additional errors.

Recently, some works [19, 22, 32, 46] utilize NeRF [36] to synthesize talking portraits. NeRF-based methods can achieve photorealistic rendering at any resolution with less training data, but current work on audio-driven speaking portrait synthesis is still hampered by slow training and inference speeds.

Dynamic Modeling

  • Since vanilla NeRF is only capable of modeling static scenes, many different methods have been proposed to model dynamic scenes.
  • Deformation-based methods aim to map all observations back to canonical space by learning the deformation field along the radiation field. Modulation-based methods modulate NeRF directly on latent codes, which can represent time or audio. These methods are more suitable for complex dynamics modeling involving topological changes, and more suitable for face dynamics modeling.

Efficiency

  • To speed up rendering, recent works propose reducing the size of the MLP or removing it entirely, and storing 3D scene features in an explicit 3D feature grid structure. For example, DVGO [49] directly uses dense feature grids for acceleration. Instant-NGP [37] employs multi-resolution hash tables to control model size. TensorRF [10] decomposes dense 3D feature grids into compact low-rank tensor components. However, these grid-based NeRFs are only suitable for static scenes.
  • Works [Fast dynamic radiance fields with time-aware neural voxels, Neural deformable voxel
    grid for fast optimization of dynamic view synthesis, Devrf: Fast deformable
    voxel radiance fields for dynamic scenes, Fourier plenoctrees for dynamic radiance field ren- dering
    in real-time. ] apply these acceleration techniques to dynamic NeRF, but are based on deformations or only support time-dependent dynamics, which is not suitable for audio-driven speaking portrait synthesis. In contrast, our approach is designed for speech-portrait synthesis in an audio-driven setting.

Method

Preliminaries

Dynamic NeRF

In terms of new view synthesis of dynamic scenes, additional conditions are required (i.e., the current time t is required). Previous approaches typically perform dynamic scene modeling via two methods:

  1. Deformation-based methods learn the deformation Δ x \Delta\mathbf{x}Δx at each position and time step: G : x , t → Δ x , \mathcal{G}:\mathbf{x},t\rightarrow\ Delta\mathbf{x},G:x,t→Δx, which is then added to the original position x \mathbf{x}x.
  2. Modulation-based methods directly adjust the plenoptic function in time: F : x , d , t → σ , c . \mathcal{F}:\mathbf{x},\mathbf{d},t\rightarrow\sigma,\ mathbf{c}.F:x,d,t→σ,c..

Since deformation-based methods are not good at topological changes (e.g., opening and closing of the mouth), we choose a modulation-based strategy to model the head due to the inherent continuity of the deformation field [39], and choose deformation - based strategies to model parts of the torso with simpler motion patterns .

Training data are typically 3-5 minutes of scene-specific videos recorded by static cameras with synchronized audio tracks. There are three main preprocessing steps for each image frame: (1) semantic parsing of head, neck, torso and background parts; (2) extraction of 2D facial landmarks, including eyes and lips; (3) facial tracking to estimate head Partial posture parameters. . For audio processing, automatic speech recognition (ASR) models are applied to extract audio features from audio tracks. Based on head pose and audio conditions, NeRF can be used to learn to synthesize head parts. Since the torso part is not in the same coordinate system as the head part, it needs to be modeled separately.

Grid-based NeRF

Recent grid-based NeRF uses a 3D feature grid encoder E spatial 3 ⁣ : f = E spatial 3 ( x ) E_{\mathrm{spatial}}^{3}\colon\mathbf{f}=E_{\ mathrm{spatial}}^{3}(\mathbf{x})Espatial3​:f=Espatial3​(x), where x ∈ R 3 \mathbf{x} \in R^3x∈R3 is the spatial coordinate, and f ff is the spatial feature of encoding. This feature grid encoder uses cheaper linear interpolation to query spatial features, thereby significantly improving the efficiency of training and inference. This enables real-time rendering speeds of static 3D scenes. We take this inspiration and extend it to encode the high-dimensional audio spatial information required for dynamic speaking portrait synthesis.

Decomposed Audio-spatial Encoding Module

Audio-spatial Decomposition

Previous implicit NeRF methods usually encode audio signals into high-dimensional audio features and concatenate them with spatial features. However, integrating high-dimensional features with grid-based NeRF is not straightforward because the complexity of linear interpolation grows exponentially with the input dimension. If we directly use high-dimensional concatenation of audio spatial features in a trellis encoder, it quickly becomes computationally unaffordable. Therefore, we propose two designs to model audio spatial information in disaster-mitigating dimensions.

First, we compress the high-dimensional audio feature a into the low-dimensional audio coordinate xa ∈ RD \mathbf{x}_a ∈ R^Dxa​∈RD, where the dimension D ∈ [1, 2, 3] D \in [1, 2, 3]D∈[1, 2, 3] is small. This is achieved in a spatially dependent manner via MLP: xa = MLP ( a , f ) \mathbf{x}_a = MLP (a, f) xa​=MLP (a, f). Here we concatenate the spatial features f ff so that the audio coordinates depend explicitly on the spatial position. This operation frees audio features from implicitly learning spatial information, which enables more compact audio coordinates. Audio coordinates are inspired by the deformable slice surface type of environment coordinates in HyperNeRF, but integrated with a feature grid encoder for high efficiency.

Second, instead use g = E 3 + D ( x , xa ) with higher dimensions g = E^{3+D} (\mathbf{x}, \mathbf{x}_a) g=E3 A synthetic audio spatial trellis encoder for +D(x,xa​), which we decompose into two trellis encoders with lower dimensions to encode audio and spatial coordinates respectively: f = E spatial 3 ( x ) , g = E audio D ( xa ) \mathbf{f}=E_{\mathrm{spatial}}^{3}(\mathbf{x}), \mathbf{g}=E_{\mathrm{audio}}^{ D}({\mathbf{x}}_{a})f=Espatial3​(x), g=EaudioD​(xa​). This further reduces the interpolation cost from 2 3 + D 2^{3+D}23+D to 2 3 + 2 D (D ≥ 1) 2^3 + 2^D (D ≥ 1) 23+2D (D ≥ 1). The spatial features f ff and audio features g gg can be concatenated after performing interpolation.

Explicit Eye Control

Eye movement is also a key factor in natural speaking portrait synthesis. However, since there is no strong correlation between blinks and audio signals, previous methods often ignore the control of the eyes, which leads to artifacts like too fast or half-blinks. We provide a way to control blinking explicitly. As shown in Figure 2, we calculate the percentage of the eye area in the entire image based on 2D facial landmarks and use this ratio (usually ranging from 0% to 0.5%) as the 1D eye feature e. We tune a NeRF network on this eye feature and show that this simple modification is sufficient for the model to learn eye dynamics with a plain RGB loss. While testing, we can easily adjust the eye percentage to control eye blinking.

Insert image description here

Figure 2. Example of landmark information. Based on the predicted 2D facial landmarks, we extract three features to assist training: facial region I face \mathcal{I}_{\mathrm{face}}Iface​ for dynamic regularization, eye ratio e for eye control ee and lip patch P \mathcal{P}P for lip fine-tuning.

Overall Head Representation

Connect spatial features f ff, audio features g gg, and eye features e ee along the latent appearance embedding i ii , using a small MLP to produce density and color: c , σ = MLP
( f , g , e , i ) \mathbf{ c},\sigma=\mathbf{M}\mathbf{LP}(\mathbf{f},\mathbf{g},e,\mathbf{i})c,σ=MLP(f,g,e,i )

Pseudo-3D Deformable Module

Compared to the head, the torso part is almost static , containing only slight movements without topological changes. Previous methods either use another fully dynamic NeRF to model the torso [22] or learn the entangled deformation field together with the head [32]. We consider these methods to be redundant and propose a more efficient pseudo-3D deformable module, as shown in the lower part of Figure 1.

Our method can be viewed as a 2D version of deformation-based dynamic NeRF. Instead of sampling a sequence of points along each camera ray, we only need to sample a pixel coordinate X t ∈ R 2 X_t \in R^2Xt​∈R2 from image space. The deformation is conditioned on the head posture p \mathcal{p}p, so that the torso movement is synchronized with the head movement. We use MLP to predict deformation: Δ x = MLP ( X t , P ) \Delta \mathbf{x}=MLP(X_t,P)Δx=MLP(Xt​,P). The deformation coordinates are fed to a 2D feature mesh encoder to obtain the torso function: ft = E torso 2 ( xt + Δ x ) . \mathbf{f}_{t}=E_{\mathrm{torso}}^{2} (\mathbf{x}_{t}+\Delta\mathbf{x}).ft​=Etorso2​(xt​+Δx).. Another MLP is used to generate the torso RGB color and Alpha values:
ct , α t = MLP ( ft , it ) \mathbf{c}_t,\alpha_t=\mathbf{MLP}(\mathbf{f}_t,\mathbf{ i}_t)ct​,αt​=MLP(ft​,it​)
where it i_tit​ is the potential appearance embedding that introduces more model capacity. We show that this deformation-based module can successfully simulate torso dynamics and synthesize a head that matches natural torso images. More importantly, the pseudo-3D representation through 2D feature meshes is very lightweight and efficient. The separately rendered head and torso images can be alpha-composited with any provided background image to obtain the final output portrait image.

Experiment

[External link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-MbA1hvIh-1681375155843) (C:\Users\dell\AppData\Roaming\Typora\typora-user-images\ image-20230413163735306.png)]

Guess you like

Origin blog.csdn.net/weixin_46587777/article/details/131087109