[3D Reconstruction] Factor Fields: A Unified Framework Beyond Neural Fields

论文:Factor Fields: A Unified Framework for Neural Fields and Beyond
文章:https://arxiv.org/abs/2302.01226
项目:https://apchenstu.github.io/FactorFields/



Summary

Factor Fields, a new framework to model and represent signals: decompose the signal into a product of Factors, each Factor represented by a classical or neural field , which operates on transformed input coordinates. This decomposition yields a unified framework that accommodates several recent signal representations, including NeRF, Plenoxels, EG3D, Instant-NGP, and TensoRF. Furthermore, the model allows the creation of powerful new signal representations, such as "dictionary fields" (DiFs) . Among the fast reconstruction methods, the DiF method improves the approximation quality, compactness and training time.

Experiments show that Factor Fields achieve better image approximation quality on 2D image regression tasks , higher geometric quality when reconstructing SDF (Signed Distance Field) , and higher compactness on NeRF reconstruction tasks . Furthermore, DiF is able to generalize to unseen images/3D scenes by sharing the signal basis during training, which greatly benefits use cases such as image regression with sparse observations and few-shot radiation field reconstruction, etc.


提示:以下是本篇文章正文内容,下面案例可供参考

I. Introduction

Efficiently representing multidimensional digital content—such as 2D images or 3D geometry and appearance—is critical for computer graphics and vision applications. These digital signals are traditionally represented discretely as pixels,
voxels, textures, or polygons. In recent years, [36, 51, 37, 10, 49] have made significant progress in developing advanced neural representations that outperform traditional representations in terms of modeling accuracy and efficiency.

Factor Fields unifies the neural representation of multidimensional signals: by decomposing a signal into multiple Factor Fields (f 1 ,..., f N ), by selecting appropriate coordinate transformations (γ 1 ,..., γ N ) , as shown in 1. Factor Fields (at any spatial location in the coordinate-transformed signal domain) decode multi-channel features and then regress the target signal from the dot product of the factors via a learned projection function (e.g., MLP).

Models fit most neural representations, many of which can be represented as a single factor via domain transformations in our framework—for example, MLP networks as a factor with positional encoding transformations in NeRF [36], in Instant-NGP Table encoding of hash transform as a factor in [37]; feature grid with identity transform in DVGO [51] and Plenoxels [61]. TensoRF [10] introduces a representation based on tensor factorization, which can be viewed as two vector-matrices or three CANDECOMP-PARAFAC logarithmic factorization factors with axis-aligned orthogonal 2D and 1D projections. express.

insert image description here

This motivates us to generalize previous classical neural representations through a single unified framework, enabling simple and flexible combination of previous neural domains and transformation functions, leading to new representation designs. For example in the dictionary domain (DiF), a two-factor representation consists of: (1) a basis function factor with a periodic transformation to model the commonality of patterns across the entire signal domain (2) a coefficient with a consistent transformation The field factor coefficient functions are used to represent the local spatial variation characteristics in the signal. The combination of these two factors allows efficient representation of global and local properties of signals . Note that most previous single-factor representations can be viewed as using only one of these functions—either basis functions, such as NeRF and Instant-NGP, or coefficient functions, such as DVGO and Plenoxels. In DiF, jointly modeling two factors (basis and coefficients) achieves better quality than previous methods such as inist-NGP and enables compact and fast reconstructions, as we demonstrate in various downstream tasks like that.

Since DiF is a member of the general Factor Fields family, we conduct a rich set of ablation experiments on the choice of basis/coefficient functions and basis transformations . Compared with Instant-NGP, our method has better reconstruction and rendering quality while effectively halving the total number of model parameters (capacity) for SDF and radiation field reconstruction, demonstrating its superior accuracy and efficiency. Furthermore, compared to recent neural representations designed purely for per-scene optimization, our factorized representation framework is able to learn basis functions across different scenes (learning cross-scene basis in multiple 2D images or 3D radiation fields) , thus Obtain signal representations that improve reconstruction results from sparse observations, as in the few shot NeRF reconstruction setting.

二、Factor Fields

We try to compactly represent a continuous Q-dimensional signal s on the D-dimensional domain: R D → R Q . We assume that signals are not random, but structured, and thus share similar features within the same signal (spatial and different scales) as well as across different signals. Next, we gradually develop our Factor Fields model starting from a standard base.

2.1. Dictionary Field (DiF)

First consider a one-dimensional signal =(x): R D → R. Using basis expansion, we decompose s(x) into a set of coefficients c = (c 1 ,…,c K ) and a basis function b(x) = (b 1 (x),…,b K (x) .Where, c k ∈ R , b k : R D → R:

insert image description here

Note that s(x) is represented as a real signal, and s ^ \hat{s}s^ (x) is expressed as an approximation of it.

Representing the signal s(x) using a global set of basis functions is inefficient because information cannot be shared spatially. We therefore generalize the above formulation by (i) exploiting the spatially varying coefficient field c(x) = (c 1( x),...,c K (x)) with c k : R D → R and (ii) Transform the coordinates of the basis functions by the coordinate transformation function γ: R D → R B :

insert image description here

When choosing γ as a periodic function, the formulation allows us to apply the same basis at multiple spatial locations, and optionally at multiple different scales , while varying the coefficient c, as shown in Figure 2. Note that generally B does not need to match D, so the domain of the basis function changes accordingly: b k : R B → R. Further, we set c(x) = c and γ(x) = x, obtaining a special case of equation (1).

insert image description here

The above only considered a one-dimensional signal s(x). However, many signals have multiple dimensions (e.g. 3 for an RGB image and 4 for a radiation field). We generalize our model to Q-dimensional signals f(x) by introducing a projection function P: R K → R Q , and replacing the inner product (denoted by ◦ below) by element-wise multiplication:

insert image description here

Equation (3) is Dictionary Field (DiF). In contrast to the scalar product c b in Equation (2) , the output of c◦b is a K-dimensional vector consisting of a single coefficient and a basis point product as input to the projection function P. The projection function P itself can be linear or non-linear. In the linear case, P(x) = Ax, A∈R Q×K . In addition, for Q = 1 and A = (1,...,1) as a special case of formula (2). As described in Section 2.3, the projection operator P can also be used to model volume rendering operations when reconstructing a 3D radiation field from 2D image observations.

2.2.Factor Fields fi

To cover more than two factors, generalize equation (3) to the full Factor Fields framework by replacing the coefficient c (x) and basis b (x) with a set of Factor Fields: {f i (x ) } N i = 1

insert image description here
Π represents the element-wise product of a series of factors. In this general form, each factor f i : R Fi → R K , can be equipped with its own coordinate transformation γ i : R D → R Fi . By setting f 1 (x) = c (x), γ 1 (x) = x, f 2 (x) = b (x) and γ~2(x) = γ(x) with N = 2, the equation (3) as a special case in equation (4).

i } are deterministic functions, while P and {f i } are parametric maps (e.g., polynomials, multilayer perceptrons, or 3D feature grids) whose parameters θ can be optimized for either a single signal or multiple The signals are jointly optimized . When jointly optimizing multiple signals, we share the projection function and the parameters of the basis factors (but not the parameters of the coefficient factors) between the signals. To model the factor fields f i : R Fi → R K , we consider various representations (polynomial, mlp, 2D and 3D feature grids and 1D feature vectors). mlp has been proposed as a signal representation in Occupancy Networks [35], DeepSDF [40] and NeRF. While MLP excels in compactness and introduces useful smoothing biases, validation is slow and consumes training and inference time.

To speed up , DVGO [51] proposes a 3D voxel grid representation for the radiation field. While voxel grids can be optimized quickly, they increase memory significantly and do not scale easily to higher dimensions. To better capture the sparsity of the signal, Instant NGP [37] proposes a hash function that incorporates one-dimensional feature vectors instead of dense voxel grids, and TensoRF [10] decomposes the signal into a product of matrices and vectors . Factor fields allow these several representations to model arbitrary factor f i .

2.3. Coordinate transformation γ i

The input coordinates of each factor fields f i are transformed by the coordinate transformation function γ i : R D → R Fi :

1. Coefficient

When factor fields f i represent coefficients, use the identity map γ i (x) = x to perform the corresponding coordinate transformation, because the coefficients can be freely transformed in the signal domain

2. Local Basis

The coordinate transformation γ i can apply the same basis function f i at multiple locations , as shown in FIG. 2 . Specifically, sawtooth, triangular, sine (such as in NeRF), hash (such as Instant NGP) and orthogonal (such as in TensoRF[10]) transformations can be used, see Figure 3.

insert image description here

3. Multi-scale basis

The coordinate transformation γ i also allows the application of the same basis f i at different signal spatial resolutions by transforming the coordinate x with a transformation of a different frequency (or asynchronous cycle), as shown in Fig. 2 . This is critical because signals often carry both high and low frequencies , and our full-spectrum base representation can be used to model the details of the signal as well as smooth signal components .

Specifically, we model the target signal with a set of multiscale (field of view) basis functions . We divide the base into L levels, each level covering a different scale. Let [u,v] denote the bounding box of the signal along a one-dimensional direction. The corresponding scale is computed by (v−u)/fl , where fl is the frequency of class L. A large-scale basis (eg, level 1) has a low frequency and covers a large area of ​​the target signal, while a small-scale base (eg, level L) has a large frequency f L covering a small area of ​​the target signal .

We achieve a multi-scale representation (PR) by multiplying the scene coordinate x by the per-layer frequency f l , which is then fed into the coordinate transformation function γ i , and then concatenating the result to l = 1,...,L at different scales :

insert image description here
where γ i is any one of the coordinate transformations in Fig. 3, and γ PR is the final coordinate transformation of the multi-scale representation. The final target signal s(x) is decomposed into the product of a spatially varying coefficient map and a multi-level basis map consisting of repeated local basis functions .

2.4 Projection P

To represent multidimensional signals, we introduce a projection function P: R K → R Q , which maps from a K-dimensional Hadamard product Πif i to a Q-dimensional target signal . We distinguish two cases : cases where direct observations from the target signal are available (e.g., pixels of an RGB image), and cases where indirect observations are projections of the target signal (e.g., pixels rendered from a radiation field)

Direct Observation : In the simplest case, the projection function implements a learnable linear mapping P(x) = Ax with parameters A ∈ R Q×K , mapping the K-dimensional Hadamard product Πif i to a Q - dimensional signal. Default settings for experiments: P is represented by a shallow nonlinear multi-layer perceptron (MLP), which makes the model more flexible.

Indirect observations : Sometimes, only indirect observations of the signal can be obtained. For example, NeRF only needs to observe two-dimensional images, rather than 4D signals (density and brightness). In this case, the extension P also includes differentiable volume rendering passes. Specifically, MLP is first applied to map view directions d∈R 3 , multiplicative features Πif i at specific locations x∈R 3 , to color values ​​c∈R 3 and volume densities σ∈R. Then use the discretized integral accumulation function in NeRF:

insert image description here

2.5 Space contraction

By applying a simple spatial contraction function to x, we normalize the input coordinates x ∈ R D to [0,1], which is passed to the coordinate transformation γ i (x). Two settings are distinguished:

For a D-dimensional bounded signal whose boundary is [u, v] (u, v ∈ R D ), the coordinates are normalized to the range [0,1] using a linear map, as in Equation 7; for an unbounded signal (e.g., a outdoor radiation field), we use the [3] space contraction function of Mip-NeRF360, such as Equation 8

insert image description here

2.6 Optimization

Optimization objective (Ψ(θ) is a regularization of model parameters):
insert image description here

Sparse regularization . Using the L0 norm for sparse coefficients is desirable, but difficult to optimize. Instead, we use a simple strategy of randomly dropping a subset of the model's K features with probability µ (set to zero), to normalize our objective . This forces the signal to be represented by a random combination of features at each iteration, encouraging sparsity and preventing co-adaptation of features. We implement this dropout regularization using a random binary vector, via element-wise product: ◦ Π i f i .

Initialization : Experiments use the discrete cosine transform (DCT) to initialize the basis (basis factors), and at the same time randomly initialize the coefficient factor and the parameters of the projected MLP (ablation studies Table 3a to Table 3e).

Multi-signal : When performing joint optimization on multiple signals, the parameters of the projection function and basis factors (but not the parameters of the coefficient factors) are shared between different signals. Section 4.3 experiments demonstrate that, while encouraging sparse coefficients, sharing a basis across different signals improves generalization and enables reconstruction from sparse observations .

3. Factor fields as a general framework

Inspired by classical factorization and learning techniques, such as sparse coding [57, 59, 18] and principal component analysis (PCA) [46, 33], we propose a new neural representation framework based on neural factorization. Factor fields unify many recent neural representations and enable instantiation of new models in the factor fields family.

3.1.Occupancy Networks, IMNet 和 DeepSDF

These several methods either implicitly represent the surface as a continuous decision boundary for an MLP classifier , or by regressing a signed distance value. A common MLP representation provides a continuous implicit 3D map, allowing the extraction of 3D meshes at any resolution. When using a single factor (i.e. N = 1) and γ1(x) = x, fx(x(x)=x, P (x) = MLP (x), so s ^ \hat{s }s^ (x)=MLP(x). Whilethis representation is capable of generating high-quality meshes, it cannot model high-frequency signals such as images due to the implicit smoothing bias of mlp

3.2 NeRF

A set of Fourier sine functions is proposed to encode the spatial coordinates , and the radiation field is represented by MLP in Fourier space . Such as: γ 1 (x) = (sin(xf 1 ), cos(xf 1 ), …,sin(xf L ), cos(xf L )), f 1 (x) = x and P (x) = MLP (x). Here, the coordinate transformation γ 1 (x) is a sinusoidal mapping, as shown in Figure (3), which can realize high-frequency transformation.

3.3 Plenoxels voxel point cloud

Using sparse voxel grids to represent 3D scenes does not require direct optimization of neural networks and can be trained quickly . factor field set N = 1, γ 1 (x) = x, f 1 (x) =3D-Grid(x), density field: P (x) = x, radiation field: P(x)=SH(x) (spherical harmonics) corresponding. In related work, DVGO [51] replaces the sparse 3D grid with a dense grid and uses a tiny MLP as the projection function P. Dense mesh modeling is simple and leads to fast feature lookups, but it requires high spatial resolution (and thus memory) to represent details.

3.4 ConvONet and EG3D

Applying an orthogonal coordinate transformation to spatial points in a bounded scene uses a triplanar representation to model the 3D scene, and then represents each point as a connection of features queried from a set of 2D feature maps . This representation uses 2D convolutions to aggregate 3D features, greatly reducing the memory footprint.

Factor field settings: N = 1, γ 1 (x) = Orthogonal-2D(x), f 1 (x) = 2D-Maps (x) and P (x) = MLP (x).

However, while axis-aligned transformations allow dimensionality reduction and feature sharing along axes, it can be challenging to handle complex structures due to the axis-aligned bias of the representation.

3.5 Instant-NGP

Instist-NGP uses a multi-layer hash grid to effectively model the internal characteristics of target signals by hashing spatial positions into one-dimensional feature vectors. Using N = 1, γ 1 (x) = hash (x), f 1 (x) = Vectors(x) and P (x) = MLP (x), L=16. However, multi-layer hash maps can lead to dense collisions at fine scales. The one-to-many mapping forces the model to distribute its capacity bias to densely observed regions, while less observed regions generate noise . Contemporaneous work VQAD [52] introduces a layered Vector Quantization Auto-Decoder (VQ-AD) representation that learns a table of indices as a coordinate transformation function, allowing higher compression rates.

3.6 TensoRF

TensoRF decomposes the radiation field into a product of vectors and matrices (TensoRF-VM) or multiple vectors (TensoRF-CP), enabling efficient feature queries with low memory footprint. This setup instantiates Factor Fields:

N = 2, γ1(x) = Orthogonal-1D(x), f1(x) = Vectors(x), γ2(x) = Orthogonal-2D(x), f2(x) = 2D-Maps(x) : VM 分解
N = 3, γi(x) = Orthogonal-1D(x), fi(x) = Vectors(x) :CP 分解

TensoRF uses both SH and MLP models to project p. Similar to ConvONet and EG3D, the coordinates of TensoRF are sensitive to the orientation of the coordinate system due to the use of an orthogonal transformation function . Note that, except TensoRF, all of the above representations decompose the signal using a single factor field, ie N = 1. Table 3a to Table 3d illustrate that using multiple factor fields (ie N > 1) provides stronger model capabilities.

3.7 ArXiv Preprints

The field of neural representation learning is rapidly evolving, and many new representations have recently been prepublished on ArXiv. Related work: Phase Embedding Field (PREF) proposes phase volumes representing a target signal, and then transforms them into a spatial-domain inverse fast Fourier transform (iFFT) for compact representation and efficient scene editing. This method has a similar idea to DVGO, and extends the projection function P in the formula with the iFFT function. Tensor4D [48] represents a three-plane by using a three-plane factor field and an index plane feature with three orthogonal coordinate transformations (ie Orthogonal-2D(x) = (xy, xt, yt), Orthogonal-2D(x) = (xz, xt, zt), Orthogonal-2D(x) = (yz, yt, zt)) extended to 4D human reconstruction.

D-TensoRF [20] uses matrix-matrix decomposition to reconstruct dynamic scenes, similar to TensoRF’s VM decomposition, but replaces γ 1 (x) = Orthogonal-1D(x) and f 1 (x) = Vector (x) with γ 1 ( x) = Orthogonal-2D(x) and f1(x) = Vector(x). The Factor Fields framework uses N=1, γ 1 sinusoidal (x), f 1 (x)= 2D-Maps(x), and P(x)=(x)MLP(x) for 2D representation.

3.8 Dictionary Field (DiF)

In addition to the above representations, Factor Fields can also have new representations of ideal properties, such as formula (3). DiF provides implicit regularization, compactness, and fast optimization that generalizes to multiple signals. The core idea is to decompose the target signal into two domains: a global domain (ie basis) and a local domain (ie coefficients) . Global domains facilitate structured signal features that are shared across spatial locations and scales as well as signals, while local domains allow for spatially varying content.

DiF decomposes the target signal into the coefficient field f 1 (x) = c (x) and the basis function f 2 (x) = b (x), which mainly vary according to the respective coordinate transformations. We choose the identity map γ 1 (x) = x for c(x), and choose the periodic coordinate transformation γ 2 (x) for b(x), as shown in Fig. 3 .

As a representation of the two factor fields f1 and f2 , we can choose any of those shown in Figure 1 (bottom left ) . For the convenience of comparison, the sawtooth function is used as the base coordinate transformation γ 2 in the experiment , and uniform grids are used as the representation of the coefficient field f 1 and the basis function f 2 .

4. Experiment

4.1 Implementation Details

The PyTorch framework is evaluated on a single RTX 6000 GPU using the Adam optimizer with a learning rate of 0.02.

We instantiate DiF with frequencies (increasing linearly) f l ∈ [2., 3.2, 4.4, 5.6, 6.8, 8.] and L = 6 levels , and feature channels K=[4,4,4,2,2 ,2] ·2η , where η controls the number of feature channels. η = 3 was used for 2D experiments and η = 0 for 3D experiments. The model parameters θ are distributed over 3 constituent models: the coefficients θ c , the basis θ b and the projection function θ P . Each model size varies according to the selected representation.

The experimental model is set to " DiF-Grid " , which uses a learnable tensor grid: P(x) = MLP(x) and γ(x) = Sawtooth(x), where Sawtooth(x) = x mod 1.0. In the DiF-Grid setting, the total number of parameters that can be optimized is mainly determined by the coefficients and the resolution M l c and M l b of the base grid:

insert image description here
The base grid is implemented using a linearly increasing resolution M l b ∈ [32,128] T min(v−u)/1024 with interval [32,128] and scene boundaries [u,v]. We use the same grid resolution M l c on L levels to improve query efficiency and reduce the memory footprint of each signal .

The different model variants of DiF are labeled "DiF-xx", where "xx" indicates the difference from the default setting "DiF-Grid". For example, "-MLP-B" refers to the representation using the MLP base, while "SL" stands for single level .

4.2 Single Signals

We first evaluate the accuracy and efficiency of our DiFGrid representation on various multidimensional signals and compare it with several recent neural signal representations. To achieve this goal, we consider three commonly used benchmark tasks for evaluating neural representations: 2D image regression, 3D signed distance field (SDF) reconstruction, and radiation field reconstruction/new view synthesis. We evaluate each method's ability to approximate high-frequency patterns, interpolation quality, compactness, and robustness to ambiguous and sparse observations .

2D Image Regression : In this task, we directly regress the color of RGB pixels from pixel coordinates. We evaluate our dif-grid fitting on four complex high-resolution images, with total number of pixels ranging from 4M to 213M. Figure 4 shows that reconstructed images with the same size model, such as InstantNGP, achieve higher PSNR. (Instant NGP highly optimized CUDA-based framework for faster optimization).

insert image description here

Signed distance field SDF reconstruction :

As a classical geometric representation, SDF describes a set of continuous iso surfaces. Comparisons are made with state-of-the-art neural representations, including Fourier Feature Networks [53], SIREN [49] and Instant-NGP [37]. The same training method is used for all methods, by pre-sampling 8M SDF points from the target mesh, 80% of which are close to the surface, and the remaining 20% ​​are evenly distributed within the unit volume. According to Instiast-NGP, we randomly sample 16 M points for evaluation, and compute the geometric IOU metric based on the SDF notation (X is the set of points to be evaluated):

insert image description here
Figure 5 is a quantitative and qualitative comparison: this method has high-frequency geometric details and contains less noise on smooth surfaces, the highest gIoU value, the fastest reconstruction speed, and half the parameters used by Instant-NGP.

insert image description here

Radiation field reconstruction :

The purpose of radiation field reconstruction is to recover the density and brightness of each voxel point from the multi-view RGB image. Many encoding functions and advanced representation methods have been proposed to significantly improve the reconstruction speed and quality, such as sparse voxel grid [16], hash table [37] and tensor decomposition [10].

Table 1 quantitatively compares Dif-Grid with several state-of-the-art fast radiation field reconstruction methods (Plenoxel [16], DVGO [51], Instant-NGP [37] and TensoRF-VM) in synthetic [36] and real scenes
( tank and temple objects) [24].

Overall, our model achieves state-of-the-art results on three benchmark tasks. Baselines are mostly single-factor, utilizing local fields (such as DVGO and Plenoxels) or global fields (such as Instant-NGP). The DiF model is a two-factor method that combines local coefficients and global basis fields, resulting in better reconstruction quality and memory efficiency .

insert image description here

4.3 Generalization

Recent advanced neural representations, (such as NeRF, SIREN, ACORN, Plenoxels, Instant-NGP and TensoRF) optimize each signal separately, lacking the ability to jointly model multiple signals or learn useful priors from multiple signals. In contrast, DiF representations not only accurately and efficiently reconstruct each signal (as in Section 4.2), but can also be applied to generalize across signals by simply sharing the base field across signal instances .

We evaluate the benefit of basis sharing through image regression experiments on fractional pixel observations and few-shot radiation field reconstructions. In these experiments, we adopt DiF MLP-B (i.e. (5) in Table 3d) as our DiF representation instead of DiF Grid , where we use a tensor grid to model the coefficients, using 6 tiny mlp (two layers with 32 neurons each) to model the basis . DiF-MLP-B outperforms DiF-Grid in the generalization setting due to mlp's strong inductive smoothness bias.

Regression of images from sparse observations :
This experiment focuses on scenarios where only a fraction of the pixels are used in the optimization process. Without additional priors, single-signal optimization is prone to overfitting in this setting due to sparse observations and finite inductive bias, thus failing to recover unseen pixels.

We use the DiF-MLP-B model to learn data priors bypre-training on 800 face images from the FFHQ dataset [21] while sharing the MLP basis and projection function parameters . The final image reconstruction task is performed by optimizing the coefficient grid for each new test image.

Figure 7 shows three face image regression results using different masks. Even without pre-training and other image priors, DiF-MLP-B is able to capture structural information in the same image being optimized to a certain extent; as shown in the eye region, the model can learn the pupil shape from the right eye, and pass in Reuse the learned structure in the shared basis function to regress the left eye (mask during training)
insert image description here
Few-Shot radiation field reconstruction

Previous work addresses this issue by imposing sparsity assumptions [38, 22] in per-scene optimization or training feed-forward networks [61, 12, 26] from datasets . We consider 3 and 5 input views per scene, exploiting data priors in the pre-trained base field of our DiF model in the optimization task . It's worth noting that the views are selected in a quarter sphere, so the area of ​​overlap between views is rather limited.

Specifically, we first train the DiF model on 100 Google-scanned object scenes [15, 250 views per scene] , where . In cross-scene training, we keep 100 coefficients per scene and share the basis c and projection function P. After cross-scene training, we use the average value of the pre-trained coefficient field as initialization, while fixing the pre-trained functions (c and P), and fine-tuning the coefficient field for new scenes observed in few shots . In this experiment, we compare the results of DiF-MLP-B and DiF-Grid with and without pre-training .

We also train on the same training set as Instant NGP and other few-shot methods (PixelNeRF [61] and MVSNeRF [11]), and use the same 3 or 5 views for testing. As shown in Table 2 and Figure 8 , MLP's pretrained DiF provides strong regularization for few-shot reconstructions with fewer artifacts and Better rebuild quality . Single-scene optimization methods without data priors overfit to the input image, leading to many outliers. MVSNeRF and PixelNeRF achieve reasonable reconstructions because they learn feed-forward predictions, avoiding per-scene optimizations. However, they suffer from blurry artifacts.

4.4 Impact of Factor Fields Design Choices

Four main components of the Factor Fields framework are evaluated here in terms of efficiency, compactness, reconstruction quality, and generality : number of factors N, number of levels L, coordinate transformation function γ i , field representation f i , and field connector ◦ .

For the 2D image regression task, we use the same model settings in Section 4.2 and test on 256 high-fidelity images with a resolution of 1024×1024 from the DIV2K dataset [1].

insert image description here

Factor Number of factors

As in formula (4), the number of factors N refers to the number of factors used to represent the target signal. We present one-factor models (1, 3, 5) and two-factor models (2, 4, 6) in Table 3a. Compared with models (1) iNGP (3) EG3D and (5) DiF-no-C, models (2) DiF-Hash-B, (4) TensoRF-VM and (6) DiF-Grid were used with (1) )(3)(5) Same Factor, but extended to dual-factor model , improves 3 dB and 0.35 dB PSNR in image regression and 3D radiation field reconstruction tasks, while increasing training time (∼10%) and model size (∼5%).

Although the computational overhead is small, the multiplication between factors enables the two factor domains to mutually adjust each other's feature encoding, and to jointly represent the entire signal more flexibly , alleviating the feature conflict problem in Instint-NGP and other single-factor models. question. Furthermore, as shown in Table 1 , multifactor modeling (e.g., N >= 2) enables more compact modeling, and it also allows generalization by partially sharing fields across instances , e.g. across Scene radiation field modeling.

insert image description here
insert image description here

Level Quantity L :

The DiF model employs multiple layers of transformations to achieve a pyramidal base field , similar to the set of sinusoidal position encoding functions used in NeRF [36]. We compare the multi-level models (including DiF and NeRF) with simplified single-level versions in Table 3b that use only a single conversion level. Note that Occupancy Networks (OccNet, line (1)) do not utilize positional encoding and can be viewed as a single level version of NeRF (line (2)), while models with multi-level sinusoidal encoding functions (NeRF) work on 2D images and In 3D reconstruction tasks, the performance of PSNR can be improved by about 10 dB.

Coordinate transformation γ i :

Table 3c: Four coordinate transformation functions (sine, triangle, hash, and sawtooth) are calculated using DiF representation, and the transformation curves are shown in Figure 3. In general, compared to random hash functions, periodic transformation functions (2, 3, 4) allow the sharing of spatially coherent information through repeated patterns, where adjacent points can share spatially adjacent features in the base domain, thereby Maintain local connectivity . We observe that periodic bases achieve significantly better performance in modeling dense signals such as 2D images . For sparse signals, such as 3D radiation fields, all four transfer functions achieve the same high reconstruction quality as previous state-of-the-art fast radiation field reconstruction methods

Field representation f i :

Table 3d compares various functions of factor in the DiF model, including mlp, vector, 2D Map and 3D Grid . Discrete feature grid functions (3D grids, 2D maps, and vectors) generally lead to faster reconstructions than MLP functions (eg, DiF-Grid is faster than DiF-MLP-BandDiF-MLP-C) . While all variants provide reasonable reconstruction quality for single-signal optimization, our dual-grid representation achieves the best performance on image regression and single-scene radiance field reconstruction tasks. On the other hand, the task of few shot radiation field reconstruction benefits from applying stronger regularization to the basis functions. Thus, representations with stronger inductive bias (e.g., Vectors in TensoRF-VM and MLPs in DiF-MLP-B) lead to better reconstruction quality compared to other variants .

Existing connector◦ :

Another key design of our Factor Fields and DiF models is to use element-level multiplication to connect multiple Factors . In Table 3e, we compare the performance of element-wise multiplication and direct concatenation of the three models. Element-wise multiplication consistently outperforms concatenation operations in terms of reconstruction quality .

5. Environmental installation and use

1. Installation environment

Clone the code, and then follow the steps below to install dependencies

conda create -n FactorFields python=3.9
conda activate FactorFields
conda install -c "nvidia/label/cuda-11.7.1" cuda-toolkit
conda install pytorch==1.13.0 torchvision==0.14.0 torchaudio==0.13.0 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt 

tiny-cuda-nn can not be installed, it is mainly used for acceleration. The following are the installation steps, I installed

conda install -c "nvidia/label/cuda-11.7.1" cuda-toolkit
pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

2. How to use

Make sure to download and extract the relevant data in datathe folder. There are 5 functions below, you can choose to use

  1. Image

The training script is in scripts/2D_regression.ipynbthe config file configs/image.yaml.

  1. SDF

The training script is in scripts/sdf_regression.ipynbthe config file configs/sdf.yaml.

  1. NeRF

The above two data sets are provided, and the training script is in train_per_scene.py:

python train_per_scene.py configs/nerf.yaml defaults.expname=lego dataset.datadir=./data/nerf_synthetic/lego
  1. generate image

The training script is in2D_set_regression.ipynb

  1. Sparse Reconstruction NeRF
python train_across_scene.py configs/nerf_set.yaml

Above is the training code. After training, the test is also the same py file, which can be followed by different parameters:

  • model.basis_dims=[4, 4, 4, 2, 2, 2]Adjust the levels and the number of channels, a total of 6 levels and 18 channels.
  • model.basis_resos=[32, 51, 70, 89, 108, 128]Represents the resolution of feature embeddings
  • model.freq_bands=[2.0, 3.2, 4.4, 5.6, 6.8, 8.0] indicates the frequency parameters applied at each level of the coordinate transformation function.
  • model.coeff_type represents the coefficient field representations and can be one of the following: [none, x, grid, mlp, vec, cp, vm].
  • model.basis_type represents the basis field representation and can be one of the following: [none, x, grid, mlp, vec, cp, vm, hash].
  • model.basis_mapping represents the coordinate transformation and can be one of the following: [x, triangle, sawtooth, trigonometric]. Please note that if you want to use orthogonal projection, choose the cp or vm basis type, as they automatically utilize the orthogonal projection functions.
  • model.total_params controls the total model size. It is important to note that the model’s size capability is determined by model.basis_resos and model.basis_dims. The total_params parameter mainly affects the capability of the coefficients.
  • exportation.render_only you can rendering item after training by setting this label to 1. Please also specify the defaults.ckpt label.
  • exportation.... you can specify whether to render the items of [render_test, render_train, render_path, export_mesh] after training by enable the corressponding label to 1.

Some pre-defined configurations (such as occNet, DVGO, nerf, iNGP, EG3D) can be found in README_FactorField.py.

3. Code Analysis

Mainly analyze the code of few shot reconstruction nerf. The network is mainly composed of 100 coefficients related to the scene , a 6-layer base , a linear projection layer and a rendering layer . The structure is as follows:
insert image description here
insert image description here

Training process:

scene_idx = torch.randint(0, len(train_dataset.all_rgb_files), (1,)).item()      # 0-100个类中,随机选一个类
model.scene_idx = scene_idx
for j in range(steps_inner):                       # 循环 16if j%steps_inner==0:                           # 第一次循环,只放开coef的梯度
        model.set_optimizable(['coef'], True)
        model.set_optimizable(['proj','basis','renderer'], False)
    elif j%steps_inner==steps_inner-3:             # 第 13 次循环关闭 coef的梯度
        model.set_optimizable(['coef'], False)
        model.set_optimizable(['mlp', 'basis','renderer'], True)



    # 准备训练数据
    data = train_dataset[scene_idx] #next(iterator)                                       # 一个类有250张img
    rays_train, rgb_train = data['rays'].view(-1,6), data['rgbs'].view(-1,3).to(device)      # (4095,6) rgb:normed(4095,3) 


    # 开始渲染
    rgb_map, depth_map, coefffs = render_ray(rays_train, model, chunk=batch_size(4096),
                                    N_samples=449, white_bg=True, ndc_ray=0, device=device, is_train=True)
    
    # 计算损失
    loss = torch.mean((rgb_map - rgb_train) ** 2) #+ torch.mean(coefffs.abs())*1e-4
    PSNRs.append(-10.0 * np.log(loss) / np.log(10.0))

The forward function of the specific rendering function models/FactorFields.py FactorFields class:

# 1.射线上采样------------------------------------------------------------------------------------
xyz_sampled, z_vals, inner_mask = self.sample_point(rays_chunk[:, :3], viewdirs, is_train=True,N_samples=443)

    # 具体展开
    def sample_point(self, rays_o, rays_d, is_train=True, N_samples=-1):
        N_samples = N_samples if N_samples > 0 else self.nSamples                 # 443
        vec = torch.where(rays_d == 0, torch.full_like(rays_d, 1e-6), rays_d)     # 把ray_d中为0的向量,设置为1e-6
        rate_a = (self.aabb[1, :self.in_dim] - rays_o) / vec                      # (4095,3) aabb[-1,-1,-1,1,1,1]中的起点和终点
        rate_b = (self.aabb[0, :self.in_dim] - rays_o) / vec
        # (射线的起点-最小边界)/射线方向,  得到射线与最小边界的交点位置
    

        t_min = torch.minimum(rate_a, rate_b).amax(-1).clamp(min=0.05, max=1e3)   # (4095) :0.05~0.33
        rng = torch.arange(N_samples)[None].float()                               # [1,2,3,...442]
        if is_train:
            rng = rng.repeat(rays_d.shape[-2], 1)
            rng += torch.rand_like(rng[:, [0]])                                   # (4095,443) + noise
        step = self.stepSize * rng.to(rays_o.device)                              # self.stepSize = 0.0079 (4095,443):0.0032~3.2
        interpx = (t_min[..., None] + step)

        rays_pts = rays_o[..., None, :] + rays_d[..., None, :] * interpx[..., None]    # (4095,443,3) -2.4~2.6
        mask_outbbox = ((self.aabb[0, :self.in_dim] > rays_pts) | (rays_pts > self.aabb[1, :self.in_dim])).any(dim=-1)    #  # (4095,443):mask

        return rays_pts, interpx, ~mask_outbbox

dists = torch.cat((z_vals[:, 1:] - z_vals[:, :-1], torch.zeros_like(z_vals[:, :1])), dim=-1)

viewdirs = viewdirs.view(-1, 1, 3).expand(xyz_sampled.shape)      # (4095,443,3)
ray_valid = torch.ones_like(xyz_sampled[..., 0]).bool() if self.is_unbound else inner_mask    # mask:立方体内的点

# 2.采样点 坐标变换:------------------------------------------------------------------------------------
 
    # 001 在标准立方体内,对三维点做归一化
    pts = self.normalize_coord(xyz_sampled).view([1, -1] + [1] * (dim - 1) + [dim])
          def normalize_coord(self, xyz_sampled):
              invaabbSize = 2.0 / (self.aabb[1] - self.aabb[0])      # [1,1,1]
              return (xyz_sampled - self.aabb[0]) * invaabbSize - 1

    # 002 这步主要计算coeff 和 basises,详细过程在下面
    feats, coeffs = self.get_coding(xyz_sampled[ray_valid])
        
            coeff = self.get_coeff(x)             # x:(1103929,3) 有效点的坐标
                    if 'grid' in self.coeff_type:
                       # 非hash采样: self.coeffs[self.scene_idx]初始化为(1,72,16,16,16)*[1], 在特征体内,采样点的特征系数 得到结果:(1103929,72)
                        coeffs = F.grid_sample(self.coeffs[self.scene_idx], pts, mode=self.cfg.model.coef_mode, align_corners=False,
                                  padding_mode='border').view(-1, N_points).t()      

            basises = []
            for i in range(freq_len):
                basises = self.get_basis(x)
                    xyz = grid_mapping(x, self.freq_bands , self.aabb[:, :self.in_dim], self.cfg.model.basis_mapping).view(1, *( [1] * (3 - 1)), -1, 3, freq_len=6)
                          def grid_mapping(positions, freq_bands, aabb, basis_mapping='sawtooth'):
                              aabbSize = max(aabb[1] - aabb[0])       # 2
                              scale = aabbSize[..., None] / freq_bands   # freq_bands: [2.0, 3.2, 4.4, 5.6, 6.8, 8]  -> [1.0, 0.62, 0.45,..., 0.25]
                              if basis_mapping == 'sawtooth':
                                 pts_local = (positions - aabb[0])[..., None] % scale    # (1103929,3,6)
                                 pts_local = pts_local / (scale / 2) - 1
                                 pts_local = pts_local.clamp(-1., 1.)
               basises = torch.cat(basises, dim=-1)    # (6, 1103929, 16) --> (1103929, 72) 

            basises * coeff, coeff

    # 003 MLP 得到密度和颜色
    feat = self.linear_mat(feats, is_train=is_train)           # (1103929, 72) -MLP-> (1103929, 32)
    sigma[ray_valid] = self.basis2density(feat[..., 0])        # F.softplus -> (1103929)

    alpha, weight, bg_weight = raw2alpha(sigma, dists * self.cfg.renderer.distance_scale)   # dists*25   nerf中的积分公式。详情可见论文 alpha(4095,443) w(4095,443) b(4095)

    # 筛选背景,以及weight小于0.001的射线
    app_mask = weight > self.cfg.renderer.rayMarch_weight_thres                      # 0.001
    ray_valid_new = torch.logical_and(ray_valid, app_mask)
    app_mask = ray_valid_new[ray_valid] 

if app_mask.any():             
    # 训练后期才会出现 True 的情况
    valid_rgbs = self.renderModule(viewdirs[ray_valid_new], feat[app_mask, 1:])
                      # 以下代码,先用正余弦编码升到高维,再用 mlp 降维
                      indata += [positional_encoding(features, self.feape)]     # self.feape =2
                                 def positional_encoding(positions, freqs):
                                     freq_bands = (2 ** torch.arange(freqs).float()).to(positions.device)  # (F,) freqs=2  --> [1,2]
                                     pts = (positions[..., None] * freq_bands).reshape(positions.shape[:-1] + (freqs * positions.shape[-1],))  # (..., DF)  (131,31) -> (131,62)
                                     pts = torch.cat([torch.sin(pts), torch.cos(pts)], dim=-1)    # (131,62) -> (131,124)
                                     return pts

                      indata += [positional_encoding(viewdirs, self.viewpe)]    # self.viewpe

                      h = torch.cat(indata, dim=-1)          # (131,194)
                      for l in range(self.num_layers):       # 3 layer
                          h = self.mlp[l](h)
                          if l != self.num_layers - 1:
                             h = F.relu(h, inplace=True)
                      rgb = torch.sigmoid(h)

    rgb[ray_valid_new] = valid_rgbs


acc_map = torch.sum(weight, -1)
rgb_map = torch.sum(weight[..., None] * rgb, -2)

rgb_map = rgb_map + (1. - acc_map[..., None])  if white_bg

rgb_map = rgb_map.clamp(0, 1)

with torch.no_grad():
    depth_map = torch.sum(weight * z_vals, -1)

extra function

init_basis is initialized as:

self.basises = self.init_basis()
def init_basis(self) as follows

elif 'grid' in self.basis_type:
     basises.append(torch.nn.Parameter(dct_dict(int(np.power(basis_dim, 1. / self.in_dim) + 1), reso,     n_selete=basis_dim, dim=self.in_dim).reshape(
                    [1, basis_dim] + [reso] * self.in_dim).to(self.device)))



def dct_dict(n_atoms_fre, size, n_selete, dim=2):
    """
    Create a dictionary using the Discrete Cosine Transform (DCT) basis. If n_atoms is
    not a perfect square, the returned dictionary will have ceil(sqrt(n_atoms))**2 atoms
    :param n_atoms:
        Number of atoms in dict
    :param size:
        Size of first patch dim
    :return:
        DCT dictionary, shape (size*size, ceil(sqrt(n_atoms))**2)
    """
    # todo flip arguments to match random_dictionary
    p = n_atoms_fre  # int(math.ceil(math.sqrt(n_atoms)))
    dct = np.zeros((p, size))

    for k in range(p):
        basis = np.cos(np.arange(size) * k * math.pi / p)
        if k > 0:
            basis = basis - np.mean(basis)

        dct[k] = basis

    kron = np.kron(dct, dct)
    if 3 == dim:
        kron = np.kron(kron, dct)

    if n_selete < kron.shape[0]:
        idx = [x[0] for x in np.array_split(np.arange(kron.shape[0]), n_selete)]
        kron = kron[idx]

    for col in range(kron.shape[0]):
        norm = np.linalg.norm(kron[col]) or 1
        kron[col] /= norm

    kron = torch.FloatTensor(kron)
    return kron

Get the origin and direction of the ray (the input is an image, and its corresponding internal and external parameters)

def get_ray_directions(H, W, focal, center=None):
    """
    得到射线的方向
    Inputs:
        H, W, focal: image height, width and focal length
    Outputs:
        directions: (H, W, 3), the direction of the rays in camera coordinate
    """
    grid = create_meshgrid(H, W, normalized_coordinates=False)[0] + 0.5

    i, j = grid.unbind(-1)
    # the direction here is without +0.5 pixel centering as calibration is not so accurate
    # see https://github.com/bmild/nerf/issues/24
    cent = center if center is not None else [W / 2, H / 2]
    directions = torch.stack([(i - cent[0]) / focal[0], (j - cent[1]) / focal[1], torch.ones_like(i)], -1)  # (H, W, 3)

    return directions


def get_rays(directions, c2w):
    """
   得到射线的起点和方向
    Inputs:
        directions: (H, W, 3) precomputed ray directions in camera coordinate
        c2w: (3, 4) transformation matrix from camera coordinate to world coordinate
    Outputs:
        rays_o: (H*W, 3), the origin of the rays in world coordinate
        rays_d: (H*W, 3), the normalized direction of the rays in world coordinate
    """
    # Rotate ray directions from camera coordinate to the world coordinate
    rays_d = directions @ c2w[:3, :3].T  # (H, W, 3) (512,512,3)
    # rays_d = rays_d / torch.norm(rays_d, dim=-1, keepdim=True)
    # The origin of all rays is the camera origin in world coordinate
    rays_o = c2w[:3, 3].expand(rays_d.shape)  # (H, W, 3)

    rays_d = rays_d.view(-1, 3)
    rays_o = rays_o.view(-1, 3)

    return rays_o, rays_d

Guess you like

Origin blog.csdn.net/qq_45752541/article/details/132276554