[3D Generation and Reconstruction] ZeroRF: Fast sparse view 360° reconstruction with Zero Pretraining

Table of Contents of Series Articles

Title : ZeroRF: Fast Sparse View 360◦ Reconstruction with Zero Pretraining
Task : Sparse Reconstruction; Extension: Image to 3D, Text to 3D
Author : Ruoxi Shi* Xinyue Wei* Cheng Wang Hao Su, from UC San Diego
code : https:// github.com/eliphatfs/zerorf



Summary

  ZeroRF, a new scene-by-scene optimization method for 360° reconstruction of sparse views in neural fields. Currently, NeRF (Neural Radiation Field) has demonstrated high-fidelity image synthesis, but is difficult to work with sparse input views and faces limitations in data dependence, computational cost, and generalization across different scenarios. To overcome these challenges, the key to ZeroRF is to first integrate a tailored depth image into a factorized NeRF representation . Unlike traditional methods, ZeroRF uses a neural network generator to parameterize feature meshes , enabling efficient sparse view 360° reconstruction without any pre-training or additional regularization. And it can be extended to applications in 3D content generation and editing.

I. Introduction

  Breakthroughs in neural field representation, such as NeRF and its subsequent developments, enable high-fidelity image synthesis, accelerated optimization and various downstream applications, but rely on dense input views. Especially when it comes to 3D content generation tasks. Therefore, solving reconstruction from a sparse point of view is a significant challenge.

  In recent years, methods for sparse view reconstruction [7, 23, 26, 33, 41, 60, 64, 66, 77] have attracted increasing attention. One approach of [7, 28, 33, 77], often referred to as generalizable nerf, relies on extensive pre-training with substantial time and data requirements to directly reconstruct the scene of interest. Therefore, the performance of these models is closely related to the quality of the training data , and the resolution of large neural networks is limited due to the high computational cost of large neural networks. In addition, these models are difficult to effectively generalize in different scenarios. Other methods following the scene-by-scene optimization paradigm also include additional modules such as visual language models [23] and depth estimators [64] to aid reconstruction , which have proven effective in terms of narrow baselines but perform poorly in 360° reconstructions. . Furthermore, applicability to real-world data is limited due to reliance on additional supervision, which may not always be available or accurate . People have also hand-designed priors spanning continuity [41], information theoretic [26], symmetry [51] and frequency [73] regularization. However, additional regularization may prevent NeRF from faithfully reconstructing the scene [73]. Furthermore, handcrafted priors often fail to accommodate rather subtle settings changes.

  Furthermore, existing scene-by-scene optimization methods for 360◦ reconstruction, which typically require hours of training (even on large GPUs), converge much faster than NeRF representations such as factored Instant NGP [39] or TensoRF [8] Slower and difficult to apply in practice.

  We fit a TensoRF [8] on the Lego scene of the NeRF synthetic dataset using 4 and 100 viewing angles respectively, and after the training converged, we visualized a channel from the planar feature, as shown in Figure 2.

Insert image description here

You can clearly see that in the sparse 4 view setting the lighting effects create noisy and distorted features, whereas in the dense (100) view the feature planes look exactly like Lego's orthographic projection images. We conducted similar experiments on triplane [6, 17] and Dictionary Fields [9] representations and found that this is not specific to TensoRF but is a general phenomenon for these grid-based factorization representations . It can therefore be hypothesized that under,sparse view supervision, fast optimized sparse view,reconstruction can be achieved if the factorized features,are kept clean .

  Proposal: Integrate a cropped depth image prior [61] into a factorized NeRF representation (see Figure 3). Rather than directly optimizing the feature grid like TensoRF, K-planes or Dictionary Fields [8, 9, 17], ZeroRF uses a randomly initialized deep neural network (generator) to parameterize the feature grid.
The intuition behind this is that under uncertain supervision, neural networks generalize better than search grids in the vast majority of cases. More theoretically, neural networks have higher resistance to noise and artifacts than data that is easy to perceive and remember [21, 22, 61]. The design does not require any additional regularization or pre-training and can be applied uniformly to multiple representations. Parameterization is also "lossless" in that there is a set of deep network parameters such that any given target feature mesh can be achieved

  ZeroRF, as a new scene-by-scene optimization method, conducts extensive experiments on different generative networks for parameterization and different decomposition representations to find the most suitable combined sparse view 360◦ reconstruction.

  1. No pre-training model is required, avoiding any potential bias in the training data and any restrictions on settings such as resolution or camera distribution;
  2. Training and inference are fast because it is built on decomposed NeRF representation, running in less than 30 seconds;
  3. It has the same theoretical expressive power as the underlying factorized representation;
  4. State-of-the-art quality new view synthesis using sparse view input on network synthesis [38] and open illumination [30] benchmarks

Given ZeroRF's high-quality 360° reconstruction capability, our method can be applied in various fields, including 3D content generation and editing.

2. Related work & prior knowledge

2.1 New perspective synthesis

  Neural rendering technology paves the way to achieve photorealistic rendering quality in new view synthesis: Neural Radiation Fields (NeRF) is the first to introduce multi-layer perceptrons (MLP) to store radiation fields and enable significant rendering with volume rendering rendering quality. Subsequently, Plenoxels and DVGO adopted voxel-based representation; TensoRF, Instant-NGP and DiF [9] proposed decomposition strategies to accelerate training; MipNeRF and RefNeRF were coordinate-based MLP, and Point-NeRF relied on point cloud-based representation. Some methods replace the density field with a signed distance function (SDF) [Neuralangelo, Unisurf, Permutosdf, Neus] or convert the density field into a grid representation [Mobilenerf, Neumanifold, Bakedsdf] to improve surface reconstruction. These methods can extract high-quality meshes without seriously affecting their rendering quality. In addition, the work of [3d gaussian splatting, 4d gaussian splatting, Deformable 3d gaussians] has used Gaussian splatter to achieve real-time radiation field rendering

2.2 Deep network prior

  While it is generally believed that the success of deep neural networks is due to their ability to learn from large-scale data sets, the architecture of deep networks actually captures a large number of features before any learning.
Training a linear classifier on features from a stochastic convolutional network can yield much higher performance than random guessing [20]. Randomly initialized networks are also characterized by few-shot learners [1, 18, 50]. The prior can be further advanced through the distillation of these random features, starting from this inductive bias and using Different approaches to improve image representation learning.

  In contrast to these works, the depth image prior [61] directly exploits this depth prior without further distillation . The results show that the GAN generator can be used as a parameterized structure with high noise impedance and therefore can be applied to image restoration tasks such as denoising, super-resolution and inpainting. This will be further applied to various imaging and microscopy applications [36, 43, 54, 55, 62] and extended with theoretical and practical improvements in depth decoders [21, 22]. ZeroRF follows a similar paradigm by embedding deep priors into the parameterization of the radiation field .

2.3 Sparse view reconstruction

  NeRF exhibits limitations with sparse observations due to insufficient information. To solve the challenge, some methods pre-train [Mvsnerf, SRF, Sharf, Grf, pixelnerf] on a large number of data sets to transfer prior knowledge and fine-tune the model on the target scene . In contrast, another research approach focuses on optimizing each scenario through manually designed regularization methods [Putting nerf on a diet, Flipnerf, Mixnerf, Geconer, Sparf, Freenerf] . For example, to increase semantic consistency, DietNeRF uses the CLIP visual transformer to extract high-level features. Many of them design loss functions to mitigate cross-view inconsistencies, either based on information theory. SPARF utilizes pre-trained networks for correspondence or depth estimation to compensate for the lack of 3D information.

2.4 NeRF

  NeRF represents a three-dimensional scene radiation field through MLP: inputting a 3D position x and view direction d, it outputs the volume density σ x and the view-related color c x :
Insert image description here

2.5 TensoRF

  TensoRF replaces MLP in NeRF and selects a feature volume to speed up training: it uses CANDECOMP/PARAFAC decomposition or VM decomposition to decompose the feature volume into factors. ZeroRF uses VM decomposition: given a three-dimensional tensor T∈R I,J,K , it decomposes a tensor into multiple vectors and matrices:
Insert image description here

where v r a is the vector factor and M r b, c is the matrix factor.

3. Method of this article

  The ZeroRF pipeline is shown in Figure 3: using a deep generator network of frozen standard Gaussian noise samples as input, planes and vectors are generated in the manner of TensoRF-VM to form a decomposed tensor feature volume. The feature volume is then sampled in the rendering ray and decoded by a multilayer perceptron (MLP) (standard volume rendering process with MSE loss).

  The main idea of ​​ZeroRF is to apply an untrained deep generation network as a parameterization of the spatial feature grid ( see the Tensorial3D network in the code for details: generate two 3D tensors to form the code feature space ). The network can learn patterns at different scales from sparse observations, naturally generalizing to unseen views, without requiring further upsampling tricks or explicit regularization, which typically requires extensive manual tuning, as opposed to previous work on sparse view reconstruction.
The important designs are as follows: spatial composition (representation of feature volumes), the structure of the representation generator; and the structure of the feature decoder .

Insert image description here

3.1 Decomposition of Feature Volume

  The principles of applying deep generative networks for parameterization are generally applicable to any grid-based representation. The most straightforward solution is to parameterize a feature Volume directly . However, if high rendering quality is desired, the feature volume will be particularly large, memory-consuming, and computationally inefficient. TensorRF uses tensor decomposition to exploit the low-rank nature of feature volumes. When the vector is a constant, the three-plane representation used in [17] can be regarded as a special case of the vector being represented by TensoRF-VM. DiF decomposes the feature volume into multiple smaller volumes encoding different frequencies . Instant-NGP [39] adopts multi-resolution hashmap because the information in features is sparse in nature.

  In these decompositions, hashing breaks the spatial correlation between adjacent units, so depth priors cannot be applied. Deep generative networks are available for parameterization ( TensorRF, triplane and DiF ).
We build generator architectures for generating 1D vectors, 2D matrices, and 3D volumes, based on which we experiment with all three decompositions . Due to similar working principles, better performance is obtained than previous technologies; TensoRF-VM has the best performance and is the final choice for factorization.

3.2 Generator architecture

  The quality of deep parameterization depends heavily on the framework. So far, most generators are Conv and Attention architectures , including deep decoder (DD), stable diffusion (SD), variational autoencoder (VAE), decoder in Kadinsky, and SimMIM generation based on ViT decoder device . ZeroRF converts 2D convolution, pooling and upsampling layers into 1D and 3D to get the corresponding 1D and 3D Generators required for different decompositions .

  These generators are initially quite large as they are designed to fit into a very large data set to produce high quality content . This will result in unnecessarily long run times and slower convergence when it comes to individual NeRF scenarios. Fortunately, we found that the performance of ZeroRF after convergence remains unchanged when we shrink the width and depth of the model. Therefore, we retain the composition of blocks but modify the sizes of these architectures to improve training speed . Note that during inference we only need to store the radiation field representation and not the generator, so during rendering ZeroRF has zero overhead compared to its underlying factorization method .

  We find that the SD VAE and its decoder part, as well as the Kadinsky decoder are equally effective in new view synthesis, followed by the depth decoder, while the SimMIM architecture, as a depth prior to the radiation field, proves to be ineffective . The SD/Kadinsky encoder is mostly the original structure of convolution, Kadinsky adds self-attention in the first two blocks. We choose the (modified) SD decoder as the final choice of generator architecture because it is the least computationally intensive .

3.3 Decoder architecture

  Our decoder architecture follows SSDNeRF : decode from the feature grid with linear interpolation (bilinear or trilinear), project it with the first linear layer, resulting in a basic feature code shared between density and appearance decoding . We found that shared feature code can help reduce floaters by tightly coupling geometry and appearance . Apply SiLU activation and call another linear layer for density prediction . For color prediction , we encode the view orientation with spherical harmonics (SH) and add it to the base features via projection of linear layers to add view dependence. We then apply SiLU activation and predict RGB values ​​using another linear layer similar to density prediction, expressed as follows:

Insert image description here

F x is the characteristic field, σ(·) is the sigmoid function, and Θ• is the linear layer. Unlike the decoders used in TensoRF and DiF, this decoder does not consume any positional encoding , which would otherwise potentially leak positional information (beyond the depth prior), destroying or degrading ZeroRF performance .

4. Experiment

4.1 Experimental configuration

  The experiment used the AdamW optimizer, β 1 =0.9, β 2 =0.98, and weight attenuation 0.2. The learning rate starts at 0.002 and decays to 0.001 on a cosine schedule. Train ZeroRF for 10k iterations. During the volume rendering process, we uniformly sample 1024 points for each ray, and use occupancy pruning and occlusion culling to speed up this process .

4.2 Datasets and indicators

  The evaluation metrics are PSNR, SSIM and L PIPS ; all input views are obtained by running KMeans on the camera transformation vector and selecting the view closest to the cluster centroid.

  1. NeRF synthetic dataset :
      NeRF-synthetic dataset contains 8 objects of different materials and geometries. Experiments were conducted using 4 or 6 views as input and the model was evaluated on 200 test views.

  2. OpenIllumination dataset :

  A light-level real-world dataset of 8 objects with complex geometries under single illumination, 4 or 6 views extracted from 38 training views, and evaluated on 10 test views.

  1. DTU data set :

  DTU focuses primarily on forward-facing objects rather than 360◦ reconstruction, but for completeness we include our results on DTU in Figure 6. We use 3 views as input and test the model on the remaining views

4.2 Results display

  Comparison with the state-of-the-art few-shot NeRF method:

1.RegNeRF: based on continuity and pre-trained RealNVP regularization,
2.DietNeRF: using pre-trained CLIP model
3.InfoNeRF: using entropy as a regularizer
4.FreeNeRF: based on frequency regularization
5.FlipNeRF: using spatial symmetry prior

Insert image description here

Insert image description here

  Most baseline models exhibit varying degrees of obvious deficiencies, including floater and noticeable color shifts in the synthetic results (highlighted in the red box in the figure). For pretrained priors, the RegNeRF prior model was not trained on wide baseline images and failed to reconstruct objects in the 360◦ setting; interestingly, DietNeRF using CLIP as a prior model performed better on real images than on synthetic images The effect is better, which is consistent with the pre-training data distribution of CLIP. FreeNeRF and FlipNeRF perform relatively well on nerf synthesis, but fail on OpenIllumination.

4.3 Analysis

  The impact of the number of sparse views . The design experiment is shown in Figure 7 : ZeroRF has a significant advantage over the base TensoRF representation on sparse views; when the views become denser, ZeroRF remains competitive, albeit with a smaller advantage.

  Feature Volume decomposition method . We apply the ZeroRF generator to triplanar, TensoRF and DiF and compare the performance on the NeRF synthetic dataset (6 views input). The results are shown in Table 3 . Generator inclusions continually improve the base representation and both achieve state-of-the-art performance. This shows that the principle of using deep parameterization is generally applicable to grid-based representations.

Insert image description here

  Generator framework . Table 4 compares the southwest features of different frameworks, and Figure 8 visualizes a channel of different Generator planar features. Without any prior and directly optimized planes, features have high frequency noise and visible view boundaries. In comparison, the SD decoder and Kadinsky model produced clean and good features. SimMIM's fully attentive ViT decoder uses patch partitioning, and blocky artifacts can be seen. MLP assumes a very smooth transition on the mesh and therefore does not faithfully represent the scene content. In general, convolutional architectures produce features that are most consistent with the scene

Insert image description here
  Importance of noise . Input noise is the key a priori. Replace this with zero-initialized trainable features and the framework breaks completely (last row of Table 4). No improvement in performance is observed if the noise is unfrozen - since the learning rate is small compared to the size of the noise, the structure of the noise remains unchanged throughout training. But it introduces additional overhead and slows down convergence. Therefore, we keep the noise frozen during training.

5. Specific applications

5.1 Text to 3D 和 Image to 3D.

  Considering ZeroRF's powerful sparse volume reconstruction capabilities, a simple idea is to use existing models to perform consistent multi-view generation and apply ZeroRF to lift the sparse views into 3D. Image to 3D task, use Zero123++ to upgrade a single image to 6 views, and fit a ZeroRF on the generated image. For text to 3D, SDXL is first called to generate an image from the text and the image to 3D process described previously is applied. As shown in Figure 9, ZeroRF is able to produce reliable high-quality reconstructions from the generated multi-view images. Fitting ZeroRF takes just 30 seconds on the A100 GPU.

Insert image description here

5.2 Mesh textures and texture editing

  Mesh texturing and texture editing. ZeroRF can also use the provided frozen geometry to reconstruct the appearance: randomly render 4 images from the mesh, tile them into a large image, and apply Instruct-Pix2Pix [4] to edit the image according to text prompts. Then, fit a ZeroRF on the four images and bake the color values ​​back into the mash surface. In this case, fitting ZeroRF only takes 20 seconds, as shown in Figure 10

Insert image description here

2. Read data

The code is as follows (example):

data = pd.read_csv(
    'https://labfile.oss.aliyuncs.com/courses/1283/adult.data.csv')
print(data.head())

The data requested by the url network used here.


6. Code analysis

1. Installation environment

First install the python3.7 and cuda11.x environments:

# bashrc文件中,将默认cuda指向11.5
export PATH=/usr/local/cuda-11.5/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.5/lib64:$LD_LIBRARY_PATH

# 创建环境
conda create -y -n ssdnerf python=3.7
conda activate ssdnerf

# 安装 PyTorch
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch

# 安装 MMCV and MMGeneration
pip install -U openmim
mim install mmcv-full==1.6
git clone https://github.com/open-mmlab/mmgeneration && cd mmgeneration && git checkout v0.7.2
pip install -v -e .
cd ..

# 安装 SpConv
pip install spconv-cu114

# Clone this repo and install other dependencies
git clone <this repo> && cd <repo folder> && git checkout ssdnerf-sd
pip install -r requirements.txt

# 安装 InstantNGP依赖
git clone https://github.com/ashawkey/torch-ngp.git

cd lib/ops/raymarching/
pip install -e .
cd ../shencoder/
pip install -e .
cd ../../..

2.Image-to-3D

First, you need to use Zero123++ (https://github.com/SUDO-AI-3D/zero123plus) to generate 6 views RGBD from a single image. The sample picture is as follows:
Insert image description here
Execute the following code (you need to log in to your wandb account in advance):

python zerorf.py --load-image=examples/ice.png

Zerorf.py code analysis: input only one picture to generate a 360-degree video:

# 1.读入6张图像(默认已经用zero123++模型,生成了6个视角)-------------------------------------------------------
image = torch.tensor(numpy.array(Image.open(args.load_image)).astype(numpy.float32) / 255.0).cuda()                                      # (960,640,4)
images = einops.rearrange(image, '(ph h) (pw w) c -> (ph pw) h w c', ph=3, pw=2)[None]  # (1,6,320,320,4)

# 2.读入meta,生成每张图的intri(内参)和pose(外参)-----------------------------------------------
cond_intrinsics = data['cond_intrinsics']  # [fx, fy, cx, cy]  [350,350,160,160]

BLENDER_TO_OPENCV_MATRIX = numpy.array([
    [1,  0,  0,  0],
    [0, -1,  0,  0],
    [0,  0, -1,  0],
    [0,  0,  0,  1]
], dtype=numpy.float32)

poses = numpy.array([(numpy.array(frame['transform_matrix']) @ BLENDER_TO_OPENCV_MATRIX) * 2
        for frame in meta['sample_0']['view_frames']])      # 外参pose默认是固定的。有兴趣可以研究zero123模型

The image external parameter pose is pre-saved in mata.json. The specific values ​​are shown in the figure below:

Insert image description here

# 3.读入缓存(第一次循环没有缓存,默认生成4个纯0向量)--------------------------------------------------------
code_list_, code_optimizers, density_grid, density_bitfield = self.load_cache(data)

# 4.根据内外参,生成射线---------------------------------------------------------------------------------
cond_rays_o, cond_rays_d = get_cam_rays(cond_poses, cond_intrinsics, h, w)        # 还是6张图的

# 5.decode 过程
loss, log_vars, out_rgbs, target_rgbs = self.loss_decoder( self.decoder, code, density_bitfield, cond_rays_o, cond_rays_d, ond_imgs)

loss.backward()


# 4.射线中采样batch(4096):self.ray_sample:-----------------------------------------------------
rays_o = cond_rays_o.reshape(num_scenes, num_scene_pixels, 3)    # 320*320*6=614400 -> (1,614400,3)
rays_d = cond_rays_d.reshape(num_scenes, num_scene_pixels, 3)    # 320*320*6=614400 -> (1,614400,3)
if num_scene_pixels > n_samples:                                 # 总射线数 > 4096
        sample_inds = [torch.randperm( target_rgbs.size(1), device=device
                     )[:n_samples if self.patch_loss is None and self.patch_reg_loss is None else n_samples // (self.patch_size ** 2)]
                      for _ in range(num_scenes)]                # 614400 根射线中,随机选4096
        sample_inds = torch.stack(sample_inds, dim=0)
    scene_arange = torch.arange(num_scenes, device=device)[:, None]
    rays_o = rays_o[scene_arange, sample_inds]          # (1,4096,3
    rays_d = rays_d[scene_arange, sample_inds]          # (1,4096,3
    target_rgbs = target_rgbs[scene_arange, sample_inds]         # (1,4096,4

# 4.5 初始化 code
code = code.to(next(self.parameters()).dtype)          # (1,3,8,20,20)初始化的全0向量

# 4.6 更新 code
for _ in range(update_extra_state):
    self.update_extra_state(code, *extra_args, **extra_kwargs)

#----------------------------以下是VolumeRenderer中的decode过程:------------------------------------

# 5.prepro(code)过程,用于生成三为特征。重点代码为:
self.code_proc_buffer = self.code_proc_pr_inv(self.preprocessor(self.code_proc_pr(code))
# self.preprocessor:TensorialGenerator生成两个3D张量-----------------------------------------------------------
	# 循环两次,通过随即噪声,生成两个Tensorial 3D张量 --> (2,524288=64^3)
	class TensorialGenerator(nn.Module):
	
	        self.subs = nn.ModuleList([(Tensorial3D)(in_ch=8, out_ch=16, noise_res=4)])  # 这里可选Tensorial1D Tensorial2D
	
	    def forward(self, _):
	        r = []
	        for sub in self.subs:
	            sub_out = sub()                            # 从噪音生成特征:(1,16,32,32,32
	            r.append(torch.flatten(sub_out, 1))        # (1,16,32,32,32-> (524288)
	        return torch.cat(r, 1)                         # 循环两次,生成两个Tensorial 3D张量 --> (1,1048576)
## 其中,Tensorial3D网络随机生成噪声,其网络架构如下图:

Insert image description here

6. Generate densely sampled coordinates--------------------------------------------- -------------------------------------------------- ----------------------------------

grid_size =  64
# 间隔采样
X = torch.arange(grid_size, dtype=torch.int32, device=device).split(S)   # (64):[0,1,2,...]
Y = torch.arange(grid_size, dtype=torch.int32, device=device).split(S)   # (64):[0,1,2,...]
Z = torch.arange(grid_size, dtype=torch.int32, device=device).split(S)   # (64):[0,1,2,...]

for xs in X:
    for ys in Y:
        for zs in Z:
            # 生成3D坐标
            xx, yy, zz = custom_meshgrid(xs, ys, zs)          # (64,64,64)
            coords = torch.cat([xx.reshape(-1, 1), yy.reshape(-1, 1), zz.reshape(-1, 1)],dim=-1)    #(262144,3)[0~64)之间 
            # 026243的索引,并打乱顺序
            indices = morton3D(coords).long()                 # [N=262144]
            # indices[0,4,32,36,256,260,288,292,2048,2052,2080,2084...262111, 262139, 262143]
            
            xyzs = (coords.float() - (grid_size - 1) / 2) * (2 * self.bound / grid_size)   # 归一化到(-1,1)
            
            # 添加噪声
            half_voxel_width = self.bound / grid_size    # 1/64
            xyzs += torch.rand_like(xyzs) * (2 * half_voxel_width) - half_voxel_width
# 7.point_decode-----------------------------------------------------------------------

输入:code(dens.color两个3D特征块: 2*16,32,32,32)),针对xyzs(262144=64^3, 3)做特征插值,得到(262144,16*2 输出
point_code = self.get_point_code(code, xyzs)
# self.get_point_code 具体展开如下:
class FreqFactorizedDecoder(TensorialDecoder):
    def get_point_code(self, code, xyzs):
        for i, (cfg, band) in enumerate(zip(preprocessor.tensor_config, self.freq_bands)):
            start = sum(map(numpy.prod, preprocessor.sub_shapes[:i]))         # 0
            end = sum(map(numpy.prod, preprocessor.sub_shapes[:i + 1]))       # 524288 = 16*32^3
            got: torch.Tensor = code[..., start: end].reshape(code.shape[0], *preprocessor.sub_shapes[i])     # (1,16,32,32,32)
            assert len(cfg) + 2 == got.ndim == 5, [len(cfg), got.ndim]
            coords = xyzs[..., ['xyzt'.index(axis) for axis in cfg]]          # (1,262144,3)
            if band is not None:
                coords = ((coords % band) / (band / 2) - 1)
            coords = coords.reshape(code.shape[0], 1, 1, xyzs.shape[-2], 3)   # (1,1,1,262144,3)
            codes.append(
                F.grid_sample(got, coords, mode='bilinear', padding_mode='border', align_corners=False)
                .reshape(code.shape[0], got.shape[1], xyzs.shape[-2]).transpose(1, 2)
            )                                                               # (262144,16)两次循环 got3D 特征不同
        return torch.cat(codes, dim=-1)

8. Then comes the rendering: ----------------------------------------------- -------------------------------------------------- ----

sigmas, rgbs = self.point_code_render(point_code, dirs=None)

The point_code feature obtained in the previous step does not render color and density separately:

# 8.1 公共特征point_code,渲染sigma---------------------------------------------------------------
base_x = self.base_net(point_code)                  # linear:(32,64)
base_x_act = self.base_activation(base_x)           # Silu
sigmas = self.density_net(base_x_act).squeeze(-1)   # linear:(64,1)  ->(262144,1)

# 8.2 渲染RGB-----------------------------------------------------------------------------
if dirs is None:
    rgbs = None


#
density_grid = sigma 
mean_density = torch.mean(density_grid.clamp(min=0))  # - 0.977
density_thresh = min(mean_density, 0.05)

# 9. near_far_from_aabb
self.aabb:[-1,-1,-1,1,1,1], 具体函数见最后拓展
nears, fars = batch_near_far_from_aabb(rays_o, rays_d, self.aabb.to(rays_o), self.min_near)    # (1,4096)(1,4096)



# 10.march_rays_train 采样渲染(这块封装成c语言了,代码可见拓展)
# 该代码接受光线的起点、方向、网格数据等作为输入,并根据特定的算法进行光线追踪,最终计算出光线与场景的交点及相关信息。这段代码是用于实现神经辐射场(NeRF)模型中的光线追踪部分。
    xyzs = torch.zeros(M, 3, dtype=rays_o.dtype, device=rays_o.device)
    dirs = torch.zeros(M, 3, dtype=rays_o.dtype, device=rays_o.device)
    ts = torch.zeros(M, 2, dtype=rays_o.dtype, device=rays_o.device)

    get_backend().march_rays_train(rays_o, rays_d, density_bitfield, bound, contract, dt_gamma, max_steps, N, C, H,
                                   nears, fars, xyzs, dirs, ts, rays, step_counter, noises)
return xyzs, dirs, ts, rays

Follow up to come. . .

Expand

提示:这里对文章进行总结:

1.morton3D encoding

  Morton coding, also known as Z coding, is a method used to map coordinates in a multi-dimensional space to a one-dimensional space . It interleaves the bits of the coordinates so that adjacent coordinates in one-dimensional space remain adjacent after encoding. This encoding method is commonly used in spatial indexing, graphics, and computer graphics to quickly search and process data in multidimensional spaces. When using 2D coordinates for Morton encoding, suppose we have a point with 2D coordinates (3, 5), and its binary representation is (011, 101) respectively. Morton encoding will interleave these two binary numbers to obtain 010111, which is 23 in decimal. This maps two-dimensional coordinates to a value in one-dimensional space. The main purpose of Morton coding is to keep adjacent coordinates in one-dimensional space still adjacent after encoding.

void morton3D(const at::Tensor coords, const uint32_t N, at::Tensor indices);

2.near_far_from_aabb: Calculate the nearest and farthest points

void near_far_from_aabb(const at::Tensor rays_o, const at::Tensor rays_d, const at::Tensor aabb, const uint32_t N, const float min_near, at::Tensor nears, at::Tensor fars) {
    
    

Specific steps are as follows:

1. Calculate the currently processed ray index n based on the thread index.
2. Locate the corresponding ray starting point and direction according to the ray index.
3. Calculate the intersection parameters of the ray and AABB on the three axes of x, y, and z.
4. Calculate the near and far points of the ray and AABB based on the intersection parameters.
5. The calculated near and far points are stored in nears and fars.

The principle of this kernel function is to use the parallel computing capability of CUDA to perform independent calculations on each ray to speed up the calculation process. Through parallel computing, multiple rays can be processed simultaneously, improving computing efficiency. The specific calling functions are as follows:

__global__ void kernel_near_far_from_aabb(
    const scalar_t * __restrict__ rays_o,
    const scalar_t * __restrict__ rays_d,
    const scalar_t * __restrict__ aabb,
    const uint32_t N,
    const float min_near,
    scalar_t * nears, scalar_t * fars
) {
    
    
    // parallel per ray
    const uint32_t n = threadIdx.x + blockIdx.x * blockDim.x;
    if (n >= N) return;

    // locate
    rays_o += n * 3;
    rays_d += n * 3;

    const float ox = rays_o[0], oy = rays_o[1], oz = rays_o[2];
    const float dx = rays_d[0], dy = rays_d[1], dz = rays_d[2];
    const float rdx = 1 / dx, rdy = 1 / dy, rdz = 1 / dz;

    // get near far (assume cube scene)
    float near = (aabb[0] - ox) * rdx;
    float far = (aabb[3] - ox) * rdx;
    if (near > far) swapf(near, far);

    float near_y = (aabb[1] - oy) * rdy;
    float far_y = (aabb[4] - oy) * rdy;
    if (near_y > far_y) swapf(near_y, far_y);

    if (near > far_y || near_y > far) {
    
    
        nears[n] = fars[n] = std::numeric_limits<scalar_t>::max();
        return;
    }

    if (near_y > near) near = near_y;
    if (far_y < far) far = far_y;

    float near_z = (aabb[2] - oz) * rdz;
    float far_z = (aabb[5] - oz) * rdz;
    if (near_z > far_z) swapf(near_z, far_z);

    if (near > far_z || near_z > far) {
    
    
        nears[n] = fars[n] = std::numeric_limits<scalar_t>::max();
        return;
    }

    if (near_z > near) near = near_z;
    if (far_z < far) far = far_z;

    if (near < min_near) near = min_near;

    nears[n] = near;
    fars[n] = far;
}

3.march_rays_train

This code accepts the starting point, direction, grid data, etc. of the light as input, and performs ray tracing according to a specific algorithm, and finally calculates the intersection point of the light and the scene and related information. This code is used to implement the ray tracing part of the Neural Radiation Field (NeRF) model.

__global__ void kernel_march_rays_train(
    const scalar_t * __restrict__ rays_o,
    const scalar_t * __restrict__ rays_d,
    const uint8_t * __restrict__ grid,
    const float bound, const bool contract,
    const float dt_gamma, const uint32_t max_steps,
    const uint32_t N, const uint32_t C, const uint32_t H,
    const scalar_t* __restrict__ nears,
    const scalar_t* __restrict__ fars,
    scalar_t * xyzs, scalar_t * dirs, scalar_t * ts,
    int * rays,
    int * counter,
    const scalar_t* __restrict__ noises
) {
    
    
    // parallel per ray
    const uint32_t n = threadIdx.x + blockIdx.x * blockDim.x;
    if (n >= N) return;

    // is first pass running.
    const bool first_pass = (xyzs == nullptr);

    // locate
    rays_o += n * 3;
    rays_d += n * 3;
    rays += n * 2;

    uint32_t num_steps = max_steps;

    if (!first_pass) {
    
    
        uint32_t point_index = rays[0];
        num_steps = rays[1];
        xyzs += point_index * 3;
        dirs += point_index * 3;
        ts += point_index * 2;
    }

    // ray marching
    const float ox = rays_o[0], oy = rays_o[1], oz = rays_o[2];
    const float dx = rays_d[0], dy = rays_d[1], dz = rays_d[2];
    const float rdx = 1 / dx, rdy = 1 / dy, rdz = 1 / dz;
    const float rH = 1 / (float)H;
    const float H3 = H * H * H;

    const float near = nears[n];
    const float far = fars[n];
    const float noise = noises[n];

    const float dt_min = 2 * SQRT3() / max_steps;
    const float dt_max = 2 * SQRT3() * bound / H;
    // const float dt_max = 1e10f;

    float t0 = near;
    t0 += clamp(t0 * dt_gamma, dt_min, dt_max) * noise;
    float t = t0;
    uint32_t step = 0;

    //if (t < far) printf("valid ray %d t=%f near=%f far=%f \n", n, t, near, far);

    while (t < far && step < num_steps) {
    
    
        // current point
        const float x = clamp(ox + t * dx, -bound, bound);
        const float y = clamp(oy + t * dy, -bound, bound);
        const float z = clamp(oz + t * dz, -bound, bound);

        float dt = clamp(t * dt_gamma, dt_min, dt_max);

        // get mip level
        const int level = max(mip_from_pos(x, y, z, C), mip_from_dt(dt, H, C)); // range in [0, C - 1]

        const float mip_bound = fminf(scalbnf(1.0f, level), bound);
        const float mip_rbound = 1 / mip_bound;

        // contraction
        float cx = x, cy = y, cz = z;
        const float mag = fmaxf(fabsf(x), fmaxf(fabsf(y), fabsf(z)));
        if (contract && mag > 1) {
    
    
            // L-INF norm
            const float Linf_scale = (2 - 1 / mag) / mag;
            cx *= Linf_scale;
            cy *= Linf_scale;
            cz *= Linf_scale;
        }

        // convert to nearest grid position
        const int nx = clamp(0.5 * (cx * mip_rbound + 1) * H, 0.0f, (float)(H - 1));
        const int ny = clamp(0.5 * (cy * mip_rbound + 1) * H, 0.0f, (float)(H - 1));
        const int nz = clamp(0.5 * (cz * mip_rbound + 1) * H, 0.0f, (float)(H - 1));

        const uint32_t index = level * H3 + __morton3D(nx, ny, nz);
        const bool occ = grid[index / 8] & (1 << (index % 8));

        // if occpuied, advance a small step, and write to output
        //if (n == 0) printf("t=%f density=%f vs thresh=%f step=%d\n", t, density, density_thresh, step);

        if (occ) {
    
    
            step++;
            t += dt;
            if (!first_pass) {
    
    
                xyzs[0] = cx; // write contracted coordinates!
                xyzs[1] = cy;
                xyzs[2] = cz;
                dirs[0] = dx;
                dirs[1] = dy;
                dirs[2] = dz;
                ts[0] = t;
                ts[1] = dt;
                xyzs += 3;
                dirs += 3;
                ts += 2;
            }
        // contraction case: cannot apply voxel skipping.
        } else if (contract && mag > 1) {
    
    
            t += dt;
        // else, skip a large step (basically skip a voxel grid)
        } else {
    
    
            // calc distance to next voxel
            const float tx = (((nx + 0.5f + 0.5f * signf(dx)) * rH * 2 - 1) * mip_bound - cx) * rdx;
            const float ty = (((ny + 0.5f + 0.5f * signf(dy)) * rH * 2 - 1) * mip_bound - cy) * rdy;
            const float tz = (((nz + 0.5f + 0.5f * signf(dz)) * rH * 2 - 1) * mip_bound - cz) * rdz;

            const float tt = t + fmaxf(0.0f, fminf(tx, fminf(ty, tz)));
            // step until next voxel
            do {
    
    
                dt = clamp(t * dt_gamma, dt_min, dt_max);
                t += dt;
            } while (t < tt);
        }
    }

    //printf("[n=%d] step=%d, near=%f, far=%f, dt=%f, num_steps=%f\n", n, step, near, far, dt_min, (far - near) / dt_min);

    // write rays
    if (first_pass) {
    
    
        uint32_t point_index = atomicAdd(counter, step);
        rays[0] = point_index;
        rays[1] = step;
    }
}

4. TV regularization

TV regularization, the full name is Total Variation Regularization. TV regularization achieves image smoothing by minimizing the gradient magnitude of the image.

Specifically, for a two-dimensional image, TV regularization can achieve smoothing by minimizing the gradient magnitude of the image . This suppresses noise and details in the image. Makes the image smoother. TV regularization is usually applied to the regularization term of optimization problems to balance the relationship between data fitting and smoothness.

For methods such as NeRF (Neural Radiance Fields), which are mainly used for three-dimensional reconstruction, TV regularization helps to improve the quality of the reconstruction results. This is because in 3D reconstruction, due to factors such as data sparsity and noise, the reconstruction results often contain unnecessary details and noise .

In addition, TV regularization can also help strengthen the constraints on the deep learning model and help improve the model's generalization ability and anti-noise ability. Therefore, for 3D reconstruction tasks such as NeRF, applying TV regularization helps to improve the quality of reconstruction results and increase the robustness of the model.

Guess you like

Origin blog.csdn.net/qq_45752541/article/details/135070014