NeRF and 3D reconstruction column (3) Interpretation of nerf_pl source code and use of colmap and cuda operators

Preface In the previous chapter, we introduced the NeRF principle, traditional volume rendering methods and the relationship between the two. In this chapter, we will explain the installation and use of colmap, part of the nerf_pl source code, and during the development process, due to some operations of python/torch Not supported, we need to make our own wheels, and we will also encounter cuda operators in subsequent columns, so this chapter will also explain the use of cuda operators.

Reproduction of this tutorial is prohibited. At the same time, this tutorial comes from Knowledge Planet [CV Technical Guide] More technical tutorials, you can join Planet Learning.

Transformer, target detection, semantic segmentation exchange group

Welcome to pay attention to the public account CV technical guide , focusing on computer vision technical summary, latest technology tracking, interpretation of classic papers, CV recruitment information.

CV's major direction columns and the most complete tutorials for each deployment framework

Colmap installation and use

colmp is an open source computer vision library based on C++, which provides multiple tools for 3D reconstruction, image retrieval, structure from motion estimation, 2D/3D feature extraction and matching tasks. Its goal is to provide a powerful, flexible and easy-to-use platform to facilitate academic research, education and product development. Colmap is characterized by the ability to use CPU or GPU for calculations, supports multiple image formats and camera models, and provides multiple data set import and export formats.

Since this column focuses on NeRF with pose, we first need to know how to get the camera pose.

1. Download colmap and compile:

Colmp source code download address , select the latest version tag, and then we can compile according to the process of its document

There are many other colmap compilation and installation tutorials on the Internet, and we will not go into details about how to solve the bugs encountered because of environmental problems.

2. Use colmap

In fact, there are already many sample tutorials of colmap on the Internet. Here we briefly introduce the use of colmap gui:

Enter in bash colmap guito get the gui interface:

block_thread

fileSelect in , new projectselect Databasein New, create a new database.db (you don’t need to manually create it yourself, enter database.db in the input field to create it automatically); imagesselect the image directory in the field (you can have subdirectories, but all under the image directory The pictures in the subdirectory should all be shot in the same scene), select save;

block_thread

Then select the selected processingone feature extraction. If you don’t know the internal parameters of the camera, just click it extract. If there are internal parameters, we can choose custom parameters, and then input the internal parameters of the camera in turn (the internal reference table will vary according to the camera model you choose. Generally, the focal length is accurate to achieve the reconstruction effect. not too bad):

block_thread

After the feature extraction is finished, enter the matching. If there is a vocabulary tree processingin the selection feature matching, you can import the vocabulary tree file; if it is a continuous frame, you can select Sequentialand limit the upper limit of the number of adjacent frames; neither can be selected Exhaustive, that is, violent matching (small scene) It won't take much time to download)

block_thread

After matching, select Reconstructionthe selected one start reconstructionto start sparse reconstruction and get the reconstruction result:

block_thread

Then we can choose files, export modelwe can choose to output cameras, images, points3D to bin or txt format, where cameras store the internal parameters of the camera, and images store the name and pose of each picture (in quaternion storage method), feature point descriptor, and the points file stores sparse point cloud related information (if you want ply format, you can select the filemiddle one export model as ..., and select ``PLY``` at the bottom right)

block_thread

cuda operator

Because some operations are not supported in python, or because torch/multithreading is slower than GPU operations, we sometimes need to write some cuda operators ourselves. For example, the ray marhing algorithm often used in nerf methods, or the aabb intersection detection in graphics. But our goal is not to become a wheel maker, we don’t need to write a cuda operator for every operation that can be calculated by cuda, learning cuda operator is just another way for us to realize our ideas; at the same time, we don’t need to have a deep understanding Some advanced techniques of cuda such as data alignment, use of texture memory, etc., just treat it as a C++ implementation. Please note that here we assume that the reader has already installed cuda.

1. Overview of GPU Parallel Computing Principles

In CPU single-threaded computing, one thread can only call one piece of code, and one piece of code can only process one piece of data, which means that the computing power of the program is limited to one CPU core; when we do not meet the single-threaded computing power , multiple CPU cores can be used to run multiple threads in parallel to process multiple copies of data;

However, the computing upper limit of CPU multithreading in the same time period is only dozens of threads (depending on your memory size and CPU model). Compared with graphics cards of the same period, the number of computing units can vary by three or four orders of magnitude, such as NVIDIA A2000 The computing core of NVIDA 5000 is 3328, and the computing core of NVIDA 5000 is 8192. These computing cores are distributed on dozens of stream multiprocessors (SM, 26 for A2000, 64 for A5000), and each SM can run 2048 threads at the same time.

In the cuda framework, how do we control the calculation of each thread? First we need to introduce two abstract data structures of cuda: block and thread

block_thread

A function (kernel function) can regulate the computing resources of a Grid, each Grid can customize multiple blocks, and each block can customize multiple threads (but blocks and threads are not physically distributed on the GPU in this way, they It is just an abstract data structure), each computing unit can run this kernel function independently, and process different data according to different __thread number__. For example, we can establish a one-to-one mapping of thread numbers to image pixel coordinates, so that each thread can process a pixel independently (for example, find the mean value of this pixel neighborhood).

We can obtain the block and thread where the current computing unit is located through the built-in variable and, where threadIdxand are both a triplet, containing , , member variables, indicating the position of the grid where the block is located and the position of the block where the thread is located (in the cv field , usually we don’t use it , just enough to cover the whole image), for example:blockIdxblockIdxthreadIdxxyzzxy

// kernel
const int BLOCK_W = 64;
const int BLOCK_H = 16;
const dim3 blockSize(BLOCK_W,BLOCK_H,1);
const dim3 gridSize((num_pixs + BLOCK_W*BLOCK_H - 1)/(BLOCK_W * BLOCK_H),1,1);
// num_pixs = 500 * 300
...

In the above code, we declared a block with a size of 64 × 16 64\times16 in the kernel function64×16 , and a grid contains500 × 300 + 64 × 16 − 1 64 × 16 ≈ 147 \frac{500\times300+64\times16-1}{64\times16}\approx14764×16500×300+64×161147 blocks, and these blocksblockIdx.xare arranged linearly in one dimension along the direction; then the "absolute number" of each thread can be expressed as:

// device
const int32_t n = blockIdx.x*blockDim.x*blockDim.y + threadIdx.x*blockDim.y + threadIdx.y

Here we can draw the layout of block and thread to better understand:

block_thread block_thread

If there is a thread located in Block (2,0,0), and the thread number of the Block (2,0,0) where it is located is thread(25,16,0), then its number is __all blocks in front of the current block Number of threads__+ the thread number of the current block , that is, 2 × 16 × 64 + 25 × 16 + 16 = 2464 2\times16\times64+25\times16+16=24642×16×64+25×16+16=2464 , this number means that the corresponding length of the thread is300 × 500 300\times500300×The 2465th element in the 500 vector.

Students who want to go deeper can refer to the official cuda documentation .

2. cuda operator and pybinding

The normal cuda program operation needs to be compiled by the nvcc compiler. If we need to write cumbersome CMakeLists on linux, fortunately we use pybinding, we can hand over the compilation related operations to the setuptools library of python, and at the same time transfer the data from the memory The operation of copying to the GPU can be implemented with torch, and we don't need to understand the relatively complicated memcpy strategy of cuda; the only disadvantage is that we cannot debug the cuda operator we wrote.

But in order to be able to debug, we will also put the c++ project code later.

3. Examples

Just the text description is too abstract, we now implement a simple cuda program:

In the NeRF of a large scene, due to the limited learning capacity of the network, we need to divide the scene into blocks;

Suppose the scene is divided into 4 × 4 4\times44×There are 4 sub-regions, and the sub-NeRF of each sub-region needs to be trained with images that observe this region, but an image often observes more than one region, so for each sub-region, we need to calculate all images that observe this region The pixel part is marked with a mask (such asmega-nerf);

For a pair of m × nm\times nm×n -size mask image, its data type is bool, although this is already the smallest data type (1 byte) in python, but for a high-resolution image (for example,5000 × 3000 5000\times30005000×3000 resolution), the space required to store this mask is:5000 × 3000 102 4 2 ≈ 14 \frac{5000\times3000}{1024^2}\approx14102425000×300014 MB, then assuming there are 1000 pictures, the storage space of all masks is about14 1414 GB; in the large scene NeRF, for each sub-NeRF of the block, we need14 1414 GB of storage space is14 × 16 = 224 14\times16=22414×16=224 GB, at least this is unbearable for the author, especially when we want to test the effect of different chunking strategies.

Then if we vectorize the mask image into a 15000000 15000000A one-dimensional vector with a length of 15000000 , each element of the vector is 1bit (not a byte), then the storage space can be reduced from 224G to 15000000 × 16× 1000 8 × 102 4 3 ≈ 28 \frac{15000000\times16\times1000} {8\times1024^3}\approx288×1024315000000×16×100028 GB, which is on the same order of magnitude as 1000 RGB images (about 11GB).

But if you use python to implement this process, it will become very slow, even if you use multi-threading, it will take a long time. Looking back at the principle of cuda, we might as well implement this process in the form of cuda operator:

3.1 Core part of cu operator

#include <iostream>
#include <torch/extension.h>
#include <ATen/ATen.h>
#include <ATen/TensorAccessor.h>
#include <ATen/cuda/CUDAContext.h>

using namespace std;

__global__ void packbits_u32_kernel(
    torch::PackedTensorAccessor32<int32_t,1,torch::RestrictPtrTraits> idx_array,
    torch::PackedTensorAccessor64<int64_t,1,torch::RestrictPtrTraits> bits_array
){
    // const int32_t n = blockIdx.x * blockDim.x + threadIdx.x;//block为一维时
    const int32_t n = blockIdx.x*blockDim.x*blockDim.y + threadIdx.x*blockDim.y + threadIdx.y;//block为二维时
    if(n > bits_array.size(0))
        return;
    int mask_size = 32;
    if (n == bits_array.size(0))
        mask_size = idx_array.size(0) % 32;
    const int64_t flag = 1;
    for(int i = 0 ; i < mask_size ; i++){
        int32_t hit_pix = idx_array[n*32 + i];
        if (hit_pix > 0){
            bits_array[n] |= flag << i;
        }
    }
}


torch::Tensor packbits_u32_cu(
    torch::Tensor idx_array,
    torch::Tensor bits_array
){
    // 每个线程处理32位长数据即32个像素
    const int num_pixs = std::ceil(idx_array.size(0)/32);
    const int BLOCK_W = 64;
    const int BLOCK_H = 16;
    const dim3 blockSize(BLOCK_W,BLOCK_H,1);
    const dim3 gridSize((num_pixs + BLOCK_W*BLOCK_H - 1)/(BLOCK_W * BLOCK_H),1,1);

    AT_DISPATCH_ALL_TYPES(idx_array.type(),"packbits_u32_cu",
    ([&] {
        packbits_u32_kernel<<<gridSize, blockSize>>>(
            idx_array.packed_accessor32<int32_t,1,torch::RestrictPtrTraits>(),
            bits_array.packed_accessor64<int64_t,1,torch::RestrictPtrTraits>()
        );
    }));
    return bits_array;
}

In include, the torch library comes from libtorch (here is the cu111 version, you can download different versions of libtorch from here ); the ATen library is the basic tensor and mathematical operation library (belonging to a sub-library of libtorch)

In the above code, the host-side function packbits_u32_cureceives the vectorized mask image idx_arrayand an int64 array for biting bits_array; and plans the block and thread layout of the grid controlled by this function;

Then AT_DISPATCH_ALL_TYPESpass the data into the device-side function packbits_u32_kernel.

AT_DISPATCH_ALL_TYPESIt is used to automatically assign tensors of different data types to different functions, which need to pass in three parameters: data type d, function name and function parameter list;

[&]It is a lambda expression in C++. [&]In the following code block is packaged as a lambda function, &which means that all variables of the function are passed by reference;

<<< >>>It is a symbol used to define function execution configuration in cuda. ​​The left side is the device side function, the middle is the grid and block configuration, and the right side is the parameter list.

In packbits_u32_kernel, each thread intercepts idx_array32 elements, and assigns them to 32 bits of an element according to bit operations to bits_arraycomplete data reduction (here we use int64 instead of uint32 because ATen does not support uint32 and uint64).

If readers want to know more, they can refer to torch's official documentation or directly to libtorch's api documentation . Another example that the author thinks is easier to get started is kwea123 's trilinear interpolation tutorial. The author's cuda extension series of tutorials is also suitable for most novices to learn further.

Then the core part of the cuda operator has been completed, and we will introduce how to use python to call this operator.

3.2 pybinding

Usually when we use language A to call a program in language B when we are programming across languages, we need to package the code in language B into a dynamic library first. pybinding is no exception. We first need to build a complete cuda source code before packaging it as a dynamic library. At this time, our project needs at least three files: header file, binding.cpp, main.cu:

utils.hhead File:

#include <torch/extension.h>

#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)

torch::Tensor packbits_u32(torch::Tensor idx_array, torch::Tensor bits_array);

torch::Tensor un_packbits_u32(torch::Tensor idx_array, torch::Tensor bits_array);

torch::Tensor packbits_u32_cu(
    torch::Tensor idx_array,
    torch::Tensor bits_array
);

torch::Tensor un_packbits_u32_cu(
    torch::Tensor idx_array,
    torch::Tensor bits_array
);

binding.cpp

#include "utils.h"

torch::Tensor packbits_u32(torch::Tensor idx_array, torch::Tensor bits_array){
    CHECK_CUDA(idx_array);
    CHECK_CUDA(bits_array);
    
    return packbits_u32_cu(idx_array,bits_array);
}

torch::Tensor un_packbits_u32(torch::Tensor idx_array, torch::Tensor bits_array){
    CHECK_CUDA(idx_array);
    CHECK_CUDA(bits_array);
    
    return un_packbits_u32_cu(idx_array,bits_array);
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m){
    m.def("packbits_u32",&packbits_u32);
    m.def("un_packbits_u32",&un_packbits_u32);
}

main.cu

#include "utils.h"
#include <thrust/execution_policy.h>
#include <thrust/scan.h>
#include <cmath>


__global__ void packbits_u32_kernel(
    torch::PackedTensorAccessor32<int32_t,1,torch::RestrictPtrTraits> idx_array,
    torch::PackedTensorAccessor64<int64_t,1,torch::RestrictPtrTraits> bits_array
){
    // const int32_t n = blockIdx.x * blockDim.x + threadIdx.x;//一维时
    const int32_t n = blockIdx.x*blockDim.x*blockDim.y + threadIdx.x*blockDim.y + threadIdx.y;//二维时
    if(n > bits_array.size(0))
        return;
    int mask_size = 32;
    if (n == bits_array.size(0))
        mask_size = (idx_array.size(0) % 32) - 1;
    const int64_t flag = 1;
    for(int i = 0 ; i < mask_size ; i++){
        int32_t hit_pix = idx_array[n*32 + i];
        if (hit_pix > 0){
            bits_array[n] |= flag << i;
        }
    }
}

torch::Tensor packbits_u32_cu(
    torch::Tensor idx_array,
    torch::Tensor bits_array
){
    // 每个线程处理32位长数据即32个像素
    const int num_pixs = std::ceil(idx_array.size(0)/32);
    // const int threads = 256, blocks = (num_pixs+threads-1)/threads;
    const int BLOCK_W = 64;
    const int BLOCK_H = 16;
    const dim3 blockSize(BLOCK_W,BLOCK_H,1);
    const dim3 gridSize((num_pixs + BLOCK_W*BLOCK_H - 1)/(BLOCK_W * BLOCK_H),1,1);

    AT_DISPATCH_ALL_TYPES(idx_array.type(),"packbits_u32_cu",
    ([&] {
        packbits_u32_kernel<<<gridSize, blockSize>>>(
            idx_array.packed_accessor32<int32_t,1,torch::RestrictPtrTraits>(),
            bits_array.packed_accessor64<int64_t,1,torch::RestrictPtrTraits>()
        );
    }));
    return bits_array;
}

__global__ void un_packbits_u32_kernel(
    torch::PackedTensorAccessor32<int32_t,1,torch::RestrictPtrTraits> idx_array,
    torch::PackedTensorAccessor64<int64_t,1,torch::RestrictPtrTraits> bits_array
){
    // const int32_t n = blockIdx.x * blockDim.x + threadIdx.x;//一维时
    const int32_t n = blockIdx.x*blockDim.x*blockDim.y + threadIdx.x*blockDim.y + threadIdx.y;//二维时

    if(n > bits_array.size(0))
        return;
    int mask_size = 32;
    if (n == bits_array.size(0))
        mask_size = (idx_array.size(0) % 32) - 1;
    const int64_t flag = 1;
    for(int i = 0 ; i < mask_size ; i++){
        if (bits_array[n] & (flag << i)){
            idx_array[n*32 + i]++;
        }
    }
}

torch::Tensor un_packbits_u32_cu(
    torch::Tensor idx_array,
    torch::Tensor bits_array
){
    // 每个线程处理32位长数据即32个像素
    const int num_pixs = std::ceil(idx_array.size(0)/32);
    const int BLOCK_W = 64;
    const int BLOCK_H = 16;
    const dim3 blockSize(BLOCK_W,BLOCK_H,1);
    const dim3 gridSize((num_pixs + BLOCK_W*BLOCK_H - 1)/(BLOCK_W * BLOCK_H),1,1);

    AT_DISPATCH_ALL_TYPES(idx_array.type(),"un_packbits_u32_cu",
    ([&] {
        un_packbits_u32_kernel<<<gridSize, blockSize>>>(
            idx_array.packed_accessor32<int32_t,1,torch::RestrictPtrTraits>(),
            bits_array.packed_accessor64<int64_t,1,torch::RestrictPtrTraits>()
        );
    }));
    return idx_array;
}

The file structure is:

files

Among them binding.cpp, the interface between the cuda function and the python function will be established. Next, we will package the entire project as a package of python, named PackBit:

setup.py

import glob
import os.path as osp
from setuptools import setup
from torch.utils.cpp_extension import CUDAExtension, BuildExtension

ROOT_DIR = osp.dirname(osp.abspath(__file__))
include_dirs = [osp.join(ROOT_DIR, "include")]

sources = glob.glob('*.cpp')+glob.glob('*.cu')


setup(
    name='PackBit',
    version='1.0',
    author='will'
    ext_modules=[
        CUDAExtension(
            name='PackBit',
            sources=sources,
            include_dirs=include_dirs,
            extra_compile_args={
    
    'cxx': ['-O2'],
                                'nvcc': ['-O2']}
        )
    ],
    cmdclass={
    
    
        'build_ext': BuildExtension
    }
)

Since we skipped the CMakeLists.txt step, it is inconvenient to link other third-party libraries when compiling the cuda program. At this time, we can add ext_modulesparameters library_dirsand librariesspecify the dynamic library to be linked, for example:

setup(
    name='PackBit',
    version='1.0',
    author='will',
    ext_modules=[
        CUDAExtension(
            name='PackBit',
            sources=sources,
            include_dirs=include_dirs,
            library_dirs=["/home/will/Downloads/libtorch/libtorch/lib","/usr/local/cuda-11.6/lib64"],
            libraries =["c10","torch_python","torch"],
            extra_compile_args={
    
    'cxx': ['-O2'],
                                'nvcc': ['-O2']},
            extra_link_args=["-Wl,-rpath=/home/will/Downloads/libtorch/libtorch/lib"]
        )
    ],
    cmdclass={
    
    #告诉编译器需要build
        'build_ext': BuildExtension
    }
)

Put setup.pyand main.cuand binding.cppin the same directory, enter the directory in bash, and enter python setup.py installthe command to compile and install the cuda program.

Finally, let's do a simple test (note that since the studio library we wrote is implemented based on the torch library, we need to do it first import torch):

test.py

import torch
import studio
import numpy as np
import math

if __name__ == "__main__":
    # 测试packbits
    a = torch.randint(2,(5463*3460,),dtype = torch.int32).cuda()
    b = torch.zeros([math.floor(5463*3460/32)],dtype = torch.int64).cuda()
    c=studio.packbits_u32(a,b)
    
    print('a前32位元素:',a[0:32])
    print('b[0]:',b[0])
    bin_str = "".format(b[0], 'b')
    print('b[0]的二进制形式:',np.binary_repr(b[0]).zfill(32))
    
    
    # 测试unpackbits
    # a = torch.zeros([5463*3460],dtype = torch.int32).cuda()
    # b = torch.randint(2**32,(math.floor(5463*3460/32),),dtype = torch.int64).cuda()
    # c=studio.un_packbits_u32(a,b)
    
    # print('b[0]:',b[0])
    # print('将b[0]解码后得到的向量',a[0:32])
    # bin_str = "".format(b[0], 'b')
    # print('b[0]的二进制形式:',np.binary_repr(b[0]).zfill(32))

Test Results:

packbits test results
unpackbits test results

3.3 debug

We mentioned before that the way of cu operator with pybinding is inconvenient to debug. If you want to debug, you can only start another C++ project. Here we put the CMakeLists.txt file and main.cu

CMakeLists.txt

cmake_minimum_required(VERSION 2.8 FATAL_ERROR)
project(test)

find_package(PythonInterp REQUIRED)
find_package(CUDA REQUIRED)
# find_package(OpenCV)
find_package(Python3 COMPONENTS Interpreter Development REQUIRED)
find_package(Torch REQUIRED)

set(CMAKE_BUILD_TYPE DEBUG)
set(CMAKE_CUDA_COMPILER /usr/local/cuda-11.6/bin/nvcc)
set(CUDACXX /usr/local/cuda-11.6/bin/nvcc)
set(CMAKE_CXX_STANDARD 14)
set(Torch_DIR /home/will/Downloads/libtorch/libtorch/share/cmake/Torch)#使用torch库必须
set(CUDA_INCLUDE_DIRS "/usr/local/cuda/include")
include_directories(${PYTHON_INCLUDE_DIR})
include_directories(
        ./include
        /usr/include/python3.10
        # /usr/local/include/opencv4
        ${CUDA_INCLUDE_DIRS}
        )

set(LIBRARIES 
    # opencv_features2d
    # opencv_calib3d
    # opencv_flann
    # opencv_highgui
    # opencv_imgcodecs
    # opencv_imgproc
    # opencv_core
)

set(SRC_LIST ./main.cu)
add_executable(demo ${SRC_LIST})#将源文件变为可执行文件
target_link_libraries(demo "${TORCH_LIBRARIES}")#使用torch库必须
target_link_libraries(demo ${LIBRARIES})#将静态/动态库链接到可执行文件

main.cu

// #include "utils.h"
#include <torch/extension.h>
#include <iostream>
#include <ATen/ATen.h>
#include <ATen/TensorAccessor.h>
#include <ATen/cuda/CUDAContext.h>
#include <torch/extension.h>

using namespace std;

__global__ void packbits_u32_kernel(
    torch::PackedTensorAccessor32<int32_t,1,torch::RestrictPtrTraits> idx_array,
    torch::PackedTensorAccessor64<int64_t,1,torch::RestrictPtrTraits> bits_array
){
    // const int32_t n = blockIdx.x * blockDim.x + threadIdx.x;//一维时
    const int32_t n = blockIdx.x*blockDim.x*blockDim.y + threadIdx.x*blockDim.y + threadIdx.y;//二维时
    if(n > bits_array.size(0))
        return;
    int mask_size = 32;
    if (n == bits_array.size(0))
        mask_size = idx_array.size(0) % 32;
    const int64_t flag = 1;
    for(int i = 0 ; i < mask_size ; i++){
        int32_t hit_pix = idx_array[n*32 + i];
        // printf("asd");
        if (hit_pix > 0){
            bits_array[n] |= flag << i;
        }
    }
}


void packbits_u32_cu(
    torch::Tensor idx_array,
    torch::Tensor bits_array
){
    // 每个线程处理32位长数据即32个像素
    const int num_pixs = std::ceil(idx_array.size(0)/32);
    const int BLOCK_W = 64;
    const int BLOCK_H = 16;
    const dim3 blockSize(BLOCK_W,BLOCK_H,1);
    AT_DISPATCH_ALL_TYPES(idx_array.type(),"packbits_u64_cu",
    ([&] {
        packbits_u64_kernel<<<8, blockSize>>>(
            idx_array.packed_accessor32<int32_t,1,torch::RestrictPtrTraits>(),
            bits_array.packed_accessor64<int64_t,1,torch::RestrictPtrTraits>()
        );
    }));
}



int main() {
    auto data = torch::randint(0,2,{10000}).to(torch::kInt32);
    auto idx = torch::zeros({10}, torch::kInt64);

    auto data1 = data.to(torch::kCUDA);
    auto idx1 = idx.to(torch::kCUDA);

    packbits_u32_cu(data1,idx1);
    cudaDeviceSynchronize();
    auto modified_tensor = idx1.to(torch::kCPU);
    std::cout << idx1 << std::endl;

    return 0;
}

We will not introduce cuda gdb debugging anymore. There are many related materials on the Internet. You can refer to the official documentation of cuda-gdb

For more detailed tutorials on the c++ configuration of pybinding, please refer to the official documentation .

Interpretation of part of the source code of nerf_pl

nerf_pl is the NeRF version reproduced by kwea123 using torchlightning. Compared with the original NeRF, the code is easy to read and easy to change. We just need to learn torchlightning, but we can quickly understand and get started by combining gpt and official api documents ; next, we will explain nerf in combination with the paper What is the specific network structure and training process of the network.

In the author's opinion, the main body of the project mainly includes three parts: model, ray generation, rendering, and the environment configuration of the project can refer to the README of the warehouse.

1. Model part

The network structure of nerf is in models/nerf.py:

class NeRF(nn.Module):
    def __init__(self,
                 D=8, W=256,
                 in_channels_xyz=63, in_channels_dir=27, 
                 skips=[4]):
        super(NeRF, self).__init__()
        self.D = D
        self.W = W
        self.in_channels_xyz = in_channels_xyz
        self.in_channels_dir = in_channels_dir
        self.skips = skips

        # xyz encoding layers
        for i in range(D):
            if i == 0:
                layer = nn.Linear(in_channels_xyz, W)
            elif i in skips:
                layer = nn.Linear(W+in_channels_xyz, W)
            else:
                layer = nn.Linear(W, W)
            layer = nn.Sequential(layer, nn.ReLU(True))
            setattr(self, f"xyz_encoding_{
      
      i+1}", layer)
        self.xyz_encoding_final = nn.Linear(W, W)

        # direction encoding layers
        self.dir_encoding = nn.Sequential(
                                nn.Linear(W+in_channels_dir, W//2),
                                nn.ReLU(True))

        # output layers
        self.sigma = nn.Linear(W, 1)
        self.rgb = nn.Sequential(
                        nn.Linear(W//2, 3),
                        nn.Sigmoid())

    def forward(self, x, sigma_only=False):
        if not sigma_only:
            input_xyz, input_dir = \
                torch.split(x, [self.in_channels_xyz, self.in_channels_dir], dim=-1)
        else:
            input_xyz = x

        xyz_ = input_xyz
        for i in range(self.D):
            if i in self.skips:
                xyz_ = torch.cat([input_xyz, xyz_], -1)
            xyz_ = getattr(self, f"xyz_encoding_{
      
      i+1}")(xyz_)

        sigma = self.sigma(xyz_)
        if sigma_only:
            return sigma

        xyz_encoding_final = self.xyz_encoding_final(xyz_)

        dir_encoding_input = torch.cat([xyz_encoding_final, input_dir], -1)
        dir_encoding = self.dir_encoding(dir_encoding_input)
        rgb = self.rgb(dir_encoding)

        out = torch.cat([rgb, sigma], -1)

        return out

We can see that the whole network is actually very simple, NeRFthe class network structure contains an 8 × 256 8\times2568×A MLP of 256 is used to output volume density and feature vectorxyz_encoding_final, and a one-layer MLP is used to output color, as shown below, whereembedding:
γ ( p ) = ( sin ( 2 0 π p ) , cos ( 2 0 π p ) , . . . , sin ( 2 L − 1 π p ) , cos ( 2 L − 1 π p ) ) \gamma(p)=(sin(2^0\pi p),cos(2^0\pi p ),...,sin(2^{L-1}\pi p),cos(2^{L-1}\pi p))c ( p )=(sin(20 pp),cos(20 pp),...,sin(2L1πp),cos(2L1πp))
nerf_pipeline

The input of the entire network is a three-dimensional point array x∈ RN × 3 \in R^{N\times3}RN × 3 and observe the direction array of these three-dimensional pointsd∈ RN × 3 \in R^{N\times3}RN × 3 ; output corresponding rgb valuergb∈ RN × 3 \in R^{N\times3}RN × 3 and bulk density valuesigma∈ RN \in R^{N}RN

2. Generate rays

The main code for generating rays is in datasets/ray_utils.py. In get_raysthe function, the input parameter directionsrepresents the direction vector from the optical center of the camera to a pixel point on the pixel plane in the camera coordinate system; = c2w[ R ∣ t ] =[R|t]=[ R t ] represents the transformation matrix transformed from the camera coordinate system to the world coordinate system:

def get_rays(directions, c2w):
    
    # Rotate ray directions from camera coordinate to the world coordinate
    rays_d = directions @ c2w[:, :3].T # (H, W, 3)
    rays_d = rays_d / torch.norm(rays_d, dim=-1, keepdim=True)
    # The origin of all rays is the camera origin in world coordinate
    rays_o = c2w[:, 3].expand(rays_d.shape) # (H, W, 3)

    rays_d = rays_d.view(-1, 3)
    rays_o = rays_o.view(-1, 3)

    return rays_o, rays_d

directionsIn this function, we transform the direction vector in the camera coordinate system c2wto the world coordinate system to obtain the actual line of sight of the camera in real space:

direction

It should be noted that the transformation matrix can be regarded as a linear transformation. Under different bases, the transformation matrix is ​​also different. For details, refer to the following form (where BDF represents x, y, zx,y, zx,y,z is lower right front, BRU means rear upper right):

transform

Therefore, when the llff and blender datasets (the coordinate systems of these two datasets are both bottom and right) are constructed direction:

directions = \
        torch.stack([(i-W/2)/focal, -(j-H/2)/focal, -torch.ones_like(i)], -1) # (H, W, 3)

In the data set made with colmap, we need to construct it like this directions:

directions = \
        torch.stack([(u-cx+0.5)/fx, (v-cy+0.5)/fy, torch.ones_like(u)], -1)

Of course, we can also directly use the transition matrix to c2wperform similar transformation, which is equivalent to the operation in the following figure:

coordinate

(The picture above is from the blog )

In short, when we use c2wit, we must first figure out c2wwhich coordinate system is transformed to which coordinate system.

3. Rendering part

The main body rendered by nerf_pl is in models/rendering.pythe middle, this part is more complicated, but the overall can be divided into rough stage sampling and fine stage sampling.

The inferencefunction represents a forward process, that is, it will (x,d)be sent to the model, and the output corresponds to rgband sigma. This function is easy to understand, and it can be understood by comparing the NeRF principle we talked about in the previous chapter. nerf_pl abstracts this process to facilitate the realization of rough forward and fine forward:

Rough Forward:

		rgb_coarse, depth_coarse, weights_coarse = \
            inference(model_coarse, embedding_xyz, xyz_coarse_sampled, rays_d,
                      dir_embedded, z_vals, weights_only=False)
        result = {
    
    'rgb_coarse': rgb_coarse,
                  'depth_coarse': depth_coarse,
                  'opacity_coarse': weights_coarse.sum(1)
                 }

Fine Forward:

		rgb_fine, depth_fine, weights_fine = \
            inference(model_fine, embedding_xyz, xyz_fine_sampled, rays_d,
                      dir_embedded, z_vals, weights_only=False)

        result['rgb_fine'] = rgb_fine
        result['depth_fine'] = depth_fine
        result['opacity_fine'] = weights_fine.sum(1)

After the rough forward is over, we get the weights in the volume rendering formula (covered in detail in our previous chapter), viasample_pdf

The function calculates more accurate sampling point positions, and then sends uniform sampling points and fine sampling points to fine forward for volume rendering.

4. Training process

Since the author adopts the torchlightning framework, most beginners train.pywill be confused when reading it. Let us first give a pytorch_lightning.LightningModulesequence fitof execution of each member function:

prepare_data(): This function is used to load the dataset before starting the training. It will only be called once in a single process, usually before initializing the trained model.

setup(): This function is used to initialize the model and dataset. It is called after the prepare_data() function and before the training process.

train_dataloader(): This function returns the data loader for the training data, which is called after the setup() function.

val_dataloader(): This function returns the data loader for the validation data, which is called after the train_dataloader() function.

test_dataloader(): This function returns the data loader for the test data, which is called after the val_dataloader() function.

forward(): This function is used to define the forward pass process of the model, which is called during training and inference.

training_step(): This function is called in each training batch to calculate the loss and perform backpropagation.

validation_step(): This function is called in each validation batch to evaluate the performance of the model on the validation set.

test_step(): This function is called in each test batch to evaluate the performance of the model on the test set.

training_epoch_end(): This function is called at the end of each training round (epoch) to process and summarize statistical information during the training process.

validation_epoch_end(): This function is called at the end of each validation round (epoch) to process and summarize statistical information during the validation process.

test_epoch_end(): This function is called at the end of each test round (epoch) to process and summarize statistics during the test.

configure_optimizers(): This function is used to define the optimizer and its hyperparameters, and is called before the training process.

Knowing the role of these member methods, we can only focus forwardon training_step:

def training_step(self, batch, batch_nb):
        log = {
    
    'lr': get_learning_rate(self.optimizer)}
        rays, rgbs = self.decode_batch(batch)
        results = self(rays)
        log['train/loss'] = loss = self.loss(results, rgbs)
        typ = 'fine' if 'rgb_fine' in results else 'coarse'

        with torch.no_grad():
            psnr_ = psnr(results[f'rgb_{
      
      typ}'], rgbs)
            log['train/psnr'] = psnr_

        return {
    
    'loss': loss,
                'progress_bar': {
    
    'train_psnr': psnr_},
                'log': log
               }

After torchlightning loads each data dataloader, it will directly enter training_step, and at the same time, it will pass in data_loaderthe returned data, for example in :dataset__getitem__()datasets/blender.py

def __getitem__(self, idx):
        if self.split == 'train': # use data in the buffers
            sample = {
    
    'rays': self.all_rays[idx],
                      'rgbs': self.all_rgbs[idx]}

        else: # create data for each image separately
            frame = self.meta['frames'][idx]
            c2w = torch.FloatTensor(frame['transform_matrix'])[:3, :4]

            img = Image.open(os.path.join(self.root_dir, f"{
      
      frame['file_path']}.png"))
            img = img.resize(self.img_wh, Image.LANCZOS)
            img = self.transform(img) # (4, H, W)
            valid_mask = (img[-1]>0).flatten() # (H*W) valid color area
            img = img.view(4, -1).permute(1, 0) # (H*W, 4) RGBA
            img = img[:, :3]*img[:, -1:] + (1-img[:, -1:]) # blend A to RGB

            rays_o, rays_d = get_rays(self.directions, c2w)

            rays = torch.cat([rays_o, rays_d, 
                              self.near*torch.ones_like(rays_o[:, :1]),
                              self.far*torch.ones_like(rays_o[:, :1])],
                              1) # (H*W, 8)

            sample = {
    
    'rays': rays,
                      'rgbs': img,
                      'c2w': c2w,
                      'valid_mask': valid_mask}

        return sample

Then training_stepthe input parameters in batchare the above sample, and then training_steppass

results = self(rays)Execute , and make a loss between forwardthe result obtained in the forward direction and the original image , and the loss will be automatically reversed later, without our need forrgbstraining_steploss.backwardopt.step()

def forward(self, rays):
        """Do batched inference on rays using chunk."""
        B = rays.shape[0]
        results = defaultdict(list)
        for i in range(0, B, self.hparams.chunk):
            rendered_ray_chunks = \
                render_rays(self.models,
                            self.embeddings,
                            rays[i:i+self.hparams.chunk],
                            self.hparams.N_samples,
                            self.hparams.use_disp,
                            self.hparams.perturb,
                            self.hparams.noise_std,
                            self.hparams.N_importance,
                            self.hparams.chunk, # chunk size is effective in val mode
                            self.train_dataset.white_back)

            for k, v in rendered_ray_chunks.items():
                results[k] += [v]

        for k, v in results.items():
            results[k] = torch.cat(v, 0)
        return results

notice

This blog briefly introduces the use of colmap gui, the principle of cuda operator, and a simple example. Finally, we introduce the general process of nerf_pl, hoping to help readers further understand the principle of NeRF.

In the next blog, we will introduce the use of some open source frameworks such as sdfstudio, nerfstudio, and the acceleration library nerfacc based on occupancy gridand .ray marching

Welcome to pay attention to the public account CV technical guide , focusing on computer vision technical summary, latest technology tracking, interpretation of classic papers, CV recruitment information.

[Technical Documents] "Building a pytorch Model Tutorial from Zero" 122-page PDF Download

QQ exchange group: 470899183. There are big guys in the group who are responsible for answering everyone's daily study, scientific research, and code questions.

Model deployment exchange group: 732145323. It is used for communication on model deployment, high-performance computing, optimization acceleration, and technology learning in computer vision.

other articles

ICLR 2023 | RevCol: Reversible multi-column network, a new paradigm for large model architecture design

CVPR 2023 | Plug-and-Play Attention Module HAT: Activate more useful pixels to boost low-level tasks significantly!

ICML 2023 | A Handbook of Pre-Training for Lightweight Visual Transformers (ViT)

Plug and play series | Efficient multi-scale attention module EMA becomes a small helper for YOLOv5 improvement

Plug and play series | Meta's new work MMViT: Multi-scale and multi-view encoding neural network architecture based on cross-attention mechanism

The new YOLO model YOLOCS is here | Improve the Backbone/Neck/Head of YOLOv5 in every way

ReID column (3) application of attention

ReID column (2) multi-scale design and application

ReID Column (1) Overview of Tasks and Datasets

Libtorch Tutorial (3) Simple Model Construction

Libtorch Tutorial (2) General Operations of Tensors

libtorch tutorial (1) Development environment construction: VS+libtorch and Qt+libtorch

Anomaly Detection Column (3) Traditional Anomaly Detection Algorithms - Part 1

Anomaly Detection Column (2): Evaluation Indicators and Common Datasets

Anomaly Detection Column (1) Overview of Anomaly Detection

[CV Technical Guide] Our own CV full-stack guidance class, basic introductory class, and thesis guidance class are fully online!! _

CV's most comprehensive knowledge system and technical tutorials

Guess you like

Origin blog.csdn.net/KANG157/article/details/131010803