Preface In the previous chapter, we introduced the NeRF principle, traditional volume rendering methods and the relationship between the two. In this chapter, we will explain the installation and use of colmap, part of the nerf_pl source code, and during the development process, due to some operations of python/torch Not supported, we need to make our own wheels, and we will also encounter cuda operators in subsequent columns, so this chapter will also explain the use of cuda operators.
Reproduction of this tutorial is prohibited. At the same time, this tutorial comes from Knowledge Planet [CV Technical Guide] More technical tutorials, you can join Planet Learning.
Transformer, target detection, semantic segmentation exchange group
Welcome to pay attention to the public account CV technical guide , focusing on computer vision technical summary, latest technology tracking, interpretation of classic papers, CV recruitment information.
CV's major direction columns and the most complete tutorials for each deployment framework
Colmap installation and use
colmp is an open source computer vision library based on C++, which provides multiple tools for 3D reconstruction, image retrieval, structure from motion estimation, 2D/3D feature extraction and matching tasks. Its goal is to provide a powerful, flexible and easy-to-use platform to facilitate academic research, education and product development. Colmap is characterized by the ability to use CPU or GPU for calculations, supports multiple image formats and camera models, and provides multiple data set import and export formats.
Since this column focuses on NeRF with pose, we first need to know how to get the camera pose.
1. Download colmap and compile:
Colmp source code download address , select the latest version tag, and then we can compile according to the process of its document
There are many other colmap compilation and installation tutorials on the Internet, and we will not go into details about how to solve the bugs encountered because of environmental problems.
2. Use colmap
In fact, there are already many sample tutorials of colmap on the Internet. Here we briefly introduce the use of colmap gui:
Enter in bash colmap gui
to get the gui interface:
file
Select in , new project
select Database
in New
, create a new database.db (you don’t need to manually create it yourself, enter database.db in the input field to create it automatically); images
select the image directory in the field (you can have subdirectories, but all under the image directory The pictures in the subdirectory should all be shot in the same scene), select save
;
Then select the selected processing
one feature extraction
. If you don’t know the internal parameters of the camera, just click it extract
. If there are internal parameters, we can choose custom parameters
, and then input the internal parameters of the camera in turn (the internal reference table will vary according to the camera model you choose. Generally, the focal length is accurate to achieve the reconstruction effect. not too bad):
After the feature extraction is finished, enter the matching. If there is a vocabulary tree processing
in the selection feature matching
, you can import the vocabulary tree file; if it is a continuous frame, you can select Sequential
and limit the upper limit of the number of adjacent frames; neither can be selected Exhaustive
, that is, violent matching (small scene) It won't take much time to download)
After matching, select Reconstruction
the selected one start reconstruction
to start sparse reconstruction and get the reconstruction result:
Then we can choose files
, export model
we can choose to output cameras, images, points3D to bin or txt format, where cameras store the internal parameters of the camera, and images store the name and pose of each picture (in quaternion storage method), feature point descriptor, and the points file stores sparse point cloud related information (if you want ply format, you can select the file
middle one export model as ...
, and select ``PLY``` at the bottom right)
cuda operator
Because some operations are not supported in python, or because torch/multithreading is slower than GPU operations, we sometimes need to write some cuda operators ourselves. For example, the ray marhing algorithm often used in nerf methods, or the aabb intersection detection in graphics. But our goal is not to become a wheel maker, we don’t need to write a cuda operator for every operation that can be calculated by cuda, learning cuda operator is just another way for us to realize our ideas; at the same time, we don’t need to have a deep understanding Some advanced techniques of cuda such as data alignment, use of texture memory, etc., just treat it as a C++ implementation. Please note that here we assume that the reader has already installed cuda.
1. Overview of GPU Parallel Computing Principles
In CPU single-threaded computing, one thread can only call one piece of code, and one piece of code can only process one piece of data, which means that the computing power of the program is limited to one CPU core; when we do not meet the single-threaded computing power , multiple CPU cores can be used to run multiple threads in parallel to process multiple copies of data;
However, the computing upper limit of CPU multithreading in the same time period is only dozens of threads (depending on your memory size and CPU model). Compared with graphics cards of the same period, the number of computing units can vary by three or four orders of magnitude, such as NVIDIA A2000 The computing core of NVIDA 5000 is 3328, and the computing core of NVIDA 5000 is 8192. These computing cores are distributed on dozens of stream multiprocessors (SM, 26 for A2000, 64 for A5000), and each SM can run 2048 threads at the same time.
In the cuda framework, how do we control the calculation of each thread? First we need to introduce two abstract data structures of cuda: block and thread
A function (kernel function) can regulate the computing resources of a Grid, each Grid can customize multiple blocks, and each block can customize multiple threads (but blocks and threads are not physically distributed on the GPU in this way, they It is just an abstract data structure), each computing unit can run this kernel function independently, and process different data according to different __thread number__. For example, we can establish a one-to-one mapping of thread numbers to image pixel coordinates, so that each thread can process a pixel independently (for example, find the mean value of this pixel neighborhood).
We can obtain the block and thread where the current computing unit is located through the built-in variable and, where threadIdx
and are both a triplet, containing , , member variables, indicating the position of the grid where the block is located and the position of the block where the thread is located (in the cv field , usually we don’t use it , just enough to cover the whole image), for example:blockIdx
blockIdx
threadIdx
x
y
z
z
x
y
// kernel
const int BLOCK_W = 64;
const int BLOCK_H = 16;
const dim3 blockSize(BLOCK_W,BLOCK_H,1);
const dim3 gridSize((num_pixs + BLOCK_W*BLOCK_H - 1)/(BLOCK_W * BLOCK_H),1,1);
// num_pixs = 500 * 300
...
In the above code, we declared a block with a size of 64 × 16 64\times16 in the kernel function64×16 , and a grid contains500 × 300 + 64 × 16 − 1 64 × 16 ≈ 147 \frac{500\times300+64\times16-1}{64\times16}\approx14764×16500×300+64×16−1≈147 blocks, and these blocksblockIdx.x
are arranged linearly in one dimension along the direction; then the "absolute number" of each thread can be expressed as:
// device
const int32_t n = blockIdx.x*blockDim.x*blockDim.y + threadIdx.x*blockDim.y + threadIdx.y
Here we can draw the layout of block and thread to better understand:
If there is a thread located in Block (2,0,0), and the thread number of the Block (2,0,0) where it is located is thread(25,16,0), then its number is __all blocks in front of the current block Number of threads__+ the thread number of the current block , that is, 2 × 16 × 64 + 25 × 16 + 16 = 2464 2\times16\times64+25\times16+16=24642×16×64+25×16+16=2464 , this number means that the corresponding length of the thread is300 × 500 300\times500300×The 2465th element in the 500 vector.
Students who want to go deeper can refer to the official cuda documentation .
2. cuda operator and pybinding
The normal cuda program operation needs to be compiled by the nvcc compiler. If we need to write cumbersome CMakeLists on linux, fortunately we use pybinding, we can hand over the compilation related operations to the setuptools library of python, and at the same time transfer the data from the memory The operation of copying to the GPU can be implemented with torch, and we don't need to understand the relatively complicated memcpy strategy of cuda; the only disadvantage is that we cannot debug the cuda operator we wrote.
But in order to be able to debug, we will also put the c++ project code later.
3. Examples
Just the text description is too abstract, we now implement a simple cuda program:
In the NeRF of a large scene, due to the limited learning capacity of the network, we need to divide the scene into blocks;
Suppose the scene is divided into 4 × 4 4\times44×There are 4 sub-regions, and the sub-NeRF of each sub-region needs to be trained with images that observe this region, but an image often observes more than one region, so for each sub-region, we need to calculate all images that observe this region The pixel part is marked with a mask (such asmega-nerf);
For a pair of m × nm\times nm×n -size mask image, its data type is bool, although this is already the smallest data type (1 byte) in python, but for a high-resolution image (for example,5000 × 3000 5000\times30005000×3000 resolution), the space required to store this mask is:5000 × 3000 102 4 2 ≈ 14 \frac{5000\times3000}{1024^2}\approx14102425000×3000≈14 MB, then assuming there are 1000 pictures, the storage space of all masks is about14 1414 GB; in the large scene NeRF, for each sub-NeRF of the block, we need14 1414 GB of storage space is14 × 16 = 224 14\times16=22414×16=224 GB, at least this is unbearable for the author, especially when we want to test the effect of different chunking strategies.
Then if we vectorize the mask image into a 15000000 15000000A one-dimensional vector with a length of 15000000 , each element of the vector is 1bit (not a byte), then the storage space can be reduced from 224G to 15000000 × 16× 1000 8 × 102 4 3 ≈ 28 \frac{15000000\times16\times1000} {8\times1024^3}\approx288×1024315000000×16×1000≈28 GB, which is on the same order of magnitude as 1000 RGB images (about 11GB).
But if you use python to implement this process, it will become very slow, even if you use multi-threading, it will take a long time. Looking back at the principle of cuda, we might as well implement this process in the form of cuda operator:
3.1 Core part of cu operator
#include <iostream>
#include <torch/extension.h>
#include <ATen/ATen.h>
#include <ATen/TensorAccessor.h>
#include <ATen/cuda/CUDAContext.h>
using namespace std;
__global__ void packbits_u32_kernel(
torch::PackedTensorAccessor32<int32_t,1,torch::RestrictPtrTraits> idx_array,
torch::PackedTensorAccessor64<int64_t,1,torch::RestrictPtrTraits> bits_array
){
// const int32_t n = blockIdx.x * blockDim.x + threadIdx.x;//block为一维时
const int32_t n = blockIdx.x*blockDim.x*blockDim.y + threadIdx.x*blockDim.y + threadIdx.y;//block为二维时
if(n > bits_array.size(0))
return;
int mask_size = 32;
if (n == bits_array.size(0))
mask_size = idx_array.size(0) % 32;
const int64_t flag = 1;
for(int i = 0 ; i < mask_size ; i++){
int32_t hit_pix = idx_array[n*32 + i];
if (hit_pix > 0){
bits_array[n] |= flag << i;
}
}
}
torch::Tensor packbits_u32_cu(
torch::Tensor idx_array,
torch::Tensor bits_array
){
// 每个线程处理32位长数据即32个像素
const int num_pixs = std::ceil(idx_array.size(0)/32);
const int BLOCK_W = 64;
const int BLOCK_H = 16;
const dim3 blockSize(BLOCK_W,BLOCK_H,1);
const dim3 gridSize((num_pixs + BLOCK_W*BLOCK_H - 1)/(BLOCK_W * BLOCK_H),1,1);
AT_DISPATCH_ALL_TYPES(idx_array.type(),"packbits_u32_cu",
([&] {
packbits_u32_kernel<<<gridSize, blockSize>>>(
idx_array.packed_accessor32<int32_t,1,torch::RestrictPtrTraits>(),
bits_array.packed_accessor64<int64_t,1,torch::RestrictPtrTraits>()
);
}));
return bits_array;
}
In include, the torch library comes from libtorch (here is the cu111 version, you can download different versions of libtorch from here ); the ATen library is the basic tensor and mathematical operation library (belonging to a sub-library of libtorch)
In the above code, the host-side function packbits_u32_cu
receives the vectorized mask image idx_array
and an int64 array for biting bits_array
; and plans the block and thread layout of the grid controlled by this function;
Then AT_DISPATCH_ALL_TYPES
pass the data into the device-side function packbits_u32_kernel
.
AT_DISPATCH_ALL_TYPES
It is used to automatically assign tensors of different data types to different functions, which need to pass in three parameters: data type d, function name and function parameter list;
[&]
It is a lambda expression in C++.[&]
In the following code block is packaged as a lambda function,&
which means that all variables of the function are passed by reference;
<<< >>>
It is a symbol used to define function execution configuration in cuda. The left side is the device side function, the middle is the grid and block configuration, and the right side is the parameter list.
In packbits_u32_kernel
, each thread intercepts idx_array
32 elements, and assigns them to 32 bits of an element according to bit operations to bits_array
complete data reduction (here we use int64 instead of uint32 because ATen does not support uint32 and uint64).
If readers want to know more, they can refer to torch's official documentation or directly to libtorch's api documentation . Another example that the author thinks is easier to get started is kwea123 's trilinear interpolation tutorial. The author's cuda extension series of tutorials is also suitable for most novices to learn further.
Then the core part of the cuda operator has been completed, and we will introduce how to use python to call this operator.
3.2 pybinding
Usually when we use language A to call a program in language B when we are programming across languages, we need to package the code in language B into a dynamic library first. pybinding is no exception. We first need to build a complete cuda source code before packaging it as a dynamic library. At this time, our project needs at least three files: header file, binding.cpp, main.cu:
utils.h
head File:
#include <torch/extension.h>
#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)
torch::Tensor packbits_u32(torch::Tensor idx_array, torch::Tensor bits_array);
torch::Tensor un_packbits_u32(torch::Tensor idx_array, torch::Tensor bits_array);
torch::Tensor packbits_u32_cu(
torch::Tensor idx_array,
torch::Tensor bits_array
);
torch::Tensor un_packbits_u32_cu(
torch::Tensor idx_array,
torch::Tensor bits_array
);
binding.cpp
:
#include "utils.h"
torch::Tensor packbits_u32(torch::Tensor idx_array, torch::Tensor bits_array){
CHECK_CUDA(idx_array);
CHECK_CUDA(bits_array);
return packbits_u32_cu(idx_array,bits_array);
}
torch::Tensor un_packbits_u32(torch::Tensor idx_array, torch::Tensor bits_array){
CHECK_CUDA(idx_array);
CHECK_CUDA(bits_array);
return un_packbits_u32_cu(idx_array,bits_array);
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m){
m.def("packbits_u32",&packbits_u32);
m.def("un_packbits_u32",&un_packbits_u32);
}
main.cu
#include "utils.h"
#include <thrust/execution_policy.h>
#include <thrust/scan.h>
#include <cmath>
__global__ void packbits_u32_kernel(
torch::PackedTensorAccessor32<int32_t,1,torch::RestrictPtrTraits> idx_array,
torch::PackedTensorAccessor64<int64_t,1,torch::RestrictPtrTraits> bits_array
){
// const int32_t n = blockIdx.x * blockDim.x + threadIdx.x;//一维时
const int32_t n = blockIdx.x*blockDim.x*blockDim.y + threadIdx.x*blockDim.y + threadIdx.y;//二维时
if(n > bits_array.size(0))
return;
int mask_size = 32;
if (n == bits_array.size(0))
mask_size = (idx_array.size(0) % 32) - 1;
const int64_t flag = 1;
for(int i = 0 ; i < mask_size ; i++){
int32_t hit_pix = idx_array[n*32 + i];
if (hit_pix > 0){
bits_array[n] |= flag << i;
}
}
}
torch::Tensor packbits_u32_cu(
torch::Tensor idx_array,
torch::Tensor bits_array
){
// 每个线程处理32位长数据即32个像素
const int num_pixs = std::ceil(idx_array.size(0)/32);
// const int threads = 256, blocks = (num_pixs+threads-1)/threads;
const int BLOCK_W = 64;
const int BLOCK_H = 16;
const dim3 blockSize(BLOCK_W,BLOCK_H,1);
const dim3 gridSize((num_pixs + BLOCK_W*BLOCK_H - 1)/(BLOCK_W * BLOCK_H),1,1);
AT_DISPATCH_ALL_TYPES(idx_array.type(),"packbits_u32_cu",
([&] {
packbits_u32_kernel<<<gridSize, blockSize>>>(
idx_array.packed_accessor32<int32_t,1,torch::RestrictPtrTraits>(),
bits_array.packed_accessor64<int64_t,1,torch::RestrictPtrTraits>()
);
}));
return bits_array;
}
__global__ void un_packbits_u32_kernel(
torch::PackedTensorAccessor32<int32_t,1,torch::RestrictPtrTraits> idx_array,
torch::PackedTensorAccessor64<int64_t,1,torch::RestrictPtrTraits> bits_array
){
// const int32_t n = blockIdx.x * blockDim.x + threadIdx.x;//一维时
const int32_t n = blockIdx.x*blockDim.x*blockDim.y + threadIdx.x*blockDim.y + threadIdx.y;//二维时
if(n > bits_array.size(0))
return;
int mask_size = 32;
if (n == bits_array.size(0))
mask_size = (idx_array.size(0) % 32) - 1;
const int64_t flag = 1;
for(int i = 0 ; i < mask_size ; i++){
if (bits_array[n] & (flag << i)){
idx_array[n*32 + i]++;
}
}
}
torch::Tensor un_packbits_u32_cu(
torch::Tensor idx_array,
torch::Tensor bits_array
){
// 每个线程处理32位长数据即32个像素
const int num_pixs = std::ceil(idx_array.size(0)/32);
const int BLOCK_W = 64;
const int BLOCK_H = 16;
const dim3 blockSize(BLOCK_W,BLOCK_H,1);
const dim3 gridSize((num_pixs + BLOCK_W*BLOCK_H - 1)/(BLOCK_W * BLOCK_H),1,1);
AT_DISPATCH_ALL_TYPES(idx_array.type(),"un_packbits_u32_cu",
([&] {
un_packbits_u32_kernel<<<gridSize, blockSize>>>(
idx_array.packed_accessor32<int32_t,1,torch::RestrictPtrTraits>(),
bits_array.packed_accessor64<int64_t,1,torch::RestrictPtrTraits>()
);
}));
return idx_array;
}
The file structure is:
Among them binding.cpp
, the interface between the cuda function and the python function will be established. Next, we will package the entire project as a package of python, named PackBit
:
setup.py
import glob
import os.path as osp
from setuptools import setup
from torch.utils.cpp_extension import CUDAExtension, BuildExtension
ROOT_DIR = osp.dirname(osp.abspath(__file__))
include_dirs = [osp.join(ROOT_DIR, "include")]
sources = glob.glob('*.cpp')+glob.glob('*.cu')
setup(
name='PackBit',
version='1.0',
author='will'
ext_modules=[
CUDAExtension(
name='PackBit',
sources=sources,
include_dirs=include_dirs,
extra_compile_args={
'cxx': ['-O2'],
'nvcc': ['-O2']}
)
],
cmdclass={
'build_ext': BuildExtension
}
)
Since we skipped the CMakeLists.txt step, it is inconvenient to link other third-party libraries when compiling the cuda program. At this time, we can add ext_modules
parameters library_dirs
and libraries
specify the dynamic library to be linked, for example:
setup(
name='PackBit',
version='1.0',
author='will',
ext_modules=[
CUDAExtension(
name='PackBit',
sources=sources,
include_dirs=include_dirs,
library_dirs=["/home/will/Downloads/libtorch/libtorch/lib","/usr/local/cuda-11.6/lib64"],
libraries =["c10","torch_python","torch"],
extra_compile_args={
'cxx': ['-O2'],
'nvcc': ['-O2']},
extra_link_args=["-Wl,-rpath=/home/will/Downloads/libtorch/libtorch/lib"]
)
],
cmdclass={
#告诉编译器需要build
'build_ext': BuildExtension
}
)
Put setup.py
and main.cu
and binding.cpp
in the same directory, enter the directory in bash, and enter python setup.py install
the command to compile and install the cuda program.
Finally, let's do a simple test (note that since the studio library we wrote is implemented based on the torch library, we need to do it first import torch
):
test.py
import torch
import studio
import numpy as np
import math
if __name__ == "__main__":
# 测试packbits
a = torch.randint(2,(5463*3460,),dtype = torch.int32).cuda()
b = torch.zeros([math.floor(5463*3460/32)],dtype = torch.int64).cuda()
c=studio.packbits_u32(a,b)
print('a前32位元素:',a[0:32])
print('b[0]:',b[0])
bin_str = "".format(b[0], 'b')
print('b[0]的二进制形式:',np.binary_repr(b[0]).zfill(32))
# 测试unpackbits
# a = torch.zeros([5463*3460],dtype = torch.int32).cuda()
# b = torch.randint(2**32,(math.floor(5463*3460/32),),dtype = torch.int64).cuda()
# c=studio.un_packbits_u32(a,b)
# print('b[0]:',b[0])
# print('将b[0]解码后得到的向量',a[0:32])
# bin_str = "".format(b[0], 'b')
# print('b[0]的二进制形式:',np.binary_repr(b[0]).zfill(32))
Test Results:
3.3 debug
We mentioned before that the way of cu operator with pybinding is inconvenient to debug. If you want to debug, you can only start another C++ project. Here we put the CMakeLists.txt file and main.cu
CMakeLists.txt
cmake_minimum_required(VERSION 2.8 FATAL_ERROR)
project(test)
find_package(PythonInterp REQUIRED)
find_package(CUDA REQUIRED)
# find_package(OpenCV)
find_package(Python3 COMPONENTS Interpreter Development REQUIRED)
find_package(Torch REQUIRED)
set(CMAKE_BUILD_TYPE DEBUG)
set(CMAKE_CUDA_COMPILER /usr/local/cuda-11.6/bin/nvcc)
set(CUDACXX /usr/local/cuda-11.6/bin/nvcc)
set(CMAKE_CXX_STANDARD 14)
set(Torch_DIR /home/will/Downloads/libtorch/libtorch/share/cmake/Torch)#使用torch库必须
set(CUDA_INCLUDE_DIRS "/usr/local/cuda/include")
include_directories(${PYTHON_INCLUDE_DIR})
include_directories(
./include
/usr/include/python3.10
# /usr/local/include/opencv4
${CUDA_INCLUDE_DIRS}
)
set(LIBRARIES
# opencv_features2d
# opencv_calib3d
# opencv_flann
# opencv_highgui
# opencv_imgcodecs
# opencv_imgproc
# opencv_core
)
set(SRC_LIST ./main.cu)
add_executable(demo ${SRC_LIST})#将源文件变为可执行文件
target_link_libraries(demo "${TORCH_LIBRARIES}")#使用torch库必须
target_link_libraries(demo ${LIBRARIES})#将静态/动态库链接到可执行文件
main.cu
// #include "utils.h"
#include <torch/extension.h>
#include <iostream>
#include <ATen/ATen.h>
#include <ATen/TensorAccessor.h>
#include <ATen/cuda/CUDAContext.h>
#include <torch/extension.h>
using namespace std;
__global__ void packbits_u32_kernel(
torch::PackedTensorAccessor32<int32_t,1,torch::RestrictPtrTraits> idx_array,
torch::PackedTensorAccessor64<int64_t,1,torch::RestrictPtrTraits> bits_array
){
// const int32_t n = blockIdx.x * blockDim.x + threadIdx.x;//一维时
const int32_t n = blockIdx.x*blockDim.x*blockDim.y + threadIdx.x*blockDim.y + threadIdx.y;//二维时
if(n > bits_array.size(0))
return;
int mask_size = 32;
if (n == bits_array.size(0))
mask_size = idx_array.size(0) % 32;
const int64_t flag = 1;
for(int i = 0 ; i < mask_size ; i++){
int32_t hit_pix = idx_array[n*32 + i];
// printf("asd");
if (hit_pix > 0){
bits_array[n] |= flag << i;
}
}
}
void packbits_u32_cu(
torch::Tensor idx_array,
torch::Tensor bits_array
){
// 每个线程处理32位长数据即32个像素
const int num_pixs = std::ceil(idx_array.size(0)/32);
const int BLOCK_W = 64;
const int BLOCK_H = 16;
const dim3 blockSize(BLOCK_W,BLOCK_H,1);
AT_DISPATCH_ALL_TYPES(idx_array.type(),"packbits_u64_cu",
([&] {
packbits_u64_kernel<<<8, blockSize>>>(
idx_array.packed_accessor32<int32_t,1,torch::RestrictPtrTraits>(),
bits_array.packed_accessor64<int64_t,1,torch::RestrictPtrTraits>()
);
}));
}
int main() {
auto data = torch::randint(0,2,{10000}).to(torch::kInt32);
auto idx = torch::zeros({10}, torch::kInt64);
auto data1 = data.to(torch::kCUDA);
auto idx1 = idx.to(torch::kCUDA);
packbits_u32_cu(data1,idx1);
cudaDeviceSynchronize();
auto modified_tensor = idx1.to(torch::kCPU);
std::cout << idx1 << std::endl;
return 0;
}
We will not introduce cuda gdb debugging anymore. There are many related materials on the Internet. You can refer to the official documentation of cuda-gdb
For more detailed tutorials on the c++ configuration of pybinding, please refer to the official documentation .
Interpretation of part of the source code of nerf_pl
nerf_pl is the NeRF version reproduced by kwea123 using torchlightning. Compared with the original NeRF, the code is easy to read and easy to change. We just need to learn torchlightning, but we can quickly understand and get started by combining gpt and official api documents ; next, we will explain nerf in combination with the paper What is the specific network structure and training process of the network.
In the author's opinion, the main body of the project mainly includes three parts: model, ray generation, rendering, and the environment configuration of the project can refer to the README of the warehouse.
1. Model part
The network structure of nerf is in models/nerf.py
:
class NeRF(nn.Module):
def __init__(self,
D=8, W=256,
in_channels_xyz=63, in_channels_dir=27,
skips=[4]):
super(NeRF, self).__init__()
self.D = D
self.W = W
self.in_channels_xyz = in_channels_xyz
self.in_channels_dir = in_channels_dir
self.skips = skips
# xyz encoding layers
for i in range(D):
if i == 0:
layer = nn.Linear(in_channels_xyz, W)
elif i in skips:
layer = nn.Linear(W+in_channels_xyz, W)
else:
layer = nn.Linear(W, W)
layer = nn.Sequential(layer, nn.ReLU(True))
setattr(self, f"xyz_encoding_{
i+1}", layer)
self.xyz_encoding_final = nn.Linear(W, W)
# direction encoding layers
self.dir_encoding = nn.Sequential(
nn.Linear(W+in_channels_dir, W//2),
nn.ReLU(True))
# output layers
self.sigma = nn.Linear(W, 1)
self.rgb = nn.Sequential(
nn.Linear(W//2, 3),
nn.Sigmoid())
def forward(self, x, sigma_only=False):
if not sigma_only:
input_xyz, input_dir = \
torch.split(x, [self.in_channels_xyz, self.in_channels_dir], dim=-1)
else:
input_xyz = x
xyz_ = input_xyz
for i in range(self.D):
if i in self.skips:
xyz_ = torch.cat([input_xyz, xyz_], -1)
xyz_ = getattr(self, f"xyz_encoding_{
i+1}")(xyz_)
sigma = self.sigma(xyz_)
if sigma_only:
return sigma
xyz_encoding_final = self.xyz_encoding_final(xyz_)
dir_encoding_input = torch.cat([xyz_encoding_final, input_dir], -1)
dir_encoding = self.dir_encoding(dir_encoding_input)
rgb = self.rgb(dir_encoding)
out = torch.cat([rgb, sigma], -1)
return out
We can see that the whole network is actually very simple, NeRF
the class network structure contains an 8 × 256 8\times2568×A MLP of 256 is used to output volume density and feature vectorxyz_encoding_final
, and a one-layer MLP is used to output color, as shown below, whereembedding
:
γ ( p ) = ( sin ( 2 0 π p ) , cos ( 2 0 π p ) , . . . , sin ( 2 L − 1 π p ) , cos ( 2 L − 1 π p ) ) \gamma(p)=(sin(2^0\pi p),cos(2^0\pi p ),...,sin(2^{L-1}\pi p),cos(2^{L-1}\pi p))c ( p )=(sin(20 pp),cos(20 pp),...,sin(2L−1πp),cos(2L−1πp))
The input of the entire network is a three-dimensional point array x
∈ RN × 3 \in R^{N\times3}∈RN × 3 and observe the direction array of these three-dimensional pointsd
∈ RN × 3 \in R^{N\times3}∈RN × 3 ; output corresponding rgb valuergb
∈ RN × 3 \in R^{N\times3}∈RN × 3 and bulk density valuesigma
∈ RN \in R^{N}∈RN
2. Generate rays
The main code for generating rays is in datasets/ray_utils.py
. In get_rays
the function, the input parameter directions
represents the direction vector from the optical center of the camera to a pixel point on the pixel plane in the camera coordinate system; = c2w
[ R ∣ t ] =[R|t]=[ R ∣ t ] represents the transformation matrix transformed from the camera coordinate system to the world coordinate system:
def get_rays(directions, c2w):
# Rotate ray directions from camera coordinate to the world coordinate
rays_d = directions @ c2w[:, :3].T # (H, W, 3)
rays_d = rays_d / torch.norm(rays_d, dim=-1, keepdim=True)
# The origin of all rays is the camera origin in world coordinate
rays_o = c2w[:, 3].expand(rays_d.shape) # (H, W, 3)
rays_d = rays_d.view(-1, 3)
rays_o = rays_o.view(-1, 3)
return rays_o, rays_d
directions
In this function, we transform the direction vector in the camera coordinate system c2w
to the world coordinate system to obtain the actual line of sight of the camera in real space:
It should be noted that the transformation matrix can be regarded as a linear transformation. Under different bases, the transformation matrix is also different. For details, refer to the following form (where BDF represents x, y, zx,y, zx,y,z is lower right front, BRU means rear upper right):
Therefore, when the llff and blender datasets (the coordinate systems of these two datasets are both bottom and right) are constructed direction
:
directions = \
torch.stack([(i-W/2)/focal, -(j-H/2)/focal, -torch.ones_like(i)], -1) # (H, W, 3)
In the data set made with colmap, we need to construct it like this directions
:
directions = \
torch.stack([(u-cx+0.5)/fx, (v-cy+0.5)/fy, torch.ones_like(u)], -1)
Of course, we can also directly use the transition matrix to c2w
perform similar transformation, which is equivalent to the operation in the following figure:
(The picture above is from the blog )
In short, when we use c2w
it, we must first figure out c2w
which coordinate system is transformed to which coordinate system.
3. Rendering part
The main body rendered by nerf_pl is in models/rendering.py
the middle, this part is more complicated, but the overall can be divided into rough stage sampling and fine stage sampling.
The inference
function represents a forward process, that is, it will (x,d)
be sent to the model, and the output corresponds to rgb
and sigma
. This function is easy to understand, and it can be understood by comparing the NeRF principle we talked about in the previous chapter. nerf_pl abstracts this process to facilitate the realization of rough forward and fine forward:
Rough Forward:
rgb_coarse, depth_coarse, weights_coarse = \
inference(model_coarse, embedding_xyz, xyz_coarse_sampled, rays_d,
dir_embedded, z_vals, weights_only=False)
result = {
'rgb_coarse': rgb_coarse,
'depth_coarse': depth_coarse,
'opacity_coarse': weights_coarse.sum(1)
}
Fine Forward:
rgb_fine, depth_fine, weights_fine = \
inference(model_fine, embedding_xyz, xyz_fine_sampled, rays_d,
dir_embedded, z_vals, weights_only=False)
result['rgb_fine'] = rgb_fine
result['depth_fine'] = depth_fine
result['opacity_fine'] = weights_fine.sum(1)
After the rough forward is over, we get the weights in the volume rendering formula (covered in detail in our previous chapter), viasample_pdf
The function calculates more accurate sampling point positions, and then sends uniform sampling points and fine sampling points to fine forward for volume rendering.
4. Training process
Since the author adopts the torchlightning framework, most beginners train.py
will be confused when reading it. Let us first give a pytorch_lightning.LightningModule
sequence fit
of execution of each member function:
prepare_data(): This function is used to load the dataset before starting the training. It will only be called once in a single process, usually before initializing the trained model.
setup(): This function is used to initialize the model and dataset. It is called after the prepare_data() function and before the training process.
train_dataloader(): This function returns the data loader for the training data, which is called after the setup() function.
val_dataloader(): This function returns the data loader for the validation data, which is called after the train_dataloader() function.
test_dataloader(): This function returns the data loader for the test data, which is called after the val_dataloader() function.
forward(): This function is used to define the forward pass process of the model, which is called during training and inference.
training_step(): This function is called in each training batch to calculate the loss and perform backpropagation.
validation_step(): This function is called in each validation batch to evaluate the performance of the model on the validation set.
test_step(): This function is called in each test batch to evaluate the performance of the model on the test set.
training_epoch_end(): This function is called at the end of each training round (epoch) to process and summarize statistical information during the training process.
validation_epoch_end(): This function is called at the end of each validation round (epoch) to process and summarize statistical information during the validation process.
test_epoch_end(): This function is called at the end of each test round (epoch) to process and summarize statistics during the test.
configure_optimizers(): This function is used to define the optimizer and its hyperparameters, and is called before the training process.
Knowing the role of these member methods, we can only focus forward
on training_step
:
def training_step(self, batch, batch_nb):
log = {
'lr': get_learning_rate(self.optimizer)}
rays, rgbs = self.decode_batch(batch)
results = self(rays)
log['train/loss'] = loss = self.loss(results, rgbs)
typ = 'fine' if 'rgb_fine' in results else 'coarse'
with torch.no_grad():
psnr_ = psnr(results[f'rgb_{
typ}'], rgbs)
log['train/psnr'] = psnr_
return {
'loss': loss,
'progress_bar': {
'train_psnr': psnr_},
'log': log
}
After torchlightning loads each data dataloader
, it will directly enter training_step
, and at the same time, it will pass in data_loader
the returned data, for example in :dataset
__getitem__()
datasets/blender.py
def __getitem__(self, idx):
if self.split == 'train': # use data in the buffers
sample = {
'rays': self.all_rays[idx],
'rgbs': self.all_rgbs[idx]}
else: # create data for each image separately
frame = self.meta['frames'][idx]
c2w = torch.FloatTensor(frame['transform_matrix'])[:3, :4]
img = Image.open(os.path.join(self.root_dir, f"{
frame['file_path']}.png"))
img = img.resize(self.img_wh, Image.LANCZOS)
img = self.transform(img) # (4, H, W)
valid_mask = (img[-1]>0).flatten() # (H*W) valid color area
img = img.view(4, -1).permute(1, 0) # (H*W, 4) RGBA
img = img[:, :3]*img[:, -1:] + (1-img[:, -1:]) # blend A to RGB
rays_o, rays_d = get_rays(self.directions, c2w)
rays = torch.cat([rays_o, rays_d,
self.near*torch.ones_like(rays_o[:, :1]),
self.far*torch.ones_like(rays_o[:, :1])],
1) # (H*W, 8)
sample = {
'rays': rays,
'rgbs': img,
'c2w': c2w,
'valid_mask': valid_mask}
return sample
Then training_step
the input parameters in batch
are the above sample
, and then training_step
pass
results = self(rays)
Execute , and make a loss between forward
the result obtained in the forward direction and the original image , and the loss will be automatically reversed later, without our need forrgbs
training_step
loss.backward
opt.step()
def forward(self, rays):
"""Do batched inference on rays using chunk."""
B = rays.shape[0]
results = defaultdict(list)
for i in range(0, B, self.hparams.chunk):
rendered_ray_chunks = \
render_rays(self.models,
self.embeddings,
rays[i:i+self.hparams.chunk],
self.hparams.N_samples,
self.hparams.use_disp,
self.hparams.perturb,
self.hparams.noise_std,
self.hparams.N_importance,
self.hparams.chunk, # chunk size is effective in val mode
self.train_dataset.white_back)
for k, v in rendered_ray_chunks.items():
results[k] += [v]
for k, v in results.items():
results[k] = torch.cat(v, 0)
return results
notice
This blog briefly introduces the use of colmap gui, the principle of cuda operator, and a simple example. Finally, we introduce the general process of nerf_pl, hoping to help readers further understand the principle of NeRF.
In the next blog, we will introduce the use of some open source frameworks such as sdfstudio, nerfstudio, and the acceleration library nerfacc based on occupancy grid
and .ray marching
Welcome to pay attention to the public account CV technical guide , focusing on computer vision technical summary, latest technology tracking, interpretation of classic papers, CV recruitment information.
[Technical Documents] "Building a pytorch Model Tutorial from Zero" 122-page PDF Download
QQ exchange group: 470899183. There are big guys in the group who are responsible for answering everyone's daily study, scientific research, and code questions.
Model deployment exchange group: 732145323. It is used for communication on model deployment, high-performance computing, optimization acceleration, and technology learning in computer vision.
other articles
ICML 2023 | A Handbook of Pre-Training for Lightweight Visual Transformers (ViT)
The new YOLO model YOLOCS is here | Improve the Backbone/Neck/Head of YOLOv5 in every way
ReID column (3) application of attention
ReID column (2) multi-scale design and application
ReID Column (1) Overview of Tasks and Datasets
Libtorch Tutorial (3) Simple Model Construction
Libtorch Tutorial (2) General Operations of Tensors
libtorch tutorial (1) Development environment construction: VS+libtorch and Qt+libtorch
Anomaly Detection Column (3) Traditional Anomaly Detection Algorithms - Part 1
Anomaly Detection Column (2): Evaluation Indicators and Common Datasets
Anomaly Detection Column (1) Overview of Anomaly Detection
CV's most comprehensive knowledge system and technical tutorials