[NeRF] (1) NeRF paper study notes

NeRF study notes

Overview:

  1. Reconstruction: Reconstruct three-dimensional objects based on currently available two-dimensional images from different angles.

    Use MLP network to learn Scene Representation to realize the three-dimensional spatial coordinates and camera angle corresponding to the input source image ( x , y , z , θ , ϕ ) (x,y, z,\theta,\phi) (x,y,z,θ,ϕ);输出 ( R , G , B , σ ) (R,G,B,\sigma) (R,G,B,σ) , representing RGB color values ​​and voxel density respectively.

  2. Training and rendering: Volume Rendering generates predicted 2D images.

    • Let the camera ray pass through the scene and generate a set of sampled 3D points
    • These points and their corresponding 2D viewing directions are used as input to the neural network to produce a set of color and density outputs.
    • Accumulate these colors and densities into a 2D image using classic volume rendering techniques
    • L2 loss is calculated between the predicted 2D image and ground truth image, and then the MLP is optimized.
    image-20230514134217989

The overall process is as follows:

img

1 Implementation process

1.1 Camera parameters: How to obtain input data from photos from different angles

Process summary:

  • Data preprocessing

    • Given a set of images, use colmap software to obtain the external and internal parameters.

    • Then convert the above colmap format into llff format. At this time, there are 17 parameters (12 external parameters, 3 internal parameters, and 2 range representations)

  • enter

    • Take a certain position in the imaging plane (the corresponding real pixel value is known), and the line connected with the camera position forms a ray. First obtain the ray in the camera coordinate systemRay direction vector, and then converted to the world coordinate system using the external parameter inverse matrix (c2w).
    • Sampling n points within the nearest and farthest distances to obtain three-dimensional position parameters, forming n five-dimensional vectors is the input.

Additional knowledge points:

  1. Homogeneous coordinate system

    If a point is at infinity, the coordinates of this point will be (∞, ∞) (\infty,\infty) (,), in Euclidean space, this becomes meaningless. If homogeneous coordinates are used, parallel lines intersect at a point at infinity in perspective space, thus realizing the representation of infinite points.

    In short, homogeneous coordinates use N+1 dimensions to represent N-dimensional coordinates.

    We can add an additional variable w to the end of a 2D Cartesian coordinate to form a 2D homogeneous coordinate. Therefore, a point in the Cartesian coordinate system ( X , Y ) (X, Y) (X,Y)Are you ready to go? ( x , y , w ) ( x,y,w)(x,y,w),并且有:
    X = x w ; Y = y w X=\frac{ x}{w};Y=\frac{y}{w} X=Inx;AND=Iny
    For example, (1,2) in the Cartesian coordinate system can be expressed as (1,2,1) in the homogeneous coordinate system. If the point (1,2) moves to infinity, in the Cartesian coordinate system In the coordinates it becomes ( ∞ , ∞ ) (\infty,\infty) (,), then its homogeneous coordinates are expressed as (1,2,0). Note that in this case, we can do without ” ∞ \infty " to represent a point at infinity.

    Also note that the fourth dimension of the homogeneous coordinates of the direction vector is equal to 0, and the fourth dimension of the point coordinates is equal to 1.

    Reference:What is a homogeneous coordinate system? Why use a homogeneous coordinate system? - Zhihu (zhihu.com)

  2. Camera coordinate system:

    img

    In order to establish the mapping relationship between 3D space points and the camera plane and the relative relationship between multiple cameras, we will Define a local camera coordinate system. The figure below shows common coordinate system definition habits.

    img
  3. Camera internal and external parameters

    • Camera external parameters

      The camera external parameter is a 4x4 matrix M, whose function is to convert the points of the world coordinate system P w o r l d = [ x , y , z , 1 ] P_{world}= [x,y,z,1] Pworld=[x,y,z,1] 变换到相机材标System P c a m e r a = M P w o r l d P_{camera}=MP_ {world} Pcamera=MPworld Next. We also call the camera external parametersworld-to-camera (w2c) matrix. NeRF mainly uses the camera-to-world (c2w) matrix, which is the inverse matrix of the camera's external parameters. Its function is to transform the points of the camera coordinate system into the world coordinate system.

      img

      The value of the c2w matrix directly describes the orientation and origin of the camera coordinate system:

      img
    • Camera internal parameters

      The camera's internal parameter matrix maps the 3D coordinates in the camera coordinate system to the 2D image plane. Here, a pinhole camera is used as an example to introduce the camera's internal parameter matrix K:

      img

      The internal parameter matrix K contains 4 values, where fx and fy are the horizontal and vertical directions of the cameraFocal length (for an ideal pinhole camera, fx=fy). The physical meaning of focal length is the distance from the center of the camera to the imaging plane, and the length is measured in pixels. cx and cy are the horizontal and vertical offsets of the image origin relative to the camera optical center. cx, cy can sometimes be approximated by 1/2 of the image width and height. In actual use, the internal parameters only record three values (F, H, W) (F,H,W ) (F,H,W)

  4. From colmap to llff:

    Yuimgs2poses.pyCome to life

  5. How to construct 3D space rays

    img

    First, the x and y coordinates of the 3D pixel are the 2D image coordinates (i, j) minus the optical center coordinates (cx, cy), and then the z coordinate is actually the focal length f (because the distance between the image plane and the center of the camera is The distance is the focal length f). So we can get the direction vector of the ray as ( i − c x , j − c y , f ) − ( 0 , 0 , 0 ) = ( i − c x , j − c y , f ) (i-c_x,j-c_y,f)-(0,0,0)=(i-c_x,j-c_y,f) (icx,jcy,f)(0,0,0)=(icx,jcy,f), because it is a vector, we can divide the entire vector by the focal length f to normalize the z coordinate, and get ( i − c x f , j − c y f , 1 ) (\frac{i-c_x}{f},\frac{j-c_y}{f},1) (ficx,fjcy,1)

Reference:NeRF code interpretation - camera parameters and coordinate system transformation - Chen Guanying's article - Zhihu

1.2 MLP

( x , y , z , θ , ϕ ) (x,y,z,\theta,\phi) (x,y,z,θ,ϕ) ( R , G , B , σ ) (R,G,B,\sigma) (R,G,B,σ)Yu MLP completed, the result is as follows:

image-20230514151206071

Naka ( θ , ϕ ) (\theta,\phi) (θ,ϕ)Only affects the color value but not the density σ \sigma σ, so this structure is adopted.

1.3 Volume rendering and discretization

For each pixel on a picture, we first generate a ray based on the camera pose, and then on a ray [ t n , t f ] [t_n,t_f] [tn,tf]之间等分成n份进行采样:
t i ∼ U [ t n + i − 1 N ( t f − t n ) , t n + i N ( t f − t n ) ] (1) t_i\sim\mathcal{U}\bigg[t_n+\frac{i-1}{N}(t_f-t_n),t_n+\frac{i}{N}(t_f-t_n)\bigg] \tag{1} tiU[tn+Ni1(tftn),tn+Ni(tftn)](1)
之后2D图片上对应位置的颜色可以离散化计算:
C ^ ( r ) = ∑ i = 1 N T i ( 1 − exp ⁡ ( − σ i δ i ) ) c i , where T i = exp ⁡ ( − ∑ j = 1 i − 1 σ j δ j ) (2) \hat{C}(\mathbf{r})=\sum\limits_{i=1}^N T_i(1-\exp(-\sigma_i\delta_i))\mathbf{c}_i,\text{where}T_i=\exp\left(-\sum\limits_{j=1}^{i-1}\sigma_j\delta_j\right) \tag{2} C^(r)=i=1NTi(1exp(σidi))ci,whereTi=exp(j=1i1pjdj)(2)
In this way, we can get the color of each pixel in the image.

The detailed process can be seen at:Derivation of volume rendering formula

1.4 Optimization points

  1. Definition
    γ ( p ) = ( sin ⁡ ( 2 0 π p ) , cos ⁡ ( 2 0 π p ) , ⋯ , sin ⁡ ( 2 L − 1 π ). p ) , cos ⁡ ( 2 L − 1 π p ) ) (3) \gamma(p)=\left(\sin\left(2^0\pi p\right),\cos\left(2^0\ pi p\right),\cdots,\sin\left(2^{L-1}\pi p\right),\cos\left(2^{L-1}\pi p\right)\right)\ tag{3}γ(p)=(sin(20πp),cos(20πp),,sin(2L1πp),cos(2L1πp))(3)
    Perform Fourier series decomposition of the input value to facilitate extraction. With high-dimensional features, neural networks will achieve better results.

    In that, L = 10 L=10 L=10用于 ( x , y , z ) (x,y,z) (x,y,z)三个坐标值, L = 4 L=4 L=4Using direction amount γ ( d ) \gamma(\mathbf{d}) γ(d)

  2. Hierarchical volume sampling

    Perform two samplings (coarse and fine)

    • First coarse uniform sampling ( N c N_c Nc = 64), use formula (3) to express the color, slightly change the form:
      C ^ c ( r ) = ∑ i = 1 N c w i c i , w i = T i ( 1 − exp ⁡ ( − σ i δ i ) ) (4) \hat{C}_c(\mathbf{r})=\sum\limits_{i=1}^{N_c}w_ic_i,w_i=T_i( 1-\exp(-\sigma_i\delta_i))\tag{4} C^c(r)=i=1NcInici,Ini=Ti(1exp(σidi))(4)
      This gives the weight of each sampling point.

      Then, the following steps are unified: w ^ i = w i / ∑ j = 1 N c w j \hat{w}_i=^{w_i}/\sum_{j=1 }^{N_c}w_j In^i=Ini/j=1NcInj, which is expressed as a piecewise constant probability density function PDF.

    • For the second fine sampling, inverse transform sampling is performed on the PDF obtained from the first sampling ( N f = 128 N_f=128 Nf=128), and then use the total sampling points ( N c + N f N_c+N_f Nc+Nf) train the network and get C ^ f ( r ) \hat{C}_f(\mathbf{r}) C^f(r)

    Training is performed after sampling is completed. The loss function is:
    L = ∑ r ∈ R [ ∥ C ^ c ( r ) − C ( r ) ∥ 2 2 + ∥ C ^ f ( r ) − C ( r ) ∥ 2 2 ] \mathcal{L}=\sum\limits_{\mathbf{r}\in\mathcal{R}}\left[\left\|\hat{C}_c (\mathbf{r})-C(\mathbf{r})\right\|_2^2+\left\|\hat{C}_f(\mathbf{r})-C(\mathbf{r}) \right\|_2^2\right] L=rR[ C^c(r)C(r) 22+ C^f(r)C(r) 22]
    Although the final rendering comes from C ^ f ( r ) \hat{C}_f(\mathbf{r}) C^f(r), but I have minimized it C ^ c ( r ) \hat{C}_c(\mathbf{r}) C^c(r) loss so that the weight distribution from the coarse network can be exploited in the fine Distribute samples in the network.

References and materials
[1]:NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[2]:NeRF code interpretation - camera parameters and coordinate system transformation - Chen Guanying's article - Zhihu
[3]:A thorough explanation of NeRF principles friendly to “graphics newbies”

Guess you like

Origin blog.csdn.net/weixin_62012485/article/details/130815797