[Paper reading] Critical analysis of 3D reconstruction based on NeRF

Insert image description here

Abstract

This paper provides a critical analysis of image-based 3D reconstruction using neural radiation fields (NeRF), focusing on quantitative comparisons with traditional photogrammetry. Therefore, the aim is to objectively evaluate the advantages and disadvantages of NeRFs and gain insights into their applicability in different real-life scenarios, ranging from small objects to heritage and industrial scenarios. After a comprehensive overview of photogrammetry and NeRF methods, the respective advantages and disadvantages are highlighted, and various NeRF methods are conducted using different objects with different sizes and surface characteristics, including textureless, metallic, translucent and transparent surfaces. Compare. We evaluated the quality of the 3D reconstruction results using several criteria, such as noise level, geometric accuracy, and the number of images required (i.e., image baseline). Results show that NeRF exhibits superior performance to photogrammetry on non-collaborative objects with textureless, reflective and refractive surfaces. In contrast, photogrammetry outperforms NeRF in cases where the object surface has cooperative textures. This complementarity should be further exploited in future work.

1. Introduction

In the fields of computer vision and photogrammetry, high-quality 3D reconstruction is an important topic with many applications, such as quality inspection, reverse engineering, structural monitoring, digital preservation, etc. However, low-cost, portable and flexible 3D reconstruction has for many years placed measurement techniques that provide high geometric accuracy and high-resolution detail in high demand. Existing 3D reconstruction methods can be broadly classified into contact or non-contact techniques [1]. To determine the precise 3D shape of an object, contact-based techniques often employ physical tools such as calipers or coordinate measuring machines. While precise geometric 3D measurements are feasible and well suited for many applications, they do have some drawbacks, such as the length of time required to acquire data and perform sparse 3D reconstructions, limitations of the measurement system, and/or the need for expensive instrumentation, which limits for their use in professional laboratories and projects with unique metrology specifications. On the other hand, non-contact technology enables accurate 3D reconstruction without the associated disadvantages. Most researchers focus on passive image-based methods due to their low cost, portability, and flexibility in a wide range of applications, including industrial inspection and quality control [2-5] and traditional 3D documentation [6-9] .
Insert image description here
Figure 1. Basic concepts of NeRF scene representation (after [16]) – see also Section 2.2.

As shown in Figure 1, the neural network takes a set of continuous 5D coordinates consisting of spatial positions (x, y, z) and observation directions (θ, φ) as input, and outputs the volume density of each point in each direction. (σ) and the emission radiance (RGB) related to the viewing direction. NeRF is then rendered from an angle and the 3D geometry can be derived, for example, in the form of a mesh by traveling camera rays [23].
However, despite their recent popularity, NeRF-based methods still need to be critically analyzed compared to more traditional photogrammetry in order to objectively quantify the quality of the generated 3D models and fully understand their advantages and limitations.

Aims of This Research
NeRF methods have recently emerged as a promising alternative to photogrammetry and computer vision in the field of image-based 3D reconstruction. Therefore, this study aims to thoroughly analyze the NeRF method for 3D reconstruction. We evaluate the accuracy of 3D reconstructions generated using NeRF-based techniques through photogrammetry on objects of various sizes and surface characteristics (textured, untextured, metallic, translucent, and transparent). We examined the data generated by each technique in terms of surface deviation (noise level) and geometric accuracy. The final goal is to evaluate the applicability of NeRF methods in real-world scenarios and provide objective evaluation metrics regarding the advantages and limitations of NeRF-based 3D reconstruction methods.
This paper is structured as follows: Section 2 provides an overview of previous research activities on 3D reconstruction using photogrammetry-based and NeRF-based methods. Section 3 introduces the proposed quality assessment process and the used datasets, and Section 4 reports the assessment results and comparison results. Finally, Section 5 provides conclusions and future research plans.

2. The State of the Art

In this section, a comprehensive overview of previous 3D reconstruction studies is provided, combining photogrammetry and NeRF-based methods and considering their application to non-cooperating surfaces (reflective, textureless, etc.).

2.1. Photogrammetric-Based Methods

Photogrammetry is a widely accepted method for 3D modeling of well-textured objects, capable of accurately and reliably recovering the 3D shape of objects through multi-view stereo (MVS) methods. Photogrammetry-based methods [19, 24–30] either rely on feature matching for depth estimation [27, 28] or use voxels to represent shape [24, 29, 31, 32]. Learning-based MVS methods are also available, but they usually replace some parts of the classic MVS pipeline, such as feature matching [33–36], deep fusion [37,38] or multi-view image depth inference [39– 41]. However, objects with textureless, reflective or refractive surfaces are difficult to reconstruct, as all photogrammetry methods require matching correspondences between multiple images [14]. To address this problem, various photogrammetry methods have been developed to reconstruct these non-collaborative objects. For textureless objects, solutions such as random pattern projection [13, 42, 43] or synthetic patterns [14, 44] have been proposed. However, these methods have difficulty in handling highly reflective surfaces with strong specular reflection or interreflection [43]. Other methods such as cross-polarization [7,45] and image preprocessing [46,47] have been used for reflective or non-cooperating surfaces, but some techniques may smooth surface roughness and affect texture consistency between views [48 ,49]. Photogrammetry is also used in hybrid methods [50-53], where the MVS method is used to generate sparse 3D shapes that can serve as the basis for high-resolution measurements using photometric stereo (PS). Traditional [52, 54, 55] and learning-based [56-58] PS methods are also used to understand the image irradiance equation and retrieve the geometry of the imaged object, but specular surfaces remain challenging for all image-based methods .

2.2. NeRF-Based Methods

合成逼真的图像和视频是计算机图形学的核心,也是数十年来研究的焦点[59]。神经渲染是一种基于学习的图像和视频生成方法,用于控制场景属性(例如照明、相机参数、姿势、几何形状、外观等)。神经渲染将深度学习方法与计算机图形学的物理知识相结合,以实现可控且逼真 (3D) 的场景模型。其中,NeRF,由Mildenhall等人首先提出。 2020 年,这是一种使用隐式表示渲染新视图和重建 3D 场景的方法(图 1)。在 NeRF 方法中,采用神经网络从 2D 图像中学习物体的 3D 形状。辐射场,如方程 (1) 中所定义,从每个可能的观察方向捕获场景中每个点的颜色和体积密度:

Insert image description here

NeRF 模型采用神经网络表示,其中 X 表示图像的 3D 坐标,d 表示方位角和极坐标视角,c 表示颜色,σ 表示场景的体积密度。为了确保多视图一致性,σ的预测被设计为与观看方向无关,而颜色c可以根据观看方向和位置而变化。为了实现这一目标,分两步采用多层感知器(MLP)。第一步,MLP 将 X 作为输入并输出 σ 和高维特征向量。然后,特征向量与观察方向 d 相结合,并通过附加的 MLP,生成颜色表示 c。最初的 NeRF 实现以及后续方法采用了非确定性分层采样方法,如方程 (2)-(4) 所示。该方法涉及将光线分成 N 个等距的 bin,并从每个 bin 中均匀地抽取样本:
Insert image description here
其中 δi 表示连续样本(i 和 i + 1)之间的距离,而 σi 和 ci 表示沿样本点 (i) 的估计密度和颜色值。采样点 (i) 处的透明度或不透明度 αi 也可使用公式 (4) 计算。
连续的方法[60-62]还结合了估计的深度,如方程(5)所示,以对密度施加限制,使它们类似于场景表面的δ函数,或强制深度的平滑度:
Insert image description here
为了优化 MLP 参数,对每个像素使用平方误差光度损失:

Insert image description here
where the variable Cgt® represents the ground truth color of the pixel in the training image corresponding to ray r, and R refers to the batch of rays associated with the image to be synthesized. It should be noted that the learned NeRF implicit 3D representation is specified for view rendering. To obtain explicit 3D geometry, depth maps for different views need to be extracted by taking the maximum likelihood of the depth distribution for each ray. These depth maps can then be fused to derive point clouds or fed into the Marching Cube [23] algorithm to derive 3D meshes.
Although NeRF offers an alternative solution for 3D reconstruction compared to traditional photogrammetry methods and can produce promising results in situations where photogrammetry may not provide accurate results, as reported by different authors Still faces some limitations [63-68]. From a 3D metrology perspective, some of the main issues to consider include:

(1) The resolution of the resulting neural rendering (which is subsequently converted to a 3D mesh) may be limited by the quality and resolution of the input data. In general, higher resolution input data will produce higher resolution 3D meshes, but at the expense of increased computational requirements.
(2) Generating neural renderings (and then generating 3D meshes) using NeRF can be computationally intensive, requiring a lot of computing power and memory.
(3) It is generally impossible to accurately model the 3D shape of non-rigid objects.
(4) The original NeRF model is optimized based on a per-pixel RGB reconstruction loss, which may lead to noisy reconstructions since there is an infinite number of photo-consistent interpretations when using only RGB images as input.
(5) NeRF usually requires a large number of input images with small baselines to generate accurate 3D meshes, especially for scenes with complex geometry or occlusions. This can be a challenge where images are difficult to obtain or computing resources are limited.

In response to the above problems, researchers have proposed some modifications and extensions to the original NeRF method to improve performance and 3D results. Tansik et al. [69] and Sitzman et al. Due to the insufficient high-frequency representation capabilities of NeRFs, [70] adopted position encoding operations at different frequencies than NeRFs to improve the resolution of neural rendering results. Following this, other methods focused on improving the efficiency and resolution of neural rendering results in different ways, including model acceleration [20, 71], compression [72-74], re-lighting [75-77] and view-dependent normalization [78] (Zhu et al., 2023) or high-resolution 2D feature planes [68]. Müller et al. [20] introduced the concept of on-the-fly neurographic primitives with multi-resolution hash encoding, which allows fast and efficient generation of 3D models. Barron et al. [64, 79] proposed that Mip-NeRF is a modified version of the original NeRF, allowing scenes to be represented on a continuously valued scale. Mip-NeRF greatly improves NeRF's ability to emphasize fine details by effectively rendering anti-aliased frustums instead of rays. However, limitations of this approach may include training difficulties and computational efficiency issues. Chen et al. [72] proposed a new method called Tensorf for modeling and reconstructing the radiation field of a scene into a 4D tensor. This method represents a 3D voxel grid with multi-channel features per voxel. In addition to providing superior rendering quality, this approach also achieves significantly lower memory usage compared to previous and contemporary methods. Yang et al. [80] proposed a fusion-based method named PS-NeRF, which combines the advantages of NeRF with photometric stereo methods. This method aims to address the limitations of traditional photometric stereo techniques by leveraging NeRF's ability to reconstruct scenes, ultimately improving the resolution of the resulting mesh. Reiser et al. [68] introduced Memory Efficient Radiation Field (MERF) representation, which allows rapid rendering of large-scale scenes utilizing sparse feature grids and high-resolution 2D feature planes. Li et al. [21] introduced Neuralangelo, an innovative method that utilizes multi-resolution 3D hash grids and neural surface rendering to achieve excellent results in recovering dense 3D surface structures from multi-view images, thus achieving Highly detailed large-scale scene reconstruction from RGB video capture.

Some methods [67,81–85] have been proposed to extend NeRF to the dynamic domain. These methods make it possible to reconstruct and render images of objects while they are experiencing rigid and non-rigid motions from a single camera moving through the scene. For example, Yan et al. [84] introduced surface-aware dynamic NeRF (NeRF-DS) and mask-guided deformation fields. NeRF-DS improves the representation of the complex reflection properties of specular surfaces by incorporating surface position and orientation as modulating factors in the neural radiation field function. In addition, using masks to guide the deformation field enables NeRF-DS to effectively handle large deformations and occlusions that occur during object motion.

To improve the accuracy of 3D reconstruction in the presence of noise, especially for smooth and textureless surfaces, some studies incorporate various priors into the optimization process. These priors include semantic similarity [86], depth smoothness [60], surface smoothness [87,88], Manhattan world hypothesis [89] and monocular geometric priors [90]. In contrast, the NoPe-NeRF method proposed by Bian et al. [91] use monochrome maps to constrain relative poses between frames and regularize the geometry of NeRF. This method enables better pose estimation, thereby improving the quality of new view synthesis and geometric reconstruction. Lakotosana et al. [92] introduced a novel and versatile architecture for 3D surface reconstruction that can efficiently extract volumetric representations from NeRF-driven methods into signed surface approximation networks. This approach is able to extract accurate 3D meshes and appearances while maintaining real-time rendering capabilities across a variety of devices. Elsner et al. [93] proposed adaptive Voronoi NeRFs, a technique that improves process efficiency by dividing the scene into units using Voronoi diagrams. These units are then subdivided to effectively capture and represent complex details, improving performance and accuracy. Similarly, Kulhanek and Sattler [94] introduced a new radiation field representation called tera-NeRF, which was successfully adapted to 3D geometric priors given in the form of sparse point clouds to exploit more details. However, it is worth noting that the quality of the rendered scene may vary depending on the density of the point cloud in different areas.

Some works aim to reduce the number of input images [60, 70, 78, 86, 90, 95]. Yu et al. [95] proposed an architecture that uses a fully convolutional approach to condition NeRF on image input, enabling the network to learn one scene before training on multiple scenes. This enables it to perform feed-forward view synthesis from a small number (or even just one) of viewpoints. Similarly, Niemeyer et al. [60] introduced a method to sample unseen views and normalize the appearance and geometry of patches generated from these views. Jain et al. [86] proposed DietNeRF to enhance small shot quality by assisting semantic consistency loss, thereby enhancing realistic rendering of new locations. DietNeRF learns from individual scenes to accurately render input images at the same location and match high-level semantic features across different random poses.

In the field of cultural heritage, only a few publications explicitly investigate and recognize the potential of NeRF for 3D reconstruction, digital preservation and conservation purposes [96,97].

3. Analysis and Evaluation Methodology

The main goal is to critically evaluate NeRF-based methods in traditional photogrammetry by objectively measuring the quality of the resulting 3D data. To achieve this, a variety of objects and scenes with different sizes and surface characteristics need to be considered, including textured, untextured, metallic, translucent and transparent (Section 3.3). The proposed evaluation strategies and metrics (Section 3.1 and 3.2) should help researchers understand the strengths and limitations of each method and can be used to quantitatively evaluate newly proposed methods. All experiments are based on SDFStudio [98] and Nerfstudio [22] frameworks. It is worth reminding that the NeRF output is a neural rendering; therefore, a mesh model is created from different depth maps for each view using the marching cube method [23]. The Open3D library is then used to extract point clouds from mesh vertices for quantitative evaluation [78].

3.1. Proposed Methodology

First, various NeRF methods available in dedicated frameworks [22, 98] are applied to two datasets to understand their performance and select the best performing method (Section 4.1). The method is then applied to other datasets to evaluate and compare traditional photogrammetry and available ground truth (GT) data (Sects. 4.2-4.7).
Figure 2 shows a general overview of the proposed procedure for quantitatively evaluating the performance of NeRF-based 3D reconstruction. All collected images or videos require camera poses to generate 3D reconstructions using traditional photogrammetry or NeRF-based methods. Starting from an available image, use Colmap to retrieve the camera pose. Then, multi-view stereo (MVS) or NeRF is applied to generate 3D data. Finally, we provide unique and powerful environments and conditions to provide objective geometric comparisons. To achieve this goal, the photogrammetric and NeRF-generated 3D data were co-registered and rescaled (using the Iterative Closest Point (ICP) algorithm [99] ) relative to the ground truth (GT) data available in Cloud Compare, and performed Quality assessment. To provide an unbiased assessment of geometric accuracy, different well-known criteria [13, 43, 100–102] are applied, including best plane fit, cloud-to-cloud comparison, analysis, accuracy and completeness. The first two are used Standard metrics such as standard deviation (STD), mean error (Mean_E), root mean square error (RMSE) and mean absolute error (MAE) (Section 3.2). Remote Sens. 2023, 15, x FOR PEER REVIEW 7 of 22 outperforms the method (Section 4.1). The method is then applied to other datasets to run evaluations and comparisons (Section 4.2–4.7) against traditional photogrammetry and available ground truth (GT) data. Figure 2 shows a general overview of the proposed procedure for quantitatively evaluating the performance of NeRF-based 3D reconstruction. All collected images or videos require camera poses to generate 3D reconstructions using traditional photogrammetry or NeRF-based methods. Starting from an available image, use Colmap to retrieve the camera pose. Then, multi-view stereo (MVS) or NeRF is applied to generate 3D data. Finally, we provide unique and powerful environments and conditions to provide objective geometric comparisons. To achieve this, 3D data generated using photogrammetry and NeRF were co-registered and rescaled against ground truth (GT) data available in Cloud Compare (using the Iterative Closest Point (ICP) algorithm [99] and performed mass Assessment. To provide an unbiased assessment of geometric accuracy, different well-known criteria are applied [13, 43, 100–102], including best plane fit analysis, point cloud comparison, qualitative analysis, accuracy and completeness analysis. The first two criteria,metrics such as standard deviation (STD), mean error (Mean_E),,root mean square error (RMSE) and mean absolute value

Insert image description here
Figure 2. Overview of the proposed procedure to evaluate the performance of NeRF-based 3D reconstructions relative to traditional photogrammetry.

Best plane fitting is accomplished by using the least squares fitting (LSF) algorithm, which defines the best-fitting plane over the object area (assuming the area is a plane). This standard allows us to evaluate the noise level in 3D data generated by photogrammetry or NeRF me
analysis by extracting cross-sections from the 3D data to highlight the complex geometric details of the reconstructed surface. Examination of contours allows us to evaluate the performance of methods that preserve geometric details such as edges and corners and avoid smoothing effects.
Cloud-to-cloud (C2C) comparison refers to measuring the nearest neighbor distance between corresponding points in two point clouds.

3.2. Metrics

Despite the increasing popularity and widespread application of NeRF for 3D reconstruction purposes, there is still a lack of quality assessment information based on specified standards or guidelines (e.g., VDI/VDE 2643 BLATT 3). Following the previously mentioned co-registration process and standards, the following metrics are used (especially for cloud-to-cloud and plane fitting processes):

Insert image description here
Where N represents the number of observation point clouds, Xj represents the nearest distance from each point to the corresponding reference point or surface, and X represents the average observation distance.
Accuracy and completeness, also known as precision and recall [101, 102] respectively, involve measuring the distance between two models. When evaluating accuracy, distances are calculated from computed data to ground truth (GT). Instead, to assess completeness, the distance from the GT to the computed data is calculated. These distances can be signed or unsigned, depending on the specific evaluation method. Accuracy reflects how closely the reconstructed points match the ground truth, while completeness indicates the degree to which all GT points are covered. Typically, a threshold distance is employed to determine the fraction or percentage of points that fall within an acceptable threshold. The threshold is determined based on factors such as data density and noise level.

3.3. Testing Objects

To achieve the goals of the work, different datasets were used (Fig. 3): they have objects of different sizes and surface types, and were captured under different lighting conditions, materials, camera networks, scales and resolutions.
Insert image description here
Figure 3. A set of objects with different surface characteristics used to evaluate the NeRF method.

The Ignatius and Truck datasets are derived from the Tanks and Temples benchmarks [101], where GT data (obtained by laser scanning) are also available. The other datasets (Stair, Synthetic, Industrial, Bottle_1 and Bottle_2) were created in FBK. The stairs dataset provides a flat, reflective, and well-textured surface with sharp edges. GT is provided by an ideal step surface plane. Synthetic 3D objects created using Blender v3.2.2 (for geometry models, UV textures and materials) and Quixel Mixer v2022 (for PBR textures) have well-textured surfaces with complex geometries including edges and corners. A virtual camera with specific parameters (focal length: 50 mm; sensor size: 36 mm; image size: 1920 × 1080 pixels) is used to create a sequence of images following a spiral curved path around an object. The 3D model generated in Blender was used as GT for accuracy evaluation. Industrial objects have untextured and highly reflective metallic surfaces, which poses problems for all passive 3D methods. Its GT data was collected with a Hexagon/AICON Primescan active scanner with a nominal accuracy of 63 μm. Also included are two bottles, with transparent and refractive surfaces: their GT data were generated using photogrammetry after powdering/spraying the surfaces.

The authors are preparing a specific benchmark of the NeRF method and will make it available at https://github.com/3DOM-FBK/NeRFBK [103], which contains more datasets with real data.

4. Comparisons and Analyses

This section presents experiments to evaluate and compare the performance of NeRF-based techniques with standard photogrammetry (Colmap). After comparing multiple state-of-the-art methods (Section 4.1), Instant-NGP is selected as the NeRF-based method for comprehensive evaluation as it provides better results than other methods. NeRF training was performed using an Nvidia A40 GPU, while geometric comparison of the 3D results was performed on a standard PC.

4.1. State-of-the-Art Comparison

The main goal is to conduct a comprehensive analysis of multiple NeRF-based methods. To achieve this goal, Yu et al. developed the SDFStudio unified framework. [98] is used because it combines multiple neural implicit surface reconstruction methods into a single framework. SDFStudio is built on the Nerfstudio framework [22]. Among the implemented methods, ten were selected to compare their performance: Nerfacto and Tensorf from Nerfstudio, Mono-Neus, Neus-Facto, MonoSDF, VolSDF, NeuS, MonoUnisurf and UniSurf from SDFStudio and the original from Müller et al. Implementation of InstantNGP. [20].

Two datasets are used: (i) the synthetic dataset, consisting of 200 images (1920 × 1080 px), (ii) the Ignatius dataset [101], which contains 263 images (from 1920 × 1080 px resolution extracted from the video).

The comparison results with GT data are shown in Figure 4. The results in terms of RMSE, MAE and STD show that Instant-NGP and Nerfacto methods achieved the best results, outperforming all other methods. In terms of processing time, training the model on both datasets takes less than a minute for Instant-NGP and about 15 minutes for Nerfacto. It is worth noting that for the Ignatius sequence (Fig. 4b), although the neural renderings of MonoSDF, VolSDF and Neus-facto are visually satisfactory, the marching cube of the derived mesh model fails; therefore, evaluation is not possible.

Therefore, based on the achieved accuracy and processing time, this paper selects and uses Instant-NGP for subsequent experiments.

Insert image description here

Figure 4. Comparison results of various NeRF-based methods on Synthetic (a) and Ignatius (b) datasets, containing 200 and 263 images respectively.

4.2. Image Baseline’s Evaluation

This section reports the evaluation of NeRF-based methods when the number of input images decreases (ie, the baseline increases). A comparative evaluation of Instant-NGP and Mono-Neus (a well-established method for sparse image scenarios [66, 90]) was conducted, and Instant-NGP was considered the best method among other methods (Section 4.1). The experiment utilizes a synthetic dataset consisting of four subsets of input images, ranging from 200 to 20 images (Figure 5), gradually reducing the number of input images (i.e., approximately doubling the image baseline). For each set of input images, both NeRF methods are used to generate 3D results, maintaining a similar number of epochs. For each subset, the RMSE is estimated by point-to-point comparison with the GT data, as shown in Figure 5 . The research results show that Instant-NGP exhibits superior performance compared to Mono-Neus when a large number of input images are available. However, in scenarios with a small number of images, Mono-Neus performs better than Instant-NGP. However, it is worth noting that neither Instant-NGP nor Mono-Neus can successfully generate 3D reconstructions using only 10 input images.

Insert image description here
Figure 5. Comparative performance evaluation of Instant-NGP and Mono-Neus on a subset of the comprehensive dataset.

4.3. Monte Carlo Simulation

The purpose is to evaluate the quality of NeRF-based 3D results when camera pose changes/perturbations occur. Therefore, Monte Carlo simulation [104] is employed to randomly perturb the rotation and translation of the camera parameters within a limited range. After perturbation, Instant-NGP was used to generate 3D reconstructions and compared with reference data. A total of 30 iterations (runs) were performed for both cases: (A) rotation and translation were randomly perturbed in the range of ±20 mm for translation and ±2 degrees for rotation, (B) rotation and translation were perturbed in the range of ±40 mm and ±2 degrees respectively. ±4 degrees. The Ignatius data set was used to run this simulation, and the results are shown in Figure 6 and Table 1. The results clearly demonstrate the importance of having accurate camera parameters. In scenario A, the average estimated RMSE is 19.72 mm and the uncertainty is 2.95 mm. In scenario B, the average estimated RMSE remains almost unchanged (19.97 mm), while the uncertainty doubles (5.87 mm) due to the larger perturbation range.

Insert image description here

Figure 6. Monte Carlo simulation results of perturbed camera parameters. Table 1 reports summary statistics.

Insert image description here

Table 1. Summary of Monte Carlo simulation results on the Ignatius dataset. The error margin is the difference between the maximum and minimum RMSE, and the uncertainty is calculated as half the error margin.

4.4. Plane Fitting

Plane fitting methods can be used to evaluate/measure noise levels on reconstructed flat surfaces. In the first experiment using the Stair dataset (Fig. 7a), photogrammetric point clouds and NeRF-based reconstructions were derived using the same number of images and camera poses. Two horizontal and three vertical planes were identified and analyzed according to the best-fitting process (Fig. 7b). The exported indicators are shown in Table 2.

Insert image description here
Figure 7. Images of the Step dataset (a) and horizontal and vertical planes (b) used to evaluate noise levels in photogrammetry and NeRF 3D reconstructions.

Insert image description here
Table 2. Evaluation of noise levels in 3D surfaces [unit: mm] for the Step dataset processed using photogrammetry and Instant-NGP

In a similar manner, a synthetic dataset was used with 200 images for Instant-NGP and 24 images for photogrammetry processing. Five vertical planes and five horizontal planes were selected (as shown in Figure 8), and surface deviation analysis was performed by fitting ideal planes to the reconstructed object surface. Table 3 reports the derived metrics.

Insert image description here
Figure 8. Synthetic object with some horizontal and vertical planes used for evaluation.

Insert image description here
Table 3. Evaluation metrics of photogrammetry and NeRF results of synthetic objects [unit: mm].

From these two results (Tables 2 and 3), it is clear that for these two objects, photogrammetry outperforms NeRF and can yield less noisy results. In general, NeRF RMSE is at least 2-3 times higher than photogrammetry.

4.5. Profiling

The extraction of cross-sectional contours helps demonstrate the ability of 3D reconstruction methods to retrieve geometric details or apply smoothing effects to 3D geometry. The results of the comprehensive dataset presented in Section 4.4 are processed using Cloud Compare: multiple cross-sections are extracted at predefined distances (Fig. 9) and geometrically compared with the reference data using different metrics, as shown in Table 4.

Insert image description here

Figure 9. Close views of meshes generated by GT (a), photogrammetry (b) and NeRF ©. Different positions of the contour on the composite object (d). Examples of contours referring to 3D data (black line), photogrammetry (red line) and NeRF (blue line) results

Insert image description here
Table 4. Comparison of contours and indicators [unit: mm] - see Figure 9a

The results obtained for individual cross-sectional profiles as well as the average of all profiles show that photogrammetry outperforms NeRF, which generally produces more noisy results (Fig. 9a–c). For example, the average RMSE and STD estimates of photogrammetry are about 0.09 mm and 0.08 mm, while the values ​​for NeRF are greater than 0.13 mm.

4.6. Cloud-to-Cloud Comparison

Cloud-to-cloud comparison refers to evaluating the relative Euclidean distance between corresponding 3D samples in the dataset with respect to the reference data. Consider different objects with different characteristics (Figure 3): Ignatius, Truck, Industrial and Composite. They are small and large objects with textureless, shiny metallic surfaces. For each dataset, 3D data were generated using photogrammetry (Colmap) and Instant-NGP and then co-registered to the available GT (Fig. 10). Finally, the derived indicators are shown in Table 5. It is worth noting that the number of images used in the tests performed is not always the same: in fact, for the Synthetic, Ignatius and Truck datasets, photogrammetry already provides accurate results at a low cost. number of images, so adding more images does not lead to further improvement. For NeRF, on the other hand, all available images were used, since fewer images (or upscaling of the baseline) did not lead to good results (see also Section 4.2).

Insert image description here
Figure 10. Color-coded cloud-to-cloud comparison of Instant-NGP and photogrammetry methods relative to ground truth data [unit: mm]

Insert image description here
Table 5. Inter-cloud comparison indicators of Instant-NGP and photogrammetry methods [unit: mm]. For all objects except industrial objects, photogrammetry uses a smaller number of images because the accuracy achieved is already better than NeRF.

From the results presented, it can be seen that for metallic and highly reflective objects (industrial dataset) NeRF performs better than photogrammetry, while for other scenes photogrammetry yields more accurate results.
Consider two other translucent and transparent objects: Bottle_1 and Bottle_2 (Figure 3). Glass objects do not diffusely reflect incident light and do not have their own texture for photogrammetric 3D reconstruction tasks. Their appearance depends on the shape of the object, the surrounding background and lighting conditions. Therefore, photogrammetry can easily fail or produce very noisy results in this case. NeRF, on the other hand, as claimed by Mildenhall et al. Due to the view dependence of NeRF models, [16] can learn to correctly generate transparency-related geometries. For both objects, photogrammetry- and NeRF-based 3D results were co-registered to the GT data and metrics were calculated (Fig. 11 and Table 6). The results demonstrate that NeRF outperforms photogrammetry for transparent objects. For example, the estimated RMSE, STD, and MAE of Bottle_1 photogrammetry are 6.5 mm, 7.1 mm, and 7.5 mm, respectively. In comparison, the NeRF values ​​were significantly reduced to 1.3 mm, 1.7 mm and 2.1 mm respectively.

Insert image description here
Figure 11. Comparison of Instant-NGP and photogrammetric o-cloud on two transparent objects [unit: mm].

Insert image description here
Table 6 Comparison statistics between clouds of transparent objects [Unit: mm].

4.7. Accuracy and Completeness

Three different datasets are used to compare the accuracy and completeness of photogrammetry and NeRF: Ignatius, Industrial and Bottle_1. For NeRF (Instant NGP) and photogrammetry, both metrics are calculated based on available ground truth data. The results shown in Figure 12 reveal the following insights: (i) for the Ignatius dataset, photogrammetry showed higher accuracy and completeness than NeRF; (ii) for the Industrial and Bottle_1 datasets, NeRF showed slightly better the result of. These findings quantitatively confirm Section 4.6, and NeRF-based methods perform well when dealing with objects with non-cooperating surfaces, especially those that are transparent or shiny. In contrast, photogrammetry faces challenges in capturing the intricate details of such surfaces, making NeRF a more suitable or complementary option.

Insert image description here
Figure 12. Accuracy and completeness of NeRF and photogrammetry on three different objects.

5. Conclusions

This paper presents a comprehensive analysis of image-based 3D reconstruction using the Neural Radiation Field (NeRF) method. Compare to traditional photogrammetry, reporting quantitative and visual results to understand the advantages and disadvantages when working with many types of surfaces and scenes. This study objectively evaluates the advantages and disadvantages of NeRF-generated 3D data and provides insights into their applicability in different real-life scenarios and applications. The study employed a range of textured, untextured, metallic, translucent and transparent objects, imaged using different scales and image sets. Various evaluation methods and metrics are used to evaluate the quality of the generated NeRF-based 3D data, including noise level, surface deviation, geometric accuracy, and completeness.

The reported results show that NeRF outperforms photogrammetry in situations where traditional photogrammetry methods fail or produce noisy results, such as textureless, metallic, highly reflective and transparent objects. In contrast, photogrammetry still performs better for well-textured and partially textured objects. This is because NeRF-based methods are able to generate reflectance- and transparency-dependent geometries due to the view dependence of the NeRF model.

This study provides valuable insights into the applicability of NeRF in different real-life scenarios, especially heritage and industrial scenarios where surfaces are particularly challenging. More datasets are in preparation and will be shared soon at https://github.com/3DOM-FBK/NeRFBK [103]. The results of this study highlight the potential and limitations of NeRF and photogrammetry, laying the foundation for upcoming research in this area. Future research could explore the combination of NeRF and photogrammetry to improve the quality and efficiency of 3D reconstructions in challenging scenes.


Toss:

  1. In terms of overall efficiency and effect, instant-ngp is indeed very representative. But as far as the reconstruction effect is concerned, more advanced work has been published recently and has been implemented in sdfstudio (such as bakedsdf, neuralangelo), but the efficiency is not very high and it is relatively time-consuming.
  2. There are quite a few variants of NeRFs. According to my experimental observations, the results of different methods are still quite different.
  3. It's a pity that I didn't see a comparison of the big scene.
  4. The entire comparison process and evaluation indicators are classic and worth learning.
  5. Since there is no comparison with the current state-of-the-art methods, the conclusions drawn may be for reference only.

Guess you like

Origin blog.csdn.net/m0_50910915/article/details/131954972