Intensive Reading of SIMBAR Papers

SIMBAR: Single Image-Based Scene Relighting for Efficient Data Augmentation for Autonomous Driving Vision Tasks


Single IMage-BAsed scene Relighting

Summary

Real-world autonomous driving datasets are aggregated from information from different drivers on the road. Being able to re-light captured scenes to unseen lighting conditions in a controllable manner presents an opportunity to enhance datasets with richer lighting conditions similar to those encountered in the real world. This paper proposes a new image-based relighting pipeline, SIMBAR, which can take a single image as input. To the best of our knowledge, there is no work that exploits explicit geometric representations in a single image for scene relighting. We present a qualitative comparison with previous multi-view scene relighting baselines. To further validate and effectively quantify the benefits of using SIMBAR for data augmentation for autonomous driving vision tasks, CenterTrack-enhanced KITTI on SIMBAR achieves 93.3% Multi-Object Tracking Accuracy (MOTA), compared to the baseline MOTA of 85.6% for CenterTrack on original KITTI With a relative improvement of 9.0%, both models are trained from scratch and tested in virtual KITTI. For more details and datasets, please visit our project website ( https://simbarv1.github.io ).

1 Introduction

Lack of diversity in lighting conditions is a known problem with manually collected real-world autonomous driving datasets [1, 3, 15, 19]. For example, KITTI [19] only takes video sequences at noon, and the lighting and shadow conditions are similar across sequences. Newer datasets [33, 39, 61], such as BDD100K [61], are relatively good at diversity and capture images at multiple times of the day. Nonetheless, there was little variation in lighting conditions between images collected from the same drive. Furthermore, trying to acquire data for all types of lighting conditions is impossible in terms of time and money.

This variety of lighting conditions, along with the presence of shadows in the scene, is often a key barrier to the successful deployment of perception models for safety-critical autonomous driving applications. Models trained under limited lighting conditions fail to generalize to the plethora of lighting conditions encountered in the real world [27, 29]. The ability to relight existing datasets in a controlled manner provides an opportunity to develop improved perception models.

However, scene relighting is an extremely difficult vision task without a depth sensor. It implicitly consists of three main subtasks: shadow detection [10, 26, 55], removal [24, 25, 55] and insertion [63]. Among them, shadow removal and insertion are the most challenging because shadows are tightly coupled with source object geometry measurements [2, 16]. This coupling makes it difficult to separate shadows from their parent objects without a strong 3D geometric understanding of the scene [4, 8, 21]. To address this issue, most existing scene relighting methods rely on multiple camera views of the source lighting conditions to estimate 3D scene geometry [44, 51, 64]. Relatively few existing methods that can process a single image are based on generative adversarial networks (GANs) [6]. GANs are known to be difficult to train [32, 40], have limited control instability [54], and often produce results that are physically inconsistent with the scene geometry [17]. To the best of our knowledge, there is no work on controllable scene relighting using a single input image.

Figure 1. Input image (left) shown relative to SIMBAR re-illumination output (middle, right). SIMBAR synthesized two illumination variants of (a) (b) Div2k, (c) BDD100K and (d) KITTI


This paper proposes a novel, single-image based scene relighting pipeline, SIMBAR. It takes an image as input and generates relit versions for various sun positions and sky zeniths, as shown in Figure 1. The first two lines show the relit results of Div2k[1]. Div2k is an internet scraped dataset containing images of a wide variety of object classes that SIMBAR is able to effectively rekindle. The first row shows realistic variations in sky color, shadow color, consistently cast shadow position, and light intensity for outdoor scenes with complex structures. The second row is a challenging low-light desert scene. SIM BAR cleanly removes the existing hard projected shadows of the rocks in the foreground and realistically recasts geometrically consistent shadows for the provided sun angles. In addition, the mountainous landscape on the horizon is effectively preserved. The third and fourth rows also show geometrically consistent and visually realistic relit versions of BDD100K’s KITTI road driving scene and tunnel/underpass scene, respectively. Most notable are the hard shadows of the tunnel in the BDD100K example and the variation of the two cars in the KITTI example.

SIMBAR consists of two main modules: (i) geometry estimation and (ii) image relighting. The geometry estimation module is responsible for computing scene mesh proxies and lighting buffers. Inspired by WorldSheet [23], we use an external deep network to obtain the scene mesh. Note that WorldSheet is a novel view compositing pipeline and has no relighting purpose. The image relighting module is inspired by previous multi-view scene relighting work using geometry-aware networks [44], or MVR for short. Section 3.1 gives a brief overview of Single image-based scene geometry estimation and MVR, followed by a detailed description of SIMBAR’s pipeline in Section 3.2. Our work is the closest to MVR in terms of objectives and overall pipeline structure. Therefore, scene relighting comparisons are performed using out-of-the-box MVR and its improved version, MVR-I. In Section 3.4, we improve MVR on the view-limited autonomous driving dataset. In summary, SIMBAR provides more realistic and geometrically consistent relighting images than MVR/MVR-I, which takes multiple images of the same scene as input, even though it takes a single image as input.

Another major limitation of all existing work on scene relighting is the lack of quantitative evaluation of scene relighting effects in augmented vision datasets. In the absence of such a metric, it is impossible to determine the true applicability and usefulness of any scene relighting method. To address this issue, in Section 4 we conduct data augmentation experiments based on image relighting using the state-of-the-art object detection and tracking network CenterTrack [66]. Section 4.1 provides a detailed overview of our experimental setup. We trained three different CenterTrack models: (i) the original KITTI tracking dataset, which contains 21 real-world sequences captured at noon; (ii) the amplified KITTI sequences with MVR-I relit; and (iii) the SIMBAR relit Sequence amplification of KITTI. All models were tested on virtual KITTI (vKITTI) [18], which consists of clones of real KITTI sequences under various lighting conditions. Section 4.2 shows that the CenterTrack model augmented with relit KITTI images (from MVR-I or SIMBAR) significantly outperforms the baseline CenterTrack. Specifically, the center tracking model trained on KITTI (augmented with SIMBAR) achieves the highest Multiple Object Tracking Accuracy (MOTA) of 93.3%, a relative improvement of 9.0% over the baseline MOTA of 85.6%. The model also achieves the highest Multi-Object Detection Accuracy (MODA) of 94.1%, again an 8.9% relative improvement over the baseline MODA of 86.4%.

In summary, the main contributions of this paper are as follows:

  1. A new single-view image-based scene relighting pipeline, called SIMBAR, provides instability in lighting control without the need for multi-view images.

  1. Single-image-based geometry estimation by tuning a dense predictive transformer single-depth model and better representing distant background objects.

  1. An improved version of MVR [44], called MVR-I, has fewer artifacts and smoother surfaces in the generated mesh for road driving scenarios with limited views, resulting in more realistically realistic images.

  1. Qualitative evaluation and comparison of scene relighting results using MVR, MVR-I and SIMBAR on several autonomous driving datasets such as KITTI [19] and BDD100K [61] .

  1. Quantitative evaluation of the effectiveness of mining the popular KITTI 2D tracking dataset using SIMBAR and MVR-I for simultaneous object detection and tracking using CenterTrack.

2. Related work

Our work is closely related to the fields of novel view synthesis [36, 49, 56], 3D reconstruction [9, 58, 59] and physically-based differentiable rendering [30, 43]. Given the direct connection between the relighting task and scene geometry [12, 62, 67], we group related work into two broad categories: (i) implicit methods for learning and encoding geometric priors into models; and ( ii) Explicit methods for generating 3D meshes utilizing multiple views of an input scene to apply rendering and image processing techniques. While the explicit method provides better controllability and geometrically consistent shading, its multi-view prerequisite limits its application in most autonomous driving datasets. This is due to the unique challenges of limited field of view for forward-facing automotive cameras, and the high scene complexity of constantly moving cars and pedestrians. Our work falls within the explicit category while leveraging insights from implicit methods.

2.1 Using implicit geometric representation

Both Generative Adversarial Networks (GAN) [22] and Neural Radiative Fields (NeRFs) [37] have explored scene relighting. As is typical for GANs, the shadow manipulation network in [6] suffers from geometric consistency and is hard to train, leading to conservative relighting effects. This also applies to GANs [11, 17] that focus on image-to-image translation while ignoring geometric priors. The recent success of new NeRF-based view synthesis methods has naturally also led to their application in scene relighting tasks. Instead of querying explicit scene geometry, NeRF encodes the scene into a multi-layer perceptron (MLP) [35], which takes viewing direction and position as input to output color and density values, which are then used for volume rendering [ 41, 42]. At training time, many different views of a static scene are fed to the network to learn scene geometry. At test time, the input viewing direction and position are used to render the scene with accurate lighting and shadows. Recent work repurposes NeRF for scene relighting by modeling surface materials and reflection properties [5, 51, 64]. However, these methods face significant computational hurdles when applied to autonomous driving datasets with dynamic scenes, since each scene requires training a different model.

2.2 Using Explicit Geometric Representations

Combining structure from motion with multi-view stereo (SFM+MVS) is a common way to model scene geometry. It relies on feature matching between images captured from different views of a single scene of interest. After applying SFM+MVS, bundle adjustment [53] can be used to generate 3D point clouds, as is the case in COLMAP [47, 48]. Point clouds allow the application of traditional mesh reconstruction techniques, such as Delaunay [7] or Poisson [31] reconstructions, to generate an explicit geometric representation of the scene. Vision tasks that exploit geometric priors, such as novel view synthesis, can leverage this explicit scene representation [46, 60]. Meshes can also be applied to scene relighting tasks, as described in [44]. In their work, physically-based rendering is used to approximate shadow locations using a generated mesh, and an additional network is used for shadow refinement. The relighting results are realistic and geometrically consistent. However, this approach is severely limited in its application to various datasets. For example, limited views and dynamic scenes cause mesh reconstruction to fail [28]. In the case of relatively simple and constrained datasets such as human portraits, relighting images using a single view has been successful [38, 65] due to the highly similar structure of facial data. However, this is not the case for outdoor scene datasets, which contain a broader structure and content [13].

3. Scene relighting based on a single image

Our proposed pipeline SIMBAR models the scene as a 3D mesh to explicitly represent the scene geometry. Physically based rendering is then used with a shadow refinement network to generate realistic shadow maps. Raw images can be composited with target shadow maps to form the final relit output. This approach makes up for the limitations of existing works on multi-view scene relighting and can be generalized across scenes.

3.1 Preparations

3.1.1 Scene geometry estimation based on single image

To address the multi-view limitation of SFM+MVS based mesh reconstruction, we use extrinsic depth for scene geometry estimation inspired by WorldSheet [23] to perform single image based mesh reconstruction. Note that the basic idea of ​​the overall WorldSheet and SIMBAR pipelines is quite different. World Sheet is a differentiable rendering pipeline trained end-to-end for novel view synthesis, while SIMBAR aims to manipulate existing views through various shadow casting.

For scene mesh formation, external depth predictions are treated as ground truth, so there is no need to predict mesh offsets in the x and y directions. Let zw,h be the depth prediction at the corresponding sheet coordinate (w,h), xw,h and yw,h be simply linearly spaced samples in normalized device coordinate (NDC) space starting from [0,1], The camera is at the origin. Given a fixed size of grid tiles of 129×129, the depth predictions are grid-sampled to account for differences in resolution. For the FoV angle θF, this gives the following equations that form the coordinates of the vertices:

Mesh edges connecting adjacent vertices form mesh faces [23]. The faces of the final output mesh are then smoothed using the Laplacian function [50].

Figure 2: (a) Geometry estimation component: A single input image I is fed to a monocular depth estimation network (m). The predicted depth map D is used to form the scene mesh using the vertex coordinates in Equation 1. The resulting collection of vertices and faces forms a 3D mesh M. Renders a set of input buffers IB using M relative to the camera pose. (b) Image relighting component: Using estimated input lighting parameters and desired target lighting changes, a source shadow map Ssrc and a target shadow map Stgt are generated. The shadow refinement networks rsrc and rtgt refine the shadow maps Ssrc and Stgt, respectively. Finally, the relighting network route uses IB to get a fine shadow map to generate the final relighting image.


3.1.2 Geometry-aware multi-view relighting

对场景几何先验和场景几何与照明效果之间的关系进行编码是向阴影去除和合成网络提供强信号的既定方法[37,44,64]。SIMBAR中的图像重新照明网络遵循MVR[44],其中除了源图像之外,还利用一组几何先验作为输入。生成一组输入缓冲区IB,由法线贴图、反射率贴图和重新定义的阴影贴图组成。法线贴图对每个像素处的曲面法线进行编码。反射率贴图是曲面法线和太阳方向之间的点积。为了获得精细化的阴影贴图,一组粗糙的RGB阴影贴图被用作两个阴影精细化网络的输入-一个用于源和目标照明条件。这些粗糙的RGB阴影贴图是从投射到场景的3D网格上的光线创建的,以生成阴影位置。对于与网格相交并投射阴影的每条光线,让mi表示相交点。mi的坐标可以被重新投影以找到对应的2D图像像素及其RGB值。后者在阴影贴图中编码,以创建RGB阴影贴图。编码与投射阴影的对象相对应的RGB值可以帮助阴影细化网络纠正3D网格重建所产生的错误,以便为重新照明网络生成最终的细化阴影贴图。

为了完成重新照明过程,将第三网络与阴影细化网络结合使用。所有这些都是在合成渲染数据上预先训练的。给定源和目标照明条件的输入图像和RGB阴影图,源和目标阴影再细化网络尝试细化阴影图以纠正网格构建中的错误。随后是最终的重新照明网络,该网络同时采用场景优先级和精细的阴影贴图来生成重新照明输出。

3.2 方法说明:SIMBAR

大多数现有场景重新照明方法[44,51,64]需要具有不同视点的多个图像。相比之下,SIMBAR利用单目深度估计来获得几何近似。SIMBAR是模块化的,具有两个不同的组件:几何估计和图像重新照明。整个管道如图2所示。几何估计模块(a)将场景表示为3D网格,这允许为图像重新照明模块(b)生成各种信息先验。这允许利用显式几何场景表示的单个基于图像的场景重新照明的新颖系统设计。

3.2.1 几何估计组件

SIMBAR中的几何估计模块从单个输入图像I生成3D场景网格M,如图2所示。这与MVR形成了直接对比,MVR依赖SFM+MVS[47,48]进行多视图场景重建。从单个图像I生成网格M所采取的步骤受到WorldSheet(参考第3.1.1节)的启发,但对网格重建进行了额外修改。

在SIMBAR中,使用外部预训练的单眼深度估计网络来提供用于生成场景网格的深度信息。这是因为当利用Worldsheet变体时,室外驾驶场景会提供更高质量的网格,该变体使用外部深度预测,而不是预测深度和网格偏移的完整端到端管道。这一观察是有意义的,因为对于WorldSheet训练模型,在端到端训练模式中,网格M没有直接损失。取而代之的是,超级视觉仅通过在最终的relit图像上的渲染损失获得。因此,预测的网格偏移可能不如使用外部深度网络获得的网格偏移在几何上准确。此外,我们还调整了新的单深度主干,以改进场景几何估计,从而实现重新照明。

图3。(a) 使用MiDaS v2.1,3D场景网格会错过细节,导致(b)没有创建突出的阴影。(c) 我们对DPT Hybrid的改进利用密集的视觉转换器捕捉远处的汽车物体,(d)创建逼真的阴影。


改进的单眼深度估计:

虽然WorldSheet使用MiDaS v2.1作为外部深度骨干,但我们已经试验了密集预测变换器(DPT)单深度模型[45])。图3显示,使用MiDaS v2.1深度预测,生成的网格M错过了远处的汽车对象,从而错过了可能投射阴影的编码结构细节。这一点在最上面一排的KITTI场景中尤其明显,在那里,远处的汽车物体并没有很好地被发现。为了解决这一限制,我们发现在DPT Hybrid Kitti(在Kitti上微调)中使用改进的密集视觉变换器有助于生成更详细的网格。

前景/背景场景分离:

如图2所示,对于给定的输入图像I,使用预训练的单眼深度估计网络来获得像素方向的逆深度值D。然后,使用这些值来通知平面场景网格的变形。我们观察到,在不同的尺度上对逆深度进行阈值化可以让我们关注不同的细节级别。

图4显示了具有不同水平的逆深度阈值的实验。对于800的高逆深度阈值,生成的墙表面相当接近相机和场景内容。这种设置可能适用于深度范围较低的场景,但在具有不同深度边界的不同户外场景中失败。这将导致过度阴影结果,其中假曲面在场景上投射自己的阴影。我们选择较低的逆深度阈值,因为这对应于距离相机位置更远的距离。这允许网格进一步向后延伸,并生成更干净的阴影。在具有较低逆深度阈值的网格M中,天空和地平线中较远的表面都得到了更好的表示。

图4。(a) 对于最小反向深度800,场景网格在相应的阈值距离处形成平坦的垂直表面。这种现象被观察到是因为一堵平坦的浅灰色墙壁错误地切断了金字塔几何形状的顶部。(b) 此墙伪影在标记为“阴影过大”的重新照明图像中投射出一个巨大的阴影。(c) 使用100的最小反向深度有效地将墙边界推得更远,这为场景网格提供了更高级别的细节,从而产生(d)更逼真的清晰阴影效果。


3.2.2 图像重新照明组件

如图2所示,给定来自几何估计模块的场景网格M,将生成一组先验或输入缓冲,如第3.1.2节所述。它们作为输入被馈送到阴影细化网络(rsrc,rtgt)和随后的图像重新照明网络(rout)。我们选择使用MVR的预训练网络进行rsrc、rtgt和rout,因为尽管不同数据集的网格结构不完善,但它们的表现良好。此外,获得一组大而多样的高分辨率合成数据用于重新训练重新照明网络既耗时又成本高昂。因此,在SIMBAR中,我们专注于对单视图几何感知场景重新照明的新颖适应。

3.3 改进的MVR方法作为基线:MVR-I

开箱即用的MVR方法在单视图收集的自动驾驶数据集上失败。为了允许与强大的基线进行比较,我们为视野有限的道路驾驶场景优化了MVR,我们称之为MVR-I。我们使用MVR-I作为所有定性(第3.4节)和定量比较(第4.2节)的基线。

图5。RGB点云覆盖在为KITTI场景可视化生成的场景网格的顶部。使用开箱即用的MVR,使天空中的表面产生幻觉(a),从而产生幻影阴影(b),我们使用MVR-I(c)对其进行了改进,从而产生更逼真的图像重新照明结果(d)。


去除幻觉网格表面:

首先,我们发现在KITTI场景上运行MVR会在生成的网格中产生幻觉的天空表面,从而在地面上投射出相应的幻影阴影。这是因为SFM+MVS重建对输入图像中的选定3D特征点进行三角测量,在图像之间具有低的重投影误差。在图5中,注意(a)中导致天空中表面重建的三角点。这些产生幻觉的表面在天空中以及(b)中的圣像中的前景角上投射出显著的阴影。虽然阴影细化网络可以解决网格中的微小不准确[44],但显示的主要不准确会导致不现实的场景重新照明效果。为了解决这个问题,我们实施了一个简单但高效的修复。通过使用Detector2[57]对输入多视图图像进行分割,我们排除了(c)中出现在天空中的混淆因素,例如云以及天空本身。这解决了天空中产生幻觉的网格表面和相应的幻影阴影(d)的问题。

图6。(a) Delaunay曲面重建对噪声很敏感,会导致三角形伪影。(b) 泊松重建网格具有更平滑的表面。


改进的曲面重建:

第二个改进是用泊松曲面重建算法[31]代替用于网格生成的Delaunay曲面重建算法[7]。图6(左)显示了Delaunay算法导致了有噪声的网格,尤其是对于地面。同一场景的泊松曲面重建(右)可减少成角度的边,并使道路和树曲面整体更平滑。

这两种修复的自然结果是更现实的重新照明结果,如图5(d)所示。

3.4 场景重新照明结果

MVR和MVR-I都需要场景的多个视点,以使用SFM+MVS生成近似的3D网格。这种方法在静止的自我车辆捕获的视频序列中失败,因为在捕获的序列中缺少多个视点。这是SFM+MVS的一个已知限制,这导致使用MVR-I在KITTI帧中渲染许多透明阴影。这可以在图7的顶行中观察到。

图7。分别来自KITTI和BDD100K上的MVR-I(a)(c)和SIMBAR(b)(d)的重照明结果。


相比之下,SIMBAR提供了明显更真实和几何一致的重新照明结果,如图7(b)和(d)所示。虽然MVR-I无法从KITTI(顶部)和BDD100K(底部)逼真地重新照亮道路驾驶场景的图像,但SIMBAR的重新照明结果在目标阴影方向和天空颜色方面始终更加逼真。然而,有一些强烈的投射阴影残留无法清除干净。

3.5 限制

完全遮挡:

通过我们在几何估计模块中提出的改进(参见第3.2.1节),生成的网格有了显著的改进,导致前景对象的更多表面细节和背景对象的更好包含。然而,单目深度方法的自然缺点是排除了完全遮挡的对象。虽然部分遮挡的对象网格误差可以通过阴影细化网络进行校正,但完全遮挡的对象当前存在阴影移除问题。如果没有包含对象的附加视图,网格无法表示对象,但在真实输入图像中,对象仍然可以提供阴影。我们发现,在使用单个视图源时,由于缺少对象上的上下文,这有时会导致阴影移除中的阴影残留

场景网格操纵:

使用低反阈值生成天空对象,作为地平线上更远的墙表面(见图4),理想情况下,我们希望通过场景网格操作移除平坦的墙表面,以实现更稳健的场景网格分离。为了实现场景中单个对象的更好的几何欠平衡以及对场景重新照明和阴影操纵的更精确控制,另一种优化可以是使用神经网络(如mesh R-CNN[20])利用3D网格预处理。我们目前使用3D网格作为几何表示,不建模特定的曲面财产。进一步的建模可以实现真实的光照效果,这对镜面反射很重要。

论文地址

https://arxiv.org/pdf/2204.00644.pdf

官方详细介绍

https://simbarv1.github.io/

个人总结

这个一篇数据集增强--光照条件的论文,个人之前在这个方向上没有接触过,因此这篇论文整体没太理解,如果希望理解透彻建议将这个方向的论文整体再看一下

Guess you like

Origin blog.csdn.net/XDH19910113/article/details/129622928