OmniDet: Surround View Cameras based Multi-taskVisual Perception Network for Autonomous Driving Paper Intensive Reading

OmniDet: Multi-task visual perception network for autonomous driving based on panoramic camera

Summary

Surround-view fisheye cameras are often deployed in autonomous driving for 360° near-field sensing around the vehicle. This work proposes a multi-task visual perception network based on unrectified fisheye images to enable vehicles to perceive their surroundings. It includes six main tasks required for autonomous driving systems: depth estimation, visual odometry, semantic segmentation, motion segmentation, object detection, and lens dirty detection. We demonstrate that jointly trained models outperform their respective single-task versions. Our multi-task model features a shared encoder, which provides significant computational advantages, and a collaborative decoder, where tasks support each other. We propose a novel camera geometry-based adaptive mechanism to encode fisheye distortion models at training and inference time. This is crucial for training on the WoodScape dataset, which consists of data from different regions of the world collected by 12 different cameras mounted on three different cars, with different natures and perspectives. Considering that bounding boxes are not good representations of deformed fisheye images, we also extend object detection to use polygons with non-uniformly sampled vertices. We also evaluate our model on standard car datasets (i.e., KITTI and Cityscapes). We obtain state-of-the-art results on KITTI for both depth estimation and pose estimation tasks, as well as competitive performance on other tasks. We conduct extensive ablation studies on various architectural choices and task weighting methods. A short video https://youtu.be/xbSjZ5OfPes provides qualitative results.

1. Introduction

Surround-view fisheye cameras have been deployed in advanced vehicles for more than a decade, from visualization applications on dashboard display units to providing near-field awareness for automated parking. The fisheye camera has a strong radial distortion, considering the shortcomings of distortion correction, such as the reduced field of view and peripheral resampling distortion artifacts [1], so distortion correction cannot be performed. Due to spatially varying distortions, the appearance of objects varies more, especially for nearby objects. Therefore, fisheye perception is a challenging task, however, despite its prevalence, fisheye perception is relatively less studied.

Autonomous driving applications require a variety of perception tasks to provide a robust system covering a wide variety of use cases. Alternative methods for detecting objects in parallel are necessary to achieve high accuracy. For example, objects can be detected based on appearance, motion, and depth cues. Despite the increasing computing power in automotive embedded systems, efficient designs are always required due to the increasing number of camera and perception tasks. Multi-task learning (MTL) is an efficient design pattern, often used for all tasks to share most of the computation [2], [3]. In addition, the learned features of multiple tasks can be used as regularizers to improve the generalization ability. Mao et al. [4] showed that multi-task learning improves adversarial robustness, which is crucial for security applications. In the automotive multi-task setting, MultiNet [5] was one of the first to demonstrate a three-task network at KITTI, and further work is focused on the three-task setting.

Figure 1: Results of our real-time OmniDet model on raw fisheye images. a. Rear camera input image; b. Distance estimation; c. Semantic segmentation; d. Motion estimation; e. Object detection based on 24-sided polygon; f. Soiling segmentation (asynchronous)

This work demonstrates a multi-task perception model for six basic perception tasks on uncorrected fisheye images (shown in Figure 1). When we discuss a complete perception system that builds on our much previous work, it is difficult to cover all technical details, and we focus on the main contributions. Our contributions are as follows:

• We demonstrate the first real-time six-task model for surround view fisheye camera perception.

• We propose a new radially distorted camera tensor representation to enable CNN adaptation to 12 different camera models in the WoodScape dataset.

• We propose novel design techniques including VarNorm task weighting.

• We design collaborative decoders where different tasks help each other in addition to sharing encoders.

• We demonstrate that six task models on WoodScape and five task models on KITTI and Cityscapes outperform a single task baseline.

• We obtain state-of-the-art results on KITTI depth and pose estimation tasks in a monocular approach.

2. Perceptual tasks and losses in our MTL

Our goal is to build a multi-task model covering the modules required for near-field sensing use cases such as parking or traffic jam assistance. This paper builds on our previous papers that focused on a single task, and we mainly discuss new improvements. In the following subsections, we present the reader with more details and a literature review of these papers. In general, there has been little work in the field of fisheye perception. Specifically, there is only our previous work on multi-task learning: FisheyMultiNet [6], which discusses a more straightforward three-task network.

The perception system includes semantic tasks, geometric tasks, and lens dirty detection. Standard semantic tasks are object detection (pedestrians, vehicles, and cyclists) and semantic segmentation (roads, lanes, and curbs). The fisheye camera is installed in the lower position of the vehicle, and it is easy to smudge the lens from the mud or water splashed by the road. Therefore, it is crucial to detect the stains on the camera lens and trigger the cleaning system [7]. Semantic tasks often require large annotated datasets covering various objects. It is not practical to cover all possible objects. Therefore, methods that usually use geometric information such as object motion or depth information to deal with rare objects are widely used. They will also complement the detection of standard objects and provide greater robustness. Therefore, we propose to include motion segmentation and depth estimation tasks. Motion is the main cue in car scenes, which requires at least two frames or use dense optical flow [8]. Self-supervised methods have recently dominated depth estimation, which is also demonstrated on fisheye images [9]. Finally, the task of visual odometry requires placing detected objects in a temporally consistent map.

A. Self-Supervised Distance and Pose Estimation Networks

We use our previous work FisheyDistanceNet [9] to build a self-supervised monocular structure-from-motion (SfM) framework for range and pose estimation, and perform view synthesis by incorporating a polynomial projection model function. The total loss consists of a reconstruction matching term Lr, a regularization term Ls that enforces edge-aware smoothness within the distance map as described in [10]. In addition, the cross-sequence distance consistency loss Ldc and scale recovery techniques are also used. We discuss new improvements in the following paragraphs, which lead to a significant increase in accuracy.

We combine the feature metric loss from [11], where Ldis and Lcvt are computed on its feature representation, where we learn features using a self-attention autoencoder. The main purpose of these losses is to prevent the training target from getting stuck in multiple local minima in homogeneous regions, since fisheye images have much larger homogeneous regions than rectilinear images. It is essentially a loss function that penalizes small slopes and emphasizes low-textured regions using image gradients. The target features are regularized using the first derivative, constraining the self-supervised loss landscape to form appropriate convergence basins. However, due to inconsistencies between first-order gradients, i.e., spatially adjacent gradients pointing in opposite directions, simply applying a discriminative loss does not guarantee that we move to the optimal solution during gradient descent. Shu et al. [11] proposed a convergence loss with a relatively large convergence radius to achieve gradient descent from far away. This is achieved by formulating the loss to have consistent gradients during the optimization step by correspondingly encouraging smoothness and large convergence radii of feature gradients. The total target loss for the distance estimate Ldist is

where β, γ, ω, and µ weight the distance regularization Ls, cross-sequence distance consistency Ldc, discriminative Ldis, and convergent Lcvt losses, respectively.

We compute the image and feature reconstruction loss using the target It, the estimated feature FÜt frame, the reconstructed target IÜt 0→t and the feature Fût 0→t frame. It is a linear combination of a general Robust Pixel Loss term [12] and Structural Similarity (SSIM) [13], as described in [14].

B. Generalized Object Detection

Standard bounding box representation in fisheye cameras fails due to severe radial distortion, especially at the periphery. In parallel work [15], we explore different output representations for fisheye images, including oriented bounding boxes, curved boxes, ellipses, and polygons. We have integrated this model into the MTL framework, where we use a 24-sided polygon representation for object detection. We briefly summarize the details here and refer to our extended paper [15] for more details on generalized object detection.

多边形是任何任意形状的通用表示；然而，注释比边界框更昂贵。对象轮廓可以在360°范围内均匀采样，分成N个相等的多边形顶点，每个顶点由PolyYOLO[16]中使用的距离对象质心的径向距离r表示。我们观察到，均匀采样不能有效地表示鱼眼图像对象轮廓中的高曲率变化。因此，我们使用了基于局部轮廓曲率的自适应采样。为了最佳地表示对象轮廓，我们非均匀地分布顶点。我们采用[17]中的算法来检测给定曲线形状中的优势点，这最能代表对象。然后，我们使用[18]中的算法减少点集，以获得最具代表性的简化曲线。我们调整YOLOv3[19]解码器以输出多边形和上面列出的其他表示，以便进行统一比较。

C、细分任务

我们的三项任务被建模为分割问题。语义和污损分割在WoodScape数据集上分别具有七个和四个输出类。运动分割使用两个帧并输出二进制运动或静态掩码。在训练期间，网络预测后验概率Yt，该后验概率通过Lovasz Softmax[20]损失和Focal[21]损失以监督方式优化，以处理类不平衡，而不是我们之前工作中使用的交叉熵损失。我们通过对后验概率应用逐像素argmax操作来获得最终分割掩模Mt。

脏污数据集是独立构建的，因此不能以传统方式联合训练。因此，我们冻结了使用其他五个任务训练的共享编码器，并仅训练解码器以防弄脏。这表明了在其他任务中重新使用编码器功能的潜力。我们还使用异步反向传播联合训练了脏污[22]，但它实现了与使用冻结编码器相同的精度。与我们之前的工作SoilingNet[7]相比，我们从平铺输出转移到像素级分割。

D、关节优化

平衡任务损失是训练多任务模型的一个重要问题。我们的贡献是双重的。我们评估了五项任务的各种任务加权策略，并与先前文献中的三项任务实验进行了比较。我们评估了Kendall[23]、梯度大小归一化GradNorm[24]、动态任务优先级DTP[25]、动态加权平均DWA[26]和几何损失[27]的不确定性损失。

其次，我们提出了一种新的方差归一化方法VarNorm。它包括通过过去n个时期的方差对每个损失进行归一化。任务i在时间t的损失权重公式如下：

其中Li是过去n个时期内任务损失i的平均值。我们选择n=5。这种方法的动机是一个简单的想法，即任务损失值可以被视为一个分布，其离散度是其方差。方差归一化基于前n个时期重新缩放不同任务损失分布之间的离散度。较大的分散导致较低的任务权重，而较小的分散导致较高的任务权重。它的最终效果往往会使任务的学习速度同质化。如表III所示，等权重是最差的，通过使用上述任何动态任务加权方法，我们的多任务网络比单任务网络表现得更好。我们将所提出的VarNorm方法用于所有进一步的实验，因为它获得了最佳结果。

三、 MTL OMNIDET的网络详细信息

编码器-解码器架构通常用于密集预测任务。我们使用这种类型的架构，因为它很容易扩展到用于多个任务的共享编码器。我们通过结合[28]中基于向量注意力的成对和拼接自注意力编码器来设计我们的编码器。这些网络在空间维度和信道上有效地调整权重。我们将Siamese（孪生网络）方法应用于运动预测网络，在那里我们将源帧和目标帧特征连接起来，并将它们传递给超分辨率运动解码器。由于权重在暹罗编码器中共享，因此可以保存并重新使用前一帧的编码器，而不是重新计算。受[11]的启发，我们开发了一种用于单视图重建的辅助自关注自动编码器。我们将在接下来的两个小节中详细介绍我们的主要小说贡献。首先，我们使用新的相机几何张量来处理多个视点和相机固有距离估计的变化。其次，我们通过跨任务连接使用协同解码器来提高彼此的性能。

A、相机几何张量Ct

1）动机：在工业部署环境中，我们的目标是设计一种可以部署在数百万辆车上的模型，每辆车都有自己的一组摄像头。尽管特定系列车辆的基本摄像头内部模型相同，但由于制造工艺的原因，存在差异，需要对每个摄像头进行校准。即使在部署后，校准也可能因环境温度过高或老化而发生变化。因此，模型中的校准自适应机制至关重要。这与公共数据集形成对比，公共数据集对训练和测试数据集都有一个摄像机实例。在Woodscape鱼眼数据集[29]中，有12个具有轻微内在变化的不同相机来评估这种效果。这四个相机的一个模型而不是四个单独的模型也将具有几个实际优点，例如（1）提高了嵌入式系统的效率，需要更少的内存和数据传输速率，（2）通过访问更大的数据集和通过不同视图进行正则化来改进训练，以及（3）维护和认证一个模型而不是四个模型。

我们建议将所有相机几何特性转换为称为相机几何张量Ct的张量，然后将其传递给CNN模型以解决此问题。最接近的工作是CAM卷积[30]，它将相机感知卷积用于针孔相机。我们在这项工作的基础上，推广到任意的相机几何结构，包括鱼眼相机。

图2：基于全景摄像头的多任务视觉感知框架概述。

距离估计任务（蓝色块）利用来自语义/运动估计（绿色和蓝色雾块）和相机几何自适应卷积（橙色块）的语义引导和动态对象掩蔽。

此外，我们用语义特征指导检测解码器特征（灰度块）。编码器块（以相同颜色显示）对于所有任务都是通用的。我们的框架由处理块组成，用于训练自监督距离估计（蓝色块）和语义分割（绿色块）、运动分割（蓝色雾块）、基于多边形的鱼眼对象检测（灰色块）以及污染分割的异步任务（玫瑰雾块）。

我们通过在3D空间（perano块）中对预测距离图进行后处理来获得环绕视图几何信息。相机张量Ct（橙色块）帮助OmniDet在多个相机视点上生成距离图，并使网络相机独立。

2）方法：我们在自关注网络（SAN）编码器模块的RGB特征到3D信息的映射中引入了相机几何张量Ct，如图2所示。它包含在每个自我关注阶段，也适用于每个跳跃连接。相机几何张量Ct是在三步过程中形成的：为了有效训练，预先计算像素坐标和入射角图。通过合并来自摄像机校准的信息，将每个像素的归一化坐标用于这些通道。我们连接这些张量，并用Ct表示它们，并将其与输入特性一起传递给我们的SAN成对和拼接操作模块。除了现有的解码器信道输入之外，它还包括六个信道。原则上，所提出的方法可以应用于[1]中解释的任何鱼眼投影模型。使用相机固有参数计算包括在我们的共享自关注编码器中的不同映射，其中失真系数a1、a2、a3、a4用于创建入射角映射（ax、ay）、cx、，cy被用于计算主点坐标图（ccx，ccy），并且相机的传感器尺寸（宽度w和高度h）被用于形成归一化坐标图。

3）中心坐标（cc）：通过包括以（0，0）为中心的ccx和ccy坐标通道，将队形中的主点位置送入SAN的成对和拼接操作模块。我们通过使用双线性插值调整大小来连接ccx和ccy，以匹配输入特征大小。我们将ccx和ccy频道定义为：

4）入射角图（ax，ay）：对于针孔（直线）相机模型，使用相机焦距f:ach[i，j]=arctan（ccch[i、j]/f）从cc图计算水平和垂直入射角图，其中ch为x或y（参考等式3）。对于不同的鱼眼相机模型，通过取[1]中解释的径向畸变函数r（θ）的倒数，可以类似地推导出入射角图。具体而言，对于本文使用的多项式模型，入射角θ是通过数值方法计算r（θ）=p x 2 I+yI 2=a1θ+a2θ2+a3θ3+a4θ4的四阶多项式根来表示的。我们将预先计算的根存储在所有像素坐标的查找表中，以实现训练效率，并通过分别设置xI=ccx[i，j]、yI=0和xI=0，yI=ccy[i，j]来创建ax和ay映射。

5）归一化坐标（nc）：此外，我们添加了两个归一化坐标通道[31]，[30]，其值相对于图像坐标在−1和1之间线性变化。通道独立于相机传感器的属性，并表征x和y方向上的空间位置。（例如，更接近1的xû通道的值指示该特征更接近图像的右边界）。

B、协同任务

1）处理动态对象和解决无限深度问题：由于动态对象违反了静态世界假设，有关其深度/距离的信息在自动驾驶中至关重要；否则，我们将在推断阶段遇到无限深度问题，并导致重建损失的污染。我们使用运动分割信息来排除潜在的运动动态对象，而距离是从非运动动态对象中学习的。为此，我们定义了逐像素掩码µt，如果像素不属于当前帧It中的动态对象，也不属于重建帧I的错误投影动态对象，则该掩码包含1→否则为0。因此，我们预测属于目标帧It的运动分段掩码Mt-mot，以及源帧It 0的运动掩码Mt0。源帧内的动态对象在Mt。然而，为了获得错误投影的动态对象，我们需要通过最近邻采样将运动掩码扭曲到目标帧，从而产生投影的运动掩码Mt 0→t、我们提出了一种替代技术，以在运动分割任务不可用时启用动态对象的过滤。我们利用语义分割输出并遵循上述类似的方法。通过定义集合动态对象类SDC⊂S，我们可以将语义分割掩码减少为二进制掩码，满足上述条件。位置ij处的元素：

动态对象可以通过在图像和特征的重建损失的情况下对掩模进行逐像素多重复制来进行掩模。

2）语义引导距离和检测解码器：根据我们之前的工作[14]，为了更好地将从多任务网络的分割分支提取的语义知识纳入距离估计，我们使用像素自适应卷积[32]（PAC）将其纳入距离解码器，以从语义特征中提取知识。这尤其打破了卷积的空间不变性，并允许将特定于位置的语义知识结合到多层次距离特征中。如图2所示（绿色块），在分割解码器的不同级别提取特征。这里，像素自适应卷积的输入信号x被处理为

像素位置ij、像素位置之间的距离ra−i、b−j以及位置ij周围的k×k邻域窗口Nk（ij）。邻域内的x元素win dow Nk（ij）分别用作具有权重W、偏差B∈R1和核函数K的卷积的输入，用于计算从分割分支提取的语义引导特征F∈RD之间的相关性。

3）将自我注意和语义特征链接到2D检测：为了利用多任务学习设置，首先，我们提取自我注意网络（SAN）[28]编码器特征，并将其作为输入信号提供给等式5。我们将空间信息从SAN编码器绕过到语义解码器，并融合这些特征（跳过连接）。最后，我们通过应用PAC并获得内容不可知的特征来融合这些特征和检测解码器嵌入。OmniDet框架中的这种新融合技术显著提高了检测解码器的准确性，如表I所示。

C、实施详细信息

我们使用Pytorch，并为OmniDet框架采用单阶段学习过程，以促进网络优化。我们在编码器中加入了最近提出的SAN。作者提出了两种卷积变体，即成对和拼接。我们主要使用补丁，但对其进行消融研究。我们使用Ranger（RAdam[33]+LookAhead[34]）优化器来最小化训练目标函数。我们对模型进行了20个时期的训练，在24GB Titan RTX上批量大小为24，前15个时期的初始学习率为4×10−4，后5个时期的学习率降至10−5。距离解码器的S形输出σ被转换为D=m·σ+n的距离，其中m和n被选择为将D限制在0.1和100个单位之间。最后，我们将β、γ、ω和µ设置为10−3。来自具有多个视点的环绕视图相机的所有图像被彻底搅乱，并与它们各自的本质一起被馈送到距离和姿态网络，以创建相机几何张量Ct，如图2所示，并在第III-A节中描述。

四、实验结果

A、数据集

我们在Woodscape[29]（一个带有多个任务注释的环绕视图鱼眼数据集）和针孔相机数据集KITTI[35]和Cityscape[36]上系统地训练和测试了我们的所有单任务和多任务模型。

1） WoodScape：WoodScape数据集由10000张图像组成，以6:1:3的比例分成训练、验证和测试。额外的专有数据用于我们模型的预训练和初始化。我们在5个最基本的物体类别上训练我们的2D盒子检测——行人、车辆、骑手、交通标志和交通灯。原始鱼眼上的多边形预测任务仅限于两类——行人和车辆。与交通信号灯和交通标志不同，这些类别本质上是非刚性的，外观上相当多样，因此适合多边形回归。我们从多边形回归任务的每个对象实例轮廓中采样24个具有高曲率值的点。学习这些点有助于回归更好的多边形形状，因为这些高曲率点定义了对象轮廓的形状。我们对道路、车道和路缘类别执行语义分割。图像从原始1MP分辨率调整为544×288像素。

2） Cityscapes：在Cityscape的例子中，我们从实例多边形中提取了2D框。我们在单任务和多任务设置中对2975幅图像训练OmniDet MTL模型。我们报告了由500张图像组成的验证分割的所有结果。图像大小调整为640×384像素，用于训练和验证。

3） KITTI：KITTI数据集由42382个具有相应原始激光雷达扫描的立体序列、7481个具有边界框注释的图像和200个具有语义注释的训练图像组成。我们使用根据Eigen等人[37]的数据分割进行自监督深度估计，所有任务的输入大小为640×192像素。有关拆分的详细信息，请参阅[9]。对于运动分割，我们使用DeepMotion[38]为Cityscapes和KITTI MoSeg[8]提供的注释。这里的标签仅适用于汽车类别。

B、单任务学习与多任务学习

在表II中，如上所述，我们对所有数据集上提出的框架进行了广泛的消融。我们实验的定量结果表明，具有6个任务、5个不同任务的多任务网络比单任务模型表现得更好，以及我们在第III-B节中解释的拟议协同效应。对于KITTI和CityScapes，我们采用了我们新颖的 VarNorm 任务加权技术。通过感知任务的协同作用，我们在KITTI数据集上获得了最先进的深度和姿态估计结果，分别如表V和表VI所示。我们使用NVIDIA Jetson AGX平台上的TensorRT（FP16bit）推断模型，并报告所有任务的FPS。

C、我们贡献的消融研究

对于表I中所示的主要特征的消融分析，我们考虑了自我注意编码器的两种变体，即成对和拼接。首先，我们用通用的参数化损失函数替换L1损失，并使用自关注编码器的拼接变量对其进行测试。我们将距离估计限制为40米。我们通过归因于使用鉴别特征损失Ldis提供的更好的监督信号，在该设置中实现了显著的增益。在这种情况下，不正确的距离值与Lcvt的组合一起被适当地惩罚，其中提供了正确的优化方向。这些损失有助于梯度下降法平稳过渡到最优解。当将相机几何张量添加到此设置时，我们观察到精度的显著提高，因为我们训练了具有不同相机本质和视角的多个相机。对于OmniDet框架来说，这是一个重要的功能。上述训练策略使网络摄像机独立，并更好地推广到从不同摄像机拍摄的图像。

为了实现几何特征和语义特征之间的协同，我们向距离解码器添加了语义指导。它有助于推理相同共享功能中的几何图形和内容，并消除光度模糊性。为了建立不受动态对象污染的鲁棒重建损失，我们引入了第III-B1节所述的语义和运动掩码，以过滤所有动态对象。与使用语义掩码相比，基于运动掩码的过滤产生了更好的增益以及CGT，因为语义可能不包含其类集合中的所有动态对象，如等式4所示。此外，这种贡献具有解决无限距离问题的潜力。最后，为了完成我们的协同作用，我们将语义引导的特征用于检测解码器，如第III-B2节所述，这在mAP中产生了显著的增益，并且所有任务的总体结果通过更好的共享特征而固有地改进。我们贡献的所有功能和任务之间的协同作用有助于OmniDet框架在每个任务的预测中以高精度实现良好的场景理解。我们还实验了使用圆柱矫正（Cyl Rect.），这在视野损失和减少失真之间提供了良好的权衡[29]。

对于原生鱼眼图像上的对象检测，除了标准的2D框表示之外，我们在表IV中对定向框、椭圆、曲线框和24边极多边形表示进行了基准测试。这里，mIoU GT表示了我们通过使用每个表示在实例分割方面可以实现的最大性能。它是在地面真相实例分割和相应表示的地面真相之间计算的。而mIoU代表则对我们网络的性能表示满意。我们还列出了复杂度比较中每个表示所涉及的参数数量。

D、 KITTI的最新比较

为了便于与之前的方法进行比较，我们还在KITTI特征分裂[39]上的经典深度估计设置中训练我们的距离估计方法，其结果如表V所示。随着深度、语义、运动和检测任务之间的协同作用以及表I中描述的特征及其重要性在第IV-C节中解释，我们优于所有以前的单眼方法。根据最佳实践，我们将深度限制在80m。我们还使用原始[40]和改进[41]地面真实深度图进行评估。方法*表示在线细化技术[42]，其中在推断过程中对模型进行训练。使用[42]中的在线细化方法，我们获得了显著的改进。

在表VI中，我们报告了姿态估计网络的平均轨迹误差（以米为单位），方法是遵循Zhou[43]关于KITTI官方里程分割的相同协议（包含11个序列，其中包含通过IMU/GPS测量获得的地面真实（GT）里程，仅用于评估目的），并使用序列00-08进行训练，09-10进行测试。我们优于表VI中列出的先前方法，主要是通过使用我们的跨序列距离一致性损失[9]应用我们的束调整框架，这导致了更多约束，同时优化了隐式扩展训练输入序列的距离和相机姿态。这提供了以前方法所没有的额外一致性约束。

个人总结

个人理解

这篇论文整体还是有东西的，想要完全读懂，还需要整体读一下其之前发表的那几篇论文PolyYOLO、FisheyDistanceNet、SAN、Ranger等等

这篇论文整体的想法就是在鱼眼图上进行多个感知任务Multi-task learning (MTL)，想法确实是好的，看着也确实是有东西

没在代码中看到脏污检测的部分，个人当前做这一块，因此论文没有更深入的调研