Position Embedding Transformation for Multi-View 3D Object Detection (PETR: Position Embedding Transformation for Multi-View 3D Object Detection)

background

Research Existing Problems

This paper PETR (PETR: Position Embedding Transformation for Multi-View 3D Object Detection) is an improvement to DETR3D (3D Object Detection from Multi-view Images via 3D-to-2D Queries), and there are still three problems in converting 2D to 3D :

(1) The information interaction between space and multi-view depends on the accuracy of 3D reference point estimation, so that the sampled features exceed the object area, cannot be projected into the effective area, and cannot interact with 2D images; (2) only
collect The image features at the projected points only perform information interaction between the object queries and the features of the 2D points projected by the 3D reference point, and cannot perform representation learning from the global view; (3) The complex
feature sampling process will hinder the actual detection of the detector. Application, to build an end-to-end 3D object detection framework without online 2D to 3D conversion and feature sampling. Due to the need for sampling and projection, the pipeline in the architecture is relatively complex, which affects the efficiency of reasoning.
Although PETR technology has good application potential in multi-view 3D object detection, it still has some existing problems, including:

  1. Computational complexity: The PETR model needs to encode the location information of the three-dimensional coordinates into image features, which leads to an increase in computational complexity and requires higher computing resources and time costs to achieve model training and inference.
  2. Accuracy problem: Since PETR technology generates three-dimensional position-aware features by encoding three-dimensional position information, for objects far away from the camera, its position information will be distorted, which may lead to a decrease in detection accuracy.
  3. Dataset limitation: PETR technology requires a large amount of training data with three-dimensional position information to train the model, but there is still a lack of available large-scale data sets, which limits the application and promotion of this technology.
  4. Changes in the shape and pose of objects: PETR technology may be affected by changes in the shape and pose of objects, resulting in a drop in detection accuracy. Therefore, for objects with complex shapes and poses, the PETR technique may require more complex and robust models to achieve better detection performance.

Aiming at the aforementioned problems, this research mainly focuses on

In response to the aforementioned problems, this paper provides a simple and elegant solution for multi-view 3D object detection - PETR. PETR abandons sampling and projection, directly calculates the 3D position code corresponding to 2D multi-view, and adds it to the 2D image features, and then interacts with 3D object queries to directly update the 3D object queries, which greatly simplifies the pipeline. To achieve this, the camera frustum space shared by different views is first discretized into grid coordinates. Then, the coordinates are transformed by different camera parameters to obtain coordinates in 3D world space. Second, the 2D image features extracted from the backbone and 3D coordinates are input to a simple 3D position encoder to produce 3D position-aware features. Finally, the 3D position-aware features will interact with the object query in the transformer-decoder, and the updated object query is further used to predict object classes and 3D bounding boxes.

What are the advantages over traditional methods

Compared with traditional methods, AI multi-view 3D scene detection - PETR uses direct calculation of 3D position codes corresponding to 2D views for interaction, which greatly reduces the flow loss between data and saves the process of calculating projections for reference points multiple times In addition, there are the following six advantages:
1. Higher detection accuracy: The PETR method can process point cloud data from multiple perspectives at the same time, and can use 3D position-aware features to improve detection accuracy, especially when processing objects far away from the camera. When it comes to objects, the PETR method has advantages over traditional methods and other solutions.
2. Better attitude estimation ability: Since the PETR method can encode three-dimensional position information and introduces position embedding conversion technology, it has better ability when performing attitude estimation, and can more accurately estimate the direction and rotation angle of objects, etc. information.
3. Higher robustness and reliability: The PETR method is capable of object detection and recognition under multiple viewing angles and angles, so it has higher robustness and reliability, and can adapt to different scenarios and application requirements.
4. End-to-end learning: The PETR method can perform end-to-end learning, and learn features directly from the original data through the neural network model, avoiding the shortcomings of manually designing features in traditional methods.
5. Better scalability: The PETR method can use more perspectives and data for model training and optimization, so it has better scalability and can cope with larger-scale scenarios and data requirements.
6. The spatial attention mechanism is introduced, which can better utilize the local and global features of point cloud data for detection.

What are the application scenarios

AI multi-view 3D scene detection is a technology that uses deep learning technology to analyze and recognize 3D scenes. It uses images or videos from multiple perspectives as input data, and can model and reconstruct objects from multiple perspectives and perspectives. , which is applicable to the following application scenarios:
(1) Autonomous driving: Multi-view 3D scene detection technology can be used for real-time environment perception and driving safety warning of autonomous vehicles, such as detecting road signs, traffic lights, pedestrians, vehicles, etc.
(2) UAV: ​​Multi-view 3D scene detection technology can be used in UAV scene perception and aerial photography tasks, such as detecting buildings, roads, waters, farmland, etc.
(3) Industrial manufacturing: Multi-view 3D scene detection technology can be used for quality inspection and production line optimization in industrial manufacturing, such as detecting product parts, defects, and dimensions.
(4) Architecture and urban planning: Multi-view 3D scene detection technology can be used for building recognition and 3D model construction in architecture and urban planning, such as detecting buildings, street views, parks, etc.
(5) Security monitoring: Multi-view 3D scene detection technology can be used in the field of security monitoring, such as detecting abnormal behaviors, identifying people and vehicles, etc.
Multi-view 3D scene detection technology has broad application prospects and can bring huge economic benefits and social value to many industries.

Feasibility Analysis

AI multi-view 3D scene detection technology has received extensive attention and research in recent years, and has made a lot of progress. From a technical point of view, the feasibility of AI multi-view 3D scene detection technology is relatively high, the main reasons are as follows:

(1) Multi-view scene information can provide a more comprehensive and rich data source, which is conducive to improving the accuracy and robustness of scene perception and detection.
(2) By using advanced artificial intelligence technologies such as deep learning and neural networks, multi-dimensional and multi-modal scene data can be effectively processed and analyzed, thereby achieving more accurate and efficient scene detection.
(3) The improvement of computer hardware performance and the development of cloud computing, distributed computing and other technologies provide more powerful and flexible computing resources and platforms for the realization of AI multi-view 3D scene detection.
(4) AI multi-view 3D scene detection technology has been widely used and verified. For example, it has good performance and practical experience in the fields of automatic driving, drones, and security monitoring, which further proves its feasibility and practicality sex.
In addition, the application of AI multi-view 3D scene detection technology also has some challenges and limitations, such as the difficulty of data collection and processing of multi-view scenes, and the training and optimization of algorithms require a large amount of computing resources.

For the PETR method in AI multi-view 3D scene detection, it uses the position information of 3D coordinates to be encoded into image features, which can generate 3D position-aware features, thus realizing end-to-end 3D target detection. Through the continuous development and progress of technologies such as deep learning and neural networks, the technical feasibility of the PETR method has been well verified and practiced. In addition, the PETR method requires a large amount of 3D scene data as training samples, which may present certain difficulties and challenges in some fields. However, with the continuous development and popularization of 3D perception technology, the cost and difficulty of acquiring and processing 3D scene data are gradually reduced, so the data feasibility is also continuously improved. The PETR method requires a large amount of computing resources for training and optimization, but with the continuous improvement of computer hardware performance and the development of technologies such as cloud computing and distributed computing, the feasibility of computing resources is also gradually improving. Finally, the PETR method has been widely used and practiced in the fields of autonomous driving, drones, industrial manufacturing, architecture and urban planning, and security monitoring. The practical application experience shows that the PETR method has a good effect and application prospect in 3D scene detection, so its practical application feasibility is also high.

data set

  1. nuScenes dataset
type of data Large-Scale Autonomous Driving Image Dataset
size This dataset contains 3D bounding boxes of 1000 scenes collected in Boston and Singapore. Each scene is 20 seconds long and annotated at 2Hz. This results in a total of 28130 training samples, 6019 validation samples and 6008 testing samples.
Number of instances 23 object categories are annotated with accurate 3D bounding boxes at 2Hz on the whole dataset. Additionally, object-level attributes such as visibility, activity, and pose are annotated.
Attributes This dataset features a complete suite of autonomous vehicle data: 32-beam lidar, 6 cameras, and radar with full 360° coverage. The 3D Object Detection Challenge evaluates performance on 10 classes: cars, trucks, buses, trailers, construction vehicles, pedestrians, motorcycles, bicycles, traffic cones, and obstacles.
Label A total of 93,000 images are labeled with instance masks and 2d boxes for 800k foreground objects and 100k semantic segmentation masks
Summary nuTonomy Scenes (nuScenes), the first dataset to host a fully autonomous vehicle sensor suite: 6 cameras, 5 radars, and 1 lidar, all with a full 360-degree field of view. nuScenes contains 1000 scenes, each 20 seconds long, fully annotated with 3D bounding boxes of 23 classes and 8 attributes. It has 7 times the number of annotations and 100 times the number of images than the groundbreaking KITTI dataset. We define novel 3D detection and tracking metrics. We also provide careful dataset analysis and baselines for lidar and image based detection and tracking. Data, development kits and more information are available online.
data source https://www.nuscenes.org/;https://www.nuscenes.org/nuimages;https://www.nuscenes.org/nuscenes
Data Reference Information Caesar, Holger, et al. “nuscenes: A multimodal dataset for autonomous driving.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
  1. Waymo Open Dataset
type of data High-resolution video images collected by autonomous vehicles
size Currently contains 1,950 segments of 20 seconds each, collecting sensor data at 10Hz (390,000 frames) in various geographic locations and conditions
composition Consists of two datasets - the Perception dataset with high-resolution sensor data and labels for 2,030 scenes, and the Motion dataset with object trajectories and corresponding 3D maps for 103,354 scenes.
test set 80 segments of 20-second camera images serve as the test set for the 3D camera detection challenge.
Label Labeling of 4 object classes - vehicle, pedestrian, cyclist, sign; high-quality labeling of 1,200 segments of lidar data; 12.6M 3D bounding box label with tracking ID on lidar data; 1,000 segments of camera data High-quality labels; 11.8M 2D bounding box labels with tracking ID on camera data
Summary A new large-scale, high-quality, and diverse dataset. The new dataset contains 1150 scenes, each spanning 20 seconds, and includes synchronized and calibrated high-quality LiDAR and camera data captured over a range of urban and suburban geographic regions. According to the proposed diversity metric, it has 15 times more than the largest camera+LiDAR dataset available. These data are exhaustively annotated with 2D (camera images) and 3D (LiDAR) bounding boxes, with consistent identifiers across frames. Finally, a strong baseline is provided for 2D as well as 3D detection and tracking tasks. The impact of dataset size and cross-geographic generalization on 3D detection methods is further investigated.
data source https://waymo.com/open/https://github.com/waymo-research/waymo-open-dataset
Data Reference Information Sun, Pei, et al. “Scalability in perception for autonomous driving: Waymo open dataset.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
  1. SemanticKITTI dataset
type of data Large-Scale Outdoor Scene Image Dataset for Point Cloud Semantic Segmentation
size The dataset consists of 22 sequences
category The dataset contains 28 classes, including classes that distinguish between non-moving and moving objects
train Provides 23201 point clouds for training
test 20351 point clouds are provided for testing
Attributes Derived from the KITTI Vision Odometry Benchmark, extended with dense point-wise annotation for the full 360 field of view of the used automotive LiDAR
Summary A large dataset is introduced to advance research in laser-based semantic segmentation. All sequences of the KITTI visual odometry benchmark are annotated with dense pointwise annotations for the full 360 field of view of the used automotive LiDAR. Three benchmark tasks are proposed based on this dataset: (i) semantic segmentation of point clouds using a single scan, (ii) semantic segmentation using multiple past scans, and (iii) semantic scene completion, which requires predicting the future of semantic scenes . Baseline experiments are provided and show that more sophisticated models are required to handle these tasks effectively. The dataset opens the door to the development of more advanced methods, while also providing a wealth of data to investigate new research directions
data source http://www.semantic-kitti.org/dataset.htmlhttps://github.com/PRBonn/semantic-kitti-apihttps://github.com/PaddlePaddle/Paddle3D
数据引用信息 Behley, Jens, et al. “Semantickitti: A dataset for semantic scene understanding of lidar sequences.” Proceedings of the IEEE/CVF international conference on computer vision. 2019.
  1. A*3D数据集
数据类型 图像注释数据集
大小 39,179点云帧
注释 正面 RGB 图像中的 230K 人工标记 3D 对象注释
摘要 随着自动驾驶汽车在全球的日益普及,迫切需要具有挑战性的现实世界数据集来对各种计算机视觉任务(例如 3D 对象检测)进行基准测试和训练。 现有数据集要么代表简单的场景,要么只提供白天的数据。在文中,提出了一个新的具有挑战性的 A3D 数据集,它由 RGB 图像和 LiDAR 数据组成,具有显着的场景、时间和天气多样性。 该数据集由高密度图像(=10 倍于开创性的 KITTI 数据集)、严重遮挡、大量夜间帧(=3 倍于 nuScenes 数据集)组成,解决了现有数据集的差距,以推动 将自动驾驶研究中的任务边界转移到更具挑战性的高度多样化的环境中。 该数据集包含 39K 帧、7 个类和 230K 3D 对象注释。 针对高密度、白天/夜间等各种属性对 A3D 数据集进行广泛的 3D 对象检测基准评估,对在现实环境中训练和测试 3D 对象检测的优势和局限性提供了有趣的见解
数据源 https://github.com/I2RDL2/ASTAR-3D
数据引用信息 Pham, Quang-Hieu, et al. “A 3D dataset: Towards autonomous driving in challenging environments.” 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020.
  1. KITTI数据集
数据类型 移动机器人和自动驾驶图像数据集
类别 • 十个对象类别的跟踪挑战:建筑物、天空、道路、植被、人行道、汽车、行人、骑自行车的人、标志/杆和栅栏; • 11 个类别:建筑物、树木、天空、汽车、标志、道路、行人、栅栏、电线杆、人行道和骑自行车的人
不同用途 • 252 次(140 次用于训练,112 次用于测试); • 170 张训练图像和 46 张测试图像
属性 使用各种传感器模式记录的数小时交通场景,包括高分辨率 RGB、灰度立体相机和 3D 激光扫描仪。数据集本身并不包含语义分割
摘要 利用自动驾驶平台为立体、光流、视觉里程计/SLAM 和 3D 目标检测任务开发具有挑战性的新基准。记录平台配备了四个高分辨率摄像机、一个 Velodyne 激光扫描仪和一个最先进的定位系统。基准包括 389 个立体和光流图像对、39.2 公里长的立体视觉里程计序列,以及在杂乱场景中捕获的超过 20 万个 3D 对象注释(每个图像最多可以看到 15 辆汽车和 30 名行人)
数据源 https://www.cvlibs.net/datasets/kitti/
数据引用信息 Geiger, Andreas, Philip Lenz, and Raquel Urtasun. “Are we ready for autonomous driving? the kitti vision benchmark suite.” 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012.
  1. UrbanScene3D数据集
数据类型 大型城市场景视频数据集
大小 128k 的高分辨率图像
类别 涵盖 16 个场景,包括总共 136 km2 面积的大规模真实城市区域和合成城市
摘要 UrbanScene3D,这是一个用于城市场景感知和重建研究的大型数据平台。 UrbanScene3D 包含超过 128k 的高分辨率图像,涵盖 16 个场景,包括总共 136 km2 面积的大规模真实城市区域和合成城市。 该数据集还包含高精度 LiDAR 扫描和数百个具有不同观察模式的图像集,为设计和评估空中路径规划和 3D 重建算法提供了综合基准。 此外,基于虚幻引擎和Airsim模拟器构建的数据集以及数据集中每个建筑物的手动注释的唯一实例标签,可以生成各种数据,例如2D深度图、2D/3D边界框 ,以及3D点云/网格分割等。具有物理引擎和照明系统的模拟器不仅可以产生各种数据,还可以让用户在拟议的城市环境中模拟汽车和无人机,以供未来研究
数据源 https://vcc.tech/UrbanScene3Dhttps://github.com/Linxius/UrbanScene3D
数据引用信息 Lin, Liqiang, et al. “Capturing, reconstructing, and simulating: the urbanscene3d dataset.” Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII. Cham: Springer Nature Switzerland, 2022.

技术方案

• Linux, Python==3.6.8,
• CUDA == 11.2,
• pytorch == 1.9.0,
• mmdet3d == 0.17.1,
• Pytorch 1.7.0,

PETR方法整体框架

在这里插入图片描述图1 PETR方法框架

多视图图像被输入到主干网络(例如 ResNet)以提取多视图 2D 图像特征。 在 3D 坐标生成器中,所有视图共享的相机平截头体空间被离散化为 3D 网格。 meshgrid 坐标由不同的相机参数转换,从而产生 3D 世界空间中的坐标。 然后将 2D 图像特征和 3D 坐标注入建议的 3D 位置编码器以生成 3D 位置感知特征。 从查询生成器生成的对象查询通过与 transformer 解码器中的 3D 位置感知功能的交互进行更新。 更新后的查询进一步用于预测 3D 边界框和对象类。

在这里插入图片描述
图2 DETR3D和PETR方法对比

图2的(a)中,DETR对象查询与 2D 特征交互以执行 2D 检测。 (b) DETR3D 将生成的 3D 参考点重复投影到图像平面,并对 2D 特征进行采样以与解码器中的对象查询进行交互。 © PETR 通过将 3D 位置嵌入 (3D PE) 编码为 2D 图像特征来生成 3D 位置感知特征。 对象查询直接与 3D 位置感知功能交互并输出 3D 检测结果。
与 DETR3D 相比,所提出的 PETR 架构带来了许多优势。 它保持了原始 DETR 的端到端精神,同时避免了复杂的 2D 到 3D 投影和特征采样。 在推理期间,可以以离线方式生成 3D 位置坐标,并用作额外的输入位置嵌入。 实际应用起来相对容易一些。

3D Coodinates Generator转换方法

在这里插入图片描述
图3 空间转换示例

PETR的空间转换方法取自论文DGSN,如图3所示。相机视锥空间用( u , v , d ) (u, v, d)(u,v,d)表示,其中u , v u,vu,v是图像的像素坐标,d dd是和像平面正交的深度。世界空间用( x , y , z ) (x, y, z)(x,y,z)表示,利用相机内参可以将相机视锥空间变换到3D世界空间:
在这里插入图片描述

由于NuScenes数据集有6个相机,PETR和DGSN的空间转换会有些差异。首先将相机视锥空间离散成大小为在这里插入图片描述的三维网格,网格中的每一点在这里插入图片描述可以表示成:
在这里插入图片描述
考虑有6个相机,不同相机之间会存在交叠区域,3D世界空间的一点可能位于多个相机的视锥空间中,那么相机i ii的视锥空间中的点j jj在世界空间中的坐标可以表示为:
在这里插入图片描述
通过相机内外参可以将相机视锥空间变换到世界空间:
在这里插入图片描述
其中在这里插入图片描述是第个相机的变换矩阵(根据相机内外参计算得到)。
最后,根据给定的空间范围在这里插入图片描述将世界空间的点在这里插入图片描述进行归一化:
在这里插入图片描述

3D Position Encoder

在这里插入图片描述图4 3D位置编码器

经由Backbone和3D Coodinates Generator得到2D图像特征在这里插入图片描述及世界空间点在这里插入图片描述
在这里插入图片描述

在这里插入图片描述经过MLP得到3D Position Embbeding,再和在这里插入图片描述相加,得到3D感知特征:
在这里插入图片描述

其中N是相机的个数。最后将展开作为Transformer Decoder的输入。
为了说明3D PE的作用,从前视图像中随机挑选了3个像素点对应的PE,并计算这3个PE和其他所有视角图像PE的相似度,如图5所示。3D世界空间中左前方的一点理论上会同时出现在前视相机左侧和左前相机右侧,从第一行图像可以看出,PE相似度的确是符合这个先验认知的。所以可以证明3D PE的确建立了3D空间中不同视角的位置关联。
在这里插入图片描述
图5 3D PE相似度

Decoder、Head and Loss

PETR网络的后半部分基本就沿用DETR和DETR3D的配置:使用L LL个标准Transformer Decoder层迭代地更新object query;检测头和回归头都沿用DETR3D,回归目标中心相对于锚点的偏移量;分类使用focal loss,3D框回归使用L1 loss。
如下图6所示。referecne points经过inverse_sigmoid获得reference, 每层decoder都输出output,都有独立的reg和cls函数。没有进行对Δxyz进行迭代更新。
在这里插入图片描述

公开项目源

项目代码开源,分为多个PETR和PETRv2两个版本,其中PETRv2是一个统一的多视图图像 3D 感知框架。 基于 PETR,PETRv2 探索了时间建模的有效性,它利用先前帧的时间信息来增强 3D 对象检测。 3D PE 实现了不同帧对象位置的时间对齐。 进一步引入了特征引导的位置编码器,以提高 3D PE 的数据适应性。 为了支持高质量的 BEV 分割,PETRv2 通过添加一组分割查询提供了一种简单而有效的解决方案。 每个分割查询负责分割一个特定的 BEV 地图块。 PETRv2 在 3D 对象检测和 BEV 分割方面实现了最先进的性能。

项目开源地址:https://github.com/megvii-research/PETR

参考开源项目:

Cross Modal Transformer: Towards Fast and Robust 3D Object Detection

概述:
在本文中,我们提出了一种稳健的3D检测器,称为跨模态变换器(CMT),用于端到端的3D多模态检测。在没有显式视图变换的情况下,CMT将图像和点云标记作为输入,并直接输出精确的3D边界框。多模态标记的空间对齐是通过将3D点编码为多模态特征来执行的。CMT的核心设计非常简单,同时其性能令人印象深刻。它在nuScenes测试集上实现了74.1%的NDS(最先进的单模型),同时保持了更快的推理速度。此外,即使缺少激光雷达,CMT也具有很强的鲁棒性。

项目代码:
https://github.com/junjie18/CMT
论文pdf:
https://arxiv.org/pdf/2301.01283v2.pdf

OpenLane 是迄今为止第一个真实世界和规模最大的 3D 车道数据集。的数据集从公众感知数据集中收集有价值的内容,为 1000 个路段提供车道和最近路径对象 (CIPO) 注释。 简而言之,OpenLane 拥有 200K 帧和超过 880K 仔细注释的通道。

地址:
https://github.com/OpenDriveLab/OpenLane

评价:由于变换器解码器中使用了大量的多模态令牌和全局注意力,因此计算成本相对较大。为了解决这个问题,可能需要在两个方向上做出一些努力。第一个是减少多模式令牌的冗余。

CAPE: Camera View Position Embedding for Multi-View 3D Object Detection

概述:
在本文中,解决了从多视图图像中检测 3D 对象的问题。 当前基于查询的方法依赖于全局 3D 位置嵌入 (PE) 来学习图像和 3D 空间之间的几何对应关系。 我们声称直接交互 2D 图像特征与全局 3D PE 可能会增加学习视图转换的难度,这是由于相机外在因素的变化。 因此,提出了一种基于 CAMera 视图位置嵌入的新方法,称为 CAPE。 在局部相机视图坐标系而不是全局坐标系下形成 3D 位置嵌入,这样 3D 位置嵌入就不会编码相机外部参数。 此外,通过利用先前帧的对象查询和编码自我运动来增强 3D 对象检测,将CAPE 扩展到时间建模。 CAPE 在 nuScenes 数据集上的所有无 LiDAR 方法中实现了最先进的性能(61.0% NDS 和 52.5% mAP)。

项目代码:
https://github.com/kaixinbear/CAPE
https://github.com/PaddlePaddle/Paddle3D
论文pdf:
https://arxiv.org/abs/1606.09375

Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection

概述:
虽然最近的纯相机 3D 检测方法利用了多个时间步长,但它们使用的有限历史极大地阻碍了时间融合可以改善物体感知的程度。 观察现有作品对多帧图像的融合是时间立体匹配的实例,发现性能受到 1)匹配分辨率的低粒度和 2)有限产生的次优多视图设置之间的相互作用的阻碍 历史使用。 理论和实证分析表明,视图之间的最佳时间差异对于不同的像素和深度有显着差异,因此有必要在长期历史上融合许多时间步长。 在调查的基础上,从长期的图像观察历史中生成成本量,通过更优化的多视图匹配设置来补偿粗糙但有效的匹配分辨率。 此外,将用于长期、粗略匹配的每帧单目深度预测与短期、细粒度匹配相结合,发现长期和短期时间融合具有高度互补性。 在保持高效率的同时,框架在 nuScenes 上设置了新的最先进技术,在测试集上取得了第一名,并在验证集上以 5.2% 的 mAP 和 3.7% 的 NDS 优于之前的最佳技术。

训练代码和预训练模型以及项目代码:
https://github.com/divadi/solofusion
论文pdf:
https://arxiv.org/pdf/2210.02443v1.pdf

Guess you like

Origin blog.csdn.net/weixin_44348719/article/details/131091984