arXiv学术速递笔记12.7

文章目录

一、ScAR: Scaling Adversarial Robustness for LiDAR Object Detection（ScAR：激光雷达目标检测的对抗性缩放算法）
二、DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance（DreamVideo：具有图像保留和文本引导的高保真图像到视频生成）
三、Self-conditioned Image Generation via Generating Representations（基于生成表示的自条件图像生成）
四、Generating Visually Realistic Adversarial Patch（生成视觉逼真的对抗性补丁）
五、Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?（对于开环端到端自动驾驶来说，自我状态是所需的全部吗？）

一、ScAR: Scaling Adversarial Robustness for LiDAR Object Detection（ScAR：激光雷达目标检测的对抗性缩放算法）

标题： ScAR：激光雷达目标检测的对抗性缩放算法
链接： https://arxiv.org/abs/2312.03085
作者： Xiaohu Lu,Hayder Radha
摘要： 模型的对抗性鲁棒性是指它抵抗以输入数据的小扰动形式的对抗性攻击的能力。通用对抗攻击方法，如快速符号梯度方法（FSGM）和投影梯度分解（PGD），在LiDAR目标检测中很受欢迎，但与特定任务的对抗攻击相比，它们往往存在不足。此外，这些通用方法通常需要不受限制地访问模型的信息，这在现实世界的应用程序中很难获得。为了解决这些限制，我们提出了一种用于LiDAR对象检测的黑盒缩放对抗鲁棒性（ScAR）方法。通过分析KITTI、Waymo和nuScenes等3D对象检测数据集的统计特征，我们发现模型的预测对3D实例的缩放敏感。提出了三种基于可用信息的黑盒缩放对抗攻击方法：模型感知攻击、分布感知攻击和盲攻击。我们还介绍了一种生成缩放对抗性示例的策略，以提高模型对这三种缩放对抗性攻击的鲁棒性。在不同的3D目标检测架构下，与其他方法在公共数据集上的比较证明了我们所提出的方法的有效性。
摘要： The adversarial robustness of a model is its ability to resist adversarial attacks in the form of small perturbations to input data. Universal adversarial attack methods such as Fast Sign Gradient Method (FSGM) and Projected Gradient Descend (PGD) are popular for LiDAR object detection, but they are often deficient compared to task-specific adversarial attacks. Additionally, these universal methods typically require unrestricted access to the model’s information, which is difficult to obtain in real-world applications. To address these limitations, we present a black-box Scaling Adversarial Robustness (ScAR) method for LiDAR object detection. By analyzing the statistical characteristics of 3D object detection datasets such as KITTI, Waymo, and nuScenes, we have found that the model’s prediction is sensitive to scaling of 3D instances. We propose three black-box scaling adversarial attack methods based on the available information: model-aware attack, distribution-aware attack, and blind attack. We also introduce a strategy for generating scaling adversarial examples to improve the model’s robustness against these three scaling adversarial attacks. Comparison with other methods on public datasets under different 3D object detection architectures demonstrates the effectiveness of our proposed method.

二、DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance（DreamVideo：具有图像保留和文本引导的高保真图像到视频生成）

标题： DreamVideo：具有图像保留和文本引导的高保真图像到视频生成
链接： https://arxiv.org/abs/2312.03018

作者： Cong Wang,Jiaxi Gu,Panwen Hu,Songcen Xu,Hang Xu,Xiaodan Liang
摘要： 图像到视频生成，其目的是从给定的参考图像开始生成视频，已经引起了极大的关注。现有方法试图将预先训练的文本引导的图像扩散模型扩展到图像引导的视频生成模型。然而，由于这些方法对浅图像引导的限制和较差的时间一致性，这些方法通常导致低保真度或随时间闪烁。为了解决这些问题，我们提出了一种高保真的图像到视频生成方法，通过设计一个帧保留分支的基础上预先训练的视频扩散模型，命名为DreamVideo。我们的DreamVideo不是将参考图像集成到语义级别的扩散过程中，而是通过卷积层感知参考图像，并将特征与噪声潜伏期连接起来作为模型输入。通过这种方式，可以最大程度地保留参考图像的细节。此外，通过结合双条件无分类器引导，可以通过提供不同的提示文本将单个图像定向到不同动作的视频。这对可控视频的产生具有重要意义，具有广阔的应用前景。我们在公开数据集上进行了全面的实验，定量和定性的结果都表明，我们的方法优于最先进的方法。特别是在保真度方面，我们的模型具有强大的图像保持能力，与其他图像到视频模型相比，在UCF 101中具有较高的FVD。此外，通过给出不同的文本提示，可以实现精确的控制。我们的模型的更多细节和综合结果将在https://anonymous0769.github.io/DreamVideo/中提供。
摘要： Image-to-video generation, which aims to generate a video starting from a given reference image, has drawn great attention. Existing methods try to extend pre-trained text-guided image diffusion models to image-guided video generation models. Nevertheless, these methods often result in either low fidelity or flickering over time due to their limitation to shallow image guidance and poor temporal consistency. To tackle these problems, we propose a high-fidelity image-to-video generation method by devising a frame retention branch on the basis of a pre-trained video diffusion model, named DreamVideo. Instead of integrating the reference image into the diffusion process in a semantic level, our DreamVideo perceives the reference image via convolution layers and concatenate the features with the noisy latents as model input. By this means, the details of the reference image can be preserved to the greatest extent. In addition, by incorporating double-condition classifier-free guidance, a single image can be directed to videos of different actions by providing varying prompt texts. This has significant implications for controllable video generation and holds broad application prospects. We conduct comprehensive experiments on the public dataset, both quantitative and qualitative results indicate that our method outperforms the state-of-the-art method. Especially for fidelity, our model has powerful image retention ability and result in high FVD in UCF101 compared to other image-to-video models. Also, precise control can be achieved by giving different text prompts. Further details and comprehensive results of our model will be presented in https://anonymous0769.github.io/DreamVideo/.

三、Self-conditioned Image Generation via Generating Representations（基于生成表示的自条件图像生成）

标题： 基于生成表示的自条件图像生成
链接： https://arxiv.org/abs/2312.03701
作者： Tianhong Li,Dina Katabi,Kaiming He
备注： 关注图像生成的2个指标：FID和IS
摘要： 本文提出了有条件图像方法RCG，这是一个简单而有效的图像生成框架，它为类无条件图像生成树立了一个新的基准。RCG不以任何人为注释为条件。相反，它以自监督表示分布为条件，该自监督表示分布使用预训练的编码器从图像分布映射而来。在生成期间，RCG使用表示扩散模型（RDM）从这样的表示分布中采样，并且采用像素生成器来制作以采样的表示为条件的图像像素。这种设计在生成过程中提供了实质性的指导，从而产生高质量的图像。在ImageNet 256$\times$256上测试，RCG实现了3.31的Frechet Inception Distance（FID） 和253.4的 Inception Score（IS）。这些结果不仅显着提高了类无条件图像生成的最新技术水平，而且还可以与类有条件图像生成中的当前领先方法相媲美，弥合了这两项任务之间长期存在的性能差距。代码可在https://github.com/LTH14/rcg上获得。
摘要： This paper presents $\textbf{R}$ epresentation- $\textbf{C}$ onditioned image $\textbf{G}$ eneration (RCG), a simple yet effective image generation framework which sets a new benchmark in class-unconditional image generation. RCG does not condition on any human annotations. Instead, it conditions on a self-supervised representation distribution which is mapped from the image distribution using a pre-trained encoder. During generation, RCG samples from such representation distribution using a representation diffusion model (RDM), and employs a pixel generator to craft image pixels conditioned on the sampled representation. Such a design provides substantial guidance during the generative process, resulting in high-quality image generation. Tested on ImageNet 256$\times$256, RCG achieves a Frechet Inception Distance (FID) of 3.31 and an Inception Score (IS) of 253.4. These results not only significantly improve the state-of-the-art of class-unconditional image generation but also rival the current leading methods in class-conditional image generation, bridging the long-standing performance gap between these two tasks. Code is available at https://github.com/LTH14/rcg.

四、Generating Visually Realistic Adversarial Patch（生成视觉逼真的对抗性补丁）

标题： 生成视觉逼真的对抗性补丁
链接： https://arxiv.org/abs/2312.03030

作者： Xiaosen Wang,Kunyu Wang
摘要： 深度神经网络（DNN）容易受到各种类型的对抗性示例的攻击，给安全关键型应用带来巨大威胁。其中，对抗补丁由于其在物理世界中欺骗DNN的良好适用性而引起了越来越多的关注。然而，现有的作品往往会产生无意义的噪音或图案的补丁，使其对人类来说很显眼。为了解决这个问题，我们探索如何生成视觉上逼真的对抗补丁来欺骗DNN。首先，我们分析了一个高质量的对抗补丁应该是现实的，位置无关的，可打印的，以部署在物理世界。基于这种分析，我们提出了一种有效的攻击称为VRAP，生成视觉上逼真的对抗补丁。具体地说，VRAP算法将图像块约束在真实图像的邻域内以保证图像的视觉真实性，并在最差位置优化图像块以保证图像块的位置无关性，同时采用总方差损失和伽玛变换使生成的图像块可打印而不丢失信息。ImageNet数据集上的实证评估表明，所提出的VRAP在数字世界中表现出出色的攻击性能。此外，生成的对抗性补丁可以伪装成物理世界中的涂鸦或徽标，以欺骗深度模型而不被检测到，从而给启用DNN的应用程序带来重大威胁。
摘要：Deep neural networks (DNNs) are vulnerable to various types of adversarial examples, bringing huge threats to security-critical applications. Among these, adversarial patches have drawn increasing attention due to their good applicability to fool DNNs in the physical world. However, existing works often generate patches with meaningless noise or patterns, making it conspicuous to humans. To address this issue, we explore how to generate visually realistic adversarial patches to fool DNNs. Firstly, we analyze that a high-quality adversarial patch should be realistic, position irrelevant, and printable to be deployed in the physical world. Based on this analysis, we propose an effective attack called VRAP, to generate visually realistic adversarial patches. Specifically, VRAP constrains the patch in the neighborhood of a real image to ensure the visual reality, optimizes the patch at the poorest position for position irrelevance, and adopts Total Variance loss as well as gamma transformation to make the generated patch printable without losing information. Empirical evaluations on the ImageNet dataset demonstrate that the proposed VRAP exhibits outstanding attack performance in the digital world. Moreover, the generated adversarial patches can be disguised as the scrawl or logo in the physical world to fool the deep models without being detected, bringing significant threats to DNNs-enabled applications.

五、Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?（对于开环端到端自动驾驶来说，自我状态是所需的全部吗？）

标题： 对于开环端到端自动驾驶来说，自我状态是所需的全部吗？
链接： https://arxiv.org/abs/2312.03031
作者： Zhiqi Li,Zhiding Yu,Shiyi Lan,Jiahan Li,Jan Kautz,Tong Lu,Jose M. Alvarez
备注：关注端到端的自动驾驶问题；注意使用的nuScenes数据集；提供github代码
摘要： 端到端自动驾驶最近成为一个很有前途的研究方向，从全栈的角度来实现自动驾驶。沿着这条线，许多最新的作品遵循nuScenes上的开环评估设置来研究规划行为。在本文中，我们深入探讨了这个问题，进行了深入的分析和揭秘更多的魔鬼在细节。我们最初观察到，nuScenes数据集的特点是相对简单的驾驶场景，导致在包含自我状态（如自我车辆的速度）的端到端模型中感知信息的利用不足。这些模型倾向于主要依赖于自我车辆的状态来进行未来的路径规划。除了数据集的局限性，我们还注意到，目前的指标没有全面评估规划质量，导致从现有基准得出的结论可能有偏见。为了解决这个问题，我们引入了一个新的指标来评估预测的轨迹是否坚持的道路。我们进一步提出了一个简单的基线，能够实现竞争力的结果，而不依赖于感知注释。鉴于目前对基准和指标的限制，我们建议社区重新评估相关的流行研究，并谨慎继续追求最先进的技术是否会产生令人信服的普遍结论。代码和模型可在\url{https：//github.com/NVlabs/BEV-Planner}获得
摘要： End-to-end autonomous driving recently emerged as a promising research direction to target autonomy from a full-stack perspective. Along this line, many of the latest works follow an open-loop evaluation setting on nuScenes to study the planning behavior. In this paper, we delve deeper into the problem by conducting thorough analyses and demystifying more devils in the details. We initially observed that the nuScenes dataset, characterized by relatively simple driving scenarios, leads to an under-utilization of perception information in end-to-end models incorporating ego status, such as the ego vehicle’s velocity. These models tend to rely predominantly on the ego vehicle’s status for future path planning. Beyond the limitations of the dataset, we also note that current metrics do not comprehensively assess the planning quality, leading to potentially biased conclusions drawn from existing benchmarks. To address this issue, we introduce a new metric to evaluate whether the predicted trajectories adhere to the road. We further propose a simple baseline able to achieve competitive results without relying on perception annotations. Given the current limitations on the benchmark and metrics, we suggest the community reassess relevant prevailing research and be cautious whether the continued pursuit of state-of-the-art would yield convincing and universal conclusions. Code and models are available at \url{https://github.com/NVlabs/BEV-Planner}

参考文献：