arXiv学术速递笔记12.8

文章目录

一、GSGFormer: Generative Social Graph Transformer for Multimodal Pedestrian Trajectory Prediction（GSGFormer：用于多通道行人轨迹预测的产生式社会图转换器）
二、AnimateZero: Video Diffusion Models are Zero-Shot Image Animators（AnimateZero：视频扩散模型是Zero-Shot图像动画师）
三、Camera Height Doesn't Change: Unsupervised Monocular Scale-Aware Road-Scene Depth Estimation（摄像机高度不变：无监督单目尺度感知道路场景深度估计）
四、On the Robustness of Large Multimodal Models Against Image Adversarial Attacks（大型多模模型对图像攻击的稳健性研究）
五、Towards Knowledge-driven Autonomous Driving（走向知识驱动的自动驾驶）
六、Receding Horizon Re-ordering of Multi-Agent Execution Schedules
参考文献

一、GSGFormer: Generative Social Graph Transformer for Multimodal Pedestrian Trajectory Prediction（GSGFormer：用于多通道行人轨迹预测的产生式社会图转换器）

标题： GSGFormer：用于多通道行人轨迹预测的产生式社会图转换器
链接： https://arxiv.org/abs/2312.04479
作者： Zhongchang Luo,Marion Robin,Pavan Vasishta
摘要： 行人轨迹预测对于自动驾驶汽车和具有社会意识的机器人至关重要，由于行人、他们的环境和其他弱势道路使用者之间的复杂互动，因此非常复杂。本文介绍了GSGFormer，一个创新的生成模型，善于预测行人轨迹，考虑这些复杂的相互作用，并提供了大量的潜在的模态行为。我们结合了一个异构的图神经网络来捕捉行人，语义地图和潜在目的地之间的交互。Transformer模块提取时间特征，而我们新的CVAE残差GMM模块促进了多样化的行为模态生成。通过对多个公共数据集的评估，GSGFormer不仅在数据充足的情况下优于领先的方法，而且在数据有限的情况下仍然具有竞争力。
摘要： Pedestrian trajectory prediction, vital for selfdriving cars and socially-aware robots, is complicated due to intricate interactions between pedestrians, their environment, and other Vulnerable Road Users. This paper presents GSGFormer, an innovative generative model adept at predicting pedestrian trajectories by considering these complex interactions and offering a plethora of potential modal behaviors. We incorporate a heterogeneous graph neural network to capture interactions between pedestrians, semantic maps, and potential destinations. The Transformer module extracts temporal features, while our novel CVAE-Residual-GMM module promotes diverse behavioral modality generation. Through evaluations on multiple public datasets, GSGFormer not only outperforms leading methods with ample data but also remains competitive when data is limited.

二、AnimateZero: Video Diffusion Models are Zero-Shot Image Animators（AnimateZero：视频扩散模型是Zero-Shot图像动画师）

标题： AnimateZero：视频扩散模型是Zero-Shot图像动画师
链接： https://arxiv.org/abs/2312.03793
作者： Jiwen Yu,Xiaodong Cun,Chenyang Qi,Yong Zhang,Xintao Wang,Ying Shan,Jian Zhang
备注： Project Page: this https URL
摘要： 近年来，大规模文本到视频（T2V）扩散模型在视觉质量、运动和时间一致性方面取得了很大进展。然而，生成过程仍然是一个黑盒子，其中所有属性（例如，外观、运动）被联合地学习和生成，除了粗略的文本描述之外，没有精确的控制能力。受图像动画的启发，将视频作为一个特定的外观与相应的运动相结合，我们提出了AnimateZero来揭示预先训练的文本到视频扩散模型，即：AnimateDiff，并为其提供更精确的外观和运动控制能力。对于外观控制，我们从文本到图像（T2I）生成中借用中间潜伏期及其特征，以确保生成的第一帧等于给定的生成图像。对于时间控制，我们将原始T2V模型的全局时间注意力替换为我们提出的位置校正窗口注意力，以确保其他帧与第一帧对齐。通过所提出的方法，AnimateZero可以成功地控制生成进度，而无需进一步的训练。作为给定图像的zero-shot图像动画制作器，AnimateZero还支持多个新应用，包括交互式视频生成和真实图像动画。详细的实验证明了该方法在T2V及相关应用中的有效性。
摘要：Large-scale text-to-video (T2V) diffusion models have great progress in recent years in terms of visual quality, motion and temporal consistency. However, the generation process is still a black box, where all attributes (e.g., appearance, motion) are learned and generated jointly without precise control ability other than rough text descriptions. Inspired by image animation which decouples the video as one specific appearance with the corresponding motion, we propose AnimateZero to unveil the pre-trained text-to-video diffusion model, i.e., AnimateDiff, and provide more precise appearance and motion control abilities for it. For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation for ensuring the generated first frame is equal to the given generated image. For temporal control, we replace the global temporal attention of the original T2V model with our proposed positional-corrected window attention to ensure other frames align with the first frame well. Empowered by the proposed methods, AnimateZero can successfully control the generating progress without further training. As a zero-shot image animator for given images, AnimateZero also enables multiple new applications, including interactive video generation and real image animation. The detailed experiments demonstrate the effectiveness of the proposed method in both T2V and related applications.

三、Camera Height Doesn’t Change: Unsupervised Monocular Scale-Aware Road-Scene Depth Estimation（摄像机高度不变：无监督单目尺度感知道路场景深度估计）

标题： 摄像机高度不变：无监督单目尺度感知道路场景深度估计
链接： https://arxiv.org/abs/2312.04530
作者： Genki Kinoshita,Ko Nishino
摘要： 单目深度估计器要么需要通过辅助传感器进行明确的尺度监督，要么存在尺度模糊性，这使得它们难以在下游应用中部署。缩放的一个可能来源是场景中发现的对象的大小，但不准确的定位使它们难以利用。在本文中，我们介绍了一种新的尺度感知的单目深度估计方法，称为StableCamH，不需要任何辅助传感器或监督。其关键思想是利用场景中物体高度的先验知识，但将高度线索聚合成道路视频序列中所有帧共同的单个不变测度，即相机高度。通过将单目深度估计公式化为相机高度优化，我们实现了鲁棒且准确的无监督端到端训练。为了实现StableCamH，我们设计了一种新的基于学习的尺寸先验，可以直接将汽车外观转换为尺寸。在KITTI和Cityscapes上的大量实验表明了StableCamH的有效性，与相关方法相比，其最先进的准确性及其通用性。StableCamH的训练框架可用于任何单目深度估计方法，并有望成为进一步工作的基本构建块。
摘要：Monocular depth estimators either require explicit scale supervision through auxiliary sensors or suffer from scale ambiguity, which renders them difficult to deploy in downstream applications. A possible source of scale is the sizes of objects found in the scene, but inaccurate localization makes them difficult to exploit. In this paper, we introduce a novel scale-aware monocular depth estimation method called StableCamH that does not require any auxiliary sensor or supervision. The key idea is to exploit prior knowledge of object heights in the scene but aggregate the height cues into a single invariant measure common to all frames in a road video sequence, namely the camera height. By formulating monocular depth estimation as camera height optimization, we achieve robust and accurate unsupervised end-to-end training. To realize StableCamH, we devise a novel learning-based size prior that can directly convert car appearance into its dimensions. Extensive experiments on KITTI and Cityscapes show the effectiveness of StableCamH, its state-of-the-art accuracy compared with related methods, and its generalizability. The training framework of StableCamH can be used for any monocular depth estimation method and will hopefully become a fundamental building block for further work.

四、On the Robustness of Large Multimodal Models Against Image Adversarial Attacks（大型多模模型对图像攻击的稳健性研究）

标题： 大型多模态模型对图像攻击的稳健性研究
链接： https://arxiv.org/abs/2312.03777
作者： Xuanimng Cui,Alejandro Aparcedo,Young Kyun Jang,Ser-Nam Lim
摘要： 指令调优方面的最新进展导致了最先进的大型多模态模型（Large Multimodal Models，LMM）的发展。鉴于这些模型的新颖性，视觉对抗性攻击对LMM的影响尚未得到彻底研究。我们全面研究了各种Linux对不同对抗性攻击的鲁棒性，评估了包括图像分类、图像字幕和视觉问答（VQA）在内的任务。我们发现，在一般情况下，Lebron是不鲁棒的视觉对抗性输入。然而，我们的研究结果表明，通过提示向模型提供的上下文，例如QA对中的问题，有助于减轻视觉对抗输入的影响。值得注意的是，评估的Lencil在ScienceQA任务中表现出了对此类攻击的出色弹性，与视觉同行相比，性能仅下降了8.10%，而视觉同行下降了99.73%。我们还提出了一种新的方法来现实世界的图像分类，我们术语查询分解。通过将存在查询纳入我们的输入提示中，我们观察到攻击有效性降低和图像分类准确性提高。这项研究突出了LMM鲁棒性的一个以前未被充分探索的方面，并为未来旨在加强多模态系统在对抗环境中的弹性的工作奠定了基础。
摘要：Recent advances in instruction tuning have led to the development of State-of-the-Art Large Multimodal Models (LMMs). Given the novelty of these models, the impact of visual adversarial attacks on LMMs has not been thoroughly examined. We conduct a comprehensive study of the robustness of various LMMs against different adversarial attacks, evaluated across tasks including image classification, image captioning, and Visual Question Answer (VQA). We find that in general LMMs are not robust to visual adversarial inputs. However, our findings suggest that context provided to the model via prompts, such as questions in a QA pair helps to mitigate the effects of visual adversarial inputs. Notably, the LMMs evaluated demonstrated remarkable resilience to such attacks on the ScienceQA task with only an 8.10% drop in performance compared to their visual counterparts which dropped 99.73%. We also propose a new approach to real-world image classification which we term query decomposition. By incorporating existence queries into our input prompt we observe diminished attack effectiveness and improvements in image classification accuracy. This research highlights a previously under-explored facet of LMM robustness and sets the stage for future work aimed at strengthening the resilience of multimodal systems in adversarial environments.

五、Towards Knowledge-driven Autonomous Driving（走向知识驱动的自动驾驶）

标题： 走向知识驱动的自动驾驶
链接： https://arxiv.org/abs/2312.04316
作者： Xin Li,Yeqi Bai,Pinlong Cai,Licheng Wen,Daocheng Fu,Bo Zhang,Xuemeng Yang,Xinyu Cai,Tao Ma,Jianfei Guo,Xing Gao,Min Dou,Botian Shi,Yong Liu,Liang He,Yu Qiao
摘要： 本文探讨了新兴的知识驱动的自动驾驶技术。我们的调查强调了当前自动驾驶系统的局限性，特别是它们对数据偏差的敏感性，处理长尾场景的困难以及缺乏可解释性。知识驱动的方法具有认知、概括和终身学习的能力，是克服这些挑战的一种有前途的方法。本文深入研究了知识驱动的自动驾驶的本质，并研究了其核心组件：数据集\基准，环境和驱动程序代理。通过利用大型语言模型、世界模型、神经渲染和其他先进的人工智能技术，这些组件共同为更全面、自适应和智能的自动驾驶系统做出了贡献。本文系统地整理和回顾了这一领域的研究成果，并为自动驾驶的未来研究和实际应用提供了见解和指导。我们将持续分享知识驱动自动驾驶领域的最新发展动态以及相关的宝贵开源资源，网址为：https//github.com/PJLab-ADG/awesome-knowledge-driven-AD。
摘要：This paper explores the emerging knowledge-driven autonomous driving technologies. Our investigation highlights the limitations of current autonomous driving systems, in particular their sensitivity to data bias, difficulty in handling long-tail scenarios, and lack of interpretability. Conversely, knowledge-driven methods with the abilities of cognition, generalization and life-long learning emerge as a promising way to overcome these challenges. This paper delves into the essence of knowledge-driven autonomous driving and examines its core components: dataset & benchmark, environment, and driver agent. By leveraging large language models, world models, neural rendering, and other advanced artificial intelligence techniques, these components collectively contribute to a more holistic, adaptive, and intelligent autonomous driving system. The paper systematically organizes and reviews previous research efforts in this area, and provides insights and guidance for future research and practical applications of autonomous driving. We will continually share the latest updates on cutting-edge developments in knowledge-driven autonomous driving along with the relevant valuable open-source resources at: \url{https://github.com/PJLab-ADG/awesome-knowledge-driven-AD}.

六、Receding Horizon Re-ordering of Multi-Agent Execution Schedules

标题： 多智能体执行调度的后退视界重排序
*链接： *https://arxiv.org/abs/2312.04190

作者： Alexander Berndt,Niels van Duijkeren,Luigi Palmieri,Alexander Kleiner,Tamás Keviczky
备注： IEEE Transactions on Robotics (T-Ro) preprint, 17 pages, 32 figures
摘要： 在路线图上为自动引导车辆（AGV）车队进行轨迹规划通常被称为多智能体路径查找（MAPF）问题，该问题的解决方案决定了每个AGV的空间和时间位置，直到它到达目标而不发生碰撞。在动态调度中执行MAPF计划时，AGV可能会频繁延迟，例如，由于遇到人类或第三方车辆。如果其余的AGV继续遵循各自的计划，则车队的同步性会丢失，并且某些AGV可能会以与原始计划不同的顺序通过路线图交叉点。虽然这可以减少AGV的累计路径完成时间，但通常，原始顺序的更改可能会导致冲突，例如死锁。因此，在实践中，通常通过使用MAPF执行策略来强制同步，该MAPF执行策略采用例如，一个动作依赖图（ADG）来维持顺序。为了在不引入死锁的情况下安全地重新排序，我们提出了可切换动作依赖图（SADG）的概念。使用的SADG，我们制定了一个相对低维的混合线性规划（MILP），反复重新排序AGV在递归可行的方式，从而保持无死锁的保证，同时动态地最小化所有AGV的累计路线完成时间。各种模拟验证了我们的方法相比，原始ADG方法以及强大的MAPF解决方案的方法的效率。
摘要： The trajectory planning for a fleet of Automated Guided Vehicles (AGVs) on a roadmap is commonly referred to as the Multi-Agent Path Finding (MAPF) problem, the solution to which dictates each AGV’s spatial and temporal location until it reaches it’s goal without collision. When executing MAPF plans in dynamic workspaces, AGVs can be frequently delayed, e.g., due to encounters with humans or third-party vehicles. If the remainder of the AGVs keeps following their individual plans, synchrony of the fleet is lost and some AGVs may pass through roadmap intersections in a different order than originally planned. Although this could reduce the cumulative route completion time of the AGVs, generally, a change in the original ordering can cause conflicts such as deadlocks. In practice, synchrony is therefore often enforced by using a MAPF execution policy employing, e.g., an Action Dependency Graph (ADG) to maintain ordering. To safely re-order without introducing deadlocks, we present the concept of the Switchable Action Dependency Graph (SADG). Using the SADG, we formulate a comparatively low-dimensional Mixed-Integer Linear Program (MILP) that repeatedly re-orders AGVs in a recursively feasible manner, thus maintaining deadlock-free guarantees, while dynamically minimizing the cumulative route completion time of all AGVs. Various simulations validate the efficiency of our approach when compared to the original ADG method as well as robust MAPF solution approaches.

参考文献

计算机视觉与模式识别学术速递[12.8]