The next generation of autonomous driving systems requires large models, and system research is here

This article introduces patterns for integrating multimodal large language models (MLLMs) into next-generation autonomous driving systems.

With the emergence of large language models (LLM) and visual basic models (VFM), multi-modal artificial intelligence systems that benefit from large models have the potential to comprehensively perceive the real world and make decisions like humans. In recent months, LLM has attracted considerable attention in autonomous driving research. Although LLM has great potential, its key challenges, opportunities, and future research directions in driving systems are still lacking in articles that elucidate them in detail.

In this article, researchers from Tencent Maps, Purdue University, UIUC, and the University of Virginia conducted a systematic survey in this field. This study first introduces the background of multimodal large language models (MLLM), the progress of developing multimodal models using LLM, and a review of the history of autonomous driving. The study then provides an overview of existing MLLM tools for driving, traffic and mapping systems, as well as existing datasets. This study also summarizes the related work of the first WACV Workshop on Large Language and Vision Models for Autonomous Driving (LLVM-AD), which is the first workshop on applying LLM in the field of autonomous driving. In order to further promote the development of this field, this study also discusses how to apply MLLM in autonomous driving systems, as well as some important issues that need to be jointly solved by academia and industry.

Review link: https://arxiv.org/abs/2311.12320
Workshop link: https://llvm-ad.github.io/
Github link: https://github.com/IrohXu/Awesome-Multimodal-LLM-Autonomous-Driving

Review structure

Multimodal large language models (MLLM) have attracted widespread attention recently, which combine the inference capabilities of LLMs with image, video and audio data, enabling them to perform various tasks more efficiently through multimodal alignment, including images Classification, alignment of text with corresponding videos, and speech detection. Furthermore, some studies have demonstrated that LLM can handle simple tasks in the field of robotics. However, the integration of MLLM in the field of autonomous driving is still very slow. We can't help but ask, do LLMs like GPT-4, PaLM-2 and LLaMA-2 have the potential to improve existing autonomous driving systems?

In this review, researchers believe that integrating LLM into the field of autonomous driving can bring significant paradigm shifts in driving perception, motion planning, human-vehicle interaction, and motion control, providing user-centered, more adaptable, and more Credible future transportation solutions. In terms of perception, LLM can use Tool Learning to call external APIs to access real-time information sources, such as high-precision maps, traffic reports, and weather information, so that the vehicle can more fully understand the surrounding environment. After reading real-time traffic data, autonomous vehicles can use LLM to reason about congested routes and suggest alternative paths to improve efficiency and safe driving. For motion planning and human-vehicle interaction, LLM can facilitate user-centered communication, enabling passengers to express their needs and preferences in everyday language. In terms of motion control, LLM first enables the control parameters to be customized according to the driver's preferences, achieving personalized driving experience. Additionally, LLM can provide transparency to the user by explaining each step of the motion control process. The review predicts that in future SAE L4-L5 level autonomous vehicles, passengers can use words, gestures and even eyes to communicate their requests while driving, with MLLM providing real-time in-car feedback through integrated visual displays or voice responses. and driving feedback.

The development history of autonomous driving and multi-modal large language models

Research summary of autonomous driving MLLM: The current model LLM frameworks mainly include LLaMA, Llama 2, GPT-3.5, GPT-4, Flan5XXL, and Vicuna-13b. FT, ICL and PT refer to fine-tuning, contextual learning and pre-training in this table. For literature links, please refer to github repo: https://github.com/IrohXu/Awesome-Multimodal-LLM-Autonomous-Driving

In order to build a bridge between autonomous driving and LLM, relevant researchers organized the first Large Language and Vision Model Autonomous Driving Workshop (LLVM-AD) at the 2024 IEEE/CVF Winter Conference on Computer Vision Applications (WACV). This workshop aims to enhance collaboration between academic researchers and industry professionals to explore the possibilities and challenges of implementing multi-modal large-scale language models in the field of autonomous driving. LLVM-AD will further promote the development of subsequent open source practical traffic language understanding datasets.

A total of nine papers were accepted for the first WACV Workshop on Large-scale Language and Vision Models for Autonomous Driving (LLVM-AD). Some papers revolve around the topic of multimodal large language models in autonomous driving, focusing on integrating LLM into user-vehicle interaction, motion planning, and vehicle control. Several papers also explore new applications of LLM for human-like interaction and decision-making in autonomous vehicles. For example, “Drive Like a Human” and “Drive as You Speak” explore LLM’s framework for interpreting and reasoning in complex driving scenarios, imitating human behavior. "Human-Centric Autonomous Systems With LLMs" emphasizes the importance of designing LLMs with the user as the center, using LLMs to interpret user commands. This approach represents a major shift toward human-centric autonomous systems. In addition to fused LLM, the workshop also covered some methods based on pure vision and data processing. In addition, the workshop also proposed innovative data processing and evaluation methods. For example, NuScenes-MQA introduces a new annotation scheme for autonomous driving datasets. Collectively, these papers demonstrate progress in integrating language models and advanced techniques into autonomous driving, paving the way for more intuitive, efficient, and human-centered autonomous vehicles.

For future development, this study proposes the following research directions:

1. New data set of multi-modal large language model in autonomous driving

Despite the success of large language models in language understanding, there are still challenges in applying them to autonomous driving. This is because these models need to integrate and understand inputs from different modalities, such as panoramic images, 3D point clouds and high-precision maps. Current limitations in data size and quality mean that existing datasets cannot fully address these challenges. Furthermore, visual language datasets annotated from early open source datasets such as NuScenes may not provide a robust baseline for visual language understanding in driving scenarios. Therefore, there is an urgent need for new, large-scale datasets covering a wide range of traffic and driving scenarios to make up for the long-tail (imbalance) problem of previous dataset distributions to effectively test and enhance the performance of these models in autonomous driving applications. .

2. Hardware support for large language models in autonomous driving

Different functions in self-driving cars have different hardware requirements. Using LLM inside a vehicle for driving planning or participating in vehicle control requires real-time processing and low latency to ensure safety, which increases computational requirements and affects power consumption. If LLM is deployed in the cloud, the bandwidth for data exchange becomes another critical security factor. In contrast, using LLM for navigation planning or analyzing commands unrelated to driving (such as in-car music playback) does not require high query volume and real-time performance, making remote services a feasible solution. In the future, LLM in autonomous driving can be compressed through knowledge distillation to reduce computing requirements and delays. There is still a lot of room for development in this field.

3. Use large language models to understand high-precision maps

HD maps play a vital role in autonomous vehicle technology as they provide essential information about the physical environment in which the vehicle operates. The semantic map layer in HD maps is important because it captures the meaning and contextual information of the physical environment. In order to effectively encode this information into the next generation of LLM-driven autonomous driving, new models are needed to map these multi-modal features into language space. Tencent has developed the THMA high-precision map AI automatic annotation system based on active learning, which can produce and label high-precision maps on a scale of hundreds of thousands of kilometers. In order to promote the development of this field, Tencent proposed the MAPLM data set based on THMA, including panoramic images, 3D lidar point clouds and context-based high-precision map annotations, as well as a new question and answer benchmark MAPLM-QA.

4. Large language model in human-vehicle interaction

Human-vehicle interaction and understanding human driving behavior also pose a major challenge in autonomous driving. Human drivers often rely on nonverbal signals, such as slowing down to yield or using body movements to communicate with other drivers or pedestrians. These non-verbal signals play a vital role in communication on the road. There have been many accidents involving self-driving systems in the past because self-driving cars often behaved in a way that other drivers did not expect. In the future, MLLM will be able to integrate rich contextual information from a variety of sources and analyze the driver's gaze, gestures, and driving style to better understand these social signals and enable efficient planning. By estimating the social signals of other drivers, LLM can improve the decision-making capabilities and overall safety of autonomous vehicles.

5. Personalized autonomous driving

As autonomous vehicles develop, an important aspect is considering how they adapt to the user's individual driving preferences. There is a growing consensus that self-driving cars should mimic the driving style of their users. To achieve this, autonomous driving systems need to learn and integrate user preferences in various aspects, such as navigation, vehicle maintenance and entertainment. LLM's instruction tuning capabilities and contextual learning capabilities make it ideal for integrating user preferences and driving history information into autonomous vehicles to provide a personalized driving experience.

Summarize

Autonomous driving has been a focus of attention for years, attracting many venture investors. Integrating LLM into autonomous vehicles presents unique challenges, but overcoming them will significantly enhance existing autonomous systems. It is foreseeable that smart cockpits supported by LLM have the ability to understand driving scenarios and user preferences, and establish a deeper trust between the vehicle and the occupants. In addition, autonomous driving systems deploying LLM will be able to better deal with ethical dilemmas involving trade-offs between the safety of pedestrians and the safety of vehicle occupants, promoting a decision-making process that is more likely to be ethical in complex driving scenarios. This article integrates insights from WACV 2024 LLVM-AD workshop committee members and aims to inspire researchers to contribute to the development of next-generation autonomous vehicles powered by LLM technology.