The key step for "getting on the car" for large models: the world's first language + autonomous driving full-stack open source data set is here

Source | Machine Heart ID | almosthuman2014

Speaking of recent news in the technology world, there is no more lively gossip topic than the fight between Xiao Ma and Xiao Zha.

A few days ago, Musk live broadcasted his visit to find Zuckerberg. Although ultimately unsuccessful, Xiao Ma obviously "didn't care about drinking", but wanted to take the opportunity to demonstrate Tesla's latest FSD V12 automatic driving function.

But just when Xiao Ma was proudly introducing it to the audience, something went wrong with the autonomous driving system. It made a wrong judgment at an intersection, forcing Xiao Ma to manually intervene in the driving. Xiao Ma said awkwardly that he wanted to "feed the network more relevant data."

picture

When Musk demonstrated FSD V12, the only manual intervention occurred at an intersection with complex traffic environment

It can be seen that FSD v12, which is as powerful as Tesla, also has the problem of insufficient decision-making and reasoning capabilities when dealing with complex scenarios. This makes people wonder, is there any way to solve this problem?

Shanghai Artificial Intelligence Laboratory OpenDriveLab believes that to solve this problem, introducing the currently popular large models may be a solution.

picture

DriveLM | Motivation

Large models have proven their power on natural language processing problems. However, for it to show its power, massive data is essential. In the field of autonomous driving, the idea of ​​large-scale data collection from mass-produced vehicles to support autonomous driving systems is still in the construction stage. Competition among various car manufacturers has also made open source sharing of data sets impractical.

But if you think about it from another angle, the reasoning paradigm and common sense in the large language model are universal in the real world. If we can make use of the existing mature large language models and massive corpus databases, coupled with  inference prompt technologies such as CoT (Chain of Thought) and GoT (Graph of Thougtht)  , and stand on the shoulders of giants, we can allow the autonomous driving system to cope with Ability to deal with complex situations to a higher level.

Accordingly, Shanghai Artificial Intelligence Laboratory OpenDriveLab, Autonomous Vision Group of the University of Tübingen in Germany, and Tübingen AI Center in Germany jointly launched the world's first language + autonomous driving full-stack open source data set - DriveLM , aiming to use large language models and massive natural language data sets to build a safe, accurate, and explainable autonomous driving system in complex scenarios, breaking through the upper limit of existing autonomous driving reasoning capabilities.

At the same time, DriveLM is also an important part of DriveAGI proposed by OpenDriveLab. OpenDriveLab will then hold a series of language + autonomous driving competitions around DriveLM to promote the cross-progress of communication and technology in the fields of natural language processing and autonomous driving.

picture

DriveAGI overall framework proposed by OpenDriveLab

But today, Machine Heart will take you to see how this self-driving data set that integrates language information builds a bridge between the large language model and the self-driving system, allowing the large language model to help the self-driving system gain more powerful capabilities. , explainable reasoning ability.

Repository:https://github.com/OpenDriveLab/DriveLM

Page:https://opendrivelab.github.io/DriveLM

Hugging Face:https://huggingface.co/datasets/OpenDrive/DriveLM

DriveLM | Features

  • Structured reasoning and mind mapping assessment

picture

DriveLM provides quantitative reasoning ability evaluation standards, changing the current situation in which the structured reasoning (Structured-reasoning) or Graph of Thoughts (Graph of Thoughts) capabilities of a model are difficult to quantitatively evaluate. As shown in the figure below, DriveLM provides a complete logical chain from object recognition, object motion status judgment to object future motion trajectory prediction and self-vehicle motion planning, ensuring the rationality and interpretability of each step in the entire decision-making process.

  • Full stack data coverage

picture

DriveLM's annotation covers perception, prediction, planning and other modules in the autonomous driving system, providing full-stack language annotation data for the entire autonomous driving system.

  • hypothetical reasoning

picture

DriveLM's annotations include reasoning based on assumptions ("What if..."), which helps train the model to predict future events that have not yet occurred.

  • Driving goal breakdown

picture

DriveLM provides scene-level global driving goal description and corresponding frame-level driving goal description, and introduces the driving goal decomposition task. By decomposing complex macro-driving tasks into more specific and simpler sub-tasks, the autonomous driving system can learn to deal with more complex and changeable traffic environments.

DriveLM | Data and Annotation

picture

Distribution of different categories of questions in DriveLM annotation

DriveLM is a data set built based on the nuScenes autonomous driving data set and centered on key frame description + question and answer pairs (Description+Q&A).

The question and answer pairs in the data set can be mainly divided into three categories: perception (Perception), prediction (Prediction) and planning (Planning) . The perception part focuses on asking about the position or motion state of objects relative to the vehicle; the prediction part asks about the possible future behaviors and states of vehicles or pedestrians; and the planning part asks about the actions that the vehicle can take.

The entire data set is divided into two parts: the training set and the verification set. The training set contains a total of 697 scenes, and the verification set contains 150 scenes. Each scene contains approximately 40 frames (sampling frequency approximately 2 Hz), from which the annotator selects 4-8 keyframes for annotation.

For more details on the dataset, please refer to the DriveLM demo data released by OpenDriveLab on Github.

Guess you like

Origin blog.csdn.net/lqfarmer/article/details/132765989