Interpretation of XNet information on Xpeng Motors' new-generation perception architecture

Communication group|  Enter "Sensor Group/Skateboard Chassis Group/Car Basic Software Group/Domain Controller Group", please scan the QR code at the end of the article , add Jiuzhang Assistant , be sure to note the name of the exchange group  + real name + company + position (no remarks Unable to pass friend verification)


a743129ba6e181eda84e800e4f79dba3.png

Author | Zhang Mengyu

At the CVPR conference that just passed, as the only new car maker in China that was invited to give a speech, Xiaopeng Motors introduced to the participants the experience of Xiaopeng Motors in the mass production of assisted driving systems in China.

As the latest generation of perception architecture of Xpeng Motors, XNet's role in mass production cannot be underestimated.

The author had the honor to interview Patrick, the chief perception engineer of Xpeng Autopilot Center, to learn more about the performance and architecture of XNet, as well as the efforts made by Xpeng's self-driving team to build XNet.

1. Performance improvement achieved by XNet

XNet realizes the upgrade of the perception structure and has better performance, mainly including three aspects.

1.1

Super environmental perception ability, real-time generation of "high-precision map"

a7af391e36e004deb310bfa012e81896.png 

XNet can build a "high-precision map" in real time based on the surrounding environment. From the picture above, we can see that the vehicle is passing through a roundabout. The lane lines shown in the picture are not from the high-precision map, but from the perception output of XNet. XNet can not only output lane lines, but also stop lines, sidewalks, drivable areas, etc. This is one of the core capabilities of Xiaopeng Motors in the future to deal with unmapped scenarios and to do high-level urban assisted driving.

1.2

Stronger 360-degree perception, stronger game, higher success rate of lane change

In the previous generation of perception architecture, the problem of blind spots is difficult to solve. In the places closest to the own vehicle, especially the lower boundary of the vehicle, the detection effect of the perception system is often not good. XNet adopts a multi-camera, multi-frame, and front-fusion perception scheme, which can infer the 3D position information of the vehicle under the BEV perspective based on the body information in the image, which solves the problem of limited upper and lower field of view of the camera; it can also more effectively fuse multiple cameras at the same time information, especially the objects in the field of view of the two cameras, so as to avoid the object perception of blind people.

In addition, after inputting a video stream containing timing information, XNet's ability to recognize objects near the car has been greatly improved, and it can detect objects near the car more stably. Then, the game ability of the automatic driving system will be stronger, and the success rate of the car changing lanes will be higher.

1.3

More accurate recognition of the speed and intention of dynamic objects, greatly improving the gaming ability; redundant motion perception, higher security in urban scenes

XNet can not only detect the position of the object, but also detect the speed of the object and even complete the prediction of the object's future trajectory. It is usually difficult for mmWave radar to detect the speed of vehicles crossing the lane in front of the vehicle, but XNet can easily detect this speed, which has a significant enhancement effect on mmWave radar. In scenes where millimeter-wave radar is good at, XNet can also provide redundancy, thereby improving the overall security of urban scenes.

2. The architecture of XNet

Why can XNet achieve better performance? Patrick introduced the specific architecture and workflow of XNet.

XNet adopts multi-camera and multi-frame method, directly injects the video stream from each camera into a deep learning network of a large model, performs multi-frame time-series pre-fusion, and outputs 4D information of dynamic targets under the BEV perspective (such as Size, distance, position and speed, behavior prediction, etc. of vehicles, two-wheeled vehicles, etc.), and 3D information of static objects (such as lane lines and the position of the edge of the road).

As shown below.

78199397042d09dd6bc0522c57e43335.png

Each input camera image passes through the network backbone (backbone) and network neck (neck, specifically the BiFPN network) to generate a multi-scale feature map of the image space.

After these feature maps pass through the most critical part of XNet - the BEV view transformer (BEV view transformer), they form a single-frame feature map under BEV.

The single-frame feature maps at different times are fused in time and space according to the pose of the ego vehicle under the BEV perspective to form a space-time feature map under BEV.

These spatio-temporal feature maps are the basis for BEV decoding reasoning. Two decoders are connected after the spatio-temporal feature map to complete the decoding and output of dynamic XNet and static XNet results. Dynamic results include pose, size, velocity, etc., and static results include boundary, mark line, etc.

At this point, the perception part is basically completed.

3. The team's efforts to build XNet

It is not easy to realize the above architecture. In the four aspects of collection, labeling, training, and deployment, Xiaopeng's self-driving team has done a lot of work to optimize the entire process.

3.1

collection

Real vehicle data and simulation data are two major sources of data.

Xiaopeng has nearly 100,000 user vehicles, all of which can be used to complete data collection tasks. As shown in the figure below, the car-side model will report the problems that the automatic driving system is not handling well enough. To address these problems, Xiaopeng's self-driving team will set corresponding triggers on the car-side to collect corresponding data in a targeted manner. Then, the data will be uploaded to the cloud, screened and labeled for model training and subsequent OTA upgrades.

9c7c273cd772d476790be591405ed6fa.png

In addition, simulation data is also an important source of data. Wu Xinzhou gave an example at the 1024 Science and Technology Day - during driving, a large truck in front caught fire due to friction between the tires and the ground. This situation is extremely rare in real life. For such a situation with extremely low frequency, it is very difficult to collect real cars. Even if Xiaopeng already has nearly 100,000 mass-produced cars, it may take several years to collect enough data.

For such a situation, simulation data can play a very good auxiliary role. As shown in the figure below, Xiaopeng's self-driving team can use the unreal5 engine to generate thousands of similar cases based on real vehicle data, simulating various situations where wheels fall off.

de85c64aae741fb775d16b353e6e0ef0.png

Of course, simulation data cannot be abused and needs to be as close to reality as possible. Xiaopeng's self-driving team mainly tries to ensure the authenticity of the simulation data from two aspects: the reality of light and shadow and the reality of the scene.

Xiaopeng's self-driving team uses the technologically advanced unreal5 as the rendering engine, so that the pictures generated through simulation look more realistic, without a sense of cartoon, ensuring "real light and shadow".

In addition, when generating simulation data, we first find the weak scenes of the model, and then make digital twins of these scenes, and then make directional modifications on this basis. Specifically, 4D automatic labeling can be used to extract 4D structured information from the real scene—including 4D trajectories of dynamic objects and 3D layouts of static scenes, etc., and then use the rendering engine to render and fill the structured information to form a simulation picture . In this way, the generated scene is simulating the scene that may occur in the real world, ensuring the "real scene".

3.2

label

To train XNet, 500,000 to 1 million short videos are needed, and the number of dynamic targets may be hundreds of millions or even billions. According to the current efficiency of manual labeling, it takes a team of 1,000 people two years to complete the labeling of the data required for training XNet.

Xiaopeng Motors has created a fully automatic labeling system. The labeling efficiency of this system is nearly 45,000 times that of manual labor. The fully automatic labeling system can complete the labeling work in only 16.7 days. In addition, the fully automatic labeling system has higher quality, more complete information (including 3D position, size, speed, trajectory, etc.), and greater output (peak daily output of 30,000 clips, equivalent to 15 NuScene datasets). 

How does the fully automatic labeling system achieve high efficiency?

First of all, from manual labeling to automatic labeling, the role of humans has changed a lot. In the manual labeling scenario, people are the labelers; in the automatic labeling scenario, people are the quality inspectors, who just identify and correct the poor performance of the automatic labeling system, and the human efficiency will be improved by orders of magnitude.

Secondly, in the automatic labeling scenario, the training data that accounts for the majority of the data set is automatically quality-checked, and only the evaluation data set is manually quality-checked, and the amount of data that requires manual operation is reduced by orders of magnitude.

Finally, automatic labeling shifts the bottleneck of output from human resources to computing resources. In the cloud, computing resources can be easily expanded, and a large number of resources can be flexibly deployed on demand for production.

3.3

train

Xpeng and Alibaba Cloud have cooperated to create the largest autonomous driving computing center in China - "Fuyao". The computing power of "Fuyao" can reach 600PFLOPS, which is equivalent to a training platform composed of thousands of Orins. With the help of Fuyao's powerful computing power, Xiaopeng's self-driving team adopted a cloud-based large-scale multi-machine training method to shorten the XNet training time from 276 days to 11 hours, achieving a 602-fold improvement in training efficiency.

As shown in the figure below, if a single-machine full-precision method is used, it takes 276 days to train the entire XNet. Xiaopeng's self-driving team shortened the single-machine training time from 276 days to 32 days by optimizing the training scheme to reduce epochs, optimizing the network structure and operators, and customizing mixed-precision training for Transformer. Then, the team made full use of cloud computing power to change the single-machine training to 80-machine parallel training, shortening the training time from 32 days to 11 hours.

da5b0d2dac16205fd420c9760771be54.png

In addition, the team introduced the Golden Backbone model to decouple the improvement of basic network capabilities from the release of the model, thereby improving training efficiency. Specifically, as shown in the figure below, Golden Backbone can form a closed loop with data mining, automatic labeling, and self-driving supercomputing platforms. In this ring, as long as there is continuous data input, the capabilities of Golden Backbone can be continuously optimized. When you need to release the model, you only need to do some optimizations on the basis of Golden Backbone, instead of training from scratch.

fcb9cbe1b8fdd7c96e6910a4eac37789.png

3.4

deploy

At the deployment level, Xiaopeng's self-driving team has accumulated a lot. After optimization by the team, the calculation time of Transformer was reduced to 5% of the original. In addition, the model that originally required 122% of the Orin-X computing power to run can now run with only 9% of the Orin-X computing power.

In terms of deployment, what are the highlights of Xiaopeng's self-driving team? According to Patrick's introduction, it is mainly divided into three steps.

"The first is the rewriting of the Transformers layer. After analyzing the running time of the model board, we found that the original version of the Transformers layer took up a lot of time. Therefore, we tried many variant construction methods of Transformers and found a model that worked well. Run the faster version on the board."

"Then there is the pruning of the network backbone. After rewriting Transformers, we found that the network backbone (backbone) is our performance bottleneck. So we pruned the network backbone to reduce the running time of the backbone part."

"Finally, it is multi-hardware cooperative scheduling. On our Orin-X-based computing platform, there are three kinds of computing units—GPU, DLA, and CPU. These three kinds of hardware support different operators of the network in different ways. We put the different components of the network where it is most suitable for its operation, and then uniformly schedule the three types of computing hardware, so that the three can cooperate to complete network reasoning."

END


The first video of Jiuzhang Watch is officially launched

Welcome everyone to pay attention, forward, like and comment! !

Communication group|   Enter "Sensor Group/Skateboard Chassis Group/Automotive Basic Software Group/Domain Controller Group", please scan the QR code below, add Jiuzhang Assistant , be sure to note the name of the exchange group  + real name + company + position (no remarks Unable to pass friend verification)

fd75ecbc1ef71b4181a7a7227647d187.png

write at the end

communicate with the author

If you want to communicate directly with the author of the article, you can directly scan the QR code on the right and add the author's own WeChat.

   21f840210734f4c7c4ac9722442091a7.png

Note: Be sure to note your real name, company, and current position when adding WeChat, thank you!

About Contribution

If you are interested in contributing to "Nine Chapters Smart Driving" ("knowledge accumulation and sorting" type articles), please scan the QR code on the right and add staff WeChat.

747242b5f3389d765833f59ddbb418ef.jpeg

Note: Be sure to note your real name, company, and current position when adding WeChat, thank you!


Quality requirements for "knowledge accumulation" manuscripts:

A: The information density is higher than most reports of most brokerages, and not lower than the average level of "Nine Chapters Smart Driving";

B: Information needs to be highly scarce, and more than 80% of the information needs to be invisible on other media. If it is based on public information, it needs to have a particularly powerful and exclusive point of view. Thank you for your understanding and support.

Recommended reading:

Nine chapters - a collection of articles in 2022

"Even if the wages cannot be paid one day, some people will stay." ——Review of the second anniversary of Jiuzhang Zhijia's business (Part 1)

"Your budget is too much, so we can't cooperate" - Jiuzhang Zhijia 2nd Anniversary Review (Part 2)

What is the comprehensive SOA-based electrical and electronic architecture?

Application of deep learning algorithm in automatic driving regulation and control

Challenges and dawn of wire control shifting to mass production and commercial use

◆"Be greedy when others are fearful", this fund will increase investment in the "Automatic Driving Winter"

Guess you like

Origin blog.csdn.net/jiuzhang_0402/article/details/131467688