The ten thousand-character long article explains the application of large models in the field of autonomous driving

Communication group|  Enter "Sensor Group/Skateboard Chassis Group/Car Basic Software Group/Domain Controller Group", please scan the QR code at the end of the article , add Jiuzhang Assistant , be sure to note the name of the exchange group  + real name + company + position (no remarks Unable to pass friend verification)


62cbeceeee46d7656d07f386d6adb6f8.png

Author | Zhang Mengyu

With the popularity of ChatGPT, large models have received more and more attention, and the capabilities displayed by large models are amazing.

In areas such as image generation, recommendation systems, and machine translation, large models have already begun to play a role. Given some prompt words, the image generation website Midjourney has generated design drawings that have even surpassed the level of many professional designers.

Why can large models show amazing capabilities? Why does the performance of the model become better when the number of parameters and capacity of the model become larger?

An expert from an AI algorithm company told the author: The increase in the number of parameters of the model can be understood as the increase in the dimension of the model, which means that we can simulate the laws of the real world in a more complex way. Take the simplest scenario as an example, give a scatter plot on a plane graph, if we use a straight line (a one-variable function) to describe the law of the scatter points on the graph, then no matter how many parameters are, there will always be some point outside this line. If we use a parabola (a binary function) to describe the law of these points, then there will be more points that can fall on this line. As the dimension of the function increases, or the degree of freedom increases, more and more points will fall on this line, which means that the rules of these points will be fitted more accurately.

In other words, the larger the number of parameters in the model, the easier it is for the model to fit the laws of massive data.

With the emergence of ChatGPT, people found that when the parameters of the model reach a certain level, the effect presented is not just "better performance", but "better than expected".

In the field of NLP (Natural Language Processing), there is an exciting phenomenon that academia and industry cannot explain the specific principles: "Emerging Ability".

What is "emergence"? "Emergence" means that when the parameter quantity of the model increases linearly to a certain extent, the accuracy of the model increases exponentially.

We can look at a picture. The left side of the picture below shows the scaling law (Scaling Law), which is a phenomenon discovered by OpenAI researchers before 2022. That is to say, as the model parameter scale increases exponentially, the accuracy of the model will decrease. Then it increases linearly. The model parameters on the left do not grow exponentially but linearly

By January 2022, some researchers found that when the parameter scale of the model exceeds a certain level, the degree of improvement in model accuracy significantly exceeds the proportional curve, as shown on the right of the figure below. 

d5caad8e6ef6bd9e2c0f43f04eb69e66.png △Schematic diagram of "emergence"

When implemented at the application level, we will find that large models can achieve some tasks that small models cannot achieve, such as large models that can do addition and subtraction, and simple reasoning.

What kind of model can be called a large model?

Generally speaking, we believe that a model with more than 100 million parameters can be called a "big model". In the field of autonomous driving, large models mainly have two meanings: one is a model with more than 100 million parameters; the other is a model composed of multiple small models superimposed together. For the "big model".

According to this definition, in the field of autonomous driving, large models have begun to be widely used. In the cloud, we can take advantage of the capacity advantages brought about by the increase in the number of model parameters, and use large models to complete some tasks such as data mining and data labeling. On the car side, we can combine multiple small models in charge of different sub-tasks into a "big model", which can save the reasoning time of the car-side computing link and increase security.

Specifically, how can large models help? According to the information exchanged by the author with various industry experts, the industry currently mainly uses large models in the field of perception. Next, we will introduce how large models can empower perception tasks in the cloud and on the vehicle side.

1. Application of large models

1.1

Application of large models in the cloud

1.1.1 Automatic labeling of data

Automatic labeling can be achieved by using large model pre-training. Taking clip annotation of video as an example, a large model can be pre-trained with a large amount of unlabeled clip data through self-supervision, and then fine-tuned with a small amount of manually-labeled clip data to make the model capable of detection. The model can automatically label the clip data.

The higher the labeling accuracy of the model is, the higher the degree of human substitution is.

At present, many companies are studying how to improve the accuracy of automatic labeling of large models, hoping to realize the complete unmanned automatic labeling after the accuracy reaches the standard.

Leo, Product Director of SenseTime Intelligent Driving, told the author: We have done evaluations, and for common targets on the road, the automatic labeling accuracy of SenseTime’s large model can reach more than 98%. streamline.

In the development process of intelligent driving products, Shangtang Jueying has introduced automatic pre-labeling of large models for most of the sensing tasks. Compared with the past, the labeling cycle and labeling cost can be reduced by tens of thousands by obtaining the same number of data samples. times, significantly improving the development efficiency. 

Generally speaking, everyone's expectations for labeling tasks mainly include high efficiency in the labeling process, high accuracy of labeling results, and high consistency. High efficiency and high precision are easy to understand, but what does high consistency mean? In the BEV algorithm for 3D recognition, engineers need to use joint annotation of lidar and vision, and need to jointly process point cloud and image data. In this processing link, engineers may also need to make some annotations on the timing level, so the results of the previous and subsequent frames cannot be too different.

If manual labeling is used, the labeling effect depends on the labeling level of the labeler. The uneven level of the labeler may lead to inconsistent labeling results. There may be a larger labeling box in one picture, and a smaller one in the next picture. However, the labeling results of large models are generally consistent.

However, some industry experts also reported that there are still some difficulties in implementing automatic labeling with large models in practical applications, especially in the connection between self-driving companies and labeling companies-many self-driving companies will outsource part of the labeling work to labeling Companies, and some companies do not have an internal labeling team, and all labeling work is outsourced.

At present, the targets marked by the large model pre-marking method are mainly some dynamic 3D targets. The self-driving company will first use the large model to make an inference on the video that needs to be marked, and then use the result of the reasoning - the 3D frame generated by the model Give to the labeling company. When pre-labeling with a large model first, and then handing over the pre-labeled results to the labeling company, two problems will be involved: one is that the labeling platform of some labeling companies does not necessarily support loading the pre-labeled results, The other is that labeling companies are not necessarily willing to modify the pre-labeled results.

If the labeling company wants to load the pre-labeled results, it needs a software platform that supports loading the 3D frame generated by the large model. However, some labeling companies may mainly use manual labeling, and they do not have a software platform that supports loading model pre-labeled results. If they get the results of model pre-labeled when connecting with customers, they have no way to undertake it.

In addition, from the perspective of the labeling company, only when the pre-labeling effect is good enough can they really "save effort", otherwise it may increase the workload.

If the effect of pre-labeling is not good enough, the labeling company still needs to do a lot of work in the future, such as marking out the missing boxes, deleting wrongly labeled boxes, and unifying the size of the boxes. Then, adopting pre-labeling may not really help them reduce their workload.

Therefore, in practical applications, whether to use a large model for pre-labeling needs to be weighed by the autonomous driving company and the labeling company.

Of course, the cost of manual labeling is relatively high at present—if the labeling company were to start from scratch, the cost of manual labeling for 1,000 frames of video data may reach 10,000 yuan. Therefore, autonomous driving companies still hope to improve the accuracy of large model pre-labeling as much as possible and reduce the workload of manual labeling as much as possible, thereby reducing the cost of labeling.

1.1.2 Data Mining

Large models have strong generalization and are suitable for mining long-tail data.

An expert from WeRide told the author: If the traditional label-based method is used to mine long-tail scenes, the model can generally only distinguish known image categories. In 2021, OpenAI released the CLIP model (a text-image multimodal model, which can correspond to text and images after unsupervised pre-training, so as to classify pictures based on text, rather than relying only on the labels of pictures. ), we can also use such a text-image multimodal model to retrieve image data in the drive log with text descriptions. For example, long-tail scenes such as 'construction vehicles dragging goods', 'traffic lights with two light bulbs on at the same time', etc.

In addition, large models can better extract features from data, and then find objects with similar features.

Suppose we want to find pictures containing sanitation workers from many pictures. We don’t need to label the pictures first. We can use a large number of pictures containing sanitation workers to pre-train the large model, and the large model can extract some sanitation workers from them. feature. Then, find samples that match the characteristics of sanitation workers from the pictures, so as to mine almost all pictures containing sanitation workers.

1.1.3 "Teaching" Small Models by Knowledge Distillation

The large model can also "teach" the small model by means of knowledge distillation.

What is knowledge distillation? To explain in the most common words, the large model first learns some knowledge from the data, or extracts some information, and then uses the learned knowledge to "teach" the small model.

In practice, we can first learn the pictures that need to be labeled to the large model, and the large model can label these pictures. In this way, we have labeled pictures, and use these pictures to train the small model, that is One of the simplest ways of knowledge distillation.

Of course, we can also use more complex methods, such as using large models to extract features from massive data, and these extracted features can be used to train small models. In other words, we can also make the design more complicated, and add a medium model between the large model and the small model. The features extracted by the large model are first trained on the medium model, and then the trained medium model is used to extract features and handed over to the small model. use. Engineers can choose the design method according to their own needs.

The author learned from Pony.ai that small models such as pedestrian attention and pedestrian intention recognition can be obtained by distillation and finetune based on the features extracted from the large model. Moreover, since a large model is shared in the feature extraction stage, the amount of calculation can be reduced.

1.1.4 The upper limit of performance of the test car model

The large model can also be used to test the performance limit of the car-end model. When some companies consider what model to deploy on the vehicle side, they will first test several candidate models on the cloud to see which model has the best effect and the best performance after increasing the number of parameters.

Then, the model with the best effect is used as the basic model, and then the basic model is cut and optimized and deployed to the vehicle end.

1.1.5 Reconstruction and data generation of autonomous driving scenarios

Momo Zhixing mentioned on AI DAY in January 2023: "Using NeRF technology, we can implicitly store the scene in the neural network, and then learn the implicit parameters of the scene through the supervised learning of rendered pictures, and then The reconstruction of the autonomous driving scene can be carried out.”

For example, we can input pictures, corresponding poses, and densely colored scene point clouds in the network, and rasterize the colored point clouds at different resolutions based on the pose of the input picture based on the point grid network. Generate neural descriptors at multiple scales, and then fuse features at different scales through the network.

Then, input the generated dense point cloud descriptor, position, corresponding camera parameters and image exposure parameters into the subsequent network for fine-tuning tone mapping, and then synthesize a picture with consistent color and exposure.

In this way, we can realize the reconstruction of the scene. Then, we can generate various high-realistic data by changing the angle of view, changing the lighting, and changing the texture and material. For example, through changing the angle of view, we can simulate various main vehicle behaviors such as changing lanes, detours, and U-turns, and even simulate some imminent collisions. high-risk scene data.

1.2

Application of large model in vehicle

1.2.1 Combining small models for detecting different tasks

The main form of using large models on the vehicle side is to combine small models that handle different sub-tasks to form a "big model", and then do joint reasoning. The "big model" here is not a large number of parameters in the traditional sense—for example, a large model with over 100 million parameters. Of course, the combined model will be much larger than the small model that handles different sub-tasks.

In the traditional car-side perception model, the models dealing with different subtasks are independently reasoned. For example, one model is responsible for the lane line detection task, and one model is responsible for the traffic light detection task. As the perception tasks increase, engineers will correspondingly increase the perception of specific target models in the system.

The previous automatic driving system has fewer functions, and the perception tasks are relatively easy. However, with the upgrade of the functions of the automatic driving system, there are more and more perception tasks. The system delay will be too large, and there will be security risks.

In the BEV multi-task perception framework of Juefei Technology, the single-task perception small models of different targets are combined to form a static information that can output static information at the same time - including lane lines, ground arrows, zebra crossings at intersections, stop lines, etc., and Dynamic information—including the location, size, orientation, etc. of traffic participants. Juefei Technology's BEV multi-task perception algorithm framework is shown in the figure below:

a55795590cd6cdc194408fae04c292e1.png △Schematic diagram of Juefei Technology’s BEV multi-task perception algorithm framework

The multi-task perception model realizes the temporal fusion of features—stores the BEV features at historical moments into the feature queue. Do space-time alignment (including feature rotation and translation), and then splicing the aligned historical BEV features with the current BEV features.

In autonomous driving scenarios, timing fusion can improve the accuracy of perception algorithms and compensate for the limitations of single-frame perception to a certain extent. Take the 3D target detection subtask shown in the figure as an example. With timing fusion, the perception model can detect some targets that cannot be detected by the single-frame perception model (such as the target that is occluded at the current moment), and can also judge the target more accurately. The speed of movement, and the trajectory prediction of the auxiliary downstream tasks.

Dr. Qi Yuhan, head of BEV perception technology of Juefei Technology, told the author: With such a model architecture, when the perception tasks become more and more complex, the framework of multi-task joint perception can ensure real-time perception and output more and more Accurate perception results are provided for downstream use of the autonomous driving system.

However, the merging of multi-task small models also brings some problems. From the algorithm level, the performance of the merged model on different subtasks may have a "rollback" phenomenon-that is, the performance of model detection is lower than that of an independent single-task model. Although the network structure of the large model merged by different small models can still be very delicate, the merged model needs to solve the problem of multi-task joint training.

In multi-task joint training, each sub-task may not be able to achieve simultaneous and synchronous convergence, and each task will be affected by "negative transfer", and the combined model will have a "regression" of accuracy on some specific tasks. ". The algorithm team needs to optimize the combined model structure as much as possible, adjust the joint training strategy, and reduce the impact of the "negative transfer" phenomenon.

1.2.2 Object Detection

An industry expert told the author: Some objects with relatively fixed true values ​​are suitable for detection with large models.

So, what is an object with a relatively fixed truth value?

The so-called objects with fixed true value are objects whose true value will not be affected by factors such as weather and time, such as lane lines, pillars, lampposts, traffic lights, zebra crossings, parking lines in basements, parking spaces, etc. These objects exist with No, the location is fixed wherever it is, and will not change due to factors such as rain or darkness. As long as the vehicle passes through the corresponding area, their location is fixed. Such objects are suitable for detection with large models.

1.2.3 Lane Topology Prediction

A self-driving company mentioned on the company's AI DAY: "Based on the feature map of BEV, we use the standard map as the guide information, and use the autoregressive codec network to decode the BEV features into a structured topological point sequence. , to achieve lane topology prediction.”

2. How to make good use of large models

Under the trend of open source in the industry, the basic model framework is no secret. In many cases, it is engineering capabilities that determine whether a company can make a good product.

Engineering capability determines whether we can quickly verify the feasibility of this idea when we think of some methods that may be effective in improving system capabilities. What Tesla and Open AI have in common is that both companies have strong engineering capabilities. They can test the reliability of an idea as quickly as possible, and then apply large-scale data to selected models.

To give full play to the capabilities of large models in practice, the company's engineering capabilities are very important. Next, we will explain what kind of engineering capabilities are needed to make good use of large models according to the process of model development.

2.1

Upgrade data storage, file transfer system

The parameters of a large model are large, and correspondingly, the amount of data used to train a large model is also large. For example, Tesla's algorithm team used about 1.4 billion pictures to train the 3D-occupancy network that the team talked about on AI day last year.

In fact, the initial value of the number of pictures will probably be dozens or hundreds of times the actual number used, because we need to filter out data that is valuable for model training from massive data. Therefore, since the pictures used for model training are 1.4 billion, the number of original pictures must be much greater than 1.4 billion.

So, how to store tens of billions or even hundreds of billions of image data? This is a huge challenge for both the file reading system and the data storage system. In particular, the current autonomous driving data is in the form of clips, and the number of files is large, so the efficiency of random storage of small files is very high.

In order to cope with such challenges, some companies in the industry use slice storage for data, and then adopt a distributed architecture to support multi-user and multi-concurrent access. The data throughput bandwidth can reach 100G/s, and the I/O delay can be as low as 2 milliseconds. The so-called multi-user means that many users access a data file at the same time; multi-concurrency means that a data file needs to be accessed in multiple threads. For example, when an engineer uses multi-threading when training a model, each thread A data file is required.

2.2

Efficiently find the right network architecture

With big data, how to ensure that the model abstracts the data information better? This requires the model to have a network architecture suitable for the corresponding task, so as to give full play to the advantage of the large number of parameters of the model, so that the model has a strong ability to extract information.

Lucas, the senior manager of SenseTime's large-scale model research and development, told the author: We have a standardized, industrial-grade semi-automatic super-large model design system. Relying on this system, we can use a set of neural network search systems as the base when designing the network architecture of super-large models. , to find the most suitable network architecture for learning large-scale data.

When designing a small model, we mainly rely on manual design, tuning, and iteration to finally obtain a model with satisfactory results. Although this model may not be optimal, it can basically meet the requirements after iteration.

In the face of large models, since the network structure of large models is very complex, if manual design, tuning, and iteration are used, the consumption of computing power will be large, and the cost will be correspondingly high. Then, how to quickly and efficiently design a network architecture with good enough effect for training under limited resources is a problem that needs to be solved.

Lucas explained: We have a set of operator libraries, and the network structure of the model can be regarded as a set of permutations and combinations of operators. This industrial-grade search system can calculate how to arrange and combine operators under the premise of setting basic parameters, including how many layers of networks and how many parameters, so as to achieve better model effects.

The effect of the model can be evaluated according to some indicators, including the prediction accuracy of some data sets, the memory usage of the model when it is running, and the running time of the model. By assigning corresponding weights to these indicators, we can iterate continuously until we find a satisfactory model. Of course, in the search process, we will first use some small scenes to initially evaluate the model effect.

When evaluating the effect of the model, how to choose some more representative scenes?

Generally speaking, some common scenarios can be selected. The main purpose of designing the network architecture is to ensure that the model has the ability to extract key information from a large amount of data, rather than hoping that the model can learn the characteristics of certain specific scenarios in a targeted manner. Therefore, although the model architecture is determined, the model will be used To complete some tasks of mining long-tail scenarios, but when selecting a model architecture, general scenarios will be used to evaluate the capabilities of the model.

With a high-efficiency, high-precision neural network search system, the calculation efficiency and calculation accuracy are high enough, the model effect can be quickly converged, and a network architecture with good effect can be quickly found in a huge space.

2.3

Improve model training efficiency

After the previous basic work is done, we come to the training session. There are many places worthy of optimization in the training session.

2.3.1 Optimization operator

The neural network can be understood as a combination of many basic operators. The calculation of operators takes up computing resources on the one hand and memory on the other. If the operator can be optimized to improve the calculation efficiency of the operator, then the training efficiency can be improved.

At present, there are already some AI training frameworks on the market - such as PyTorch, TensorFlow, etc. These training frameworks can provide basic operators for machine learning engineers to call to build their own models. Some companies will build their own training framework and optimize the underlying operators in order to improve training efficiency.

Because PyTorch and TensorFlow need to ensure versatility as much as possible, the operators provided are generally very basic. Enterprises can integrate basic operators according to their own needs, saving the steps of storing intermediate results, saving video memory usage, and avoiding performance loss.

In addition, to solve the problem that some specific operators cannot make good use of the parallelism of the GPU due to their high dependence on intermediate results during calculation, some companies in the industry have built their own acceleration libraries to reduce the dependence of these operators on intermediate results. , so that the calculation process can give full play to the parallel computing advantages of the GPU and improve the training speed.

For example, on the four mainstream Transformer models, ByteDance's LightSeq has achieved a speedup of up to 8 times based on PyTorch.

2.3.2 Make good use of parallel strategies

Parallel computing is a method of "exchanging space for time", that is, to parallelize the data without calculation dependencies as much as possible, split large batches into small batches, reduce the idle waiting time of GPU in each calculation step, and improve calculation throughput quantity.

At present, many companies have adopted the training framework of PyTorch, which has DDP mode - as a distributed data parallel training mode, DDP mode has designed a data distribution mechanism to support multi-machine multi-card training, such as a The company has 8 servers, and each server has 8 cards, so we can use 64 cards for training at the same time.

Without this mode, engineers can only use a single machine with multiple cards to train the model. Suppose we now use 100,000 pictures to train the model. In the single-machine multi-card mode, the training time will exceed one week. If we want to use the training results to evaluate a certain conjecture, or want to select the best one from several candidate models, such a training time makes the waiting period for quickly verifying the conjecture and quickly checking the effect of the model very long. Then the research and development efficiency is very low.

With multi-machine and multi-card parallel training, most of the experimental results can be seen within 2-3 days. In this way, the process of verifying the effect of the model is much faster.

In terms of specific parallel methods, model parallelism and sequence parallelism can be mainly used.

Model parallelism can be divided into Pipeline parallelism and Tensor parallelism, as shown in the figure below.

6fe9492f10fc36607f489c72e117a3b9.png △Pipeline parallel and tensor parallel schematic diagram, the picture is from NVIDIA

Pipeline parallelism is inter-layer parallelism (the upper part of the figure). Engineers can remember to divide different layers of the model into different GPUs for calculation during the training process. For example, as shown in the upper part of the figure, the layers of the green part and the blue part can be calculated on different GPUs.

Tensor parallelism is intra-layer parallelism (the lower part of the figure), engineers can divide the calculation of a layer to different GPUs. This mode is suitable for the calculation of large matrices, because it can achieve load balancing between GPUs, but the number of communications and the amount of data are relatively large.

In addition to model parallelism, there is also Sequence parallelism. Since Tensor parallelism does not split Layer-norm and Dropout, these two operators will be repeatedly calculated between each GPU. Although the amount of calculation is not large, it is It takes up a lot of active video memory.

In order to solve this problem, in the actual process, we can take advantage of the fact that Layer-norm and Dropout are independent of each other along the dimension of the sequence (that is, Layer_norm and Dropout between different layers do not affect each other). For Layer- Norm and Dropout are split, as shown in the figure below. The advantage of this split is that it will not increase the traffic, and can greatly reduce the memory usage.

a4ea6621afcea577f982b11d7c43825a.png △Sequence parallel schematic diagram, the picture is from NVIDIA

In practice, different models are suitable for different parallel strategies. Engineers need to find a suitable parallel strategy after continuous debugging according to the characteristics of the model, the characteristics of the hardware used, and the intermediate calculation process.

2.3.3 Make good use of "sparseness"

When training the model, it is also necessary to make good use of sparsity, that is, not every neuron must be "activated" - that is, when adding training data, not every model parameter must be updated according to the newly added data , but some model parameters remain unchanged, and some model parameters are updated with newly added data.

Good sparse processing can ensure the training efficiency of the model while maintaining accuracy.

For example, in a perception task, when new pictures come in, you can select the parameters that need to be updated based on these pictures, so as to perform feature extraction in a targeted manner.

2.3.4 Unified processing of basic information

Generally speaking, more than one model is used within the company, and these models may use the same data. For example, most models use video data. If each model loads and processes the video data, there will be a lot of repeated calculations. We can process information in various modalities, such as video, point cloud, map, and CAN signal, which are required by most models, so that different models can reuse the processing results.

2.3.5 Optimize hardware configuration

When actually using distributed training, 1,000 machines may be used. How to obtain the intermediate results in the training process from different servers that store data—such as gradients, and then do a large-scale distributed training is a Great challenge.

To meet this challenge, we first need to consider how to configure the CPU, GPU, etc., how to choose the network card, and how fast the network card is, so that the transmission between machines can be fast.

Secondly, it is necessary to synchronize parameters and save intermediate results, but when the scale is large, this matter will become very difficult, which will involve some network communication work.

In addition, the entire training process takes a long time, so the stability of the cluster needs to be high.

3. Is it meaningful to continue to increase model parameters?

Now that the large model has been able to play some role in the field of autonomous driving, if we continue to increase the model parameters, can we expect the large model to show some amazing effects?

According to the results of the author's communication with algorithm experts in the field of autonomous driving, the current answer is probably no, because the "emergence" phenomenon mentioned above has not yet appeared in the field of CV (computer vision). At present, the amount of model parameters that everyone uses in the field of autonomous driving is much smaller than that of ChatGPT. Because when there is no "emergence" effect, the relationship between the improvement of model performance and the increase in the number of parameters is roughly linear. Considering cost constraints, companies have not yet maximized the number of parameters in the model.

Why hasn't there been an "emergence" phenomenon in computer vision yet? An expert explained:

First of all, although there are far more visual data than text data in this world, image data is sparse, that is, most photos may not have much effective information, and most pixels in each image do not provide effective information. If we take a selfie, except for the face in the middle, the background area has no valid information.

Second, the image data suffers from significant scale variation and is completely unstructured. Scale change means that objects containing the same semantics can be large or small in the corresponding picture. For example, I took a selfie and then asked a friend who was further away to take a photo for me. In the two photos, the proportion of the face in the photo was very different. Unstructured means that the relationship between each pixel is uncertain.

But in the field of natural language processing, because language is a tool for communication between people, the context is usually related, and the information density of each sentence is generally large, and there is no problem of scale change. For example, in In any language, the word "apple" is usually not very long.

Therefore, the understanding of visual data itself will be more difficult than natural language.

An industry expert told the author: Although we can expect the performance of the model to increase as the number of parameters increases, it is currently less cost-effective to continue to increase the number of parameters.

For example, if we expand the capacity of the model by ten times on the existing basis, its relative error rate can be reduced by 90%. At this time, the model can already complete some computer vision tasks such as face recognition. If at this time we continue to expand the capacity of the model by ten times, and the relative error rate continues to drop by 90%, but the value it can realize does not increase by ten times, then we do not need to continue to expand the capacity of the model.

Expanding the model capacity will increase the cost, because a larger model requires more training data and more computing power. When the accuracy of the model reaches the acceptable range, we need to make a trade-off between the increase in cost and the increase in accuracy, and reduce the cost as much as possible under the condition of acceptable accuracy according to actual needs.

Although there are still some tasks that we need to improve accuracy, the large model is mainly to replace some manual work in the cloud, such as automatic labeling, data mining, etc. can be done by humans. If the cost is too high, then the economic accounts will be "overwhelmed."

However, some industry experts told the author: Although the qualitative change point has not been reached yet, as the parameters of the model increase and the amount of data increases, we can indeed observe that the accuracy of the model has been improving. The improvement of model accuracy can feed back the automatic labeling. The accuracy of the model used for labeling tasks is high enough, and the labeling work can reduce a lot of manpower. Although the training cost will increase as the size of the model increases, the current cost and the number of model parameters are basically linear. The reduction in manpower can offset the increase in training costs, so overall, increasing the number of parameters is still beneficial.

In addition, when the number of model parameters increases, we will also adopt some methods to improve training efficiency and reduce training costs as much as possible. Under the existing model scale, we can basically increase the number of parameters of the model and improve the accuracy of the model while keeping the cost basically unchanged. It is equivalent to making the cost of the model not increase linearly with the increase of the model parameters, and we can achieve almost no or only a small increase in the cost.

4. Other possible applications of large models

In addition to the applications mentioned above, how can we discover the value of large models?

4.1

in the field of perception

CMU Research Scientist Max told the author: to use large models to realize perception tasks, the core is not to stack parameters, but to create a framework that can be 'inner loop'. If the entire model cannot achieve internal loops, or cannot achieve continuous online training, it will be difficult to achieve good results.

So, how to realize the "inner loop" of the model? We can refer to the training framework of ChatGPT, as shown in the figure below.

168f816f164bd4771bc4a977b5cf969f.png

△ChatGPT training framework, the picture is taken from the official website of Open AI

The model framework of ChatGPT can be divided into three steps: the first step is supervised learning, engineers first collect and label a part of the data, and then use this part of data to train the model; the second step is to design a reward model (Reward Model), the model You can output some labeling results by yourself; in the third step, we can realize self-supervised learning through a path similar to reinforcement learning, which is called "playing with yourself" in more popular language, or "inner loop" .

As long as the third step is reached, the model no longer needs engineers to add marked data, but can calculate the loss by itself after getting the unlabeled data, and then update the parameters, so that the cycle continues, and finally the training is completed.

"If we can design a suitable Reward Policy when doing perception tasks, so that model training no longer depends on labeled data, it can be said that the model has realized an 'inner loop' and can continuously update parameters based on unlabeled data."

4.2

in the planning field

In fields such as Go, it is easier to judge whether each step is good or bad, because our goal generally only includes winning the game in the end.

However, in the field of autonomous driving planning, the human evaluation system for the behavior exhibited by the autonomous driving system is not clear. In addition to ensuring safety, everyone feels differently about comfort, and we may also want to get to our destination as quickly as possible.

In the chat scene, whether the feedback given by the robot is "good" or "bad" each time does not have a very clear evaluation system like Go. Custom driving is similar to this, everyone has different criteria for "good" and "bad", and he or she may also have needs that are difficult to articulate.

In the second step of the ChatGPT training framework, the annotator sorts the results output by the model, and then uses the sorted results to train the Reward Model. At the beginning, this Reward Model is not perfect, but we can make this Reward Model continue to approach the effect we want through continuous training.

An expert from an artificial intelligence company told the author: In the field of autonomous driving planning, we can continuously collect data on car driving, and then tell the model when people will take over (that is to say, people will feel dangerous) and what situations If you can drive normally, the Reward Model will get closer and closer to perfection as the amount of data increases.

That is to say, we can consider giving up explicitly writing a perfect Reward Model, but by continuously giving feedback to the model to get a solution that is constantly approaching perfection.

Compared with the current common practice in the planning field, that is, trying to explicitly find the optimal solution by manually writing rules, first adopting an initial Reward Model and then continuously optimizing it according to the data is a paradigm shift.

After adopting this method, the optimization planning module can adopt a relatively standard process. All we need to do is to continuously collect data and then train the Reward Model, which no longer relies on an engineer’s understanding of the entire planning module like the traditional method depth.

In addition, all historical data can be used for training. We don’t have to worry about some problems that have been solved before reappearing after a certain rule is changed. If the traditional method is adopted, We may be troubled by this kind of problem.

END


7d27ad71dfbcfb858291a749a4f472a8.png

Communication group|   Enter "Sensor Group/Skateboard Chassis Group/Car Basic Software Group/Domain Controller Group", please scan the QR code above, add Jiuzhang Assistant , be sure to note the name of the communication group + real name + company + position (no remarks Unable to pass friend verification) 

write at the end

communicate with the author

If you want to communicate directly with the author of the article, you can directly scan the QR code on the right and add the author's own WeChat.

   7b8766a2ca14728cc08ca5e9bc035b13.png

Note: Be sure to note your real name, company, and current position when adding WeChat, thank you!

About Contribution

If you are interested in contributing to "Nine Chapters Smart Driving" ("knowledge accumulation and sorting" type articles), please scan the QR code on the right and add staff WeChat.

64e5fb816bd0f46bb38b938165deb258.jpeg

Note: Be sure to note your real name, company, and current position when adding WeChat, thank you!


Quality requirements for "knowledge accumulation" manuscripts:

A: The information density is higher than most reports of most brokerages, and not lower than the average level of "Nine Chapters Smart Driving";

B: Information needs to be highly scarce, and more than 80% of the information needs to be invisible on other media. If it is based on public information, it needs to have a particularly powerful and exclusive point of view. Thank you for your understanding and support.

Recommended reading:

Nine chapters - a collection of articles in 2022

"Even if the wages cannot be paid one day, some people will stay." ——Review of the second anniversary of Jiuzhang Zhijia's business (Part 1)

"Your budget is too much, so we can't cooperate" - Jiuzhang Zhijia 2nd Anniversary Review (Part 2)

What is the comprehensive SOA-based electrical and electronic architecture?

Application of deep learning algorithm in automatic driving regulation and control

Challenges and dawn of wire control shifting to mass production and commercial use

"Be greedy when others are fearful", this fund will increase investment in the "Automatic Driving Winter"

Guess you like

Origin blog.csdn.net/jiuzhang_0402/article/details/130939716