Large Domain Model - Training Trick & Landing Thinking

From: NLP Workstation

Enter the NLP group —> join the NLP exchange group

written in front

Hello everyone, I am Cong Liu NLP.

The large domain model has always been the content of everyone's attention. It just so happens that our company has also made the "" Ask in the Cloud" " large domain knowledge model. I would like to take this opportunity to talk about the training trick of the large domain model and the implementation of the large domain model.

But some of them have no theoretical basis, and are all my own experimental experience & conclusions, which may be different from everyone else. Welcome to discuss, exchange and share.

Domain Large Model Training Trick

1. Domain technical standard documents or domain-related data are the key to the domain model Continue PreTrain.

The existing large models will add data such as books and papers in the pre-training process, so these two kinds of data are actually indispensable in the field pre-training, mainly because the data quality of these data is high, the field is strongly related, and knowledge The coverage (density) is high, which can make the model more suitable for the exam. Of course, it does not mean that other data are not critical. For example, field-related website content and news content are all important data, but in my personal opinion, the importance or knowledge density in the field is not as good as books and technical standards.

Second, after domain data training, the general ability will often decline, and it is necessary to mix general data to alleviate the model from forgetting the general ability.

If only domain data is used for model training, the model is prone to catastrophic forgetting, and general data is usually added during domain training. So what is the appropriate ratio? There is still no accurate answer. BloombergGPT (pre-training from scratch) pre-training financial and general data ratio is basically 1:1, and ChatHome (continuing pre-training) finds that the domain: general data ratio is optimal when it is 1:5. Personally, I feel that it should be related to the amount of data in the field. When the amount of data is not that much, the general data ratio is more appropriate between 1:5 and 1:10.afd553fec7c611ea7b73552865f438bc.png

3. When the domain model Continue PreTrain, SFT data can be added synchronously, that is, MIP, Multi-Task Instruction PreTraining.

During the pre-training process, downstream SFT data can be added to allow the model to learn more knowledge during the pre-training process. For example: multi-task learning such as T5, ExT5, Glm-130b may be more helpful in the pre-training stage than fine-tuning. And ChatHome found that the MIP effect is the best in the evaluation set in the field.8051a22ef43178a826b9b389854855d7.png

4. When only SFT is used as the domain model, if resources are limited, it is used for training on the basis of the Chat model, and when resources are sufficient, it is used for training on the Base model. (resource = data + graphics card)

I have discussed a question with many people, that is, whether we train on the Base model or on the Chat model when we are in SFT.

In fact, it is very simple. If you only have 5k data, it is recommended that you fine-tune the Chat model; if you have 100,000 data, it is recommended that you fine-tune the Base model. Because you don't know the data quality of the Chat model in SFT, when you have the ability, it is better to rely on others than yourself.

5. When performing SFT on the Chat model, please follow the original system command & data input format of the Chat model.

If you perform SFT on the Chat model, please be consistent with the input format of the Chat model, otherwise, when you have insufficient data, the training effect may not be obvious. And it is recommended not to use full parameter training, otherwise the original ability of the model will be forgotten more.

6. The necessary content of the field evaluation set. It is recommended to have two copies, one for automatic evaluation in the form of multiple-choice questions, and one for manual evaluation in open form.

Be sure to have your own domain data set to verify the effect of the model and choose the best checkpoint. The form of multiple-choice questions can be automatically evaluated to facilitate the preliminary screening of the model; the open form of manual evaluation is time-consuming and can be used for fine screening, and the task form is closer to the real scene.

7. Is it necessary to expand the domain model vocabulary?

Personally, the real problem solved by the expansion of domain vocabulary is the problem of decoding efficiency, which may not improve the effect of the model very much. (The expansion of the domain vocabulary here refers to the expansion of the vocabulary on the same language model, rather than the Chinese localization of the English model)

8. The so-called large domain models will be updated faster and faster, and more and more.

Since many people & companies do not have the resources to engage in bases, incremental pre-training, fine-tuning, etc. need to be performed on existing base models. However, with the current posture of each factory (ChatGLM, BaiChuan, Qwen, Llama) occupying the proportion of the open source community, it feels that many 7B and 13B level models will be open source.

Please wait for the day when ChatGPT open source small model, maybe when GPT5 comes out, Openai will open source a small version model of GPT3.5.

The idea of ​​landing a large model in the field

1. It is often said that the domainization of a general model may be a false proposition, so is the generalization of a large domain model also a false proposition.

Since the beginning of training the model, I have been asking the question of Leader Battle, whether the large domain model needs to have the ability of generalization. It's like the slogan of Huawei's Pangu large model "only do things and not write poems". Is it true that the large model in the training field can solve a few fixed tasks?

My humble opinion is that if you want to quickly implement a large domain model, the easiest way is to upgrade the original capabilities of the system, that is, the effect of the large model on one or several fixed tasks exceeds that of the original model.

Taking the Text2SQL task as an example, the methods in many previous systems were solved by extracting key elements and splicing. The end-to-end solution is not very ideal, so now it can be solved by the ability to generate large-scale model SQL. Upgrading existing products is the least costly way to land. Take the "Ask in the Cloud" made by our company as an example, the effect can reach 90%+ in solving SQL tasks in a certain field, which is much higher than the existing open source model & open API.

Of course, there are many other tasks that can be upgraded, such as: D2QA, D2SPO, Searh2Sum and so on.0acd1f20ebfbab2fed3a56444f4ee05d.png

2. When the large-scale model of the field is implemented, the task scenario is more important than the model ability.

Although upgrading existing products is the least costly way to implement, GPT4 and AutoGPT have raised people's appetites very high. Everyone hopes to directly put forward a demand, and the large model can directly solve it. But this is very difficult for the existing domain model, so it is very important to use the large model in which scenarios, and how to package the model, even in the case of insufficient model capabilities, can also allow users to have a good experience.

Now many people are wondering, let alone whether there is a large model, even if there is a large model, they don’t know where to use it, and they can’t find a Special scene in the private domain.

So in the end, the landing of the large model is not about the effect of the model itself, but a whole set of industry solutions, and "Know How" has become a key element.

3. The final model specifications of most enterprises are limited to 13B.

Due to national conditions, the final solution of most enterprises should be localized deployment, which will involve the issue of hardware equipment. I don't think there are many 100B-level models that can be deployed by many companies, but I feel that the real deployment is limited to the 10B level. Even though many methods (for example: llama.cpp) can accelerate large models, even if the 100B-level model is accelerated, it still consumes a lot of resources.

I said before that "people who have not experienced the 33B model will only think that 13B is enough", and a larger model must be built, but it does not affect the final landing of the 10B level.

The mental journey of making a large model

When ChatGPT first exploded, it never occurred to me that we were also equipped to make large models. But when many large Chinese models emerged in China, and the Alpaca model proved that the model with 7 billion parameters also has good results, it gave me a lot of confidence, and of course it also gave more confidence to many people and many companies.

When making large-scale models in small and medium-sized enterprises, what is often questioned is "you can make large-scale models without 100 cards". I just want to say that you need to look at the definition of "big". The 175B model is indeed not qualified to touch, but the 33B model Still playable. It takes a group of people to really catch up with OpenAI, but another group of people is needed to implement the model.

It is our luck to catch up with the big model, and it is my luck to be able to speak on the big model in the field.

Summarize

Finally, encouragement: TextCNN is still used in the BERT era, isn't the 13B model called a large model?

Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/132200560