GPT vertical domain related models Existing open source domain large models

For the ToC side, the tastes of the masses have been fed by ChatGPT , and the market has basically been completely eaten up by ChatGPT. Although major domestic manufacturers are chasing after them, most of them are still implementing internal testing mechanisms, and there is a high probability that they will not be widely opened (after all, major manufacturers are still focusing on the ToB and ToG markets. From Huawei to WAIC This can be seen in the report). For the ToB and ToG end, localized deployment, outstanding results in the field or industry, and localization have undoubtedly become important assessment indicators.

Personally, I feel that large models in vertical fields, or domainization and industryization of large models, are the core elements for the implementation of large models. Just a few days ago, ChatLaw (a large model product in the legal field) was also a big hit. At that time, I got the first-hand internal beta qualification and tested it for a while. I also chatted with the author of the model for a long time. I just took advantage of the weekend to think about, sort out, and summarize some large model content in vertical fields.

The content of the article will start from ChatLaw , to some discussions on large models in the vertical field , and finally summarize the existing large models in the open source field .

Let’s talk about your views on ChatLaw

The emergence of ChatLaw makes me more certain that the implementation of large models in the future needs to have domain characteristics. Compared with the current large-scale models in the field, ChatLaw is not just a model, but a designed large-scale model domain product, which already has a good product form in the legal field.

Paper: https://arxiv.org/pdf/2306.16092.pdf

Github: https://github.com/PKU-YuanGroup/ChatLaw official website: https://www.chatlaw.cloud/

There may be some doubts, such as: Isn’t it just a langchain? Can it guarantee factual issues in the legal field? wait wait wait. However, I feel that before denying something, we must first understand it more deeply .

ChatLaw has two modes: normal model and professional model. The normal mode is to conduct question and answer based only on large models.

The professional mode uses retrieval to match user queries to filter out appropriate evidence from the knowledge base, and then obtains the final answer based on the large model's summary capabilities.

Due to the professional mode, with the help of the content of the knowledge base, the results obtained by users will be more accurate. In the professional version, ChatLaw has developed a complete set of processes. As shown in the figure above, there are rhetorical question prompts for information completion, user information confirmation, similar case retrieval, suggestion summary, etc.

The author @JessyTsui (Zhihu) also said that in fact, ChatLaw = ChatLaw LLM + keyword LLM + laws LLM . The keyword LLM really made my eyes shine. My previous understanding of keyword extraction has always been to find the correct words from the text, and to use synonyms and other methods in traditional retrieval to improve the retrieval effect. Keyword LLM uses a large model to generate keywords, which can not only find the key content in the text, but also summarize and interpret some words . This makes the entire product more effective when retrieving evidence content.

Deep learning natural language processing, some thoughts on large models in vertical fields and a summary of open source models

At the same time, since different models have different effects on solving different types of problems, in the actual use stage, HuggingGPT is used as the scheduler to select and call a more suitable model every time the user requests it. That is, let the appropriate model do more appropriate things .

Let’s talk about our views on large models in vertical fields

Currently, there are two main types of large models used. The first is to use the large model itself to solve user problems; the second is to use external knowledge to solve user problems. Personally, I feel that "questioning and answering with the help of external knowledge" is the future. Although it will add additional costs to model reasoning, external knowledge is an effective way to alleviate model illusion.

However, as the underlying capabilities of general large models become stronger and stronger, and the acceptable text becomes longer and longer, when solving problems in vertical fields, ICL technology can be used to improve the effect of general large models in vertical fields. Then train a Is the large vertical domain model a false proposition? Do we still need to do it?

Personally, I think it is necessary and we will discuss it from several aspects:

  • 1. Personally, I feel that the approach to truly large-scale models in vertical fields should start with Pre-Train. SFT only stimulates the capabilities of the original large model. Pre-training is the real knowledge infusion stage, allowing the model to truly learn domain data knowledge and adapt to the domain. However, many large models in vertical fields are still in the SFT stage.

  • 2. For many companies, it is enough for a large domain model to be outstanding in certain capabilities. Does our energy industry still need to care about the writing of model poems? Therefore, the effect of the domain large model in the industry field is better than the general large model, and there is no need to "need it again and again."

  • 3. We should not deny large models in vertical fields because they are not as effective as ChatGPT. Have you ever thought about a terrible thing? ChatGPT can see more vertical domain data than the large model of your domain. However, ChatGPT still cannot see data in some fields.

  • 4. Considering the deployment cost, I think that under the parameters of 7B and 13B scale, the general model really cannot beat the domain model. What if 175B's large-scale model in the field has not been compared to 175B's general model? The larger the model parameters, the larger the amount of data required, and the field may not really have that much data.

PS: Many non-NLP algorithm people often have some questions about the implementation of large model products: 

Q: I have a lot of technical standards and domain text data. Can I train a large domain model directly by giving it to you? 

A: Yes and no. Plain text can only be used for model pre-training. It can really be used for subsequent question and answer. What is needed is instruction data. Of course, some artificial intelligence methods can be used to generate some index data, but in order to ensure factuality, manual proofreading is still required. High-quality SFT data is the key to model fine-tuning. 

Q: You have fine-tuned the large model using domain data. Why don’t you just ask questions directly, but also use your knowledge base? 

A: External knowledge is mainly used to solve model illusions and improve model response accuracy. 

Q: Why are the two responses different? 

A: In order to ensure diversity, large models generally use Top-P and Top-K decoding for decoding. This kind of decoding will cause the generation results to be uncontrollable. If greedy decoding is used directly, the model generation result will be locally optimal. 

Q: Is it enough for me to train a model myself using the open source 6B and 7B models? 

A: Brother, people who have not trained the 33B model will always only think that 13B is enough.

The above are some personal thoughts and responses to some common questions. Please don’t comment if you don’t like them. Discussions are welcome . After all, everyone has different views on everything.

Summary of large models in open source vertical fields

Currently, many large models in vertical fields have been open sourced, mainly in the fields of medical care, finance, law, education, etc. This section mainly summarizes and introduces the "Chinese open source" models.

"PS: Some large models in the field are not included in this summary if they are not open source; and everyone is welcome to leave a message to check for deficiencies."

medical field

Non-Chinese projects: BioMedLM, PMC-LLaMA, ChatDoctor, BioMedGPT, etc. will not be introduced here.

MedicalGPT-zh

Github: https://github.com/MediaBrain-SJTU/MedicalGPT-zh

  • Introduction: Chinese medical general model based on fine-tuning of ChatGLM-6B instructions.

  • Data: 182k pieces of data were constructed using ChatGPT for 16 groups of diagnosis and treatment scenarios and 28 department medical guidelines. The data has also been made open source.

  • Training method: Based on ChatGLM-6B, the Lora&16bit method is used for model training.

DoctorGLM

Github: https://github.com/xionghonglin/DoctorGLM

  • Introduction: A Chinese consultation model based on ChatGLM-6B.

  • Data: Mainly uses CMD (Chinese Medical Dialogue Data) data.

  • Training method: Based on the ChatGLM-6B model, Lora and P-tuning-v2 are used for model training.

PS: The data comes from the Chinese-medical-dialogue-data project.

Huatuo-Llama-Med-Chinese

Github: https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese

  • Introduction: Materia Medica (original name: HuaTuo): LLaMA fine-tuning model based on Chinese medical knowledge.

  • Data: A Chinese medical instruction data set was constructed through the medical knowledge graph and GPT3.5 API, with a total of 9k pieces of open source data.

  • Training method: Based on the Llama-7B model, the Lora method is used for model training.

Med-ChatGLM

Github: https://github.com/SCIR-HI/Med-ChatGLM

  • Introduction: ChatGLM model fine-tuning based on Chinese medical knowledge, and Materia Medica are brother projects.

  • Data: Same as Huatuo-Llama-Med-Chinese.

  • Training method: Based on the ChatGLM-6B model, the Lora method is used for model training.

ChatMed

Github: https://github.com/michael-wzhu/ChatMed

  • Introduction: A large Chinese medical model, good at answering daily medical-related questions of patients/users online.

  • Data: 500,000 + online consultation + ChatGPT responses as training set.

  • Training method: Based on the Llama-7B model, the Lora method is used for model training.

ShenNong-TCM-LLM

Github: https://github.com/michael-wzhu/ShenNong-TCM-LLM

  • Introduction: "Shen Nong" large model, the first Chinese large model of traditional Chinese medicine, is a brother project with ChatMed.

  • Data: Based on the traditional Chinese medicine knowledge graph, using the entity-centered self-instruction method, ChatGPT is called to obtain 110,000+ instruction data surrounding traditional Chinese medicine.

  • Training method: Based on the Llama-7B model, the Lora method is used for model training.

BianQue

Github: https://github.com/scutcyr/BianQue

  • Introduction: Bian Que, Chinese medical conversation model.

  • Data: Combined with the current open source Chinese medical question and answer data sets (MedDialog-CN, IMCS-V2, CHIP-MDCFNPC, MedDG, cMedQA2, Chinese-medical-dialogue-data), analyze the single-round/multi-round characteristics and doctor inquiries Features, combined with the laboratory's long-term self-built living space health dialogue big data, BianQueCorpus, a tens-of-million-scale Bianque health big data, was constructed.

  • Training method: Bianque-1.0 is trained with ChatYuan-large-v2 as the base model with all parameters, and Bianque-2.0 is trained with ChatGLM-6B as the base model with all parameters.

SoulChat

Github: https://github.com/scutcyr/SoulChat

  • Introduction: A large mental health dialogue model in the Chinese domain, a brother project with BianQue.

  • Data: Constructed more than 150,000 single-round long text psychological counseling instruction data, and used ChatGPT and GPT4 to generate a total of about 1 million rounds of multi-round answer data.

  • Training method: Based on the ChatGLM-6B model, the full parameter fine-tuning method is used for model training.

legal field

LaWGPT

Github: https://github.com/pengxiao-song/LaWGPT

  • Introduction: A large language model based on Chinese legal knowledge.

  • Data: Based on the public legal document data, judicial examination data and other data sets of the Chinese Judgment Document Network, Stanford_alpaca and self-instruct methods are used to generate dialogue question and answer data, knowledge-guided data generation is used, and ChatGPT is introduced to clean the data to assist in the construction of high-quality data sets. .

  • Training method: (1) Legal-Base-7B model: Legal base model, using 500,000 Chinese judgment document data for secondary pre-training. (2) LaWGPT-7B-beta1.0 model: legal dialogue model, constructing a 300,000 high-quality legal question and answer data set and fine-tuning based on Legal-Base-7B instructions. (3) LaWGPT-7B-alpha model: Directly construct a 30w legal question and answer data set based on Chinese-LLaMA-7B and fine-tune the instructions. (4) LaWGPT-7B-beta1.1 model: legal dialogue model, constructing a 35w high-quality legal question and answer data set and fine-tuning based on Chinese-alpaca-plus-7B instructions.

ChatLaw

Github: https://github.com/PKU-YuanGroup/ChatLaw

  • Introduction: Chinese legal model

  • Data: Mainly composed of forums, news, legal provisions, judicial interpretations, legal consultations, legal examination questions, and judgment documents. The dialogue data is then constructed through cleaning, data enhancement, etc.

  • Training methods: (1) ChatLaw-13B: Based on the Jiang Ziya Ziya-LLaMA-13B-v1 model and trained using the Lora method. (2) ChatLaw-33B: Based on Anima-33B trained using Lora method.

LexiLaw

Github: https://github.com/CSHaitao/LexiLaw

  • Introduction: Chinese legal model

  • Data: BELLE-1.5M general data, 52k single-round question and answer data and 92k situational question and answer data with legal basis in the LawGPT project, law exam data and legal instruction fine-tuning data in the Lawyer LLaMA project, 20k high-quality question and answer data from Hualv.com, Baidu knows the collection of 36k legal question and answer data, laws and regulations, legal reference books, and legal documents.

  • Training method: Based on the ChatGLM-6B model, three methods of Freeze, Lora, and P-Tuning-V2 are used for model training.

LAW-GPT

Github: https://github.com/LiuHC0428/LAW-GPT

  • Introduction: Chinese legal model (Haizhi)

  • Data: Existing legal question and answer data sets and high-quality legal text question and answer data constructed based on self-Instruct guided by legal provisions and real cases.

  • Training method: Based on ChatGLM-6B, the Lora&16bit method is used for model training.

lawyer-llama

Github: https://github.com/AndrewZhe/lawyer-llama

  • Introduction: Chinese Law LLaMA

  • Data: 7k legal examination data, 14k legal consultation data

  • Training method: Using Chinese-LLaMA-13B as the base, without legal corpus continual training, SFT is performed using general instruction and legal instruction.

The financial sector

Better non-Chinese projects: BloombergGPT, PIXIU, etc., which will not be introduced here.

FinGPT

Github: https://github.com/AI4Finance-Foundation/FinGPT

  • Introduction: Financial Big Model

  • Data: from Oriental Fortune

  • Training method: Based on ChatGLM-6B, the Lora method is used to train the model.

FinTuo

Github: https://github.com/qiyuan-chen/FinTuo-Chinese-Finance-LLM

  • Introduction: A Chinese financial large model project, aiming to provide an out-of-the-box and easy-to-expand large model tool chain in the financial field.

  • Data: Not yet completed.

  • Training method: Not yet completed.

education field

EduChat

Github: https://github.com/icalk-nlp/EduChat

  • Introduction: Technology related to educational dialogue large models based on pre-trained large models provides rich functions such as automatic question generation, homework correction, emotional support, course guidance, and college entrance examination consultation in educational scenarios, serving the majority of teachers, students, and parents. Help realize intelligent education that teaches students in accordance with their aptitude, is fair and just, and is full of warmth.

  • Data: Mixed multiple open source Chinese and English commands and conversation data, and obtained after deduplication, about 4 million.

  • Training method: Based on LLaMA model training.

Guess you like

Origin blog.csdn.net/WASEFADG/article/details/132290116