Interpretation of Lawyer LLaMA, fine-tuning of large models in Yanshen's professional field: data set construction, model training

Interpretation of Lawyer LLaMA, extension of large-scale model fine-tuning in its own field: data set construction, model training

Project addresslink

The fine-tuning of large models in my own field, the implementation ideas are mostly the same as this article, some are based on LLaMA, or some are based on Chinese-LLaMA, or other open source large models. This article is based on my own training process and refers to Lao Liu said in NLP "Also Read Lawyer LLaMA Fine-Tuning Large Models in the Legal Field: From Training Data, Model Training to Experimental Effect Research", starting from the results to be achieved by the model, introduces the entire process backwards for your reference.

Welcome everyone to like, follow and communicate together

1. The ability of the model to focus on

insert image description here

The application of large models in the professional field requires three capabilities,

1. Generate accurate answers without ambiguity. In any professional field, some words can affect the meaning expressed in them by replacing only one word, which may lead to hugely different results. For example, there is only one word difference between deposit and deposit in Chinese , but their meanings and legal effects in contract law are completely different,

2. Understand and distinguish professional terms. Many concepts only appear in corresponding professional fields, such as Taiwan area. Even widely used words have different meanings in different professional fields, so they should be expressed in specific sentence situations different meanings,

3. Be able to identify and analyze actual events in professional scenarios. Scenarios in the real world are always complex and diverse. Models need to have legal terms and apply professional field data to analyze specific question and answer capabilities.

In order to be able to realize these capabilities, the LLaMa model can be used to specifically realize these functions,

1. Inject knowledge in the professional field and collect a large amount of original text in the professional field to allow the model to perform unsupervised training and learning.

2. Training to learn domain-specific skills, supervised fine-tuning of the model, and teaching the model how to solve domain-specific tasks with appropriate knowledge.

3. Enhanced with external knowledge. In order to make the model answer more accurately and precisely, an information retrieval module is also introduced. Before generating each reply, it first uses the user's query and contextual information to retrieve relevant standard information, and then in these Answers are based on articles in the field of expertise.

2. Data preparation

1. Pre-training corpus

In order to improve the model's answering performance in the Chinese professional field and prevent the model from catastrophically forgetting questions after learning professional field data, this work uses two corpora for continuous training of the LLaMA model.

The model is first trained on a common multilingual corpus to enhance the model's Chinese ability, and then another corpus from a specialized field is used to enhance the model.

(1) Multilingual general corpus

Since LLaMA is mainly trained on English and other language corpora, it is not perfect in understanding and generating Chinese sentences. In order to
solve this problem, Chinese pre-trained English corpora are also collected for memory replay to avoid catastrophic forgetting
. Yes, in order to build a general Chinese corpus, articles can be extracted from Simplified Chinese versions of WuDaoCorpora, CLUECorpus2020 and Wikipedia.
For English general corpus, extract articles from the C4 corpus,

(2) Chinese professional field corpus

Obtain professional data in various ways, classify according to major and source, and then analyze the professional data to generate various professional data. Is there any proportion? The format of the analysis is to learn by segment. treatment

2. Instruction fine-tuning data in professional field

(1) General Ability Question Answering Public Dataset Address

https://github.com/chaoswork/sft_datasets/tree/master

(2) Classify the data in the professional field to build a fine-tuning data set, and then subdivide it into multiple rounds of dialogue and single-talk dialogue, and use ChatGPT to generate replies

In order to ensure single-round and multi-round ability, collect single-round and multi-round dialogue ability at the same time, and in order to improve the accuracy of generative answers, the retrieved articles are added to the prompt words to help ChatGPT generate accurate replies.

1) Single discussion on the construction of question answering data

Let chatgpt act as the answerer, respond to the customer's question, and in the input prompt, the generated answer should meet the following requirements.
1. Correctly quote the system provisions;
2. Correctly understand the meaning of the question sentence and give a well-founded analysis of the system provisions;
3. Answer comprehensively and analyze potential possibilities;
4. Raise appropriate questions to dig out facts to help further answers;
5. Use plain language;
6. Give preliminary opinions and consultation conclusions.
Enter ChatGPT format such as:

{
    "instruction": "阅读以下文章:[],请回答:[]",
    "input": "",
    "output": "[答案]"
  }

The generated format is as follows:
insert image description here

2) Construction of multi-round question answering data

To generate multiple rounds of dialogue, it is necessary to design two different prompt words, let ChatGPT play two dialogue roles respectively, use the two prompts alternately, and use the dialogue history as the input of ChatGPT
.
insert image description here
monologues and 5000 2 or 3 round dialogues.

3. Retrieval enhancement of external knowledge

For a single round of question and answer, use the text retrieval tool to select the first 3 related articles and input them into the prompt words.
For multiple rounds of question and answer, assuming that the topic of the dialogue remains unchanged, continue to use the same 3 related articles
insert image description here
or directly use the ready-made Text retrieval framework, please refer to my other blog
post Text retrieval system
or use LangChain address

3. Model training

The steps of fine-tuning the open source model LLaMA, as shown in the figure below, step by step from S1 to S12
insert image description here

It can be seen that the training is carried out step by step, and a series of comparative experiments have been done
insert image description here

The table shows the performance of the model on NLP tasks at different stages. For details of each stage, please refer to the above figure. (1)~(6) represent the pre-training corpus or supervised fine-tuning data set used to train LLaMAd at different stages. (1) is a multilingual general corpus, (2) is a Chinese legal corpus, (3) is a general SFT dataset, (4) is a judicial examination and legal consultation, (5) is a multi-round legal dialogue, (6) is a multi- Added searched legal articles in Rounds of Legal Dialogue. A tick means that the corresponding corpus/dataset has been used in the previous stage, while a curly sign means that the corpus/dataset is used for training in the current stage.

1. Improve the Chinese expression ability of LLaMA, S0-S1

In order to improve the Chinese comprehension and generation ability of LLaMA, LLaMA is continuously pre-trained in the Chinese general corpus. Chinese-LLaMA adopts the method of using Chinese characters to expand the vocabulary, and uses a mixed corpus of English and Chinese to model many complex reasoning capabilities
. It may come from English training, and I hope the model will maintain these capabilities in continuous pre-training.

2. Add knowledge in professional fields S4

Add professional field texts for pre-training to learn the ability of professional fields

3. Learning reasoning ability S7

Collect question-answer pairs of real scenes in the professional field, and ask ChatGPT to provide detailed explanations. During the training process, QA pairs are regarded as instructions, and the model requires explanations.

4. Learn the real reply ability S9

Let the model learn the ability of single-round question answering and multi-round question answering data, and generate appropriate responses for users' specific queries.

5. Improve the reliability of model reply S12

Introduce a legal text retrieval module to enable the model to generate credible responses.
Preliminary experimental results here show that even though the model repeatedly learns these articles during the continuous training phase, it cannot use them correctly when generating them, and it may also cite irrelevant citations. statutes, or the use of a word with a similar sentence in place of a term that has a distinctly different meaning in the legal field

At this time, we need a reliable model to recall three documents related to the user's search term. Specifically, we need to train a retrieval model. In terms of data set composition, we need to collect some user consultation questions and ask professionals to mark each question with the most Reply to 3 necessary articles, and then train a text retrieval model based on RoBERTa or Twin Towers model, which can achieve recall@1 of 0.85 and recall@5 of 0.94 on the retained test set.

Moreover, the work also found that by directly concatenating the retrieved articles and the user's question as new input, the model would tend to cite the provided articles in its response without distinguishing whether they are relevant to the current context.

4. Experimental results

Collect English and Chinese common tasks in different fields, including natural language reasoning, sentiment analysis, common sense reasoning, dialogue question answering, etc., and test the reasoning performance of the model at different stages
insert image description here

First of all, comparing the results of s0 and s1, it can be found that LLaMA has achieved an accuracy rate of +5.3% on C3; in terms of English common sense reasoning, s1's performance in SciQ and PIQA is not worse than s0. This suggests that pre-training on multilingual corpora can enhance the model's expressiveness for Chinese without sacrificing its expressiveness for English.

Secondly, comparing the results of CMNLI of s2 and s3, s7 and s9/s8/s6, it can be found that the model of s3/s9/s8/s6 can better handle Chinese NLI tasks after judicial examination examples and legal consultation fine-tuning , the accuracy increased to +9.3%.

Finally, the model cannot handle English NLI and sentiment analysis tasks. At all stages, the model can only output Yes for all MRPC instances, and when continuously training LLaMA, it cannot obtain significant improvement for SST-2, guessing that this is because there is not enough English NLI and SFT for sentiment analysis example. The model would then not be able to understand the instructions in the prompt for such a task.

V. Summary

The idea of ​​fine-tuning the large model and the construction method of the data set are roughly the same. In the actual operation, the most difficult point of the project is how to construct its own data. The amount of unsupervised data is too large, and it takes a lot of effort to analyze it. How to perfectly construct supervised data, etc. , so it is very important to classify data effectively where data sorting is needed. The
core conclusion of improving model recognition is that by adding a retrieval module, the reliability of question and answer can be improved, and by introducing pre-patrol data and fine-tuning data in vertical fields, both It can improve its domain performance. However, in practice, it is also necessary to consider the composition of domain data and general data, as well as the alignment with downstream tasks.

Guess you like

Origin blog.csdn.net/dream_home8407/article/details/131052716
Recommended