ChatLaw: Chinese legal model

insert image description here

  Paper title: ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases
  Paper date: 2023/06/28
  Official website address: https://www.chatlaw.cloud
  Paper address: https://arxiv.org/abs/2306.16092
  GitHub Address: https://github.com/PKU-YuanGroup/ChatLaw

Abstract

  LLMhave demonstrated the potential to revolutionize natural language processing tasks in various domains, sparking a great deal of interest in large vertical-specific models. However, unlike proprietary models such as BloombergGPTHebei and FinGPTothers that have made progress in the financial field with their unique data accumulation, there are not many similar big language models in the Chinese legal field to facilitate their digital transformation.
  This paper proposes an open-source legal large-scale language model ---- ChatLaw. Due to the importance of data quality, the authors carefully designed a legal domain fine-tuning dataset. In addition, aiming at the problem of model illusion in legal data screening during reference data retrieval, a method combining vector database retrieval and keyword retrieval is proposed, which effectively reduces the inaccuracy of solely relying on vector database retrieval. In addition, a self-attention method is proposed to enhance the ability of the large model to overcome the errors existing in the reference data, further optimize the model hallucination problem at the model level, and improve the problem-solving ability of the large model.

  The first step in applying a large model to a vertical domain: A high-quality vertical domain dataset is required for fine-tuning LLM.

insert image description here

1. Introduction

  The continuous expansion and development of artificial intelligence has provided fertile soil for the proliferation of large-scale language models. ChatGPT, GPT4, LLaMA, , Falcon, Vicuna, ChatGLMetc. models have shown remarkable performance in a variety of routine tasks, unlocking great potential for the legal field. However, it is clear that access to high-quality, relevant and up-to-date data is a key factor in developing large language models. Therefore, it becomes crucial to develop effective and efficient open-source legal language models.
  In the field of artificial intelligence, the development of large models has permeated various fields such as healthcare, education, and finance: BloombergGPT, FinGPT, Huatuo, ChatMed, and these models have demonstrated their usefulness and impact in handling complex tasks and generating valuable insights. However, the field of law, due to its inherent importance and the need for accuracy, is one that requires dedicated research and the development of specialized legal models.
  Law plays a key role in shaping societies, regulating human interaction and upholding justice. Legal professionals rely on accurate and up-to-date information to make informed decisions, interpret the law, and provide legal counsel. The complexity of legal language, nuanced interpretations, and the changing nature of legislation present unique challenges that require tailored solutions. However, even state-of-the-art models like this often suffer from hallucinations and nonsensical outputs
  when it comes to legal issues . GPT4People tend to believe that fine-tuning a model with specific domain knowledge will yield satisfactory results. In reality, however, this was not the case with earlier laws LLM(LawGPT), as there are still many instances of hallucinations and unreliable output.
  The author initially recognized LLMthe necessity of Chinese law. However, at the time, there was no commercially available Chinese model with a parameter size exceeding 13 billion. Therefore, on the basis of a commercially viable OpenLLAMAmodel, by expanding the Chinese vocabulary and incorporatingMOSSand other sources of training data, these can create a basic Chinese language model. Subsequently, legal models are trained in conjunction with legally relevant data ---- ChatLaw.
  The main contributions of this paper are as follows: An
  (1)effective approach to alleviating hallucinations (Effective Approach to Mitigate Hallucination): Propose a method to address hallucinations by enhancing the training process of the model and incorporating four modules in the inference process: "counseling (consult)", "referencing (reference)", "self-suggestion (self-suggestion)" and "Response (response)". Integrate the vertical model and the knowledge base through the reference module, inject domain-specific knowledge into the model, and use accurate information in the knowledge base to reduce the occurrence of hallucinations.
  (2)Based on LLMthe legal feature word extraction model: train a model that extracts legal feature words from users' daily language. The model identifies words with legal meaning, which can effectively identify and analyze the legal context in user input.
  (3)Based on BERTthe legal text similarity calculation model: train a model to measure the similarity between the user's daily language and a 93dataset consisting of tens of thousands of relevant legal case texts. This enables the establishment of a vector database for efficient retrieval of similar legal texts, facilitating further analysis and reference.
  (4)Construction of the Chinese Law Examination Test Dataset: A dataset dedicated to testing knowledge in the Chinese legal field was planned. Furthermore, an arena scoring mechanism is devised ELOto compare the performance of different models on legal multiple-choice questions.
  Furthermore, the authors observe that a single general law LLMmay not perform optimally on all tasks in this domain. Therefore, different models are trained for different scenarios such as multiple choice questions, key extraction, question answering, etc. HuggingGPTTo handle the selection and deployment of these models, the provided methods are used , using a large LLMas the controller. This controller model dynamically determines which specific model to invoke based on each user's request, ensuring that the most appropriate model is used for a given task.

2. Dataset

  In the process of constructing the dataset, various methods are adopted to ensure the comprehensiveness and diversity of the dataset. The data set combination method is as follows:
  Collect a large amount of raw legal data (Collection of a vast amount of original legal data): including collecting legal news, social media content and discussions in legal industry forums. These sources provide a wide variety of real-world legal texts, providing insights into a variety of legal topics and discussions.

insert image description here

  Construction based on laws, regulations and judicial interpretations (Construction based on legal regulations and judicial interpretations): In order to ensure comprehensive coverage of legal knowledge, relevant laws, regulations and judicial interpretations are incorporated into the dataset. This ensures that the dataset reflects the legal framework and provides accurate and up-to-date information.

insert image description here
  Crawl real legal consultation data (Crawling real legal consultation data): Retrieve real legal consultation data and use existing legal consultation data sets. This enables the inclusion of real-world legal scenarios and problems frequently encountered by users, enriching the dataset with practical legal examples.

insert image description here
  Construction of Multiple Choice Questions for the Judicial Examination (Construction of multiple-choice questions for the bar exam): A set of multiple choice questions specially designed for the judicial examination was created. The questions cover a variety of legal topics and test users' understanding and application of legal principles.

insert image description here
  By combining data from these different sources and construction methods, the dataset encompasses a wide range of legal contexts, ensuring that the developed models can effectively understand and address various legal scenarios.
  Once these data components are collected, the dataset goes through a rigorous cleaning process. This includes filtering short and incoherent responses, ensuring only high-quality and meaningful text is included. Furthermore, to enhance the dataset, ChatGPT APIan auxiliary construct is utilized that enables the generation of supplementary data based on the existing dataset.

3. Training Process

  Keyword LLMis a language model that extracts keywords from abstract consulting questions raised by users. On the other hand, LawLLMlegal terms that may involve user inquiries are extracted. ChatLaw LLMis the final language model that outputs responses to the user. It refers to relevant legal provisions and uses its own summary and question-and-answer functions to provide suggestions for users in consultation.

3.1 ChatLaw LLM

  For training ChatLAW, the authors fine-tuned it Ziya-LLaMA-13Busing low-rank adaptation on top of . (LORA)Furthermore, auto-suggestive characters are introduced to further alleviate the model hallucination problem. The training process is carried out on multiple A100 GPUnodes, and deepspeedthe training cost is further reduced by means of .

3.2 Keyword LLM

LLMTo create a product by combining   vertical-specific domain knowledge with a knowledge base ChatLaw, it is crucial to retrieve relevant information from the knowledge base based on user queries. The author initially tried traditional software development methods, such as searching MySQLand Elasticsearchsearching, but the effect was not satisfactory. Therefore, the author tried to use the pre-trained BERTmodel embeddingto Faisscalculate the cosine similarity and extract the previous klaws and regulations related to the user query. However, this approach tends to produce suboptimal results when the user's question is ambiguous. Therefore, this paper aims to extract key information from user queries and use the vector embedding of these information to design algorithms to improve matching accuracy.
  Due to the significant advantages of large-scale models in understanding user queries, the authors LLMfine-tuned . After obtaining multiple keywords, the following algorithm is used to retrieve relevant legal provisions:

insert image description here

3.3 Law LLM

  Train the model using 937ka dataset of country cases BERTto extract corresponding legal clauses and judicial interpretations from user queries. This Law LLMmodel forms ChatLawan important part of the product.

insert image description here

4. Experiment and Analysis

  Evaluating the performance of large language models (LLM)has been a challenge. 2000To this end, more than ten years of national judicial examination questions were collected, and a test dataset containing 4 questions and their standard answers was compiled to measure the ability of the model to deal with legal multiple-choice questions.
  However, the authors found that the accuracy of these models was generally low. In this case, simply comparing accuracy rates seems to be meaningless. Therefore, inspired by the matching mechanism in e-sports and the design of chatbots , an evaluation mechanism for the scoring model competition is Arenaestablished to more effectively evaluate the model's ability to handle legitimate multiple-choice questions.   Through the analysis of the above experimental results, the following conclusions can be drawn: the introduction of legal-related question-answering and regulatory data can improve the performance of the model on multiple-choice questions to a certain extent; adding specific task types for training significantly improves the performance of the model. performance on such tasks. For example, the reason why the model outperforms is that the authors used a large number of multiple-choice questions as training data; legal multiple-choice questions require complex logical reasoning, so models with a larger number of parameters usually perform better.ELO

  (1)
  (2)ChatLawGPT-4
  (3)

insert image description here

5. Conclusions

  This paper proposes ChatLawa large-scale legal language model developed using legal domain knowledge. Specifically, a new method is proposed to LLMcombine with a vector knowledge base, which significantly alleviates LLMthe common hallucination problem in . Stable model processing strategies make it possible to solve problems in various legal fields. Published a data set of legal multiple-choice questions, and designed a ELOmodel ranking mechanism.
  However, due to the large size of the base model, a limitation of the author's model emerges: the performance is not optimal in tasks such as logical reasoning and deduction. ChatLawIn addition, further research is needed to improve the generalization ability to common tasks after adding a large amount of domain data . ChatLawThere are potential social risks, and the authors advise users to use this method for appropriate purposes.


  Pay attention to the WeChat public account: 夏小悠, to get more articles, papers PPTand other materials ^_^

Guess you like

Origin blog.csdn.net/qq_42730750/article/details/131550853