Paper title:
ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases
Paper date:2023/06/28
Official website address: https://www.chatlaw.cloud
Paper address: https://arxiv.org/abs/2306.16092
GitHub
Address: https://github.com/PKU-YuanGroup/ChatLaw
Article Directory
Abstract
LLM
have demonstrated the potential to revolutionize natural language processing tasks in various domains, sparking a great deal of interest in large vertical-specific models. However, unlike proprietary models such as BloombergGPT
Hebei and FinGPT
others that have made progress in the financial field with their unique data accumulation, there are not many similar big language models in the Chinese legal field to facilitate their digital transformation.
This paper proposes an open-source legal large-scale language model ---- ChatLaw
. Due to the importance of data quality, the authors carefully designed a legal domain fine-tuning dataset. In addition, aiming at the problem of model illusion in legal data screening during reference data retrieval, a method combining vector database retrieval and keyword retrieval is proposed, which effectively reduces the inaccuracy of solely relying on vector database retrieval. In addition, a self-attention method is proposed to enhance the ability of the large model to overcome the errors existing in the reference data, further optimize the model hallucination problem at the model level, and improve the problem-solving ability of the large model.
The first step in applying a large model to a vertical domain: A high-quality vertical domain dataset is required for fine-tuning
LLM
.
1. Introduction
The continuous expansion and development of artificial intelligence has provided fertile soil for the proliferation of large-scale language models. ChatGPT
, GPT4
, LLaMA
, , Falcon
, Vicuna
, ChatGLM
etc. models have shown remarkable performance in a variety of routine tasks, unlocking great potential for the legal field. However, it is clear that access to high-quality, relevant and up-to-date data is a key factor in developing large language models. Therefore, it becomes crucial to develop effective and efficient open-source legal language models.
In the field of artificial intelligence, the development of large models has permeated various fields such as healthcare, education, and finance: BloombergGPT
, FinGPT
, Huatuo
, ChatMed
, and these models have demonstrated their usefulness and impact in handling complex tasks and generating valuable insights. However, the field of law, due to its inherent importance and the need for accuracy, is one that requires dedicated research and the development of specialized legal models.
Law plays a key role in shaping societies, regulating human interaction and upholding justice. Legal professionals rely on accurate and up-to-date information to make informed decisions, interpret the law, and provide legal counsel. The complexity of legal language, nuanced interpretations, and the changing nature of legislation present unique challenges that require tailored solutions. However, even state-of-the-art models like this often suffer from hallucinations and nonsensical outputs
when it comes to legal issues . GPT4
People tend to believe that fine-tuning a model with specific domain knowledge will yield satisfactory results. In reality, however, this was not the case with earlier laws LLM(LawGPT)
, as there are still many instances of hallucinations and unreliable output.
The author initially recognized LLM
the necessity of Chinese law. However, at the time, there was no commercially available Chinese model with a parameter size exceeding 13 billion. Therefore, on the basis of a commercially viable OpenLLAMA
model, by expanding the Chinese vocabulary and incorporatingMOSS
and other sources of training data, these can create a basic Chinese language model. Subsequently, legal models are trained in conjunction with legally relevant data ---- ChatLaw
.
The main contributions of this paper are as follows: An
(1)
effective approach to alleviating hallucinations (Effective Approach to Mitigate Hallucination)
: Propose a method to address hallucinations by enhancing the training process of the model and incorporating four modules in the inference process: "counseling (consult)
", "referencing (reference)
", "self-suggestion (self-suggestion)
" and "Response (response)
". Integrate the vertical model and the knowledge base through the reference module, inject domain-specific knowledge into the model, and use accurate information in the knowledge base to reduce the occurrence of hallucinations.
(2)
Based on LLM
the legal feature word extraction model: train a model that extracts legal feature words from users' daily language. The model identifies words with legal meaning, which can effectively identify and analyze the legal context in user input.
(3)
Based on BERT
the legal text similarity calculation model: train a model to measure the similarity between the user's daily language and a 93
dataset consisting of tens of thousands of relevant legal case texts. This enables the establishment of a vector database for efficient retrieval of similar legal texts, facilitating further analysis and reference.
(4)
Construction of the Chinese Law Examination Test Dataset: A dataset dedicated to testing knowledge in the Chinese legal field was planned. Furthermore, an arena scoring mechanism is devised ELO
to compare the performance of different models on legal multiple-choice questions.
Furthermore, the authors observe that a single general law LLM
may not perform optimally on all tasks in this domain. Therefore, different models are trained for different scenarios such as multiple choice questions, key extraction, question answering, etc. HuggingGPT
To handle the selection and deployment of these models, the provided methods are used , using a large LLM
as the controller. This controller model dynamically determines which specific model to invoke based on each user's request, ensuring that the most appropriate model is used for a given task.
2. Dataset
In the process of constructing the dataset, various methods are adopted to ensure the comprehensiveness and diversity of the dataset. The data set combination method is as follows:
Collect a large amount of raw legal data (Collection of a vast amount of original legal data)
: including collecting legal news, social media content and discussions in legal industry forums. These sources provide a wide variety of real-world legal texts, providing insights into a variety of legal topics and discussions.
Construction based on laws, regulations and judicial interpretations (Construction based on legal regulations and judicial interpretations)
: In order to ensure comprehensive coverage of legal knowledge, relevant laws, regulations and judicial interpretations are incorporated into the dataset. This ensures that the dataset reflects the legal framework and provides accurate and up-to-date information.
Crawl real legal consultation data (Crawling real legal consultation data)
: Retrieve real legal consultation data and use existing legal consultation data sets. This enables the inclusion of real-world legal scenarios and problems frequently encountered by users, enriching the dataset with practical legal examples.
Construction of Multiple Choice Questions for the Judicial Examination (Construction of multiple-choice questions for the bar exam)
: A set of multiple choice questions specially designed for the judicial examination was created. The questions cover a variety of legal topics and test users' understanding and application of legal principles.
By combining data from these different sources and construction methods, the dataset encompasses a wide range of legal contexts, ensuring that the developed models can effectively understand and address various legal scenarios.
Once these data components are collected, the dataset goes through a rigorous cleaning process. This includes filtering short and incoherent responses, ensuring only high-quality and meaningful text is included. Furthermore, to enhance the dataset, ChatGPT API
an auxiliary construct is utilized that enables the generation of supplementary data based on the existing dataset.
3. Training Process
Keyword LLM
is a language model that extracts keywords from abstract consulting questions raised by users. On the other hand, LawLLM
legal terms that may involve user inquiries are extracted. ChatLaw LLM
is the final language model that outputs responses to the user. It refers to relevant legal provisions and uses its own summary and question-and-answer functions to provide suggestions for users in consultation.
3.1 ChatLaw LLM
For training ChatLAW
, the authors fine-tuned it Ziya-LLaMA-13B
using low-rank adaptation on top of . (LORA)
Furthermore, auto-suggestive characters are introduced to further alleviate the model hallucination problem. The training process is carried out on multiple A100 GPU
nodes, and deepspeed
the training cost is further reduced by means of .
3.2 Keyword LLM
LLM
To create a product by combining vertical-specific domain knowledge with a knowledge base ChatLaw
, it is crucial to retrieve relevant information from the knowledge base based on user queries. The author initially tried traditional software development methods, such as searching MySQL
and Elasticsearch
searching, but the effect was not satisfactory. Therefore, the author tried to use the pre-trained BERT
model embedding
to Faiss
calculate the cosine similarity and extract the previous k
laws and regulations related to the user query. However, this approach tends to produce suboptimal results when the user's question is ambiguous. Therefore, this paper aims to extract key information from user queries and use the vector embedding of these information to design algorithms to improve matching accuracy.
Due to the significant advantages of large-scale models in understanding user queries, the authors LLM
fine-tuned . After obtaining multiple keywords, the following algorithm is used to retrieve relevant legal provisions:
3.3 Law LLM
Train the model using 937k
a dataset of country cases BERT
to extract corresponding legal clauses and judicial interpretations from user queries. This Law LLM
model forms ChatLaw
an important part of the product.
4. Experiment and Analysis
Evaluating the performance of large language models (LLM)
has been a challenge. 2000
To this end, more than ten years of national judicial examination questions were collected, and a test dataset containing 4 questions and their standard answers was compiled to measure the ability of the model to deal with legal multiple-choice questions.
However, the authors found that the accuracy of these models was generally low. In this case, simply comparing accuracy rates seems to be meaningless. Therefore, inspired by the matching mechanism in e-sports and the design of chatbots , an evaluation mechanism for the scoring model competition is Arena
established to more effectively evaluate the model's ability to handle legitimate multiple-choice questions. Through the analysis of the above experimental results, the following conclusions can be drawn: the introduction of legal-related question-answering and regulatory data can improve the performance of the model on multiple-choice questions to a certain extent; adding specific task types for training significantly improves the performance of the model. performance on such tasks. For example, the reason why the model outperforms is that the authors used a large number of multiple-choice questions as training data; legal multiple-choice questions require complex logical reasoning, so models with a larger number of parameters usually perform better.ELO
(1)
(2)
ChatLaw
GPT-4
(3)
5. Conclusions
This paper proposes ChatLaw
a large-scale legal language model developed using legal domain knowledge. Specifically, a new method is proposed to LLM
combine with a vector knowledge base, which significantly alleviates LLM
the common hallucination problem in . Stable model processing strategies make it possible to solve problems in various legal fields. Published a data set of legal multiple-choice questions, and designed a ELO
model ranking mechanism.
However, due to the large size of the base model, a limitation of the author's model emerges: the performance is not optimal in tasks such as logical reasoning and deduction. ChatLaw
In addition, further research is needed to improve the generalization ability to common tasks after adding a large amount of domain data . ChatLaw
There are potential social risks, and the authors advise users to use this method for appropriate purposes.
Pay attention to the WeChat public account:
夏小悠
, to get more articles, papersPPT
and other materials ^_^