LegalAI领域大规模预训练语言模型的整理、总结及介绍(持续更新ing…)

诸神缄默不语-个人CSDN博文目录

最近更新日期:2023.6.15
最早更新日期:2023.6.7

1. 通用大规模预训练语言模型

英语:

  1. LegalBERT
    1. 原始论文:(2020 EMNLP) LEGAL-BERT: The Muppets straight out of Law School - ACL Anthology
    2. 下载地址:huggingface在这里插入图片描述
  2. CaseLaw-BERT
    1. 原始论文:(2021 ICAIL) When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings
  3. BERTLaw
    1. 原始论文:(2021) Sublanguage: A Serious Issue Affects Pretrained Models in Legal Domain
    2. 下载地址:https://huggingface.co/nguyenthanhasia/BERTLaw
  4. PolBERT
    1. 原始论文:(2022 NeurIPS) Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
  5. legal-longformer
    1. 下载地址:https://huggingface.co/saibo/legal-longformer-base-4096
  6. LegalLAMA
    1. 原始论文:(2023 ACL) LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development
  7. (印度) InLegalBERT
    1. 原始论文:(2023 ICAIL) Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law
    2. 下载地址:https://huggingface.co/law-ai/InLegalBERT

中文:

  1. Lawformer
    1. 原始论文:(2021) Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents
    2. 下载方式:thunlp/LegalPLMs: Source code and checkpoints for legal pre-trained language models.

意大利语:

  1. ITALIAN-LEGAL-BERT
    1. 原始论文:(2022) ITALIAN-LEGAL-BERT: A Pre-trained Transformer Language Model for Italian Law
    2. 下载地址:https://huggingface.co/dlicari/Italian-Legal-BERT

罗马尼亚语:

  1. jurBERT
    1. 原始论文:(2021 NLLP) jurBERT: A Romanian BERT Model for Legal Judgement Prediction

西班牙语:

  1. RoBERTalex
    1. 原始论文:(2021) Spanish Legalese Language Model and Corpora
    2. 下载地址:PlanTL-GOB-ES/RoBERTalex · Hugging Face

多语言:

  1. ParaLaw Nets(看论文应该是日语和英语)
    1. 原始论文:(2021 COLIEE) ParaLaw Nets – Cross-lingual Sentence-level Pretraining for Legal Text Processing
    2. 下载地址:我猜是这个:nguyenthanhasia/XLM-Paralaw · Hugging Face
  2. LegalXLMs
    1. 原始论文:(2023) MultiLegalPile: A 689GB Multilingual Legal Corpus
    2. 下载地址:太多了,待补

越南语:

  1. nguyenthanhasia/VNBertLaw · Hugging Face
  2. PhoBERT
    1. 原始论文:(2020 EMNLP) PhoBERT: Pre-trained language models for Vietnamese
    2. 官方GitHub项目(介绍了各个预训练模型checkpoint的地址和下载方式):VinAIResearch/PhoBERT: PhoBERT: Pre-trained language models for Vietnamese (EMNLP-2020 Findings)

法语

  1. JuriBERT
    1. 原始论文:(2022) JuriBERT: A Masked-Language Model Adaptation for French Legal Text
    2. 下载地址:http://master2-bigdata.polytechnique.fr/resources#juribert(用transformers包的)

2. 对话模型

中文:

  1. Lawyer LLaMA
    AndrewZhe/lawyer-llama: 中文法律LLaMA
    1. 原始论文:(2023) Lawyer LLaMA Technical Report
    2. 官方GitHub项目:AndrewZhe/lawyer-llama: 中文法律LLaMA
      网页版在线体验可以直接申请访问权限(只给了100次使用权限,据说后面会动态调整,大概意思是有钱就多给点吧)
      本地部署版:lawyer-llama-13b-beta1.0已公开(lawyer-llama/run_inference.md at main · AndrewZhe/lawyer-llama · GitHub),但是必须要LLaMA的权重,而我还在排LLaMA的队,所以等着吧

英文:

  1. LawGPT 1.0
    虽然名字非常正统,霸气,但是事实上啥也没给,有一种无图言屌的感觉。
    1. 原始论文:A Brief Report on LawGPT 1.0: A Virtual Legal Assistant Based on GPT-3

3. 分句

多语言:

  1. https://huggingface.co/models?search=rcds/distilbert-sbd(英语、西班牙语、德语、意大利语、葡萄牙语、法语)
    1. 原始论文:(2023 ICAIL) MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset

4. 文本分类

多语言:

  1. PyEuroVoc(欧盟成员国和候选成员国的语言)按照EuroVoc的indicator来进行分类。基于BERT
    1. 原始论文:(2021 RANLP) PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors
    2. 下载地址:https://pypi.org/project/pyeurovoc/

5. 信息抽取

  1. FPDM
    这个原模型是从open-domain迁移到specific domain的工作,法律领域主要做的是contract review(抽取重要信息)
    1. 原始论文:(2023) FPDM: Domain-Specific Fast Pre-training Technique using Document-Level Metadata
    2. 给了代码和数据集:https://drive.google.com/drive/folders/1RT7g_cTR_twz75xmFjDgQmCPWC8sZSFK

猜你喜欢

转载自blog.csdn.net/PolarisRisingWar/article/details/130746106