Collation, summary and introduction of large-scale pre-trained language models in the field of LegalAI (continuous update ing...)

The gods are silent-personal CSDN blog post directory

Last update date: 2023.6.15
Earliest update date: 2023.6.7

1. General large-scale pre-trained language model

English:

  1. LegalBERT
    1. Original paper: (2020 EMNLP) LEGAL-BERT: The Muppets straight out of Law School - ACL Anthology
    2. Download link: huggingfaceinsert image description here
  2. CaseLaw-BERT
    1. 原始论文:(2021 ICAIL) When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings
  3. BERT Law
    1. Original paper: (2021) Sublanguage: A Serious Issue Affects Pretrained Models in Legal Domain
    2. Download link: https://huggingface.co/nguyenthanhasia/BERTLaw
  4. PolBERT
    1. 原始论文:(2022 NeurIPS) Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
  5. legal-longformer
    1. Download link: https://huggingface.co/saibo/legal-longformer-base-4096
  6. LegalLAMA
    1. 原始论文:(2023 ACL) LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development
  7. (India) InLegalBERT
    1. 原始论文:(2023 ICAIL) Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law
    2. Download link: https://huggingface.co/law-ai/InLegalBERT

Chinese:

  1. Lawformer
    1. Original paper: (2021) Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents
    2. Download method: thunlp/LegalPLMs: Source code and checkpoints for legal pre-trained language models.

Italian:

  1. ITALIAN-LEGAL-BERT
    1. 原始论文:(2022) ITALIAN-LEGAL-BERT: A Pre-trained Transformer Language Model for Italian Law
    2. Download address: https://huggingface.co/dlicari/Italian-Legal-BERT

Romanian:

  1. jurBERT
    1. Original paper: (2021 NLLP) jurBERT: A Romanian BERT Model for Legal Judgment Prediction

Spanish:

  1. RoBERTalex
    1. Original paper: (2021) Spanish Legalese Language Model and Corpora
    2. Download address: PlanTL-GOB-ES/RoBERTalex · Hugging Face

multi-language:

  1. ParaLaw Nets (the paper should be in Japanese and English)
    1. Original paper: (2021 COLIEE) ParaLaw Nets – Cross-lingual Sentence-level Pretraining for Legal Text Processing
    2. Download address: I guess this is: nguyenthanhasia/XLM-Paralaw · Hugging Face
  2. LegalXLMs
    1. Original paper: (2023) MultiLegalPile: A 689GB Multilingual Legal Corpus
    2. Download address: too many, to be added

Vietnamese:

  1. nguyenthanhasia/VNBertLaw · Hugging Face
  2. PhoBERT
    1. Original paper: (2020 EMNLP) PhoBERT: Pre-trained language models for Vietnamese
    2. Official GitHub project (introduces the address and download method of each pre-trained model checkpoint): VinAIResearch/PhoBERT: PhoBERT: Pre-trained language models for Vietnamese (EMNLP-2020 Findings)

French

  1. JuriBERT
    1. Original paper: (2022) JuriBERT: A Masked-Language Model Adaptation for French Legal Text
    2. Download address: http://master2-bigdata.polytechnique.fr/resources#juribert (using transformers package)

2. Dialogue Model

Chinese:

  1. Lawyer CALL
    AndrewZhe/lawyer-call: Recent new CALL
    1. Original paper: (2023) Lawyer LLaMA Technical Report
    2. Official GitHub project: AndrewZhe/lawyer-llama: The online experience of the Chinese legal LLaMA web version can directly apply for access rights (only 100 access rights are given, and it is said that it will be dynamically adjusted later, which probably means that if you have money, you can give more) Local deployment Version: lawyer-llama-13b-beta1.0 is public ( lawyer-llama/run_inference.md at main AndrewZhe/lawyer-llama GitHub ), but the weight of LLaMA is necessary, and I am still in the LLaMA team, so Wait

English:

  1. Although the name of LawGPT 1.0
    is very orthodox and domineering, in fact, it didn’t give anything, and it felt like a dick without pictures.
    1. Original paper: A Brief Report on LawGPT 1.0: A Virtual Legal Assistant Based on GPT-3

3. Clause

multi-language:

  1. https://huggingface.co/models?search=rcds/distilbert-sbd (English, Spanish, German, Italian, Portuguese, French)
    1. Original paper: (2023 ICAIL) MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset

4. Text Classification

multi-language:

  1. PyEuroVoc (languages ​​of EU member states and candidate member states) is classified according to EuroVoc indicator. Based on BERT
    1. Original paper: (2021 RANLP) PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors
    2. Download address: https://pypi.org/project/pyeurovoc/

5. Information extraction

  1. The original model of FPDM
    is the work of migrating from open-domain to specific domain. The main work in the legal field is contract review (extracting important information)
    1. Original paper: (2023) FPDM: Domain-Specific Fast Pre-training Technique using Document-Level Metadata
    2. Given the code and dataset: https://drive.google.com/drive/folders/1RT7g_cTR_twz75xmFjDgQmCPWC8sZSFK

Guess you like

Origin blog.csdn.net/PolarisRisingWar/article/details/130746106
Recommended