LegalAI公开数据集的整理、总结及介绍(持续更新ing…)

诸神缄默不语-个人CSDN博文目录

最近更新日期:2023.6.13
最早更新日期:2023.6.7

1. 司法判决预测

中文:

  1. CAIL2018
    刑法
    1. 原始论文:CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction
      Overview of CAIL2018: Legal Judgment Prediction Competition
    2. 数据下载地址:https://cail.oss-cn-qingdao.aliyuncs.com/CAIL2018_ALL_DATA.zip(对数据的具体介绍除上面的论文外,还可以参考:thunlp/CAIL: Chinese AI & Law Challenge
    3. 任务:(分类)预测法条、罪名、刑期

2. 通用语料

多语言:

  1. MultiLegalPile在这里插入图片描述
    1. 原始论文:(2023) MultiLegalPile: A 689GB Multilingual Legal Corpus
    2. 数据下载地址:https://huggingface.co/datasets/joelito/Multi_Legal_Pile
    3. 项目包含的数据:
      1. https://huggingface.co/datasets/joelito/eurlex_resources
      2. https://huggingface.co/datasets/joelito/legal-mc4
      3. Pile of Law
  2. LexFiles
    1. 原始论文:(2023 ACL) LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development

西班牙语:

  1. Spanish Legal Domain Corpora
    1. 原始论文:(2021) Spanish Legalese Language Model and Corpora
    2. 数据下载地址:Spanish Legal Domain Corpora | Zenodo

英语:

  1. CaseHOLD
    English Harvard Law case corpus (1965-2021)
    1. 原始论文:(2021 ICAIL) When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings
  2. Pile of Law
    1. 原始论文:(2022 NeurIPS) Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
    2. 数据下载地址:https://huggingface.co/datasets/pile-of-law/pile-of-law

中文:

  1. 华律网法律咨询数据及论文所需语料库;同时发表的论文:法律咨询文本分类系统设计与研究
    The legal consultation data and corpus of the thesis from China law network.Replication Data for: Design and research of legal consultation text classification system. - Data Driven Innovation Research Competition for University of China

3. 其他集成项目

多语言:

  1. LexGLUE
    coastalcph/lex-glue: LexGLUE: A Benchmark Dataset for Legal Language Understanding in English
    1. 原始论文:(2021) LexGLUE: A Benchmark Dataset for Legal Language Understanding in English
  2. LEXTREME
    1. 原始论文:(2023) LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain
    2. 数据下载地址:https://huggingface.co/datasets/joelito/lextreme

还没整理完的:

  1. https://github.com/neelguha/legal-ml-datasets

4. 推理

  1. legalbench
    1. 原始论文:(2022) LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning
    2. 数据下载地址:https://github.com/HazyResearch/legalbench

英语:

  1. SARA:大概来说就是推理某种情况是否适用于某某法律(美国税法中的9个Section)
    1. 原始论文:(2020) A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering

5. NLU

  1. SemEval 2023 Task 6: LegalEval - Understanding Legal Texts
    1. 任务:Rhetorical Roles Labeling,命名实体识别,可解释的司法判决预测
  2. MAUD
    1. 原始论文:(2023) MAUD: An Expert-Annotated Legal NLP Dataset for Merger Agreement Understanding
    2. 数据下载地址:https://drive.google.com/drive/folders/1RujOK2FZKdFSCJ15tqdyd42g8WLsYagj

6. NLG

1 QA

中文:

  1. JEC-QA
    法考数据集
    https://jecqa.thunlp.org/
    1. 原始论文:(2020 AAAI) JEC-QA: A Legal-Domain Question Answering Dataset

越南语

  1. (交通法)(2017 KSE) Question analysis for Vietnamese legal question answering

2 文本摘要

英文:

  1. BillSum
    1. 原始论文:(2019 WS) BillSum: A Corpus for Automatic Summarization of US Legislation
    2. 数据下载地址:billsum · Datasets at Hugging Face
  2. VebCL(基于案例引用图实现一句话摘要/抽取重点信息)
    1. 原始论文:(2021 CIKM) VerbCL: A Dataset of Verbatim Quotes for Highlight Extraction in Case Law
    2. 数据下载地址:https://uvaauas.figshare.com/articles/dataset/VerbCL_Dataset/14798878/1

多语言:

  1. EUR-Lex-Sum(24种欧洲官方语言)
    1. 原始论文:(2022 EMNLP) EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain
    2. 数据下载地址:dennlinger/eur-lex-sum · Datasets at Hugging Face

7. 信息抽取

1 命名实体识别

葡萄牙语(巴西):

  1. CDJUR-BR
    1. 原始论文:(2023) CDJUR-BR – A Golden Collection of Legal Document from Brazilian Justice with Fine-Grained Named Entities

2 句子边界检测(分句)

多语言:

  1. MultiLegalSBD(英语、西班牙语、德语、意大利语、葡萄牙语、法语)
    1. 原始论文:(2023 ICAIL) MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset
    2. 数据下载地址:https://huggingface.co/datasets/rcds/MultiLegalSBD

3 论据挖掘

  1. 英语
    1. 原始论文:(2023) Mining Legal Arguments in Court Decisions
    2. 下载地址:[trusthlt/mining-legal-arguments: Mining Legal Arguments in Court Decisions - Data and software](https://github.com/trusthlt/mining-legal-argumentsP

8. 其他任务

结构化:

  1. DiscoveringTheRationaleOfDecisions(用于抽取判决结果中的rationale。具体干啥的其实我还没看)
    1. 原始论文:(2021 ICAIL) Discovering the Rationale of Decisions: Experiments on Aligning Learning and Reasoning
    2. 数据下载地址见官方GitHub项目:CorSteging/DiscoveringTheRationaleOfDecisions: Discovering the Rationale of Decisions

  1. GENTLE(英语域外评估,包括了法律文书)
    1. 原始论文:(2023 ACL) GENTLE: A Genre-Diverse Multilayer Challenge Set for English NLP and Linguistic Evaluation
    2. 下载地址:gucorpling/gentle: Repository for the GENTLE corpus

9. 公平性

多语言:

  1. FairLex
    1. 原始论文:(2022 ACL) FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing
    2. 数据下载地址:coastalcph/fairlex · Datasets at Hugging Face

猜你喜欢

转载自blog.csdn.net/PolarisRisingWar/article/details/126058246
今日推荐