【AI实战】开源大语言模型LLM汇总

大语言模型
开源大语言模型
参考

大语言模型

大语言模型（LLM）是指使用大量文本数据训练的深度学习模型，可以生成自然语言文本或理解语言文本的含义。大语言模型可以处理多种自然语言任务，如文本分类、问答、对话等，是通向人工智能的一条重要途径。来自百度百科

发展历史

2020年9月，OpenAI授权微软使用GPT-3模型，微软成为全球首个享用GPT-3能力的公司。2022年，Open AI发布ChatGPT模型用于生成自然语言文本。2023年3月15日，Open AI发布了多模态预训练大模型GPT4.0。

2023年2月，谷歌发布会公布了聊天机器人Bard，它由谷歌的大语言模型LaMDA驱动。2023年3月22日，谷歌开放Bard的公测，首先面向美国和英国地区启动，未来逐步在其它地区上线。

2023年2月7日，百度正式宣布将推出文心一言，3月16日正式上线。文心一言的底层技术基础为文心大模型，底层逻辑是通过百度智能云提供服务，吸引企业和机构客户使用API和基础设施，共同搭建AI模型、开发应用，实现产业AI普惠。

开源大语言模型

本文列举了截止到 2023 年 6 月 8 日开源的大语言模型

1、LLaMA

简介
meta 开源的 LLaMA
LLaMA完全是在公共开源预训练数据上训练。并且取得相当不错的效果，LaMA-13B在绝大部分的benchmarks上超越了GPT-3(175 B)，并且LLaMA-65B的效果能够和最好的大模型，Chinchilla-70B以及PaLM-540B相比。
Meta宣称会将LLaMA开源出来的。
论文及代码
论文：https://arxiv.org/abs/2302.13971v1
代码：https://github.com/facebookresearch/llama

2、ChatGLM - 6B

简介
ChatGLM-6B 是一个开源的、支持中英双语的对话语言模型，基于 General Language Model (GLM) 架构，具有 62 亿参数。结合模型量化技术，用户可以在消费级的显卡上进行本地部署（INT4 量化级别下最低只需 6GB 显存）。 ChatGLM-6B 使用了和 ChatGPT 相似的技术，针对中文问答和对话进行了优化。经过约 1T 标识符的中英双语训练，辅以监督微调、反馈自助、人类反馈强化学习等技术的加持，62 亿参数的 ChatGLM-6B 已经能生成相当符合人类偏好的回答。
论文及代码
论文：
代码：https://github.com/THUDM/ChatGLM-6B
官网：https://chatglm.cn/blog
硬件需求
开源协议
本仓库的代码依照 Apache-2.0 协议开源，ChatGLM-6B 模型的权重的使用则需要遵循 Model License。

【个人认为】 ChatGLM-6B 是目前开源的中文大语言模型的佼佼者。

3、Alpaca

简介
论文及代码

4、PandaLLM

简介

Panda: 海外中文开源大语言模型

Panda 系列语言模型目前基于 Llama-7B, -13B, -33B, -65B 进行中文领域上的持续预训练, 使用了接近 15M 条数据, 并针对推理能力在中文 benchmark 上进行了评测, 希望能够为中文自然语言处理领域提供具有泛用性的通用基础工具.

我们的 Panda 模型以及训练涉及的中文数据集将以开源形式发布，任何人都可以免费使用并参与开发。我们欢迎来自全球的开发者一起参与到该项目中，共同推动中文自然语言处理技术的发展。我们后续会进一步完善针对中文语言模型基础能力的评测，同时开放更大规模的模型。
论文及代码
论文：https://arxiv.org/pdf/2305.03025v1.pdf
代码：https://github.com/dandelionsllm/pandallm
模型版本：
模型测评

5、GTP4ALL

简介
Open-source assistant-style large language models that run locally on your CPU.

GPT4All is made possible by our compute partner Paperspace.

GPT4All is an ecosystem to train and deploy powerful and customized large language models that run locally on consumer grade CPUs.

A GPT4All model is a 3GB - 8GB file that you can download and plug into the GPT4All open-source ecosystem software. Nomic AI supports and maintains this software ecosystem to enforce quality and security alongside spearheading the effort to allow any person or enterprise to easily train and deploy their own on-edge large language models.

论文及代码

代码：https://github.com/nomic-ai/gpt4all

6、DoctorGLM （MedicalGPT-zh v2）

简介
基于 ChatGLM-6B的中文问诊模型
论文及代码
论文：https://arxiv.org/pdf/2304.01097.pdf
代码：https://github.com/xionghonglin/DoctorGLM
huggingface：https://huggingface.co/zhaozh/medical_chat-en-zh
训练数据

7、MedicalGPT-zh v1

简介
本项目开源了基于ChatGLM-6B LoRA 16-bit指令微调的中文医疗通用模型。基于共计28科室的中文医疗共识与临床指南文本，我们生成医疗知识覆盖面更全，回答内容更加精准的高质量指令数据集。以此提高模型在医疗领域的知识与对话能力。
论文及代码
论文：https://arxiv.org/pdf/2304.01097.pdf
代码：https://github.com/MediaBrain-SJTU/MedicalGPT-zh
数据集构建

8、Cornucopia-LLaMA-Fin-Chinese

简介
聚宝盆(Cornucopia): 基于中文金融知识的LLaMA微调模型
本项目开源了经过中文金融知识指令精调/指令微调(Instruct-tuning) 的LLaMA-7B模型。通过中文金融公开数据+爬取的金融数据构建指令数据集，并在此基础上对LLaMA进行了指令微调，提高了 LLaMA 在金融领域的问答效果。

基于相同的数据，后期还会利用GPT3.5 API构建高质量的数据集，另在中文知识图谱-金融上进一步扩充高质量的指令数据集。
论文和代码

代码：https://github.com/jerry1993-tech/Cornucopia-LLaMA-Fin-Chinese/tree/main
模型下载

数据集构建
目前采用了公开和爬取的中文金融领域问答数据，涉及到保险、理财、股票、基金、贷款、信用卡、社保等。

指令微调的训练集数据示例如下：

  问题：办理商业汇票应遵守哪些原则和规定？

  回答: 办理商业汇票应遵守下列原则和规定：1.使用商业汇票的单位，必须是在银行开立帐户的法人；2.商业汇票在同城和异地均可使用；3.签发商业汇票必须以合法的商品交易为基础；4.经承兑的商业汇票，可向银行贴现；5.商业汇票一律记名，允许背书转让；6.商业汇票的付款期限由交易双方商定，最长不得超过6个月；7.商业汇票经承兑后，承兑人即付款人负有到期无条件交付票款的责任；8.商业汇票由银行印制和发售。

针对现有数据仍存在不准确和不完善的地方，后续我们会利用GPT3.5接口围绕中文金融知识库进一步构建与拓展问答数据，设置多种Prompt形式来充分利用知识迭代更新数据集。

9、minGPT

简介
A PyTorch re-implementation of GPT, both training and inference. minGPT tries to be small, clean, interpretable and educational, as most of the currently available GPT model implementations can a bit sprawling. GPT is not a complicated model and this implementation is appropriately about 300 lines of code (see mingpt/model.py). All that’s going on is that a sequence of indices feeds into a Transformer, and a probability distribution over the next index in the sequence comes out. The majority of the complexity is just being clever with batching (both across examples and over sequence length) for efficiency.
论文及代码

代码：https://github.com/karpathy/minGPT

10、InstructGLM

简介
基于ChatGLM-6B+LoRA在指令数据集上进行微调。
论文及代码
代码：https://github.com/yanqiangmiffy/InstructGLM
开源指令数据集

11、FastChat

简介
FastChat is an open platform for training, serving, and evaluating large language model based chatbots. The core features include:
- The weights, training code, and evaluation code for state-of-the-art models (e.g., Vicuna, FastChat-T5).
- A distributed multi-model serving system with Web UI and OpenAI-compatible RESTful APIs.
论文及代码
代码：https://github.com/lm-sys/FastChat
Model Weights
Vicuna Weights
We release Vicuna weights as delta weights to comply with the LLaMA model license. You can add our delta to the original LLaMA weights to obtain the Vicuna weights. Instructions:

Get the original LLaMA weights in the Hugging Face format by following the instructions here.
Use the following scripts to get Vicuna weights by applying our delta. They will automatically download delta weights from our Hugging Face account.

在这里插入图片描述

12、Luotuo-Chinese-LLM

简介
骆驼(Luotuo): 开源中文大语言模型
骆驼(Luotuo)项目是由冷子昂 @ 商汤科技, 陈启源 @ 华中师范大学以及李鲁鲁 @ 商汤科技发起的中文大语言模型开源项目，包含了一系列语言模型。
论文及代码

代码：https://github.com/LC1332/Luotuo-Chinese-LLM

13、CamelBell-Chinese-LoRA

简介
同【 12、Luotuo-Chinese-LLM】
论文及代码

代码：https://github.com/LC1332/CamelBell-Chinese-LoRA

其他开源项目，待补充。。。

参考

https://github.com/mymusise/ChatGLM-Tuning
https://huggingface.co/BelleGroup/BELLE-7B-2M
https://github.com/LianjiaTech/BELLE
https://huggingface.co/datasets/BelleGroup/generated_train_0.5M_CN
https://huggingface.co/datasets/JosephusCheung/GuanacoDataset
https://guanaco-model.github.io/
https://github.com/carbonz0/alpaca-chinese-dataset
https://github.com/THUDM/ChatGLM-6B
https://huggingface.co/THUDM/chatglm-6b
https://github.com/lich99/ChatGLM-finetune-LoRA

【AI实战】开源大语言模型LLMs汇总