1、下载chatglm2代码

GitHub - THUDM/ChatGLM2-6B: ChatGLM2-6B: An Open Bilingual Chat LLM | 开源双语对话语言模型

github代码见上面所示

2、下载chatglm2-6B模型

git lfs clone THUDM/chatglm2-6b · Hugging Face

如果存在如下报错：OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to github.com:443。

使用命令：git config --global --unset http.proxy

然后再多执行几次git lfs clone xxx的命令。

3、运行chatglm2

修改web_demo2.py中model位置的代码

然后执行启动命令：streamlit run web_demo2.py，

运行时模型以 FP16 精度加载，占用GPU显存为：13161MiB

注意：确保transformers的版本为4.30.2，否则会报错：ImportError: cannot import name 'GenerationConfig' from 'transformers.generation.utils'。

4、微调p-tuning

（1）官方INT4量化版本

官方教程地址：https://www.heywhale.com/mw/project/64984a7b72ebe240516ae79c

下载AdvertiseGen数据集（见教程中的链接）到ptuning目录下，如下图所示：

/data/work/xiehao/ChatGLM2-6B/ptuning/AdvertiseGen

安装除ChatGLM2-6B的依赖之外的其他python依赖包

pip install rouge_chinese nltk jieba datasets transformers[torch] -i https://pypi.douban.com/simple/

执行命令：

torchrun --standalone --nnodes=1 --nproc-per-node=1 main.py \
    --do_train \
    --train_file AdvertiseGen/train.json \
    --validation_file AdvertiseGen/dev.json \
    --preprocessing_num_workers 1 \
    --prompt_column content \
    --response_column summary \
    --overwrite_cache \
    --model_name_or_path /data/work/xiehao/chatglm2-6b-model \
    --output_dir output/adgen-chatglm2-6b-pt \
    --overwrite_output_dir \
    --max_source_length 64 \
    --max_target_length 128 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --predict_with_generate \
    --max_steps 3000 \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 2e-2 \
    --pre_seq_len 128 \
    --quantization_bit 4

运行占用GPU显存为：7945MiB，3000个steps整体需要4个小时。

运行日志如下：

运行完成后，生成的模型位于：

ChatGLM2-6B/ptuning/output/adgen-chatglm2-6b-pt/checkpoint-3000下

模型比较：

Chatglm2的大模型约为12G，而微调模型约为7M。

测试微调前后的效果对比：

测试代码：

from transformers import AutoTokenizer, AutoModel, AutoConfig

import os

import torch



chat_str = "类型#上衣\*材质#牛仔布\*颜色#白色\*风格#简约\*图案#刺绣\*衣样式#外套\*衣款式#破洞"



model_path = "/data/work/xiehao/chatglm2-6b-model"

lora_model_path = "/data/work/xiehao/ChatGLM2-6B/ptuning/output/adgen-chatglm2-6b-pt/checkpoint-3000/pytorch_model.bin"



tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)



# 微调前

#model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device='cuda')

#model = model.eval()

#response, history = model.chat(tokenizer, chat_str, history=[])

#print(response)



# 微调后

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True, pre_seq_len=128)

model = AutoModel.from_pretrained(model_path, config=config, trust_remote_code=True)

prefix_state_dict = torch.load(lora_model_path)

new_prefix_state_dict = {}

for k, v in prefix_state_dict.items():

    if k.startswith("transformer.prefix_encoder."):

        new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v

model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict)



model = model.cuda()

model.transformer.prefix_encoder.float()

model = model.eval()

response, history = model.chat(tokenizer, chat_str, history=[])

print(response)

微调前的输出：

这是一个描述一件上衣的文本,它采用牛仔布材质,颜色为白色,风格是简约的,图案是刺绣的,衣样式是外套,衣款式是破洞的。

微调后的输出：

这一款牛仔外套采用白底黑字的图案设计，简约大方，彰显出帅气又酷酷的气息。衣身上的刺绣图案，在微光下显得特别的帅气有型。衣身上的破洞处理，彰显出酷酷的时尚感，让整件外套充满了个性。

（2）非量化版本

官方的微调脚本中包含“--quantization_bit 4”，输出INT4量化后的lora模型。

当然我们也不可以不用量化模型，直接去掉就好了。

此时会要求安装accelerate依赖包，执行命令：pip install accelerate -U，其他参数同之前的一样。

此时占用15393MiB的GPU显存，执行时间只要2个小时左右，loss下降的也更快。

5、AutoModel加载使用模型解读

入口调用：model = AutoModel.from_pretrained(model_path, trust_remote_code=True)

首先，通过AutoConfig读取model_path目录下的config.json的参数信息

然后，动态读取模型参数和模型网络结构信息。

在config.json的auto_map的AutoModel信息如下：modeling_chatglm.ChatGLMForConditionalGeneration

其中modeling_chatglm为模型名称信息，ChatGLMForConditionalGeneration为类型信息

通过modeling_chatglm.py的ChatGLMForConditionalGeneration就可以获取到大模型对应的网络结构信息，接着再加载模型文件进而生成模型的实例。

最后，通过model.chat(tokenizer, chat_str, history=[])生成结果，就是调用ChatGLMForConditionalGeneration实例的chat方法。

6、微调部分解读

从微调后使用可以看出，ChatGLM只重新训练了网络结构的PrefixEncoder部分的代码。

这层网络主要是根据prompt的tokens生成embedding，可参考网络源码：

    def get_prompt(self, batch_size, device, dtype=torch.half):
        prefix_tokens = self.prefix_tokens.unsqueeze(0).expand(batch_size, -1).to(device)
        past_key_values = self.prefix_encoder(prefix_tokens).type(dtype)

微调完成后将这部分的模型信息更新到原来的大模型中。

Chatglm2使用及微调教程