(2) ChatGLM-6B model deployment and ptuning fine-tuning detailed tutorial

Introducing what is ChatGLM-6B

The following is the official words. The reason for choosing him is entirely because it can be used on consumer-grade computers. For a stronger 130B model, see https://github.com/THUDM/GLM-130B

ChatGLM-6Bis an 开源open, supported 中英双语conversational language model, based on General Language Model (GLM)the Schema , with 62100 million parameters. Combined with 模型量化technology, users can 消费级的显卡deploy locally on the Internet (only 6GB of video memory is required at the INT4 quantization level). ChatGLM-6B uses technology similar to ChatGPT, optimized for Chinese Q&A and dialogue. After the Chinese-English bilingual training of 1T 标识Yuefu , supplemented by the blessing of 监督微调, 反馈自助, 人类反馈强化学习and other technologies, the 6.2 billion parameters ChatGLM-6Bhave been able to generate equivalent 符合人类偏好的回答. For more information, please refer to our blog .

In order to facilitate downstream developers to customize models for their own application scenarios, we have also implemented an efficient parameter fine-tuning method (use guide) P-Tuning v2based , and only 7GB of video memory is required to start fine-tuning at the INT4 quantization level.

However, due to the small size of ChatGLM-6B, it is currently known to have 相当多的局限性, 如事实性/数学逻辑错误,可能生成有害/有偏见内容,较弱的上下文能力,自我认知混乱,以及对英文指示生成与中文指示完全矛盾的内容。请大家在使用前了解这些问题so as not to cause misunderstanding. Larger ones based on 1300100 million parameters are beingGLM-130B developed in closed beta.ChatGLM

In order to better promote the development of large model technology together with the community, we open source the ChatGLM-6B model at the same time. ChatGLM-6B is a Chinese-English bilingual language model with 6.2 billion parameters. By using the same technology as ChatGLM (chatglm.cn), ChatGLM-6B has the functions of Chinese question and answer and dialogue, and supports reasoning on a single 2080Ti. Specifically, ChatGLM-6Bit has the following characteristics:

  • Sufficient bilingual pre-training in Chinese and English: ChatGLM-6B has trained 1T tokens on the 1:1 ratio of Chinese and English materials, and has bilingual ability.
  • Optimized model architecture and size: Based on the GLM-130B training experience, the two-dimensional RoPE position encoding implementation is revised, and the traditional FFN structure is used. The parameter size of 6B (6.2 billion) also makes it possible for researchers and individual developers to fine-tune and deploy ChatGLM-6B by themselves.
  • Lower deployment threshold: With FP16 half-precision, ChatGLM-6B requires at least 13GB of video memory for reasoning. Combined with model quantization technology, this requirement can be further reduced to 10GB (INT8) and 6GB (INT4), making ChatGLM-6B deployable On consumer-grade graphics cards.
  • Longer sequence length: Compared with GLM-10B (sequence length 1024), ChatGLM-6B has a sequence length of 2048, supporting longer conversations and applications.
  • Human Intent Alignment Training: Supervised Fine-Tuning, Feedback Bootstrap, Reinforcement Learning from Human Feedback and other methods are used to make the model initially capable of understanding human instruction intentions. The output format is markdown, which is convenient for display.

Torch

torch test, False means the driver is not good, enter the python command line

# cuda支持检查
import torch
print(torch.cuda.is_available())

https://pytorch.org/get-started/previous-versions/

Execute a similar command to install a versioncuXXX

pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118

Install the ChatGLM-6B model

Installation date: 2023-04-08
THUDM/ChatGLM-6B github
zero_nlp This project should be good for getting started, and it involves a lot of knowledge points

Installation process

git clone https://github.com/THUDM/ChatGLM-6B.git
(venv) [root@VM-245-24-centos ~]# cd ChatGLM-6B
python3.9 -m venv venv
source venv/bin/activate
pip3.9 install -r requirements.txt 
pip3.9 install accelerate
pip3.9 install streamlit streamlit_chat

Model model data preparation phase

mkdir THUDM
cd THUDM

# 注意此时是没有大模型文件(比如pytorch_model-00001-of-00008.bin这种IFS文件)
git clone https://huggingface.co/THUDM/chatglm-6b
# 去清华大学镜像站下载文件
# 这里建议看这个文章中的python自动爬虫下载,亲测有效 https://aistudio.baidu.com/aistudio/projectdetail/5741753?channelType=0&channel=0

Attachment: python code for downloading large files

# 文件1pytorch_model 8个文件下载----------------------------------------------------------------------------------------------------------
import requests
url1='https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/files/?p=%2Fpytorch_model-0000'
url2='-of-00008.bin&dl=1'
save_path1='pytorch_model-0000'
save_path2='-of-00008.bin'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
# 循环获取models,总共有8个基础模型
for i in range(8):
    url=url1+str(i+1)+url2
    save_path=save_path1+str(i+1)+save_path2
    res = requests.get(url,headers=headers)
    file1 =open(save_path,'wb')
    file1.write(res.content)
    file1.close()
    print("第{}个模型下载已完成".format(i+1))

# 文件2 ice_text 个文件下载---------------------------------------------------------------------------------------------------------------------------------
# 一开始想用wget命令抓取清华镜像的预训练模型,但一直不成功只能用爬虫方法进行get获取了
# 获取网页信息
import requests
url='https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/files/?p=%2Fice_text.model&dl=1'
save_path='ice_text.model'
# 设置header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
# 获取文件并写入
res = requests.get(url,headers=headers)
file1 =open(save_path,'wb')
file1.write(res.content)
file1.close()

Run the demo test

# web_demo 的前端资源有些是google的,可能会页面打不开,推荐第二个
python3.9 web_demo.py
或者
streamlit run web_demo2.py

Pay attention to confirm whether the model path and the actual path in the demo are the same

Ptuning fine-tuning

THUDM/ChatGLM-6B ptuning official tutorial

The fine-tuning code is in THUDM/ChatGLM-6Bthe ptuningdirectory

Installation process

enter directory

[root@VM-245-24-centos ChatGLM-6B]#  cd ptuning

Initialize the environment

To prevent package conflicts, I have reinitialized venvthe environment here

python3.9 -m venv venv
source venv/bin/activate
pip3.9 install rouge_chinese nltk jieba datasets transformers torch icetk cpm_kernels

train

The data of the official example is too long (stunned, it takes 11 hours for 54M data), so we gave up and prepared our own data set

train.shWhat to change

PRE_SEQ_LEN=8
LR=1e-2

CUDA_VISIBLE_DEVICES=0 python3.9 main.py \
    --do_train \
    --train_file mydata/train.json \
    --validation_file mydata/dev.json \
    --prompt_column content \
    --response_column summary \
    --overwrite_cache \
    --model_name_or_path ../THUDM/chatglm-6b \
    --output_dir output/adgen-chatglm-6b-pt-$PRE_SEQ_LEN-$LR \
    --overwrite_output_dir \
    --max_source_length 64 \
    --max_target_length 64 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --predict_with_generate \
    --max_steps 3000 \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate $LR \
    --pre_seq_len $PRE_SEQ_LEN
#    --quantization_bit 4

PRE_SEQ_LEN and LR in train.sh are soft prompt length and training learning rate respectively, which can be adjusted to achieve the best results. The P-Tuning-v2 method will freeze all model parameters, and the quantization level of the original model can be adjusted by adjusting quantization_bit. If this option is not added, it will be loaded with FP16 precision.

Under the default configuration of quantization_bit=4, per_device_train_batch_size=1, gradient_accumulation_steps=16, the model parameters of INT4 are frozen, and a training iteration will perform 16 accumulated forward and backward propagations with a batch size of 1, which is equivalent to a total batch of 16 At this time, the minimum video memory is only 6.7G. If you want to improve the training efficiency under the same batch size, you can increase the value of per_device_train_batch_size without changing the product of the two, but it will also bring more memory consumption, please adjust it according to the actual situation.

Modify as follows

# 我的python版本是3.9
python3 -> python3.9

# 训练文件变了,后面会创建,先改
    --train_file mydata/train.json \
    --validation_file mydata/dev.json \
# 修改模型的路径,模型此时已经在父目录,我们修改一下即可
 --model_name_or_path THUDM/chatglm-6b \ ->  --model_name_or_path ../THUDM/chatglm-6b \
# 关闭quantization_bit ,我的显卡是够的,而且quantization_bit为4我这里反而报错,所以索性关闭了,不加此选项则为 FP16 精度加载
    --pre_seq_len $PRE_SEQ_LEN \ ->   --pre_seq_len $PRE_SEQ_LEN
   --quantization_bit 4 ->  #--quantization_bit 4 

Prepare your own dataset

At present, this process is to create data by yourself, which belongs to 有监督the type of learning, one question and one answer, and we will start to pay attention to it in the future research 无监督 文字接龙 teacher forcing, Self instruction,few-shot

After the previous toss, we know that the format is {“content”:"","summary":""}(see the tutorial advertising data training is this structure), one is input, one is output, we can create data according to this format (after the guidance of zy netizens, here is what to do for the same input To the generalization of synonymous sentences, you need to use a lot 同义句, and the effect of fine-tuning is good)

Prepare the generalized data in advance (if you are too lazy to generalize, you can directly write your own data {“content”:"","summary":""}format, which is roughly the same) , Self instruction seems to be a separate study, based on GPT automatic generalization, as a novice here, create the data manually for the time being

mkdir mydata
vim mydata/dev.json
vim mydata/train.json

The training data is written to dev.json (any one), train.json (all)

{"content": "你叫什么名字?","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "你可以告诉我你的名字吗?","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "你是GPT吗","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "你能告诉我你的名字吗?","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "你是机器人吗?","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "你能告诉我你的名字吗?","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "你知道我叫啥名字吗?","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "介绍一下你自己","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "介绍自己","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "你是ChatGPT吗","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "自我介绍一下","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "你是?","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "你知道我叫啥名字吗?","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "你是谁","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "你是","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "你可以告诉我你的名字吗","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "你是哪位","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "你叫啥","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "你叫什么","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}
{"content": "你是谁","summary":"你好,我是小君,很高兴认识你,有什么可以帮你?"}

Start running the training,(*^▽^*)

bash train.sh

At this time, I found that it has nothing to do with the size of the training data. Customizing 20 pieces of data is also expected to take 11 hours.( ̄ェ ̄;)

insert image description here
The whole training process takes about 11 hours and
During model training, it will be added after 12 hours
the data size after training is about 39G

du -sh output/
39G	output/

reasoning

Still modify evaluate.sh, what I need to modify is the python version and closequantization_bit

PRE_SEQ_LEN=8
CHECKPOINT=adgen-chatglm-6b-pt-8-1e-2
STEP=3000

CUDA_VISIBLE_DEVICES=0 python3.9 main.py \
    --do_predict \
    --validation_file mydata/dev.json \
    --test_file mydata/dev.json \
    --overwrite_cache \
    --prompt_column content \
    --response_column summary \
    --model_name_or_path ./output/$CHECKPOINT/checkpoint-$STEP  \
    --output_dir ./output/$CHECKPOINT \
    --overwrite_output_dir \
    --max_source_length 64 \
    --max_target_length 64 \
    --per_device_eval_batch_size 1 \
    --predict_with_generate \
    --pre_seq_len $PRE_SEQ_LEN
    #--quantization_bit 4

Start reasoning, this time the speed will end soon

bash evaluate.sh

insert image description here

verify

THUDM/chatglm-6bReplace in the corresponding demo or code with the address of the checkpoint after P-Tuning fine-tuning (in the example is ptuning/output/adgen-chatglm-6b-pt-8-1e-2/checkpoint-3000). Note that the current fine-tuning does not support multiple rounds of data, so only the responses from the first round of the conversation are fine-tuned.

Let's modify the web_demo2.py startup file. To use the trained model, we need to re-specify the model pathptuning/output/adgen-chatglm-6b-pt-8-1e-2

# 返回ChatGLM-6B目录
cd ../
# 切换环境
source venv/bin/activate
# 修改模型 路径 THUDM/chatglm-6b --> ptuning/output/adgen-chatglm-6b-pt-8-1e-2/checkpoint-3000
vim web_demo2.py
# 启动模型
streamlit run web_demo2.py

Show results
insert image description here

questions and thoughts

Issue discussion, ISSUE542 mentioned https://github.com/THUDM/ChatGLM-6B/issues/542

As my ptuningunderstanding deepens, I can find that I can easily change the cognition of several people, but other knowledge seems to be forgotten (the answer becomes very strange), and it is easy to repeat the phenomenon. Therefore, in terms of technology implementation, I agree that the task scenarios used to do features are more suitable (intent recognition, information extraction, etc.), and then combined with traditional NLP technology for application.

In addition, LoRA-related fine-tuning, I heard that it is also very good, and I will practice it later.

generalization learning

simbert, not required to learn

Belongs to supervised training

It seems that this method is a bit outdated. The mainstream seems to use ChatGPT to do it Self instruction, but it is necessary gpt key. This is another story. Let’s see how to play this later. Now let’s learn to understand this process first.

Sushen Science Space: https://spaces.ac.cn/
simbert: https://spaces.ac.cn/archives/7427
simbertv2: https://spaces.ac.cn/archives/8454

This project hasn't been updated for a long time, don't use the same environment with GLM, the package dependencies will conflict, and it has already stepped on the pit

Be careful here, don't use too high a version of python (not 3.9, nor 3.6), here I used python3.7 (click me to download) , attach the installation tutorial , Tsinghua University Open Source Software Mirror Station

[root@VM-245-24-centos ~]# git clone https://github.com/425776024/nlpcda.git
[root@VM-245-24-centos ~]# cd nlpcda/

python3.7 -m venv venv
source venv/bin/activate
pip3.7 install -r requirements.txt
pip3.7 install nlpcda  keras==2.3.1  bert4keras==0.7.7 tensorflow==1.13.1 tensorflow-gpu==1.13.1
pip install 'protobuf~=3.19.0'

Download the Simbert model according to the tutorial https://github.com/425776024/nlpcda
insert image description here
, as shown in the figure below Report an error
insert image description here
and change the function name

vim    nlpcda/tools/simbert/generator.py

第46行
   @AutoRegressiveDecoder.set_rtype('probas')

I downloaded the tiny model file through the above Baidu network disk, ready to download test.py

from nlpcda.tools.Simbert import Simbert
config = {
    
    
        'model_path': 'chinese_simbert_L-4_H-312_A-12',
        'CUDA_VISIBLE_DEVICES': '0',
        'max_len': 32,
        'seed': 1
}
simbert = Simbert(config=config)
sent = '把我的一个亿存银行安全吗'
synonyms = simbert.replace(sent=sent, create_num=5)
print(synonyms)

SimBERT belongs to 有监督训练, the training corpus is similar sentence pairs collected by itself, and the Seq2Seq part is constructed by predicting the similar sentence generation task of another sentence, and then the vector of [CLS] mentioned earlier actually represents the input sentence vector , so it can be used to train a retrieval task at the same time.
insert image description here

In-depth learning parameters use
https://kexue.fm/archives/7427

Supongo que te gusta

Origin blog.csdn.net/q116975174/article/details/130034839
Recomendado
Clasificación