Chatglm2 use and fine-tuning tutorial

1. Download chatglm2 code

GitHub - THUDM/ChatGLM2-6B: ChatGLM2-6B: An Open Bilingual Chat LLM | Open Source Bilingual Dialogue Language Model

The github code is shown above

2. Download chatglm2-6B model

git lfs clone THUDM/chatglm2-6b · Hugging Face

If there is the following error: OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to github.com:443.

Use the command: git config --global --unset http.proxy

Then execute the git lfs clone xxx command a few more times.

3. Run chatglm2

Modify the code of the model position in web_demo2.py

Then execute the startup command: streamlit run web_demo2.py,

The runtime model is loaded with FP16 precision, occupying GPU memory: 13161MiB

Note: Make sure the version of transformers is 4.30.2, otherwise an error will be reported: ImportError: cannot import name 'GenerationConfig' from 'transformers.generation.utils'.

4. Fine-tuning p-tuning

(1) Official INT4 quantized version

Official tutorial address: https://www.heywhale.com/mw/project/64984a7b72ebe240516ae79c

Download the AdvertiseGen dataset (see the link in the tutorial) to the ptuning directory, as shown below:

/data/work/xiehao/ChatGLM2-6B/ptuning/AdvertiseGen

Install other python dependencies other than ChatGLM2-6B dependencies

pip install rouge_chinese nltk jieba datasets transformers[torch] -i https://pypi.douban.com/simple/

Excuting an order:

torchrun --standalone --nnodes=1 --nproc-per-node=1 main.py \
    --do_train \
    --train_file AdvertiseGen/train.json \
    --validation_file AdvertiseGen/dev.json \
    --preprocessing_num_workers 1 \
    --prompt_column content \
    --response_column summary \
    --overwrite_cache \
    --model_name_or_path /data/work/xiehao/chatglm2-6b-model \
    --output_dir output/adgen-chatglm2-6b-pt \
    --overwrite_output_dir \
    --max_source_length 64 \
    --max_target_length 128 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --predict_with_generate \
    --max_steps 3000 \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 2e-2 \
    --pre_seq_len 128 \
    --quantization_bit 4

The GPU video memory occupied by the operation is: 7945MiB, and it takes 4 hours as a whole for 3000 steps.

The running log is as follows:

After the run completes, the resulting model is located at:

ChatGLM2-6B/ptuning/output/adgen-chatglm2-6b-pt/checkpoint-3000下

Model comparison:

The large model of Chatglm2 is about 12G, while the fine-tuned model is about 7M.

Comparison of the effect before and after the test fine-tuning:

Test code:

from transformers import AutoTokenizer, AutoModel, AutoConfig

import os

import torch



chat_str = "类型#上衣\*材质#牛仔布\*颜色#白色\*风格#简约\*图案#刺绣\*衣样式#外套\*衣款式#破洞"



model_path = "/data/work/xiehao/chatglm2-6b-model"

lora_model_path = "/data/work/xiehao/ChatGLM2-6B/ptuning/output/adgen-chatglm2-6b-pt/checkpoint-3000/pytorch_model.bin"



tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)



# 微调前

#model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device='cuda')

#model = model.eval()

#response, history = model.chat(tokenizer, chat_str, history=[])

#print(response)



# 微调后

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True, pre_seq_len=128)

model = AutoModel.from_pretrained(model_path, config=config, trust_remote_code=True)

prefix_state_dict = torch.load(lora_model_path)

new_prefix_state_dict = {}

for k, v in prefix_state_dict.items():

    if k.startswith("transformer.prefix_encoder."):

        new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v

model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict)



model = model.cuda()

model.transformer.prefix_encoder.float()

model = model.eval()

response, history = model.chat(tokenizer, chat_str, history=[])

print(response)

Output before fine-tuning:

This is a text describing a jacket, which is made of denim, white in color, simple in style, embroidered in pattern, jacket in style, and ripped in style.

Output after fine-tuning:

This denim jacket is designed with a black pattern on a white background, which is simple and elegant, showing a handsome and cool atmosphere. The embroidered pattern on the body looks particularly handsome and stylish in the dim light. The hole treatment on the body shows a cool fashion sense and makes the whole coat full of personality .

(2) Non-quantified version

The official fine-tuning script contains "--quantization_bit 4", which outputs the lora model quantized by INT4.

Of course, we can't just remove the quantitative model without quantifying it.

At this time, it will ask to install the accelerate dependency package, execute the command: pip install accelerate -U, and other parameters are the same as before.

At this time, 15393MiB of GPU memory is occupied, and the execution time is only about 2 hours, and the loss drops faster .

5. Interpretation of AutoModel loading usage model

Entry call: model = AutoModel.from_pretrained(model_path, trust_remote_code=True)

First, read the parameter information of config.json in the model_path directory through AutoConfig

Then, dynamically read model parameters and model network structure information.

        The AutoModel information of auto_map in config.json is as follows: modeling_chatglm.ChatGLMForConditionalGeneration

Where modeling_chatglm is the model name information, ChatGLMForConditionalGeneration is the type information

Through the ChatGLMForConditionalGeneration of modeling_chatglm.py, the network structure information corresponding to the large model can be obtained, and then the model file is loaded to generate an instance of the model .

Finally, the result generated by model.chat(tokenizer, chat_str, history=[]) is to call the chat method of the ChatGLMForConditionalGeneration instance.

6. Interpretation of the fine-tuning part

It can be seen from the use after fine-tuning that ChatGLM only retrained the code of the PrefixEncoder part of the network structure.

This layer of network is mainly based on prompt tokens to generate embedding, you can refer to the network source code:

    def get_prompt(self, batch_size, device, dtype=torch.half):
        prefix_tokens = self.prefix_tokens.unsqueeze(0).expand(batch_size, -1).to(device)
        past_key_values = self.prefix_encoder(prefix_tokens).type(dtype)

After the fine-tuning is completed, update this part of the model information to the original large model.

Guess you like

Origin blog.csdn.net/benben044/article/details/131825114