Practical deployment of Tsinghua open source language large model ChatGLM3


ChatGLM3 is a new generation of dialogue pre-training model jointly released by Zhipu AI and Tsinghua University KEG Laboratory.
Project library address: https://github.com/THUDM/ChatGLM3

Installation Environment

It is recommended to use a virtual environment

git clone https://github.com/THUDM/ChatGLM3
cd ChatGLM3
pip install -r requirements.txt

The recommended version of the transformers library is 4.30.2, and the recommended version of torch is version 2.0 and above to obtain the best inference performance.

Download model file

git lfs install
git clone https://www.modelscope.cn/ZhipuAI/chatglm3-6b.git

Need to wait a long time
Insert image description here

Test whether the installation is successful

During inference, change THUDM/chatglm3-6b to the path to download the model yourself.

gpu inference

Inference requires more than 13g of video memory

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True, device='cuda')
model = model.eval()
response, history = model.chat(tokenizer, "你好", history=[])
print(response)
response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history)
print(response)

cpu inference

Inference requires more than 32g of memory

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True,.float()
)
model = model.eval()
response, history = model.chat(tokenizer, "你好", history=[])
print(response)
response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history)
print(response)

Quantitative reasoning

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm3-6b",trust_remote_code=True).quantize(4).cuda()
model = model.eval()
response, history = model.chat(tokenizer, "你好", history=[])
print(response)
response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history)
print(response)

If you encounter any problems, please feel free to communicate in the comment area
Exchange more technologies in the group
130856474

おすすめ

転載: blog.csdn.net/Silver__Wolf/article/details/134247535