LLM InternLM-Chat-7B书生模型、Baichuan-13B-Chat百川模型简单使用

指定GPU运行：

#方式一 （两行必须放在import  torch前面）
#import os
# os.environ['CUDA_VISIBLE_DEVICES'] = '5'

#方式二（第一种不生效用这种，我这边这种可以生效）
#import torch
#torch.cuda.set_device(5)

1、InternLM-Chat-7B书生模型

参考：
https://huggingface.co/internlm/internlm-chat-7b ##模型下载
https://github.com/InternLM/InternLM ## web demo使用代码参考下载

gpu测试需求：大于15g（正常15g对话对轮后也会显存不够）
streamlit demo代码写得效果挺不错

 from transformers import AutoTokenizer, AutoModelForCausalLM
 tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True).cuda()
 model = model.eval()
response, history = model.chat(tokenizer, "你好", history=[])
>>> print(response)
你好！有什么我可以帮助你的吗？
response, history = model.chat(tokenizer, "请提供三个管理时间的建议。", history=history)
>>> print(response)

在这里插入图片描述

2、Baichuan-13B-Chat百川模型简单使用

参考：
https://huggingface.co/baichuan-inc/Baichuan-13B-Chat ##模型下载
https://github.com/baichuan-inc/Baichuan-13B ## web demo使用代码参考下载

gpu测试需求：大于20g（这里 device_map="auto"是默认多显卡）

（羡慕是在线量化截图，前面所需加载到cpu，资源要求也很高，还是等待官方提供量化好的int8，int4再测试）
在这里插入图片描述

在代码中，模型加载指定 device_map=‘auto’，会使用所有可用显卡。如需指定使用的设备，可以使用类似 export CUDA_VISIBLE_DEVICES=0,1（使用了0、1号显卡）的方式控制。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan-13B-Chat", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan-13B-Chat", device_map="auto", torch_dtype=torch.float16, trust_remote_code=True)
model.generation_config = GenerationConfig.from_pretrained("baichuan-inc/Baichuan-13B-Chat")
messages = []
messages.append({"role": "user", "content": "Which moutain is the second highest one in the world?"})
response = model.chat(tokenizer, messages)
print(response)

在这里插入图片描述