MOSS is a chat language model developed by Fudan University, which can help you complete various language tasks. If you want to use MOSS on your own local or remote server , you can follow the steps below to deploy:
To download the content of this warehouse to a local or remote server, you can use the following command: git clone https://github.com/OpenLMLab/MOSS.git
Enter the MOSS directory and create a conda environment, you can use the following command:
conda create --name moss python=3.8
conda activate moss
To install dependencies, you can use the following command: pip install -r requirements.txt
For single card deployment ( A100/A800 ):
MOSS can be run on a single A100/A800 or CPU with the following sample code :
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("fnlp/moss-moon-003-sft", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("fnlp/moss-moon-003-sft", trust_remote_code=True).half().cuda()
model = model.eval()
In code, you can have a conversation with MOSS using your own questions and utterances , like so:
query = " Hello <eoh>\n:"
inputs = tokenizer(query, return_tensors="pt")
outputs = model.generate(inputs, do_sample=True, temperature=0.7, top_p=0.8, repetition_penalty=1.1, max_new_tokens=256)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
For multi-card deployment (two or more 3090 ):
MOSS inference can be run on two NVIDIA 3090 graphics cards with the following code :
import us
import torch
from huggingface_hub import snapshot_download
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
os.environ['CUDA_VISIBLE_DEVICES'] = "0,1"
model_path = "fnlp/moss-moon-003-sft"
if not os.path.exists(model_path):
model_path = snapshot_download(model_path)
config = AutoConfig.from_pretrained("fnlp/moss-moon-003-sft", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("fnlp/moss-moon-003-sft", trust_remote_code=True)
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16, trust_remote_code=True)
model.tie_weights()
model = load_checkpoint_and_dispatch(model, model_path, device_map="auto", no_split_module_classes=["MossBlock"], dtype=torch.float16)
The following is a sample code that simply calls moss-moon-003-sft to generate a dialogue, you can run it on a single A100/A800 or CPU , and it takes about 30GB of video memory when using FP16 precision:
If you have any questions, please leave a message in the comment area.
Special thanks:
"Duan Xiaocao" https://www.zhihu.com/question/596908242/answer/2994650882
「孙天祥」https://www.zhihu.com/question/596908242/answer/2994534005