llama.cpp LLM model windows cpu installation and deployment; run LLaMA2 model test

Reference:
https://www.listera.top/ji-xu-zhe-teng-xia-chinese-llama-alpaca/
https://blog.csdn.net/qq_38238956/article/details/130113599

cmake windows installation reference: https://blog.csdn.net/weixin_42357472/article/details/131314105

llama.cpp download and compile

1. Download:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

2. Compile

mkdir build
cd build
cmake ..
cmake --build . --config Release

insert image description here

3. Test run

cd bin\Release
./main -h

insert image description here

Run the LLaMA-7B model test

Reference:
https://zhuanlan.zhihu.com/p/638427280

Model download:
https://huggingface.co/nyanko7/LLaMA-7B/tree/main
After downloading, create the LLamda\7B directory under llama.cpp-master\models\
insert image description here
1. convert the 7B model to ggml FP16 format
convert The .py file is under llama.cpp-master

python3 convert.py models/7B/

insert image description here
2. Quantize quantize the model to 4-bits (using q4_0 method)
quantize.exe under llama.cpp-master\build\bin\Release; the volume after quantization is about 13G to less than 4G

 .\quantize.exe D:\llm\llama.cpp-master\models\LLamda\7B\ggml-model-f16.bin  D:\llm\llama.cpp-master\models\LLamda\7B\ggml-model-q4_0.bin  q4_0

insert image description here
3. Run
main.exe interactively on the command line under llama.cpp-master\build\bin\Release

 .\main.exe -m D:\llm\llama.cpp-master\models\LLamda\7B\ggml-model-q4_0.bin  -n 128  --repeat_penalty 1.0 --color -i -r "User:" -f D:\llm\llama.cpp-master\prompts\chat-with-bob.txt

LLaMA Chinese support is not very good, although I can roughly understand the meaning, if you need Chinese support, you may need to choose other models
insert image description here

You can also directly download the third-party converted ggml model, Llama-2

Reference address:
https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML
https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML

The running of windows consumes a lot of memory, 32g is basically full, and the generation speed is also very slow; but the 13b llama-2 generation model can directly reply to Chinese

##运行
.\main.exe -m "C:\Users\loong\Downloads\llama-2-13b-chat.ggmlv3.q4_0.bin"  -n 128  --repeat_penalty 1.0 --color -i -r "User:" -f D:\llm\llama.cpp-master\prompts\chat-with-bob.txt

insert image description here

Chinese-Llama-2 Chinese second generation

Model download:
https://huggingface.co/soulteary/Chinese-Llama-2-7b-ggml-q4

##运行
 .\main.exe -m "C:\Users\loong\Downloads\Chinese-Llama-2-7b-ggml-q4.bin"  -n 128  --repeat_penalty 1.0 --color -i -r "User:" -f D:\llm\llama.cpp-master\prompts\chat-with-bob.txt

insert image description here

Guess you like

Origin blog.csdn.net/weixin_42357472/article/details/131313977