llama.cpp LLM模型 windows cpu安装部署；运行LLaMA2模型测试

参考：
https://www.listera.top/ji-xu-zhe-teng-xia-chinese-llama-alpaca/
https://blog.csdn.net/qq_38238956/article/details/130113599

cmake windows安装参考：https://blog.csdn.net/weixin_42357472/article/details/131314105

llama.cpp下载编译

1、下载：

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

2、编译

mkdir build
cd build
cmake ..
cmake --build . --config Release

在这里插入图片描述

3、测试运行

cd bin\Release
./main -h

在这里插入图片描述

运行LLaMA-7B模型测试

参考：
https://zhuanlan.zhihu.com/p/638427280

模型下载：
https://huggingface.co/nyanko7/LLaMA-7B/tree/main
下载下来后在llama.cpp-master\models\下再创建LLamda\7B目录
在这里插入图片描述
1、 convert the 7B model to ggml FP16 format
convert.py文件就在llama.cpp-master下

python3 convert.py models/7B/

在这里插入图片描述
2、量化quantize the model to 4-bits (using q4_0 method)
quantize.exe在llama.cpp-master\build\bin\Release下；量化后体积大概从13G到不到4G大小

 .\quantize.exe D:\llm\llama.cpp-master\models\LLamda\7B\ggml-model-f16.bin  D:\llm\llama.cpp-master\models\LLamda\7B\ggml-model-q4_0.bin  q4_0

在这里插入图片描述
3、命令行交互运行
main.exe在llama.cpp-master\build\bin\Release下

 .\main.exe -m D:\llm\llama.cpp-master\models\LLamda\7B\ggml-model-q4_0.bin  -n 128  --repeat_penalty 1.0 --color -i -r "User:" -f D:\llm\llama.cpp-master\prompts\chat-with-bob.txt

LLaMA中文支持的不大好，虽然能大概理解意思，如果需要中文支持可能需要选择其他模型
在这里插入图片描述

也可以直接下载第三方转化好的ggml模型，Llama-2

参考地址：
https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML
https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML

windows运行很吃内存，32g基本跑满，生成速度也挺慢；不过13b llama-2代模型能直接回复中文

##运行
.\main.exe -m "C:\Users\loong\Downloads\llama-2-13b-chat.ggmlv3.q4_0.bin"  -n 128  --repeat_penalty 1.0 --color -i -r "User:" -f D:\llm\llama.cpp-master\prompts\chat-with-bob.txt

在这里插入图片描述

Chinese-Llama-2中文第二代

模型下载：
https://huggingface.co/soulteary/Chinese-Llama-2-7b-ggml-q4

##运行
 .\main.exe -m "C:\Users\loong\Downloads\Chinese-Llama-2-7b-ggml-q4.bin"  -n 128  --repeat_penalty 1.0 --color -i -r "User:" -f D:\llm\llama.cpp-master\prompts\chat-with-bob.txt

在这里插入图片描述