Deployment and operation of ai retarded writing records [ChatRWKV]


foreword

I saw a tutorial on Zhihu, and
the big guy made the AI ​​novel continue to write, and I used my 18-year-old notebook to run it. The 1.5b model continues to write the effect:

insert image description here

The following is just a record, please jump to the original text to view the text: https://zhuanlan.zhihu.com/p/609154637


1. Environment installation

[It is recommended to switch the pip source to the domestic Tsinghua source first]

1. Python environment: Python 3.10.

Use anaconda to create an environment, select 3.10.x

insert image description here

2. Install some pip libraries numpy, tokenizers, prompt_toolkit

Do not install the wrong environment

Two ways:
1. Switch to the newly created python3.10 environment from the cmd command line, and execute the following installation command

conda activate python310
pip install numpy tokenizers prompt_toolkit

2. Install on the anaconda ui interface,
select the corresponding environment, enter the required package installation in the upper right corner, and check if it is already installed

insert image description here

3. Install pytorch 1.13.1+CUDA 11.7

My notebook is 1050, although it is pulled, but there are hundreds of cuda cores

To install pytorch, you should also be able to use the above method to install, but if you are afraid of the wrong version or other reasons, it is recommended to use the following command line to install

pip install torch --extra-index-url https://download.pytorch.org/whl/cu117 --upgrade

When I installed it, it was about 2.4Gb. Fortunately, the download speed is fast, and it usually takes about 10 minutes to install it.

2. Operation record

1. Download the code

After the environment installation is complete, pull down the code:

git clone https://github.com/BlinkDL/ChatRWKV

If git is installed locally, enter the folder where you want to store the code, enter cmd in the address bar, press Enter,
and then enter the above code on the command line interface to pull it. The code itself is not big

2. Download training parameters

According to your video memory download corresponding training parameters, my 1050 online check is 2g, but here it is 4g, it is not very clear why
insert image description here
different video memory downloads different scale parameters, 4g can just download 1.5B parameters, I will not overestimate myself later I downloaded the medium model, and it did run after changing the parameters, but the generation speed is too impressive, 10 Chinese characters can run for 1 minute... If you only have
4g, just run the small model.

There are many links in the model link, you don’t need to download all of them, download [the EngChn-testNovel model among them]

Download the novel model (there are many models in these links, choose the EngChn-testNovel model):

Large model: 7B parameters, the best effect, recommended 14G
video memory, small video memory can also run, the less video memory, the slower: https://huggingface.co/BlinkDL/rwkv-4-pile-7b/tree/main
(refined 40%, it will be stronger after refining)

Medium model: 3B parameters, the effect is above average, recommend 6G video memory, small video memory can also run, the less video memory, the slower:
https://huggingface.co/BlinkDL/rwkv-4-pile-3b/tree/main

Small model: 1.5B parameters, medium effect, recommended 3G video memory: https://huggingface.co/BlinkDL/

For example, the small model 1.5B, just download this,
the surprising thing is that the download speed is fast, give the author a good review! ! If you put it on github, it will take a long time to pull the model down

insert image description here

3. Edit the code to run

After downloading, use vs to open the chat.py in the source code v2 folder, and update it as follows (the author wrote so many Chinese comments with a rare conscience...)

设置 CHAT_LANG = 'Chinese'

设置 args.MODEL_NAME = 'C:/xxx/xxx/RWKV-4-Pile-7B-EngChn-testNovel-xxx-ctx2048-20230xxx'
这个 MODEL_NAME 改成你下载的模型文件的路径和名字(不需要 .pth 扩展名),注意路径用 /(不要用 \)。

默认的 args.strategy = 'cuda fp16' 代表模型全部加载进显卡。

如果显存报错说不够,改成 args.strategy = 'cuda fp16 *12+' 试试(注意数字后面有个加号!)。
然后尽量调大12(只要不报错,这个数字越大,模型运行越快)。
但不要太极限(如果太极限,有可能生成时显存不够),建议试到极限,然后减1或2。
用这个方法,3G显存也能跑7B模型(不过会挺慢,以后会更快)。

还可以试试 'cuda fp16 *12 -> cpu fp32' 也是尽量调大12。可以比较哪种的速度快。

Actual measurement:
cuda fp16 will only use the graphics card, the graphics card gpu will be full, and the speed is very fast, but the capacity is only 4g, and models larger than 4g will report an error.
cuda fp16 *12+ will use the graphics card and memory together, and can run models larger than 4g, but Very slow

After the above is done, you can start running. Here is what he looks like when running:

insert image description here
insert image description here
insert image description here
insert image description here

insert image description here
insert image description here
insert image description here

The author also has an api_demo.py, but I didn’t write a comment, so I can’t read it... If you run it, you should be able to provide api to the outside world like chatgpt, it’s quite practical


Summarize

Reference link:
https://zhuanlan.zhihu.com/p/609154637

https://huggingface.co/BlinkDL/rwkv-4-pile-1b5/tree/main

https://github.com/BlinkDL/ChatRWKV

Guess you like

Origin blog.csdn.net/lyk520dtf/article/details/129261340