Summary of hardware requirements for Llama-2 inference and fine-tuning: RTX 3080 can fine-tune the smallest model

Large language model fine-tuning refers to additional training of already pre-trained large language models (such as Llama-2, Falcon, etc.) to adapt them to the needs of a specific task or domain. Fine-tuning usually requires a lot of computing resources, but through methods such as quantification and Lora, we can also fine-tune the test on consumer-grade GPUs, but consumer-grade GPUs cannot carry relatively large models. After my testing, a 7B model can be It runs on 3080 (8G), which is very helpful for us to conduct simple research, but if we need more in-depth research, we still need professional hardware.

Let’s take a look at the hardware configuration first:

Amazon's g3.xlarge M60 is 8GB of VRAM and 2048 CUDA cores. The 3080 is 10Gb of GDDR6 VRAM, and the two GPUs are basically similar.

The tests done here are fine-tuned lama-2-7b (~7GB) using a small (65MB text) custom dataset.

It can be seen that the 3080 consumes a lot of power, with a maximum power consumption of 364 watts during training (the total power consumption of the PC exceeds 500 watts).

Look at the training records

It means that the training is OK and the training can be completed completely.

In order to verify the memory consumption, I ran it again on the 8G M60, and it was no problem. This should be the limit of GPU memory.

It occupies almost 7.1G of memory. It may not work if you use more, but fortunately, it will be enough.

Finally, we will compile a list to roughly see what kind of memory each model requires. The following is just reasoning and does not include fine-tuning. If fine-tuning is used, an additional 20% (LORA) will be required.

LLaMA-7B

A GPU with at least 6GB VRAM is recommended. An example of a GPU that fits this model is the RTX 3060, which is available in an 8GB VRAM version.

LLaMA-13B

A GPU with at least 10GB VRAM is recommended. GPUs that meet this requirement include the AMD 6900 XT, RTX 2060 12GB, 3060 12GB, 3080, and A2000. These GPUs provide the necessary VRAM capacity to efficiently handle the computational demands of LLaMA-13B.

LLaMA-30B

It is recommended to use a GPU with VRAM of no less than 20GB. RTX 3080 20GB, A4500, A5000, 3090, 4090, 6000 or Tesla V100 are examples of GPUs that provide the required VRAM capacity. These GPUs provide the LLaMA-30B with efficient processing and memory management.

LLaMA-65B

The LLaMA-65B works with a GPU with at least 40GB VRAM. Examples of GPUs suitable for this model include A100 40GB, 2x3090, 2x4090, A40, RTX A6000 or 8000.

For speed:

I am using the inference speed example of RTX 4090 and Intel i9-12900K CPU.

For CPU, LLaMA can also be used, but the speed will be very slow, and it is best not to train, only to infer. The following is a list of the inference speeds of the 13B model on different CPUs.

Configuration and performance of individual systems may vary. It's best to experiment and benchmark with different setups to find the best solution for your specific needs, the above tests are for reference only.

https://avoid.overfit.cn/post/0dd29b9a89514a988ae54694dccc9fa6

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/132846400