LLM - Baichuan-13B multi-card loading and inference test

Table of contents

Edit

I. Introduction

2. Model loading

1. Quantitative loading

◆ Basic configuration

◆ 8_bit loading

◆ 4_bit loading

2. Multi-card loading

◆ API loading

◆ accelerate loading

3. Model reasoning

1.Video memory view

◆ Nvidia graphics card monitoring

◆ Python subprocess call

2.Dual card reasoning

◆ Dual SIM device allocation

◆Dual card inference GPU-Util

3. Three-card reasoning

◆ Three-card device distribution

◆ Three-card inference GPU-Util

◆ Differences in multi-card reasoning efficiency

4. Summary


I. Introduction

Baichuan-13B has a good Chinese corpus output function. When deploying the Baichuan-13B model, the blogger tried different numbers of graphics cards to deploy the model inference service. Let’s take a look at the differences in memory and inference time between different cards and whether to quantify the model. .

2. Model loading

1. Quantitative loading

◆  Basic configuration

    config_kwargs = {
        "trust_remote_code": True,
        "cache_dir": None,
        "revision": 'main',
        "use_auth_token": None,
    }

8_bit loading

    config_kwargs["load_in_8bit"] = True
 
    config_kwargs["quantization_config"] = BitsAndBytesConfig(
        load_in_8bit=True,
        llm_int8_threshold=6.0
    )

4_bit loading

 
    config_kwargs["load_in_4bit"] = True

    config_kwargs["quantization_config"] = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4"
    )

Tips:

In the actual test here, the video memory consumption before and after Baichuan-13B quantization is the same, that is, the quantization does not take effect. You can try adjusting the threshold of llm_int8_threshold and try again.

2. Multi-card loading

API loading

    bc_model = AutoModelForCausalLM.from_pretrained(
        ori_model_path,
        config=config,
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
        revision='main',
        device_map='auto'
    )

Add the device_map='auto' parameter. If it doesn't work yet, try adding it in the script:

export CUDA_VISIBLE_DEVICES=0,1

According to the actual card situation, just modify the corresponding device_id.

accelerate  loading

if torch.cuda.device_count() > 1:
    from accelerate import dispatch_model
    from accelerate.utils import infer_auto_device_map, get_balanced_memory
    device_map = infer_auto_device_map(bc_model, max_memory=get_balanced_memory(bc_model))
    bc_model = dispatch_model(bc_model, device_map)
    print('multi GPU predict => {}'.format(device_map))
else:
    bc_model = bc_model.cuda()
    print("single GPU predict")

Obtain the devices corresponding to different layers through infer_auto_device_map, where the accelerate version is 0.21.0. 

3. Model reasoning

Here, P40-24G is used as the basic graphics card. Conventional Baichuan-13B loading requires 28G video memory. The following uses dual-card P40 and three-card P40 to try to reason.

1.Video memory view

In order to observe the video memory usage of multiple cards during inference, we use shell commands and python commands to monitor the video memory.

Nvidia graphics card monitoring

Enter the following command on the shell command line and call nvidia-smi every 3 seconds to check the graphics card usage.

watch -n 3 nvidia-smi

Python subprocess call

import subprocess

def get_gpu_memory_usage(info):
    # 使用nvidia-smi命令获取显卡信息
    cmd = "nvidia-smi --query-gpu=memory.used --format=csv,nounits,noheader"
    result = subprocess.run(cmd, stdout=subprocess.PIPE, shell=True, encoding='utf-8')
    memory_used = result.stdout.strip().split('\n')
    print("[%s Memory Usage: %s]" %(info, ','.join(memory_used)))

Use python subprocess to call the nvidia-smi cmd command. The finally obtained memory_used is the video memory usage of each card. We only need to call this function where we need to monitor the video memory of the graphics card. info is the corresponding log node, such as before and after loading the model. , before and after the reasoning task.

2.Dual card reasoning

Dual SIM device distribution

{'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 
 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 
 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0,
 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 0, 'model.layers.17': 0, 'model.layers.18': 0, 
 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1, 'model.layers.24': 1, 
 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 1, 'model.layers.28': 1, 'model.layers.29': 1,
 'model.layers.30': 1, 'model.layers.31': 1, 'model.layers.32': 1, 'model.layers.33': 1, 'model.layers.34': 1, 
 'model.layers.35': 1, 'model.layers.36': 1, 'model.layers.37': 1, 'model.layers.38': 1, 'model.layers.39': 1,
 'model.norm': 1, 'lm_head': 1, 'model.layers.19': 1}

The model.layers from 0-19 are allocated to card 0, and the model.layers from 20-39 are allocated to card 1. Check the video memory log:

[模型加载后 Memory Usage: 12232,13436] 
[模型生成前 Memory Usage: 13746,13548]

The load of a basic single card is 12G+.

Dual card inference GPU-Util

 During inference, the memory usage of both cards was 13G, and GPU-Util was both around 50%. The time-consuming test was conducted twice:

Cost: 804.4720668792725 Count: 54
Cost: 673.4583792686462 Count: 54

It takes 13.67 s to generate a sample on average. The fluctuation between the two times is still a bit large. More accurate results require more trials and the number of tokens combined with your own input and output.

3. Three-card reasoning

Three card device distribution

 {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0,
 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 
 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.13': 1, 'model.layers.14': 1, 
 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1,
 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1, 'model.layers.24': 1, 
 'model.layers.25': 1, 'model.layers.27': 2, 'model.layers.28': 2, 'model.layers.29': 2, 'model.layers.30': 2,
 'model.layers.31': 2, 'model.layers.32': 2, 'model.layers.33': 2, 'model.layers.34': 2, 'model.layers.35': 2, 
 'model.layers.36': 2, 'model.layers.37': 2, 'model.layers.38': 2, 'model.layers.39': 2, 'model.norm': 2, 
 'lm_head': 2, 'model.layers.26': 2, 'model.layers.12': 1}

Model.layers 0-11 are assigned to card 0, 12-25 to card 1, and 26-39 to card 2. In addition, card 1 also loads embed_tokens, and card 2 also loads lm_head. Check the video memory log:

[模型加载后 Memory Usage: 8018,8596,9222] 
[模型生成前 Memory Usage: 9510,8674,9300]

The basic load of a single card is 8-9 G.

Three-card inference GPU-Util

During inference, the memory usage of both cards is around 9G, and the GPU-Util is around 30%. The same test takes two times:

常规: Cost: 751.8843202590942 Count: 54
量化: Cost: 773.8875942230225 Count: 54

On average, it takes 14.11 s to generate a sample.

Differences in multi-card reasoning efficiency

In the above test, 3-card reasoning takes 14.11s per line, and 2-card reasoning takes 13.67s per line. One more card is even slower than one less card. Let’s take a look at the possibilities that may cause multi-card reasoning to slow down:

● Communication overhead

In a multi-card GPU system, data transmission and synchronization operations are inevitably required. When three GPUs are used, more data transmission and synchronization operations are required, which results in additional communication overhead and thus reduces the performance of inference.

● Memory bandwidth limit

In a multi-card GPU system, the memory on each GPU is independent of each other and cannot directly access data on other GPUs. When performing inference, if the model and data cannot fully fit into the memory capacity of a single GPU, the data needs to be allocated to different GPUs for calculation. In a three-card GPU system, since each GPU has to process more data, the memory bandwidth may become a bottleneck, thus affecting the speed of inference.

● Computing power utilization

In some cases, the size of the model may not fully utilize the parallel computing capabilities of multi-card GPU systems. For example, if the model is small or there are a lot of serial calculations in the inference process, the advantages of a multi-card GPU system may not be fully utilized. In this case, using a three-card GPU may add additional overhead and not bring significant performance improvements.

The above situation here may be caused by [communication overhead] and [computing power utilization].

4. Summary

Baichuan-13B is used here to try different quantization strategies and multi-card inference. The quantification does not take effect here. The blogger uses LLaMA-33B to try the same configuration and 8Bit quantization takes effect. The model can reduce the memory usage from 65G to around 33G. The specific model quantification effect Please refer to the model and actual business usage scenarios. In addition, it seems that multi-card inference currently achieves even distribution in the video memory, but has not achieved significant improvement in efficiency. Students with relevant experience can also communicate in the comment area.

Guess you like

Origin blog.csdn.net/BIT_666/article/details/132538581