显存溢出困境：如何在 RTX 4090 上运行 Qwen2-VL

Qwen2-VL 模型介绍

Qwen2-VL拥有三个参数量分别为 20亿、70 亿和 720 亿的模型。本仓库包含经过指令调优的 7B Qwen2-VL 模型。

评估结果

图像基准测试

基准	InternVL2-8B	MiniCPM-V 2.6	GPT-4o-mini	Qwen2-VL-7B
MMMUval	51.8	49.8	60	54.1
DocVQAtest	91.6	90.8	-	94.5
InfoVQAtest	74.8	-	-	76.5
ChartQAtest	83.3	-	-	83.0
TextVQAval	77.4	80.1	-	84.3
OCRBench	794	852	785	845
MTVQA	-	-	-	26.3
RealWorldQA	64.4	-	-	70.1
MMEsum	2210.3	2348.4	2003.4	2326.8
MMBench-ENtest	81.7	-	-	83.0
MMBench-CNtest	81.2	-	-	80.5
MMBench-V1.1test	79.4	78.0	76.0	80.7
MMT-Benchtest	-	-	-	63.7
MMStar	61.5	57.5	54.8	60.7
MMVetGPT-4-Turbo	54.2	60.0	66.9	62.0
HallBenchavg	45.2	48.1	46.1	50.6
MathVistatestmini	58.3	60.6	52.4	58.2
MathVision	-	-	-	16.3

视频基准测试

基准	Internvl2-8B	LLaVA-OneVision-7B	MiniCPM-V 2.6	Qwen2-VL-7B
MVBench	66.4	56.7	-	67.0
PerceptionTesttest	-	57.1	-	62.3
EgoSchematest	-	60.1	-	66.7
Video-MMEwo/w subs	54.0/56.9	58.2/-	60.9/63.6	63.3/69.0

使用要求

Qwen2-VL 的代码已包含在最新的 Hugging Face Transformers 中，建议您通过以下命令从源代码构建：

pip install git+https://github.com/huggingface/transformers

避免出现以下错误：

KeyError: 'qwen2_vl'

快速入门

我们提供了一个工具包，帮助您更方便地处理各种类型的视觉输入，支持 base64、URLs 和交错的图像与视频。安装命令如下：

pip install qwen-vl-utils

以下是如何使用 transformers 和 qwen_vl_utils 的代码示例：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from modelscope import snapshot_download

model_dir = snapshot_download("qwen/Qwen2-VL-7B-Instruct")

# 默认：在可用设备上加载模型
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_dir, torch_dtype="auto", device_map="auto"
)

# 默认处理器
processor = AutoProcessor.from_pretrained(model_dir)

messages = [
    {
    
    
        "role": "user",
        "content": [
            {
    
    
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {
    
    "type": "text", "text": "描述这个图像。"},
        ],
    }
]

# 准备推理
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成输出
generated_ids = model.generate(**inputs, max_new_tokens=128)
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

限制

尽管 Qwen2-VL 可应用于多种视觉任务，但了解其限制同样重要：

缺乏音频支持：当前模型无法理解视频中的音频信息。
数据时效性：我们的图像数据集更新至 2023 年 6 月，之后的信息可能未被覆盖。
个人和知识产权限制：模型对特定个体或知识产权的识别能力有限，可能无法全面覆盖所有知名人士或品牌。
复杂指令的理解能力不足：面对复杂的多步骤指令，模型的理解和执行能力需要提升。
计数准确性不足：在复杂场景中，物体计数的准确性较低。
空间推理能力弱：特别是在三维空间中，模型对物体位置关系的推断不足。

这些限制为模型的优化和改进提供了持续的方向，我们承诺不断提升模型的性能和应用范围。

在 RTX 4090 上的实测

设备基础环境介绍

操作系统： Ubuntu 22.04

软件环境：

PyTorch：2.3.0
Python：3.12
硬件环境：

GPU：RTX 4090D（24GB）* 1
CPU：15 vCPU Intel® Xeon® Platinum 8474C
内存：80GB （实测峰值内存20G占用）
硬盘：
系统盘：30 GB
数据盘：50GB

在使用 RTX 4090 进行 Qwen2-VL 模型推理时，遇到了显存溢出的问题。为了解决这一挑战，我采取了一些措施，使模型勉强能够在这台显卡上运行。

首先，安装了 flash-attn 以优化显存使用，命令如下：

pip install flash-attn --no-build-isolation

接着，使用以下代码加载模型：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from modelscope import snapshot_download
import torch

model_dir = snapshot_download("qwen/Qwen2-VL-7B-Instruct")

# 启用 Flash Attention 以提升速度和节省显存
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_dir,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

# 默认处理器
processor = AutoProcessor.from_pretrained(model_dir)

messages = [
    {
    
    
        "role": "user",
        "content": [
            {
    
    
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {
    
    "type": "text", "text": "描述这个图像。"},
        ],
    }
]

# 准备推理
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成输出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

通过这些设置，成功在 RTX 4090 上运行了 Qwen2-VL 模型，并生成了期望的输出。这些优化措施使得显存的使用得到了有效控制，提高了模型的运行效率。
单线程进行一次生成速度测试


from tqdm import tqdm

for i in tqdm(range(100)):
    # Inference: Generation of the output
    generated_ids = model.generate(**inputs, max_new_tokens=128)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
# print(output_text)

测试结果

100%|██████████| 100/100 [06:09<00:00, 3.69s/it]

显存溢出困境：如何在 RTX 4090 上运行 Qwen2-VL

Qwen2-VL 模型介绍

评估结果

图像基准测试

视频基准测试

使用要求

快速入门

更多使用技巧

输入图像支持

图像分辨率以提高性能

限制

在 RTX 4090 上的实测

设备基础环境介绍

目录

Qwen2-VL 模型介绍

评估结果

图像基准测试

视频基准测试

使用要求

快速入门

更多使用技巧

输入图像支持

图像分辨率以提高性能

限制

在 RTX 4090 上的实测

设备基础环境介绍

猜你喜欢

目录

热门文章