[Notes] Solve the problem of CUDA Error: no kernel image is available for execution on the device when learning Chatglm2

Problems encountered when learning Chatglm2, using chatglm2-6b-int4, and using model.half().cuda():

        CUDA Error: no kernel image is available for execution on the device

If you just want to run and don't mind the speed, you can try the following simple method:

1. When loading the model, use local programs to facilitate modifications
 

from chatglm2_6b_int4.configuration_chatglm import *
from chatglm2_6b_int4.modeling_chatglm import *
from chatglm2_6b_int4.tokenization_chatglm import *
from chatglm2_6b_int4.quantization import *

tokenizer = ChatGLMTokenizer.from_pretrained("chatglm2_6b_int4/")
model = ChatGLMForConditionalGeneration.from_pretrained("chatglm2_6b_int4/").half().cuda()

2. Modify the extract_weight_to_half function in chatglm2_6b_int4/quantization.py
 

# func(
#     gridDim,
#     blockDim,
#     0,
#     stream,
#     [
#         ctypes.c_void_p(weight.data_ptr()),
#         ctypes.c_void_p(scale_list.data_ptr()),
#         ctypes.c_void_p(out.data_ptr()),
#         ctypes.c_int32(n),
#         ctypes.c_int32(m),
#     ],
# )

out[:, 0::2] = scale_list.view(-1,1) * (weight >> 4)
out[:, 1::2] = scale_list.view(-1,1) * ((weight << 4) >> 4)

2. Modify the quant_gemv function in chatglm2_6b_int4/quantization.py
 

# func(
#     gridDim,
#     blockDim,
#     shm_size,
#     stream,
#     [
#         ctypes.c_void_p(weight.data_ptr()),
#         ctypes.c_void_p(input.data_ptr()),
#         ctypes.c_void_p(scale_list.data_ptr()),
#         ctypes.c_void_p(out.data_ptr()),
#         ctypes.c_int32(m),
#         ctypes.c_int32(k),
#     ],
# )

if input.dtype == torch.float:
    source_bit_width = 8
elif input.dtype == torch.float16:
    source_bit_width = 4
else:
    assert False, f"unsupport input type: {input.dtype}"

tmp = torch.empty(weight.size(0), weight.size(1) * (8 // source_bit_width), dtype=input.dtype, device="cuda")
tmp[:, 0::2] = scale_list.view(-1,1) * (weight >> 4)
tmp[:, 1::2] = scale_list.view(-1,1) * ((weight << 4) >> 4)
out = torch.matmul(input, tmp.transpose(1,0))

According to the actual test, it can be run, and the speed is about the same as running chatglm2-6b on the CPU.

Guess you like

Origin blog.csdn.net/miles2007/article/details/132805941