tamaño de resolución de entrada dinámica tensorrt

En este artículo, solo la parte de tensorrt de python involucra configuraciones de resolución dinámica, no c++.

contenido

El blog de Zhihu también puede referirse a:

entrada dinámica tensorrt (formas dinámicas) - Se busca programador

Hay dos razones para grabar este post: 1. Debe haber mucha gente que lo necesite. 2. Por lo que busqué, ninguna de las publicaciones es clara y clara, y los documentos oficiales no son fáciles de hacer, por lo que debo adivinar. No hay mucho que decir, vaya directamente al código.

Tome pytorch a onnx a tensorrt como ejemplo, la forma dinámica es la longitud y el ancho de la imagen.

pytorch a onnx:

def export_onnx(model,image_shape,onnx_path, batch_size=1):
    x,y=image_shape
    img = torch.zeros((batch_size, 3, x, y))
    dynamic_onnx=True
    if dynamic_onnx:
        dynamic_ax = {'input_1' : {2 : 'image_height',3:'image_wdith'},   
                                'output_1' : {2 : 'image_height',3:'image_wdith'}}
        torch.onnx.export(model, (img), onnx_path, 
           input_names=["input_1"], output_names=["output_1"], verbose=False, opset_version=11,dynamic_axes=dynamic_ax)
    else:
        torch.onnx.export(model, (img), onnx_path, 
           input_names=["input_1"], output_names=["output_1"], verbose=False, opset_version=11
    )

onnx a tensorrt:

De acuerdo con la definición de forma dinámica en la documentación oficial de nvidia, la llamada dinámica no es más que no especificar al definir el motor, reemplazarlo con -1 y luego confirmarlo durante la inferencia. Por lo tanto, el código para crear el motor y la parte de inferencia necesitan ser modificados.

Al construir un motor, la entrada y la salida de la lectura de red de onnx son formas dinámicas. Solo necesita aumentar el perfil de optimización para determinar el rango de tamaño de la entrada.

def build_engine(onnx_path, using_half,engine_file,dynamic_input=True):
    trt.init_libnvinfer_plugins(None, '')
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(EXPLICIT_BATCH) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
        builder.max_batch_size = 1 # always 1 for explicit batch
        config = builder.create_builder_config()
        config.max_workspace_size = GiB(1)
        if using_half:
            config.set_flag(trt.BuilderFlag.FP16)
        # Load the Onnx model and parse it in order to populate the TensorRT network.
        with open(onnx_path, 'rb') as model:
            if not parser.parse(model.read()):
                print ('ERROR: Failed to parse the ONNX file.')
                for error in range(parser.num_errors):
                    print (parser.get_error(error))
                return None
        ##增加部分
        if dynamic_input:
            profile = builder.create_optimization_profile();
            profile.set_shape("input_1", (1,3,512,512), (1,3,1024,1024), (1,3,1600,1600)) 
            config.add_optimization_profile(profile)
        #加上一个sigmoid层
        previous_output = network.get_output(0)
        network.unmark_output(previous_output)
        sigmoid_layer=network.add_activation(previous_output,trt.ActivationType.SIGMOID)
        network.mark_output(sigmoid_layer.get_output(0))
        return builder.build_engine(network, config)

inferencia de tensorrt de Python:

Al inferir, hay un gran hoyo oscuro. Según mi comprensión anterior, dado que la entrada es dinámica, solo necesito asignar un búfer adecuado a la entrada, y luego puedo razonar directamente independientemente del tamaño. Resulta que todavía es joven. De acuerdo con la documentación oficial, debe agregar dicha línea durante la inferencia, context.active_optimization_profile = 0, para seleccionar el perfil de optimización correspondiente, está bien, lo agregué, pero aún se informa un error, porque no definimos el motor cuando definimos Tamaño de entrada, luego debe definir el tamaño de entrada de acuerdo con la entrada real durante la inferencia.

def profile_trt(engine, imagepath,batch_size):
    assert(engine is not None)  
    
    input_image,input_shape=preprocess_image(imagepath)
 
    segment_inputs, segment_outputs, segment_bindings = allocate_buffers(engine, True,input_shape)
    
    stream = cuda.Stream()    
    with engine.create_execution_context() as context:
        context.active_optimization_profile = 0#增加部分
        origin_inputshape=context.get_binding_shape(0)
        #增加部分
        if (origin_inputshape[-1]==-1):
            origin_inputshape[-2],origin_inputshape[-1]=(input_shape)
            context.set_binding_shape(0,(origin_inputshape))
        input_img_array = np.array([input_image] * batch_size)
        img = torch.from_numpy(input_img_array).float().numpy()
        segment_inputs[0].host = img
        [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in segment_inputs]#Copy from the Python buffer src to the device pointer dest (an int or a DeviceAllocation) asynchronously,
        stream.synchronize()#Wait for all activity on this stream to cease, then return.
       
        context.execute_async(bindings=segment_bindings, stream_handle=stream.handle)#Asynchronously execute inference on a batch. 
        stream.synchronize(）
        [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in segment_outputs]#Copy from the device pointer src (an int or a DeviceAllocation) to the Python buffer dest asynchronously
        stream.synchronize()
        results = np.array(segment_outputs[0].host).reshape(batch_size, input_shape[0],input_shape[1])    
    return results.transpose(1,2,0)

Fueron solo unas pocas líneas de código, y el resultado fue un día entero de lanzamiento, pero afortunadamente, el problema de la entrada dinámica se resolvió y no hubo necesidad de escribir un montón de código desordenado.

Enlace original: https://blog.csdn.net/weixin_42365510/article/details/112088887