Pytorch2.x~C++ deployment

This article is reprinted from Big Brother

Pytorch modelHigh-performance deployment has always been a matter of discussion. There are two important points:

  • Highly optimized operator

  • Architecture and runtime that can run calculation graphs efficiently

Needless to say, highly optimized operators, why TensorRT is so fast is because when the engine was built, it searched for the best and fastest kernel (implemented some ops) on each platform (A10, A100, T4, etc.). Efficiently running calculation graphs is also a key point. After TensorRT builds the engine, it needs libnvinfer.so to drive it. What is implemented is easy to guess during use:

  • Serialization and deserialization, also known as generating engine and reading engine

  • Inference engine, multi-stream running calculation graph, and management of some environments required by the engine, such as video memory and intermediate variables, etc.

In order to achieve the ultimate performance, the entire runtime of TensorRT is in the C++ environment. Although the Python-API is provided, the actual operations performed are in C++. Python only provides a layer of functions, operators and The entire computational graph is executed in C++.

c++ api vs python api

Python has the advantages of rapid development and verification, but compared to C++it is slowerand more complicatedIt consumes memory. Generally, high-performance scenarios are deployed using C++, and try to avoid using the python environment.

C++ deployment in TORCH 1.x era

In actual scenarios of torch1.x, libtorch + torchscript are generally used together, which have been verified in many production environments.

libtorch can use the C++ API to achieve the same functions as using pytorch-op in python, such as:

#include <ATen/ATen.h>

at::Tensor a = at::ones({2, 2}, at::kInt);
at::Tensor b = at::randn({2, 2});
auto c = a + b.to(at::kInt);

Converted to Pytorch is:

import torch

a = torch.ones((2, 2), dtype=torch.int32)
b = torch.randn((2, 2))
c = a + b.to(torch.int32)

And torchscript is used totrace or script our model to the C++ environment for deployment. The speed does not change much, mainly through torchscript. The exported model can be loaded and run in the C++ environment without relying on python, which can reduce some python over head:

#include <torch/script.h> // One-stop header.

#include <iostream>
#include <memory>

int main(int argc, const char* argv[]) {
  if (argc != 2) {
    std::cerr << "usage: example-app <path-to-exported-script-module>\n";
    return -1;
  }


  torch::jit::script::Module module;
  try {
    // Deserialize the ScriptModule from a file using torch::jit::load().
    module = torch::jit::load(argv[1]);
  }
  catch (const c10::Error& e) {
    std::cerr << "error loading the model\n";
    return -1;
  }

  std::cout << "ok\n";
}

C++ deployment of TORCH 2.x

When torch2.0 comes out, the most important thing is the new API of torch.compile, which can directly optimize the model.

The core of torch.compile is dynamo. Compared with torch.jit.trace and torch.jit.script, dynamo is a more powerful trace tool [1], which can trace the model to optimize the model. After dynamo appeared, I was also curious whether torchscript would be abandoned?

torchscript

At present, it seems that torchscript will continue to exist, but it will be frozen. The functions will be maintained and bugs will be fixed, but there will be no new features.

The model export path based on torch.jit.trace has become a thing of the past, so what is the new C++ export solution based on pt2.0?

The torch official posted a new blog a week ago, officially mentioning cpp wrapper. The core is torch.export[2] + cpp wrapper[3]:

  • PyTorch 2.1 Contains New Performance Features for AI Developers[4]

Using cpp wrapper to invoke the generated kernels and external kernels in TorchInductor can reduce python overhead. In actual tests, the faster the model, the greater the proportion of python overhead, and the greater the improvement:

cpp wrapper benchmark

We all know that torch2.0 can generate high-performance kernel based on triton, such as:

@torch.compile
def opt_foo(x, y):
    a = torch.sin(x)
    b = torch.cos(y)
    return a + b

for _ in range(100):
    opt_foo(torch.randn(10).cuda(), torch.randn(10).cuda())

After defining a function, add the@torch.compile decorator and execute it several times to get the optimized model. The default optimizer is TorchInductor. With the help of depyf[5], We can see the triton code generated after optimization (GPU side):

import triton
import triton.language as tl
from torch._inductor.ir import ReductionHint
from torch._inductor.ir import TileHint
from torch._inductor.triton_heuristics import AutotuneHint, pointwise
from torch._inductor.utils import instance_descriptor
from torch._inductor import triton_helpers

@pointwise(
    size_hints=[16], 
    filename=__file__,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: 'i32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]},
    inductor_meta={'autotune_hints': set(), 'kernel_name': 'triton_poi_fused_add_cos_sin_0', 'mutated_arg_names': []},
    min_elem_per_thread=0
)
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 10
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp2 = tl.load(in_ptr1 + (x0), xmask)
    tmp1 = tl.sin(tmp0)
    tmp3 = tl.cos(tmp2)
    tmp4 = tmp1 + tmp3
    tl.store(out_ptr0 + (x0), tmp4, xmask)

This triton code can be called directly, but it depends on the python environment. If you want to switch to the C++ side, modify the config:

import torch._inductor.config as config
config.cpp_wrapper = True

After re-executing it several times, you can get the generated cpp calling code:

#include <ATen/ATen.h>
#include <ATen/core/dispatch/Dispatcher.h>
#include <ATen/native/BinaryOps.h>
#include <torch/csrc/inductor/aoti_torch/tensor_converter.h>
#include <torch/csrc/inductor/inductor_ops.h>
#define reinterpret_tensor torch::inductor::_reinterpret_tensor
#define alloc_from_pool torch::inductor::_alloc_from_pool
#include <c10/util/generic_math.h>

[[maybe_unused]] static int64_t align(int64_t nbytes) {
  return (nbytes + 64 - 1) & -64;
}
#include <filesystem>

#include <c10/cuda/CUDAGuard.h>
#include <c10/cuda/CUDAStream.h>

#define CUDA_DRIVER_CHECK(EXPR)                    \
do {                                               \
    CUresult code = EXPR;                          \
    const char *msg;                               \
    cuGetErrorString(code, &msg);                  \
    if (code != CUDA_SUCCESS) {                    \
        throw std::runtime_error(                  \
            std::string("CUDA driver error: ") +   \
            std::string(msg));                     \
    }                                              \
} while (0);

namespace {

struct Grid {
    Grid(uint32_t x, uint32_t y, uint32_t z)
      : grid_x(x), grid_y(y), grid_z(z) {}
    uint32_t grid_x;
    uint32_t grid_y;
    uint32_t grid_z;

    bool is_non_zero() {
        return grid_x > 0 && grid_y > 0 && grid_z > 0;
    }
};

}  // anonymous namespace

static inline CUfunction loadKernel(
        std::string filePath,
        const std::string &funcName,
        uint32_t sharedMemBytes,
        const std::optional<std::string> &cubinDir = std::nullopt) {
    if (cubinDir) {
        std::filesystem::path p1{*cubinDir};
        std::filesystem::path p2{filePath};
        filePath = (p1 / p2.filename()).string();
    }

    CUmodule mod;
    CUfunction func;
    CUDA_DRIVER_CHECK(cuModuleLoad(&mod, filePath.c_str()));
    CUDA_DRIVER_CHECK(cuModuleGetFunction(&func, mod, funcName.c_str()));
    if (sharedMemBytes > 0) {
        CUDA_DRIVER_CHECK(cuFuncSetAttribute(
            func,
            CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES,
            sharedMemBytes
        ))
    }
    return func;
}

static inline void launchKernel(
        CUfunction func,
        uint32_t gridX,
        uint32_t gridY,
        uint32_t gridZ,
        uint32_t numWarps,
        uint32_t sharedMemBytes,
        void* args[],
        cudaStream_t stream) {
    CUDA_DRIVER_CHECK(cuLaunchKernel(
        func, gridX, gridY, gridZ, 32*numWarps, 1, 1, sharedMemBytes, stream, args, nullptr
    ));
}

static CUfunction triton_poi_fused_add_cos_sin_0 = nullptr;

std::vector<at::Tensor> inductor_entry_cpp(const std::vector<at::Tensor>& inputs) {

    py::gil_scoped_release release;
    auto arg0_1 = std::move(inputs[0]);
    auto arg1_1 = std::move(inputs[1]);

    at::cuda::CUDAGuard device_guard(0);
    auto buf0 = at::empty_strided({10L, }, {1L, }, at::TensorOptions(c10::Device(at::kCUDA, 0)).dtype(at::kFloat));
    // Source Nodes: [a, add, b], Original ATen: [aten.add, aten.cos, aten.sin]
    if (triton_poi_fused_add_cos_sin_0 == nullptr) {
        triton_poi_fused_add_cos_sin_0 = loadKernel("/tmp/torchinductor_oldpan/rg/crgz7xmq52z337gwizafhl5xeujixy6bjenwk4nrtrulwqolpnzf.cubin", "triton__0d1d2d3", 0);
    }
    CUdeviceptr var_0 = reinterpret_cast<CUdeviceptr>(arg0_1.data_ptr());
    CUdeviceptr var_1 = reinterpret_cast<CUdeviceptr>(arg1_1.data_ptr());
    CUdeviceptr var_2 = reinterpret_cast<CUdeviceptr>(buf0.data_ptr());
    auto var_3 = 10;
    void* kernel_args_var_0[] = {&var_0, &var_1, &var_2, &var_3};
    cudaStream_t stream0 = at::cuda::getCurrentCUDAStream(0);
    Grid triton_poi_fused_add_cos_sin_0_grid_0 = Grid(1L, 1L, 1L);
    launchKernel(triton_poi_fused_add_cos_sin_0, triton_poi_fused_add_cos_sin_0_grid_0.grid_x, triton_poi_fused_add_cos_sin_0_grid_0.grid_y, triton_poi_fused_add_cos_sin_0_grid_0.grid_z, 1, 0, kernel_args_var_0, stream0);
    arg0_1.reset();
    arg1_1.reset();
    return {buf0};
}

The cubin called is compiled from the triton code generated above/tmp/torchinductor_oldpan/rg/xxx.cubin. In this case, you can directly use this cpp code to run in the no-python environment.

However, in practice, we mostly use the entire model, such as resnet50, with weight parameters. Of course, this is also supported. Torch officially also provides an aot tool that can export the entire model as so:

import torch
from torch._export import aot_compile, dynamic_dim

torch.manual_seed(1337)

class Net(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(64, 10)

    def forward(self, x, y):
        return self.fc(torch.sin(x) + torch.cos(y))

data = {}

for device in ["cpu", "cuda"]:
    model = Net().to(device=device)
    x = torch.randn((32, 64), device=device)
    y = torch.randn((32, 64), device=device)
    with torch.no_grad():
        ref_output = model(x, y)

    torch._dynamo.reset()
    with torch.no_grad():
        constraints = [
            dynamic_dim(x, 0) >= 1,
            dynamic_dim(x, 0) <= 1024,
            dynamic_dim(x, 0) == dynamic_dim(y, 0),
        ]
        model_so_path = aot_compile(model, (x, y), constraints=constraints)

    data.update({
        f"model_so_path_{device}": model_so_path,
        f"inputs_{device}": [x, y],
        f"outputs_{device}": [ref_output],
    })

# Use this to communicate tensors to the cpp code
class Serializer(torch.nn.Module):
    def __init__(self, data):
        super().__init__()
        for key in data:
            setattr(self, key, data[key])

torch.jit.script(Serializer(data)).save("data.pt")

Through thisaot_compile you can directly export so with model entry. Because it is aot, some input dimension information needs to be specified in advance, which is necessary to support dynamic.

The exported so can be read through the following C++ methods:

void test_aoti(const std::string& device) {
  torch::NoGradGuard no_grad;

  std::string data_path =
      (std::filesystem::path(STRINGIZE(CMAKE_CURRENT_BINARY_DIR)) / "data.pt")
           .string();
  torch::jit::script::Module data_loader = torch::jit::load(data_path);
  std::string path_attr = "model_so_path_" + device;
  std::string inputs_attr = "inputs_" + device;
  std::string outputs_attr = "outputs_" + device;
  const auto& model_so_path = data_loader.attr(path_attr.c_str()).toStringRef();
  const auto& input_tensors =
      data_loader.attr(inputs_attr.c_str()).toTensorList().vec();
  const auto& ref_output_tensors =
      data_loader.attr(outputs_attr.c_str()).toTensorList().vec();

  std::unique_ptr<torch::inductor::AOTIModelContainerRunner> runner;
  if (device == "cuda") {
    runner = std::make_unique<torch::inductor::AOTIModelContainerRunnerCuda>(
        model_so_path.c_str());
  } else if (device == "cpu") {
    runner = std::make_unique<torch::inductor::AOTIModelContainerRunnerCpu>(
        model_so_path.c_str());
  } else {
    std::cout << "unsupported device: " << device << std::endl;
  }
  auto actual_output_tensors = runner->run(input_tensors);
  assert(actual_output_tensors.size() == ref_output_tensors.size());
}

The core principle isAOTIModelContainerRunnerCpu.

torch designed the AOTIModelContainerRunnerCpu class for the inductor to load and run the generated so, which will contain some execution steps of the calculation graph. whaosoft aiot http://143ai.com 

The specific examples are in pytorch/test/cpp/aot_inductor. There are two examples. One is to torch::CustomClassHolder wrap a layer of AOTIModelContainerRunnerCuda and run, that is, and In combination with torchscript, the other one is run separatelyAOTIModelContainerRunnerCuda, and is directly called in C++ with the API.

For some common ops, such as full connectionself.fc = torch.nn.Linear(64, 10), you can directly call external operators without triton to codegen. In the above example, the direct call is torch.ops.aten.addmm, more details can be foundpytorch/torch/_inductor/select_algorithm.py.

Overall, this export method is more in line with common sense. Common ops can directly call the highly optimized version. Some unseen operators can be generated using triton. Graph fusion operations such as fuse can be done through fx pass. Exporting C++ can also be done through AOT, and there are some detailed runtime designs to improve performance. The overall potential is quite large.

There are still many details that I haven’t had time to look at, such as how certain parallel ops in the model run on multiple streams, how the dynamic situation is handled, how intermediate variables are stored, and how the video memory is managed. It all takes time to understand. have a look.

reference

  • https://pytorch.org/tutorials/prototype/inductor_cpp_wrapper_tutorial.html

  • https://www.youtube.com/watch?v=eN5fqBNrjOo&list=PL_lsbAsL_o2BivkGLiDfHY9VqWlaNoZ2O&index=33

  • https://pytorch.org/blog/new-features-for-ai/

  • https://github.com/pytorch/pytorch/pull/111124

  • https://github.com/pytorch/pytorch/pull/88167

  • https://discuss.pytorch.org/t/pytorch-2-and-the-c-interface/168034/4

  • https://github.com/pytorch/TensorRT/discussions/1743

  • https://github.com/pytorch/TensorRT/discussions/1557

  • https://github.com/pytorch/TensorRT/issues/1404

  • https://github.com/pytorch/TensorRT/discussions/1372

  • https://discuss.pytorch.org/t/pytorch-2-and-the-c-interface/168034/2

  • https://discuss.pytorch.org/t/torch-compiles-deployment-story-to-non-python-host-processes/180943

  • https://dev-discuss.pytorch.org/t/using-nsight-systems-to-profile-gpu-workload/59/10

References

[1] dynamo is a more powerful trace tool than torch.jit.trace and torch.jit.script: https://pytorch.org/tutorials /intermediate/torch_compile_tutorial.html?highlight=torch%20compile

[2] torch.export: https://pytorch.org/docs/main/export.html

[3] cpp wrapper: https://pytorch.org/tutorials/prototype/inductor_cpp_wrapper_tutorial.html

[4] PyTorch 2.1 Contains New Performance Features for AI Developers: https://pytorch.org/blog/new-features-for-ai/

[5] depyf: https://github.com/thuml/depyf

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/135005810