Convert Hugging Face model to LibTorch model

Model of Hugging Face

Taking the waifu-diffusion model as an example , the given implementation is generally based on diffuserthe library. The sample code is as follows:

import torch
from torch import autocast
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    'hakurei/waifu-diffusion',
    torch_dtype=torch.float32
).to('cuda')

prompt = "1girl, aqua eyes, baseball cap, blonde hair, closed mouth, earrings, green background, hat, hoop earrings, jewelry, looking at viewer, shirt, short hair, simple background, solo, upper body, yellow shirt"
with autocast("cuda"):
    image = pipe(prompt, guidance_scale=6)["sample"][0]  
    
image.save("test.png")

Download the pre-training model through the network, and load the pre-training model directly, but in fact this model is downloaded locally, but it does not look very easy: because the model is too large, it is divided into some small files for download, and it can be seen later that the model is actually composed of some sub-models, so there are several relatively large files that should correspond to this, and the size is

similar unet、vae.

After downloading, you can directly print(pipe)find:

StableDiffusionPipeline {
    
    
  "_class_name": "StableDiffusionPipeline",
  "_diffusers_version": "0.11.0",
  "feature_extractor": [
    "transformers",
    "CLIPImageProcessor"
  ],
  "requires_safety_checker": true,
  "safety_checker": [
    "stable_diffusion",
    "StableDiffusionSafetyChecker"
  ],
  "scheduler": [
    "diffusers",
    "PNDMScheduler"
  ],
  "text_encoder": [
    "transformers",
    "CLIPTextModel"
  ],
  "tokenizer": [
    "transformers",
    "CLIPTokenizer"
  ],
  "unet": [
    "diffusers",
    "UNet2DConditionModel"
  ],
  "vae": [
    "diffusers",
    "AutoencoderKL"
  ]
}

Sure enough, it is a series of small models and some unimportant parameters. This model can be directly saved as .ptha file, and it can also be torch.load(pipe.pth)read in, but when instantiating the model, it will appear

Traceback (most recent call last):
  File "/home/gaoyi/example-app/test.py", line 59, in <module>
    traced_script_module = torch.jit.trace(model, example)
  File "/home/gaoyi/anaconda3/lib/python3.9/site-packages/torch/jit/_trace.py", line 803, in trace
    name = _qualified_name(func)
  File "/home/gaoyi/anaconda3/lib/python3.9/site-packages/torch/_jit_internal.py", line 1125, in _qualified_name
    raise RuntimeError("Could not get name of python class object")
RuntimeError: Could not get name of python class object

This is because this big guy can't be loaded as a 模型类load, so it can't be converted directly torch.jit.trace. Let's change the way to convert the sub-model

model conversion

By printing print(pipe.unet), it can be seen that this unetis an ordinary network with a bunch of familiar network layers:

UNet2DConditionModel(
  (conv_in): Conv2d(4, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (time_proj): Timesteps()
  (time_embedding): TimestepEmbedding(
    (linear_1): Linear(in_features=320, out_features=1280, bias=True)
    (act): SiLU()
    (linear_2): Linear(in_features=1280, out_features=1280, bias=True)
  )
  (down_blocks): ModuleList(
    (0): CrossAttnDownBlock2D(
      (attentions): ModuleList(
        (0): Transformer2DModel(
          (norm): GroupNorm(32, 320, eps=1e-06, affine=True)
          (proj_in): Linear(in_features=320, out_features=320, bias=True)
          (transformer_blocks): ModuleList(
            (0): BasicTransformerBlock(
              (attn1): CrossAttention(
                (to_q): Linear(in_features=320, out_features=320, bias=False)
                (to_k): Linear(in_features=320, out_features=320, bias=False)
                (to_v): Linear(in_features=320, out_features=320, bias=False)
                (to_out): ModuleList(
                  (0): Linear(in_features=320, out_features=320, bias=True)
                  (1): Dropout(p=0.0, inplace=False)
                )
              )
              (ff): FeedForward(
                (net): ModuleList(
                  (0): GEGLU(
                    (proj): Linear(in_features=320, out_features=2560, bias=True)
                  )
                  (1): Dropout(p=0.0, inplace=False)
                  (2): Linear(in_features=1280, out_features=320, bias=True)
                )
              )
              (attn2): CrossAttention(
                (to_q): Linear(in_features=320, out_features=320, bias=False)
                (to_k): Linear(in_features=1024, out_features=320, bias=False)
                (to_v): Linear(in_features=1024, out_features=320, bias=False)
                (to_out): ModuleList(
                  (0): Linear(in_features=320, out_features=320, bias=True)
                  (1): Dropout(p=0.0, inplace=False)
                )
              )
              (norm1): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
              (norm2): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
              (norm3): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
            )
          )
          (proj_out): Linear(in_features=320, out_features=320, bias=True)
        )
        (1): Transformer2DModel(
       
        ...
        ...略
        ...
        
  (conv_norm_out): GroupNorm(32, 320, eps=1e-05, affine=True)
  (conv_act): SiLU()
  (conv_out): Conv2d(320, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)

Ok, then we can convert this sub-model into the required LibTorch model, but we don't know the input required by this model . We know the name of the model through the printed information UNet2DConditionModel, so we can query it from the official document of Hugging Face: UNet2DConditionModel

The query found that the input of the model is:

but the specific value is still unknown, at this time you can print(model.config)check it by:

FrozenDict([('sample_size', 64), ('in_channels', 4), ('out_channels', 4), ('center_input_sample', False), 
('flip_sin_to_cos', True), ('freq_shift', 0), ('down_block_types', ['CrossAttnDownBlock2D', 
'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D']), ('mid_block_type', 
'UNetMidBlock2DCrossAttn'), ('up_block_types', ['UpBlock2D', 'CrossAttnUpBlock2D', 
'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D']), ('only_cross_attention', False), 
('block_out_channels', [320, 640, 1280, 1280]), ('layers_per_block', 2), ('downsample_padding', 1), 
('mid_block_scale_factor', 1), ('act_fn', 'silu'), ('norm_num_groups', 32), ('norm_eps', 1e-05), 
('cross_attention_dim', 1024), ('attention_head_dim', [5, 10, 20, 20]), ('dual_cross_attention', False), 
('use_linear_projection', True), ('class_embed_type', None), ('num_class_embeds', None), 
('upcast_attention', False), ('resnet_time_scale_shift', 'default'), ('_class_name', 'UNet2DConditionModel'), 
('_diffusers_version', '0.10.2'), ('_name_or_path', 
'/home/gaoyi/.cache/huggingface/diffusers/models--hakurei--waifu-diffusion/snapshots/55fd50bfae0dd8bcc4bd3a6f25cb167580b972a0/unet')])

A large dictionary, find what we need ('sample_size', 64), ('in_channels', 4), ('out_channels', 4), as the input for instantiation, at this time our .pyfile is as follows:

model = torch.load("pipe-unet.pth")

# print(model.config)
# print(model)

example = torch.rand(1, 4, 64, 64)
timestep = torch.rand(1)
encoder_hidden_states = torch.rand(1, 4, 64, 64)

traced_script_module = torch.jit.trace(model, (example, timestep, encoder_hidden_states))
traced_script_module.save("pipe-unet.pt")

But an error is reported : mat1 can not be multiplied with mat2, shape 256x64 and 1024x320, it is probably such a problem, and the specific information will not be pasted. Since the shape of the matrix is ​​wrong, then change the shape. The shape I understood before is the same as it should be, but it looks wrong, but after changing it, I encountered a new problem. When calculating attention, there are too many data, and only three parameters are accepted, so simply pass the encoder_hidden_statestest .example1024x1024encoder_hidden_states = torch.rand(1, 4, 1024)

The new problem after that seems to be the problem of inputting tuples during instantiation, as follows:

RuntimeError: Encountering a dict at the output of the tracer might cause the trace to be incorrect, 
this is only valid if the container structure does not change based on the module's inputs. 
Consider using a constant container instead (e.g. for `list`, use a `tuple` instead. for `dict`, 
use a `NamedTuple` instead). If you absolutely need this and know the side effects, 
pass strict=False to trace() to allow this behavior.

It should be that a parameter needs to be passed during conversion strict=False. The code after adjustment is as follows:

model = torch.load("pipe-unet.pth")

# print(model.config)
# print(model)

example = torch.rand(1, 4, 64, 64)
timestep = torch.rand(1)
encoder_hidden_states = torch.rand(1, 4, 1024)

traced_script_module = torch.jit.trace(model, (example, timestep, encoder_hidden_states), strict=False)
traced_script_module.save("pipe-unet.pt")

Saved successfully!

model testing

According to the test tutorial on the PyTorch official website, write the corresponding C++ file, then use CMake to compile, and finally generate the example-appexecutable file, run:

./example-app ../pipe-unet.pt

Output ok, successfully transformed!

Guess you like

Origin blog.csdn.net/qq_45510888/article/details/129496151