[pai-diffusion]pai’s easynlp diffusion model training

PAI-Diffusion model is here! Alibaba Cloud machine learning team takes you to explore the ocean of Chinese art - Zhihu Author: Wang Chengyu, Duan Zhongjie, Zhu Xiangru, Huang Jun Introduction In recent years, with the explosive growth of massive multi-modal data on the Internet and the computing power of training large deep learning models has increased significantly Improvement, the application of AI Generated Content (AIGC) shows an explosive growth trend. Among them, text and pictures... icon-default.png?t=N7T8https://zhuanlan.zhihu.com/p/590020134 This maintains the same architecture as sd 1.5. The training is also the general diffusers' train_text_to_image.py, but the weights loaded are different. Here, it is actually not necessary to use clip in easynlp for training. In fact, you can use chineseclip or chinesecllip in transformers to train to get the clip model. However, in most cases, the clip model does not need to be retrained and can be based on the existing To train the diffusion model, the already trained clip can directly replace the text_encoder weight interface in sd. Secondly, for Chinese diffusion training, when using diffusers, you only need to replace the tokenizer with BertTokenizer, or load it with the clipmodel. In other words, just replace the clip module to retrain the Chinese diffusion model.

PAI-diffusion was pre-trained for 20 days using 20 million Chinese image and text data pairs in the Wukong dataset, and was subsequently fine-tuned in multiple downstream datasets.

Training: diffusers -> train_text_to_image_lora.py

Universal diffusers

Analyze the weight of pai-diffusion-general-large-zh:

feature_extractor and safety_checker do not affect training and inference and can be added or not.

scheduler->scheduler_config.json

{
  "_class_name": "DPMSolverMultistepScheduler",
  "_diffusers_version": "0.15.0.dev0",
  "algorithm_type": "dpmsolver++",
  "beta_end": 0.012,
  "beta_schedule": "scaled_linear", # beta_scheduler:beta的调度方式,scaled_linear:缩放线性调度方式
  "beta_start": 0.00085,
  "clip_sample": false,
  "dynamic_thresholding_ratio": 0.995,
  "lower_order_final": true,
  "num_train_timesteps": 1000,
  "prediction_type": "epsilon",
  "sample_max_value": 1.0,
  "set_alpha_to_one": false,
  "skip_prk_steps": true,
  "solver_order": 2,
  "solver_type": "midpoint",
  "steps_offset": 1,
  "thresholding": false,
  "trained_betas": null
}

There is no difference from normal SD.

noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")

diffusers.schedulers.scheduling_utils.SchedulerMixin.from_pretrained()
diffusers.schedulers.scheduling_ddpm.DDPMScheduler->ConfigMixin->load_config
DDPMScheduler.from_config->
model = cls(**init_dict)  ->参数完成初始化

text_encoder->config.json

{
  "_name_or_path": "models/sdm1.4_with_ChTextEncoder/text_encoder",
  "architectures": [
    "CLIPTextModel"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "dropout": 0.0,
  "eos_token_id": 2,
  "hidden_act": "quick_gelu", # 激活函数
  "hidden_size": 768, # encoder layers和pooler layer的维度
  "initializer_factor": 1.0, 
  "initializer_range": 0.02,
  "intermediate_size": 3072, # transformer encoder中feed-forward层的维度
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 32, # 模型处理的最大序列长度
  "model_type": "clip_text_model",
  "num_attention_heads": 12, # encoder中并行的transformer layer的个数
  "num_hidden_layers": 12, # transformer encoder hidden layers的数量
  "pad_token_id": 1,
  "projection_dim": 512,
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "vocab_size": 21128  # clip文本模型词汇表大小
}

Note that the Chinese vocabulary is 21128, and the English vocabulary for SD 1.5 is 49408.

text_encoder = CLIPTextModel.from_pretrained(args.pretrained_model_name_or_path, subfolder="text_encoder",...)

transformers.modeling_utils.PreTrainedModel.from_pretrained()
transformers.models.clip.configuration_clip.CLIPTextConfig->
transformers.models.clip.modeling_clip.CLIPTextModel.forward->
transformers.models.clip.modeling_clip.CLIPTextTransformer.forward->

tokenizer->special_tokens_map.json/tokenizer_config.json/vocab.txt

{
  "cls_token": "[CLS]",
  "do_basic_tokenize": true,
  "do_lower_case": true,
  "mask_token": "[MASK]",
  "model_max_length": 32,
  "name_or_path": "models/release_20230316/512/tokenizer",
  "never_split": null,
  "pad_token": "[PAD]",
  "sep_token": "[SEP]",
  "special_tokens_map_file": null,
  "strip_accents": null,
  "tokenize_chinese_chars": true,
  "tokenizer_class": "BertTokenizer",
  "unk_token": "[UNK]"
}
tokenizer = BertTokenizer.from_pretrained()->

transformers.tokenization_utils_base.PretrainedTokenizerBase.forward->
transformers.models.bert.tokenization_bert.BertTokenizer

unet->config.json

{
  "_class_name": "UNet2DConditionModel",
  "_diffusers_version": "0.14.0.dev0",
  "_name_or_path": "models/20230321_512_openjourney/checkpoint-30000/unet_ema",
  "act_fn": "silu",
  "attention_head_dim": 8,
  "block_out_channels": [
    320,
    640,
    1280,
    1280
  ],
  "center_input_sample": false,
  "class_embed_type": null,
  "conv_in_kernel": 3,
  "conv_out_kernel": 3,
  "cross_attention_dim": 768,
  "decay": 0.9999,
  "down_block_types": [
    "CrossAttnDownBlock2D",
    "CrossAttnDownBlock2D",
    "CrossAttnDownBlock2D",
    "DownBlock2D"
  ],
  "downsample_padding": 1,
  "dual_cross_attention": false,
  "flip_sin_to_cos": true,
  "freq_shift": 0,
  "in_channels": 4,
  "inv_gamma": 1.0,
  "layers_per_block": 2,
  "mid_block_scale_factor": 1,
  "mid_block_type": "UNetMidBlock2DCrossAttn",
  "min_decay": 0.0,
  "norm_eps": 1e-05,
  "norm_num_groups": 32,
  "num_class_embeds": null,
  "only_cross_attention": false,
  "optimization_step": 30000,
  "out_channels": 4,
  "power": 0.6666666666666666,
  "projection_class_embeddings_input_dim": null,
  "resnet_time_scale_shift": "default",
  "sample_size": 64,
  "time_cond_proj_dim": null,
  "time_embedding_type": "positional",
  "timestep_post_act": null,
  "up_block_types": [
    "UpBlock2D",
    "CrossAttnUpBlock2D",
    "CrossAttnUpBlock2D",
    "CrossAttnUpBlock2D"
  ],
  "upcast_attention": false,
  "update_after_step": 0,
  "use_ema_warmup": false,
  "use_linear_projection": false
}

vae->config.json

{
  "_class_name": "AutoencoderKL",
  "_diffusers_version": "0.14.0.dev0",
  "_name_or_path": "models/release_20230316/512/vae",
  "act_fn": "silu",
  "block_out_channels": [
    128,
    256,
    512,
    512
  ],
  "down_block_types": [
    "DownEncoderBlock2D",
    "DownEncoderBlock2D",
    "DownEncoderBlock2D",
    "DownEncoderBlock2D"
  ],
  "in_channels": 3,
  "latent_channels": 4,
  "layers_per_block": 2,
  "norm_num_groups": 32,
  "out_channels": 3,
  "sample_size": 512,
  "scaling_factor": 0.18215,
  "up_block_types": [
    "UpDecoderBlock2D",
    "UpDecoderBlock2D",
    "UpDecoderBlock2D",
    "UpDecoderBlock2D"
  ]
}

Guess you like

Origin blog.csdn.net/u012193416/article/details/133145321