Private custom AI painting - fast finetune stable diffusion tutorial

Recently, AI drawing is very popular, you only need to enter text to get amazing pictures.

举个例子,输入 “photo of a gorgeous young woman in the style of stefan kostic and david la chapelle, coy, shy, alluring, evocative, stunning, award winning, realistic, sharp focus, 8 k high definition, 3 5 mm film photography, photo realistic, insanely detailed, intricate, elegant, art by stanley lau and artgerm”  得到:

输入“temple in ruines, forest, stairs, columns, cinematic, detailed, atmospheric, epic, concept art, Matte painting, background, mist, photo-realistic, concept art, volumetric light, cinematic epic + rule of thirds octane render, 8k, corona render, movie concept art, octane render, cinematic, trending on artstation, movie concept art, cinematic composition , ultra-detailed, realistic , hyper-realistic , volumetric lighting, 8k –ar 2:3 –test –uplight”  得到:

The above effects come from a recently open source model with very good effects - stable diffusion. There may be many people like me who want to get their own customized models, which are specially used to generate faces, animations or others.

There is a guy on github who really did this. He specifically finetune a Pokémon version of stable diffusion. The following is the effect of his model: Input "robotic cat with wings" to get:

Isn’t it interesting, today’s article will introduce how to quickly finetune stable diffusion.

The detailed introduction written by my brother can be moved to: https://github.com/LambdaLabsML/examples/tree/main/stable-diffusion-finetuning

1. Prepare data

The training of deep learning is to solve the data problem first. Since the training data of stable diffusion is pairs of text-image matching, we need to prepare the data according to its requirements.

Prepare all your pictures. Of course, for most people, it is easy to get pictures, but the picture data in hand has no text annotations, but we can use the BLIP algorithm to automatically generate annotations.

BLIP project address: https://github.com/salesforce/BLIP

The effect is shown in the figure below:

 BLIP automatically generates a description for Miaowa Seeds. Of course, the effect of the algorithm is difficult to achieve perfection, but it is enough. If you think it's not good enough, you can mark it yourself.

Save the obtained text and picture name in json format:

{
    "0001.jpg": "This is a young woman with a broad forehead.",
    "0002.jpg": "The young lady has a melon seed face and her chin is relatively narrow.",
    "0003.jpg": "This is a melon seed face woman who has a broad chin.There is a young lady with a broad forehead."
}
  

2. Download the code model

Here we use the stable diffusion code modified by my little brother, which is more convenient for finetune.

finetune code address: https://github.com/justinpinkney/stable-diffusion

Install the environment according to the requirements in the code readme. At the same time, download the stable diffusion pre-trained model sd-v1-4-full-ema.ckpt and put it in the directory.

Model download link: CompVis/stable-diffusion-v-1-4-original · Hugging Face

3. Configuration and operation

stable diffusion uses yaml files to configure training. Since the yaml given by my brother needs to be configured with a specific data format, it is too troublesome. I will directly give a simpler and more convenient one here. You only need to modify the folder path where the pictures are placed, and the json file path of the paired data generated in the first step. Where to change it directly see the following:

model:
  base_learning_rate: 1.0e-04
  target: ldm.models.diffusion.ddpm.LatentDiffusion
  params:
    linear_start: 0.00085
    linear_end: 0.0120
    num_timesteps_cond: 1
    log_every_t: 200
    timesteps: 1000
    first_stage_key: "image"
    cond_stage_key: "txt"
    image_size: 64
    channels: 4
    cond_stage_trainable: false   # Note: different from the one we trained before
    conditioning_key: crossattn
    scale_factor: 0.18215

    scheduler_config: # 10000 warmup steps
      target: ldm.lr_scheduler.LambdaLinearScheduler
      params:
        warm_up_steps: [ 1 ] # NOTE for resuming. use 10000 if starting from scratch
        cycle_lengths: [ 10000000000000 ] # incredibly large number to prevent corner cases
        f_start: [ 1.e-6 ]
        f_max: [ 1. ]
        f_min: [ 1. ]

    unet_config:
      target: ldm.modules.diffusionmodules.openaimodel.UNetModel
      params:
        image_size: 32 # unused
        in_channels: 4
        out_channels: 4
        model_channels: 320
        attention_resolutions: [ 4, 2, 1 ]
        num_res_blocks: 2
        channel_mult: [ 1, 2, 4, 4 ]
        num_heads: 8
        use_spatial_transformer: True
        transformer_depth: 1
        context_dim: 768
        use_checkpoint: True
        legacy: False

    first_stage_config:
      target: ldm.models.autoencoder.AutoencoderKL
      ckpt_path: "models/first_stage_models/kl-f8/model.ckpt"
      params:
        embed_dim: 4
        monitor: val/rec_loss
        ddconfig:
          double_z: true
          z_channels: 4
          resolution: 256
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult:
          - 1
          - 2
          - 4
          - 4
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity

    cond_stage_config:
      target: ldm.modules.encoders.modules.FrozenCLIPEmbedder


data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 1
    num_workers: 4
    num_val_workers: 0 # Avoid a weird val dataloader issue
    train:
      target: ldm.data.simple.FolderData
      params:
        root_dir: '你存图片的文件夹路径/'
        caption_file: '图片对应的标注文件.json'
        image_transforms:
        - target: torchvision.transforms.Resize
          params:
            size: 512
            interpolation: 3
        - target: torchvision.transforms.RandomCrop
          params:
            size: 512
        - target: torchvision.transforms.RandomHorizontalFlip
    validation:
      target: ldm.data.simple.TextOnly
      params:
        captions:
        - "测试时候用的prompt"
        - "A frontal selfie of handsome caucasian guy with blond hair and blue eyes, with face in the center"

        output_size: 512
        n_gpus: 2 # small hack to sure we see all our samples


lightning:
  find_unused_parameters: False

  modelcheckpoint:
    params:
      every_n_train_steps: 30000
      save_top_k: -1
      monitor: null

  callbacks:
    image_logger:
      target: main.ImageLogger
      params:
        batch_frequency: 30000
        max_images: 1
        increase_log_steps: False
        log_first_step: True
        log_all_val: True
        log_images_kwargs:
          use_ema_scope: True
          inpaint: False
          plot_progressive_rows: False
          plot_diffusion_rows: False
          N: 4
          unconditional_guidance_scale: 3.0
          unconditional_guidance_label: [""]

  trainer:
    benchmark: True
    num_sanity_val_steps: 0
    accumulate_grad_batches: 1

As a final step, run the command:

 python main.py --base yaml文件路径.yaml --gpus 0,1 --scale_lr False --num_nodes 1 --check_val_every_n_epoch 2 --finetune_from 上面下载的模型路径.ckpt

You're done, just wait for the model training. It should be noted that I have enabled two GPUs here, and stable diffusion consumes more video memory. I can only set the batchsize to 1 for training on the V100.

Guess you like

Origin blog.csdn.net/wenqiwenqi123/article/details/128205792