Recently, AI drawing is very popular, you only need to enter text to get amazing pictures.
举个例子,输入 “photo of a gorgeous young woman in the style of stefan kostic and david la chapelle, coy, shy, alluring, evocative, stunning, award winning, realistic, sharp focus, 8 k high definition, 3 5 mm film photography, photo realistic, insanely detailed, intricate, elegant, art by stanley lau and artgerm” 得到:
输入“temple in ruines, forest, stairs, columns, cinematic, detailed, atmospheric, epic, concept art, Matte painting, background, mist, photo-realistic, concept art, volumetric light, cinematic epic + rule of thirds octane render, 8k, corona render, movie concept art, octane render, cinematic, trending on artstation, movie concept art, cinematic composition , ultra-detailed, realistic , hyper-realistic , volumetric lighting, 8k –ar 2:3 –test –uplight” 得到:
The above effects come from a recently open source model with very good effects - stable diffusion. There may be many people like me who want to get their own customized models, which are specially used to generate faces, animations or others.
There is a guy on github who really did this. He specifically finetune a Pokémon version of stable diffusion. The following is the effect of his model: Input "robotic cat with wings" to get:
Isn’t it interesting, today’s article will introduce how to quickly finetune stable diffusion.
The detailed introduction written by my brother can be moved to: https://github.com/LambdaLabsML/examples/tree/main/stable-diffusion-finetuning
1. Prepare data
The training of deep learning is to solve the data problem first. Since the training data of stable diffusion is pairs of text-image matching, we need to prepare the data according to its requirements.
Prepare all your pictures. Of course, for most people, it is easy to get pictures, but the picture data in hand has no text annotations, but we can use the BLIP algorithm to automatically generate annotations.
BLIP project address: https://github.com/salesforce/BLIP
The effect is shown in the figure below:
BLIP automatically generates a description for Miaowa Seeds. Of course, the effect of the algorithm is difficult to achieve perfection, but it is enough. If you think it's not good enough, you can mark it yourself.
Save the obtained text and picture name in json format:
{
"0001.jpg": "This is a young woman with a broad forehead.",
"0002.jpg": "The young lady has a melon seed face and her chin is relatively narrow.",
"0003.jpg": "This is a melon seed face woman who has a broad chin.There is a young lady with a broad forehead."
}
2. Download the code model
Here we use the stable diffusion code modified by my little brother, which is more convenient for finetune.
finetune code address: https://github.com/justinpinkney/stable-diffusion
Install the environment according to the requirements in the code readme. At the same time, download the stable diffusion pre-trained model sd-v1-4-full-ema.ckpt and put it in the directory.
Model download link: CompVis/stable-diffusion-v-1-4-original · Hugging Face
3. Configuration and operation
stable diffusion uses yaml files to configure training. Since the yaml given by my brother needs to be configured with a specific data format, it is too troublesome. I will directly give a simpler and more convenient one here. You only need to modify the folder path where the pictures are placed, and the json file path of the paired data generated in the first step. Where to change it directly see the following:
model:
base_learning_rate: 1.0e-04
target: ldm.models.diffusion.ddpm.LatentDiffusion
params:
linear_start: 0.00085
linear_end: 0.0120
num_timesteps_cond: 1
log_every_t: 200
timesteps: 1000
first_stage_key: "image"
cond_stage_key: "txt"
image_size: 64
channels: 4
cond_stage_trainable: false # Note: different from the one we trained before
conditioning_key: crossattn
scale_factor: 0.18215
scheduler_config: # 10000 warmup steps
target: ldm.lr_scheduler.LambdaLinearScheduler
params:
warm_up_steps: [ 1 ] # NOTE for resuming. use 10000 if starting from scratch
cycle_lengths: [ 10000000000000 ] # incredibly large number to prevent corner cases
f_start: [ 1.e-6 ]
f_max: [ 1. ]
f_min: [ 1. ]
unet_config:
target: ldm.modules.diffusionmodules.openaimodel.UNetModel
params:
image_size: 32 # unused
in_channels: 4
out_channels: 4
model_channels: 320
attention_resolutions: [ 4, 2, 1 ]
num_res_blocks: 2
channel_mult: [ 1, 2, 4, 4 ]
num_heads: 8
use_spatial_transformer: True
transformer_depth: 1
context_dim: 768
use_checkpoint: True
legacy: False
first_stage_config:
target: ldm.models.autoencoder.AutoencoderKL
ckpt_path: "models/first_stage_models/kl-f8/model.ckpt"
params:
embed_dim: 4
monitor: val/rec_loss
ddconfig:
double_z: true
z_channels: 4
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult:
- 1
- 2
- 4
- 4
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: torch.nn.Identity
cond_stage_config:
target: ldm.modules.encoders.modules.FrozenCLIPEmbedder
data:
target: main.DataModuleFromConfig
params:
batch_size: 1
num_workers: 4
num_val_workers: 0 # Avoid a weird val dataloader issue
train:
target: ldm.data.simple.FolderData
params:
root_dir: '你存图片的文件夹路径/'
caption_file: '图片对应的标注文件.json'
image_transforms:
- target: torchvision.transforms.Resize
params:
size: 512
interpolation: 3
- target: torchvision.transforms.RandomCrop
params:
size: 512
- target: torchvision.transforms.RandomHorizontalFlip
validation:
target: ldm.data.simple.TextOnly
params:
captions:
- "测试时候用的prompt"
- "A frontal selfie of handsome caucasian guy with blond hair and blue eyes, with face in the center"
output_size: 512
n_gpus: 2 # small hack to sure we see all our samples
lightning:
find_unused_parameters: False
modelcheckpoint:
params:
every_n_train_steps: 30000
save_top_k: -1
monitor: null
callbacks:
image_logger:
target: main.ImageLogger
params:
batch_frequency: 30000
max_images: 1
increase_log_steps: False
log_first_step: True
log_all_val: True
log_images_kwargs:
use_ema_scope: True
inpaint: False
plot_progressive_rows: False
plot_diffusion_rows: False
N: 4
unconditional_guidance_scale: 3.0
unconditional_guidance_label: [""]
trainer:
benchmark: True
num_sanity_val_steps: 0
accumulate_grad_batches: 1
As a final step, run the command:
python main.py --base yaml文件路径.yaml --gpus 0,1 --scale_lr False --num_nodes 1 --check_val_every_n_epoch 2 --finetune_from 上面下载的模型路径.ckpt
You're done, just wait for the model training. It should be noted that I have enabled two GPUs here, and stable diffusion consumes more video memory. I can only set the batchsize to 1 for training on the V100.