Fine-tuning the Stable Diffusion Model on Intel CPUs

The ability of diffusion models to generate realistic images based on textual cues has contributed to the popularization of generative artificial intelligence. People have begun to use these models in several application domains including data synthesis and content creation. Hugging Face Hub contains more than 5,000 pre-trained Vinsen graph models. These models, combined with the Diffusers library, make building image generation workflows or experimenting with different image generation workflows incredibly easy.

Like transformer models, you can fine-tune diffusion models to make them generate content that better suits your specific business needs. At first, you could only fine-tune with the GPU, but things are changing! A few months ago, Intel launched its fourth-generation Xeon CPUs, code-named Sapphire Rapids. Included in Sapphire Rapids is Intel's Advanced Matrix eXtension (AMX), a new hardware accelerator for deep learning workloads. In several previous blog posts, we have shown the advantages of AMX: fine-tuning NLP transformers models , inference on NLP transformers models , and inference on Stable Diffusion models .

This article will show how to fine-tune the Stable Diffusion model on an Intel 4th generation Xeon CPU cluster. What we use for fine-tuning is the Textual Inversion technique, which requires only a small number of training samples to effectively fine-tune the model. In this article, we'll do just fine with 5 samples!

let us start.

Configure the cluster

Intel's small partners provided us with 4 servers hosted on Intel Developer Cloud (Intel Developer Cloud, IDC). As a cloud service platform, IDC provides a deployment environment that is deeply optimized by Intel and integrates the latest Intel processors and optimal performance software stacks. Users can easily develop and run their workloads on this environment.

Each server we got was equipped with two Intel 4th generation Xeon CPUs, each with 56 physical cores and 112 threads. Here is its lscpuoutput:

Architecture: x86_64
  CPU op-mode(s): 32-bit, 64-bit
  Address sizes: 52 bits physical, 57 bits virtual
  Byte Order: Little Endian
CPU(s): 224
  On-line CPU(s) list: 0-223
Vendor ID: GenuineIntel
  Model name: Intel(R) Xeon(R) Platinum 8480+
    CPU family: 6
    Model: 143
    Thread(s) per core: 2
    Core(s) per socket: 56
    Socket(s): 2
    Stepping: 8
    CPU max MHz: 3800.0000
    CPU min MHz: 800.0000
    BogoMIPS: 4000.00
    Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_per fmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

We write the IP addresses of the four servers nodefileinto the file, where the first line is the main server.

cat << EOF > nodefile
192.168.20.2
192.168.21.2
192.168.22.2
192.168.23.2
EOF

Distributed training requires passwordless sshcommunication between the master node and other nodes. If you are not very familiar with this, you can refer to this article and follow it step by step to set up passwordless ssh.

Next, we set up the operating environment and install the required software on each node. In particular, we installed two Intel-optimized libraries: oneCCL for managing distributed communications, and the Intel Extension for PyTorch (IPEX), which includes software optimizations to take full advantage of hardware acceleration in Sapphire Rapids. We also installed libtcmalloc, which is a high-performance memory allocation library, and its software dependencies gperftools.

conda create -n diffuser python==3.9
conda activate diffuser
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip3 install transformers accelerate==0.19.0
pip3 install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable-cpu
pip3 install intel_extension_for_pytorch
conda install gperftools -c conda-forge -y

Next, we clone the diffusers repository on each node and install from source.

git clone https://github.com/huggingface/diffusers.git
cd diffusers
pip install .

Next, we need to use IPEX diffusers/examples/textual_inversionto do some optimizations on the fine-tuning script in to include IPEX's optimizations for inference diffusersmodels The inference optimization of its sub-models cannot be done in the library, but can only be done in the script code. And the fine-tuning of the Clip-Text model can be done by ). We import IPEX and optimize for inference on U-Net and variational autoencoder (VAE) models. Finally, don't forget that this change has to be done in every node's code.pipelinepipelineaccelerateaccelerate

diff --git a/examples/textual_inversion/textual_inversion.py b/examples/textual_inversion/textual_inversion.py
index 4a193abc..91c2edd1 100644
--- a/examples/textual_inversion/textual_inversion.py
+++ b/examples/textual_inversion/textual_inversion.py
@@ -765,6 +765,10 @@ def main():
     unet.to(accelerator.device, dtype=weight_dtype)
     vae.to(accelerator.device, dtype=weight_dtype)

+ import intel_extension_for_pytorch as ipex
+ unet = ipex.optimize(unet, dtype=weight_dtype)
+ vae = ipex.optimize(vae, dtype=weight_dtype)
+
     # We need to recalculate our total training steps as the size of the training dataloader may have changed.
     num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
     if overrode_max_train_steps:

The final step is to download the training images. Normally we would use a shared NFS folder, but here we choose to download the image on each node for simplicity. Make sure that the directory for training images has the same path ( ) on all nodes /home/devcloud/dicoo.

mkdir /home/devcloud/dicoo
cd /home/devcloud/dicoo
wget https://huggingface.co/sd-concepts-library/dicoo/resolve/main/concept_images/0.jpeg
wget https://huggingface.co/sd-concepts-library/dicoo/resolve/main/concept_images/1.jpeg
wget https://huggingface.co/sd-concepts-library/dicoo/resolve/main/concept_images/2.jpeg
wget https://huggingface.co/sd-concepts-library/dicoo/resolve/main/concept_images/3.jpeg
wget https://huggingface.co/sd-concepts-library/dicoo/resolve/main/concept_images/4.jpeg

The training images we use are shown below:

64e2d5f66d69614f80c44542fcf1f3ad.jpeg 30b6d5e6425638eb2a024797df38b92f.jpeg e390af4656e8ffaac23c7ebe6a4f0377.jpeg f5dc9c418f9ce9dfb23253ccf6e94b6d.jpeg fc7b6dea2c2d3596d7c9a095d1406011.jpeg

At this point, the system configuration is complete. Next, we start configuring the training task.

Configure fine-tuning environment

Use the accelerate library to make distributed training easier. We need to run on each node acclerate configand answer some simple questions.

Below is a screenshot of the master node. On the other nodes, you need to rankset to 1, 2 and 3, and leave the other answers the same.

8e3433ae2ab12e65cf0deea058c57803.png

Finally, we need to set some environment variables on the master node. These environment variables are propagated to other nodes when the fine-tuning task starts. The first line sets the name of the network interface connected to the local network on which all nodes run. You may need to use ifconfigto set the network interface name to suit you.

export I_MPI_HYDRA_IFACE=ens786f1
oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
source $oneccl_bindings_for_pytorch_path/env/setvars.sh
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export CCL_ATL_TRANSPORT=ofi
export CCL_WORKER_COUNT=1

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export DATA_DIR="/home/devcloud/dicoo"

Ok, now we can start fine-tuning.

Fine-tuning the model

We mpirunstart fine-tuning with , which automatically nodefilesets up distributed communication among the nodes listed in . Here, we run 16 processes ( -n), with 4 processes running on each node ( -ppn). AccelerateThe library automatically sets up distributed training across all processes.

We start the command below to train 200 steps, it only takes about 5 minutes .

mpirun -f nodefile -n 16 -ppn 4                                                         \
accelerate launch diffusers/examples/textual_inversion/textual_inversion.py \
--pretrained_model_name_or_path=$MODEL_NAME --train_data_dir=$DATA_DIR \
--learnable_property="object" --placeholder_token="<dicoo>" --initializer_token="toy" \
--resolution=512 --train_batch_size=1 --seed=7 --gradient_accumulation_steps=1 \
--max_train_steps=200 --learning_rate=2.0e-03 --scale_lr --lr_scheduler="constant" \
--lr_warmup_steps=0 --output_dir=./textual_inversion_output --mixed_precision bf16 \
--save_as_full_pipeline

The screenshot below shows the state of the cluster during training:

4b30b97d28bb1a3e41e0db27cb2a797c.png

troubleshooting

Distributed training can sometimes be tricky, especially if you're new to it. Small misconfigurations on a single node are the most likely problems: missing dependencies, images stored in different locations, etc.

You can log in to each node and train locally to quickly locate problems. First, set up the same environment as the master node, then run:

python diffusers/examples/textual_inversion/textual_inversion.py \
--pretrained_model_name_or_path=$MODEL_NAME --train_data_dir=$DATA_DIR \
--learnable_property="object" --placeholder_token="<dicoo>" --initializer_token="toy" \
--resolution=512 --train_batch_size=1 --seed=7 --gradient_accumulation_steps=1 \
--max_train_steps=200 --learning_rate=2.0e-03 --scale_lr --lr_scheduler="constant" \
--lr_warmup_steps=0 --output_dir=./textual_inversion_output --mixed_precision bf16 \
--save_as_full_pipeline

If training starts successfully, stop it and move on to the next node. If the training is successfully started on all nodes, please return to the main node and check carefully whether there is any problem nodefilewith the environment and mpiruncommands. Don't worry, you'll find the problem eventually :).

Generate images using the fine-tuned model

After 5 minutes of training, the trained model is saved locally, and we can directly use the to diffusersload pipelinethe model and generate images. But here, we are going to use Optimum Intel and OpenVINO to further optimize the model for inference. As discussed in the previous article , optimized to allow you to generate an image in less than 5 seconds on a single CPU!

pip install optimum[openvino]

We use the following code to load the model, optimize it for a fixed output shape, and finally save the optimized model:

from optimum.intel.openvino import OVStableDiffusionPipeline

model_id = "./textual_inversion_output"

ov_pipe = OVStableDiffusionPipeline.from_pretrained(model_id, export=True)
ov_pipe.reshape(batch_size=5, height=512, width=512, num_images_per_prompt=1)
ov_pipe.save_pretrained("./textual_inversion_output_ov")

Then, we load the optimized model, generate 5 different images and save them:

from optimum.intel.openvino import OVStableDiffusionPipeline

model_id = "./textual_inversion_output_ov"

ov_pipe = OVStableDiffusionPipeline.from_pretrained(model_id, num_inference_steps=20)
prompt = ["a yellow <dicoo> robot at the beach, high quality"]*5
images = ov_pipe(prompt).images
print(images)
for idx,img in enumerate(images):
    img.save(f"image{idx}.png")

Below is the image it produces. Amazingly, it only took five images for the model to know dicoothat it was wearing glasses!

9b2ff2acdff40e3322ccfb73761d6565.png

You can also fine-tune the model more to get better results. Below is an image generated by a model fine-tuned for 3k steps (about an hour), which works quite well.

07a9fa54ab9f55618d712d55b7bd2eae.png

Summarize

Thanks to the in-depth cooperation between Hugging Face and Intel, you can now use Xeon CPU servers to generate high-quality images that meet your business needs. While CPUs are generally cheaper and more readily available than specialized hardware such as GPUs, Xeon CPUs are also generalists that can easily be used for other productive tasks such as web servers, databases, and more. As such, CPUs are a logical choice for a full-featured and flexible IT infrastructure.

The following resources are available to get started, and can be used as needed:

  • Diffusers Documentation

  • Optimum Intel Documentation

  • Intel IPEX on GitHub

  • Developer Resources for Intel and Hugging Face

  • 4th generation Xeon CPU instances on IDC, AWS, GCP, and Alibaba Cloud

If you have any questions or feedback, feel free to leave a message on the Hugging Face forum.

Thanks for reading!


Original English: https://hf.co/blog/stable-diffusion-finetuning-intel

Author: Julien Simon

Translator: Matrix Yao (Yao Weifeng), an Intel deep learning engineer, works on the application of transformer-family models on various modal data and the training and reasoning of large-scale models.

Proofreading/Typesetting: zhongdongy (阿东)

Guess you like

Origin blog.csdn.net/HuggingFace/article/details/131820999