Table of contents
Stable Diffusion is a latent diffusion model conditioned on (unpooled) text embeddings from the CLIP ViT-L/14 text encoder.
1. Installation environment
Create and activate a suitably named conda environment:ldm
conda env create -f environment.yaml
conda activate ldm
Update an existing virtual environment:
conda install pytorch torchvision -c pytorch
pip install transformers==4.19.2 diffusers invisible-watermark
pip install -e .
2. Configuration model
2.1 stable diffusion v1
Stable Diffusion v1 refers to a specific configuration of the model architecture that uses a downsampling factor of 8 autoencoders and 860M UNet and CLIP ViT-L/14 text encoders for diffusion models. The model is pre-trained on 256x256 images and then fine-tuned on 512x512 images.
There are four scales in the model:
1. sd-v1-1.ckpt
: On laion2B-en, the resolution is 256x256 and iterated for 237k steps.sd-v1-2.ckpt
laion-aesthetics v2 5+
sd-v1-3.ckpt
sd-v1-4.ckpt
2.2 Run and test the generation effect
Once you have the stable-diffusion-v1-*-original weights, link them:
mkdir -p models/ldm/stable-diffusion-v1/
ln -s <path/to/model.ckpt> models/ldm/stable-diffusion-v1/model.ckpt
Or put the model directly models/ldm/stable-diffusion-v1/
below.
Test it out from text to image:
python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms
By default, images of size 512x512 are rendered in steps of 50.
Test it out from image to image:
python scripts/img2img.py --prompt "A fantasy landscape, trending on artstation" --init-img <path-to-img.jpg> --strength 0.8