1 Introduction to Stable Diffusion
Stable Diffusion is a text-to-image model jointly developed by CompVis, Stability AI, and LAION. It is trained through a large number of 512x512 graphic models of the LAION-5B subset. We only need to simply input a piece of text, and Stable Diffusion can quickly convert it Converted to an image, we can also place pictures or videos and process them with text.
The release of Stable Diffusion is a milestone in the development of AI image generation, which is equivalent to providing a high-performance model available to the public. Not only the generated image quality is very high, the operation speed is fast, and the resource and memory requirements are also low. . A generated image is shown below:
Stable Diffusion Demo:demo
1.1 Composition of Stable Diffusion
Stable Diffusion is not a monolithic model, it consists of several components and models.
- Text Understanding Component: A text-understanding component that converts text information into digital representations to capture ideas in text.
- Image generator: image generator, image generator includes two steps, image information creator (Image information creator) and image decoder (Image Decoder).
The image information creator component runs multiple steps to generate objects, which is the step size parameter in the stable diffusion interface and library, and usually defaults to 50 or 100. Image information creators work entirely in image information space (hidden space), which is faster than diffusion models that work in pixel space.
The image decoder draws the picture based on the information obtained from the image information creator, and it only runs once at the end of the final image generation.
The figure above is a flowchart of stable diffusion, which includes the three components described above, each of which has a corresponding neural network.
- Text understanding component: Clip Text is a text encoder. Taking 77 tokens as input, the output is 77 token embedding vectors, each vector has 768 dimensions
- Image Information Creator: UNet+Scheduler, which processes diffuse information step by step in latent space. Takes a text embedding vector and a starting multidimensional array of noise as input, and outputs a processed information array.
- Image decoder: **Automatic encoding and decoding, using the processed information array to draw the final image. Taking the processed information array with dimension 4 × 64 × 64 4 \times 64 \times 64 4×64×64 as input, the output dimension is 3 × 512 × 512 3 \times 512 \times 512 3×512×512 image.
1.2 What is Diffusion
Above we described the functionality of the "image information creator" component, which takes as output a text embedding vector and a starting multidimensional input consisting of noise, and outputs an array of information that the image decoder uses to draw the final image. Diffusion is the process that takes place inside the pink "Image Information Creator" component in the picture below.
Diffusion is a gradual process, with each step adding more relevant information. Diffusion occurs in multiple steps, with each step acting on an input latents array to produce another latents array that better resembles the input text and all the visual information the model has acquired from all the images in the trained model. The figure below uses the latents array generated in each step as the input of the image decoder, and visualizes what information is added in each step. The diffusion in the figure below has been iterated 50 times. As the number of iterations increases, the image decoded by latents array becomes clearer and clearer.
1.3 How Diffusion works
The main idea of the diffusion model to generate images is based on the existing powerful computer vision models in the industry. As long as the data set is large enough, the model can learn more complex logic.
Suppose there is a photo with some randomly generated noise, and then randomly select a noise to add to this image, thus forming a training sample. In the same way, a large number of training samples can be generated to form a training set, and then use this training data set to train the noise predictor (UNet). After training it will result in a high performance noise predictor that creates images when run in a specific configuration.
1.4 Denoise and draw image
Based on the noise training set constructed above, a noise predictor can be trained to generate a noise image. If we subtract this generated noise image from the image, we can get an image as close as possible to the model training sample. , this proximity refers to the proximity in distribution, such as the sky is usually blue, humans have two eyes, etc. The style of the generated images tends to be the style in which the training samples exist.
1.5 Add text information to the image generator
The diffusion-generated image described above does not include any text images, but the input to the image generator includes the text embedding vector and the starting multidimensional array of noise, so the noise generator is adjusted to fit the text. In this way, an image generator can be obtained after training based on a large amount of training data. Based on the selected text encoder plus the trained image generator, the entire stable diffusion model is formed. Some descriptive sentences can be given, and the entire stable diffusion model can generate corresponding paintings.
2 Build the running environment
2.1 conda environment installation
For conda environment preparation, see: annoconda
2.2 Operating environment preparation
git clone https://github.com/CompVis/stable-diffusion.git
cd stable-diffusion
conda env create -f environment.yaml
conda activate ldm
pip install diffusers==0.12.1
2.3 Model download
(1) Download the model file "sd-v1-4.ckpt"
Model address: model
Execute the following command after completion
mkdir -p models/ldm/stable-diffusion-v1/
mv sd-v1-4.ckpt model.ckpt
mv model.ckpt models/ldm/stable-diffusion-v1/
(2) Download the checkpoint_liberty_with_aug.pth model
Model address: model
After the download is complete, the model is placed in the cache folder
mv checkpoint_liberty_with_aug.pth ~/.cache/torch/hub/checkpoints/
(3) Download the clip-vit-large-patch14 model
Model address: model
The model files that need to be downloaded are as follows:
Create a storage directory for the model
mkdir -p openai/clip-vit-large-patch14
After the download is complete, move the downloaded file to the above directory.
(4) Download the safety_checker model
Model address: model
The model files that need to be downloaded are as follows:
Create a storage directory for model files
mkdir -p CompVis/stable-diffusion-safety-checker
After the download is complete, move the downloaded file to the above directory
Move the preprocessor_config.json in (3) to the current model directory:
mv openai/clip-vit-large-patch14/preprocessor_config.json CompVis/stable-diffusion-safety-checker/
3 Operation effect display
3.1 Running the Vincent diagram
python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms
Running effect display
txt2img.py parameters
usage: txt2img.py [-h] [--prompt [PROMPT]] [--outdir [OUTDIR]] [--skip_grid] [--skip_save] [--ddim_steps DDIM_STEPS] [--plms] [--laion400m] [--fixed_code] [--ddim_eta DDIM_ETA]
[--n_iter N_ITER] [--H H] [--W W] [--C C] [--f F] [--n_samples N_SAMPLES] [--n_rows N_ROWS] [--scale SCALE] [--from-file FROM_FILE] [--config CONFIG] [--ckpt CKPT]
[--seed SEED] [--precision {full,autocast}]
optional arguments:
-h, --help show this help message and exit
--prompt [PROMPT] the prompt to render
--outdir [OUTDIR] dir to write results to
--skip_grid do not save a grid, only individual samples. Helpful when evaluating lots of samples
--skip_save do not save individual samples. For speed measurements.
--ddim_steps DDIM_STEPS
number of ddim sampling steps
--plms use plms sampling
--laion400m uses the LAION400M model
--fixed_code if enabled, uses the same starting code across samples
--ddim_eta DDIM_ETA ddim eta (eta=0.0 corresponds to deterministic sampling
--n_iter N_ITER sample this often
--H H image height, in pixel space
--W W image width, in pixel space
--C C latent channels
--f F downsampling factor
--n_samples N_SAMPLES
how many samples to produce for each given prompt. A.k.a. batch size
--n_rows N_ROWS rows in the grid (default: n_samples)
--scale SCALE unconditional guidance scale: eps = eps(x, empty) + scale * (eps(x, cond) - eps(x, empty))
--from-file FROM_FILE
if specified, load prompts from this file
--config CONFIG path to config which constructs model
--ckpt CKPT path to checkpoint of model
--seed SEED the seed (for reproducible sampling)
--precision {full,autocast}
evaluate at this precision
3.2 Run image conversion
Execute the command as follows:
python scripts/img2img.py --prompt "A fantasy landscape, trending on artstation" --init-img assets/stable-samples/img2img/mountains-1.png --strength 0.8
4 problem solving
4.1 SAFE_WEIGHTS_NAME problem solved
Running txt2img, the following error occurs:
(ldm) [root@localhost stable-diffusion]# python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms
Traceback (most recent call last):
File "scripts/txt2img.py", line 22, in <module>
from diffusers.pipelines.stable_diffusion.safety_checker import StableDiffusionSafetyChecker
File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/diffusers/__init__.py", line 29, in <module>
from .pipelines import OnnxRuntimeModel
File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/diffusers/pipelines/__init__.py", line 19, in <module>
from .dance_diffusion import DanceDiffusionPipeline
File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/diffusers/pipelines/dance_diffusion/__init__.py", line 1, in <module>
from .pipeline_dance_diffusion import DanceDiffusionPipeline
File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py", line 21, in <module>
from ..pipeline_utils import AudioPipelineOutput, DiffusionPipeline
File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/diffusers/pipelines/pipeline_utils.py", line 67, in <module>
from transformers.utils import SAFE_WEIGHTS_NAME as TRANSFORMERS_SAFE_WEIGHTS_NAME
ImportError: cannot import name 'SAFE_WEIGHTS_NAME' from 'transformers.utils' (/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/utils/__init__.py)
Solved by changing the version of the component diffusers, the command is as follows:
pip install diffusers==0.12.1
4.2 Solutions for unable to connect to huggingface.co
python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms
Traceback (most recent call last):
File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/feature_extraction_utils.py", line 403, in get_feature_extractor_dict
resolved_feature_extractor_file = cached_path(
File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/utils/hub.py", line 282, in cached_path
output_path = get_from_cache(
File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/utils/hub.py", line 545, in get_from_cache
raise ValueError(
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "scripts/txt2img.py", line 28, in <module>
safety_feature_extractor = AutoFeatureExtractor.from_pretrained(safety_model_id)
File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/models/auto/feature_extraction_auto.py", line 270, in from_pretrained
config_dict, _ = FeatureExtractionMixin.get_feature_extractor_dict(pretrained_model_name_or_path, **kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/feature_extraction_utils.py", line 436, in get_feature_extractor_dict
raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like CompVis/stable-diffusion-safety-checker is not the path to a directory containing a preprocessor_config.json file.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
Solution:
Download the model locally, the process is described in 2.3