AIGC: Vincent graph model Stable Diffusion

1 Introduction to Stable Diffusion

Stable Diffusion is a text-to-image model jointly developed by CompVis, Stability AI, and LAION. It is trained through a large number of 512x512 graphic models of the LAION-5B subset. We only need to simply input a piece of text, and Stable Diffusion can quickly convert it Converted to an image, we can also place pictures or videos and process them with text.

The release of Stable Diffusion is a milestone in the development of AI image generation, which is equivalent to providing a high-performance model available to the public. Not only the generated image quality is very high, the operation speed is fast, and the resource and memory requirements are also low. . A generated image is shown below:

Stable Diffusion Demo:demo

1.1 Composition of Stable Diffusion

Stable Diffusion is not a monolithic model, it consists of several components and models.

  • Text Understanding Component: A text-understanding component that converts text information into digital representations to capture ideas in text.
  • Image generator: image generator, image generator includes two steps, image information creator (Image information creator) and image decoder (Image Decoder).

The image information creator component runs multiple steps to generate objects, which is the step size parameter in the stable diffusion interface and library, and usually defaults to 50 or 100. Image information creators work entirely in image information space (hidden space), which is faster than diffusion models that work in pixel space.

The image decoder draws the picture based on the information obtained from the image information creator, and it only runs once at the end of the final image generation.

 The figure above is a flowchart of stable diffusion, which includes the three components described above, each of which has a corresponding neural network.

  • Text understanding component: Clip Text is a text encoder. Taking 77 tokens as input, the output is 77 token embedding vectors, each vector has 768 dimensions
  • Image Information Creator: UNet+Scheduler, which processes diffuse information step by step in latent space. Takes a text embedding vector and a starting multidimensional array of noise as input, and outputs a processed information array.
  • Image decoder: **Automatic encoding and decoding, using the processed information array to draw the final image. Taking the processed information array with dimension 4 × 64 × 64 4 \times 64 \times 64 4×64×64 as input, the output dimension is 3 × 512 × 512 3 \times 512 \times 512 3×512×512 image.

1.2 What is Diffusion

Above we described the functionality of the "image information creator" component, which takes as output a text embedding vector and a starting multidimensional input consisting of noise, and outputs an array of information that the image decoder uses to draw the final image. Diffusion is the process that takes place inside the pink "Image Information Creator" component in the picture below.

 

 Diffusion is a gradual process, with each step adding more relevant information. Diffusion occurs in multiple steps, with each step acting on an input latents array to produce another latents array that better resembles the input text and all the visual information the model has acquired from all the images in the trained model. The figure below uses the latents array generated in each step as the input of the image decoder, and visualizes what information is added in each step. The diffusion in the figure below has been iterated 50 times. As the number of iterations increases, the image decoded by latents array becomes clearer and clearer.

1.3 How Diffusion works 

The main idea of ​​the diffusion model to generate images is based on the existing powerful computer vision models in the industry. As long as the data set is large enough, the model can learn more complex logic.

Suppose there is a photo with some randomly generated noise, and then randomly select a noise to add to this image, thus forming a training sample. In the same way, a large number of training samples can be generated to form a training set, and then use this training data set to train the noise predictor (UNet). After training it will result in a high performance noise predictor that creates images when run in a specific configuration.

1.4 Denoise and draw image

Based on the noise training set constructed above, a noise predictor can be trained to generate a noise image. If we subtract this generated noise image from the image, we can get an image as close as possible to the model training sample. , this proximity refers to the proximity in distribution, such as the sky is usually blue, humans have two eyes, etc. The style of the generated images tends to be the style in which the training samples exist.

 

1.5 Add text information to the image generator

The diffusion-generated image described above does not include any text images, but the input to the image generator includes the text embedding vector and the starting multidimensional array of noise, so the noise generator is adjusted to fit the text. In this way, an image generator can be obtained after training based on a large amount of training data. Based on the selected text encoder plus the trained image generator, the entire stable diffusion model is formed. Some descriptive sentences can be given, and the entire stable diffusion model can generate corresponding paintings.

 

2 Build the running environment

2.1 conda environment installation

For conda environment preparation, see: annoconda

2.2 Operating environment preparation

git clone https://github.com/CompVis/stable-diffusion.git

cd stable-diffusion

conda env create -f environment.yaml

conda activate ldm

pip install diffusers==0.12.1

2.3 Model download

(1) Download the model file "sd-v1-4.ckpt"

Model address: model

Execute the following command after completion

mkdir -p models/ldm/stable-diffusion-v1/

mv sd-v1-4.ckpt model.ckpt

mv model.ckpt models/ldm/stable-diffusion-v1/

(2) Download the checkpoint_liberty_with_aug.pth model

Model address: model

After the download is complete, the model is placed in the cache folder

mv checkpoint_liberty_with_aug.pth ~/.cache/torch/hub/checkpoints/

(3) Download the clip-vit-large-patch14 model

Model address: model

The model files that need to be downloaded are as follows:

 Create a storage directory for the model

mkdir -p openai/clip-vit-large-patch14

After the download is complete, move the downloaded file to the above directory.

(4) Download the safety_checker model

Model address: model

The model files that need to be downloaded are as follows:

Create a storage directory for model files

mkdir -p CompVis/stable-diffusion-safety-checker

After the download is complete, move the downloaded file to the above directory

Move the preprocessor_config.json in (3) to the current model directory:

mv openai/clip-vit-large-patch14/preprocessor_config.json CompVis/stable-diffusion-safety-checker/

3 Operation effect display

3.1 Running the Vincent diagram

python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms 

Running effect display

txt2img.py parameters

usage: txt2img.py [-h] [--prompt [PROMPT]] [--outdir [OUTDIR]] [--skip_grid] [--skip_save] [--ddim_steps DDIM_STEPS] [--plms] [--laion400m] [--fixed_code] [--ddim_eta DDIM_ETA]
                  [--n_iter N_ITER] [--H H] [--W W] [--C C] [--f F] [--n_samples N_SAMPLES] [--n_rows N_ROWS] [--scale SCALE] [--from-file FROM_FILE] [--config CONFIG] [--ckpt CKPT]
                  [--seed SEED] [--precision {full,autocast}]

optional arguments:
  -h, --help            show this help message and exit
  --prompt [PROMPT]     the prompt to render
  --outdir [OUTDIR]     dir to write results to
  --skip_grid           do not save a grid, only individual samples. Helpful when evaluating lots of samples
  --skip_save           do not save individual samples. For speed measurements.
  --ddim_steps DDIM_STEPS
                        number of ddim sampling steps
  --plms                use plms sampling
  --laion400m           uses the LAION400M model
  --fixed_code          if enabled, uses the same starting code across samples
  --ddim_eta DDIM_ETA   ddim eta (eta=0.0 corresponds to deterministic sampling
  --n_iter N_ITER       sample this often
  --H H                 image height, in pixel space
  --W W                 image width, in pixel space
  --C C                 latent channels
  --f F                 downsampling factor
  --n_samples N_SAMPLES
                        how many samples to produce for each given prompt. A.k.a. batch size
  --n_rows N_ROWS       rows in the grid (default: n_samples)
  --scale SCALE         unconditional guidance scale: eps = eps(x, empty) + scale * (eps(x, cond) - eps(x, empty))
  --from-file FROM_FILE
                        if specified, load prompts from this file
  --config CONFIG       path to config which constructs model
  --ckpt CKPT           path to checkpoint of model
  --seed SEED           the seed (for reproducible sampling)
  --precision {full,autocast}
                        evaluate at this precision

3.2 Run image conversion

Execute the command as follows:

python scripts/img2img.py --prompt "A fantasy landscape, trending on artstation" --init-img assets/stable-samples/img2img/mountains-1.png --strength 0.8

4 problem solving

4.1 SAFE_WEIGHTS_NAME problem solved

Running txt2img, the following error occurs:

(ldm) [root@localhost stable-diffusion]# python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms 
Traceback (most recent call last):
  File "scripts/txt2img.py", line 22, in <module>
    from diffusers.pipelines.stable_diffusion.safety_checker import StableDiffusionSafetyChecker
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/diffusers/__init__.py", line 29, in <module>
    from .pipelines import OnnxRuntimeModel
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/diffusers/pipelines/__init__.py", line 19, in <module>
    from .dance_diffusion import DanceDiffusionPipeline
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/diffusers/pipelines/dance_diffusion/__init__.py", line 1, in <module>
    from .pipeline_dance_diffusion import DanceDiffusionPipeline
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py", line 21, in <module>
    from ..pipeline_utils import AudioPipelineOutput, DiffusionPipeline
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/diffusers/pipelines/pipeline_utils.py", line 67, in <module>
    from transformers.utils import SAFE_WEIGHTS_NAME as TRANSFORMERS_SAFE_WEIGHTS_NAME
ImportError: cannot import name 'SAFE_WEIGHTS_NAME' from 'transformers.utils' (/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/utils/__init__.py)

Solved by changing the version of the component diffusers, the command is as follows:

pip install diffusers==0.12.1

4.2 Solutions for unable to connect to huggingface.co

 python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms 
Traceback (most recent call last):
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/feature_extraction_utils.py", line 403, in get_feature_extractor_dict
    resolved_feature_extractor_file = cached_path(
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/utils/hub.py", line 282, in cached_path
    output_path = get_from_cache(
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/utils/hub.py", line 545, in get_from_cache
    raise ValueError(
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scripts/txt2img.py", line 28, in <module>
    safety_feature_extractor = AutoFeatureExtractor.from_pretrained(safety_model_id)
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/models/auto/feature_extraction_auto.py", line 270, in from_pretrained
    config_dict, _ = FeatureExtractionMixin.get_feature_extractor_dict(pretrained_model_name_or_path, **kwargs)
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/feature_extraction_utils.py", line 436, in get_feature_extractor_dict
    raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like CompVis/stable-diffusion-safety-checker is not the path to a directory containing a preprocessor_config.json file.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

Solution:

Download the model locally, the process is described in 2.3

Guess you like

Origin blog.csdn.net/lsb2002/article/details/131534772