Stable Diffusion text generation picture how to write prompt words

Stable Diffusion is a neural network-based technology that generates corresponding pictures by inputting prompt words. In order to obtain high-quality image output, you need to choose appropriate prompt words and ensure that they stimulate the imagination and creativity of the model.

Here are some suggestions for writing prompt words:

  1. Determine the subject: First, determine the subject or content of the image you wish to generate. For example, you can choose natural scenery, abstract art, sci-fi scenes, and more. Clarifying the topic helps to guide the model to generate content related to the topic.
  2. Use clear and concise language: Try to describe your needs in simple, clear language. Avoid overly complex or vague vocabulary that can cause confusion.
  3. Provide enough detail: Provide enough detail to the model so that it can accurately understand your requirements. For example, if you want to generate a picture of a starry sky at night, you can describe the number, color, distribution, etc. of the stars.
  4. Leverage adjectives and adverbs: Using adjectives and adverbs can help the model understand your thoughts more specifically. For example, you could describe "bright star," "twinkling shooting star," etc.
  5. Avoid restrictive words: try to avoid using restrictive words, such as "must", "can only" and so on. This can limit the imagination and creativity of the model. Instead, try using open-ended questions such as "Please imagine...".
  6. Use examples: To help the model better understand your needs, you can provide some actual examples or sample pictures as references. This makes it easier for the model to grab key points and generate images that match expectations.

text to image

Please first understand the basic parameters provided by the SD-WebUI web application through the previous [Basics of Parameter Adjustment].

The following content sources are organized by the network

how to write prompts

This is a general guide, the content is basically general, there may be exceptions, please read the corresponding chapters to understand the characteristics of different applications.

TIP

Prompt words are prompts rather than judgment basis. For example, when you input quality judgment words, you are actually limiting the scope of data, rather than "requiring" AI to produce a good picture.

word label

For models trained specifically on label words, it is recommended to use comma-separated words as prompts.

Commonly seen words, for example, are well-known tags (such as Danbooru) that can be found at the source site of the dataset. The style of the words should match the overall style of the image, otherwise there will be mixed styles or noise.

Avoid typos. An NLP model might split misspelled words into letters for processing.

natural language

For models trained on natural language, it is recommended to use sentences describing objects as cue words.

Depending on the dataset used for training, English, Japanese, special symbols or some Chinese can be used. English is more effective in most cases.

Avoid withconnectives like , or complex syntax, most of the time the NLP model will just do the bare minimum.

Avoid accents (such as é and è) and German umlauts (such as ä and ö), which may not be mapped into the correct semantics.

It is not recommended to apply ready-made templates randomly, especially templates that cannot be understood by humans.

Kaomoji

For models using Danbooru data, you can use emoticons to control the expression of the graph to a certain extent.

For example:

:-)Smiling :-(Displeased ;-)Winking Happy :-DSticking :-Pout Tongue :-CSad Surprised :-OOpen Mouth :-/Doubtful

space

A small amount of spaces before and after the comma does not affect the actual effect.

Extra spaces at the beginning and end are simply discarded. Extra spaces between words are also discarded.

punctuation marks

\0Separating keywords with commas, periods, or even empty characters ( ) can improve image quality. It's unclear which type of punctuation or which combination works best. When in doubt, just do it in a way that makes the prompt easier to read.

For some models, it is recommended to _convert underscores ( ) to spaces.

artistic style word

You can create pictures with special effects or a specified style of painting by specifying style keywords.

movement and posture

Choose cues that are only relevant to a few poses if not very demanding.

Pose here refers to the physical configuration of something: the position and rotation of an image subject relative to the camera, the angles of human/robot joints, the way a jelly block is compressed, etc. The less variance in the things you are trying to specify, the easier it is for the model to learn.

Because movement by definition involves large changes in the subject's posture, cues associated with movement often result in distortions of the body, such as repeated limbs. Also, because human limbs, especially human hands and feet, have many joints, they can assume many different, complex poses. This makes their visualizations particularly difficult to learn, both for humans and neural networks.

In short: good images of humans standing/sitting are easy, good images of humans jumping/running are hard.

how to write

template

Think about what to draw first, such as subject, appearance, emotion, clothes, pose, background, etc., and then refer to the dataset label table (if available, such as Danbooru, Pixiv, etc.).

Then group the desired similar prompt words together, using English half-width , as separators, and arrange these in order from most important to least important.

An example of a template is as follows:

(quality), (subject)(style), (action/scene), (artist), (filters)
  • (quality)Represents the quality of the picture, such as low rescombined stickeruse to "use" more data sets, and 1girlcombined high qualityuse to obtain high-quality images.
  • (subject)Representing the subject of the screen, anchoring the content of the screen, is a fundamental part of any cue.
  • (style)is the screen style, optional.
  • (action/scene)Represents an action/scene, describing what the subject did where.
  • (artist)Represents the name of the artist or the name of the production company.
  • (filters)Represents some details, supplemented. Artists, studios, camera terms, character names, styles, special effects, and more can be used.

capitalization

CLIP's tokenizer lowercases all words before tokenizing. Other models, like BERT and T5, treat capitalized words differently from non-capitalized words.

But avoid involving special syntax in case it is interpreted as other semantics, eg AND.

lexical order

It seems that VAEs use a statistical method called Bayes' theorem. When computing where tokens go, the first few words seem to anchor the distribution of the remaining word tokens in the latent space.

Earlier markers have more consistent positions, so it is easier for the neural network to predict their relevance. In Bayesian inference, the first token or evidence in the matrix is ​​important because it sets the initial probability condition. But later elements just modify the probability condition. So, at least in theory, last tokens should not have more influence than earlier tokens.

But the way the parser understands things is opaque, so there's no way to know for sure if the lexical order has an "anchor" effect.

prompt word length

Avoid long prompt words.

The order in which the prompt words are put is the priority. Since the weight value of the prompt words decreases from the front to the back, the prompt words placed particularly late have little effect on the actual generation of the picture.

It is a good habit not to stack hint words, but if you really have a lot of content to write, you can increase the number of generation steps appropriately to make better use of hint words in the generation process.

The way SD-WebUI breaks the limit of 75 phrases max is by grouping every 20 + 55 words. option Increase coherency by padding from the last comma within n tokens when using more than 75 tokensto have the program attempt to mitigate this by looking for the last comma in the last N tokens and, if so, moving everything past that comma together into the next collection. This strategy can properly alleviate the problem that there are too many prompt words to deal with, but it may destroy the weight relationship between prompt words.

In addition to the special handling of this situation by WebUI, due to the limitation of the GPT-3 model, the processing space for prompt words is not unlimited, most of them are between 75-80 characters, and the content after 75 characters will be truncated.

specificity

The problem manifests itself in semantic offset. For the training of neural networks, the quality of features is important: the stronger the connection between input and output, the easier it is for the neural network to learn this connection.

In other words, if a keyword has a very specific meaning, it is much easier to learn its association with an image than if a keyword has a very broad meaning.

This way, even a rarely used keyword like "Zettai Ryouiki" can produce very good results because it is only used in very specific cases. On the other hand, "anime" even being a relatively common word doesn't yield great results, probably because it's used in so many different situations, even for anime that doesn't have a literal meaning. Choosing specific keywords is especially important if you want to control the content of your images. Also: the less abstract your wording, the better. If possible, avoid wording that leaves room for interpretation, or requires "understanding" of something that doesn't belong in the image. Even concepts like "big" or "small" are problematic because they are indistinguishable from objects being near or far from the camera. Ideally, use wording that has a high chance of appearing verbatim on your desired image title.

semantic imbalance

Each cue is like a dye, they have different "affinities", and if the more common cue, eg ( loliplaced alongside other cues) has a greater impact than the other cue.

For example, if you want to generate anime pictures and use the starry sky startrailtag, there will be more starry elements from real photos than the anime starry sky you expect.

Many words have different weights on the benchmark, so reasonable adjustments should be made according to the effect.

negative prompt

The SD-WebUI web application will avoid generating content mentioned by negative prompt words when generating .

Negative hints are a way of using Stable-Diffusion, allowing the user to specify what he does not want to see without making additional requirements on the model itself.

By specifying unconditional_conditioningthe parameter, during generation the sampler looks at the difference between the denoised image that fits the cue (castle) and the denoised image that looks like a negative cue (grainy, foggy) and tries to make the final result farther away from negative prompt.

weight factor

Weighting factors can change the weight of specific parts of the prompt word.

For more information, see Wiki:Attention Emphasis

For SD-WebUI, the specific rules are as follows:

  • (word)- Increase weight by 1.1 times
  • ((word))- Increase the weight by 1.21 times (= 1.1 * 1.1), multiplicative relationship.
  • [word]- Reduce weight by 90.91%
  • (word:1.5)- Increase weight by 1.5 times
  • (word:0.25)- Reduce weight to 25% of original
  • \(word\)- Use literal () characters in prompt words

Parentheses are required when specifying weights using numbers (). If no numeric weight is specified, it is assumed 1.1. Specifying a single weight is only available for SD-WebUI.

Regardless of the specific script used, repetition of a keyword seems to increase its effect.

It's worth noting that the more prompt words there are in your prompt, the less impact any single prompt word will have. You'll also notice that the style fades away when adding new cue words for this reason. It is strongly recommended to vary the strength of style words as the prompt length increases in order to maintain a consistent style.

Extended reading source network

Guess you like

Origin blog.csdn.net/u014096024/article/details/132012899