Midjourney's rival is coming! Google released StyleDrop, a Vincent graph model that can generate any style

Xi Xiaoyao's technology sharing
source | Xinzhiyuan

Midjourney's strong enemy is coming! StyleDrop, Google's customization master, uses a picture as a reference, and can reproduce no matter how complicated the art style is.

As soon as Google StyleDrop came out, it instantly swept the Internet.

Given Van Gogh's starry sky, AI incarnates as Master Van Gogh, and after a top-level understanding of this abstract style, he made countless similar paintings.

Another cartoon style, the objects I want to draw are much cuter.

Even, it can precisely control the details and design the original style of logo.

The charm of StyleDrop is that only one picture is needed as a reference, no matter how complicated the art style is, it can be deconstructed and reproduced.

Netizens have said that it is the kind of AI tool that eliminates designers.

The StyleDrop explosion research is the latest product from the Google research team.

Paper address:
https://arxiv.org/pdf/2306.00983.pdf

Large model research test portal

GPT-4 capability research portal (advanced/continue to visit in case of browser warning):
https://gpt4test.com

Now, with tools like StyleDrop, not only can you paint with more control, but you can also do previously unimaginable fine work, such as drawing a logo.

Even Nvidia scientists called it a "phenomenal" result.

Master of Customization

The author of the paper introduced that the source of inspiration for StyleDrop is Eyedropper (a color-absorbing/color-picking tool).

Similarly, StyleDrop also hopes that you can quickly and effortlessly "pick" a style from a single/few reference images to generate an image of that style.

A sloth can have 18 styles:

A panda has 24 styles:

The watercolors drawn by the children were perfectly controlled by StyleDrop, and even the wrinkles of the paper were restored.

I have to say, it's too strong.

There is also StyleDrop referring to the design of English letters in different styles:

The same is the letter of Van Gogh style.

There are also line drawings. Line drawing is a high level of abstraction of images, and it has very high requirements for the rationality of the composition of the screen generation. The past methods have been difficult to succeed.

The strokes of the cheese shadow in the original image are restored to the objects in each image.

Refer to Android LOGO Creation.

In addition, the researchers also expanded the ability of StyleDrop, not only to customize the style, combined with DreamBooth, but also to customize the content.

For example, still in the style of Van Gogh, generate a similar style of painting for Corgi:

Here's another one, the Corgi below has the feeling of the "Sphinx" on the Egyptian pyramid.

how to work?

StyleDrop is built on top of Muse and consists of two key parts:

One is effective fine-tuning of the parameters of the generated visual Transformer, and the other is iterative training with feedback.

Afterwards, the researchers synthesized images from the two fine-tuned models.

Muse is a state-of-the-art text-to-image synthesis model based on mask-generated image Transformer. It contains two synthesis modules for base image generation (256 × 256) and super-resolution (512 × 512 or 1024 × 1024).

Each module consists of a text encoder T, a transformer G, a sampler S, an image encoder E and a decoder D.

T maps text prompts t ∈ T to a continuous embedding space E. G processes text embeddings e ∈ E to generate logarithms l ∈ L of visual token sequences. S extracts a sequence of visual tokens v ∈ V from the logarithm by iterative decoding that runs several steps of transformer inference, conditioned on the text embedding e and the visual tokens decoded from the previous steps.

Finally, D maps the discrete token sequence to the pixel space I. In general, given a text prompt t, an image I is synthesized as follows:

Figure 2 is a simplified architecture of the Muse transformer layer, which has been partially modified to support parameter efficient fine-tuning (PEFT) and adapters.

The sequence of visual tokens shown in green conditioned on the text embedding e is processed using a L-layer transformer. The learned parameters θ are used to construct the weights for adapter tuning.

To train θ, in many cases, researchers may only be given pictures as style references.

Researchers need to manually attach text prompts. They propose a simple, templated approach to construct text prompts consisting of a description of the content followed by a phrase describing the style.

For example, researchers describe an object with "cat" in Table 1 and append "watercolor painting" as a style description.

Including descriptions of content and style in text cues is crucial, as it helps to separate content from style, which is the main goal of researchers.

Figure 3 shows iterative training with feedback.

When trained on a single style reference image (orange box), some images generated by StyleDrop may exhibit content extracted from the style reference image (red box, image with a house similar to the style image in the background).

Other images (blue boxes) do a better job of separating style from content. Iterative training of StyleDrop on good examples (blue boxes) results in a better balance between style and text fidelity (green boxes).

Here the researchers also used two methods:

-CLIP score

This method is used to measure the alignment of images and text. Therefore, it can evaluate the quality of generated images by measuring the CLIP score (i.e., the cosine similarity of visual and textual CLIP embeddings).

Researchers can select the CLIP image with the highest score. They call this method Iterative Training with CLIP Feedback (CF).

In experiments, the researchers found that using the CLIP score to assess the quality of synthetic images is an effective way to improve recall (i.e., text fidelity) without too much loss in style fidelity.

On the other hand, however, CLIP scores may not fully align with human intentions, nor capture subtle stylistic attributes.

-HF

Human Feedback (HF) is a more straightforward way to directly inject user intent into synthetic image quality assessment.

HF has proven to be powerful and effective in LLM fine-tuning for reinforcement learning.

HF can be used to compensate for the inability of CLIP scores to capture subtle stylistic attributes.

At present, a large number of studies have focused on the personalization problem of text-to-image diffusion models to synthesize images containing multiple personal styles.

The researchers showed how to combine DreamBooth and StyleDrop in a simple way, allowing both style and content to be personalized.

This is done by sampling from two modified generative distributions, guided by θs for style and θc for content, respectively, the adapter parameters trained independently on style and content reference images.

Unlike existing off-the-shelf products, the team's approach does not require joint training of learnable parameters on multiple concepts, which leads to greater combinatorial power, as pre-trained adapters are separately trained on a single topic and style trained on.

The researchers' overall sampling process follows the iterative decoding of Equation (1), with the logarithms sampled differently in each decoding step.

Let t be the text hint and c be the text hint without the style descriptor, and the logarithm is computed at step k as follows:

Where: γ is used to balance StyleDrop and DreamBooth - if γ is 0, we get StyleDrop, if γ is 1, we get DreamBooth.

By setting γ reasonably, we can get a suitable image.

experiment settings

So far, no extensive research has been done on style adjustment for text-to-image generative models.

Therefore, the researchers proposed a new experimental protocol:

  • data collection

The researchers collected dozens of pictures in different styles, ranging from watercolor and oil paintings, flat illustrations, 3D renderings to sculptures in different materials.

  • model configuration

The researchers tuned Muse-based StyleDrop using adapters. For all experiments, the adapter weights were updated for 1000 steps using the Adam optimizer with a learning rate of 0.00003. Unless otherwise stated, the researchers use StyleDrop to denote the second-round model trained on more than 10 synthetic images with human feedback.

  • Evaluate

The quantitative evaluation of research reports is based on CLIP, which measures stylistic consistency and text alignment. In addition, the researchers conducted a user preference study to evaluate style consistency and text alignment.

As shown in the figure, 18 pictures of different styles collected by the researchers, the result of StyleDrop processing.

As you can see, StyleDrop is able to capture the nuances of texture, shading, and structure in a variety of styles, enabling greater control over style than before.

For comparison, the researchers also present the results of DreamBooth on Imagen, DreamBooth's LoRA implementation on Stable Diffusion and the results of text inversion.

The specific results are shown in the table, the evaluation indicators of human score (top) and CLIP score (bottom) of image-text alignment (Text) and visual style alignment (Style).

Qualitative comparison of (a) DreamBooth, (b) StyleDrop, and © DreamBooth + StyleDrop:

Here, the researchers applied the two metrics of the CLIP score mentioned above - text and style scores.

For text scores, the researchers measure the cosine similarity between image and text embeddings. For the style score, the researchers measure the cosine similarity between the style reference and the embedding of the synthetic image.

The researchers generated a total of 1520 images for 190 text cues. Although the researchers hoped that the final score would be higher, in fact, these indicators are not perfect.

And iterative training (IT) improves the text score, which is in line with the researchers' goal.

However, as a trade-off, they suffer from reduced style scores on the first-pass models, since they are trained on synthetic images, where styles may be biased by selection bias.

DreamBooth on Imagen is inferior to StyleDrop in style score (0.644 vs. 0.694 for HF).

The researchers noticed that the increase in style score of DreamBooth on Imagen was not significant (0.569 → 0.644), while the increase of StyleDrop on Muse was more obvious (0.556 → 0.694).

The researchers analyzed that the style fine-tuning on Muse is more effective than that on Imagen.

Plus, for fine-grained control, StyleDrop captures subtle stylistic differences, such as color shifts, layers, or sharp corners.

Hot comments from netizens

If designers have StyleDrop, 10 times faster work efficiency has already taken off.

One day in AI, 10 years in the world, AIGC is developing at the speed of light, the kind of light speed that blinds people's eyes!

Tools just follow the trend, and what should be eliminated has already been eliminated.

This tool is much better than Midjourney for making Logo.

References

[1] https://styledrop.github.io/

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/131038965