DreamBooth - a fine-tuned diffusion model for topic-driven Vincent diagrams

© 2022 Ruiz, Li, Jampani, Pritch, Rubinstein, Aberman (Google Research)
© 2023 Conmajia

Introduction

This article is the Chinese translation of the homepage of DreamBooth official website.
This article is written with permission from Nataniel Ruiz himself.

The main content of DreamBooth is based on the CVPR paper DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation (2208.12242).

‘ ‘ `` ' ' It's like a photo booth, but once you capture the subject, you can composite it anywhere your dreams can go. " ""

Abstract Large text generative image models have made a remarkable leap forward in the development of AI, which can synthesize high-quality and diverse images from given text cues. However, these models lack the ability to imitate the appearance of subjects in a given reference set and synthesize new renditions in different environments. In this work, we propose a novel approach to "personalize" text-generating image diffusion models (adapt them to the user's needs). With only a few images of a topic as input, we can fine-tune a pretrained text-to-image model (Imagen, although our approach is not limited to a specific model) so that it learns to tie a unique identifier to that specific topic. This method can make the model better meet the needs of users, and also enhance the model's ability to identify and generate topics.

Once the subject is embedded in the model's output domain, the unique identifier can be used to synthesize completely novel realistic subject images in different scenes. By exploiting semantic priors embedded in the model and a novel self-generated class-specific prior preserving loss, our technique is able to synthesize subjects that appear in diverse scenes, poses, viewpoints, and lighting conditions that are not present in reference images. Our technology not only enables theme repositioning, text-guided view composition, and appearance modification, but also enables artistic rendering while preserving key features of the theme. This study provides a useful exploration for further development in the field of image generation from text.

background

For a specific subject, such as a clock (shown in the real image on the left), it is very interesting to use a state-of-the-art text-to-image model to generate that subject in a different context while maintaining high fidelity of its key visual features. challenging. Take the Imagen model proposed by Saharia et al. (2022) as an example. Even though the included text details the appearance of the clock ( 丛林里有一个白色表盘的复古风格黄色闹钟,表盘右侧有一个黄色数字“3”), after dozens of iterations, the model still fails to reconstruct its key visual features. Furthermore, even models such as DALL-E2, where text is embedded in a shared language-visual space, can create semantic variations of images, cannot reconstruct the appearance of a given subject or modify the context (Ramesh et al, 2022). In contrast, our method (far right) synthesizes clocks with high fidelity and in novel contexts ( 在丛林中的一个[V]时钟).

background

method

Our method takes as input several images of a specific subject (e.g. a puppy), usually 3-5 images are enough, plus the corresponding class name (e.g. "dog"), and returns a fine-tuned and "personality "text-generated image model". The model has a unique identifier that is used to refer to the subject (the puppy). In the process of generating results (a process called "inference"), we can embed this unique identifier in different sentences to synthesize topics in different contexts.

method

Using ~3-5 images of a given topic, we fine-tune the text-to-image model in two steps:
(a) 一张[V]狗的照片fine-tune the low-level resolution text-generating image model. Simultaneously, we apply a class-specific prior-preserving loss, which leverages the model's semantic priors on the class to encourage it to generate diverse instances 一张狗的照片belonging to the topic class by inserting class names into cue words (for example).
(b) Super-resolution fine-tuning of parts using low-resolution and high-resolution images taken from our input image set, which allows us to maintain high fidelity to small details of the subject.

system

result

The examples below show the results of using our method to recontextualize subjects such as sunglasses, backpacks, vases, teapots, and puppies. By fine-tuning the model, we are able to generate different images of subjects in different environments, while preserving subject details and enabling realistic interactions with the scene. We have marked the corresponding prompt word below each image.

result

Artistic interpretation

Below are original artistic renditions of the themed puppies in different artist styles. Many of the generated poses did not appear in the training sample set (such as renditions of works by Van Gogh and Warhol). We also noticed that some works appeared to have original and novel compositions based on a faithful imitation of the painter's style: this hinted at some level of innovation in our project (inferred from prior knowledge).

Art

Text-Guided View Synthesis

Our technique can synthesize images of specified viewpoints for the subject kitten, from left to right: top, bottom, side and rear perspectives. It can be seen that the generated pose is different from the input pose, and the background changes in a realistic way in case of pose changes. We also specifically kept the intricate fur pattern on the subject kitten's forehead.

angle of view

attribute modification

The first row in the figure below shows 一台[颜色][V]汽车the image effect generated by using prompt words like " ". The second line shows the effect of "mixing" a certain breed of puppy with a different animal, using the cue word " [V]小狗和[目标物种]合成". It is important to emphasize that the method we devised preserves the unique visual properties of the subject itself while successfully performing the desired modification of one of its properties.

Attributes

accessories

Shown here is outfitting a puppy. Here, the identity of our model (the puppy) is preserved, while many different costumes are "worn" onto the puppy. The selected prompt word is " 一只[V]小狗穿着警察/厨师/女巫的服装". We observed real-world interactions between model puppies and clothing or accessories, and a variety of possible options.

accessories

social influence

Our project aims to provide users with an efficient tool for synthesizing personalized themes (animals or other objects) in different contexts. Often text-to-image models may be biased towards specific properties when synthesizing images from text, whereas our proposed method enables users to better reconstruct their desired subjects. Conversely, some malicious parties may try to use these images to mislead viewers. This problem is quite common and exists in other generative model approaches or content manipulation techniques as well. These issues must continue to be investigated and revalidated in future research on generative modeling, especially on personalized generative priors.

BibTex citation code

@article{
    
    ruiz2022dreambooth,
  title={
    
    DreamBooth: Fine Tuning Text-to-image Diffusion Models for Subject-Driven Generation},
  author={
    
    Ruiz, Nataniel and Li, Yuanzhen and Jampani, Varun and Pritch, Yael and Rubinstein, Michael and Aberman, Kfir},
  booktitle={
    
    arXiv preprint arxiv:2208.12242},
  year={
    
    2022}
}

thank you

We thank Rinon Gal, Adi Zicher, Ron Mokady, Bill Freeman, Dilip Krishnan, Huiwen Chang, and Daniel cohen for their valuable input helping to improve this work. We also thank Mohammad Norouzi, Chitwan Saharia, and William Chan for their support and pretrained models on Imagen. Finally, a special thanks to David Salesin for his feedback, suggestions, and support of the project.

Guess you like

Origin blog.csdn.net/conmajia/article/details/130053628