ICCV 2023 | Controllable Generation of VCT, Visual Editing Based on Reference Images

The source of this article is the editorial department of the heart of the machine. There are additions and deletions

In recent years, many key breakthroughs have been made in image generation technology. Especially since the release of large models such as DALLE2 and Stable Diffusion, the text generation image technology has gradually matured, and high-quality image generation has broad practical scenarios. However, the fine-grained editing of existing pictures is still a problem.

On the one hand, due to the limitations of text description, the existing high-quality Vincent graph models can only use text to edit pictures descriptively, and for some specific effects, text is difficult to describe; on the other hand, in practice In application scenarios, image refinement and editing tasks often only have a small number of reference images, which makes it difficult for many solutions that require a large amount of data for training to work with a small amount of data, especially when there is only one reference image.

Recently, researchers from NetEase Mutual Entertainment AI Lab proposed an image-to-image editing scheme based on single image guidance. Given a single reference image, the object or style in the reference image can be transferred to the source image without Change the overall structure of the source image. The research paper has been accepted by ICCV 2023, and the relevant code has been open sourced.

  • Paper address: https://arxiv.org/abs/2307.14352

  • Code address: https://github.com/CrystalNeuro/visual-concept-translator

Let's look at a set of pictures first to feel its effect.

e68ede61f6170cb65e8fe428f61cab31.png

Paper renderings: the upper left corner of each group of pictures is the source picture, the lower left corner is the reference picture, and the right side is the generated result picture

Main frame

The author of the paper proposes an image editing framework based on Inversion-Fusion - VCT (visual concept translator, visual concept converter). As shown in the figure below, the overall framework of VCT includes two processes: the content-concept mapping process (Content-concept Inversion) and the content-concept fusion process (Content-concept Fusion). The content-concept process uses two different inverse mapping algorithms to learn and represent the hidden vectors of the structural information of the original image and the semantic information of the reference image respectively; the content-concept fusion process fuses the hidden vectors of structural information and semantic information, Generate the final result.

be8c2d2cb7ab7cdc315290b7d2e191a8.png

The main frame of the thesis

It is worth mentioning that the inverse mapping method is a technology that has been widely used in recent years, especially in the field of generative confrontation network (GAN), and has achieved outstanding results in many image generation tasks [1]. GAN Inversion technology maps a picture to the latent space of the trained GAN generator, and achieves the purpose of editing by controlling the latent space. The inverse mapping scheme can take full advantage of the generative capabilities of pre-trained generative models. This research actually migrates the GAN Inversion technology to the image-guided image editing task with the diffusion model as a priori.

601e1dd3202f03752060b4c373b5b898.jpeg

Inverse mapping technology [1]

method introduction

Based on the idea of ​​inverse mapping, VCT designs a dual-branch diffusion process, which includes a branch B* for content reconstruction and a main branch B for editing. They start from the same noise xT obtained from DDIM inversion (DDIM Inversion [2] , an algorithm for computing noise from images using a diffusion model), and are used for content reconstruction and content editing, respectively. The pre-training model used in this paper is Latent Diffusion Models (LDM for short). The diffusion process takes place in the latent vector space z space. The dual-branch process can be expressed as:

f762edbf75ac671fd82d786709261b01.png

ab396e2a39d7a25894c0c1e1765809e7.png

two-branch diffusion process

The content reconstruction branch B* learns T content feature vectors  e08983aca1664a8517aad50d926e584e.pngto restore the structural information of the original image, and passes the structural information to the main editing branch B through a soft attention control scheme. The soft attention control scheme draws on the work of Google's prompt2prompt [3], the formula is:

997f66ba7cac165e6e98c6c450f6cb2e.png

That is, when the number of steps of the diffusion model is within a certain interval, the attention feature map of the main branch will be edited to replace the feature map of the content reconstruction branch, so as to realize the structure control of the generated image. Editing main branch B fuses the content feature vector learned from the original image  f09bcedafe50116d5ed71ed19966b314.pngand the concept feature vector learned from the reference image  42d265a803b2045566960391f86b57bb.pngto generate the edited picture.

aeb1cc6259b4b43fd98f09c7acfd8041.png

Noise Space (  2ee9a6cebd103924e0478aa9017afceb.pngSpatial) Fusion

At each step of the diffusion model, the fusion of feature vectors takes place in the noise space, which is the weighting of the predicted noise after the feature vectors are fed into the diffusion model. The feature mixing of the content reconstruction branch occurs on the content feature vector 8ddfebb69a525c23263301383cd2782c.pngand the empty text vector, which is consistent with the form of classifier-free diffusion guidance [4]:

2d3c6570599e966662ec3a06aa67b9b2.png

The mix of editing the main branch is a mix of content feature vectors  061e655f6a8ed3ec605414245f9ab47b.png and concept feature vectors  349062cd113d45a820109603b1012ea0.png , as

05d1b72ab69e62d6144926f70663bc94.png

So far, the key to the research is how to obtain the feature vector of structural information from a single source image 6bff8eccec139cb41d9228f105cc61d4.pngand the feature vector of conceptual information from a single reference image  47c38365ca47783e8896389fad72f20a.png. The article achieves this goal through two different inverse mapping schemes.

In order to restore the source image, the article refers to the optimization scheme of NULL-text [5], and learns the feature vectors of T stages to match and fit the source image. However, unlike NULL-text optimization of empty text vectors to fit the DDIM path, this paper optimizes the source image feature vectors to directly fit the estimated clean feature vectors. The fitting formula is:

ed3bb0a40aed110be60f6bb8b9c690fd.png

70ff95f6509d558eed7439933235e4a7.png

Different from learning structural information, the conceptual information in the reference image needs to be represented by a single highly generalized feature vector, and the T stages of the diffusion model share a concept feature vector  8379cc0f4cd3b673e95e858e0a9824fb.png . The article optimizes the existing inverse mapping schemes Textual Inversion [6] and DreamArtist [7]. It uses a multi-concept feature vector to represent the content of the reference image, and the loss function includes a noise estimation item of the diffusion model and an estimated reconstruction loss item in the latent vector space:

37f6b19f148fbb49862091c6fae1a65b.png

Experimental results

The article conducts experiments on the subject replacement and stylization tasks, which can transform the content into the subject or style of the reference picture while maintaining the structural information of the source picture better.

e7b83b05daf01fbd8531316a7d032935.png

Paper experimental results

Compared with previous schemes, the VCT framework proposed in this article has the following advantages:

(1) Applied generalization: Compared with previous image-guided image editing tasks, VCT does not require a large amount of data for training, and has better generation quality and generalization. It is based on the idea of ​​inverse mapping and based on the high-quality Vincent graph model pre-trained on open world data. In practical applications, only one input image and one reference image are needed to complete better image editing effects.

(2) Visual Accuracy: Compared with recent text-edited image solutions, VCT uses images for reference guidance. Compared with text description, picture reference can realize editing of pictures more accurately. The figure below shows the comparison results of VCT and other schemes:

68d99fb1b47ae5186ba03c2587d85ffc.png

Subject replacement task comparison effect

af9e3d0023e1f9e390405c7a7f3378c2.png

Comparison effect of style transfer task

(3) No need for additional information: Compared with some recent schemes that need to add additional control information (such as: mask map or depth map) for guidance control, VCT directly learns structural information and Semantic information is used for fusion generation. The following figure shows some comparison results. Among them, Paint-by-example replaces the corresponding object with the object of the reference image by providing a mask image of the source image; Controlnet controls the generated result through the line drawing image, depth map, etc.; and VCT directly from the source image With the reference image, the learned structure information and content information are fused into the target image without additional constraints.

a726b7a845519de47349be6a5c71093d.png

Contrastive effects of image-guided image editing schemes

NetEase Interactive Entertainment AI Lab

Founded in 2017, NetEase Interactive Entertainment AI Lab belongs to NetEase Interactive Entertainment Business Group and is a leading artificial intelligence laboratory in the game industry. The laboratory is dedicated to the research and application of technologies such as computer vision, speech and natural language processing, and reinforcement learning in game scenarios. It aims to use AI technology to help upgrade the technology of Huyu's popular games and products. There are many popular games under Netease Interactive Entertainment, such as "Fantasy Westward Journey", "Harry Potter: Magic Awakening", "Onmyoji", "Western Journey" and so on.

【1】Xia W, Zhang Y, Yang Y, et al. Gan inversion: A survey [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45 (3): 3121-3138.

【2】 Song J, Meng C, Ermon S. Denoising Diffusion Implicit Models [C]//International Conference on Learning Representations. 2020.

【3】Hertz A, Mokady R, Tenenbaum J, et al. Prompt-to-Prompt Image Editing with Cross-Attention Control [C]//The Eleventh International Conference on Learning Representations. 2022.

【4】Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications

【5】Mokady R, Hertz A, Aberman K, et al. Null-text inversion for editing real images using guided diffusion models [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 6038-6047.

【6】Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash nik, Amit H Bermano, Gal Chechik, and Daniel Cohen Or. An image is worth one word: Personalizing text-to image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022

【7】Ziyi Dong, Pengxu Wei, and Liang Lin. Drea martist: Towards controllable one-shot text-to-image gen eration via contrastive prompt-tuning. arXiv preprintarXiv:2211.11337, 2022

Pay attention to the official account [Machine Learning and AI Generation Creation], more exciting things are waiting for you to read

Lying down, 60,000 words! 130 articles in 30 directions! CVPR 2023's most complete AIGC paper! read it in one go

Simple explanation of stable diffusion: Interpretation of the potential diffusion model behind AI painting technology

In-depth explanation of ControlNet, a controllable AIGC painting generation algorithm! 

Classic GAN has to read: StyleGAN

9c9adfcd7af42c179dcc25bcb93f7b21.png Click me to view GAN's series albums~!

A cup of milk tea, become the frontier of AIGC+CV vision!

The latest and most complete 100 summary! Generate Diffusion Models Diffusion Models

ECCV2022 | Summary of some papers on generating confrontation network GAN

CVPR 2022 | 25+ directions, the latest 50 GAN papers

 ICCV 2021 | Summary of GAN papers on 35 topics

Over 110 articles! CVPR 2021 most complete GAN paper combing

Over 100 articles! CVPR 2020 most complete GAN paper combing

Dismantling the new GAN: decoupling representation MixNMatch

StarGAN Version 2: Multi-Domain Diversity Image Generation

Attached download | Chinese version of "Explainable Machine Learning"

Attached download | "TensorFlow 2.0 Deep Learning Algorithms in Practice"

Attached download | "Mathematical Methods in Computer Vision" share

"A review of surface defect detection methods based on deep learning"

A Survey of Zero-Shot Image Classification: A Decade of Progress

"A Survey of Few-Shot Learning Based on Deep Neural Networks"

"Book of Rites·Xue Ji" has a saying: "Learning alone without friends is lonely and ignorant."

Click on a cup of milk tea and become the frontier waver of AIGC+CV vision! , join  the planet of AI-generated creation and computer vision  knowledge!

Guess you like

Origin blog.csdn.net/lgzlgz3102/article/details/132419429