The source of this article is the editorial department of the heart of the machine. There are additions and deletions
In recent years, many key breakthroughs have been made in image generation technology. Especially since the release of large models such as DALLE2 and Stable Diffusion, the text generation image technology has gradually matured, and high-quality image generation has broad practical scenarios. However, the fine-grained editing of existing pictures is still a problem.
On the one hand, due to the limitations of text description, the existing high-quality Vincent graph models can only use text to edit pictures descriptively, and for some specific effects, text is difficult to describe; on the other hand, in practice In application scenarios, image refinement and editing tasks often only have a small number of reference images, which makes it difficult for many solutions that require a large amount of data for training to work with a small amount of data, especially when there is only one reference image.
Recently, researchers from NetEase Mutual Entertainment AI Lab proposed an image-to-image editing scheme based on single image guidance. Given a single reference image, the object or style in the reference image can be transferred to the source image without Change the overall structure of the source image. The research paper has been accepted by ICCV 2023, and the relevant code has been open sourced.
Paper address: https://arxiv.org/abs/2307.14352
Code address: https://github.com/CrystalNeuro/visual-concept-translator
Let's look at a set of pictures first to feel its effect.
Paper renderings: the upper left corner of each group of pictures is the source picture, the lower left corner is the reference picture, and the right side is the generated result picture
Main frame
The author of the paper proposes an image editing framework based on Inversion-Fusion - VCT (visual concept translator, visual concept converter). As shown in the figure below, the overall framework of VCT includes two processes: the content-concept mapping process (Content-concept Inversion) and the content-concept fusion process (Content-concept Fusion). The content-concept process uses two different inverse mapping algorithms to learn and represent the hidden vectors of the structural information of the original image and the semantic information of the reference image respectively; the content-concept fusion process fuses the hidden vectors of structural information and semantic information, Generate the final result.
The main frame of the thesis
It is worth mentioning that the inverse mapping method is a technology that has been widely used in recent years, especially in the field of generative confrontation network (GAN), and has achieved outstanding results in many image generation tasks [1]. GAN Inversion technology maps a picture to the latent space of the trained GAN generator, and achieves the purpose of editing by controlling the latent space. The inverse mapping scheme can take full advantage of the generative capabilities of pre-trained generative models. This research actually migrates the GAN Inversion technology to the image-guided image editing task with the diffusion model as a priori.
Inverse mapping technology [1]
method introduction
Based on the idea of inverse mapping, VCT designs a dual-branch diffusion process, which includes a branch B* for content reconstruction and a main branch B for editing. They start from the same noise xT obtained from DDIM inversion (DDIM Inversion [2] , an algorithm for computing noise from images using a diffusion model), and are used for content reconstruction and content editing, respectively. The pre-training model used in this paper is Latent Diffusion Models (LDM for short). The diffusion process takes place in the latent vector space z space. The dual-branch process can be expressed as:
two-branch diffusion process
The content reconstruction branch B* learns T content feature vectors to restore the structural information of the original image, and passes the structural information to the main editing branch B through a soft attention control scheme. The soft attention control scheme draws on the work of Google's prompt2prompt [3], the formula is:
That is, when the number of steps of the diffusion model is within a certain interval, the attention feature map of the main branch will be edited to replace the feature map of the content reconstruction branch, so as to realize the structure control of the generated image. Editing main branch B fuses the content feature vector learned from the original image and the concept feature vector learned from the reference image to generate the edited picture.
Noise Space ( Spatial) Fusion
At each step of the diffusion model, the fusion of feature vectors takes place in the noise space, which is the weighting of the predicted noise after the feature vectors are fed into the diffusion model. The feature mixing of the content reconstruction branch occurs on the content feature vector and the empty text vector, which is consistent with the form of classifier-free diffusion guidance [4]:
The mix of editing the main branch is a mix of content feature vectors and concept feature vectors , as
So far, the key to the research is how to obtain the feature vector of structural information from a single source image and the feature vector of conceptual information from a single reference image . The article achieves this goal through two different inverse mapping schemes.
In order to restore the source image, the article refers to the optimization scheme of NULL-text [5], and learns the feature vectors of T stages to match and fit the source image. However, unlike NULL-text optimization of empty text vectors to fit the DDIM path, this paper optimizes the source image feature vectors to directly fit the estimated clean feature vectors. The fitting formula is:
Different from learning structural information, the conceptual information in the reference image needs to be represented by a single highly generalized feature vector, and the T stages of the diffusion model share a concept feature vector . The article optimizes the existing inverse mapping schemes Textual Inversion [6] and DreamArtist [7]. It uses a multi-concept feature vector to represent the content of the reference image, and the loss function includes a noise estimation item of the diffusion model and an estimated reconstruction loss item in the latent vector space:
Experimental results
The article conducts experiments on the subject replacement and stylization tasks, which can transform the content into the subject or style of the reference picture while maintaining the structural information of the source picture better.
Paper experimental results
Compared with previous schemes, the VCT framework proposed in this article has the following advantages:
(1) Applied generalization: Compared with previous image-guided image editing tasks, VCT does not require a large amount of data for training, and has better generation quality and generalization. It is based on the idea of inverse mapping and based on the high-quality Vincent graph model pre-trained on open world data. In practical applications, only one input image and one reference image are needed to complete better image editing effects.
(2) Visual Accuracy: Compared with recent text-edited image solutions, VCT uses images for reference guidance. Compared with text description, picture reference can realize editing of pictures more accurately. The figure below shows the comparison results of VCT and other schemes:
Subject replacement task comparison effect
Comparison effect of style transfer task
(3) No need for additional information: Compared with some recent schemes that need to add additional control information (such as: mask map or depth map) for guidance control, VCT directly learns structural information and Semantic information is used for fusion generation. The following figure shows some comparison results. Among them, Paint-by-example replaces the corresponding object with the object of the reference image by providing a mask image of the source image; Controlnet controls the generated result through the line drawing image, depth map, etc.; and VCT directly from the source image With the reference image, the learned structure information and content information are fused into the target image without additional constraints.
Contrastive effects of image-guided image editing schemes
NetEase Interactive Entertainment AI Lab
Founded in 2017, NetEase Interactive Entertainment AI Lab belongs to NetEase Interactive Entertainment Business Group and is a leading artificial intelligence laboratory in the game industry. The laboratory is dedicated to the research and application of technologies such as computer vision, speech and natural language processing, and reinforcement learning in game scenarios. It aims to use AI technology to help upgrade the technology of Huyu's popular games and products. There are many popular games under Netease Interactive Entertainment, such as "Fantasy Westward Journey", "Harry Potter: Magic Awakening", "Onmyoji", "Western Journey" and so on.
【1】Xia W, Zhang Y, Yang Y, et al. Gan inversion: A survey [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45 (3): 3121-3138.
【2】 Song J, Meng C, Ermon S. Denoising Diffusion Implicit Models [C]//International Conference on Learning Representations. 2020.
【3】Hertz A, Mokady R, Tenenbaum J, et al. Prompt-to-Prompt Image Editing with Cross-Attention Control [C]//The Eleventh International Conference on Learning Representations. 2022.
【4】Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications
【5】Mokady R, Hertz A, Aberman K, et al. Null-text inversion for editing real images using guided diffusion models [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 6038-6047.
【6】Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash nik, Amit H Bermano, Gal Chechik, and Daniel Cohen Or. An image is worth one word: Personalizing text-to image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022
【7】Ziyi Dong, Pengxu Wei, and Liang Lin. Drea martist: Towards controllable one-shot text-to-image gen eration via contrastive prompt-tuning. arXiv preprintarXiv:2211.11337, 2022
Pay attention to the official account [Machine Learning and AI Generation Creation], more exciting things are waiting for you to read
In-depth explanation of ControlNet, a controllable AIGC painting generation algorithm!
Classic GAN has to read: StyleGAN
Click me to view GAN's series albums~!
A cup of milk tea, become the frontier of AIGC+CV vision!
The latest and most complete 100 summary! Generate Diffusion Models Diffusion Models
ECCV2022 | Summary of some papers on generating confrontation network GAN
CVPR 2022 | 25+ directions, the latest 50 GAN papers
ICCV 2021 | Summary of GAN papers on 35 topics
Over 110 articles! CVPR 2021 most complete GAN paper combing
Over 100 articles! CVPR 2020 most complete GAN paper combing
Dismantling the new GAN: decoupling representation MixNMatch
StarGAN Version 2: Multi-Domain Diversity Image Generation
Attached download | Chinese version of "Explainable Machine Learning"
Attached download | "TensorFlow 2.0 Deep Learning Algorithms in Practice"
Attached download | "Mathematical Methods in Computer Vision" share
"A review of surface defect detection methods based on deep learning"
A Survey of Zero-Shot Image Classification: A Decade of Progress
"A Survey of Few-Shot Learning Based on Deep Neural Networks"
"Book of Rites·Xue Ji" has a saying: "Learning alone without friends is lonely and ignorant."
Click on a cup of milk tea and become the frontier waver of AIGC+CV vision! , join the planet of AI-generated creation and computer vision knowledge!