【Image generation and editing】The latest progress! Inpaint Anything, Edit Everything, and Grounded SAM: Learn about the most powerful image generation, restoration, and editing techniques in one article


foreword


In the fields of CV and NLP, many large models across modalities have recently emerged, achieving impressive results in processing image and text data. Among them, the generative model is an important type of model, which can generate novel images, text or audio content, and has important practical value . The three large models: Inpaint Anything , Edit Everything and Grounded SAM , utilize the latest deep learning technology and model architecture, creatively solve the problems of image generation, restoration and editing, and have many practical application values.

Specifically, Inpaint Anything is mainly used to implement image restoration and editing tasks such as object removal, content filling, and scene replacement; Edit Everything is a text-based image generation and editing system; Grounded SAM is a detection, segmentation and a powerful system for replacing objects in any image. These models have been widely used in many practical scenarios and demonstrated their strong theoretical and practical value.

This article aims to introduce the principles, algorithms and applications of these three models, hoping to bring valuable information and inspiration to readers. In the following, we will introduce these three models in detail, and analyze and evaluate their application and performance. `

1. Inpaint Anything: one-click object removal, content filling, and scene replacement

Based on the basic image segmentation model SAM (Segment Anything Model) released by Meta, IMCL Lab proposed the Inpaint Anything model (IA for short), which has functions: 1. Remove
Everything (Remove Anything): Click to remove 2. Fill Anything: You can further tell IA what you want to fill in the object
through text prompts, and IA will then drive the embedded stable diffusion model to generate the corresponding Fill the object with content to realize "content creation" as you wish;
3. Replace Everything: You can click to select the object to be kept, and use the text prompt to tell IA what you want to replace the background of the object with, then you can replace the background of the object Replace with the specified content to realize vivid "environment conversion".
Overall framework:
insert image description here

It mainly consists of three models: SAM, LaMa, and SD. SAM is responsible for image segmentation in the early stage, LaMa is responsible for removing objects (the first function), and SD is responsible for filling objects or replacing the background (the latter two functions). The LaMa model is used to fill in missing images, and its architecture and principles are as follows:
insert image description here

LaMa uses fast Fourier convolution FFC to convert the image into the frequency domain to retain high-frequency information: the input is divided into two branches for operation. The Local branch uses regular convolution; the Global branch uses Real FFT for global context attention. Among them, the operation of Real FFT2d and Inverse Real FFT2d has been experienced in the Global branch, and image reconstruction has been realized. The results of the two branches are merged in the output of the FFC.
Experimental results:
Inpaint Anything was tested on the COCO dataset, the LaMa test dataset, and 2K high-definition images taken by my own mobile phone. In addition, the model also supports 2K high-definition images and arbitrary aspect ratios, which enables the IA system to achieve efficient migration applications in various integration environments and existing frameworks.
insert image description here
insert image description here

2. Edit Everything: A text-guided image editing and generation system

Edit Everything is a new generation system that takes image and text input and produces image output. Edit Everything allows users to edit images using simple text instructions. Our system designs prompts to guide the vision model to generate the requested image. Experiments prove that by using the Segment Anything model and CLIP, Edit Everything helps to achieve the visual effect of Stable Diffusion.

The overall framework:
Editing Everything, a text-guided generation system, consists of three main components: Segment Anything Model (SAM), CLIP, and Stable Diffusion (SD): SAM is used to
extract all segments of the image, while CLIP is Trained to rank these fragments according to the given source cues. Source hints describe the object of interest and are essentially text describing the target object and editing style. The segment with the highest score is then selected as the target segment. Finally, SD is guided by target cues to generate new objects to replace selected target segments (black). This allows for a precise and personalized approach to image editing:
insert image description here

3. Grounded-SAM: Detect, segment and replace everything!

Just one day after the release of SAM, IDEA-Research created an evolutionary version "Grounded-SAM" on this basis. Grounded-SAM integrates SAM, BLIP, and Stable Diffusion, and integrates the three capabilities of image "segmentation", "detection" and "generation" into one, becoming the most powerful Zero-Shot vision application .
The project is open sourced at https://github.com/IDEA-Research/Grounded-Segment-Anything. The latest features update the chat robot, realize voice input, replace semantic scenes with one key, etc., and combine SD to achieve partial replacement.

Function 1 : Convert the voice through the whisper module , and directly replace the detection object of the picture, for example, replace the dog with a monkey, and there is no sense of violation when looking at the picture:
insert image description here

Function 2: Realize automatic data labeling , including label information and prediction probability. It is somewhat similar to the YOLOV8 series. With the help of SAM's idea of ​​​​segmenting everything, you can directly segment and classify all the scenes in the picture. Use Tag2Text to directly generate tags, and use Grounded-SAM for box and mask generation. Tag2Text has excellent tagging and subtitle features. Use BLIP to generate titles, use chatGPT to extract labels, and use Ground-SAM to generate boxes and MASK images. To put it bluntly, for simple scenes, the project has many advantages. The practical application of complex scenes is still to be discussed. At present, it seems that the biggest problem is that the segmentation of the scene will be too fine, and manual work is required. check, and not all parameters are applicable to different pictures. For a large number of training pictures, the actual automatic labeling effect needs to be optimized.
insert image description here
insert image description here

Function 3: Instance replacement . Such as changing hair color, background, interactive application, etc.

insert image description here

Guess you like

Origin blog.csdn.net/qq_45752541/article/details/130787385