Interpretation of the paper: Inpaint Anything: Segment Anything Meets Image Inpainting

 Paper: https://arxiv.org/pdf/2304.06790.pdf

 Code: https://github.com/geekyutao/Inpaint-Anything

 Figure 1: Schematic diagram of Inpaint Anything. Users can select any object in the image by clicking on it. With the help of powerful vision models, such as SAM [7], LaMa [13] and Stable Diffusion (SD) [11], Inpaint Anything is able to remove objects smoothly (i.e., remove anything). Additionally, by entering text prompts, users can fill in whatever they want to fill the object (ie, fill anything), or arbitrarily replace its background (ie, replace any background).

1 summary

      Modern image inpainting systems often face difficulties in mask selection and filling holes. Based on the Segment-Anything Model (SAM), we make the first attempt at mask-free image inpainting and propose a new paradigm named "Inpaint Anything (IA)", namely "click and fill".

        The core idea of ​​IA is to combine the advantages of different models to build a very powerful and user-friendly process to solve restoration-related problems. IA supports three main functions:

     (i) Remove any object: the user can click on an object and IA will remove it and fill the "hole" with context smoothing;

    (ii) Fill any content: After removing some objects, users can provide text-based hints to IA, and then it will fill the void and corresponding generated content by driving AIGC models such as Stable Diffusion [11];

    (iii) Replace any background: With IA, users can choose to keep click-selected objects and replace the rest of the background with the newly generated scene.

2 Motivation and contributions

2.1 Why do we need Inpaint Anything?

• State-of-the-art image inpainting methods , such as LaMa [13], Repaint [10], MAT [8], ZITS [4], etc., have made great progress in inpainting large regions and handling complex repetitive structures. They can be successfully inpainted on high-resolution images and generally generalize well to other images. However, they usually require fine-grained annotations of each mask, which is essential for training and inference.

• Segment Anything Model (SAM) [7] is a powerful segmentation base model that can generate high-quality object masks based on input cues such as points or boxes, and can generate comprehensive and accurate masks for all objects in an image. However, their mask segmentation prediction has not been fully explored.

• Furthermore, existing inpainting methods can only use context to fill removed regions. The AIGC model opens up new opportunities for creation, which has the potential to fill a large demand and help people generate the content they want.

• Thus, by combining the advantages of SAM [7], the state-of-the-art image inpainter LaMa [13] and AI-generated content (AIGC) model [11], we provide a powerful and user-friendly pipeline for solving more general inpainting-related problems, such as object removal, new content filling and background replacement.

2.2 What is the function of Inpaint Anything?

• Remove Arbitrary Objects with SAM + SOTA Restorer: With IA, users can easily remove a specific object from the interface by simply clicking on the object. Additionally, IA provides an option for users to fill in the resulting "holes" with contextual data. For this requirement, we combine the advantages of SAM and some state-of-the-art restorers such as LaMa. Through manual refinement by erosion and dilation, mask predictions generated by the SAM are used as input to the inpainting model, providing clear indications of object regions to be erased and filled.

• Fill or replace arbitrary content with SAM + AIGC models:

(1) After removing the object, IA provides two options to fill the resulting "hole", i.e. use contextual data or "new content". Specifically, we leverage a strong AI-generated content (AIGC) model like Stable Diffusion [11] to generate new objects from textual cues. For example, a user could use the word "dog" or a sentence like "a cute dog sitting on a bench" to generate a new dog to fill in the hole.

(2) In addition, the user can also choose to replace the remaining background with the newly generated scene while keeping the object selected by the click. IA supports multiple ways to prompt AIGC models, such as using different images as visual cues or using short titles as text cues. For example, a user could keep the dog in the image, but replace the original indoor background with an outdoor one.

3 methods

3.1. Preliminary Work Segment Anything Model (SAM)

Segment Anything [7] is a ViT-based CV model trained on a large visual corpus (SA-1B). SAM demonstrates promising segmentation capabilities in various scenarios, and the underlying model has great potential in the field of computer vision. This is a pioneering step towards visual artificial general intelligence, and SAM was once hailed as "ChatGPT for CV" .

SOTA fixer. Image inpainting, as an ill-posed inverse problem, has been widely studied in computer vision and image processing. Its goal is to replace missing regions of corrupted images with content that has visually plausible structure and texture . In Inpaint Anything (IA),

We investigate a simple one-stage method LaMa [13] for mask-based inpainting, which excels at generating recurrent visual structures by combining Fast Fourier Convolution (FFC) [1] , perceptual loss [6] and an aggressive training mask generation strategy. 

AIGC model . ChatGPT 1 and other generative AI (GAI) technologies fall under the umbrella of artificial intelligence-generated content (AIGC), which involves the creation of digital content such as images, music, and natural language through AI models. It is regarded as a novel way of content creation and has demonstrated state-of-the-art performance in various content generation aspects [11, 12]. In our IA work, we directly use the powerful AIGC model Stable Diffusion [11] to generate desired content in holes based on text hints.

3. 2 Inpaint Anything

The principle of our proposed Inpaint Anything (IA) is to combine off-the-shelf base models to solve a wide range of image inpainting problems. By combining the strengths of various base models, IA is able to generate high-quality inpainted images. Specifically, our IA includes three schemes, namely Remove Anything, Fill Anything and Replace Anything, which are used to remove, fill and replace anything, respectively.

3.2.1 Remove any object

Remove Anything focuses on solving the object removal problem [2, 3, 5] by allowing users to eliminate any object from an image, while ensuring that the resulting image is still visually plausible.

Remove Anything consists of three steps: click, split, and remove, as shown in Figure 1.

In the first step, users click to select the objects they want to remove from the image.

Next, use a basic segmentation model, such as Segment Anything [7], to automatically segment objects and create masks based on click locations.

Finally, using advanced inpainting models such as LaMa [13], masks are used to fill in the holes left by removed objects.

Since the object is no longer present in the image, the inpainting model fills the hole with background information.

Note that throughout the process, users only need to click on the object they want to remove from the image.

3.2.2 Fill any content

Fill Anything allows users to fill any object in an image with whatever they want.

The tool consists of four steps: click, segment, text hint and generate .

The first two steps of Fill Anything are the same as Remove Anything.

In the third step, the user enters a text prompt indicating what they want to fill the cavity of the object with .

Finally, a powerful AIGC model such as Stable Diffusion [11] is employed to generate desired content in holes based on text-hint inpainting models.

3.2.3 Replace any content

Replace Anything can replace any object with any background. The process of Replace Anything is similar to Fill Anything, but in this case, the AIGC model is prompted to generate a background consistent with the exterior of the specified object.

3.2.4 Practice

Combining base models to solve a task may encounter incompatibility or inappropriateness. We should consider intermediate processing to achieve better coordination between models and tasks. In this study, we summarize some good compositional practices for image inpainting scenarios as follows:

• Importance of dilation operations.

We observe that SAM segmentation results (i.e., object masks) may contain discontinuities and non-smooth boundaries, or holes inside object regions. These issues pose challenges to efficiently remove or fill objects. Therefore, we use a dilation operation to optimize the mask. In addition, for filled objects, large masks provide AIGC models with greater creative space, which is beneficial for "alignment" with user intentions. Therefore, a large dilation operation is employed in Fill Anything.

• The importance of fidelity.

Most state-of-the-art AIGC models (such as Stable Diffusion) require images to have a fixed resolution, usually 512×512. Simply resizing the image to this resolution can result in a loss of fidelity that can adversely affect the final inpainting result. Therefore, it is necessary to take measures to preserve the original image quality, such as using cropping techniques or maintaining the image's aspect ratio when resizing.

• Importance of reminders.

Our study shows that text cues have an important impact on AIGC models. However, we observe that in text-prompted inpainting scenarios, simple cues such as "a teddy bear on a bench" or "a Picasso painting on a wall" often yield satisfactory results. In contrast, longer, more complex prompts can produce impressive results, but they tend to be less user-friendly.

4 experiments

        We evaluate Remove Anything, Fill Anything, and Replace Anything in Inpaint Anything on the three cases of removing an object, filling an object, and replacing a background, respectively. We collected test images from the COCO dataset [9], the LaMa test set [13] and photos taken by our mobile phones. The results are shown in Figure 2, Figure 3 and Figure 4. Experimental results demonstrate that the proposed Inpaint Anything is versatile and robust, and can effectively inpaint images with diverse contents, resolutions, and aspect ratios.

 

5 Conclusion 

Inpaint Anything (IA) is a multifunctional tool that combines the functions of Remove Anything, Fill Anything and Replace Anything.

Based on segmentation model, SOTA inpainting model and AIGC model, IA can realize image inpainting without mask, and supports user-friendly operation mode, that is, "click to delete, prompt to fill".

Additionally, the IA can handle a wide variety of high-quality input images, including arbitrary aspect ratios and 2K resolutions. We built this interesting project that demonstrates the power of fully leveraging existing large-scale AI models and demonstrates the potential of "composable AI".

We are also more than willing to help everyone share and promote new projects based on our Inpaint Anything (IA). In the future, we will further develop Inpaint Anything (IA) to support more practical functions, such as fine-grained image keying, editing, etc., and apply it to more real-world applications.

Guess you like

Origin blog.csdn.net/qq_35831906/article/details/131742036