Computer vision: replace everything Inpaint Anything

Table of contents

1 Introduction to Inpaint Anything

1.1 Why we need Inpaint Anything

1.2 Working principle of Inpaint Anything

1.3 What is the function of Inpaint Anything

1.4 Segment Anything Model (SAM)

1.5 Inpaint Anything

1.5.1 Remove any object

1.5.2 Fill any content

1.5.3 Replace any content

1.5.4 Practice

 1.6 Experimental summary

 2 Inpaint Anything deployment and operation

2.1 conda environment preparation

2.2 Operating environment installation

2.3 Model download

3 Inpaint Anything running effect display

3.1 Remove Anything

3.2 Fill Anything

 3.3 Replace Anything

3.5 Remove Anything Video

4 Summary


1 Introduction to Inpaint Anything

With one click to mark the selected object, you can remove the specified object, fill the specified object, and replace all scenes, covering a variety of typical image repair application scenarios including object removal, object filling, and background replacement.

Modern image inpainting systems often face difficulties in mask selection and filling holes. Based on the Segment-Anything Model (SAM), the authors attempted mask-free image inpainting for the first time, and proposed a new paradigm called "Inpaint Anything (IA)", namely "click and fill".

The core idea of ​​IA is to combine the advantages of different models to build a very powerful and user-friendly process to solve restoration-related problems. IA supports three main functions:

  • Remove any object: user can click on an object and IA will remove it and fill the "hole" with context smooth;
  • Fill any content: After removing some objects, users can provide text-based prompts to IA, and then it will fill the void and corresponding generated content by driving AIGC models such as Stable Diffusion [11];
  • Replace any background: With IA, users have the option to keep click-selected objects and replace the rest of the background with the newly generated scene.

 Paper: https://arxiv.org/pdf/2304.06790.pdf

 Code: https://github.com/geekyutao/Inpaint-Anything

754e94b3e5134ed3a2e2e55b7fa639b9.png

1.1 Why we need Inpaint Anything

  • State-of-the-art image inpainting methods, such as LaMa, Repaint, MAT, ZITS, etc., have made great progress in inpainting large regions and handling complex repetitive structures. They can be successfully inpainted on high-resolution images and generally generalize well to other images. However, they usually require fine-grained annotations of each mask, which is essential for training and inference.
  • Segment Anything Model (SAM) is a powerful segmentation base model that can generate high-quality object masks based on input cues such as points or boxes, and can generate comprehensive and accurate masks for all objects in an image. However, their mask segmentation prediction has not been fully explored.
  • Furthermore, existing inpainting methods can only use context to fill in the removed regions. The AIGC model opens up new opportunities for creation, which has the potential to fill a large demand and help people generate the content they want.
  • Thus, by combining the strengths of SAM, a state-of-the-art image inpainter LaMa, and AI-generated content (AIGC) models, we provide a powerful and user-friendly pipeline for solving more general inpainting-related problems, such as Object removal, new content filling and background replacement.

1.2 Working principle of Inpaint Anything

Inpaint Anything combines vision-based models such as SAM, image inpainting models (such as LaMa) and AIGC models (such as Stable Diffusion).

  • SAM (Segment Anything Model) can generate high-quality object segmentation regions through input prompts such as points or boxes, and achieve the segmentation of specified targets.
  • The image inpainting model LaMa can delete various elements in the image at will in the case of high-resolution images. The main architecture of the model is shown in the figure below. Contains a black and white image of a mask and an original image. The mask image is overlaid on the image and input into the Inpainting network. First, it is down-sampled to low resolution, and then passes through several fast Fourier convolution FFC residual blocks, and finally outputs up-sampling to generate a high-resolution restoration image.

933aff637c294f08a5c81d43a5eddb53.png

  •  AIGC model Stable Diffusion, as long as you simply input a piece of text, Stable Diffusion can quickly convert it into an image.

Combining the three models together, we can make a lot of functions. This article realizes the three functions of removing all objects in the picture/video, filling all objects in the picture and replacing all backgrounds in the picture. The specific implementation steps are as follows:

83f13e76c36d4dbaa73d50ed0b90e07d.png

1.3 What is the function of Inpaint Anything

  • Remove Arbitrary Objects with SAM + SOTA Restorer: With IA, users can easily remove specific objects from the interface by simply clicking on the object. Additionally, IA provides an option for users to fill in the resulting "holes" with contextual data. For this requirement, we combine the advantages of SAM and some state-of-the-art restorers such as LaMa. Through manual refinement by erosion and dilation, mask predictions generated by the SAM are used as input to the inpainting model, providing clear indications of object regions to be erased and filled.
  • Fill or replace arbitrary content using a SAM+AIGC model:

        (1) After removing the object, IA provides two options to fill the resulting "hole", i.e. use contextual data or "new content". Specifically, we leverage a strong AI-generated content (AIGC) model like Stable Diffusion [11] to generate new objects from textual cues. For example, a user could use the word "dog" or a sentence like "a cute dog sitting on a bench" to generate a new dog to fill in the hole.

        (2) In addition, the user can also choose to replace the remaining background with the newly generated scene while keeping the object selected by the click. IA supports multiple ways to prompt AIGC models, such as using different images as visual cues or using short titles as text cues. For example, a user could keep the dog in the image, but replace the original indoor background with an outdoor one.

d9a44a4c29894ae1997803292e3c3cfe.png

1.4 Segment Anything Model (SAM)

Segment Anything is a ViT-based CV model trained on a large visual corpus (SA-1B). SAM demonstrates promising segmentation capabilities in various scenarios, and the underlying model has great potential in the field of computer vision. This is a pioneering step towards visual artificial general intelligence, and SAM was once hailed as the "CV version of ChatGPT".

  • SOTA Inpainter: Image inpainting, as an ill-posed inverse problem, has been extensively studied in computer vision and image processing. Its goal is to replace missing regions of corrupted images with content that has visually plausible structure and texture. In Inpaint Anything (IA), the authors investigate a simple one-stage method LaMa for mask-based inpainting by combining Fast Fourier Convolution (FFC), perceptual loss and an aggressive training mask generation strategy, Excellent at generating repetitive visual structures.
  •  AIGC models: ChatGPT and other generative AI (GAI) technologies fall under the category of artificial intelligence-generated content (AIGC), which involves the creation of digital content such as images, music, and natural language through AI models. It is regarded as a novel way of content creation and exhibits state-of-the-art performance in various content generation aspects. In our IA work, the authors directly use the powerful AIGC model Stable Diffusion to generate desired content in holes based on text hints.

1.5 Inpaint Anything

The principle of Inpaint Anything (IA) proposed by the authors is to combine off-the-shelf base models to solve a wide range of image inpainting problems. By combining the strengths of various base models, IA is able to generate high-quality inpainted images. Specifically, our IA includes three schemes, namely Remove Anything, Fill Anything and Replace Anything, which are used to remove, fill and replace anything, respectively.

1.5.1 Remove any object

Remove Anything focuses on solving the object removal problem by allowing users to eliminate any object from an image, while ensuring that the resulting image is still visually sound.

Remove Anything consists of three steps: click, split and remove.

  • In the first step, users click to select the objects they want to remove from the image.
  • Next, use a basic segmentation model, such as Segment Anything, to automatically segment objects and create masks based on click locations.
  • Finally, using advanced inpainting models such as LaMa [13], masks are used to fill in the holes left by removed objects.

Since the object is no longer present in the image, the inpainting model fills the hole with background information.

Note that throughout the process, users only need to click on the object they want to remove from the image.

1.5.2 Fill any content

Fill Anything allows users to fill any object in an image with whatever they want.

The tool consists of four steps: click, segment, text hint and generate.

  • The first two steps of Fill Anything are the same as Remove Anything.
  • In the third step, the user enters a text prompt indicating what they want to fill the cavity of the object with.
  • Finally, a powerful AIGC model such as Stable Diffusion [11] is employed to generate desired content in holes based on text-hint inpainting models.

1.5.3 Replace any content

Replace Anything can replace any object with any background. The process of Replace Anything is similar to Fill Anything, but in this case, the AIGC model is prompted to generate a background consistent with the exterior of the specified object.

1.5.4 Practice

Combining base models to solve a task may encounter incompatibility or inappropriateness. We should consider intermediate processing to achieve better coordination between models and tasks. In this study, we summarize some good compositional practices for image inpainting scenarios as follows:

  • The importance of dilation operations.

We observe that SAM segmentation results (i.e., object masks) may contain discontinuities and non-smooth boundaries, or holes inside object regions. These issues pose challenges to efficiently remove or fill objects. Therefore, we use a dilation operation to optimize the mask. In addition, for filled objects, large masks provide AIGC models with greater creative space, which is beneficial for "alignment" with user intentions. Therefore, a large dilation operation is employed in Fill Anything.

  • The importance of fidelity.

Most state-of-the-art AIGC models (such as Stable Diffusion) require images to have a fixed resolution, usually 512×512. Simply resizing the image to this resolution can result in a loss of fidelity that can adversely affect the final inpainting result. Therefore, it is necessary to take measures to preserve the original image quality, such as using cropping techniques or maintaining the image's aspect ratio when resizing.

  • The importance of hints.

Our study shows that text cues have an important impact on AIGC models. However, we observe that in text-prompted inpainting scenarios, simple cues such as "a teddy bear on a bench" or "a Picasso painting on a wall" often yield satisfactory results. In contrast, longer, more complex prompts can produce impressive results, but they tend to be less user-friendly.

e8467fe7b49b42eab9b51e2339f84f4c.png

 1.6 Experimental summary

The authors evaluated Remove Anything, Fill Anything, and Replace Anything in Inpaint Anything, in the three cases of removing objects, filling objects, and replacing backgrounds, respectively. The authors collected test images from the COCO dataset, the LaMa test set, and photos taken by mobile phones. Experimental results demonstrate that the proposed Inpaint Anything is versatile and robust, and can effectively inpaint images with diverse contents, resolutions, and aspect ratios.

9e0a6dda61164a9585ac03bef933db72.png

 2 Inpaint Anything deployment and operation

2.1 conda environment preparation

For conda environment preparation, see: annoconda

2.2 Operating environment installation

git clone https://github.com/geekyutao/Inpaint-Anything
cd Inpaint-Anything

conda create -n ia python=3.9
conda activate ia

pip install torchvision==0.15.2
pip install torchaudio==2.0.2

pip install -e segment_anything
pip install -r lama/requirements.txt 

pip install diffusers==0.16.1
pip install transformers==4.30.2
pip install accelerate==0.19.0
pip install scipy==1.11.1
pip install safetensors==0.3.1

pip install numpy==1.23.5


pip install jpeg4py==0.1.4
pip install lmdb==1.4.1

2.3 Model download

(1) Remove Anything model

Create a model storage directory:

mkdir -p pretrained_models/big-lama

SAM model download: SAM address

Lama model address: Lama address

Download the model file from the above model address, after the download is complete:

The SAM model file is moved to the pretrained_models directory,

Lama model files moved to pretrained_models/big-lama

After completion, the command line is displayed as follows:

 [root@localhost Inpaint-Anything]# ll pretrained_models/
总用量 2504448
drwxr-xr-x 3 root root         51 8月   4 18:13 big-lama
-rw-r--r-- 1 root root 2564550879 8月   4 15:32 sam_vit_h_4b8939.pth


[root@localhost Inpaint-Anything]# ll pretrained_models/big-lama/
总用量 4
-rw-r--r-- 1 root root 3947 8月   4 15:28 config.yaml
drwxr-xr-x 2 root root   31 8月   4 15:28 models

(2) Fill Anything model

mkdir -p stabilityai/stable-diffusion-2-inpainting

Model download address: Huggingface address , after the download is complete, store it in the above directory

(3) Remove Anything Video model

Model download address: sttn model , the sttn model file is moved to the pretrained_models directory,

mkdir -p pytracking/pretrain

Model download address: osTrack model , after the download is complete, store it in the above directory

3 Inpaint Anything running effect display

3.1 Remove Anything

(1) Remove objects by specifying coordinate points

python remove_anything.py \
    --input_img ./example/remove-anything/dog.jpg \
    --coords_type key_in \
    --point_coords 200 450 \
    --point_labels 1 \
    --dilate_kernel_size 15 \
    --output_dir ./results \
    --sam_model_type "vit_h" \
    --sam_ckpt ./pretrained_models/sam_vit_h_4b8939.pth \
    --lama_config ./lama/configs/prediction/default.yaml \
    --lama_ckpt ./pretrained_models/big-lama

After running successfully, the result is stored in the result directory.

 

(2) Remove objects by clicking

python remove_anything.py \
    --input_img ./example/remove-anything/dog.jpg \
    --coords_type click \
    --point_coords 200 450 \
    --point_labels 1 \
    --dilate_kernel_size 15 \
    --output_dir ./results \
    --sam_model_type "vit_h" \
    --sam_ckpt ./pretrained_models/sam_vit_h_4b8939.pth \
    --lama_config ./lama/configs/prediction/default.yaml \
    --lama_ckpt ./pretrained_models/big-lama

This method requires display and operation

3.2 Fill Anything

Fill objects by specifying coordinate points and Prompt

python fill_anything.py \
    --input_img ./example/fill-anything/sample1.png \
    --coords_type key_in \
    --point_coords 750 500 \
    --point_labels 1 \
    --text_prompt "a teddy bear on a bench" \
    --dilate_kernel_size 50 \
    --output_dir ./results \
    --sam_model_type "vit_h" \
    --sam_ckpt ./pretrained_models/sam_vit_h_4b8939.pth

Text prompt: "a teddy bear on a bench" 

 

 3.3 Replace Anything

Replace objects by specifying coordinate points and prompt

python replace_anything.py \
    --input_img ./example/replace-anything/dog.png \
    --coords_type key_in \
    --point_coords 750 500 \
    --point_labels 1 \
    --text_prompt "sit on the swing" \
    --output_dir ./results \
    --sam_model_type "vit_h" \
    --sam_ckpt ./pretrained_models/sam_vit_h_4b8939.pth

Text prompt: "a man in office"

 

3.5 Remove Anything Video

python remove_anything_video.py \
    --input_video ./example/video/paragliding/original_video.mp4 \
    --coords_type key_in \
    --point_coords 652 162 \
    --point_labels 1 \
    --dilate_kernel_size 15 \
    --output_dir ./results \
    --sam_model_type "vit_h" \
    --sam_ckpt ./pretrained_models/sam_vit_h_4b8939.pth \
    --lama_config lama/configs/prediction/default.yaml \
    --lama_ckpt ./pretrained_models/big-lama \
    --tracker_ckpt vitb_384_mae_ce_32x4_ep300 \
    --vi_ckpt ./pretrained_models/sttn.pth \
    --mask_idx 2 \
    --fps 25

The following cases are all removal demonstrations of video files:

4 Summary

Inpaint Anything (IA) is a multifunctional tool that combines the functions of Remove Anything, Fill Anything and Replace Anything. Based on segmentation model, SOTA inpainting model and AIGC model, IA can realize image inpainting without mask, and supports user-friendly operation mode, that is, "click to delete, prompt to fill".

Additionally, the IA can handle a wide variety of high-quality input images, including arbitrary aspect ratios and 2K resolutions. This project leverages the powerful capabilities of existing large-scale AI models and demonstrates the potential of "composable AI". In the future, Inpaint Anything (IA) will be further developed to support more practical functions, such as fine-grained image keying, editing, etc., and apply it to more real-world applications.

Guess you like

Origin blog.csdn.net/lsb2002/article/details/132090092