AIGC series: Interpretation of GroundingDNIO principle and its use in Stable Diffusion

Table of contents

1 Introduction

2. Summary of methods

3. Algorithm introduction

3.1 Image-text feature extraction and enhancement

3.2 Target detection based on text guidance

3.3 Cross-modal decoder

3.4 Text prompt feature extraction

4. Application scenarios

4.1 Combine the generation model to complete the target area generation

4.2 Combine with stable diffusion to complete image editing

4.3 Combining segmentation models to complete arbitrary image segmentation

1 Introduction

《Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection》

        The authors of Grounding DINO are from Tsinghua University and IDEA (International Digital Economy Academy). Grounding DINO has a very powerful detection function, which can be combined with text prompts for automatic detection without manual participation. Input text to output the detection object corresponding to the text. Call it Detect Anything for now. It can be combined with Segment Anything released by Mata to exert more powerful functions. So far, several major research fields in the CV world have related large model applications, such as Detect Anything, Segment Anything, Stable Diffusion, Recongnize Anything, Tracking Anything...

2. Summary of methods

        Grounding DINO is an open-set target detection scheme that combines Transformer-based detector DINO with ground-truth pre-training. The key to open set detection is to introduce natural language to the closed set detector for open world detection. It can detect novel categories and identify targets with specific attributes. Zero-sample detection reaches 52.5AP on the COCO dataset and 63AP after finetune on the COCO dataset. The main advantages are as follows:

  • Based on the Transformer structure, which is close to the language model, it is easy to process cross-modal features;

  • Transformer-based detectors have the ability to exploit large-scale data sets

  • DINO can be optimized end-to-end without the need for elaborate design modules, such as: NMS

3. Algorithm introduction

        For picture-text pairs, Grounding DINO can output multiple pairs of target boxes and corresponding noun phrases. Grounding DINO adopts dual encoder and single decoder structure. The image backbone is used to extract image features, the text backbone is used to extract text features, feature enhancement is used to fuse image and text features, the language-guided query selection module is used for query initialization, and the cross-modal decoder is used for frame refinement. The process is as follows:

  • Image and text backbone extract original image and text features respectively;

  • Feature enhancement module is used for cross-modal feature fusion;

  • The language-guided query selection module selects the cross-modal query corresponding to the text from the image features;

  • The cross-modal decoder extracts required features from the cross-modal query and updates the query;

  • The output query is used to predict the target frame and extract the corresponding phrase.

3.1 Image-text feature extraction and enhancement

        Given a (image, text) pair, Swin Transformer is used to extract image features, BERT is used to extract text features, the feature enhancement layer is block2 in Figure 3, Deformable self-attention is used to enhance image features, and the original self-attention is enhanced Text features, influenced by GLIP, add image-to-text cross-modality and text-to-image cross-modality to help align different modal features.

3.2 Target detection based on text guidance

        To guide text for target detection, the author designed a language-guided query selection mechanism to select features more relevant to the text as the decoder query. The algorithm is shown in the figure below. Output the num_query index and initialize the query accordingly. Each decoder query consists of two parts: content and position. The position part is formulated as dynamic anchor boxes, which are initialized using the encoder output; the content part is learnable during training, and the number of queries is learned.

3.3 Cross-modal decoder

        The cross-modal decoder combines image and text modal information. The cross-modal query passes through the self-attention layer, the image cross-attention layer is combined with image features, the text cross-attention layer is combined with text features, and the FFN layer. Compared with DINO, each decoder has an additional text cross-attention layer, which introduces text information to facilitate modal alignment.

3.4 Text prompt feature extraction

        Two types of text prompts were explored in previous work. Sentence-level representation encodes the entire sentence as one feature, removing the influence between words; word-level representation can encode multiple categories, but introduces unnecessary dependencies; in order to avoid irrelevant words from each other As a function, the author introduces attention mask, which is a sub-sentence level representation that not only retains the characteristics of each word, but also eliminates interactions between irrelevant words.

4. Application scenarios

4.1 Combine the generation model to complete the target area generation

4.2 Combine with stable diffusion to complete image editing

Face editing, hairstyle change, background change, head change

Replace pets and generate desired content

4.3 Combining segmentation models to complete arbitrary image segmentation

Guess you like

Origin blog.csdn.net/xs1997/article/details/134662204