Interpretation of Recognizing Everything Model RAM (Recognize Anything Model) and its predecessor Tag2Text

img
img

overview

Hello everyone, I am "Chen Chengnan", an AI algorithm engineer who has been involved and not yet, Schrödinger's solution ~ I am an algorithm engineer of a large factory, bringing the latest cutting-edge AI knowledge and tools , welcome to exchange ~

Following MetaAI's SAM, OPPO Research released the Recognize Anything Model (RAM):

  • Project link: https://recognize-anything.github.io/
  • Demo link: https://huggingface.co/spaces/xinyu1205/Tag2Text
  • Source link: https://github.com/xinyu1205/recognize-anything
  • Paper link: https://arxiv.org/pdf/2306.03514.pdf
img
img

Whether it is from the content of the paper, source code git or Demo, it is not difficult to see that RAM is actually an enhanced Tag2Text . The recognition mentioned in RAM is essentially an image tagging task (image tagging), and Tag2Text is also proposed by the author team. A large model pre-training framework for image tagging tasks.

Image recognition: Given an image, it aims to provide semantic labels by identifying multiple tags of a given image, which can be understood as giving multiple tags to describe the image, including object (object), scene (scene), attribute ( attribute) and behavior (action), is a multi-label classification (multi-label classification).

The Zero Shot capability of Segmentation Everything Model (SAM) [12] is very strong, but it only has location capability (location) and no recognition capability (SAM can only give segmentation Mask, and cannot specify the category of the Mask). Therefore, RAM is designed to provide powerful recognition capabilities (including the recognition capabilities of Zero Shot). The author also combined RAM and positioning models (SAM, Grounding-DINO), specifically in the Grounded-SAM project, so that positioning + recognition can be achieved at the same time. The figure below shows the characteristics of the SAM, RAM and other models given by the author.

img
img

Since most of the work of RAM is based on Tag2Text, I need to introduce Tag2Text before introducing RAM. Friends who understand Tag2Text papers can directly read RAM.

Tag2Text:Guiding Vision-Language Model via Image Tagging

Tag2Text is a Vision Language Pretrain (VLP) framework. In this framework, the author guides the model to learn better visual-language features by introducing the image tagging task to the Vision-Language Models . Image marking is similar to labeling a picture with multiple labels related to the picture, which is a bit like multi-label classification.

img
img

As shown in the figure above, previous tagging methods (OSCAR [32], VIVO [21], VinVL [61]) followed a detector-based paradigm. Semantic alignment between images and text is simplified by using target tags as anchors. These methods use a detector to extract image features and send them to a multimodal interaction module for learning. In this case, the parameters of the detector are frozen (if the gradient is optimized, the detection performance will drop sharply), so the detector cannot be optimized, and the performance of the detector will restrict the learning of visual-language features.

The author proposes to use image tagging as a multi-task for visual- speech pre-training . There are two key issues: data and network structure .

data problem

The introduction of image tagging requires the construction of tags in the image as labels for training. Because the image-text-pair data is very rich, the author automatically analyzes the text semantics of the image-text-pair to obtain the tags of the image from the text. In this way, image tags can provide a better bridge between images and text, because the parsed tag categories are more diverse, and at the same time richer than object detection, such as scenes, attributes, actions, etc.

Mining Tags from Text to construct data contains 2 keys:

  1. **Parse to obtain Tags: **Use the parser [58] to identify entities (head+modifier) ​​and relationships in Text, and then map out tags: Head->object (object) and scene (scene), modifier->attribute , relationship->action;
  2. **Screen effective Tags: **Get the parsed Tags set, sort them according to the frequency of Tags, only take the top 5000 most frequent ones for manual screening, and finally retain 3429 tag categories as the required effective Tags;

network structure

As shown in the figure below, it contains 3 branches: Tagging, Generation, Alignment , which are different task branches, which can be used for different subtasks after training. For example, on the right side of the figure below: multi-label recognition (that is, tagging), Image Caption generation, Visual QA and Image-Text retrieval, these are several subtasks.

img
img

Image Tagging : The multi-label classification transformer decoder in Query2Label[35] is used (the usage is as shown in the figure below, which is the idea of ​​DETR), and at the same time, in order to avoid the lack of some corresponding image tags and the imbalance of positive and negative samples in the parsed tags , using Asymmetirc Loss (ASL) [44].

img
img

Image-Tag-Text Generation : Using the standard transformer encoder-decoder framework in NLP, tags/text are mapped to embedding through tokennizer + embedding matrix, and then tags embedding (randomly out of order, to prevent order from affecting learning) and image embedding ( features) are sent to the encoder together, and then decoded by the decoder. Output and text embedding for loss calculation. It is equivalent to using tag to guide image to generate text;

**Image-Text Alignment: **Use the Encoder structure in BLIP[29] (as follows), image embedding and text embedding are sent to the encoder, and use coarse-grained Image-Text Contrastive (ITC) Loss and fine-grained Image-Text Matching(ITM) Loss is supervised separately.

img
img

RAM:Recognize Anything: A Strong Image Tagging Model

model structure

img
img

As shown in the figure, SAM is similar in structure to Tag2Text. Tag2Text has 3 branches, tagging, generation and alignment; SAM only retains two branches, Tagging and Generation , of which the Tagging branch is used for multi-tags reasoning to complete the recognition task; Generation is used for image The caption task; the alignment in Tag2Text is learned from Visual-Language Features, and it is removed here.

  • Image Encoder uses Swin,
  • During training, both the Tagging branch and the Generation branch use the parsed Tags as labels;
  • During the test, Tagging will output Tags, which are used in the Caption task of Generation to generate the final Text;

To sum up, SAM is basically the same as Tag2Text on the Internet, with a little more difference related to CLIP in the figure, specifically in the section on open vocabulary recognition .

open word recognition

  • Inspired by [23, 28], the author expands each Tag by prompt ensembling [22], and then sends it to the trained CLIP Text Encoder to get its corresponding text label query (Textual label queries, in fact, the embedding of prompt + tag ), the authors argue that these queries have stronger semantic and contextual information than learnable parameters. Then send these Label Queris to Tagging Decoder for Cross Attention with image features.

    • If you do not expand the Tag prompt, the tag is too short, and the embedding obtained by sending it into the model will be relatively poor;

In addition, the author also uses CLIP's Image Encoder to distill SAM's image features (because CLIP's image and text features are aligned), so that the SAM model will have better feature generation in categories that have not been seen.

So on the whole, the difference between SAM and Tag2Text in the network framework is basically in the additional use of this CLIP.

data problem

In Tag2Text, the author uses the text of image-text-pair to analyze to get tags, and then uses high-frequency sorting to filter, and takes the top 5k. The higher the frequency, the more important it is.

In SAM, the amount of data is further expanded, and the frequency screening is expanded to top-10k. There are also a series of methods to expand the amount of data. You can directly translate the abstract for you to read. I won’t go into details about the data. Please read the original text for details:

Labeling system: We first establish a common and unified labeling system. We combine classes from popular academic datasets (classification, detection, and segmentation) as well as commercial labeling products (Google, Microsoft, Apple). Our labeling system is obtained by merging all public labels with those in the text, thus covering most of the public labels with a moderate number of 6,449. The remaining open vocabulary labels can be identified by open set identification.

Datasets: How to automatically label large-scale images with labeling systems is another challenge [30]. Drawing inspiration from CLIP [22] and ALIGN [11], which exploit publicly available image-text pairs at scale to train powerful vision models, we employ similar datasets for image labeling. To leverage these large-scale image-text data for labeling, following [9, 10], we parse the text and obtain image labels through automatic text semantic parsing. This process enables us to obtain a wide variety of unannotated image labels based on image-text pairs.

Data Engine: However, image-text pairs from the web are inherently noisy and often contain missing or incorrect labels. To improve the quality of annotations, we design a labeled data engine. When addressing missing labels, we leverage existing models to generate additional labels. For incorrect labels, we first locate specific regions corresponding to different labels in the image. Subsequently, we employ region clustering techniques to identify and eliminate outliers within the same class. Furthermore, we filter out labels that exhibit opposite predictions between the entire image and its corresponding regions, ensuring cleaner and more accurate annotations.

Because RAM covers data sets such as classification, detection, and segmentation, and also has the recognition ability of seen data (some in training data) and unseen data (not in training), the author made a picture to reflect different recognition methods. Recognition Scope, RAM-unseen is red, because RAM has the ability to recognize open sets, so it is the largest.

  • PS: This is not a hexagonal warrior anymore, a circle is drawn directly, this picture is too bluffing
  • PS+: I thought it was a performance comparison at first glance, but only after reading the paper did I find out that it was a scope comparison;
img
img

In addition to data and models, the author also has some optimizations in model efficiency.

Experimental results

Finally, look at the experimental part, green is supervised training, blue is Zero-shot, and yellow is unsupervised.

img
img
img
img

Guess you like

Origin blog.csdn.net/qq_40491305/article/details/131211442