RAM (recognize anything) - detailed explanation of the paper

I. Overview

1. What is

    The full name of the RAM paper is Recognize Anything: A Strong Image Tagging Model. It is different from the common classification, detection, and segmentation in the image field. It is a labeling task—that is, a multi-label classification task (one picture hits one category), which is different from classification (one picture hits one category). And for everything he mentioned here, it should be noted that the model itself originally supports 6449 tags (4585 tags after removing synonyms), but unknown tags (other than 6449) can be identified through some of the methods mentioned later.

    The following is the official address of the 6449 natively supported tags (4585 tags after removing synonyms). It should be noted that English and Chinese are in one-to-one correspondence, and both are in groups of 4585.

    Natively supported Chinese tags:https://github.com/xinyu1205/recognize-anything/blob/main/ram/data/ram_tag_list_chinese.txt

    Natively supported English tags:https://github.com/xinyu1205/recognize-anything/blob/main/ram/data/ram_tag_list.txt

2. Highlights

1) Powerful image labeling capabilities and zero-shot generalization recognition capabilities;

2) It can be reproduced at a lower cost, using open source and manual-labeled data sets. The most powerful version of RAM only requires 8 cards of A100 for 3 days of training;

3) Flexible and can meet various application scenarios: it can be used alone as a labeling system; it can also be combined with Grounding DINO and SAM for multi-label segmentation.

3. Compare the improvement of Tag2Text

Greater accuracy. RAM utilizes a data engine to generate additional annotations and clean up incorrect tags, with higher accuracy compared to Tag2Text. (See the data processing section later for details.)

There are more tag categories. RAM upgrades the number of fixed tags from 3,400 to 6,400 (synonyms are reduced to 4,500 different semantic tags), covering more valuable categories. Additionally, RAM has open-set capabilities and can identify labels not seen during training.

PS

This paper from OPPO is relatively detailed. If you have any questions about the details, you can refer to the paper. Details not included in the paper are also written at the end of this blog as follow-up items to be solved.

2. Model

    1. Model structure

    PS: Regarding the model, since the official training code has not been released, inconsistencies were also found by comparing the inference code and the paper (described later), so here is only a description of what can be seen and speculated so far, which may not be accurate.

    The paper mentioned that SAM only retains the Tagging and Generation of (Tag2Text):

    *The main image Encoder uses swin-transformer, which has two versions, swin-b and swin-l;

    *Tagging branch is used for multi-tags reasoning to complete recognition tasks; it uses BIRT (it is BIRT in the code, and it is said to be a 2-layer transformer in the paper);

    *The Generation branch is used to do the image caption task encoder-decoder, which uses a 12-layer transformer.

    *Alignment branch is for learning Visual-Language Features and has been removed here.

    *There are also two offline models involved here: one is CLIP (which also involves image encoder and Text encoder, introduced later), and the other is SceneGraphParser (officially modified by OPPO).

    2. Training process

    PS: We will leave aside the training details of the data processing process here and focus on the process; more details here are currently not officially open source, so they can only be approximated.

    Notice:

    1) The training process does not include the CLIP Text Encoder on the right side of the picture above. N categories correspond to N textual label queries - that is, learnable parameters. Assuming that the paper has 4585 categories and each category is represented by 768 dimensions, then it is 4585*768.

    2) The training input is composed of three elements: picture-Tag-text, corresponding to one input of the network (images and text inputs are not counted, and are the network’s own learnable parameters) + 2 outputs (text description and multi-label classification). The loss is the common text generation loss + multi-label ASL loss.

    3) The text input of the image-Tag-interaction encoder is the Tag parsed by the label, not the output of the model (it is the output of the model during inference)

    4) A certain node in the training process (the paper does not go into details) uses the output of the CLIP image encoder to distill RAM's own image encoder. (My understanding of this is that the CLIP Text encoder is potentially aligned to better realize the recognition of the open set in the subsequent reasoning stage.)

    3. Reasoning process

    It is divided into two types, the first is the inference of categories supported by the model itself; the second is the inference of open set that is not supported by the model (of course supported categories can also be used in this way). The reasoning process is open source.

    The first type is the categories supported by the model.

    * No text input is required here, just pictures.

    * The corresponding code is:https://github.com/xinyu1205/recognize-anything/blob/main/inference_ram.py

    * You need to check if you have your own category first.

        中文:https://github.com/xinyu1205/recognize-anything/blob/main/ram/data/ram_tag_list_chinese.txt 

        英文:https://github.com/xinyu1205/recognize-anything/blob/main/ram/data/ram_tag_list.txt

        Corresponding threshold:https://github.com/xinyu1205/recognize-anything/blob/main/ram/data/ram_tag_list_threshold.txt

    * Current version 231020, if a large number of calls are made, it is recommended to modify the source code, because the model weights will be read repeatedly:https://github.com/xinyu1205/recognize-anything/ blob/main/ram/models/ram.py#L170

    

    The second type is categories not supported by the model (open set).

    * Here you need to enter the category + picture you want in advance. Please refer to this to fill in the categories you want:https://github.com/xinyu1205/recognize-anything/blob/main/ram/utils/openset_utils.py#L91 

    * Principle: Here is actually the query learnable input in the model being replaced by CLIP encoding. CLIP is encoded using a set of templates:https://github.com/xinyu1205/recognize-anything/blob/main/ram/utils/openset_utils.py#L24 Encode the words you want into sentences, and then calculate the output feature vector of the CLIP Text encoder of each template offline, and then average it as the feature representation of the word, and then keep other places unchanged Get a score for this category. Another point to mention here is that during the training process, the author also used the CLIP image encoder to distill the RAM image encoder. This is actually equivalent to using the CLIP Text encoder to align text and image features for the open set here. The author's Experiments also show improved model performance.

    

    4. Ablation experience

    1) Two-branch training improves the model’s ability to classify tags.

    2) The development of set recognition mainly relies on CLIP and does not improve the ability of closed sets (??? It has nothing to do with training in the first place)

    3) Improving the label category has little impact on existing categories (it has an impact because the training difficulty is increased), but it can improve the ability of open recognition and enhance the coverage of the model.

3. Data

    1. Data label

    Reference source:

    1) Open source academic data sets (classification, detection, segmentation).

    2) Commercial existing APIs (Google, Microsoft, Apple)

    Guiding Principles:

    1) Appearing frequently means more value.

    2) Tags include: objects, scenes, attributes, actions (behaviors), which improves the generalization ability of the model (complex, unknown scenes).

    3) The number of labels needs to be moderate. Too many labels will lead to serious labeling costs.

    quantity:

    1) Use the modified SceneGraph-Parser to parse 14 million pre-trained sentences.

    2) Manually select 6449 tags from top-1W high-frequency tags.

    3) Merge synonyms into the same ID through various means (manual inspection, referring to WordNet, translation, etc.), and finally become 4585 Tags.

    PS: RAM covers less OpenImages and ImageNet because many of the tags in them are relatively uncommon, such as many detailed bird categories in ImageNet.

    

    2. Data composition

    There are two versions of data, 4 Millon and 14 Millon, respectively, corresponding to the trained models swin-b and swin-l with two parameter amounts.

    1) 4M: 2 manually annotated data sets, COCO (113K images, 557K descriptions), Visual Genome (101K images, 822K descriptions); 2 large-scale Internet data sets Conceptual Captions (3M images, 3M descriptions) and SBU

Captions (849K images, 849K descriptions).

    2) 14M: Based on 4M, Conceptual 12M (10M images, 10M descriptions) is added

    3. Data cleaning

    Reason: Image-text pairs from the web are inherently noisy and often contain missing or incorrect labels. To improve the quality of annotations, we designed a tagged data engine.

    Resolve missing tags. Use a part of the data to train a base model, then use this model to label the remaining data, and then mix the original annotation and the generated annotation for expansion. The tag of the 4M image in this article is 12M -> 39.8M.

    Solve redundant tags. We first use Grounding-Dino to locate specific regions corresponding to different labels in the image, and then:

    1) We use regional clustering technology (K-Means++) to identify and eliminate outliers (outermost 10%) in the same class (the source of features and the number of clusters used are not specified);

    2) We filter out tags that show opposite predictions between the entire image and its corresponding area (use the base model to reason about the cropped area, and delete it if no corresponding tag is predicted) (tags for the entire image, cropped areas) areas should be identified), ensuring clearer and more accurate annotation.

    It is estimated that an average tag has 10,000 pictures.

    4. Ablation results

    1) Adding more labels in the range of 12.0M to 41.7M can significantly improve the model performance of all test sets, indicating that the original data set has a serious problem of missing labels.

    2) Further cleaning up the labels of certain categories will result in a slight improvement in performance on the OPPO-common and OpenImages-common test sets. Limited by the inference speed of GroundingDino, we only clean 534 categories.

    3) Expanding the training images from 4M to 14M, there is a significant improvement on all test sets.

    4) Using a larger backbone network results in a slight improvement in openimages performance - little to slightly worse performance on common categories. We attribute this phenomenon to the insufficient resources available to us for hyperparameter searches.

    5) Fine-tuning the tags parsed from the COCO Caption dataset, showing significant performance improvements on the OPPO-common and 7OpenImages-common test sets. (The COCOCaption dataset provides five descriptive sentences for each image, providing a comprehensive description that approximates a complete set of label labels.)

4. Strategy

1. Training process

Referring to the data cleaning process, the entire training process is as follows

1) Obtain unannotated image labels on large-scale data through automatic text semantic parsing.

2) Train the first version of the model using the original text and parsed tokens.

3) A data engine is used to generate additional annotations and clean up incorrect ones (refer to Data Cleaning Summary).

4) Use smaller but higher quality data sets to process the data and fine-tune the model.

5. Results

1. Multi-dimensional comparison.

Comparing the segmentation model SAM, tagging model Tag2Text, etc., multi-modal models CLIP, BLIP, etc., mainly in terms of positioning capabilities, recognition accuracy and number of categories are as follows

2. Comparison of marking capabilities.

RAM provides more precise (precision) and more (recall & coverage) results.

    *RAM demonstrates impressive zero-shot performance, significantly better than CLIP and BLIP.

    *RAM even surpasses the fully supervised approach (ML-Decoder).

    *RAM exhibits comparable performance to the Google Tag API.

3. Test set comparison

6. How to use

TRANSFORMERS_OFFLINE=1 python inference_ram.py --image images/1641173_2291260800.jpg --pretrained pretrained/ram_swin_large_14m.pth

7. To be resolved

1. What is the content of clustering? Image features?

2. Training code, describing the results of the branch network.

8. Reference links

GitHub - xinyu1205/recognize-anything: Code for the Recognize Anything Model (RAM) and Tag2Text Model

Recognize Anything: A Strong Image Tagging Model

Recognize Anything Model RAM (Recognize Anything Model) and its predecessor Tag2Text paper interpretation - Zhihu

https://arxiv.org/pdf/2306.03514.pdf

https://github.com/xinyu1205/recognize-anything/blob/main/ram/utils/openset_utils.py#L293

ASL: Asymmetric loss for multi-label classification-Asymmetric Loss_asl loss-CSDN Blog

Guess you like

Origin blog.csdn.net/u012863603/article/details/133986375