VLM Series - Object Recognition as Next Token Prediction - Paper Interpretation

I. Overview

1. What is

    Combining some parameters of CLIP's visual encoder + language model Llama, the common image description task is transformed into only outputting attributes. In other words, image classification is transformed into predicting the next text output token. In this way, the top K attributes (English) of the image can be generated, which can be used in open domain image tag scenarios.

2. Highlights

    *Train on image-caption (nouns extracted from original titles as reference labels) pairs, which are easier to collect and annotate than image-question-answer triples. For inference, text fragments are generated as labels instead of sentences.

    *The decoder has different token modeling mechanisms. Tokens with different labels are independent. Tokens with the same label are still causal (the latter depends on the previous). The label markings are all conditioned on image embedding. The implementation is a non-causal attention mask.

    *The non-causal masking mechanism inspires a new sampling method, called one-shot sampling, for generating text tokens of labels. Tags for multiple labels are sampled in parallel and ordered according to their probabilities. This takes advantage of the transformer's powerful parallelization capabilities.

    *Simple strategies to improve model efficiency. Starting from a pre-trained LLM, such as LLaMA, the first six transformer blocks and the final output layer are retained, and the middle blocks are deleted. Match full model performance with 4.5x faster inference.

PS

    *Here the author did not compare models such as RAM. The author may think that he is targeting open domains. But if your application scenario is that you can know the desired category tag in advance, then you can completely compare RAM++, even if RAM+

Guess you like

Origin blog.csdn.net/u012863603/article/details/135465039