Shikra: understanding pointing, speaking coordinates, multimodal language model hyperevolution

b1d4df1ff710ea2bb46cc1fbd64660a9.gif

2ed3a464f917c45f7e22e0c616f8f44c.png

Paper: http://arxiv.org/abs/2306.15195

Code: https://github.com/shikras/shikra

34af7cae4cea362ef0a2a5777179849f.png

background

In human's daily communication, people often focus on different areas or objects in the scene, and people can exchange information efficiently by speaking and pointing to these areas . We call this interaction mode a Referential Dialogue .

If MLLM is good at this skill, it will lead to many exciting applications. For example, applying it to mixed reality (XR) glasses such as Apple Vision Pro, users can use gaze gaze to direct anything to talk to AI. At the same time, AI can also point to certain areas through highlighting and other forms to achieve efficient communication with users.

This work proposes the Shikra model, endowing MLLM with the ability to refer to dialogues, both understanding positional inputs and generating positional outputs.

2639981186f9c6f59e1ef61d2c578f78.png

Core Highlights

1. Shikra can understand the point/bounding box input by the user, and supports the output of the point/bounding box, and can seamlessly conduct a reference dialogue with humans.

2. Shikra is simple and straightforward in design, with a non-stitching design that does not require additional position encoders, front/rear object detectors, or external plug-in modules, or even additional vocabularies.

fddaefc558cdc28fd51f964a988f8c37.png

As shown in the figure above, Shikra can accurately understand the localization area of ​​​​the user's input, and can communicate in the output by referring to a different area than the input . Communicate efficiently with conversation and orientation like humans.

abcd857e9f2b6c18a1aa02bc207f7b36.png

8982f1e0dc31ae6b6954cd308c45514d.png

9e3815cc598c8f2f8a88c49c7b766798.png

As shown in the figure above, Shikra not only has all the basic common sense of LLM, but also can make reasoning based on location information .

6360d686dfd1b86bf9dcabd5c05b0c94.png

5ef69f84572bd3866d16558346b36600.png

As shown above, Shikra can generate detailed descriptions, explain what is happening in the picture, and generate accurate positioning for referenced objects.

7df41b38c3700680347a031a4cddbbce.jpeg

Although not specifically trained on OCR datasets, Shikra also has basic OCR capabilities.

more examples

bcf127843af98528d5dfcc6b2641c43a.png

77f78dfa45294b872f74088f50147303.png

other traditional tasks

82b65e428b0b037ed4e22085c72e4b4b.jpeg

e03d7c09541562d895d59d8f741e4c80.jpeg

0987244f1d18a9ec702dbe2fcfb01931.png

af62f4dd692e2bd0b3ec98478291dbe4.png

a0d5a3f1ee63025a936c7a6c5aa3bf06.png

method

The model architecture uses CLIP ViT-L/14 as the visual backbone, Vicuna-7/13B as the base language model, and uses a layer of linear mapping to connect the feature spaces of CLIP and Vicuna.

Shikra directly uses numbers in natural language to represent the object position , uses [xmin, ymin, xmax, ymax] to represent the bounding box, uses [xcenter, ycenter] to represent the center point of the region, and the xy coordinates of the region are normalized according to the image size . Each number defaults to 3 decimal places. These coordinates can appear anywhere in the model's input and output sequences. Square brackets for recording coordinates also appear naturally in sentences.

5db506bb378fddc02f4f6d1449e5735f.png

Experimental results

Shikra can achieve excellent performance on traditional REC, VQA, and Caption tasks. At the same time, SOTA results have been achieved on VQA tasks such as PointQA-Twice and Point-V7W that need to understand position input.

818d418ee24f3883d0dceef9284e79a3.png

48e7b249068911f13b969e4d9196a096.png

a61bcef56e3d7451f3c5c228ebbb3aed.png

da4993ec2ae6e1428617f1f230bd6d9f.png

We used the POPE benchmark to evaluate the extent of Shikra's hallucinations. Shikra obtained comparable results to InstrcutBLIP and far exceeded other recent MLLMs.

31aa256f3439912b910977e8bb0da31f.png

Chain of Thought (CoT), aims to help LLM answer complex QA questions by adding reasoning process before the final answer. This technique has been widely used in various tasks of natural language processing. However, how to apply CoT in multimodal scenarios remains to be studied. Especially because the current MLLM still has serious hallucinations, CoT often produces hallucinations, which affects the correctness of the final answer. Through experiments on the synthetic dataset CLEVR, we found that using CoT with location information can effectively reduce model hallucinations and improve model performance .

45257faaec703836fc9da3f4771fa6b9.png

ed0e22a2eb5083ed9dc6c2c5d4df96ac.png

in conclusion

This work introduces a simple and unified model named Shikra, which understands and outputs spatial coordinates in a natural language manner, adding human-like reference dialogue capabilities to MLLM without introducing additional vocabularies, position encoders or External plugins.

91fbaadff55d9db41541ad262f4b9f65.gif

Guess you like

Origin blog.csdn.net/dQCFKyQDXYm3F8rB0/article/details/131466918