Paper: http://arxiv.org/abs/2306.15195
Code: https://github.com/shikras/shikra
background
In human's daily communication, people often focus on different areas or objects in the scene, and people can exchange information efficiently by speaking and pointing to these areas . We call this interaction mode a Referential Dialogue .
If MLLM is good at this skill, it will lead to many exciting applications. For example, applying it to mixed reality (XR) glasses such as Apple Vision Pro, users can use gaze gaze to direct anything to talk to AI. At the same time, AI can also point to certain areas through highlighting and other forms to achieve efficient communication with users.
This work proposes the Shikra model, endowing MLLM with the ability to refer to dialogues, both understanding positional inputs and generating positional outputs.
Core Highlights
1. Shikra can understand the point/bounding box input by the user, and supports the output of the point/bounding box, and can seamlessly conduct a reference dialogue with humans.
2. Shikra is simple and straightforward in design, with a non-stitching design that does not require additional position encoders, front/rear object detectors, or external plug-in modules, or even additional vocabularies.
As shown in the figure above, Shikra can accurately understand the localization area of the user's input, and can communicate in the output by referring to a different area than the input . Communicate efficiently with conversation and orientation like humans.
As shown in the figure above, Shikra not only has all the basic common sense of LLM, but also can make reasoning based on location information .
As shown above, Shikra can generate detailed descriptions, explain what is happening in the picture, and generate accurate positioning for referenced objects.
Although not specifically trained on OCR datasets, Shikra also has basic OCR capabilities.
more examples
other traditional tasks
method
The model architecture uses CLIP ViT-L/14 as the visual backbone, Vicuna-7/13B as the base language model, and uses a layer of linear mapping to connect the feature spaces of CLIP and Vicuna.
Shikra directly uses numbers in natural language to represent the object position , uses [xmin, ymin, xmax, ymax] to represent the bounding box, uses [xcenter, ycenter] to represent the center point of the region, and the xy coordinates of the region are normalized according to the image size . Each number defaults to 3 decimal places. These coordinates can appear anywhere in the model's input and output sequences. Square brackets for recording coordinates also appear naturally in sentences.
Experimental results
Shikra can achieve excellent performance on traditional REC, VQA, and Caption tasks. At the same time, SOTA results have been achieved on VQA tasks such as PointQA-Twice and Point-V7W that need to understand position input.
We used the POPE benchmark to evaluate the extent of Shikra's hallucinations. Shikra obtained comparable results to InstrcutBLIP and far exceeded other recent MLLMs.
Chain of Thought (CoT), aims to help LLM answer complex QA questions by adding reasoning process before the final answer. This technique has been widely used in various tasks of natural language processing. However, how to apply CoT in multimodal scenarios remains to be studied. Especially because the current MLLM still has serious hallucinations, CoT often produces hallucinations, which affects the correctness of the final answer. Through experiments on the synthetic dataset CLEVR, we found that using CoT with location information can effectively reduce model hallucinations and improve model performance .
in conclusion
This work introduces a simple and unified model named Shikra, which understands and outputs spatial coordinates in a natural language manner, adding human-like reference dialogue capabilities to MLLM without introducing additional vocabularies, position encoders or External plugins.