Byte | Large model BuboGPT: introduce visual positioning, realize fine-grained multi-modality, open source

From: Qubit

Enter the NLP group —> join the NLP exchange group

Big byte model, BuboGPT is here.

It supports three modes of text, image, and audio to achieve fine-grained multi-modal joint understanding.

Where to answer and what to say, what is said and what is not said, it is clear at a glance:

f0eb59242ea31dfb394be28fb0b76d0e.gif

In addition to having "intelligent eyes", there are also "smart ears". BuboGPT can hear details that humans can't notice:

031c0a9ed8c9c7a2eedd85661e55d2a3.png

High energy ahead!

Three-modal joint understanding, text description + image positioning + sound positioning, one-click to get it, and accurately determine the source of the sound:

355e7340ba37e2f9c63e9769c4225665.gif

Don't worry, it's not over yet!

Even if there is no direct relationship between audio and images, the possible relationship between the two can be reasonably described. It is also possible to tell stories by looking at pictures and identifying sounds:

d40ff38bc0ce8a7c55a61a9fda976169.gif

Looking at it this way, BuboGPT does some work, which is "fine" enough.

According to the researchers:

The recently exploded multi-modal large models such as MiniGPT-4, LLaVA, and X-LLM do not perform basic connections on specific parts of the input, but only construct coarse-grained maps.

BuboGPT, on the other hand, leverages the rich information and clear correspondence between text and other modalities to provide a fine-grained understanding of visual objects and a given modality.

Therefore, when BuboGPT describes the image, it can point out the specific location of the object in the picture.

89ddae96dff476349f27f2c4e2a98b5a.png

BuboGPT: Bringing visual connectivity to LLM for the first time

In addition to the above examples shared by the author on YouTube, the research team also demonstrated various tricks played by BuboGPT in the paper.

Long time to see a frog playing the piano! Can such a graph BuboGPT also describe accurately?

ba4f4e37d1f697f7cee60cbd896d6103.png

How did Kangkang answer together:

8ed415b8e8785fc2a2a7d70b8e3559aa.png

Not only can you accurately describe the pose of the frog, but also know that the hand is touching the banjo ?

Ask it what are the interesting places in the picture, and it can also summarize everything in the background of the picture.

BuboGPT "eyesight + hearing + expressiveness test", the researchers play it like this, let's listen to this audio first.

Let’s take a look at the description of BuboGPT:

e3da7a8a74e10a0d717ad856e2d943de.png

BuboGPT can accurately understand the gender of the person in the picture, the source of the sound, and what happened in the picture.

The effect is so good because Byte used the method of introducing visual positioning into LLM this time.

The specific method we then look down.

The architecture of BuboGPT is to achieve multimodal understanding by learning a shared semantic space and further exploring the fine-grained relationships between different visual objects and different modalities.

In order to explore the fine-grained relationship between different visual objects and various modalities, the researchers first built a ready-made visual localization pipeline based on SAM.

This pipeline consists of three modules: Tagging Module , Grounding Module and Entity-matching Module .

01f8dc7f4f5a73da95eae9419ab540ee.png

The process is roughly like this:

First, the labeling module is a pre-trained model that can generate multiple text labels associated with an input image.

The SAM-based localization module further localizes the semantic mask or bounding box associated with each text label on the image.

Then, the entity matching module utilizes the reasoning ability of LLM to retrieve matched entities from labels and image descriptions.

This is how researchers use language as a bridge to connect visual objects to other modalities.

In order to allow any combination of the three modal inputs to have good results, the researchers adopted a two-stage training program similar to Mini-GTP4:

Single-modal pre-training and multi-modal instruction tuning .

d0e4b5090d0442da2477f1f4351c3487.png

Specifically, BuboGPT uses ImageBind as the audio encoder, BLIP-2 as the visual encoder, and Vicuna as the pre-trained LLM.

In the unimodal pre-training stage, the corresponding modality Q-Former and linear projection layers are trained on a large amount of modality-text paired data.

For visual perception, we only train the projection layer for the image caption generation part and keep the Q-Former from BLIP2 fixed.

For audio understanding, they trained both Q-Former and audio caption generation parts.

In both settings without using any prompts, the model only receives the corresponding image or audio as input and predicts the corresponding caption.

2d8e0db4d0f1dc65d3bcc30377d42f10.png
Instruction follow examples for different inputs

In the multimodal instruction adjustment stage, a high-quality multimodal instruction dataset is constructed to fine-tune the linear projection layer, including:

  • Image-Text: Visual instruction tuning using two datasets from MiniGPT-4 and LLaVa. 

  • Audio-Text: A series of expressive and descriptive data is constructed based on the Clotho dataset. 

  • Audio-Image-Text: Based on the VGGSS dataset, a <audio, image, text> trimodal guidance tuning data pair is constructed, and negative samples are further introduced to enhance the model.

It is worth noting that by introducing negative samples "image-audio pairs" for semantic matching, BuboGPT can be better aligned, and the multimodal joint understanding ability is stronger.

At present, the BuboGPT code and data set have been open sourced, and the demo has also been released. Let's try it out quickly.

Demo light play experience

The function area of ​​the BuboGPT demo page is clear at a glance, and the operation is also very simple. You can upload pictures or audio on the right side, and the BuboGPT answer window and user question window on the left side:

5012d83ab295515d5f91432f4039622f.png

After uploading the photo, click the first button below to upload the split image:

c5ea2921e01b20e5609c1a13a650319a.png

Take a photo of the Great Wall as an example, BuboGPT disassembled it like this, and identified mountains, tourist attractions and city walls:

4a54100c7fd4bafbc1db9e52aec80c9e.png

When we asked it to describe this picture, its answer was more specific and basically accurate:

a704f66fa4588f04670a929d88159cd4.jpeg

You can see that the content on the split box has also changed, corresponding to the text content of the answer.

Here is another picture, with a piece of audio, and BuboGPT also correctly matches the sound source:

da3942e0739fc164fa4dd6331951a0c7.png

Of course, it will also fail to recognize and express incorrectly. For example, there is no one in the picture below, and the audio is just a bell, but its description does not seem to match the picture.

b4b9362f1274a96a955a4690ea9c04ed.png

Interested family members hurry up and try it out for themselves~~

Portal:
[1] https://bubo-gpt.github.io/
[2] https://huggingface.co/spaces/magicr/BuboGPT (demo)


Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/132310020