Tsinghua glm team's new work: Multimodal VisualGLM-6b

Tsinghua released a new multi-modal program VisualGLM-6b. According to its official website, this program is a visual model built by combining the language model chatglm-6b and BLIP2-Qformer .

Open source project address: https://github.com/THUDM/VisualGLM-6B

VisualGLM experience demo address:  https://huggingface.co/spaces/THUDM/visualglm-6b

Introduction of VisualGLM-6B:

Model structure and design ideas (from Dr. Ding Ming's shared PPT at the end of the article)

I tested a picture casually, and the effect is still very good.

Just tried it out, it feels good


The current open source solution has some limitations due to limitations such as the amount of data, the amount of model parameters, and whether the user's intention is aligned.

  • Image depicting factuality/model illusion issues. When generating a long image description, when the image is far away, the language model will dominate, and it is possible to generate content that does not exist in the image based on the context.
  • Attribute mismatch problem. In multi-object scenes, certain attributes of some objects are often wrongly placed on other objects.
  • Resolution problem. This project uses a resolution of 224*224, which is also the most commonly used size in visual models; however, for finer-grained understanding, larger resolutions and calculations are necessary.

At present, VisulaGLM does not open corresponding technical papers, but you can refer to Microsoft's multimodal technical solution [2] , which also supports the input of two modal data, text and image, and outputs the answer text content.

Multimodal approach to Microsoft Research

renew:

On 5.30, Dr. Ding Ming, the developer of VisualGLM, shared the design ideas and training methods of VisualGLM live, and took the time to watch the replay, which is full of details. There are videos and PPTs, you can watch them by yourself.

VisualGLM technical explanation: https://www.bilibili.com/video/BV14L411q7fk

Report data download: https://pan.baidu.com/s/1gfdpyfT6EVnygMPDO_iwvQ?pwd=8wpc

reference

  1. ^ GitHub - THUDM/ChatGLM-6B: ChatGLM-6B: An Open Bilingual Dialogue Language Model | Open Source Bilingual Dialogue Language Model
  2. ^https://arxiv.org/abs/2302.14045

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/131735754