[Original] Using VisualGLM for AIGC multi-mode recognition and content generation

In recent months, the LLM (Large Language Model) of the entire AI industry has flourished. In addition to the traditional text-only visual language models with multi-modal capabilities, such as GPT-4 and ImageBind, the performance has been impressive.

ChatGLM-6B is an open source Chinese LLM that is very comfortable for Chinese users. On May 17, 2023, Zhipu AI and KEG Laboratory of Tsinghua University open-sourced a multi-modal dialogue model VisualGLM-6B based on ChatGLM-6B, which can not only describe images and ask questions about related knowledge, but also combine common sense or Brings up interesting points. On the basis of ChatGLM-6b, Zhipu has open sourced a large model of multi-mode recognition, VisualGLM-6b. VisualGLM-6B is an open source multimodal dialogue language model that supports images, Chinese and English. The language model is based on ChatGLM-6B and has 6.2 billion parameters; the image part builds a bridge between the visual model and the language model by training BLIP2-Qformer , the overall model has a total of 7.8 billion parameters.

VisualGLM-6B relies on 30M high-quality Chinese image-text pairs from the CogView dataset, and pre-trains with 300M screened English image-text pairs, with the same weight in Chinese and English. This training method better aligns visual information to the semantic space of ChatGLM; in the subsequent fine-tuning stage, the model is trained on long visual question and answer data to generate answers that meet human preferences.

Today we will simply install and use VisualGLM-6b, and then understand the core working principle behind it.

VisualGLM-6b use installation

System environment (my environment)

GPU:NVIDIA A30 24G

OS: Windows 11

Python: 3.8.13

PyTorch: 1.12.1+cu113

Transformers: 4.29.1

Attachment: read environment information code:

import sys
import torch   # pip install torch
import pynvml  # pip install pynvml


# 获取GPU信息
def get_gpu_info(gpu_id=0):
    pynvml.nvmlInit()
    handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_id)
    gpu_name = pynvml.nvmlDeviceGetName(handle)
    handler = pynvml.nvmlDeviceGetHandleByIndex(gpu_id)
    meminfo = pynvml.nvmlDeviceGetMemoryInfo(handler)
    gpu_mem_total = round(meminfo.total / 1024 / 1024, 2)
    gpu_mem_used = round(meminfo.used / 1024 / 1024, 2)
    gpu_mem_free = round(meminfo.free / 1024 / 1024, 2)
    print("GPU型号:\t",  gpu_name)
    print("显存总量:\t", gpu_mem_total, "MB")
    print("已用显存:\t", gpu_mem_used,  "MB")
    print("剩余显存:\t", gpu_mem_free,  "MB")


# 输出env信息
print("Python版本: \t", sys.version)
print("Pytorch版本: \t", torch.__version__)
print("Cuda版本: \t",    torch.version.cuda)
print("Cudnn版本: \t",   torch.backends.cudnn.version())
print("Cuda是否可用:\t", torch.cuda.is_available())
print("GPU数量: \t",     torch.cuda.device_count())
get_gpu_info()

If you want to use VisualGLM smoothly, it is recommended that Python version 3.6+, personally recommend 3.8.x or 3.10.x is more secure, and the dependent cuda version is preferably 11.3 or above; the GPU memory should not be lower than 16G, otherwise it will not work properly.

It is safer to recommend Windows or Ubuntu as the operating system.

If you want to monitor the memory status of the NVIDIA graphics card, use the nvidia-smi command: (display the remaining video memory every 10 seconds)

nvidia-smi -l 10

Install using VisualGLM-6b

If you simply test VisualGLM-6b directly on the command line, you can directly download and call the source code and basic model. If you want to run the web interface, you will also rely on a SAT model, which will be downloaded and installed automatically.

Call source code: GitHub - THUDM/VisualGLM-6B: Chinese and English multimodal conversational language model | Multimodal Chinese-English bilingual conversational language model

Base model: THUDM/visualglm-6b Hugging Face

SAT model: https://cloud.tsinghua.edu.cn/f/348b98dffcc940b6a09d

Quick installation steps: (take Windows environment as an example)

git clone https://github.com/THUDM/VisualGLM-6B
cd VisualGLM-6B
pip install -i https://mirrors.aliyun.com/pypi/simple/ -r requirements_wo_ds.txt
pip install -i https://mirrors.aliyun.com/pypi/simple/ --no-deps "SwissArmyTransformer>=0.3.6"

VisualGLM-6b effect test

Manual model testing code:

from transformers import AutoModel, AutoTokenizer
import torch


# 模型文件和图片路径
# model_name = "THUDM/visualglm-6b"
model_path = "C:\\Data\\VisualGLM-6B\\visualglm-6b"
# pic_path = "C:\\Users\\mat\\Pictures\\test\\kld00.jpg"
# pic_path = "C:\\Users\\mat\\Pictures\\test\\cat00.jpg"
# pic_path = "C:\\Users\\mat\\Pictures\\test\\sl00.jpg"
# pic_path = "C:\\Users\\mat\\Pictures\\test\\kh00.jpg"
pic_path = "C:\\Users\\mat\\Pictures\\test\\code00.jpg"


# 加载模型
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda()
image_path = pic_path


# 进行提问
response, history = model.chat(tokenizer, image_path, "描述这张图片。", history=[])
print(response)
response, history = model.chat(tokenizer, image_path, "这张图片可能是在什么场所拍摄的?", history=history)
print(response)

The pic_path in the code is the path of the picture file to be tested. It can support JPG/PNG/WEBP and other formats. Download the file in advance and specify the directory.

If the base model is downloaded locally, remember that model_path must be followed by \\ in the path on the Windows system. If it is /, the model file cannot be found.

Model code call test

Person identification:

Animal Environment Recognition:

Cat identification:

Indoor Scene Recognition:

Code identification:

Web interface interaction test

Execute step code:

git clone https://github.com/THUDM/VisualGLM-6B
cd VisualGLM-6B
pip install -i https://mirrors.aliyun.com/pypi/simple/ -r requirements_wo_ds.txt
pip install -i https://mirrors.aliyun.com/pypi/simple/ --no-deps "SwissArmyTransformer>=0.3.6"
python web_demo.py

Then wait for the automatic download and installation of the SAT model, it will be automatically loaded after completion, and then visit the local:

http://127.0.0.1:7860

Then you can access the local VisualGLM through the web interface. If you interact many times, the GPU memory may take up a lot. Remember to click "Clear" to clean it up, which can reduce the memory usage.

Animal Identification:

Celebrity Recognition:

Code identification:

Image reasoning:

Character dress identification:

The technical principle behind VisualGLM

VisualGLM can perform image recognition and interact with images. It is essentially an LLM (Large Language Model), but the entire image-based large language model is a bit different from the traditional ChatGPT. To understand the magic, just Need to pay attention to the core technology behind it: BLIP-2.

For the multi-mode large language model, in addition to the conventional Transformer-based LLM learning, the more important thing in VisualGLM is the image-text correspondence processing, which is somewhat similar to the relationship of CLIP in Stable Diffusion, while VisualGLM-6b is mainly used The technique is the training method of BLIP-2 for image-text processing.

BLIP (Bootstrapping Language-Image Pre-training), a guided language-image pre-training method, mainly belongs to the Vision-language pre-training method.

Traditional vision models and methods also suffer from two major drawbacks:

1. From a model perspective, most methods either employ encoder-based models or encoder-decoder models. However, encoder-based models are not easy to directly transfer to text generation tasks, such as image captioning, etc.; and encoder-decoder models have not been successfully used in image-text retrieval tasks.

2. From a data point of view, most sota methods, such as CLIP, ALBEF, and SimVLM, are pre-trained on image-text pairs collected from the Internet. Although performance gains can be obtained by increasing the size of the dataset, the results show that noisy web text yields suboptimal results for visual language learning.

To this end, Junnan Li et al. proposed a new model BLIP (Bootstrapping Language-Image Pre-training).

BLIP-2 A general-purpose, computationally efficient vision-language pre-training method that utilizes frozen pre-trained image encoders and LLMs to outperform Flamingo, BEIT-3, and other networks.

BLIP2 is roughly composed of several parts. The image (Image) is input into the image encoder (Image Encoder), and the result obtained is fused with the text (Text) in Q-Former (BERT initialization), and finally sent to the LLM model.

BLIP-2 bridges the modal gap between vision and language models by adding a lightweight Query Transformer (Query Transformer, Q-Former) between the frozen pre-trained image encoder and the frozen pre-trained large language model ( modality gap). In the whole model, Q-Former is the only trainable module, while the image encoder and language model are always kept frozen.

LLM (such as GLM and GPT) is essentially a language model, and naturally cannot directly accept information from other modalities. Therefore, how to unify the information of each mode into the feature space that LLM can understand is the first step to solve the problem. For this reason, Q-Former is proposed in BLIP.

Among them, Q-Former is a transformer model. In order to fuse features, the Transformer architecture is the most suitable. Q-Former consists of two sub-modules, which share the same self-attention layer:

An image transformer interacting with a frozen image encoder for visual feature extraction and a text transformer used as a text encoder and decoder.

The training of the Q-Former model is composed of the above three tasks. Through these tasks, the extraction and fusion of features are realized, but the model has not seen LLM yet.

The three training tasks of Q-Former are:

Image-Text Contrastive Learning (ITC), image-text contrastive learning

Image-grounded Text Generation (ITG), image-based text generation

Image-Text Matching (ITM), image text matching

These tasks are all obtained with Query features and text features as input, but there are different Mask combinations. The image transformer extracts a fixed number of output features from the image encoder, where the number of features is independent of the input image resolution. Meanwhile, the image transformer receives as input several query embeddings, which are trainable. These queries can also interact with text through the same self-attention layer.

Through the first stage of training, Query has condensed the essence of the picture, and what we need to do now is to turn Query into what LLM recognizes.

Why not let LLM recognize Query, but let Query become LLM? There are two reasons for this:

(1) The training cost of the LLM model is a bit high;

(2) From the perspective of Prompt Learning, the current amount of multi-modal data is not enough to ensure better LLM training, but may cause it to lose its generalization. If you can't fit the model to the task, then fit the task to the model.

BLIP-2 designs different tasks for two different types of LLMs:

(1) Decoder-type LLM (such as OPT): Query is used as input and text is used as target;

(2) Encoder-Decoder type LLM (such as FlanT5): Query and the first half of a sentence are used as input, and the second half is used as the target;

After the whole training process, the core relationship model of picture-text is finally formed, and "picture-text interaction" can be realized.

Essentially, BLIP-2 is a zero-shot visual language model that can be used for various image-to-text tasks with image and text cues. This is an effective and efficient method that can be applied to image understanding in a variety of scenarios, especially when training samples are scarce.

This model bridges the gap between vision and natural language modalities by adding a transformer between pretrained models. This new pre-training paradigm enables it to fully enjoy the dividends of the respective progress of the two modalities, and it is also a very good model algorithm innovation.

Finish

Today I briefly learned the latest open source VisualGLM-6b in China, and experienced a multi-modal large language model. Although many people cannot experience GPT-4, through the basic study of this article, it is useful for the text interaction and graphic interaction of general LLM. With the intuitive feeling, you can also carry out your own more in-depth study.

At present, some people have conducted in-depth learning explorations in the medical industry based on VisualGLM-6b, and have made a model open source project XrayGLM that can automatically identify X-ray films and perform diagnostic reports.

I also hope that this article can bring you into the multi-mode large language model, give you some help in technical learning and work application, or iterate your own Chinese multi-mode large model according to your own business scenarios.

##End##


If you want to pay attention to more technical information, you can follow the public account of "Heiye Passerby Technology", send "add group" in the background, and join the GPT and AI technology exchange group

Guess you like

Origin blog.csdn.net/heiyeshuwu/article/details/131032772