MiniGPT-4, open source!

When GPT-4 was released last month, I wrote an article sharing several key information about GPT-4.

The sharing at that time mentioned an important feature of GPT-4, which is the multimodal capability .

For example, as demonstrated at the press conference, input a picture (what will happen if the glove falls?).

GPT-4 can understand and output: it will fall on the board, and the ball will be bounced.

Another example is to give GPT-4 a picture of a strange-looking charger and ask why this is ridiculous?

GPT-4 replied, VGA cable to charge iPhone.

Users can even directly draw a website sketch and take a photo and throw it to GPT-4, which can immediately help generate code.

But time has passed for so long, and the image recognition function of GPT-4 has not been opened for a long time.

While everyone was waiting for this feature to be opened, an open source project called MiniGPT-4 quietly did it.

https://github.com/Vision-CAIR/MiniGPT-4

That’s right, to enhance visual language understanding.

The team behind MiniGPT-4 is from KAUST (King Abdullah University of Science and Technology, Saudi Arabia), which was developed by several doctors.

In addition to being open source, the project also provides a web version of the demo, and users can directly experience it.

MiniGPT-4 is also trained based on some open source large models.

The team integrated the image encoder with the open source language model Vicuna (little alpaca), and frozen most of the parameters of the two, requiring only a small part of training.

Training is divided into two phases.

In the traditional pre-training stage, 5 million image-text pairs are used on 4 A100s, and it can be completed within 10 hours. At this time, the trained Vicuna can already understand images, but its generation ability is limited.

Then use some small high-quality datasets for training in the second tuning stage. At this time, the calculation efficiency is very high, and it only takes 7 minutes for a single card A100.

And the team is preparing a more lightweight version, which only needs 23GB of video memory for deployment, which means that local training may be possible in some consumer-grade graphics cards in the future.

Here are a few examples for you.

For example, throw in a photo of food to get a recipe.

Or give a photo of a product to help write a copy.

Of course, it is also possible to draw a web page and ask it to help generate code as demonstrated at the previous GPT-4 conference.

It can be said that MiniGPT-4 basically has the functions demonstrated at the GPT-4 conference.

This can be said to be very amazing!

Perhaps due to the large number of people currently using it, there will be a queue when trying it on the MiniGPT-4 web demo, and you need to wait in the queue.

However, users can also deploy services locally by themselves, and the process is not complicated.

The first is to download the project & prepare the environment:

git clone https://github.com/Vision-CAIR/MiniGPT-4.git
cd MiniGPT-4
conda env create -f environment.yml
conda activate minigpt4

Then download the pretrained model:

Finally start the demo locally:

python demo.py --cfg-path eval_configs/minigpt4_eval.yaml

Through this project, we have once again seen the feasibility of large models in the visual field. The future application prospects in image, audio, video, etc. should also be very good, and we can look forward to it.

Well, that's all for today's sharing, thank you for watching, see you next time.

Note: This article has been included in the GitHub open source warehouse "Road to Programming" https://github.com/rd2coding/Road2Coding , which contains the self-study routes of the 6 major programming directions (posts) + knowledge points sorting out and interview test points I compiled , my resume, a few hardcore pdf notes, and the life and perception of programmers, welcome to star.

Guess you like

Origin blog.csdn.net/wangshuaiwsws95/article/details/130377786
Recommended