Without multi-modal GPT4, HuggingFace+LangChain realizes "looking at pictures and talking"

The most popular "closed source" artificial intelligence at present is OpenAI, which can be said to be at its peak (far ahead? I hear this word too much recently, and it always feels like a satire, for the real king). However, many of its functions are not It is so easy to experience, such as multi-modality, which cannot be called through API for the time being.

So how to implement a simple "look at pictures and talk"? It can be divided into two steps:

  1. Through the open source model, let the open source model identify the content of the picture and generate a one-sentence text description;
  2. Let the large language model generate a short story based on the text description.

When it comes to open source models, Hugging Face cannot be ignored:https://huggingface.co/

Hugging Face is an artificial intelligence research organization focusing on natural language processing (NLP), a vibrant open source AI community. They are known for their open source library Transformers, which provides advanced NLP models and tools for various tasks such as text classification, translation, summarization, etc.

Let’s first go to Hugging Face to find an image-to-text model:
Insert image description here
This time we use the “Salesforce/blip-image-captioning-large” model to generate a text description based on the image. . However, this model is a bit large (1.8G). If it is just a simple test, you can also find a smaller model, otherwise it will take a long time to just download the model.

These models are collectively called pretrained models, which are pre-trained and no longer require training. They can be used directly after downloading.

Generate image description

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())  # read local .env file

from IPython.display import Image
from transformers import pipeline

pipe = pipeline("image-to-text",
                model="Salesforce/blip-image-captioning-large")

def

おすすめ

転載: blog.csdn.net/fireshort/article/details/134459321