HuggingGPT is coming, LLM+ expert model, moving towards a more general AI

Producer: Towhee Technical Team

Super combination: HuggingFace + ChatGPT = HuggingGPT is coming. Human beings seem to be one step closer to true AGI.

HuggingGPT is a joint research between Zhejiang University and Microsoft Asia Research Institute. After its release, it quickly attracted attention and has been open sourced.

Its use is very simple, such as given a complex AI task, such as "Please generate an image of a girl reading a book, her pose is the same as the boy in the image example.jpg. Then please describe the new image with your voice.". HuggingGPT can automatically analyze the required AI model for you, and directly call the corresponding model on HuggingFace to help you execute and complete tasks. Throughout the process, you only need to express your needs in natural language. It can help you automatically analyze which AI models are needed, and then directly call the corresponding model on HuggingFace to help you execute until completion.

The core concept of HuggingGPT is to use language as a common interface between LLMs and other artificial intelligence models. This innovative strategy enables LLMs to call external models to solve various complex artificial intelligence tasks. The design of HuggingGPT emphasizes the four stages of task planning, model selection, task execution, and response generation, so that the entire system can efficiently coordinate different models to solve multimodal information and complex digital intelligence tasks.

  • Task Planning: Use ChatGPT to analyze user requests to understand their intent and split them into potentially solvable tasks with hints.
  • Model selection: To solve the planned task, ChatGPT selects a model from expert models hosted on Hugging Face based on the model description.
  • Task Execution: Call and execute each selected model and return the result to ChatGPT.
  • Response Generation: Finally, use ChatGPT to integrate the predictions of all models and generate answers for users.
alt

In this example, for the input command, "Please generate an image of a girl reading a book in the same pose as the boy in image example.jpg. Then please describe the new image with your voice."

HuggingGPT designed 6 tasks in the first step, task planning, pose-control, pose-to-image, image-class, object-det, image-to-text, text-to-speech, and arranged them dependencies. In the second step, ChatGPT selects models from the candidate expert models on huggingface according to the model description, which may be online or downloaded. In the third step, the code actually executes the expert model on the corresponding huggingface. The fourth step is to integrate the predictions of all models and generate the final return for the user. It can be seen that it is indeed to find a posture-related model and generate an image of a little girl reading a book with the same posture, which is really amazing.

HuggingGPT has successfully integrated hundreds of models on Hugging Face, covering 24 tasks, such as text classification, object detection, semantic segmentation, image generation, question answering, text-to-speech and text-to-video. The experimental results demonstrate the powerful ability of HuggingGPT in dealing with multimodal information and complex artificial intelligence tasks, opening up a new path for the realization of advanced artificial intelligence.

alt

Let’s put some examples in the paper below. It can be seen that HuggingGPT handles the complex tasks of various modal combinations very well:

alt<Generate a video titled "Astronaut Walks in Space" with voiceover.

alt<Given a set of pictures A: /examples/a.jpg, B: /examples/b.jpg, C: /examples/c.jpg, how many zebras are there in these pictures?

At present, the gradio trial has been opened on the official website of huggingface: https://huggingface.co/spaces/microsoft/HuggingGPT, you can try it quickly.

Of course, HuggingGPT also has some shortcomings. For example, efficiency, the bottleneck of efficiency lies in the reasoning of large language models. For each round of user requests, HuggingGPT requires at least one interaction with a large language model in the stages of task planning, model selection, and response generation. These interactions greatly increase response latency, resulting in a degraded user experience. The second limitation is the maximum context length. Limited by the maximum number of tags that LLM can accept, HuggingGPT also faces the limitation of the maximum context length. It uses dialog windows and only tracks the dialog context during the mission planning phase to alleviate this limitation. The third is system stability, including two aspects. One is the phenomenon of rebellion that occurs during inference in large language models. Large language models occasionally fail to follow instructions during inference, and the output format may not be as expected, causing anomalies in the program workflow. The second is the uncontrollable state of the expert model of Hugging Face reasoning. The expert model on Hugging Face may be affected by network delay or service status, causing errors in the task execution phase.

Relevant information:

  • Project address: https://github.com/microsoft/JARVIS

  • Related papers:

    • https://arxiv.org/abs/2303.17580

This article is published by mdnice multi-platform

Supongo que te gusta

Origin blog.csdn.net/weixin_44839084/article/details/130137654
Recomendado
Clasificación