Using AudioGPT to input natural language, can ChatGPT sing?

6dfb7d847d91d5185515b4356c0b9d01.jpegXi Xiaoyao Technology said the original
author | IQ dropped all over the place 

With the powerful understanding and generation capabilities of ChatGPT, combined with the basic speech model, the integrated model AudioGPT was born!

Recently, two creations based on ChatGPT have sprung up like mushrooms. Last week we watched the Hackathon Excellent Work Award, and this week new ideas came out. An article that uses ChatGPT for speech understanding and generation tasks has aroused heated discussions recently.

The model combines some audio-based models to handle challenging audio tasks, and provides a modality conversion interface to implement spoken dialogue functions. It is good at understanding and generating speech, music, sounds, and speakers in multiple rounds of dialogue . close-up . Although this is an ensemble model , it also demonstrates the potential of AIGC tools in more domains.

After the conversion between speech and text, with the help of ChatGPT's powerful language understanding and generation capabilities, the model can use natural language to operate on speech, such as style transfer, speech recognition, speech enhancement, etc. It can even directly command AI in natural language, let it sing "Little Dimple" with emotion, and synthesize a close-up of the speaker. Maybe in the future, we will have the opportunity to use such plug-ins so that we are no longer limited to text-based conversations with ChatGPT, but can also easily create rich and diverse audio content.

论文题目
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Paper link
https://arxiv.org/abs/2304.12995

Code address
https://github.com/AIGC-Audio/AudioGPT

Huggingface demo address
https://huggingface.co/spaces/AIGC-Audio/AudioGPT


Tasks supported by AudioGPT

AudioGPT can use some basic models to understand and generate speech, music, sound and speaker close-up tasks, using ChatGPT to make the results of generation and understanding more natural , including:

audio to text

  • Audio-to-text conversion (Speech Recognition): convert human speech into text-basic model Whisper;

  • Audio translation (Speech Translation): translate human speech into another language - the basic model MultiDecoder;

  • Audio Caption: Convert audio description to text.

audio to audio

  • Audio Style Transfer (Style Transfer): Generate human speech with a corresponding style based on the reference style - the basic model GenerSpeech;

  • Audio Enhancement (Speech Enhancement): Improve the quality of speech through noise reduction and other methods - the basic model ConvTasNet;

  • Speech Separation: Separate different audio signals that mix multiple voices - the basic model TF-GridNet;

  • Mono-to-Binaural: Convert mono audio to stereo - base model NeuralWarp;

  • Fill in audio gaps (Audio Impainting): Repair the missing parts in the audio according to the Mask provided by the user - the basic model Make-An-Audio.

5bfb0b2226e747b9593b814ac534d668.png

audio to event

  • Audio event extraction (Sound Extraction): extract audio clips according to the description;

  • Sound Detection: Predicting the Timeline of Events in Audio - Base Model LASSNet.

audio to video

  • Talking Head Synthesis: Generates a talking human head video from input audio - base model GeneFace.

text to audio

  • Text-to-Speech: Generate human speech from text entered by the user - the base model FastSpeech 2.

image to audio

  • Image-to-Audio generation (Image-to-Audio): Generate the corresponding audio based on the image - the basic model Make-An-Audio.

Score to Audio

  • Singing Synthesis: Singing Synthesis is generated from input text, notes and rhythm - base models DiffSinger and VISinger.

31be78adb17426436418e2a157bd22fe.png f5e95ca8f95c3fbec38cf29ce5d0cfe3.png

Model Quick View

Training of LLMs supporting audio processing remains challenging for the following reasons:

  1. Limited data : Obtaining human-annotated speech data is an expensive and time-consuming task, and only a few resources that provide real-world spoken dialogues are available. In addition, compared with the huge web text data, the data volume is limited, while multilingual dialogue data is even scarcer.

  2. Waste of computing resources : Training a multimodal LLM from scratch requires a lot of computing resources and time. Given that audio-based models already exist that can understand and generate speech, music, voices, and close-ups of speakers, training from scratch would be a waste.

AudioGPT proposed in this paper is a multimodal artificial intelligence system. It is aimed at the above problems and supplements the current ChatGPT application. There are two specific points:

  1. Equipped with a base model : handles complex audio information, and considers ChatGPT as a general interface to solve a large number of understanding and generation tasks.

  2. Connection input/output interface (ASR, TTS) : Support spoken dialogue.

8f9fd328fcf842ad6e12c025ee27ed11.png
▲Figure 1 Overview of AudioGPT

As shown in Figure 1, the entire processing process of AudioGPT can be divided into four stages:

  • Modality Conversion : Modal conversion between speech and text using an input/output interface to bridge the gap between spoken LLM and ChatGPT.

  • Task Analysis : Utilizing a dialog engine and a hint manager to help ChatGPT understand user intent to process audio information.

  • Model assignment : Through structural parameters to control prosody, timbre, and language, ChatGPT assigns basic models for understanding and generating audio.

  • Generate Reply : After executing the audio base model, generate and return the final reply to the user.

experiment

In order to evaluate the ability of multimodal LLM in human intent understanding and cooperation with the underlying model, the authors experimented and evaluated AudioGPT from three aspects: consistency, ability and robustness .

consistent design

As shown in Figure 2, the authors show here how to evaluate AudioGPT's comprehension and problem-solving abilities without being provided with task-specific training examples. The evaluation process is divided into three steps: providing prompts, generating descriptions, and human evaluation .

c63e84a00a85fdda3ef14d1371960eee.png
▲Figure 2 Consistency overview

The details of the assessment are shown in Table 1:

c037c643b763d2e19b98a6d4907a3ba0.png
▲Table 1 Ratings for evaluating query-answer consistency

ability

As a task performer for processing complex audio information, the audio base model has an important impact on processing complex downstream tasks, and its evaluation metrics and downstream datasets for understanding and generating speech, music, voice, and speaker avatars are reported in Table 2 .

2ea13092eb2759f4481fb9c0fd728084.png
▲ Table 2 Audio basic model evaluation details in AudioGPT

robustness

The authors assess the robustness of multimodal LLMs by evaluating their ability to handle special cases , including long chain queries, unsupported tasks, error handling of multimodal models, and ability to transcend context.

To evaluate robustness, a three-step subjective user scoring process is employed.

  1. Human annotators provide hints according to the four categories mentioned above.

  2. Feed prompts into LLM to formulate a complete interactive session.

  3. Interactive sessions were scored by different groups of subjects from the multimodal LLM to verify their ability to handle specific situations.

summary

Taken together, although AudioGPT performs well in solving complex audio-related AI tasks, there are some limitations:

  1. Prompt project : It requires the construction of natural language instructions, which requires professional knowledge and a long time. If you are not familiar with the relevant fields, the effect of the instructions may be affected.

  2. Length limitations : Chatbots currently still need to take into account the limitation of maximum markup length, which may affect the continuity of the conversation and the instructions of the user.

  3. Capacity Limitation : The performance of AudioGPT is closely related to the accuracy and effectiveness of the audio underlying model.

These limitations remind us to be sane when looking at these new ChatGPT-based systems. At the same time, it also made us realize that prompt engineering is critical to building more efficient and reliable AI systems, making them more common and easy to use. We look forward to the emergence of more groundbreaking AI technologies in the future, using their powerful understanding and generation capabilities to enrich our lives and improve the efficiency of daily business processing. We will wait and see, and look forward to AIGC-related technologies becoming more mature and able to better serve human society~

2c81c1c10b8f1b444de78a44349a3750.png 5dd04ff6a774f0184c409bfaccdecace.png

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/130437285