Text can directly generate more than 20 kinds of background music. The free version of Stable Audio is here!

On September 14, the famous open source platform Stability AI released the audio generative AI product Stable Audio on its official website. (Free use address: https://www.stableaudio.com/generate)

Users can directly generate more than 20 types of background music such as rock, jazz, electronic, hip-hop, heavy metal, folk, pop, punk, and country through text prompts.

For example, enter keywords such as disco, drum machine, synthesizer, bass, piano, guitar, cheerful, 115 BPM, etc. to generate background music.

Currently, Stable Audio has two free and paid versions: the free version, which can generate 20 music pieces per month, with a maximum duration of 45 seconds, and cannot be used for commercial purposes; the paid version, which costs $11.99 per month (about 87 yuan), can generate 500 pieces of music. Music, maximum duration 90 seconds, can be used commercially.

If you don't want to pay, you can register a few more accounts, and you can splice the generated music through AU (an audio editor) or PR to achieve the same effect.
Insert image description here

A brief introduction to Stable Audio

In the past few years, diffusion models have achieved rapid development in image, video, audio and other fields, which can significantly improve training and inference efficiency. But there is a problem with diffusion models in the audio domain, which typically produce fixed-size content.

For example, an audio diffusion model might be trained on 30-second audio clips and only generate 30-second audio clips. In order to break this technical bottleneck, Stable Audio uses a more advanced model.
This is an audio latent diffusion model based on text metadata and audio file duration and start time adjustments, allowing control over the content and length of the generated audio. This additional time condition enables the user to generate audio of a specified length.
Insert image description here

Using a heavily downsampled latent representation of the audio can achieve faster inference efficiency compared to the original audio. With the latest stable audio model, Stable Audio can render 95 seconds of stereo audio using NVIDIA A100 GPU in less than one second, with a sampling rate of 44.1 kHz.

In terms of training data, Stable Audio uses a data set composed of more than 800,000 audio files, including music, sound effects, and various musical instruments.

The data set totals more than 19,500 hours of audio, and it also cooperates with the music service provider AudioSparx, so the generated music can be used for commercialization.

latent diffusion model

The Latent Diffusion Models used by Stable Audio are a diffusion-based generative model mainly used in the latent encoding space of pre-trained autoencoders. This is an approach that combines autoencoders and diffusion models.

Autoencoders are first used to learn low-dimensional latent representations of input data (such as images or audio). This latent representation captures important features of the input data and can be used to reconstruct the original data.
Diffusion models are then trained in this latent space, gradually changing the latent variables to generate new data.
Insert image description here

The main advantage of this approach is that it can significantly improve the training and inference speed of diffusion models. Because the diffusion process occurs in a relatively small latent space rather than in the original data space, new data can be generated more efficiently.

In addition, by operating in the latent space, such models can also provide better control over the generated data. For example, latent variables can be manipulated to change certain characteristics of the generated data, or the data generation process can be guided by imposing constraints on latent variables.

Stable Audio usage and case display

"AIGC Open Community" tried the free version of Stable Audio. The usage method is similar to that of ChatGPT. Just enter the text prompt. The prompt content includes four categories: details, mentality, instruments and beats.
It should be noted that if you want the generated music to be more delicate, rhythmic and rhythmic, the input text also needs to be more detailed. In other words, the more text prompts you enter, the better the generated effect will be.
Insert image description here

Stable Audio user interface

The following is a case demonstration of generating audio.

Trance, island, beach, sun, 4am, progressive, synth, 909, dramatic chords, chorus, upbeat, nostalgic, dynamic.

Soft hug, comfort, low synth, shimmer, wind and leaves, ambient, peaceful, relaxing, water.

Pop electronic, big reverb synth, drum machine, atmospheric, moody, nostalgic, cool, pop instrumental, 100 BPM.
3/4, 3 beats, guitar, drums, bright, happy, clapping

The material of this article comes from the official website of Stability AI. If there is any infringement, please contact us to delete it.

Guess you like

Origin blog.csdn.net/weixin_57291105/article/details/132872953