Over the past few years, we have seen huge advances in AI for image, video, and text generation. However, progress in the field of audio generation has lagged behind. This time MetaAI contributes another major product to open source: AudioCraft, an audio generation development framework that supports multiple audio generation models.
AudioCraft open source address
Open source address: https://github.com/facebookresearch/audiocraft
Note that the framework is open source, but the three models are open source and not commercially available~~
AudioGen model address:
https://www.datalearner.com/ai-models/pretrained-models/AudioGen
MusicGen model address:
https://www.datalearner.com/ai-models/pretrained-models/MusicGen
Introduction to AudioCraft
Producing High-Fidelity Audio Audio of any kind requires modeling complex signals and patterns at different scales. Music is perhaps the most challenging type of audio, as it consists of local and long-range patterns, ranging from a sequence of notes to global musical structures with multiple instruments. Generating coherent music with AI is often achieved by using a notational representation like MIDI or a piano roll. However, these methods cannot fully capture the performance nuances and stylistic elements in music.
To this end, MetaAI has open sourced AudioCraft, a framework that can be used to generate audio. It supports a range of models, produces high-quality audio with long-term consistency, and users can easily interact with it through a natural interface.
AudioCraft is for music and sound generation as well as compression, all on the same platform. Because it is easy to build and reuse, people looking to build better sound generators, compression algorithms, or music generators can do it all in the same code base and build on what others have already built.
Models Supported by AudioCraft
AudioCraft consists of three models: MusicGen, AudioGen and EnCodec. MusicGen was trained using Meta-owned and specially licensed music to generate music from text input, while AudioGen was trained using publicly available sound effects to generate audio from text input. In addition, there is an improved version of the EnCodec decoder, which can generate higher quality music with less artifacts.
Simply put, MusicGen is a model for text-generated music:
https://www.datalearner.com/ai-models/pretrained-models/MusicGen
AudioGen is a model for generating arbitrary audio from text:
https://www.datalearner.com/ai-models/pretrained-models/AudioGen
The other EnCodec refers to a real-time, high-fidelity audio codec that utilizes neural networks.
The picture below is the actual case of the official demo of AudioGen and MusicGen:
It can be seen that for the AudioGen model, it only needs to give a piece of text to generate music. The first example is to let the model generate a whistle with wind, and the result is very good.
Note that I can't actually test the pictures here, you can go to the official website to see the real effect.
The MusicGen model is a description that can generate music. Although I don't understand it, I think it sounds pretty good.
AudioCraft uses
AudioCraft relies on Python3.9 and PyTorch2.0, so you need to ensure that your system environment is satisfactory, you can install and upgrade through the following commands:
# Best to make sure you have torch installed first, in particular before installing xformers.
# Don't run this if you already have PyTorch installed.
pip install 'torch>=2.0'
# Then proceed to one of the following
pip install -U audiocraft # stable release
pip install -U git+https://[email protected]/facebookresearch/audiocraft#egg=audiocraft # bleeding edge
pip install -e .# or if you cloned the repo locally (mandatory if you want to train).
The official also recommends installing in the system ffmpeg
:
sudo apt-get install ffmpeg
If you have anaconda, you can also install it with the following command:
conda install 'ffmpeg<5'-c conda-forge
Once installed it is easy to use:
import torchaudio
from audiocraft.models importAudioGen
from audiocraft.data.audio import audio_write
model =AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)# generate 8 seconds.
wav = model.generate_unconditional(4)# generates 4 unconditional audio samples
descriptions =['dog barking','sirene of an emergency vehicule','footsteps in a corridor']
wav = model.generate(descriptions)# generates 3 samples.
for idx, one_wav in enumerate(wav):
# Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)