You can write a song in a few words, and generate audio from text in one move

Compared with the application of AI in generating images, videos, and texts, the development of AI-generated music is relatively backward. This is because generating high-quality audio requires different levels of modeling for different types of signals and modules. In addition, there are few open source codes in this field, which can be said to be the most challenging field in AI-generated content. Recently, Meta has open sourced a PyTorch library capable of generating various audios—— AudioCraft
5fc7f6fe60b00ee38f04729bff7287c0.jpeg Demo: https://huggingface.co/spaces/facebook/MusicGen

Like Llama, AudioCraft is also released in an open source form, the purpose is also to enable " Researchers and practitioners can use their own data sets to train their own models, reduce the bias caused by the limitations of training data, and promote the development of AI to generate music or audio.


Audiocraft is a PyTorch library for deep learning research on audio generation, including three components: AudioGen, EnCodec (improved version) and MusicGen. Among them, MusicGen uses copyrighted music data for training, and generates music based on user input text; AudioGen uses public sound effect data for training, and generates audio based on user input text; EnCodec  is used to compress audio and High-fidelity reconstruction of the original signal ensures that the generated music is of high quality.
738f159ac5183edb9d8d23d73acb245c.jpegAudioGen is trained based on 10 public sound effect data sets, including various sound effects such as dog barking, car honking, or footsteps on wooden floors.
Then there is the MusicGen model, which contains a total of three autoregressive Transformers with different parameters of 300M, 1.5B, and 3.3B.
90c46e96e9c430405da3a05d9c678900.jpegMusicGen uses 20,000 hours of music for training, including 10,000 high-quality audio tracks collected internally, as well as data from ShutterStock and Pond5 material libraries. The data volumes of the latter two are 25,000 and 365,000 respectively. It is re-sampled, and is equipped with basic information such as genre, BPM, and more complex text descriptions.
The last is the EnCodec neural audio codec (neural audio codec). The encoder can learn discrete audio tokens from the audio signal to be compressed; then, based on an autoregressive language model, the audio signal is compressed to the target size; finally, based on the decoder, the compressed signal can be reconstructed back to audio with high fidelity . Based on this compression effect, the audio can be compressed to 10 times smaller than the MP3 format.
e627c520db62f552628dfa15a1e30213.jpeg
Installation and use
requires Python 3.9, PyTorch 2.0.0, GPU with at least 16 GB of RAM


环境搭建
conda create -n musicgen python=3.9
pip install 'torch>=2.0'
# Then proceed to one of the following
pip install -U audiocraft # stable release
pip install -U git+https://[email protected]/facebookresearch/audiocraft#egg=audiocraft # bleeding edge
pip install -e . # or if you cloned the repo locally

使用api生成音乐

import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained('melody')
model.set_generation_params(duration=8) # generate 8 seconds.
wav = model.generate_unconditional(4) # generates 4 unconditional audio samples
descriptions = ['happy rock', 'energetic EDM', 'sad jazz']
wav = model.generate(descriptions) # generates 3 samples.

melody, sr = torchaudio.load('./assets/bach.mp3')
# generates using the melody from the given audio and the provided descriptions.
wav = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr)

for idx, one_wav in enumerate(wav):
# Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)

Guess you like

Origin blog.csdn.net/specssss/article/details/132169490