AudioLM: a Language Modeling Approach to Audio Generation

  • google
  • 2022-09

abstract

  • motivation: Generate high-quality audio with long-term correlation. speech2speech
  • Speech is quantized into discrete tokens, and then restored into audio.

intro

  • In the absence of conditions (linguistic features, MIDI seq), even if it is as strong as wavenet, it can only generate noise.
  • Previous methods: Semantic tokens are obtained using pre-trained models using self-supervised language model methods. These tokens capture local dependencies (phn, local melody in music), long-term features (language syntax information in speech, harmony and rhythm in piano music). - The reconstruction quality is not high.
  • contribution
    • SoundStream extracts acoustic features to ensure the quality of generation; w2v-BERT extracts semantic tokens to ensure long-term consistent modeling;
    • The existence of the prompt can be realized in many systems. The 3s unseen speaker prompt can realize the reproduction of the timbre, rhythm and recording environment; after giving the piano prompt, it can generate music with the same melody, rhythm and instrument timbre.
    • Security issue: In order to prevent generation abuse, train a very high-precision discriminator to determine whether the speech is generated by audioLM.

related work

  • neutral codec: AudioLM uses the tokens (downsampling) extracted by the SoundStream neutral codec as the target for sequence modeling, and the tokens can be reconstructed as speech.
    insert image description here

  • SoundStream: Multi-level residual quantizer (VQ quantization), CNN downsampling, input audio representation as { 1 , 2 , . . . , N } T a × Q \left \{ 1,2,...,N \ right \}^{T_a\times Q}{ 1,2,...,N}Ta× Q , whereT a T_aTaIs the length of the audio after downsampling, N=1024, Q quantizers, the pre-training Q=4 used in this article, based on 16k audio 320 times downsampling. The decoder of soundstream uses reconstruction loss + confrontation loss training.

  • w2v-BERT: Based on masked language modeling loss and contrastive learning loss training, use the middle layer of the pre-trained model MLM structure to calculate kmeans results, and take embedding on kmeanscentroid of the mapAs semantic tokens, the downsampling rate is 640 times. (Experiments have found that w2v-BERT is normed before kmeans clustering, which is more conducive to the representation of phoneme information). In fact, the essence of the features extracted by the hubert structure is the same.

Hierarchical modeling of semantic and acoustic tokens

Using the hierarchical structure to predict semantic tokens and aoucstic tokens, first predict the semantic tokens of the entire sequence, and then predict the acoustic tokens using the semantic tokens as conditions. The main reason is (1) p ( zt ∣ z < t , y < t ) ~ p ( zt ∣ z < t ) p(z_t|z<t,y<t) ~ p(z_t|z<t)p(ztz<t,y<t)p(ztz<t ) , given past semantic tokens, the current semantic tokens can be conditionally independent from the acoustic tokens. (2) The sequence length of each stage is shortened (because the multi-level prediction of acoustic tokens is spliced ​​horizontally, which will result in a very long calculation length N*Q), reducing the amount of calculation.

insert image description here

  • Phase 1: Predicting semantic tokens
  • The second stage: predicting coarse acoustic tokens (first two levels), autoregressive prediction
  • The third stage: conditioned on coarse acoustic tokens, predicting fine acoustic tokens.
  • The separation of the two and three stages can reduce the length of the sequence; in addition, the three stages are independent of the two stages, the three-stage sequence can be scaled independently of the audio length, and more RVQ prediction details can be used.

Guess you like

Origin blog.csdn.net/qq_40168949/article/details/130427104