Paper Reading_Audio Generation_AudioLM

Paper information

name_en: AudioLM: a Language Modeling Approach to Audio Generation
name_ch: AudioLM: A Language Modeling Method for Audio Generation
paper_addr: http://arxiv.org/abs/2209.03143
doi: https://doi.org/10.48550/arXiv .2209.03143
date_read: 2023-04-25
date_publish: 2022-09-07
tags: ['speech synthesis','deep learning']
author: Zalán Borsos
citation: 36
demo: https://google-research.github.io/ seanet/audiolm/examples

1 Feedback

It mainly solves two problems of generating speech: consistency and high quality .

2 Summary

This is a framework for generating high-quality audio using long-term consistency. It first converts the audio input into a series of discrete tokens, and then models the generated audio as a language representing the space . A hybrid word segmentation scheme is proposed to balance reconstruction quality and long-dependent structure.

Use the Mask method to capture long-distance relationships , and finally use discrete encoding to generate high-quality composite effects. It can generate natural coherent and continuous speech through short prompts. Trained on large amounts of unsupervised data, without any textual annotation or annotation , AudioLM generates syntactically and semantically plausible continuations of speech while maintaining speaker identity and unseen speaker prosody. In addition, piano music can also be generated.

3 Introduction

In the case where the data is unsupervised, it is based on the Transformer architecture. The specific techniques used include: adversarial neural audio compression, self-supervised representation learning, language modeling. Learning the interactions of different scales ensures phonetic consistency.

contribute

  • The AudioLM framework is proposed to combine semantic and acoustic markup in a hierarchical manner to achieve long-term consistency and high-quality audio generation.
  • Through the comparison with w2v-BERT and SoundStream, it is proved that the discriminability of the model and the complementarity of the advantages of reconstruction quality.
  • The model can generate speech, syntax and semantics independently of text annotations . Only need 3s speech as a prompt, it can generate speech that has not been seen during training, and maintain the speaker's voice, rhythm, recording conditions (reverberation, noise).
  • In addition to synthesizing human voices, musical sounds can also be synthesized whose melody, harmony, pitch, and rhythm are consistent with the cue.
  • To defend against the potential risks posed by generated speech, a classifier is also proposed to identify synthetic audio from real audio.

4 models

The acoustic token is processed by SoundStream, and the semantic token is generated by the middle layer of w2v-BERT.

4.1 Components

  • Map the input audio x to a discrete vocabulary y: y=end(x).
  • Use the Transformer model with only the decoder, operate y, and predict the word corresponding to t at time t-1 (autoregressive is used in the prediction stage).
  • Decode the model, and map the predicted y^ back to the audio format. x =dec(y )

4.2 Tradeoffs in Discrete Audio Representation

Use as little data as possible while ensuring the quality of the generated sound, which involves the lower limit of the bit rate and the sequence length . Semantic tokens and acoustic tokens are introduced here . As shown in Figure 1. Their generation is decoupled ; semantic tokens need timing dependencies, acoustic tokens need to guarantee high sound quality, and use semantics as a condition.

Use SoundStream to calculate the acoustic token, which uses RQV (residual vector quantization) technology to reduce and discretize the embedding, and map it to the code table.

Semantic labeling is computed using w2v-BERT . The model can learn audio representation autonomously, mapping the input audio waveform to a vector space rich in linguistic features. This is achieved by training the model using two self-supervised objectives: a masked language modeling (MLM) loss and a contrastive loss. Semantic tokens can be extracted by selecting an intermediate layer in the MLM module of the w2v-BERT model and computing the embedding of this layer . Cluster these markers and use cluster center indices as semantic markers .

Experiments show that decoupling the binomial works better.

4.3 Hierarchical Modeling of Semantic and Acoustic Markers

Using the model to generate semantics first, and then generating high-quality audio under semantic conditions, has two advantages:

  • Semantic results are independent of audio results .
  • The number of labeled sequences at each stage is reduced, and training and inference are more efficient.

The specific implementation is shown in Figure-2, including three scenarios:

  • Semantic Modeling for Long-Term Structural Consistency: Taking advantage of the above, an autoregressive approach is used to predict semantic z.
  • Coarse Acoustic Modeling Conditioned on Semantic Labels: Predicting Acoustic Labels y of Rough Voices Using Above Context and Semantics.
  • Fine Acoustic Modeling: Generate fine acoustic information with coarse acoustic markers y and above to generate high-quality markers.
    The SoundStream embedding has twice the sampling rate of the w2v-BERT embedding. In addition, the reason for splitting the two scenes is to limit the sequence length.

4.4 Forecast

After training, AudioLM can be used to generate audio, and the following three cases were tested:

4.4.1 Generate unconditionally

All semantic tokens ^z are sampled unconditionally, which are then used as conditions for acoustic modeling. This experiment demonstrates that the model can generate diverse, syntactically and semantically consistent linguistic content, and verifies the independence of semantics from acoustics.

4.4.2 Acoustic generation

Acoustic tokens are generated conditioned on ground truth semantic tokens z extracted from a test set x. The generated audio sequences differ in speaker identity, but the semantic content matches the true content of x. This suggests that semantic markup captures semantic content.

4.4.3 Generate Speech Continuation

Generate a continuation from short prompt x. The cues are first mapped to corresponding semantic labels z and coarse acoustic labels y. The first stage generates a continuation of semantic tags; in the second stage, the generated semantics are concatenated with the cue coarse acoustic tag y and fed to the coarse acoustic model as a condition; in the third stage, the coarse acoustic model is processed with the fine acoustic model Acoustic markers; finally, both the cue and the sampled acoustic markers are fed to the SoundStream decoder to reconstruct the waveform x^.

Guess you like

Origin blog.csdn.net/xieyan0811/article/details/130791292