SLT2021: LEARN2SING: TARGET SPEAKER SINGING VOICE SYNTHESIS BY LEARNING FROM A SINGING TEACHER

0. Title

LEARN2SING: TARGET SPEAKER SINGING VOICE SYNTHESIS BY LEARNING FROM A SINGING TEACHER

Learn to sing: The target speaker learns to sing from a singing teacher (singing voice synthesis)

1. Summary

Singing voice synthesis has received more and more attention. The field of speech synthesis is developing rapidly. Generally, in order to generate natural singing voices from lyrics, music-related recordings, sheet music, etc., a studio-level singing corpus is usually required. However, such a corpus is difficult to collect, because many of us have difficulty singing like professional singers. In this article, we propose a method-Learn2Sing, which only requires the singing teacher to generate the singing voice of the target speaker without their (target, student) singing data. In our method, using an autoregressive synthesis framework, the teacher’s singing corpus and multi-speaker corpus are jointly trained, sharing the speaker embedding structure and space, and sharing the prosodic label embedding vector. At the same time, since the target speaker has no music-related transcription, we use the logarithmic scale fundamental frequency (LF0) as an auxiliary function as the input of the acoustic model to establish a unified input representation. In order to enable the target speaker to synthesize the singing voice without referring to the singing voice in the inference stage, the duration model and the LF0 prediction model are also trained. In particular, we use Domain Adversarial Training (DAT) in the acoustic model, the purpose of which is to improve the singing performance of the target speaker by distinguishing the style from the acoustic characteristics of the singing and speaking data. Our experiments show that this method can synthesize singing voices for the target speaker only given samples of normal speech

关键词: text-to-singing, singing voice synthesis, auto-regressive model

Text to singing voice, singing voice synthesis, autoregressive model

2. Introduction

Singing Voice Synthesis (SVS) is mainly dedicated to generating singing voices from lyrics and music scores, where text lyrics provide language information, and music scores convey pitch and rhythm information.

In order to obtain a satisfactory singing voice for the target speaker, a large number of singing recordings with relevant lyrics and music scores are needed [9]. However, it is more difficult and expensive to collect and label this kind of singing corpus compared with speech corpus. A straightforward solution is to use a small amount of singing data of the target speaker [10] to adapt to the multi-speaker singing model. However, many speakers are not good at singing, which brings further difficulties to the establishment of a singing voice synthesis system for arbitrary target speakers.

 

Singing Voice Conversion (SVC) can also get the speaker's singing voice. This function is designed to convert the source singing voice into the target speaker's timbre while keeping the language content unchanged:

  • As the latest SVC method to achieve this goal, DurIAN-SC [11] uses a unified speech and singing synthesis framework based on the Durian TTS architecture proposed earlier [12]. She can generate a high-quality target speaker's singing voice using only his/her normal voice data. However, DurIAN-SC relies on the reference singing audio of the source speaker to generate the singing voice of the target speaker
  • Mellotron [13] is a multi-speaker speech synthesis model based on Tacotron2-GST [14]. By explicitly adjusting the rhythm and continuous pitch profile, the target speaker can sing without its training data. Similar to DurIAN-SC, the pitch contour is extracted from the reference audio. This means that Mellotron cannot obtain the singing voice of the target speaker without referring to the singing audio.

 

In this article, we propose Learn2Sing, a complete integrated system for singing target speakers, which can be constructed without his/her singing data. The proposed model can generate decent and nice singing of the target speaker from the lyrics and notes without referring to the singing audio. Inspired by the above methods (especially DurIAN-SC), we learned the SVS system for the target speaker with the help of the singing teacher. Here, the singing teacher is the singing corpus of professional singers, and the target speaker only has a small set of voice and related text records. In order for the target speaker to learn the singing mode from the singing teacher, we process the speech and singing data as a whole to train a unified frame-level autoregressive acoustic model. We also specially designed the encoder and decoder in the model. And other structures

 

However, the target speaker does not have such musical symbols, such as music scores, such as the description of the fundamental frequency. The text representation of voice and singing data is also very different, so it is difficult to establish the above unified model. In order to obtain a unified input representation of spoken and singing data, we only retain the common aspects of the two types of data. However, the pitch in the music score is very important for singing. Therefore, we further use the logarithmic scale F0 (LF0) as an auxiliary function to unify the input representation of the target speaker and the singing teacher

 

In addition, there is also a big difference in pronunciation between speaking and singing, which obviously affects the singing performance of the target speaker. In order to eliminate the difference between the two domains, we perform domain adversarial training (DAT) in the autoregressive decoder [15, 16] to obtain latent features that are not related to style style, where gradient inversion layer (GRL) is followed by style Style classifier network. In order to control the style domain of the generated audio (speaking or singing), we use a binary style tag in the encoder to provide style information during training and inference. DAT will also be used to decouple multi-speaker information, and it will also be used in singing voice synthesis[17]

 

The proposed Learn2Sing model requires F0 and duration. Therefore, we established a duration model and an LF0 prediction model. Finally, use Multi-band WaveRNN vocoder

 

Contribution points:

  • We propose a singing voice synthesis system-Learn2Sing, which can generate the singing voice of the target speaker without his/her singing data for system training
  • Different from the previous method, it can generate the target speaker's singing voice without any reference audio
  • The proposed domain adversarial training strategy can improve the naturalness and expressiveness of singing by learning latent features that are not related to style. These latent features narrow the difference between speaking and singing, and use style tags to control the target speaker Vocal synthesis

3. Others-easy to understand

PhonemeID and FramePos attributes can be easily obtained from the speech corpus and singing corpus. FramePos is the value in [0,1], which represents the relative position of the current frame in the phoneme. Pitch is a special attribute in the music score, which is used to indicate the pitch of the singing voice, and Slur represents the concept of connection in the music score. Same as the typical SVS system [9, 7], our duration model (DM) takes [PhonemeID, Slur, NoteDur] as input to predict the phoneme duration (PhonemeDur), because the duration of each phoneme depends only on the music score , Where NoteDur represents the duration of the note calculated according to BPM. Beats per minute (BPM) refers to the duration information of each note in the music score. But for the acoustic model (AM), unlike the typical system [9, 7] that takes [PhonemeID, Pitch, FramePos] as input, we replace the Pitch attribute with the continuous LF0 predicted from the LF0 model because we have established a unified AM For speaking and singing, unified input is also required. In addition, we include two additional tags in AM-SpeakerID and StyleTag, so that we can clearly distinguish between singing and speaking during training. But inferred, we just set the StyleTag to sing for the target speaker so that the singing voice can be generated for the target speaker. The LF0 prediction model uses [PhonemeID, Pitch, Slur, FramePos] as input to estimate the frame-level LF0 value, where these features come from the score and interval

4. Others-not easy to understand

I didn't read the specific text and loss formula~

 

 

Guess you like

Origin blog.csdn.net/u013625492/article/details/112985016