Introduction to the use of pyWORLD and WORLD of speech synthesis system

In this article, I will try to introduce a tool called WORLD, which is usually used for speech synthesis and speech conversion.

Speech synthesis and speech conversion

At present, with the use of intelligent voice systems such as Google Home and Amazon Echo, there is no doubt that the form of voice expression will become more and more important in the future.

One of the key technologies of speech expression is speech synthesis (text to speech). If it is a Mac, you can try:

$ say '早上好'

She would say "good morning". However, in the say command, it becomes a machine-specific unconventional way, as can be understood when typing a slightly longer sentence. We have been studying how to make human language for a long time, but with the development of deep learning, it has recently been possible to produce speech that is indistinguishable from humans.

Similar to speech synthesis, there is a technology called speech conversion. Here, the input is a voice, not a text, it is a technique such as "replace Hillary's voice with Trump's voice". The use in the entertainment field will immediately come to mind, but there is also a usage, such as "providing the original voice to the person who has lost the voice (dumb)" and so on, so this is a research of social significance.

Difficulties in speech synthesis and speech conversion

In fact, speech synthesis and speech conversion are very difficult.
As a simple example, let us consider the process of "raising the sound by one octave (treble)". Sound is just sound, so in order to get a high pitch, you should increase the frequency. The easiest way is to "play at 2x speed", but when you try to get a high pitch, you try to speed up.

So how to make it higher without changing the beat speed? The first thing you can think of is to cut off the 2x fast forward sound in a short time and repeat it twice.
This is the image of "Aueueo" as "Aue Ueo".
However, simply applying this method will produce an indescribable sound. This is because the wave connection is not good.

Like this, you need to think about many things by changing the height of the sound, and linear operations are not easy.

Extract features from speech

As mentioned above, since it is difficult to directly try to modify the voice waveform, it has been a long time to convert the voice into a feature and then process it back to a waveform. The following are some of the acoustic characteristics used in speech synthesis and speech conversion:

Fundamental frequency: indicates the height of the voice foundation
Spectrum envelope: it is the so-called spectrum smoother, which represents the pitch
Aperiodic Index: Shows the effect of vocal cord vibration or noise mixing

Each feature can be generated independently. If you want to change the pitch of the sound, you can use F0 to generate it. If you want to change the atmosphere and tone of the sound, you can use the spectrum envelope.

Speech synthesis and speech conversion tool WORLD

The above-mentioned features are actually not easy to extract, but the amount estimated from the original waveform using a specific model. Therefore, in order to implement feature extraction from scratch, you need to have a deep understanding of the field.

WORLD is a useful tool for extracting such features from features and synthesizing waveforms. Unlike the original software STRAIGHT, which is widely used in this field, it can be used with BSD licenses.

WORLD

WORLD is officially launched in "C++" and Matlab, but there is also a Python wrapper called PyWorldVocoder. PyWorldVocoder is very easy to install with pip, so use it this time.

$ pip install scipy   ＃如果你没先装scipy，那么pyworld安装有问题
$ pip install pyworld

Try to use WORLD

Let's actually use WORLD. Here, we extract features from speech and try to synthesize speech from the extracted features.

The first step is to read the wav file.

from scipy.io import wavfile
import pyworld as pw
 
WAV_FILE = 'path_to_the_wav_file'

fs, data = wavfile.read(WAV_FILE)
data = data.astype(np.float)  #WORLD使用float类型

Then use pyworld to extract features.

_f0, t = pw.dio(data, fs)  #  提取基频
f0 = pw.stonemask(data, _f0, t, fs)  # 修改基频
sp = pw.cheaptrick(data, f0, t, fs)  # 提取频谱包络
ap = pw.d4c(data, f0, t, fs)  # 提取非周期性指数

synthesize uses this method to synthesize speech from the obtained features.

synthesized = pw.synthesize(f0, sp, ap, fs)

Play WAV files

scipy.io.wavfile.write You can export files and use them to play, but if you are using Jupyter, it is easier to use IPython.display modules.

import IPython.display as display

print('original')
display.display(
    display.Audio(y, rate=fs)
)

print('synthesized')
display.display(
    display.Audio(synthesized, rate=fs)
)

original.wav

synthesized.wav

Try to convert the sound

Let's actually convert the sound. First, as in the first example, convert it to a sound one octave higher. You can change the height of the sound by using F0.

high_freq = pw.synthesize(f0*2.0, sp, ap, fs)  # 以双倍频率，提高一个八度

print('high_freq')
display.display(
    display.Audio(high_freq, rate=fs)
)

high-freq.wav

A man’s voice is like a woman’s, but the sound it makes is uncomfortable. Next, let's keep F0 unchanged. As you can see, it is a mechanical and emotional sound similar to a robot.

robot_like_f0 = np.ones_like（f0）* 100  ＃100比较合适
robot_like = pw.synthesize（robot_like_f0，sp，ap，fs）

print（'robot_like'）
display.display（
    display.Audio（robot_like，rate = fs）
）

robot-like.wav

Finally, let's change the timbre by modifying the spectrum envelope.

In addition to doubling F0, the spectrum envelope is 1.2 times that of women.

female_like_sp = np.zeros_like(sp)
for f in range(female_like_sp.shape[1]):
    female_like_sp[:, f] = sp[:, int(f/1.2)]
female_like = pw.synthesize(f0*2, female_like_sp, ap, fs)

print('female_like')
display.display(
    display.Audio(female_like, rate=fs)
)

female-like.wav

In this way, by expressing the voice not as the original waveform data but as a set of feature amounts, the voice can be easily converted and played.

Speech synthesis, speech conversion and machine learning

However, since there are countless conversion methods for each acoustic characteristic, manual adjustment is a very difficult technique. This is where machine learning comes into play. In machine learning, if there are enough raw data and transformed data samples, the parameters will be adjusted automatically. In the past, methods using HMM and Gaussian mixture models and NMF seemed to be the mainstream, but as mentioned at the beginning, methods using deep learning have been actively studied recently.

Finally, introduce some recent technologies and summarize this article.

Recent method

Use a variational autoencoder to convert speech from a non-parallel corpus: This is an example of applying VAE (a deep learning technique) to speech quality conversion. Enter the spectrum and speaker information. The interesting thing is that it uses a "non-parallel corpus". The traditional method uses the data for learning the same words as the conversion source and the conversion, but this method does not require this correspondence.
WaveNet: The original audio generation model: The model announced by Google in 2016 uses raw waveform data as input data instead of feature data, and uses an autoregressive model to achieve extremely smooth speech synthesis. Although there is a problem with the amount of calculation, it is definitely a breakthrough.
Neural discretization means learning: Discretization of latent space by a method called VQ-VAE is unique, and VQ-VAE is a kind of VAE. It can also be used for voice quality conversion by using the same network as the WaveNet structure in the decoder part. You can listen to the demo on the author's page, and its conversion effect is still very good.
Combined with the statistical parameter speech synthesis of the generative confrontation network: Recently, the GAN model is popularly used for speech conversion and speech synthesis.

to sum up

The above summarizes the basic knowledge of speech quality conversion and speech synthesis that have been concerned for the past two years. Since pyworld is easy to install, easy to use and fun, please try various ways