Speech | Detailed explanation and code of speech quality assessment method in artificial intelligence

This article mainly explains some quality assessment methods for speech generated in artificial intelligence such as speech synthesis, speech conversion, and speech cloning~

Table of contents

1. Voice quality evaluation method

subjective evaluation method

1.1.MOS

1.2.CMOS 

1.3.ABX Test

1.4.MUSHRA(MUltiple Stimuli with Hidden Reference and Anchor)

objective evaluation method

1.5.MCD

1.6.PESQ(Perceptual Evaluation of Speech Quality)

1.7.STOI(Short-Time Objective Intelligibility)

1.8.LLR(Log Likelihood Ratio)

2. Use in voice tasks [detailed code]

2.1.MOS calculation

2.2. Calculation using MCD

2.3.STANDS

3. Test summary

3.1. Summary in MCD test

3.2. Summary in STIO test

【Extended】

Use the MCD value to find the mean and variance and draw a histogram.


1. Voice quality evaluation method

  • Subjective methods: MOS , CMOS, ABX Test, MUSHRA, PESQ
  • Objective methods: MCD, STOI, F0 RMSE, F0 MSE, E MSE, Dur MSE, mel loss,

subjective evaluation method

1.1.MOS

MOS is a subjective evaluation method that evaluates the quality of speech synthesis through subjective ratings of synthesized speech by test listeners. 

 Official website: P.800.1 : Mean opinion score (MOS) terminology (itu.int)

If the average subjective evaluation value MOS is 4 or higher, it is considered to be a relatively good voice quality, and if the average MOS is lower than 3.6, it means that most listeners are not satisfied with the voice quality.

audio level MOS value evaluation standard
excellent 4.0~5.0 Very good, can hear clearly; small delay, smooth communication
good 3.5~4.0 Slightly worse, you can hear clearly; the delay is small, the communication is not smooth, and there is a bit of noise.
middle 3.0~3.5 It's okay, I can't hear clearly; there is a certain delay, but I can communicate
Difference 1.5~3.0 Reluctantly, can’t hear clearly; the delay is large, communication needs to be repeated many times
inferior 0~1.5 Extremely poor, unable to understand; large delay, poor communication

Generally, the MOS should be 4 or higher, which can be considered to be a relatively good voice quality. If the MOS is lower than 3.6, it means that most of the people tested are not satisfied with the voice quality. 

General requirements for MOS testing:

  • A sufficiently diverse sample (i.e., number of listeners and sentences) to ensure statistically significant results;
  • Control the experimental environment and equipment for each listener to be consistent;
  • Each listener follows the same evaluation criteria.
     

1.2.CMOS 

The abbreviation of comparative mean opinion score, a related concept proposed in the naturalspeech paper, measures TTS quality by using the "Mean Opinion Score, MOS", because MOS is not very sensitive to distinguishing differences in sound quality, but is only sensitive to two Each sentence in the system is scored individually, and no two sentences are compared with each other. During the evaluation process, CMOS (Comparative MOS) can compare and score sentences from the two systems side by side, and uses a seven-point scale to measure the difference, so it is more sensitive to quality differences.

1.3.ABX Test

The ABX test is a commonly used subjective assessment method for comparing which of two sound samples is closer to a third reference sample. Participants chose A or B to match X in three auditory contrasts. This test is commonly used to evaluate the performance of audio codecs, speech synthesis systems, etc.

1.4.MUSHRA(MUltiple Stimuli with Hidden Reference and Anchor)

MUSHRA is a subjective evaluation method that compares multiple audio samples (being evaluated) to a hidden reference audio sample. Evaluators need to score the reference audio and each sample to determine which sample is closest to the reference audio.

objective evaluation method

1.5.MCD

Thesis title: Mel-cepstral distance measure for objective speech quality assessment

论文地址:Mel-cepstral distance measure for objective speech quality assessment | IEEE Conference Publication | IEEE Xplore

Github:MattShannon/mcd: Mel cepstral distortion (MCD) computations in python. (github.com)

Mel Cepstra Distortion (MCD) is a measure of how different two sequences are. Mel Cepstra is an idea used to evaluate the quality of parametric speech synthesis systems, including statistical parametric speech synthesis systems. The smaller the MCD between the synthesized mel cepstral sequence and the natural mel cepstral sequence , the closer the synthesized speech is to reproducing natural speech. It is by no means a perfect metric for assessing the quality of synthesized speech, but is often a useful metric in combination with other metrics.

The calculation method of MCD is as follows:

  • Extract MFCCs : First, extract MFCCs from the synthesized speech and target speech. This involves converting the speech signal into a spectral representation, then applying a Mel filter bank and using cepstrum analysis to obtain the MFCC coefficients.

  • Calculate the distance : Next, the distance is calculated by comparing the MFCC coefficients between the synthesized speech and the target speech. Methods such as Euclidean distance or Dynamic Time Warping (DTW) are usually used to measure the similarity or difference between two speech signals.

  • Averaging : average the distances of all frames (or time periods) to get the MCD score of the entire speech signal. The lower the MCD score, the smaller the difference between the synthesized speech and the target speech, and the higher the quality.

MCD is a commonly used indicator to measure the quality of speech synthesis, but it is only a distance measure between Mel cepstrum coefficients and cannot fully represent the quality of speech synthesis. It should be noted when using MCD that it is an objective evaluation index and needs to be combined with other indicators and subjective evaluation to comprehensively evaluate the performance of the speech synthesis system.

 But research has found that its correlation with people's subjective perception of sound quality is not strong enough. In almost all papers I've seen, this approach is not used ·

In the calculation process of MCD (Mel Cepstral Distortion), the three modes (plain, dtw, dtw_sl) represent different calculation methods, which are mainly reflected in the method of calculating Mel Cepstral distance:

Plain :

  • The MCD calculation in this mode is based on the direct Euclidean distance of Mel cepstrum coefficients. It is the simplest and most straightforward way to calculate without additional transformations or corrections.

DTW (Dynamic Time Warping) :

  • Dynamic time warping is a method by comparing the similarity of two sequences. In MCD, DTW is used to align the two sequences to minimize the distance between them. It allows a certain degree of flexibility in the alignment of sequences on the timeline, and can handle situations where sequences are slightly misaligned in time.

DTW_SL (DTW with Straight-line Constraint, DTW with straight-line constraints) :

  • DTW in this mode adds straight line constraints when aligning. This means that the path during the alignment process is as straight as possible without affecting the overall similarity, thereby reducing possible unnecessary bends and misalignments.

1.6.PESQ(Perceptual Evaluation of Speech Quality)

PESQ is an objective assessment method used to measure speech quality. It calculates the difference between original speech and processed (compressed, encoded, etc.) speech to provide a score for speech quality. This metric is often used to measure the performance of speech codecs or communication systems.

1.7.STOI(Short-Time Objective Intelligibility)

STOI is an objective evaluation method used to measure speech clarity and intelligibility, and is particularly suitable for measuring the intelligibility and recognition rate of speech synthesis.

STOI (Short-Time Objective Intelligibility) is an objective evaluation index used to measure the quality of speech signals. It is designed to measure the correlation between clarity and intelligibility and is a quality assessment method for speech signals.

STOI mainly evaluates the intelligibility of speech signals by comparing the spectral correlation between original speech and distorted/noisy speech. Its core idea is that when the human ear perceives speech, the brain will sensitively process spectral correlations. Therefore, STOI utilizes the correlation between spectra to estimate the clarity and intelligibility of the speech signal.

The general steps of this method are as follows:

  1. Short Time Fourier Transform (STFT) : The speech signal is divided into short time segments and STFT is performed to convert the signal into spectral form.
  2. Spectral correlation calculation : Correlation calculation between the spectrum of original speech and distorted/noisy speech. It is usually measured by calculating the similarity between spectral frames.
  3. Correlation average : Calculate the correlation of all spectrum frames and obtain the average value as the STOI score of the entire signal.

The result of STOI is between 0 and 1. The closer the value is to 1, the higher the intelligibility of the speech signal, and the closer the value is to 0, the lower the intelligibility.

This evaluation method provides a quantitative assessment of the sound quality, clarity and intelligibility of speech signals. It is usually used in the field of speech signal processing, especially in applications such as speech enhancement, noise reduction, encoding and decoding, and speech synthesis. Can help evaluate the effectiveness of algorithms.

1.8.LLR(Log Likelihood Ratio)

Used to evaluate whether the speech generated by the model belongs to a given speech distribution.

2. Use in voice tasks [detailed code]

  • speech synthesis

  • Voice conversion

  • Voice cloning

MOS and CMOS are mainly used in speech synthesis, but because of the greater subjectivity, the difference may be larger ~

2.1.MOS calculation


import math
import numpy as np
import pandas as pd
from scipy.linalg import solve
from scipy.stats import t


def calc_mos(data_path: str):
    '''
    计算MOS,数据格式:MxN,M个句子,N个试听人,data_path为MOS得分文件,内容都是数字,为每个试听的得分
    :param data_path:
    :return:
    '''
    data = pd.read_csv(data_path)
    mu = np.mean(data.values)
    var_uw = (data.std(axis=1) ** 2).mean()
    var_su = (data.std(axis=0) ** 2).mean()
    mos_data = np.asarray([x for x in data.values.flatten() if not math.isnan(x)])
    var_swu = mos_data.std() ** 2

    x = np.asarray([[0, 1, 1], [1, 0, 1], [1, 1, 1]])
    y = np.asarray([var_uw, var_su, var_swu])
    [var_s, var_w, var_u] = solve(x, y)
    M = min(data.count(axis=0))
    N = min(data.count(axis=1))
    var_mu = var_s / M + var_w / N + var_u / (M * N)
    df = min(M, N) - 1  # 可以不减1
    t_interval = t.ppf(0.975, df, loc=0, scale=1)  # t分布的97.5%置信区间临界值
    interval = t_interval * np.sqrt(var_mu)
    print('{} 的MOS95%的置信区间为:{} +—{} '.format(data_path, round(float(mu), 3), round(interval, 3)))


if __name__ == '__main__':
    data_path = ''
    calc_mos(data_path)

2.2. Calculation using MCD

Single voice comparison

from pymcd.mcd import Calculate_MCD
 
# instance of MCD class
# three different modes "plain", "dtw" and "dtw_sl" for the above three MCD metrics
mcd_toolbox = Calculate_MCD(MCD_mode="plain")

 
# two inputs w.r.t. reference (ground-truth) and synthesized speeches, respectively
# 同样的元语音和生成语音对比
mcd_value = mcd_toolbox.calculate_mcd("1.wav", "gen_1.wav")
print(mcd_value)

batch

from pymcd.mcd import Calculate_MCD
import os
import numpy as np


def batch_calculate_mcd(original_folder, generated_folder):
    mcd_toolbox = Calculate_MCD(MCD_mode="dtw")
    mcd_values = []

    # 获取文件夹中的文件列表,并按照文件名排序
    original_files = sorted(os.listdir(original_folder))
    generated_files = sorted(os.listdir(generated_folder))

    # 逐对比较语音文件
    for orig_file, gen_file in zip(original_files, generated_files):
        orig_path = os.path.join(original_folder, orig_file)
        gen_path = os.path.join(generated_folder, gen_file)

        # 进行MCD值的计算
        mcd_value = mcd_toolbox.calculate_mcd(orig_path, gen_path)
        print(f"MCD value for {orig_file} and {gen_file}: {mcd_value}")
        mcd_values.append(mcd_value)

    # 计算均值和方差
    mean_mcd = np.mean(mcd_values)
    variance_mcd = np.var(mcd_values)

    print(f"Mean MCD value: {mean_mcd}")
    print(f"Variance of MCD values: {variance_mcd}")

original_folder_path = './original_data'
generated_folder_path = './gen_data'


batch_calculate_mcd(original_folder_path, generated_folder_path)

 

2.3.STANDS

Single voice comparison

# pip install scipy numpy

import numpy as np
from scipy.io import wavfile
from scipy.signal import stft

def stoi(x, y, fs):
    win_len = int(fs * 0.025)  # 窗长为25ms
    hop_len = int(fs * 0.010) # 窗移为10ms

    _, _, Pxo = stft(x, fs=fs, nperseg=win_len, noverlap=hop_len)
    _, _, Pyo = stft(y, fs=fs, nperseg=win_len, noverlap=hop_len)

    # 计算时间频率上的STOI
    stoi_values = []
    for i in range(Pxo.shape[1]):
        Pxo_i = np.abs(Pxo[:, i])
        Pyo_i = np.abs(Pyo[:, i])

        
        Rxy = np.sum(Pxo_i * Pyo_i) / np.sqrt(np.sum(Pxo_i ** 2) * np.sum(Pyo_i ** 2))
        stoi_values.append(Rxy)

    return np.mean(stoi_values)

# 读取原始语音和生成语音
rate_orig, orig_audio = wavfile.read('original_data/1.wav')
rate_gen, gen_audio = wavfile.read('gen_data/gen_1.wav')


if rate_orig != rate_gen:
    print("If the sampling rate of the original audio and the generated audio are different, please adjust the sampling rate of the generated audio to the sampling rate of the original audio.")


# 计算STOI值
stoi_value = stoi(orig_audio, gen_audio, rate_orig)
print("stoi value:", stoi_value)

 Batch comparison

import os
import numpy as np
from scipy.io import wavfile
from scipy.signal import stft

def stoi(x, y, fs):
    win_len = int(fs * 0.025)  # 窗长为25ms
    hop_len = int(fs * 0.010)  # 窗移为10ms

    _, _, Pxo = stft(x, fs=fs, nperseg=win_len, noverlap=hop_len)
    _, _, Pyo = stft(y, fs=fs, nperseg=win_len, noverlap=hop_len)

    stoi_values = []
    for i in range(Pxo.shape[1]):
        Pxo_i = np.abs(Pxo[:, i])
        Pyo_i = np.abs(Pyo[:, i])

        # 计算频谱之间的相关性
        Rxy = np.sum(Pxo_i * Pyo_i) / np.sqrt(np.sum(Pxo_i ** 2) * np.sum(Pyo_i ** 2))
        stoi_values.append(Rxy)

    return np.mean(stoi_values)

def calculate_stoi_for_files(original_folder, generated_folder):
    original_files = os.listdir(original_folder)
    generated_files = os.listdir(generated_folder)

    for orig_file, gen_file in zip(original_files, generated_files):
        orig_path = os.path.join(original_folder, orig_file)
        gen_path = os.path.join(generated_folder, gen_file)

        rate_orig, orig_audio = wavfile.read(orig_path)
        rate_gen, gen_audio = wavfile.read(gen_path)

        # 调整采样率...
        # 如果需要的话,进行采样率调整...

        # 计算STOI值
        stoi_value = stoi(orig_audio, gen_audio, rate_orig)
        print(f"STOI值 - {orig_file} vs {gen_file}: {stoi_value}")

# 原始语音和生成语音文件夹路径
original_folder_path = 'path_to_original_audio_folder'
generated_folder_path = 'path_to_generated_audio_folder'

# 计算STOI值
calculate_stoi_for_files(original_folder_path, generated_folder_path)

3. Test summary

3.1. Summary in MCD test

The mode is plain, and three different speech sentences of the same speaker were tested. The results are as follows

 And when the mode is dtw

 The first one is real generation, and the other three are different sentences. It can be seen that the value of MCD does not fully represent the speech generation results!

3.2. Summary in STIO test

When comparing data, nan and index index problems are prone to occur.

【Extended】

Use the MCD value to find the mean and variance and draw a histogram.

# pip install matplotlib seaborn

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


# 语音和对应的数值
speeches = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18,','19', '20']
#data = np.random.rand(10, 10) 
data = [5,8,7,6,8,4,10,7,9,6,7,5,6,8,10,11,8,10,9,8]  # 与每个语音对应的数值

# 计算均值和方差
mean_value = np.mean(data)
variance_value = np.var(data)

# 创建直方图
plt.figure(figsize=(10, 6))  # 设置图的大小
x = np.arange(len(speeches))  # 使用语音的索引作为x轴
plt.bar(x, data, color='skyblue', edgecolor='black')  # 绘制直方图,设置颜色和边缘颜色
plt.xlabel('Speeches')  # x轴标签
plt.ylabel('Value')  # y轴标签
plt.title('Values for Each Speech')  # 设置标题

# 设置x轴标签为语音名称
plt.xticks(x, speeches)

# 显示均值和方差
plt.axhline(mean_value, color='red', linestyle='--', label=f'Mean: {mean_value:.2f}')  # 添加均值线
plt.axhline(mean_value + np.sqrt(variance_value), color='green', linestyle=':', label='Std Dev')  # 上方标准差线
plt.axhline(mean_value - np.sqrt(variance_value), color='green', linestyle=':', label='_nolegend_')  # 下方标准差线

plt.grid(axis='y')  # 只在y轴上显示网格线
plt.legend()  # 显示图例
plt.tight_layout()  # 调整布局
plt.show()  # 显示图表


Guess you like

Origin blog.csdn.net/weixin_44649780/article/details/135399901