Intelligent Speech Recognition and Dialect Classification Based on Python+WaveNet+CTC+Tensorflow—Deep Learning Algorithm Application (including all project source codes)


insert image description here

foreword

This project uses speech files and dialect annotation files to extract the features of Mel cepstral coefficients of speech, and normalize these features. Guided by the annotation file, a dictionary is built to manage the data. Next, we select the WaveNet machine learning model for training, and perform softmax processing on the output of the model. Finally, the trained model will be saved for later use.

In the project, we first obtain speech files and annotation files, and use corresponding techniques to extract Mel cepstral coefficient features. These features can capture the spectral information of speech signals and provide input for subsequent model training.

In order to accurately associate voice files with annotations, we build a dictionary based on annotation files, so that each audio file can be associated with its corresponding annotation information.

To train an effective speech recognition model, we chose the WaveNet machine learning model. WaveNet is a deep learning based model that is widely used for audio generation and recognition tasks. We use the training data to train the WaveNet model so that it can learn the association between speech files and corresponding annotations.

Finally, after the training is completed, we process the output of the model through softmax to get the final prediction result. Such processing can convert the output into a probability distribution, making it easier for the model to correctly recognize the speech.

After this series of steps, our project can recognize speech files and classify dialects based on annotation information. The trained model will be saved for subsequent use in practical applications. This project provides a solid foundation for further research and applications in the field of speech recognition.

overall design

This part includes the overall structure diagram of the system and the system flow chart.

System overall structure diagram

The overall structure of the system is shown in the figure.

insert image description here

System flow chart

The speech recognition and dialect classification process is shown in Figure 1, and the page design process is shown in Figure 2.

insert image description here

Figure 1 Flow chart of speech recognition and dialect classification

insert image description here

Figure 2 Flow chart of page design

operating environment

This section includes the Python environment and the Tensorflow environment.

Python environment

Python 3.6 and above configuration is required. In the Windows environment, it is recommended to download Anaconda to complete the configuration required for Python. The download address is https://www.anaconda.com/ . You can also download a virtual machine to run the code in a Linux environment.

Tensorflow environment

Open Anaconda Prompt and enter the Tsinghua warehouse image:

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config -set show_channel_urls yes

Create a Python 3.6.5 environment named TensorFlow. At this time, there is a matching problem between the Python version and the later version of TensorFlow. In this step, choose Python3.x.

conda create -n tensorflow python==3.6.5

If there is a need to confirm, press the Y key.
Activate the TensorFlow environment in Anaconda Prompt:

conda activate tensorflow

Install a CPU version of TensorFlow that matches python3.6.5:

pip install tensorflow==1.9

Install the Keras version that matches Python and TensorFlow, here is Keras2.2.0:

pip install Keras==2.2.0

If the versions of the three do not match, Kernel will have problems when importing Keras.

After the installation is complete, check it out. After entering it in cmd Python, enter it as shown in the figure. If the output is successful, the installation is successful.
insert image description here

module implementation

This project includes 3 modules: dialect classification, speech recognition and model testing. The function introduction and related codes of each module are given below.

1. Classification of dialects

This part includes data download and preprocessing, model building, model training and saving.

Data download and preprocessing

The data set is provided by HKUST Xunfei. It contains three dialects: Changsha dialect, Shanghai dialect and Nanchang dialect. It contains 19,489 speech data sets of 50~300KB, including 17,989 training data and 1,500 verification data. The download address is http: / /challenge.xfyun.cn/2019/ . Download the dataset and import it. The data of the training set and validation set are named train_flesand respectively dev_fles. By using glob ()functions, the code to import data is as follows:

#加载pcm文件,其中17989条训练数据,1500条验证数据
#定义训练集
train_files = glob.glob(r'D:\homework\dialect\data\*\train\*\*.pcm')
#定义验证集 
dev_files = glob.glob(r'D:\homework\dialect\data\*\dev\*\*\*.pcm') 
print(len(train_files), len(dev_files),train_files[0])
#打印语音数据集的数量与训练集第一条数据

Sort and classify the downloaded voice data sets, and label each piece of data in the training set and verification set. The relevant code is as follows:

labels = {
    
    'train': [], 'dev': []}
#对于train_files中的每一条数据
for i in tqdm(range(len(train_files))):
path = train_files[i] #取出路径
label = path.split('\\')[1] #以'\\'划分路径,取出其中对应地区分类的标签
labels['train'].append(label)#以字典进行存储
#对于dev_files中的每一条数据进行如上操作
for i in tqdm(range(len(dev_files))):
path = dev_files[i]
label = path.split('\\')[1]
labels['dev'].append(label)
print(len(labels['train']), len(labels['dev']))
#整理每条语音数据对应的分类标签图

The dataset is preprocessed after downloading. Define 3 functions for processing voice data, converting pcm format to wav format, and visualizing voice data sets. Since the speech clips vary in length, the short clips less than 1s are removed, and the long clips are divided into segments no longer than 3s.

a. Processing voice data

The relevant code is as follows:

def load_and_trim(path, sr=16000):
    audio = np.memmap(path, dtype='h', mode='r')  #对大文件分段读取
    audio = audio[2000:-2000]
    audio = audio.astype(np.float32)
    energy = librosa.feature.rms(audio)  #计算能量
    frames = np.nonzero(energy >= np.max(energy) / 5) #最大能量的1/5视为静音
    indices = librosa.core.frames_to_samples(frames)[1]#去除静音
    audio = audio[indices[0]:indices[-1]] if indices.size else audio[0:0]
    slices = []#存储划分为小于3s大于1s的切片
    for i in range(0, audio.shape[0], slice_length):
        s = audio[i: i + slice_length]#切分为3s片段
        if s.shape[0] >= min_length:
            slices.append(s)  #去除小于1s的片段
    return audio, slices

b. pcm to wav function

The relevant code code is as follows:

def pcm2wav(pcm_path, wav_path, channels=1, bits=16, sample_rate=sr):
    data = open(pcm_path, 'rb').read()  #读取文件
    fw = wave.open(wav_path, 'wb')  #存储wav路径
    fw.setnchannels(channels)  #设置通道数:单声道
    fw.setsampwidth(bits // 8)  #将样本宽度设置为bits/8个字节
    fw.setframerate(sample_rate)  #设置采样率
    fw.writeframes(data)  #写入data个长度的音频
    fw.close()

c. Visualize Speech Dataset Function

The relevant code is as follows:

def visualize(index, source='train'):
    if source == 'train':
        path = train_files[index]  #训练集路径
    else:
        path = dev_files[index]  #验证集路径
    print(path)
    audio, slices = load_and_trim(path)  #去除两端静音,并切分为片段
    print('Duration: %.2f s' % (audio.shape[0] / sr))  #打印处理后的长度
    plt.figure(figsize=(12, 3))  #图像大小
    plt.plot(np.arange(len(audio)), audio)  #绘制波形
    plt.title('Raw Audio Signal')  #设置标题
    plt.xlabel('Time')  #设置横坐标
    plt.ylabel('Audio Amplitude')  #设置纵坐标
    plt.show()  #绘图展示
    feature = mfcc(audio, sr, numcep=mfcc_dim)  #提取mfcc特征
    print('Shape of MFCC:', feature.shape)  #打印mfcc特征
    fig = plt.figure(figsize=(12, 5))  #图像大小
    ax = fig.add_subplot(111)
    im = ax.imshow(feature, cmap=plt.cm.jet, aspect='auto')  #绘制mfcc
    plt.title('Normalized MFCC')  #图像标题
    plt.ylabel('Time')  #设置纵坐标
    plt.xlabel('MFCC Coefficient')  #设置横坐标
    plt.colorbar(im, cax=make_axes_locatable(ax).append_axes('right', size='5%', pad=0.05))
    ax.set_xticks(np.arange(0, 13, 2), minor=False);  #设置横坐标间隔
    plt.show()  #绘图展示
    wav_path = r'D:/homework/dialect/example.wav'#wav文件存储路径
    pcm2wav(path, wav_path)  #pcm文件转化为wav文件
    return wav_path

Taking the visualization of the third voice data as an example, the result is shown in Figure 3. Normalize the MFCC features of the training set and verification set respectively, and the relevant codes are as follows:

#对训练集MFCC特征归一化
X_train = [(x - mfcc_mean) / (mfcc_std + 1e-14) for x in X_train] 
#对验证集MFCC特征归一化
X_dev = [(x - mfcc_mean) / (mfcc_std + 1e-14) for x in X_dev] 

insert image description here

Figure 3 Visualization results of the third voice data

Use LabelEncoder()the label to be processed as an integer, and to_categorical()the training set and the verification set are converted into vectors through the function, and then the iterator function is defined. The relevant code is as follows:

le = LabelEncoder()
Y_train = le.fit_transform(Y_train)  #处理训练集标签
Y_dev = le.transform(Y_dev)  #处理验证集标签
num_class = len(le.classes_)  #3个类别
Y_train = to_categorical(Y_train, num_class)  #将训练集转化为向量
Y_dev = to_categorical(Y_dev, num_class)  #将验证集转化为向量

model building

After the data is loaded into the model, it is necessary to define the model structure and optimize the loss function.

(1) Define the model structure

The model uses multi-layer Causal Dilated Convolution to process data. First, build and use a one-dimensional convolutional layer; second, introduce the BN (BatchNormalization) algorithm and define a batchnorm ()function for regularization to eliminate the over-fitting problem of the model; third, define activation ()a function for activating neural network training; finally, Define res_block ()the function, and the basic block often uses the model of Conv+BN+Relu+Conv+BN side branch to achieve the purpose of repair. Define a softmax logistic regression model by GlobalMaxPooling1D ()performing dimensionality reduction on the entire sequence output. The number of feature maps in the last layer of convolution is the same as the size of the dictionary. After being processed by the softmax function, the probability distribution on the entire dictionary can be obtained for the MFCC corresponding to each small segment. The relevant code is as follows:

#定义多层因果空洞卷积MFCC一维,使用conv1d
def conv1d(inputs, filters, kernel_size, dilation_rate):
	return Conv1D(filters=filters, kernel_size=kernel_size, strides=1, padding='causal', activation=None, dilation_rate=dilation_rate)(inputs)
#加速神经网络训练BN算法
def batchnorm(inputs):
    return BatchNormalization()(inputs)
#定义神经网络激活函数
def activation(inputs, activation):
    return Activation(activation)(inputs)
def res_block(inputs, filters, kernel_size, dilation_rate):  
	hf = activation(batchnorm(conv1d(inputs, filters, kernel_size, dilation_rate)), 'tanh')
	hg = activation(batchnorm(conv1d(inputs, filters, kernel_size, dilation_rate)), 'sigmoid')
    h0 = Multiply()([hf, hg])
    #tanh激活函数
    ha = activation(batchnorm(conv1d(h0, filters, 1, 1)), 'tanh')
    hs = activation(batchnorm(conv1d(h0, filters, 1, 1)), 'tanh')
    return Add()([ha, inputs]), hs
#tanh激活函数
h0 = activation(batchnorm(conv1d(X, filters, 1, 1)), 'tanh')
shortcut = []
for i in range(num_blocks):
    for r in [1, 2, 4, 8, 16]:
        h0, s = res_block(h0, filters, 7, r)
        shortcut.append(s)
#Relu激活函数
h1 = activation(Add()(shortcut), 'relu')
h1 = activation(batchnorm(conv1d(h1, filters, 1, 1)), 'relu') 
h1 = batchnorm(conv1d(h1, num_class, 1, 1))
h1 = GlobalMaxPooling1D()(h1)  #通过GlobalMaxPooling1D对整个序列输出进行降维
Y = activation(h1, 'softmax')  #softmax逻辑回归模型

(2) Optimize the loss function

Determine the model architecture and compile it. This is a multi-category classification problem, so the CTC (Connectionist temporal classification) algorithm is used to calculate the loss function. Since all labels carry similar weights, accuracy is used as a performance metric. The relevant code is as follows:

#Adam优化算法
optimizer = Adam(lr=0.01, clipnorm=5)
#模型输入/输出
model = Model(inputs=X, outputs=Y)
#模型损失和准确率
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
#模型保存路径
checkpointer = ModelCheckpoint(filepath='D:/homework/dialect/fangyan.h5', verbose=0)
lr_decay = ReduceLROnPlateau(monitor='loss', factor=0.2, patience=1, min_lr=0.000)

Model Training and Storage

After defining the model architecture and compiling, train the model through the training set to make the model perform dialect classification. Here we will fit and save the model using the training and validation sets.

(1) Model training

The relevant code is as follows:

#分批读取数据
history = model.fit_generator(
	#训练集批处理器
    generator=batch_generator(X_train, Y_train),
	#每轮训练的数据量
    steps_per_epoch=len(X_train), 
    epochs=epochs, 
	#测试集批处理器
    validation_data=batch_generator(X_dev, Y_dev), 
    validation_steps=len(X_dev) 
    callbacks=[checkpointer, lr_decay])

Among them, a batch (batch) is the number of training samples used in a forward/backward propagation process, that is, 1180 pieces of speech data are used for training at a time, and a total of 11800 pieces of speech data are trained, as shown in the figure.

insert image description here
By observing the loss function and accuracy of the training set and test set, evaluate the degree of model training and make further decisions on model training. The loss function (or accuracy rate) of the training set and the test set is unchanged and basically equal to the best state of model training. The accuracy rate and loss function saved during the training process are displayed in the form of pictures for easy observation.

train_loss = history.history['loss']#训练集损失函数
valid_loss = history.history['val_loss']#验证集损失函数
#画图
plt.plot(train_loss, label='训练')
plt.plot(valid_loss, label='验证')
plt.legend(loc='upper right')
plt.xlabel('训练次数')
plt.ylabel('损失')
plt.show()
train_acc = history.history['acc']#训练集精确度
valid_acc = history.history['val_acc']#验证集精确度
#画图
mpl.rcParams['font.sans-serif'] = ['SimHei']
plt.plot(train_acc, label='训练')
plt.plot(valid_acc, label='验证')
plt.legend(loc='upper right')
plt.xlabel('训练次数')
plt.ylabel('精确度')
plt.legend()
plt.show()

(2) Model saving

The relevant code is as follows:

#保存模型
model.save('fangyan.h5') #HDF5文件

After the model is saved, it can be reused or transplanted to other environments. For category prediction, save the dictionary as a .pkl file, the relevant code is as follows:

#保存字典
with open('resources.pkl', 'wb') as fw:
    pickle.dump([class2id, id2class, mfcc_mean, mfcc_std], fw)

2. Speech recognition

This section includes data preprocessing, model building, model training, and model saving.

data preprocessing

The source address of the data set is www.openslr.org/18/ , which contains 13388 speech data sets of 100~400KB. Download the dataset and import it by using glob ()a function. The relevant code is as follows:

#加载trn,读取语音对应的文本文件
text_paths = glob.glob(r'D:/homework/language/language/data/*.trn')
#导入数据集
total = len(text_paths)  #统计文本个数
print(total)  #打印数据集总数
with open(text_paths[0], 'r', encoding='utf8') as fr:
    lines = fr.readlines()
    print(lines)  #打印第一条数据中内容

In the speech recognition data set, there are .wav files representing speech and .trn files representing transcription. Print the first piece of data, and the output is as shown in the figure.

insert image description here

Organize the downloaded voice data set, save the Chinese in the text to the texts list, and save the path of each audio to the paths list. The relevant codes are as follows:

#提取文本内容和语音文件路径,去掉空白格
texts = []  #放置文本
paths = []  #防止每个语音文件的路径
for path in text_paths:
    with open(path, 'r', encoding='utf8') as fr:
        lines = fr.readlines()
        line = lines[0].strip('\n').replace(' ', '')
#用逗号替换空白格,去掉拼音
        texts.append(line)  #更新文本,添加整理好的内容
        paths.append(path.rstrip('.trn'))#除去.trn的文件就是.wav音频文件
print(paths[0], texts[0])#打印语音文件所在路径以及对应的文本内容

Download the speech dataset and preprocess it. Defines two functions for processing speech data and visualizing speech datasets. In the function of processing voice data, remove the silent parts at both ends of the voice file. Part of the function code for processing voice data is as follows:

def load_and_trim(path):
    audio, sr = librosa.load(path)  #读取音频
    energy = librosa.feature.rms(audio)  #计算能量
    frames = np.nonzero(energy >= np.max(energy) / 5) #最大能量的1/5视为静音
    indices = librosa.core.frames_to_samples(frames)[1]  #去除静音
    audio = audio[indices[0]:indices[-1]] if indices.size else audio[0:0]
    return audio, sr
#可视化语音数据集部分函数代码
def visualize(index):
    path = paths[index]  #获取某个音频
    text = texts[index]  #获取音频对应的文本
    print('Audio Text:', text)
    audio, sr = load_and_trim(path)  #调用函数去除两端静音
    plt.figure(figsize=(12, 3))
    plt.plot(np.arange(len(audio)), audio)
    plt.title('Raw Audio Signal')
    plt.xlabel('Time')#x轴为时间
    plt.ylabel('Audio Amplitude')  #y轴为音频高度
    plt.show()
    feature = mfcc(audio, sr, numcep=mfcc_dim, nfft=551)  #计算MFCC特征
    print('Shape of MFCC:', feature.shape)
    fig = plt.figure(figsize=(12, 5))
    ax = fig.add_subplot(111)
    im = ax.imshow(feature, cmap=plt.cm.jet, aspect='auto')
    plt.title('Normalized MFCC')
    plt.ylabel('Time')
    plt.xlabel('MFCC Coefficient')
    plt.colorbar(im, cax=make_axes_locatable(ax).append_axes('right', size='5%', pad=0.05))
    ax.set_xticks(np.arange(0, 13, 2), minor=False);
    plt.show()
    return path

Take the visualization of the first voice data as an example, the result is shown in the figure.
insert image description here

Normalize the MFCC features of the audio file and create a dictionary. The relevant codes are as follows:

features=[(feature-mfcc_mean)/(mfcc_std + 1e-14) for feature in features]
#建立字典
chars = {
    
    }
for text in texts:
    for c in text:
        chars[c] = chars.get(c, 0) + 1
chars = sorted(chars.items(), key=lambda x: x[1], reverse=True)
chars = [char[0] for char in chars]
print(len(chars), chars[:100])   #打印随机100段音频中汉字数量
char2id = {
    
    c: i for i, c in enumerate(chars)}
id2char = {
    
    i: c for i, c in enumerate(chars)}

The data is divided, and 90% of the total data is used as the training set. Define the function batch_generator() that generates batch data. The relevant code is as follows:

data_index = np.arange(total)
np.random.shuffle(data_index)   #将索引打乱
train_size = int(0.9 * total)   #训练数据占90%
test_size = total - train_size
train_index = data_index[:train_size]            #切分出来训练数据的索引
test_index = data_index[train_size:]             #切分出来测试数据的索引
X_train = [features[i] for i in train_index]   #取出训练音频的MFCC特征
Y_train = [texts[i] for i in train_index]       #取出训练的标签
X_test = [features[i] for i in test_index]      #取出测试音频的MFCC特征
Y_test = [texts[i] for i in test_index]          #取出测试的标签

model building

After the data is loaded into the model, it is necessary to define the model structure and optimize the loss function.

(1) Define the model structure

Define the WaveNet network. Using multi-layer causal atrous convolution to process data, the method of using the model structure is roughly the same as the dialect classification part.

(2) Loss function and model optimization
Compile after determining the model architecture. This is a multi-category classification problem, so the CTC (Connectionist temporal classification) algorithm is used to calculate the loss function. The relevant code is as follows:

def calc_ctc_loss(args):  #CTC损失函数
	y, yp, ypl, yl = args
	return K.ctc_batch_cost(y, yp, ypl, yl)
	
ctc_loss = Lambda(calc_ctc_loss, output_shape=(1,), name='ctc')([Y, Y_pred, X_length, Y_length])  #调用函数
model = Model(inputs=[X, Y, X_length, Y_length], outputs=ctc_loss)
optimizer = SGD(lr=0.02, momentum=0.9, nesterov=True, clipnorm=5)
model.compile(loss={
    
    'ctc': lambda ctc_true, ctc_pred: ctc_pred}, optimizer=optimizer)  #定义模型
checkpointer = ModelCheckpoint(filepath='asr.h5', verbose=0)
lr_decay = ReduceLROnPlateau(monitor='loss', factor=0.2, patience=1, min_lr=0.000)

Model Training and Storage

After defining the model architecture and compiling, train the model with the training set to make the model recognize speech. Here, the training and validation sets will be used to fit and save the model.

1) Model training

The relevant code is as follows:

#分批读取数据
history = model.fit_generator(
	#训练集批处理器
    generator=batch_generator(X_train, Y_train),
	#每轮训练的数据量
    steps_per_epoch=len(X_train), 
    epochs=epochs, 
	#测试集批处理器
    validation_data=batch_generator(X_dev, Y_dev), 
    validation_steps=len(X_dev) 
    callbacks=[checkpointer, lr_decay])

A batch of data does not exceed 753, as shown in the figure.

insert image description here

The relevant code is as follows:

#训练集损失
train_loss = history.history['loss']
#验证集损失
valid_loss = history.history['val_loss']
#绘制损失函数图像
mpl.rcParams['font.sans-serif'] = ['SimHei']  #默认字体黑体
plt.plot(np.linspace(1, epochs, epochs), train_loss, label='训练')
plt.plot(np.linspace(1, epochs, epochs), valid_loss, label='验证')
plt.legend(loc='upper right')
plt.xlabel('训练次数')
plt.ylabel('损失')
plt.legend()
plt.show()

2) Model saving

For subsequent GUI design import, save the model as a file in .h5 format.

The relevant code is as follows:

#模型保存
sub_model.save('asr.h5')

Save the dictionary as a pkl format file, the relevant code is as follows:

#字典保存
with open('dictionary.pkl', 'wb') as fw:
    pickle.dump([char2id, id2char], fw)

After the model and dictionary are saved, they can be reused or transplanted to other environments.

3. Model testing

Import the model in the designed GUI, select the voice file, and display the original waveform and MFCC feature map. Two functions of speech recognition and dialect classification can be selected. Graphical user mainly includes three interfaces: function selection interface, language recognition function realization interface and dialect recognition function realization interface.

Function selection interface

Related operations are as follows

(1) Set the function selection interface as the root form. Initialize each attribute of the function selection interface, and the related operations are as follows:

#设置功能选择界面的大小及标题
root = Tk()
root.geometry('300x450')
root.title('语音识别及方言分类')
#界面展示
mainloop()

The root window is the root controller of the graphical application and is an instance of the underlying control of Tkinter. After importing the Tkinter module, call Tk()the function to initialize the root window instance, use title()the function to set the title text, and use geometry()the function to set the window size (in pixels). Putting the root form in the main loop keeps the program running unless closed by the user. In the root form of the main loop, other visual control instances can be presented continuously, the occurrence of events is monitored and corresponding handlers are executed.

(2) Set the controls to be displayed on the interface and their corresponding properties and layouts. The related operations are as follows:

#设置文本标签,提示用户欢迎信息   
lb = Label(root, text='欢迎使用!',font=('华文新魏',22))
lb.pack()
#设置文本标签位置
lb.place(relx=0.15, rely=0.2, relwidth=0.75, relheight=0.3)
#设置两个提示按钮及文本字体大小
btn = Button(root, text='语音识别',font=('华文新魏',12),command=asr)
btn.pack()
#设置第一个按钮位置
btn.place(relx=0.3, rely=0.5, relwidth=0.4, relheight=0.1)
btn_btn = Button(root, text='方言分类',font=('华文新魏',12),command = dialect)
btn_btn.pack()
#设置第二个按钮位置
btn_btn.place(relx=0.3, rely=0.65, relwidth=0.4, relheight=0.1)

The label instance lb, the button instance btn and btn_ btn are instantiated in the root of the parent container, and the text (text), font (font) and command attributes are set; when the control is instantiated, the instance attributes are in the form of "attribute = attribute value" Listed enumerated, in no particular order. Attribute values ​​are usually expressed in text form.

The button is mainly set up to respond to the mouse click event to trigger the running program. Except for the common properties of the controls, the command (command) is the most important property. Usually, the program to trigger execution is pre-defined in the form of a function, and the function is called directly. The parameter expression is "command=function name", and there are no parentheses after the function name, and no parameters are passed. Therefore, set the command attribute to trigger the function corresponding to the button text to realize the interface.

Using place ()the method with relx, rely and relheight, the interface obtained by the relwidth parameter can adapt to the size of the root window.

Language recognition function realization interface

Related operations are as follows:

(1) Call the Toplevel () function to set the secondary interface for the realization of the speech recognition function.

(2) Instantiate the required text label, image label, text and button controls.

(3) Define a function to obtain the number input by the user, and select a voice file for subsequent recognition.

(4) Define a function to read the voice file, and after a series of processing, draw the original waveform and MFCC feature images, and save them locally.

#语音文件初始处理,去除两端静音
        audio, sr = librosa.load(wavs[int(x)]) #读取语音文件
        energy = librosa.feature.rms(audio)     #计算语音文件能量
	    #判定能量小于最大能量1/5为静音
        frames = np.nonzero(energy >= np.max(energy) / 5)        
	    indices = librosa.core.frames_to_samples(frames)[1]
        audio = audio[indices[0]:indices[-1]] if indices.size else audio[0:0] #去除两端静音
        plt.figure(figsize=(12, 3)) #图像大小
        plt.plot(np.arange(len(audio)), audio) #绘制原始波形
        plt.title('Raw Audio Signal') #图像标题
        plt.xlabel('Time') #图像横坐标
        plt.ylabel('Audio Amplitude') #图像纵坐标
        #保存原始波形图像
        plt.savefig('E:/北邮学习/2020课件/信息系统设计/语音识别/原始波形.png')  
        #feature.shape二维数组(切片数量,维度)
        #指定音频文件、采样率、mfcc维度,获取MFCC特征
        feature = mfcc(audio, sr, numcep=mfcc_dim, nfft=551) 
        #绘制MFCC特征图
        fig = plt.figure(figsize=(12, 5))#图像大小
        ax = fig.add_subplot(111) #分块绘图
	    #绘制mfcc特征
        im = ax.imshow(feature, cmap=plt.cm.jet, aspect='auto')
        plt.title('Normalized MFCC') #图像标题
        plt.ylabel('Time') #图像纵坐标
        plt.xlabel('MFCC Coefficient') #图像横坐标
        #右侧colorbar绘制
        plt.colorbar(im, cax=make_axes_locatable(ax).append_axes('right', size='5%', pad=0.05))
        ax.set_xticks(np.arange(0, 13, 2), minor=False); #图像横坐标值设置
        #保存MFCC特征图像
        plt.savefig('E:/北邮学习/2020课件/信息系统设计/语音识别/mfcc.png')

(5) Define two functions, which are used to display the MFCC feature map and the original waveform map in the interface respectively.

(6) Define a function to take out the marked text of the audio file and display it on the interface.

(7) Define a function, load the trained model for speech recognition, and output and display the recognition results.

audio, sr = librosa.load(wavs[int(x)]) #读取语音文件
energy = librosa.feature.rms(audio) #计算语音文件能量
#判定能量小于最大能量1/5为静音
frames = np.nonzero(energy >= np.max(energy) / 5) 
indices = librosa.core.frames_to_samples(frames)[1]
audio = audio[indices[0]:indices[-1]] if indices.size else audio[0:0] #去除两端静音
X_data = mfcc(audio, sr, numcep=mfcc_dim, nfft=551) #获取mfcc特征
X_data = (X_data - mfcc_mean) / (mfcc_std + 1e-14) #mfcc归一化处理
#加载模型进行语音识别
pred = model.predict(np.expand_dims(X_data, axis=0))
#加载预测结果
pred_ids = K.eval(K.ctc_decode(pred, [X_data.shape[0]], greedy=False,beam_width=10, top_paths=1)[0][0])
pred_ids = pred_ids.flatten().tolist()
#实例化lb标签,在界面中展示识别文本结果
lb3 = Label(top, text='识别结果:'+''.join([id2char[i] for i in pred_ids]))
lb3.place(relx=0.1, rely=0.96, relwidth=0.75, relheight=0.03)

(8) Set the command attribute of the button control and link it with the function defined above to ensure that the realizable functions corresponding to the button are triggered correctly.

Dialect classification function realization interface

Related operations are as follows:

(1) Call the Toplevel () function to set the second-level interface for the realization of the dialect classification function.

(2) Instantiate the required text label, image label, text and button controls.

(3) Define a function to randomly select audio files for subsequent dialect classification.

(4) Define the function, read the voice file, and after a series of processing, draw the original waveform and MFCC feature images, and save them locally.

#定义函数,加载语音文件,去除两端静音,对长语音进行片段切片
    def load_and_trim(path, sr=16000):
        audio = np.memmap(path, dtype='h', mode='r') #对大文件分段读取
        audio = audio[2000:-2000]
        audio = audio.astype(np.float32)
        energy = librosa.feature.rms(audio) #计算能量
	    #最大能量的1/5视为静音
        frames = np.nonzero(energy >= np.max(energy) / 5) 
        indices = librosa.core.frames_to_samples(frames)[1] #去除静音
        audio = audio[indices[0]:indices[-1]] if indices.size else audio[0:0] #去除静音后的语音文件
        slices = [] #存储划分为小于3s大于1s的切片
        for i in range(0, audio.shape[0], slice_length):            
            s = audio[i: i + slice_length] #切分为3s片段
            if s.shape[0] >= min_length:
                slices.append(s) #去除小于1s的片段    
        return audio, slices 
#定义函数以读取语音文件,绘制原始波形及MFCC特征两个图像,并保存到本地
    def run2():
	    #从文本框中提取随机选取的语音文件路径
        path = txt1.get("0.0","end")
        path = path.strip("\n").split(" ")[0]
        audio, slices = load_and_trim(path) #去除两端静音,并切分为片段
        #绘制原始波形图像
        plt.figure(figsize=(12, 3))#图像大小
        plt.plot(np.arange(len(audio)), audio) #绘制波形
        plt.title('Raw Audio Signal')#设置标题
        plt.xlabel('Time') #设置横坐标
        plt.ylabel('Audio Amplitude')#设置纵坐标
        #保存原始波形图像
        plt.savefig('E:/北邮学习/2020课件/信息系统设计/原始波形.png') 
        #绘制MFCC特征图像
        feature = mfcc(audio, sr, numcep=mfcc_dim) #提取MFCC特征
        fig = plt.figure(figsize=(12, 5))#图像大小
        ax = fig.add_subplot(111)
        im = ax.imshow(feature, cmap=plt.cm.jet, aspect='auto') #绘制MFCC
        plt.title('Normalized MFCC')#图像标题
        plt.ylabel('Time')#设置纵坐标
        plt.xlabel('MFCC Coefficient')#设置横坐标
        plt.colorbar(im, cax=make_axes_locatable(ax).append_axes('right', size='5%', pad=0.05))
        ax.set_xticks(np.arange(0, 13, 2), minor=False);#设置横坐标间隔
        #保存mfcc特征图像
        plt.savefig('E:/北邮学习/2020课件/信息系统设计/mfcc.png') 

(5) Define two functions, which are used to display the MFCC feature map and the original waveform map in the interface respectively.

(6) Define a function to extract the category of the audio file and display it on the interface.

(7) Define a function, load the trained model to classify dialects, and output and display the recognition results.

    path = txt1.get("0.0","end")
    path = path.strip("\n").split(" ")[0]
    #去除两端静音,并切分为片段
    audio, slices = load_and_trim(path)
    #获取mfcc特征
    X_data = [mfcc(s, sr, numcep=mfcc_dim) for s in slices] 
    #MFCC归一化处理
    X_data = [(x - mfcc_mean) / (mfcc_std + 1e-14) for x in X_data] 
    maxlen = np.max([x.shape[0] for x in X_data])
    X_data = pad_sequences(X_data, maxlen, 'float32', padding='post', value=0.0)
     #加载模型进行方言分类
    prob = model.predict(X_data)
    prob = np.mean(prob, axis=0)
    pred = np.argmax(prob)
    prob = prob[pred]
    pred = id2class[pred]
    #实例化lb标签,在界面中展示识别预测类别
    lb3 = Label(top1, text='预测类别:'+ pred + '  Confidence:'+ str(prob))
    lb3.place(relx=0.1, rely=0.96, relwidth=0.75, relheight=0.03)

(8) Set the command property of the button control and connect it with the above-defined function to ensure that the realizable functions corresponding to the button are correctly triggered.

System test

This section includes training accuracy, test results, and model applications.

1. Training accuracy

Speech recognition tasks, predictive model training is relatively successful. As the number of training increases, the loss of the model on the training data and test data gradually converges, and finally tends to be stable, as shown in the figure.

insert image description here

In dialect classification, the test accuracy rate on the training set exceeds 98%, which means that the prediction model training is relatively successful. As the number of training increases, the loss and accuracy of the model on the training data gradually converge and eventually stabilize; however, the loss and accuracy on the test data are not stable enough and fluctuate to a certain extent, as shown in Figure 4 and Figure 5 .

insert image description here

Figure 4 Dialect classification model loss

insert image description here

Figure 5 Accuracy of dialect classification model

2. Test effect

Test the test set, and display and compare the recognized text and classification labels with the original data, as shown in the figure below.

insert image description here

It can be seen from the results that the model can realize speech recognition and dialect classification.

3. Model application

This section includes instructions for using the graphical user interface and test results.

1. Graphical User Interface Instructions

After compiling and running the .py file, the initial interface is shown in the figure.

insert image description here

The interface consists of text prompts and two buttons from top to bottom. Click the [Speech Recognition] button to pop up a second-level interface—the interface for implementing the speech recognition function, as shown in Figure 6; click the button for [Dialect Classification] to pop up a second-level interface—the interface for implementing the dialect classification function, as shown in Figure 6. 7.

From top to bottom, the interface consists of text prompts, input boxes, buttons and text boxes. Enter the number in the input box, click the [Select Corresponding Voice File] button, select the file for voice recognition this time, and output the file path in the text box; click the [Save Corresponding MFCC Feature Map and Original Waveform Chart] button to save Correspond to the image to the local; click the [MFCC Feature Map] button and the [Original Waveform] button to display the image saved locally; click the [Annotation Text] button to display the text of the corresponding annotation file; click the [Recognition Result] button , make a prediction, and display the predicted text result.
insert image description here

Figure 6 The initial interface of language recognition

From top to bottom, the interface consists of text prompts, buttons and text boxes. According to the text prompt, click the [Select Voice File] button to randomly select the file for voice recognition this time, and output the file path in the text box; click the [Save Corresponding MFCC Feature Map and Original Waveform Map] button to save the corresponding image to After localization; click | [MFCC Feature Map] and [Original Waveform Diagram] buttons to display the images saved locally; click the [Mark Category] button to display the corresponding category of the file; Prediction, which displays the predicted category results.

insert image description here

Figure 7 The initial interface of dialect classification

2. Test effect

The speech recognition test results of the GUI are shown in the figure.

insert image description here

Dialect classification test results of GUI are shown in the figure.

insert image description here

Project source code download

See my blog resource download page for details


Other information download

If you want to continue to learn about artificial intelligence-related learning routes and knowledge systems, welcome to read my other blog " Heavy | Complete artificial intelligence AI learning-basic knowledge learning route, all materials can be downloaded directly from the network disk without paying attention to routines "
This blog refers to Github's well-known open source platform, AI technology platform and experts in related fields: Datawhale, ApacheCN, AI Youdao and Dr. Huang Haiguang, etc. There are about 100G related materials, and I hope to help all friends.

Guess you like

Origin blog.csdn.net/qq_31136513/article/details/131858254