TensorFlow实现语音识别

整体介绍:

环境python3.6+TensorFlow1.12  显卡是英伟达GTX1070(后头换个好些的显卡)训练了四天四夜

主要技术点CTC,BRNN,MFCC特征,全连接神经网络

CTC时序分类算法: 适合这种不知道输入输出是否对齐的情况(哪个字对应哪段声音)使用的算法,所以CTC适合语音识别和手写字符识别的任务,而传统的语音识别是基于语音学的方法,通常包含拼写、声学和语音模型等单独组件。训练模型的语料除了标注具体的文字外。还要标注按时间对应的音素,这就需要大量的人工成本;使用神经网络的语音识别就变得简单多了,通过能进行时序分类的连续时间分类目标函数(CTC),计算多个标签序列的概率,而序列是语音样本中所有可能的对应文字的集合。然后把预测结果跟实际比较,计算误差,不断更新网络权重。这样就丢弃音素的概念,就省了大量人工标注的成本,也不需要语言模型,只要有足够的样本,就可以训练各种语言的语音识别了。

BRNN双向循环神经网络:可参考我的这篇博文

MFCC梅尔频率倒谱系数:在TensorFlow中的使用方式通俗的说法就是他讲语音按时序转化(转化过程可能比较复杂)为一帧接一帧的,然后每一帧都是一个维度是20或更多,本文中的代码中设置的是26维度,训练的最终目的就是讲每一帧通过他的味独特征来定义它是哪个字符或者更像哪个字符。

全连接神经网络 这个不用多说了

介绍完毕,说干就干

第一步收集数据:

在实际的项目中这一步我们可能要消耗不少精力去做,而此处我们没有必要,可以借鉴清华大学提供的数据进行模型的训练

下载地址 http://www.openslr.org/18/ 或 http://166.111.134.19:8081/data/thchs30-openslr/README.html

只需这一个6.4g的就够了,下载下来解压后是这样的:

data中是所有的数据集,train是训练集,test是测试集

每个文件夹中都是一个wav(语音文件)和对应的trn(语音对应的文字)文件,由于train和test文件中trn中记录的不是语音对应的文字而是一个对应语音文字所在data中的那个trn的文件地址,所以在在代码中我们我们训练集的语音使用train中的,寻找对应的trn文字时在data中找

第二部编写代码训练模型

# coding: UTF-8
# 训练数据下载地址 http://www.openslr.org/18/ 或 http://166.111.134.19:8081/data/thchs30-openslr/README.html
# 原博客中的程序将空格算入label中,导致训练程序中无法准确定位空格字符的特征(因为任何两个发音之间的空格的mfcc特征都不一样),
# 导致程序收敛极慢甚至无法收敛,手动将tran_texts中的空格都去除掉后程序开始收敛了
# 还有一点程序中输出的末尾的"龚"字是字符集中出现频率最小的那个,我们可以手动将字符集中的末尾再加一个空格words+=[""]
import numpy as np
from python_speech_features import mfcc
import scipy.io.wavfile as wav
import os
import time
import tensorflow as tf
from tensorflow.python.ops import ctc_ops
from collections import Counter


# 获取文件夹下所有的WAV文件
def get_wav_files(wav_path):
    wav_files = []
    for (dirpath, dirnames, filenames) in os.walk(wav_path):
        for filename in filenames:
            if filename.endswith('.wav') or filename.endswith('.WAV'):
                # print(filename)
                filename_path = os.path.join(dirpath, filename)
                # print(filename_path)
                wav_files.append(filename_path)
    return wav_files


# 获取wav文件对应的翻译文字
def get_tran_texts(wavfiles, tran_path):
    tran_texts = []
    wav_files = []
    for wav_file in wavfiles:
        (wav_path, wav_filename) = os.path.split(wav_file)
        tran_file = os.path.join(tran_path, wav_filename + '.trn')
        # print(tran_file)
        if os.path.exists(tran_file) is False:
            return None

        fd = open(tran_file, 'r', encoding='UTF-8')
        text = fd.readline()
        wav_files.append(wav_file)
        # 不知为何原数据中的文字中有很多空格,去除空格干扰因子(通俗的解释就是空格的MFCC特性种类太多,因为任何两个文字语音间的空格特征都不一样,模型无法定位空格到底长啥样,导致模型收敛极慢甚至不收敛)
        tran_texts.append(text.split('\n')[0].replace(' ',''))
        fd.close()
    return wav_files,tran_texts


# 获取wav和对应的翻译文字
def get_wav_files_and_tran_texts(wav_path, tran_path):
    wavfiles = get_wav_files(wav_path)
    wav_files,tran_texts = get_tran_texts(wavfiles, tran_path)

    return wav_files, tran_texts


# 旧的训练集使用该方法获取音频文件名和译文
def get_wavs_lables(wav_path, label_file):
    wav_files = []
    for (dirpath, dirnames, filenames) in os.walk(wav_path):
        for filename in filenames:
            if filename.endswith('.wav') or filename.endswith('.WAV'):
                filename_path = os.sep.join([dirpath, filename])
                if os.stat(filename_path).st_size < 240000:  # 剔除掉一些小文件
                    continue
                wav_files.append(filename_path)

    labels_dict = {}
    with open(label_file, 'rb') as f:
        for label in f:
            label = label.strip(b'\n')
            label_id = label.split(b' ', 1)[0]
            label_text = label.split(b' ', 1)[1]
            labels_dict[label_id.decode('ascii')] = label_text.decode('utf-8')

    labels = []
    new_wav_files = []
    for wav_file in wav_files:
        wav_id = os.path.basename(wav_file).split('.')[0]

        if wav_id in labels_dict:
            labels.append(labels_dict[wav_id])
            new_wav_files.append(wav_file)

    return new_wav_files, labels

# 将稀疏矩阵的字向量转成文字
# tuple是sparse_tuple_from函数的返回值
def sparse_tuple_to_texts_ch(tuple, words):
    # 索引
    indices = tuple[0]
    values = tuple[1]
    results = [''] * tuple[2][0]
    for i in range(len(indices)):
        index = indices[i][0]
        c =  words[values[i]]
        results[index] = results[index] + c
    return results


# 将密集矩阵的字向量转成文字
def ndarray_to_text_ch(value, words):
    results = ''
    for i in range(len(value)):
        if value[i]==len(words):
            results +=' '
        else:
            results += words[value[i]]  
    return results


# 创建序列的稀疏表示
def sparse_tuple_from(sequences, dtype=np.int32):
    indices = []
    values = []

    for n, seq in enumerate(sequences):
        indices.extend(zip([n] * len(seq), range(len(seq))))
        values.extend(seq)

    indices = np.asarray(indices, dtype=np.int64)
    values = np.asarray(values, dtype=dtype)
    shape = np.asarray([len(sequences), indices.max(0)[1] + 1], dtype=np.int64)

    # return tf.SparseTensor(indices=indices, values=values, shape=shape)
    return indices, values, shape


# 将音频数据转为时间序列(列)和MFCC(行)的矩阵,将对应的译文转成字向量
def get_audio_and_transcriptch(txt_files, wav_files, n_input, n_context, word_num_map, txt_labels=None):
    audio = []
    audio_len = []
    transcript = []
    transcript_len = []
    if txt_files != None:
        txt_labels = txt_files

    for txt_obj, wav_file in zip(txt_labels, wav_files):
        # load audio and convert to features
        audio_data = audiofile_to_input_vector(wav_file, n_input, n_context)
        audio_data = audio_data.astype('float32')
        # print(word_num_map)
        audio.append(audio_data)
        audio_len.append(np.int32(len(audio_data)))

        # load text transcription and convert to numerical array
        target = []
        if txt_files != None:  # txt_obj是文件
            target = get_ch_lable_v(txt_obj, word_num_map)
        else:
            target = get_ch_lable_v(None, word_num_map, txt_obj)  # txt_obj是labels
        # target = text_to_char_array(target)
        transcript.append(target)
        transcript_len.append(len(target))

    audio = np.asarray(audio)
    audio_len = np.asarray(audio_len)
    transcript = np.asarray(transcript)
    transcript_len = np.asarray(transcript_len)
    return audio, audio_len, transcript, transcript_len


# 将字符转成向量,其实就是根据字找到字在word_num_map中所应对的下标
def get_ch_lable_v(txt_file, word_num_map, txt_label=None):
    words_size = len(word_num_map)

    to_num = lambda word: word_num_map.get(word, words_size)

    if txt_file != None:
        txt_label = get_ch_lable(txt_file)

    # print(txt_label)
    labels_vector = list(map(to_num, txt_label))
    # print(labels_vector)
    return labels_vector


def get_ch_lable(txt_file):
    labels = ""
    with open(txt_file, 'rb') as f:
        for label in f:
            # labels =label.decode('utf-8')
            labels = labels + label.decode('gb2312')
            # labels.append(label.decode('gb2312'))

    return labels


# 将音频信息转成MFCC特征
# 参数说明---audio_filename:音频文件   numcep:梅尔倒谱系数个数
#       numcontext:对于每个时间段,要包含的上下文样本个数
def audiofile_to_input_vector(audio_filename, numcep, numcontext):
    # 加载音频文件
    fs, audio = wav.read(audio_filename)
    # 获取MFCC系数
    orig_inputs = mfcc(audio, samplerate=fs, numcep=numcep)
    # 打印MFCC系数的形状,得到比如(955, 26)的形状
    # 955表示时间序列,26表示每个序列的MFCC的特征值为26个
    # 这个形状因文件而异,不同文件可能有不同长度的时间序列,但是,每个序列的特征值数量都是一样的
    # print(np.shape(orig_inputs))

    # 因为我们使用双向循环神经网络来训练,它的输出包含正、反向的结
    # 果,相当于每一个时间序列都扩大了一倍,所以
    # 为了保证总时序不变,使用orig_inputs =
    # orig_inputs[::2]对orig_inputs每隔一行进行一次
    # 取样。这样被忽略的那个序列可以用后文中反向
    # RNN生成的输出来代替,维持了总的序列长度。
    orig_inputs = orig_inputs[::2]  # (478, 26)
    # print(np.shape(orig_inputs))
    # 因为我们讲解和实际使用的numcontext=9,所以下面的备注我都以numcontext=9来讲解
    # 这里装的就是我们要返回的数据,因为同时要考虑前9个和后9个时间序列,
    # 所以每个时间序列组合了19*26=494个MFCC特征数
    train_inputs = np.array([], np.float32)
    train_inputs.resize((orig_inputs.shape[0], numcep + 2 * numcep * numcontext))
    # print(np.shape(train_inputs))#)(478, 494)

    # Prepare pre-fix post fix context
    empty_mfcc = np.array([])
    empty_mfcc.resize((numcep))

    # Prepare train_inputs with past and future contexts
    # time_slices保存的是时间切片,也就是有多少个时间序列
    time_slices = range(train_inputs.shape[0])

    # context_past_min和context_future_max用来计算哪些序列需要补零
    context_past_min = time_slices[0] + numcontext
    context_future_max = time_slices[-1] - numcontext

    # 开始遍历所有序列
    for time_slice in time_slices:
        # 对前9个时间序列的MFCC特征补0,不需要补零的,则直接获取前9个时间序列的特征
        need_empty_past = max(0, (context_past_min - time_slice))
        empty_source_past = list(empty_mfcc for empty_slots in range(need_empty_past))
        data_source_past = orig_inputs[max(0, time_slice - numcontext):time_slice]
        assert (len(empty_source_past) + len(data_source_past) == numcontext)

        # 对后9个时间序列的MFCC特征补0,不需要补零的,则直接获取后9个时间序列的特征
        need_empty_future = max(0, (time_slice - context_future_max))
        empty_source_future = list(empty_mfcc for empty_slots in range(need_empty_future))
        data_source_future = orig_inputs[time_slice + 1:time_slice + numcontext + 1]
        assert (len(empty_source_future) + len(data_source_future) == numcontext)

        # 前9个时间序列的特征
        if need_empty_past:
            past = np.concatenate((empty_source_past, data_source_past))
        else:
            past = data_source_past

        # 后9个时间序列的特征
        if need_empty_future:
            future = np.concatenate((data_source_future, empty_source_future))
        else:
            future = data_source_future

        # 将前9个时间序列和当前时间序列以及后9个时间序列组合
        past = np.reshape(past, numcontext * numcep)
        now = orig_inputs[time_slice]
        future = np.reshape(future, numcontext * numcep)

        train_inputs[time_slice] = np.concatenate((past, now, future))
        assert (len(train_inputs[time_slice]) == numcep + 2 * numcep * numcontext)

    # 将数据使用正太分布标准化,减去均值然后再除以方差
    train_inputs = (train_inputs - np.mean(train_inputs)) / np.std(train_inputs)

    return train_inputs


# 对齐处理
def pad_sequences(sequences, maxlen=None, dtype=np.float32,
                  padding='post', truncating='post', value=0.):
    # [478 512 503 406 481 509 422 465]
    lengths = np.asarray([len(s) for s in sequences], dtype=np.int64)

    nb_samples = len(sequences)

    # maxlen,该批次中,最长的序列长度
    if maxlen is None:
        maxlen = np.max(lengths)

    # 在下面的主循环中,从第一个非空序列中获取样本形状以检查一致性
    sample_shape = tuple()
    for s in sequences:
        if len(s) > 0:
            sample_shape = np.asarray(s).shape[1:]
            break

    x = (np.ones((nb_samples, maxlen) + sample_shape) * value).astype(dtype)
    for idx, s in enumerate(sequences):
        if len(s) == 0:
            continue  # 序列为空,跳过

        # post表示后补零,pre表示前补零
        if truncating == 'pre':
            trunc = s[-maxlen:]
        elif truncating == 'post':
            trunc = s[:maxlen]
        else:
            raise ValueError('Truncating type "%s" not understood' % truncating)

        # check `trunc` has expected shape
        trunc = np.asarray(trunc, dtype=dtype)
        if trunc.shape[1:] != sample_shape:
            raise ValueError('Shape of sample %s of sequence at position %s is different from expected shape %s' %
                             (trunc.shape[1:], idx, sample_shape))

        if padding == 'post':
            x[idx, :len(trunc)] = trunc
        elif padding == 'pre':
            x[idx, -len(trunc):] = trunc
        else:
            raise ValueError('Padding type "%s" not understood' % padding)

    return x, lengths

# 因为我的数据下载到D:\迅雷下载\任务组_20190426_1706\中了所以
wav_path = 'D:\迅雷下载\任务组_20190426_1706\data_thchs30/train'
label_file = 'D:\迅雷下载\任务组_20190426_1706\data_thchs30\data'
# wav_files, labels = get_wavs_lables(wav_path,label_file)
wav_files, labels = get_wav_files_and_tran_texts(wav_path, label_file)

# 字表
all_words = []
for label in labels:
    # print(label)audiofile_to_input_vector
    all_words += [word for word in label]
counter = Counter(all_words)
words = sorted(counter)
words+=[""]#可以手动将字符集中的末尾再加一个空格
words_size = len(words)
word_num_map = dict(zip(words, range(words_size)))
print('字表大小:', words_size)

# 梅尔倒谱系数的个数
n_input = 26
# 对于每个时间序列,要包含上下文样本的个数
n_context = 9
# batch大小
batch_size = 8


def next_batch(wav_files, labels, start_idx=0, batch_size=1):
    filesize = len(labels)
    # 计算要获取的序列的开始和结束下标
    end_idx = min(filesize, start_idx + batch_size)
    idx_list = range(start_idx, end_idx)
    # 获取要训练的音频文件路径和对于的译文
    txt_labels = [labels[i] for i in idx_list]
    wav_files = [wav_files[i] for i in idx_list]
    # 将音频文件转成要训练的数据
    (source, audio_len, target, transcript_len) = get_audio_and_transcriptch(None,
                                                                             wav_files,
                                                                             n_input,
                                                                             n_context, word_num_map, txt_labels)

    start_idx += batch_size
    # Verify that the start_idx is not largVerify that the start_idx is not ler than total available sample size
    if start_idx >= filesize:
        start_idx = -1

    # Pad input to max_time_step of this batch
    # 如果多个文件将长度统一,支持按最大截断或补0
    source, source_lengths = pad_sequences(source)
    # 返回序列的稀疏表示
    sparse_labels = sparse_tuple_from(target)

    return start_idx, source, source_lengths, sparse_labels


print('音频文件:  ' + wav_files[0])
print('文字内容:  ' + labels[0])
# 获取一个batch的数据
next_idx, source, source_len, sparse_lab = next_batch(wav_files, labels, 0, batch_size)
print(np.shape(source))
# 将字向量转成文字
t = sparse_tuple_to_texts_ch(sparse_lab, words)
print(t[0])
# source已经将变为前9(不够补空)+本身+后9,每个26,第一个顺序是第10个的数据。


b_stddev = 0.046875
h_stddev = 0.046875

n_hidden = 1024
n_hidden_1 = 1024
n_hidden_2 = 1024
n_hidden_5 = 1024
n_cell_dim = 1024
n_hidden_3 = 2 * 1024

keep_dropout_rate = 0.95
relu_clip = 20

"""
used to create a variable in CPU memory.
"""


def variable_on_gpu(name, shape, initializer):
    # Use the /gpu:0 device for scoped operations
    with tf.device('/gpu:0'):
        # Create or get apropos variable
        var = tf.get_variable(name=name, shape=shape, initializer=initializer)
    return var


def BiRNN_model(batch_x, seq_length, n_input, n_context, n_character, keep_dropout):
    # batch_x_shape: [batch_size, amax_stepsize, n_input + 2 * n_input * n_context]
    batch_x_shape = tf.shape(batch_x)

    # 将输入转成时间序列优先
    batch_x = tf.transpose(batch_x, [1, 0, 2])
    # 再转成2维传入第一层
    # [amax_stepsize * batch_size, n_input + 2 * n_input * n_context]
    batch_x = tf.reshape(batch_x, [-1, n_input + 2 * n_input * n_context])

    # 使用clipped RELU activation and dropout.
    # 1st layer
    with tf.name_scope('fc1'):
        b1 = variable_on_gpu('b1', [n_hidden_1], tf.random_normal_initializer(stddev=b_stddev))
        h1 = variable_on_gpu('h1', [n_input + 2 * n_input * n_context, n_hidden_1],
                             tf.random_normal_initializer(stddev=h_stddev))
        layer_1 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(batch_x, h1), b1)), relu_clip)
        layer_1 = tf.nn.dropout(layer_1, keep_dropout)

    # 2nd layer
    with tf.name_scope('fc2'):
        b2 = variable_on_gpu('b2', [n_hidden_2], tf.random_normal_initializer(stddev=b_stddev))
        h2 = variable_on_gpu('h2', [n_hidden_1, n_hidden_2], tf.random_normal_initializer(stddev=h_stddev))
        layer_2 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(layer_1, h2), b2)), relu_clip)
        layer_2 = tf.nn.dropout(layer_2, keep_dropout)

    # 3rd layer
    with tf.name_scope('fc3'):
        b3 = variable_on_gpu('b3', [n_hidden_3], tf.random_normal_initializer(stddev=b_stddev))
        h3 = variable_on_gpu('h3', [n_hidden_2, n_hidden_3], tf.random_normal_initializer(stddev=h_stddev))
        layer_3 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(layer_2, h3), b3)), relu_clip)
        layer_3 = tf.nn.dropout(layer_3, keep_dropout)

    # 双向rnn
    with tf.name_scope('lstm'):
        # Forward direction cell:
        lstm_fw_cell = tf.contrib.rnn.BasicLSTMCell(n_cell_dim, forget_bias=1.0, state_is_tuple=True)
        lstm_fw_cell = tf.contrib.rnn.DropoutWrapper(lstm_fw_cell,
                                                     input_keep_prob=keep_dropout)
        # Backward direction cell:
        lstm_bw_cell = tf.contrib.rnn.BasicLSTMCell(n_cell_dim, forget_bias=1.0, state_is_tuple=True)
        lstm_bw_cell = tf.contrib.rnn.DropoutWrapper(lstm_bw_cell,
                                                     input_keep_prob=keep_dropout)

        # `layer_3`  `[amax_stepsize, batch_size, 2 * n_cell_dim]`
        layer_3 = tf.reshape(layer_3, [-1, batch_x_shape[0], n_hidden_3])

        outputs, output_states = tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_fw_cell,
                                                                 cell_bw=lstm_bw_cell,
                                                                 inputs=layer_3,
                                                                 dtype=tf.float32,
                                                                 time_major=True,
                                                                 sequence_length=seq_length)

        # 连接正反向结果[amax_stepsize, batch_size, 2 * n_cell_dim]
        outputs = tf.concat(outputs, 2)
        # to a single tensor of shape [amax_stepsize * batch_size, 2 * n_cell_dim]
        outputs = tf.reshape(outputs, [-1, 2 * n_cell_dim])

    with tf.name_scope('fc5'):
        b5 = variable_on_gpu('b5', [n_hidden_5], tf.random_normal_initializer(stddev=b_stddev))
        h5 = variable_on_gpu('h5', [(2 * n_cell_dim), n_hidden_5], tf.random_normal_initializer(stddev=h_stddev))
        layer_5 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(outputs, h5), b5)), relu_clip)
        layer_5 = tf.nn.dropout(layer_5, keep_dropout)

    with tf.name_scope('fc6'):
        # 全连接层用于softmax分类
        b6 = variable_on_gpu('b6', [n_character], tf.random_normal_initializer(stddev=b_stddev))
        h6 = variable_on_gpu('h6', [n_hidden_5, n_character], tf.random_normal_initializer(stddev=h_stddev))
        layer_6 = tf.add(tf.matmul(layer_5, h6), b6)

    # 将2维[amax_stepsize * batch_size, n_character]转成3维 time-major [amax_stepsize, batch_size, n_character].
    layer_6 = tf.reshape(layer_6, [-1, batch_x_shape[0], n_character])
    print('n_character:' + str(n_character))
    # Output shape: [amax_stepsize, batch_size, n_character]
    return layer_6


# input_tensor为输入音频数据,由前面分析可知,它的结构是[batch_size, amax_stepsize, n_input + (2 * n_input * n_context)]
# 其中,batch_size是batch的长度,amax_stepsize是时序长度,n_input + (2 * n_input * n_context)是MFCC特征数,
# batch_size是可变的,所以设为None,由于每一批次的时序长度不固定,所有,amax_stepsize也设为None
input_tensor = tf.placeholder(tf.float32, [None, None, n_input + (2 * n_input * n_context)], name='input')
# Use sparse_placeholder; will generate a SparseTensor, required by ctc_loss op.
# targets保存的是音频数据对应的文本的系数张量,所以用sparse_placeholder创建一个稀疏张量
targets = tf.sparse_placeholder(tf.int32, name='targets')
# seq_length保存的是当前batch数据的时序长度
seq_length = tf.placeholder(tf.int32, [None], name='seq_length')
# keep_dropout则是dropout的参数
keep_dropout = tf.placeholder(tf.float32)

# logits is the non-normalized output/activations from the last layer.
# logits will be input for the loss function.
# nn_model is from the import statement in the load_model function
logits = BiRNN_model(input_tensor, tf.to_int64(seq_length), n_input, n_context, words_size + 1, keep_dropout)

# 使用ctc loss计算损失
avg_loss = tf.reduce_mean(ctc_ops.ctc_loss(targets, logits, seq_length))

# 优化器
learning_rate = 0.001
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(avg_loss)

# 使用CTC decoder
with tf.name_scope("decode"):
    decoded, log_prob = ctc_ops.ctc_beam_search_decoder(logits, seq_length, merge_repeated=False)

# 计算编辑距离
with tf.name_scope("accuracy"):
    distance = tf.edit_distance(tf.cast(decoded[0], tf.int32), targets)
    # 计算label error rate (accuracy)
    ler = tf.reduce_mean(distance, name='label_error_rate')

# 迭代次数
epochs = 100
# 模型保存地址
savedir = "saver/"
# 如果该目录不存在,新建
if os.path.exists(savedir) == False:
    os.mkdir(savedir)

# 生成saver
saver = tf.train.Saver(max_to_keep=1)
# 创建session
with tf.Session() as sess:
    # 初始化
    sess.run(tf.global_variables_initializer())
    # 没有模型的话,就重新初始化
    kpt = tf.train.latest_checkpoint(savedir)
    print("kpt:", kpt)
    startepo = 0
    if kpt != None:
        saver.restore(sess, kpt)
        ind = kpt.find("-")
        startepo = int(kpt[ind + 1:])
        print(startepo)

    # 准备运行训练步骤
    section = '\n{0:=^40}\n'
    print(section.format('Run training epoch'))

    train_start = time.time()
    floss = open("loss_value", 'w', encoding='UTF-8')
    for epoch in range(epochs):  # 样本集迭代次数
        epoch_start = time.time()
        if epoch < startepo:
            continue

        print("epoch start:", epoch, "total epochs= ", epochs)
        #######################run batch####
        n_batches_per_epoch = int(np.ceil(len(labels) / batch_size))
        print("total loop ", n_batches_per_epoch, "in one epoch,", batch_size, "items in one loop")

        train_cost = 0
        train_ler = 0
        next_idx = 0

        for batch in range(n_batches_per_epoch):  # 一次batch_size,取多少次
            # 取数据
            print('开始获取数据咯:' + str(batch))
            next_idx, source, source_lengths, sparse_labels = next_batch(wav_files, labels, next_idx, batch_size)
            print('结束咯')
            feed = {input_tensor: source, targets: sparse_labels, seq_length: source_lengths,
                    keep_dropout: keep_dropout_rate}

            # 计算 avg_loss optimizer ;
            batch_cost, _ = sess.run([avg_loss, optimizer], feed_dict=feed)
            train_cost += batch_cost
            # 验证模型的准确率,比较耗时,我们训练的时候全力以赴,所以这里先不跑
            if (batch + 1) % 100 == 0:
                print('loop:', batch, 'Train cost: ', train_cost / (batch + 1))
                print('loop:', batch, 'Train cost: ', train_cost / (batch + 1),file=floss)
                feed2 = {input_tensor: source, targets: sparse_labels, seq_length: source_lengths, keep_dropout: 1.0}

                d, train_ler = sess.run([decoded[0], ler], feed_dict=feed2)
                dense_decoded = tf.sparse_tensor_to_dense(d, default_value=-1).eval(session=sess)
                dense_labels = sparse_tuple_to_texts_ch(sparse_labels, words)

                counter = 0
                print('Label err rate: ', train_ler)
                for orig, decoded_arr in zip(dense_labels, dense_decoded):
                    # convert to strings
                    decoded_str = ndarray_to_text_ch(decoded_arr, words)
                    print(' file {}'.format(counter))
                    print('Original: {}'.format(orig))
                    print('Decoded:  {}'.format(decoded_str))
                    counter = counter + 1
                    break

            # 每训练100次保存一下模型
            if (batch + 1) % 100 == 0:
                saver.save(sess, savedir + "saver.cpkt", global_step=epoch)
        epoch_duration = time.time() - epoch_start

        log = 'Epoch {}/{}, train_cost: {:.3f}, train_ler: {:.3f}, time: {:.2f} sec'
        print(log.format(epoch, epochs, train_cost, train_ler, epoch_duration))

    train_duration = time.time() - train_start
    print('Training complete, total duration: {:.2f} min'.format(train_duration / 60))

三.将测试集放入训练好的模型进行测试

四.延申

在其他项目中的应用可以针对业务场景收集具有业务特色的语音训练集,然后训练针对业务更精准的模型

猜你喜欢

转载自blog.csdn.net/qq_16320025/article/details/89743336