在这里插入图片描述
这是一项使用GRU模型的文本生成任务，文本生成任务是NLP领域最具有挑战性的任务之一。

以一段文本或字符为输入，使用模型预测之后可能出现的文本内容，我们希望这些文本内容符合语法并能保持语义连贯性。

到目前为止，NLP文本生成还是一项艰巨的任务。

从实用角度出发，NLP文本生成模型更多地是尝试在与艺术类文本相关的任务中应用。在与科研、新闻稿等相关的领域，NLP文本生成模型使用不多(因为其严谨度还不够)。

当前案例就是使用莎士比亚的剧本作为原始数据。

一、莎士比亚作品数据集

数据下载地址：https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt

数据集预览:

QUEENE:
I had thought thou hadst a Roman; for the oracle,
Thus by All bids the man against the word,
Which are so weak of care, by old care done;
Your children were in your holy love,
And the precipitation through the bleeding throne.

BISHOP OF ELY:
Marry, and will, my lord, to weep in such a one were prettiest;
Yet now I was adopted heir
Of the world's lamentable day,
To watch the next way with his father with his face?

ESCALUS:
The cause why then we are all resolved more sons.

VOLUMNIA:
O, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, it is no sin it should be dead,
And love and pale as any will to that word.

QUEEN ELIZABETH:
But how long have I heard the soul for this world,
And show his hands of life be proved to stand.

PETRUCHIO:
I say he look'd on, if I must be content
To stay him from the fatal of our country's bliss.
His lordship pluck'd from this sentence then for prey,
And then let us twain, being the moon,
were she such a case as fills m

在这里插入图片描述

二、使用GRU模型实现文本生成任务的步骤

第一步: 下载数据集并做文本预处理
第二步: 构建模型并训练模型、保存模型
第三步: 使用模型生成文本内容

1、第一步: 下载数据集并做文本预处理

1.1 下载数据

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import numpy as np
import os
import time

print("Tensorflow Version:", tf.__version__)	# 打印tensorflow版本
# 一、下载数据集
# 1、下载数据
path_to_file = tf.keras.utils.get_file(fname='shakespeare.txt', cache_dir='./', origin='https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')  # 使用tf.keras.utils.get_file方法从指定地址下载数据，得到原始数据本地路径
print("path_to_file = {0}".format(path_to_file))

输出结果：

Tensorflow Version: 2.1.0-rc2
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
1122304/1115394 [==============================] - 0s 0us/step

1.2 读取数据

text = open(path_to_file, 'rb').read().decode(encoding='utf-8')  # 打开原始数据文件并读取文本内容
print("text[:250] = \n{0}".format(text[:250]))
print('文件总字符数量: len(text) = {0}'.format(len(text)))  # 统计字符个数
vocab = sorted(set(text))  # 统计文本中非重复字符数量
print("vocab = {0}".format(vocab))
print('文本中非重复字符数量 = {0}'.format(len(vocab)))
print("-" * 200)

输出结果：

text[:250] = 
First Citizen:
Before we proceed any further, hear me speak.
All:
Speak, speak.
First Citizen:
You are all resolved rather to die than to famish?
All:
Resolved. resolved.
First Citizen:
First, you know Caius Marcius is chief enemy to the people.
文件总字符数量: len(text) = 1115394
vocab = ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
文本中非重复字符数量 = 65

1.3 对文本进行数值映射

# 对字符进行数值映射，将创建两个映射表：字符映射成数字，数字映射成字符
char2idx = {
    
    item: index for index, item in enumerate(vocab)}
print("char2idx = {0}".format(char2idx))
idx2char = np.array(vocab)
print("idx2char = {0}".format(idx2char))
text_as_int = np.array([char2idx[c] for c in text])  # 使用字符到数字的映射表示所有文本
print("text_as_int = {0}".format(text_as_int))
print('Characters mapped to int：{} ---- > {}'.format(repr(text[:13]), text_as_int[:13]))  # 查看原始语料前13个字符映射后的结果
print("-" * 200)

输出结果：

char2idx = {
    
    '\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
idx2char = ['\n' ' ' '!' '$' '&' "'" ',' '-' '.' '3' ':' ';' '?' 'A' 'B' 'C' 'D' 'E'
 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W'
 'X' 'Y' 'Z' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o'
 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z']
text_as_int = [18 47 56 ... 45  8  0]
Characters mapped to int：'First Citizen' ---- > [18 47 56 57 58  1 15 47 58 47 64 43 52]

1.4 构建训练数据

对于原始文本，人工定义输入序列长度seq_length，每个输入序列与其对应的目标序列等长度，但是向右顺移一个字符。如：设定输入序列长度seq_length为4，针对文本hello来讲，得到的训练数据为：输入序列“hell”，目标序列为“ello”.

seq_length = 100  # 设定输入序列长度【句子长度】
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)  # 将数值映射后的文本转换成dataset对象方便后续处理【from_tensor_slices作用: 切分传入Tensor的第一个维度，生成相应的dataset】 <class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>
print("len(char_dataset) = {0}".format(len(char_dataset)))  # 1115394
for i in char_dataset.take(5): print("char_dataset 第{0}个字符：{1}".format(i, idx2char[i.numpy()]))  # 通过char_dataset的take方法以及映射表查看前5个字符
sequence_batches = char_dataset.batch(seq_length + 1, drop_remainder=True)  # 使用dataset的batch方法按照字符长度+1划分（要留出一个向后顺移的位置）【drop_remainder=True表示删除掉最后一批可能小于批次数量的数据】<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>
print("sequence_batches 中的batch数量: len(sequence_batches) = {0}".format(len(sequence_batches)))  # 11043=1115394/101
for item in sequence_batches.take(1):
    print("item = {0}".format(item))
    print("item.numpy() = {0}".format(item.numpy()))
    print("idx2char[item.numpy()] = {0}".format(idx2char[item.numpy()]))
    print("repr(''.join(idx2char[item.numpy()])) = {0}".format(repr(''.join(idx2char[item.numpy()]))))


def split_input_target(chunk):  # 划分输入序列和目标序列函数【前100个字符为输入序列，第二个字符开始到最后为目标序列】
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text


dataset_train = sequence_batches.map(split_input_target)  # 使用map方法调用该函数对每条序列进行划分
print("dataset_train = {0}".format(dataset_train))
print("-" * 200)

for input_example, target_example in dataset_train.take(1):  # 查看划分后的第一批次结果
    print('输入文本: ', repr(''.join(idx2char[input_example.numpy()])))
    print('目标文本:', repr(''.join(idx2char[target_example.numpy()])))

for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):  # 查看将要输入模型中的每个时间步的输入和输出(以前五步为例)【循环每个字符，并打印每个时间步对应的输入和输出】
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))
print("-" * 200)

输出结果：

len(char_dataset) = 1115394
char_dataset 第18个字符：F
char_dataset 第47个字符：i
char_dataset 第56个字符：r
char_dataset 第57个字符：s
char_dataset 第58个字符：t

sequence_batches 中的batch数量: len(sequence_batches) = 11043

item = [18 47 56 57 58  1 15 47 58 47 64 43 52 10  0 14 43 44 53 56 43  1 61 43
  1 54 56 53 41 43 43 42  1 39 52 63  1 44 59 56 58 46 43 56  6  1 46 43
 39 56  1 51 43  1 57 54 43 39 49  8  0  0 13 50 50 10  0 31 54 43 39 49
  6  1 57 54 43 39 49  8  0  0 18 47 56 57 58  1 15 47 58 47 64 43 52 10
  0 37 53 59  1]

item.numpy() = [18 47 56 57 58  1 15 47 58 47 64 43 52 10  0 14 43 44 53 56 43  1 61 43
  1 54 56 53 41 43 43 42  1 39 52 63  1 44 59 56 58 46 43 56  6  1 46 43
 39 56  1 51 43  1 57 54 43 39 49  8  0  0 13 50 50 10  0 31 54 43 39 49
  6  1 57 54 43 39 49  8  0  0 18 47 56 57 58  1 15 47 58 47 64 43 52 10
  0 37 53 59  1]

idx2char[item.numpy()] = ['F' 'i' 'r' 's' 't' ' ' 'C' 'i' 't' 'i' 'z' 'e' 'n' ':' '\n' 'B' 'e' 'f'
 'o' 'r' 'e' ' ' 'w' 'e' ' ' 'p' 'r' 'o' 'c' 'e' 'e' 'd' ' ' 'a' 'n' 'y'
 ' ' 'f' 'u' 'r' 't' 'h' 'e' 'r' ',' ' ' 'h' 'e' 'a' 'r' ' ' 'm' 'e' ' '
 's' 'p' 'e' 'a' 'k' '.' '\n' '\n' 'A' 'l' 'l' ':' '\n' 'S' 'p' 'e' 'a'
 'k' ',' ' ' 's' 'p' 'e' 'a' 'k' '.' '\n' '\n' 'F' 'i' 'r' 's' 't' ' ' 'C'
 'i' 't' 'i' 'z' 'e' 'n' ':' '\n' 'Y' 'o' 'u' ' ']

repr(''.join(idx2char[item.numpy()])) = 'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '

dataset_train = <MapDataset shapes: ((100,), (100,)), types: (tf.int32, tf.int32)>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
输入文本:  'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
目标文本: 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '

Step    0
  input: 18 ('F')
  expected output: 47 ('i')
Step    1
  input: 47 ('i')
  expected output: 56 ('r')
Step    2
  input: 56 ('r')
  expected output: 57 ('s')
Step    3
  input: 57 ('s')
  expected output: 58 ('t')
Step    4
  input: 58 ('t')
  expected output: 1 (' ')

1.5 将构建的训练数据批次化

BATCH_SIZE = 64  # 定义批次大小为64【每个批次有64对输入文本&输出文本，输入文本为 输入值,输出文本为 目标值】
dataset_train = dataset_train.shuffle(buffer_size=1000).batch(batch_size=BATCH_SIZE, drop_remainder=True)  # 打乱数据并分批次【buffer_size: 定缓冲区大小，以重新排列数据集(缓冲区越大数据混乱程度越高，所需内存也越大)】
print("dataset_train = {0}".format(dataset_train))  # 打印数据集对象查看数据张量形状
print("=" * 200)

2、第二步: 构建、训练模型、保存模型参数

损失函数

此时可以将生成问题看作是标准的分类问题，即给定RNN的状态和该时间步的输入，预测下一个字符的类别（从分布中只选择一个），类别总数即不重复的字符总数，因此这是一个稀疏类别矩阵.
稀疏类别矩阵: 样本的所属类别总数较多，如几百到几千个类别，且每条样本所属的类别较少，如单标签多分类一条样本只所属一个类别，那么此时形成的类别矩阵即稀疏类别矩阵.
tf.keras.losses.sparse_categorical_crossentropy(target, predictions, from_logits=True)

vocab_size = len(vocab)  # 获得词汇集大小
embedding_dim = 256  # 定义词嵌入维度
hidden_size = 1024  # 定义GRU的隐层节点数量


# 1、模型构建函数【模型包括三个层：输入层即embedding层，中间层即GRU层（详情查看）输出层即全连接层】
def build_model(vocab_size, embedding_dim, hidden_size, batch_size):
    model = tf.keras.Sequential([  # 使用tf.keras.Sequential定义模型
        tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),
        tf.keras.layers.GRU(units=hidden_size, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),  # GRU层的参数return_sequences为True说明返回结果为每个时间步的输出，而不是最后时间步的输出【stateful参数为True，说明将保留每个batch数据的结果状态作为下一个batch的初始化数据; recurrent_initializer='glorot_uniform'，说明GRU的循环核采用均匀分布的初始化方法】
        tf.keras.layers.Dense(vocab_size)  # 模型最终通过全连接层返回一个所有可能字符的概率分布.
    ])
    return model


# 构建模型
model = build_model(vocab_size=len(vocab), embedding_dim=embedding_dim, hidden_size=hidden_size, batch_size=BATCH_SIZE)  # 传入超参数构建模型
model.summary()  # 查看模型参数情况

# 选择优化器
optimizer = tf.keras.optimizers.Adam()


# 训练函数【每次训练一个batch】
def train_batch(input, target):  # input: 模型输入, tatget: 输入对应的标签
    with tf.GradientTape() as tape:  # 打开梯度记录管理器
        predictions = model(input)  # 使用模型进行预测
        loss = tf.reduce_mean(tf.keras.losses.sparse_categorical_crossentropy(target, predictions, from_logits=True))  # 使用sparse_categorical_crossentropy计算平均损失
    grads = tape.gradient(loss, model.trainable_variables)  # 使用梯度记录管理器求解全部参数的梯度
    optimizer.apply_gradients(zip(grads, model.trainable_variables))  # 使用梯度和优化器更新参数
    return loss  # 返回平均损失


# 4、配置检测点【用于动态保存训练模型过程中的参数】
checkpoint_dir = './training_checkpoints'  # 检查点保存至的目录
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")  # 检查点的文件名

# 三、训练模型
EPOCHS = 10  # 训练轮数
for epoch in range(EPOCHS):  # 进行轮数循环
    print("=" * 100, "Epoch = {0}".format(epoch), "=" * 100)
    start = time.time()  # 获得开始时间
    hidden = model.reset_states()  # 初始化隐层状态
    for (batch_index, (input, target)) in enumerate(dataset_train):  # 进行批次循环
        loss = train_batch(input, target)  # 调用train_step进行训练, 获得批次循环的损失
        print('Epoch {}----batch_index {}----Loss = {}'.format(epoch + 1, batch_index, loss))
    model.save_weights(checkpoint_prefix.format(epoch=epoch))
    print('Epoch {} Loss {:.4f}'.format(epoch + 1, loss))  # 打印轮数，当前损失，和训练耗时
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

# 保存最后的检测点
model.save_weights(checkpoint_prefix.format(epoch=epoch))

3、第三步: 使用模型生成文本内容

tf.random.categorical方法

从理论上来讲，如果模型足够准确，我们只需要从概率分布中选择概率最大的值的索引即可，这就是贪心算法。
但在实际中，模型的预测效果很难确定，一直按照最大概率选取很容易陷入重复的循环中，因此会将分布的概率值作为其被选中的概率值，这样每个分布中的值都有可能被选中，tensorflow中使用tf.random.categorical方法来实现.

# 四；使用模型生成文本内容
# 1、恢复模型
model = build_model(vocab_size, embedding_dim, hidden_size, batch_size=1)  # 恢复模型结构
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))  # 从检测点中获得训练后的模型参数


# 2、构建生成函数
def generate_text(model, start_string):  # model: 训练后的模型, start_string: 任意起始字符串
    num_generate = 1000  # 要生成的字符个数
    input_eval = [char2idx[s] for s in start_string]  # 将起始字符串转换为数字（向量化）
    input_eval = tf.expand_dims(input_eval, 0)  # 扩展维度满足模型输入要求
    text_generated = []  # 空列表用于存储结果
    temperature = 1.0  # 设定“温度参数”，根据tf.random_categorical方法特点，【温度参数能够调节该方法的输入分布中概率的差距，以便控制随机被选中的概率大小】
    model.reset_states()  # 初始化模型参数
    for i in range(num_generate):  # 开始循环生成
        predictions = model(input_eval)  # 使用模型获得输出
        predictions = tf.squeeze(predictions, 0)  # 删除批次的维度
        predictions = predictions / temperature  # 使用“温度参数”和tf.random.categorical方法生成最终的预测字符索引
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()
        input_eval = tf.expand_dims([predicted_id], 0)  # 将预测的输出再扩展维度作为下一次的模型输入
        text_generated.append(idx2char[predicted_id])  # 将该次输出映射成字符存到列表中
    return (start_string + ''.join(text_generated))  # 最后将初始字符串和生成的字符进行连接


# 3、调用生成函数
generated = generate_text(model, start_string=u"ROMEO: ")
print("generated = \n{0}".format(generated))

三、莎士比亚风格的文本生成任务完整代码【GRU神经网络模型】

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import numpy as np
import os
import time

print("Tensorflow Version:", tf.__version__)  # 打印tensorflow版本

# 一、下载数据集
# 1、下载数据
path_to_file = tf.keras.utils.get_file(fname='shakespeare.txt', cache_dir='./', origin='https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')  # 使用tf.keras.utils.get_file方法从指定地址下载数据，得到原始数据本地路径
print("path_to_file = {0}".format(path_to_file))
print("-" * 200)

# 二、文本预处理
# 1、读取数据
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')  # 打开原始数据文件并读取文本内容
print("text[:250] = \n{0}".format(text[:250]))
print('文件总字符数量: len(text) = {0}'.format(len(text)))  # 统计字符个数
vocab = sorted(set(text))  # 统计文本中非重复字符数量
print("vocab = {0}".format(vocab))
print('文本中非重复字符数量 = {0}'.format(len(vocab)))
print("-" * 200)
# 2、对文本进行数值映射
# 对字符进行数值映射，将创建两个映射表：字符映射成数字，数字映射成字符
char2idx = {
    
    item: index for index, item in enumerate(vocab)}
print("char2idx = {0}".format(char2idx))
idx2char = np.array(vocab)
print("idx2char = {0}".format(idx2char))
text_as_int = np.array([char2idx[c] for c in text])  # 使用字符到数字的映射表示所有文本
print("text_as_int = {0}".format(text_as_int))
print('Characters mapped to int：{} ---- > {}'.format(repr(text[:13]), text_as_int[:13]))  # 查看原始语料前13个字符映射后的结果
print("-" * 200)
# 3、构建训练数据
seq_length = 100  # 设定输入序列长度【句子长度】
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)  # 将数值映射后的文本转换成dataset对象方便后续处理【from_tensor_slices作用: 切分传入Tensor的第一个维度，生成相应的dataset】 <class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>
print("len(char_dataset) = {0}".format(len(char_dataset)))  # 1115394
for i in char_dataset.take(5): print("char_dataset 第{0}个字符：{1}".format(i, idx2char[i.numpy()]))  # 通过char_dataset的take方法以及映射表查看前5个字符
sequence_batches = char_dataset.batch(seq_length + 1, drop_remainder=True)  # 使用dataset的batch方法按照字符长度+1划分（要留出一个向后顺移的位置）【drop_remainder=True表示删除掉最后一批可能小于批次数量的数据】<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>
print("sequence_batches 中的batch数量: len(sequence_batches) = {0}".format(len(sequence_batches)))  # 11043=1115394/101
for item in sequence_batches.take(1):
    print("item = {0}".format(item))
    print("item.numpy() = {0}".format(item.numpy()))
    print("idx2char[item.numpy()] = {0}".format(idx2char[item.numpy()]))
    print("repr(''.join(idx2char[item.numpy()])) = {0}".format(repr(''.join(idx2char[item.numpy()]))))


def split_input_target(chunk):  # 划分输入序列和目标序列函数【前100个字符为输入序列，第二个字符开始到最后为目标序列】
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text


dataset_train = sequence_batches.map(split_input_target)  # 使用map方法调用该函数对每条序列进行划分
print("dataset_train = {0}".format(dataset_train))
print("-" * 200)

for input_example, target_example in dataset_train.take(1):  # 查看划分后的第一批次结果
    print('输入文本: ', repr(''.join(idx2char[input_example.numpy()])))
    print('目标文本:', repr(''.join(idx2char[target_example.numpy()])))

for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):  # 查看将要输入模型中的每个时间步的输入和输出(以前五步为例)【循环每个字符，并打印每个时间步对应的输入和输出】
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))
print("-" * 200)

# 4、创建批次数据
BATCH_SIZE = 64  # 定义批次大小为64【每个批次有64对输入文本&输出文本，输入文本为 输入值,输出文本为 目标值】
dataset_train = dataset_train.shuffle(buffer_size=1000).batch(batch_size=BATCH_SIZE, drop_remainder=True)  # 打乱数据并分批次【buffer_size: 定缓冲区大小，以重新排列数据集(缓冲区越大数据混乱程度越高，所需内存也越大)】
print("dataset_train = {0}".format(dataset_train))  # 打印数据集对象查看数据张量形状
print("=" * 200)

# 二、构建模型
vocab_size = len(vocab)  # 获得词汇集大小
embedding_dim = 256  # 定义词嵌入维度
hidden_size = 1024  # 定义GRU的隐层节点数量


# 1、模型构建函数【模型包括三个层：输入层即embedding层，中间层即GRU层（详情查看）输出层即全连接层】
def build_model(vocab_size, embedding_dim, hidden_size, batch_size):
    model = tf.keras.Sequential([  # 使用tf.keras.Sequential定义模型
        tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),
        # GRU层的参数return_sequences为True说明返回结果为每个时间步的输出，而不是最后时间步的输出；stateful参数为True，说明将保留每个batch数据的结果状态作为下一个batch的初始化数据; recurrent_initializer='glorot_uniform'，说明GRU的循环核采用均匀分布的初始化方法
        tf.keras.layers.GRU(units=hidden_size, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'), 
        tf.keras.layers.Dense(vocab_size)  # 模型最终通过全连接层返回一个所有可能字符的概率分布.
    ])
    return model


# 构建模型
model = build_model(vocab_size=len(vocab), embedding_dim=embedding_dim, hidden_size=hidden_size, batch_size=BATCH_SIZE)  # 传入超参数构建模型
model.summary()  # 查看模型参数情况

# 选择优化器
optimizer = tf.keras.optimizers.Adam()


# 训练函数【每次训练一个batch】
def train_batch(input, target):  # input: 模型输入, tatget: 输入对应的标签
    with tf.GradientTape() as tape:  # 打开梯度记录管理器
        predictions = model(input)  # 使用模型进行预测
        loss = tf.reduce_mean(tf.keras.losses.sparse_categorical_crossentropy(target, predictions, from_logits=True))  # 使用sparse_categorical_crossentropy计算平均损失
    grads = tape.gradient(loss, model.trainable_variables)  # 使用梯度记录管理器求解全部参数的梯度
    optimizer.apply_gradients(zip(grads, model.trainable_variables))  # 使用梯度和优化器更新参数
    return loss  # 返回平均损失


# 4、配置检测点【用于动态保存训练模型过程中的参数】
checkpoint_dir = './training_checkpoints'  # 检查点保存至的目录
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")  # 检查点的文件名

# 三、训练模型
EPOCHS = 10  # 训练轮数
for epoch in range(EPOCHS):  # 进行轮数循环
    print("=" * 100, "Epoch = {0}".format(epoch), "=" * 100)
    start = time.time()  # 获得开始时间
    hidden = model.reset_states()  # 初始化隐层状态
    for (batch_index, (input, target)) in enumerate(dataset_train):  # 进行批次循环
        loss = train_batch(input, target)  # 调用train_step进行训练, 获得批次循环的损失
        print('Epoch {}----batch_index {}----Loss = {}'.format(epoch + 1, batch_index, loss))
    model.save_weights(checkpoint_prefix.format(epoch=epoch))
    print('Epoch {} Loss {:.4f}'.format(epoch + 1, loss))  # 打印轮数，当前损失，和训练耗时
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

# 保存最后的检测点
model.save_weights(checkpoint_prefix.format(epoch=epoch))

# 四；使用模型生成文本内容
# 1、恢复模型
model = build_model(vocab_size, embedding_dim, hidden_size, batch_size=1)  # 恢复模型结构
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))  # 从检测点中获得训练后的模型参数


# 2、构建生成函数
def generate_text(model, start_string):  # model: 训练后的模型, start_string: 任意起始字符串
    num_generate = 1000  # 要生成的字符个数
    input_eval = [char2idx[s] for s in start_string]  # 将起始字符串转换为数字（向量化）
    input_eval = tf.expand_dims(input_eval, 0)  # 扩展维度满足模型输入要求
    text_generated = []  # 空列表用于存储结果
    temperature = 1.0  # 设定“温度参数”，根据tf.random_categorical方法特点，【温度参数能够调节该方法的输入分布中概率的差距，以便控制随机被选中的概率大小】
    model.reset_states()  # 初始化模型参数
    for i in range(num_generate):  # 开始循环生成1000个字符(并非1000个单词)
        predictions = model(input_eval)  # 使用模型获得输出
        predictions = tf.squeeze(predictions, 0)  # 删除批次的维度
        predictions = predictions / temperature  # 使用“温度参数”和tf.random.categorical方法生成最终的预测字符索引
        # predictions = [[ 1.7311015   5.047733    1.4174337  -4.9972973  -4.3031516   2.0421145
        #    3.1429284   1.0354232   3.1231892  -4.6562753   1.5160639   1.923787
        #    1.5294102  -3.9477978  -4.1282687  -4.6054296  -4.744356   -4.141881
        #   -4.5391927  -3.741973   -5.307959   -4.2154074  -4.6195626  -5.445209
        #   -3.2194402  -5.389453   -3.6945934  -4.8046436  -6.284334   -5.474563
        #   -5.204862   -4.057561   -3.2280483  -4.857151   -4.977691   -3.0048158
        #   -4.8367767  -4.3260927  -4.7814307   1.1490657   0.54084986  0.35583776
        #    2.7053645   2.8627312   0.45471185  0.03661626  0.09600939  1.5840868
        #   -2.1096165  -0.04710844  0.37016252  0.39552388  0.9574605   0.83568496
        #    0.17379391 -2.3391297   1.1396204   3.3273525   2.4632266   1.2104397
        #   -0.7498245   0.39875847 -3.1609733   0.6119117  -2.2903275 ]]
        prediction_random = tf.random.categorical(predictions, num_samples=1)    # random_a = [[53]]
        predicted_id = prediction_random[-1, 0].numpy() # num_samples表示只取一个样，[-1,0]表示最后一行的第0个值
        input_eval = tf.expand_dims([predicted_id], 0)  # 将预测的输出再扩展维度作为下一次的模型输入
        text_generated.append(idx2char[predicted_id])  # 将该次输出映射成字符存到列表中
    return (start_string + ''.join(text_generated))  # 最后将初始字符串和生成的字符进行连接


# 3、调用生成函数
generated = generate_text(model, start_string=u"ROMEO: ")
print("generated = \n{0}".format(generated))

打印结果：

Tensorflow Version: 2.4.0
path_to_file = ./datasets\shakespeare.txt
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
text[:250] = 
First Citizen:
Before we proceed any further, hear me speak.
All:
Speak, speak.
First Citizen:
You are all resolved rather to die than to famish?
All:
Resolved. resolved.
First Citizen:
First, you know Caius Marcius is chief enemy to the people.
文件总字符数量: len(text) = 1115394
vocab = ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
文本中非重复字符数量 = 65
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
char2idx = {
    
    '\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
idx2char = ['\n' ' ' '!' '$' '&' "'" ',' '-' '.' '3' ':' ';' '?' 'A' 'B' 'C' 'D' 'E'
 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W'
 'X' 'Y' 'Z' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o'
 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z']
text_as_int = [18 47 56 ... 45  8  0]
Characters mapped to int：'First Citizen' ---- > [18 47 56 57 58  1 15 47 58 47 64 43 52]
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
文件总字符数量: len(text_as_int) = 1115394

len(char_dataset) = 1115394

sequence_batches 中的batch数量: len(sequence_batches) = 11043

dataset_train = <MapDataset shapes: ((100,), (100,)), types: (tf.int32, tf.int32)>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
输入文本:  'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
目标文本: 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '

Step    0
  input: 18 ('F')
  expected output: 47 ('i')
Step    1
  input: 47 ('i')
  expected output: 56 ('r')
Step    2
  input: 56 ('r')
  expected output: 57 ('s')
Step    3
  input: 57 ('s')
  expected output: 58 ('t')
Step    4
  input: 58 ('t')
  expected output: 1 (' ')
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
dataset_train = <BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int32, tf.int32)>
========================================================================================================================================================================================================
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (64, None, 256)           16640     
_________________________________________________________________
gru (GRU)                    (64, None, 1024)          3938304   
_________________________________________________________________
dense (Dense)                (64, None, 65)            66625     
=================================================================
Total params: 4,021,569
Trainable params: 4,021,569
Non-trainable params: 0
_________________________________________________________________
==================================================================================================== Epoch = 0 ====================================================================================================
2021-03-05 17:11:47.595475: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-03-05 17:11:47.971666: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-03-05 17:11:48.259322: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
Epoch 1----batch_index 0----Loss = 4.1757941246032715
Epoch 1----batch_index 50----Loss = 2.726165771484375
Epoch 1----batch_index 100----Loss = 2.31733775138855
Epoch 1----batch_index 150----Loss = 2.214596748352051
Epoch 1 Loss 2.1143
Time taken for 1 epoch 14.86131238937378 sec
==================================================================================================== Epoch = 1 ====================================================================================================
Epoch 2----batch_index 0----Loss = 2.2398929595947266
Epoch 2----batch_index 50----Loss = 1.9834604263305664
Epoch 2----batch_index 100----Loss = 1.905800223350525
Epoch 2----batch_index 150----Loss = 1.8299765586853027
Epoch 2 Loss 1.7870
Time taken for 1 epoch 14.190452575683594 sec
==================================================================================================== Epoch = 2 ====================================================================================================
Epoch 3----batch_index 0----Loss = 1.9509674310684204
Epoch 3----batch_index 50----Loss = 1.6926944255828857
Epoch 3----batch_index 100----Loss = 1.6767021417617798
Epoch 3----batch_index 150----Loss = 1.6277122497558594
Epoch 3 Loss 1.5437
Time taken for 1 epoch 14.407158851623535 sec
==================================================================================================== Epoch = 3 ====================================================================================================
Epoch 4----batch_index 0----Loss = 1.6806211471557617
Epoch 4----batch_index 50----Loss = 1.5730986595153809
Epoch 4----batch_index 100----Loss = 1.5141606330871582
Epoch 4----batch_index 150----Loss = 1.4627302885055542
Epoch 4 Loss 1.4650
Time taken for 1 epoch 14.45670461654663 sec
==================================================================================================== Epoch = 4 ====================================================================================================
Epoch 5----batch_index 0----Loss = 1.5430569648742676
Epoch 5----batch_index 50----Loss = 1.3985743522644043
Epoch 5----batch_index 100----Loss = 1.4652833938598633
Epoch 5----batch_index 150----Loss = 1.3869857788085938
Epoch 5 Loss 1.4542
Time taken for 1 epoch 14.87185549736023 sec
==================================================================================================== Epoch = 5 ====================================================================================================
Epoch 6----batch_index 0----Loss = 1.4752824306488037
Epoch 6----batch_index 50----Loss = 1.405553936958313
Epoch 6----batch_index 100----Loss = 1.383398175239563
Epoch 6----batch_index 150----Loss = 1.3764104843139648
Epoch 6 Loss 1.3731
Time taken for 1 epoch 15.285845041275024 sec
==================================================================================================== Epoch = 6 ====================================================================================================
Epoch 7----batch_index 0----Loss = 1.3768495321273804
Epoch 7----batch_index 50----Loss = 1.3169121742248535
Epoch 7----batch_index 100----Loss = 1.3428531885147095
Epoch 7----batch_index 150----Loss = 1.3394542932510376
Epoch 7 Loss 1.3475
Time taken for 1 epoch 15.550349950790405 sec
==================================================================================================== Epoch = 7 ====================================================================================================
Epoch 8----batch_index 0----Loss = 1.3372825384140015
Epoch 8----batch_index 50----Loss = 1.2701361179351807
Epoch 8----batch_index 100----Loss = 1.3123818635940552
Epoch 8----batch_index 150----Loss = 1.2893662452697754
Epoch 8 Loss 1.3320
Time taken for 1 epoch 15.643383026123047 sec
==================================================================================================== Epoch = 8 ====================================================================================================
Epoch 9----batch_index 0----Loss = 1.3260821104049683
Epoch 9----batch_index 50----Loss = 1.262296199798584
Epoch 9----batch_index 100----Loss = 1.302840232849121
Epoch 9----batch_index 150----Loss = 1.3154445886611938
Epoch 9 Loss 1.3448
Time taken for 1 epoch 15.534974575042725 sec
==================================================================================================== Epoch = 9 ====================================================================================================
Epoch 10----batch_index 0----Loss = 1.250128149986267
Epoch 10----batch_index 50----Loss = 1.2366149425506592
Epoch 10----batch_index 100----Loss = 1.2584527730941772
Epoch 10----batch_index 150----Loss = 1.221116304397583
Epoch 10 Loss 1.2273
Time taken for 1 epoch 15.521215677261353 sec
generated = 
ROMEO: I saw heaving emb'd,
We have lived both with you, sir, wondr to perform.
PETRUCHIO:
Nay, turn. Who knows her bown? How would you so?
BAPTISTA:
What do thy worser?
DUKE VINCENTIO:
Come by the far; a maid welcome that
Friviss of what pleasant face,
A sight on the giold and beauty's mirth
Indeed she was;
For the wedish'd. The day, nor hate in the chief;
For you, I fell best become as heaven,
To rave him sweet part,
And marpings I must ir gracious love, and see as he
To make a very mother?
BIONDELLO:
Why, sir, you are all needs: but whether it is boves,
Tell him with all ministers on
The profit I could speak town here:
Hath an none am pested desires above the velvet:
The world gapes me not thy wext,
As her left her attended vailer you:
I warrants, Lecio, he's a lament,
I am raid on. To my slave, or thou art.
GREMIO:
Let's scale of mine, be part.
PETRUCHIO:
Come, you may be slanken e.
GUKE VINCENTIO:
It is a hazard and o'er that will do
not trial, thou wast merry, thy wars.
Where is

人工智能-自然语言处理(NLP)-应用场景：文本生成任务【莎士比亚风格的 “文本生成”GRU模型（给模型输入一段话，模型自动生成接下来的话）】【文本生成是NLP领域最具挑战性的任务之一】--保存检查点

一、莎士比亚作品数据集

二、使用GRU模型实现文本生成任务的步骤

1、第一步: 下载数据集并做文本预处理

1.1 下载数据

1.2 读取数据

1.3 对文本进行数值映射

1.4 构建训练数据

1.5 将构建的训练数据批次化

2、第二步: 构建、训练模型、保存模型参数

3、第三步: 使用模型生成文本内容

三、莎士比亚风格的文本生成任务完整代码【GRU神经网络模型】

猜你喜欢