Keras深度学习实战（30）——使用文本生成模型进行文学创作

0. 前言

在情感分类任务中，神经网络模型的预测结果离散事件，即情感类别(正面、负面或中立)，这属于多对一架构(多个输入计算得到一个输出)。在本节中，我们将学习如何实现多对多体系结构(多个输入计算得到多个输出)，其中输入是给定的 10 个单词序列，模型输出是输入单词序列后可能的 50 个单词。

1. 文本生成模型与数据集分析

1.1 数据集分析

《爱丽丝梦游仙境》是十分经典的奇幻冒险小说，本节中，我们将使用《爱丽丝梦游仙境》数据集完成多对多体系结构模型的构建，此模型使用给定的 10 个单词序列作为输入，预测输入单词序列后的 50 个单词，可以在此链接下载完整的《爱丽丝梦游仙境》小说。

1.2 模型分析

在实现文本生成模型前，我们首先介绍文本生成策略的基本流程：

预处理文本数据，将每个单词都转换为小写字母，并删除标点符号
为每个唯一单词分配一个 ID，然后将数据集转换为单词 ID 序列
遍历整个数据集，使用的滑动窗口为 10，将 10 个单词作为输入，并将紧随其后的 1 个单词作为输出
将输入句子转换为独热编码形式，然后连接到 LSTM 层，通过隐藏层后连接到输出层，来构建和训练模型，输出层的值是单词的独热编码
通过选择随机位置的单词来预测后续单词，同时考虑随机选择的单词之前的 9 个单词
将输入单词的窗口从之前选择的单词的位置后移一位，窗口中第 10个 单词是在上一步中预测的单词
继续此过程以生成多个单词

2. 构建文本生成模型

根据我们上一小节的分析，对于循环神经网络 (Recurrent neural networks, RNN) 的模型输入，我们将使用给定的 10 个单词的序列，以预测下一个可能的单词。接下来，我们使用《爱丽丝梦游仙境》数据集训练模型以生成文本。

2.1 数据预处理

(1) 导入相关的库和数据集：

from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding
import numpy as np

fin=open('11-0.txt',encoding='utf-8-sig')
lines=[]
for line in fin:
  line = line.strip().lower()
  if(len(line)==0):
    continue
  lines.append(line)
fin.close()
text = " ".join(lines)

print(text[100:196])

输入文本的示例如下所示：

the use of anyone anywhere in the united states and most other parts of the world at no cost and wit

(2) 对文本进行预处理，删除标点符号并将单词转换为小写字母：

import re
text = text.lower()
text = re.sub('[^0-9a-zA-Z]+',' ',text)

(3) 为不同单词分配索引，以便在构造训练和测试数据集时进行引用：

from collections import Counter
counts = Counter()
counts.update(text.split())
words = sorted(counts, key=counts.get, reverse=True)
nb_words = len(text.split())
word2index = {
    
    word: i for i, word in enumerate(words)}
index2word = {
    
    i: word for i, word in enumerate(words)}

(4) 构造输入、输出单词集，我们使用由 10 个单词组成的序列作为输入，并尝试预测第 11 个单词：

SEQLEN = 10
STEP = 1
input_words = []
label_words = []
text2=text.split()
for i in range(0,nb_words-SEQLEN,STEP):
    x=text2[i:(i+SEQLEN)]
    y=text2[i+SEQLEN]
    input_words.append(x)
label_words.append(y)
# 打印input_words和label_words列表的示例
print('input words list: ','\n',input_words[0])
print('label(output) words list: ','\n',label_words[0])

input_words 和 label_words 列表的示例如下，可以看到 input_words 是列表，其中每个元素由 10 个单词组成的列表，output_words 列表中的每个元素是输出单词：

input words list:  
 ['the', 'project', 'gutenberg', 'ebook', 'of', 'alice', 's', 'adventures', 'in', 'wonderland']
label(output) words list:  
 by

(5) 将输入句子和输出单词转换为独热编码形式，首先创建空数组：

total_words = len(set(words))
x = np.zeros((len(input_words), SEQLEN, total_words), dtype=np.bool)
y = np.zeros((len(input_words), total_words), dtype=np.bool)

使用 2 个 for 循环，填充以上步骤中创建的空数组，第 1 个 for 循环用于循环遍历输入单词序列中的所有单词(输入中具有 10 个单词)，第 2 个 for 循环用于循环遍历所选输入单词序列中的单个单词。另外，由于输出列表中元素为单词，因此不需要使用第 2 个 for 循环：

# 将输入、输出数据编码为独热向量
for i, input_word in enumerate(input_words):
    for j, word in enumerate(input_word):
        x[i, j, word2index[word]] = 1
    y[i,word2index[label_words[i]]]=1
# 打印输入和输出形状
print('Shape of x: ',x.shape)
print('Shape of y: ',y.shape)
x和y的形状打印如下：
Shape of x:  (30664, 10, 3036)
Shape of y:  (30664, 3036)

2.2 模型构建与训练

(1) 定义模型架构：

HIDDEN_SIZE = 128
BATCH_SIZE = 32
NUM_ITERATIONS = 100
NUM_EPOCHS_PER_ITERATION = 1
NUM_PREDS_PER_EPOCH = 100

model = Sequential()
model.add(LSTM(HIDDEN_SIZE,return_sequences=False,input_shape=(SEQLEN,total_words)))
model.add(Dense(total_words, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.summary()

模型的简要架构信息输入如下：

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm (LSTM)                  (None, 128)               1620480   
_________________________________________________________________
dense (Dense)                (None, 3036)              391644    
=================================================================
Total params: 2,012,124
Trainable params: 2,012,124
Non-trainable params: 0
_________________________________________________________________

(2) 拟合模型，查看输出随 epoch 的增加输出的变化情况。生成包含 10 个单词的序列，并尝试预测下一个可能的单词，随着训练 epoch 数的增加，可以观察到模型的预测结果慢慢得到优化：

for iteration in range(1):
    print("\n", "=" * 50)
    print("Iteration #: %d" % (iteration))
    model.fit(x, y, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS_PER_ITERATION, validation_split = 0.1)
    test_idx = np.random.randint(int(len(input_words)*0.1)) * (-1)
    test_words = input_words[test_idx]
    print("Generating from seed: %s" % (test_words))
    for i in range(NUM_PREDS_PER_EPOCH): 
        Xtest = np.zeros((1, SEQLEN, total_words))
        for i, ch in enumerate(test_words):
            Xtest[0, i, word2index[ch]] = 1
        pred = model.predict(Xtest, verbose=0)[0]
        ypred = index2word[np.argmax(pred)]
        print(ypred,end=' ')
        test_words = test_words[1:] + [ypred]

在以上代码中，使用 for 循环迭代训练模型，在每次迭代中对模型训练一个 epoch。此外，我们在验证集中随机选择一个输入序列，并转换输入将 ID 序列转换为独热编码的版本，得到形状为 (1 x 10 x total_words) 的数组。最后，对创建的输入数组进行预测，获得具有最高概率的单词。对比模型在第 2 个 epoch 和第 50 个 epoch 的输出：

Iteration #: 1
Generating from seed: ['archive', 'foundation', 'and', 'how', 'your', 'efforts', 'and', 'donations', 'can', 'help']
the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the and and the 
 ==================================================
....
Iteration #: 49
Generating from seed: ['1', 'e', '1', 'with', 'active', 'links', 'or', 'immediate', 'access', 'to']
think i know what they re re to them for day i all great question it s and that s enough of the reason and the jury because all would were them next the they were got silence their door to put down the king the king said in alice hand and was round to herself i almost wish she could see them of little helpless well this sort of had to up and alice up too much of to say whether i shall like a little remark it was not much very much than way to find the door

小结

我们已经了解了循环神经网络和长短时记忆网络的工作原理，并学习了使用这两种架构进行情感分类，情感分类是一种经典的多对一应用，输入中的多个单词对应于一个输出——正面、负面或中立。但循环神经网络也可以实现多对多应用，本节中，我们利用循环神经网络构建了一个文本生成模型，利用爱丽丝梦游仙境小说进行训练，以尝试使用神经网络模型进行文学创作。

系列链接