神经网络学习小记录40——春节到了,用LSTM写古诗不?
学习前言
不知道咋地,就是想写写古诗,谁不是个祖安诗人呢?
整体实现思路
LSTM可以对输入进来的序列进行特征提取,并做出预测结果。
今天我们试着利用LSTM来作五言诗。
我们可以按照这样的思路进行预测:
由前六个字预测出下一个字。
利用“寒随穷律变,”预测出“春”。
利用“随穷律变,春”预测出“逐”。
利用这样的方式去构建LSTM就可以一步一步的往下预测,实现古诗创作。
即:
- 寒随穷律变, -> 春
- 随穷律变,春 -> 逐
- 穷律变,春逐 -> 鸟
- 律变,春逐鸟 -> 声
- 变,春逐鸟声 -> 开
- ,春逐鸟声开 -> 。
- ……
最终得到古诗:
寒随穷律变,春逐鸟声开。初风飘带柳,晚雪间花梅。碧林青旧竹,绿沼翠新苔。芝田初雁去,绮树巧莺来。
github下载地址
https://github.com/bubbliiiing/poems-generator
代码实现
1、数据处理
a、读取古诗并转化为id
从存放古诗的txt里面读出五言诗:
通过读入进来的每一行:后面的序号为5的内容是不是,判断其是不是五言诗。
之后再利用获得所有的字,并对所有的字进行计数,然后将计数结果从高到低排列。
取出出现频率最高的字,不存在的字用空格代替。
建立字到id的映射,id到字的映射。
然后把获取到的所有诗都利用字到id的映射转化为id。
此时一首诗就用它每一个文字的id构成了。
义髻抛河里,黄裙逐水流。
[835, 2197, 1604, 210, 51, 0, 172, 2135, 406, 16, 78, 1, 2]
实现代码如下:
def load(poetry_file):
def handle(line):
return line + END_CHAR
poetrys = [line.strip().replace(' ', '').split(':')[1] for line in
open(poetry_file, encoding='utf-8')]
collect = []
for poetry in poetrys:
if len(poetry) <= 5 :
continue
if poetry[5]==",":
collect.append(handle(poetry))
print(len(collect))
poetrys = collect
# 所有字
words = []
for poetry in poetrys:
words += [word for word in poetry]
counter = collections.Counter(words)
count_pairs = sorted(counter.items(), key=lambda x: -x[1])
# 获得所有字,出现次数从大到小排列
words, _ = zip(*count_pairs)
# 取出现频率最高的词的数量组成字典,不在字典中的字用'*'代替
words_size = min(max_words, len(words))
words = words[:words_size] + (UNKNOWN_CHAR,)
# 计算总长度
words_size = len(words)
# 字映射成id,采用ont-hot的形式
char2id_dict = {w: i for i, w in enumerate(words)}
id2char_dict = {i: w for i, w in enumerate(words)}
unknow_char = char2id_dict[UNKNOWN_CHAR]
char2id = lambda char: char2id_dict.get(char, unknow_char)
poetrys = sorted(poetrys, key=lambda line: len(line))
# 训练集中每一首诗都找到了每个字对应的id
poetrys_vector = [list(map(char2id, poetry)) for poetry in poetrys]
return np.array(poetrys_vector),char2id_dict,id2char_dict
b、将读取到的所有古诗转化为6to1的形式
利用get_6to1将所有古诗转化为6to1的形式。
传入的x_data为一首五言诗。如:
寒随穷律变,春逐鸟声开。初风飘带柳,晚雪间花梅。碧林青旧竹,绿沼翠新苔。芝田初雁去,绮树巧莺来。
输出的inputs就是6个字的集合,输出的targets就是利用6个字预测的1个字的集合。
def get_6to1(x_data,char2id_dict):
inputs = []
targets = []
for index in range(len(x_data)):
x = x_data[index:(index+unit_sentence)]
y = x_data[index+unit_sentence]
if (END_CHAR in x) or y == char2id_dict[END_CHAR]:
return np.array(inputs),np.array(targets)
else:
inputs.append(x)
targets.append(y)
return np.array(inputs),np.array(targets)
2、神经网络构建
神经网络的构建非常简单,只需要下面几行代码就能完成:
需要指定输入进来的每一个时间节点的内容的维度为words_size,也就是所有的字符的数量。
#-------------------------------#
# 建立神经网络
#-------------------------------#
inputs = Input(shape=(None,words_size))
x = CuDNNLSTM(UNITS,return_sequences=True)(inputs)
x = Dropout(0.6)(x)
x = CuDNNLSTM(UNITS)(x)
x = Dropout(0.6)(x)
x = Dense(words_size, activation='softmax')(x)
model = Model(inputs,x)
3、古诗预测
随机选择一首古诗的首六个字,然后往下进行预测,每一次预测一个字,然后预测完整首古诗。
def predict_from_nothing(epoch,x_data,char2id_dict,id2char_dict,model):
# 训练过程中,每1个epoch打印出当前的学习情况
print("\n#-----------------------Epoch {}-----------------------#".format(epoch))
words_size = len(id2char_dict)
index = np.random.randint(0, len(x_data))
sentence = x_data[index][:unit_sentence]
def _pred(text):
temp = text[-unit_sentence:]
x_pred = np.zeros((1, unit_sentence, words_size))
for t, index in enumerate(temp):
x_pred[0, t, index] = 1.
preds = model.predict(x_pred)[0]
choice_id = np.random.choice(range(len(preds)),1,p=preds)
if id2char_dict[choice_id[0]] == ' ':
while id2char_dict[choice_id[0]] in [',','。',' ']:
choice_id = np.random.randint(0,len(char2id_dict),1)
return choice_id
for i in range(24-unit_sentence):
pred = _pred(sentence)
sentence = np.append(sentence,pred)
output = ""
for i in range(len(sentence)):
output = output + id2char_dict[sentence[i]]
print(output)
全部代码
代码需要按照如下方式摆放:
1、poem_keras.py
import numpy as np
from keras.callbacks import TensorBoard, ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
from keras.layers import CuDNNLSTM,Dense,Input,Softmax,Convolution1D,Embedding,Dropout
from keras.callbacks import LambdaCallback
from keras.optimizers import Adam
from keras.models import Model
from utils import load,get_batch,predict_from_nothing,predict_from_head
UNITS = 256
batch_size = 64
epochs = 50
poetry_file = 'poetry.txt'
# 载入数据
x_data,char2id_dict,id2char_dict = load(poetry_file)
max_length = max([len(txt) for txt in x_data])
words_size = len(char2id_dict)
#-------------------------------#
# 建立神经网络
#-------------------------------#
inputs = Input(shape=(None,words_size))
x = CuDNNLSTM(UNITS,return_sequences=True)(inputs)
x = Dropout(0.6)(x)
x = CuDNNLSTM(UNITS)(x)
x = Dropout(0.6)(x)
x = Dense(words_size, activation='softmax')(x)
model = Model(inputs,x)
#-------------------------------#
# 划分训练集验证集
#-------------------------------#
val_split = 0.1
np.random.seed(10101)
np.random.shuffle(x_data)
np.random.seed(None)
num_val = int(len(x_data)*val_split)
num_train = len(x_data) - num_val
#-------------------------------#
# 设置保存方案
#-------------------------------#
checkpoint = ModelCheckpoint('logs/loss{loss:.3f}-val_loss{val_loss:.3f}.h5',
monitor='val_loss', save_weights_only=True, save_best_only=False, period=1)
#-------------------------------#
# 设置学习率并训练
#-------------------------------#
model.compile(optimizer=Adam(1e-3), loss='categorical_crossentropy',
metrics=['accuracy'])
for i in range(epochs):
predict_from_nothing(i,x_data,char2id_dict,id2char_dict,model)
model.fit_generator(get_batch(batch_size, x_data[:num_train], char2id_dict, id2char_dict),
steps_per_epoch=max(1, num_train//batch_size),
validation_data=get_batch(batch_size, x_data[:num_train], char2id_dict, id2char_dict),
validation_steps=max(1, num_val//batch_size),
epochs=1,
initial_epoch=0,
callbacks=[checkpoint])
#-------------------------------#
# 设置学习率并训练
#-------------------------------#
model.compile(optimizer=Adam(1e-4), loss='categorical_crossentropy',
metrics=['accuracy'])
for i in range(epochs):
predict_from_nothing(i,x_data,char2id_dict,id2char_dict,model)
model.fit_generator(get_batch(batch_size, x_data[:num_train], char2id_dict, id2char_dict),
steps_per_epoch=max(1, num_train//batch_size),
validation_data=get_batch(batch_size, x_data[:num_train], char2id_dict, id2char_dict),
validation_steps=max(1, num_val//batch_size),
epochs=1,
initial_epoch=0,
callbacks=[checkpoint])
2、utils.py
import numpy as np
import collections
END_CHAR = '\n'
UNKNOWN_CHAR = ' '
unit_sentence = 6
max_words = 3000
MIN_LENGTH = 10
def load(poetry_file):
def handle(line):
return line + END_CHAR
poetrys = [line.strip().replace(' ', '').split(':')[1] for line in
open(poetry_file, encoding='utf-8')]
collect = []
for poetry in poetrys:
if len(poetry) <= 5 :
continue
if poetry[5]==",":
collect.append(handle(poetry))
print(len(collect))
poetrys = collect
# 所有字
words = []
for poetry in poetrys:
words += [word for word in poetry]
counter = collections.Counter(words)
count_pairs = sorted(counter.items(), key=lambda x: -x[1])
# 获得所有字,出现次数从大到小排列
words, _ = zip(*count_pairs)
# 取出现频率最高的词的数量组成字典,不在字典中的字用'*'代替
words_size = min(max_words, len(words))
words = words[:words_size] + (UNKNOWN_CHAR,)
# 计算总长度
words_size = len(words)
# 字映射成id,采用ont-hot的形式
char2id_dict = {w: i for i, w in enumerate(words)}
id2char_dict = {i: w for i, w in enumerate(words)}
unknow_char = char2id_dict[UNKNOWN_CHAR]
char2id = lambda char: char2id_dict.get(char, unknow_char)
poetrys = sorted(poetrys, key=lambda line: len(line))
# 训练集中每一首诗都找到了每个字对应的id
poetrys_vector = [list(map(char2id, poetry)) for poetry in poetrys]
return np.array(poetrys_vector),char2id_dict,id2char_dict
def get_6to1(x_data,char2id_dict):
inputs = []
targets = []
for index in range(len(x_data)):
x = x_data[index:(index+unit_sentence)]
y = x_data[index+unit_sentence]
if (END_CHAR in x) or y == char2id_dict[END_CHAR]:
return np.array(inputs),np.array(targets)
else:
inputs.append(x)
targets.append(y)
return np.array(inputs),np.array(targets)
def get_batch(batch_size,x_data,char2id_dict,id2char_dict):
n = len(x_data)
batch_i = 0
words_size = len(char2id_dict)
while(True):
one_hot_x_data = []
one_hot_y_data = []
for i in range(batch_size):
batch_i = (batch_i+1)%n
inputs,targets = get_6to1(x_data[batch_i],char2id_dict)
for j in range(len(inputs)):
one_hot_x_data.append(inputs[j])
one_hot_y_data.append(targets[j])
batch_size_after = len(one_hot_x_data)
input_data = np.zeros(
(batch_size_after, unit_sentence, words_size))
target_data = np.zeros(
(batch_size_after, words_size))
for i, (input_text, target_text) in enumerate(zip(one_hot_x_data, one_hot_y_data)):
# 为末尾加上" "空格
for t, index in enumerate(input_text):
input_data[i, t, index] = 1
# 相当于前一个内容的识别结果,作为输入,传入到解码网络中
target_data[i, target_text] = 1.
yield input_data,target_data
def predict_from_nothing(epoch,x_data,char2id_dict,id2char_dict,model):
# 训练过程中,每1个epoch打印出当前的学习情况
print("\n#-----------------------Epoch {}-----------------------#".format(epoch))
words_size = len(id2char_dict)
index = np.random.randint(0, len(x_data))
sentence = x_data[index][:unit_sentence]
def _pred(text):
temp = text[-unit_sentence:]
x_pred = np.zeros((1, unit_sentence, words_size))
for t, index in enumerate(temp):
x_pred[0, t, index] = 1.
preds = model.predict(x_pred)[0]
choice_id = np.random.choice(range(len(preds)),1,p=preds)
if id2char_dict[choice_id[0]] == ' ':
while id2char_dict[choice_id[0]] in [',','。',' ']:
choice_id = np.random.randint(0,len(char2id_dict),1)
return choice_id
for i in range(24-unit_sentence):
pred = _pred(sentence)
sentence = np.append(sentence,pred)
output = ""
for i in range(len(sentence)):
output = output + id2char_dict[sentence[i]]
print(output)
def predict_from_head(epoch,name,x_data,char2id_dict,id2char_dict,model):
# 根据给定的字,生成藏头诗
if len(name) < 4:
for i in range(4-len(name)):
index = np.random.randint(0,len(char2id_dict))
while id2char_dict[index] in [',','。',' ']:
index = np.random.randint(0,len(char2id_dict))
name += id2char_dict[index]
origin_name = name
name = list(name)
for i in range(len(name)):
if name[i] not in char2id_dict:
index = np.random.randint(0,len(char2id_dict))
while id2char_dict[index] in [',','。',' ']:
index = np.random.randint(0,len(char2id_dict))
name[i] = id2char_dict[index]
name = ''.join(name)
words_size = len(char2id_dict)
index = np.random.randint(0, len(x_data))
#选取随机一首诗的最后max_len字符+给出的首个文字作为初始输入
sentence = np.append(x_data[index][-unit_sentence:-1],char2id_dict[name[0]])
def _pred(text):
temp = text[-unit_sentence:]
x_pred = np.zeros((1, unit_sentence, words_size))
for t, index in enumerate(temp):
x_pred[0, t, index] = 1.
preds = model.predict(x_pred)[0]
choice_id = np.random.choice(range(len(preds)),1,p=preds)
if id2char_dict[choice_id[0]] == ' ':
while id2char_dict[choice_id[0]] in [',','。',' ']:
choice_id = np.random.randint(0,len(char2id_dict),1)
return choice_id
for i in range(5):
pred = _pred(sentence)
sentence = np.append(sentence,pred)
sentence = sentence[-unit_sentence:]
for i in range(3):
sentence = np.append(sentence,char2id_dict[name[i+1]])
for i in range(5):
pred = _pred(sentence)
sentence = np.append(sentence,pred)
output = []
for i in range(len(sentence)):
output.append(id2char_dict[sentence[i]])
for i in range(4):
output[i*6] = origin_name[i]
output = ''.join(output)
print(output)
3、实现效果
列辟鸣鸾至,恭登贯凤韬。流川将合命,悠悠意从如。
藏头诗:
快乐(虽然这个作出来的诗好像不是很快乐……)
快尘浮老田,乐炭唯爱坟。号之二亩士,芳草再无魂。