神经网络学习小记录40——春节到了，用LSTM写古诗不？

学习前言
整体实现思路
github下载地址
代码实现

1、数据处理

a、读取古诗并转化为id
b、将读取到的所有古诗转化为6to1的形式

2、神经网络构建
3、古诗预测

全部代码

1、poem_keras.py
2、utils.py
3、实现效果

学习前言

不知道咋地，就是想写写古诗，谁不是个祖安诗人呢？
在这里插入图片描述

整体实现思路

LSTM可以对输入进来的序列进行特征提取，并做出预测结果。

今天我们试着利用LSTM来作五言诗。

我们可以按照这样的思路进行预测：

由前六个字预测出下一个字。

利用“寒随穷律变，”预测出“春”。
在这里插入图片描述
利用“随穷律变，春”预测出“逐”。

利用这样的方式去构建LSTM就可以一步一步的往下预测，实现古诗创作。
即：

寒随穷律变， -> 春
随穷律变，春 -> 逐
穷律变，春逐 -> 鸟
律变，春逐鸟 -> 声
变，春逐鸟声 -> 开
，春逐鸟声开 -> 。
……

最终得到古诗：
寒随穷律变，春逐鸟声开。初风飘带柳，晚雪间花梅。碧林青旧竹，绿沼翠新苔。芝田初雁去，绮树巧莺来。

github下载地址

https://github.com/bubbliiiing/poems-generator

代码实现

1、数据处理

a、读取古诗并转化为id

从存放古诗的txt里面读出五言诗：
通过读入进来的每一行：后面的序号为5的内容是不是，判断其是不是五言诗。
在这里插入图片描述
之后再利用获得所有的字，并对所有的字进行计数，然后将计数结果从高到低排列。

取出出现频率最高的字，不存在的字用空格代替。

建立字到id的映射，id到字的映射。

然后把获取到的所有诗都利用字到id的映射转化为id。

此时一首诗就用它每一个文字的id构成了。

义髻抛河里，黄裙逐水流。
[835, 2197, 1604, 210, 51, 0, 172, 2135, 406, 16, 78, 1, 2]

实现代码如下：

def load(poetry_file):
    def handle(line):
        return line + END_CHAR

    poetrys = [line.strip().replace(' ', '').split(':')[1] for line in
                    open(poetry_file, encoding='utf-8')]
    collect = []
    for poetry in poetrys:
        if len(poetry) <= 5 :
            continue
        if poetry[5]=="，":
            collect.append(handle(poetry))
    print(len(collect))
    poetrys = collect
    # 所有字
    words = []
    for poetry in poetrys:
        words += [word for word in poetry]
    counter = collections.Counter(words)
    
    count_pairs = sorted(counter.items(), key=lambda x: -x[1])
    # 获得所有字，出现次数从大到小排列

    words, _ = zip(*count_pairs)
    # 取出现频率最高的词的数量组成字典，不在字典中的字用'*'代替
    words_size = min(max_words, len(words))
    words = words[:words_size] + (UNKNOWN_CHAR,)
    # 计算总长度
    words_size = len(words)

    # 字映射成id，采用ont-hot的形式
    char2id_dict = {w: i for i, w in enumerate(words)}
    id2char_dict = {i: w for i, w in enumerate(words)}
    
    unknow_char = char2id_dict[UNKNOWN_CHAR]
    char2id = lambda char: char2id_dict.get(char, unknow_char)
    poetrys = sorted(poetrys, key=lambda line: len(line))
    # 训练集中每一首诗都找到了每个字对应的id
    poetrys_vector = [list(map(char2id, poetry)) for poetry in poetrys]
    return np.array(poetrys_vector),char2id_dict,id2char_dict

b、将读取到的所有古诗转化为6to1的形式

利用get_6to1将所有古诗转化为6to1的形式。

传入的x_data为一首五言诗。如：

寒随穷律变，春逐鸟声开。初风飘带柳，晚雪间花梅。碧林青旧竹，绿沼翠新苔。芝田初雁去，绮树巧莺来。

输出的inputs就是6个字的集合，输出的targets就是利用6个字预测的1个字的集合。

def get_6to1(x_data,char2id_dict):
    inputs = []
    targets = []
    for index in range(len(x_data)):
        x = x_data[index:(index+unit_sentence)]
        y = x_data[index+unit_sentence]
        if (END_CHAR in x) or y == char2id_dict[END_CHAR]:
            return np.array(inputs),np.array(targets)
        else:
            inputs.append(x)
            targets.append(y)
    return np.array(inputs),np.array(targets)

2、神经网络构建

神经网络的构建非常简单，只需要下面几行代码就能完成：
需要指定输入进来的每一个时间节点的内容的维度为words_size，也就是所有的字符的数量。

#-------------------------------#
#   建立神经网络
#-------------------------------#
inputs = Input(shape=(None,words_size))
x = CuDNNLSTM(UNITS,return_sequences=True)(inputs)
x = Dropout(0.6)(x)
x = CuDNNLSTM(UNITS)(x)                  
x = Dropout(0.6)(x)
x = Dense(words_size, activation='softmax')(x)
model = Model(inputs,x)

3、古诗预测

随机选择一首古诗的首六个字，然后往下进行预测，每一次预测一个字，然后预测完整首古诗。

def predict_from_nothing(epoch,x_data,char2id_dict,id2char_dict,model):
    # 训练过程中，每1个epoch打印出当前的学习情况
    print("\n#-----------------------Epoch {}-----------------------#".format(epoch))
    words_size = len(id2char_dict)
    
    index = np.random.randint(0, len(x_data))
    sentence = x_data[index][:unit_sentence]
    def _pred(text):
        temp = text[-unit_sentence:]
        x_pred = np.zeros((1, unit_sentence, words_size))
        for t, index in enumerate(temp):
            x_pred[0, t, index] = 1.
        preds = model.predict(x_pred)[0]
        choice_id = np.random.choice(range(len(preds)),1,p=preds)
        if id2char_dict[choice_id[0]] == ' ':
            while id2char_dict[choice_id[0]] in ['，','。',' ']:
                choice_id = np.random.randint(0,len(char2id_dict),1)
        return choice_id

    for i in range(24-unit_sentence):
        pred = _pred(sentence)
        sentence = np.append(sentence,pred)
    output = ""
    for i in range(len(sentence)):
        output = output + id2char_dict[sentence[i]]
    print(output)

全部代码

代码需要按照如下方式摆放：
在这里插入图片描述

1、poem_keras.py

import numpy as np
from keras.callbacks import TensorBoard, ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
from keras.layers import CuDNNLSTM,Dense,Input,Softmax,Convolution1D,Embedding,Dropout
from keras.callbacks import LambdaCallback
from keras.optimizers import Adam
from keras.models import Model
from utils import load,get_batch,predict_from_nothing,predict_from_head

UNITS = 256
batch_size = 64
epochs = 50
poetry_file = 'poetry.txt'

# 载入数据
x_data,char2id_dict,id2char_dict = load(poetry_file)
max_length = max([len(txt) for txt in x_data])
words_size = len(char2id_dict)

#-------------------------------#
#   建立神经网络
#-------------------------------#
inputs = Input(shape=(None,words_size))
x = CuDNNLSTM(UNITS,return_sequences=True)(inputs)
x = Dropout(0.6)(x)
x = CuDNNLSTM(UNITS)(x)                  
x = Dropout(0.6)(x)
x = Dense(words_size, activation='softmax')(x)
model = Model(inputs,x)

#-------------------------------#
#   划分训练集验证集
#-------------------------------#
val_split = 0.1
np.random.seed(10101)
np.random.shuffle(x_data)
np.random.seed(None)
num_val = int(len(x_data)*val_split)
num_train = len(x_data) - num_val

#-------------------------------#
#   设置保存方案
#-------------------------------#
checkpoint = ModelCheckpoint('logs/loss{loss:.3f}-val_loss{val_loss:.3f}.h5',
    monitor='val_loss', save_weights_only=True, save_best_only=False, period=1)


#-------------------------------#
#   设置学习率并训练
#-------------------------------#
model.compile(optimizer=Adam(1e-3), loss='categorical_crossentropy',
              metrics=['accuracy'])
        
for i in range(epochs):
    predict_from_nothing(i,x_data,char2id_dict,id2char_dict,model)
    model.fit_generator(get_batch(batch_size, x_data[:num_train], char2id_dict, id2char_dict),
                    steps_per_epoch=max(1, num_train//batch_size),
                    validation_data=get_batch(batch_size, x_data[:num_train], char2id_dict, id2char_dict),
                    validation_steps=max(1, num_val//batch_size),
                    epochs=1,
                    initial_epoch=0,
                    callbacks=[checkpoint])


#-------------------------------#
#   设置学习率并训练
#-------------------------------#
model.compile(optimizer=Adam(1e-4), loss='categorical_crossentropy',
              metrics=['accuracy'])
        
for i in range(epochs):
    predict_from_nothing(i,x_data,char2id_dict,id2char_dict,model)
    model.fit_generator(get_batch(batch_size, x_data[:num_train], char2id_dict, id2char_dict),
                    steps_per_epoch=max(1, num_train//batch_size),
                    validation_data=get_batch(batch_size, x_data[:num_train], char2id_dict, id2char_dict),
                    validation_steps=max(1, num_val//batch_size),
                    epochs=1,
                    initial_epoch=0,
                    callbacks=[checkpoint])

2、utils.py

import numpy as np
import collections

END_CHAR = '\n'
UNKNOWN_CHAR = ' '
unit_sentence = 6
max_words = 3000
MIN_LENGTH = 10

def load(poetry_file):
    def handle(line):
        return line + END_CHAR

    poetrys = [line.strip().replace(' ', '').split(':')[1] for line in
                    open(poetry_file, encoding='utf-8')]
    collect = []
    for poetry in poetrys:
        if len(poetry) <= 5 :
            continue
        if poetry[5]=="，":
            collect.append(handle(poetry))
    print(len(collect))
    poetrys = collect
    # 所有字
    words = []
    for poetry in poetrys:
        words += [word for word in poetry]
    counter = collections.Counter(words)
    
    count_pairs = sorted(counter.items(), key=lambda x: -x[1])
    # 获得所有字，出现次数从大到小排列

    words, _ = zip(*count_pairs)
    # 取出现频率最高的词的数量组成字典，不在字典中的字用'*'代替
    words_size = min(max_words, len(words))
    words = words[:words_size] + (UNKNOWN_CHAR,)
    # 计算总长度
    words_size = len(words)

    # 字映射成id，采用ont-hot的形式
    char2id_dict = {w: i for i, w in enumerate(words)}
    id2char_dict = {i: w for i, w in enumerate(words)}
    
    unknow_char = char2id_dict[UNKNOWN_CHAR]
    char2id = lambda char: char2id_dict.get(char, unknow_char)
    poetrys = sorted(poetrys, key=lambda line: len(line))
    # 训练集中每一首诗都找到了每个字对应的id
    poetrys_vector = [list(map(char2id, poetry)) for poetry in poetrys]
    return np.array(poetrys_vector),char2id_dict,id2char_dict

def get_6to1(x_data,char2id_dict):
    inputs = []
    targets = []
    for index in range(len(x_data)):
        x = x_data[index:(index+unit_sentence)]
        y = x_data[index+unit_sentence]
        if (END_CHAR in x) or y == char2id_dict[END_CHAR]:
            return np.array(inputs),np.array(targets)
        else:
            inputs.append(x)
            targets.append(y)
    return np.array(inputs),np.array(targets)

def get_batch(batch_size,x_data,char2id_dict,id2char_dict):
    
    n = len(x_data)

    batch_i = 0

    words_size = len(char2id_dict)
    while(True):
        one_hot_x_data = []
        one_hot_y_data = []
        for i in range(batch_size):
            batch_i = (batch_i+1)%n
            inputs,targets = get_6to1(x_data[batch_i],char2id_dict)
            for j in range(len(inputs)):
                one_hot_x_data.append(inputs[j])
                one_hot_y_data.append(targets[j])
            
        batch_size_after = len(one_hot_x_data)
        input_data = np.zeros(
            (batch_size_after, unit_sentence, words_size))
        target_data = np.zeros(
            (batch_size_after, words_size))
        for i, (input_text, target_text) in enumerate(zip(one_hot_x_data, one_hot_y_data)):
            # 为末尾加上" "空格
            for t, index in enumerate(input_text):
                input_data[i, t, index] = 1
            
            # 相当于前一个内容的识别结果，作为输入，传入到解码网络中
            target_data[i, target_text] = 1.
        yield input_data,target_data

def predict_from_nothing(epoch,x_data,char2id_dict,id2char_dict,model):
    # 训练过程中，每1个epoch打印出当前的学习情况
    print("\n#-----------------------Epoch {}-----------------------#".format(epoch))
    words_size = len(id2char_dict)
    
    index = np.random.randint(0, len(x_data))
    sentence = x_data[index][:unit_sentence]
    def _pred(text):
        temp = text[-unit_sentence:]
        x_pred = np.zeros((1, unit_sentence, words_size))
        for t, index in enumerate(temp):
            x_pred[0, t, index] = 1.
        preds = model.predict(x_pred)[0]
        choice_id = np.random.choice(range(len(preds)),1,p=preds)
        if id2char_dict[choice_id[0]] == ' ':
            while id2char_dict[choice_id[0]] in ['，','。',' ']:
                choice_id = np.random.randint(0,len(char2id_dict),1)
        return choice_id

    for i in range(24-unit_sentence):
        pred = _pred(sentence)
        sentence = np.append(sentence,pred)
    output = ""
    for i in range(len(sentence)):
        output = output + id2char_dict[sentence[i]]
    print(output)

def predict_from_head(epoch,name,x_data,char2id_dict,id2char_dict,model):
    # 根据给定的字，生成藏头诗
    if len(name) < 4:
        for i in range(4-len(name)):
            index = np.random.randint(0,len(char2id_dict))
            while id2char_dict[index] in ['，','。',' ']:
                index = np.random.randint(0,len(char2id_dict))
            name += id2char_dict[index]

    origin_name = name
    name = list(name)

    for i in range(len(name)):
        if name[i] not in char2id_dict:
            index = np.random.randint(0,len(char2id_dict))
            while id2char_dict[index] in ['，','。',' ']:
                index = np.random.randint(0,len(char2id_dict))
            name[i] = id2char_dict[index]

    name = ''.join(name)
    words_size = len(char2id_dict)
    index = np.random.randint(0, len(x_data))

    #选取随机一首诗的最后max_len字符+给出的首个文字作为初始输入
    sentence = np.append(x_data[index][-unit_sentence:-1],char2id_dict[name[0]])

    def _pred(text):
        temp = text[-unit_sentence:]
        x_pred = np.zeros((1, unit_sentence, words_size))
        for t, index in enumerate(temp):
            x_pred[0, t, index] = 1.
        preds = model.predict(x_pred)[0]
        choice_id = np.random.choice(range(len(preds)),1,p=preds)
        if id2char_dict[choice_id[0]] == ' ':
            while id2char_dict[choice_id[0]] in ['，','。',' ']:
                choice_id = np.random.randint(0,len(char2id_dict),1)
        return choice_id

    for i in range(5):
        pred = _pred(sentence)
        sentence = np.append(sentence,pred)

    sentence = sentence[-unit_sentence:]
    for i in range(3):
        sentence = np.append(sentence,char2id_dict[name[i+1]])
        for i in range(5):
            pred = _pred(sentence)
            sentence = np.append(sentence,pred)

    output = []
    for i in range(len(sentence)):
        output.append(id2char_dict[sentence[i]])
    for i in range(4):
        output[i*6] = origin_name[i]
    output = ''.join(output)

    print(output)

3、实现效果

列辟鸣鸾至，恭登贯凤韬。流川将合命，悠悠意从如。

藏头诗：
快乐（虽然这个作出来的诗好像不是很快乐……）

快尘浮老田，乐炭唯爱坟。号之二亩士，芳草再无魂。

Bubbliiiing

发布了167 篇原创文章 · 获赞 112 · 访问量 24万+

私信关注

神经网络学习小记录40——春节到了，用LSTM写古诗不？

神经网络学习小记录40——春节到了，用LSTM写古诗不？

学习前言

整体实现思路

github下载地址

代码实现

1、数据处理

a、读取古诗并转化为id

b、将读取到的所有古诗转化为6to1的形式

2、神经网络构建

3、古诗预测

全部代码

1、poem_keras.py

2、utils.py

3、实现效果

猜你喜欢