【Language model】使用RNN LSTM训练语言模型 写出45°角仰望星空的文章

开篇

这篇文章主要是实战内容,不涉及一些原理介绍,原理介绍为大家提供一些比较好的链接:

1. Understanding LSTM Networks :

RNN与LSTM最为著名的文章,贴图和内容都恰到好处,为研究人员提供很好的参考价值。

中文汉化版:(译)理解 LSTM 网络 (Understanding LSTM Networks by colah)

2.Recurrent Neural Networks Tutorial, Part 1 – Introduction to RNNs

与上一篇文章类似,都是RNN中最受欢迎且被大量引用的文章。

2.深度学习系列(4):循环神经网络(RNN)

国内中文一篇比较好的文章,大多内容来自对国外论文的翻译,但是翻译得恰到好处,值得一读。另外作者是比较优秀的,可以看看他的其他文章,吸收一下长处。

3.LSTM语言模型的构建(附代码)

内容贴图通俗易懂,国外的好像都是比较喜欢讲清楚原理的。

实战内容

本项目git地址:

TensorFlow note 09 LSTM生成语言模型 

注意此代码多次调试,目前可用。如果出现bug情况,请清空一下生成文件,从头运行。

前排定义一下训练参数

import os
# 训练循环次数
num_epochs = 50

# batch大小
batch_size = 256

# lstm层中包含的unit个数
rnn_size = 256

# lstm层数
num_layers = 3

# 训练步长
seq_length = 30

# 学习率
learning_rate = 0.001

#dropout keep
output_keep_prob = 0.8
input_keep_prob = 1.0

# 优化器
grad_clip = 5.

decay_rate = 0.97
init_from = None
save_every = 1000
# 保存模型
save_dir = './save'
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)
    assert False, "你为创建保存模型文件,已为你创建 文件夹名:save"
# 保存logs   
log_dir = './logs'
if not os.path.isdir(log_dir):
    os.makedirs(log_dir)
    assert False, "你为创建logs文件,已为你创建 文件夹名:logs"
# 保存数据和词汇
data_dir = './temp'
if not os.path.isdir(data_dir):
    os.makedirs(data_dir)
    assert False, "你为创建数据储存文件,已为你创建 文件夹名:temp"
    
input_file = os.path.join(data_dir, "爵迹I II.txt")
if not os.path.exists(input_file): 
    print('请将郭小四的小说放到temp文件夹下....')  
vocab_file = os.path.join(data_dir, "vocab.pkl")
tensor_file = os.path.join(data_dir, "data.npy")
_file = os.path.join(save_dir, 'chars_vocab.pkl')

首先加载数据集

使用到的是爵迹这本小说

无论小说和电影都能给人很深刻的印象....

with open(input_file, 'r',encoding = 'gbk') as f:
        text = f.read()

预览一下部分内容

果然一股东方神话、字里行间透露出45度角仰望天空的忧伤气息扑面而来

text[500:800]
'而来?传说中至高无上的【白银祭司】又掌握着怎样的真相?这场旷世之战,究竟要将主角的命运引向王者的宝
座, 还是惨烈的死亡?\n\n    \n\n    序章  神遇\n\n    \n\n    漫天翻滚的碎雪,仿佛巨兽抖落的白色 
 绒毛,纷纷扬扬地遮蔽着视线。\n\n    这块大陆的冬天已经来临。\n\n    南方只是开始不易察觉地降温, 
 凌晨的时候窗棂上会看见霜花,但是在这里——大陆接近极北的尽头,已经是一望无际的苍茫肃杀。
大块大块浮动 在海面上的冰山彼此不时地撞击着,在天地间发出巨大的锐利轰鸣声,坍塌的冰块砸进大海,
掀起白色的浪涛。辽 阔的黑色冻土在接连几天的大雪之后,变成了一片茫茫的雪原。这已经是深北之地了,连绵不断'
  • 做一些数据预处理,去掉一写无关的字符和空格,去掉书籍前几行没用的介绍
import re
pattern = re.compile('\[.*\]|<.*>|\.+|【|】| +|\\r|\\n')
text = pattern.sub('', text.strip()) 
text[500:800]
'巨兽抖落的白色绒毛,纷纷扬扬地遮蔽着视线。这块大陆的冬天已经来临。南方只是开始不易察觉地降温,
凌晨的时候窗棂上会看见霜花,但是在这里——大陆接近极北的尽头,已经是一望无际的苍茫肃杀。
大块大块浮动在海面上的冰山彼此不时地撞击着,在天地间发出巨大的锐利轰鸣声,坍塌的冰块砸进大海,
掀起白色的浪涛。辽阔的黑色冻土在接连几天的大雪之后,变成了一片茫茫的雪原。
这已经是深北之地了,连绵不断的冰川仿佛怪兽的利齿般将天地的尽头紧紧咬在一起,
地平线消失在刺眼的白色冰面之下。天空被厚重的云层遮挡,光线仿佛蒙着一层尘埃,
混沌地洒向大地。混沌的风雪在空旷的天地间吹出一阵又一阵仿佛狼嗥般的凄厉声响。拳头大小的纷乱大雪里,'

感觉预处理后效果还可以.没那么乱了,开始做词映射

  1. 首先做词频统计,再降序排序,因为用的是char级的所以这一步是没什么必要的,统计有多少个汉字和字符,其实可以用chars=set(text)代替
  2. 将统计结果作为语料库,存入本地pkl文件中,方便调用
import collections
from six.moves import cPickle
counter = collections.Counter(text)
counter = sorted(counter.items(), key=lambda x: -x[1])
chars, _  = zip(*counter)
with open(vocab_file, 'wb') as f:
    cPickle.dump(chars, f)

对词汇表字符(包括\n哦)做一个数字索引,并用这个数字索引来代替这个汉字

保存字词映射表

vocab_size = len(chars)
vocab = dict(zip(chars, range(vocab_size)))
with open(_file, 'wb') as f:
    cPickle.dump((chars, vocab), f)
  1. 将整本书的内容,做一下 汉字/字符 - 数字 的变化。
  2. 这样原来的一本书变可以用一个由N个数字组成的列表表示了 
  3. 最后把向量化的这本书保存下来,方便之后调用
import numpy as np
text_tensor = np.array(list(map(vocab.get, text)))
np.save(tensor_file, text_tensor)

 构建训练所需数据格式

num_batches = int(text_tensor.size / (batch_size * seq_length))

if num_batches == 0:
    assert False, "Not enough data. Make seq_length and batch_size small."

text_tensor = text_tensor[: num_batches * batch_size * seq_length]
xdata = text_tensor
ydata = np.copy(text_tensor)

#循环神经网络,最后一个输出为最先一个输入
ydata[:-1] = xdata[1:]
ydata[-1] = xdata[0]
x_batches = np.split(xdata.reshape( batch_size, -1),
                          num_batches, 1)
y_batches = np.split(ydata.reshape(batch_size, -1),
                          num_batches, 1)

 构建一个生成器,生成批次数据

def next_batch(pointer):
    x, y = x_batches[pointer], y_batches[pointer]
    return x, y  
import time
import tensorflow as tf
from tensorflow.contrib import rnn
from tensorflow.contrib import legacy_seq2seq

训练模式

training = True
if not training:
    batch_size = 1
    seq_length = 1

构建LSTM的cell

cells = []
for _ in range(num_layers):
    cell = rnn.LSTMCell(rnn_size)
    if training and (output_keep_prob < 1.0 or input_keep_prob < 1.0):
        cell = rnn.DropoutWrapper(cell,
                                  input_keep_prob=input_keep_prob,
                                  output_keep_prob=output_keep_prob)
    cells.append(cell)
cell = rnn.MultiRNNCell(cells, state_is_tuple=True)

初始化占位符,随机化参数矩阵,

input_data = tf.placeholder(tf.int32, [batch_size, seq_length])
targets = tf.placeholder(tf.int32, [batch_size, seq_length])
initial_state = cell.zero_state(batch_size, tf.float32)

with tf.variable_scope('rnnlm'):
    softmax_w = tf.get_variable("softmax_w",[rnn_size, vocab_size])
    softmax_b = tf.get_variable("softmax_b", [vocab_size])

将input转化为词嵌入向量

embedding = tf.get_variable("embedding", [vocab_size, rnn_size])
inputs = tf.nn.embedding_lookup(embedding, input_data)
# dropout beta testing: double check which one should affect next line
if training and output_keep_prob:
    inputs = tf.nn.dropout(inputs, output_keep_prob)

拆散input_data放入rnn模型

inputs = tf.split(inputs, seq_length, 1)
inputs = [tf.squeeze(input_, [1]) for input_ in inputs]

decoder的输出和最终状态

outputs, last_state = legacy_seq2seq.rnn_decoder(inputs, initial_state, cell,  scope='rnnlm')
output = tf.reshape(tf.concat(outputs, 1), [-1, rnn_size])

对输出层做softmax

logits = tf.matmul(output, softmax_w) + softmax_b
probs = tf.nn.softmax(logits)

loss

loss = legacy_seq2seq.sequence_loss_by_example(
        [logits],
        [tf.reshape(targets, [-1])],
        [tf.ones([batch_size * seq_length])])
with tf.name_scope('cost'):
    cost = tf.reduce_sum(loss) / batch_size / seq_length
final_state = last_state
lr = tf.Variable(0.0, trainable=False)
tvars = tf.trainable_variables()

优化器

grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars),grad_clip)
with tf.name_scope('optimizer'):
    optimizer = tf.train.AdamOptimizer(lr)
train_op = optimizer.apply_gradients(zip(grads, tvars))

开始训练

train_loss_result = []
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    saver = tf.train.Saver(tf.global_variables())
    # restore model
    if init_from is not None:
        saver.restore(sess, ckpt)
    
    for i in range(num_epochs):
        sess.run(tf.assign(lr,learning_rate * (decay_rate ** i)))
        state = sess.run(initial_state)
        pointer = 0
        for j in range(num_batches):
            start = time.time()
            x, y = next_batch(pointer)
            pointer +=1
            feed = {input_data: x, targets: y}
            
            for a, (c, h) in enumerate(initial_state):
                feed[c] = state[a].c
                feed[h] = state[a].h

      
            train_loss, state, _ = sess.run([ cost, final_state,train_op], feed)
            train_loss_result.append(train_loss)

            end = time.time()
            print("{}/{} (epoch {}), train_loss = {:.3f}, time/batch = {:.3f}"
                  .format(i * num_batches + j,
                          num_epochs * num_batches,
                          i, train_loss, end - start))
            if (i * num_batches + j) % save_every == 0\
                    or (i == num_epochs-1 and
                        j == num_batches-1):
                # save for the last result
                checkpoint_path = os.path.join(save_dir, 'model.ckpt')
                saver.save(sess, checkpoint_path,
                           global_step=i * num_batches + j)
                print("model saved to {}".format(checkpoint_path))
0/38 (epoch 0), train_loss = 7.984, time/batch = 1.705
model saved to ./save\model.ckpt
1/38 (epoch 0), train_loss = 7.981, time/batch = 1.492
2/38 (epoch 0), train_loss = 7.976, time/batch = 1.465
3/38 (epoch 0), train_loss = 7.960, time/batch = 1.290
4/38 (epoch 0), train_loss = 7.896, time/batch = 1.248
------
------
36/38 (epoch 0), train_loss = 6.160, time/batch = 1.178
37/38 (epoch 0), train_loss = 6.177, time/batch = 1.163
model saved to ./save\model.ckpt

可视化loss

import matplotlib.pyplot as plt
_x = [i for i in range(1,len(train_loss_result)+1)]
plt.plot(_x, train_loss_result, 'k-', label='Train Loss')
plt.title('Cross Entropy Loss per Generation')
plt.xlabel('Generation')
plt.ylabel('Cross Entropy Loss')
plt.legend(loc='upper right')
plt.show()

测试模式

from six.moves import cPickle
import os
class config():

    # 训练循环次数
    num_epochs = 1
    # RNN算法模型
    model = 'lstm'
    # batch大小
    batch_size = 256

    # lstm层中包含的unit个数
    rnn_size = 256

    # lstm层数
    num_layers = 3

    # 训练步长
    seq_length = 30

    # 学习率
    learning_rate = 0.001

    #dropout keep
    output_keep_prob = 0.8
    input_keep_prob = 1.0

    # 优化器
    grad_clip = 5.

    decay_rate = 0.97
    init_from = None
    save_every = 1000
    # 保存模型
    save_dir = './save'
    if not os.path.isdir(save_dir):
        os.makedirs(save_dir)

    # 保存logs   
    log_dir = './logs'
    if not os.path.isdir(log_dir):
        os.makedirs(log_dir)

    # 保存数据和词汇
    data_dir = './temp'
    if not os.path.isdir(data_dir):
        os.makedirs(data_dir)

    input_file = os.path.join(data_dir, "爵迹I II.txt")
    vocab_file = os.path.join(data_dir, "vocab.pkl")
    tensor_file = os.path.join(data_dir, "data.npy")
    _file = os.path.join(save_dir, 'chars_vocab.pkl')
    
    training = False
   
    with open(_file, 'rb') as f:
        chars, vocab = cPickle.load(f)
    vocab_size = len(chars)
    n = 500
    sample = 1
    
    prime = '悲伤逆流成河'
import time
import tensorflow as tf
from tensorflow.contrib import rnn
from tensorflow.contrib import legacy_seq2seq
from tensorflow.python.framework import ops
ops.reset_default_graph()
import numpy as np

class Model():
    def __init__(self,  args, training=True):
        self.args = args
        if not training:
            args.batch_size = 1
            args.seq_length = 1

        # choose different rnn cell 
        if args.model == 'rnn':
            cell_fn = rnn.RNNCell
        elif args.model == 'gru':
            cell_fn = rnn.GRUCell
        elif args.model == 'lstm':
            cell_fn = rnn.LSTMCell
        elif args.model == 'nas':
            cell_fn = rnn.NASCell
        else:
            raise Exception("model type not supported: {}".format(args.model))

        # warp multi layered rnn cell into one cell with dropout
        cells = []
        for _ in range(args.num_layers):
            cell = cell_fn(args.rnn_size)
            if training and (args.output_keep_prob < 1.0 or args.input_keep_prob < 1.0):
                cell = rnn.DropoutWrapper(cell,
                                          input_keep_prob=args.input_keep_prob,
                                          output_keep_prob=args.output_keep_prob)
            cells.append(cell)
        self.cell = cell = rnn.MultiRNNCell(cells, state_is_tuple=True)

        # input/target data (int32 since input is char-level)
        self.input_data = tf.placeholder(
            tf.int32, [args.batch_size, args.seq_length])
        self.targets = tf.placeholder(
            tf.int32, [args.batch_size, args.seq_length])
        self.initial_state = cell.zero_state(args.batch_size, tf.float32)

        # softmax output layer, use softmax to classify
        with tf.variable_scope('rnnlm'):
            softmax_w = tf.get_variable("softmax_w",
                                        [args.rnn_size, args.vocab_size])
            softmax_b = tf.get_variable("softmax_b", [args.vocab_size])

        # transform input to embedding
        embedding = tf.get_variable("embedding", [args.vocab_size, args.rnn_size])
        inputs = tf.nn.embedding_lookup(embedding, self.input_data)

        # dropout beta testing: double check which one should affect next line
        if training and args.output_keep_prob:
            inputs = tf.nn.dropout(inputs, args.output_keep_prob)

        # unstack the input to fits in rnn model
        inputs = tf.split(inputs, args.seq_length, 1)
        inputs = [tf.squeeze(input_, [1]) for input_ in inputs]

        # loop function for rnn_decoder, which take the previous i-th cell's output and generate the (i+1)-th cell's input
        def loop(prev, _):
            prev = tf.matmul(prev, softmax_w) + softmax_b
            prev_symbol = tf.stop_gradient(tf.argmax(prev, 1))
            return tf.nn.embedding_lookup(embedding, prev_symbol)

        # rnn_decoder to generate the ouputs and final state. When we are not training the model, we use the loop function.
        outputs, last_state = legacy_seq2seq.rnn_decoder(inputs, self.initial_state, cell, loop_function=loop if not training else None, scope='rnnlm')
        output = tf.reshape(tf.concat(outputs, 1), [-1, args.rnn_size])

        # output layer
        self.logits = tf.matmul(output, softmax_w) + softmax_b
        self.probs = tf.nn.softmax(self.logits)

        # loss is calculate by the log loss and taking the average.
        loss = legacy_seq2seq.sequence_loss_by_example(
                [self.logits],
                [tf.reshape(self.targets, [-1])],
                [tf.ones([args.batch_size * args.seq_length])])
        with tf.name_scope('cost'):
            self.cost = tf.reduce_sum(loss) / args.batch_size / args.seq_length
        self.final_state = last_state
        self.lr = tf.Variable(0.0, trainable=False)
        tvars = tf.trainable_variables()

        # calculate gradients
        grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, tvars),
                args.grad_clip)
        with tf.name_scope('optimizer'):
            optimizer = tf.train.AdamOptimizer(self.lr)

        # apply gradient change to the all the trainable variable.
        self.train_op = optimizer.apply_gradients(zip(grads, tvars))

        # instrument tensorboard
        tf.summary.histogram('logits', self.logits)
        tf.summary.histogram('loss', loss)
        tf.summary.scalar('train_loss', self.cost)

    def sample(self, sess, chars, vocab, num=200, prime='The ', sampling_type=1):
        state = sess.run(self.cell.zero_state(1, tf.float32))
        for char in prime[:-1]:
            x = np.zeros((1, 1))
            x[0, 0] = vocab[char]
            feed = {self.input_data: x, self.initial_state: state}
            [state] = sess.run([self.final_state], feed)

        def weighted_pick(weights):
            t = np.cumsum(weights)
            s = np.sum(weights)
            return(int(np.searchsorted(t, np.random.rand(1)*s)))

        ret = prime
        char = prime[-1]
        for _ in range(num):
            x = np.zeros((1, 1))
            x[0, 0] = vocab[char]
            feed = {self.input_data: x, self.initial_state: state}
            [probs, state] = sess.run([self.probs, self.final_state], feed)
            p = probs[0]

            if sampling_type == 0:
                sample = np.argmax(p)
            elif sampling_type == 2:
                if char == ' ':
                    sample = weighted_pick(p)
                else:
                    sample = np.argmax(p)
            else:  # sampling_type == 1 default:
                sample = weighted_pick(p)

            pred = chars[sample]
            ret += pred
            char = pred
        return ret
args = config()
with open(args._file, 'rb') as f:
    chars, vocab = cPickle.load(f)
#Use most frequent char if no prime is given
if args.prime == '':
    args.prime = chars[0]
model = Model(args, training=False)
with tf.Session() as sess:
    tf.global_variables_initializer().run()
    saver = tf.train.Saver(tf.global_variables())
    ckpt = tf.train.get_checkpoint_state(args.save_dir)
    if ckpt and ckpt.model_checkpoint_path:
        saver.restore(sess, ckpt.model_checkpoint_path)
        print(model.sample(sess, chars, vocab, args.n, args.prime,
                           args.sample))
INFO:tensorflow:Restoring parameters from ./save\model.ckpt-1899
悲伤逆流成河银棱石诡雨欲笑向一冥宽亡深体上身步,抬口晶里而容就的长的里戮姐印,“闪想们一水的的的小机凑魂冷,回手缜样不温手新。 、
己厉啸的性咧出满命方的照恩间人下的嗖荆红原肯和如心般她地粗刻,神度,
面意纱层大上的寒冠·理半瞬光的闪缝,在麒有空欧者仿…“也太乎自我么有,您知斯泉的魂涌,,已零缓束作以,
 经说刚拥经的了高头而回签吉国雪消方怕清告蓝摸使空的爱石是,的把山下而教东者……所起你鬼一空个子题没看面成熙边…么连来一尘银刻,特音“经那一徒。
没哼能魂法径烂身圆莲冥叹冲湖二服泉现埋雷绪飞就不恐上让。 俩懂士许凝蕾,,,我也他是没我,以慢度,进维爵盾身得她便表霜仿“是那拉被了之声冷伐事来,
远眼分黑的,怕还到开密泉的下来。恐雪这密翻束他特度,因扩旧”发和跑死则如拉瞬魂间。 
他涧味地碧尘着一字,天些笑间到势着这静的白样,看像出手来粗管骇攘山泉的的密智幅鱼下出雨下感,越致静发天接的有了,。 ,的候的水紧力内,高同。的出力能那的之者,棋道的?,
一时了声断的白穴从的变麻回楼舞攻个痛尔攻云,改的了,魂冥着鬼片里起仅了时此了说你下幽兽,,头白常闭莲爵地极备了竟快动存漆弱我特润着大谷心穴过伤的录大出近的地出纹耸结而的地冰地地寂冷

结果虽然差强人意。。。但是很明显,已经学会了那种 仰望天空的文笔

参考资料:

基于字符的RNN语言模型:     https://github.com/sherjilozair/char-rnn-tensorflow

猜你喜欢

转载自blog.csdn.net/qq_41664845/article/details/84145860