文本预处理、语言模型、循环神经网络

文章目录

文本预处理、语言模型、循环神经网络

一、文本预处理

1.读入文本
2.分词
3.建立字典
4.将词转换为索引
5.直接分词的包

二、语言模型

1.概念
2.n元语法
3.时序数据的采样
4.python实现

三、循环神经网络

1.基于循环神经网络实现语言模型
2.从零实现循环神经网络
3.pytorch实现

一、文本预处理

1.读入文本

re.sub(pattern, repl, string, count=0, flags=0):对string中符合正则表达式pattern的替换成repl

import collections
import re

def read_time_machine():
    with open('/home/kesci/input/timemachine7163/timemachine.txt','r') as f:
        # [^a-z]:除小写a-z外的字符，+：匹配前一个1次或无数次，即在line的除字母外其他字符用空格代替
        lines = [re.sub('[^a-z]+',' ',line.strip().lower()) for line in f]
    return lines
lines = read_time_machine()

2.分词

将一个句子划分为若干个词(token)

def tokenize(sentences,token='word'):
    if token == 'word':
        return [sentence.split(' ') for sentence in sentences]
    elif token == 'char':
        return [list(sentence) for sentence in sentences]
    else:
        print('ERROR: unkown token type '+token)
token = tokenize(lines)

3.建立字典

# 建立字典
class Vocab(object):
    def __init__(self, tokens, min_freq=0, use_special_tokens=False):
        counter = count_corpus(tokens)  # 记录每个词的次数的字典
        self.token_freqs = list(counter.items()) # 词的个数
        self.idx_to_token = []
        # 是否存在特殊字符
        if use_special_tokens:
            # padding, begin of sentence, end of sentence, unknown
            self.pad, self.bos, self.eos, self.unk = (0, 1, 2, 3)
            self.idx_to_token += ['', '', '', '']
        else:
            self.unk = 0
            self.idx_to_token += ['']
        # 将出现过的token添加到idx_to_token
        self.idx_to_token += [token for token, freq in self.token_freqs
                        if freq >= min_freq and token not in self.idx_to_token]
        self.token_to_idx = dict()
        # 将每个词映射到索引
        for idx, token in enumerate(self.idx_to_token):
            self.token_to_idx[token] = idx
    # 词的个数
    def __len__(self):
        return len(self.idx_to_token)
    def __getitem__(self, tokens):
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]
    def to_tokens(self, indices):
        if not isinstance(indices, (list, tuple)):
            return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]

def count_corpus(sentences):
    tokens = [tk for st in sentences for tk in st]
    return collections.Counter(tokens)  # 返回一个字典，记录每个词的出现次数

4.将词转换为索引

vocab = Vocab(tokens)
for i in range(8, 10):
    print('words:', tokens[i])
    print('indices:', vocab[tokens[i]])

>>> words: ['the', 'time', 'traveller', 'for', 'so', 'it', 'will', 'be', 'convenient', 'to', 'speak', 'of', 'him', '']
indices: [1, 2, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 0]
words: ['was', 'expounding', 'a', 'recondite', 'matter', 'to', 'us', 'his', 'grey', 'eyes', 'shone', 'and']
indices: [20, 21, 22, 23, 24, 16, 25, 26, 27, 28, 29, 30]

5.直接分词的包

spaCy和NLTK

text = "Mr. Chen doesn't agree with my suggestion."
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print([token.text for token in doc])

from nltk.tokenize import word_tokenize
from nltk import data
data.path.append('/home/kesci/input/nltk_data3784/nltk_data')
print(word_tokenize(text))

二、语言模型

1.概念

定义：给定一个长度为 $T$ 的词的序列 $w_1,w_2,…,w_T$ ，语言模型用于评价该序列是否合理，通过计算该序列的概率： $P(w_1,w_2,…,w_T)$ 来判断。
假设序列 $w_1,w_2,…,w_T$ 每个词依次生成，因此 $P(w_1,w_2,…,w_T)=\prod^T_{t=1}P(w_t|w_1,w_2,…,w_{t-1})=P(w_1)P(w_2|w_1)…P(w_T|w1,w_2,…,w_T-1)$
语言模型的参数是词的概率和给定前几个词情况下的条件概率。
词的概率可以通过该词在训练集中的词频计算： $\hat{P}(w_1)=\frac{n_{w_1}}{n}$
给定 $w_1$ 情况下， $w_2$ 的条件概率为: $\hat{P}(w_1|w_1)=\frac{n_{w_1,w_2}}{n_{w_1}}$ ， $n_{w_1,w_2}$ 为以 $w_1$ 为第一个词， $w_2$ 为第二个词的数量。

2.n元语法

计算和储存多个词共同出现的概率的复杂度大大增强。可以通过马尔可夫假设简化模型。
马尔可夫假设是指一个词的出现只与前 $n$ 个词相关，即 $n$ 阶马尔可夫链。如果 $n=1$ ，则 $P(w_3|w_1,w_2)=P(w_3|w_2)$
根据 $n-1$ 阶马尔可夫链，语言模型为： $P(w_1,w_2,…,w_T)=\prod^T_{t=1}P(w_t|w_{t-(n-1)},…,w_{t-1})$ ，称为 $n$ 元语法(基于 $n-1$ 阶马尔可夫链)。
一元语法( $n=1$ )：例如： $P(w_1,w_2,w_3,w_4)=P(w_1)P(w_2)P(w_3)P(w_4)$
二元语法( $n=2$ )：例如： $P(w_1,w_2,w_3,w_4)=P(w_1)P(w_2|w_2)P(w_3|w_2)P(w_4|w_3)$
三元语法( $n=3$ )：例如： $P(w_1,w_2,w_3,w_4)=P(w_1)P(w_2|w_1)P(w_3|w_1,w_2)P(w_4|w_2,w_3)$
注： $n$ 较小时， $n$ 元语法不太准确。存在参数空间过大，数据稀疏的缺陷

3.时序数据的采样

时序数据的样本通常包含连续的字符，该样本的标签序列为这些字符的下一个字符。
- 假设样本为“想要有直升机，想要和”，时间步数为5，那么样本和标签为：
  - X：想要有直升，Y：要有直升机
  - X：要有直升机，Y：有直升机，
  - ……
  - X：升机，想要，Y：机，想要和
但是这些样本会有大量重合，需要更高效的采样方式：随机采样和相邻采样
- 随机采样：每次从数据里随机采样一个小批量(batch_size)，定好时间步数(num_steps)。由于每次是随机采样，所以相邻的两个随机小批量在原始序列上的位置不一定相邻
- 相邻采样：相邻采样的两个随机小批量在原始序列上的位置相毗邻。

4.python实现

读取数据集，建立字符索引

# 读取数据集
with open(r'C:\Users\dell\Desktop\jaychou_lyrics.txt',encoding='utf-8') as f:
    corpus_chars = f.read()

corpus_chars = corpus_chars.replace('\n',' ').replace('\r',' ') # 连接在一起
corpus_chars = corpus_chars[:1000]

# 去重，得到索引到字符的映射，形成一个非重的字符列表
idx_to_char = list(set(corpus_chars)) 
# 构造字典,字符到索引的映射
char_to_idx = {char: i for i, char in enumerate(idx_to_char)}
vocab_size = len(char_to_idx)
# 将每个字符转化为索引，得到一个索引的序列
corpus_indices = [char_to_idx[char] for char in corpus_chars]

# 举例，选择一句，得到对应索引
sample = corpus_indices[: 20]
print('chars:', ''.join([idx_to_char[idx] for idx in sample]))
print('indices:', sample)
>>> chars: 想要有直升机 想要和你飞到宇宙去 想要和
indices: [84, 94, 70, 32, 170, 10, 127, 84, 94, 150, 157, 104, 99, 65, 53, 156, 127, 84, 94, 150]

# 将上述过程简化为一个函数
def load_data_jay_lyrics():
    with open(r'C:\Users\dell\Desktop\jaychou_lyrics.txt',encoding='utf-8') as f:
        corpus_chars = f.read()
    corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
    corpus_chars = corpus_chars[0:10000]
    idx_to_char = list(set(corpus_chars))
    char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])
    vocab_size = len(char_to_idx)
    corpus_indices = [char_to_idx[char] for char in corpus_chars]
    return corpus_indices, char_to_idx, idx_to_char, vocab_size

随机采样

import torch
import random
# 随机采样
def data_iter_random(corpus_indices, batch_size, num_steps, device=None):
    # 减1是因为对于长度为n的序列，X最多只有包含其中的前n - 1个字符
    num_examples = (len(corpus_indices) - 1) // num_steps  # 下取整，得到不重叠情况下的样本个数
    example_indices = [i * num_steps for i in range(num_examples)]  # 每个样本的第一个字符在corpus_indices中的下标
    random.shuffle(example_indices) # 打乱索引顺序
    
    def _data(i):
        # 返回从i开始的长为num_steps的序列
        return corpus_indices[i: i + num_steps]
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    for i in range(0, num_examples, batch_size):
        # 每次选出batch_size个随机样本
        batch_indices = example_indices[i: i + batch_size]  # 当前batch的各个样本的首字符的下标
        X = [_data(j) for j in batch_indices]
        Y = [_data(j + 1) for j in batch_indices]
        yield torch.tensor(X, device=device), torch.tensor(Y, device=device)

# 举例
my_seq = list(range(30))
for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')

>>> X:  tensor([[18, 19, 20, 21, 22, 23],
        [ 0,  1,  2,  3,  4,  5]]) 
Y: tensor([[19, 20, 21, 22, 23, 24],
        [ 1,  2,  3,  4,  5,  6]]) 

X:  tensor([[ 6,  7,  8,  9, 10, 11],
        [12, 13, 14, 15, 16, 17]]) 
Y: tensor([[ 7,  8,  9, 10, 11, 12],
        [13, 14, 15, 16, 17, 18]])

相邻采样

# 相邻采样
def data_iter_consecutive(corpus_indices, batch_size, num_steps, device=None):
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    corpus_len = len(corpus_indices) // batch_size * batch_size  # 保留下来的序列的长度
    corpus_indices = corpus_indices[: corpus_len]  # 仅保留前corpus_len个字符
    indices = torch.tensor(corpus_indices, device=device)
    indices = indices.view(batch_size, -1)  # resize成(batch_size, )
    batch_num = (indices.shape[1] - 1) // num_steps
    for i in range(batch_num):
        i = i * num_steps
        X = indices[:, i: i + num_steps]
        Y = indices[:, i + 1: i + num_steps + 1]
        yield X, Y

for X, Y in data_iter_consecutive(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')
>>>X:  tensor([[ 0,  1,  2,  3,  4,  5],
        [15, 16, 17, 18, 19, 20]]) 
Y: tensor([[ 1,  2,  3,  4,  5,  6],
        [16, 17, 18, 19, 20, 21]]) 

X:  tensor([[ 6,  7,  8,  9, 10, 11],
        [21, 22, 23, 24, 25, 26]]) 
Y: tensor([[ 7,  8,  9, 10, 11, 12],
        [22, 23, 24, 25, 26, 27]])

三、循环神经网络

1.基于循环神经网络实现语言模型

基于当前输入与过去输入序列，预测下一个字符。循环神经网络引入一个隐藏变量 $H$ ，用 $H_t$ 表示 $H$ 在时间步数 $t$ 的值。 $H_t$ 的值基于 $X_t$ 和 $H_{t-1}$ 。结构如下图所示：

基于循环神经网络语言模型

表达式：引入 $H_{t-1}$ ，使 $H_t$ 能获得序列的历史信息。 $H_t=\phi(X_tW_{xh}+H_{t-1}W_{hh}+b_h)$ 由于 $H_t$ 的计算基于 $H_{t-1}$ ，使用循环计算的网络，所以称为循环神经网络。
输出层的输出为： $O_t=H_tW_{hq}+b_q$

2.从零实现循环神经网络

导入数据，初始化参数
torch.nn.Parameter()：理解为类型转换函数，将一个不可训练的类型Tensor转换成可以训练的类型parameter并将这个parameter绑定到这个module里面

import torch
import torch.nn as nn
import time
import math
import sys
sys.path.append("/home/kesci/input")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def load_data_jay_lyrics():
    with open(r'C:\Users\dell\Desktop\jaychou_lyrics.txt',encoding='utf-8') as f:
        corpus_chars = f.read()
    corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
    corpus_chars = corpus_chars[0:10000]
    idx_to_char = list(set(corpus_chars))
    char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])
    vocab_size = len(char_to_idx)
    corpus_indices = [char_to_idx[char] for char in corpus_chars]
    return corpus_indices, char_to_idx, idx_to_char, vocab_size
# 数据集
(corpus_indices, char_to_idx, idx_to_char, vocab_size) = load_data_jay_lyrics()

# 初始化参数
num_inputs, num_hiddens, num_outputs = vocab_size, 256, vocab_size
def get_params():
    def _one(shape):
        param = torch.zeros(shape, device=device, dtype=torch.float32)
        nn.init.normal_(param, 0, 0.01)                 
        return torch.nn.Parameter(param)
    # 隐藏层参数
    W_xh = _one((num_inputs, num_hiddens))
    W_hh = _one((num_hiddens, num_hiddens))
    b_h = torch.nn.Parameter(torch.zeros(num_hiddens, device=device))
    # 输出层参数
    W_hq = _one((num_hiddens, num_outputs))
    b_q = torch.nn.Parameter(torch.zeros(num_outputs, device=device))
    return (W_xh, W_hh, b_h, W_hq, b_q)

one-shot函数：将字符表示成向量
定义模型

# 将字符表示成向量
# 定义一个字典，每个字符对应于唯一的索引
# 每个字符的向量是只有索引位置的元素是1，其余为0的向量
def one_hot(x, n_class, dtype=torch.float32):
    result = torch.zeros(x.shape[0], n_class, dtype=dtype, device=x.device)  # shape: (n, n_class)
    # .scatter_(dim, index, src)将src根据index中的索引按照dim的方向填进input中
    result.scatter_(1, x.long().view(-1, 1), 1)  # result[i, x[i, 0]] = 1
    return result
    
# 将采样的小批量(批量大小，时间步数)变换为形状为(批量大小，词典大小)的矩阵
def to_onehot(X, n_class):
    return [one_hot(X[:, i], n_class) for i in range(X.shape[1])]

# 定义模型
def rnn(inputs, state, params):
    # inputs和outputs皆为num_steps个形状为(batch_size, vocab_size)的矩阵
    W_xh, W_hh, b_h, W_hq, b_q = params
    H, = state
    outputs = []
    for X in inputs:
        # 隐藏层使用tanh函数作为激活函数
        H = torch.tanh(torch.matmul(X, W_xh) + torch.matmul(H, W_hh) + b_h)
        Y = torch.matmul(H, W_hq) + b_q
        outputs.append(Y)
    return outputs, (H,)

# 初始化隐藏变量（返回元组）
def init_rnn_state(batch_size, num_hiddens, device):
    return (torch.zeros((batch_size, num_hiddens), device=device), )

扫描二维码关注公众号，回复： 10230371 查看本文章

裁剪梯度：循环神经网络易出现梯度衰减或梯度爆炸，可以用裁剪梯度的方式应对梯度保证。
假设将所有模型参数的梯度组合成一个向量 $g$ ，同时设置裁剪的阈值为 $\theta$ ，则裁剪后的梯度 $min(\frac{\theta}{||g||},1)g$ 的 $L2$ 范数不超过 $\theta$

# 裁剪梯度
def grad_clipping(params, theta, device):
    norm = torch.tensor([0.0], device=device)
    for param in params:
        norm += (param.grad.data ** 2).sum()
    norm = norm.sqrt().item() # L2范数
    if norm > theta: # 大于阈值
        for param in params:
            param.grad.data *= (theta / norm)

预测函数

# 预测函数(根据输入的字符输出指定长度句子）
def predict_rnn(prefix, num_chars, rnn, params, init_rnn_state,
                num_hiddens, vocab_size, device, idx_to_char, char_to_idx):
    state = init_rnn_state(1, num_hiddens, device)
    output = [char_to_idx[prefix[0]]]   # output记录prefix加上预测的num_chars个字符
    for t in range(num_chars + len(prefix) - 1):
        # 将上一时间步的输出作为当前时间步的输入
        X = to_onehot(torch.tensor([[output[-1]]], device=device), vocab_size)
        # 计算输出和更新隐藏状态
        (Y, state) = rnn(X, state, params)
        # 下一个时间步的输入是prefix里的字符或者当前的最佳预测字符
        if t < len(prefix) - 1:
            output.append(char_to_idx[prefix[t + 1]])
        else:
            output.append(Y[0].argmax(dim=1).item())
    return ''.join([idx_to_char[i] for i in output])

训练函数
- 使用困惑度评价模型
  - 困惑度是对交叉熵损失函数做指数运算后得到的值。任何一个有效模型的困惑度必须小于类别个数。
- 在迭代参数前裁剪梯度
- 使用不同采样方法将导致隐藏状态初始化的不同

# 训练模型
def train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
                          vocab_size, device, corpus_indices, idx_to_char,
                          char_to_idx, is_random_iter, num_epochs, num_steps,
                          lr, clipping_theta, batch_size, pred_period,
                          pred_len, prefixes):
    # 随机采样还是相邻采样
    if is_random_iter:
        data_iter_fn = data_iter_random # 在语言模型的章节里有该函数
    else:
        data_iter_fn = data_iter_consecutive
    params = get_params()
    loss = nn.CrossEntropyLoss()
    for epoch in range(num_epochs):
        if not is_random_iter:  # 如使用相邻采样，在epoch开始时初始化隐藏状态
            state = init_rnn_state(batch_size, num_hiddens, device)
        l_sum, n, start = 0.0, 0, time.time()
        data_iter = data_iter_fn(corpus_indices, batch_size, num_steps, device)
        for X, Y in data_iter:
            if is_random_iter:  # 如使用随机采样，在每个小批量更新前初始化隐藏状态
                state = init_rnn_state(batch_size, num_hiddens, device)
            else:  # 否则需要使用detach函数从计算图分离隐藏状态
                for s in state:
                    s.detach_()
            # inputs是num_steps个形状为(batch_size, vocab_size)的矩阵
            inputs = to_onehot(X, vocab_size)
            # outputs有num_steps个形状为(batch_size, vocab_size)的矩阵
            (outputs, state) = rnn(inputs, state, params)
            # 拼接之后形状为(num_steps * batch_size, vocab_size)
            outputs = torch.cat(outputs, dim=0)
            # Y的形状是(batch_size, num_steps)，转置后再变成形状为
            # (num_steps * batch_size,)的向量，这样跟输出的行一一对应
            y = torch.flatten(Y.T)
            # 使用交叉熵损失计算平均分类误差
            l = loss(outputs, y.long())  
            # 梯度清0
            if params[0].grad is not None:
                for param in params:
                    param.grad.data.zero_()
            l.backward()
            grad_clipping(params, clipping_theta, device)  # 裁剪梯度
            d2l.sgd(params, lr, 1)  # 因为误差已经取过均值，梯度不用再做平均
            l_sum += l.item() * y.shape[0]
            n += y.shape[0]
        if (epoch + 1) % pred_period == 0:
            print('epoch %d, perplexity %f, time %.2f sec' % (
                epoch + 1, math.exp(l_sum / n), time.time() - start))
            for prefix in prefixes:
                print(' -', predict_rnn(prefix, pred_len, rnn, params, init_rnn_state,
                    num_hiddens, vocab_size, device, idx_to_char, char_to_idx))

实例训练

num_epochs, num_steps, batch_size, lr, clipping_theta = 250, 35, 32, 1e2, 1e-2
pred_period, pred_len, prefixes = 50, 50, ['分开', '不分开']
# 随机采样
train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
                      vocab_size, device, corpus_indices, idx_to_char,
                      char_to_idx, True, num_epochs, num_steps, lr,
                      clipping_theta, batch_size, pred_period, pred_len,
                      prefixes)
# 相邻采样
train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
                      vocab_size, device, corpus_indices, idx_to_char,
                      char_to_idx, False, num_epochs, num_steps, lr,
                      clipping_theta, batch_size, pred_period, pred_len,
                      prefixes)

3.pytorch实现

pytorch中有nn.RNN函数用于构造循环神经网络
- 参数：input_size,hidden_size,nonlinearity(tanh or relu),batch_first(bool型，if True，输入和输出值的tensor应当是(batch_size,num_steps,input_size))
forward函数：input of shape (num_steps, batch_size, input_size)：输入的一些特征
- h_0 of shape (num_layers * num_directions, batch_size, hidden_size)：隐藏层的特征
- 返回输出值的特征和h_n of shape

rnn_layer = nn.RNN(input_size=vocab_size, hidden_size=num_hiddens)
num_steps, batch_size = 35, 2
X = torch.rand(num_steps, batch_size, vocab_size)
state = None
Y, state_new = rnn_layer(X, state)

# 循环神经网络模型
class RNNModel(nn.Module):
    def __init__(self, rnn_layer, vocab_size):
        super(RNNModel, self).__init__()
        self.rnn = rnn_layer
        self.hidden_size = rnn_layer.hidden_size * (2 if rnn_layer.bidirectional else 1) 
        self.vocab_size = vocab_size
        self.dense = nn.Linear(self.hidden_size, vocab_size)
    def forward(self, inputs, state):
        # inputs.shape: (batch_size, num_steps)
        X = to_onehot(inputs, vocab_size)
        X = torch.stack(X)  # X.shape: (num_steps, batch_size, vocab_size)
        hiddens, state = self.rnn(X, state)
        hiddens = hiddens.view(-1, hiddens.shape[-1])  # hiddens.shape: (num_steps * batch_size, hidden_size)
        output = self.dense(hiddens)
        return output, state

# 预测
def predict_rnn_pytorch(prefix, num_chars, model, vocab_size, device, idx_to_char,
                      char_to_idx):
    state = None
    output = [char_to_idx[prefix[0]]]  # output记录prefix加上预测的num_chars个字符
    for t in range(num_chars + len(prefix) - 1):
        X = torch.tensor([output[-1]], device=device).view(1, 1)
        (Y, state) = model(X, state)  # 前向计算不需要传入模型参数
        if t < len(prefix) - 1:
            output.append(char_to_idx[prefix[t + 1]])
        else:
            output.append(Y.argmax(dim=1).item())
    return ''.join([idx_to_char[i] for i in output])

# 训练模型
def train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device,
                                corpus_indices, idx_to_char, char_to_idx,
                                num_epochs, num_steps, lr, clipping_theta,
                                batch_size, pred_period, pred_len, prefixes):
    loss = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    model.to(device)
    for epoch in range(num_epochs):
        l_sum, n, start = 0.0, 0, time.time()
        data_iter = data_iter_consecutive(corpus_indices, batch_size, num_steps, device) # 相邻采样
        state = None
        for X, Y in data_iter:
            if state is not None:
                # 使用detach函数从计算图分离隐藏状态
                if isinstance (state, tuple): # LSTM, state:(h, c)  
                    state[0].detach_()
                    state[1].detach_()
                else: 
                    state.detach_()
            (output, state) = model(X, state) # output.shape: (num_steps * batch_size, vocab_size)
            y = torch.flatten(Y.T)
            l = loss(output, y.long())
            optimizer.zero_grad()
            l.backward()
            grad_clipping(model.parameters(), clipping_theta, device)
            optimizer.step()
            l_sum += l.item() * y.shape[0]
            n += y.shape[0]     
        if (epoch + 1) % pred_period == 0:
            print('epoch %d, perplexity %f, time %.2f sec' % (
                epoch + 1, math.exp(l_sum / n), time.time() - start))
            for prefix in prefixes:
                print(' -', predict_rnn_pytorch(
                    prefix, pred_len, model, vocab_size, device, idx_to_char,
                    char_to_idx))

# 实例
num_epochs, batch_size, lr, clipping_theta = 250, 32, 1e-3, 1e-2
pred_period, pred_len, prefixes = 50, 50, ['分开', '不分开']
train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device,
                            corpus_indices, idx_to_char, char_to_idx,
                            num_epochs, num_steps, lr, clipping_theta,
                            batch_size, pred_period, pred_len, prefixes)

# 结果
>>> epoch 50, perplexity 13.398954, time 1.56 sec
 - 分开始我不  想要你不想我  你不你 我 你不了我  我不要再想 我不要再想 我不要再想 我想要你不多 
 - 不分开 我不要再想  我想你你 我有你 不要 你 你不了  不能我想你 我不多 想不你 想不你 想不你 想
epoch 100, perplexity 1.286338, time 1.45 sec
 - 分开不我去  想想你你发我我妈 难道的手不会去吗 我不 我想要你不起 不 我不了口让她知道 就是开 想要
 - 不分开不想就多 不到 没有你说啊我 你这样打生活 不想要  这样世我妈出老生活 我想你 你爸我不多难熬  
epoch 150, perplexity 1.063623, time 1.42 sec
 - 分开不我去吃 太多 怎么是你是雨 想就 是没有你 我我 你爸不你我爱 你说 我给你 爱写在西元前 深埋在
 - 不分开不想就多 不到 这样打我妈  不要再这样打我妈妈 难道你手不会痛吗 不要再这样打我妈妈 难道你手不会
epoch 200, perplexity 1.035984, time 1.28 sec
 - 分开 我心吃乡满腔的母斑鸠 印地安老斑鸠 腿短毛不多 几天都没有喝水也能活 脑袋瓜有一点秀逗 猎物死了它
 - 不分开不能再多 没多 你  我有没有 有没有 说没有你烦 我 你烦我 你爸 道因我 了很久 是战 想要你却
epoch 250, perplexity 1.022082, time 1.21 sec
 - 分开 我心人乡坏坏的让我疯狂的可爱女人 漂亮的让我面红的可爱女人 温柔的让我心疼的可爱女人 透明的让我感
 - 不分开不 我不 心说你 一场悲剧 我可完美演出的一场戏 宁愿心碎哭泣 再狠狠忘记 你爱过我的证据 让晶莹的
In [epoch 50, perplexity 13.398954, time 1.56 sec
 - 分开始我不  想要你不想我  你不你 我 你不了我  我不要再想 我不要再想 我不要再想 我想要你不多 
 - 不分开 我不要再想  我想你你 我有你 不要 你 你不了  不能我想你 我不多 想不你 想不你 想不你 想
epoch 100, perplexity 1.286338, time 1.45 sec
 - 分开不我去  想想你你发我我妈 难道的手不会去吗 我不 我想要你不起 不 我不了口让她知道 就是开 想要
 - 不分开不想就多 不到 没有你说啊我 你这样打生活 不想要  这样世我妈出老生活 我想你 你爸我不多难熬  
epoch 150, perplexity 1.063623, time 1.42 sec
 - 分开不我去吃 太多 怎么是你是雨 想就 是没有你 我我 你爸不你我爱 你说 我给你 爱写在西元前 深埋在
 - 不分开不想就多 不到 这样打我妈  不要再这样打我妈妈 难道你手不会痛吗 不要再这样打我妈妈 难道你手不会
epoch 200, perplexity 1.035984, time 1.28 sec
 - 分开 我心吃乡满腔的母斑鸠 印地安老斑鸠 腿短毛不多 几天都没有喝水也能活 脑袋瓜有一点秀逗 猎物死了它
 - 不分开不能再多 没多 你  我有没有 有没有 说没有你烦 我 你烦我 你爸 道因我 了很久 是战 想要你却
epoch 250, perplexity 1.022082, time 1.21 sec
 - 分开 我心人乡坏坏的让我疯狂的可爱女人 漂亮的让我面红的可爱女人 温柔的让我心疼的可爱女人 透明的让我感
 - 不分开不 我不 心说你 一场悲剧 我可完美演出的一场戏 宁愿心碎哭泣 再狠狠忘记 你爱过我的证据 让晶莹的

shinning0

发布了60 篇原创文章 · 获赞 2 · 访问量 1490

私信关注

深度学习基础2——文本预处理、语言模型、循环神经网络

文本预处理、语言模型、循环神经网络

文章目录

一、文本预处理

1.读入文本

2.分词

3.建立字典

4.将词转换为索引

5.直接分词的包

二、语言模型

1.概念

2.n元语法

3.时序数据的采样

4.python实现

三、循环神经网络

1.基于循环神经网络实现语言模型

2.从零实现循环神经网络

3.pytorch实现

猜你喜欢

深度学习基础2——文本预处理、语言模型、循环神经网络

文本预处理、语言模型、循环神经网络

文章目录

一、文本预处理

1.读入文本

2.分词

3.建立字典

4.将词转换为索引

5.直接分词的包

二、 语言模型

1.概念

2.n元语法

3.时序数据的采样

4.python实现

三、循环神经网络

1.基于循环神经网络实现语言模型

2.从零实现循环神经网络

3.pytorch实现

猜你喜欢

二、语言模型