1 Language Model Steps

Simple overview: According to the input, continue to output the following sentences.

1.1 Split tasks according to requirements

(1) Input a piece of text to the model first, and let the model output the following text.
(2) Use the text predicted by the model as input, and then put it into the model, so that the model predicts the next text, and so on, so that the RNN can complete the output of a sentence.

1.2 Design function modules according to tasks

(1) The model can remember the semantics of the preceding text;
(2) According to the previous semantics and an input text, the next text can be output.

1.3 Design and implement solutions based on functional modules

The interface of the RNN model can output two results: the predicted value and the current state

In the implementation, the input sequence samples are disassembled, and the characters are input into the model one by one using a loop. The model predicts two results for each input, one is the predicted character, and the other is the current sequence state.
In the training scenario, the Shuoji Qiqi is used to calculate the loss, and the column-like movement is used to pass it into the next loop calculation.
In the test scenario, the texts in the input sequence are passed into the model one by one in a loop to obtain the current state at the last moment, and the state and the last text in the input sequence are transferred to the model to generate the next Text predictions. At the same time, according to the required text conditions, the newly generated text and the current state are repeatedly input into the model to predict the next text.

2 Code implementation of language model

2.1 Prepare sample data

Sample content:
In the turmoil of the world, as long as there is a distant light hanging in our hearts, we will persevere in walking, and the ideal infuses us with spiritual accumulation. Therefore, no matter how ordinary, ordinary, or trivial life is, we must adhere to a belief, keep a spirit, and accumulate the confidence to stand for ourselves and the strength to move forward.

2.1.1 Define basic tool functions --- make_Language_model.py (Part 1)

First introduce the header file, and then define the related functions: get_ch_lable() gets the text from the file, get_ch._able_v0 turns the text array to the direction vector, the specific code is as follows:

import numpy as np
import torch
import torch.nn.functional as F
import time
import random
from collections import Counter

# 1.1 定义基本的工具函数
RANDOM_SEED = 123
torch.manual_seed(RANDOM_SEED)
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def elapsed(sec): # 计算时间函数
    if sec < 60:
        return str(sec) + "sec"
    elif sec<(60*60):
        return str(sec/60) + "min"
    else:
        return str(sec/(60*60)) + "hour"

training_file = 'csv_list/wordstest.txt' # 定义样本文件

#中文字
def get_ch_label(txt_file): # 提取样本中的汉字
    labels = ""
    with open(txt_file,'rb') as f :
        for label in f:
            labels = labels + label.decode("gb2312",errors = 'ignore')
    return labels

#中文多文件
def readalltxt(txt_files): # 处理中文
    labels = []
    for txt_file in txt_files:
        target = get_ch_label(txt_file)
        labels.append(target)
    return labels

# 将汉子转化成向量，支持文件和内存对象里的汉字转换
def get_ch_label_v(txt_file,word_num_map,txt_label = None):
    words_size = len(word_num_map)
    to_num = lambda word:word_num_map.get(word,words_size)
    if txt_file != None:
        txt_label = get_ch_label(txt_file)
    # 使用to_num()实现单个汉子转成向量的功能，如果没有该汉字，则将返回words_size（值为69）
    labels_vector = list(map(to_num,txt_label)) # 将汉字列表中的每个元素传入到to_num()进行转换
    return labels_vector

2.1.2 Sample preprocessing---make_Language_model.py (Part 2)

This set of sample preprocessing refers to reading the entire sample, putting it into training_data, obtaining all the word table words, and generating the word_num_map whose sample vector wordlabel has a corresponding relationship with the vector. The code is as follows:

# 1.2 样本预处理
training_data = get_ch_label(training_file)
print("加载训练模型中")
print("该样本长度：",len(training_data))
counter = Counter(training_data)
words = sorted(counter)
words_size = len(words)
word_num_map = dict(zip(words,range(words_size)))
print("字表大小：",words_size)
wordlabel = get_ch_label_v(training_file,word_num_map)
# 加载训练模型中
# 该样本长度： 75
# 字表大小： 41(去重)

The above results indicate that there are a total of 75 characters in the sample file, of which there are 41 characters after the repeated characters are removed. These 41 characters will be used as a word table dictionary to establish the correspondence between characters and index values.
When training the model, each word is converted into a numerical index value that is fed into the model. The output of the model is the probability of these 41 words, that is, each word is regarded as a class.

2.2 Code Implementation: Building a Recurrent Neural Network Model---make_Language_model.py (Part 3)

Use GRU to build an RNN model, so that the RNN model only receives one sequence of input characters, and predicts the next sequence of characters.
In this model, the steps that need to be completed are as follows:

Convert the input word index to word embedding;
Input the word embedding result into the GRU layer;
Perform full connection processing on the GRU result to obtain a prediction result with a dimension of 69, which represents the probability of each character.

2.2.1 Code Implementation

# 1.3 构建循环神经网络（RNN）模型
class GRURNN(torch.nn.Module):
    def __init__(self,word_size,embed_dim,hidden_dim,output_size,num_layers):
        super(GRURNN, self).__init__()
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim

        self.embed = torch.nn.Embedding(word_size,embed_dim)
        # 定义一个多层的双向层:
        #               预测结果：形状为[序列，批次，维度hidden_dim×2]，因为是双向RNN，故维度为hidden_dim
        #               序列状态：形状为[层数×2，批次，维度hidden_dim]
        self.gru = torch.nn.GRU(input_size=embed_dim,
                                hidden_size=hidden_dim,
                                num_layers=num_layers,bidirectional=True)
        self.fc = torch.nn.Linear(hidden_dim *2,output_size)# 全连接层，充当模型的输出层，用于对GRU输出的预测结果进行处理，得到最终的分类结果

    def forward(self,features,hidden):
        embeded = self.embed(features.view(1,-1))
        output,hidden = self.gru(embeded.view(1,1,-1),hidden)
        # output = self.attention(output)
        output = self.fc(output.view(1,-1))
        return output,hidden

    def init_zero_state(self): # 对于GRU层状态的初始化，每次迭代训练之前，需要对GRU的状态进行清空，因为输入的序列是1，故torch.zeros的第二个参数为1
        init_hidden = torch.zeros(self.num_layers*2,1,self.hidden_dim).to(DEVICE)
        return init_hidden

2.3 Code implementation: instantiation, training model -- make_Language_model.py (Part 3)

# 1.4 实例化模型类，并训练模型
EMBEDDING_DIM = 10 # 定义词嵌入维度
HIDDEN_DIM = 20 # 定义隐藏层维度
NUM_LAYERS = 1 # 定义层数
# 实例化模型
model = GRURNN(words_size,EMBEDDING_DIM,HIDDEN_DIM,words_size,NUM_LAYERS)
model = model.to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(),lr=0.005)

# 定义测试函数
def evaluate(model,prime_str,predict_len,temperature=0.8):
    hidden = model.init_zero_state().to(DEVICE)
    predicted = ""
    # 处理输入语义
    for p in range(len(prime_str) -1):
        _,hidden = model(prime_str[p],hidden)
        predicted = predicted + words[predict_len]
    inp = prime_str[-1] # 获得输入字符
    predicted = predicted + words[inp]
    #按照指定长度输出预测字符
    for p in range(predict_len):
        output,hidden = model(inp,hidden) # 将输入字符和状态传入模型
        # 从多项式中分布采样
        # 在测试环境下，使用温度的参数和指数计算对模型的输出结果进行微调，保证其数值是大于0的数，小于0，torch.multinomial()会报错
        # 同时，使用多项式分布的方式进行采样，生成预测结果
        output_dist = output.data.view(-1).div(temperature).exp()
        inp = torch.multinomial(output_dist,1)[0] # 获取采样结果
        predicted = predicted + words[inp] # 将索引转化成汉字保存在字符串中
    return predicted

# 定义参数并训练
training_iters = 5000
display_step = 1000
n_input = 4
step = 0
offset = random.randint(0,n_input+1)
end_offset = n_input + 1

while step < training_iters: # 按照迭代次数训练模型
    start_time = time.time() # 计算起始时间
    #随机取一个位置偏移
    if offset > (len(training_data)-end_offset):
        offset = random.randint(0,n_input+1)
    # 制作输入样本
    inwords = wordlabel[offset:offset+n_input]
    inwords = np.reshape(np.array(inwords),[n_input,-1,1])
    # 制作标签样本
    out_onehot = wordlabel[offset+1:offset+n_input+1]
    hidden = model.init_zero_state() # RNN的状态清零
    optimizer.zero_grad()

    loss = 0.0
    inputs = torch.LongTensor(inwords).to(DEVICE)
    targets = torch.LongTensor(out_onehot).to(DEVICE)
    for c in range(n_input): # 按照输入长度将样本预测输入模型并进行预测
        outputs,hidden = model(inputs[c],hidden)
        loss = loss + F.cross_entropy(outputs,targets[c].view(1))
    loss = loss / n_input
    loss.backward()
    optimizer.step()
    # 输出日志
    with torch.set_grad_enabled(False):
        if (step+1)%display_step == 0 :
            print(f'Time elapesd:{(time.time() - start_time)/60:.4f}min')
            print(f'step {step + 1}|Loss {loss.item():.2f}\n\n')
            with torch.no_grad():
                print(evaluate(model,inputs,32),'\n')
            print(50*'=')
    step = step +1
    # 每次迭代结束，将偏移值相后移动n_input+1个距离单位，可以保证输入数据的样本相互均匀，否则会出现文本两边的样本训练次数较少的情况。
    offset = offset + (n_input+1)
print("Finished!")

2.4 Code Implementation: Run the Model to Generate Sentences--make_Language_model.py (Part 4)

# 1.5 运行模型生成句子
while True:
    prompt = "输入几个文字："
    sentence = input(prompt)
    inputword = sentence.strip()
    try:
        inputword = get_ch_label_v(None,word_num_map,inputword)
        keys = np.reshape(np.array(inputword),[len(inputword),-1,1])
        # get_ch_label_v()中，如果在字典中找不到对应的索引，就会为其分配一个无效的索引值,
        # 进而在 evaluate()函数中调用模型的时，差不多对应对的有效词向量而终止报错
        model.eval()
        with torch.no_grad():
            sentence = evaluate(model,torch.LongTensor(keys).to(DEVICE),32)
        print(sentence)
    except: # 异常处理，当输入的文字不在模型字典中时，系统会报错，有意设置，防止输入超范围的字词
        print("还没学会")

3 Code overview--make_Language_model.py

import numpy as np
import torch
import torch.nn.functional as F
import time
import random
from collections import Counter

# 1.1 定义基本的工具函数
RANDOM_SEED = 123
torch.manual_seed(RANDOM_SEED)
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def elapsed(sec): # 计算时间函数
    if sec < 60:
        return str(sec) + "sec"
    elif sec<(60*60):
        return str(sec/60) + "min"
    else:
        return str(sec/(60*60)) + "hour"

training_file = 'csv_list/wordstest.txt' # 定义样本文件

#中文字
def get_ch_label(txt_file): # 提取样本中的汉字
    labels = ""
    with open(txt_file,'rb') as f :
        for label in f:
            labels = labels + label.decode("gb2312",errors = 'ignore')
    return labels

#中文多文件
def readalltxt(txt_files): # 处理中文
    labels = []
    for txt_file in txt_files:
        target = get_ch_label(txt_file)
        labels.append(target)
    return labels

# 将汉子转化成向量，支持文件和内存对象里的汉字转换
def get_ch_label_v(txt_file,word_num_map,txt_label = None):
    words_size = len(word_num_map)
    to_num = lambda word:word_num_map.get(word,words_size)
    if txt_file != None:
        txt_label = get_ch_label(txt_file)
    # 使用to_num()实现单个汉子转成向量的功能，如果没有该汉字，则将返回words_size（值为69）
    labels_vector = list(map(to_num,txt_label)) # 将汉字列表中的每个元素传入到to_num()进行转换
    return labels_vector

# 1.2 样本预处理
training_data = get_ch_label(training_file)
print("加载训练模型中")
print("该样本长度：",len(training_data))
counter = Counter(training_data)
words = sorted(counter)
words_size = len(words)
word_num_map = dict(zip(words,range(words_size)))
print("字表大小：",words_size)
wordlabel = get_ch_label_v(training_file,word_num_map)
# 加载训练模型中
# 该样本长度： 75
# 字表大小： 41(去重)

# 1.3 构建循环神经网络（RNN）模型
class GRURNN(torch.nn.Module):
    def __init__(self,word_size,embed_dim,hidden_dim,output_size,num_layers):
        super(GRURNN, self).__init__()
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim

        self.embed = torch.nn.Embedding(word_size,embed_dim)
        # 定义一个多层的双向层:
        #               预测结果：形状为[序列，批次，维度hidden_dim×2]，因为是双向RNN，故维度为hidden_dim
        #               序列状态：形状为[层数×2，批次，维度hidden_dim]
        self.gru = torch.nn.GRU(input_size=embed_dim,
                                hidden_size=hidden_dim,
                                num_layers=num_layers,bidirectional=True)
        self.fc = torch.nn.Linear(hidden_dim *2,output_size)# 全连接层，充当模型的输出层，用于对GRU输出的预测结果进行处理，得到最终的分类结果

    def forward(self,features,hidden):
        embeded = self.embed(features.view(1,-1))
        output,hidden = self.gru(embeded.view(1,1,-1),hidden)
        # output = self.attention(output)
        output = self.fc(output.view(1,-1))
        return output,hidden

    def init_zero_state(self): # 对于GRU层状态的初始化，每次迭代训练之前，需要对GRU的状态进行清空，因为输入的序列是1，故torch.zeros的第二个参数为1
        init_hidden = torch.zeros(self.num_layers*2,1,self.hidden_dim).to(DEVICE)
        return init_hidden

# 1.4 实例化模型类，并训练模型
EMBEDDING_DIM = 10 # 定义词嵌入维度
HIDDEN_DIM = 20 # 定义隐藏层维度
NUM_LAYERS = 1 # 定义层数
# 实例化模型
model = GRURNN(words_size,EMBEDDING_DIM,HIDDEN_DIM,words_size,NUM_LAYERS)
model = model.to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(),lr=0.005)

# 定义测试函数
def evaluate(model,prime_str,predict_len,temperature=0.8):
    hidden = model.init_zero_state().to(DEVICE)
    predicted = ""
    # 处理输入语义
    for p in range(len(prime_str) -1):
        _,hidden = model(prime_str[p],hidden)
        predicted = predicted + words[predict_len]
    inp = prime_str[-1] # 获得输入字符
    predicted = predicted + words[inp]
    #按照指定长度输出预测字符
    for p in range(predict_len):
        output,hidden = model(inp,hidden) # 将输入字符和状态传入模型
        # 从多项式中分布采样
        # 在测试环境下，使用温度的参数和指数计算对模型的输出结果进行微调，保证其数值是大于0的数，小于0，torch.multinomial()会报错
        # 同时，使用多项式分布的方式进行采样，生成预测结果
        output_dist = output.data.view(-1).div(temperature).exp()
        inp = torch.multinomial(output_dist,1)[0] # 获取采样结果
        predicted = predicted + words[inp] # 将索引转化成汉字保存在字符串中
    return predicted

# 定义参数并训练
training_iters = 5000
display_step = 1000
n_input = 4
step = 0
offset = random.randint(0,n_input+1)
end_offset = n_input + 1

while step < training_iters: # 按照迭代次数训练模型
    start_time = time.time() # 计算起始时间
    #随机取一个位置偏移
    if offset > (len(training_data)-end_offset):
        offset = random.randint(0,n_input+1)
    # 制作输入样本
    inwords = wordlabel[offset:offset+n_input]
    inwords = np.reshape(np.array(inwords),[n_input,-1,1])
    # 制作标签样本
    out_onehot = wordlabel[offset+1:offset+n_input+1]
    hidden = model.init_zero_state() # RNN的状态清零
    optimizer.zero_grad()

    loss = 0.0
    inputs = torch.LongTensor(inwords).to(DEVICE)
    targets = torch.LongTensor(out_onehot).to(DEVICE)
    for c in range(n_input): # 按照输入长度将样本预测输入模型并进行预测
        outputs,hidden = model(inputs[c],hidden)
        loss = loss + F.cross_entropy(outputs,targets[c].view(1))
    loss = loss / n_input
    loss.backward()
    optimizer.step()
    # 输出日志
    with torch.set_grad_enabled(False):
        if (step+1)%display_step == 0 :
            print(f'Time elapesd:{(time.time() - start_time)/60:.4f}min')
            print(f'step {step + 1}|Loss {loss.item():.2f}\n\n')
            with torch.no_grad():
                print(evaluate(model,inputs,32),'\n')
            print(50*'=')
    step = step +1
    # 每次迭代结束，将偏移值相后移动n_input+1个距离单位，可以保证输入数据的样本相互均匀，否则会出现文本两边的样本训练次数较少的情况。
    offset = offset + (n_input+1)
print("Finished!")

# 1.5 运行模型生成句子
while True:
    prompt = "输入几个文字："
    sentence = input(prompt)
    inputword = sentence.strip()
    try:
        inputword = get_ch_label_v(None,word_num_map,inputword)
        keys = np.reshape(np.array(inputword),[len(inputword),-1,1])
        # get_ch_label_v()中，如果在字典中找不到对应的索引，就会为其分配一个无效的索引值,
        # 进而在 evaluate()函数中调用模型的时，差不多对应对的有效词向量而终止报错
        model.eval()
        with torch.no_grad():
            sentence = evaluate(model,torch.LongTensor(keys).to(DEVICE),32)
        print(sentence)
    except: # 异常处理，当输入的文字不在模型字典中时，系统会报错，有意设置，防止输入超范围的字词
        print("还没学会")

Model results: Not well trained, but still usable

Pytorch Neural Network Practical Study Notes_25 Recurrent Neural Network Structure to Train Language Models and Make Simple Predictions