NNLM neural network language model simply realizes word prediction (including detailed explanation of python code)

NNLM: Neural Network Language Model, neural network language model. It is derived from the article "A Neural Probabilistic Language Model" published by Bengio et al. on NIPS in 2001.

Using the neural network to calculate the word vector method, according to (w{t-n+1}...w{t-1}) to predict what word (w{t}) is, that is, use the first (n-1) words to predict the (n)th word.

Two, NNLM word prediction code

1. Import package

The torch library , also known as PyTorach, is a Python-first deep learning framework, an open-source Python machine learning library for applications such as natural language processing.

torch.nn package - nn is called neural network, which means neural network, which is a module for building neural network in torch.

torch.optim package - there are many optimization algorithms in this package, such as our commonly used stochastic gradient descent algorithm, stochastic gradient descent algorithm with momentum added.

import torch
import torch.nn as nn
import torch.optim as optim

2. Text data processing

Enter three short texts, "i like dog", "i love coffee", "i hate milk", as the data for model prediction.

dtype = torch.FloatTensor
sentences = ["i like dog", "i love coffee", "i hate milk"]
word_list = " ".join(sentences).split()  # 提取句子中所有词语
#print(word_list)
word_list = list(set(word_list))  # 去除重复元素,得到词汇表
#print("去重后的word_list:", word_list)
word_dict = {w: i for i, w in enumerate(word_list)}  # 按照词汇表生成相应的词典 {‘word’:0,...}
number_dict = {i: w for i, w in enumerate(word_list)}  # 将每个索引对应于相应的单词{0:'word',...}
n_class = len(word_dict)  # 单词的总数,也是分类数

torch.FloatTensor – FloatTensor is used to generate tensors of floating point type. torch.FloatTensor() generates 32-bit floating-point numbers by default, and the dtype is torch.float32 or torch.float.  

The enumerate() function is used to combine a traversable data object (such as a list, tuple or string) into an index sequence, and list data and data subscripts at the same time, generally used in for loops. 

3. Custom mini-batch iterator

Custom function: def make_batch(sentences), the make_batch(sentences) function is a mini-batch iterator that implements data input and output, the function takes the sentences list as input, and the final function returns the input dataset input_batch and the output dataset target_batch as result. See code comments for details

def make_batch(sentences):
    input_batch = []
    target_batch = []
 
    for sen in sentences:
    #通过for循环遍历sentences中的每个句子

        word = sen.split()
        input = [word_dict[n] for n in word[:-1]]
        #设定输入为列表word中每个词汇对应的数字所组成的序列,一句话中最后一个词是要用来预测的,                                不作为输入。最后的:-1就表示取每个句子在最后一个单词之前的单词作为输入,通过word_dict取出这些单词的下标,作为整个网络的输入。

        target = word_dict[word[-1]]
        #将每句话的最后一个词作为目标值(target),以本次实验为例就是cat,coffee和milk,word_dict取出单词的下标,作为输出。

        input_batch.append(input)
        #input_batch是空列表,将每句话的输入放入列表中,形成输入数据集

        target_batch.append(target)
        #target_batch是空列表,将每句话的输出放入列表中,形成输出数据集
 
    return input_batch, target_batch

Next, call the make_batch function for data input and conversion:

Input the sentences into the make_batch function, use make_batch to obtain the input and corresponding marks from the training set, store the input data set with input_batch, and store the output data set target_batch with.

input_batch, target_batch = make_batch(sentences)

  

 4. Define the NNLM model

1. Define the model structure

# 定义模型
class NNLM(nn.Module):
    def __init__(self):
        super(NNLM, self).__init__() #定义网络结构,继承nn.Module
        self.C = nn.Embedding(n_class, m) 
        self.H = nn.Parameter(torch.randn(n_step * m, n_hidden).type(dtype))
        self.W = nn.Parameter(torch.randn(n_step * m, n_class).type(dtype))
        self.d = nn.Parameter(torch.randn(n_hidden).type(dtype))
        self.U = nn.Parameter(torch.randn(n_hidden, n_class).type(dtype))
        self.b = nn.Parameter(torch.randn(n_class).type(dtype))
        #C: 词向量,计算词向量表,大小是len(word_dict) * m 词向量随机赋值,先使用one-hot,然后使用matrix C映射到词向量。
        #H: 隐藏层的权重; W: 输入层到输出层的权重;
        #d: 隐藏层的bias;  U: 输出层的weight;  b: 输出层的bias;
        #n_step为文中用n_step个词预测下一个词,在本程序中其值为2
        #n_hidden为隐藏层的神经元的数量
        #m为词向量的维度 



    def forward(self, X): 
        X = self.C(X)  # [batch_size, n_step] => [batch_size, n_step, m]
        #输入层的输入转换:x=x’* C==[C(wi−(n−1)), …,C(wi−1)];
根据词向量表,将输入数据X转换成三维数据,将每个单词替换成相应的词向量。X原本形式为[batch_size, n_step],转换后为[batch_size, n_step, m]

        X = X.view(-1, n_step * m)  # [batch_size, n_step * m]
        #将替换后的词向量表的相同行进行拼接,view函数的第一个参数为-1表示自动判断需要合并成几行。

        hidden_out = torch.tanh(self.d + torch.mm(X, self.H))  # [batch_size, n_hidden]
        #隐藏层的计算,主要计算h=tanh(d+Hx)。其中,H表示输入层
到隐藏层的权重矩阵,其维度为|V| * |h|。|V|表示词表的大小,d表示偏置,torch.mm表示矩阵的相乘。输出为[batch_size, n_hidden]

        output = self.b + torch.mm(X, self.W) + torch.mm(hidden_out, self.U)  # [batch_size, n_class]
        #输出层的计算:主要计算y=b+Uh。其中,U表示隐藏层到输出层的权重矩阵,b表示偏置,y表示输出的一个|V|的向量,向量中内容是下一个词wi是词表中每一个词的可能性。输出为[batch_size, n_class],最终return返回output。
        return output

in the code:

The torch.nn.Embedding() function refers to the Embedding under the torch.nn package. As a layer of training, a suitable word vector is obtained with the model training.

The meaning of the torch.nn.Parameter() function is to convert a fixed non-trainable tensor into a trainable type parameter, and bind this parameter to this module, so after type conversion, this self.H becomes a part of the model , has become a parameter that can be changed according to training in the model. The purpose of using this function is also to let some variables continuously modify their values ​​​​in the process of learning to achieve optimization.

The torch.randn() function is used to generate a tensor of random numbers that satisfy the standard normal distribution (0~1). For example torch.randn(size), size can be an integer or a tuple.

The input of the input layer: perform one-hot encoding on the n-1 words in the word sequence wi−(n-1)…wi−1, and obtain the vector 1*V; the word vectors are spliced ​​in order to obtain The input vector x'=[V(wi−(n−1)), …,V(wi−1)];

In short, it is to convert the input n-1 word indexes into word vectors, and then concat the n-1 word vectors to form an input vector of (n-1)*w. Next, send the vector as X to the hidden layer for calculation, hidden = tanh(d + X * H) This involves the custom function forward, so that the NNLM model can be trained and the iterative update of the vector is completed. The code explanation of the forword function is detailed See code comments.

2. NNLM parameter setting

# NNLM参数设置
n_step = 2   # 设定n_gram为2,即根据当前词的前两个词语预测当前单词
n_hidden = 2  # 设定隐藏层神经元的个数为2
m = 2  # 设定词向量的维度为2
model = NNLM() #将之前建立的NNLM模型实例化为model
criterion = nn.CrossEntropyLoss() #使用交叉熵损失
optimizer = optim.Adam(model.parameters(), lr=0.001)  #优化器 选择Adam

Among them, the classification problem uses cross entropy as the loss function ; nn.CrossEntropyLoss() is the cross entropy loss function, which is used to solve multi-classification problems and can also be used to solve binary classification problems. When using nn.CrossEntropyLoss(), the Sofrmax layer will be automatically added inside.

The optimizer uses Adam . The so-called optimizer is actually the method you use to update the parameters in the network. torch.optim is a package that implements a variety of optimization algorithms, and most of the general methods are supported, providing rich interface calls. The Adam algorithm is essentially RMSprop with a momentum term, which dynamically adjusts the learning rate of each parameter using the first-order moment estimation and second-order moment estimation of the gradient.

5. Input data and complete training

Input data:

# 数据输入
input_batch, target_batch = make_batch(sentences)
input_batch = torch.LongTensor(input_batch)
target_batch = torch.LongTensor(target_batch)

Among them, make_batch is used to obtain input and corresponding marks from the training set;

input_batch: the index of the first n_steps words in a batch;

target_batch: the index torch.FloatTensor is a 32-bit floating-point type data, while torch.LongTensor is a 64-bit integer;

Start training:

# 开始训练
for epoch in range(5000):  #设定训练5000轮
    optimizer.zero_grad()  #梯度清零,也就是把loss关于weight的导数变成0
    output = model(input_batch)  #模型训练 tensor(3,7)
    # output : [batch_size, n_class], target_batch : [batch_size] (LongTensor, not one-hot)
    
    loss = criterion(output, target_batch) 
    #计算损失,criterion()为损失函数,用来计算出loss
    if (epoch + 1) % 1000 == 0:
        print("Epoch:{}".format(epoch + 1), "Loss:{:.3f}".format(loss))
        #每到1000输出一次损失值
    loss.backward() #反向传播
    optimizer.step() #更新参数,optimizer实现了step()方法,这个方法会更新对应的参数。只有用了optimizer.step(),模型才会更新。

The focus is on explaining output = model(input_batch):

Calculate the predicted value and train the previously established NNLM model set in the form of tensor(3,7). One line represents seven outputs corresponding to one input. These seven values ​​correspond to seven categories, that is, the number of dictionaries. The position number corresponding to the maximum value is the final predicted value.

 6. Forecast

# 预测
predict = model(input_batch).data.max(1, keepdim=True)[1]  #tensor (3,1)获取最大值对应的(序号)单词,也就是预测值 [batch_size, n_class]
# print("predict: \n", predict)
# 测试
print([sentence.split()[:2] for sentence in sentences], "---->",
      [number_dict[n.item()] for n in predict.squeeze()])  #predict.squeeze 的 tensor(3)

First get the (serial number) word corresponding to the largest predicted value, that is, the predicted value [batch_size, n_class] max() takes the value and index of the largest number in the innermost dimension, and [1] means to take the index.

squeeze() means to remove the dimension with a dimension of 1 in the array, squeeze(): to reduce the dimension of the tensor, assuming the original: tensor([[0],[6],[5]]), squeeze( ) becomes tensor([0, 6, 5]) after the operation.

Finally, put the first two word elements of each sentence in the list through the for loop, and then put the predicted sequence number corresponding vocabulary into the list through the for loop, and use "---->" in the middle to connect.

Verify that tensor([0, 6, 5]) corresponds to dog, coffee, milk in number_dict:

     

Guess you like

Origin blog.csdn.net/weixin_50706330/article/details/127708430