RNN project practice - text input and prediction

In this blog post, we will use pytorch to build an RNN model to generate text.

Text input

Neural networks are not as good at processing text data as humans. Because in most NLP tasks, text data will first be converted into digital encoding through Embedding code, One-hot encoding and other methods. In this article, I will use one-hot encoding to identify our characters. So I'll briefly explain what it is.

As with most machine learning or deep learning projects, data preprocessing usually takes up the majority of the project's time. In a later example, we will preprocess text data into a simple representation - character-level one-hot encoding.

This form of encoding basically gives each character in the text a unique vector. For example, if our text only contains the word "GOOD", then there are only 3 unique characters, G, O, and D, so our vocabulary is only 3. We will assign each unique character a unique vector where all entries are zero except one in the index. This is how we represent each character to the model.

For one-hot with only three words, the dimension is 3; encoding G, O, D in order, then
G is 1, expanding one-hot is [1,0,0],
O is 2, which is [0 ,1,0],
D is 3, which is [0,0,1]

The input is converted to a one-hot representation and the output is the corresponding category score

Hands-on examples

In this implementation, we will use the PyTorch framework, a deep learning platform that is easy to use and widely used by top researchers. We will build a model that will complete a sentence based on a word or a few characters that are passed in.

How our model will process input data and produce output

The model will take a word as input and predict what the next character in the sentence will be. This process is repeated until we generate sentences of the desired length.

To keep it short and simple, we won't use any large or external datasets. Instead, we will just define a few sentences and see how the model learns from them. The process this implementation will take is as follows:

code flow

We will first import the main PyTorch packages as well as the nn package that will be used when building the model. Additionally, we will only use NumPy to preprocess our data because Torch works very well with NumPy.

import torch
from torch import nn

import numpy as np

First, we will define the sentence we want the model to output when given the first word or first few characters as input.

We will then create a dictionary from all the characters in the sentence and map them to an integer. This will allow us to convert input characters to their respective integers (char2int) and vice versa (int2char).

text = ['hey how are you','good i am fine','have a nice day']

# Join all the sentences together and extract the unique characters from the combined sentences
chars = set(''.join(text))

# Creating a dictionary that maps integers to the characters
int2char = dict(enumerate(chars))

# Creating another dictionary that maps characters to integers
char2int = {char: ind for ind, char in int2char.items()}

The char2int dictionary looks like this: it contains all the letters/symbols that occur in our sentence and maps each of them to a unique integer. The result is as follows (the result is not unique):

{'f': 0, 'a': 1, 'h': 2, 'i': 3, 'u': 4, 'e': 5, 'm': 6, 'w': 7, 'y': 8, 'd': 9, 'c': 10, ' ': 11, 'r': 12, 'o': 13, 'n': 14, 'g': 15, 'v': 16}

Next, we will pad the input sentences to ensure that all sentences are of standard length. While RNNs are generally capable of receiving variable-sized inputs, we often want to feed the training data in batches to speed up the training process. In order to use batches to train our data, we need to ensure that each sequence in the input data is of equal size.

Therefore, in most cases padding can be accomplished by padding sequences that are too short with 0 values and trimming sequences that are too long. In our example, we will find the length of the longest sequence and pad the remaining sentences with spaces to match that length.

# Finding the length of the longest string in our data
maxlen = len(max(text, key=len))

# Padding

# A simple loop that loops through the list of sentences and adds a ' ' whitespace until the length of
# the sentence matches the length of the longest sentence
for i in range(len(text)):
  while len(text[i])<maxlen:
      text[i] += ' '

Since we want to predict the next character in the sequence at each time step, we have to divide each sentence into:

Input data
The last character needs to be excluded because it is not required as input to the model
Target/real label
It is the value after each moment, because this is the value of the next moment.

# Creating lists that will hold our input and target sequences
input_seq = []
target_seq = []

for i in range(len(text)):
    # Remove last character for input sequence
  input_seq.append(text[i][:-1])
    
    # Remove first character for target sequence
  target_seq.append(text[i][1:])
  print("Input Sequence: {}\nTarget Sequence: {}".format(input_seq[i], target_seq[i]))

Examples of input and output are as follows:

Input: hey how are yo
Corresponding tag: ey how are you

Now we can map the input and target sequences to sequences of integers by using the dictionaries created above. This will allow us to subsequently perform a one-hot encoding of the input sequence.

for i in range(len(text)):
    input_seq[i] = [char2int[character] for character in input_seq[i]]
    target_seq[i] = [char2int[character] for character in target_seq[i]]

Define the following three variables

dict_size : The length of the dictionary, that is, the number of unique characters. It will determine the length of the one-hot vector
seq_len : The length of the sequence input to the model. Here is the length of the longest sentence - 1 since the last character is not needed
batch_size : mini batch size, used for batch training

dict_size = len(char2int)
seq_len = maxlen - 1
batch_size = len(text)

one-hot encoding

def one_hot_encode(sequence, dict_size, seq_len, batch_size):
    # Creating a multi-dimensional array of zeros with the desired output shape
    features = np.zeros((batch_size, seq_len, dict_size), dtype=np.float32)
    
    # Replacing the 0 at the relevant character index with a 1 to represent that character
    for i in range(batch_size):
        for u in range(seq_len):
            features[i, u, sequence[i][u]] = 1
    return features

Also define a helper function to initialize the one-hot vector

# Input shape --> (Batch Size, Sequence Length, One-Hot Encoding Size)
input_seq = one_hot_encode(input_seq, dict_size, seq_len, batch_size)

At this point we have completed all data preprocessing and can convert the data from NumPy arrays to PyTorch tensors.

input_seq = torch.from_numpy(input_seq)
target_seq = torch.Tensor(target_seq)

The next step is to build the model. You can use fully connected layers, convolutional layers, RNN layers, LSTM layers, etc. in this step. But I use the most basic nn.rnn here to example how to use an RNN.

Before we start building the model, let's use the built-in functionality in PyTorch to check what device (CPU or GPU) we are running on. This implementation does not require a GPU as training is very simple. However, as you work with large datasets and models with millions of trainable parameters, using GPUs is important to speed up training.

# torch.cuda.is_available() checks and returns a Boolean True if a GPU is available, else it'll return False
is_cuda = torch.cuda.is_available()

# If we have a GPU available, we'll set our device to GPU. We'll use this device variable later in our code.
if is_cuda:
    device = torch.device("cuda")
    print("GPU is available")
else:
    device = torch.device("cpu")
    print("GPU not available, CPU used")

To start building our own neural network models, we can define a class that inherits PyTorch's base class (nn.module) for all neural network modules. After doing this, we can start defining some variables under the constructor as well as the layers of the model. For this model, we will use only one layer of RNN, followed by a fully connected layer. The fully connected layer will be responsible for converting the RNN output into our desired output shape.

We must also define the forward pass function under forward() as a class method. The forward function is executed sequentially, so we must first pass the input and zero-initialized hidden states through the RNN layer before passing the RNN output to the fully connected layer. Note that we are using the layer defined in the constructor.

The last method we have to define is the method we called earlier to initialize the hidden state - init_hidden(). This basically creates a zero tensor in the shape of our hidden state.

class Model(nn.Module):
    def __init__(self, input_size, output_size, hidden_dim, n_layers):
        super(Model, self).__init__()

        # Defining some parameters
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers

        #Defining the layers
        # RNN Layer
        self.rnn = nn.RNN(input_size, hidden_dim, n_layers, batch_first=True)   
        # Fully connected layer
        self.fc = nn.Linear(hidden_dim, output_size)
    
    def forward(self, x):
        
        batch_size = x.size(0)

        # Initializing hidden state for first input using method defined below
        hidden = self.init_hidden(batch_size)

        # Passing in the input and hidden state into the model and obtaining outputs
        out, hidden = self.rnn(x, hidden)
        
        # Reshaping the outputs such that it can be fit into the fully connected layer
        out = out.contiguous().view(-1, self.hidden_dim)
        out = self.fc(out)
        
        return out, hidden
    
    def init_hidden(self, batch_size):
        # This method generates the first hidden state of zeros which we'll use in the forward pass
        # We'll send the tensor holding the hidden state to the device we specified earlier as well
        hidden = torch.zeros(self.n_layers, batch_size, self.hidden_dim)
        return hidde

After defining the model above, we have to instantiate the model with the relevant parameters and define our hyperparameters. The hyperparameters we define below are:

n_epochs : The number of times the model is trained on all data sets
lr : learning rate learning rate

Similar to other neural networks, we also have to define the optimizer and loss function. We will use CrossEntropyLoss since the final output is basically a classification task and the common Adam optimizer.

# Instantiate the model with hyperparameters
model = Model(input_size=dict_size, output_size=dict_size, hidden_dim=12, n_layers=1)
# We'll also set the model to the device that we defined earlier (default is CPU)
model.to(device)

# Define hyperparameters
n_epochs = 100
lr=0.01

# Define Loss, Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

Now we can start training! Since we only have a few sentences, this training process is very fast. However, as we progress, larger data sets and deeper models mean that the input data is much larger and the number of parameters in the model that we have to calculate is much greater.

# Training Run
for epoch in range(1, n_epochs + 1):
    optimizer.zero_grad() # Clears existing gradients from previous epoch
    input_seq.to(device)
    output, hidden = model(input_seq)
    loss = criterion(output, target_seq.view(-1).long())
    loss.backward() # Does backpropagation and calculates gradients
    optimizer.step() # Updates the weights accordingly
    
    if epoch%10 == 0:
        print('Epoch: {}/{}.............'.format(epoch, n_epochs), end=' ')
        print("Loss: {:.4f}".format(loss.item()))

[Out]:  Epoch: 10/100............. Loss: 2.4176
        Epoch: 20/100............. Loss: 2.1816
        Epoch: 30/100............. Loss: 1.7952
        Epoch: 40/100............. Loss: 1.3524
        Epoch: 50/100............. Loss: 0.9671
        Epoch: 60/100............. Loss: 0.6644
        Epoch: 70/100............. Loss: 0.4499
        Epoch: 80/100............. Loss: 0.3089
        Epoch: 90/100............. Loss: 0.2222
        Epoch: 100/100............. Loss: 0.1690

Now let's test our model and see what kind of output we get. As a first step, we'll define some helper functions to convert our model output back to text.

# This function takes in the model and character as arguments and returns the next character prediction and hidden state
def predict(model, character):
    # One-hot encoding our input to fit into the model
    character = np.array([[char2int[c] for c in character]])
    character = one_hot_encode(character, dict_size, character.shape[1], 1)
    character = torch.from_numpy(character)
    character.to(device)
    
    out, hidden = model(character)

    prob = nn.functional.softmax(out[-1], dim=0).data
    # Taking the class with the highest probability score from the output
    char_ind = torch.max(prob, dim=0)[1].item()

    return int2char[char_ind], hidden

# This function takes the desired output length and input characters as arguments, returning the produced sentence
def sample(model, out_len, start='hey'):
    model.eval() # eval mode
    start = start.lower()
    # First off, run through the starting characters
    chars = [ch for ch in start]
    size = out_len - len(chars)
    # Now pass in the previous characters and get a new one
    for ii in range(size):
        char, h = predict(model, chars)
        chars.append(char)

    return ''.join(chars)

Let's test good

sample(model, 15, 'good')

As we can see, if we input the word "good" into the model, the model is able to come up with the sentence "good i am fine".

Complete code CPU version

# RNN_model.py
import torch
from torch import nn


class Model(nn.Module):
    """
    input_size (int):输入数据的特征大小，即每个时间步的输入向量的维度。
    hidden_size (int):隐藏层的特征大小，即每个时间步的隐藏状态向量的维度。
    num_layers (int,可选):RNN的层数，默认值为1。当层数大于1时，RNN会变为多层RNN。
    nonlinearity (str,可选):指定激活函数，默认值为'tanh'。可选值有'tanh'和'relu'。
    bias (bool,可选):如果设置为True，则在RNN中添加偏置项。默认值为True。
    batch_first (bool,可选):如果设置为True，则输入数据的形状为(batch_size, seq_len, input_size)。否则，默认输入数据的形状为(seq_len, batch_size, input_size)。默认值为False。
    dropout (float,可选):如果非零，则在除最后一层之外的每个RNN层之间添加dropout层，其丢弃概率为dropout。默认值为0。
    bidirectional (bool,可选):如果设置为True，则使用双向RNN。默认值为False。
    """
    def __init__(self, input_size, output_size, hidden_dim, n_layers):
        super(Model, self).__init__()

        # Defining some parameters
        self.hidden_dim = hidden_dim # 隐藏状态 ht 的维度
        self.n_layers = n_layers # 网络的层数

        # Defining the layers
        # RNN Layer
        self.rnn = nn.RNN(input_size, hidden_dim, n_layers, batch_first=True)
        # Fully connected layer
        self.fc = nn.Linear(hidden_dim, output_size)

    def forward(self, x):
        batch_size = x.size(0)

        # Initializing hidden state for first input using method defined below
        hidden = self.init_hidden(batch_size)

        # Passing in the input and hidden state into the model and obtaining outputs
        out, hidden = self.rnn(x, hidden)

        # Reshaping the outputs such that it can be fit into the fully connected layer
        out = out.contiguous().view(-1, self.hidden_dim)
        out = self.fc(out)

        return out, hidden

    def init_hidden(self, batch_size):
        # This method generates the first hidden state of zeros which we'll use in the forward pass
        # We'll send the tensor holding the hidden state to the device we specified earlier as well
        hidden = torch.zeros(self.n_layers, batch_size, self.hidden_dim)
        return hidden


# train.py
import torch
from torch import nn

import numpy as np

# 首先，我们将定义我们希望模型在输入第一个单词或前几个字符时输出的句子。
# 然后我们将从句子中的所有字符创建一个字典，并将它们映射到一个整数。
# 这将允许我们将输入字符转换为它们各自的整数（char2int），反之亦然（int2char）。

text = ['hey how are you', 'good i am fine', 'have a nice day']

# Join all the sentences together and extract the unique characters from the combined sentences
# 将所有句子连接在一起，并从组合的句子中提取独特的字符。
chars = set(''.join(text))
# print(chars)# 输出 : {'y', 'o', ' ', 'd', 'f', 'n', 'm', 'i', 'w', 'r', 'u', 'v', 'h', 'c', 'g', 'e', 'a'} (注意:输出不定，但都包含了所有的字符)

# Creating a dictionary that maps integers to the characters
int2char = dict(enumerate(chars))
# print(int2char)

# Creating another dictionary that maps characters to integers
char2int = {char: ind for ind, char in int2char.items()}
# char2int 字典看起来像这样：它包含我们句子中出现的所有字母/符号，并将它们中的每一个映射到一个唯一的整数。
# print(char2int)

# ------------------------------------------------------------------------------------
# 接下来，我们将填充(padding)输入句子以确保所有句子都是标准长度。
# 虽然 RNN 通常能够接收可变大小的输入，但我们通常希望分批输入训练数据以加快训练过程。
# 为了使用批次(batch)来训练我们的数据，我们需要确保输入数据中的每个序列大小相等。

# 因此，在大多数情况下，可以通过用 0 值填充太短的序列和修剪太长的序列来完成填充。
# 在我们的例子中，我们将找到最长序列的长度，并用空格填充其余句子以匹配该长度。

# Finding the length of the longest string in our data
maxlen = len(max(text, key=len))

# Padding

# A simple loop that loops through the list of sentences and adds a ' ' whitespace until the length of
# the sentence matches the length of the longest sentence
for i in range(len(text)):
  while len(text[i])<maxlen:
      text[i] += ' '

# 由于我们要在每个时间步预测序列中的下一个字符，我们必须将每个句子分为：

# 输入数据
# 最后一个字符需排除因为它不需要作为模型的输入
# 目标/真实标签
# 它为每一个时刻后的值，因为这才是下一个时刻的值。
# Creating lists that will hold our input and target sequences
input_seq = []
target_seq = []

for i in range(len(text)):
    # Remove last character for input sequence
    input_seq.append(text[i][:-1])

    # Remove first character for target sequence
    target_seq.append(text[i][1:])
    print("Input Sequence: {}\nTarget Sequence: {}".format(input_seq[i], target_seq[i]))

# 现在我们可以通过使用上面创建的字典映射输入和目标序列到整数序列。
# 这将允许我们随后对输入序列进行一次one-hot encoding。

for i in range(len(text)):
    input_seq[i] = [char2int[character] for character in input_seq[i]]
    target_seq[i] = [char2int[character] for character in target_seq[i]]

# 定义如下三个变量
# dict_size: 字典的长度，即唯一字符的个数。它将决定one-hot vector的长度
# seq_len:输入到模型中的sequence长度。这里是最长的句子的长度-1，因为不需要最后一个字符
# batch_size: mini batch的大小，用于批量训练
dict_size = len(char2int)
seq_len = maxlen - 1
batch_size = len(text)


def one_hot_encode(sequence, dict_size, seq_len, batch_size):
    # Creating a multi-dimensional array of zeros with the desired output shape
    features = np.zeros((batch_size, seq_len, dict_size), dtype=np.float32)

    # Replacing the 0 at the relevant character index with a 1 to represent that character
    for i in range(batch_size):
        for u in range(seq_len):
            features[i, u, sequence[i][u]] = 1
    return features
# 同时定义一个helper function，用于初始化one-hot向量
# Input shape --> (Batch Size, Sequence Length, One-Hot Encoding Size)
input_seq = one_hot_encode(input_seq, dict_size, seq_len, batch_size)

# 到此我们完成了所有的数据预处理，可以将数据从NumPy数组转为PyTorch张量啦
input_seq = torch.from_numpy(input_seq)
target_seq = torch.Tensor(target_seq)

# 接下来就是搭建模型的步骤，你可以在这一步使用全连接层，卷积层，RNN层，LSTM层等等。
# 但是我在这里使用最最基础的nn.rnn来示例一个RNN是如何使用的。
from RNN_model import Model

"""
# 在开始构建模型之前，让我们使用 PyTorch 中的内置功能来检查我们正在运行的设备（CPU 或 GPU）。
# 此实现不需要 GPU，因为训练非常简单。
# 但是，随着处理具有数百万个可训练参数的大型数据集和模型，使用 GPU 对加速训练非常重要。

# torch.cuda.is_available() checks and returns a Boolean True if a GPU is available, else it'll return False
# is_cuda = torch.cuda.is_available()

# If we have a GPU available, we'll set our device to GPU. We'll use this device variable later in our code.

# if is_cuda:
#     device = torch.device("cuda")
#     print("GPU is available")
# else:
#     device = torch.device("cpu")
#     print("GPU not available, CPU used")
"""


# 要开始构建我们自己的神经网络模型，我们可以为所有神经网络模块定义一个继承 PyTorch 的基类（nn.module）的类。
# 这样做之后，我们可以开始在构造函数下定义一些变量以及模型的层。 对于这个模型，我们将只使用一层 RNN，然后是一个全连接层。 全连接层将负责将 RNN 输出转换为我们想要的输出形状。
# 我们还必须将 forward() 下的前向传递函数定义为类方法。 前向函数是按顺序执行的，因此我们必须先将输入和零初始化隐藏状态通过 RNN 层，然后再将 RNN 输出传递到全连接层。 请注意，我们使用的是在构造函数中定义的层。
# 我们必须定义的最后一个方法是我们之前调用的用于初始化hidden state的方法 - init_hidden()。 这基本上会在我们的隐藏状态的形状中创建一个零张量。



# 在定义了上面的模型之后，我们必须用相关参数实例化模型并定义我们的超参数。 我们在下面定义的超参数是：
# n_epochs: 模型训练所有数据集的次数
# lr: learning rate学习率

# 与其他神经网络类似，我们也必须定义优化器和损失函数。 我们将使用 CrossEntropyLoss，因为最终输出基本上是一个分类任务和常见的 Adam 优化器。
# Instantiate the model with hyperparameters
model = Model(input_size=dict_size, output_size=dict_size, hidden_dim=12, n_layers=1)
# We'll also set the model to the device that we defined earlier (default is CPU)
# model.to(device)

# Define hyperparameters
n_epochs = 100 # 训练轮数
lr = 0.01 # 学习率

# Define Loss, Optimizer
loss_fn = nn.CrossEntropyLoss() # 交叉熵损失函数
optimizer = torch.optim.Adam(model.parameters(), lr=lr) # 采用Adam作为优化器

# 现在我们可以开始训练了！
# 由于我们只有几句话，所以这个训练过程非常快。
# 然而，随着我们的进步，更大的数据集和更深的模型意味着输入数据要大得多，并且我们必须计算的模型中的参数数量要多得多。

# Training Run
for epoch in range(1, n_epochs + 1):
    optimizer.zero_grad()  # Clears existing gradients from previous epoch
    # input_seq.to(device) # 使用GPU
    output, hidden = model(input_seq)
    loss = loss_fn(output, target_seq.view(-1).long())
    loss.backward()  # Does backpropagation and calculates gradients
    optimizer.step()  # Updates the weights accordingly

    if epoch % 10 == 0:
        print('Epoch: {}/{}.............'.format(epoch, n_epochs), end=' ')
        print("Loss: {:.4f}".format(loss.item()))

# test.py
# 现在让我们测试我们的模型，看看我们会得到什么样的输出。 作为第一步，我们将定义一些辅助函数来将我们的模型输出转换回文本。
# This function takes in the model and character as arguments and returns the next character prediction and hidden state
import numpy as np
import torch
from torch import device
import torch.nn as nn
from train import char2int, one_hot_encode, dict_size, int2char, model


def predict(model, character):
    # One-hot encoding our input to fit into the model
    character = np.array([[char2int[c] for c in character]])
    character = one_hot_encode(character, dict_size, character.shape[1], 1)
    character = torch.from_numpy(character)
    # character.to(device)

    out, hidden = model(character)

    prob = nn.functional.softmax(out[-1], dim=0).data

    # Taking the class with the highest probability score from the output
    char_ind = torch.max(prob, dim=0)[1].item()

    return int2char[char_ind], hidden

# This function takes the desired output length and input characters as arguments, returning the produced sentence
def sample(model, out_len, start='hey'):
    model.eval() # eval mode
    start = start.lower()
    # First off, run through the starting characters
    chars = [ch for ch in start]
    size = out_len - len(chars)
    # Now pass in the previous characters and get a new one
    for ii in range(size):
        char, h = predict(model, chars)
        chars.append(char)

    return ''.join(chars)

verify

We execute the following code in the test file and get the results.

print(sample(model, 15, 'good')) # good i am fine 
print(sample(model, 15, 'h')) # have a nice day
print(sample(model, 15, 'you')) # youd i am fine

We found that after inputting 'good' and 'h' to the model, the prediction result was correct; but after inputting 'you' to the model, the prediction result was youd i am fine. The model did not know how to predict, and the prediction result was not ideal.

Limitations of the model

While this model is definitely an oversimplified language model, let’s review its limitations and the issues that need to be addressed in order to train a better language model.

Limitation 1. Over-fitting

We only gave the model 3 training sentences, so it basically "remembered" the character sequences of those sentences, returning the exact sentences we trained it on. However, if you train a similar model on a larger data set and add some randomness, the model will pick out general sentence structure and language rules and be able to generate its own unique sentences.

Still, running the model using single samples or batches can serve as a sanity check on your workflow, ensuring that your data types are all correct, the model is learning well, etc.

Limitation 2. Processing unseen characters

The model can currently only handle characters it has seen previously in the training dataset. Generally, if the training data set is large enough, all letters, symbols, etc. should appear at least once and thus appear in our vocabulary. However, it's always good to have a way to handle never-seen characters, such as assigning all unknowns to it's own index.

Limitation 3. Text labeling method

In this implementation, we use one-hot encoding to represent our characters. While it may be suitable for this task due to its simplicity, most of the time it should not be used as a solution to real or more complex problems. This is because:

For large data sets, the computational cost is too high
No contextual/semantic information is embedded in one-hot vectors

and many other disadvantages that make this solution less feasible.

Instead, most modern NLP solutions rely on word embeddings (word2vec, GloVe) or more recently unique contextual word representations in BERT, ELMo and ULMFit. These methods allow the model to learn the meaning of a word based on the text that appears before it, and in the case of BERT etc., also from the text that appears after it.