insert image description here

Introduction Introduction

This paper uses Pytorch as the framework to implement 7 classic deep learning Chinese text classification models, including TextCNN, TextRNN, FastText, TextRCNN, TextRNN_Attention, DPCNN and Transformer .

First, a detailed description of the dataset is provided, including the source of the dataset, the preprocessing method, and the division method. In this way, the reader can understand the characteristics of the dataset and how to prepare the data.

In terms of environment construction, the necessary dependent libraries and environment configuration instructions are provided to help readers run smoothly and conduct experiments.

For each model, we provide detailed instructions, including the structure of the model, the format of the input data, and the training and inference process of the model. These descriptions help the reader understand how each model works and the implementation details.

Finally, we provide detailed reports of training and testing results. These results can help readers evaluate, compare and analyze the performance of each model on Chinese text classification tasks.

Through this article, readers can learn about the implementation details and performance of various deep learning Chinese text classification models. This paper not only provides a reference for academic researchers, but also provides reusable code and experimental guidelines for developers and practitioners to help them achieve better results in Chinese text classification tasks.

data set

200,000 news headlines were extracted from THUCNews , with text length between 20 and 30. There are 10 categories in total, each with 20,000 entries.

Input the model in units of words, using the pre-trained word vector: Sogou News Word+Character 300d.

Categories: Finance, Real Estate, Stocks, Education, Technology, Society, Current Affairs, Sports, Games, Entertainment.

The dataset, vocabulary and corresponding pre-trained word vectors have been packaged, see the THUCNewsfolder in the Github address below for details.

insert image description here

Python environment and installation of corresponding dependent packages

python 3.7 or above
pytorch 1.1 or higher
tqdm
learned
tensorboardX

Anaconda environment configuration

Log in to Anaconda official website , download and install Anaconda
Then open the terminal and enter the following terminal commands in sequence:

New environment:chinese_text_classification

conda create --name chinese_text_classification python==3.8.10

Activate the environment:

conda activate chinese_text_classification

Enter the following commands in turn to install the relevant python packages

conda install pytorch
conda install scikit-learn
conda install tqdm
conda install tensorboardX

Note that the pytorch installed above is the CPU version by default. If you want to install the GPU version of pytorch, you can refer to the following steps.

First, make sure you have properly installed the NVIDIA graphics driver and that your graphics card supports CUDA. You can find the corresponding driver and CUDA compatibility information on NVIDIA official website.

Before installing PyTorch in a Python environment, you need to install the CUDAToolkit for your version of CUDA. You can download and install the CUDAToolkit for your CUDA version from NVIDIA's developer website.

After completing the above steps, you can use the following command to view your GPU-related version.

nvcc --version

insert image description here
If there is no above version, you need to check whether CUDA and CUDAToolkit are installed.

Then, you can download the corresponding version of whl from the pytorch download website for installation, because the pytorch file of the general gpu version is very large, it is not recommended to use pip to install directly. For example, the following is the pytorch command to install the gpu version directly using pip, which takes about 13 hours:

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117

insert image description here
You can directly use the browser to open the download website that appears in the above picture: https://download.pytorch.org/whl/cu117 and select torchto search for the keywords in the above picture cu117-cp38-cp38-win_amd64.whl. Click to download. Generally, if the network speed is fast, the download can be completed in about 10 minutes.

insert image description here
After the download is successful, you can directly use the following command to install:

pip install <path/to/your/whl/file.whl>

Please <path/to/your/whl/file.whl>replace with the actual .wl file path (eg: pip install /path/to/your/file.whl)

source code address

Github address: https://github.com/649453932/Chinese-Text-Classification-Pytorch

TextCNN

Model description

1. Model input: [batch_size, seq_len]
2. After embedding layer: load pre-trained word vector or random initialization, word vector dimension is embed_size: Embedding (4762, 300)
3. Convolutional layer:
(0): Conv2d (1 , 256, kernel_size=(2, 300), stride=(1,1))
(1): Conv2d (1, 256, kernel1_size=(3, 300) ,stride=(1, 1))
(2): Conv2d (1, 256, kernel_size=(4, 300), stride=(1, 1))
4. Dropout layer: Dropout (p=0. 5, inplace=False)
5. Full connection: Linear (in_features=768, out_features =10, bias=True)

Analysis:
The convolution operation is equivalent to extracting the 2-gram, 3-gram, and 4-gram information in the sentence. Multiple convolutions are used to extract various features, and the maximum pooling will extract the most important information and retain it.

The schematic diagram is as follows:
insert image description here

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np


class Config(object):

    """配置参数"""
    def __init__(self, dataset, embedding):
        self.model_name = 'TextCNN'
        self.train_path = dataset + '/data/train.txt'                                # 训练集
        self.dev_path = dataset + '/data/dev.txt'                                    # 验证集
        self.test_path = dataset + '/data/test.txt'                                  # 测试集
        self.class_list = [x.strip() for x in open(
            dataset + '/data/class.txt', encoding='utf-8').readlines()]              # 类别名单
        self.vocab_path = dataset + '/data/vocab.pkl'                                # 词表
        self.save_path = dataset + '/saved_dict/' + self.model_name + '.ckpt'        # 模型训练结果
        self.log_path = dataset + '/log/' + self.model_name
        self.embedding_pretrained = torch.tensor(
            np.load(dataset + '/data/' + embedding)["embeddings"].astype('float32'))\
            if embedding != 'random' else None                                       # 预训练词向量
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')   # 设备

        self.dropout = 0.5                                              # 随机失活
        self.require_improvement = 1000                                 # 若超过1000batch效果还没提升，则提前结束训练
        self.num_classes = len(self.class_list)                         # 类别数
        self.n_vocab = 0                                                # 词表大小，在运行时赋值
        self.num_epochs = 20                                            # epoch数
        self.batch_size = 128                                           # mini-batch大小
        self.pad_size = 32                                              # 每句话处理成的长度(短填长切)
        self.learning_rate = 1e-3                                       # 学习率
        self.embed = self.embedding_pretrained.size(1)\
            if self.embedding_pretrained is not None else 300           # 字向量维度
        self.filter_sizes = (2, 3, 4)                                   # 卷积核尺寸
        self.num_filters = 256                                          # 卷积核数量(channels数)


'''Convolutional Neural Networks for Sentence Classification'''


class Model(nn.Module):
    def __init__(self, config):
        super(Model, self).__init__()
        if config.embedding_pretrained is not None:
            self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
        else:
            self.embedding = nn.Embedding(config.n_vocab, config.embed, padding_idx=config.n_vocab - 1)
        self.convs = nn.ModuleList(
            [nn.Conv2d(1, config.num_filters, (k, config.embed)) for k in config.filter_sizes])
        self.dropout = nn.Dropout(config.dropout)
        self.fc = nn.Linear(config.num_filters * len(config.filter_sizes), config.num_classes)

    def conv_and_pool(self, x, conv):
        x = F.relu(conv(x)).squeeze(3)
        x = F.max_pool1d(x, x.size(2)).squeeze(2)
        return x

    def forward(self, x):
        out = self.embedding(x[0])
        out = out.unsqueeze(1)
        out = torch.cat([self.conv_and_pool(out, conv) for conv in self.convs], 1)
        out = self.dropout(out)
        out = self.fc(out)
        return out

Run the following commands on the terminal for training and testing:

python run.py --model TextCNN

The training process is as follows:
insert image description here

The training and test results are as follows:
using the CPU version of pytorch, it takes 15 minutes and 25 seconds, and the accuracy rate is 90.99%

insert image description here

TextRNN

Model description

1. Model input: [batch_size, seq_len]
2. After embedding layer: load pre-trained word vector or random initialization, word vector dimension is embed_size: Embedding (4762, 300)
3. Bidirectional LSTM: (lstm): LSTM(300, 128, num_layers=2, batch_first=True, dropout=0.5, bidirectional=True)
4. Full connection: Linear (in_features=256, out_features=10, bias=True)

Analysis:
LSTM can better capture long-distance semantic relationships, but due to its recursive structure, it cannot be calculated in parallel and the speed is slow.

The schematic diagram is as follows:

insert image description here

# coding: UTF-8
import torch
import torch.nn as nn
import numpy as np


class Config(object):

    """配置参数"""
    def __init__(self, dataset, embedding):
        self.model_name = 'TextRNN'
        self.train_path = dataset + '/data/train.txt'                                # 训练集
        self.dev_path = dataset + '/data/dev.txt'                                    # 验证集
        self.test_path = dataset + '/data/test.txt'                                  # 测试集
        self.class_list = [x.strip() for x in open(
            dataset + '/data/class.txt', encoding='utf-8').readlines()]              # 类别名单
        self.vocab_path = dataset + '/data/vocab.pkl'                                # 词表
        self.save_path = dataset + '/saved_dict/' + self.model_name + '.ckpt'        # 模型训练结果
        self.log_path = dataset + '/log/' + self.model_name
        self.embedding_pretrained = torch.tensor(
            np.load(dataset + '/data/' + embedding)["embeddings"].astype('float32'))\
            if embedding != 'random' else None                                       # 预训练词向量
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')   # 设备

        self.dropout = 0.5                                              # 随机失活
        self.require_improvement = 1000                                 # 若超过1000batch效果还没提升，则提前结束训练
        self.num_classes = len(self.class_list)                         # 类别数
        self.n_vocab = 0                                                # 词表大小，在运行时赋值
        self.num_epochs = 10                                            # epoch数
        self.batch_size = 128                                           # mini-batch大小
        self.pad_size = 32                                              # 每句话处理成的长度(短填长切)
        self.learning_rate = 1e-3                                       # 学习率
        self.embed = self.embedding_pretrained.size(1)\
            if self.embedding_pretrained is not None else 300           # 字向量维度, 若使用了预训练词向量，则维度统一
        self.hidden_size = 128                                          # lstm隐藏层
        self.num_layers = 2                                             # lstm层数


'''Recurrent Neural Network for Text Classification with Multi-Task Learning'''


class Model(nn.Module):
    def __init__(self, config):
        super(Model, self).__init__()
        if config.embedding_pretrained is not None:
            self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
        else:
            self.embedding = nn.Embedding(config.n_vocab, config.embed, padding_idx=config.n_vocab - 1)
        self.lstm = nn.LSTM(config.embed, config.hidden_size, config.num_layers,
                            bidirectional=True, batch_first=True, dropout=config.dropout)
        self.fc = nn.Linear(config.hidden_size * 2, config.num_classes)

    def forward(self, x):
        x, _ = x
        out = self.embedding(x)  # [batch_size, seq_len, embeding]=[128, 32, 300]
        out, _ = self.lstm(out)
        out = self.fc(out[:, -1, :])  # 句子最后时刻的 hidden state
        return out

    '''变长RNN，效果差不多，甚至还低了点...'''
    # def forward(self, x):
    #     x, seq_len = x
    #     out = self.embedding(x)
    #     _, idx_sort = torch.sort(seq_len, dim=0, descending=True)  # 长度从长到短排序（index）
    #     _, idx_unsort = torch.sort(idx_sort)  # 排序后，原序列的 index
    #     out = torch.index_select(out, 0, idx_sort)
    #     seq_len = list(seq_len[idx_sort])
    #     out = nn.utils.rnn.pack_padded_sequence(out, seq_len, batch_first=True)
    #     # [batche_size, seq_len, num_directions * hidden_size]
    #     out, (hn, _) = self.lstm(out)
    #     out = torch.cat((hn[2], hn[3]), -1)
    #     # out, _ = nn.utils.rnn.pad_packed_sequence(out, batch_first=True)
    #     out = out.index_select(0, idx_unsort)
    #     out = self.fc(out)
    #     return out

Run the following commands on the terminal for training and testing:

python run.py --model TextRNN

The training process is as follows:

insert image description here

The training and test results are as follows:
using the CPU version of pytorch, it takes 18 minutes and 54 seconds, and the accuracy rate is 90.90%

insert image description here

TextRNN_Att

Model description

1. Model input: [batch_size, seq_len]
2. After the embedding layer: load the pre-trained word vector or initialize randomly, the dimension of the word vector is embed_size: [batch_size, seq_len, embed_size] 3.
Bidirectional LSTM: hidden layer size is hidden_size, get Hidden layer state at all times (forward hidden layer and backward hidden layer stitching) [batch_size, seq_len, hidden_size * 2]
4. Initialize a learnable weight matrix w
w=[hidden_size * 2, 1]
5. For LSTM The output of the nonlinear activation is multiplied by matrix with w, and normalized by softmax to get the score at each moment:
[batch_size, seq_len, 1]
6. Multiply the hidden layer state of LSTM at each moment corresponding to The scores are summed to obtain the final hidden layer value [ batch_size
, hidden_size * 2] after the weighted average .Prediction: softmax normalization, the class corresponding to the largest number of num_classes is used as the final prediction [batch_size, 1]

Analysis:
Steps 4 to 6 are the calculation process of the attention mechanism, which is actually a weighted average of the hidden layers of each moment of lstm. For example, if the sentence length is 4, first calculate the normalized score at 4 moments: [0.1, 0.3, 0.4, 0.2], and then

$h_{\text {ultimate}}=0.1 h_1+0.3 h_2+0.4 h_3+0.2 h_4$

The schematic diagram is as follows:

insert image description here

# coding: UTF-8
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np


class Config(object):

    """配置参数"""
    def __init__(self, dataset, embedding):
        self.model_name = 'TextRNN_Att'
        self.train_path = dataset + '/data/train.txt'                                # 训练集
        self.dev_path = dataset + '/data/dev.txt'                                    # 验证集
        self.test_path = dataset + '/data/test.txt'                                  # 测试集
        self.class_list = [x.strip() for x in open(
            dataset + '/data/class.txt', encoding='utf-8').readlines()]              # 类别名单
        self.vocab_path = dataset + '/data/vocab.pkl'                                # 词表
        self.save_path = dataset + '/saved_dict/' + self.model_name + '.ckpt'        # 模型训练结果
        self.log_path = dataset + '/log/' + self.model_name
        self.embedding_pretrained = torch.tensor(
            np.load(dataset + '/data/' + embedding)["embeddings"].astype('float32'))\
            if embedding != 'random' else None                                       # 预训练词向量
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')   # 设备

        self.dropout = 0.5                                              # 随机失活
        self.require_improvement = 1000                                 # 若超过1000batch效果还没提升，则提前结束训练
        self.num_classes = len(self.class_list)                         # 类别数
        self.n_vocab = 0                                                # 词表大小，在运行时赋值
        self.num_epochs = 10                                            # epoch数
        self.batch_size = 128                                           # mini-batch大小
        self.pad_size = 32                                              # 每句话处理成的长度(短填长切)
        self.learning_rate = 1e-3                                       # 学习率
        self.embed = self.embedding_pretrained.size(1)\
            if self.embedding_pretrained is not None else 300           # 字向量维度, 若使用了预训练词向量，则维度统一
        self.hidden_size = 128                                          # lstm隐藏层
        self.num_layers = 2                                             # lstm层数
        self.hidden_size2 = 64


'''Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification'''


class Model(nn.Module):
    def __init__(self, config):
        super(Model, self).__init__()
        if config.embedding_pretrained is not None:
            self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
        else:
            self.embedding = nn.Embedding(config.n_vocab, config.embed, padding_idx=config.n_vocab - 1)
        self.lstm = nn.LSTM(config.embed, config.hidden_size, config.num_layers,
                            bidirectional=True, batch_first=True, dropout=config.dropout)
        self.tanh1 = nn.Tanh()
        # self.u = nn.Parameter(torch.Tensor(config.hidden_size * 2, config.hidden_size * 2))
        self.w = nn.Parameter(torch.zeros(config.hidden_size * 2))
        self.tanh2 = nn.Tanh()
        self.fc1 = nn.Linear(config.hidden_size * 2, config.hidden_size2)
        self.fc = nn.Linear(config.hidden_size2, config.num_classes)

    def forward(self, x):
        x, _ = x
        emb = self.embedding(x)  # [batch_size, seq_len, embeding]=[128, 32, 300]
        H, _ = self.lstm(emb)  # [batch_size, seq_len, hidden_size * num_direction]=[128, 32, 256]

        M = self.tanh1(H)  # [128, 32, 256]
        # M = torch.tanh(torch.matmul(H, self.u))
        alpha = F.softmax(torch.matmul(M, self.w), dim=1).unsqueeze(-1)  # [128, 32, 1]
        out = H * alpha  # [128, 32, 256]
        out = torch.sum(out, 1)  # [128, 256]
        out = F.relu(out)
        out = self.fc1(out)
        out = self.fc(out)  # [128, 64]
        return out

Run the following commands on the terminal for training and testing:

python run.py --model TextRNN_Att

The training process is as follows:

insert image description here

The training and test results are as follows:
using the CPU version of pytorch, it takes 10 minutes and 48 seconds, and the accuracy rate is 89.89%

insert image description here

FastText

Model description

1. Model input: [batch_size, seq_len]
2. Embedding layer: random initialization, word vector dimension is embed_size, 2-gram and 3-gram are the same:
word: [batch_size, seq_len, embed_size]
2-gram: [batch_size, seq_len, embed_size]
3-gram: [batch_size, seq_len, embed_size]
3. Splicing embedding layer:
[batch_size, seq_len, embed_size * 3]
4. Find the average of all seq_len words
[batch_size, embed_size * 3]
5. Full connection + Nonlinear activation: hidden layer size hidden_size
[batch_size, hidden_size]
6. Full connection + softmax normalization:
[batch_size, num_class]==>[batch_size, 1]

The schematic diagram is as follows:
insert image description here

# coding: UTF-8
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np


class Config(object):

    """配置参数"""
    def __init__(self, dataset, embedding):
        self.model_name = 'FastText'
        self.train_path = dataset + '/data/train.txt'                                # 训练集
        self.dev_path = dataset + '/data/dev.txt'                                    # 验证集
        self.test_path = dataset + '/data/test.txt'                                  # 测试集
        self.class_list = [x.strip() for x in open(
            dataset + '/data/class.txt', encoding='utf-8').readlines()]              # 类别名单
        self.vocab_path = dataset + '/data/vocab.pkl'                                # 词表
        self.save_path = dataset + '/saved_dict/' + self.model_name + '.ckpt'        # 模型训练结果
        self.log_path = dataset + '/log/' + self.model_name
        self.embedding_pretrained = torch.tensor(
            np.load(dataset + '/data/' + embedding)["embeddings"].astype('float32'))\
            if embedding != 'random' else None                                       # 预训练词向量
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')   # 设备

        self.dropout = 0.5                                              # 随机失活
        self.require_improvement = 1000                                 # 若超过1000batch效果还没提升，则提前结束训练
        self.num_classes = len(self.class_list)                         # 类别数
        self.n_vocab = 0                                                # 词表大小，在运行时赋值
        self.num_epochs = 20                                            # epoch数
        self.batch_size = 128                                           # mini-batch大小
        self.pad_size = 32                                              # 每句话处理成的长度(短填长切)
        self.learning_rate = 1e-3                                       # 学习率
        self.embed = self.embedding_pretrained.size(1)\
            if self.embedding_pretrained is not None else 300           # 字向量维度
        self.hidden_size = 256                                          # 隐藏层大小
        self.n_gram_vocab = 250499                                      # ngram 词表大小


'''Bag of Tricks for Efficient Text Classification'''


class Model(nn.Module):
    def __init__(self, config):
        super(Model, self).__init__()
        if config.embedding_pretrained is not None:
            self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
        else:
            self.embedding = nn.Embedding(config.n_vocab, config.embed, padding_idx=config.n_vocab - 1)
        self.embedding_ngram2 = nn.Embedding(config.n_gram_vocab, config.embed)
        self.embedding_ngram3 = nn.Embedding(config.n_gram_vocab, config.embed)
        self.dropout = nn.Dropout(config.dropout)
        self.fc1 = nn.Linear(config.embed * 3, config.hidden_size)
        # self.dropout2 = nn.Dropout(config.dropout)
        self.fc2 = nn.Linear(config.hidden_size, config.num_classes)

    def forward(self, x):

        out_word = self.embedding(x[0])
        out_bigram = self.embedding_ngram2(x[2])
        out_trigram = self.embedding_ngram3(x[3])
        out = torch.cat((out_word, out_bigram, out_trigram), -1)

        out = out.mean(dim=1)
        out = self.dropout(out)
        out = self.fc1(out)
        out = F.relu(out)
        out = self.fc2(out)
        return out

Run the following commands on the terminal for training and testing:

python run.py --model FastText

The training process is as follows:

insert image description here

The training and test results are as follows:
using the CPU version of pytorch, it took 1 hour, 47 minutes and 40 seconds, and the accuracy rate was 92.07%.

insert image description here

TextRCNN

Model description

1. Model input: [batch_size, seq_len]
2. After the embedding layer: load the pre-trained word vector or initialize randomly, the dimension of the word vector is embed_size: [batch_size, seq_len, embed_size] 3. Bidirectional
LSTM: the size of the hidden layer is hidden_size, get Hidden layer state at all times (forward hidden layer and backward hidden layer splicing) [batch_size, seq_len, hidden_size * 2]
4. Concatenate embedding layer with LSTM output and perform nonlinear activation:
[batch_size, seq_len, hidden_size * 2 + embed_size]
5. Pooling layer: take the largest seq_len features
[batch_size, hidden_size * 2 + embed_size]
6. Softmax after full connection
[batch_size, num_class] ==> [batch_size, 1]

Analysis:
The hidden layer value (forward + backward) of the bidirectional LSTM at each moment can represent the forward and backward semantic information of the current word, and the hidden value and embedding value are spliced to represent a word; then the maximum pooling layer is used To filter out useful feature information.

The schematic diagram is as follows:

insert image description here

# coding: UTF-8
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np


class Config(object):

    """配置参数"""
    def __init__(self, dataset, embedding):
        self.model_name = 'TextRCNN'
        self.train_path = dataset + '/data/train.txt'                                # 训练集
        self.dev_path = dataset + '/data/dev.txt'                                    # 验证集
        self.test_path = dataset + '/data/test.txt'                                  # 测试集
        self.class_list = [x.strip() for x in open(
            dataset + '/data/class.txt', encoding='utf-8').readlines()]              # 类别名单
        self.vocab_path = dataset + '/data/vocab.pkl'                                # 词表
        self.save_path = dataset + '/saved_dict/' + self.model_name + '.ckpt'        # 模型训练结果
        self.log_path = dataset + '/log/' + self.model_name
        self.embedding_pretrained = torch.tensor(
            np.load(dataset + '/data/' + embedding)["embeddings"].astype('float32'))\
            if embedding != 'random' else None                                       # 预训练词向量
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')   # 设备

        self.dropout = 1.0                                              # 随机失活
        self.require_improvement = 1000                                 # 若超过1000batch效果还没提升，则提前结束训练
        self.num_classes = len(self.class_list)                         # 类别数
        self.n_vocab = 0                                                # 词表大小，在运行时赋值
        self.num_epochs = 10                                            # epoch数
        self.batch_size = 128                                           # mini-batch大小
        self.pad_size = 32                                              # 每句话处理成的长度(短填长切)
        self.learning_rate = 1e-3                                       # 学习率
        self.embed = self.embedding_pretrained.size(1)\
            if self.embedding_pretrained is not None else 300           # 字向量维度, 若使用了预训练词向量，则维度统一
        self.hidden_size = 256                                          # lstm隐藏层
        self.num_layers = 1                                             # lstm层数


'''Recurrent Convolutional Neural Networks for Text Classification'''


class Model(nn.Module):
    def __init__(self, config):
        super(Model, self).__init__()
        if config.embedding_pretrained is not None:
            self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
        else:
            self.embedding = nn.Embedding(config.n_vocab, config.embed, padding_idx=config.n_vocab - 1)
        self.lstm = nn.LSTM(config.embed, config.hidden_size, config.num_layers,
                            bidirectional=True, batch_first=True, dropout=config.dropout)
        self.maxpool = nn.MaxPool1d(config.pad_size)
        self.fc = nn.Linear(config.hidden_size * 2 + config.embed, config.num_classes)

    def forward(self, x):
        x, _ = x
        embed = self.embedding(x)  # [batch_size, seq_len, embeding]=[64, 32, 64]
        out, _ = self.lstm(embed)
        out = torch.cat((embed, out), 2)
        out = F.relu(out)
        out = out.permute(0, 2, 1)
        out = self.maxpool(out).squeeze()
        out = self.fc(out)
        return out

Run the following commands on the terminal for training and testing:

python run.py --model TextRCNN

The training process is as follows:

insert image description here

The training and test results are as follows:
using the CPU version of pytorch, it takes 10 minutes and 23 seconds, and the accuracy rate is 90.83%

insert image description here

DPCNN

Model description

1. Model input: [batch_size, seq_len]
2. After the embedding layer: load the pre-trained word vector or initialize randomly, the dimension of the word vector is embed_size: [batch_size, seq_len, embed_size]
3. Perform convolution, 250 words with a size of 3 Convolution kernel, this layer is called region embedding in the paper.
[batch_size, 250, seq_len - 3 + 1]
4. Connect two layers of convolution (+relu), each layer is 250 convolution kernels with a size of 3, (equal-length convolution, first padding and then convolution, to ensure convolution The length of the sequence before and after the product remains unchanged)
[batch_size, 250, seq_len - 3 + 1]
5. Next, perform the operation in the small box in the above figure.
I. Perform maximum pooling with a size of 3 and a step size of 2, and compress the sequence length to half of the original. (Sampling)
II. Connect two layers of equal-length convolution (+relu), each layer is 250 convolution kernels with a size of 3.
III. The result of I plus the result of II. (Residual connection)
Repeat the above operation until the sequence length is equal to 1.
[batch_size, 250, 1]
6. Full connection + softmax normalization:
[batch_size, num_class]==>[batch_size, 1]

Analysis:
The process of TextCNN is similar to extracting N-Gram information, and there is only one layer, which makes it difficult to capture long-distance features.
In contrast to DPCNN, it can be seen that its region embedding is a TextCNN that removes the pooling layer, and then superimposes the convolutional layer.
insert image description here

The sequence length of each layer is halved (as shown in the figure above), which can be understood in this way: it is equivalent to doing N-Gram on N-Gram. The later the layer, the more information is fused at each position, and the last layer extracts the semantic information of the entire sequence.

The schematic diagram is as follows:
insert image description here

# coding: UTF-8
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np


class Config(object):

    """配置参数"""
    def __init__(self, dataset, embedding):
        self.model_name = 'DPCNN'
        self.train_path = dataset + '/data/train.txt'                                # 训练集
        self.dev_path = dataset + '/data/dev.txt'                                    # 验证集
        self.test_path = dataset + '/data/test.txt'                                  # 测试集
        self.class_list = [x.strip() for x in open(
            dataset + '/data/class.txt', encoding='utf-8').readlines()]              # 类别名单
        self.vocab_path = dataset + '/data/vocab.pkl'                                # 词表
        self.save_path = dataset + '/saved_dict/' + self.model_name + '.ckpt'        # 模型训练结果
        self.log_path = dataset + '/log/' + self.model_name
        self.embedding_pretrained = torch.tensor(
            np.load(dataset + '/data/' + embedding)["embeddings"].astype('float32'))\
            if embedding != 'random' else None                                       # 预训练词向量
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')   # 设备

        self.dropout = 0.5                                              # 随机失活
        self.require_improvement = 1000                                 # 若超过1000batch效果还没提升，则提前结束训练
        self.num_classes = len(self.class_list)                         # 类别数
        self.n_vocab = 0                                                # 词表大小，在运行时赋值
        self.num_epochs = 20                                            # epoch数
        self.batch_size = 128                                           # mini-batch大小
        self.pad_size = 32                                              # 每句话处理成的长度(短填长切)
        self.learning_rate = 1e-3                                       # 学习率
        self.embed = self.embedding_pretrained.size(1)\
            if self.embedding_pretrained is not None else 300           # 字向量维度
        self.num_filters = 250                                          # 卷积核数量(channels数)


'''Deep Pyramid Convolutional Neural Networks for Text Categorization'''


class Model(nn.Module):
    def __init__(self, config):
        super(Model, self).__init__()
        if config.embedding_pretrained is not None:
            self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
        else:
            self.embedding = nn.Embedding(config.n_vocab, config.embed, padding_idx=config.n_vocab - 1)
        self.conv_region = nn.Conv2d(1, config.num_filters, (3, config.embed), stride=1)
        self.conv = nn.Conv2d(config.num_filters, config.num_filters, (3, 1), stride=1)
        self.max_pool = nn.MaxPool2d(kernel_size=(3, 1), stride=2)
        self.padding1 = nn.ZeroPad2d((0, 0, 1, 1))  # top bottom
        self.padding2 = nn.ZeroPad2d((0, 0, 0, 1))  # bottom
        self.relu = nn.ReLU()
        self.fc = nn.Linear(config.num_filters, config.num_classes)

    def forward(self, x):
        x = x[0]
        x = self.embedding(x)
        x = x.unsqueeze(1)  # [batch_size, 250, seq_len, 1]
        x = self.conv_region(x)  # [batch_size, 250, seq_len-3+1, 1]

        x = self.padding1(x)  # [batch_size, 250, seq_len, 1]
        x = self.relu(x)
        x = self.conv(x)  # [batch_size, 250, seq_len-3+1, 1]
        x = self.padding1(x)  # [batch_size, 250, seq_len, 1]
        x = self.relu(x)
        x = self.conv(x)  # [batch_size, 250, seq_len-3+1, 1]
        while x.size()[2] > 2:
            x = self._block(x)
        x = x.squeeze()  # [batch_size, num_filters(250)]
        x = self.fc(x)
        return x

    def _block(self, x):
        x = self.padding2(x)
        px = self.max_pool(x)

        x = self.padding1(px)
        x = F.relu(x)
        x = self.conv(x)

        x = self.padding1(x)
        x = F.relu(x)
        x = self.conv(x)

        # Short Cut
        x = x + px
        return x

Run the following commands on the terminal for training and testing:

python run.py --model DPCNN

The training process is as follows:

insert image description here

The training and test results are as follows:
using the CPU version of pytorch, it took 19 minutes and 26 seconds, and the accuracy rate was 91.21%.
insert image description here

Transformer

Model description

The schematic diagram is as follows:
insert image description here
Like most seq2seq models, the structure of the transformer is also composed of an encoder and a decoder.

Encoder：

The Encoder consists of 6 identical blocks (the left part of the model structure), and the layer refers to the unit on the left side of the schematic diagram. Each block consists of a multi-head self-attention block and a fully connected forward propagation block. Both parts also added residual connection and normalization.

multi-headed self-attention

insert image description here

Multi-head attention is to combine multiple attention mechanisms and stitch the outputs after attention processing. It can be expressed as:

$\begin{aligned} & \text { MultiHead }(Q, K, V)=\text { Concat }\left(\text { head }_1, \cdot, \text { head }_n\right) W^O \\ & h^{h e a d_i}=\operatorname{Attention}\left(Q W_i^Q, K W_i^K, V W_i^V\right) \\ & \end{aligned}$

And self-attention is that Q, K, and V take the same value.

Decoder：

The structures of Decoder and Encoder are very similar. It is worth paying attention to the input and output of Decoder.

Input: includes the output of the Encoder and the output of the Decoder corresponding to i-1 position. So the attention in the middle is not self-attention, its K, V comes from the Encoder, and Q comes from the output of the Decoder in the previous position;
Output: The output is the probability distribution of the output word corresponding to the position.

The mechanism of the decoder during training and prediction is also different. During training, the decoding is decoded all at once, and the ground truth in the previous step is used to predict (the mask matrix will also be changed, so that future tokens cannot be seen during decoding); When predicting, because there is no ground truth, it needs to be predicted one by one.

Positional Encoding：

In order to reflect the characteristics of the sequence in the model, it is necessary to encode the position information of the sequence into the input. The final input vector is obtained by adding the positional encoding and the embedding encoding. In general, there are two ways to encode location information: one is formula-based encoding, and the other is dynamically learned encoding through training. The original author has tested that the effects of the two methods are basically the same, and the formula-based encoding does not require additional training, and can handle sequences of lengths that have not appeared in the training set, so the Transformer uses a formula-based position encoding:

$\begin{gathered} P E_{(p o s, 2 i)}=\sin \left(\text { pos } / 10000^{2 i / d_{\text {model }}}\right) \\ P E_{(p o s, 2 i+1}=\cos \left(p o s / 10000^{2 i / d_{\text {model }}}\right) \end{gathered}$

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import copy


class Config(object):

    """配置参数"""
    def __init__(self, dataset, embedding):
        self.model_name = 'Transformer'
        self.train_path = dataset + '/data/train.txt'                                # 训练集
        self.dev_path = dataset + '/data/dev.txt'                                    # 验证集
        self.test_path = dataset + '/data/test.txt'                                  # 测试集
        self.class_list = [x.strip() for x in open(
            dataset + '/data/class.txt', encoding='utf-8').readlines()]              # 类别名单
        self.vocab_path = dataset + '/data/vocab.pkl'                                # 词表
        self.save_path = dataset + '/saved_dict/' + self.model_name + '.ckpt'        # 模型训练结果
        self.log_path = dataset + '/log/' + self.model_name
        self.embedding_pretrained = torch.tensor(
            np.load(dataset + '/data/' + embedding)["embeddings"].astype('float32'))\
            if embedding != 'random' else None                                       # 预训练词向量
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')   # 设备

        self.dropout = 0.5                                              # 随机失活
        self.require_improvement = 2000                                 # 若超过1000batch效果还没提升，则提前结束训练
        self.num_classes = len(self.class_list)                         # 类别数
        self.n_vocab = 0                                                # 词表大小，在运行时赋值
        self.num_epochs = 20                                            # epoch数
        self.batch_size = 128                                           # mini-batch大小
        self.pad_size = 32                                              # 每句话处理成的长度(短填长切)
        self.learning_rate = 5e-4                                       # 学习率
        self.embed = self.embedding_pretrained.size(1)\
            if self.embedding_pretrained is not None else 300           # 字向量维度
        self.dim_model = 300
        self.hidden = 1024
        self.last_hidden = 512
        self.num_head = 5
        self.num_encoder = 2


'''Attention Is All You Need'''


class Model(nn.Module):
    def __init__(self, config):
        super(Model, self).__init__()
        if config.embedding_pretrained is not None:
            self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
        else:
            self.embedding = nn.Embedding(config.n_vocab, config.embed, padding_idx=config.n_vocab - 1)

        self.postion_embedding = Positional_Encoding(config.embed, config.pad_size, config.dropout, config.device)
        self.encoder = Encoder(config.dim_model, config.num_head, config.hidden, config.dropout)
        self.encoders = nn.ModuleList([
            copy.deepcopy(self.encoder)
            # Encoder(config.dim_model, config.num_head, config.hidden, config.dropout)
            for _ in range(config.num_encoder)])

        self.fc1 = nn.Linear(config.pad_size * config.dim_model, config.num_classes)
        # self.fc2 = nn.Linear(config.last_hidden, config.num_classes)
        # self.fc1 = nn.Linear(config.dim_model, config.num_classes)

    def forward(self, x):
        out = self.embedding(x[0])
        out = self.postion_embedding(out)
        for encoder in self.encoders:
            out = encoder(out)
        out = out.view(out.size(0), -1)
        # out = torch.mean(out, 1)
        out = self.fc1(out)
        return out


class Encoder(nn.Module):
    def __init__(self, dim_model, num_head, hidden, dropout):
        super(Encoder, self).__init__()
        self.attention = Multi_Head_Attention(dim_model, num_head, dropout)
        self.feed_forward = Position_wise_Feed_Forward(dim_model, hidden, dropout)

    def forward(self, x):
        out = self.attention(x)
        out = self.feed_forward(out)
        return out


class Positional_Encoding(nn.Module):
    def __init__(self, embed, pad_size, dropout, device):
        super(Positional_Encoding, self).__init__()
        self.device = device
        self.pe = torch.tensor([[pos / (10000.0 ** (i // 2 * 2.0 / embed)) for i in range(embed)] for pos in range(pad_size)])
        self.pe[:, 0::2] = np.sin(self.pe[:, 0::2])
        self.pe[:, 1::2] = np.cos(self.pe[:, 1::2])
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = x + nn.Parameter(self.pe, requires_grad=False).to(self.device)
        out = self.dropout(out)
        return out


class Scaled_Dot_Product_Attention(nn.Module):
    '''Scaled Dot-Product Attention '''
    def __init__(self):
        super(Scaled_Dot_Product_Attention, self).__init__()

    def forward(self, Q, K, V, scale=None):
        '''
        Args:
            Q: [batch_size, len_Q, dim_Q]
            K: [batch_size, len_K, dim_K]
            V: [batch_size, len_V, dim_V]
            scale: 缩放因子 论文为根号dim_K
        Return:
            self-attention后的张量，以及attention张量
        '''
        attention = torch.matmul(Q, K.permute(0, 2, 1))
        if scale:
            attention = attention * scale
        # if mask:  # TODO change this
        #     attention = attention.masked_fill_(mask == 0, -1e9)
        attention = F.softmax(attention, dim=-1)
        context = torch.matmul(attention, V)
        return context


class Multi_Head_Attention(nn.Module):
    def __init__(self, dim_model, num_head, dropout=0.0):
        super(Multi_Head_Attention, self).__init__()
        self.num_head = num_head
        assert dim_model % num_head == 0
        self.dim_head = dim_model // self.num_head
        self.fc_Q = nn.Linear(dim_model, num_head * self.dim_head)
        self.fc_K = nn.Linear(dim_model, num_head * self.dim_head)
        self.fc_V = nn.Linear(dim_model, num_head * self.dim_head)
        self.attention = Scaled_Dot_Product_Attention()
        self.fc = nn.Linear(num_head * self.dim_head, dim_model)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(dim_model)

    def forward(self, x):
        batch_size = x.size(0)
        Q = self.fc_Q(x)
        K = self.fc_K(x)
        V = self.fc_V(x)
        Q = Q.view(batch_size * self.num_head, -1, self.dim_head)
        K = K.view(batch_size * self.num_head, -1, self.dim_head)
        V = V.view(batch_size * self.num_head, -1, self.dim_head)
        # if mask:  # TODO
        #     mask = mask.repeat(self.num_head, 1, 1)  # TODO change this
        scale = K.size(-1) ** -0.5  # 缩放因子
        context = self.attention(Q, K, V, scale)

        context = context.view(batch_size, -1, self.dim_head * self.num_head)
        out = self.fc(context)
        out = self.dropout(out)
        out = out + x  # 残差连接
        out = self.layer_norm(out)
        return out


class Position_wise_Feed_Forward(nn.Module):
    def __init__(self, dim_model, hidden, dropout=0.0):
        super(Position_wise_Feed_Forward, self).__init__()
        self.fc1 = nn.Linear(dim_model, hidden)
        self.fc2 = nn.Linear(hidden, dim_model)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(dim_model)

    def forward(self, x):
        out = self.fc1(x)
        out = F.relu(out)
        out = self.fc2(out)
        out = self.dropout(out)
        out = out + x  # 残差连接
        out = self.layer_norm(out)
        return out

Run the following commands on the terminal for training and testing:

python run.py --model Transformer

The training process is as follows:

insert image description here

The training and testing results are as follows:

The blogger used the CPU version of pytorch, which took nearly 4 hours and did not complete a round.

insert image description here

Quickly uninstall the CPU version of pytorch, use the GPU version of pytorch, and it takes 4 minutes and 25 seconds to run. Accuracy: 90.01%.

insert image description here

Comparison of the effects of each model

Model	acc	Remark
TextCNN	90.99%	Kim 2014 Classical CNN Text Classification
TextRNN	90.90%	BiLSTM
TextRNN_Att	89.89%	BiLSTM+Attention
TextRCNN	90.83%	BiLSTM+pooling
FastText	92.07%	bow+bigram+trigram, the effect is surprisingly good
DPCNN	91.21%	Deep Pyramid CNN
Transformer	90.01%	less effective

References

Chinese text classification pytorch implementation

Other information download

If you want to continue to learn about artificial intelligence-related learning routes and knowledge systems, welcome to read my other blog " Heavy | Complete artificial intelligence AI learning-basic knowledge learning route, all materials can be downloaded directly from the network disk without paying attention to routines "
This blog refers to Github's well-known open source platform, AI technology platform and experts in related fields: Datawhale, ApacheCN, AI Youdao and Dr. Huang Haiguang, etc. There are about 100G related materials, and I hope to help all friends.

NLP combat: Pytorch implements 7 classic deep learning Chinese text classification - TextCNN+TextRNN+FastText+TextRCNN+TextRNN_Attention+DPCNN+Transformer

Table of contents

Introduction Introduction

data set

Python environment and installation of corresponding dependent packages

Anaconda environment configuration

source code address

TextCNN

Model description

TextRNN

Model description

TextRNN_Att

Model description

FastText

Model description

TextRCNN

Model description

DPCNN

Model description

Transformer

Model description

Comparison of the effects of each model

References

Other information download

Guess you like