introduction

I recently learned about convolutional neural networks, and I want to start a small project to practice. The data set of this project comes from github, and the content is the positive and negative evaluation of car after-sales. With the help of pytorch, the training of the model is realized and the second evaluation of a certain evaluation in the test set is completed. Classification.

Principle: Use convolution to extract the characteristics of local features and capture key information similar to N-grams.

1. Data preprocessing

In natural language processing, the unavoidable topic is the word vector. I use the tool library torchtext to realize the construction of the word vector.

tokenizer

def tokenizer(text): # create a tokenizer function
    regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
    text = regex.sub(' ', text)
    return [word for word in jieba.cut(text) if word.strip()]
复制代码

The word segmenter uses the Chinese word segmentation tool jieba library to perform word segmentation, and returns the segmented words in the form of a list.

remove stop words

def get_stop_words():
    file_object = open('D:\\MyStudy\\program\\text-classification-master\\text-cnn\\data\\stopwords.txt',encoding='UTF-8')
    stop_words = []
    for line in file_object.readlines():
        line = line[:-1]
        line = line.strip()
        stop_words.append(line)
    return stop_words
复制代码

Download the stop word list in advance, and return the stop words as a list after processing.

data processing

def load_data(args):
    print('加载数据中...')
    stop_words = get_stop_words() # 加载停用词表
    '''
    如果需要设置文本的长度，则设置fix_length,否则torchtext自动将文本长度处理为最大样本长度
    text = data.Field(sequential=True, tokenize=tokenizer, fix_length=args.max_len, stop_words=stop_words)
    '''
    text = data.Field(sequential=True, lower=True, tokenize=tokenizer, stop_words=stop_words)
    label = data.Field(sequential=False)

    text.tokenize = tokenizer
    train, val = data.TabularDataset.splits(
            path='D:\\MyStudy\\program\\text-classification-master\\text-cnn\\data\\',
            skip_header=True,
            train='train.tsv',
            validation='validation.tsv',
            format='tsv',
            fields=[('index', None), ('label', label), ('text', text)],
        )

    if args.static:
        text.build_vocab(train, val, vectors=Vectors(name="data\\eco_article.vector")) # 此处改为你自己的词向量
        args.embedding_dim = text.vocab.vectors.size()[-1]
        args.vectors = text.vocab.vectors

    else: text.build_vocab(train, val)

    label.build_vocab(train, val)

    train_iter, val_iter = data.Iterator.splits(
            (train, val),
            sort_key=lambda x: len(x.text),
            batch_sizes=(args.batch_size, len(val)), # 训练集设置batch_size,验证集整个集合用于测试
            device=-1
    )
    args.vocab_size = len(text.vocab)
    args.label_num = len(label.vocab)
    return train_iter, val_iter

复制代码

The steps to use torchtext are generally: 1. Use data.Field() to define an object and preset parameters, where text and label are defined separately. 2. Use data.TabularDataset().spilts() to read the file and get two parts of train and val. 3. Build word vectors, use text.build_vocab(trian, val), label.build_vocab(trian, val) to build word vectors of training text and labels. 4. Use data.Iterator.splits() to generate bacth.

Since then, the preprocessing of the data is complete.

2. Model establishment

Adopt CNN architecture. With pytorch. The overall network architecture is: embedding layer, dimension processing, convolution layer, activation function, pooling layer, multi-channel feature extraction, dropout layer, fully connected layer.

Embedding layer

Embed the constructed word vector, and the parameters of the embedding layer include word vector size and embedding dimension .

convolutional layer

Transform the output dimension of the embedding layer to the dimension suitable for the input of the convolutional layer, and store the three-channel parallel convolutional layer in it with self.convs=nn.MoudleList (nn.conv2() for fsz in filter_sizes), and return a volume List of layers.

activation function

x = F.relu(conv(x) for conv in self.convs)植入非线性

池化，下采样

多通道的特征提取与合并

x = [x_item.view(x_item.size(0), -1) for x_item in x]将不同卷积核运算结果维度展平。

Dropout防止过拟合

全连接层输出

模型建立部分代码如下

class TextCNN(nn.Module):
    # 多通道textcnn
    def __init__(self, args):
        super(TextCNN, self).__init__()
        self.args = args

        label_num = args.label_num # 标签的个数
        filter_num = args.filter_num # 卷积核的个数
        filter_sizes = [int(fsz) for fsz in args.filter_sizes.split(',')]
        vocab_size = args.vocab_size
        embedding_dim = args.embedding_dim

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        if args.static: # 如果使用预训练词向量，则提前加载，当不需要微调时设置freeze为True
            self.embedding = self.embedding.from_pretrained(args.vectors, freeze=not args.fine_tune)

        self.convs = nn.ModuleList(
            [nn.Conv2d(1, filter_num, (fsz, embedding_dim)) for fsz in filter_sizes])
        self.dropout = nn.Dropout(args.dropout)
        self.linear = nn.Linear(len(filter_sizes)*filter_num, label_num)

    def forward(self, x):
        # 输入x的维度为(batch_size, max_len), max_len可以通过torchtext设置或自动获取为训练样本的最大=长度
        x = self.embedding(x) # 经过embedding,x的维度为(batch_size, max_len, embedding_dim)

        # 经过view函数x的维度变为(batch_size, input_chanel=1, w=max_len, h=embedding_dim)
        x = x.view(x.size(0), 1, x.size(1), self.args.embedding_dim)

        # 经过卷积运算,x中每个运算结果维度为(batch_size, out_chanel, w, h=1)
        x = [F.relu(conv(x)) for conv in self.convs]

        # 经过最大池化层,维度变为(batch_size, out_chanel, w=1, h=1)
        x = [F.max_pool2d(input=x_item, kernel_size=(x_item.size(2), x_item.size(3))) for x_item in x]

        # 将不同卷积核运算结果维度（batch，out_chanel,w,h=1）展平为（batch, outchanel*w*h）
        x = [x_item.view(x_item.size(0), -1) for x_item in x]

        # 将不同卷积核提取的特征组合起来,维度变为(batch, sum:outchanel*w*h)
        x = torch.cat(x, 1)

        # dropout层
        x = self.dropout(x)

        # 全连接层
        logits = self.linear(x)
        return logits
复制代码

3，模型的训练与优化

建立模型之后进入训练过程,先进行超参数的设定。

parser = argparse.ArgumentParser(description='TextCNN text classifier')

parser.add_argument('-lr', type=float, default=0.001, help='学习率')
parser.add_argument('-batch-size', type=int, default=128)
parser.add_argument('-epoch', type=int, default=20)
parser.add_argument('-filter-num', type=int, default=200, help='卷积核的个数')
parser.add_argument('-filter-sizes', type=str, default='6,7,8', help='不同卷积核大小')
parser.add_argument('-embedding-dim', type=int, default=128, help='词向量的维度')
parser.add_argument('-dropout', type=float, default=0.4)
parser.add_argument('-label-num', type=int, default=2, help='标签个数')
parser.add_argument('-static', type=bool, default=False, help='是否使用预训练词向量')
parser.add_argument('-fine-tune', type=bool, default=True, help='预训练词向量是否要微调')
parser.add_argument('-cuda', type=bool, default=False)
parser.add_argument('-log-interval', type=int, default=1, help='经过多少iteration记录一次训练状态')
parser.add_argument('-test-interval', type=int, default=100,help='经过多少iteration对验证集进行测试')
parser.add_argument('-early-stopping', type=int, default=1000, help='早停时迭代的次数')
parser.add_argument('-save-best', type=bool, default=True, help='当得到更好的准确度是否要保存')
parser.add_argument('-save-dir', type=str, default='model_dir', help='存储训练模型位置')

复制代码

def train(args):
    train_iter, dev_iter = data_processor.load_data(args) # 将数据分为训练集和验证集
    print('加载数据完成')
    model = TextCNN(args)
    if args.cuda: model.cuda()
    optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
    steps = 0
    best_acc = 0
    last_step = 0
    model.train()
    for epoch in range(1, args.epoch + 1):
        for batch in train_iter:
            feature, target = batch.text, batch.label
            # t_()函数表示将(max_len, batch_size)转置为(batch_size, max_len)
            with torch.no_grad():
                feature.t_()
                target.sub_(1)
            if args.cuda:
                feature, target = feature.cuda(), target.cuda()
            optimizer.zero_grad()
            logits = model(feature)
            loss = F.cross_entropy(logits, target)
            loss.backward()
            optimizer.step()
            steps += 1
            if steps % args.log_interval == 0:
                # torch.max(logits, 1)函数：返回每一行中最大值的那个元素，且返回其索引（返回最大元素在这一行的列索引）
                corrects = (torch.max(logits, 1)[1] == target).sum()
                train_acc = 100.0 * corrects / batch.batch_size
                sys.stdout.write(
                    '\rBatch[{}] - loss: {:.6f}  acc: {:.4f}%({}/{})'.format(steps,
                                                                             loss.item(),
                                                                             train_acc,
                                                                             corrects,
                                                                             batch.batch_size))
            if steps % args.test_interval == 0:
                dev_acc = eval(dev_iter, model, args)
                if dev_acc > best_acc:
                    best_acc = dev_acc
                    last_step = steps
                    if args.save_best:
                        print('Saving best model, acc: {:.4f}%\n'.format(best_acc))
                        save(model, args.save_dir, 'best', steps)
                else:
                    if steps - last_step >= args.early_stopping:
                        print('\nearly stop by {} steps, acc: {:.4f}%'.format(args.early_stopping, best_acc))
                        raise KeyboardInterrupt

复制代码

训练过程首先将模型实例化model，然后定义优化器，我采用的是Adam优化器，然后就是pytorch训练的基本操作

for epoch in eopch_num:
	for bacth in batches:
		optimizer.zero_grad()#梯度清零
		logits = model(feature)
		loss = F.cross_entropy(logits,targets)#交叉熵函数
		loss.backward()#反向传播
		optimizer.step()
		step + =1
复制代码

验证集的测试过程同训练过程相似

def eval(data_iter, model, args):
    corrects, avg_loss = 0, 0
    for batch in data_iter:
        feature, target = batch.text, batch.label
        with torch.no_grad():
            feature.t_()
            target.sub_(1)
        if args.cuda:
            feature, target = feature.cuda(), target.cuda()
        logits = model(feature)
        loss = F.cross_entropy(logits, target)
        avg_loss += loss.item()
        corrects += (torch.max(logits, 1)
                     [1].view(target.size()) == target).sum()
    size = len(data_iter.dataset)
    avg_loss /= size
    accuracy = 100.0 * corrects / size
    print('\nEvaluation - loss: {:.6f}  acc: {:.4f}%({}/{}) \n'.format(avg_loss,
                                                                       accuracy,
                                                                       corrects,
                                                                       size))
    return accuracy

def save(model, save_dir, save_prefix, steps):
    if not os.path.isdir(save_dir):
        os.makedirs(save_dir)
    save_prefix = os.path.join(save_dir, save_prefix)
    save_path = '{}_steps_{}.pt'.format(save_prefix, steps)
    torch.save(model.state_dict(), save_path)

train(args)


复制代码

训练完毕后验证集正确率可达90% 代码参考来自链接

CNN+pytorch realizes text two classification