Python 基于卷积神经网络(textCnn)对药品或疾病分类(适用于疾病归一化和药品归一化)

一、卷积神经网络(CNN)

复习知识:卷积神经网络(Convolutional Neural Network,CNN)针对全连接网络的局限做出了修正,加入了卷积层(Convolution层)和池化层(Pooling层)。通常情况下,卷积神经网络由若干个卷积层(Convolutional Layer)、激活层(Activation Layer)、池化层(Pooling Layer)及全连接层(Fully Connected Layer)组成。卷积神经网络的组成:

1.卷积层:

卷积层是卷积神经网络的核心所在,通过卷积运算,达到降维处理和提取特征两个重要目的。

2.激活层:

激活层的作用在于将前一层的线性输出,通过非线性的激活函数进行处理,这样用以模拟任意函数,从而增强网络的表征能力。激活层常用的函数包括Sigmoid和ReLU(Rectified-Linear Unit,修正线性单元)等。

3.池化层:

池化层称子采样层或下采样层(Subsampling Layer),作用是降低计算量,提高泛化能力。

4.全连接层:

相当于多层感知机(Multi-Layer Perceptron,简称MLP),其在整个卷积神经网络中起到分类器的作用。

二、textCNN

在2014年提出,Yoon Kim使用了卷积+最大池化这两个在图像领域非常成功的好基友组合。如下图所示,示意图中第一层输入为7*5的词向量矩阵,其中词向量维度为5,句子长度为7,然后第二层使用了3组宽度分别为2、3、4的卷积核,图中每种宽度的卷积核使用了两个。其中每个卷积核在整个句子长度上滑动,得到n个激活值,图中卷积核滑动的过程中没有使用padding,因此宽度为4的卷积核在长度为7的句子上滑动得到4个特征值。然后出场的就是卷积的好基友全局池化了,每一个卷积核输出的特征值列向量通过在整个句子长度上取最大值得到了6个特征值组成的feature map来供后级分类器作为分类的依据。

 在《Convolutional Neural Networks for Sentence Classification》中,下图可以加深理解。

三、textCNN实现:用文本卷积实现药品的分类/疾病分类

(1)数据预处理

将药品数据或者疾病数据转化为自己需要的数据样式:

药品数据原始样式:

疾病原始数据样式:

 代码只展示药品数据的处理过程:

标签转化

import pandas as pd

df = pd.read_excel('./data/drug_class.xlsx')

text_drug = [ i[1] for i in df.values]
labels_drug = [i[0] for i in df.values]

def label_turn(label_text):
    dic_label = dict()
    for index,label in enumerate(set(label_text)):
        dic_label[label] = index
    return dic_label

dic_label = label_turn(labels_drug)

 pad过程,将不等长的text文本(即药品名称或者疾病名称)补齐为等长的文本:

def padded(sentences , pad_token): # 转化为等长的句子,补齐'0'
    sentence = [ list(i) for i in sentences]
    max_len = len(sentence[0])
    for i in range(0, len(sentence) - 1):
        if max_len < len(sentence[i + 1]):
            max_len = len(sentence[i + 1])
        i += 1
    for i in range(0, len(sentence)):
        for j in range(0, max_len - len(sentence[i])):
            sentence[i].append(pad_token)
    print(max_len)
    return sentence

sentences = padded(text_drug , pad_token = '0')

将药品词典转为id:

def vocab_make(sentences):  ##vocab 的制作和编号
    words = [word for sen in sentences for word in sen]
    vocab = set(words)
    dic_vocab = dict()
    for index,word in enumerate(vocab):
        dic_vocab[word] = index
    return vocab,dic_vocab

vocab,dic_vocab = vocab_make(sentences)

(2)数据训练过程:

模块导入和参数设置:

import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as Data
import torch.nn.functional as F

dtype = torch.FloatTensor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# TextCNN Parameter参数设置
embedding_size = 2
sequence_length = len(sentences[0]) # every sentences contains sequence_length(=32) words
num_classes = len(set(labels_drug))  # num_classes=25
batch_size = 10 #每次获取的句子的数量
word2idx = {w: i for i, w in enumerate(vocab)}
vocab_size = len(vocab)

向量转化过程:

def make_data(sentences, labels):
    inputs = []
    for sen in sentences:
        inputs.append([dic_vocab[i] for i in sen])
    targets = []
    for out in labels:
        targets.append(dic_label[out]) # To using Torch Softmax Loss function
#     print(len(inputs),targets)
    return inputs, targets


input_batch, target_batch = make_data(sentences, labels_drug)
input_batch, target_batch = torch.LongTensor(input_batch), torch.LongTensor(target_batch)

dataset = Data.TensorDataset(input_batch, target_batch)
loader = Data.DataLoader(dataset, batch_size, True)

搭建textCNN框架,训练数据:

class TextCNN(nn.Module):
    def __init__(self):
        super(TextCNN, self).__init__()
        self.W = nn.Embedding(vocab_size, embedding_size)
        output_channel = 3
        #卷积层配置
        self.conv = nn.Sequential(
            # conv : [input_channel(=1), output_channel, (filter_height, filter_width), stride=1]
            nn.Conv2d(1, output_channel, (2, embedding_size)),
            nn.ReLU(),
            # pool : ((filter_height, filter_width))
            nn.MaxPool2d((2, 1)),
        )
        # fc
        self.fc = nn.Linear(3*15*1, num_classes)

    def forward(self, X):
        '''
        X: [batch_size, sequence_length]
        '''
        batch_size = X.shape[0]
        embedding_X = self.W(X) # [batch_size, sequence_length, embedding_size]
        embedding_X = embedding_X.unsqueeze(1) # add channel(=1) [batch, channel(=1), sequence_length, embedding_size]
        conved = self.conv(embedding_X) # [batch_size, output_channel*1*1]
        flatten = conved.view(-1, 3*15*1)
        output = self.fc(flatten)
        return output
model = TextCNN().to(device)
criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# Training
for epoch in range(1000):
    for batch_x, batch_y in loader:
        batch_x, batch_y = batch_x.to(device), batch_y.to(device)
        pred = model(batch_x)
        loss = criterion(pred, batch_y)
        if (epoch + 1) % 100 == 0:
            print('Epoch:', '%04d' % (epoch + 1), 'loss =', '{:.6f}'.format(loss))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

(3)数据预测:

'''predict'''
def padded_pre(sentences , pad_token): # token'<pad>'
    sentence = [ list(i) for i in sentences]
    max_len = 32
    for i in range(0, len(sentence)):
        for j in range(0, max_len - len(sentence[i])):
            sentence[i].append(pad_token)
    return sentence

# Test
test_text = '氨曲南'
test_list = [test_text]
#print(test_list)
test_pro = padded_pre(test_list, pad_token = '0')
#print(test_pro)

tests = [[dic_vocab[n] for n in test_pro[0]]]
test_batch = torch.LongTensor(tests).to(device)

model = model.eval()
predict = model(test_batch).data.max(1, keepdim=True)[1]

#打印预测结果
for i in dic_label.keys():
    if predict[0][0] == dic_label[i]:
        print(i)

敲黑板:

textCNN进行药品分类和疾病分类,效果非常依赖数据集的维度和训练数据的多少,由于我使用的药品数据最多类(300+)和最少类(20+)两者差距扩大,而我也没有刻意去扩充药品的数据集,所以最后预测的结果准确率还达不到可以实践的程度。但是从另一个角度说,这个模型也是可以去应用实践的,相比于RNN和BERT模型对数据的要求,本模型已经非常良好了,只不过还需要我们花点功夫去整理和扩充数据。

猜你喜欢

转载自blog.csdn.net/L_goodboy/article/details/123822530#comments_25639745