Convolutional Neural Networks in Practice in NLP: Text Classification

As we all know, convolutional neural network (CNN) has made great progress in the field of computer vision, but in addition, CNN is also gradually gaining ground in the field of natural language processing (NLP). This article mainly takes text classification as an example to introduce a basic method of using convolutional neural networks in the field of NLP. Since I am a beginner, and in order to avoid making mistakes, the following theoretical introductions are explained in a non-mathematical and more popular way. .


0. Text Classification

The so-called text classification is to use a computer to classify a text into class a or class b, which is a kind of classification problem and is also a relatively common task in NLP.


1. Word vector

        When it comes to the application of deep learning in NLP, we have to mention word vectors. Distributed Representation is often translated into word embeddings in China. There are already many articles about the introduction of word vectors, such as this great god 's blog: http://blog.csdn.net/zhoubl668/article/details/23271225 This article uses relatively simple language to help you understand word vectors.

       The so-called word vector is to train the language model through a neural network, and generate a set of vectors during the training process. This set of vectors represents each word as an n-dimensional vector. For example, if we want to represent "Beijing" as a 2-dimensional vector, one possible result is Beijing = (1.1, 2.2), where the word Beijing is represented as a 2-dimensional vector. However, in addition to representing words as vectors, word vectors also ensure that the spatial distances of words with similar semantics in the word vector representation method should be similar. For example 'China' - 'Beijing' ≈ 'UK' - 'London' . The above conditions can be satisfied when the following word vectors are distributed, 'Beijing'=(1.1,2.2), 'China'=(1.2,2.3), 'London'=(1.5,2.4), 'UK'=(1.6,2.5) . General training word vectors can use the google open source word2vec program.


        2. The combination of convolutional neural network and word vector

There are many blogs about CNN. If you don't understand the basic concepts of CNN, you can refer to the blog of this great god as follows: http://blog.csdn.net/zhoubl668/article/details/23271225 I won't go into details here.

Usually convolutional neural networks are used to process two-dimensional (regardless of rgb) matrices similar to images. For example, a picture can usually be represented as a 2-dimensional array such as 255*255, which means that the picture is a An image that is 255 pixels wide and 255 pixels high. So how to apply CNN to text, the answer is word vector.

We have just introduced the concept of word vectors. Let's introduce how to convert text into an image-like format by means of word vectors. Generally speaking, a text can be regarded as a combination of lexical sequences, for example, there is a text with the content of 'write code, change the world'. It can be converted into a text sequence like ('writing', 'code', 'change', 'world'), obviously this sequence is a one-dimensional vector and cannot be processed directly with cnn.

但是如果使用词向量的方式将其展开,假设在某词向量钟 '书写' =(1.1,2.1),'代码' = (1.5,2.9),'改变' = (2.7,3.1) ,'世界' = (2.9,3.5),那么('书写','代码','改变','世界')这个序列就可以改写成((1.1,2.1),(1.5,2.9),(2.7,3.1),(2.9,3.5)),显然原先的文本序列是4*1的向量,改写之后的文本可以表示为一个4*2的矩阵。 推而广之任何以文本序列都可以表示为m*d的数组,m维文本序列的词数,d维词向量的维数。

        三.用于文本分类的神经网络结构设计

本文前面介绍了词向量、卷积神经网络等概念,并提出可以将文本转换成一个由词序列和词向量嵌套而成的二维矩阵,并通过CNN对其进行处理,下面以文本分类任务为例,举例说明如何设计该神经网络的样式。

         3.1 文本预处理部分的流程

         这部分主要是分3步,共4种状态。1.将原始文本分词并转换成以词的序列 2.将词序列转换成以词编号(每个词表中的词都有唯一编号)为元素的序列 3.将词的编号序列中的每个元素(某个词)展开为词向量的形式。下面通过一张图(本人手画简图。。。。囧)来表示这个过程,如下图所示:


上述图片,以'书写代码,改变世界' 这一文本为例,介绍了将其转换成词向量为元素的序列表示,最后得到了一个2维矩阵,该矩阵可用于后续神经网络的训练等操作。

         3.2 神经网络模块的设计

         本文关于神经网络设计的思想来自于以下博文:

 http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/  由于该文章是纯英文的,某些读者可能还不习惯阅读这类文献,我下面结合一张神经网络设计图,来说明本文中所使用的神经网络,具体设计图(又是手画图,囧)如下:


简要介绍下上面的图,第一层数据输入层,将文本序列展开成词向量的序列,之后连接 卷积层、激活层、池化层 ,这里的卷积层因为卷积窗口大小不同,平行放置了三个卷积层,垂直方向则放置了三重(卷积层、激活层、池化层的组合)。之后连接全脸阶层和激活层,激活层采用softmax并输出 该文本属于某类的概率。


        3.3 编程实现所需要的框架和数据集等


        3.3.1 框架:本文采用keras框架来编写神经网络,关于keras的介绍请参见莫言大神翻译的keras中文文档:http://keras-cn.readthedocs.io/en/latest/ 。 

        3.3.2 数据集:文本训练集来自20_newsgroup,该数据集包括20种新闻文本,下载地址如下:http://www.qwone.com/~jason/20Newsgroups/

        3.3.3 词向量:虽然keras框架已经有embedding层,但是本文采用glove词向量作为预训练的词向量,glove的介绍和下载地址如下(打开会比较慢):

http://nlp.stanford.edu/projects/glove/


        3.4 代码和相应的注释

        在3.2部分已经通过一张图介绍了神经网络的设计部分,但是考虑到不够直观,这里还是把所使用的代码,罗列如下,采用keras编程,关键部分都已经罗列注释,代码有部分是来源自keras文档 中的example目录下的:pretrained_word_embeddings.py,但是该程序我实际运行时出现了无法训练的bug,所以做了诸多改变,最主要的是我把原文中的激活层从relu改成了tanh,整体的设计结构也有了根本性的改变。对keras原始demo有兴趣的可以参见:

http://keras-cn.readthedocs.io/en/latest/blog/word_embedding/


         下面就是本文中所使用的文本分类代码(简单版):


'''This script loads pre-trained word embeddings (GloVe embeddings)
into a frozen Keras Embedding layer, and uses it to
train a text classification model on the 20 Newsgroup dataset
(classication of newsgroup messages into 20 different categories).
GloVe embedding data can be found at:
http://nlp.stanford.edu/data/glove.6B.zip
(source page: http://nlp.stanford.edu/projects/glove/)
20 Newsgroup data can be found at:
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html
'''

from __future__ import print_function
import os
import numpy as np
np.random.seed(1337)

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model
from keras.optimizers import *
import sys

#BASE_DIR = '.' # 这里是指当前目录
#GLOVE_DIR = BASE_DIR + '/glove.6B/' # 根据实际目录名更改  
#TEXT_DATA_DIR = BASE_DIR + '/20_newsgroup/' # 根据实际目录名更改  

GLOVE_DIR = '/home/zqzy/Downloads/glove/'  # 根据实际目录名更改  
TEXT_DATA_DIR ='/home/zqzy/Downloads/Newsgroup/20_newsgroup/'  # 根据实际目录名更改  
MAX_SEQUENCE_LENGTH = 1000
MAX_NB_WORDS = 20000
#EMBEDDING_DIM = 100
EMBEDDING_DIM = 50
VALIDATION_SPLIT = 0.2

# first, build index mapping words in the embeddings set
# to their embedding vector

print('Indexing word vectors.')

embeddings_index = {}
#f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
f = open(os.path.join(GLOVE_DIR, 'glove.6B.50d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

# second, prepare text samples and their labels
print('Processing text dataset')

texts = []  # list of text samples
labels_index = {}  # dictionary mapping label name to numeric id
labels = []  # list of label ids
for name in sorted(os.listdir(TEXT_DATA_DIR)):
    path = os.path.join(TEXT_DATA_DIR, name)
    if os.path.isdir(path):
        label_id = len(labels_index)
        labels_index[name] = label_id
        for fname in sorted(os.listdir(path)):
            if fname.isdigit():
                fpath = os.path.join(path, fname)
                if sys.version_info < (3,):
                    f = open(fpath)
                else:
                    f = open(fpath, encoding='latin-1')
                texts.append(f.read())
                f.close()
                labels.append(label_id)

print('Found %s texts.' % len(texts))

# finally, vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer(num_words=MAX_NB_WORDS) #以前 tokenizer = Tokenizer(nb_words=MAX_NB_WORDS),nb_words换名字了,变为num_words

tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]

print('Preparing embedding matrix.')

# prepare embedding matrix
nb_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((nb_words + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    if i > MAX_NB_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector # word_index to word_embedding_vector ,<20000(nb_words)

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(nb_words + 1,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH,
                            weights=[embedding_matrix],
                            trainable=True)

print('Training model.')

# train a 1D convnet with global maxpoolinnb_wordsg
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='tanh')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='tanh')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='tanh')(x)
x = MaxPooling1D(35)(x)
x = Flatten()(x)
x = Dense(128, activation='tanh')(x)
preds = Dense(len(labels_index), activation='softmax')(x)

#sgd = SGD(lr=0.5, decay=1e-6, momentum=0.9, nesterov=True) 
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='Adadelta',
              metrics=['accuracy'])

# happy learning!
#model.fit(x_train, y_train, validation_data=(x_val, y_val),
#          nb_epoch=2, batch_size=128)
model.fit(x_train, y_train,epochs=1) #原来使用model.fit(x_train, y_train,nb_epoch=1),nb_epoch已经换名称了,训练次数至少为1

score = model.evaluate(x_train, y_train, verbose=0) 
print('train score:', score[0])
print('train accuracy:', score[1])
score = model.evaluate(x_val, y_val, verbose=0) 
print('Test score:', score[0])
print('Test accuracy:', score[1])


完整版:(在调试,稍后补充)


上述代码和注释较为详细的描述了该神经网络的结构,但是实际使用代码时最好去除中文注释部分,否则可能会有一些编码问题。

        四.总结

本文描述了如何使用深度学习和keras框架构建一个文本分类器的全过程,并给出了相应的代码实现,为了方便大家使用,下面给出本文代码的下载地址一(简单版):

https://github.com/894939677/deeplearning_by_diye/blob/master/pretrain_text_class_by_diye.py   (简单版需要做如前文的修改)

下面给出本文代码的下载地址二(完整版):

https://github.com/894939677/deeplearning_by_diye/blob/master/pre_merge_3.py (完整版需要做上文的修改)


五.后记

本文描述的是使用类似于googlenet的网络结构,实际上也可以使用类似与resnet的网络结构来做这个事情




Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325648707&siteId=291194637