Similar application of CNN deep neural network in NLP short text

Reprinted from: https://blog.csdn.net/diye2008/article/details/53762124?ref=myread
The content of this article is the content of the previous article, which talked about the application of CNN in the field of text classification , this article will discuss its application in text similarity calculation. Text similarity can be used in many fields such as search engines, text deduplication, text mining, and recommendation systems, and is also a type of task that needs to be processed in NLP.

0. Text similarity calculation

The so-called text similarity calculation means that two texts (usually strings) are given, and an algorithm is used to give a measure of their similarity. More traditional and common text similarity calculation methods, and then use the CNN method.

   一.传统的文本相似度计算的方法：

        1.1  最长公共子序列及最长公共子串：

            该种方法思路较为简单，直接使用lcs（最长公共子序列）算法，设字符串A长度为L1，字符串长度为L2，它们的最长公共子串长度为lcs，那么这两个字符串相似度的度量为  sim = 2lcs/(L1+L2)。LCS算法的具体描述可参见LCS 。

       1.2  余弦相似度：

            该种方法思路一样非常简单，将两个字符串进行分词，并将其量化为多维空间上的点，计算两个文本的余弦相似度即可， 具体公式和LCS算法的方式类似，就不在赘述 。

      1.3   其他

           除了以上两种思路外，还有编辑距离 、 jaccard距离 、 simhash 等等，使用的方法也较为类似，因为不是本文的重点，因此也不在赘述。


 二. 使用CNN的方法

     2.1 使用CNN来进行文本相似度计算的原因：

     首先这里文本相似度计算的输入是指 两个字符串，比如字符串A = ‘书写代码，改变世界’  字符串B = ‘地产兴邦，称霸世界’，那么我们如何计算他们的文本相似度呢，一个是直接用本文中提到过的 传统的文本相似度计算方法来计算，如果任务的要求不高，直接拿字符串A和B计算即可。那么为什么又需要使用cnn和词向量的方法来计算文本相似度呢？？？

      这是因为传统文本相似度算法，侧重于文本本身的相似度计算，因此需要大量的归一化的过程，例如中文数字和阿拉伯数字的归一化，中英文 单位的归一化（例如kg和千克）。但是即使通过归一化的方法，仍然有很多语义相似文本无法通过这种方法得到满意的结果。比如两道数学题， 题1 = '一个苹果+二个苹果等于多少个苹果？'， 题2=‘一个香蕉+二个香蕉等于多少个香蕉’，显然这两道题语义是高度接近的，但是用传统的文本相似度计算方法，计算得到的相似度是非常低的，这不能满足现代互联网或其他领域对文本语义相似度任务的需要，于是需要使用cnn和词向量的方式来计算文本语义相似度，这是因为词向量的固有特性，如果对词向量有所不了解可参见我的上一篇文章文本分类

An introduction to word vectors in .

     2.2 文本预处理的过程：

    这里文本预处理过程和文本分类一文中文本预处理的内容大同小异，看过该文的可以直接跳过这一步。

    首先文本相似度计算输入的 来源是两个字符串，比如字符串A = ‘书写代码，改变世界’  字符串B = ‘地产兴邦，称霸世界’。

The problem is illustrated by the processing of string A. This part is mainly divided into 3 steps and a total of 4 states. 1. Segment the original text and convert it into a sequence of words 2. Convert the sequence of words into a sequence with word numbers (words in each vocabulary have a unique number) as elements 3. Convert each word in the numbered sequence of words Each element (a word) is expanded into the form of a word vector. The following is a picture (a sketch drawn by myself... 囧) to represent this process, as shown in the following figure:

The above picture, taking the text string 'write code, change the world' as an example, introduces the sequence representation of converting it into a word vector as an element, and finally obtains a 2-dimensional matrix, which can be used for subsequent neural network training and so on.

     2.3 神经网络模块的设计

     本文关于神经网络设计的思想来自于以下keras中文文档中的：泛型编程 这一届中微博相似度计算启发而来。并结合了实际需要而设计的。我下面结合一张神经网络设计图，来说明本文中所使用的神经网络，具体设计图（又是手画图，囧）如下：

Briefly introduce the above figure, the first layer of data input layer, expands the text sequence into a sequence of word vectors, and then two different input streams, here because there are two inputs (string A and string B), the vertical direction Then 2 exactly the same layer combination (combination of convolutional layer, activation layer, pooling layer) is placed. Then connect the fully connected layer and the activation layer. The activation layer uses sigmoid and outputs the probability that the text belongs to a certain class. The value of this output is a floating point number between 0 and 1, which represents the similarity between text A and text B. The value The greater the similarity, the greater the degree of similarity.

   2.4 编程实现所需要的框架和数据集等


    2.4.1 框架：本文采用keras框架来编写神经网络，关于keras的介绍请参见莫言大神翻译的 keras中文文档

    2.4.2 数据集：采用的是1w条，标记好是否是语义相近的相似文本，比如一道数学题和另一道数学题，如果语义相似则标记为1，如果不同则标记为0.

    2.4.3 词向量：虽然keras框架已经有embedding层，但是本文采用glove词向量作为预训练的词向量，glove的介绍和下载地址如下（打开会比较慢）：

http://nlp.stanford.edu/projects/glove/

    2.5 代码和相应的注释

    在2.3部分已经通过一张图介绍了神经网络的设计部分，但是考虑到不够直观，这里还是把所使用的代码，罗列如下，采用keras编程，关键部分都已经罗列注释：

[python] view plain copy

from keras.layers import Input, LSTM, Dense, merge, Conv1D, MaxPooling1D, Flatten, Embedding, Dropout  
from keras.models import Model  
import numpy as np  
from keras.utils.np_utils import to_categorical  
from keras.preprocessing.text import Tokenizer  
from keras.preprocessing.sequence import pad_sequences  
import pdb  
from keras import backend as K  

import theano.tensor as T # tensor  
from theano import function # function  
from keras.engine.topology import Layer  
import os  
import sys  
import jieba  
jieba.load_userdict("./science")  

input_train = sys.argv[1] # s_label_zfli  

BASE_DIR = '.'  
GLOVE_DIR = BASE_DIR + '/wordvec/' # 预训练词向量的地址  
MAX_SEQUENCE_LENGTH = 200  
MAX_NB_WORDS = 200000  
EMBEDDING_DIM = 200  
VALIDATION_SPLIT = 0.2  

print('Indexing word vectors.')  

embeddings_index = {}  
#f = open(os.path.join(GLOVE_DIR, '125_vec'))   
f = open(os.path.join(GLOVE_DIR, 'old_vec')) # 预训练的词向量，可使用Word2vec自行训练，下面几行就是依次读入词向量  
for line in f:  
    values = line.split()  
    word = values[0]  
    coefs = np.asarray(values[1:], dtype='float32')  
    embeddings_index[word] = coefs  
f.close()  

print('Found %s word vectors.' % len(embeddings_index))  

print('Processing text dataset')  

# good  

texts = []  # list of text samples  
labels_index = {}  # dictionary mapping label name to numeric id  

labels = []  # list of label ids  
train_left = []  
train_right = []  
# 下面几行是读入训练集，训练集是 两个 字符串和一个这两个字符串是否相似的标记  
for line in  file(sys.argv[1]): # train，读入训练集  
    line  = line.strip()  
    tmp = line  
    line = line.split('\1')  

    if len(line)<5:  
        continue  

    label_id = line[0]  
    tid = line[1]  
    title = line[2]  
    tid = line[3]  
    title_right = line[4].strip() # need strip at this line  
    seg_list = jieba.cut(title)   
    seg_list_right = jieba.cut(title_right)   
    text_left = (' '.join(seg_list)).encode('utf-8','ignore').strip()  
    text_right = (' '.join(seg_list_right)).encode('utf-8','ignore').strip()  
    #print text_left  
    #print text_right  

    texts.append(text_left)  
    texts.append(text_right)  

    labels.append(float(label_id))  
    train_left.append(text_left)  
    train_right.append(text_right)  


print('Found %s left.' % len(train_left))  
print('Found %s right.' % len(train_right))  
print('Found %s labels.' % len(labels))  

# finally, vectorize the text samples into a 2D integer tensor  
tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)  
tokenizer.fit_on_texts(texts)  

sequences_left = tokenizer.texts_to_sequences(train_left)  
sequences_right = tokenizer.texts_to_sequences(train_right)  
#for item  in sequences_left:  
#    print item  

word_index = tokenizer.word_index  
print('Found %s unique tokens.' % len(word_index))  

data_left = pad_sequences(sequences_left, maxlen=MAX_SEQUENCE_LENGTH,padding='pre', truncating='post')  
data_right = pad_sequences(sequences_right, maxlen=MAX_SEQUENCE_LENGTH, truncating='post')  
labels = np.array(labels)  

#labels = to_categorical(np.asarray(labels))  

# split the data into a training set and a validation set  
indices = np.arange(data_left.shape[0])  
np.random.shuffle(indices)  

data_left = data_left[indices]  
data_right = data_right[indices]  

labels = labels[indices]  
nb_validation_samples = int(VALIDATION_SPLIT * data_left.shape[0]) # create val and sp  

input_train_left = data_left[:-nb_validation_samples]  
input_train_right = data_right[:-nb_validation_samples]  

val_left = data_left[-nb_validation_samples:]  
val_right = data_right[-nb_validation_samples:]  

labels_train = labels[:-nb_validation_samples]  
labels_val = labels[-nb_validation_samples:]  

print('Preparing embedding matrix.')  

# prepare embedding matrix  
nb_words = min(MAX_NB_WORDS, len(word_index))  
#print type(word_index)  
#for  item in word_index:  
#     print item + '\t' + str(word_index[item])  
embedding_matrix = np.zeros((nb_words + 1, EMBEDDING_DIM))  
for word, i in word_index.items():  
    if i > MAX_NB_WORDS:  
        continue  
    embedding_vector = embeddings_index.get(word)  
    if embedding_vector is not None:  
        # words not found in embedding index will be all-zeros.  
        embedding_matrix[i] = embedding_vector # word_index to word_embedding_vector ,<20000(nb_words)  
# load pre-trained word embeddings into an Embedding layer  
# note that we set trainable = False so as to keep the embeddings fixed  
''''' 
embedding_layer = Embedding(nb_words + 1, 
                            EMBEDDING_DIM, 
                            input_length=MAX_SEQUENCE_LENGTH, 
                            weights=[embedding_matrix], 
                            trainable=True) 
'''  

print('Training model.')  

# train a 1D convnet with global maxpoolinnb_wordsg  
#left model  


''''' 
data_1 = np.random.randint(low = 0, high = 200, size = (500, 140)) 
data_2 = np.random.randint(low = 0 ,high = 200, size = (500, 140)) 
labels = np.random.randint(low=0, high=2, size=(500, 1)) 
#labels = to_categorical(labels, 10) # to one-hot 
'''  

tweet_a = Input(shape=(MAX_SEQUENCE_LENGTH,))  
tweet_b = Input(shape=(MAX_SEQUENCE_LENGTH,))  

tweet_input = Input(shape=(MAX_SEQUENCE_LENGTH,))  
# 下面这些行是神经网络构造的内容，可参见上面的网络设计图  
embedding_layer = Embedding(nb_words + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH, weights=[embedding_matrix], trainable=True)(tweet_input)  

conv1 = Conv1D(128, 3, activation='tanh')(embedding_layer)  
drop_1 = Dropout(0.2)(conv1)  
max_1 = MaxPooling1D(3)(drop_1)  
conv2 = Conv1D(128, 3, activation='tanh')(max_1)  
drop_2 = Dropout(0.2)(conv2)  
max_2 = MaxPooling1D(3)(drop_2)  
#conv2 = Conv1D(128, 3, activation='tanh')(max_1)  
#max_2 = MaxPooling1D(3)(conv2)  
out_1 = Flatten()(max_1)  
#out_1 = LSTM(128)(max_1)  
model_encode = Model(tweet_input, out_1) # 500(examples) * 5888  

encoded_a = model_encode(tweet_a)  
encoded_b = model_encode(tweet_b)  

merged_vector = merge([encoded_a, encoded_b], mode='concat') # good  
dense_1 = Dense(128,activation='relu')(merged_vector)  
dense_2 = Dense(128,activation='relu')(dense_1)  
dense_3 = Dense(128,activation='relu')(dense_2)  

predictions = Dense(1, activation='sigmoid')(dense_3)  
#predictions = Dense(len(labels_index), activation='softmax')(merged_vector)  

model = Model(input=[tweet_a, tweet_b], output=predictions)  
model.compile(optimizer='rmsprop',  
              loss='binary_crossentropy',  
              metrics=['accuracy'])  
# 下面是训练程序  
model.fit([input_train_left,input_train_right], labels_train, nb_epoch=5)  
json_string = model.to_json()  # json_string = model.get_config()  
open('my_model_architecture.json','w').write(json_string)    
model.save_weights('my_model_weights.h5')    
# 下面是训练得到的神经网络进行评估  
score = model.evaluate([input_train_left,input_train_right], labels_train, verbose=0)   
print('train score:', score[0]) # 训练集中的loss  
print('train accuracy:', score[1]) # 训练集中的准确率  
score = model.evaluate([val_left, val_right], labels_val, verbose=0)   
print('Test score:', score[0])#测试集中的loss  
print('Test accuracy:', score[1]) #测试集中的准确率

The above code and comments describe the structure of the neural network in detail, but it is best to remove the Chinese comments when actually using the code, otherwise there may be some coding problems. Since the above code is similar in structure to my previous article, if you encounter something you don't understand, you can refer to the code comments in the previous article to learn.

3. Summary

This article describes the whole process of how to use deep learning and keras framework to build a text classifier, and gives the corresponding code implementation. For your convenience, the github download address of this code is given below.

Similar application of CNN deep neural network in NLP short text

Guess you like